Log File Analysis
Logs contain very detailed information about events happening on computers. And the extra details that they provide come with additional complexity that we need to handle ourselves. A pageview may contain many log lines, and a session can consist of several pageviews for example.
Another important characterisitic of log files is that their are usualy not big. They are massive.
So, we also need to cater for their large size, as well as rapid changes.
TL;DR
>>> import advertools as adv
>>> import pandas as pd
>>> adv.logs_to_df(
... log_file="access.log",
... output_file="access_logs.parquet",
... errors_file="log_errors.csv",
... log_format="common",
... fields=None,
... )
>>> logs_df = pd.read_parquet("access_logs.parquet")
How to run the logs_to_df()
function:
log_file
: The path to the log file you are trying to analyze.output_file
: The path to where you want the parsed and compressed file to be saved. Only the parquet format is supported.errors_file
: You will almost certainly have log lines that don't conform to the format that you have, so all lines that weren't properly parsed would go to this file. This file also contains the error messages, so you know what went wrong, and how you might fix it. In some cases, you might simply take these "errors" and parse them again. They might not be really errors, but lines in a different format, or temporary debug messages.log_format
: The format in which your logs were formatted. Logs can (and are) formatted in many ways, and there is no right or wrong way. However, there are defaults, and a few popular formats that most servers use. It is likely that your file is in one of the popular formats. This parameter can take any one of the pre-defined formats, for example "common", or "combined", or a regular expression that you provide. This means that you can parse any log format (as long as lines are single lines, and not formatted in JSON).date_format
: The date format string that the log file uses. For the supported default formats there are also default date formats. In some cases you might have a different date format. You can use standard Python date string formatting.. For example, to parse this string "2024-01-01" you can use%Y-%m-%d
. If this is the correct pattern the output file's datetime column will be saved as a datetime column, otherwise, it will be saved as a string.fields
: If you selected one of the supported formats, then there is no need to provide a value for this parameter. You have to provide a list of fields in case you provide a custom (regex) format. The fields will become the names of the columns of the resulting DataFrame, so you can distinguish between them (client, time, status code, response size, etc.)
Supported Log Formats
common
combined (a.k.a "extended")
common_with_vhost
nginx_error
apache_error
Log File Analysis - Data Preparation
We go through an example where we prepare the data for analysis, and here is the plan:
Parse the log file into a DataFrame saved to disk with a .parquet extension. A side effect is that your log file is also compressed down to 5% - 15% of its original size. It also makes it super efficient to query and analyze once in this format. Function used:
logs_to_df
.Convert data types as needed (optional): Most importantly converting the datetime column into a date object helps a lot in querying the data. Other possibilities include converting to categorical data types for more efficient storage and querying. Function used:
pandas.to_datetime
.Get the hostnames of the IP addresses of the clients sending requests. Function used: reverse_dns_lookup. We can then easily add a
hostname
column to the original DataFrame.Parse and split URL columns into their respective components. Typically we have
request
which is the resource/URL requested, as well asreferer
, which shows us where the request was referred from. Function used: url_to_df.Parse user agents if available. This allows us to analyze by user-agent family, operating system, bot/non-bot, version, and any other combination we want.
Combine all data together, and save back to a new .parquet file, and start analyzing.
!head data/sample_log.log
66.249.73.72 - - [16/Feb/2022:00:18:53 +0000] "GET / HTTP/1.1" 200 1095 "-" "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.80 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
109.237.103.118 - - [16/Feb/2022:00:20:39 +0000] "GET /.env HTTP/1.1" 404 209 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36"
45.12.223.214 - - [16/Feb/2022:00:23:45 +0000] "GET / HTTP/1.0" 200 2240 "http://adver.tools/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.72 Safari/537.36"
51.68.77.249 - - [16/Feb/2022:00:26:23 +0000] "GET /robots.txt HTTP/1.1" 404 209 "-" "advertools/0.13.0"
51.68.77.249 - - [16/Feb/2022:00:26:23 +0000] "HEAD / HTTP/1.1" 200 0 "-" "advertools/0.13.0"
192.241.211.176 - - [16/Feb/2022:00:31:16 +0000] "GET /login HTTP/1.1" 404 209 "-" "Mozilla/5.0 zgrab/0.x"
66.249.73.69 - - [16/Feb/2022:00:48:56 +0000] "GET /robots.txt HTTP/1.1" 404 209 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.73.72 - - [16/Feb/2022:00:48:56 +0000] "GET /staging/urlytics/ HTTP/1.1" 200 520 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.73.75 - - [16/Feb/2022:00:49:38 +0000] "GET /staging/urlytics/_dash-component-suites/dash/html/dash_html_components.v2_0_0m1638886228.min.js HTTP/1.1" 200 154258 "http://www.adver.tools/staging/urlytics/" "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Googlebot/2.1; +http://www.google.com/bot.html) Chrome/98.0.4758.80 Safari/537.36"
66.249.73.75 - - [16/Feb/2022:00:49:39 +0000] "GET /staging/urlytics/_dash-layout HTTP/1.1" 200 2547 "http://www.adver.tools/staging/urlytics/" "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Googlebot/2.1; +http://www.google.com/bot.html) Chrome/98.0.4758.80 Safari/537.36"
import advertools as adv
import pandas as pd
from ua_parser import user_agent_parser
pd.options.display.max_columns = None
adv.logs_to_df(log_file='data/sample_log.log',
output_file='data/adv_logs.parquet',
errors_file='data/adv_errors.txt',
log_format='combined')
Read the parquet file into a pandas DataFrame, and convert the datetime
column into a datetime object.
logs_df = pd.read_parquet('data/adv_logs.parquet')
logs_df['datetime'] = pd.to_datetime(logs_df['datetime'],
format='%d/%b/%Y:%H:%M:%S %z')
logs_df
Perform a reverse DNS lookup on the IP addresses in the client
column:
%%time
host_df = adv.reverse_dns_lookup(logs_df['client'])
print(f'Rows, columns: {host_df.shape}')
host_df.head(15)
# Rows, columns: (1210, 9)
# CPU times: user 745 ms, sys: 729 ms, total: 1.47 s
# Wall time: 21.1 s
ip_address |
count |
cum_count |
perc |
cum_perc |
hostname |
aliaslist |
ipaddrlist |
errors |
|
---|---|---|---|---|---|---|---|---|---|
0 |
143.244.132.225 |
426 |
426 |
0.0701004 |
0.0701004 |
[Errno 1] Unknown host |
|||
1 |
45.146.164.110 |
290 |
716 |
0.0477209 |
0.117821 |
[Errno 1] Unknown host |
|||
2 |
46.177.196.171 |
192 |
908 |
0.0315945 |
0.149416 |
ppp046177196171.access.hol.gr |
171.196.177.46.in-addr.arpa |
46.177.196.171 |
|
3 |
185.22.173.83 |
182 |
1090 |
0.029949 |
0.179365 |
[Errno 1] Unknown host |
|||
4 |
152.32.226.223 |
171 |
1261 |
0.0281389 |
0.207504 |
[Errno 1] Unknown host |
|||
5 |
94.200.35.174 |
154 |
1415 |
0.0253415 |
0.232845 |
[Errno 1] Unknown host |
|||
6 |
89.47.44.105 |
130 |
1545 |
0.0213921 |
0.254237 |
ppp089047044105.access.hol.gr |
105.44.47.89.in-addr.arpa |
89.47.44.105 |
|
7 |
94.200.92.2 |
119 |
1664 |
0.019582 |
0.273819 |
[Errno 1] Unknown host |
|||
8 |
143.244.132.234 |
113 |
1777 |
0.0185947 |
0.292414 |
[Errno 1] Unknown host |
|||
9 |
217.100.98.101 |
81 |
1858 |
0.0133289 |
0.305743 |
d9646265.static.ziggozakelijk.nl |
101.98.100.217.in-addr.arpa |
217.100.98.101 |
|
10 |
203.163.243.241 |
79 |
1937 |
0.0129998 |
0.318743 |
[Errno 1] Unknown host |
|||
11 |
66.249.73.135 |
77 |
2014 |
0.0126707 |
0.331414 |
crawl-66-249-73-135.googlebot.com |
135.73.249.66.in-addr.arpa |
66.249.73.135 |
|
12 |
194.163.179.92 |
60 |
2074 |
0.00987329 |
0.341287 |
vmi660635.contaboserver.net |
92.179.163.194.in-addr.arpa |
194.163.179.92 |
|
13 |
66.249.73.137 |
58 |
2132 |
0.00954418 |
0.350831 |
crawl-66-249-73-137.googlebot.com |
137.73.249.66.in-addr.arpa |
66.249.73.137 |
|
14 |
109.70.100.30 |
58 |
2190 |
0.00954418 |
0.360375 |
tor-exit-anonymizer.appliedprivacy.net |
30.100.70.109.in-addr.arpa |
109.70.100.30 |
Add a new hostname
column, by matching IP adresses to their hostnames.
ip_host_dict = dict(zip(host_df['ip_address'], host_df['hostname']))
logs_df['hostname'] = [ip_host_dict[ip] for ip in logs_df['client']]
Split the request URLs into their components.
request_url_df = adv.url_to_df(logs_df['request'])
request_url_df = request_url_df.add_prefix('request_')
request_url_df.head(10)
request_url |
request_scheme |
request_netloc |
request_path |
request_query |
request_fragment |
request_hostname |
request_port |
request_dir_1 |
request_dir_2 |
request_dir_3 |
request_dir_4 |
request_dir_5 |
request_dir_6 |
request_dir_7 |
request_dir_8 |
request_dir_9 |
request_dir_10 |
request_dir_11 |
request_dir_12 |
request_dir_13 |
request_last_dir |
request_query_index |
request_query_s |
request_query_XDEBUG_SESSION_START |
request_query_function |
request_query_vars[0] |
request_query_vars[1][] |
request_query_file |
request_query_url |
request_query_a |
request_query_content |
request_query_wt |
request_query_action |
request_query_username |
request_query_psd |
request_query_dns |
request_query_step |
request_query_cmd |
request_query_lang |
request_query_option |
request_query_folderIds |
request_query_input_file |
request_query_currentsetting.htm |
request_query_type |
request_query_next_file |
request_query_curpath |
request_query_page |
request_query_id |
request_query_img |
request_query_panel |
request_query_todo |
request_query_code |
request_query_ref |
request_query_scopeName |
request_query_op |
request_query_controller |
request_query_q |
request_query_sb_category |
request_query_Email |
request_query_name |
request_query_abspath |
request_query_fn |
request_query_files |
request_query_thumb |
request_query_ACTION |
request_query_NOCONTINUE |
request_query_filepath |
request_query_file_link |
request_query_myPath |
request_query_adaptive-images-settings[source_file] |
request_query_aam-media |
request_query_cpabc_calendar_update |
request_query_term |
request_query_Itemid |
request_query_search_key |
request_query_short |
request_query_title |
request_query_Type |
request_query_format |
request_query_findcli |
request_query_v |
request_query_target |
request_query_albid |
request_query_pic |
request_query_path |
request_query_mode |
request_query_libpath |
request_query_srt |
request_query_redirect |
request_query_order |
request_query_item |
request_query_gid |
request_query_act |
request_query_rid |
request_query_service |
request_query_agent |
request_query_typeid |
request_query_dir |
request_query_stockCodeInternal |
request_query_site |
request_query_position |
request_query_fileName |
||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 |
/ |
/ |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
||||
1 |
/.env |
/.env |
nan |
nan |
.env |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
.env |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
||||
2 |
/ |
/ |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
||||
3 |
/robots.txt |
/robots.txt |
nan |
nan |
robots.txt |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
robots.txt |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
||||
4 |
/ |
/ |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
||||
5 |
/login |
/login |
nan |
nan |
login |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
login |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
||||
6 |
/robots.txt |
/robots.txt |
nan |
nan |
robots.txt |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
robots.txt |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
||||
7 |
/staging/urlytics/ |
/staging/urlytics/ |
nan |
nan |
staging |
urlytics |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
urlytics |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
||||
8 |
/staging/urlytics/_dash-component-suites/dash/html/dash_html_components.v2_0_0m1638886228.min.js |
/staging/urlytics/_dash-component-suites/dash/html/dash_html_components.v2_0_0m1638886228.min.js |
nan |
nan |
staging |
urlytics |
_dash-component-suites |
dash |
html |
dash_html_components.v2_0_0m1638886228.min.js |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
dash_html_components.v2_0_0m1638886228.min.js |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
||||
9 |
/staging/urlytics/_dash-layout |
/staging/urlytics/_dash-layout |
nan |
nan |
staging |
urlytics |
_dash-layout |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
_dash-layout |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
Do the same for the URLs in the referer
column.
referer_url_df = adv.url_to_df(logs_df['referer'])
referer_url_df = referer_url_df.add_prefix('referer_')
referer_url_df.head(10)
referer_url |
referer_scheme |
referer_netloc |
referer_path |
referer_query |
referer_fragment |
referer_hostname |
referer_port |
referer_dir_1 |
referer_dir_2 |
referer_dir_3 |
referer_last_dir |
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 |
- |
- |
nan |
nan |
- |
nan |
nan |
- |
||||
1 |
- |
- |
nan |
nan |
- |
nan |
nan |
- |
||||
2 |
http |
adver.tools |
/ |
nan |
nan |
nan |
nan |
nan |
nan |
|||
3 |
- |
- |
nan |
nan |
- |
nan |
nan |
- |
||||
4 |
- |
- |
nan |
nan |
- |
nan |
nan |
- |
||||
5 |
- |
- |
nan |
nan |
- |
nan |
nan |
- |
||||
6 |
- |
- |
nan |
nan |
- |
nan |
nan |
- |
||||
7 |
- |
- |
nan |
nan |
- |
nan |
nan |
- |
||||
8 |
http |
www.adver.tools |
/staging/urlytics/ |
nan |
nan |
staging |
urlytics |
nan |
urlytics |
|||
9 |
http |
www.adver.tools |
/staging/urlytics/ |
nan |
nan |
staging |
urlytics |
nan |
urlytics |
Parse the user_agent
column.
ua_df = pd.json_normalize([user_agent_parser.Parse(ua) for ua in logs_df['user_agent']])
ua_df.columns = 'ua_' + ua_df.columns.str.replace('user_agent\.', '', regex=True)
ua_df.head(10)
ua_string |
ua_family |
ua_major |
ua_minor |
ua_patch |
ua_os.family |
ua_os.major |
ua_os.minor |
ua_os.patch |
ua_os.patch_minor |
ua_device.family |
ua_device.brand |
ua_device.model |
|
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 |
Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.80 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) |
Googlebot |
2 |
1 |
Android |
6 |
0 |
1 |
Spider |
Spider |
Smartphone |
||
1 |
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36 |
Chrome |
81 |
0 |
4044 |
Linux |
Other |
||||||
2 |
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.72 Safari/537.36 |
Chrome |
90 |
0 |
4430 |
Windows |
10 |
Other |
|||||
3 |
advertools/0.13.0 |
Other |
Other |
Other |
|||||||||
4 |
advertools/0.13.0 |
Other |
Other |
Other |
|||||||||
5 |
Mozilla/5.0 zgrab/0.x |
Other |
Other |
Other |
|||||||||
6 |
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) |
Googlebot |
2 |
1 |
Other |
Spider |
Spider |
Desktop |
|||||
7 |
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) |
Googlebot |
2 |
1 |
Other |
Spider |
Spider |
Desktop |
|||||
8 |
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Googlebot/2.1; +http://www.google.com/bot.html) Chrome/98.0.4758.80 Safari/537.36 |
Googlebot |
2 |
1 |
Other |
Spider |
Spider |
Desktop |
|||||
9 |
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Googlebot/2.1; +http://www.google.com/bot.html) Chrome/98.0.4758.80 Safari/537.36 |
Googlebot |
2 |
1 |
Other |
Spider |
Spider |
Desktop |
Combine all data into one DataFrame and save to a new .parquet file.
(pd.concat([logs_df, request_url_df, referer_url_df, ua_df], axis=1)
.to_parquet('data/adv_logs_final.parquet', index=False, version='2.4'))
Start the analysis.
The advantage of using the parquet format is that the file doens't need to be
loaded into memory, and can be queried from disk, just like querying a
database. This means you only load the columns that you select, and the rows
that satisfy certain conditions. For example we can load the
ua_device.family
and ua_family
columns, and only the rows where
'ua_device.family', '==', 'Spider'
. We then count the values in the
ua_family
column, and get the top bots accessing our website.
top_bots = pd.read_parquet('data/adv_logs_final.parquet',
filters=[
('ua_device.family', '==', 'Spider')
],
columns=['ua_device.family', 'ua_family'])['ua_family'].value_counts()
top_bots[:15]
Googlebot 499
PetalBot 46
AhrefsBot 42
Chrome 29
YandexBot 29
LinkedInBot 23
Baiduspider 18
DotBot 17
Twitterbot 16
bingbot 12
MJ12bot 12
Java 10
Nutch 8
masscan 6
FacebookBot 4
Name: ua_family, dtype: int64
Happy analyzing!
Parse and Analyze Crawl Logs in a Dataframe
While crawling with the crawl()
function, the process produces logs for
every page crawled, scraped, redirected, and even blocked by robots.txt rules.
By default, those logs are can be seen on the command line as their default destination is stdout.
A good practice is to set a LOG_FILE
so you can save those logs to a text
file, and review them later. There are several reasons why you might want to do
that:
Blocked URLs: The crawler obeys robots.txt rules by default, and when it encounters pages that it shouldn't crawl, it doesn't. However, this is logged as an event, and you can easily extract a list of blocked URLs from the logs.
Crawl errors: You might also get some errors while crawling, and it can be interesting to know which URLs generated errors.
Filtered pages: Those are pages that were discovered but weren't crawled because they are not a sub-domain of the provided url_list, or happen to be on external domains altogether.
This can simply be done by specifying a file name through the optional
custom_settings parameter of crawl
:
>>> import advertools as adv
>>> adv.crawl('https://example.com',
output_file='example.jl',
follow_links=True,
custom_settings={'LOG_FILE': 'example.log'})
If you run it this way, all logs will be saved to the file you chose, example.log in this case.
Now, you can use the crawllogs_to_df()
function to open the logs in a
DataFrame:
>>> import advertools as adv
>>> logs_df = adv.crawllogs_to_df("example.log")
The DataFrame might contain the following columns:
time: The timestamp for the process
middleware: The middleware responsible for this process, whether it is the core engine, the scraper, error handler and so on.
level: The logging level (DEBUG, INFO, etc.)
message: A single word summarizing what this row represents, "Crawled", "Scraped", "Filtered", and so on.
domain: The domain name of filtered (not crawled pages) typically for URLs outside the current website.
method: The HTTP method used in this process (GET, PUT, etc.)
url: The URL currently under process.
status: HTTP status code, 200, 404, etc.
referer: The referring URL, where applicable.
method_to: In redirect rows the HTTP method used to crawl the URL going to.
redirect_to: The URL redirected to.
method_from: In redirect rows the HTTP method used to crawl the URL coming from.
redirect_from: The URL redirected from.
blocked_urls: The URLs that were not crawled due to robots.txt rules.
- crawllogs_to_df(logs_file_path)[source]
Convert a crawl logs file to a DataFrame.
An interesting option while using the
crawl
function, is to specify a destination file to save the logs of the crawl process itself. This contains additional information about each crawled, scraped, blocked, or redirected URL.What you would most likely use this for is to get a list of URLs blocked by robots.txt rules. These can be found und the column
blocked_urls
. Crawling errors are also interesting, and can be found in rows wheremessage
is equal to "error".>>> import advertools as adv >>> adv.crawl('https://example.com', output_file='example.jl', follow_links=True, custom_settings={'LOG_FILE': 'example.log'}) >>> logs_df = adv.crawl_logs_to_df("example.log")
- Parameters:
logs_file_path (str) -- The path to the logs file.
- Returns DataFrame crawl_logs_df:
A DataFrame summarizing the logs.
- logs_to_df(log_file, output_file, errors_file, log_format, date_format=None, fields=None, encoding='utf-8')[source]
Parse and compress any log file into a DataFrame format.
Convert a log file to a parquet file in a DataFrame format, and save all errors (or lines not conformig to the chosen log format) into a separate
errors_file
text file. Any non-JSON log format is possible, provided you have the right regex for it. A few default ones are provided and can be used. Check outadv.LOG_FORMATS
andadv.LOG_FIELDS
for the available formats and fields.- Parameters:
log_file (str) -- The path to the log file.
output_file (str) -- The path to the desired output file. Must have a ".parquet" extension, and must not have the same path as an existing file.
errors_file (str) -- The path where the parsing errors are stored. Any text format works, CSV is recommended to easily open it with any CSV reader with the separator as "@@".
log_format (str) -- The name of one of the supported log formats, or a regex of your own format.
fields (list) -- A list of fields, which will become the names of columns in
output_file
. Only required if you provide a custom (regex)log_format
.encoding (str) -- The encoding of the log file. It defaults to utf-8, but you might need to try others in case of errors (latin-1, utf-16, etc.)
Examples
>>> import advertools as adv >>> import pandas as pd >>> adv.logs_to_df( ... log_file="access.log", ... output_file="access_logs.parquet", ... errors_file="log_errors.csv", ... log_format="common", ... fields=None, ... ) >>> logs_df = pd.read_parquet("access_logs.parquet")
You can now analyze
logs_df
as a normal pandas DataFrame.- Parameters:
log_file (str) -- The path to the log file.
output_file (str) -- The path to the desired output file. Must have a ".parquet" extension, and must not have the same path as an existing file.
errors_file (str) -- The path where the parsing errors are stored. Any text format works, CSV is recommended to easily open it with any CSV reader with the separator as "@@".
log_format (str) -- Either the name of one of the supported log formats, or a regex of your own format.
date_format (str) -- The date format in strftime format, in case you have a a different one from the default.
fields (str) -- A list of fields, which will become the names of columns in
output_file
. Only required if you provide a custom (regex)log_format
.encoding (str) -- The encoding of the log file. It defaults to utf-8, but you might need to try others in case of errors (latin-1, utf-16, etc.)