Parse and Analyze Crawl Logs in a Dataframe (experimental)

While crawling with the crawl() function, the process produces logs for every page crawled, scraped, redirected, and even blocked by robots.txt rules.

By default, those logs are can be seen on the command line as their default destination is stdout.

A good practice is to set a LOG_FILE so you can save those logs to a text file, and review them later. There are several reasons why you might want to do that:

  • Blocked URLs: The crawler obeys robots.txt rules by default, and when it encounters pages that it shouldn't crawl, it doesn't. However, this is logged as an event, and you can easily extract a list of blocked URLs from the logs.

  • Crawl errors: You might also get some errors while crawling, and it can be interesting to know which URLs generated errors.

  • Filtered pages: Those are pages that were discovered but weren't crawled because they are not a sub-domain of the provided url_list, or happen to be on external domains altogether.

This can simply be done by specifying a file name through the optional custom_settings parameter of crawl:

>>> import advertools as adv
>>> adv.crawl('https://example.com',
              output_file='example.jl',
              follow_links=True,
              custom_settings={'LOG_FILE': 'example.log'})

If you run it this way, all logs will be saved to the file you chose, example.log in this case.

Now, you can use the crawllogs_to_df() function to open the logs in a DataFrame:

>>> import advertools as adv
>>> logs_df = adv.crawllogs_to_df('example.log')

The DataFrame might contain the following columns:

  • time: The timestamp for the process

  • middleware: The middleware responsible for this process, whether it is the core engine, the scraper, error handler and so on.

  • level: The logging level (DEBUG, INFO, etc.)

  • message: A single word summarizing what this row represents, "Crawled", "Scraped", "Filtered", and so on.

  • domain: The domain name of filtered (not crawled pages) typically for URLs outside the current website.

  • method: The HTTP method used in this process (GET, PUT, etc.)

  • url: The URL currently under process.

  • status: HTTP status code, 200, 404, etc.

  • referer: The referring URL, where applicable.

  • method_to: In redirect rows the HTTP method used to crawl the URL going to.

  • redirect_to: The URL redirected to.

  • method_from: In redirect rows the HTTP method used to crawl the URL coming from.

  • redirect_from: The URL redirected from.

  • blocked_urls: The URLs that were not crawled due to robots.txt rules.

crawllogs_to_df(logs_file_path)[source]

Convert a crawl logs file to a DataFrame.

An interesting option while using the crawl function, is to specify a destination file to save the logs of the crawl process itself. This contains additional information about each crawled, scraped, blocked, or redirected URL.

What you would most likely use this for is to get a list of URLs blocked by robots.txt rules. These can be found und the column blocked_urls. Crawling errors are also interesting, and can be found in rows where message is equal to "error".

>>> import advertools as adv
>>> adv.crawl('https://example.com',
              output_file='example.jl',
              follow_links=True,
              custom_settings={'LOG_FILE': 'example.log'})
>>> logs_df = adv.crawl_logs_to_df('example.log')
Parameters

logs_file_path (str) -- The path to the logs file.

Returns DataFrame crawl_logs_df

A DataFrame summarizing the logs.