🕷 Python Status Code Checker with Response Headers

A mini crawler that only makes HEAD requests to a known list of URLs. It uses Scrapy under the hood, which means you get all its power in a simplified interface for a simple and specific use-case.

The crawl_headers() function can be used to make those requests for various quality assurance and analysis reasons. Since HEAD requests don't download the whole page, this makes the crawling super light on servers, and makes the process very fast.

The function is straight-forward and easy to use, you basically need a list of URLs and a file path where you want to save the output (in .jl format):

import advertools as adv
import pandas as pd

url_list = ['https://advertools.readthedocs.io', 'https://adver.tools',
            'https://www.dashboardom.com', 'https://povertydata.org']
adv.crawl_headers(url_list, 'output_file.jl')
headers_df = pd.read_json('output_file.jl', lines=True)

headers_df

	url	crawl_time	status	download_timeout	download_slot	download_latency	protocol	body	resp_headers_content-length	resp_headers_server	resp_headers_date	resp_headers_content-type	resp_headers_content-encoding	request_headers_accept	request_headers_accept-language	request_headers_user-agent	request_headers_accept-encoding	resp_headers_vary	redirect_times	redirect_ttl	redirect_urls	redirect_reasons	resp_headers_x-amz-id-2	resp_headers_x-amz-request-id	resp_headers_last-modified	resp_headers_etag	resp_headers_x-served	resp_headers_x-backend	resp_headers_x-rtd-project	resp_headers_x-rtd-version	resp_headers_x-rtd-path	resp_headers_x-rtd-domain	resp_headers_x-rtd-version-method	resp_headers_x-rtd-project-method	resp_headers_referrer-policy	resp_headers_permissions-policy	resp_headers_strict-transport-security	resp_headers_cf-cache-status	resp_headers_age	resp_headers_expires	resp_headers_cache-control	resp_headers_expect-ct	resp_headers_cf-ray	resp_headers_alt-svc	resp_headers_via
0	https://adver.tools	2022-02-11 02:32:26	200	180	adver.tools	0.0270483	HTTP/1.1	nan	0	nginx/1.18.0 (Ubuntu)	Fri, 11 Feb 2022 02:32:26 GMT	text/html; charset=utf-8	gzip	text/html,application/xhtml+xml,application/xml;q=0.9,...;q=0.8	en	advertools/0.13.0.rc2	gzip, deflate	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan
1	https://povertydata.org	2022-02-11 02:32:26	200	180	povertydata.org	0.06442	HTTP/1.1	nan	13270	nginx/1.18.0 (Ubuntu)	Fri, 11 Feb 2022 02:32:26 GMT	text/html; charset=utf-8	gzip	text/html,application/xhtml+xml,application/xml;q=0.9,...;q=0.8	en	advertools/0.13.0.rc2	gzip, deflate	Accept-Encoding	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan
2	https://advertools.readthedocs.io/en/master/	2022-02-11 02:32:26	200	180	advertools.readthedocs.io	0.0271282	HTTP/1.1	nan	0	cloudflare	Fri, 11 Feb 2022 02:32:26 GMT	text/html	gzip	text/html,application/xhtml+xml,application/xml;q=0.9,...;q=0.8	en	advertools/0.13.0.rc2	gzip, deflate	Accept-Encoding	1	19	https://advertools.readthedocs.io	302	rNKT7MYjJ7hcnSvbnZg9qdqizeFfTx9YtZ3/gwNLj8M99yumuCgdd6YTm/iBMO9hrZTAi/iYl50=	EE0DJX6Z511TGX88	Thu, 10 Feb 2022 17:04:27 GMT	W/"14c904a172315a4922f4d28948b916c2"	Nginx-Proxito-Sendfile	web-i-0710e93d610dd8c3e	advertools	master	/proxito/html/advertools/master/index.html	advertools.readthedocs.io	path	subdomain	no-referrer-when-downgrade	interest-cohort=()	max-age=31536000; includeSubDomains; preload	HIT	1083	Fri, 11 Feb 2022 04:32:26 GMT	public, max-age=7200	max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"	6dba2aae6b424107-PRG	h3=":443"; ma=86400, h3-29=":443"; ma=86400	nan
3	https://www.dashboardom.com	2022-02-11 02:32:26	200	180	www.dashboardom.com	0.118614	HTTP/1.1	nan	26837	gunicorn/19.9.0	Fri, 11 Feb 2022 02:32:26 GMT	text/html; charset=utf-8	nan	text/html,application/xhtml+xml,application/xml;q=0.9,...;q=0.8	en	advertools/0.13.0.rc2	gzip, deflate	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	1.1 vegur

Optionally, you can customize the crawling behavior with the optional custom_settings parameter. Please check the crawl strategies page for tips on how you can do that.

Here are some of the common reasons for using a HEAD crawler:

Checking status codes: One of the most important maintenance tasks you should be doing continuously. It's very easy to set up an automated script the checks status codes for a few hundred or thousand URLs on a periodic basis. You can easily build some rules and alerts based on the status codes you get.
Status codes of page elements: Yes, your page returns a 200 OK status, but what about all the elements/components of the page? Images, links (internal and external), hreflang, canonical, URLs in metatags, script URLs, URLs in various structured data elements like Twitter, OpenGraph, and JSON-LD are some of the most important ones to check as well.
Getting search engine directives: Those directives can be set using meta tags as well as response headers. This crawler gets all available response headers so you can check for search engine-specific ones, like noindex for example.
Getting image sizes: You might want to crawl a list of image URLs and get their meta data. The response header Content-Length contains the length of the page in bytes. With images, it contains the size of the image. This can be an extremely efficient way of analyzing image sizes (and other meta data) without having to download those images, which could consume a lot of bandwidth. Lookout for the column resp_headers_content-length.
Getting image types: The resp_headers_content-type gives you an indication on the type of content of the page (or image when crawling image URLs); text/html, image/jpeg and image/png are some such content types.

class HeadersSpider(*args: Any, **kwargs: Any)

Bases: Spider

custom_settings = {'AUTOTHROTTLE_ENABLED': True, 'AUTOTHROTTLE_TARGET_CONCURRENCY': 8, 'HTTPERROR_ALLOW_ALL': True, 'ROBOTSTXT_OBEY': True, 'USER_AGENT': 'advertools/0.18.0'}

errback(failure)

name = 'headers_spider'

parse(response)

start_requests()

crawl_headers(url_list, output_file, custom_settings=None)

Crawl a list of URLs using the HEAD method.

This function helps in analyzing a set of URLs by getting status codes, download latency, all response headers and a few other meta data about the crawled URLs.

Sine the full page is not downloaded, these requests are very light on servers and it is super-fast. You can modify the speed of course through various settings.

Typically status code checking is an on-going task that needs to be done and managed. Automated alerts can be easily created based on certain status codes. Another interesting piece of the information is the Content-Length response header. This gives you the size of the response body without having to download the whole page. It can also be very interesting with image URLs. Downloading all images can really be expensive and time consuming. Being able to get image sizes without having to download them can help a lot in making decisions about optimizing those images. Several other data can be interesting to analyze, depending on what response headers you get.

Parameters:

url_list (str, list) -- One or more URLs to crawl.
output_file (str) -- The path to the output of the crawl. Jsonlines only is supported to allow for dynamic values. Make sure your file ends with ".jl", e.g. output_file.jl.
custom_settings (dict) -- A dictionary of optional custom settings that you might want to add to the spider's functionality. There are over 170 settings for all kinds of options. For details please refer to the spider settings documentation.

Examples

>>> import advertools as adv
>>> url_list = ['https://exmaple.com/A', 'https://exmaple.com/B',
...             'https://exmaple.com/C', 'https://exmaple.com/D',
...             'https://exmaple.com/E']

>>> adv.crawl_headers(url_list, 'output_file.jl')
>>> import pandas as pd
>>> crawl_df = pd.read_json('output_file.jl', lines=True)