🤖 Analyze and Test robots.txt Files on a Large Scale
Even though they are tiny in size, robots.txt files contain potent instructions that can block major sections of your site, which is what they are supposed to do. Only sometimes you might make the mistake of blocking the wrong section.
So it is very important to check if certain pages (or groups of pages) are blocked for a certain user-agent by a certain robots.txt file. Ideally, you would want to run the same check for all possible user-agents. Even more ideally, you want to be able to run the check for a large number of pages with every possible combination with user-agents.
To get the robots.txt file into an easily readable format, you can use the
robotstxt_to_df()
function to get it in a DataFrame.
import advertools as adv
amazon = adv.robotstxt_to_df('https://www.amazon.com/robots.txt')
amazon
directive |
content |
etag |
robotstxt_last_modified |
robotstxt_url |
download_date |
|
---|---|---|---|---|---|---|
0 |
User-agent |
* |
"a850165d925db701988daf7ead7492d3" |
2021-10-28 17:51:39+00:00 |
2022-02-11 19:33:03.200689+00:00 |
|
1 |
Disallow |
/exec/obidos/account-access-login |
"a850165d925db701988daf7ead7492d3" |
2021-10-28 17:51:39+00:00 |
2022-02-11 19:33:03.200689+00:00 |
|
2 |
Disallow |
/exec/obidos/change-style |
"a850165d925db701988daf7ead7492d3" |
2021-10-28 17:51:39+00:00 |
2022-02-11 19:33:03.200689+00:00 |
|
3 |
Disallow |
/exec/obidos/flex-sign-in |
"a850165d925db701988daf7ead7492d3" |
2021-10-28 17:51:39+00:00 |
2022-02-11 19:33:03.200689+00:00 |
|
4 |
Disallow |
/exec/obidos/handle-buy-box |
"a850165d925db701988daf7ead7492d3" |
2021-10-28 17:51:39+00:00 |
2022-02-11 19:33:03.200689+00:00 |
|
... |
... |
... |
... |
... |
... |
... |
146 |
Disallow |
/hp/video/mystuff |
"a850165d925db701988daf7ead7492d3" |
2021-10-28 17:51:39+00:00 |
2022-02-11 19:33:03.200689+00:00 |
|
147 |
Disallow |
/gp/video/profiles |
"a850165d925db701988daf7ead7492d3" |
2021-10-28 17:51:39+00:00 |
2022-02-11 19:33:03.200689+00:00 |
|
148 |
Disallow |
/hp/video/profiles |
"a850165d925db701988daf7ead7492d3" |
2021-10-28 17:51:39+00:00 |
2022-02-11 19:33:03.200689+00:00 |
|
149 |
User-agent |
EtaoSpider |
"a850165d925db701988daf7ead7492d3" |
2021-10-28 17:51:39+00:00 |
2022-02-11 19:33:03.200689+00:00 |
|
150 |
Disallow |
/ |
"a850165d925db701988daf7ead7492d3" |
2021-10-28 17:51:39+00:00 |
2022-02-11 19:33:03.200689+00:00 |
The returned DataFrame contains columns for directives, their content, the URL of the robots.txt file, as well as the date it was downloaded.
directive: The main commands. Allow, Disallow, Sitemap, Crawl-delay, User-agent, and so on.
content: The details of each of the directives.
robotstxt_last_modified: The date when the robots.txt file was last modified, if provided (according the response header Last-modified).
etag: The entity tag of the response header, if provided.
robotstxt_url: The URL of the robots.txt file.
download_date: The date and time when the file was downloaded.
Alternatively, you can provide a list of robots URLs if you want to download them all in one go. This might be interesting if:
You are analyzing an industry and want to keep an eye on many different websites.
You are analyzing a website with many sub-domains, and want to get all the robots files together.
You are trying to understand a company that has many websites under different domains and sub-domains.
In this case you simply provide a list of URLs instead of a single one.
robots_urls = ['https://www.google.com/robots.txt',
'https://twitter.com/robots.txt',
'https://facebook.com/robots.txt']
googtwfb = adv.robotstxt_to_df(robots_urls)
# How many lines does each robots file have?
googtwfb.groupby('robotstxt_url')['directive'].count()
robotstxt_url
https://facebook.com/robots.txt 541
https://twitter.com/robots.txt 108
https://www.google.com/robots.txt 289
Name: directive, dtype: int64
# Display the first five rows of each of the robots files:
googtwfb.groupby('robotstxt_url').head()
directive |
content |
robotstxt_last_modified |
robotstxt_url |
download_date |
|
---|---|---|---|---|---|
0 |
User-agent |
* |
2022-02-07 22:30:00+00:00 |
2022-02-11 19:52:13.375724+00:00 |
|
1 |
Disallow |
/search |
2022-02-07 22:30:00+00:00 |
2022-02-11 19:52:13.375724+00:00 |
|
2 |
Allow |
/search/about |
2022-02-07 22:30:00+00:00 |
2022-02-11 19:52:13.375724+00:00 |
|
3 |
Allow |
/search/static |
2022-02-07 22:30:00+00:00 |
2022-02-11 19:52:13.375724+00:00 |
|
4 |
Allow |
/search/howsearchworks |
2022-02-07 22:30:00+00:00 |
2022-02-11 19:52:13.375724+00:00 |
|
289 |
comment |
Google Search Engine Robot |
NaT |
2022-02-11 19:52:13.461815+00:00 |
|
290 |
comment |
========================== |
NaT |
2022-02-11 19:52:13.461815+00:00 |
|
291 |
User-agent |
Googlebot |
NaT |
2022-02-11 19:52:13.461815+00:00 |
|
292 |
Allow |
/?_escaped_fragment_ |
NaT |
2022-02-11 19:52:13.461815+00:00 |
|
293 |
Allow |
/*?lang= |
NaT |
2022-02-11 19:52:13.461815+00:00 |
|
397 |
comment |
Notice: Collection of data on Facebook through automated means is |
NaT |
2022-02-11 19:52:13.474456+00:00 |
|
398 |
comment |
prohibited unless you have express written permission from Facebook |
NaT |
2022-02-11 19:52:13.474456+00:00 |
|
399 |
comment |
and may only be conducted for the limited purpose contained in said |
NaT |
2022-02-11 19:52:13.474456+00:00 |
|
400 |
comment |
permission. |
NaT |
2022-02-11 19:52:13.474456+00:00 |
|
401 |
comment |
See: http://www.facebook.com/apps/site_scraping_tos_terms.php |
NaT |
2022-02-11 19:52:13.474456+00:00 |
Bulk robots.txt
Tester
This tester is designed to work on a large scale. The robotstxt_test()
function runs a test for a given robots.txt file, checking which of the
provided user-agents can fetch which of the provided URLs, paths, or patterns.
import advertools as adv
adv.robotstxt_test(
robotstxt_url='https://www.amazon.com/robots.txt',
user_agents=['Googlebot', 'baiduspider', 'Bingbot'],
urls=['/', '/hello', '/some-page.html'])
As a result, you get a DataFrame with a row for each combination of (user-agent, URL) indicating whether or not that particular user-agent can fetch the given URL.
Some reasons why you might want to do that:
SEO Audits: Especially for large websites with many URL patterns, and many rules for different user-agents.
Developer or site owner about to make large changes
Interest in strategies of certain companies
User-agents
In reality there are only two groups of user-agents that you need to worry about:
User-agents listed in the robots.txt file: For each one of those you need to check whether or not they are blocked from fetching a certain URL (or pattern).
*
all other user-agents: The*
includes all other user-agents, so checking the rules that apply to it should take care of the rest.
robots.txt Testing Approach
Get the robots.txt file that you are interested in
Extract the user-agents from it
Specify the URLs you are interested in testing
Run the
robotstxt_test()
function
fb_robots = adv.robotstxt_to_df('https://www.facebook.com/robots.txt')
fb_robots
directive |
content |
robotstxt_url |
download_date |
|
---|---|---|---|---|
0 |
comment |
Notice: Collection of data on Facebook through automated means is |
2022-02-12 00:48:58.951053+00:00 |
|
1 |
comment |
prohibited unless you have express written permission from Facebook |
2022-02-12 00:48:58.951053+00:00 |
|
2 |
comment |
and may only be conducted for the limited purpose contained in said |
2022-02-12 00:48:58.951053+00:00 |
|
3 |
comment |
permission. |
2022-02-12 00:48:58.951053+00:00 |
|
4 |
comment |
See: http://www.facebook.com/apps/site_scraping_tos_terms.php |
2022-02-12 00:48:58.951053+00:00 |
|
... |
... |
... |
... |
... |
536 |
Allow |
/ajax/pagelet/generic.php/PagePostsSectionPagelet |
2022-02-12 00:48:58.951053+00:00 |
|
537 |
Allow |
/careers/ |
2022-02-12 00:48:58.951053+00:00 |
|
538 |
Allow |
/safetycheck/ |
2022-02-12 00:48:58.951053+00:00 |
|
539 |
User-agent |
2022-02-12 00:48:58.951053+00:00 |
||
540 |
Disallow |
/ |
2022-02-12 00:48:58.951053+00:00 |
Now that we have downloaded the file, we can easily extract the list of user-agents that it contains.
fb_useragents = (fb_robots
[fb_robots['directive']=='User-agent']
['content'].drop_duplicates()
.tolist())
fb_useragents
['Applebot',
'baiduspider',
'Bingbot',
'Discordbot',
'facebookexternalhit',
'Googlebot',
'Googlebot-Image',
'ia_archiver',
'LinkedInBot',
'msnbot',
'Naverbot',
'Pinterestbot',
'seznambot',
'Slurp',
'teoma',
'TelegramBot',
'Twitterbot',
'Yandex',
'Yeti',
'*']
Quite a long list!
As a small and quick test, I'm interested in checking the home page, a random profile page (/bbc), groups and hashtags pages.
urls_to_test = ['/', '/bbc', '/groups', '/hashtag/']
fb_test = robotstxt_test('https://www.facebook.com/robots.txt',
fb_useragents, urls_to_test)
fb_test
robotstxt_url |
user_agent |
url_path |
can_fetch |
|
---|---|---|---|---|
0 |
* |
/ |
False |
|
1 |
* |
/bbc |
False |
|
2 |
* |
/groups |
False |
|
3 |
* |
/hashtag/ |
False |
|
4 |
Applebot |
/ |
True |
|
... |
... |
... |
... |
|
75 |
seznambot |
/hashtag/ |
True |
|
76 |
teoma |
/ |
True |
|
77 |
teoma |
/bbc |
True |
|
78 |
teoma |
/groups |
True |
|
79 |
teoma |
/hashtag/ |
True |
For twenty user-agents and four URLs each, we received a total of eighty test results. You can immediately see that all user-agents not listed (denoted by * are not allowed to fetch any of the provided URLs).
Let's see who is and who is not allowed to fetch the home page.
fb_test.query('url_path== "/"')
robotstxt_url |
user_agent |
url_path |
can_fetch |
|
---|---|---|---|---|
0 |
* |
/ |
False |
|
4 |
Applebot |
/ |
True |
|
8 |
Bingbot |
/ |
True |
|
12 |
Discordbot |
/ |
False |
|
16 |
Googlebot |
/ |
True |
|
20 |
Googlebot-Image |
/ |
True |
|
24 |
LinkedInBot |
/ |
False |
|
28 |
Naverbot |
/ |
True |
|
32 |
Pinterestbot |
/ |
False |
|
36 |
Slurp |
/ |
True |
|
40 |
TelegramBot |
/ |
False |
|
44 |
Twitterbot |
/ |
True |
|
48 |
Yandex |
/ |
True |
|
52 |
Yeti |
/ |
True |
|
56 |
baiduspider |
/ |
True |
|
60 |
facebookexternalhit |
/ |
False |
|
64 |
ia_archiver |
/ |
False |
|
68 |
msnbot |
/ |
True |
|
72 |
seznambot |
/ |
True |
|
76 |
teoma |
/ |
True |
I'll leave it to you to figure out why LinkedIn and Pinterest are not allowed to crawl the home page but Google and Apple are, because I have no clue!
- robotstxt_test(robotstxt_url, user_agents, urls)[source]
Given a
robotstxt_url
check which of theuser_agents
is allowed to fetch which of theurls
.All the combinations of
user_agents
andurls
will be checked and the results returned in one DataFrame.- Parameters:
robotstxt_url (str) -- The URL of robotx.txt file.
user_agents (str, list) -- One or more user agents.
urls (str, list) -- One or more paths (relative) or URLs (absolute) to check.
- Returns:
robotstxt_test_df -- A DataFrame with the test results per user-agent/rule combination.
- Return type:
pandas.DataFrame
Examples
>>> robotstxt_test( ... "https://facebook.com/robots.txt", ... user_agents=["*", "Googlebot", "Applebot"], ... urls=["/", "/bbc", "/groups", "/hashtag/"], ... ) robotstxt_url user_agent url_path can_fetch 0 https://facebook.com/robots.txt * / False 1 https://facebook.com/robots.txt * /bbc False 2 https://facebook.com/robots.txt * /groups False 3 https://facebook.com/robots.txt * /hashtag/ False 4 https://facebook.com/robots.txt Applebot / True 5 https://facebook.com/robots.txt Applebot /bbc True 6 https://facebook.com/robots.txt Applebot /groups True 7 https://facebook.com/robots.txt Applebot /hashtag/ False 8 https://facebook.com/robots.txt Googlebot / True 9 https://facebook.com/robots.txt Googlebot /bbc True 10 https://facebook.com/robots.txt Googlebot /groups True 11 https://facebook.com/robots.txt Googlebot /hashtag/ False
- robotstxt_to_df(robotstxt_url, output_file=None)[source]
Download the contents of
robotstxt_url
into a DataFrame- Parameters:
robotstxt_url (str) -- One or more URLs of the robots.txt file(s)
output_file (str) -- Optional file path to save the robots.txt files, mainly useful for downloading > 500 files. The files are appended as soon as they are downloaded. Only the ".jl" extension is supported.
- Returns:
robotstxt_df -- A DataFrame containing directives, their content, the URL and time of download
- Return type:
pandas.DataFrame
Examples
You can also use it to download multiple robots files by passing a list of URLs.
>>> robotstxt_to_df("https://www.twitter.com/robots.txt") directive content robotstxt_url download_date 0 User-agent * https://www.twitter.com/robots.txt 2020-09-27 21:57:23.702814+00:00 1 Disallow / https://www.twitter.com/robots.txt 2020-09-27 21:57:23.702814+00:00
>>> robotstxt_to_df( ... ["https://www.google.com/robots.txt", "https://www.twitter.com/robots.txt"] ... ) directive content robotstxt_last_modified robotstxt_url download_date 0 User-agent * 2021-01-11 21:00:00+00:00 https://www.google.com/robots.txt 2021-01-16 14:08:50.087985+00:00 1 Disallow /search 2021-01-11 21:00:00+00:00 https://www.google.com/robots.txt 2021-01-16 14:08:50.087985+00:00 2 Allow /search/about 2021-01-11 21:00:00+00:00 https://www.google.com/robots.txt 2021-01-16 14:08:50.087985+00:00 3 Allow /search/static 2021-01-11 21:00:00+00:00 https://www.google.com/robots.txt 2021-01-16 14:08:50.087985+00:00 4 Allow /search/howsearchworks 2021-01-11 21:00:00+00:00 https://www.google.com/robots.txt 2021-01-16 14:08:50.087985+00:00 283 User-agent facebookexternalhit 2021-01-11 21:00:00+00:00 https://www.google.com/robots.txt 2021-01-16 14:08:50.087985+00:00 284 Allow /imgres 2021-01-11 21:00:00+00:00 https://www.google.com/robots.txt 2021-01-16 14:08:50.087985+00:00 285 Sitemap https://www.google.com/sitemap.xml 2021-01-11 21:00:00+00:00 https://www.google.com/robots.txt 2021-01-16 14:08:50.087985+00:00 286 User-agent * NaT https://www.twitter.com/robots.txt 2021-01-16 14:08:50.468588+00:00 287 Disallow / NaT https://www.twitter.com/robots.txt 2021-01-16 14:08:50.468588+00:00
For research purposes and if you want to download more than ~500 files, you might want to use
output_file
to save results as they are downloaded. The file extension should be ".jl", and robots files are appended to that file as soon as they are downloaded, in case you lose your connection, or maybe your patience!>>> robotstxt_to_df( ... [ ... "https://example.com/robots.txt", ... "https://example.com/robots.txt", ... "https://example.com/robots.txt", ... ], ... output_file="robots_output_file.jl", ... )
To open the file as a DataFrame:
>>> import pandas as pd >>> robotsfiles_df = pd.read_json("robots_output_file.jl", lines=True)