🕷 Python SEO Crawler / Spider

A customizable crawler to analyze SEO and content of pages and websites.

This is provided by the crawl() function which is customized for SEO and content analysis usage, and is highly configurable. The crawler uses Scrapy so you get all the power that it provides in terms of performance, speed, as well as flexibility and customization.

There are two main approaches to crawl:

  1. Discovery (spider mode): You know the website to crawl, so you provide a url_list (one or more URLs), and you want the crawler to go through the whole website(s) by following all available links.

  2. Pre-determined (list mode): You have a known set of URLs that you want to crawl and analyze, without following links or discovering new URLs.

Discovery Crawling Approach

The simplest way to use the function is to provide a list of one or more URLs and the crawler will go through all of the reachable pages.

>>> import advertools as adv
>>> adv.crawl('https://example.com', 'my_output_file.jl', follow_links=True)

That's it! To open the file:

>>> import pandas as pd
>>> crawl_df = pd.read_json('my_output_file.jl', lines=True)

What this does:

  • Check the site's robots.txt file and get the crawl rules, which means that your crawl will be affected by these rules and the user agent you are using. Check the details below on how to change settings and user agents to control this.

  • Starting with the provided URL(s) go through all links and parse pages.

  • For each URL extract the most important SEO elements.

  • Save them to my_output_file.jl.

  • The column headers of the output file (once you import it as a DataFrame) would be the names of the elements (title, h1, h2, etc.).

Jsonlines is the supported output format because of its flexibility in allowing different values for different scraped pages, and appending indepentent items to the output files.

Note

When the crawler parses pages it saves the data to the specified file by appending, and not overwriting. Otherwise it would have to store all the data in memory, which might crash your computer. A good practice is to have a separate output_file for every crawl with a descriptive name sitename_crawl_YYYY_MM_DD.jl for example. If you use the same file you will probably get duplicate data in the same file.

Extracted On-Page SEO Elements

The names of these elements become the headers (column names) of the output_file.

Element

Remarks

url

The response URL that was actually crawled. This might be different from the rquested URL in case of a redirect for example. Please check the redirect_* columns for more information.

title

The <title> tag(s)

viewport

The viewport meta tag if available

charset

The charset meta tag if available

meta_desc

Meta description

canonical

The canonical tag if available

alt_href

The href attribute of rel=alternate tags

alt_hreflang

The language codes of the alternate links

og:*

Open Graph data

twitter:*

Twitter card data

jsonld_*

JSON-LD data if available. In case multiple snippets occur, the respective column names will include a number to distinguish them, jsonld_1_{item_a}, jsonld_1_{item_b}, etc. Note that the first snippet will not contain a number, so the numbering starts with "1", starting from the second snippet. The same applies to OG and Twitter cards.

h1...h6

<h1> through <h6> tag(s), whichever is available

links_url

The URLs of the links on the page

links_text

The link text (anchor text)

links_nofollow

Boolean, whether or not the link is a nofllow link. Note that this only tells if the link itself contains a rel="nofollow" attribute. The page might indicate "nofollow" using meta robots or X-Robots-Tag, which you have to check separately.

nav_links_text

The anchor text of all links in the <nav> tag if available

nav_links_url

The links in the <nav> tag if available

header_links_text

The anchor text of all links in the <header> tag if available

header_links_url

The links in the <header> tag if available

footer_links_text

The anchor text of all links in the <footer> tag if available

footer_links_url

The links in the <footer> tag if available

body_text

The text in the <p>, <span>, and <li> tags within <body>

size

The page size in bytes

download_latency

The amount of time it took to get the page HTML, in seconds.

download_timout

The amount of time (in secs) that the downloader will wait before timing out. Defaults to 180.

redirect_times

The number of times the pages was redirected if available

redirect_ttl

The default maximum number of redirects the crawler allows

redirect_urls

The chain of URLs from the requested URL to the one actually fetched

redirect_reasons

The type of redirection(s) 301, 302, etc.

depth

The depth of the current URL, relative to the first URLs where crawling started. The first pages to be crawled have a depth of zero, pages linked from there, a depth of one, etc.

status

Response status code (200, 404, etc.)

img_*

All available <img> tag attributes. 'alt', 'crossorigin', 'height', 'ismap', 'loading', 'longdesc', 'referrerpolicy', 'sizes', 'src', 'srcset', 'usemap', and 'width' (excluding global HTML attributes like style and draggable)

ip_address

IP address

crawl_time

Date and time the page was crawled

resp_headers_*

All available response headers (last modified, server, etc.)

request_headers_*

All available request headers (user-agent, encoding, etc.)

Note

All elements that may appear multiple times on a page (like heading tags, or images, for example), will be joined with two "@" signs @@. For example, "first H2 tag@@second H2 tag@@third tag" and so on. Once you open the file, you simply have to split by @@ to get the elements as a list.

Here is a sample file of a crawl of this site (output truncated for readability):

>>> import pandas as pd
>>> site_crawl = pd.read_json('path/to/file.jl', lines=True)
>>> site_crawl.head()
                               url                           title                       meta_desc                              h1                              h2                              h3                        body_text  size  download_timeout              download_slot  download_latency  redirect_times  redirect_ttl                   redirect_urls redirect_reasons  depth  status                      links_href                      links_text                         img_src                         img_alt    ip_address           crawl_time              resp_headers_date resp_headers_content-type     resp_headers_last-modified resp_headers_vary    resp_headers_x-ms-request-id resp_headers_x-ms-version resp_headers_x-ms-lease-status resp_headers_x-ms-blob-type resp_headers_access-control-allow-origin   resp_headers_x-served resp_headers_x-backend resp_headers_x-rtd-project resp_headers_x-rtd-version         resp_headers_x-rtd-path  resp_headers_x-rtd-domain resp_headers_x-rtd-version-method resp_headers_x-rtd-project-method resp_headers_strict-transport-security resp_headers_cf-cache-status  resp_headers_age           resp_headers_expires resp_headers_cache-control          resp_headers_expect-ct resp_headers_server   resp_headers_cf-ray      resp_headers_cf-request-id          request_headers_accept request_headers_accept-language      request_headers_user-agent request_headers_accept-encoding          request_headers_cookie
0   https://advertools.readthedocs            advertools —  Python  Get productive as an online ma  advertools@@Indices and tables  Online marketing productivity                              NaN   Generate keywords for SEM camp   NaN               NaN  advertools.readthedocs.io               NaN             NaN           NaN  https://advertools.readthedocs            [302]    NaN     NaN  #@@readme.html@@advertools.kw_  @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                             NaN                             NaN  104.17.32.82  2020-05-21 10:39:35  Thu, 21 May 2020 10:39:35 GMT                 text/html  Wed, 20 May 2020 12:26:23 GMT   Accept-Encoding  720a8581-501e-0043-01a2-2e77d2                2009-09-19                       unlocked                   BlockBlob                                        *  Nginx-Proxito-Sendfile              web00007c                 advertools                     master  /proxito/media/html/advertools  advertools.readthedocs.io                              path                         subdomain         max-age=31536000; includeSubDo                          HIT               NaN  Thu, 21 May 2020 11:39:35 GMT       public, max-age=3600  max-age=604800, report-uri="ht          cloudflare  596daca7dbaa7e9e-BUD  02d86a3cea00007e9edb0cf2000000  text/html,application/xhtml+xm                              en  Mozilla/5.0 (Windows NT 10.0;                    gzip, deflate  __cfduid=d76b68d148ddec1efd004
1   https://advertools.readthedocs            advertools —  Python                             NaN                      advertools         Change Log - advertools  0.9.1 (2020-05-19)@@0.9.0 (202   Ability to specify robots.txt    NaN               NaN  advertools.readthedocs.io               NaN             NaN           NaN                             NaN              NaN    NaN     NaN  index.html@@readme.html@@adver  @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                             NaN                             NaN  104.17.32.82  2020-05-21 10:39:36  Thu, 21 May 2020 10:39:35 GMT                 text/html  Wed, 20 May 2020 12:26:23 GMT   Accept-Encoding  4f7bea3b-701e-0039-3f44-2f1d9f                2009-09-19                       unlocked                   BlockBlob                                        *  Nginx-Proxito-Sendfile              web00007h                 advertools                     master  /proxito/media/html/advertools  advertools.readthedocs.io                              path                         subdomain         max-age=31536000; includeSubDo                          HIT               NaN  Thu, 21 May 2020 11:39:35 GMT       public, max-age=3600  max-age=604800, report-uri="ht          cloudflare  596daca9bcab7e9e-BUD  02d86a3e0e00007e9edb0d72000000  text/html,application/xhtml+xm                              en  Mozilla/5.0 (Windows NT 10.0;                    gzip, deflate  __cfduid=d76b68d148ddec1efd004
2   https://advertools.readthedocs            advertools —  Python  Get productive as an online ma  advertools@@Indices and tables  Online marketing productivity                              NaN   Generate keywords for SEM camp   NaN               NaN  advertools.readthedocs.io               NaN             NaN           NaN                             NaN              NaN    NaN     NaN  #@@readme.html@@advertools.kw_  @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                             NaN                             NaN  104.17.32.82  2020-05-21 10:39:36  Thu, 21 May 2020 10:39:35 GMT                 text/html  Wed, 20 May 2020 12:26:36 GMT   Accept-Encoding  98b729fa-e01e-00bf-24c3-2e494d                2009-09-19                       unlocked                   BlockBlob                                        *  Nginx-Proxito-Sendfile              web00007c                 advertools                     latest  /proxito/media/html/advertools  advertools.readthedocs.io                              path                         subdomain         max-age=31536000; includeSubDo                          HIT               NaN  Thu, 21 May 2020 11:39:35 GMT       public, max-age=3600  max-age=604800, report-uri="ht          cloudflare  596daca9bf26d423-BUD  02d86a3e150000d423322742000000  text/html,application/xhtml+xm                              en  Mozilla/5.0 (Windows NT 10.0;                    gzip, deflate  __cfduid=d76b68d148ddec1efd004
3   https://advertools.readthedocs    advertools package —  Python                             NaN              advertools package     Submodules@@Module contents                             NaN   Top-level package for advertoo   NaN               NaN  advertools.readthedocs.io               NaN             NaN           NaN                             NaN              NaN    NaN     NaN  index.html@@readme.html@@adver  @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                             NaN                             NaN  104.17.32.82  2020-05-21 10:39:36  Thu, 21 May 2020 10:39:35 GMT                 text/html  Wed, 20 May 2020 12:26:25 GMT   Accept-Encoding  7a28ef3b-801e-00c2-24c3-2ed585                2009-09-19                       unlocked                   BlockBlob                                        *  Nginx-Proxito-Sendfile              web000079                 advertools                     master  /proxito/media/html/advertools  advertools.readthedocs.io                              path                         subdomain         max-age=31536000; includeSubDo                          HIT               NaN  Thu, 21 May 2020 11:39:35 GMT       public, max-age=3600  max-age=604800, report-uri="ht          cloudflare  596daca9bddb7ec2-BUD  02d86a3e1300007ec2a808a2000000  text/html,application/xhtml+xm                              en  Mozilla/5.0 (Windows NT 10.0;                    gzip, deflate  __cfduid=d76b68d148ddec1efd004
4   https://advertools.readthedocs   Python Module Index —  Python                             NaN             Python Module Index                             NaN                             NaN            © Copyright 2020, Eli   NaN               NaN  advertools.readthedocs.io               NaN             NaN           NaN                             NaN              NaN    NaN     NaN  index.html@@readme.html@@adver  @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@               _static/minus.png                               -  104.17.32.82  2020-05-21 10:39:36  Thu, 21 May 2020 10:39:35 GMT                 text/html  Wed, 20 May 2020 12:26:23 GMT   Accept-Encoding  75911c9e-201e-00e6-34c3-2e4ccb                2009-09-19                       unlocked                   BlockBlob                                        *  Nginx-Proxito-Sendfile              web00007g                 advertools                     master  /proxito/media/html/advertools  advertools.readthedocs.io                              path                         subdomain         max-age=31536000; includeSubDo                          HIT               NaN  Thu, 21 May 2020 11:39:35 GMT       public, max-age=3600  max-age=604800, report-uri="ht          cloudflare  596daca9b91fd437-BUD  02d86a3e140000d437b81532000000  text/html,application/xhtml+xm                              en  Mozilla/5.0 (Windows NT 10.0;                    gzip, deflate  __cfduid=d76b68d148ddec1efd004
66  https://advertools.readthedocs  advertools.url_builders —  Pyt                             NaN  Source code for advertools.url                             NaN                             NaN            © Copyright 2020, Eli   NaN               NaN  advertools.readthedocs.io               NaN             NaN           NaN                             NaN              NaN    NaN     NaN  ../../index.html@@../../readme  @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                             NaN                             NaN  104.17.32.82  2020-05-21 10:39:39  Thu, 21 May 2020 10:39:38 GMT                 text/html  Wed, 20 May 2020 12:26:36 GMT   Accept-Encoding  d99f2368-c01e-006f-18c3-2ef5ef                2009-09-19                       unlocked                   BlockBlob                                        *  Nginx-Proxito-Sendfile              web00007a                 advertools                     latest  /proxito/media/html/advertools  advertools.readthedocs.io                              path                         subdomain         max-age=31536000; includeSubDo                          HIT               NaN  Thu, 21 May 2020 11:39:38 GMT       public, max-age=3600  max-age=604800, report-uri="ht          cloudflare  596dacbbb8afd437-BUD  02d86a494f0000d437b828b2000000  text/html,application/xhtml+xm                              en  Mozilla/5.0 (Windows NT 10.0;                    gzip, deflate  __cfduid=d76b68d148ddec1efd004
67  https://advertools.readthedocs  advertools.kw_generate —  Pyth                             NaN  Source code for advertools.kw_                             NaN                             NaN            © Copyright 2020, Eli   NaN               NaN  advertools.readthedocs.io               NaN             NaN           NaN                             NaN              NaN    NaN     NaN  ../../index.html@@../../readme  @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                             NaN                             NaN  104.17.32.82  2020-05-21 10:39:39  Thu, 21 May 2020 10:39:39 GMT                 text/html  Wed, 20 May 2020 12:26:36 GMT   Accept-Encoding  85855c48-c01e-00ce-13c3-2e3b74                2009-09-19                       unlocked                   BlockBlob                                        *  Nginx-Proxito-Sendfile              web00007g                 advertools                     latest  /proxito/media/html/advertools  advertools.readthedocs.io                              path                         subdomain         max-age=31536000; includeSubDo                          HIT               NaN  Thu, 21 May 2020 11:39:39 GMT       public, max-age=3600  max-age=604800, report-uri="ht          cloudflare  596dacbd980bd423-BUD  02d86a4a7f0000d423323b42000000  text/html,application/xhtml+xm                              en  Mozilla/5.0 (Windows NT 10.0;                    gzip, deflate  __cfduid=d76b68d148ddec1efd004
68  https://advertools.readthedocs  advertools.ad_from_string —  P                             NaN  Source code for advertools.ad_                             NaN                             NaN            © Copyright 2020, Eli   NaN               NaN  advertools.readthedocs.io               NaN             NaN           NaN                             NaN              NaN    NaN     NaN  ../../index.html@@../../readme  @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                             NaN                             NaN  104.17.32.82  2020-05-21 10:39:39  Thu, 21 May 2020 10:39:39 GMT                 text/html  Wed, 20 May 2020 12:26:36 GMT   Accept-Encoding  b0aef497-801e-004a-1647-2f6d5c                2009-09-19                       unlocked                   BlockBlob                                        *  Nginx-Proxito-Sendfile              web00007k                 advertools                     latest  /proxito/media/html/advertools  advertools.readthedocs.io                              path                         subdomain         max-age=31536000; includeSubDo                          HIT               NaN  Thu, 21 May 2020 11:39:39 GMT       public, max-age=3600  max-age=604800, report-uri="ht          cloudflare  596dacbd980cd423-BUD  02d86a4a7f0000d423209db2000000  text/html,application/xhtml+xm                              en  Mozilla/5.0 (Windows NT 10.0;                    gzip, deflate  __cfduid=d76b68d148ddec1efd004
69  https://advertools.readthedocs  advertools.ad_create —  Python                             NaN  Source code for advertools.ad_                             NaN                             NaN            © Copyright 2020, Eli   NaN               NaN  advertools.readthedocs.io               NaN             NaN           NaN                             NaN              NaN    NaN     NaN  ../../index.html@@../../readme  @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                             NaN                             NaN  104.17.32.82  2020-05-21 10:39:39  Thu, 21 May 2020 10:39:39 GMT                 text/html  Wed, 20 May 2020 12:26:36 GMT   Accept-Encoding  9dfdd38a-101e-00a1-7ec3-2e93a0                2009-09-19                       unlocked                   BlockBlob                                        *  Nginx-Proxito-Sendfile              web00007c                 advertools                     latest  /proxito/media/html/advertools  advertools.readthedocs.io                              path                         subdomain         max-age=31536000; includeSubDo                          HIT               NaN  Thu, 21 May 2020 11:39:39 GMT       public, max-age=3600  max-age=604800, report-uri="ht          cloudflare  596dacbd99847ec2-BUD  02d86a4a7f00007ec2a811f2000000  text/html,application/xhtml+xm                              en  Mozilla/5.0 (Windows NT 10.0;                    gzip, deflate  __cfduid=d76b68d148ddec1efd004
70  https://advertools.readthedocs      advertools.emoji —  Python                             NaN  Source code for advertools.emo                             NaN                             NaN            © Copyright 2020, Eli   NaN               NaN  advertools.readthedocs.io               NaN             NaN           NaN                             NaN              NaN    NaN     NaN  ../../index.html@@../../readme  @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                             NaN                             NaN  104.17.32.82  2020-05-21 10:39:40  Thu, 21 May 2020 10:39:39 GMT                 text/html  Wed, 20 May 2020 12:26:36 GMT   Accept-Encoding  2ad504a1-101e-000b-03c3-2e454f                2009-09-19                       unlocked                   BlockBlob                                        *  Nginx-Proxito-Sendfile              web000079                 advertools                     latest  /proxito/media/html/advertools  advertools.readthedocs.io                              path                         subdomain         max-age=31536000; includeSubDo                          HIT               NaN  Thu, 21 May 2020 11:39:39 GMT       public, max-age=3600  max-age=604800, report-uri="ht          cloudflare  596dacbd9fb97e9e-BUD  02d86a4a7f00007e9edb13a2000000  text/html,application/xhtml+xm                              en  Mozilla/5.0 (Windows NT 10.0;                    gzip, deflate  __cfduid=d76b68d148ddec1efd004

Pre-Determined Crawling Approach (List Mode)

Sometimes you might have a fixed set of URLs for which you want to scrape and analyze SEO or content performance. Some ideas:

SERP Data

Let's say you just ran serp_goog and got a bunch of top-ranking pages that you would like to analyze, and see how that relates to their SERP ranking.

You simply provide the url_list parameter and again specify the output_file. This will only crawl the specified URLs, and will not follow any links.

Now you have the SERP DataFrame, as well as the crawl output file. All you have to do is to merge them by the URL columns, and end up with a richer dataset

News Articles

You want to follow the latest news of a certain publication, and you extract their latest news URLs from their news sitemap using sitemap_to_df . You provide those URLs and crawl them only.

Google Analytics / Google Search Console

Since they provide reports for URLs, you can also combine them with the ones crawled and end up with a better perspective. You might be interested in knowing more about high bounce-rate pages, pages that convert well, pages that get less traffic than you think they should and so on. You can simply export those URLs and crawl them.

Any tool that has data about a set of URLs can be used.

Again running the function is as simple as providing a list of URLs, as well as a filepath where you want the result saved.

>>> adv.crawl(url_list, 'output_file.jl', follow_links=False)

The difference between the two approaches, is the simple parameter follow_links. If you keep it as False (the default), the crawler will only go through the provided URLs. Otherwise, it will discover pages by following links on pages that it crawls. So how do you make sure that the crawler doesn't try to crawl the whole web when follow_links is True? The allowed_domains parameter gives you the ability to control this, although it is an optional parameter. If you don't specify it, then it will default to only the domains in the url_list and their sub-domains if any. It's important to note that you have to set this parameter if you want to only crawl certain sub-domains.

Custom Extraction with CSS and XPath Selectors

The above approaches are generic, and are useful for exploratory SEO audits and the output is helpful for most cases.

But what if you want to extract special elements that are not included in the default output? This is extremely important, as there are key elements on pages that you need to additionally extract and analyze. Some examples might be tags, prices, social media shares, product price or availability, comments, and pretty much any element on a page that might be of interest to you.

For this you can use two special parameters for CSS and/or XPath selectors. You simply provide a dictionary {'name_1': 'selector_1', 'name_2': 'selector_2'} where the keys become the column names, and the values (selectors) will be used to extract the required elements.

I mostly rely on SlectorGadget which is a really great tool for getting the CSS/XPath selecotrs of required elements. In some pages it can get really tricky to figure that out. Other resources for learning more about selectors:

Once you have determined the elements that you want to extract and figured out what their names are going to be, you simply pass them as arguments to css_selectors and/or xpath_selectors as dictionaries, as decribed above.

Let's say you want to extract the links in the sidebar of this page. By default you would get all the links from the page, but you want to put those in the sidebar in a separate column. It seems that the CSS selector for them is .toctree-l1 .internal, and the XPath equivalent is //*[contains(concat( " ", @class, " " ), concat( " ", "toctree-l1", " " ))]//*[contains(concat( " ", @class, " " ), concat( " ", "internal", " " ))]. Note that this selects the element (the whole link object), which is not typically what you might be interested in.

So with CSS you need to append ::text or ::attr(href) if you want the text of the links or the href attribute respectively. Similarly with XPath, you will need to append /text() or /@href to the selector to get the same.

>>> adv.crawl(
...     "https://advertools.readthedocs.io/en/master/advertools.spider.html",
...     "output_file.jl",
...     css_selectors={
...         "sidebar_links": ".toctree-l1 .internal::text",
...         "sidebar_links_url": ".toctree-l1 .internal::attr(href)",
...     },
... )

Or, instead of css_selectors you can add a similar dictionary for the xpath_selectors argument:

>>> adv.crawl(
...     "https://advertools.readthedocs.io/en/master/advertools.spider.html",
...     "output_file.jl",
...     xpath_selectors={
...         "sidebar_links": '//*[contains(concat( " ", @class, " " ), concat( " ", "toctree-l1", " " ))]//*[contains(concat( " ", @class, " " ), concat( " ", "internal", " " ))]/text()',
...         "sidebar_links_url": '//*[contains(concat( " ", @class, " " ), concat( " ", "toctree-l1", " " ))]//*[contains(concat( " ", @class, " " ), concat( " ", "internal", " " ))]/@href',
...     },
... )

Spider Custom Settings and Additional Functionality

In addition to what you can control regarding the items you can extract, you can also customize the behaviour of the spider and set rules for crawling so you can control it even further.

This is provided by the custom_settings parameter. It is optional, and takes a dictionary of settings and their values. Scrapy provides a very large number of settings, and they are all available through this parameter (assuming some conditions for some of the settings).

Here are some examples that you might find interesting:

  • CONCURRENT_REQUESTS_PER_DOMAIN Defaults to 8, and controls the number of simultaneous requests to be performed for each domain. You might want to lower this if you don't want to put too much pressure on the website's server, and you probably don't want to get blocked!

  • DEFAULT_REQUEST_HEADERS You can change this if you need to.

  • DEPTH_LIMIT How deep your crawl will be allowed. The default has no limit.

  • DOWNLOAD_DELAY Similar to the first option. Controls the amount of time in seconds for the crawler to wait between consecutive pages of the same website. It can also take fractions of a second (0.4, 0.75, etc.)

  • LOG_FILE If you want to save your crawl logs to a file, which is strongly recommended, you can provide a path to it here.

  • USER_AGENT If you want to identify yourself differently while crawling. This is affected by the robots.txt rules, so you would be potentially allowed/disallowed from certain pages based on your user-agent.

  • CLOSESPIDER_ERRORCOUNT, CLOSESPIDER_ITEMCOUNT, CLOSESPIDER_PAGECOUNT, CLOSESPIDER_TIMEOUT Stop crawling after that many errors, items, pages, or seconds. These can be very useful to limit your crawling in certain cases. I particularly like to use CLOSESPIDER_PAGECOUNT when exploring a new website, and also to make sure that my selectors are working as expected. So for your first few crawls you might set this to five hundred for example and explore the crawled pages. Then when you are confident things are working fine, you can remove this restriction. CLOSESPIDER_ERRORCOUNT can also be very useful while exploring, just in case you get unexpected errors.

The next page contains a number of strategies and recipes for crawling with code examples and explanations.

Usage

A very simple dictionary to be added to your function call:

>>> adv.crawl(
...     "http://exmaple.com",
...     "outpuf_file.jl",
...     custom_settings={
...         "CLOSESPIDER_PAGECOUNT": 100,
...         "CONCURRENT_REQUESTS_PER_DOMAIN": 1,
...         "USER_AGENT": "custom-user-agent",
...     },
... )

Please refer to the spider settings documentation for the full details.

crawl(url_list, output_file, follow_links=False, allowed_domains=None, exclude_url_params=None, include_url_params=None, exclude_url_regex=None, include_url_regex=None, css_selectors=None, xpath_selectors=None, custom_settings=None, meta=None)[source]

Crawl a website or a list of URLs based on the supplied options.

Parameters:
  • url_list (url, list) -- One or more URLs to crawl. If follow_links is True, the crawler will start with these URLs and follow all links on pages recursively.

  • output_file (str) -- The path to the output of the crawl. Jsonlines only is supported to allow for dynamic values. Make sure your file ends with ".jl", e.g. output_file.jl.

  • follow_links (bool) -- Defaults to False. Whether or not to follow links on crawled pages.

  • allowed_domains (list) -- A list of the allowed domains to crawl. This ensures that the crawler does not attempt to crawl the whole web. If not specified, it defaults to the domains of the URLs provided in url_list and all their sub-domains. You can also specify a list of sub-domains, if you want to only crawl those.

  • exclude_url_params (list, bool) -- A list of URL parameters to exclude while following links. If a link contains any of those parameters, don't follow it. Setting it to True will exclude links containing any parameter.

  • include_url_params (list) -- A list of URL parameters to include while following links. If a link contains any of those parameters, follow it. Having the same parmeters to include and exclude raises an error.

  • exclude_url_regex (str) -- A regular expression of a URL pattern to exclude while following links. If a link matches the regex don't follow it.

  • include_url_regex (str) -- A regular expression of a URL pattern to include while following links. If a link matches the regex follow it.

  • css_selectors (dict) -- A dictionary mapping names to CSS selectors. The names will become column headers, and the selectors will be used to extract the required data/content.

  • xpath_selectors (dict) -- A dictionary mapping names to XPath selectors. The names will become column headers, and the selectors will be used to extract the required data/content.

  • custom_settings (dict) -- A dictionary of optional custom settings that you might want to add to the spider's functionality. There are over 170 settings for all kinds of options. For details please refer to the spider settings documentation.

  • meta (dict) -- Additional data to pass to the crawler; add arbitrary metadata, set custom request headers per URL, and/or enable some third party plugins.

Examples

Crawl a website and let the crawler discover as many pages as available

>>> import advertools as adv
>>> adv.crawl("http://example.com", "output_file.jl", follow_links=True)
>>> import pandas as pd
>>> crawl_df = pd.read_json("output_file.jl", lines=True)

Crawl a known set of pages (on a single or multiple sites) without following links (just crawl the specified pages) or "list mode":

>>> adv.crawl(
...     [
...         "http://exmaple.com/product",
...         "http://exmaple.com/product2",
...         "https://anotherexample.com",
...         "https://anotherexmaple.com/hello",
...     ],
...     "output_file.jl",
...     follow_links=False,
... )

Crawl a website, and in addition to standard SEO elements, also get the required CSS selectors. Here we will get three additional columns price, author, and author_url. Note that you need to specify if you want the text attribute or the href attribute if you are working with links (and all other selectors).

>>> adv.crawl(
...     "http://example.com",
...     "output_file.jl",
...     css_selectors={
...         "price": ".a-color-price::text",
...         "author": ".contributorNameID::text",
...         "author_url": ".contributorNameID::attr(href)",
...     },
... )

Using the meta parameter:

Adding custom meta data for the crawler using the meta parameter for tracking/context purposes. If you supply {"purpose": "pre-launch test"}, then you will get a column called "purpose", and all its values will be "pre-launch test" in the crawl DataFrame.

>>> adv.crawl(
...     "https://example.com",
...     "output_file.jl",
...     meta={"purpose": "pre-launch test"},
... )

Or maybe mention which device(s) you crawled with, which is much easier than reading the user-agent string:

>>> adv.crawl(
...     "https://example.com",
...     "output.jsonl",
...     custom_settings={
...         "USER_AGENT": "Mozilla/5.0 (iPhone; CPUiPhone OS 14_7_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.2 Mobile/15E148 Safari/604.1"
...     },
...     meta={"device": "Apple iPhone 12 Pro (Safari)"},
... )

Of course you can combine any such meta data however way you want:

>>> {"device": "iphone", "purpose": "initial audit", "crawl_country": "us", ...}

Custom request headers: Supply custom request headers per URL with the special key custom_headers. It's value is a dictionary where its keys are URLs, and every URL's values is a dictionary, each with its own custom request headers.

>>> adv.crawl(
...     URL_LIST,
...     OUTPUT_FILE,
...     meta={
...         "custom_headers": {
...             "URL_A": {"HEADER_1": "VALUE_1", "HEADER_2": "VALUE_1"},
...             "URL_B": {"HEADER_1": "VALUE_2", "HEADER_2": "VALUE_2"},
...             "URL_C": {"HEADER_1": "VALUE_3"},
...         }
...     },
... )

OR:

>>> meta = {
...     "custom_headers": {
...         "https://example.com/A": {"If-None-Match": "Etag A"},
...         "https://example.com/B": {
...             "If-None-Match": "Etag B",
...             "User-Agent": "custom UA",
...         },
...         "https://example.com/C": {
...             "If-None-Match": "Etag C",
...             "If-Modified-Since": "Sat, 17 Oct 2024 16:24:00 GMT",
...         },
...     }
... }

Long lists of requests headers: In some cases you might have a very long list and that might raise an Argument list too long error. In this case you can provide the path of a Python script that contains a dictionary for the headers. Keep in mind:

  • The dictionary has to be named custom_headers with the same structure mentioned above

  • The file has to be a Python script, having the extension ".py"

  • The script can generate the dictionary programmatically to make it easier to incorporate in various workflows

  • The path to the file can be absolute or relative to where the command is run from.

    >>> meta = {"custom_headers": "my_custom_headers.py"}
    

    OR

    >>> meta = {"custom_headers": "/full/path/to/my_custom_headers.py"}
    

Use with third party plugins like scrapy playwright. To enable it, set {"playwright": True} together with other settings.