Crawling and Scraping Analysis

After crawling a website, or a bunch of URLs, you mostly likely want to analyze the data and gain a better undersanding of the website's structure, strategy, and content. You probably also want to check for technical issues that the site might have.

This module provides a few ready-made functions to help in anayzing crawl data.


There are certain columns in the crawl DataFrame that can be analyzed separately and independently, like page size and status codes. They can of course be analyzed together with other columns like URL and title to put these columns and their data in context.

There are also groups of columns the can be thought of as describing one full aspect of a website, yet spread across a few columns. For exmaple:

Analyzing crawled images

Every crawled URL typically contains multiple images. Every image in turn has multiple attributes (src, alt, width, etc.) The number of images per URL is not the same, and not all images have the same attributes. So we need a way to unpack all these data points in a tidy (long form) DataFrame to get an idea of how images (and their attributes) are used and distributed across the crawled website (URLs).

Once you have read a crawl output file into a DataFrame, you can summarize the images in this DataFrame as follows:

>>> import advertools as adv
>>> import pandas as pd
>>> crawldf = pd.read_json("path/to/output_file.jl", lines=True)
>>> img_df = adv.crawlytics.images(crawldf)
>>> img_df

url

img_src

img_alt

img_loading

img_sizes

img_decoding

img_width

img_height

img_border

0

https://www.nytimes.com/

/vi-assets/static-assets/icon-the-morning_144x144-b12a6923b6ad9102b766352261b1a847.webp

The Morning Logo

nan

nan

nan

nan

nan

0

https://www.nytimes.com/

/vi-assets/static-assets/icon-the-upshot_144x144-0b1553ff703bbd07ac8fe73e6d215888.webp

The Upshot Logo

nan

nan

nan

nan

nan

0

https://www.nytimes.com/

https://static01.nyt.com/images/2017/01/29/podcasts/the-daily-album-art/the-daily-album-art-square320-v5.jpg?quality=75&auto=webp&disable=upscale

The Daily Logo

nan

nan

nan

nan

nan

1

https://www.nytimes.com/newsletters/morning-briefing-europe

https://static.nytimes.com/email-images/NYT-Newsletters-Europe-Icon-500px.jpg

morning briefing

nan

nan

nan

nan

nan

nan

2

https://www.nytimes.com/newsletters/australia-letter

https://static.nytimes.com/email-images/NYT-Newsletters-AustraliaLetter-Icon-500px.jpg

australia-letter

nan

nan

nan

nan

nan

nan

3

https://www.nytimes.com/newsletters/the-interpreter

https://static.nytimes.com/email-images/NYT-Newsletters-SONL-TheInterpreter-Icon-500px.jpg

the interpreter

nan

nan

nan

nan

nan

nan

4

https://www.nytimes.com/section/world/middleeast

https://static01.nyt.com/images/2024/01/25/multimedia/25israel-1-hbcz/25israel-1-hbcz-thumbWide.jpg?quality=75&auto=webp&disable=upscale

nan

(min-width: 1024px) 205px, 150px

async

150

100

nan

4

https://www.nytimes.com/section/world/middleeast

https://static01.nyt.com/images/2024/01/25/multimedia/25israel-hamas-icj-case-explain-wjth/25israel-hamas-icj-case-explain-wjth-thumbWide.jpg?quality=75&auto=webp&disable=upscale

nan

(min-width: 1024px) 205px, 150px

async

150

100

nan

4

https://www.nytimes.com/section/world/middleeast

https://static01.nyt.com/images/2024/01/25/multimedia/25israel-hamas-qatar-israel-ctbv/25israel-hamas-qatar-israel-ctbv-thumbWide.jpg?quality=75&auto=webp&disable=upscale

nan

(min-width: 1024px) 205px, 150px

async

150

100

nan

As you can see, for every URL we have all available image attributes listed, and in many cases with those attributes empty, because they were not used for that particular image. Also note that each image is represented independently on its own row, and mapped to the URL on which it was found, which can be seen in the first column. This is why we have the same URL repeated, to represent data for each image. You can use those URLs to get more data about the URL of interest from the crawl DataFrame.

Let's get a quick overview of the usage of the various image attributes in this DataFrame. We do this by checking whether a tag is notna and get the averages.

>>> img_df.notna().mean().sort_values(ascending=False).to_frame().round(2)

url

1

img_src

0.99

img_alt

0.99

img_width

0.86

img_height

0.86

img_srcset

0.25

img_sizes

0.25

img_decoding

0.25

img_loading

0.01

img_border

0

We can now see that almost all (99%) of our images have src and alt attributes. About 86% have width and height, and so on. This immediately gives us an overview of how our images are managed on the site. We can easily estimate the size of the issues if any, and plan our work accordingly.

Analyzing the redirects of a crawled website

Like images and links, the information about redirects is presented using a group of columns:

>>> redirect_df = adv.crawlytics.redirects(crawldf)
>>> redirect_df

url

status

order

type

download_latency

redirect_times

0

https://nytimes.com

301

1

requested

0.220263

1

0

https://www.nytimes.com/

200

2

crawled

0.220263

1

26

https://www.nytimes.com/privacy/privacy-policy

301

1

requested

0.079844

1

26

https://help.nytimes.com/hc/en-us/articles/10940941449492

403

2

crawled

0.079844

1

105

https://www.nytimes.com/es/privacy/privacy-policy

301

1

requested

0.0630789

1

105

https://help.nytimes.com/hc/en-us/articles/13537530305428

403

2

crawled

0.0630789

1

218

https://nytimes.com/spotlight/privacy-project-data-protection

301

1

requested

0.852014

1

218

https://www.nytimes.com/spotlight/privacy-project-data-protection

200

2

crawled

0.852014

1

225

https://nytimes.com/spotlight/privacy-project-regulation-solutions

301

1

requested

0.732559

1

310

http://nytimes.com/by/sahil-chinoy

301

1

requested

0.435062

2

310

https://nytimes.com/by/sahil-chinoy

301

2

intermediate

0.435062

2

310

https://www.nytimes.com/by/sahil-chinoy

200

3

crawled

0.435062

2

Here each redirect is represented using a group of columns, as well as a group of rows. Columns show attributes of a redirect (status code, the order of the URL in the redirect, the type of the URL in the redirect context, download latency in seconds, and the number of redirects in this specific process). Since a redirect contains multiple URLs, each one of those URLs is represented on its own row. You can use the index of this DataFrame to connect a redirect back to the crawl DataFrame in case you want more context about it.

Let's now see what can be done with large crawl files.

Handling very large crawl files

Many times you might end up crawling a large website, and the crawl file file might be as large as your memory (or even more), making it impossible to analyze.

We have several options availablel to us:

  1. Read a subset of columns

  2. Convert the jsonlines file to parquet

  3. Explore the available column names and their respective data types of a parquet file

The jl_subset function only reads the column subset that you want, massively reducing the memory consumption of our file. In some cases you only want a small set of columns, you can read the DataFrame with the columns of interest, write them to a new file, and delete the old large crawl file.

>>> crawl_subset = adv.crawlytics.jl_subset(
...     filepath="/path/to/output_file.jl",
...     columns=[col1, col2, ...],
...     regex=column_regex,
... )

You can use the columns parameter to specify exactly which columns you want. You can also use a regular expression to specify a set of columns. Here are some examples of regular expressions that you might typically want to use:

  • "img_": Get all image columns, including all availabe <img> attributes.

  • "jsonld_": Get all JSON-LD columns.

  • "resp_headers_": Response headers.

  • "request_headers_": Request headers.

  • "h\d": Heading columns, h1..h6.

  • "redirect_": Columns containing redirect information.

  • "links_": Columns containing link information.

An important characteristic of these groups of columns is that you most likely don't know how many they are, and what they might include, so a regular expression can save a lot of time.

You can use the columns and regex parameters together or either one of them on its own depending on your needs.

Compressing large crawl files

Another strategy while dealing with large jsonlines (.jl) files is to convert them to the highly performant .parquet format. You simply have to provide the current path to the .jl file, and provide a path for the desired .parquet file:

>>> adv.crawlytics.jl_to_parquet("output_file.jl", "output_file.parquet")

Now you have a much smaller file on disk, and you can use the full power of parquet to efficiently read (and filter) columns and rows. Check the`pandas.read_parquet <https://pandas.pydata.org/docs/reference/api/pandas.read_parquet.html>`_ documentation for details.

Exploring the columns and data types of parquet files

Another simple function gives us a DataFrame of the available columns in a parquet file. One of the main advantags of using parquet is that you can select which columns you want to read.

>>> adv.crawlytics.parquet_columns("output_file.parquet")  # first 15 columns only

column

type

0

url

string

1

title

string

2

meta_desc

string

3

viewport

string

4

charset

string

5

h1

string

6

h2

string

7

h3

string

8

canonical

string

9

alt_href

string

10

alt_hreflang

string

11

og:url

string

12

og:type

string

13

og:title

string

14

og:description

string

Check how many columns we have of each type.

>>> adv.crawlytics.parquet_columns("nyt_crawl.parquet")["type"].value_counts()

type

count

0

string

215

1

double

22

2

list<element: string>

5

3

int64

4

4

list<element: struct<@context: string, @type: string, position: int64, url: string>>

2

5

list<element: struct<@context: string, @type: string, contentUrl: string, creditText: string, url: string>>

2

6

list<element: struct<@context: string, @type: string, caption: string, contentUrl: string, creditText: string, height: int64, url: string, width: int64>>

1

7

timestamp[ns]

1

8

list<element: struct<@context: string, @type: string, item: string, name: string, position: int64>>

1

9

list<element: struct<@context: string, @type: string, name: string, url: string>>

1

Module functions

compare(df1, df2, column, keep_equal=False)[source]

Compare common URLs in two crawl DataFrames with respect to column.

There are three main options that you might select for comparison:

  • Numeric column: You get the difference between changed columns, as a numeric difference, and as a fraction.

  • String column: You get the values that changed.

  • The "url" column: You get two boolean columns df1 and df2 with True if the respective URL was found in that DataFrame, and False otherwise. This allows for easily checking which URLs were present in both crawls, or only in one of them.

Parameters:
  • df1 (pandas.DataFrame) -- The DataFrame of the first crawl

  • df2 (pandas.DataFrame) -- The DataFrame of the second crawl

  • column (str) -- The name of the column that you want to compare

  • keep_equal (bool, default False) -- Whether or not to keep unchanged values in the result DataFrame

Returns:

comparison_df -- The values will dependon the data type of the selected column, please see above.

Return type:

pandas.DataFrame

Examples

>>> import advertools as adv
>>> import pandas as pd
>>> df1 = pd.read_json("output_file1.jl", lines=True)
>>> df2 = pd.read_json("output_file2.jl", lines=True)
>>> adv.crawlytics.compare(df1, df1, "size")

url

size_x

size_y

diff

diff_perc

0

https://example.com/page_2

299218

317541

18323

0.0612363

1

https://example.com/page_5

214891

208886

-6005

-0.0279444

2

https://example.com/page_7

257442

251437

-6005

-0.0233256

3

https://example.com/page_8

230403

224398

-6005

-0.026063

4

https://example.com/page_9

222242

216237

-6005

-0.0270201

images(crawldf)[source]

Summarize crawled images from a crawl DataFrame.

Parameters:

crawldf (pandas.DataFrame) -- A crawl DataFrame as a result of the advertools.crawl function.

Returns:

img_summary -- A DataFrame containing all available img tags and their attributes mapped to their respective URLs where each image data is represented with a separate row.

Return type:

pandas.DataFrame

Examples

>>> import advertools as adv
>>> import pandas as pd
>>> crawldf = pd.read_json("output_file.jl", lines=True)
>>> image_df = adv.crawlytics.images(crawldf)
>>> image_df

url

img_src

img_alt

img_loading

img_sizes

img_decoding

img_width

img_height

img_border

0

https://www.nytimes.com/

/vi-assets/static-assets/icon-the-morning_144x144-b12a6923b6ad9102b766352261b1a847.webp

The Morning Logo

nan

nan

nan

nan

nan

0

https://www.nytimes.com/

/vi-assets/static-assets/icon-the-upshot_144x144-0b1553ff703bbd07ac8fe73e6d215888.webp

The Upshot Logo

nan

nan

nan

nan

nan

0

https://www.nytimes.com/

https://static01.nyt.com/images/2017/01/29/podcasts/the-daily-album-art/the-daily-album-art-square320-v5.jpg?quality=75&auto=webp&disable=upscale

The Daily Logo

nan

nan

nan

nan

nan

1

https://www.nytimes.com/newsletters/morning-briefing-europe

https://static.nytimes.com/email-images/NYT-Newsletters-Europe-Icon-500px.jpg

morning briefing

nan

nan

nan

nan

nan

nan

2

https://www.nytimes.com/newsletters/australia-letter

https://static.nytimes.com/email-images/NYT-Newsletters-AustraliaLetter-Icon-500px.jpg

australia-letter

nan

nan

nan

nan

nan

nan

3

https://www.nytimes.com/newsletters/the-interpreter

https://static.nytimes.com/email-images/NYT-Newsletters-SONL-TheInterpreter-Icon-500px.jpg

the interpreter

nan

nan

nan

nan

nan

nan

4

https://www.nytimes.com/section/world/middleeast

https://static01.nyt.com/images/2024/01/25/multimedia/25israel-1-hbcz/25israel-1-hbcz-thumbWide.jpg?quality=75&auto=webp&disable=upscale

nan

(min-width: 1024px) 205px, 150px

async

150

100

nan

4

https://www.nytimes.com/section/world/middleeast

https://static01.nyt.com/images/2024/01/25/multimedia/25israel-hamas-icj-case-explain-wjth/25israel-hamas-icj-case-explain-wjth-thumbWide.jpg?quality=75&auto=webp&disable=upscale

nan

(min-width: 1024px) 205px, 150px

async

150

100

nan

4

https://www.nytimes.com/section/world/middleeast

https://static01.nyt.com/images/2024/01/25/multimedia/25israel-hamas-qatar-israel-ctbv/25israel-hamas-qatar-israel-ctbv-thumbWide.jpg?quality=75&auto=webp&disable=upscale

nan

(min-width: 1024px) 205px, 150px

async

150

100

nan

jl_subset(filepath, columns=None, regex=None, chunksize=500)[source]

Read a jl file extracting selected columns and/or columns matching regex.

Parameters:
  • filepath (str) -- The path of the .jl (jsonlines) file to read.

  • columns (list) -- An optional list of column names that you want to read.

  • regex (str) -- An optional regular expression of the pattern of columns to read.

  • chunksize (int) -- How many rows to read per chunk.

Examples

>>> import advertools as adv

Read only the columns "url" and "meta_desc":

>>> adv.crawlytics.jl_subset("output_file.jl", columns=["url", "meta_desc"])

Read columns matching the regex "jsonld":

>>> adv.crawlytics.jl_subset("output_file.jl", regex="jsonld")

Read the columns "url" and "meta_desc" as well as columns matching "jsonld":

>>> adv.crawlytics.jl_subset(
...     "output_file.jl", columns=["url", "meta_desc"], regex="jsonld"
... )
Returns:

df_subset -- A DataFrame containing the list of columns and/or columns matching regex.

Return type:

pandas.DataFrame

jl_to_parquet(jl_filepath, parquet_filepath)[source]

Convert a jsonlines crawl file to the parquet format.

Parameters:
  • jl_filepath (str) -- The path of an existing .jl file.

  • parquet_fileapth (str) -- The pather where you want the new file to be saved.

Examples

>>> import advertools as adv
>>> adv.crawlytics.jl_to_parquet("output_file.jl", "output_file.parquet")

Summarize links from a crawl DataFrame.

Parameters:
  • crawldf (DataFrame) -- A DataFrame of a website crawled with advertools.

  • internal_url_regex (str) -- A regular expression for identifying if a link is internal or not. For example if your website is example.com, this would be "example.com".

Returns:

link_df

Return type:

pandas.DataFrame

Examples

>>> import advertools as adv
>>> import pandas as pd
>>> crawldf = pd.read_json("output_file.jl", lines=True)
>>> link_df = adv.crawlytics.links(crawldf)
>>> link_df

url

link

text

nofollow

internal

0

https://www.nytimes.com/

https://www.nytimes.com/#site-content

Skip to content

False

True

0

https://www.nytimes.com/

https://www.nytimes.com/#site-index

Skip to site index

False

True

0

https://www.nytimes.com/

https://www.nytimes.com/#after-dfp-ad-top

SKIP ADVERTISEMENT

False

True

1

https://www.nytimes.com/newsletters/morning-briefing-europe

https://www.nytimes.com/newsletters/morning-briefing-europe#site-content

Skip to content

False

True

1

https://www.nytimes.com/newsletters/morning-briefing-europe

https://www.nytimes.com/newsletters/morning-briefing-europe#site-index

Skip to site index

False

True

1

https://www.nytimes.com/newsletters/morning-briefing-europe

https://www.nytimes.com/

False

True

2

https://www.nytimes.com/newsletters/australia-letter

https://www.nytimes.com/newsletters/australia-letter#site-content

Skip to content

False

True

2

https://www.nytimes.com/newsletters/australia-letter

https://www.nytimes.com/newsletters/australia-letter#site-index

Skip to site index

False

True

2

https://www.nytimes.com/newsletters/australia-letter

https://www.nytimes.com/

False

True

parquet_columns(filepath)[source]

Get column names and datatypes of a parquet file.

Parameters:

filepath (str) -- The path of the file that you want to get columns names and types.

Returns:

columns_types -- A DataFrame with two columns "column" and "type".

Return type:

pandas.DataFrame

redirects(crawldf)[source]

Create a tidy DataFrame for the redirects in crawldf with the columns:

  • url: All the URLs in the redirect (chain).

  • status: The status code of each URL.

  • type: "requested", "inermediate", or "crawled".

  • order: 1, 2, 3... up to the number of urls in the redirect chain.

  • redirect_times: The number of redirects in the chain (URLs in the chain minus one).

Parameters:

crawldf (pandas.DataFrame) -- A DataFrame of an advertools crawl file

Examples

>>> import advertools as adv
>>> import pandas as pd
>>> crawldf = pd.read_json("output_file.jl", lines=True)
>>> redirect_df = adv.crawlytics.redirects(crawldf)
>>> redirect_df

url

status

order

type

download_latency

redirect_times

0

https://nytimes.com

301

1

requested

0.220263

1

0

https://www.nytimes.com/

200

2

crawled

0.220263

1

26

https://www.nytimes.com/privacy/privacy-policy

301

1

requested

0.079844

1

26

https://help.nytimes.com/hc/en-us/articles/10940941449492

403

2

crawled

0.079844

1

105

https://www.nytimes.com/es/privacy/privacy-policy

301

1

requested

0.0630789

1

105

https://help.nytimes.com/hc/en-us/articles/13537530305428

403

2

crawled

0.0630789

1

running_crawls()[source]

Get details of currently running spiders.

Get a DataFrame showing the following details:

  • pid: Process ID. Use this to identify (or stop) the spider that you want.

  • started: The time when this spider has started.

  • elapsed: The elapsed time since the spider started.

  • %mem: The percentage of memory that this spider is consuming.

  • %cpu: The percentage of CPU that this spider is consuming.

  • command: The command that was used to start this spider. Use this to identify the spider(s) that you want to know about.

  • output_file: The path to the output file for each running crawl job.

  • crawled_urls: The current number of lines in output_file.

Examples

While a crawl is running:

>>> import advertools as adv
>>> adv.crawlytics.running_crawls()

pid

started

elapsed

%mem

%cpu

command

output_file

crawled_urls

0

195720

21:41:14

00:11

1.1

103

/opt/tljh/user/bin/python /opt/tljh/user/bin/scrapy runspider /opt/tljh/user/lib/python3.10/site-packages/advertools/spider.py -a url_list=https://cnn.com -a allowed_domains=cnn.com -a follow_links=True -a exclude_url_params=None -a include_url_params=None -a exclude_url_regex=None -a include_url_regex=None -a css_selectors=None -a xpath_selectors=None -o cnn.jl -s CLOSESPIDER_PAGECOUNT=200

cnn.jl

30

After a few moments:

>>> adv.crawlytics.running_crawls()

pid

started

elapsed

%mem

%cpu

command

output_file

crawled_urls

0

195720

21:41:14

00:27

1.2

96.7

/opt/tljh/user/bin/python /opt/tljh/user/bin/scrapy runspider /opt/tljh/user/lib/python3.10/site-packages/advertools/spider.py -a url_list=https://cnn.com -a allowed_domains=cnn.com -a follow_links=True -a exclude_url_params=None -a include_url_params=None -a exclude_url_regex=None -a include_url_regex=None -a css_selectors=None -a xpath_selectors=None -o cnn.jl -s CLOSESPIDER_PAGECOUNT=200

cnn.jl

72

After starting a new crawl:

>>> adv.crawlytics.running_crawls()

pid

started

elapsed

%mem

%cpu

command

output_file

crawled_urls

0

195720

21:41:14

01:02

1.6

95.7

/opt/tljh/user/bin/python /opt/tljh/user/bin/scrapy runspider /opt/tljh/user/lib/python3.10/site-packages/advertools/spider.py -a url_list=https://cnn.com -a allowed_domains=cnn.com -a follow_links=True -a exclude_url_params=None -a include_url_params=None -a exclude_url_regex=None -a include_url_regex=None -a css_selectors=None -a xpath_selectors=None -o cnn.jl -s CLOSESPIDER_PAGECOUNT=200

cnn.jl

154

1

195769

21:42:09

00:07

0.4

83.8

/opt/tljh/user/bin/python /opt/tljh/user/bin/scrapy runspider /opt/tljh/user/lib/python3.10/site-packages/advertools/spider.py -a url_list=https://nytimes.com -a allowed_domains=nytimes.com -a follow_links=True -a exclude_url_params=None -a include_url_params=None -a exclude_url_regex=None -a include_url_regex=None -a css_selectors=None -a xpath_selectors=None -o nyt.jl -s CLOSESPIDER_PAGECOUNT=200

nyt.jl

17