Download, Parse, and Analyze XML Sitemaps
One of the fastest and easiest ways to get insights on a website's content is to simply download its XML sitemap(s).
Sitemaps are also important SEO tools as they reveal a lot of information about the website, and help search engines in indexing those pages. You might want to run an SEO audit and check if the URLs in the sitemap properly correspond to the actual URLs of the site, so this would be an easy way to get them.
Sitemaps basically contain a log of publishing activity, and if they have rich URLs then you can do some good analysis on their content over time as well.
The sitemap_to_df()
function is very simple to use, and only requires the
URL of a sitemap, a sitemap index, or even a robots.txt file. It goes through
the sitemap(s) and returns a DataFrame containing all the tags and their
information.
loc: The location of the URLs of hte sitemaps.
lastmod: The datetime of the date when each URL was last modified, if available.
sitemap: The URL of the sitemap from which the URL on this row was retreived.
etag: The entity tag of the response header, if provided.
sitemap_last_modified: The datetime when the sitemap file was last modified, if provided.
sitemap_size_mb: The size of the sitemap in mega bytes (1MB = 1,024 x 1,024 bytes)
download_date: The datetime when the sitemap was downloaded.
Sitemap Index
Large websites typically have a sitmeapindex file, which contains links to all
other regular sitemaps that belong to the site. The sitemap_to_df()
function retreives all sub-sitemaps recursively by default.
In some cases, especially with very large sites, it might be better to first
get the sitemap index, explore its structure, and then decide which sitemaps
you want to get, or if you want them all. Even with smaller websites, it still
might be interesting to get the index only and see how it is structured.
This behavior can be modified by the recursive
parameter, which is set to
True by default. Set it to False if you want only the index file.
Another interesting thing you might want to do is to provide a robots.txt URL, and set recursive=False to get all available sitemap index files.
>>> sitemap_to_df('https://example.com/robots.txt', recursive=False)
Let's now go through a quick example of what can be done with sitemaps. We can start by getting one of the BBC's sitemaps.
Regular XML Sitemaps
import advertools as adv
bbc_sitemap = adv.sitemap_to_df('https://www.bbc.com/sitemaps/https-sitemap-com-archive-1.xml')
bbc_sitemap.head(10)
loc |
lastmod |
sitemap |
etag |
sitemap_last_modified |
sitemap_size_mb |
download_date |
|
---|---|---|---|---|---|---|---|
0 |
https://www.bbc.com/arabic/middleeast/2009/06/090620_as_iraq_explosion_tc2 |
2009-06-20 14:10:48+00:00 |
https://www.bbc.com/sitemaps/https-sitemap-com-archive-1.xml |
e7e15811c65f406f89f89fe10aef29f5 |
2021-11-05 20:52:56+00:00 |
7.63124 |
2022-02-12 01:37:39.461037+00:00 |
1 |
https://www.bbc.com/arabic/middleeast/2009/06/090620_iraq_blast_tc2 |
2009-06-20 21:07:43+00:00 |
https://www.bbc.com/sitemaps/https-sitemap-com-archive-1.xml |
e7e15811c65f406f89f89fe10aef29f5 |
2021-11-05 20:52:56+00:00 |
7.63124 |
2022-02-12 01:37:39.461037+00:00 |
2 |
https://www.bbc.com/arabic/business/2009/06/090622_me_worldbank_tc2 |
2009-06-22 12:41:48+00:00 |
https://www.bbc.com/sitemaps/https-sitemap-com-archive-1.xml |
e7e15811c65f406f89f89fe10aef29f5 |
2021-11-05 20:52:56+00:00 |
7.63124 |
2022-02-12 01:37:39.461037+00:00 |
3 |
https://www.bbc.com/arabic/multimedia/2009/06/090624_me_inpictures_brazil_tc2 |
2009-06-24 15:27:24+00:00 |
https://www.bbc.com/sitemaps/https-sitemap-com-archive-1.xml |
e7e15811c65f406f89f89fe10aef29f5 |
2021-11-05 20:52:56+00:00 |
7.63124 |
2022-02-12 01:37:39.461037+00:00 |
4 |
2009-06-18 15:32:54+00:00 |
https://www.bbc.com/sitemaps/https-sitemap-com-archive-1.xml |
e7e15811c65f406f89f89fe10aef29f5 |
2021-11-05 20:52:56+00:00 |
7.63124 |
2022-02-12 01:37:39.461037+00:00 |
|
5 |
https://www.bbc.com/arabic/multimedia/2009/06/090625_sf_tamim_verdict_tc2 |
2009-06-25 09:46:39+00:00 |
https://www.bbc.com/sitemaps/https-sitemap-com-archive-1.xml |
e7e15811c65f406f89f89fe10aef29f5 |
2021-11-05 20:52:56+00:00 |
7.63124 |
2022-02-12 01:37:39.461037+00:00 |
6 |
https://www.bbc.com/arabic/middleeast/2009/06/090623_iz_cairo_russia_tc2 |
2009-06-23 13:10:56+00:00 |
https://www.bbc.com/sitemaps/https-sitemap-com-archive-1.xml |
e7e15811c65f406f89f89fe10aef29f5 |
2021-11-05 20:52:56+00:00 |
7.63124 |
2022-02-12 01:37:39.461037+00:00 |
7 |
https://www.bbc.com/arabic/sports/2009/06/090622_me_egypt_us_tc2 |
2009-06-22 15:37:07+00:00 |
https://www.bbc.com/sitemaps/https-sitemap-com-archive-1.xml |
e7e15811c65f406f89f89fe10aef29f5 |
2021-11-05 20:52:56+00:00 |
7.63124 |
2022-02-12 01:37:39.461037+00:00 |
8 |
https://www.bbc.com/arabic/sports/2009/06/090624_mz_wimbledon_tc2 |
2009-06-24 13:57:18+00:00 |
https://www.bbc.com/sitemaps/https-sitemap-com-archive-1.xml |
e7e15811c65f406f89f89fe10aef29f5 |
2021-11-05 20:52:56+00:00 |
7.63124 |
2022-02-12 01:37:39.461037+00:00 |
9 |
https://www.bbc.com/arabic/worldnews/2009/06/090623_mz_leaders_lifespan_tc2 |
2009-06-23 13:24:23+00:00 |
https://www.bbc.com/sitemaps/https-sitemap-com-archive-1.xml |
e7e15811c65f406f89f89fe10aef29f5 |
2021-11-05 20:52:56+00:00 |
7.63124 |
2022-02-12 01:37:39.461037+00:00 |
print(bbc_sitemap.shape)
print(bbc_sitemap.dtypes)
(49999, 7)
loc object
lastmod datetime64[ns, UTC]
sitemap object
etag object
sitemap_last_modified datetime64[ns, UTC]
sitemap_size_mb float64
download_date datetime64[ns, UTC]
dtype: object
Since lastmod
is a datetime
object, we can easily use it for various
time-related operations.
Here we look at how many articles have been published (last modified) per year.
bbc_sitemap.set_index('lastmod').resample('A')['loc'].count()
lastmod
2008-12-31 00:00:00+00:00 2287
2009-12-31 00:00:00+00:00 47603
2010-12-31 00:00:00+00:00 0
2011-12-31 00:00:00+00:00 0
2012-12-31 00:00:00+00:00 0
2013-12-31 00:00:00+00:00 0
2014-12-31 00:00:00+00:00 0
2015-12-31 00:00:00+00:00 0
2016-12-31 00:00:00+00:00 0
2017-12-31 00:00:00+00:00 0
2018-12-31 00:00:00+00:00 0
2019-12-31 00:00:00+00:00 99
2020-12-31 00:00:00+00:00 10
Freq: A-DEC, Name: loc, dtype: int64
As the majority are in 2009 with a few in other years, it seems these were
later updated, but we would have to check to verify (in this special case BBC's
URLs contain date information, which can be compared to lastmod
to check if
there is a difference between them).
We can take a look at a sample of the URLs to get the URL template that they use.
bbc_sitemap['loc'].sample(10).tolist()
['https://www.bbc.com/russian/rolling_news/2009/06/090628_rn_pakistani_soldiries_ambush',
'https://www.bbc.com/urdu/pakistan/2009/04/090421_mqm_speaks_rza',
'https://www.bbc.com/arabic/middleeast/2009/07/090723_ae_silwan_tc2',
'https://www.bbc.com/portuguese/noticias/2009/07/090729_iraquerefenbritsfn',
'https://www.bbc.com/portuguese/noticias/2009/06/090623_egitomilitaresfn',
'https://www.bbc.com/portuguese/noticias/2009/03/090302_gazaconferenciaml',
'https://www.bbc.com/portuguese/noticias/2009/07/090715_hillary_iran_cq',
'https://www.bbc.com/vietnamese/culture/2009/04/090409_machienhuu_revisiting',
'https://www.bbc.com/portuguese/noticias/2009/05/090524_paquistaoupdateg',
'https://www.bbc.com/arabic/worldnews/2009/06/090629_om_pakistan_report_tc2']
It seems the pattern is
https://www.bbc.com/{language}/{topic}/{YYYY}/{MM}/{YYMMDD_article_title}
This is quite a rich structure, full of useful information. We can
analyze the URL structure using the url_to_df
function:
url_df = adv.url_to_df(bbc_sitemap['loc'])
url_df
url |
scheme |
netloc |
path |
query |
fragment |
dir_1 |
dir_2 |
dir_3 |
dir_4 |
dir_5 |
dir_6 |
dir_7 |
last_dir |
|
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 |
https://www.bbc.com/arabic/middleeast/2009/06/090620_as_iraq_explosion_tc2 |
https |
www.bbc.com |
/arabic/middleeast/2009/06/090620_as_iraq_explosion_tc2 |
arabic |
middleeast |
2009 |
06 |
090620_as_iraq_explosion_tc2 |
nan |
nan |
090620_as_iraq_explosion_tc2 |
||
1 |
https://www.bbc.com/arabic/middleeast/2009/06/090620_iraq_blast_tc2 |
https |
www.bbc.com |
/arabic/middleeast/2009/06/090620_iraq_blast_tc2 |
arabic |
middleeast |
2009 |
06 |
090620_iraq_blast_tc2 |
nan |
nan |
090620_iraq_blast_tc2 |
||
2 |
https://www.bbc.com/arabic/business/2009/06/090622_me_worldbank_tc2 |
https |
www.bbc.com |
/arabic/business/2009/06/090622_me_worldbank_tc2 |
arabic |
business |
2009 |
06 |
090622_me_worldbank_tc2 |
nan |
nan |
090622_me_worldbank_tc2 |
||
3 |
https://www.bbc.com/arabic/multimedia/2009/06/090624_me_inpictures_brazil_tc2 |
https |
www.bbc.com |
/arabic/multimedia/2009/06/090624_me_inpictures_brazil_tc2 |
arabic |
multimedia |
2009 |
06 |
090624_me_inpictures_brazil_tc2 |
nan |
nan |
090624_me_inpictures_brazil_tc2 |
||
4 |
https |
www.bbc.com |
/arabic/business/2009/06/090618_tomtest |
arabic |
business |
2009 |
06 |
090618_tomtest |
nan |
nan |
090618_tomtest |
|||
49994 |
https://www.bbc.com/vietnamese/world/2009/08/090831_dalailamataiwan |
https |
www.bbc.com |
/vietnamese/world/2009/08/090831_dalailamataiwan |
vietnamese |
world |
2009 |
08 |
090831_dalailamataiwan |
nan |
nan |
090831_dalailamataiwan |
||
49995 |
https://www.bbc.com/vietnamese/world/2009/09/090901_putin_regret_pact |
https |
www.bbc.com |
/vietnamese/world/2009/09/090901_putin_regret_pact |
vietnamese |
world |
2009 |
09 |
090901_putin_regret_pact |
nan |
nan |
090901_putin_regret_pact |
||
49996 |
https://www.bbc.com/vietnamese/culture/2009/09/090901_tiananmen_movie |
https |
www.bbc.com |
/vietnamese/culture/2009/09/090901_tiananmen_movie |
vietnamese |
culture |
2009 |
09 |
090901_tiananmen_movie |
nan |
nan |
090901_tiananmen_movie |
||
49997 |
https://www.bbc.com/vietnamese/pictures/2009/08/090830_ugc_ddh_sand |
https |
www.bbc.com |
/vietnamese/pictures/2009/08/090830_ugc_ddh_sand |
vietnamese |
pictures |
2009 |
08 |
090830_ugc_ddh_sand |
nan |
nan |
090830_ugc_ddh_sand |
||
49998 |
https://www.bbc.com/vietnamese/business/2009/09/090901_japecontask |
https |
www.bbc.com |
/vietnamese/business/2009/09/090901_japecontask |
vietnamese |
business |
2009 |
09 |
090901_japecontask |
nan |
nan |
090901_japecontask |
It seems that the dir_1
is where they have the language information, so we
can easily count how many articles they have per language:
url_df['dir_1'].value_counts()
russian 14022
persian 10968
portuguese 5403
urdu 5068
mundo 5065
vietnamese 3561
arabic 2984
hindi 1677
turkce 706
ukchina 545
Name: dir_1, dtype: int64
We can also get a subset of articles written in a certain language, and see how many articles they publish per month, week, year, etc.
(bbc_sitemap[bbc_sitemap['loc']
.str.contains('/russian/')]
.set_index('lastmod')
.resample('M')['loc'].count())
lastmod
2009-04-30 00:00:00+00:00 1506
2009-05-31 00:00:00+00:00 2910
2009-06-30 00:00:00+00:00 3021
2009-07-31 00:00:00+00:00 3250
2009-08-31 00:00:00+00:00 2769
...
2019-09-30 00:00:00+00:00 8
2019-10-31 00:00:00+00:00 17
2019-11-30 00:00:00+00:00 11
2019-12-31 00:00:00+00:00 24
2020-01-31 00:00:00+00:00 6
Freq: M, Name: loc, Length: 130, dtype: int64
The topic or category of the article seems to be in dir_2
for which we can
do the same and count the values.
url_df['dir_2'].value_counts()[:20]
rolling_news 9044
world 5050
noticias 4224
iran 3682
pakistan 2103
afghanistan 1959
multimedia 1657
internacional 1555
sport 1350
international 1293
india 1285
america_latina 1274
business 1204
cultura_sociedad 913
middleeast 874
worldnews 872
russia 841
radio 769
science 755
football 674
Name: dir_2, dtype: int64
There is much more you can do, and a lot depends on the URL structure, which you have to explore and run the right operation.
For example, we can use the last_dir
column which contains the slugs
of the articles, replace underscores with spaces, split, concatenate all, put
in a pd.Series
and count the values. This way we see how many times each
word occurred in an article. The same code can also be run after filtering for
articles in a particular language to get a more meaningful list of words.
url_df['last_dir'].str.split('_').str[1:].explode().value_counts()[:20]
rn 8808
tc2 3153
iran 1534
video 973
obama 882
us 862
china 815
ir88 727
russia 683
si 640
np 638
afghan 632
ka 565
an 556
iraq 554
pakistan 547
nh 533
cq 520
zs 510
ra 491
Name: last_dir, dtype: int64
This was a quick overview and data preparation for a sample sitemap. Once you are familiar with the sitemap's structure, you can more easily start analyzing the content.
Note
There is a bug currently with tags that contain multiple values in sitemaps. If an image column in a news sitemap contains multiple images, only the last one is retreived. The same applies for any other sitemap that has a tag with multiple values.
News Sitemaps
nyt_news = adv.sitemap_to_df('https://www.nytimes.com/sitemaps/new/news.xml.gz')
print(nyt_news.shape)
# (5085, 16)
nyt_news
loc |
lastmod |
news |
news_publication |
publication_name |
publication_language |
news_publication_date |
news_title |
news_keywords |
image |
image_loc |
sitemap |
etag |
sitemap_last_modified |
sitemap_size_mb |
download_date |
|
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 |
https://www.nytimes.com/interactive/2021/us/ottawa-ohio-covid-cases.html |
2022-02-12 00:00:00+00:00 |
The New York Times |
en |
2021-01-27T17:00:00Z |
Ottawa County, Ohio Covid Case and Exposure Risk Tracker |
Coronavirus (2019-nCoV), States (US), Deaths (Fatalities), United States, Disease Rates |
0cff645fbb74c21791568b78a888967d |
2022-02-12 20:17:31+00:00 |
0.0774069 |
2022-02-12 20:18:39.744247+00:00 |
|||||
1 |
https://www.nytimes.com/interactive/2021/us/hopewell-virginia-covid-cases.html |
2022-02-12 00:00:00+00:00 |
The New York Times |
en |
2021-01-27T17:00:00Z |
Hopewell, Virginia Covid Case and Exposure Risk Tracker |
Coronavirus (2019-nCoV), States (US), Deaths (Fatalities), United States, Disease Rates |
0cff645fbb74c21791568b78a888967d |
2022-02-12 20:17:31+00:00 |
0.0774069 |
2022-02-12 20:18:39.744247+00:00 |
|||||
2 |
https://www.nytimes.com/interactive/2021/us/box-butte-nebraska-covid-cases.html |
2022-02-12 00:00:00+00:00 |
The New York Times |
en |
2021-01-27T17:00:00Z |
Box Butte County, Nebraska Covid Case and Exposure Risk Tracker |
Coronavirus (2019-nCoV), States (US), Deaths (Fatalities), United States, Disease Rates |
0cff645fbb74c21791568b78a888967d |
2022-02-12 20:17:31+00:00 |
0.0774069 |
2022-02-12 20:18:39.744247+00:00 |
|||||
3 |
https://www.nytimes.com/interactive/2021/us/stearns-minnesota-covid-cases.html |
2022-02-12 00:00:00+00:00 |
The New York Times |
en |
2021-01-27T17:00:00Z |
Stearns County, Minnesota Covid Case and Exposure Risk Tracker |
Coronavirus (2019-nCoV), States (US), Deaths (Fatalities), United States, Disease Rates |
0cff645fbb74c21791568b78a888967d |
2022-02-12 20:17:31+00:00 |
0.0774069 |
2022-02-12 20:18:39.744247+00:00 |
|||||
4 |
https://www.nytimes.com/interactive/2021/us/benton-iowa-covid-cases.html |
2022-02-12 00:00:00+00:00 |
The New York Times |
en |
2021-01-27T17:00:00Z |
Benton County, Iowa Covid Case and Exposure Risk Tracker |
Coronavirus (2019-nCoV), States (US), Deaths (Fatalities), United States, Disease Rates |
0cff645fbb74c21791568b78a888967d |
2022-02-12 20:17:31+00:00 |
0.0774069 |
2022-02-12 20:18:39.744247+00:00 |
|||||
5080 |
https://www.nytimes.com/interactive/2021/us/hodgeman-kansas-covid-cases.html |
2022-02-12 00:00:00+00:00 |
The New York Times |
en |
2021-01-27T17:00:00Z |
Hodgeman County, Kansas Covid Case and Exposure Risk Tracker |
Coronavirus (2019-nCoV), States (US), Deaths (Fatalities), United States, Disease Rates |
f53301c8286f9bf59ef297f0232dcfc1 |
2022-02-12 20:17:31+00:00 |
0.914107 |
2022-02-12 20:18:39.995323+00:00 |
|||||
5081 |
https://www.nytimes.com/interactive/2021/us/miller-georgia-covid-cases.html |
2022-02-12 00:00:00+00:00 |
The New York Times |
en |
2021-01-27T17:00:00Z |
Miller County, Georgia Covid Case and Exposure Risk Tracker |
Coronavirus (2019-nCoV), States (US), Deaths (Fatalities), United States, Disease Rates |
f53301c8286f9bf59ef297f0232dcfc1 |
2022-02-12 20:17:31+00:00 |
0.914107 |
2022-02-12 20:18:39.995323+00:00 |
|||||
5082 |
2022-02-12 00:00:00+00:00 |
The New York Times |
en |
2020-11-03T17:00:00Z |
West Virginia First Congressional District Results: David McKinley vs. Natalie Cline |
Elections, Presidential Election of 2020, United States, internal-election-open, House of Representatives, West Virginia |
f53301c8286f9bf59ef297f0232dcfc1 |
2022-02-12 20:17:31+00:00 |
0.914107 |
2022-02-12 20:18:39.995323+00:00 |
||||||
5083 |
https://www.nytimes.com/interactive/2020/11/03/us/elections/results-maine-senate.html |
2022-02-12 00:00:00+00:00 |
The New York Times |
en |
2020-11-03T17:00:00Z |
Maine Senate Results: Susan Collins Defeats Sara Gideon |
Elections, Presidential Election of 2020, United States, internal-election-open, Senate, Maine |
f53301c8286f9bf59ef297f0232dcfc1 |
2022-02-12 20:17:31+00:00 |
0.914107 |
2022-02-12 20:18:39.995323+00:00 |
|||||
5084 |
https://www.nytimes.com/interactive/2021/us/randolph-missouri-covid-cases.html |
2022-02-12 00:00:00+00:00 |
The New York Times |
en |
2021-01-27T17:00:00Z |
Randolph County, Missouri Covid Case and Exposure Risk Tracker |
Coronavirus (2019-nCoV), States (US), Deaths (Fatalities), United States, Disease Rates |
f53301c8286f9bf59ef297f0232dcfc1 |
2022-02-12 20:17:31+00:00 |
0.914107 |
2022-02-12 20:18:39.995323+00:00 |
Video Sitemaps
wired_video = adv.sitemap_to_df('https://www.wired.com/video/sitemap.xml')
print(wired_video.shape)
# (2955, 14)
wired_video
loc |
video |
video_thumbnail_loc |
video_title |
video_description |
video_content_loc |
video_duration |
video_publication_date |
video_expiration_date |
lastmod |
sitemap |
etag |
sitemap_size_mb |
download_date |
|
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 |
Autocomplete Interview - Owen Wilson Answers The Web’s Most Searched Questions |
Owen Wilson takes the WIRED Autocomplete Interview and answers the internet's most searched questions about himself. How did Owen Wilson break his nose? How many movies is he in with Ben Stiller? Is Owen in every Wes Anderson movie? Is he a good skateboarder? Owen answers all these questions and much more! |
645 |
2022-02-11T17:00:00+00:00 |
nan |
NaT |
W/90b11f47f8b2ab57cb180cbd3c6f06f9 |
2.86199 |
2022-02-12 20:24:55.841851+00:00 |
|||||
1 |
https://www.wired.com/video/watch/wired-news-and-science-samsung-s22 |
Currents - Samsung S22 Ultra Explained in 3 Minutes |
Julian Chokkattu, Reviews Editor for WIRED, walks us through a few of the Samsung S22 Ultra's new features. |
184 |
2022-02-10T17:00:00+00:00 |
nan |
NaT |
W/90b11f47f8b2ab57cb180cbd3c6f06f9 |
2.86199 |
2022-02-12 20:24:55.841851+00:00 |
||||
2 |
https://www.wired.com/video/watch/first-look-samsung-galaxy-unpacked-2022 |
First Look: Samsung Galaxy Unpacked 2022 |
Samsung has debuted three new smartphones—the Galaxy S22 Ultra, S22+, S22—and three Android tablets in various sizes at Samsung Unpacked 2022. WIRED's Julian Chokkattu takes a look at the newest features. |
373 |
2022-02-09T15:00:00+00:00 |
nan |
NaT |
W/90b11f47f8b2ab57cb180cbd3c6f06f9 |
2.86199 |
2022-02-12 20:24:55.841851+00:00 |
||||
3 |
Reinventing With Data | WIRED Brand Lab |
Produced by WIRED Brand Lab with AWS | What can the Seattle Seahawks winning strategy teach businesses? Swami Sivasubramanian, VP of AI at Amazon Web Services helps us to understand how the Seattle Seahawks are using data and AI to remain a top performing team in the NFL, and how their process of data capture, storage, and machine learning to gain strategic insights is a model for making better business decision across industries. |
292 |
2022-02-09T13:00:00+00:00 |
nan |
NaT |
W/90b11f47f8b2ab57cb180cbd3c6f06f9 |
2.86199 |
2022-02-12 20:24:55.841851+00:00 |
|||||
4 |
https://www.wired.com/video/watch/seth-rogen-answers-the-webs-most-searched-questions |
Autocomplete Interview - Seth Rogen Answers The Web’s Most Searched Questions |
"Pam & Tommy" star Seth Rogen takes the WIRED Autocomplete Interview once again and answers the internet's most searched questions about himself. Who does Seth Rogen look like? Does Seth have a podcast? Does he sell pottery? Does he celebrate Christmas? Does he play Call of Duty? Pam & Tommy premieres February 2 on Hulu (finale on March 9) |
635 |
2022-02-08T17:00:00+00:00 |
nan |
NaT |
W/90b11f47f8b2ab57cb180cbd3c6f06f9 |
2.86199 |
2022-02-12 20:24:55.841851+00:00 |
||||
2950 |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
NaT |
W/90b11f47f8b2ab57cb180cbd3c6f06f9 |
2.86199 |
2022-02-12 20:24:55.841851+00:00 |
||
2951 |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
NaT |
W/90b11f47f8b2ab57cb180cbd3c6f06f9 |
2.86199 |
2022-02-12 20:24:55.841851+00:00 |
||
2952 |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
NaT |
W/90b11f47f8b2ab57cb180cbd3c6f06f9 |
2.86199 |
2022-02-12 20:24:55.841851+00:00 |
||
2953 |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
NaT |
W/90b11f47f8b2ab57cb180cbd3c6f06f9 |
2.86199 |
2022-02-12 20:24:55.841851+00:00 |
||
2954 |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
NaT |
W/90b11f47f8b2ab57cb180cbd3c6f06f9 |
2.86199 |
2022-02-12 20:24:55.841851+00:00 |
- sitemap_to_df(sitemap_url, max_workers=8, recursive=True)[source]
Retrieve all URLs and other available tags of a sitemap(s) and put them in a DataFrame.
You can also pass the URL of a sitemap index, or a link to a robots.txt file.
- Parameters
sitemap_url (url) -- The URL of a sitemap, either a regular sitemap, a sitemap index, or a link to a robots.txt file. In the case of a sitemap index or robots.txt, the function will go through all the sub sitemaps and retrieve all the included URLs in one DataFrame.
max_workers (int) -- The maximum number of workers to use for threading. The higher the faster, but with high numbers you risk being blocked and/or missing some data as you might appear like an attacker.
recursive (bool) -- Whether or not to follow and import all sub-sitemaps (in case you have a sitemap index), or to only import the given sitemap. This might be useful in case you want to explore what sitemaps are available after which you can decide which ones you are interested in.
- Return sitemap_df
A pandas DataFrame containing all URLs, as well as other tags if available (
lastmod
,changefreq
,priority
, or others found in news, video, or image sitemaps).