Download, Parse, and Analyze XML Sitemaps

One of the fastest and easiest ways to get insights on a website's content is to simply download its XML sitemap(s).

Sitemaps are also important SEO tools as they reveal a lot of information about the website, and help search engines in indexing those pages. You might want to run an SEO audit and check if the URLs in the sitemap properly correspond to the actual URLs of the site, so this would be an easy way to get them.

Sitemaps basically contain a log of publishing activity, and if they have rich URLs then you can do some good analysis on their content over time as well.

The sitemap_to_df() function is very simple to use, and only requires the URL of a sitemap, a sitemap index, or even a robots.txt file. It goes through the sitemap(s) and returns a DataFrame containing all the tags and their information.

  • loc: The location of the URLs of hte sitemaps.

  • lastmod: The datetime of the date when each URL was last modified, if available.

  • sitemap: The URL of the sitemap from which the URL on this row was retreived.

  • etag: The entity tag of the response header, if provided.

  • sitemap_last_modified: The datetime when the sitemap file was last modified, if provided.

  • sitemap_size_mb: The size of the sitemap in mega bytes (1MB = 1,024 x 1,024 bytes)

  • download_date: The datetime when the sitemap was downloaded.

Sitemap Index

Large websites typically have a sitmeapindex file, which contains links to all other regular sitemaps that belong to the site. The sitemap_to_df() function retreives all sub-sitemaps recursively by default. In some cases, especially with very large sites, it might be better to first get the sitemap index, explore its structure, and then decide which sitemaps you want to get, or if you want them all. Even with smaller websites, it still might be interesting to get the index only and see how it is structured.

This behavior can be modified by the recursive parameter, which is set to True by default. Set it to False if you want only the index file.

Another interesting thing you might want to do is to provide a robots.txt URL, and set recursive=False to get all available sitemap index files.

>>> sitemap_to_df("https://example.com/robots.txt", recursive=False)

Let's now go through a quick example of what can be done with sitemaps. We can start by getting one of the BBC's sitemaps.

Regular XML Sitemaps

import advertools as adv

bbc_sitemap = adv.sitemap_to_df('https://www.bbc.com/sitemaps/https-sitemap-com-archive-1.xml')
bbc_sitemap.head(10)

loc

lastmod

sitemap

etag

sitemap_last_modified

sitemap_size_mb

download_date

0

https://www.bbc.com/arabic/middleeast/2009/06/090620_as_iraq_explosion_tc2

2009-06-20 14:10:48+00:00

https://www.bbc.com/sitemaps/https-sitemap-com-archive-1.xml

e7e15811c65f406f89f89fe10aef29f5

2021-11-05 20:52:56+00:00

7.63124

2022-02-12 01:37:39.461037+00:00

1

https://www.bbc.com/arabic/middleeast/2009/06/090620_iraq_blast_tc2

2009-06-20 21:07:43+00:00

https://www.bbc.com/sitemaps/https-sitemap-com-archive-1.xml

e7e15811c65f406f89f89fe10aef29f5

2021-11-05 20:52:56+00:00

7.63124

2022-02-12 01:37:39.461037+00:00

2

https://www.bbc.com/arabic/business/2009/06/090622_me_worldbank_tc2

2009-06-22 12:41:48+00:00

https://www.bbc.com/sitemaps/https-sitemap-com-archive-1.xml

e7e15811c65f406f89f89fe10aef29f5

2021-11-05 20:52:56+00:00

7.63124

2022-02-12 01:37:39.461037+00:00

3

https://www.bbc.com/arabic/multimedia/2009/06/090624_me_inpictures_brazil_tc2

2009-06-24 15:27:24+00:00

https://www.bbc.com/sitemaps/https-sitemap-com-archive-1.xml

e7e15811c65f406f89f89fe10aef29f5

2021-11-05 20:52:56+00:00

7.63124

2022-02-12 01:37:39.461037+00:00

4

https://www.bbc.com/arabic/business/2009/06/090618_tomtest

2009-06-18 15:32:54+00:00

https://www.bbc.com/sitemaps/https-sitemap-com-archive-1.xml

e7e15811c65f406f89f89fe10aef29f5

2021-11-05 20:52:56+00:00

7.63124

2022-02-12 01:37:39.461037+00:00

5

https://www.bbc.com/arabic/multimedia/2009/06/090625_sf_tamim_verdict_tc2

2009-06-25 09:46:39+00:00

https://www.bbc.com/sitemaps/https-sitemap-com-archive-1.xml

e7e15811c65f406f89f89fe10aef29f5

2021-11-05 20:52:56+00:00

7.63124

2022-02-12 01:37:39.461037+00:00

6

https://www.bbc.com/arabic/middleeast/2009/06/090623_iz_cairo_russia_tc2

2009-06-23 13:10:56+00:00

https://www.bbc.com/sitemaps/https-sitemap-com-archive-1.xml

e7e15811c65f406f89f89fe10aef29f5

2021-11-05 20:52:56+00:00

7.63124

2022-02-12 01:37:39.461037+00:00

7

https://www.bbc.com/arabic/sports/2009/06/090622_me_egypt_us_tc2

2009-06-22 15:37:07+00:00

https://www.bbc.com/sitemaps/https-sitemap-com-archive-1.xml

e7e15811c65f406f89f89fe10aef29f5

2021-11-05 20:52:56+00:00

7.63124

2022-02-12 01:37:39.461037+00:00

8

https://www.bbc.com/arabic/sports/2009/06/090624_mz_wimbledon_tc2

2009-06-24 13:57:18+00:00

https://www.bbc.com/sitemaps/https-sitemap-com-archive-1.xml

e7e15811c65f406f89f89fe10aef29f5

2021-11-05 20:52:56+00:00

7.63124

2022-02-12 01:37:39.461037+00:00

9

https://www.bbc.com/arabic/worldnews/2009/06/090623_mz_leaders_lifespan_tc2

2009-06-23 13:24:23+00:00

https://www.bbc.com/sitemaps/https-sitemap-com-archive-1.xml

e7e15811c65f406f89f89fe10aef29f5

2021-11-05 20:52:56+00:00

7.63124

2022-02-12 01:37:39.461037+00:00

print(bbc_sitemap.shape)
print(bbc_sitemap.dtypes)
(49999, 7)

loc                                   object
lastmod                  datetime64[ns, UTC]
sitemap                               object
etag                                  object
sitemap_last_modified    datetime64[ns, UTC]
sitemap_size_mb                      float64
download_date            datetime64[ns, UTC]
dtype: object

Since lastmod is a datetime object, we can easily use it for various time-related operations. Here we look at how many articles have been published (last modified) per year.

bbc_sitemap.set_index('lastmod').resample('A')['loc'].count()
lastmod
2008-12-31 00:00:00+00:00     2287
2009-12-31 00:00:00+00:00    47603
2010-12-31 00:00:00+00:00        0
2011-12-31 00:00:00+00:00        0
2012-12-31 00:00:00+00:00        0
2013-12-31 00:00:00+00:00        0
2014-12-31 00:00:00+00:00        0
2015-12-31 00:00:00+00:00        0
2016-12-31 00:00:00+00:00        0
2017-12-31 00:00:00+00:00        0
2018-12-31 00:00:00+00:00        0
2019-12-31 00:00:00+00:00       99
2020-12-31 00:00:00+00:00       10
Freq: A-DEC, Name: loc, dtype: int64

As the majority are in 2009 with a few in other years, it seems these were later updated, but we would have to check to verify (in this special case BBC's URLs contain date information, which can be compared to lastmod to check if there is a difference between them).

We can take a look at a sample of the URLs to get the URL template that they use.

bbc_sitemap['loc'].sample(10).tolist()
['https://www.bbc.com/russian/rolling_news/2009/06/090628_rn_pakistani_soldiries_ambush',
'https://www.bbc.com/urdu/pakistan/2009/04/090421_mqm_speaks_rza',
'https://www.bbc.com/arabic/middleeast/2009/07/090723_ae_silwan_tc2',
'https://www.bbc.com/portuguese/noticias/2009/07/090729_iraquerefenbritsfn',
'https://www.bbc.com/portuguese/noticias/2009/06/090623_egitomilitaresfn',
'https://www.bbc.com/portuguese/noticias/2009/03/090302_gazaconferenciaml',
'https://www.bbc.com/portuguese/noticias/2009/07/090715_hillary_iran_cq',
'https://www.bbc.com/vietnamese/culture/2009/04/090409_machienhuu_revisiting',
'https://www.bbc.com/portuguese/noticias/2009/05/090524_paquistaoupdateg',
'https://www.bbc.com/arabic/worldnews/2009/06/090629_om_pakistan_report_tc2']

It seems the pattern is

https://www.bbc.com/{language}/{topic}/{YYYY}/{MM}/{YYMMDD_article_title}

This is quite a rich structure, full of useful information. We can analyze the URL structure using the url_to_df function:

url_df = adv.url_to_df(bbc_sitemap['loc'])
url_df

url

scheme

netloc

path

query

fragment

dir_1

dir_2

dir_3

dir_4

dir_5

dir_6

dir_7

last_dir

0

https://www.bbc.com/arabic/middleeast/2009/06/090620_as_iraq_explosion_tc2

https

www.bbc.com

/arabic/middleeast/2009/06/090620_as_iraq_explosion_tc2

arabic

middleeast

2009

06

090620_as_iraq_explosion_tc2

nan

nan

090620_as_iraq_explosion_tc2

1

https://www.bbc.com/arabic/middleeast/2009/06/090620_iraq_blast_tc2

https

www.bbc.com

/arabic/middleeast/2009/06/090620_iraq_blast_tc2

arabic

middleeast

2009

06

090620_iraq_blast_tc2

nan

nan

090620_iraq_blast_tc2

2

https://www.bbc.com/arabic/business/2009/06/090622_me_worldbank_tc2

https

www.bbc.com

/arabic/business/2009/06/090622_me_worldbank_tc2

arabic

business

2009

06

090622_me_worldbank_tc2

nan

nan

090622_me_worldbank_tc2

3

https://www.bbc.com/arabic/multimedia/2009/06/090624_me_inpictures_brazil_tc2

https

www.bbc.com

/arabic/multimedia/2009/06/090624_me_inpictures_brazil_tc2

arabic

multimedia

2009

06

090624_me_inpictures_brazil_tc2

nan

nan

090624_me_inpictures_brazil_tc2

4

https://www.bbc.com/arabic/business/2009/06/090618_tomtest

https

www.bbc.com

/arabic/business/2009/06/090618_tomtest

arabic

business

2009

06

090618_tomtest

nan

nan

090618_tomtest

49994

https://www.bbc.com/vietnamese/world/2009/08/090831_dalailamataiwan

https

www.bbc.com

/vietnamese/world/2009/08/090831_dalailamataiwan

vietnamese

world

2009

08

090831_dalailamataiwan

nan

nan

090831_dalailamataiwan

49995

https://www.bbc.com/vietnamese/world/2009/09/090901_putin_regret_pact

https

www.bbc.com

/vietnamese/world/2009/09/090901_putin_regret_pact

vietnamese

world

2009

09

090901_putin_regret_pact

nan

nan

090901_putin_regret_pact

49996

https://www.bbc.com/vietnamese/culture/2009/09/090901_tiananmen_movie

https

www.bbc.com

/vietnamese/culture/2009/09/090901_tiananmen_movie

vietnamese

culture

2009

09

090901_tiananmen_movie

nan

nan

090901_tiananmen_movie

49997

https://www.bbc.com/vietnamese/pictures/2009/08/090830_ugc_ddh_sand

https

www.bbc.com

/vietnamese/pictures/2009/08/090830_ugc_ddh_sand

vietnamese

pictures

2009

08

090830_ugc_ddh_sand

nan

nan

090830_ugc_ddh_sand

49998

https://www.bbc.com/vietnamese/business/2009/09/090901_japecontask

https

www.bbc.com

/vietnamese/business/2009/09/090901_japecontask

vietnamese

business

2009

09

090901_japecontask

nan

nan

090901_japecontask

It seems that the dir_1 is where they have the language information, so we can easily count how many articles they have per language:

url_df['dir_1'].value_counts()
russian       14022
persian       10968
portuguese     5403
urdu           5068
mundo          5065
vietnamese     3561
arabic         2984
hindi          1677
turkce          706
ukchina         545
Name: dir_1, dtype: int64

We can also get a subset of articles written in a certain language, and see how many articles they publish per month, week, year, etc.

(bbc_sitemap[bbc_sitemap['loc']
 .str.contains('/russian/')]
 .set_index('lastmod')
 .resample('M')['loc'].count())
lastmod
2009-04-30 00:00:00+00:00    1506
2009-05-31 00:00:00+00:00    2910
2009-06-30 00:00:00+00:00    3021
2009-07-31 00:00:00+00:00    3250
2009-08-31 00:00:00+00:00    2769
                             ...
2019-09-30 00:00:00+00:00       8
2019-10-31 00:00:00+00:00      17
2019-11-30 00:00:00+00:00      11
2019-12-31 00:00:00+00:00      24
2020-01-31 00:00:00+00:00       6
Freq: M, Name: loc, Length: 130, dtype: int64

The topic or category of the article seems to be in dir_2 for which we can do the same and count the values.

url_df['dir_2'].value_counts()[:20]
rolling_news        9044
world               5050
noticias            4224
iran                3682
pakistan            2103
afghanistan         1959
multimedia          1657
internacional       1555
sport               1350
international       1293
india               1285
america_latina      1274
business            1204
cultura_sociedad     913
middleeast           874
worldnews            872
russia               841
radio                769
science              755
football             674
Name: dir_2, dtype: int64

There is much more you can do, and a lot depends on the URL structure, which you have to explore and run the right operation.

For example, we can use the last_dir column which contains the slugs of the articles, replace underscores with spaces, split, concatenate all, put in a pd.Series and count the values. This way we see how many times each word occurred in an article. The same code can also be run after filtering for articles in a particular language to get a more meaningful list of words.

url_df['last_dir'].str.split('_').str[1:].explode().value_counts()[:20]
rn          8808
tc2         3153
iran        1534
video        973
obama        882
us           862
china        815
ir88         727
russia       683
si           640
np           638
afghan       632
ka           565
an           556
iraq         554
pakistan     547
nh           533
cq           520
zs           510
ra           491
Name: last_dir, dtype: int64

This was a quick overview and data preparation for a sample sitemap. Once you are familiar with the sitemap's structure, you can more easily start analyzing the content.

Note

There is a bug currently with tags that contain multiple values in sitemaps. If an image column in a news sitemap contains multiple images, only the last one is retreived. The same applies for any other sitemap that has a tag with multiple values.

News Sitemaps

nyt_news = adv.sitemap_to_df('https://www.nytimes.com/sitemaps/new/news.xml.gz')
print(nyt_news.shape)
# (5085, 16)
nyt_news

loc

lastmod

news

news_publication

publication_name

publication_language

news_publication_date

news_title

news_keywords

image

image_loc

sitemap

etag

sitemap_last_modified

sitemap_size_mb

download_date

0

https://www.nytimes.com/interactive/2021/us/ottawa-ohio-covid-cases.html

2022-02-12 00:00:00+00:00

The New York Times

en

2021-01-27T17:00:00Z

Ottawa County, Ohio Covid Case and Exposure Risk Tracker

Coronavirus (2019-nCoV), States (US), Deaths (Fatalities), United States, Disease Rates

https://static01.nyt.com/images/2020/03/29/us/ohio-coronavirus-cases-promo-1585539358901/ohio-coronavirus-cases-promo-1585539358901-articleLarge-v274.png

https://www.nytimes.com/sitemaps/new/news-6.xml.gz

0cff645fbb74c21791568b78a888967d

2022-02-12 20:17:31+00:00

0.0774069

2022-02-12 20:18:39.744247+00:00

1

https://www.nytimes.com/interactive/2021/us/hopewell-virginia-covid-cases.html

2022-02-12 00:00:00+00:00

The New York Times

en

2021-01-27T17:00:00Z

Hopewell, Virginia Covid Case and Exposure Risk Tracker

Coronavirus (2019-nCoV), States (US), Deaths (Fatalities), United States, Disease Rates

https://static01.nyt.com/images/2020/03/29/us/virginia-coronavirus-cases-promo-1585539536519/virginia-coronavirus-cases-promo-1585539536519-articleLarge-v271.png

https://www.nytimes.com/sitemaps/new/news-6.xml.gz

0cff645fbb74c21791568b78a888967d

2022-02-12 20:17:31+00:00

0.0774069

2022-02-12 20:18:39.744247+00:00

2

https://www.nytimes.com/interactive/2021/us/box-butte-nebraska-covid-cases.html

2022-02-12 00:00:00+00:00

The New York Times

en

2021-01-27T17:00:00Z

Box Butte County, Nebraska Covid Case and Exposure Risk Tracker

Coronavirus (2019-nCoV), States (US), Deaths (Fatalities), United States, Disease Rates

https://static01.nyt.com/images/2020/03/29/us/nebraska-coronavirus-cases-promo-1585539237156/nebraska-coronavirus-cases-promo-1585539237156-articleLarge-v281.png

https://www.nytimes.com/sitemaps/new/news-6.xml.gz

0cff645fbb74c21791568b78a888967d

2022-02-12 20:17:31+00:00

0.0774069

2022-02-12 20:18:39.744247+00:00

3

https://www.nytimes.com/interactive/2021/us/stearns-minnesota-covid-cases.html

2022-02-12 00:00:00+00:00

The New York Times

en

2021-01-27T17:00:00Z

Stearns County, Minnesota Covid Case and Exposure Risk Tracker

Coronavirus (2019-nCoV), States (US), Deaths (Fatalities), United States, Disease Rates

https://static01.nyt.com/images/2020/03/29/us/minnesota-coronavirus-cases-promo-1585539172701/minnesota-coronavirus-cases-promo-1585539172701-articleLarge-v282.png

https://www.nytimes.com/sitemaps/new/news-6.xml.gz

0cff645fbb74c21791568b78a888967d

2022-02-12 20:17:31+00:00

0.0774069

2022-02-12 20:18:39.744247+00:00

4

https://www.nytimes.com/interactive/2021/us/benton-iowa-covid-cases.html

2022-02-12 00:00:00+00:00

The New York Times

en

2021-01-27T17:00:00Z

Benton County, Iowa Covid Case and Exposure Risk Tracker

Coronavirus (2019-nCoV), States (US), Deaths (Fatalities), United States, Disease Rates

https://static01.nyt.com/images/2020/03/29/us/iowa-coronavirus-cases-promo-1585539039190/iowa-coronavirus-cases-promo-1585539039190-articleLarge-v286.png

https://www.nytimes.com/sitemaps/new/news-6.xml.gz

0cff645fbb74c21791568b78a888967d

2022-02-12 20:17:31+00:00

0.0774069

2022-02-12 20:18:39.744247+00:00

5080

https://www.nytimes.com/interactive/2021/us/hodgeman-kansas-covid-cases.html

2022-02-12 00:00:00+00:00

The New York Times

en

2021-01-27T17:00:00Z

Hodgeman County, Kansas Covid Case and Exposure Risk Tracker

Coronavirus (2019-nCoV), States (US), Deaths (Fatalities), United States, Disease Rates

https://static01.nyt.com/images/2020/03/29/us/kansas-coronavirus-cases-promo-1585539054298/kansas-coronavirus-cases-promo-1585539054298-articleLarge-v285.png

https://www.nytimes.com/sitemaps/new/news-2.xml.gz

f53301c8286f9bf59ef297f0232dcfc1

2022-02-12 20:17:31+00:00

0.914107

2022-02-12 20:18:39.995323+00:00

5081

https://www.nytimes.com/interactive/2021/us/miller-georgia-covid-cases.html

2022-02-12 00:00:00+00:00

The New York Times

en

2021-01-27T17:00:00Z

Miller County, Georgia Covid Case and Exposure Risk Tracker

Coronavirus (2019-nCoV), States (US), Deaths (Fatalities), United States, Disease Rates

https://static01.nyt.com/images/2020/03/29/us/georgia-coronavirus-cases-promo-1585538956622/georgia-coronavirus-cases-promo-1585538956622-articleLarge-v290.png

https://www.nytimes.com/sitemaps/new/news-2.xml.gz

f53301c8286f9bf59ef297f0232dcfc1

2022-02-12 20:17:31+00:00

0.914107

2022-02-12 20:18:39.995323+00:00

5082

https://www.nytimes.com/interactive/2020/11/03/us/elections/results-west-virginia-house-district-1.html

2022-02-12 00:00:00+00:00

The New York Times

en

2020-11-03T17:00:00Z

West Virginia First Congressional District Results: David McKinley vs. Natalie Cline

Elections, Presidential Election of 2020, United States, internal-election-open, House of Representatives, West Virginia

https://static01.nyt.com/images/2020/11/03/us/elections/eln-promo-race-west-virginia-house-1WINNER-mckinleyd/eln-promo-race-west-virginia-house-1WINNER-mckinleyd-articleLarge.png

https://www.nytimes.com/sitemaps/new/news-2.xml.gz

f53301c8286f9bf59ef297f0232dcfc1

2022-02-12 20:17:31+00:00

0.914107

2022-02-12 20:18:39.995323+00:00

5083

https://www.nytimes.com/interactive/2020/11/03/us/elections/results-maine-senate.html

2022-02-12 00:00:00+00:00

The New York Times

en

2020-11-03T17:00:00Z

Maine Senate Results: Susan Collins Defeats Sara Gideon

Elections, Presidential Election of 2020, United States, internal-election-open, Senate, Maine

https://static01.nyt.com/images/2020/11/03/us/elections/eln-promo-race-maine-senateWINNER-collinss/eln-promo-race-maine-senateWINNER-collinss-articleLarge.png

https://www.nytimes.com/sitemaps/new/news-2.xml.gz

f53301c8286f9bf59ef297f0232dcfc1

2022-02-12 20:17:31+00:00

0.914107

2022-02-12 20:18:39.995323+00:00

5084

https://www.nytimes.com/interactive/2021/us/randolph-missouri-covid-cases.html

2022-02-12 00:00:00+00:00

The New York Times

en

2021-01-27T17:00:00Z

Randolph County, Missouri Covid Case and Exposure Risk Tracker

Coronavirus (2019-nCoV), States (US), Deaths (Fatalities), United States, Disease Rates

https://static01.nyt.com/images/2020/03/29/us/missouri-coronavirus-cases-promo-1585539206866/missouri-coronavirus-cases-promo-1585539206866-articleLarge-v282.png

https://www.nytimes.com/sitemaps/new/news-2.xml.gz

f53301c8286f9bf59ef297f0232dcfc1

2022-02-12 20:17:31+00:00

0.914107

2022-02-12 20:18:39.995323+00:00

Video Sitemaps

wired_video = adv.sitemap_to_df('https://www.wired.com/video/sitemap.xml')
print(wired_video.shape)
# (2955, 14)
wired_video

loc

video

video_thumbnail_loc

video_title

video_description

video_content_loc

video_duration

video_publication_date

video_expiration_date

lastmod

sitemap

etag

sitemap_size_mb

download_date

0

https://www.wired.com/video/watch/autocomplete-inverviews-owen-wilson-answers-the-webs-most-searched-questions

http://dwgyu36up6iuz.cloudfront.net/heru80fdn/image/upload/c_fill,d_placeholder_thescene.jpg,fl_progressive,g_face,h_180,q_80,w_320/v1644595412/wired_autocomplete-inverviews-owen-wilson-answers-the-webs-most-searched-questions.jpg

Autocomplete Interview - Owen Wilson Answers The Web’s Most Searched Questions

Owen Wilson takes the WIRED Autocomplete Interview and answers the internet's most searched questions about himself. How did Owen Wilson break his nose? How many movies is he in with Ben Stiller? Is Owen in every Wes Anderson movie? Is he a good skateboarder? Owen answers all these questions and much more!

http://dp8hsntg6do36.cloudfront.net/62067f085577c277dd9acf42/39687acb-505b-4c69-94f1-afaa7cb5e636low.mp4

645

2022-02-11T17:00:00+00:00

nan

NaT

https://www.wired.com/video/sitemap.xml

W/90b11f47f8b2ab57cb180cbd3c6f06f9

2.86199

2022-02-12 20:24:55.841851+00:00

1

https://www.wired.com/video/watch/wired-news-and-science-samsung-s22

http://dwgyu36up6iuz.cloudfront.net/heru80fdn/image/upload/c_fill,d_placeholder_thescene.jpg,fl_progressive,g_face,h_180,q_80,w_320/v1644418652/wired_wired-news-and-science-samsung-s22.jpg

Currents - Samsung S22 Ultra Explained in 3 Minutes

Julian Chokkattu, Reviews Editor for WIRED, walks us through a few of the Samsung S22 Ultra's new features.

http://dp8hsntg6do36.cloudfront.net/6203cd7b5577c23d19622259/fe546b9b-a320-4883-9cbd-0d790f23c36dlow.mp4

184

2022-02-10T17:00:00+00:00

nan

NaT

https://www.wired.com/video/sitemap.xml

W/90b11f47f8b2ab57cb180cbd3c6f06f9

2.86199

2022-02-12 20:24:55.841851+00:00

2

https://www.wired.com/video/watch/first-look-samsung-galaxy-unpacked-2022

http://dwgyu36up6iuz.cloudfront.net/heru80fdn/image/upload/c_fill,d_placeholder_thescene.jpg,fl_progressive,g_face,h_180,q_80,w_320/v1644381627/wired_first-look-samsung-galaxy-unpacked-2022.jpg

First Look: Samsung Galaxy Unpacked 2022

Samsung has debuted three new smartphones—the Galaxy S22 Ultra, S22+, S22—and three Android tablets in various sizes at Samsung Unpacked 2022. WIRED's Julian Chokkattu takes a look at the newest features.

http://dp8hsntg6do36.cloudfront.net/620345a15577c23d46622256/d74930cf-11e1-466e-b023-1d9b91664204low.mp4

373

2022-02-09T15:00:00+00:00

nan

NaT

https://www.wired.com/video/sitemap.xml

W/90b11f47f8b2ab57cb180cbd3c6f06f9

2.86199

2022-02-12 20:24:55.841851+00:00

3

https://www.wired.com/video/watch/reinventing-with-data

http://dwgyu36up6iuz.cloudfront.net/heru80fdn/image/upload/c_fill,d_placeholder_thescene.jpg,fl_progressive,g_face,h_180,q_80,w_320/v1642801328/wired_reinventing-with-data.jpg

Reinventing With Data | WIRED Brand Lab

Produced by WIRED Brand Lab with AWS | What can the Seattle Seahawks winning strategy teach businesses? Swami Sivasubramanian, VP of AI at Amazon Web Services helps us to understand how the Seattle Seahawks are using data and AI to remain a top performing team in the NFL, and how their process of data capture, storage, and machine learning to gain strategic insights is a model for making better business decision across industries.

http://dp8hsntg6do36.cloudfront.net/619bd9be1d75db41adee6b58/d4889b15-4f34-41b0-b935-0c79465a9793low.mp4

292

2022-02-09T13:00:00+00:00

nan

NaT

https://www.wired.com/video/sitemap.xml

W/90b11f47f8b2ab57cb180cbd3c6f06f9

2.86199

2022-02-12 20:24:55.841851+00:00

4

https://www.wired.com/video/watch/seth-rogen-answers-the-webs-most-searched-questions

http://dwgyu36up6iuz.cloudfront.net/heru80fdn/image/upload/c_fill,d_placeholder_thescene.jpg,fl_progressive,g_face,h_180,q_80,w_320/v1644335726/wired_seth-rogen-answers-the-webs-most-searched-questions.jpg

Autocomplete Interview - Seth Rogen Answers The Web’s Most Searched Questions

"Pam & Tommy" star Seth Rogen takes the WIRED Autocomplete Interview once again and answers the internet's most searched questions about himself. Who does Seth Rogen look like? Does Seth have a podcast? Does he sell pottery? Does he celebrate Christmas? Does he play Call of Duty?

Pam & Tommy premieres February 2 on Hulu (finale on March 9)

http://dp8hsntg6do36.cloudfront.net/6201430a1d75db06ae1f62e8/488ed635-91d0-4281-9e64-34be9bf74f00low.mp4

635

2022-02-08T17:00:00+00:00

nan

NaT

https://www.wired.com/video/sitemap.xml

W/90b11f47f8b2ab57cb180cbd3c6f06f9

2.86199

2022-02-12 20:24:55.841851+00:00

2950

https://www.wired.com/video/genres/how-to

nan

nan

nan

nan

nan

nan

nan

nan

NaT

https://www.wired.com/video/sitemap.xml

W/90b11f47f8b2ab57cb180cbd3c6f06f9

2.86199

2022-02-12 20:24:55.841851+00:00

2951

https://www.wired.com/video/genres/movies-tv

nan

nan

nan

nan

nan

nan

nan

nan

NaT

https://www.wired.com/video/sitemap.xml

W/90b11f47f8b2ab57cb180cbd3c6f06f9

2.86199

2022-02-12 20:24:55.841851+00:00

2952

https://www.wired.com/video/genres/events

nan

nan

nan

nan

nan

nan

nan

nan

NaT

https://www.wired.com/video/sitemap.xml

W/90b11f47f8b2ab57cb180cbd3c6f06f9

2.86199

2022-02-12 20:24:55.841851+00:00

2953

https://www.wired.com/video/genres/promotion

nan

nan

nan

nan

nan

nan

nan

nan

NaT

https://www.wired.com/video/sitemap.xml

W/90b11f47f8b2ab57cb180cbd3c6f06f9

2.86199

2022-02-12 20:24:55.841851+00:00

2954

https://www.wired.com/video/genres/transportation

nan

nan

nan

nan

nan

nan

nan

nan

NaT

https://www.wired.com/video/sitemap.xml

W/90b11f47f8b2ab57cb180cbd3c6f06f9

2.86199

2022-02-12 20:24:55.841851+00:00

sitemap_to_df(sitemap_url, max_workers=8, recursive=True)[source]

Retrieve all URLs and other available tags of a sitemap(s) and put them in a DataFrame.

You can also pass the URL of a sitemap index, or a link to a robots.txt file.

Parameters:
  • sitemap_url (url) -- The URL of a sitemap, either a regular sitemap, a sitemap index, or a link to a robots.txt file. In the case of a sitemap index or robots.txt, the function will go through all the sub sitemaps and retrieve all the included URLs in one DataFrame.

  • max_workers (int) -- The maximum number of workers to use for threading. The higher the faster, but with high numbers you risk being blocked and/or missing some data as you might appear like an attacker.

  • recursive (bool) -- Whether or not to follow and import all sub-sitemaps (in case you have a sitemap index), or to only import the given sitemap. This might be useful in case you want to explore what sitemaps are available after which you can decide which ones you are interested in.

Return sitemap_df:

A pandas DataFrame containing all URLs, as well as other tags if available (lastmod, changefreq, priority, or others found in news, video, or image sitemaps).