Split, Parse, and Analyze URL Structure

Extracting information from URLs can be a little tedious, yet very important. Using the standard for URLs we can extract a lot of information in a fairly structured manner.

There are many situations in which you have many URLs that you want to better understand:

Analytics reports: Whichever analytics system you use, whether Google Analytics, search console, or any other reporting tool that reports on URLs, your reports can be enhanced by splitting URLs, and in effect becoming four or five data points as opposed to one.
Crawl datasets: The result of any crawl you run typically contains the URLs, which can benefit from the same enhancement.
SERP datasets: Which are basically about URLs.
Extracted URLs: Extracting URLs from social media posts is one thing you might want to do to better understand those posts, and further splitting URLs can also help.
XML sitemaps: Right after downloading a sitemap(s) splitting it further can help in giving a better perspective on the dataset.

The main function here is url_to_df(), which as the name suggests, converts URLs to DataFrames.

import advertools as adv

urls = ['https://netloc.com/path_1/path_2?price=10&color=blue#frag_1',
        'https://netloc.com/path_1/path_2?price=15&color=red#frag_2',
        'https://netloc.com/path_1/path_2/path_3?size=sm&color=blue#frag_1',
        'https://netloc.com/path_1?price=10&color=blue']
adv.url_to_df(urls)

	url	scheme	netloc	path	query	fragment	dir_1	dir_2	dir_3	last_dir	query_color	query_price	query_size
0	https://netloc.com/path_1/path_2?price=10&color=blue#frag_1	https	netloc.com	/path_1/path_2	price=10&color=blue	frag_1	path_1	path_2	nan	path_2	blue	10	nan
1	https://netloc.com/path_1/path_2?price=15&color=red#frag_2	https	netloc.com	/path_1/path_2	price=15&color=red	frag_2	path_1	path_2	nan	path_2	red	15	nan
2	https://netloc.com/path_1/path_2/path_3?size=sm&color=blue#frag_1	https	netloc.com	/path_1/path_2/path_3	size=sm&color=blue	frag_1	path_1	path_2	path_3	path_3	blue	nan	sm
3	https://netloc.com/path_1?price=10&color=blue	https	netloc.com	/path_1	price=10&color=blue		path_1	nan	nan	path_1	blue	10	nan

ِA more elaborate exmaple on how to analyze URLs shows how you might use this function after obtaining a set of URLs.

The resulting DataFrame contains the following columns:

url: The original URLs are listed as a reference. They are decoded for easier reading, and you can set decode=False if you want to retain the original encoding.
scheme: Self-explanatory. Note that you can also provide relative URLs /category/sub-category?one=1&two=2 in which case the url, scheme and netloc columns would be empty. You can mix relative and absolute URLs as well.
netloc: The network location is the sub-domain (optional) together with the domain and top-level domain and/or the country domain.
path: The slug of the URL, excluding the query parameters and fragments if any. The path is also split into directories dir_1/dir_2/dir_3/... to make it easier to categorize and analyze the URLs.
last_dir: The last directory of each of the URLs. This is usually the part that contains information about the page itself (blog post title, product name, etc.) with previous directories providing meta data (category, sub-category, author name, etc.). In many cases you don't have all URLs with the same number of directories, so they end up unaligned. This extracts all last_dir's in one column.
query: If query parameters are available they are given in this column, but more importantly they are parsed and included in separate columns, where each parameter has its own column (with the keys being the names). As in the example above, the query price=10&color=blue becomes two columns, one for price and the other for color. If any other URLs in the dataset contain the same parameters, their values will be populated in the same column, and NA otherwise.
fragment: The final part of the URL after the hash mark #, linking to a part in the page.
query_*: The query parameter names are prepended with query_ to make it easy to filter them out, and to avoid any name collissions with other columns, if some URL contains a query parameter called "url" for example. In the unlikely event of having a repeated parameter in the same URL, then their values would be delimited by two "@" signs one@@two@@three. It's unusual, but it happens.
hostname and port: If available a column for ports will be shown, and if the hostname is different from netloc it would also have its own column.

Query Parameters

The great thing about parameters is that the names are descriptive (mostly!) and once given a certain column you can easily understand what data they contain. Once this is done, you can sort the products by price, filter by destination, get the red and blue items, and so on.

The URL Path (Directories):

Here things are not as straightforward, and there is no way to know what the first or second directory is supposed to indicate. In general, I can think of three main situations that you can encounter while analyzing directories.

Consistent URLs: This is the simplest case, where all URLs follow the same structure. /en/product1 clearly shows that the first directory indicates the language of the page. So it can also make sense to rename those columns once you have discovered their meaning.
Inconsistent URLs: This is similar to the previous situation. All URLs follow the same pattern with a few exceptions. Take the following URLs for example:
- /topic1/title-of-article-1
- /es/topic1/title-of-article-2
- /es/topic2/title-of-article-3
- /topic2/title-of-artilce-4
You can see that they follow the pattern /language/topic/article-title, except for English, which is not explicitly mentioned, but its articles can be identified by having two instead of three directories, as we have for "/es/". If URLs are split in this case, yout will end up with dir_1 having "topic1", "es", "es", and "topic2", which distorts the data. Actually you want to have "en", "es", "es", "en". In such cases, after making sure you have the right rules and patterns, you might create special columns or replace/insert values to make them consistent, and get them to a state similar to the first example.
URLs of different types: In many cases you will find that sites have different types of pages with completely different roles on the site.
- /blog/post-1-title.html
- /community/help/topic_1
- /community/help/topic_2
Here, once you split the directories, you will see that they don't align properly (because of different lengths), and they can't be compared easily. A good approach is to split your dataset into one for blog posts and another for community content for example.

The ideal case for the path part of the URL is to be split into directories of equal length across the dataset, having the right data in the right columns and NA otherwise. Or, splitting the dataset and analyzing separately.

Analyzing a large number of URLs

Having a very long list of URLs is a thing that you might encounter with log files, big XML sitemaps, crawling a big website, and so on. You can still use url_to_df but you might consume a massive amount of memory, in some cases making impossible to process the data. For these cases you can use the output_file parameter. All you have to do is provide a path for this output file, and it has to have the .parquet extension. This allows you to compress the data, analyze it way more efficiently, and you can refer back to the same dataset without having to go through the process again (it can take a few minutes with big datasets).

 import advertools as adv
 import pandas as pd

 adv.url_to_df([url_1, url_2, ...], ouput_file="output_file.parquet")
 pd.read_parquet("output_file.parquet", columns=["scheme"])
 pd.read_parquet("output_file.parquet", columns=["dir_1", "dir_2"])
 pd.read_parquet(
     "output_file.parquet",
     columns=["dir_1", "dir_2"],
     filters=[("dir_1", "in", ["news", "politics"])],
 )

url_to_df(urls, decode=True, output_file=None)

Split the given URLs into their components to a DataFrame.

Each column will have its own component, and query parameters and directories will also be parsed and given special columns each.

Parameters:

urls (list,pandas.Series) -- A list of URLs to split into components
decode (bool, default True) -- Whether or not to decode the given URLs
output_file (str) -- The path where to save the output DataFrame with a .parquet extension

Returns:

urldf -- A DataFrame with a column for each URL component

Return type:

pandas.DataFrame