Split, Parse, and Analyze URL Structure¶
Extracting information from URLs can be a little tedious, yet very important. Using the standard for URLs we can extract a lot of information in a fairly structured manner.
There are many situations in which you have many URLs that you want to better understand:
Analytics reports: Whichever analytics sytem you use, whether Google Analytics, search console, or any other reporting tool that reports on URLs, your reports can be enhanced by splitting URLs, and in effect becoming four or five data points as opposed to one.
Crawl datasets: The result of any crawl you run typically contains the URLs, which can benefit from the same enhancement.
SERP datasets: Which are basically about URLs.
Extracted URLs: Extracting URLs from social media posts is one thing you might want to do to better understand those posts, and further splitting URLs can also help.
XML sitemaps: Right after downloading a sitemap(s) splitting it further can help in giving a better perspective on the dataset.
The main function here is
url_to_df(), which as the name suggests,
converts URLs to DataFrames.
>>> urls ['https://net.location.com/path_1/path_2?price=10&color=blue#frag_1', ... 'https://net.location.com/path_1/path_2?price=15&color=red#frag_2'] >>> url_to_df(urls) url scheme netloc path query fragment dir_1 dir_2 query_price query_color 0 https://net.location.com/path_1/path_2?price=10&color=blue#frag_1 https net.location.com /path_1/path_2 price=10&color=blue frag_1 path_1 path_2 10 blue 1 https://net.location.com/path_1/path_2?price=15&color=red#frag_2 https net.location.com /path_1/path_2 price=15&color=red frag_2 path_1 path_2 15 red
url: The original URLs are listed as a reference. They are decoded for easier reading, and you can set
decode=Falseif you want to retain the original encoding.
scheme: Self-explanatory. Note that you can also provide relative URLs /category/sub-category?one=1&two=2 in which case the url, scheme and netloc columns would be empty. You can mix relative and absolute URLs as well.
netloc: The network location is the sub-domain (optional) together with the domain and top-level domain and/or the country domain.
path: The slug of the URL, excluding the query parameters and fragments if any. The path is also split in to directories dir_1, dir_2, dir_3... to make it easier to categorize and analyze the URLs.
query: If query parameters are available they are given in this column, but more importantly they are parsed and included in separate columns, where each parameter has its own column (with the keys being the names). As in the example above, the query price=10&color=blue becomes two columns, one for price and the other for color. If any other URLs in the dataset contain the same parameters, their values will be populated in the same column, and NA otherwise.
fragment: The final part of the URL after the hash mark #, linking to a part in the page.
query_*: The query parameter names are prepended with query_ to make it easy to filter them out, and to avoid any name collissions with other columns, if some URL contains a query parameter called "url" for example. In the unlikely event of having a repeated parameter in the same URL, then their values would be delimited by two "@" signs one@@two@@three. It's unusual, but it happens.
hostname and port: If available a column for ports will be shown, and if the hostname is different from netloc it would also have its own column.
The great thing about parameters is that the names are descriptive (mostly!) and once given a certain column you can easily understand what data they contain. Once this is done, you can sort the products by price, filter by destination, get the red and blue items, and so on.
The URL Path (Directories):¶
Here things are not as straightforward, and there is no way to know what the first or second directory is supposed to indicate. In general, I can think of three main situations that you can encounter while analyzing directories.
Consistent URLs: This is the simplest case, where all URLs follow the same structure. /en/product1 clearly shows that the first directory indicates the language of the page. So it can also make sense to rename those columns once you have discovered their meaning.
Inconsistent URLs: This is similar to the previous situation. All URLs follow the same pattern with a few exceptions. Take the following URLs for example:
You can see that they follow the pattern /language/topic/article-title, except for English, which is not explicitly mentioned, but its articles can be identified by having two instead of three directories, as we have for "/es/". If URLs are split in this case, yout will end up with dir_1 having "topic1", "es", "es", and "topic2", which distorts the data. Actually you want to have "en", "es", "es", "en". In such cases, after making sure you have the right rules and patterns, you might create special columns or replace/insert values to make them consistent, and get them to a state similar to the first example.
URLs of different types: In many cases you will find that sites having different types of pages with completely different roles on the site.
Here, once you split the directories, you will see that they don't align properly (because of different lengths), and they can't be compared easily. A good approach is to split your dataset into one for blog posts and another for community content for example.
The ideal case for the path part of the URL is to be split into directories of equal length across the dataset, having the right data in the right columns and NA otherwise. Or, splitting the dataset and analyzing separately.
- url_to_df(urls, decode=True)¶
Split the given URLs into their components to a DataFrame.
Each column will have its own component, and query parameters and directories will also be parsed and given special columns each.
urls (url) -- A list of URLs to split into components
decode (bool) -- Whether or not to decode the given URLs
- Return DataFrame split
A DataFrame with a column for each component