Source code for advertools.urlytics

.. _urlytics:

Split, Parse, and Analyze URL Structure

Extracting information from URLs can be a little tedious, yet very important.
Using the standard for URLs we can extract a lot of information in a fairly
structured manner.

There are many situations in which you have many URLs that you want to better

* **Analytics reports**: Whichever analytics system you use, whether Google
  Analytics, search console, or any other reporting tool that reports on URLs,
  your reports can be enhanced by splitting URLs, and in effect becoming four
  or five data points as opposed to one.
* :ref:`Crawl datasets <crawl>`: The result of any crawl you run typically
  contains the URLs, which can benefit from the same enhancement.
* :ref:`SERP datasets <serp>`: Which are basically about URLs.
* :ref:`Extracted URLs <extract>`: Extracting URLs from social media posts is
  one thing you might want to do to better understand those posts, and further
  splitting URLs can also help.
* :ref:`XML sitemaps <sitemaps>`: Right after downloading a sitemap(s)
  splitting it further can help in giving a better perspective on the dataset.

The main function here is :func:`url_to_df`, which as the name suggests,
converts URLs to DataFrames.

.. thebe-button::
    Run this code

.. code-block::
    :class: thebe, thebe-init

    import advertools as adv

    urls = ['',

====  =================================================================  ========  ==========  =====================  ===================  ==========  =======  =======  =======  ==========  =============  =============  ============
  ..  url                                                                scheme    netloc      path                   query                fragment    dir_1    dir_2    dir_3    last_dir    query_color      query_price  query_size
====  =================================================================  ========  ==========  =====================  ===================  ==========  =======  =======  =======  ==========  =============  =============  ============
   0        https  /path_1/path_2         price=10&color=blue  frag_1      path_1   path_2   nan      path_2      blue                      10  nan
   1         https  /path_1/path_2         price=15&color=red   frag_2      path_1   path_2   nan      path_2      red                       15  nan
   2  https  /path_1/path_2/path_3  size=sm&color=blue   frag_1      path_1   path_2   path_3   path_3      blue                     nan  sm
   3                      https  /path_1                price=10&color=blue              path_1   nan      nan      path_1      blue                      10  nan
====  =================================================================  ========  ==========  =====================  ===================  ==========  =======  =======  =======  ==========  =============  =============  ============

ِA more elaborate exmaple on :ref:`how to analyze URLs <sitemaps>` shows how you
might use this function after obtaining a set of URLs.

The resulting DataFrame contains the following columns:

* **url**: The original URLs are listed as a reference. They are decoded for
  easier reading, and you can set ``decode=False`` if you want to retain the
  original encoding.
* **scheme**: Self-explanatory. Note that you can also provide relative URLs
  `/category/sub-category?one=1&two=2` in which case the `url`, `scheme` and
  `netloc` columns would be empty. You can mix relative and absolute URLs as
* **netloc**: The network location is the sub-domain (optional) together with
  the domain and top-level domain and/or the country domain.
* **path**: The slug of the URL, excluding the query parameters and fragments
  if any. The path is also split into directories ``dir_1/dir_2/dir_3/...`` to
  make it easier to categorize and analyze the URLs.
* **last_dir**: The last directory of each of the URLs. This is usually the
  part that contains information about the page itself (blog post title,
  product name, etc.) with previous directories providing meta data (category,
  sub-category, author name, etc.). In many cases you don't have all URLs with
  the same number of directories, so they end up unaligned. This extracts all
  ``last_dir``'s in one column.
* **query**: If query parameters are available they are given in this column,
  but more importantly they are parsed and included in separate columns, where
  each parameter has its own column (with the keys being the names). As in the
  example above, the query `price=10&color=blue` becomes two columns, one for
  price and the other for color. If any other URLs in the dataset contain the
  same parameters, their values will be populated in the same column, and `NA`
* **fragment**: The final part of the URL after the hash mark `#`, linking to a
  part in the page.
* **query_***: The query parameter names are prepended with `query_` to make
  it easy to filter them out, and to avoid any name collissions with other
  columns, if some URL contains a query parameter called "url" for example.
  In the unlikely event of having a repeated parameter in the same URL, then
  their values would be delimited by two "@" signs `one@@two@@three`. It's
  unusual, but it happens.
* **hostname and port**: If available a column for ports will be shown, and if
  the hostname is different from `netloc` it would also have its own column.

Query Parameters
The great thing about parameters is that the names are descriptive (mostly!)
and once given a certain column you can easily understand what data they
contain. Once this is done, you can sort the products by price, filter by
destination, get the red and blue items, and so on.

The URL Path (Directories):
Here things are not as straightforward, and there is no way to know what the
first or second directory is supposed to indicate. In general, I can think of
three main situations that you can encounter while analyzing directories.

* **Consistent URLs**: This is the simplest case, where all URLs follow the
  same structure. `/en/product1` clearly shows that the first directory
  indicates the language of the page. So it can also make sense to rename those
  columns once you have discovered their meaning.

* **Inconsistent URLs**: This is similar to the previous situation. All URLs
  follow the same pattern with a few exceptions. Take the following URLs for

    * /topic1/title-of-article-1
    * /es/topic1/title-of-article-2
    * /es/topic2/title-of-article-3
    * /topic2/title-of-artilce-4

  You can see that they follow the pattern `/language/topic/article-title`,
  except for English, which is not explicitly mentioned, but its articles can
  be identified by having two instead of three directories, as we have for
  "/es/". If URLs are split in this case, yout will end up with `dir_1` having
  "topic1", "es", "es", and "topic2", which distorts the data. Actually you
  want to have "en", "es", "es", "en". In such cases, after making sure you
  have the right rules and patterns, you might create special columns or
  replace/insert values to make them consistent, and get them to a state
  similar to the first example.

* **URLs of different types**: In many cases you will find that sites have
  different types of pages with completely different roles on the site.

    * /blog/post-1-title.html
    * /community/help/topic_1
    * /community/help/topic_2

  Here, once you split the directories, you will see that they don't align
  properly (because of different lengths), and they can't be compared easily. A
  good approach is to split your dataset into one for blog posts and another
  for community content for example.

The ideal case for the `path` part of the URL is to be split into directories
of equal length across the dataset, having the right data in the right columns
and `NA` otherwise. Or, splitting the dataset and analyzing separately.

Analyzing a large number of URLs

Having a very long list of URLs is a thing that you might encounter with log files,
big XML sitemaps, crawling a big website, and so on.
You can still use ``url_to_df`` but you might consume a massive amount of memory, in
some cases making impossible to process the data. For these cases you can use the
``output_file`` parameter.
All you have to do is provide a path for this output file, and it has to have the
.parquet extension. This allows you to compress the data, analyze it way more
efficiently, and you can refer back to the same dataset without having to go through
the process again (it can take a few minutes with big datasets).

.. code-block:: python

    import advertools as adv
    import pandas as pd
    adv.url_to_df([url_1, url_2, ...], ouput_file="output_file.parquet")
    pd.read_parquet("output_file.parquet", columns=["scheme"])
    pd.read_parquet("output_file.parquet", columns=["dir_1", "dir_2"])
    pd.read_parquet("output_file.parquet", columns=["dir_1", "dir_2"], filters=[("dir_1", "in", ["news", "politics"])])

"""  # noqa: E501

import os
from tempfile import TemporaryDirectory
from urllib.parse import parse_qs, unquote, urlsplit

import numpy as np
import pandas as pd

def _url_to_df(urls, decode=True):
    """Split the given URLs into their components to a DataFrame.

    Each column will have its own component, and query parameters and
    directories will also be parsed and given special columns each.

    :param url urls: A list of URLs to split into components
    :param bool decode: Whether or not to decode the given URLs
    :return DataFrame split: A DataFrame with a column for each component
    if isinstance(urls, str):
        urls = [urls]
    decode = unquote if decode else lambda x: x
    split_list = []
    for url in urls:
        split = urlsplit(decode(url))
        port = split.port
        hostname = split.hostname if split.hostname != split.netloc else None
        split = split._asdict()
        if hostname:
            split["hostname"] = hostname
        if port:
            split["port"] = port
        parsed_query = parse_qs(split["query"])
        parsed_query = {
            "query_" + key: "@@".join(val) for key, val in parsed_query.items()
        dirs = split["path"].strip("/").split("/")
        if dirs[0]:
            dir_cols = {"dir_{}".format(n): d for n, d in enumerate(dirs, 1)}
    df = pd.DataFrame(split_list)

    query_df = df.filter(regex="query_")
    if not query_df.empty:
        sorted_q_params = query_df.notna().mean().sort_values(ascending=False).index
        query_df = query_df[sorted_q_params]
        df = df.drop(query_df.columns, axis=1)
    dirs_df = df.filter(regex="^dir_")
    if not dirs_df.empty:
        df = df.drop(dirs_df.columns, axis=1)
        dirs_df = dirs_df.assign(last_dir=dirs_df.ffill(axis=1).iloc[:, -1:].squeeze())
    df = pd.concat([df, dirs_df, query_df], axis=1).replace("", np.nan)
    url_list_df = pd.DataFrame({"url": [decode(url) for url in urls]})
    final_df = pd.concat([url_list_df, df], axis=1)
    return final_df

[docs] def url_to_df(urls, decode=True, output_file=None): """Split the given URLs into their components to a DataFrame. Each column will have its own component, and query parameters and directories will also be parsed and given special columns each. Parameters ---------- urls : list,pandas.Series A list of URLs to split into components decode : bool, default True Whether or not to decode the given URLs output_file : str The path where to save the output DataFrame with a .parquet extension Returns ------- urldf : pandas.DataFrame A DataFrame with a column for each URL component """ if output_file is not None: if output_file.rsplit(".")[-1] != "parquet": raise ValueError("Your output_file has to have a .parquet extension.") step = 1000 sublists = (urls[sub : sub + step] for sub in range(0, len(urls), step)) with TemporaryDirectory() as tmpdir: for i, sublist in enumerate(sublists): urldf = _url_to_df(sublist, decode=decode) urldf.index = range(i * step, (i * step) + len(urldf)) urldf.to_parquet(f"{tmpdir}/{i:08}.parquet", index=True, version="2.6") final_df_list = [ pd.read_parquet(f"{tmpdir}/{tmpfile}") for tmpfile in os.listdir(tmpdir) ] final_df = pd.concat(final_df_list).sort_index() if output_file is not None: final_df.to_parquet(output_file, index=False, version="2.6") else: return final_df