Source code for advertools.reverse_dns_lookup

"""
.. _reverse_dns_lookup:

.. raw:: html

    <script src="https://cdnjs.cloudflare.com/ajax/libs/require.js/2.3.4/require.min.js"></script>

Getting the host name of a list of IP addresses can be useful in verifying
the authenticity of those IP addresses. You typically want to do this as part
of a :ref:`log file analysis <logs>` pipeline. In this case you have requests
made to your server claiming to be of a certain user agent/bot name. Performing
a :func:`reverse_dns_lookup` on those IP addresses, will get you the actual
host name that they belong to.

What the :func:`reverse_dns_lookup` function does, is simply like running the
`host` command from the command line, but on a massive scale:

.. code-block:: bash

    $ host 66.249.80.0
    0.80.249.66.in-addr.arpa domain name pointer google-proxy-66-249-80-0.google.com.


Because you usually have a large number of duplicated IP addresses that you
want to check, this function makes the process practical and efficient, in
comparison to running the command thousands of times from the comand line.

Running the function is very simple, you simply supply a list of the IP
addresses that you have. Make sure to **keep the duplicates**, because the
function handles that for you, as well as provide counts and some statistics on
the frequency of the IPs:

.. container:: thebe

    .. thebe-button::
        Run this code

    .. code-block::
        :class: thebe, thebe-init

        import advertools as adv
        ip_list = ['66.249.66.194', '66.249.66.194', '66.249.66.194',
                '66.249.66.91', '66.249.66.91', '130.185.74.243',
                '31.56.96.51', '5.211.97.39']

        host_df = adv.reverse_dns_lookup(ip_list)
        host_df

====  ==============  =======  ===========  ======  ==========  =================================  ===========================  ==============  ======================
  ..  ip_address        count    cum_count    perc    cum_perc  hostname                           aliaslist                    ipaddrlist      errors
====  ==============  =======  ===========  ======  ==========  =================================  ===========================  ==============  ======================
   0  66.249.66.194         3            3   0.375       0.375  crawl-66-249-66-194.googlebot.com  194.66.249.66.in-addr.arpa   66.249.66.194
   1  66.249.66.91          2            5   0.25        0.625  crawl-66-249-66-91.googlebot.com   91.66.249.66.in-addr.arpa    66.249.66.91
   2  130.185.74.243        1            6   0.125       0.75   mail.garda.ir                      243.74.185.130.in-addr.arpa  130.185.74.243
   3  31.56.96.51           1            7   0.125       0.875  31-56-96-51.shatel.ir              51.96.56.31.in-addr.arpa     31.56.96.51
   4  5.211.97.39           1            8   0.125       1                                                                                      [Errno 1] Unknown host
====  ==============  =======  ===========  ======  ==========  =================================  ===========================  ==============  ======================

As you can see, in addition to getting hostnames, aliaslist, and ipaddrlist for
the IPs you supplied, you also get counts (absolute and cumulative) as well as
percentages (absolute and cumulative). This can give you a good overview of
the relative importance of each IP, and can help focus attention on them as
needed.
"""  # noqa: E501

import platform
import socket
from concurrent import futures

import pandas as pd

system = platform.system()

_default_max_workders = 60 if system == "Darwin" else 600


def _single_request(ip):
    try:
        hostname, aliaslist, ipaddrlist = socket.gethostbyaddr(ip)
        return [ip, hostname, "@@".join(aliaslist), "@@".join(ipaddrlist)]
    except Exception as e:
        return [ip, None, None, None, str(e)]



[docs]
def reverse_dns_lookup(ip_list, max_workers=_default_max_workders):
    """Return the hostname, aliaslist, and ipaddrlist for a list of IP
    addresses.

    This is mainly useful for a long list of typically duplicated IP adresses
    and helps in getting the information very fast. It is basically the
    equivalent of running the `host` command on the command line many times:

    .. code-block:: bash

        $ host advertools.readthedocs.io
        advertools.readthedocs.io has address 104.17.32.82

    Parameters
    ----------
    ip_list : list
      A list of IP addresses.
    max_workers : int
      The maximum number of workers to use for multi processing.


    You also get a simple report about the counts of the IPs to get an overview
    of the top ones.

    Examples
    --------
    >>> import advertools as adv
    >>> ip_list = [
    ...     "66.249.66.194",
    ...     "66.249.66.194",
    ...     "66.249.66.194",
    ...     "66.249.66.91",
    ...     "66.249.66.91",
    ...     "130.185.74.243",
    ...     "31.56.96.51",
    ...     "5.211.97.39",
    ... ]
    >>> adv.reverse_dns_lookup([ip_list])

    ====  ==============  =======  ===========  ======  ==========  =================================  ===========================  ==============  ======================
      ..  ip_address        count    cum_count    perc    cum_perc  hostname                           aliaslist                    ipaddrlist      errors
    ====  ==============  =======  ===========  ======  ==========  =================================  ===========================  ==============  ======================
       0  66.249.66.194         3            3   0.375       0.375  crawl-66-249-66-194.googlebot.com  194.66.249.66.in-addr.arpa   66.249.66.194
       1  66.249.66.91          2            5   0.25        0.625  crawl-66-249-66-91.googlebot.com   91.66.249.66.in-addr.arpa    66.249.66.91
       2  130.185.74.243        1            6   0.125       0.75   mail.garda.ir                      243.74.185.130.in-addr.arpa  130.185.74.243
       3  31.56.96.51           1            7   0.125       0.875  31-56-96-51.shatel.ir              51.96.56.31.in-addr.arpa     31.56.96.51
       4  5.211.97.39           1            8   0.125       1                                                                                      [Errno 1] Unknown host
    ====  ==============  =======  ===========  ======  ==========  =================================  ===========================  ==============  ======================
    """  # noqa: E501
    socket.setdefaulttimeout(8)
    count_df = pd.Series(ip_list).value_counts().reset_index()
    count_df.columns = ["ip_address", "count"]
    count_df["cum_count"] = count_df["count"].cumsum()
    count_df["perc"] = count_df["count"].div(count_df["count"].sum())
    count_df["cum_perc"] = count_df["perc"].cumsum()

    hosts = []
    if system == "Darwin":
        with futures.ProcessPoolExecutor(max_workers=max_workers) as executor:
            for _ip, host in zip(
                ip_list, executor.map(_single_request, count_df["ip_address"])
            ):
                hosts.append(host)
    else:
        with futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
            for host in executor.map(_single_request, count_df["ip_address"]):
                hosts.append(host)
    df = pd.DataFrame(hosts)
    columns = ["ip", "hostname", "aliaslist", "ipaddrlist", "errors"]
    if df.shape[1] == 4:
        columns = columns[:-1]
    df.columns = columns
    final_df = pd.merge(
        count_df, df, left_on="ip_address", right_on="ip", how="left"
    ).drop("ip", axis=1)
    return final_df