Source code for advertools.reverse_dns_lookup

"""
.. _reverse_dns_lookup:

.. raw:: html

    <script src="https://cdnjs.cloudflare.com/ajax/libs/require.js/2.3.4/require.min.js"></script>

Getting the host name of a list of IP addresses can be useful in verifying
the authenticity of those IP addresses. You typically want to do this as part
of a :ref:`log file analysis <logs>` pipeline. In this case you have requests
made to your server claiming to be of a certain user agent/bot name. Performing
a :func:`reverse_dns_lookup` on those IP addresses, will get you the actual
host name that they belong to.

What the :func:`reverse_dns_lookup` function does, is simply like running the
`host` command from the command line, but on a massive scale:

.. code-block:: bash

    $ host 66.249.80.0
    0.80.249.66.in-addr.arpa domain name pointer google-proxy-66-249-80-0.google.com.


Because you usually have a large number of duplicated IP addresses that you
want to check, this function makes the process practical and efficient, in
comparison to running the command thousands of times from the comand line.

Running the function is very simple, you simply supply a list of the IP
addresses that you have. Make sure to **keep the duplicates**, because the
function handles that for you, as well as provide counts and some statistics on
the frequency of the IPs:

.. container:: thebe

    .. thebe-button::
        Run this code

    .. code-block::
        :class: thebe, thebe-init

        import advertools as adv
        ip_list = ['66.249.66.194', '66.249.66.194', '66.249.66.194',
                '66.249.66.91', '66.249.66.91', '130.185.74.243',
                '31.56.96.51', '5.211.97.39']

        host_df = adv.reverse_dns_lookup(ip_list)
        host_df

====  ==============  =======  ===========  ======  ==========  =================================  ===========================  ==============  ======================
  ..  ip_address        count    cum_count    perc    cum_perc  hostname                           aliaslist                    ipaddrlist      errors
====  ==============  =======  ===========  ======  ==========  =================================  ===========================  ==============  ======================
   0  66.249.66.194         3            3   0.375       0.375  crawl-66-249-66-194.googlebot.com  194.66.249.66.in-addr.arpa   66.249.66.194
   1  66.249.66.91          2            5   0.25        0.625  crawl-66-249-66-91.googlebot.com   91.66.249.66.in-addr.arpa    66.249.66.91
   2  130.185.74.243        1            6   0.125       0.75   mail.garda.ir                      243.74.185.130.in-addr.arpa  130.185.74.243
   3  31.56.96.51           1            7   0.125       0.875  31-56-96-51.shatel.ir              51.96.56.31.in-addr.arpa     31.56.96.51
   4  5.211.97.39           1            8   0.125       1                                                                                      [Errno 1] Unknown host
====  ==============  =======  ===========  ======  ==========  =================================  ===========================  ==============  ======================

As you can see, in addition to getting hostnames, aliaslist, and ipaddrlist for
the IPs you supplied, you also get counts (absolute and cumulative) as well as
percentages (absolute and cumulative). This can give you a good overview of
the relative importance of each IP, and can help focus attention on them as
needed.
"""  # noqa: E501

import platform
import socket
from concurrent import futures

import pandas as pd

system = platform.system()

_default_max_workders = 60 if system == "Darwin" else 600


def _single_request(ip):
    try:
        hostname, aliaslist, ipaddrlist = socket.gethostbyaddr(ip)
        return [ip, hostname, "@@".join(aliaslist), "@@".join(ipaddrlist)]
    except Exception as e:
        return [ip, None, None, None, str(e)]


[docs] def reverse_dns_lookup(ip_list, max_workers=_default_max_workders): """Return the hostname, aliaslist, and ipaddrlist for a list of IP addresses. This is mainly useful for a long list of typically duplicated IP adresses and helps in getting the information very fast. It is basically the equivalent of running the `host` command on the command line many times: .. code-block:: bash $ host advertools.readthedocs.io advertools.readthedocs.io has address 104.17.32.82 Parameters ---------- ip_list : list A list of IP addresses. max_workers : int The maximum number of workers to use for multi processing. You also get a simple report about the counts of the IPs to get an overview of the top ones. Examples -------- >>> import advertools as adv >>> ip_list = [ ... "66.249.66.194", ... "66.249.66.194", ... "66.249.66.194", ... "66.249.66.91", ... "66.249.66.91", ... "130.185.74.243", ... "31.56.96.51", ... "5.211.97.39", ... ] >>> adv.reverse_dns_lookup([ip_list]) ==== ============== ======= =========== ====== ========== ================================= =========================== ============== ====================== .. ip_address count cum_count perc cum_perc hostname aliaslist ipaddrlist errors ==== ============== ======= =========== ====== ========== ================================= =========================== ============== ====================== 0 66.249.66.194 3 3 0.375 0.375 crawl-66-249-66-194.googlebot.com 194.66.249.66.in-addr.arpa 66.249.66.194 1 66.249.66.91 2 5 0.25 0.625 crawl-66-249-66-91.googlebot.com 91.66.249.66.in-addr.arpa 66.249.66.91 2 130.185.74.243 1 6 0.125 0.75 mail.garda.ir 243.74.185.130.in-addr.arpa 130.185.74.243 3 31.56.96.51 1 7 0.125 0.875 31-56-96-51.shatel.ir 51.96.56.31.in-addr.arpa 31.56.96.51 4 5.211.97.39 1 8 0.125 1 [Errno 1] Unknown host ==== ============== ======= =========== ====== ========== ================================= =========================== ============== ====================== """ # noqa: E501 socket.setdefaulttimeout(8) count_df = pd.Series(ip_list).value_counts().reset_index() count_df.columns = ["ip_address", "count"] count_df["cum_count"] = count_df["count"].cumsum() count_df["perc"] = count_df["count"].div(count_df["count"].sum()) count_df["cum_perc"] = count_df["perc"].cumsum() hosts = [] if system == "Darwin": with futures.ProcessPoolExecutor(max_workers=max_workers) as executor: for _ip, host in zip( ip_list, executor.map(_single_request, count_df["ip_address"]) ): hosts.append(host) else: with futures.ThreadPoolExecutor(max_workers=max_workers) as executor: for host in executor.map(_single_request, count_df["ip_address"]): hosts.append(host) df = pd.DataFrame(hosts) columns = ["ip", "hostname", "aliaslist", "ipaddrlist", "errors"] if df.shape[1] == 4: columns = columns[:-1] df.columns = columns final_df = pd.merge( count_df, df, left_on="ip_address", right_on="ip", how="left" ).drop("ip", axis=1) return final_df