advertools Command Line Interface (CLI)
Once you install advertools with python3 -m pip install advertools
, you
should have acess to the command line interface and run the available commands.
You just need Python3 installed, and you are good to go (no need for any Python programming to use the CLI)
Run advertools --help
or adv -h
to get access to the documentation.
For specific documentation of a certain command run
advertools <command> --help
For example advertools sitemaps --help
or adv crawl -h
convert a robots.txt file (or list of file URLs) to a table in a CSV format
usage:
advertools robots [-h] [url ...]
convert a robots.txt file (or list of file URLs) to a table in a CSV format
you can provide a web URL, or a local file URL on your local machine e.g. file:///Users/path/to/robots.txt
advertools robots https://www.google.com/robots.txt
multiple robots files:
advertools robots https://www.google.com/robots.txt https://www.google.jo/robots.txt https://www.google.es/robots.txt
use output redirection ">" to save to a CSV file:
advertools robots https://www.google.com/robots.txt > google_robots.csv
run the function for a long list of robots files saved in a text file (robotslist.txt):
advertools robots < robotslist.txt > multi_robots.csv
- positional arguments:
url a robots.txt URL (or a list of URLs) (default: None)
- optional arguments:
- -h, --help
show this help message and exit
download, parse, and save an XML sitemap to a table in a CSV file
usage:
advertools sitemaps [-h] [-r {0,1}] [-s SEPARATOR] [sitemap_url]
download, parse, and save an XML sitemap to a table in a CSV file
- positional arguments:
sitemap_url the URL of the XML sitemap (regular or sitemap index) (default: None)
- optional arguments:
- -h, --help
show this help message and exit
- -r {0,1}, --recursive {0,1}
whether or not to fetch sub-sitemaps if it is a sitemap index file (default: 1)
- -s SEPARATOR, --separator SEPARATOR
the separator with which to separate columns of the output (default: ,)
split a list of URLs into their components: scheme, netloc, path, query, etc.
usage:
advertools urls [-h] [url_list ...]
split a list of URLs into their components: scheme, netloc, path, query, etc.
- positional arguments:
url_list a list of URLs to parse (default: None)
- optional arguments:
- -h, --help
show this help message and exit
crawl a list of known URLs using the HEAD method
usage:
advertools headers [-h] [-s [CUSTOM_SETTINGS ...]] [url_list ...] output_file
crawl a list of known URLs using the HEAD method return status codes and all available response headers
- positional arguments:
url_list a list of URLs (default: None) output_file filepath - where to save the output (.jl)
- optional arguments:
- -h, --help
show this help message and exit
- -s [CUSTOM_SETTINGS ...], --custom-settings [CUSTOM_SETTINGS ...]
settings that modify the behavior of the crawler settings should be separated by spaces, and each setting name and value should be separated by an equal sign '=' without spaces between them
example:
- advertools headers https://example.com example.jl --custom-settings LOG_FILE=logs.log CLOSESPIDER_TIMEOUT=20
(default: None)
parse, compress and convert a log file to a DataFrame in the .parquet format
usage:
advertools logs [-h] [-f [FIELDS ...]] log_file output_file errors_file log_format
parse, compress and convert a log file to a DataFrame in the .parquet format
- positional arguments:
log_file filepath - the log file output_file filepath - where to save the output (.parquet) errors_file filepath - where to save the error lines (.txt) log_format the format of the logs, available defaults are:
common, combined, common_with_vhost, nginx_error, apache_error supply a special regex instead if you have a different format
- optional arguments:
- -h, --help
show this help message and exit
- -f [FIELDS ...], --fields [FIELDS ...]
in case you have a special log format, provide a list of the fields names which will become column names in the parsed compressed file (default: None)
perform a reverse DNS lookup on a list of IP addresses
usage:
advertools dns [-h] [ip_list ...]
perform a reverse DNS lookup on a list of IP addresses
- positional arguments:
ip_list a list of IP addresses (default: None)
- optional arguments:
- -h, --help
show this help message and exit
generate a table of SEM keywords by supplying a list of products and a list of intent words
usage:
advertools semkw [-h] [-t [{exact,phrase,modified,broad} ...]] [-l MAX_LEN] [-c {0,1}] [-m {0,1}] [-n CAMPAIGN_NAME] products words
generate a table of SEM keywords by supplying a list of products and a list of intent words
- positional arguments:
products a file containing the products that you sell, one per line words a file containing the intent words/phrases that you want to combine with products, one per line
- optional arguments:
- -h, --help
show this help message and exit
-t [{exact,phrase,modified,broad} ...], --match-types [{exact,phrase,modified,broad} ...] -l MAX_LEN, --max-len MAX_LEN
the number of words that should be combined with products (default: 3)
- -c {0,1}, --capitalize-adgroups {0,1}
whether or not to capitalize ad group names in the output file (default: 1)
- -m {0,1}, --order-matters {0,1}
do you want combinations and permutations, or just combinations? "buy product" and "product buy" or just "buy product"? (default: 1)
-n CAMPAIGN_NAME, --campaign-name CAMPAIGN_NAME
get stopwords of the selected language
usage:
advertools stopwords [-h] {arabic,azerbaijani,bengali,catalan,chinese,croatian,danish,dutch,english,finnish,french,german,greek,hebrew,hindi,hungarian,indonesian,irish,italian,japanese,kazakh,nepali,norwegian,persian,polish,portuguese,romanian,russian,sinhala,spanish,swedish,tagalog,tamil,tatar,telugu,thai,turkish,ukrainian,urdu,vietnamese}
get stopwords of the selected language
- positional arguments:
{arabic,azerbaijani,bengali,catalan,chinese,croatian,danish,dutch,english,finnish,french,german,greek,hebrew,hindi,hungarian,indonesian,irish,italian,japanese,kazakh,nepali,norwegian,persian,polish,portuguese,romanian,russian,sinhala,spanish,swedish,tagalog,tamil,tatar,telugu,thai,turkish,ukrainian,urdu,vietnamese}
- optional arguments:
- -h, --help
show this help message and exit
get word counts of a text list optionally weighted by a number list
usage:
advertools wordfreq [-h] [-n NUMBER_LIST] [-r REGEX] [-l PHRASE_LEN] [-s [STOPWORDS ...]] [text_list ...]
get word counts of a text list optionally weighted by a number list
words (tokens) can be tokenized using any pattern with the --regex option word/phrase lengths can also be modified using the --phrase-len option
- positional arguments:
text_list a text list, one document (sentence, tweet, etc.) per line (default: None)
- optional arguments:
- -h, --help
show this help message and exit
- -n NUMBER_LIST, --number-list NUMBER_LIST
filepath - a file containing the number list, one number per line (default: None)
- -r REGEX, --regex REGEX
a regex to tokenize words (default: None)
- -l PHRASE_LEN, --phrase-len PHRASE_LEN
the phrase (token) length to split words (the n in n-grams) (default: 1)
- -s [STOPWORDS ...], --stopwords [STOPWORDS ...]
a list of stopwords to exclude when counting, defaults to English stopwords run advertools stopwords english to get the stopwords change the language to get other stopwords (default: None)
search for emoji using a regex
usage:
advertools emoji [-h] regex
search for emoji using a regex
- positional arguments:
regex pattern to search for emoji
- optional arguments:
- -h, --help
show this help message and exit
tokenize documents (phrases, keywords, tweets, etc) into token of the desired length
usage:
advertools tokenize [-h] [-l LENGTH] [-s SEPARATOR] [text_list ...]
tokenize documents (phrases, keywords, tweets, etc) into token of the desired length
- positional arguments:
text_list filepath - a file containing the text list, one document (sentence, tweet, etc.) per line (default: None)
- optional arguments:
- -h, --help
show this help message and exit
- -l LENGTH, --length LENGTH
the length of tokens (the n in n-grams) (default: 1)
- -s SEPARATOR, --separator SEPARATOR
the character with which to separate the tokens (default: ,)
SEO crawler
- usage: advertools crawl [-h] [-l FOLLOW_LINKS] [-d [ALLOWED_DOMAINS ...]]
[--exclude-url-params [EXCLUDE_URL_PARAMS ...]] [--include-url-params [INCLUDE_URL_PARAMS ...]] [--exclude-url-regex EXCLUDE_URL_REGEX] [--include-url-regex INCLUDE_URL_REGEX] [--css-selectors [CSS_SELECTORS ...]] [--xpath-selectors [XPATH_SELECTORS ...]] [--custom-settings [CUSTOM_SETTINGS ...]] [url_list ...] output_file
SEO crawler
- positional arguments:
url_list one or more URLs to crawl (default: None) output_file filepath - where to save the output (.jl)
- optional arguments:
- -h, --help
show this help message and exit
- -l FOLLOW_LINKS, --follow-links FOLLOW_LINKS
whether or not to follow links encountered on crawled pages (default: 0)
- -d [ALLOWED_DOMAINS ...], --allowed-domains [ALLOWED_DOMAINS ...]
while following links, only links on these domains will be followed (default: None)
- --exclude-url-params [EXCLUDE_URL_PARAMS ...]
a list of URL parameters to exclude while following links if a link contains any of those parameters, don't follow it setting it to True will exclude links containing any parameter (default: None)
- --include-url-params [INCLUDE_URL_PARAMS ...]
a list of URL parameters to include while following links if a link contains any of those parameters, follow it having the same parmeters to include and exclude raises an error (default: None)
- --exclude-url-regex EXCLUDE_URL_REGEX
a regular expression of a URL pattern to exclude while following links if a link matches the regex don't follow it (default: None)
- --include-url-regex INCLUDE_URL_REGEX
a regular expression of a URL pattern to include while following links if a link matches the regex follow it (default: None)
- --css-selectors [CSS_SELECTORS ...]
a dictionary mapping names to CSS selectors the names will become column headers, and the selectors will be used to extract the required data/content (default: None)
- --xpath-selectors [XPATH_SELECTORS ...]
a dictionary mapping names to XPath selectors. the names will become column headers, and the selectors will be used to extract the required data/content (default: None)
- --custom-settings [CUSTOM_SETTINGS ...]
a dictionary of optional custom settings that you might want to add to the spider's functionality. there are over 170 settings for all kinds of options for details please refer to the spider settings: https://docs.scrapy.org/en/latest/topics/settings.html (default: None)
crawl a website starting from its home page:
advertools crawl https://examle.com example_output.jl --follow-links 1
crawl a list of pages (list mode):
advertools crawl url_1 url_2 url_3 example_output.jl
OR if you have a long list in a file (url_list.txt):
advertools crawl < url_list.txt example_output.jl
stop crawling after having processed 1,000 pages:
advertools crawl https://examle.com example_output.jl --follow-links 1 --custom-settings CLOSESPIDER_PAGECOUNT=1000