advertools.cli_docs.cli module
Convert a robots.txt file (or list of file URLs) to a table in a CSV format
Usage
advertools robots [-h] [url ...]
Convert a robots.txt file (or list of file URLs) to a table in a CSV format.
You can provide a web URL, or a local file URL on your local machine, for example:
file:///Users/path/to/robots.txt
Examples
Single robots.txt file:
advertools robots https://www.google.com/robots.txt
Multiple robots.txt files:
advertools robots https://www.google.com/robots.txt https://www.google.jo/robots.txt https://www.google.es/robots.txt
Save to a CSV file using output redirection:
advertools robots https://www.google.com/robots.txt > google_robots.csv
Run the function for a long list of robots files saved in a text file (robotslist.txt):
advertools robots < robotslist.txt > multi_robots.csv
Arguments
Positional arguments:
url: A robots.txt URL (or a list of URLs). Default: None.
Optional arguments:
-h, --help: Show this help message and exit.
Download, parse, and save an XML sitemap to a table in a CSV file
Usage
advertools sitemaps [-h] [-r {0,1}] [-s SEPARATOR] [sitemap_url]
Positional arguments:
sitemap_url: The URL of the XML sitemap (regular or sitemap index). Default: None.
Optional arguments:
-h, --help: Show this help message and exit.
-r {0,1}, --recursive {0,1}: Whether or not to fetch sub-sitemaps if it is a sitemap index file. Default: 1.
-s SEPARATOR, --separator SEPARATOR: The separator with which to separate columns of the output. Default: ,.
Split a list of URLs into their components: scheme, netloc, path, query, etc.
Usage
advertools urls [-h] [url_list ...]
Positional arguments:
url_list: A list of URLs to parse. Default: None.
Optional arguments:
-h, --help: Show this help message and exit.
Crawl a list of known URLs using the HEAD method
Usage
advertools headers [-h] [-s [CUSTOM_SETTINGS ...]] [url_list ...] output_file
Positional arguments:
url_list: A list of URLs. Default: None.
output_file: Filepath - where to save the output (.jl).
Optional arguments:
-h, --help: Show this help message and exit.
-s [CUSTOM_SETTINGS ...], --custom-settings [CUSTOM_SETTINGS ...]: Settings that modify the behavior of the crawler. Settings should be separated by spaces, and each setting name and value should be separated by an equal sign = without spaces between them.
Example:
advertools headers https://example.com example.jl --custom-settings LOG_FILE=logs.log CLOSESPIDER_TIMEOUT=20
Parse, compress and convert a log file to a DataFrame in the .parquet format
Usage
advertools logs [-h] [-f [FIELDS ...]] log_file output_file errors_file log_format
Positional arguments:
log_file: Filepath - the log file.
output_file: Filepath - where to save the output (.parquet).
errors_file: Filepath - where to save the error lines (.txt).
log_format: The format of the logs. Available defaults are: common, combined, common_with_vhost, nginx_error, apache_error. Supply a special regex instead if you have a different format.
Optional arguments:
-h, --help: Show this help message and exit.
-f [FIELDS ...], --fields [FIELDS ...]: Provide a list of field names to become column names in the parsed compressed file. Default: None.
Perform a reverse DNS lookup on a list of IP addresses
Usage
advertools dns [-h] [ip_list ...]
Description
Perform a reverse DNS lookup on a list of IP addresses.
Positional arguments:
ip_list: A list of IP addresses. Default: None.
Optional arguments:
-h, --help: Show this help message and exit.
Generate a table of SEM keywords by supplying a list of products and a list of intent words
Usage
advertools semkw [-h] [-t [{exact,phrase,modified,broad} ...]] [-l MAX_LEN] [-c {0,1}] [-m {0,1}] [-n CAMPAIGN_NAME] products words
Description
Generate a table of SEM keywords by supplying a list of products and a list of intent words.
Positional arguments:
products: A file containing the products that you sell, one per line.
words: A file containing the intent words/phrases that you want to combine with products, one per line.
Optional arguments:
-h, --help: Show this help message and exit.
-t [{exact,phrase,modified,broad} ...]: Specify match types.
-l MAX_LEN, --max-len MAX_LEN: Number of words to combine with products. Default: 3.
-c {0,1}, --capitalize-adgroups {0,1}: Whether or not to capitalize ad group names in the output file. Default: 1.
-m {0,1}, --order-matters {0,1}: Use combinations and permutations, or just combinations. Default: 1.
-n CAMPAIGN_NAME, --campaign-name CAMPAIGN_NAME: The campaign name.
Get stopwords of the selected language
Usage
advertools stopwords [-h] {arabic,azerbaijani,bengali,catalan,chinese,croatian,danish,dutch,english,finnish,french,german,greek,hebrew,hindi,hungarian,indonesian,irish,italian,japanese,kazakh,nepali,norwegian,persian,polish,portuguese,romanian,russian,sinhala,spanish,swedish,tagalog,tamil,tatar,telugu,thai,turkish,ukrainian,urdu,vietnamese}
Description
Get stopwords of the selected language.
Positional arguments:
{arabic,azerbaijani,bengali,catalan,chinese,croatian,danish,dutch,english,finnish,french,german,greek,hebrew,hindi,hungarian,indonesian,irish,italian,japanese,kazakh,nepali,norwegian,persian,polish,portuguese,romanian,russian,sinhala,spanish,swedish,tagalog,tamil,tatar,telugu,thai,turkish,ukrainian,urdu,vietnamese}: The language to retrieve stopwords for.
Optional arguments:
-h, --help: Show this help message and exit.
Get word counts of a text list optionally weighted by a number list
Usage
advertools wordfreq [-h] [-n NUMBER_LIST] [-r REGEX] [-l PHRASE_LEN] [-s [STOPWORDS ...]] [text_list ...]
Description
Get word counts of a text list optionally weighted by a number list.
Positional arguments:
text_list: A text list, one document (sentence, tweet, etc.) per line. Default: None.
Optional arguments:
-h, --help: Show this help message and exit.
-n NUMBER_LIST, --number-list NUMBER_LIST: Filepath for a file containing the number list, one number per line. Default: None.
-r REGEX, --regex REGEX: A regex to tokenize words. Default: None.
-l PHRASE_LEN, --phrase-len PHRASE_LEN: The phrase (token) length to split words (the n in n-grams). Default: 1.
-s [STOPWORDS ...], --stopwords [STOPWORDS ...]: A list of stopwords to exclude. Default: None.
Search for emoji using a regex
Usage
advertools emoji [-h] regex
Description
Search for emoji using a regex.
Positional arguments:
regex: Pattern to search for emoji.
Optional arguments:
-h, --help: Show this help message and exit.
Tokenize documents (phrases, keywords, tweets, etc.) into tokens of the desired length
Usage
advertools tokenize [-h] [-l LENGTH] [-s SEPARATOR] [text_list ...]
Description
Tokenize documents (phrases, keywords, tweets, etc.) into tokens of the desired length.
Positional arguments:
text_list: Filepath containing the text list, one document (sentence, tweet, etc.) per line. Default: None.
Optional arguments:
-h, --help: Show this help message and exit.
-l LENGTH, --length LENGTH: Length of tokens (the n in n-grams). Default: 1.
-s SEPARATOR, --separator SEPARATOR: Character to separate the tokens. Default: ,.
SEO crawler
Usage
advertools crawl [-h] [-l FOLLOW_LINKS] [-d [ALLOWED_DOMAINS ...]]
[--exclude-url-params [EXCLUDE_URL_PARAMS ...]]
[--include-url-params [INCLUDE_URL_PARAMS ...]]
[--exclude-url-regex EXCLUDE_URL_REGEX]
[--include-url-regex INCLUDE_URL_REGEX]
[--css-selectors [CSS_SELECTORS ...]]
[--xpath-selectors [XPATH_SELECTORS ...]]
[--custom-settings [CUSTOM_SETTINGS ...]]
[url_list ...] output_file
Description
SEO crawler to extract and collect information from web pages.
Positional arguments:
url_list: One or more URLs to crawl. Default: None.
output_file: Filepath where to save the output (.jl).
Optional arguments:
-h, --help: Show this help message and exit.
-l FOLLOW_LINKS, --follow-links FOLLOW_LINKS: Whether or not to follow links encountered on crawled pages. Default: 0.
-d [ALLOWED_DOMAINS ...], --allowed-domains [ALLOWED_DOMAINS ...]: Only follow links on these domains while crawling. Default: None.
--exclude-url-params [EXCLUDE_URL_PARAMS ...]: List of URL parameters to exclude while following links. If a link contains any of these parameters, it will not be followed. Default: None.
--include-url-params [INCLUDE_URL_PARAMS ...]: List of URL parameters to include while following links. If a link contains any of these parameters, it will be followed. Default: None.
--exclude-url-regex EXCLUDE_URL_REGEX: A regular expression pattern to exclude certain URLs from being followed. Default: None.
--include-url-regex INCLUDE_URL_REGEX: A regular expression pattern to include certain URLs to be followed. Default: None.
--css-selectors [CSS_SELECTORS ...]: A dictionary mapping column names to CSS selectors. The selectors will be used to extract the required data/content. Default: None.
--xpath-selectors [XPATH_SELECTORS ...]: A dictionary mapping column names to XPath selectors. The selectors will be used to extract the required data/content. Default: None.
--custom-settings [CUSTOM_SETTINGS ...]: Custom settings to modify the behavior of the crawler. Over 170 settings are available for configuration. See Scrapy documentation for details: https://docs.scrapy.org/en/latest/topics/settings.html. Default: None.
Examples
Crawl a website starting from its homepage:
advertools crawl https://example.com example_output.jl --follow-links 1
Crawl a list of pages in list mode:
advertools crawl url_1 url_2 url_3 example_output.jl
If you have a long list of URLs in a file (url_list.txt):
advertools crawl < url_list.txt example_output.jl
Stop crawling after processing 1,000 pages:
advertools crawl https://example.com example_output.jl --follow-links 1 --custom-settings CLOSESPIDER_PAGECOUNT=1000