Image Crawler and Downloader
Experimental feature - expect changes
This is a crawler that downloads all images on a given list of URLs. Using
crawl_images()
is straightforward:
>>> import advertools as adv
>>> adv.crawl_images([URL_1, URL_2, URL_3, ...], "output_dir")
This would go to the supplied URLs and download all images found on those URLs, and
place them in output_dir
.
You can set a few conditions to modify the behavior:
min_width
: The minimum width in pixels for an image to be downloaded. This is mainly to avoid downloading logos, tracking pixels, navigational elemenst as images, and so on.min_height
: The minimum height in pixels for an image to be downloadedinclude_img_regex
: A regular expression that the image path needs to match in order for it to be downloaded. In some cases, after checking the patterns of images for example, you might want to only download images that contain "sports", or any other pattern. Or maybe images of interest are under the /economy/ folder and you only want those images.custom_settings
: Just like other crawl functions, you can set any custom settings you want to control the crawler's behavior. Some examples include changing the User-agent, (dis)obeying robots.txt rules, and so on. More options and code details can be found in the crawling strategies page.
To run the crawl_images()
function you need to set an output_dir
. This is
where all images will be downloaded. You also get a summary file with details about the
downloaded images. You can read this file through the special function
summarize_crawled_imgs()
to get a few more details about those images.
>>> adv.summarize_crawled_imgs("path/to/output_dir")
Image file names
The downloaded images need to be given a name naturally, and the name is taken from the
slug of the image URL, excluding any query parameters or slashes.
The full URLs of those images can be found in the summary file, and you can access them
through summarize_crawled_imgs()
. You also see where those images are located as
you can see in the table above.
- class AdvImagesPipeline(store_uri: str | PathLike[str], download_func: Callable[[Request, Spider], Response] | None = None, settings: Settings | dict[str, Any] | None = None, *, crawler: Crawler | None = None)[source]
Bases:
ImagesPipeline
- class ImageSpider(*args: Any, **kwargs: Any)[source]
Bases:
Spider
- custom_settings: dict[_SettingsKeyT, Any] | None = {'AUTOTHROTTLE_ENABLED': True, 'AUTOTHROTTLE_TARGET_CONCURRENCY': 8, 'HTTPERROR_ALLOW_ALL': True, 'ITEM_PIPELINES': {<class 'advertools.image_spider.AdvImagesPipeline'>: 1}, 'ROBOTSTXT_OBEY': True, 'USER_AGENT': 'advertools/0.16.4'}
- include_img_regex = None
- name: str = 'image_spider'
- class ImgItem(*args: Any, **kwargs: Any)[source]
Bases:
Item
- fields: dict[str, Field] = {'image_location': {}, 'image_urls': {}, 'images': {}}
A dictionary containing all declared fields for this Item, not only those populated. The keys are the field names and the values are the
Field
objects used in the Item declaration.
- crawl_images(start_urls, output_dir, min_width=0, min_height=0, include_img_regex=None, custom_settings=None)[source]
Download all images available on start_urls and save them to output_dir.
THIS FUNCTION IS STILL EXPERIMENTAL. Expect many changes.
- Parameters:
start_urls (list) -- A list of URLs from which you want to download available images.
output_dir (str) -- The directory where you want the images to be saved.
min_width (int) -- The minimum width in pixels for an image to be downloaded.
min_height (int) -- The minimum height in pixels for an image to be downloaded.
include_img_regex (str) -- A regular expression to select image src URLs. Use this to restrict image files that match this regex.
custom_settings (dict) -- Additional settings to customize the crawling behaviour.
- summarize_crawled_imgs(image_dir)[source]
Provide a DataFrame of image locations and image URLs resulting from crawl_images.
Running the crawl_images function create a summary CSV file of the downloaded images. This function parses that file and provides a two-column DataFrame:
image_location: The URL from which the images was downloaded from.
image_urls: The URL of the image file tha was downloaded.
- Parameters:
image_dir (str) -- The path to the directory that you provided to crawl_images