Image Crawler and Downloader

Experimental feature - expect changes

This is a crawler that downloads all images on a given list of URLs. Using crawl_images() is straightforward:

>>> import advertools as adv
>>> adv.crawl_images([URL_1, URL_2, URL_3, ...], "output_dir")

This would go to the supplied URLs and download all images found on those URLs, and place them in output_dir.

You can set a few conditions to modify the behavior:

  • min_width: The minimum width in pixels for an image to be downloaded. This is mainly to avoid downloading logos, tracking pixels, navigational elemenst as images, and so on.

  • min_height: The minimum height in pixels for an image to be downloaded

  • include_img_regex: A regular expression that the image path needs to match in order for it to be downloaded. In some cases, after checking the patterns of images for example, you might want to only download images that contain "sports", or any other pattern. Or maybe images of interest are under the /economy/ folder and you only want those images.

  • custom_settings: Just like other crawl functions, you can set any custom settings you want to control the crawler's behavior. Some examples include changing the User-agent, (dis)obeying robots.txt rules, and so on. More options and code details can be found in the crawling strategies page.

To run the crawl_images() function you need to set an output_dir. This is where all images will be downloaded. You also get a summary file with details about the downloaded images. You can read this file through the special function summarize_crawled_imgs() to get a few more details about those images.

>>> adv.summarize_crawled_imgs("path/to/output_dir")

image_location

image_urls

0

https://www.buzzfeed.com/hannahdobro/dirty-little-industry-secrets?origin=tuh

https://img.buzzfeed.com/buzzfeed-static/static/user_images/6r1oxXOpC_large.jpg?downsize=120:*&output-format=jpg&output-quality=auto

0

https://www.buzzfeed.com/hannahdobro/dirty-little-industry-secrets?origin=tuh

https://img.buzzfeed.com/buzzfeed-static/static/2024-03/18/16/asset/fce856744ed8/sub-buzz-1303-1710779249-1.jpg

0

https://www.buzzfeed.com/hannahdobro/dirty-little-industry-secrets?origin=tuh



0

https://www.buzzfeed.com/hannahdobro/dirty-little-industry-secrets?origin=tuh

https://img.buzzfeed.com/buzzfeed-static/static/2024-03/18/16/asset/245ecfa321e9/sub-buzz-894-1710779358-1.jpg

1

https://www.buzzfeed.com/chelseastewart/josh-peck-statement-drake-bell-abuse-claims?origin=tuh

https://img.buzzfeed.com/buzzfeed-static/static/2017-12/12/13/user_images/buzzfeed-prod-web-03/chelseastewart-v2-5590-1513102854-0_large.jpg?downsize=120:*&output-format=jpg&output-quality=auto

1

https://www.buzzfeed.com/chelseastewart/josh-peck-statement-drake-bell-abuse-claims?origin=tuh

https://img.buzzfeed.com/buzzfeed-static/static/2024-03/21/19/asset/ea6298160040/sub-buzz-1093-1711048323-1.jpg?downsize=700%3A%2A&output-quality=auto&output-format=auto

1

https://www.buzzfeed.com/chelseastewart/josh-peck-statement-drake-bell-abuse-claims?origin=tuh



1

https://www.buzzfeed.com/chelseastewart/josh-peck-statement-drake-bell-abuse-claims?origin=tuh



2

https://www.buzzfeed.com/josephlongo/celebs-wearing-rewearing-same-dress?origin=tuh

https://img.buzzfeed.com/buzzfeed-static/static/2021-06/3/16/user_images/a824550933a9/tomiobaro-v2-2174-1622738336-41_large.jpg?downsize=120:*&output-format=jpg&output-quality=auto

2

https://www.buzzfeed.com/josephlongo/celebs-wearing-rewearing-same-dress?origin=tuh

https://img.buzzfeed.com/buzzfeed-static/static/2024-03/19/13/asset/6634db63f453/sub-buzz-576-1710855734-6.jpg?downsize=700%3A%2A&output-quality=auto&output-format=auto

2

https://www.buzzfeed.com/josephlongo/celebs-wearing-rewearing-same-dress?origin=tuh

https://img.buzzfeed.com/buzzfeed-static/static/2024-03/19/13/asset/cb8db05df7e7/sub-buzz-1743-1710855790-4.jpg

2

https://www.buzzfeed.com/josephlongo/celebs-wearing-rewearing-same-dress?origin=tuh



Image file names

The downloaded images need to be given a name naturally, and the name is taken from the slug of the image URL, excluding any query parameters or slashes. The full URLs of those images can be found in the summary file, and you can access them through summarize_crawled_imgs(). You also see where those images are located as you can see in the table above.

class AdvImagesPipeline(store_uri, download_func=None, settings=None)[source]

Bases: ImagesPipeline

file_path(request, response=None, info=None, *, item=None)[source]

Returns the path where downloaded media should be stored

class ImageSpider(*args: Any, **kwargs: Any)[source]

Bases: Spider

custom_settings: dict | None = {'AUTOTHROTTLE_ENABLED': True, 'AUTOTHROTTLE_TARGET_CONCURRENCY': 8, 'HTTPERROR_ALLOW_ALL': True, 'ITEM_PIPELINES': {<class 'advertools.image_spider.AdvImagesPipeline'>: 1}, 'ROBOTSTXT_OBEY': True, 'USER_AGENT': 'advertools/0.14.2'}
include_img_regex = None
name: str = 'image_spider'
parse(response)[source]
start_requests()[source]
class ImgItem(*args: Any, **kwargs: Any)[source]

Bases: Item

fields: Dict[str, Field] = {'image_location': {}, 'image_urls': {}, 'images': {}}
crawl_images(start_urls, output_dir, min_width=0, min_height=0, include_img_regex=None, custom_settings=None)[source]

Download all images available on start_urls and save them to output_dir.

THIS FUNCTION IS STILL EXPERIMENTAL. Expect many changes.

Parameters:
  • start_urls (list) -- A list of URLs from which you want to download available images.

  • output_dir (str) -- The directory where you want the images to be saved.

  • min_width (int) -- The minimum width in pixels for an image to be downloaded.

  • min_height (int) -- The minimum height in pixels for an image to be downloaded.

  • include_img_regex (str) -- A regular expression to select image src URLs. Use this to restrict image files that match this regex.

  • custom_settings (dict) -- Additional settings to customize the crawling behaviour.

summarize_crawled_imgs(image_dir)[source]

Provide a DataFrame of image locations and image URLs resulting from crawl_images.

Running the crawl_images function create a summary CSV file of the downloaded images. This function parses that file and provides a two-column DataFrame:

  • image_location: The URL from which the images was downloaded from.

  • image_urls: The URL of the image file tha was downloaded.

Parameters:

image_dir (str) -- The path to the directory that you provided to crawl_images