Skip to main content

Scaling crawlers

As we build our crawler, we may want to control how many tasks it performs at any given time. In other words, how many requests it makes to the web we are trying to scrape. Crawlee offers several options to fine-tune the number of parallel tasks, limit the number of requests per minute, and optimize scaling based on available system resources.

tip

All of these options are available across all crawlers provided by Crawlee. In this guide, we are using the BeautifulSoupCrawler as an example. You should also explore the ConcurrencySettings.

Max tasks per minute

The max_tasks_per_minute setting in ConcurrencySettings controls how many total tasks the crawler can process per minute. It ensures that tasks are spread evenly throughout the minute, preventing a sudden burst at the max_concurrency limit followed by idle time. By default, this is set to Infinity, meaning the crawler can run at full speed, limited only by max_concurrency. Use this option if you want to throttle your crawler to avoid overwhelming the target website with continuous requests.

import asyncio

from crawlee import ConcurrencySettings
from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler


async def main() -> None:
concurrency_settings = ConcurrencySettings(
# Set the maximum number of concurrent requests the crawler can run to 100.
max_concurrency=100,
# Limit the total number of requests to 10 per minute to avoid overwhelming
# the target website.
max_tasks_per_minute=10,
)

crawler = BeautifulSoupCrawler(
# Apply the defined concurrency settings to the crawler.
concurrency_settings=concurrency_settings,
)


if __name__ == '__main__':
asyncio.run(main())

Minimum and maximum concurrency

The min_concurrency and max_concurrency options in the ConcurrencySettings define the minimum and maximum number of parallel tasks that can run at any given time. By default, crawlers start with a single parallel task and gradually scale up to a maximum of concurrent requests.

Avoid setting minimum concurrency too high

If you set min_concurrency too high compared to the available system resources, the crawler may run very slowly or even crash. It is recommended to stick with the default value and let the crawler automatically adjust concurrency based on the system's available resources.

Desired concurrency

The desired_concurrency option in the ConcurrencySettings specifies the initial number of parallel tasks to start with, assuming sufficient resources are available. It defaults to the same value as min_concurrency.

import asyncio

from crawlee import ConcurrencySettings
from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler


async def main() -> None:
concurrency_settings = ConcurrencySettings(
# Start with 8 concurrent tasks, as long as resources are available.
desired_concurrency=8,
# Maintain a minimum of 5 concurrent tasks to ensure steady crawling.
min_concurrency=5,
# Limit the maximum number of concurrent tasks to 10 to prevent
# overloading the system.
max_concurrency=10,
)

crawler = BeautifulSoupCrawler(
# Use the configured concurrency settings for the crawler.
concurrency_settings=concurrency_settings,
)


if __name__ == '__main__':
asyncio.run(main())

Autoscaled pool

The AutoscaledPool manages a pool of asynchronous, resource-intensive tasks that run in parallel. It automatically starts new tasks only when there is enough free CPU and memory. To monitor system resources, it leverages the Snapshotter and SystemStatus classes. If any task raises an exception, the error is propagated, and the pool is stopped. Every crawler uses an AutoscaledPool under the hood.