Scaling crawlers
As we build our crawler, we may want to control how many tasks it performs at any given time. In other words, how many requests it makes to the web we are trying to scrape. Crawlee offers several options to fine-tune the number of parallel tasks, limit the number of requests per minute, and optimize scaling based on available system resources.
All of these options are available across all crawlers provided by Crawlee. In this guide, we are using the BeautifulSoupCrawler
as an example. You should also explore the ConcurrencySettings
.
Max tasks per minute
The max_tasks_per_minute
setting in ConcurrencySettings
controls how many total tasks the crawler can process per minute. It ensures that tasks are spread evenly throughout the minute, preventing a sudden burst at the max_concurrency
limit followed by idle time. By default, this is set to Infinity
, meaning the crawler can run at full speed, limited only by max_concurrency
. Use this option if you want to throttle your crawler to avoid overwhelming the target website with continuous requests.
import asyncio
from crawlee import ConcurrencySettings
from crawlee.crawlers import BeautifulSoupCrawler
async def main() -> None:
concurrency_settings = ConcurrencySettings(
# Set the maximum number of concurrent requests the crawler can run to 100.
max_concurrency=100,
# Limit the total number of requests to 10 per minute to avoid overwhelming
# the target website.
max_tasks_per_minute=10,
)
crawler = BeautifulSoupCrawler(
# Apply the defined concurrency settings to the crawler.
concurrency_settings=concurrency_settings,
)
if __name__ == '__main__':
asyncio.run(main())
Minimum and maximum concurrency
The min_concurrency
and max_concurrency
options in the ConcurrencySettings
define the minimum and maximum number of parallel tasks that can run at any given time. By default, crawlers start with a single parallel task and gradually scale up to a maximum of concurrent requests.
If you set min_concurrency
too high compared to the available system resources, the crawler may run very slowly or even crash. It is recommended to stick with the default value and let the crawler automatically adjust concurrency based on the system's available resources.
Desired concurrency
The desired_concurrency
option in the ConcurrencySettings
specifies the initial number of parallel tasks to start with, assuming sufficient resources are available. It defaults to the same value as min_concurrency
.
import asyncio
from crawlee import ConcurrencySettings
from crawlee.crawlers import BeautifulSoupCrawler
async def main() -> None:
concurrency_settings = ConcurrencySettings(
# Start with 8 concurrent tasks, as long as resources are available.
desired_concurrency=8,
# Maintain a minimum of 5 concurrent tasks to ensure steady crawling.
min_concurrency=5,
# Limit the maximum number of concurrent tasks to 10 to prevent
# overloading the system.
max_concurrency=10,
)
crawler = BeautifulSoupCrawler(
# Use the configured concurrency settings for the crawler.
concurrency_settings=concurrency_settings,
)
if __name__ == '__main__':
asyncio.run(main())
Autoscaled pool
The AutoscaledPool
manages a pool of asynchronous, resource-intensive tasks that run in parallel. It automatically starts new tasks only when there is enough free CPU and memory. To monitor system resources, it leverages the Snapshotter
and SystemStatus
classes. If any task raises an exception, the error is propagated, and the pool is stopped. Every crawler uses an AutoscaledPool
under the hood.