Version: Next

Scaling our crawlers

Copy for LLM

As we build our crawler, we might want to control how many requests we do to the website at a time. Crawlee provides several options to fine tune how many parallel requests should be made at any time, how many requests should be done per minute, and how should scaling work based on the available system resources.

tip

All of these options are available on all crawlers Crawlee provides, but for this guide we'll be using the CheerioCrawler. We can see all options that are available here.

`maxRequestsPerMinute`

This controls how many total requests can be made per minute. It counts the amount of requests done every second, to ensure there is not a burst of requests at the maxConcurrency limit followed by a long period of waiting. By default, it is set to Infinity which means the crawler will keep going up to the maxConcurrency. We would set this if we wanted our crawler to work at full throughput, but also not keep hitting the website we're crawling with non-stop requests.

import { CheerioCrawler } from 'crawlee';

const crawler = new CheerioCrawler({
    // Let the crawler know it can run up to 100 requests concurrently at any time
    maxConcurrency: 100,
    // ...but also ensure the crawler never exceeds 250 requests per minute
    maxRequestsPerMinute: 250,
});

`minConcurrency` and `maxConcurrency`

These control how many parallel requests can be run at any time. By default, crawlers will start with one parallel request at a time and scale up over time to a maximum of 200 requests at a time.

Don't set minConcurrency too high!

Setting this option too high compared to the available system resources will make your crawler run extremely slow or might even crash.

It's recommended to leave it at the default value that is provided and letting the crawler scale up and down automatically based on available resources instead.

import { CheerioCrawler } from 'crawlee';

const crawler = new CheerioCrawler({
    // Start the crawler right away and ensure there will always be 5 concurrent requests ran at any time
    minConcurrency: 5,
    // Ensure the crawler doesn't exceed 15 concurrent requests ran at any time
    maxConcurrency: 15,
});

Advanced options

While the options above should be enough for most users, if we wanted to get super deep into the configuration of the autoscaling pool (the internal utility in Crawlee that helps us allow crawlers to scale up and down), we can do so through the autoscaledPoolOptions object available on crawler options.

Complex options up ahead!

This section is super advanced and, unless you test the changes extensively and know what you're doing, it's better to leave these options to their defaults, as they are most likely going to work fine without much fuss.

With that warning aside, if we're feeling adventurous, this is how we would pass these options when using a crawler:

import { CheerioCrawler } from 'crawlee';

const crawler = new CheerioCrawler({
    // Pass in advanced options by providing them in the autoscaledPoolOptions
    autoscaledPoolOptions: {
        // ...
    },
});

`desiredConcurrency`

This option specifies the amount of requests that should be running in parallel at the start of the crawler, assuming there are so many available. It defaults to the same value as minConcurrency.

`desiredConcurrencyRatio`

The minimum ratio of concurrency to reach before more scaling up is allowed (a number between 0 and 1). By default, it is set to 0.95.

We can think of this as the point where the autoscaling pool can attempt to scale up (or down), monitor if there's any changes, and correct them if necessary.

`scaleUpStepRatio` and `scaleDownStepRatio`

These values define the fractional amount of desired concurrency to be added or subtracted as the autoscaling pool scales up or down. Both of these values default to 0.05.

Every time the autoscaled pool attempts to scale up or down, this value will be added or subtracted from the current concurrency, and, based on the desiredConcurrencyRatio and maxConcurrency, determines how many requests can run concurrently.

`maybeRunIntervalSecs`

Indicates how often the autoscaling pool should check if more requests can be started and, if that's true, starts a new request if there are any available. This value is represented in seconds, and defaults to 0.5.

info

Changing this has no effect for requests that are fired immediately after the previous ones are finished. However, it will influence how fast new requests will be started after the autoscaled pool scales up.

`loggingIntervalSecs`

This option lets us control how often the autoscaled pool should log its current state (the current concurrency ratio, desired ratios, if the system is overloaded and so on).

We can disable logging altogether by setting this to null. By default, it is set to 60 seconds.

`autoscaleIntervalSecs`

This option lets us control how often the autoscaling pool should check if it can and should scale up or down. This value is represented in seconds, and defaults to 10.

tip

It's recommended you keep this value between 5 and 20 seconds.

Be careful with how low, or high, you set this option

Setting this option to a value that's too low might have a severe impact on our crawling performance. And, in reverse, setting this to a value that's too high might mean we leave performance on the table that could've been used for crawling more requests instead.

With that said, if you configure this alongside scaleUpStepRatio and scaleDownStepRatio, you could make your crawler scale up at a slower interval, but with more requests at a time when it does.

`maxTasksPerMinute`

info

This option can be set by specifying maxRequestsPerMinute in your crawler options too, as it is a shortcut for visibility and ease of access.

Scaling our crawlers

maxRequestsPerMinute​

minConcurrency and maxConcurrency​

Advanced options​

desiredConcurrency​

desiredConcurrencyRatio​

scaleUpStepRatio and scaleDownStepRatio​

maybeRunIntervalSecs​

loggingIntervalSecs​

autoscaleIntervalSecs​

maxTasksPerMinute​