Configuration
Configuration is a class holding Crawlee configuration parameters. By default, you don't need to set or change any of them, but for certain use cases you might want to do so, e.g. in order to change the default storage directory, or enable verbose error logging, and so on.
There are three ways of changing the configuration parameters:
- adding crawlee.jsonfile to your project
- setting environment variables
- using the Configurationclass
You could also combine all the above, but you should keep in mind, that the precedence for these 3 options is the following:
crawlee.json < constructor options < environment variables.
crawlee.json is a baseline. The options provided in the Configuration constructor will override the options provided in the JSON. Environment variables will override both.
crawlee.json
The first option you could use for configuring Crawlee is crawlee.json file. The only thing you need to do is specify the ConfigurationOptions in the file, place the file in the root of your project, and Crawlee will use provided options as global configuration.
{
  "persistStateIntervalMillis": 10000,
  "logLevel": "DEBUG"
}
With crawlee.json you don't need to do anything else in the code:
import { CheerioCrawler, sleep } from 'crawlee';
// We are not importing nor passing
// the Configuration to the crawler.
// We are not assigning any env vars either.
const crawler = new CheerioCrawler();
crawler.router.addDefaultHandler(async ({ request }) => {
    // for the first request we wait for 5 seconds,
    // and add the second request to the queue
    if (request.url === 'https://www.example.com/1') {
        await sleep(5_000);
        await crawler.addRequests(['https://www.example.com/2'])
    }
    // for the second request we wait for 10 seconds,
    // and abort the run
    if (request.url === 'https://www.example.com/2') {
        await sleep(10_000);
        process.exit(0);
    }
});
await crawler.run(['https://www.example.com/1']);
If you run this example (assuming you placed the crawlee.json file with persistStateIntervalMillis and logLevel specified there in the root of your project), you will find the SDK_CRAWLER_STATISTICS file in default Key-Value store,
which would show, that there's 1 finished request and crawler runtime was ~10 seconds.
This confirms that the state was persisted after 10 seconds, as it was set in crawlee.json.
Besides, you should see DEBUG logs in addition to INFO ones in your terminal, as logLevel was set to DEBUG in the crawlee.json, meaning Crawlee picked both provided options correctly.
Environment Variables
Another way of configuring Crawlee is setting environment variables. The following is a list of the environment variables used by Crawlee that are available to the user.
Important env vars
The following environment variables have large impact on the way Crawlee works and its behavior can be changed significantly by setting or unsetting them.
CRAWLEE_STORAGE_DIR
Defines the path to a local directory where KeyValueStore, Dataset, and RequestQueue store their data. By default, it is set to ./storage.
CRAWLEE_DEFAULT_DATASET_ID
The default dataset has ID default. Setting this environment variable overrides the default dataset ID with the provided value.
CRAWLEE_DEFAULT_KEY_VALUE_STORE_ID
The default key-value store has ID default. Setting this environment variable overrides the default key-value store ID with the provided value.
CRAWLEE_DEFAULT_REQUEST_QUEUE_ID
The default request queue has ID default. Setting this environment variable overrides the default request queue ID with the provided value.
CRAWLEE_PURGE_ON_START
Storage directories are purged by default. If set to false - local storage directories would not be purged automatically at the start of the crawler run or before opening of some storage explicitly (e.g. via Dataset.open()). Useful if we're trying e.g. to add more items to dataset with each next run (and keep the previously saved/scraped items).
CRAWLEE_CONTAINERIZED
This variable is only effective when the systemInfoV2 experiment is enabled.
Changes how crawlee measures its CPU and Memory usage and limits. If unset, crawlee will determine if it is containerised using common features of containerized environments using the isContainerized utility function.
- A file at /.dockerenv.
- A file at /proc/self/cgroupcontainingdocker.
- A value for the KUBERNETES_SERVICE_HOSTenvironment variable. IfisLambdareturns true,isContainerizedwill return false regardless of these other checks.
When this variable is set, it is used in place of isContainerized.
Convenience env vars
The next group includes env vars that can help achieve certain goals without having to change our code, such as temporarily switching log level to DEBUG or enabling verbose logging for errors.
CRAWLEE_HEADLESS
If set to 1, web browsers launched by Crawlee will run in the headless mode. We can still override
this setting in the code, e.g. by passing the headless: true option to the launchPuppeteer() function. By default, the browsers
are launched in headful mode, i.e. with windows.
CRAWLEE_LOG_LEVEL
Specifies the minimum log level, which can be one of the following values (in order of severity):
DEBUG, INFO, WARNING, ERROR and OFF. By default, the log level is set to INFO,
which means that DEBUG messages are not printed to console. See the utils.log
namespace for logging utilities.
CRAWLEE_VERBOSE_LOG
Enables verbose logging if set to true. If not explicitly set to true - for errors thrown from inside request handler a warning with only error message will be logged as long as we know the request will be retried. Same applies to some known errors (such as timeout errors). Disabled by default.
CRAWLEE_MEMORY_MBYTES
Sets the amount of system memory in megabytes to be used by the AutoscaledPool.
It is used to limit the number of concurrently running tasks. By default, the max amount of memory
to be used is set to one quarter of total system memory, i.e. on a system with 8192 MB of memory,
the autoscaling feature will only use up to 2048 MB of memory.
Configuration class
The last option to adjust Crawlee configuration is to use the Configuration class in the code.
Global Configuration
By default, there is a global singleton instance of Configuration class, it is used by the crawlers and some other classes that depend on a configurable behavior. In most cases you don't need to adjust any options there, but if needed - you can get access to it via Configuration.getGlobalConfig() function. Now you can easily get and set the ConfigurationOptions.
import { CheerioCrawler, Configuration, sleep } from 'crawlee';
// Get the global configuration
const config = Configuration.getGlobalConfig();
// Set the 'persistStateIntervalMillis' option
// of global configuration to 10 seconds
config.set('persistStateIntervalMillis', 10_000);
// Note, that we are not passing the configuration to the crawler
// as it's using the global configuration
const crawler = new CheerioCrawler();
crawler.router.addDefaultHandler(async ({ request }) => {
    // For the first request we wait for 5 seconds,
    // and add the second request to the queue
    if (request.url === 'https://www.example.com/1') {
        await sleep(5_000);
        await crawler.addRequests(['https://www.example.com/2'])
    }
    // For the second request we wait for 10 seconds,
    // and abort the run
    if (request.url === 'https://www.example.com/2') {
        await sleep(10_000);
        process.exit(0);
    }
});
await crawler.run(['https://www.example.com/1']);
This is pretty much the same example we used for showing crawlee.json usage,
but now we're using the global configuration, which is the only difference.
If you run this example - you will find the SDK_CRAWLER_STATISTICS file in default Key-Value store as before,
which would show the same number of finishes requests (one) and the same crawler runtime (~10 seconds).
This confirms that provided parameters worked: the state was persisted after 10 seconds, as it was set in the global configuration.
After running the same example with commented two lines of code related to Configuration there will be
no SDK_CRAWLER_STATISTICS file stored in the default Key-Value store:
as we did not change the persistStateIntervalMillis, Crawlee used the default value of 60 seconds,
and the crawler was forcefully aborted after ~15 seconds of run time before it persisted the state for the first time.
Custom configuration
Alternatively, you can create a custom configuration. In this case you need to pass it to the class that is going to use it, e.g. to the crawler. Let's adjust the previous example:
import { CheerioCrawler, Configuration, sleep } from 'crawlee';
// Create new configuration
const config = new Configuration({
    // Set the 'persistStateIntervalMillis' option to 10 seconds
    persistStateIntervalMillis: 10_000,
});
// Now we need to pass the configuration to the crawler
const crawler = new CheerioCrawler({}, config);
crawler.router.addDefaultHandler(async ({ request }) => {
    // for the first request we wait for 5 seconds,
    // and add the second request to the queue
    if (request.url === 'https://www.example.com/1') {
        await sleep(5_000);
        await crawler.addRequests(['https://www.example.com/2'])
    }
    // for the second request we wait for 10 seconds,
    // and abort the run
    if (request.url === 'https://www.example.com/2') {
        await sleep(10_000);
        process.exit(0);
    }
});
await crawler.run(['https://www.example.com/1']);
If you run this example - it would work exactly the same as before,
with the same SDK_CRAWLER_STATISTICS file in default Key-Value store after the run,
showing the same number of finished requests and the same crawler run time.
If you would not pass the configuration to the crawler, there again will be
no SDK_CRAWLER_STATISTICS file stored in the default Key-Value store, this time for a different reason though.
Since we did not pass the configuration to the crawler,
the crawler will use the global configuration, which is using the default persistStateIntervalMillis.
So again, the run was aborted before the state was persisted for the first time.