Skip to main content

Configuration

Configuration settings for the Crawlee project.

This class stores common configurable parameters for Crawlee. Default values are provided for all settings, so typically, no adjustments are necessary. However, you may modify settings for specific use cases, such as changing the default storage directory, the default storage IDs, the timeout for internal operations, and more.

Settings can also be configured via environment variables, prefixed with CRAWLEE_.

Index

Methods

get_global_configuration

  • get_global_configuration(): Self
  • Retrieve the global instance of the configuration.

    Mostly for the backwards compatibility. It is recommended to use the service_locator.get_configuration() instead.


    Returns Self

Properties

available_memory_ratio

available_memory_ratio: float

The maximum proportion of system memory to use. If memory_mbytes is not provided, this ratio is used to calculate the maximum memory. This option is utilized by the Snapshotter.

default_browser_path

default_browser_path: str | None

Specifies the path to the browser executable. Currently primarily for Playwright-based features. This option is passed directly to Playwright's browser_type.launch method as executable_path argument. For more details, refer to the Playwright documentation: https://playwright.dev/docs/api/class-browsertype#browser-type-launch.

default_dataset_id

default_dataset_id: str

The default Dataset ID. This option is utilized by the storage client.

default_key_value_store_id

default_key_value_store_id: str

The default KeyValueStore ID. This option is utilized by the storage client.

default_request_queue_id

default_request_queue_id: str

The default RequestQueue ID. This option is utilized by the storage client.

disable_browser_sandbox

disable_browser_sandbox: bool

Disables the sandbox for the browser. Currently primarily for Playwright-based features. This option is passed directly to Playwright's browser_type.launch method as chromium_sandbox. For more details, refer to the Playwright documentation: https://playwright.dev/docs/api/class-browsertype#browser-type-launch.

headless

headless: bool

Whether to run the browser in headless mode. Currently primarily for Playwright-based features. This option is passed directly to Playwright's browser_type.launch method as headless. For more details, refer to the Playwright documentation: https://playwright.dev/docs/api/class-browsertype#browser-type-launch.

internal_timeout

internal_timeout: timedelta | None

Timeout for the internal asynchronous operations.

log_level

log_level: Literal[DEBUG, INFO, WARNING, ERROR, CRITICAL]

The logging level.

max_client_errors

max_client_errors: int

The maximum number of client errors (HTTP 429) allowed before the system is considered overloaded. This option is used by the Snapshotter.

max_event_loop_delay

max_event_loop_delay: timedelta_ms

The maximum event loop delay. If the event loop delay exceeds this value, it is considered overloaded. This option is used by the Snapshotter.

max_used_cpu_ratio

max_used_cpu_ratio: float

The maximum CPU usage ratio. If the CPU usage exceeds this value, the system is considered overloaded. This option is used by the Snapshotter.

max_used_memory_ratio

max_used_memory_ratio: float

The maximum memory usage ratio. If the memory usage exceeds this ratio, it is considered overloaded. This option is used by the Snapshotter.

memory_mbytes

memory_mbytes: int | None

The maximum used memory in megabytes. This option is utilized by the Snapshotter.

model_config

model_config: Undefined

persist_state_interval

persist_state_interval: timedelta_ms

Interval at which PersistState events are emitted. The event ensures the state persistence during the crawler run. This option is utilized by the EventManager.

persist_storage

persist_storage: bool

Whether to persist the storage. This option is utilized by the MemoryStorageClient.

purge_on_start

purge_on_start: bool

Whether to purge the storage on the start. This option is utilized by the MemoryStorageClient.

storage_dir

storage_dir: str

The path to the storage directory. This option is utilized by the MemoryStorageClient.

system_info_interval

system_info_interval: timedelta_ms

Interval at which SystemInfo events are emitted. The event represents the current status of the system. This option is utilized by the LocalEventManager.

write_metadata

write_metadata: bool

Whether to write the storage metadata. This option is utilized by the MemoryStorageClient.