Configuration
Index
Methods
Properties
- available_memory_ratio
- default_browser_path
- default_dataset_id
- default_key_value_store_id
- default_request_queue_id
- disable_browser_sandbox
- headless
- internal_timeout
- log_level
- max_client_errors
- max_event_loop_delay
- max_used_cpu_ratio
- max_used_memory_ratio
- memory_mbytes
- model_config
- persist_state_interval
- persist_storage
- purge_on_start
- storage_dir
- system_info_interval
- write_metadata
Methods
get_global_configuration
Retrieve the global instance of the configuration.
Mostly for the backwards compatibility. It is recommended to use the
service_locator.get_configuration()
instead.Returns Self
Properties
available_memory_ratio
The maximum proportion of system memory to use. If memory_mbytes
is not provided, this ratio is used to
calculate the maximum memory. This option is utilized by the Snapshotter
.
default_browser_path
Specifies the path to the browser executable. Currently primarily for Playwright-based features. This option
is passed directly to Playwright's browser_type.launch
method as executable_path
argument. For more details,
refer to the Playwright documentation:
https://playwright.dev/docs/api/class-browsertype#browser-type-launch.
default_dataset_id
The default Dataset
ID. This option is utilized by the storage client.
default_key_value_store_id
The default KeyValueStore
ID. This option is utilized by the storage client.
default_request_queue_id
The default RequestQueue
ID. This option is utilized by the storage client.
disable_browser_sandbox
Disables the sandbox for the browser. Currently primarily for Playwright-based features. This option
is passed directly to Playwright's browser_type.launch
method as chromium_sandbox
. For more details,
refer to the Playwright documentation:
https://playwright.dev/docs/api/class-browsertype#browser-type-launch.
headless
Whether to run the browser in headless mode. Currently primarily for Playwright-based features. This option
is passed directly to Playwright's browser_type.launch
method as headless
. For more details,
refer to the Playwright documentation:
https://playwright.dev/docs/api/class-browsertype#browser-type-launch.
internal_timeout
Timeout for the internal asynchronous operations.
log_level
The logging level.
max_client_errors
The maximum number of client errors (HTTP 429) allowed before the system is considered overloaded.
This option is used by the Snapshotter
.
max_event_loop_delay
The maximum event loop delay. If the event loop delay exceeds this value, it is considered overloaded.
This option is used by the Snapshotter
.
max_used_cpu_ratio
The maximum CPU usage ratio. If the CPU usage exceeds this value, the system is considered overloaded.
This option is used by the Snapshotter
.
max_used_memory_ratio
The maximum memory usage ratio. If the memory usage exceeds this ratio, it is considered overloaded.
This option is used by the Snapshotter
.
memory_mbytes
The maximum used memory in megabytes. This option is utilized by the Snapshotter
.
model_config
persist_state_interval
Interval at which PersistState
events are emitted. The event ensures the state persistence during
the crawler run. This option is utilized by the EventManager
.
persist_storage
Whether to persist the storage. This option is utilized by the MemoryStorageClient
.
purge_on_start
Whether to purge the storage on the start. This option is utilized by the MemoryStorageClient
.
storage_dir
The path to the storage directory. This option is utilized by the MemoryStorageClient
.
system_info_interval
Interval at which SystemInfo
events are emitted. The event represents the current status of the system.
This option is utilized by the LocalEventManager
.
write_metadata
Whether to write the storage metadata. This option is utilized by the MemoryStorageClient
.
Configuration settings for the Crawlee project.
This class stores common configurable parameters for Crawlee. Default values are provided for all settings, so typically, no adjustments are necessary. However, you may modify settings for specific use cases, such as changing the default storage directory, the default storage IDs, the timeout for internal operations, and more.
Settings can also be configured via environment variables, prefixed with
CRAWLEE_
.