crawlee
Index
Main Classes
- BasicCrawler
- BeautifulSoupCrawler
- BrowserPool
- Configuration
- CurlImpersonateHttpClient
- Dataset
- EventManager
- HttpCrawler
- HttpxHttpClient
- KeyValueStore
- LocalEventManager
- MemoryStorageClient
- ParselCrawler
- PlaywrightBrowserController
- PlaywrightBrowserPlugin
- PlaywrightCrawler
- Request
- RequestQueue
- Session
- SessionPool
- Statistics
Helper Classes
- _AutoscaledPoolRun
- _BaseListPage
- _BaseStorageMetadata
- _CurlImpersonateResponse
- _HttpxResponse
- _HttpxTransport
- _NewUrlFunction
- _ProxyTierTracker
- _Services
- AddRequestsFunction
- AddRequestsFunctionCall
- AddRequestsKwargs
- AutoscaledPool
- BaseBrowserController
- BaseBrowserPlugin
- BaseDatasetClient
- BaseDatasetCollectionClient
- BaseHttpClient
- BaseKeyValueStoreClient
- BaseKeyValueStoreCollectionClient
- BaseRequestData
- BaseRequestQueueClient
- BaseRequestQueueCollectionClient
- BaseStorage
- BaseStorageClient
- BasicCrawlerOptions
- BasicCrawlingContext
- BatchRequestsOperationResponse
- BeautifulSoupCrawlingContext
- BoundedSet
- ByteSize
- CachedRequest
- ClientSnapshot
- ConcurrencySettings
- ContextPipeline
- CpuInfo
- CpuSnapshot
- CrawleeLogFormatter
- CrawleePage
- CrawleeRequestData
- DatasetClient
- DatasetCollectionClient
- DatasetItemsListPage
- DatasetListPage
- DatasetMetadata
- EnqueueLinksFunction
- ErrorGroup
- ErrorTracker
- EventAbortingData
- EventExitData
- EventLoopSnapshot
- EventManagerOptions
- EventMigratingData
- EventPersistStateData
- EventSystemInfoData
- ExportDataCsvKwargs
- ExportDataJsonKwargs
- ExportToFunction
- ExportToKwargs
- FinalStatistics
- GetDataFunction
- GetDataKwargs
- GetKeyValueStoreFromRequestHandlerFunction
- GetKeyValueStoreFunction
- Glob
- HeaderGenerator
- HttpCrawlingContext
- HttpCrawlingResult
- HttpHeaders
- HttpResponse
- KeyValueStoreChangeRecords
- KeyValueStoreClient
- KeyValueStoreCollectionClient
- KeyValueStoreInterface
- KeyValueStoreKeyInfo
- KeyValueStoreListKeysPage
- KeyValueStoreListPage
- KeyValueStoreMetadata
- KeyValueStoreRecord
- KeyValueStoreRecordMetadata
- KeyValueStoreValue
- LoadRatioInfo
- LRUCache
- MemoryInfo
- MemorySnapshot
- ParselCrawlingContext
- PlaywrightCrawlingContext
- PlaywrightPreNavigationContext
- ProcessedRequest
- ProlongRequestLockResponse
- ProxyConfiguration
- ProxyInfo
- PushDataFunction
- PushDataFunctionCall
- PushDataKwargs
- RecurringTask
- RequestHandlerRunResult
- RequestList
- RequestProcessingRecord
- RequestProvider
- RequestQueueClient
- RequestQueueCollectionClient
- RequestQueueHead
- RequestQueueHeadState
- RequestQueueHeadWithLocks
- RequestQueueListPage
- RequestQueueMetadata
- RequestState
- RequestWithLock
- Router
- SendRequestFunction
- SessionModel
- SessionPoolModel
- Snapshotter
- StatisticsPersistedState
- StatisticsState
- SystemInfo
- SystemStatus
- TimerResult
- UnprocessedRequest
- UserData
Errors
Methods
- callback
- compute_short_hash
- compute_unique_key
- compute_weighted_avg
- configure_logger
- convert_to_absolute_url
- create
- create_dataset_from_directory
- create_kvs_from_directory
- create_rq_from_directory
- crypto_random_object_id
- determine_file_extension
- extract_query_params
- filter_out_none_values_recursively
- find_or_create_client_by_id_or_name_inner
- force_remove
- force_rename
- get_configuration
- get_configuration_if_set
- get_configured_log_level
- get_cpu_info
- get_event_manager
- get_memory_info
- get_or_create_inner
- get_storage_client
- infinite_scroll
- is_content_type
- is_file_or_bytes
- is_url_absolute
- json_dumps
- maybe_extract_enum_member_value
- maybe_parse_body
- measure_time
- normalize_url
- open_storage
- persist_metadata_if_enabled
- raise_on_duplicate_storage
- raise_on_non_existing_storage
- remove_storage_from_cache
- set_cloud_storage_client
- set_configuration
- set_default_storage_client_type
- set_event_manager
- set_local_storage_client
- unique_key_to_request_id
- validate_http_url
- wait_for
- wait_for_all_tasks_for_finish
Properties
- __version__
- AsyncListener
- BrowserType
- cli
- CLOUDFLARE_RETRY_CSS_SELECTORS
- COMMON_ACCEPT
- COMMON_ACCEPT_LANGUAGE
- CreateSessionFunctionType
- ErrorHandler
- EventData
- FailedRequestHandler
- HttpMethod
- HttpPayload
- KvsValueType
- Listener
- logger
- logger
- logger
- logger
- logger
- logger
- logger
- logger
- logger
- logger
- logger
- logger
- logger
- logger
- logger
- logger
- logger
- logger
- logger
- logger
- logger
- METADATA_FILENAME
- PW_CHROMIUM_HEADLESS_DEFAULT_SEC_CH_UA
- PW_CHROMIUM_HEADLESS_DEFAULT_SEC_CH_UA_MOBILE
- PW_CHROMIUM_HEADLESS_DEFAULT_SEC_CH_UA_PLATFORM
- PW_CHROMIUM_HEADLESS_DEFAULT_USER_AGENT
- PW_FIREFOX_HEADLESS_DEFAULT_USER_AGENT
- PW_WEBKIT_HEADLESS_DEFAULT_USER_AGENT
- RequestHandler
- ResourceClient
- ResourceCollectionClient
- RETRY_CSS_SELECTORS
- ROTATE_PROXY_ERRORS
- Snapshot
- StorageClientType
- SyncListener
- T
- T
- T
- T
- T
- TCrawlingContext
- TCrawlingContext
- TCrawlingContext
- TCrawlingContext
- TEMPLATE_LIST_URL
- timedelta_ms
- TMiddlewareCrawlingContext
- TResource
- TResourceClient
- TStatisticsState
- USER_AGENT_POOL
- user_data_adapter
- WrappedListener
Constants
Errors
is_status_code_client_error
Parameters
value: int
Returns bool
is_status_code_error
Returns
True
for 4xx or 5xx status codes,False
otherwise.Parameters
value: int
Returns bool
is_status_code_server_error
Returns
True
for 5xx status codes,False
otherwise.Parameters
value: int
Returns bool
Methods
callback
Crawlee is a web scraping and browser automation library.
Parameters
version: bool = False
Returns None
compute_short_hash
Computes a hexadecimal SHA-256 hash of the provided data and returns a substring (prefix) of it.
Parameters
data: bytes
The binary data to be hashed.
keyword-onlylength: int = 8
The length of the hash to be returned.
Returns str
A substring (prefix) of the hexadecimal hash of the data.
compute_unique_key
Compute a unique key for caching & deduplication of requests.
This function computes a unique key by normalizing the provided URL and method. If
use_extended_unique_key
is True and a payload is provided, the payload is hashed and included in the key. Otherwise, the unique key is just the normalized URL. Additionally, if HTTP headers are provided, the whitelisted headers are hashed and included in the key.Parameters
url: str
The request URL.
method: HttpMethod = 'GET'
The HTTP method.
headers: HttpHeaders | None = None
The HTTP headers.
payload: HttpPayload | None = None
The data to be sent as the request body.
keyword-onlykeep_url_fragment: bool = False
A flag indicating whether to keep the URL fragment.
keyword-onlyuse_extended_unique_key: bool = False
A flag indicating whether to include a hashed payload in the key.
Returns str
A string representing the unique key for the request.
compute_weighted_avg
Computes a weighted average of an array of numbers, complemented by an array of weights.
Parameters
values: list[float]
List of values.
weights: list[float]
List of weights.
Returns float
[object Object]
configure_logger
Parameters
logger: logging.Logger
configuration: Configuration
keyword-onlyremove_old_handlers: bool = False
Returns None
convert_to_absolute_url
Convert a relative URL to an absolute URL using a base URL.
Parameters
base_url: str
relative_url: str
Returns str
create
Bootstrap a new Crawlee project.
Parameters
optionalproject_name: Optional[str] = typer.Argument( default=None, help='The name of the project and the directory that will be created to contain it. ' 'If none is given, you will be prompted.', show_default=False, )
optionaltemplate: Optional[str] = typer.Option( default=None, help='The template to be used to create the project. If none is given, you will be prompted.', show_default=False, )
Returns None
create_dataset_from_directory
Parameters
storage_directory: str
memory_storage_client: MemoryStorageClient
id: str | None = None
name: str | None = None
Returns DatasetClient
create_kvs_from_directory
Parameters
storage_directory: str
memory_storage_client: MemoryStorageClient
id: str | None = None
name: str | None = None
Returns KeyValueStoreClient
create_rq_from_directory
Parameters
storage_directory: str
memory_storage_client: MemoryStorageClient
id: str | None = None
name: str | None = None
Returns RequestQueueClient
crypto_random_object_id
Generates a random object ID.
Parameters
length: int = 17
Returns str
determine_file_extension
Determine the file extension for a given MIME content type.
Parameters
content_type: str
The MIME content type string.
Returns str | None
A string representing the determined file extension without a leading dot, or None if no extension could be determined.
extract_query_params
Extract query parameters from a given URL.
Parameters
url: str
Returns dict[str, list[str]]
filter_out_none_values_recursively
Recursively filters out None values from a dictionary.
Parameters
dictionary: dict
The dictionary to filter.
keyword-onlyremove_empty_dicts: bool = False
Flag indicating whether to remove empty nested dictionaries.
Returns dict | None
A copy of the dictionary with all None values (and potentially empty dictionaries) removed.
find_or_create_client_by_id_or_name_inner
Locates or creates a new storage client based on the given ID or name.
This method attempts to find a storage client in the memory cache first. If not found, it tries to locate a storage directory by name. If still not found, it searches through storage directories for a matching ID or name in their metadata. If none exists, and the specified ID is 'default', it checks for a default storage directory. If a storage client is found or created, it is added to the memory cache. If no storage client can be located or created, the method returns None.
Parameters
resource_client_class: type[TResourceClient]
The class of the resource client.
memory_storage_client: MemoryStorageClient
The memory storage client used to store and retrieve storage clients.
id: str | None = None
The unique identifier for the storage client.
name: str | None = None
The name of the storage client.
Returns TResourceClient | None
The found or created storage client, or None if no client could be found or created.
force_remove
Removes a file, suppressing the FileNotFoundError if it does not exist.
JS-like rm(filename, { force: true }).
Parameters
filename: str
The path to the file to be removed.
Returns None
force_rename
Renames a directory, ensuring that the destination directory is removed if it exists.
Parameters
src_dir: str
The source directory path.
dst_dir: str
The destination directory path.
Returns None
get_configuration
Get the configuration object.
Returns Configuration
get_configuration_if_set
Get the configuration object, or None if it hasn't been set yet.
Returns Configuration | None
get_configured_log_level
Parameters
configuration: Configuration
Returns int
get_cpu_info
Retrieves the current CPU usage.
It utilizes the
psutil
library. Functionpsutil.cpu_percent()
returns a float representing the current system-wide CPU utilization as a percentage.Returns CpuInfo
get_event_manager
Get the event manager.
Returns EventManager
get_memory_info
Retrieves the current memory usage of the process and its children.
It utilizes the
psutil
library.Returns MemoryInfo
get_or_create_inner
Retrieve a named storage, or create a new one when it doesn't exist.
Parameters
keyword-onlymemory_storage_client: MemoryStorageClient
The memory storage client.
keyword-onlystorage_client_cache: list[TResourceClient]
The cache of storage clients.
keyword-onlyresource_client_class: type[TResourceClient]
The class of the storage to retrieve or create.
keyword-onlyname: str | None = None
The name of the storage to retrieve or create.
keyword-onlyid: str | None = None
ID of the storage to retrieve or create.
Returns TResourceClient
The retrieved or newly-created storage.
get_storage_client
Get the storage client instance for the current environment.
Parameters
keyword-onlyclient_type: StorageClientType | None = None
Allows retrieving a specific storage client type, regardless of where we are running.
Returns BaseStorageClient
The current storage client instance.
infinite_scroll
Scroll to the bottom of a page, handling loading of additional items.
Parameters
page: Page
Returns None
is_content_type
Check if the provided content type string matches the specified ContentType.
Parameters
content_type_enum: ContentType
content_type: str
Returns bool
is_file_or_bytes
Determine if the input value is a file-like object or bytes.
This function checks whether the provided value is an instance of bytes, bytearray, or io.IOBase (file-like). The method is simplified for common use cases and may not cover all edge cases.
Parameters
value: Any
The value to be checked.
Returns bool
True if the value is either a file-like object or bytes, False otherwise.
is_url_absolute
Check if a URL is absolute.
Parameters
url: str
Returns bool
json_dumps
Serialize an object to a JSON-formatted string with specific settings.
Parameters
obj: Any
The object to serialize.
Returns str
A string containing the JSON representation of the input object.
maybe_extract_enum_member_value
Extract the value of an enumeration member if it is an Enum, otherwise return the original value.
Parameters
maybe_enum_member: Any
Returns Any
maybe_parse_body
Parse the response body based on the content type.
Parameters
body: bytes
content_type: str
Returns Any
measure_time
Measure the execution time (wall-clock and CPU) between the start and end of the with-block.
Returns Iterator[TimerResult]
normalize_url
Normalizes a URL.
This function cleans and standardizes a URL by removing leading and trailing whitespaces, converting the scheme and netloc to lower case, stripping unwanted tracking parameters (specifically those beginning with 'utm_'), sorting the remaining query parameters alphabetically, and optionally retaining the URL fragment. The goal is to ensure that URLs that are functionally identical but differ in trivial ways (such as parameter order or casing) are treated as the same.
Parameters
url: str
The URL to be normalized.
keyword-onlykeep_url_fragment: bool = False
Flag to determine whether the fragment part of the URL should be retained.
Returns str
A string containing the normalized URL.
open_storage
Open either a new storage or restore an existing one and return it.
Parameters
keyword-onlystorage_class: type[TResource]
keyword-onlystorage_client: BaseStorageClient | None = None
keyword-onlyconfiguration: Configuration | None = None
keyword-onlyid: str | None = None
keyword-onlyname: str | None = None
Returns TResource
persist_metadata_if_enabled
Updates or writes metadata to a specified directory.
The function writes a given metadata dictionary to a JSON file within a specified directory. The writing process is skipped if
write_metadata
is False. Before writing, it ensures that the target directory exists, creating it if necessary.Parameters
keyword-onlydata: dict
A dictionary containing metadata to be written.
keyword-onlyentity_directory: str
The directory path where the metadata file should be stored.
keyword-onlywrite_metadata: bool
A boolean flag indicating whether the metadata should be written to file.
Returns None
raise_on_duplicate_storage
Raise an error indicating that a storage with the provided key name and value already exists.
Parameters
client_type: StorageTypes
key_name: str
value: str
Returns NoReturn
raise_on_non_existing_storage
Raise an error indicating that a storage with the provided id does not exist.
Parameters
client_type: StorageTypes
id: str | None
Returns NoReturn
remove_storage_from_cache
Remove a storage from cache by ID or name.
Parameters
keyword-onlystorage_class: type
keyword-onlyid: str | None = None
keyword-onlyname: str | None = None
Returns None
set_cloud_storage_client
Set the cloud storage client instance.
Parameters
cloud_client: BaseStorageClient
The cloud storage client instance.
Returns None
set_configuration
Set the configuration object.
Parameters
configuration: Configuration
Returns None
set_default_storage_client_type
Set the default storage client type.
Parameters
client_type: StorageClientType
Returns None
set_event_manager
Set the event manager.
Parameters
event_manager: EventManager
Returns None
set_local_storage_client
Set the local storage client instance.
Parameters
local_client: BaseStorageClient
The local storage client instance.
Returns None
unique_key_to_request_id
Generate a deterministic request ID based on a unique key.
Parameters
unique_key: str
The unique key to convert into a request ID.
keyword-onlyrequest_id_length: int = 15
The length of the request ID.
Returns str
A URL-safe, truncated request ID based on the unique key.
validate_http_url
Validate the given HTTP URL.
Parameters
value: str | None
Returns str | None
wait_for
Wait for an async operation to complete.
If the wait times out,
TimeoutError
is raised and the future is cancelled. Optionally retry on error.Parameters
operation: Callable[[], Awaitable[T]]
A function that returns the future to wait for.
keyword-onlytimeout: timedelta
How long should we wait before cancelling the future.
keyword-onlytimeout_message: str | None = None
Message to be included in the
TimeoutError
in case of timeout.keyword-onlymax_retries: int = 1
How many times should the operation be attempted.
keyword-onlylogger: Logger
Used to report information about retries as they happen.
Returns T
wait_for_all_tasks_for_finish
Wait for all tasks to finish or until the timeout is reached.
Parameters
tasks: Sequence[asyncio.Task]
A sequence of asyncio tasks to wait for.
keyword-onlylogger: Logger
Logger to use for reporting.
keyword-onlytimeout: timedelta | None = None
How long should we wait before cancelling the tasks.
Returns None
Properties
__version__
AsyncListener
BrowserType
cli
CLOUDFLARE_RETRY_CSS_SELECTORS
COMMON_ACCEPT
COMMON_ACCEPT_LANGUAGE
CreateSessionFunctionType
ErrorHandler
EventData
FailedRequestHandler
HttpMethod
HttpPayload
KvsValueType
Listener
logger
logger
logger
logger
logger
logger
logger
logger
logger
logger
logger
logger
logger
logger
logger
logger
logger
logger
logger
logger
logger
METADATA_FILENAME
PW_CHROMIUM_HEADLESS_DEFAULT_SEC_CH_UA
PW_CHROMIUM_HEADLESS_DEFAULT_SEC_CH_UA_MOBILE
PW_CHROMIUM_HEADLESS_DEFAULT_SEC_CH_UA_PLATFORM
PW_CHROMIUM_HEADLESS_DEFAULT_USER_AGENT
PW_FIREFOX_HEADLESS_DEFAULT_USER_AGENT
PW_WEBKIT_HEADLESS_DEFAULT_USER_AGENT
RequestHandler
ResourceClient
ResourceCollectionClient
RETRY_CSS_SELECTORS
CSS selectors for elements that should trigger a retry, as the crawler is likely getting blocked.
ROTATE_PROXY_ERRORS
Content of proxy errors that should trigger a retry, as the proxy is likely getting blocked / is malfunctioning.
Returns
True
for 4xx status codes,False
otherwise.