Index
Main Classes
- BasicCrawler
- BeautifulSoupCrawler
- BrowserPool
- Configuration
- CurlImpersonateHttpClient
- Dataset
- EventManager
- HttpCrawler
- HttpxHttpClient
- KeyValueStore
- LocalEventManager
- MemoryStorageClient
- ParselCrawler
- PlaywrightBrowserController
- PlaywrightBrowserPlugin
- PlaywrightCrawler
- Request
- RequestQueue
- Session
- SessionPool
- Statistics
Helper Classes
- _AutoscaledPoolRun
- _BaseListPage
- _BaseStorageMetadata
- _CurlImpersonateResponse
- _HttpxResponse
- _HttpxTransport
- _NewUrlFunction
- _ProxyTierTracker
- _Services
- AddRequestsFunction
- AddRequestsFunctionCall
- AddRequestsKwargs
- AutoscaledPool
- BaseBrowserController
- BaseBrowserPlugin
- BaseDatasetClient
- BaseDatasetCollectionClient
- BaseHttpClient
- BaseKeyValueStoreClient
- BaseKeyValueStoreCollectionClient
- BaseRequestData
- BaseRequestQueueClient
- BaseRequestQueueCollectionClient
- BaseStorage
- BaseStorageClient
- BasicCrawlerOptions
- BasicCrawlingContext
- BatchRequestsOperationResponse
- BeautifulSoupCrawlingContext
- BoundedSet
- ByteSize
- CachedRequest
- ClientSnapshot
- ConcurrencySettings
- ContextPipeline
- CpuInfo
- CpuSnapshot
- CrawleeLogFormatter
- CrawleePage
- CrawleeRequestData
- DatasetClient
- DatasetCollectionClient
- DatasetItemsListPage
- DatasetListPage
- DatasetMetadata
- EnqueueLinksFunction
- EventAbortingData
- EventExitData
- EventLoopSnapshot
- EventManagerOptions
- EventMigratingData
- EventPersistStateData
- EventSystemInfoData
- ExportToFunction
- ExportToKwargs
- FinalStatistics
- GetDataFunction
- GetDataKwargs
- GetKeyValueStoreFromRequestHandlerFunction
- GetKeyValueStoreFunction
- Glob
- HeaderGenerator
- HttpCrawlingContext
- HttpCrawlingResult
- HttpHeaders
- HttpResponse
- KeyValueStoreChangeRecords
- KeyValueStoreClient
- KeyValueStoreCollectionClient
- KeyValueStoreInterface
- KeyValueStoreKeyInfo
- KeyValueStoreListKeysPage
- KeyValueStoreListPage
- KeyValueStoreMetadata
- KeyValueStoreRecord
- KeyValueStoreRecordMetadata
- KeyValueStoreValue
- LoadRatioInfo
- LRUCache
- MemoryInfo
- MemorySnapshot
- ParselCrawlingContext
- PlaywrightCrawlingContext
- ProcessedRequest
- ProlongRequestLockResponse
- ProxyConfiguration
- ProxyInfo
- PushDataFunction
- PushDataFunctionCall
- PushDataKwargs
- RecurringTask
- RequestHandlerRunResult
- RequestList
- RequestProcessingRecord
- RequestProvider
- RequestQueueClient
- RequestQueueCollectionClient
- RequestQueueHead
- RequestQueueHeadState
- RequestQueueHeadWithLocks
- RequestQueueListPage
- RequestQueueMetadata
- RequestState
- RequestWithLock
- Router
- SendRequestFunction
- SessionModel
- SessionPoolModel
- Snapshotter
- StatisticsPersistedState
- StatisticsState
- SystemInfo
- SystemStatus
- TimerResult
- UnprocessedRequest
- UserData
Errors
- AbortError
- ContextPipelineFinalizationError
- ContextPipelineInitializationError
- ContextPipelineInterruptedError
- ErrorGroup
- ErrorHandler
- ErrorTracker
- HttpStatusCodeError
- is_status_code_client_error
- is_status_code_error
- is_status_code_server_error
- ProxyError
- RequestHandlerError
- ROTATE_PROXY_ERRORS
- ServiceConflictError
- SessionError
- UserDefinedErrorHandlerError
Methods
- callback
- compute_short_hash
- compute_unique_key
- compute_weighted_avg
- configure_logger
- convert_to_absolute_url
- create
- create_dataset_from_directory
- create_kvs_from_directory
- create_rq_from_directory
- crypto_random_object_id
- determine_file_extension
- extract_query_params
- filter_out_none_values_recursively
- find_or_create_client_by_id_or_name_inner
- force_remove
- force_rename
- get_configuration
- get_configuration_if_set
- get_configured_log_level
- get_cpu_info
- get_event_manager
- get_memory_info
- get_or_create_inner
- get_storage_client
- infinite_scroll
- is_content_type
- is_file_or_bytes
- is_url_absolute
- json_dumps
- maybe_extract_enum_member_value
- maybe_parse_body
- measure_time
- normalize_url
- open_storage
- persist_metadata_if_enabled
- raise_on_duplicate_storage
- raise_on_non_existing_storage
- remove_storage_from_cache
- set_cloud_storage_client
- set_configuration
- set_default_storage_client_type
- set_event_manager
- set_local_storage_client
- unique_key_to_request_id
- validate_http_url
- wait_for
- wait_for_all_tasks_for_finish
Properties
- __version__
- AsyncListener
- BrowserType
- cli
- CLOUDFLARE_RETRY_CSS_SELECTORS
- COMMON_ACCEPT
- COMMON_ACCEPT_LANGUAGE
- CreateSessionFunctionType
- EventData
- FailedRequestHandler
- HttpMethod
- HttpPayload
- HttpQueryParams
- KvsValueType
- Listener
- logger
- logger
- logger
- logger
- logger
- logger
- logger
- logger
- logger
- logger
- logger
- logger
- logger
- logger
- logger
- logger
- logger
- logger
- logger
- logger
- logger
- METADATA_FILENAME
- PW_CHROMIUM_HEADLESS_DEFAULT_SEC_CH_UA
- PW_CHROMIUM_HEADLESS_DEFAULT_SEC_CH_UA_MOBILE
- PW_CHROMIUM_HEADLESS_DEFAULT_SEC_CH_UA_PLATFORM
- PW_CHROMIUM_HEADLESS_DEFAULT_USER_AGENT
- PW_FIREFOX_HEADLESS_DEFAULT_USER_AGENT
- PW_WEBKIT_HEADLESS_DEFAULT_USER_AGENT
- RequestHandler
- ResourceClient
- ResourceCollectionClient
- RETRY_CSS_SELECTORS
- Snapshot
- StorageClientType
- SyncListener
- T
- T
- T
- T
- T
- TCrawlingContext
- TCrawlingContext
- TCrawlingContext
- TCrawlingContext
- TEMPLATE_LIST_URL
- timedelta_ms
- TMiddlewareCrawlingContext
- TResource
- TResourceClient
- TStatisticsState
- USER_AGENT_POOL
- user_data_adapter
- WrappedListener
Constants
Errors
ErrorHandler
is_status_code_client_error
Parameters
value: int
Returns bool
is_status_code_error
Parameters
value: int
Returns bool
is_status_code_server_error
Parameters
value: int
Returns bool
ROTATE_PROXY_ERRORS
Methods
callback
Crawlee is a web scraping and browser automation library.
Parameters
version: Annotated[ bool, typer.Option( '-V', '--version', is_flag=True, help='Print Crawlee version', ), ] = False
Returns None
compute_short_hash
Computes a hexadecimal SHA-256 hash of the provided data and returns a substring (prefix) of it.
Parameters
data: bytes
length: int = 8keyword-only
Returns str
compute_unique_key
Computes a unique key for caching & deduplication of requests.
This function computes a unique key by normalizing the provided URL and method. If
use_extended_unique_key
is True and a payload is provided, the payload is hashed and included in the key. Otherwise, the unique key is just the normalized URL.Parameters
url: str
method: HttpMethod = 'GET'
payload: HttpPayload | None = None
keep_url_fragment: bool = Falsekeyword-only
use_extended_unique_key: bool = Falsekeyword-only
Returns str
compute_weighted_avg
Computes a weighted average of an array of numbers, complemented by an array of weights.
Parameters
values: list[float]
weights: list[float]
Returns float
Weighted average.
configure_logger
Parameters
logger: logging.Logger
configuration: Configuration
remove_old_handlers: bool = Falsekeyword-only
Returns None
convert_to_absolute_url
Convert a relative URL to an absolute URL using a base URL.
Parameters
base_url: str
relative_url: str
Returns str
create
Bootstrap a new Crawlee project.
Parameters
project_name: Optional[str] = typer.Argument( default=None, help='The name of the project and the directory that will be created to contain it. ' 'If none is given, you will be prompted.', show_default=False, )optional
template: Optional[str] = typer.Option( default=None, help='The template to be used to create the project. If none is given, you will be prompted.', show_default=False, )optional
Returns None
create_dataset_from_directory
Parameters
storage_directory: str
memory_storage_client: MemoryStorageClient
id: str | None = None
name: str | None = None
Returns DatasetClient
create_kvs_from_directory
Parameters
storage_directory: str
memory_storage_client: MemoryStorageClient
id: str | None = None
name: str | None = None
Returns KeyValueStoreClient
create_rq_from_directory
Parameters
storage_directory: str
memory_storage_client: MemoryStorageClient
id: str | None = None
name: str | None = None
Returns RequestQueueClient
crypto_random_object_id
Generates a random object ID.
Parameters
length: int = 17
Returns str
determine_file_extension
Determine the file extension for a given MIME content type.
Parameters
content_type: str
Returns str | None
extract_query_params
Extract query parameters from a given URL.
Parameters
url: str
Returns dict[str, list[str]]
filter_out_none_values_recursively
Recursively filters out None values from a dictionary.
Parameters
dictionary: dict
remove_empty_dicts: bool = Falsekeyword-only
Returns dict | None
find_or_create_client_by_id_or_name_inner
Locates or creates a new storage client based on the given ID or name.
This method attempts to find a storage client in the memory cache first. If not found, it tries to locate a storage directory by name. If still not found, it searches through storage directories for a matching ID or name in their metadata. If none exists, and the specified ID is 'default', it checks for a default storage directory. If a storage client is found or created, it is added to the memory cache. If no storage client can be located or created, the method returns None.
Parameters
resource_client_class: type[TResourceClient]
memory_storage_client: MemoryStorageClient
id: str | None = None
name: str | None = None
Returns TResourceClient | None
force_remove
Removes a file, suppressing the FileNotFoundError if it does not exist.
JS-like rm(filename, { force: true }).
Parameters
filename: str
Returns None
force_rename
Renames a directory, ensuring that the destination directory is removed if it exists.
Parameters
src_dir: str
dst_dir: str
Returns None
get_configuration
Get the configuration object.
Returns Configuration
get_configuration_if_set
Get the configuration object, or None if it hasn't been set yet.
Returns Configuration | None
get_configured_log_level
Parameters
configuration: Configuration
Returns int
get_cpu_info
Retrieves the current CPU usage.
It utilizes the
psutil
library. Functionpsutil.cpu_percent()
returns a float representing the current system-wide CPU utilization as a percentage.Returns CpuInfo
get_event_manager
Get the event manager.
Returns EventManager
get_memory_info
Retrieves the current memory usage of the process and its children.
It utilizes the
psutil
library.Returns MemoryInfo
get_or_create_inner
Retrieve a named storage, or create a new one when it doesn't exist.
Parameters
memory_storage_client: MemoryStorageClientkeyword-only
storage_client_cache: list[TResourceClient]keyword-only
resource_client_class: type[TResourceClient]keyword-only
name: str | None = Nonekeyword-only
id: str | None = Nonekeyword-only
Returns TResourceClient
get_storage_client
Get the storage client instance for the current environment.
Parameters
client_type: StorageClientType | None = Nonekeyword-only
Returns BaseStorageClient
infinite_scroll
Scroll to the bottom of a page, handling loading of additional items.
Parameters
page: Page
Returns None
is_content_type
Check if the provided content type string matches the specified ContentType.
Parameters
content_type_enum: ContentType
content_type: str
Returns bool
is_file_or_bytes
Determine if the input value is a file-like object or bytes.
This function checks whether the provided value is an instance of bytes, bytearray, or io.IOBase (file-like). The method is simplified for common use cases and may not cover all edge cases.
Parameters
value: Any
Returns bool
is_url_absolute
Check if a URL is absolute.
Parameters
url: str
Returns bool
json_dumps
Serialize an object to a JSON-formatted string with specific settings.
Parameters
obj: Any
Returns str
maybe_extract_enum_member_value
Extract the value of an enumeration member if it is an Enum, otherwise return the original value.
Parameters
maybe_enum_member: Any
Returns Any
maybe_parse_body
Parse the response body based on the content type.
Parameters
body: bytes
content_type: str
Returns Any
measure_time
Measure the execution time (wall-clock and CPU) between the start and end of the with-block.
Returns Iterator[TimerResult]
normalize_url
Normalizes a URL.
This function cleans and standardizes a URL by removing leading and trailing whitespaces, converting the scheme and netloc to lower case, stripping unwanted tracking parameters (specifically those beginning with 'utm_'), sorting the remaining query parameters alphabetically, and optionally retaining the URL fragment. The goal is to ensure that URLs that are functionally identical but differ in trivial ways (such as parameter order or casing) are treated as the same.
Parameters
url: str
keep_url_fragment: bool = Falsekeyword-only
Returns str
open_storage
Open either a new storage or restore an existing one and return it.
Parameters
storage_class: type[TResource]keyword-only
storage_client: BaseStorageClient | None = Nonekeyword-only
configuration: Configuration | None = Nonekeyword-only
id: str | None = Nonekeyword-only
name: str | None = Nonekeyword-only
Returns TResource
persist_metadata_if_enabled
Updates or writes metadata to a specified directory.
The function writes a given metadata dictionary to a JSON file within a specified directory. The writing process is skipped if
write_metadata
is False. Before writing, it ensures that the target directory exists, creating it if necessary.Parameters
data: dictkeyword-only
entity_directory: strkeyword-only
write_metadata: boolkeyword-only
Returns None
raise_on_duplicate_storage
Raise an error indicating that a storage with the provided key name and value already exists.
Parameters
client_type: StorageTypes
key_name: str
value: str
Returns NoReturn
raise_on_non_existing_storage
Raise an error indicating that a storage with the provided id does not exist.
Parameters
client_type: StorageTypes
id: str | None
Returns NoReturn
remove_storage_from_cache
Remove a storage from cache by ID or name.
Parameters
storage_class: typekeyword-only
id: str | None = Nonekeyword-only
name: str | None = Nonekeyword-only
Returns None
set_cloud_storage_client
Set the cloud storage client instance.
Parameters
cloud_client: BaseStorageClient
Returns None
set_configuration
Set the configuration object.
Parameters
configuration: Configuration
Returns None
set_default_storage_client_type
Set the default storage client type.
Parameters
client_type: StorageClientType
Returns None
set_event_manager
Set the event manager.
Parameters
event_manager: EventManager
Returns None
set_local_storage_client
Set the local storage client instance.
Parameters
local_client: BaseStorageClient
Returns None
unique_key_to_request_id
Generate a deterministic request ID based on a unique key.
Parameters
unique_key: str
request_id_length: int = 15keyword-only
Returns str
validate_http_url
Validate the given HTTP URL.
Raises: pydantic.ValidationError: If the URL is not valid.
Parameters
value: str | None
Returns str | None
wait_for
Wait for an async operation to complete.
If the wait times out, TimeoutError is raised and the future is cancelled. Optionally retry on error.
Parameters
operation: Callable[[], Awaitable[T]]
timeout: timedeltakeyword-only
timeout_message: str | None = Nonekeyword-only
max_retries: int = 1keyword-only
logger: Loggerkeyword-only
Returns T
wait_for_all_tasks_for_finish
Wait for all tasks to finish or until the timeout is reached.
Parameters
tasks: Sequence[asyncio.Task]
logger: Loggerkeyword-only
timeout: timedelta | None = Nonekeyword-only
Returns None
Properties
__version__
AsyncListener
BrowserType
cli
CLOUDFLARE_RETRY_CSS_SELECTORS
COMMON_ACCEPT
COMMON_ACCEPT_LANGUAGE
CreateSessionFunctionType
EventData
FailedRequestHandler
HttpMethod
HttpPayload
HttpQueryParams
KvsValueType
Listener
logger
logger
logger
logger
logger
logger
logger
logger
logger
logger
logger
logger
logger
logger
logger
logger
logger
logger
logger
logger
logger
METADATA_FILENAME
PW_CHROMIUM_HEADLESS_DEFAULT_SEC_CH_UA
PW_CHROMIUM_HEADLESS_DEFAULT_SEC_CH_UA_MOBILE
PW_CHROMIUM_HEADLESS_DEFAULT_SEC_CH_UA_PLATFORM
PW_CHROMIUM_HEADLESS_DEFAULT_USER_AGENT
PW_FIREFOX_HEADLESS_DEFAULT_USER_AGENT
PW_WEBKIT_HEADLESS_DEFAULT_USER_AGENT
RequestHandler
ResourceClient
ResourceCollectionClient
RETRY_CSS_SELECTORS
CSS selectors for elements that should trigger a retry, as the crawler is likely getting blocked.
Content of proxy errors that should trigger a retry, as the proxy is likely getting blocked / is malfunctioning.