Skip to main content

Index

Main Classes

Helper Classes

Errors

Methods

Properties

Constants

Errors

ErrorHandler

ErrorHandler:

is_status_code_client_error

  • is_status_code_client_error(value): bool
  • Parameters

    • value: int

    Returns bool

is_status_code_error

  • is_status_code_error(value): bool
  • Parameters

    • value: int

    Returns bool

is_status_code_server_error

  • is_status_code_server_error(value): bool
  • Parameters

    • value: int

    Returns bool

ROTATE_PROXY_ERRORS

ROTATE_PROXY_ERRORS:

Content of proxy errors that should trigger a retry, as the proxy is likely getting blocked / is malfunctioning.

Methods

callback

  • callback(version): None
  • Crawlee is a web scraping and browser automation library.


    Parameters

    • version: Annotated[ bool, typer.Option( '-V', '--version', is_flag=True, help='Print Crawlee version', ), ] = False

    Returns None

compute_short_hash

  • compute_short_hash(data, *, length): str
  • Computes a hexadecimal SHA-256 hash of the provided data and returns a substring (prefix) of it.


    Parameters

    • data: bytes
    • length: int = 8keyword-only

    Returns str

compute_unique_key

  • compute_unique_key(url, method, payload, *, keep_url_fragment, use_extended_unique_key): str
  • Computes a unique key for caching & deduplication of requests.

    This function computes a unique key by normalizing the provided URL and method. If use_extended_unique_key is True and a payload is provided, the payload is hashed and included in the key. Otherwise, the unique key is just the normalized URL.


    Parameters

    • url: str
    • method: HttpMethod = 'GET'
    • payload: HttpPayload | None = None
    • keep_url_fragment: bool = Falsekeyword-only
    • use_extended_unique_key: bool = Falsekeyword-only

    Returns str

compute_weighted_avg

  • compute_weighted_avg(values, weights): float
  • Computes a weighted average of an array of numbers, complemented by an array of weights.


    Parameters

    • values: list[float]
    • weights: list[float]

    Returns float

    Weighted average.

configure_logger

  • configure_logger(logger, configuration, *, remove_old_handlers): None
  • Parameters

    • logger: logging.Logger
    • configuration: Configuration
    • remove_old_handlers: bool = Falsekeyword-only

    Returns None

convert_to_absolute_url

  • convert_to_absolute_url(base_url, relative_url): str
  • Convert a relative URL to an absolute URL using a base URL.


    Parameters

    • base_url: str
    • relative_url: str

    Returns str

create

  • create(project_name, template): None
  • Bootstrap a new Crawlee project.


    Parameters

    • project_name: Optional[str] = typer.Argument( default=None, help='The name of the project and the directory that will be created to contain it. ' 'If none is given, you will be prompted.', show_default=False, )optional
    • template: Optional[str] = typer.Option( default=None, help='The template to be used to create the project. If none is given, you will be prompted.', show_default=False, )optional

    Returns None

create_dataset_from_directory

  • create_dataset_from_directory(storage_directory, memory_storage_client, id, name): DatasetClient
  • Parameters

    • storage_directory: str
    • memory_storage_client: MemoryStorageClient
    • id: str | None = None
    • name: str | None = None

    Returns DatasetClient

create_kvs_from_directory

  • create_kvs_from_directory(storage_directory, memory_storage_client, id, name): KeyValueStoreClient
  • Parameters

    • storage_directory: str
    • memory_storage_client: MemoryStorageClient
    • id: str | None = None
    • name: str | None = None

    Returns KeyValueStoreClient

create_rq_from_directory

  • create_rq_from_directory(storage_directory, memory_storage_client, id, name): RequestQueueClient
  • Parameters

    • storage_directory: str
    • memory_storage_client: MemoryStorageClient
    • id: str | None = None
    • name: str | None = None

    Returns RequestQueueClient

crypto_random_object_id

  • crypto_random_object_id(length): str
  • Generates a random object ID.


    Parameters

    • length: int = 17

    Returns str

determine_file_extension

  • determine_file_extension(content_type): str | None
  • Determine the file extension for a given MIME content type.


    Parameters

    • content_type: str

    Returns str | None

extract_query_params

  • extract_query_params(url): dict[str, list[str]]
  • Extract query parameters from a given URL.


    Parameters

    • url: str

    Returns dict[str, list[str]]

filter_out_none_values_recursively

  • filter_out_none_values_recursively(dictionary, *, remove_empty_dicts): dict | None
  • Recursively filters out None values from a dictionary.


    Parameters

    • dictionary: dict
    • remove_empty_dicts: bool = Falsekeyword-only

    Returns dict | None

find_or_create_client_by_id_or_name_inner

  • find_or_create_client_by_id_or_name_inner(resource_client_class, memory_storage_client, id, name): TResourceClient | None
  • Locates or creates a new storage client based on the given ID or name.

    This method attempts to find a storage client in the memory cache first. If not found, it tries to locate a storage directory by name. If still not found, it searches through storage directories for a matching ID or name in their metadata. If none exists, and the specified ID is 'default', it checks for a default storage directory. If a storage client is found or created, it is added to the memory cache. If no storage client can be located or created, the method returns None.


    Parameters

    • resource_client_class: type[TResourceClient]
    • memory_storage_client: MemoryStorageClient
    • id: str | None = None
    • name: str | None = None

    Returns TResourceClient | None

force_remove

  • async force_remove(filename): None
  • Removes a file, suppressing the FileNotFoundError if it does not exist.

    JS-like rm(filename, { force: true }).


    Parameters

    • filename: str

    Returns None

force_rename

  • async force_rename(src_dir, dst_dir): None
  • Renames a directory, ensuring that the destination directory is removed if it exists.


    Parameters

    • src_dir: str
    • dst_dir: str

    Returns None

get_configuration

  • get_configuration(): Configuration
  • Get the configuration object.


    Returns Configuration

get_configuration_if_set

  • get_configuration_if_set(): Configuration | None
  • Get the configuration object, or None if it hasn't been set yet.


    Returns Configuration | None

get_configured_log_level

  • get_configured_log_level(configuration): int
  • Parameters

    • configuration: Configuration

    Returns int

get_cpu_info

  • get_cpu_info(): CpuInfo
  • Retrieves the current CPU usage.

    It utilizes the psutil library. Function psutil.cpu_percent() returns a float representing the current system-wide CPU utilization as a percentage.


    Returns CpuInfo

get_event_manager

  • get_event_manager(): EventManager
  • Get the event manager.


    Returns EventManager

get_memory_info

  • get_memory_info(): MemoryInfo
  • Retrieves the current memory usage of the process and its children.

    It utilizes the psutil library.


    Returns MemoryInfo

get_or_create_inner

  • async get_or_create_inner(*, memory_storage_client, storage_client_cache, resource_client_class, name, id): TResourceClient
  • Retrieve a named storage, or create a new one when it doesn't exist.


    Parameters

    • memory_storage_client: MemoryStorageClientkeyword-only
    • storage_client_cache: list[TResourceClient]keyword-only
    • resource_client_class: type[TResourceClient]keyword-only
    • name: str | None = Nonekeyword-only
    • id: str | None = Nonekeyword-only

    Returns TResourceClient

get_storage_client

  • get_storage_client(*, client_type): BaseStorageClient
  • Get the storage client instance for the current environment.


    Parameters

    • client_type: StorageClientType | None = Nonekeyword-only

    Returns BaseStorageClient

infinite_scroll

  • async infinite_scroll(page): None
  • Scroll to the bottom of a page, handling loading of additional items.


    Parameters

    • page: Page

    Returns None

is_content_type

  • is_content_type(content_type_enum, content_type): bool
  • Check if the provided content type string matches the specified ContentType.


    Parameters

    • content_type_enum: ContentType
    • content_type: str

    Returns bool

is_file_or_bytes

  • is_file_or_bytes(value): bool
  • Determine if the input value is a file-like object or bytes.

    This function checks whether the provided value is an instance of bytes, bytearray, or io.IOBase (file-like). The method is simplified for common use cases and may not cover all edge cases.


    Parameters

    • value: Any

    Returns bool

is_url_absolute

  • is_url_absolute(url): bool
  • Check if a URL is absolute.


    Parameters

    • url: str

    Returns bool

json_dumps

  • async json_dumps(obj): str
  • Serialize an object to a JSON-formatted string with specific settings.


    Parameters

    • obj: Any

    Returns str

maybe_extract_enum_member_value

  • maybe_extract_enum_member_value(maybe_enum_member): Any
  • Extract the value of an enumeration member if it is an Enum, otherwise return the original value.


    Parameters

    • maybe_enum_member: Any

    Returns Any

maybe_parse_body

  • maybe_parse_body(body, content_type): Any
  • Parse the response body based on the content type.


    Parameters

    • body: bytes
    • content_type: str

    Returns Any

measure_time

  • measure_time(): Iterator[TimerResult]
  • Measure the execution time (wall-clock and CPU) between the start and end of the with-block.


    Returns Iterator[TimerResult]

normalize_url

  • normalize_url(url, *, keep_url_fragment): str
  • Normalizes a URL.

    This function cleans and standardizes a URL by removing leading and trailing whitespaces, converting the scheme and netloc to lower case, stripping unwanted tracking parameters (specifically those beginning with 'utm_'), sorting the remaining query parameters alphabetically, and optionally retaining the URL fragment. The goal is to ensure that URLs that are functionally identical but differ in trivial ways (such as parameter order or casing) are treated as the same.


    Parameters

    • url: str
    • keep_url_fragment: bool = Falsekeyword-only

    Returns str

open_storage

  • async open_storage(*, storage_class, storage_client, configuration, id, name): TResource
  • Open either a new storage or restore an existing one and return it.


    Parameters

    • storage_class: type[TResource]keyword-only
    • storage_client: BaseStorageClient | None = Nonekeyword-only
    • configuration: Configuration | None = Nonekeyword-only
    • id: str | None = Nonekeyword-only
    • name: str | None = Nonekeyword-only

    Returns TResource

persist_metadata_if_enabled

  • async persist_metadata_if_enabled(*, data, entity_directory, write_metadata): None
  • Updates or writes metadata to a specified directory.

    The function writes a given metadata dictionary to a JSON file within a specified directory. The writing process is skipped if write_metadata is False. Before writing, it ensures that the target directory exists, creating it if necessary.


    Parameters

    • data: dictkeyword-only
    • entity_directory: strkeyword-only
    • write_metadata: boolkeyword-only

    Returns None

raise_on_duplicate_storage

  • raise_on_duplicate_storage(client_type, key_name, value): NoReturn
  • Raise an error indicating that a storage with the provided key name and value already exists.


    Parameters

    • client_type: StorageTypes
    • key_name: str
    • value: str

    Returns NoReturn

raise_on_non_existing_storage

  • raise_on_non_existing_storage(client_type, id): NoReturn
  • Raise an error indicating that a storage with the provided id does not exist.


    Parameters

    • client_type: StorageTypes
    • id: str | None

    Returns NoReturn

remove_storage_from_cache

  • remove_storage_from_cache(*, storage_class, id, name): None
  • Remove a storage from cache by ID or name.


    Parameters

    • storage_class: typekeyword-only
    • id: str | None = Nonekeyword-only
    • name: str | None = Nonekeyword-only

    Returns None

set_cloud_storage_client

  • set_cloud_storage_client(cloud_client): None
  • Set the cloud storage client instance.


    Parameters

    • cloud_client: BaseStorageClient

    Returns None

set_configuration

  • set_configuration(configuration): None
  • Set the configuration object.


    Parameters

    • configuration: Configuration

    Returns None

set_default_storage_client_type

  • set_default_storage_client_type(client_type): None
  • Set the default storage client type.


    Parameters

    • client_type: StorageClientType

    Returns None

set_event_manager

  • set_event_manager(event_manager): None
  • Set the event manager.


    Parameters

    • event_manager: EventManager

    Returns None

set_local_storage_client

  • set_local_storage_client(local_client): None
  • Set the local storage client instance.


    Parameters

    • local_client: BaseStorageClient

    Returns None

unique_key_to_request_id

  • unique_key_to_request_id(unique_key, *, request_id_length): str
  • Generate a deterministic request ID based on a unique key.


    Parameters

    • unique_key: str
    • request_id_length: int = 15keyword-only

    Returns str

validate_http_url

  • validate_http_url(value): str | None
  • Validate the given HTTP URL.

    Raises: pydantic.ValidationError: If the URL is not valid.


    Parameters

    • value: str | None

    Returns str | None

wait_for

  • async wait_for(operation, *, timeout, timeout_message, max_retries, logger): T
  • Wait for an async operation to complete.

    If the wait times out, TimeoutError is raised and the future is cancelled. Optionally retry on error.


    Parameters

    • operation: Callable[[], Awaitable[T]]
    • timeout: timedeltakeyword-only
    • timeout_message: str | None = Nonekeyword-only
    • max_retries: int = 1keyword-only
    • logger: Loggerkeyword-only

    Returns T

wait_for_all_tasks_for_finish

  • async wait_for_all_tasks_for_finish(tasks, *, logger, timeout): None
  • Wait for all tasks to finish or until the timeout is reached.


    Parameters

    • tasks: Sequence[asyncio.Task]
    • logger: Loggerkeyword-only
    • timeout: timedelta | None = Nonekeyword-only

    Returns None

Properties

__version__

__version__:

AsyncListener

AsyncListener:

BrowserType

BrowserType:

cli

cli:

CLOUDFLARE_RETRY_CSS_SELECTORS

CLOUDFLARE_RETRY_CSS_SELECTORS:

COMMON_ACCEPT

COMMON_ACCEPT:

COMMON_ACCEPT_LANGUAGE

COMMON_ACCEPT_LANGUAGE:

CreateSessionFunctionType

CreateSessionFunctionType:

EventData

EventData:

FailedRequestHandler

FailedRequestHandler:

HttpMethod

HttpMethod: TypeAlias

HttpPayload

HttpPayload: TypeAlias

HttpQueryParams

HttpQueryParams: TypeAlias

KvsValueType

KvsValueType:

Listener

Listener:

logger

logger:

logger

logger:

logger

logger:

logger

logger:

logger

logger:

logger

logger:

logger

logger:

logger

logger:

logger

logger:

logger

logger:

logger

logger:

logger

logger:

logger

logger:

logger

logger:

logger

logger:

logger

logger:

logger

logger:

logger

logger:

logger

logger:

logger

logger:

logger

logger:

METADATA_FILENAME

METADATA_FILENAME:

PW_CHROMIUM_HEADLESS_DEFAULT_SEC_CH_UA

PW_CHROMIUM_HEADLESS_DEFAULT_SEC_CH_UA:

PW_CHROMIUM_HEADLESS_DEFAULT_SEC_CH_UA_MOBILE

PW_CHROMIUM_HEADLESS_DEFAULT_SEC_CH_UA_MOBILE:

PW_CHROMIUM_HEADLESS_DEFAULT_SEC_CH_UA_PLATFORM

PW_CHROMIUM_HEADLESS_DEFAULT_SEC_CH_UA_PLATFORM:

PW_CHROMIUM_HEADLESS_DEFAULT_USER_AGENT

PW_CHROMIUM_HEADLESS_DEFAULT_USER_AGENT:

PW_FIREFOX_HEADLESS_DEFAULT_USER_AGENT

PW_FIREFOX_HEADLESS_DEFAULT_USER_AGENT:

PW_WEBKIT_HEADLESS_DEFAULT_USER_AGENT

PW_WEBKIT_HEADLESS_DEFAULT_USER_AGENT:

RequestHandler

RequestHandler:

ResourceClient

ResourceClient:

ResourceCollectionClient

ResourceCollectionClient:

RETRY_CSS_SELECTORS

RETRY_CSS_SELECTORS:

CSS selectors for elements that should trigger a retry, as the crawler is likely getting blocked.

Snapshot

Snapshot:

StorageClientType

StorageClientType:

SyncListener

SyncListener:

T

T:

T

T:

T

T:

T

T:

T

T:

TCrawlingContext

TCrawlingContext:

TCrawlingContext

TCrawlingContext:

TCrawlingContext

TCrawlingContext:

TCrawlingContext

TCrawlingContext:

TEMPLATE_LIST_URL

TEMPLATE_LIST_URL:

timedelta_ms

timedelta_ms:

TMiddlewareCrawlingContext

TMiddlewareCrawlingContext:

TResource

TResource:

TResourceClient

TResourceClient:

TStatisticsState

TStatisticsState:

USER_AGENT_POOL

USER_AGENT_POOL:

user_data_adapter

user_data_adapter:

WrappedListener

WrappedListener: