Skip to main content

crawlee

Index

Main Classes

Helper Classes

Errors

Methods

Properties

Constants

Errors

is_status_code_client_error

  • is_status_code_client_error(value): bool
  • Returns True for 4xx status codes, False otherwise.


    Parameters

    • value: int

    Returns bool

is_status_code_error

  • is_status_code_error(value): bool
  • Returns True for 4xx or 5xx status codes, False otherwise.


    Parameters

    • value: int

    Returns bool

is_status_code_server_error

  • is_status_code_server_error(value): bool
  • Returns True for 5xx status codes, False otherwise.


    Parameters

    • value: int

    Returns bool

Methods

callback

  • callback(version): None
  • Crawlee is a web scraping and browser automation library.


    Parameters

    • version: bool = False

    Returns None

compute_short_hash

  • compute_short_hash(data, *, length): str
  • Computes a hexadecimal SHA-256 hash of the provided data and returns a substring (prefix) of it.


    Parameters

    • data: bytes

      The binary data to be hashed.

    • keyword-onlylength: int = 8

      The length of the hash to be returned.

    Returns str

    A substring (prefix) of the hexadecimal hash of the data.

compute_unique_key

  • compute_unique_key(url, method, headers, payload, *, keep_url_fragment, use_extended_unique_key): str
  • Compute a unique key for caching & deduplication of requests.

    This function computes a unique key by normalizing the provided URL and method. If use_extended_unique_key is True and a payload is provided, the payload is hashed and included in the key. Otherwise, the unique key is just the normalized URL. Additionally, if HTTP headers are provided, the whitelisted headers are hashed and included in the key.


    Parameters

    • url: str

      The request URL.

    • method: HttpMethod = 'GET'

      The HTTP method.

    • headers: HttpHeaders | None = None

      The HTTP headers.

    • payload: HttpPayload | None = None

      The data to be sent as the request body.

    • keyword-onlykeep_url_fragment: bool = False

      A flag indicating whether to keep the URL fragment.

    • keyword-onlyuse_extended_unique_key: bool = False

      A flag indicating whether to include a hashed payload in the key.

    Returns str

    A string representing the unique key for the request.

compute_weighted_avg

  • compute_weighted_avg(values, weights): float
  • Computes a weighted average of an array of numbers, complemented by an array of weights.


    Parameters

    • values: list[float]

      List of values.

    • weights: list[float]

      List of weights.

    Returns float

    [object Object]

configure_logger

  • configure_logger(logger, configuration, *, remove_old_handlers): None
  • Parameters

    • logger: logging.Logger
    • configuration: Configuration
    • keyword-onlyremove_old_handlers: bool = False

    Returns None

convert_to_absolute_url

  • convert_to_absolute_url(base_url, relative_url): str
  • Convert a relative URL to an absolute URL using a base URL.


    Parameters

    • base_url: str
    • relative_url: str

    Returns str

create

  • create(project_name, template): None
  • Bootstrap a new Crawlee project.


    Parameters

    • optionalproject_name: Optional[str] = typer.Argument( default=None, help='The name of the project and the directory that will be created to contain it. ' 'If none is given, you will be prompted.', show_default=False, )
    • optionaltemplate: Optional[str] = typer.Option( default=None, help='The template to be used to create the project. If none is given, you will be prompted.', show_default=False, )

    Returns None

create_dataset_from_directory

  • create_dataset_from_directory(storage_directory, memory_storage_client, id, name): DatasetClient

create_kvs_from_directory

  • create_kvs_from_directory(storage_directory, memory_storage_client, id, name): KeyValueStoreClient

create_rq_from_directory

  • create_rq_from_directory(storage_directory, memory_storage_client, id, name): RequestQueueClient

crypto_random_object_id

  • crypto_random_object_id(length): str
  • Generates a random object ID.


    Parameters

    • length: int = 17

    Returns str

determine_file_extension

  • determine_file_extension(content_type): str | None
  • Determine the file extension for a given MIME content type.


    Parameters

    • content_type: str

      The MIME content type string.

    Returns str | None

    A string representing the determined file extension without a leading dot, or None if no extension could be determined.

extract_query_params

  • extract_query_params(url): dict[str, list[str]]
  • Extract query parameters from a given URL.


    Parameters

    • url: str

    Returns dict[str, list[str]]

filter_out_none_values_recursively

  • filter_out_none_values_recursively(dictionary, *, remove_empty_dicts): dict | None
  • Recursively filters out None values from a dictionary.


    Parameters

    • dictionary: dict

      The dictionary to filter.

    • keyword-onlyremove_empty_dicts: bool = False

      Flag indicating whether to remove empty nested dictionaries.

    Returns dict | None

    A copy of the dictionary with all None values (and potentially empty dictionaries) removed.

find_or_create_client_by_id_or_name_inner

  • find_or_create_client_by_id_or_name_inner(resource_client_class, memory_storage_client, id, name): TResourceClient | None
  • Locates or creates a new storage client based on the given ID or name.

    This method attempts to find a storage client in the memory cache first. If not found, it tries to locate a storage directory by name. If still not found, it searches through storage directories for a matching ID or name in their metadata. If none exists, and the specified ID is 'default', it checks for a default storage directory. If a storage client is found or created, it is added to the memory cache. If no storage client can be located or created, the method returns None.


    Parameters

    • resource_client_class: type[TResourceClient]

      The class of the resource client.

    • memory_storage_client: MemoryStorageClient

      The memory storage client used to store and retrieve storage clients.

    • id: str | None = None

      The unique identifier for the storage client.

    • name: str | None = None

      The name of the storage client.

    Returns TResourceClient | None

    The found or created storage client, or None if no client could be found or created.

force_remove

  • async force_remove(filename): None
  • Removes a file, suppressing the FileNotFoundError if it does not exist.

    JS-like rm(filename, { force: true }).


    Parameters

    • filename: str

      The path to the file to be removed.

    Returns None

force_rename

  • async force_rename(src_dir, dst_dir): None
  • Renames a directory, ensuring that the destination directory is removed if it exists.


    Parameters

    • src_dir: str

      The source directory path.

    • dst_dir: str

      The destination directory path.

    Returns None

get_configuration

get_configuration_if_set

  • Get the configuration object, or None if it hasn't been set yet.


    Returns Configuration | None

get_configured_log_level

  • get_configured_log_level(configuration): int

get_cpu_info

  • Retrieves the current CPU usage.

    It utilizes the psutil library. Function psutil.cpu_percent() returns a float representing the current system-wide CPU utilization as a percentage.


    Returns CpuInfo

get_event_manager

get_memory_info

  • Retrieves the current memory usage of the process and its children.

    It utilizes the psutil library.


    Returns MemoryInfo

get_or_create_inner

  • async get_or_create_inner(*, memory_storage_client, storage_client_cache, resource_client_class, name, id): TResourceClient
  • Retrieve a named storage, or create a new one when it doesn't exist.


    Parameters

    • keyword-onlymemory_storage_client: MemoryStorageClient

      The memory storage client.

    • keyword-onlystorage_client_cache: list[TResourceClient]

      The cache of storage clients.

    • keyword-onlyresource_client_class: type[TResourceClient]

      The class of the storage to retrieve or create.

    • keyword-onlyname: str | None = None

      The name of the storage to retrieve or create.

    • keyword-onlyid: str | None = None

      ID of the storage to retrieve or create.

    Returns TResourceClient

    The retrieved or newly-created storage.

get_storage_client

  • Get the storage client instance for the current environment.


    Parameters

    • keyword-onlyclient_type: StorageClientType | None = None

      Allows retrieving a specific storage client type, regardless of where we are running.

    Returns BaseStorageClient

    The current storage client instance.

infinite_scroll

  • async infinite_scroll(page): None
  • Scroll to the bottom of a page, handling loading of additional items.


    Parameters

    • page: Page

    Returns None

is_content_type

  • is_content_type(content_type_enum, content_type): bool
  • Check if the provided content type string matches the specified ContentType.


    Parameters

    Returns bool

is_file_or_bytes

  • is_file_or_bytes(value): bool
  • Determine if the input value is a file-like object or bytes.

    This function checks whether the provided value is an instance of bytes, bytearray, or io.IOBase (file-like). The method is simplified for common use cases and may not cover all edge cases.


    Parameters

    • value: Any

      The value to be checked.

    Returns bool

    True if the value is either a file-like object or bytes, False otherwise.

is_url_absolute

  • is_url_absolute(url): bool
  • Check if a URL is absolute.


    Parameters

    • url: str

    Returns bool

json_dumps

  • async json_dumps(obj): str
  • Serialize an object to a JSON-formatted string with specific settings.


    Parameters

    • obj: Any

      The object to serialize.

    Returns str

    A string containing the JSON representation of the input object.

maybe_extract_enum_member_value

  • maybe_extract_enum_member_value(maybe_enum_member): Any
  • Extract the value of an enumeration member if it is an Enum, otherwise return the original value.


    Parameters

    • maybe_enum_member: Any

    Returns Any

maybe_parse_body

  • maybe_parse_body(body, content_type): Any
  • Parse the response body based on the content type.


    Parameters

    • body: bytes
    • content_type: str

    Returns Any

measure_time

  • Measure the execution time (wall-clock and CPU) between the start and end of the with-block.


    Returns Iterator[TimerResult]

normalize_url

  • normalize_url(url, *, keep_url_fragment): str
  • Normalizes a URL.

    This function cleans and standardizes a URL by removing leading and trailing whitespaces, converting the scheme and netloc to lower case, stripping unwanted tracking parameters (specifically those beginning with 'utm_'), sorting the remaining query parameters alphabetically, and optionally retaining the URL fragment. The goal is to ensure that URLs that are functionally identical but differ in trivial ways (such as parameter order or casing) are treated as the same.


    Parameters

    • url: str

      The URL to be normalized.

    • keyword-onlykeep_url_fragment: bool = False

      Flag to determine whether the fragment part of the URL should be retained.

    Returns str

    A string containing the normalized URL.

open_storage

  • async open_storage(*, storage_class, storage_client, configuration, id, name): TResource
  • Open either a new storage or restore an existing one and return it.


    Parameters

    • keyword-onlystorage_class: type[TResource]
    • keyword-onlystorage_client: BaseStorageClient | None = None
    • keyword-onlyconfiguration: Configuration | None = None
    • keyword-onlyid: str | None = None
    • keyword-onlyname: str | None = None

    Returns TResource

persist_metadata_if_enabled

  • async persist_metadata_if_enabled(*, data, entity_directory, write_metadata): None
  • Updates or writes metadata to a specified directory.

    The function writes a given metadata dictionary to a JSON file within a specified directory. The writing process is skipped if write_metadata is False. Before writing, it ensures that the target directory exists, creating it if necessary.


    Parameters

    • keyword-onlydata: dict

      A dictionary containing metadata to be written.

    • keyword-onlyentity_directory: str

      The directory path where the metadata file should be stored.

    • keyword-onlywrite_metadata: bool

      A boolean flag indicating whether the metadata should be written to file.

    Returns None

raise_on_duplicate_storage

  • raise_on_duplicate_storage(client_type, key_name, value): NoReturn
  • Raise an error indicating that a storage with the provided key name and value already exists.


    Parameters

    Returns NoReturn

raise_on_non_existing_storage

  • raise_on_non_existing_storage(client_type, id): NoReturn
  • Raise an error indicating that a storage with the provided id does not exist.


    Parameters

    Returns NoReturn

remove_storage_from_cache

  • remove_storage_from_cache(*, storage_class, id, name): None
  • Remove a storage from cache by ID or name.


    Parameters

    • keyword-onlystorage_class: type
    • keyword-onlyid: str | None = None
    • keyword-onlyname: str | None = None

    Returns None

set_cloud_storage_client

  • set_cloud_storage_client(cloud_client): None
  • Set the cloud storage client instance.


    Parameters

    Returns None

set_configuration

  • set_configuration(configuration): None
  • Set the configuration object.


    Parameters

    Returns None

set_default_storage_client_type

  • set_default_storage_client_type(client_type): None
  • Set the default storage client type.


    Parameters

    Returns None

set_event_manager

  • set_event_manager(event_manager): None
  • Set the event manager.


    Parameters

    Returns None

set_local_storage_client

  • set_local_storage_client(local_client): None
  • Set the local storage client instance.


    Parameters

    Returns None

unique_key_to_request_id

  • unique_key_to_request_id(unique_key, *, request_id_length): str
  • Generate a deterministic request ID based on a unique key.


    Parameters

    • unique_key: str

      The unique key to convert into a request ID.

    • keyword-onlyrequest_id_length: int = 15

      The length of the request ID.

    Returns str

    A URL-safe, truncated request ID based on the unique key.

validate_http_url

  • validate_http_url(value): str | None
  • Validate the given HTTP URL.


    Parameters

    • value: str | None

    Returns str | None

wait_for

  • async wait_for(operation, *, timeout, timeout_message, max_retries, logger): T
  • Wait for an async operation to complete.

    If the wait times out, TimeoutError is raised and the future is cancelled. Optionally retry on error.


    Parameters

    • operation: Callable[[], Awaitable[T]]

      A function that returns the future to wait for.

    • keyword-onlytimeout: timedelta

      How long should we wait before cancelling the future.

    • keyword-onlytimeout_message: str | None = None

      Message to be included in the TimeoutError in case of timeout.

    • keyword-onlymax_retries: int = 1

      How many times should the operation be attempted.

    • keyword-onlylogger: Logger

      Used to report information about retries as they happen.

    Returns T

wait_for_all_tasks_for_finish

  • async wait_for_all_tasks_for_finish(tasks, *, logger, timeout): None
  • Wait for all tasks to finish or until the timeout is reached.


    Parameters

    • tasks: Sequence[asyncio.Task]

      A sequence of asyncio tasks to wait for.

    • keyword-onlylogger: Logger

      Logger to use for reporting.

    • keyword-onlytimeout: timedelta | None = None

      How long should we wait before cancelling the tasks.

    Returns None

Properties

__version__

__version__:

AsyncListener

AsyncListener:

BrowserType

BrowserType:

cli

cli:

CLOUDFLARE_RETRY_CSS_SELECTORS

CLOUDFLARE_RETRY_CSS_SELECTORS:

COMMON_ACCEPT

COMMON_ACCEPT:

COMMON_ACCEPT_LANGUAGE

COMMON_ACCEPT_LANGUAGE:

CreateSessionFunctionType

CreateSessionFunctionType:

ErrorHandler

ErrorHandler:

EventData

EventData:

FailedRequestHandler

FailedRequestHandler:

HttpMethod

HttpMethod: TypeAlias

HttpPayload

HttpPayload: TypeAlias

KvsValueType

KvsValueType:

Listener

Listener:

logger

logger:

logger

logger:

logger

logger:

logger

logger:

logger

logger:

logger

logger:

logger

logger:

logger

logger:

logger

logger:

logger

logger:

logger

logger:

logger

logger:

logger

logger:

logger

logger:

logger

logger:

logger

logger:

logger

logger:

logger

logger:

logger

logger:

logger

logger:

logger

logger:

METADATA_FILENAME

METADATA_FILENAME:

PW_CHROMIUM_HEADLESS_DEFAULT_SEC_CH_UA

PW_CHROMIUM_HEADLESS_DEFAULT_SEC_CH_UA:

PW_CHROMIUM_HEADLESS_DEFAULT_SEC_CH_UA_MOBILE

PW_CHROMIUM_HEADLESS_DEFAULT_SEC_CH_UA_MOBILE:

PW_CHROMIUM_HEADLESS_DEFAULT_SEC_CH_UA_PLATFORM

PW_CHROMIUM_HEADLESS_DEFAULT_SEC_CH_UA_PLATFORM:

PW_CHROMIUM_HEADLESS_DEFAULT_USER_AGENT

PW_CHROMIUM_HEADLESS_DEFAULT_USER_AGENT:

PW_FIREFOX_HEADLESS_DEFAULT_USER_AGENT

PW_FIREFOX_HEADLESS_DEFAULT_USER_AGENT:

PW_WEBKIT_HEADLESS_DEFAULT_USER_AGENT

PW_WEBKIT_HEADLESS_DEFAULT_USER_AGENT:

RequestHandler

RequestHandler:

ResourceClient

ResourceClient:

ResourceCollectionClient

ResourceCollectionClient:

RETRY_CSS_SELECTORS

RETRY_CSS_SELECTORS:

CSS selectors for elements that should trigger a retry, as the crawler is likely getting blocked.

ROTATE_PROXY_ERRORS

ROTATE_PROXY_ERRORS:

Content of proxy errors that should trigger a retry, as the proxy is likely getting blocked / is malfunctioning.

Snapshot

Snapshot:

StorageClientType

StorageClientType:

SyncListener

SyncListener:

T

T:

T

T:

T

T:

T

T:

T

T:

TCrawlingContext

TCrawlingContext:

TCrawlingContext

TCrawlingContext:

TCrawlingContext

TCrawlingContext:

TCrawlingContext

TCrawlingContext:

TEMPLATE_LIST_URL

TEMPLATE_LIST_URL:

timedelta_ms

timedelta_ms:

TMiddlewareCrawlingContext

TMiddlewareCrawlingContext:

TResource

TResource:

TResourceClient

TResourceClient:

TStatisticsState

TStatisticsState:

USER_AGENT_POOL

USER_AGENT_POOL:

user_data_adapter

user_data_adapter:

WrappedListener

WrappedListener:
Page Options