Skip to main content

Request

Represents a request in the Crawlee framework, containing the necessary information for crawling operations.

The Request class is one of the core components in Crawlee, utilized by various components such as request providers, HTTP clients, crawlers, and more. It encapsulates the essential data for executing web requests, including the URL, HTTP method, headers, payload, and user data. The user data allows custom information to be stored and persisted throughout the request lifecycle, including its retries.

Key functionalities include managing the request's identifier (id), unique key (unique_key) that is used for request deduplication, controlling retries, handling state management, and enabling configuration for session rotation and proxy handling.

The recommended way to create a new instance is by using the Request.from_url constructor, which automatically generates a unique key and identifier based on the URL and request parameters.

Usage

from crawlee import Request

request = Request.from_url('https://crawlee.dev')

Hierarchy

Index

Methods

__eq__

  • __eq__(*, other): bool
  • Compare all relevant fields of the Request class, excluding deprecated fields json_ and order_no.

    TODO: Remove this method once the issue is resolved. https://github.com/apify/crawlee-python/issues/94


    Parameters

    • optionalkeyword-onlyother: object

    Returns bool

crawl_depth

  • crawl_depth(*, new_value): None
  • Parameters

    • optionalkeyword-onlynew_value: int

    Returns None

enqueue_strategy

  • enqueue_strategy(*, new_enqueue_strategy): None
  • Parameters

    Returns None

forefront

  • forefront(*, new_value): None
  • Parameters

    • optionalkeyword-onlynew_value: bool

    Returns None

from_base_request_data

  • from_base_request_data(*, base_request_data, id): Self
  • Create a complete Request object based on a BaseRequestData instance.


    Parameters

    • optionalkeyword-onlybase_request_data: BaseRequestData
    • optionalkeyword-onlyid: str | None = None

    Returns Self

from_url

  • from_url(*, url, method, headers, payload, label, unique_key, id, keep_url_fragment, use_extended_unique_key, always_enqueue, kwargs): Self
  • Create a new Request instance from a URL.

    This is recommended constructor for creating new Request instances. It generates a Request object from a given URL with additional options to customize HTTP method, payload, unique key, and other request properties. If no unique_key or id is provided, they are computed automatically based on the URL, method and payload. It depends on the keep_url_fragment and use_extended_unique_key flags.


    Parameters

    • optionalkeyword-onlyurl: str

      The URL of the request.

    • optionalkeyword-onlymethod: HttpMethod = 'GET'

      The HTTP method of the request.

    • optionalkeyword-onlyheaders: (HttpHeaders | dict[str, str]) | None = None

      The HTTP headers of the request.

    • optionalkeyword-onlypayload: (HttpPayload | str) | None = None

      The data to be sent as the request body. Typically used with 'POST' or 'PUT' requests.

    • optionalkeyword-onlylabel: str | None = None

      A custom label to differentiate between request types. This is stored in user_data, and it is used for request routing (different requests go to different handlers).

    • optionalkeyword-onlyunique_key: str | None = None

      A unique key identifying the request. If not provided, it is automatically computed based on the URL and other parameters. Requests with the same unique_key are treated as identical.

    • optionalkeyword-onlyid: str | None = None

      A unique identifier for the request. If not provided, it is automatically generated from the unique_key.

    • optionalkeyword-onlykeep_url_fragment: bool = False

      Determines whether the URL fragment (e.g., `section`) should be included in the unique_key computation. This is only relevant when unique_key is not provided.

    • optionalkeyword-onlyuse_extended_unique_key: bool = False

      Determines whether to include the HTTP method and payload in the unique_key computation. This is only relevant when unique_key is not provided.

    • optionalkeyword-onlyalways_enqueue: bool = False

      If set to True, the request will be enqueued even if it is already present in the queue. Using this is not allowed when a custom unique_key is also provided and will result in a ValueError.

    • optionalkeyword-onlykwargs: Any

    Returns Self

get_query_param_from_url

  • get_query_param_from_url(*, param, default): str | None
  • Get the value of a specific query parameter from the URL.


    Parameters

    • optionalkeyword-onlyparam: str
    • optionalkeyword-onlydefault: str | None = None

    Returns str | None

last_proxy_tier

  • last_proxy_tier(*, new_value): None
  • Parameters

    • optionalkeyword-onlynew_value: int

    Returns None

max_retries

  • max_retries(*, new_max_retries): None
  • Parameters

    • optionalkeyword-onlynew_max_retries: int

    Returns None

session_rotation_count

  • session_rotation_count(*, new_session_rotation_count): None
  • Parameters

    • optionalkeyword-onlynew_session_rotation_count: int

    Returns None

state

  • state(*, new_state): None
  • Parameters

    Returns None

Properties

crawl_depth

crawl_depth: int

The depth of the request in the crawl tree.

crawlee_data

crawlee_data: CrawleeRequestData

Crawlee-specific configuration stored in the user_data.

enqueue_strategy

enqueue_strategy: EnqueueStrategy

The strategy used when enqueueing the request.

forefront

forefront: bool

Indicate whether the request should be enqueued at the front of the queue.

handled_at

handled_at: datetime | None

Timestamp when the request was handled.

headers

headers: HttpHeaders

HTTP request headers.

id

id: str

A unique identifier for the request. Note that this is not used for deduplication, and should not be confused with unique_key.

json_

json_: str | None

Deprecated internal field, do not use it.

Should be removed as part of https://github.com/apify/crawlee-python/issues/94.

label

label: str | None

A string used to differentiate between arbitrary request types.

last_proxy_tier

last_proxy_tier: int | None

The last proxy tier used to process the request.

loaded_url

loaded_url: str | None

URL of the web page that was loaded. This can differ from the original URL in case of redirects.

max_retries

max_retries: int | None

Crawlee-specific limit on the number of retries of the request.

method

method: HttpMethod

HTTP request method.

model_config

model_config: Undefined

no_retry

no_retry: bool

If set to True, the request will not be retried in case of failure.

order_no

order_no: Decimal | None

Deprecated internal field, do not use it.

Should be removed as part of https://github.com/apify/crawlee-python/issues/94.

payload

payload: HttpPayload | None

HTTP request payload.

TODO: Re-check the need for Validator and Serializer once the issue is resolved. https://github.com/apify/crawlee-python/issues/94

retry_count

retry_count: int

Number of times the request has been retried.

session_rotation_count

session_rotation_count: int | None

Crawlee-specific number of finished session rotations for the request.

state

state: RequestState | None

Crawlee-specific request handling state.

unique_key

unique_key: str

A unique key identifying the request. Two requests with the same unique_key are considered as pointing to the same URL.

If unique_key is not provided, then it is automatically generated by normalizing the URL. For example, the URL of HTTP://www.EXAMPLE.com/something/ will produce the unique_key of http://www.example.com/something.

Pass an arbitrary non-empty text value to the unique_key property to override the default behavior and specify which URLs shall be considered equal.

url

url: str

The URL of the web page to crawl. Must be a valid HTTP or HTTPS URL, and may include query parameters and fragments.

user_data

user_data: dict[str, JsonSerializable]

Custom user data assigned to the request. Use this to save any request related data to the request's scope, keeping them accessible on retries, failures etc.