Skip to main content

RequestQueue

crawlee.storages._request_queue.RequestQueue

Represents a queue storage for managing HTTP requests in web crawling operations.

The RequestQueue class handles a queue of HTTP requests, each identified by a unique URL, to facilitate structured web crawling. It supports both breadth-first and depth-first crawling strategies, allowing for recursive crawling starting from an initial set of URLs. Each URL in the queue is uniquely identified by a unique_key, which can be customized to allow the same URL to be added multiple times under different keys.

Data can be stored either locally or in the cloud. It depends on the setup of underlying storage client. By default a MemoryStorageClient is used, but it can be changed to a different one.

By default, data is stored using the following path structure:

{CRAWLEE_STORAGE_DIR}/request_queues/{QUEUE_ID}/{REQUEST_ID}.json
  • {CRAWLEE_STORAGE_DIR}: The root directory for all storage data specified by the environment variable.
  • {QUEUE_ID}: The identifier for the request queue, either "default" or as specified.
  • {REQUEST_ID}: The unique identifier for each request in the queue.

The RequestQueue supports both creating new queues and opening existing ones by id or name. Named queues persist indefinitely, while unnamed queues expire after 7 days unless specified otherwise. The queue supports mutable operations, allowing URLs to be added and removed as needed.

Usage:

rq = await RequestQueue.open(name='my_rq')

Index

Constructors

__init__

  • __init__(id, name, configuration, client, event_manager): None
  • Parameters

    • id: str
    • name: str | None
    • configuration: Configuration
    • client: BaseStorageClient
    • event_manager: EventManager

    Returns None

Methods

add_request

  • async add_request(request, *, forefront): ProcessedRequest
  • Parameters

    • request: str | Request
    • forefront: bool = Falsekeyword-only

    Returns ProcessedRequest

add_requests_batched

  • async add_requests_batched(requests, *, batch_size, wait_time_between_batches, wait_for_all_requests_to_be_added, wait_for_all_requests_to_be_added_timeout): None
  • Parameters

    • requests: Sequence[str | Request]
    • batch_size: int = 1000keyword-only
    • wait_time_between_batches: timedelta = timedelta(seconds=1)keyword-only
    • wait_for_all_requests_to_be_added: bool = Falsekeyword-only
    • wait_for_all_requests_to_be_added_timeout: timedelta | None = Nonekeyword-only

    Returns None

drop

  • async drop(*, timeout): None
  • Parameters

    • timeout: timedelta | None = Nonekeyword-only

    Returns None

fetch_next_request

  • async fetch_next_request(): Request | None
  • Return the next request in the queue to be processed.

    Once you successfully finish processing of the request, you need to call RequestQueue.mark_request_as_handled to mark the request as handled in the queue. If there was some error in processing the request, call RequestQueue.reclaim_request instead, so that the queue will give the request to some other consumer in another call to the fetch_next_request method.

    Note that the None return value does not mean the queue processing finished, it means there are currently no pending requests. To check whether all requests in queue were finished, use RequestQueue.is_finished instead.


    Returns Request | None

get_handled_count

  • async get_handled_count(): int
  • Returns int

get_info

  • async get_info(): RequestQueueMetadata | None
  • Get an object containing general information about the request queue.


    Returns RequestQueueMetadata | None

get_request

  • async get_request(request_id): Request | None
  • Retrieve a request from the queue.


    Parameters

    • request_id: str

    Returns Request | None

get_total_count

  • async get_total_count(): int
  • Returns int

is_empty

  • async is_empty(): bool
  • Check whether the queue is empty.


    Returns bool

    True if the next call to RequestQueue.fetch_next_request would return None, otherwise False.

is_finished

  • async is_finished(): bool
  • Check whether the queue is finished.

    Due to the nature of distributed storage used by the queue, the function might occasionally return a false negative, but it will never return a false positive.


    Returns bool

    True if all requests were already handled and there are no more left. False otherwise.

mark_request_as_handled

  • async mark_request_as_handled(request): ProcessedRequest | None
  • Mark a request as handled after successful processing.

    Handled requests will never again be returned by the RequestQueue.fetch_next_request method.


    Parameters

    • request: Request

    Returns ProcessedRequest | None

open

  • async open(*, id, name, configuration, storage_client): RequestQueue
  • Parameters

    • id: str | None = Nonekeyword-only
    • name: str | None = Nonekeyword-only
    • configuration: Configuration | None = Nonekeyword-only
    • storage_client: BaseStorageClient | None = Nonekeyword-only

    Returns RequestQueue

reclaim_request

  • async reclaim_request(request, *, forefront): ProcessedRequest | None
  • Reclaim a failed request back to the queue.

    The request will be returned for processing later again by another call to RequestQueue.fetch_next_request.


    Parameters

    • request: Request
    • forefront: bool = Falsekeyword-only

    Returns ProcessedRequest | None

Properties

id

id: str

name

name: str | None