Skip to main content

RequestQueue

crawlee.storages.request_queue.RequestQueue

Represents a queue storage for HTTP requests to crawl.

Manages a queue of requests with unique URLs for structured deep web crawling with support for both breadth-first and depth-first orders. This queue is designed for crawling websites by starting with initial URLs and recursively following links. Each URL is uniquely identified by a unique_key field, which can be overridden to add the same URL multiple times under different keys.

Local storage path (if CRAWLEE_STORAGE_DIR is set): {CRAWLEE_STORAGE_DIR}/request_queues/{QUEUE_ID}/{REQUEST_ID}.json, where {QUEUE_ID} is the request queue's ID (default or specified) and {REQUEST_ID} is the request's ID.

Usage includes creating or opening existing queues by ID or name, with named queues retained indefinitely and unnamed queues expiring after 7 days unless specified otherwise. Supports mutable operations—URLs can be added and deleted.

Usage: rq = await RequestQueue.open(id='my_rq_id')

Index

Constructors

__init__

  • __init__(id, name, configuration, client): None
  • Parameters

    • id: str
    • name: str | None
    • configuration: Configuration
    • client: BaseStorageClient

    Returns None

Methods

add_request

  • async add_request(request, *, forefront): ProcessedRequest
  • Adds a request to the RequestQueue while managing deduplication and positioning within the queue.

    The deduplication of requests relies on the uniqueKey field within the request dictionary. If uniqueKey exists, it remains unchanged; if it does not, it is generated based on the request's url, method, and payload fields. The generation of uniqueKey can be influenced by the keep_url_fragment and use_extended_unique_key flags, which dictate whether to include the URL fragment and the request's method and payload, respectively, in its computation.

    The request can be added to the forefront (beginning) or the back of the queue based on the forefront parameter. Information about the request's addition to the queue, including whether it was already present or handled, is returned in an output dictionary.


    Parameters

    • request: Request | BaseRequestData | str
    • forefront: bool = Falsekeyword-only

    Returns ProcessedRequest

    • requestId The ID of the request.
      • uniqueKey The unique key associated with the request.
      • wasAlreadyPresent (bool)

add_requests_batched

  • async add_requests_batched(requests, *, batch_size, wait_time_between_batches, wait_for_all_requests_to_be_added, wait_for_all_requests_to_be_added_timeout): None
  • Parameters

    • requests: Sequence[str | BaseRequestData | Request]
    • batch_size: int = 1000keyword-only
    • wait_time_between_batches: timedelta = timedelta(seconds=1)keyword-only
    • wait_for_all_requests_to_be_added: bool = Falsekeyword-only
    • wait_for_all_requests_to_be_added_timeout: timedelta | None = Nonekeyword-only

    Returns None

drop

  • async drop(*, timeout): None
  • Parameters

    • timeout: timedelta | None = Nonekeyword-only

    Returns None

ensure_head_is_non_empty

  • async ensure_head_is_non_empty(*, ensure_consistency, limit, iteration): bool
  • Ensure that the queue head is non-empty.

    The method ensures that the queue head contains items. It may request more items than are currently in progress to guarantee that at least one item is present in the head of the queue.


    Parameters

    • ensure_consistency: bool = Falsekeyword-only
    • limit: int | None = Nonekeyword-only
    • iteration: int = 0keyword-only

    Returns bool

fetch_next_request

  • async fetch_next_request(): Request | None
  • Return the next request in the queue to be processed.

    Once you successfully finish processing of the request, you need to call RequestQueue.mark_request_as_handled to mark the request as handled in the queue. If there was some error in processing the request, call RequestQueue.reclaim_request instead, so that the queue will give the request to some other consumer in another call to the fetch_next_request method.

    Note that the None return value does not mean the queue processing finished, it means there are currently no pending requests. To check whether all requests in queue were finished, use RequestQueue.is_finished instead.


    Returns Request | None

get_handled_count

  • async get_handled_count(): int
  • Returns int

get_info

  • async get_info(): RequestQueueMetadata | None
  • Get an object containing general information about the request queue.


    Returns RequestQueueMetadata | None

get_request

  • async get_request(request_id): Request | None
  • Retrieve a request from the queue.


    Parameters

    • request_id: str

    Returns Request | None

get_total_count

  • async get_total_count(): int
  • Returns int

is_empty

  • async is_empty(): bool
  • Check whether the queue is empty.


    Returns bool

    True if the next call to RequestQueue.fetchNextRequest would return None, otherwise False.

is_finished

  • async is_finished(): bool
  • Check whether the queue is finished.

    Due to the nature of distributed storage used by the queue, the function might occasionally return a false negative, but it will never return a false positive.


    Returns bool

    True if all requests were already handled and there are no more left. False otherwise.

mark_request_as_handled

  • async mark_request_as_handled(request): ProcessedRequest | None
  • Mark a request as handled after successful processing.

    Handled requests will never again be returned by the RequestQueue.fetch_next_request method.


    Parameters

    • request: Request

    Returns ProcessedRequest | None

open

  • async open(*, id, name, configuration): RequestQueue
  • Parameters

    • id: str | None = Nonekeyword-only
    • name: str | None = Nonekeyword-only
    • configuration: Configuration | None = Nonekeyword-only

    Returns RequestQueue

reclaim_request

  • async reclaim_request(request, *, forefront): ProcessedRequest | None
  • Reclaim a failed request back to the queue.

    The request will be returned for processing later again by another call to RequestQueue.fetchNextRequest.


    Parameters

    • request: Request
    • forefront: bool = Falsekeyword-only

    Returns ProcessedRequest | None

Properties

id

id: str

name

name: str | None