RequestQueue
crawlee.storages._request_queue.RequestQueue
Index
Constructors
__init__
Parameters
id: str
name: str | None
configuration: Configuration
client: BaseStorageClient
event_manager: EventManager
Returns None
Methods
add_request
Parameters
request: str | Request
forefront: bool = Falsekeyword-only
Returns ProcessedRequest
add_requests_batched
Parameters
requests: Sequence[str | Request]
batch_size: int = 1000keyword-only
wait_time_between_batches: timedelta = timedelta(seconds=1)keyword-only
wait_for_all_requests_to_be_added: bool = Falsekeyword-only
wait_for_all_requests_to_be_added_timeout: timedelta | None = Nonekeyword-only
Returns None
drop
Parameters
timeout: timedelta | None = Nonekeyword-only
Returns None
fetch_next_request
Return the next request in the queue to be processed.
Once you successfully finish processing of the request, you need to call
RequestQueue.mark_request_as_handled
to mark the request as handled in the queue. If there was some error in processing the request, callRequestQueue.reclaim_request
instead, so that the queue will give the request to some other consumer in another call to thefetch_next_request
method.Note that the
None
return value does not mean the queue processing finished, it means there are currently no pending requests. To check whether all requests in queue were finished, useRequestQueue.is_finished
instead.Returns Request | None
get_handled_count
Returns int
get_info
Get an object containing general information about the request queue.
Returns RequestQueueMetadata | None
get_request
Retrieve a request from the queue.
Parameters
request_id: str
Returns Request | None
get_total_count
Returns int
is_empty
Check whether the queue is empty.
Returns bool
True
if the next call toRequestQueue.fetch_next_request
would returnNone
, otherwiseFalse
.
is_finished
Check whether the queue is finished.
Due to the nature of distributed storage used by the queue, the function might occasionally return a false negative, but it will never return a false positive.
Returns bool
True
if all requests were already handled and there are no more left.False
otherwise.
mark_request_as_handled
Mark a request as handled after successful processing.
Handled requests will never again be returned by the
RequestQueue.fetch_next_request
method.Parameters
request: Request
Returns ProcessedRequest | None
open
Parameters
id: str | None = Nonekeyword-only
name: str | None = Nonekeyword-only
configuration: Configuration | None = Nonekeyword-only
storage_client: BaseStorageClient | None = Nonekeyword-only
Returns RequestQueue
reclaim_request
Reclaim a failed request back to the queue.
The request will be returned for processing later again by another call to
RequestQueue.fetch_next_request
.Parameters
request: Request
forefront: bool = Falsekeyword-only
Returns ProcessedRequest | None
Represents a queue storage for managing HTTP requests in web crawling operations.
The
RequestQueue
class handles a queue of HTTP requests, each identified by a unique URL, to facilitate structured web crawling. It supports both breadth-first and depth-first crawling strategies, allowing for recursive crawling starting from an initial set of URLs. Each URL in the queue is uniquely identified by aunique_key
, which can be customized to allow the same URL to be added multiple times under different keys.Data can be stored either locally or in the cloud. It depends on the setup of underlying storage client. By default a
MemoryStorageClient
is used, but it can be changed to a different one.By default, data is stored using the following path structure:
{CRAWLEE_STORAGE_DIR}
: The root directory for all storage data specified by the environment variable.{QUEUE_ID}
: The identifier for the request queue, either "default" or as specified.{REQUEST_ID}
: The unique identifier for each request in the queue.The
RequestQueue
supports both creating new queues and opening existing ones byid
orname
. Named queues persist indefinitely, while unnamed queues expire after 7 days unless specified otherwise. The queue supports mutable operations, allowing URLs to be added and removed as needed.Usage: