Version: Next

ThrottlingRequestManager

A request manager that wraps another and enforces per-domain delays.

Requests for explicitly configured domains are routed into dedicated sub-managers at insertion time — each request lives in exactly one manager, eliminating duplication and simplifying deduplication.

When fetch_next_request() is called, it returns requests from the sub-manager whose domain has been waiting the longest. If all configured domains are throttled, it falls back to the inner manager for non-throttled domains. If the inner manager is also empty and all sub-managers are throttled, it sleeps until the earliest cooldown expires.

Delay sources:

HTTP 429 responses (via record_domain_delay)
robots.txt crawl-delay directives (via set_crawl_delay)

The class is generic over the wrapped manager type. The request_manager_opener callback is used to construct per-domain sub-managers at insertion time, so every sub-manager shares the same RequestManager subclass and backing store as inner. The opener must accept alias, storage_client, and configuration keyword arguments (as RequestQueue.open does) and return the same concrete subclass as inner.

Usage

from crawlee.crawlers import BasicCrawler
from crawlee.request_loaders import ThrottlingRequestManager
from crawlee.storages import RequestQueue

queue = await RequestQueue.open()
throttler = ThrottlingRequestManager(
    inner=queue,
    domains=['api.example.com', 'slow-site.org'],
    request_manager_opener=RequestQueue.open,
)
crawler = BasicCrawler(request_manager=throttler)

Hierarchy

RequestManager
- ThrottlingRequestManager

Index

Methods

init

__init__(inner, *, domains, request_manager_opener, service_locator, base_delay, max_delay): None

Initialize the throttling manager.
Parameters
- inner: TRequestManager
  The underlying request manager to wrap (typically a RequestQueue). Requests for non-throttled domains are stored here.
- keyword-onlydomains: Sequence[str]
  Explicit list of domain hostnames to throttle. Only requests matching these domains will be routed to per-domain sub-managers. Matching is case-insensitive (hostnames are lowercased) and exact: subdomain wildcards such as *.example.com are not supported — list each subdomain explicitly if needed.
- keyword-onlyrequest_manager_opener: _RequestManagerOpener[TRequestManager]
  Async callable used to create per-domain sub-managers at insertion time. Must accept alias, storage_client, and configuration keyword arguments and return the same concrete subclass as inner (e.g. RequestQueue.open when inner is a RequestQueue).
- optionalkeyword-onlyservice_locator: ServiceLocator | None = None
  Service locator for creating sub-managers. If not provided, defaults to the global service locator, ensuring consistency with the crawler's storage backend.
- optionalkeyword-onlybase_delay: timedelta = timedelta(seconds=2)
  Initial delay after the first 429 response from a domain.
- optionalkeyword-onlymax_delay: timedelta = timedelta(seconds=60)
  Maximum delay between requests to a rate-limited domain.
Returns None

add_request

async add_request(request, *, forefront): ProcessedRequest | None

Overrides RequestManager.add_request
Add a request, routing it to the appropriate manager.

Requests for explicitly configured domains are routed directly to their per-domain sub-manager. All other requests go to the inner manager.
Parameters
- request: str | Request
- optionalkeyword-onlyforefront: bool = False
Returns ProcessedRequest | None

add_requests

async add_requests(requests, *, forefront, batch_size, wait_time_between_batches, wait_for_all_requests_to_be_added, wait_for_all_requests_to_be_added_timeout): None

Overrides RequestManager.add_requests
Add multiple requests, routing each to the appropriate manager.
Parameters
- requests: Sequence[str | Request]
- optionalkeyword-onlyforefront: bool = False
- optionalkeyword-onlybatch_size: int = 1000
- optionalkeyword-onlywait_time_between_batches: timedelta = timedelta(seconds=1)
- optionalkeyword-onlywait_for_all_requests_to_be_added: bool = False
- optionalkeyword-onlywait_for_all_requests_to_be_added_timeout: timedelta | None = None
Returns None

drop

async drop(): None

Overrides Storage.drop
Remove persistent state either from the Apify Cloud storage or from the local database.
Returns None

fetch_next_request

async fetch_next_request(): Request | None

Overrides RequestManager.fetch_next_request
Fetch the next request, respecting per-domain delays.

Sub-managers are checked in order of longest-overdue domain first (sorted by throttled_until ascending). If all configured domains are throttled, falls back to the inner manager for non-throttled domains. If the inner manager is also empty and all sub-managers are throttled, waits until either the earliest domain becomes available or new work is added (whichever comes first).
Returns Request | None

get_handled_count

async get_handled_count(): int

Overrides RequestManager.get_handled_count
Get the number of requests in the loader that have been handled.
Returns int

get_total_count

async get_total_count(): int

Overrides RequestManager.get_total_count
Get an offline approximation of the total number of requests in the loader (i.e. pending + handled).
Returns int

is_empty

async is_empty(): bool

Overrides RequestManager.is_empty
Return True if there are no more requests in the loader (there might still be unfinished requests).
Returns bool

is_finished

async is_finished(): bool

Overrides RequestManager.is_finished
Return True if all requests have been handled.
Returns bool

mark_request_as_handled

async mark_request_as_handled(request): ProcessedRequest | None

Overrides RequestManager.mark_request_as_handled
Mark a request as handled after a successful processing (or after giving up retrying).
Parameters
- request: Request
Returns ProcessedRequest | None

purge

async purge(): None

Overrides RequestManager.purge
Empty the inner manager and all sub-managers, and reset transient per-domain throttle state.

The configured domain list and any robots.txt-derived crawl_delay are preserved; only the dynamic backoff state (consecutive 429 counter and throttled_until) is cleared. Sub-managers are kept around so they don't need to be re-opened on the next request — they're just emptied.
Returns None

reclaim_request

async reclaim_request(request, *, forefront): ProcessedRequest | None

Overrides RequestManager.reclaim_request
Reclaims a failed request back to the source, so that it can be returned for processing later again.

It is possible to modify the request data by supplying an updated request as a parameter.
Parameters
- request: Request
- optionalkeyword-onlyforefront: bool = False
Returns ProcessedRequest | None

record_domain_delay

record_domain_delay(url, *, retry_after): bool

Record a 429 Too Many Requests response for the domain of the given URL.

Increments the consecutive 429 count and calculates the next allowed request time using exponential backoff or the Retry-After value.
Parameters
- url: str
  The URL that received a 429 response.
- optionalkeyword-onlyretry_after: timedelta | None = None
  Optional delay from the Retry-After header. If provided, it takes priority over the calculated exponential backoff.
Returns bool
True if the URL's domain is configured for throttling and the delay was applied; False if the domain is not in the configured domains list, in which case the call is a no-op.

record_success

record_success(url): None

Record a successful request, resetting the backoff state for that domain.
Parameters
- url: str
  The URL that received a successful response.
Returns None

set_crawl_delay

set_crawl_delay(url, delay_seconds): None

Set the robots.txt crawl-delay for a domain.

The delay is locked once set so robots.txt re-fetches (e.g. after LRU eviction) can't change the in-flight dispatch cadence and cause oscillation mid-crawl. Subsequent calls for the same domain are no-ops.
Parameters
- url: str
  A URL from the domain to throttle.
- delay_seconds: int
  The crawl-delay value in seconds.
Returns None

to_tandem

async to_tandem(request_manager): RequestManagerTandem

Inherited from RequestLoader.to_tandem
Combine the loader with a request manager to support adding and reclaiming requests.
Parameters
- optionalrequest_manager: RequestManager | None = None
  Request manager to combine the loader with. If None is given, the default request queue is used.
Returns RequestManagerTandem

Usage

Hierarchy

Index

Methods

Methods

__init__

Parameters

inner: TRequestManager

keyword-onlydomains: Sequence[str]

keyword-onlyrequest_manager_opener: _RequestManagerOpener[TRequestManager]

optionalkeyword-onlyservice_locator: ServiceLocator | None = None

optionalkeyword-onlybase_delay: timedelta = timedelta(seconds=2)

optionalkeyword-onlymax_delay: timedelta = timedelta(seconds=60)

Returns None

add_request

Parameters

request: str | Request

optionalkeyword-onlyforefront: bool = False

Returns ProcessedRequest | None

add_requests

Parameters

requests: Sequence[str | Request]

optionalkeyword-onlyforefront: bool = False

optionalkeyword-onlybatch_size: int = 1000

optionalkeyword-onlywait_time_between_batches: timedelta = timedelta(seconds=1)

optionalkeyword-onlywait_for_all_requests_to_be_added: bool = False

optionalkeyword-onlywait_for_all_requests_to_be_added_timeout: timedelta | None = None

Returns None

drop

Returns None

fetch_next_request

Returns Request | None

get_handled_count

Returns int

get_total_count

Returns int

is_empty

Returns bool

is_finished

Returns bool

mark_request_as_handled

Parameters

request: Request

Returns ProcessedRequest | None

purge

Returns None

reclaim_request

Parameters

request: Request

optionalkeyword-onlyforefront: bool = False

Returns ProcessedRequest | None

record_domain_delay

Parameters

url: str

optionalkeyword-onlyretry_after: timedelta | None = None

Returns bool

record_success

Parameters

url: str

Returns None

set_crawl_delay

Parameters

url: str

delay_seconds: int

Returns None

to_tandem

Parameters

optionalrequest_manager: RequestManager | None = None

Returns RequestManagerTandem

init