ThrottlingRequestManager
Hierarchy
- RequestManager
- ThrottlingRequestManager
Index
Methods
__init__
Initialize the throttling manager.
Parameters
inner: TRequestManager
The underlying request manager to wrap (typically a
RequestQueue). Requests for non-throttled domains are stored here.keyword-onlydomains: Sequence[str]
Explicit list of domain hostnames to throttle. Only requests matching these domains will be routed to per-domain sub-managers. Matching is case-insensitive (hostnames are lowercased) and exact: subdomain wildcards such as
*.example.comare not supported — list each subdomain explicitly if needed.keyword-onlyrequest_manager_opener: _RequestManagerOpener[TRequestManager]
Async callable used to create per-domain sub-managers at insertion time. Must accept
alias,storage_client, andconfigurationkeyword arguments and return the same concrete subclass asinner(e.g.RequestQueue.openwheninneris aRequestQueue).optionalkeyword-onlyservice_locator: ServiceLocator | None = None
Service locator for creating sub-managers. If not provided, defaults to the global service locator, ensuring consistency with the crawler's storage backend.
optionalkeyword-onlybase_delay: timedelta = timedelta(seconds=2)
Initial delay after the first 429 response from a domain.
optionalkeyword-onlymax_delay: timedelta = timedelta(seconds=60)
Maximum delay between requests to a rate-limited domain.
Returns None
add_request
Add a request, routing it to the appropriate manager.
Requests for explicitly configured domains are routed directly to their per-domain sub-manager. All other requests go to the inner manager.
Parameters
request: str | Request
optionalkeyword-onlyforefront: bool = False
Returns ProcessedRequest | None
add_requests
Add multiple requests, routing each to the appropriate manager.
Parameters
requests: Sequence[str | Request]
optionalkeyword-onlyforefront: bool = False
optionalkeyword-onlybatch_size: int = 1000
optionalkeyword-onlywait_time_between_batches: timedelta = timedelta(seconds=1)
optionalkeyword-onlywait_for_all_requests_to_be_added: bool = False
optionalkeyword-onlywait_for_all_requests_to_be_added_timeout: timedelta | None = None
Returns None
drop
Remove persistent state either from the Apify Cloud storage or from the local database.
Returns None
fetch_next_request
Fetch the next request, respecting per-domain delays.
Sub-managers are checked in order of longest-overdue domain first (sorted by
throttled_untilascending). If all configured domains are throttled, falls back to the inner manager for non-throttled domains. If the inner manager is also empty and all sub-managers are throttled, waits until either the earliest domain becomes available or new work is added (whichever comes first).Returns Request | None
get_handled_count
Get the number of requests in the loader that have been handled.
Returns int
get_total_count
Get an offline approximation of the total number of requests in the loader (i.e. pending + handled).
Returns int
is_empty
Return True if there are no more requests in the loader (there might still be unfinished requests).
Returns bool
is_finished
Return True if all requests have been handled.
Returns bool
mark_request_as_handled
Mark a request as handled after a successful processing (or after giving up retrying).
Parameters
request: Request
Returns ProcessedRequest | None
purge
Empty the inner manager and all sub-managers, and reset transient per-domain throttle state.
The configured domain list and any robots.txt-derived
crawl_delayare preserved; only the dynamic backoff state (consecutive 429 counter andthrottled_until) is cleared. Sub-managers are kept around so they don't need to be re-opened on the next request — they're just emptied.Returns None
reclaim_request
Reclaims a failed request back to the source, so that it can be returned for processing later again.
It is possible to modify the request data by supplying an updated request as a parameter.
Parameters
request: Request
optionalkeyword-onlyforefront: bool = False
Returns ProcessedRequest | None
record_domain_delay
Record a 429 Too Many Requests response for the domain of the given URL.
Increments the consecutive 429 count and calculates the next allowed request time using exponential backoff or the
Retry-Aftervalue.Parameters
url: str
The URL that received a 429 response.
optionalkeyword-onlyretry_after: timedelta | None = None
Optional delay from the
Retry-Afterheader. If provided, it takes priority over the calculated exponential backoff.
Returns bool
True if the URL's domain is configured for throttling and the delay was applied; False if the domain is not in the configured
domainslist, in which case the call is a no-op.
record_success
Record a successful request, resetting the backoff state for that domain.
Parameters
url: str
The URL that received a successful response.
Returns None
set_crawl_delay
Set the robots.txt crawl-delay for a domain.
The delay is locked once set so robots.txt re-fetches (e.g. after LRU eviction) can't change the in-flight dispatch cadence and cause oscillation mid-crawl. Subsequent calls for the same domain are no-ops.
Parameters
url: str
A URL from the domain to throttle.
delay_seconds: int
The crawl-delay value in seconds.
Returns None
to_tandem
Combine the loader with a request manager to support adding and reclaiming requests.
Parameters
optionalrequest_manager: RequestManager | None = None
Request manager to combine the loader with. If None is given, the default request queue is used.
Returns RequestManagerTandem
A request manager that wraps another and enforces per-domain delays.
Requests for explicitly configured domains are routed into dedicated sub-managers at insertion time — each request lives in exactly one manager, eliminating duplication and simplifying deduplication.
When
fetch_next_request()is called, it returns requests from the sub-manager whose domain has been waiting the longest. If all configured domains are throttled, it falls back to the inner manager for non-throttled domains. If the inner manager is also empty and all sub-managers are throttled, it sleeps until the earliest cooldown expires.Delay sources:
record_domain_delay)set_crawl_delay)The class is generic over the wrapped manager type. The
request_manager_openercallback is used to construct per-domain sub-managers at insertion time, so every sub-manager shares the sameRequestManagersubclass and backing store asinner. The opener must acceptalias,storage_client, andconfigurationkeyword arguments (asRequestQueue.opendoes) and return the same concrete subclass asinner.Usage