Skip to main content
Version: Next

ThrottlingRequestManager

A request manager that wraps another and enforces per-domain delays.

Requests for explicitly configured domains are routed into dedicated sub-managers at insertion time — each request lives in exactly one manager, eliminating duplication and simplifying deduplication.

When fetch_next_request() is called, it returns requests from the sub-manager whose domain has been waiting the longest. If all configured domains are throttled, it falls back to the inner manager for non-throttled domains. If the inner manager is also empty and all sub-managers are throttled, it sleeps until the earliest cooldown expires.

Delay sources:

  • HTTP 429 responses (via record_domain_delay)
  • robots.txt crawl-delay directives (via set_crawl_delay)

The class is generic over the wrapped manager type. The request_manager_opener callback is used to construct per-domain sub-managers at insertion time, so every sub-manager shares the same RequestManager subclass and backing store as inner. The opener must accept alias, storage_client, and configuration keyword arguments (as RequestQueue.open does) and return the same concrete subclass as inner.

Usage

from crawlee.crawlers import BasicCrawler
from crawlee.request_loaders import ThrottlingRequestManager
from crawlee.storages import RequestQueue

queue = await RequestQueue.open()
throttler = ThrottlingRequestManager(
inner=queue,
domains=['api.example.com', 'slow-site.org'],
request_manager_opener=RequestQueue.open,
)
crawler = BasicCrawler(request_manager=throttler)

Hierarchy

Index

Methods

__init__

  • __init__(inner, *, domains, request_manager_opener, service_locator, base_delay, max_delay): None
  • Initialize the throttling manager.


    Parameters

    • inner: TRequestManager

      The underlying request manager to wrap (typically a RequestQueue). Requests for non-throttled domains are stored here.

    • keyword-onlydomains: Sequence[str]

      Explicit list of domain hostnames to throttle. Only requests matching these domains will be routed to per-domain sub-managers. Matching is case-insensitive (hostnames are lowercased) and exact: subdomain wildcards such as *.example.com are not supported — list each subdomain explicitly if needed.

    • keyword-onlyrequest_manager_opener: _RequestManagerOpener[TRequestManager]

      Async callable used to create per-domain sub-managers at insertion time. Must accept alias, storage_client, and configuration keyword arguments and return the same concrete subclass as inner (e.g. RequestQueue.open when inner is a RequestQueue).

    • optionalkeyword-onlyservice_locator: ServiceLocator | None = None

      Service locator for creating sub-managers. If not provided, defaults to the global service locator, ensuring consistency with the crawler's storage backend.

    • optionalkeyword-onlybase_delay: timedelta = timedelta(seconds=2)

      Initial delay after the first 429 response from a domain.

    • optionalkeyword-onlymax_delay: timedelta = timedelta(seconds=60)

      Maximum delay between requests to a rate-limited domain.

    Returns None

add_request

  • Add a request, routing it to the appropriate manager.

    Requests for explicitly configured domains are routed directly to their per-domain sub-manager. All other requests go to the inner manager.


    Parameters

    • request: str | Request
    • optionalkeyword-onlyforefront: bool = False

    Returns ProcessedRequest | None

add_requests

  • async add_requests(requests, *, forefront, batch_size, wait_time_between_batches, wait_for_all_requests_to_be_added, wait_for_all_requests_to_be_added_timeout): None
  • Add multiple requests, routing each to the appropriate manager.


    Parameters

    • requests: Sequence[str | Request]
    • optionalkeyword-onlyforefront: bool = False
    • optionalkeyword-onlybatch_size: int = 1000
    • optionalkeyword-onlywait_time_between_batches: timedelta = timedelta(seconds=1)
    • optionalkeyword-onlywait_for_all_requests_to_be_added: bool = False
    • optionalkeyword-onlywait_for_all_requests_to_be_added_timeout: timedelta | None = None

    Returns None

drop

  • async drop(): None
  • Remove persistent state either from the Apify Cloud storage or from the local database.


    Returns None

fetch_next_request

  • async fetch_next_request(): Request | None
  • Fetch the next request, respecting per-domain delays.

    Sub-managers are checked in order of longest-overdue domain first (sorted by throttled_until ascending). If all configured domains are throttled, falls back to the inner manager for non-throttled domains. If the inner manager is also empty and all sub-managers are throttled, waits until either the earliest domain becomes available or new work is added (whichever comes first).


    Returns Request | None

get_handled_count

  • async get_handled_count(): int

get_total_count

  • async get_total_count(): int
  • Get an offline approximation of the total number of requests in the loader (i.e. pending + handled).


    Returns int

is_empty

  • async is_empty(): bool
  • Return True if there are no more requests in the loader (there might still be unfinished requests).


    Returns bool

is_finished

  • async is_finished(): bool

mark_request_as_handled

  • async mark_request_as_handled(request): ProcessedRequest | None

purge

  • async purge(): None
  • Empty the inner manager and all sub-managers, and reset transient per-domain throttle state.

    The configured domain list and any robots.txt-derived crawl_delay are preserved; only the dynamic backoff state (consecutive 429 counter and throttled_until) is cleared. Sub-managers are kept around so they don't need to be re-opened on the next request — they're just emptied.


    Returns None

reclaim_request

  • Reclaims a failed request back to the source, so that it can be returned for processing later again.

    It is possible to modify the request data by supplying an updated request as a parameter.


    Parameters

    • request: Request
    • optionalkeyword-onlyforefront: bool = False

    Returns ProcessedRequest | None

record_domain_delay

  • record_domain_delay(url, *, retry_after): bool
  • Record a 429 Too Many Requests response for the domain of the given URL.

    Increments the consecutive 429 count and calculates the next allowed request time using exponential backoff or the Retry-After value.


    Parameters

    • url: str

      The URL that received a 429 response.

    • optionalkeyword-onlyretry_after: timedelta | None = None

      Optional delay from the Retry-After header. If provided, it takes priority over the calculated exponential backoff.

    Returns bool

    True if the URL's domain is configured for throttling and the delay was applied; False if the domain is not in the configured domains list, in which case the call is a no-op.

record_success

  • record_success(url): None
  • Record a successful request, resetting the backoff state for that domain.


    Parameters

    • url: str

      The URL that received a successful response.

    Returns None

set_crawl_delay

  • set_crawl_delay(url, delay_seconds): None
  • Set the robots.txt crawl-delay for a domain.

    The delay is locked once set so robots.txt re-fetches (e.g. after LRU eviction) can't change the in-flight dispatch cadence and cause oscillation mid-crawl. Subsequent calls for the same domain are no-ops.


    Parameters

    • url: str

      A URL from the domain to throttle.

    • delay_seconds: int

      The crawl-delay value in seconds.

    Returns None

to_tandem

  • Combine the loader with a request manager to support adding and reclaiming requests.


    Parameters

    • optionalrequest_manager: RequestManager | None = None

      Request manager to combine the loader with. If None is given, the default request queue is used.

    Returns RequestManagerTandem