Version: Next

Request throttling

When crawling websites that enforce rate limits (HTTP 429) or specify crawl-delay in their robots.txt, you need a way to throttle requests per domain without blocking unrelated domains. The ThrottlingRequestManager provides exactly this.

Overview

The ThrottlingRequestManager wraps a RequestManager (typically a RequestQueue) and manages per-domain throttling. You specify which domains to throttle at initialization, and the manager automatically:

Routes requests for listed domains into dedicated sub-managers at insertion time.
Enforces delays from HTTP 429 responses (exponential backoff) and robots.txt crawl-delay directives.
Schedules fairly by fetching from the domain that has been waiting the longest.
Sleeps intelligently when all configured domains are throttled, instead of busy-waiting.

Requests for domains not in the configured list pass through to the main queue without any throttling.

Basic usage

To use request throttling, create a ThrottlingRequestManager with the domains you want to throttle and pass it as the request_manager to your crawler:

Run on

import asyncio

from crawlee.crawlers import BasicCrawler, BasicCrawlingContext
from crawlee.request_loaders import ThrottlingRequestManager
from crawlee.storages import RequestQueue


async def main() -> None:
    # Open the default request queue.
    queue = await RequestQueue.open()

    # Wrap it with ThrottlingRequestManager for specific domains. The throttler uses the
    # same storage backend as the underlying queue.
    throttler = ThrottlingRequestManager(
        queue,
        domains=['api.example.com', 'slow-site.org'],
        request_manager_opener=RequestQueue.open,
    )

    # Pass the throttler as the crawler's request manager.
    crawler = BasicCrawler(request_manager=throttler)

    @crawler.router.default_handler
    async def handler(context: BasicCrawlingContext) -> None:
        context.log.info(f'Processing {context.request.url}')

    # Add requests. Listed domains are routed directly to their throttled sub-managers.
    # Others go to the inner manager.
    await throttler.add_requests(
        [
            'https://api.example.com/data',
            'https://api.example.com/users',
            'https://slow-site.org/page1',
            'https://fast-site.com/page1',  # Not throttled
        ]
    )

    await crawler.run()


if __name__ == '__main__':
    asyncio.run(main())

How it works

Insertion-time routing: When you add requests via add_request or add_requests, each request is checked against the configured domain list. Matching requests go directly into a per-domain sub-manager; all others go to the inner manager. This eliminates request duplication entirely.
429 backoff: When the crawler detects an HTTP 429 response, the ThrottlingRequestManager records an exponential backoff delay for that domain (starting at 2s, doubling up to 60s). If the response includes a Retry-After header, that value takes priority.
Crawl-delay: If robots.txt specifies a crawl-delay, the manager enforces a minimum interval between requests to that domain.
Fair scheduling: fetch_next_request sorts available sub-managers by how long each domain has been waiting, ensuring no domain is starved.

tip

The ThrottlingRequestManager is an opt-in feature. If you don't pass it to your crawler, requests are processed normally without any per-domain throttling.

Request throttling

Overview​

Basic usage​

How it works​

Overview

Basic usage

How it works