Skip to main content

Request router

The Router class manages request flow and coordinates the execution of user-defined logic in Crawlee projects. It routes incoming requests to appropriate user-defined handlers based on labels, manages error scenarios, and provides hooks for pre-navigation execution. The Router serves as the orchestrator for all crawling operations, ensuring that each request is processed by the correct handler according to its type and label.

Request handlers

Request handlers are user-defined functions that process individual requests and their corresponding responses. Each handler receives a crawling context as its primary argument, which provides access to the current request, response data, and utility methods for data extraction, link enqueuing, and storage operations. Handlers determine how different types of pages are processed and how data is extracted and stored.

note

The code examples in this guide use ParselCrawler for demonstration, but the Router works with all crawler types.

Built-in router

Every crawler instance includes a built-in Router accessible through the crawler.router property. This approach simplifies initial setup and covers basic use cases where request routing requirements are straightforward.

Run on
import asyncio

from crawlee.crawlers import ParselCrawler, ParselCrawlingContext


async def main() -> None:
# Create a crawler instance
crawler = ParselCrawler(
max_requests_per_crawl=10, # Limit the max requests per crawl.
)

# Use the crawler's built-in router to define a default handler
@crawler.router.default_handler
async def default_handler(context: ParselCrawlingContext) -> None:
context.log.info(f'Processing {context.request.url}')

# Extract page title
title = context.selector.css('title::text').get() or 'No title found'

# Extract and save basic page data
await context.push_data(
{
'url': context.request.url,
'title': title,
}
)

# Find and enqueue product links for further crawling
await context.enqueue_links(selector='a[href*="/products/"]', label='PRODUCT')

# Start crawling
await crawler.run(['https://warehouse-theme-metal.myshopify.com/'])


if __name__ == '__main__':
asyncio.run(main())

The default handler processes all requests that either lack a label or have a label for which no specific handler has been registered.

Custom router

Applications requiring explicit control over router configuration or router reuse across multiple crawler instances can create custom Router instances. Custom routers provide complete control over request routing configuration and enable modular application architecture. Router instances can be configured independently and attached to your crawler instances as needed.

You can also implement a custom request router class from scratch or by inheriting from Router. This allows you to define custom routing logic or manage request handlers in a different way.

Run on
import asyncio

from crawlee.crawlers import ParselCrawler, ParselCrawlingContext
from crawlee.router import Router


async def main() -> None:
# Create a custom router instance
router = Router[ParselCrawlingContext]()

# Define only a default handler
@router.default_handler
async def default_handler(context: ParselCrawlingContext) -> None:
context.log.info(f'Processing {context.request.url}')

# Extract page title
title = context.selector.css('title::text').get() or 'No title found'

# Extract and save basic page data
await context.push_data(
{
'url': context.request.url,
'title': title,
}
)

# Find and enqueue product links for further crawling
await context.enqueue_links(
selector='a[href*="/products/"]',
label='PRODUCT', # Note: no handler for this label, will use default
)

# Create crawler with the custom router
crawler = ParselCrawler(
request_handler=router,
max_requests_per_crawl=10, # Limit the max requests per crawl.
)

# Start crawling
await crawler.run(['https://warehouse-theme-metal.myshopify.com/'])


if __name__ == '__main__':
asyncio.run(main())

Advanced routing by labels

More complex crawling projects often require different processing logic for various page types. The router supports label-based routing, which allows registration of specialized handlers for specific content categories. This pattern enables clean separation of concerns and targeted processing logic for different URL patterns or content types.

Run on
import asyncio

from crawlee import Request
from crawlee.crawlers import ParselCrawler, ParselCrawlingContext
from crawlee.router import Router


async def main() -> None:
# Create a custom router instance
router = Router[ParselCrawlingContext]()

# Define the default handler (fallback for requests without specific labels)
@router.default_handler
async def default_handler(context: ParselCrawlingContext) -> None:
context.log.info(f'Processing homepage: {context.request.url}')

# Extract page title
title = context.selector.css('title::text').get() or 'No title found'

await context.push_data(
{
'url': context.request.url,
'title': title,
'page_type': 'homepage',
}
)

# Find and enqueue collection/category links
await context.enqueue_links(selector='a[href*="/collections/"]', label='CATEGORY')

# Define a handler for category pages
@router.handler('CATEGORY')
async def category_handler(context: ParselCrawlingContext) -> None:
context.log.info(f'Processing category page: {context.request.url}')

# Extract category information
category_title = context.selector.css('h1::text').get() or 'Unknown Category'
product_count = len(context.selector.css('.product-item').getall())

await context.push_data(
{
'url': context.request.url,
'type': 'category',
'category_title': category_title,
'product_count': product_count,
'handler': 'category',
}
)

# Enqueue product links from this category
await context.enqueue_links(selector='a[href*="/products/"]', label='PRODUCT')

# Define a handler for product detail pages
@router.handler('PRODUCT')
async def product_handler(context: ParselCrawlingContext) -> None:
context.log.info(f'Processing product page: {context.request.url}')

# Extract detailed product information
product_data = {
'url': context.request.url,
'name': context.selector.css('h1::text').get(),
'price': context.selector.css('.price::text').get(),
'description': context.selector.css('.product-description p::text').get(),
'images': context.selector.css('.product-gallery img::attr(src)').getall(),
'in_stock': bool(context.selector.css('.add-to-cart-button').get()),
'handler': 'product',
}

await context.push_data(product_data)

# Create crawler with the router
crawler = ParselCrawler(
request_handler=router,
max_requests_per_crawl=10, # Limit the max requests per crawl.
)

# Start crawling with some initial requests
await crawler.run(
[
# Will use default handler
'https://warehouse-theme-metal.myshopify.com/',
# Will use category handler
Request.from_url(
'https://warehouse-theme-metal.myshopify.com/collections/all',
label='CATEGORY',
),
]
)


if __name__ == '__main__':
asyncio.run(main())

Error handlers

Crawlee provides error handling mechanisms to manage request processing failures. It distinguishes between recoverable errors that may succeed on retry and permanent failures that require alternative handling strategies.

Error handler

The error handler executes when exceptions occur during request processing, before any retry attempts. This handler receives the error context and can implement custom recovery logic, modify request parameters, or determine whether the request should be retried. Error handlers enable control over failure scenarios and allow applications to implement error recovery strategies.

Run on
import asyncio

from crawlee.crawlers import BasicCrawlingContext, ParselCrawler, ParselCrawlingContext
from crawlee.errors import HttpStatusCodeError

# HTTP status code constants
TOO_MANY_REQUESTS = 429


async def main() -> None:
# Create a crawler instance
crawler = ParselCrawler(
max_requests_per_crawl=10, # Limit the max requests per crawl.
)

@crawler.router.default_handler
async def default_handler(context: ParselCrawlingContext) -> None:
context.log.info(f'Processing {context.request.url}')

# Extract product information (might fail for some pages)
product_name = context.selector.css('h1[data-testid="product-title"]::text').get()
if not product_name:
raise ValueError('Product name not found - might be a non-product page')

price = context.selector.css('.price::text').get()
await context.push_data(
{
'url': context.request.url,
'product_name': product_name,
'price': price,
}
)

# Error handler - called when an error occurs during request processing
@crawler.error_handler
async def error_handler(context: BasicCrawlingContext, error: Exception) -> None:
error_name = type(error).__name__
context.log.warning(f'Error occurred for {context.request.url}: {error_name}')

# You can modify the request or context here before retry
if (
isinstance(error, HttpStatusCodeError)
and error.status_code == TOO_MANY_REQUESTS
):
context.log.info('Rate limited - will retry with delay')
# You could modify headers, add delay, etc.
elif isinstance(error, ValueError):
context.log.info('Parse error - marking request as no retry')
context.request.no_retry = True

# Start crawling
await crawler.run(
[
'https://warehouse-theme-metal.myshopify.com/products/on-running-cloudmonster-2-mens',
# Might cause parse error
'https://warehouse-theme-metal.myshopify.com/collections/mens-running',
]
)


if __name__ == '__main__':
asyncio.run(main())

Failed request handler

The failed request handler executes when a request has exhausted all retry attempts and is considered permanently failed. This handler serves as the final opportunity to log failures, store failed requests for later analysis, create alternative requests, or implement fallback processing strategies.

Run on
import asyncio

from crawlee.crawlers import BasicCrawlingContext, ParselCrawler, ParselCrawlingContext


async def main() -> None:
# Create a crawler instance with retry settings
crawler = ParselCrawler(
max_requests_per_crawl=10, # Limit the max requests per crawl.
max_request_retries=2, # Allow 2 retries before failing
)

@crawler.router.default_handler
async def default_handler(context: ParselCrawlingContext) -> None:
context.log.info(f'Processing {context.request.url}')

# Extract product information
product_name = context.selector.css('h1[data-testid="product-title"]::text').get()
if not product_name:
product_name = context.selector.css('h1::text').get() or 'Unknown Product'

price = context.selector.css('.price::text').get() or 'Price not available'

await context.push_data(
{
'url': context.request.url,
'product_name': product_name,
'price': price,
'status': 'success',
}
)

# Failed request handler - called when request has exhausted all retries
@crawler.failed_request_handler
async def failed_handler(context: BasicCrawlingContext, error: Exception) -> None:
context.log.error(
f'Failed to process {context.request.url} after all retries: {error}'
)

# Save failed request information for analysis
await context.push_data(
{
'failed_url': context.request.url,
'label': context.request.label,
'error_type': type(error).__name__,
'error_message': str(error),
'retry_count': context.request.retry_count,
'status': 'failed',
}
)

# Start crawling with some URLs that might fail
await crawler.run(
[
'https://warehouse-theme-metal.myshopify.com/products/on-running-cloudmonster-2-mens',
# This will likely fail
'https://warehouse-theme-metal.myshopify.com/invalid-url',
'https://warehouse-theme-metal.myshopify.com/products/valid-product',
]
)


if __name__ == '__main__':
asyncio.run(main())

Pre-navigation hooks

Pre-navigation hooks execute before each request is processed, providing opportunities to configure request parameters, modify browser settings, or implement request-specific optimizations. You can use pre-navigation hooks for example for viewport configuration, resource blocking, timeout management, header customization, custom proxy rotation, and request interception.

HTTP crawler

HTTP crawlers support pre-navigation hooks that execute before making HTTP requests. These hooks enable request modification, header configuration, and other HTTP-specific optimizations.

Run on
import asyncio

from crawlee import HttpHeaders
from crawlee.crawlers import BasicCrawlingContext, ParselCrawler, ParselCrawlingContext


async def main() -> None:
crawler = ParselCrawler(
max_requests_per_crawl=10, # Limit the max requests per crawl.
)

@crawler.pre_navigation_hook
async def setup_request(context: BasicCrawlingContext) -> None:
# Add custom headers before making the request
context.request.headers |= HttpHeaders(
{
'User-Agent': 'Crawlee Bot 1.0',
'Accept': 'text/html,application/xhtml+xml',
},
)

@crawler.router.default_handler
async def default_handler(context: ParselCrawlingContext) -> None:
# Extract basic page information
title = context.selector.css('title::text').get()
await context.push_data(
{
'url': context.request.url,
'title': title,
}
)

await crawler.run(['https://warehouse-theme-metal.myshopify.com/'])


if __name__ == '__main__':
asyncio.run(main())

Playwright crawler

Playwright crawlers provide extensive pre-navigation capabilities that allow browser page configuration before navigation. These hooks can modify browser behavior and configure page settings.

Run on
import asyncio

from crawlee.crawlers import (
PlaywrightCrawler,
PlaywrightCrawlingContext,
PlaywrightPreNavCrawlingContext,
)


async def main() -> None:
crawler = PlaywrightCrawler(
max_requests_per_crawl=10, # Limit the max requests per crawl.
)

@crawler.pre_navigation_hook
async def setup_page(context: PlaywrightPreNavCrawlingContext) -> None:
# Set viewport size for consistent rendering
await context.page.set_viewport_size({'width': 1280, 'height': 720})

# Block unnecessary resources to speed up crawling
await context.block_requests(
extra_url_patterns=[
'*.png',
'*.jpg',
'*.jpeg',
'*.gif',
'*.svg',
'*.css',
'*.woff',
'*.woff2',
'*.ttf',
'*google-analytics*',
'*facebook*',
'*twitter*',
]
)

# Set custom user agent
await context.page.set_extra_http_headers(
{
'User-Agent': 'Mozilla/5.0 (compatible; Crawlee Bot)',
}
)

@crawler.router.default_handler
async def default_handler(context: PlaywrightCrawlingContext) -> None:
title = await context.page.title()
await context.push_data(
{
'url': context.request.url,
'title': title,
}
)

await crawler.run(['https://crawlee.dev/'])


if __name__ == '__main__':
asyncio.run(main())

Adaptive Playwright crawler

The AdaptivePlaywrightCrawler implements a dual-hook system with common hooks that execute for all requests and Playwright-specific hooks that execute only when browser automation is required. This is perfect for projects that need both static and dynamic content handling.

Run on
import asyncio

from crawlee import HttpHeaders
from crawlee.crawlers import (
AdaptivePlaywrightCrawler,
AdaptivePlaywrightCrawlingContext,
AdaptivePlaywrightPreNavCrawlingContext,
)


async def main() -> None:
crawler = AdaptivePlaywrightCrawler.with_beautifulsoup_static_parser(
max_requests_per_crawl=10, # Limit the max requests per crawl.
)

@crawler.pre_navigation_hook
async def common_setup(context: AdaptivePlaywrightPreNavCrawlingContext) -> None:
# Common pre-navigation hook - runs for both HTTP and browser requests.
context.request.headers |= HttpHeaders(
{'Accept': 'text/html,application/xhtml+xml'},
)

@crawler.pre_navigation_hook(playwright_only=True)
async def browser_setup(context: AdaptivePlaywrightPreNavCrawlingContext) -> None:
# Playwright-specific pre-navigation hook - runs only when browser is used.
await context.page.set_viewport_size({'width': 1280, 'height': 720})
if context.block_requests:
await context.block_requests(extra_url_patterns=['*.css', '*.js'])

@crawler.router.default_handler
async def default_handler(context: AdaptivePlaywrightCrawlingContext) -> None:
# Extract title using the unified context interface.
title_tag = context.parsed_content.find('title')
title = title_tag.get_text() if title_tag else None

# Extract other data consistently across both modes.
links = [a.get('href') for a in context.parsed_content.find_all('a', href=True)]

await context.push_data(
{
'url': context.request.url,
'title': title,
'links': links,
}
)

await crawler.run(['https://crawlee.dev/'])


if __name__ == '__main__':
asyncio.run(main())

Conclusion

This guide introduced you to the Router class and how to organize your crawling logic. You learned how to use built-in and custom routers, implement request handlers with label-based routing, handle errors with error and failed request handlers, and configure pre-navigation hooks for different crawler types.

If you have questions or need assistance, feel free to reach out on our GitHub or join our Discord community. Happy scraping!