Request loaders

The request_loaders sub-package extends the functionality of the RequestQueue, providing additional tools for managing URLs and requests. If you are new to Crawlee and unfamiliar with the RequestQueue, consider starting with the Storages guide first. Request loaders define how requests are fetched and stored, enabling various use cases such as reading URLs from files, external APIs, or combining multiple sources together.

Overview

The request_loaders sub-package introduces the following abstract classes:

RequestLoader: The base interface for reading requests in a crawl.
RequestManager: Extends RequestLoader with write capabilities.
RequestManagerTandem: Combines a read-only RequestLoader with a writable RequestManager.

And specific request loader implementations:

RequestList: A lightweight implementation for managing a static list of URLs.
SitemapRequestLoader: A specialized loader that reads URLs from XML and plain-text sitemaps following the Sitemaps protocol with filtering capabilities.

Below is a class diagram that illustrates the relationships between these components and the RequestQueue:

Request loaders

The RequestLoader interface defines the foundation for fetching requests during a crawl. It provides abstract methods for basic operations like retrieving, marking, and checking the status of requests. Concrete implementations, such as RequestList, build on this interface to handle specific scenarios. You can create your own custom loader that reads from an external file, web endpoint, database, or any other specific data source. For more details, refer to the RequestLoader API reference.

NOTE

To learn how to use request loaders in your crawlers, see the Request manager tandem section below.

Request list

The RequestList can accept an asynchronous generator as input, allowing requests to be streamed rather than loading them all into memory at once. This can significantly reduce memory usage, especially when working with large sets of URLs.

Here is a basic example of working with the RequestList:

Run on

import asyncio

from crawlee.request_loaders import RequestList


async def main() -> None:
    # Open the request list, if it does not exist, it will be created.
    # Leave name empty to use the default request list.
    request_list = RequestList(
        name='my-request-list',
        requests=[
            'https://apify.com/',
            'https://crawlee.dev/',
            'https://crawlee.dev/python/',
        ],
    )

    # Fetch and process requests from the queue.
    while request := await request_list.fetch_next_request():
        # Do something with it...
        print(f'Processing {request.url}')

        # And mark it as handled.
        await request_list.mark_request_as_handled(request)


if __name__ == '__main__':
    asyncio.run(main())

Request list with persistence

The RequestList supports state persistence, allowing it to resume from where it left off after interruption. This is particularly useful for long-running crawls or when you need to pause and resume crawling later.

To enable persistence, provide persist_state_key and optionally persist_requests_key parameters, and disable automatic cleanup by setting purge_on_start = False in the configuration. The persist_state_key saves the loader's progress, while persist_requests_key ensures that the request data doesn't change between runs. For more details on resuming interrupted crawls, see the Resuming a paused crawl example.

Run on

import asyncio
import logging

from crawlee import service_locator
from crawlee.request_loaders import RequestList

logging.basicConfig(level=logging.INFO, format='%(asctime)s-%(levelname)s-%(message)s')
logger = logging.getLogger(__name__)


# Disable clearing the `KeyValueStore` on each run.
# This is necessary so that the state keys are not cleared at startup.
# The recommended way to achieve this behavior is setting the environment variable
# `CRAWLEE_PURGE_ON_START=0`
configuration = service_locator.get_configuration()
configuration.purge_on_start = False


async def main() -> None:
    # Open the request list, if it does not exist, it will be created.
    # Leave name empty to use the default request list.
    request_list = RequestList(
        name='my-request-list',
        requests=[
            'https://apify.com/',
            'https://crawlee.dev/',
            'https://crawlee.dev/python/',
        ],
        # Enable persistence
        persist_state_key='my-persist-state',
        persist_requests_key='my-persist-requests',
    )

    # We receive only one request.
    # Each time you run it, it will be a new request until you exhaust the `RequestList`.
    request = await request_list.fetch_next_request()
    if request:
        logger.info(f'Processing request: {request.url}')
        # Do something with it...

        # And mark it as handled.
        await request_list.mark_request_as_handled(request)


if __name__ == '__main__':
    asyncio.run(main())

Sitemap request loader

The SitemapRequestLoader is a specialized request loader that reads URLs from sitemaps following the Sitemaps protocol. It supports both XML and plain text sitemap formats. It's particularly useful when you want to crawl a website systematically by following its sitemap structure.

note

The SitemapRequestLoader is designed specifically for sitemaps that follow the standard Sitemaps protocol. HTML pages containing links are not supported by this loader - those should be handled by regular crawlers using the enqueue_links functionality.

The loader supports filtering URLs using glob patterns and regular expressions, allowing you to include or exclude specific types of URLs. The SitemapRequestLoader provides streaming processing of sitemaps, ensuring efficient memory usage without loading the entire sitemap into memory.

Run on

import asyncio
import re

from crawlee.http_clients import ImpitHttpClient
from crawlee.request_loaders import SitemapRequestLoader


async def main() -> None:
    # Create an HTTP client for fetching the sitemap.
    http_client = ImpitHttpClient()

    # Create a sitemap request loader with filtering rules.
    sitemap_loader = SitemapRequestLoader(
        sitemap_urls=['https://crawlee.dev/sitemap.xml'],
        http_client=http_client,
        include=[re.compile(r'.*docs.*')],  # Only include URLs containing 'docs'.
        max_buffer_size=500,  # Keep up to 500 URLs in memory before processing.
    )

    # We work with the loader until we process all relevant links from the sitemap.
    while request := await sitemap_loader.fetch_next_request():
        # Do something with it...
        print(f'Processing {request.url}')

        # And mark it as handled.
        await sitemap_loader.mark_request_as_handled(request)


if __name__ == '__main__':
    asyncio.run(main())

Sitemap request loader with persistence

Similarly, the SitemapRequestLoader supports state persistence to resume processing from where it left off. This is especially valuable when processing large sitemaps that may take considerable time to complete.

Run on

import asyncio
import logging

from crawlee import service_locator
from crawlee.http_clients import ImpitHttpClient
from crawlee.request_loaders import SitemapRequestLoader

logging.basicConfig(level=logging.INFO, format='%(asctime)s-%(levelname)s-%(message)s')
logger = logging.getLogger(__name__)


# Disable clearing the `KeyValueStore` on each run.
# This is necessary so that the state keys are not cleared at startup.
# The recommended way to achieve this behavior is setting the environment variable
# `CRAWLEE_PURGE_ON_START=0`
configuration = service_locator.get_configuration()
configuration.purge_on_start = False


async def main() -> None:
    # Create an HTTP client for fetching sitemaps
    # Use the context manager for `SitemapRequestLoader` to correctly save the state when
    # the work is completed.
    async with (
        ImpitHttpClient() as http_client,
        SitemapRequestLoader(
            sitemap_urls=['https://crawlee.dev/sitemap.xml'],
            http_client=http_client,
            # Enable persistence
            persist_state_key='my-persist-state',
        ) as sitemap_loader,
    ):
        # We receive only one request.
        # Each time you run it, it will be a new request until you exhaust the sitemap.
        request = await sitemap_loader.fetch_next_request()
        if request:
            logger.info(f'Processing request: {request.url}')
            # Do something with it...

            # And mark it as handled.
            await sitemap_loader.mark_request_as_handled(request)


if __name__ == '__main__':
    asyncio.run(main())

When using persistence with SitemapRequestLoader, make sure to use the context manager (async with) to properly save the state when the work is completed.

Request managers

The RequestManager extends RequestLoader with write capabilities. In addition to reading requests, a request manager can add and reclaim them. This is essential for dynamic crawling projects where new URLs may emerge during the crawl process, or when certain requests fail and need to be retried. For more details, refer to the RequestManager API reference.

Request manager tandem

The RequestManagerTandem class allows you to combine the read-only capabilities of a RequestLoader (like RequestList) with the read-write capabilities of a RequestManager (like RequestQueue). This is useful for scenarios where you need to load initial requests from a static source (such as a file or database) and dynamically add or retry requests during the crawl. Additionally, it provides deduplication capabilities, ensuring that requests are not processed multiple times.

Under the hood, RequestManagerTandem checks whether the read-only loader still has pending requests. If so, each new request from the loader is transferred to the manager. Any newly added or reclaimed requests go directly to the manager side.

Request list with request queue

This section describes the combination of the RequestList and RequestQueue classes. This setup is particularly useful when you have a static list of URLs that you want to crawl, but also need to handle dynamic requests discovered during the crawl process. The RequestManagerTandem class facilitates this combination, with the RequestLoader.to_tandem method available as a convenient shortcut. Requests from the RequestList are processed first by being enqueued into the default RequestQueue, which handles persistence and retries for failed requests.

Explicit usage
Using to_tandem helper

Run on

import asyncio

from crawlee.crawlers import ParselCrawler, ParselCrawlingContext
from crawlee.request_loaders import RequestList, RequestManagerTandem
from crawlee.storages import RequestQueue


async def main() -> None:
    # Create a static request list.
    request_list = RequestList(['https://crawlee.dev', 'https://apify.com'])

    # Open the default request queue.
    request_queue = await RequestQueue.open()

    # And combine them together to a sinhle request manager.
    request_manager = RequestManagerTandem(request_list, request_queue)

    # Create a crawler and pass the request manager to it.
    crawler = ParselCrawler(
        request_manager=request_manager,
        max_requests_per_crawl=10,  # Limit the max requests per crawl.
    )

    @crawler.router.default_handler
    async def handler(context: ParselCrawlingContext) -> None:
        context.log.info(f'Processing {context.request.url}')

        # New links will be enqueued directly to the queue.
        await context.enqueue_links()

        # Extract data using Parsel's XPath and CSS selectors.
        data = {
            'url': context.request.url,
            'title': context.selector.xpath('//title/text()').get(),
        }

        # Push extracted data to the dataset.
        await context.push_data(data)

    await crawler.run()


if __name__ == '__main__':
    asyncio.run(main())

Run on

import asyncio

from crawlee.crawlers import ParselCrawler, ParselCrawlingContext
from crawlee.request_loaders import RequestList


async def main() -> None:
    # Create a static request list.
    request_list = RequestList(['https://crawlee.dev', 'https://apify.com'])

    # Convert the request list to a request manager using the to_tandem method.
    # It is a tandem with the default request queue.
    request_manager = await request_list.to_tandem()

    # Create a crawler and pass the request manager to it.
    crawler = ParselCrawler(
        request_manager=request_manager,
        max_requests_per_crawl=10,  # Limit the max requests per crawl.
    )

    @crawler.router.default_handler
    async def handler(context: ParselCrawlingContext) -> None:
        context.log.info(f'Processing {context.request.url}')

        # New links will be enqueued directly to the queue.
        await context.enqueue_links()

        # Extract data using Parsel's XPath and CSS selectors.
        data = {
            'url': context.request.url,
            'title': context.selector.xpath('//title/text()').get(),
        }

        # Push extracted data to the dataset.
        await context.push_data(data)

    await crawler.run()


if __name__ == '__main__':
    asyncio.run(main())

Sitemap request loader with request queue

Similar to the RequestList example above, you can combine a SitemapRequestLoader with a RequestQueue using the RequestManagerTandem class. This setup is particularly useful when you want to crawl URLs from a sitemap while also handling dynamic requests discovered during the crawl process. URLs from the sitemap are processed first by being enqueued into the default RequestQueue, which handles persistence and retries for failed requests.

Explicit usage
Using to_tandem helper

Run on

import asyncio
import re

from crawlee.crawlers import ParselCrawler, ParselCrawlingContext
from crawlee.http_clients import ImpitHttpClient
from crawlee.request_loaders import RequestManagerTandem, SitemapRequestLoader
from crawlee.storages import RequestQueue


async def main() -> None:
    # Create an HTTP client for fetching the sitemap.
    http_client = ImpitHttpClient()

    # Create a sitemap request loader with filtering rules.
    sitemap_loader = SitemapRequestLoader(
        sitemap_urls=['https://crawlee.dev/sitemap.xml'],
        http_client=http_client,
        include=[re.compile(r'.*docs.*')],  # Only include URLs containing 'docs'.
        max_buffer_size=500,  # Keep up to 500 URLs in memory before processing.
    )

    # Open the default request queue.
    request_queue = await RequestQueue.open()

    # And combine them together to a single request manager.
    request_manager = RequestManagerTandem(sitemap_loader, request_queue)

    # Create a crawler and pass the request manager to it.
    crawler = ParselCrawler(
        request_manager=request_manager,
        max_requests_per_crawl=10,  # Limit the max requests per crawl.
    )

    @crawler.router.default_handler
    async def handler(context: ParselCrawlingContext) -> None:
        context.log.info(f'Processing {context.request.url}')

        # New links will be enqueued directly to the queue.
        await context.enqueue_links()

        # Extract data using Parsel's XPath and CSS selectors.
        data = {
            'url': context.request.url,
            'title': context.selector.xpath('//title/text()').get(),
        }

        # Push extracted data to the dataset.
        await context.push_data(data)

    await crawler.run()


if __name__ == '__main__':
    asyncio.run(main())

Run on

import asyncio
import re

from crawlee.crawlers import ParselCrawler, ParselCrawlingContext
from crawlee.http_clients import ImpitHttpClient
from crawlee.request_loaders import SitemapRequestLoader


async def main() -> None:
    # Create an HTTP client for fetching the sitemap.
    http_client = ImpitHttpClient()

    # Create a sitemap request loader with filtering rules.
    sitemap_loader = SitemapRequestLoader(
        sitemap_urls=['https://crawlee.dev/sitemap.xml'],
        http_client=http_client,
        include=[re.compile(r'.*docs.*')],  # Only include URLs containing 'docs'.
        max_buffer_size=500,  # Keep up to 500 URLs in memory before processing.
    )

    # Convert the sitemap loader into a request manager linked
    # to the default request queue.
    request_manager = await sitemap_loader.to_tandem()

    # Create a crawler and pass the request manager to it.
    crawler = ParselCrawler(
        request_manager=request_manager,
        max_requests_per_crawl=10,  # Limit the max requests per crawl.
    )

    @crawler.router.default_handler
    async def handler(context: ParselCrawlingContext) -> None:
        context.log.info(f'Processing {context.request.url}')

        # New links will be enqueued directly to the queue.
        await context.enqueue_links()

        # Extract data using Parsel's XPath and CSS selectors.
        data = {
            'url': context.request.url,
            'title': context.selector.xpath('//title/text()').get(),
        }

        # Push extracted data to the dataset.
        await context.push_data(data)

    await crawler.run()


if __name__ == '__main__':
    asyncio.run(main())

Conclusion

This guide explained the request_loaders sub-package, which extends the functionality of the RequestQueue with additional tools for managing URLs and requests. You learned about the RequestLoader, RequestManager, and RequestManagerTandem classes, as well as the RequestList and SitemapRequestLoader implementations. You also saw practical examples of how to work with these classes to handle various crawling scenarios.

If you have questions or need assistance, feel free to reach out on our GitHub or join our Discord community. Happy scraping!

Overview​

Request loaders​

Request list​

Request list with persistence​

Sitemap request loader​

Sitemap request loader with persistence​

Request managers​

Request manager tandem​

Request list with request queue​

Sitemap request loader with request queue​

Conclusion​

Overview

Request loaders

Request list

Request list with persistence

Sitemap request loader

Sitemap request loader with persistence

Request managers

Request manager tandem

Request list with request queue

Sitemap request loader with request queue

Conclusion