Skip to main content

SitemapRequestLoader

A request loader that reads URLs from sitemap(s).

The loader fetches and parses sitemaps in the background, allowing crawling to start before all URLs are loaded. It supports filtering URLs using glob and regex patterns.

Hierarchy

Index

Methods

__init__

  • __init__(sitemap_urls, http_client, *, proxy_info, include, exclude, max_buffer_size, parse_sitemap_options): None
  • Initialize the sitemap request loader.


    Parameters

    • sitemap_urls: list[str]

      Configuration options for the loader.

    • http_client: HttpClient

      the instance of HttpClient to use for fetching sitemaps.

    • optionalkeyword-onlyproxy_info: ProxyInfo | None = None

      Optional proxy to use for fetching sitemaps.

    • optionalkeyword-onlyinclude: list[re.Pattern[Any] | Glob] | None = None

      List of glob or regex patterns to include URLs.

    • optionalkeyword-onlyexclude: list[re.Pattern[Any] | Glob] | None = None

      List of glob or regex patterns to exclude URLs.

    • optionalkeyword-onlymax_buffer_size: int = 200

      Maximum number of URLs to buffer in memory.

    • optionalkeyword-onlyparse_sitemap_options: ParseSitemapOptions | None = None

      Options for parsing sitemaps, such as SitemapSource and max_urls.

    Returns None

abort_loading

  • async abort_loading(): None
  • Abort the sitemap loading process.


    Returns None

fetch_next_request

  • async fetch_next_request(): Request | None

get_handled_count

  • async get_handled_count(): int

get_total_count

  • async get_total_count(): int

is_empty

  • async is_empty(): bool

is_finished

  • async is_finished(): bool

mark_request_as_handled

to_tandem

  • Combine the loader with a request manager to support adding and reclaiming requests.


    Parameters

    • optionalrequest_manager: RequestManager | None = None

      Request manager to combine the loader with. If None is given, the default request queue is used.

    Returns RequestManagerTandem