Skip to main content

SitemapRequestLoaderState

State model for persisting sitemap request loader data.

The crawler processes one sitemap at a time. The current sitemap is stored in in_progress_sitemap_url. The parse_sitemap function parses the sitemap and returns elements as an async iterator. Each element retrieved from the iterator is processed based on its type. If the element is a NestedSitemap, its URL is added to pending_sitemap_urls if it hasn't been processed yet (not in processed_sitemap_urls). If the element is a SitemapUrl, the system checks whether it already exists in current_sitemap_processed_urls. If it exists, the loader was restarted from a saved state and the URL is skipped.

If the URL is new, it is first added to url_queue, then to current_sitemap_processed_urls, and total_count is incremented by 1. When all elements from the current sitemap iterator have been processed, in_progress_sitemap_url is set to None, the sitemap URL is added to processed_sitemap_urls, and current_sitemap_processed_urls is cleared. The next sitemap is retrieved from pending_sitemap_urls, skipping any URLs that already exist in processed_sitemap_urls. If pending_sitemap_urls is empty, completed is set to True.

When fetch_next_request is called, a URL is extracted from url_queue and placed in in_progress. When mark_request_as_handled is called for the extracted URL, it is removed from in_progress and handled_count is incremented by 1.

During initial startup or restart after persistence, state validation occurs in _get_state. If both pending_sitemap_urls and in_progress_sitemap_url are empty and completed is False, this indicates a fresh start. In this case, self._sitemap_urls are moved to pending_sitemap_urls. Otherwise, the system is restarting from a persisted state. If in_progress contains any URLs, they are moved back to url_queue and in_progress is cleared.

Index

Properties

completed

completed: bool

Whether all sitemaps have been fully processed.

current_sitemap_processed_urls

current_sitemap_processed_urls: set[str]

URLs from the current sitemap that have been added to the queue.

handled_count

handled_count: int

Number of URLs that have been successfully handled.

in_progress

in_progress: set[str]

Set of request URLs currently being processed.

in_progress_sitemap_url

in_progress_sitemap_url: str | None

The sitemap URL currently being processed.

model_config

model_config: Undefined

pending_sitemap_urls

pending_sitemap_urls: deque[str]

Queue of sitemap URLs that need to be fetched and processed.

processed_sitemap_urls

processed_sitemap_urls: set[str]

Set of processed sitemap URLs.

total_count

total_count: int

Total number of URLs found and added to the queue from all processed sitemaps.

url_queue

url_queue: deque[str]

Queue of URLs extracted from sitemaps and ready for processing.