SitemapRequestLoaderState
Index
Properties
completed
Whether all sitemaps have been fully processed.
current_sitemap_processed_urls
URLs from the current sitemap that have been added to the queue.
handled_count
Number of URLs that have been successfully handled.
in_progress
Set of request URLs currently being processed.
in_progress_sitemap_url
The sitemap URL currently being processed.
model_config
pending_sitemap_urls
Queue of sitemap URLs that need to be fetched and processed.
processed_sitemap_urls
Set of processed sitemap URLs.
total_count
Total number of URLs found and added to the queue from all processed sitemaps.
url_queue
Queue of URLs extracted from sitemaps and ready for processing.
State model for persisting sitemap request loader data.
The crawler processes one sitemap at a time. The current sitemap is stored in
in_progress_sitemap_url. Theparse_sitemapfunction parses the sitemap and returns elements as an async iterator. Each element retrieved from the iterator is processed based on its type. If the element is aNestedSitemap, its URL is added topending_sitemap_urlsif it hasn't been processed yet (not inprocessed_sitemap_urls). If the element is aSitemapUrl, the system checks whether it already exists incurrent_sitemap_processed_urls. If it exists, the loader was restarted from a saved state and the URL is skipped.If the URL is new, it is first added to
url_queue, then tocurrent_sitemap_processed_urls, andtotal_countis incremented by 1. When all elements from the current sitemap iterator have been processed,in_progress_sitemap_urlis set toNone, the sitemap URL is added toprocessed_sitemap_urls, andcurrent_sitemap_processed_urlsis cleared. The next sitemap is retrieved frompending_sitemap_urls, skipping any URLs that already exist inprocessed_sitemap_urls. Ifpending_sitemap_urlsis empty,completedis set toTrue.When
fetch_next_requestis called, a URL is extracted fromurl_queueand placed inin_progress. Whenmark_request_as_handledis called for the extracted URL, it is removed fromin_progressandhandled_countis incremented by 1.During initial startup or restart after persistence, state validation occurs in
_get_state. If bothpending_sitemap_urlsandin_progress_sitemap_urlare empty andcompletedis False, this indicates a fresh start. In this case,self._sitemap_urlsare moved topending_sitemap_urls. Otherwise, the system is restarting from a persisted state. Ifin_progresscontains any URLs, they are moved back tourl_queueandin_progressis cleared.