SitemapRequestLoaderState
Index
Properties
completed
Whether all sitemaps have been fully processed.
current_sitemap_processed_urls
URLs from the current sitemap that have been added to the queue.
handled_count
Number of URLs that have been successfully handled.
in_progress
Set of request URLs currently being processed.
in_progress_sitemap_url
The sitemap URL currently being processed.
model_config
pending_sitemap_urls
Queue of sitemap URLs that need to be fetched and processed.
processed_sitemap_urls
Set of processed sitemap URLs.
total_count
Total number of URLs found and added to the queue from all processed sitemaps.
url_queue
Queue of URLs extracted from sitemaps and ready for processing.
State model for persisting sitemap request loader data.
The crawler processes one sitemap at a time. The current sitemap is stored in
in_progress_sitemap_url
. Theparse_sitemap
function parses the sitemap and returns elements as an async iterator. Each element retrieved from the iterator is processed based on its type. If the element is aNestedSitemap
, its URL is added topending_sitemap_urls
if it hasn't been processed yet (not inprocessed_sitemap_urls
). If the element is aSitemapUrl
, the system checks whether it already exists incurrent_sitemap_processed_urls
. If it exists, the loader was restarted from a saved state and the URL is skipped.If the URL is new, it is first added to
url_queue
, then tocurrent_sitemap_processed_urls
, andtotal_count
is incremented by 1. When all elements from the current sitemap iterator have been processed,in_progress_sitemap_url
is set toNone
, the sitemap URL is added toprocessed_sitemap_urls
, andcurrent_sitemap_processed_urls
is cleared. The next sitemap is retrieved frompending_sitemap_urls
, skipping any URLs that already exist inprocessed_sitemap_urls
. Ifpending_sitemap_urls
is empty,completed
is set toTrue
.When
fetch_next_request
is called, a URL is extracted fromurl_queue
and placed inin_progress
. Whenmark_request_as_handled
is called for the extracted URL, it is removed fromin_progress
andhandled_count
is incremented by 1.During initial startup or restart after persistence, state validation occurs in
_get_state
. If bothpending_sitemap_urls
andin_progress_sitemap_url
are empty andcompleted
is False, this indicates a fresh start. In this case,self._sitemap_urls
are moved topending_sitemap_urls
. Otherwise, the system is restarting from a persisted state. Ifin_progress
contains any URLs, they are moved back tourl_queue
andin_progress
is cleared.