Resuming a paused crawl

This example demonstrates how to resume crawling from its last state when running locally, if for some reason it was unexpectedly terminated.

If each run should continue crawling from the previous state, you can configure this using purge_on_start in Configuration.

Use the code below and perform 2 sequential runs. During the 1st run, stop the crawler by pressing CTRL+C, and the 2nd run will resume crawling from where it stopped.

Run on

import asyncio

from crawlee import ConcurrencySettings, service_locator
from crawlee.crawlers import (
    BeautifulSoupCrawler,
    BeautifulSoupCrawlingContext,
)

# Disable clearing the `RequestQueue`, `KeyValueStore` and `Dataset` on each run.
# This makes the scraper continue from where it left off in the previous run.
# The recommended way to achieve this behavior is setting the environment variable
# `CRAWLEE_PURGE_ON_START=0`
configuration = service_locator.get_configuration()
configuration.purge_on_start = False


async def main() -> None:
    crawler = BeautifulSoupCrawler(
        # Let's slow down the crawler for a demonstration
        concurrency_settings=ConcurrencySettings(max_tasks_per_minute=20)
    )

    @crawler.router.default_handler
    async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
        context.log.info(f'Processing {context.request.url} ...')

    # List of links for crawl
    requests = [
        'https://crawlee.dev',
        'https://crawlee.dev/python/docs',
        'https://crawlee.dev/python/docs/examples',
        'https://crawlee.dev/python/docs/guides',
        'https://crawlee.dev/python/docs/quick-start',
    ]

    await crawler.run(requests)


if __name__ == '__main__':
    asyncio.run(main())

Perform the 1st run, interrupting the crawler with CTRL+C after 2 links have been processed.

Run with interruption

Now resume crawling after the pause to process the remaining 3 links.

Resuming crawling

Alternatively, use the environment variable CRAWLEE_PURGE_ON_START=0 instead of using configuration.purge_on_start = False.

For example, when running code:

CRAWLEE_PURGE_ON_START=0 python -m best_crawler