Stopping a Crawler with stop method
This example demonstrates how to use stop
method of BasicCrawler
to stop crawler once the crawler finds what it is looking for. This method is available to all crawlers that inherit from BasicCrawler
and in the example below it is shown on BeautifulSoupCrawler
. Simply call crawler.stop()
to stop the crawler. It will not continue to crawl through new requests. Requests that are already being concurrently processed are going to get finished. It is possible to call stop
method with optional argument reason
that is a string that will be used in logs and it can improve logs readability especially if you have multiple different conditions for triggering stop
.
import asyncio
from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
async def main() -> None:
# Create an instance of the BeautifulSoupCrawler class, a crawler that automatically
# loads the URLs and parses their HTML using the BeautifulSoup library.
crawler = BeautifulSoupCrawler()
# Define the default request handler, which will be called for every request.
# The handler receives a context parameter, providing various properties and
# helper methods. Here are a few key ones we use for demonstration:
# - request: an instance of the Request class containing details such as the URL
# being crawled and the HTTP method used.
# - soup: the BeautifulSoup object containing the parsed HTML of the response.
@crawler.router.default_handler
async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
context.log.info(f'Processing {context.request.url} ...')
# Create custom condition to stop crawler once it finds what it is looking for.
if 'crawlee' in context.request.url:
crawler.stop(reason='Manual stop of crawler after finding `crawlee` in the url.')
# Extract data from the page.
data = {
'url': context.request.url,
}
# Push the extracted data to the default dataset. In local configuration,
# the data will be stored as JSON files in ./storage/datasets/default.
await context.push_data(data)
# Run the crawler with the initial list of URLs.
await crawler.run(['https://crawlee.dev'])
if __name__ == '__main__':
asyncio.run(main())