Examples | Crawlee for Python · Fast, reliable Python web crawlers.

📄️ Add data to dataset

This example demonstrates how to store extracted data into datasets using the context.pushdata helper function. If the specified dataset does not already exist, it will be created automatically. Additionally, you can save data to custom datasets by providing datasetid or datasetname parameters to the pushdata function.

📄️ BeautifulSoup crawler

This example demonstrates how to use BeautifulSoupCrawler to crawl a list of URLs, load each URL using a plain HTTP request, parse the HTML using the BeautifulSoup library and extract some data from it - the page title and all `, and ` tags. This setup is perfect for scraping specific elements from web pages. Thanks to the well-known BeautifulSoup, you can easily navigate the HTML structure and retrieve the data you need with minimal code. It also shows how you can add optional pre-navigation hook to the crawler. Pre-navigation hooks are user defined functions that execute before sending the request.

📄️ Capture screenshots using Playwright

This example demonstrates how to capture screenshots of web pages using PlaywrightCrawler and store them in the key-value store.

📄️ Capturing page snapshots with ErrorSnapshotter

How to capture page snapshots on errors.

📄️ Crawl all links on website

This example uses the enqueue_links helper to add new links to the RequestQueue as the crawler navigates from page to page. By automatically discovering and enqueuing all links on a given page, the crawler can systematically scrape an entire website. This approach is ideal for web scraping tasks where you need to collect data from multiple interconnected pages.

📄️ Crawl multiple URLs

This example demonstrates how to crawl a specified list of URLs using different crawlers. You'll learn how to set up the crawler, define a request handler, and run the crawler with multiple URLs. This setup is useful for scraping data from multiple pages or websites concurrently.

📄️ Crawl specific links on website

This example demonstrates how to crawl a website while targeting specific patterns of links. By utilizing the enqueue_links helper, you can pass include or exclude parameters to improve your crawling strategy. This approach ensures that only the links matching the specified patterns are added to the RequestQueue. Both include and exclude support lists of globs or regular expressions. This functionality is great for focusing on relevant sections of a website and avoiding scraping unnecessary or irrelevant content.

📄️ Crawl website with relative links

When crawling a website, you may encounter various types of links that you wish to include in your crawl. To facilitate this, we provide the enqueue_links method on the crawler context, which will automatically find and add these links to the crawler's RequestQueue. This method simplifies the process of handling different types of links, including relative links, by automatically resolving them based on the page's context.

📄️ Keep a Crawler alive waiting for more requests

This example demonstrates how to keep crawler alive even when there are no requests at the moment by using keepalive=True argument of BasicCrawler.init. This is available to all crawlers that inherit from BasicCrawler and in the example below it is shown on BeautifulSoupCrawler. To stop the crawler that was started with keepalive=True you can call crawler.stop().

📄️ Stopping a Crawler with stop method

This example demonstrates how to use stop method of BasicCrawler to stop crawler once the crawler finds what it is looking for. This method is available to all crawlers that inherit from BasicCrawler and in the example below it is shown on BeautifulSoupCrawler. Simply call crawler.stop() to stop the crawler. It will not continue to crawl through new requests. Requests that are already being concurrently processed are going to get finished. It is possible to call stop method with optional argument reason that is a string that will be used in logs and it can improve logs readability especially if you have multiple different conditions for triggering stop.

📄️ Export entire dataset to file

This example demonstrates how to use the BasicCrawler.export_data method of the crawler to export the entire default dataset to a single file. This method supports exporting data in either CSV or JSON format and also accepts additional keyword arguments so you can fine-tune the underlying json.dump or csv.writer behavior.

📄️ Fill and submit web form

This example demonstrates how to fill and submit a web form using the HttpCrawler crawler. The same approach applies to any crawler that inherits from it, such as the BeautifulSoupCrawler or ParselCrawler.

📄️ Сonfigure JSON logging

This example demonstrates how to configure JSON line (JSONL) logging with Crawlee. By using the usetablelogs=False parameter, you can disable table-formatted statistics logs, which makes it easier to parse logs with external tools or to serialize them as JSON.

📄️ Parsel crawler

This example shows how to use ParselCrawler to crawl a website or a list of URLs. Each URL is loaded using a plain HTTP request and the response is parsed using Parsel library which supports CSS and XPath selectors for HTML responses and JMESPath for JSON responses. We can extract data from all kinds of complex HTML structures using XPath. In this example, we will use Parsel to crawl github.com and extract page title, URL and emails found in the webpage. The default handler will scrape data from the current webpage and enqueue all the links found in the webpage for continuous scraping. It also shows how you can add optional pre-navigation hook to the crawler. Pre-navigation hooks are user defined functions that execute before sending the request.

📄️ Playwright crawler

This example demonstrates how to use PlaywrightCrawler to recursively scrape the Hacker news website using headless Chromium and Playwright.

📄️ Adaptive Playwright crawler

This example demonstrates how to use AdaptivePlaywrightCrawler. An AdaptivePlaywrightCrawler is a combination of PlaywrightCrawler and some implementation of HTTP-based crawler such as ParselCrawler or BeautifulSoupCrawler.

📄️ Playwright crawler with block requests

This example demonstrates how to optimize your PlaywrightCrawler performance by blocking unnecessary network requests.

📄️ Playwright crawler with Camoufox

This example demonstrates how to integrate Camoufox into PlaywrightCrawler using BrowserPool with custom PlaywrightBrowserPlugin.

📄️ Playwright crawler with fingerprint generator

This example demonstrates how to use PlaywrightCrawler together with FingerprintGenerator that will populate several browser attributes to mimic real browser fingerprint. To read more about fingerprints please see//docs.apify.com/academy/anti-scraping/techniques/fingerprinting.

📄️ Respect robots.txt file

This example demonstrates how to configure your crawler to respect the rules established by websites for crawlers as described in the robots.txt file.

📄️ Resuming a paused crawl

This example demonstrates how to resume crawling from its last state when running locally, if for some reason it was unexpectedly terminated.

📄️ Using browser profile

This example demonstrates how to run PlaywrightCrawler using your local browser profile from Chrome or Firefox.

📄️ Using sitemap request loader

This example demonstrates how to use SitemapRequestLoader to crawl websites that provide sitemap.xml files following the Sitemaps protocol. The SitemapRequestLoader processes sitemaps in a streaming fashion without loading them entirely into memory, making it suitable for large sitemaps.