๐๏ธ Add data to dataset
This example demonstrates how to store extracted data into datasets using the context.pushdata helper function. If the specified dataset does not already exist, it will be created automatically. Additionally, you can save data to custom datasets by providing datasetid or datasetname parameters to the pushdata function.
๐๏ธ BeautifulSoup crawler
This example demonstrates how to use BeautifulSoupCrawler to crawl a list of URLs, load each URL using a plain HTTP request, parse the HTML using the BeautifulSoup library and extract some data from it - the page title and all `, and ` tags. This setup is perfect for scraping specific elements from web pages. Thanks to the well-known BeautifulSoup, you can easily navigate the HTML structure and retrieve the data you need with minimal code. It also shows how you can add optional pre-navigation hook to the crawler. Pre-navigation hooks are user defined functions that execute before sending the request.
๐๏ธ Capture screenshots using Playwright
This example demonstrates how to capture screenshots of web pages using PlaywrightCrawler and store them in the key-value store.
๐๏ธ Crawl all links on website
This example uses the enqueue_links helper to add new links to the RequestQueue as the crawler navigates from page to page. By automatically discovering and enqueuing all links on a given page, the crawler can systematically scrape an entire website. This approach is ideal for web scraping tasks where you need to collect data from multiple interconnected pages.
๐๏ธ Crawl multiple URLs
This example demonstrates how to crawl a specified list of URLs using different crawlers. You'll learn how to set up the crawler, define a request handler, and run the crawler with multiple URLs. This setup is useful for scraping data from multiple pages or websites concurrently.
๐๏ธ Crawl specific links on website
This example demonstrates how to crawl a website while targeting specific patterns of links. By utilizing the enqueue_links helper, you can pass include or exclude parameters to improve your crawling strategy. This approach ensures that only the links matching the specified patterns are added to the RequestQueue. Both include and exclude support lists of globs or regular expressions. This functionality is great for focusing on relevant sections of a website and avoiding scraping unnecessary or irrelevant content.
๐๏ธ Crawl website with relative links
When crawling a website, you may encounter various types of links that you wish to include in your crawl. To facilitate this, we provide the enqueue_links method on the crawler context, which will automatically find and add these links to the crawler's RequestQueue. This method simplifies the process of handling different types of links, including relative links, by automatically resolving them based on the page's context.
๐๏ธ Keep a Crawler alive waiting for more requests
This example demonstrates how to keep crawler alive even when there are no requests at the moment by using keepalive=True argument of BasicCrawler.init. This is available to all crawlers that inherit from BasicCrawler and in the example below it is shown on BeautifulSoupCrawler. To stop the crawler that was started with keepalive=True you can call crawler.stop().
๐๏ธ Stopping a Crawler with stop method
This example demonstrates how to use stop method of BasicCrawler to stop crawler once the crawler finds what it is looking for. This method is available to all crawlers that inherit from BasicCrawler and in the example below it is shown on BeautifulSoupCrawler. Simply call crawler.stop() to stop the crawler. It will not continue to crawl through new requests. Requests that are already being concurrently processed are going to get finished. It is possible to call stop method with optional argument reason that is a string that will be used in logs and it can improve logs readability especially if you have multiple different conditions for triggering stop.
๐๏ธ Export entire dataset to file
This example demonstrates how to use the BasicCrawler.export_data method of the crawler to export the entire default dataset to a single file. This method supports exporting data in either CSV or JSON format.
๐๏ธ Fill and submit web form
This example demonstrates how to fill and submit a web form using the HttpCrawler crawler. The same approach applies to any crawler that inherits from it, such as the BeautifulSoupCrawler or ParselCrawler.
๐๏ธ Parsel crawler
This example shows how to use ParselCrawler to crawl a website or a list of URLs. Each URL is loaded using a plain HTTP request and the response is parsed using Parsel library which supports CSS and XPath selectors for HTML responses and JMESPath for JSON responses. We can extract data from all kinds of complex HTML structures using XPath. In this example, we will use Parsel to crawl github.com and extract page title, URL and emails found in the webpage. The default handler will scrape data from the current webpage and enqueue all the links found in the webpage for continuous scraping. It also shows how you can add optional pre-navigation hook to the crawler. Pre-navigation hooks are user defined functions that execute before sending the request.
๐๏ธ Playwright crawler
This example demonstrates how to use PlaywrightCrawler to recursively scrape the Hacker news website using headless Chromium and Playwright.
๐๏ธ Playwright crawler with block requests
This example demonstrates how to optimize your PlaywrightCrawler performance by blocking unnecessary network requests.
๐๏ธ Playwright crawler with Camoufox
This example demonstrates how to integrate Camoufox into PlaywrightCrawler using BrowserPool with custom PlaywrightBrowserPlugin.