Skip to main content

Examples

๐Ÿ“„๏ธ Crawl specific links on website

This example demonstrates how to crawl a website while targeting specific patterns of links. By utilizing the enqueue_links helper, you can pass include or exclude parameters to improve your crawling strategy. This approach ensures that only the links matching the specified patterns are added to the RequestQueue. Both include and exclude support lists of globs or regular expressions. This functionality is great for focusing on relevant sections of a website and avoiding scraping unnecessary or irrelevant content.

๐Ÿ“„๏ธ Parsel crawler

This example shows how to use ParselCrawler to crawl a website or a list of URLs. Each URL is loaded using a plain HTTP request and the response is parsed using Parsel library which supports CSS and XPath selectors for HTML responses and JMESPath for JSON responses. We can extract data from all kinds of complex HTML structures using XPath. In this example, we will use Parsel to crawl github.com and extract page title, URL and emails found in the webpage. The default handler will scrape data from the current webpage and enqueue all the links found in the webpage for continuous scraping.