Real-world project
Hey, guys, you know, it's cool that we can scrape the
<title>
elements of web pages, but that's not very useful. Can we finally scrape some real data and save it somewhere in a machine-readable format? Because that's why I started reading this tutorial in the first place!
We hear you, young padawan! First, learn how to crawl, you must. Only then, walk through data, you can!
Making a production-grade crawlerโ
Making a production-grade crawler is not difficult, but there are many pitfalls of scraping that can catch you off guard. So for the real world project you'll learn how to scrape an Warehouse store example instead of the Crawlee website. It contains a list of products of different categories, and each product has its own detail page.
The website requires JavaScript rendering, which allows us to showcase more features of Crawlee. We've also added some helpful tips that prepare you for the real-world issues that you will surely encounter when scraping at scale.
If you're not interested in crawling theory, feel free to skip to the next chapter and get right back to coding.
Drawing a planโ
Sometimes scraping is really straightforward, but most of the time, it really pays off to do a bit of research first and try to answer some of these questions:
- How is the website structured?
- Can I scrape it only with HTTP requests (read "with some
HttpCrawler
, e.g.BeautifulSoupCrawler
")? - Do I need a headless browser for something?
- Are there any anti-scraping protections in place?
- Do I need to parse the HTML or can I get the data otherwise, such as directly from the website's API?
For the purposes of this tutorial, let's assume that the website cannot be scraped with HttpCrawler
. It actually can, but we would have to dive a bit deeper than this introductory guide allows. So for now we will make things easier for you, scrape it with PlaywrightCrawler
, and you'll learn about headless browsers in the process.
Choosing the data you needโ
A good first step is to figure out what data you want to scrape and where to find it. For the time being, let's just agree that we want to scrape all products from all categories available on the all collections page of the store and for each product we want to get its:
- URL
- Manufacturer
- SKU
- Title
- Current price
- Stock available
You will notice that some information is available directly on the list page, but for details such as "SKU" we'll also need to open the product's detail page.
The start URL(s)โ
This is where you start your crawl. It's convenient to start as close to the data as possible. For example, it wouldn't make much sense to start at https://warehouse-theme-metal.myshopify.com and look for a collections
link there, when we already know that everything we want to extract can be found at the https://warehouse-theme-metal.myshopify.com/collections page.
Exploring the pageโ
Let's take a look at the https://warehouse-theme-metal.myshopify.com/collections page more carefully. There are some categories on the page, and each category has a list of items. On some category pages, at the bottom you will notice there are links to the next pages of results. This is usually called the pagination.
Categories and sortingโ
When you click the categories, you'll see that they load a page of products filtered by that category. By going through a few categories and observing the behavior, we can also observe that we can sort by different conditions (such as Best selling
, or Price, low to high
), but for this example, we will not be looking into those.
Be careful, because on some websites, like amazon.com, this is not true and the sum of products in categories is actually larger than what's available without filters. Learn more in our tutorial on scraping websites with limited pagination.
Paginationโ
The pagination of the demo Warehouse Store is simple enough. When switching between pages, you will see that the URL changes to:
https://warehouse-theme-metal.myshopify.com/collections/headphones?page=2
Try clicking on the link to page 4. You'll see that the pagination links update and show more pages. But can you trust that this will include all pages and won't stop at some point?
Similarly to the issue with filters explained above, the existence of pagination does not guarantee that you can simply paginate through all the results. Always test your assumptions about pagination. Otherwise, you might miss a chunk of results, and not even know about it.
At the time of writing the Headphones
collection results counter showed 75 results - products. Quick count of products on one page of results makes 24. 6 rows times 4 products. This means that there are 4 pages of results.
If you're not convinced, you can visit a page somewhere in the middle, like https://warehouse-theme-metal.myshopify.com/collections/headphones?page=2
and see how the pagination looks there.
The crawling strategyโ
Now that you know where to start and how to find all the collection details, let's look at the crawling process.
- Visit the store page containing the list of categories (our start URL).
- Enqueue all links to all categories.
- Enqueue all product pages from the current page.
- Enqueue links to next pages of results.
- Open the next page in queue.
- When it's a results list page, go to 2.
- When it's a product page, scrape the data.
- Repeat until all results pages and all products have been processed.
PlaywrightCrawler
will make sure to visit the pages for you, if you provide the correct requests, and you already know how to enqueue pages, so this should be fairly easy. Nevertheless, there are few more tricks that we'd like to showcase.
Sanity checkโ
Let's check that everything is set up correctly before writing the scraping logic itself. You might realize that something in your previous analysis doesn't quite add up, or the website might not behave exactly as you expected.
The example below creates a new crawler that visits the start URL and prints the text content of all the categories on that page. When you run the code, you will see the very badly formatted content of the individual category card.
import asyncio
# Instead of BeautifulSoupCrawler let's use Playwright to be able to render JavaScript.
from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext
async def main() -> None:
crawler = PlaywrightCrawler()
@crawler.router.default_handler
async def request_handler(context: PlaywrightCrawlingContext) -> None:
# Wait for the collection cards to render on the page. This ensures that
# the elements we want to interact with are present in the DOM.
await context.page.wait_for_selector('.collection-block-item')
# Execute a function within the browser context to target the collection card elements
# and extract their text content, trimming any leading or trailing whitespace.
category_texts = await context.page.eval_on_selector_all(
'.collection-block-item',
'(els) => els.map(el => el.textContent.trim())',
)
# Log the extracted texts.
for i, text in enumerate(category_texts):
context.log.info(f'CATEGORY_{i + 1}: {text}')
await crawler.run(['https://warehouse-theme-metal.myshopify.com/collections'])
if __name__ == '__main__':
asyncio.run(main())
If you're wondering how to get that .collection-block-item
selector. We'll explain it in the next chapter on DevTools.