Crawling the Store
To crawl the whole example Warehouse Store and find all the data, you first need to visit all the pages with products - going through all categories available and also all the product detail pages.
Crawling the listing pages
In previous lessons, you used the enqueueLinks()
function like this:
await enqueueLinks();
While useful in that scenario, you need something different now. Instead of finding all the <a href="..">
elements with links to the same hostname, you need to find only the specific ones that will take your crawler to the next page of results. Otherwise, the crawler will visit a lot of other pages that you're not interested in. Using the power of DevTools and yet another enqueueLinks()
parameter, this becomes fairly easy.
import { PlaywrightCrawler } from 'crawlee';
const crawler = new PlaywrightCrawler({
requestHandler: async ({ page, request, enqueueLinks }) => {
console.log(`Processing: ${request.url}`);
// Only run this logic on the main category listing, not on sub-pages.
if (request.label !== 'CATEGORY') {
// Wait for the category cards to render,
// otherwise enqueueLinks wouldn't enqueue anything.
await page.waitForSelector('.collection-block-item');
// Add links to the queue, but only from
// elements matching the provided selector.
await enqueueLinks({
selector: '.collection-block-item',
label: 'CATEGORY',
});
}
},
});
await crawler.run(['https://warehouse-theme-metal.myshopify.com/collections']);
The code should look pretty familiar to you. It's a very simple requestHandler
where we log the currently processed URL to the console and enqueue more links. But there are also a few new, interesting additions. Let's break it down.
The selector
parameter of enqueueLinks()
When you previously used enqueueLinks()
, you were not providing any selector
parameter, and it was fine, because you wanted to use the default value, which is a
- finds all <a>
elements. But now, you need to be more specific. There are multiple <a>
links on the Categories
page, and you're only interested in those that will take your crawler to the available list of results. Using the DevTools, you'll find that you can select the links you need using the .collection-block-item
selector, which selects all the elements that have the class=collection-block-item
attribute.
The label
of enqueueLinks()
You will see label
used often throughout Crawlee, as it's a convenient way of labelling a Request
instance for quick identification later. You can access it with request.label
and it's a string
. You can name your requests any way you want. Here, we used the label CATEGORY
to note that we're enqueueing pages that represent a category of products. The enqueueLinks()
function will add this label to all requests before enqueueing them to the RequestQueue
. Why this is useful will become obvious in a minute.
Crawling the detail pages
In a similar fashion, you need to collect all the URLs to the product detail pages, because only from there you can scrape all the data you need. The following code only repeats the concepts you already know for another set of links.
import { PlaywrightCrawler } from 'crawlee';
const crawler = new PlaywrightCrawler({
requestHandler: async ({ page, request, enqueueLinks }) => {
console.log(`Processing: ${request.url}`);
if (request.label === 'DETAIL') {
// We're not doing anything with the details yet.
} else if (request.label === 'CATEGORY') {
// We are now on a category page. We can use this to paginate through and enqueue all products,
// as well as any subsequent pages we find
await page.waitForSelector('.product-item > a');
await enqueueLinks({
selector: '.product-item > a',
label: 'DETAIL', // <= note the different label
});
// Now we need to find the "Next" button and enqueue the next page of results (if it exists)
const nextButton = await page.$('a.pagination__next');
if (nextButton) {
await enqueueLinks({
selector: 'a.pagination__next',
label: 'CATEGORY', // <= note the same label
});
}
} else {
// This means we're on the start page, with no label.
// On this page, we just want to enqueue all the category pages.
await page.waitForSelector('.collection-block-item');
await enqueueLinks({
selector: '.collection-block-item',
label: 'CATEGORY',
});
}
},
});
await crawler.run(['https://warehouse-theme-metal.myshopify.com/collections']);
The crawling code is now complete. When you run the code, you'll see the crawler visit all the listing URLs and all the detail URLs.
Next steps
This concludes the Crawling lesson, because you have taught the crawler to visit all the pages it needs. Let's continue with scraping data.