Version: Next

Scraping the Store

Copy for LLM

In the Real-world project chapter, you've created a list of the information you wanted to collect about the products in the example Warehouse store. Let's review that and figure out ways to access the data.

URL
Manufacturer
SKU
Title
Current price
Stock available

data to scrape

Scraping the URL, Manufacturer and SKU

Some information is lying right there in front of us without even having to touch the product detail pages. The URL we already have - the request.url. And by looking at it carefully, we realize that we can also extract the manufacturer from the URL (as all product urls start with /products/<manufacturer>). We can just split the string and be on our way then!

request.loaderUrl vs request.url

You can use request.loadedUrl as well. Remember the difference: request.url is what you enqueue, request.loadedUrl is what gets processed (after possible redirects).

// request.url = https://warehouse-theme-metal.myshopify.com/products/sennheiser-mke-440-professional-stereo-shotgun-microphone-mke-440

const urlPart = request.url.split('/').slice(-1); // ['sennheiser-mke-440-professional-stereo-shotgun-microphone-mke-440']
const manufacturer = urlPart[0].split('-')[0]; // 'sennheiser'

Storing information

It's a matter of preference, whether to store this information separately in the resulting dataset, or not. Whoever uses the dataset can easily parse the manufacturer from the URL, so should you duplicate the data unnecessarily? Our opinion is that unless the increased data consumption would be too large to bear, it's better to make the dataset as rich as possible. For example, someone might want to filter by manufacturer.

Adapt and extract

One thing you may notice is that the manufacturer might have a - in its name. If that's the case, your best bet is extracting it from the details page instead, but it's not mandatory. At the end of the day, you should always adjust and pick the best solution for your use case, and website you are crawling.

Now it's time to add more data to the results. Let's open one of the product detail pages, for example the Sony XBR-950G page and use our DevTools-Fu 🥋 to figure out how to get the title of the product.

Title

product title

By using the element selector tool, you can see that the title is there under an <h1> tag, as titles should be. The <h1> tag is enclosed in a <div> with class product-meta. We can leverage this to create a combined selector .product-meta h1. It selects any <h1> element that is a child of a different element with the class product-meta.

Verifying selectors with DevTools

Remember that you can press CTRL+F (or CMD+F on Mac) in the Elements tab of DevTools to open the search bar where you can quickly search for elements using their selectors. Always verify your scraping process and assumptions using the DevTools. It's faster than changing the crawler code all the time.

To get the title, you need to find it using Playwright and a .product-meta h1 locator, which selects the <h1> element you're looking for, or throws, if it finds more than one. That's good. It's usually better to crash the crawler than silently return bad data.

const title = await page.locator('.product-meta h1').textContent();

SKU

Using the DevTools, you can find that the product SKU is inside a <span> tag with a class product-meta__sku-number. And since there's no other <span> with that class on the page, you can safely use it.

product sku selector

const sku = await page.locator('span.product-meta__sku-number').textContent();

Current price

DevTools can tell you that the currentPrice can be found in a <span> element tagged with the price class. But it also shows that it is nested as raw text alongside another <span> element with the visually-hidden class. You don't want that, so you need to filter it out, and the hasText helper can be used for that for that.

product current price selector

const priceElement = page
    .locator('span.price')
    .filter({
        hasText: '$',
    })
    .first();

const currentPriceString = await priceElement.textContent();
const rawPrice = currentPriceString.split('$')[1];
const price = Number(rawPrice.replaceAll(',', ''));

It might look a little too complex at first glance, but let's walk through what you did. First off, you find the right part of the price span (specifically the actual price) by filtering the element that has the $ sign in it. When you do that, you will get a string similar to Sale price$1,398.00. This, by itself, is not that useful, so you extract the actual numeric part by splitting by the $ sign.

Once you do that, you receive a string that represents our price, but you will be converting it to a number. You do that by replacing all the commas with nothingness (so we can parse it into a number), then it is parsed into a number using Number().

Stock available

You're finishing up with the availableInStock. There is a span with the product-form__inventory class, and it contains the text In stock. You can use the hasText helper again to filter out the right element.

const inStockElement = await page
    .locator('span.product-form__inventory')
    .filter({
        hasText: 'In stock',
    })
    .first();

const inStock = (await inStockElement.count()) > 0;

For this, all that matter is whether the element exists or not, so you can use the count() method to check if there are any elements that match our selector. If there are, that means the product is in stock.

And there you have it! All the needed data. For the sake of completeness, let's add all the properties together, and you're good to go.

const urlPart = request.url.split('/').slice(-1); // ['sennheiser-mke-440-professional-stereo-shotgun-microphone-mke-440']
const manufacturer = urlPart.split('-')[0]; // 'sennheiser'

const title = await page.locator('.product-meta h1').textContent();
const sku = await page.locator('span.product-meta__sku-number').textContent();

const priceElement = page
    .locator('span.price')
    .filter({
        hasText: '$',
    })
    .first();

const currentPriceString = await priceElement.textContent();
const rawPrice = currentPriceString.split('$')[1];
const price = Number(rawPrice.replaceAll(',', ''));

const inStockElement = await page
    .locator('span.product-form__inventory')
    .filter({
        hasText: 'In stock',
    })
    .first();

const inStock = (await inStockElement.count()) > 0;

Trying it out

You have everything that is needed, so grab your newly created scraping logic, dump it into your original requestHandler() and see the magic happen!

Run on

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    requestHandler: async ({ page, request, enqueueLinks }) => {
        console.log(`Processing: ${request.url}`);
        if (request.label === 'DETAIL') {
            const urlPart = request.url.split('/').slice(-1); // ['sennheiser-mke-440-professional-stereo-shotgun-microphone-mke-440']
            const manufacturer = urlPart[0].split('-')[0]; // 'sennheiser'

            const title = await page.locator('.product-meta h1').textContent();
            const sku = await page.locator('span.product-meta__sku-number').textContent();

            const priceElement = page
                .locator('span.price')
                .filter({
                    hasText: '$',
                })
                .first();

            const currentPriceString = await priceElement.textContent();
            const rawPrice = currentPriceString?.split('$')[1];
            const price = Number(rawPrice?.replaceAll(',', ''));

            const inStockElement = page
                .locator('span.product-form__inventory')
                .filter({
                    hasText: 'In stock',
                })
                .first();

            const inStock = (await inStockElement.count()) > 0;

            const results = {
                url: request.url,
                manufacturer,
                title,
                sku,
                currentPrice: price,
                availableInStock: inStock,
            };

            console.log(results);
        } else if (request.label === 'CATEGORY') {
            // We are now on a category page. We can use this to paginate through and enqueue all products,
            // as well as any subsequent pages we find

            await page.waitForSelector('.product-item > a');
            await enqueueLinks({
                selector: '.product-item > a',
                label: 'DETAIL', // <= note the different label
            });

            // Now we need to find the "Next" button and enqueue the next page of results (if it exists)
            const nextButton = await page.$('a.pagination__next');
            if (nextButton) {
                await enqueueLinks({
                    selector: 'a.pagination__next',
                    label: 'CATEGORY', // <= note the same label
                });
            }
        } else {
            // This means we're on the start page, with no label.
            // On this page, we just want to enqueue all the category pages.

            await page.waitForSelector('.collection-block-item');
            await enqueueLinks({
                selector: '.collection-block-item',
                label: 'CATEGORY',
            });
        }
    },

    // Let's limit our crawls to make our tests shorter and safer.
    maxRequestsPerCrawl: 50,
});

await crawler.run(['https://warehouse-theme-metal.myshopify.com/collections']);

When you run the crawler, you will see the crawled URLs and their scraped data printed to the console. The output will look something like this:

{
    "url": "https://warehouse-theme-metal.myshopify.com/products/sony-str-za810es-7-2-channel-hi-res-wi-fi-network-av-receiver",
    "manufacturer": "sony",
    "title": "Sony STR-ZA810ES 7.2-Ch Hi-Res Wi-Fi Network A/V Receiver",
    "sku": "SON-692802-STR-DE",
    "currentPrice": 698,
    "availableInStock": true
}

Next steps

Next, you'll see how to save the data you scraped to the disk for further processing.

Scraping the Store

Scraping the URL, Manufacturer and SKU​

Title​

SKU​

Current price​

Stock available​

Trying it out​

Next steps​