Version: 3.4

Refactoring

It may seem that the data is extracted and the crawler is done, but honestly, this is just the beginning. For the sake of brevity, we've completely omitted error handling, proxies, logging, architecture, tests, documentation and other stuff that a reliable software should have. The good thing is, error handling is mostly done by Crawlee itself, so no worries on that front, unless you need some custom magic.

info

If you've got to this point and are wondering where all the anti-blocking, bot-protection avoiding stealthy features are, you're right, we haven't shown you any. But that's the point! They are automatically used with the default configuration.

That does not mean the default configuration can handle everything, but it should get you far enough. If you want to learn more, browse the Avoid getting blocked, Proxy management and Session management guides.

Anyway, to promote good coding practices, let's look at how you can use a Router to better structure your crawler code.

Routing

In the following code we've made several changes:

Split the code into multiple files.
Replaced console.log with the Crawlee logger for nicer, colourful logs.
Added a Router to make our routing cleaner, without if clauses.

In our main.mjs file, we place the general structure of the crawler:

src/main.mjs
import { PlaywrightCrawler, log } from 'crawlee';
import { router } from './routes.mjs';

// This is better set with CRAWLEE_LOG_LEVEL env var
// or a configuration option. This is just for show 😈
log.setLevel(log.LEVELS.DEBUG);

log.debug('Setting up crawler.');
const crawler = new PlaywrightCrawler({
    // Instead of the long requestHandler with
    // if clauses we provide a router instance.
    requestHandler: router,
});

await crawler.run(['https://warehouse-theme-metal.myshopify.com/collections']);

Then in a separate routes.mjs file:

src/routes.mjs
import { createPlaywrightRouter, Dataset } from 'crawlee';

// createPlaywrightRouter() is only a helper to get better
// intellisense and typings. You can use Router.create() too.
export const router = createPlaywrightRouter();

// This replaces the request.label === DETAIL branch of the if clause.
router.addHandler('DETAIL', async ({ request, page, log }) => {
    log.debug(`Extracting data: ${request.url}`);
    const urlPart = request.url.split('/').slice(-1); // ['sennheiser-mke-440-professional-stereo-shotgun-microphone-mke-440']
    const manufacturer = urlPart[0].split('-')[0]; // 'sennheiser'

    const title = await page.locator('.product-meta h1').textContent();
    const sku = await page
        .locator('span.product-meta__sku-number')
        .textContent();

    const priceElement = page
        .locator('span.price')
        .filter({
            hasText: '$',
        })
        .first();

    const currentPriceString = await priceElement.textContent();
    const rawPrice = currentPriceString.split('$')[1];
    const price = Number(rawPrice.replaceAll(',', ''));

    const inStockElement = page
        .locator('span.product-form__inventory')
        .filter({
            hasText: 'In stock',
        })
        .first();

    const inStock = (await inStockElement.count()) > 0;

    const results = {
        url: request.url,
        manufacturer,
        title,
        sku,
        currentPrice: price,
        availableInStock: inStock,
    };

    log.debug(`Saving data: ${request.url}`);
    await Dataset.pushData(results);
});

router.addHandler('CATEGORY', async ({ page, enqueueLinks, request, log }) => {
    log.debug(`Enqueueing pagination for: ${request.url}`);
    // We are now on a category page. We can use this to paginate through and enqueue all products,
    // as well as any subsequent pages we find

    await page.waitForSelector('.product-item > a');
    await enqueueLinks({
        selector: '.product-item > a',
        label: 'DETAIL', // <= note the different label
    });

    // Now we need to find the "Next" button and enqueue the next page of results (if it exists)
    const nextButton = await page.$('a.pagination__next');
    if (nextButton) {
        await enqueueLinks({
            selector: 'a.pagination__next',
            label: 'CATEGORY', // <= note the same label
        });
    }
});

// This is a fallback route which will handle the start URL
// as well as the LIST labeled URLs.
router.addDefaultHandler(async ({ request, page, enqueueLinks, log }) => {
    log.debug(`Enqueueing categories from page: ${request.url}`);
    // This means we're on the start page, with no label.
    // On this page, we just want to enqueue all the category pages.

    await page.waitForSelector('.collection-block-item');
    await enqueueLinks({
        selector: '.collection-block-item',
        label: 'CATEGORY',
    });
});

Let's describe the changes a bit more in detail. We hope that in the end, you will agree that this structure makes the crawler more readable and manageable.

Splitting your code into multiple files

There's no reason not to split your code into multiple files and keep your logic separate. Less code in a single file means less code you need to think about at any time, and that's good. We would most likely go even further and split even the routes into separate files.

Using Crawlee `log` instead of `console.log`

We won't go to great lengths here to talk about log object from Crawlee, because you can read it all in the documentation, but there's just one thing that we need to stress: log levels.

Crawlee log has multiple log levels, such as log.debug, log.info or log.warning. It not only makes your log more readable, but it also allows selective turning off of some levels by either calling the log.setLevel() function or by setting the CRAWLEE_LOG_LEVEL environment variable. Thanks to this you can add a lot of debug logs to your crawler without polluting your log when they're not needed, but ready to help when you encounter issues.

Using a router to structure your crawling

At first, it might seem more readable using just a simple if / else statement to select different logic based on the crawled pages, but trust us, it becomes far less impressive when working with more than two different pages, and it definitely starts to fall apart when the logic to handle each page spans tens or hundreds of lines of code.

It's good practice in any programming language to split your logic into bite-sized chunks that are easy to read and reason about. Scrolling through a thousand line long requestHandler() where everything interacts with everything and variables can be used everywhere is not a beautiful thing to do and a pain to debug. That's why we prefer the separation of routes into their own files.

Next lesson

In the next and final lesson, we will show you how you can deploy your Crawlee project to the cloud. If you used the CLI to bootstrap your project, you already have a Dockerfile ready, and the next section will show you how to deploy it to the Apify Platform with ease.

Routing​

Splitting your code into multiple files​

Using Crawlee log instead of console.log​

Using a router to structure your crawling​

Next lesson​

Routing

Splitting your code into multiple files

Using Crawlee `log` instead of `console.log`

Using a router to structure your crawling

Next lesson