Skip to main content
Version: Next

Refactoring

It may seem that the data is extracted and the crawler is done, but honestly, this is just the beginning. For the sake of brevity, we've completely omitted error handling, proxies, logging, architecture, tests, documentation and other stuff that a reliable software should have. The good thing is, error handling is mostly done by Crawlee itself, so no worries on that front, unless you need some custom magic.

Navigating automatic bot-protextion avoidance

You might be wondering about the anti-blocking, bot-protection avoiding stealthy features and why we haven't highlighted them yet. The reason is straightforward: these features are automatically used within the default configuration, providing a smooth start without manual adjustments. However, the default configuration, while powerful, may not cover every scenario.

If you want to learn more, browse the Avoid getting blocked, Proxy management and Session management guides.

Anyway, to promote good coding practices, let's look at how you can use a Router to better structure your crawler code.

Routing

In the following code we've made several changes:

  • Split the code into multiple files.
  • Replaced console.log with the Crawlee logger for nicer, colourful logs.
  • Added a Router to make our routing cleaner, without if clauses.

In our main.mjs file, we place the general structure of the crawler:

src/main.mjs
import { PlaywrightCrawler, log } from 'crawlee';
import { router } from './routes.mjs';

// This is better set with CRAWLEE_LOG_LEVEL env var
// or a configuration option. This is just for show 😈
log.setLevel(log.LEVELS.DEBUG);

log.debug('Setting up crawler.');
const crawler = new PlaywrightCrawler({
// Instead of the long requestHandler with
// if clauses we provide a router instance.
requestHandler: router,
});

await crawler.run(['https://warehouse-theme-metal.myshopify.com/collections']);

Then in a separate routes.mjs file:

src/routes.mjs
import { createPlaywrightRouter, Dataset } from 'crawlee';

// createPlaywrightRouter() is only a helper to get better
// intellisense and typings. You can use Router.create() too.
export const router = createPlaywrightRouter();

// This replaces the request.label === DETAIL branch of the if clause.
router.addHandler('DETAIL', async ({ request, page, log }) => {
log.debug(`Extracting data: ${request.url}`);
const urlPart = request.url.split('/').slice(-1); // ['sennheiser-mke-440-professional-stereo-shotgun-microphone-mke-440']
const manufacturer = urlPart[0].split('-')[0]; // 'sennheiser'

const title = await page.locator('.product-meta h1').textContent();
const sku = await page
.locator('span.product-meta__sku-number')
.textContent();

const priceElement = page
.locator('span.price')
.filter({
hasText: '$',
})
.first();

const currentPriceString = await priceElement.textContent();
const rawPrice = currentPriceString.split('$')[1];
const price = Number(rawPrice.replaceAll(',', ''));

const inStockElement = page
.locator('span.product-form__inventory')
.filter({
hasText: 'In stock',
})
.first();

const inStock = (await inStockElement.count()) > 0;

const results = {
url: request.url,
manufacturer,
title,
sku,
currentPrice: price,
availableInStock: inStock,
};

log.debug(`Saving data: ${request.url}`);
await Dataset.pushData(results);
});

router.addHandler('CATEGORY', async ({ page, enqueueLinks, request, log }) => {
log.debug(`Enqueueing pagination for: ${request.url}`);
// We are now on a category page. We can use this to paginate through and enqueue all products,
// as well as any subsequent pages we find

await page.waitForSelector('.product-item > a');
await enqueueLinks({
selector: '.product-item > a',
label: 'DETAIL', // <= note the different label
});

// Now we need to find the "Next" button and enqueue the next page of results (if it exists)
const nextButton = await page.$('a.pagination__next');
if (nextButton) {
await enqueueLinks({
selector: 'a.pagination__next',
label: 'CATEGORY', // <= note the same label
});
}
});

// This is a fallback route which will handle the start URL
// as well as the LIST labeled URLs.
router.addDefaultHandler(async ({ request, page, enqueueLinks, log }) => {
log.debug(`Enqueueing categories from page: ${request.url}`);
// This means we're on the start page, with no label.
// On this page, we just want to enqueue all the category pages.

await page.waitForSelector('.collection-block-item');
await enqueueLinks({
selector: '.collection-block-item',
label: 'CATEGORY',
});
});

Let's explore the changes in more detail. We believe these modification will enhance the readability and manageability of the crawler.

Splitting your code into multiple files

There's no reason not to split your code into multiple files and keep your logic separate. Less code in a single file means less code you need to think about at any time, and that's good. We would most likely go even further and split even the routes into separate files.

Using Crawlee log instead of console.log

We won't go to great lengths here to talk about log object from Crawlee, because you can read all about it in the documentation, but there's just one thing that we need to stress: log levels.

Crawlee log has multiple log levels, such as log.debug, log.info or log.warning. It not only makes your log more readable, but it also allows selective turning off of some levels by either calling the log.setLevel() function or by setting the CRAWLEE_LOG_LEVEL environment variable. Thanks to this you can add a lot of debug logs to your crawler without polluting your log when they're not needed, but ready to help when you encounter issues.

Using a router to structure your crawling

Initially, using a simple if/else statement for selecting different logic based on the crawled pages might appear more readable. However, this approach can become cumbersome with more than two types of pages, especially when the logic for each page extends over dozens or even hundreds of lines of code.

It's good practice in any programming language to split your logic into bite-sized chunks that are easy to read and reason about. Scrolling through a thousand line long requestHandler() where everything interacts with everything and variables can be used everywhere is not a beautiful thing to do and a pain to debug. That's why we prefer the separation of routes into their own files.

Next steps

In the next and final step, you'll see how to deploy your Crawlee project to the cloud. If you used the CLI to bootstrap your project, you already have a Dockerfile ready, and the next section will show you how to deploy it to the Apify Platform with ease.