Skip to main content
Version: Next

Avoid getting blocked

A scraper might get blocked for numerous reasons. Let's narrow it down to two main ones. The first one is a bad or blocked IP address. This topic is covered in the proxy management guide. The second reason we will explore more in this guide is browser fingerprints (or signatures).

Browser fingerprint is a collection of browser attributes and significant features that can show if our browser is a bot or a real user. Moreover, most browsers have these unique features that allow the website to track the browser even within different IP addresses. This is the main reason why scrapers should change browser fingerprints while doing browser-based scraping. In return, it should significantly reduce the blocking.

Using browser fingerprints

Changing browser fingerprints can be a tedious job. Luckily, Crawlee provides this feature with zero configuration necessary - the usage of fingerprints is enabled by default and available in PlaywrightCrawler and PuppeteerCrawler. So whenever we build a scraper that is using one of these crawlers - the fingerprints are going to be generated for the default browser and the operating system out of the box.

Customizing browser fingerprints

In certain cases we want to narrow down the fingerprints used - e.g. specify a certain operating system, locale or browser. This is also possible with Crawlee - the crawler can have the generation algorithm customized to reflect the particular browser version and many more. Let's take a look at the examples bellow:

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
browserPoolOptions: {
useFingerprints: true, // this is the default
fingerprintOptions: {
fingerprintGeneratorOptions: {
browsers: [{
name: 'edge',
minVersion: 96,
}],
devices: [
'desktop',
],
operatingSystems: [
'windows',
],
},
},
},
// ...
});

Disabling browser fingerprints

On the contrary, sometimes we want to entirely disable the usage of browser fingerprints. This is easy to do with Crawlee too. All we have to do is set the useFingerprints option of the browserPoolOptions to false:

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
browserPoolOptions: {
useFingerprints: false,
},
// ...
});

Related links