Skip to main content
Version: Next

Session Management

‚ÄčSessionPool is a class that allows us to handle the rotation of proxy IP addresses along with cookies and other custom settings in Crawlee.

The main benefit of using Session pool is that we can filter out blocked or non-working proxies, so our actor does not retry requests over known blocked/non-working proxies. Another benefit of using SessionPool is that we can store information tied tightly to an IP address, such as cookies, auth tokens, and particular headers. Having our cookies and other identifiers used only with a specific IP will reduce the chance of being blocked. The last but not least benefit is the even rotation of IP addresses - SessionPool picks the session randomly, which should prevent burning out a small pool of available IPs.

Now let's take a look at the examples of how to use Session pool:

import { CheerioCrawler, ProxyConfiguration } from 'crawlee';

const proxyConfiguration = new ProxyConfiguration({ /* opts */ });

const crawler = new CheerioCrawler({
// To use the proxy IP session rotation logic, you must turn the proxy usage on.
proxyConfiguration,
// Activates the Session pool (default is true).
useSessionPool: true,
// Overrides default Session pool configuration.
sessionPoolOptions: { maxPoolSize: 100 },
// Set to true if you want the crawler to save cookies per session,
// and set the cookie header to request automatically (default is true).
persistCookiesPerSession: true,
async requestHandler({ session, $ }) {
const title = $('title').text();

if (title === 'Blocked') {
session.retire();
} else if (title === 'Not sure if blocked, might also be a connection error') {
session.markBad();
} else {
// session.markGood() - this step is done automatically in BasicCrawler.
}
},
});

These are the basics of configuring SessionPool. Please, bear in mind that a Session pool needs time to find working IPs and build up the pool, so we will probably see a lot of errors until it becomes stabilized.