Proxy Management
IP address blocking is one of the oldest and most effective ways of preventing access to a website. It is therefore paramount for a good web scraping library to provide easy to use but powerful tools which can work around IP blocking. The most powerful weapon in our anti IP blocking arsenal is a proxy server.
With Crawlee we can use our own proxy servers or proxy servers acquired from third-party providers.
Check out the avoid blocking guide for more information about blocking.
Quick start
If we already have proxy URLs of our own, we can start using them immediately in only a few lines of code.
import { ProxyConfiguration } from 'crawlee';
const proxyConfiguration = new ProxyConfiguration({
proxyUrls: [
'http://proxy-1.com',
'http://proxy-2.com',
]
});
const proxyUrl = await proxyConfiguration.newUrl();
Examples of how to use our proxy URLs with crawlers are shown below in Crawler integration section.
Proxy Configuration
All our proxy needs are managed by the ProxyConfiguration
class. We create an instance using the ProxyConfiguration
constructor
function based on the provided options. See the ProxyConfigurationOptions
for all the possible constructor options.
Crawler integration
ProxyConfiguration
integrates seamlessly into HttpCrawler
, CheerioCrawler
, JSDOMCrawler
, PlaywrightCrawler
and PuppeteerCrawler
.
- HttpCrawler
- CheerioCrawler
- JSDOMCrawler
- PlaywrightCrawler
- PuppeteerCrawler
import { HttpCrawler, ProxyConfiguration } from 'crawlee';
const proxyConfiguration = new ProxyConfiguration({
proxyUrls: ['http://proxy-1.com', 'http://proxy-2.com'],
});
const crawler = new HttpCrawler({
proxyConfiguration,
// ...
});
import { CheerioCrawler, ProxyConfiguration } from 'crawlee';
const proxyConfiguration = new ProxyConfiguration({
proxyUrls: ['http://proxy-1.com', 'http://proxy-2.com'],
});
const crawler = new CheerioCrawler({
proxyConfiguration,
// ...
});
import { JSDOMCrawler, ProxyConfiguration } from 'crawlee';
const proxyConfiguration = new ProxyConfiguration({
proxyUrls: ['http://proxy-1.com', 'http://proxy-2.com'],
});
const crawler = new JSDOMCrawler({
proxyConfiguration,
// ...
});
import { PlaywrightCrawler, ProxyConfiguration } from 'crawlee';
const proxyConfiguration = new ProxyConfiguration({
proxyUrls: ['http://proxy-1.com', 'http://proxy-2.com'],
});
const crawler = new PlaywrightCrawler({
proxyConfiguration,
// ...
});
import { PuppeteerCrawler, ProxyConfiguration } from 'crawlee';
const proxyConfiguration = new ProxyConfiguration({
proxyUrls: ['http://proxy-1.com', 'http://proxy-2.com'],
});
const crawler = new PuppeteerCrawler({
proxyConfiguration,
// ...
});
Our crawlers will now use the selected proxies for all connections.
IP Rotation and session management
proxyConfiguration.newUrl()
allows us to pass a sessionId
parameter. It will then be used to create a sessionId
-proxyUrl
pair, and subsequent newUrl()
calls with the same sessionId
will always return the same proxyUrl
. This is extremely useful in scraping, because we want to create the impression of a real user. See the session management guide and SessionPool
class for more information on how keeping a real session helps us avoid blocking.
When no sessionId
is provided, our proxy URLs are rotated round-robin.
- HttpCrawler
- CheerioCrawler
- JSDOMCrawler
- PlaywrightCrawler
- PuppeteerCrawler
- Standalone
import { HttpCrawler, ProxyConfiguration } from 'crawlee';
const proxyConfiguration = new ProxyConfiguration({
/* opts */
});
const crawler = new HttpCrawler({
useSessionPool: true,
persistCookiesPerSession: true,
proxyConfiguration,
// ...
});
import { CheerioCrawler, ProxyConfiguration } from 'crawlee';
const proxyConfiguration = new ProxyConfiguration({
/* opts */
});
const crawler = new CheerioCrawler({
useSessionPool: true,
persistCookiesPerSession: true,
proxyConfiguration,
// ...
});
import { JSDOMCrawler, ProxyConfiguration } from 'crawlee';
const proxyConfiguration = new ProxyConfiguration({
/* opts */
});
const crawler = new JSDOMCrawler({
useSessionPool: true,
persistCookiesPerSession: true,
proxyConfiguration,
// ...
});
import { PlaywrightCrawler, ProxyConfiguration } from 'crawlee';
const proxyConfiguration = new ProxyConfiguration({
/* opts */
});
const crawler = new PlaywrightCrawler({
useSessionPool: true,
persistCookiesPerSession: true,
proxyConfiguration,
// ...
});
import { PuppeteerCrawler, ProxyConfiguration } from 'crawlee';
const proxyConfiguration = new ProxyConfiguration({
/* opts */
});
const crawler = new PuppeteerCrawler({
useSessionPool: true,
persistCookiesPerSession: true,
proxyConfiguration,
// ...
});
import { ProxyConfiguration, SessionPool } from 'crawlee';
const proxyConfiguration = new ProxyConfiguration({
/* opts */
});
const sessionPool = await SessionPool.open({
/* opts */
});
const session = await sessionPool.getSession();
const proxyUrl = await proxyConfiguration.newUrl(session.id);
Inspecting current proxy in Crawlers
HttpCrawler
, CheerioCrawler
, JSDOMCrawler
, PlaywrightCrawler
and PuppeteerCrawler
grant access to information about the currently used proxy
in their requestHandler
using a proxyInfo
object.
With the proxyInfo
object, we can easily access the proxy URL.
- HttpCrawler
- CheerioCrawler
- JSDOMCrawler
- PlaywrightCrawler
- PuppeteerCrawler
import { HttpCrawler, ProxyConfiguration } from 'crawlee';
const proxyConfiguration = new ProxyConfiguration({
/* opts */
});
const crawler = new HttpCrawler({
proxyConfiguration,
async requestHandler({ proxyInfo }) {
console.log(proxyInfo);
},
// ...
});
import { CheerioCrawler, ProxyConfiguration } from 'crawlee';
const proxyConfiguration = new ProxyConfiguration({
/* opts */
});
const crawler = new CheerioCrawler({
proxyConfiguration,
async requestHandler({ proxyInfo }) {
console.log(proxyInfo);
},
// ...
});
import { JSDOMCrawler, ProxyConfiguration } from 'crawlee';
const proxyConfiguration = new ProxyConfiguration({
/* opts */
});
const crawler = new JSDOMCrawler({
proxyConfiguration,
async requestHandler({ proxyInfo }) {
console.log(proxyInfo);
},
// ...
});
import { PlaywrightCrawler, ProxyConfiguration } from 'crawlee';
const proxyConfiguration = new ProxyConfiguration({
/* opts */
});
const crawler = new PlaywrightCrawler({
proxyConfiguration,
async requestHandler({ proxyInfo }) {
console.log(proxyInfo);
},
// ...
});
import { PuppeteerCrawler, ProxyConfiguration } from 'crawlee';
const proxyConfiguration = new ProxyConfiguration({
/* opts */
});
const crawler = new PuppeteerCrawler({
proxyConfiguration,
async requestHandler({ proxyInfo }) {
console.log(proxyInfo);
},
// ...
});