Skip to main content
Version: 3.10

PuppeteerCrawlerOptions

Hierarchy

Index

Properties

optionalautoscaledPoolOptions

autoscaledPoolOptions?: AutoscaledPoolOptions

Custom options passed to the underlying AutoscaledPool constructor.

NOTE: The runTaskFunction and isTaskReadyFunction options are provided by the crawler and cannot be overridden. However, we can provide a custom implementation of isFinishedFunction.

optionalbrowserPoolOptions

browserPoolOptions?: Partial<BrowserPoolOptions<BrowserPlugin<CommonLibrary, undefined | Dictionary, CommonBrowser, unknown, CommonPage>>> & Partial<BrowserPoolHooks<BrowserController<PuppeteerNode, PuppeteerLaunchOptions, Browser, PuppeteerNewPageOptions, Page>, LaunchContext<PuppeteerNode, PuppeteerLaunchOptions, Browser, PuppeteerNewPageOptions, Page>, Page>>

Custom options passed to the underlying BrowserPool constructor. We can tweak those to fine-tune browser management.

optionalerrorHandler

errorHandler?: BrowserErrorHandler<PuppeteerCrawlingContext<Dictionary>>

User-provided function that allows modifying the request object before it gets retried by the crawler. It's executed before each retry for the requests that failed less than maxRequestRetries times.

The function receives the BrowserCrawlingContext (actual context will be enhanced with the crawler specific properties) as the first argument, where the request corresponds to the request to be retried. Second argument is the Error instance that represents the last error thrown during processing of the request.

optionalexperiments

experiments?: CrawlerExperiments

Enables experimental features of Crawlee, which can alter the behavior of the crawler. WARNING: these options are not guaranteed to be stable and may change or be removed at any time.

optionalfailedRequestHandler

failedRequestHandler?: BrowserErrorHandler<PuppeteerCrawlingContext<Dictionary>>

A function to handle requests that failed more than option.maxRequestRetries times.

The function receives the BrowserCrawlingContext (actual context will be enhanced with the crawler specific properties) as the first argument, where the request corresponds to the failed request. Second argument is the Error instance that represents the last error thrown during processing of the request.

optionalheadless

headless?: boolean | new | old

Whether to run browser in headless mode. Defaults to true. Can be also set via Configuration.

optionalignoreShadowRoots

ignoreShadowRoots?: boolean

Whether to ignore custom elements (and their #shadow-roots) when processing the page content via parseWithCheerio helper. By default, they are expanded automatically. Use this option to disable this behavior.

optionalkeepAlive

keepAlive?: boolean

Allows to keep the crawler alive even if the RequestQueue gets empty. By default, the crawler.run() will resolve once the queue is empty. With keepAlive: true it will keep running, waiting for more requests to come. Use crawler.teardown() to exit the crawler.

optionallaunchContext

launchContext?: PuppeteerLaunchContext

Options used by launchPuppeteer to start new Puppeteer instances.

optionalmaxConcurrency

maxConcurrency?: number

Sets the maximum concurrency (parallelism) for the crawl. Shortcut for the AutoscaledPool maxConcurrency option.

optionalmaxRequestRetries

maxRequestRetries?: number = 3

Indicates how many times the request is retried if requestHandler fails.

optionalmaxRequestsPerCrawl

maxRequestsPerCrawl?: number

Maximum number of pages that the crawler will open. The crawl will stop when this limit is reached. This value should always be set in order to prevent infinite loops in misconfigured crawlers.

NOTE: In cases of parallel crawling, the actual number of pages visited might be slightly higher than this value.

optionalmaxRequestsPerMinute

maxRequestsPerMinute?: number

The maximum number of requests per minute the crawler should run. By default, this is set to Infinity, but we can pass any positive, non-zero integer. Shortcut for the AutoscaledPool maxTasksPerMinute option.

optionalmaxSessionRotations

maxSessionRotations?: number = 10

Maximum number of session rotations per request. The crawler will automatically rotate the session in case of a proxy error or if it gets blocked by the website.

The session rotations are not counted towards the maxRequestRetries limit.

optionalminConcurrency

minConcurrency?: number

Sets the minimum concurrency (parallelism) for the crawl. Shortcut for the AutoscaledPool minConcurrency option.

WARNING: If we set this value too high with respect to the available system memory and CPU, our crawler will run extremely slow or crash. If not sure, it's better to keep the default value and the concurrency will scale up automatically.

optionalnavigationTimeoutSecs

navigationTimeoutSecs?: number

Timeout in which page navigation needs to finish, in seconds.

optionalpersistCookiesPerSession

persistCookiesPerSession?: boolean

Defines whether the cookies should be persisted for sessions. This can only be used when useSessionPool is set to true.

optionalpostNavigationHooks

postNavigationHooks?: PuppeteerHook[]

Async functions that are sequentially evaluated after the navigation. Good for checking if the navigation was successful. The function accepts crawlingContext as the only parameter. Example:

postNavigationHooks: [
async (crawlingContext) => {
const { page } = crawlingContext;
if (hasCaptcha(page)) {
await solveCaptcha (page);
}
},
]

optionalpreNavigationHooks

preNavigationHooks?: PuppeteerHook[]

Async functions that are sequentially evaluated before the navigation. Good for setting additional cookies or browser properties before navigation. The function accepts two parameters, crawlingContext and gotoOptions, which are passed to the page.goto() function the crawler calls to navigate. Example:

preNavigationHooks: [
async (crawlingContext, gotoOptions) => {
const { page } = crawlingContext;
await page.evaluate((attr) => { window.foo = attr; }, 'bar');
},
]

Modyfing pageOptions is supported only in Playwright incognito. See PrePageCreateHook

optionalproxyConfiguration

proxyConfiguration?: ProxyConfiguration

If set, the crawler will be configured for all connections to use the Proxy URLs provided and rotated according to the configuration.

optionalrequestHandler

requestHandler?: BrowserRequestHandler<PuppeteerCrawlingContext<Dictionary>>

Function that is called to process each request.

The function receives the BrowserCrawlingContext (actual context will be enhanced with the crawler specific properties) as an argument, where:

The function must return a promise, which is then awaited by the crawler.

If the function throws an exception, the crawler will try to re-crawl the request later, up to the maxRequestRetries times. If all the retries fail, the crawler calls the function provided to the failedRequestHandler parameter. To make this work, we should always let our function throw exceptions rather than catch them. The exceptions are logged to the request using the Request.pushErrorMessage() function.

optionalrequestHandlerTimeoutSecs

requestHandlerTimeoutSecs?: number = 60

Timeout in which the function passed as requestHandler needs to finish, in seconds.

optionalrequestList

requestList?: RequestList

Static list of URLs to be processed. If not provided, the crawler will open the default request queue when the crawler.addRequests() function is called.

Alternatively, requests parameter of crawler.run() could be used to enqueue the initial requests - it is a shortcut for running crawler.addRequests() before the crawler.run().

optionalrequestQueue

requestQueue?: RequestProvider

Dynamic queue of URLs to be processed. This is useful for recursive crawling of websites. If not provided, the crawler will open the default request queue when the crawler.addRequests() function is called.

Alternatively, requests parameter of crawler.run() could be used to enqueue the initial requests - it is a shortcut for running crawler.addRequests() before the crawler.run().

optionalretryOnBlocked

retryOnBlocked?: boolean

If set to true, the crawler will automatically try to bypass any detected bot protection.

Currently supports:

optionalsameDomainDelaySecs

sameDomainDelaySecs?: number = 0

Indicates how much time (in seconds) to wait before crawling another same domain request.

optionalsessionPoolOptions

sessionPoolOptions?: SessionPoolOptions

The configuration options for SessionPool to use.

optionalstatisticsOptions

statisticsOptions?: StatisticsOptions

Customize the way statistics collecting works, such as logging interval or whether to output them to the Key-Value store.

optionalstatusMessageCallback

statusMessageCallback?: StatusMessageCallback<BasicCrawlingContext<Dictionary>, BasicCrawler<BasicCrawlingContext<Dictionary>>>

Allows overriding the default status message. The callback needs to call crawler.setStatusMessage() explicitly. The default status message is provided in the parameters.

const crawler = new CheerioCrawler({
statusMessageCallback: async (ctx) => {
return ctx.crawler.setStatusMessage(`this is status message from ${new Date().toISOString()}`, { level: 'INFO' }); // log level defaults to 'DEBUG'
},
statusMessageLoggingInterval: 1, // defaults to 10s
async requestHandler({ $, enqueueLinks, request, log }) {
// ...
},
});

optionalstatusMessageLoggingInterval

statusMessageLoggingInterval?: number

Defines the length of the interval for calling the setStatusMessage in seconds.

optionaluseSessionPool

useSessionPool?: boolean

Basic crawler will initialize the SessionPool with the corresponding sessionPoolOptions. The session instance will be than available in the requestHandler.