Version: 3.17

AdaptivePlaywrightCrawlerOptions

Hierarchy

Omit<PlaywrightCrawlerOptions, requestHandler | handlePageFunction | preNavigationHooks | postNavigationHooks>
- AdaptivePlaywrightCrawlerOptions

Properties

optionalinheritedautoscaledPoolOptions

autoscaledPoolOptions?: AutoscaledPoolOptions

Custom options passed to the underlying AutoscaledPool constructor.

NOTE: The runTaskFunction option is provided by the crawler and cannot be overridden. However, we can provide custom implementations of isFinishedFunction and isTaskReadyFunction.

optionalinheritedbrowserPoolOptions

browserPoolOptions?: Partial<BrowserPoolOptions<BrowserPlugin<CommonLibrary, undefined | Dictionary, CommonBrowser, unknown, CommonPage>>> & Partial<BrowserPoolHooks<BrowserController<BrowserType<{}>, undefined | LaunchOptions, Browser, undefined | { acceptDownloads?: boolean; baseURL?: string; bypassCSP?: boolean; clientCertificates?: { cert?: ... | ...; certPath?: ... | ...; key?: ... | ...; keyPath?: ... | ...; origin: string; passphrase?: ... | ...; pfx?: ... | ...; pfxPath?: ... | ... }[]; colorScheme?: null | light | dark | no-preference; contrast?: null | no-preference | more; deviceScaleFactor?: number; extraHTTPHeaders?: {}; forcedColors?: null | active | none; geolocation?: { accuracy?: number; latitude: number; longitude: number }; hasTouch?: boolean; httpCredentials?: { origin?: string; password: string; send?: unauthorized | always; username: string }; ignoreHTTPSErrors?: boolean; isMobile?: boolean; javaScriptEnabled?: boolean; locale?: string; logger?: Logger; offline?: boolean; permissions?: string[]; proxy?: { bypass?: string; password?: string; server: string; username?: string }; recordHar?: { content?: omit | embed | attach; mode?: full | minimal; omitContent?: boolean; path: string; urlFilter?: string | RegExp }; recordVideo?: { dir?: string; showActions?: { duration?: ...; fontSize?: ...; position?: ... }; size?: { height: ...; width: ... } }; reducedMotion?: null | reduce | no-preference; screen?: { height: number; width: number }; serviceWorkers?: allow | block; storageState?: string | { cookies: { domain: ...; expires: ...; httpOnly: ...; name: ...; path: ...; sameSite: ...; secure: ...; value: ... }[]; origins: { localStorage: ...; origin: ... }[] }; strictSelectors?: boolean; timezoneId?: string; userAgent?: string; viewport?: null | { height: number; width: number } }, Page>, LaunchContext<BrowserType<{}>, undefined | LaunchOptions, Browser, undefined | { acceptDownloads?: boolean; baseURL?: string; bypassCSP?: boolean; clientCertificates?: { cert?: ... | ...; certPath?: ... | ...; key?: ... | ...; keyPath?: ... | ...; origin: string; passphrase?: ... | ...; pfx?: ... | ...; pfxPath?: ... | ... }[]; colorScheme?: null | light | dark | no-preference; contrast?: null | no-preference | more; deviceScaleFactor?: number; extraHTTPHeaders?: {}; forcedColors?: null | active | none; geolocation?: { accuracy?: number; latitude: number; longitude: number }; hasTouch?: boolean; httpCredentials?: { origin?: string; password: string; send?: unauthorized | always; username: string }; ignoreHTTPSErrors?: boolean; isMobile?: boolean; javaScriptEnabled?: boolean; locale?: string; logger?: Logger; offline?: boolean; permissions?: string[]; proxy?: { bypass?: string; password?: string; server: string; username?: string }; recordHar?: { content?: omit | embed | attach; mode?: full | minimal; omitContent?: boolean; path: string; urlFilter?: string | RegExp }; recordVideo?: { dir?: string; showActions?: { duration?: ...; fontSize?: ...; position?: ... }; size?: { height: ...; width: ... } }; reducedMotion?: null | reduce | no-preference; screen?: { height: number; width: number }; serviceWorkers?: allow | block; storageState?: string | { cookies: { domain: ...; expires: ...; httpOnly: ...; name: ...; path: ...; sameSite: ...; secure: ...; value: ... }[]; origins: { localStorage: ...; origin: ... }[] }; strictSelectors?: boolean; timezoneId?: string; userAgent?: string; viewport?: null | { height: number; width: number } }, Page>, Page>>

Custom options passed to the underlying BrowserPool constructor. We can tweak those to fine-tune browser management.

optionalinheritederrorHandler

errorHandler?: BrowserErrorHandler<PlaywrightCrawlingContext<Dictionary>>

User-provided function that allows modifying the request object before it gets retried by the crawler. It's executed before each retry for the requests that failed less than maxRequestRetries times.

The function receives the BrowserCrawlingContext (actual context will be enhanced with the crawler specific properties) as the first argument, where the request corresponds to the request to be retried. Second argument is the Error instance that represents the last error thrown during processing of the request.

optionalinheritedexperiments

experiments?: CrawlerExperiments

Enables experimental features of Crawlee, which can alter the behavior of the crawler. WARNING: these options are not guaranteed to be stable and may change or be removed at any time.

optionalinheritedfailedRequestHandler

failedRequestHandler?: BrowserErrorHandler<PlaywrightCrawlingContext<Dictionary>>

A function to handle requests that failed more than option.maxRequestRetries times.

The function receives the BrowserCrawlingContext (actual context will be enhanced with the crawler specific properties) as the first argument, where the request corresponds to the failed request. Second argument is the Error instance that represents the last error thrown during processing of the request.

optionalinheritedheadless

headless?: boolean | new | old

Whether to run browser in headless mode. Defaults to true. Can be also set via Configuration.

optionalinheritedhttpClient

httpClient?: BaseHttpClient

HTTP client implementation for the sendRequest context helper and for plain HTTP crawling. Defaults to a new instance of GotScrapingHttpClient

optionalinheritedignoreIframes

ignoreIframes?: boolean

Whether to ignore iframes when processing the page content via parseWithCheerio helper. By default, iframes are expanded automatically. Use this option to disable this behavior.

optionalinheritedignoreShadowRoots

ignoreShadowRoots?: boolean

Whether to ignore custom elements (and their #shadow-roots) when processing the page content via parseWithCheerio helper. By default, they are expanded automatically. Use this option to disable this behavior.

optionalinheritedkeepAlive

keepAlive?: boolean

Allows to keep the crawler alive even if the RequestQueue gets empty. By default, the crawler.run() will resolve once the queue is empty. With keepAlive: true it will keep running, waiting for more requests to come. Use crawler.stop() to exit the crawler gracefully, or crawler.teardown() to stop it immediately.

optionalinheritedlaunchContext

launchContext?: PlaywrightLaunchContext

The same options as used by launchPlaywright.

optionalinheritedmaxConcurrency

maxConcurrency?: number

Sets the maximum concurrency (parallelism) for the crawl. Shortcut for the AutoscaledPool maxConcurrency option.

optionalinheritedmaxCrawlDepth

maxCrawlDepth?: number

Maximum depth of the crawl. If not set, the crawl will continue until all requests are processed. Setting this to 0 will only process the initial requests, skipping all links enqueued by crawlingContext.enqueueLinks and crawlingContext.addRequests. Passing 1 will process the initial requests and all links enqueued by crawlingContext.enqueueLinks and crawlingContext.addRequests in the handler for initial requests.

optionalinheritedmaxRequestRetries

maxRequestRetries?: number = 3

Specifies the maximum number of retries allowed for a request if its processing fails. This includes retries due to navigation errors or errors thrown from user-supplied functions (requestHandler, preNavigationHooks, postNavigationHooks).

This limit does not apply to retries triggered by session rotation (see maxSessionRotations).

optionalinheritedmaxRequestsPerCrawl

maxRequestsPerCrawl?: number

Maximum number of pages that the crawler will open. The crawl will stop when this limit is reached. This value should always be set in order to prevent infinite loops in misconfigured crawlers.

NOTE: In cases of parallel crawling, the actual number of pages visited might be slightly higher than this value.

optionalinheritedmaxRequestsPerMinute

maxRequestsPerMinute?: number

The maximum number of requests per minute the crawler should run. By default, this is set to Infinity, but we can pass any positive, non-zero integer. Shortcut for the AutoscaledPool maxTasksPerMinute option.

optionalinheritedmaxSessionRotations

maxSessionRotations?: number = 10

Maximum number of session rotations per request. The crawler will automatically rotate the session in case of a proxy error or if it gets blocked by the website.

The session rotations are not counted towards the maxRequestRetries limit.

optionalinheritedminConcurrency

minConcurrency?: number

Sets the minimum concurrency (parallelism) for the crawl. Shortcut for the AutoscaledPool minConcurrency option.

WARNING: If we set this value too high with respect to the available system memory and CPU, our crawler will run extremely slow or crash. If not sure, it's better to keep the default value and the concurrency will scale up automatically.

optionalinheritednavigationTimeoutSecs

navigationTimeoutSecs?: number

Timeout in which page navigation needs to finish, in seconds.

optionalinheritedonSkippedRequest

onSkippedRequest?: SkippedRequestCallback

When a request is skipped for some reason, you can use this callback to act on it. This is currently fired for requests skipped

based on robots.txt file,
because they don't match enqueueLinks filters,
because they are redirected to a URL that doesn't match the enqueueLinks strategy,
or because the maxRequestsPerCrawl limit has been reached

optionalinheritedpersistCookiesPerSession

persistCookiesPerSession?: boolean

Defines whether the cookies should be persisted for sessions. This can only be used when useSessionPool is set to true.

optionalpostNavigationHooks

postNavigationHooks?: AdaptiveHook[]

Async functions that are sequentially evaluated after the navigation. Good for checking if the navigation was successful. The function accepts a subset of the crawling context. If you attempt to access the page property during HTTP-only crawling, an exception will be thrown. If it's not caught, the request will be transparently retried in a browser.

optionalpreNavigationHooks

preNavigationHooks?: AdaptiveHook[]

Async functions that are sequentially evaluated before the navigation. Good for setting additional cookies. The function accepts a subset of the crawling context. If you attempt to access the page property during HTTP-only crawling, an exception will be thrown. If it's not caught, the request will be transparently retried in a browser.

optionalpreventDirectStorageAccess

preventDirectStorageAccess?: boolean

Prevent direct access to storage in request handlers (only allow using context helpers). Defaults to true

optionalinheritedproxyConfiguration

proxyConfiguration?: ProxyConfiguration

If set, the crawler will be configured for all connections to use the Proxy URLs provided and rotated according to the configuration.

optionalrenderingTypeDetectionRatio

renderingTypeDetectionRatio?: number

Specifies the frequency of rendering type detection checks - 0.1 means roughly 10% of requests. Defaults to 0.1 (so 10%).

optionalrenderingTypePredictor

renderingTypePredictor?: Pick<RenderingTypePredictor, predict | storeResult | initialize>

A custom rendering type predictor

optionalrequestHandler

requestHandler?: (crawlingContext) => Awaitable<void>

Function that is called to process each request.

The function receives the AdaptivePlaywrightCrawlingContext as an argument, and it must refrain from calling code with side effects, other than the methods of the crawling context. Any other side effects may be invoked repeatedly by the crawler, which can lead to inconsistent results.

The function must return a promise, which is then awaited by the crawler.

If the function throws an exception, the crawler will try to re-crawl the request later, up to option.maxRequestRetries times.

Type declaration

- (crawlingContext): Awaitable<void>
- Parameters
  - crawlingContext: { request: LoadedRequest<Request<Dictionary>> } & Omit<AdaptivePlaywrightCrawlerContext<Dictionary>, request>
  Returns Awaitable<void>

optionalinheritedrequestHandlerTimeoutSecs

requestHandlerTimeoutSecs?: number = 60

Timeout in which the function passed as requestHandler needs to finish, in seconds.

optionalinheritedrequestList

requestList?: IRequestList

Static list of URLs to be processed. If not provided, the crawler will open the default request queue when the crawler.addRequests() function is called.

Alternatively, requests parameter of crawler.run() could be used to enqueue the initial requests - it is a shortcut for running crawler.addRequests() before the crawler.run().

optionalinheritedrequestManager

requestManager?: IRequestManager

Allows explicitly configuring a request manager. Mutually exclusive with the requestQueue and requestList options.

This enables explicitly configuring the crawler to use RequestManagerTandem, for instance. If using this, the type of BasicCrawler.requestQueue may not be fully compatible with the RequestProvider class.

optionalinheritedrequestQueue

requestQueue?: RequestProvider

Dynamic queue of URLs to be processed. This is useful for recursive crawling of websites. If not provided, the crawler will open the default request queue when the crawler.addRequests() function is called.

Alternatively, requests parameter of crawler.run() could be used to enqueue the initial requests - it is a shortcut for running crawler.addRequests() before the crawler.run().

optionalinheritedrespectRobotsTxtFile

respectRobotsTxtFile?: boolean | { userAgent?: string }

If set to true, the crawler will automatically try to fetch the robots.txt file for each domain, and skip those that are not allowed. This also prevents disallowed URLs to be added via enqueueLinks.

If an object is provided, it may contain a userAgent property to specify which user-agent should be used when checking the robots.txt file. If not provided, the default user-agent * will be used.

optionalresultChecker

resultChecker?: (result) => boolean

An optional callback that is called on dataset items found by the request handler in plain HTTP mode. If it returns false, the request is retried in a browser. If no callback is specified, every dataset item is considered valid.

Type declaration

- (result): boolean
- Parameters
  - result: RequestHandlerResult
  Returns boolean

optionalresultComparator

resultComparator?: (resultA, resultB) => boolean | equal | different | inconclusive

An optional callback used in rendering type detection. On each detection, the result of the plain HTTP run is compared to that of the browser one. If a callback is provided, the contract is as follows: It the callback returns true or 'equal', the results are considered equal and the target site is considered static. If it returns false or 'different', the target site is considered client-rendered. If it returns 'inconclusive', the detection result won't be used. If no result comparator is specified, but there is a resultChecker, any site where the resultChecker returns true is considered static. If neither resultComparator nor resultChecker are specified, a deep comparison of returned dataset items is used as a default.

Type declaration

- (resultA, resultB): boolean | equal | different | inconclusive
- Parameters
  - resultA: RequestHandlerResult
  - resultB: RequestHandlerResult
  Returns boolean | equal | different | inconclusive

optionalinheritedretryOnBlocked

retryOnBlocked?: boolean

If set to true, the crawler will automatically try to bypass any detected bot protection.

Currently supports:

optionalinheritedsameDomainDelaySecs

sameDomainDelaySecs?: number = 0

Indicates how much time (in seconds) to wait before crawling another same domain request.

optionalinheritedsessionPoolOptions

sessionPoolOptions?: SessionPoolOptions

The configuration options for SessionPool to use.

optionalshouldPropagateError

shouldPropagateError?: (error, context) => Awaitable<boolean> = (error, context) => Awaitable<boolean>

An optional callback that decides whether an error thrown during the plain HTTP request handler should be propagated (instead of falling back to browser navigation).

If the callback returns true, the error is thrown, triggering the standard retry mechanism. If the callback returns false (or is not provided), the error is logged and the crawler falls back to browser navigation (default behavior).

Type declaration

- (error, context): Awaitable<boolean>
- Parameters
  - error: Error
  - context: PlaywrightCrawlingContext<Dictionary>
  Returns Awaitable<boolean>

optionalinheritedstatisticsOptions

statisticsOptions?: StatisticsOptions

Customize the way statistics collecting works, such as logging interval or whether to output them to the Key-Value store.

optionalinheritedstatusMessageCallback

statusMessageCallback?: StatusMessageCallback<BasicCrawlingContext<Dictionary>, BasicCrawler<BasicCrawlingContext<Dictionary>>>

Allows overriding the default status message. The callback needs to call crawler.setStatusMessage() explicitly. The default status message is provided in the parameters.

const crawler = new CheerioCrawler({
    statusMessageCallback: async (ctx) => {
        return ctx.crawler.setStatusMessage(`this is status message from ${new Date().toISOString()}`, { level: 'INFO' }); // log level defaults to 'DEBUG'
    },
    statusMessageLoggingInterval: 1, // defaults to 10s
    async requestHandler({ $, enqueueLinks, request, log }) {
        // ...
    },
});

optionalinheritedstatusMessageLoggingInterval

statusMessageLoggingInterval?: number

Defines the length of the interval for calling the setStatusMessage in seconds.

optionalinheriteduseSessionPool

useSessionPool?: boolean

Basic crawler will initialize the SessionPool with the corresponding sessionPoolOptions. The session instance will be than available in the requestHandler.

Hierarchy

Index

Properties

Properties

optionalinheritedautoscaledPoolOptions

optionalinheritedbrowserPoolOptions

optionalinheritederrorHandler

optionalinheritedexperiments

optionalinheritedfailedRequestHandler

optionalinheritedheadless

optionalinheritedhttpClient

optionalinheritedignoreIframes

optionalinheritedignoreShadowRoots

optionalinheritedkeepAlive

optionalinheritedlaunchContext

optionalinheritedmaxConcurrency

optionalinheritedmaxCrawlDepth

optionalinheritedmaxRequestRetries

optionalinheritedmaxRequestsPerCrawl

optionalinheritedmaxRequestsPerMinute

optionalinheritedmaxSessionRotations

optionalinheritedminConcurrency

optionalinheritednavigationTimeoutSecs

optionalinheritedonSkippedRequest

optionalinheritedpersistCookiesPerSession

optionalpostNavigationHooks

optionalpreNavigationHooks

optionalpreventDirectStorageAccess

optionalinheritedproxyConfiguration

optionalrenderingTypeDetectionRatio

optionalrenderingTypePredictor

optionalrequestHandler

Type declaration

Parameters

crawlingContext: { request: LoadedRequest<Request<Dictionary>> } & Omit<AdaptivePlaywrightCrawlerContext<Dictionary>, request>

Returns Awaitable<void>

optionalinheritedrequestHandlerTimeoutSecs

optionalinheritedrequestList

optionalinheritedrequestManager

optionalinheritedrequestQueue

optionalinheritedrespectRobotsTxtFile

optionalresultChecker

Type declaration

Parameters

result: RequestHandlerResult

Returns boolean

optionalresultComparator

Type declaration

Parameters

resultA: RequestHandlerResult

resultB: RequestHandlerResult

Returns boolean | equal | different | inconclusive

optionalinheritedretryOnBlocked

optionalinheritedsameDomainDelaySecs

optionalinheritedsessionPoolOptions

optionalshouldPropagateError

Type declaration

Parameters

error: Error

context: PlaywrightCrawlingContext<Dictionary>

Returns Awaitable<boolean>

optionalinheritedstatisticsOptions

optionalinheritedstatusMessageCallback

optionalinheritedstatusMessageLoggingInterval

optionalinheriteduseSessionPool