AdaptivePlaywrightCrawlerOptions
Hierarchy
- Omit<PlaywrightCrawlerOptions, requestHandler | handlePageFunction>
- AdaptivePlaywrightCrawlerOptions
Index
Properties
- autoscaledPoolOptions
- browserPoolOptions
- errorHandler
- experiments
- failedRequestHandler
- headless
- ignoreIframes
- ignoreShadowRoots
- keepAlive
- launchContext
- maxConcurrency
- maxRequestRetries
- maxRequestsPerCrawl
- maxRequestsPerMinute
- maxSessionRotations
- minConcurrency
- navigationTimeoutSecs
- persistCookiesPerSession
- postNavigationHooks
- preNavigationHooks
- proxyConfiguration
- renderingTypeDetectionRatio
- renderingTypePredictor
- requestHandler
- requestHandlerTimeoutSecs
- requestList
- requestQueue
- resultChecker
- resultComparator
- retryOnBlocked
- sameDomainDelaySecs
- sessionPoolOptions
- statisticsOptions
- statusMessageCallback
- statusMessageLoggingInterval
- useSessionPool
Properties
optionalautoscaledPoolOptions
optionalbrowserPoolOptions
Custom options passed to the underlying BrowserPool constructor. We can tweak those to fine-tune browser management.
optionalerrorHandler
User-provided function that allows modifying the request object before it gets retried by the crawler.
It's executed before each retry for the requests that failed less than maxRequestRetries
times.
The function receives the BrowserCrawlingContext
(actual context will be enhanced with the crawler specific properties) as the first argument,
where the request
corresponds to the request to be retried.
Second argument is the Error
instance that
represents the last error thrown during processing of the request.
optionalexperiments
Enables experimental features of Crawlee, which can alter the behavior of the crawler. WARNING: these options are not guaranteed to be stable and may change or be removed at any time.
optionalfailedRequestHandler
A function to handle requests that failed more than option.maxRequestRetries
times.
The function receives the BrowserCrawlingContext
(actual context will be enhanced with the crawler specific properties) as the first argument,
where the request
corresponds to the failed request.
Second argument is the Error
instance that
represents the last error thrown during processing of the request.
optionalheadless
Whether to run browser in headless mode. Defaults to true
.
Can be also set via Configuration.
optionalignoreIframes
Whether to ignore iframes
when processing the page content via parseWithCheerio
helper.
By default, iframes
are expanded automatically. Use this option to disable this behavior.
optionalignoreShadowRoots
Whether to ignore custom elements (and their #shadow-roots) when processing the page content via parseWithCheerio
helper.
By default, they are expanded automatically. Use this option to disable this behavior.
optionalkeepAlive
Allows to keep the crawler alive even if the RequestQueue gets empty.
By default, the crawler.run()
will resolve once the queue is empty. With keepAlive: true
it will keep running,
waiting for more requests to come. Use crawler.teardown()
to exit the crawler.
optionallaunchContext
The same options as used by launchPlaywright.
optionalmaxConcurrency
Sets the maximum concurrency (parallelism) for the crawl. Shortcut for the
AutoscaledPool maxConcurrency
option.
optionalmaxRequestRetries
Indicates how many times the request is retried if requestHandler
fails.
optionalmaxRequestsPerCrawl
Maximum number of pages that the crawler will open. The crawl will stop when this limit is reached. This value should always be set in order to prevent infinite loops in misconfigured crawlers.
NOTE: In cases of parallel crawling, the actual number of pages visited might be slightly higher than this value.
optionalmaxRequestsPerMinute
The maximum number of requests per minute the crawler should run.
By default, this is set to Infinity
, but we can pass any positive, non-zero integer.
Shortcut for the AutoscaledPool maxTasksPerMinute
option.
optionalmaxSessionRotations
Maximum number of session rotations per request. The crawler will automatically rotate the session in case of a proxy error or if it gets blocked by the website.
The session rotations are not counted towards the maxRequestRetries
limit.
optionalminConcurrency
Sets the minimum concurrency (parallelism) for the crawl. Shortcut for the
AutoscaledPool minConcurrency
option.
WARNING: If we set this value too high with respect to the available system memory and CPU, our crawler will run extremely slow or crash. If not sure, it's better to keep the default value and the concurrency will scale up automatically.
optionalnavigationTimeoutSecs
Timeout in which page navigation needs to finish, in seconds.
optionalpersistCookiesPerSession
Defines whether the cookies should be persisted for sessions.
This can only be used when useSessionPool
is set to true
.
optionalpostNavigationHooks
Async functions that are sequentially evaluated after the navigation. Good for checking if the navigation was successful.
The function accepts crawlingContext
as the only parameter.
Example:
postNavigationHooks: [
async (crawlingContext) => {
const { page } = crawlingContext;
if (hasCaptcha(page)) {
await solveCaptcha (page);
}
},
]
optionalpreNavigationHooks
Async functions that are sequentially evaluated before the navigation. Good for setting additional cookies
or browser properties before navigation. The function accepts two parameters, crawlingContext
and gotoOptions
,
which are passed to the page.goto()
function the crawler calls to navigate.
Example:
preNavigationHooks: [
async (crawlingContext, gotoOptions) => {
const { page } = crawlingContext;
await page.evaluate((attr) => { window.foo = attr; }, 'bar');
},
]
Modyfing pageOptions
is supported only in Playwright incognito.
See PrePageCreateHook
optionalproxyConfiguration
If set, the crawler will be configured for all connections to use the Proxy URLs provided and rotated according to the configuration.
optionalrenderingTypeDetectionRatio
Specifies the frequency of rendering type detection checks - 0.1 means roughly 10% of requests. Defaults to 0.1 (so 10%).
optionalrenderingTypePredictor
A custom rendering type predictor
optionalrequestHandler
Function that is called to process each request.
The function receives the AdaptivePlaywrightCrawlingContext as an argument, and it must refrain from calling code with side effects, other than the methods of the crawling context. Any other side effects may be invoked repeatedly by the crawler, which can lead to inconsistent results.
The function must return a promise, which is then awaited by the crawler.
If the function throws an exception, the crawler will try to re-crawl the
request later, up to option.maxRequestRetries
times.
Type declaration
Parameters
crawlingContext: { request: LoadedRequest<Request<Dictionary>> } & Omit<AdaptivePlaywrightCrawlerContext, request>
Returns Awaitable<void>
optionalrequestHandlerTimeoutSecs
Timeout in which the function passed as requestHandler
needs to finish, in seconds.
optionalrequestList
Static list of URLs to be processed.
If not provided, the crawler will open the default request queue when the crawler.addRequests()
function is called.
Alternatively,
requests
parameter ofcrawler.run()
could be used to enqueue the initial requests - it is a shortcut for runningcrawler.addRequests()
before thecrawler.run()
.
optionalrequestQueue
Dynamic queue of URLs to be processed. This is useful for recursive crawling of websites.
If not provided, the crawler will open the default request queue when the crawler.addRequests()
function is called.
Alternatively,
requests
parameter ofcrawler.run()
could be used to enqueue the initial requests - it is a shortcut for runningcrawler.addRequests()
before thecrawler.run()
.
optionalresultChecker
An optional callback that is called on dataset items found by the request handler in plain HTTP mode. If it returns false, the request is retried in a browser. If no callback is specified, every dataset item is considered valid.
Type declaration
Parameters
result: RequestHandlerResult
Returns boolean
optionalresultComparator
An optional callback used in rendering type detection. On each detection, the result of the plain HTTP run is compared to that of the browser one.
If the callback returns true, the results are considered equal and the target site is considered static.
If no result comparator is specified, but there is a resultChecker
, any site where the resultChecker
returns true is considered static.
If neither resultComparator
nor resultChecker
are specified, a deep comparison of returned dataset items is used as a default.
Type declaration
Parameters
resultA: RequestHandlerResult
resultB: RequestHandlerResult
Returns boolean
optionalretryOnBlocked
If set to true
, the crawler will automatically try to bypass any detected bot protection.
Currently supports:
optionalsameDomainDelaySecs
Indicates how much time (in seconds) to wait before crawling another same domain request.
optionalsessionPoolOptions
The configuration options for SessionPool to use.
optionalstatisticsOptions
Customize the way statistics collecting works, such as logging interval or whether to output them to the Key-Value store.
optionalstatusMessageCallback
Allows overriding the default status message. The callback needs to call crawler.setStatusMessage()
explicitly.
The default status message is provided in the parameters.
const crawler = new CheerioCrawler({
statusMessageCallback: async (ctx) => {
return ctx.crawler.setStatusMessage(`this is status message from ${new Date().toISOString()}`, { level: 'INFO' }); // log level defaults to 'DEBUG'
},
statusMessageLoggingInterval: 1, // defaults to 10s
async requestHandler({ $, enqueueLinks, request, log }) {
// ...
},
});
optionalstatusMessageLoggingInterval
Defines the length of the interval for calling the setStatusMessage
in seconds.
optionaluseSessionPool
Basic crawler will initialize the SessionPool with the corresponding sessionPoolOptions
.
The session instance will be than available in the requestHandler
.
Custom options passed to the underlying AutoscaledPool constructor.