HttpCrawlerOptions <Context>
Hierarchy
- BasicCrawlerOptions<Context>- HttpCrawlerOptions
 
Index
Properties
- additionalHttpErrorStatusCodes
- additionalMimeTypes
- autoscaledPoolOptions
- errorHandler
- experiments
- failedRequestHandler
- forceResponseEncoding
- handlePageFunction
- httpClient
- ignoreHttpErrorStatusCodes
- ignoreSslErrors
- keepAlive
- maxConcurrency
- maxCrawlDepth
- maxRequestRetries
- maxRequestsPerCrawl
- maxRequestsPerMinute
- maxSessionRotations
- minConcurrency
- navigationTimeoutSecs
- onSkippedRequest
- persistCookiesPerSession
- postNavigationHooks
- preNavigationHooks
- proxyConfiguration
- requestHandler
- requestHandlerTimeoutSecs
- requestList
- requestManager
- requestQueue
- respectRobotsTxtFile
- retryOnBlocked
- sameDomainDelaySecs
- sessionPoolOptions
- statisticsOptions
- statusMessageCallback
- statusMessageLoggingInterval
- suggestResponseEncoding
- useSessionPool
Properties
optionaladditionalHttpErrorStatusCodes
optionaladditionalMimeTypes
An array of MIME types
you want the crawler to load and process. By default, only text/html and application/xhtml+xml MIME types are supported.
optionalinheritedautoscaledPoolOptions
Custom options passed to the underlying AutoscaledPool constructor.
NOTE: The
runTaskFunctionoption is provided by the crawler and cannot be overridden. However, we can provide custom implementations ofisFinishedFunctionandisTaskReadyFunction.
optionalinheritederrorHandler
User-provided function that allows modifying the request object before it gets retried by the crawler.
It's executed before each retry for the requests that failed less than maxRequestRetries times.
The function receives the BasicCrawlingContext as the first argument,
where the request corresponds to the request to be retried.
Second argument is the Error instance that
represents the last error thrown during processing of the request.
optionalinheritedexperiments
Enables experimental features of Crawlee, which can alter the behavior of the crawler. WARNING: these options are not guaranteed to be stable and may change or be removed at any time.
optionalinheritedfailedRequestHandler
A function to handle requests that failed more than maxRequestRetries times.
The function receives the BasicCrawlingContext as the first argument,
where the request corresponds to the failed request.
Second argument is the Error instance that
represents the last error thrown during processing of the request.
optionalforceResponseEncoding
By default this crawler will extract correct encoding from the HTTP response headers. Use forceResponseEncoding
to force a certain encoding, disregarding the response headers.
To only provide a default for missing encodings, use HttpCrawlerOptions.suggestResponseEncoding
// Will force windows-1250 encoding even if headers say otherwise
forceResponseEncoding: 'windows-1250'
optionalhandlePageFunction
An alias for HttpCrawlerOptions.requestHandler
Soon to be removed, use requestHandler instead.
optionalinheritedhttpClient
HTTP client implementation for the sendRequest context helper and for plain HTTP crawling.
Defaults to a new instance of GotScrapingHttpClient
optionalignoreHttpErrorStatusCodes
An array of HTTP response Status Codes to be excluded from error consideration. By default, status codes >= 500 trigger errors.
optionalignoreSslErrors
If set to true, SSL certificate errors will be ignored.
optionalinheritedkeepAlive
Allows to keep the crawler alive even if the RequestQueue gets empty.
By default, the crawler.run() will resolve once the queue is empty. With keepAlive: true it will keep running,
waiting for more requests to come. Use crawler.stop() to exit the crawler gracefully, or crawler.teardown() to stop it immediately.
optionalinheritedmaxConcurrency
Sets the maximum concurrency (parallelism) for the crawl. Shortcut for the
AutoscaledPool maxConcurrency option.
optionalinheritedmaxCrawlDepth
Maximum depth of the crawl. If not set, the crawl will continue until all requests are processed.
Setting this to 0 will only process the initial requests, skipping all links enqueued by crawlingContext.enqueueLinks and crawlingContext.addRequests.
Passing 1 will process the initial requests and all links enqueued by crawlingContext.enqueueLinks and crawlingContext.addRequests in the handler for initial requests.
optionalinheritedmaxRequestRetries
Specifies the maximum number of retries allowed for a request if its processing fails.
This includes retries due to navigation errors or errors thrown from user-supplied functions
(requestHandler, preNavigationHooks, postNavigationHooks).
This limit does not apply to retries triggered by session rotation
(see maxSessionRotations).
optionalinheritedmaxRequestsPerCrawl
Maximum number of pages that the crawler will open. The crawl will stop when this limit is reached. This value should always be set in order to prevent infinite loops in misconfigured crawlers.
NOTE: In cases of parallel crawling, the actual number of pages visited might be slightly higher than this value.
optionalinheritedmaxRequestsPerMinute
The maximum number of requests per minute the crawler should run.
By default, this is set to Infinity, but we can pass any positive, non-zero integer.
Shortcut for the AutoscaledPool maxTasksPerMinute option.
optionalinheritedmaxSessionRotations
Maximum number of session rotations per request. The crawler will automatically rotate the session in case of a proxy error or if it gets blocked by the website.
The session rotations are not counted towards the maxRequestRetries limit.
optionalinheritedminConcurrency
Sets the minimum concurrency (parallelism) for the crawl. Shortcut for the
AutoscaledPool minConcurrency option.
WARNING: If we set this value too high with respect to the available system memory and CPU, our crawler will run extremely slow or crash. If not sure, it's better to keep the default value and the concurrency will scale up automatically.
optionalnavigationTimeoutSecs
Timeout in which the HTTP request to the resource needs to finish, given in seconds.
optionalinheritedonSkippedRequest
When a request is skipped for some reason, you can use this callback to act on it. This is currently fired for requests skipped
- based on robots.txt file,
- because they don't match enqueueLinks filters,
- because they are redirected to a URL that doesn't match the enqueueLinks strategy,
- or because the maxRequestsPerCrawllimit has been reached
optionalpersistCookiesPerSession
Automatically saves cookies to Session. Works only if Session Pool is used.
It parses cookie from response "set-cookie" header saves or updates cookies for session and once the session is used for next request. It passes the "Cookie" header to the request with the session cookies.
optionalpostNavigationHooks
Async functions that are sequentially evaluated after the navigation. Good for checking if the navigation was successful.
The function accepts crawlingContext as the only parameter.
Example:
postNavigationHooks: [
    async (crawlingContext) => {
        // ...
    },
]
optionalpreNavigationHooks
Async functions that are sequentially evaluated before the navigation. Good for setting additional cookies
or browser properties before navigation. The function accepts two parameters, crawlingContext and gotOptions,
which are passed to the requestAsBrowser() function the crawler calls to navigate.
Example:
preNavigationHooks: [
    async (crawlingContext, gotOptions) => {
        // ...
    },
]
Modyfing pageOptions is supported only in Playwright incognito.
See PrePageCreateHook
optionalproxyConfiguration
If set, this crawler will be configured for all connections to use Apify Proxy or your own Proxy URLs provided and rotated according to the configuration. For more information, see the documentation.
optionalinheritedrequestHandler
User-provided function that performs the logic of the crawler. It is called for each URL to crawl.
The function receives the BasicCrawlingContext as an argument,
where the request represents the URL to crawl.
The function must return a promise, which is then awaited by the crawler.
If the function throws an exception, the crawler will try to re-crawl the
request later, up to the maxRequestRetries times.
If all the retries fail, the crawler calls the function
provided to the failedRequestHandler parameter.
To make this work, we should always
let our function throw exceptions rather than catch them.
The exceptions are logged to the request using the
Request.pushErrorMessage() function.
optionalinheritedrequestHandlerTimeoutSecs
Timeout in which the function passed as requestHandler needs to finish, in seconds.
optionalinheritedrequestList
Static list of URLs to be processed.
If not provided, the crawler will open the default request queue when the crawler.addRequests() function is called.
Alternatively,
requestsparameter ofcrawler.run()could be used to enqueue the initial requests - it is a shortcut for runningcrawler.addRequests()before thecrawler.run().
optionalinheritedrequestManager
Allows explicitly configuring a request manager. Mutually exclusive with the requestQueue and requestList options.
This enables explicitly configuring the crawler to use RequestManagerTandem, for instance.
If using this, the type of BasicCrawler.requestQueue may not be fully compatible with the RequestProvider class.
optionalinheritedrequestQueue
Dynamic queue of URLs to be processed. This is useful for recursive crawling of websites.
If not provided, the crawler will open the default request queue when the crawler.addRequests() function is called.
Alternatively,
requestsparameter ofcrawler.run()could be used to enqueue the initial requests - it is a shortcut for runningcrawler.addRequests()before thecrawler.run().
optionalinheritedrespectRobotsTxtFile
If set to true, the crawler will automatically try to fetch the robots.txt file for each domain,
and skip those that are not allowed. This also prevents disallowed URLs to be added via enqueueLinks.
optionalinheritedretryOnBlocked
If set to true, the crawler will automatically try to bypass any detected bot protection.
Currently supports:
optionalinheritedsameDomainDelaySecs
Indicates how much time (in seconds) to wait before crawling another same domain request.
optionalinheritedsessionPoolOptions
The configuration options for SessionPool to use.
optionalinheritedstatisticsOptions
Customize the way statistics collecting works, such as logging interval or whether to output them to the Key-Value store.
optionalinheritedstatusMessageCallback
Allows overriding the default status message. The callback needs to call crawler.setStatusMessage() explicitly.
The default status message is provided in the parameters.
const crawler = new CheerioCrawler({
    statusMessageCallback: async (ctx) => {
        return ctx.crawler.setStatusMessage(`this is status message from ${new Date().toISOString()}`, { level: 'INFO' }); // log level defaults to 'DEBUG'
    },
    statusMessageLoggingInterval: 1, // defaults to 10s
    async requestHandler({ $, enqueueLinks, request, log }) {
        // ...
    },
});
optionalinheritedstatusMessageLoggingInterval
Defines the length of the interval for calling the setStatusMessage in seconds.
optionalsuggestResponseEncoding
By default this crawler will extract correct encoding from the HTTP response headers.
Sadly, there are some websites which use invalid headers. Those are encoded using the UTF-8 encoding.
If those sites actually use a different encoding, the response will be corrupted. You can use
suggestResponseEncoding to fall back to a certain encoding, if you know that your target website uses it.
To force a certain encoding, disregarding the response headers, use HttpCrawlerOptions.forceResponseEncoding
// Will fall back to windows-1250 encoding if none found
suggestResponseEncoding: 'windows-1250'
optionalinheriteduseSessionPool
Basic crawler will initialize the SessionPool with the corresponding sessionPoolOptions.
The session instance will be than available in the requestHandler.
An array of additional HTTP response Status Codes to be treated as errors. By default, status codes >= 500 trigger errors.