BasicCrawlerOptions <Context>
Hierarchy
- BasicCrawlerOptions
- HttpCrawlerOptions
Index
Properties
optionalautoscaledPoolOptions
optionalerrorHandler
User-provided function that allows modifying the request object before it gets retried by the crawler.
It's executed before each retry for the requests that failed less than maxRequestRetries
times.
The function receives the BasicCrawlingContext as the first argument,
where the request
corresponds to the request to be retried.
Second argument is the Error
instance that
represents the last error thrown during processing of the request.
optionalfailedRequestHandler
A function to handle requests that failed more than maxRequestRetries
times.
The function receives the BasicCrawlingContext as the first argument,
where the request
corresponds to the failed request.
Second argument is the Error
instance that
represents the last error thrown during processing of the request.
optionalkeepAlive
Allows to keep the crawler alive even if the RequestQueue gets empty.
By default, the crawler.run()
will resolve once the queue is empty. With keepAlive: true
it will keep running,
waiting for more requests to come. Use crawler.teardown()
to exit the crawler.
optionalmaxConcurrency
Sets the maximum concurrency (parallelism) for the crawl. Shortcut for the
AutoscaledPool maxConcurrency
option.
optionalmaxRequestRetries
Indicates how many times the request is retried if requestHandler
fails.
optionalmaxRequestsPerCrawl
Maximum number of pages that the crawler will open. The crawl will stop when this limit is reached. This value should always be set in order to prevent infinite loops in misconfigured crawlers.
NOTE: In cases of parallel crawling, the actual number of pages visited might be slightly higher than this value.
optionalmaxRequestsPerMinute
The maximum number of requests per minute the crawler should run.
By default, this is set to Infinity
, but we can pass any positive, non-zero integer.
Shortcut for the AutoscaledPool maxTasksPerMinute
option.
optionalminConcurrency
Sets the minimum concurrency (parallelism) for the crawl. Shortcut for the
AutoscaledPool minConcurrency
option.
WARNING: If we set this value too high with respect to the available system memory and CPU, our crawler will run extremely slow or crash. If not sure, it's better to keep the default value and the concurrency will scale up automatically.
optionalrequestHandler
User-provided function that performs the logic of the crawler. It is called for each URL to crawl.
The function receives the BasicCrawlingContext as an argument,
where the request
represents the URL to crawl.
The function must return a promise, which is then awaited by the crawler.
If the function throws an exception, the crawler will try to re-crawl the
request later, up to the maxRequestRetries
times.
If all the retries fail, the crawler calls the function
provided to the failedRequestHandler
parameter.
To make this work, we should always
let our function throw exceptions rather than catch them.
The exceptions are logged to the request using the
Request.pushErrorMessage()
function.
optionalrequestHandlerTimeoutSecs
Timeout in which the function passed as requestHandler
needs to finish, in seconds.
optionalrequestList
Static list of URLs to be processed.
If not provided, the crawler will open the default request queue when the crawler.addRequests()
function is called.
Alternatively,
requests
parameter ofcrawler.run()
could be used to enqueue the initial requests - it is a shortcut for runningcrawler.addRequests()
before thecrawler.run()
.
optionalrequestQueue
Dynamic queue of URLs to be processed. This is useful for recursive crawling of websites.
If not provided, the crawler will open the default request queue when the crawler.addRequests()
function is called.
Alternatively,
requests
parameter ofcrawler.run()
could be used to enqueue the initial requests - it is a shortcut for runningcrawler.addRequests()
before thecrawler.run()
.
optionalsessionPoolOptions
The configuration options for SessionPool to use.
optionaluseSessionPool
Basic crawler will initialize the SessionPool with the corresponding sessionPoolOptions
.
The session instance will be than available in the requestHandler
.
Custom options passed to the underlying AutoscaledPool constructor.