# Crawlee for JavaScript · Build reliable crawlers. Fast. - [Build reliable web scrapers. Fast.](https://crawlee.dev/index.md) ## blog - [Crawlee Blog - learn how to build better scrapers](https://crawlee.dev/blog.md) - [Archive](https://crawlee.dev/blog/archive.md) - [Authors](https://crawlee.dev/blog/authors.md) - [Current problems and mistakes of web scraping in Python and tricks to solve them!](https://crawlee.dev/blog/common-problems-in-web-scraping.md) - [Launching Crawlee Blog](https://crawlee.dev/blog/crawlee-blog-launch.md) - [Crawlee for Python v0.5](https://crawlee.dev/blog/crawlee-for-python-v05.md) - [Crawlee for Python v0.6](https://crawlee.dev/blog/crawlee-for-python-v06.md) - [Crawlee for Python v1](https://crawlee.dev/blog/crawlee-for-python-v1.md) - [How to build a price tracker with Crawlee and Apify](https://crawlee.dev/blog/crawlee-python-price-tracker.md) - [Reverse engineering GraphQL persistedQuery extension](https://crawlee.dev/blog/graphql-persisted-query.md) - [How to scrape Amazon products](https://crawlee.dev/blog/how-to-scrape-amazon.md) - [How to scrape infinite scrolling webpages with Python](https://crawlee.dev/blog/infinite-scroll-using-python.md) - [Announcing Crawlee for Python: Now you can use Python to build reliable web crawlers](https://crawlee.dev/blog/launching-crawlee-python.md) - [How to create a LinkedIn job scraper in Python with Crawlee](https://crawlee.dev/blog/linkedin-job-scraper-python.md) - [Building a Netflix show recommender using Crawlee and React](https://crawlee.dev/blog/netflix-show-recommender.md) - [Crawlee Blog - learn how to build better scrapers](https://crawlee.dev/blog/page/2.md) - [Crawlee Blog - learn how to build better scrapers](https://crawlee.dev/blog/page/3.md) - [How Crawlee uses tiered proxies to avoid getting blocked](https://crawlee.dev/blog/proxy-management-in-crawlee.md) - [How to scrape Bluesky with Python](https://crawlee.dev/blog/scrape-bluesky-using-python.md) - [How to scrape Crunchbase using Python in 2024 (Easy Guide)](https://crawlee.dev/blog/scrape-crunchbase-python.md) - [How to scrape Google Maps data using Python](https://crawlee.dev/blog/scrape-google-maps.md) - [How to scrape Google search results with Python](https://crawlee.dev/blog/scrape-google-search.md) - [How to scrape TikTok using Python](https://crawlee.dev/blog/scrape-tiktok-python.md) - [Optimizing web scraping: Scraping auth data using JSDOM](https://crawlee.dev/blog/scrape-using-jsdom.md) - [How to scrape YouTube using Python [2025 guide]](https://crawlee.dev/blog/scrape-youtube-python.md) - [Web scraping of a dynamic website using Python with HTTP Client](https://crawlee.dev/blog/scraping-dynamic-websites-using-python.md) - [Scrapy vs. Crawlee](https://crawlee.dev/blog/scrapy-vs-crawlee.md) - [Inside implementing SuperScraper with Crawlee](https://crawlee.dev/blog/superscraper-with-crawlee.md) - [Tags](https://crawlee.dev/blog/tags.md) - [10 posts tagged with "community"](https://crawlee.dev/blog/tags/community.md) - [One post tagged with "proxy"](https://crawlee.dev/blog/tags/proxy.md) - [12 tips on how to think like a web scraping expert](https://crawlee.dev/blog/web-scraping-tips.md) ## js - [Build reliable web scrapers. Fast.](https://crawlee.dev/js.md) - [API](https://crawlee.dev/js/api.md) - [@crawlee/basic](https://crawlee.dev/js/api/basic-crawler.md) - [Changelog](https://crawlee.dev/js/api/basic-crawler/changelog.md) - [BasicCrawler ](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md) - [createBasicRouter](https://crawlee.dev/js/api/basic-crawler/function/createBasicRouter.md) - [BasicCrawlerOptions ](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md) - [BasicCrawlingContext ](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlingContext.md) - [CrawlerAddRequestsOptions](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerAddRequestsOptions.md) - [CrawlerAddRequestsResult](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerAddRequestsResult.md) - [CrawlerExperiments](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerExperiments.md) - [CrawlerRunOptions](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerRunOptions.md) - [CreateContextOptions](https://crawlee.dev/js/api/basic-crawler/interface/CreateContextOptions.md) - [StatusMessageCallbackParams ](https://crawlee.dev/js/api/basic-crawler/interface/StatusMessageCallbackParams.md) - [@crawlee/browser](https://crawlee.dev/js/api/browser-crawler.md) - [Changelog](https://crawlee.dev/js/api/browser-crawler/changelog.md) - [abstractBrowserCrawler ](https://crawlee.dev/js/api/browser-crawler/class/BrowserCrawler.md) - [BrowserCrawlerOptions ](https://crawlee.dev/js/api/browser-crawler/interface/BrowserCrawlerOptions.md) - [BrowserCrawlingContext ](https://crawlee.dev/js/api/browser-crawler/interface/BrowserCrawlingContext.md) - [BrowserLaunchContext ](https://crawlee.dev/js/api/browser-crawler/interface/BrowserLaunchContext.md) - [@crawlee/browser-pool](https://crawlee.dev/js/api/browser-pool.md) - [Changelog](https://crawlee.dev/js/api/browser-pool/changelog.md) - [abstractBrowserController ](https://crawlee.dev/js/api/browser-pool/class/BrowserController.md) - [BrowserLaunchError](https://crawlee.dev/js/api/browser-pool/class/BrowserLaunchError.md) - [abstractBrowserPlugin ](https://crawlee.dev/js/api/browser-pool/class/BrowserPlugin.md) - [BrowserPool ](https://crawlee.dev/js/api/browser-pool/class/BrowserPool.md) - [LaunchContext ](https://crawlee.dev/js/api/browser-pool/class/LaunchContext.md) - [PlaywrightBrowser](https://crawlee.dev/js/api/browser-pool/class/PlaywrightBrowser.md) - [PlaywrightController](https://crawlee.dev/js/api/browser-pool/class/PlaywrightController.md) - [PlaywrightPlugin](https://crawlee.dev/js/api/browser-pool/class/PlaywrightPlugin.md) - [PuppeteerController](https://crawlee.dev/js/api/browser-pool/class/PuppeteerController.md) - [PuppeteerPlugin](https://crawlee.dev/js/api/browser-pool/class/PuppeteerPlugin.md) - [constBROWSER_CONTROLLER_EVENTS](https://crawlee.dev/js/api/browser-pool/enum/BROWSER_CONTROLLER_EVENTS.md) - [constBROWSER_POOL_EVENTS](https://crawlee.dev/js/api/browser-pool/enum/BROWSER_POOL_EVENTS.md) - [BrowserName](https://crawlee.dev/js/api/browser-pool/enum/BrowserName.md) - [constDeviceCategory](https://crawlee.dev/js/api/browser-pool/enum/DeviceCategory.md) - [constOperatingSystemsName](https://crawlee.dev/js/api/browser-pool/enum/OperatingSystemsName.md) - [BrowserControllerEvents ](https://crawlee.dev/js/api/browser-pool/interface/BrowserControllerEvents.md) - [BrowserPluginOptions ](https://crawlee.dev/js/api/browser-pool/interface/BrowserPluginOptions.md) - [BrowserPoolEvents ](https://crawlee.dev/js/api/browser-pool/interface/BrowserPoolEvents.md) - [BrowserPoolHooks ](https://crawlee.dev/js/api/browser-pool/interface/BrowserPoolHooks.md) - [BrowserPoolNewPageInNewBrowserOptions ](https://crawlee.dev/js/api/browser-pool/interface/BrowserPoolNewPageInNewBrowserOptions.md) - [BrowserPoolNewPageOptions ](https://crawlee.dev/js/api/browser-pool/interface/BrowserPoolNewPageOptions.md) - [BrowserPoolOptions ](https://crawlee.dev/js/api/browser-pool/interface/BrowserPoolOptions.md) - [BrowserSpecification](https://crawlee.dev/js/api/browser-pool/interface/BrowserSpecification.md) - [CommonLibrary](https://crawlee.dev/js/api/browser-pool/interface/CommonLibrary.md) - [CreateLaunchContextOptions ](https://crawlee.dev/js/api/browser-pool/interface/CreateLaunchContextOptions.md) - [FingerprintGenerator](https://crawlee.dev/js/api/browser-pool/interface/FingerprintGenerator.md) - [FingerprintGeneratorOptions](https://crawlee.dev/js/api/browser-pool/interface/FingerprintGeneratorOptions.md) - [FingerprintOptions](https://crawlee.dev/js/api/browser-pool/interface/FingerprintOptions.md) - [GetFingerprintReturn](https://crawlee.dev/js/api/browser-pool/interface/GetFingerprintReturn.md) - [LaunchContextOptions ](https://crawlee.dev/js/api/browser-pool/interface/LaunchContextOptions.md) - [@crawlee/cheerio](https://crawlee.dev/js/api/cheerio-crawler.md) - [Changelog](https://crawlee.dev/js/api/cheerio-crawler/changelog.md) - [CheerioCrawler](https://crawlee.dev/js/api/cheerio-crawler/class/CheerioCrawler.md) - [createCheerioRouter](https://crawlee.dev/js/api/cheerio-crawler/function/createCheerioRouter.md) - [CheerioCrawlerOptions ](https://crawlee.dev/js/api/cheerio-crawler/interface/CheerioCrawlerOptions.md) - [CheerioCrawlingContext ](https://crawlee.dev/js/api/cheerio-crawler/interface/CheerioCrawlingContext.md) - [@crawlee/core](https://crawlee.dev/js/api/core.md) - [Changelog](https://crawlee.dev/js/api/core/changelog.md) - [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) - [Configuration](https://crawlee.dev/js/api/core/class/Configuration.md) - [CriticalError](https://crawlee.dev/js/api/core/class/CriticalError.md) - [Dataset ](https://crawlee.dev/js/api/core/class/Dataset.md) - [ErrorSnapshotter](https://crawlee.dev/js/api/core/class/ErrorSnapshotter.md) - [ErrorTracker](https://crawlee.dev/js/api/core/class/ErrorTracker.md) - [abstractEventManager](https://crawlee.dev/js/api/core/class/EventManager.md) - [GotScrapingHttpClient](https://crawlee.dev/js/api/core/class/GotScrapingHttpClient.md) - [KeyValueStore](https://crawlee.dev/js/api/core/class/KeyValueStore.md) - [LocalEventManager](https://crawlee.dev/js/api/core/class/LocalEventManager.md) - [externalLog](https://crawlee.dev/js/api/core/class/Log.md) - [externalLogger](https://crawlee.dev/js/api/core/class/Logger.md) - [externalLoggerJson](https://crawlee.dev/js/api/core/class/LoggerJson.md) - [externalLoggerText](https://crawlee.dev/js/api/core/class/LoggerText.md) - [NonRetryableError](https://crawlee.dev/js/api/core/class/NonRetryableError.md) - [ProxyConfiguration](https://crawlee.dev/js/api/core/class/ProxyConfiguration.md) - [externalPseudoUrl](https://crawlee.dev/js/api/core/class/PseudoUrl.md) - [RecoverableState ](https://crawlee.dev/js/api/core/class/RecoverableState.md) - [Request ](https://crawlee.dev/js/api/core/class/Request.md) - [RequestHandlerResult](https://crawlee.dev/js/api/core/class/RequestHandlerResult.md) - [RequestList](https://crawlee.dev/js/api/core/class/RequestList.md) - [RequestManagerTandem](https://crawlee.dev/js/api/core/class/RequestManagerTandem.md) - [abstractRequestProvider](https://crawlee.dev/js/api/core/class/RequestProvider.md) - [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) - [RequestQueueV1](https://crawlee.dev/js/api/core/class/RequestQueueV1.md) - [RetryRequestError](https://crawlee.dev/js/api/core/class/RetryRequestError.md) - [Router ](https://crawlee.dev/js/api/core/class/Router.md) - [Session](https://crawlee.dev/js/api/core/class/Session.md) - [SessionError](https://crawlee.dev/js/api/core/class/SessionError.md) - [SessionPool](https://crawlee.dev/js/api/core/class/SessionPool.md) - [SitemapRequestList](https://crawlee.dev/js/api/core/class/SitemapRequestList.md) - [Snapshotter](https://crawlee.dev/js/api/core/class/Snapshotter.md) - [Statistics](https://crawlee.dev/js/api/core/class/Statistics.md) - [SystemStatus](https://crawlee.dev/js/api/core/class/SystemStatus.md) - [EnqueueStrategy](https://crawlee.dev/js/api/core/enum/EnqueueStrategy.md) - [constEventType](https://crawlee.dev/js/api/core/enum/EventType.md) - [externalLogLevel](https://crawlee.dev/js/api/core/enum/LogLevel.md) - [RequestState](https://crawlee.dev/js/api/core/enum/RequestState.md) - [checkStorageAccess](https://crawlee.dev/js/api/core/function/checkStorageAccess.md) - [enqueueLinks](https://crawlee.dev/js/api/core/function/enqueueLinks.md) - [filterRequestsByPatterns](https://crawlee.dev/js/api/core/function/filterRequestsByPatterns.md) - [processHttpRequestOptions](https://crawlee.dev/js/api/core/function/processHttpRequestOptions.md) - [purgeDefaultStorages](https://crawlee.dev/js/api/core/function/purgeDefaultStorages.md) - [tryAbsoluteURL](https://crawlee.dev/js/api/core/function/tryAbsoluteURL.md) - [useState](https://crawlee.dev/js/api/core/function/useState.md) - [withCheckedStorageAccess](https://crawlee.dev/js/api/core/function/withCheckedStorageAccess.md) - [AddRequestsBatchedOptions](https://crawlee.dev/js/api/core/interface/AddRequestsBatchedOptions.md) - [AddRequestsBatchedResult](https://crawlee.dev/js/api/core/interface/AddRequestsBatchedResult.md) - [AutoscaledPoolOptions](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md) - [BaseHttpClient](https://crawlee.dev/js/api/core/interface/BaseHttpClient.md) - [BaseHttpResponseData](https://crawlee.dev/js/api/core/interface/BaseHttpResponseData.md) - [ClientInfo](https://crawlee.dev/js/api/core/interface/ClientInfo.md) - [ConfigurationOptions](https://crawlee.dev/js/api/core/interface/ConfigurationOptions.md) - [Cookie](https://crawlee.dev/js/api/core/interface/Cookie.md) - [CrawlingContext ](https://crawlee.dev/js/api/core/interface/CrawlingContext.md) - [CreateSession](https://crawlee.dev/js/api/core/interface/CreateSession.md) - [DatasetConsumer ](https://crawlee.dev/js/api/core/interface/DatasetConsumer.md) - [DatasetContent ](https://crawlee.dev/js/api/core/interface/DatasetContent.md) - [DatasetDataOptions](https://crawlee.dev/js/api/core/interface/DatasetDataOptions.md) - [DatasetExportOptions](https://crawlee.dev/js/api/core/interface/DatasetExportOptions.md) - [DatasetExportToOptions](https://crawlee.dev/js/api/core/interface/DatasetExportToOptions.md) - [DatasetIteratorOptions](https://crawlee.dev/js/api/core/interface/DatasetIteratorOptions.md) - [DatasetMapper ](https://crawlee.dev/js/api/core/interface/DatasetMapper.md) - [DatasetOptions](https://crawlee.dev/js/api/core/interface/DatasetOptions.md) - [DatasetReducer ](https://crawlee.dev/js/api/core/interface/DatasetReducer.md) - [EnqueueLinksOptions](https://crawlee.dev/js/api/core/interface/EnqueueLinksOptions.md) - [ErrnoException](https://crawlee.dev/js/api/core/interface/ErrnoException.md) - [ErrorTrackerOptions](https://crawlee.dev/js/api/core/interface/ErrorTrackerOptions.md) - [FinalStatistics](https://crawlee.dev/js/api/core/interface/FinalStatistics.md) - [HttpRequest ](https://crawlee.dev/js/api/core/interface/HttpRequest.md) - [HttpRequestOptions ](https://crawlee.dev/js/api/core/interface/HttpRequestOptions.md) - [HttpResponse ](https://crawlee.dev/js/api/core/interface/HttpResponse.md) - [IRequestList](https://crawlee.dev/js/api/core/interface/IRequestList.md) - [IRequestManager](https://crawlee.dev/js/api/core/interface/IRequestManager.md) - [IStorage](https://crawlee.dev/js/api/core/interface/IStorage.md) - [KeyConsumer](https://crawlee.dev/js/api/core/interface/KeyConsumer.md) - [KeyValueStoreIteratorOptions](https://crawlee.dev/js/api/core/interface/KeyValueStoreIteratorOptions.md) - [KeyValueStoreOptions](https://crawlee.dev/js/api/core/interface/KeyValueStoreOptions.md) - [externalLoggerOptions](https://crawlee.dev/js/api/core/interface/LoggerOptions.md) - [PersistenceOptions](https://crawlee.dev/js/api/core/interface/PersistenceOptions.md) - [ProxyConfigurationFunction](https://crawlee.dev/js/api/core/interface/ProxyConfigurationFunction.md) - [ProxyConfigurationOptions](https://crawlee.dev/js/api/core/interface/ProxyConfigurationOptions.md) - [ProxyInfo](https://crawlee.dev/js/api/core/interface/ProxyInfo.md) - [PushErrorMessageOptions](https://crawlee.dev/js/api/core/interface/PushErrorMessageOptions.md) - [QueueOperationInfo](https://crawlee.dev/js/api/core/interface/QueueOperationInfo.md) - [RecordOptions](https://crawlee.dev/js/api/core/interface/RecordOptions.md) - [RecoverableStateOptions ](https://crawlee.dev/js/api/core/interface/RecoverableStateOptions.md) - [RecoverableStatePersistenceOptions](https://crawlee.dev/js/api/core/interface/RecoverableStatePersistenceOptions.md) - [RequestListOptions](https://crawlee.dev/js/api/core/interface/RequestListOptions.md) - [RequestListState](https://crawlee.dev/js/api/core/interface/RequestListState.md) - [RequestOptions ](https://crawlee.dev/js/api/core/interface/RequestOptions.md) - [RequestProviderOptions](https://crawlee.dev/js/api/core/interface/RequestProviderOptions.md) - [RequestQueueOperationOptions](https://crawlee.dev/js/api/core/interface/RequestQueueOperationOptions.md) - [RequestQueueOptions](https://crawlee.dev/js/api/core/interface/RequestQueueOptions.md) - [RequestTransform](https://crawlee.dev/js/api/core/interface/RequestTransform.md) - [ResponseLike](https://crawlee.dev/js/api/core/interface/ResponseLike.md) - [ResponseTypes](https://crawlee.dev/js/api/core/interface/ResponseTypes.md) - [RestrictedCrawlingContext ](https://crawlee.dev/js/api/core/interface/RestrictedCrawlingContext.md) - [RouterHandler ](https://crawlee.dev/js/api/core/interface/RouterHandler.md) - [SessionOptions](https://crawlee.dev/js/api/core/interface/SessionOptions.md) - [SessionPoolOptions](https://crawlee.dev/js/api/core/interface/SessionPoolOptions.md) - [SessionState](https://crawlee.dev/js/api/core/interface/SessionState.md) - [SitemapRequestListOptions](https://crawlee.dev/js/api/core/interface/SitemapRequestListOptions.md) - [SnapshotResult](https://crawlee.dev/js/api/core/interface/SnapshotResult.md) - [SnapshotterOptions](https://crawlee.dev/js/api/core/interface/SnapshotterOptions.md) - [StatisticPersistedState](https://crawlee.dev/js/api/core/interface/StatisticPersistedState.md) - [StatisticsOptions](https://crawlee.dev/js/api/core/interface/StatisticsOptions.md) - [StatisticState](https://crawlee.dev/js/api/core/interface/StatisticState.md) - [StorageClient](https://crawlee.dev/js/api/core/interface/StorageClient.md) - [StorageManagerOptions](https://crawlee.dev/js/api/core/interface/StorageManagerOptions.md) - [StreamingHttpResponse](https://crawlee.dev/js/api/core/interface/StreamingHttpResponse.md) - [SystemInfo](https://crawlee.dev/js/api/core/interface/SystemInfo.md) - [SystemStatusOptions](https://crawlee.dev/js/api/core/interface/SystemStatusOptions.md) - [TieredProxy](https://crawlee.dev/js/api/core/interface/TieredProxy.md) - [UseStateOptions](https://crawlee.dev/js/api/core/interface/UseStateOptions.md) - [@crawlee/http](https://crawlee.dev/js/api/http-crawler.md) - [Changelog](https://crawlee.dev/js/api/http-crawler/changelog.md) - [FileDownload](https://crawlee.dev/js/api/http-crawler/class/FileDownload.md) - [HttpCrawler ](https://crawlee.dev/js/api/http-crawler/class/HttpCrawler.md) - [ByteCounterStream](https://crawlee.dev/js/api/http-crawler/function/ByteCounterStream.md) - [createFileRouter](https://crawlee.dev/js/api/http-crawler/function/createFileRouter.md) - [createHttpRouter](https://crawlee.dev/js/api/http-crawler/function/createHttpRouter.md) - [MinimumSpeedStream](https://crawlee.dev/js/api/http-crawler/function/MinimumSpeedStream.md) - [FileDownloadCrawlingContext ](https://crawlee.dev/js/api/http-crawler/interface/FileDownloadCrawlingContext.md) - [HttpCrawlerOptions ](https://crawlee.dev/js/api/http-crawler/interface/HttpCrawlerOptions.md) - [HttpCrawlingContext ](https://crawlee.dev/js/api/http-crawler/interface/HttpCrawlingContext.md) - [@crawlee/jsdom](https://crawlee.dev/js/api/jsdom-crawler.md) - [Changelog](https://crawlee.dev/js/api/jsdom-crawler/changelog.md) - [JSDOMCrawler](https://crawlee.dev/js/api/jsdom-crawler/class/JSDOMCrawler.md) - [createJSDOMRouter](https://crawlee.dev/js/api/jsdom-crawler/function/createJSDOMRouter.md) - [JSDOMCrawlerOptions ](https://crawlee.dev/js/api/jsdom-crawler/interface/JSDOMCrawlerOptions.md) - [JSDOMCrawlingContext ](https://crawlee.dev/js/api/jsdom-crawler/interface/JSDOMCrawlingContext.md) - [@crawlee/linkedom](https://crawlee.dev/js/api/linkedom-crawler.md) - [Changelog](https://crawlee.dev/js/api/linkedom-crawler/changelog.md) - [LinkeDOMCrawler](https://crawlee.dev/js/api/linkedom-crawler/class/LinkeDOMCrawler.md) - [createLinkeDOMRouter](https://crawlee.dev/js/api/linkedom-crawler/function/createLinkeDOMRouter.md) - [LinkeDOMCrawlerEnqueueLinksOptions](https://crawlee.dev/js/api/linkedom-crawler/interface/LinkeDOMCrawlerEnqueueLinksOptions.md) - [LinkeDOMCrawlerOptions ](https://crawlee.dev/js/api/linkedom-crawler/interface/LinkeDOMCrawlerOptions.md) - [LinkeDOMCrawlingContext ](https://crawlee.dev/js/api/linkedom-crawler/interface/LinkeDOMCrawlingContext.md) - [@crawlee/memory-storage](https://crawlee.dev/js/api/memory-storage.md) - [Changelog](https://crawlee.dev/js/api/memory-storage/changelog.md) - [MemoryStorage](https://crawlee.dev/js/api/memory-storage/class/MemoryStorage.md) - [MemoryStorageOptions](https://crawlee.dev/js/api/memory-storage/interface/MemoryStorageOptions.md) - [@crawlee/playwright](https://crawlee.dev/js/api/playwright-crawler.md) - [Changelog](https://crawlee.dev/js/api/playwright-crawler/changelog.md) - [AdaptivePlaywrightCrawler](https://crawlee.dev/js/api/playwright-crawler/class/AdaptivePlaywrightCrawler.md) - [PlaywrightCrawler](https://crawlee.dev/js/api/playwright-crawler/class/PlaywrightCrawler.md) - [RenderingTypePredictor](https://crawlee.dev/js/api/playwright-crawler/class/RenderingTypePredictor.md) - [createAdaptivePlaywrightRouter](https://crawlee.dev/js/api/playwright-crawler/function/createAdaptivePlaywrightRouter.md) - [createPlaywrightRouter](https://crawlee.dev/js/api/playwright-crawler/function/createPlaywrightRouter.md) - [launchPlaywright](https://crawlee.dev/js/api/playwright-crawler/function/launchPlaywright.md) - [AdaptivePlaywrightCrawlerContext ](https://crawlee.dev/js/api/playwright-crawler/interface/AdaptivePlaywrightCrawlerContext.md) - [AdaptivePlaywrightCrawlerOptions](https://crawlee.dev/js/api/playwright-crawler/interface/AdaptivePlaywrightCrawlerOptions.md) - [PlaywrightCrawlerOptions](https://crawlee.dev/js/api/playwright-crawler/interface/PlaywrightCrawlerOptions.md) - [PlaywrightCrawlingContext ](https://crawlee.dev/js/api/playwright-crawler/interface/PlaywrightCrawlingContext.md) - [PlaywrightHook](https://crawlee.dev/js/api/playwright-crawler/interface/PlaywrightHook.md) - [PlaywrightLaunchContext](https://crawlee.dev/js/api/playwright-crawler/interface/PlaywrightLaunchContext.md) - [PlaywrightRequestHandler](https://crawlee.dev/js/api/playwright-crawler/interface/PlaywrightRequestHandler.md) - [playwrightClickElements](https://crawlee.dev/js/api/playwright-crawler/namespace/playwrightClickElements.md) - [playwrightUtils](https://crawlee.dev/js/api/playwright-crawler/namespace/playwrightUtils.md) - [@crawlee/puppeteer](https://crawlee.dev/js/api/puppeteer-crawler.md) - [Changelog](https://crawlee.dev/js/api/puppeteer-crawler/changelog.md) - [PuppeteerCrawler](https://crawlee.dev/js/api/puppeteer-crawler/class/PuppeteerCrawler.md) - [createPuppeteerRouter](https://crawlee.dev/js/api/puppeteer-crawler/function/createPuppeteerRouter.md) - [launchPuppeteer](https://crawlee.dev/js/api/puppeteer-crawler/function/launchPuppeteer.md) - [PuppeteerCrawlerOptions](https://crawlee.dev/js/api/puppeteer-crawler/interface/PuppeteerCrawlerOptions.md) - [PuppeteerCrawlingContext ](https://crawlee.dev/js/api/puppeteer-crawler/interface/PuppeteerCrawlingContext.md) - [PuppeteerHook](https://crawlee.dev/js/api/puppeteer-crawler/interface/PuppeteerHook.md) - [PuppeteerLaunchContext](https://crawlee.dev/js/api/puppeteer-crawler/interface/PuppeteerLaunchContext.md) - [PuppeteerRequestHandler](https://crawlee.dev/js/api/puppeteer-crawler/interface/PuppeteerRequestHandler.md) - [puppeteerClickElements](https://crawlee.dev/js/api/puppeteer-crawler/namespace/puppeteerClickElements.md) - [puppeteerRequestInterception](https://crawlee.dev/js/api/puppeteer-crawler/namespace/puppeteerRequestInterception.md) - [puppeteerUtils](https://crawlee.dev/js/api/puppeteer-crawler/namespace/puppeteerUtils.md) - [@crawlee/types](https://crawlee.dev/js/api/types.md) - [Changelog](https://crawlee.dev/js/api/types/changelog.md) - [BatchAddRequestsResult](https://crawlee.dev/js/api/types/interface/BatchAddRequestsResult.md) - [BrowserLikeResponse](https://crawlee.dev/js/api/types/interface/BrowserLikeResponse.md) - [Dataset](https://crawlee.dev/js/api/types/interface/Dataset.md) - [DatasetClient ](https://crawlee.dev/js/api/types/interface/DatasetClient.md) - [DatasetClientListOptions](https://crawlee.dev/js/api/types/interface/DatasetClientListOptions.md) - [DatasetClientUpdateOptions](https://crawlee.dev/js/api/types/interface/DatasetClientUpdateOptions.md) - [DatasetCollectionClient](https://crawlee.dev/js/api/types/interface/DatasetCollectionClient.md) - [DatasetCollectionClientOptions](https://crawlee.dev/js/api/types/interface/DatasetCollectionClientOptions.md) - [DatasetCollectionData](https://crawlee.dev/js/api/types/interface/DatasetCollectionData.md) - [DatasetInfo](https://crawlee.dev/js/api/types/interface/DatasetInfo.md) - [DatasetStats](https://crawlee.dev/js/api/types/interface/DatasetStats.md) - [DeleteRequestLockOptions](https://crawlee.dev/js/api/types/interface/DeleteRequestLockOptions.md) - [KeyValueStoreClient](https://crawlee.dev/js/api/types/interface/KeyValueStoreClient.md) - [KeyValueStoreClientGetRecordOptions](https://crawlee.dev/js/api/types/interface/KeyValueStoreClientGetRecordOptions.md) - [KeyValueStoreClientListData](https://crawlee.dev/js/api/types/interface/KeyValueStoreClientListData.md) - [KeyValueStoreClientListOptions](https://crawlee.dev/js/api/types/interface/KeyValueStoreClientListOptions.md) - [KeyValueStoreClientUpdateOptions](https://crawlee.dev/js/api/types/interface/KeyValueStoreClientUpdateOptions.md) - [KeyValueStoreCollectionClient](https://crawlee.dev/js/api/types/interface/KeyValueStoreCollectionClient.md) - [KeyValueStoreInfo](https://crawlee.dev/js/api/types/interface/KeyValueStoreInfo.md) - [KeyValueStoreItemData](https://crawlee.dev/js/api/types/interface/KeyValueStoreItemData.md) - [KeyValueStoreRecord](https://crawlee.dev/js/api/types/interface/KeyValueStoreRecord.md) - [KeyValueStoreRecordOptions](https://crawlee.dev/js/api/types/interface/KeyValueStoreRecordOptions.md) - [KeyValueStoreStats](https://crawlee.dev/js/api/types/interface/KeyValueStoreStats.md) - [ListAndLockHeadResult](https://crawlee.dev/js/api/types/interface/ListAndLockHeadResult.md) - [ListAndLockOptions](https://crawlee.dev/js/api/types/interface/ListAndLockOptions.md) - [ListOptions](https://crawlee.dev/js/api/types/interface/ListOptions.md) - [PaginatedList ](https://crawlee.dev/js/api/types/interface/PaginatedList.md) - [ProcessedRequest](https://crawlee.dev/js/api/types/interface/ProcessedRequest.md) - [ProlongRequestLockOptions](https://crawlee.dev/js/api/types/interface/ProlongRequestLockOptions.md) - [ProlongRequestLockResult](https://crawlee.dev/js/api/types/interface/ProlongRequestLockResult.md) - [QueueHead](https://crawlee.dev/js/api/types/interface/QueueHead.md) - [RequestOptions](https://crawlee.dev/js/api/types/interface/RequestOptions.md) - [RequestQueueClient](https://crawlee.dev/js/api/types/interface/RequestQueueClient.md) - [RequestQueueCollectionClient](https://crawlee.dev/js/api/types/interface/RequestQueueCollectionClient.md) - [RequestQueueHeadItem](https://crawlee.dev/js/api/types/interface/RequestQueueHeadItem.md) - [RequestQueueInfo](https://crawlee.dev/js/api/types/interface/RequestQueueInfo.md) - [RequestQueueOptions](https://crawlee.dev/js/api/types/interface/RequestQueueOptions.md) - [RequestQueueStats](https://crawlee.dev/js/api/types/interface/RequestQueueStats.md) - [RequestSchema](https://crawlee.dev/js/api/types/interface/RequestSchema.md) - [SetStatusMessageOptions](https://crawlee.dev/js/api/types/interface/SetStatusMessageOptions.md) - [UnprocessedRequest](https://crawlee.dev/js/api/types/interface/UnprocessedRequest.md) - [UpdateRequestSchema](https://crawlee.dev/js/api/types/interface/UpdateRequestSchema.md) - [@crawlee/utils](https://crawlee.dev/js/api/utils.md) - [Changelog](https://crawlee.dev/js/api/utils/changelog.md) - [RobotsTxtFile](https://crawlee.dev/js/api/utils/class/RobotsTxtFile.md) - [Sitemap](https://crawlee.dev/js/api/utils/class/Sitemap.md) - [chunk](https://crawlee.dev/js/api/utils/function/chunk.md) - [createRequestDebugInfo](https://crawlee.dev/js/api/utils/function/createRequestDebugInfo.md) - [downloadListOfUrls](https://crawlee.dev/js/api/utils/function/downloadListOfUrls.md) - [extractUrls](https://crawlee.dev/js/api/utils/function/extractUrls.md) - [extractUrlsFromCheerio](https://crawlee.dev/js/api/utils/function/extractUrlsFromCheerio.md) - [getCgroupsVersion](https://crawlee.dev/js/api/utils/function/getCgroupsVersion.md) - [getMemoryInfo](https://crawlee.dev/js/api/utils/function/getMemoryInfo.md) - [getObjectType](https://crawlee.dev/js/api/utils/function/getObjectType.md) - [gotScraping](https://crawlee.dev/js/api/utils/function/gotScraping.md) - [htmlToText](https://crawlee.dev/js/api/utils/function/htmlToText.md) - [isContainerized](https://crawlee.dev/js/api/utils/function/isContainerized.md) - [isDocker](https://crawlee.dev/js/api/utils/function/isDocker.md) - [isLambda](https://crawlee.dev/js/api/utils/function/isLambda.md) - [parseOpenGraph](https://crawlee.dev/js/api/utils/function/parseOpenGraph.md) - [parseSitemap](https://crawlee.dev/js/api/utils/function/parseSitemap.md) - [sleep](https://crawlee.dev/js/api/utils/function/sleep.md) - [DownloadListOfUrlsOptions](https://crawlee.dev/js/api/utils/interface/DownloadListOfUrlsOptions.md) - [ExtractUrlsOptions](https://crawlee.dev/js/api/utils/interface/ExtractUrlsOptions.md) - [MemoryInfo](https://crawlee.dev/js/api/utils/interface/MemoryInfo.md) - [OpenGraphProperty](https://crawlee.dev/js/api/utils/interface/OpenGraphProperty.md) - [ParseSitemapOptions](https://crawlee.dev/js/api/utils/interface/ParseSitemapOptions.md) - [social](https://crawlee.dev/js/api/utils/namespace/social.md) - [Deployment guides](https://crawlee.dev/js/docs/deployment.md) - [Apify Platform](https://crawlee.dev/js/docs/deployment/apify-platform.md) - [Browsers on AWS Lambda](https://crawlee.dev/js/docs/deployment/aws-browsers.md) - [Cheerio on AWS Lambda](https://crawlee.dev/js/docs/deployment/aws-cheerio.md) - [Browsers in GCP Cloud Run](https://crawlee.dev/js/docs/deployment/gcp-browsers.md) - [Cheerio on GCP Cloud Functions](https://crawlee.dev/js/docs/deployment/gcp-cheerio.md) - [Examples](https://crawlee.dev/js/docs/examples.md) - [Accept user input](https://crawlee.dev/js/docs/examples/accept-user-input.md) - [Add data to dataset](https://crawlee.dev/js/docs/examples/add-data-to-dataset.md) - [Basic crawler](https://crawlee.dev/js/docs/examples/basic-crawler.md) - [Capture a screenshot using Puppeteer](https://crawlee.dev/js/docs/examples/capture-screenshot.md) - [Cheerio crawler](https://crawlee.dev/js/docs/examples/cheerio-crawler.md) - [Crawl all links on a website](https://crawlee.dev/js/docs/examples/crawl-all-links.md) - [Crawl multiple URLs](https://crawlee.dev/js/docs/examples/crawl-multiple-urls.md) - [Crawl a website with relative links](https://crawlee.dev/js/docs/examples/crawl-relative-links.md) - [Crawl a single URL](https://crawlee.dev/js/docs/examples/crawl-single-url.md) - [Crawl a sitemap](https://crawlee.dev/js/docs/examples/crawl-sitemap.md) - [Crawl some links on a website](https://crawlee.dev/js/docs/examples/crawl-some-links.md) - [Using Puppeteer Stealth Plugin (puppeteer-extra) and playwright-extra](https://crawlee.dev/js/docs/examples/crawler-plugins.md) - [Export entire dataset to one file](https://crawlee.dev/js/docs/examples/export-entire-dataset.md) - [Download a file](https://crawlee.dev/js/docs/examples/file-download.md) - [Download a file with Node.js streams](https://crawlee.dev/js/docs/examples/file-download-stream.md) - [Fill and Submit a Form using Puppeteer](https://crawlee.dev/js/docs/examples/forms.md) - [HTTP crawler](https://crawlee.dev/js/docs/examples/http-crawler.md) - [JSDOM crawler](https://crawlee.dev/js/docs/examples/jsdom-crawler.md) - [Dataset Map and Reduce methods](https://crawlee.dev/js/docs/examples/map-and-reduce.md) - [Playwright crawler](https://crawlee.dev/js/docs/examples/playwright-crawler.md) - [Using Firefox browser with Playwright crawler](https://crawlee.dev/js/docs/examples/playwright-crawler-firefox.md) - [Puppeteer crawler](https://crawlee.dev/js/docs/examples/puppeteer-crawler.md) - [Puppeteer recursive crawl](https://crawlee.dev/js/docs/examples/puppeteer-recursive-crawl.md) - [Skipping navigations for certain requests](https://crawlee.dev/js/docs/examples/skip-navigation.md) - [Experiments](https://crawlee.dev/js/docs/experiments.md) - [Request Locking](https://crawlee.dev/js/docs/experiments/experiments-request-locking.md) - [System Infomation V2](https://crawlee.dev/js/docs/experiments/experiments-system-infomation-v2.md) - [Guides](https://crawlee.dev/js/docs/guides.md) - [Avoid getting blocked](https://crawlee.dev/js/docs/guides/avoid-blocking.md) - [CheerioCrawler guide](https://crawlee.dev/js/docs/guides/cheerio-crawler-guide.md) - [Configuration](https://crawlee.dev/js/docs/guides/configuration.md) - [Using a custom HTTP client (Experimental)](https://crawlee.dev/js/docs/guides/custom-http-client.md) - [Running in Docker](https://crawlee.dev/js/docs/guides/docker-images.md) - [Got Scraping](https://crawlee.dev/js/docs/guides/got-scraping.md) - [Impit HTTP Client](https://crawlee.dev/js/docs/guides/impit-http-client.md) - [JavaScript rendering](https://crawlee.dev/js/docs/guides/javascript-rendering.md) - [JSDOMCrawler guide](https://crawlee.dev/js/docs/guides/jsdom-crawler-guide.md) - [motivation](https://crawlee.dev/js/docs/guides/motivation.md) - [Parallel Scraping Guide](https://crawlee.dev/js/docs/guides/parallel-scraping.md) - [Proxy Management](https://crawlee.dev/js/docs/guides/proxy-management.md) - [Request Storage](https://crawlee.dev/js/docs/guides/request-storage.md) - [Result Storage](https://crawlee.dev/js/docs/guides/result-storage.md) - [Running in web server](https://crawlee.dev/js/docs/guides/running-in-web-server.md) - [Scaling our crawlers](https://crawlee.dev/js/docs/guides/scaling-crawlers.md) - [Session Management](https://crawlee.dev/js/docs/guides/session-management.md) - [TypeScript Projects](https://crawlee.dev/js/docs/guides/typescript-project.md) - [Introduction](https://crawlee.dev/js/docs/introduction.md) - [Adding more URLs](https://crawlee.dev/js/docs/introduction/adding-urls.md) - [Crawling the Store](https://crawlee.dev/js/docs/introduction/crawling.md) - [Running your crawler in the Cloud](https://crawlee.dev/js/docs/introduction/deployment.md) - [First crawler](https://crawlee.dev/js/docs/introduction/first-crawler.md) - [Getting some real-world data](https://crawlee.dev/js/docs/introduction/real-world-project.md) - [Refactoring](https://crawlee.dev/js/docs/introduction/refactoring.md) - [Saving data](https://crawlee.dev/js/docs/introduction/saving-data.md) - [Scraping the Store](https://crawlee.dev/js/docs/introduction/scraping.md) - [Setting up](https://crawlee.dev/js/docs/introduction/setting-up.md) - [Quick Start](https://crawlee.dev/js/docs/quick-start.md) - [Upgrading](https://crawlee.dev/js/docs/upgrading.md) - [Upgrading to v1](https://crawlee.dev/js/docs/upgrading/upgrading-to-v1.md) - [Upgrading to v2](https://crawlee.dev/js/docs/upgrading/upgrading-to-v2.md) - [Upgrading to v3](https://crawlee.dev/js/docs/upgrading/upgrading-to-v3.md) ## search - [Search the documentation](https://crawlee.dev/search.md) ## Optional - [Crawlee for Python llms.txt](https://crawlee.dev/python/llms.txt) - [Crawlee for Python llms-full.txt](https://crawlee.dev/python/llms-full.txt) --- # Full Documentation Content ## [Crawlee for Python v1](https://crawlee.dev/blog/crawlee-for-python-v1.md) September 15, 2025 · 15 min read [![Vlada Dusek](https://avatars.githubusercontent.com/u/25082181?v=4)](https://github.com/vdusek) [Vlada Dusek](https://github.com/vdusek) Developer of Crawlee for Python We launched Crawlee for Python in beta mode in [July 2024](https://www.crawlee.dev/blog/launching-crawlee-python). Over the past year, we received many early adopters, tremendous interest in the library from the Python community, more than 6000 stars on GitHub, a dozen contributors, and many feature requests. After months of development, polishing, and community feedback, the library is leaving beta and entering a production/stable development status. **We are happy to announce Crawlee for Python v1.0.** From now on, Crawlee for Python will strictly follow [semantic versioning](https://www.semver.org/). You can now rely on it as a stable foundation for your crawling and scraping projects, knowing that breaking changes will only occur in major releases. ## What's new in Crawlee for Python v1[​](#whats-new-in-crawlee-for-python-v1 "Direct link to What's new in Crawlee for Python v1") * [New storage client system](#new-storage-client-system) * [Adaptive Playwright crawler](#adaptive-playwright-crawler) * [Impit HTTP client](#impit-http-client) * [Sitemap request loader](#sitemap-request-loader) * [Robots exclusion standard](#robots-exclusion-standard) * [Fingerprinting](#fingerprinting) * [Open telemetry](#open-telemetry) ![Crawlee for Python v1.0](/assets/images/crawlee_v100-d491a6c5406c55e0bfcdc9b39b81b7ae.webp) [**Read More**](https://crawlee.dev/blog/crawlee-for-python-v1.md) --- ### 2025[​](#2025 "Direct link to 2025") * [January 3](https://crawlee.dev/blog/scrape-crunchbase-python.md) [ - ](https://crawlee.dev/blog/scrape-crunchbase-python.md) [How to scrape Crunchbase using Python in 2024 (Easy Guide)](https://crawlee.dev/blog/scrape-crunchbase-python.md) * [January 10](https://crawlee.dev/blog/crawlee-for-python-v05.md) [ - ](https://crawlee.dev/blog/crawlee-for-python-v05.md) [Crawlee for Python v0.5](https://crawlee.dev/blog/crawlee-for-python-v05.md) * [March 5](https://crawlee.dev/blog/superscraper-with-crawlee.md) [ - ](https://crawlee.dev/blog/superscraper-with-crawlee.md) [Inside implementing SuperScraper with Crawlee](https://crawlee.dev/blog/superscraper-with-crawlee.md) * [March 6](https://crawlee.dev/blog/crawlee-for-python-v06.md) [ - ](https://crawlee.dev/blog/crawlee-for-python-v06.md) [Crawlee for Python v0.6](https://crawlee.dev/blog/crawlee-for-python-v06.md) * [March 20](https://crawlee.dev/blog/scrape-bluesky-using-python.md) [ - ](https://crawlee.dev/blog/scrape-bluesky-using-python.md) [How to scrape Bluesky with Python](https://crawlee.dev/blog/scrape-bluesky-using-python.md) * [April 8](https://crawlee.dev/blog/crawlee-python-price-tracker.md) [ - ](https://crawlee.dev/blog/crawlee-python-price-tracker.md) [How to build a price tracker with Crawlee and Apify](https://crawlee.dev/blog/crawlee-python-price-tracker.md) * [April 25](https://crawlee.dev/blog/scrape-tiktok-python.md) [ - ](https://crawlee.dev/blog/scrape-tiktok-python.md) [How to scrape TikTok using Python](https://crawlee.dev/blog/scrape-tiktok-python.md) * [July 14](https://crawlee.dev/blog/scrape-youtube-python.md) [ - ](https://crawlee.dev/blog/scrape-youtube-python.md) [How to scrape YouTube using Python \[2025 guide\]](https://crawlee.dev/blog/scrape-youtube-python.md) * [September 15](https://crawlee.dev/blog/crawlee-for-python-v1.md) [ - ](https://crawlee.dev/blog/crawlee-for-python-v1.md) [Crawlee for Python v1](https://crawlee.dev/blog/crawlee-for-python-v1.md) --- # Authors * [![Percival Villalva](https://avatars.githubusercontent.com/u/70678259?v=4)](https://github.com/PerVillalva) ## [Percival Villalva](https://github.com/PerVillalva) 1 Community Member of Crawlee [](https://github.com/PerVillalva "GitHub") * [![Saurav Jain](https://avatars.githubusercontent.com/u/53312820?v=4)](https://github.com/souravjain540) ## [Saurav Jain](https://github.com/souravjain540) 8 Developer Community Manager [](https://x.com/sauain "X")[](https://github.com/souravjain540 "GitHub") * [![Arindam Majumder](https://avatars.githubusercontent.com/u/109217591?v=4)](https://github.com/Arindam200) ## [Arindam Majumder](https://github.com/Arindam200) 1 Community Member of Crawlee [](https://x.com/Arindam_1729 "X")[](https://github.com/Arindam200 "GitHub") * [![Ayush Thakur](https://avatars.githubusercontent.com/u/43995654?v=4)](https://github.com/ayush2390) ## [Ayush Thakur](https://github.com/ayush2390) 1 Community Member of Crawlee [](https://x.com/JSAyushThakur "X")[](https://github.com/ayush2390 "GitHub") * [![Max](https://avatars.githubusercontent.com/u/34358312?v=4)](https://github.com/Mantisus) ## [Max](https://github.com/Mantisus) 8 Community Member of Crawlee and web scraping expert [](https://github.com/Mantisus "GitHub") * [![Lukáš Průša](./img/lukasp.webp)](https://github.com/Patai5) ## [Lukáš Průša](https://github.com/Patai5) 1 Junior Web Automation Engineer [](https://github.com/Patai5 "GitHub") * [![Matěj Volf](https://avatars.githubusercontent.com/u/31281386?v=4)](https://github.com/mvolfik) ## [Matěj Volf](https://github.com/mvolfik) 1 Web Automation Engineer [](https://github.com/mvolfik "GitHub") * [![Satyam Tripathi](https://avatars.githubusercontent.com/u/69134468?v=4)](https://github.com/triposat) ## [Satyam Tripathi](https://github.com/triposat) 1 Community Member of Crawlee [](https://github.com/triposat "GitHub") * [![Vlada Dusek](https://avatars.githubusercontent.com/u/25082181?v=4)](https://github.com/vdusek) ## [Vlada Dusek](https://github.com/vdusek) 3 Developer of Crawlee for Python [](https://github.com/vdusek "GitHub") * [![Radoslav Chudovský](https://ca.slack-edge.com/T0KRMEKK6-U04MGU11VUK-7f59c4a9343b-512)](https://github.com/chudovskyr) ## [Radoslav Chudovský](https://github.com/chudovskyr) 1 Web Automation Engineer [](https://github.com/chudovskyr "GitHub") --- # Current problems and mistakes of web scraping in Python and tricks to solve them! August 20, 2024 · 17 min read [![Max](https://avatars.githubusercontent.com/u/34358312?v=4)](https://github.com/Mantisus) [Max](https://github.com/Mantisus) Community Member of Crawlee and web scraping expert ## Introduction[​](#introduction "Direct link to Introduction") Greetings! I'm [Max](https://apify.com/mantisus), a Python developer from Ukraine, a developer with expertise in web scraping, data analysis, and processing. My journey in web scraping started in 2016 when I was solving lead generation challenges for a small company. Initially, I used off-the-shelf solutions such as [Import.io](https://www.import.io/) and Kimono Labs. However, I quickly encountered limitations such as blocking, inaccurate data extraction, and performance issues. This led me to learn Python. Those were the glory days when [`requests`](https://requests.readthedocs.io/en/latest/) and [`lxml`](https://lxml.de/)/[`beautifulsoup`](https://beautiful-soup-4.readthedocs.io/en/latest/) were enough to extract data from most websites. And if you knew how to work with threads, you were already a respected expert :) note One of our community members wrote this blog as a contribution to Crawlee Blog. If you want to contribute blogs like these to Crawlee Blog, please reach out to us on our [discord channel](https://apify.com/discord). As a freelancer, I've built small solutions and large, complex data mining systems for products over the years. Today, I want to discuss the realities of [web scraping with Python in 2024](https://blog.apify.com/web-scraping-python/). We'll look at the mistakes I sometimes see and the problems you'll encounter and offer solutions to some of them. Let's get started. Just take `requests` and `beautifulsoup` and start making a lot of money... No, this is not that kind of article. ## 1. "I got a 200 response from the server, but it's an unreadable character set."[​](#1-i-got-a-200-response-from-the-server-but-its-an-unreadable-character-set "Direct link to 1. \"I got a 200 response from the server, but it's an unreadable character set.\"") Yes, it can be surprising. But I've seen this message from customers and developers six years ago, four years ago, and in 2024. I read a post on Reddit just a few months ago about this issue. Let's look at a simple code example. This will work for `requests`, [`httpx`](https://www.python-httpx.org/), and [`aiohttp`](https://docs.aiohttp.org/en/stable/client.html#aiohttp-client) with a clean installation and no extensions. ``` import httpx url = 'https://www.wayfair.com/' headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:128.0) Gecko/20100101 Firefox/128.0", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/png,image/svg+xml,*/*;q=0.8", "Accept-Language": "en-US,en;q=0.5", "Accept-Encoding": "gzip, deflate, br, zstd", "Connection": "keep-alive", } response = httpx.get(url, headers=headers) print(response.content[:10]) ``` The print result will be similar to: ``` b'\x83\x0c\x00\x00\xc4\r\x8e4\x82\x8a' ``` It's not an error - it's a perfectly valid server response. It's encoded somehow. The answer lies in the `Accept-Encoding` header. In the example above, I just copied it from my browser, so it lists all the compression methods my browser supports: "gzip, deflate, br, zstd". The Wayfair backend supports compression with "br", which is [Brotli](https://github.com/google/brotli), and uses it as the most efficient method. This can happen if none of the libraries listed above have a `Brotli` dependency among their standard dependencies. However, they all support decompression from this format if you already have `Brotli` installed. Therefore, it's sufficient to install the appropriate library: ``` pip install Brotli ``` This will allow you to get the result of the print: ``` b' 3 min read [![Saurav Jain](https://avatars.githubusercontent.com/u/53312820?v=4)](https://github.com/souravjain540) [Saurav Jain](https://github.com/souravjain540) Developer Community Manager Hey, crawling masters! I’m Saurav, Developer Community Manager at Apify, and I’m thrilled to announce that we’re launching the Crawlee blog today 🎉 We launched Crawlee, the successor to our Apify SDK, in [August 2022](https://blog.apify.com/announcing-crawlee-the-web-scraping-and-browser-automation-library/) to make the best web scraping and automation library for Node.js developers who like to write code in JavaScript or TypeScript. Since then, our dev community has grown exponentially. I’m proud to tell you that we have **over 11,500 Stars on GitHub**, over **6,000 community members on our Discord**, and over **125,000 downloads monthly on npm**. We’re now the most popular web scraping and automation library for Node.js developers 👏 ## Changes in Crawlee since the launch[​](#changes-in-crawlee-since-the-launch "Direct link to Changes in Crawlee since the launch") Crawlee has progressively evolved with the introduction of key features to enhance web scraping and automation: * [v3.1](https://github.com/apify/crawlee/releases/tag/v3.1.0) added an [error tracker](https://crawlee.dev/js/api/core/class/ErrorTracker.md) for analyzing and summarizing failed requests. * The [v3.3](https://github.com/apify/crawlee/releases/tag/v3.3.0) update brought an `exclude` option to the `enqueueLinks` helper and integrated status messages. This improved usability on the Apify platform with automatic summary updates in the console UI. * [v3.4](https://github.com/apify/crawlee/releases/tag/v3.4.0) introduced the [`linkedom` crawler](https://crawlee.dev/js/api/linkedom-crawler.md), offering a new parsing option. * The [v3.5](https://github.com/apify/crawlee/releases/tag/v3.5.0) update optimized link enqueuing for efficiency. * [v3.6](https://github.com/apify/crawlee/releases/tag/v3.6.0) launched experimental support for a [new request queue API](https://crawlee.dev/js/docs/experiments/experiments-request-locking.md), enabling parallel execution and improved scalability for multiple scrapers working concurrently. All of this marked significant strides in making web scraping more efficient and robust. ## Future of Crawlee\![​](#future-of-crawlee "Direct link to Future of Crawlee!") The Crawlee team is actively developing an adaptive crawling feature to revolutionize how Crawlee interacts with and navigates through websites. We just launched [v3.8](https://github.com/apify/crawlee/releases/tag/v3.8.0) with experimental support for the new [adaptive crawler type](https://crawlee.dev/js/api/playwright-crawler/class/AdaptivePlaywrightCrawler.md). ## Support us on GitHub.[​](#support-us-on-github "Direct link to Support us on GitHub.") Before I tell you about our upcoming plans for Crawlee Blog, I recommend you check out Crawlee if you haven’t already. We are open-source. You can see our [source code here](https://github.com/apify/crawlee/). If you like Crawlee, then please don’t forget to give us a on GitHub. ![Crawlee\_presentation\_final](https://github.com/souravjain540/crawlee-first-blog/assets/53312820/051ec8a3-86a7-4109-8fb3-135e399cbe93) ## Crawlee Blog and upcoming plans\![​](#crawlee-blog-and-upcoming-plans "Direct link to Crawlee Blog and upcoming plans!") The first step to achieving this goal is to reach out to the broader developer community through our content. The Crawlee blog aims to be the best informational hub for Node.js developers interested in web scraping and automation. **What to expect:** * How-to-tutorials on making web crawlers, scrapers, and automation applications using Crawlee. * Thought leadership content on web crawling. * Crawlee feature updates and changes. * Community content collaboration. We’ll be posting content monthly for our dev community, so stay tuned! If you have ideas on specific content topics and want to give us input, please [join our Discord community](https://apify.com/discord) and tag me with your ideas. Also, we encourage collaboration with the community, so if you have some interesting pieces of content related to Crawlee, let us know in Discord, and we’ll feature them on our blog. 😀 In the meantime, you might want to check out this article on [Crawlee data storage types](https://blog.apify.com/crawlee-data-storage-types/) on the Apify Blog. --- # Crawlee for Python v0.5 January 10, 2025 · 7 min read [![Vlada Dusek](https://avatars.githubusercontent.com/u/25082181?v=4)](https://github.com/vdusek) [Vlada Dusek](https://github.com/vdusek) Developer of Crawlee for Python Crawlee for Python v0.5 is now available! This is our biggest release to date, bringing new ported functionality from the [Crawlee for JavaScript](https://github.com/apify/crawlee), brand-new features that are exclusive to the Python library (for now), a new consolidated package structure, and a bunch of bug fixes and further improvements. ## Getting started[​](#getting-started "Direct link to Getting started") You can upgrade to the latest version straight from [PyPI](https://pypi.org/project/crawlee/): ``` pip install --upgrade crawlee ``` Check out the full changelog on our [website](https://www.crawlee.dev/python/docs/changelog#050-2025-01-02) to see all the details. If you are updating from an older version, make sure to follow our [Upgrading to v0.5](https://www.crawlee.dev/python/docs/upgrading/upgrading-to-v0x#upgrading-to-v05) guide for a smooth upgrade. ## New package structure[​](#new-package-structure "Direct link to New package structure") We have introduced a new consolidated package structure. The goal is to streamline the development experience, help you find the crawlers you are looking for faster, and improve the IDE's code suggestions while importing. ### Crawlers[​](#crawlers "Direct link to Crawlers") We have grouped all crawler classes (and their corresponding crawling context classes) into a single sub-package called `crawlers`. Here is a quick example of how the imports have changed: ``` - from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler, BeautifulSoupCrawlingContext + from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext ``` Look how you can see all the crawlers that we have, isn't that cool! ![Import from crawlers subpackage.](/assets/images/import_crawlers-32dc36ba69192c5d936cbc8c05a9b946.webp) ### Storage clients[​](#storage-clients "Direct link to Storage clients") Similarly, we have moved all storage client classes under `storage_clients` sub-package. For instance: ``` - from crawlee.memory_storage_client import MemoryStorageClient + from crawlee.storage_clients import MemoryStorageClient ``` This consolidation makes it clearer where each class belongs and ensures that your IDE can provide better autocompletion when you are looking for the right crawler or storage client. ## Continued parity with Crawlee JS[​](#continued-parity-with-crawlee-js "Direct link to Continued parity with Crawlee JS") We are constantly working toward feature parity with our JavaScript library, [Crawlee JS](https://github.com/apify/crawlee). With v0.5, we have brought over more functionality: ### HTML to text context helper[​](#html-to-text-context-helper "Direct link to HTML to text context helper") The `html_to_text` crawling context helper simplifies extracting text from an HTML page by automatically removing all tags and returning only the raw text content. It's available in the [`ParselCrawlingContext`](https://www.crawlee.dev/python/api/class/ParselCrawlingContext#html_to_text) and [`BeautifulSoupCrawlingContext`](https://www.crawlee.dev/python/api/class/BeautifulSoupCrawlingContext#html_to_text). ``` import asyncio from crawlee.crawlers import ParselCrawler, ParselCrawlingContext async def main() -> None: crawler = ParselCrawler() @crawler.router.default_handler async def handler(context: ParselCrawlingContext) -> None: context.log.info('Crawling: %s', context.request.url) text = context.html_to_text() # Continue with the processing... await crawler.run(['https://crawlee.dev']) if __name__ == '__main__': asyncio.run(main()) ``` In this example, we use a [`ParselCrawler`](https://www.crawlee.dev/python/api/class/ParselCrawler) to fetch a webpage, then invoke `context.html_to_text()` to extract clean text for further processing. ### Use state[​](#use-state "Direct link to Use state") The [`use_state`](https://www.crawlee.dev/python/api/class/UseStateFunction) crawling context helper makes it simple to create and manage persistent state values within your crawler. It ensures that all state values are automatically persisted. It enables you to maintain data across different crawler runs, restarts, and failures. It acts as a convenient abstraction for interaction with [`KeyValueStore`](https://www.crawlee.dev/python/api/class/KeyValueStore). ``` import asyncio from crawlee import Request from crawlee.configuration import Configuration from crawlee.crawlers import ParselCrawler, ParselCrawlingContext async def main() -> None: # Create a crawler with purge_on_start disabled to retain state across runs. crawler = ParselCrawler( configuration=Configuration(purge_on_start=False), ) @crawler.router.default_handler async def handler(context: ParselCrawlingContext) -> None: context.log.info(f'Crawling {context.request.url}') # Retrieve or initialize the state with a default value. state = await context.use_state('state', default_value={'runs': 0}) # Increment the run count. state['runs'] += 1 # Create a request with always_enqueue enabled to bypass deduplication and ensure it is processed. request = Request.from_url('https://crawlee.dev/', always_enqueue=True) # Run the crawler with the start request. await crawler.run([request]) # Fetch the persisted state from the key-value store. kvs = await crawler.get_key_value_store() state = await kvs.get_auto_saved_value('state') crawler.log.info(f'Final state after run: {state}') if __name__ == '__main__': asyncio.run(main()) ``` Please note that the `use_state` is an experimental feature. Its behavior and interface may evolve in future versions. ## Brand new features[​](#brand-new-features "Direct link to Brand new features") In addition to porting features from JS, we are introducing new, Python-first functionalities that will eventually make their way into Crawlee JS in the coming months. ### Crawler's stop method[​](#crawlers-stop-method "Direct link to Crawler's stop method") The [`BasicCrawler`](https://www.crawlee.dev/python/api/class/BasicCrawler), and by extension, all crawlers that inherit from it, now has a [`stop`](https://www.crawlee.dev/python/api/class/BasicCrawler#stop) method. This makes it easy to halt the crawling when a specific condition is met, for instance, if you have found the data you were looking for. ``` import asyncio from crawlee.crawlers import ParselCrawler, ParselCrawlingContext async def main() -> None: crawler = ParselCrawler() @crawler.router.default_handler async def handler(context: ParselCrawlingContext) -> None: context.log.info('Crawling: %s', context.request.url) # Extract and enqueue links from the page. await context.enqueue_links() title = context.selector.css('title::text').get() # Condition when you want to stop the crawler, e.g. you # have found what you were looking for. if 'Crawlee for Python' in title: context.log.info('Condition met, stopping the crawler.') await crawler.stop() await crawler.run(['https://crawlee.dev']) if __name__ == '__main__': asyncio.run(main()) ``` ### Request loaders[​](#request-loaders "Direct link to Request loaders") There are new classes [`RequestLoader`](https://www.crawlee.dev/python/api/class/RequestLoader), [`RequestManager`](https://www.crawlee.dev/python/api/class/RequestManager) and [`RequestManagerTandem`](https://www.crawlee.dev/python/api/class/RequestManagerTandem) that manage how Crawlee accesses and stores requests. They allow you to use other component (service) as a source for requests and optionally you can combine it with a [`RequestQueue`](https://www.crawlee.dev/python/api/class/RequestQueue). They let you plug in any request source, and combine the external data sources with Crawlee's standard `RequestQueue`. You can learn more about these new features in the [Request loaders guide](https://www.crawlee.dev/python/docs/guides/request-loaders). ``` import asyncio from crawlee.crawlers import ParselCrawler, ParselCrawlingContext from crawlee.request_loaders import RequestList, RequestManagerTandem from crawlee.storages import RequestQueue async def main() -> None: rl = RequestList( [ 'https://crawlee.dev', 'https://apify.com', # Long list of URLs... ], ) rq = await RequestQueue.open() # Combine them into a single request source. tandem = RequestManagerTandem(rl, rq) crawler = ParselCrawler(request_manager=tandem) @crawler.router.default_handler async def handler(context: ParselCrawlingContext) -> None: context.log.info(f'Crawling {context.request.url}') # ... await crawler.run() if __name__ == '__main__': asyncio.run(main()) ``` In this example we combine a [`RequestList`](https://www.crawlee.dev/python/api/class/RequestList) with a [`RequestQueue`](https://www.crawlee.dev/python/api/class/RequestQueue). However, instead of the `RequestList` you can use any other class that implements the [`RequestLoader`](https://www.crawlee.dev/python/api/class/RequestLoader) interface to suit your specific requirements. ### Service locator[​](#service-locator "Direct link to Service locator") The [`ServiceLocator`](https://www.crawlee.dev/python/api/class/ServiceLocator) is primarily an internal mechanism for managing the services that Crawlee depends on. Specifically, the [`Configuration`](https://www.crawlee.dev/python/api/class/ServiceLocator), [`StorageClient`](https://www.crawlee.dev/python/api/class/ServiceLocator), and [`EventManager`](https://www.crawlee.dev/python/api/class/ServiceLocator). By swapping out these components, you can adapt Crawlee to suit different runtime environments. You can use the service locator explicitly: ``` import asyncio from crawlee import service_locator from crawlee.configuration import Configuration from crawlee.crawlers import ParselCrawler, ParselCrawlingContext from crawlee.events import LocalEventManager from crawlee.storage_clients import MemoryStorageClient async def main() -> None: service_locator.set_configuration(Configuration()) service_locator.set_storage_client(MemoryStorageClient()) service_locator.set_event_manager(LocalEventManager()) crawler = ParselCrawler() # ... if __name__ == '__main__': asyncio.run(main()) ``` Or pass the services directly to the crawler instance, and they will be set under the hood: ``` import asyncio from crawlee.configuration import Configuration from crawlee.crawlers import ParselCrawler, ParselCrawlingContext from crawlee.events import LocalEventManager from crawlee.storage_clients import MemoryStorageClient async def main() -> None: crawler = ParselCrawler( configuration=Configuration(), storage_client=MemoryStorageClient(), event_manager=LocalEventManager(), ) # ... if __name__ == '__main__': asyncio.run(main()) ``` ## Conclusion[​](#conclusion "Direct link to Conclusion") We are excited to share that Crawlee v0.5 is here. If you have any questions or feedback, please open a [GitHub discussion](https://github.com/apify/crawlee-python/discussions). If you encounter any bugs, or have an idea for a new feature, please open a [GitHub issue](https://github.com/apify/crawlee-python/issues). --- # Crawlee for Python v0.6 March 6, 2025 · 4 min read [![Vlada Dusek](https://avatars.githubusercontent.com/u/25082181?v=4)](https://github.com/vdusek) [Vlada Dusek](https://github.com/vdusek) Developer of Crawlee for Python Crawlee for Python v0.6 is here, and it's packed with new features and important bug fixes. If you're upgrading from a previous version, please take a moment to review the breaking changes detailed below to ensure a smooth transition. ![Crawlee for Python v0.6.0](/assets/images/crawlee_v060-5cdf895baf62d5ab5beea47ce6502dec.webp) ## Getting started[​](#getting-started "Direct link to Getting started") You can upgrade to the latest version straight from [PyPI](https://www.pypi.org/project/crawlee/): ``` pip install --upgrade crawlee ``` Check out the full changelog on our [website](https://www.crawlee.dev/python/docs/changelog#060-2025-03-03) to see all the details. If you are updating from an older version, make sure to follow our [Upgrading to v0.6](https://www.crawlee.dev/python/docs/upgrading/upgrading-to-v0x#upgrading-to-v06) guide. ## Adaptive Playwright crawler[​](#adaptive-playwright-crawler "Direct link to Adaptive Playwright crawler") The new [`AdaptivePlaywrightCrawler`](https://www.crawlee.dev/python/api/class/AdaptivePlaywrightCrawler) is a hybrid solution that combines the best of two worlds: full browser rendering with [Playwright](https://www.playwright.dev/) and lightweight HTTP-based crawling (using, for example, [`BeautifulSoupCrawler`](https://www.crawlee.dev/python/api/class/BeautifulSoupCrawler) or [`ParselCrawler`](https://www.crawlee.dev/python/api/class/ParselCrawler)). It automatically switches between the two methods based on real-time analysis of the target page, helping you achieve lower crawl costs and improved performance when crawling a variety of websites. The example below demonstrates how the `AdaptivePlaywrightCrawler` can handle both static and dynamic content. ``` import asyncio from datetime import timedelta from crawlee.crawlers import AdaptivePlaywrightCrawler, AdaptivePlaywrightCrawlingContext async def main() -> None: crawler = AdaptivePlaywrightCrawler.with_beautifulsoup_static_parser( max_requests_per_crawl=5, playwright_crawler_specific_kwargs={'browser_type': 'chromium'}, ) @crawler.router.default_handler async def request_handler(context: AdaptivePlaywrightCrawlingContext) -> None: # Do some processing using `parsed_content` context.log.info(context.parsed_content.title) # Locate element h2 within 5 seconds h2 = await context.query_selector_one('h2', timedelta(seconds=5)) # Do stuff with element found by the selector context.log.info(h2) # Find more links and enqueue them. await context.enqueue_links() # Save some data. await context.push_data({'Visited url': context.request.url}) await crawler.run(['https://www.crawlee.dev/']) if __name__ == '__main__': asyncio.run(main()) ``` Check out our [Adaptive Playwright crawler guide](https://www.crawlee.dev/python/docs/guides/adaptive-playwright-crawler) for more details on how to use this new crawler. ## Browserforge fingerprints[​](#browserforge-fingerprints "Direct link to Browserforge fingerprints") To help you avoid detection and blocking, Crawlee now integrates the [browserforge](https://www.github.com/daijro/browserforge) library - intelligent browser header & fingerprint generator. This feature simulates real browser behavior by automatically randomizing HTTP headers and fingerprints, making your crawling sessions significantly more resilient against anti-bot measures. With [browserforge](https://www.github.com/daijro/browserforge) fingerprints enabled by default, your crawler sends realistic HTTP headers and user-agent strings. HTTP-based crawlers, which use [`HttpxHttpClient`](https://www.crawlee.dev/python/api/class/HttpxHttpClient) by default benefit from these adjustments, while the [`CurlImpersonateHttpClient`](https://www.crawlee.dev/python/api/class/CurlImpersonateHttpClient) employs its own stealthy techniques. The [`PlaywrightCrawler`](https://www.crawlee.dev/python/docs/guides/playwright-crawler) adjusts HTTP headers and browser fingerprints accordingly. Together, these improvements make your crawlers much harder to detect. Below is an example of using `PlaywrightCrawler`, which now benefits from the [browserforge](https://www.github.com/daijro/browserforge) library: ``` import asyncio from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext async def main() -> None: # The browserforge fingerprints and headers are used by default. crawler = PlaywrightCrawler() @crawler.router.default_handler async def handler(context: PlaywrightCrawlingContext) -> None: url = context.request.url context.log.info(f'Crawling URL: {url}') # Decode and log the response body, which contains the headers we sent. headers = (await context.response.body()).decode() context.log.info(f'Response headers: {headers}') # Extract and log the User-Agent and UA data used in the browser context. ua = await context.page.evaluate('() => window.navigator.userAgent') ua_data = await context.page.evaluate('() => window.navigator.userAgentData') context.log.info(f'Navigator user-agent: {ua}') context.log.info(f'Navigator user-agent data: {ua_data}') # The endpoint httpbin.org/headers returns the request headers in the response body. await crawler.run(['https://www.httpbin.org/headers']) if __name__ == '__main__': asyncio.run(main()) ``` For further details on utilizing [browserforge](https://www.github.com/daijro/browserforge) to avoid blocking, please refer to our [Avoid getting blocked guide](https://www.crawlee.dev/python/docs/guides/avoid-blocking). ## CLI dependencies[​](#cli-dependencies "Direct link to CLI dependencies") In v0.6, we've reduced the size of the core package by moving CLI (template creation) dependencies to optional extras. This change reduces the package footprint, keeping the base installation lightweight. To use Crawlee's CLI for creating new projects, simply install the package with the CLI extras. For example, to create a new project from a template using `pipx`, run: ``` pipx run 'crawlee[cli]' create my-crawler ``` Or with `uvx`: ``` uvx 'crawlee[cli]' create my-crawler ``` This change ensures that while the core package remains lean, you can still opt in to CLI functionality when bootstrapping new projects. ## Conclusion[​](#conclusion "Direct link to Conclusion") We are excited to share that Crawlee v0.6 is here. If you have any questions or feedback, please open a [GitHub discussion](https://www.github.com/apify/crawlee-python/discussions). If you encounter any bugs, or have an idea for a new feature, please open a [GitHub issue](https://www.github.com/apify/crawlee-python/issues). --- # Crawlee for Python v1 September 15, 2025 · 15 min read [![Vlada Dusek](https://avatars.githubusercontent.com/u/25082181?v=4)](https://github.com/vdusek) [Vlada Dusek](https://github.com/vdusek) Developer of Crawlee for Python We launched Crawlee for Python in beta mode in [July 2024](https://www.crawlee.dev/blog/launching-crawlee-python). Over the past year, we received many early adopters, tremendous interest in the library from the Python community, more than 6000 stars on GitHub, a dozen contributors, and many feature requests. After months of development, polishing, and community feedback, the library is leaving beta and entering a production/stable development status. **We are happy to announce Crawlee for Python v1.0.** From now on, Crawlee for Python will strictly follow [semantic versioning](https://www.semver.org/). You can now rely on it as a stable foundation for your crawling and scraping projects, knowing that breaking changes will only occur in major releases. ## What's new in Crawlee for Python v1[​](#whats-new-in-crawlee-for-python-v1 "Direct link to What's new in Crawlee for Python v1") * [New storage client system](#new-storage-client-system) * [Adaptive Playwright crawler](#adaptive-playwright-crawler) * [Impit HTTP client](#impit-http-client) * [Sitemap request loader](#sitemap-request-loader) * [Robots exclusion standard](#robots-exclusion-standard) * [Fingerprinting](#fingerprinting) * [Open telemetry](#open-telemetry) ![Crawlee for Python v1.0](/assets/images/crawlee_v100-d491a6c5406c55e0bfcdc9b39b81b7ae.webp) ## Getting started[​](#getting-started "Direct link to Getting started") You can upgrade to the latest version straight from [PyPI](https://www.pypi.org/project/crawlee/): ``` pip install --upgrade crawlee ``` Check out the full changelog on our [website](https://www.crawlee.dev/python/docs/changelog#100-2025-09-15) to see all the details. If you are updating from an older version, make sure to follow our [Upgrading to v1](https://www.crawlee.dev/python/docs/upgrading/upgrading-to-v1) guide. ## New storage client system[​](#new-storage-client-system "Direct link to New storage client system") One of the biggest architectural changes in Crawlee v1 is the introduction of a new storage client system. Until now, datasets, key–value stores, and request queues were handled in slightly different ways depending on where they were stored. With v1, this has been unified under a single, consistent interface. This means that whether you're storing data in memory, on the local file system, in a database, on the Apify platform, or even using a custom backend, the API remains the same. The result is less duplication, better extensibility, and a cleaner developer experience. It also opens the door for the community to build and share their own storage client implementations. For example, here's how to set up a crawler with a file-system–backed storage client, which persists data locally: ``` from crawlee.configuration import Configuration from crawlee.crawlers import ParselCrawler from crawlee.storage_clients import FileSystemStorageClient # Create a new instance of storage client. storage_client = FileSystemStorageClient() # Create a configuration with custom settings. configuration = Configuration( storage_dir='./my_storage', purge_on_start=False, ) # And pass them to the crawler. crawler = ParselCrawler( storage_client=storage_client, configuration=configuration, ) ``` And here's an example of using a memory-only storage client, useful for testing or short-lived crawls: ``` from crawlee.crawlers import ParselCrawler from crawlee.storage_clients import MemoryStorageClient # Create a new instance of storage client. storage_client = MemoryStorageClient() # And pass it to the crawler. crawler = ParselCrawler(storage_client=storage_client) ``` With this new design, switching between storage backends is as simple as swapping out a client, without changing your crawling logic. To dive deeper into configuration, advanced usage (e.g. using different storage clients for specific storage instances), and even how to write your own storage client, see the [Storages](https://www.crawlee.dev/python/docs/guides/storages) and [Storage clients](https://www.crawlee.dev/python/docs/guides/storage-clients) guides. ### New experimental SQL storage client[​](#new-experimental-sql-storage-client "Direct link to New experimental SQL storage client") Crawlee v1 introduces an experimental [`SqlStorageClient`](https://www.crawlee.dev/python/api/class/SqlStorageClient) that enables persistent storage using SQL databases. Currently, SQLite and PostgreSQL are supported. This storage backend supports concurrent access from multiple crawler processes, enabling distributed crawling scenarios. The SQL storage client uses [SQLAlchemy 2+](https://www.sqlalchemy.org/) under the hood, providing automatic schema creation, connection pooling, and database-specific optimizations. It maintains the same interface as other storage clients, making it easy to switch between different storage backends without changing your crawling logic. The client uses a context manager to ensure proper connection handling: ``` import asyncio from crawlee.crawlers import ParselCrawler from crawlee.storage_clients import SqlStorageClient async def main() -> None: # Create SQL storage client (defaults to SQLite). async with SqlStorageClient() as storage_client: # Pass it to the crawler. crawler = ParselCrawler(storage_client=storage_client) # ... define your handlers and crawling logic await crawler.run(['https://crawlee.dev']) if __name__ == '__main__': asyncio.run(main()) ``` For PostgreSQL, simply provide a connection string: ``` import asyncio from crawlee.crawlers import ParselCrawler from crawlee.storage_clients import SqlStorageClient async def main() -> None: async with SqlStorageClient( connection_string='postgresql+asyncpg://user:pass@localhost/crawlee_db' ) as storage_client: crawler = ParselCrawler(storage_client=storage_client) # ... define your handlers and crawling logic await crawler.run(['https://crawlee.dev']) if __name__ == '__main__': asyncio.run(main()) ``` Since this is an experimental feature, the implementation may evolve in future releases as we gather feedback from the community. ## Adaptive Playwright crawler[​](#adaptive-playwright-crawler "Direct link to Adaptive Playwright crawler") Some websites can be scraped quickly with plain HTTP requests, while others require the full power of a browser to render dynamic content. Traditionally, you had to decide upfront whether to use one of the lightweight HTTP-based crawlers ([`ParselCrawler`](https://www.crawlee.dev/python/api/class/ParselCrawler) or [`BeautifulSoupCrawler`](https://www.crawlee.dev/python/api/class/BeautifulSoupCrawler)) or a browser-based [`PlaywrightCrawler`](https://www.crawlee.dev/python/api/class/PlaywrightCrawler). Crawlee v1 introduces the [`AdaptivePlaywrightCrawler`](https://www.crawlee.dev/python/api/class/AdaptivePlaywrightCrawler), which automatically chooses the right approach for each page. The adaptive crawler uses a detection mechanism: it compares the results of plain HTTP requests with those of a browser-rendered version of the same page. If both match, it can continue with the faster HTTP approach; if differences appear, it falls back to browser-based crawling. Over time, it builds confidence about which rendering type is needed for different pages, occasionally re-checking with the browser to ensure its predictions stay correct. This makes your crawls faster and cheaper, while still allowing you to reliably handle complex, dynamic websites. In practice, you get the best of both worlds: speed on simple pages and robustness on modern, JavaScript-heavy sites. For advanced options, such as customizing the detection strategy, see the [Adaptive Playwright crawler guide](https://www.crawlee.dev/python/docs/guides/adaptive-playwright-crawler). Here's a simplified example using the static [Parsel](https://www.github.com/scrapy/parsel) parser for HTTP responses, and falling back to [Playwright](https://www.playwright.dev/) only when needed: ``` import asyncio from datetime import timedelta from crawlee.crawlers import AdaptivePlaywrightCrawler, AdaptivePlaywrightCrawlingContext async def main() -> None: crawler = AdaptivePlaywrightCrawler.with_parsel_static_parser() @crawler.router.default_handler async def request_handler(context: AdaptivePlaywrightCrawlingContext) -> None: # Locate element h2 within 5 seconds h2 = await context.query_selector_one('h2', timedelta(milliseconds=5000)) # Do stuff with element found by the selector context.log.info(h2) await crawler.run(['https://crawlee.dev/']) if __name__ == '__main__': asyncio.run(main()) ``` In this example, pages that don't need JavaScript rendering will be processed through the fast HTTP client, while others will be automatically handled with Playwright. You don't need to write two different crawlers or guess in advance which method to use - Crawlee adapts dynamically. For more details and configuration options, see the [Adaptive Playwright crawler](https://www.crawlee.dev/python/docs/guides/adaptive-playwright-crawler) guide. ## Impit HTTP client[​](#impit-http-client "Direct link to Impit HTTP client") Crawlee v1 introduces a brand-new default HTTP client: [`ImpitHttpClient`](https://www.crawlee.dev/python/api/class/ImpitHttpClient), powered by the [Impit](https://www.github.com/apify/impit) library. Written in Rust and exposed to Python through bindings, it delivers better performance, async-first design, HTTP/3 support, and browser impersonation. It can impersonate real browsers out of the box, which makes your crawlers harder to detect and block by common anti-bot systems. This means fewer false positives, more resilient crawls, and less need for complicated workarounds. Impit is also developed as an open-source project by Apify, so you can dive into the internals or contribute improvements yourself. By default, Crawlee now uses [`ImpitHttpClient`](https://www.crawlee.dev/python/api/class/ImpitHttpClient) under the hood. But you can also create your own instance, configure it to your needs (e.g. enable HTTP/3 or choose a specific browser profile), and pass it into your crawler. Here's an example of explicitly using [`ImpitHttpClient`](https://www.crawlee.dev/python/api/class/ImpitHttpClient) with a [`ParselCrawler`](https://www.crawlee.dev/python/api/class/ParselCrawler): ``` import asyncio from crawlee.crawlers import ParselCrawler, ParselCrawlingContext from crawlee.http_clients import ImpitHttpClient async def main() -> None: http_client = ImpitHttpClient( # Optional additional keyword arguments for `impit.AsyncClient`. http3=True, browser='firefox', verify=True, ) crawler = ParselCrawler( http_client=http_client, # Limit the crawl to max requests. Remove or increase it for crawling all links. max_requests_per_crawl=10, ) # Define the default request handler, which will be called for every request. @crawler.router.default_handler async def request_handler(context: ParselCrawlingContext) -> None: context.log.info(f'Processing {context.request.url} ...') # Enqueue all links from the page. await context.enqueue_links() # Extract data from the page. data = { 'url': context.request.url, 'title': context.selector.css('title::text').get(), } # Push the extracted data to the default dataset. await context.push_data(data) # Run the crawler with the initial list of URLs. await crawler.run(['https://crawlee.dev']) if __name__ == '__main__': asyncio.run(main()) ``` With the [`ImpitHttpClient`](https://www.crawlee.dev/python/api/class/ImpitHttpClient), you get stealth without extra dependencies or plugins. Check out the [HTTP clients](https://www.crawlee.dev/python/docs/guides/http-clients) guide for more details and advanced configuration options. ## Sitemap request loader[​](#sitemap-request-loader "Direct link to Sitemap request loader") Many websites expose their structure through sitemaps. These files provide a clear list of all available URLs, and are often the most efficient way to discover content on a site. In previous Crawlee versions, you had to fetch and parse these XML files manually before feeding them into your crawler. With Crawlee v1, that's no longer necessary. The new [`SitemapRequestLoader`](https://www.crawlee.dev/python/api/class/SitemapRequestLoader) lets you load URLs directly from a sitemap into your request queue, with options for filtering and batching. This makes it much easier to start large-scale crawls where sitemaps already provide full coverage of the site. Here's an example that loads a sitemap, filters out only documentation pages, and processes them with a [`ParselCrawler`](https://www.crawlee.dev/python/api/class/ParselCrawler): ``` import asyncio import re from crawlee.crawlers import ParselCrawler, ParselCrawlingContext from crawlee.http_clients import ImpitHttpClient from crawlee.request_loaders import SitemapRequestLoader async def main() -> None: # Create an HTTP client for fetching the sitemap. http_client = ImpitHttpClient() # Create a sitemap request loader with filtering rules. sitemap_loader = SitemapRequestLoader( sitemap_urls=['https://crawlee.dev/sitemap.xml'], http_client=http_client, include=[re.compile(r'.*docs.*')], # Only include URLs containing 'docs'. max_buffer_size=500, # Keep up to 500 URLs in memory before processing. ) # Convert the sitemap loader into a request manager linked # to the default request queue. request_manager = await sitemap_loader.to_tandem() # Create a crawler and pass the request manager to it. crawler = ParselCrawler( request_manager=request_manager, max_requests_per_crawl=10, # Limit the max requests per crawl. ) @crawler.router.default_handler async def handler(context: ParselCrawlingContext) -> None: context.log.info(f'Processing {context.request.url}') # New links will be enqueued directly to the queue. await context.enqueue_links() # Extract data using Parsel's XPath and CSS selectors. data = { 'url': context.request.url, 'title': context.selector.xpath('//title/text()').get(), } # Push extracted data to the dataset. await context.push_data(data) await crawler.run() if __name__ == '__main__': asyncio.run(main()) ``` By connecting the [`SitemapRequestLoader`](https://www.crawlee.dev/python/api/class/SitemapRequestLoader) directly with a crawler, you can skip the boilerplate of parsing XML and just focus on extracting data. For more details, see the [Request loaders](https://www.crawlee.dev/python/docs/guides/request-loaders) guide. ## Robots exclusion standard[​](#robots-exclusion-standard "Direct link to Robots exclusion standard") Respecting [`robots.txt`](https://en.wikipedia.org/wiki/Robots.txt) is an important part of responsible web crawling. This simple file lets website owners declare which parts of their site should not be crawled by automated agents. Crawlee v1 makes it trivial to follow these rules: just set the `respect_robots_txt_file` option on your crawler, and Crawlee will automatically check the file before issuing requests. This not only helps you build ethical crawlers, but can also save time and bandwidth by skipping disallowed or irrelevant pages. For example, login pages, search results, or admin sections are often excluded in [`robots.txt`](https://www.en.wikipedia.org/wiki/Robots.txt), and Crawlee will handle that for you automatically. Here's a minimal example showing how a [`ParselCrawler`](https://www.crawlee.dev/python/api/class/ParselCrawler) obeys the robots exclusion standard: ``` import asyncio from crawlee.crawlers import ParselCrawler, ParselCrawlingContext async def main() -> None: # Create a new crawler instance with robots.txt compliance enabled. crawler = ParselCrawler( respect_robots_txt_file=True, ) # Define the default request handler. @crawler.router.default_handler async def request_handler(context: ParselCrawlingContext) -> None: context.log.info(f'Processing {context.request.url}') # Extract the data from website. data = { 'url': context.request.url, 'title': context.selector.xpath('//title/text()').get(), } # Push extracted data to the dataset. await context.push_data(data) # Run the crawler with the list of start URLs. # The crawler will check the robots.txt file before making requests. # In this example, "https://news.ycombinator.com/login" will be skipped # because it's disallowed in the site's robots.txt file. await crawler.run( ['https://news.ycombinator.com/', 'https://news.ycombinator.com/login'] ) if __name__ == '__main__': asyncio.run(main()) ``` With this option enabled, you don't need to manually check which URLs are allowed. Crawlee will handle it, letting you focus on the crawling logic and data extraction. For a more information, see the [Respect robots.txt file](https://www.crawlee.dev/python/docs/examples/respect-robots-txt-file) documentation page. ## Fingerprinting[​](#fingerprinting "Direct link to Fingerprinting") Modern websites often rely on browser fingerprinting to distinguish real users from automated traffic. Instead of just checking the [User-Agent](https://www.developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Headers/User-Agent) header, they combine dozens of subtle signals - supported fonts, canvas rendering, WebGL features, media devices, screen resolution, and more. Together, these form a unique [device fingerprint](https://www.en.wikipedia.org/wiki/Device_fingerprint) that can easily expose headless browsers or automation frameworks. Without fingerprinting, Playwright sessions tend to look identical and are more likely to be flagged by anti-bot systems. Crawlee v1 integrates with the [`FingerprintGenerator`](https://www.crawlee.dev/python/api/class/FingerprintGenerator) to automatically inject realistic, randomized fingerprints into every [`PlaywrightCrawler`](https://www.crawlee.dev/python/api/class/PlaywrightCrawler) session. This modifies HTTP headers, browser APIs, and other low-level signals so that each crawler run looks like a real browser on a real device. Using fingerprinting in Crawlee is straightforward: create a fingerprint generator with your desired options and pass it to the crawler. ``` import asyncio from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext from crawlee.fingerprint_suite import ( DefaultFingerprintGenerator, HeaderGeneratorOptions, ScreenOptions, ) async def main() -> None: # Use default fingerprint generator with desired fingerprint options. # Generator will generate real looking browser fingerprint based on the options. # Unspecified fingerprint options will be automatically selected by the generator. fingerprint_generator = DefaultFingerprintGenerator( header_options=HeaderGeneratorOptions(browsers=['chrome']), screen_options=ScreenOptions(min_width=400), ) crawler = PlaywrightCrawler( # Limit the crawl to max requests. Remove or increase it for crawling all links. max_requests_per_crawl=10, # Headless mode, set to False to see the browser in action. headless=False, # Browser types supported by Playwright. browser_type='chromium', # Fingerprint generator to be used. By default no fingerprint generation is done. fingerprint_generator=fingerprint_generator, ) # Define the default request handler, which will be called for every request. @crawler.router.default_handler async def request_handler(context: PlaywrightCrawlingContext) -> None: context.log.info(f'Processing {context.request.url} ...') # Find a link to the next page and enqueue it if it exists. await context.enqueue_links(selector='.morelink') # Run the crawler with the initial list of URLs. await crawler.run(['https://news.ycombinator.com/']) if __name__ == '__main__': asyncio.run(main()) ``` In this example, each Playwright instance starts with a unique, realistic fingerprint. From the website’s perspective, the crawler behaves like a real browser session, reducing the chance of detection or blocking. For more details and examples, see the [Avoid getting blocked](https://www.crawlee.dev/python/docs/guides/avoid-blocking) guide and the [Playwright crawler with fingerprint generator](https://www.crawlee.dev/python/docs/examples/playwright-crawler-with-fingeprint-generator) documentation page. ## Open telemetry[​](#open-telemetry "Direct link to Open telemetry") Running crawlers in production means you often want more than just logs - you need visibility into what the crawler is doing, how it's performing, and where bottlenecks occur. Crawlee v1 adds basic [OpenTelemetry](https://www.opentelemetry.io/) instrumentation via [`CrawlerInstrumentor`](https://www.crawlee.dev/python/api/class/CrawlerInstrumentor), giving you a standardized way to collect traces and metrics from your crawlers. With [OpenTelemetry](https://www.opentelemetry.io/) enabled, Crawlee automatically records information such as: * Requests and responses (including timings, retries, and errors). * Resource usage events (memory, concurrency, system snapshots). * Lifecycle events from crawlers, routers, and handlers. These signals can be exported to any OpenTelemetry-compatible backend (e.g. [Jaeger](https://www.jaegertracing.io/), [Prometheus](https://www.prometheus.io/), or [Grafana](https://www.grafana.com/)), where you can monitor real-time dashboards or analyze traces to understand crawler performance. Here's a minimal example: ``` import asyncio from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter from opentelemetry.sdk.resources import Resource from opentelemetry.sdk.trace import TracerProvider from opentelemetry.sdk.trace.export import SimpleSpanProcessor from opentelemetry.trace import set_tracer_provider from crawlee.crawlers import BasicCrawlingContext, ParselCrawler, ParselCrawlingContext from crawlee.otel import CrawlerInstrumentor from crawlee.storages import Dataset, KeyValueStore, RequestQueue def instrument_crawler() -> None: resource = Resource.create( { 'service.name': 'ExampleCrawler', 'service.version': '1.0.0', 'environment': 'development', } ) # Set up the OpenTelemetry tracer provider and exporter provider = TracerProvider(resource=resource) otlp_exporter = OTLPSpanExporter(endpoint='localhost:4317', insecure=True) provider.add_span_processor(SimpleSpanProcessor(otlp_exporter)) set_tracer_provider(provider) # Instrument the crawler with OpenTelemetry CrawlerInstrumentor( instrument_classes=[RequestQueue, KeyValueStore, Dataset] ).instrument() async def main() -> None: instrument_crawler() crawler = ParselCrawler(max_requests_per_crawl=100) kvs = await KeyValueStore.open() @crawler.pre_navigation_hook async def pre_nav_hook(_: BasicCrawlingContext) -> None: # Simulate some pre-navigation processing await asyncio.sleep(0.01) @crawler.router.default_handler async def handler(context: ParselCrawlingContext) -> None: await context.push_data({'url': context.request.url}) await kvs.set_value(key='url', value=context.request.url) await context.enqueue_links() await crawler.run(['https://crawlee.dev/']) if __name__ == '__main__': asyncio.run(main()) ``` Once configured, your traces and metrics can be exported using standard OpenTelemetry exporters (e.g. OTLP, console, or custom backends). This makes it much easier to integrate Crawlee into existing monitoring pipelines. For more details on available options and examples of exporting traces, see the [Trace and monitor crawlers](https://www.crawlee.dev/python/docs/guides/trace-and-monitor-crawlers) guide. ## A message from the Crawlee team[​](#a-message-from-the-crawlee-team "Direct link to A message from the Crawlee team") Last but not least, we want to thank our open-source community members who tried Crawlee for Python in its beta version and helped us improve it for the scraping and automation community. We would appreciate it if you could check out the latest version and [give us a star on GitHub](https://www.github.com/apify/crawlee-python/) if you like the new features. If you have any questions or feedback, please open a [GitHub discussion](https://www.github.com/apify/crawlee-python/discussions) or [join our Discord community](https://www.apify.com/discord/) to get support or talk to fellow Crawlee users. If you encounter any bugs or have an idea for a new feature, please open a [GitHub issue](https://www.github.com/apify/crawlee-python/issues). --- # How to build a price tracker with Crawlee and Apify April 8, 2025 · 11 min read [![Percival Villalva](https://avatars.githubusercontent.com/u/70678259?v=4)](https://github.com/PerVillalva) [Percival Villalva](https://github.com/PerVillalva) Community Member of Crawlee Build a price tracker with Crawlee for Python to scrape product details, export data in multiple formats, and send email alerts for price drops, then deploy and schedule it as an Apify Actor. ![Crawlee for Python Price Tracker](/assets/images/crawlee-python-price-tracker-8ffc0121eee82024852513938dd525ab.webp) In this tutorial, we’ll build a price tracker using Crawlee for Python and Apify. By the end, you’ll have an Apify Actor that scrapes product details from a webpage, exports the data in various formats (CSV, Excel, JSON, and more), and sends an email alert when the product’s price falls below your specified threshold. ## 1. Project Setup[​](#1-project-setup "Direct link to 1. Project Setup") Our first step is to install the [Apify CLI](https://docs.apify.com/cli/docs). You can do this using either Homebrew or NPM with the following commands: s ### Homebrew[​](#homebrew "Direct link to Homebrew") ``` brew install apify-cli ``` ### Via NPM[​](#via-npm "Direct link to Via NPM") ``` npm -g install apify-cli ``` Next, let’s run the following commands to use one of Apify’s pre-built templates. This will streamline the setup process and get us coding right away: ``` apify create price-tracking-actor ``` A dropdown list will appear. To follow along with this tutorial, select **`Python`** and **`Crawlee + BeautifulSoup`** `template`. Once the template is installed, navigate to the newly created folder and open it in your preferred IDE. ![actor-templates](/assets/images/actor-templates-88fa253dabe612261cb2fe95430c4c04.webp) Navigate to **`src/main.py`** in your project, and you’ll find that a significant amount of boilerplate code has already been generated for you. If you’re new to Apify or Crawlee, don’t worry, it’s not as complex as it might seem. This pre-written code is designed to save you time and jumpstart your development process. ![crawlee-bs4-template](/assets/images/crawlee-bs4-template-528a9eee4ab1c859feb2ed42e3328045.webp) In fact, this template comes with fully functional code that scrapes the Apify homepage. To test it out, simply run the command **`apify run`**. Within a few seconds, you’ll see the **`storage/datasets`** directory populate with the scraped data in JSON format. ![json-data](/assets/images/json-data-9ec19a8958775e66dcd094d0d46faa90.webp) ## 2. Customizing the template[​](#2-customizing-the-template "Direct link to 2. Customizing the template") Now that our project is set up, let’s customize the template to scrape our target website: [Raspberry Pi 5 (8GB RAM) on Central Computer](https://www.centralcomputer.com/raspberry-pi-5-8gb-ram-board.html). First, on the `src/main.py` file, go to the `crawler.run(start_urls)` and replace it with the URL for the target website, as shown below: ``` await crawler.run(['https://www.centralcomputer.com/raspberry-pi-5-8gb-ram-board.html']) ``` Normally, you could let users specify a URL through the Actor input, and the Actor would prioritize it. However, since we’re scraping a specific page, we’ll just use the hardcoded URL for simplicity. Keep in mind that dynamic input is still an option if you want to make the Actor more flexible later. ### Extracting the Product’s Name and Price[​](#extracting-the-products-name-and-price "Direct link to Extracting the Product’s Name and Price") Finally, let’s modify our template to extract key elements from the page, such as the product name and price. Starting with the **product name**, inspect the [target page](https://www.centralcomputer.com/raspberry-pi-5-8gb-ram-board.html) using DevTools to find suitable selectors for targeting the element. ![product-name](/assets/images/product-name-dbaba09d2d06b4b8a6b9a340698739af.webp) Next, create a `product_name_element` variable to hold the element selected with the CSS selectors found on the page and update the `data` dictionary with the element’s text contents. Also, remove the line of code that previously made the Actor crawl the Apify website, as we now want it to scrape only a single page. Your `request_handler` function should look similar to the example below: ``` @crawler.router.default_handler async def request_handler(context: BeautifulSoupCrawlingContext) -> None: url = context.request.url Actor.log.info(f'Scraping {url}...') # Select the product name and price elements. product_name_element = context.soup.find('div', class_='productname') # Extract the desired data. data = { 'url': context.request.url, 'product_name': product_name_element.text.strip() if product_name_element else None, } # Store the extracted data to the default dataset. await context.push_data(data) # Enqueue additional links found on the current page. # await context.enqueue_links() -> REMOVE THIS LINE ``` It’s a good practice to test our code after every significant change to ensure it works as expected. Run `apify run` again, but this time, add the `–-purge` flag to prevent the newly scraped data from mixing with previous runs: ``` apify run --purge ``` Navigate to `storage/datasets`, and you should find a file with the scraped content: ``` { "url": "https://www.centralcomputer.com/raspberry-pi-5-8gb-ram-board.html", "product_name": "Raspberry Pi 5 8GB RAM Board"} } ``` Now that you’ve got the hang of it, let’s do the same thing for the price: `79.99`. ![product\_price.png](/assets/images/product-price-fa3ab906b4a95258251defe78c19b6d3.webp) In the code below, you’ll notice a slight difference: instead of extracting the element’s text content, we’re retrieving the value of its `data-price-amount` attribute. This approach avoids capturing the dollar sign `($)` that would otherwise come with the text. If you prefer working with text content instead, that’s perfectly fine, you can simply use `.replace('$', '')` to remove the dollar sign. Also, keep in mind that the extracted price will be a `string` by default. To perform numerical comparisons, we need to convert it to a `float`. This conversion will allow us to accurately compare the price values later on. Here’s how the updated code looks so far: ``` # main.py # ...previous code @crawler.router.default_handler async def request_handler(context: BeautifulSoupCrawlingContext) -> None: url = context.request.url Actor.log.info(f'Scraping {url}...') # Select the product name and price elements. product_name_element = context.soup.find('div', class_='productname') product_price_element = context.soup.find('span', id='product-price-395001') # Extract the desired data. data = { 'url': context.request.url, 'product_name': product_name_element.text.strip() if product_name_element else None, 'price': float(product_price_element['data-price-amount']) if product_price_element else None, } # Store the extracted data to the default dataset. await context.push_data(data) ``` Again, try running it with `apify run --purge` and check if you get a similar output as the example below: ``` { "url": "https://www.centralcomputer.com/raspberry-pi-5-8gb-ram-board.html", "product_name": "Raspberry Pi 5 8GB RAM Board", "price": 79.99 } ``` That’s it for the extraction part! Below is the complete code we’ve written so far. > 💡 **TIP:** If you’d like to get some more practice, try scraping additional elements such as the **`model`**, **`Item #`**, or **`stock availability (In stock)`**. ``` # main.py from apify import Actor from crawler.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext async def main() -> None: # Enter the context of the Actor. async with Actor: # Create a crawler. crawler = BeautifulSoupCrawler( # Limit the crawl to max requests. Remove or increase it for crawling all links. max_requests_per_crawl=50, ) # Define a request handler, which will be called for every request. @crawler.router.default_handler async def request_handler(context: BeautifulSoupCrawlingContext) -> None: url = context.request.url Actor.log.info(f'Scraping {url}...') # Select the product name and price elements. product_name_element = context.soup.find('div', class_='productname') product_price_element = context.soup.find('span', id='product-price-395001') # Extract the desired data. data = { 'url': context.request.url, 'product_name': product_name_element.text.strip() if product_name_element else None, 'price': float(product_price_element['data-price-amount']) if product_price_element else None, } # Store the extracted data to the default dataset. await context.push_data(data) # Run the crawler with the starting requests. await crawler.run(['https://www.centralcomputer.com/raspberry-pi-5-8gb-ram-board.html']) ``` ## 3. Sending an Email Alert[​](#3-sending-an-email-alert "Direct link to 3. Sending an Email Alert") From this point forward, you’ll need an **Apify account**. You can create one for free [here](https://console.apify.com/sign-up). We need an Apify account because we’ll be making an API call to a pre-existing Actor from the **Apify Store,** the “Send Email Actor”, to handle notifications. Apify’s email system takes care of sending alerts, so we don’t have to worry about handling **2FA** in our automation. ``` # main.py # ...previous code # Define a price threshold price_threshold = 80 # Call the "Send Email" Actor when the price goes below the threshold if data['price'] < price_threshold: actor_run = await Actor.start( actor_id="apify/send-mail", run_input={ "to": "your_email@email.com", "subject": "Python Price Alert", "text": f"The price of '{data['product_name']}' has dropped below ${price_threshold} and is now ${data['price']}.\n\nCheck it out here: {data['url']}", }, ) Actor.log.info(f"Email sent with run ID: {actor_run.id}") ``` In the code above, we’re using the **Apify Python SDK**, which is already included in our project, to call the “Send Email” Actor with the required input. To make this API call work, you’ll need to log in to your Apify account from the terminal using your **`APIFY_API_TOKEN`**. To get your **`APIFY_API_TOKEN`**, sign up for an Apify account, then navigate to **Settings → API & Integrations**, and copy your **Personal API token**. ![apify-api-token](/assets/images/apify-api-token-eb76078df32c242a7f064ab71e63c7fa.webp) Next, enter the following command in the terminal inside your **Price Tracking Project**: ``` apify login ``` Select `Enter API Token Manually` , paste the token you copied from your account and hit enter. ![apify-login](data:image/webp;base64,UklGRkAZAABXRUJQVlA4WAoAAAAoAAAAVAIASgAASUNDUAwCAAAAAAIMYXBwbAQAAABtbnRyUkdCIFhZWiAH6QADAAIACwAxADlhY3NwQVBQTAAAAABBUFBMAAAAAAAAAAAAAAAAAAAAAAAA9tYAAQAAAADTLWFwcGyr6ljS/rLBRgz/7+fcGoQHAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAApkZXNjAAAA/AAAADJjcHJ0AAABMAAAAFB3dHB0AAABgAAAABRyWFlaAAABlAAAABRnWFlaAAABqAAAABRiWFlaAAABvAAAABRyVFJDAAAB0AAAABBjaGFkAAAB4AAAACxiVFJDAAAB0AAAABBnVFJDAAAB0AAAABBtbHVjAAAAAAAAAAEAAAAMZW5VUwAAABYAAAAcAFMAYwBlAHAAdAByAGUAIABGADIANwAAbWx1YwAAAAAAAAABAAAADGVuVVMAAAA0AAAAHABDAG8AcAB5AHIAaQBnAGgAdAAgAEEAcABwAGwAZQAgAEkAbgBjAC4ALAAgADIAMAAyADVYWVogAAAAAAAA9tYAAQAAAADTLVhZWiAAAAAAAABkpAAAMzsAAAFeWFlaIAAAAAAAAGu+AAC9KQAAD61YWVogAAAAAAAAJnMAAA+cAADCInBhcmEAAAAAAAAAAAAB9gRzZjMyAAAAAAABC7cAAAWW///zVwAABykAAP3X///7t////aYAAAPaAADA9lZQOCDOFgAA8FUAnQEqVQJLAD6RRJ1JpaQjISeXGmiwEglpbt1fNlO7Abb3nj/RLvEW8d/u3lDHkD+4dov9n/Jfzr8Mnrv9d/cv1W8i/m/755kfyT7K/m/7V5g/6D+2eIPwZ/qfUI/Mv5r/uP7j4/OwKz7/PegF7H/XP+L/i/EQ/mP8R6iflX9f/5vuAfyv+n/7b0f/yfgU/c/9L+ynwC/zr+9/9f/Gf374UP6L/1f6PzifUX/t/1PwFf0L+6dZ70pyqgJUhT7s3WLIkzwKFQUEu9wU5d8R7mDpQKrOp6uFKbtUTokSi1tPdJtfpfFJm9mbaVPGtH3OMK00mFAREnPGrnguZFj5XvIbOVpq0uftFkdl6GRv4sGz4aFPWIVlUkU5shSS0A9CcFtft1NiIOZ72Y5CDp7gF/o8mh7mbuezzbFe0kE891QHY1SUTtl9I7+fO4GKkZUWTHFvD5xpuBn/IdHa4ra7WMHS+E5ugSJJsTnlS5tVt4MqfcvKnTse9ZxeD9Xux9V2UfAyBv7uCPG1LyUpfGX7iyS0EGx80uNSG85Drosncn+kEi9202Vaz+T2Lt7dFrhlQI0vIRmtC7/puzf0NbHLJAKVYycKCI0BrLA5JgRwxfE85T5C1Jift669oSqYTIsumRcPfdbS8CNN2Ozdc5iOfyFgqHEEXSbiC3rCAxrGvFCQrzcQsWoqlJfwHlCrzG5G7l9ZBrSuGHXucGk9OcR1jTuHuHoxSaoORtXrCifa9EDvuHc26Po8CmPdiK/w+RzMJyC7tUnsinlfnkyJf3VTGzSomjTkQKM8tqfFZUsGjMhZWGgUVEUOZGGRvmSzFMnPsn9afIrMIZghOIxfC91R1+UiQmhUsylsoG3y3fbfyZCBCT6DRfxHR+DtwvYv61Yii9JB5L4Oi+KI0kfB5LEgmJ1/118URo7wAP6hJnGQRk6IbxLrsBKgulT/zYsDrQVVhQhHF8QnxsnF7U1RqCoNq212Yr4cPTfodyrd+t/0wUeMaMYlnVBRPjAf8USni9m/pAtoJgdpZUvZ/qf6uAl8FDRP+nsFgfHnbMtuvefdUWd7es6dQlATQslvm7TzWfMYv0Zjfo/R80KymPC1cyaPJn+JaZpVl27te0d0sVgc4dFzFDZtK6LD9uox3tFaW9G38lUk8gQuWQurw07EBFUSA5z9o1UUwuA96pmNVwmEKpwlVXR3wny+2wIP86FbM60oNTtQqUnHMahh5xDBFG7G2NV6uj11SH3ORVXZZIXbRmj93/qQEuH9mc+Zxx5J0sJ39O30UUIbU27y8lou603Eufp/2jmyGtpuSrCdWLiyMqd/sW7AE64gtTZZJOEIBsK/lLHGFKi3De7kBr4RKOT7Iw8I0mFiSEo1FtTo7qO10sWdFlyvz71zsxWv1Ew6Ai66Pn4qdal+9MjrOVpTJYd0jhzjNne37AuWyZw8i3Kvrntnh3+9EMtRNehQJFX2xktkCWaHUn4CavLNLQf/pmz+z3NCNrHsNYR2xTmH7jnYtHtQ5DwVuOkrHB2oxVCS0iDF2rv2bH9uu26B1taeL0kHrPrKj7zAF20GP4c0Y60m+BORNa4DfnTDX9KLnO6Bf/iGTuzvSQBwQUH8IPKs5NHLacRadoiCnixz+Y8z420Dtk7RL18bCX3k+TYlhFyT7IjutlG/GU/+4hc0wzGmIaAD48CtVkMzARSDbinZpO6W8I3qXHA/pFPkz7tOU3kV8Q2+YmyB2PLXGJyr3kMRyTQn/JybjoCjN+r5JaXUd8c/PYGqS71yW2jMIy67eOpoEBVkLIcE2HM3L2yjf5Opg8WZaOm6jStDWxq3xmpfz2JRFMaH//NJxMKvDwCd1zEHyLpwKc5b5vllKicngkd84XLuTv1xY81myd+K4L4WJb9+qafVDvRk6IJQ5awrYij2BHxgE7z+nog891uAHCsC/BjJcEdI6lvX6mZpfwts0XXkvCpuQbMlsgsl9RUxuLj5mqKPnRNFBZBvW0bA/SsxfeS7PisWxnL8C5Qz8xF7W480VKze3ssZazEitbMBK40rHDUOEvf22MJ2duiBuc8oj7HsXG/fSKJuQaIcHwrR52UTTK738MnwPVZk8uv0Mrk3Ak7b7eLaYtn6yGtPbUtCXTRaEgSHn0ZWIKvsT/XM2lRNiVwvVtwf1yHVXXD0Qnb9k5Loft+VOh75T0MNEZ6VJMogx7Fvyb3oSl++9tPb0Rt4eJAfyJQY564ia82b2TNqgbvLatCg5zQMIteY9PwGTdJeslv9KfwysCLilrhKRf7JTUI3veGA9dh5ewC47dgHDxuaEYN7RSelbMC6xlG5Ssfk22bCTKkIzy8gjevancjx1Whr3CsXvPnUAqYdPbwvcQNdHvcM2dfLa+EeeoUqSXJytwuQgWLbcBA2Ocs5rUeEXo/5PQ10BTO2jDyfjlDed32453mvmgsXk+MdO+X9xmw97qSZevbwXHvU6/Ai4bQIketeHZmeuMf5xX205tGlKn/dKqEVMhV72oH1back5H0yCdMKGr1bTm2pisB5IXE6q+e/7DsvjSy23Jw5M+vIyE+Enur7t8edFHKVwER69/k5jYoPoqUBYnjUlkJPQcPxaHAbEdD1uCmt4eiPZKPSDc0H3bTf8NHG4/gDNQrAv+j4j6PKeOfhMMqJC1rOvy+eIPWtYo1J5dy52AsZ88kYVopIiN0IG30vKqAzPA1LYnvYTlwpldKi5KimxzUOG/4QIsg5eOf6QnHzfRfV661w5nOUvYIIVMnZRRgJHJqp10n8D1gWrXqbB58OYzlq5xMdmKwTPnpexWV2BUcpjsy1hi5jaYQSNGhrAsJ5SrV4JAoMTDOOMdEMdbkAKgQrAVgGWTsMCQvOOoyW7+5G4hHami6MLy67O8XMclSThJYOxvpVwbfn0rtkX7V2/7q8O10QBVn95v63jLuYmPrc9mjKBPvnxr5iP4zwS51/wxP8vfNqV1EqRgrlyPTlMDexuNFPb2uHgZwaE9EuttvMlLeJY0+0uBwW7+ZKwPYR3fBMhO2AYy7wCwuKON8CnVm/U6x4bUXlrp4RvVvRGS7V6HKsyMHRkXfwJ3mWS/+9qC1zrZVQ/P2wY3tU4kv94Js+U+7G44sNZq4c+/FZRGVKFM0v6gbc7ED9YjLxRz+2VPomnXmj1GlqSP56p8ewJL3E1VtNAvXBLDu4xOsc3dmuyMT0rxTteJLSm+/aYCOspVfCKifH6otiCLvmqezLYReNWGDJIBic+R3TXsetyYaVNtuLhQkB2WmOYINKAzyI2dfGTIkX9G54AYZkWvp36QKT1Vum9YWLdPE+0MApdUI2PjBUg98PGvjjmGgAGZvaJgR4yl35TozURJ4Yi8RK5B4Sf8HE+4iCANM3PIyNYRKgzdsKRs4QFzjTRatl0vmDUAz1hyJKYQT3WsJn20sRWG4uRJeg9wwSjRgF3LJOfTrJOolE29d4d2xadRCQIUiAdtVl86GdpZ7E70dhn+fJf2rhDAmykSq/6vA2U1PfUdOodL3Nz+H3IKSyzfTitquqQoRraMHFc/zeoeiebXKM724ZrNfsF2UN5YbLMcF+7xCVQdGwZTp975IMYRUDbJ87w2gj7sz3R0oa+g3wklRwDnnAQflPZ41N2BaNQDLX7icsf+mk2+IURt7ruZpU/JcJNNpRe5w/uOXB/hvjMD5FCHZjOQMi7v3gdRuBumPMbCCkuiWRwQjhVaeqahxJJ1L0oU9QBVsAljVrdEE2Eto4iEgBHDiw6vq/lScOjyKtK0jreuLKC1e2YryS3wrF9JgXmynz637sCqPZw678txHqc+qYq6KNSPE/lAgjKyX2p9HH+J4Ph9yUtB6X2qxo1B7XI4dqvtW6rYhhRZ3pQhQVjPc8Wlltci2gS8/8pm4caUPzVxgHW/Ou0F9i5rKFyXPyifg+n+QBPt47ImhTp/hQ//gaWzRPUVqOOrdpIJ62ZVsb4L7iYnQkxorX3nt/EvMrqinThMh9AF1I7fuuMVOldVUW6Y8aOCJfVoueqheobeXelACyo+FhM6133AC3Kqqy3INnf8YeVYvgHzNFWK+pZNa0DCgQvi0anFKXmHo6SiJ67zhYftKNXnoa10FWfb25ULTToTWDFa310/Z4mi6M361ecQY74UTUzrgC63RK1MarSQgFWWEH/ei3zjGSMk328RO8HM6hTmT7tBjKbPx5iO3JjE3fcp5bK3Fi/DOEvl3xPjBrlkMT6a2elS2EDU44HHZIHalf92dptsb2sE+8PNYkxRhL6ozKYIGKaLWLXDiKf4/5cJ/MrVBgsDILet4tCkMVbHYMJ1Lc4tHxVvSGCotcDrUA8wU0fDnaA1tDsFDkhTg4oZVpFsANg1Pi7TvykgHiYPhgeB4qLJlq1A0uCOGGh1++MleCbEyYRqxpmFeToUsoTtJtPVZ2C2BF8VHzy63l28Y0BvMy58tyK0ARb9Z0lCEm6zbB5IBMGWNANsq5zs5GJ0IfUBpOv3333VGBbwnaN6QJSSG7KF4O62g9ZJNV9IrPWpISWdqaYmh48nzdTg3btjImE6H0Cd8ZkJ5PkS/CO2gMjNaVKVPf3eECbv/S8wIeHY0uUe65wDhXPU9YyE+4bOVTWUZ8dmA+OdCjFVR5PDI/fS7hO+kxDnfeu7kXin6LHNL6KSRN+1/ZdCs+Mj4KcLsKTzNH0FFPEKct6vfEOld+JwQs0nn9BsmteWT3KdjNaEH8U5MogC3Qlertg7Go5q/WThmh+dI7Mo2TfHmH1+oto19Zb4hq4pWYC55RrK3NXXyo1T8bEMe7gzdsyfWfrVpFTZkx00ZjSP1tNCvSD+ThLZViGgWTCiYjZRWecjyX9vqy81CBb4A7Kdj5hiqawwvaMOXTG3bC78kzQinHmY5B/IJm9SHi970zDhIDNAus4gL6B0xZK7mADZ+Y4u3bhWJTME1LMzCEo2gsGDnsRRDIcsxcdIoYTa9HF4Kj4IMCG+DSnV91esm6SZXq+5g1fOCJoEM/PQSmmG4x0fi7cbVjBsyjdsaPB8JmTCUwO5uAV2S0d2O5Ldr7cEWTkUccDnHn5WS+84DnzYt3tDSEUrR8zp/EbVBhaBQdJQELQKpDQy+N/VGRecnUFmn/uMiZCmCN57wsruFu8Yl5w5nKFnuB/nWy9A9CA7+TJ4xBlk+PmUa9ZgIrdWS8vydx6h7EgphSn0GjJK6cwzlMqHuJwdaHu19MZCbmYKTPLaTmam2RH9rIZKusJYS71t21SA9exPRMMuKP2fbPkgFBLC34TZAHMuZrr4JkK3U2KFl+PmC3ezP8zc0G3goMjiPkfm7Qx/xTXN4p5QP3NtL6M1/jR3OCGiaer3j/55oO96iQnxb0dxpQ+OEvayj68aFZ4mx9+66R01H72K61mnASlN4iMbzib6GgPSVKswmBh3soc+K1yJN92pMxMlgEyx1Xly9maAAA75CofFZ7yvpCCJIA4xCXTz7rAGwoj7guravfYIyzT1sVkP40MHhBbb8odiLoyask5JtjuJqMw7Mx6/g/XbMuSYoyvLM/qvJrlEEsOC61Jk2eD9r2GqOb0yuzaKspTEOkoFRQxQ1TzyXEMDplJrAzXBg5Is9+a4gQGuKlPDU7jFMWqyZ79S3fWUVAybicET976oo4lPj4NoQS95Wzbb4BCYJsEaFiqbe35eN6NPnaVpwoLW0fm9yU939V+dBfGhmn/dxtWzFU/fT4+lsiseZU/yOHRf2Cv/AcUL9QGwj8t0c8CWzDsML6NQVF6l9twfVVXKEhsmHTh/3H2MiGlzh1csO+6Ia7/evG0ELKt4I1e0rSjsQ9poD4P4BfGATdPhZ6e4csAWXdguQ9hYEdIQUYiOoPwfnS+LrISerYvZ/mErEh4ziy5eUSjoLqfWtTb/gTT3DDZx+gBWraTQhlPwNg1B/BZnb8DnSURPa8cRnP3Nho1wAJaQDTEYycHankCjedseLuPYBx3awwEGbxCYbmdjbtvlWoxnX4hEQ2TybALWTSmB00MiXoD4rO0Gc/rGL5UT1DEIqos+BAQ1KzmwlVNHEHvpd6CywloDi7iaP8+zKHa6AjwnpzG3eJy8FIQDxMHwwPAnchHuzDTMMlUi8N41AFivjIlftGSysuRBEjNriqXC+RBEontuOwoOAqukoPOmjAuORXeLxyD5wxEgPV9mQZNoLZaasck712yVroju5Hw+fiZF0n4moXZXx1urg3fa61Wk1kcMqfmBNYkEj4Ff+xDLNDc1bSkHqXz0OEhPj+m6b1S+4MLO3q1irclOZW3j+UlfW3YyIahH0brrM5wK6o5nlV5R6EsAp+yc4mogwnWixXtH3n0yTkNGdeEX/KPFoyH5itgJ/Yf+vI9FAuAdCFFjpuo9FPND9AlafBM8CEZ5YXsjneOTvLr5UarLsKNmoNes4Lw1blTlrd5WB2/I4XXnlLqnO2CyqwdOed9Hh2fg6kRylXIM0jqIHLbxnG2zjNsfsERmk1BJRRkXKN1thKBvNDJbMzYXsD5CnlF9CFpkYWQF2P40d1kpewONrNBfq34IQGqp5uAFBuHZxLKRinb7NkcqqFTtJIPuBCspnpydgcxqgJx04v9Ga/JdROq426VrhOlo09arduVv5XbGUIl+8zmxU5k6l9xgA0WegarqfF+gAAanpyiX/qTWqZXBZdnxLmkKLRJNnyGY/Sor9whIPC4fhy2pQiSkoScVwfIaAJcMj9LBS181UXnj8uPbud/TodIfr4/pBsiuSPsYRbiDeHM+RnNrR9N7NbBMCjnqPlzQ7UHAWsQ5CMOTlU8hpVb5DDZlfEauXJ4nMlBUWi4Rg0Rrk4ILtY+S6nRdCdBdI6YKRvgx4uZW0SgPJrZZ0d0BCT1FfFro0Qus8IKqqlXB+SXpOZITNwpVcLs+VSpHVtsQ19O6CQk5ynwPJZATjgLcseIYRGpsP6sZqm9a/DmyC7OAWfgD18GsKNTEzu0UeEtP2ZDKjAlzaaupA03RJ1N0kM+YWU3jQSaPL6hA3i8prohWJrPD02xNbuqqYBe8eWgwjOqJRUDwGfbvfO1IeSk4rKET1IaHxhR3HWgmsX3+HytP2tCR5EgCPhcykbtP8q7SAPSUelIq/PJJlEQqPqy7f411W4/fqQm+QVPkOkLtgRSUbD0S9QABNX2l3hSUYQsaIx1vKcKMPkIyyNTwnCQTvrmWk6y9i81os2Y7eXXsWESJ+RNtLBFdYV+v5G1vWwmwc5ZKTqlT2KOVE3HO/jJ+uKQLU2m/K1xh/7i0BMGltL/HdyHyj2ZiJuB5Cvx6dsPYSiCEqH55gAAABGYZs0snuxZgR2bOTS7zKzVdB1S5ZIhWfYKTBf/zIGhEYWsx7usIvfGk1sduusG66oapz4yAomOhlCpfBxwYCR40VxD7HTRM7Pf/pUOpCAZAhXyik0+O3f8UwP6jFavd0vFfLSk38vKR/4Rden15AAYWMFDbRdP92kLOYDCqk/ZkUT7+vZvp/rntH1zzJKKLc8ts+vdou/wAoxSheZbcHxu8disMonsc9Ezz83Q4YJYKIF/J4nKhWR07WklLgz2unYfGISpakg+XVg+VF0IPnc6ZssIPQ5Tatlr8Ixxk/wDovCXP0/7awjh52GmswYk6rKFKz/spVSMPzdv2EfMK8e81Q/8okO+uKon21uawhvCjKcmNFIAWpELoX6iwrq2wBK1gi4e5OdI4TaYPt7NmnW7OLFXrdAY2jMwxsREkcVoold+VxiQOcRxjj/bWbCV4K/QpEgaYNzkJ6BDiw9Cj/YWCaNF/FCmaQgAAAAAAAARVhJRjgAAABNTQAqAAAACAABh2kABAAAAAEAAAAaAAAAAAACoAIABAAAAAEAAAJVoAMABAAAAAEAAABLAAAAAA==) You’ll see a confirmation that you’re now logged into your Apify account. When you run the code, the API token will be automatically inferred from your account, allowing you to use the **Send Email Actor**. If you encountered any issues, double-check that your code matches the one below: ``` from apify import Actor from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext async def main() -> None: # Enter the context of the Actor. async with Actor: # Create a crawler. crawler = BeautifulSoupCrawler( # Limit the crawl to max requests. Remove or increase it for crawling all links. max_requests_per_crawl=50, ) # Define a request handler, which will be called for every request. @crawler.router.default_handler async def request_handler(context: BeautifulSoupCrawlingContext) -> None: url = context.request.url Actor.log.info(f'Scraping {url}...') # Select the product name and price elements. product_name_element = context.soup.find('div', class_='productname') product_price_element = context.soup.find('span', id='product-price-395001') # Extract the desired data. data = { 'url': context.request.url, 'product_name': product_name_element.text.strip() if product_name_element else None, 'price': float(product_price_element['data-price-amount']) if product_price_element else None, } price_threshold = 80 if data['price'] < price_threshold: actor_run = await Actor.start( actor_id="apify/send-mail", run_input={ "to": "your_email@gmail.com", "subject": "Python Price Alert", "text": f"The price of '{data['product_name']}' has dropped below ${price_threshold} and is now ${data['price']}.\n\nCheck it out here: {data['url']}", }, ) Actor.log.info(f"Email sent with run ID: {actor_run.id}") # Store the extracted data to the default dataset. await context.push_data(data) # Run the crawler with the starting requests. await crawler.run(['https://www.centralcomputer.com/raspberry-pi-5-8gb-ram-board.html']) ``` > 🔖 Replace the placeholder email address with your actual email, the one where you want to receive notifications. Make sure it matches the email you used to register your **Apify account**. Then, run the code using: ``` apify run --purge ``` If everything works correctly, you should receive an email like the one below in your inbox. ![price-alert](/assets/images/price-alet-530cccd85b681fd98e32a81e4f52e488.webp) ## 4. Deploying your Actor[​](#4-deploying-your-actor "Direct link to 4. Deploying your Actor") It’s time to deploy your Actor to the cloud, allowing it to take full advantage of the Apify Platform’s features. Fortunately, this process is incredibly simple. Since you’re already logged into your account, just run the following command: ``` apify push ``` In just a few seconds, you’ll find your newly created Actor in your Apify account by navigating to **Actors → Development → Price Tracking Actor**. ![price-tracking-actor](/assets/images/price-tracking-actor-c91e4f5243ea20363d2621424d89985f.webp) Note that the **Start URLs** input has been reset to **apify.com**, so be sure to replace it with our target website: Once updated, click the green ***Save & Start*** button at the bottom of the page to run your Actor. After the run completes, you’ll see a **preview of the results** in the ***Output*** tab. You can also export your data in multiple formats from the ***Storage*** tab. ![actor-run](/assets/images/actor-run-faa6f7deb56846b88c7d446e9eb05e1d.webp) **Export dataset:** ![actor-export-dataset](/assets/images/export-dataset-9d56cd86006ff21fbbd695a72cd5529c.webp) ## 5. Schedule your runs[​](#5-schedule-your-runs "Direct link to 5. Schedule your runs") Now, a **price monitoring script** wouldn’t be very effective unless it ran on a schedule, automatically checking the product’s price and notifying us when it drops below the threshold. Since our Actor is already deployed on **Apify**, scheduling it to run, say, every hour, is incredibly simple. On your Actor page, click the three dots in the top-right corner of the screen and select **“Schedule Actor.”** ![schedule-run](/assets/images/schedule-run-3c2c1975cb23d5f4bdbe8116172a2a47.webp) Next, choose how often you want your Actor to run, and that’s it! Your script will now run in the cloud, continuously monitoring the product’s price and sending you an email notification whenever it goes on sale. ![actor-schedule](/assets/images/actor-schedule-2fe3df75d91fa3270776f814ed6888dc.webp) ## That’s a wrap\![​](#thats-a-wrap "Direct link to That’s a wrap!") Congratulations on completing this tutorial! I hope you enjoyed getting your feet wet with Crawlee and feel confident enough to tweak the code to build your own price tracker. We’ve only scratched the surface of what Apify and Crawlee can do. As a next step, join our [Discord community](https://discord.com/invite/jyEM2PRvMU) to connect with other web scraping developers and stay up to date with the latest news about Crawlee and Apify! --- # Reverse engineering GraphQL persistedQuery extension November 15, 2024 · 5 min read [![Saurav Jain](https://avatars.githubusercontent.com/u/53312820?v=4)](https://github.com/souravjain540) [Saurav Jain](https://github.com/souravjain540) Developer Community Manager [![Matěj Volf](https://avatars.githubusercontent.com/u/31281386?v=4)](https://github.com/mvolfik) [Matěj Volf](https://github.com/mvolfik) Web Automation Engineer GraphQL is a query language for getting deeply nested structured data from a website's backend, similar to MongoDB queries. The request is usually a POST to some general `/graphql` endpoint with a body like this: ![GraphQL Query](/assets/images/graphql-a3962ed441b2a078e43c8158ad64336a.webp) When scraping data from websites using GraphQL, it’s common to inspect the network requests in developer tools to find the exact queries being used. However, on some websites, you might notice that the GraphQL query itself isn’t visible in the request. Instead, you only see a cryptic hash value. This can be confusing and makes it harder to understand how data is being requested from the server. This is because some websites use a feature called ["persisted queries.](https://www.apollographql.com/docs/apollo-server/performance/apq/) It's a performance optimization that reduces the amount of data sent with each request by replacing the full query text with a precomputed hash. While this improves website speed and efficiency, it introduces challenges for scraping because the query text isn’t readily available. ![Persisted Query Reverse Engineering](/assets/images/graphql-persisted-query-6e36e61d76503e617fe4e7651bdf53a3.webp) TLDR: the client computes the sha256 hash of the `query` text and only sends that hash. In addition, you can possibly fit all of this into the query string of a GET request, making it easily cachable. Below is an example request from Zillow ![Request from Zillow](/assets/images/zillow-ebd03223cb4ed6af11e972135e854851.webp) As you can see, it’s just some metadata about the persistedQuery extension, the hash of the query, and variables to be embedded in the query. Here’s another request from expedia.com, sent as a POST, but with the same extension: ![Expedia Query](/assets/images/expedia-2e5f3670fa2a7fe4b27c9e5f93e5ec5a.webp) This primarily optimizes website performance, but it creates several challenges for web scraping: * GET requests are usually more prone to being blocked. * Hidden Query Parameters: We don’t know the full query, so if the website responds with a “Persisted query not found” error (asking us to send the query in full, not just the hash), we can’t send it. * Once the website changes even a little bit and the clients start asking for a new query - even though the old one might still work, the server will very soon forget its ID/hash, and your request with this hash will never work again, since you can’t “remind” the server of the full query text. For various reasons, you might need to extract the entire GraphQL query text, but this can be tricky. While you could inspect the website’s JavaScript to find the query text, it’s often dynamically constructed from multiple fragments, making it hard to piece together. Instead, we’ll take a more direct approach: tricking the client application (e.g., the browser) into revealing the full query. When the client uses a hash that the server doesn't recognize, the server typically responds with an error message like `PersistedQueryNotFound`. This prompts the client to resend the full query in a subsequent request. By intercepting and modifying the original request to include an invalid hash, we can trigger this behavior and capture the complete query text. This method avoids digging through JavaScript and relies on the natural error-handling flow of the client-server interaction. For exactly this use case, a perfect tool exists: [mitmproxy](https://mitmproxy.org/), an open-source Python library that intercepts requests made by your own devices, websites, or apps and allows you to modify them with simple Python scripts. Download `mitmproxy`, and prepare a Python script like this: ``` import json def request(flow): try: dat = json.loads(flow.request.text) dat[0]["extensions"]["persistedQuery"]["sha256Hash"] = "0d9e" # any bogus hex string here flow.request.text = json.dumps(dat) except: pass ``` This defines a hook that `mitmproxy` will run on every request: it tries to load the request's JSON body, modifies the hash to an arbitrary value, and writes the updated JSON as a new body of the request. We also need to make sure we reroute our browser requests to `mitmproxy`. For this purpose we are going to use a browser extension called [FoxyProxy](https://chromewebstore.google.com/detail/foxyproxy/gcknhkkoolaabfmlnjonogaaifnjlfnp?hl=en). It is available in both Firefox and Chrome. Just add a route with these settings: ![mitmproxy settings](/assets/images/mitmprpxy-1e6b253c473a57f3451077aae16640b6.webp) Now we can run `mitmproxy` with this script: `mitmweb -s script.py` This will open a browser tab where you can watch all the intercepted requests in real-time. ![Browser tab](/assets/images/browser-408715fa1be9f079c6672f7f3ae59644.webp) If you go to the particular path and see the query in the request section, you will see some garbage value has replaced the hash. ![Replaced hash](/assets/images/request-6f8330f873c988f6dd07d358130627bd.webp) Now, if you visit Zillow and open that particular path that we tried for the extension, and go to the response section, the client-side receives the PersistedQueryNotFound error. ![Persisted query error](/assets/images/error-2b5eed861143a45328231c6629406454.webp) The front end of Zillow reacts with sending the whole query as a POST request. ![POST request](/assets/images/query-b793b6bbe82994b3d38a565204f82e11.webp) We extract the query and hash directly from this POST request. To ensure that the Zillow server does not forget about this hash, we periodically run this POST request with the exact same query and hash. This will ensure that the scraper continues to work even when the server's cache is cleaned or reset or the website changes. ## Conclusion[​](#conclusion "Direct link to Conclusion") Persisted queries are a powerful optimization tool for GraphQL APIs, enhancing website performance by minimizing payload sizes and enabling GET request caching. However, they also pose significant challenges for web scraping, primarily due to the reliance on server-stored hashes and the potential for those hashes to become invalid. Using `mitmproxy` to intercept and manipulate GraphQL requests gives an efficient approach to reveal the full query text without delving into complex client-side JavaScript. By forcing the server to respond with a `PersistedQueryNotFound` error, we can capture the full query payload and utilize it for scraping purposes. Periodically running the extracted query ensures the scraper remains functional, even when server-side cache resets occur or the website evolves. --- # How to scrape Amazon products March 27, 2024 · 11 min read [![Lukáš Průša](/assets/images/lukasp-e0c7202aabdcc50c75cf45603be990a0.webp)](https://github.com/Patai5) [Lukáš Průša](https://github.com/Patai5) Junior Web Automation Engineer ## Introduction[​](#introduction "Direct link to Introduction") Amazon is one of the largest and most complex websites, which means scraping it is pretty challenging. Thankfully, the Crawlee library makes things a little easier, with utilities like JSON file outputs, automatic scaling, and request queue management. In this guide, we'll be extracting information from Amazon product pages using the power of [TypeScript](https://www.typescriptlang.org) in combination with the [Cheerio](https://cheerio.js.org) and [Crawlee](https://crawlee.dev) libraries. We'll explore how to retrieve and extract detailed product data such as titles, prices, image URLs, and more from Amazon's vast marketplace. We'll also discuss handling potential blocking issues that may arise during the scraping process. ![How to scrape Amazon using Typescript, Cheerio, and Crawlee](/assets/images/how-to-scrape-amazon-b6c5753f8b985c94a3d4cc372048f79d.webp) ## Prerequisites[​](#prerequisites "Direct link to Prerequisites") You'll find the journey smoother if you have a decent grasp of the TypeScript language and a fundamental understanding of [HTML](https://developer.mozilla.org/en-US/docs/Web/HTML) structure. A familiarity with Cheerio and Crawlee is advised but optional. This guide is built to introduce these tools and their use cases in an approachable manner. Crawlee is open-source with nearly 12,000 stars on GitHub. You can check out the [source code here](https://github.com/apify/crawlee). Feel free to play with Crawlee with the inbuilt templates that they offer. ## Writing the scraper[​](#writing-the-scraper "Direct link to Writing the scraper") To begin with, let's identify the product fields that we're interested in scraping: * Product Title * Price * List Price * Review Rating * Review Count * Image URLs * Product Overview Attributes ![Image highlighting the product fields to be scraped on Amazon](/assets/images/fields-to-scrape-e30b9a71e42a7b6baed85d7936fbb165.webp) For now, our focus will be solely on the scraping part. In a later section, we'll shift our attention to Crawlee, our crawling tool. Let's begin! ### Scraping the individual data points[​](#scraping-the-individual-data-points "Direct link to Scraping the individual data points") Our first step will be to utilize [browser DevTools](https://developer.mozilla.org/en-US/docs/Learn/Common_questions/Tools_and_setup/What_are_browser_developer_tools) to inspect the layout and discover the [CSS selectors](https://developer.mozilla.org/en-US/docs/Learn/CSS/Building_blocks/Selectors) for the data points we aim to scrape. (by default on [Chrome](https://developer.chrome.com/docs/devtools), press `Ctrl + Shift + C`) For example, let's take a look at how we find the selector for the product title: ![Amazon product title selector in DevTools](/assets/images/dev-tools-example-b243683d8baf93e34bbce102986d37b5.webp) The product title selector we've deduced is `span#productTitle`. This selector targets all `span` elements with the id of `productTitle`. Luckily, there's only one such element on the page - exactly what we're after. We can find the selectors for the remaining data points using the same principle combined with a sprinkle of trial and error. Next, let's write a function that uses a [Cheerio object](https://cheerio.js.org/docs/api/interfaces/CheerioAPI) of the product page as input and outputs our extracted data in a structured format. Initially, we'll focus on scraping simple data points. We'll leave the more complex ones, like image URLs and product attributes overview, for later. ``` import { CheerioAPI } from 'cheerio'; type ProductDetails = { title: string; price: string; listPrice: string; reviewRating: string; reviewCount: string; }; /** * CSS selectors for the product details. Feel free to figure out different variations of these selectors. */ const SELECTORS = { TITLE: 'span#productTitle', PRICE: 'span.priceToPay', LIST_PRICE: 'span.basisPrice .a-offscreen', REVIEW_RATING: '#acrPopover a > span', REVIEW_COUNT: '#acrCustomerReviewText', } as const; /** * Scrapes the product details from the given Cheerio object. */ export const extractProductDetails = ($: CheerioAPI): ProductDetails => { const title = $(SELECTORS.TITLE).text().trim(); const price = $(SELECTORS.PRICE).first().text(); const listPrice = $(SELECTORS.LIST_PRICE).first().text(); const reviewRating = $(SELECTORS.REVIEW_RATING).first().text(); const reviewCount = $(SELECTORS.REVIEW_COUNT).first().text(); return { title, price, listPrice, reviewRating, reviewCount }; }; ``` ## Improving the scraper[​](#improving-the-scraper "Direct link to Improving the scraper") At this point, our scraper extracts all fields as strings, which isn't ideal for numerical fields like prices and review counts - we'd rather have those as numbers. Simple casting from string to numbers will only work for some fields. In some cases, such as processing the price fields, we must clean the string and remove unnecessary characters before conversion. To address this, write a utility function parsing a number from a string. We'll also have another function to find the first element matching our selector and return it parsed as a number. ``` /** * Parses a number from a string by removing all non-numeric characters. * - Keeps the decimal point. */ const parseNumberValue = (rawString: string): number => { return Number(rawString.replace(/[^\d.]+/g, '')); }; /** * Parses a number value from the first element matching the given selector. */ export const parseNumberFromSelector = ($: CheerioAPI, selector: string): number => { const rawValue = $(selector).first().text(); return parseNumberValue(rawValue); }; ``` With the function above: `parseNumberValue`, we can now update and simplify the main scraping function `extractProductDetails`. ``` import { CheerioAPI } from 'cheerio'; import { parseNumberFromSelector } from './utils.js'; type ProductDetails = { title: string; price: number; // listPrice: number; // updated to numbers reviewRating: number; // reviewCount: number; // }; ... /** * Scrapes the product details from the given Cheerio object. */ export const extractProductDetails = ($: CheerioAPI): ProductDetails => { const title = $(SELECTORS.TITLE).text().trim(); const price = parseNumberFromSelector($, SELECTORS.PRICE); const listPrice = parseNumberFromSelector($, SELECTORS.LIST_PRICE); const reviewRating = parseNumberFromSelector($, SELECTORS.REVIEW_RATING); const reviewCount = parseNumberFromSelector($, SELECTORS.REVIEW_COUNT); return { title, price, listPrice, reviewRating, reviewCount }; }; ``` ### Scraping the advanced data points[​](#scraping-the-advanced-data-points "Direct link to Scraping the advanced data points") As we progress in our scraping journey, it's time to focus on the more complex data fields, like image URLs and product attributes overview. To extract data from these fields, we must utilize the `map` function to iterate over all matching elements and fetch data from each. Let's start with image URLs. ``` const SELECTORS = { ... IMAGES: '#altImages .item img', } as const; /** * Extracts the product image URLs from the given Cheerio object. * - We have to iterate over the image elements and extract the `src` attribute. */ const extractImageUrls = ($: CheerioAPI): string[] => { const imageUrls = $(SELECTORS.IMAGES) .map((_, imageEl) => $(imageEl).attr('src')) .get(); // `get()` - Retrieve all elements matched by the Cheerio object, as an array. Removes `undefined` values. return imageUrls; }; ``` Extracting images is relatively simple yet still deserves a separate function for clarity. We'll now parse the product attributes overview. ``` type ProductAttribute = { label: string; value: string; }; const SELECTORS = { ... PRODUCT_ATTRIBUTE_ROWS: '#productOverview_feature_div tr', ATTRIBUTES_LABEL: 'td:nth-of-type(1) span', ATTRIBUTES_VALUE: 'td:nth-of-type(2) span', } as const; /** * Extracts the product attributes from the given Cheerio object. * - We have to iterate over the attribute rows and extract both label and value for each row. */ const extractProductAttributes = ($: CheerioAPI): ProductAttribute[] => { const attributeRowEls = $(SELECTORS.PRODUCT_ATTRIBUTE_ROWS).get(); const attributeRows = attributeRowEls.map((rowEl) => { const label = $(rowEl).find(SELECTORS.ATTRIBUTES_LABEL).text(); const value = $(rowEl).find(SELECTORS.ATTRIBUTES_VALUE).text(); return { label, value }; }); return attributeRows; }; ``` We've now effectively crafted our scraping functions. Here's the complete `scraper.ts` file: ``` import { CheerioAPI } from 'cheerio'; import { parseNumberFromSelector } from './utils.js'; type ProductAttribute = { label: string; value: string; }; type ProductDetails = { title: string; price: number; listPrice: number; reviewRating: number; reviewCount: number; imageUrls: string[]; attributes: ProductAttribute[]; }; /** * CSS selectors for the product details. Feel free to figure out different variations of these selectors. */ const SELECTORS = { TITLE: 'span#productTitle', PRICE: 'span.priceToPay', LIST_PRICE: 'span.basisPrice .a-offscreen', REVIEW_RATING: '#acrPopover a > span', REVIEW_COUNT: '#acrCustomerReviewText', IMAGES: '#altImages .item img', PRODUCT_ATTRIBUTE_ROWS: '#productOverview_feature_div tr', ATTRIBUTES_LABEL: 'td:nth-of-type(1) span', ATTRIBUTES_VALUE: 'td:nth-of-type(2) span', } as const; /** * Extracts the product image URLs from the given Cheerio object. * - We have to iterate over the image elements and extract the `src` attribute. */ const extractImageUrls = ($: CheerioAPI): string[] => { const imageUrls = $(SELECTORS.IMAGES) .map((_, imageEl) => $(imageEl).attr('src')) .get(); // `get()` - Retrieve all elements matched by the Cheerio object, as an array. Removes `undefined` values. return imageUrls; }; /** * Extracts the product attributes from the given Cheerio object. * - We have to iterate over the attribute rows and extract both label and value for each row. */ const extractProductAttributes = ($: CheerioAPI): ProductAttribute[] => { const attributeRowEls = $(SELECTORS.PRODUCT_ATTRIBUTE_ROWS).get(); const attributeRows = attributeRowEls.map((rowEl) => { const label = $(rowEl).find(SELECTORS.ATTRIBUTES_LABEL).text(); const value = $(rowEl).find(SELECTORS.ATTRIBUTES_VALUE).text(); return { label, value }; }); return attributeRows; }; /** * Scrapes the product details from the given Cheerio object. */ export const extractProductDetails = ($: CheerioAPI): ProductDetails => { const title = $(SELECTORS.TITLE).text().trim(); const price = parseNumberFromSelector($, SELECTORS.PRICE); const listPrice = parseNumberFromSelector($, SELECTORS.LIST_PRICE); const reviewRating = parseNumberFromSelector($, SELECTORS.REVIEW_RATING); const reviewCount = parseNumberFromSelector($, SELECTORS.REVIEW_COUNT); const imageUrls = extractImageUrls($); const attributes = extractProductAttributes($); return { title, price, listPrice, reviewRating, reviewCount, imageUrls, attributes }; }; ``` Next up is the task of making the scraping part functional. Let's implement the crawling part using Crawlee. ## Crawling the product pages[​](#crawling-the-product-pages "Direct link to Crawling the product pages") We'll utilize the features that Crawlee offers to crawl the product pages. As we mentioned at the beginning, it considerably simplifies web scraping with JSON file outputs, automatic scaling, and request queue management. Our next stepping stone is to wrap our scraping logic within Crawlee, thereby implementing the crawling part of our process. ``` import { CheerioCrawler, CheerioCrawlingContext, log } from 'crawlee'; import { extractProductDetails } from './scraper.js'; /** * Performs the logic of the crawler. It is called for each URL to crawl. * - Passed to the crawler using the `requestHandler` option. */ const requestHandler = async (context: CheerioCrawlingContext) => { const { $, request } = context; const { url } = request; log.info(`Scraping product page`, { url }); const extractedProduct = extractProductDetails($); log.info(`Scraped product details for "${extractedProduct.title}", saving...`, { url }); crawler.pushData(extractedProduct); }; /** * The crawler instance. Crawlee provides a few different crawlers, but we'll use CheerioCrawler, as it's very fast and simple to use. * - Alternatively, we could use a full browser crawler like `PlaywrightCrawler` to imitate a real browser. */ const crawler = new CheerioCrawler({ requestHandler }); await crawler.run(['https://www.amazon.com/dp/B0BV7XQ9V9']); ``` The code now successfully extracts the product details from the given URLs. We've integrated our scraping function into Crawlee, and it's ready to scrape. Here's an example of the extracted data: ``` { "title": "ASUS ROG Strix G16 (2023) Gaming Laptop, 16” 16:10 FHD 165Hz, GeForce RTX 4070, Intel Core i9-13980HX, 16GB DDR5, 1TB PCIe SSD, Wi-Fi 6E, Windows 11, G614JI-AS94, Eclipse Gray", "price": 1799.99, "listPrice": 1999.99, "reviewRating": 4.3, "reviewCount": 372, "imageUrls": [ "https://m.media-amazon.com/images/I/41EWnXeuMzL._AC_US40_.jpg", "https://m.media-amazon.com/images/I/51gAOHZbtUL._AC_US40_.jpg", "https://m.media-amazon.com/images/I/51WLw+9ItgL._AC_US40_.jpg", "https://m.media-amazon.com/images/I/41D-FN8qjLL._AC_US40_.jpg", "https://m.media-amazon.com/images/I/41X+oNPvdkL._AC_US40_.jpg", "https://m.media-amazon.com/images/I/41X6TCWz69L._AC_US40_.jpg", "https://m.media-amazon.com/images/I/31rphsiD0lL.SS40_BG85,85,85_BR-120_PKdp-play-icon-overlay__.jpg" ], "attributes": [ { "label": "Brand", "value": "ASUS" }, { "label": "Model Name", "value": "ROG Strix G16" }, { "label": "Screen Size", "value": "16 Inches" }, { "label": "Color", "value": "Eclipse Gray" }, { "label": "Hard Disk Size", "value": "1 TB" }, { "label": "CPU Model", "value": "Intel Core i9" }, { "label": "Ram Memory Installed Size", "value": "16 GB" }, { "label": "Operating System", "value": "Windows 11 Home" }, { "label": "Special Feature", "value": "Anti Glare Coating" }, { "label": "Graphics Card Description", "value": "Dedicated" } ] } ``` ## How to avoid getting blocked when scraping Amazon[​](#how-to-avoid-getting-blocked-when-scraping-amazon "Direct link to How to avoid getting blocked when scraping Amazon") With a giant website like Amazon, one is bound to face some issues with blocking. Let's discuss how to handle them. Amazon frequently presents annoying CAPTCHAs or warning screens that may detect or block your scraper. We can counter this inconvenience by implementing a mechanism to detect and handle these blocks. As soon as we stumble upon one, we retry the request. ``` import { CheerioAPI } from 'cheerio'; const CAPTCHA_SELECTOR = '[action="/errors/validateCaptcha"]'; /** * Handles the captcha blocking. Throws an error if a captcha is displayed. * - Crawlee automatically retries any requests that throw an error. * - Status code blocking (e.g. Amazon's `503`) is handled automatically by Crawlee. */ export const handleCaptchaBlocking = ($: CheerioAPI) => { const isCaptchaDisplayed = $(CAPTCHA_SELECTOR).length > 0; if (isCaptchaDisplayed) throw new Error('Captcha is displayed! Retrying...'); }; ``` Make a small tweak in the request handler to use `handleCaptchaBlocking`: ``` import { handleCaptchaBlocking } from './blocking-detection.js'; const requestHandler = async (context: CheerioCrawlingContext) => { const { request, $ } = context; const { url } = request; handleCaptchaBlocking($); // Alternatively, we can put this into the crawler's `postNavigationHooks` log.info(`Scraping product page`, { url }); ... }; ``` While Crawlee's browser-like user-agent headers prevent blocking to a certain extent, this is only partially effective for a site as vast as Amazon. ### Using proxies[​](#using-proxies "Direct link to Using proxies") The use of proxies marks another significant tactic in evading blocking. You'll be pleased to know that Crawlee excels in this domain, supporting both [custom proxies](https://crawlee.dev/js/docs/guides/proxy-management.md) and [Apify proxies](https://apify.com/proxy). Here's an example of how to use Apify's [residential proxies](https://docs.apify.com/platform/proxy/residential-proxy), which are highly effective in preventing blocking: ``` import { ProxyConfiguration } from 'apify'; const proxyConfiguration = new ProxyConfiguration({ groups: ['RESIDENTIAL'], countryCode: 'US', // Optionally, you can specify the proxy country code. // This is useful for sites like Amazon, which display different content based on the user's location. }); const crawler = new CheerioCrawler({ requestHandler, proxyConfiguration }); ... ``` ### Using headless browsers to scrape Amazon[​](#using-headless-browsers-to-scrape-amazon "Direct link to Using headless browsers to scrape Amazon") For more advanced scraping, you can use a headless browser like [Playwright](https://crawlee.dev/js/docs/examples/playwright-crawler.md) to scrape Amazon. This method is more effective in preventing blocking and can handle websites with complex JavaScript interactions. To use Playwright with Crawlee, we can replace the `CheerioCrawler` with `PlaywrightCrawler`: ``` import { PlaywrightCrawler } from 'crawlee'; const crawler = new PlaywrightCrawler({ requestHandler, proxyConfiguration }); ... ``` And update our Cheerio-dependent code to work within Playwright: ``` import { PlaywrightCrawlingContext } from 'crawlee'; const requestHandler = async (context: PlaywrightCrawlingContext) => { const { request, parseWithCheerio } = context; const { url } = request; const $ = await parseWithCheerio(); // Get the Cheerio object for the page. ... }; ``` ## Conclusion and next steps[​](#conclusion-and-next-steps "Direct link to Conclusion and next steps") You've now journeyed through the basic and advanced terrains of web scraping Amazon product pages using the capabilities of TypeScript, Cheerio, and Crawlee. It can seem like a lot to digest but don't worry! With more practice, each step will become more familiar and intuitive - until you become a web scraping ninja. So go ahead and start experimenting. If you want to learn more, check out our detailed tutorial on building a [HackerNews scraper using Crawlee](https://blog.apify.com/crawlee-web-scraping-tutorial/). For more extensive web scraping abilities, check out pre-built scrapers from Apify, like [Amazon Web Scraper](https://apify.com/junglee/amazon-crawler)! --- # How to scrape infinite scrolling webpages with Python August 27, 2024 · 7 min read [![Saurav Jain](https://avatars.githubusercontent.com/u/53312820?v=4)](https://github.com/souravjain540) [Saurav Jain](https://github.com/souravjain540) Developer Community Manager Hello, Crawlee Devs, and welcome back to another tutorial on the Crawlee Blog. This tutorial will teach you how to scrape infinite-scrolling websites using Crawlee for Python. For context, infinite-scrolling pages are a modern alternative to classic pagination. When users scroll to the bottom of the webpage instead of choosing the next page, the page automatically loads more data, and users can scroll more. As a big sneakerhead, I'll take the Nike shoes infinite-scrolling [website](https://www.nike.com/) as an example, and we'll scrape thousands of sneakers from it. ![How to scrape infinite scrolling pages with Python](/assets/images/infinite-scroll-de1fd1c1791fdf8f6b5614a947ccc878.webp) Crawlee for Python has some amazing initial features, such as a unified interface for HTTP and headless browser crawling, automatic retries, and much more. ## Prerequisites and bootstrapping the project[​](#prerequisites-and-bootstrapping-the-project "Direct link to Prerequisites and bootstrapping the project") Let's start the tutorial by installing Crawlee for Python with this command: ``` pipx run crawlee create nike-crawler ``` note Before going ahead if you like reading this blog, we would be really happy if you gave [Crawlee for Python a star on GitHub!](https://github.com/apify/crawlee-python/) We will scrape using headless browsers. Select `PlaywrightCrawler` in the terminal when Crawlee for Python asks for it. After installation, Crawlee for Python will create boilerplate code for you. Redirect into the project folder and then run this command for all the dependencies installation: ``` poetry install ``` ## How to scrape infinite scrolling webpages[​](#how-to-scrape-infinite-scrolling-webpages "Direct link to How to scrape infinite scrolling webpages") 1. Handling accept cookie dialog 2. Adding request of all shoes links 3. Extract data from product details 4. Accept Cookies context manager 5. Handling infinite scroll on the listing page 6. Exporting data to CSV format ### Handling accept cookie dialog[​](#handling-accept-cookie-dialog "Direct link to Handling accept cookie dialog") After all the necessary installations, we'll start looking into the files and configuring them accordingly. When you look into the folder, you'll see many files, but for now, let's focus on `main.py` and `routes.py`. In `main.py`, let's change the target location to the Nike website. Then, just to see how scraping will happen, we'll add `headless = False` to the `PlaywrightCrawler` parameters. Let's also increase the maximum requests per crawl option to 100 to see the power of parallel scraping in Crawlee for Python. The final code will look like this: ``` from crawlee.playwright_crawler import PlaywrightCrawler from .routes import router async def main() -> None: """The crawler entry point.""" crawler = PlaywrightCrawler( headless=False, request_handler=router, max_requests_per_crawl=100, ) await crawler.run( [ 'https://nike.com/, ] ) ``` Now coming to `routes.py`, let's remove: ``` await context.enqueue_links() ``` As we don't want to scrape the whole website. Now, if you run the crawler using the command: ``` poetry run python -m nike-crawler ``` As the cookie dialog is blocking us from crawling more than one page's worth of shoes, let's get it out of our way. We can handle the cookie dialog by going to Chrome dev tools and looking at the `test_id` of the "accept cookies" button, which is `dialog-accept-button`. Now, let's remove the `context.push_data` call that was left there from the project template and add the code to accept the dialog in routes.py. The updated code will look like this: ``` from crawlee.router import Router from crawlee.playwright_crawler import PlaywrightCrawlingContext router = Router[PlaywrightCrawlingContext]() @router.default_handler async def default_handler(context: PlaywrightCrawlingContext) -> None: """Default request handler.""" # Wait for the popup to be visible to ensure it has loaded on the page. await context.page.get_by_test_id('dialog-accept-button').click() ``` ### Adding request of all shoes links[​](#adding-request-of-all-shoes-links "Direct link to Adding request of all shoes links") Now, if you hover over the top bar and see all the sections, i.e., man, woman, and kids, you'll notice the “All shoes” section. As we want to scrape all the sneakers, this section interests us. Let's use `get_by_test_id` with the filter of `has_text='All shoes'` and add all the links with the text “All shoes” to the request handler. Let's add this code to the existing `routes.py` file: ``` shoe_listing_links = ( await context.page.get_by_test_id('link').filter(has_text='All shoes').all() ) await context.add_requests( [ Request.from_url(url, label='listing') for link in shoe_listing_links if (url := await link.get_attribute('href')) ] ) @router.handler('listing') async def listing_handler(context: PlaywrightCrawlingContext) -> None: """Handler for shoe listings.""" ``` ### Extract data from product details[​](#extract-data-from-product-details "Direct link to Extract data from product details") Now that we have all the links to the pages with the title “All Shoes,” the next step is to scrape all the products on each page and the information provided on them. We'll extract each shoe's URL, title, price, and description. Again, let's go to dev tools and extract each parameter's relevant `test_id`. After scraping each of the parameters, we'll use the `context.push_data` function to add it to the local storage. Now let's add the following code to the `listing_handler` and update it in the `routes.py` file: ``` @router.handler('listing') async def listing_handler(context: PlaywrightCrawlingContext) -> None: """Handler for shoe listings.""" await context.enqueue_links(selector='a.product-card__link-overlay', label='detail') @router.handler('detail') async def detail_handler(context: PlaywrightCrawlingContext) -> None: """Handler for shoe details.""" title = await context.page.get_by_test_id( 'product_title', ).text_content() price = await context.page.get_by_test_id( 'currentPrice-container', ).first.text_content() description = await context.page.get_by_test_id( 'product-description', ).text_content() await context.push_data( { 'url': context.request.loaded_url, 'title': title, 'price': price, 'description': description, } ) ``` ### Accept Cookies context manager[​](#accept-cookies-context-manager "Direct link to Accept Cookies context manager") Since we're dealing with multiple browser pages with multiple links and we want to do infinite scrolling, we may encounter an accept cookie dialog on each page. This will prevent loading more shoes via infinite scroll. We'll need to check for cookies on every page, as each one may be opened with a fresh session (no stored cookies) and we'll get the accept cookie dialog even though we already accepted it in another browser window. However, if we don't get the dialog, we want the request handler to work as usual. To solve this problem, we'll try to deal with the dialog in a parallel task that will run in the background. A context manager is a nice abstraction that will allow us to reuse this logic in all the router handlers. So, let's build a context manager: ``` from playwright.async_api import TimeoutError as PlaywrightTimeoutError @asynccontextmanager async def accept_cookies(page: Page): task = asyncio.create_task(page.get_by_test_id('dialog-accept-button').click()) try: yield finally: if not task.done(): task.cancel() with suppress(asyncio.CancelledError, PlaywrightTimeoutError): await task ``` This context manager will make sure we're accepting the cookie dialog if it exists before scrolling and scraping the page. Let's implement it in the `routes.py` file, and the updated code is [here](https://github.com/janbuchar/crawlee-python-demo/blob/6ca6f7f1d1bbbf789a3b86f14bec492cf756251e/crawlee-python-webinar/routes.py) ### Handling infinite scroll on the listing page[​](#handling-infinite-scroll-on-the-listing-page "Direct link to Handling infinite scroll on the listing page") Now for the last and most interesting part of the tutorial! How to handle the infinite scroll of each shoe listing page and make sure our crawler is scrolling and scraping the data constantly. This tutorial is taken from the webinar held on August 5th where Jan Buchar, Senior Python Engineer at Apify, gave a live demo about this use case. Watch the tutorial here: [YouTube video player](https://www.youtube.com/embed/ip8Ii0eLfRY?si=7ZllUhMhuC7VC23B\&start=667) To handle infinite scrolling in Crawlee for Python, we just need to make sure the page is loaded, which is done by waiting for the `network_idle` load state, and then use the `infinite_scroll` helper function which will keep scrolling to the bottom of the page as long as that makes additional items appear. Let's add two lines of code to the `listing` handler: ``` @router.handler('listing') async def listing_handler(context: PlaywrightCrawlingContext) -> None: """Handler for shoe listings.""" async with accept_cookies(context.page): await context.page.wait_for_load_state('networkidle') await context.infinite_scroll() await context.enqueue_links( selector='a.product-card__link-overlay', label='detail' ) ``` ## Exporting data to CSV format[​](#exporting-data-to-csv-format "Direct link to Exporting data to CSV format") As we want to store all the shoe data into a CSV file, we can just add a call to the `export_data` helper into the `main.py` file just after the crawler run: ``` await crawler.export_data('shoes.csv') ``` ## Working crawler and its code[​](#working-crawler-and-its-code "Direct link to Working crawler and its code") Now, we have a crawler ready that can scrape all the shoes from the Nike website while handling infinite scrolling and many other problems, like the cookies dialog. You can find the complete working crawler code here on the [GitHub repository](https://github.com/janbuchar/crawlee-python-demo). Learn more about Crawlee for Python from our latest step by step [tutorial](https://blog.apify.com/crawlee-for-python-tutorial/). If you have any doubts regarding this tutorial or using Crawlee for Python, feel free to [join our discord community](https://apify.com/discord/) and ask fellow developers or the Crawlee team. --- # Announcing Crawlee for Python: Now you can use Python to build reliable web crawlers July 5, 2024 · 6 min read [![Saurav Jain](https://avatars.githubusercontent.com/u/53312820?v=4)](https://github.com/souravjain540) [Saurav Jain](https://github.com/souravjain540) Developer Community Manager > Testimonial from early adopters > > “Crawlee for Python development team did a great job in building the product, it makes things faster for a Python developer.” > > \~ [Maksym Bohomolov](https://apify.com/mantisus) We launched Crawlee in [August 2022](https://blog.apify.com/announcing-crawlee-the-web-scraping-and-browser-automation-library/) and got an amazing response from the JavaScript community. With many early adopters in its initial days, we got valuable feedback, which gave Crawlee a strong base for its success. Today, [Crawlee built-in TypeScript](https://github.com/apify/crawlee) has nearly **13,000 stars on GitHub**, with 90 open-source contributors worldwide building the best web scraping and automation library. Since the launch, the feedback we’ve received most often [\[1\]](https://discord.com/channels/801163717915574323/999250964554981446/1138826582581059585)[\[2\]](https://discord.com/channels/801163717915574323/801163719198638092/1137702376267059290)[\[3\]](https://discord.com/channels/801163717915574323/1090592836044476426/1103977818221719584) has been to build Crawlee in Python so that the Python community can use all the features the JavaScript community does. With all these requests in mind and to simplify the life of Python web scraping developers, **we’re launching [Crawlee for Python](https://github.com/apify/crawlee-python) today.** The new library is still in **beta**, and we are looking for **early adopters**. ![Crawlee for Python is looking for early adopters](/assets/images/early-adopters-0c5f38327dd8e5fad85dc127dcabc1f0.webp) Crawlee for Python has some amazing initial features, such as a unified interface for HTTP and headless browser crawling, automatic retries, and much more. ## Why use Crawlee instead of a random HTTP library with an HTML parser?[​](#why-use-crawlee-instead-of-a-random-http-library-with-an-html-parser "Direct link to Why use Crawlee instead of a random HTTP library with an HTML parser?") * Unified interface for HTTP & headless browser crawling. * HTTP - HTTPX with BeautifulSoup, * Headless browser - Playwright. * Automatic parallel crawling based on available system resources. * Written in Python with type hints - enhances DX (IDE autocompletion) and reduces bugs (static type checking). * Automatic retries on errors or when you’re getting blocked. * Integrated proxy rotation and session management. * Configurable request routing - direct URLs to the appropriate handlers. * Persistent queue for URLs to crawl. * Pluggable storage of both tabular data and files. ## Understanding the why behind the features of Crawlee[​](#understanding-the-why-behind-the-features-of-crawlee "Direct link to Understanding the why behind the features of Crawlee") ### Out-of-the-box support for headless browser crawling (Playwright).[​](#out-of-the-box-support-for-headless-browser-crawling-playwright "Direct link to Out-of-the-box support for headless browser crawling (Playwright).") While libraries like Scrapy require additional installation of middleware, i.e, [`scrapy-playwright`](https://github.com/scrapy-plugins/scrapy-playwright) and still doesn’t work with Windows, Crawlee for Python supports a unified interface for HTTP & headless browsers. Using a headless browser to download web pages and extract data, `PlaywrightCrawler` is ideal for crawling websites that require JavaScript execution. For websites that don’t require JavaScript, consider using the `BeautifulSoupCrawler,` which utilizes raw HTTP requests and will be much faster. ``` import asyncio from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext async def main() -> None: # Create a crawler instance crawler = PlaywrightCrawler( # headless=False, # browser_type='firefox', ) @crawler.router.default_handler async def request_handler(context: PlaywrightCrawlingContext) -> None: data = { 'request_url': context.request.url, 'page_url': context.page.url, 'page_title': await context.page.title(), 'page_content': (await context.page.content())[:10000], } await context.push_data(data) await crawler.run(['https://crawlee.dev']) if __name__ == '__main__': asyncio.run(main()) ``` The above example uses Crawlee’s built-in `PlaywrightCrawler` to crawl the [https://crawlee.dev/](https://crawlee.dev/index.md) website title and its content. ### Small learning curve[​](#small-learning-curve "Direct link to Small learning curve") In other libraries like Scrapy, when you run a command to create a new project, you get many files. Then you need to learn about the architecture, including various components (spiders, middlewares, pipelines, etc.). [The learning curve is very steep](https://crawlee.dev/blog/scrapy-vs-crawlee.md#language-and-development-environments). While building Crawlee, we made sure that the learning curve and the setup would be as fast as possible. With [ready-made templates](https://github.com/apify/crawlee-python/tree/master/templates), and having only a single file to add the code, it's very easy to start building a scraper, you might need to learn a little about request handlers and storage, but that’s all. ### Complete type hint coverage[​](#complete-type-hint-coverage "Direct link to Complete type hint coverage") We know how much developers like their code to be high-quality, readable, and maintainable. That's why the whole code base of Crawlee is fully type-hinted. Thanks to that, you should have better autocompletion in your IDE, enhancing developer experience while developing your scrapers using Crawlee. Type hinting should also reduce the number of bugs thanks to static type checking. ![Crawlee\_Python\_Type\_Hint](/assets/images/crawlee-python-type-hint-90bb0ec4fb86916d8a6b2512a80f965b.webp) ### Based on Asyncio[​](#based-on-asyncio "Direct link to Based on Asyncio") Crawlee is fully asynchronous and based on [Asyncio](https://docs.python.org/3/library/asyncio.html). For scraping frameworks, where many IO-bounds operations occur, this should be crucial to achieving high performance. Also, thanks to Asyncio, integration with other applications or the rest of your system should be easy. How is this different from the Scrapy framework, which is also asynchronous? Scrapy relies on the "legacy" Twisted framework. Integrating Scrapy with modern Asyncio-based applications can be challenging, often requiring more effort and debugging [\[1\]](https://stackoverflow.com/questions/49201915/debugging-scrapy-project-in-visual-studio-code). ## Power of open source community and early adopters giveaway[​](#power-of-open-source-community-and-early-adopters-giveaway "Direct link to Power of open source community and early adopters giveaway") Crawlee for Python is fully open-sourced and the codebase is available on the [GitHub repository of Crawlee for Python](https://github.com/apify/crawlee-python). We have already started receiving initial and very [valuable contributions from the Python community](https://github.com/apify/crawlee-python/pull/226). > Early adopters also said: > > “Crawlee for Python development team did a great job in building the product, it makes things faster for a Python developer.” > > \~ [Maksym Bohomolov](https://apify.com/mantisus) There’s still room for improvement. Feel free to open issues, make pull requests, and [star the repository](https://github.com/apify/crawlee-python/) to spread the work to other developers. **We will award the first 10 pieces of feedback** that add value and are accepted by our team with an exclusive Crawlee for Python swag (The first Crawlee for Python swag ever). Check out the [GitHub issue here](https://github.com/apify/crawlee-python/issues/269/). With such contributions, we’re excited and looking forward to building an amazing library for the Python community. Check out a step by step guide on how to use Crawlee for Python through one of our [latest tutorial](https://blog.apify.com/crawlee-for-python-tutorial/). [Join our Discord community](https://apify.com/discord) with nearly 8,000 web scraping developers, where our team would be happy to help you with any problems or discuss any use case for Crawlee for Python. --- # How to create a LinkedIn job scraper in Python with Crawlee October 14, 2024 · 7 min read [![Arindam Majumder](https://avatars.githubusercontent.com/u/109217591?v=4)](https://github.com/Arindam200) [Arindam Majumder](https://github.com/Arindam200) Community Member of Crawlee ## Introduction[​](#introduction "Direct link to Introduction") In this article, we will build a web application that scrapes LinkedIn for job postings using Crawlee and Streamlit. We will create a LinkedIn job scraper in Python using Crawlee for Python to extract the company name, job title, time of posting, and link to the job posting from dynamically received user input through the web application. note One of our community members wrote this blog as a contribution to Crawlee Blog. If you want to contribute blogs like these to Crawlee Blog, please reach out to us on our [discord channel](https://apify.com/discord). By the end of this tutorial, you’ll have a fully functional web application that you can use to scrape job postings from LinkedIn. ![Linkedin Job Scraper](data:image/webp;base64,UklGRpgcAABXRUJQVlA4IIwcAAAQfAGdASpcBt0CPpFIoEylpCaioJN4QNASCWlu+F8+/k/AuBCbOFzUYLOjUh0O8tefHp3bdRuonqp/z31K/PL9aj/jZM75b/xP42eEf+Z8N/IZ799v+U91X5q/y375fwPNPwB+IGoR+V/0L/U72OAD8t/r368+Ob/l+jH2Q9gLzB8EGgN5Nf+j5Tvr/2EOmGB9D/YD8z5nzPmfM+Z8z5nzPmfM+Z8z5nzPmfM+Z8z5nzPmfM+Z8z5nzPmfM+Z8z5nzPmfM+Z8z5nzPmfM+Z8z5nzPmfM+Z8z5nzPmfM+Z8z5nzPmfM+Yq8OTW3tvbe29t7b23tvbe22iZO/M+Z8z5nzPmfM+Z8z5nzPmfM+Z8z5nzPmfwiTklmXH4pWwfHDhw4cOHDhw4cOHDhw4cOHDhw4cOHDhw4cOHDhw4cOHDhw4cOHDhw4b/YoBGhwWLFixYsWLFixYsWLFixYsWLFixYsWLFixYsWLFixYsWLFixYsWLFimOzM6zIyDLIC6inMYBQN4AdJ+xanZv1IVvZErjAbqWdBIkSJEiRIkSJEiRIkSJEiRIkSJEiRIkSJEiRIkSJEiRIkSJEiRIkSJEiRLgLnNevXr169evXr169evXr169evXr169evXr169evXr169evXr169evXr169e69y5cuXLly5cuXLly5cuXLly5cuXLly5cuXLly5cuXLly5cuXLly5cuXLly6R48ePHjx48ePHjx48ePHjx48ePHjx48ePHjx48ePHjx48ePHjx48ePHjx48iCtWrVq1atWrVq1atWrVq1atWrVq1atWrVq1atWrVq1atWrVq1atWrVq1atW8tkyZMmTJkyZMmTHkuiL1UwcSehb4g6f+/kZqrDYDuhB5g1lF0nZC9nZ+vVNIKgtI4eyNTRiq4pxF9r8+2qMYQezlaAB+FZCmqk8DlC4jj+06dOnTp06dOnTp06dOnTp06dOnT3kiRIkSJEiRIkSJC/a91YpfY9LY7LHPlSIuGOyl8hLfCadJCMOqQf2nFwzjL0sCEFlwSmiGpay2JlSr+iwgbQ/pr3/Ynau1aSaw0G/XGdFiZN0U3g9p2vfcqW19E3mXGJv0k2NLrU4el8Rntj/mUStqrFnao8rs/wEEb5SvrG/EAEcf8WauHMhiN9lYgl2y/WuvXS47Ad0sCQXpQ5Nij8jEYp+5SOO5My++pwpNPhyQp6DiawABYsWLFixYsWLFixYsWLFixYsWLPB48ePHjx48ePHjwyCazIpuZ2oqIwtbNyUGmr/YqnQDGqKoAqAOC+AcbvAl6wa8o9xi6LIPKbbUi45jqAXa2j2Gj95zrxLBMiuH02BvEolPoFjYmD0ZKtzDSDllWy8+Flzd+B+OzJt/0oD3bUj4YEAiLSXdQ+r41F0F9sCCT3Ev3MTUyqF7F08v9d3zuBQwIC8skNs/43cnF0y20CAU2fPnz58+fPnz58+fPnz58+fPn0AQkSJEiRIkSJEiRIk0lONsiRIkVM0IoWkPA0COSa9evXr169evXr169evXr169evde5cuXLly5cuXLlybh1/cSC+LPV6vV6vV6vV6vV6vV6vV6vV6vV6vV6HatiOGNmzZs2bNmzZs2bNuGBAgQIECBAgQIEBsEIXEFaMWp2OMZCEqZYsWLFixYsWLFixYsWLFixYs8Hjx48ePHjx48ePHpLVq1atWrVq1atWrGVxgdKMwI6H/Ym0ZEun08JZg6tAPH4t+yZMmTJkyZMmTJkyZMmTJkyZQDw4cOHDhw4cOHDhxYHxw4cOHDhw4cOHCIWNkEvOOewez7OeSlCkjWWuuTa9XtfwJxUszla8lJ/crS4InNevXr169evXr16917ly5cuXLly5cuXKjLb+Vq1atWrVq1atWrVq1atWrVq1bBpsRIkSJEiRIkSJEiSG+fPnz58+fPnz587tyCt9HPAEWaQwGAwGAwGAwGAwGAwGAwGAwGAwGAwF/sqMz6biw4cOHDhw4cOHDhxFLZs2bNmzZs2bNmy1x9llmnoZJ6a7f4Qo9Rdv8IUeou3+EKPUXb/CFHqLt/hCj1FxJLVq1atWrVq1atWrWbyZMmTJkyZMmTJkwObpZg7BbwG3kXgHT+Ifw9We5EAuFPbxx2Nz+SJEiRIkSJEiRIkSJEiRIkSQ3z58+fPnz58+fPn1BIkSJEiRIkSJEiRFbm9gAP3Yig/dVuq8k79Vbvl+gB1E4kVDD3La9Xq9Xq9Xq9Xq9Xq9Xq9Xq9Xq9Wj91BRIkSJEiRIkSJEiROGfPnz58+fPnz58+PjNKh8L8OYiTklmXH4pZ6i7f4Qo9Rdv8IUeou3+EKPUXb/CFG8B1yuF2bNmzZs2bNmzZs24YECBAgQIECBAgQGwQjnYwyZMmTJkyZMmTJkyZMmTJkyZMocqXLly5cuXLly5cuXSPHjx48ePHjx48eO0d82+2TRbe4zjoOiFHqLt/hCj1F2/whR6i7f4Qo9Rdv8IUeou3+EKOf+PTb4Hxw4cOHDhw4cOHDmQiRIkSJEiRIkSJEVuc9x5l7TVdAzwpeAZ4UvAM8KXgGeFLwDPCl4BnhS8AzuuQgQIECBAgQIECBAjOCxYsWLFixYsWLFaZan17VjsKGW4mfhbr4uDWNmMlU5/DYpzUFDKudkJXnBM87dLMZhwLuBdANKzlFVJjrxPyBw4cOHDhw4cOHDhw4cOHDiwPjhw4cOHDhw4cOHMhEiRIkSJEiRIkSIrc31ectMWdFKFIYDAYDAYDAYDAYDAYDAYDAX9HO5dz+vXr169evXr169ews1atWrVq1atWrVqxlcxua1atWrVq1atWrVq1atWrVq1at7J4cOHDhw4cOHDhw4sD44cOHDhw4cOHDhELGz8ho0ggQIECBAgQIECBAgQIECBAgQFB2mEOa1atWrVq1atWrVq3lsmTJkyZMmTJkyZGeus1CmLLs9Rdv8IUeou3+EKPUXb/CFHqLt/hCj1F2/whR6i7f4QlQUougsWLFixYsWLFixZ4PHjx48ePHjx48eLwut/K2BZaHwJ/1E7OH/169evXr169evXr169evXsLNWrVq1atWrVq1atZvJkyZMmTJkyZMmTA5vfI1UIFuBXQxuyTahDxzn+3naEaFCUHjKr7Z0cAY25W4DZ4BCiMzT8bgjHr169evXr169evXr169evfNatWrVq1atWrVq1by2TJkyZMmTJkyZMjPXBCbVB4EhjESWdcDRjdynfp2vU3Lp06dOnTp06dOnTp06dOnWFEiRIkSJEiRIkSJE4Z8+fPnz58+fPnz4+R9TGVTFByOR1er1er1er1er1er1er1er1er1erzzKO2I2gcOHDhw4cOHDhw4cWB8cOHDhw4cOHDhw6SHDhw4cOHDhw4cOHDhw4cOHDhw4cOHDhw4cOHDhw4cOZCJEiRIkSJEiRIkSJEiRIkSJEiRIkSJEiRIkSJEiRIkSJEiRIkSJEiRIkSJFzmvXr169evXr169evXr169evXr169evXr169evXr169evXr169evXr169evde5cuXLly5cuXLly5cuXLly5cuXLly5cuXLly5cuXLly5cuXLly5cuXLly5dI8ePHjx48ePHjx48ePHjx48ePHjx48ePHjx48ePHjx48ePHjx48ePHjx48eRBWrVq1atWrVq1atWrVq1atWrVq1atWrVq1atWrVq1atWrVq1atWrVq1atWreWyZMmTJkyZMmTJkyZMmTJkyZMmTJkyZMmTJkyZMmTJkyZMmTJkyZMmTJkyZ0Bw4cOHDhw4cOHDhw4cOHDhw4cOHDhw4cOHDhw4cOHDhw4cOHDhw4cOHDhw7PPnz58+fPnz58+fPnz58+fPnz58+fPnz58+fPnz58+fPnz58+fPnz58+fPoAhIkSJEiRIkSJEiRIkSJEiRIkSJEiRIkSJEiRIkSJEiRIkSJEiRIkSJEiRIlLNWrVq1atWrVq1atWrVq1atWrVq1atWrVq1atWrVq1atWrVq1atWrVq1atWvbLly5cuXLly5cuXLly5cuXLly5cuXLly5cuXLly5cuXLly5cuXLly5cuXLkYAP6AO7hn23LIvWT3M3vMst7B+vrx5e/ZlBrK08CPxP+D/oZWo2jCcFgsFgsFgsFjqxIDI/q0JC2E6Q6uQNNk+PZFrddrzPSK0FGCRVC1EavogH9xSgVxasOrkksNwPYjioVCoWodXHir8xLkjqxIBGNBdEp+ta4LBcqQdXGbxGr3lLDiCLxAo2l6DHkxJSoVC1wy0zVQ4tVYtVW0LzAnginiPvpXWx/eOlaivY/mfa8FgsFgsFguXAOrjLiUaViPxgV600NYu3cBGcQLNnHMCLd6t0Fm1HqWPeCKNMtyAAAAX9w7UT+nbjLsWFoAAAG0aKFHOqwbxgPSUn6VD01qVkVGzxq9DOalKLLfFdrCKMGk7Ri1lMwtOcMzf01s2OduQRBx/mYRhMNNDKvgrWKatcolqXGWPgEreD0n/tgs3WiHzwSspoRTkcz5zogVz21uiELFwRzBqFqw7kpZ74PK/6L73212+vDZSVYmytUmEz2pEOCPLTb0FmPxgFCygOKse06osrNc6X8siWviWy17381t9GwO6wzroCRVlbbpFydOnqWK3QnPbr2LY180iBGBKGn21nmp9E1ICTJj255HWH5H6PmnWWhdE7wSv/jrL4OlAA7b5BJBNMA2+i+QAAAAAAAAAAAAAAAAAAAAAAAFKwHSLbvIIml3X9mx7PjoUS30gxAQRlx7lfW7KBGB7v3LaMKXRiEBO3FNhEWH5DJHyzfg098I5D0+n8oBt6iFQHta9JXQ8agvoMYB3vFfgPyYmRfwBNFD3mZ69AI0mbVx8bcozR0lbRv6V++Me39KY/DDNJ5ikQiIA+em3gL1nIl5Rj7KcwFX9/q10mQbVxo+tboWxDo4s3iwu8djqrMkJmDuIN9r0f6eRkPbO0o2GoPSIjKnB2wKihBTFED/YUmnam69VhOXvGQNe6EbQ9SsozxE+Ax0PfeaTOmhYTRRyLmtSWJVP7U6YJcE7EOwTWTbOs/9zQibPYftSrMAzKEFAC2vvC7BSkNVQ0R6i6GrmtNXBV281fx0i5JEk1QgEH/vSnPEtVoCbTFAgIZ+DotsYbjvr218F4UspZu8Az+0XLs3VE7B4/UaDG1OoQ96GAAD8CZexB5fHSOqilCDQU8NRKtiujsjJHvNNPePGf6HdyZno/gnUzPNA1AALXmEzozCBneXJguHb9SkUznQxFKb+zRsVY3bVqa7eQrIqNDZzBn6k8M1M4aMwO/Rr60fvC7xxtJMChb2D0qWBjQzV840aI0yrDphsWsfdRJ58RGfSGthmJQ44QW7GMfRW0DczfQpX4NUYYudCC9kg3cH2KLaGGzr1wKQjUi1uP0XWGtC1GribJrlrgpWfGK3WibAT5AbVHhRaMLyeuo9JQk3yUJthoJQWKcEtyBEfTAigGu64QtlQFjC3MWBCULmu9R1slHE5mA7dPPBVWYE6wDCwX8nXWcnuEhBp6xTQnaIBBg665KMfjOWwYTkjXexBkzjNkgKUxtLTYO6VnNKU81flbNhdbod8+YoeBWMHnru94FhgYjNu30wugcG7sR6EYSUKfTQAuG6WG5vfff4zx9QDFflwYA+lQo4gZSoIgoJAwc1LdypG+cVr7Ht4xViZiibBUebK32Yg3gwP4ywZrD1ftYKRVmGi+YYmlTbcq+DwurGdc3OAujqiHYezOb/JzvkE1cQvizN1o2OCwR1SnvW6U9lcOUNZiMBB9gxZHE0cvnM7IWJ1yjooW6KJ/IIdj5OkS7Vc1l+pw2rgOv4sAmV4uzX7twZZtP3Ejgch6itSsKUAcFi4yZ71ctpXqpipsyrOCdKxvqiiF5MbkAD5iPm1IUC4fW/MmYxGleN2VtCx2eW/ciSQtMQYHBYINjqujpxf/V4HUTTZaIlG2WTv9tcM/rvpuoiWDMWEOZbN7vWeyvUvpzY/peveh5oubVK6r2Wjh8KdMV/TgpzF5IMHK2Ocuwz/INqWzuvPXPD8FQbF3pzIPPbDP5lR1zgWNmEjhV4+nTfCUfUrriHBqRc/HtuCDdA7oDq6BomPCW+RNi1Zl3Fwo31w2xTHlws8VuelJUJVj0IvSYrBcp0pV8YoluCWTxxrEXKwq9pufG+oRm6CWxDTrtkHy4Bf42chCHTHBsxRBxSfQlDoD2OXPap40JwDjKKwej7l8WF3Hnfk0IMTZ5iVLwjneExnfimEuhNfj4smBGf1nQK+2irInWJzco+sp6PQl3jeczs4RlP9fZsCXzU9dqHxyMMNs3jYbMOCUxnJAk8IRoxN3ul436EnVeoRf9r7Fnakpd+KFMgNHs62+Kf3eFW83gvbbu/jZaMI/76k3VJbAZcezIMY8/tFr+rlLPEJUCN0xd/wtKTu+A0sn8b+S/aGoef0nVx03u2Yar3v3spduyfAixL8MM6nxOr5LAEVX/Awz80nDfviaZCMCgEQ63zch+xVJdR+aakXy7qat1qq1SvPbnjZpmtKwp05S6yoFfotiuAADG+7lE7rBa0aDn8vkC4fQymz61TcdKYty0IFt3C+Qxl+orJXeh/zZDFvdR/LxUFVM6pvHrfsBibZvliCYvB0mcGyNi4sWg/3Ao3U+M3OkP8TfCleFBCZOd8r7i00/fdp/otgaZLjZ8Bysc/xp2iyhT4LQmJHC/pKrHAeNhJGhbrpsYyVPc9NSNs4iOIerR5LzUMmL+an5aNSC5SbR5z727Ep1SlKocfyWZjEQQ4E0lOMgCKSCk5j0dHX8huFwRjSpCszqcM5nX8lpe4WjACdF+7kyOa26FiCPySLR8Iy4W+yS30OGoWqbvKjn/AH3bXaZvYwsTQAIA1TzudC8+7Rvcdb302oZyy9PvIwsr+ZsN2hFZmSuFdwATVOPHlwrlnUXbmrpRhg8SOHA7UOTZ2f0Tig1wIJ9nwaeO2b1JOVYH4Qc4jiLQSjwxk4cjbolKN+V2RzbMfQwlbFM6kXoHnAYZC8qv0Q3lyczdaccWf4Ka0oUZRMEU3zC7USdguTkzM0DK4n/OCNbubEPUrEPWEzS3PsLLK/XmXpZGKSuYlpdD0mF3H5OpbRJQRt5Pp4EvmeLSAYNOXP6FtvbRDDhHyTY91bfX04xsLLLnWk/uxUYg7HPiCNGHoy6v11s2fVJzKqtF0MkvZF6S9SqE+aplINFLSM+biUMHZq58vo6vm2OauUZkgYYBEfcLw+Q9oRMgEn+LVmDSkgTf5uPXZKmnRcXHaHH22y5BlHmV/biXKZAECO6SBgSE5sx0V/mP0WDcifW+KQIyKNbS2AYpreXVuvqjwziBaDJkVBTCoH9BwMsukgbczmIFYS1n/E88nleas2TCZj469Fwjrokr6GAAAAAAAAAUuj4L1wuzBr4ANeUEty1rRAWQAA98OKegz+OXSrc5LLfMIXjYdL1roEsPB8dO4D1eaeuN09cQZM1STJuU4Xev4XoaNwHzuryum3loNEdkwfgvJTTiAzgR2ABY+d+AaZ9eocK6s5S8L3w31jnkXWrHFGBML+4chZ6/k4HXYtk5lM9bBg7VeiBLbD99lVtzzDChneeJx+X+CQfWXuCqpQtUx8p+KB+D0hlCt4PDLt6jO+1owc+7aG+s49Ceq/tjGd3FJBA7uPy944jY8xb45hN/QEqioGF7mf/daYg9LuZUZtDDs98qUmrby76oKJ0PW1d/VypjDysJqe+926kU9mIvVPIAAADVVwjytyvACT+XHfug3t+IuN3PZZ9SFwXG7nss+pC4Ljdz2WfUhcFxu57LPqQuC43c9ln1IXBcbueyz6kLguN3PZZ9SEWABU/J4AAAAAAJ6GAAAAAAAABlVk1PjFjG9jKp8I8CAcdo81w+rJ0QzZdTsHjlIMTCIivF6+54ecBYjxNB+hDXPuUmIPX3DOZrscbjH9laqLeXobhiNV4wh9eCYSz8ffq4biO/L38voUVHkP4Z0ynmn8ZNB0hHf4zEOOZgxdcx15DtyHnwq0C7QxB5FH8deOx8KXzg+Gh1yOA6SQkydlG6BtYQkUGXLYZx0in0TAzVSGIFWUZxUxWFJWc91aOhD1hlq1ijqU5bETt16pY8/bXr4A8sqFVdS90VdbNDlvO2LRRXDEWkoCqaCTa2meY0QKmy/63mqG3dMcRQvKmO6XlQN4EDGAAAB0Zr7v83mc/84feuyZY6XEFtYM7HEY6SdMBTZyPsKq9jvMD3hQgpX2CeSZAS+9Ilh6RWup7PRD80o1xMUDC64q64x2FA0Y4xGWpkeM/pz0Zq0d+fRIrNfacyVig6yRCy8Sxw0BoCwYywAGIZeCCI8PAGYLAA2qaBgiwAF/BYHL4+AoxYAaCSDwJvZhwAAAIQlFTpUxgpS1A8nlPjZkgsDAJH5qukhvXydIed994fOggH8EMhTVDVtu7v7wlKjXi9WEEezbIGNnhgEVtfMvv85OMTHSvGFteDoST2J1nuWhm2fGSadUB0v1d+ugd5MtH7xL7X1gBWY0p/cGFi5S7ufPrZXXw1/ng5xGxUXRb6sIaeouXLb4zm2QfUSc1O5CUucu5e84lRA13hIJziA74lz3tvG0Avx5uOTx4dA+dHAvLCUFAeAODYh+QdcvkVMjbhI0cTcGgz2UXcJi0zYoe0rfhA4ndDRQiqquIrVnGmRjq+XCQVS1jT89F8RwrwWS1bnbbZFjMvwyxZwh4ks0aBiZIXfc/M8N0Du2CTPWclmraBcsfG+OpLXr9Tc+Yq0ZP/5t5Q1gv+dpwtxLLDt9fT3pgy5jpfXsoNcC9+qM7dLW3+2GDYmJR4fiWuoHYo9QdaXf8iAisso6PAXcrwBDRUXuiSAWJGD46WAU2vZQAGCisYsv1P3rSeNoSlIuPe92HFNXAMP3Sp8t1+wAAAHiiyh3+kb8VQWPWI9M7DOqTydZyeTyeTyeTyeTyeTyeTyeTyeTyeTyeTyeTyeTyeTyeTyeTyeTyeTyeTyeTydLBYAgOJAAAAcYTt24AAAA9YIAAAAAAD5LZIz1V5aIR5JR2TuI9wgAn7dWuPUAAABHLkZArjelrbGpqDoLo2qFqqVcEfqpk5QfNOQoCsYgOBLEBPXYOQnUXknc4jGqAyC//gdXbDUdDjwyNHufwY4lVjsvX2yQxB2VCeVhr22pxq9yWnbhhWX/iPpM0BfbLGe5I9Z08Xdamn8/8xen0W8cfzwVrKogWHMfJruMKL0dR8T0czWzc3s/XAmlBKWdmzAoRe8Fgn1RzlVnsDLN40nvgfvJh9OY6uewTufqhgI5HqvK9sK9k1undUY97F+Lyr3M4XEIL2QyawItQwg1ODjsSmTWtNd5YxWkTXfYv7bCuAyNGyDF3hvKa83ZnDpD5j4JYlA6kk+enAE4RTGgol40rRDWp2ZN8LIhXWs1hc7aPagXnEOdDrxlqWPwyb8G2/h5ufuO3oFcOiysQOYyzaLEVfPfsa5oMdrxpsISO66msXWSmfVdZeUmbFkEFfP/CbyYcW3LwDMJYN8i5Bf3BW3m3VZNr6yw7fHT8MDPTYB3NjukSN5cVRdaNK7YBk3/vu5TpvDyEB6pmdBl/RwzZTtYfvYCP80JWxW/GLTKGF6Rhl2TFAjsAAABreajb8jK/nT2a6PaZiGc1u46XXk7PWyY3vAAAAFo7TNu+VTSE0ADz3VxwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=) Let's begin. ## Prerequisites[​](#prerequisites "Direct link to Prerequisites") Let's start by creating a new Crawlee for Python project with this command: ``` pipx run crawlee create linkedin-scraper ``` Select `PlaywrightCrawler` in the terminal when Crawlee asks for it. After installation, Crawlee for Python will create boilerplate code for you. You can change the directory(`cd`) to the project folder and run this command to install dependencies. ``` poetry install ``` We are going to begin editing the files provided to us by Crawlee so we can build our scraper. note Before going ahead if you like reading this blog, we would be really happy if you gave [Crawlee for Python a star on GitHub](https://github.com/apify/crawlee-python/)! ## Building the LinkedIn job Scraper in Python with Crawlee[​](#building-the-linkedin-job-scraper-in-python-with-crawlee "Direct link to Building the LinkedIn job Scraper in Python with Crawlee") In this section, we will be building the scraper using the Crawlee for Python package. To learn more about Crawlee, check out their [documentation](https://www.crawlee.dev/python/docs/quick-start). ### 1. Inspecting the LinkedIn job Search Page[​](#1-inspecting-the-linkedin-job-search-page "Direct link to 1. Inspecting the LinkedIn job Search Page") Open LinkedIn in your web browser and sign out from the website (if you already have an account logged in). You should see an interface like this. ![LinkedIn Homepage](/assets/images/linkedin-homepage-8bec2b6a9ae97a18a7e49d4275c14cee.webp) Navigate to the jobs section, search for a job and location of your choice, and copy the URL. ![LinkedIn Jobs Page](/assets/images/linkedin-jobs-44e352d2233de5adb7af9838b75b9895.webp) You should have something like this: `https://www.linkedin.com/jobs/search?keywords=Backend%20Developer&location=Canada&geoId=101174742&trk=public_jobs_jobs-search-bar_search-submit&position=1&pageNum=0` We're going to focus on the search parameters, which is the part that goes after '?'. The keyword and location parameters are the most important ones for us. The job title the user supplies will be input to the keyword parameter, while the location the user supplies will go into the location parameter. Lastly, the `geoId` parameter will be removed while we keep the other parameters constant. We are going to be making changes to our `main.py` file. Copy and paste the code below in your `main.py` file. ``` from crawlee.playwright_crawler import PlaywrightCrawler from .routes import router import urllib.parse async def main(title: str, location: str, data_name: str) -> None: base_url = "https://www.linkedin.com/jobs/search" # URL encode the parameters params = { "keywords": title, "location": location, "trk": "public_jobs_jobs-search-bar_search-submit", "position": "1", "pageNum": "0" } encoded_params = urlencode(params) # Encode parameters into a query string query_string = '?' + encoded_params # Combine base URL with the encoded query string encoded_url = urljoin(base_url, "") + query_string # Initialize the crawler crawler = PlaywrightCrawler( request_handler=router, ) # Run the crawler with the initial list of URLs await crawler.run([encoded_url]) # Save the data in a CSV file output_file = f"{data_name}.csv" await crawler.export_data(output_file) ``` Now that we have encoded the URL, the next step for us is to adjust the generated router to handle LinkedIn job postings. ### 2. Routing your crawler[​](#2-routing-your-crawler "Direct link to 2. Routing your crawler") We will be making use of two handlers for your application: * **Default handler** The `default_handler` handles the start URL * **Job listing** The `job_listing` handler extracts the individual job details. Playwright crawler is going to crawl through the job posting page and extract the links to all job postings on the page. ![Identifying elements](/assets/images/elements-a634b50a7ad31ae15db61e1a06f5125e.webp) When you examine the job postings, you will discover that the job posting links are inside an ordered list with a class named `jobs-search__results-list`. We will then extract the links using the Playwright locator object and add them to the `job_listing` route for processing. ``` router = Router[PlaywrightCrawlingContext]() @router.default_handler async def default_handler(context: PlaywrightCrawlingContext) -> None: """Default request handler.""" #select all the links for the job posting on the page hrefs = await context.page.locator('ul.jobs-search__results-list a').evaluate_all("links => links.map(link => link.href)") #add all the links to the job listing route await context.add_requests( [Request.from_url(rec, label='job_listing') for rec in hrefs] ) ``` Now that we have the job listings, the next step is to scrape their details. We'll extract each job’s title, company's name, time of posting, and the link to the job post. Open your dev tools to extract each element using its CSS selector. ![Inspecting elements](/assets/images/inspect-90f77b162804bd1163b16bb23b315ed8.webp) After scraping each of the listings, we'll remove special characters from the text to make it clean and push the data to local storage using the `context.push_data` function. ``` @router.handler('job_listing') async def listing_handler(context: PlaywrightCrawlingContext) -> None: """Handler for job listings.""" await context.page.wait_for_load_state('load') job_title = await context.page.locator('div.top-card-layout__entity-info h1.top-card-layout__title').text_content() company_name = await context.page.locator('span.topcard__flavor a').text_content() time_of_posting= await context.page.locator('div.topcard__flavor-row span.posted-time-ago__text').text_content() await context.push_data( { # we are making use of regex to remove special characters for the extracted texts 'title': re.sub(r'[\s\n]+', '', job_title), 'Company name': re.sub(r'[\s\n]+', '', company_name), 'Time of posting': re.sub(r'[\s\n]+', '', time_of_posting), 'url': context.request.loaded_url, } ) ``` ## 3. Creating your application[​](#3-creating-your-application "Direct link to 3. Creating your application") For this project, we will be using Streamlit for the web application. Before we proceed, we are going to create a new file named `app.py` in your project directory. In addition, ensure you have [Streamlit](https://docs.streamlit.io/get-started/installation) installed in your global Python environment before proceeding with this section. ``` import streamlit as st import subprocess # Streamlit form for inputs st.title("LinkedIn Job Scraper") with st.form("scraper_form"): title = st.text_input("Job Title", value="backend developer") location = st.text_input("Job Location", value="newyork") data_name = st.text_input("Output File Name", value="backend_jobs") submit_button = st.form_submit_button("Run Scraper") if submit_button: # Run the scraping script with the form inputs command = f"""poetry run python -m linkedin-scraper --title "{title}" --location "{location}" --data_name "{data_name}" """ with st.spinner("Crawling in progress..."): # Execute the command and display the results result = subprocess.run(command, shell=True, capture_output=True, text=True) st.write("Script Output:") st.text(result.stdout) if result.returncode == 0: st.success(f"Data successfully saved in {data_name}.csv") else: st.error(f"Error: {result.stderr}") ``` The Streamlit web application takes in the user's input and uses the Python Subprocess package to run the Crawlee scraping script. ## 4. Testing your app[​](#4-testing-your-app "Direct link to 4. Testing your app") Before we test the application, we need to make a little modification to the `__main__` file in order for it to accommodate the command line arguments. ``` import asyncio import argparse from .main import main def get_args(): # ArgumentParser object to capture command-line arguments parser = argparse.ArgumentParser(description="Crawl LinkedIn job listings") # Define the arguments parser.add_argument("--title", type=str, required=True, help="Job title") parser.add_argument("--location", type=str, required=True, help="Job location") parser.add_argument("--data_name", type=str, required=True, help="Name for the output CSV file") # Parse the arguments return parser.parse_args() if __name__ == '__main__': args = get_args() # Run the main function with the parsed command-line arguments asyncio.run(main(args.title, args.location, args.data_name)) ``` We will start the Streamlit application by running this code in the terminal: ``` streamlit run app.py ``` This is what your application what the application should look like on the browser: ![Running scraper](/assets/images/running-555ab15f009be751f516aabd99e6c574.webp) You will get this interface showing you that the scraping has been completed: ![Filling input form](/assets/images/form-774ee8d03c87acfc38d3012d38a9c4ce.webp) To access the scraped data, go over to your project directory and open the CSV file. ![CSV file with all scraped LinkedIn jobs](/assets/images/excel-23850449d4d74099a1264cd93ca8565b.webp) You should have something like this as the output of your CSV file. ## Conclusion[​](#conclusion "Direct link to Conclusion") In this tutorial, we have learned how to build an application that can scrape job posting data from LinkedIn using Crawlee. Have fun building great scraping applications with Crawlee. You can find the complete working Crawler code here on the [GitHub repository.](https://github.com/Arindam200/LinkedIn_Scraping) --- # Building a Netflix show recommender using Crawlee and React June 10, 2024 · 8 min read [![Ayush Thakur](https://avatars.githubusercontent.com/u/43995654?v=4)](https://github.com/ayush2390) [Ayush Thakur](https://github.com/ayush2390) Community Member of Crawlee In this blog, we'll guide you through the process of using Vite and Crawlee to build a website that recommends Netflix shows based on their categories and genres. To do that, we will first scrape the shows and categories from Netflix using Crawlee, and then visualize the scraped data in a React app built with Vite. By the end of this guide, you'll have a functional web show recommender that can provide Netflix show suggestions. note One of our community members wrote this blog as a contribution to Crawlee Blog. If you want to contribute blogs like these to Crawlee Blog, please reach out to us on our [discord channel](https://apify.com/discord). ![How to scrape Netflix using Crawlee and React to build a show recommender](/assets/images/create-netflix-show-recommender-c429467c4a972badaa0b8ab414454250.webp) Let’s get started! ## Prerequisites[​](#prerequisites "Direct link to Prerequisites") To use Crawlee, you need to have Node.js 16 or newer. tip If you like the posts on the Crawlee blog so far, please consider [giving Crawlee a star on GitHub](https://github.com/apify/crawlee), it helps us to reach and help more developers. You can install the latest version of Node.js from the [official website](https://nodejs.org/en/). This great [Node.js installation guide](https://blog.apify.com/how-to-install-nodejs/) gives you tips to avoid issues later on. ## Creating a React app[​](#creating-a-react-app "Direct link to Creating a React app") First, we will create a React app (for the front end) using Vite. Run this command in the terminal to create it: ``` npx create-vite@latest ``` You can check out the [Vite Docs](https://vitejs.dev/guide/) for more details on how to create a React app. Once the React app is created, open it in VS Code. ![react](/assets/images/react-646682cf5586bf230bf98086a4323845.webp) This will be the structure of your React app. Run `npm run dev` command in the terminal to run the app. ![viteandreact](/assets/images/viteandreact-57c4bb4028b4d6b7cc9a22b32b70d3f7.webp) This will be the output displayed. ## Adding Scraper code[​](#adding-scraper-code "Direct link to Adding Scraper code") As per our project requirements, we will scrape the genres and the titles of the shows available on Netflix. Let’s start building the scraper code. ### Installation[​](#installation "Direct link to Installation") Run this command to install `crawlee`: ``` npm install crawlee ``` Crawlee utilizes Cheerio for HTML parsing and scraping of static websites. While faster and [less resource-intensive](https://crawlee.dev/js/docs/guides/scaling-crawlers.md), it can only scrape websites that do not require JavaScript rendering, making it unsuitable for SPAs (single page applications). In this tutorial we can extract data from the HTML structure, so we will go with Cheerio but for extracting data from SPAs or JavaScript-rendered websites, Crawlee also supports headless browser libraries like [Playwright](https://playwright.dev/) and [Puppeteer](https://pptr.dev/) After installing the libraries, it’s time to create the scraper code. Create a file in `src` directory and name it `scraper.js`. The entire scraper code will be created in this file. ### Scraping genres and shows[​](#scraping-genres-and-shows "Direct link to Scraping genres and shows") To scrape the genres and shows, we will utilize the [browser DevTools](https://developer.mozilla.org/en-US/docs/Learn/Common%60questions/Tools%60and%60setup/What%60are%60browser%60developer%60tools) to identify the tags and CSS selectors targeting the genre elements on the Netflix website. We can capture the HTML structure and call `$(element)` to query the element's subtree. ![genre](/assets/images/genre-cea03ab54c084a8df3139bf584920062.webp) Here, we can observe that the name of the genre is captured by a `span` tag with `nm-collections-row-name` class. So we can use the `span.nm-collections-row-name` selector to capture this and similar elements. ![title](/assets/images/title-b56306b68714d95cc9e45168906a045f.webp) Similarly, we can observe that the title of the show is captured by the `span` tag having `nm-collections-title-name` class. So we can use the `span.nm-collections-title-name` selector to capture this and similar elements. ``` // Use parseWithCheerio for efficient HTML parsing const $ = await parseWithCheerio(); // Extract genre and shows directly from the HTML structure const data = $('[data-uia="collections-row"]') .map((_, el) => { const genre = $(el) .find('[data-uia="collections-row-title"]') .text() .trim(); const items = $(el) .find('[data-uia="collections-title"]') .map((_, itemEl) => $(itemEl).text().trim()) .get(); return { genre, items }; }) .get(); const genres = data.map((d) => d.genre); const shows = data.map((d) => d.items); ``` In the code snippet given above, we are using `parseWithCheerio` to parse the HTML content of the current page and extract `genres` and `shows` information from the HTML structure using Cheerio. This will give the `genres` and `shows` array having list of genres and shows stored in it respectively. ### Storing data[​](#storing-data "Direct link to Storing data") Now we have all the data that we want for our project and it’s time to store or save the scraped data. To store the data, Crawlee comes with a `pushData()` method. The [pushData()](https://crawlee.dev/js/docs/introduction/saving-data.md) method creates a storage folder in the project directory and stores the scraped data in JSON format. ``` await pushData({ genres: genres, shows: shows, }); ``` This will save the `genres` and `shows` arrays as values in the `genres` and `shows` keys. Here’s the full code that we will use in our project: ``` import { CheerioCrawler, log, Dataset } from "crawlee"; const crawler = new CheerioCrawler({ requestHandler: async ({ request, parseWithCheerio, pushData }) => { log.info(`Processing: ${request.url}`); // Use parseWithCheerio for efficient HTML parsing const $ = await parseWithCheerio(); // Extract genre and shows directly from the HTML structure const data = $('[data-uia="collections-row"]') .map((_, el) => { const genre = $(el) .find('[data-uia="collections-row-title"]') .text() .trim(); const items = $(el) .find('[data-uia="collections-title"]') .map((_, itemEl) => $(itemEl).text().trim()) .get(); return { genre, items }; }) .get(); // Prepare data for pushing const genres = data.map((d) => d.genre); const shows = data.map((d) => d.items); await pushData({ genres: genres, shows: shows, }); }, // Limit crawls for efficiency maxRequestsPerCrawl: 20, }); await crawler.run(["https://www.netflix.com/in/browse/genre/1191605"]); await Dataset.exportToJSON("results"); ``` Now, we will run Crawlee to scrape the website. Before running Crawlee, we need to tweak the `package.json` file. We will add the `start` script targeting the `scraper.js` file to run Crawlee. Add the following code in `'scripts'` object: ``` "start": "node src/scraper.js" ``` and save it. Now run this command to run Crawlee to scrape the data: ``` npm start ``` After running this command, you will see a `storage` folder with the `key_value_stores/default/results.json` file. The scraped data will be stored in JSON format in this file. Now we can use this JSON data and display it in the `App.jsx` component to create the project. In the `App.jsx` component, we will import `jsonData` from the `results.json` file: ``` import { useState } from "react"; import "./App.css"; import jsonData from "../storage/key_value_stores/default/results.json"; function HeaderAndSelector({ handleChange }) { return ( <>

Netflix Web Show Recommender

); } function App() { const [count, setCount] = useState(null); const handleChange = (event) => { const value = event.target.value; if (value) setCount(parseInt(value)); }; // Validate count to ensure it is within the bounds of the jsonData.shows array const isValidCount = count !== null && count <= jsonData[0].shows.length; return (
{isValidCount && ( <>
    {jsonData[0].shows[count].slice(0, 20).map((show, index) => (
  • {show}
  • ))}
    {jsonData[0].shows[count].slice(20).map((show, index) => (
  • {show}
  • ))}
)}
); } export default App; ``` In this code snippet, the `genre` array is used to display the list of genres. User can select their desired genre and based upon that a list of web shows available on Netflix will be displayed using the `shows` array. Make sure to update CSS on the `App.css` file from here: and download and save this image file in main project folder: [Download Image](https://raw.githubusercontent.com/ayush2390/web-show-recommender/main/Netflix.png) Our project is ready! ## Result[​](#result "Direct link to Result") Now, to run your project on localhost, run this command: ``` npm run dev ``` This command will run your project on localhost. Here is a demo of the project: ![result](/assets/images/result-021f50e0c1a5870d2448701c8ca6042d.gif) Project link - In this project, we used Crawlee to scrape Netflix; similarly, Crawlee can be used to scrape single application pages (SPAs) and JavaScript-rendered websites. The best part is all of this can be done while coding in JavaScript/TypeScript and using a single library. If you want to learn more about Crawlee, go through the [documentation](https://crawlee.dev/js/docs/quick-start.md) and this step-by-step [Crawlee web scraping tutorial](https://blog.apify.com/crawlee-web-scraping-tutorial/) from Apify. **Tags:** * [community](https://crawlee.dev/blog/tags/community.md) --- ## [How to scrape Google search results with Python](https://crawlee.dev/blog/scrape-google-search.md) December 2, 2024 · 7 min read [![Max](https://avatars.githubusercontent.com/u/34358312?v=4)](https://github.com/Mantisus) [Max](https://github.com/Mantisus) Community Member of Crawlee and web scraping expert Scraping `Google Search` delivers essential `SERP analysis`, SEO optimization, and data collection capabilities. Modern scraping tools make this process faster and more reliable. note One of our community members wrote this blog as a contribution to the Crawlee Blog. If you would like to contribute blogs like these to Crawlee Blog, please reach out to us on our [discord channel](https://apify.com/discord). In this guide, we'll create a Google Search scraper using [`Crawlee for Python`](https://github.com/apify/crawlee-python) that can handle result ranking and pagination. We'll create a scraper that: * Extracts titles, URLs, and descriptions from search results * Handles multiple search queries * Tracks ranking positions * Processes multiple result pages * Saves data in a structured format ![How to scrape Google search results with Python](/assets/images/google-search-a91bfdf17a4c2860798444b1be56f625.webp) **Tags:** * [community](https://crawlee.dev/blog/tags/community.md) [**Read More**](https://crawlee.dev/blog/scrape-google-search.md) --- ## [Building a Netflix show recommender using Crawlee and React](https://crawlee.dev/blog/netflix-show-recommender.md) June 10, 2024 · 8 min read [![Ayush Thakur](https://avatars.githubusercontent.com/u/43995654?v=4)](https://github.com/ayush2390) [Ayush Thakur](https://github.com/ayush2390) Community Member of Crawlee In this blog, we'll guide you through the process of using Vite and Crawlee to build a website that recommends Netflix shows based on their categories and genres. To do that, we will first scrape the shows and categories from Netflix using Crawlee, and then visualize the scraped data in a React app built with Vite. By the end of this guide, you'll have a functional web show recommender that can provide Netflix show suggestions. note One of our community members wrote this blog as a contribution to Crawlee Blog. If you want to contribute blogs like these to Crawlee Blog, please reach out to us on our [discord channel](https://apify.com/discord). ![How to scrape Netflix using Crawlee and React to build a show recommender](/assets/images/create-netflix-show-recommender-c429467c4a972badaa0b8ab414454250.webp) **Tags:** * [community](https://crawlee.dev/blog/tags/community.md) [**Read More**](https://crawlee.dev/blog/netflix-show-recommender.md) --- # How Crawlee uses tiered proxies to avoid getting blocked June 24, 2024 · 4 min read [![Saurav Jain](https://avatars.githubusercontent.com/u/53312820?v=4)](https://github.com/souravjain540) [Saurav Jain](https://github.com/souravjain540) Developer Community Manager Hello Crawlee community, We are back with another blog, this time explaining how Crawlee rotates proxies and prevents crawlers from getting blocked. Proxies vary in quality, speed, reliability, and cost. There are a [few types of proxies](https://blog.apify.com/types-of-proxies/), such as datacenter and residential proxies. Datacenter proxies are cheaper but, on the other hand, more prone to getting blocked, and vice versa with residential proxies. It is hard for developers to decide which proxy to use while scraping data. We might get blocked if we use [datacenter proxies](https://blog.apify.com/datacenter-proxies-when-to-use-them-and-how-to-make-the-most-of-them/) for low-cost scraping, but residential proxies are sometimes too expensive for bigger projects. Developers need a system that can manage both costs and avoid getting blocked. To manage this, we recently introduced tiered proxies in Crawlee. Let’s take a look at it. note If you like reading this blog, we would be really happy if you gave [Crawlee a star on GitHub!](https://github.com/apify/crawlee/) ## What are tiered proxies?[​](#what-are-tiered-proxies "Direct link to What are tiered proxies?") Tiered proxies are a method of organizing and using different types of proxies based on their quality, speed, reliability, and cost. Tiered proxies allow you to rotate between a mix of proxy types to optimize your scraping activities. You categorize your proxies into different tiers based on their quality. For example: * **High-tier proxies**: Fast, reliable, and expensive. Best for critical tasks where you need high performance. * **Mid-tier proxies**: Moderate speed and reliability. A good balance between cost and performance. * **Low-tier proxies**: Slow and less reliable but cheap. Useful for less critical tasks or high-volume scraping. ## Features:[​](#features "Direct link to Features:") * **Tracking errors**: The system monitors errors (e.g. failed requests, retries) for each domain. * **Adjusting tiers**: Higher-tier proxies are used if a domain shows more errors. Conversely, if a domain performs well with a high-tier proxy, the system will occasionally test lower-tier proxies. If successful, it continues using the lower tier, optimizing costs. * **Forgetting old errors**: Old errors are given less weight over time, allowing the system to adjust tiers dynamically as proxies' performance changes. ## Working[​](#working "Direct link to Working") The `tieredProxyUrls` option in Crawlee's `ProxyConfigurationOptions` allows you to define a list of proxy URLs organized into tiers. Each tier represents a different level of quality, speed, and reliability. ## Usage[​](#usage "Direct link to Usage") **Fallback Mechanism**: Crawlee starts with the first tier of proxies. If proxies in the current tier fail, it will switch to the next tier. ``` import { CheerioCrawler, ProxyConfiguration } from 'crawlee'; const proxyConfiguration = new ProxyConfiguration({ tieredProxyUrls: [ ['http://tier1-proxy1.example.com', 'http://tier1-proxy2.example.com'], ['http://tier2-proxy1.example.com', 'http://tier2-proxy2.example.com'], ['http://tier2-proxy1.example.com', 'http://tier3-proxy2.example.com'], ], }); const crawler = new CheerioCrawler({ proxyConfiguration, requestHandler: async ({ request, response }) => { // Handle the request }, }); await crawler.addRequests([ { url: 'https://example.com/critical' }, { url: 'https://example.com/important' }, { url: 'https://example.com/regular' }, ]); await crawler.run(); ``` ## How tiered proxies use Session Pool under the hood[​](#how-tiered-proxies-use-session-pool-under-the-hood "Direct link to How tiered proxies use Session Pool under the hood") A session pool is a way to manage multiple [sessions](https://crawlee.dev/js/api/core/class/Session.md) on a website so you can distribute your requests across them, reducing the chances of being detected and blocked. You can imagine each session like a different human user with its own IP address. When you use tiered proxies, each proxy tier works with the [session pool](https://crawlee.dev/js/api/core/class/SessionPool.md) to enhance request distribution and manage errors effectively. ![Diagram explaining how tiered proxies use Session Pool under the hood](/assets/images/session-pool-working-a2dee3e83a3444b1330081044b0a234a.webp) For each request, the crawler instance asks the `ProxyConfiguration` which proxy it should use. ' ProxyConfiguration\` also keeps track of the requests domains, and if it sees more requests being retried or, say, more errors, it returns higher proxy tiers. In each request, we must pass `sessionId` and the request URL to the proxy configuration to get the needed proxy URL from one of the tiers. Choosing which session to pass is where SessionPool comes in. Session pool automatically creates a pool of sessions, rotates them, and uses one of them without getting blocked and mimicking human-like behavior. ## Conclusion: using proxies efficiently[​](#conclusion-using-proxies-efficiently "Direct link to Conclusion: using proxies efficiently") This inbuilt feature is similar to what Scrapy's `scrapy-rotating-proxies` plugin offers to its users. The tiered proxy configuration dynamically adjusts proxy usage based on real-time performance data, optimizing cost and performance. The session pool ensures requests are distributed across multiple sessions, mimicking human behavior and reducing detection risk. We hope this gives you a better understanding of how Crawlee manages proxies and sessions to make your scraping tasks more effective. As always, we welcome your feedback. [Join our developer community on Discord](https://apify.com/discord) to ask any questions about Crawlee or tell us how you use it. **Tags:** * [proxy](https://crawlee.dev/blog/tags/proxy.md) --- # How to scrape Bluesky with Python March 20, 2025 · 15 min read [![Max](https://avatars.githubusercontent.com/u/34358312?v=4)](https://github.com/Mantisus) [Max](https://github.com/Mantisus) Community Member of Crawlee and web scraping expert [Bluesky](https://bsky.app/) is an emerging social network developed by former members of the [Twitter](https://x.com/)(now X) development team. The platform has been showing significant growth recently, reaching 140.3 million visits according to [SimilarWeb](https://www.similarweb.com/website/bsky.app/#traffic). Like X, Bluesky generates a vast amount of data that can be used for analysis. In this article, we’ll explore how to collect this data using [Crawlee for Python](https://github.com/apify/crawlee-python). note One of our community members wrote this blog as a contribution to the Crawlee Blog. If you’d like to contribute articles like these, please reach out to us on our [discord channel](https://apify.com/discord). ![Banner article](/assets/images/scrape-bluesky-using-python-723c9a74dadb375da06226b1a6a29e10.webp) Key steps we will cover: 1. Project setup 2. Development of the Bluesky crawler in Python 3. Create Apify Actor for Bluesky crawler 4. Conclusion and repository access ## Prerequisites[​](#prerequisites "Direct link to Prerequisites") * Basic understanding of web scraping concepts * Python 3.9 or higher * [UV](https://docs.astral.sh/uv/) version 0.6.0 or higher * Crawlee for Python v0.6.5 or higher * Bluesky account for API access ### Project setup[​](#project-setup "Direct link to Project setup") In this project, we’ll use UV for package management and a specific Python version installed through UV. UV is a fast and modern package manager written in Rust. 1. If you don’t have UV installed yet, follow the [guide](https://docs.astral.sh/uv/getting-started/installation/) or use this command: ``` curl -LsSf https://astral.sh/uv/install.sh | sh ``` 2. Install standalone Python using UV: ``` uv install python 3.13 ``` 3. Create a new project and install Crawlee for Python: ``` uv init bluesky-crawlee --package cd bluesky-crawlee uv add crawlee ``` We’ve created a new isolated Python project with all the necessary dependencies for Crawlee. ## Development of the Bluesky crawler in Python[​](#development-of-the-bluesky-crawler-in-python "Direct link to Development of the Bluesky crawler in Python") note Before going ahead with the project, I'd like to ask you to star Crawlee for Python on [GitHub](https://github.com/apify/crawlee-python/), it helps us to spread the word to fellow scraper developers. ### 1. Identifying the data source[​](#1-identifying-the-data-source "Direct link to 1. Identifying the data source") When accessing the [search page](https://bsky.app/search?q=apify), you'll see data displayed, but be aware of a key limitation: the site only allows viewing the first page of results, preventing access to any additional pages. ![Search Limit](/assets/images/search_limit-c8ee1da0dc9b48fdb6fb125600519ee3.webp) Fortunately, Bluesky provides a well-documented [API](https://docs.bsky.app/docs/get-started) that is accessible to any registered user without additional permissions. This is what we’ll use for data collection ### 2. Creating a session for API interaction[​](#2-creating-a-session-for-api-interaction "Direct link to 2. Creating a session for API interaction") note For secure API interaction, you need to create a dedicated app password instead of using your main account password. Go to Settings -> Privacy and Security -> [App Passwords](https://bsky.app/settings/app-passwords) and click *Add App Password*. Important: Save the generated password, as it won’t be visible after creation. Next, create environment variables to store your credentials: * Your application password * Your user identifier (found in your profile and Bluesky URL, for example: [`mantisus.bsky.social`](https://bsky.app/profile/mantisus.bsky.social)) ``` export BLUESKY_APP_PASSWORD=your_app_password export BLUESKY_IDENTIFIER=your_identifier ``` Using the [createSession](https://docs.bsky.app/docs/api/com-atproto-server-create-session), [deleteSession](https://docs.bsky.app/docs/api/com-atproto-server-delete-session) endpoints and [`httpx`](https://www.python-httpx.org/), we can create a session for API interaction. Let us create a class with the necessary methods: ``` import asyncio import json import os import traceback import httpx from yarl import URL from crawlee import ConcurrencySettings, Request from crawlee.configuration import Configuration from crawlee.crawlers import HttpCrawler, HttpCrawlingContext from crawlee.http_clients import HttpxHttpClient from crawlee.storages import Dataset # Environment variables for authentication # BLUESKY_APP_PASSWORD: App-specific password generated from Bluesky settings # BLUESKY_IDENTIFIER: Your Bluesky handle (e.g., username.bsky.social) BLUESKY_APP_PASSWORD = os.getenv('BLUESKY_APP_PASSWORD') BLUESKY_IDENTIFIER = os.getenv('BLUESKY_IDENTIFIER') class BlueskyApiScraper: """A scraper class for extracting data from Bluesky social network using their official API. This scraper manages authentication, concurrent requests, and data collection for both posts and user profiles. It uses separate datasets for storing post and user information. """ def __init__(self) -> None: self._crawler: HttpCrawler | None = None self._users: Dataset | None = None self._posts: Dataset | None = None # Variables for storing session data self._service_endpoint: str | None = None self._user_did: str | None = None self._access_token: str | None = None self._refresh_token: str | None = None self._handle: str | None = None def create_session(self) -> None: """Create credentials for the session.""" url = 'https://bsky.social/xrpc/com.atproto.server.createSession' headers = { 'Content-Type': 'application/json', } data = {'identifier': BLUESKY_IDENTIFIER, 'password': BLUESKY_APP_PASSWORD} response = httpx.post(url, headers=headers, json=data) response.raise_for_status() data = response.json() self._service_endpoint = data['didDoc']['service'][0]['serviceEndpoint'] self._user_did = data['didDoc']['id'] self._access_token = data['accessJwt'] self._refresh_token = data['refreshJwt'] self._handle = data['handle'] def delete_session(self) -> None: """Delete the current session.""" url = f'{self._service_endpoint}/xrpc/com.atproto.server.deleteSession' headers = {'Content-Type': 'application/json', 'authorization': f'Bearer {self._refresh_token}'} response = httpx.post(url, headers=headers) response.raise_for_status() ``` The session expires after 2 hours, so if you plan for your crawler to run longer, you should also add a method for [refresh](https://docs.bsky.app/docs/api/com-atproto-server-refresh-session). ### 3. Configuring Crawlee for Python for data collection[​](#3-configuring-crawlee-for-python-for-data-collection "Direct link to 3. Configuring Crawlee for Python for data collection") Since we’ll be using the official API, we do not need to worry about being blocked by Bluesky. However, we should be careful with the number of requests to avoid overloading Bluesky's servers, so we will configure [`ConcurrencySettings`](https://www.crawlee.dev/python/api/class/ConcurrencySettings). We’ll also configure [`HttpxHttpClient`](https://www.crawlee.dev/python/api/class/HttpxHttpClient) to use custom headers with the current session's `Authorization`. We’ll use 2 endpoints for data collection: [searchPosts](https://docs.bsky.app/docs/api/app-bsky-feed-search-posts) for posts and [getProfile](https://docs.bsky.app/docs/api/app-bsky-actor-get-profile). If you plan to scale the crawler, you can use [getProfiles](https://docs.bsky.app/docs/api/app-bsky-actor-get-profiles) for user data, but in this case, you’ll need to implement deduplication logic. When each link is unique, Crawlee for Python handles this for you. When collecting data, I’d like to separately collect user and post data, so we’ll use different [`Dataset`](https://www.crawlee.dev/python/api/class/Dataset) instances for storage. ``` async def init_crawler(self) -> None: """Initialize the crawler.""" if not self._user_did: raise ValueError('Session not created.') # Initialize the datasets purge the data if it is not empty self._users = await Dataset.open(name='users', configuration=Configuration(purge_on_start=True)) self._posts = await Dataset.open(name='posts', configuration=Configuration(purge_on_start=True)) # Initialize the crawler self._crawler = HttpCrawler( max_requests_per_crawl=100, http_client=HttpxHttpClient( # Set headers for API requests headers={ 'Content-Type': 'application/json', 'Authorization': f'Bearer {self._access_token}', 'Connection': 'Keep-Alive', 'accept-encoding': 'gzip, deflate, br, zstd', } ), # Configuring concurrency of crawling requests concurrency_settings=ConcurrencySettings( min_concurrency=10, desired_concurrency=10, max_concurrency=30, max_tasks_per_minute=200, ), ) self._crawler.router.default_handler(self._search_handler) # Handler for search requests self._crawler.router.handler(label='user')(self._user_handler) # Handler for user requests ``` ### 4. Implementing handlers for data collection[​](#4-implementing-handlers-for-data-collection "Direct link to 4. Implementing handlers for data collection") Now we can implement the handler for searching posts. We’ll save the retrieved posts in `self._posts` and create requests for user data, placing them in the crawler's queue. We also need to handle pagination by forming the link to the next search page. ``` async def _search_handler(self, context: HttpCrawlingContext) -> None: context.log.info(f'Processing search {context.request.url} ...') data = json.loads(context.http_response.read()) if 'posts' not in data: context.log.warning(f'No posts found in response: {context.request.url}') return user_requests = {} posts = [] profile_url = URL(f'{self._service_endpoint}/xrpc/app.bsky.actor.getProfile') for post in data['posts']: # Add user request if not already added in current context if post['author']['did'] not in user_requests: user_requests[post['author']['did']] = Request.from_url( url=str(profile_url.with_query(actor=post['author']['did'])), user_data={'label': 'user'}, ) posts.append( { 'uri': post['uri'], 'cid': post['cid'], 'author_did': post['author']['did'], 'created': post['record']['createdAt'], 'indexed': post['indexedAt'], 'reply_count': post['replyCount'], 'repost_count': post['repostCount'], 'like_count': post['likeCount'], 'quote_count': post['quoteCount'], 'text': post['record']['text'], 'langs': '; '.join(post['record'].get('langs', [])), 'reply_parent': post['record'].get('reply', {}).get('parent', {}).get('uri'), 'reply_root': post['record'].get('reply', {}).get('root', {}).get('uri'), } ) await self._posts.push_data(posts) # Push a batch of posts to the dataset await context.add_requests(list(user_requests.values())) if cursor := data.get('cursor'): next_url = URL(context.request.url).update_query({'cursor': cursor}) # Use yarl for update the query string await context.add_requests([str(next_url)]) ``` When receiving user data, we'll store it in the corresponding Dataset `self._users` ``` async def _user_handler(self, context: HttpCrawlingContext) -> None: context.log.info(f'Processing user {context.request.url} ...') data = json.loads(context.http_response.read()) user_item = { 'did': data['did'], 'created': data['createdAt'], 'avatar': data.get('avatar'), 'description': data.get('description'), 'display_name': data.get('displayName'), 'handle': data['handle'], 'indexed': data.get('indexedAt'), 'posts_count': data['postsCount'], 'followers_count': data['followersCount'], 'follows_count': data['followsCount'], } await self._users.push_data(user_item) ``` ### 5. Saving data to files[​](#5-saving-data-to-files "Direct link to 5. Saving data to files") For saving results, we will use the [`write_to_json`](https://www.crawlee.dev/python/api/class/Dataset#write_to_json). ``` async def save_data(self) -> None: """Save the data.""" if not self._users or not self._posts: raise ValueError('Datasets not initialized.') with open('users.json', 'w') as f: await self._users.write_to_json(f, indent=4) with open('posts.json', 'w') as f: await self._posts.write_to_json(f, indent=4) ``` ### 6. Running the crawler[​](#6-running-the-crawler "Direct link to 6. Running the crawler") We have everything needed to complete the crawler. We just need a method to execute the crawling - let us call it `crawl` ``` async def crawl(self, queries: list[str]) -> None: """Crawl the given URL.""" if not self._crawler: raise ValueError('Crawler not initialized.') search_url = URL(f'{self._service_endpoint}/xrpc/app.bsky.feed.searchPosts') await self._crawler.run([str(search_url.with_query(q=query)) for query in queries]) ``` Let's finalize the code: ``` async def run() -> None: """Main execution function that orchestrates the crawling process. Creates a scraper instance, manages the session, and handles the complete crawling lifecycle including proper cleanup on completion or error. """ scraper = BlueskyApiScraper() scraper.create_session() try: await scraper.init_crawler() await scraper.crawl(['python', 'apify', 'crawlee']) await scraper.save_data() except Exception: traceback.print_exc() finally: scraper.delete_session() def main() -> None: """Entry point for the crawler application.""" asyncio.run(run()) ``` If you check your `pyproject.toml`, you will see that UV created an entrypoint for running `bluesky-crawlee = "bluesky_crawlee:main"`, so we can run our crawler simply by executing: ``` uv run bluesky-crawlee ``` Let's look at sample results: Posts ![Posts Example](/assets/images/posts-9156686b24a69b73efbc3915f1c8d18e.webp) Users ![Users Example](/assets/images/users-d896c9f24165a0e970d2b26c54def9eb.webp) ## Create Apify Actor for Bluesky crawler[​](#create-apify-actor-for-bluesky-crawler "Direct link to Create Apify Actor for Bluesky crawler") We already have a fully functional implementation for local execution. Let us explore how to adapt it for running on the [Apify Platform](https://apify.com/) and transform in [Apify Actor](https://docs.apify.com/platform/actors). An Actor is a simple and efficient way to deploy your code in the cloud infrastructure on the Apify Platform. You can flexibly interact with the Actor, [schedule regular runs](https://docs.apify.com/platform/schedules) for monitoring data, or [integrate](https://docs.apify.com/platform/integrations) with other tools to build data processing flows. First, create an `.actor` directory with platform configuration files: ``` mkdir .actor && touch .actor/{actor.json,Dockerfile,input_schema.json} ``` Then add [Apify SDK for Python](https://docs.apify.com/sdk/python/) as a project dependency: ``` uv add apify ``` ### Configure Dockerfile[​](#configure-dockerfile "Direct link to Configure Dockerfile") We’ll use the official [Apify Docker image](https://docs.apify.com/academy/deploying-your-code/docker-file) along with recommended [UV practices for Docker](https://docs.astral.sh/uv/guides/integration/docker/): ``` FROM apify/actor-python:3.13 ENV PATH='/app/.venv/bin:$PATH' WORKDIR /app COPY --from=ghcr.io/astral-sh/uv:latest /uv /uvx /bin/ COPY pyproject.toml uv.lock ./ RUN uv sync --frozen --no-install-project --no-editable -q --no-dev COPY . . RUN uv sync --frozen --no-editable -q --no-dev CMD ["bluesky-crawlee"] ``` Here, `bluesky-crawlee` refers to the entrypoint specified in `pyproject.toml`. ### Define project metadata in actor.json[​](#define-project-metadata-in-actorjson "Direct link to Define project metadata in actor.json") The `actor.json` file contains project metadata for Apify Platform. Follow the [documentation for proper configuration](https://docs.apify.com/platform/actors/development/actor-definition/actor-json): ``` { "actorSpecification": 1, "name": "Bluesky-Crawlee", "title": "Bluesky - Crawlee", "minMemoryMbytes": 128, "maxMemoryMbytes": 2048, "description": "Scrape data products from bluesky", "version": "0.1", "meta": { "templateId": "bluesky-crawlee" }, "input": "./input_schema.json", "dockerfile": "./Dockerfile" } ``` ### Define Actor input parameters[​](#define-actor-input-parameters "Direct link to Define Actor input parameters") Our crawler requires several external parameters. Let’s define them: * identifier: User's Bluesky identifier (encrypted for security) * appPassword: Bluesky app password (encrypted) * queries: List of search queries for crawling * maxRequestsPerCrawl: Optional limit for testing * mode: Choose between collecting posts or user data who post on specific topics Configure the input schema following the [specification](https://docs.apify.com/platform/actors/development/actor-definition/input-schema/specification/v1): ``` { "title": "Bluesky - Crawlee", "type": "object", "schemaVersion": 1, "properties": { "identifier": { "title": "Bluesky identifier", "description": "Bluesky identifier for API login", "type": "string", "editor": "textfield", "isSecret": true }, "appPassword": { "title": "Bluesky app password", "description": "Bluesky app password for API", "type": "string", "editor": "textfield", "isSecret": true }, "maxRequestsPerCrawl": { "title": "Max requests per crawl", "description": "Maximum number of requests for crawling", "type": "integer" }, "queries": { "title": "Queries", "type": "array", "description": "Search queries", "editor": "stringList", "prefill": [ "apify" ], "example": [ "apify", "crawlee" ] }, "mode": { "title": "Mode", "type": "string", "description": "Collect posts or users who post on a topic", "enum": [ "posts", "users" ], "default": "posts" } }, "required": [ "identifier", "appPassword", "queries", "mode" ] } ``` ### Update project code[​](#update-project-code "Direct link to Update project code") Remove environment variables and parameterize the code according to the Actor input parameters. Replace named datasets with the default dataset. Add Actor logging: ``` # __init__.py import logging from apify.log import ActorLogFormatter handler = logging.StreamHandler() handler.setFormatter(ActorLogFormatter()) apify_client_logger = logging.getLogger('apify_client') apify_client_logger.setLevel(logging.INFO) apify_client_logger.addHandler(handler) apify_logger = logging.getLogger('apify') apify_logger.setLevel(logging.DEBUG) apify_logger.addHandler(handler) ``` Update imports and entry point code: ``` import asyncio import json import traceback from dataclasses import dataclass import httpx from apify import Actor from yarl import URL from crawlee import ConcurrencySettings, Request from crawlee.crawlers import HttpCrawler, HttpCrawlingContext from crawlee.http_clients import HttpxHttpClient @dataclass class ActorInput: """Actor input schema.""" identifier: str app_password: str queries: list[str] mode: str max_requests_per_crawl: Optional[int] = None async def run() -> None: """Main execution function that orchestrates the crawling process. Creates a scraper instance, manages the session, and handles the complete crawling lifecycle including proper cleanup on completion or error. """ async with Actor: raw_input = await Actor.get_input() actor_input = ActorInput( identifier=raw_input.get('indentifier', ''), app_password=raw_input.get('appPassword', ''), queries=raw_input.get('queries', []), mode=raw_input.get('mode', 'posts'), max_requests_per_crawl=raw_input.get('maxRequestsPerCrawl') ) scraper = BlueskyApiScraper(actor_input.mode, actor_input.max_requests_per_crawl) try: scraper.create_session(actor_input.identifier, actor_input.app_password) await scraper.init_crawler() await scraper.crawl(actor_input.queries) except httpx.HTTPError as e: Actor.log.error(f'HTTP error occurred: {e}') raise except Exception as e: Actor.log.error(f'Unexpected error: {e}') traceback.print_exc() finally: scraper.delete_session() def main() -> None: """Entry point for the scraper application.""" asyncio.run(run()) ``` Update methods with Actor input parameters: ``` class BlueskyApiScraper: """A scraper class for extracting data from Bluesky social network using their official API. This scraper manages authentication, concurrent requests, and data collection for both posts and user profiles. It uses separate datasets for storing post and user information. """ def __init__(self, mode: str, max_request: int | None) -> None: self._crawler: HttpCrawler | None = None self.mode = mode self.max_request = max_request # Variables for storing session data self._service_endpoint: str | None = None self._user_did: str | None = None self._access_token: str | None = None self._refresh_token: str | None = None self._handle: str | None = None def create_session(self, identifier: str, password: str) -> None: """Create credentials for the session.""" url = 'https://bsky.social/xrpc/com.atproto.server.createSession' headers = { 'Content-Type': 'application/json', } data = {'identifier': identifier, 'password': password} response = httpx.post(url, headers=headers, json=data) response.raise_for_status() data = response.json() self._service_endpoint = data['didDoc']['service'][0]['serviceEndpoint'] self._user_did = data['didDoc']['id'] self._access_token = data['accessJwt'] self._refresh_token = data['refreshJwt'] self._handle = data['handle'] ``` Implement mode-aware data collection logic: ``` async def _search_handler(self, context: HttpCrawlingContext) -> None: """Handle search requests based on mode.""" context.log.info(f'Processing search {context.request.url} ...') data = json.loads(context.http_response.read()) if 'posts' not in data: context.log.warning(f'No posts found in response: {context.request.url}') return user_requests = {} posts = [] profile_url = URL(f'{self._service_endpoint}/xrpc/app.bsky.actor.getProfile') for post in data['posts']: if self.mode == 'users' and post['author']['did'] not in user_requests: user_requests[post['author']['did']] = Request.from_url( url=str(profile_url.with_query(actor=post['author']['did'])), user_data={'label': 'user'}, ) elif self.mode == 'posts': posts.append( { 'uri': post['uri'], 'cid': post['cid'], 'author_did': post['author']['did'], 'created': post['record']['createdAt'], 'indexed': post['indexedAt'], 'reply_count': post['replyCount'], 'repost_count': post['repostCount'], 'like_count': post['likeCount'], 'quote_count': post['quoteCount'], 'text': post['record']['text'], 'langs': '; '.join(post['record'].get('langs', [])), 'reply_parent': post['record'].get('reply', {}).get('parent', {}).get('uri'), 'reply_root': post['record'].get('reply', {}).get('root', {}).get('uri'), } ) if self.mode == 'posts': await context.push_data(posts) else: await context.add_requests(list(user_requests.values())) if cursor := data.get('cursor'): next_url = URL(context.request.url).update_query({'cursor': cursor}) await context.add_requests([str(next_url)]) ``` Update the user handler for the default dataset: ``` async def _user_handler(self, context: HttpCrawlingContext) -> None: """Handle user profile requests.""" context.log.info(f'Processing user {context.request.url} ...') data = json.loads(context.http_response.read()) user_item = { 'did': data['did'], 'created': data['createdAt'], 'avatar': data.get('avatar'), 'description': data.get('description'), 'display_name': data.get('displayName'), 'handle': data['handle'], 'indexed': data.get('indexedAt'), 'posts_count': data['postsCount'], 'followers_count': data['followersCount'], 'follows_count': data['followsCount'], } await context.push_data(user_item) ``` ### Deploy[​](#deploy "Direct link to Deploy") Use the official [Apify CLI](https://docs.apify.com/cli/) to upload your code: Authenticate using your API token from [Apify Console](https://console.apify.com/settings/integrations): ``` apify login ``` Choose "Enter API token manually" and paste your token. Push the project to the platform: ``` apify push ``` Now you can configure runs on Apify Platform. Let’s perform a test run: Fill in the input parameters: ![Actor Input](/assets/images/input_actor-20bb99df05dea1b2e799d92d6e3750f5.webp) Check that logging works correctly: ![Actor Log](/assets/images/actor_log-c74fa12a02ea0ff9ec3f77cfcb02bc52.webp) View results in the dataset: ![Dataset Results](/assets/images/actor_results-dca44d296e6897737ef338a19b7b2177.webp) If you want to make your Actor public and provide access to other users, potentially to earn income from it, follow this [publishing guide](https://docs.apify.com/platform/actors/publishing) for [Apify Store](https://apify.com/store). ## Conclusion and repository access[​](#conclusion-and-repository-access "Direct link to Conclusion and repository access") We’ve created an efficient crawler for Bluesky using the official API. If you want to learn more this topic for regular data extraction from Bluesky, I recommend explorin [custom feed generation](https://docs.bsky.app/docs/starter-templates/custom-feeds) - I think it opens up some interesting possibilities. And if you need to quickly create a crawler that can retrieve data for various queries, you now have everything you need. You can find the complete code in the [repository](https://github.com/Mantisus/bluesky-crawlee) If you enjoyed this blog, feel free to support Crawlee for Python by starring the [repository](https://github.com/apify/crawlee-python) or joining the maintainer team. Have questions or want to discuss implementation details? Join our [Discord](https://discord.com/invite/jyEM2PRvMU) - our community of 10,000+ developers is there to help. **Tags:** * [community](https://crawlee.dev/blog/tags/community.md) --- # How to scrape Crunchbase using Python in 2024 (Easy Guide) January 3, 2025 · 13 min read [![Max](https://avatars.githubusercontent.com/u/34358312?v=4)](https://github.com/Mantisus) [Max](https://github.com/Mantisus) Community Member of Crawlee and web scraping expert Python developers know the drill: you need reliable company data, and Crunchbase has it. This guide shows you how to build an effective [Crunchbase](https://www.crunchbase.com/) scraper in Python that gets you the data you need. Crunchbase tracks details that matter: locations, business focus, founders, and investment histories. Manual extraction from such a large dataset isn't practical -automation is essential for transforming this information into an analyzable format. By the end of this blog, we'll explore three different ways to extract data from Crunchbase using [`Crawlee for Python`](https://github.com/apify/crawlee-python). We'll fully implement two of them and discuss the specifics and challenges of the third. This will help us better understand how important it is to properly [choose the right data source](https://www.crawlee.dev/blog/web-scraping-tips#1-choosing-a-data-source-for-the-project). note This guide comes from a developer in our growing community. Have you built interesting projects with Crawlee? Join us on [Discord](https://discord.com/invite/jyEM2PRvMU) to share your experiences and blog ideas - we value these contributions from developers like you. ![How to Scrape Crunchbase Using Python](/assets/images/scrape_crunchbase-28a71b5380492fe6618bbd9c90989543.webp) Key steps we'll cover: 1. Project setup 2. Choosing the data source 3. Implementing sitemap-based crawler 4. Analysis of search-based approach and its limitations 5. Implementing the official API crawler 6. Conclusion and repository access ## Prerequisites[​](#prerequisites "Direct link to Prerequisites") * Python 3.9 or higher * Familiarity with web scraping concepts * Crawlee for Python `v0.5.0` * poetry `v2.0` or higher ### Project setup[​](#project-setup "Direct link to Project setup") Before we start scraping, we need to set up our project. In this guide, we won't be using crawler templates (`Playwright` and `Beautifulsoup`), so we'll set up the project manually. 1. Install [`Poetry`](https://python-poetry.org/) ``` pipx install poetry ``` 2. Create and navigate to the project folder. ``` mkdir crunchbase-crawlee && cd crunchbase-crawlee ``` 3. Initialize the project using Poetry, leaving all fields empty. ``` poetry init ``` When prompted: * For "Compatible Python versions", enter: `>={your Python version},<4.0` (For example, if you're using Python 3.10, enter: `>=3.10,<4.0`) * Leave all other fields empty by pressing Enter * Confirm the generation by typing "yes" 4. Add and install Crawlee with necessary dependencies to your project using `Poetry.` ``` poetry add crawlee[parsel,curl-impersonate] ``` 5. Complete the project setup by creating the standard file structure for `Crawlee for Python` projects. ``` mkdir crunchbase-crawlee && touch crunchbase-crawlee/{__init__.py,__main__.py,main.py,routes.py} ``` After setting up the basic project structure, we can explore different methods of obtaining data from Crunchbase. ### Choosing the data source[​](#choosing-the-data-source "Direct link to Choosing the data source") While we can extract target data directly from the [company page](https://www.crunchbase.com/organization/apify), we need to choose the best way to navigate the site. A careful examination of Crunchbase's structure shows that we have three main options for obtaining data: 1. [`Sitemap`](https://www.crunchbase.com/www-sitemaps/sitemap-index.xml) - for complete site traversal. 2. [`Search`](https://www.crunchbase.com/discover/organization.companies) - for targeted data collection. 3. [Official API](https://data.crunchbase.com/v4-legacy/docs/crunchbase-basic-getting-started) - recommended method. Let's examine each of these approaches in detail. ## Scraping Crunchbase using sitemap and Crawlee for Python[​](#scraping-crunchbase-using-sitemap-and-crawlee-for-python "Direct link to Scraping Crunchbase using sitemap and Crawlee for Python") `Sitemap` is a standard way of site navigation used by crawlers like [`Google`](https://google.com/), [`Ahrefs`](https://ahrefs.com/), and other search engines. All crawlers must follow the rules described in [`robots.txt`](https://www.crunchbase.com/robots.txt). Let's look at the structure of Crunchbase's Sitemap: ![Sitemap first lvl](/assets/images/sitemap_lvl_one-553a6b9df5c5d3c35a8987878456fe7b.webp) As you can see, links to organization pages are located inside second-level `Sitemap` files, which are compressed using `gzip`. The structure of one of these files looks like this: ![Sitemap second lvl](/assets/images/sitemap_lvl_two-8f3213f305713ebf8bf91b32febfa234.webp) The `lastmod` field is particularly important here. It allows tracking which companies have updated their information since the previous data collection. This is especially useful for regular data updates. ### 1. Configuring the crawler for scraping[​](#1-configuring-the-crawler-for-scraping "Direct link to 1. Configuring the crawler for scraping") To work with the site, we'll use [`CurlImpersonateHttpClient`](https://www.crawlee.dev/python/api/class/CurlImpersonateHttpClient), which impersonates a `Safari` browser. While this choice might seem unexpected for working with a sitemap, it's necessitated by Crunchbase's protection features. The reason is that Crunchbase uses [Cloudflare](https://www.cloudflare.com/) to protect against automated access. This is clearly visible when analyzing traffic on a company page: ![Cloudflare Link](/assets/images/cloudflare_link-bf8b6ba2c873ccb31463258e5964e39b.webp) An interesting feature is that `challenges.cloudflare` is executed after loading the document with data. This means we receive the data first, and only then JavaScript checks if we're a bot. If our HTTP client's fingerprint is sufficiently similar to a real browser, we'll successfully receive the data. Cloudflare [also analyzes traffic at the sitemap level](https://developers.cloudflare.com/waf/custom-rules/use-cases/allow-traffic-from-verified-bots/). If our crawler doesn't look legitimate, access will be blocked. That's why we impersonate a real browser. To prevent blocks due to overly aggressive crawling, we'll configure [`ConcurrencySettings`](https://www.crawlee.dev/python/api/class/ConcurrencySettings). When scaling this approach, you'll likely need proxies. Detailed information about proxy setup can be found in the [documentation](https://www.crawlee.dev/python/docs/guides/proxy-management). We'll save our scraping results in `JSON` format. Here's how the basic crawler configuration looks: ``` # main.py from crawlee import ConcurrencySettings, HttpHeaders from crawlee.crawlers import ParselCrawler from crawlee.http_clients import CurlImpersonateHttpClient from .routes import router async def main() -> None: """The crawler entry point.""" concurrency_settings = ConcurrencySettings(max_concurrency=1, max_tasks_per_minute=50) http_client = CurlImpersonateHttpClient( impersonate='safari17_0', headers=HttpHeaders( { 'accept-language': 'en', 'accept-encoding': 'gzip, deflate, br, zstd', } ), ) crawler = ParselCrawler( request_handler=router, max_request_retries=1, concurrency_settings=concurrency_settings, http_client=http_client, max_requests_per_crawl=30, ) await crawler.run(['https://www.crunchbase.com/www-sitemaps/sitemap-index.xml']) await crawler.export_data_json('crunchbase_data.json') ``` ### 2. Implementing sitemap navigation[​](#2-implementing-sitemap-navigation "Direct link to 2. Implementing sitemap navigation") Sitemap navigation happens in two stages. In the first stage, we need to get a list of all files containing organization information: ``` # routes.py from crawlee.crawlers import ParselCrawlingContext from crawlee.router import Router from crawlee import Request router = Router[ParselCrawlingContext]() @router.default_handler async def default_handler(context: ParselCrawlingContext) -> None: """Default request handler.""" context.log.info(f'default_handler processing {context.request} ...') requests = [ Request.from_url(url, label='sitemap') for url in context.selector.xpath('//loc[contains(., "sitemap-organizations")]/text()').getall() ] # Since this is a tutorial, I don't want to upload more than one sitemap link await context.add_requests(requests, limit=1) ``` In the second stage, we process second-level sitemap files stored in `gzip` format. This requires a special approach as the data needs to be decompressed first: ``` # routes.py from gzip import decompress from parsel import Selector @router.handler('sitemap') async def sitemap_handler(context: ParselCrawlingContext) -> None: """Sitemap gzip request handler.""" context.log.info(f'sitemap_handler processing {context.request.url} ...') data = context.http_response.read() data = decompress(data) selector = Selector(data.decode()) requests = [Request.from_url(url, label='company') for url in selector.xpath('//loc/text()').getall()] await context.add_requests(requests) ``` ### 3. Extracting and saving data[​](#3-extracting-and-saving-data "Direct link to 3. Extracting and saving data") Each company page contains a large amount of information. For demonstration purposes, we'll focus on the main fields: `Company Name`, `Short Description`, `Website`, and `Location`. One of Crunchbase's advantages is that all data is stored in `JSON` format within the page: ![Company Data](/assets/images/data_json-7c79a7387510a995f29ba5ce157f0845.webp) This significantly simplifies data extraction - we only need to use one `Xpath` selector to get the `JSON`, and then apply [`jmespath`](https://jmespath.org/) to extract the needed fields: ``` # routes.py @router.handler('company') async def company_handler(context: ParselCrawlingContext) -> None: """Company request handler.""" context.log.info(f'company_handler processing {context.request.url} ...') json_selector = context.selector.xpath('//*[@id="ng-state"]/text()') await context.push_data( { 'Company Name': json_selector.jmespath('HttpState.*.data[].properties.identifier.value').get(), 'Short Description': json_selector.jmespath('HttpState.*.data[].properties.short_description').get(), 'Website': json_selector.jmespath('HttpState.*.data[].cards.company_about_fields2.website.value').get(), 'Location': '; '.join( json_selector.jmespath( 'HttpState.*.data[].cards.company_about_fields2.location_identifiers[].value' ).getall() ), } ) ``` The collected data is saved in `Crawlee for Python`'s internal storage using the `context.push_data` method. When the crawler finishes, we export all collected data to a JSON file: ``` # main.py await crawler.export_data_json('crunchbase_data.json') ``` ### 4. Running the project[​](#4-running-the-project "Direct link to 4. Running the project") With all components in place, we need to create an entry point for our crawler: ``` # __main__.py import asyncio from .main import main if __name__ == '__main__': asyncio.run(main()) ``` Execute the crawler using Poetry: ``` poetry run python -m crunchbase-crawlee ``` ### 5. Finally, characteristics of using the sitemap crawler[​](#5-finally-characteristics-of-using-the-sitemap-crawler "Direct link to 5. Finally, characteristics of using the sitemap crawler") The sitemap approach has its distinct advantages and limitations. It's ideal in the following cases: * When you need to collect data about all companies on the platform * When there are no specific company selection criteria * If you have sufficient time and computational resources However, there are significant limitations to consider: * Almost no ability to filter data during collection * Requires constant monitoring of Cloudflare blocks * Scaling the solution requires proxy servers, which increases project costs ## Using search for scraping Crunchbase[​](#using-search-for-scraping-crunchbase "Direct link to Using search for scraping Crunchbase") The limitations of the sitemap approach might point to search as the next solution. However, Crunchbase applies tighter security measures to its search functionality compared to its public pages. The key difference lies in how Cloudflare protection works. While we receive data before the `challenges.cloudflare` check when accessing a company page, the search API requires valid `cookies` that have passed this check. Let's verify this in practice. Open the following link in Incognito mode: ``` ``` When analyzing the traffic, we'll see the following pattern: ![Search Protect](/assets/images/search_protect-3b4a1a1934d54c12ac210217919b8b88.webp) The sequence of events here is: 1. First, the page is blocked with code `403` 2. Then the `challenges.cloudflare` check is performed 3. Only after successfully passing the check do we receive data with code `200` Automating this process would require a `headless` browser capable of bypassing [`Cloudflare Turnstile`](https://www.cloudflare.com/application-services/products/turnstile/). The current version of `Crawlee for Python` (v0.5.0) doesn't provide this functionality, although it's planned for future development. You can extend the capabilities of Crawlee for Python by integrating [`Camoufox`](https://camoufox.com/) following this [example.](https://www.crawlee.dev/python/docs/examples/playwright-crawler-with-camoufox) ## Working with the official Crunchbase API[​](#working-with-the-official-crunchbase-api "Direct link to Working with the official Crunchbase API") Crunchbase provides a [free API](https://data.crunchbase.com/v4-legacy/docs/crunchbase-basic-using-api) with basic functionality. Paid subscription users get expanded data access. Complete documentation for available endpoints can be found in the [official API specification](https://app.swaggerhub.com/apis-docs/Crunchbase/crunchbase-enterprise_api). ### 1. Setting up API access[​](#1-setting-up-api-access "Direct link to 1. Setting up API access") To start working with the API, follow these steps: 1. [Create a Crunchbase account](https://www.crunchbase.com/register) 2. Go to the Integrations section 3. Create a Crunchbase Basic API key Although the documentation states that key activation may take up to an hour, it usually starts working immediately after creation. ### 2. Configuring the crawler for API work[​](#2-configuring-the-crawler-for-api-work "Direct link to 2. Configuring the crawler for API work") An important API feature is the limit - no more than 200 requests per minute, but in the free version, this number is significantly lower. Taking this into account, let's configure [`ConcurrencySettings`](https://www.crawlee.dev/python/api/class/ConcurrencySettings). Since we're working with the official API, we don't need to mask our HTTP client. We'll use the standard ['HttpxHttpClient'](https://www.crawlee.dev/python/api/class/HttpxHttpClient) with preset headers. First, let's save the API key in an environment variable: ``` export CRUNCHBASE_TOKEN={YOUR KEY} ``` Here's how the crawler configuration for working with the API looks: ``` # main.py import os from crawlee.crawlers import HttpCrawler from crawlee.http_clients import HttpxHttpClient from crawlee import ConcurrencySettings, HttpHeaders from .routes import router CRUNCHBASE_TOKEN = os.getenv('CRUNCHBASE_TOKEN', '') async def main() -> None: """The crawler entry point.""" concurrency_settings = ConcurrencySettings(max_tasks_per_minute=60) http_client = HttpxHttpClient( headers=HttpHeaders({'accept-encoding': 'gzip, deflate, br, zstd', 'X-cb-user-key': CRUNCHBASE_TOKEN}) ) crawler = HttpCrawler( request_handler=router, concurrency_settings=concurrency_settings, http_client=http_client, max_requests_per_crawl=30, ) await crawler.run( ['https://api.crunchbase.com/api/v4/autocompletes?query=apify&collection_ids=organizations&limit=25'] ) await crawler.export_data_json('crunchbase_data.json') ``` ### 3. Processing search results[​](#3-processing-search-results "Direct link to 3. Processing search results") For working with the API, we'll need two main endpoints: 1. [get\_autocompletes](https://app.swaggerhub.com/apis-docs/Crunchbase/crunchbase-enterprise_api/1.0.3#/Autocomplete/get_autocompletes) - for searching 2. [get\_entities\_organizations\_\_entity\_id](https://app.swaggerhub.com/apis-docs/Crunchbase/crunchbase-enterprise_api/1.0.3#/Entity/get_entities_organizations__entity_id_) - for getting data First, let's implement search results processing: ``` import json from crawlee.crawlers import HttpCrawler from crawlee.router import Router from crawlee import Request router = Router[HttpCrawlingContext]() @router.default_handler async def default_handler(context: HttpCrawlingContext) -> None: """Default request handler.""" context.log.info(f'default_handler processing {context.request.url} ...') data = json.loads(context.http_response.read()) requests = [] for entity in data['entities']: permalink = entity['identifier']['permalink'] requests.append( Request.from_url( url=f'https://api.crunchbase.com/api/v4/entities/organizations/{permalink}?field_ids=short_description%2Clocation_identifiers%2Cwebsite_url', label='company', ) ) await context.add_requests(requests) ``` ### 4. Extracting company data[​](#4-extracting-company-data "Direct link to 4. Extracting company data") After getting the list of companies, we extract detailed information about each one: ``` @router.handler('company') async def company_handler(context: HttpCrawlingContext) -> None: """Company request handler.""" context.log.info(f'company_handler processing {context.request.url} ...') data = json.loads(context.http_response.read()) await context.push_data( { 'Company Name': data['properties']['identifier']['value'], 'Short Description': data['properties']['short_description'], 'Website': data['properties'].get('website_url'), 'Location': '; '.join([item['value'] for item in data['properties'].get('location_identifiers', [])]), } ) ``` ### 5. Advanced location-based search[​](#5-advanced-location-based-search "Direct link to 5. Advanced location-based search") If you need more flexible search capabilities, the API provides a special [`search`](https://app.swaggerhub.com/apis-docs/Crunchbase/crunchbase-enterprise_api/1.0.3#/Search/post_searches_organizations) endpoint. Here's an example of searching for all companies in Prague: ``` payload = { 'field_ids': ['identifier', 'location_identifiers', 'short_description', 'website_url'], 'limit': 200, 'order': [{'field_id': 'rank_org', 'sort': 'asc'}], 'query': [ { 'field_id': 'location_identifiers', 'operator_id': 'includes', 'type': 'predicate', 'values': ['e0b951dc-f710-8754-ddde-5ef04dddd9f8'], }, {'field_id': 'facet_ids', 'operator_id': 'includes', 'type': 'predicate', 'values': ['company']}, ], } serialiazed_payload = json.dumps(payload) await crawler.run( [ Request.from_url( url='https://api.crunchbase.com/api/v4/searches/organizations', method='POST', payload=serialiazed_payload, use_extended_unique_key=True, headers=HttpHeaders({'Content-Type': 'application/json'}), label='search', ) ] ) ``` For processing search results and pagination, we use the following handler: ``` @router.handler('search') async def search_handler(context: HttpCrawlingContext) -> None: """Search results handler with pagination support.""" context.log.info(f'search_handler processing {context.request.url} ...') data = json.loads(context.http_response.read()) last_entity = None results = [] for entity in data['entities']: last_entity = entity['uuid'] results.append( { 'Company Name': entity['properties']['identifier']['value'], 'Short Description': entity['properties']['short_description'], 'Website': entity['properties'].get('website_url'), 'Location': '; '.join([item['value'] for item in entity['properties'].get('location_identifiers', [])]), } ) if results: await context.push_data(results) if last_entity: payload = json.loads(context.request.payload) payload['after_id'] = last_entity payload = json.dumps(payload) await context.add_requests( [ Request.from_url( url='https://api.crunchbase.com/api/v4/searches/organizations', method='POST', payload=payload, use_extended_unique_key=True, headers=HttpHeaders({'Content-Type': 'application/json'}), label='search', ) ] ) ``` ### 6. Finally, free API limitations[​](#6-finally-free-api-limitations "Direct link to 6. Finally, free API limitations") The free version of the API has significant limitations: * Limited set of available endpoints * Autocompletes function only works for company searches * Not all data fields are accessible * Limited search filtering capabilities Consider a paid subscription for production-level work. The API provides the most reliable way to access Crunchbase data, even with its rate constraints. ## What’s your best path forward?[​](#whats-your-best-path-forward "Direct link to What’s your best path forward?") We've explored three different approaches to obtaining data from Crunchbase: 1. **Sitemap** - for large-scale data collection 2. **Search** - difficult to automate due to Cloudflare protection 3. **Official API** - the most reliable solution for commercial projects Each method has its advantages, but for most projects, I recommend using the official API despite its limitations in the free version. The complete source code is available in my [repository](https://github.com/Mantisus/crunchbase-crawlee). Have questions or want to discuss implementation details? Join our [Discord](https://discord.com/invite/jyEM2PRvMU) - our community of developers is there to help. **Tags:** * [community](https://crawlee.dev/blog/tags/community.md) --- # How to scrape Google Maps data using Python December 13, 2024 · 12 min read [![Satyam Tripathi](https://avatars.githubusercontent.com/u/69134468?v=4)](https://github.com/triposat) [Satyam Tripathi](https://github.com/triposat) Community Member of Crawlee Millions of people use Google Maps daily, leaving behind a goldmine of data just waiting to be analyzed. In this guide, I'll show you how to build a reliable scraper using Crawlee and Python to extract locations, ratings, and reviews from Google Maps, all while handling its dynamic content challenges. note One of our community members wrote this blog as a contribution to the Crawlee Blog. If you would like to contribute blogs like these to Crawlee Blog, please reach out to us on our [discord channel](https://apify.com/discord). ## What data will we extract from Google Maps?[​](#what-data-will-we-extract-from-google-maps "Direct link to What data will we extract from Google Maps?") We’ll collect information about hotels in a specific city. You can also customize your search to meet your requirements. For example, you might search for "hotels near me", "5-star hotels in Bombay", or other similar queries. ![Google Maps Data Screenshot](/assets/images/scrape-google-maps-with-crawlee-screenshot-data-to-scrape-00e7e4e3498679b8a7611eafd0a1bfbe.webp) We’ll extract important data, including the hotel name, rating, review count, price, a link to the hotel page on Google Maps, and all available amenities. Here’s an example of what the extracted data will look like: ``` { "name": "Vividus Hotels, Bangalore", "rating": "4.3", "reviews": "633", "price": "₹3,667", "amenities": [ "Pool available", "Free breakfast available", "Free Wi-Fi available", "Free parking available" ], "link": "https://www.google.com/maps/place/Vividus+Hotels+,+Bangalore/..." } ``` ## Building a Google Maps scraper[​](#building-a-google-maps-scraper "Direct link to Building a Google Maps scraper") Let's build a Google Maps scraper step-by-step. note Crawlee requires Python 3.9 or later. ### 1. Setting up your environment[​](#1-setting-up-your-environment "Direct link to 1. Setting up your environment") First, let's set up everything you’ll need to run the scraper. Open your terminal and run these commands: ``` # Create and activate a virtual environment python -m venv google-maps-scraper # Windows: .\google-maps-scraper\Scripts\activate # Mac/Linux: source google-maps-scraper/bin/activate # We plan to use Playwright with Crawlee, so we need to install both: pip install crawlee "crawlee[playwright]" playwright install ``` *If you're new to **Crawlee**, check out its easy-to-follow documentation. It’s available for both [Node.js](https://www.crawlee.dev/js/docs/quick-start) and [Python](https://www.crawlee.dev/python/docs/quick-start).* note Before going ahead with the project, I'd like to ask you to star Crawlee for Python on [GitHub](https://github.com/apify/crawlee-python/), it helps us to spread the word to fellow scraper developers. ### 2. Connecting to Google Maps[​](#2-connecting-to-google-maps "Direct link to 2. Connecting to Google Maps") Let's see the steps to connect to Google Maps. **Step 1: Setting up the crawler** The first step is to configure the crawler. We're using [`PlaywrightCrawler`](https://www.crawlee.dev/python/api/class/PlaywrightCrawler) from Crawlee, which gives us powerful tools for automated browsing. We set `headless=False` to make the browser visible during scraping and allow 5 minutes for the pages to load. ``` from crawlee.playwright_crawler import PlaywrightCrawler from datetime import timedelta # Initialize crawler with browser visibility and timeout settings crawler = PlaywrightCrawler( headless=False, # Shows the browser window while scraping request_handler_timeout=timedelta( minutes=5 ), # Allows plenty of time for page loading ) ``` **Step 2: Handling each page** This function defines how each page is handled when the crawler visits it. It uses `context.page` to navigate to the target URL. ``` async def scrape_google_maps(context): """ Establishes connection to Google Maps and handles the initial page load """ page = context.page await page.goto(context.request.url) context.log.info(f"Processing: {context.request.url}") ``` **Step 3: Launching the crawler** Finally, the main function brings everything together. It creates a search URL, sets up the crawler, and starts the scraping process. ``` import asyncio async def main(): # Prepare the search URL search_query = "hotels in bengaluru" start_url = f"https://www.google.com/maps/search/{search_query.replace(' ', '+')}" # Tell the crawler how to handle each page it visits crawler.router.default_handler(scrape_google_maps) # Start the scraping process await crawler.run([start_url]) if __name__ == "__main__": asyncio.run(main()) ``` Let’s combine the above code snippets and save them in a file named `gmap_scraper.py`: ``` from crawlee.playwright_crawler import PlaywrightCrawler from datetime import timedelta import asyncio async def scrape_google_maps(context): """ Establishes connection to Google Maps and handles the initial page load """ page = context.page await page.goto(context.request.url) context.log.info(f"Processing: {context.request.url}") async def main(): """ Configures and launches the crawler with custom settings """ # Initialize crawler with browser visibility and timeout settings crawler = PlaywrightCrawler( headless=False, # Shows the browser window while scraping request_handler_timeout=timedelta( minutes=5 ), # Allows plenty of time for page loading ) # Tell the crawler how to handle each page it visits crawler.router.default_handler(scrape_google_maps) # Prepare the search URL search_query = "hotels in bengaluru" start_url = f"https://www.google.com/maps/search/{search_query.replace(' ', '+')}" # Start the scraping process await crawler.run([start_url]) if __name__ == "__main__": asyncio.run(main()) ``` Run the code using: ``` $ python3 gmap_scraper.py ``` When everything works correctly, you'll see the output like this: ![Connect to page](/assets/images/scrape-google-maps-with-crawlee-screenshot-connect-to-page-6d6391022d64446a161825935a307d8d.png) ### 3. Import dependencies and defining Scraper Class[​](#3-import-dependencies-and-defining-scraper-class "Direct link to 3. Import dependencies and defining Scraper Class") Let's start with the basic structure and necessary imports: ``` import asyncio from datetime import timedelta from typing import Dict, Optional, Set from crawlee.playwright_crawler import PlaywrightCrawler from playwright.async_api import Page, ElementHandle ``` The `GoogleMapsScraper` class serves as the main scraper engine: ``` class GoogleMapsScraper: def __init__(self, headless: bool = True, timeout_minutes: int = 5): self.crawler = PlaywrightCrawler( headless=headless, request_handler_timeout=timedelta(minutes=timeout_minutes), ) self.processed_names: Set[str] = set() async def setup_crawler(self) -> None: self.crawler.router.default_handler(self._scrape_listings) ``` This initialization code sets up two crucial components: 1. A `PlaywrightCrawler` instance configured to run either headlessly (without a visible browser window) or with a visible browser 2. A set to track processed business names, preventing duplicate entries The `setup_crawler` method configures the crawler to use our main scraping function as the default handler for all requests. ### 4. Understanding Google Maps internal code structure[​](#4-understanding-google-maps-internal-code-structure "Direct link to 4. Understanding Google Maps internal code structure") Before we dive into scraping, let's understand exactly what elements we need to target. When you search for hotels in Bengaluru, Google Maps organizes hotel information in a specific structure. Here's a detailed breakdown of how to locate each piece of information. **Hotel name:** ![Hotel name](/assets/images/scrape-google-maps-with-crawlee-screenshot-name-d1fcc59eb4e3eec109fcbf5be0237fbc.webp) **Hotel rating:** ![Hotel rating](/assets/images/scrape-google-maps-with-crawlee-screenshot-ratings-7748ca46b1e14126de728add8313d286.webp) **Hotel review count:** ![Hotel Review Count](/assets/images/scrape-google-maps-with-crawlee-screenshot-reviews-521c92ebf7eeefb615659e0cd9cce6eb.webp) **Hotel URL:** ![Hotel URL](/assets/images/scrape-google-maps-with-crawlee-screenshot-url-ef8f37822fe579765ece5c37c1f8fdeb.webp) **Hotel Price:** ![Hotel Price](/assets/images/scrape-google-maps-with-crawlee-screenshot-price-a2ab8516020bfcbfd6054d889f871743.webp) **Hotel amenities:** This returns multiple elements as each hotel has several amenities. We'll need to iterate through these. ![Hotel amenities](/assets/images/scrape-google-maps-with-crawlee-screenshot-amenities-8a138b2fc9d7c4fad6a81bec55ee5db7.webp) **Quick tips:** * Always verify these selectors before scraping, as Google might update them. * Use Chrome DevTools (F12) to inspect elements and confirm selectors. * Some elements might not be present for all hotels (like prices during the off-season). ### 5. Scraping Google Maps data using identified selectors[​](#5-scraping-google-maps-data-using-identified-selectors "Direct link to 5. Scraping Google Maps data using identified selectors") Let's build a scraper to extract detailed hotel information from Google Maps. First, create the core scraping function to handle data extraction. *gmap\_scraper.py:* ``` async def _extract_listing_data(self, listing: ElementHandle) -> Optional[Dict]: """Extract structured data from a single listing element.""" try: name_el = await listing.query_selector(".qBF1Pd") if not name_el: return None name = await name_el.inner_text() if name in self.processed_names: return None elements = { "rating": await listing.query_selector(".MW4etd"), "reviews": await listing.query_selector(".UY7F9"), "price": await listing.query_selector(".wcldff"), "link": await listing.query_selector("a.hfpxzc"), "address": await listing.query_selector(".W4Efsd:nth-child(2)"), "category": await listing.query_selector(".W4Efsd:nth-child(1)"), } amenities = [] amenities_els = await listing.query_selector_all(".dc6iWb") for amenity in amenities_els: amenity_text = await amenity.get_attribute("aria-label") if amenity_text: amenities.append(amenity_text) place_data = { "name": name, "rating": await elements["rating"].inner_text() if elements["rating"] else None, "reviews": (await elements["reviews"].inner_text()).strip("()") if elements["reviews"] else None, "price": await elements["price"].inner_text() if elements["price"] else None, "address": await elements["address"].inner_text() if elements["address"] else None, "category": await elements["category"].inner_text() if elements["category"] else None, "amenities": amenities if amenities else None, "link": await elements["link"].get_attribute("href") if elements["link"] else None, } self.processed_names.add(name) return place_data except Exception as e: context.log.exception("Error extracting listing data") return None ``` In the code: * `query_selector`: Returns first DOM element matching CSS selector, useful for single items like a name or rating * `query_selector_all`: Returns all matching elements, ideal for multiple items like amenities * `inner_text()`: Extracts text content * Some hotels might not have all the information available - we handle this with 'N/A’ When you run this script, you'll see output similar to this: ``` { "name": "GRAND KALINGA HOTEL", "rating": "4.2", "reviews": "1,171", "price": "\u20b91,760", "link": "https://www.google.com/maps/place/GRAND+KALINGA+HOTEL/data=!4m10!3m9!1s0x3bae160e0ce07789:0xb15bf736f4238e6a!5m2!4m1!1i2!8m2!3d12.9762259!4d77.5786043!16s%2Fg%2F11sp32pz28!19sChIJiXfgDA4WrjsRao4j9Db3W7E?authuser=0&hl=en&rclk=1", "amenities": [ "Pool available", "Free breakfast available", "Free Wi-Fi available", "Free parking available" ] } ``` ### 6. Managing Infinite Scrolling[​](#6-managing-infinite-scrolling "Direct link to 6. Managing Infinite Scrolling") Google Maps uses infinite scrolling to load more results as users scroll down. We handle this with a dedicated method: First, we need a function that can handle the scrolling and detect when we've hit the bottom. Copy-paste this new function in the `gmap_scraper.py` file: ``` async def _load_more_items(self, page: Page) -> bool: """Scroll down to load more items.""" try: feed = await page.query_selector('div[role="feed"]') if not feed: return False prev_scroll = await feed.evaluate("(element) => element.scrollTop") await feed.evaluate("(element) => element.scrollTop += 800") await page.wait_for_timeout(2000) new_scroll = await feed.evaluate("(element) => element.scrollTop") if new_scroll <= prev_scroll: return False await page.wait_for_timeout(1000) return True except Exception as e: context.log.exception("Error during scroll") return False ``` Run this code using: ``` $ python3 gmap_scraper.py ``` You should see an output like this: ![scrape-google-maps-with-crawlee-screenshot-handle-pagination](/assets/images/scrape-google-maps-with-crawlee-screenshot-handle-pagination-319232595ced535f175346ae0003e32f.webp) ### 7. Scrape Listings[​](#7-scrape-listings "Direct link to 7. Scrape Listings") The main scraping function ties everything together. It scrapes listings from the page by repeatedly extracting data and scrolling. ``` async def _scrape_listings(self, context) -> None: """Main scraping function to process all listings""" try: page = context.page print(f"\nProcessing URL: {context.request.url}\n") await page.wait_for_selector(".Nv2PK", timeout=30000) await page.wait_for_timeout(2000) while True: listings = await page.query_selector_all(".Nv2PK") new_items = 0 for listing in listings: place_data = await self._extract_listing_data(listing) if place_data: await context.push_data(place_data) new_items += 1 print(f"Processed: {place_data['name']}") if new_items == 0 and not await self._load_more_items(page): break if new_items > 0: await self._load_more_items(page) print(f"\nFinished processing! Total items: {len(self.processed_names)}") except Exception as e: print(f"Error in scraping: {str(e)}") ``` The scraper uses Crawlee's built-in storage system to manage scraped data. When you run the scraper, it creates a `storage` directory in your project with several key components: * `datasets/`: Contains the scraped results in JSON format * `key_value_stores/`: Stores crawler state and metadata * `request_queues/`: Manages URLs to be processed The `push_data()` method we use in our scraper sends the data to Crawlee's dataset storage as you can see below: ![Crawlee push\_data](/assets/images/How-to-scrape-Google-Maps-data-using-Python-and-Crawlee-metadata-a27257a5ffffad0fdcc598064445fe57.webp) ### 8. Running the Scraper[​](#8-running-the-scraper "Direct link to 8. Running the Scraper") Finally, we need functions to execute our scraper: ``` async def run(self, search_query: str) -> None: """Execute the scraper with a search query""" try: await self.setup_crawler() start_url = f"https://www.google.com/maps/search/{search_query.replace(' ', '+')}" await self.crawler.run([start_url]) await self.crawler.export_data_json('gmap_data.json') except Exception as e: print(f"Error running scraper: {str(e)}") async def main(): """Entry point of the script""" scraper = GoogleMapsScraper(headless=True) search_query = "hotels in bengaluru" await scraper.run(search_query) if __name__ == "__main__": asyncio.run(main()) ``` This data is automatically stored and can later be exported to a JSON file using: ``` await self.crawler.export_data_json('gmap_data.json') ``` Here's what your exported JSON file will look like: ``` [ { "name": "Vividus Hotels, Bangalore", "rating": "4.3", "reviews": "633", "price": "₹3,667", "amenities": [ "Pool available", "Free breakfast available", "Free Wi-Fi available", "Free parking available" ], "link": "https://www.google.com/maps/place/Vividus+Hotels+,+Bangalore/..." } ] ``` ### 9. Using proxies for Google Maps scraping[​](#9-using-proxies-for-google-maps-scraping "Direct link to 9. Using proxies for Google Maps scraping") When scraping Google Maps at scale, using proxies is very helpful. Here are a few key reasons why: 1. **Avoid IP blocks**: Google Maps can detect and block IP addresses that make an excessive number of requests in a short time. Using proxies helps you stay under the radar. 2. **Bypass rate limits**: Google implements strict limits on the number of requests per IP address. By rotating through multiple IPs, you can maintain a consistent scraping pace without hitting these limits. 3. **Access location-specific data**: Different regions may display different data on Google Maps. Proxies allow you to view listings as if you are browsing from any specific location. Here's a simple implementation using Crawlee's built-in proxy management. Update your previous code with this to use proxy settings. ``` from crawlee.playwright_crawler import PlaywrightCrawler from crawlee.proxy_configuration import ProxyConfiguration # Configure your proxy settings proxy_configuration = ProxyConfiguration( proxy_urls=[ "http://username:password@proxy.provider.com:12345", # Add more proxy URLs as needed ] ) # Initialize crawler with proxy support crawler = PlaywrightCrawler( headless=True, request_handler_timeout=timedelta(minutes=5), proxy_configuration=proxy_configuration, ) ``` Here, I use a proxy to scrape hotel data in New York City. ![Using a proxy](/assets/images/scrape-google-maps-with-crawlee-screenshot-proxies-5c4dece0247a87e7d338328c472cea74.webp) Here's an example of data scraped from New York City hotels using proxies: ``` { "name": "The Manhattan at Times Square Hotel", "rating": "3.1", "reviews": "8,591", "price": "$120", "amenities": [ "Free parking available", "Free Wi-Fi available", "Air-conditioned available", "Breakfast available" ], "link": "https://www.google.com/maps/place/..." } ``` ### 10. Project: Interactive hotel analysis dashboard[​](#10-project-interactive-hotel-analysis-dashboard "Direct link to 10. Project: Interactive hotel analysis dashboard") After scraping hotel data from Google Maps, you can build an interactive dashboard that helps analyze hotel trends. Here’s a preview of how the dashboard works: ![Final dashboard](/assets/images/scrape-google-maps-with-crawlee-screenshot-hotel-analysis-dashboard-c14806409a7c1db63943f58d855aa07e.webp) Find the complete info for this dashboard on GitHub: [Hotel Analysis Dashboard](https://github.com/triposat/Hotel-Analytics-Dashboard). ### 11. Now you’re ready to put everything into action\![​](#11-now-youre-ready-to-put-everything-into-action "Direct link to 11. Now you’re ready to put everything into action!") Take a look at the complete scripts in my GitHub Gist: * [Basic Scraper](https://gist.github.com/triposat/9a6fb03130f3c4332bab71b72a973940) * [Code with Proxy Integration](https://gist.github.com/triposat/6c554b13c787a55348b48b6bfc5459c0) * [Hotel Analysis Dashboard](https://gist.github.com/triposat/13ce4b05c36512e69b5602833e781a6c) To make it all work: 1. **Run the basic scraper or proxy-integrated scraper**: This will collect the hotel data and store it in a JSON file. 2. **Run the dashboard script**: Load your JSON data and view it interactively in the dashboard. ## Wrapping up and next steps[​](#wrapping-up-and-next-steps "Direct link to Wrapping up and next steps") You've successfully built a comprehensive Google Maps scraper that collects and processes hotel data, presenting it through an interactive dashboard. Now you’ve learned about: * Using Crawlee with Playwright to navigate and extract data from Google Maps * Using proxies to scale up scraping without getting blocked * Storing the extracted data in JSON format * Creating an interactive dashboard to analyze hotel data We’ve handpicked some great resources to help you further explore web scraping: * [Scrapy vs. Crawlee: Choosing the right tool](https://www.crawlee.dev/blog/scrapy-vs-crawlee) * [Mastering proxy management with Crawlee](https://www.crawlee.dev/blog/proxy-management-in-crawlee) * [Think like a web scraping expert: 12 pro tips](https://www.crawlee.dev/blog/web-scraping-tips) * [Building a LinkedIn job scraper](https://www.crawlee.dev/blog/linkedin-job-scraper-python) **Tags:** * [community](https://crawlee.dev/blog/tags/community.md) --- # How to scrape Google search results with Python December 2, 2024 · 7 min read [![Max](https://avatars.githubusercontent.com/u/34358312?v=4)](https://github.com/Mantisus) [Max](https://github.com/Mantisus) Community Member of Crawlee and web scraping expert Scraping `Google Search` delivers essential `SERP analysis`, SEO optimization, and data collection capabilities. Modern scraping tools make this process faster and more reliable. note One of our community members wrote this blog as a contribution to the Crawlee Blog. If you would like to contribute blogs like these to Crawlee Blog, please reach out to us on our [discord channel](https://apify.com/discord). In this guide, we'll create a Google Search scraper using [`Crawlee for Python`](https://github.com/apify/crawlee-python) that can handle result ranking and pagination. We'll create a scraper that: * Extracts titles, URLs, and descriptions from search results * Handles multiple search queries * Tracks ranking positions * Processes multiple result pages * Saves data in a structured format ![How to scrape Google search results with Python](/assets/images/google-search-a91bfdf17a4c2860798444b1be56f625.webp) ## Prerequisites[​](#prerequisites "Direct link to Prerequisites") * Python 3.7 or higher * Basic understanding of HTML and CSS selectors * Familiarity with web scraping concepts * Crawlee for Python v0.4.2 or higher ### Project setup[​](#project-setup "Direct link to Project setup") 1. Install Crawlee with required dependencies: ``` pipx install crawlee[beautifulsoup,curl-impersonate] ``` 2. Create a new project using Crawlee CLI: ``` pipx run crawlee create crawlee-google-search ``` 3. When prompted, select `Beautifulsoup` as your template type. 4. Navigate to the project directory and complete installation: ``` cd crawlee-google-search poetry install ``` ## Development of the Google Search scraper in Python[​](#development-of-the-google-search-scraper-in-python "Direct link to Development of the Google Search scraper in Python") ### 1. Defining data for extraction[​](#1-defining-data-for-extraction "Direct link to 1. Defining data for extraction") First, let's define our extraction scope. Google's search results now include maps, notable people, company details, videos, common questions, and many other elements. We'll focus on analyzing standard search results with rankings. Here's what we'll be extracting: ![Search Example](/assets/images/search_example-53f4fdf556178b9478a8d4f3e3816669.webp) Let's verify whether we can extract the necessary data from the page's HTML code, or if we need deeper analysis or `JS` rendering. Note that this verification is sensitive to HTML tags: ![Check Html](/assets/images/check_html-e243b1a0eff6d4404b9034863969bedc.webp) Based on the data obtained from the page, all necessary information is present in the HTML code. Therefore, we can use [`beautifulsoup_crawler`](https://www.crawlee.dev/python/docs/examples/beautifulsoup-crawler). The fields we'll extract: * Search result titles * URLs * Description text * Ranking positions ### 2. Configure the crawler[​](#2-configure-the-crawler "Direct link to 2. Configure the crawler") First, let's create the crawler configuration. We'll use [`CurlImpersonateHttpClient`](https://www.crawlee.dev/python/api/class/CurlImpersonateHttpClient) as our `http_client` with preset `headers` and `impersonate` relevant to the [`Chrome`](https://www.google.com/intl/en/chrome/) browser. We'll also configure [`ConcurrencySettings`](https://www.crawlee.dev/python/api/class/ConcurrencySettings) to control scraping aggressiveness. This is crucial to avoid getting blocked by Google. If you need to extract data more intensively, consider setting up [`ProxyConfiguration`](https://www.crawlee.dev/python/api/class/ProxyConfiguration). ``` from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler from crawlee.http_clients.curl_impersonate import CurlImpersonateHttpClient from crawlee import ConcurrencySettings, HttpHeaders async def main() -> None: concurrency_settings = ConcurrencySettings(max_concurrency=5, max_tasks_per_minute=200) http_client = CurlImpersonateHttpClient(impersonate="chrome124", headers=HttpHeaders({"referer": "https://www.google.com/", "accept-language": "en", "accept-encoding": "gzip, deflate, br, zstd", "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36" })) crawler = BeautifulSoupCrawler( max_request_retries=1, concurrency_settings=concurrency_settings, http_client=http_client, max_requests_per_crawl=10, max_crawl_depth=5 ) await crawler.run(['https://www.google.com/search?q=Apify']) ``` ### 3. Implementing data extraction[​](#3-implementing-data-extraction "Direct link to 3. Implementing data extraction") First, let's analyze the HTML code of the elements we need to extract: ![Check Html](/assets/images/html_example-ccefa4ed63c38812ac5b8ca7b5122c8c.webp) There's an obvious distinction between *readable* ID attributes and *generated* class names and other attributes. When creating selectors for data extraction, you should ignore any generated attributes. Even if you've read that Google has been using a particular generated tag for N years, you shouldn't rely on it - this reflects your experience in writing robust code. Now that we understand the HTML structure, let's implement the extraction. As our crawler deals with only one type of page, we can use `router.default_handler` for processing it. Within the handler, we'll use `BeautifulSoup` to iterate through each search result, extracting data such as `title`, `url`, and `text_widget` while saving the results. ``` @crawler.router.default_handler async def default_handler(context: BeautifulSoupCrawlingContext) -> None: """Default request handler.""" context.log.info(f'Processing {context.request} ...') for item in context.soup.select("div#search div#rso div[data-hveid][lang]"): data = { 'title': item.select_one("h3").get_text(), "url": item.select_one("a").get("href"), "text_widget": item.select_one("div[style*='line']").get_text(), } await context.push_data(data) ``` ### 4. Handling pagination[​](#4-handling-pagination "Direct link to 4. Handling pagination") Since Google results depend on the IP geolocation of the search request, we can't rely on link text for pagination. We need to create a more sophisticated CSS selector that works regardless of geolocation and language settings. The `max_crawl_depth` parameter controls how many pages our crawler should scan. Once we have our robust selector, we simply need to get the next page link and add it to the crawler's queue. To write more efficient selectors, learn the basics of [CSS](https://www.w3schools.com/cssref/css_selectors.php) and [XPath](https://www.w3schools.com/xml/xpath_syntax.asp) syntax. ``` await context.enqueue_links(selector="div[role='navigation'] td[role='heading']:last-of-type > a") ``` ### 5. Exporting data to CSV format[​](#5-exporting-data-to-csv-format "Direct link to 5. Exporting data to CSV format") Since we want to save all search result data in a convenient tabular format like CSV, we can simply add the export\_data method call right after running the crawler: ``` await crawler.export_data_csv("google_search.csv") ``` ### 6. Finalizing the Google Search scraper[​](#6-finalizing-the-google-search-scraper "Direct link to 6. Finalizing the Google Search scraper") While our core crawler logic works, you might have noticed that our results currently lack ranking position information. To complete our scraper, we need to implement proper ranking position tracking by passing data between requests using `user_data` in [`Request`](https://www.crawlee.dev/python/api/class/Request). Let's modify the script to handle multiple queries and track ranking positions for search results analysis. We'll also set the crawling depth as a top-level variable. Let's move the `router.default_handler` to `routes.py` to match the project structure: ``` # crawlee-google-search.main from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler, BeautifulSoupCrawlingContext from crawlee.http_clients.curl_impersonate import CurlImpersonateHttpClient from crawlee import Request, ConcurrencySettings, HttpHeaders from .routes import router QUERIES = ["Apify", "Crawlee"] CRAWL_DEPTH = 2 async def main() -> None: """The crawler entry point.""" concurrency_settings = ConcurrencySettings(max_concurrency=5, max_tasks_per_minute=200) http_client = CurlImpersonateHttpClient(impersonate="chrome124", headers=HttpHeaders({"referer": "https://www.google.com/", "accept-language": "en", "accept-encoding": "gzip, deflate, br, zstd", "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36" })) crawler = BeautifulSoupCrawler( request_handler=router, max_request_retries=1, concurrency_settings=concurrency_settings, http_client=http_client, max_requests_per_crawl=100, max_crawl_depth=CRAWL_DEPTH ) requests_lists = [Request.from_url(f"https://www.google.com/search?q={query}", user_data = {"query": query}) for query in QUERIES] await crawler.run(requests_lists) await crawler.export_data_csv("google_ranked.csv") ``` Let's also modify the handler to add `query` and `order_no` fields and basic error handling: ``` # crawlee-google-search.routes from crawlee.beautifulsoup_crawler import BeautifulSoupCrawlingContext from crawlee.router import Router router = Router[BeautifulSoupCrawlingContext]() @router.default_handler async def default_handler(context: BeautifulSoupCrawlingContext) -> None: """Default request handler.""" context.log.info(f'Processing {context.request.url} ...') order = context.request.user_data.get("last_order", 1) query = context.request.user_data.get("query") for item in context.soup.select("div#search div#rso div[data-hveid][lang]"): try: data = { "query": query, "order_no": order, 'title': item.select_one("h3").get_text(), "url": item.select_one("a").get("href"), "text_widget": item.select_one("div[style*='line']").get_text(), } await context.push_data(data) order += 1 except AttributeError as e: context.log.warning(f'Attribute error for query "{query}": {str(e)}') except Exception as e: context.log.error(f'Unexpected error for query "{query}": {str(e)}') await context.enqueue_links(selector="div[role='navigation'] td[role='heading']:last-of-type > a", user_data={"last_order": order, "query": query}) ``` And we're done! Our Google Search crawler is ready. Let's look at the results in the `google_ranked.csv` file: ![Results CSV](/assets/images/results-03c51354b4347837a24ec6977a442ce8.webp) The code repository is available on [`GitHub`](https://github.com/Mantisus/crawlee-google-search) ## Scrape Google Search results with Apify[​](#scrape-google-search-results-with-apify "Direct link to Scrape Google Search results with Apify") If you're working on a large-scale project requiring millions of data points, like the project featured in this [article about Google ranking analysis](https://backlinko.com/search-engine-ranking) - you might need a ready-made solution. Consider using [`Google Search Results Scraper`](https://www.apify.com/apify/google-search-scraper) by the Apify team. It offers important features such as: * Proxy support * Scalability for large-scale data extraction * Geolocation control * Integration with external services like [`Zapier`](https://zapier.com/), [`Make`](https://www.make.com/), [`Airbyte`](https://airbyte.com/), [`LangChain`](https://www.langchain.com/) and others You can learn more in the Apify [blog](https://blog.apify.com/unofficial-google-search-api-from-apify-22a20537a951/) ## What will you scrape?[​](#what-will-you-scrape "Direct link to What will you scrape?") In this blog, we've explored step-by-step how to create a Google Search crawler that collects ranking data. How you analyze this dataset is up to you! As a reminder, you can find the full project code on [`GitHub`](https://github.com/Mantisus/crawlee-google-search). I'd like to think that in 5 years I'll need to write an article on "How to extract data from the best search engine for LLMs", but I suspect that in 5 years this article will still be relevant. **Tags:** * [community](https://crawlee.dev/blog/tags/community.md) --- # How to scrape TikTok using Python April 25, 2025 · 12 min read [![Max](https://avatars.githubusercontent.com/u/34358312?v=4)](https://github.com/Mantisus) [Max](https://github.com/Mantisus) Community Member of Crawlee and web scraping expert [TikTok](https://www.tiktok.com/) users generate tons of data that are valuable for analysis. Which hashtags are trending now? What is an influencer's engagement rate? What topics are important for a content creator? You can find answers to these and many other questions by analyzing TikTok data. However, for analysis, you need to extract the data in a convenient format. In this blog, we'll explore how to scrape TikTok using [Crawlee for Python](https://github.com/apify/crawlee-python). note One of our community members wrote this blog as a contribution to the Crawlee Blog. If you'd like to contribute articles like these, please reach out to us on our [Discord channel](https://apify.com/discord). ![How to scrape TikTok using Python](/assets/images/main_image-94d608c24b2e8970cac1d9040b8290a5.webp) Key steps we'll cover: 1. [Project setup](https://www.crawlee.dev/blog/scrape-tiktok-python#1-project-setup) 2. [Analyzing TikTok and determining a scraping strategy](https://www.crawlee.dev/blog/scrape-tiktok-python#2-analyzing-tiktok-and-determining-a-scraping-strategy) 3. [Configuring Crawlee](https://www.crawlee.dev/blog/scrape-tiktok-python#3-configuring-crawlee) 4. [Extracting TikTok data](https://www.crawlee.dev/blog/scrape-tiktok-python#4-extracting-tiktok-data) 5. [Creating TikTok Actor on the Apify platform](https://www.crawlee.dev/blog/scrape-tiktok-python#5-creating-tiktok-actor-on-apify-platform) 6. [Deploying to Apify](https://www.crawlee.dev/blog/scrape-tiktok-python#6-deploying-to-apify) ## Prerequisites[​](#prerequisites "Direct link to Prerequisites") * Python 3.9 or higher * Familiarity with web scraping concepts * Crawlee for Python `v0.6.0` or higher * [uv](https://docs.astral.sh/uv/) `v0.6` or higher * An Apify account ## 1. Project setup[​](#1-project-setup "Direct link to 1. Project setup") note Before going ahead with the project, I'd like to ask you to star Crawlee for Python on [GitHub](https://github.com/apify/crawlee-python/), it helps us to spread the word to fellow scraper developers. In this project, we'll use uv for package management and a specific Python version will be installed through uv. Uv is a fast and modern package manager written in Rust. If you don't have uv installed yet, just follow the [guide](https://docs.astral.sh/uv/getting-started/installation/) or use this command: ``` curl -LsSf https://astral.sh/uv/install.sh | sh ``` To create the project, run: ``` uvx crawlee['cli'] create tiktok-crawlee ``` In the `cli` menu that opens, select: 1. `Playwright` 2. `Httpx` 3. `uv` 4. Leave the default value - `https://crawlee.dev` 5. `y` Or, just run the command: ``` uvx crawlee['cli'] create tiktok-crawlee --crawler-type playwright --http-client httpx --package-manager uv --apify --start-url 'https://crawlee.dev' ``` Creating the project may take a few minutes. After installation is complete, navigate to the project folder: ``` cd tiktok-crawlee ``` ## 2. Analyzing TikTok and determining a scraping strategy[​](#2-analyzing-tiktok-and-determining-a-scraping-strategy "Direct link to 2. Analyzing TikTok and determining a scraping strategy") TikTok uses quite a lot of JavaScript on its site, both for displaying content and for analyzing user behavior, including detecting and blocking crawlers. Therefore, for crawling TikTok, we'll use a headless browser with [Playwright](https://playwright.dev/python/). To load new elements on a user's page, TikTok uses infinite scrolling. You may already be familiar with this method from this [article](https://www.crawlee.dev/blog/infinite-scroll-using-python). Let's look at what happens under the hood when we scroll a TikTok page. I recommend studying network activity in [DevTools](https://developer.chrome.com/docs/devtools) to understand what requests are going to the server. ![Backend Network](/assets/images/load_elems-b739afc4d1d682c6fa2944275e1f8a9f.webp) Let's examine the HTML structure to understand if navigating to elements will be difficult. ![Selectors](/assets/images/selectors-80c3c3aa2697ef3c0f8b2422e7367d65.webp) Well, this looks quite simple. If using [CSS selectors](https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_selectors), `[data-e2e="user-post-item"] a` is sufficient. Let's look at what a video page response looks like to see what data we can extract. ![Video Response](/assets/images/html_response-4344e00324cd04aa52a5a8b257d48eaf.webp) It seems that the HTML code contains JSON with all the data we're interested in. Great! ## 3. Configuring Crawlee[​](#3-configuring-crawlee "Direct link to 3. Configuring Crawlee") Now that we understand our scraping strategy, let's set up Crawlee for scraping TikTok. Since pages have infinite scrolling, we need to limit the number of elements we want to get. For this, we'll add a `max_items` parameter that will limit the maximum number of elements for each search and pass it in `user_data` when forming a [Request](https://www.crawlee.dev/python/api/class/Request). We'll limit the intensity of scraping by setting `max_tasks_per_minute` in [`ConcurrencySettings`](https://www.crawlee.dev/python/api/class/ConcurrencySettings). This will help us reduce the likelihood of being blocked by TikTok. We'll set `browser_type` to `firefox`, as it performed better for TikTok in my tests. TikTok may request permissions to access device data, so we'll explicitly limit all [permissions](https://playwright.dev/python/docs/api/class-browser#browser-new-context-option-permissions) by passing the appropriate parameter to `browser_new_context_options`. Scrolling pages can take a long time, so we should increase the time limit for processing a single request using `request_handler_timeout`. ``` # main.py from datetime import timedelta from apify import Actor from crawlee import ConcurrencySettings, Request from crawlee.crawlers import PlaywrightCrawler from .routes import router async def main() -> None: """The crawler entry point.""" # When creating the template, we confirmed Apify integration. # However, this isn't important for us at this stage. async with Actor: max_items = 20 # Create a crawler with the necessary settings crawler = PlaywrightCrawler( # Limit scraping intensity by setting a limit on requests per minute concurrency_settings=ConcurrencySettings(max_tasks_per_minute=50), # We'll configure the `router` in the next step request_handler=router, # You can use `False` during development. But for production, it's always `True` headless=True, max_requests_per_crawl=100, # Increase the timeout for the request handling pipeline request_handler_timeout=timedelta(seconds=120), browser_type='firefox', # Limit any permissions to device data browser_new_context_options={'permissions': []}, ) # Run the crawler to collect data from several user pages await crawler.run( [ Request.from_url('https://www.tiktok.com/@apifyoffice', user_data={'limit': max_items}), Request.from_url('https://www.tiktok.com/@authorbrandonsanderson', user_data={'limit': max_items}), ] ) ``` Someone might ask, "What about configurations to avoid fingerprint blocking?!!!" My answer is, "Crawlee for Python has already done that for you." Depending on your deployment environment, you may need to add a proxy. We'll come back to this in the last section. ## 4. Extracting TikTok data[​](#4-extracting-tiktok-data "Direct link to 4. Extracting TikTok data") After configuration, let's move on to navigation and data extraction. For infinite scrolling, we'll use the built-in helper function ['infinite\_scroll'](https://www.crawlee.dev/python/api/class/PlaywrightCrawlingContext#infinite_scroll). But instead of waiting for scrolling to complete, which in some cases can take a really long time, we'll use Python's `asyncio` capabilities to make it a background task. Also, with deeper investigation, you may encounter a TikTok page that doesn't load user videos, but only shows a button and an error message. ![Error Page](/assets/images/went_wrong-413878d9f5a4331add12544c0a25ccd7.webp) It's very important to handle this case. Also during testing, I discovered that you need to interact with scrolling, otherwise when using `infinite_scroll`, new elements don't load. I think this is a TikTok bug. Let's start with a simple function to extract video links. It will help avoid code duplication. ``` # routes.py import asyncio import json from playwright.async_api import Page from crawlee import Request from crawlee.crawlers import PlaywrightCrawlingContext from crawlee.router import Router router = Router[PlaywrightCrawlingContext]() # Helper function that extracts all loaded video links async def extract_video_links(page: Page) -> list[Request]: """Extract all loaded video links from the page.""" links = [] for post in await page.query_selector_all('[data-e2e="user-post-item"] a'): post_link = await post.get_attribute('href') if post_link and '/video/' in post_link: links.append(Request.from_url(post_link, label='video')) return links ``` Now we can move on to the main handler that will process TikTok user pages. ``` # routes.py # Main handler used for TikTok user pages @router.default_handler async def default_handler(context: PlaywrightCrawlingContext) -> None: """Handle request without specific label.""" context.log.info(f'Processing {context.request.url} ...') # Get the limit for video elements from `user_data` limit = context.request.user_data.get('limit', 10) if not isinstance(limit, int): raise TypeError('Limit must be an integer') # Wait until the button or at least a video loads, if the connection is slow check_locator = context.page.locator('[data-e2e="user-post-item"], main button').first await check_locator.wait_for() # If the button loaded, click it to initiate video loading if button := await context.page.query_selector('main button'): await button.click() # Perform interaction with scrolling await context.page.press('body', 'PageDown') # Start `infinite_scroll` as a background task scroll_task: asyncio.Task[None] = asyncio.create_task(context.infinite_scroll()) # Wait until scrolling is completed or until the limit is reached while not scroll_task.done(): requests = await extract_video_links(context.page) # If we've already reached the limit, interrupt scrolling and exit the loop if len(requests) >= limit: scroll_task.cancel() break # Switch the asynchronous context to allow other tasks to execute await asyncio.sleep(0.2) else: requests = await extract_video_links(context.page) # Limit the number of requests to the limit value requests = requests[:limit] # If the page wasn't properly processed for some reason and didn't find any links, # then I want to raise an error for retry if not requests: raise RuntimeError('No video links found') await context.add_requests(requests) ``` The final stage is handling the video page. ``` # routes.py @router.handler(label='video') async def video_handler(context: PlaywrightCrawlingContext) -> None: """Handle request with the label 'video'.""" context.log.info(f'Processing video {context.request.url} ...') # Extract the element containing JSON with data json_element = await context.page.query_selector('#__UNIVERSAL_DATA_FOR_REHYDRATION__') if json_element: # Extract JSON and convert it to a dictionary text_data = await json_element.text_content() json_data = json.loads(text_data) data = json_data['__DEFAULT_SCOPE__']['webapp.video-detail']['itemInfo']['itemStruct'] # Create result item result_item = { 'author': { 'nickname': data['author']['nickname'], 'id': data['author']['id'], 'handle': data['author']['uniqueId'], 'signature': data['author']['signature'], 'followers': data['authorStats']['followerCount'], 'following': data['authorStats']['followingCount'], 'hearts': data['authorStats']['heart'], 'videos': data['authorStats']['videoCount'], }, 'description': data['desc'], 'tags': [item['hashtagName'] for item in data['textExtra'] if item['hashtagName']], 'hearts': data['stats']['diggCount'], 'shares': data['stats']['shareCount'], 'comments': data['stats']['commentCount'], 'plays': data['stats']['playCount'], } # Save the result to the dataset await context.push_data(result_item) else: # If the data wasn't received, we raise an error for retry raise RuntimeError('No JSON data found') ``` The crawler is ready for local launch. To run it, execute the command: ``` uv run python -m tiktok_crawlee ``` You can view the saved results in the `dataset` folder, path `./storage/datasets/default/`. Example record: ``` { "author": { "nickname": "apifyoffice", "id": "7095709566285480965", "handle": "apifyoffice", "signature": "🤖 web scraping and AI 🤖\n\ncheck out our open positions at ✨apify.it/jobs✨", "followers": 118, "following": 3, "hearts": 1975, "videos": 33 }, "description": ""Fun" is the top word Apifiers used to describe our culture. Here's what else came to their minds 🎤 #workculture #teambuilding #interview #czech #ilovemyjob ", "tags": [ "workculture", "teambuilding", "interview", "czech", "ilovemyjob" ], "hearts": 7, "shares": 1, "comments": 1, "plays": 448 } ``` ## 5. Creating TikTok Actor on the [Apify platform](https://apify.com/)[​](#5-creating-tiktok-actor-on-the-apify-platform "Direct link to 5-creating-tiktok-actor-on-the-apify-platform") For deployment, we'll use the [Apify platform](https://apify.com/). It's a simple and effective environment for cloud deployment, allowing efficient interaction with your crawler. Call it via [API](https://docs.apify.com/api/v2/), [schedule tasks](https://docs.apify.com/platform/schedules), [integrate](https://docs.apify.com/platform/integrations) with various services, and much more. To deploy to the Apify platform, we need to adapt our project for the [Apify Actor](https://apify.com/actors) structure. Create an `.actor` folder with the necessary files. ``` mkdir .actor && touch .actor/{actor.json,input_schema.json} ``` Move the `Dockerfile` from the root folder to `.actor`. ``` mv Dockerfile .actor ``` Let's fill in the empty files: The `actor.json` file contains project metadata for the Apify platform. Follow the [documentation for proper configuration](https://docs.apify.com/platform/actors/development/actor-definition/actor-json): ``` { "actorSpecification": 1, "name": "TikTok-Crawlee", "title": "TikTok - Crawlee", "minMemoryMbytes": 2048, "description": "Scrape video elements from TikTok user pages", "version": "0.1", "meta": { "templateId": "tiktok-crawlee" }, "input": "./input_schema.json", "dockerfile": "./Dockerfile" } ``` Actor input parameters are defined using `input_schema.json`, which is specified [here](https://docs.apify.com/platform/actors/development/actor-definition/input-schema/specification/v1). Let's define input parameters for our crawler: * `maxItems` - this should be an externally configurable parameter. * `urls` - these are links to TikTok user pages, the starting points for our crawler's scraping * `proxySettings` - proxy settings, since without a proxy you'll be using the datacenter IP that Apify uses. ``` { "title": "TikTok Crawlee", "type": "object", "schemaVersion": 1, "properties": { "urls": { "title": "List URLs", "type": "array", "description": "Direct URLs to pages TikTok profiles.", "editor": "stringList", "prefill": ["https://www.tiktok.com/@apifyoffice"] }, "maxItems": { "type": "integer", "editor": "number", "title": "Limit search results", "description": "Limits the maximum number of results, applies to each search separately.", "default": 10 }, "proxySettings": { "title": "Proxy configuration", "type": "object", "description": "Select proxies to be used by your scraper.", "prefill": { "useApifyProxy": true }, "editor": "proxy" } }, "required": ["urls"] } ``` Let's update the code to accept input parameters. ``` # main.py from datetime import timedelta from apify import Actor from crawlee.crawlers import PlaywrightCrawler from crawlee import ConcurrencySettings from crawlee import Request from .routes import router async def main() -> None: """The crawler entry point.""" async with Actor: # Accept input parameters passed when starting the Actor actor_input = await Actor.get_input() max_items = actor_input.get('maxItems', 0) requests = [Request.from_url(url, user_data={'limit': max_items}) for url in actor_input.get('urls', [])] proxy = await Actor.create_proxy_configuration(actor_proxy_input=actor_input.get('proxySettings')) crawler = PlaywrightCrawler( concurrency_settings=ConcurrencySettings(max_tasks_per_minute=50), proxy_configuration=proxy, request_handler=router, headless=True, request_handler_timeout=timedelta(seconds=120), browser_type='firefox', browser_new_context_options={'permissions': []} ) await crawler.run(requests) ``` That's it, the project is ready for deployment. ## 6. Deploying to Apify[​](#6-deploying-to-apify "Direct link to 6. Deploying to Apify") Use the official [Apify CLI](https://docs.apify.com/cli/) to upload your code: Authenticate using your API token from [Apify Console](https://console.apify.com/settings/integrations): ``` apify login ``` Choose "Enter API token manually" and paste your token. Push the project to the platform: ``` apify push ``` Now you can configure runs on the Apify platform. Let's perform a test run: Fill in the input parameters: ![Actor Input](/assets/images/input_actor-33501c94f9a90c5e28c272016a7d5ec9.webp) Check that logging works correctly: ![Actor Log](/assets/images/actor_log-4301af07fb3f21631f98802876e6b3f5.webp) View results in the dataset: ![Dataset Results](/assets/images/actor_results-7ab9904db12130be0317320c43070b71.webp) If you want to make your Actor public and provide access to other users, potentially to earn income from it, follow this [publishing guide](https://docs.apify.com/platform/actors/publishing) for [Apify Store](https://apify.com/store). ## Conclusion[​](#conclusion "Direct link to Conclusion") We've created a good foundation for crawling TikTok using Crawlee for Python and Playwright. If you want to improve the project, I would recommend adding error handling and handling cases when you get a CAPTCHA to reduce the likelihood of being blocked by TikTok. However, this is a good foundation to start working with TikTok. It allows you to get data right now. You can find the complete code in the [repository](https://github.com/Mantisus/tiktok-crawlee) If you enjoyed this blog, feel free to support Crawlee for Python by starring the [repository](https://github.com/apify/crawlee-python) or joining the maintainer team. Have questions or want to discuss implementation details? Join our [Discord](https://discord.com/invite/jyEM2PRvMU) - our community of 10,000+ developers is there to help. **Tags:** * [community](https://crawlee.dev/blog/tags/community.md) --- # Optimizing web scraping: Scraping auth data using JSDOM September 30, 2024 · 8 min read [![Saurav Jain](https://avatars.githubusercontent.com/u/53312820?v=4)](https://github.com/souravjain540) [Saurav Jain](https://github.com/souravjain540) Developer Community Manager As scraping developers, we sometimes need to extract authentication data like temporary keys to perform our tasks. However, it is not as simple as that. Usually, it is in HTML or XHR network requests, but sometimes, the auth data is computed. In that case, we can either reverse-engineer the computation, which takes a lot of time to deobfuscate scripts or run the JavaScript that computes it. Normally, we use a browser, but that is expensive. Crawlee provides support for running browser scraper and Cheerio Scraper in parallel, but that is very complex and expensive in terms of compute resource usage. JSDOM helps us run page JavaScript with fewer resources than a browser and slightly higher than Cheerio. This article will discuss a new approach that we use in one of our Actors to obtain the authentication data from TikTok ads creative center generated by browser web applications without actually running the browser but instead of it, using JSDOM. ![JSDOM based approach from scraping](data:image/webp;base64,UklGRoYlAABXRUJQVlA4IHolAAAwdwGdASqABIgCPpFIoEwlpCalIROpKNASCWlu8hrke3XPLgHb9D1Wds7uP6XZF5P/t89k/4y9YOjyKelv5OR1wufOzmP5V+Z/rP7afDhc37F+Dvl1Ym/Ef2383dR/zBv0Z/33WF8wH7U+r35xPqAf2vqT/QA8un2R/3M9EP//6076W/3vrK81nzvcRey/PmfHth4ATxe0OwP8JP5HWyzxX/B5Uf23fH/vGHpLm3ht0692Id2Me7EO7qpM1Mx7EXjVX3Z+P0sNbmhaq+7Px+lhrc0LVSiIMc5QmllZTyQ6/UOl+HeivFrbDtSRugEP6BUI/Io7XtSExQKhH5FHa9qQmKBT9v3rKxJiZwlqpq2aTj/S243PsvniQmt2Me7EO7GPdiHdjHuxDuyVHuVr//+RRytVe5Xp42cEzTTpdQSRoMelzbw26de7EO7GPdiHdjHwUEH5RB5AgN7KtoB81ZAWCumRCeI3xh7nviHdjHuxDuxj3Yh3Yx7sQ79q6AzCmFssXeWHQihJQMLSDfUzVv8EH1Ps7sQ7sY92Id2Me7EO7GPdkIZfaen6Z1VcL5TTSj0cqLr1u+4OtYk6ViED5t4bdOvdiHdjHuxDvm1QJX5PR8lvcQmLDQlSVaSOyXUOS/TMD5t4bdOvdiHdjHuxDuxldSJ2EKAemmKo1uaCyZt46Gb5NF6sqSMPXuxDuxj3Yh3Yx7sQ7sY9/J2udY4xJydhkoMkml2RqBya2890xTSMqaTuxDuxj3Yh3Yx7sQ7sZVJ3XYB/hj4x16Cm99+WZUQdKpzXUP2cG6x2GPjHYc9lFTmMAcXVRZfYQeZXe1sTviHc30Wxu5gV6mbU609u7GV4boVQJeArWbeG3Ttfr1TZeVXjT4U9hRTGGyBJIoxrwpJKpweRHnFxFeQTf8w8he8rkWTNH5sca9IyHPYvJF0152hdyDaZvwU043TndjHofS90evVmMIWlSxyvnQ3HQWYTB3HQWf8oPF/K2CDePonCcndiH0L+/RR2NymLtMHOM5iJSX+iSgVYOHOzVVuGDXqwnw5Kex0udSaJPKhnatlyRm+zKn2d2G7FzF6m2YX9R7gmf5n8dePVSRoF92txlovLwtoxJ77P2Vtw6Bnr792IfRGc24UVNve6j3cXAxcJSdl6VK/+uvkGZD4O6SMPXuw4XOEJoc19BltrDUhMUCoQrEn8cT/Okrh+SYbdP8JsZ42ar/8kKP/tHILN/HizKI0b3Nz3xDue/jGqxXt1/kN/29L18Tc2JBXfOt1P4g9nUSgY5ht0/yjoDtBWxvM2sbdRoOElJylR03Av7+lUbO87rP92IdX36n00Hqj/TrpJ3Be5xiKftRbGHW97wIGOYbdP9YxipMOa3+D7sCoaigpQRxBxBNBCt7zEpICIQPlfJEgIjLd3UxVAmhlyqWYHGaSSZQxn5DDjJe7TBigXjnr/Pv0rEINL3QvSfoRmlBJ6uD/9150ca7asaQOFeBJIGkz6M4d+J7Ew26cm8oYI8NNLq6oIy8PmiT5pPg2NNaKBKC9Em+ffpWIQapo9KHsuGt/lRrxFyQCL9+zkp1MsiIw/WSSWcsRzmJPMHLDJvlH2nrb4pmwsSoXtD7TimKNyssrKaQf5vFMgaRK5gPmw0hDFUpb5BBUiaLyf7yYEk569D2mT+QHiHdrat3zD6LEuihoQVvkC++dkNjpZ8xmVFIC4dSjR5FnALGaFnjLnh48CBmJ+0p+yvSZ8aE0q4rIX7Kyt8j4xCQL9bjhgLVVIX+2ha9CFYx342uQKS93RSV+3Tr3YlFVabmIJRU9643pKXWswn9vuaznftXMRbdS2P0rA1Fmag1SgoQUsGizXtsvHqz3E2w6BYZfJwUaQhiqSQpjRhUUpJk6KIapxeeEHn4Nq+Bj0RFBEPte7opLQUgQPm+HmBPyguBszs23OXDabCM6AR9iqpK7fTKb1fq05zou1NLEeW9BvMshQFgg26HeopgJ1pkhkaAIBddmwa2IQL9ekhttDlwnUn6lR3VQIBJ3RMUj0XuzBufWvwzrKmobJaNW5dIbcNRWlemrWNjFZCCAFwZmi2SSwbcZ4nDSZsH2cJjMjJ70m+KJYuzjyD7YMarAJrG2Zl9FgfEwelUcJEULEukZz4WjqATFJ6UkCWIghjHVUOo4AGZc5DzQIJkY0s5BMyfnp/ZTP/DhDNww/ar7oCnfwLnouFAuJ9qpfESiusYDedgzq5A1Oj4R/+vbFxvgkD5uf2jjmBt3Wf7ryeaj7QqOnKykB0o/mBKz+6cGgOF1nDBOCsW5sxBcmD9hvwV6S0yBA72Cu8i8csiV0e2a5aq7+K7EO7IAjwCT/z2gW354UuFWvbCrky5qSFV92fz7bksyxXIt6Y8fe00B+yR8nQnTI5LioG+hUxhF6iz4g7qiXOWZBCnKu0Nd0x7tdRJzUkvS7wQesdJtrcrKNy3OwPpJP9Q/XmxK48psHdgf9G9UoY1bAQbNXOnBbeftaiWo94GPRxZfiobF+VRa9UWZTAcW95Qu3fJIbYXKsFStMOOUEqS7F9SXMxnsOQLM+f6ifNQb01gjMqy/V6o7Z1HPKF3Y9hYkYbhs/jhXNwToPGIryKeaku/9LlPUD/nWgvhsNRzhA+bDPQSPvNxhKWiui6oBdGffmgl+ejplEWb9ogEw0edwPBhp+3214zZM1iWo/IY3WnYZ4bi2/Nt3BgpSYFo9BSKBnZ3xzXp1YZ7iI7Lrp4BU0v0HEDVv8j0fn/n5IWquhoKnZ/QY1Q0IEuWP4FzcxzYIOYrEiAblenlIt0pchSkomdSaJP92txlqk1XLCF/iSocc7oEm9LcAY9UZsepi9s+qymtS3eRN1QNKChNZ/vhTrU5t3XpRDCXnAVQklQsDYTd2OBcbckhm7+dlQxRWc8gyNq8J7Mp5DVcVrNt50MxZNB/QDMzZPEPHOATnAOE5iphg4c7SW6XK7yljOJN2QonixK09Q/m1gj1V8WITGLPqq7lg7ROYa1yNLs7I3swpdpa7QZRpCoUbKN/0YCfikbvfnCpVOKItMzKHU/d7OlfzpSULSdZIVMTYT8cAvN4FcKGQCEelgbZxvC6PTMSyHX3Lf+hHXwzH5xki2XsbJ2fB54SanofQ9+Vx4Xm8n3Dqivym9ewr9wceYzOQselcvpZk5P2xp3z0WrhUQrHlJBnRhjf4v+dDviKUT0G2fKWd+IxH001O3N38cpowikOEWxqHAPVdnmLZa/3Oxm7HFTT+Yl+SyYWuy12jhwzAQK3NC1WGvEa5ZL7u5eOjinRkbIYzP1MoCPWp+ifZdyc+dmDm1F9DqI1S955dgBb08sNxpVm2Kuu7l2YzNp8573sRlV27fi3PGKeq0B4lYWd12IMmrQ4LceRgdLLvG9qr1kGX0sNQt3lflHrarYhx2MwqQ5nmrc+nhM/PK7vnBnkZdVmGllOMzQhEp3AmUDMOngPIYP7Upmkc8ky8yBPi+i21mZ/A4Euos0XHnU44+7PzWUyGVPEjK2zaOJa9mA6NhIgeKkI8c6xfuV0RCh00UUseY8C65NXkP7QzH6M7ta9OTqtv7jwrG2Y0C+GHG3F0wX5N+slhBAO92Id2vole9+wzoxP7GN7uWG/xQK9+FDEa7l9J44r0WcMcHHpUrOM3di1z6IF0dKsKSDsO5PKmzcrIwBoi9xNPg8poRRUbY13nZ0rW7pV48xxai0JnqHNNSPWodFDEoZaaFA0bfvhjpj0hH+q0gQjJzvjMoBdIEDUWp+Xae1EYROK+CcMxhbpmSJ9787oR7UsW8z1VZvrt4dAaVkMb30qVHuSW4hCA33LfVQscWL3wZeLSJKoRND0b2qStKuWa/fiY/SpP8B9osXO+tmgJ6AEqVCG4shzfJdos392fb13MPz4DlmojevdiHdq3JKMVshhVEHLIsALp4D50t/LEN+7sc3WFn/JpEkndiH0yGikF35Soeoj2pfhZpc6kjQLn6zivOyomdqXj1UkaBc/c1XPXqU9iBc6k0Sf7IAP70qt+Fw3yknAYA7LVKgOvTmhueEKfCy4Bt60SjxBgq/nyMoMIFeUqAIbvDD5idTJMUs6k2/U4apNwXa1R0vZcc/hbckkBk4CE5CEmNQTTn71Tx5eA1SqJvM2jQs8Th+R4xZ4nD8jxizxOH5HjFnicPyPGLPE4fkeMWeJw/I8Ys8Th+R4xZ4nD8jxizxOH5HjFnicPyPGLPE4fkeMWfk1JaYSL2sBIPU4lijrV83LuXubZbWNMbtYlJLyunBqwxhiXPhSHECu2B7Whe3OaN1B+AAAACgZSXguc4Z7wrczIZWB1dN4az8IAMYBTFGnsQFxatGRm3E+X5qKKmdtvbSFpnvAAAAAFAv7WwYxOZjj4JZ2xNprcvMngNbZC0yXcZbCWf+Ub36jieerr1o6RKjq4H/c4t9JUCBFW7PT4AAAABGc+sdqd/R2v/yRSMINAdmonAerozQP2B2tKdTLGRb6jWgUNFh7cFEjAFGqI6AKVcBAAAABDwZCJaZuPJuE+kF8Z8OA8b2KoIQmNZBKB21js8MxLaTXHWdQ4rASXyKrW3OBHWwdIAAAAFAUIX4IEmthvVUdZ19K2AbHS/1B0ThS65THEhBrfG05eiKuw9kj2PwpVeNkzgzqi3xSNiAAAACezDFfm01s9DrZ4vobHKFkWGP2DnB0SHl9B/C5qd8iMxRRJldNJw+IV7ifAPJmrY0oTKkUW3YROS0wAAAACe+wsV7E06xgCYcnN+dtfIsjMF+a5bBP4ANJshTFG4bJTYDIkv4M5UkIJgzEBXcHaIicsqAAAAAE9l9jzY8o5MT7/nINttrv5BV5BxXCYEJ1muPR8Eln+KvCWKefQ/L06OYOAAAAAise67IcdeSlQkU4pf8mnbi7Fn9q4zMy2sB005Zx71UVo2px4Fx3U6VUjFskwPToe2v66Bysmma/t4ndDrQWyZIJ7HTqsmcpmcpFkGoFig6aHacJg83flHnbbABkn3UAmueMH9Y00cW9lTLUfOY/YJsEIvnVcAdYJ8HlEIGfKE7GmtcRytQwd6SBo0/wR9zh0zeBvn8/A0K1nxq9njkTczwZ5Gsgpt379mp6QYzX+L/ZXyAoqJPP9DItNcsUTx1Q8cvUbzemTP0nkugv1WqyvTSWhlBWsdb2RIegwnl1iEz2i+ReqjjFop761Vrh3s0ixRuOrVjUAByHHe6lkvzRC3Z7sfxIdCNTw85wGXsV8LBvct4zP1ANxA36kCJtxd2oNdxoIs9+GSV0L/hU07CPm1eg1T/gWi5aPuz5QZjAnJjSZrPQuI2+14FE3EiVVK8o738CeuVsWYZV0G5uBAdKxks0CBjnTyTZfxbvjfl0ZRw489XB6RLPlmMA+eLREaM/jfbiWNSLS1QpqaFguiNcqLt+0mlwpY43mE4TMoB91tcSc0JynT1Fr/66aayPSu7MKlR6Sb+/f2Mtc+/enAcrnEqmtOt8xpPWcR2cwp7ItGs61dxpHie/OxEGnj+M/79bbj/pUEVZlSpqJhOKFxi3ideQ0goKlPy/DWQUTXzHnMEu7DxrobLcthlbblJm64fOshLTVyXrnJOWmdSG5fWItHNyAx6BjYReIiZJhcZviw/vkPNOVYMRjdUHFuPRByQpbGsfSf6ncUJi7oZKSMTU4lWKl4EXTWoaHAG7LmPb9V9d8B066Ap+OO4jpuwLBxajh+ngn3Trjm/pOy8fJchJT/JbcclJ2L7TeLeSbSxw8XG2IYnNu/gWtXA6fFtKMgMSvAStWv+0NO8l2/Oi7pEjTmPJBOPLUPuQKF/jDMn+iM/W09qEBe9Sxw/2ZcSi+FyC8fDw9gnkn0CRqwKcarOPf2EtfMpzC1SC97iVj+FquFqdQgacW51kXs+L+KvcxdHdiZ0oklhBtItEYFhMWtJ1ZSMbneoape4iSBh8Nl8n0XAwIzWE8bG9wY2yt6tCLFBbfoQE38A9KukJ0ueJpMRYmCcSlUXhAdX59oGlkxWRHZ72ilX6zOPE6JA/JshqGTqIxO/BDFckaFSQlCYX+ey0+u2tAIXc/IgbfF77X2256/s5xCJyylXXwDZMHbCW8fp8Lj/ZHyOXKka8GbjnpHkjFV3esGUsN3mDt5bOmc36OHS3dtzo9787ECrKVLjj2Xj/7/oFJSUNEi2S9vmZmJRT5D+dGI85GGVuQvrEbfaaCZGDUMD7ZAv3MIaxYxnYBn0ER+pLRYTFyLuSIY5fSWg4kCcnuupFX/i9MAEj93r1TrUQP3VCfK4EN3FpC3XNySvO+hxo1ehcSAFZrQGaH5MWQDKwyGtv+ljyhObmHMBcTj+LxW6iiQBxIHBVe7/fXCnLkngDsNvMMv7Ve7z0Ek6ZBBX7pNdXDOpByGkutTe9Sra/40XxiCIk+nPCcJPWbBTjYpr9ruFb5o1SuU0Qv5n8HYKv/nRd8MOrHI3/8qviuR73GgDBKD46Wq0dKGEhPe5cg2Cl+M17WRDitU7eI5nWPRGX2QZgkypLG4NGYHmxDCubFyhmABvF4rb8NdJ0fD6azOW/PtZ3tSkhrolI+gGXKCDUuZKwpNmZeLNR64tKsINJv5iYYp8XqeAwZQymTT9TcyquM+57bA8QOaZa8zvx3pI0VHF+LNrXex8R7yf2xGYEJlkEdY9aXGW2BC0qE/K/UrAoa0xmvGHDq5op/syk36LHKP2ybAjyvULOvYfA+4MkgsB+w4W4eVk+ZdY9TdBx8YRcDwoqCP3ngrBipb4ZbAl2HLZtKhXl2U3FABEiUcyUtvps9GuykLj1pAn45sZsymBkQ6Dzj1fAH8VSj+b0T7bCZpfKE3Tj7myMW5J9DuGdZ+QBR+UA01hqVrroOOgSRnEDyc9AhoqdbAVfv0A6b6ho2JPs4Hp3QjJvmaDH8XR1aibcTFe32+5LOpAi75dBPs5ljubeb3//TWtSLiPmlE6VBBGamhh41AZTEsNrYZWeHGDSGbw//LkD6lkp6QaK/4yG8Fl4voW+snkwonBoAIDKhjudqXFPLtBt0T+4lP5T+wZ/9mVeCq7CdS3syGmi383vVS3ast0nnpmp2OUYConOWp4a6TXtQEJQPk06Z1xPsvAMjNslqCn+y8numhXEpwL1z4fSAz1IcZ14C7GHnp82Rci4A0SNhqOXoH3DPRQZTOBxEUlb7I8XqAbz3bktt1BT9dkNcQ9P8oD3iqzXS4tyJo2Pj3WQ1RSl0f1XcsPNv/QFFw1oi7DRTzpqb/90wp5anJWIt5ZocLda6GN3d39ujh2X6/LY8fj5+q3sX4FYvW+oEcZ+Uv9Jg96dtbdJdqP582wDp869YIEgT+CxNCxNCxNC08a5zqvqU1vpVcY8JpdANT2/dDnLZuVaI31h26KszeFcmwJQYiqvnjuQdbLB6X8KnY9JsVYEikVA3MDZETgKGB2lE86NSdVwHokfN2p+RDrChCCj8v4zbu4ouQ+1NJ0Z2zIXc3ASoZ4SQz6sKAl0VgyPQMMXx2hpD+0b+R+KLLke1s8owlN4VYRE5iqCXx1PVBIv10eSFeY10Dc27kvNBHHPWdYSR8eZ5CNtjvh9/MNJ6u2YToBN/T1dOnX0VgPsO4yKHbaVSTWfJfU/JALE3dTnrNWLnWQz2WbMG4Wdt8cyO3m34Ah4nZBz19j/X5s36I3FWnhKQlBLE7RwcTFzi+huZoVvLELYjO8Z5lFp02279SCW9jD10ZXH0agtoxawa8T3Xsg/o5CHafdQv1FnEDqMwS/GaiADDipfU1wagstKLbhCeZrga8jVuZ0ZrVxY0WIb/RSoM14pmW8E6vZahRGg7oemlhGSFdC9HGuOljaYoaVXaiWKKzeGGeDE2dwmYZGDwG5OLB+YgWo9MiKomEqcmq+ykuvtQziB+JVAtgpc7czSq7O8L73B2Q9Fre8bC1sqsW5qzfHU2/10dJhcZnXM812+4Q4WAyvFPiPjFRs4MPkyGGwQ1bn08XmxEB6Cc1Yut4noTDL4EoCjYaNgWOzkF8gN8yEKiPNPLPls1j69bSTI61YlBrart2YHcKRqAN3loZOjzTJXY60+yrdO4oPMqNQKEFLY/3d+Udftrj8jV32V1PzNviY8IKUbCxgF/30KqHefjD/OhQ4Y3qeaVN8U3V9ced23AyPldtjn6rwt9+F/X/JvVj8z1EwNM/t7JVR91nSlejns609ZOBsilQGCDWbipid8RVKxgb0wADr1rcnWTKVcr0rObCAlUJQko7FugBsncvg7g1hYGyBPsQ7cCCOCWgtMOguCeU9rk2M2H+xnqW3cChfe9jFImpHcrREOmwf7Rl/hPldHY/1J8ol9y1UhCtNm1RYnFvC1YxQGk/tFxyGBrS86qwpcrKzYYYlt087ttysxy2QSbNUBESAEk+YNAvfDD2FddBUC1xFnARYI92nHS+lcyexLkTIwhYR/MPf4EHbI472EluNATMZNbxaszNlchIW+TQu7BlD3SLvkIX0UVRAUCp0lbt33zzNgUD2vXSTzejc3GFtdwdtpyXs3fPZvj4CSDwA30LTMRZH1HKQoMDKTHmn1s8TRhe+3bYTvFwl+rpbHDr42GhjZWGhkXsw6+XAkOuHt+863o+nMY7yEI6cMVLPJ8xd/pg/er8aAgC24pfTwHBFmOQbMw+gTeeU7kXx4dWEBS5vcBHr4nUd1gAg1BwAQf3skpo9EKRUjakytqwPWeNN8fl4gS1nAP6UnxKN+t6zOYaJkf4Ins7OcAJljs6VU0/UlXXBK5OS7a9ept6JYorNtheoez6IjQ49NSlRET9POE59rEcvbjlkWeAwuaQ0gYQuCBq0LsFspq7XKPhLnbmE1Ie0qzc4m1TTOX6QJ8Et9HGXBujGoN5DeZAQ+AfT8CJWYtGReUGf6wAPUb4N0negvGJoHFfEpeTNWCeMbAhTvMqcbkPzEAUEpF9AkwsR9BFOnzoZD8QfKG0BvHHGwc93fGLCIz0M+hzqC4D+AeDbq0Knzb771+FLhOgRT38DVYbznTIGLRbSSlEcLfvimOqJIpRCLwcUi4Adh9jyVnPUO9p/ZZphvtZ0wqPN8OGXqPohtMn4v6WawiTy5LW9TLoxKmbK7RTcP/e5xPfaoAwOuEgJMWNFgEJx/emx6kT/8sXxfBk+rqoMonvmynxkXvbq5Jighwe7STRAoqJagdBQoLQdyUjisWNHpzLUWKMUypATAgNvenjVdGP9XL4Ihui5IArtRT1L7Z/rkiW1Sqdz56+9v2ik7mtgRNeyyRUPa6VTcefdbxIi9BRsoHx+yRsfJp55pdjXJU4tfcAb8QvaT1ibKNupF10KlvbAteUVrO7todbx/qPp7Xdp2gCuu273OVENYbhdDC4jJBfe7kWVcWHn/sdN9QEu51NYYtUqLqs0FnRAsAMU/FgyYGr2I9ixne5VpWEsEzEwt9BX8uNmcKSGvL/iVKoc5/d+WQi6ToX2KM1eLoJG8z5hG8odHMAALXr3I5vwXviu+7ZeReJpb/BKiJeX0/Wozv5u3HyeGzSov3UxgsDJj8vqIpi8eSlllb8ObgXa3mNCX4y1GiZ5kL4ve8IN/4DgRQSis2ujmhHKEHeeL32jzopGde0AuYX/GeQtqLuvo9PORQm4q6JZt6cIoMC2yGx3WpkXR6OGybNeX8hxeUooojNzeTZfTDoNRLfitCMtukF4gsvMXTVcemhUkdddPDyBICua6Vdcrw7BIT5J1fSsD66eCDHnM5cQQoveFUNnOeiGVG4338+IX1Dmn05vw1Sa+O5fK2V7V4p3UDIlm0VdgeeaLB4dqBer6mCOXeHuz+uBQfeBcfLK4DWaZfhI+jfQ1ZSJ8Db4h5zu/j5j3NGpwTDAUeSeY/HY8L8vmyQsTqZrH2kS4dSQGzUltko4sKs0zeUTqA3ZjOVkoXrnasKNWxbm5gs1W/10tiX7s+dU0VQwSd1SBDDdhHH89VJFChvw2gj7iYF/fjFnCTdMNi962mdTe+5282iskn/SvDUPXfsqQbg8suRWdBUvK/WvTbI8qcLuPx/gUI/G1bUAPWeQp/4iGIpdikW9hJ5RxVhgUPdBcXz7Yi8mYl8Irc1T2OkJP0nQGI+moI0c9pWBoG3NhIEVVqV8Z8h6/wEM4Vxhq7AxpOm2URCLwVsFA7vBhXegJDXwx8AOAfloXAYRAUnd/94bY0U82hH5EwnQVvFtzlclIWT8YBieFHc+eV9C/a3xnjT4DCEuuHUR/BEjOVvT/2dznReyGOL11jFYIJzOwkRNoiUIF5UbnBe4258Vc93Cl/YWYa3pS1g0OHrobaOWESgvAQmCq3Ug+c6Zk8aNNN6wcsUP+nW8z4OZaYR3mrMcIXsIrKl4NPP+MuyqS5dxRsVJiV5JCkZgjeSxnGNBHSbr2Y8ExR3W1Yl5vAGY3i3O75KyZi5R8VUtNGjG86JlhxebOKsmHcjBRlwxwsY54s4eMLxn6EO4L7QuOsOL/HB9f6CJ3W/LjAX8pSMkOQ7BuKMy1eSRBfQsavOHMz0OAZbeSi1aLSFGo4g08tPL3uIRswxqVPI/p75nE/PGps/ItImobbbQbfdNYmp5CE6NaGbf0bGqzaJmzGZ+Ty9C1i9RTmlTHGtiypNeWbpQg8hQM/gu+xWu8+IzdXyqR+SrIN50xFANydpLWxL5Bp03Nh/NiZtIJ+ULz7q1BisqyQ1v0WG2ZWnn5zrmm7TWlKQVeBDoCjZjI0RATnBazmbQlCJEsiFBSktaPdnNlkFjwjagJdAEiulancMu67iosEnDprowzKoc8FZHARaqXdHgwzaAcVEyRgJD+o5Ncdi5UU+yKjhuhFjGwfAAzzmsGfvnOWGOUGdeF38RTDNpD+GYlyjXaPj3ZKHN9N2OSjDoMKm+DIm0R7ElCtHcPgTQhaXm5YuBwVfgY4J2NxpOy7IWrQzqED3zXMDhlso9OFUdLVLP2yii+tQ4IIxE1ddKFXpUealUbNTjMrm9kjeWRi1TX9E4mCrQfexWa9nqgdI98nHp6+9sx+ZEZDkLzzTFNkhUUOzSWlNGhc02H87G1ZvOE5WN/AwutrCAFmVM+917ex+ilY7sqFTK6Y8psgLiChJiqxh+TRi/XGDVdK1cT0Td5Aggu+wmMJPlIAxCfWhFjxJCgIz/v938tSqygvglsZ6maEeWYyGCg8OwqY459ATfDLA1vaUNg5aH8JUKo4w4+GvuEu9C1lJjrjQaW1g6Fa2U9jdtMTPyxNUxEunpCgk4KdA9teViVBhVJ1kMDP1tOMNDUHvVO6WlUlSw9O7ofsg+H8/wdFGVieKOjHkEpClyjlc8HxUre66KyrX1+NaoS9Eskg94XASK2+fT7wsQbC81P3P/zaSsTMM2dCzvRcRYuYHqakiHnZwnB1x70qEJaazTBGCbEFPJ/KGojkE1DlSI2xBkQF7COEMpBP26WnW/f63impb483ZLZSJGJ10FOQ5hFh7AXrDgrwUiEOiT5OoNyVRvkUKmR1cgOeQyFY9REftEZEYqtPJurl0LzekQV/k71/TUuPW7z7RMTAbZGqB93pru1wl+lqTrQmPeRxRxNcxsJQuXjjlqa8CEcxEWEhPHCWm2dJwlptmcskqythSYCYmx2ouPLXfy2Z2EVU0d6Lb3kMVl5AWce0eE3/ou4Ir7sW8okWwRKqBUbYF/AXrIqJzJT6s/7rBSFJb2o5RyGMxlihNHOc89REmYTdSs19CtC9PlEfXXPJBkLDo+NdaSMZf/v6UG4FqyRyKGtjEHtYA/6Md5r1bHys/+2V7INVo8UBqC6Ebnqpi6xCB9wXq6EGOhadcdOjISVdqzovg93PVPA3C/kjhVNcprG+qfVVMSsp5+F69HPsEABj0wBgZQ0kAFNWwmlB/oR5Kc4Yck2pjKYcPYRJiK9uYI/rsa8WeuUMsY8OowF5VdjuEYJ0peImt1VTBmbs1n3Z0gQfAAzgvA/qAAez+1j57asGP3wSXFokIsT6y6f/YOeaKlgPaFQPDx6Ptf4BVjH0mGfLqJN0UK2Gx3fyfyPlGYZBgUHTSdNUcgUY7Gr/K7NI4Xs5vMmG6cptvsQjWqpSqcHsAKDEAT9rMFlUu/N8xyyMgkHcADe7Pl18uuGi/7sbj1fJU+o+eAibl2YZkylocZCWrBQe/h10Nl2Kv2kYiRSeMK/vbp+hVtNuOy20D0yBR3LyIyIrya0JL2zo9cDhHPDBic92UfoytHDTFFAo7v6LOZgwHzmz8iAJOR9MG/qHjdFWRqqH9zBap7bZKfuRm3KpEOcIoLaS672lVzwEzMC75ABdEjtg2OjUQ4SYpcC+1azsFAeAJMK2Zi1T7iNXB1LgxcyblWAwiNL6gLouXLMxlj+1r2wr+MGDFGvWLwyJ+obzAYRBhO7rqrkt6+dcaHBpwdjvFvAYR6wg4Ciuj5+aGlOmfYCFweMWKRgMBhHrCLMnFLayVGqhQnhvAeHJ0fGS9FGhZEITOBWCfrGr7sDLhdpI8YBHFNd5IMwLyoZ7AgdEeGoGbvbQCvU7bzhHJ8joc7pfD/qcyIFoJ/4ADzmBXuPOXlt580YtHtyK864KXro/ogKyzC3GbNw9uKfCprIStLLKgVoxuTk6f2BQYPOToPpxSy0Z2g2qjA9fgtJZmUSQUDnpR9s1pNwq3To11J4W2gzzEWzVfAccvB0E77GGBAC/QAAAABYtF5UgQd+5iBPX/r7w7aRkWAzqRkHkiDj+cAAAAbEw3Sxc55BkJ/yQCl8XZiiu9tnkC6Qldg21zcIyev7dv8nwoPQMSjYChAA89sfRhAZD1PEJE/mFwQJQhYTCK6aBfHtU04PwEAAA=) ## Analyzing the website[​](#analyzing-the-website "Direct link to Analyzing the website") When you visit this URL: `https://ads.tiktok.com/business/creativecenter/inspiration/popular/hashtag/pc/en` You will see a list of hashtags with their live ranking, the number of posts they have, trend chart, creators, and analytics. You can also notice that we can filter the industry, set the time period, and use a check box to filter if the trend is new to the top 100 or not. ![tiktok-trends](/assets/images/tiktok-trends-1b92bf04848ae6c440eb1e9fabb55a41.webp) Our goal here is to extract the top 100 hashtags from the list with the given filters. The two possible approaches are to use [`CheerioCrawler`](https://crawlee.dev/js/docs/guides/cheerio-crawler-guide.md), and the second one will be browser-based scraping. Cheerio gives results faster but does not work with JavaScript-rendered websites. Cheerio is not the best option here as the [Creative Center](https://ads.tiktok.com/business/creativecenter/inspiration/popular/hashtag/pad/en) is a web application, and the data source is API, so we can only get the hashtags initially present in the HTML structure but not each of the 100 as we require. The second approach can be using libraries like Puppeteer, Playwright, etc, to do browser-based scraping and using automation to scrape all of the hashtags, but with previous experiences, it takes a lot of time for such a small task. Now comes the new approach that we developed to make this process a lot better than browser based and very close to CheerioCrawler based crawling. ## JSDOM Approach[​](#jsdom-approach "Direct link to JSDOM Approach") note Before diving deep into this approach, I would like to give credit to [Alexey Udovydchenko](https://apify.com/alexey), Web Automation Engineer at Apify, for developing this approach. Kudos to him! In this approach, we are going to make API calls to `https://ads.tiktok.com/creative_radar_api/v1/popular_trend/hashtag/list` to get the required data. Before making calls to this API, we will need few required headers (auth data), so we will first make the call to `https://ads.tiktok.com/business/creativecenter/inspiration/popular/hashtag/pad/en`. We will start this approach by creating a function that will create the URL for the API call for us and, make the call and get the data. ``` export const createStartUrls = (input) => { const { days = '7', country = '', resultsLimit = 100, industry = '', isNewToTop100, } = input; const filterBy = isNewToTop100 ? 'new_on_board' : ''; return [ { url: `https://ads.tiktok.com/creative_radar_api/v1/popular_trend/hashtag/list?page=1&limit=50&period=${days}&country_code=${country}&filter_by=${filterBy}&sort_by=popular&industry_id=${industry}`, headers: { // required headers }, userData: { resultsLimit }, }, ]; }; ``` In the above function, we create the start url for the API call that include various parameters as we talked about earlier. After creating the URL according to the parameters it will call the `creative_radar_api` and fetch all the results. But it won’t work until we get the headers. So, let’s create a function that will first create a session using `sessionPool` and `proxyConfiguration`. ``` export const createSessionFunction = async ( sessionPool, proxyConfiguration, ) => { const proxyUrl = await proxyConfiguration.newUrl(Math.random().toString()); const url = 'https://ads.tiktok.com/business/creativecenter/inspiration/popular/hashtag/pad/en'; // need url with data to generate token const response = await gotScraping({ url, proxyUrl }); const headers = await getApiUrlWithVerificationToken( response.body.toString(), url, ); if (!headers) { throw new Error(`Token generation blocked`); } log.info(`Generated API verification headers`, Object.values(headers)); return new Session({ userData: { headers, }, sessionPool, }); }; ``` In this function, the main goal is to call `https://ads.tiktok.com/business/creativecenter/inspiration/popular/hashtag/pad/en` and get headers in return. To get the headers we are using `getApiUrlWithVerificationToken` function. note Before going ahead, I want to mention that Crawlee natively supports JSDOM using the [JSDOM Crawler](https://crawlee.dev/js/api/jsdom-crawler.md). It gives a framework for the parallel crawling of web pages using plain HTTP requests and jsdom DOM implementation. It uses raw HTTP requests to download web pages, it is very fast and efficient on data bandwidth. Let’s see how we are going to create the `getApiUrlWithVerificationToken` function: ``` const getApiUrlWithVerificationToken = async (body, url) => { log.info(`Getting API session`); const virtualConsole = new VirtualConsole(); const { window } = new JSDOM(body, { url, contentType: 'text/html', runScripts: 'dangerously', resources: 'usable' || new CustomResourceLoader(), // ^ 'usable' faster than custom and works without canvas pretendToBeVisual: false, virtualConsole, }); virtualConsole.on('error', () => { // ignore errors cause by fake XMLHttpRequest }); const apiHeaderKeys = ['anonymous-user-id', 'timestamp', 'user-sign']; const apiValues = {}; let retries = 10; // api calls made outside of fetch, hack below is to get URL without actual call window.XMLHttpRequest.prototype.setRequestHeader = (name, value) => { if (apiHeaderKeys.includes(name)) { apiValues[name] = value; } if (Object.values(apiValues).length === apiHeaderKeys.length) { retries = 0; } }; window.XMLHttpRequest.prototype.open = (method, urlToOpen) => { if ( ['static', 'scontent'].find((x) => urlToOpen.startsWith(`https://${x}`), ) ) log.debug('urlToOpen', urlToOpen); }; do { await sleep(4000); retries--; } while (retries > 0); await window.close(); return apiValues; }; ``` In this function, we are creating a virtual console that uses `CustomResourceLoader` to run the background process and replace the browser with JSDOM. For this particular example, we need three mandatory headers to make the API call, and those are `anonymous-user-id,` `timestamp,` and `user-sign.` Using `XMLHttpRequest.prototype.setRequestHeader`, we are checking if the mentioned headers are in the response or not, if yeas, we take the value of those headers, and repeat the retries until we get all the headers. Then, the most important part is that we use `XMLHttpRequest.prototype.open` to extract the auth data and make calls without actually using browsers or exposing the bot activity. At the end of `createSessionFunction`, it returns a session with the required headers. Now coming to our main code, we will use CheerioCrawler and will use `prenavigationHooks` to inject the headers that we got from the earlier function into the `requestHandler`. ``` const crawler = new CheerioCrawler({ sessionPoolOptions: { maxPoolSize: 1, createSessionFunction: async (sessionPool) => createSessionFunction(sessionPool, proxyConfiguration), }, preNavigationHooks: [ (crawlingContext) => { const { request, session } = crawlingContext; request.headers = { ...request.headers, ...session.userData?.headers, }; }, ], proxyConfiguration, }); ``` Finally in the request handler we make the call using the headers and make sure how many calls are needed to fetch all the data handling pagination. ``` async requestHandler(context) { const { log, request, json } = context; const { userData } = request; const { itemsCounter = 0, resultsLimit = 0 } = userData; if (!json.data) { throw new Error('BLOCKED'); } const { data } = json; const items = data.list; const counter = itemsCounter + items.length; const dataItems = items.slice( 0, resultsLimit && counter > resultsLimit ? resultsLimit - itemsCounter : undefined, ); await context.pushData(dataItems); const { pagination: { page, total }, } = data; log.info( `Scraped ${dataItems.length} results out of ${total} from search page ${page}`, ); const isResultsLimitNotReached = counter < Math.min(total, resultsLimit); if (isResultsLimitNotReached && data.pagination.has_more) { const nextUrl = new URL(request.url); nextUrl.searchParams.set('page', page + 1); await crawler.addRequests([ { url: nextUrl.toString(), headers: request.headers, userData: { ...request.userData, itemsCounter: itemsCounter + dataItems.length, }, }, ]); } } ``` One important thing to note here is that we are making this code in a way that we can make any numbers of API calls. In this particular example we just made one request and a single session, but you can make more if you need. When the first API call will be completed, it will create the second API call. Again, you can make more calls if needed, but we stopped at two. To make things more clear, here is how code flow looks: ![code flow](/assets/images/code-flow-9b59d77892326bdf8ae27f1e99489c9e.webp) ## Conclusion[​](#conclusion "Direct link to Conclusion") This approach helps us to get a third way to extract the authentication data without actually using a browser and pass the data to CheerioCrawler. This significantly improves the performance and reduces the RAM requirement by 50%, and while browser-based scraping performance is ten times slower than pure Cheerio, JSDOM does it just 3-4 times slower, which makes it 2-3 times faster than browser-based scraping. The project's codebase is already [uploaded here](https://github.com/souravjain540/tiktok-trends). The code is written as an Apify Actor; you can find more about it [here](https://docs.apify.com/academy/getting-started/creating-actors), but you can also run it without using Apify SDK. If you have any doubts or questions about this approach, reach out to us on our [Discord server](https://apify.com/discord). --- # How to scrape YouTube using Python \[2025 guide] July 14, 2025 · 23 min read [![Max](https://avatars.githubusercontent.com/u/34358312?v=4)](https://github.com/Mantisus) [Max](https://github.com/Mantisus) Community Member of Crawlee and web scraping expert In this guide, we'll explore how to efficiently collect data from YouTube using [Crawlee for Python](https://github.com/apify/crawlee-python). The scraper will extract video metadata, video statistics, and transcripts - giving you structured YouTube data perfect for content analysis, ML training, or trend monitoring. note One of our community members wrote this guide as a contribution to the Crawlee Blog. If you'd like to contribute articles like these, please reach out to us on Apify’s [Discord channel](https://apify.com/discord). ![How to scrape YouTube using Python](/assets/images/youtube_banner-fb73d10d52bbf13a89f3c0d66d2eff5b.webp) Key steps we'll cover: 1. [Project setup](https://www.crawlee.dev/blog/scrape-youtube-python#1-project-setup) 2. [Analyzing YouTube and determining a scraping strategy](https://www.crawlee.dev/blog/scrape-youtube-python#2-analyzing-youtube-and-determining-a-scraping-strategy) 3. [Configuring YouTube](https://www.crawlee.dev/blog/scrape-youtube-python#3-configuring-crawlee) 4. [Extracting YouTube data](https://www.crawlee.dev/blog/scrape-youtube-python#4-extracting-youtube-data) 5. [Enhancing the scraper capabilities](https://www.crawlee.dev/blog/scrape-youtube-python#5-enhancing-the-scraper-capabilities) 6. [Creating a YouTube Actor on the Apify platform](https://www.crawlee.dev/blog/scrape-youtube-python#6-creating-a-youtube-actor-on-the-apify-platform) 7. [Deploying to Apify](https://www.crawlee.dev/blog/scrape-youtube-python#7-deploying-to-apify) ## What you’ll need to get started[​](#what-youll-need-to-get-started "Direct link to What you’ll need to get started") * Python 3.10 or higher * Familiarity with web scraping concepts * Crawlee for Python `v0.6.0` or higher * [uv](https://docs.astral.sh/uv/) `v0.7` or higher ## 1. Project setup[​](#1-project-setup "Direct link to 1. Project setup") note Before starting the project, I'd like to ask you to star Crawlee for Python on [GitHub](https://github.com/apify/crawlee-python/). This will help us spread the word to fellow scraper developers. In this project, we'll use uv for package management and a specific Python version will be installed through uv. If you don't have uv installed yet, just follow the [guide](https://docs.astral.sh/uv/getting-started/installation/) or use this command: ``` curl -LsSf https://astral.sh/uv/install.sh | sh ``` To create the project, run: ``` uvx crawlee['cli'] create youtube-crawlee ``` In the `cli` menu that opens, select: 1. `Playwright` 2. `Httpx` 3. `uv` 4. Leave the default value - `https://crawlee.dev` 5. `y` Or, just run the command: ``` uvx crawlee['cli'] create youtube-crawlee --crawler-type playwright --http-client httpx --package-manager uv --apify --start-url 'https://crawlee.dev' ``` Or, if you prefer to use `pipx`. ``` pipx run crawlee['cli'] create youtube-crawlee --crawler-type playwright --http-client httpx --package-manager uv --apify --start-url 'https://crawlee.dev' ``` Creating the project may take a few minutes. After installation is complete, navigate to the project folder: ``` cd youtube-crawlee ``` ## 2. Analyzing YouTube and determining a scraping strategy[​](#2-analyzing-youtube-and-determining-a-scraping-strategy "Direct link to 2. Analyzing YouTube and determining a scraping strategy") If you're working on a small project to extract data from YouTube, you should use the [YouTube API](https://developers.google.com/youtube/v3/docs/search/list) to get your data. However, the API has very strict quotas, with no more than [10,000 units per day](https://developers.google.com/youtube/v3/determine_quota_cost). This allows you to get just 100 search pages, and you can't increase this limit. If your project requires more data than the API allows, you'll need to use crawling. Let's examine the site to develop an optimal crawling strategy. Let's study YouTube navigation using [Apify's YouTube channel](https://www.youtube.com/@Apify) as an example to better understand the features and data extraction points. YouTube uses infinite scrolling to load new elements on the page, similar to what we discussed in the corresponding [article](https://www.crawlee.dev/blog/infinite-scroll-using-python) from the [Apify](https://apify.com/) team. Let's look at how this works using [DevTools](https://developer.chrome.com/docs/devtools) and the [Network](https://developer.chrome.com/docs/devtools/network/) tab. ![Load Request](/assets/images/load_request-c583830dda107ae55fb6426d7b96e569.webp) If we look at the response structure, we can see that YouTube uses [JSON](https://www.json.org) to transmit data, but its structure is quite complex to navigate. ![Load Response](/assets/images/load_response-7061bb91cadc904d54073c033f3f0a20.webp) Therefore, we'll use [Playwright](https://playwright.dev/python/docs/intro) for crawling, which will help us avoid parsing complex JSON responses. But if you want to practice crawling complex websites, try implementing a crawler based on an HTTP client, like in this [article](https://www.crawlee.dev/blog/scraping-dynamic-websites-using-python). Let's analyze the selectors for getting video links using the [Elements](https://developer.chrome.com/docs/devtools/elements/) tab: ![Selectors](/assets/images/selectors-745f5daab12810cc998990e4c066afdf.webp) It looks like we're interested in `a` tags with the attribute `id="video-title-link"`! Let's look at the video page to understand better how YouTube transmits data. As expected, we see data in JSON format. ![Video Response](/assets/images/video_json-44affd2ba348740caa8d1bc79ba9a8a9.webp) Now let's get the transcript link. Click on the subtitles button in the player to trigger the transcript request. ![Transcript Request](/assets/images/transcript_request-77c78163912afe398161b431c20cb733.webp) Let's verify that we can access the transcript via this link. Remove the `fmt=json3` parameter from the URL and open it in your browser. Removing the `fmt` parameter is necessary to get the data in a convenient XML format instead of the complex JSON3 format. ![Transcript Response](/assets/images/transcript_response-06133506fa3559a10cfc43912d1af67c.webp) If you live in a country where [GDPR](https://gdpr-info.eu/) applies, you'll need to handle the following pop-up before you can access the data: ![GDPR](/assets/images/GDPR-103ec4d5f927916f704ec1d4d597bd82.webp) After our analysis, we now understand: * **Navigation strategy**: How to navigate the channel page to retrieve all videos using infinite scroll. * **Video metadata extraction**: How to extract video statistics, title, description, publish date, and other metadata from video pages. * **Transcript access**: How to obtain the correct transcript link. * **Data formats**: Transcript data is available in XML format, which is easier to parse than JSON3 * **Regional considerations**: Special handling required for GDPR consent in European countries With this knowledge, we're ready to implement the YouTube scraper using Crawlee for Python. ## 3. Configuring Crawlee[​](#3-configuring-crawlee "Direct link to 3. Configuring Crawlee") Configuring Crawlee for YouTube is very similar to configuring it for [TikTok](https://www.crawlee.dev/blog/scrape-tiktok-python), but with some key differences. Since pages have infinite scrolling, we need to limit the number of elements we want to get. For this, we'll add a `max_items` parameter that will limit the maximum number of elements for each search, and pass it in `user_data` when forming a [Request](https://www.crawlee.dev/python/api/class/Request). We'll limit the intensity of scraping by setting `max_tasks_per_minute` in [`ConcurrencySettings`](https://www.crawlee.dev/python/api/class/ConcurrencySettings). This will help us reduce the likelihood of being blocked by YouTube. Scrolling pages can take a long time, so we’ll increase the time limit for processing a single request using `request_handler_timeout`. Since we won't be saving images, videos, and similar media content during crawling, we can block requests to them using [`block_requests`](https://www.crawlee.dev/python/api/class/BlockRequestsFunction) and [`pre_navigation_hook`](https://www.crawlee.dev/python/api/class/PlaywrightCrawler#pre_navigation_hook). Also, to handle the `GDPR` page only once, we'll use [`use_state`](https://www.crawlee.dev/python/api/class/UseStateFunction) to pass the appropriate cookies between sessions, ensuring all requests have the necessary cookies. ``` # main.py from datetime import timedelta from apify import Actor from crawlee import ConcurrencySettings, Request from crawlee.crawlers import PlaywrightCrawler from .hooks import pre_hook from .routes import router async def main() -> None: """The crawler entry point.""" async with Actor: # Create a crawler instance with the router crawler = PlaywrightCrawler( # Limit scraping intensity by setting a limit on requests per minute concurrency_settings=ConcurrencySettings(max_tasks_per_minute=50), # We'll configure the `router` in the next step request_handler=router, # Increase the timeout for the request handling pipeline request_handler_timeout=timedelta(seconds=120), # Runs browser without visual interface headless=True, # Limit requests per crawl for testing purposes max_requests_per_crawl=100, ) # Set the maximum number of items to scrape per youtube channel max_items = 1 # Set the list of channels to scrape channels = ['Apify'] # Set hook for prepare context before navigation on each request crawler.pre_navigation_hook(pre_hook) await crawler.run( [ Request.from_url(f'https://www.youtube.com/@{channel}/videos', user_data={'limit': max_items}) for channel in channels ] ) ``` Let's prepare the `pre_hook` function to block requests and set cookies (the cookie collection process will be explained in the extraction section): ``` # hooks.py from crawlee.crawlers import PlaywrightPreNavCrawlingContext async def pre_hook(context: PlaywrightPreNavCrawlingContext) -> None: """Prepare context before navigation.""" crawler_state = await context.use_state() # Check if there are previously collected cookies in the crawler state and set them for the session if 'cookies' in crawler_state and context.session: cookies = crawler_state['cookies'] # Set cookies for the session context.session.cookies.set_cookies_from_playwright_format(cookies) # Block requests to resources that aren't needed for parsing # This is similar to the default value, but we don't block `css` as it is needed for Player loading await context.block_requests( url_patterns=['.webp', '.jpg', '.jpeg', '.png', '.svg', '.gif', '.woff', '.pdf', '.zip'] ) ``` ## 4. Extracting YouTube data[​](#4-extracting-youtube-data "Direct link to 4. Extracting YouTube data") After configuration, let's move on to navigation and data extraction. For infinite scrolling, we'll use the built-in helper function ['infinite\_scroll'](https://www.crawlee.dev/python/api/class/PlaywrightCrawlingContext#infinite_scroll). But instead of waiting for scrolling to complete, which in some cases can take a really long time, we'll use Python's `asyncio` capabilities to make it a background task. The `GDPR` page requiring consent for cookie usage is on the domain `consent.youtube.com`, which might cause an error when forming a [Request](https://www.crawlee.dev/python/api/class/Request) for a video page. Therefore, we need to use a helper function for the `transform_request_function` parameter in [`extract_links`](https://www.crawlee.dev/python/api/class/ExtractLinksFunction). This function will check each extracted URL. If it contains 'consent.youtube', we'll replace it with '[www.youtube](http://www.youtube)'. This will allow us to get the correct URL for the video page. ``` # routes.py from __future__ import annotations import asyncio import xml.etree.ElementTree as ET from typing import TYPE_CHECKING from yarl import URL from crawlee import Request, RequestOptions, RequestTransformAction from crawlee.crawlers import PlaywrightCrawlingContext from crawlee.router import Router if TYPE_CHECKING: from playwright.async_api import Request as PlaywrightRequest from playwright.async_api import Route as PlaywrightRoute router = Router[PlaywrightCrawlingContext]() def request_domain_transform(request_param: RequestOptions) -> RequestOptions | RequestTransformAction: """Transform request before adding it to the queue.""" if 'consent.youtube' in request_param['url']: request_param['url'] = request_param['url'].replace('consent.youtube', 'www.youtube') return request_param return 'unchanged' ``` Let's implement a function that will intercept transcript requests for later modification and processing in the crawler: ``` # routes.py async def extract_transcript_url(context: PlaywrightCrawlingContext) -> str | None: """Extract the transcript URL from request intercepted by Playwright.""" # Create a Future to store the transcript URL transcript_future: asyncio.Future[str] = asyncio.Future() # Define a handler for the transcript request # This will be called when the page requests the transcript async def handle_transcript_request(route: PlaywrightRoute, request: PlaywrightRequest) -> None: # Set the result of the future with the transcript URL if not transcript_future.done(): transcript_future.set_result(request.url) await route.fulfill(status=200) # Set up a route to intercept requests to the transcript API await context.page.route('**/api/timedtext**', handle_transcript_request) # Click the subtitles button to trigger the transcript request await context.page.click('.ytp-subtitles-button') # Wait for the transcript URL to be captured # The future will resolve when handle_transcript_request is called return await transcript_future ``` Now, let's create the main handler that will navigate to the channel page, perform infinite scrolling, and extract links to videos. ``` # routes.py @router.default_handler async def default_handler(context: PlaywrightCrawlingContext) -> None: """Handle requests that do not match any specific handler.""" context.log.info(f'Processing {context.request.url} ...') # Get the limit from user_data, default to 10 if not set limit = context.request.user_data.get('limit', 10) if not isinstance(limit, int): raise TypeError('Limit must be an integer') # Wait for the page to load await context.page.locator('h1').first.wait_for(state='attached') # Check if there's a GDPR popup on the page requiring consent for cookie usage cookies_button = context.page.locator('button[aria-label*="Accept"]').first if await cookies_button.is_visible(): await cookies_button.click() # Save cookies for later use with other sessions # You can learn more about `SOCS` cookies from - https://policies.google.com/technologies/cookies?hl=en-US cookies_state = [cookie for cookie in await context.page.context.cookies() if cookie['name'] == 'SOCS'] crawler_state = await context.use_state() crawler_state['cookies'] = cookies_state # Wait until at least one video loads await context.page.locator('a[href*="watch"]').first.wait_for() # Create a background task for infinite scrolling scroll_task: asyncio.Task[None] = asyncio.create_task(context.infinite_scroll()) # Scroll the page to the end until we reach the limit or finish scrolling while not scroll_task.done(): # Extract links to videos requests = await context.extract_links( selector='a[href*="watch"]', label='video', transform_request_function=request_domain_transform, strategy='same-domain', ) # Create a dictionary to avoid duplicates requests_map = {request.id: request for request in requests} # If the limit is reached, cancel the scrolling task if len(requests_map) >= limit: scroll_task.cancel() break # Switch the asynchronous context to allow other tasks to execute await asyncio.sleep(0.2) else: # If the scroll task is done, we can safely assume that we have reached the end of the page requests = await context.extract_links( selector='a[href*="watch"]', label='video', transform_request_function=request_domain_transform, strategy='same-domain', ) requests_map = {request.id: request for request in requests} requests = list(requests_map.values()) requests = requests[:limit] # Add the requests to the queue await context.enqueue_links(requests=requests) ``` Let's take a closer look at the parameters used in [`extract_links`](https://www.crawlee.dev/python/api/class/ExtractLinksFunction#Methods): * `selector` - selector for extracting links to videos. We expected that we could use `id="video-title-link"`, but YouTube uses different page formats with different selectors, so the selector `a[href*="watch"]` will be more universal. * `label` - pointer for the router that will be used to handle the video page. * `transform_request_function` - function to transform the request before adding it to the queue. We use it to replace the domain `consent.youtube` with `www.youtube`, which helps avoid errors when processing the video page. * `strategy` - strategy for extracting links. We use `same-domain` to extract links to any subdomain of `youtube.com`. Let's move on to the handler for video pages. In it, we'll extract video data and also look at how to get and process the video transcript link. ``` # routes.py @router.handler('video') async def video_handler(context: PlaywrightCrawlingContext) -> None: """Handle video requests.""" context.log.info(f'Processing video {context.request.url} ...') # extract video data from the page video_data = await context.page.evaluate('window.ytInitialPlayerResponse') main_data = { 'url': context.request.url, 'title': video_data['videoDetails']['title'], 'description': video_data['videoDetails']['shortDescription'], 'channel': video_data['videoDetails']['author'], 'channel_id': video_data['videoDetails']['channelId'], 'video_id': video_data['videoDetails']['videoId'], 'duration': video_data['videoDetails']['lengthSeconds'], 'keywords': video_data['videoDetails']['keywords'], 'view_count': video_data['videoDetails']['viewCount'], 'like_count': video_data['microformat']['playerMicroformatRenderer']['likeCount'], 'is_shorts': video_data['microformat']['playerMicroformatRenderer']['isShortsEligible'], 'publish_date': video_data['microformat']['playerMicroformatRenderer']['publishDate'], } # Try to extract the transcript URL try: transcript_url = await asyncio.wait_for(extract_transcript_url(context), timeout=20) except asyncio.TimeoutError: transcript_url = None if transcript_url: transcript_url = str(URL(transcript_url).without_query_params('fmt')) context.log.info(f'Found transcript URL: {transcript_url}') await context.add_requests( [Request.from_url(transcript_url, label='transcript', user_data={'video_data': main_data})] ) else: await context.push_data(main_data) ``` Note that if we want to extract the video transcript, we need to get the link to the transcript file and pass the video data to the next handler before it's saved to the [`Dataset`](https://www.crawlee.dev/python/api/class/Dataset). The final stage is processing the transcript. YouTube uses [XML](https://www.w3schools.com/xml/) to transmit transcript data, so we need to use a library to parse XML, such as [`xml.etree.ElementTree`](https://docs.python.org/3/library/xml.etree.elementtree.html). ``` # routes.py @router.handler('transcript') async def transcript_handler(context: PlaywrightCrawlingContext) -> None: """Handle transcript requests.""" context.log.info(f'Processing transcript {context.request.url} ...') # Get the main video data extracted in `video_handler` video_data = context.request.user_data.get('video_data', {}) try: # Get XML data from the response root = ET.fromstring(await context.response.text()) # Extract text elements from XML transcript_data = [text_element.text.strip() for text_element in root.findall('.//text') if text_element.text] # Enrich video data by adding the transcript video_data['transcript'] = '\n'.join(transcript_data) # Save the data to Dataset await context.push_data(video_data) except ET.ParseError: context.log.warning('Incorect XML Response') # Save the video data without the transcript await context.push_data(video_data) ``` After collecting the data, we need to save the results to a file. Just add the following code to the end of the `main` function in `main.py`: ``` # main.py # Export the data from Dataset to JSON format await crawler.export_data_json('youtube.json') ``` To run the crawler, use the command: ``` uv run python -m youtube_crawlee ``` Example result record: ``` { "url": "https://www.youtube.com/watch?v=r-1J94tk5Fo", "title": "Facebook Marketplace API - Scrape Data Based on LOCATION, CATEGORY and SEARCH", "description": "See how you can export Facebook Marketplace listings to Excel, CSV or JSON with the Facebook Marketplace API 🛍️ Input one or more URLs to scrape price, description, images, delivery info, seller data, location, listing status, and much more 📊\n\nWith the Facebook Marketplace Downloader, you can:\n🛒 **Extract listings and seller details** from any public Marketplace category or search query.\n📷 **Scrape product details**, including images, prices, descriptions, locations, and timestamps.\n💰 **Get thousands of marketplace listings** quickly and efficiently.\n📦 **Export results** via API or in JSON, CSV, or Excel with all listing details.\n\n🛍️ Facebook Marketplace Search API 👉 https://apify.it/3E5NLz4\n📱 Explore other Facebook Scrapers 👉 https://apify.it/43Bae1f\n\n*Why scrape Facebook Marketplace data?* 🤔\n💰 Price & Demand Analysis – Track product pricing trends and demand fluctuations.\n📊 Competitor Insights – Monitor listings from competitors to adjust pricing and strategy.\n📍 Location-Based Market Trends – Identify popular products in specific regions.\n🔎 Product Availability Monitoring – Detect shortages or oversupply in certain categories.\n📈 Reselling Opportunities – Find underpriced items for profitable flips.\n🛍 Consumer Behavior Insights – Understand what products and features attract buyers.\n💡 Trend Spotting – Discover emerging products before they go mainstream.\n📝 Market Research – Gather data for academic, business, or personal research.\n\n*How to* scrape *facebook marketplace? 🧑‍🏫* \nStep 1. Find the Facebook Marketplace dataset tool on Apify Store\nStep 2: Click ‘Try for free’\nStep 3: Input a URL\nStep 4: Fine tune the input\nStep 5: Start the Actor and get your data!\n\n*Useful links 🧑‍💻*\n📚 Read more about Scraping Facebook data: https://apify.it/43wyth9\n🧑‍💻 Sign up for Apify: https://apify.it/42e8nNu\n🧩 Integrate the Actor with other tools: https://apify.it/43Ustiz\n📱 Browse other Social Media Scrapers on Apify Store: https://apify.it/4jhq7i8\n\n*Follow us 🤳*\nhttps://www.linkedin.com/company/apifytech\nhttps://twitter.com/apify\nhttps://www.tiktok.com/@apifytech\nhttps://discord.com/invite/jyEM2PRvMU\n\n*Timestamps ⌛️*\n00:00 Introduction\n01:27 Input\n02:17 Run\n02:26 Export\n02:41 Scheduling\n02:54 Integrations\n03:00 API\n03:13 Other Meta Scrapers\n03:26 Like and subscribe!\n\n#webscraping #instagram", "channel": "Apify", "channel_id": "UCTgwcoeGGKmZ3zzCXN2qo_A", "video_id": "r-1J94tk5Fo", "duration": "226", "keywords": [ "web scraping platform", "web automation", "scrapers", "Apify", "web crawling", "web scraping", "data extraction", "best web scraping tool", "API", "how to extract data from any website", "web scraping tutorial", "web scrape", "data collection tool", "RPA", "web integration", "how to turn website into API", "JSON", "python web scraping", "web scraping python", "web api integration", "how to turn website into api", "scraping", "apify", "data extraction tools", "how to web scrape", "web scraping javascript", "web scraping tool" ], "view_count": "765", "like_count": "8", "is_shorts": false, "publish_date": "2025-04-03T05:33:18-07:00", "transcript": "Hi, Theo here. In this video, I’ll \nshow you how to scrape structured\ndata from Facebook Marketplace by location, \ncategory, or specific search query. You’ll\nbe able to extract listing details like price, \ndescription, images, delivery info, seller data,\nlocation, and listing status — using a \ntool called Facebook Marketplace Scraper.\nHere’s what you can do with it. \nIf you're reselling, flipping,\nor deal hunting, scraping helps you track \nprices, spot trends, and catch underpriced\nor free items early. Looking for a rental \nor house? Compare listings across cities,\ncheck historical prices, and avoid wasting \ntime on overpriced options. Selling on\nMarketplace? Analyze top-performing listings, \noptimize keywords, and price competitively.\nFor businesses, scraping \nenables competitor tracking,\ndynamic pricing, real estate \nresearch, fraud detection,\nand brand protection — like spotting counterfeit \nor unauthorized listings before they do damage.\nThe best part is you don’t need to \njump through hoops to get this data:\nFacebook Marketplace Scraper makes things simple: \nno login, no cookies, no browser extension.\nIt runs in the cloud, and you can export \nresults in JSON, CSV, Excel — or use the API.\nLet’s see how it works.\nFirst, head to the link in the description, \nwhich’ll take you to Facebook Marketplace\nScraper’s README. Click on `try \nfor free`, which will send you to\nthe `Login page` and you can get started \nwith a free Apify account - don’t worry,\nthere’s no limit on the free plan and \nno credit card will ever be required.\nAfter logging in, you’ll land on the Actor’s \ninput page. While you can configure this through\neither the intuitive UI or JSON, we’ll \nstick with the UI option to keep it easy.\nFor scraping Facebook Marketplace, you’re gonna \nneed the URL from Facebook. You can use a URL of\na search term, location or an item category. For \nthis tutorial, we’re gonna go with an iPhone. So\nlet’s open up Facebook Marketplace, input a search \nterm and then copy the URL from the toolbar and\npaste it in the input. You can add more via the \nadd button, edit them in bulk or import the URLs\nas a text file. Next, you can limit how many \nposts you want to scrape. And that’s it.\nBefore running your Actor, it’s a great idea \nto save your configuration and create a task.\nThis will come in handy for scheduling or \nintegrating your Actor with other tools,\nor if you plan to work with \nmultiple configurations.\nNow that we have the `input`, let’s run \nthe Actor by hitting START. You can watch\nyour results appear in Overview or switch to \nthe Log tab to see more details about run.\nNow that your run is finished, we can get the \ndata via the Export button. You can choose your\npreffered format, and select which fields you want \nto include or exclude in your dataset. Then just\nhit Download and you have your dataset file. Let \nme show you what this looks like in JSON format.\nIf you want to automate your workflow \neven more, you can schedule your Facebook\nMarketplace Scraper to run at regular intervals. \nChoose your task and hit schedule. You can set\nthe frequency of how often you want to run \nthe Actor. You can even connect your Actor\nto other cloud services, such as Google \nDrive, Make, or any other Apify Actor.\nYou can also run this scraper locally via \nAPI. You can find the code in Node.js,\nPython, or curl in the API \ndrop down menu in the top-right\ncorner. To learn more about retrieving data \nprogramatically, check out our video on it.\nNeed more Facebook or Instagram data? \nCheck out our other scrapers in Apify\nStore. We have got dozens of meta \nscrapers, links are in the description.\nIf you prefer video tutorials, we have a playlist \ncovering different Instagram scraping use cases.\nAnd that’s all for today! Let us know what you \nthink about the Facebok Marketplace Scraper.\nRemember, if you come across any issues, make \nsure to report them to our team in Apify Console.\nIf you found this helpful, give us a thumbs \nup and subscribe. Don't forget to hit the\nbell to stay updated on new tutorials. Thanks for \nwatching! So long, and thanks for all the likes" } ``` ## 5. Enhancing the scraper capabilities[​](#5-enhancing-the-scraper-capabilities "Direct link to 5. Enhancing the scraper capabilities") As with any project working with a large site like YouTube, you may encounter various issues that need to be resolved. Currently, the Crawlee for Python documentation contains many guides and examples to help you with this. * Use [`Camoufox`](https://camoufox.com/), a project compatible with Playwright, which allows you to get a browser configuration that's more resistant to blocking, and you can easily [integrate it with Crawlee for Python](https://www.crawlee.dev/python/docs/examples/playwright-crawler-with-camoufox). * Improve error handling and logging for unusual cases so you can easily debug and maintain the project; the guide on [error handling](https://www.crawlee.dev/python/docs/guides/error-handling) is a good place to start. * Add proxy support to avoid blocks from YouTube. You can use [Apify Proxy](https://apify.com/proxy) and [`ProxyConfiguration`](https://www.crawlee.dev/python/api/class/ProxyConfiguration); you can learn more in this guide in the [documentation](https://www.crawlee.dev/python/docs/guides/proxy-management#proxy-configuration). * Make your crawler a web service that crawls pages by user request, using [FastAPI](https://fastapi.tiangolo.com/) and following this [guide](https://www.crawlee.dev/python/docs/guides/running-in-web-server). ## 6. Creating YouTube Actor on the Apify platform[​](#6-creating-youtube-actor-on-the-apify-platform "Direct link to 6. Creating YouTube Actor on the Apify platform") For deployment, we'll use the [Apify platform](https://apify.com/). It's a simple and effective environment for cloud deployment, allowing efficient interaction with your crawler. Call it via [API](https://docs.apify.com/api/v2/), [schedule tasks](https://docs.apify.com/platform/schedules), [integrate](https://docs.apify.com/platform/integrations) with various services, and much more. To deploy to the Apify platform, we need to adapt our project for the [Apify Actor](https://apify.com/actors) structure. Create an `.actor` folder with the necessary files. ``` mkdir .actor && touch .actor/{actor.json,input_schema.json} ``` Move the `Dockerfile` from the root folder to `.actor`. ``` mv Dockerfile .actor ``` Let's fill in the empty files: The `actor.json` file contains project metadata for the Apify platform. Follow the [documentation for proper configuration](https://docs.apify.com/platform/actors/development/actor-definition/actor-json): ``` { "actorSpecification": 1, "name": "YouTube-Crawlee", "title": "YouTube - Crawlee", "minMemoryMbytes": 2048, "description": "Scrape video stats, metadata and transcripts from videos in YouTube channels", "version": "0.1", "meta": { "templateId": "youtube-crawlee" }, "input": "./input_schema.json", "dockerfile": "./Dockerfile" } ``` Actor input parameters are defined using `input_schema.json`, which is specified [here](https://docs.apify.com/platform/actors/development/actor-definition/input-schema/specification/v1). Let's define input parameters for our crawler: * `maxItems` - maximum number of videos per channel for scraping. * `channelNames` - these are the YouTube channel names to scrape. * `proxySettings` - proxy settings, since without a proxy, you'll be using the datacenter IP that Apify uses. ``` { "title": "YouTube Crawlee", "type": "object", "schemaVersion": 1, "properties": { "channelNames": { "title": "List Channel Names", "type": "array", "description": "Channel names for extraction video stats, metadata and transcripts.", "editor": "stringList", "prefill": ["Apify"] }, "maxItems": { "type": "integer", "editor": "number", "title": "Limit search results", "description": "Limits the maximum number of results, applies to each search separately.", "default": 10 }, "proxySettings": { "title": "Proxy configuration", "type": "object", "description": "Select proxies to be used by your scraper.", "prefill": { "useApifyProxy": true }, "editor": "proxy" } }, "required": ["channelNames"] } ``` Let's update the code to accept input parameters. ``` # main.py async def main() -> None: """The crawler entry point.""" async with Actor: # Get the input parameters from the Actor actor_input = await Actor.get_input() max_items = actor_input.get('maxItems', 0) channels = actor_input.get('channelNames', []) proxy = await Actor.create_proxy_configuration(actor_proxy_input=actor_input.get('proxySettings')) crawler = PlaywrightCrawler( concurrency_settings=ConcurrencySettings(max_tasks_per_minute=50), request_handler=router, request_handler_timeout=timedelta(seconds=120), headless=True, max_requests_per_crawl=100, proxy_configuration=proxy ) ``` And delete export to JSON from the `main` function, as the Apify platform will handle data storage in the [Dataset](https://docs.apify.com/platform/storage/dataset). That's it, the project is ready for deployment. ## 7. Deploying to Apify[​](#7-deploying-to-apify "Direct link to 7. Deploying to Apify") Use the official [Apify CLI](https://docs.apify.com/cli/) to upload your code: Authenticate using your API token from [Apify Console](https://console.apify.com/settings/integrations): ``` apify login ``` Choose "Enter API token manually" and paste your token. Push the project to the platform: ``` apify push ``` Now you can configure runs on the Apify platform. Let's perform a test run: Fill in the input parameters: ![Actor Input](/assets/images/input_actor-6bdab40eb022bcb34ad63da770f4dcea.webp) View results in the dataset: ![Dataset Results](/assets/images/actor_results-36a0c08c154c59a9fb3887222c5926f2.webp) If you want to make your Actor public and provide access to other users, potentially to earn income from it, follow this [publishing guide](https://docs.apify.com/platform/actors/publishing) for [Apify Store](https://apify.com/store). ## Conclusion[​](#conclusion "Direct link to Conclusion") We've created a good foundation for crawling YouTube using Crawlee for Python and Playwright. If you're just starting your journey in crawling, this will be an excellent project for learning and practice. You can use it as a basis for creating more complex crawlers that will collect data from YouTube. If this is your first project using Crawlee for Python, check out all the documentation links provided in this article; it will help you better understand how Crawlee for Python works and how you can use it for your projects. You can find the complete code in the [repository](https://github.com/Mantisus/youtube-crawlee) If you enjoyed this blog, feel free to support Crawlee for Python by starring the [repository](https://github.com/apify/crawlee-python) or joining the maintainer team. Do you have questions or want to discuss the details of the implementation? Join our [Discord](https://discord.com/invite/jyEM2PRvMU)—our community of 11,000+ developers is there to help. **Tags:** * [community](https://crawlee.dev/blog/tags/community.md) --- # Web scraping of a dynamic website using Python with HTTP Client September 12, 2024 · 15 min read [![Max](https://avatars.githubusercontent.com/u/34358312?v=4)](https://github.com/Mantisus) [Max](https://github.com/Mantisus) Community Member of Crawlee and web scraping expert Dynamic websites that use JavaScript for content rendering and backend interaction often create challenges for web scraping. The traditional approach to solving this problem is browser emulation, but it's not very efficient in terms of resource consumption. note One of our community members wrote this blog as a contribution to Crawlee Blog. If you want to contribute blogs like these to Crawlee Blog, please reach out to us on our [discord channel](https://apify.com/discord). In this article, we'll explore an alternative method based on in-depth site analysis and the use of an HTTP client. We'll go through the entire process from analyzing a dynamic website to implementing an efficient web crawler using the [`Crawlee for Python`](https://www.crawlee.dev/python/) framework. ![How to scrape dynamic websites in Python](/assets/images/dynamic-websites-d9a83deff0729330b2d3de2d1481cd6a.webp) ## What you'll learn in this tutorial[​](#what-youll-learn-in-this-tutorial "Direct link to What you'll learn in this tutorial") Our subject of study is the [Accommodation for Students](https://www.accommodationforstudents.com) website. Using this example, we'll examine the specifics of analyzing sites built with the Next.js framework and implement a crawler capable of efficiently extracting data without using browser emulation. By the end of this article, you will have: * A clear understanding of how to analyze sites with dynamic content rendered using JavaScript. * How to implement a crawler based on Crawlee for Python. * Insight into some of the details of working with sites that use [`Next.js`](https://nextjs.org/). * A link to a GitHub repository with the full crawler implementation code. ## Website analysis[​](#website-analysis "Direct link to Website analysis") To track all requests, open your Dev Tools and the `network` tab before entering the site. Some data may be transmitted only once the site is first opened. As the site is intended for students in the UK, let's go to London. We'll start the analysis from the [search page](https://www.accommodationforstudents.com/search-results?location=London\&beds=0\&occupancy=min\&minPrice=0\&maxPrice=500\&latitude=51.509865\&longitude=-0.118092\&geo=false\&page=1) Interacting with elements on the site page, you'll quickly notice a request of this type: ``` https://www.accommodationforstudents.com/search?limit=22&skip=0&random=false&mode=text&numberOfBedrooms=0&occupancy=min&countryCode=gb&location=London&sortBy=price&order=asc ``` ![Request type](/assets/images/request-185e9cf4845c0b0f07c004d155563ea7.webp) If we look at the format of the received response, we'll immediately notice that it comes in [`JSON`](https://www.json.org/json-en.html) format. ![JSON reposonse](/assets/images/json-a85571ceba8b80c314af9a159db15511.webp) Great, we're getting data in a structured format that's very convenient to work with. We see the total number of results links to listings are in the `url` attribute for each `properties` element Let's also take a look at the server response headers. ![server response](data:image/webp;base64,UklGRtYhAABXRUJQVlA4IMohAACQgACdASpzAsMAPpFGnUslo6KhpTU5sLASCWVu/HyZjugB159PE/+GG39eku2dXh+QHbh9BX/T6e3oP/9nRJ+iD/G9Lb/6vYG/u3/n9gDzt/VF/1PSAf/31AP//1v/R7+I/jd3xf1f+1/sb5s/jfxT9a/UX/Yf3X1yP7Lwa8xf8L0H/j32e/O/2fyc/vX27+gvt6/gvtp+QL8f/in97/r35E++h7F/n+0KAB+X/zv/Xf3b8pvbH8n/xX9I/GL4F/IP53/o/7J+UH2Afxv+af6j+9fjx8O/3D/JeIN8y/w/68fAD/Lf53/uv73/o/2R+jv9h/5v95/wn7r+yD8y/tf/U/w3+m+QT+Zf0//f/3b/K++D///bt+2H//9179of///zAz0RHd6I/5A4Rh8ndm/Y1S59TB7Us0tq5QYGbSdMM9X6dk9N/jRMmT8W2VF0tPnwDa/knroHOydjn+QiIiHqsaatZI8dG9PTu0ZUaxbJUy8rWPSUTopp1wA87PAsLEBC2xm683GRd25pu08RIrVGIzPhsz4bM+GzPhskbKZUmlxNW7nhAwHT+g41APt8a1+JtLhD3GfMy6ZbsnX273JQ9qrLT6Pdmchf7mSgSeZUfLeEOqgKJ+zHcZWL+Seugc7J2OpS+KGxpEwmNwGxtOIdAuUD8qjp0XNDx9JvPcAXMwIaebnC0kdhveRhHBmJBzmhmoyBJon5GoG1/JPXQOdlADyMHJw9SVw02TRTdska074IKu3dmw0HCYGG28dbmrUnaa6bx5cvlv335VnTxJtfmdk7HP8hERERWPtwUCgWt5kGs96v+b0imC7jjD0D8o3XeRpOcRMfr+8cAeu+WPb/EAvW6BtY7hJrstZVnqjro390FyBwdQdQfr+kbwtl+VVVVVVVV0OPjBgsLFCymAj2tCer+SZPJix6KBhuyfFetCKPU8/UschSrlw1ipn2ffwPNvo9UYabXs6rSHuXcNcjfia9WQiIiIiIiIrHfT8uBgw34l1idjn+Pdm3/YBtP4ofOk2y4atiWiAFnL+MwQlf6nBLEWfMaPZcZr0sAbwL8cG6UYBQvjdAU8CtkSK9S/8VzQLkFkmoG1/JPYqXA0cFcJlfaBpDxNBmFJbkIiHb0nqQ+lfjiBmY3gCZHAKKBYLflpLz10DnZO2rRXhCS8NdY8UhCUD8qqQDXP/GpBElzXhMoOdk7HP8hERERERWPxpqON4I8GkC4BbnJH3yERDyKFJCobh6iOVyGjkRERERERERERERWPx4r1k2Z5Bqvjuk1emn2RVQi3Wv32kvgazVHXKGdhf2011r+Seugc7J2Of5LHzF/AY9B2W3ZNbDdsEhye82EKn3ZckOT325BU+7Lkhye+3IKmWxxOKkAAD8iVwV6FM8oPeFAAkF3G6dTjvULyCJHNxEp4viEx2xwk8efiS3rScf9NBdtKzhSkZyQ6dm67BccUM2aaR9dGml4yXoVlEllg3+ntsMAkU+2s5X1QCInTto1IO1gkXFuc7dG5gzZOGdmEtbpEhRHBmpjot0gmD9iUsrc7eRU9rsDYXVgju1C4ks62K1QlMrz3b5wV33kiQkyolwXrcNskZMhjG6gPJ7k8hPm/wXxpeR10foq+FukWMZo3DOAxl78rzF97SGnqh32/a3P6ItaMMNuk6xW/FBp+WjUyLvw4Pd77JG04QTW/VbMANfsHnQrfYMWzI6jKZOcjwWU7vw6cBzM5YPE+rVYlp6iJziQUg5v2qMN5L5Xks6rbhC9p9uUA3GOsfY9DXLpj9fhs7CBr37G68AYjeSoTTRd+86/3q5XDzHfTHOw/yfQ87ZVORqPaWLuPogk84E4OPjJggIr79vwgXr418nNbrHNdefgIiP6uECOi9byNzrjFVhUUcVkiVFihpbM0jF8BmMrZ8n31dcBxW9nFGzBlsABGVwD/ywDCLL0v/aqCerxpdKHQKTZrikGP6etfAwSzl5S1Atsohz9NzMNSN9bru+gRL1IKRxSGqgmsWdqpGpfts1+Y+8qWXwyg03r0pujVmdjU1HccQKWLx8pgA+PNm0W3rUwaM80RBepefKRsjDAerUeyBy9YFdhDcg0TZVJi6qxdE6nHQ47fxYYrhOVOJRwMSN05DXXZdSn8v5qIcCZYifz1x9l38ATW/fxMoyfOxoc4seZrAQAM8XXe/msU54uIU7g0++Z8qKYVDG4ljFBDbHRFl4NIZaKkdokocsKmdPKFJENwPtrfqYKZj4++DgizXoU80Xel/GUuDJ1o4+DaGZsWMAwqYvFMPEhwO4hBOL4W/dPc2Pi+lcKNlL9S4NLSmIH0GOENmzrSgQw6B0Cc7xKibh4lFkK+lUOrvl+m9vD11g68Y48Y4ha+HeZ2DziOXORj21mVULwvoizCQjETMPHGSMVvMCm1aWQHFQFz6hdFGMfpOV5Ec8Q9CZSqEp2aemIqMpzpCOS2tQ77KMOrZ20L0+GanZVCzIHQxeNAvtQCguWGfWwmLHfM13y7jhc1eQwnsnfa3C1Cd3YA5Kye4Eq1/FNhwD0UD1Nq8jSTIuvlCoaQL0mkjBZ74qGVK5h+AIeUQmaLTEh1ItKOj00Hx3vUC9ed5mOFigqu1aRWozgJlPw4LUOZdCyyd/aAdOgqceGx0uXGXfwPdCeieVyHem2KgZkRaisK8khH/aO0cuGbEpet7r4xZQDCljUll2jhgcaBuM7FrGmx/j9yjW+3xuyD6si063qPjU44YOksdp9vf+CzJLOlH4+wRGgPsUssp54isUlOhyOck0FTnAatXAvtSmLrh7C4TcChzowVLcKPDjHevX+sXC1ymQ7ppUvWheSbgal087V1wcVmxRiuxsY0pWrZZfIMs49WAUwKgs+kCRq5sEciauphVQqqUTxLAP1zmEy/r5zVJOSr/euDuOKnkaDZq5NWgPqmAh078aVPtKT0vnAoveM/mfjqMUkID9qjWlMhAPowX8Z16SzwVkwOwOpYjtVHHA3yjOlITOXyxIfynsLy3iSD3Bm4vlWjwM3lMGIwluBaZkKyKJsRzIaQwc9X1pddeUELx7GMU9z1BJs3MohIwOObNlGAmKAT9QsGI6c4YDjTKd2+/pqRXKYxxUInGweF7HXc8rzimvHU/By/SqKVtiUgPZpem0d6w4GOJpunov2VrL0YH1Crwn1wTwXLyWWXQwtmQ258yzADzZrgqJ9WDf66waAImP3Da35tzaRIM+p968rAy/tm6M8ZQyA4tQAraFWMrEny3uOXg6WaL2yL544Hlin4Rmq+hAx0wy+yubqTyPtgqn0n8EHQ+2wH/redI/1kr7RW6wRUGDXFtZ77XBO02F6PumAOKmJZz99YH82MycdUhr0/IMuRzXF2X/vhCpCHSy+byjdQMOne9NwB9NG49cPDsTL/Id1/YWiiB8kDguazgBpY5WrdmNflxBnN6Ivad5yjykBaH3+v2lR3sz9n87jSldZpCU/GkADqlhmxxuSWgnjtatIbO1RUc8sDM2OZlcZ1X5DnMM/PMfFky5yZIIOY6DRNKuixYkm1TFiQdkmb6nfAgTTLpkO2AQcj+NVwXzOuZt2enEMeYE+0zlfQnfZkXEMGWaEEZqZ+/a1gl75Z0r6qcHq/loNcFXEF5BVjbE/QDE4NYVeDZLkmW09DZxnUpjXpFA04WGiQfrWyHzBQSqjpeqGcXXtb0CDy2aqPqwLBk4UKySNwc+nuG1yUFVVhZNt/HR++O9ZvH8d6m2F4ziGeokEaUGep1fSO8dgK0CLBovRIfdgDQn12USruyIHg0YmuH3ZtVpSIhJp+GxAHGuJQ/yQM0tqagCe6Nt9LYCuDMyxgjIDF2n1Xyr3NCDlq9hh7G8VtpkocZ/0rFDWFXmykzyJZ7jiSbq58ZLMtwaWCH9edXlZ40JLFtoVy1d+CbO+1Afk3k5bkT5qf9xlUr2X76xmyJXyeiAvJpzkam0KX4DCMW5romH6o72Wk687WJq/6MZfvOb8Sud0Tb309MaIS4cpjCvKgPuq9pYV/2PvYFhtSMdcEeYjySnx1oUQlAAminZ8ByTWs2bWKtrpf/czL8hoZNKuHf+8e/+r+KaRAOucqZ1fNFRjrye3MAzo31NtVh+dnxx771o7Gl4RYeQWKP5PLa3RzVlEabW20JahvQgEpRBSddN4LiAiyOebiN5OdorDJYKyB2jCrAAffznb6hxXbmio0hBIQWFidvVIhG9bl6OkDFmyb9FHCOqCBdmzenJUF5b/Gjh7DjDTGzznHPek8Q48wdcsv1WnOD44smFmmO8PRCCc0UQFxXVxNtLdQOFhgWgdae/WMwmoXV1UKo90c4gRongtUMWJalAV5x3givUEg3KSDDm+JfuNXASsYReyVZor5Ti06RLAAv6dhYRRFRHFnGBwi+L0pAS/w+FA4dvxhhck9wiCS64l9lPAWEjRQCJW+IKskw7vdkvCZCxSQop2FbR0PZNQfBzhc+nMKjLJuTSm7zDzUzNynQTDv+MHdHnLU3ZQoY9V9dSiHEzKIA+//VlLJuI7wNVNCB3iSTNHK2CFKufbCPMhdqcUdHhNtVbqtAZYTMZ9LtXS+uMWWIivzf0+T4UaGggjuksh/6OYNIRRwvWKhlH4T80MFX2X8EXqWx2z4VZtPqZRsRu82kTosIy/HbMyWDJk72h7oisUWSGKS7bMa3btadxEQ9UD1cCiPCWO9mFVNMbs0EmWHHJMMeXxkoeRlDzcAXKsxGSQ07LGApOIzvlNd00b80O57HiTe1mnmBy0knQvORjtNjJbn9LIs9GDQJdzP8L+91z+mmS6E6VWnfB6gCZWzqUZvfDQYcUfyEdKASD6t1f2v0Rs/DlCcd/l0n0LxSfyq8iRSOsbUxgOE+lgX3LCPtIXXUQKVL0ostdt5nFeLN0bcOuKrkl2BwTT2A/Dk2QRQeIqJt5SIjo/NgFZqlD18Z2Evnu/JRIS3Mn111YfpBhGKnODhrIfr/sSI3Of8y8fKJ585AiNikV4+ZSwgBMCNgNGtPrMq8MwMn9foX2gM+HQQmlZiGqcsYEydPcUCcU6PmR2eYZSnAuLAZ5rZxKJruLuqy4gDiSLqAlaVSIVgwcDqJFKjPr0iqSK7XDuVY8JCpSor8TgABXrQ6dmrxSOaLzl13w3cdjDn3/tZsj+Hw2ByiBoBvc1+H7N22wrHFtYf3AhTs6NT54szLITMSvkWX06gbyMO7kwWHBdTChNoqytm5mRkNuNd+BOGp70jxWUFmTFUXvbG9IR2Myrj+oSsO0ca+sQtrHQKg2N3vtws2LZjDjYePrsEgZLLmj5WIw7CuHoCOLED/ebJ2RzM0BvddBzHt54oi3BHoUgAvOxVbh3TI+v4PxVIiscGN4KvzdK23CnwFPPBlgjHTPl1Ym4/0xFcnHOnV7mH9CJ1AyEQoDLWmQ03Nn9qTEnDAgnfAIWz9Nq01SmMuC8qDoAm9zLAtwy3pls/LguQTg/3OSLij6LS5hLLaL3mbQVNf1dRwOTzQI0V2VTdZFy5GAZjpPHzUL6k49HgCzrk04Nsasl/HHq+Igm9xhhNonEbCtIAGz6zz26EJvJ6pP6TC90wUbd5r2IAosIyk0coOYTwkEUtI1bg+p1QsQqMK+Q2d+KO24crHnX1+mPIklcLDudyZT53fyXRx1nDpgYfxSlJgJjv7UXbB7mq2aBhACZvAHRRwjqd3+WT50o3kVxx/DK/WItp83bXMRqDbD9lX0yOV0zLvC1grTy8fwyRhbm2EyMOEJhiC1xjbVGzXieUXhj63TbqDl7xlVALiuK9F8ruBS+N4gthlGWuf4qabMdpRPjINyto4Y7xJA4zpBUt9EpSBsG616TIlKKfjL3OUM5QKDkozw8CXnXsDbG5XVfOdzzyi+coamQyrYMuj5QqWN04oS/EjrU6x4BrSC2Tkm1sW6i6xcVaVUbwqwLk6yMd4CwNH6rQhjo0jVI+H59P4bFCk0Sx9IbiwlLZiVcfKzzLfgVCms5Ap07Cq2YiBqBhWF05S23jWqbR76nZy6/ijjQv7g5QzvReSZ0N2d1dQG9VZ+FokG0VUzVE0quOM23RUERELUSgdOEjeUgU4tjVkH4IsduaBf+mOOmy/AgnXlFrYuBKdsoSn10i19zTTFyyL75CqK0GgXXDvw/QiN9NZPE/AXhqsej6XL8WWmBKiLOXfTGOTjSxJUh4AdB+OOxP7iJ47o9SzvOyt6hYZtkDBHoBXii3qFyuz6KMHeFuihPEM2VXV1643BdEc2xefzs411YeK3wJemGrs09M9j6Ix6r7AhplBc/tdrMnQqRrUZtebXb92Fc+psxGwkyyYeU21ZNZquU1onpJcekWUBIAaFt8WqK4CEjtTXmgccLrOjT9y+VlYwq+ruuHhkwYqr4XK9pB3dk1XbLl5qLiz3D3Wluh0fqxAaFy4cGOv6jsuh6ajlW2fONprj7MvwQrWFI/U8oaJCHS83X+G2OznoHYogdmG+isXTGJkds7T+6Yc7JOGa1CixzF4Z8u8ZcdJ/cyNAxCD7yObXkQBmJmrnDatTbEdGCOc/qTrx5Kodwt4VE5QH0u4ecZk9EwWmCq3KMMpAhgeRjz+wRb5Fse0MQqzZzkoXq2dtEm3z7Fp6M1JkNdUb7DIik4qpWaJGRMv6erXaSvd0q6Oa7nm8TlSPgGW7Gd7z6MikJhys/IaJvVHIzPQPeuoWfYkRS+xK0PwN70VFvmDulya7nmgqCrUal0tbiIfZLuaSpeMwOAkRfvPCTO5hQA8Dn3kbUJ9d9FyelugnxbpgMMjIkdZ4cFzcziBbtsclQz7ObTYxHmvbn3SKZVrjcWjUX/bh+yvrlRoo6UgHL+9D2Ffk1kFUNfLLdNz+zmPjHImBcHTRK5Xkm0nTFD90m8iwYqj7HN6KZG6r/Czq5ToMFMS6S399MgxAy4aL+HjXF8a5POfbOf7hRLr0qX2Q1urKiJHpksbvRXI90jF/s+TgwFETpVaKzvC4YllPuO/MGZ6Q3+ybddwbdfk8DtSx924urg8P8dLifZccD3CGQmnlqkAp4oyAiJP9zAnCHrRSoNYW+tHMnWT7syw36lNXweQtrCGiy/7gwz3Q89YtbEFaCACt0gxe6Vl+Q5EGwBt78F8u04q2CB0zW34jIj+ZYPvzFbRGd2FOyDfeAjy+JgdYjJqpmUUlB1uXVwupQPvcKn/Y+Pur+ajiQSx4wKUVsS8LIY2+yBhfpR1W2iBG56y373PWXzlqLfeLo4ux/K/zVGkX+tnCtsbZdgLImwSE5CKV0z1CrRl/8L30mpPj0iBquaKlxHc1yJarSEWRntzT72/c0oC/2pr9cYeOdI0srhtSvBKiU8VRkBsRrMWHaKVIgQUH73Q3fl2n1h0H/w+pblNUf55ysTPk3xT5E2zQpZDbGs6N03ybNZxEQYdIZCmFP06wgqZ3DM+2gm8aRZgN+2uswo7/aN98dsaeU9Tx9AogJL6FP+ADr9hO/N+MQK1qXXYxWDBxJJr0DdVWAApe9WHPzFxbB2nH8iwtxpqlos4lAUFHzg7F/nBOD8/VUNP4fRYOOYz7dnsbgOIhZI0fuAXP13a2vmNuodlok+871sUHzKWaHBsCXbgyBD3C2F7n8R18icF6SyU7qHyl8nwSy3UFY/WC3avSExt5CDfmj91fFSMPsNBkJDd+6n7VgZVSUnlbQb9/2pHQ++B5cB7p6spgf7mco9mVm5gih/oLBCKm0dUCcJNNMGDZ/LheyVPLpYcnNIRZDBVhIkVJsprIvKWzHa3BCxbgrlu980BTGRYTWOwDGFiFNFVkw/8DEAKWqnDgS7mI9DiAuQ5NF9sNqu4QYMdIM4XRIX4SZzJltfVlGFwOpVQRyoBTT/MVD0zvfxCdAmLRfUS+9FJvEfOi8tZkuNa9JxPNTNIFpN6KrFwLTVHlaO0+O3eLNcKHxbtvuFJYCqeAZevfFvINOFFGDT9uny3ZR6lVYDcDtKHgpdCCe3s0+9OoF2k5hEyUljqoIIdIlpC5Ve0gtsl6fjZOn841p6ucQAOueEf2+KIUM4hdJQcxYhOz+ZSWlIEMB3JzwXaRnjzlZrQGMaIkIEJeHXjJYx9j0OqdyfckUcPfioT4DT3hURjXvlMzC7OtrhQqNZu59Vmex8NIJd5kWmREPnwrCMLGM2yQpkQx572TEMtkwgnQ5D9K9YcRQwEnJNe67ssX38fUbvjflVDqliXDDLTDg7NFBNBpClgxjBjSuaUiVNnJ6JyWkkn6UZxKmR7GxLB/aEMBdwsRWsMGzRnr2IEey47lZcnAsR8ejc9FFq1ayjb8ecAbKdfKpzEs3Y46rs2KXeF/BSNzCf3PNuDIjJPTT89m6tuEwHrSc6YuH/oCLDN7IvbBzALxxKRSse06ac/EeXt+h2OyvZj8h4e1EF6s5AtNU4gx/CdkUWB1QgRVOA8784rpHHR0WuNk0H8z6uU8nutO9lsDehgfHovDnFGw02d+YJt4rUW3iTEf7RLw8ZlzDd9X5h1iEtHQZJEazr1rzaRJd4CZMn7X9/gYZvPUqc4m0svU6JzSUKsuUFmcWuVoEkcQwk8svJ2L2R50i7DR6WhRf+TAcs2EN7efXdkBSkal4KOZRVO4FLz11wezTfO1Mq3AaZkkNV55yh18MehkzmJ0ptF+LXvtqsYnPLJSYL0e/47NXqTGqB1U+f7EjdqAlV9WtUPU+H9UmoJ1BVj5I+5W8ukGu33yICzNzz5iq9V/MYgwfvgw9HmJPjSX2Mq9V3Bs+updBzZUpMiWYGNTlzDbVfBzCiTYVI82kS6F1GDhyb6q49kZE2EbMDcyfjPsY5HZIYYsZfD/0gntEAGvTo/Ks6J4q/SrgVwA8WVLVhwxB2RKHG8J402Lf34XF15uWKdqc6rPVKRp3dniAN5pvA4gTkBFzf13h58lvRgu6XTJ0swNn0mndDWa4bs5oYa1InaM4UW2fh1KszptJfYRA9XINzRIvF8UpL8qEKedZzc8ySov7NBM09hoF6wTYXxOaZ/taI87y2eApFkL4s+z1xAwi0duxgY7ow21WQ7aWCl1KSB7UuxoeL/rhYFlQe6Aa0U9zk1mH4W98H0BQwnYFfLWxeEOu+HlkpkaATPupjUb6aqBnjboFYXXnd4ycIK1COlyP5JlCaBY5cHbihIOyyFXVk7fH0ppPztv/7BXK1ovsFTmRUDZCnyrypUHTIKzYFwcb0pLt3UN2S4ZTFg5s2B+hEKRPr2vZdAbXiK5eiS6U5bP9Z6mrw4soNdzQ978mviyF4GDL2ZcyTL82okXzVg8CIrGyXe2FiyNSohpy3d8yjMcgRI1zWaL/fMY9DiRkW/SjEcYozIBNFzJW37uKCa18WPsZqg6zoJ9pqiO5+TOBEgiFpgmx9/1Fi9avW/sUbeo/YYHSO1as/iMuw2jVLr2zjLsHO2tRobc+5jSLUwleaP2Pt4TecQlLZFZMgbi2brn10stGBQPrULHQ7y/ML3wnX/wRPtpr5hebHQS+OuoiC0NreVZeBzH9RelRe0VZA6TrNqOlO3yX36JhpXc44gbw1wQGLDn/70pcpyIjTP6gr1iadcosFI8WwPMQTxJRS5qQEuEDELIx1xJo88maFEm41xqJOPCwKeaLI7X5a9odcRDKNvkOWeglKcCssKUZX9Oi4/kANQbNP5l+6Gd+mUHRwB9beN8jAa+nGDkwrpfzJ2qvpV5Wn0eSc/euNxlL7emh3P4PYLoiUdz07+cpvzOqUtwau7dreNDO4j93YE49j/SPh3ssfotVq+bpT2DP4vNVOyrABsA3kzS6JMRyrJ/KPNqEV82X94FSC969CKUGOqtgr/dxorMwrZPRAojHUopBZTs0MfAFMquUnxKFsKLMYRM327L6tkTV31YkDRBTBfO5BwBaZbZ1YSq3DlpF36JjCpn19mzRq00Zyo1pT3YA0EOBXYnsrwjf6fLl3qLJM/B2aqvSrvWDVHAlwLeEN4juD0L3tDf/srsKkEiidA+m3gaciL5WkkoyOd6bf0upN9EgxxVRkNom0DTK5vNZM45VKNcw9y3OlOJKKlmzIMflikoYrbdjpRfQuDev1gPJa/AM+hNAydDIRuhpQpWuTUJPRGuwLsISVCJZ6zRZ1CFGsHFLiqvAT3NIfPXtf03e+tX2ZLO1H6UOlt5YWE5ynuKFsaY5abJ56G2awpIWQwEPXpqVMFK7pM8c5YvowyTBVs1A54EvL3m/+/V//WC13a4PAI0IPHO3KW3AyMnoYoqUSOHPwL6W7KwrMREylahhndK8x0dufG9+V3ZZSOXuCleiz504Gw6x1U8foRMdsR+eM0X7mmtroX4VT7jKfIAAAAAAWAixWvTN8hHfqjaFHDZ53VcY2OjVt92q5R12m/6WHcUiXszd2SVQ9wL1cONblBwWunZJ4qKFQ1cFvmL47L9hGtK9/NX5fH2RlaJkDlPQguWZS43jEAFRRR5Nyce23psI+gPXG7WX6vzGLqzzRu8U7lHtRi8bNujQai8+0gxMWvF4QzsRgTSraz9OkgEqUMCbLcBuepqSnzIhXUmF1SUazWVH7EiGIMtohcB9eZz2xvKbyMnAFbRgvVulxWEjPIBjMLTOpXyNMxghu9eihoaQIpKzKYDEADvD1Hi4XeaSsOoMZq3qRjLsv/v1yb3CQ26WLPVdQ9/E2VwxbdpEXYKmaO/YTfe/gjvJYSSzwMDf363VQkX7AAAkN6uWYbEONss22RdQW3Hbv1O2aj8dtsgrKrlH5ABD+GBYDKtGDJWbucEqvjmE0qiNod6NyUkwyJkDDPqlk75q0ejqXFwpfZwziykrne0mQXxHwBLHMGHQn4yHYOWGEEJRPvXRQ2HxpOqa6utjEIlQclaZ4o/XSrtpdDwijjiflryKib3fcYZpE7hm402MJTNqtMHesv9BoEVbdr5vIn7BhUYG6Jkusi/xSMhjtSX8cQj8Fc3dD8RSr8zXct55AYOrz71IK2RTKy9+pZBzSXOEVjS/XoKKK4E139pIRu0wqcqp8Yg9ItG3Ji3Q1N8C8aoEMcFDD3Y8hDsJAtgHCaeXM3JsUVyIRswn9dAROk1DAdxVhcjIQ85DDG3KRPEe6B1145UCTNo6HBkgEtWBTvz8W05YQAJi5VhrAYKZtq6dMU5cvLn4FOKt7UZGsrwY9AtV3ZKgUbYR51O0bTVtanI5RrxJgtp73zd0OAG5/yW31BEiEbLHJ97W91VBUjzPKWmgKlh5p+6msLybwvdYzxFKWto54IISmvCXN36Vr9HKbGITU/Eayt8E1ym3GGz17LY2hDNHKSPWIPJBg5U1MoLxT2S1QkmbJYBHvO7TadyRDeWm7u9YrbG/q8FcTPHo5OFKRk2qtxLv7HgzNzs3BJMnzKfLQs1YYMF9Ci2gpGNGRrm7jJmu5Mm+1lFWAXI6ouTuHF8vuZfmHS54I4Q5qt1ZmodC7BWAVXlPf7wthBOKqxAW3ANZvoYUG1Tnrs/MjxT/3jwz2SrCxX5QBrQvAj1IcQ/oEQrZ2gBynGMDQEclry+nFeEgAACuaBSnvWb8nuDAAAA) * `content-type: application/json; charset=utf-8` - It tells us that the server response comes in JSON format, which we've already confirmed visually. * `content-encoding: gzip` - It tells us that the response was compressed using [`gzip`](https://www.gnu.org/software/gzip/), and therefore we should use appropriate decompression in our crawler. * `server: cloudflare` - The site is hosted on [Сloudflare](https://www.cloudflare.com/) servers and uses their protection. We should consider this when creating our crawler. Great, let's also look at the parameters used in the search API request and make hypotheses about what they're responsible for: * `limit: 22` - The number of elements we get per request. * `skip: 0` - The element from which we'll start getting important data for pagination. * `random: false` - We don't change the random sorting as we benefit from strict sorting. * `mode: text` - An unusual parameter. If you decide to conduct several experiments, you'll find that it can take the following values: text, fallback, geo. - Interestingly, the geo parameter completely changes the output, returning about 5400 options. I assume it's necessary to search by coordinates, and if we don't pass any coordinates, we get all the available results. * `numberOfBedrooms: 0 `- filter by bedrooms. * `occupancy: min` - filter by occupancy. * `countryCode: gb` - country code, in our case it's Great Britain * `location: London` - search location * `sortBy: price` - the field by which sorting is performed * `order: asc` - type of sorting But there's another important point to pay attention to. Let's look at our link in the browser bar, which looks like this: ``` https://www.accommodationforstudents.com/search-results?location=London&beds=0&occupancy=min&minPrice=0&maxPrice=500&latitude=51.509865&longitude=-0.118092&geo=false&page=1 ``` In it, we see the coordinate parameters `latitude` and `longitude`, which don't participate in any way when interacting with the backend, and the `geo` parameter with a false value. This also confirms our hypothesis regarding the mode parameter. This is quite useful if you want to extract all data from the site. Great. We can get the site's search data in a convenient JSON format. We also have flexible parameters to guarantee data extraction, whether all are available on the site or for a specific city. Let's move on to analyzing the property page. Since after clicking on the listing it opens in a new window, make sure you have `Auto-open DevTools for popups` option set in Dev Tools Unfortunately, we don't see any interesting interaction with the backend after analyzing all requests. All listing data is obtained in one request containing HTML code and JSON elements. ![Listing data contained in HTML code and JSON elements](/assets/images/listing-d32f6a4dabca952c5150d3e4705028fb.webp) After carefully studying the page's source code, we can say that all the data we're interested in is in the JSON located in the `script` tag, which has an `id` attribute with the value `__NEXT_DATA__`. We can easily extract this JSON using a regular expression or HTML parser. We already have everything necessary to build the crawler at this analysis stage. We know how to get data from the search, how pagination works, how to go from the search to the listing page, and where to extract the data we're interested in on the listing page. But there's one obvious inconvenience: we get search data in JSON, and listing data we get in HTML inside, which is JSON. This isn't a problem but rather an inconvenience and higher traffic consumption, as such an HTML page will weigh much more than just JSON. Let's continue our analysis. The data in `__NEXT_DATA__` signals that the site uses the Next.js framework. Each framework has its own established internal patterns, parameters, and features. Let's analyze the listing page again by refreshing it and analyzing the `.js` files we receive. ![Javascript files](/assets/images/javascript-52ed58b7cb5fca440f94193ff7687de3.webp) We're interested in the file containing `_buildManifest.js` in its name, the link to it will regularly change, so I'll provide an example: ``` https://www.accommodationforstudents.com/_next/static/B5yLvSqNOvFysuIu10hQ5/_buildManifest.js ``` This file contains all possible routes available on the site. After careful study, we can see a link format like `/property/[id]`, which is clearly related to the property page. After reading more about Next.js, we can get the final link—`https://www.accommodationforstudents.com/_next/data/[build_id]/property/[id].json`. This link has two variables: 1. `build_id` - the current build of the `Next.js` application, it can be obtained from `__NEXT_DATA__` on any application page. In the example link for `_buildManifest.js`, its value is `B5yLvSqNOvFysuIu10hQ5` 2. `id` - the identifier for the property object whose data we're interested in. Let's form a link and study the result in the browser. ![Study the result in browser](/assets/images/result-64a7188999fd127f0b6e26bf94a4a7e5.webp) As you can see, now we get the listing results in JSON format. But after all, `Next.js` works for search, so let's get a link for it, so that our future crawler interacts with only one API. It transforms from the link you see in the browser bar and will look like this: ``` https://www.accommodationforstudents.com/_next/data/[build_id]/search-results.json?location=[location]&page=[page] ``` I think you immediately noticed that I excluded part of the search parameters, I did this because we simply don't need them. Coordinates aren't used in basic interaction with the backend. I plan that the crawler will search by location, so I keep the location and pagination parameters. Let's summarize our analysis. 1. For search pages, we'll use links of the format - `https://www.accommodationforstudents.com/_next/data/[build_id]/search-results.json?location=[location]&page=[page]` 2. For listing pages, we'll use links of the format - `https://www.accommodationforstudents.com/_next/data/[build_id]/property/[id].json` 3. We need to get the `build_id`, let's use the main page of the site and a simple regular expression for this. 4. We need an HTTP client that allows bypassing Cloudflare, and we don't need any HTML parsers, as we'll get all target data from JSON. ## Crawler implementation[​](#crawler-implementation "Direct link to Crawler implementation") I'm using Crawlee for Python version `0.3.5`, this is important, as the library is developing actively and will have more capabilities in higher versions. But this is an ideal moment to show how we can work with it for complex projects. The library already has support for an HTTP client that allows bypassing Cloudflare - [`CurlImpersonateHttpClient`](https://github.com/apify/crawlee-python/blob/v0.3.6/src/crawlee/http_clients/curl_impersonate.py). Since we have to work with JSON responses we could use [`parsel_crawler`](https://github.com/apify/crawlee-python/tree/v0.3.5/src/crawlee/parsel_crawler) added in version `0.3.0`, but I think this is excessive for such tasks, besides I like the high speed of [`orjson`](https://github.com/ijl/orjson).. Therefore, we'll need to implement our crawler rather than using one of the ready-made ones. As a sample crawler, we'll use [beautifulsoup\_crawler](https://github.com/apify/crawlee-python/tree/v0.3.5/src/crawlee/beautifulsoup_crawler) Let's install the necessary dependencies. ``` pip install crawlee[curl-impersonate]==0.3.5 pip install orjson>=3.10.7,<4.0.0" ``` I'm using [`orjson`](https://pypi.org/project/orjson/) instead of the standard [`json`](https://docs.python.org/3/library/json.html) module due to its high performance, which is especially noticeable in asynchronous applications. Well, let's implement our custom\_crawler. Let's define the `CustomContext` class with the necessary attributes. ``` # custom_context.py from __future__ import annotations from dataclasses import dataclass from typing import TYPE_CHECKING from crawlee.basic_crawler import BasicCrawlingContext from crawlee.http_crawler import HttpCrawlingResult if TYPE_CHECKING: from collections.abc import Callable @dataclass(frozen=True) class CustomContext(HttpCrawlingResult, BasicCrawlingContext): """Crawling context used by CustomCrawler.""" page_data: dict | None # not `EnqueueLinksFunction`` because we are breaking protocol since we are not working with HTML # and we are not using selectors enqueue_links: Callable ``` Note that in my context, `enqueue_links` is just `Callable`, not [`EnqueueLinksFunction`](https://github.com/apify/crawlee-python/blob/v0.3.5/src/crawlee/_types.py#L162). This is because we won't be using selectors and extracting links from HTML, which violate the agreed protocol. Still, I want the syntax in my crawler to be as close to standardized as possible. Let's move on to the crawler functionality in the `CustomCrawler` class. ``` # custom_crawler.py from __future__ import annotations import logging from re import search from typing import TYPE_CHECKING, Any, Unpack from crawlee import Request from crawlee.basic_crawler import ( BasicCrawler, BasicCrawlerOptions, BasicCrawlingContext, ContextPipeline, ) from crawlee.errors import SessionError from crawlee.http_clients.curl_impersonate import CurlImpersonateHttpClient from crawlee.http_crawler import HttpCrawlingContext from orjson import loads from afs_crawlee.constants import BASE_TEMPLATE, HEADERS from .custom_context import CustomContext if TYPE_CHECKING: from collections.abc import AsyncGenerator, Iterable class CustomCrawler(BasicCrawler[CustomContext]): """A crawler that fetches the request URL using `curl_impersonate` and parses the result with `orjson` and `re`.""" def __init__( self, *, impersonate: str = 'chrome124', additional_http_error_status_codes: Iterable[int] = (), ignore_http_error_status_codes: Iterable[int] = (), **kwargs: Unpack[BasicCrawlerOptions[CustomContext]], ) -> None: self._build_id = None self._base_url = BASE_TEMPLATE kwargs['_context_pipeline'] = ( ContextPipeline() .compose(self._make_http_request) .compose(self._handle_blocked_request) .compose(self._parse_http_response) ) # Initialize curl_impersonate http client using TLS preset and necessary headers kwargs.setdefault( 'http_client', CurlImpersonateHttpClient( additional_http_error_status_codes=additional_http_error_status_codes, ignore_http_error_status_codes=ignore_http_error_status_codes, impersonate=impersonate, headers=HEADERS, ), ) kwargs.setdefault('_logger', logging.getLogger(__name__)) super().__init__(**kwargs) ``` In `__init__`, we define that we'll use `CurlImpersonateHttpClient` as the `http_client`. Another important element is `_context_pipeline`, which defines the sequence of methods through which our context passes. `_make_http_request` - is completely identical to `BeautifulSoupCrawler` `_handle_blocked_request` - since we get all data through the API, only the server response status will signal about blocking. ``` async def _handle_blocked_request(self, crawling_context: CustomContext) -> AsyncGenerator[CustomContext, None]: if self._retry_on_blocked: status_code = crawling_context.http_response.status_code if crawling_context.session and crawling_context.session.is_blocked_status_code(status_code=status_code): raise SessionError(f'Assuming the session is blocked based on HTTP status code {status_code}') yield crawling_context ``` `_parse_http_response` - a function that encapsulates the main logic of parsing responses ``` async def _parse_http_response(self, context: HttpCrawlingContext) -> AsyncGenerator[CustomContext, None]: page_data = None if context.http_response.headers['content-type'] == 'text/html; charset=utf-8': # Get Build ID for Next js from the start page of the site, form a link to next.js endpoints build_id = search(rb'"buildId":"(.{21})"', context.http_response.read()).group(1) self._build_id = build_id.decode('UTF-8') self._base_url = self._base_url.format(build_id=self._build_id) else: # Convert json to python dictionary page_data = context.http_response.read() page_data = page_data.decode('ISO-8859-1').encode('utf-8') page_data = loads(page_data) async def enqueue_links( *, path_template: str, items: list[str], user_data: dict[str, Any] | None = None, label: str | None = None ) -> None: requests = list[Request]() user_data = user_data if user_data else {} for item in items: link_user_data = user_data.copy() if label is not None: link_user_data.setdefault('label', label) if link_user_data.get('label') == 'SEARCH': link_user_data['location'] = item url = self._base_url + path_template.format(item=item, **user_data) requests.append(Request.from_url(url, user_data=link_user_data)) await context.add_requests(requests) yield CustomContext( request=context.request, session=context.session, proxy_info=context.proxy_info, enqueue_links=enqueue_links, add_requests=context.add_requests, send_request=context.send_request, push_data=context.push_data, log=context.log, http_response=context.http_response, page_data=page_data, ) ``` As you can see, if the server response comes in HTML, we get the `build_id` using a simple regular expression. This condition should be executed once for the first link and is necessary to interact further with the Next.js API. In all other cases, we simply convert JSON to a Python `dict` and save it in the context. In `enqueue_links`, I create logic for generating links based on string templates and input parameters. That's it: our custom Crawler Class for Crawlee for Python is ready, it's based on the `CurlImpersonateHttpClient` client, works with JSON responses instead of HTML, and implements the link generation logic we need. Let's finalize it by defining public classes for import. ``` # init.py from .custom_crawler import CustomCrawler from .types import CustomContext __all__ = ['CustomCrawler', 'CustomContext'] ``` Now that we have the crawler functionality, let's implement routing and data extraction from the site. We'll use the [`official documentation`](https://www.crawlee.dev/python/docs/introduction/refactoring) as a template. ``` # router.py from crawlee.router import Router from .constants import LISTING_PATH, SEARCH_PATH, TARGET_LOCATIONS from .custom_crawler import CustomContext router = Router[CustomContext]() @router.default_handler async def default_handler(context: CustomContext) -> None: """Handle the start URL to get the Build ID and create search links.""" context.log.info(f'default_handler is processing {context.request.url}') await context.enqueue_links( path_template=SEARCH_PATH, items=TARGET_LOCATIONS, label='SEARCH', user_data={'page': 1} ) @router.handler('SEARCH') async def search_handler(context: CustomContext) -> None: """Handle the SEARCH URL generates links to listings and to the next search page.""" context.log.info(f'search_handler is processing {context.request.url}') max_pages = context.page_data['pageProps']['initialPageCount'] current_page = context.request.user_data['page'] if current_page < max_pages: await context.enqueue_links( path_template=SEARCH_PATH, items=[context.request.user_data['location']], label='SEARCH', user_data={'page': current_page + 1}, ) else: context.log.info(f'Last page for {context.request.user_data["location"]} location') listing_ids = [ listing['property']['id'] for group in context.page_data['pageProps']['initialListings']['groups'] for listing in group['results'] if listing.get('property') ] await context.enqueue_links(path_template=LISTING_PATH, items=listing_ids, label='LISTING') @router.handler('LISTING') async def listing_handler(context: CustomContext) -> None: """Handle the LISTING URL extracts data from the listings and saving it to a dataset.""" context.log.info(f'listing_handler is processing {context.request.url}') listing_data = context.page_data['pageProps']['viewModel']['propertyDetails'] if not listing_data['exists']: context.log.info(f'listing_handler, data is not available for url {context.request.url}') return property_data = { 'property_id': listing_data['id'], 'property_type': listing_data['propertyType'], 'location_latitude': listing_data['coordinates']['lat'], 'location_longitude': listing_data['coordinates']['lng'], 'address1': listing_data['address']['address1'], 'address2': listing_data['address']['address2'], 'city': listing_data['address']['city'], 'postcode': listing_data['address']['postcode'], 'bills_included': listing_data.get('terms', {}).get('billsIncluded'), 'description': listing_data.get('description'), 'bathrooms': listing_data.get('numberOfBathrooms'), 'number_rooms': len(listing_data['rooms']) if listing_data.get('rooms') else None, 'rent_ppw': listing_data.get('terms', {}).get('rentPpw', {}).get('value', None), } await context.push_data(property_data) ``` Let's define our `main` function, which will launch the crawler. ``` # main.py from .custom_crawler import CustomCrawler from .router import router async def main() -> None: """The main function that starts crawling.""" crawler = CustomCrawler(max_requests_per_crawl=50, request_handler=router) # Run the crawler with the initial list of URLs. await crawler.run(['https://www.accommodationforstudents.com/']) await crawler.export_data('results.json') ``` Let's look at the results. ![Final results file](/assets/images/final-results-f14b378f9aa0cbd5d1185301c49a222e.webp) As I prefer to manage my projects as packages and use `pyproject.toml` according to [PEP 518](https://peps.python.org/pep-0518/), the final structure of our project will look like this. ![PEP 518 file structure](data:image/webp;base64,UklGRvweAABXRUJQVlA4IPAeAADwhwCdASqcAWUBPpFIn0slpCKhpTMJILASCWdu4XYGQBqTpq/L/u3ck4DjnVMuyL5PxMYbPQx/vvTV9PHoM8wHnOelD/mdKz/zP///6vgp/pv/j9gD9pPXB9WX/b+mf6AH//4IHw9/fP5J29f2b8tv6r6U+F32n+q+b5jz6cNRH5D9y/1v+C8w/8h/X/EX4Rf4HqBfmX81/3Hh5/2P9V7iDT/8X/v/UC9gPnH/N/tv5Ue3L6P/of8Z6l/lX92/8H2RfYB/Jv61/sfTn/b+CZ9l/yX7EfAF/J/7j/0v8H/nPhT/k//p/k/OP+ff4//1/5j/R/IR/Pf7j/4P8j7a///9xP7of//3c/3S/////GytcvQHasOaljN4jfbAMP3aj0uKwxzitE1urEUBfir8sOZbIsBaxOyFN1ZxsujKiYe3uiyZlHyHFTweFHjPK1URpBrU+XWNWnA9A0vYaP4/6EbJFzorMq9iR+Plm5oyx+ha3T8PHrq2hzZo+ERrzTFlUcXtBp4WNL1kNPIkhoYBwF/BTh0OyomHmVpkgWBHE+K5iINDAkBhUm2AUcrnRYlhDGFEcv3iH5fAKcOh2U4iO9zRhCyv7bmqcikf8D6hU2TB8BlnHRk84wk0F+tlS1j1FZsh2VEw90OtXbwaGR+d/PJdE79KTvKaEVbsmbymJxm+NxIon+y+AU4dDskhu5Kvhkej5J6UuUJHX/huNCb2U30KTvqkrpOefjFh9OpiismcC6MqJh7odkkDAkQyCe5BOWJejasQCoohmYpTHoFgIwe6eR6qeKSXwfl8Apw6HZJBRo1B4P3OkhvhVvSmYSpLAbezGTvsvgFOHQ7KiWsFeycSkXHcEaD4CqGIJhGSPmQXB/+BpQbLoyomHuh2SQOEylR2g3VOsH+3Q+hU3ktI+QP4NaNTdvoIWmn3wCqFh6uWFeDEKId2XRlRMPdDskgavdoFkfuHgGLBVJ1LLCyAaIMGomq2Qs2Ubzm8vgFOHQ7KiWsJti6MkuwilBBLwNjccNxGsKwlg/pg8DKiYe6HZUS1gruZFMSGqpnn+HPw8Z4auCzvpyqFdfQ7KiYe6HZUTC5OM4HlQbSsk3cNXSsz095tovxmyaqz0xl0ZUTD3Q7KiWsKDo3sit7/ZqO+tQf8O6pPHyqyvnLTh0OyomHuh2U4gAkqvkpsdoUiRns1KxE17RaFRtNNYdDsqJh7obFs31CQk7PPbBHmXBBnt0wophbvqccTtD3HbJ4IYVEnWYqnY3fAuuCnDodlRLWC1b8W84JAJMpI0AcrYOwfeil83+pDMpaZNlQeVo6u6YR64LYqqsx9xBYw90OyomFi7pOUQuuZESQVRSFGszC67aTxXorrrj01CkqacOh2VEw90OyP+xVjELlDdUmD6gIaEEff8HA6p/AKcOh2VEw9z8EtxsbHsFOHQ7KiYe6GxhqnDodlRMPdDsqJh4gAAP6Vp35aMhVa8zeSdOjP1bW3C12q6jRXPfDditGwDlvzmWjxscBvqs8fRarg5Vx+RlQ9K8xOGQV/43/HnltVJqO/hDra9fw+yfQxE+E0gAuQTcw10FnRie7TA3APnvZQO8HY1iu/qmmaActKWtJ3/xfHs9VRHH7nufV4fIk7IFjBXUY1c5Vam2oY8fXJqhf8/1QQx3g/jmTMx5TBta565OgTUXDY69dhe4j5NNSaIPwKC1wZJH/uyPScCvApzbChkiMjMLNLuvGECSwrmozPrwo7PzfRZ1PNhlWvBHVqPGuVZMQd5R7whOSBxWRdMWnFeYObUOxmbCpqH5X9ojC6B3CgdHOB07pU/6VhmWHZ4twvLL/kYStsFGgd0MhLyZTwPPN5Qyu15BsiDGRV2w3fRoSIc3cWXWtraY7aABlXlofzhoUw2SBQ3xrm0jln9Njumb3mSmZuzGoJzFw85SnkhPNflxcydSYteMsnQg1+SR4v5hX8fs1TozPzxILpHOLe3hPggvcodHSKWsC/FMdjKzPxXhr9LbltlipUtAQzwLAi8swDvdikNDQonCQsUmu8aLG4FYpK/rtbS8tV8KFscHZdDGUPEK/mn3qBBtA72et7u0ZHvj87nFu+tuqlIJ1IuS/lA0RBqXOtZFEPw0EWw2aoE1QxlTHn7TrhahJIyBORmgVgvDGiyN1x7qKgB4xLuUWhfNFMf4jydTwT5wcQoorXUtfrEBfYplXP8PzFlMQJkBhgiN/5LqdyIHQ4OBlc7HKqgVw6JJpGdamcrrnUauRc02EWCaxDP4YFChaKn2j46UR/SQ7qNa6V0bzEomuyIJVh1SDSQU5PkB3KW0awzAgZoanm3i5NQinBKnzPpRpqa4Sd7iH7oUB12Ehv3ELp6BbPYD4Sz+G72fxH+9MAqYFWucfvZDTQbuuGnMyavfr539eXCf187+vLhP69NyhnmyF6au8jorx+vq7lwghYDK3Mi9NYv3yUV+TJhC/WhL84Zrp0228Pdi3QfVJjvZ+Y/JTmS1pWhG4rofgEtoBwMbvpuJmL8hzI7kvpapk975wrV79GuhJ0CTyZgn2YyTmUIIRWCO5AOXhTqLsw2LNFw97BEYF3Lv942HwG0wltWupxrMYhSPsLbQ8P05UcF1ekN4DHiluN/H8iVNeGdzjVBj4SqmLwZKMP6kQL3M+kjqMZnzLfQMEpNptlS0CltHyQLXLo2wk+kQZeJoWVJkL7FLz1z7HJE+V4uSTmZUs9jeXUQGlqQOYbARGQI3GbJhFnZq/zgyHQnXisz2RlE4+QtiE6hBQbCBdcY9a502n839m6iP/GwbobdUXPlS8cHyIfX0U0DFk4CqPGYagipU+2SJLyMRRF2yJVzpAABp84AW8gT2NFEIwRTTU1cPiiGA0TXVR6+4n5wWXN6MDpc3KRkQK9NFDM08Z00dUSZsYq8+49zqn2w4ICRSYMHwxz/uZP6ACVzvSJX49ivaKQTbwyvDrUK7JGnpzFe5tXe/0PH/yEx77yO/lf1Pdxqkni5C2en7FUp05CF8bF1RZqIGyDDsEAh0TNDPoHPt9mZ3RFf8soTXEnXWG6+sWNZ+7+l1gDL5ao3kcFDcwUiNGAu/Z/yyJUNCTE2sPz5jsDZnAmpbm1f8NZ2sE54ITzxM7eblJUzKpqnTAKFbAfknJM28EbZhbQQGh/QyLw7P71i+QD7QRyvfBYcDO98Rjv0iGZrHozkjQCmPq12Nkgge5rQ7BZvFHM419Edg2uQHvfDtUp0q0AncPMCnYAH+vlI73MXmW3czYKL41A+jZKURvuJ9uEVvx6BW6uodpBSeHaXQwqNE8eRqRMxvHRpVFf6sqM4ZZvBGG1BLrqNxmrnixoTjU4scsMEGBidxlLdW4xXw/DlgWftNZzyyJz7JFb7NthouBakIE0ujw9oF4hjMWi9PtSUdtatXxMGCXLXHIfVHuktHZii97u/ocJNA4+eSW+FQRuqPWc4aJmKSed49Tss6yI68WICmaHwci0KQrVyT7DarGYkz/OX0lqMbOjKB3iApw7xFr/ndCzl+XhTyJkdDZsue/ABrgEod5h6IBtUnKv6iaOLYn1zAzHBSF/Jvv2V9pxFIW8XhFJIsWVBCGMVSZYq7h3v4WgS1Fk3TzFiW2b0J9GFMsp0NHm3inUM+Id0DiWekRJoGhfXUdm9rX82ADp5k4YP23P54Rw3uT4YHb7cd8sVRepyQUkrGq13W4/LflMbf+jge1Kj/7y7wp6Fa0MRmogyaIDjPghIj4d7ket1QiJ/a5tob7iGGWe6zsY0dPZv8GIFJIjJ94P9cPPKPI+mmaa5pHp5a3PYoRY7b5Zz5aln3xFJU3OTDTjhBr0RCfyT5dn4cmw19QHVTI4WKVUUxVAnh0MPIroEpqifBJETmtxQMqC1i8sPLiwvbo6psOsMcbfA9+G0WL6QIUpH7he33mt99DTpxe1B188Dk03/5L8U+7mOG0+LilFMRWqVo5KWeM9UxryUvOV4nSNEvJamOdkm9faow74yGILDbtYFsWi/tIB44DAZKDs1TVzbxBBdMv/AInLvbeFUP2BFMqsCL/WzyTMcjQMHiisAIm1nCthUhmHYnlCacGQ0b/J2pqgpTopwi/HZSwtVxYQsz7G/p4YAYL3iNQGvDTUHHSf8E/U4I0N0xgXBzgDX0qAQQkZUHAfHlU0H4GjZcKpoxK/1HMv2Wz2iSkZECvNIUcjz7pvp7tW/pewLzPJ7I6fSwWcwbVX+K1BHkdzCLCAZH9A2W0fwsEldzOG/sWNJVj2K/uD8oDcDrd1MdEY3XnjxEu2rLAsDaThej8//Q1ja4sOR3KmJq7eF1MLaXpRcohSS9O+9cWcdJQlfL7XcYITkbqceTloWJKMyUSzXs3BM0HEK7Z2HFJVigj/8nZjoPpwb50PkjHZfLSxFmtFRfETWe0kyVlTuc/3StNNMLjwXNWQ0ZcqDvZclqnVkenZnLZXXpdQOH4YmlrB16mJbFVBAnnp3NgLz63GSLjrlNGhNTvqM2uDjh2A4vRxkpmeif22vuufr7zz3wJYd2Vx2xL82urlzEhwErox4pkwmyvwT9yQykDvGc8/XAJt8ZyZ8kLBgKACZLXKmoi02YL9mJUNzSLK4GrXrr2m9tqDHqWnYpsFvG5Me0goK5wj8TkiBvch1wkJ2wbzSncQUcNPBlORtrosD2b3zuVb2WuY2I2GquanN6Ak9XfzFe2/uS4RqYUWiNtbQywMIpq2Yc5rEDRcMVrhLgiFzhQtTAwlSG5bCuYxqA/uL/r3BO4jA2maGSuPdPsfQZthhMNYxnZqazk3yMNSHQHSMZqkKIVaM+yeelwA1eM9mWtObmt6Dtd2/2DRkRJ7rr6hC/Yjn/H0mDv51YgmZQwGsOaPlEwaUbNbnvTZ8ev2zeJ2J8Ch/wBove0igS4h6tIe+qJn4QH5Po5Imgj/PWDmOFHwimbJmiDZl9PpBHZjAshMt/56Gn1DKXuAHjnaZ4Lkk7R7tfx2w1ciM+ueVXi9cubfWen8bF8gquDLjNjjc1IMIUpbdsQWdldwjpemakFk6xq6Hl43QUSp1/e//H519qAaE4kddOUUa2A/hXEY8NY0weEE91vAsRrIEOZvRjgEa9aCN/21C7BVvszqRcqGEPWSmnxUEVMw6tgUfxq/Lc4ZvkyEOopnUFePZNwxdMvMDJ1mbyt6ZqiL+CkCan6NAyaoONmABeG9qiaIojniYWmyKg8fAAtXx6IUu8CJQVn1Y7eeaU1DdwmxC/YyfiyaDu+U2YEvI4YOUzqX967TRR4qHvCsHx7MiaeXKso9Gib63x9nKMc9k9v/lnGbwGo99lWYealdaLlqiyU/ue+5yEe86ufRLXR9y4UPoTRPTWDsdLfQvwC+IQWuan6JCMgWJ54GZTffEyr76eNwLnjV5azYEhcQFR/ujLi7Ts7fCGIg6LUNAupiekp8LTYKGWqkyNyMbTlGWozet3vgl5KTkQbO3OZ7seiHlzWH61KrmfHYkVk6P3TXB7KMgM/Gui09YPpsXwfFmDCgTro383VhXwKmPouHVxhx4xiBdmFFG3F+LCX9z5wCH4k1CCDo5MY9WpDZuypWtWYdApWAIFG0I3bkSdyctxQFq25J4TRg6ylXeIZYtkNq9vpKb/8mb6OTnvITGEQZnn8rDOxnBdnebKs+piX2BJfSMgg9q3Kk366U9BWRLFiUoVSE4FwqnIxz3GeIKpHRnf4gExe4AdCLvoifiHdshgxHhkv9VfOVZikJkXr5zthjXlhPCGkTi+GXwpy+A8GWbBbkexrfxEM4z1Ps4ALqkqgxB26eutjBVwjPHoae9AOSRfFlwtzOx2ZbG70forUpXhYASuO4eQJAo16sYnLpqj5dMTxAOJgs9sYFVV5u/khGtkjdQQrpbWWVfnyi88+Gsx1bcGeocmdCxRwbnkmY6dsD2xT5vVEUrOYAphbD1zv0IE2RaxkMeLrRDrrq1/ol2jjndT9IXI7Y2natu7fAamDTIv4LcSl3+XtV8bsFvTSANPDT5VOscx/rufD9ipIFun6T5TO9Icr2R6OL8Rjzyyy8kla/30a3+6D5aj5dMTxAUOv2UCmzgUhcPL/Jyy5k2i2ymAuaL9jSlsFQW0qZqs8gaEEBaiaG3t4Kvia9IepxLcqp8UTDAxTwsWyOTzM729gStRzGkWj3VCvVxRTzUw/cBbpj9kUIVmLFG6rp33MYYGzHWWgn+qoAWM5R2yj70xF+xx4khabHoeduoD/yp/L4H/6ENDsgWVL/XMIBGlZhEddgP1kgTW+GbnyL3Sowy/klH1Pddh6Fw1+hq6/yATzRE3cWF7dHLIXPwfJHXRiYku+lIyIFh6/t3pvp8zOduuwLKrx3aRtzr9hmHauIPUrc+GICcARLo6BOTDvGYXr1ReIL3LEsleVPOUa/3r6LSwMGvwIWEkDjBJwvm4YIYYZue4lYOjdjU9PajRHYvFaZbM74xOAVB6sG6bdYQNBFuAQrGSdBXBms4ekN/F+rpdTl7TnAczqZrk1aXGtycXzFaq8DFtM6Zz2MO1Yosij9ANhKTQoIt9onYeoRPraqAImT+rUapWDXca4zr2W54F8Am4quZHuFsX7XypOWgngdZrq3jONJS+Iwfdncc/LvIP0EUVxdB0/UVDfkGsKbgukYrGGTHkdLE4IRXmM2WGo4AUjKCkDQeVjN05blTpFpYc2kxacM4hjnmrwGD7Smzk8q7LJ1KdUcRpnc4vgVXm2WHKqEzNNXlx2xL82uhbolQPd3PbbBLysPpp4D6YDKosXaYEykMcv4VPYaEt55D2sVVQiXlRxsofkEX23BMVDBfZ4e2TZD5m5HUhte806vAIN8x9bANsdOQPS0NhthMJ9Ljl01ClBforhBR+mZZ0dM6BBwEjdxcOG+dsMqf6CO3dDGN4PX4dEiIGsTp/f2hfFYnAWgj7pR6Iars1r8+0B0686SemYMmZDgVo1+0abciLrnqA75HqjqUJKbe0cA96YyAOKGHZZblqh9tWen+yo6e7seo4Xi1NSaNdXKPGBS6VGUR9OijUQo74ZKqH5pXLB6tiyOcUclcZMxBnegonDqS7C3ozf+x0N/yGjKu7JjXkWNJs5uwMYAASp4OnVNbolD+rLMWc3gXPs+W+ozW533eV2l0MaDKd3Gtt3sp3ATBLwDZZ1Ppg8zE+JxiZa286NOfFtrzjgP6rafU0aog5GFsDqRB17hr6vo41bnnuaBHFGm2bPoOcINWappUQc6GddvMfIyz2IWVkmaOR6CEsKLCgj5hIl1YYNUycIjX3M5G9E62l7eAb+HuVeuTCDirm+RQLV6Gx4HwwrCrzeLhgG6Gh9nPmyjAqr8u6n7a0xhjkkGYfscuyJgTuFlFS39jE1/IAbaKry2Qel4bGzW3PK1yS9n7nwdvMqbgRgZUoW8howXOcFxPOybJFXw1Wi6JxJ6O2iU+4tVU21XUlGH7ewvN4COYtURvouMDZ18K6OVrDUvOKzYKJHk9EkmCtygUSRfCaDf4AkSloYs7hZRvfpkeqwCIm9N925ap4LqZmSbdnuBrnmRBkrDuz2/J/sYrZtHxEWnG/OgnhIhIvzn3SfehniPIYiQc6Z+KMPG/TprXqnIxC2ADLh8tJBvOcbEWM/2TfGxE1wjIYxniPEmHHHAi5yYe+x7Ywpl39Ryo3jIZtfQsP/uUwOTED4K8SvAm0MWTn3PU3+kKJA7sJi+hXkmuwGJci/0QPwwvHfG+giY4KTAfDcNWbOoHoA4VS0xNGkzxn3h0fBk4Rn9qIEbazmVsVBgATJqdOfjBAVBbOtizPNyF2goQCVM8F5g/czGfPY3zBnAhoO0crcPesUGGHuDVyGxPYi7cGQtPW0JBADvGyeDhMvVj1FGWYWeYzoEeMF72st04kuFcHgGcjuAh+Udyy28hT4xIYJZO/2Jerj3VR790YLzQAYbC2140dRwVfTY/ezqYIKo3uRJmgRr97A8F8xbczX14Kq+i4X76vO+vXZjVfzgokBvOqKILw5Yn8CbR3YDZ58j34yUuFd4ISYnS0jjv6mvon3xNknJfTCXy1LpOlBy5/UgN6ecuXHaODQzSrerODLJ9L1ecUY1LFswscM/m5RJ8lWxhpw+eDIPWcwB7O4ZfS96dZCKSAFNd4ZWUzHVhzm7FaFVDMPJ0IY6I8PuSt04gaLHZiRhGSVZ369XgmfcEpebfQUvsZ36FOa1pQ784fSNZpkWS4wwB9G+it+uO5oZde4HtKOgzEfDjsMRAN9OX80ekZDGn2NaLfPMAx1jgw7MGp+VUURP+bQawBywwtvQ97S8aXsHaab+73xq23pDuE+U03pmDG4vNJ58sIFp0Iin2pEuDja2zKJTAki1J9HdQ4VnYrArSRiOan8DbFwoDkyrTJMeZP+HFZf2Pbtol57ivitVUkoqMdijbvYaoQ1zhUIwRBVNi5CgDmQg2w6eKMz7GHUZdjYFODBATBfGyZUWAavTps1KuSVIdLhlVVrHpDLWgItlzL0ZTOoQdVmy+dVu91auQl7e0rI86k3AoeAu7IHMYp0S7M3NUvgLzmoMypROKrq6YkFKYLZssxtN51B/Aez+J4K9sHTHGzfJuWW3rnJlFckOPN2UGVd2yfGF/jkFm6ptNwI68D81QAQOYdSxGuGmc1EiedIoLTG10s+OP2JJB8AmQbaKcQV+mVlS0U4gQRwGdc1BuZqPEAvprquRZ5M4AERIlGuTGggCCstAPV/X8AiMIH1OZb7+z8Fmk5ULVOtfCSe2hrXzU//jF2A/PnWfGd2FkHaTd8+Um806f1n6wanz5yJR5ntjYNieu/oDmb9f/tTNR3l6nkPSpn3vfPmbvQwxwjasA8LtGbBw+gZHaVT2k9uTYptrZ6F48FbJcEj8VRwIyujpPuZJy92H7wB9HbOn/YHuPOHQuE3tv3pvtKK7woJofNKYodgn6xEX9IpwY4k6gHsFqZeyVlWZQbA0bc6/YZiDEQ6qAjXOzuzcU8vsKV2QC6vpmB0fzsUvmD686z7k0SAqsb3DlXv931kkO7YtwgiLyIY5lByE9RrqsGd2eAad0BjlZJK8hzK8BaKxk//oS1ygIoDqlwtLsKcAdY93H2RI2e8nj64pww6uX/8tvxob9tUFPA3u9vzFVG+8myQW1bIlj734NaJBP+wb98vzhR/l0gD6qTeC7rU4oJfRosavg7n2kqi1ZBXk0vozf4jgoNkyGcJBDOtlp4dalYet6GirZN1A4kqHjYiuho/Z90pbUGp3ZOiqdEZTNz8LbwEiGPbo+FAesw7T3smwkz7D3JS5Ye39bFaQxJVOJDSL72GwvgewVfrQEpdBfWTziuw3pn8v/QxpQtVGjGyuMzevpfMGf/VtjtGT/tdHW5XNcMruD8w9xgOOdCeQ7fExcjWSTNqVUtgtDpFqnJ9/e2k1GDgbFcRZmOQ8U8884M3QmsGAS+5FsmiQiEIgr3qDB/veYV5N5oz7CULbtNW0bY7wR1ZQizXJq4y1X5vsmAh8Amb+uwgnyhKOVFWn81jh+hGCR1DbPE/e7P+LHUvpbA+qhqc92KY2L/cXtJec4XVH7Igbu3JNe5o6R7xxLTnom3y7YAzQMBGIT4VPvOSP48PdaS9TlG/S1pVGItHGND79b6p43pwUnouyGk/Fib+UAeH40wx3vsYd+3A/dfqww6foSc0B5UhCX75zFShfE6sDIualTD6/X64L9pJaGPOw4zEaq4JQ9RVN7ShajYIERFhGo0BqOC06/1HDbli40+jLppxA/lHOksIwmWsp50Ok12ZH0GT2vgbpCK3JY4QKe/1WSBKeqdadH3LclvUdku85VP7YyKkkOSw/g88z8eTNaZ4smlkF02di1z1Q/oiv4ejuEKA1jrm1gWGbcXRn2WJmd9L31BspP/9+s/J3Zy5viDE07/1pV8xHabJQ4wMablz8wPtVJYZ7zmByO5ufUAHjwwkbOU0tC/tkF9+3mfFsfNPnb+pt3EC9r4h+gg33Ilu8Mok2OXnCVWWHmZbi6xXqxTYTe6bvRppOS40bCqgtquyNoVTMHCdM64ZLwFio3HCcCbAaMZZTXFPXa/SlzqjrlKLApGdz47VEcE453UtO9WKcJ+uj5xGkptnOql/mEUpn9Nl/u5X8TeQ/WOrNr2qtk8ugEGFGo2BqStSFnqusfEM7AiuZuiuraZt90Wg3bUQCMK0L6cn/QeEmfYANN8DFmczfOaLSr+nvabfb7TQfj0GMpm55W5lw8fjDGg/Qq/EMy6zHVR5K0ExJaLfNF2Tev67m6V+0jqMYdtveV3LAAMzWWVKDUW3WHM7mSFKk39o2w8cQ+MdOAsJgIDk4832vfsJiTkdAsjbSJXpH8+ouj6NCz2a00+9V5A7WfoHAP7YAwTJlLyzOBzmEaj2RmzyE7Puzs+3QFXm6TfoOLlwNI1AXX9J8VwfqyuCPydQRJ2EbpnNQJvcVOj5os8Dr9XDBTYv6v9n3cyiY9vzqGhJqMnmcMqbvYRUmdKn9ONCGJPM3XIJY79kMtpXnsLLsbGeHAvpm7d/So1D5LMFctpzYb/cGK97xWnoLd/GIoUEKAAAAAAA=) ## Conclusion[​](#conclusion "Direct link to Conclusion") In this project, we went through the entire cycle of crawler development, from analyzing a rather interesting dynamic site to full implementation of a crawler using `Crawlee for Python`. You can view the full project code on [GitHub](https://github.com/Mantisus/crawlee_python_example) I would also like to hear your comments and thoughts on the web scraping topic you'd like to see in the next article. Feel free to comment here in the article or contact me in the [Crawlee developer community](https://apify.com/discord) on Discord. If you are looking out to how to start scraping using Crawlee for Python, check out our [latest tutorial here](https://blog.apify.com/crawlee-for-python-tutorial/). You can find me on the following platforms: [Github](https://github.com/Mantisus), [Linkedin](https://www.linkedin.com/in/max-bohomolov/), [Apify](https://apify.com/mantisus), [Upwork](https://www.upwork.com/freelancers/mantisus), [Contra](https://contra.com/mantisus). Thank you for your attention. I hope you found this information useful. **Tags:** * [community](https://crawlee.dev/blog/tags/community.md) --- # Scrapy vs. Crawlee April 23, 2024 · 12 min read [![Saurav Jain](https://avatars.githubusercontent.com/u/53312820?v=4)](https://github.com/souravjain540) [Saurav Jain](https://github.com/souravjain540) Developer Community Manager Hey, crawling masters! Welcome to another post on the Crawlee blog; this time, we are going to compare Scrapy, one of the oldest and most popular web scraping libraries in the world, with Crawlee, a relative newcomer. This article will answer your questions about when to use Scrapy and help you decide when it would be better to use Crawlee instead. This article will be the first in a series comparing the various technical aspects of Crawlee with Scrapy. ## Introduction:[​](#introduction "Direct link to Introduction:") [Scrapy](https://scrapy.org/) is an open-source Python-based web scraping framework that extracts data from websites. With Scrapy, you create spiders, which are autonomous scripts to download and process web content. The limitation of Scrapy is that it does not work very well with JavaScript rendered websites, as it was designed for static HTML pages. We will do a comparison later in the article about this. Crawlee is also an open-source library that originated as [Apify SDK](https://docs.apify.com/sdk/js/). Crawlee has the advantage of being the latest library in the market, so it already has many features that Scrapy lacks, like autoscaling, headless browsing, working with JavaScript rendered websites without any plugins, and many more, which we are going to explain later on. ## Feature comparison[​](#feature-comparison "Direct link to Feature comparison") We'll start comparing Scrapy and Crawlee by looking at language and development environments, and then features to make the scraping process easier for developers, like autoscaling, headless browsing, queue management, and more. ### Language and development environments[​](#language-and-development-environments "Direct link to Language and development environments") Scrapy is written in Python, making it easier for the data science community to integrate it with various tools. While Scrapy offers very detailed documentation, it can take a lot of work to get started with Scrapy. One of the reasons why it is considered not so beginner-friendly[\[1\]](https://towardsdatascience.com/web-scraping-with-scrapy-theoretical-understanding-f8639a25d9cd)[\[2\]](https://www.accordbox.com/blog/scrapy-tutorial-1-scrapy-vs-beautiful-soup/#:~:text=Since%20Scrapy%20does%20no%20only,to%20become%20a%20Scrapy%20expert.)[\[3\]](https://www.udemy.com/tutorial/scrapy-tutorial-web-scraping-with-python/scrapy-vs-beautiful-soup-vs-selenium//1000) is its [complex architecture](https://docs.scrapy.org/en/latest/topics/architecture.html), which consists of various components like spiders, middleware, item pipelines, and settings. These can be challenging for beginners. Crawlee is one of the few web scraping and automation libraries that supports JavaScript and TypeScript. Crawlee supports CLI just like Scrapy, but it also provides [pre-built templates](https://github.com/apify/crawlee/tree/master/packages/templates/templates) in TypeScript and JavaScript with support for Playwright and Puppeteer. These templates help beginners to quickly understand the file structure and how it works. ### Headless browsing and JS rendering[​](#headless-browsing-and-js-rendering "Direct link to Headless browsing and JS rendering") Scrapy does not support headless browsers natively, but it supports them with its plugin system, similarly it does not support scraping JavaScript rendered websites, but the plugin system makes this possible. One of the best examples is its [Playwright plugin](https://github.com/scrapy-plugins/scrapy-playwright/tree/main). Apify Store is a JavaScript rendered website, so we will scrape it in this example using the `scrapy-playwright` integration. For installation and to make changes to \[`settings.py`], please follow the instructions on the `scrapy-playwright` [repository on GitHub](https://github.com/scrapy-plugins/scrapy-playwright/tree/main?tab=readme-ov-file#installation). Then, create a spider with this code to scrape the data: spider.py ``` import scrapy class ActorSpider(scrapy.Spider): name = 'actor_spider' start_urls = ['https://apify.com/store'] def start_requests(self): for url in self.start_urls: yield scrapy.Request( url, meta={"playwright": True, "playwright_include_page": True}, callback=self.parse_playwright ) async def parse_playwright(self, response): page = response.meta['playwright_page'] await page.wait_for_selector('.ActorStoreItem-title-wrapper') actor_card = await page.query_selector('.ActorStoreItem-title-wrapper') if actor_card: actor_text = await actor_card.text_content() yield { 'actor': actor_text.strip() if actor_text else 'N/A' } await page.close() ``` One of the drawbacks of this plugin is its [lack of native support for windows](https://github.com/scrapy-plugins/scrapy-playwright/tree/main?tab=readme-ov-file#lack-of-native-support-for-windows). In Crawlee, you can scrape JavaScript rendered websites using the built-in headless [Puppeteer](https://github.com/puppeteer/puppeteer/) and [Playwright](https://github.com/microsoft/playwright) browsers. It is important to note that, by default, Crawlee scrapes in headless mode. If you don't want headless, then just set `headless: false`. * Playwright * Puppeteer crawler.js ``` import { PlaywrightCrawler } from 'crawlee'; const crawler = new PlaywrightCrawler({ async requestHandler({ page }) { const actorCard = page.locator('.ActorStoreItem-title-wrapper').first(); const actorText = await actorCard.textContent(); await crawler.pushData({ 'actor': actorText }); }, }); await crawler.run(['https://apify.com/store']); ``` crawler.js ``` import { PuppeteerCrawler } from 'crawlee'; const crawler = new PuppeteerCrawler({ async requestHandler({ page }) { await page.waitForSelector('.ActorStoreItem-title-wrapper'); const actorText = await page.$eval('.ActorStoreItem-title-wrapper', (el) => { return el.textContent; }); await crawler.pushData({ 'actor': actorText }); }, }); await crawler.run(['https://apify.com/store']); ``` ### Autoscaling support[​](#autoscaling-support "Direct link to Autoscaling support") Autoscaling refers to the capability of a library to automatically adjusting the number of concurrent tasks (such as browser instances, HTTP requests, etc.) based on the current load and system resources. This feature is particularly useful when handling web scraping and crawling tasks that may require dynamically scaled resources to optimize performance, manage system load, and handle rate limitations efficiently. Scrapy does not have built-in autoscaling capabilities, but it can be done using external services like [Scrapyd](https://scrapyd.readthedocs.io/en/latest/) or deployed in a distributed manner with Scrapy Cluster. Crawlee has [built-in autoscaling](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) with `AutoscaledPool`. It increases the number of requests that are processed concurrently within one crawler. ### Queue management[​](#queue-management "Direct link to Queue management") Scrapy supports both breadth-first and depth-first crawling strategies using a disk-based queuing system. By default, it uses the LIFO queue for the pending requests, which means it is using depth-first order, but if you want to use breadth-first order, you can do it by changing these settings: settings.py ``` DEPTH_PRIORITY = 1 SCHEDULER_DISK_QUEUE = "scrapy.squeues.PickleFifoDiskQueue" SCHEDULER_MEMORY_QUEUE = "scrapy.squeues.FifoMemoryQueue" ``` Crawlee uses breadth-first by default and you can override it on a per-request basis by using the `forefront: true` argument in `addRequest` and its derivatives. If you use `forefront: true` for all requests, it becomes a depth-first process. ### CLI support[​](#cli-support "Direct link to CLI support") Scrapy has a [powerful command-line interface](https://docs.scrapy.org/en/latest/topics/commands.html#command-line-tool) that offers functionalities like starting a project, generating spiders, and controlling the crawling process. Scrapy CLI comes with Scrapy. Just run this command, and you are good to go: ``` pip install scrapy ``` Crawlee also [includes a CLI tool](https://crawlee.dev/js/docs/quick-start.md#installation-with-crawlee-cli) (`crawlee-cli`) that facilitates project setup, crawler creation and execution, streamlining the development process for users familiar with Node.js environments. The command for installation is: ``` npx crawlee create my-crawler ``` ### Proxy rotation and storage management[​](#proxy-rotation-and-storage-management "Direct link to Proxy rotation and storage management") Scrapy handles it via custom middleware. You have to install their [`scrapy-rotating-proxies`](https://pypi.org/project/scrapy-rotating-proxies/) package using pip. ``` pip install scrapy-rotating-proxies ``` Then in the `settings.py` file, add `ROTATING_PROXY_LIST` and the middleware to the `DOWNLOADER_MIDDLEWARES` and specify the list of proxy servers. For example: settings.py ``` DOWNLOADER_MIDDLEWARES = { # Lower value means higher priority 'scrapy.downloadermiddlewares.retry.RetryMiddleware': 90, 'scrapy_rotating_proxies.middlewares.RotatingProxyMiddleware': 610, 'scrapy_rotating_proxies.middlewares.BanDetectionMiddleware': 620, } ROTATING_PROXY_LIST = [ 'proxy1.com:8000', 'proxy2.com:8031', # Add more proxies as needed ] ``` Now create a spider with the code you want to scrape any site and the `ROTATING_PROXY_LIST` in `settings.py` will manage which proxy to use for each request. Here middleware will treat each proxy initially as valid and then when a request is made, the middleware selects a proxy from the list of available proxies. The selection isn't purely sequential but is influenced by the recent history of proxy performance. The middleware has mechanisms to detect when a proxy might be banned or rendered ineffective. When such conditions are detected, the proxy is temporarily deactivated and put into a cooldown period. After the cooldown period expires, the proxy is reconsidered for use. In Crawlee, you can [use your own proxy servers](https://crawlee.dev/js/docs/guides/proxy-management.md) or proxy servers acquired from third-party providers. If you already have your proxy URLs, you can start using them like this: crawler.js ``` import { ProxyConfiguration } from 'crawlee'; const proxyConfiguration = new ProxyConfiguration({ proxyUrls: [ 'http://proxy1.example.com', 'http://proxy2.example.com', ] }); const crawler = new CheerioCrawler({ proxyConfiguration, // ... }); ``` Crawlee also has [`SessionPool`](https://crawlee.dev/js/api/core/class/SessionPool.md), a built-in allocation system for proxies. It handles the rotation, creation, and persistence of user-like sessions. It creates a pool of session instances that are randomly rotated. ### Data storage[​](#data-storage "Direct link to Data storage") One of the most frequently required features when implementing scrapers is being able to store the scraped data as an "export file". Scrapy provides this functionality out of the box with the [`Feed Exports`](https://docs.scrapy.org/en/latest/topics/feed-exports.html), which allows it to generate feeds with the scraped items, using multiple serialization formats and storage backends. It supports `CSV, JSON, JSON Lines, and XML.` To do this, you need to modify your `settings.py` file and enter: settings.py ``` # To store in CSV format FEEDS = { 'data/crawl_data.csv': {'format': 'csv', 'overwrite': True} } # OR to store in JSON format FEEDS = { 'data/crawl_data.json': {'format': 'json', 'overwrite': True} } ``` Crawlee's storage can be divided into two categories: Request Storage (Request Queue and Request List) and Results Storage (Datasets and Key Value Stores). Both are stored locally by default in the `./storage` directory. Also, remember that Crawlee, by default, clears its storages before starting a crawler run. This action is taken to prevent old data from interfering with new crawling sessions. Let's see how Crawlee stores the result: * You can use local storage with dataset crawler.js ``` import { PlaywrightCrawler } from 'crawlee'; const crawler = new PlaywrightCrawler({ requestHandler: async ({ page }) => { const title = await page.title(); const price = await page.textContent('.price'); await crawler.pushData({ url: request.url, title, price }); } }) await crawler.run(['http://example.com']); ``` * Using Key-Value Store crawler.js ``` import { KeyValueStore } from 'crawlee'; //... Code to crawl the data await KeyValueStore.setValue('key', { foo: 'bar' }); ``` ### Anti-blocking and fingerprints[​](#anti-blocking-and-fingerprints "Direct link to Anti-blocking and fingerprints") In Scrapy, handling anti-blocking strategies like [IP rotation](https://pypi.org/project/scrapy-rotated-proxy/), [user-agent rotation](https://python.plainenglish.io/rotating-user-agent-with-scrapy-78ca141969fe), custom solutions via middleware, and plugins are needed. Crawlee provides HTTP crawling and [browser fingerprints](https://crawlee.dev/js/docs/guides/avoid-blocking.md) with zero configuration necessary; fingerprints are enabled by default and available in `PlaywrightCrawler` and `PuppeteerCrawler` but also work with `CheerioCrawler` and the other HTTP Crawlers. ### Error handling[​](#error-handling "Direct link to Error handling") Both libraries support error-handling practices like automatic retries, logging, and custom error handling. In Scrapy, you can handle errors using middleware and [signals](https://docs.scrapy.org/en/latest/topics/signals.html). There are also [exceptions](https://docs.scrapy.org/en/latest/topics/exceptions.html) like `IgnoreRequest`, which can be raised by Scheduler or any downloader middleware to indicate that the request should be ignored. Similarly, a spider callback can raise' CloseSpider' to close the spider. Scrapy has built-in support for retrying failed requests. You can configure the retry policy (e.g., the number of retries, retrying on particular HTTP codes) via settings such as `RETRY_TIMES`, as shown in the example: settings.py ``` RETRY_ENABLED = True RETRY_TIMES = 2 # Number of retry attempts RETRY_HTTP_CODES = [500, 502, 503, 504, 522, 524] # HTTP error codes to retry ``` In Crawlee, you can also set up a custom error handler. For retries, `maxRequestRetries` controls how often Crawlee will retry a request before marking it as failed. To set it up, you just need to add the following line of code in your crawler. crawler.js ``` const crawler = new CheerioCrawler({ maxRequestRetries: 3 // Crawler will retry three times. // ... }) ``` There is also `noRetry`. If set to `true` then the request will not be automatically tried. Crawlee also provides a built-in [logging mechanism](https://crawlee.dev/js/api/core/class/Log.md) via `log`, allowing you to log warnings, errors, and other information effectively. ### Deployment using Docker[​](#deployment-using-docker "Direct link to Deployment using Docker") Scrapy can be containerized using Docker, though it typically requires manual setup to create Dockerfiles and configure environments. While Crawlee includes [ready-to-use Docker configurations](https://crawlee.dev/js/docs/guides/docker-images.md), making deployment straightforward across various environments without additional configuration. ## Community[​](#community "Direct link to Community") Both projects are open source. Scrapy benefits from a large and well-established community. It has been around since 2008 and has attracted a lot of attention among developers, particularly those in the Python ecosystem. Crawlee started its journey as Apify SDK in 2018. It now has more than [12K stars on GitHub](https://github.com/apify/crawlee), a community of more than 7,000 developers in its [Discord Community](https://apify.com/discord), and is used by the TypeScript and JavaScript community. ## So which is better - Scrapy or Crawlee?[​](#so-which-is-better---scrapy-or-crawlee "Direct link to So which is better - Scrapy or Crawlee?") Both frameworks can handle a wide range of scraping tasks, and the best choice will depend on specific technical needs like language preference, project requirements, ease of use, etc. If you are comfortable with Python and want to work only with it, go with Scrapy. It has very detailed documentation, and it is one of the oldest and most stable libraries in the space. But if you want to explore or are comfortable working with TypeScript or JavaScript, our recommendation is Crawlee. With all the valuable features like a single interface for HTTP requests and headless browsing, making it work well with JavaScript rendered websites, autoscaling and fingerprint support, it is the best choice for scraping websites that can be complex, resource intensive, using JavaScript, or even have blocking methods. As promised, this is just the first of the many articles comparing Scrapy and Crawlee. With the upcoming articles, you will learn more about every technical detail. Meanwhile, if you want to learn more about Crawlee, read our [introduction to Crawlee](https://crawlee.dev/js/docs/introduction.md) or Apify's [Crawlee web scraping tutorial](https://blog.apify.com/crawlee-web-scraping-tutorial/). --- # Inside implementing SuperScraper with Crawlee March 5, 2025 · 6 min read [![Saurav Jain](https://avatars.githubusercontent.com/u/53312820?v=4)](https://github.com/souravjain540) [Saurav Jain](https://github.com/souravjain540) Developer Community Manager [![Radoslav Chudovský](https://ca.slack-edge.com/T0KRMEKK6-U04MGU11VUK-7f59c4a9343b-512)](https://github.com/chudovskyr) [Radoslav Chudovský](https://github.com/chudovskyr) Web Automation Engineer [SuperScraper](https://github.com/apify/super-scraper) is an open-source [Actor](https://docs.apify.com/platform/actors) that combines features from various web scraping services, including [ScrapingBee](https://www.scrapingbee.com/), [ScrapingAnt](https://scrapingant.com/), and [ScraperAPI](https://www.scraperapi.com/). A key capability is its standby mode, which runs the Actor as a persistent API server. This removes the usual start-up times - a common pain point in many systems - and lets users make direct API calls to interact with the system immediately. This blog explains how SuperScraper works, highlights its implementation details, and provides code snippets to demonstrate its core functionality. ![Google Maps Data Screenshot](/assets/images/superscraper-8d24da63227f97df70998e8900b3a901.webp) ### What is SuperScraper?[​](#what-is-superscraper "Direct link to What is SuperScraper?") SuperScraper transforms a traditional scraper into an API server. Instead of running with static inputs and waiting for completion, it starts only once, stays active, and listens for incoming requests. ### How to enable standby mode[​](#how-to-enable-standby-mode "Direct link to How to enable standby mode") To activate standby mode, you must configure the settings so it listens for incoming requests. ![Activating Actor standby mode](/assets/images/actor-standby-9b094dde2615b70afb82685d56c8d74e.webp) ### Server setup[​](#server-setup "Direct link to Server setup") The project uses Node.js `http` module to create a server that listens on the desired port. After the server starts, a check ensures users are interacting with it correctly by sending requests instead of running it traditionally. This keeps SuperScraper operating as a persistent server. ### Handling multiple crawlers[​](#handling-multiple-crawlers "Direct link to Handling multiple crawlers") SuperScraper processes user requests using multiple instances of Crawlee’s [`PlaywrightCrawler`](https://crawlee.dev/js/api/playwright-crawler/class/PlaywrightCrawler.md). Since each `PlaywrightCrawler` instance can only handle one proxy configuration, a separate crawler is created for each unique proxy setting. For example, if the user sends one request for “normal” proxies and one request with residential US proxies, a separate crawler needs to be created for each proxy configuration. Hence, to solve this, we store the crawlers in a key-value map, where the key is a stringified proxy configuration. ``` const crawlers = new Map(); ``` Here’s a part of the code that gets executed when a new request from the user arrives; if the crawler for this proxy configuration exists in the map, it will be used. Otherwise, a new crawler gets created. Then, we add the request to the crawler’s queue so it can be processed. ``` const key = JSON.stringify(crawlerOptions); const crawler = crawlers.has(key) ? crawlers.get(key)! : await createAndStartCrawler(crawlerOptions); await crawler.addRequests([request]); ``` The function below initializes new crawlers with predefined settings and behaviors. Each crawler utilizes its own in-memory queue created with the `MemoryStorage` client. This approach is used for two key reasons: 1. **Performance**: In-memory queues are faster, and there's no need to persist them when SuperScraper migrates. 2. **Isolation**: Using a separate queue prevents interference with the shared default queue of the SuperScraper Actor, avoiding potential bugs when multiple crawlers use it simultaneously. ``` export const createAndStartCrawler = async (crawlerOptions: CrawlerOptions = DEFAULT_CRAWLER_OPTIONS) => { const client = new MemoryStorage({ persistStorage: false }); const queue = await RequestQueue.open(undefined, { storageClient: client }); const proxyConfig = await Actor.createProxyConfiguration(crawlerOptions.proxyConfigurationOptions); const crawler = new PlaywrightCrawler({ keepAlive: true, proxyConfiguration: proxyConfig, maxRequestRetries: 4, requestQueue: queue, }); }; ``` At the end of the function, we start the crawler and log a message if it terminates for any reason. Next, we add the newly created crawler to the key-value map containing all crawlers, and finally, we return the crawler. ``` crawler.run().then( () => log.warning(`Crawler ended`, crawlerOptions), () => { } ); crawlers.set(JSON.stringify(crawlerOptions), crawler); log.info('Crawler ready 🚀', crawlerOptions); return crawler; ``` ### Mapping standby HTTP requests to Crawlee requests[​](#mapping-standby-http-requests-to-crawlee-requests "Direct link to Mapping standby HTTP requests to Crawlee requests") When creating the server, it accepts a request listener function that takes two arguments: the user’s request and a response object. The response object is used to send scraped data back to the user. These response objects are stored in a key-value map to so they can be accessed later in the code. The key is a randomly generated string shared between the request and its corresponding response object, it is used as `request.uniqueKey`. ``` const responses = new Map(); ``` **Saving response objects** The following function stores a response object in the key-value map: ``` export function addResponse(responseId: string, response: ServerResponse) { responses.set(responseId, response); } ``` **Updating crawler logic to store responses** Here’s the updated logic for fetching/creating the corresponding crawler for a given proxy configuration, with a call to store the response object: ``` const key = JSON.stringify(crawlerOptions); const crawler = crawlers.has(key) ? crawlers.get(key)! : await createAndStartCrawler(crawlerOptions); addResponse(request.uniqueKey!, res); await crawler.requestQueue!.addRequest(request); ``` **Sending scraped data back** Once a crawler finishes processing a request, it retrieves the corresponding response object using the key and sends the scraped data back to the user: ``` export const sendSuccResponseById = (responseId: string, result: unknown, contentType: string) => { const res = responses.get(responseId); if (!res) { log.info(`Response for request ${responseId} not found`); return; } res.writeHead(200, { 'Content-Type': contentType }); res.end(result); responses.delete(responseId); }; ``` **Error handling** There is similar logic to send a response back if an error occurs during scraping: ``` export const sendErrorResponseById = (responseId: string, result: string, statusCode: number = 500) => { const res = responses.get(responseId); if (!res) { log.info(`Response for request ${responseId} not found`); return; } res.writeHead(statusCode, { 'Content-Type': 'application/json' }); res.end(result); responses.delete(responseId); }; ``` **Adding timeouts during migrations** During migration, SuperScraper adds timeouts to pending responses to handle termination cleanly. ``` export const addTimeoutToAllResponses = (timeoutInSeconds: number = 60) => { const migrationErrorMessage = { errorMessage: 'Actor had to migrate to another server. Please, retry your request.', }; const responseKeys = Object.keys(responses); for (const key of responseKeys) { setTimeout(() => { sendErrorResponseById(key, JSON.stringify(migrationErrorMessage)); }, timeoutInSeconds * 1000); } }; ``` ### Managing migrations[​](#managing-migrations "Direct link to Managing migrations") SuperScraper handles migrations by timing out active responses to prevent lingering requests during server transitions. ``` Actor.on('migrating', ()=>{ addTimeoutToAllResponses(60); }); ``` Users receive clear feedback during server migrations, maintaining stable operation. ### Build your own[​](#build-your-own "Direct link to Build your own") This guide showed how to build and manage a standby web scraper using Apify’s platform and Crawlee. The implementation handles multiple proxy configurations through `PlaywrightCrawler` instances while managing request-response cycles efficiently to support diverse scraping needs. Standby mode transforms SuperScraper into a persistent API server, eliminating start-up delays. The migration handling system keeps operations stable during server transitions. You can build on this foundation to create web scraping tools tailored to your requirements. To get started, explore the project on [GitHub](https://github.com/apify/super-scraper) or learn more about [Crawlee](https://crawlee.dev/index.md) to build your own scalable web scraping tools. --- ## C[​](#C "Direct link to C") * [community10](https://crawlee.dev/blog/tags/community.md) *** --- ## [How to scrape YouTube using Python \[2025 guide\]](https://crawlee.dev/blog/scrape-youtube-python.md) July 14, 2025 · 23 min read [![Max](https://avatars.githubusercontent.com/u/34358312?v=4)](https://github.com/Mantisus) [Max](https://github.com/Mantisus) Community Member of Crawlee and web scraping expert In this guide, we'll explore how to efficiently collect data from YouTube using [Crawlee for Python](https://github.com/apify/crawlee-python). The scraper will extract video metadata, video statistics, and transcripts - giving you structured YouTube data perfect for content analysis, ML training, or trend monitoring. note One of our community members wrote this guide as a contribution to the Crawlee Blog. If you'd like to contribute articles like these, please reach out to us on Apify’s [Discord channel](https://apify.com/discord). ![How to scrape YouTube using Python](/assets/images/youtube_banner-fb73d10d52bbf13a89f3c0d66d2eff5b.webp) Key steps we'll cover: 1. [Project setup](https://www.crawlee.dev/blog/scrape-youtube-python#1-project-setup) 2. [Analyzing YouTube and determining a scraping strategy](https://www.crawlee.dev/blog/scrape-youtube-python#2-analyzing-youtube-and-determining-a-scraping-strategy) 3. [Configuring YouTube](https://www.crawlee.dev/blog/scrape-youtube-python#3-configuring-crawlee) 4. [Extracting YouTube data](https://www.crawlee.dev/blog/scrape-youtube-python#4-extracting-youtube-data) 5. [Enhancing the scraper capabilities](https://www.crawlee.dev/blog/scrape-youtube-python#5-enhancing-the-scraper-capabilities) 6. [Creating a YouTube Actor on the Apify platform](https://www.crawlee.dev/blog/scrape-youtube-python#6-creating-a-youtube-actor-on-the-apify-platform) 7. [Deploying to Apify](https://www.crawlee.dev/blog/scrape-youtube-python#7-deploying-to-apify) **Tags:** * [community](https://crawlee.dev/blog/tags/community.md) [**Read More**](https://crawlee.dev/blog/scrape-youtube-python.md) --- ## [How Crawlee uses tiered proxies to avoid getting blocked](https://crawlee.dev/blog/proxy-management-in-crawlee.md) June 24, 2024 · 4 min read [![Saurav Jain](https://avatars.githubusercontent.com/u/53312820?v=4)](https://github.com/souravjain540) [Saurav Jain](https://github.com/souravjain540) Developer Community Manager Hello Crawlee community, We are back with another blog, this time explaining how Crawlee rotates proxies and prevents crawlers from getting blocked. Proxies vary in quality, speed, reliability, and cost. There are a [few types of proxies](https://blog.apify.com/types-of-proxies/), such as datacenter and residential proxies. Datacenter proxies are cheaper but, on the other hand, more prone to getting blocked, and vice versa with residential proxies. It is hard for developers to decide which proxy to use while scraping data. We might get blocked if we use [datacenter proxies](https://blog.apify.com/datacenter-proxies-when-to-use-them-and-how-to-make-the-most-of-them/) for low-cost scraping, but residential proxies are sometimes too expensive for bigger projects. Developers need a system that can manage both costs and avoid getting blocked. To manage this, we recently introduced tiered proxies in Crawlee. Let’s take a look at it. **Tags:** * [proxy](https://crawlee.dev/blog/tags/proxy.md) [**Read More**](https://crawlee.dev/blog/proxy-management-in-crawlee.md) --- # 12 tips on how to think like a web scraping expert November 10, 2024 · 13 min read [![Max](https://avatars.githubusercontent.com/u/34358312?v=4)](https://github.com/Mantisus) [Max](https://github.com/Mantisus) Community Member of Crawlee and web scraping expert Typically, tutorials focus on the technical aspects, on what you can replicate: "Start here, follow this path, and you'll end up here." This is great for learning a particular technology, but it's sometimes difficult to understand why the author decided to do things a certain way or what guides their development process. note One of our community members wrote this blog as a contribution to Crawlee Blog. If you want to contribute blogs like these to Crawlee Blog, please reach out to us on our [discord channel](https://apify.com/discord). In this blog, I'll discuss the general rules and principles that guide me when I work on web scraping projects and allow me to achieve great results. So, let's explore the mindset of a web scraping developer. ![How to think like a web scraping expert](/assets/images/scraping-tips-8c538d5ae19dc1737b083169ad2a203b.webp) ## 1. Choosing a data source for the project[​](#1-choosing-a-data-source-for-the-project "Direct link to 1. Choosing a data source for the project") When you start working on a project, you likely have a target site from which you need to extract specific data. Check what possibilities this site or application provides for data extraction. Here are some possible options: * `Official API` - the site may provide a free official API through which you can get all the necessary data. This is the best option for you. For example, you can consider this approach if you need to extract data from [`Yelp`](https://docs.developer.yelp.com/docs/fusion-intro) * `Website` - in this case, we study the website, its structure, as well as the ways the frontend and backend interact * `Mobile Application` - in some cases, there's no website or API at all, or the mobile application provides more data, in which case, don't forget about the [`man-in-the-middle`](https://blog.apify.com/using-a-man-in-the-middle-proxy-to-scrape-data-from-a-mobile-app-api-e954915f979d/) approach If one data source fails, try accessing another available source. For example, for `Yelp`, all three options are available, and if the `Official API` doesn't suit you for some reason, you can try the other two. ## 2. Check [`robots.txt`](https://developers.google.com/search/docs/crawling-indexing/robots/intro) and [`sitemap`](https://developers.google.com/search/docs/crawling-indexing/sitemaps/build-sitemap)[​](#2-check-robotstxt-and-sitemap "Direct link to 2-check-robotstxt-and-sitemap") I think everyone knows about `robots.txt` and `sitemap` one way or another, but I regularly see people simply forgetting about them. If you're hearing about these for the first time, here's a quick explanation: * `robots` is the established name for crawlers in SEO. Usually, this refers to crawlers of major search engines like Google and Bing, or services like Ahrefs and ChatGPT. * `robots.txt` is a file describing the allowed behavior for robots. It includes permitted crawler user-agents, wait time between page scans, patterns of pages forbidden for scanning, and more. These rules are typically based on which pages should be indexed by search engines and which should not. * `sitemap` describes the site structure to make it easier for robots to navigate. It also helps in scanning only the content that needs updating, without creating unnecessary load on the site Since you're not [`Google`](http://google.com/) or any other popular search engine, the robot rules in `robots.txt` will likely be against you. But combined with the `sitemap`, this is a good place to study the site structure, expected interaction with robots, and non-browser user-agents. In some situations, it simplifies data extraction from the site. For example, using the [`sitemap`](https://www.crawlee.dev/sitemap.xml) for [Crawlee website](http://www.crawlee.dev/), you can easily get direct links to posts both for the entire lifespan of the blog and for a specific period. One simple check, and you don't need to implement pagination logic. ## 3. Don't neglect site analysis[​](#3-dont-neglect-site-analysis "Direct link to 3. Don't neglect site analysis") Thorough site analysis is an important prerequisite for creating an effective web scraper, especially if you're not planning to use browser automation. However, such analysis takes time, sometimes a lot of it. It's also worth noting that the time spent on analysis and searching for a more optimal crawling solution doesn't always pay off - you might spend hours only to discover that the most obvious approach was the best all along. Therefore, it's wise to set limits on your initial site analysis. If you don't see a better path within the allocated time, revert to simpler approaches. As you gain more experience, you'll more often be able to tell early on, based on the technologies used on the site, whether it's worth dedicating more time to analysis or not. Also, in projects where you need to extract data from a site just once, thorough site analysis can sometimes eliminate the need to write scraper code altogether. Here's an example of such a site - `https://ricebyrice.com/nl/pages/find-store`. ![Ricebyrice](/assets/images/ricebyrice_base-433dcb67f3debf8855b0043fb87a63c3.webp) By analyzing it, you'll easily discover that all the data can be obtained with a single request. You simply need to copy this data from your browser into a JSON file, and your task is complete. ![Ricebyrice Response](/assets/images/ricebyrice_response-77221911846c701f7abd865673867d60.webp) ## 4. Maximum interactivity[​](#4-maximum-interactivity "Direct link to 4. Maximum interactivity") When analyzing a site, switch sorts, pages, interact with various elements of the site, while watching the `Network` tab in your browser's [Dev Tools](https://developer.chrome.com/docs/devtools). This will allow you to better understand how the site interacts with the backend, what framework it's built on, and what behavior can be expected from it. ## 5. Data doesn't appear out of thin air[​](#5-data-doesnt-appear-out-of-thin-air "Direct link to 5. Data doesn't appear out of thin air") This is obvious, but it's important to keep in mind while working on a project. If you see some data or request parameters, it means they were obtained somewhere earlier, possibly in another request, possibly they may have already been on the website page, possibly they were formed using JS from other parameters. But they are always somewhere. If you don't understand where the data on the page comes from, or the data used in a request, follow these steps: 1. Sequentially, check all requests the site made before this point. 2. Examine their responses, headers, and cookies. 3. Use your intuition: Could this parameter be a timestamp? Could it be another parameter in a modified form? 4. Does it resemble any standard hashes or encodings? Practice makes perfect here. As you become familiar with different technologies, various frameworks, and their expected behaviors, and as you encounter a wide range of technologies, you'll find it easier to understand how things work and how data is transferred. This accumulated knowledge will significantly improve your ability to trace and understand data flow in web applications. ## 6. Data is cached[​](#6-data-is-cached "Direct link to 6. Data is cached") You may notice that when opening the same page several times, the requests transmitted to the server differ: possibly something was cached and is already stored on your computer. Therefore, it's recommended to analyze the site in incognito mode, as well as switch browsers. This situation is especially relevant for mobile applications, which may store some data in storage on the device. Therefore, when analyzing mobile applications, you may need to clear the cache and storage. ## 7. Learn more about the framework[​](#7-learn-more-about-the-framework "Direct link to 7. Learn more about the framework") If during the analysis you discover that the site uses a framework you haven't encountered before, take some time to learn about it and its features. For example, if you notice a site is built with Next.js, understanding how it handles routing and data fetching could be crucial for your scraping strategy. You can learn about these frameworks through official documentation or by using LLMs like [`ChatGPT`](https://openai.com/chatgpt/) or [`Claude`](https://claude.ai/). These AI assistants are excellent at explaining framework-specific concepts. Here's an example of how you might query an LLM about Next.js: ``` I am in the process of optimizing my website using Next.js. Are there any files passed to the browser that describe all internal routing and how links are formed? Restrictions: - Accompany your answers with code samples - Use this message as the main message for all subsequent responses - Reference only those elements that are available on the client side, without access to the project code base ``` You can create similar queries for backend frameworks as well. For instance, with GraphQL, you might ask about available fields and query structures. These insights can help you understand how to better interact with the site's API and what data is potentially available. For effective work with LLM, I recommend at least basically studying the basics of [`prompt engineering`](https://parlance-labs.com/education/prompt_eng/berryman.html). ## 8. Reverse engineering[​](#8-reverse-engineering "Direct link to 8. Reverse engineering") Web scraping goes hand in hand with reverse engineering. You study the interactions of the frontend and backend, you may need to study the code to better understand how certain parameters are formed. But in some cases, reverse engineering may require more knowledge, effort, time, or have a high degree of complexity. At this point, you need to decide whether you need to delve into it or it's better to change the data source, or, for example, technologies. Most likely, this will be the moment when you decide to abandon HTTP web scraping and switch to a headless browser. The main principle of most web scraping protections is not to make web scraping impossible, but to make it expensive. Let's just look at what the response to a search on [`zoopla`](https://www.zoopla.co.uk/) looks like ![Zoopla Search Response](/assets/images/zoopla_response-c6997e953965244f6293d44d2562f2dd.webp) ## 9. Testing requests to endpoints[​](#9-testing-requests-to-endpoints "Direct link to 9. Testing requests to endpoints") After identifying the endpoints you need to extract the target data, make sure you get a correct response when making a request. If you get a response from the server other than 200, or data different from expected, then you need to figure out why. Here are some possible reasons: * You need to pass some parameters, for example cookies, or specific technical headers * The site requires that when accessing this endpoint, there is a corresponding `Referrer` header * The site expects that the headers will follow a certain order. I've encountered this only a couple of times, but I have encountered it * The site uses protection against web scraping, for example with `TLS fingerprint` And many other possible reasons, each of which requires separate analysis. ## 10. Experiment with request parameters[​](#10-experiment-with-request-parameters "Direct link to 10. Experiment with request parameters") Explore what results you get when changing request parameters, if any. Some parameters may be missing but supported on the server side. For example, `order`, `sort`, `per_page`, `limit`, and others. Try adding them and see if the behavior changes. This is especially relevant for sites using [`graphql`](https://graphql.org/) Let's consider this [`example`](https://restoran.ua/en/posts?subsection=0) If you analyze the site, you'll see a request that can be reproduced with the following code, I've formatted it a bit to improve readability: ``` import requests url = "https://restoran.ua/graphql" data = { "operationName": "Posts_PostsForView", "variables": {"sort": {"sortBy": ["startAt_DESC"]}}, "query": """query Posts_PostsForView( $where: PostForViewWhereInput, $sort: PostForViewSortInput, $pagination: PaginationInput, $search: String, $token: String, $coordinates_slice: SliceInput) { PostsForView( where: $where sort: $sort pagination: $pagination search: $search token: $token ) { id title: ukTitle summary: ukSummary slug startAt endAt newsFeed events journal toProfessionals photoHeader { address: mobile __typename } coordinates(slice: $coordinates_slice) { lng lat __typename } __typename } }""" } response = requests.post(url, json=data) print(response.json()) ``` Now I'll update it to get results in 2 languages at once, and most importantly, along with the internal text of the publications: ``` import requests url = "https://restoran.ua/graphql" data = { "operationName": "Posts_PostsForView", "variables": {"sort": {"sortBy": ["startAt_DESC"]}}, "query": """query Posts_PostsForView( $where: PostForViewWhereInput, $sort: PostForViewSortInput, $pagination: PaginationInput, $search: String, $token: String, $coordinates_slice: SliceInput) { PostsForView( where: $where sort: $sort pagination: $pagination search: $search token: $token ) { id uk_title: ukTitle en_title: enTitle summary: ukSummary slug startAt endAt newsFeed events journal toProfessionals photoHeader { address: mobile __typename } mixedBlocks { index en_text: enText uk_text: ukText __typename } coordinates(slice: $coordinates_slice) { lng lat __typename } __typename } }""" } response = requests.post(url, json=data) print(response.json()) ``` As you can see, a small update of the request parameters allows me not to worry about visiting the internal page of each publication. You have no idea how many times this trick has saved me. If you see `graphql` in front of you and don't know where to start, then my advice about documentation and LLM works here too. ## 11. Don't be afraid of new technologies[​](#11-dont-be-afraid-of-new-technologies "Direct link to 11. Don't be afraid of new technologies") I know how easy it is to master a few tools and just use them because it works. I've fallen into this trap more than once myself. But modern sites use modern technologies that have a significant impact on web scraping, and in response, new tools for web scraping are emerging. Learning these may greatly simplify your next project, and may even solve some problems that were insurmountable for you. I wrote about some tools [`earlier`](https://www.crawlee.dev/blog/common-problems-in-web-scraping). I especially recommend paying attention to [`curl_cffi`](https://curl-cffi.readthedocs.io/en/latest/) and frameworks [`botasaurus`](https://www.omkar.cloud/botasaurus/) and [`Crawlee for Python`](https://www.crawlee.dev/python/). ## 12. Help open-source libraries[​](#12-help-open-source-libraries "Direct link to 12. Help open-source libraries") Personally, I only recently came to realize the importance of this. All the tools I use for my work are either open-source developments or based on open-source. Web scraping literally lives thanks to open-source, and this is especially noticeable if you're a `Python` developer and have realized that on pure `Python` everything is quite sad when you need to deal with `TLS fingerprint`, and again, open-source saved us here. And it seems to me that the least we could do is invest a little of our knowledge and skills in supporting open-source. I chose to support [`Crawlee for Python`](https://www.crawlee.dev/python/), and no, not because they allowed me to write in their blog, but because it shows excellent development dynamics and is aimed at making life easier for web crawler developers. It allows for faster crawler development by taking care of and hiding under the hood such critical aspects as session management, session rotation when blocked, managing concurrency of asynchronous tasks (if you write asynchronous code, you know what a pain this can be), and much more. tip If you like the blog so far, please consider [giving Crawlee a star on GitHub](https://github.com/apify/crawlee), it helps us to reach and help more developers. And what choice will you make? ## Conclusion[​](#conclusion "Direct link to Conclusion") I think some things in the article were obvious to you, some things you follow yourself, but I hope you learned something new too. If most of them were new, then try using these rules as a checklist in your next project. I would be happy to discuss the article. Feel free to comment here, in the article, or contact me in the [Crawlee developer community](https://apify.com/discord) on Discord. You can also find me on the following platforms: [Github](https://github.com/Mantisus), [Linkedin](https://www.linkedin.com/in/max-bohomolov/), [Apify](https://apify.com/mantisus), [Upwork](https://www.upwork.com/freelancers/mantisus), [Contra](https://contra.com/mantisus). Thank you for your attention :) **Tags:** * [community](https://crawlee.dev/blog/tags/community.md) --- [Skip to main content](#__docusaurus_skipToContent_fallback) **[Apify $1M Challenge 💰](https://apify.com/challenge)** Earn and win building with Crawlee! [![Crawlee JavaScript](/img/crawlee-javascript-light.svg)![Crawlee JavaScript](/img/crawlee-javascript-dark.svg)](https://crawlee.dev/js.md) [![Crawlee JavaScript](/img/crawlee-javascript-light.svg)![Crawlee JavaScript](/img/crawlee-javascript-dark.svg)](https://crawlee.dev/js.md)[![Crawlee Python](/img/crawlee-python-light.svg)![Crawlee Python](/img/crawlee-python-dark.svg)](https://crawlee.dev/python)[![Crawlee](/img/crawlee-light.svg)![Crawlee](/img/crawlee-dark.svg)](https://crawlee.dev/index.md) [Docs](https://crawlee.dev/js/docs/quick-start.md)[Examples](https://crawlee.dev/js/docs/examples.md)[API](https://crawlee.dev/js/api/core.md)[Changelog](https://crawlee.dev/js/api/core/changelog.md)[Blog](https://crawlee.dev/blog.md) Search documentation... [Get started](https://crawlee.dev/js/docs/quick-start.md) # Build reliable web scrapers. Fast. Crawlee is a web scraping library for JavaScript and Python. It handles blocking, crawling, proxies, and browsers for you. ![Crawlee JavaScript](/img/crawlee-javascript-light.svg)![Crawlee JavaScript](/img/crawlee-javascript-dark.svg) [Get started](https://crawlee.dev/js/docs/quick-start.md)[Star](https://github.com/apify/crawlee) [Run on](https://console.apify.com/actors/6i5QsHBMtm3hKph70?runConfig=eyJ1IjoiRWdQdHczb2VqNlRhRHQ1cW4iLCJ2IjoxfQ.eyJpbnB1dCI6IntcbiAgICBcImNvZGVcIjogXCJpbXBvcnQgeyBQbGF5d3JpZ2h0Q3Jhd2xlciB9IGZyb20gJ2NyYXdsZWUnO1xcblxcbi8vIENyYXdsZXIgc2V0dXAgZnJvbSB0aGUgcHJldmlvdXMgZXhhbXBsZS5cXG5jb25zdCBjcmF3bGVyID0gbmV3IFBsYXl3cmlnaHRDcmF3bGVyKHtcXG4gICAgLy8gVXNlIHRoZSByZXF1ZXN0SGFuZGxlciB0byBwcm9jZXNzIGVhY2ggb2YgdGhlIGNyYXdsZWQgcGFnZXMuXFxuICAgIGFzeW5jIHJlcXVlc3RIYW5kbGVyKHsgcmVxdWVzdCwgcGFnZSwgZW5xdWV1ZUxpbmtzLCBwdXNoRGF0YSwgbG9nIH0pIHtcXG4gICAgICAgIGNvbnN0IHRpdGxlID0gYXdhaXQgcGFnZS50aXRsZSgpO1xcbiAgICAgICAgbG9nLmluZm8oYFRpdGxlIG9mICR7cmVxdWVzdC5sb2FkZWRVcmx9IGlzICcke3RpdGxlfSdgKTtcXG5cXG4gICAgICAgIC8vIFNhdmUgcmVzdWx0cyBhcyBKU09OIHRvIC4vc3RvcmFnZS9kYXRhc2V0cy9kZWZhdWx0XFxuICAgICAgICBhd2FpdCBwdXNoRGF0YSh7IHRpdGxlLCB1cmw6IHJlcXVlc3QubG9hZGVkVXJsIH0pO1xcblxcbiAgICAgICAgLy8gRXh0cmFjdCBsaW5rcyBmcm9tIHRoZSBjdXJyZW50IHBhZ2VcXG4gICAgICAgIC8vIGFuZCBhZGQgdGhlbSB0byB0aGUgY3Jhd2xpbmcgcXVldWUuXFxuICAgICAgICBhd2FpdCBlbnF1ZXVlTGlua3MoKTtcXG4gICAgfSxcXG5cXG4gICAgLy8gVW5jb21tZW50IHRoaXMgb3B0aW9uIHRvIHNlZSB0aGUgYnJvd3NlciB3aW5kb3cuXFxuICAgIC8vIGhlYWRsZXNzOiBmYWxzZSxcXG5cXG4gICAgLy8gQ29tbWVudCB0aGlzIG9wdGlvbiB0byBzY3JhcGUgdGhlIGZ1bGwgd2Vic2l0ZS5cXG4gICAgbWF4UmVxdWVzdHNQZXJDcmF3bDogMjAsXFxufSk7XFxuXFxuLy8gQWRkIGZpcnN0IFVSTCB0byB0aGUgcXVldWUgYW5kIHN0YXJ0IHRoZSBjcmF3bC5cXG5hd2FpdCBjcmF3bGVyLnJ1bihbJ2h0dHBzOi8vY3Jhd2xlZS5kZXYnXSk7XFxuXFxuLy8gRXhwb3J0IHRoZSBlbnRpcmV0eSBvZiB0aGUgZGF0YXNldCB0byBhIHNpbmdsZSBmaWxlIGluXFxuLy8gLi9zdG9yYWdlL2tleV92YWx1ZV9zdG9yZXMvcmVzdWx0LmNzdlxcbmNvbnN0IGRhdGFzZXQgPSBhd2FpdCBjcmF3bGVyLmdldERhdGFzZXQoKTtcXG5hd2FpdCBkYXRhc2V0LmV4cG9ydFRvQ1NWKCdyZXN1bHQnKTtcXG5cXG4vLyBPciB3b3JrIHdpdGggdGhlIGRhdGEgZGlyZWN0bHkuXFxuY29uc3QgZGF0YSA9IGF3YWl0IGNyYXdsZXIuZ2V0RGF0YSgpO1xcbmNvbnNvbGUudGFibGUoZGF0YS5pdGVtcyk7XFxuXCJcbn0iLCJvcHRpb25zIjp7ImNvbnRlbnRUeXBlIjoiYXBwbGljYXRpb24vanNvbjsgY2hhcnNldD11dGYtOCIsIm1lbW9yeSI6NDA5Nn19.WKB14SjgTceKYyhONw2oXTkiOao6X4-UAS7cIuwqGvo\&asrc=run_on_apify) ``` import { PlaywrightCrawler } from 'crawlee'; const crawler = new PlaywrightCrawler({ async requestHandler({ request, page, enqueueLinks, pushData, log }) { const title = await page.title(); log.info(`Title of ${request.loadedUrl} is '${title}'`); await pushData({ title, url: request.loadedUrl }); await enqueueLinks(); }, // Uncomment this option to see the browser window. // headless: false, }); await crawler.run(['https://crawlee.dev']); ``` Or start with a template from our CLI `$npx crawlee create my-crawler` Built with 🤍 by Apify. Forever free and open-source. ## What are the benefits? ### Unblock websites by default Crawlee crawls stealthily with zero configuration, but you can customize its behavior to overcome any protection. Real-world fingerprints included. [Learn more](https://crawlee.dev/js/docs/guides/avoid-blocking.md) ``` { fingerprintOptions: { fingerprintGeneratorOptions: { browsers: ['chrome', 'firefox'], devices: ['mobile'], locales: ['en-US'], }, }, }, ``` ### Work with your favorite tools Crawlee integrates BeautifulSoup, Cheerio, Puppeteer, Playwright, and other popular open-source tools. No need to learn new syntax. [Learn more](https://crawlee.dev/js/docs/quick-start.md#choose-your-crawler) ![Work with your favorite tools](/img/favorite-tools-light.webp)![Work with your favorite tools](/img/favorite-tools-dark.webp) ### One API for headless and HTTP Switch between HTTP and headless without big rewrites thanks to a shared API. Or even let Adaptive crawler decide if JS rendering is needed. [Learn more](https://crawlee.dev/js/api/core.md) ``` const crawler = new AdaptivePlaywrightCrawler({ renderingTypeDetectionRatio: 0.1, async requestHandler({ querySelector, enqueueLinks }) { // The crawler detects if JS rendering is needed // to extract this data. If not, it will use HTTP // for follow-up requests to save time and costs. const $prices = await querySelector('span.price') await enqueueLinks(); }, }); ``` ## What else is in Crawlee? [![](/img/auto-scaling-light.webp)![](/img/auto-scaling-dark.webp)](https://crawlee.dev/js/docs/guides/scaling-crawlers.md) ### [Auto scaling](https://crawlee.dev/js/docs/guides/scaling-crawlers.md) [Crawlers automatically adjust concurrency based on available system resources. Avoid memory errors in small containers and run faster in large ones.](https://crawlee.dev/js/docs/guides/scaling-crawlers.md) [![](/img/smart-proxy-light.webp)![](/img/smart-proxy-dark.webp)](https://crawlee.dev/js/docs/guides/proxy-management.md) ### [Smart proxy rotation](https://crawlee.dev/js/docs/guides/proxy-management.md) [Crawlee uses a pool of sessions represented by different proxies to maintain the proxy performance and keep IPs healthy. Blocked proxies are removed from the pool automatically.](https://crawlee.dev/js/docs/guides/proxy-management.md) [![](/img/queue-light-icon.svg)![](/img/queue-dark-icon.svg)](https://crawlee.dev/js/docs/guides/request-storage.md) ### [Queue and storage](https://crawlee.dev/js/docs/guides/request-storage.md) [Pause and resume crawlers thanks to a persistent queue of URLs and storage for structured data.](https://crawlee.dev/js/docs/guides/request-storage.md) [![](/img/scraping-utils-light-icon.svg)![](/img/scraping-utils-dark-icon.svg)](https://crawlee.dev/js/api/utils.md) ### [Handy scraping utils](https://crawlee.dev/js/api/utils.md) [Sitemaps, infinite scroll, contact extraction, large asset blocking and many more utils included.](https://crawlee.dev/js/api/utils.md) [![](/img/routing-light-icon.svg)![](/img/routing-dark-icon.svg)](https://crawlee.dev/js/api/core/class/Router.md) ### [Routing & middleware](https://crawlee.dev/js/api/core/class/Router.md) [Keep your code clean and organized while managing complex crawls with a built-in router that streamlines the process.](https://crawlee.dev/js/api/core/class/Router.md) ## Deploy to cloud Crawlee, by Apify, works anywhere, but Apify offers the best experience. Easily turn your project into an [Actor](https://apify.com/actors)—a serverless micro-app with built-in infra, proxies, and storage. [Deploy to Apify](https://crawlee.dev/js/docs/deployment/apify-platform.md) 1 Install Apify SDK and Apify CLI. 2 Add ``` Actor.init() ``` to the begining and ``` Actor.exit() ``` to the end of your code. 3 Use the Apify CLI to push the code to the Apify platform. ## Crawlee helps you build scrapers faster ![](/img/zero-setup-light-icon.svg)![](/img/zero-setup-dark-icon.svg) ### Zero setup required Copy code example, install Crawlee and go. No CLI required, no complex file structure, no boilerplate. [Get started](https://crawlee.dev/js/docs/quick-start.md) ![](/img/defaults-light-icon.svg)![](/img/defaults-dark-icon.svg) ### Reasonable defaults Unblocking, proxy rotation and other core features are already turned on. But also very configurable. [Learn more](https://crawlee.dev/js/docs/guides/configuration.md) ![](/img/community-light-icon.svg)![](/img/community-dark-icon.svg) ### Helpful community Join our Discord community of over 10k developers and get fast answers to your web scraping questions. [Join Discord](https://discord.gg/jyEM2PRvMU) ## Get started now! Crawlee won’t fix broken selectors for you (yet), but it makes building and maintaining reliable crawlers faster and easier—so you can focus on what matters most. [Get started](https://crawlee.dev/js/docs/quick-start.md) [![Docusaurus themed image](/img/crawlee-light.svg)![Docusaurus themed image](/img/crawlee-dark.svg)](https://crawlee.dev/index.md) Docs * [Guides](https://crawlee.dev/js/docs/guides.md) * [Examples](https://crawlee.dev/js/docs/examples.md) * [API reference](https://crawlee.dev/js/api/core.md) * [Changelog](https://crawlee.dev/js/api/core/changelog.md) Product * [Discord](https://discord.com/invite/jyEM2PRvMU) * [Stack Overflow](https://stackoverflow.com/questions/tagged/crawlee) * [Twitter](https://twitter.com/apify) * [YouTube](https://www.youtube.com/apify) More * [Apify platform](https://apify.com) * [Docusaurus](https://docusaurus.io) * [GitHub](https://github.com/apify/crawlee) Crawlee is forever free and open source © 2026 Apify --- [Skip to main content](#__docusaurus_skipToContent_fallback) **[Apify $1M Challenge 💰](https://apify.com/challenge)** Earn and win building with Crawlee! [![Crawlee JavaScript](/img/crawlee-javascript-light.svg)![Crawlee JavaScript](/img/crawlee-javascript-dark.svg)](https://crawlee.dev/js.md) [![Crawlee JavaScript](/img/crawlee-javascript-light.svg)![Crawlee JavaScript](/img/crawlee-javascript-dark.svg)](https://crawlee.dev/js.md)[![Crawlee Python](/img/crawlee-python-light.svg)![Crawlee Python](/img/crawlee-python-dark.svg)](https://crawlee.dev/python)[![Crawlee](/img/crawlee-light.svg)![Crawlee](/img/crawlee-dark.svg)](https://crawlee.dev/index.md) [Docs](https://crawlee.dev/js/docs/quick-start.md)[Examples](https://crawlee.dev/js/docs/examples.md)[API](https://crawlee.dev/js/api/core.md)[Changelog](https://crawlee.dev/js/api/core/changelog.md)[Blog](https://crawlee.dev/blog.md) [3.15](https://crawlee.dev/js/docs/quick-start.md) * [Next](https://crawlee.dev/js/docs/next/quick-start) * [3.15](https://crawlee.dev/js/docs/quick-start.md) * [3.14](https://crawlee.dev/js/docs/3.14/quick-start) * [3.13](https://crawlee.dev/js/docs/3.13/quick-start) * [3.12](https://crawlee.dev/js/docs/3.12/quick-start) * [3.11](https://crawlee.dev/js/docs/3.11/quick-start) * [3.10](https://crawlee.dev/js/docs/3.10/quick-start) * [2.2](https://sdk.apify.com/docs/guides/getting-started) * [1.3](https://sdk.apify.com/docs/1.3.1/guides/getting-started) Search documentation... [Get started](https://crawlee.dev/js/docs/quick-start.md) # Search the documentation Type your search here Next (current) Powered by[](https://www.algolia.com/) [![Docusaurus themed image](/img/crawlee-light.svg)![Docusaurus themed image](/img/crawlee-dark.svg)](https://crawlee.dev/index.md) Docs * [Guides](https://crawlee.dev/js/docs/guides.md) * [Examples](https://crawlee.dev/js/docs/examples.md) * [API reference](https://crawlee.dev/js/api/core.md) * [Changelog](https://crawlee.dev/js/api/core/changelog.md) Product * [Discord](https://discord.com/invite/jyEM2PRvMU) * [Stack Overflow](https://stackoverflow.com/questions/tagged/crawlee) * [Twitter](https://twitter.com/apify) * [YouTube](https://www.youtube.com/apify) More * [Apify platform](https://apify.com) * [Docusaurus](https://docusaurus.io) * [GitHub](https://github.com/apify/crawlee) Crawlee is forever free and open source © 2026 Apify --- # API ### Packages * [v](https://crawlee.dev/js/api/core.md) [3.15.0 @crawlee/core](https://crawlee.dev/js/api/core.md) * [v](https://crawlee.dev/js/api/cheerio-crawler.md) [3.15.0 @crawlee/cheerio](https://crawlee.dev/js/api/cheerio-crawler.md) * [v](https://crawlee.dev/js/api/playwright-crawler.md) [3.15.0 @crawlee/playwright](https://crawlee.dev/js/api/playwright-crawler.md) * [v](https://crawlee.dev/js/api/puppeteer-crawler.md) [3.15.0 @crawlee/puppeteer](https://crawlee.dev/js/api/puppeteer-crawler.md) * [v](https://crawlee.dev/js/api/jsdom-crawler.md) [3.15.0 @crawlee/jsdom](https://crawlee.dev/js/api/jsdom-crawler.md) * [v](https://crawlee.dev/js/api/linkedom-crawler.md) [3.15.0 @crawlee/linkedom](https://crawlee.dev/js/api/linkedom-crawler.md) * [v](https://crawlee.dev/js/api/basic-crawler.md) [3.15.0 @crawlee/basic](https://crawlee.dev/js/api/basic-crawler.md) * [v](https://crawlee.dev/js/api/http-crawler.md) [3.15.0 @crawlee/http](https://crawlee.dev/js/api/http-crawler.md) * [v](https://crawlee.dev/js/api/browser-crawler.md) [3.15.0 @crawlee/browser](https://crawlee.dev/js/api/browser-crawler.md) * [v](https://crawlee.dev/js/api/memory-storage.md) [3.15.0 @crawlee/memory-storage](https://crawlee.dev/js/api/memory-storage.md) * [v](https://crawlee.dev/js/api/browser-pool.md) [3.15.0 @crawlee/browser-pool](https://crawlee.dev/js/api/browser-pool.md) * [v](https://crawlee.dev/js/api/utils.md) [3.15.0 @crawlee/utils](https://crawlee.dev/js/api/utils.md) * [v](https://crawlee.dev/js/api/types.md) [3.15.0 @crawlee/types](https://crawlee.dev/js/api/types.md) --- # @crawlee/basic Provides a simple framework for parallel crawling of web pages. The URLs to crawl are fed either from a static list of URLs or from a dynamic queue of URLs enabling recursive crawling of websites. `BasicCrawler` is a low-level tool that requires the user to implement the page download and data extraction functionality themselves. If we want a crawler that already facilitates this functionality, we should consider using [CheerioCrawler](https://crawlee.dev/js/api/cheerio-crawler/class/CheerioCrawler.md), [PuppeteerCrawler](https://crawlee.dev/js/api/puppeteer-crawler/class/PuppeteerCrawler.md) or [PlaywrightCrawler](https://crawlee.dev/js/api/playwright-crawler/class/PlaywrightCrawler.md). `BasicCrawler` invokes the user-provided [`requestHandler`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#requestHandler) for each [Request](https://crawlee.dev/js/api/core/class/Request.md) object, which represents a single URL to crawl. The [Request](https://crawlee.dev/js/api/core/class/Request.md) objects are fed from the [RequestList](https://crawlee.dev/js/api/core/class/RequestList.md) or [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) instances provided by the [`requestList`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#requestList) or [`requestQueue`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#requestQueue) constructor options, respectively. If neither `requestList` nor `requestQueue` options are provided, the crawler will open the default request queue either when the [`crawler.addRequests()`](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md#addRequests) function is called, or if `requests` parameter (representing the initial requests) of the [`crawler.run()`](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md#run) function is provided. If both [`requestList`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#requestList) and [`requestQueue`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#requestQueue) options are used, the instance first processes URLs from the [RequestList](https://crawlee.dev/js/api/core/class/RequestList.md) and automatically enqueues all of them to the [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) before it starts their processing. This ensures that a single URL is not crawled multiple times. The crawler finishes if there are no more [Request](https://crawlee.dev/js/api/core/class/Request.md) objects to crawl. New requests are only dispatched when there is enough free CPU and memory available, using the functionality provided by the [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) class. All [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) configuration options can be passed to the [`autoscaledPoolOptions`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#autoscaledPoolOptions) parameter of the `BasicCrawler` constructor. For user convenience, the [`minConcurrency`](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md#minConcurrency) and [`maxConcurrency`](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md#maxConcurrency) options of the underlying [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) constructor are available directly in the `BasicCrawler` constructor. ## Example usage[​](#example-usage "Direct link to Example usage") ``` import { BasicCrawler, Dataset } from 'crawlee'; // Create a crawler instance const crawler = new BasicCrawler({ async requestHandler({ request, sendRequest }) { // 'request' contains an instance of the Request class // Here we simply fetch the HTML of the page and store it to a dataset const { body } = await sendRequest({ url: request.url, method: request.method, body: request.payload, headers: request.headers, }); await Dataset.pushData({ url: request.url, html: body, }) }, }); // Enqueue the initial requests and run the crawler await crawler.run([ 'http://www.example.com/page-1', 'http://www.example.com/page-2', ]); ``` ## Index[**](#Index) ### Crawlers * [**BasicCrawler](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md) ### Other * [**AddRequestsBatchedOptions](https://crawlee.dev/js/api/basic-crawler.md#AddRequestsBatchedOptions) * [**AddRequestsBatchedResult](https://crawlee.dev/js/api/basic-crawler.md#AddRequestsBatchedResult) * [**AutoscaledPool](https://crawlee.dev/js/api/basic-crawler.md#AutoscaledPool) * [**AutoscaledPoolOptions](https://crawlee.dev/js/api/basic-crawler.md#AutoscaledPoolOptions) * [**BaseHttpClient](https://crawlee.dev/js/api/basic-crawler.md#BaseHttpClient) * [**BaseHttpResponseData](https://crawlee.dev/js/api/basic-crawler.md#BaseHttpResponseData) * [**BLOCKED\_STATUS\_CODES](https://crawlee.dev/js/api/basic-crawler.md#BLOCKED_STATUS_CODES) * [**checkStorageAccess](https://crawlee.dev/js/api/basic-crawler.md#checkStorageAccess) * [**ClientInfo](https://crawlee.dev/js/api/basic-crawler.md#ClientInfo) * [**Configuration](https://crawlee.dev/js/api/basic-crawler.md#Configuration) * [**ConfigurationOptions](https://crawlee.dev/js/api/basic-crawler.md#ConfigurationOptions) * [**Cookie](https://crawlee.dev/js/api/basic-crawler.md#Cookie) * [**CrawlingContext](https://crawlee.dev/js/api/basic-crawler.md#CrawlingContext) * [**CreateSession](https://crawlee.dev/js/api/basic-crawler.md#CreateSession) * [**CriticalError](https://crawlee.dev/js/api/basic-crawler.md#CriticalError) * [**Dataset](https://crawlee.dev/js/api/basic-crawler.md#Dataset) * [**DatasetConsumer](https://crawlee.dev/js/api/basic-crawler.md#DatasetConsumer) * [**DatasetContent](https://crawlee.dev/js/api/basic-crawler.md#DatasetContent) * [**DatasetDataOptions](https://crawlee.dev/js/api/basic-crawler.md#DatasetDataOptions) * [**DatasetExportOptions](https://crawlee.dev/js/api/basic-crawler.md#DatasetExportOptions) * [**DatasetExportToOptions](https://crawlee.dev/js/api/basic-crawler.md#DatasetExportToOptions) * [**DatasetIteratorOptions](https://crawlee.dev/js/api/basic-crawler.md#DatasetIteratorOptions) * [**DatasetMapper](https://crawlee.dev/js/api/basic-crawler.md#DatasetMapper) * [**DatasetOptions](https://crawlee.dev/js/api/basic-crawler.md#DatasetOptions) * [**DatasetReducer](https://crawlee.dev/js/api/basic-crawler.md#DatasetReducer) * [**enqueueLinks](https://crawlee.dev/js/api/basic-crawler.md#enqueueLinks) * [**EnqueueLinksOptions](https://crawlee.dev/js/api/basic-crawler.md#EnqueueLinksOptions) * [**EnqueueStrategy](https://crawlee.dev/js/api/basic-crawler.md#EnqueueStrategy) * [**ErrnoException](https://crawlee.dev/js/api/basic-crawler.md#ErrnoException) * [**ErrorSnapshotter](https://crawlee.dev/js/api/basic-crawler.md#ErrorSnapshotter) * [**ErrorTracker](https://crawlee.dev/js/api/basic-crawler.md#ErrorTracker) * [**ErrorTrackerOptions](https://crawlee.dev/js/api/basic-crawler.md#ErrorTrackerOptions) * [**EventManager](https://crawlee.dev/js/api/basic-crawler.md#EventManager) * [**EventType](https://crawlee.dev/js/api/basic-crawler.md#EventType) * [**EventTypeName](https://crawlee.dev/js/api/basic-crawler.md#EventTypeName) * [**filterRequestsByPatterns](https://crawlee.dev/js/api/basic-crawler.md#filterRequestsByPatterns) * [**FinalStatistics](https://crawlee.dev/js/api/basic-crawler.md#FinalStatistics) * [**GetUserDataFromRequest](https://crawlee.dev/js/api/basic-crawler.md#GetUserDataFromRequest) * [**GlobInput](https://crawlee.dev/js/api/basic-crawler.md#GlobInput) * [**GlobObject](https://crawlee.dev/js/api/basic-crawler.md#GlobObject) * [**GotScrapingHttpClient](https://crawlee.dev/js/api/basic-crawler.md#GotScrapingHttpClient) * [**HttpRequest](https://crawlee.dev/js/api/basic-crawler.md#HttpRequest) * [**HttpRequestOptions](https://crawlee.dev/js/api/basic-crawler.md#HttpRequestOptions) * [**HttpResponse](https://crawlee.dev/js/api/basic-crawler.md#HttpResponse) * [**IRequestList](https://crawlee.dev/js/api/basic-crawler.md#IRequestList) * [**IRequestManager](https://crawlee.dev/js/api/basic-crawler.md#IRequestManager) * [**IStorage](https://crawlee.dev/js/api/basic-crawler.md#IStorage) * [**KeyConsumer](https://crawlee.dev/js/api/basic-crawler.md#KeyConsumer) * [**KeyValueStore](https://crawlee.dev/js/api/basic-crawler.md#KeyValueStore) * [**KeyValueStoreIteratorOptions](https://crawlee.dev/js/api/basic-crawler.md#KeyValueStoreIteratorOptions) * [**KeyValueStoreOptions](https://crawlee.dev/js/api/basic-crawler.md#KeyValueStoreOptions) * [**LoadedRequest](https://crawlee.dev/js/api/basic-crawler.md#LoadedRequest) * [**LocalEventManager](https://crawlee.dev/js/api/basic-crawler.md#LocalEventManager) * [**log](https://crawlee.dev/js/api/basic-crawler.md#log) * [**Log](https://crawlee.dev/js/api/basic-crawler.md#Log) * [**Logger](https://crawlee.dev/js/api/basic-crawler.md#Logger) * [**LoggerJson](https://crawlee.dev/js/api/basic-crawler.md#LoggerJson) * [**LoggerOptions](https://crawlee.dev/js/api/basic-crawler.md#LoggerOptions) * [**LoggerText](https://crawlee.dev/js/api/basic-crawler.md#LoggerText) * [**LogLevel](https://crawlee.dev/js/api/basic-crawler.md#LogLevel) * [**MAX\_POOL\_SIZE](https://crawlee.dev/js/api/basic-crawler.md#MAX_POOL_SIZE) * [**NonRetryableError](https://crawlee.dev/js/api/basic-crawler.md#NonRetryableError) * [**PERSIST\_STATE\_KEY](https://crawlee.dev/js/api/basic-crawler.md#PERSIST_STATE_KEY) * [**PersistenceOptions](https://crawlee.dev/js/api/basic-crawler.md#PersistenceOptions) * [**processHttpRequestOptions](https://crawlee.dev/js/api/basic-crawler.md#processHttpRequestOptions) * [**ProxyConfiguration](https://crawlee.dev/js/api/basic-crawler.md#ProxyConfiguration) * [**ProxyConfigurationFunction](https://crawlee.dev/js/api/basic-crawler.md#ProxyConfigurationFunction) * [**ProxyConfigurationOptions](https://crawlee.dev/js/api/basic-crawler.md#ProxyConfigurationOptions) * [**ProxyInfo](https://crawlee.dev/js/api/basic-crawler.md#ProxyInfo) * [**PseudoUrl](https://crawlee.dev/js/api/basic-crawler.md#PseudoUrl) * [**PseudoUrlInput](https://crawlee.dev/js/api/basic-crawler.md#PseudoUrlInput) * [**PseudoUrlObject](https://crawlee.dev/js/api/basic-crawler.md#PseudoUrlObject) * [**purgeDefaultStorages](https://crawlee.dev/js/api/basic-crawler.md#purgeDefaultStorages) * [**PushErrorMessageOptions](https://crawlee.dev/js/api/basic-crawler.md#PushErrorMessageOptions) * [**QueueOperationInfo](https://crawlee.dev/js/api/basic-crawler.md#QueueOperationInfo) * [**RecordOptions](https://crawlee.dev/js/api/basic-crawler.md#RecordOptions) * [**RecoverableState](https://crawlee.dev/js/api/basic-crawler.md#RecoverableState) * [**RecoverableStateOptions](https://crawlee.dev/js/api/basic-crawler.md#RecoverableStateOptions) * [**RecoverableStatePersistenceOptions](https://crawlee.dev/js/api/basic-crawler.md#RecoverableStatePersistenceOptions) * [**RedirectHandler](https://crawlee.dev/js/api/basic-crawler.md#RedirectHandler) * [**RegExpInput](https://crawlee.dev/js/api/basic-crawler.md#RegExpInput) * [**RegExpObject](https://crawlee.dev/js/api/basic-crawler.md#RegExpObject) * [**Request](https://crawlee.dev/js/api/basic-crawler.md#Request) * [**RequestHandlerResult](https://crawlee.dev/js/api/basic-crawler.md#RequestHandlerResult) * [**RequestList](https://crawlee.dev/js/api/basic-crawler.md#RequestList) * [**RequestListOptions](https://crawlee.dev/js/api/basic-crawler.md#RequestListOptions) * [**RequestListSourcesFunction](https://crawlee.dev/js/api/basic-crawler.md#RequestListSourcesFunction) * [**RequestListState](https://crawlee.dev/js/api/basic-crawler.md#RequestListState) * [**RequestManagerTandem](https://crawlee.dev/js/api/basic-crawler.md#RequestManagerTandem) * [**RequestOptions](https://crawlee.dev/js/api/basic-crawler.md#RequestOptions) * [**RequestProvider](https://crawlee.dev/js/api/basic-crawler.md#RequestProvider) * [**RequestProviderOptions](https://crawlee.dev/js/api/basic-crawler.md#RequestProviderOptions) * [**RequestQueue](https://crawlee.dev/js/api/basic-crawler.md#RequestQueue) * [**RequestQueueOperationOptions](https://crawlee.dev/js/api/basic-crawler.md#RequestQueueOperationOptions) * [**RequestQueueOptions](https://crawlee.dev/js/api/basic-crawler.md#RequestQueueOptions) * [**RequestQueueV1](https://crawlee.dev/js/api/basic-crawler.md#RequestQueueV1) * [**RequestQueueV2](https://crawlee.dev/js/api/basic-crawler.md#RequestQueueV2) * [**RequestsLike](https://crawlee.dev/js/api/basic-crawler.md#RequestsLike) * [**RequestState](https://crawlee.dev/js/api/basic-crawler.md#RequestState) * [**RequestTransform](https://crawlee.dev/js/api/basic-crawler.md#RequestTransform) * [**ResponseLike](https://crawlee.dev/js/api/basic-crawler.md#ResponseLike) * [**ResponseTypes](https://crawlee.dev/js/api/basic-crawler.md#ResponseTypes) * [**RestrictedCrawlingContext](https://crawlee.dev/js/api/basic-crawler.md#RestrictedCrawlingContext) * [**RetryRequestError](https://crawlee.dev/js/api/basic-crawler.md#RetryRequestError) * [**Router](https://crawlee.dev/js/api/basic-crawler.md#Router) * [**RouterHandler](https://crawlee.dev/js/api/basic-crawler.md#RouterHandler) * [**RouterRoutes](https://crawlee.dev/js/api/basic-crawler.md#RouterRoutes) * [**Session](https://crawlee.dev/js/api/basic-crawler.md#Session) * [**SessionError](https://crawlee.dev/js/api/basic-crawler.md#SessionError) * [**SessionOptions](https://crawlee.dev/js/api/basic-crawler.md#SessionOptions) * [**SessionPool](https://crawlee.dev/js/api/basic-crawler.md#SessionPool) * [**SessionPoolOptions](https://crawlee.dev/js/api/basic-crawler.md#SessionPoolOptions) * [**SessionState](https://crawlee.dev/js/api/basic-crawler.md#SessionState) * [**SitemapRequestList](https://crawlee.dev/js/api/basic-crawler.md#SitemapRequestList) * [**SitemapRequestListOptions](https://crawlee.dev/js/api/basic-crawler.md#SitemapRequestListOptions) * [**SkippedRequestCallback](https://crawlee.dev/js/api/basic-crawler.md#SkippedRequestCallback) * [**SkippedRequestReason](https://crawlee.dev/js/api/basic-crawler.md#SkippedRequestReason) * [**SnapshotResult](https://crawlee.dev/js/api/basic-crawler.md#SnapshotResult) * [**Snapshotter](https://crawlee.dev/js/api/basic-crawler.md#Snapshotter) * [**SnapshotterOptions](https://crawlee.dev/js/api/basic-crawler.md#SnapshotterOptions) * [**Source](https://crawlee.dev/js/api/basic-crawler.md#Source) * [**StatisticPersistedState](https://crawlee.dev/js/api/basic-crawler.md#StatisticPersistedState) * [**Statistics](https://crawlee.dev/js/api/basic-crawler.md#Statistics) * [**StatisticsOptions](https://crawlee.dev/js/api/basic-crawler.md#StatisticsOptions) * [**StatisticState](https://crawlee.dev/js/api/basic-crawler.md#StatisticState) * [**StorageClient](https://crawlee.dev/js/api/basic-crawler.md#StorageClient) * [**StorageManagerOptions](https://crawlee.dev/js/api/basic-crawler.md#StorageManagerOptions) * [**StreamingHttpResponse](https://crawlee.dev/js/api/basic-crawler.md#StreamingHttpResponse) * [**SystemInfo](https://crawlee.dev/js/api/basic-crawler.md#SystemInfo) * [**SystemStatus](https://crawlee.dev/js/api/basic-crawler.md#SystemStatus) * [**SystemStatusOptions](https://crawlee.dev/js/api/basic-crawler.md#SystemStatusOptions) * [**TieredProxy](https://crawlee.dev/js/api/basic-crawler.md#TieredProxy) * [**tryAbsoluteURL](https://crawlee.dev/js/api/basic-crawler.md#tryAbsoluteURL) * [**UrlPatternObject](https://crawlee.dev/js/api/basic-crawler.md#UrlPatternObject) * [**useState](https://crawlee.dev/js/api/basic-crawler.md#useState) * [**UseStateOptions](https://crawlee.dev/js/api/basic-crawler.md#UseStateOptions) * [**withCheckedStorageAccess](https://crawlee.dev/js/api/basic-crawler.md#withCheckedStorageAccess) * [**BasicCrawlerOptions](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md) * [**BasicCrawlingContext](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlingContext.md) * [**CrawlerAddRequestsOptions](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerAddRequestsOptions.md) * [**CrawlerAddRequestsResult](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerAddRequestsResult.md) * [**CrawlerExperiments](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerExperiments.md) * [**CrawlerRunOptions](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerRunOptions.md) * [**CreateContextOptions](https://crawlee.dev/js/api/basic-crawler/interface/CreateContextOptions.md) * [**StatusMessageCallbackParams](https://crawlee.dev/js/api/basic-crawler/interface/StatusMessageCallbackParams.md) * [**ErrorHandler](https://crawlee.dev/js/api/basic-crawler.md#ErrorHandler) * [**RequestHandler](https://crawlee.dev/js/api/basic-crawler.md#RequestHandler) * [**StatusMessageCallback](https://crawlee.dev/js/api/basic-crawler.md#StatusMessageCallback) * [**BASIC\_CRAWLER\_TIMEOUT\_BUFFER\_SECS](https://crawlee.dev/js/api/basic-crawler.md#BASIC_CRAWLER_TIMEOUT_BUFFER_SECS) * [**createBasicRouter](https://crawlee.dev/js/api/basic-crawler/function/createBasicRouter.md) ## Other[**](#__CATEGORY__) ### [**](#AddRequestsBatchedOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L965)AddRequestsBatchedOptions Re-exports [AddRequestsBatchedOptions](https://crawlee.dev/js/api/core/interface/AddRequestsBatchedOptions.md) ### [**](#AddRequestsBatchedResult)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L983)AddRequestsBatchedResult Re-exports [AddRequestsBatchedResult](https://crawlee.dev/js/api/core/interface/AddRequestsBatchedResult.md) ### [**](#AutoscaledPool)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/autoscaled_pool.ts#L180)AutoscaledPool Re-exports [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) ### [**](#AutoscaledPoolOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/autoscaled_pool.ts#L16)AutoscaledPoolOptions Re-exports [AutoscaledPoolOptions](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md) ### [**](#BaseHttpClient)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L179)BaseHttpClient Re-exports [BaseHttpClient](https://crawlee.dev/js/api/core/interface/BaseHttpClient.md) ### [**](#BaseHttpResponseData)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L130)BaseHttpResponseData Re-exports [BaseHttpResponseData](https://crawlee.dev/js/api/core/interface/BaseHttpResponseData.md) ### [**](#BLOCKED_STATUS_CODES)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/consts.ts#L1)BLOCKED\_STATUS\_CODES Re-exports [BLOCKED\_STATUS\_CODES](https://crawlee.dev/js/api/core.md#BLOCKED_STATUS_CODES) ### [**](#checkStorageAccess)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/access_checking.ts#L10)checkStorageAccess Re-exports [checkStorageAccess](https://crawlee.dev/js/api/core/function/checkStorageAccess.md) ### [**](#ClientInfo)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/system_status.ts#L79)ClientInfo Re-exports [ClientInfo](https://crawlee.dev/js/api/core/interface/ClientInfo.md) ### [**](#Configuration)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/configuration.ts#L247)Configuration Re-exports [Configuration](https://crawlee.dev/js/api/core/class/Configuration.md) ### [**](#ConfigurationOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/configuration.ts#L16)ConfigurationOptions Re-exports [ConfigurationOptions](https://crawlee.dev/js/api/core/interface/ConfigurationOptions.md) ### [**](#Cookie)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/index.ts#L19)Cookie Re-exports [Cookie](https://crawlee.dev/js/api/core/interface/Cookie.md) ### [**](#CrawlingContext)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L111)CrawlingContext Re-exports [CrawlingContext](https://crawlee.dev/js/api/core/interface/CrawlingContext.md) ### [**](#CreateSession)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session_pool.ts#L22)CreateSession Re-exports [CreateSession](https://crawlee.dev/js/api/core/interface/CreateSession.md) ### [**](#CriticalError)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/errors.ts#L10)CriticalError Re-exports [CriticalError](https://crawlee.dev/js/api/core/class/CriticalError.md) ### [**](#Dataset)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L232)Dataset Re-exports [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md) ### [**](#DatasetConsumer)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L703)DatasetConsumer Re-exports [DatasetConsumer](https://crawlee.dev/js/api/core/interface/DatasetConsumer.md) ### [**](#DatasetContent)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L742)DatasetContent Re-exports [DatasetContent](https://crawlee.dev/js/api/core/interface/DatasetContent.md) ### [**](#DatasetDataOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L92)DatasetDataOptions Re-exports [DatasetDataOptions](https://crawlee.dev/js/api/core/interface/DatasetDataOptions.md) ### [**](#DatasetExportOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L144)DatasetExportOptions Re-exports [DatasetExportOptions](https://crawlee.dev/js/api/core/interface/DatasetExportOptions.md) ### [**](#DatasetExportToOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L176)DatasetExportToOptions Re-exports [DatasetExportToOptions](https://crawlee.dev/js/api/core/interface/DatasetExportToOptions.md) ### [**](#DatasetIteratorOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L152)DatasetIteratorOptions Re-exports [DatasetIteratorOptions](https://crawlee.dev/js/api/core/interface/DatasetIteratorOptions.md) ### [**](#DatasetMapper)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L714)DatasetMapper Re-exports [DatasetMapper](https://crawlee.dev/js/api/core/interface/DatasetMapper.md) ### [**](#DatasetOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L735)DatasetOptions Re-exports [DatasetOptions](https://crawlee.dev/js/api/core/interface/DatasetOptions.md) ### [**](#DatasetReducer)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L726)DatasetReducer Re-exports [DatasetReducer](https://crawlee.dev/js/api/core/interface/DatasetReducer.md) ### [**](#enqueueLinks)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/enqueue_links.ts#L274)enqueueLinks Re-exports [enqueueLinks](https://crawlee.dev/js/api/core/function/enqueueLinks.md) ### [**](#EnqueueLinksOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/enqueue_links.ts#L34)EnqueueLinksOptions Re-exports [EnqueueLinksOptions](https://crawlee.dev/js/api/core/interface/EnqueueLinksOptions.md) ### [**](#EnqueueStrategy)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/enqueue_links.ts#L216)EnqueueStrategy Re-exports [EnqueueStrategy](https://crawlee.dev/js/api/core/enum/EnqueueStrategy.md) ### [**](#ErrnoException)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/error_tracker.ts#L9)ErrnoException Re-exports [ErrnoException](https://crawlee.dev/js/api/core/interface/ErrnoException.md) ### [**](#ErrorSnapshotter)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/error_snapshotter.ts#L42)ErrorSnapshotter Re-exports [ErrorSnapshotter](https://crawlee.dev/js/api/core/class/ErrorSnapshotter.md) ### [**](#ErrorTracker)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/error_tracker.ts#L286)ErrorTracker Re-exports [ErrorTracker](https://crawlee.dev/js/api/core/class/ErrorTracker.md) ### [**](#ErrorTrackerOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/error_tracker.ts#L17)ErrorTrackerOptions Re-exports [ErrorTrackerOptions](https://crawlee.dev/js/api/core/interface/ErrorTrackerOptions.md) ### [**](#EventManager)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/events/event_manager.ts#L24)EventManager Re-exports [EventManager](https://crawlee.dev/js/api/core/class/EventManager.md) ### [**](#EventType)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/events/event_manager.ts#L9)EventType Re-exports [EventType](https://crawlee.dev/js/api/core/enum/EventType.md) ### [**](#EventTypeName)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/events/event_manager.ts#L17)EventTypeName Re-exports [EventTypeName](https://crawlee.dev/js/api/core.md#EventTypeName) ### [**](#filterRequestsByPatterns)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L217)filterRequestsByPatterns Re-exports [filterRequestsByPatterns](https://crawlee.dev/js/api/core/function/filterRequestsByPatterns.md) ### [**](#FinalStatistics)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/system_status.ts#L85)FinalStatistics Re-exports [FinalStatistics](https://crawlee.dev/js/api/core/interface/FinalStatistics.md) ### [**](#GetUserDataFromRequest)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/router.ts#L15)GetUserDataFromRequest Re-exports [GetUserDataFromRequest](https://crawlee.dev/js/api/core.md#GetUserDataFromRequest) ### [**](#GlobInput)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L41)GlobInput Re-exports [GlobInput](https://crawlee.dev/js/api/core.md#GlobInput) ### [**](#GlobObject)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L36)GlobObject Re-exports [GlobObject](https://crawlee.dev/js/api/core.md#GlobObject) ### [**](#GotScrapingHttpClient)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/got-scraping-http-client.ts#L17)GotScrapingHttpClient Re-exports [GotScrapingHttpClient](https://crawlee.dev/js/api/core/class/GotScrapingHttpClient.md) ### [**](#HttpRequest)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L78)HttpRequest Re-exports [HttpRequest](https://crawlee.dev/js/api/core/interface/HttpRequest.md) ### [**](#HttpRequestOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L111)HttpRequestOptions Re-exports [HttpRequestOptions](https://crawlee.dev/js/api/core/interface/HttpRequestOptions.md) ### [**](#HttpResponse)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L152)HttpResponse Re-exports [HttpResponse](https://crawlee.dev/js/api/core/interface/HttpResponse.md) ### [**](#IRequestList)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_list.ts#L26)IRequestList Re-exports [IRequestList](https://crawlee.dev/js/api/core/interface/IRequestList.md) ### [**](#IRequestManager)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L44)IRequestManager Re-exports [IRequestManager](https://crawlee.dev/js/api/core/interface/IRequestManager.md) ### [**](#IStorage)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/storage_manager.ts#L14)IStorage Re-exports [IStorage](https://crawlee.dev/js/api/core/interface/IStorage.md) ### [**](#KeyConsumer)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/key_value_store.ts#L724)KeyConsumer Re-exports [KeyConsumer](https://crawlee.dev/js/api/core/interface/KeyConsumer.md) ### [**](#KeyValueStore)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/key_value_store.ts#L108)KeyValueStore Re-exports [KeyValueStore](https://crawlee.dev/js/api/core/class/KeyValueStore.md) ### [**](#KeyValueStoreIteratorOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/key_value_store.ts#L758)KeyValueStoreIteratorOptions Re-exports [KeyValueStoreIteratorOptions](https://crawlee.dev/js/api/core/interface/KeyValueStoreIteratorOptions.md) ### [**](#KeyValueStoreOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/key_value_store.ts#L734)KeyValueStoreOptions Re-exports [KeyValueStoreOptions](https://crawlee.dev/js/api/core/interface/KeyValueStoreOptions.md) ### [**](#LoadedRequest)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L21)LoadedRequest Re-exports [LoadedRequest](https://crawlee.dev/js/api/core.md#LoadedRequest) ### [**](#LocalEventManager)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/events/local_event_manager.ts#L11)LocalEventManager Re-exports [LocalEventManager](https://crawlee.dev/js/api/core/class/LocalEventManager.md) ### [**](#log)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/log.ts#L3)log Re-exports [log](https://crawlee.dev/js/api/core.md#log) ### [**](#Log)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/log.ts#L3)Log Re-exports [Log](https://crawlee.dev/js/api/core/class/Log.md) ### [**](#Logger)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/log.ts#L3)Logger Re-exports [Logger](https://crawlee.dev/js/api/core/class/Logger.md) ### [**](#LoggerJson)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/log.ts#L3)LoggerJson Re-exports [LoggerJson](https://crawlee.dev/js/api/core/class/LoggerJson.md) ### [**](#LoggerOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/log.ts#L3)LoggerOptions Re-exports [LoggerOptions](https://crawlee.dev/js/api/core/interface/LoggerOptions.md) ### [**](#LoggerText)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/log.ts#L3)LoggerText Re-exports [LoggerText](https://crawlee.dev/js/api/core/class/LoggerText.md) ### [**](#LogLevel)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/log.ts#L3)LogLevel Re-exports [LogLevel](https://crawlee.dev/js/api/core/enum/LogLevel.md) ### [**](#MAX_POOL_SIZE)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/consts.ts#L3)MAX\_POOL\_SIZE Re-exports [MAX\_POOL\_SIZE](https://crawlee.dev/js/api/core.md#MAX_POOL_SIZE) ### [**](#NonRetryableError)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/errors.ts#L4)NonRetryableError Re-exports [NonRetryableError](https://crawlee.dev/js/api/core/class/NonRetryableError.md) ### [**](#PERSIST_STATE_KEY)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/consts.ts#L2)PERSIST\_STATE\_KEY Re-exports [PERSIST\_STATE\_KEY](https://crawlee.dev/js/api/core.md#PERSIST_STATE_KEY) ### [**](#PersistenceOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L41)PersistenceOptions Re-exports [PersistenceOptions](https://crawlee.dev/js/api/core/interface/PersistenceOptions.md) ### [**](#processHttpRequestOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L196)processHttpRequestOptions Re-exports [processHttpRequestOptions](https://crawlee.dev/js/api/core/function/processHttpRequestOptions.md) ### [**](#ProxyConfiguration)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/proxy_configuration.ts#L203)ProxyConfiguration Re-exports [ProxyConfiguration](https://crawlee.dev/js/api/core/class/ProxyConfiguration.md) ### [**](#ProxyConfigurationFunction)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/proxy_configuration.ts#L9)ProxyConfigurationFunction Re-exports [ProxyConfigurationFunction](https://crawlee.dev/js/api/core/interface/ProxyConfigurationFunction.md) ### [**](#ProxyConfigurationOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/proxy_configuration.ts#L15)ProxyConfigurationOptions Re-exports [ProxyConfigurationOptions](https://crawlee.dev/js/api/core/interface/ProxyConfigurationOptions.md) ### [**](#ProxyInfo)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/proxy_configuration.ts#L80)ProxyInfo Re-exports [ProxyInfo](https://crawlee.dev/js/api/core/interface/ProxyInfo.md) ### [**](#PseudoUrl)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/index.ts#L18)PseudoUrl Re-exports [PseudoUrl](https://crawlee.dev/js/api/core/class/PseudoUrl.md) ### [**](#PseudoUrlInput)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L34)PseudoUrlInput Re-exports [PseudoUrlInput](https://crawlee.dev/js/api/core.md#PseudoUrlInput) ### [**](#PseudoUrlObject)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L29)PseudoUrlObject Re-exports [PseudoUrlObject](https://crawlee.dev/js/api/core.md#PseudoUrlObject) ### [**](#purgeDefaultStorages)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/utils.ts#L33)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/utils.ts#L45)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/utils.ts#L46)purgeDefaultStorages Re-exports [purgeDefaultStorages](https://crawlee.dev/js/api/core/function/purgeDefaultStorages.md) ### [**](#PushErrorMessageOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L559)PushErrorMessageOptions Re-exports [PushErrorMessageOptions](https://crawlee.dev/js/api/core/interface/PushErrorMessageOptions.md) ### [**](#QueueOperationInfo)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/index.ts#L19)QueueOperationInfo Re-exports [QueueOperationInfo](https://crawlee.dev/js/api/core/interface/QueueOperationInfo.md) ### [**](#RecordOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/key_value_store.ts#L741)RecordOptions Re-exports [RecordOptions](https://crawlee.dev/js/api/core/interface/RecordOptions.md) ### [**](#RecoverableState)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/recoverable_state.ts#L75)RecoverableState Re-exports [RecoverableState](https://crawlee.dev/js/api/core/class/RecoverableState.md) ### [**](#RecoverableStateOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/recoverable_state.ts#L33)RecoverableStateOptions Re-exports [RecoverableStateOptions](https://crawlee.dev/js/api/core/interface/RecoverableStateOptions.md) ### [**](#RecoverableStatePersistenceOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/recoverable_state.ts#L6)RecoverableStatePersistenceOptions Re-exports [RecoverableStatePersistenceOptions](https://crawlee.dev/js/api/core/interface/RecoverableStatePersistenceOptions.md) ### [**](#RedirectHandler)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L171)RedirectHandler Re-exports [RedirectHandler](https://crawlee.dev/js/api/core.md#RedirectHandler) ### [**](#RegExpInput)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L48)RegExpInput Re-exports [RegExpInput](https://crawlee.dev/js/api/core.md#RegExpInput) ### [**](#RegExpObject)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L43)RegExpObject Re-exports [RegExpObject](https://crawlee.dev/js/api/core.md#RegExpObject) ### [**](#Request)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L84)Request Re-exports [Request](https://crawlee.dev/js/api/core/class/Request.md) ### [**](#RequestHandlerResult)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L174)RequestHandlerResult Re-exports [RequestHandlerResult](https://crawlee.dev/js/api/core/class/RequestHandlerResult.md) ### [**](#RequestList)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_list.ts#L300)RequestList Re-exports [RequestList](https://crawlee.dev/js/api/core/class/RequestList.md) ### [**](#RequestListOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_list.ts#L91)RequestListOptions Re-exports [RequestListOptions](https://crawlee.dev/js/api/core/interface/RequestListOptions.md) ### [**](#RequestListSourcesFunction)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_list.ts#L1000)RequestListSourcesFunction Re-exports [RequestListSourcesFunction](https://crawlee.dev/js/api/core.md#RequestListSourcesFunction) ### [**](#RequestListState)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_list.ts#L988)RequestListState Re-exports [RequestListState](https://crawlee.dev/js/api/core/interface/RequestListState.md) ### [**](#RequestManagerTandem)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_manager_tandem.ts#L22)RequestManagerTandem Re-exports [RequestManagerTandem](https://crawlee.dev/js/api/core/class/RequestManagerTandem.md) ### [**](#RequestOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L446)RequestOptions Re-exports [RequestOptions](https://crawlee.dev/js/api/core/interface/RequestOptions.md) ### [**](#RequestProvider)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L102)RequestProvider Re-exports [RequestProvider](https://crawlee.dev/js/api/core/class/RequestProvider.md) ### [**](#RequestProviderOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L907)RequestProviderOptions Re-exports [RequestProviderOptions](https://crawlee.dev/js/api/core/interface/RequestProviderOptions.md) ### [**](#RequestQueue)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/index.ts#L7)RequestQueue Re-exports [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) ### [**](#RequestQueueOperationOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L934)RequestQueueOperationOptions Re-exports [RequestQueueOperationOptions](https://crawlee.dev/js/api/core/interface/RequestQueueOperationOptions.md) ### [**](#RequestQueueOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L923)RequestQueueOptions Re-exports [RequestQueueOptions](https://crawlee.dev/js/api/core/interface/RequestQueueOptions.md) ### [**](#RequestQueueV1)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/index.ts#L6)RequestQueueV1 Re-exports [RequestQueueV1](https://crawlee.dev/js/api/core/class/RequestQueueV1.md) ### [**](#RequestQueueV2)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/index.ts#L8)RequestQueueV2 Re-exports [RequestQueueV2](https://crawlee.dev/js/api/core.md#RequestQueueV2) ### [**](#RequestsLike)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L39)RequestsLike Re-exports [RequestsLike](https://crawlee.dev/js/api/core.md#RequestsLike) ### [**](#RequestState)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L42)RequestState Re-exports [RequestState](https://crawlee.dev/js/api/core/enum/RequestState.md) ### [**](#RequestTransform)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L287)RequestTransform Re-exports [RequestTransform](https://crawlee.dev/js/api/core/interface/RequestTransform.md) ### [**](#ResponseLike)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/cookie_utils.ts#L7)ResponseLike Re-exports [ResponseLike](https://crawlee.dev/js/api/core/interface/ResponseLike.md) ### [**](#ResponseTypes)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L39)ResponseTypes Re-exports [ResponseTypes](https://crawlee.dev/js/api/core/interface/ResponseTypes.md) ### [**](#RestrictedCrawlingContext)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L30)RestrictedCrawlingContext Re-exports [RestrictedCrawlingContext](https://crawlee.dev/js/api/core/interface/RestrictedCrawlingContext.md) ### [**](#RetryRequestError)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/errors.ts#L22)RetryRequestError Re-exports [RetryRequestError](https://crawlee.dev/js/api/core/class/RetryRequestError.md) ### [**](#Router)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/router.ts#L86)Router Re-exports [Router](https://crawlee.dev/js/api/core/class/Router.md) ### [**](#RouterHandler)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/router.ts#L10)RouterHandler Re-exports [RouterHandler](https://crawlee.dev/js/api/core/interface/RouterHandler.md) ### [**](#RouterRoutes)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/router.ts#L17)RouterRoutes Re-exports [RouterRoutes](https://crawlee.dev/js/api/core.md#RouterRoutes) ### [**](#Session)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session.ts#L100)Session Re-exports [Session](https://crawlee.dev/js/api/core/class/Session.md) ### [**](#SessionError)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/errors.ts#L33)SessionError Re-exports [SessionError](https://crawlee.dev/js/api/core/class/SessionError.md) ### [**](#SessionOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session.ts#L37)SessionOptions Re-exports [SessionOptions](https://crawlee.dev/js/api/core/interface/SessionOptions.md) ### [**](#SessionPool)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session_pool.ts#L137)SessionPool Re-exports [SessionPool](https://crawlee.dev/js/api/core/class/SessionPool.md) ### [**](#SessionPoolOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session_pool.ts#L30)SessionPoolOptions Re-exports [SessionPoolOptions](https://crawlee.dev/js/api/core/interface/SessionPoolOptions.md) ### [**](#SessionState)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session.ts#L24)SessionState Re-exports [SessionState](https://crawlee.dev/js/api/core/interface/SessionState.md) ### [**](#SitemapRequestList)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/sitemap_request_list.ts#L128)SitemapRequestList Re-exports [SitemapRequestList](https://crawlee.dev/js/api/core/class/SitemapRequestList.md) ### [**](#SitemapRequestListOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/sitemap_request_list.ts#L60)SitemapRequestListOptions Re-exports [SitemapRequestListOptions](https://crawlee.dev/js/api/core/interface/SitemapRequestListOptions.md) ### [**](#SkippedRequestCallback)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L52)SkippedRequestCallback Re-exports [SkippedRequestCallback](https://crawlee.dev/js/api/core.md#SkippedRequestCallback) ### [**](#SkippedRequestReason)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L50)SkippedRequestReason Re-exports [SkippedRequestReason](https://crawlee.dev/js/api/core.md#SkippedRequestReason) ### [**](#SnapshotResult)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/error_snapshotter.ts#L16)SnapshotResult Re-exports [SnapshotResult](https://crawlee.dev/js/api/core/interface/SnapshotResult.md) ### [**](#Snapshotter)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/snapshotter.ts#L118)Snapshotter Re-exports [Snapshotter](https://crawlee.dev/js/api/core/class/Snapshotter.md) ### [**](#SnapshotterOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/snapshotter.ts#L19)SnapshotterOptions Re-exports [SnapshotterOptions](https://crawlee.dev/js/api/core/interface/SnapshotterOptions.md) ### [**](#Source)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L575)Source Re-exports [Source](https://crawlee.dev/js/api/core.md#Source) ### [**](#StatisticPersistedState)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L482)StatisticPersistedState Re-exports [StatisticPersistedState](https://crawlee.dev/js/api/core/interface/StatisticPersistedState.md) ### [**](#Statistics)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L59)Statistics Re-exports [Statistics](https://crawlee.dev/js/api/core/class/Statistics.md) ### [**](#StatisticsOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L436)StatisticsOptions Re-exports [StatisticsOptions](https://crawlee.dev/js/api/core/interface/StatisticsOptions.md) ### [**](#StatisticState)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L496)StatisticState Re-exports [StatisticState](https://crawlee.dev/js/api/core/interface/StatisticState.md) ### [**](#StorageClient)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/index.ts#L19)StorageClient Re-exports [StorageClient](https://crawlee.dev/js/api/core/interface/StorageClient.md) ### [**](#StorageManagerOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/storage_manager.ts#L156)StorageManagerOptions Re-exports [StorageManagerOptions](https://crawlee.dev/js/api/core/interface/StorageManagerOptions.md) ### [**](#StreamingHttpResponse)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L162)StreamingHttpResponse Re-exports [StreamingHttpResponse](https://crawlee.dev/js/api/core/interface/StreamingHttpResponse.md) ### [**](#SystemInfo)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/system_status.ts#L10)SystemInfo Re-exports [SystemInfo](https://crawlee.dev/js/api/core/interface/SystemInfo.md) ### [**](#SystemStatus)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/system_status.ts#L120)SystemStatus Re-exports [SystemStatus](https://crawlee.dev/js/api/core/class/SystemStatus.md) ### [**](#SystemStatusOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/system_status.ts#L35)SystemStatusOptions Re-exports [SystemStatusOptions](https://crawlee.dev/js/api/core/interface/SystemStatusOptions.md) ### [**](#TieredProxy)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/proxy_configuration.ts#L45)TieredProxy Re-exports [TieredProxy](https://crawlee.dev/js/api/core/interface/TieredProxy.md) ### [**](#tryAbsoluteURL)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L12)tryAbsoluteURL Re-exports [tryAbsoluteURL](https://crawlee.dev/js/api/core/function/tryAbsoluteURL.md) ### [**](#UrlPatternObject)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L24)UrlPatternObject Re-exports [UrlPatternObject](https://crawlee.dev/js/api/core.md#UrlPatternObject) ### [**](#useState)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/utils.ts#L87)useState Re-exports [useState](https://crawlee.dev/js/api/core/function/useState.md) ### [**](#UseStateOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/utils.ts#L69)UseStateOptions Re-exports [UseStateOptions](https://crawlee.dev/js/api/core/interface/UseStateOptions.md) ### [**](#withCheckedStorageAccess)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/access_checking.ts#L18)withCheckedStorageAccess Re-exports [withCheckedStorageAccess](https://crawlee.dev/js/api/core/function/withCheckedStorageAccess.md) ### [**](#ErrorHandler)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L114)ErrorHandler **ErrorHandler\: (inputs, error) => Awaitable\ #### Type parameters * **Context**: [CrawlingContext](https://crawlee.dev/js/api/core/interface/CrawlingContext.md) = LoadedContext<[BasicCrawlingContext](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlingContext.md) & [RestrictedCrawlingContext](https://crawlee.dev/js/api/core/interface/RestrictedCrawlingContext.md)> #### Type declaration * * **(inputs, error): Awaitable\ - #### Parameters * ##### inputs: LoadedContext\ * ##### error: Error #### Returns Awaitable\ ### [**](#RequestHandler)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L110)RequestHandler **RequestHandler\: (inputs) => Awaitable\ #### Type parameters * **Context**: [CrawlingContext](https://crawlee.dev/js/api/core/interface/CrawlingContext.md) = LoadedContext<[BasicCrawlingContext](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlingContext.md) & [RestrictedCrawlingContext](https://crawlee.dev/js/api/core/interface/RestrictedCrawlingContext.md)> #### Type declaration * * **(inputs): Awaitable\ - #### Parameters * ##### inputs: LoadedContext\ #### Returns Awaitable\ ### [**](#StatusMessageCallback)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L128)StatusMessageCallback **StatusMessageCallback\: (params) => Awaitable\ #### Type parameters * **Context**: [CrawlingContext](https://crawlee.dev/js/api/core/interface/CrawlingContext.md) = [BasicCrawlingContext](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlingContext.md) * **Crawler**: [BasicCrawler](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md)\ = [BasicCrawler](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md)\ #### Type declaration * * **(params): Awaitable\ - #### Parameters * ##### params: [StatusMessageCallbackParams](https://crawlee.dev/js/api/basic-crawler/interface/StatusMessageCallbackParams.md)\ #### Returns Awaitable\ ### [**](#BASIC_CRAWLER_TIMEOUT_BUFFER_SECS)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/constants.ts#L6)constBASIC\_CRAWLER\_TIMEOUT\_BUFFER\_SECS **BASIC\_CRAWLER\_TIMEOUT\_BUFFER\_SECS: 10 = 10 Additional number of seconds used in [CheerioCrawler](https://crawlee.dev/js/api/cheerio-crawler/class/CheerioCrawler.md) and [BrowserCrawler](https://crawlee.dev/js/api/browser-crawler/class/BrowserCrawler.md) to set a reasonable [`requestHandlerTimeoutSecs`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#requestHandlerTimeoutSecs) for [BasicCrawler](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md) that would not impare functionality (not timeout before crawlers). --- # Changelog All notable changes to this project will be documented in this file. See [Conventional Commits](https://conventionalcommits.org) for commit guidelines. ## [3.15.3](https://github.com/apify/crawlee/compare/v3.15.2...v3.15.3) (2025-11-10)[​](#3153-2025-11-10 "Direct link to 3153-2025-11-10") ### Bug Fixes[​](#bug-fixes "Direct link to Bug Fixes") * await retries inside `_timeoutAndRetry` ([#3206](https://github.com/apify/crawlee/issues/3206)) ([9c1cf6d](https://github.com/apify/crawlee/commit/9c1cf6d68acd356af8b7dbd682141357d789e3fb)), closes [/github.com/apify/crawlee/pull/3188#discussion\_r2410256271](https://github.com//github.com/apify/crawlee/pull/3188/issues/discussion_r2410256271) * use shared enqueue links wrapper in `AdaptivePlaywrightCrawler` ([#3188](https://github.com/apify/crawlee/issues/3188)) ([9569d19](https://github.com/apify/crawlee/commit/9569d191933325d93f6c66754274b63fd272fc59)) ### Features[​](#features "Direct link to Features") * support custom `userAgent` with `respectRobotsTxtFile` ([#3226](https://github.com/apify/crawlee/issues/3226)) ([354252d](https://github.com/apify/crawlee/commit/354252dee44c5ea618a12e087acb24b9e0f555c7)), closes [#3222](https://github.com/apify/crawlee/issues/3222) ## [3.15.2](https://github.com/apify/crawlee/compare/v3.15.1...v3.15.2) (2025-10-23)[​](#3152-2025-10-23 "Direct link to 3152-2025-10-23") ### Features[​](#features-1 "Direct link to Features") * export cheerio types in all crawler packages ([#3204](https://github.com/apify/crawlee/issues/3204)) ([f05790b](https://github.com/apify/crawlee/commit/f05790b8c4e77056fd3cdbdd6d6abe3186ddf104)) ### Performance Improvements[​](#performance-improvements "Direct link to Performance Improvements") * don't await `crawler.setStatusMessage` ([#3207](https://github.com/apify/crawlee/issues/3207)) ([1a67ffb](https://github.com/apify/crawlee/commit/1a67ffbf22e0ecf034d30a2215c4bd0f0ecbf41e)) ## [3.15.1](https://github.com/apify/crawlee/compare/v3.15.0...v3.15.1) (2025-09-26)[​](#3151-2025-09-26 "Direct link to 3151-2025-09-26") ### Bug Fixes[​](#bug-fixes-1 "Direct link to Bug Fixes") * use correct config for storage classes to avoid memory leaks ([#3144](https://github.com/apify/crawlee/issues/3144)) ([911a2eb](https://github.com/apify/crawlee/commit/911a2eb45cdb5e3fc0e6a96471af86b43bc828bf)) # [3.15.0](https://github.com/apify/crawlee/compare/v3.14.1...v3.15.0) (2025-09-17) ### Bug Fixes[​](#bug-fixes-2 "Direct link to Bug Fixes") * don't fail `exportData` calls on empty datasets ([#3115](https://github.com/apify/crawlee/issues/3115)) ([298f170](https://github.com/apify/crawlee/commit/298f170ef032f76d5b252e2a08971bfd161a7ef5)), closes [#2734](https://github.com/apify/crawlee/issues/2734) * respect `maxCrawlDepth` with a custom enqueueLinks `transformRequestFunction` ([#3159](https://github.com/apify/crawlee/issues/3159)) ([e2ecb74](https://github.com/apify/crawlee/commit/e2ecb745da6105d8d083b30b8b68197e53b1cf84)) ### Features[​](#features-2 "Direct link to Features") * add `collectAllKeys` option for `BasicCrawler.exportData` ([#3129](https://github.com/apify/crawlee/issues/3129)) ([2ddfc9c](https://github.com/apify/crawlee/commit/2ddfc9c6108207d3289ee92fe3c5b646611cc508)), closes [#3007](https://github.com/apify/crawlee/issues/3007) * add `TandemRequestProvider` for combined `RequestList` and `RequestQueue` usage ([#2914](https://github.com/apify/crawlee/issues/2914)) ([4ca450f](https://github.com/apify/crawlee/commit/4ca450f08b9fb69ae3b2ba3fc66361f14631b15b)), closes [#2499](https://github.com/apify/crawlee/issues/2499) ## [3.14.1](https://github.com/apify/crawlee/compare/v3.14.0...v3.14.1) (2025-08-05)[​](#3141-2025-08-05 "Direct link to 3141-2025-08-05") **Note:** Version bump only for package @crawlee/basic # [3.14.0](https://github.com/apify/crawlee/compare/v3.13.10...v3.14.0) (2025-07-25) ### Bug Fixes[​](#bug-fixes-3 "Direct link to Bug Fixes") * validation of iterables when adding requests to the queue ([#3091](https://github.com/apify/crawlee/issues/3091)) ([529a1dd](https://github.com/apify/crawlee/commit/529a1dd57278efef4fb2013e79a09fd1bc8594a5)), closes [#3063](https://github.com/apify/crawlee/issues/3063) ### Features[​](#features-3 "Direct link to Features") * add `maxCrawlDepth` crawler option ([#3045](https://github.com/apify/crawlee/issues/3045)) ([0090df9](https://github.com/apify/crawlee/commit/0090df93a12df9918d016cf2f1378f1f7d40557d)), closes [#2633](https://github.com/apify/crawlee/issues/2633) ## [3.13.10](https://github.com/apify/crawlee/compare/v3.13.9...v3.13.10) (2025-07-09)[​](#31310-2025-07-09 "Direct link to 31310-2025-07-09") **Note:** Version bump only for package @crawlee/basic ## [3.13.9](https://github.com/apify/crawlee/compare/v3.13.8...v3.13.9) (2025-06-27)[​](#3139-2025-06-27 "Direct link to 3139-2025-06-27") ### Features[​](#features-4 "Direct link to Features") * Accept (Async)Iterables in `addRequests` methods ([#3013](https://github.com/apify/crawlee/issues/3013)) ([a4ab748](https://github.com/apify/crawlee/commit/a4ab74852c3c60bdbc96035f54b16d125220f699)), closes [#2980](https://github.com/apify/crawlee/issues/2980) * Report links skipped because of various filter conditions ([#3026](https://github.com/apify/crawlee/issues/3026)) ([5a867bc](https://github.com/apify/crawlee/commit/5a867bc28135803b55c765ec12e6fd04017ce53d)), closes [#3016](https://github.com/apify/crawlee/issues/3016) ## [3.13.8](https://github.com/apify/crawlee/compare/v3.13.7...v3.13.8) (2025-06-16)[​](#3138-2025-06-16 "Direct link to 3138-2025-06-16") ### Bug Fixes[​](#bug-fixes-4 "Direct link to Bug Fixes") * Do not enqueue more links than what the crawler is capable of processing ([#2990](https://github.com/apify/crawlee/issues/2990)) ([ea094c8](https://github.com/apify/crawlee/commit/ea094c819232e0b30bc550270836d10506eb9454)), closes [#2728](https://github.com/apify/crawlee/issues/2728) ## [3.13.7](https://github.com/apify/crawlee/compare/v3.13.6...v3.13.7) (2025-06-06)[​](#3137-2025-06-06 "Direct link to 3137-2025-06-06") **Note:** Version bump only for package @crawlee/basic ## [3.13.6](https://github.com/apify/crawlee/compare/v3.13.5...v3.13.6) (2025-06-05)[​](#3136-2025-06-05 "Direct link to 3136-2025-06-05") **Note:** Version bump only for package @crawlee/basic ## [3.13.5](https://github.com/apify/crawlee/compare/v3.13.4...v3.13.5) (2025-05-20)[​](#3135-2025-05-20 "Direct link to 3135-2025-05-20") **Note:** Version bump only for package @crawlee/basic ## [3.13.4](https://github.com/apify/crawlee/compare/v3.13.3...v3.13.4) (2025-05-14)[​](#3134-2025-05-14 "Direct link to 3134-2025-05-14") ### Bug Fixes[​](#bug-fixes-5 "Direct link to Bug Fixes") * Optimize request unlocking to get rid of unnecessary unlock calls ([#2963](https://github.com/apify/crawlee/issues/2963)) ([a433037](https://github.com/apify/crawlee/commit/a433037f307ed3490a1ef5df334f1f9a9044510d)) ## [3.13.3](https://github.com/apify/crawlee/compare/v3.13.2...v3.13.3) (2025-05-05)[​](#3133-2025-05-05 "Direct link to 3133-2025-05-05") ### Bug Fixes[​](#bug-fixes-6 "Direct link to Bug Fixes") * respect `autoscaledPoolOptions.isTaskReadyFunction` option ([#2948](https://github.com/apify/crawlee/issues/2948)) ([fe2d206](https://github.com/apify/crawlee/commit/fe2d206b46afabb18c83e8af11fa03f085f4cd4e)), closes [#2922](https://github.com/apify/crawlee/issues/2922) * **statistics:** track actual request.retryCount in Statistics ([#2940](https://github.com/apify/crawlee/issues/2940)) ([c9f7f54](https://github.com/apify/crawlee/commit/c9f7f5494ac4895a30b283a5defe382db0cdea26)) ## [3.13.2](https://github.com/apify/crawlee/compare/v3.13.1...v3.13.2) (2025-04-08)[​](#3132-2025-04-08 "Direct link to 3132-2025-04-08") ### Features[​](#features-5 "Direct link to Features") * add `onSkippedRequest` option ([#2916](https://github.com/apify/crawlee/issues/2916)) ([764f992](https://github.com/apify/crawlee/commit/764f99203627b6a44d2ee90d623b8b0e6ecbffb5)), closes [#2910](https://github.com/apify/crawlee/issues/2910) ## [3.13.1](https://github.com/apify/crawlee/compare/v3.13.0...v3.13.1) (2025-04-07)[​](#3131-2025-04-07 "Direct link to 3131-2025-04-07") ### Bug Fixes[​](#bug-fixes-7 "Direct link to Bug Fixes") * rename `RobotsFile` to `RobotsTxtFile` ([#2913](https://github.com/apify/crawlee/issues/2913)) ([3160f71](https://github.com/apify/crawlee/commit/3160f717e865326476d78089d778cbc7d35aa58d)), closes [#2910](https://github.com/apify/crawlee/issues/2910) ### Features[​](#features-6 "Direct link to Features") * add `respectRobotsTxtFile` crawler option ([#2910](https://github.com/apify/crawlee/issues/2910)) ([0eabed1](https://github.com/apify/crawlee/commit/0eabed1f13070d902c2c67b340621830a7f64464)) # [3.13.0](https://github.com/apify/crawlee/compare/v3.12.2...v3.13.0) (2025-03-04) ### Bug Fixes[​](#bug-fixes-8 "Direct link to Bug Fixes") * Simplified RequestQueueV2 implementation ([#2775](https://github.com/apify/crawlee/issues/2775)) ([d1a094a](https://github.com/apify/crawlee/commit/d1a094a47eaecbf367b222f9b8c14d7da5d3e03a)), closes [#2767](https://github.com/apify/crawlee/issues/2767) [#2700](https://github.com/apify/crawlee/issues/2700) ## [3.12.2](https://github.com/apify/crawlee/compare/v3.12.1...v3.12.2) (2025-01-27)[​](#3122-2025-01-27 "Direct link to 3122-2025-01-27") ### Bug Fixes[​](#bug-fixes-9 "Direct link to Bug Fixes") * destructure `CrawlerRunOptions` before passing them to `addRequests` ([#2803](https://github.com/apify/crawlee/issues/2803)) ([02a598c](https://github.com/apify/crawlee/commit/02a598c2a501957f04ca3a2362bcee289ef861c0)), closes [#2802](https://github.com/apify/crawlee/issues/2802) * graceful `BasicCrawler` tidy-up on `CriticalError` ([#2817](https://github.com/apify/crawlee/issues/2817)) ([53331e8](https://github.com/apify/crawlee/commit/53331e82ee66274316add7cadb4afec1ce2d4bcf)), closes [#2807](https://github.com/apify/crawlee/issues/2807) ### Features[​](#features-7 "Direct link to Features") * stopping the crawlers gracefully with `BasicCrawler.stop()` ([#2792](https://github.com/apify/crawlee/issues/2792)) ([af2966f](https://github.com/apify/crawlee/commit/af2966f65caeaf4273fd0a8ab583a7857e4330ab)), closes [#2777](https://github.com/apify/crawlee/issues/2777) ## [3.12.1](https://github.com/apify/crawlee/compare/v3.12.0...v3.12.1) (2024-12-04)[​](#3121-2024-12-04 "Direct link to 3121-2024-12-04") ### Bug Fixes[​](#bug-fixes-10 "Direct link to Bug Fixes") * log status message timeouts to debug level ([55ee44a](https://github.com/apify/crawlee/commit/55ee44aaf5e73c2a9d96d973a4aae111ab2e0025)) # [3.12.0](https://github.com/apify/crawlee/compare/v3.11.5...v3.12.0) (2024-11-04) ### Features[​](#features-8 "Direct link to Features") * allow using other HTTP clients ([#2661](https://github.com/apify/crawlee/issues/2661)) ([568c655](https://github.com/apify/crawlee/commit/568c6556d79ce91654c8a715d1d1729d7d6ed8ef)), closes [#2659](https://github.com/apify/crawlee/issues/2659) ## [3.11.5](https://github.com/apify/crawlee/compare/v3.11.4...v3.11.5) (2024-10-04)[​](#3115-2024-10-04 "Direct link to 3115-2024-10-04") ### Bug Fixes[​](#bug-fixes-11 "Direct link to Bug Fixes") * check `.isFinished()` before `RequestList` reads ([#2695](https://github.com/apify/crawlee/issues/2695)) ([6fa170f](https://github.com/apify/crawlee/commit/6fa170fbe16c326307b8a58c09c07f64afb64bb2)) * **core:** trigger `errorHandler` for session errors ([#2683](https://github.com/apify/crawlee/issues/2683)) ([7d72bcb](https://github.com/apify/crawlee/commit/7d72bcb36f32933c6251382e5efd28a284e9267d)), closes [#2678](https://github.com/apify/crawlee/issues/2678) ## [3.11.4](https://github.com/apify/crawlee/compare/v3.11.3...v3.11.4) (2024-09-23)[​](#3114-2024-09-23 "Direct link to 3114-2024-09-23") **Note:** Version bump only for package @crawlee/basic ## [3.11.3](https://github.com/apify/crawlee/compare/v3.11.2...v3.11.3) (2024-09-03)[​](#3113-2024-09-03 "Direct link to 3113-2024-09-03") **Note:** Version bump only for package @crawlee/basic ## [3.11.2](https://github.com/apify/crawlee/compare/v3.11.1...v3.11.2) (2024-08-28)[​](#3112-2024-08-28 "Direct link to 3112-2024-08-28") ### Bug Fixes[​](#bug-fixes-12 "Direct link to Bug Fixes") * **RequestQueueV2:** remove `inProgress` cache, rely solely on locked states ([#2601](https://github.com/apify/crawlee/issues/2601)) ([57fcb08](https://github.com/apify/crawlee/commit/57fcb0804a9f1268039d1e2b246c515ceca7e405)) ## [3.11.1](https://github.com/apify/crawlee/compare/v3.11.0...v3.11.1) (2024-07-24)[​](#3111-2024-07-24 "Direct link to 3111-2024-07-24") **Note:** Version bump only for package @crawlee/basic # [3.11.0](https://github.com/apify/crawlee/compare/v3.10.5...v3.11.0) (2024-07-09) ### Features[​](#features-9 "Direct link to Features") * Sitemap-based request list implementation ([#2498](https://github.com/apify/crawlee/issues/2498)) ([7bf8f0b](https://github.com/apify/crawlee/commit/7bf8f0bcd4cc81e02c7cc60e82dfe7a0cdd80938)) ## [3.10.5](https://github.com/apify/crawlee/compare/v3.10.4...v3.10.5) (2024-06-12)[​](#3105-2024-06-12 "Direct link to 3105-2024-06-12") ### Bug Fixes[​](#bug-fixes-13 "Direct link to Bug Fixes") * mark `context.request.loadedUrl` and `id` as required inside the request handler ([#2531](https://github.com/apify/crawlee/issues/2531)) ([2b54660](https://github.com/apify/crawlee/commit/2b546600691d84852a2f9ef42f273cecf818d66d)) ## [3.10.4](https://github.com/apify/crawlee/compare/v3.10.3...v3.10.4) (2024-06-11)[​](#3104-2024-06-11 "Direct link to 3104-2024-06-11") ### Bug Fixes[​](#bug-fixes-14 "Direct link to Bug Fixes") * add missing `useState` implementation into crawling context ([eec4a71](https://github.com/apify/crawlee/commit/eec4a71769f1236ca0876a4a32288241b1b63db1)) * make `crawler.log` publicly accessible ([#2526](https://github.com/apify/crawlee/issues/2526)) ([3e9e665](https://github.com/apify/crawlee/commit/3e9e6652c0b5e4d0c2707985abbad7d80336b9af)) * respect `crawler.log` when creating child logger for `Statistics` ([0a0d75d](https://github.com/apify/crawlee/commit/0a0d75d40b5f78b329589535bbe3e0e84be76a7e)), closes [#2412](https://github.com/apify/crawlee/issues/2412) ## [3.10.3](https://github.com/apify/crawlee/compare/v3.10.2...v3.10.3) (2024-06-07)[​](#3103-2024-06-07 "Direct link to 3103-2024-06-07") ### Features[​](#features-10 "Direct link to Features") * log desired concurrency in the default status message ([9f0b796](https://github.com/apify/crawlee/commit/9f0b79684d9e27e6ba29634e7da2e9a095367eda)) ## [3.10.2](https://github.com/apify/crawlee/compare/v3.10.1...v3.10.2) (2024-06-03)[​](#3102-2024-06-03 "Direct link to 3102-2024-06-03") **Note:** Version bump only for package @crawlee/basic ## [3.10.1](https://github.com/apify/crawlee/compare/v3.10.0...v3.10.1) (2024-05-23)[​](#3101-2024-05-23 "Direct link to 3101-2024-05-23") **Note:** Version bump only for package @crawlee/basic # [3.10.0](https://github.com/apify/crawlee/compare/v3.9.2...v3.10.0) (2024-05-16) ### Bug Fixes[​](#bug-fixes-15 "Direct link to Bug Fixes") * `EnqueueStrategy.All` erroring with links using unsupported protocols ([#2389](https://github.com/apify/crawlee/issues/2389)) ([8db3908](https://github.com/apify/crawlee/commit/8db39080b7711ba3c27dff7fce1170ddb0ee3d05)) * do not drop statistics on migration/resurrection/resume ([#2462](https://github.com/apify/crawlee/issues/2462)) ([8ce7dd4](https://github.com/apify/crawlee/commit/8ce7dd4ae6a3718dac95e784a53bd5661c827edc)) ### Features[​](#features-11 "Direct link to Features") * implement ErrorSnapshotter for error context capture ([#2332](https://github.com/apify/crawlee/issues/2332)) ([e861dfd](https://github.com/apify/crawlee/commit/e861dfdb451ae32fb1e0c7749c6b59744654b303)), closes [#2280](https://github.com/apify/crawlee/issues/2280) * make `RequestQueue` v2 the default queue, see more on [Apify blog](https://blog.apify.com/new-apify-request-queue/) ([#2390](https://github.com/apify/crawlee/issues/2390)) ([41ae8ab](https://github.com/apify/crawlee/commit/41ae8abec1da811ae0750ac2d298e77c1e3b7b55)), closes [#2388](https://github.com/apify/crawlee/issues/2388) ## [3.9.2](https://github.com/apify/crawlee/compare/v3.9.1...v3.9.2) (2024-04-17)[​](#392-2024-04-17 "Direct link to 392-2024-04-17") ### Bug Fixes[​](#bug-fixes-16 "Direct link to Bug Fixes") * don't call `notify` in `addRequests()` ([#2425](https://github.com/apify/crawlee/issues/2425)) ([c4d5446](https://github.com/apify/crawlee/commit/c4d54469120648a592b6898f849154fda60e3d59)), closes [#2421](https://github.com/apify/crawlee/issues/2421) ## [3.9.1](https://github.com/apify/crawlee/compare/v3.9.0...v3.9.1) (2024-04-11)[​](#391-2024-04-11 "Direct link to 391-2024-04-11") **Note:** Version bump only for package @crawlee/basic # [3.9.0](https://github.com/apify/crawlee/compare/v3.8.2...v3.9.0) (2024-04-10) ### Bug Fixes[​](#bug-fixes-17 "Direct link to Bug Fixes") * notify autoscaled pool about newly added requests ([#2400](https://github.com/apify/crawlee/issues/2400)) ([a90177d](https://github.com/apify/crawlee/commit/a90177d5207794be1d6e401d746dd4c6e5961976)) ### Features[​](#features-12 "Direct link to Features") * `tieredProxyUrls` for ProxyConfiguration ([#2348](https://github.com/apify/crawlee/issues/2348)) ([5408c7f](https://github.com/apify/crawlee/commit/5408c7f60a5bf4dbdba92f2d7440e0946b94ea6e)) ## [3.8.2](https://github.com/apify/crawlee/compare/v3.8.1...v3.8.2) (2024-03-21)[​](#382-2024-03-21 "Direct link to 382-2024-03-21") **Note:** Version bump only for package @crawlee/basic ## [3.8.1](https://github.com/apify/crawlee/compare/v3.8.0...v3.8.1) (2024-02-22)[​](#381-2024-02-22 "Direct link to 381-2024-02-22") **Note:** Version bump only for package @crawlee/basic # [3.8.0](https://github.com/apify/crawlee/compare/v3.7.3...v3.8.0) (2024-02-21) ### Bug Fixes[​](#bug-fixes-18 "Direct link to Bug Fixes") * declare missing dependencies on `csv-stringify` and `fs-extra` ([#2326](https://github.com/apify/crawlee/issues/2326)) ([718959d](https://github.com/apify/crawlee/commit/718959dbbe1fa69f948d0b778d0f54d9c493ab25)), closes [/github.com/redabacha/crawlee/blob/2f05ed22b203f688095300400bb0e6d03a03283c/.eslintrc.json#L50](https://github.com//github.com/redabacha/crawlee/blob/2f05ed22b203f688095300400bb0e6d03a03283c/.eslintrc.json/issues/L50) ### Features[​](#features-13 "Direct link to Features") * accessing crawler state, key-value store and named datasets via crawling context ([#2283](https://github.com/apify/crawlee/issues/2283)) ([58dd5fc](https://github.com/apify/crawlee/commit/58dd5fcc25f31bb066402c46e48a9e5e91efd5c5)) * adaptive playwright crawler ([#2316](https://github.com/apify/crawlee/issues/2316)) ([8e4218a](https://github.com/apify/crawlee/commit/8e4218ada03cf485751def46f8c465b2d2a825c7)) ## [3.7.3](https://github.com/apify/crawlee/compare/v3.7.2...v3.7.3) (2024-01-30)[​](#373-2024-01-30 "Direct link to 373-2024-01-30") **Note:** Version bump only for package @crawlee/basic ## [3.7.2](https://github.com/apify/crawlee/compare/v3.7.1...v3.7.2) (2024-01-09)[​](#372-2024-01-09 "Direct link to 372-2024-01-09") **Note:** Version bump only for package @crawlee/basic ## [3.7.1](https://github.com/apify/crawlee/compare/v3.7.0...v3.7.1) (2024-01-02)[​](#371-2024-01-02 "Direct link to 371-2024-01-02") **Note:** Version bump only for package @crawlee/basic # [3.7.0](https://github.com/apify/crawlee/compare/v3.6.2...v3.7.0) (2023-12-21) ### Features[​](#features-14 "Direct link to Features") * allow configuring crawler statistics ([#2213](https://github.com/apify/crawlee/issues/2213)) ([9fd60e4](https://github.com/apify/crawlee/commit/9fd60e4036dce720c71f2d169a8eccbc4c813a96)), closes [#1789](https://github.com/apify/crawlee/issues/1789) * check enqueue link strategy post redirect ([#2238](https://github.com/apify/crawlee/issues/2238)) ([3c5f9d6](https://github.com/apify/crawlee/commit/3c5f9d6056158e042e12d75b2b1b21ef6c32e618)), closes [#2173](https://github.com/apify/crawlee/issues/2173) * log cause with `retryOnBlocked` ([#2252](https://github.com/apify/crawlee/issues/2252)) ([e19a773](https://github.com/apify/crawlee/commit/e19a773693cfc5e65c1e2321bfc8b73c9844ea8b)), closes [#2249](https://github.com/apify/crawlee/issues/2249) ## [3.6.2](https://github.com/apify/crawlee/compare/v3.6.1...v3.6.2) (2023-11-26)[​](#362-2023-11-26 "Direct link to 362-2023-11-26") **Note:** Version bump only for package @crawlee/basic ## [3.6.1](https://github.com/apify/crawlee/compare/v3.6.0...v3.6.1) (2023-11-15)[​](#361-2023-11-15 "Direct link to 361-2023-11-15") **Note:** Version bump only for package @crawlee/basic # [3.6.0](https://github.com/apify/crawlee/compare/v3.5.8...v3.6.0) (2023-11-15) ### Features[​](#features-15 "Direct link to Features") * **core:** add `crawler.exportData()` helper ([#2166](https://github.com/apify/crawlee/issues/2166)) ([c8c09a5](https://github.com/apify/crawlee/commit/c8c09a54a712689969ff1f6bddf70f12a2a22670)) * got-scraping v4 ([#2110](https://github.com/apify/crawlee/issues/2110)) ([2f05ed2](https://github.com/apify/crawlee/commit/2f05ed22b203f688095300400bb0e6d03a03283c)) ## [3.5.8](https://github.com/apify/crawlee/compare/v3.5.7...v3.5.8) (2023-10-17)[​](#358-2023-10-17 "Direct link to 358-2023-10-17") **Note:** Version bump only for package @crawlee/basic ## [3.5.7](https://github.com/apify/crawlee/compare/v3.5.6...v3.5.7) (2023-10-05)[​](#357-2023-10-05 "Direct link to 357-2023-10-05") ### Bug Fixes[​](#bug-fixes-19 "Direct link to Bug Fixes") * add warning when we detect use of RL and RQ, but RQ is not provided explicitly ([#2115](https://github.com/apify/crawlee/issues/2115)) ([6fb1c55](https://github.com/apify/crawlee/commit/6fb1c5568a0bf3b6fa38045161866a32b13310ca)), closes [#1773](https://github.com/apify/crawlee/issues/1773) * ensure the status message cannot stuck the crawler ([#2114](https://github.com/apify/crawlee/issues/2114)) ([9034f08](https://github.com/apify/crawlee/commit/9034f08106f53a70205695076e874f04f632c5bb)) * RQ request count is consistent after migration ([#2116](https://github.com/apify/crawlee/issues/2116)) ([9ab8c18](https://github.com/apify/crawlee/commit/9ab8c1874f52acc3f0337fdabd36321d0fb40b86)), closes [#1855](https://github.com/apify/crawlee/issues/1855) [#1855](https://github.com/apify/crawlee/issues/1855) ## [3.5.6](https://github.com/apify/crawlee/compare/v3.5.5...v3.5.6) (2023-10-04)[​](#356-2023-10-04 "Direct link to 356-2023-10-04") **Note:** Version bump only for package @crawlee/basic ## [3.5.5](https://github.com/apify/crawlee/compare/v3.5.4...v3.5.5) (2023-10-02)[​](#355-2023-10-02 "Direct link to 355-2023-10-02") ### Bug Fixes[​](#bug-fixes-20 "Direct link to Bug Fixes") * session pool leaks memory on multiple crawler runs ([#2083](https://github.com/apify/crawlee/issues/2083)) ([b96582a](https://github.com/apify/crawlee/commit/b96582a200e25ec11124da1f7f84a2b16b64d133)), closes [#2074](https://github.com/apify/crawlee/issues/2074) [#2031](https://github.com/apify/crawlee/issues/2031) ### Features[​](#features-16 "Direct link to Features") * Request Queue v2 ([#1975](https://github.com/apify/crawlee/issues/1975)) ([70a77ee](https://github.com/apify/crawlee/commit/70a77ee15f984e9ae67cd584fc58ace7e55346db)), closes [#1365](https://github.com/apify/crawlee/issues/1365) ## [3.5.4](https://github.com/apify/crawlee/compare/v3.5.3...v3.5.4) (2023-09-11)[​](#354-2023-09-11 "Direct link to 354-2023-09-11") ### Features[​](#features-17 "Direct link to Features") * remove side effect from the deprecated error context augmentation ([#2069](https://github.com/apify/crawlee/issues/2069)) ([f9fb5c4](https://github.com/apify/crawlee/commit/f9fb5c42ecb14f8d0845a15982d204bd2b5b228f)) ## [3.5.3](https://github.com/apify/crawlee/compare/v3.5.2...v3.5.3) (2023-08-31)[​](#353-2023-08-31 "Direct link to 353-2023-08-31") ### Bug Fixes[​](#bug-fixes-21 "Direct link to Bug Fixes") * **browser-pool:** improve error handling when browser is not found ([#2050](https://github.com/apify/crawlee/issues/2050)) ([282527f](https://github.com/apify/crawlee/commit/282527f31bb366a4e52463212f652dcf6679b6c3)), closes [#1459](https://github.com/apify/crawlee/issues/1459) * clean up `inProgress` cache when delaying requests via `sameDomainDelaySecs` ([#2045](https://github.com/apify/crawlee/issues/2045)) ([f63ccc0](https://github.com/apify/crawlee/commit/f63ccc018c9e9046531287c47d11283a8e71a6ad)) * pin all internal dependencies ([#2041](https://github.com/apify/crawlee/issues/2041)) ([d6f2b17](https://github.com/apify/crawlee/commit/d6f2b172d4a6776137c7893ca798d5b4a9408e79)), closes [#2040](https://github.com/apify/crawlee/issues/2040) * respect current config when creating implicit `RequestQueue` instance ([845141d](https://github.com/apify/crawlee/commit/845141d921c10dd5fb121a499bb1b24f5eb3ff04)), closes [#2043](https://github.com/apify/crawlee/issues/2043) ### Features[​](#features-18 "Direct link to Features") * **core:** add default dataset helpers to `BasicCrawler` ([#2057](https://github.com/apify/crawlee/issues/2057)) ([e2a7544](https://github.com/apify/crawlee/commit/e2a7544ddf775db023ca25553d21cb73484fcd8c)) ## [3.5.2](https://github.com/apify/crawlee/compare/v3.5.1...v3.5.2) (2023-08-21)[​](#352-2023-08-21 "Direct link to 352-2023-08-21") **Note:** Version bump only for package @crawlee/basic ## [3.5.1](https://github.com/apify/crawlee/compare/v3.5.0...v3.5.1) (2023-08-16)[​](#351-2023-08-16 "Direct link to 351-2023-08-16") ### Features[​](#features-19 "Direct link to Features") * exceeding maxSessionRotations calls failedRequestHandler ([#2029](https://github.com/apify/crawlee/issues/2029)) ([b1cb108](https://github.com/apify/crawlee/commit/b1cb108882ab28d956adfc3d77ba9813507823f6)), closes [#2028](https://github.com/apify/crawlee/issues/2028) # [3.5.0](https://github.com/apify/crawlee/compare/v3.4.2...v3.5.0) (2023-07-31) ### Features[​](#features-20 "Direct link to Features") * add support for `sameDomainDelay` ([#2003](https://github.com/apify/crawlee/issues/2003)) ([e796883](https://github.com/apify/crawlee/commit/e79688324790e5d07fc11192769cf051617e96e4)), closes [#1993](https://github.com/apify/crawlee/issues/1993) * **basic-crawler:** allow configuring the automatic status message ([#2001](https://github.com/apify/crawlee/issues/2001)) ([3eb4e4c](https://github.com/apify/crawlee/commit/3eb4e4c558b4bc0673fbff75b1db19c46004a1da)) * retire session on proxy error ([#2002](https://github.com/apify/crawlee/issues/2002)) ([8c0928b](https://github.com/apify/crawlee/commit/8c0928b24ceabefc454f8114ac30a27023709010)), closes [#1912](https://github.com/apify/crawlee/issues/1912) ## [3.4.2](https://github.com/apify/crawlee/compare/v3.4.1...v3.4.2) (2023-07-19)[​](#342-2023-07-19 "Direct link to 342-2023-07-19") ### Bug Fixes[​](#bug-fixes-22 "Direct link to Bug Fixes") * **basic-crawler:** limit `internalTimeoutMillis` in addition to `requestHandlerTimeoutMillis` ([#1981](https://github.com/apify/crawlee/issues/1981)) ([8122622](https://github.com/apify/crawlee/commit/8122622c3054a0e0e0c1869ba462276cbead8090)), closes [#1766](https://github.com/apify/crawlee/issues/1766) ### Features[​](#features-21 "Direct link to Features") * **core:** add `RequestQueue.addRequestsBatched()` that is non-blocking ([#1996](https://github.com/apify/crawlee/issues/1996)) ([c85485d](https://github.com/apify/crawlee/commit/c85485d6ca2bb61cfebb24a2ad99e0b3ba5c069b)), closes [#1995](https://github.com/apify/crawlee/issues/1995) * retryOnBlocked detects blocked webpage ([#1956](https://github.com/apify/crawlee/issues/1956)) ([766fa9b](https://github.com/apify/crawlee/commit/766fa9b88029e9243a7427075384c1abe85c70c8)) ## [3.4.1](https://github.com/apify/crawlee/compare/v3.4.0...v3.4.1) (2023-07-13)[​](#341-2023-07-13 "Direct link to 341-2023-07-13") **Note:** Version bump only for package @crawlee/basic # [3.4.0](https://github.com/apify/crawlee/compare/v3.3.3...v3.4.0) (2023-06-12) **Note:** Version bump only for package @crawlee/basic ## [3.3.3](https://github.com/apify/crawlee/compare/v3.3.2...v3.3.3) (2023-05-31)[​](#333-2023-05-31 "Direct link to 333-2023-05-31") ### Bug Fixes[​](#bug-fixes-23 "Direct link to Bug Fixes") * set status message every 5 seconds and log it via debug level ([#1918](https://github.com/apify/crawlee/issues/1918)) ([32aede6](https://github.com/apify/crawlee/commit/32aede6bbaa25b402e6e9cee9d3aa44722b1cfd0)) ### Features[​](#features-22 "Direct link to Features") * **core:** add `Request.maxRetries` to allow overriding the `maxRequestRetries` ([#1925](https://github.com/apify/crawlee/issues/1925)) ([c5592db](https://github.com/apify/crawlee/commit/c5592db0f8094de27c46ad993bea2c1ab1f61385)) ## [3.3.2](https://github.com/apify/crawlee/compare/v3.3.1...v3.3.2) (2023-05-11)[​](#332-2023-05-11 "Direct link to 332-2023-05-11") ### Bug Fixes[​](#bug-fixes-24 "Direct link to Bug Fixes") * respect config object when creating `SessionPool` ([#1881](https://github.com/apify/crawlee/issues/1881)) ([db069df](https://github.com/apify/crawlee/commit/db069df80bc183c6b861c9ac82f1e278e57ea92b)) ### Features[​](#features-23 "Direct link to Features") * allow running single crawler instance multiple times ([#1844](https://github.com/apify/crawlee/issues/1844)) ([9e6eb1e](https://github.com/apify/crawlee/commit/9e6eb1e32f582a8837311aac12cc1d657432f3fa)), closes [#765](https://github.com/apify/crawlee/issues/765) * **router:** allow inline router definition ([#1877](https://github.com/apify/crawlee/issues/1877)) ([2d241c9](https://github.com/apify/crawlee/commit/2d241c9f88964ebd41a181069c378b6b7b5bf262)) ## [3.3.1](https://github.com/apify/crawlee/compare/v3.3.0...v3.3.1) (2023-04-11)[​](#331-2023-04-11 "Direct link to 331-2023-04-11") ### Bug Fixes[​](#bug-fixes-25 "Direct link to Bug Fixes") * start status message logger after the crawl actually starts ([5d1df7a](https://github.com/apify/crawlee/commit/5d1df7aae00d0d6ca29338723f92b77cff667354)) * status message - total requests ([#1842](https://github.com/apify/crawlee/issues/1842)) ([710f734](https://github.com/apify/crawlee/commit/710f7347623619057e99abf539f0ccf78de41bbc)) # [3.3.0](https://github.com/apify/crawlee/compare/v3.2.2...v3.3.0) (2023-03-09) ### Features[​](#features-24 "Direct link to Features") * add basic support for `setStatusMessage` ([#1790](https://github.com/apify/crawlee/issues/1790)) ([c318980](https://github.com/apify/crawlee/commit/c318980ec11d211b1a5c9e6bdbe76198c5d895be)) * move the status message implementation to Crawlee, noop in storage ([#1808](https://github.com/apify/crawlee/issues/1808)) ([99c3fdc](https://github.com/apify/crawlee/commit/99c3fdc18030b7898e6b6d149d6d94fab7881f09)) ## [3.2.2](https://github.com/apify/crawlee/compare/v3.2.1...v3.2.2) (2023-02-08)[​](#322-2023-02-08 "Direct link to 322-2023-02-08") **Note:** Version bump only for package @crawlee/basic ## [3.2.1](https://github.com/apify/crawlee/compare/v3.2.0...v3.2.1) (2023-02-07)[​](#321-2023-02-07 "Direct link to 321-2023-02-07") **Note:** Version bump only for package @crawlee/basic # [3.2.0](https://github.com/apify/crawlee/compare/v3.1.4...v3.2.0) (2023-02-07) ### Bug Fixes[​](#bug-fixes-26 "Direct link to Bug Fixes") * declare missing dependency on `tslib` ([27e96c8](https://github.com/apify/crawlee/commit/27e96c80c26e7fc31809a4b518d699573cb8c662)), closes [#1747](https://github.com/apify/crawlee/issues/1747) ## [3.1.4](https://github.com/apify/crawlee/compare/v3.1.3...v3.1.4) (2022-12-14)[​](#314-2022-12-14 "Direct link to 314-2022-12-14") ### Bug Fixes[​](#bug-fixes-27 "Direct link to Bug Fixes") * session.markBad() on requestHandler error ([#1709](https://github.com/apify/crawlee/issues/1709)) ([e87eb1f](https://github.com/apify/crawlee/commit/e87eb1f2ccd9585f8d53cb03ec671cedf23a06b4)), closes [#1635](https://github.com/apify/crawlee/issues/1635) [/github.com/apify/crawlee/blob/5ff04faa85c3a6b6f02cd58a91b46b80610d8ae6/packages/browser-crawler/src/internals/browser-crawler.ts#L524](https://github.com//github.com/apify/crawlee/blob/5ff04faa85c3a6b6f02cd58a91b46b80610d8ae6/packages/browser-crawler/src/internals/browser-crawler.ts/issues/L524) ## [3.1.3](https://github.com/apify/crawlee/compare/v3.1.2...v3.1.3) (2022-12-07)[​](#313-2022-12-07 "Direct link to 313-2022-12-07") ### Bug Fixes[​](#bug-fixes-28 "Direct link to Bug Fixes") * remove memory leaks from migration event handling ([#1679](https://github.com/apify/crawlee/issues/1679)) ([49bba25](https://github.com/apify/crawlee/commit/49bba252ebc348b61eac3895155361f7d394db36)), closes [#1670](https://github.com/apify/crawlee/issues/1670) ### Features[​](#features-25 "Direct link to Features") * always show error origin if inside the userland ([#1677](https://github.com/apify/crawlee/issues/1677)) ([bbe9045](https://github.com/apify/crawlee/commit/bbe9045d550f95138d570522f6f469eae2d146d0)) ## 3.1.2 (2022-11-15)[​](#312-2022-11-15 "Direct link to 3.1.2 (2022-11-15)") **Note:** Version bump only for package @crawlee/basic ## 3.1.1 (2022-11-07)[​](#311-2022-11-07 "Direct link to 3.1.1 (2022-11-07)") **Note:** Version bump only for package @crawlee/basic # 3.1.0 (2022-10-13) **Note:** Version bump only for package @crawlee/basic ## [3.0.4](https://github.com/apify/crawlee/compare/v3.0.3...v3.0.4) (2022-08-22)[​](#304-2022-08-22 "Direct link to 304-2022-08-22") **Note:** Version bump only for package @crawlee/basic --- # BasicCrawler \ Provides a simple framework for parallel crawling of web pages. The URLs to crawl are fed either from a static list of URLs or from a dynamic queue of URLs enabling recursive crawling of websites. `BasicCrawler` is a low-level tool that requires the user to implement the page download and data extraction functionality themselves. If we want a crawler that already facilitates this functionality, we should consider using [CheerioCrawler](https://crawlee.dev/js/api/cheerio-crawler/class/CheerioCrawler.md), [PuppeteerCrawler](https://crawlee.dev/js/api/puppeteer-crawler/class/PuppeteerCrawler.md) or [PlaywrightCrawler](https://crawlee.dev/js/api/playwright-crawler/class/PlaywrightCrawler.md). `BasicCrawler` invokes the user-provided [`requestHandler`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#requestHandler) for each [Request](https://crawlee.dev/js/api/core/class/Request.md) object, which represents a single URL to crawl. The [Request](https://crawlee.dev/js/api/core/class/Request.md) objects are fed from the [RequestList](https://crawlee.dev/js/api/core/class/RequestList.md) or [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) instances provided by the [`requestList`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#requestList) or [`requestQueue`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#requestQueue) constructor options, respectively. If neither `requestList` nor `requestQueue` options are provided, the crawler will open the default request queue either when the [`crawler.addRequests()`](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md#addRequests) function is called, or if `requests` parameter (representing the initial requests) of the [`crawler.run()`](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md#run) function is provided. If both [`requestList`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#requestList) and [`requestQueue`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#requestQueue) options are used, the instance first processes URLs from the [RequestList](https://crawlee.dev/js/api/core/class/RequestList.md) and automatically enqueues all of them to the [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) before it starts their processing. This ensures that a single URL is not crawled multiple times. The crawler finishes if there are no more [Request](https://crawlee.dev/js/api/core/class/Request.md) objects to crawl. New requests are only dispatched when there is enough free CPU and memory available, using the functionality provided by the [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) class. All [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) configuration options can be passed to the [`autoscaledPoolOptions`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#autoscaledPoolOptions) parameter of the `BasicCrawler` constructor. For user convenience, the [`minConcurrency`](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md#minConcurrency) and [`maxConcurrency`](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md#maxConcurrency) options of the underlying [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) constructor are available directly in the `BasicCrawler` constructor. **Example usage:** ``` import { BasicCrawler, Dataset } from 'crawlee'; // Create a crawler instance const crawler = new BasicCrawler({ async requestHandler({ request, sendRequest }) { // 'request' contains an instance of the Request class // Here we simply fetch the HTML of the page and store it to a dataset const { body } = await sendRequest({ url: request.url, method: request.method, body: request.payload, headers: request.headers, }); await Dataset.pushData({ url: request.url, html: body, }) }, }); // Enqueue the initial requests and run the crawler await crawler.run([ 'http://www.example.com/page-1', 'http://www.example.com/page-2', ]); ``` ### Hierarchy * *BasicCrawler* * [BrowserCrawler](https://crawlee.dev/js/api/browser-crawler/class/BrowserCrawler.md) * [HttpCrawler](https://crawlee.dev/js/api/http-crawler/class/HttpCrawler.md) ## Index[**](#Index) ### Constructors * [**constructor](#constructor) ### Properties * [**autoscaledPool](#autoscaledPool) * [**config](#config) * [**hasFinishedBefore](#hasFinishedBefore) * [**log](#log) * [**requestList](#requestList) * [**requestQueue](#requestQueue) * [**router](#router) * [**running](#running) * [**sessionPool](#sessionPool) * [**stats](#stats) ### Methods * [**addRequests](#addRequests) * [**exportData](#exportData) * [**getData](#getData) * [**getDataset](#getDataset) * [**getRequestQueue](#getRequestQueue) * [**pushData](#pushData) * [**run](#run) * [**setStatusMessage](#setStatusMessage) * [**stop](#stop) * [**useState](#useState) ## Constructors[**](#Constructors) ### [**](#constructor)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L615)constructor * ****new BasicCrawler**\(options, config): [BasicCrawler](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md)\ - All `BasicCrawler` parameters are passed via an options object. *** #### Parameters * ##### options: [BasicCrawlerOptions](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md)\ = {} * ##### config: [Configuration](https://crawlee.dev/js/api/core/class/Configuration.md) = ... #### Returns [BasicCrawler](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md)\ ## Properties[**](#Properties) ### [**](#autoscaledPool)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L524)optionalautoscaledPool **autoscaledPool? : [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) A reference to the underlying [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) class that manages the concurrency of the crawler. > *NOTE:* This property is only initialized after calling the [`crawler.run()`](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md#run) function. We can use it to change the concurrency settings on the fly, to pause the crawler by calling [`autoscaledPool.pause()`](https://crawlee.dev/js/api/core/class/AutoscaledPool.md#pause) or to abort it by calling [`autoscaledPool.abort()`](https://crawlee.dev/js/api/core/class/AutoscaledPool.md#abort). ### [**](#config)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L617)readonlyconfig **config: [Configuration](https://crawlee.dev/js/api/core/class/Configuration.md) = ... ### [**](#hasFinishedBefore)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L533)hasFinishedBefore **hasFinishedBefore: boolean = false ### [**](#log)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L535)readonlylog **log: [Log](https://crawlee.dev/js/api/core/class/Log.md) ### [**](#requestList)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L497)optionalrequestList **requestList? : [IRequestList](https://crawlee.dev/js/api/core/interface/IRequestList.md) A reference to the underlying [RequestList](https://crawlee.dev/js/api/core/class/RequestList.md) class that manages the crawler's [requests](https://crawlee.dev/js/api/core/class/Request.md). Only available if used by the crawler. ### [**](#requestQueue)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L504)optionalrequestQueue **requestQueue? : [RequestProvider](https://crawlee.dev/js/api/core/class/RequestProvider.md) Dynamic queue of URLs to be processed. This is useful for recursive crawling of websites. A reference to the underlying [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) class that manages the crawler's [requests](https://crawlee.dev/js/api/core/class/Request.md). Only available if used by the crawler. ### [**](#router)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L530)readonlyrouter **router: [RouterHandler](https://crawlee.dev/js/api/core/interface/RouterHandler.md)\> = ... Default [Router](https://crawlee.dev/js/api/core/class/Router.md) instance that will be used if we don't specify any [`requestHandler`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#requestHandler). See [`router.addHandler()`](https://crawlee.dev/js/api/core/class/Router.md#addHandler) and [`router.addDefaultHandler()`](https://crawlee.dev/js/api/core/class/Router.md#addDefaultHandler). ### [**](#running)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L532)running **running: boolean = false ### [**](#sessionPool)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L515)optionalsessionPool **sessionPool? : [SessionPool](https://crawlee.dev/js/api/core/class/SessionPool.md) A reference to the underlying [SessionPool](https://crawlee.dev/js/api/core/class/SessionPool.md) class that manages the crawler's [sessions](https://crawlee.dev/js/api/core/class/Session.md). Only available if used by the crawler. ### [**](#stats)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L491)readonlystats **stats: [Statistics](https://crawlee.dev/js/api/core/class/Statistics.md) A reference to the underlying [Statistics](https://crawlee.dev/js/api/core/class/Statistics.md) class that collects and logs run statistics for requests. ## Methods[**](#Methods) ### [**](#addRequests)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L1147)addRequests * ****addRequests**(requests, options): Promise<[CrawlerAddRequestsResult](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerAddRequestsResult.md)> - Adds requests to the queue in batches. By default, it will resolve after the initial batch is added, and continue adding the rest in background. You can configure the batch size via `batchSize` option and the sleep time in between the batches via `waitBetweenBatchesMillis`. If you want to wait for all batches to be added to the queue, you can use the `waitForAllRequestsToBeAdded` promise you get in the response object. This is an alias for calling `addRequestsBatched()` on the implicit `RequestQueue` for this crawler instance. *** #### Parameters * ##### requests: [RequestsLike](https://crawlee.dev/js/api/core.md#RequestsLike) The requests to add * ##### options: [CrawlerAddRequestsOptions](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerAddRequestsOptions.md) = {} Options for the request queue #### Returns Promise<[CrawlerAddRequestsResult](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerAddRequestsResult.md)> ### [**](#exportData)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L1253)exportData * ****exportData**\(path, format, options): Promise\ - Retrieves all the data from the default crawler [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md) and exports them to the specified format. Supported formats are currently 'json' and 'csv', and will be inferred from the `path` automatically. *** #### Parameters * ##### path: string * ##### optionalformat: json | csv * ##### optionaloptions: [DatasetExportOptions](https://crawlee.dev/js/api/core/interface/DatasetExportOptions.md) #### Returns Promise\ ### [**](#getData)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L1244)getData * ****getData**(...args): Promise<[DatasetContent](https://crawlee.dev/js/api/core/interface/DatasetContent.md)\> - Retrieves data from the default crawler [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md) by calling [Dataset.getData](https://crawlee.dev/js/api/core/class/Dataset.md#getData). *** #### Parameters * ##### rest...args: \[options: [DatasetDataOptions](https://crawlee.dev/js/api/core/interface/DatasetDataOptions.md)] #### Returns Promise<[DatasetContent](https://crawlee.dev/js/api/core/interface/DatasetContent.md)\> ### [**](#getDataset)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L1237)getDataset * ****getDataset**(idOrName): Promise<[Dataset](https://crawlee.dev/js/api/core/class/Dataset.md)\> - Retrieves the specified [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md), or the default crawler [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md). *** #### Parameters * ##### optionalidOrName: string #### Returns Promise<[Dataset](https://crawlee.dev/js/api/core/class/Dataset.md)\> ### [**](#getRequestQueue)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L1080)getRequestQueue * ****getRequestQueue**(): Promise<[RequestProvider](https://crawlee.dev/js/api/core/class/RequestProvider.md)> - #### Returns Promise<[RequestProvider](https://crawlee.dev/js/api/core/class/RequestProvider.md)> ### [**](#pushData)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L1229)pushData * ****pushData**(data, datasetIdOrName): Promise\ - Pushes data to the specified [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md), or the default crawler [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md) by calling [Dataset.pushData](https://crawlee.dev/js/api/core/class/Dataset.md#pushData). *** #### Parameters * ##### data: Dictionary | Dictionary\[] * ##### optionaldatasetIdOrName: string #### Returns Promise\ ### [**](#run)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L948)run * ****run**(requests, options): Promise<[FinalStatistics](https://crawlee.dev/js/api/core/interface/FinalStatistics.md)> - Runs the crawler. Returns a promise that resolves once all the requests are processed and `autoscaledPool.isFinished` returns `true`. We can use the `requests` parameter to enqueue the initial requests — it is a shortcut for running [`crawler.addRequests()`](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md#addRequests) before [`crawler.run()`](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md#run). *** #### Parameters * ##### optionalrequests: [RequestsLike](https://crawlee.dev/js/api/core.md#RequestsLike) The requests to add. * ##### optionaloptions: [CrawlerRunOptions](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerRunOptions.md) Options for the request queue. #### Returns Promise<[FinalStatistics](https://crawlee.dev/js/api/core/interface/FinalStatistics.md)> ### [**](#setStatusMessage)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L871)setStatusMessage * ****setStatusMessage**(message, options): Promise\ - This method is periodically called by the crawler, every `statusMessageLoggingInterval` seconds. *** #### Parameters * ##### message: string * ##### options: [SetStatusMessageOptions](https://crawlee.dev/js/api/types/interface/SetStatusMessageOptions.md) = {} #### Returns Promise\ ### [**](#stop)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L1068)stop * ****stop**(message): void - Gracefully stops the current run of the crawler. All the tasks active at the time of calling this method will be allowed to finish. *** #### Parameters * ##### message: string = 'The crawler has been gracefully stopped.' #### Returns void ### [**](#useState)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L1102)useState * ****useState**\(defaultValue): Promise\ - #### Parameters * ##### defaultValue: State = ... #### Returns Promise\ --- # createBasicRouter ### Callable * ****createBasicRouter**\(routes): [RouterHandler](https://crawlee.dev/js/api/core/interface/RouterHandler.md)\ *** * Creates new [Router](https://crawlee.dev/js/api/core/class/Router.md) instance that works based on request labels. This instance can then serve as a [`requestHandler`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#requestHandler) of our [BasicCrawler](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md). Defaults to the [BasicCrawlingContext](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlingContext.md). > Serves as a shortcut for using `Router.create()`. ``` import { BasicCrawler, createBasicRouter } from 'crawlee'; const router = createBasicRouter(); router.addHandler('label-a', async (ctx) => { ctx.log.info('...'); }); router.addDefaultHandler(async (ctx) => { ctx.log.info('...'); }); const crawler = new BasicCrawler({ requestHandler: router, }); await crawler.run(); ``` *** #### Parameters * ##### optionalroutes: [RouterRoutes](https://crawlee.dev/js/api/core.md#RouterRoutes)\ #### Returns [RouterHandler](https://crawlee.dev/js/api/core/interface/RouterHandler.md)\ --- # BasicCrawlerOptions \ ### Hierarchy * *BasicCrawlerOptions* * [HttpCrawlerOptions](https://crawlee.dev/js/api/http-crawler/interface/HttpCrawlerOptions.md) ## Index[**](#Index) ### Properties * [**autoscaledPoolOptions](#autoscaledPoolOptions) * [**errorHandler](#errorHandler) * [**experiments](#experiments) * [**failedRequestHandler](#failedRequestHandler) * [**httpClient](#httpClient) * [**keepAlive](#keepAlive) * [**maxConcurrency](#maxConcurrency) * [**maxCrawlDepth](#maxCrawlDepth) * [**maxRequestRetries](#maxRequestRetries) * [**maxRequestsPerCrawl](#maxRequestsPerCrawl) * [**maxRequestsPerMinute](#maxRequestsPerMinute) * [**maxSessionRotations](#maxSessionRotations) * [**minConcurrency](#minConcurrency) * [**onSkippedRequest](#onSkippedRequest) * [**requestHandler](#requestHandler) * [**requestHandlerTimeoutSecs](#requestHandlerTimeoutSecs) * [**requestList](#requestList) * [**requestManager](#requestManager) * [**requestQueue](#requestQueue) * [**respectRobotsTxtFile](#respectRobotsTxtFile) * [**retryOnBlocked](#retryOnBlocked) * [**sameDomainDelaySecs](#sameDomainDelaySecs) * [**sessionPoolOptions](#sessionPoolOptions) * [**statisticsOptions](#statisticsOptions) * [**statusMessageCallback](#statusMessageCallback) * [**statusMessageLoggingInterval](#statusMessageLoggingInterval) * [**useSessionPool](#useSessionPool) ## Properties[**](#Properties) ### [**](#autoscaledPoolOptions)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L294)optionalautoscaledPoolOptions **autoscaledPoolOptions? : [AutoscaledPoolOptions](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md) Custom options passed to the underlying [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) constructor. > *NOTE:* The [`runTaskFunction`](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md#runTaskFunction) option is provided by the crawler and cannot be overridden. However, we can provide custom implementations of [`isFinishedFunction`](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md#isFinishedFunction) and [`isTaskReadyFunction`](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md#isTaskReadyFunction). ### [**](#errorHandler)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L222)optionalerrorHandler **errorHandler? : [ErrorHandler](https://crawlee.dev/js/api/basic-crawler.md#ErrorHandler)\ User-provided function that allows modifying the request object before it gets retried by the crawler. It's executed before each retry for the requests that failed less than [`maxRequestRetries`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#maxRequestRetries) times. The function receives the [BasicCrawlingContext](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlingContext.md) as the first argument, where the [`request`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlingContext.md#request) corresponds to the request to be retried. Second argument is the `Error` instance that represents the last error thrown during processing of the request. ### [**](#experiments)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L390)optionalexperiments **experiments? : [CrawlerExperiments](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerExperiments.md) Enables experimental features of Crawlee, which can alter the behavior of the crawler. WARNING: these options are not guaranteed to be stable and may change or be removed at any time. ### [**](#failedRequestHandler)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L232)optionalfailedRequestHandler **failedRequestHandler? : [ErrorHandler](https://crawlee.dev/js/api/basic-crawler.md#ErrorHandler)\ A function to handle requests that failed more than [`maxRequestRetries`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#maxRequestRetries) times. The function receives the [BasicCrawlingContext](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlingContext.md) as the first argument, where the [`request`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlingContext.md#request) corresponds to the failed request. Second argument is the `Error` instance that represents the last error thrown during processing of the request. ### [**](#httpClient)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L402)optionalhttpClient **httpClient? : [BaseHttpClient](https://crawlee.dev/js/api/core/interface/BaseHttpClient.md) HTTP client implementation for the `sendRequest` context helper and for plain HTTP crawling. Defaults to a new instance of [GotScrapingHttpClient](https://crawlee.dev/js/api/core/class/GotScrapingHttpClient.md) ### [**](#keepAlive)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L322)optionalkeepAlive **keepAlive? : boolean Allows to keep the crawler alive even if the [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) gets empty. By default, the `crawler.run()` will resolve once the queue is empty. With `keepAlive: true` it will keep running, waiting for more requests to come. Use `crawler.stop()` to exit the crawler gracefully, or `crawler.teardown()` to stop it immediately. ### [**](#maxConcurrency)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L308)optionalmaxConcurrency **maxConcurrency? : number Sets the maximum concurrency (parallelism) for the crawl. Shortcut for the AutoscaledPool [`maxConcurrency`](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md#maxConcurrency) option. ### [**](#maxCrawlDepth)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L285)optionalmaxCrawlDepth **maxCrawlDepth? : number Maximum depth of the crawl. If not set, the crawl will continue until all requests are processed. Setting this to `0` will only process the initial requests, skipping all links enqueued by `crawlingContext.enqueueLinks` and `crawlingContext.addRequests`. Passing `1` will process the initial requests and all links enqueued by `crawlingContext.enqueueLinks` and `crawlingContext.addRequests` in the handler for initial requests. ### [**](#maxRequestRetries)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L256)optionalmaxRequestRetries **maxRequestRetries? : number = 3 Specifies the maximum number of retries allowed for a request if its processing fails. This includes retries due to navigation errors or errors thrown from user-supplied functions (`requestHandler`, `preNavigationHooks`, `postNavigationHooks`). This limit does not apply to retries triggered by session rotation (see [`maxSessionRotations`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#maxSessionRotations)). ### [**](#maxRequestsPerCrawl)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L278)optionalmaxRequestsPerCrawl **maxRequestsPerCrawl? : number Maximum number of pages that the crawler will open. The crawl will stop when this limit is reached. This value should always be set in order to prevent infinite loops in misconfigured crawlers. > *NOTE:* In cases of parallel crawling, the actual number of pages visited might be slightly higher than this value. ### [**](#maxRequestsPerMinute)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L315)optionalmaxRequestsPerMinute **maxRequestsPerMinute? : number The maximum number of requests per minute the crawler should run. By default, this is set to `Infinity`, but we can pass any positive, non-zero integer. Shortcut for the AutoscaledPool [`maxTasksPerMinute`](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md#maxTasksPerMinute) option. ### [**](#maxSessionRotations)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L271)optionalmaxSessionRotations **maxSessionRotations? : number = 10 Maximum number of session rotations per request. The crawler will automatically rotate the session in case of a proxy error or if it gets blocked by the website. The session rotations are not counted towards the [`maxRequestRetries`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#maxRequestRetries) limit. ### [**](#minConcurrency)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L302)optionalminConcurrency **minConcurrency? : number Sets the minimum concurrency (parallelism) for the crawl. Shortcut for the AutoscaledPool [`minConcurrency`](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md#minConcurrency) option. > *WARNING:* If we set this value too high with respect to the available system memory and CPU, our crawler will run extremely slow or crash. If not sure, it's better to keep the default value and the concurrency will scale up automatically. ### [**](#onSkippedRequest)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L381)optionalonSkippedRequest **onSkippedRequest? : [SkippedRequestCallback](https://crawlee.dev/js/api/core.md#SkippedRequestCallback) When a request is skipped for some reason, you can use this callback to act on it. This is currently fired for requests skipped 1. based on robots.txt file, 2. because they don't match enqueueLinks filters, 3. because they are redirected to a URL that doesn't match the enqueueLinks strategy, 4. or because the [`maxRequestsPerCrawl`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#maxRequestsPerCrawl) limit has been reached ### [**](#requestHandler)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L151)optionalrequestHandler **requestHandler? : [RequestHandler](https://crawlee.dev/js/api/basic-crawler.md#RequestHandler)\> User-provided function that performs the logic of the crawler. It is called for each URL to crawl. The function receives the [BasicCrawlingContext](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlingContext.md) as an argument, where the [`request`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlingContext.md#request) represents the URL to crawl. The function must return a promise, which is then awaited by the crawler. If the function throws an exception, the crawler will try to re-crawl the request later, up to the [`maxRequestRetries`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#maxRequestRetries) times. If all the retries fail, the crawler calls the function provided to the [`failedRequestHandler`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#failedRequestHandler) parameter. To make this work, we should **always** let our function throw exceptions rather than catch them. The exceptions are logged to the request using the [`Request.pushErrorMessage()`](https://crawlee.dev/js/api/core/class/Request.md#pushErrorMessage) function. ### [**](#requestHandlerTimeoutSecs)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L203)optionalrequestHandlerTimeoutSecs **requestHandlerTimeoutSecs? : number = 60 Timeout in which the function passed as [`requestHandler`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#requestHandler) needs to finish, in seconds. ### [**](#requestList)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L181)optionalrequestList **requestList? : [IRequestList](https://crawlee.dev/js/api/core/interface/IRequestList.md) Static list of URLs to be processed. If not provided, the crawler will open the default request queue when the [`crawler.addRequests()`](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md#addRequests) function is called. > Alternatively, `requests` parameter of [`crawler.run()`](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md#run) could be used to enqueue the initial requests - it is a shortcut for running `crawler.addRequests()` before the `crawler.run()`. ### [**](#requestManager)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L197)optionalrequestManager **requestManager? : [IRequestManager](https://crawlee.dev/js/api/core/interface/IRequestManager.md) Allows explicitly configuring a request manager. Mutually exclusive with the `requestQueue` and `requestList` options. This enables explicitly configuring the crawler to use `RequestManagerTandem`, for instance. If using this, the type of `BasicCrawler.requestQueue` may not be fully compatible with the `RequestProvider` class. ### [**](#requestQueue)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L189)optionalrequestQueue **requestQueue? : [RequestProvider](https://crawlee.dev/js/api/core/class/RequestProvider.md) Dynamic queue of URLs to be processed. This is useful for recursive crawling of websites. If not provided, the crawler will open the default request queue when the [`crawler.addRequests()`](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md#addRequests) function is called. > Alternatively, `requests` parameter of [`crawler.run()`](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md#run) could be used to enqueue the initial requests - it is a shortcut for running `crawler.addRequests()` before the `crawler.run()`. ### [**](#respectRobotsTxtFile)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L371)optionalrespectRobotsTxtFile **respectRobotsTxtFile? : boolean If set to `true`, the crawler will automatically try to fetch the robots.txt file for each domain, and skip those that are not allowed. This also prevents disallowed URLs to be added via `enqueueLinks`. ### [**](#retryOnBlocked)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L365)optionalretryOnBlocked **retryOnBlocked? : boolean If set to `true`, the crawler will automatically try to bypass any detected bot protection. Currently supports: * [**Cloudflare** Bot Management](https://www.cloudflare.com/products/bot-management/) * [**Google Search** Rate Limiting](https://www.google.com/sorry/) ### [**](#sameDomainDelaySecs)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L262)optionalsameDomainDelaySecs **sameDomainDelaySecs? : number = 0 Indicates how much time (in seconds) to wait before crawling another same domain request. ### [**](#sessionPoolOptions)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L333)optionalsessionPoolOptions **sessionPoolOptions? : [SessionPoolOptions](https://crawlee.dev/js/api/core/interface/SessionPoolOptions.md) The configuration options for [SessionPool](https://crawlee.dev/js/api/core/class/SessionPool.md) to use. ### [**](#statisticsOptions)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L396)optionalstatisticsOptions **statisticsOptions? : [StatisticsOptions](https://crawlee.dev/js/api/core/interface/StatisticsOptions.md) Customize the way statistics collecting works, such as logging interval or whether to output them to the Key-Value store. ### [**](#statusMessageCallback)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L356)optionalstatusMessageCallback **statusMessageCallback? : [StatusMessageCallback](https://crawlee.dev/js/api/basic-crawler.md#StatusMessageCallback)<[BasicCrawlingContext](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlingContext.md)\, [BasicCrawler](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md)<[BasicCrawlingContext](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlingContext.md)\>> Allows overriding the default status message. The callback needs to call `crawler.setStatusMessage()` explicitly. The default status message is provided in the parameters. ``` const crawler = new CheerioCrawler({ statusMessageCallback: async (ctx) => { return ctx.crawler.setStatusMessage(`this is status message from ${new Date().toISOString()}`, { level: 'INFO' }); // log level defaults to 'DEBUG' }, statusMessageLoggingInterval: 1, // defaults to 10s async requestHandler({ $, enqueueLinks, request, log }) { // ... }, }); ``` ### [**](#statusMessageLoggingInterval)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L338)optionalstatusMessageLoggingInterval **statusMessageLoggingInterval? : number Defines the length of the interval for calling the `setStatusMessage` in seconds. ### [**](#useSessionPool)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L328)optionaluseSessionPool **useSessionPool? : boolean Basic crawler will initialize the [SessionPool](https://crawlee.dev/js/api/core/class/SessionPool.md) with the corresponding [`sessionPoolOptions`](https://crawlee.dev/js/api/core/interface/SessionPoolOptions.md). The session instance will be than available in the [`requestHandler`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#requestHandler). --- # BasicCrawlingContext \ ### Hierarchy * [CrawlingContext](https://crawlee.dev/js/api/core/interface/CrawlingContext.md)<[BasicCrawler](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md), UserData> * *BasicCrawlingContext* ## Index[**](#Index) ### Properties * [**addRequests](#addRequests) * [**crawler](#crawler) * [**getKeyValueStore](#getKeyValueStore) * [**id](#id) * [**log](#log) * [**proxyInfo](#proxyInfo) * [**request](#request) * [**session](#session) * [**useState](#useState) ### Methods * [**enqueueLinks](#enqueueLinks) * [**pushData](#pushData) * [**sendRequest](#sendRequest) ## Properties[**](#Properties) ### [**](#addRequests)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L88)inheritedaddRequests **addRequests: (requestsLike, options) => Promise\ Inherited from CrawlingContext.addRequests Add requests directly to the request queue. *** #### Type declaration * * **(requestsLike, options): Promise\ - #### Parameters * ##### requestsLike: readonly (string | ReadonlyObjectDeep\> & { regex?: RegExp; requestsFromUrl?: string }> | ReadonlyObjectDeep<[Request](https://crawlee.dev/js/api/core/class/Request.md)\>)\[] * ##### optionaloptions: ReadonlyObjectDeep<[RequestQueueOperationOptions](https://crawlee.dev/js/api/core/interface/RequestQueueOperationOptions.md)> Options for the request queue #### Returns Promise\ ### [**](#crawler)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L113)inheritedcrawler **crawler: [BasicCrawler](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md)<[BasicCrawlingContext](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlingContext.md)\> Inherited from CrawlingContext.crawler ### [**](#getKeyValueStore)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L147)inheritedgetKeyValueStore **getKeyValueStore: (idOrName) => Promise<[KeyValueStore](https://crawlee.dev/js/api/core/class/KeyValueStore.md)> Inherited from CrawlingContext.getKeyValueStore Get a key-value store with given name or id, or the default one for the crawler. *** #### Type declaration * * **(idOrName): Promise<[KeyValueStore](https://crawlee.dev/js/api/core/class/KeyValueStore.md)> - #### Parameters * ##### optionalidOrName: string #### Returns Promise<[KeyValueStore](https://crawlee.dev/js/api/core/class/KeyValueStore.md)> ### [**](#id)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L33)inheritedid **id: string Inherited from CrawlingContext.id ### [**](#log)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L108)inheritedlog **log: [Log](https://crawlee.dev/js/api/core/class/Log.md) Inherited from CrawlingContext.log A preconfigured logger for the request handler. ### [**](#proxyInfo)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L40)optionalinheritedproxyInfo **proxyInfo? : [ProxyInfo](https://crawlee.dev/js/api/core/interface/ProxyInfo.md) Inherited from CrawlingContext.proxyInfo An object with information about currently used proxy by the crawler and configured by the [ProxyConfiguration](https://crawlee.dev/js/api/core/class/ProxyConfiguration.md) class. ### [**](#request)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L45)inheritedrequest **request: [Request](https://crawlee.dev/js/api/core/class/Request.md)\ Inherited from CrawlingContext.request The original [Request](https://crawlee.dev/js/api/core/class/Request.md) object. ### [**](#session)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L34)optionalinheritedsession **session? : [Session](https://crawlee.dev/js/api/core/class/Session.md) Inherited from CrawlingContext.session ### [**](#useState)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L96)inheriteduseState **useState: \(defaultValue) => Promise\ Inherited from CrawlingContext.useState Returns the state - a piece of mutable persistent data shared across all the request handler runs. *** #### Type declaration * * **\(defaultValue): Promise\ - #### Parameters * ##### optionaldefaultValue: State #### Returns Promise\ ## Methods[**](#Methods) ### [**](#enqueueLinks)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L96)enqueueLinks * ****enqueueLinks**(options): Promise<[BatchAddRequestsResult](https://crawlee.dev/js/api/types/interface/BatchAddRequestsResult.md)> - Overrides CrawlingContext.enqueueLinks This function automatically finds and enqueues links from the current page, adding them to the [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) currently used by the crawler. Optionally, the function allows you to filter the target links' URLs using an array of globs or regular expressions and override settings of the enqueued [Request](https://crawlee.dev/js/api/core/class/Request.md) objects. Check out the [Crawl a website with relative links](https://crawlee.dev/js/docs/examples/crawl-relative-links.md) example for more details regarding its usage. **Example usage** ``` async requestHandler({ enqueueLinks }) { await enqueueLinks({ urls: [...], }); }, ``` *** #### Parameters * ##### optionaloptions: { baseUrl?: string; exclude?: readonly ([GlobInput](https://crawlee.dev/js/api/core.md#GlobInput) | [RegExpInput](https://crawlee.dev/js/api/core.md#RegExpInput))\[]; forefront?: boolean; globs?: readonly [GlobInput](https://crawlee.dev/js/api/core.md#GlobInput)\[]; label?: string; limit?: number; onSkippedRequest?: [SkippedRequestCallback](https://crawlee.dev/js/api/core.md#SkippedRequestCallback); pseudoUrls?: readonly [PseudoUrlInput](https://crawlee.dev/js/api/core.md#PseudoUrlInput)\[]; regexps?: readonly [RegExpInput](https://crawlee.dev/js/api/core.md#RegExpInput)\[]; requestQueue?: [RequestProvider](https://crawlee.dev/js/api/core/class/RequestProvider.md); robotsTxtFile?: Pick<[RobotsTxtFile](https://crawlee.dev/js/api/utils/class/RobotsTxtFile.md), isAllowed>; selector?: string; skipNavigation?: boolean; strategy?: [EnqueueStrategy](https://crawlee.dev/js/api/core/enum/EnqueueStrategy.md) | all | same-domain | same-hostname | same-origin; transformRequestFunction?: [RequestTransform](https://crawlee.dev/js/api/core/interface/RequestTransform.md); urls: readonly string\[]; userData?: Dictionary; waitForAllRequestsToBeAdded?: boolean } All `enqueueLinks()` parameters are passed via an options object. * ##### optionalbaseUrl: string A base URL that will be used to resolve relative URLs when using Cheerio. Ignored when using Puppeteer, since the relative URL resolution is done inside the browser automatically. * ##### optionalexclude: readonly ([GlobInput](https://crawlee.dev/js/api/core.md#GlobInput) | [RegExpInput](https://crawlee.dev/js/api/core.md#RegExpInput))\[] An array of glob pattern strings, regexp patterns or plain objects containing patterns matching URLs that will **never** be enqueued. The plain objects must include either the `glob` property or the `regexp` property. Glob matching is always case-insensitive. If you need case-sensitive matching, provide a regexp. * ##### optionalforefront: boolean = false If set to `true`: * while adding the request to the queue: the request will be added to the foremost position in the queue. * while reclaiming the request: the request will be placed to the beginning of the queue, so that it's returned in the next call to [RequestQueue.fetchNextRequest](https://crawlee.dev/js/api/core/class/RequestQueue.md#fetchNextRequest). By default, it's put to the end of the queue. In case the request is already present in the queue, this option has no effect. If more requests are added with this option at once, their order in the following `fetchNextRequest` call is arbitrary. * ##### optionalglobs: readonly [GlobInput](https://crawlee.dev/js/api/core.md#GlobInput)\[] An array of glob pattern strings or plain objects containing glob pattern strings matching the URLs to be enqueued. The plain objects must include at least the `glob` property, which holds the glob pattern string. All remaining keys will be used as request options for the corresponding enqueued [Request](https://crawlee.dev/js/api/core/class/Request.md) objects. The matching is always case-insensitive. If you need case-sensitive matching, use `regexps` property directly. If `globs` is an empty array or `undefined`, and `regexps` are also not defined, then the function enqueues the links with the same subdomain. * ##### optionallabel: string Sets [Request.label](https://crawlee.dev/js/api/core/class/Request.md#label) for newly enqueued requests. Note that the request options specified in `globs`, `regexps`, or `pseudoUrls` objects have priority over this option. * ##### optionallimit: number Limit the amount of actually enqueued URLs to this number. Useful for testing across the entire crawling scope. * ##### optionalonSkippedRequest: [SkippedRequestCallback](https://crawlee.dev/js/api/core.md#SkippedRequestCallback) When a request is skipped for some reason, you can use this callback to act on it. This is currently fired for requests skipped 1. based on robots.txt file, 2. because they don't match enqueueLinks filters, 3. or because the maxRequestsPerCrawl limit has been reached * ##### optionalpseudoUrls: readonly [PseudoUrlInput](https://crawlee.dev/js/api/core.md#PseudoUrlInput)\[] *NOTE:* In future versions of SDK the options will be removed. Please use `globs` or `regexps` instead. An array of [PseudoUrl](https://crawlee.dev/js/api/core/class/PseudoUrl.md) strings or plain objects containing [PseudoUrl](https://crawlee.dev/js/api/core/class/PseudoUrl.md) strings matching the URLs to be enqueued. The plain objects must include at least the `purl` property, which holds the pseudo-URL string. All remaining keys will be used as request options for the corresponding enqueued [Request](https://crawlee.dev/js/api/core/class/Request.md) objects. With a pseudo-URL string, the matching is always case-insensitive. If you need case-sensitive matching, use `regexps` property directly. If `pseudoUrls` is an empty array or `undefined`, then the function enqueues the links with the same subdomain. * **@deprecated** prefer using `globs` or `regexps` instead * ##### optionalregexps: readonly [RegExpInput](https://crawlee.dev/js/api/core.md#RegExpInput)\[] An array of regular expressions or plain objects containing regular expressions matching the URLs to be enqueued. The plain objects must include at least the `regexp` property, which holds the regular expression. All remaining keys will be used as request options for the corresponding enqueued [Request](https://crawlee.dev/js/api/core/class/Request.md) objects. If `regexps` is an empty array or `undefined`, and `globs` are also not defined, then the function enqueues the links with the same subdomain. * ##### optionalrequestQueue: [RequestProvider](https://crawlee.dev/js/api/core/class/RequestProvider.md) A request queue to which the URLs will be enqueued. * ##### optionalrobotsTxtFile: Pick<[RobotsTxtFile](https://crawlee.dev/js/api/utils/class/RobotsTxtFile.md), isAllowed> RobotsTxtFile instance for the current request that triggered the `enqueueLinks`. If provided, disallowed URLs will be ignored. * ##### optionalselector: string A CSS selector matching links to be enqueued. * ##### optionalskipNavigation: boolean = false If set to `true`, tells the crawler to skip navigation and process the request directly. * ##### optionalstrategy: [EnqueueStrategy](https://crawlee.dev/js/api/core/enum/EnqueueStrategy.md) | all | same-domain | same-hostname | same-origin = [EnqueueStrategy](https://crawlee.dev/js/api/core/enum/EnqueueStrategy.md) | all | same-domain | same-hostname | same-origin The strategy to use when enqueueing the urls. Depending on the strategy you select, we will only check certain parts of the URLs found. Here is a diagram of each URL part and their name: ``` Protocol Domain ┌────┐ ┌─────────┐ https://example.crawlee.dev/... │ └─────────────────┤ │ Hostname │ │ │ └─────────────────────────┘ Origin ``` * ##### optionaltransformRequestFunction: [RequestTransform](https://crawlee.dev/js/api/core/interface/RequestTransform.md) Just before a new [Request](https://crawlee.dev/js/api/core/class/Request.md) is constructed and enqueued to the [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md), this function can be used to remove it or modify its contents such as `userData`, `payload` or, most importantly `uniqueKey`. This is useful when you need to enqueue multiple `Requests` to the queue that share the same URL, but differ in methods or payloads, or to dynamically update or create `userData`. For example: by adding `keepUrlFragment: true` to the `request` object, URL fragments will not be removed when `uniqueKey` is computed. **Example:** ``` { transformRequestFunction: (request) => { request.userData.foo = 'bar'; request.keepUrlFragment = true; return request; } } ``` Note that the request options specified in `globs`, `regexps`, or `pseudoUrls` objects have priority over this function. Some request options returned by `transformRequestFunction` may be overwritten by pattern-based options from `globs`, `regexps`, or `pseudoUrls`. * ##### urls: readonly string\[] An array of URLs to enqueue. * ##### optionaluserData: Dictionary Sets [Request.userData](https://crawlee.dev/js/api/core/class/Request.md#userData) for newly enqueued requests. * ##### optionalwaitForAllRequestsToBeAdded: boolean By default, only the first batch (1000) of found requests will be added to the queue before resolving the call. You can use this option to wait for adding all of them. #### Returns Promise<[BatchAddRequestsResult](https://crawlee.dev/js/api/types/interface/BatchAddRequestsResult.md)> Promise that resolves to [BatchAddRequestsResult](https://crawlee.dev/js/api/types/interface/BatchAddRequestsResult.md) object. ### [**](#pushData)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L54)inheritedpushData * ****pushData**(data, datasetIdOrName): Promise\ - Inherited from CrawlingContext.pushData This function allows you to push data to a [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md) specified by name, or the one currently used by the crawler. Shortcut for `crawler.pushData()`. *** #### Parameters * ##### optionaldata: ReadonlyDeep\ Data to be pushed to the default dataset. * ##### optionaldatasetIdOrName: string #### Returns Promise\ ### [**](#sendRequest)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L166)inheritedsendRequest * ****sendRequest**\(overrideOptions): Promise\> - Inherited from CrawlingContext.sendRequest Fires HTTP request via [`got-scraping`](https://crawlee.dev/js/docs/guides/got-scraping.md), allowing to override the request options on the fly. This is handy when you work with a browser crawler but want to execute some requests outside it (e.g. API requests). Check the [Skipping navigations for certain requests](https://crawlee.dev/js/docs/examples/skip-navigation.md) example for more detailed explanation of how to do that. ``` async requestHandler({ sendRequest }) { const { body } = await sendRequest({ // override headers only headers: { ... }, }); }, ``` *** #### Parameters * ##### optionaloverrideOptions: Partial\ #### Returns Promise\> --- # CrawlerAddRequestsOptions ### Hierarchy * [AddRequestsBatchedOptions](https://crawlee.dev/js/api/core/interface/AddRequestsBatchedOptions.md) * *CrawlerAddRequestsOptions* * [CrawlerRunOptions](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerRunOptions.md) ## Index[**](#Index) ### Properties * [**batchSize](#batchSize) * [**forefront](#forefront) * [**waitBetweenBatchesMillis](#waitBetweenBatchesMillis) * [**waitForAllRequestsToBeAdded](#waitForAllRequestsToBeAdded) ## Properties[**](#Properties) ### [**](#batchSize)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L975)optionalinheritedbatchSize **batchSize? : number = 1000 Inherited from AddRequestsBatchedOptions.batchSize ### [**](#forefront)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L948)optionalinheritedforefront **forefront? : boolean = false Inherited from AddRequestsBatchedOptions.forefront If set to `true`: * while adding the request to the queue: the request will be added to the foremost position in the queue. * while reclaiming the request: the request will be placed to the beginning of the queue, so that it's returned in the next call to [RequestQueue.fetchNextRequest](https://crawlee.dev/js/api/core/class/RequestQueue.md#fetchNextRequest). By default, it's put to the end of the queue. In case the request is already present in the queue, this option has no effect. If more requests are added with this option at once, their order in the following `fetchNextRequest` call is arbitrary. ### [**](#waitBetweenBatchesMillis)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L980)optionalinheritedwaitBetweenBatchesMillis **waitBetweenBatchesMillis? : number = 1000 Inherited from AddRequestsBatchedOptions.waitBetweenBatchesMillis ### [**](#waitForAllRequestsToBeAdded)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L970)optionalinheritedwaitForAllRequestsToBeAdded **waitForAllRequestsToBeAdded? : boolean = false Inherited from AddRequestsBatchedOptions.waitForAllRequestsToBeAdded Whether to wait for all the provided requests to be added, instead of waiting just for the initial batch of up to `batchSize`. --- # CrawlerAddRequestsResult ### Hierarchy * [AddRequestsBatchedResult](https://crawlee.dev/js/api/core/interface/AddRequestsBatchedResult.md) * *CrawlerAddRequestsResult* ## Index[**](#Index) ### Properties * [**addedRequests](#addedRequests) * [**waitForAllRequestsToBeAdded](#waitForAllRequestsToBeAdded) ## Properties[**](#Properties) ### [**](#addedRequests)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L984)inheritedaddedRequests **addedRequests: [ProcessedRequest](https://crawlee.dev/js/api/types/interface/ProcessedRequest.md)\[] Inherited from AddRequestsBatchedResult.addedRequests ### [**](#waitForAllRequestsToBeAdded)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L1001)inheritedwaitForAllRequestsToBeAdded **waitForAllRequestsToBeAdded: Promise<[ProcessedRequest](https://crawlee.dev/js/api/types/interface/ProcessedRequest.md)\[]> Inherited from AddRequestsBatchedResult.waitForAllRequestsToBeAdded A promise which will resolve with the rest of the requests that were added to the queue. Alternatively, we can set [`waitForAllRequestsToBeAdded`](https://crawlee.dev/js/api/core/interface/AddRequestsBatchedOptions.md#waitForAllRequestsToBeAdded) to `true` in the [`crawler.addRequests()`](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md#addRequests) options. **Example:** ``` // Assuming `requests` is a list of requests. const result = await crawler.addRequests(requests); // If we want to wait for the rest of the requests to be added to the queue: await result.waitForAllRequestsToBeAdded; ``` --- # CrawlerExperiments A set of options that you can toggle to enable experimental features in Crawlee. NOTE: These options will not respect semantic versioning and may be removed or changed at any time. Use at your own risk. If you do use these and encounter issues, please report them to us. ## Index[**](#Index) ### Properties * [**requestLocking](#requestLocking) ## Properties[**](#Properties) ### [**](#requestLocking)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L418)optionalrequestLocking **requestLocking? : boolean * **@deprecated** This experiment is now enabled by default, and this flag will be removed in a future release. If you encounter issues due to this change, please: * report it to us: * set `requestLocking` to `false` in the `experiments` option of the crawler --- # CrawlerRunOptions ### Hierarchy * [CrawlerAddRequestsOptions](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerAddRequestsOptions.md) * *CrawlerRunOptions* ## Index[**](#Index) ### Properties * [**batchSize](#batchSize) * [**forefront](#forefront) * [**purgeRequestQueue](#purgeRequestQueue) * [**waitBetweenBatchesMillis](#waitBetweenBatchesMillis) * [**waitForAllRequestsToBeAdded](#waitForAllRequestsToBeAdded) ## Properties[**](#Properties) ### [**](#batchSize)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L975)optionalinheritedbatchSize **batchSize? : number = 1000 Inherited from CrawlerAddRequestsOptions.batchSize ### [**](#forefront)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L948)optionalinheritedforefront **forefront? : boolean = false Inherited from CrawlerAddRequestsOptions.forefront If set to `true`: * while adding the request to the queue: the request will be added to the foremost position in the queue. * while reclaiming the request: the request will be placed to the beginning of the queue, so that it's returned in the next call to [RequestQueue.fetchNextRequest](https://crawlee.dev/js/api/core/class/RequestQueue.md#fetchNextRequest). By default, it's put to the end of the queue. In case the request is already present in the queue, this option has no effect. If more requests are added with this option at once, their order in the following `fetchNextRequest` call is arbitrary. ### [**](#purgeRequestQueue)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L2045)optionalpurgeRequestQueue **purgeRequestQueue? : boolean = true Whether to purge the RequestQueue before running the crawler again. Defaults to true, so it is possible to reprocess failed requests. When disabled, only new requests will be considered. Note that even a failed request is considered as handled. ### [**](#waitBetweenBatchesMillis)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L980)optionalinheritedwaitBetweenBatchesMillis **waitBetweenBatchesMillis? : number = 1000 Inherited from CrawlerAddRequestsOptions.waitBetweenBatchesMillis ### [**](#waitForAllRequestsToBeAdded)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L970)optionalinheritedwaitForAllRequestsToBeAdded **waitForAllRequestsToBeAdded? : boolean = false Inherited from CrawlerAddRequestsOptions.waitForAllRequestsToBeAdded Whether to wait for all the provided requests to be added, instead of waiting just for the initial batch of up to `batchSize`. --- # CreateContextOptions ## Index[**](#Index) ### Properties * [**proxyInfo](#proxyInfo) * [**request](#request) * [**session](#session) ## Properties[**](#Properties) ### [**](#proxyInfo)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L2032)optionalproxyInfo **proxyInfo? : [ProxyInfo](https://crawlee.dev/js/api/core/interface/ProxyInfo.md) ### [**](#request)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L2030)request **request: [Request](https://crawlee.dev/js/api/core/class/Request.md)\ ### [**](#session)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L2031)optionalsession **session? : [Session](https://crawlee.dev/js/api/core/class/Session.md) --- # StatusMessageCallbackParams \ ## Index[**](#Index) ### Properties * [**crawler](#crawler) * [**message](#message) * [**previousState](#previousState) * [**state](#state) ## Properties[**](#Properties) ### [**](#crawler)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L123)crawler **crawler: Crawler ### [**](#message)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L125)message **message: string ### [**](#previousState)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L124)previousState **previousState: [StatisticState](https://crawlee.dev/js/api/core/interface/StatisticState.md) ### [**](#state)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L122)state **state: [StatisticState](https://crawlee.dev/js/api/core/interface/StatisticState.md) --- # @crawlee/browser Provides a simple framework for parallel crawling of web pages using headless browsers with [Puppeteer](https://github.com/puppeteer/puppeteer) and [Playwright](https://github.com/microsoft/playwright). The URLs to crawl are fed either from a static list of URLs or from a dynamic queue of URLs enabling recursive crawling of websites. Since `BrowserCrawler` uses headless (or even headful) browsers to download web pages and extract data, it is useful for crawling of websites that require to execute JavaScript. If the target website doesn't need JavaScript, we should consider using the [CheerioCrawler](https://crawlee.dev/js/api/cheerio-crawler/class/CheerioCrawler.md), which downloads the pages using raw HTTP requests and is about 10x faster. The source URLs are represented by the [Request](https://crawlee.dev/js/api/core/class/Request.md) objects that are fed from the [RequestList](https://crawlee.dev/js/api/core/class/RequestList.md) or [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) instances provided by the [`requestList`](https://crawlee.dev/js/api/browser-crawler/interface/BrowserCrawlerOptions.md#requestList) or [`requestQueue`](https://crawlee.dev/js/api/browser-crawler/interface/BrowserCrawlerOptions.md#requestQueue) constructor options, respectively. If neither `requestList` nor `requestQueue` options are provided, the crawler will open the default request queue either when the [`crawler.addRequests()`](https://crawlee.dev/js/api/browser-crawler/class/BrowserCrawler.md#addRequests) function is called, or if `requests` parameter (representing the initial requests) of the [`crawler.run()`](https://crawlee.dev/js/api/browser-crawler/class/BrowserCrawler.md#run) function is provided. If both [`requestList`](https://crawlee.dev/js/api/browser-crawler/interface/BrowserCrawlerOptions.md#requestList) and [`requestQueue`](https://crawlee.dev/js/api/browser-crawler/interface/BrowserCrawlerOptions.md#requestQueue) options are used, the instance first processes URLs from the [RequestList](https://crawlee.dev/js/api/core/class/RequestList.md) and automatically enqueues all of them to the [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) before it starts their processing. This ensures that a single URL is not crawled multiple times. The crawler finishes when there are no more [Request](https://crawlee.dev/js/api/core/class/Request.md) objects to crawl. `BrowserCrawler` opens a new browser page (i.e. tab or window) for each [Request](https://crawlee.dev/js/api/core/class/Request.md) object to crawl and then calls the function provided by user as the [`requestHandler`](https://crawlee.dev/js/api/browser-crawler/interface/BrowserCrawlerOptions.md#requestHandler) option. New pages are only opened when there is enough free CPU and memory available, using the functionality provided by the [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) class. All [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) configuration options can be passed to the [`autoscaledPoolOptions`](https://crawlee.dev/js/api/browser-crawler/interface/BrowserCrawlerOptions.md#autoscaledPoolOptions) parameter of the `BrowserCrawler` constructor. For user convenience, the [`minConcurrency`](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md#minConcurrency) and [`maxConcurrency`](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md#maxConcurrency) options of the underlying [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) constructor are available directly in the `BrowserCrawler` constructor. > *NOTE:* the pool of browser instances is internally managed by the [BrowserPool](https://crawlee.dev/js/api/browser-pool/class/BrowserPool.md) class. ## Index[**](#Index) ### Crawlers * [**BrowserCrawler](https://crawlee.dev/js/api/browser-crawler/class/BrowserCrawler.md) ### Other * [**AddRequestsBatchedOptions](https://crawlee.dev/js/api/browser-crawler.md#AddRequestsBatchedOptions) * [**AddRequestsBatchedResult](https://crawlee.dev/js/api/browser-crawler.md#AddRequestsBatchedResult) * [**AutoscaledPool](https://crawlee.dev/js/api/browser-crawler.md#AutoscaledPool) * [**AutoscaledPoolOptions](https://crawlee.dev/js/api/browser-crawler.md#AutoscaledPoolOptions) * [**BaseHttpClient](https://crawlee.dev/js/api/browser-crawler.md#BaseHttpClient) * [**BaseHttpResponseData](https://crawlee.dev/js/api/browser-crawler.md#BaseHttpResponseData) * [**BASIC\_CRAWLER\_TIMEOUT\_BUFFER\_SECS](https://crawlee.dev/js/api/browser-crawler.md#BASIC_CRAWLER_TIMEOUT_BUFFER_SECS) * [**BasicCrawler](https://crawlee.dev/js/api/browser-crawler.md#BasicCrawler) * [**BasicCrawlerOptions](https://crawlee.dev/js/api/browser-crawler.md#BasicCrawlerOptions) * [**BasicCrawlingContext](https://crawlee.dev/js/api/browser-crawler.md#BasicCrawlingContext) * [**BLOCKED\_STATUS\_CODES](https://crawlee.dev/js/api/browser-crawler.md#BLOCKED_STATUS_CODES) * [**checkStorageAccess](https://crawlee.dev/js/api/browser-crawler.md#checkStorageAccess) * [**ClientInfo](https://crawlee.dev/js/api/browser-crawler.md#ClientInfo) * [**Configuration](https://crawlee.dev/js/api/browser-crawler.md#Configuration) * [**ConfigurationOptions](https://crawlee.dev/js/api/browser-crawler.md#ConfigurationOptions) * [**Cookie](https://crawlee.dev/js/api/browser-crawler.md#Cookie) * [**CrawlerAddRequestsOptions](https://crawlee.dev/js/api/browser-crawler.md#CrawlerAddRequestsOptions) * [**CrawlerAddRequestsResult](https://crawlee.dev/js/api/browser-crawler.md#CrawlerAddRequestsResult) * [**CrawlerExperiments](https://crawlee.dev/js/api/browser-crawler.md#CrawlerExperiments) * [**CrawlerRunOptions](https://crawlee.dev/js/api/browser-crawler.md#CrawlerRunOptions) * [**CrawlingContext](https://crawlee.dev/js/api/browser-crawler.md#CrawlingContext) * [**createBasicRouter](https://crawlee.dev/js/api/browser-crawler.md#createBasicRouter) * [**CreateContextOptions](https://crawlee.dev/js/api/browser-crawler.md#CreateContextOptions) * [**CreateSession](https://crawlee.dev/js/api/browser-crawler.md#CreateSession) * [**CriticalError](https://crawlee.dev/js/api/browser-crawler.md#CriticalError) * [**Dataset](https://crawlee.dev/js/api/browser-crawler.md#Dataset) * [**DatasetConsumer](https://crawlee.dev/js/api/browser-crawler.md#DatasetConsumer) * [**DatasetContent](https://crawlee.dev/js/api/browser-crawler.md#DatasetContent) * [**DatasetDataOptions](https://crawlee.dev/js/api/browser-crawler.md#DatasetDataOptions) * [**DatasetExportOptions](https://crawlee.dev/js/api/browser-crawler.md#DatasetExportOptions) * [**DatasetExportToOptions](https://crawlee.dev/js/api/browser-crawler.md#DatasetExportToOptions) * [**DatasetIteratorOptions](https://crawlee.dev/js/api/browser-crawler.md#DatasetIteratorOptions) * [**DatasetMapper](https://crawlee.dev/js/api/browser-crawler.md#DatasetMapper) * [**DatasetOptions](https://crawlee.dev/js/api/browser-crawler.md#DatasetOptions) * [**DatasetReducer](https://crawlee.dev/js/api/browser-crawler.md#DatasetReducer) * [**enqueueLinks](https://crawlee.dev/js/api/browser-crawler.md#enqueueLinks) * [**EnqueueLinksOptions](https://crawlee.dev/js/api/browser-crawler.md#EnqueueLinksOptions) * [**EnqueueStrategy](https://crawlee.dev/js/api/browser-crawler.md#EnqueueStrategy) * [**ErrnoException](https://crawlee.dev/js/api/browser-crawler.md#ErrnoException) * [**ErrorHandler](https://crawlee.dev/js/api/browser-crawler.md#ErrorHandler) * [**ErrorSnapshotter](https://crawlee.dev/js/api/browser-crawler.md#ErrorSnapshotter) * [**ErrorTracker](https://crawlee.dev/js/api/browser-crawler.md#ErrorTracker) * [**ErrorTrackerOptions](https://crawlee.dev/js/api/browser-crawler.md#ErrorTrackerOptions) * [**EventManager](https://crawlee.dev/js/api/browser-crawler.md#EventManager) * [**EventType](https://crawlee.dev/js/api/browser-crawler.md#EventType) * [**EventTypeName](https://crawlee.dev/js/api/browser-crawler.md#EventTypeName) * [**filterRequestsByPatterns](https://crawlee.dev/js/api/browser-crawler.md#filterRequestsByPatterns) * [**FinalStatistics](https://crawlee.dev/js/api/browser-crawler.md#FinalStatistics) * [**GetUserDataFromRequest](https://crawlee.dev/js/api/browser-crawler.md#GetUserDataFromRequest) * [**GlobInput](https://crawlee.dev/js/api/browser-crawler.md#GlobInput) * [**GlobObject](https://crawlee.dev/js/api/browser-crawler.md#GlobObject) * [**GotScrapingHttpClient](https://crawlee.dev/js/api/browser-crawler.md#GotScrapingHttpClient) * [**HttpRequest](https://crawlee.dev/js/api/browser-crawler.md#HttpRequest) * [**HttpRequestOptions](https://crawlee.dev/js/api/browser-crawler.md#HttpRequestOptions) * [**HttpResponse](https://crawlee.dev/js/api/browser-crawler.md#HttpResponse) * [**IRequestList](https://crawlee.dev/js/api/browser-crawler.md#IRequestList) * [**IRequestManager](https://crawlee.dev/js/api/browser-crawler.md#IRequestManager) * [**IStorage](https://crawlee.dev/js/api/browser-crawler.md#IStorage) * [**KeyConsumer](https://crawlee.dev/js/api/browser-crawler.md#KeyConsumer) * [**KeyValueStore](https://crawlee.dev/js/api/browser-crawler.md#KeyValueStore) * [**KeyValueStoreIteratorOptions](https://crawlee.dev/js/api/browser-crawler.md#KeyValueStoreIteratorOptions) * [**KeyValueStoreOptions](https://crawlee.dev/js/api/browser-crawler.md#KeyValueStoreOptions) * [**LoadedRequest](https://crawlee.dev/js/api/browser-crawler.md#LoadedRequest) * [**LocalEventManager](https://crawlee.dev/js/api/browser-crawler.md#LocalEventManager) * [**log](https://crawlee.dev/js/api/browser-crawler.md#log) * [**Log](https://crawlee.dev/js/api/browser-crawler.md#Log) * [**Logger](https://crawlee.dev/js/api/browser-crawler.md#Logger) * [**LoggerJson](https://crawlee.dev/js/api/browser-crawler.md#LoggerJson) * [**LoggerOptions](https://crawlee.dev/js/api/browser-crawler.md#LoggerOptions) * [**LoggerText](https://crawlee.dev/js/api/browser-crawler.md#LoggerText) * [**LogLevel](https://crawlee.dev/js/api/browser-crawler.md#LogLevel) * [**MAX\_POOL\_SIZE](https://crawlee.dev/js/api/browser-crawler.md#MAX_POOL_SIZE) * [**NonRetryableError](https://crawlee.dev/js/api/browser-crawler.md#NonRetryableError) * [**PERSIST\_STATE\_KEY](https://crawlee.dev/js/api/browser-crawler.md#PERSIST_STATE_KEY) * [**PersistenceOptions](https://crawlee.dev/js/api/browser-crawler.md#PersistenceOptions) * [**processHttpRequestOptions](https://crawlee.dev/js/api/browser-crawler.md#processHttpRequestOptions) * [**ProxyConfiguration](https://crawlee.dev/js/api/browser-crawler.md#ProxyConfiguration) * [**ProxyConfigurationFunction](https://crawlee.dev/js/api/browser-crawler.md#ProxyConfigurationFunction) * [**ProxyConfigurationOptions](https://crawlee.dev/js/api/browser-crawler.md#ProxyConfigurationOptions) * [**ProxyInfo](https://crawlee.dev/js/api/browser-crawler.md#ProxyInfo) * [**PseudoUrl](https://crawlee.dev/js/api/browser-crawler.md#PseudoUrl) * [**PseudoUrlInput](https://crawlee.dev/js/api/browser-crawler.md#PseudoUrlInput) * [**PseudoUrlObject](https://crawlee.dev/js/api/browser-crawler.md#PseudoUrlObject) * [**purgeDefaultStorages](https://crawlee.dev/js/api/browser-crawler.md#purgeDefaultStorages) * [**PushErrorMessageOptions](https://crawlee.dev/js/api/browser-crawler.md#PushErrorMessageOptions) * [**QueueOperationInfo](https://crawlee.dev/js/api/browser-crawler.md#QueueOperationInfo) * [**RecordOptions](https://crawlee.dev/js/api/browser-crawler.md#RecordOptions) * [**RecoverableState](https://crawlee.dev/js/api/browser-crawler.md#RecoverableState) * [**RecoverableStateOptions](https://crawlee.dev/js/api/browser-crawler.md#RecoverableStateOptions) * [**RecoverableStatePersistenceOptions](https://crawlee.dev/js/api/browser-crawler.md#RecoverableStatePersistenceOptions) * [**RedirectHandler](https://crawlee.dev/js/api/browser-crawler.md#RedirectHandler) * [**RegExpInput](https://crawlee.dev/js/api/browser-crawler.md#RegExpInput) * [**RegExpObject](https://crawlee.dev/js/api/browser-crawler.md#RegExpObject) * [**Request](https://crawlee.dev/js/api/browser-crawler.md#Request) * [**RequestHandler](https://crawlee.dev/js/api/browser-crawler.md#RequestHandler) * [**RequestHandlerResult](https://crawlee.dev/js/api/browser-crawler.md#RequestHandlerResult) * [**RequestList](https://crawlee.dev/js/api/browser-crawler.md#RequestList) * [**RequestListOptions](https://crawlee.dev/js/api/browser-crawler.md#RequestListOptions) * [**RequestListSourcesFunction](https://crawlee.dev/js/api/browser-crawler.md#RequestListSourcesFunction) * [**RequestListState](https://crawlee.dev/js/api/browser-crawler.md#RequestListState) * [**RequestManagerTandem](https://crawlee.dev/js/api/browser-crawler.md#RequestManagerTandem) * [**RequestOptions](https://crawlee.dev/js/api/browser-crawler.md#RequestOptions) * [**RequestProvider](https://crawlee.dev/js/api/browser-crawler.md#RequestProvider) * [**RequestProviderOptions](https://crawlee.dev/js/api/browser-crawler.md#RequestProviderOptions) * [**RequestQueue](https://crawlee.dev/js/api/browser-crawler.md#RequestQueue) * [**RequestQueueOperationOptions](https://crawlee.dev/js/api/browser-crawler.md#RequestQueueOperationOptions) * [**RequestQueueOptions](https://crawlee.dev/js/api/browser-crawler.md#RequestQueueOptions) * [**RequestQueueV1](https://crawlee.dev/js/api/browser-crawler.md#RequestQueueV1) * [**RequestQueueV2](https://crawlee.dev/js/api/browser-crawler.md#RequestQueueV2) * [**RequestsLike](https://crawlee.dev/js/api/browser-crawler.md#RequestsLike) * [**RequestState](https://crawlee.dev/js/api/browser-crawler.md#RequestState) * [**RequestTransform](https://crawlee.dev/js/api/browser-crawler.md#RequestTransform) * [**ResponseLike](https://crawlee.dev/js/api/browser-crawler.md#ResponseLike) * [**ResponseTypes](https://crawlee.dev/js/api/browser-crawler.md#ResponseTypes) * [**RestrictedCrawlingContext](https://crawlee.dev/js/api/browser-crawler.md#RestrictedCrawlingContext) * [**RetryRequestError](https://crawlee.dev/js/api/browser-crawler.md#RetryRequestError) * [**Router](https://crawlee.dev/js/api/browser-crawler.md#Router) * [**RouterHandler](https://crawlee.dev/js/api/browser-crawler.md#RouterHandler) * [**RouterRoutes](https://crawlee.dev/js/api/browser-crawler.md#RouterRoutes) * [**Session](https://crawlee.dev/js/api/browser-crawler.md#Session) * [**SessionError](https://crawlee.dev/js/api/browser-crawler.md#SessionError) * [**SessionOptions](https://crawlee.dev/js/api/browser-crawler.md#SessionOptions) * [**SessionPool](https://crawlee.dev/js/api/browser-crawler.md#SessionPool) * [**SessionPoolOptions](https://crawlee.dev/js/api/browser-crawler.md#SessionPoolOptions) * [**SessionState](https://crawlee.dev/js/api/browser-crawler.md#SessionState) * [**SitemapRequestList](https://crawlee.dev/js/api/browser-crawler.md#SitemapRequestList) * [**SitemapRequestListOptions](https://crawlee.dev/js/api/browser-crawler.md#SitemapRequestListOptions) * [**SkippedRequestCallback](https://crawlee.dev/js/api/browser-crawler.md#SkippedRequestCallback) * [**SkippedRequestReason](https://crawlee.dev/js/api/browser-crawler.md#SkippedRequestReason) * [**SnapshotResult](https://crawlee.dev/js/api/browser-crawler.md#SnapshotResult) * [**Snapshotter](https://crawlee.dev/js/api/browser-crawler.md#Snapshotter) * [**SnapshotterOptions](https://crawlee.dev/js/api/browser-crawler.md#SnapshotterOptions) * [**Source](https://crawlee.dev/js/api/browser-crawler.md#Source) * [**StatisticPersistedState](https://crawlee.dev/js/api/browser-crawler.md#StatisticPersistedState) * [**Statistics](https://crawlee.dev/js/api/browser-crawler.md#Statistics) * [**StatisticsOptions](https://crawlee.dev/js/api/browser-crawler.md#StatisticsOptions) * [**StatisticState](https://crawlee.dev/js/api/browser-crawler.md#StatisticState) * [**StatusMessageCallback](https://crawlee.dev/js/api/browser-crawler.md#StatusMessageCallback) * [**StatusMessageCallbackParams](https://crawlee.dev/js/api/browser-crawler.md#StatusMessageCallbackParams) * [**StorageClient](https://crawlee.dev/js/api/browser-crawler.md#StorageClient) * [**StorageManagerOptions](https://crawlee.dev/js/api/browser-crawler.md#StorageManagerOptions) * [**StreamingHttpResponse](https://crawlee.dev/js/api/browser-crawler.md#StreamingHttpResponse) * [**SystemInfo](https://crawlee.dev/js/api/browser-crawler.md#SystemInfo) * [**SystemStatus](https://crawlee.dev/js/api/browser-crawler.md#SystemStatus) * [**SystemStatusOptions](https://crawlee.dev/js/api/browser-crawler.md#SystemStatusOptions) * [**TieredProxy](https://crawlee.dev/js/api/browser-crawler.md#TieredProxy) * [**tryAbsoluteURL](https://crawlee.dev/js/api/browser-crawler.md#tryAbsoluteURL) * [**UrlPatternObject](https://crawlee.dev/js/api/browser-crawler.md#UrlPatternObject) * [**useState](https://crawlee.dev/js/api/browser-crawler.md#useState) * [**UseStateOptions](https://crawlee.dev/js/api/browser-crawler.md#UseStateOptions) * [**withCheckedStorageAccess](https://crawlee.dev/js/api/browser-crawler.md#withCheckedStorageAccess) * [**BrowserCrawlerOptions](https://crawlee.dev/js/api/browser-crawler/interface/BrowserCrawlerOptions.md) * [**BrowserCrawlingContext](https://crawlee.dev/js/api/browser-crawler/interface/BrowserCrawlingContext.md) * [**BrowserLaunchContext](https://crawlee.dev/js/api/browser-crawler/interface/BrowserLaunchContext.md) * [**BrowserErrorHandler](https://crawlee.dev/js/api/browser-crawler.md#BrowserErrorHandler) * [**BrowserHook](https://crawlee.dev/js/api/browser-crawler.md#BrowserHook) * [**BrowserRequestHandler](https://crawlee.dev/js/api/browser-crawler.md#BrowserRequestHandler) ## Other[**](#__CATEGORY__) ### [**](#AddRequestsBatchedOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L965)AddRequestsBatchedOptions Re-exports [AddRequestsBatchedOptions](https://crawlee.dev/js/api/core/interface/AddRequestsBatchedOptions.md) ### [**](#AddRequestsBatchedResult)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L983)AddRequestsBatchedResult Re-exports [AddRequestsBatchedResult](https://crawlee.dev/js/api/core/interface/AddRequestsBatchedResult.md) ### [**](#AutoscaledPool)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/autoscaled_pool.ts#L180)AutoscaledPool Re-exports [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) ### [**](#AutoscaledPoolOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/autoscaled_pool.ts#L16)AutoscaledPoolOptions Re-exports [AutoscaledPoolOptions](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md) ### [**](#BaseHttpClient)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L179)BaseHttpClient Re-exports [BaseHttpClient](https://crawlee.dev/js/api/core/interface/BaseHttpClient.md) ### [**](#BaseHttpResponseData)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L130)BaseHttpResponseData Re-exports [BaseHttpResponseData](https://crawlee.dev/js/api/core/interface/BaseHttpResponseData.md) ### [**](#BASIC_CRAWLER_TIMEOUT_BUFFER_SECS)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/constants.ts#L6)BASIC\_CRAWLER\_TIMEOUT\_BUFFER\_SECS Re-exports [BASIC\_CRAWLER\_TIMEOUT\_BUFFER\_SECS](https://crawlee.dev/js/api/basic-crawler.md#BASIC_CRAWLER_TIMEOUT_BUFFER_SECS) ### [**](#BasicCrawler)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L485)BasicCrawler Re-exports [BasicCrawler](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md) ### [**](#BasicCrawlerOptions)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L133)BasicCrawlerOptions Re-exports [BasicCrawlerOptions](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md) ### [**](#BasicCrawlingContext)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L71)BasicCrawlingContext Re-exports [BasicCrawlingContext](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlingContext.md) ### [**](#BLOCKED_STATUS_CODES)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/consts.ts#L1)BLOCKED\_STATUS\_CODES Re-exports [BLOCKED\_STATUS\_CODES](https://crawlee.dev/js/api/core.md#BLOCKED_STATUS_CODES) ### [**](#checkStorageAccess)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/access_checking.ts#L10)checkStorageAccess Re-exports [checkStorageAccess](https://crawlee.dev/js/api/core/function/checkStorageAccess.md) ### [**](#ClientInfo)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/system_status.ts#L79)ClientInfo Re-exports [ClientInfo](https://crawlee.dev/js/api/core/interface/ClientInfo.md) ### [**](#Configuration)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/configuration.ts#L247)Configuration Re-exports [Configuration](https://crawlee.dev/js/api/core/class/Configuration.md) ### [**](#ConfigurationOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/configuration.ts#L16)ConfigurationOptions Re-exports [ConfigurationOptions](https://crawlee.dev/js/api/core/interface/ConfigurationOptions.md) ### [**](#Cookie)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/index.ts#L19)Cookie Re-exports [Cookie](https://crawlee.dev/js/api/core/interface/Cookie.md) ### [**](#CrawlerAddRequestsOptions)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L2035)CrawlerAddRequestsOptions Re-exports [CrawlerAddRequestsOptions](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerAddRequestsOptions.md) ### [**](#CrawlerAddRequestsResult)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L2037)CrawlerAddRequestsResult Re-exports [CrawlerAddRequestsResult](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerAddRequestsResult.md) ### [**](#CrawlerExperiments)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L411)CrawlerExperiments Re-exports [CrawlerExperiments](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerExperiments.md) ### [**](#CrawlerRunOptions)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L2039)CrawlerRunOptions Re-exports [CrawlerRunOptions](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerRunOptions.md) ### [**](#CrawlingContext)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L111)CrawlingContext Re-exports [CrawlingContext](https://crawlee.dev/js/api/core/interface/CrawlingContext.md) ### [**](#createBasicRouter)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L2081)createBasicRouter Re-exports [createBasicRouter](https://crawlee.dev/js/api/basic-crawler/function/createBasicRouter.md) ### [**](#CreateContextOptions)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L2029)CreateContextOptions Re-exports [CreateContextOptions](https://crawlee.dev/js/api/basic-crawler/interface/CreateContextOptions.md) ### [**](#CreateSession)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session_pool.ts#L22)CreateSession Re-exports [CreateSession](https://crawlee.dev/js/api/core/interface/CreateSession.md) ### [**](#CriticalError)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/errors.ts#L10)CriticalError Re-exports [CriticalError](https://crawlee.dev/js/api/core/class/CriticalError.md) ### [**](#Dataset)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L232)Dataset Re-exports [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md) ### [**](#DatasetConsumer)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L703)DatasetConsumer Re-exports [DatasetConsumer](https://crawlee.dev/js/api/core/interface/DatasetConsumer.md) ### [**](#DatasetContent)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L742)DatasetContent Re-exports [DatasetContent](https://crawlee.dev/js/api/core/interface/DatasetContent.md) ### [**](#DatasetDataOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L92)DatasetDataOptions Re-exports [DatasetDataOptions](https://crawlee.dev/js/api/core/interface/DatasetDataOptions.md) ### [**](#DatasetExportOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L144)DatasetExportOptions Re-exports [DatasetExportOptions](https://crawlee.dev/js/api/core/interface/DatasetExportOptions.md) ### [**](#DatasetExportToOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L176)DatasetExportToOptions Re-exports [DatasetExportToOptions](https://crawlee.dev/js/api/core/interface/DatasetExportToOptions.md) ### [**](#DatasetIteratorOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L152)DatasetIteratorOptions Re-exports [DatasetIteratorOptions](https://crawlee.dev/js/api/core/interface/DatasetIteratorOptions.md) ### [**](#DatasetMapper)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L714)DatasetMapper Re-exports [DatasetMapper](https://crawlee.dev/js/api/core/interface/DatasetMapper.md) ### [**](#DatasetOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L735)DatasetOptions Re-exports [DatasetOptions](https://crawlee.dev/js/api/core/interface/DatasetOptions.md) ### [**](#DatasetReducer)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L726)DatasetReducer Re-exports [DatasetReducer](https://crawlee.dev/js/api/core/interface/DatasetReducer.md) ### [**](#enqueueLinks)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/enqueue_links.ts#L274)enqueueLinks Re-exports [enqueueLinks](https://crawlee.dev/js/api/core/function/enqueueLinks.md) ### [**](#EnqueueLinksOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/enqueue_links.ts#L34)EnqueueLinksOptions Re-exports [EnqueueLinksOptions](https://crawlee.dev/js/api/core/interface/EnqueueLinksOptions.md) ### [**](#EnqueueStrategy)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/enqueue_links.ts#L216)EnqueueStrategy Re-exports [EnqueueStrategy](https://crawlee.dev/js/api/core/enum/EnqueueStrategy.md) ### [**](#ErrnoException)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/error_tracker.ts#L9)ErrnoException Re-exports [ErrnoException](https://crawlee.dev/js/api/core/interface/ErrnoException.md) ### [**](#ErrorHandler)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L114)ErrorHandler Re-exports [ErrorHandler](https://crawlee.dev/js/api/basic-crawler.md#ErrorHandler) ### [**](#ErrorSnapshotter)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/error_snapshotter.ts#L42)ErrorSnapshotter Re-exports [ErrorSnapshotter](https://crawlee.dev/js/api/core/class/ErrorSnapshotter.md) ### [**](#ErrorTracker)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/error_tracker.ts#L286)ErrorTracker Re-exports [ErrorTracker](https://crawlee.dev/js/api/core/class/ErrorTracker.md) ### [**](#ErrorTrackerOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/error_tracker.ts#L17)ErrorTrackerOptions Re-exports [ErrorTrackerOptions](https://crawlee.dev/js/api/core/interface/ErrorTrackerOptions.md) ### [**](#EventManager)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/events/event_manager.ts#L24)EventManager Re-exports [EventManager](https://crawlee.dev/js/api/core/class/EventManager.md) ### [**](#EventType)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/events/event_manager.ts#L9)EventType Re-exports [EventType](https://crawlee.dev/js/api/core/enum/EventType.md) ### [**](#EventTypeName)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/events/event_manager.ts#L17)EventTypeName Re-exports [EventTypeName](https://crawlee.dev/js/api/core.md#EventTypeName) ### [**](#filterRequestsByPatterns)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L217)filterRequestsByPatterns Re-exports [filterRequestsByPatterns](https://crawlee.dev/js/api/core/function/filterRequestsByPatterns.md) ### [**](#FinalStatistics)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/system_status.ts#L85)FinalStatistics Re-exports [FinalStatistics](https://crawlee.dev/js/api/core/interface/FinalStatistics.md) ### [**](#GetUserDataFromRequest)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/router.ts#L15)GetUserDataFromRequest Re-exports [GetUserDataFromRequest](https://crawlee.dev/js/api/core.md#GetUserDataFromRequest) ### [**](#GlobInput)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L41)GlobInput Re-exports [GlobInput](https://crawlee.dev/js/api/core.md#GlobInput) ### [**](#GlobObject)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L36)GlobObject Re-exports [GlobObject](https://crawlee.dev/js/api/core.md#GlobObject) ### [**](#GotScrapingHttpClient)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/got-scraping-http-client.ts#L17)GotScrapingHttpClient Re-exports [GotScrapingHttpClient](https://crawlee.dev/js/api/core/class/GotScrapingHttpClient.md) ### [**](#HttpRequest)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L78)HttpRequest Re-exports [HttpRequest](https://crawlee.dev/js/api/core/interface/HttpRequest.md) ### [**](#HttpRequestOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L111)HttpRequestOptions Re-exports [HttpRequestOptions](https://crawlee.dev/js/api/core/interface/HttpRequestOptions.md) ### [**](#HttpResponse)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L152)HttpResponse Re-exports [HttpResponse](https://crawlee.dev/js/api/core/interface/HttpResponse.md) ### [**](#IRequestList)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_list.ts#L26)IRequestList Re-exports [IRequestList](https://crawlee.dev/js/api/core/interface/IRequestList.md) ### [**](#IRequestManager)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L44)IRequestManager Re-exports [IRequestManager](https://crawlee.dev/js/api/core/interface/IRequestManager.md) ### [**](#IStorage)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/storage_manager.ts#L14)IStorage Re-exports [IStorage](https://crawlee.dev/js/api/core/interface/IStorage.md) ### [**](#KeyConsumer)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/key_value_store.ts#L724)KeyConsumer Re-exports [KeyConsumer](https://crawlee.dev/js/api/core/interface/KeyConsumer.md) ### [**](#KeyValueStore)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/key_value_store.ts#L108)KeyValueStore Re-exports [KeyValueStore](https://crawlee.dev/js/api/core/class/KeyValueStore.md) ### [**](#KeyValueStoreIteratorOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/key_value_store.ts#L758)KeyValueStoreIteratorOptions Re-exports [KeyValueStoreIteratorOptions](https://crawlee.dev/js/api/core/interface/KeyValueStoreIteratorOptions.md) ### [**](#KeyValueStoreOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/key_value_store.ts#L734)KeyValueStoreOptions Re-exports [KeyValueStoreOptions](https://crawlee.dev/js/api/core/interface/KeyValueStoreOptions.md) ### [**](#LoadedRequest)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L21)LoadedRequest Re-exports [LoadedRequest](https://crawlee.dev/js/api/core.md#LoadedRequest) ### [**](#LocalEventManager)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/events/local_event_manager.ts#L11)LocalEventManager Re-exports [LocalEventManager](https://crawlee.dev/js/api/core/class/LocalEventManager.md) ### [**](#log)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/log.ts#L3)log Re-exports [log](https://crawlee.dev/js/api/core.md#log) ### [**](#Log)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/log.ts#L3)Log Re-exports [Log](https://crawlee.dev/js/api/core/class/Log.md) ### [**](#Logger)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/log.ts#L3)Logger Re-exports [Logger](https://crawlee.dev/js/api/core/class/Logger.md) ### [**](#LoggerJson)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/log.ts#L3)LoggerJson Re-exports [LoggerJson](https://crawlee.dev/js/api/core/class/LoggerJson.md) ### [**](#LoggerOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/log.ts#L3)LoggerOptions Re-exports [LoggerOptions](https://crawlee.dev/js/api/core/interface/LoggerOptions.md) ### [**](#LoggerText)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/log.ts#L3)LoggerText Re-exports [LoggerText](https://crawlee.dev/js/api/core/class/LoggerText.md) ### [**](#LogLevel)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/log.ts#L3)LogLevel Re-exports [LogLevel](https://crawlee.dev/js/api/core/enum/LogLevel.md) ### [**](#MAX_POOL_SIZE)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/consts.ts#L3)MAX\_POOL\_SIZE Re-exports [MAX\_POOL\_SIZE](https://crawlee.dev/js/api/core.md#MAX_POOL_SIZE) ### [**](#NonRetryableError)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/errors.ts#L4)NonRetryableError Re-exports [NonRetryableError](https://crawlee.dev/js/api/core/class/NonRetryableError.md) ### [**](#PERSIST_STATE_KEY)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/consts.ts#L2)PERSIST\_STATE\_KEY Re-exports [PERSIST\_STATE\_KEY](https://crawlee.dev/js/api/core.md#PERSIST_STATE_KEY) ### [**](#PersistenceOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L41)PersistenceOptions Re-exports [PersistenceOptions](https://crawlee.dev/js/api/core/interface/PersistenceOptions.md) ### [**](#processHttpRequestOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L196)processHttpRequestOptions Re-exports [processHttpRequestOptions](https://crawlee.dev/js/api/core/function/processHttpRequestOptions.md) ### [**](#ProxyConfiguration)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/proxy_configuration.ts#L203)ProxyConfiguration Re-exports [ProxyConfiguration](https://crawlee.dev/js/api/core/class/ProxyConfiguration.md) ### [**](#ProxyConfigurationFunction)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/proxy_configuration.ts#L9)ProxyConfigurationFunction Re-exports [ProxyConfigurationFunction](https://crawlee.dev/js/api/core/interface/ProxyConfigurationFunction.md) ### [**](#ProxyConfigurationOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/proxy_configuration.ts#L15)ProxyConfigurationOptions Re-exports [ProxyConfigurationOptions](https://crawlee.dev/js/api/core/interface/ProxyConfigurationOptions.md) ### [**](#ProxyInfo)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/proxy_configuration.ts#L80)ProxyInfo Re-exports [ProxyInfo](https://crawlee.dev/js/api/core/interface/ProxyInfo.md) ### [**](#PseudoUrl)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/index.ts#L18)PseudoUrl Re-exports [PseudoUrl](https://crawlee.dev/js/api/core/class/PseudoUrl.md) ### [**](#PseudoUrlInput)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L34)PseudoUrlInput Re-exports [PseudoUrlInput](https://crawlee.dev/js/api/core.md#PseudoUrlInput) ### [**](#PseudoUrlObject)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L29)PseudoUrlObject Re-exports [PseudoUrlObject](https://crawlee.dev/js/api/core.md#PseudoUrlObject) ### [**](#purgeDefaultStorages)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/utils.ts#L33)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/utils.ts#L45)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/utils.ts#L46)purgeDefaultStorages Re-exports [purgeDefaultStorages](https://crawlee.dev/js/api/core/function/purgeDefaultStorages.md) ### [**](#PushErrorMessageOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L559)PushErrorMessageOptions Re-exports [PushErrorMessageOptions](https://crawlee.dev/js/api/core/interface/PushErrorMessageOptions.md) ### [**](#QueueOperationInfo)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/index.ts#L19)QueueOperationInfo Re-exports [QueueOperationInfo](https://crawlee.dev/js/api/core/interface/QueueOperationInfo.md) ### [**](#RecordOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/key_value_store.ts#L741)RecordOptions Re-exports [RecordOptions](https://crawlee.dev/js/api/core/interface/RecordOptions.md) ### [**](#RecoverableState)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/recoverable_state.ts#L75)RecoverableState Re-exports [RecoverableState](https://crawlee.dev/js/api/core/class/RecoverableState.md) ### [**](#RecoverableStateOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/recoverable_state.ts#L33)RecoverableStateOptions Re-exports [RecoverableStateOptions](https://crawlee.dev/js/api/core/interface/RecoverableStateOptions.md) ### [**](#RecoverableStatePersistenceOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/recoverable_state.ts#L6)RecoverableStatePersistenceOptions Re-exports [RecoverableStatePersistenceOptions](https://crawlee.dev/js/api/core/interface/RecoverableStatePersistenceOptions.md) ### [**](#RedirectHandler)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L171)RedirectHandler Re-exports [RedirectHandler](https://crawlee.dev/js/api/core.md#RedirectHandler) ### [**](#RegExpInput)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L48)RegExpInput Re-exports [RegExpInput](https://crawlee.dev/js/api/core.md#RegExpInput) ### [**](#RegExpObject)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L43)RegExpObject Re-exports [RegExpObject](https://crawlee.dev/js/api/core.md#RegExpObject) ### [**](#Request)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L84)Request Re-exports [Request](https://crawlee.dev/js/api/core/class/Request.md) ### [**](#RequestHandler)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L110)RequestHandler Re-exports [RequestHandler](https://crawlee.dev/js/api/basic-crawler.md#RequestHandler) ### [**](#RequestHandlerResult)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L174)RequestHandlerResult Re-exports [RequestHandlerResult](https://crawlee.dev/js/api/core/class/RequestHandlerResult.md) ### [**](#RequestList)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_list.ts#L300)RequestList Re-exports [RequestList](https://crawlee.dev/js/api/core/class/RequestList.md) ### [**](#RequestListOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_list.ts#L91)RequestListOptions Re-exports [RequestListOptions](https://crawlee.dev/js/api/core/interface/RequestListOptions.md) ### [**](#RequestListSourcesFunction)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_list.ts#L1000)RequestListSourcesFunction Re-exports [RequestListSourcesFunction](https://crawlee.dev/js/api/core.md#RequestListSourcesFunction) ### [**](#RequestListState)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_list.ts#L988)RequestListState Re-exports [RequestListState](https://crawlee.dev/js/api/core/interface/RequestListState.md) ### [**](#RequestManagerTandem)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_manager_tandem.ts#L22)RequestManagerTandem Re-exports [RequestManagerTandem](https://crawlee.dev/js/api/core/class/RequestManagerTandem.md) ### [**](#RequestOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L446)RequestOptions Re-exports [RequestOptions](https://crawlee.dev/js/api/core/interface/RequestOptions.md) ### [**](#RequestProvider)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L102)RequestProvider Re-exports [RequestProvider](https://crawlee.dev/js/api/core/class/RequestProvider.md) ### [**](#RequestProviderOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L907)RequestProviderOptions Re-exports [RequestProviderOptions](https://crawlee.dev/js/api/core/interface/RequestProviderOptions.md) ### [**](#RequestQueue)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/index.ts#L7)RequestQueue Re-exports [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) ### [**](#RequestQueueOperationOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L934)RequestQueueOperationOptions Re-exports [RequestQueueOperationOptions](https://crawlee.dev/js/api/core/interface/RequestQueueOperationOptions.md) ### [**](#RequestQueueOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L923)RequestQueueOptions Re-exports [RequestQueueOptions](https://crawlee.dev/js/api/core/interface/RequestQueueOptions.md) ### [**](#RequestQueueV1)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/index.ts#L6)RequestQueueV1 Re-exports [RequestQueueV1](https://crawlee.dev/js/api/core/class/RequestQueueV1.md) ### [**](#RequestQueueV2)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/index.ts#L8)RequestQueueV2 Re-exports [RequestQueueV2](https://crawlee.dev/js/api/core.md#RequestQueueV2) ### [**](#RequestsLike)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L39)RequestsLike Re-exports [RequestsLike](https://crawlee.dev/js/api/core.md#RequestsLike) ### [**](#RequestState)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L42)RequestState Re-exports [RequestState](https://crawlee.dev/js/api/core/enum/RequestState.md) ### [**](#RequestTransform)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L287)RequestTransform Re-exports [RequestTransform](https://crawlee.dev/js/api/core/interface/RequestTransform.md) ### [**](#ResponseLike)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/cookie_utils.ts#L7)ResponseLike Re-exports [ResponseLike](https://crawlee.dev/js/api/core/interface/ResponseLike.md) ### [**](#ResponseTypes)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L39)ResponseTypes Re-exports [ResponseTypes](https://crawlee.dev/js/api/core/interface/ResponseTypes.md) ### [**](#RestrictedCrawlingContext)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L30)RestrictedCrawlingContext Re-exports [RestrictedCrawlingContext](https://crawlee.dev/js/api/core/interface/RestrictedCrawlingContext.md) ### [**](#RetryRequestError)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/errors.ts#L22)RetryRequestError Re-exports [RetryRequestError](https://crawlee.dev/js/api/core/class/RetryRequestError.md) ### [**](#Router)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/router.ts#L86)Router Re-exports [Router](https://crawlee.dev/js/api/core/class/Router.md) ### [**](#RouterHandler)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/router.ts#L10)RouterHandler Re-exports [RouterHandler](https://crawlee.dev/js/api/core/interface/RouterHandler.md) ### [**](#RouterRoutes)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/router.ts#L17)RouterRoutes Re-exports [RouterRoutes](https://crawlee.dev/js/api/core.md#RouterRoutes) ### [**](#Session)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session.ts#L100)Session Re-exports [Session](https://crawlee.dev/js/api/core/class/Session.md) ### [**](#SessionError)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/errors.ts#L33)SessionError Re-exports [SessionError](https://crawlee.dev/js/api/core/class/SessionError.md) ### [**](#SessionOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session.ts#L37)SessionOptions Re-exports [SessionOptions](https://crawlee.dev/js/api/core/interface/SessionOptions.md) ### [**](#SessionPool)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session_pool.ts#L137)SessionPool Re-exports [SessionPool](https://crawlee.dev/js/api/core/class/SessionPool.md) ### [**](#SessionPoolOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session_pool.ts#L30)SessionPoolOptions Re-exports [SessionPoolOptions](https://crawlee.dev/js/api/core/interface/SessionPoolOptions.md) ### [**](#SessionState)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session.ts#L24)SessionState Re-exports [SessionState](https://crawlee.dev/js/api/core/interface/SessionState.md) ### [**](#SitemapRequestList)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/sitemap_request_list.ts#L128)SitemapRequestList Re-exports [SitemapRequestList](https://crawlee.dev/js/api/core/class/SitemapRequestList.md) ### [**](#SitemapRequestListOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/sitemap_request_list.ts#L60)SitemapRequestListOptions Re-exports [SitemapRequestListOptions](https://crawlee.dev/js/api/core/interface/SitemapRequestListOptions.md) ### [**](#SkippedRequestCallback)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L52)SkippedRequestCallback Re-exports [SkippedRequestCallback](https://crawlee.dev/js/api/core.md#SkippedRequestCallback) ### [**](#SkippedRequestReason)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L50)SkippedRequestReason Re-exports [SkippedRequestReason](https://crawlee.dev/js/api/core.md#SkippedRequestReason) ### [**](#SnapshotResult)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/error_snapshotter.ts#L16)SnapshotResult Re-exports [SnapshotResult](https://crawlee.dev/js/api/core/interface/SnapshotResult.md) ### [**](#Snapshotter)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/snapshotter.ts#L118)Snapshotter Re-exports [Snapshotter](https://crawlee.dev/js/api/core/class/Snapshotter.md) ### [**](#SnapshotterOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/snapshotter.ts#L19)SnapshotterOptions Re-exports [SnapshotterOptions](https://crawlee.dev/js/api/core/interface/SnapshotterOptions.md) ### [**](#Source)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L575)Source Re-exports [Source](https://crawlee.dev/js/api/core.md#Source) ### [**](#StatisticPersistedState)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L482)StatisticPersistedState Re-exports [StatisticPersistedState](https://crawlee.dev/js/api/core/interface/StatisticPersistedState.md) ### [**](#Statistics)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L59)Statistics Re-exports [Statistics](https://crawlee.dev/js/api/core/class/Statistics.md) ### [**](#StatisticsOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L436)StatisticsOptions Re-exports [StatisticsOptions](https://crawlee.dev/js/api/core/interface/StatisticsOptions.md) ### [**](#StatisticState)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L496)StatisticState Re-exports [StatisticState](https://crawlee.dev/js/api/core/interface/StatisticState.md) ### [**](#StatusMessageCallback)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L128)StatusMessageCallback Re-exports [StatusMessageCallback](https://crawlee.dev/js/api/basic-crawler.md#StatusMessageCallback) ### [**](#StatusMessageCallbackParams)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L118)StatusMessageCallbackParams Re-exports [StatusMessageCallbackParams](https://crawlee.dev/js/api/basic-crawler/interface/StatusMessageCallbackParams.md) ### [**](#StorageClient)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/index.ts#L19)StorageClient Re-exports [StorageClient](https://crawlee.dev/js/api/core/interface/StorageClient.md) ### [**](#StorageManagerOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/storage_manager.ts#L156)StorageManagerOptions Re-exports [StorageManagerOptions](https://crawlee.dev/js/api/core/interface/StorageManagerOptions.md) ### [**](#StreamingHttpResponse)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L162)StreamingHttpResponse Re-exports [StreamingHttpResponse](https://crawlee.dev/js/api/core/interface/StreamingHttpResponse.md) ### [**](#SystemInfo)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/system_status.ts#L10)SystemInfo Re-exports [SystemInfo](https://crawlee.dev/js/api/core/interface/SystemInfo.md) ### [**](#SystemStatus)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/system_status.ts#L120)SystemStatus Re-exports [SystemStatus](https://crawlee.dev/js/api/core/class/SystemStatus.md) ### [**](#SystemStatusOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/system_status.ts#L35)SystemStatusOptions Re-exports [SystemStatusOptions](https://crawlee.dev/js/api/core/interface/SystemStatusOptions.md) ### [**](#TieredProxy)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/proxy_configuration.ts#L45)TieredProxy Re-exports [TieredProxy](https://crawlee.dev/js/api/core/interface/TieredProxy.md) ### [**](#tryAbsoluteURL)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L12)tryAbsoluteURL Re-exports [tryAbsoluteURL](https://crawlee.dev/js/api/core/function/tryAbsoluteURL.md) ### [**](#UrlPatternObject)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L24)UrlPatternObject Re-exports [UrlPatternObject](https://crawlee.dev/js/api/core.md#UrlPatternObject) ### [**](#useState)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/utils.ts#L87)useState Re-exports [useState](https://crawlee.dev/js/api/core/function/useState.md) ### [**](#UseStateOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/utils.ts#L69)UseStateOptions Re-exports [UseStateOptions](https://crawlee.dev/js/api/core/interface/UseStateOptions.md) ### [**](#withCheckedStorageAccess)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/access_checking.ts#L18)withCheckedStorageAccess Re-exports [withCheckedStorageAccess](https://crawlee.dev/js/api/core/function/withCheckedStorageAccess.md) ### [**](#BrowserErrorHandler)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L67)BrowserErrorHandler **BrowserErrorHandler\: [ErrorHandler](https://crawlee.dev/js/api/basic-crawler.md#ErrorHandler)\ #### Type parameters * **Context**: [BrowserCrawlingContext](https://crawlee.dev/js/api/browser-crawler/interface/BrowserCrawlingContext.md) = [BrowserCrawlingContext](https://crawlee.dev/js/api/browser-crawler/interface/BrowserCrawlingContext.md) ### [**](#BrowserHook)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L70)BrowserHook **BrowserHook\: (crawlingContext, gotoOptions) => Awaitable\ #### Type parameters * **Context** = [BrowserCrawlingContext](https://crawlee.dev/js/api/browser-crawler/interface/BrowserCrawlingContext.md) * **GoToOptions**: Dictionary | undefined = Dictionary #### Type declaration * * **(crawlingContext, gotoOptions): Awaitable\ - #### Parameters * ##### crawlingContext: Context * ##### gotoOptions: GoToOptions #### Returns Awaitable\ ### [**](#BrowserRequestHandler)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L64)BrowserRequestHandler **BrowserRequestHandler\: [RequestHandler](https://crawlee.dev/js/api/basic-crawler.md#RequestHandler)\ #### Type parameters * **Context**: [BrowserCrawlingContext](https://crawlee.dev/js/api/browser-crawler/interface/BrowserCrawlingContext.md) = [BrowserCrawlingContext](https://crawlee.dev/js/api/browser-crawler/interface/BrowserCrawlingContext.md) --- # Changelog All notable changes to this project will be documented in this file. See [Conventional Commits](https://conventionalcommits.org) for commit guidelines. ## [3.15.3](https://github.com/apify/crawlee/compare/v3.15.2...v3.15.3) (2025-11-10)[​](#3153-2025-11-10 "Direct link to 3153-2025-11-10") **Note:** Version bump only for package @crawlee/browser ## [3.15.2](https://github.com/apify/crawlee/compare/v3.15.1...v3.15.2) (2025-10-23)[​](#3152-2025-10-23 "Direct link to 3152-2025-10-23") **Note:** Version bump only for package @crawlee/browser ## [3.15.1](https://github.com/apify/crawlee/compare/v3.15.0...v3.15.1) (2025-09-26)[​](#3151-2025-09-26 "Direct link to 3151-2025-09-26") **Note:** Version bump only for package @crawlee/browser # [3.15.0](https://github.com/apify/crawlee/compare/v3.14.1...v3.15.0) (2025-09-17) **Note:** Version bump only for package @crawlee/browser ## [3.14.1](https://github.com/apify/crawlee/compare/v3.14.0...v3.14.1) (2025-08-05)[​](#3141-2025-08-05 "Direct link to 3141-2025-08-05") **Note:** Version bump only for package @crawlee/browser # [3.14.0](https://github.com/apify/crawlee/compare/v3.13.10...v3.14.0) (2025-07-25) ### Features[​](#features "Direct link to Features") * add `maxCrawlDepth` crawler option ([#3045](https://github.com/apify/crawlee/issues/3045)) ([0090df9](https://github.com/apify/crawlee/commit/0090df93a12df9918d016cf2f1378f1f7d40557d)), closes [#2633](https://github.com/apify/crawlee/issues/2633) ## [3.13.10](https://github.com/apify/crawlee/compare/v3.13.9...v3.13.10) (2025-07-09)[​](#31310-2025-07-09 "Direct link to 31310-2025-07-09") **Note:** Version bump only for package @crawlee/browser ## [3.13.9](https://github.com/apify/crawlee/compare/v3.13.8...v3.13.9) (2025-06-27)[​](#3139-2025-06-27 "Direct link to 3139-2025-06-27") ### Features[​](#features-1 "Direct link to Features") * Report links skipped because of various filter conditions ([#3026](https://github.com/apify/crawlee/issues/3026)) ([5a867bc](https://github.com/apify/crawlee/commit/5a867bc28135803b55c765ec12e6fd04017ce53d)), closes [#3016](https://github.com/apify/crawlee/issues/3016) ## [3.13.8](https://github.com/apify/crawlee/compare/v3.13.7...v3.13.8) (2025-06-16)[​](#3138-2025-06-16 "Direct link to 3138-2025-06-16") ### Bug Fixes[​](#bug-fixes "Direct link to Bug Fixes") * Do not enqueue more links than what the crawler is capable of processing ([#2990](https://github.com/apify/crawlee/issues/2990)) ([ea094c8](https://github.com/apify/crawlee/commit/ea094c819232e0b30bc550270836d10506eb9454)), closes [#2728](https://github.com/apify/crawlee/issues/2728) ## [3.13.7](https://github.com/apify/crawlee/compare/v3.13.6...v3.13.7) (2025-06-06)[​](#3137-2025-06-06 "Direct link to 3137-2025-06-06") **Note:** Version bump only for package @crawlee/browser ## [3.13.6](https://github.com/apify/crawlee/compare/v3.13.5...v3.13.6) (2025-06-05)[​](#3136-2025-06-05 "Direct link to 3136-2025-06-05") **Note:** Version bump only for package @crawlee/browser ## [3.13.5](https://github.com/apify/crawlee/compare/v3.13.4...v3.13.5) (2025-05-20)[​](#3135-2025-05-20 "Direct link to 3135-2025-05-20") **Note:** Version bump only for package @crawlee/browser ## [3.13.4](https://github.com/apify/crawlee/compare/v3.13.3...v3.13.4) (2025-05-14)[​](#3134-2025-05-14 "Direct link to 3134-2025-05-14") **Note:** Version bump only for package @crawlee/browser ## [3.13.3](https://github.com/apify/crawlee/compare/v3.13.2...v3.13.3) (2025-05-05)[​](#3133-2025-05-05 "Direct link to 3133-2025-05-05") **Note:** Version bump only for package @crawlee/browser ## [3.13.2](https://github.com/apify/crawlee/compare/v3.13.1...v3.13.2) (2025-04-08)[​](#3132-2025-04-08 "Direct link to 3132-2025-04-08") ### Features[​](#features-2 "Direct link to Features") * add `onSkippedRequest` option ([#2916](https://github.com/apify/crawlee/issues/2916)) ([764f992](https://github.com/apify/crawlee/commit/764f99203627b6a44d2ee90d623b8b0e6ecbffb5)), closes [#2910](https://github.com/apify/crawlee/issues/2910) ## [3.13.1](https://github.com/apify/crawlee/compare/v3.13.0...v3.13.1) (2025-04-07)[​](#3131-2025-04-07 "Direct link to 3131-2025-04-07") ### Bug Fixes[​](#bug-fixes-1 "Direct link to Bug Fixes") * don't double increment session usage count in `BrowserCrawler` ([#2908](https://github.com/apify/crawlee/issues/2908)) ([3107e55](https://github.com/apify/crawlee/commit/3107e5511142a3579adc2348fcb6a9dcadd5c0b9)), closes [#2851](https://github.com/apify/crawlee/issues/2851) * rename `RobotsFile` to `RobotsTxtFile` ([#2913](https://github.com/apify/crawlee/issues/2913)) ([3160f71](https://github.com/apify/crawlee/commit/3160f717e865326476d78089d778cbc7d35aa58d)), closes [#2910](https://github.com/apify/crawlee/issues/2910) ### Features[​](#features-3 "Direct link to Features") * add `respectRobotsTxtFile` crawler option ([#2910](https://github.com/apify/crawlee/issues/2910)) ([0eabed1](https://github.com/apify/crawlee/commit/0eabed1f13070d902c2c67b340621830a7f64464)) # [3.13.0](https://github.com/apify/crawlee/compare/v3.12.2...v3.13.0) (2025-03-04) **Note:** Version bump only for package @crawlee/browser ## [3.12.2](https://github.com/apify/crawlee/compare/v3.12.1...v3.12.2) (2025-01-27)[​](#3122-2025-01-27 "Direct link to 3122-2025-01-27") **Note:** Version bump only for package @crawlee/browser ## [3.12.1](https://github.com/apify/crawlee/compare/v3.12.0...v3.12.1) (2024-12-04)[​](#3121-2024-12-04 "Direct link to 3121-2024-12-04") **Note:** Version bump only for package @crawlee/browser # [3.12.0](https://github.com/apify/crawlee/compare/v3.11.5...v3.12.0) (2024-11-04) **Note:** Version bump only for package @crawlee/browser ## [3.11.5](https://github.com/apify/crawlee/compare/v3.11.4...v3.11.5) (2024-10-04)[​](#3115-2024-10-04 "Direct link to 3115-2024-10-04") ### Bug Fixes[​](#bug-fixes-2 "Direct link to Bug Fixes") * **puppeteer:** rename `ignoreHTTPSErrors` to `acceptInsecureCerts` to support v23 ([#2684](https://github.com/apify/crawlee/issues/2684)) ([f3927e6](https://github.com/apify/crawlee/commit/f3927e6c3487deef4a2a6b0face04d3742ecd5dd)) ## [3.11.4](https://github.com/apify/crawlee/compare/v3.11.3...v3.11.4) (2024-09-23)[​](#3114-2024-09-23 "Direct link to 3114-2024-09-23") **Note:** Version bump only for package @crawlee/browser ## [3.11.3](https://github.com/apify/crawlee/compare/v3.11.2...v3.11.3) (2024-09-03)[​](#3113-2024-09-03 "Direct link to 3113-2024-09-03") **Note:** Version bump only for package @crawlee/browser ## [3.11.2](https://github.com/apify/crawlee/compare/v3.11.1...v3.11.2) (2024-08-28)[​](#3112-2024-08-28 "Direct link to 3112-2024-08-28") **Note:** Version bump only for package @crawlee/browser ## [3.11.1](https://github.com/apify/crawlee/compare/v3.11.0...v3.11.1) (2024-07-24)[​](#3111-2024-07-24 "Direct link to 3111-2024-07-24") **Note:** Version bump only for package @crawlee/browser # [3.11.0](https://github.com/apify/crawlee/compare/v3.10.5...v3.11.0) (2024-07-09) ### Features[​](#features-4 "Direct link to Features") * add `iframe` expansion to `parseWithCheerio` in browsers ([#2542](https://github.com/apify/crawlee/issues/2542)) ([328d085](https://github.com/apify/crawlee/commit/328d08598807782b3712bd543e394fe9a000a85d)), closes [#2507](https://github.com/apify/crawlee/issues/2507) * add `ignoreIframes` opt-out from the Cheerio iframe expansion ([#2562](https://github.com/apify/crawlee/issues/2562)) ([474a8dc](https://github.com/apify/crawlee/commit/474a8dc06a567cde0651d385fdac9c350ddf4508)) ## [3.10.5](https://github.com/apify/crawlee/compare/v3.10.4...v3.10.5) (2024-06-12)[​](#3105-2024-06-12 "Direct link to 3105-2024-06-12") ### Bug Fixes[​](#bug-fixes-3 "Direct link to Bug Fixes") * declare missing peer dependencies in `@crawlee/browser` package ([#2532](https://github.com/apify/crawlee/issues/2532)) ([3357c7f](https://github.com/apify/crawlee/commit/3357c7fc5ab071b12f72097c190dbee9990e3751)) * mark `context.request.loadedUrl` and `id` as required inside the request handler ([#2531](https://github.com/apify/crawlee/issues/2531)) ([2b54660](https://github.com/apify/crawlee/commit/2b546600691d84852a2f9ef42f273cecf818d66d)) ## [3.10.4](https://github.com/apify/crawlee/compare/v3.10.3...v3.10.4) (2024-06-11)[​](#3104-2024-06-11 "Direct link to 3104-2024-06-11") **Note:** Version bump only for package @crawlee/browser ## [3.10.3](https://github.com/apify/crawlee/compare/v3.10.2...v3.10.3) (2024-06-07)[​](#3103-2024-06-07 "Direct link to 3103-2024-06-07") **Note:** Version bump only for package @crawlee/browser ## [3.10.2](https://github.com/apify/crawlee/compare/v3.10.1...v3.10.2) (2024-06-03)[​](#3102-2024-06-03 "Direct link to 3102-2024-06-03") **Note:** Version bump only for package @crawlee/browser ## [3.10.1](https://github.com/apify/crawlee/compare/v3.10.0...v3.10.1) (2024-05-23)[​](#3101-2024-05-23 "Direct link to 3101-2024-05-23") **Note:** Version bump only for package @crawlee/browser # [3.10.0](https://github.com/apify/crawlee/compare/v3.9.2...v3.10.0) (2024-05-16) **Note:** Version bump only for package @crawlee/browser ## [3.9.2](https://github.com/apify/crawlee/compare/v3.9.1...v3.9.2) (2024-04-17)[​](#392-2024-04-17 "Direct link to 392-2024-04-17") **Note:** Version bump only for package @crawlee/browser ## [3.9.1](https://github.com/apify/crawlee/compare/v3.9.0...v3.9.1) (2024-04-11)[​](#391-2024-04-11 "Direct link to 391-2024-04-11") ### Features[​](#features-5 "Direct link to Features") * `browserPerProxy` browser launch option ([#2418](https://github.com/apify/crawlee/issues/2418)) ([df57b29](https://github.com/apify/crawlee/commit/df57b2965ac8c8b3adf807e3bad8a649814fa213)) # [3.9.0](https://github.com/apify/crawlee/compare/v3.8.2...v3.9.0) (2024-04-10) ### Features[​](#features-6 "Direct link to Features") * `tieredProxyUrls` for ProxyConfiguration ([#2348](https://github.com/apify/crawlee/issues/2348)) ([5408c7f](https://github.com/apify/crawlee/commit/5408c7f60a5bf4dbdba92f2d7440e0946b94ea6e)) * better `newUrlFunction` for ProxyConfiguration ([#2392](https://github.com/apify/crawlee/issues/2392)) ([330598b](https://github.com/apify/crawlee/commit/330598b348ad27bc7c73732294a14b655ccd3507)), closes [#2348](https://github.com/apify/crawlee/issues/2348) [#2065](https://github.com/apify/crawlee/issues/2065) * expand #shadow-root elements automatically in `parseWithCheerio` helper ([#2396](https://github.com/apify/crawlee/issues/2396)) ([a05b3a9](https://github.com/apify/crawlee/commit/a05b3a93a9b57926b353df0e79d846b5024c42ac)) ## [3.8.2](https://github.com/apify/crawlee/compare/v3.8.1...v3.8.2) (2024-03-21)[​](#382-2024-03-21 "Direct link to 382-2024-03-21") **Note:** Version bump only for package @crawlee/browser ## [3.8.1](https://github.com/apify/crawlee/compare/v3.8.0...v3.8.1) (2024-02-22)[​](#381-2024-02-22 "Direct link to 381-2024-02-22") **Note:** Version bump only for package @crawlee/browser # [3.8.0](https://github.com/apify/crawlee/compare/v3.7.3...v3.8.0) (2024-02-21) ### Features[​](#features-7 "Direct link to Features") * adaptive playwright crawler ([#2316](https://github.com/apify/crawlee/issues/2316)) ([8e4218a](https://github.com/apify/crawlee/commit/8e4218ada03cf485751def46f8c465b2d2a825c7)) ## [3.7.3](https://github.com/apify/crawlee/compare/v3.7.2...v3.7.3) (2024-01-30)[​](#373-2024-01-30 "Direct link to 373-2024-01-30") **Note:** Version bump only for package @crawlee/browser ## [3.7.2](https://github.com/apify/crawlee/compare/v3.7.1...v3.7.2) (2024-01-09)[​](#372-2024-01-09 "Direct link to 372-2024-01-09") **Note:** Version bump only for package @crawlee/browser ## [3.7.1](https://github.com/apify/crawlee/compare/v3.7.0...v3.7.1) (2024-01-02)[​](#371-2024-01-02 "Direct link to 371-2024-01-02") **Note:** Version bump only for package @crawlee/browser # [3.7.0](https://github.com/apify/crawlee/compare/v3.6.2...v3.7.0) (2023-12-21) ### Bug Fixes[​](#bug-fixes-4 "Direct link to Bug Fixes") * `retryOnBlocked` doesn't override the blocked HTTP codes ([#2243](https://github.com/apify/crawlee/issues/2243)) ([81672c3](https://github.com/apify/crawlee/commit/81672c3d1db1dcdcffb868de5740addff82cf112)) ### Features[​](#features-8 "Direct link to Features") * check enqueue link strategy post redirect ([#2238](https://github.com/apify/crawlee/issues/2238)) ([3c5f9d6](https://github.com/apify/crawlee/commit/3c5f9d6056158e042e12d75b2b1b21ef6c32e618)), closes [#2173](https://github.com/apify/crawlee/issues/2173) * log cause with `retryOnBlocked` ([#2252](https://github.com/apify/crawlee/issues/2252)) ([e19a773](https://github.com/apify/crawlee/commit/e19a773693cfc5e65c1e2321bfc8b73c9844ea8b)), closes [#2249](https://github.com/apify/crawlee/issues/2249) ## [3.6.2](https://github.com/apify/crawlee/compare/v3.6.1...v3.6.2) (2023-11-26)[​](#362-2023-11-26 "Direct link to 362-2023-11-26") **Note:** Version bump only for package @crawlee/browser ## [3.6.1](https://github.com/apify/crawlee/compare/v3.6.0...v3.6.1) (2023-11-15)[​](#361-2023-11-15 "Direct link to 361-2023-11-15") ### Features[​](#features-9 "Direct link to Features") * **puppeteer:** enable `new` headless mode ([#1910](https://github.com/apify/crawlee/issues/1910)) ([7fc999c](https://github.com/apify/crawlee/commit/7fc999cf4658ca69b97f16d434444081998470f4)) # [3.6.0](https://github.com/apify/crawlee/compare/v3.5.8...v3.6.0) (2023-11-15) **Note:** Version bump only for package @crawlee/browser ## [3.5.8](https://github.com/apify/crawlee/compare/v3.5.7...v3.5.8) (2023-10-17)[​](#358-2023-10-17 "Direct link to 358-2023-10-17") **Note:** Version bump only for package @crawlee/browser ## [3.5.7](https://github.com/apify/crawlee/compare/v3.5.6...v3.5.7) (2023-10-05)[​](#357-2023-10-05 "Direct link to 357-2023-10-05") **Note:** Version bump only for package @crawlee/browser ## [3.5.6](https://github.com/apify/crawlee/compare/v3.5.5...v3.5.6) (2023-10-04)[​](#356-2023-10-04 "Direct link to 356-2023-10-04") **Note:** Version bump only for package @crawlee/browser ## [3.5.5](https://github.com/apify/crawlee/compare/v3.5.4...v3.5.5) (2023-10-02)[​](#355-2023-10-02 "Direct link to 355-2023-10-02") ### Features[​](#features-10 "Direct link to Features") * Request Queue v2 ([#1975](https://github.com/apify/crawlee/issues/1975)) ([70a77ee](https://github.com/apify/crawlee/commit/70a77ee15f984e9ae67cd584fc58ace7e55346db)), closes [#1365](https://github.com/apify/crawlee/issues/1365) ## [3.5.4](https://github.com/apify/crawlee/compare/v3.5.3...v3.5.4) (2023-09-11)[​](#354-2023-09-11 "Direct link to 354-2023-09-11") **Note:** Version bump only for package @crawlee/browser ## [3.5.3](https://github.com/apify/crawlee/compare/v3.5.2...v3.5.3) (2023-08-31)[​](#353-2023-08-31 "Direct link to 353-2023-08-31") ### Bug Fixes[​](#bug-fixes-5 "Direct link to Bug Fixes") * pin all internal dependencies ([#2041](https://github.com/apify/crawlee/issues/2041)) ([d6f2b17](https://github.com/apify/crawlee/commit/d6f2b172d4a6776137c7893ca798d5b4a9408e79)), closes [#2040](https://github.com/apify/crawlee/issues/2040) ## [3.5.2](https://github.com/apify/crawlee/compare/v3.5.1...v3.5.2) (2023-08-21)[​](#352-2023-08-21 "Direct link to 352-2023-08-21") **Note:** Version bump only for package @crawlee/browser ## [3.5.1](https://github.com/apify/crawlee/compare/v3.5.0...v3.5.1) (2023-08-16)[​](#351-2023-08-16 "Direct link to 351-2023-08-16") ### Bug Fixes[​](#bug-fixes-6 "Direct link to Bug Fixes") * log original error message on session rotation ([#2022](https://github.com/apify/crawlee/issues/2022)) ([8a11ffb](https://github.com/apify/crawlee/commit/8a11ffbdaef6b2fe8603aac570c3038f84c2f203)) # [3.5.0](https://github.com/apify/crawlee/compare/v3.4.2...v3.5.0) (2023-07-31) ### Features[​](#features-11 "Direct link to Features") * retire session on proxy error ([#2002](https://github.com/apify/crawlee/issues/2002)) ([8c0928b](https://github.com/apify/crawlee/commit/8c0928b24ceabefc454f8114ac30a27023709010)), closes [#1912](https://github.com/apify/crawlee/issues/1912) ## [3.4.2](https://github.com/apify/crawlee/compare/v3.4.1...v3.4.2) (2023-07-19)[​](#342-2023-07-19 "Direct link to 342-2023-07-19") ### Features[​](#features-12 "Direct link to Features") * retryOnBlocked detects blocked webpage ([#1956](https://github.com/apify/crawlee/issues/1956)) ([766fa9b](https://github.com/apify/crawlee/commit/766fa9b88029e9243a7427075384c1abe85c70c8)) ## [3.4.1](https://github.com/apify/crawlee/compare/v3.4.0...v3.4.1) (2023-07-13)[​](#341-2023-07-13 "Direct link to 341-2023-07-13") **Note:** Version bump only for package @crawlee/browser # [3.4.0](https://github.com/apify/crawlee/compare/v3.3.3...v3.4.0) (2023-06-12) ### Bug Fixes[​](#bug-fixes-7 "Direct link to Bug Fixes") * respect `` when enqueuing ([#1936](https://github.com/apify/crawlee/issues/1936)) ([aeef572](https://github.com/apify/crawlee/commit/aeef57231c84671374ed0309b7b95fa9ce9a6e8b)) ## [3.3.3](https://github.com/apify/crawlee/compare/v3.3.2...v3.3.3) (2023-05-31)[​](#333-2023-05-31 "Direct link to 333-2023-05-31") **Note:** Version bump only for package @crawlee/browser ## [3.3.2](https://github.com/apify/crawlee/compare/v3.3.1...v3.3.2) (2023-05-11)[​](#332-2023-05-11 "Direct link to 332-2023-05-11") **Note:** Version bump only for package @crawlee/browser ## [3.3.1](https://github.com/apify/crawlee/compare/v3.3.0...v3.3.1) (2023-04-11)[​](#331-2023-04-11 "Direct link to 331-2023-04-11") **Note:** Version bump only for package @crawlee/browser # [3.3.0](https://github.com/apify/crawlee/compare/v3.2.2...v3.3.0) (2023-03-09) ### Bug Fixes[​](#bug-fixes-8 "Direct link to Bug Fixes") * ignore invalid URLs in `enqueueLinks` in browser crawlers ([#1803](https://github.com/apify/crawlee/issues/1803)) ([5ac336c](https://github.com/apify/crawlee/commit/5ac336c5b83b212fd6281659b8ceee091e259ff1)) ## [3.2.2](https://github.com/apify/crawlee/compare/v3.2.1...v3.2.2) (2023-02-08)[​](#322-2023-02-08 "Direct link to 322-2023-02-08") **Note:** Version bump only for package @crawlee/browser ## [3.2.1](https://github.com/apify/crawlee/compare/v3.2.0...v3.2.1) (2023-02-07)[​](#321-2023-02-07 "Direct link to 321-2023-02-07") **Note:** Version bump only for package @crawlee/browser # [3.2.0](https://github.com/apify/crawlee/compare/v3.1.4...v3.2.0) (2023-02-07) ### Bug Fixes[​](#bug-fixes-9 "Direct link to Bug Fixes") * declare missing dependency on `tslib` ([27e96c8](https://github.com/apify/crawlee/commit/27e96c80c26e7fc31809a4b518d699573cb8c662)), closes [#1747](https://github.com/apify/crawlee/issues/1747) ## [3.1.4](https://github.com/apify/crawlee/compare/v3.1.3...v3.1.4) (2022-12-14)[​](#314-2022-12-14 "Direct link to 314-2022-12-14") **Note:** Version bump only for package @crawlee/browser ## [3.1.3](https://github.com/apify/crawlee/compare/v3.1.2...v3.1.3) (2022-12-07)[​](#313-2022-12-07 "Direct link to 313-2022-12-07") **Note:** Version bump only for package @crawlee/browser ## 3.1.2 (2022-11-15)[​](#312-2022-11-15 "Direct link to 3.1.2 (2022-11-15)") **Note:** Version bump only for package @crawlee/browser ## 3.1.1 (2022-11-07)[​](#311-2022-11-07 "Direct link to 3.1.1 (2022-11-07)") **Note:** Version bump only for package @crawlee/browser # 3.1.0 (2022-10-13) **Note:** Version bump only for package @crawlee/browser ## [3.0.4](https://github.com/apify/crawlee/compare/v3.0.3...v3.0.4) (2022-08-22)[​](#304-2022-08-22 "Direct link to 304-2022-08-22") ### Features[​](#features-13 "Direct link to Features") * enable tab-as-a-container for Firefox ([#1456](https://github.com/apify/crawlee/issues/1456)) ([ae5ba4f](https://github.com/apify/crawlee/commit/ae5ba4f15fd6d14f444486234753ce1781c74cc8)) --- # abstractBrowserCrawler \ Provides a simple framework for parallel crawling of web pages using headless browsers with [Puppeteer](https://github.com/puppeteer/puppeteer) and [Playwright](https://github.com/microsoft/playwright). The URLs to crawl are fed either from a static list of URLs or from a dynamic queue of URLs enabling recursive crawling of websites. Since `BrowserCrawler` uses headless (or even headful) browsers to download web pages and extract data, it is useful for crawling of websites that require to execute JavaScript. If the target website doesn't need JavaScript, we should consider using the [CheerioCrawler](https://crawlee.dev/js/api/cheerio-crawler/class/CheerioCrawler.md), which downloads the pages using raw HTTP requests and is about 10x faster. The source URLs are represented by the [Request](https://crawlee.dev/js/api/core/class/Request.md) objects that are fed from the [RequestList](https://crawlee.dev/js/api/core/class/RequestList.md) or [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) instances provided by the [`requestList`](https://crawlee.dev/js/api/browser-crawler/interface/BrowserCrawlerOptions.md#requestList) or [`requestQueue`](https://crawlee.dev/js/api/browser-crawler/interface/BrowserCrawlerOptions.md#requestQueue) constructor options, respectively. If neither `requestList` nor `requestQueue` options are provided, the crawler will open the default request queue either when the [`crawler.addRequests()`](https://crawlee.dev/js/api/browser-crawler/class/BrowserCrawler.md#addRequests) function is called, or if `requests` parameter (representing the initial requests) of the [`crawler.run()`](https://crawlee.dev/js/api/browser-crawler/class/BrowserCrawler.md#run) function is provided. If both [`requestList`](https://crawlee.dev/js/api/browser-crawler/interface/BrowserCrawlerOptions.md#requestList) and [`requestQueue`](https://crawlee.dev/js/api/browser-crawler/interface/BrowserCrawlerOptions.md#requestQueue) options are used, the instance first processes URLs from the [RequestList](https://crawlee.dev/js/api/core/class/RequestList.md) and automatically enqueues all of them to the [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) before it starts their processing. This ensures that a single URL is not crawled multiple times. The crawler finishes when there are no more [Request](https://crawlee.dev/js/api/core/class/Request.md) objects to crawl. `BrowserCrawler` opens a new browser page (i.e. tab or window) for each [Request](https://crawlee.dev/js/api/core/class/Request.md) object to crawl and then calls the function provided by user as the [`requestHandler`](https://crawlee.dev/js/api/browser-crawler/interface/BrowserCrawlerOptions.md#requestHandler) option. New pages are only opened when there is enough free CPU and memory available, using the functionality provided by the [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) class. All [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) configuration options can be passed to the [`autoscaledPoolOptions`](https://crawlee.dev/js/api/browser-crawler/interface/BrowserCrawlerOptions.md#autoscaledPoolOptions) parameter of the `BrowserCrawler` constructor. For user convenience, the [`minConcurrency`](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md#minConcurrency) and [`maxConcurrency`](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md#maxConcurrency) options of the underlying [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) constructor are available directly in the `BrowserCrawler` constructor. > *NOTE:* the pool of browser instances is internally managed by the [BrowserPool](https://crawlee.dev/js/api/browser-pool/class/BrowserPool.md) class. ### Hierarchy * [BasicCrawler](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md)\ * *BrowserCrawler* * [PuppeteerCrawler](https://crawlee.dev/js/api/puppeteer-crawler/class/PuppeteerCrawler.md) * [PlaywrightCrawler](https://crawlee.dev/js/api/playwright-crawler/class/PlaywrightCrawler.md) ## Index[**](#Index) ### Properties * [**autoscaledPool](#autoscaledPool) * [**browserPool](#browserPool) * [**config](#config) * [**hasFinishedBefore](#hasFinishedBefore) * [**launchContext](#launchContext) * [**log](#log) * [**proxyConfiguration](#proxyConfiguration) * [**requestList](#requestList) * [**requestQueue](#requestQueue) * [**router](#router) * [**running](#running) * [**sessionPool](#sessionPool) * [**stats](#stats) ### Methods * [**addRequests](#addRequests) * [**exportData](#exportData) * [**getData](#getData) * [**getDataset](#getDataset) * [**getRequestQueue](#getRequestQueue) * [**pushData](#pushData) * [**run](#run) * [**setStatusMessage](#setStatusMessage) * [**stop](#stop) * [**useState](#useState) ## Properties[**](#Properties) ### [**](#autoscaledPool)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L524)optionalinheritedautoscaledPool **autoscaledPool? : [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) Inherited from BasicCrawler.autoscaledPool A reference to the underlying [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) class that manages the concurrency of the crawler. > *NOTE:* This property is only initialized after calling the [`crawler.run()`](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md#run) function. We can use it to change the concurrency settings on the fly, to pause the crawler by calling [`autoscaledPool.pause()`](https://crawlee.dev/js/api/core/class/AutoscaledPool.md#pause) or to abort it by calling [`autoscaledPool.abort()`](https://crawlee.dev/js/api/core/class/AutoscaledPool.md#abort). ### [**](#browserPool)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L329)browserPool **browserPool: [BrowserPool](https://crawlee.dev/js/api/browser-pool/class/BrowserPool.md)\, ReturnType<[InferBrowserPluginArray](https://crawlee.dev/js/api/browser-pool.md#InferBrowserPluginArray)\\[number]\[createController]>, ReturnType<[InferBrowserPluginArray](https://crawlee.dev/js/api/browser-pool.md#InferBrowserPluginArray)\\[number]\[createLaunchContext]>, Parameters\\[number]\[createController]>\[newPage]>\[0], [UnwrapPromise](https://crawlee.dev/js/api/browser-pool.md#UnwrapPromise)\\[number]\[createController]>\[newPage]>>> A reference to the underlying [BrowserPool](https://crawlee.dev/js/api/browser-pool/class/BrowserPool.md) class that manages the crawler's browsers. ### [**](#config)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L364)readonlyinheritedconfig **config: [Configuration](https://crawlee.dev/js/api/core/class/Configuration.md) = ... Inherited from BasicCrawler.config ### [**](#hasFinishedBefore)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L533)inheritedhasFinishedBefore **hasFinishedBefore: boolean = false Inherited from BasicCrawler.hasFinishedBefore ### [**](#launchContext)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L331)launchContext **launchContext: [BrowserLaunchContext](https://crawlee.dev/js/api/browser-crawler/interface/BrowserLaunchContext.md)\ ### [**](#log)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L535)readonlyinheritedlog **log: [Log](https://crawlee.dev/js/api/core/class/Log.md) Inherited from BasicCrawler.log ### [**](#proxyConfiguration)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L324)optionalproxyConfiguration **proxyConfiguration? : [ProxyConfiguration](https://crawlee.dev/js/api/core/class/ProxyConfiguration.md) A reference to the underlying [ProxyConfiguration](https://crawlee.dev/js/api/core/class/ProxyConfiguration.md) class that manages the crawler's proxies. Only available if used by the crawler. ### [**](#requestList)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L497)optionalinheritedrequestList **requestList? : [IRequestList](https://crawlee.dev/js/api/core/interface/IRequestList.md) Inherited from BasicCrawler.requestList A reference to the underlying [RequestList](https://crawlee.dev/js/api/core/class/RequestList.md) class that manages the crawler's [requests](https://crawlee.dev/js/api/core/class/Request.md). Only available if used by the crawler. ### [**](#requestQueue)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L504)optionalinheritedrequestQueue **requestQueue? : [RequestProvider](https://crawlee.dev/js/api/core/class/RequestProvider.md) Inherited from BasicCrawler.requestQueue Dynamic queue of URLs to be processed. This is useful for recursive crawling of websites. A reference to the underlying [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) class that manages the crawler's [requests](https://crawlee.dev/js/api/core/class/Request.md). Only available if used by the crawler. ### [**](#router)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L530)readonlyinheritedrouter **router: [RouterHandler](https://crawlee.dev/js/api/core/interface/RouterHandler.md)\> = ... Inherited from BasicCrawler.router Default [Router](https://crawlee.dev/js/api/core/class/Router.md) instance that will be used if we don't specify any [`requestHandler`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#requestHandler). See [`router.addHandler()`](https://crawlee.dev/js/api/core/class/Router.md#addHandler) and [`router.addDefaultHandler()`](https://crawlee.dev/js/api/core/class/Router.md#addDefaultHandler). ### [**](#running)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L532)inheritedrunning **running: boolean = false Inherited from BasicCrawler.running ### [**](#sessionPool)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L515)optionalinheritedsessionPool **sessionPool? : [SessionPool](https://crawlee.dev/js/api/core/class/SessionPool.md) Inherited from BasicCrawler.sessionPool A reference to the underlying [SessionPool](https://crawlee.dev/js/api/core/class/SessionPool.md) class that manages the crawler's [sessions](https://crawlee.dev/js/api/core/class/Session.md). Only available if used by the crawler. ### [**](#stats)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L491)readonlyinheritedstats **stats: [Statistics](https://crawlee.dev/js/api/core/class/Statistics.md) Inherited from BasicCrawler.stats A reference to the underlying [Statistics](https://crawlee.dev/js/api/core/class/Statistics.md) class that collects and logs run statistics for requests. ## Methods[**](#Methods) ### [**](#addRequests)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L1147)inheritedaddRequests * ****addRequests**(requests, options): Promise<[CrawlerAddRequestsResult](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerAddRequestsResult.md)> - Inherited from BasicCrawler.addRequests Adds requests to the queue in batches. By default, it will resolve after the initial batch is added, and continue adding the rest in background. You can configure the batch size via `batchSize` option and the sleep time in between the batches via `waitBetweenBatchesMillis`. If you want to wait for all batches to be added to the queue, you can use the `waitForAllRequestsToBeAdded` promise you get in the response object. This is an alias for calling `addRequestsBatched()` on the implicit `RequestQueue` for this crawler instance. *** #### Parameters * ##### requests: [RequestsLike](https://crawlee.dev/js/api/core.md#RequestsLike) The requests to add * ##### options: [CrawlerAddRequestsOptions](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerAddRequestsOptions.md) = {} Options for the request queue #### Returns Promise<[CrawlerAddRequestsResult](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerAddRequestsResult.md)> ### [**](#exportData)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L1253)inheritedexportData * ****exportData**\(path, format, options): Promise\ - Inherited from BasicCrawler.exportData Retrieves all the data from the default crawler [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md) and exports them to the specified format. Supported formats are currently 'json' and 'csv', and will be inferred from the `path` automatically. *** #### Parameters * ##### path: string * ##### optionalformat: json | csv * ##### optionaloptions: [DatasetExportOptions](https://crawlee.dev/js/api/core/interface/DatasetExportOptions.md) #### Returns Promise\ ### [**](#getData)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L1244)inheritedgetData * ****getData**(...args): Promise<[DatasetContent](https://crawlee.dev/js/api/core/interface/DatasetContent.md)\> - Inherited from BasicCrawler.getData Retrieves data from the default crawler [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md) by calling [Dataset.getData](https://crawlee.dev/js/api/core/class/Dataset.md#getData). *** #### Parameters * ##### rest...args: \[options: [DatasetDataOptions](https://crawlee.dev/js/api/core/interface/DatasetDataOptions.md)] #### Returns Promise<[DatasetContent](https://crawlee.dev/js/api/core/interface/DatasetContent.md)\> ### [**](#getDataset)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L1237)inheritedgetDataset * ****getDataset**(idOrName): Promise<[Dataset](https://crawlee.dev/js/api/core/class/Dataset.md)\> - Inherited from BasicCrawler.getDataset Retrieves the specified [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md), or the default crawler [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md). *** #### Parameters * ##### optionalidOrName: string #### Returns Promise<[Dataset](https://crawlee.dev/js/api/core/class/Dataset.md)\> ### [**](#getRequestQueue)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L1080)inheritedgetRequestQueue * ****getRequestQueue**(): Promise<[RequestProvider](https://crawlee.dev/js/api/core/class/RequestProvider.md)> - Inherited from BasicCrawler.getRequestQueue #### Returns Promise<[RequestProvider](https://crawlee.dev/js/api/core/class/RequestProvider.md)> ### [**](#pushData)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L1229)inheritedpushData * ****pushData**(data, datasetIdOrName): Promise\ - Inherited from BasicCrawler.pushData Pushes data to the specified [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md), or the default crawler [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md) by calling [Dataset.pushData](https://crawlee.dev/js/api/core/class/Dataset.md#pushData). *** #### Parameters * ##### data: Dictionary | Dictionary\[] * ##### optionaldatasetIdOrName: string #### Returns Promise\ ### [**](#run)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L948)inheritedrun * ****run**(requests, options): Promise<[FinalStatistics](https://crawlee.dev/js/api/core/interface/FinalStatistics.md)> - Inherited from BasicCrawler.run Runs the crawler. Returns a promise that resolves once all the requests are processed and `autoscaledPool.isFinished` returns `true`. We can use the `requests` parameter to enqueue the initial requests — it is a shortcut for running [`crawler.addRequests()`](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md#addRequests) before [`crawler.run()`](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md#run). *** #### Parameters * ##### optionalrequests: [RequestsLike](https://crawlee.dev/js/api/core.md#RequestsLike) The requests to add. * ##### optionaloptions: [CrawlerRunOptions](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerRunOptions.md) Options for the request queue. #### Returns Promise<[FinalStatistics](https://crawlee.dev/js/api/core/interface/FinalStatistics.md)> ### [**](#setStatusMessage)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L871)inheritedsetStatusMessage * ****setStatusMessage**(message, options): Promise\ - Inherited from BasicCrawler.setStatusMessage This method is periodically called by the crawler, every `statusMessageLoggingInterval` seconds. *** #### Parameters * ##### message: string * ##### options: [SetStatusMessageOptions](https://crawlee.dev/js/api/types/interface/SetStatusMessageOptions.md) = {} #### Returns Promise\ ### [**](#stop)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L1068)inheritedstop * ****stop**(message): void - Inherited from BasicCrawler.stop Gracefully stops the current run of the crawler. All the tasks active at the time of calling this method will be allowed to finish. *** #### Parameters * ##### message: string = 'The crawler has been gracefully stopped.' #### Returns void ### [**](#useState)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L1102)inheriteduseState * ****useState**\(defaultValue): Promise\ - Inherited from BasicCrawler.useState #### Parameters * ##### defaultValue: State = ... #### Returns Promise\ --- # BrowserCrawlerOptions \ ### Hierarchy * Omit<[BasicCrawlerOptions](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md), requestHandler | handleRequestFunction | failedRequestHandler | handleFailedRequestFunction | errorHandler> * *BrowserCrawlerOptions* * [PuppeteerCrawlerOptions](https://crawlee.dev/js/api/puppeteer-crawler/interface/PuppeteerCrawlerOptions.md) * [PlaywrightCrawlerOptions](https://crawlee.dev/js/api/playwright-crawler/interface/PlaywrightCrawlerOptions.md) ## Index[**](#Index) ### Properties * [**autoscaledPoolOptions](#autoscaledPoolOptions) * [**browserPoolOptions](#browserPoolOptions) * [**errorHandler](#errorHandler) * [**experiments](#experiments) * [**failedRequestHandler](#failedRequestHandler) * [**headless](#headless) * [**httpClient](#httpClient) * [**ignoreIframes](#ignoreIframes) * [**ignoreShadowRoots](#ignoreShadowRoots) * [**keepAlive](#keepAlive) * [**launchContext](#launchContext) * [**maxConcurrency](#maxConcurrency) * [**maxCrawlDepth](#maxCrawlDepth) * [**maxRequestRetries](#maxRequestRetries) * [**maxRequestsPerCrawl](#maxRequestsPerCrawl) * [**maxRequestsPerMinute](#maxRequestsPerMinute) * [**maxSessionRotations](#maxSessionRotations) * [**minConcurrency](#minConcurrency) * [**navigationTimeoutSecs](#navigationTimeoutSecs) * [**onSkippedRequest](#onSkippedRequest) * [**persistCookiesPerSession](#persistCookiesPerSession) * [**postNavigationHooks](#postNavigationHooks) * [**preNavigationHooks](#preNavigationHooks) * [**proxyConfiguration](#proxyConfiguration) * [**requestHandler](#requestHandler) * [**requestHandlerTimeoutSecs](#requestHandlerTimeoutSecs) * [**requestList](#requestList) * [**requestManager](#requestManager) * [**requestQueue](#requestQueue) * [**respectRobotsTxtFile](#respectRobotsTxtFile) * [**retryOnBlocked](#retryOnBlocked) * [**sameDomainDelaySecs](#sameDomainDelaySecs) * [**sessionPoolOptions](#sessionPoolOptions) * [**statisticsOptions](#statisticsOptions) * [**statusMessageCallback](#statusMessageCallback) * [**statusMessageLoggingInterval](#statusMessageLoggingInterval) * [**useSessionPool](#useSessionPool) ## Properties[**](#Properties) ### [**](#autoscaledPoolOptions)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L294)optionalinheritedautoscaledPoolOptions **autoscaledPoolOptions? : [AutoscaledPoolOptions](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md) Inherited from Omit.autoscaledPoolOptions Custom options passed to the underlying [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) constructor. > *NOTE:* The [`runTaskFunction`](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md#runTaskFunction) option is provided by the crawler and cannot be overridden. However, we can provide custom implementations of [`isFinishedFunction`](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md#isFinishedFunction) and [`isTaskReadyFunction`](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md#isTaskReadyFunction). ### [**](#browserPoolOptions)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L194)optionalbrowserPoolOptions **browserPoolOptions? : Partial<[BrowserPoolOptions](https://crawlee.dev/js/api/browser-pool/interface/BrowserPoolOptions.md)<[BrowserPlugin](https://crawlee.dev/js/api/browser-pool/class/BrowserPlugin.md)<[CommonLibrary](https://crawlee.dev/js/api/browser-pool/interface/CommonLibrary.md), undefined | Dictionary, CommonBrowser, unknown, CommonPage>>> & Partial<[BrowserPoolHooks](https://crawlee.dev/js/api/browser-pool/interface/BrowserPoolHooks.md)<\_\_BrowserControllerReturn, \_\_LaunchContextReturn, [UnwrapPromise](https://crawlee.dev/js/api/browser-pool.md#UnwrapPromise)\>>> Custom options passed to the underlying [BrowserPool](https://crawlee.dev/js/api/browser-pool/class/BrowserPool.md) constructor. We can tweak those to fine-tune browser management. ### [**](#errorHandler)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L163)optionalerrorHandler **errorHandler? : [BrowserErrorHandler](https://crawlee.dev/js/api/browser-crawler.md#BrowserErrorHandler)\ User-provided function that allows modifying the request object before it gets retried by the crawler. It's executed before each retry for the requests that failed less than [`maxRequestRetries`](https://crawlee.dev/js/api/browser-crawler/interface/BrowserCrawlerOptions.md#maxRequestRetries) times. The function receives the [BrowserCrawlingContext](https://crawlee.dev/js/api/browser-crawler/interface/BrowserCrawlingContext.md) (actual context will be enhanced with the crawler specific properties) as the first argument, where the [`request`](https://crawlee.dev/js/api/browser-crawler/interface/BrowserCrawlingContext.md#request) corresponds to the request to be retried. Second argument is the `Error` instance that represents the last error thrown during processing of the request. ### [**](#experiments)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L390)optionalinheritedexperiments **experiments? : [CrawlerExperiments](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerExperiments.md) Inherited from Omit.experiments Enables experimental features of Crawlee, which can alter the behavior of the crawler. WARNING: these options are not guaranteed to be stable and may change or be removed at any time. ### [**](#failedRequestHandler)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L174)optionalfailedRequestHandler **failedRequestHandler? : [BrowserErrorHandler](https://crawlee.dev/js/api/browser-crawler.md#BrowserErrorHandler)\ A function to handle requests that failed more than `option.maxRequestRetries` times. The function receives the [BrowserCrawlingContext](https://crawlee.dev/js/api/browser-crawler/interface/BrowserCrawlingContext.md) (actual context will be enhanced with the crawler specific properties) as the first argument, where the [`request`](https://crawlee.dev/js/api/browser-crawler/interface/BrowserCrawlingContext.md#request) corresponds to the failed request. Second argument is the `Error` instance that represents the last error thrown during processing of the request. ### [**](#headless)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L260)optionalheadless **headless? : boolean | new | old Whether to run browser in headless mode. Defaults to `true`. Can be also set via [Configuration](https://crawlee.dev/js/api/core/class/Configuration.md). ### [**](#httpClient)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L402)optionalinheritedhttpClient **httpClient? : [BaseHttpClient](https://crawlee.dev/js/api/core/interface/BaseHttpClient.md) Inherited from Omit.httpClient HTTP client implementation for the `sendRequest` context helper and for plain HTTP crawling. Defaults to a new instance of [GotScrapingHttpClient](https://crawlee.dev/js/api/core/class/GotScrapingHttpClient.md) ### [**](#ignoreIframes)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L272)optionalignoreIframes **ignoreIframes? : boolean Whether to ignore `iframes` when processing the page content via `parseWithCheerio` helper. By default, `iframes` are expanded automatically. Use this option to disable this behavior. ### [**](#ignoreShadowRoots)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L266)optionalignoreShadowRoots **ignoreShadowRoots? : boolean Whether to ignore custom elements (and their #shadow-roots) when processing the page content via `parseWithCheerio` helper. By default, they are expanded automatically. Use this option to disable this behavior. ### [**](#keepAlive)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L322)optionalinheritedkeepAlive **keepAlive? : boolean Inherited from Omit.keepAlive Allows to keep the crawler alive even if the [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) gets empty. By default, the `crawler.run()` will resolve once the queue is empty. With `keepAlive: true` it will keep running, waiting for more requests to come. Use `crawler.stop()` to exit the crawler gracefully, or `crawler.teardown()` to stop it immediately. ### [**](#launchContext)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L90)optionallaunchContext **launchContext? : [BrowserLaunchContext](https://crawlee.dev/js/api/browser-crawler/interface/BrowserLaunchContext.md)\ ### [**](#maxConcurrency)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L308)optionalinheritedmaxConcurrency **maxConcurrency? : number Inherited from Omit.maxConcurrency Sets the maximum concurrency (parallelism) for the crawl. Shortcut for the AutoscaledPool [`maxConcurrency`](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md#maxConcurrency) option. ### [**](#maxCrawlDepth)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L285)optionalinheritedmaxCrawlDepth **maxCrawlDepth? : number Inherited from Omit.maxCrawlDepth Maximum depth of the crawl. If not set, the crawl will continue until all requests are processed. Setting this to `0` will only process the initial requests, skipping all links enqueued by `crawlingContext.enqueueLinks` and `crawlingContext.addRequests`. Passing `1` will process the initial requests and all links enqueued by `crawlingContext.enqueueLinks` and `crawlingContext.addRequests` in the handler for initial requests. ### [**](#maxRequestRetries)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L256)optionalinheritedmaxRequestRetries **maxRequestRetries? : number = 3 Inherited from Omit.maxRequestRetries Specifies the maximum number of retries allowed for a request if its processing fails. This includes retries due to navigation errors or errors thrown from user-supplied functions (`requestHandler`, `preNavigationHooks`, `postNavigationHooks`). This limit does not apply to retries triggered by session rotation (see [`maxSessionRotations`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#maxSessionRotations)). ### [**](#maxRequestsPerCrawl)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L278)optionalinheritedmaxRequestsPerCrawl **maxRequestsPerCrawl? : number Inherited from Omit.maxRequestsPerCrawl Maximum number of pages that the crawler will open. The crawl will stop when this limit is reached. This value should always be set in order to prevent infinite loops in misconfigured crawlers. > *NOTE:* In cases of parallel crawling, the actual number of pages visited might be slightly higher than this value. ### [**](#maxRequestsPerMinute)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L315)optionalinheritedmaxRequestsPerMinute **maxRequestsPerMinute? : number Inherited from Omit.maxRequestsPerMinute The maximum number of requests per minute the crawler should run. By default, this is set to `Infinity`, but we can pass any positive, non-zero integer. Shortcut for the AutoscaledPool [`maxTasksPerMinute`](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md#maxTasksPerMinute) option. ### [**](#maxSessionRotations)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L271)optionalinheritedmaxSessionRotations **maxSessionRotations? : number = 10 Inherited from Omit.maxSessionRotations Maximum number of session rotations per request. The crawler will automatically rotate the session in case of a proxy error or if it gets blocked by the website. The session rotations are not counted towards the [`maxRequestRetries`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#maxRequestRetries) limit. ### [**](#minConcurrency)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L302)optionalinheritedminConcurrency **minConcurrency? : number Inherited from Omit.minConcurrency Sets the minimum concurrency (parallelism) for the crawl. Shortcut for the AutoscaledPool [`minConcurrency`](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md#minConcurrency) option. > *WARNING:* If we set this value too high with respect to the available system memory and CPU, our crawler will run extremely slow or crash. If not sure, it's better to keep the default value and the concurrency will scale up automatically. ### [**](#navigationTimeoutSecs)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L248)optionalnavigationTimeoutSecs **navigationTimeoutSecs? : number Timeout in which page navigation needs to finish, in seconds. ### [**](#onSkippedRequest)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L381)optionalinheritedonSkippedRequest **onSkippedRequest? : [SkippedRequestCallback](https://crawlee.dev/js/api/core.md#SkippedRequestCallback) Inherited from Omit.onSkippedRequest When a request is skipped for some reason, you can use this callback to act on it. This is currently fired for requests skipped 1. based on robots.txt file, 2. because they don't match enqueueLinks filters, 3. because they are redirected to a URL that doesn't match the enqueueLinks strategy, 4. or because the [`maxRequestsPerCrawl`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#maxRequestsPerCrawl) limit has been reached ### [**](#persistCookiesPerSession)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L254)optionalpersistCookiesPerSession **persistCookiesPerSession? : boolean Defines whether the cookies should be persisted for sessions. This can only be used when `useSessionPool` is set to `true`. ### [**](#postNavigationHooks)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L243)optionalpostNavigationHooks **postNavigationHooks? : [BrowserHook](https://crawlee.dev/js/api/browser-crawler.md#BrowserHook)\\[] Async functions that are sequentially evaluated after the navigation. Good for checking if the navigation was successful. The function accepts `crawlingContext` as the only parameter. **Example:** ``` postNavigationHooks: [ async (crawlingContext) => { const { page } = crawlingContext; if (hasCaptcha(page)) { await solveCaptcha(page); } }, ] ``` ### [**](#preNavigationHooks)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L224)optionalpreNavigationHooks **preNavigationHooks? : [BrowserHook](https://crawlee.dev/js/api/browser-crawler.md#BrowserHook)\\[] Async functions that are sequentially evaluated before the navigation. Good for setting additional cookies or browser properties before navigation. The function accepts two parameters, `crawlingContext` and `gotoOptions`, which are passed to the `page.goto()` function the crawler calls to navigate. **Example:** ``` preNavigationHooks: [ async (crawlingContext, gotoOptions) => { const { page } = crawlingContext; await page.evaluate((attr) => { window.foo = attr; }, 'bar'); gotoOptions.timeout = 60_000; gotoOptions.waitUntil = 'domcontentloaded'; }, ] ``` Modyfing `pageOptions` is supported only in Playwright incognito. See [PrePageCreateHook](https://crawlee.dev/js/api/browser-pool.md#PrePageCreateHook) ### [**](#proxyConfiguration)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L201)optionalproxyConfiguration **proxyConfiguration? : [ProxyConfiguration](https://crawlee.dev/js/api/core/class/ProxyConfiguration.md) If set, the crawler will be configured for all connections to use the Proxy URLs provided and rotated according to the configuration. ### [**](#requestHandler)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L119)optionalrequestHandler **requestHandler? : [BrowserRequestHandler](https://crawlee.dev/js/api/browser-crawler.md#BrowserRequestHandler)\> Function that is called to process each request. The function receives the [BrowserCrawlingContext](https://crawlee.dev/js/api/browser-crawler/interface/BrowserCrawlingContext.md) (actual context will be enhanced with the crawler specific properties) as an argument, where: * [`request`](https://crawlee.dev/js/api/browser-crawler/interface/BrowserCrawlingContext.md#request) is an instance of the [Request](https://crawlee.dev/js/api/core/class/Request.md) object with details about the URL to open, HTTP method etc; * [`page`](https://crawlee.dev/js/api/browser-crawler/interface/BrowserCrawlingContext.md#page) is an instance of the Puppeteer [Page](https://pptr.dev/api/puppeteer.page) or Playwright [Page](https://playwright.dev/docs/api/class-page); * [`browserController`](https://crawlee.dev/js/api/browser-crawler/interface/BrowserCrawlingContext.md#browserController) is an instance of the [BrowserController](https://crawlee.dev/js/api/browser-pool/class/BrowserController.md); * [`response`](https://crawlee.dev/js/api/browser-crawler/interface/BrowserCrawlingContext.md#response) is an instance of the Puppeteer [Response](https://pptr.dev/api/puppeteer.httpresponse) or Playwright [Response](https://playwright.dev/docs/api/class-response), which is the main resource response as returned by the respective `page.goto()` function. The function must return a promise, which is then awaited by the crawler. If the function throws an exception, the crawler will try to re-crawl the request later, up to the [`maxRequestRetries`](https://crawlee.dev/js/api/browser-crawler/interface/BrowserCrawlerOptions.md#maxRequestRetries) times. If all the retries fail, the crawler calls the function provided to the [`failedRequestHandler`](https://crawlee.dev/js/api/browser-crawler/interface/BrowserCrawlerOptions.md#failedRequestHandler) parameter. To make this work, we should **always** let our function throw exceptions rather than catch them. The exceptions are logged to the request using the [`Request.pushErrorMessage()`](https://crawlee.dev/js/api/core/class/Request.md#pushErrorMessage) function. ### [**](#requestHandlerTimeoutSecs)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L203)optionalinheritedrequestHandlerTimeoutSecs **requestHandlerTimeoutSecs? : number = 60 Inherited from Omit.requestHandlerTimeoutSecs Timeout in which the function passed as [`requestHandler`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#requestHandler) needs to finish, in seconds. ### [**](#requestList)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L181)optionalinheritedrequestList **requestList? : [IRequestList](https://crawlee.dev/js/api/core/interface/IRequestList.md) Inherited from Omit.requestList Static list of URLs to be processed. If not provided, the crawler will open the default request queue when the [`crawler.addRequests()`](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md#addRequests) function is called. > Alternatively, `requests` parameter of [`crawler.run()`](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md#run) could be used to enqueue the initial requests - it is a shortcut for running `crawler.addRequests()` before the `crawler.run()`. ### [**](#requestManager)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L197)optionalinheritedrequestManager **requestManager? : [IRequestManager](https://crawlee.dev/js/api/core/interface/IRequestManager.md) Inherited from Omit.requestManager Allows explicitly configuring a request manager. Mutually exclusive with the `requestQueue` and `requestList` options. This enables explicitly configuring the crawler to use `RequestManagerTandem`, for instance. If using this, the type of `BasicCrawler.requestQueue` may not be fully compatible with the `RequestProvider` class. ### [**](#requestQueue)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L189)optionalinheritedrequestQueue **requestQueue? : [RequestProvider](https://crawlee.dev/js/api/core/class/RequestProvider.md) Inherited from Omit.requestQueue Dynamic queue of URLs to be processed. This is useful for recursive crawling of websites. If not provided, the crawler will open the default request queue when the [`crawler.addRequests()`](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md#addRequests) function is called. > Alternatively, `requests` parameter of [`crawler.run()`](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md#run) could be used to enqueue the initial requests - it is a shortcut for running `crawler.addRequests()` before the `crawler.run()`. ### [**](#respectRobotsTxtFile)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L371)optionalinheritedrespectRobotsTxtFile **respectRobotsTxtFile? : boolean Inherited from Omit.respectRobotsTxtFile If set to `true`, the crawler will automatically try to fetch the robots.txt file for each domain, and skip those that are not allowed. This also prevents disallowed URLs to be added via `enqueueLinks`. ### [**](#retryOnBlocked)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L365)optionalinheritedretryOnBlocked **retryOnBlocked? : boolean Inherited from Omit.retryOnBlocked If set to `true`, the crawler will automatically try to bypass any detected bot protection. Currently supports: * [**Cloudflare** Bot Management](https://www.cloudflare.com/products/bot-management/) * [**Google Search** Rate Limiting](https://www.google.com/sorry/) ### [**](#sameDomainDelaySecs)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L262)optionalinheritedsameDomainDelaySecs **sameDomainDelaySecs? : number = 0 Inherited from Omit.sameDomainDelaySecs Indicates how much time (in seconds) to wait before crawling another same domain request. ### [**](#sessionPoolOptions)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L333)optionalinheritedsessionPoolOptions **sessionPoolOptions? : [SessionPoolOptions](https://crawlee.dev/js/api/core/interface/SessionPoolOptions.md) Inherited from Omit.sessionPoolOptions The configuration options for [SessionPool](https://crawlee.dev/js/api/core/class/SessionPool.md) to use. ### [**](#statisticsOptions)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L396)optionalinheritedstatisticsOptions **statisticsOptions? : [StatisticsOptions](https://crawlee.dev/js/api/core/interface/StatisticsOptions.md) Inherited from Omit.statisticsOptions Customize the way statistics collecting works, such as logging interval or whether to output them to the Key-Value store. ### [**](#statusMessageCallback)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L356)optionalinheritedstatusMessageCallback **statusMessageCallback? : [StatusMessageCallback](https://crawlee.dev/js/api/basic-crawler.md#StatusMessageCallback)<[BasicCrawlingContext](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlingContext.md)\, [BasicCrawler](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md)<[BasicCrawlingContext](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlingContext.md)\>> Inherited from Omit.statusMessageCallback Allows overriding the default status message. The callback needs to call `crawler.setStatusMessage()` explicitly. The default status message is provided in the parameters. ``` const crawler = new CheerioCrawler({ statusMessageCallback: async (ctx) => { return ctx.crawler.setStatusMessage(`this is status message from ${new Date().toISOString()}`, { level: 'INFO' }); // log level defaults to 'DEBUG' }, statusMessageLoggingInterval: 1, // defaults to 10s async requestHandler({ $, enqueueLinks, request, log }) { // ... }, }); ``` ### [**](#statusMessageLoggingInterval)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L338)optionalinheritedstatusMessageLoggingInterval **statusMessageLoggingInterval? : number Inherited from Omit.statusMessageLoggingInterval Defines the length of the interval for calling the `setStatusMessage` in seconds. ### [**](#useSessionPool)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L328)optionalinheriteduseSessionPool **useSessionPool? : boolean Inherited from Omit.useSessionPool Basic crawler will initialize the [SessionPool](https://crawlee.dev/js/api/core/class/SessionPool.md) with the corresponding [`sessionPoolOptions`](https://crawlee.dev/js/api/core/interface/SessionPoolOptions.md). The session instance will be than available in the [`requestHandler`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#requestHandler). --- # BrowserCrawlingContext \ ### Hierarchy * [CrawlingContext](https://crawlee.dev/js/api/core/interface/CrawlingContext.md)\ * *BrowserCrawlingContext* * [PuppeteerCrawlingContext](https://crawlee.dev/js/api/puppeteer-crawler/interface/PuppeteerCrawlingContext.md) * [PlaywrightCrawlingContext](https://crawlee.dev/js/api/playwright-crawler/interface/PlaywrightCrawlingContext.md) ## Index[**](#Index) ### Properties * [**addRequests](#addRequests) * [**browserController](#browserController) * [**crawler](#crawler) * [**getKeyValueStore](#getKeyValueStore) * [**id](#id) * [**log](#log) * [**page](#page) * [**proxyInfo](#proxyInfo) * [**request](#request) * [**response](#response) * [**session](#session) * [**useState](#useState) ### Methods * [**enqueueLinks](#enqueueLinks) * [**pushData](#pushData) * [**sendRequest](#sendRequest) ## Properties[**](#Properties) ### [**](#addRequests)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L88)inheritedaddRequests **addRequests: (requestsLike, options) => Promise\ Inherited from CrawlingContext.addRequests Add requests directly to the request queue. *** #### Type declaration * * **(requestsLike, options): Promise\ - #### Parameters * ##### requestsLike: readonly (string | ReadonlyObjectDeep\> & { regex?: RegExp; requestsFromUrl?: string }> | ReadonlyObjectDeep<[Request](https://crawlee.dev/js/api/core/class/Request.md)\>)\[] * ##### optionaloptions: ReadonlyObjectDeep<[RequestQueueOperationOptions](https://crawlee.dev/js/api/core/interface/RequestQueueOperationOptions.md)> Options for the request queue #### Returns Promise\ ### [**](#browserController)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L59)browserController **browserController: ProvidedController ### [**](#crawler)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L113)inheritedcrawler **crawler: Crawler Inherited from CrawlingContext.crawler ### [**](#getKeyValueStore)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L147)inheritedgetKeyValueStore **getKeyValueStore: (idOrName) => Promise<[KeyValueStore](https://crawlee.dev/js/api/core/class/KeyValueStore.md)> Inherited from CrawlingContext.getKeyValueStore Get a key-value store with given name or id, or the default one for the crawler. *** #### Type declaration * * **(idOrName): Promise<[KeyValueStore](https://crawlee.dev/js/api/core/class/KeyValueStore.md)> - #### Parameters * ##### optionalidOrName: string #### Returns Promise<[KeyValueStore](https://crawlee.dev/js/api/core/class/KeyValueStore.md)> ### [**](#id)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L33)inheritedid **id: string Inherited from CrawlingContext.id ### [**](#log)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L108)inheritedlog **log: [Log](https://crawlee.dev/js/api/core/class/Log.md) Inherited from CrawlingContext.log A preconfigured logger for the request handler. ### [**](#page)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L60)page **page: Page ### [**](#proxyInfo)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L40)optionalinheritedproxyInfo **proxyInfo? : [ProxyInfo](https://crawlee.dev/js/api/core/interface/ProxyInfo.md) Inherited from CrawlingContext.proxyInfo An object with information about currently used proxy by the crawler and configured by the [ProxyConfiguration](https://crawlee.dev/js/api/core/class/ProxyConfiguration.md) class. ### [**](#request)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L45)inheritedrequest **request: [Request](https://crawlee.dev/js/api/core/class/Request.md)\ Inherited from CrawlingContext.request The original [Request](https://crawlee.dev/js/api/core/class/Request.md) object. ### [**](#response)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L61)optionalresponse **response? : Response ### [**](#session)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L34)optionalinheritedsession **session? : [Session](https://crawlee.dev/js/api/core/class/Session.md) Inherited from CrawlingContext.session ### [**](#useState)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L96)inheriteduseState **useState: \(defaultValue) => Promise\ Inherited from CrawlingContext.useState Returns the state - a piece of mutable persistent data shared across all the request handler runs. *** #### Type declaration * * **\(defaultValue): Promise\ - #### Parameters * ##### optionaldefaultValue: State #### Returns Promise\ ## Methods[**](#Methods) ### [**](#enqueueLinks)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L140)inheritedenqueueLinks * ****enqueueLinks**(options): Promise<[BatchAddRequestsResult](https://crawlee.dev/js/api/types/interface/BatchAddRequestsResult.md)> - Inherited from CrawlingContext.enqueueLinks This function automatically finds and enqueues links from the current page, adding them to the [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) currently used by the crawler. Optionally, the function allows you to filter the target links' URLs using an array of globs or regular expressions and override settings of the enqueued [Request](https://crawlee.dev/js/api/core/class/Request.md) objects. Check out the [Crawl a website with relative links](https://crawlee.dev/js/docs/examples/crawl-relative-links.md) example for more details regarding its usage. **Example usage** ``` async requestHandler({ enqueueLinks }) { await enqueueLinks({ globs: [ 'https://www.example.com/handbags/*', ], }); }, ``` *** #### Parameters * ##### optionaloptions: ReadonlyObjectDeep\> & Pick<[EnqueueLinksOptions](https://crawlee.dev/js/api/core/interface/EnqueueLinksOptions.md), requestQueue> All `enqueueLinks()` parameters are passed via an options object. #### Returns Promise<[BatchAddRequestsResult](https://crawlee.dev/js/api/types/interface/BatchAddRequestsResult.md)> Promise that resolves to [BatchAddRequestsResult](https://crawlee.dev/js/api/types/interface/BatchAddRequestsResult.md) object. ### [**](#pushData)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L54)inheritedpushData * ****pushData**(data, datasetIdOrName): Promise\ - Inherited from CrawlingContext.pushData This function allows you to push data to a [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md) specified by name, or the one currently used by the crawler. Shortcut for `crawler.pushData()`. *** #### Parameters * ##### optionaldata: ReadonlyDeep\ Data to be pushed to the default dataset. * ##### optionaldatasetIdOrName: string #### Returns Promise\ ### [**](#sendRequest)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L166)inheritedsendRequest * ****sendRequest**\(overrideOptions): Promise\> - Inherited from CrawlingContext.sendRequest Fires HTTP request via [`got-scraping`](https://crawlee.dev/js/docs/guides/got-scraping.md), allowing to override the request options on the fly. This is handy when you work with a browser crawler but want to execute some requests outside it (e.g. API requests). Check the [Skipping navigations for certain requests](https://crawlee.dev/js/docs/examples/skip-navigation.md) example for more detailed explanation of how to do that. ``` async requestHandler({ sendRequest }) { const { body } = await sendRequest({ // override headers only headers: { ... }, }); }, ``` *** #### Parameters * ##### optionaloverrideOptions: Partial\ #### Returns Promise\> --- # BrowserLaunchContext \ ### Hierarchy * [BrowserPluginOptions](https://crawlee.dev/js/api/browser-pool/interface/BrowserPluginOptions.md)\ * *BrowserLaunchContext* * [PuppeteerLaunchContext](https://crawlee.dev/js/api/puppeteer-crawler/interface/PuppeteerLaunchContext.md) * [PlaywrightLaunchContext](https://crawlee.dev/js/api/playwright-crawler/interface/PlaywrightLaunchContext.md) ## Index[**](#Index) ### Properties * [**browserPerProxy](#browserPerProxy) * [**experimentalContainers](#experimentalContainers) * [**launcher](#launcher) * [**launchOptions](#launchOptions) * [**proxyUrl](#proxyUrl) * [**useChrome](#useChrome) * [**useIncognitoPages](#useIncognitoPages) * [**userAgent](#userAgent) * [**userDataDir](#userDataDir) ## Properties[**](#Properties) ### [**](#browserPerProxy)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-launcher.ts#L40)optionalbrowserPerProxy **browserPerProxy? : boolean Overrides BrowserPluginOptions.browserPerProxy If set to `true`, the crawler respects the proxy url generated for the given request. This aligns the browser-based crawlers with the `HttpCrawler`. Might cause performance issues, as Crawlee might launch too many browser instances. ### [**](#experimentalContainers)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-launcher.ts#L54)optionalexperimentalContainersexperimental **experimentalContainers? : boolean Overrides BrowserPluginOptions.experimentalContainers Like `useIncognitoPages`, but for persistent contexts, so cache is used for faster loading. Works best with Firefox. Unstable on Chromium. ### [**](#launcher)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-launcher.ts#L82)optionallauncher **launcher? : Launcher The type of browser to be launched. By default, `chromium` is used. Other browsers like `webkit` or `firefox` can be used. * **@example** ``` // import the browser from the library first import { firefox } from 'playwright'; ``` For more details, check out the [example](https://crawlee.dev/js/docs/examples/playwright-crawler-firefox.md). ### [**](#launchOptions)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-plugin.ts#L54)optionalinheritedlaunchOptions **launchOptions? : TOptions Inherited from BrowserPluginOptions.launchOptions Options that will be passed down to the automation library. E.g. `puppeteer.launch(launchOptions);`. This is a good place to set options that you want to apply as defaults. To dynamically override those options per-browser, see the `preLaunchHooks` of [BrowserPool](https://crawlee.dev/js/api/browser-pool/class/BrowserPool.md). ### [**](#proxyUrl)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-launcher.ts#L22)optionalproxyUrl **proxyUrl? : string Overrides BrowserPluginOptions.proxyUrl URL to an HTTP proxy server. It must define the port number, and it may also contain proxy username and password. * **@example** ``` `http://bob:pass123@proxy.example.com:1234`. ``` ### [**](#useChrome)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-launcher.ts#L32)optionaluseChrome **useChrome? : boolean = false If `true` and the `executablePath` option of [`launchOptions`](https://crawlee.dev/js/api/browser-crawler/interface/BrowserLaunchContext.md#launchOptions) is not set, the launcher will launch full Google Chrome browser available on the machine rather than the bundled Chromium. The path to Chrome executable is taken from the `CRAWLEE_CHROME_EXECUTABLE_PATH` environment variable if provided, or defaults to the typical Google Chrome executable location specific for the operating system. ### [**](#useIncognitoPages)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-launcher.ts#L47)optionaluseIncognitoPages **useIncognitoPages? : boolean = false Overrides BrowserPluginOptions.useIncognitoPages With this option selected, all pages will be opened in a new incognito browser context. This means they will not share cookies nor cache and their resources will not be throttled by one another. ### [**](#userAgent)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-launcher.ts#L68)optionaluserAgent **userAgent? : string The `User-Agent` HTTP header used by the browser. If not provided, the function sets `User-Agent` to a reasonable default to reduce the chance of detection of the crawler. ### [**](#userDataDir)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-launcher.ts#L61)optionaluserDataDir **userDataDir? : string Overrides BrowserPluginOptions.userDataDir Sets the [User Data Directory](https://chromium.googlesource.com/chromium/src/+/master/docs/user_data_dir.md) path. The user data directory contains profile data such as history, bookmarks, and cookies, as well as other per-installation local state. If not specified, a temporary directory is used instead. --- # @crawlee/browser-pool Browser Pool is a small, but powerful and extensible library, that allows you to seamlessly control multiple headless browsers at the same time with only a little configuration, and a single function call. Currently, it supports [Puppeteer](https://github.com/puppeteer/puppeteer), [Playwright](https://github.com/microsoft/playwright), and it can be easily extended with plugins. We created Browser Pool because we regularly needed to execute tasks concurrently in many headless browsers and their pages, but we did not want to worry about launching browsers, closing browsers, restarting them after crashes and so on. We also wanted to easily and reliably manage the whole browser/page lifecycle. You can use Browser Pool for scraping the internet at scale, testing your website in multiple browsers at the same time or launching web automation robots. ## Installation[​](#installation "Direct link to Installation") Use NPM or Yarn to install `@crawlee/browser-pool`. Note that `@crawlee/browser-pool` does not come preinstalled with browser automation libraries. This allows you to choose your own libraries and their versions, and it also makes `@crawlee/browser-pool` much smaller. Run this command to install `@crawlee/browser-pool` and the `playwright` browser automation library. ``` npm install @crawlee/browser-pool playwright ``` ## Usage[​](#usage "Direct link to Usage") This simple example shows how to open a page in a browser using Browser Pool. We use the provided `PlaywrightPlugin` to wrap a Playwright installation of your own. By calling `browserPool.newPage()` you launch a new Firefox browser and open a new page in that browser. ``` import { BrowserPool, PlaywrightPlugin } from '@crawlee/browser-pool'; import playwright from 'playwright'; const browserPool = new BrowserPool({ browserPlugins: [new PlaywrightPlugin(playwright.chromium)], }); // Launches Chromium with Playwright and returns a Playwright Page. const page1 = await browserPool.newPage(); // You can interact with the page as you're used to. await page1.goto('https://example.com'); // When you're done, close the page. await page1.close(); // Opens a second page in the same browser. const page2 = await browserPool.newPage(); // When everything's finished, tear down the pool. await browserPool.destroy(); ``` ## Launching multiple browsers[​](#launching-multiple-browsers "Direct link to Launching multiple browsers") The basic example shows how to launch a single browser, but the purpose of Browser Pool is to launch many browsers. This is done automatically in the background. You only need to provide the relevant plugins and call `browserPool.newPage()`. ``` import { BrowserPool, PlaywrightPlugin } from '@crawlee/browser-pool'; import playwright from 'playwright'; const browserPool = new BrowserPool({ browserPlugins: [ new PlaywrightPlugin(playwright.chromium), new PlaywrightPlugin(playwright.firefox), new PlaywrightPlugin(playwright.webkit), ], }); // Open 4 pages in 3 browsers. The browsers are launched // in a round-robin fashion based on the plugin order. const chromiumPage = await browserPool.newPage(); const firefoxPage = await browserPool.newPage(); const webkitPage = await browserPool.newPage(); const chromiumPage2 = await browserPool.newPage(); // Don't forget to close pages / destroy pool when you're done. ``` This round-robin way of opening pages may not be useful for you, if you need to consistently run tasks in multiple environments. For that, there's the `newPageWithEachPlugin` function. ``` import { BrowserPool, PlaywrightPlugin, PuppeteerPlugin } from '@crawlee/browser-pool'; import playwright from 'playwright'; import puppeteer from 'puppeteer'; const browserPool = new BrowserPool({ browserPlugins: [ new PlaywrightPlugin(playwright.chromium), new PuppeteerPlugin(puppeteer), ], }); const pages = await browserPool.newPageWithEachPlugin(); const promises = pages.map(async page => { // Run some task with each page // pages are in order of plugins: // [playwrightPage, puppeteerPage] await page.close(); }); await Promise.all(promises); // Continue with some more work. ``` ## Features[​](#features "Direct link to Features") Besides a simple interface for launching browsers, Browser Pool includes other helpful features that make browser management more convenient. ### Simple configuration[​](#simple-configuration "Direct link to Simple configuration") You can easily set the maximum number of pages that can be open in a given browser and also the maximum number of pages to process before a browser [is retired](#graceful-browser-closing). ``` const browserPool = new BrowserPool({ maxOpenPagesPerBrowser: 20, retireBrowserAfterPageCount: 100, }); ``` You can configure the browser launch options either right in the plugins: ``` const playwrightPlugin = new PlaywrightPlugin(playwright.chromium, { launchOptions: { headless: true, } }) ``` Or dynamically in [pre-launch hooks](#lifecycle-management-with-hooks): ``` const browserPool = new BrowserPool({ preLaunchHooks: [(pageId, launchContext) => { if (pageId === 'headful') { launchContext.launchOptions.headless = false; } }] }); ``` ### Proxy management[​](#proxy-management "Direct link to Proxy management") When scraping at scale or testing websites from multiple geolocations, one often needs to use proxy servers. Setting up an authenticated proxy in Puppeteer can be cumbersome, so we created a helper that does all the heavy lifting for you. Simply provide a proxy URL with authentication credentials, and you're done. It works the same for Playwright too. ``` const puppeteerPlugin = new PuppeteerPlugin(puppeteer, { proxyUrl: 'http://:@proxy.com:8000' }); ``` > We plan to extend this by adding a proxy-per-page functionality, allowing you to rotate proxies per page, rather than per browser. ### Lifecycle management with hooks[​](#lifecycle-management-with-hooks "Direct link to Lifecycle management with hooks") Browser Pool allows you to manage the full browser / page lifecycle by attaching hooks to the most important events. Asynchronous hooks are supported, and their execution order is guaranteed. The first parameter of each hook is either a `pageId` for the hooks executed before a `page` is created or a `page` afterward. This is useful to keep track of which hook was triggered by which `newPage()` call. ``` const browserPool = new BrowserPool({ browserPlugins: [ new PlaywrightPlugin(playwright.chromium), ], preLaunchHooks: [(pageId, launchContext) => { // You can use pre-launch hooks to make dynamic changes // to the launchContext, such as changing a proxyUrl // or updating the browser launchOptions pageId === 'my-page' // true }], postPageCreateHooks: [(page, browserController) => { // It makes sense to make global changes to pages // in post-page-create hooks. For example, you can // inject some JavaScript library, such as jQuery. browserPool.getPageId(page) === 'my-page' // true }] }); await browserPool.newPage({ id: 'my-page' }); ``` > See the API Documentation for all hooks and their arguments. ### Manipulating playwright context using `pageOptions` or `launchOptions`[​](#manipulating-playwright-context-using-pageoptions-or-launchoptions "Direct link to manipulating-playwright-context-using-pageoptions-or-launchoptions") Playwright allows customizing multiple browser attributes by browser context. You can customize some of them once the context is created, but some need to be customized within its creation. This part of the documentation should explain how you can effectively customize the browser context. First of all, let's take a look at what kind of context strategy you chose. You can choose between two strategies by `useIncognitoPages` `LaunchContext` option. Suppose you decide to keep `useIncognitoPages` default `false` and create a shared context across all pages launched by one browser. In this case, you should pass the `contextOptions` as a `launchOptions` since the context is created within the new browser launch. The `launchOptions` corresponds to these [playwright options](https://playwright.dev/docs/api/class-browsertype#browsertypelaunchpersistentcontextuserdatadir-options). As you can see, these options contain not only ordinary playwright launch options but also the context options. If you set `useIncognitoPages` to `true`, you will create a new context within each new page, which allows you to handle each page its cookies and application data. This approach allows you to pass the context options as `pageOptions` because a new context is created once you create a new page. In this case, the `pageOptions` corresponds to these [playwright options](https://playwright.dev/docs/api/class-browser#browsernewpageoptions). **Changing context options with `LaunchContext`:** This will only work if you keep the default value for `useIncognitoPages` (`false`). ``` const browserPool = new BrowserPool({ browserPlugins: [ new PlaywrightPlugin( playwright.chromium, { launchOptions: { deviceScaleFactor: 2, }, }, ), ], }); ``` **Changing context options with `browserPool.newPage` options:** ``` const browserPool = new BrowserPool({ browserPlugins: [ new PlaywrightPlugin( playwright.chromium, { useIncognitoPages: true, // You must turn on incognito pages. launchOptions: { // launch options headless: false, devtools: true, }, }, ), ], }); // Launches Chromium with Playwright and returns a Playwright Page. const page = await browserPool.newPage({ pageOptions: { // context options deviceScaleFactor: 2, colorScheme: 'light', locale: 'de-DE', }, }); ``` **Changing context options with `prePageCreateHooks` options:** ``` const browserPool = new BrowserPool({ browserPlugins: [ new PlaywrightPlugin( playwright.chromium, { useIncognitoPages: true, launchOptions: { // launch options headless: false, devtools: true, }, }, ), ], prePageCreateHooks: [ (pageId, browserController, pageOptions) => { pageOptions.deviceScaleFactor = 2; pageOptions.colorScheme = 'dark'; pageOptions.locale = 'de-DE'; // You must modify the 'pageOptions' object, not assign to the variable. // pageOptions = {deviceScaleFactor: 2, ...etc} => This will not work! }, ], }); // Launches Chromium with Playwright and returns a Playwright Page. const page = await browserPool.newPage(); ``` ### Single API for common operations[​](#single-api-for-common-operations "Direct link to Single API for common operations") Puppeteer and Playwright handle some things differently. Browser Pool attempts to remove those differences for the most common use-cases. ``` // Playwright const cookies = await context.cookies(); await context.addCookies(cookies); // Puppeteer const cookies = await page.cookies(); await page.setCookie(...cookies); // BrowserPool uses the same API for all plugins const cookies = await browserController.getCookies(page); await browserController.setCookies(page, cookies); ``` ### Graceful browser closing[​](#graceful-browser-closing "Direct link to Graceful browser closing") With Browser Pool, browsers are not closed, but retired. A retired browser will no longer open new pages, but it will wait until the open pages are closed, allowing your running tasks to finish. If a browser gets stuck in limbo, it will be killed after a timeout to prevent hanging browser processes. ### Changing browser fingerprints a.k.a. browser signatures[​](#changing-browser-fingerprints-aka-browser-signatures "Direct link to Changing browser fingerprints a.k.a. browser signatures") > Fingerprints are enabled by default since v3. Changing browser fingerprints is beneficial for avoiding getting blocked and simulating real user browsers. With Browser Pool, you can do this otherwise complicated technique by enabling the `useFingerprints` option. The fingerprints are by default tied to the respective proxy urls to not use the same unique fingerprint from various IP addresses. You can disable this behavior in the [`fingerprintOptions`](https://crawlee.dev/js/api/browser-pool/interface/FingerprintOptions.md). In the `fingerprintOptions`, You can also control which fingerprints are generated. You can control parameters as browser, operating system, and browser versions. The `browser-pool` module exports three constructors. One for `BrowserPool` itself and two for the included Puppeteer and Playwright plugins. **Example:** ``` import { BrowserPool, PuppeteerPlugin, PlaywrightPlugin } from '@crawlee/browser-pool'; import puppeteer from 'puppeteer'; import playwright from 'playwright'; const browserPool = new BrowserPool({ browserPlugins: [ new PuppeteerPlugin(puppeteer), new PlaywrightPlugin(playwright.chromium), ] }); ``` ## Index[**](#Index) ### Enumerations * [**BROWSER\_CONTROLLER\_EVENTS](https://crawlee.dev/js/api/browser-pool/enum/BROWSER_CONTROLLER_EVENTS.md) * [**BROWSER\_POOL\_EVENTS](https://crawlee.dev/js/api/browser-pool/enum/BROWSER_POOL_EVENTS.md) * [**BrowserName](https://crawlee.dev/js/api/browser-pool/enum/BrowserName.md) * [**DeviceCategory](https://crawlee.dev/js/api/browser-pool/enum/DeviceCategory.md) * [**OperatingSystemsName](https://crawlee.dev/js/api/browser-pool/enum/OperatingSystemsName.md) ### Classes * [**BrowserController](https://crawlee.dev/js/api/browser-pool/class/BrowserController.md) * [**BrowserLaunchError](https://crawlee.dev/js/api/browser-pool/class/BrowserLaunchError.md) * [**BrowserPlugin](https://crawlee.dev/js/api/browser-pool/class/BrowserPlugin.md) * [**BrowserPool](https://crawlee.dev/js/api/browser-pool/class/BrowserPool.md) * [**LaunchContext](https://crawlee.dev/js/api/browser-pool/class/LaunchContext.md) * [**PlaywrightBrowser](https://crawlee.dev/js/api/browser-pool/class/PlaywrightBrowser.md) * [**PlaywrightController](https://crawlee.dev/js/api/browser-pool/class/PlaywrightController.md) * [**PlaywrightPlugin](https://crawlee.dev/js/api/browser-pool/class/PlaywrightPlugin.md) * [**PuppeteerController](https://crawlee.dev/js/api/browser-pool/class/PuppeteerController.md) * [**PuppeteerPlugin](https://crawlee.dev/js/api/browser-pool/class/PuppeteerPlugin.md) ### Interfaces * [**BrowserControllerEvents](https://crawlee.dev/js/api/browser-pool/interface/BrowserControllerEvents.md) * [**BrowserPluginOptions](https://crawlee.dev/js/api/browser-pool/interface/BrowserPluginOptions.md) * [**BrowserPoolEvents](https://crawlee.dev/js/api/browser-pool/interface/BrowserPoolEvents.md) * [**BrowserPoolHooks](https://crawlee.dev/js/api/browser-pool/interface/BrowserPoolHooks.md) * [**BrowserPoolNewPageInNewBrowserOptions](https://crawlee.dev/js/api/browser-pool/interface/BrowserPoolNewPageInNewBrowserOptions.md) * [**BrowserPoolNewPageOptions](https://crawlee.dev/js/api/browser-pool/interface/BrowserPoolNewPageOptions.md) * [**BrowserPoolOptions](https://crawlee.dev/js/api/browser-pool/interface/BrowserPoolOptions.md) * [**BrowserSpecification](https://crawlee.dev/js/api/browser-pool/interface/BrowserSpecification.md) * [**CommonLibrary](https://crawlee.dev/js/api/browser-pool/interface/CommonLibrary.md) * [**CreateLaunchContextOptions](https://crawlee.dev/js/api/browser-pool/interface/CreateLaunchContextOptions.md) * [**FingerprintGenerator](https://crawlee.dev/js/api/browser-pool/interface/FingerprintGenerator.md) * [**FingerprintGeneratorOptions](https://crawlee.dev/js/api/browser-pool/interface/FingerprintGeneratorOptions.md) * [**FingerprintOptions](https://crawlee.dev/js/api/browser-pool/interface/FingerprintOptions.md) * [**GetFingerprintReturn](https://crawlee.dev/js/api/browser-pool/interface/GetFingerprintReturn.md) * [**LaunchContextOptions](https://crawlee.dev/js/api/browser-pool/interface/LaunchContextOptions.md) ### Type Aliases * [**InferBrowserPluginArray](https://crawlee.dev/js/api/browser-pool.md#InferBrowserPluginArray) * [**PostLaunchHook](https://crawlee.dev/js/api/browser-pool.md#PostLaunchHook) * [**PostPageCloseHook](https://crawlee.dev/js/api/browser-pool.md#PostPageCloseHook) * [**PostPageCreateHook](https://crawlee.dev/js/api/browser-pool.md#PostPageCreateHook) * [**PreLaunchHook](https://crawlee.dev/js/api/browser-pool.md#PreLaunchHook) * [**PrePageCloseHook](https://crawlee.dev/js/api/browser-pool.md#PrePageCloseHook) * [**PrePageCreateHook](https://crawlee.dev/js/api/browser-pool.md#PrePageCreateHook) * [**UnwrapPromise](https://crawlee.dev/js/api/browser-pool.md#UnwrapPromise) ### Variables * [**DEFAULT\_USER\_AGENT](https://crawlee.dev/js/api/browser-pool.md#DEFAULT_USER_AGENT) ## Type Aliases[**](<#Type Aliases>) ### [**](#InferBrowserPluginArray)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/utils.ts#L15)InferBrowserPluginArray **InferBrowserPluginArray\: Input extends readonly \[infer FirstValue, ...infer Rest] | \[infer FirstValue, ...infer Rest] ? FirstValue extends [PlaywrightPlugin](https://crawlee.dev/js/api/browser-pool/class/PlaywrightPlugin.md) ? [InferBrowserPluginArray](https://crawlee.dev/js/api/browser-pool.md#InferBrowserPluginArray)\ : FirstValue extends [PuppeteerPlugin](https://crawlee.dev/js/api/browser-pool/class/PuppeteerPlugin.md) ? [InferBrowserPluginArray](https://crawlee.dev/js/api/browser-pool.md#InferBrowserPluginArray)\ : never : Input extends \[] ? Result : Input extends readonly infer U\[] ? \[U] extends \[[PuppeteerPlugin](https://crawlee.dev/js/api/browser-pool/class/PuppeteerPlugin.md) | [PlaywrightPlugin](https://crawlee.dev/js/api/browser-pool/class/PlaywrightPlugin.md)] ? U\[] : never : Result #### Type parameters * **Input**: readonly unknown\[] * **Result**: [BrowserPlugin](https://crawlee.dev/js/api/browser-pool/class/BrowserPlugin.md)\[] = \[] ### [**](#PostLaunchHook)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L136)PostLaunchHook **PostLaunchHook\: (pageId, browserController) => void | Promise\ Post-launch hooks are executed as soon as a browser is launched. The hooks are called with two arguments: `pageId`: `string` and `browserController`: [BrowserController](https://crawlee.dev/js/api/browser-pool/class/BrowserController.md) To guarantee order of execution before other hooks in the same browser, the [BrowserController](https://crawlee.dev/js/api/browser-pool/class/BrowserController.md) methods cannot be used until the post-launch hooks complete. If you attempt to call `await browserController.close()` from a post-launch hook, it will deadlock the process. This API is subject to change. *** #### Type parameters * **BC**: [BrowserController](https://crawlee.dev/js/api/browser-pool/class/BrowserController.md) #### Type declaration * * **(pageId, browserController): void | Promise\ - #### Parameters * ##### pageId: string * ##### browserController: BC #### Returns void | Promise\ ### [**](#PostPageCloseHook)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L186)PostPageCloseHook **PostPageCloseHook\: (pageId, browserController) => void | Promise\ Post-page-close hooks allow you to do page related clean up. The hooks are called with two arguments: `pageId`: `string` and `browserController`: [BrowserController](https://crawlee.dev/js/api/browser-pool/class/BrowserController.md) *** #### Type parameters * **BC**: [BrowserController](https://crawlee.dev/js/api/browser-pool/class/BrowserController.md) #### Type declaration * * **(pageId, browserController): void | Promise\ - #### Parameters * ##### pageId: string * ##### browserController: BC #### Returns void | Promise\ ### [**](#PostPageCreateHook)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L164)PostPageCreateHook **PostPageCreateHook\: (page, browserController) => void | Promise\ Post-page-create hooks are called right after a new page is created and all internal actions of Browser Pool are completed. This is the place to make changes to a page that you would like to apply to all pages. Such as injecting a JavaScript library into all pages. The hooks are called with two arguments: `page`: `Page` and `browserController`: [BrowserController](https://crawlee.dev/js/api/browser-pool/class/BrowserController.md) *** #### Type parameters * **BC**: [BrowserController](https://crawlee.dev/js/api/browser-pool/class/BrowserController.md) * **Page** = [UnwrapPromise](https://crawlee.dev/js/api/browser-pool.md#UnwrapPromise)\> #### Type declaration * * **(page, browserController): void | Promise\ - #### Parameters * ##### page: Page * ##### browserController: BC #### Returns void | Promise\ ### [**](#PreLaunchHook)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L125)PreLaunchHook **PreLaunchHook\: (pageId, launchContext) => void | Promise\ Pre-launch hooks are executed just before a browser is launched and provide a good opportunity to dynamically change the launch options. The hooks are called with two arguments: `pageId`: `string` and `launchContext`: [LaunchContext](https://crawlee.dev/js/api/browser-pool/class/LaunchContext.md) *** #### Type parameters * **LC**: [LaunchContext](https://crawlee.dev/js/api/browser-pool/class/LaunchContext.md) #### Type declaration * * **(pageId, launchContext): void | Promise\ - #### Parameters * ##### pageId: string * ##### launchContext: LC #### Returns void | Promise\ ### [**](#PrePageCloseHook)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L176)PrePageCloseHook **PrePageCloseHook\: (page, browserController) => void | Promise\ Pre-page-close hooks give you the opportunity to make last second changes in a page that's about to be closed, such as saving a snapshot or updating state. The hooks are called with two arguments: `page`: `Page` and `browserController`: [BrowserController](https://crawlee.dev/js/api/browser-pool/class/BrowserController.md) *** #### Type parameters * **BC**: [BrowserController](https://crawlee.dev/js/api/browser-pool/class/BrowserController.md) * **Page** = [UnwrapPromise](https://crawlee.dev/js/api/browser-pool.md#UnwrapPromise)\> #### Type declaration * * **(page, browserController): void | Promise\ - #### Parameters * ##### page: Page * ##### browserController: BC #### Returns void | Promise\ ### [**](#PrePageCreateHook)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L150)PrePageCreateHook **PrePageCreateHook\: (pageId, browserController, pageOptions) => void | Promise\ Pre-page-create hooks are executed just before a new page is created. They are useful to make dynamic changes to the browser before opening a page. The hooks are called with three arguments: `pageId`: `string`, `browserController`: [BrowserController](https://crawlee.dev/js/api/browser-pool/class/BrowserController.md) and `pageOptions`: `object|undefined` - This only works if the underlying `BrowserController` supports new page options. So far, new page options are only supported by `PlaywrightController` in incognito contexts. If the page options are not supported by `BrowserController` the `pageOptions` argument is `undefined`. *** #### Type parameters * **BC**: [BrowserController](https://crawlee.dev/js/api/browser-pool/class/BrowserController.md) * **PO** = Parameters\\[0] #### Type declaration * * **(pageId, browserController, pageOptions): void | Promise\ - #### Parameters * ##### pageId: string * ##### browserController: BC * ##### optionalpageOptions: PO #### Returns void | Promise\ ### [**](#UnwrapPromise)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/utils.ts#L5)UnwrapPromise **UnwrapPromise\: T extends PromiseLike\ R> ? [UnwrapPromise](https://crawlee.dev/js/api/browser-pool.md#UnwrapPromise)\ : T #### Type parameters * **T** ## Variables[**](#Variables) ### [**](#DEFAULT_USER_AGENT)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-plugin.ts#L20)constDEFAULT\_USER\_AGENT **DEFAULT\_USER\_AGENT: Mozilla/5.0 (Macintosh; Intel Mac OS X 10\_15\_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36 = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10\_15\_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36' The default User Agent used by `PlaywrightCrawler`, `launchPlaywright`, 'PuppeteerCrawler' and 'launchPuppeteer' when Chromium/Chrome browser is launched: * in headless mode, * without using a fingerprint, * without specifying a user agent. Last updated on 2022-05-05. After you update it here, please update it also in jsdom-crawler.ts --- # Changelog All notable changes to this project will be documented in this file. See [Conventional Commits](https://conventionalcommits.org) for commit guidelines. ## [3.15.3](https://github.com/apify/crawlee/compare/v3.15.2...v3.15.3) (2025-11-10)[​](#3153-2025-11-10 "Direct link to 3153-2025-11-10") **Note:** Version bump only for package @crawlee/browser-pool ## [3.15.2](https://github.com/apify/crawlee/compare/v3.15.1...v3.15.2) (2025-10-23)[​](#3152-2025-10-23 "Direct link to 3152-2025-10-23") ### Bug Fixes[​](#bug-fixes "Direct link to Bug Fixes") * correctly apply `launchOptions` with `useIncognitoPages` ([#3181](https://github.com/apify/crawlee/issues/3181)) ([84a4b70](https://github.com/apify/crawlee/commit/84a4b709ee59d9edbcdc9a19559fefa4e9139ba4)), closes [/github.com/apify/crawlee/issues/3173#issuecomment-3346728227](https://github.com//github.com/apify/crawlee/issues/3173/issues/issuecomment-3346728227) [#3173](https://github.com/apify/crawlee/issues/3173) [#3173](https://github.com/apify/crawlee/issues/3173) ## [3.15.1](https://github.com/apify/crawlee/compare/v3.15.0...v3.15.1) (2025-09-26)[​](#3151-2025-09-26 "Direct link to 3151-2025-09-26") **Note:** Version bump only for package @crawlee/browser-pool # [3.15.0](https://github.com/apify/crawlee/compare/v3.14.1...v3.15.0) (2025-09-17) **Note:** Version bump only for package @crawlee/browser-pool ## [3.14.1](https://github.com/apify/crawlee/compare/v3.14.0...v3.14.1) (2025-08-05)[​](#3141-2025-08-05 "Direct link to 3141-2025-08-05") **Note:** Version bump only for package @crawlee/browser-pool # [3.14.0](https://github.com/apify/crawlee/compare/v3.13.10...v3.14.0) (2025-07-25) ### Bug Fixes[​](#bug-fixes-1 "Direct link to Bug Fixes") * don't retire browsers with long-running `pre|postLaunchHooks` prematurely ([#3062](https://github.com/apify/crawlee/issues/3062)) ([681660e](https://github.com/apify/crawlee/commit/681660e35a1ceaca5e96a7f61d5a7c66ec32bcde)) ## [3.13.10](https://github.com/apify/crawlee/compare/v3.13.9...v3.13.10) (2025-07-09)[​](#31310-2025-07-09 "Direct link to 31310-2025-07-09") **Note:** Version bump only for package @crawlee/browser-pool ## [3.13.9](https://github.com/apify/crawlee/compare/v3.13.8...v3.13.9) (2025-06-27)[​](#3139-2025-06-27 "Direct link to 3139-2025-06-27") **Note:** Version bump only for package @crawlee/browser-pool ## [3.13.8](https://github.com/apify/crawlee/compare/v3.13.7...v3.13.8) (2025-06-16)[​](#3138-2025-06-16 "Direct link to 3138-2025-06-16") **Note:** Version bump only for package @crawlee/browser-pool ## [3.13.7](https://github.com/apify/crawlee/compare/v3.13.6...v3.13.7) (2025-06-06)[​](#3137-2025-06-06 "Direct link to 3137-2025-06-06") **Note:** Version bump only for package @crawlee/browser-pool ## [3.13.6](https://github.com/apify/crawlee/compare/v3.13.5...v3.13.6) (2025-06-05)[​](#3136-2025-06-05 "Direct link to 3136-2025-06-05") **Note:** Version bump only for package @crawlee/browser-pool ## [3.13.5](https://github.com/apify/crawlee/compare/v3.13.4...v3.13.5) (2025-05-20)[​](#3135-2025-05-20 "Direct link to 3135-2025-05-20") **Note:** Version bump only for package @crawlee/browser-pool ## [3.13.4](https://github.com/apify/crawlee/compare/v3.13.3...v3.13.4) (2025-05-14)[​](#3134-2025-05-14 "Direct link to 3134-2025-05-14") **Note:** Version bump only for package @crawlee/browser-pool ## [3.13.3](https://github.com/apify/crawlee/compare/v3.13.2...v3.13.3) (2025-05-05)[​](#3133-2025-05-05 "Direct link to 3133-2025-05-05") ### Bug Fixes[​](#bug-fixes-2 "Direct link to Bug Fixes") * await `_createPageForBrowser` in browser pool ([#2950](https://github.com/apify/crawlee/issues/2950)) ([27ba74b](https://github.com/apify/crawlee/commit/27ba74bacfcaa0467e7d97eb27d6a9c1d9cea9be)), closes [#2789](https://github.com/apify/crawlee/issues/2789) * Fix trailing slash removal in BrowserPool ([#2921](https://github.com/apify/crawlee/issues/2921)) ([c1fc439](https://github.com/apify/crawlee/commit/c1fc439e8e9cf74808912c66a1915f1bfd345b5f)), closes [#2878](https://github.com/apify/crawlee/issues/2878) ## [3.13.2](https://github.com/apify/crawlee/compare/v3.13.1...v3.13.2) (2025-04-08)[​](#3132-2025-04-08 "Direct link to 3132-2025-04-08") **Note:** Version bump only for package @crawlee/browser-pool ## [3.13.1](https://github.com/apify/crawlee/compare/v3.13.0...v3.13.1) (2025-04-07)[​](#3131-2025-04-07 "Direct link to 3131-2025-04-07") **Note:** Version bump only for package @crawlee/browser-pool # [3.13.0](https://github.com/apify/crawlee/compare/v3.12.2...v3.13.0) (2025-03-04) **Note:** Version bump only for package @crawlee/browser-pool ## [3.12.2](https://github.com/apify/crawlee/compare/v3.12.1...v3.12.2) (2025-01-27)[​](#3122-2025-01-27 "Direct link to 3122-2025-01-27") **Note:** Version bump only for package @crawlee/browser-pool ## [3.12.1](https://github.com/apify/crawlee/compare/v3.12.0...v3.12.1) (2024-12-04)[​](#3121-2024-12-04 "Direct link to 3121-2024-12-04") **Note:** Version bump only for package @crawlee/browser-pool # [3.12.0](https://github.com/apify/crawlee/compare/v3.11.5...v3.12.0) (2024-11-04) ### Bug Fixes[​](#bug-fixes-3 "Direct link to Bug Fixes") * update `fingerprintGeneratorOptions` types ([#2705](https://github.com/apify/crawlee/issues/2705)) ([fcb098d](https://github.com/apify/crawlee/commit/fcb098d6357b69e6d1790765076e4fe4146c8143)), closes [/github.com/apify/fingerprint-suite/blob/c61814e6ba8822543deb0ce6c03e0a0249933629/packages/fingerprint-generator/src/fingerprint-generator.ts#L73](https://github.com//github.com/apify/fingerprint-suite/blob/c61814e6ba8822543deb0ce6c03e0a0249933629/packages/fingerprint-generator/src/fingerprint-generator.ts/issues/L73) [/github.com/apify/fingerprint-suite/blob/c61814e6ba8822543deb0ce6c03e0a0249933629/packages/header-generator/src/header-generator.ts#L87](https://github.com//github.com/apify/fingerprint-suite/blob/c61814e6ba8822543deb0ce6c03e0a0249933629/packages/header-generator/src/header-generator.ts/issues/L87) [#2703](https://github.com/apify/crawlee/issues/2703) ### Features[​](#features "Direct link to Features") * allow using other HTTP clients ([#2661](https://github.com/apify/crawlee/issues/2661)) ([568c655](https://github.com/apify/crawlee/commit/568c6556d79ce91654c8a715d1d1729d7d6ed8ef)), closes [#2659](https://github.com/apify/crawlee/issues/2659) ## [3.11.5](https://github.com/apify/crawlee/compare/v3.11.4...v3.11.5) (2024-10-04)[​](#3115-2024-10-04 "Direct link to 3115-2024-10-04") **Note:** Version bump only for package @crawlee/browser-pool ## [3.11.4](https://github.com/apify/crawlee/compare/v3.11.3...v3.11.4) (2024-09-23)[​](#3114-2024-09-23 "Direct link to 3114-2024-09-23") **Note:** Version bump only for package @crawlee/browser-pool ## [3.11.3](https://github.com/apify/crawlee/compare/v3.11.2...v3.11.3) (2024-09-03)[​](#3113-2024-09-03 "Direct link to 3113-2024-09-03") **Note:** Version bump only for package @crawlee/browser-pool ## [3.11.2](https://github.com/apify/crawlee/compare/v3.11.1...v3.11.2) (2024-08-28)[​](#3112-2024-08-28 "Direct link to 3112-2024-08-28") **Note:** Version bump only for package @crawlee/browser-pool ## [3.11.1](https://github.com/apify/crawlee/compare/v3.11.0...v3.11.1) (2024-07-24)[​](#3111-2024-07-24 "Direct link to 3111-2024-07-24") **Note:** Version bump only for package @crawlee/browser-pool # [3.11.0](https://github.com/apify/crawlee/compare/v3.10.5...v3.11.0) (2024-07-09) **Note:** Version bump only for package @crawlee/browser-pool ## [3.10.5](https://github.com/apify/crawlee/compare/v3.10.4...v3.10.5) (2024-06-12)[​](#3105-2024-06-12 "Direct link to 3105-2024-06-12") **Note:** Version bump only for package @crawlee/browser-pool ## [3.10.4](https://github.com/apify/crawlee/compare/v3.10.3...v3.10.4) (2024-06-11)[​](#3104-2024-06-11 "Direct link to 3104-2024-06-11") **Note:** Version bump only for package @crawlee/browser-pool ## [3.10.3](https://github.com/apify/crawlee/compare/v3.10.2...v3.10.3) (2024-06-07)[​](#3103-2024-06-07 "Direct link to 3103-2024-06-07") ### Bug Fixes[​](#bug-fixes-4 "Direct link to Bug Fixes") * increase timeout for retiring inactive browsers ([#2523](https://github.com/apify/crawlee/issues/2523)) ([195f176](https://github.com/apify/crawlee/commit/195f1766a03293db19caa33f9fc3d4ab08081f71)) ## [3.10.2](https://github.com/apify/crawlee/compare/v3.10.1...v3.10.2) (2024-06-03)[​](#3102-2024-06-03 "Direct link to 3102-2024-06-03") **Note:** Version bump only for package @crawlee/browser-pool ## [3.10.1](https://github.com/apify/crawlee/compare/v3.10.0...v3.10.1) (2024-05-23)[​](#3101-2024-05-23 "Direct link to 3101-2024-05-23") **Note:** Version bump only for package @crawlee/browser-pool # [3.10.0](https://github.com/apify/crawlee/compare/v3.9.2...v3.10.0) (2024-05-16) **Note:** Version bump only for package @crawlee/browser-pool ## [3.9.2](https://github.com/apify/crawlee/compare/v3.9.1...v3.9.2) (2024-04-17)[​](#392-2024-04-17 "Direct link to 392-2024-04-17") **Note:** Version bump only for package @crawlee/browser-pool ## [3.9.1](https://github.com/apify/crawlee/compare/v3.9.0...v3.9.1) (2024-04-11)[​](#391-2024-04-11 "Direct link to 391-2024-04-11") ### Features[​](#features-1 "Direct link to Features") * `browserPerProxy` browser launch option ([#2418](https://github.com/apify/crawlee/issues/2418)) ([df57b29](https://github.com/apify/crawlee/commit/df57b2965ac8c8b3adf807e3bad8a649814fa213)) # [3.9.0](https://github.com/apify/crawlee/compare/v3.8.2...v3.9.0) (2024-04-10) ### Features[​](#features-2 "Direct link to Features") * `tieredProxyUrls` for ProxyConfiguration ([#2348](https://github.com/apify/crawlee/issues/2348)) ([5408c7f](https://github.com/apify/crawlee/commit/5408c7f60a5bf4dbdba92f2d7440e0946b94ea6e)) * better `newUrlFunction` for ProxyConfiguration ([#2392](https://github.com/apify/crawlee/issues/2392)) ([330598b](https://github.com/apify/crawlee/commit/330598b348ad27bc7c73732294a14b655ccd3507)), closes [#2348](https://github.com/apify/crawlee/issues/2348) [#2065](https://github.com/apify/crawlee/issues/2065) ## [3.8.2](https://github.com/apify/crawlee/compare/v3.8.1...v3.8.2) (2024-03-21)[​](#382-2024-03-21 "Direct link to 382-2024-03-21") ### Bug Fixes[​](#bug-fixes-5 "Direct link to Bug Fixes") * fix detection of older puppeteer versions ([890669b](https://github.com/apify/crawlee/commit/890669b0b3eef94d00ad69aa022e13b3109a660c)), closes [#2370](https://github.com/apify/crawlee/issues/2370) * **puppeteer:** improve detection of older versions ([98d4e86](https://github.com/apify/crawlee/commit/98d4e8664a54c1a134446a1b6ab9042d14ed8629)) ## [3.8.1](https://github.com/apify/crawlee/compare/v3.8.0...v3.8.1) (2024-02-22)[​](#381-2024-02-22 "Direct link to 381-2024-02-22") **Note:** Version bump only for package @crawlee/browser-pool # [3.8.0](https://github.com/apify/crawlee/compare/v3.7.3...v3.8.0) (2024-02-21) ### Bug Fixes[​](#bug-fixes-6 "Direct link to Bug Fixes") * **puppeteer:** add 'process' to the browser bound methods ([#2329](https://github.com/apify/crawlee/issues/2329)) ([2750ba6](https://github.com/apify/crawlee/commit/2750ba646ef3c1d51eacdd8e7d67be0e14fb2a97)) * **puppeteer:** support `puppeteer@v22` ([#2337](https://github.com/apify/crawlee/issues/2337)) ([3cc360a](https://github.com/apify/crawlee/commit/3cc360a1ea94147133f9785d65834f360f7b42a7)) ## [3.7.3](https://github.com/apify/crawlee/compare/v3.7.2...v3.7.3) (2024-01-30)[​](#373-2024-01-30 "Direct link to 373-2024-01-30") **Note:** Version bump only for package @crawlee/browser-pool ## [3.7.2](https://github.com/apify/crawlee/compare/v3.7.1...v3.7.2) (2024-01-09)[​](#372-2024-01-09 "Direct link to 372-2024-01-09") **Note:** Version bump only for package @crawlee/browser-pool ## [3.7.1](https://github.com/apify/crawlee/compare/v3.7.0...v3.7.1) (2024-01-02)[​](#371-2024-01-02 "Direct link to 371-2024-01-02") **Note:** Version bump only for package @crawlee/browser-pool # [3.7.0](https://github.com/apify/crawlee/compare/v3.6.2...v3.7.0) (2023-12-21) ### Bug Fixes[​](#bug-fixes-7 "Direct link to Bug Fixes") * **browser-pool:** respect user options before assigning fingerprints ([#2190](https://github.com/apify/crawlee/issues/2190)) ([f050776](https://github.com/apify/crawlee/commit/f050776a916a0530aca6727a447a49252e643417)), closes [#2164](https://github.com/apify/crawlee/issues/2164) ## [3.6.2](https://github.com/apify/crawlee/compare/v3.6.1...v3.6.2) (2023-11-26)[​](#362-2023-11-26 "Direct link to 362-2023-11-26") **Note:** Version bump only for package @crawlee/browser-pool ## [3.6.1](https://github.com/apify/crawlee/compare/v3.6.0...v3.6.1) (2023-11-15)[​](#361-2023-11-15 "Direct link to 361-2023-11-15") ### Features[​](#features-3 "Direct link to Features") * **puppeteer:** enable `new` headless mode ([#1910](https://github.com/apify/crawlee/issues/1910)) ([7fc999c](https://github.com/apify/crawlee/commit/7fc999cf4658ca69b97f16d434444081998470f4)) # [3.6.0](https://github.com/apify/crawlee/compare/v3.5.8...v3.6.0) (2023-11-15) ### Bug Fixes[​](#bug-fixes-8 "Direct link to Bug Fixes") * **BrowserPool:** ignore `--no-sandbox` flag for webkit launcher ([#2148](https://github.com/apify/crawlee/issues/2148)) ([1eb2f08](https://github.com/apify/crawlee/commit/1eb2f08a3cdead5dd21ffde4162d403175a4594c)), closes [#1797](https://github.com/apify/crawlee/issues/1797) * provide more detailed error messages for browser launch errors ([#2157](https://github.com/apify/crawlee/issues/2157)) ([f188ebe](https://github.com/apify/crawlee/commit/f188ebe0b4ae7594225ef37d8160d175d4535ccd)) ## [3.5.8](https://github.com/apify/crawlee/compare/v3.5.7...v3.5.8) (2023-10-17)[​](#358-2023-10-17 "Direct link to 358-2023-10-17") **Note:** Version bump only for package @crawlee/browser-pool ## [3.5.7](https://github.com/apify/crawlee/compare/v3.5.6...v3.5.7) (2023-10-05)[​](#357-2023-10-05 "Direct link to 357-2023-10-05") **Note:** Version bump only for package @crawlee/browser-pool ## [3.5.6](https://github.com/apify/crawlee/compare/v3.5.5...v3.5.6) (2023-10-04)[​](#356-2023-10-04 "Direct link to 356-2023-10-04") **Note:** Version bump only for package @crawlee/browser-pool ## [3.5.5](https://github.com/apify/crawlee/compare/v3.5.4...v3.5.5) (2023-10-02)[​](#355-2023-10-02 "Direct link to 355-2023-10-02") ### Bug Fixes[​](#bug-fixes-9 "Direct link to Bug Fixes") * allow to use any version of puppeteer or playwright ([#2102](https://github.com/apify/crawlee/issues/2102)) ([0cafceb](https://github.com/apify/crawlee/commit/0cafceb2966d430dd1b2a1b619fe66da1c951f4c)), closes [#2101](https://github.com/apify/crawlee/issues/2101) ## [3.5.4](https://github.com/apify/crawlee/compare/v3.5.3...v3.5.4) (2023-09-11)[​](#354-2023-09-11 "Direct link to 354-2023-09-11") **Note:** Version bump only for package @crawlee/browser-pool ## [3.5.3](https://github.com/apify/crawlee/compare/v3.5.2...v3.5.3) (2023-08-31)[​](#353-2023-08-31 "Direct link to 353-2023-08-31") ### Bug Fixes[​](#bug-fixes-10 "Direct link to Bug Fixes") * **browser-pool:** improve error handling when browser is not found ([#2050](https://github.com/apify/crawlee/issues/2050)) ([282527f](https://github.com/apify/crawlee/commit/282527f31bb366a4e52463212f652dcf6679b6c3)), closes [#1459](https://github.com/apify/crawlee/issues/1459) * pin all internal dependencies ([#2041](https://github.com/apify/crawlee/issues/2041)) ([d6f2b17](https://github.com/apify/crawlee/commit/d6f2b172d4a6776137c7893ca798d5b4a9408e79)), closes [#2040](https://github.com/apify/crawlee/issues/2040) ## [3.5.2](https://github.com/apify/crawlee/compare/v3.5.1...v3.5.2) (2023-08-21)[​](#352-2023-08-21 "Direct link to 352-2023-08-21") **Note:** Version bump only for package @crawlee/browser-pool ## [3.5.1](https://github.com/apify/crawlee/compare/v3.5.0...v3.5.1) (2023-08-16)[​](#351-2023-08-16 "Direct link to 351-2023-08-16") **Note:** Version bump only for package @crawlee/browser-pool # [3.5.0](https://github.com/apify/crawlee/compare/v3.4.2...v3.5.0) (2023-07-31) **Note:** Version bump only for package @crawlee/browser-pool ## [3.4.2](https://github.com/apify/crawlee/compare/v3.4.1...v3.4.2) (2023-07-19)[​](#342-2023-07-19 "Direct link to 342-2023-07-19") **Note:** Version bump only for package @crawlee/browser-pool ## [3.4.1](https://github.com/apify/crawlee/compare/v3.4.0...v3.4.1) (2023-07-13)[​](#341-2023-07-13 "Direct link to 341-2023-07-13") **Note:** Version bump only for package @crawlee/browser-pool # [3.4.0](https://github.com/apify/crawlee/compare/v3.3.3...v3.4.0) (2023-06-12) **Note:** Version bump only for package @crawlee/browser-pool ## [3.3.3](https://github.com/apify/crawlee/compare/v3.3.2...v3.3.3) (2023-05-31)[​](#333-2023-05-31 "Direct link to 333-2023-05-31") **Note:** Version bump only for package @crawlee/browser-pool ## [3.3.2](https://github.com/apify/crawlee/compare/v3.3.1...v3.3.2) (2023-05-11)[​](#332-2023-05-11 "Direct link to 332-2023-05-11") **Note:** Version bump only for package @crawlee/browser-pool ## [3.3.1](https://github.com/apify/crawlee/compare/v3.3.0...v3.3.1) (2023-04-11)[​](#331-2023-04-11 "Direct link to 331-2023-04-11") **Note:** Version bump only for package @crawlee/browser-pool # [3.3.0](https://github.com/apify/crawlee/compare/v3.2.2...v3.3.0) (2023-03-09) **Note:** Version bump only for package @crawlee/browser-pool ## [3.2.2](https://github.com/apify/crawlee/compare/v3.2.1...v3.2.2) (2023-02-08)[​](#322-2023-02-08 "Direct link to 322-2023-02-08") **Note:** Version bump only for package @crawlee/browser-pool ## [3.2.1](https://github.com/apify/crawlee/compare/v3.2.0...v3.2.1) (2023-02-07)[​](#321-2023-02-07 "Direct link to 321-2023-02-07") **Note:** Version bump only for package @crawlee/browser-pool # [3.2.0](https://github.com/apify/crawlee/compare/v3.1.4...v3.2.0) (2023-02-07) ### Bug Fixes[​](#bug-fixes-11 "Direct link to Bug Fixes") * update playwright to 1.29.2 and make peer dep. less strict ([#1735](https://github.com/apify/crawlee/issues/1735)) ([c654fcd](https://github.com/apify/crawlee/commit/c654fcdea06fb203b7952ed97650190cc0e74394)), closes [#1723](https://github.com/apify/crawlee/issues/1723) ## [3.1.3](https://github.com/apify/crawlee/compare/v3.1.2...v3.1.3) (2022-12-07)[​](#313-2022-12-07 "Direct link to 313-2022-12-07") **Note:** Version bump only for package @crawlee/browser-pool ## 3.1.2 (2022-11-15)[​](#312-2022-11-15 "Direct link to 3.1.2 (2022-11-15)") **Note:** Version bump only for package @crawlee/browser-pool ## 3.1.1 (2022-11-07)[​](#311-2022-11-07 "Direct link to 3.1.1 (2022-11-07)") **Note:** Version bump only for package @crawlee/browser-pool # 3.1.0 (2022-10-13) **Note:** Version bump only for package @crawlee/browser-pool ## [3.0.4](https://github.com/apify/crawlee/compare/v3.0.3...v3.0.4) (2022-08-22)[​](#304-2022-08-22 "Direct link to 304-2022-08-22") ### Features[​](#features-4 "Direct link to Features") * enable tab-as-a-container for Firefox ([#1456](https://github.com/apify/crawlee/issues/1456)) ([ae5ba4f](https://github.com/apify/crawlee/commit/ae5ba4f15fd6d14f444486234753ce1781c74cc8)) --- # abstractBrowserController \ The `BrowserController` serves two purposes. First, it is the base class that specialized controllers like `PuppeteerController` or `PlaywrightController` extend. Second, it defines the public interface of the specialized classes which provide only private methods. Therefore, we do not keep documentation for the specialized classes, because it's the same for all of them. ### Hierarchy * TypedEmitter<[BrowserControllerEvents](https://crawlee.dev/js/api/browser-pool/interface/BrowserControllerEvents.md)\> * *BrowserController* * [PuppeteerController](https://crawlee.dev/js/api/browser-pool/class/PuppeteerController.md) * [PlaywrightController](https://crawlee.dev/js/api/browser-pool/class/PlaywrightController.md) ## Index[**](#Index) ### Properties * [**activePages](#activePages) * [**browser](#browser) * [**browserPlugin](#browserPlugin) * [**id](#id) * [**isActive](#isActive) * [**lastPageOpenedAt](#lastPageOpenedAt) * [**launchContext](#launchContext) * [**proxyTier](#proxyTier) * [**proxyUrl](#proxyUrl) * [**totalPages](#totalPages) * [**defaultMaxListeners](#defaultMaxListeners) ### Methods * [**addListener](#addListener) * [**close](#close) * [**emit](#emit) * [**eventNames](#eventNames) * [**getCookies](#getCookies) * [**getMaxListeners](#getMaxListeners) * [**kill](#kill) * [**listenerCount](#listenerCount) * [**listeners](#listeners) * [**off](#off) * [**on](#on) * [**once](#once) * [**prependListener](#prependListener) * [**prependOnceListener](#prependOnceListener) * [**rawListeners](#rawListeners) * [**removeAllListeners](#removeAllListeners) * [**removeListener](#removeListener) * [**setCookies](#setCookies) * [**setMaxListeners](#setMaxListeners) ## Properties[**](#Properties) ### [**](#activePages)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-controller.ts#L73)activePages **activePages: number = 0 ### [**](#browser)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-controller.ts#L52)browser **browser: LaunchResult = ... Browser representation of the underlying automation library. ### [**](#browserPlugin)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-controller.ts#L47)browserPlugin **browserPlugin: [BrowserPlugin](https://crawlee.dev/js/api/browser-pool/class/BrowserPlugin.md)\ The `BrowserPlugin` instance used to launch the browser. ### [**](#id)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-controller.ts#L42)id **id: string = ... ### [**](#isActive)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-controller.ts#L71)isActive **isActive: boolean = false ### [**](#lastPageOpenedAt)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-controller.ts#L77)lastPageOpenedAt **lastPageOpenedAt: number = ... ### [**](#launchContext)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-controller.ts#L57)launchContext **launchContext: [LaunchContext](https://crawlee.dev/js/api/browser-pool/class/LaunchContext.md)\ = ... The configuration the browser was launched with. ### [**](#proxyTier)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-controller.ts#L63)optionalproxyTier **proxyTier? : number The proxy tier tied to this browser controller. `undefined` if no tiered proxy is used. ### [**](#proxyUrl)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-controller.ts#L69)optionalproxyUrl **proxyUrl? : string The proxy URL used by the browser controller. This is set every time the browser controller uses proxy (even the tiered one). `undefined` if no proxy is used ### [**](#totalPages)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-controller.ts#L75)totalPages **totalPages: number = 0 ### [**](#defaultMaxListeners)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L10)staticexternalinheriteddefaultMaxListeners **defaultMaxListeners: number Inherited from TypedEmitter.defaultMaxListeners ## Methods[**](#Methods) ### [**](#addListener)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L11)externalinheritedaddListener * ****addListener**\(event, listener): this - Inherited from TypedEmitter.addListener #### Parameters * ##### externalevent: U * ##### externallistener: [BrowserControllerEvents](https://crawlee.dev/js/api/browser-pool/interface/BrowserControllerEvents.md)\\[U] #### Returns this ### [**](#close)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-controller.ts#L131)close * ****close**(): Promise\ - Gracefully closes the browser and makes sure there will be no lingering browser processes. Emits 'browserClosed' event. *** #### Returns Promise\ ### [**](#emit)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L19)externalinheritedemit * ****emit**\(event, ...args): boolean - Inherited from TypedEmitter.emit #### Parameters * ##### externalevent: U * ##### externalrest...args: Parameters<[BrowserControllerEvents](https://crawlee.dev/js/api/browser-pool/interface/BrowserControllerEvents.md)\\[U]> #### Returns boolean ### [**](#eventNames)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L20)externalinheritedeventNames * ****eventNames**\(): U\[] - Inherited from TypedEmitter.eventNames #### Returns U\[] ### [**](#getCookies)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-controller.ts#L181)getCookies * ****getCookies**(page): Promise<[Cookie](https://crawlee.dev/js/api/core/interface/Cookie.md)\[]> - #### Parameters * ##### page: NewPageResult #### Returns Promise<[Cookie](https://crawlee.dev/js/api/core/interface/Cookie.md)\[]> ### [**](#getMaxListeners)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L24)externalinheritedgetMaxListeners * ****getMaxListeners**(): number - Inherited from TypedEmitter.getMaxListeners #### Returns number ### [**](#kill)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-controller.ts#L156)kill * ****kill**(): Promise\ - Immediately kills the browser process. Emits 'browserClosed' event. *** #### Returns Promise\ ### [**](#listenerCount)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L21)externalinheritedlistenerCount * ****listenerCount**(type): number - Inherited from TypedEmitter.listenerCount #### Parameters * ##### externaltype: BROWSER\_CLOSED #### Returns number ### [**](#listeners)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L22)externalinheritedlisteners * ****listeners**\(type): [BrowserControllerEvents](https://crawlee.dev/js/api/browser-pool/interface/BrowserControllerEvents.md)\\[U]\[] - Inherited from TypedEmitter.listeners #### Parameters * ##### externaltype: U #### Returns [BrowserControllerEvents](https://crawlee.dev/js/api/browser-pool/interface/BrowserControllerEvents.md)\\[U]\[] ### [**](#off)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L18)externalinheritedoff * ****off**\(event, listener): this - Inherited from TypedEmitter.off #### Parameters * ##### externalevent: U * ##### externallistener: [BrowserControllerEvents](https://crawlee.dev/js/api/browser-pool/interface/BrowserControllerEvents.md)\\[U] #### Returns this ### [**](#on)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L17)externalinheritedon * ****on**\(event, listener): this - Inherited from TypedEmitter.on #### Parameters * ##### externalevent: U * ##### externallistener: [BrowserControllerEvents](https://crawlee.dev/js/api/browser-pool/interface/BrowserControllerEvents.md)\\[U] #### Returns this ### [**](#once)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L16)externalinheritedonce * ****once**\(event, listener): this - Inherited from TypedEmitter.once #### Parameters * ##### externalevent: U * ##### externallistener: [BrowserControllerEvents](https://crawlee.dev/js/api/browser-pool/interface/BrowserControllerEvents.md)\\[U] #### Returns this ### [**](#prependListener)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L12)externalinheritedprependListener * ****prependListener**\(event, listener): this - Inherited from TypedEmitter.prependListener #### Parameters * ##### externalevent: U * ##### externallistener: [BrowserControllerEvents](https://crawlee.dev/js/api/browser-pool/interface/BrowserControllerEvents.md)\\[U] #### Returns this ### [**](#prependOnceListener)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L13)externalinheritedprependOnceListener * ****prependOnceListener**\(event, listener): this - Inherited from TypedEmitter.prependOnceListener #### Parameters * ##### externalevent: U * ##### externallistener: [BrowserControllerEvents](https://crawlee.dev/js/api/browser-pool/interface/BrowserControllerEvents.md)\\[U] #### Returns this ### [**](#rawListeners)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L23)externalinheritedrawListeners * ****rawListeners**\(type): [BrowserControllerEvents](https://crawlee.dev/js/api/browser-pool/interface/BrowserControllerEvents.md)\\[U]\[] - Inherited from TypedEmitter.rawListeners #### Parameters * ##### externaltype: U #### Returns [BrowserControllerEvents](https://crawlee.dev/js/api/browser-pool/interface/BrowserControllerEvents.md)\\[U]\[] ### [**](#removeAllListeners)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L15)externalinheritedremoveAllListeners * ****removeAllListeners**(event): this - Inherited from TypedEmitter.removeAllListeners #### Parameters * ##### externaloptionalevent: BROWSER\_CLOSED #### Returns this ### [**](#removeListener)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L14)externalinheritedremoveListener * ****removeListener**\(event, listener): this - Inherited from TypedEmitter.removeListener #### Parameters * ##### externalevent: U * ##### externallistener: [BrowserControllerEvents](https://crawlee.dev/js/api/browser-pool/interface/BrowserControllerEvents.md)\\[U] #### Returns this ### [**](#setCookies)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-controller.ts#L177)setCookies * ****setCookies**(page, cookies): Promise\ - #### Parameters * ##### page: NewPageResult * ##### cookies: [Cookie](https://crawlee.dev/js/api/core/interface/Cookie.md)\[] #### Returns Promise\ ### [**](#setMaxListeners)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L25)externalinheritedsetMaxListeners * ****setMaxListeners**(n): this - Inherited from TypedEmitter.setMaxListeners #### Parameters * ##### externaln: number #### Returns this --- # BrowserLaunchError Errors of `CriticalError` type will shut down the whole crawler. Error handlers catching CriticalError should avoid logging it, as it will be logged by Node.js itself at the end ### Hierarchy * [CriticalError](https://crawlee.dev/js/api/core/class/CriticalError.md) * *BrowserLaunchError* ## Index[**](#Index) ### Constructors * [**constructor](#constructor) ### Properties * [**cause](#cause) * [**message](#message) * [**name](#name) * [**stack](#stack) * [**stackTraceLimit](#stackTraceLimit) ### Methods * [**captureStackTrace](#captureStackTrace) * [**isError](#isError) * [**prepareStackTrace](#prepareStackTrace) ## Constructors[**](#Constructors) ### [**](#constructor)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-plugin.ts#L295)publicconstructor * ****new BrowserLaunchError**(...args): [BrowserLaunchError](https://crawlee.dev/js/api/browser-pool/class/BrowserLaunchError.md) - Overrides CriticalError.constructor #### Parameters * ##### rest...args: \[message?: string, options?: ErrorOptions] #### Returns [BrowserLaunchError](https://crawlee.dev/js/api/browser-pool/class/BrowserLaunchError.md) ## Properties[**](#Properties) ### [**](#cause)[**](https://undefined/apify/crawlee/blob/master/website/node_modules/typescript/src/lib.es2022.error.d.ts#L26)externaloptionalinheritedcause **cause? : unknown Inherited from CriticalError.cause ### [**](#message)[**](https://undefined/apify/crawlee/blob/master/website/node_modules/typescript/src/lib.es5.d.ts#L1077)externalinheritedmessage **message: string Inherited from CriticalError.message ### [**](#name)[**](https://undefined/apify/crawlee/blob/master/website/node_modules/typescript/src/lib.es5.d.ts#L1076)externalinheritedname **name: string Inherited from CriticalError.name ### [**](#stack)[**](https://undefined/apify/crawlee/blob/master/website/node_modules/typescript/src/lib.es5.d.ts#L1078)externaloptionalinheritedstack **stack? : string Inherited from CriticalError.stack ### [**](#stackTraceLimit)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/globals.d.ts#L68)staticexternalinheritedstackTraceLimit **stackTraceLimit: number Inherited from CriticalError.stackTraceLimit The `Error.stackTraceLimit` property specifies the number of stack frames collected by a stack trace (whether generated by `new Error().stack` or `Error.captureStackTrace(obj)`). The default value is `10` but may be set to any valid JavaScript number. Changes will affect any stack trace captured *after* the value has been changed. If set to a non-number value, or set to a negative number, stack traces will not capture any frames. ## Methods[**](#Methods) ### [**](#captureStackTrace)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/globals.d.ts#L52)staticexternalinheritedcaptureStackTrace * ****captureStackTrace**(targetObject, constructorOpt): void - Inherited from CriticalError.captureStackTrace Creates a `.stack` property on `targetObject`, which when accessed returns a string representing the location in the code at which `Error.captureStackTrace()` was called. ``` const myObject = {}; Error.captureStackTrace(myObject); myObject.stack; // Similar to `new Error().stack` ``` The first line of the trace will be prefixed with `${myObject.name}: ${myObject.message}`. The optional `constructorOpt` argument accepts a function. If given, all frames above `constructorOpt`, including `constructorOpt`, will be omitted from the generated stack trace. The `constructorOpt` argument is useful for hiding implementation details of error generation from the user. For instance: ``` function a() { b(); } function b() { c(); } function c() { // Create an error without stack trace to avoid calculating the stack trace twice. const { stackTraceLimit } = Error; Error.stackTraceLimit = 0; const error = new Error(); Error.stackTraceLimit = stackTraceLimit; // Capture the stack trace above function b Error.captureStackTrace(error, b); // Neither function c, nor b is included in the stack trace throw error; } a(); ``` *** #### Parameters * ##### externaltargetObject: object * ##### externaloptionalconstructorOpt: Function #### Returns void ### [**](#isError)[**](https://undefined/apify/crawlee/blob/master/website/node_modules/typescript/src/lib.esnext.error.d.ts#L23)staticexternalinheritedisError * ****isError**(error): error is Error - Inherited from CriticalError.isError Indicates whether the argument provided is a built-in Error instance or not. *** #### Parameters * ##### externalerror: unknown #### Returns error is Error ### [**](#prepareStackTrace)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/globals.d.ts#L56)staticexternalinheritedprepareStackTrace * ****prepareStackTrace**(err, stackTraces): any - Inherited from CriticalError.prepareStackTrace * **@see** *** #### Parameters * ##### externalerr: Error * ##### externalstackTraces: CallSite\[] #### Returns any --- # abstractBrowserPlugin \ The `BrowserPlugin` serves two purposes. First, it is the base class that specialized controllers like `PuppeteerPlugin` or `PlaywrightPlugin` extend. Second, it allows the user to configure the automation libraries and feed them to [BrowserPool](https://crawlee.dev/js/api/browser-pool/class/BrowserPool.md) for use. ### Hierarchy * *BrowserPlugin* * [PlaywrightPlugin](https://crawlee.dev/js/api/browser-pool/class/PlaywrightPlugin.md) * [PuppeteerPlugin](https://crawlee.dev/js/api/browser-pool/class/PuppeteerPlugin.md) ## Index[**](#Index) ### Constructors * [**constructor](#constructor) ### Properties * [**browserPerProxy](#browserPerProxy) * [**experimentalContainers](#experimentalContainers) * [**launchOptions](#launchOptions) * [**library](#library) * [**name](#name) * [**proxyUrl](#proxyUrl) * [**useIncognitoPages](#useIncognitoPages) * [**userDataDir](#userDataDir) ### Methods * [**createController](#createController) * [**createLaunchContext](#createLaunchContext) * [**launch](#launch) ## Constructors[**](#Constructors) ### [**](#constructor)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-plugin.ts#L129)constructor * ****new BrowserPlugin**\(library, options): [BrowserPlugin](https://crawlee.dev/js/api/browser-pool/class/BrowserPlugin.md)\ - #### Parameters * ##### library: Library * ##### options: [BrowserPluginOptions](https://crawlee.dev/js/api/browser-pool/interface/BrowserPluginOptions.md)\ = {} #### Returns [BrowserPlugin](https://crawlee.dev/js/api/browser-pool/class/BrowserPlugin.md)\ ## Properties[**](#Properties) ### [**](#browserPerProxy)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-plugin.ts#L127)optionalbrowserPerProxy **browserPerProxy? : boolean ### [**](#experimentalContainers)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-plugin.ts#L125)experimentalContainers **experimentalContainers: boolean ### [**](#launchOptions)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-plugin.ts#L117)launchOptions **launchOptions: LibraryOptions ### [**](#library)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-plugin.ts#L115)library **library: Library ### [**](#name)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-plugin.ts#L113)name **name: string = ... ### [**](#proxyUrl)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-plugin.ts#L119)optionalproxyUrl **proxyUrl? : string ### [**](#useIncognitoPages)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-plugin.ts#L123)useIncognitoPages **useIncognitoPages: boolean ### [**](#userDataDir)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-plugin.ts#L121)optionaluserDataDir **userDataDir? : string ## Methods[**](#Methods) ### [**](#createController)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-plugin.ts#L181)createController * ****createController**(): [BrowserController](https://crawlee.dev/js/api/browser-pool/class/BrowserController.md)\ - #### Returns [BrowserController](https://crawlee.dev/js/api/browser-pool/class/BrowserController.md)\ ### [**](#createLaunchContext)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-plugin.ts#L154)createLaunchContext * ****createLaunchContext**(options): [LaunchContext](https://crawlee.dev/js/api/browser-pool/class/LaunchContext.md)\ - Creates a `LaunchContext` with all the information needed to launch a browser. Aside from library specific launch options, it also includes internal properties used by `BrowserPool` for management of the pool and extra features. *** #### Parameters * ##### options: [CreateLaunchContextOptions](https://crawlee.dev/js/api/browser-pool/interface/CreateLaunchContextOptions.md)\ = {} #### Returns [LaunchContext](https://crawlee.dev/js/api/browser-pool/class/LaunchContext.md)\ ### [**](#launch)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-plugin.ts#L188)launch * ****launch**(launchContext): Promise\ - Launches the browser using provided launch context. *** #### Parameters * ##### launchContext: [LaunchContext](https://crawlee.dev/js/api/browser-pool/class/LaunchContext.md)\ = ... #### Returns Promise\ --- # BrowserPool \ The `BrowserPool` class is the most important class of the `browser-pool` module. It manages opening and closing of browsers and their pages and its constructor options allow easy configuration of the browsers' and pages' lifecycle. The most important and useful constructor options are the various lifecycle hooks. Those allow you to sequentially call a list of (asynchronous) functions at each stage of the browser / page lifecycle. **Example:** ``` import { BrowserPool, PlaywrightPlugin } from '@crawlee/browser-pool'; import playwright from 'playwright'; const browserPool = new BrowserPool({ browserPlugins: [new PlaywrightPlugin(playwright.chromium)], preLaunchHooks: [(pageId, launchContext) => { // do something before a browser gets launched launchContext.launchOptions.headless = false; }], postLaunchHooks: [(pageId, browserController) => { // manipulate the browser right after launch console.dir(browserController.browser.contexts()); }], prePageCreateHooks: [(pageId, browserController) => { if (pageId === 'my-page') { // make changes right before a specific page is created } }], postPageCreateHooks: [async (page, browserController) => { // update some or all new pages await page.evaluate(() => { // now all pages will have 'foo' window.foo = 'bar' }) }], prePageCloseHooks: [async (page, browserController) => { // collect information just before a page closes await page.screenshot(); }], postPageCloseHooks: [(pageId, browserController) => { // clean up or log after a job is done console.log('Page closed: ', pageId) }] }); ``` ### Hierarchy * TypedEmitter<[BrowserPoolEvents](https://crawlee.dev/js/api/browser-pool/interface/BrowserPoolEvents.md)\> * *BrowserPool* ## Index[**](#Index) ### Constructors * [**constructor](#constructor) ### Properties * [**activeBrowserControllers](#activeBrowserControllers) * [**browserPlugins](#browserPlugins) * [**closeInactiveBrowserAfterMillis](#closeInactiveBrowserAfterMillis) * [**fingerprintCache](#fingerprintCache) * [**fingerprintGenerator](#fingerprintGenerator) * [**fingerprintInjector](#fingerprintInjector) * [**fingerprintOptions](#fingerprintOptions) * [**maxOpenPagesPerBrowser](#maxOpenPagesPerBrowser) * [**operationTimeoutMillis](#operationTimeoutMillis) * [**pageCounter](#pageCounter) * [**pageIds](#pageIds) * [**pages](#pages) * [**pageToBrowserController](#pageToBrowserController) * [**postLaunchHooks](#postLaunchHooks) * [**postPageCloseHooks](#postPageCloseHooks) * [**postPageCreateHooks](#postPageCreateHooks) * [**preLaunchHooks](#preLaunchHooks) * [**prePageCloseHooks](#prePageCloseHooks) * [**prePageCreateHooks](#prePageCreateHooks) * [**retireBrowserAfterPageCount](#retireBrowserAfterPageCount) * [**retiredBrowserControllers](#retiredBrowserControllers) * [**startingBrowserControllers](#startingBrowserControllers) * [**useFingerprints](#useFingerprints) * [**defaultMaxListeners](#defaultMaxListeners) ### Methods * [**addListener](#addListener) * [**closeAllBrowsers](#closeAllBrowsers) * [**destroy](#destroy) * [**emit](#emit) * [**eventNames](#eventNames) * [**getBrowserControllerByPage](#getBrowserControllerByPage) * [**getMaxListeners](#getMaxListeners) * [**getPage](#getPage) * [**getPageId](#getPageId) * [**listenerCount](#listenerCount) * [**listeners](#listeners) * [**newPage](#newPage) * [**newPageInNewBrowser](#newPageInNewBrowser) * [**newPageWithEachPlugin](#newPageWithEachPlugin) * [**off](#off) * [**on](#on) * [**once](#once) * [**prependListener](#prependListener) * [**prependOnceListener](#prependOnceListener) * [**rawListeners](#rawListeners) * [**removeAllListeners](#removeAllListeners) * [**removeListener](#removeListener) * [**retireAllBrowsers](#retireAllBrowsers) * [**retireBrowserByPage](#retireBrowserByPage) * [**retireBrowserController](#retireBrowserController) * [**setMaxListeners](#setMaxListeners) ## Constructors[**](#Constructors) ### [**](#constructor)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L338)constructor * ****new BrowserPool**\(options): [BrowserPool](https://crawlee.dev/js/api/browser-pool/class/BrowserPool.md)\ - Overrides TypedEmitter\>.constructor #### Parameters * ##### options: Options & [BrowserPoolHooks](https://crawlee.dev/js/api/browser-pool/interface/BrowserPoolHooks.md)\ #### Returns [BrowserPool](https://crawlee.dev/js/api/browser-pool/class/BrowserPool.md)\ ## Properties[**](#Properties) ### [**](#activeBrowserControllers)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L322)activeBrowserControllers **activeBrowserControllers: Set\ = ... ### [**](#browserPlugins)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L305)browserPlugins **browserPlugins: BrowserPlugins ### [**](#closeInactiveBrowserAfterMillis)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L309)closeInactiveBrowserAfterMillis **closeInactiveBrowserAfterMillis: number ### [**](#fingerprintCache)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L327)optionalfingerprintCache **fingerprintCache? : QuickLRU\ ### [**](#fingerprintGenerator)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L326)optionalfingerprintGenerator **fingerprintGenerator? : FingerprintGenerator ### [**](#fingerprintInjector)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L325)optionalfingerprintInjector **fingerprintInjector? : FingerprintInjector ### [**](#fingerprintOptions)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L311)fingerprintOptions **fingerprintOptions: [FingerprintOptions](https://crawlee.dev/js/api/browser-pool/interface/FingerprintOptions.md) ### [**](#maxOpenPagesPerBrowser)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L306)maxOpenPagesPerBrowser **maxOpenPagesPerBrowser: number ### [**](#operationTimeoutMillis)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L308)operationTimeoutMillis **operationTimeoutMillis: number ### [**](#pageCounter)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L318)pageCounter **pageCounter: number = 0 ### [**](#pageIds)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L320)pageIds **pageIds: WeakMap\ = ... ### [**](#pages)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L319)pages **pages: Map\ = ... ### [**](#pageToBrowserController)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L324)pageToBrowserController **pageToBrowserController: WeakMap\ = ... ### [**](#postLaunchHooks)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L313)postLaunchHooks **postLaunchHooks: [PostLaunchHook](https://crawlee.dev/js/api/browser-pool.md#PostLaunchHook)\\[] ### [**](#postPageCloseHooks)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L317)postPageCloseHooks **postPageCloseHooks: [PostPageCloseHook](https://crawlee.dev/js/api/browser-pool.md#PostPageCloseHook)\\[] ### [**](#postPageCreateHooks)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L315)postPageCreateHooks **postPageCreateHooks: [PostPageCreateHook](https://crawlee.dev/js/api/browser-pool.md#PostPageCreateHook)\\[] ### [**](#preLaunchHooks)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L312)preLaunchHooks **preLaunchHooks: [PreLaunchHook](https://crawlee.dev/js/api/browser-pool.md#PreLaunchHook)\\[] ### [**](#prePageCloseHooks)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L316)prePageCloseHooks **prePageCloseHooks: [PrePageCloseHook](https://crawlee.dev/js/api/browser-pool.md#PrePageCloseHook)\\[] ### [**](#prePageCreateHooks)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L314)prePageCreateHooks **prePageCreateHooks: [PrePageCreateHook](https://crawlee.dev/js/api/browser-pool.md#PrePageCreateHook)\\[] ### [**](#retireBrowserAfterPageCount)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L307)retireBrowserAfterPageCount **retireBrowserAfterPageCount: number ### [**](#retiredBrowserControllers)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L323)retiredBrowserControllers **retiredBrowserControllers: Set\ = ... ### [**](#startingBrowserControllers)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L321)startingBrowserControllers **startingBrowserControllers: Set\ = ... ### [**](#useFingerprints)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L310)optionaluseFingerprints **useFingerprints? : boolean ### [**](#defaultMaxListeners)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L10)staticexternalinheriteddefaultMaxListeners **defaultMaxListeners: number Inherited from TypedEmitter.defaultMaxListeners ## Methods[**](#Methods) ### [**](#addListener)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L11)externalinheritedaddListener * ****addListener**\(event, listener): this - Inherited from TypedEmitter.addListener #### Parameters * ##### externalevent: U * ##### externallistener: [BrowserPoolEvents](https://crawlee.dev/js/api/browser-pool/interface/BrowserPoolEvents.md)\\[U] #### Returns this ### [**](#closeAllBrowsers)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L649)closeAllBrowsers * ****closeAllBrowsers**(): Promise\ - Closes all managed browsers without waiting for pages to close. *** #### Returns Promise\ ### [**](#destroy)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L661)destroy * ****destroy**(): Promise\ - Closes all managed browsers and tears down the pool. *** #### Returns Promise\ ### [**](#emit)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L19)externalinheritedemit * ****emit**\(event, ...args): boolean - Inherited from TypedEmitter.emit #### Parameters * ##### externalevent: U * ##### externalrest...args: Parameters<[BrowserPoolEvents](https://crawlee.dev/js/api/browser-pool/interface/BrowserPoolEvents.md)\\[U]> #### Returns boolean ### [**](#eventNames)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L20)externalinheritedeventNames * ****eventNames**\(): U\[] - Inherited from TypedEmitter.eventNames #### Returns U\[] ### [**](#getBrowserControllerByPage)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L525)getBrowserControllerByPage * ****getBrowserControllerByPage**(page): undefined | BrowserControllerReturn - Retrieves a [BrowserController](https://crawlee.dev/js/api/browser-pool/class/BrowserController.md) for a given page. This is useful when you're working only with pages and need to access the browser manipulation functionality. You could access the browser directly from the page, but that would circumvent `BrowserPool` and most likely cause weird things to happen, so please always use `BrowserController` to control your browsers. The function returns `undefined` if the browser is closed. *** #### Parameters * ##### page: PageReturn Browser plugin page #### Returns undefined | BrowserControllerReturn ### [**](#getMaxListeners)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L24)externalinheritedgetMaxListeners * ****getMaxListeners**(): number - Inherited from TypedEmitter.getMaxListeners #### Returns number ### [**](#getPage)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L535)getPage * ****getPage**(id): undefined | PageReturn - If you provided a custom ID to one of your pages or saved the randomly generated one, you can use this function to retrieve the page. If the page is no longer open, the function will return `undefined`. *** #### Parameters * ##### id: string #### Returns undefined | PageReturn ### [**](#getPageId)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L545)getPageId * ****getPageId**(page): undefined | string - Page IDs are used throughout `BrowserPool` as a method of linking events. You can use a page ID to track the full lifecycle of the page. It is created even before a browser is launched and stays with the page until it's closed. *** #### Parameters * ##### page: PageReturn #### Returns undefined | string ### [**](#listenerCount)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L21)externalinheritedlistenerCount * ****listenerCount**(type): number - Inherited from TypedEmitter.listenerCount #### Parameters * ##### externaltype: keyof [BrowserPoolEvents](https://crawlee.dev/js/api/browser-pool/interface/BrowserPoolEvents.md)\ #### Returns number ### [**](#listeners)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L22)externalinheritedlisteners * ****listeners**\(type): [BrowserPoolEvents](https://crawlee.dev/js/api/browser-pool/interface/BrowserPoolEvents.md)\\[U]\[] - Inherited from TypedEmitter.listeners #### Parameters * ##### externaltype: U #### Returns [BrowserPoolEvents](https://crawlee.dev/js/api/browser-pool/interface/BrowserPoolEvents.md)\\[U]\[] ### [**](#newPage)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L437)newPage * ****newPage**(options): Promise\ - Opens a new page in one of the running browsers or launches a new browser and opens a page there, if no browsers are active, or their page limits have been exceeded. *** #### Parameters * ##### options: [BrowserPoolNewPageOptions](https://crawlee.dev/js/api/browser-pool/interface/BrowserPoolNewPageOptions.md)\ = {} #### Returns Promise\ ### [**](#newPageInNewBrowser)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L465)newPageInNewBrowser * ****newPageInNewBrowser**(options): Promise\ - Unlike newPage, `newPageInNewBrowser` always launches a new browser to open the page in. Use the `launchOptions` option to configure the new browser. *** #### Parameters * ##### options: [BrowserPoolNewPageInNewBrowserOptions](https://crawlee.dev/js/api/browser-pool/interface/BrowserPoolNewPageInNewBrowserOptions.md)\ = {} #### Returns Promise\ ### [**](#newPageWithEachPlugin)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L499)newPageWithEachPlugin * ****newPageWithEachPlugin**(optionsList): Promise\ - Opens new pages with all available plugins and returns an array of pages in the same order as the plugins were provided to `BrowserPool`. This is useful when you want to run a script in multiple environments at the same time, typically in testing or website analysis. **Example:** ``` const browserPool = new BrowserPool({ browserPlugins: [ new PlaywrightPlugin(playwright.chromium), new PlaywrightPlugin(playwright.firefox), new PlaywrightPlugin(playwright.webkit), ] }); const pages = await browserPool.newPageWithEachPlugin(); const [chromiumPage, firefoxPage, webkitPage] = pages; ``` *** #### Parameters * ##### optionsList: Omit<[BrowserPoolNewPageOptions](https://crawlee.dev/js/api/browser-pool/interface/BrowserPoolNewPageOptions.md)\, browserPlugin>\[] = \[] #### Returns Promise\ ### [**](#off)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L18)externalinheritedoff * ****off**\(event, listener): this - Inherited from TypedEmitter.off #### Parameters * ##### externalevent: U * ##### externallistener: [BrowserPoolEvents](https://crawlee.dev/js/api/browser-pool/interface/BrowserPoolEvents.md)\\[U] #### Returns this ### [**](#on)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L17)externalinheritedon * ****on**\(event, listener): this - Inherited from TypedEmitter.on #### Parameters * ##### externalevent: U * ##### externallistener: [BrowserPoolEvents](https://crawlee.dev/js/api/browser-pool/interface/BrowserPoolEvents.md)\\[U] #### Returns this ### [**](#once)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L16)externalinheritedonce * ****once**\(event, listener): this - Inherited from TypedEmitter.once #### Parameters * ##### externalevent: U * ##### externallistener: [BrowserPoolEvents](https://crawlee.dev/js/api/browser-pool/interface/BrowserPoolEvents.md)\\[U] #### Returns this ### [**](#prependListener)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L12)externalinheritedprependListener * ****prependListener**\(event, listener): this - Inherited from TypedEmitter.prependListener #### Parameters * ##### externalevent: U * ##### externallistener: [BrowserPoolEvents](https://crawlee.dev/js/api/browser-pool/interface/BrowserPoolEvents.md)\\[U] #### Returns this ### [**](#prependOnceListener)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L13)externalinheritedprependOnceListener * ****prependOnceListener**\(event, listener): this - Inherited from TypedEmitter.prependOnceListener #### Parameters * ##### externalevent: U * ##### externallistener: [BrowserPoolEvents](https://crawlee.dev/js/api/browser-pool/interface/BrowserPoolEvents.md)\\[U] #### Returns this ### [**](#rawListeners)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L23)externalinheritedrawListeners * ****rawListeners**\(type): [BrowserPoolEvents](https://crawlee.dev/js/api/browser-pool/interface/BrowserPoolEvents.md)\\[U]\[] - Inherited from TypedEmitter.rawListeners #### Parameters * ##### externaltype: U #### Returns [BrowserPoolEvents](https://crawlee.dev/js/api/browser-pool/interface/BrowserPoolEvents.md)\\[U]\[] ### [**](#removeAllListeners)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L15)externalinheritedremoveAllListeners * ****removeAllListeners**(event): this - Inherited from TypedEmitter.removeAllListeners #### Parameters * ##### externaloptionalevent: keyof BrowserPoolEvents\ #### Returns this ### [**](#removeListener)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L14)externalinheritedremoveListener * ****removeListener**\(event, listener): this - Inherited from TypedEmitter.removeListener #### Parameters * ##### externalevent: U * ##### externallistener: [BrowserPoolEvents](https://crawlee.dev/js/api/browser-pool/interface/BrowserPoolEvents.md)\\[U] #### Returns this ### [**](#retireAllBrowsers)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L639)retireAllBrowsers * ****retireAllBrowsers**(): void - Removes all active browsers from the pool. The browsers will be closed after all their pages are closed. *** #### Returns void ### [**](#retireBrowserByPage)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L630)retireBrowserByPage * ****retireBrowserByPage**(page): void - Removes a browser from the pool. It will be closed after all its pages are closed. *** #### Parameters * ##### page: PageReturn #### Returns void ### [**](#retireBrowserController)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L613)retireBrowserController * ****retireBrowserController**(browserController): void - Removes a browser controller from the pool. The underlying browser will be closed after all its pages are closed. *** #### Parameters * ##### browserController: BrowserControllerReturn #### Returns void ### [**](#setMaxListeners)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L25)externalinheritedsetMaxListeners * ****setMaxListeners**(n): this - Inherited from TypedEmitter.setMaxListeners #### Parameters * ##### externaln: number #### Returns this --- # LaunchContext \ ## Index[**](#Index) ### Constructors * [**constructor](#constructor) ### Properties * [**browserPerProxy](#browserPerProxy) * [**browserPlugin](#browserPlugin) * [**experimentalContainers](#experimentalContainers) * [**fingerprint](#fingerprint) * [**id](#id) * [**launchOptions](#launchOptions) * [**proxyTier](#proxyTier) * [**useIncognitoPages](#useIncognitoPages) * [**userDataDir](#userDataDir) ### Accessors * [**proxyUrl](#proxyUrl) ### Methods * [**extend](#extend) ## Constructors[**](#Constructors) ### [**](#constructor)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/launch-context.ts#L85)constructor * ****new LaunchContext**\(options): [LaunchContext](https://crawlee.dev/js/api/browser-pool/class/LaunchContext.md)\ - #### Parameters * ##### options: [LaunchContextOptions](https://crawlee.dev/js/api/browser-pool/interface/LaunchContextOptions.md)\ #### Returns [LaunchContext](https://crawlee.dev/js/api/browser-pool/class/LaunchContext.md)\ ## Properties[**](#Properties) ### [**](#browserPerProxy)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/launch-context.ts#L74)optionalbrowserPerProxy **browserPerProxy? : boolean ### [**](#browserPlugin)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/launch-context.ts#L71)browserPlugin **browserPlugin: [BrowserPlugin](https://crawlee.dev/js/api/browser-pool/class/BrowserPlugin.md)\ ### [**](#experimentalContainers)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/launch-context.ts#L75)experimentalContainers **experimentalContainers: boolean ### [**](#fingerprint)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/launch-context.ts#L82)optionalfingerprint **fingerprint? : BrowserFingerprintWithHeaders ### [**](#id)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/launch-context.ts#L70)optionalid **id? : string ### [**](#launchOptions)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/launch-context.ts#L72)launchOptions **launchOptions: LibraryOptions ### [**](#proxyTier)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/launch-context.ts#L77)optionalproxyTier **proxyTier? : number ### [**](#useIncognitoPages)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/launch-context.ts#L73)useIncognitoPages **useIncognitoPages: boolean ### [**](#userDataDir)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/launch-context.ts#L76)userDataDir **userDataDir: string ## Accessors[**](#Accessors) ### [**](#proxyUrl)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/launch-context.ts#L131)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/launch-context.ts#L148)proxyUrl * **get proxyUrl(): undefined | string * **set proxyUrl(url): void - Returns the proxy URL of the browser. *** #### Returns undefined | string - Sets a proxy URL for the browser. Use `undefined` to unset existing proxy URL. *** #### Parameters * ##### url: undefined | string #### Returns void ## Methods[**](#Methods) ### [**](#extend)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/launch-context.ts#L117)extend * ****extend**\(fields): void - Extend the launch context with any extra fields. This is useful to keep state information relevant to the browser being launched. It ensures that no internal fields are overridden and should be used instead of property assignment. *** #### Parameters * ##### fields: T #### Returns void --- # PlaywrightBrowser Browser wrapper created to have consistent API with persistent and non-persistent contexts. ### Hierarchy * EventEmitter * *PlaywrightBrowser* ## Index[**](#Index) ### Constructors * [**constructor](#constructor) ### Properties * [**captureRejections](#captureRejections) * [**captureRejectionSymbol](#captureRejectionSymbol) * [**defaultMaxListeners](#defaultMaxListeners) * [**errorMonitor](#errorMonitor) ### Methods * [**\[asyncDispose\]](#\[asyncDispose]) * [**\[captureRejectionSymbol\]](#\[captureRejectionSymbol]) * [**addListener](#addListener) * [**browserType](#browserType) * [**close](#close) * [**contexts](#contexts) * [**emit](#emit) * [**eventNames](#eventNames) * [**getMaxListeners](#getMaxListeners) * [**isConnected](#isConnected) * [**listenerCount](#listenerCount) * [**listeners](#listeners) * [**newBrowserCDPSession](#newBrowserCDPSession) * [**newContext](#newContext) * [**newPage](#newPage) * [**off](#off) * [**on](#on) * [**once](#once) * [**prependListener](#prependListener) * [**prependOnceListener](#prependOnceListener) * [**rawListeners](#rawListeners) * [**removeAllListeners](#removeAllListeners) * [**removeListener](#removeListener) * [**setMaxListeners](#setMaxListeners) * [**startTracing](#startTracing) * [**stopTracing](#stopTracing) * [**version](#version) * [**addAbortListener](#addAbortListener) * [**getEventListeners](#getEventListeners) * [**getMaxListeners](#getMaxListeners) * [**listenerCount](#listenerCount) * [**on](#on) * [**once](#once) * [**setMaxListeners](#setMaxListeners) ## Constructors[**](#Constructors) ### [**](#constructor)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/playwright/playwright-browser.ts#L19)constructor * ****new PlaywrightBrowser**(options): [PlaywrightBrowser](https://crawlee.dev/js/api/browser-pool/class/PlaywrightBrowser.md) - Overrides EventEmitter.constructor #### Parameters * ##### options: BrowserOptions #### Returns [PlaywrightBrowser](https://crawlee.dev/js/api/browser-pool/class/PlaywrightBrowser.md) ## Properties[**](#Properties) ### [**](#captureRejections)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L458)staticexternalinheritedcaptureRejections **captureRejections: boolean Inherited from EventEmitter.captureRejections Value: [boolean](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Data_structures#Boolean_type) Change the default `captureRejections` option on all new `EventEmitter` objects. * **@since** v13.4.0, v12.16.0 ### [**](#captureRejectionSymbol)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L451)staticexternalreadonlyinheritedcaptureRejectionSymbol **captureRejectionSymbol: typeof captureRejectionSymbol Inherited from EventEmitter.captureRejectionSymbol Value: `Symbol.for('nodejs.rejection')` See how to write a custom `rejection handler`. * **@since** v13.4.0, v12.16.0 ### [**](#defaultMaxListeners)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L497)staticexternalinheriteddefaultMaxListeners **defaultMaxListeners: number Inherited from EventEmitter.defaultMaxListeners By default, a maximum of `10` listeners can be registered for any single event. This limit can be changed for individual `EventEmitter` instances using the `emitter.setMaxListeners(n)` method. To change the default for *all*`EventEmitter` instances, the `events.defaultMaxListeners` property can be used. If this value is not a positive number, a `RangeError` is thrown. Take caution when setting the `events.defaultMaxListeners` because the change affects *all* `EventEmitter` instances, including those created before the change is made. However, calling `emitter.setMaxListeners(n)` still has precedence over `events.defaultMaxListeners`. This is not a hard limit. The `EventEmitter` instance will allow more listeners to be added but will output a trace warning to stderr indicating that a "possible EventEmitter memory leak" has been detected. For any single `EventEmitter`, the `emitter.getMaxListeners()` and `emitter.setMaxListeners()` methods can be used to temporarily avoid this warning: ``` import { EventEmitter } from 'node:events'; const emitter = new EventEmitter(); emitter.setMaxListeners(emitter.getMaxListeners() + 1); emitter.once('event', () => { // do stuff emitter.setMaxListeners(Math.max(emitter.getMaxListeners() - 1, 0)); }); ``` The `--trace-warnings` command-line flag can be used to display the stack trace for such warnings. The emitted warning can be inspected with `process.on('warning')` and will have the additional `emitter`, `type`, and `count` properties, referring to the event emitter instance, the event's name and the number of attached listeners, respectively. Its `name` property is set to `'MaxListenersExceededWarning'`. * **@since** v0.11.2 ### [**](#errorMonitor)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L444)staticexternalreadonlyinheritederrorMonitor **errorMonitor: typeof errorMonitor Inherited from EventEmitter.errorMonitor This symbol shall be used to install a listener for only monitoring `'error'` events. Listeners installed using this symbol are called before the regular `'error'` listeners are called. Installing a listener using this symbol does not change the behavior once an `'error'` event is emitted. Therefore, the process will still crash if no regular `'error'` listener is installed. * **@since** v13.6.0, v12.17.0 ## Methods[**](#Methods) ### [**](#\[asyncDispose])[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/playwright/playwright-browser.ts#L33)\[asyncDispose] * ****\[asyncDispose]**(): Promise\ - #### Returns Promise\ ### [**](#\[captureRejectionSymbol])[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L136)externaloptionalinherited\[captureRejectionSymbol] * ****\[captureRejectionSymbol]**\(error, event, ...args): void - Inherited from EventEmitter.\[captureRejectionSymbol] #### Parameters * ##### externalerror: Error * ##### externalevent: string | symbol * ##### externalrest...args: AnyRest #### Returns void ### [**](#addListener)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L596)externalinheritedaddListener * ****addListener**\(eventName, listener): this - Inherited from EventEmitter.addListener Alias for `emitter.on(eventName, listener)`. * **@since** v0.1.26 *** #### Parameters * ##### externaleventName: string | symbol * ##### externallistener: (...args) => void #### Returns this ### [**](#browserType)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/playwright/playwright-browser.ts#L58)browserType * ****browserType**(): BrowserType<{}> - #### Returns BrowserType<{}> ### [**](#close)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/playwright/playwright-browser.ts#L37)close * ****close**(): Promise\ - #### Returns Promise\ ### [**](#contexts)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/playwright/playwright-browser.ts#L41)contexts * ****contexts**(): BrowserContext\[] - #### Returns BrowserContext\[] ### [**](#emit)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L858)externalinheritedemit * ****emit**\(eventName, ...args): boolean - Inherited from EventEmitter.emit Synchronously calls each of the listeners registered for the event named `eventName`, in the order they were registered, passing the supplied arguments to each. Returns `true` if the event had listeners, `false` otherwise. ``` import { EventEmitter } from 'node:events'; const myEmitter = new EventEmitter(); // First listener myEmitter.on('event', function firstListener() { console.log('Helloooo! first listener'); }); // Second listener myEmitter.on('event', function secondListener(arg1, arg2) { console.log(`event with parameters ${arg1}, ${arg2} in second listener`); }); // Third listener myEmitter.on('event', function thirdListener(...args) { const parameters = args.join(', '); console.log(`event with parameters ${parameters} in third listener`); }); console.log(myEmitter.listeners('event')); myEmitter.emit('event', 1, 2, 3, 4, 5); // Prints: // [ // [Function: firstListener], // [Function: secondListener], // [Function: thirdListener] // ] // Helloooo! first listener // event with parameters 1, 2 in second listener // event with parameters 1, 2, 3, 4, 5 in third listener ``` * **@since** v0.1.26 *** #### Parameters * ##### externaleventName: string | symbol * ##### externalrest...args: AnyRest #### Returns boolean ### [**](#eventNames)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L921)externalinheritedeventNames * ****eventNames**(): (string | symbol)\[] - Inherited from EventEmitter.eventNames Returns an array listing the events for which the emitter has registered listeners. The values in the array are strings or `Symbol`s. ``` import { EventEmitter } from 'node:events'; const myEE = new EventEmitter(); myEE.on('foo', () => {}); myEE.on('bar', () => {}); const sym = Symbol('symbol'); myEE.on(sym, () => {}); console.log(myEE.eventNames()); // Prints: [ 'foo', 'bar', Symbol(symbol) ] ``` * **@since** v6.0.0 *** #### Returns (string | symbol)\[] ### [**](#getMaxListeners)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L773)externalinheritedgetMaxListeners * ****getMaxListeners**(): number - Inherited from EventEmitter.getMaxListeners Returns the current max listener value for the `EventEmitter` which is either set by `emitter.setMaxListeners(n)` or defaults to EventEmitter.defaultMaxListeners. * **@since** v1.0.0 *** #### Returns number ### [**](#isConnected)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/playwright/playwright-browser.ts#L45)isConnected * ****isConnected**(): boolean - #### Returns boolean ### [**](#listenerCount)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L867)externalinheritedlistenerCount * ****listenerCount**\(eventName, listener): number - Inherited from EventEmitter.listenerCount Returns the number of listeners listening for the event named `eventName`. If `listener` is provided, it will return how many times the listener is found in the list of the listeners of the event. * **@since** v3.2.0 *** #### Parameters * ##### externaleventName: string | symbol The name of the event being listened for * ##### externaloptionallistener: Function The event handler function #### Returns number ### [**](#listeners)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L786)externalinheritedlisteners * ****listeners**\(eventName): Function\[] - Inherited from EventEmitter.listeners Returns a copy of the array of listeners for the event named `eventName`. ``` server.on('connection', (stream) => { console.log('someone connected!'); }); console.log(util.inspect(server.listeners('connection'))); // Prints: [ [Function] ] ``` * **@since** v0.1.26 *** #### Parameters * ##### externaleventName: string | symbol #### Returns Function\[] ### [**](#newBrowserCDPSession)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/playwright/playwright-browser.ts#L70)newBrowserCDPSession * ****newBrowserCDPSession**(): Promise\ - #### Returns Promise\ ### [**](#newContext)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/playwright/playwright-browser.ts#L66)newContext * ****newContext**(): Promise\ - #### Returns Promise\ ### [**](#newPage)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/playwright/playwright-browser.ts#L62)newPage * ****newPage**(...args): Promise\ - #### Parameters * ##### rest...args: \[] #### Returns Promise\ ### [**](#off)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L746)externalinheritedoff * ****off**\(eventName, listener): this - Inherited from EventEmitter.off Alias for `emitter.removeListener()`. * **@since** v10.0.0 *** #### Parameters * ##### externaleventName: string | symbol * ##### externallistener: (...args) => void #### Returns this ### [**](#on)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L628)externalinheritedon * ****on**\(eventName, listener): this - Inherited from EventEmitter.on Adds the `listener` function to the end of the listeners array for the event named `eventName`. No checks are made to see if the `listener` has already been added. Multiple calls passing the same combination of `eventName` and `listener` will result in the `listener` being added, and called, multiple times. ``` server.on('connection', (stream) => { console.log('someone connected!'); }); ``` Returns a reference to the `EventEmitter`, so that calls can be chained. By default, event listeners are invoked in the order they are added. The `emitter.prependListener()` method can be used as an alternative to add the event listener to the beginning of the listeners array. ``` import { EventEmitter } from 'node:events'; const myEE = new EventEmitter(); myEE.on('foo', () => console.log('a')); myEE.prependListener('foo', () => console.log('b')); myEE.emit('foo'); // Prints: // b // a ``` * **@since** v0.1.101 *** #### Parameters * ##### externaleventName: string | symbol The name of the event. * ##### externallistener: (...args) => void The callback function #### Returns this ### [**](#once)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L658)externalinheritedonce * ****once**\(eventName, listener): this - Inherited from EventEmitter.once Adds a **one-time** `listener` function for the event named `eventName`. The next time `eventName` is triggered, this listener is removed and then invoked. ``` server.once('connection', (stream) => { console.log('Ah, we have our first user!'); }); ``` Returns a reference to the `EventEmitter`, so that calls can be chained. By default, event listeners are invoked in the order they are added. The `emitter.prependOnceListener()` method can be used as an alternative to add the event listener to the beginning of the listeners array. ``` import { EventEmitter } from 'node:events'; const myEE = new EventEmitter(); myEE.once('foo', () => console.log('a')); myEE.prependOnceListener('foo', () => console.log('b')); myEE.emit('foo'); // Prints: // b // a ``` * **@since** v0.3.0 *** #### Parameters * ##### externaleventName: string | symbol The name of the event. * ##### externallistener: (...args) => void The callback function #### Returns this ### [**](#prependListener)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L885)externalinheritedprependListener * ****prependListener**\(eventName, listener): this - Inherited from EventEmitter.prependListener Adds the `listener` function to the *beginning* of the listeners array for the event named `eventName`. No checks are made to see if the `listener` has already been added. Multiple calls passing the same combination of `eventName` and `listener` will result in the `listener` being added, and called, multiple times. ``` server.prependListener('connection', (stream) => { console.log('someone connected!'); }); ``` Returns a reference to the `EventEmitter`, so that calls can be chained. * **@since** v6.0.0 *** #### Parameters * ##### externaleventName: string | symbol The name of the event. * ##### externallistener: (...args) => void The callback function #### Returns this ### [**](#prependOnceListener)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L901)externalinheritedprependOnceListener * ****prependOnceListener**\(eventName, listener): this - Inherited from EventEmitter.prependOnceListener Adds a **one-time**`listener` function for the event named `eventName` to the *beginning* of the listeners array. The next time `eventName` is triggered, this listener is removed, and then invoked. ``` server.prependOnceListener('connection', (stream) => { console.log('Ah, we have our first user!'); }); ``` Returns a reference to the `EventEmitter`, so that calls can be chained. * **@since** v6.0.0 *** #### Parameters * ##### externaleventName: string | symbol The name of the event. * ##### externallistener: (...args) => void The callback function #### Returns this ### [**](#rawListeners)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L817)externalinheritedrawListeners * ****rawListeners**\(eventName): Function\[] - Inherited from EventEmitter.rawListeners Returns a copy of the array of listeners for the event named `eventName`, including any wrappers (such as those created by `.once()`). ``` import { EventEmitter } from 'node:events'; const emitter = new EventEmitter(); emitter.once('log', () => console.log('log once')); // Returns a new Array with a function `onceWrapper` which has a property // `listener` which contains the original listener bound above const listeners = emitter.rawListeners('log'); const logFnWrapper = listeners[0]; // Logs "log once" to the console and does not unbind the `once` event logFnWrapper.listener(); // Logs "log once" to the console and removes the listener logFnWrapper(); emitter.on('log', () => console.log('log persistently')); // Will return a new Array with a single function bound by `.on()` above const newListeners = emitter.rawListeners('log'); // Logs "log persistently" twice newListeners[0](); emitter.emit('log'); ``` * **@since** v9.4.0 *** #### Parameters * ##### externaleventName: string | symbol #### Returns Function\[] ### [**](#removeAllListeners)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L757)externalinheritedremoveAllListeners * ****removeAllListeners**(eventName): this - Inherited from EventEmitter.removeAllListeners Removes all listeners, or those of the specified `eventName`. It is bad practice to remove listeners added elsewhere in the code, particularly when the `EventEmitter` instance was created by some other component or module (e.g. sockets or file streams). Returns a reference to the `EventEmitter`, so that calls can be chained. * **@since** v0.1.26 *** #### Parameters * ##### externaloptionaleventName: string | symbol #### Returns this ### [**](#removeListener)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L741)externalinheritedremoveListener * ****removeListener**\(eventName, listener): this - Inherited from EventEmitter.removeListener Removes the specified `listener` from the listener array for the event named `eventName`. ``` const callback = (stream) => { console.log('someone connected!'); }; server.on('connection', callback); // ... server.removeListener('connection', callback); ``` `removeListener()` will remove, at most, one instance of a listener from the listener array. If any single listener has been added multiple times to the listener array for the specified `eventName`, then `removeListener()` must be called multiple times to remove each instance. Once an event is emitted, all listeners attached to it at the time of emitting are called in order. This implies that any `removeListener()` or `removeAllListeners()` calls *after* emitting and *before* the last listener finishes execution will not remove them from`emit()` in progress. Subsequent events behave as expected. ``` import { EventEmitter } from 'node:events'; class MyEmitter extends EventEmitter {} const myEmitter = new MyEmitter(); const callbackA = () => { console.log('A'); myEmitter.removeListener('event', callbackB); }; const callbackB = () => { console.log('B'); }; myEmitter.on('event', callbackA); myEmitter.on('event', callbackB); // callbackA removes listener callbackB but it will still be called. // Internal listener array at time of emit [callbackA, callbackB] myEmitter.emit('event'); // Prints: // A // B // callbackB is now removed. // Internal listener array [callbackA] myEmitter.emit('event'); // Prints: // A ``` Because listeners are managed using an internal array, calling this will change the position indices of any listener registered *after* the listener being removed. This will not impact the order in which listeners are called, but it means that any copies of the listener array as returned by the `emitter.listeners()` method will need to be recreated. When a single function has been added as a handler multiple times for a single event (as in the example below), `removeListener()` will remove the most recently added instance. In the example the `once('ping')` listener is removed: ``` import { EventEmitter } from 'node:events'; const ee = new EventEmitter(); function pong() { console.log('pong'); } ee.on('ping', pong); ee.once('ping', pong); ee.removeListener('ping', pong); ee.emit('ping'); ee.emit('ping'); ``` Returns a reference to the `EventEmitter`, so that calls can be chained. * **@since** v0.1.26 *** #### Parameters * ##### externaleventName: string | symbol * ##### externallistener: (...args) => void #### Returns this ### [**](#setMaxListeners)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L767)externalinheritedsetMaxListeners * ****setMaxListeners**(n): this - Inherited from EventEmitter.setMaxListeners By default `EventEmitter`s will print a warning if more than `10` listeners are added for a particular event. This is a useful default that helps finding memory leaks. The `emitter.setMaxListeners()` method allows the limit to be modified for this specific `EventEmitter` instance. The value can be set to `Infinity` (or `0`) to indicate an unlimited number of listeners. Returns a reference to the `EventEmitter`, so that calls can be chained. * **@since** v0.3.5 *** #### Parameters * ##### externaln: number #### Returns this ### [**](#startTracing)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/playwright/playwright-browser.ts#L74)startTracing * ****startTracing**(): Promise\ - #### Returns Promise\ ### [**](#stopTracing)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/playwright/playwright-browser.ts#L78)stopTracing * ****stopTracing**(): Promise\ - #### Returns Promise\ ### [**](#version)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/playwright/playwright-browser.ts#L49)version * ****version**(): string - #### Returns string ### [**](#addAbortListener)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L436)staticexternalinheritedaddAbortListener * ****addAbortListener**(signal, resource): Disposable - Inherited from EventEmitter.addAbortListener Listens once to the `abort` event on the provided `signal`. Listening to the `abort` event on abort signals is unsafe and may lead to resource leaks since another third party with the signal can call `e.stopImmediatePropagation()`. Unfortunately Node.js cannot change this since it would violate the web standard. Additionally, the original API makes it easy to forget to remove listeners. This API allows safely using `AbortSignal`s in Node.js APIs by solving these two issues by listening to the event such that `stopImmediatePropagation` does not prevent the listener from running. Returns a disposable so that it may be unsubscribed from more easily. ``` import { addAbortListener } from 'node:events'; function example(signal) { let disposable; try { signal.addEventListener('abort', (e) => e.stopImmediatePropagation()); disposable = addAbortListener(signal, (e) => { // Do something when signal is aborted. }); } finally { disposable?.[Symbol.dispose](); } } ``` * **@since** v20.5.0 *** #### Parameters * ##### externalsignal: AbortSignal * ##### externalresource: (event) => void #### Returns Disposable Disposable that removes the `abort` listener. ### [**](#getEventListeners)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L358)staticexternalinheritedgetEventListeners * ****getEventListeners**(emitter, name): Function\[] - Inherited from EventEmitter.getEventListeners Returns a copy of the array of listeners for the event named `eventName`. For `EventEmitter`s this behaves exactly the same as calling `.listeners` on the emitter. For `EventTarget`s this is the only way to get the event listeners for the event target. This is useful for debugging and diagnostic purposes. ``` import { getEventListeners, EventEmitter } from 'node:events'; { const ee = new EventEmitter(); const listener = () => console.log('Events are fun'); ee.on('foo', listener); console.log(getEventListeners(ee, 'foo')); // [ [Function: listener] ] } { const et = new EventTarget(); const listener = () => console.log('Events are fun'); et.addEventListener('foo', listener); console.log(getEventListeners(et, 'foo')); // [ [Function: listener] ] } ``` * **@since** v15.2.0, v14.17.0 *** #### Parameters * ##### externalemitter: EventEmitter\ | EventTarget * ##### externalname: string | symbol #### Returns Function\[] ### [**](#getMaxListeners)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L387)staticexternalinheritedgetMaxListeners * ****getMaxListeners**(emitter): number - Inherited from EventEmitter.getMaxListeners Returns the currently set max amount of listeners. For `EventEmitter`s this behaves exactly the same as calling `.getMaxListeners` on the emitter. For `EventTarget`s this is the only way to get the max event listeners for the event target. If the number of event handlers on a single EventTarget exceeds the max set, the EventTarget will print a warning. ``` import { getMaxListeners, setMaxListeners, EventEmitter } from 'node:events'; { const ee = new EventEmitter(); console.log(getMaxListeners(ee)); // 10 setMaxListeners(11, ee); console.log(getMaxListeners(ee)); // 11 } { const et = new EventTarget(); console.log(getMaxListeners(et)); // 10 setMaxListeners(11, et); console.log(getMaxListeners(et)); // 11 } ``` * **@since** v19.9.0 *** #### Parameters * ##### externalemitter: EventEmitter\ | EventTarget #### Returns number ### [**](#listenerCount)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L330)staticexternalinheritedlistenerCount * ****listenerCount**(emitter, eventName): number - Inherited from EventEmitter.listenerCount A class method that returns the number of listeners for the given `eventName` registered on the given `emitter`. ``` import { EventEmitter, listenerCount } from 'node:events'; const myEmitter = new EventEmitter(); myEmitter.on('event', () => {}); myEmitter.on('event', () => {}); console.log(listenerCount(myEmitter, 'event')); // Prints: 2 ``` * **@since** v0.9.12 * **@deprecated** Since v3.2.0 - Use `listenerCount` instead. *** #### Parameters * ##### externalemitter: EventEmitter\ The emitter to query * ##### externaleventName: string | symbol The event name #### Returns number ### [**](#on)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L303)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L308)staticexternalinheritedon * ****on**(emitter, eventName, options): AsyncIterator\ * ****on**(emitter, eventName, options): AsyncIterator\ - Inherited from EventEmitter.on ``` import { on, EventEmitter } from 'node:events'; import process from 'node:process'; const ee = new EventEmitter(); // Emit later on process.nextTick(() => { ee.emit('foo', 'bar'); ee.emit('foo', 42); }); for await (const event of on(ee, 'foo')) { // The execution of this inner block is synchronous and it // processes one event at a time (even with await). Do not use // if concurrent execution is required. console.log(event); // prints ['bar'] [42] } // Unreachable here ``` Returns an `AsyncIterator` that iterates `eventName` events. It will throw if the `EventEmitter` emits `'error'`. It removes all listeners when exiting the loop. The `value` returned by each iteration is an array composed of the emitted event arguments. An `AbortSignal` can be used to cancel waiting on events: ``` import { on, EventEmitter } from 'node:events'; import process from 'node:process'; const ac = new AbortController(); (async () => { const ee = new EventEmitter(); // Emit later on process.nextTick(() => { ee.emit('foo', 'bar'); ee.emit('foo', 42); }); for await (const event of on(ee, 'foo', { signal: ac.signal })) { // The execution of this inner block is synchronous and it // processes one event at a time (even with await). Do not use // if concurrent execution is required. console.log(event); // prints ['bar'] [42] } // Unreachable here })(); process.nextTick(() => ac.abort()); ``` Use the `close` option to specify an array of event names that will end the iteration: ``` import { on, EventEmitter } from 'node:events'; import process from 'node:process'; const ee = new EventEmitter(); // Emit later on process.nextTick(() => { ee.emit('foo', 'bar'); ee.emit('foo', 42); ee.emit('close'); }); for await (const event of on(ee, 'foo', { close: ['close'] })) { console.log(event); // prints ['bar'] [42] } // the loop will exit after 'close' is emitted console.log('done'); // prints 'done' ``` * **@since** v13.6.0, v12.16.0 *** #### Parameters * ##### externalemitter: EventEmitter\ * ##### externaleventName: string | symbol * ##### externaloptionaloptions: StaticEventEmitterIteratorOptions #### Returns AsyncIterator\ An `AsyncIterator` that iterates `eventName` events emitted by the `emitter` ### [**](#once)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L217)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L222)staticexternalinheritedonce * ****once**(emitter, eventName, options): Promise\ * ****once**(emitter, eventName, options): Promise\ - Inherited from EventEmitter.once Creates a `Promise` that is fulfilled when the `EventEmitter` emits the given event or that is rejected if the `EventEmitter` emits `'error'` while waiting. The `Promise` will resolve with an array of all the arguments emitted to the given event. This method is intentionally generic and works with the web platform [EventTarget](https://dom.spec.whatwg.org/#interface-eventtarget) interface, which has no special`'error'` event semantics and does not listen to the `'error'` event. ``` import { once, EventEmitter } from 'node:events'; import process from 'node:process'; const ee = new EventEmitter(); process.nextTick(() => { ee.emit('myevent', 42); }); const [value] = await once(ee, 'myevent'); console.log(value); const err = new Error('kaboom'); process.nextTick(() => { ee.emit('error', err); }); try { await once(ee, 'myevent'); } catch (err) { console.error('error happened', err); } ``` The special handling of the `'error'` event is only used when `events.once()` is used to wait for another event. If `events.once()` is used to wait for the '`error'` event itself, then it is treated as any other kind of event without special handling: ``` import { EventEmitter, once } from 'node:events'; const ee = new EventEmitter(); once(ee, 'error') .then(([err]) => console.log('ok', err.message)) .catch((err) => console.error('error', err.message)); ee.emit('error', new Error('boom')); // Prints: ok boom ``` An `AbortSignal` can be used to cancel waiting for the event: ``` import { EventEmitter, once } from 'node:events'; const ee = new EventEmitter(); const ac = new AbortController(); async function foo(emitter, event, signal) { try { await once(emitter, event, { signal }); console.log('event emitted!'); } catch (error) { if (error.name === 'AbortError') { console.error('Waiting for the event was canceled!'); } else { console.error('There was an error', error.message); } } } foo(ee, 'foo', ac.signal); ac.abort(); // Abort waiting for the event ee.emit('foo'); // Prints: Waiting for the event was canceled! ``` * **@since** v11.13.0, v10.16.0 *** #### Parameters * ##### externalemitter: EventEmitter\ * ##### externaleventName: string | symbol * ##### externaloptionaloptions: StaticEventEmitterOptions #### Returns Promise\ ### [**](#setMaxListeners)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L402)staticexternalinheritedsetMaxListeners * ****setMaxListeners**(n, ...eventTargets): void - Inherited from EventEmitter.setMaxListeners ``` import { setMaxListeners, EventEmitter } from 'node:events'; const target = new EventTarget(); const emitter = new EventEmitter(); setMaxListeners(5, target, emitter); ``` * **@since** v15.4.0 *** #### Parameters * ##### externaloptionaln: number A non-negative number. The maximum number of listeners per `EventTarget` event. * ##### externalrest...eventTargets: (EventEmitter\ | EventTarget)\[] Zero or more {EventTarget} or {EventEmitter} instances. If none are specified, `n` is set as the default max for all newly created {EventTarget} and {EventEmitter} objects. #### Returns void --- # PlaywrightController The `BrowserController` serves two purposes. First, it is the base class that specialized controllers like `PuppeteerController` or `PlaywrightController` extend. Second, it defines the public interface of the specialized classes which provide only private methods. Therefore, we do not keep documentation for the specialized classes, because it's the same for all of them. ### Hierarchy * [BrowserController](https://crawlee.dev/js/api/browser-pool/class/BrowserController.md)\\[0], Browser> * *PlaywrightController* ## Index[**](#Index) ### Constructors * [**constructor](#constructor) ### Properties * [**activePages](#activePages) * [**browser](#browser) * [**browserPlugin](#browserPlugin) * [**id](#id) * [**isActive](#isActive) * [**lastPageOpenedAt](#lastPageOpenedAt) * [**launchContext](#launchContext) * [**proxyTier](#proxyTier) * [**proxyUrl](#proxyUrl) * [**totalPages](#totalPages) * [**defaultMaxListeners](#defaultMaxListeners) ### Methods * [**addListener](#addListener) * [**close](#close) * [**emit](#emit) * [**eventNames](#eventNames) * [**getCookies](#getCookies) * [**getMaxListeners](#getMaxListeners) * [**kill](#kill) * [**listenerCount](#listenerCount) * [**listeners](#listeners) * [**off](#off) * [**on](#on) * [**once](#once) * [**prependListener](#prependListener) * [**prependOnceListener](#prependOnceListener) * [**rawListeners](#rawListeners) * [**removeAllListeners](#removeAllListeners) * [**removeListener](#removeListener) * [**setCookies](#setCookies) * [**setMaxListeners](#setMaxListeners) ## Constructors[**](#Constructors) ### [**](#constructor)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-controller.ts#L91)constructor * ****new PlaywrightController**(browserPlugin): [PlaywrightController](https://crawlee.dev/js/api/browser-pool/class/PlaywrightController.md) - Inherited from BrowserController< BrowserType, SafeParameters\\[0], Browser >.constructor #### Parameters * ##### browserPlugin: [BrowserPlugin](https://crawlee.dev/js/api/browser-pool/class/BrowserPlugin.md)\, undefined | LaunchOptions, Browser, undefined | { acceptDownloads?: boolean; baseURL?: string; bypassCSP?: boolean; clientCertificates?: { cert?: Buffer\; certPath?: string; key?: Buffer\; keyPath?: string; origin: string; passphrase?: string; pfx?: Buffer\; pfxPath?: string }\[]; colorScheme?: null | light | dark | no-preference; contrast?: null | no-preference | more; deviceScaleFactor?: number; extraHTTPHeaders?: {}; forcedColors?: null | active | none; geolocation?: { accuracy?: number; latitude: number; longitude: number }; hasTouch?: boolean; httpCredentials?: { origin?: string; password: string; send?: unauthorized | always; username: string }; ignoreHTTPSErrors?: boolean; isMobile?: boolean; javaScriptEnabled?: boolean; locale?: string; logger?: Logger; offline?: boolean; permissions?: string\[]; proxy?: { bypass?: string; password?: string; server: string; username?: string }; recordHar?: { content?: omit | embed | attach; mode?: full | minimal; omitContent?: boolean; path: string; urlFilter?: string | RegExp }; recordVideo?: { dir: string; size?: { height: number; width: number } }; reducedMotion?: null | reduce | no-preference; screen?: { height: number; width: number }; serviceWorkers?: allow | block; storageState?: string | { cookies: { domain: string; expires: number; httpOnly: boolean; name: string; path: string; sameSite: Strict | Lax | None; secure: boolean; value: string }\[]; origins: { localStorage: { name: string; value: string }\[]; origin: string }\[] }; strictSelectors?: boolean; timezoneId?: string; userAgent?: string; videoSize?: { height: number; width: number }; videosPath?: string; viewport?: null | { height: number; width: number } }, Page> #### Returns [PlaywrightController](https://crawlee.dev/js/api/browser-pool/class/PlaywrightController.md) ## Properties[**](#Properties) ### [**](#activePages)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-controller.ts#L73)inheritedactivePages **activePages: number = 0 Inherited from BrowserController.activePages ### [**](#browser)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-controller.ts#L52)inheritedbrowser **browser: Browser = ... Inherited from BrowserController.browser Browser representation of the underlying automation library. ### [**](#browserPlugin)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-controller.ts#L47)inheritedbrowserPlugin **browserPlugin: [BrowserPlugin](https://crawlee.dev/js/api/browser-pool/class/BrowserPlugin.md)\, undefined | LaunchOptions, Browser, undefined | { acceptDownloads? : boolean; baseURL? : string; bypassCSP? : boolean; clientCertificates? : { cert? : Buffer\; certPath? : string; key? : Buffer\; keyPath? : string; origin: string; passphrase? : string; pfx? : Buffer\; pfxPath? : string }\[]; colorScheme? : null | light | dark | no-preference; contrast? : null | no-preference | more; deviceScaleFactor? : number; extraHTTPHeaders? : {}; forcedColors? : null | active | none; geolocation? : { accuracy? : number; latitude: number; longitude: number }; hasTouch? : boolean; httpCredentials? : { origin? : string; password: string; send? : unauthorized | always; username: string }; ignoreHTTPSErrors? : boolean; isMobile? : boolean; javaScriptEnabled? : boolean; locale? : string; logger? : Logger; offline? : boolean; permissions? : string\[]; proxy? : { bypass? : string; password? : string; server: string; username? : string }; recordHar? : { content? : omit | embed | attach; mode? : full | minimal; omitContent? : boolean; path: string; urlFilter? : string | RegExp }; recordVideo? : { dir: string; size? : { height: number; width: number } }; reducedMotion? : null | reduce | no-preference; screen? : { height: number; width: number }; serviceWorkers? : allow | block; storageState? : string | { cookies: { domain: string; expires: number; httpOnly: boolean; name: string; path: string; sameSite: Strict | Lax | None; secure: boolean; value: string }\[]; origins: { localStorage: { name: string; value: string }\[]; origin: string }\[] }; strictSelectors? : boolean; timezoneId? : string; userAgent? : string; videoSize? : { height: number; width: number }; videosPath? : string; viewport? : null | { height: number; width: number } }, Page> Inherited from BrowserController.browserPlugin The `BrowserPlugin` instance used to launch the browser. ### [**](#id)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-controller.ts#L42)inheritedid **id: string = ... Inherited from BrowserController.id ### [**](#isActive)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-controller.ts#L71)inheritedisActive **isActive: boolean = false Inherited from BrowserController.isActive ### [**](#lastPageOpenedAt)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-controller.ts#L77)inheritedlastPageOpenedAt **lastPageOpenedAt: number = ... Inherited from BrowserController.lastPageOpenedAt ### [**](#launchContext)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-controller.ts#L57)inheritedlaunchContext **launchContext: [LaunchContext](https://crawlee.dev/js/api/browser-pool/class/LaunchContext.md)\, undefined | LaunchOptions, Browser, undefined | { acceptDownloads? : boolean; baseURL? : string; bypassCSP? : boolean; clientCertificates? : { cert? : Buffer\; certPath? : string; key? : Buffer\; keyPath? : string; origin: string; passphrase? : string; pfx? : Buffer\; pfxPath? : string }\[]; colorScheme? : null | light | dark | no-preference; contrast? : null | no-preference | more; deviceScaleFactor? : number; extraHTTPHeaders? : {}; forcedColors? : null | active | none; geolocation? : { accuracy? : number; latitude: number; longitude: number }; hasTouch? : boolean; httpCredentials? : { origin? : string; password: string; send? : unauthorized | always; username: string }; ignoreHTTPSErrors? : boolean; isMobile? : boolean; javaScriptEnabled? : boolean; locale? : string; logger? : Logger; offline? : boolean; permissions? : string\[]; proxy? : { bypass? : string; password? : string; server: string; username? : string }; recordHar? : { content? : omit | embed | attach; mode? : full | minimal; omitContent? : boolean; path: string; urlFilter? : string | RegExp }; recordVideo? : { dir: string; size? : { height: number; width: number } }; reducedMotion? : null | reduce | no-preference; screen? : { height: number; width: number }; serviceWorkers? : allow | block; storageState? : string | { cookies: { domain: string; expires: number; httpOnly: boolean; name: string; path: string; sameSite: Strict | Lax | None; secure: boolean; value: string }\[]; origins: { localStorage: { name: string; value: string }\[]; origin: string }\[] }; strictSelectors? : boolean; timezoneId? : string; userAgent? : string; videoSize? : { height: number; width: number }; videosPath? : string; viewport? : null | { height: number; width: number } }, Page> = ... Inherited from BrowserController.launchContext The configuration the browser was launched with. ### [**](#proxyTier)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-controller.ts#L63)optionalinheritedproxyTier **proxyTier? : number Inherited from BrowserController.proxyTier The proxy tier tied to this browser controller. `undefined` if no tiered proxy is used. ### [**](#proxyUrl)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-controller.ts#L69)optionalinheritedproxyUrl **proxyUrl? : string Inherited from BrowserController.proxyUrl The proxy URL used by the browser controller. This is set every time the browser controller uses proxy (even the tiered one). `undefined` if no proxy is used ### [**](#totalPages)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-controller.ts#L75)inheritedtotalPages **totalPages: number = 0 Inherited from BrowserController.totalPages ### [**](#defaultMaxListeners)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L10)staticexternalinheriteddefaultMaxListeners **defaultMaxListeners: number Inherited from BrowserController.defaultMaxListeners ## Methods[**](#Methods) ### [**](#addListener)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L11)externalinheritedaddListener * ****addListener**\(event, listener): this - Inherited from BrowserController.addListener #### Parameters * ##### externalevent: U * ##### externallistener: [BrowserControllerEvents](https://crawlee.dev/js/api/browser-pool/interface/BrowserControllerEvents.md)\, undefined | LaunchOptions, Browser, undefined | { acceptDownloads?: boolean; baseURL?: string; bypassCSP?: boolean; clientCertificates?: { cert?: Buffer\; certPath?: string; key?: Buffer\; keyPath?: string; origin: string; passphrase?: string; pfx?: Buffer\; pfxPath?: string }\[]; colorScheme?: null | light | dark | no-preference; contrast?: null | no-preference | more; deviceScaleFactor?: number; extraHTTPHeaders?: {}; forcedColors?: null | active | none; geolocation?: { accuracy?: number; latitude: number; longitude: number }; hasTouch?: boolean; httpCredentials?: { origin?: string; password: string; send?: unauthorized | always; username: string }; ignoreHTTPSErrors?: boolean; isMobile?: boolean; javaScriptEnabled?: boolean; locale?: string; logger?: Logger; offline?: boolean; permissions?: string\[]; proxy?: { bypass?: string; password?: string; server: string; username?: string }; recordHar?: { content?: omit | embed | attach; mode?: full | minimal; omitContent?: boolean; path: string; urlFilter?: string | RegExp }; recordVideo?: { dir: string; size?: { height: number; width: number } }; reducedMotion?: null | reduce | no-preference; screen?: { height: number; width: number }; serviceWorkers?: allow | block; storageState?: string | { cookies: { domain: string; expires: number; httpOnly: boolean; name: string; path: string; sameSite: Strict | Lax | None; secure: boolean; value: string }\[]; origins: { localStorage: { name: string; value: string }\[]; origin: string }\[] }; strictSelectors?: boolean; timezoneId?: string; userAgent?: string; videoSize?: { height: number; width: number }; videosPath?: string; viewport?: null | { height: number; width: number } }, Page>\[U] #### Returns this ### [**](#close)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-controller.ts#L131)inheritedclose * ****close**(): Promise\ - Inherited from BrowserController.close Gracefully closes the browser and makes sure there will be no lingering browser processes. Emits 'browserClosed' event. *** #### Returns Promise\ ### [**](#emit)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L19)externalinheritedemit * ****emit**\(event, ...args): boolean - Inherited from BrowserController.emit #### Parameters * ##### externalevent: U * ##### externalrest...args: Parameters<[BrowserControllerEvents](https://crawlee.dev/js/api/browser-pool/interface/BrowserControllerEvents.md)\, undefined | LaunchOptions, Browser, undefined | { acceptDownloads?: boolean; baseURL?: string; bypassCSP?: boolean; clientCertificates?: { cert?: Buffer\; certPath?: string; key?: Buffer\; keyPath?: string; origin: string; passphrase?: string; pfx?: Buffer\; pfxPath?: string }\[]; colorScheme?: null | light | dark | no-preference; contrast?: null | no-preference | more; deviceScaleFactor?: number; extraHTTPHeaders?: {}; forcedColors?: null | active | none; geolocation?: { accuracy?: number; latitude: number; longitude: number }; hasTouch?: boolean; httpCredentials?: { origin?: string; password: string; send?: unauthorized | always; username: string }; ignoreHTTPSErrors?: boolean; isMobile?: boolean; javaScriptEnabled?: boolean; locale?: string; logger?: Logger; offline?: boolean; permissions?: string\[]; proxy?: { bypass?: string; password?: string; server: string; username?: string }; recordHar?: { content?: omit | embed | attach; mode?: full | minimal; omitContent?: boolean; path: string; urlFilter?: string | RegExp }; recordVideo?: { dir: string; size?: { height: number; width: number } }; reducedMotion?: null | reduce | no-preference; screen?: { height: number; width: number }; serviceWorkers?: allow | block; storageState?: string | { cookies: { domain: string; expires: number; httpOnly: boolean; name: string; path: string; sameSite: Strict | Lax | None; secure: boolean; value: string }\[]; origins: { localStorage: { name: ...; value: ... }\[]; origin: string }\[] }; strictSelectors?: boolean; timezoneId?: string; userAgent?: string; videoSize?: { height: number; width: number }; videosPath?: string; viewport?: null | { height: number; width: number } }, Page>\[U]> #### Returns boolean ### [**](#eventNames)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L20)externalinheritedeventNames * ****eventNames**\(): U\[] - Inherited from BrowserController.eventNames #### Returns U\[] ### [**](#getCookies)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-controller.ts#L181)inheritedgetCookies * ****getCookies**(page): Promise<[Cookie](https://crawlee.dev/js/api/core/interface/Cookie.md)\[]> - Inherited from BrowserController.getCookies #### Parameters * ##### page: Page #### Returns Promise<[Cookie](https://crawlee.dev/js/api/core/interface/Cookie.md)\[]> ### [**](#getMaxListeners)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L24)externalinheritedgetMaxListeners * ****getMaxListeners**(): number - Inherited from BrowserController.getMaxListeners #### Returns number ### [**](#kill)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-controller.ts#L156)inheritedkill * ****kill**(): Promise\ - Inherited from BrowserController.kill Immediately kills the browser process. Emits 'browserClosed' event. *** #### Returns Promise\ ### [**](#listenerCount)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L21)externalinheritedlistenerCount * ****listenerCount**(type): number - Inherited from BrowserController.listenerCount #### Parameters * ##### externaltype: BROWSER\_CLOSED #### Returns number ### [**](#listeners)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L22)externalinheritedlisteners * ****listeners**\(type): [BrowserControllerEvents](https://crawlee.dev/js/api/browser-pool/interface/BrowserControllerEvents.md)\, undefined | LaunchOptions, Browser, undefined | { acceptDownloads? : boolean; baseURL? : string; bypassCSP? : boolean; clientCertificates? : { cert? : Buffer\; certPath? : string; key? : Buffer\; keyPath? : string; origin: string; passphrase? : string; pfx? : Buffer\; pfxPath? : string }\[]; colorScheme? : null | light | dark | no-preference; contrast? : null | no-preference | more; deviceScaleFactor? : number; extraHTTPHeaders? : {}; forcedColors? : null | active | none; geolocation? : { accuracy? : number; latitude: number; longitude: number }; hasTouch? : boolean; httpCredentials? : { origin? : string; password: string; send? : unauthorized | always; username: string }; ignoreHTTPSErrors? : boolean; isMobile? : boolean; javaScriptEnabled? : boolean; locale? : string; logger? : Logger; offline? : boolean; permissions? : string\[]; proxy? : { bypass? : string; password? : string; server: string; username? : string }; recordHar? : { content? : omit | embed | attach; mode? : full | minimal; omitContent? : boolean; path: string; urlFilter? : string | RegExp }; recordVideo? : { dir: string; size? : { height: number; width: number } }; reducedMotion? : null | reduce | no-preference; screen? : { height: number; width: number }; serviceWorkers? : allow | block; storageState? : string | { cookies: { domain: string; expires: number; httpOnly: boolean; name: string; path: string; sameSite: Strict | Lax | None; secure: boolean; value: string }\[]; origins: { localStorage: { name: ...; value: ... }\[]; origin: string }\[] }; strictSelectors? : boolean; timezoneId? : string; userAgent? : string; videoSize? : { height: number; width: number }; videosPath? : string; viewport? : null | { height: number; width: number } }, Page>\[U]\[] - Inherited from BrowserController.listeners #### Parameters * ##### externaltype: U #### Returns [BrowserControllerEvents](https://crawlee.dev/js/api/browser-pool/interface/BrowserControllerEvents.md)\, undefined | LaunchOptions, Browser, undefined | { acceptDownloads?: boolean; baseURL?: string; bypassCSP?: boolean; clientCertificates?: { cert?: Buffer\; certPath?: string; key?: Buffer\; keyPath?: string; origin: string; passphrase?: string; pfx?: Buffer\; pfxPath?: string }\[]; colorScheme?: null | light | dark | no-preference; contrast?: null | no-preference | more; deviceScaleFactor?: number; extraHTTPHeaders?: {}; forcedColors?: null | active | none; geolocation?: { accuracy?: number; latitude: number; longitude: number }; hasTouch?: boolean; httpCredentials?: { origin?: string; password: string; send?: unauthorized | always; username: string }; ignoreHTTPSErrors?: boolean; isMobile?: boolean; javaScriptEnabled?: boolean; locale?: string; logger?: Logger; offline?: boolean; permissions?: string\[]; proxy?: { bypass?: string; password?: string; server: string; username?: string }; recordHar?: { content?: omit | embed | attach; mode?: full | minimal; omitContent?: boolean; path: string; urlFilter?: string | RegExp }; recordVideo?: { dir: string; size?: { height: number; width: number } }; reducedMotion?: null | reduce | no-preference; screen?: { height: number; width: number }; serviceWorkers?: allow | block; storageState?: string | { cookies: { domain: string; expires: number; httpOnly: boolean; name: string; path: string; sameSite: Strict | Lax | None; secure: boolean; value: string }\[]; origins: { localStorage: { name: ...; value: ... }\[]; origin: string }\[] }; strictSelectors?: boolean; timezoneId?: string; userAgent?: string; videoSize?: { height: number; width: number }; videosPath?: string; viewport?: null | { height: number; width: number } }, Page>\[U]\[] ### [**](#off)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L18)externalinheritedoff * ****off**\(event, listener): this - Inherited from BrowserController.off #### Parameters * ##### externalevent: U * ##### externallistener: [BrowserControllerEvents](https://crawlee.dev/js/api/browser-pool/interface/BrowserControllerEvents.md)\, undefined | LaunchOptions, Browser, undefined | { acceptDownloads?: boolean; baseURL?: string; bypassCSP?: boolean; clientCertificates?: { cert?: Buffer\; certPath?: string; key?: Buffer\; keyPath?: string; origin: string; passphrase?: string; pfx?: Buffer\; pfxPath?: string }\[]; colorScheme?: null | light | dark | no-preference; contrast?: null | no-preference | more; deviceScaleFactor?: number; extraHTTPHeaders?: {}; forcedColors?: null | active | none; geolocation?: { accuracy?: number; latitude: number; longitude: number }; hasTouch?: boolean; httpCredentials?: { origin?: string; password: string; send?: unauthorized | always; username: string }; ignoreHTTPSErrors?: boolean; isMobile?: boolean; javaScriptEnabled?: boolean; locale?: string; logger?: Logger; offline?: boolean; permissions?: string\[]; proxy?: { bypass?: string; password?: string; server: string; username?: string }; recordHar?: { content?: omit | embed | attach; mode?: full | minimal; omitContent?: boolean; path: string; urlFilter?: string | RegExp }; recordVideo?: { dir: string; size?: { height: number; width: number } }; reducedMotion?: null | reduce | no-preference; screen?: { height: number; width: number }; serviceWorkers?: allow | block; storageState?: string | { cookies: { domain: string; expires: number; httpOnly: boolean; name: string; path: string; sameSite: Strict | Lax | None; secure: boolean; value: string }\[]; origins: { localStorage: { name: string; value: string }\[]; origin: string }\[] }; strictSelectors?: boolean; timezoneId?: string; userAgent?: string; videoSize?: { height: number; width: number }; videosPath?: string; viewport?: null | { height: number; width: number } }, Page>\[U] #### Returns this ### [**](#on)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L17)externalinheritedon * ****on**\(event, listener): this - Inherited from BrowserController.on #### Parameters * ##### externalevent: U * ##### externallistener: [BrowserControllerEvents](https://crawlee.dev/js/api/browser-pool/interface/BrowserControllerEvents.md)\, undefined | LaunchOptions, Browser, undefined | { acceptDownloads?: boolean; baseURL?: string; bypassCSP?: boolean; clientCertificates?: { cert?: Buffer\; certPath?: string; key?: Buffer\; keyPath?: string; origin: string; passphrase?: string; pfx?: Buffer\; pfxPath?: string }\[]; colorScheme?: null | light | dark | no-preference; contrast?: null | no-preference | more; deviceScaleFactor?: number; extraHTTPHeaders?: {}; forcedColors?: null | active | none; geolocation?: { accuracy?: number; latitude: number; longitude: number }; hasTouch?: boolean; httpCredentials?: { origin?: string; password: string; send?: unauthorized | always; username: string }; ignoreHTTPSErrors?: boolean; isMobile?: boolean; javaScriptEnabled?: boolean; locale?: string; logger?: Logger; offline?: boolean; permissions?: string\[]; proxy?: { bypass?: string; password?: string; server: string; username?: string }; recordHar?: { content?: omit | embed | attach; mode?: full | minimal; omitContent?: boolean; path: string; urlFilter?: string | RegExp }; recordVideo?: { dir: string; size?: { height: number; width: number } }; reducedMotion?: null | reduce | no-preference; screen?: { height: number; width: number }; serviceWorkers?: allow | block; storageState?: string | { cookies: { domain: string; expires: number; httpOnly: boolean; name: string; path: string; sameSite: Strict | Lax | None; secure: boolean; value: string }\[]; origins: { localStorage: { name: string; value: string }\[]; origin: string }\[] }; strictSelectors?: boolean; timezoneId?: string; userAgent?: string; videoSize?: { height: number; width: number }; videosPath?: string; viewport?: null | { height: number; width: number } }, Page>\[U] #### Returns this ### [**](#once)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L16)externalinheritedonce * ****once**\(event, listener): this - Inherited from BrowserController.once #### Parameters * ##### externalevent: U * ##### externallistener: [BrowserControllerEvents](https://crawlee.dev/js/api/browser-pool/interface/BrowserControllerEvents.md)\, undefined | LaunchOptions, Browser, undefined | { acceptDownloads?: boolean; baseURL?: string; bypassCSP?: boolean; clientCertificates?: { cert?: Buffer\; certPath?: string; key?: Buffer\; keyPath?: string; origin: string; passphrase?: string; pfx?: Buffer\; pfxPath?: string }\[]; colorScheme?: null | light | dark | no-preference; contrast?: null | no-preference | more; deviceScaleFactor?: number; extraHTTPHeaders?: {}; forcedColors?: null | active | none; geolocation?: { accuracy?: number; latitude: number; longitude: number }; hasTouch?: boolean; httpCredentials?: { origin?: string; password: string; send?: unauthorized | always; username: string }; ignoreHTTPSErrors?: boolean; isMobile?: boolean; javaScriptEnabled?: boolean; locale?: string; logger?: Logger; offline?: boolean; permissions?: string\[]; proxy?: { bypass?: string; password?: string; server: string; username?: string }; recordHar?: { content?: omit | embed | attach; mode?: full | minimal; omitContent?: boolean; path: string; urlFilter?: string | RegExp }; recordVideo?: { dir: string; size?: { height: number; width: number } }; reducedMotion?: null | reduce | no-preference; screen?: { height: number; width: number }; serviceWorkers?: allow | block; storageState?: string | { cookies: { domain: string; expires: number; httpOnly: boolean; name: string; path: string; sameSite: Strict | Lax | None; secure: boolean; value: string }\[]; origins: { localStorage: { name: string; value: string }\[]; origin: string }\[] }; strictSelectors?: boolean; timezoneId?: string; userAgent?: string; videoSize?: { height: number; width: number }; videosPath?: string; viewport?: null | { height: number; width: number } }, Page>\[U] #### Returns this ### [**](#prependListener)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L12)externalinheritedprependListener * ****prependListener**\(event, listener): this - Inherited from BrowserController.prependListener #### Parameters * ##### externalevent: U * ##### externallistener: [BrowserControllerEvents](https://crawlee.dev/js/api/browser-pool/interface/BrowserControllerEvents.md)\, undefined | LaunchOptions, Browser, undefined | { acceptDownloads?: boolean; baseURL?: string; bypassCSP?: boolean; clientCertificates?: { cert?: Buffer\; certPath?: string; key?: Buffer\; keyPath?: string; origin: string; passphrase?: string; pfx?: Buffer\; pfxPath?: string }\[]; colorScheme?: null | light | dark | no-preference; contrast?: null | no-preference | more; deviceScaleFactor?: number; extraHTTPHeaders?: {}; forcedColors?: null | active | none; geolocation?: { accuracy?: number; latitude: number; longitude: number }; hasTouch?: boolean; httpCredentials?: { origin?: string; password: string; send?: unauthorized | always; username: string }; ignoreHTTPSErrors?: boolean; isMobile?: boolean; javaScriptEnabled?: boolean; locale?: string; logger?: Logger; offline?: boolean; permissions?: string\[]; proxy?: { bypass?: string; password?: string; server: string; username?: string }; recordHar?: { content?: omit | embed | attach; mode?: full | minimal; omitContent?: boolean; path: string; urlFilter?: string | RegExp }; recordVideo?: { dir: string; size?: { height: number; width: number } }; reducedMotion?: null | reduce | no-preference; screen?: { height: number; width: number }; serviceWorkers?: allow | block; storageState?: string | { cookies: { domain: string; expires: number; httpOnly: boolean; name: string; path: string; sameSite: Strict | Lax | None; secure: boolean; value: string }\[]; origins: { localStorage: { name: string; value: string }\[]; origin: string }\[] }; strictSelectors?: boolean; timezoneId?: string; userAgent?: string; videoSize?: { height: number; width: number }; videosPath?: string; viewport?: null | { height: number; width: number } }, Page>\[U] #### Returns this ### [**](#prependOnceListener)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L13)externalinheritedprependOnceListener * ****prependOnceListener**\(event, listener): this - Inherited from BrowserController.prependOnceListener #### Parameters * ##### externalevent: U * ##### externallistener: [BrowserControllerEvents](https://crawlee.dev/js/api/browser-pool/interface/BrowserControllerEvents.md)\, undefined | LaunchOptions, Browser, undefined | { acceptDownloads?: boolean; baseURL?: string; bypassCSP?: boolean; clientCertificates?: { cert?: Buffer\; certPath?: string; key?: Buffer\; keyPath?: string; origin: string; passphrase?: string; pfx?: Buffer\; pfxPath?: string }\[]; colorScheme?: null | light | dark | no-preference; contrast?: null | no-preference | more; deviceScaleFactor?: number; extraHTTPHeaders?: {}; forcedColors?: null | active | none; geolocation?: { accuracy?: number; latitude: number; longitude: number }; hasTouch?: boolean; httpCredentials?: { origin?: string; password: string; send?: unauthorized | always; username: string }; ignoreHTTPSErrors?: boolean; isMobile?: boolean; javaScriptEnabled?: boolean; locale?: string; logger?: Logger; offline?: boolean; permissions?: string\[]; proxy?: { bypass?: string; password?: string; server: string; username?: string }; recordHar?: { content?: omit | embed | attach; mode?: full | minimal; omitContent?: boolean; path: string; urlFilter?: string | RegExp }; recordVideo?: { dir: string; size?: { height: number; width: number } }; reducedMotion?: null | reduce | no-preference; screen?: { height: number; width: number }; serviceWorkers?: allow | block; storageState?: string | { cookies: { domain: string; expires: number; httpOnly: boolean; name: string; path: string; sameSite: Strict | Lax | None; secure: boolean; value: string }\[]; origins: { localStorage: { name: string; value: string }\[]; origin: string }\[] }; strictSelectors?: boolean; timezoneId?: string; userAgent?: string; videoSize?: { height: number; width: number }; videosPath?: string; viewport?: null | { height: number; width: number } }, Page>\[U] #### Returns this ### [**](#rawListeners)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L23)externalinheritedrawListeners * ****rawListeners**\(type): [BrowserControllerEvents](https://crawlee.dev/js/api/browser-pool/interface/BrowserControllerEvents.md)\, undefined | LaunchOptions, Browser, undefined | { acceptDownloads? : boolean; baseURL? : string; bypassCSP? : boolean; clientCertificates? : { cert? : Buffer\; certPath? : string; key? : Buffer\; keyPath? : string; origin: string; passphrase? : string; pfx? : Buffer\; pfxPath? : string }\[]; colorScheme? : null | light | dark | no-preference; contrast? : null | no-preference | more; deviceScaleFactor? : number; extraHTTPHeaders? : {}; forcedColors? : null | active | none; geolocation? : { accuracy? : number; latitude: number; longitude: number }; hasTouch? : boolean; httpCredentials? : { origin? : string; password: string; send? : unauthorized | always; username: string }; ignoreHTTPSErrors? : boolean; isMobile? : boolean; javaScriptEnabled? : boolean; locale? : string; logger? : Logger; offline? : boolean; permissions? : string\[]; proxy? : { bypass? : string; password? : string; server: string; username? : string }; recordHar? : { content? : omit | embed | attach; mode? : full | minimal; omitContent? : boolean; path: string; urlFilter? : string | RegExp }; recordVideo? : { dir: string; size? : { height: number; width: number } }; reducedMotion? : null | reduce | no-preference; screen? : { height: number; width: number }; serviceWorkers? : allow | block; storageState? : string | { cookies: { domain: string; expires: number; httpOnly: boolean; name: string; path: string; sameSite: Strict | Lax | None; secure: boolean; value: string }\[]; origins: { localStorage: { name: ...; value: ... }\[]; origin: string }\[] }; strictSelectors? : boolean; timezoneId? : string; userAgent? : string; videoSize? : { height: number; width: number }; videosPath? : string; viewport? : null | { height: number; width: number } }, Page>\[U]\[] - Inherited from BrowserController.rawListeners #### Parameters * ##### externaltype: U #### Returns [BrowserControllerEvents](https://crawlee.dev/js/api/browser-pool/interface/BrowserControllerEvents.md)\, undefined | LaunchOptions, Browser, undefined | { acceptDownloads?: boolean; baseURL?: string; bypassCSP?: boolean; clientCertificates?: { cert?: Buffer\; certPath?: string; key?: Buffer\; keyPath?: string; origin: string; passphrase?: string; pfx?: Buffer\; pfxPath?: string }\[]; colorScheme?: null | light | dark | no-preference; contrast?: null | no-preference | more; deviceScaleFactor?: number; extraHTTPHeaders?: {}; forcedColors?: null | active | none; geolocation?: { accuracy?: number; latitude: number; longitude: number }; hasTouch?: boolean; httpCredentials?: { origin?: string; password: string; send?: unauthorized | always; username: string }; ignoreHTTPSErrors?: boolean; isMobile?: boolean; javaScriptEnabled?: boolean; locale?: string; logger?: Logger; offline?: boolean; permissions?: string\[]; proxy?: { bypass?: string; password?: string; server: string; username?: string }; recordHar?: { content?: omit | embed | attach; mode?: full | minimal; omitContent?: boolean; path: string; urlFilter?: string | RegExp }; recordVideo?: { dir: string; size?: { height: number; width: number } }; reducedMotion?: null | reduce | no-preference; screen?: { height: number; width: number }; serviceWorkers?: allow | block; storageState?: string | { cookies: { domain: string; expires: number; httpOnly: boolean; name: string; path: string; sameSite: Strict | Lax | None; secure: boolean; value: string }\[]; origins: { localStorage: { name: ...; value: ... }\[]; origin: string }\[] }; strictSelectors?: boolean; timezoneId?: string; userAgent?: string; videoSize?: { height: number; width: number }; videosPath?: string; viewport?: null | { height: number; width: number } }, Page>\[U]\[] ### [**](#removeAllListeners)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L15)externalinheritedremoveAllListeners * ****removeAllListeners**(event): this - Inherited from BrowserController.removeAllListeners #### Parameters * ##### externaloptionalevent: BROWSER\_CLOSED #### Returns this ### [**](#removeListener)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L14)externalinheritedremoveListener * ****removeListener**\(event, listener): this - Inherited from BrowserController.removeListener #### Parameters * ##### externalevent: U * ##### externallistener: [BrowserControllerEvents](https://crawlee.dev/js/api/browser-pool/interface/BrowserControllerEvents.md)\, undefined | LaunchOptions, Browser, undefined | { acceptDownloads?: boolean; baseURL?: string; bypassCSP?: boolean; clientCertificates?: { cert?: Buffer\; certPath?: string; key?: Buffer\; keyPath?: string; origin: string; passphrase?: string; pfx?: Buffer\; pfxPath?: string }\[]; colorScheme?: null | light | dark | no-preference; contrast?: null | no-preference | more; deviceScaleFactor?: number; extraHTTPHeaders?: {}; forcedColors?: null | active | none; geolocation?: { accuracy?: number; latitude: number; longitude: number }; hasTouch?: boolean; httpCredentials?: { origin?: string; password: string; send?: unauthorized | always; username: string }; ignoreHTTPSErrors?: boolean; isMobile?: boolean; javaScriptEnabled?: boolean; locale?: string; logger?: Logger; offline?: boolean; permissions?: string\[]; proxy?: { bypass?: string; password?: string; server: string; username?: string }; recordHar?: { content?: omit | embed | attach; mode?: full | minimal; omitContent?: boolean; path: string; urlFilter?: string | RegExp }; recordVideo?: { dir: string; size?: { height: number; width: number } }; reducedMotion?: null | reduce | no-preference; screen?: { height: number; width: number }; serviceWorkers?: allow | block; storageState?: string | { cookies: { domain: string; expires: number; httpOnly: boolean; name: string; path: string; sameSite: Strict | Lax | None; secure: boolean; value: string }\[]; origins: { localStorage: { name: string; value: string }\[]; origin: string }\[] }; strictSelectors?: boolean; timezoneId?: string; userAgent?: string; videoSize?: { height: number; width: number }; videosPath?: string; viewport?: null | { height: number; width: number } }, Page>\[U] #### Returns this ### [**](#setCookies)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-controller.ts#L177)inheritedsetCookies * ****setCookies**(page, cookies): Promise\ - Inherited from BrowserController.setCookies #### Parameters * ##### page: Page * ##### cookies: [Cookie](https://crawlee.dev/js/api/core/interface/Cookie.md)\[] #### Returns Promise\ ### [**](#setMaxListeners)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L25)externalinheritedsetMaxListeners * ****setMaxListeners**(n): this - Inherited from BrowserController.setMaxListeners #### Parameters * ##### externaln: number #### Returns this --- # PlaywrightPlugin The `BrowserPlugin` serves two purposes. First, it is the base class that specialized controllers like `PuppeteerPlugin` or `PlaywrightPlugin` extend. Second, it allows the user to configure the automation libraries and feed them to [BrowserPool](https://crawlee.dev/js/api/browser-pool/class/BrowserPool.md) for use. ### Hierarchy * [BrowserPlugin](https://crawlee.dev/js/api/browser-pool/class/BrowserPlugin.md)\\[0], PlaywrightBrowser> * *PlaywrightPlugin* ## Index[**](#Index) ### Constructors * [**constructor](#constructor) ### Properties * [**\_containerProxyServer](#_containerProxyServer) * [**browserPerProxy](#browserPerProxy) * [**experimentalContainers](#experimentalContainers) * [**launchOptions](#launchOptions) * [**library](#library) * [**name](#name) * [**proxyUrl](#proxyUrl) * [**useIncognitoPages](#useIncognitoPages) * [**userDataDir](#userDataDir) ### Methods * [**createController](#createController) * [**createLaunchContext](#createLaunchContext) * [**launch](#launch) ## Constructors[**](#Constructors) ### [**](#constructor)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-plugin.ts#L129)constructor * ****new PlaywrightPlugin**(library, options): [PlaywrightPlugin](https://crawlee.dev/js/api/browser-pool/class/PlaywrightPlugin.md) - Inherited from BrowserPlugin.constructor #### Parameters * ##### library: BrowserType<{}> * ##### options: [BrowserPluginOptions](https://crawlee.dev/js/api/browser-pool/interface/BrowserPluginOptions.md)\ = {} #### Returns [PlaywrightPlugin](https://crawlee.dev/js/api/browser-pool/class/PlaywrightPlugin.md) ## Properties[**](#Properties) ### [**](#_containerProxyServer)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/playwright/playwright-plugin.ts#L42)optional\_containerProxyServer **\_containerProxyServer? : { ipToProxy: Map\; port: number; close: any } #### Type declaration * ##### ipToProxy: Map\ * ##### port: number * ##### close: function * ****close**(closeConnections): Promise\ *** * #### Parameters * ##### closeConnections: boolean #### Returns Promise\ ### [**](#browserPerProxy)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-plugin.ts#L127)optionalinheritedbrowserPerProxy **browserPerProxy? : boolean Inherited from BrowserPlugin.browserPerProxy ### [**](#experimentalContainers)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-plugin.ts#L125)inheritedexperimentalContainers **experimentalContainers: boolean Inherited from BrowserPlugin.experimentalContainers ### [**](#launchOptions)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-plugin.ts#L117)inheritedlaunchOptions **launchOptions: undefined | LaunchOptions Inherited from BrowserPlugin.launchOptions ### [**](#library)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-plugin.ts#L115)inheritedlibrary **library: BrowserType<{}> Inherited from BrowserPlugin.library ### [**](#name)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-plugin.ts#L113)inheritedname **name: string = ... Inherited from BrowserPlugin.name ### [**](#proxyUrl)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-plugin.ts#L119)optionalinheritedproxyUrl **proxyUrl? : string Inherited from BrowserPlugin.proxyUrl ### [**](#useIncognitoPages)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-plugin.ts#L123)inheriteduseIncognitoPages **useIncognitoPages: boolean Inherited from BrowserPlugin.useIncognitoPages ### [**](#userDataDir)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-plugin.ts#L121)optionalinheriteduserDataDir **userDataDir? : string Inherited from BrowserPlugin.userDataDir ## Methods[**](#Methods) ### [**](#createController)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-plugin.ts#L181)inheritedcreateController * ****createController**(): [BrowserController](https://crawlee.dev/js/api/browser-pool/class/BrowserController.md)\, undefined | LaunchOptions, Browser, undefined | { acceptDownloads? : boolean; baseURL? : string; bypassCSP? : boolean; clientCertificates? : { cert? : Buffer\; certPath? : string; key? : Buffer\; keyPath? : string; origin: string; passphrase? : string; pfx? : Buffer\; pfxPath? : string }\[]; colorScheme? : null | light | dark | no-preference; contrast? : null | no-preference | more; deviceScaleFactor? : number; extraHTTPHeaders? : {}; forcedColors? : null | active | none; geolocation? : { accuracy? : number; latitude: number; longitude: number }; hasTouch? : boolean; httpCredentials? : { origin? : string; password: string; send? : unauthorized | always; username: string }; ignoreHTTPSErrors? : boolean; isMobile? : boolean; javaScriptEnabled? : boolean; locale? : string; logger? : Logger; offline? : boolean; permissions? : string\[]; proxy? : { bypass? : string; password? : string; server: string; username? : string }; recordHar? : { content? : omit | embed | attach; mode? : full | minimal; omitContent? : boolean; path: string; urlFilter? : string | RegExp }; recordVideo? : { dir: string; size? : { height: number; width: number } }; reducedMotion? : null | reduce | no-preference; screen? : { height: number; width: number }; serviceWorkers? : allow | block; storageState? : string | { cookies: { domain: string; expires: number; httpOnly: boolean; name: string; path: string; sameSite: Strict | Lax | None; secure: boolean; value: string }\[]; origins: { localStorage: { name: string; value: string }\[]; origin: string }\[] }; strictSelectors? : boolean; timezoneId? : string; userAgent? : string; videoSize? : { height: number; width: number }; videosPath? : string; viewport? : null | { height: number; width: number } }, Page> - Inherited from BrowserPlugin.createController #### Returns [BrowserController](https://crawlee.dev/js/api/browser-pool/class/BrowserController.md)\, undefined | LaunchOptions, Browser, undefined | { acceptDownloads?: boolean; baseURL?: string; bypassCSP?: boolean; clientCertificates?: { cert?: Buffer\; certPath?: string; key?: Buffer\; keyPath?: string; origin: string; passphrase?: string; pfx?: Buffer\; pfxPath?: string }\[]; colorScheme?: null | light | dark | no-preference; contrast?: null | no-preference | more; deviceScaleFactor?: number; extraHTTPHeaders?: {}; forcedColors?: null | active | none; geolocation?: { accuracy?: number; latitude: number; longitude: number }; hasTouch?: boolean; httpCredentials?: { origin?: string; password: string; send?: unauthorized | always; username: string }; ignoreHTTPSErrors?: boolean; isMobile?: boolean; javaScriptEnabled?: boolean; locale?: string; logger?: Logger; offline?: boolean; permissions?: string\[]; proxy?: { bypass?: string; password?: string; server: string; username?: string }; recordHar?: { content?: omit | embed | attach; mode?: full | minimal; omitContent?: boolean; path: string; urlFilter?: string | RegExp }; recordVideo?: { dir: string; size?: { height: number; width: number } }; reducedMotion?: null | reduce | no-preference; screen?: { height: number; width: number }; serviceWorkers?: allow | block; storageState?: string | { cookies: { domain: string; expires: number; httpOnly: boolean; name: string; path: string; sameSite: Strict | Lax | None; secure: boolean; value: string }\[]; origins: { localStorage: { name: string; value: string }\[]; origin: string }\[] }; strictSelectors?: boolean; timezoneId?: string; userAgent?: string; videoSize?: { height: number; width: number }; videosPath?: string; viewport?: null | { height: number; width: number } }, Page> ### [**](#createLaunchContext)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-plugin.ts#L154)inheritedcreateLaunchContext * ****createLaunchContext**(options): [LaunchContext](https://crawlee.dev/js/api/browser-pool/class/LaunchContext.md)\, undefined | LaunchOptions, Browser, undefined | { acceptDownloads? : boolean; baseURL? : string; bypassCSP? : boolean; clientCertificates? : { cert? : Buffer\; certPath? : string; key? : Buffer\; keyPath? : string; origin: string; passphrase? : string; pfx? : Buffer\; pfxPath? : string }\[]; colorScheme? : null | light | dark | no-preference; contrast? : null | no-preference | more; deviceScaleFactor? : number; extraHTTPHeaders? : {}; forcedColors? : null | active | none; geolocation? : { accuracy? : number; latitude: number; longitude: number }; hasTouch? : boolean; httpCredentials? : { origin? : string; password: string; send? : unauthorized | always; username: string }; ignoreHTTPSErrors? : boolean; isMobile? : boolean; javaScriptEnabled? : boolean; locale? : string; logger? : Logger; offline? : boolean; permissions? : string\[]; proxy? : { bypass? : string; password? : string; server: string; username? : string }; recordHar? : { content? : omit | embed | attach; mode? : full | minimal; omitContent? : boolean; path: string; urlFilter? : string | RegExp }; recordVideo? : { dir: string; size? : { height: number; width: number } }; reducedMotion? : null | reduce | no-preference; screen? : { height: number; width: number }; serviceWorkers? : allow | block; storageState? : string | { cookies: { domain: string; expires: number; httpOnly: boolean; name: string; path: string; sameSite: Strict | Lax | None; secure: boolean; value: string }\[]; origins: { localStorage: { name: string; value: string }\[]; origin: string }\[] }; strictSelectors? : boolean; timezoneId? : string; userAgent? : string; videoSize? : { height: number; width: number }; videosPath? : string; viewport? : null | { height: number; width: number } }, Page> - Inherited from BrowserPlugin.createLaunchContext Creates a `LaunchContext` with all the information needed to launch a browser. Aside from library specific launch options, it also includes internal properties used by `BrowserPool` for management of the pool and extra features. *** #### Parameters * ##### options: [CreateLaunchContextOptions](https://crawlee.dev/js/api/browser-pool/interface/CreateLaunchContextOptions.md)\, undefined | LaunchOptions, Browser, undefined | { acceptDownloads?: boolean; baseURL?: string; bypassCSP?: boolean; clientCertificates?: { cert?: Buffer\; certPath?: string; key?: Buffer\; keyPath?: string; origin: string; passphrase?: string; pfx?: Buffer\; pfxPath?: string }\[]; colorScheme?: null | light | dark | no-preference; contrast?: null | no-preference | more; deviceScaleFactor?: number; extraHTTPHeaders?: {}; forcedColors?: null | active | none; geolocation?: { accuracy?: number; latitude: number; longitude: number }; hasTouch?: boolean; httpCredentials?: { origin?: string; password: string; send?: unauthorized | always; username: string }; ignoreHTTPSErrors?: boolean; isMobile?: boolean; javaScriptEnabled?: boolean; locale?: string; logger?: Logger; offline?: boolean; permissions?: string\[]; proxy?: { bypass?: string; password?: string; server: string; username?: string }; recordHar?: { content?: omit | embed | attach; mode?: full | minimal; omitContent?: boolean; path: string; urlFilter?: string | RegExp }; recordVideo?: { dir: string; size?: { height: number; width: number } }; reducedMotion?: null | reduce | no-preference; screen?: { height: number; width: number }; serviceWorkers?: allow | block; storageState?: string | { cookies: { domain: string; expires: number; httpOnly: boolean; name: string; path: string; sameSite: Strict | Lax | None; secure: boolean; value: string }\[]; origins: { localStorage: { name: string; value: string }\[]; origin: string }\[] }; strictSelectors?: boolean; timezoneId?: string; userAgent?: string; videoSize?: { height: number; width: number }; videosPath?: string; viewport?: null | { height: number; width: number } }, Page> = {} #### Returns [LaunchContext](https://crawlee.dev/js/api/browser-pool/class/LaunchContext.md)\, undefined | LaunchOptions, Browser, undefined | { acceptDownloads?: boolean; baseURL?: string; bypassCSP?: boolean; clientCertificates?: { cert?: Buffer\; certPath?: string; key?: Buffer\; keyPath?: string; origin: string; passphrase?: string; pfx?: Buffer\; pfxPath?: string }\[]; colorScheme?: null | light | dark | no-preference; contrast?: null | no-preference | more; deviceScaleFactor?: number; extraHTTPHeaders?: {}; forcedColors?: null | active | none; geolocation?: { accuracy?: number; latitude: number; longitude: number }; hasTouch?: boolean; httpCredentials?: { origin?: string; password: string; send?: unauthorized | always; username: string }; ignoreHTTPSErrors?: boolean; isMobile?: boolean; javaScriptEnabled?: boolean; locale?: string; logger?: Logger; offline?: boolean; permissions?: string\[]; proxy?: { bypass?: string; password?: string; server: string; username?: string }; recordHar?: { content?: omit | embed | attach; mode?: full | minimal; omitContent?: boolean; path: string; urlFilter?: string | RegExp }; recordVideo?: { dir: string; size?: { height: number; width: number } }; reducedMotion?: null | reduce | no-preference; screen?: { height: number; width: number }; serviceWorkers?: allow | block; storageState?: string | { cookies: { domain: string; expires: number; httpOnly: boolean; name: string; path: string; sameSite: Strict | Lax | None; secure: boolean; value: string }\[]; origins: { localStorage: { name: string; value: string }\[]; origin: string }\[] }; strictSelectors?: boolean; timezoneId?: string; userAgent?: string; videoSize?: { height: number; width: number }; videosPath?: string; viewport?: null | { height: number; width: number } }, Page> ### [**](#launch)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-plugin.ts#L188)inheritedlaunch * ****launch**(launchContext): Promise\ - Inherited from BrowserPlugin.launch Launches the browser using provided launch context. *** #### Parameters * ##### launchContext: [LaunchContext](https://crawlee.dev/js/api/browser-pool/class/LaunchContext.md)\, undefined | LaunchOptions, Browser, undefined | { acceptDownloads?: boolean; baseURL?: string; bypassCSP?: boolean; clientCertificates?: { cert?: Buffer\; certPath?: string; key?: Buffer\; keyPath?: string; origin: string; passphrase?: string; pfx?: Buffer\; pfxPath?: string }\[]; colorScheme?: null | light | dark | no-preference; contrast?: null | no-preference | more; deviceScaleFactor?: number; extraHTTPHeaders?: {}; forcedColors?: null | active | none; geolocation?: { accuracy?: number; latitude: number; longitude: number }; hasTouch?: boolean; httpCredentials?: { origin?: string; password: string; send?: unauthorized | always; username: string }; ignoreHTTPSErrors?: boolean; isMobile?: boolean; javaScriptEnabled?: boolean; locale?: string; logger?: Logger; offline?: boolean; permissions?: string\[]; proxy?: { bypass?: string; password?: string; server: string; username?: string }; recordHar?: { content?: omit | embed | attach; mode?: full | minimal; omitContent?: boolean; path: string; urlFilter?: string | RegExp }; recordVideo?: { dir: string; size?: { height: number; width: number } }; reducedMotion?: null | reduce | no-preference; screen?: { height: number; width: number }; serviceWorkers?: allow | block; storageState?: string | { cookies: { domain: string; expires: number; httpOnly: boolean; name: string; path: string; sameSite: Strict | Lax | None; secure: boolean; value: string }\[]; origins: { localStorage: { name: string; value: string }\[]; origin: string }\[] }; strictSelectors?: boolean; timezoneId?: string; userAgent?: string; videoSize?: { height: number; width: number }; videosPath?: string; viewport?: null | { height: number; width: number } }, Page> = ... #### Returns Promise\ --- # PuppeteerController The `BrowserController` serves two purposes. First, it is the base class that specialized controllers like `PuppeteerController` or `PlaywrightController` extend. Second, it defines the public interface of the specialized classes which provide only private methods. Therefore, we do not keep documentation for the specialized classes, because it's the same for all of them. ### Hierarchy * [BrowserController](https://crawlee.dev/js/api/browser-pool/class/BrowserController.md)\ * *PuppeteerController* ## Index[**](#Index) ### Constructors * [**constructor](#constructor) ### Properties * [**activePages](#activePages) * [**browser](#browser) * [**browserPlugin](#browserPlugin) * [**id](#id) * [**isActive](#isActive) * [**lastPageOpenedAt](#lastPageOpenedAt) * [**launchContext](#launchContext) * [**proxyTier](#proxyTier) * [**proxyUrl](#proxyUrl) * [**totalPages](#totalPages) * [**defaultMaxListeners](#defaultMaxListeners) ### Methods * [**addListener](#addListener) * [**close](#close) * [**emit](#emit) * [**eventNames](#eventNames) * [**getCookies](#getCookies) * [**getMaxListeners](#getMaxListeners) * [**kill](#kill) * [**listenerCount](#listenerCount) * [**listeners](#listeners) * [**off](#off) * [**on](#on) * [**once](#once) * [**prependListener](#prependListener) * [**prependOnceListener](#prependOnceListener) * [**rawListeners](#rawListeners) * [**removeAllListeners](#removeAllListeners) * [**removeListener](#removeListener) * [**setCookies](#setCookies) * [**setMaxListeners](#setMaxListeners) ## Constructors[**](#Constructors) ### [**](#constructor)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-controller.ts#L91)constructor * ****new PuppeteerController**(browserPlugin): [PuppeteerController](https://crawlee.dev/js/api/browser-pool/class/PuppeteerController.md) - Inherited from BrowserController< typeof Puppeteer, PuppeteerTypes.LaunchOptions, PuppeteerTypes.Browser, PuppeteerNewPageOptions >.constructor #### Parameters * ##### browserPlugin: [BrowserPlugin](https://crawlee.dev/js/api/browser-pool/class/BrowserPlugin.md)\ #### Returns [PuppeteerController](https://crawlee.dev/js/api/browser-pool/class/PuppeteerController.md) ## Properties[**](#Properties) ### [**](#activePages)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-controller.ts#L73)inheritedactivePages **activePages: number = 0 Inherited from BrowserController.activePages ### [**](#browser)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-controller.ts#L52)inheritedbrowser **browser: Browser = ... Inherited from BrowserController.browser Browser representation of the underlying automation library. ### [**](#browserPlugin)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-controller.ts#L47)inheritedbrowserPlugin **browserPlugin: [BrowserPlugin](https://crawlee.dev/js/api/browser-pool/class/BrowserPlugin.md)\ Inherited from BrowserController.browserPlugin The `BrowserPlugin` instance used to launch the browser. ### [**](#id)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-controller.ts#L42)inheritedid **id: string = ... Inherited from BrowserController.id ### [**](#isActive)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-controller.ts#L71)inheritedisActive **isActive: boolean = false Inherited from BrowserController.isActive ### [**](#lastPageOpenedAt)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-controller.ts#L77)inheritedlastPageOpenedAt **lastPageOpenedAt: number = ... Inherited from BrowserController.lastPageOpenedAt ### [**](#launchContext)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-controller.ts#L57)inheritedlaunchContext **launchContext: [LaunchContext](https://crawlee.dev/js/api/browser-pool/class/LaunchContext.md)\ = ... Inherited from BrowserController.launchContext The configuration the browser was launched with. ### [**](#proxyTier)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-controller.ts#L63)optionalinheritedproxyTier **proxyTier? : number Inherited from BrowserController.proxyTier The proxy tier tied to this browser controller. `undefined` if no tiered proxy is used. ### [**](#proxyUrl)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-controller.ts#L69)optionalinheritedproxyUrl **proxyUrl? : string Inherited from BrowserController.proxyUrl The proxy URL used by the browser controller. This is set every time the browser controller uses proxy (even the tiered one). `undefined` if no proxy is used ### [**](#totalPages)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-controller.ts#L75)inheritedtotalPages **totalPages: number = 0 Inherited from BrowserController.totalPages ### [**](#defaultMaxListeners)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L10)staticexternalinheriteddefaultMaxListeners **defaultMaxListeners: number Inherited from BrowserController.defaultMaxListeners ## Methods[**](#Methods) ### [**](#addListener)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L11)externalinheritedaddListener * ****addListener**\(event, listener): this - Inherited from BrowserController.addListener #### Parameters * ##### externalevent: U * ##### externallistener: [BrowserControllerEvents](https://crawlee.dev/js/api/browser-pool/interface/BrowserControllerEvents.md)\\[U] #### Returns this ### [**](#close)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-controller.ts#L131)inheritedclose * ****close**(): Promise\ - Inherited from BrowserController.close Gracefully closes the browser and makes sure there will be no lingering browser processes. Emits 'browserClosed' event. *** #### Returns Promise\ ### [**](#emit)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L19)externalinheritedemit * ****emit**\(event, ...args): boolean - Inherited from BrowserController.emit #### Parameters * ##### externalevent: U * ##### externalrest...args: Parameters<[BrowserControllerEvents](https://crawlee.dev/js/api/browser-pool/interface/BrowserControllerEvents.md)\\[U]> #### Returns boolean ### [**](#eventNames)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L20)externalinheritedeventNames * ****eventNames**\(): U\[] - Inherited from BrowserController.eventNames #### Returns U\[] ### [**](#getCookies)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-controller.ts#L181)inheritedgetCookies * ****getCookies**(page): Promise<[Cookie](https://crawlee.dev/js/api/core/interface/Cookie.md)\[]> - Inherited from BrowserController.getCookies #### Parameters * ##### page: Page #### Returns Promise<[Cookie](https://crawlee.dev/js/api/core/interface/Cookie.md)\[]> ### [**](#getMaxListeners)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L24)externalinheritedgetMaxListeners * ****getMaxListeners**(): number - Inherited from BrowserController.getMaxListeners #### Returns number ### [**](#kill)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-controller.ts#L156)inheritedkill * ****kill**(): Promise\ - Inherited from BrowserController.kill Immediately kills the browser process. Emits 'browserClosed' event. *** #### Returns Promise\ ### [**](#listenerCount)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L21)externalinheritedlistenerCount * ****listenerCount**(type): number - Inherited from BrowserController.listenerCount #### Parameters * ##### externaltype: BROWSER\_CLOSED #### Returns number ### [**](#listeners)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L22)externalinheritedlisteners * ****listeners**\(type): [BrowserControllerEvents](https://crawlee.dev/js/api/browser-pool/interface/BrowserControllerEvents.md)\\[U]\[] - Inherited from BrowserController.listeners #### Parameters * ##### externaltype: U #### Returns [BrowserControllerEvents](https://crawlee.dev/js/api/browser-pool/interface/BrowserControllerEvents.md)\\[U]\[] ### [**](#off)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L18)externalinheritedoff * ****off**\(event, listener): this - Inherited from BrowserController.off #### Parameters * ##### externalevent: U * ##### externallistener: [BrowserControllerEvents](https://crawlee.dev/js/api/browser-pool/interface/BrowserControllerEvents.md)\\[U] #### Returns this ### [**](#on)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L17)externalinheritedon * ****on**\(event, listener): this - Inherited from BrowserController.on #### Parameters * ##### externalevent: U * ##### externallistener: [BrowserControllerEvents](https://crawlee.dev/js/api/browser-pool/interface/BrowserControllerEvents.md)\\[U] #### Returns this ### [**](#once)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L16)externalinheritedonce * ****once**\(event, listener): this - Inherited from BrowserController.once #### Parameters * ##### externalevent: U * ##### externallistener: [BrowserControllerEvents](https://crawlee.dev/js/api/browser-pool/interface/BrowserControllerEvents.md)\\[U] #### Returns this ### [**](#prependListener)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L12)externalinheritedprependListener * ****prependListener**\(event, listener): this - Inherited from BrowserController.prependListener #### Parameters * ##### externalevent: U * ##### externallistener: [BrowserControllerEvents](https://crawlee.dev/js/api/browser-pool/interface/BrowserControllerEvents.md)\\[U] #### Returns this ### [**](#prependOnceListener)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L13)externalinheritedprependOnceListener * ****prependOnceListener**\(event, listener): this - Inherited from BrowserController.prependOnceListener #### Parameters * ##### externalevent: U * ##### externallistener: [BrowserControllerEvents](https://crawlee.dev/js/api/browser-pool/interface/BrowserControllerEvents.md)\\[U] #### Returns this ### [**](#rawListeners)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L23)externalinheritedrawListeners * ****rawListeners**\(type): [BrowserControllerEvents](https://crawlee.dev/js/api/browser-pool/interface/BrowserControllerEvents.md)\\[U]\[] - Inherited from BrowserController.rawListeners #### Parameters * ##### externaltype: U #### Returns [BrowserControllerEvents](https://crawlee.dev/js/api/browser-pool/interface/BrowserControllerEvents.md)\\[U]\[] ### [**](#removeAllListeners)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L15)externalinheritedremoveAllListeners * ****removeAllListeners**(event): this - Inherited from BrowserController.removeAllListeners #### Parameters * ##### externaloptionalevent: BROWSER\_CLOSED #### Returns this ### [**](#removeListener)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L14)externalinheritedremoveListener * ****removeListener**\(event, listener): this - Inherited from BrowserController.removeListener #### Parameters * ##### externalevent: U * ##### externallistener: [BrowserControllerEvents](https://crawlee.dev/js/api/browser-pool/interface/BrowserControllerEvents.md)\\[U] #### Returns this ### [**](#setCookies)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-controller.ts#L177)inheritedsetCookies * ****setCookies**(page, cookies): Promise\ - Inherited from BrowserController.setCookies #### Parameters * ##### page: Page * ##### cookies: [Cookie](https://crawlee.dev/js/api/core/interface/Cookie.md)\[] #### Returns Promise\ ### [**](#setMaxListeners)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L25)externalinheritedsetMaxListeners * ****setMaxListeners**(n): this - Inherited from BrowserController.setMaxListeners #### Parameters * ##### externaln: number #### Returns this --- # PuppeteerPlugin The `BrowserPlugin` serves two purposes. First, it is the base class that specialized controllers like `PuppeteerPlugin` or `PlaywrightPlugin` extend. Second, it allows the user to configure the automation libraries and feed them to [BrowserPool](https://crawlee.dev/js/api/browser-pool/class/BrowserPool.md) for use. ### Hierarchy * [BrowserPlugin](https://crawlee.dev/js/api/browser-pool/class/BrowserPlugin.md)\ * *PuppeteerPlugin* ## Index[**](#Index) ### Constructors * [**constructor](#constructor) ### Properties * [**browserPerProxy](#browserPerProxy) * [**experimentalContainers](#experimentalContainers) * [**launchOptions](#launchOptions) * [**library](#library) * [**name](#name) * [**proxyUrl](#proxyUrl) * [**useIncognitoPages](#useIncognitoPages) * [**userDataDir](#userDataDir) ### Methods * [**createController](#createController) * [**createLaunchContext](#createLaunchContext) * [**launch](#launch) ## Constructors[**](#Constructors) ### [**](#constructor)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-plugin.ts#L129)constructor * ****new PuppeteerPlugin**(library, options): [PuppeteerPlugin](https://crawlee.dev/js/api/browser-pool/class/PuppeteerPlugin.md) - Inherited from BrowserPlugin.constructor #### Parameters * ##### library: PuppeteerNode * ##### options: [BrowserPluginOptions](https://crawlee.dev/js/api/browser-pool/interface/BrowserPluginOptions.md)\ = {} #### Returns [PuppeteerPlugin](https://crawlee.dev/js/api/browser-pool/class/PuppeteerPlugin.md) ## Properties[**](#Properties) ### [**](#browserPerProxy)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-plugin.ts#L127)optionalinheritedbrowserPerProxy **browserPerProxy? : boolean Inherited from BrowserPlugin.browserPerProxy ### [**](#experimentalContainers)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-plugin.ts#L125)inheritedexperimentalContainers **experimentalContainers: boolean Inherited from BrowserPlugin.experimentalContainers ### [**](#launchOptions)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-plugin.ts#L117)inheritedlaunchOptions **launchOptions: LaunchOptions Inherited from BrowserPlugin.launchOptions ### [**](#library)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-plugin.ts#L115)inheritedlibrary **library: PuppeteerNode Inherited from BrowserPlugin.library ### [**](#name)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-plugin.ts#L113)inheritedname **name: string = ... Inherited from BrowserPlugin.name ### [**](#proxyUrl)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-plugin.ts#L119)optionalinheritedproxyUrl **proxyUrl? : string Inherited from BrowserPlugin.proxyUrl ### [**](#useIncognitoPages)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-plugin.ts#L123)inheriteduseIncognitoPages **useIncognitoPages: boolean Inherited from BrowserPlugin.useIncognitoPages ### [**](#userDataDir)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-plugin.ts#L121)optionalinheriteduserDataDir **userDataDir? : string Inherited from BrowserPlugin.userDataDir ## Methods[**](#Methods) ### [**](#createController)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-plugin.ts#L181)inheritedcreateController * ****createController**(): [BrowserController](https://crawlee.dev/js/api/browser-pool/class/BrowserController.md)\ - Inherited from BrowserPlugin.createController #### Returns [BrowserController](https://crawlee.dev/js/api/browser-pool/class/BrowserController.md)\ ### [**](#createLaunchContext)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-plugin.ts#L154)inheritedcreateLaunchContext * ****createLaunchContext**(options): [LaunchContext](https://crawlee.dev/js/api/browser-pool/class/LaunchContext.md)\ - Inherited from BrowserPlugin.createLaunchContext Creates a `LaunchContext` with all the information needed to launch a browser. Aside from library specific launch options, it also includes internal properties used by `BrowserPool` for management of the pool and extra features. *** #### Parameters * ##### options: [CreateLaunchContextOptions](https://crawlee.dev/js/api/browser-pool/interface/CreateLaunchContextOptions.md)\ = {} #### Returns [LaunchContext](https://crawlee.dev/js/api/browser-pool/class/LaunchContext.md)\ ### [**](#launch)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-plugin.ts#L188)inheritedlaunch * ****launch**(launchContext): Promise\ - Inherited from BrowserPlugin.launch Launches the browser using provided launch context. *** #### Parameters * ##### launchContext: [LaunchContext](https://crawlee.dev/js/api/browser-pool/class/LaunchContext.md)\ = ... #### Returns Promise\ --- # constBROWSER\_CONTROLLER\_EVENTS ## Index[**](#Index) ### Enumeration Members * [**BROWSER\_CLOSED](#BROWSER_CLOSED) ## Enumeration Members[**](<#Enumeration Members>) ### [**](#BROWSER_CLOSED)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/events.ts#L11)BROWSER\_CLOSED **BROWSER\_CLOSED: browserClosed --- # constBROWSER\_POOL\_EVENTS ## Index[**](#Index) ### Enumeration Members * [**BROWSER\_CLOSED](#BROWSER_CLOSED) * [**BROWSER\_LAUNCHED](#BROWSER_LAUNCHED) * [**BROWSER\_RETIRED](#BROWSER_RETIRED) * [**PAGE\_CLOSED](#PAGE_CLOSED) * [**PAGE\_CREATED](#PAGE_CREATED) ## Enumeration Members[**](<#Enumeration Members>) ### [**](#BROWSER_CLOSED)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/events.ts#L4)BROWSER\_CLOSED **BROWSER\_CLOSED: browserClosed ### [**](#BROWSER_LAUNCHED)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/events.ts#L2)BROWSER\_LAUNCHED **BROWSER\_LAUNCHED: browserLaunched ### [**](#BROWSER_RETIRED)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/events.ts#L3)BROWSER\_RETIRED **BROWSER\_RETIRED: browserRetired ### [**](#PAGE_CLOSED)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/events.ts#L7)PAGE\_CLOSED **PAGE\_CLOSED: pageClosed ### [**](#PAGE_CREATED)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/events.ts#L6)PAGE\_CREATED **PAGE\_CREATED: pageCreated --- # BrowserName ## Index[**](#Index) ### Enumeration Members * [**chrome](#chrome) * [**edge](#edge) * [**firefox](#firefox) * [**safari](#safari) ## Enumeration Members[**](<#Enumeration Members>) ### [**](#chrome)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/fingerprinting/types.ts#L24)chrome **chrome: chrome ### [**](#edge)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/fingerprinting/types.ts#L27)edge **edge: edge ### [**](#firefox)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/fingerprinting/types.ts#L25)firefox **firefox: firefox ### [**](#safari)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/fingerprinting/types.ts#L26)safari **safari: safari --- # constDeviceCategory ## Index[**](#Index) ### Enumeration Members * [**desktop](#desktop) * [**mobile](#mobile) ## Enumeration Members[**](<#Enumeration Members>) ### [**](#desktop)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/fingerprinting/types.ts#L72)desktop **desktop: desktop Describes desktop computers and laptops. These devices usually have larger, horizontal screens and load full-sized versions of websites. ### [**](#mobile)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/fingerprinting/types.ts#L68)mobile **mobile: mobile Describes mobile devices (mobile phones, tablets...). These devices usually have smaller, vertical screens and load lighter versions of websites. > Note: Generating `android` and `ios` devices will not work without setting the device to `mobile` first. --- # constOperatingSystemsName ## Index[**](#Index) ### Enumeration Members * [**android](#android) * [**ios](#ios) * [**linux](#linux) * [**macos](#macos) * [**windows](#windows) ## Enumeration Members[**](<#Enumeration Members>) ### [**](#android)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/fingerprinting/types.ts#L56)android **android: android `android` is (mostly) a mobile operating system. You can use this option only together with the `mobile` device category. ### [**](#ios)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/fingerprinting/types.ts#L60)ios **ios: ios `ios` is a mobile operating system. You can use this option only together with the `mobile` device category. ### [**](#linux)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/fingerprinting/types.ts#L50)linux **linux: linux ### [**](#macos)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/fingerprinting/types.ts#L51)macos **macos: macos ### [**](#windows)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/fingerprinting/types.ts#L52)windows **windows: windows --- # BrowserControllerEvents \ ## Index[**](#Index) ### Properties * [**browserClosed](#browserClosed) ## Properties[**](#Properties) ### [**](#browserClosed)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-controller.ts#L22)browserClosed **browserClosed: (controller) => void #### Type declaration * * **(controller): void - #### Parameters * ##### controller: [BrowserController](https://crawlee.dev/js/api/browser-pool/class/BrowserController.md)\ #### Returns void --- # BrowserPluginOptions \ ### Hierarchy * *BrowserPluginOptions* * [BrowserLaunchContext](https://crawlee.dev/js/api/browser-crawler/interface/BrowserLaunchContext.md) ## Index[**](#Index) ### Properties * [**browserPerProxy](#browserPerProxy) * [**experimentalContainers](#experimentalContainers) * [**launchOptions](#launchOptions) * [**proxyUrl](#proxyUrl) * [**useIncognitoPages](#useIncognitoPages) * [**userDataDir](#userDataDir) ## Properties[**](#Properties) ### [**](#browserPerProxy)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-plugin.ts#L84)optionalbrowserPerProxy **browserPerProxy? : boolean If set to `true`, the crawler respects the proxy url generated for the given request. This aligns the browser-based crawlers with the `HttpCrawler`. Might cause performance issues, as Crawlee might launch too many browser instances. ### [**](#experimentalContainers)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-plugin.ts#L73)optionalexperimentalContainersexperimental **experimentalContainers? : boolean Like `useIncognitoPages`, but for persistent contexts, so cache is used for faster loading. Works best with Firefox. Unstable on Chromium. ### [**](#launchOptions)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-plugin.ts#L54)optionallaunchOptions **launchOptions? : LibraryOptions Options that will be passed down to the automation library. E.g. `puppeteer.launch(launchOptions);`. This is a good place to set options that you want to apply as defaults. To dynamically override those options per-browser, see the `preLaunchHooks` of [BrowserPool](https://crawlee.dev/js/api/browser-pool/class/BrowserPool.md). ### [**](#proxyUrl)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-plugin.ts#L60)optionalproxyUrl **proxyUrl? : string Automation libraries configure proxies differently. This helper allows you to set a proxy URL without worrying about specific implementations. It also allows you use an authenticated proxy without extra code. ### [**](#useIncognitoPages)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-plugin.ts#L67)optionaluseIncognitoPages **useIncognitoPages? : boolean = false By default pages share the same browser context. If set to true each page uses its own context that is destroyed once the page is closed or crashes. ### [**](#userDataDir)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-plugin.ts#L77)optionaluserDataDir **userDataDir? : string Path to a User Data Directory, which stores browser session data like cookies and local storage. --- # BrowserPoolEvents \ ## Index[**](#Index) ### Properties * [**browserLaunched](#browserLaunched) * [**browserRetired](#browserRetired) * [**pageClosed](#pageClosed) * [**pageCreated](#pageCreated) ## Properties[**](#Properties) ### [**](#browserLaunched)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L33)browserLaunched **browserLaunched: (browserController) => void | Promise\ #### Type declaration * * **(browserController): void | Promise\ - #### Parameters * ##### browserController: BC #### Returns void | Promise\ ### [**](#browserRetired)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L32)browserRetired **browserRetired: (browserController) => void | Promise\ #### Type declaration * * **(browserController): void | Promise\ - #### Parameters * ##### browserController: BC #### Returns void | Promise\ ### [**](#pageClosed)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L31)pageClosed **pageClosed: (page) => void | Promise\ #### Type declaration * * **(page): void | Promise\ - #### Parameters * ##### page: Page #### Returns void | Promise\ ### [**](#pageCreated)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L30)pageCreated **pageCreated: (page) => void | Promise\ #### Type declaration * * **(page): void | Promise\ - #### Parameters * ##### page: Page #### Returns void | Promise\ --- # BrowserPoolHooks \ ## Index[**](#Index) ### Properties * [**postLaunchHooks](#postLaunchHooks) * [**postPageCloseHooks](#postPageCloseHooks) * [**postPageCreateHooks](#postPageCreateHooks) * [**preLaunchHooks](#preLaunchHooks) * [**prePageCloseHooks](#prePageCloseHooks) * [**prePageCreateHooks](#prePageCreateHooks) ## Properties[**](#Properties) ### [**](#postLaunchHooks)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L212)optionalpostLaunchHooks **postLaunchHooks? : [PostLaunchHook](https://crawlee.dev/js/api/browser-pool.md#PostLaunchHook)\\[] Post-launch hooks are executed as soon as a browser is launched. The hooks are called with two arguments: `pageId`: `string` and `browserController`: [BrowserController](https://crawlee.dev/js/api/browser-pool/class/BrowserController.md) To guarantee order of execution before other hooks in the same browser, the [BrowserController](https://crawlee.dev/js/api/browser-pool/class/BrowserController.md) methods cannot be used until the post-launch hooks complete. If you attempt to call `await browserController.close()` from a post-launch hook, it will deadlock the process. This API is subject to change. ### [**](#postPageCloseHooks)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L245)optionalpostPageCloseHooks **postPageCloseHooks? : [PostPageCloseHook](https://crawlee.dev/js/api/browser-pool.md#PostPageCloseHook)\\[] Post-page-close hooks allow you to do page related clean up. The hooks are called with two arguments: `pageId`: `string` and `browserController`: [BrowserController](https://crawlee.dev/js/api/browser-pool/class/BrowserController.md) ### [**](#postPageCreateHooks)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L231)optionalpostPageCreateHooks **postPageCreateHooks? : [PostPageCreateHook](https://crawlee.dev/js/api/browser-pool.md#PostPageCreateHook)\\[] Post-page-create hooks are called right after a new page is created and all internal actions of Browser Pool are completed. This is the place to make changes to a page that you would like to apply to all pages. Such as injecting a JavaScript library into all pages. The hooks are called with two arguments: `page`: `Page` and `browserController`: [BrowserController](https://crawlee.dev/js/api/browser-pool/class/BrowserController.md) ### [**](#preLaunchHooks)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L202)optionalpreLaunchHooks **preLaunchHooks? : [PreLaunchHook](https://crawlee.dev/js/api/browser-pool.md#PreLaunchHook)\\[] Pre-launch hooks are executed just before a browser is launched and provide a good opportunity to dynamically change the launch options. The hooks are called with two arguments: `pageId`: `string` and `launchContext`: [LaunchContext](https://crawlee.dev/js/api/browser-pool/class/LaunchContext.md) ### [**](#prePageCloseHooks)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L239)optionalprePageCloseHooks **prePageCloseHooks? : [PrePageCloseHook](https://crawlee.dev/js/api/browser-pool.md#PrePageCloseHook)\\[] Pre-page-close hooks give you the opportunity to make last second changes in a page that's about to be closed, such as saving a snapshot or updating state. The hooks are called with two arguments: `page`: `Page` and `browserController`: [BrowserController](https://crawlee.dev/js/api/browser-pool/class/BrowserController.md) ### [**](#prePageCreateHooks)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L222)optionalprePageCreateHooks **prePageCreateHooks? : [PrePageCreateHook](https://crawlee.dev/js/api/browser-pool.md#PrePageCreateHook)\\[0]>\[] Pre-page-create hooks are executed just before a new page is created. They are useful to make dynamic changes to the browser before opening a page. The hooks are called with three arguments: `pageId`: `string`, `browserController`: [BrowserController](https://crawlee.dev/js/api/browser-pool/class/BrowserController.md) and `pageOptions`: `object|undefined` - This only works if the underlying `BrowserController` supports new page options. So far, new page options are only supported by `PlaywrightController` in incognito contexts. If the page options are not supported by `BrowserController` the `pageOptions` argument is `undefined`. --- # BrowserPoolNewPageInNewBrowserOptions \ ## Index[**](#Index) ### Properties * [**browserPlugin](#browserPlugin) * [**id](#id) * [**launchOptions](#launchOptions) * [**pageOptions](#pageOptions) ## Properties[**](#Properties) ### [**](#browserPlugin)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L912)optionalbrowserPlugin **browserPlugin? : BP Provide a plugin to launch the browser. If none is provided, one of the pool's available plugins will be used. If you configured `BrowserPool` to rotate multiple libraries, such as both Puppeteer and Playwright, you should always set the `browserPlugin` when using the `launchOptions` option. The plugin will not be added to the list of plugins used by the pool. You can either use one of those, to launch a specific browser, or provide a completely new configuration. ### [**](#id)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L894)optionalid **id? : string Assign a custom ID to the page. If you don't a random string ID will be generated. ### [**](#launchOptions)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L916)optionallaunchOptions **launchOptions? : BP\[launchOptions] Options that will be used to launch the new browser. ### [**](#pageOptions)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L899)optionalpageOptions **pageOptions? : PageOptions Some libraries (Playwright) allow you to open new pages with specific options. Use this property to set those options. --- # BrowserPoolNewPageOptions \ ## Index[**](#Index) ### Properties * [**browserPlugin](#browserPlugin) * [**id](#id) * [**pageOptions](#pageOptions) * [**proxyTier](#proxyTier) * [**proxyUrl](#proxyUrl) ## Properties[**](#Properties) ### [**](#browserPlugin)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L878)optionalbrowserPlugin **browserPlugin? : BP Choose a plugin to open the page with. If none is provided, one of the pool's available plugins will be used. It must be one of the plugins browser pool was created with. If you wish to start a browser with a different configuration, see the `newPageInNewBrowser` function. ### [**](#id)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L864)optionalid **id? : string Assign a custom ID to the page. If you don't a random string ID will be generated. ### [**](#pageOptions)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L869)optionalpageOptions **pageOptions? : PageOptions Some libraries (Playwright) allow you to open new pages with specific options. Use this property to set those options. ### [**](#proxyTier)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L886)optionalproxyTier **proxyTier? : number Proxy tier. ### [**](#proxyUrl)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L882)optionalproxyUrl **proxyUrl? : string Proxy URL. --- # BrowserPoolOptions \ ## Index[**](#Index) ### Properties * [**browserPlugins](#browserPlugins) * [**closeInactiveBrowserAfterSecs](#closeInactiveBrowserAfterSecs) * [**fingerprintOptions](#fingerprintOptions) * [**maxOpenPagesPerBrowser](#maxOpenPagesPerBrowser) * [**operationTimeoutSecs](#operationTimeoutSecs) * [**retireBrowserAfterPageCount](#retireBrowserAfterPageCount) * [**retireInactiveBrowserAfterSecs](#retireInactiveBrowserAfterSecs) * [**useFingerprints](#useFingerprints) ## Properties[**](#Properties) ### [**](#browserPlugins)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L67)browserPlugins **browserPlugins: readonly Plugin\[] Browser plugins are wrappers of browser automation libraries that allow `BrowserPool` to control browsers with those libraries. `browser-pool` comes with a `PuppeteerPlugin` and a `PlaywrightPlugin`. ### [**](#closeInactiveBrowserAfterSecs)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L101)optionalcloseInactiveBrowserAfterSecs **closeInactiveBrowserAfterSecs? : number = 300 Browsers normally close immediately after their last page is processed. However, there could be situations where this does not happen. Browser Pool makes sure all inactive browsers are closed regularly, to free resources. ### [**](#fingerprintOptions)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L116)optionalfingerprintOptions **fingerprintOptions? : [FingerprintOptions](https://crawlee.dev/js/api/browser-pool/interface/FingerprintOptions.md) ### [**](#maxOpenPagesPerBrowser)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L74)optionalmaxOpenPagesPerBrowser **maxOpenPagesPerBrowser? : number = 20 Sets the maximum number of pages that can be open in a browser at the same time. Once reached, a new browser will be launched to handle the excess. ### [**](#operationTimeoutSecs)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L93)optionaloperationTimeoutSecs **operationTimeoutSecs? : number = 15 As we know from experience, async operations of the underlying libraries, such as launching a browser or opening a new page, can get stuck. To prevent `BrowserPool` from getting stuck, we add a timeout to those operations and you can configure it with this option. ### [**](#retireBrowserAfterPageCount)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L84)optionalretireBrowserAfterPageCount **retireBrowserAfterPageCount? : number = 100 Browsers tend to get bloated after processing a lot of pages. This option configures the maximum number of processed pages after which the browser will automatically retire and close. A new browser will launch in its place. The browser might be retired sooner if the connected [Session](https://crawlee.dev/js/api/core/class/Session.md) is retired. You can change session retirement behavior using [SessionPoolOptions](https://crawlee.dev/js/api/core/interface/SessionPoolOptions.md). ### [**](#retireInactiveBrowserAfterSecs)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L111)optionalretireInactiveBrowserAfterSecs **retireInactiveBrowserAfterSecs? : number = 10 Browsers are marked as retired after they have been inactive for a certain amount of time. This option sets the interval at which the browsers are checked and retired if they are inactive. Retired browsers are closed after all their pages are closed. ### [**](#useFingerprints)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L115)optionaluseFingerprints **useFingerprints? : boolean = true --- # BrowserSpecification ## Index[**](#Index) ### Properties * [**httpVersion](#httpVersion) * [**maxVersion](#maxVersion) * [**minVersion](#minVersion) * [**name](#name) ## Properties[**](#Properties) ### [**](#httpVersion)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/fingerprinting/types.ts#L46)optionalhttpVersion **httpVersion? : 1 | 2 HTTP version to be used for header generation (the headers differ depending on the version). ### [**](#maxVersion)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/fingerprinting/types.ts#L42)optionalmaxVersion **maxVersion? : number Maximum version of browser used. ### [**](#minVersion)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/fingerprinting/types.ts#L38)optionalminVersion **minVersion? : number Minimum version of browser used. ### [**](#name)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/fingerprinting/types.ts#L34)name **name: [BrowserName](https://crawlee.dev/js/api/browser-pool/enum/BrowserName.md) String representing the browser name. --- # CommonLibrary Each plugin expects an instance of the object with the `.launch()` property. For Puppeteer, it is the `puppeteer` module itself, whereas for Playwright it is one of the browser types, such as `puppeteer.chromium`. `BrowserPlugin` does not include the library. You can choose any version or fork of the library. It also keeps `browser-pool` installation small. ## Index[**](#Index) ### Properties * [**name](#name) * [**product](#product) ### Methods * [**launch](#launch) ## Properties[**](#Properties) ### [**](#name)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-plugin.ts#L33)optionalname **name? : () => string #### Type declaration * * **(): string - #### Returns string ### [**](#product)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-plugin.ts#L31)optionalproduct **product? : string ## Methods[**](#Methods) ### [**](#launch)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-plugin.ts#L32)launch * ****launch**(opts): Promise\ - #### Parameters * ##### optionalopts: Dictionary #### Returns Promise\ --- # CreateLaunchContextOptions \ ### Hierarchy * Partial\, browserPlugin>> * *CreateLaunchContextOptions* ## Index[**](#Index) ### Properties * [**browserPerProxy](#browserPerProxy) * [**experimentalContainers](#experimentalContainers) * [**id](#id) * [**launchOptions](#launchOptions) * [**proxyTier](#proxyTier) * [**proxyUrl](#proxyUrl) * [**useIncognitoPages](#useIncognitoPages) * [**userDataDir](#userDataDir) ## Properties[**](#Properties) ### [**](#browserPerProxy)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/launch-context.ts#L43)optionalinheritedbrowserPerProxy **browserPerProxy? : boolean Inherited from Partial.browserPerProxy If set to `true`, the crawler respects the proxy url generated for the given request. This aligns the browser-based crawlers with the `HttpCrawler`. Might cause performance issues, as Crawlee might launch too many browser instances. ### [**](#experimentalContainers)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/launch-context.ts#L54)optionalinheritedexperimentalContainersexperimental **experimentalContainers? : boolean Inherited from Partial.experimentalContainers Like `useIncognitoPages`, but for persistent contexts, so cache is used for faster loading. Works best with Firefox. Unstable on Chromium. ### [**](#id)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/launch-context.ts#L27)optionalinheritedid **id? : string Inherited from Partial.id To make identification of `LaunchContext` easier, `BrowserPool` assigns the `LaunchContext` an `id` that's equal to the `id` of the page that triggered the browser launch. This is useful, because many pages share a single launch context (single browser). ### [**](#launchOptions)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/launch-context.ts#L36)optionalinheritedlaunchOptions **launchOptions? : LibraryOptions Inherited from Partial.launchOptions The actual options the browser was launched with, after changes. Those changes would be typically made in pre-launch hooks. ### [**](#proxyTier)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/launch-context.ts#L60)optionalinheritedproxyTier **proxyTier? : number Inherited from Partial.proxyTier ### [**](#proxyUrl)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/launch-context.ts#L59)optionalinheritedproxyUrl **proxyUrl? : string Inherited from Partial.proxyUrl ### [**](#useIncognitoPages)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/launch-context.ts#L48)optionalinheriteduseIncognitoPages **useIncognitoPages? : boolean Inherited from Partial.useIncognitoPages By default pages share the same browser context. If set to `true` each page uses its own context that is destroyed once the page is closed or crashes. ### [**](#userDataDir)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/launch-context.ts#L58)optionalinheriteduserDataDir **userDataDir? : string Inherited from Partial.userDataDir Path to a User Data Directory, which stores browser session data like cookies and local storage. --- # FingerprintGenerator ## Index[**](#Index) ### Properties * [**getFingerprint](#getFingerprint) ## Properties[**](#Properties) ### [**](#getFingerprint)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/fingerprinting/types.ts#L7)getFingerprint **getFingerprint: (fingerprintGeneratorOptions) => [GetFingerprintReturn](https://crawlee.dev/js/api/browser-pool/interface/GetFingerprintReturn.md) #### Type declaration * * **(fingerprintGeneratorOptions): [GetFingerprintReturn](https://crawlee.dev/js/api/browser-pool/interface/GetFingerprintReturn.md) - #### Parameters * ##### optionalfingerprintGeneratorOptions: [FingerprintGeneratorOptions](https://crawlee.dev/js/api/browser-pool/interface/FingerprintGeneratorOptions.md) #### Returns [GetFingerprintReturn](https://crawlee.dev/js/api/browser-pool/interface/GetFingerprintReturn.md) --- # FingerprintGeneratorOptions ### Hierarchy * Partial\ * *FingerprintGeneratorOptions* ## Index[**](#Index) ### Properties * [**browserListQuery](#browserListQuery) * [**browsers](#browsers) * [**devices](#devices) * [**httpVersion](#httpVersion) * [**locales](#locales) * [**mockWebRTC](#mockWebRTC) * [**operatingSystems](#operatingSystems) * [**screen](#screen) * [**slim](#slim) * [**strict](#strict) ## Properties[**](#Properties) ### [**](#browserListQuery)[**](https://undefined/apify/crawlee/blob/master/node_modules/header-generator/header-generator.d.ts#L66)externaloptionalinheritedbrowserListQuery **browserListQuery? : string Inherited from Partial.browserListQuery Browser generation query based on the real world data. For more info see the [query docs](https://github.com/browserslist/browserslist#full-list). If `browserListQuery` is passed the `browsers` array is ignored. ### [**](#browsers)[**](https://undefined/apify/crawlee/blob/master/node_modules/header-generator/header-generator.d.ts#L60)externaloptionalinheritedbrowsers **browsers? : BrowsersType Inherited from Partial.browsers List of BrowserSpecifications to generate the headers for, or one of `chrome`, `edge`, `firefox` and `safari`. ### [**](#devices)[**](https://undefined/apify/crawlee/blob/master/node_modules/header-generator/header-generator.d.ts#L74)externaloptionalinheriteddevices **devices? : (desktop | mobile)\[] Inherited from Partial.devices List of devices to generate the headers for. ### [**](#httpVersion)[**](https://undefined/apify/crawlee/blob/master/node_modules/header-generator/header-generator.d.ts#L85)externaloptionalinheritedhttpVersion **httpVersion? : 1 | 2 Inherited from Partial.httpVersion Http version to be used to generate headers (the headers differ depending on the version). Can be either 1 or 2. Default value is 2. ### [**](#locales)[**](https://undefined/apify/crawlee/blob/master/node_modules/header-generator/header-generator.d.ts#L80)externaloptionalinheritedlocales **locales? : string\[] Inherited from Partial.locales List of at most 10 languages to include in the [Accept-Language](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Accept-Language) request header in the language format accepted by that header, for example `en`, `en-US` or `de`. ### [**](#mockWebRTC)[**](https://undefined/apify/crawlee/blob/master/node_modules/fingerprint-generator/fingerprint-generator.d.ts#L99)externaloptionalinheritedmockWebRTC **mockWebRTC? : boolean Inherited from Partial.mockWebRTC ### [**](#operatingSystems)[**](https://undefined/apify/crawlee/blob/master/node_modules/header-generator/header-generator.d.ts#L70)externaloptionalinheritedoperatingSystems **operatingSystems? : (windows | macos | linux | android | ios)\[] Inherited from Partial.operatingSystems List of operating systems to generate the headers for. ### [**](#screen)[**](https://undefined/apify/crawlee/blob/master/node_modules/fingerprint-generator/fingerprint-generator.d.ts#L93)externaloptionalinheritedscreen **screen? : { maxHeight? : number; maxWidth? : number; minHeight? : number; minWidth? : number } Inherited from Partial.screen Defines the screen dimensions of the generated fingerprint. **Note:** Using this option can lead to a substantial performance drop (\~0.0007s/fingerprint -> \~0.03s/fingerprint) *** #### Type declaration * ##### externaloptionalmaxHeight?: number * ##### externaloptionalmaxWidth?: number * ##### externaloptionalminHeight?: number * ##### externaloptionalminWidth?: number ### [**](#slim)[**](https://undefined/apify/crawlee/blob/master/node_modules/fingerprint-generator/fingerprint-generator.d.ts#L106)externaloptionalinheritedslim **slim? : boolean Inherited from Partial.slim Enables the slim mode for the fingerprint injection. This disables some performance-heavy evasions, but might decrease benchmark scores. Try enabling this if you are experiencing performance issues with the fingerprint injection. ### [**](#strict)[**](https://undefined/apify/crawlee/blob/master/node_modules/header-generator/header-generator.d.ts#L91)externaloptionalinheritedstrict **strict? : boolean Inherited from Partial.strict If true, the generator will throw an error if it cannot generate headers based on the input. By default (strict: false), the generator will try to relax some requirements and generate headers based on the relaxed input. --- # FingerprintOptions Settings for the fingerprint generator and virtual session management system. > To set the specific fingerprint generation options (operating system, device type, screen dimensions), use the `fingerprintGeneratorOptions` property. ## Index[**](#Index) ### Properties * [**fingerprintCacheSize](#fingerprintCacheSize) * [**fingerprintGeneratorOptions](#fingerprintGeneratorOptions) * [**useFingerprintCache](#useFingerprintCache) ## Properties[**](#Properties) ### [**](#fingerprintCacheSize)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L58)optionalfingerprintCacheSize **fingerprintCacheSize? : number = 10000 The maximum number of fingerprints that can be stored in the cache. Only relevant if `useFingerprintCache` is set to `true`. ### [**](#fingerprintGeneratorOptions)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L45)optionalfingerprintGeneratorOptions **fingerprintGeneratorOptions? : [FingerprintGeneratorOptions](https://crawlee.dev/js/api/browser-pool/interface/FingerprintGeneratorOptions.md) Customizes the fingerprint generation by setting e.g. the device type, operating system or screen size. ### [**](#useFingerprintCache)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L51)optionaluseFingerprintCache **useFingerprintCache? : boolean = true Enables the virtual session management system. This ties every Crawlee session with a specific browser fingerprint, so your scraping activity seems more natural to the target website. --- # GetFingerprintReturn ## Index[**](#Index) ### Properties * [**fingerprint](#fingerprint) ## Properties[**](#Properties) ### [**](#fingerprint)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/fingerprinting/types.ts#L11)fingerprint **fingerprint: BrowserFingerprintWithHeaders --- # LaunchContextOptions \ `LaunchContext` holds information about the launched browser. It's useful to retrieve the `launchOptions`, the proxy the browser was launched with or any other information user chose to add to the `LaunchContext` by calling its `extend` function. This is very useful to keep track of browser-scoped values, such as session IDs. ## Index[**](#Index) ### Properties * [**browserPerProxy](#browserPerProxy) * [**browserPlugin](#browserPlugin) * [**experimentalContainers](#experimentalContainers) * [**id](#id) * [**launchOptions](#launchOptions) * [**proxyTier](#proxyTier) * [**proxyUrl](#proxyUrl) * [**useIncognitoPages](#useIncognitoPages) * [**userDataDir](#userDataDir) ## Properties[**](#Properties) ### [**](#browserPerProxy)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/launch-context.ts#L43)optionalbrowserPerProxy **browserPerProxy? : boolean If set to `true`, the crawler respects the proxy url generated for the given request. This aligns the browser-based crawlers with the `HttpCrawler`. Might cause performance issues, as Crawlee might launch too many browser instances. ### [**](#browserPlugin)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/launch-context.ts#L31)browserPlugin **browserPlugin: [BrowserPlugin](https://crawlee.dev/js/api/browser-pool/class/BrowserPlugin.md)\ The `BrowserPlugin` instance used to launch the browser. ### [**](#experimentalContainers)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/launch-context.ts#L54)optionalexperimentalContainersexperimental **experimentalContainers? : boolean Like `useIncognitoPages`, but for persistent contexts, so cache is used for faster loading. Works best with Firefox. Unstable on Chromium. ### [**](#id)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/launch-context.ts#L27)optionalid **id? : string To make identification of `LaunchContext` easier, `BrowserPool` assigns the `LaunchContext` an `id` that's equal to the `id` of the page that triggered the browser launch. This is useful, because many pages share a single launch context (single browser). ### [**](#launchOptions)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/launch-context.ts#L36)launchOptions **launchOptions: LibraryOptions The actual options the browser was launched with, after changes. Those changes would be typically made in pre-launch hooks. ### [**](#proxyTier)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/launch-context.ts#L60)optionalproxyTier **proxyTier? : number ### [**](#proxyUrl)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/launch-context.ts#L59)optionalproxyUrl **proxyUrl? : string ### [**](#useIncognitoPages)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/launch-context.ts#L48)optionaluseIncognitoPages **useIncognitoPages? : boolean By default pages share the same browser context. If set to `true` each page uses its own context that is destroyed once the page is closed or crashes. ### [**](#userDataDir)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/launch-context.ts#L58)optionaluserDataDir **userDataDir? : string Path to a User Data Directory, which stores browser session data like cookies and local storage. --- # @crawlee/cheerio Provides a framework for the parallel crawling of web pages using plain HTTP requests and [cheerio](https://www.npmjs.com/package/cheerio) HTML parser. The URLs to crawl are fed either from a static list of URLs or from a dynamic queue of URLs enabling recursive crawling of websites. Since `CheerioCrawler` uses raw HTTP requests to download web pages, it is very fast and efficient on data bandwidth. However, if the target website requires JavaScript to display the content, you might need to use [PuppeteerCrawler](https://crawlee.dev/js/api/puppeteer-crawler/class/PuppeteerCrawler.md) or [PlaywrightCrawler](https://crawlee.dev/js/api/playwright-crawler/class/PlaywrightCrawler.md) instead, because it loads the pages using full-featured headless Chrome browser. `CheerioCrawler` downloads each URL using a plain HTTP request, parses the HTML content using [Cheerio](https://www.npmjs.com/package/cheerio) and then invokes the user-provided [CheerioCrawlerOptions.requestHandler](https://crawlee.dev/js/api/cheerio-crawler/interface/CheerioCrawlerOptions.md#requestHandler) to extract page data using a [jQuery](https://jquery.com/)-like interface to the parsed HTML DOM. The source URLs are represented using [Request](https://crawlee.dev/js/api/core/class/Request.md) objects that are fed from [RequestList](https://crawlee.dev/js/api/core/class/RequestList.md) or [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) instances provided by the [CheerioCrawlerOptions.requestList](https://crawlee.dev/js/api/cheerio-crawler/interface/CheerioCrawlerOptions.md#requestList) or [CheerioCrawlerOptions.requestQueue](https://crawlee.dev/js/api/cheerio-crawler/interface/CheerioCrawlerOptions.md#requestQueue) constructor options, respectively. If both [CheerioCrawlerOptions.requestList](https://crawlee.dev/js/api/cheerio-crawler/interface/CheerioCrawlerOptions.md#requestList) and [CheerioCrawlerOptions.requestQueue](https://crawlee.dev/js/api/cheerio-crawler/interface/CheerioCrawlerOptions.md#requestQueue) are used, the instance first processes URLs from the [RequestList](https://crawlee.dev/js/api/core/class/RequestList.md) and automatically enqueues all of them to [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) before it starts their processing. This ensures that a single URL is not crawled multiple times. The crawler finishes when there are no more [Request](https://crawlee.dev/js/api/core/class/Request.md) objects to crawl. We can use the `preNavigationHooks` to adjust `gotOptions`: ``` preNavigationHooks: [ (crawlingContext, gotOptions) => { // ... }, ] ``` By default, `CheerioCrawler` only processes web pages with the `text/html` and `application/xhtml+xml` MIME content types (as reported by the `Content-Type` HTTP header), and skips pages with other content types. If you want the crawler to process other content types, use the [CheerioCrawlerOptions.additionalMimeTypes](https://crawlee.dev/js/api/cheerio-crawler/interface/CheerioCrawlerOptions.md#additionalMimeTypes) constructor option. Beware that the parsing behavior differs for HTML, XML, JSON and other types of content. For more details, see [CheerioCrawlerOptions.requestHandler](https://crawlee.dev/js/api/cheerio-crawler/interface/CheerioCrawlerOptions.md#requestHandler). New requests are only dispatched when there is enough free CPU and memory available, using the functionality provided by the [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) class. All [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) configuration options can be passed to the `autoscaledPoolOptions` parameter of the `CheerioCrawler` constructor. For user convenience, the `minConcurrency` and `maxConcurrency` [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) options are available directly in the `CheerioCrawler` constructor. ## Example usage[​](#example-usage "Direct link to Example usage") ``` const crawler = new CheerioCrawler({ requestList, async requestHandler({ request, response, body, contentType, $ }) { const data = []; // Do some data extraction from the page with Cheerio. $('.some-collection').each((index, el) => { data.push({ title: $(el).find('.some-title').text() }); }); // Save the data to dataset. await Dataset.pushData({ url: request.url, html: body, data, }) }, }); await crawler.run([ 'http://www.example.com/page-1', 'http://www.example.com/page-2', ]); ``` ## Index[**](#Index) ### Crawlers * [**CheerioCrawler](https://crawlee.dev/js/api/cheerio-crawler/class/CheerioCrawler.md) ### Other * [**AddRequestsBatchedOptions](https://crawlee.dev/js/api/cheerio-crawler.md#AddRequestsBatchedOptions) * [**AddRequestsBatchedResult](https://crawlee.dev/js/api/cheerio-crawler.md#AddRequestsBatchedResult) * [**AutoscaledPool](https://crawlee.dev/js/api/cheerio-crawler.md#AutoscaledPool) * [**AutoscaledPoolOptions](https://crawlee.dev/js/api/cheerio-crawler.md#AutoscaledPoolOptions) * [**BaseHttpClient](https://crawlee.dev/js/api/cheerio-crawler.md#BaseHttpClient) * [**BaseHttpResponseData](https://crawlee.dev/js/api/cheerio-crawler.md#BaseHttpResponseData) * [**BASIC\_CRAWLER\_TIMEOUT\_BUFFER\_SECS](https://crawlee.dev/js/api/cheerio-crawler.md#BASIC_CRAWLER_TIMEOUT_BUFFER_SECS) * [**BasicCrawler](https://crawlee.dev/js/api/cheerio-crawler.md#BasicCrawler) * [**BasicCrawlerOptions](https://crawlee.dev/js/api/cheerio-crawler.md#BasicCrawlerOptions) * [**BasicCrawlingContext](https://crawlee.dev/js/api/cheerio-crawler.md#BasicCrawlingContext) * [**BLOCKED\_STATUS\_CODES](https://crawlee.dev/js/api/cheerio-crawler.md#BLOCKED_STATUS_CODES) * [**ByteCounterStream](https://crawlee.dev/js/api/cheerio-crawler.md#ByteCounterStream) * [**checkStorageAccess](https://crawlee.dev/js/api/cheerio-crawler.md#checkStorageAccess) * [**ClientInfo](https://crawlee.dev/js/api/cheerio-crawler.md#ClientInfo) * [**Configuration](https://crawlee.dev/js/api/cheerio-crawler.md#Configuration) * [**ConfigurationOptions](https://crawlee.dev/js/api/cheerio-crawler.md#ConfigurationOptions) * [**Cookie](https://crawlee.dev/js/api/cheerio-crawler.md#Cookie) * [**CrawlerAddRequestsOptions](https://crawlee.dev/js/api/cheerio-crawler.md#CrawlerAddRequestsOptions) * [**CrawlerAddRequestsResult](https://crawlee.dev/js/api/cheerio-crawler.md#CrawlerAddRequestsResult) * [**CrawlerExperiments](https://crawlee.dev/js/api/cheerio-crawler.md#CrawlerExperiments) * [**CrawlerRunOptions](https://crawlee.dev/js/api/cheerio-crawler.md#CrawlerRunOptions) * [**CrawlingContext](https://crawlee.dev/js/api/cheerio-crawler.md#CrawlingContext) * [**createBasicRouter](https://crawlee.dev/js/api/cheerio-crawler.md#createBasicRouter) * [**CreateContextOptions](https://crawlee.dev/js/api/cheerio-crawler.md#CreateContextOptions) * [**createFileRouter](https://crawlee.dev/js/api/cheerio-crawler.md#createFileRouter) * [**createHttpRouter](https://crawlee.dev/js/api/cheerio-crawler.md#createHttpRouter) * [**CreateSession](https://crawlee.dev/js/api/cheerio-crawler.md#CreateSession) * [**CriticalError](https://crawlee.dev/js/api/cheerio-crawler.md#CriticalError) * [**Dataset](https://crawlee.dev/js/api/cheerio-crawler.md#Dataset) * [**DatasetConsumer](https://crawlee.dev/js/api/cheerio-crawler.md#DatasetConsumer) * [**DatasetContent](https://crawlee.dev/js/api/cheerio-crawler.md#DatasetContent) * [**DatasetDataOptions](https://crawlee.dev/js/api/cheerio-crawler.md#DatasetDataOptions) * [**DatasetExportOptions](https://crawlee.dev/js/api/cheerio-crawler.md#DatasetExportOptions) * [**DatasetExportToOptions](https://crawlee.dev/js/api/cheerio-crawler.md#DatasetExportToOptions) * [**DatasetIteratorOptions](https://crawlee.dev/js/api/cheerio-crawler.md#DatasetIteratorOptions) * [**DatasetMapper](https://crawlee.dev/js/api/cheerio-crawler.md#DatasetMapper) * [**DatasetOptions](https://crawlee.dev/js/api/cheerio-crawler.md#DatasetOptions) * [**DatasetReducer](https://crawlee.dev/js/api/cheerio-crawler.md#DatasetReducer) * [**enqueueLinks](https://crawlee.dev/js/api/cheerio-crawler.md#enqueueLinks) * [**EnqueueLinksOptions](https://crawlee.dev/js/api/cheerio-crawler.md#EnqueueLinksOptions) * [**EnqueueStrategy](https://crawlee.dev/js/api/cheerio-crawler.md#EnqueueStrategy) * [**ErrnoException](https://crawlee.dev/js/api/cheerio-crawler.md#ErrnoException) * [**ErrorHandler](https://crawlee.dev/js/api/cheerio-crawler.md#ErrorHandler) * [**ErrorSnapshotter](https://crawlee.dev/js/api/cheerio-crawler.md#ErrorSnapshotter) * [**ErrorTracker](https://crawlee.dev/js/api/cheerio-crawler.md#ErrorTracker) * [**ErrorTrackerOptions](https://crawlee.dev/js/api/cheerio-crawler.md#ErrorTrackerOptions) * [**EventManager](https://crawlee.dev/js/api/cheerio-crawler.md#EventManager) * [**EventType](https://crawlee.dev/js/api/cheerio-crawler.md#EventType) * [**EventTypeName](https://crawlee.dev/js/api/cheerio-crawler.md#EventTypeName) * [**FileDownload](https://crawlee.dev/js/api/cheerio-crawler.md#FileDownload) * [**FileDownloadCrawlingContext](https://crawlee.dev/js/api/cheerio-crawler.md#FileDownloadCrawlingContext) * [**FileDownloadErrorHandler](https://crawlee.dev/js/api/cheerio-crawler.md#FileDownloadErrorHandler) * [**FileDownloadHook](https://crawlee.dev/js/api/cheerio-crawler.md#FileDownloadHook) * [**FileDownloadOptions](https://crawlee.dev/js/api/cheerio-crawler.md#FileDownloadOptions) * [**FileDownloadRequestHandler](https://crawlee.dev/js/api/cheerio-crawler.md#FileDownloadRequestHandler) * [**filterRequestsByPatterns](https://crawlee.dev/js/api/cheerio-crawler.md#filterRequestsByPatterns) * [**FinalStatistics](https://crawlee.dev/js/api/cheerio-crawler.md#FinalStatistics) * [**GetUserDataFromRequest](https://crawlee.dev/js/api/cheerio-crawler.md#GetUserDataFromRequest) * [**GlobInput](https://crawlee.dev/js/api/cheerio-crawler.md#GlobInput) * [**GlobObject](https://crawlee.dev/js/api/cheerio-crawler.md#GlobObject) * [**GotScrapingHttpClient](https://crawlee.dev/js/api/cheerio-crawler.md#GotScrapingHttpClient) * [**HttpCrawler](https://crawlee.dev/js/api/cheerio-crawler.md#HttpCrawler) * [**HttpCrawlerOptions](https://crawlee.dev/js/api/cheerio-crawler.md#HttpCrawlerOptions) * [**HttpCrawlingContext](https://crawlee.dev/js/api/cheerio-crawler.md#HttpCrawlingContext) * [**HttpErrorHandler](https://crawlee.dev/js/api/cheerio-crawler.md#HttpErrorHandler) * [**HttpHook](https://crawlee.dev/js/api/cheerio-crawler.md#HttpHook) * [**HttpRequest](https://crawlee.dev/js/api/cheerio-crawler.md#HttpRequest) * [**HttpRequestHandler](https://crawlee.dev/js/api/cheerio-crawler.md#HttpRequestHandler) * [**HttpRequestOptions](https://crawlee.dev/js/api/cheerio-crawler.md#HttpRequestOptions) * [**HttpResponse](https://crawlee.dev/js/api/cheerio-crawler.md#HttpResponse) * [**IRequestList](https://crawlee.dev/js/api/cheerio-crawler.md#IRequestList) * [**IRequestManager](https://crawlee.dev/js/api/cheerio-crawler.md#IRequestManager) * [**IStorage](https://crawlee.dev/js/api/cheerio-crawler.md#IStorage) * [**KeyConsumer](https://crawlee.dev/js/api/cheerio-crawler.md#KeyConsumer) * [**KeyValueStore](https://crawlee.dev/js/api/cheerio-crawler.md#KeyValueStore) * [**KeyValueStoreIteratorOptions](https://crawlee.dev/js/api/cheerio-crawler.md#KeyValueStoreIteratorOptions) * [**KeyValueStoreOptions](https://crawlee.dev/js/api/cheerio-crawler.md#KeyValueStoreOptions) * [**LoadedRequest](https://crawlee.dev/js/api/cheerio-crawler.md#LoadedRequest) * [**LocalEventManager](https://crawlee.dev/js/api/cheerio-crawler.md#LocalEventManager) * [**log](https://crawlee.dev/js/api/cheerio-crawler.md#log) * [**Log](https://crawlee.dev/js/api/cheerio-crawler.md#Log) * [**Logger](https://crawlee.dev/js/api/cheerio-crawler.md#Logger) * [**LoggerJson](https://crawlee.dev/js/api/cheerio-crawler.md#LoggerJson) * [**LoggerOptions](https://crawlee.dev/js/api/cheerio-crawler.md#LoggerOptions) * [**LoggerText](https://crawlee.dev/js/api/cheerio-crawler.md#LoggerText) * [**LogLevel](https://crawlee.dev/js/api/cheerio-crawler.md#LogLevel) * [**MAX\_POOL\_SIZE](https://crawlee.dev/js/api/cheerio-crawler.md#MAX_POOL_SIZE) * [**MinimumSpeedStream](https://crawlee.dev/js/api/cheerio-crawler.md#MinimumSpeedStream) * [**NonRetryableError](https://crawlee.dev/js/api/cheerio-crawler.md#NonRetryableError) * [**PERSIST\_STATE\_KEY](https://crawlee.dev/js/api/cheerio-crawler.md#PERSIST_STATE_KEY) * [**PersistenceOptions](https://crawlee.dev/js/api/cheerio-crawler.md#PersistenceOptions) * [**processHttpRequestOptions](https://crawlee.dev/js/api/cheerio-crawler.md#processHttpRequestOptions) * [**ProxyConfiguration](https://crawlee.dev/js/api/cheerio-crawler.md#ProxyConfiguration) * [**ProxyConfigurationFunction](https://crawlee.dev/js/api/cheerio-crawler.md#ProxyConfigurationFunction) * [**ProxyConfigurationOptions](https://crawlee.dev/js/api/cheerio-crawler.md#ProxyConfigurationOptions) * [**ProxyInfo](https://crawlee.dev/js/api/cheerio-crawler.md#ProxyInfo) * [**PseudoUrl](https://crawlee.dev/js/api/cheerio-crawler.md#PseudoUrl) * [**PseudoUrlInput](https://crawlee.dev/js/api/cheerio-crawler.md#PseudoUrlInput) * [**PseudoUrlObject](https://crawlee.dev/js/api/cheerio-crawler.md#PseudoUrlObject) * [**purgeDefaultStorages](https://crawlee.dev/js/api/cheerio-crawler.md#purgeDefaultStorages) * [**PushErrorMessageOptions](https://crawlee.dev/js/api/cheerio-crawler.md#PushErrorMessageOptions) * [**QueueOperationInfo](https://crawlee.dev/js/api/cheerio-crawler.md#QueueOperationInfo) * [**RecordOptions](https://crawlee.dev/js/api/cheerio-crawler.md#RecordOptions) * [**RecoverableState](https://crawlee.dev/js/api/cheerio-crawler.md#RecoverableState) * [**RecoverableStateOptions](https://crawlee.dev/js/api/cheerio-crawler.md#RecoverableStateOptions) * [**RecoverableStatePersistenceOptions](https://crawlee.dev/js/api/cheerio-crawler.md#RecoverableStatePersistenceOptions) * [**RedirectHandler](https://crawlee.dev/js/api/cheerio-crawler.md#RedirectHandler) * [**RegExpInput](https://crawlee.dev/js/api/cheerio-crawler.md#RegExpInput) * [**RegExpObject](https://crawlee.dev/js/api/cheerio-crawler.md#RegExpObject) * [**Request](https://crawlee.dev/js/api/cheerio-crawler.md#Request) * [**RequestHandler](https://crawlee.dev/js/api/cheerio-crawler.md#RequestHandler) * [**RequestHandlerResult](https://crawlee.dev/js/api/cheerio-crawler.md#RequestHandlerResult) * [**RequestList](https://crawlee.dev/js/api/cheerio-crawler.md#RequestList) * [**RequestListOptions](https://crawlee.dev/js/api/cheerio-crawler.md#RequestListOptions) * [**RequestListSourcesFunction](https://crawlee.dev/js/api/cheerio-crawler.md#RequestListSourcesFunction) * [**RequestListState](https://crawlee.dev/js/api/cheerio-crawler.md#RequestListState) * [**RequestManagerTandem](https://crawlee.dev/js/api/cheerio-crawler.md#RequestManagerTandem) * [**RequestOptions](https://crawlee.dev/js/api/cheerio-crawler.md#RequestOptions) * [**RequestProvider](https://crawlee.dev/js/api/cheerio-crawler.md#RequestProvider) * [**RequestProviderOptions](https://crawlee.dev/js/api/cheerio-crawler.md#RequestProviderOptions) * [**RequestQueue](https://crawlee.dev/js/api/cheerio-crawler.md#RequestQueue) * [**RequestQueueOperationOptions](https://crawlee.dev/js/api/cheerio-crawler.md#RequestQueueOperationOptions) * [**RequestQueueOptions](https://crawlee.dev/js/api/cheerio-crawler.md#RequestQueueOptions) * [**RequestQueueV1](https://crawlee.dev/js/api/cheerio-crawler.md#RequestQueueV1) * [**RequestQueueV2](https://crawlee.dev/js/api/cheerio-crawler.md#RequestQueueV2) * [**RequestsLike](https://crawlee.dev/js/api/cheerio-crawler.md#RequestsLike) * [**RequestState](https://crawlee.dev/js/api/cheerio-crawler.md#RequestState) * [**RequestTransform](https://crawlee.dev/js/api/cheerio-crawler.md#RequestTransform) * [**ResponseLike](https://crawlee.dev/js/api/cheerio-crawler.md#ResponseLike) * [**ResponseTypes](https://crawlee.dev/js/api/cheerio-crawler.md#ResponseTypes) * [**RestrictedCrawlingContext](https://crawlee.dev/js/api/cheerio-crawler.md#RestrictedCrawlingContext) * [**RetryRequestError](https://crawlee.dev/js/api/cheerio-crawler.md#RetryRequestError) * [**Router](https://crawlee.dev/js/api/cheerio-crawler.md#Router) * [**RouterHandler](https://crawlee.dev/js/api/cheerio-crawler.md#RouterHandler) * [**RouterRoutes](https://crawlee.dev/js/api/cheerio-crawler.md#RouterRoutes) * [**Session](https://crawlee.dev/js/api/cheerio-crawler.md#Session) * [**SessionError](https://crawlee.dev/js/api/cheerio-crawler.md#SessionError) * [**SessionOptions](https://crawlee.dev/js/api/cheerio-crawler.md#SessionOptions) * [**SessionPool](https://crawlee.dev/js/api/cheerio-crawler.md#SessionPool) * [**SessionPoolOptions](https://crawlee.dev/js/api/cheerio-crawler.md#SessionPoolOptions) * [**SessionState](https://crawlee.dev/js/api/cheerio-crawler.md#SessionState) * [**SitemapRequestList](https://crawlee.dev/js/api/cheerio-crawler.md#SitemapRequestList) * [**SitemapRequestListOptions](https://crawlee.dev/js/api/cheerio-crawler.md#SitemapRequestListOptions) * [**SkippedRequestCallback](https://crawlee.dev/js/api/cheerio-crawler.md#SkippedRequestCallback) * [**SkippedRequestReason](https://crawlee.dev/js/api/cheerio-crawler.md#SkippedRequestReason) * [**SnapshotResult](https://crawlee.dev/js/api/cheerio-crawler.md#SnapshotResult) * [**Snapshotter](https://crawlee.dev/js/api/cheerio-crawler.md#Snapshotter) * [**SnapshotterOptions](https://crawlee.dev/js/api/cheerio-crawler.md#SnapshotterOptions) * [**Source](https://crawlee.dev/js/api/cheerio-crawler.md#Source) * [**StatisticPersistedState](https://crawlee.dev/js/api/cheerio-crawler.md#StatisticPersistedState) * [**Statistics](https://crawlee.dev/js/api/cheerio-crawler.md#Statistics) * [**StatisticsOptions](https://crawlee.dev/js/api/cheerio-crawler.md#StatisticsOptions) * [**StatisticState](https://crawlee.dev/js/api/cheerio-crawler.md#StatisticState) * [**StatusMessageCallback](https://crawlee.dev/js/api/cheerio-crawler.md#StatusMessageCallback) * [**StatusMessageCallbackParams](https://crawlee.dev/js/api/cheerio-crawler.md#StatusMessageCallbackParams) * [**StorageClient](https://crawlee.dev/js/api/cheerio-crawler.md#StorageClient) * [**StorageManagerOptions](https://crawlee.dev/js/api/cheerio-crawler.md#StorageManagerOptions) * [**StreamHandlerContext](https://crawlee.dev/js/api/cheerio-crawler.md#StreamHandlerContext) * [**StreamingHttpResponse](https://crawlee.dev/js/api/cheerio-crawler.md#StreamingHttpResponse) * [**SystemInfo](https://crawlee.dev/js/api/cheerio-crawler.md#SystemInfo) * [**SystemStatus](https://crawlee.dev/js/api/cheerio-crawler.md#SystemStatus) * [**SystemStatusOptions](https://crawlee.dev/js/api/cheerio-crawler.md#SystemStatusOptions) * [**TieredProxy](https://crawlee.dev/js/api/cheerio-crawler.md#TieredProxy) * [**tryAbsoluteURL](https://crawlee.dev/js/api/cheerio-crawler.md#tryAbsoluteURL) * [**UrlPatternObject](https://crawlee.dev/js/api/cheerio-crawler.md#UrlPatternObject) * [**useState](https://crawlee.dev/js/api/cheerio-crawler.md#useState) * [**UseStateOptions](https://crawlee.dev/js/api/cheerio-crawler.md#UseStateOptions) * [**withCheckedStorageAccess](https://crawlee.dev/js/api/cheerio-crawler.md#withCheckedStorageAccess) * [**CheerioCrawlerOptions](https://crawlee.dev/js/api/cheerio-crawler/interface/CheerioCrawlerOptions.md) * [**CheerioCrawlingContext](https://crawlee.dev/js/api/cheerio-crawler/interface/CheerioCrawlingContext.md) * [**CheerioErrorHandler](https://crawlee.dev/js/api/cheerio-crawler.md#CheerioErrorHandler) * [**CheerioHook](https://crawlee.dev/js/api/cheerio-crawler.md#CheerioHook) * [**CheerioRequestHandler](https://crawlee.dev/js/api/cheerio-crawler.md#CheerioRequestHandler) * [**createCheerioRouter](https://crawlee.dev/js/api/cheerio-crawler/function/createCheerioRouter.md) ## Other[**](#__CATEGORY__) ### [**](#AddRequestsBatchedOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L965)AddRequestsBatchedOptions Re-exports [AddRequestsBatchedOptions](https://crawlee.dev/js/api/core/interface/AddRequestsBatchedOptions.md) ### [**](#AddRequestsBatchedResult)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L983)AddRequestsBatchedResult Re-exports [AddRequestsBatchedResult](https://crawlee.dev/js/api/core/interface/AddRequestsBatchedResult.md) ### [**](#AutoscaledPool)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/autoscaled_pool.ts#L180)AutoscaledPool Re-exports [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) ### [**](#AutoscaledPoolOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/autoscaled_pool.ts#L16)AutoscaledPoolOptions Re-exports [AutoscaledPoolOptions](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md) ### [**](#BaseHttpClient)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L179)BaseHttpClient Re-exports [BaseHttpClient](https://crawlee.dev/js/api/core/interface/BaseHttpClient.md) ### [**](#BaseHttpResponseData)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L130)BaseHttpResponseData Re-exports [BaseHttpResponseData](https://crawlee.dev/js/api/core/interface/BaseHttpResponseData.md) ### [**](#BASIC_CRAWLER_TIMEOUT_BUFFER_SECS)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/constants.ts#L6)BASIC\_CRAWLER\_TIMEOUT\_BUFFER\_SECS Re-exports [BASIC\_CRAWLER\_TIMEOUT\_BUFFER\_SECS](https://crawlee.dev/js/api/basic-crawler.md#BASIC_CRAWLER_TIMEOUT_BUFFER_SECS) ### [**](#BasicCrawler)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L485)BasicCrawler Re-exports [BasicCrawler](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md) ### [**](#BasicCrawlerOptions)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L133)BasicCrawlerOptions Re-exports [BasicCrawlerOptions](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md) ### [**](#BasicCrawlingContext)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L71)BasicCrawlingContext Re-exports [BasicCrawlingContext](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlingContext.md) ### [**](#BLOCKED_STATUS_CODES)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/consts.ts#L1)BLOCKED\_STATUS\_CODES Re-exports [BLOCKED\_STATUS\_CODES](https://crawlee.dev/js/api/core.md#BLOCKED_STATUS_CODES) ### [**](#ByteCounterStream)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/file-download.ts#L116)ByteCounterStream Re-exports [ByteCounterStream](https://crawlee.dev/js/api/http-crawler/function/ByteCounterStream.md) ### [**](#checkStorageAccess)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/access_checking.ts#L10)checkStorageAccess Re-exports [checkStorageAccess](https://crawlee.dev/js/api/core/function/checkStorageAccess.md) ### [**](#ClientInfo)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/system_status.ts#L79)ClientInfo Re-exports [ClientInfo](https://crawlee.dev/js/api/core/interface/ClientInfo.md) ### [**](#Configuration)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/configuration.ts#L247)Configuration Re-exports [Configuration](https://crawlee.dev/js/api/core/class/Configuration.md) ### [**](#ConfigurationOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/configuration.ts#L16)ConfigurationOptions Re-exports [ConfigurationOptions](https://crawlee.dev/js/api/core/interface/ConfigurationOptions.md) ### [**](#Cookie)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/index.ts#L19)Cookie Re-exports [Cookie](https://crawlee.dev/js/api/core/interface/Cookie.md) ### [**](#CrawlerAddRequestsOptions)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L2035)CrawlerAddRequestsOptions Re-exports [CrawlerAddRequestsOptions](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerAddRequestsOptions.md) ### [**](#CrawlerAddRequestsResult)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L2037)CrawlerAddRequestsResult Re-exports [CrawlerAddRequestsResult](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerAddRequestsResult.md) ### [**](#CrawlerExperiments)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L411)CrawlerExperiments Re-exports [CrawlerExperiments](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerExperiments.md) ### [**](#CrawlerRunOptions)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L2039)CrawlerRunOptions Re-exports [CrawlerRunOptions](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerRunOptions.md) ### [**](#CrawlingContext)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L111)CrawlingContext Re-exports [CrawlingContext](https://crawlee.dev/js/api/core/interface/CrawlingContext.md) ### [**](#createBasicRouter)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L2081)createBasicRouter Re-exports [createBasicRouter](https://crawlee.dev/js/api/basic-crawler/function/createBasicRouter.md) ### [**](#CreateContextOptions)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L2029)CreateContextOptions Re-exports [CreateContextOptions](https://crawlee.dev/js/api/basic-crawler/interface/CreateContextOptions.md) ### [**](#createFileRouter)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/file-download.ts#L304)createFileRouter Re-exports [createFileRouter](https://crawlee.dev/js/api/http-crawler/function/createFileRouter.md) ### [**](#createHttpRouter)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L1068)createHttpRouter Re-exports [createHttpRouter](https://crawlee.dev/js/api/http-crawler/function/createHttpRouter.md) ### [**](#CreateSession)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session_pool.ts#L22)CreateSession Re-exports [CreateSession](https://crawlee.dev/js/api/core/interface/CreateSession.md) ### [**](#CriticalError)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/errors.ts#L10)CriticalError Re-exports [CriticalError](https://crawlee.dev/js/api/core/class/CriticalError.md) ### [**](#Dataset)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L232)Dataset Re-exports [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md) ### [**](#DatasetConsumer)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L703)DatasetConsumer Re-exports [DatasetConsumer](https://crawlee.dev/js/api/core/interface/DatasetConsumer.md) ### [**](#DatasetContent)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L742)DatasetContent Re-exports [DatasetContent](https://crawlee.dev/js/api/core/interface/DatasetContent.md) ### [**](#DatasetDataOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L92)DatasetDataOptions Re-exports [DatasetDataOptions](https://crawlee.dev/js/api/core/interface/DatasetDataOptions.md) ### [**](#DatasetExportOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L144)DatasetExportOptions Re-exports [DatasetExportOptions](https://crawlee.dev/js/api/core/interface/DatasetExportOptions.md) ### [**](#DatasetExportToOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L176)DatasetExportToOptions Re-exports [DatasetExportToOptions](https://crawlee.dev/js/api/core/interface/DatasetExportToOptions.md) ### [**](#DatasetIteratorOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L152)DatasetIteratorOptions Re-exports [DatasetIteratorOptions](https://crawlee.dev/js/api/core/interface/DatasetIteratorOptions.md) ### [**](#DatasetMapper)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L714)DatasetMapper Re-exports [DatasetMapper](https://crawlee.dev/js/api/core/interface/DatasetMapper.md) ### [**](#DatasetOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L735)DatasetOptions Re-exports [DatasetOptions](https://crawlee.dev/js/api/core/interface/DatasetOptions.md) ### [**](#DatasetReducer)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L726)DatasetReducer Re-exports [DatasetReducer](https://crawlee.dev/js/api/core/interface/DatasetReducer.md) ### [**](#enqueueLinks)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/enqueue_links.ts#L274)enqueueLinks Re-exports [enqueueLinks](https://crawlee.dev/js/api/core/function/enqueueLinks.md) ### [**](#EnqueueLinksOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/enqueue_links.ts#L34)EnqueueLinksOptions Re-exports [EnqueueLinksOptions](https://crawlee.dev/js/api/core/interface/EnqueueLinksOptions.md) ### [**](#EnqueueStrategy)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/enqueue_links.ts#L216)EnqueueStrategy Re-exports [EnqueueStrategy](https://crawlee.dev/js/api/core/enum/EnqueueStrategy.md) ### [**](#ErrnoException)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/error_tracker.ts#L9)ErrnoException Re-exports [ErrnoException](https://crawlee.dev/js/api/core/interface/ErrnoException.md) ### [**](#ErrorHandler)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L114)ErrorHandler Re-exports [ErrorHandler](https://crawlee.dev/js/api/basic-crawler.md#ErrorHandler) ### [**](#ErrorSnapshotter)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/error_snapshotter.ts#L42)ErrorSnapshotter Re-exports [ErrorSnapshotter](https://crawlee.dev/js/api/core/class/ErrorSnapshotter.md) ### [**](#ErrorTracker)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/error_tracker.ts#L286)ErrorTracker Re-exports [ErrorTracker](https://crawlee.dev/js/api/core/class/ErrorTracker.md) ### [**](#ErrorTrackerOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/error_tracker.ts#L17)ErrorTrackerOptions Re-exports [ErrorTrackerOptions](https://crawlee.dev/js/api/core/interface/ErrorTrackerOptions.md) ### [**](#EventManager)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/events/event_manager.ts#L24)EventManager Re-exports [EventManager](https://crawlee.dev/js/api/core/class/EventManager.md) ### [**](#EventType)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/events/event_manager.ts#L9)EventType Re-exports [EventType](https://crawlee.dev/js/api/core/enum/EventType.md) ### [**](#EventTypeName)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/events/event_manager.ts#L17)EventTypeName Re-exports [EventTypeName](https://crawlee.dev/js/api/core.md#EventTypeName) ### [**](#FileDownload)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/file-download.ts#L184)FileDownload Re-exports [FileDownload](https://crawlee.dev/js/api/http-crawler/class/FileDownload.md) ### [**](#FileDownloadCrawlingContext)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/file-download.ts#L52)FileDownloadCrawlingContext Re-exports [FileDownloadCrawlingContext](https://crawlee.dev/js/api/http-crawler/interface/FileDownloadCrawlingContext.md) ### [**](#FileDownloadErrorHandler)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/file-download.ts#L20)FileDownloadErrorHandler Re-exports [FileDownloadErrorHandler](https://crawlee.dev/js/api/http-crawler.md#FileDownloadErrorHandler) ### [**](#FileDownloadHook)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/file-download.ts#L47)FileDownloadHook Re-exports [FileDownloadHook](https://crawlee.dev/js/api/http-crawler.md#FileDownloadHook) ### [**](#FileDownloadOptions)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/file-download.ts#L34)FileDownloadOptions Re-exports [FileDownloadOptions](https://crawlee.dev/js/api/http-crawler.md#FileDownloadOptions) ### [**](#FileDownloadRequestHandler)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/file-download.ts#L57)FileDownloadRequestHandler Re-exports [FileDownloadRequestHandler](https://crawlee.dev/js/api/http-crawler.md#FileDownloadRequestHandler) ### [**](#filterRequestsByPatterns)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L217)filterRequestsByPatterns Re-exports [filterRequestsByPatterns](https://crawlee.dev/js/api/core/function/filterRequestsByPatterns.md) ### [**](#FinalStatistics)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/system_status.ts#L85)FinalStatistics Re-exports [FinalStatistics](https://crawlee.dev/js/api/core/interface/FinalStatistics.md) ### [**](#GetUserDataFromRequest)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/router.ts#L15)GetUserDataFromRequest Re-exports [GetUserDataFromRequest](https://crawlee.dev/js/api/core.md#GetUserDataFromRequest) ### [**](#GlobInput)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L41)GlobInput Re-exports [GlobInput](https://crawlee.dev/js/api/core.md#GlobInput) ### [**](#GlobObject)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L36)GlobObject Re-exports [GlobObject](https://crawlee.dev/js/api/core.md#GlobObject) ### [**](#GotScrapingHttpClient)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/got-scraping-http-client.ts#L17)GotScrapingHttpClient Re-exports [GotScrapingHttpClient](https://crawlee.dev/js/api/core/class/GotScrapingHttpClient.md) ### [**](#HttpCrawler)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L330)HttpCrawler Re-exports [HttpCrawler](https://crawlee.dev/js/api/http-crawler/class/HttpCrawler.md) ### [**](#HttpCrawlerOptions)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L80)HttpCrawlerOptions Re-exports [HttpCrawlerOptions](https://crawlee.dev/js/api/http-crawler/interface/HttpCrawlerOptions.md) ### [**](#HttpCrawlingContext)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L255)HttpCrawlingContext Re-exports [HttpCrawlingContext](https://crawlee.dev/js/api/http-crawler/interface/HttpCrawlingContext.md) ### [**](#HttpErrorHandler)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L75)HttpErrorHandler Re-exports [HttpErrorHandler](https://crawlee.dev/js/api/http-crawler.md#HttpErrorHandler) ### [**](#HttpHook)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L194)HttpHook Re-exports [HttpHook](https://crawlee.dev/js/api/http-crawler.md#HttpHook) ### [**](#HttpRequest)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L78)HttpRequest Re-exports [HttpRequest](https://crawlee.dev/js/api/core/interface/HttpRequest.md) ### [**](#HttpRequestHandler)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L258)HttpRequestHandler Re-exports [HttpRequestHandler](https://crawlee.dev/js/api/http-crawler.md#HttpRequestHandler) ### [**](#HttpRequestOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L111)HttpRequestOptions Re-exports [HttpRequestOptions](https://crawlee.dev/js/api/core/interface/HttpRequestOptions.md) ### [**](#HttpResponse)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L152)HttpResponse Re-exports [HttpResponse](https://crawlee.dev/js/api/core/interface/HttpResponse.md) ### [**](#IRequestList)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_list.ts#L26)IRequestList Re-exports [IRequestList](https://crawlee.dev/js/api/core/interface/IRequestList.md) ### [**](#IRequestManager)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L44)IRequestManager Re-exports [IRequestManager](https://crawlee.dev/js/api/core/interface/IRequestManager.md) ### [**](#IStorage)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/storage_manager.ts#L14)IStorage Re-exports [IStorage](https://crawlee.dev/js/api/core/interface/IStorage.md) ### [**](#KeyConsumer)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/key_value_store.ts#L724)KeyConsumer Re-exports [KeyConsumer](https://crawlee.dev/js/api/core/interface/KeyConsumer.md) ### [**](#KeyValueStore)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/key_value_store.ts#L108)KeyValueStore Re-exports [KeyValueStore](https://crawlee.dev/js/api/core/class/KeyValueStore.md) ### [**](#KeyValueStoreIteratorOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/key_value_store.ts#L758)KeyValueStoreIteratorOptions Re-exports [KeyValueStoreIteratorOptions](https://crawlee.dev/js/api/core/interface/KeyValueStoreIteratorOptions.md) ### [**](#KeyValueStoreOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/key_value_store.ts#L734)KeyValueStoreOptions Re-exports [KeyValueStoreOptions](https://crawlee.dev/js/api/core/interface/KeyValueStoreOptions.md) ### [**](#LoadedRequest)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L21)LoadedRequest Re-exports [LoadedRequest](https://crawlee.dev/js/api/core.md#LoadedRequest) ### [**](#LocalEventManager)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/events/local_event_manager.ts#L11)LocalEventManager Re-exports [LocalEventManager](https://crawlee.dev/js/api/core/class/LocalEventManager.md) ### [**](#log)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/log.ts#L3)log Re-exports [log](https://crawlee.dev/js/api/core.md#log) ### [**](#Log)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/log.ts#L3)Log Re-exports [Log](https://crawlee.dev/js/api/core/class/Log.md) ### [**](#Logger)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/log.ts#L3)Logger Re-exports [Logger](https://crawlee.dev/js/api/core/class/Logger.md) ### [**](#LoggerJson)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/log.ts#L3)LoggerJson Re-exports [LoggerJson](https://crawlee.dev/js/api/core/class/LoggerJson.md) ### [**](#LoggerOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/log.ts#L3)LoggerOptions Re-exports [LoggerOptions](https://crawlee.dev/js/api/core/interface/LoggerOptions.md) ### [**](#LoggerText)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/log.ts#L3)LoggerText Re-exports [LoggerText](https://crawlee.dev/js/api/core/class/LoggerText.md) ### [**](#LogLevel)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/log.ts#L3)LogLevel Re-exports [LogLevel](https://crawlee.dev/js/api/core/enum/LogLevel.md) ### [**](#MAX_POOL_SIZE)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/consts.ts#L3)MAX\_POOL\_SIZE Re-exports [MAX\_POOL\_SIZE](https://crawlee.dev/js/api/core.md#MAX_POOL_SIZE) ### [**](#MinimumSpeedStream)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/file-download.ts#L71)MinimumSpeedStream Re-exports [MinimumSpeedStream](https://crawlee.dev/js/api/http-crawler/function/MinimumSpeedStream.md) ### [**](#NonRetryableError)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/errors.ts#L4)NonRetryableError Re-exports [NonRetryableError](https://crawlee.dev/js/api/core/class/NonRetryableError.md) ### [**](#PERSIST_STATE_KEY)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/consts.ts#L2)PERSIST\_STATE\_KEY Re-exports [PERSIST\_STATE\_KEY](https://crawlee.dev/js/api/core.md#PERSIST_STATE_KEY) ### [**](#PersistenceOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L41)PersistenceOptions Re-exports [PersistenceOptions](https://crawlee.dev/js/api/core/interface/PersistenceOptions.md) ### [**](#processHttpRequestOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L196)processHttpRequestOptions Re-exports [processHttpRequestOptions](https://crawlee.dev/js/api/core/function/processHttpRequestOptions.md) ### [**](#ProxyConfiguration)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/proxy_configuration.ts#L203)ProxyConfiguration Re-exports [ProxyConfiguration](https://crawlee.dev/js/api/core/class/ProxyConfiguration.md) ### [**](#ProxyConfigurationFunction)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/proxy_configuration.ts#L9)ProxyConfigurationFunction Re-exports [ProxyConfigurationFunction](https://crawlee.dev/js/api/core/interface/ProxyConfigurationFunction.md) ### [**](#ProxyConfigurationOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/proxy_configuration.ts#L15)ProxyConfigurationOptions Re-exports [ProxyConfigurationOptions](https://crawlee.dev/js/api/core/interface/ProxyConfigurationOptions.md) ### [**](#ProxyInfo)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/proxy_configuration.ts#L80)ProxyInfo Re-exports [ProxyInfo](https://crawlee.dev/js/api/core/interface/ProxyInfo.md) ### [**](#PseudoUrl)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/index.ts#L18)PseudoUrl Re-exports [PseudoUrl](https://crawlee.dev/js/api/core/class/PseudoUrl.md) ### [**](#PseudoUrlInput)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L34)PseudoUrlInput Re-exports [PseudoUrlInput](https://crawlee.dev/js/api/core.md#PseudoUrlInput) ### [**](#PseudoUrlObject)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L29)PseudoUrlObject Re-exports [PseudoUrlObject](https://crawlee.dev/js/api/core.md#PseudoUrlObject) ### [**](#purgeDefaultStorages)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/utils.ts#L33)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/utils.ts#L45)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/utils.ts#L46)purgeDefaultStorages Re-exports [purgeDefaultStorages](https://crawlee.dev/js/api/core/function/purgeDefaultStorages.md) ### [**](#PushErrorMessageOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L559)PushErrorMessageOptions Re-exports [PushErrorMessageOptions](https://crawlee.dev/js/api/core/interface/PushErrorMessageOptions.md) ### [**](#QueueOperationInfo)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/index.ts#L19)QueueOperationInfo Re-exports [QueueOperationInfo](https://crawlee.dev/js/api/core/interface/QueueOperationInfo.md) ### [**](#RecordOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/key_value_store.ts#L741)RecordOptions Re-exports [RecordOptions](https://crawlee.dev/js/api/core/interface/RecordOptions.md) ### [**](#RecoverableState)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/recoverable_state.ts#L75)RecoverableState Re-exports [RecoverableState](https://crawlee.dev/js/api/core/class/RecoverableState.md) ### [**](#RecoverableStateOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/recoverable_state.ts#L33)RecoverableStateOptions Re-exports [RecoverableStateOptions](https://crawlee.dev/js/api/core/interface/RecoverableStateOptions.md) ### [**](#RecoverableStatePersistenceOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/recoverable_state.ts#L6)RecoverableStatePersistenceOptions Re-exports [RecoverableStatePersistenceOptions](https://crawlee.dev/js/api/core/interface/RecoverableStatePersistenceOptions.md) ### [**](#RedirectHandler)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L171)RedirectHandler Re-exports [RedirectHandler](https://crawlee.dev/js/api/core.md#RedirectHandler) ### [**](#RegExpInput)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L48)RegExpInput Re-exports [RegExpInput](https://crawlee.dev/js/api/core.md#RegExpInput) ### [**](#RegExpObject)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L43)RegExpObject Re-exports [RegExpObject](https://crawlee.dev/js/api/core.md#RegExpObject) ### [**](#Request)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L84)Request Re-exports [Request](https://crawlee.dev/js/api/core/class/Request.md) ### [**](#RequestHandler)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L110)RequestHandler Re-exports [RequestHandler](https://crawlee.dev/js/api/basic-crawler.md#RequestHandler) ### [**](#RequestHandlerResult)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L174)RequestHandlerResult Re-exports [RequestHandlerResult](https://crawlee.dev/js/api/core/class/RequestHandlerResult.md) ### [**](#RequestList)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_list.ts#L300)RequestList Re-exports [RequestList](https://crawlee.dev/js/api/core/class/RequestList.md) ### [**](#RequestListOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_list.ts#L91)RequestListOptions Re-exports [RequestListOptions](https://crawlee.dev/js/api/core/interface/RequestListOptions.md) ### [**](#RequestListSourcesFunction)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_list.ts#L1000)RequestListSourcesFunction Re-exports [RequestListSourcesFunction](https://crawlee.dev/js/api/core.md#RequestListSourcesFunction) ### [**](#RequestListState)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_list.ts#L988)RequestListState Re-exports [RequestListState](https://crawlee.dev/js/api/core/interface/RequestListState.md) ### [**](#RequestManagerTandem)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_manager_tandem.ts#L22)RequestManagerTandem Re-exports [RequestManagerTandem](https://crawlee.dev/js/api/core/class/RequestManagerTandem.md) ### [**](#RequestOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L446)RequestOptions Re-exports [RequestOptions](https://crawlee.dev/js/api/core/interface/RequestOptions.md) ### [**](#RequestProvider)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L102)RequestProvider Re-exports [RequestProvider](https://crawlee.dev/js/api/core/class/RequestProvider.md) ### [**](#RequestProviderOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L907)RequestProviderOptions Re-exports [RequestProviderOptions](https://crawlee.dev/js/api/core/interface/RequestProviderOptions.md) ### [**](#RequestQueue)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/index.ts#L7)RequestQueue Re-exports [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) ### [**](#RequestQueueOperationOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L934)RequestQueueOperationOptions Re-exports [RequestQueueOperationOptions](https://crawlee.dev/js/api/core/interface/RequestQueueOperationOptions.md) ### [**](#RequestQueueOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L923)RequestQueueOptions Re-exports [RequestQueueOptions](https://crawlee.dev/js/api/core/interface/RequestQueueOptions.md) ### [**](#RequestQueueV1)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/index.ts#L6)RequestQueueV1 Re-exports [RequestQueueV1](https://crawlee.dev/js/api/core/class/RequestQueueV1.md) ### [**](#RequestQueueV2)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/index.ts#L8)RequestQueueV2 Re-exports [RequestQueueV2](https://crawlee.dev/js/api/core.md#RequestQueueV2) ### [**](#RequestsLike)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L39)RequestsLike Re-exports [RequestsLike](https://crawlee.dev/js/api/core.md#RequestsLike) ### [**](#RequestState)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L42)RequestState Re-exports [RequestState](https://crawlee.dev/js/api/core/enum/RequestState.md) ### [**](#RequestTransform)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L287)RequestTransform Re-exports [RequestTransform](https://crawlee.dev/js/api/core/interface/RequestTransform.md) ### [**](#ResponseLike)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/cookie_utils.ts#L7)ResponseLike Re-exports [ResponseLike](https://crawlee.dev/js/api/core/interface/ResponseLike.md) ### [**](#ResponseTypes)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L39)ResponseTypes Re-exports [ResponseTypes](https://crawlee.dev/js/api/core/interface/ResponseTypes.md) ### [**](#RestrictedCrawlingContext)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L30)RestrictedCrawlingContext Re-exports [RestrictedCrawlingContext](https://crawlee.dev/js/api/core/interface/RestrictedCrawlingContext.md) ### [**](#RetryRequestError)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/errors.ts#L22)RetryRequestError Re-exports [RetryRequestError](https://crawlee.dev/js/api/core/class/RetryRequestError.md) ### [**](#Router)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/router.ts#L86)Router Re-exports [Router](https://crawlee.dev/js/api/core/class/Router.md) ### [**](#RouterHandler)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/router.ts#L10)RouterHandler Re-exports [RouterHandler](https://crawlee.dev/js/api/core/interface/RouterHandler.md) ### [**](#RouterRoutes)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/router.ts#L17)RouterRoutes Re-exports [RouterRoutes](https://crawlee.dev/js/api/core.md#RouterRoutes) ### [**](#Session)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session.ts#L100)Session Re-exports [Session](https://crawlee.dev/js/api/core/class/Session.md) ### [**](#SessionError)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/errors.ts#L33)SessionError Re-exports [SessionError](https://crawlee.dev/js/api/core/class/SessionError.md) ### [**](#SessionOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session.ts#L37)SessionOptions Re-exports [SessionOptions](https://crawlee.dev/js/api/core/interface/SessionOptions.md) ### [**](#SessionPool)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session_pool.ts#L137)SessionPool Re-exports [SessionPool](https://crawlee.dev/js/api/core/class/SessionPool.md) ### [**](#SessionPoolOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session_pool.ts#L30)SessionPoolOptions Re-exports [SessionPoolOptions](https://crawlee.dev/js/api/core/interface/SessionPoolOptions.md) ### [**](#SessionState)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session.ts#L24)SessionState Re-exports [SessionState](https://crawlee.dev/js/api/core/interface/SessionState.md) ### [**](#SitemapRequestList)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/sitemap_request_list.ts#L128)SitemapRequestList Re-exports [SitemapRequestList](https://crawlee.dev/js/api/core/class/SitemapRequestList.md) ### [**](#SitemapRequestListOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/sitemap_request_list.ts#L60)SitemapRequestListOptions Re-exports [SitemapRequestListOptions](https://crawlee.dev/js/api/core/interface/SitemapRequestListOptions.md) ### [**](#SkippedRequestCallback)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L52)SkippedRequestCallback Re-exports [SkippedRequestCallback](https://crawlee.dev/js/api/core.md#SkippedRequestCallback) ### [**](#SkippedRequestReason)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L50)SkippedRequestReason Re-exports [SkippedRequestReason](https://crawlee.dev/js/api/core.md#SkippedRequestReason) ### [**](#SnapshotResult)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/error_snapshotter.ts#L16)SnapshotResult Re-exports [SnapshotResult](https://crawlee.dev/js/api/core/interface/SnapshotResult.md) ### [**](#Snapshotter)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/snapshotter.ts#L118)Snapshotter Re-exports [Snapshotter](https://crawlee.dev/js/api/core/class/Snapshotter.md) ### [**](#SnapshotterOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/snapshotter.ts#L19)SnapshotterOptions Re-exports [SnapshotterOptions](https://crawlee.dev/js/api/core/interface/SnapshotterOptions.md) ### [**](#Source)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L575)Source Re-exports [Source](https://crawlee.dev/js/api/core.md#Source) ### [**](#StatisticPersistedState)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L482)StatisticPersistedState Re-exports [StatisticPersistedState](https://crawlee.dev/js/api/core/interface/StatisticPersistedState.md) ### [**](#Statistics)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L59)Statistics Re-exports [Statistics](https://crawlee.dev/js/api/core/class/Statistics.md) ### [**](#StatisticsOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L436)StatisticsOptions Re-exports [StatisticsOptions](https://crawlee.dev/js/api/core/interface/StatisticsOptions.md) ### [**](#StatisticState)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L496)StatisticState Re-exports [StatisticState](https://crawlee.dev/js/api/core/interface/StatisticState.md) ### [**](#StatusMessageCallback)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L128)StatusMessageCallback Re-exports [StatusMessageCallback](https://crawlee.dev/js/api/basic-crawler.md#StatusMessageCallback) ### [**](#StatusMessageCallbackParams)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L118)StatusMessageCallbackParams Re-exports [StatusMessageCallbackParams](https://crawlee.dev/js/api/basic-crawler/interface/StatusMessageCallbackParams.md) ### [**](#StorageClient)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/index.ts#L19)StorageClient Re-exports [StorageClient](https://crawlee.dev/js/api/core/interface/StorageClient.md) ### [**](#StorageManagerOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/storage_manager.ts#L156)StorageManagerOptions Re-exports [StorageManagerOptions](https://crawlee.dev/js/api/core/interface/StorageManagerOptions.md) ### [**](#StreamHandlerContext)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/file-download.ts#L25)StreamHandlerContext Re-exports [StreamHandlerContext](https://crawlee.dev/js/api/http-crawler.md#StreamHandlerContext) ### [**](#StreamingHttpResponse)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L162)StreamingHttpResponse Re-exports [StreamingHttpResponse](https://crawlee.dev/js/api/core/interface/StreamingHttpResponse.md) ### [**](#SystemInfo)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/system_status.ts#L10)SystemInfo Re-exports [SystemInfo](https://crawlee.dev/js/api/core/interface/SystemInfo.md) ### [**](#SystemStatus)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/system_status.ts#L120)SystemStatus Re-exports [SystemStatus](https://crawlee.dev/js/api/core/class/SystemStatus.md) ### [**](#SystemStatusOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/system_status.ts#L35)SystemStatusOptions Re-exports [SystemStatusOptions](https://crawlee.dev/js/api/core/interface/SystemStatusOptions.md) ### [**](#TieredProxy)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/proxy_configuration.ts#L45)TieredProxy Re-exports [TieredProxy](https://crawlee.dev/js/api/core/interface/TieredProxy.md) ### [**](#tryAbsoluteURL)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L12)tryAbsoluteURL Re-exports [tryAbsoluteURL](https://crawlee.dev/js/api/core/function/tryAbsoluteURL.md) ### [**](#UrlPatternObject)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L24)UrlPatternObject Re-exports [UrlPatternObject](https://crawlee.dev/js/api/core.md#UrlPatternObject) ### [**](#useState)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/utils.ts#L87)useState Re-exports [useState](https://crawlee.dev/js/api/core/function/useState.md) ### [**](#UseStateOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/utils.ts#L69)UseStateOptions Re-exports [UseStateOptions](https://crawlee.dev/js/api/core/interface/UseStateOptions.md) ### [**](#withCheckedStorageAccess)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/access_checking.ts#L18)withCheckedStorageAccess Re-exports [withCheckedStorageAccess](https://crawlee.dev/js/api/core/function/withCheckedStorageAccess.md) ### [**](#CheerioErrorHandler)[**](https://github.com/apify/crawlee/blob/master/packages/cheerio-crawler/src/internals/cheerio-crawler.ts#L26)CheerioErrorHandler **CheerioErrorHandler\: [ErrorHandler](https://crawlee.dev/js/api/basic-crawler.md#ErrorHandler)<[CheerioCrawlingContext](https://crawlee.dev/js/api/cheerio-crawler/interface/CheerioCrawlingContext.md)\> #### Type parameters * **UserData**: Dictionary = any * **JSONData**: Dictionary = any ### [**](#CheerioHook)[**](https://github.com/apify/crawlee/blob/master/packages/cheerio-crawler/src/internals/cheerio-crawler.ts#L36)CheerioHook **CheerioHook\: InternalHttpHook<[CheerioCrawlingContext](https://crawlee.dev/js/api/cheerio-crawler/interface/CheerioCrawlingContext.md)\> #### Type parameters * **UserData**: Dictionary = any * **JSONData**: Dictionary = any ### [**](#CheerioRequestHandler)[**](https://github.com/apify/crawlee/blob/master/packages/cheerio-crawler/src/internals/cheerio-crawler.ts#L82)CheerioRequestHandler **CheerioRequestHandler\: [RequestHandler](https://crawlee.dev/js/api/basic-crawler.md#RequestHandler)<[CheerioCrawlingContext](https://crawlee.dev/js/api/cheerio-crawler/interface/CheerioCrawlingContext.md)\> #### Type parameters * **UserData**: Dictionary = any * **JSONData**: Dictionary = any --- # Changelog All notable changes to this project will be documented in this file. See [Conventional Commits](https://conventionalcommits.org) for commit guidelines. ## [3.15.3](https://github.com/apify/crawlee/compare/v3.15.2...v3.15.3) (2025-11-10)[​](#3153-2025-11-10 "Direct link to 3153-2025-11-10") **Note:** Version bump only for package @crawlee/cheerio ## [3.15.2](https://github.com/apify/crawlee/compare/v3.15.1...v3.15.2) (2025-10-23)[​](#3152-2025-10-23 "Direct link to 3152-2025-10-23") **Note:** Version bump only for package @crawlee/cheerio ## [3.15.1](https://github.com/apify/crawlee/compare/v3.15.0...v3.15.1) (2025-09-26)[​](#3151-2025-09-26 "Direct link to 3151-2025-09-26") **Note:** Version bump only for package @crawlee/cheerio # [3.15.0](https://github.com/apify/crawlee/compare/v3.14.1...v3.15.0) (2025-09-17) **Note:** Version bump only for package @crawlee/cheerio ## [3.14.1](https://github.com/apify/crawlee/compare/v3.14.0...v3.14.1) (2025-08-05)[​](#3141-2025-08-05 "Direct link to 3141-2025-08-05") **Note:** Version bump only for package @crawlee/cheerio # [3.14.0](https://github.com/apify/crawlee/compare/v3.13.10...v3.14.0) (2025-07-25) ### Features[​](#features "Direct link to Features") * add `maxCrawlDepth` crawler option ([#3045](https://github.com/apify/crawlee/issues/3045)) ([0090df9](https://github.com/apify/crawlee/commit/0090df93a12df9918d016cf2f1378f1f7d40557d)), closes [#2633](https://github.com/apify/crawlee/issues/2633) ## [3.13.10](https://github.com/apify/crawlee/compare/v3.13.9...v3.13.10) (2025-07-09)[​](#31310-2025-07-09 "Direct link to 31310-2025-07-09") **Note:** Version bump only for package @crawlee/cheerio ## [3.13.9](https://github.com/apify/crawlee/compare/v3.13.8...v3.13.9) (2025-06-27)[​](#3139-2025-06-27 "Direct link to 3139-2025-06-27") **Note:** Version bump only for package @crawlee/cheerio ## [3.13.8](https://github.com/apify/crawlee/compare/v3.13.7...v3.13.8) (2025-06-16)[​](#3138-2025-06-16 "Direct link to 3138-2025-06-16") ### Bug Fixes[​](#bug-fixes "Direct link to Bug Fixes") * Do not enqueue more links than what the crawler is capable of processing ([#2990](https://github.com/apify/crawlee/issues/2990)) ([ea094c8](https://github.com/apify/crawlee/commit/ea094c819232e0b30bc550270836d10506eb9454)), closes [#2728](https://github.com/apify/crawlee/issues/2728) ## [3.13.7](https://github.com/apify/crawlee/compare/v3.13.6...v3.13.7) (2025-06-06)[​](#3137-2025-06-06 "Direct link to 3137-2025-06-06") **Note:** Version bump only for package @crawlee/cheerio ## [3.13.6](https://github.com/apify/crawlee/compare/v3.13.5...v3.13.6) (2025-06-05)[​](#3136-2025-06-05 "Direct link to 3136-2025-06-05") **Note:** Version bump only for package @crawlee/cheerio ## [3.13.5](https://github.com/apify/crawlee/compare/v3.13.4...v3.13.5) (2025-05-20)[​](#3135-2025-05-20 "Direct link to 3135-2025-05-20") **Note:** Version bump only for package @crawlee/cheerio ## [3.13.4](https://github.com/apify/crawlee/compare/v3.13.3...v3.13.4) (2025-05-14)[​](#3134-2025-05-14 "Direct link to 3134-2025-05-14") **Note:** Version bump only for package @crawlee/cheerio ## [3.13.3](https://github.com/apify/crawlee/compare/v3.13.2...v3.13.3) (2025-05-05)[​](#3133-2025-05-05 "Direct link to 3133-2025-05-05") **Note:** Version bump only for package @crawlee/cheerio ## [3.13.2](https://github.com/apify/crawlee/compare/v3.13.1...v3.13.2) (2025-04-08)[​](#3132-2025-04-08 "Direct link to 3132-2025-04-08") ### Features[​](#features-1 "Direct link to Features") * add `onSkippedRequest` option ([#2916](https://github.com/apify/crawlee/issues/2916)) ([764f992](https://github.com/apify/crawlee/commit/764f99203627b6a44d2ee90d623b8b0e6ecbffb5)), closes [#2910](https://github.com/apify/crawlee/issues/2910) ## [3.13.1](https://github.com/apify/crawlee/compare/v3.13.0...v3.13.1) (2025-04-07)[​](#3131-2025-04-07 "Direct link to 3131-2025-04-07") ### Bug Fixes[​](#bug-fixes-1 "Direct link to Bug Fixes") * rename `RobotsFile` to `RobotsTxtFile` ([#2913](https://github.com/apify/crawlee/issues/2913)) ([3160f71](https://github.com/apify/crawlee/commit/3160f717e865326476d78089d778cbc7d35aa58d)), closes [#2910](https://github.com/apify/crawlee/issues/2910) ### Features[​](#features-2 "Direct link to Features") * add `respectRobotsTxtFile` crawler option ([#2910](https://github.com/apify/crawlee/issues/2910)) ([0eabed1](https://github.com/apify/crawlee/commit/0eabed1f13070d902c2c67b340621830a7f64464)) # [3.13.0](https://github.com/apify/crawlee/compare/v3.12.2...v3.13.0) (2025-03-04) ### Bug Fixes[​](#bug-fixes-2 "Direct link to Bug Fixes") * **cheerio:** don't decode HTML entities in `context.body` ([#2838](https://github.com/apify/crawlee/issues/2838)) ([32d6d0e](https://github.com/apify/crawlee/commit/32d6d0ee7e7eaad1a401f4884926f31e0f68cc55)), closes [#2401](https://github.com/apify/crawlee/issues/2401) ## [3.12.2](https://github.com/apify/crawlee/compare/v3.12.1...v3.12.2) (2025-01-27)[​](#3122-2025-01-27 "Direct link to 3122-2025-01-27") **Note:** Version bump only for package @crawlee/cheerio ## [3.12.1](https://github.com/apify/crawlee/compare/v3.12.0...v3.12.1) (2024-12-04)[​](#3121-2024-12-04 "Direct link to 3121-2024-12-04") **Note:** Version bump only for package @crawlee/cheerio # [3.12.0](https://github.com/apify/crawlee/compare/v3.11.5...v3.12.0) (2024-11-04) **Note:** Version bump only for package @crawlee/cheerio ## [3.11.5](https://github.com/apify/crawlee/compare/v3.11.4...v3.11.5) (2024-10-04)[​](#3115-2024-10-04 "Direct link to 3115-2024-10-04") **Note:** Version bump only for package @crawlee/cheerio ## [3.11.4](https://github.com/apify/crawlee/compare/v3.11.3...v3.11.4) (2024-09-23)[​](#3114-2024-09-23 "Direct link to 3114-2024-09-23") **Note:** Version bump only for package @crawlee/cheerio ## [3.11.3](https://github.com/apify/crawlee/compare/v3.11.2...v3.11.3) (2024-09-03)[​](#3113-2024-09-03 "Direct link to 3113-2024-09-03") **Note:** Version bump only for package @crawlee/cheerio ## [3.11.2](https://github.com/apify/crawlee/compare/v3.11.1...v3.11.2) (2024-08-28)[​](#3112-2024-08-28 "Direct link to 3112-2024-08-28") **Note:** Version bump only for package @crawlee/cheerio ## [3.11.1](https://github.com/apify/crawlee/compare/v3.11.0...v3.11.1) (2024-07-24)[​](#3111-2024-07-24 "Direct link to 3111-2024-07-24") **Note:** Version bump only for package @crawlee/cheerio # [3.11.0](https://github.com/apify/crawlee/compare/v3.10.5...v3.11.0) (2024-07-09) **Note:** Version bump only for package @crawlee/cheerio ## [3.10.5](https://github.com/apify/crawlee/compare/v3.10.4...v3.10.5) (2024-06-12)[​](#3105-2024-06-12 "Direct link to 3105-2024-06-12") **Note:** Version bump only for package @crawlee/cheerio ## [3.10.4](https://github.com/apify/crawlee/compare/v3.10.3...v3.10.4) (2024-06-11)[​](#3104-2024-06-11 "Direct link to 3104-2024-06-11") **Note:** Version bump only for package @crawlee/cheerio ## [3.10.3](https://github.com/apify/crawlee/compare/v3.10.2...v3.10.3) (2024-06-07)[​](#3103-2024-06-07 "Direct link to 3103-2024-06-07") ### Features[​](#features-3 "Direct link to Features") * add `waitForSelector` context helper + `parseWithCheerio` in adaptive crawler ([#2522](https://github.com/apify/crawlee/issues/2522)) ([6f88e73](https://github.com/apify/crawlee/commit/6f88e738d43ab4774dc4ef3f78775a5d88728e0d)) ## [3.10.2](https://github.com/apify/crawlee/compare/v3.10.1...v3.10.2) (2024-06-03)[​](#3102-2024-06-03 "Direct link to 3102-2024-06-03") **Note:** Version bump only for package @crawlee/cheerio ## [3.10.1](https://github.com/apify/crawlee/compare/v3.10.0...v3.10.1) (2024-05-23)[​](#3101-2024-05-23 "Direct link to 3101-2024-05-23") **Note:** Version bump only for package @crawlee/cheerio # [3.10.0](https://github.com/apify/crawlee/compare/v3.9.2...v3.10.0) (2024-05-16) **Note:** Version bump only for package @crawlee/cheerio ## [3.9.2](https://github.com/apify/crawlee/compare/v3.9.1...v3.9.2) (2024-04-17)[​](#392-2024-04-17 "Direct link to 392-2024-04-17") **Note:** Version bump only for package @crawlee/cheerio ## [3.9.1](https://github.com/apify/crawlee/compare/v3.9.0...v3.9.1) (2024-04-11)[​](#391-2024-04-11 "Direct link to 391-2024-04-11") **Note:** Version bump only for package @crawlee/cheerio # [3.9.0](https://github.com/apify/crawlee/compare/v3.8.2...v3.9.0) (2024-04-10) **Note:** Version bump only for package @crawlee/cheerio ## [3.8.2](https://github.com/apify/crawlee/compare/v3.8.1...v3.8.2) (2024-03-21)[​](#382-2024-03-21 "Direct link to 382-2024-03-21") **Note:** Version bump only for package @crawlee/cheerio ## [3.8.1](https://github.com/apify/crawlee/compare/v3.8.0...v3.8.1) (2024-02-22)[​](#381-2024-02-22 "Direct link to 381-2024-02-22") **Note:** Version bump only for package @crawlee/cheerio # [3.8.0](https://github.com/apify/crawlee/compare/v3.7.3...v3.8.0) (2024-02-21) **Note:** Version bump only for package @crawlee/cheerio ## [3.7.3](https://github.com/apify/crawlee/compare/v3.7.2...v3.7.3) (2024-01-30)[​](#373-2024-01-30 "Direct link to 373-2024-01-30") **Note:** Version bump only for package @crawlee/cheerio ## [3.7.2](https://github.com/apify/crawlee/compare/v3.7.1...v3.7.2) (2024-01-09)[​](#372-2024-01-09 "Direct link to 372-2024-01-09") **Note:** Version bump only for package @crawlee/cheerio ## [3.7.1](https://github.com/apify/crawlee/compare/v3.7.0...v3.7.1) (2024-01-02)[​](#371-2024-01-02 "Direct link to 371-2024-01-02") **Note:** Version bump only for package @crawlee/cheerio # [3.7.0](https://github.com/apify/crawlee/compare/v3.6.2...v3.7.0) (2023-12-21) **Note:** Version bump only for package @crawlee/cheerio ## [3.6.2](https://github.com/apify/crawlee/compare/v3.6.1...v3.6.2) (2023-11-26)[​](#362-2023-11-26 "Direct link to 362-2023-11-26") **Note:** Version bump only for package @crawlee/cheerio ## [3.6.1](https://github.com/apify/crawlee/compare/v3.6.0...v3.6.1) (2023-11-15)[​](#361-2023-11-15 "Direct link to 361-2023-11-15") **Note:** Version bump only for package @crawlee/cheerio # [3.6.0](https://github.com/apify/crawlee/compare/v3.5.8...v3.6.0) (2023-11-15) **Note:** Version bump only for package @crawlee/cheerio ## [3.5.8](https://github.com/apify/crawlee/compare/v3.5.7...v3.5.8) (2023-10-17)[​](#358-2023-10-17 "Direct link to 358-2023-10-17") **Note:** Version bump only for package @crawlee/cheerio ## [3.5.7](https://github.com/apify/crawlee/compare/v3.5.6...v3.5.7) (2023-10-05)[​](#357-2023-10-05 "Direct link to 357-2023-10-05") **Note:** Version bump only for package @crawlee/cheerio ## [3.5.6](https://github.com/apify/crawlee/compare/v3.5.5...v3.5.6) (2023-10-04)[​](#356-2023-10-04 "Direct link to 356-2023-10-04") **Note:** Version bump only for package @crawlee/cheerio ## [3.5.5](https://github.com/apify/crawlee/compare/v3.5.4...v3.5.5) (2023-10-02)[​](#355-2023-10-02 "Direct link to 355-2023-10-02") ### Features[​](#features-4 "Direct link to Features") * Request Queue v2 ([#1975](https://github.com/apify/crawlee/issues/1975)) ([70a77ee](https://github.com/apify/crawlee/commit/70a77ee15f984e9ae67cd584fc58ace7e55346db)), closes [#1365](https://github.com/apify/crawlee/issues/1365) ## [3.5.4](https://github.com/apify/crawlee/compare/v3.5.3...v3.5.4) (2023-09-11)[​](#354-2023-09-11 "Direct link to 354-2023-09-11") **Note:** Version bump only for package @crawlee/cheerio ## [3.5.3](https://github.com/apify/crawlee/compare/v3.5.2...v3.5.3) (2023-08-31)[​](#353-2023-08-31 "Direct link to 353-2023-08-31") ### Bug Fixes[​](#bug-fixes-3 "Direct link to Bug Fixes") * pin all internal dependencies ([#2041](https://github.com/apify/crawlee/issues/2041)) ([d6f2b17](https://github.com/apify/crawlee/commit/d6f2b172d4a6776137c7893ca798d5b4a9408e79)), closes [#2040](https://github.com/apify/crawlee/issues/2040) ## [3.5.2](https://github.com/apify/crawlee/compare/v3.5.1...v3.5.2) (2023-08-21)[​](#352-2023-08-21 "Direct link to 352-2023-08-21") **Note:** Version bump only for package @crawlee/cheerio ## [3.5.1](https://github.com/apify/crawlee/compare/v3.5.0...v3.5.1) (2023-08-16)[​](#351-2023-08-16 "Direct link to 351-2023-08-16") **Note:** Version bump only for package @crawlee/cheerio # [3.5.0](https://github.com/apify/crawlee/compare/v3.4.2...v3.5.0) (2023-07-31) **Note:** Version bump only for package @crawlee/cheerio ## [3.4.2](https://github.com/apify/crawlee/compare/v3.4.1...v3.4.2) (2023-07-19)[​](#342-2023-07-19 "Direct link to 342-2023-07-19") **Note:** Version bump only for package @crawlee/cheerio ## [3.4.1](https://github.com/apify/crawlee/compare/v3.4.0...v3.4.1) (2023-07-13)[​](#341-2023-07-13 "Direct link to 341-2023-07-13") **Note:** Version bump only for package @crawlee/cheerio # [3.4.0](https://github.com/apify/crawlee/compare/v3.3.3...v3.4.0) (2023-06-12) ### Bug Fixes[​](#bug-fixes-4 "Direct link to Bug Fixes") * respect `` when enqueuing ([#1936](https://github.com/apify/crawlee/issues/1936)) ([aeef572](https://github.com/apify/crawlee/commit/aeef57231c84671374ed0309b7b95fa9ce9a6e8b)) ## [3.3.3](https://github.com/apify/crawlee/compare/v3.3.2...v3.3.3) (2023-05-31)[​](#333-2023-05-31 "Direct link to 333-2023-05-31") **Note:** Version bump only for package @crawlee/cheerio ## [3.3.2](https://github.com/apify/crawlee/compare/v3.3.1...v3.3.2) (2023-05-11)[​](#332-2023-05-11 "Direct link to 332-2023-05-11") ### Features[​](#features-5 "Direct link to Features") * **router:** allow inline router definition ([#1877](https://github.com/apify/crawlee/issues/1877)) ([2d241c9](https://github.com/apify/crawlee/commit/2d241c9f88964ebd41a181069c378b6b7b5bf262)) ## [3.3.1](https://github.com/apify/crawlee/compare/v3.3.0...v3.3.1) (2023-04-11)[​](#331-2023-04-11 "Direct link to 331-2023-04-11") ### Features[​](#features-6 "Direct link to Features") * add `parseWithCheerio` context helper to cheerio crawler ([b336a73](https://github.com/apify/crawlee/commit/b336a739117a6e4180492ec9915ddce128376a2c)) * **jsdom:** add `parseWithCheerio` context helper ([c8f0796](https://github.com/apify/crawlee/commit/c8f0796aebc0dfa6e6d04740a0bb7d8ddd5b2d96)) # [3.3.0](https://github.com/apify/crawlee/compare/v3.2.2...v3.3.0) (2023-03-09) ### Bug Fixes[​](#bug-fixes-5 "Direct link to Bug Fixes") * **CheerioCrawler:** pass ixXml down to response parser ([#1807](https://github.com/apify/crawlee/issues/1807)) ([af7a5c4](https://github.com/apify/crawlee/commit/af7a5c4efa94a53e5bdfeca340a9d7223d7dfda4)), closes [#1794](https://github.com/apify/crawlee/issues/1794) * ignore invalid URLs in `enqueueLinks` in browser crawlers ([#1803](https://github.com/apify/crawlee/issues/1803)) ([5ac336c](https://github.com/apify/crawlee/commit/5ac336c5b83b212fd6281659b8ceee091e259ff1)) ## [3.2.2](https://github.com/apify/crawlee/compare/v3.2.1...v3.2.2) (2023-02-08)[​](#322-2023-02-08 "Direct link to 322-2023-02-08") **Note:** Version bump only for package @crawlee/cheerio ## [3.2.1](https://github.com/apify/crawlee/compare/v3.2.0...v3.2.1) (2023-02-07)[​](#321-2023-02-07 "Direct link to 321-2023-02-07") **Note:** Version bump only for package @crawlee/cheerio # [3.2.0](https://github.com/apify/crawlee/compare/v3.1.4...v3.2.0) (2023-02-07) ### Bug Fixes[​](#bug-fixes-6 "Direct link to Bug Fixes") * declare missing dependency on `tslib` ([27e96c8](https://github.com/apify/crawlee/commit/27e96c80c26e7fc31809a4b518d699573cb8c662)), closes [#1747](https://github.com/apify/crawlee/issues/1747) ## [3.1.4](https://github.com/apify/crawlee/compare/v3.1.3...v3.1.4) (2022-12-14)[​](#314-2022-12-14 "Direct link to 314-2022-12-14") **Note:** Version bump only for package @crawlee/cheerio ## [3.1.3](https://github.com/apify/crawlee/compare/v3.1.2...v3.1.3) (2022-12-07)[​](#313-2022-12-07 "Direct link to 313-2022-12-07") **Note:** Version bump only for package @crawlee/cheerio ## 3.1.2 (2022-11-15)[​](#312-2022-11-15 "Direct link to 3.1.2 (2022-11-15)") **Note:** Version bump only for package @crawlee/cheerio ## 3.1.1 (2022-11-07)[​](#311-2022-11-07 "Direct link to 3.1.1 (2022-11-07)") **Note:** Version bump only for package @crawlee/cheerio # 3.1.0 (2022-10-13) **Note:** Version bump only for package @crawlee/cheerio ## [3.0.4](https://github.com/apify/crawlee/compare/v3.0.3...v3.0.4) (2022-08-22)[​](#304-2022-08-22 "Direct link to 304-2022-08-22") **Note:** Version bump only for package @crawlee/cheerio --- # CheerioCrawler Provides a framework for the parallel crawling of web pages using plain HTTP requests and [cheerio](https://www.npmjs.com/package/cheerio) HTML parser. The URLs to crawl are fed either from a static list of URLs or from a dynamic queue of URLs enabling recursive crawling of websites. Since `CheerioCrawler` uses raw HTTP requests to download web pages, it is very fast and efficient on data bandwidth. However, if the target website requires JavaScript to display the content, you might need to use [PuppeteerCrawler](https://crawlee.dev/js/api/puppeteer-crawler/class/PuppeteerCrawler.md) or [PlaywrightCrawler](https://crawlee.dev/js/api/playwright-crawler/class/PlaywrightCrawler.md) instead, because it loads the pages using full-featured headless Chrome browser. `CheerioCrawler` downloads each URL using a plain HTTP request, parses the HTML content using [Cheerio](https://www.npmjs.com/package/cheerio) and then invokes the user-provided [CheerioCrawlerOptions.requestHandler](https://crawlee.dev/js/api/cheerio-crawler/interface/CheerioCrawlerOptions.md#requestHandler) to extract page data using a [jQuery](https://jquery.com/)-like interface to the parsed HTML DOM. The source URLs are represented using [Request](https://crawlee.dev/js/api/core/class/Request.md) objects that are fed from [RequestList](https://crawlee.dev/js/api/core/class/RequestList.md) or [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) instances provided by the [CheerioCrawlerOptions.requestList](https://crawlee.dev/js/api/cheerio-crawler/interface/CheerioCrawlerOptions.md#requestList) or [CheerioCrawlerOptions.requestQueue](https://crawlee.dev/js/api/cheerio-crawler/interface/CheerioCrawlerOptions.md#requestQueue) constructor options, respectively. If both [CheerioCrawlerOptions.requestList](https://crawlee.dev/js/api/cheerio-crawler/interface/CheerioCrawlerOptions.md#requestList) and [CheerioCrawlerOptions.requestQueue](https://crawlee.dev/js/api/cheerio-crawler/interface/CheerioCrawlerOptions.md#requestQueue) are used, the instance first processes URLs from the [RequestList](https://crawlee.dev/js/api/core/class/RequestList.md) and automatically enqueues all of them to [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) before it starts their processing. This ensures that a single URL is not crawled multiple times. The crawler finishes when there are no more [Request](https://crawlee.dev/js/api/core/class/Request.md) objects to crawl. We can use the `preNavigationHooks` to adjust `gotOptions`: ``` preNavigationHooks: [ (crawlingContext, gotOptions) => { // ... }, ] ``` By default, `CheerioCrawler` only processes web pages with the `text/html` and `application/xhtml+xml` MIME content types (as reported by the `Content-Type` HTTP header), and skips pages with other content types. If you want the crawler to process other content types, use the [CheerioCrawlerOptions.additionalMimeTypes](https://crawlee.dev/js/api/cheerio-crawler/interface/CheerioCrawlerOptions.md#additionalMimeTypes) constructor option. Beware that the parsing behavior differs for HTML, XML, JSON and other types of content. For more details, see [CheerioCrawlerOptions.requestHandler](https://crawlee.dev/js/api/cheerio-crawler/interface/CheerioCrawlerOptions.md#requestHandler). New requests are only dispatched when there is enough free CPU and memory available, using the functionality provided by the [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) class. All [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) configuration options can be passed to the `autoscaledPoolOptions` parameter of the `CheerioCrawler` constructor. For user convenience, the `minConcurrency` and `maxConcurrency` [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) options are available directly in the `CheerioCrawler` constructor. **Example usage:** ``` const crawler = new CheerioCrawler({ async requestHandler({ request, response, body, contentType, $ }) { const data = []; // Do some data extraction from the page with Cheerio. $('.some-collection').each((index, el) => { data.push({ title: $(el).find('.some-title').text() }); }); // Save the data to dataset. await Dataset.pushData({ url: request.url, html: body, data, }) }, }); await crawler.run([ 'http://www.example.com/page-1', 'http://www.example.com/page-2', ]); ``` ### Hierarchy * [HttpCrawler](https://crawlee.dev/js/api/http-crawler/class/HttpCrawler.md)<[CheerioCrawlingContext](https://crawlee.dev/js/api/cheerio-crawler/interface/CheerioCrawlingContext.md)> * *CheerioCrawler* ## Index[**](#Index) ### Constructors * [**constructor](#constructor) ### Properties * [**autoscaledPool](#autoscaledPool) * [**config](#config) * [**hasFinishedBefore](#hasFinishedBefore) * [**log](#log) * [**proxyConfiguration](#proxyConfiguration) * [**requestList](#requestList) * [**requestQueue](#requestQueue) * [**router](#router) * [**running](#running) * [**sessionPool](#sessionPool) * [**stats](#stats) ### Methods * [**addRequests](#addRequests) * [**exportData](#exportData) * [**getData](#getData) * [**getDataset](#getDataset) * [**getRequestQueue](#getRequestQueue) * [**pushData](#pushData) * [**run](#run) * [**setStatusMessage](#setStatusMessage) * [**stop](#stop) * [**use](#use) * [**useState](#useState) ## Constructors[**](#Constructors) ### [**](#constructor)[**](https://github.com/apify/crawlee/blob/master/packages/cheerio-crawler/src/internals/cheerio-crawler.ts#L169)constructor * ****new CheerioCrawler**(options, config): [CheerioCrawler](https://crawlee.dev/js/api/cheerio-crawler/class/CheerioCrawler.md) - Overrides HttpCrawler.constructor All `CheerioCrawler` parameters are passed via an options object. *** #### Parameters * ##### optionaloptions: [CheerioCrawlerOptions](https://crawlee.dev/js/api/cheerio-crawler/interface/CheerioCrawlerOptions.md)\ * ##### optionalconfig: [Configuration](https://crawlee.dev/js/api/core/class/Configuration.md) #### Returns [CheerioCrawler](https://crawlee.dev/js/api/cheerio-crawler/class/CheerioCrawler.md) ## Properties[**](#Properties) ### [**](#autoscaledPool)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L524)optionalinheritedautoscaledPool **autoscaledPool? : [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) Inherited from HttpCrawler.autoscaledPool A reference to the underlying [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) class that manages the concurrency of the crawler. > *NOTE:* This property is only initialized after calling the [`crawler.run()`](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md#run) function. We can use it to change the concurrency settings on the fly, to pause the crawler by calling [`autoscaledPool.pause()`](https://crawlee.dev/js/api/core/class/AutoscaledPool.md#pause) or to abort it by calling [`autoscaledPool.abort()`](https://crawlee.dev/js/api/core/class/AutoscaledPool.md#abort). ### [**](#config)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L375)readonlyinheritedconfig **config: [Configuration](https://crawlee.dev/js/api/core/class/Configuration.md) = ... Inherited from HttpCrawler.config ### [**](#hasFinishedBefore)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L533)inheritedhasFinishedBefore **hasFinishedBefore: boolean = false Inherited from HttpCrawler.hasFinishedBefore ### [**](#log)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L535)readonlyinheritedlog **log: [Log](https://crawlee.dev/js/api/core/class/Log.md) Inherited from HttpCrawler.log ### [**](#proxyConfiguration)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L337)optionalinheritedproxyConfiguration **proxyConfiguration? : [ProxyConfiguration](https://crawlee.dev/js/api/core/class/ProxyConfiguration.md) Inherited from HttpCrawler.proxyConfiguration A reference to the underlying [ProxyConfiguration](https://crawlee.dev/js/api/core/class/ProxyConfiguration.md) class that manages the crawler's proxies. Only available if used by the crawler. ### [**](#requestList)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L497)optionalinheritedrequestList **requestList? : [IRequestList](https://crawlee.dev/js/api/core/interface/IRequestList.md) Inherited from HttpCrawler.requestList A reference to the underlying [RequestList](https://crawlee.dev/js/api/core/class/RequestList.md) class that manages the crawler's [requests](https://crawlee.dev/js/api/core/class/Request.md). Only available if used by the crawler. ### [**](#requestQueue)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L504)optionalinheritedrequestQueue **requestQueue? : [RequestProvider](https://crawlee.dev/js/api/core/class/RequestProvider.md) Inherited from HttpCrawler.requestQueue Dynamic queue of URLs to be processed. This is useful for recursive crawling of websites. A reference to the underlying [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) class that manages the crawler's [requests](https://crawlee.dev/js/api/core/class/Request.md). Only available if used by the crawler. ### [**](#router)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L530)readonlyinheritedrouter **router: [RouterHandler](https://crawlee.dev/js/api/core/interface/RouterHandler.md)<{ request: [LoadedRequest](https://crawlee.dev/js/api/core.md#LoadedRequest)<[Request](https://crawlee.dev/js/api/core/class/Request.md)\> } & Omit<[CheerioCrawlingContext](https://crawlee.dev/js/api/cheerio-crawler/interface/CheerioCrawlingContext.md)\, request>> = ... Inherited from HttpCrawler.router Default [Router](https://crawlee.dev/js/api/core/class/Router.md) instance that will be used if we don't specify any [`requestHandler`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#requestHandler). See [`router.addHandler()`](https://crawlee.dev/js/api/core/class/Router.md#addHandler) and [`router.addDefaultHandler()`](https://crawlee.dev/js/api/core/class/Router.md#addDefaultHandler). ### [**](#running)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L532)inheritedrunning **running: boolean = false Inherited from HttpCrawler.running ### [**](#sessionPool)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L515)optionalinheritedsessionPool **sessionPool? : [SessionPool](https://crawlee.dev/js/api/core/class/SessionPool.md) Inherited from HttpCrawler.sessionPool A reference to the underlying [SessionPool](https://crawlee.dev/js/api/core/class/SessionPool.md) class that manages the crawler's [sessions](https://crawlee.dev/js/api/core/class/Session.md). Only available if used by the crawler. ### [**](#stats)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L491)readonlyinheritedstats **stats: [Statistics](https://crawlee.dev/js/api/core/class/Statistics.md) Inherited from HttpCrawler.stats A reference to the underlying [Statistics](https://crawlee.dev/js/api/core/class/Statistics.md) class that collects and logs run statistics for requests. ## Methods[**](#Methods) ### [**](#addRequests)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L1147)inheritedaddRequests * ****addRequests**(requests, options): Promise<[CrawlerAddRequestsResult](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerAddRequestsResult.md)> - Inherited from HttpCrawler.addRequests Adds requests to the queue in batches. By default, it will resolve after the initial batch is added, and continue adding the rest in background. You can configure the batch size via `batchSize` option and the sleep time in between the batches via `waitBetweenBatchesMillis`. If you want to wait for all batches to be added to the queue, you can use the `waitForAllRequestsToBeAdded` promise you get in the response object. This is an alias for calling `addRequestsBatched()` on the implicit `RequestQueue` for this crawler instance. *** #### Parameters * ##### requests: [RequestsLike](https://crawlee.dev/js/api/core.md#RequestsLike) The requests to add * ##### options: [CrawlerAddRequestsOptions](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerAddRequestsOptions.md) = {} Options for the request queue #### Returns Promise<[CrawlerAddRequestsResult](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerAddRequestsResult.md)> ### [**](#exportData)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L1253)inheritedexportData * ****exportData**\(path, format, options): Promise\ - Inherited from HttpCrawler.exportData Retrieves all the data from the default crawler [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md) and exports them to the specified format. Supported formats are currently 'json' and 'csv', and will be inferred from the `path` automatically. *** #### Parameters * ##### path: string * ##### optionalformat: json | csv * ##### optionaloptions: [DatasetExportOptions](https://crawlee.dev/js/api/core/interface/DatasetExportOptions.md) #### Returns Promise\ ### [**](#getData)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L1244)inheritedgetData * ****getData**(...args): Promise<[DatasetContent](https://crawlee.dev/js/api/core/interface/DatasetContent.md)\> - Inherited from HttpCrawler.getData Retrieves data from the default crawler [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md) by calling [Dataset.getData](https://crawlee.dev/js/api/core/class/Dataset.md#getData). *** #### Parameters * ##### rest...args: \[options: [DatasetDataOptions](https://crawlee.dev/js/api/core/interface/DatasetDataOptions.md)] #### Returns Promise<[DatasetContent](https://crawlee.dev/js/api/core/interface/DatasetContent.md)\> ### [**](#getDataset)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L1237)inheritedgetDataset * ****getDataset**(idOrName): Promise<[Dataset](https://crawlee.dev/js/api/core/class/Dataset.md)\> - Inherited from HttpCrawler.getDataset Retrieves the specified [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md), or the default crawler [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md). *** #### Parameters * ##### optionalidOrName: string #### Returns Promise<[Dataset](https://crawlee.dev/js/api/core/class/Dataset.md)\> ### [**](#getRequestQueue)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L1080)inheritedgetRequestQueue * ****getRequestQueue**(): Promise<[RequestProvider](https://crawlee.dev/js/api/core/class/RequestProvider.md)> - Inherited from HttpCrawler.getRequestQueue #### Returns Promise<[RequestProvider](https://crawlee.dev/js/api/core/class/RequestProvider.md)> ### [**](#pushData)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L1229)inheritedpushData * ****pushData**(data, datasetIdOrName): Promise\ - Inherited from HttpCrawler.pushData Pushes data to the specified [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md), or the default crawler [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md) by calling [Dataset.pushData](https://crawlee.dev/js/api/core/class/Dataset.md#pushData). *** #### Parameters * ##### data: Dictionary | Dictionary\[] * ##### optionaldatasetIdOrName: string #### Returns Promise\ ### [**](#run)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L948)inheritedrun * ****run**(requests, options): Promise<[FinalStatistics](https://crawlee.dev/js/api/core/interface/FinalStatistics.md)> - Inherited from HttpCrawler.run Runs the crawler. Returns a promise that resolves once all the requests are processed and `autoscaledPool.isFinished` returns `true`. We can use the `requests` parameter to enqueue the initial requests — it is a shortcut for running [`crawler.addRequests()`](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md#addRequests) before [`crawler.run()`](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md#run). *** #### Parameters * ##### optionalrequests: [RequestsLike](https://crawlee.dev/js/api/core.md#RequestsLike) The requests to add. * ##### optionaloptions: [CrawlerRunOptions](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerRunOptions.md) Options for the request queue. #### Returns Promise<[FinalStatistics](https://crawlee.dev/js/api/core/interface/FinalStatistics.md)> ### [**](#setStatusMessage)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L871)inheritedsetStatusMessage * ****setStatusMessage**(message, options): Promise\ - Inherited from HttpCrawler.setStatusMessage This method is periodically called by the crawler, every `statusMessageLoggingInterval` seconds. *** #### Parameters * ##### message: string * ##### options: [SetStatusMessageOptions](https://crawlee.dev/js/api/types/interface/SetStatusMessageOptions.md) = {} #### Returns Promise\ ### [**](#stop)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L1068)inheritedstop * ****stop**(message): void - Inherited from HttpCrawler.stop Gracefully stops the current run of the crawler. All the tasks active at the time of calling this method will be allowed to finish. *** #### Parameters * ##### message: string = 'The crawler has been gracefully stopped.' #### Returns void ### [**](#use)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L470)inheriteduse * ****use**(extension): void - Inherited from HttpCrawler.use **EXPERIMENTAL** Function for attaching CrawlerExtensions such as the Unblockers. *** #### Parameters * ##### extension: CrawlerExtension Crawler extension that overrides the crawler configuration. #### Returns void ### [**](#useState)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L1102)inheriteduseState * ****useState**\(defaultValue): Promise\ - Inherited from HttpCrawler.useState #### Parameters * ##### defaultValue: State = ... #### Returns Promise\ --- # createCheerioRouter ### Callable * ****createCheerioRouter**\(routes): [RouterHandler](https://crawlee.dev/js/api/core/interface/RouterHandler.md)\ *** * Creates new [Router](https://crawlee.dev/js/api/core/class/Router.md) instance that works based on request labels. This instance can then serve as a `requestHandler` of your [CheerioCrawler](https://crawlee.dev/js/api/cheerio-crawler/class/CheerioCrawler.md). Defaults to the [CheerioCrawlingContext](https://crawlee.dev/js/api/cheerio-crawler/interface/CheerioCrawlingContext.md). > Serves as a shortcut for using `Router.create()`. ``` import { CheerioCrawler, createCheerioRouter } from 'crawlee'; const router = createCheerioRouter(); router.addHandler('label-a', async (ctx) => { ctx.log.info('...'); }); router.addDefaultHandler(async (ctx) => { ctx.log.info('...'); }); const crawler = new CheerioCrawler({ requestHandler: router, }); await crawler.run(); ``` *** #### Parameters * ##### optionalroutes: [RouterRoutes](https://crawlee.dev/js/api/core.md#RouterRoutes)\ #### Returns [RouterHandler](https://crawlee.dev/js/api/core/interface/RouterHandler.md)\ --- # CheerioCrawlerOptions \ ### Hierarchy * [HttpCrawlerOptions](https://crawlee.dev/js/api/http-crawler/interface/HttpCrawlerOptions.md)<[CheerioCrawlingContext](https://crawlee.dev/js/api/cheerio-crawler/interface/CheerioCrawlingContext.md)\> * *CheerioCrawlerOptions* ## Index[**](#Index) ### Properties * [**additionalHttpErrorStatusCodes](#additionalHttpErrorStatusCodes) * [**additionalMimeTypes](#additionalMimeTypes) * [**autoscaledPoolOptions](#autoscaledPoolOptions) * [**errorHandler](#errorHandler) * [**experiments](#experiments) * [**failedRequestHandler](#failedRequestHandler) * [**forceResponseEncoding](#forceResponseEncoding) * [**handlePageFunction](#handlePageFunction) * [**httpClient](#httpClient) * [**ignoreHttpErrorStatusCodes](#ignoreHttpErrorStatusCodes) * [**ignoreSslErrors](#ignoreSslErrors) * [**keepAlive](#keepAlive) * [**maxConcurrency](#maxConcurrency) * [**maxCrawlDepth](#maxCrawlDepth) * [**maxRequestRetries](#maxRequestRetries) * [**maxRequestsPerCrawl](#maxRequestsPerCrawl) * [**maxRequestsPerMinute](#maxRequestsPerMinute) * [**maxSessionRotations](#maxSessionRotations) * [**minConcurrency](#minConcurrency) * [**navigationTimeoutSecs](#navigationTimeoutSecs) * [**onSkippedRequest](#onSkippedRequest) * [**persistCookiesPerSession](#persistCookiesPerSession) * [**postNavigationHooks](#postNavigationHooks) * [**preNavigationHooks](#preNavigationHooks) * [**proxyConfiguration](#proxyConfiguration) * [**requestHandler](#requestHandler) * [**requestHandlerTimeoutSecs](#requestHandlerTimeoutSecs) * [**requestList](#requestList) * [**requestManager](#requestManager) * [**requestQueue](#requestQueue) * [**respectRobotsTxtFile](#respectRobotsTxtFile) * [**retryOnBlocked](#retryOnBlocked) * [**sameDomainDelaySecs](#sameDomainDelaySecs) * [**sessionPoolOptions](#sessionPoolOptions) * [**statisticsOptions](#statisticsOptions) * [**statusMessageCallback](#statusMessageCallback) * [**statusMessageLoggingInterval](#statusMessageLoggingInterval) * [**suggestResponseEncoding](#suggestResponseEncoding) * [**useSessionPool](#useSessionPool) ## Properties[**](#Properties) ### [**](#additionalHttpErrorStatusCodes)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L186)optionalinheritedadditionalHttpErrorStatusCodes **additionalHttpErrorStatusCodes? : number\[] Inherited from HttpCrawlerOptions.additionalHttpErrorStatusCodes An array of additional HTTP response [Status Codes](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status) to be treated as errors. By default, status codes >= 500 trigger errors. ### [**](#additionalMimeTypes)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L142)optionalinheritedadditionalMimeTypes **additionalMimeTypes? : string\[] Inherited from HttpCrawlerOptions.additionalMimeTypes An array of [MIME types](https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/MIME_types/Complete_list_of_MIME_types) you want the crawler to load and process. By default, only `text/html` and `application/xhtml+xml` MIME types are supported. ### [**](#autoscaledPoolOptions)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L294)optionalinheritedautoscaledPoolOptions **autoscaledPoolOptions? : [AutoscaledPoolOptions](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md) Inherited from HttpCrawlerOptions.autoscaledPoolOptions Custom options passed to the underlying [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) constructor. > *NOTE:* The [`runTaskFunction`](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md#runTaskFunction) option is provided by the crawler and cannot be overridden. However, we can provide custom implementations of [`isFinishedFunction`](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md#isFinishedFunction) and [`isTaskReadyFunction`](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md#isTaskReadyFunction). ### [**](#errorHandler)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L222)optionalinheritederrorHandler **errorHandler? : [ErrorHandler](https://crawlee.dev/js/api/basic-crawler.md#ErrorHandler)<[CheerioCrawlingContext](https://crawlee.dev/js/api/cheerio-crawler/interface/CheerioCrawlingContext.md)\> Inherited from HttpCrawlerOptions.errorHandler User-provided function that allows modifying the request object before it gets retried by the crawler. It's executed before each retry for the requests that failed less than [`maxRequestRetries`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#maxRequestRetries) times. The function receives the [BasicCrawlingContext](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlingContext.md) as the first argument, where the [`request`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlingContext.md#request) corresponds to the request to be retried. Second argument is the `Error` instance that represents the last error thrown during processing of the request. ### [**](#experiments)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L390)optionalinheritedexperiments **experiments? : [CrawlerExperiments](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerExperiments.md) Inherited from HttpCrawlerOptions.experiments Enables experimental features of Crawlee, which can alter the behavior of the crawler. WARNING: these options are not guaranteed to be stable and may change or be removed at any time. ### [**](#failedRequestHandler)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L232)optionalinheritedfailedRequestHandler **failedRequestHandler? : [ErrorHandler](https://crawlee.dev/js/api/basic-crawler.md#ErrorHandler)<[CheerioCrawlingContext](https://crawlee.dev/js/api/cheerio-crawler/interface/CheerioCrawlingContext.md)\> Inherited from HttpCrawlerOptions.failedRequestHandler A function to handle requests that failed more than [`maxRequestRetries`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#maxRequestRetries) times. The function receives the [BasicCrawlingContext](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlingContext.md) as the first argument, where the [`request`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlingContext.md#request) corresponds to the failed request. Second argument is the `Error` instance that represents the last error thrown during processing of the request. ### [**](#forceResponseEncoding)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L166)optionalinheritedforceResponseEncoding **forceResponseEncoding? : string Inherited from HttpCrawlerOptions.forceResponseEncoding By default this crawler will extract correct encoding from the HTTP response headers. Use `forceResponseEncoding` to force a certain encoding, disregarding the response headers. To only provide a default for missing encodings, use [HttpCrawlerOptions.suggestResponseEncoding](https://crawlee.dev/js/api/http-crawler/interface/HttpCrawlerOptions.md#suggestResponseEncoding) ``` // Will force windows-1250 encoding even if headers say otherwise forceResponseEncoding: 'windows-1250' ``` ### [**](#handlePageFunction)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L87)optionalinheritedhandlePageFunction **handlePageFunction? : [RequestHandler](https://crawlee.dev/js/api/basic-crawler.md#RequestHandler)<{ request: [LoadedRequest](https://crawlee.dev/js/api/core.md#LoadedRequest)<[Request](https://crawlee.dev/js/api/core/class/Request.md)\> } & Omit<[CheerioCrawlingContext](https://crawlee.dev/js/api/cheerio-crawler/interface/CheerioCrawlingContext.md)\, request>> Inherited from HttpCrawlerOptions.handlePageFunction An alias for [HttpCrawlerOptions.requestHandler](https://crawlee.dev/js/api/http-crawler/interface/HttpCrawlerOptions.md#requestHandler) Soon to be removed, use `requestHandler` instead. * **@deprecated** ### [**](#httpClient)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L402)optionalinheritedhttpClient **httpClient? : [BaseHttpClient](https://crawlee.dev/js/api/core/interface/BaseHttpClient.md) Inherited from HttpCrawlerOptions.httpClient HTTP client implementation for the `sendRequest` context helper and for plain HTTP crawling. Defaults to a new instance of [GotScrapingHttpClient](https://crawlee.dev/js/api/core/class/GotScrapingHttpClient.md) ### [**](#ignoreHttpErrorStatusCodes)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L180)optionalinheritedignoreHttpErrorStatusCodes **ignoreHttpErrorStatusCodes? : number\[] Inherited from HttpCrawlerOptions.ignoreHttpErrorStatusCodes An array of HTTP response [Status Codes](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status) to be excluded from error consideration. By default, status codes >= 500 trigger errors. ### [**](#ignoreSslErrors)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L97)optionalinheritedignoreSslErrors **ignoreSslErrors? : boolean Inherited from HttpCrawlerOptions.ignoreSslErrors If set to true, SSL certificate errors will be ignored. ### [**](#keepAlive)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L322)optionalinheritedkeepAlive **keepAlive? : boolean Inherited from HttpCrawlerOptions.keepAlive Allows to keep the crawler alive even if the [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) gets empty. By default, the `crawler.run()` will resolve once the queue is empty. With `keepAlive: true` it will keep running, waiting for more requests to come. Use `crawler.stop()` to exit the crawler gracefully, or `crawler.teardown()` to stop it immediately. ### [**](#maxConcurrency)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L308)optionalinheritedmaxConcurrency **maxConcurrency? : number Inherited from HttpCrawlerOptions.maxConcurrency Sets the maximum concurrency (parallelism) for the crawl. Shortcut for the AutoscaledPool [`maxConcurrency`](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md#maxConcurrency) option. ### [**](#maxCrawlDepth)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L285)optionalinheritedmaxCrawlDepth **maxCrawlDepth? : number Inherited from HttpCrawlerOptions.maxCrawlDepth Maximum depth of the crawl. If not set, the crawl will continue until all requests are processed. Setting this to `0` will only process the initial requests, skipping all links enqueued by `crawlingContext.enqueueLinks` and `crawlingContext.addRequests`. Passing `1` will process the initial requests and all links enqueued by `crawlingContext.enqueueLinks` and `crawlingContext.addRequests` in the handler for initial requests. ### [**](#maxRequestRetries)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L256)optionalinheritedmaxRequestRetries **maxRequestRetries? : number = 3 Inherited from HttpCrawlerOptions.maxRequestRetries Specifies the maximum number of retries allowed for a request if its processing fails. This includes retries due to navigation errors or errors thrown from user-supplied functions (`requestHandler`, `preNavigationHooks`, `postNavigationHooks`). This limit does not apply to retries triggered by session rotation (see [`maxSessionRotations`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#maxSessionRotations)). ### [**](#maxRequestsPerCrawl)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L278)optionalinheritedmaxRequestsPerCrawl **maxRequestsPerCrawl? : number Inherited from HttpCrawlerOptions.maxRequestsPerCrawl Maximum number of pages that the crawler will open. The crawl will stop when this limit is reached. This value should always be set in order to prevent infinite loops in misconfigured crawlers. > *NOTE:* In cases of parallel crawling, the actual number of pages visited might be slightly higher than this value. ### [**](#maxRequestsPerMinute)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L315)optionalinheritedmaxRequestsPerMinute **maxRequestsPerMinute? : number Inherited from HttpCrawlerOptions.maxRequestsPerMinute The maximum number of requests per minute the crawler should run. By default, this is set to `Infinity`, but we can pass any positive, non-zero integer. Shortcut for the AutoscaledPool [`maxTasksPerMinute`](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md#maxTasksPerMinute) option. ### [**](#maxSessionRotations)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L271)optionalinheritedmaxSessionRotations **maxSessionRotations? : number = 10 Inherited from HttpCrawlerOptions.maxSessionRotations Maximum number of session rotations per request. The crawler will automatically rotate the session in case of a proxy error or if it gets blocked by the website. The session rotations are not counted towards the [`maxRequestRetries`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#maxRequestRetries) limit. ### [**](#minConcurrency)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L302)optionalinheritedminConcurrency **minConcurrency? : number Inherited from HttpCrawlerOptions.minConcurrency Sets the minimum concurrency (parallelism) for the crawl. Shortcut for the AutoscaledPool [`minConcurrency`](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md#minConcurrency) option. > *WARNING:* If we set this value too high with respect to the available system memory and CPU, our crawler will run extremely slow or crash. If not sure, it's better to keep the default value and the concurrency will scale up automatically. ### [**](#navigationTimeoutSecs)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L92)optionalinheritednavigationTimeoutSecs **navigationTimeoutSecs? : number Inherited from HttpCrawlerOptions.navigationTimeoutSecs Timeout in which the HTTP request to the resource needs to finish, given in seconds. ### [**](#onSkippedRequest)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L381)optionalinheritedonSkippedRequest **onSkippedRequest? : [SkippedRequestCallback](https://crawlee.dev/js/api/core.md#SkippedRequestCallback) Inherited from HttpCrawlerOptions.onSkippedRequest When a request is skipped for some reason, you can use this callback to act on it. This is currently fired for requests skipped 1. based on robots.txt file, 2. because they don't match enqueueLinks filters, 3. because they are redirected to a URL that doesn't match the enqueueLinks strategy, 4. or because the [`maxRequestsPerCrawl`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#maxRequestsPerCrawl) limit has been reached ### [**](#persistCookiesPerSession)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L174)optionalinheritedpersistCookiesPerSession **persistCookiesPerSession? : boolean Inherited from HttpCrawlerOptions.persistCookiesPerSession Automatically saves cookies to Session. Works only if Session Pool is used. It parses cookie from response "set-cookie" header saves or updates cookies for session and once the session is used for next request. It passes the "Cookie" header to the request with the session cookies. ### [**](#postNavigationHooks)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L136)optionalinheritedpostNavigationHooks **postNavigationHooks? : InternalHttpHook<[CheerioCrawlingContext](https://crawlee.dev/js/api/cheerio-crawler/interface/CheerioCrawlingContext.md)\>\[] Inherited from HttpCrawlerOptions.postNavigationHooks Async functions that are sequentially evaluated after the navigation. Good for checking if the navigation was successful. The function accepts `crawlingContext` as the only parameter. Example: ``` postNavigationHooks: [ async (crawlingContext) => { // ... }, ] ``` ### [**](#preNavigationHooks)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L122)optionalinheritedpreNavigationHooks **preNavigationHooks? : InternalHttpHook<[CheerioCrawlingContext](https://crawlee.dev/js/api/cheerio-crawler/interface/CheerioCrawlingContext.md)\>\[] Inherited from HttpCrawlerOptions.preNavigationHooks Async functions that are sequentially evaluated before the navigation. Good for setting additional cookies or browser properties before navigation. The function accepts two parameters, `crawlingContext` and `gotOptions`, which are passed to the `requestAsBrowser()` function the crawler calls to navigate. Example: ``` preNavigationHooks: [ async (crawlingContext, gotOptions) => { // ... }, ] ``` Modyfing `pageOptions` is supported only in Playwright incognito. See [PrePageCreateHook](https://crawlee.dev/js/api/browser-pool.md#PrePageCreateHook) ### [**](#proxyConfiguration)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L104)optionalinheritedproxyConfiguration **proxyConfiguration? : [ProxyConfiguration](https://crawlee.dev/js/api/core/class/ProxyConfiguration.md) Inherited from HttpCrawlerOptions.proxyConfiguration If set, this crawler will be configured for all connections to use [Apify Proxy](https://console.apify.com/proxy) or your own Proxy URLs provided and rotated according to the configuration. For more information, see the [documentation](https://docs.apify.com/proxy). ### [**](#requestHandler)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L151)optionalinheritedrequestHandler **requestHandler? : [RequestHandler](https://crawlee.dev/js/api/basic-crawler.md#RequestHandler)<{ request: [LoadedRequest](https://crawlee.dev/js/api/core.md#LoadedRequest)<[Request](https://crawlee.dev/js/api/core/class/Request.md)\> } & Omit<[CheerioCrawlingContext](https://crawlee.dev/js/api/cheerio-crawler/interface/CheerioCrawlingContext.md)\, request>> Inherited from HttpCrawlerOptions.requestHandler User-provided function that performs the logic of the crawler. It is called for each URL to crawl. The function receives the [BasicCrawlingContext](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlingContext.md) as an argument, where the [`request`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlingContext.md#request) represents the URL to crawl. The function must return a promise, which is then awaited by the crawler. If the function throws an exception, the crawler will try to re-crawl the request later, up to the [`maxRequestRetries`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#maxRequestRetries) times. If all the retries fail, the crawler calls the function provided to the [`failedRequestHandler`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#failedRequestHandler) parameter. To make this work, we should **always** let our function throw exceptions rather than catch them. The exceptions are logged to the request using the [`Request.pushErrorMessage()`](https://crawlee.dev/js/api/core/class/Request.md#pushErrorMessage) function. ### [**](#requestHandlerTimeoutSecs)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L203)optionalinheritedrequestHandlerTimeoutSecs **requestHandlerTimeoutSecs? : number = 60 Inherited from HttpCrawlerOptions.requestHandlerTimeoutSecs Timeout in which the function passed as [`requestHandler`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#requestHandler) needs to finish, in seconds. ### [**](#requestList)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L181)optionalinheritedrequestList **requestList? : [IRequestList](https://crawlee.dev/js/api/core/interface/IRequestList.md) Inherited from HttpCrawlerOptions.requestList Static list of URLs to be processed. If not provided, the crawler will open the default request queue when the [`crawler.addRequests()`](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md#addRequests) function is called. > Alternatively, `requests` parameter of [`crawler.run()`](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md#run) could be used to enqueue the initial requests - it is a shortcut for running `crawler.addRequests()` before the `crawler.run()`. ### [**](#requestManager)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L197)optionalinheritedrequestManager **requestManager? : [IRequestManager](https://crawlee.dev/js/api/core/interface/IRequestManager.md) Inherited from HttpCrawlerOptions.requestManager Allows explicitly configuring a request manager. Mutually exclusive with the `requestQueue` and `requestList` options. This enables explicitly configuring the crawler to use `RequestManagerTandem`, for instance. If using this, the type of `BasicCrawler.requestQueue` may not be fully compatible with the `RequestProvider` class. ### [**](#requestQueue)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L189)optionalinheritedrequestQueue **requestQueue? : [RequestProvider](https://crawlee.dev/js/api/core/class/RequestProvider.md) Inherited from HttpCrawlerOptions.requestQueue Dynamic queue of URLs to be processed. This is useful for recursive crawling of websites. If not provided, the crawler will open the default request queue when the [`crawler.addRequests()`](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md#addRequests) function is called. > Alternatively, `requests` parameter of [`crawler.run()`](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md#run) could be used to enqueue the initial requests - it is a shortcut for running `crawler.addRequests()` before the `crawler.run()`. ### [**](#respectRobotsTxtFile)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L371)optionalinheritedrespectRobotsTxtFile **respectRobotsTxtFile? : boolean Inherited from HttpCrawlerOptions.respectRobotsTxtFile If set to `true`, the crawler will automatically try to fetch the robots.txt file for each domain, and skip those that are not allowed. This also prevents disallowed URLs to be added via `enqueueLinks`. ### [**](#retryOnBlocked)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L365)optionalinheritedretryOnBlocked **retryOnBlocked? : boolean Inherited from HttpCrawlerOptions.retryOnBlocked If set to `true`, the crawler will automatically try to bypass any detected bot protection. Currently supports: * [**Cloudflare** Bot Management](https://www.cloudflare.com/products/bot-management/) * [**Google Search** Rate Limiting](https://www.google.com/sorry/) ### [**](#sameDomainDelaySecs)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L262)optionalinheritedsameDomainDelaySecs **sameDomainDelaySecs? : number = 0 Inherited from HttpCrawlerOptions.sameDomainDelaySecs Indicates how much time (in seconds) to wait before crawling another same domain request. ### [**](#sessionPoolOptions)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L333)optionalinheritedsessionPoolOptions **sessionPoolOptions? : [SessionPoolOptions](https://crawlee.dev/js/api/core/interface/SessionPoolOptions.md) Inherited from HttpCrawlerOptions.sessionPoolOptions The configuration options for [SessionPool](https://crawlee.dev/js/api/core/class/SessionPool.md) to use. ### [**](#statisticsOptions)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L396)optionalinheritedstatisticsOptions **statisticsOptions? : [StatisticsOptions](https://crawlee.dev/js/api/core/interface/StatisticsOptions.md) Inherited from HttpCrawlerOptions.statisticsOptions Customize the way statistics collecting works, such as logging interval or whether to output them to the Key-Value store. ### [**](#statusMessageCallback)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L356)optionalinheritedstatusMessageCallback **statusMessageCallback? : [StatusMessageCallback](https://crawlee.dev/js/api/basic-crawler.md#StatusMessageCallback)<[BasicCrawlingContext](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlingContext.md)\, [BasicCrawler](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md)<[BasicCrawlingContext](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlingContext.md)\>> Inherited from HttpCrawlerOptions.statusMessageCallback Allows overriding the default status message. The callback needs to call `crawler.setStatusMessage()` explicitly. The default status message is provided in the parameters. ``` const crawler = new CheerioCrawler({ statusMessageCallback: async (ctx) => { return ctx.crawler.setStatusMessage(`this is status message from ${new Date().toISOString()}`, { level: 'INFO' }); // log level defaults to 'DEBUG' }, statusMessageLoggingInterval: 1, // defaults to 10s async requestHandler({ $, enqueueLinks, request, log }) { // ... }, }); ``` ### [**](#statusMessageLoggingInterval)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L338)optionalinheritedstatusMessageLoggingInterval **statusMessageLoggingInterval? : number Inherited from HttpCrawlerOptions.statusMessageLoggingInterval Defines the length of the interval for calling the `setStatusMessage` in seconds. ### [**](#suggestResponseEncoding)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L155)optionalinheritedsuggestResponseEncoding **suggestResponseEncoding? : string Inherited from HttpCrawlerOptions.suggestResponseEncoding By default this crawler will extract correct encoding from the HTTP response headers. Sadly, there are some websites which use invalid headers. Those are encoded using the UTF-8 encoding. If those sites actually use a different encoding, the response will be corrupted. You can use `suggestResponseEncoding` to fall back to a certain encoding, if you know that your target website uses it. To force a certain encoding, disregarding the response headers, use [HttpCrawlerOptions.forceResponseEncoding](https://crawlee.dev/js/api/http-crawler/interface/HttpCrawlerOptions.md#forceResponseEncoding) ``` // Will fall back to windows-1250 encoding if none found suggestResponseEncoding: 'windows-1250' ``` ### [**](#useSessionPool)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L328)optionalinheriteduseSessionPool **useSessionPool? : boolean Inherited from HttpCrawlerOptions.useSessionPool Basic crawler will initialize the [SessionPool](https://crawlee.dev/js/api/core/class/SessionPool.md) with the corresponding [`sessionPoolOptions`](https://crawlee.dev/js/api/core/interface/SessionPoolOptions.md). The session instance will be than available in the [`requestHandler`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#requestHandler). --- # CheerioCrawlingContext \ ### Hierarchy * InternalHttpCrawlingContext\ * *CheerioCrawlingContext* ## Index[**](#Index) ### Properties * [**$](#$) * [**addRequests](#addRequests) * [**body](#body) * [**contentType](#contentType) * [**crawler](#crawler) * [**getKeyValueStore](#getKeyValueStore) * [**id](#id) * [**json](#json) * [**log](#log) * [**proxyInfo](#proxyInfo) * [**request](#request) * [**response](#response) * [**session](#session) * [**useState](#useState) ### Methods * [**enqueueLinks](#enqueueLinks) * [**parseWithCheerio](#parseWithCheerio) * [**pushData](#pushData) * [**sendRequest](#sendRequest) * [**waitForSelector](#waitForSelector) ## Properties[**](#Properties) ### [**](#$)[**](https://github.com/apify/crawlee/blob/master/packages/cheerio-crawler/src/internals/cheerio-crawler.ts#L49)$ **$: CheerioAPI The [Cheerio](https://cheerio.js.org/) object with parsed HTML. Cheerio is available only for HTML and XML content types. ### [**](#addRequests)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L88)inheritedaddRequests **addRequests: (requestsLike, options) => Promise\ Inherited from InternalHttpCrawlingContext.addRequests Add requests directly to the request queue. *** #### Type declaration * * **(requestsLike, options): Promise\ - #### Parameters * ##### requestsLike: readonly (string | ReadonlyObjectDeep\> & { regex?: RegExp; requestsFromUrl?: string }> | ReadonlyObjectDeep<[Request](https://crawlee.dev/js/api/core/class/Request.md)\>)\[] * ##### optionaloptions: ReadonlyObjectDeep<[RequestQueueOperationOptions](https://crawlee.dev/js/api/core/interface/RequestQueueOperationOptions.md)> Options for the request queue #### Returns Promise\ ### [**](#body)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L213)inheritedbody **body: string | Buffer\ Inherited from InternalHttpCrawlingContext.body The request body of the web page. The type depends on the `Content-Type` header of the web page: * String for `text/html`, `application/xhtml+xml`, `application/xml` MIME content types * Buffer for others MIME content types ### [**](#contentType)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L223)inheritedcontentType **contentType: { encoding: BufferEncoding; type: string } Inherited from InternalHttpCrawlingContext.contentType Parsed `Content-Type header: { type, encoding }`. *** #### Type declaration * ##### encoding: BufferEncoding * ##### type: string ### [**](#crawler)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L113)inheritedcrawler **crawler: [CheerioCrawler](https://crawlee.dev/js/api/cheerio-crawler/class/CheerioCrawler.md) Inherited from InternalHttpCrawlingContext.crawler ### [**](#getKeyValueStore)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L147)inheritedgetKeyValueStore **getKeyValueStore: (idOrName) => Promise<[KeyValueStore](https://crawlee.dev/js/api/core/class/KeyValueStore.md)> Inherited from InternalHttpCrawlingContext.getKeyValueStore Get a key-value store with given name or id, or the default one for the crawler. *** #### Type declaration * * **(idOrName): Promise<[KeyValueStore](https://crawlee.dev/js/api/core/class/KeyValueStore.md)> - #### Parameters * ##### optionalidOrName: string #### Returns Promise<[KeyValueStore](https://crawlee.dev/js/api/core/class/KeyValueStore.md)> ### [**](#id)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L33)inheritedid **id: string Inherited from InternalHttpCrawlingContext.id ### [**](#json)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L218)inheritedjson **json: JSONData Inherited from InternalHttpCrawlingContext.json The parsed object from JSON string if the response contains the content type application/json. ### [**](#log)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L108)inheritedlog **log: [Log](https://crawlee.dev/js/api/core/class/Log.md) Inherited from InternalHttpCrawlingContext.log A preconfigured logger for the request handler. ### [**](#proxyInfo)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L40)optionalinheritedproxyInfo **proxyInfo? : [ProxyInfo](https://crawlee.dev/js/api/core/interface/ProxyInfo.md) Inherited from InternalHttpCrawlingContext.proxyInfo An object with information about currently used proxy by the crawler and configured by the [ProxyConfiguration](https://crawlee.dev/js/api/core/class/ProxyConfiguration.md) class. ### [**](#request)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L45)inheritedrequest **request: [Request](https://crawlee.dev/js/api/core/class/Request.md)\ Inherited from InternalHttpCrawlingContext.request The original [Request](https://crawlee.dev/js/api/core/class/Request.md) object. ### [**](#response)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L224)inheritedresponse **response: PlainResponse Inherited from InternalHttpCrawlingContext.response ### [**](#session)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L34)optionalinheritedsession **session? : [Session](https://crawlee.dev/js/api/core/class/Session.md) Inherited from InternalHttpCrawlingContext.session ### [**](#useState)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L96)inheriteduseState **useState: \(defaultValue) => Promise\ Inherited from InternalHttpCrawlingContext.useState Returns the state - a piece of mutable persistent data shared across all the request handler runs. *** #### Type declaration * * **\(defaultValue): Promise\ - #### Parameters * ##### optionaldefaultValue: State #### Returns Promise\ ## Methods[**](#Methods) ### [**](#enqueueLinks)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L140)inheritedenqueueLinks * ****enqueueLinks**(options): Promise<[BatchAddRequestsResult](https://crawlee.dev/js/api/types/interface/BatchAddRequestsResult.md)> - Inherited from InternalHttpCrawlingContext.enqueueLinks This function automatically finds and enqueues links from the current page, adding them to the [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) currently used by the crawler. Optionally, the function allows you to filter the target links' URLs using an array of globs or regular expressions and override settings of the enqueued [Request](https://crawlee.dev/js/api/core/class/Request.md) objects. Check out the [Crawl a website with relative links](https://crawlee.dev/js/docs/examples/crawl-relative-links.md) example for more details regarding its usage. **Example usage** ``` async requestHandler({ enqueueLinks }) { await enqueueLinks({ globs: [ 'https://www.example.com/handbags/*', ], }); }, ``` *** #### Parameters * ##### optionaloptions: ReadonlyObjectDeep\> & Pick<[EnqueueLinksOptions](https://crawlee.dev/js/api/core/interface/EnqueueLinksOptions.md), requestQueue> All `enqueueLinks()` parameters are passed via an options object. #### Returns Promise<[BatchAddRequestsResult](https://crawlee.dev/js/api/types/interface/BatchAddRequestsResult.md)> Promise that resolves to [BatchAddRequestsResult](https://crawlee.dev/js/api/types/interface/BatchAddRequestsResult.md) object. ### [**](#parseWithCheerio)[**](https://github.com/apify/crawlee/blob/master/packages/cheerio-crawler/src/internals/cheerio-crawler.ts#L79)parseWithCheerio * ****parseWithCheerio**(selector, timeoutMs): Promise\ - Overrides InternalHttpCrawlingContext.parseWithCheerio Returns Cheerio handle, this is here to unify the crawler API, so they all have this handy method. It has the same return type as the `$` context property, use it only if you are abstracting your workflow to support different context types in one handler. When provided with the `selector` argument, it will throw if it's not available. **Example usage:** ``` async requestHandler({ parseWithCheerio }) { const $ = await parseWithCheerio(); const title = $('title').text(); }); ``` *** #### Parameters * ##### optionalselector: string * ##### optionaltimeoutMs: number #### Returns Promise\ ### [**](#pushData)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L54)inheritedpushData * ****pushData**(data, datasetIdOrName): Promise\ - Inherited from InternalHttpCrawlingContext.pushData This function allows you to push data to a [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md) specified by name, or the one currently used by the crawler. Shortcut for `crawler.pushData()`. *** #### Parameters * ##### optionaldata: ReadonlyDeep\ Data to be pushed to the default dataset. * ##### optionaldatasetIdOrName: string #### Returns Promise\ ### [**](#sendRequest)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L166)inheritedsendRequest * ****sendRequest**\(overrideOptions): Promise\> - Inherited from InternalHttpCrawlingContext.sendRequest Fires HTTP request via [`got-scraping`](https://crawlee.dev/js/docs/guides/got-scraping.md), allowing to override the request options on the fly. This is handy when you work with a browser crawler but want to execute some requests outside it (e.g. API requests). Check the [Skipping navigations for certain requests](https://crawlee.dev/js/docs/examples/skip-navigation.md) example for more detailed explanation of how to do that. ``` async requestHandler({ sendRequest }) { const { body } = await sendRequest({ // override headers only headers: { ... }, }); }, ``` *** #### Parameters * ##### optionaloverrideOptions: Partial\ #### Returns Promise\> ### [**](#waitForSelector)[**](https://github.com/apify/crawlee/blob/master/packages/cheerio-crawler/src/internals/cheerio-crawler.ts#L63)waitForSelector * ****waitForSelector**(selector, timeoutMs): Promise\ - Overrides InternalHttpCrawlingContext.waitForSelector Wait for an element matching the selector to appear. Timeout is ignored. **Example usage:** ``` async requestHandler({ waitForSelector, parseWithCheerio }) { await waitForSelector('article h1'); const $ = await parseWithCheerio(); const title = $('title').text(); }); ``` *** #### Parameters * ##### selector: string * ##### optionaltimeoutMs: number #### Returns Promise\ --- # @crawlee/core Core set of classes required for Crawlee. The [`crawlee`](https://www.npmjs.com/package/crawlee) package consists of several smaller packages, released separately under `@crawlee` namespace: * [`@crawlee/core`](https://crawlee.dev/js/api/core.md): the base for all the crawler implementations, also contains things like `Request`, `RequestQueue`, `RequestList` or `Dataset` classes * [`@crawlee/cheerio`](https://crawlee.dev/js/api/cheerio-crawler.md): exports `CheerioCrawler` * [`@crawlee/playwright`](https://crawlee.dev/js/api/playwright-crawler.md): exports `PlaywrightCrawler` * [`@crawlee/puppeteer`](https://crawlee.dev/js/api/puppeteer-crawler.md): exports `PuppeteerCrawler` * [`@crawlee/linkedom`](https://crawlee.dev/js/api/linkedom-crawler.md): exports `LinkeDOMCrawler` * [`@crawlee/jsdom`](https://crawlee.dev/js/api/jsdom-crawler.md): exports `JSDOMCrawler` * [`@crawlee/basic`](https://crawlee.dev/js/api/basic-crawler.md): exports `BasicCrawler` * [`@crawlee/http`](https://crawlee.dev/js/api/http-crawler.md): exports `HttpCrawler` (which is used for creating [`@crawlee/jsdom`](https://crawlee.dev/js/api/jsdom-crawler.md) and [`@crawlee/cheerio`](https://crawlee.dev/js/api/cheerio-crawler.md)) * [`@crawlee/browser`](https://crawlee.dev/js/api/browser-crawler.md): exports `BrowserCrawler` (which is used for creating [`@crawlee/playwright`](https://crawlee.dev/js/api/playwright-crawler.md) and [`@crawlee/puppeteer`](https://crawlee.dev/js/api/puppeteer-crawler.md)) * [`@crawlee/memory-storage`](https://crawlee.dev/js/api/memory-storage.md): [`@apify/storage-local`](https://npmjs.com/package/@apify/storage-local) alternative * [`@crawlee/browser-pool`](https://crawlee.dev/js/api/browser-pool.md): previously [`browser-pool`](https://npmjs.com/package/browser-pool) package * [`@crawlee/utils`](https://crawlee.dev/js/api/utils.md): utility methods * [`@crawlee/types`](https://crawlee.dev/js/api/types.md): holds TS interfaces mainly about the [`StorageClient`](https://crawlee.dev/js/api/core/interface/StorageClient.md) ## Installing Crawlee[​](#installing-crawlee "Direct link to Installing Crawlee") Most of the Crawlee packages are extending and reexporting each other, so it's enough to install just the one you plan on using, e.g. `@crawlee/playwright` if you plan on using `playwright` - it already contains everything from the `@crawlee/browser` package, which includes everything from `@crawlee/basic`, which includes everything from `@crawlee/core`. If we don't care much about additional code being pulled in, we can just use the `crawlee` meta-package, which contains (re-exports) most of the `@crawlee/*` packages, and therefore contains all the crawler classes. ``` npm install crawlee ``` Or if all we need is cheerio support, we can install only `@crawlee/cheerio`. ``` npm install @crawlee/cheerio ``` When using `playwright` or `puppeteer`, we still need to install those dependencies explicitly - this allows the users to be in control of which version will be used. ``` npm install crawlee playwright # or npm install @crawlee/playwright playwright ``` Alternatively we can also use the `crawlee` meta-package which contains (re-exports) most of the `@crawlee/*` packages, and therefore contains all the crawler classes. > Sometimes you might want to use some utility methods from `@crawlee/utils`, so you might want to install that as well. This package contains some utilities that were previously available under `Apify.utils`. Browser related utilities can be also found in the crawler packages (e.g. `@crawlee/playwright`). ## Index[**](#Index) ### Crawlers * [**Statistics](https://crawlee.dev/js/api/core/class/Statistics.md) ### Result Stores * [**Dataset](https://crawlee.dev/js/api/core/class/Dataset.md) * [**KeyValueStore](https://crawlee.dev/js/api/core/class/KeyValueStore.md) ### Scaling * [**AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) * [**ProxyConfiguration](https://crawlee.dev/js/api/core/class/ProxyConfiguration.md) * [**Session](https://crawlee.dev/js/api/core/class/Session.md) * [**SessionPool](https://crawlee.dev/js/api/core/class/SessionPool.md) * [**Snapshotter](https://crawlee.dev/js/api/core/class/Snapshotter.md) * [**SystemStatus](https://crawlee.dev/js/api/core/class/SystemStatus.md) ### Sources * [**PseudoUrl](https://crawlee.dev/js/api/core/class/PseudoUrl.md) * [**Request](https://crawlee.dev/js/api/core/class/Request.md) * [**RequestList](https://crawlee.dev/js/api/core/class/RequestList.md) * [**RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) * [**RequestQueueV1](https://crawlee.dev/js/api/core/class/RequestQueueV1.md) ### Other * [**RequestQueueV2](https://crawlee.dev/js/api/core.md#RequestQueueV2) * [**EnqueueStrategy](https://crawlee.dev/js/api/core/enum/EnqueueStrategy.md) * [**EventType](https://crawlee.dev/js/api/core/enum/EventType.md) * [**LogLevel](https://crawlee.dev/js/api/core/enum/LogLevel.md) * [**RequestState](https://crawlee.dev/js/api/core/enum/RequestState.md) * [**Configuration](https://crawlee.dev/js/api/core/class/Configuration.md) * [**CriticalError](https://crawlee.dev/js/api/core/class/CriticalError.md) * [**ErrorSnapshotter](https://crawlee.dev/js/api/core/class/ErrorSnapshotter.md) * [**ErrorTracker](https://crawlee.dev/js/api/core/class/ErrorTracker.md) * [**EventManager](https://crawlee.dev/js/api/core/class/EventManager.md) * [**GotScrapingHttpClient](https://crawlee.dev/js/api/core/class/GotScrapingHttpClient.md) * [**LocalEventManager](https://crawlee.dev/js/api/core/class/LocalEventManager.md) * [**Log](https://crawlee.dev/js/api/core/class/Log.md) * [**Logger](https://crawlee.dev/js/api/core/class/Logger.md) * [**LoggerJson](https://crawlee.dev/js/api/core/class/LoggerJson.md) * [**LoggerText](https://crawlee.dev/js/api/core/class/LoggerText.md) * [**NonRetryableError](https://crawlee.dev/js/api/core/class/NonRetryableError.md) * [**RecoverableState](https://crawlee.dev/js/api/core/class/RecoverableState.md) * [**RequestHandlerResult](https://crawlee.dev/js/api/core/class/RequestHandlerResult.md) * [**RequestManagerTandem](https://crawlee.dev/js/api/core/class/RequestManagerTandem.md) * [**RequestProvider](https://crawlee.dev/js/api/core/class/RequestProvider.md) * [**RetryRequestError](https://crawlee.dev/js/api/core/class/RetryRequestError.md) * [**Router](https://crawlee.dev/js/api/core/class/Router.md) * [**SessionError](https://crawlee.dev/js/api/core/class/SessionError.md) * [**SitemapRequestList](https://crawlee.dev/js/api/core/class/SitemapRequestList.md) * [**AddRequestsBatchedOptions](https://crawlee.dev/js/api/core/interface/AddRequestsBatchedOptions.md) * [**AddRequestsBatchedResult](https://crawlee.dev/js/api/core/interface/AddRequestsBatchedResult.md) * [**AutoscaledPoolOptions](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md) * [**BaseHttpClient](https://crawlee.dev/js/api/core/interface/BaseHttpClient.md) * [**BaseHttpResponseData](https://crawlee.dev/js/api/core/interface/BaseHttpResponseData.md) * [**ClientInfo](https://crawlee.dev/js/api/core/interface/ClientInfo.md) * [**ConfigurationOptions](https://crawlee.dev/js/api/core/interface/ConfigurationOptions.md) * [**Cookie](https://crawlee.dev/js/api/core/interface/Cookie.md) * [**CrawlingContext](https://crawlee.dev/js/api/core/interface/CrawlingContext.md) * [**CreateSession](https://crawlee.dev/js/api/core/interface/CreateSession.md) * [**DatasetConsumer](https://crawlee.dev/js/api/core/interface/DatasetConsumer.md) * [**DatasetContent](https://crawlee.dev/js/api/core/interface/DatasetContent.md) * [**DatasetDataOptions](https://crawlee.dev/js/api/core/interface/DatasetDataOptions.md) * [**DatasetExportOptions](https://crawlee.dev/js/api/core/interface/DatasetExportOptions.md) * [**DatasetExportToOptions](https://crawlee.dev/js/api/core/interface/DatasetExportToOptions.md) * [**DatasetIteratorOptions](https://crawlee.dev/js/api/core/interface/DatasetIteratorOptions.md) * [**DatasetMapper](https://crawlee.dev/js/api/core/interface/DatasetMapper.md) * [**DatasetOptions](https://crawlee.dev/js/api/core/interface/DatasetOptions.md) * [**DatasetReducer](https://crawlee.dev/js/api/core/interface/DatasetReducer.md) * [**EnqueueLinksOptions](https://crawlee.dev/js/api/core/interface/EnqueueLinksOptions.md) * [**ErrnoException](https://crawlee.dev/js/api/core/interface/ErrnoException.md) * [**ErrorTrackerOptions](https://crawlee.dev/js/api/core/interface/ErrorTrackerOptions.md) * [**FinalStatistics](https://crawlee.dev/js/api/core/interface/FinalStatistics.md) * [**HttpRequest](https://crawlee.dev/js/api/core/interface/HttpRequest.md) * [**HttpRequestOptions](https://crawlee.dev/js/api/core/interface/HttpRequestOptions.md) * [**HttpResponse](https://crawlee.dev/js/api/core/interface/HttpResponse.md) * [**IRequestList](https://crawlee.dev/js/api/core/interface/IRequestList.md) * [**IRequestManager](https://crawlee.dev/js/api/core/interface/IRequestManager.md) * [**IStorage](https://crawlee.dev/js/api/core/interface/IStorage.md) * [**KeyConsumer](https://crawlee.dev/js/api/core/interface/KeyConsumer.md) * [**KeyValueStoreIteratorOptions](https://crawlee.dev/js/api/core/interface/KeyValueStoreIteratorOptions.md) * [**KeyValueStoreOptions](https://crawlee.dev/js/api/core/interface/KeyValueStoreOptions.md) * [**LoggerOptions](https://crawlee.dev/js/api/core/interface/LoggerOptions.md) * [**PersistenceOptions](https://crawlee.dev/js/api/core/interface/PersistenceOptions.md) * [**ProxyConfigurationFunction](https://crawlee.dev/js/api/core/interface/ProxyConfigurationFunction.md) * [**ProxyConfigurationOptions](https://crawlee.dev/js/api/core/interface/ProxyConfigurationOptions.md) * [**ProxyInfo](https://crawlee.dev/js/api/core/interface/ProxyInfo.md) * [**PushErrorMessageOptions](https://crawlee.dev/js/api/core/interface/PushErrorMessageOptions.md) * [**QueueOperationInfo](https://crawlee.dev/js/api/core/interface/QueueOperationInfo.md) * [**RecordOptions](https://crawlee.dev/js/api/core/interface/RecordOptions.md) * [**RecoverableStateOptions](https://crawlee.dev/js/api/core/interface/RecoverableStateOptions.md) * [**RecoverableStatePersistenceOptions](https://crawlee.dev/js/api/core/interface/RecoverableStatePersistenceOptions.md) * [**RequestListOptions](https://crawlee.dev/js/api/core/interface/RequestListOptions.md) * [**RequestListState](https://crawlee.dev/js/api/core/interface/RequestListState.md) * [**RequestOptions](https://crawlee.dev/js/api/core/interface/RequestOptions.md) * [**RequestProviderOptions](https://crawlee.dev/js/api/core/interface/RequestProviderOptions.md) * [**RequestQueueOperationOptions](https://crawlee.dev/js/api/core/interface/RequestQueueOperationOptions.md) * [**RequestQueueOptions](https://crawlee.dev/js/api/core/interface/RequestQueueOptions.md) * [**RequestTransform](https://crawlee.dev/js/api/core/interface/RequestTransform.md) * [**ResponseLike](https://crawlee.dev/js/api/core/interface/ResponseLike.md) * [**ResponseTypes](https://crawlee.dev/js/api/core/interface/ResponseTypes.md) * [**RestrictedCrawlingContext](https://crawlee.dev/js/api/core/interface/RestrictedCrawlingContext.md) * [**RouterHandler](https://crawlee.dev/js/api/core/interface/RouterHandler.md) * [**SessionOptions](https://crawlee.dev/js/api/core/interface/SessionOptions.md) * [**SessionPoolOptions](https://crawlee.dev/js/api/core/interface/SessionPoolOptions.md) * [**SessionState](https://crawlee.dev/js/api/core/interface/SessionState.md) * [**SitemapRequestListOptions](https://crawlee.dev/js/api/core/interface/SitemapRequestListOptions.md) * [**SnapshotResult](https://crawlee.dev/js/api/core/interface/SnapshotResult.md) * [**SnapshotterOptions](https://crawlee.dev/js/api/core/interface/SnapshotterOptions.md) * [**StatisticPersistedState](https://crawlee.dev/js/api/core/interface/StatisticPersistedState.md) * [**StatisticsOptions](https://crawlee.dev/js/api/core/interface/StatisticsOptions.md) * [**StatisticState](https://crawlee.dev/js/api/core/interface/StatisticState.md) * [**StorageClient](https://crawlee.dev/js/api/core/interface/StorageClient.md) * [**StorageManagerOptions](https://crawlee.dev/js/api/core/interface/StorageManagerOptions.md) * [**StreamingHttpResponse](https://crawlee.dev/js/api/core/interface/StreamingHttpResponse.md) * [**SystemInfo](https://crawlee.dev/js/api/core/interface/SystemInfo.md) * [**SystemStatusOptions](https://crawlee.dev/js/api/core/interface/SystemStatusOptions.md) * [**TieredProxy](https://crawlee.dev/js/api/core/interface/TieredProxy.md) * [**UseStateOptions](https://crawlee.dev/js/api/core/interface/UseStateOptions.md) * [**EventTypeName](https://crawlee.dev/js/api/core.md#EventTypeName) * [**GetUserDataFromRequest](https://crawlee.dev/js/api/core.md#GetUserDataFromRequest) * [**GlobInput](https://crawlee.dev/js/api/core.md#GlobInput) * [**GlobObject](https://crawlee.dev/js/api/core.md#GlobObject) * [**LoadedRequest](https://crawlee.dev/js/api/core.md#LoadedRequest) * [**PseudoUrlInput](https://crawlee.dev/js/api/core.md#PseudoUrlInput) * [**PseudoUrlObject](https://crawlee.dev/js/api/core.md#PseudoUrlObject) * [**RedirectHandler](https://crawlee.dev/js/api/core.md#RedirectHandler) * [**RegExpInput](https://crawlee.dev/js/api/core.md#RegExpInput) * [**RegExpObject](https://crawlee.dev/js/api/core.md#RegExpObject) * [**RequestListSourcesFunction](https://crawlee.dev/js/api/core.md#RequestListSourcesFunction) * [**RequestsLike](https://crawlee.dev/js/api/core.md#RequestsLike) * [**RouterRoutes](https://crawlee.dev/js/api/core.md#RouterRoutes) * [**SkippedRequestCallback](https://crawlee.dev/js/api/core.md#SkippedRequestCallback) * [**SkippedRequestReason](https://crawlee.dev/js/api/core.md#SkippedRequestReason) * [**Source](https://crawlee.dev/js/api/core.md#Source) * [**UrlPatternObject](https://crawlee.dev/js/api/core.md#UrlPatternObject) * [**BLOCKED\_STATUS\_CODES](https://crawlee.dev/js/api/core.md#BLOCKED_STATUS_CODES) * [**log](https://crawlee.dev/js/api/core.md#log) * [**MAX\_POOL\_SIZE](https://crawlee.dev/js/api/core.md#MAX_POOL_SIZE) * [**PERSIST\_STATE\_KEY](https://crawlee.dev/js/api/core.md#PERSIST_STATE_KEY) * [**checkStorageAccess](https://crawlee.dev/js/api/core/function/checkStorageAccess.md) * [**enqueueLinks](https://crawlee.dev/js/api/core/function/enqueueLinks.md) * [**filterRequestsByPatterns](https://crawlee.dev/js/api/core/function/filterRequestsByPatterns.md) * [**processHttpRequestOptions](https://crawlee.dev/js/api/core/function/processHttpRequestOptions.md) * [**purgeDefaultStorages](https://crawlee.dev/js/api/core/function/purgeDefaultStorages.md) * [**tryAbsoluteURL](https://crawlee.dev/js/api/core/function/tryAbsoluteURL.md) * [**useState](https://crawlee.dev/js/api/core/function/useState.md) * [**withCheckedStorageAccess](https://crawlee.dev/js/api/core/function/withCheckedStorageAccess.md) ## Other[**](#__CATEGORY__) ### [**](#RequestQueueV2)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/index.ts#L8)RequestQueueV2 Renames and re-exports [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) ### [**](#EventTypeName)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/events/event_manager.ts#L17)EventTypeName **EventTypeName: [EventType](https://crawlee.dev/js/api/core/enum/EventType.md) | systemInfo | persistState | migrating | aborting | exit ### [**](#GetUserDataFromRequest)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/router.ts#L15)GetUserDataFromRequest **GetUserDataFromRequest\: T extends [Request](https://crawlee.dev/js/api/core/class/Request.md)\ Y> ? Y : never #### Type parameters * **T** ### [**](#GlobInput)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L41)GlobInput **GlobInput: string | [GlobObject](https://crawlee.dev/js/api/core.md#GlobObject) ### [**](#GlobObject)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L36)GlobObject **GlobObject: { glob: string } & Pick<[RequestOptions](https://crawlee.dev/js/api/core/interface/RequestOptions.md), method | payload | label | userData | headers> ### [**](#LoadedRequest)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L21)LoadedRequest **LoadedRequest\: WithRequired\ #### Type parameters * **R**: [Request](https://crawlee.dev/js/api/core/class/Request.md) ### [**](#PseudoUrlInput)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L34)PseudoUrlInput **PseudoUrlInput: string | [PseudoUrlObject](https://crawlee.dev/js/api/core.md#PseudoUrlObject) ### [**](#PseudoUrlObject)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L29)PseudoUrlObject **PseudoUrlObject: { purl: string } & Pick<[RequestOptions](https://crawlee.dev/js/api/core/interface/RequestOptions.md), method | payload | label | userData | headers> ### [**](#RedirectHandler)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L171)RedirectHandler **RedirectHandler: (redirectResponse, updatedRequest) => void Type of a function called when an HTTP redirect takes place. It is allowed to mutate the `updatedRequest` argument. *** #### Type declaration * * **(redirectResponse, updatedRequest): void - #### Parameters * ##### redirectResponse: [BaseHttpResponseData](https://crawlee.dev/js/api/core/interface/BaseHttpResponseData.md) * ##### updatedRequest: { headers: SimpleHeaders; url?: string | URL } * ##### headers: SimpleHeaders * ##### optionalurl: string | URL #### Returns void ### [**](#RegExpInput)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L48)RegExpInput **RegExpInput: RegExp | [RegExpObject](https://crawlee.dev/js/api/core.md#RegExpObject) ### [**](#RegExpObject)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L43)RegExpObject **RegExpObject: { regexp: RegExp } & Pick<[RequestOptions](https://crawlee.dev/js/api/core/interface/RequestOptions.md), method | payload | label | userData | headers> ### [**](#RequestListSourcesFunction)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_list.ts#L1000)RequestListSourcesFunction **RequestListSourcesFunction: () => Promise\ #### Type declaration * * **(): Promise\ - #### Returns Promise\ ### [**](#RequestsLike)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L39)RequestsLike **RequestsLike: AsyncIterable<[Source](https://crawlee.dev/js/api/core.md#Source) | string> | Iterable<[Source](https://crawlee.dev/js/api/core.md#Source) | string> | ([Source](https://crawlee.dev/js/api/core.md#Source) | string)\[] ### [**](#RouterRoutes)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/router.ts#L17)RouterRoutes **RouterRoutes\: { \[ label in string | symbol ]: (ctx) => Awaitable\ } #### Type parameters * **Context** * **UserData**: Dictionary ### [**](#SkippedRequestCallback)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L52)SkippedRequestCallback **SkippedRequestCallback: (args) => Awaitable\ #### Type declaration * * **(args): Awaitable\ - #### Parameters * ##### args: { reason: [SkippedRequestReason](https://crawlee.dev/js/api/core.md#SkippedRequestReason); url: string } * ##### reason: [SkippedRequestReason](https://crawlee.dev/js/api/core.md#SkippedRequestReason) * ##### url: string #### Returns Awaitable\ ### [**](#SkippedRequestReason)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L50)SkippedRequestReason **SkippedRequestReason: robotsTxt | limit | filters | redirect | depth ### [**](#Source)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L575)Source **Source: (Partial<[RequestOptions](https://crawlee.dev/js/api/core/interface/RequestOptions.md)> & { regex? : RegExp; requestsFromUrl? : string }) | [Request](https://crawlee.dev/js/api/core/class/Request.md) ### [**](#UrlPatternObject)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L24)UrlPatternObject **UrlPatternObject: { glob? : string; regexp? : RegExp } & Pick<[RequestOptions](https://crawlee.dev/js/api/core/interface/RequestOptions.md), method | payload | label | userData | headers> ### [**](#BLOCKED_STATUS_CODES)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/consts.ts#L1)constBLOCKED\_STATUS\_CODES **BLOCKED\_STATUS\_CODES: number\[] = ... ### [**](#log)[**](https://undefined/apify/crawlee/blob/master/node_modules/@apify/log/src/index.d.ts#L252)externalconstlog **log: [Log](https://crawlee.dev/js/api/core/class/Log.md) ### [**](#MAX_POOL_SIZE)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/consts.ts#L3)constMAX\_POOL\_SIZE **MAX\_POOL\_SIZE: 1000 = 1000 ### [**](#PERSIST_STATE_KEY)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/consts.ts#L2)constPERSIST\_STATE\_KEY **PERSIST\_STATE\_KEY: SDK\_SESSION\_POOL\_STATE = 'SDK\_SESSION\_POOL\_STATE' --- # Changelog All notable changes to this project will be documented in this file. See [Conventional Commits](https://conventionalcommits.org) for commit guidelines. ## [3.15.3](https://github.com/apify/crawlee/compare/v3.15.2...v3.15.3) (2025-11-10)[​](#3153-2025-11-10 "Direct link to 3153-2025-11-10") **Note:** Version bump only for package @crawlee/core ## [3.15.2](https://github.com/apify/crawlee/compare/v3.15.1...v3.15.2) (2025-10-23)[​](#3152-2025-10-23 "Direct link to 3152-2025-10-23") ### Bug Fixes[​](#bug-fixes "Direct link to Bug Fixes") * enable `systemInfoV2` by default ([#3208](https://github.com/apify/crawlee/issues/3208)) ([617a343](https://github.com/apify/crawlee/commit/617a343d4f594635adfff3c41a3632a19144749a)) ## [3.15.1](https://github.com/apify/crawlee/compare/v3.15.0...v3.15.1) (2025-09-26)[​](#3151-2025-09-26 "Direct link to 3151-2025-09-26") ### Bug Fixes[​](#bug-fixes-1 "Direct link to Bug Fixes") * use correct config for storage classes to avoid memory leaks ([#3144](https://github.com/apify/crawlee/issues/3144)) ([911a2eb](https://github.com/apify/crawlee/commit/911a2eb45cdb5e3fc0e6a96471af86b43bc828bf)) ### Performance Improvements[​](#performance-improvements "Direct link to Performance Improvements") * Improve glob performance by reusing minimatch objects ([#3168](https://github.com/apify/crawlee/issues/3168)) ([e5632e2](https://github.com/apify/crawlee/commit/e5632e2700198d75ca955ef3d2ffb609dbf0f050)) # [3.15.0](https://github.com/apify/crawlee/compare/v3.14.1...v3.15.0) (2025-09-17) ### Bug Fixes[​](#bug-fixes-2 "Direct link to Bug Fixes") * `proxyUrls` list can contain `null` ([#3142](https://github.com/apify/crawlee/issues/3142)) ([dc39cc2](https://github.com/apify/crawlee/commit/dc39cc223a90d0c97c60c5a715bdf524cc32bbac)), closes [#3136](https://github.com/apify/crawlee/issues/3136) ### Features[​](#features "Direct link to Features") * add `collectAllKeys` option for `BasicCrawler.exportData` ([#3129](https://github.com/apify/crawlee/issues/3129)) ([2ddfc9c](https://github.com/apify/crawlee/commit/2ddfc9c6108207d3289ee92fe3c5b646611cc508)), closes [#3007](https://github.com/apify/crawlee/issues/3007) * add `TandemRequestProvider` for combined `RequestList` and `RequestQueue` usage ([#2914](https://github.com/apify/crawlee/issues/2914)) ([4ca450f](https://github.com/apify/crawlee/commit/4ca450f08b9fb69ae3b2ba3fc66361f14631b15b)), closes [#2499](https://github.com/apify/crawlee/issues/2499) ## [3.14.1](https://github.com/apify/crawlee/compare/v3.14.0...v3.14.1) (2025-08-05)[​](#3141-2025-08-05 "Direct link to 3141-2025-08-05") **Note:** Version bump only for package @crawlee/core # [3.14.0](https://github.com/apify/crawlee/compare/v3.13.10...v3.14.0) (2025-07-25) ### Bug Fixes[​](#bug-fixes-3 "Direct link to Bug Fixes") * respect `exclude` option in `enqueueLinksByClickingElements` ([#3058](https://github.com/apify/crawlee/issues/3058)) ([013eb02](https://github.com/apify/crawlee/commit/013eb028b6ecf05f83f8790a4a6164b9c4873733)) * validation of iterables when adding requests to the queue ([#3091](https://github.com/apify/crawlee/issues/3091)) ([529a1dd](https://github.com/apify/crawlee/commit/529a1dd57278efef4fb2013e79a09fd1bc8594a5)), closes [#3063](https://github.com/apify/crawlee/issues/3063) ### Features[​](#features-1 "Direct link to Features") * add `maxCrawlDepth` crawler option ([#3045](https://github.com/apify/crawlee/issues/3045)) ([0090df9](https://github.com/apify/crawlee/commit/0090df93a12df9918d016cf2f1378f1f7d40557d)), closes [#2633](https://github.com/apify/crawlee/issues/2633) ## [3.13.10](https://github.com/apify/crawlee/compare/v3.13.9...v3.13.10) (2025-07-09)[​](#31310-2025-07-09 "Direct link to 31310-2025-07-09") ### Bug Fixes[​](#bug-fixes-4 "Direct link to Bug Fixes") * improve enqueueLinks `limit` checking ([#3038](https://github.com/apify/crawlee/issues/3038)) ([2774124](https://github.com/apify/crawlee/commit/277412468dc00a385080c3570c24faac76e764ca)), closes [#3037](https://github.com/apify/crawlee/issues/3037) ## [3.13.9](https://github.com/apify/crawlee/compare/v3.13.8...v3.13.9) (2025-06-27)[​](#3139-2025-06-27 "Direct link to 3139-2025-06-27") ### Bug Fixes[​](#bug-fixes-5 "Direct link to Bug Fixes") * Do not log 'malformed sitemap content' on network errors in `Sitemap.tryCommonNames` ([#3015](https://github.com/apify/crawlee/issues/3015)) ([64a090f](https://github.com/apify/crawlee/commit/64a090ffbba5c69730ec0616e415a1eadf4bc7b3)), closes [#2884](https://github.com/apify/crawlee/issues/2884) * Fix link filtering in enqueueLinks in AdaptivePlaywrightCrawler ([#3021](https://github.com/apify/crawlee/issues/3021)) ([8a3b6f8](https://github.com/apify/crawlee/commit/8a3b6f8847586eb3b0865fe93053468e1605399c)), closes [#2525](https://github.com/apify/crawlee/issues/2525) ### Features[​](#features-2 "Direct link to Features") * Accept (Async)Iterables in `addRequests` methods ([#3013](https://github.com/apify/crawlee/issues/3013)) ([a4ab748](https://github.com/apify/crawlee/commit/a4ab74852c3c60bdbc96035f54b16d125220f699)), closes [#2980](https://github.com/apify/crawlee/issues/2980) * Report links skipped because of various filter conditions ([#3026](https://github.com/apify/crawlee/issues/3026)) ([5a867bc](https://github.com/apify/crawlee/commit/5a867bc28135803b55c765ec12e6fd04017ce53d)), closes [#3016](https://github.com/apify/crawlee/issues/3016) ## [3.13.8](https://github.com/apify/crawlee/compare/v3.13.7...v3.13.8) (2025-06-16)[​](#3138-2025-06-16 "Direct link to 3138-2025-06-16") ### Bug Fixes[​](#bug-fixes-6 "Direct link to Bug Fixes") * Do not enqueue more links than what the crawler is capable of processing ([#2990](https://github.com/apify/crawlee/issues/2990)) ([ea094c8](https://github.com/apify/crawlee/commit/ea094c819232e0b30bc550270836d10506eb9454)), closes [#2728](https://github.com/apify/crawlee/issues/2728) * Persist rendering type detection results in `AdaptivePlaywrightCrawler` ([#2987](https://github.com/apify/crawlee/issues/2987)) ([76431ba](https://github.com/apify/crawlee/commit/76431badf8a55892303d9b53fe23e029fad9cb18)), closes [#2899](https://github.com/apify/crawlee/issues/2899) ### Features[​](#features-3 "Direct link to Features") * **dataset:** add collectAllKeys option for full CSV export ([#2945](https://github.com/apify/crawlee/issues/2945)) ([#3007](https://github.com/apify/crawlee/issues/3007)) ([3b629da](https://github.com/apify/crawlee/commit/3b629da9418c052419381087d3ab1871a5c8718b)) * support `KVS.listKeys()` `prefix` and `collection` parameters ([#3001](https://github.com/apify/crawlee/issues/3001)) ([5c4726d](https://github.com/apify/crawlee/commit/5c4726df96e358a9bbf44a0cd2760e4e269f0fae)), closes [#2974](https://github.com/apify/crawlee/issues/2974) ## [3.13.7](https://github.com/apify/crawlee/compare/v3.13.6...v3.13.7) (2025-06-06)[​](#3137-2025-06-06 "Direct link to 3137-2025-06-06") ### Bug Fixes[​](#bug-fixes-7 "Direct link to Bug Fixes") * use merged cookies correctly in `GotScrapingHttpClient` ([#3000](https://github.com/apify/crawlee/issues/3000)) ([a2985f2](https://github.com/apify/crawlee/commit/a2985f259f068fbe00aed931a812b8a8755282cb)), closes [#2991](https://github.com/apify/crawlee/issues/2991) ## [3.13.6](https://github.com/apify/crawlee/compare/v3.13.5...v3.13.6) (2025-06-05)[​](#3136-2025-06-05 "Direct link to 3136-2025-06-05") **Note:** Version bump only for package @crawlee/core ## [3.13.5](https://github.com/apify/crawlee/compare/v3.13.4...v3.13.5) (2025-05-20)[​](#3135-2025-05-20 "Direct link to 3135-2025-05-20") **Note:** Version bump only for package @crawlee/core ## [3.13.4](https://github.com/apify/crawlee/compare/v3.13.3...v3.13.4) (2025-05-14)[​](#3134-2025-05-14 "Direct link to 3134-2025-05-14") ### Bug Fixes[​](#bug-fixes-8 "Direct link to Bug Fixes") * **core:** respect `systemInfoV2` in snapshotter ([#2961](https://github.com/apify/crawlee/issues/2961)) ([4100eab](https://github.com/apify/crawlee/commit/4100eabf171d1dfc33ff312cbedf4e178d34ebdf)), closes [#2958](https://github.com/apify/crawlee/issues/2958) * **core:** use short timeouts for periodic `KVS.setRecord` calls ([#2962](https://github.com/apify/crawlee/issues/2962)) ([d31d90e](https://github.com/apify/crawlee/commit/d31d90e5288ea80b3ed6ec4a75a4b8f87686a2c4)) * Optimize request unlocking to get rid of unnecessary unlock calls ([#2963](https://github.com/apify/crawlee/issues/2963)) ([a433037](https://github.com/apify/crawlee/commit/a433037f307ed3490a1ef5df334f1f9a9044510d)) ## [3.13.3](https://github.com/apify/crawlee/compare/v3.13.2...v3.13.3) (2025-05-05)[​](#3133-2025-05-05 "Direct link to 3133-2025-05-05") ### Bug Fixes[​](#bug-fixes-9 "Direct link to Bug Fixes") * Fix useState behavior in adaptive crawler ([#2941](https://github.com/apify/crawlee/issues/2941)) ([5282381](https://github.com/apify/crawlee/commit/52823818bd66995c1512b433e6d82755c487cb58)) * Persist SitemapRequestList state periodically ([#2923](https://github.com/apify/crawlee/issues/2923)) ([e6e7a9f](https://github.com/apify/crawlee/commit/e6e7a9feed5d8281c36a83fc5edc2f5cb6e783fd)), closes [#2897](https://github.com/apify/crawlee/issues/2897) * **statistics:** track actual request.retryCount in Statistics ([#2940](https://github.com/apify/crawlee/issues/2940)) ([c9f7f54](https://github.com/apify/crawlee/commit/c9f7f5494ac4895a30b283a5defe382db0cdea26)) ## [3.13.2](https://github.com/apify/crawlee/compare/v3.13.1...v3.13.2) (2025-04-08)[​](#3132-2025-04-08 "Direct link to 3132-2025-04-08") ### Features[​](#features-4 "Direct link to Features") * add `onSkippedRequest` option ([#2916](https://github.com/apify/crawlee/issues/2916)) ([764f992](https://github.com/apify/crawlee/commit/764f99203627b6a44d2ee90d623b8b0e6ecbffb5)), closes [#2910](https://github.com/apify/crawlee/issues/2910) ## [3.13.1](https://github.com/apify/crawlee/compare/v3.13.0...v3.13.1) (2025-04-07)[​](#3131-2025-04-07 "Direct link to 3131-2025-04-07") ### Bug Fixes[​](#bug-fixes-10 "Direct link to Bug Fixes") * don't double increment session usage count in `BrowserCrawler` ([#2908](https://github.com/apify/crawlee/issues/2908)) ([3107e55](https://github.com/apify/crawlee/commit/3107e5511142a3579adc2348fcb6a9dcadd5c0b9)), closes [#2851](https://github.com/apify/crawlee/issues/2851) * rename `RobotsFile` to `RobotsTxtFile` ([#2913](https://github.com/apify/crawlee/issues/2913)) ([3160f71](https://github.com/apify/crawlee/commit/3160f717e865326476d78089d778cbc7d35aa58d)), closes [#2910](https://github.com/apify/crawlee/issues/2910) ### Features[​](#features-5 "Direct link to Features") * add `respectRobotsTxtFile` crawler option ([#2910](https://github.com/apify/crawlee/issues/2910)) ([0eabed1](https://github.com/apify/crawlee/commit/0eabed1f13070d902c2c67b340621830a7f64464)) # [3.13.0](https://github.com/apify/crawlee/compare/v3.12.2...v3.13.0) (2025-03-04) ### Bug Fixes[​](#bug-fixes-11 "Direct link to Bug Fixes") * Make log message in RequestQueue.isFinished more accurate ([#2848](https://github.com/apify/crawlee/issues/2848)) ([3d124ae](https://github.com/apify/crawlee/commit/3d124aee8f6fa096df0daafad4bb9d07b0ae4684)) * Simplified RequestQueueV2 implementation ([#2775](https://github.com/apify/crawlee/issues/2775)) ([d1a094a](https://github.com/apify/crawlee/commit/d1a094a47eaecbf367b222f9b8c14d7da5d3e03a)), closes [#2767](https://github.com/apify/crawlee/issues/2767) [#2700](https://github.com/apify/crawlee/issues/2700) ### Features[​](#features-6 "Direct link to Features") * improved cross platform metric collection ([#2834](https://github.com/apify/crawlee/issues/2834)) ([e41b2f7](https://github.com/apify/crawlee/commit/e41b2f744513dd80aa05336eedfa1c08c54d3832)), closes [#2771](https://github.com/apify/crawlee/issues/2771) ## [3.12.2](https://github.com/apify/crawlee/compare/v3.12.1...v3.12.2) (2025-01-27)[​](#3122-2025-01-27 "Direct link to 3122-2025-01-27") ### Bug Fixes[​](#bug-fixes-12 "Direct link to Bug Fixes") * **core:** type definition of Dataset.reduce ([#2774](https://github.com/apify/crawlee/issues/2774)) ([59bc6d1](https://github.com/apify/crawlee/commit/59bc6d12cbd9e81c06ee18d0a6390b7806e346ae)), closes [#2773](https://github.com/apify/crawlee/issues/2773) ### Features[​](#features-7 "Direct link to Features") * add support for parsing comma-separated list environment variables ([#2765](https://github.com/apify/crawlee/issues/2765)) ([4e50c47](https://github.com/apify/crawlee/commit/4e50c474f60df66585c6decf07532c790c8e63a7)) ## [3.12.1](https://github.com/apify/crawlee/compare/v3.12.0...v3.12.1) (2024-12-04)[​](#3121-2024-12-04 "Direct link to 3121-2024-12-04") ### Features[​](#features-8 "Direct link to Features") * `tieredProxyUrls` accept `null` for switching the proxy off ([#2743](https://github.com/apify/crawlee/issues/2743)) ([82f4ea9](https://github.com/apify/crawlee/commit/82f4ea99f632526649ad73e3246b9bdf63a6788a)), closes [#2740](https://github.com/apify/crawlee/issues/2740) # [3.12.0](https://github.com/apify/crawlee/compare/v3.11.5...v3.12.0) (2024-11-04) ### Bug Fixes[​](#bug-fixes-13 "Direct link to Bug Fixes") * **core:** ensure correct column order in CSV export ([#2734](https://github.com/apify/crawlee/issues/2734)) ([b66784f](https://github.com/apify/crawlee/commit/b66784f89f011c2f972d73ec9cd47235a0411d1c)), closes [#2718](https://github.com/apify/crawlee/issues/2718) ### Features[​](#features-9 "Direct link to Features") * allow using other HTTP clients ([#2661](https://github.com/apify/crawlee/issues/2661)) ([568c655](https://github.com/apify/crawlee/commit/568c6556d79ce91654c8a715d1d1729d7d6ed8ef)), closes [#2659](https://github.com/apify/crawlee/issues/2659) ## [3.11.5](https://github.com/apify/crawlee/compare/v3.11.4...v3.11.5) (2024-10-04)[​](#3115-2024-10-04 "Direct link to 3115-2024-10-04") ### Bug Fixes[​](#bug-fixes-14 "Direct link to Bug Fixes") * `forefront` request fetching in RQv2 ([#2689](https://github.com/apify/crawlee/issues/2689)) ([03951bd](https://github.com/apify/crawlee/commit/03951bdba8fb34f6bed00d1b68240ff7cd0bacbf)), closes [#2669](https://github.com/apify/crawlee/issues/2669) * **core:** accept `UInt8Array` in `KVS.setValue()` ([#2682](https://github.com/apify/crawlee/issues/2682)) ([8ef0e60](https://github.com/apify/crawlee/commit/8ef0e60ca6fb2f4ec1b0d1aec6dcd53fcfb398b3)) * decode special characters in proxy `username` and `password` ([#2696](https://github.com/apify/crawlee/issues/2696)) ([0f0fcc5](https://github.com/apify/crawlee/commit/0f0fcc594685a29472b407a7c39d48b21f24375a)) ## [3.11.4](https://github.com/apify/crawlee/compare/v3.11.3...v3.11.4) (2024-09-23)[​](#3114-2024-09-23 "Direct link to 3114-2024-09-23") ### Bug Fixes[​](#bug-fixes-15 "Direct link to Bug Fixes") * `SitemapRequestList.teardown()` doesn't break `persistState` calls ([#2673](https://github.com/apify/crawlee/issues/2673)) ([fb2c5cd](https://github.com/apify/crawlee/commit/fb2c5cdaa47e2d3a91ade726cfba3091917a0137)), closes [/github.com/apify/crawlee/blob/f3eb99d9fa9a7aa0ec1dcb9773e666a9ac14fb76/packages/core/src/storages/sitemap\_request\_list.ts#L446](https://github.com//github.com/apify/crawlee/blob/f3eb99d9fa9a7aa0ec1dcb9773e666a9ac14fb76/packages/core/src/storages/sitemap_request_list.ts/issues/L446) [#2672](https://github.com/apify/crawlee/issues/2672) ## [3.11.3](https://github.com/apify/crawlee/compare/v3.11.2...v3.11.3) (2024-09-03)[​](#3113-2024-09-03 "Direct link to 3113-2024-09-03") ### Bug Fixes[​](#bug-fixes-16 "Direct link to Bug Fixes") * **RequestQueueV2:** reset recently handled cache too if the queue is pending for too long ([#2656](https://github.com/apify/crawlee/issues/2656)) ([51a69bc](https://github.com/apify/crawlee/commit/51a69bc1f2084c4d7ef3b7bdab3695b77af29540)) ## [3.11.2](https://github.com/apify/crawlee/compare/v3.11.1...v3.11.2) (2024-08-28)[​](#3112-2024-08-28 "Direct link to 3112-2024-08-28") ### Bug Fixes[​](#bug-fixes-17 "Direct link to Bug Fixes") * **RequestQueueV2:** remove `inProgress` cache, rely solely on locked states ([#2601](https://github.com/apify/crawlee/issues/2601)) ([57fcb08](https://github.com/apify/crawlee/commit/57fcb0804a9f1268039d1e2b246c515ceca7e405)) ### Features[​](#features-10 "Direct link to Features") * `globs` & `regexps` for `SitemapRequestList` ([#2631](https://github.com/apify/crawlee/issues/2631)) ([b5fd3a9](https://github.com/apify/crawlee/commit/b5fd3a9e3f6b189b86c0fb89a37b66c08ff3fe5d)) * resilient sitemap loading ([#2619](https://github.com/apify/crawlee/issues/2619)) ([1dd7660](https://github.com/apify/crawlee/commit/1dd76601e03de4541964116b3a77376e233ea22b)) ## [3.11.1](https://github.com/apify/crawlee/compare/v3.11.0...v3.11.1) (2024-07-24)[​](#3111-2024-07-24 "Direct link to 3111-2024-07-24") **Note:** Version bump only for package @crawlee/core # [3.11.0](https://github.com/apify/crawlee/compare/v3.10.5...v3.11.0) (2024-07-09) ### Features[​](#features-11 "Direct link to Features") * Sitemap-based request list implementation ([#2498](https://github.com/apify/crawlee/issues/2498)) ([7bf8f0b](https://github.com/apify/crawlee/commit/7bf8f0bcd4cc81e02c7cc60e82dfe7a0cdd80938)) ## [3.10.5](https://github.com/apify/crawlee/compare/v3.10.4...v3.10.5) (2024-06-12)[​](#3105-2024-06-12 "Direct link to 3105-2024-06-12") ### Bug Fixes[​](#bug-fixes-18 "Direct link to Bug Fixes") * mark `context.request.loadedUrl` and `id` as required inside the request handler ([#2531](https://github.com/apify/crawlee/issues/2531)) ([2b54660](https://github.com/apify/crawlee/commit/2b546600691d84852a2f9ef42f273cecf818d66d)) ## [3.10.4](https://github.com/apify/crawlee/compare/v3.10.3...v3.10.4) (2024-06-11)[​](#3104-2024-06-11 "Direct link to 3104-2024-06-11") ### Bug Fixes[​](#bug-fixes-19 "Direct link to Bug Fixes") * add `waitForAllRequestsToBeAdded` option to `enqueueLinks` helper ([925546b](https://github.com/apify/crawlee/commit/925546b31130076c2dec98a83a42d15c216589a0)), closes [#2318](https://github.com/apify/crawlee/issues/2318) * respect `crawler.log` when creating child logger for `Statistics` ([0a0d75d](https://github.com/apify/crawlee/commit/0a0d75d40b5f78b329589535bbe3e0e84be76a7e)), closes [#2412](https://github.com/apify/crawlee/issues/2412) ## [3.10.3](https://github.com/apify/crawlee/compare/v3.10.2...v3.10.3) (2024-06-07)[​](#3103-2024-06-07 "Direct link to 3103-2024-06-07") ### Bug Fixes[​](#bug-fixes-20 "Direct link to Bug Fixes") * respect implicit router when no `requestHandler` is provided in `AdaptiveCrawler` ([#2518](https://github.com/apify/crawlee/issues/2518)) ([31083aa](https://github.com/apify/crawlee/commit/31083aa27ddd51827f73c7ac4290379ec7a81283)) * revert the scaling steps back to 5% ([5bf32f8](https://github.com/apify/crawlee/commit/5bf32f855ad84037e68dd9053930fa7be4267cac)) ### Features[​](#features-12 "Direct link to Features") * add `waitForSelector` context helper + `parseWithCheerio` in adaptive crawler ([#2522](https://github.com/apify/crawlee/issues/2522)) ([6f88e73](https://github.com/apify/crawlee/commit/6f88e738d43ab4774dc4ef3f78775a5d88728e0d)) ## [3.10.2](https://github.com/apify/crawlee/compare/v3.10.1...v3.10.2) (2024-06-03)[​](#3102-2024-06-03 "Direct link to 3102-2024-06-03") **Note:** Version bump only for package @crawlee/core ## [3.10.1](https://github.com/apify/crawlee/compare/v3.10.0...v3.10.1) (2024-05-23)[​](#3101-2024-05-23 "Direct link to 3101-2024-05-23") ### Bug Fixes[​](#bug-fixes-21 "Direct link to Bug Fixes") * investigate and temp fix for possible 0-concurrency bug in RQv2 ([#2494](https://github.com/apify/crawlee/issues/2494)) ([4ebe820](https://github.com/apify/crawlee/commit/4ebe820573b269c2d0a6eff20cfd7787debc63c0)) * provide URLs to the error snapshot ([#2482](https://github.com/apify/crawlee/issues/2482)) ([7f64145](https://github.com/apify/crawlee/commit/7f64145308dfdb3909d4fcf945759a7d6344e2f5)), closes [/github.com/apify/apify-sdk-js/blob/master/packages/apify/src/key\_value\_store.ts#L25](https://github.com//github.com/apify/apify-sdk-js/blob/master/packages/apify/src/key_value_store.ts/issues/L25) # [3.10.0](https://github.com/apify/crawlee/compare/v3.9.2...v3.10.0) (2024-05-16) ### Bug Fixes[​](#bug-fixes-22 "Direct link to Bug Fixes") * `EnqueueStrategy.All` erroring with links using unsupported protocols ([#2389](https://github.com/apify/crawlee/issues/2389)) ([8db3908](https://github.com/apify/crawlee/commit/8db39080b7711ba3c27dff7fce1170ddb0ee3d05)) * **core:** conversion between tough cookies and browser pool cookies ([#2443](https://github.com/apify/crawlee/issues/2443)) ([74f73ab](https://github.com/apify/crawlee/commit/74f73ab77a94ecd285d587b7b3532443deda07b4)) * **core:** fire local `SystemInfo` events every second ([#2454](https://github.com/apify/crawlee/issues/2454)) ([1fa9a66](https://github.com/apify/crawlee/commit/1fa9a66388846505f84dcdea0393e7eaaebf84c3)) * **core:** use createSessionFunction when loading Session from persisted state ([#2444](https://github.com/apify/crawlee/issues/2444)) ([3c56b4c](https://github.com/apify/crawlee/commit/3c56b4ca1efe327138aeb32c39dfd9dd67b6aceb)) * double tier decrement in tiered proxy ([#2468](https://github.com/apify/crawlee/issues/2468)) ([3a8204b](https://github.com/apify/crawlee/commit/3a8204ba417936570ec5569dc4e4eceed79939c1)) ### Features[​](#features-13 "Direct link to Features") * implement ErrorSnapshotter for error context capture ([#2332](https://github.com/apify/crawlee/issues/2332)) ([e861dfd](https://github.com/apify/crawlee/commit/e861dfdb451ae32fb1e0c7749c6b59744654b303)), closes [#2280](https://github.com/apify/crawlee/issues/2280) * make `RequestQueue` v2 the default queue, see more on [Apify blog](https://blog.apify.com/new-apify-request-queue/) ([#2390](https://github.com/apify/crawlee/issues/2390)) ([41ae8ab](https://github.com/apify/crawlee/commit/41ae8abec1da811ae0750ac2d298e77c1e3b7b55)), closes [#2388](https://github.com/apify/crawlee/issues/2388) ### Performance Improvements[​](#performance-improvements-1 "Direct link to Performance Improvements") * improve scaling based on memory ([#2459](https://github.com/apify/crawlee/issues/2459)) ([2d5d443](https://github.com/apify/crawlee/commit/2d5d443da5fa701b21aec003d4d84797882bc175)) * optimize `RequestList` memory footprint ([#2466](https://github.com/apify/crawlee/issues/2466)) ([12210bd](https://github.com/apify/crawlee/commit/12210bd191b50c76ecca23ea18f3deda7b1517c6)) * optimize adding large amount of requests via `crawler.addRequests()` ([#2456](https://github.com/apify/crawlee/issues/2456)) ([6da86a8](https://github.com/apify/crawlee/commit/6da86a85d848cd1cf860a28e5f077b8b14cdb213)) ## [3.9.2](https://github.com/apify/crawlee/compare/v3.9.1...v3.9.2) (2024-04-17)[​](#392-2024-04-17 "Direct link to 392-2024-04-17") ### Bug Fixes[​](#bug-fixes-23 "Direct link to Bug Fixes") * break up growing stack in `AutoscaledPool.notify` ([#2422](https://github.com/apify/crawlee/issues/2422)) ([6f2e6b0](https://github.com/apify/crawlee/commit/6f2e6b0ccb404ae66be372e87d762eed67c053bb)), closes [#2421](https://github.com/apify/crawlee/issues/2421) ## [3.9.1](https://github.com/apify/crawlee/compare/v3.9.0...v3.9.1) (2024-04-11)[​](#391-2024-04-11 "Direct link to 391-2024-04-11") **Note:** Version bump only for package @crawlee/core # [3.9.0](https://github.com/apify/crawlee/compare/v3.8.2...v3.9.0) (2024-04-10) ### Bug Fixes[​](#bug-fixes-24 "Direct link to Bug Fixes") * include actual key in error message of KVS' `setValue` ([#2411](https://github.com/apify/crawlee/issues/2411)) ([9089bf1](https://github.com/apify/crawlee/commit/9089bf139b717fecc6e8220c65a4d389862bd073)) * notify autoscaled pool about newly added requests ([#2400](https://github.com/apify/crawlee/issues/2400)) ([a90177d](https://github.com/apify/crawlee/commit/a90177d5207794be1d6e401d746dd4c6e5961976)) ### Features[​](#features-14 "Direct link to Features") * `createAdaptivePlaywrightRouter` utility ([#2415](https://github.com/apify/crawlee/issues/2415)) ([cee4778](https://github.com/apify/crawlee/commit/cee477814e4901d025c5376205ad884c2fe08e0e)), closes [#2407](https://github.com/apify/crawlee/issues/2407) * `tieredProxyUrls` for ProxyConfiguration ([#2348](https://github.com/apify/crawlee/issues/2348)) ([5408c7f](https://github.com/apify/crawlee/commit/5408c7f60a5bf4dbdba92f2d7440e0946b94ea6e)) * better `newUrlFunction` for ProxyConfiguration ([#2392](https://github.com/apify/crawlee/issues/2392)) ([330598b](https://github.com/apify/crawlee/commit/330598b348ad27bc7c73732294a14b655ccd3507)), closes [#2348](https://github.com/apify/crawlee/issues/2348) [#2065](https://github.com/apify/crawlee/issues/2065) ## [3.8.2](https://github.com/apify/crawlee/compare/v3.8.1...v3.8.2) (2024-03-21)[​](#382-2024-03-21 "Direct link to 382-2024-03-21") ### Bug Fixes[​](#bug-fixes-25 "Direct link to Bug Fixes") * **core:** solve possible dead locks in `RequestQueueV2` ([#2376](https://github.com/apify/crawlee/issues/2376)) ([ffba095](https://github.com/apify/crawlee/commit/ffba095c8a74075901268cc49d970af4271d7abf)) * use 0 (number) instead of false as default for sessionRotationCount ([#2372](https://github.com/apify/crawlee/issues/2372)) ([667a3e7](https://github.com/apify/crawlee/commit/667a3e7a2be31abb94adbdb6119c4a8f3a751d69)) ### Features[​](#features-15 "Direct link to Features") * implement global storage access checking and use it to prevent unwanted side effects in adaptive crawler ([#2371](https://github.com/apify/crawlee/issues/2371)) ([fb3b7da](https://github.com/apify/crawlee/commit/fb3b7da402522ddff8c7394ac1253ba8aeac984c)), closes [#2364](https://github.com/apify/crawlee/issues/2364) ## [3.8.1](https://github.com/apify/crawlee/compare/v3.8.0...v3.8.1) (2024-02-22)[​](#381-2024-02-22 "Direct link to 381-2024-02-22") ### Bug Fixes[​](#bug-fixes-26 "Direct link to Bug Fixes") * fix crawling context type in `router.addHandler()` ([#2355](https://github.com/apify/crawlee/issues/2355)) ([d73c202](https://github.com/apify/crawlee/commit/d73c20240586aeeddaea99cd157771a01b61d917)) # [3.8.0](https://github.com/apify/crawlee/compare/v3.7.3...v3.8.0) (2024-02-21) ### Bug Fixes[​](#bug-fixes-27 "Direct link to Bug Fixes") * `createRequests` works correctly with `exclude` (and nothing else) ([#2321](https://github.com/apify/crawlee/issues/2321)) ([048db09](https://github.com/apify/crawlee/commit/048db0964a57ac570320ad495425733128235491)) ### Features[​](#features-16 "Direct link to Features") * `KeyValueStore.recordExists()` ([#2339](https://github.com/apify/crawlee/issues/2339)) ([8507a65](https://github.com/apify/crawlee/commit/8507a65d1ad079f64c752a6ddb1d8fac9b494228)) * accessing crawler state, key-value store and named datasets via crawling context ([#2283](https://github.com/apify/crawlee/issues/2283)) ([58dd5fc](https://github.com/apify/crawlee/commit/58dd5fcc25f31bb066402c46e48a9e5e91efd5c5)) * adaptive playwright crawler ([#2316](https://github.com/apify/crawlee/issues/2316)) ([8e4218a](https://github.com/apify/crawlee/commit/8e4218ada03cf485751def46f8c465b2d2a825c7)) ## [3.7.3](https://github.com/apify/crawlee/compare/v3.7.2...v3.7.3) (2024-01-30)[​](#373-2024-01-30 "Direct link to 373-2024-01-30") ### Bug Fixes[​](#bug-fixes-28 "Direct link to Bug Fixes") * **enqueueLinks:** filter out empty/nullish globs ([#2286](https://github.com/apify/crawlee/issues/2286)) ([84319b3](https://github.com/apify/crawlee/commit/84319b39efb5a921d0d5ec785db0147ec47f1243)) ## [3.7.2](https://github.com/apify/crawlee/compare/v3.7.1...v3.7.2) (2024-01-09)[​](#372-2024-01-09 "Direct link to 372-2024-01-09") ### Bug Fixes[​](#bug-fixes-29 "Direct link to Bug Fixes") * **RequestQueue:** always clear locks when a request is reclaimed ([#2263](https://github.com/apify/crawlee/issues/2263)) ([0fafe29](https://github.com/apify/crawlee/commit/0fafe290103655d450c61da78522491efde8a866)), closes [#2262](https://github.com/apify/crawlee/issues/2262) ## [3.7.1](https://github.com/apify/crawlee/compare/v3.7.0...v3.7.1) (2024-01-02)[​](#371-2024-01-02 "Direct link to 371-2024-01-02") **Note:** Version bump only for package @crawlee/core # [3.7.0](https://github.com/apify/crawlee/compare/v3.6.2...v3.7.0) (2023-12-21) ### Bug Fixes[​](#bug-fixes-30 "Direct link to Bug Fixes") * `retryOnBlocked` doesn't override the blocked HTTP codes ([#2243](https://github.com/apify/crawlee/issues/2243)) ([81672c3](https://github.com/apify/crawlee/commit/81672c3d1db1dcdcffb868de5740addff82cf112)) * filter out empty globs ([#2205](https://github.com/apify/crawlee/issues/2205)) ([41322ab](https://github.com/apify/crawlee/commit/41322ab32d7db7baf61638d00fd7eaec9e5330f1)), closes [#2200](https://github.com/apify/crawlee/issues/2200) * make SessionPool queue up getSession calls to prevent overruns ([#2239](https://github.com/apify/crawlee/issues/2239)) ([0f5665c](https://github.com/apify/crawlee/commit/0f5665c473371bff5a5d3abee3c3a9d23f2aeb23)), closes [#1667](https://github.com/apify/crawlee/issues/1667) ### Features[​](#features-17 "Direct link to Features") * allow configuring crawler statistics ([#2213](https://github.com/apify/crawlee/issues/2213)) ([9fd60e4](https://github.com/apify/crawlee/commit/9fd60e4036dce720c71f2d169a8eccbc4c813a96)), closes [#1789](https://github.com/apify/crawlee/issues/1789) * check enqueue link strategy post redirect ([#2238](https://github.com/apify/crawlee/issues/2238)) ([3c5f9d6](https://github.com/apify/crawlee/commit/3c5f9d6056158e042e12d75b2b1b21ef6c32e618)), closes [#2173](https://github.com/apify/crawlee/issues/2173) ## [3.6.2](https://github.com/apify/crawlee/compare/v3.6.1...v3.6.2) (2023-11-26)[​](#362-2023-11-26 "Direct link to 362-2023-11-26") ### Bug Fixes[​](#bug-fixes-31 "Direct link to Bug Fixes") * prevent race condition in KeyValueStore.getAutoSavedValue() ([#2193](https://github.com/apify/crawlee/issues/2193)) ([e340e2b](https://github.com/apify/crawlee/commit/e340e2b8764968d22a22bd67769676b9f2f1a2fb)) ## [3.6.1](https://github.com/apify/crawlee/compare/v3.6.0...v3.6.1) (2023-11-15)[​](#361-2023-11-15 "Direct link to 361-2023-11-15") ### Bug Fixes[​](#bug-fixes-32 "Direct link to Bug Fixes") * **ts:** specify type explicitly for logger ([aec3550](https://github.com/apify/crawlee/commit/aec355022eb13f2624eeba20aeeb42dc0ad8365c)) # [3.6.0](https://github.com/apify/crawlee/compare/v3.5.8...v3.6.0) (2023-11-15) ### Bug Fixes[​](#bug-fixes-33 "Direct link to Bug Fixes") * add `skipNavigation` option to `enqueueLinks` ([#2153](https://github.com/apify/crawlee/issues/2153)) ([118515d](https://github.com/apify/crawlee/commit/118515d2ba534b99be2f23436f6abe41d66a8e07)) * **core:** respect some advanced options for `RequestList.open()` + improve docs ([#2158](https://github.com/apify/crawlee/issues/2158)) ([c5a1b07](https://github.com/apify/crawlee/commit/c5a1b07ad62957fbe2cf90938d1f27b1ca54534a)) * declare missing dependency on got-scraping in the core package ([cd2fd4d](https://github.com/apify/crawlee/commit/cd2fd4d584c3c23ea4f74c9b2f363a55200594c9)) * retry incorrect Content-Type when response has blocked status code ([#2176](https://github.com/apify/crawlee/issues/2176)) ([b54fb8b](https://github.com/apify/crawlee/commit/b54fb8bb7bc3575195ee676d21e5feb8f898ef47)), closes [#1994](https://github.com/apify/crawlee/issues/1994) ### Features[​](#features-18 "Direct link to Features") * **core:** add `crawler.exportData()` helper ([#2166](https://github.com/apify/crawlee/issues/2166)) ([c8c09a5](https://github.com/apify/crawlee/commit/c8c09a54a712689969ff1f6bddf70f12a2a22670)) * got-scraping v4 ([#2110](https://github.com/apify/crawlee/issues/2110)) ([2f05ed2](https://github.com/apify/crawlee/commit/2f05ed22b203f688095300400bb0e6d03a03283c)) ## [3.5.8](https://github.com/apify/crawlee/compare/v3.5.7...v3.5.8) (2023-10-17)[​](#358-2023-10-17 "Direct link to 358-2023-10-17") **Note:** Version bump only for package @crawlee/core ## [3.5.7](https://github.com/apify/crawlee/compare/v3.5.6...v3.5.7) (2023-10-05)[​](#357-2023-10-05 "Direct link to 357-2023-10-05") ### Bug Fixes[​](#bug-fixes-34 "Direct link to Bug Fixes") * RQ request count is consistent after migration ([#2116](https://github.com/apify/crawlee/issues/2116)) ([9ab8c18](https://github.com/apify/crawlee/commit/9ab8c1874f52acc3f0337fdabd36321d0fb40b86)), closes [#1855](https://github.com/apify/crawlee/issues/1855) [#1855](https://github.com/apify/crawlee/issues/1855) ## [3.5.6](https://github.com/apify/crawlee/compare/v3.5.5...v3.5.6) (2023-10-04)[​](#356-2023-10-04 "Direct link to 356-2023-10-04") ### Bug Fixes[​](#bug-fixes-35 "Direct link to Bug Fixes") * **types:** re-export RequestQueueOptions as an alias to RequestProviderOptions ([#2109](https://github.com/apify/crawlee/issues/2109)) ([0900f76](https://github.com/apify/crawlee/commit/0900f76742475c19a777733462e38c5a3a9b86b7)) ## [3.5.5](https://github.com/apify/crawlee/compare/v3.5.4...v3.5.5) (2023-10-02)[​](#355-2023-10-02 "Direct link to 355-2023-10-02") ### Bug Fixes[​](#bug-fixes-36 "Direct link to Bug Fixes") * session pool leaks memory on multiple crawler runs ([#2083](https://github.com/apify/crawlee/issues/2083)) ([b96582a](https://github.com/apify/crawlee/commit/b96582a200e25ec11124da1f7f84a2b16b64d133)), closes [#2074](https://github.com/apify/crawlee/issues/2074) [#2031](https://github.com/apify/crawlee/issues/2031) * **types:** make return type of RequestProvider.open and RequestQueue(v2).open strict and accurate ([#2096](https://github.com/apify/crawlee/issues/2096)) ([dfaddb9](https://github.com/apify/crawlee/commit/dfaddb920d9772985e0b54e0ce029cc7d99b1efa)) ### Features[​](#features-19 "Direct link to Features") * Request Queue v2 ([#1975](https://github.com/apify/crawlee/issues/1975)) ([70a77ee](https://github.com/apify/crawlee/commit/70a77ee15f984e9ae67cd584fc58ace7e55346db)), closes [#1365](https://github.com/apify/crawlee/issues/1365) ## [3.5.4](https://github.com/apify/crawlee/compare/v3.5.3...v3.5.4) (2023-09-11)[​](#354-2023-09-11 "Direct link to 354-2023-09-11") ### Bug Fixes[​](#bug-fixes-37 "Direct link to Bug Fixes") * **core:** allow explicit calls to `purgeDefaultStorage` to wipe the storage on each call ([#2060](https://github.com/apify/crawlee/issues/2060)) ([4831f07](https://github.com/apify/crawlee/commit/4831f073e5639fdfb058588bc23c4b673be70929)) * various helpers opening KVS now respect Configuration ([#2071](https://github.com/apify/crawlee/issues/2071)) ([59dbb16](https://github.com/apify/crawlee/commit/59dbb164699774e5a6718e98d0a4e8f630f35323)) ## [3.5.3](https://github.com/apify/crawlee/compare/v3.5.2...v3.5.3) (2023-08-31)[​](#353-2023-08-31 "Direct link to 353-2023-08-31") ### Bug Fixes[​](#bug-fixes-38 "Direct link to Bug Fixes") * **browser-pool:** improve error handling when browser is not found ([#2050](https://github.com/apify/crawlee/issues/2050)) ([282527f](https://github.com/apify/crawlee/commit/282527f31bb366a4e52463212f652dcf6679b6c3)), closes [#1459](https://github.com/apify/crawlee/issues/1459) * crawler instances with different StorageClients do not affect each other ([#2056](https://github.com/apify/crawlee/issues/2056)) ([3f4c863](https://github.com/apify/crawlee/commit/3f4c86352bdbad1c6a8dd10a2c49a1889ca206fa)) * pin all internal dependencies ([#2041](https://github.com/apify/crawlee/issues/2041)) ([d6f2b17](https://github.com/apify/crawlee/commit/d6f2b172d4a6776137c7893ca798d5b4a9408e79)), closes [#2040](https://github.com/apify/crawlee/issues/2040) ### Features[​](#features-20 "Direct link to Features") * **core:** add default dataset helpers to `BasicCrawler` ([#2057](https://github.com/apify/crawlee/issues/2057)) ([e2a7544](https://github.com/apify/crawlee/commit/e2a7544ddf775db023ca25553d21cb73484fcd8c)) ## [3.5.2](https://github.com/apify/crawlee/compare/v3.5.1...v3.5.2) (2023-08-21)[​](#352-2023-08-21 "Direct link to 352-2023-08-21") ### Bug Fixes[​](#bug-fixes-39 "Direct link to Bug Fixes") * make the `Request` constructor options typesafe ([#2034](https://github.com/apify/crawlee/issues/2034)) ([75e7d65](https://github.com/apify/crawlee/commit/75e7d6554a1875e80e5c54f3877bb6e3daf6cdd7)) ## [3.5.1](https://github.com/apify/crawlee/compare/v3.5.0...v3.5.1) (2023-08-16)[​](#351-2023-08-16 "Direct link to 351-2023-08-16") ### Bug Fixes[​](#bug-fixes-40 "Direct link to Bug Fixes") * add `Request.maxRetries` to the `RequestOptions` interface ([#2024](https://github.com/apify/crawlee/issues/2024)) ([6433821](https://github.com/apify/crawlee/commit/6433821a59538b1f1cb4f29addd83a259ddda74f)) * log original error message on session rotation ([#2022](https://github.com/apify/crawlee/issues/2022)) ([8a11ffb](https://github.com/apify/crawlee/commit/8a11ffbdaef6b2fe8603aac570c3038f84c2f203)) # [3.5.0](https://github.com/apify/crawlee/compare/v3.4.2...v3.5.0) (2023-07-31) ### Bug Fixes[​](#bug-fixes-41 "Direct link to Bug Fixes") * **core:** add requests from URL list (`requestsFromUrl`) to the queue in batches ([418fbf8](https://github.com/apify/crawlee/commit/418fbf89d8680f8c460e37cfbf3e521f45770eb2)), closes [#1995](https://github.com/apify/crawlee/issues/1995) * **core:** support relative links in `enqueueLinks` explicitly provided via `urls` option ([#2014](https://github.com/apify/crawlee/issues/2014)) ([cbd9d08](https://github.com/apify/crawlee/commit/cbd9d08065694b8c86e32c773875cecd41e5fcc9)), closes [#2005](https://github.com/apify/crawlee/issues/2005) ### Features[​](#features-21 "Direct link to Features") * **core:** use `RequestQueue.addBatchedRequests()` in `enqueueLinks` helper ([4d61ca9](https://github.com/apify/crawlee/commit/4d61ca934072f8bbb680c842d8b1c9a4452ee73a)), closes [#1995](https://github.com/apify/crawlee/issues/1995) * retire session on proxy error ([#2002](https://github.com/apify/crawlee/issues/2002)) ([8c0928b](https://github.com/apify/crawlee/commit/8c0928b24ceabefc454f8114ac30a27023709010)), closes [#1912](https://github.com/apify/crawlee/issues/1912) ## [3.4.2](https://github.com/apify/crawlee/compare/v3.4.1...v3.4.2) (2023-07-19)[​](#342-2023-07-19 "Direct link to 342-2023-07-19") ### Features[​](#features-22 "Direct link to Features") * **core:** add `RequestQueue.addRequestsBatched()` that is non-blocking ([#1996](https://github.com/apify/crawlee/issues/1996)) ([c85485d](https://github.com/apify/crawlee/commit/c85485d6ca2bb61cfebb24a2ad99e0b3ba5c069b)), closes [#1995](https://github.com/apify/crawlee/issues/1995) ## [3.4.1](https://github.com/apify/crawlee/compare/v3.4.0...v3.4.1) (2023-07-13)[​](#341-2023-07-13 "Direct link to 341-2023-07-13") ### Bug Fixes[​](#bug-fixes-42 "Direct link to Bug Fixes") * **http-crawler:** replace `IncomingMessage` with `PlainResponse` for context's `response` ([#1973](https://github.com/apify/crawlee/issues/1973)) ([2a1cc7f](https://github.com/apify/crawlee/commit/2a1cc7f4f87f0b1c657759076a236a8f8d9b76ba)), closes [#1964](https://github.com/apify/crawlee/issues/1964) # [3.4.0](https://github.com/apify/crawlee/compare/v3.3.3...v3.4.0) (2023-06-12) ### Features[​](#features-23 "Direct link to Features") * add LinkeDOMCrawler ([#1907](https://github.com/apify/crawlee/issues/1907)) ([1c69560](https://github.com/apify/crawlee/commit/1c69560fe7ef45097e6be1037b79a84eb9a06337)), closes [/github.com/apify/crawlee/pull/1890#issuecomment-1533271694](https://github.com//github.com/apify/crawlee/pull/1890/issues/issuecomment-1533271694) ## [3.3.3](https://github.com/apify/crawlee/compare/v3.3.2...v3.3.3) (2023-05-31)[​](#333-2023-05-31 "Direct link to 333-2023-05-31") ### Features[​](#features-24 "Direct link to Features") * add support for `requestsFromUrl` to `RequestQueue` ([#1917](https://github.com/apify/crawlee/issues/1917)) ([7f2557c](https://github.com/apify/crawlee/commit/7f2557cdbbdee177db7c5970ae5a4881b7bc9b35)) * **core:** add `Request.maxRetries` to allow overriding the `maxRequestRetries` ([#1925](https://github.com/apify/crawlee/issues/1925)) ([c5592db](https://github.com/apify/crawlee/commit/c5592db0f8094de27c46ad993bea2c1ab1f61385)) ## [3.3.2](https://github.com/apify/crawlee/compare/v3.3.1...v3.3.2) (2023-05-11)[​](#332-2023-05-11 "Direct link to 332-2023-05-11") ### Bug Fixes[​](#bug-fixes-43 "Direct link to Bug Fixes") * respect config object when creating `SessionPool` ([#1881](https://github.com/apify/crawlee/issues/1881)) ([db069df](https://github.com/apify/crawlee/commit/db069df80bc183c6b861c9ac82f1e278e57ea92b)) ### Features[​](#features-25 "Direct link to Features") * allow running single crawler instance multiple times ([#1844](https://github.com/apify/crawlee/issues/1844)) ([9e6eb1e](https://github.com/apify/crawlee/commit/9e6eb1e32f582a8837311aac12cc1d657432f3fa)), closes [#765](https://github.com/apify/crawlee/issues/765) * **router:** allow inline router definition ([#1877](https://github.com/apify/crawlee/issues/1877)) ([2d241c9](https://github.com/apify/crawlee/commit/2d241c9f88964ebd41a181069c378b6b7b5bf262)) * support alternate storage clients when opening storages ([#1901](https://github.com/apify/crawlee/issues/1901)) ([661e550](https://github.com/apify/crawlee/commit/661e550dcf3609b75e2d7bc225c2f6914f45c93e)) ## [3.3.1](https://github.com/apify/crawlee/compare/v3.3.0...v3.3.1) (2023-04-11)[​](#331-2023-04-11 "Direct link to 331-2023-04-11") ### Bug Fixes[​](#bug-fixes-44 "Direct link to Bug Fixes") * **Storage:** queue up opening storages to prevent issues in concurrent calls ([#1865](https://github.com/apify/crawlee/issues/1865)) ([044c740](https://github.com/apify/crawlee/commit/044c740101dd0acd2248dee3702aec769ce0c892)) * try to detect stuck request queue and fix its state ([#1837](https://github.com/apify/crawlee/issues/1837)) ([95a9f94](https://github.com/apify/crawlee/commit/95a9f941836c020a3223fd309f11cff58bc50624)) # [3.3.0](https://github.com/apify/crawlee/compare/v3.2.2...v3.3.0) (2023-03-09) ### Bug Fixes[​](#bug-fixes-45 "Direct link to Bug Fixes") * ignore invalid URLs in `enqueueLinks` in browser crawlers ([#1803](https://github.com/apify/crawlee/issues/1803)) ([5ac336c](https://github.com/apify/crawlee/commit/5ac336c5b83b212fd6281659b8ceee091e259ff1)) ### Features[​](#features-26 "Direct link to Features") * **core:** add `exclude` option to `enqueueLinks` ([#1786](https://github.com/apify/crawlee/issues/1786)) ([2e833dc](https://github.com/apify/crawlee/commit/2e833dc4b0b82bb6741aa683f3fcba05244427df)), closes [#1785](https://github.com/apify/crawlee/issues/1785) ## [3.2.2](https://github.com/apify/crawlee/compare/v3.2.1...v3.2.2) (2023-02-08)[​](#322-2023-02-08 "Direct link to 322-2023-02-08") **Note:** Version bump only for package @crawlee/core ## [3.2.1](https://github.com/apify/crawlee/compare/v3.2.0...v3.2.1) (2023-02-07)[​](#321-2023-02-07 "Direct link to 321-2023-02-07") ### Bug Fixes[​](#bug-fixes-46 "Direct link to Bug Fixes") * add `QueueOperationInfo` export to the core package ([5ec6c24](https://github.com/apify/crawlee/commit/5ec6c24ba31c11c0ff4db49a6461f112a70071b3)) # [3.2.0](https://github.com/apify/crawlee/compare/v3.1.4...v3.2.0) (2023-02-07) ### Bug Fixes[​](#bug-fixes-47 "Direct link to Bug Fixes") * clone `request.userData` when creating new request object ([#1728](https://github.com/apify/crawlee/issues/1728)) ([222ef59](https://github.com/apify/crawlee/commit/222ef59b646740ae46be011ea0bc3d11c51a553e)), closes [#1725](https://github.com/apify/crawlee/issues/1725) * declare missing dependency on `tslib` ([27e96c8](https://github.com/apify/crawlee/commit/27e96c80c26e7fc31809a4b518d699573cb8c662)), closes [#1747](https://github.com/apify/crawlee/issues/1747) * ensure CrawlingContext interface is inferred correctly in route handlers ([aa84633](https://github.com/apify/crawlee/commit/aa84633b1a2007c2e91bf012e944433b21243f2e)) * **utils:** add missing dependency on `ow` ([bf0e03c](https://github.com/apify/crawlee/commit/bf0e03cc6ddc103c9337de5cd8dce9bc86c369a3)), closes [#1716](https://github.com/apify/crawlee/issues/1716) ### Features[​](#features-27 "Direct link to Features") * **enqueueLinks:** add SameOrigin strategy and relax protocol matching for the other strategies ([#1748](https://github.com/apify/crawlee/issues/1748)) ([4ba982a](https://github.com/apify/crawlee/commit/4ba982a909a3c16004b24ef90c3da3ee4e075be0)) ## [3.1.3](https://github.com/apify/crawlee/compare/v3.1.2...v3.1.3) (2022-12-07)[​](#313-2022-12-07 "Direct link to 313-2022-12-07") **Note:** Version bump only for package @crawlee/core ## [3.1.2](https://github.com/apify/crawlee/compare/v3.1.1...v3.1.2) (2022-11-15)[​](#312-2022-11-15 "Direct link to 312-2022-11-15") ### Bug Fixes[​](#bug-fixes-48 "Direct link to Bug Fixes") * injectJQuery in context does not survive navs ([#1661](https://github.com/apify/crawlee/issues/1661)) ([493a7cf](https://github.com/apify/crawlee/commit/493a7cff569cb12cfd9aa5e0f4fcb9de686eb41f)) * make router error message more helpful for undefined routes ([#1678](https://github.com/apify/crawlee/issues/1678)) ([ab359d8](https://github.com/apify/crawlee/commit/ab359d84f2ebdac69441ae84dcade1bca7714390)) * **MemoryStorage:** correctly respect the desc option ([#1666](https://github.com/apify/crawlee/issues/1666)) ([b5f37f6](https://github.com/apify/crawlee/commit/b5f37f66a50b2d546eca24a699cf92cb683b7026)) * requestHandlerTimeout timing ([#1660](https://github.com/apify/crawlee/issues/1660)) ([493ea0c](https://github.com/apify/crawlee/commit/493ea0ce80e55ece5a8881a6aea6674918873b35)) * shallow clone browserPoolOptions before normalization ([#1665](https://github.com/apify/crawlee/issues/1665)) ([22467ca](https://github.com/apify/crawlee/commit/22467ca81ad9464d528495333f62a60f2ea0487c)) * support headfull mode in playwright js project template ([ea2e61b](https://github.com/apify/crawlee/commit/ea2e61bc3bfcc9a895a89ad6db415a398bd3b7db)) * support headfull mode in puppeteer js project template ([e6aceb8](https://github.com/apify/crawlee/commit/e6aceb81ed0762f25dde66ff94ccdf8c1a619f7d)) ### Features[​](#features-28 "Direct link to Features") * **jsdom-crawler:** add runScripts option ([#1668](https://github.com/apify/crawlee/issues/1668)) ([8ef90bc](https://github.com/apify/crawlee/commit/8ef90bc1c020ddee334dd9a9267f6b6298a27024)) ## [3.1.1](https://github.com/apify/crawlee/compare/v3.1.0...v3.1.1) (2022-11-07)[​](#311-2022-11-07 "Direct link to 311-2022-11-07") ### Bug Fixes[​](#bug-fixes-49 "Direct link to Bug Fixes") * `utils.playwright.blockRequests` warning message ([#1632](https://github.com/apify/crawlee/issues/1632)) ([76549eb](https://github.com/apify/crawlee/commit/76549eb250a39e961b7f567ad0610af136d1c79f)) * concurrency option override order ([#1649](https://github.com/apify/crawlee/issues/1649)) ([7bbad03](https://github.com/apify/crawlee/commit/7bbad0380cd6de3fdca79ba57e1fef1d22bd56f8)) * handle non-error objects thrown gracefully ([#1652](https://github.com/apify/crawlee/issues/1652)) ([c3a4e1a](https://github.com/apify/crawlee/commit/c3a4e1a9b7d0b80a8e889bdcb394fc0be3905c6f)) * mark session as bad on failed requests ([#1647](https://github.com/apify/crawlee/issues/1647)) ([445ae43](https://github.com/apify/crawlee/commit/445ae4321816bc418a83c02fb52e64df96bfb0a9)) * support reloading of sessions with lots of retries ([ebc89d2](https://github.com/apify/crawlee/commit/ebc89d2d69d5a2da6eb4e37de59ea39daf81f8f8)) * fix type errors when `playwright` is not installed ([#1637](https://github.com/apify/crawlee/issues/1637)) ([de9db0c](https://github.com/apify/crawlee/commit/de9db0c2b24019d2e1dd43206dd7f149ecdc679a)) * upgrade to ([#1623](https://github.com/apify/crawlee/issues/1623)) ([ce36d6b](https://github.com/apify/crawlee/commit/ce36d6bd60c7adb113759126b3cb15ca222e94d0)) ### Features[​](#features-29 "Direct link to Features") * add static `set` and `useStorageClient` shortcuts to `Configuration` ([2e66fa2](https://github.com/apify/crawlee/commit/2e66fa2fad84aee2dca08b386916b465a0c012a3)) * enable migration testing ([#1583](https://github.com/apify/crawlee/issues/1583)) ([ee3a68f](https://github.com/apify/crawlee/commit/ee3a68fff1fcdf941c9a1d3734107635e9a12049)) * **playwright:** disable animations when taking screenshots ([#1601](https://github.com/apify/crawlee/issues/1601)) ([4e63034](https://github.com/apify/crawlee/commit/4e63034c7b87de405edbd84f9b1803aa101f5c78)) # [3.1.0](https://github.com/apify/crawlee/compare/v3.0.4...v3.1.0) (2022-10-13) ### Bug Fixes[​](#bug-fixes-50 "Direct link to Bug Fixes") * add overload for `KeyValueStore.getValue` with defaultValue ([#1541](https://github.com/apify/crawlee/issues/1541)) ([e3cb509](https://github.com/apify/crawlee/commit/e3cb509cb433e72e058b08a323dc7564e858f547)) * add retry attempts to methods in CLI ([#1588](https://github.com/apify/crawlee/issues/1588)) ([9142e59](https://github.com/apify/crawlee/commit/9142e598de68cc86d82825823c87b82a52c7b305)) * allow `label` in `enqueueLinksByClickingElements` options ([#1525](https://github.com/apify/crawlee/issues/1525)) ([18b7c25](https://github.com/apify/crawlee/commit/18b7c25592eaaa4a9f97cacc6e7154528ce54bf6)) * **basic-crawler:** handle `request.noRetry` after `errorHandler` ([#1542](https://github.com/apify/crawlee/issues/1542)) ([2a2040e](https://github.com/apify/crawlee/commit/2a2040e13209aff5e64ee47194940182b686b3a7)) * build storage classes by using `this` instead of the class ([#1596](https://github.com/apify/crawlee/issues/1596)) ([2b14eb7](https://github.com/apify/crawlee/commit/2b14eb7240d10760518e047095766084a3d255e3)) * correct some typing exports ([#1527](https://github.com/apify/crawlee/issues/1527)) ([4a136e5](https://github.com/apify/crawlee/commit/4a136e59e128f0a80ad4a1b98b87449647f23f43)) * do not hide stack trace of (retried) Type/Syntax/ReferenceErrors ([469b4b5](https://github.com/apify/crawlee/commit/469b4b58f1c19699d05da84f5f09a95d682421f0)) * **enqueueLinks:** ensure the enqueue strategy is respected alongside user patterns ([#1509](https://github.com/apify/crawlee/issues/1509)) ([2b0eeed](https://github.com/apify/crawlee/commit/2b0eeed3c5b0a69265f7d0567028e5707af4835b)) * **enqueueLinks:** prevent useless request creations when filtering by user patterns ([#1510](https://github.com/apify/crawlee/issues/1510)) ([cb8fe36](https://github.com/apify/crawlee/commit/cb8fe3664db1bd4cba9c2b2185e96bceddabb333)) * export `Cookie` from `crawlee` metapackage ([7b02ceb](https://github.com/apify/crawlee/commit/7b02cebc6920da9bd36d63802df0f7d6abec3887)) * handle redirect cookies ([#1521](https://github.com/apify/crawlee/issues/1521)) ([2f7fc7c](https://github.com/apify/crawlee/commit/2f7fc7cc1d27553d94a915667f0e6d2af599a80c)) * **http-crawler:** do not hang on POST without payload ([#1546](https://github.com/apify/crawlee/issues/1546)) ([8c87390](https://github.com/apify/crawlee/commit/8c87390e0db1924f463019cc55dfc265b12db2a9)) * remove undeclared dependency on core package from puppeteer utils ([827ae60](https://github.com/apify/crawlee/commit/827ae60d6c77e8c7271408493c3750a67ef8a9b4)) * support TypeScript 4.8 ([#1507](https://github.com/apify/crawlee/issues/1507)) ([4c3a504](https://github.com/apify/crawlee/commit/4c3a5045931a7f270bf8eda8a6417466b32fc99b)) * wait for persist state listeners to run when event manager closes ([#1481](https://github.com/apify/crawlee/issues/1481)) ([aa550ed](https://github.com/apify/crawlee/commit/aa550edf7e016497e8e0323e18b14bf32b416155)) ### Features[​](#features-30 "Direct link to Features") * add `Dataset.exportToValue` ([#1553](https://github.com/apify/crawlee/issues/1553)) ([acc6344](https://github.com/apify/crawlee/commit/acc6344f0e52854b4c4c833dbf7aede2547c111e)) * add `Dataset.getData()` shortcut ([522ed6e](https://github.com/apify/crawlee/commit/522ed6e209aea4aa8285ddbb336f027a36cfb6bc)) * add `utils.downloadListOfUrls` to crawlee metapackage ([7b33b0a](https://github.com/apify/crawlee/commit/7b33b0a582a75758cfca53e3ed92d6d3e392b601)) * add `utils.parseOpenGraph()` ([#1555](https://github.com/apify/crawlee/issues/1555)) ([059f85e](https://github.com/apify/crawlee/commit/059f85ebe577888d448b196f89d0f4ec1dff371e)) * add `utils.playwright.compileScript` ([#1559](https://github.com/apify/crawlee/issues/1559)) ([2e14162](https://github.com/apify/crawlee/commit/2e141625f27aa58e2195ab37ed2e31691b58f4c0)) * add `utils.playwright.infiniteScroll` ([#1543](https://github.com/apify/crawlee/issues/1543)) ([60c8289](https://github.com/apify/crawlee/commit/60c8289571f3b6bce908ef7d1636b59faebdbf87)), closes [#1528](https://github.com/apify/crawlee/issues/1528) * add `utils.playwright.saveSnapshot` ([#1544](https://github.com/apify/crawlee/issues/1544)) ([a4ceef0](https://github.com/apify/crawlee/commit/a4ceef044f0c5afdfd964dd1163a260463a60f52)) * add global `useState` helper ([#1551](https://github.com/apify/crawlee/issues/1551)) ([2b03177](https://github.com/apify/crawlee/commit/2b0317772a2bb0d29b73ff86719caf9db394d507)) * add static `Dataset.exportToValue` ([#1564](https://github.com/apify/crawlee/issues/1564)) ([a7c17d4](https://github.com/apify/crawlee/commit/a7c17d434559785d66c1220d22ea79961bda2eec)) * allow disabling storage persistence ([#1539](https://github.com/apify/crawlee/issues/1539)) ([f65e3c6](https://github.com/apify/crawlee/commit/f65e3c6a7e1efc02fac5f32046bb27da5a1c8e78)) * bump puppeteer support to 17.x ([#1519](https://github.com/apify/crawlee/issues/1519)) ([b97a852](https://github.com/apify/crawlee/commit/b97a85282b64cfb6d48b0aa71f5cc79525a80295)) * **core:** add `forefront` option to `enqueueLinks` helper ([f8755b6](https://github.com/apify/crawlee/commit/f8755b633212138671a76a8d5e0af17c12d46e10)), closes [#1595](https://github.com/apify/crawlee/issues/1595) * don't close page before calling errorHandler ([#1548](https://github.com/apify/crawlee/issues/1548)) ([1c8cd82](https://github.com/apify/crawlee/commit/1c8cd82611e93e4991b49b8ba2f1842457875680)) * enqueue links by clicking for Playwright ([#1545](https://github.com/apify/crawlee/issues/1545)) ([3d25ade](https://github.com/apify/crawlee/commit/3d25adefa7570433a9fa636941684bc2701b8ddd)) * error tracker ([#1467](https://github.com/apify/crawlee/issues/1467)) ([6bfe1ce](https://github.com/apify/crawlee/commit/6bfe1ce0161f1e26f97e2b8e5c02ec9ca608fe30)) * make the CLI download directly from GitHub ([#1540](https://github.com/apify/crawlee/issues/1540)) ([3ff398a](https://github.com/apify/crawlee/commit/3ff398a2f114760d33c43b5bc0c2447e2e48a72e)) * **router:** add userdata generic to addHandler ([#1547](https://github.com/apify/crawlee/issues/1547)) ([19cdf13](https://github.com/apify/crawlee/commit/19cdf1380abdf9aa8f337a96a4666f8f650bad69)) * use JSON5 for `INPUT.json` to support comments ([#1538](https://github.com/apify/crawlee/issues/1538)) ([09133ff](https://github.com/apify/crawlee/commit/09133ffa744436b60fc452b4f97caf1a18ebfced)) ## [3.0.4](https://github.com/apify/crawlee/compare/v3.0.3...v3.0.4) (2022-08-22)[​](#304-2022-08-22 "Direct link to 304-2022-08-22") ### Features[​](#features-31 "Direct link to Features") * bump puppeteer support to 15.1 ### Bug Fixes[​](#bug-fixes-51 "Direct link to Bug Fixes") * key value stores emitting an error when multiple write promises ran in parallel ([#1460](https://github.com/apify/crawlee/issues/1460)) ([f201cca](https://github.com/apify/crawlee/commit/f201cca4a99d1c8b3e87be0289d5b3b363048f09)) * fix dockerfiles in project templates ## [3.0.3](https://github.com/apify/crawlee/compare/v3.0.2...v3.0.3) (2022-08-11)[​](#303-2022-08-11 "Direct link to 303-2022-08-11") ### Fixes[​](#fixes "Direct link to Fixes") * add missing configuration to CheerioCrawler constructor ([#1432](https://github.com/apify/crawlee/pull/1432)) * sendRequest types ([#1445](https://github.com/apify/crawlee/pull/1445)) * respect `headless` option in browser crawlers ([#1455](https://github.com/apify/crawlee/pull/1455)) * make `CheerioCrawlerOptions` type more loose ([d871d8c](https://github.com/apify/crawlee/commit/d871d8caf22bc8d8ca1041e4975f3c95eae4b487)) * improve dockerfiles and project templates ([7c21a64](https://github.com/apify/crawlee/commit/7c21a646360d10453f17380f9882ac52d06fedb6)) ### Features[​](#features-32 "Direct link to Features") * add `utils.playwright.blockRequests()` ([#1447](https://github.com/apify/crawlee/pull/1447)) * http-crawler ([#1440](https://github.com/apify/crawlee/pull/1440)) * prefer `/INPUT.json` files for `KeyValueStore.getInput()` ([#1453](https://github.com/apify/crawlee/pull/1453)) * jsdom-crawler ([#1451](https://github.com/apify/crawlee/pull/1451)) * add `RetryRequestError` + add error to the context for BC ([#1443](https://github.com/apify/crawlee/pull/1443)) * add `keepAlive` to crawler options ([#1452](https://github.com/apify/crawlee/pull/1452)) ## [3.0.2](https://github.com/apify/crawlee/compare/v3.0.1...v3.0.2) (2022-07-28)[​](#302-2022-07-28 "Direct link to 302-2022-07-28") ### Fixes[​](#fixes-1 "Direct link to Fixes") * regression in resolving the base url for enqueue link filtering ([1422](https://github.com/apify/crawlee/pull/1422)) * improve file saving on memory storage ([1421](https://github.com/apify/crawlee/pull/1421)) * add `UserData` type argument to `CheerioCrawlingContext` and related interfaces ([1424](https://github.com/apify/crawlee/pull/1424)) * always limit `desiredConcurrency` to the value of `maxConcurrency` ([bcb689d](https://github.com/apify/crawlee/commit/bcb689d4cb90835136295d879e710969ebaf29fa)) * wait for storage to finish before resolving `crawler.run()` ([9d62d56](https://github.com/apify/crawlee/commit/9d62d565c2ff8d058164c22333b07b7d2bf79ee0)) * using explicitly typed router with `CheerioCrawler` ([07b7e69](https://github.com/apify/crawlee/commit/07b7e69e1a7b7c89b8a5538279eb6de8be0effde)) * declare dependency on `ow` in `@crawlee/cheerio` package ([be59f99](https://github.com/apify/crawlee/commit/be59f992d2897ce5c02349bbcc62472d99bb2718)) * use `crawlee@^3.0.0` in the CLI templates ([6426f22](https://github.com/apify/crawlee/commit/6426f22ce53fcce91b1d8686577557bae09fc0e9)) * fix building projects with TS when puppeteer and playwright are not installed ([1404](https://github.com/apify/crawlee/pull/1404)) * enqueueLinks should respect full URL of the current request for relative link resolution ([1427](https://github.com/apify/crawlee/pull/1427)) * use `desiredConcurrency: 10` as the default for `CheerioCrawler` ([1428](https://github.com/apify/crawlee/pull/1428)) ### Features[​](#features-33 "Direct link to Features") * feat: allow configuring what status codes will cause session retirement ([1423](https://github.com/apify/crawlee/pull/1423)) * feat: add support for middlewares to the `Router` via `use` method ([1431](https://github.com/apify/crawlee/pull/1431)) ## [3.0.1](https://github.com/apify/crawlee/compare/v3.0.0...v3.0.1) (2022-07-26)[​](#301-2022-07-26 "Direct link to 301-2022-07-26") ### Fixes[​](#fixes-2 "Direct link to Fixes") * remove `JSONData` generic type arg from `CheerioCrawler` in ([#1402](https://github.com/apify/crawlee/pull/1402)) * rename default storage folder to just `storage` in ([#1403](https://github.com/apify/crawlee/pull/1403)) * remove trailing slash for proxyUrl in ([#1405](https://github.com/apify/crawlee/pull/1405)) * run browser crawlers in headless mode by default in ([#1409](https://github.com/apify/crawlee/pull/1409)) * rename interface `FailedRequestHandler` to `ErrorHandler` in ([#1410](https://github.com/apify/crawlee/pull/1410)) * ensure default route is not ignored in `CheerioCrawler` in ([#1411](https://github.com/apify/crawlee/pull/1411)) * add `headless` option to `BrowserCrawlerOptions` in ([#1412](https://github.com/apify/crawlee/pull/1412)) * processing custom cookies in ([#1414](https://github.com/apify/crawlee/pull/1414)) * enqueue link not finding relative links if the checked page is redirected in ([#1416](https://github.com/apify/crawlee/pull/1416)) * fix building projects with TS when puppeteer and playwright are not installed in ([#1404](https://github.com/apify/crawlee/pull/1404)) * calling `enqueueLinks` in browser crawler on page without any links in ([385ca27](https://github.com/apify/crawlee/commit/385ca27c4c50096f2e28bf0da369d6aaf849a73b)) * improve error message when no default route provided in ([04c3b6a](https://github.com/apify/crawlee/commit/04c3b6ac2fd151379d57e95bde085e2a098d1b76)) ### Features[​](#features-34 "Direct link to Features") * feat: add parseWithCheerio for puppeteer & playwright in ([#1418](https://github.com/apify/crawlee/pull/1418)) ## [3.0.0](https://github.com/apify/crawlee/compare/v2.3.2...v3.0.0) (2022-07-13)[​](#300-2022-07-13 "Direct link to 300-2022-07-13") This section summarizes most of the breaking changes between Crawlee (v3) and Apify SDK (v2). Crawlee is the spiritual successor to Apify SDK, so we decided to keep the versioning and release Crawlee as v3. ### Crawlee vs Apify SDK[​](#crawlee-vs-apify-sdk "Direct link to Crawlee vs Apify SDK") Up until version 3 of `apify`, the package contained both scraping related tools and Apify platform related helper methods. With v3 we are splitting the whole project into two main parts: * Crawlee, the new web-scraping library, available as `crawlee` package on NPM * Apify SDK, helpers for the Apify platform, available as `apify` package on NPM Moreover, the Crawlee library is published as several packages under `@crawlee` namespace: * `@crawlee/core`: the base for all the crawler implementations, also contains things like `Request`, `RequestQueue`, `RequestList` or `Dataset` classes * `@crawlee/basic`: exports `BasicCrawler` * `@crawlee/cheerio`: exports `CheerioCrawler` * `@crawlee/browser`: exports `BrowserCrawler` (which is used for creating `@crawlee/playwright` and `@crawlee/puppeteer`) * `@crawlee/playwright`: exports `PlaywrightCrawler` * `@crawlee/puppeteer`: exports `PuppeteerCrawler` * `@crawlee/memory-storage`: `@apify/storage-local` alternative * `@crawlee/browser-pool`: previously `browser-pool` package * `@crawlee/utils`: utility methods * `@crawlee/types`: holds TS interfaces mainly about the `StorageClient` #### Installing Crawlee[​](#installing-crawlee "Direct link to Installing Crawlee") > As Crawlee is not yet released as `latest`, we need to install from the `next` distribution tag! Most of the Crawlee packages are extending and reexporting each other, so it's enough to install just the one you plan on using, e.g. `@crawlee/playwright` if you plan on using `playwright` - it already contains everything from the `@crawlee/browser` package, which includes everything from `@crawlee/basic`, which includes everything from `@crawlee/core`. ``` npm install crawlee@next ``` Or if all we need is cheerio support, we can install only @crawlee/cheerio ``` npm install @crawlee/cheerio@next ``` When using `playwright` or `puppeteer`, we still need to install those dependencies explicitly - this allows the users to be in control of which version will be used. ``` npm install crawlee@next playwright # or npm install @crawlee/playwright@next playwright ``` Alternatively we can also use the `crawlee` meta-package which contains (re-exports) most of the `@crawlee/*` packages, and therefore contains all the crawler classes. > Sometimes you might want to use some utility methods from `@crawlee/utils`, so you might want to install that as well. This package contains some utilities that were previously available under `Apify.utils`. Browser related utilities can be also found in the crawler packages (e.g. `@crawlee/playwright`). ### Full TypeScript support[​](#full-typescript-support "Direct link to Full TypeScript support") Both Crawlee and Apify SDK are full TypeScript rewrite, so they include up-to-date types in the package. For your TypeScript crawlers we recommend using our predefined TypeScript configuration from `@apify/tsconfig` package. Don't forget to set the `module` and `target` to `ES2022` or above to be able to use top level await. > The `@apify/tsconfig` config has [`noImplicitAny`](https://www.typescriptlang.org/tsconfig#noImplicitAny) enabled, you might want to disable it during the initial development as it will cause build failures if you left some unused local variables in your code. tsconfig.json ``` { "extends": "@apify/tsconfig", "compilerOptions": { "module": "ES2022", "target": "ES2022", "outDir": "dist", "lib": ["DOM"] }, "include": [ "./src/**/*" ] } ``` #### Docker build[​](#docker-build "Direct link to Docker build") For `Dockerfile` we recommend using multi-stage build, so you don't install the dev dependencies like TypeScript in your final image: Dockerfile ``` # using multistage build, as we need dev deps to build the TS source code FROM apify/actor-node:16 AS builder # copy all files, install all dependencies (including dev deps) and build the project COPY . ./ RUN npm install --include=dev \ && npm run build # create final image FROM apify/actor-node:16 # copy only necessary files COPY --from=builder /usr/src/app/package*.json ./ COPY --from=builder /usr/src/app/README.md ./ COPY --from=builder /usr/src/app/dist ./dist COPY --from=builder /usr/src/app/apify.json ./apify.json COPY --from=builder /usr/src/app/INPUT_SCHEMA.json ./INPUT_SCHEMA.json # install only prod deps RUN npm --quiet set progress=false \ && npm install --only=prod --no-optional \ && echo "Installed NPM packages:" \ && (npm list --only=prod --no-optional --all || true) \ && echo "Node.js version:" \ && node --version \ && echo "NPM version:" \ && npm --version # run compiled code CMD npm run start:prod ``` ### Browser fingerprints[​](#browser-fingerprints "Direct link to Browser fingerprints") Previously we had a magical `stealth` option in the puppeteer crawler that enabled several tricks aiming to mimic the real users as much as possible. While this worked to a certain degree, we decided to replace it with generated browser fingerprints. In case we don't want to have dynamic fingerprints, we can disable this behaviour via `useFingerprints` in `browserPoolOptions`: ``` const crawler = new PlaywrightCrawler({ browserPoolOptions: { useFingerprints: false, }, }); ``` ### Session cookie method renames[​](#session-cookie-method-renames "Direct link to Session cookie method renames") Previously, if we wanted to get or add cookies for the session that would be used for the request, we had to call `session.getPuppeteerCookies()` or `session.setPuppeteerCookies()`. Since this method could be used for any of our crawlers, not just `PuppeteerCrawler`, the methods have been renamed to `session.getCookies()` and `session.setCookies()` respectively. Otherwise, their usage is exactly the same! ### Memory storage[​](#memory-storage "Direct link to Memory storage") When we store some data or intermediate state (like the one `RequestQueue` holds), we now use `@crawlee/memory-storage` by default. It is an alternative to the `@apify/storage-local`, that stores the state inside memory (as opposed to SQLite database used by `@apify/storage-local`). While the state is stored in memory, it also dumps it to the file system, so we can observe it, as well as respects the existing data stored in KeyValueStore (e.g. the `INPUT.json` file). When we want to run the crawler on Apify platform, we need to use `Actor.init` or `Actor.main`, which will automatically switch the storage client to `ApifyClient` when on the Apify platform. We can still use the `@apify/storage-local`, to do it, first install it pass it to the `Actor.init` or `Actor.main` options: > `@apify/storage-local` v2.1.0+ is required for Crawlee ``` import { Actor } from 'apify'; import { ApifyStorageLocal } from '@apify/storage-local'; const storage = new ApifyStorageLocal(/* options like `enableWalMode` belong here */); await Actor.init({ storage }); ``` ### Purging of the default storage[​](#purging-of-the-default-storage "Direct link to Purging of the default storage") Previously the state was preserved between local runs, and we had to use `--purge` argument of the `apify-cli`. With Crawlee, this is now the default behaviour, we purge the storage automatically on `Actor.init/main` call. We can opt out of it via `purge: false` in the `Actor.init` options. ### Renamed crawler options and interfaces[​](#renamed-crawler-options-and-interfaces "Direct link to Renamed crawler options and interfaces") Some options were renamed to better reflect what they do. We still support all the old parameter names too, but not at the TS level. * `handleRequestFunction` -> `requestHandler` * `handlePageFunction` -> `requestHandler` * `handleRequestTimeoutSecs` -> `requestHandlerTimeoutSecs` * `handlePageTimeoutSecs` -> `requestHandlerTimeoutSecs` * `requestTimeoutSecs` -> `navigationTimeoutSecs` * `handleFailedRequestFunction` -> `failedRequestHandler` We also renamed the crawling context interfaces, so they follow the same convention and are more meaningful: * `CheerioHandlePageInputs` -> `CheerioCrawlingContext` * `PlaywrightHandlePageFunction` -> `PlaywrightCrawlingContext` * `PuppeteerHandlePageFunction` -> `PuppeteerCrawlingContext` ### Context aware helpers[​](#context-aware-helpers "Direct link to Context aware helpers") Some utilities previously available under `Apify.utils` namespace are now moved to the crawling context and are *context aware*. This means they have some parameters automatically filled in from the context, like the current `Request` instance or current `Page` object, or the `RequestQueue` bound to the crawler. #### Enqueuing links[​](#enqueuing-links "Direct link to Enqueuing links") One common helper that received more attention is the `enqueueLinks`. As mentioned above, it is context aware - we no longer need pass in the `requestQueue` or `page` arguments (or the cheerio handle `$`). In addition to that, it now offers 3 enqueuing strategies: * `EnqueueStrategy.All` (`'all'`): Matches any URLs found * `EnqueueStrategy.SameHostname` (`'same-hostname'`) Matches any URLs that have the same subdomain as the base URL (default) * `EnqueueStrategy.SameDomain` (`'same-domain'`) Matches any URLs that have the same domain name. For example, `https://wow.an.example.com` and `https://example.com` will both be matched for a base url of `https://example.com`. This means we can even call `enqueueLinks()` without any parameters. By default, it will go through all the links found on current page and filter only those targeting the same subdomain. Moreover, we can specify patterns the URL should match via globs: ``` const crawler = new PlaywrightCrawler({ async requestHandler({ enqueueLinks }) { await enqueueLinks({ globs: ['https://apify.com/*/*'], // we can also use `regexps` and `pseudoUrls` keys here }); }, }); ``` ### Implicit `RequestQueue` instance[​](#implicit-requestqueue-instance "Direct link to implicit-requestqueue-instance") All crawlers now have the `RequestQueue` instance automatically available via `crawler.getRequestQueue()` method. It will create the instance for you if it does not exist yet. This mean we no longer need to create the `RequestQueue` instance manually, and we can just use `crawler.addRequests()` method described underneath. > We can still create the `RequestQueue` explicitly, the `crawler.getRequestQueue()` method will respect that and return the instance provided via crawler options. ### `crawler.addRequests()`[​](#crawleraddrequests "Direct link to crawleraddrequests") We can now add multiple requests in batches. The newly added `addRequests` method will handle everything for us. It enqueues the first 1000 requests and resolves, while continuing with the rest in the background, again in a smaller 1000 items batches, so we don't fall into any API rate limits. This means the crawling will start almost immediately (within few seconds at most), something previously possible only with a combination of `RequestQueue` and `RequestList`. ``` // will resolve right after the initial batch of 1000 requests is added const result = await crawler.addRequests([/* many requests, can be even millions */]); // if we want to wait for all the requests to be added, we can await the `waitForAllRequestsToBeAdded` promise await result.waitForAllRequestsToBeAdded; ``` ### Less verbose error logging[​](#less-verbose-error-logging "Direct link to Less verbose error logging") Previously an error thrown from inside request handler resulted in full error object being logged. With Crawlee, we log only the error message as a warning as long as we know the request will be retried. If you want to enable verbose logging like in v2, use the `CRAWLEE_VERBOSE_LOG` env var. ### Removal of `requestAsBrowser`[​](#removal-of-requestasbrowser "Direct link to removal-of-requestasbrowser") In v1 we replaced the underlying implementation of `requestAsBrowser` to be just a proxy over calling [`got-scraping`](https://github.com/apify/got-scraping) - our custom extension to `got` that tries to mimic the real browsers as much as possible. With v3, we are removing the `requestAsBrowser`, encouraging the use of [`got-scraping`](https://github.com/apify/got-scraping) directly. For easier migration, we also added `context.sendRequest()` helper that allows processing the context bound `Request` object through [`got-scraping`](https://github.com/apify/got-scraping): ``` const crawler = new BasicCrawler({ async requestHandler({ sendRequest, log }) { // we can use the options parameter to override gotScraping options const res = await sendRequest({ responseType: 'json' }); log.info('received body', res.body); }, }); ``` #### How to use `sendRequest()`?[​](#how-to-use-sendrequest "Direct link to how-to-use-sendrequest") See [the Got Scraping guide](https://crawlee.dev/js/docs/guides/got-scraping.md). #### Removed options[​](#removed-options "Direct link to Removed options") The `useInsecureHttpParser` option has been removed. It's permanently set to `true` in order to better mimic browsers' behavior. Got Scraping automatically performs protocol negotiation, hence we removed the `useHttp2` option. It's set to `true` - 100% of browsers nowadays are capable of HTTP/2 requests. Oh, more and more of the web is using it too! #### Renamed options[​](#renamed-options "Direct link to Renamed options") In the `requestAsBrowser` approach, some of the options were named differently. Here's a list of renamed options: ##### `payload`[​](#payload "Direct link to payload") This options represents the body to send. It could be a `string` or a `Buffer`. However, there is no `payload` option anymore. You need to use `body` instead. Or, if you wish to send JSON, `json`. Here's an example: ``` // Before: await Apify.utils.requestAsBrowser({ …, payload: 'Hello, world!' }); await Apify.utils.requestAsBrowser({ …, payload: Buffer.from('c0ffe', 'hex') }); await Apify.utils.requestAsBrowser({ …, json: { hello: 'world' } }); // After: await gotScraping({ …, body: 'Hello, world!' }); await gotScraping({ …, body: Buffer.from('c0ffe', 'hex') }); await gotScraping({ …, json: { hello: 'world' } }); ``` ##### `ignoreSslErrors`[​](#ignoresslerrors "Direct link to ignoresslerrors") It has been renamed to `https.rejectUnauthorized`. By default, it's set to `false` for convenience. However, if you want to make sure the connection is secure, you can do the following: ``` // Before: await Apify.utils.requestAsBrowser({ …, ignoreSslErrors: false }); // After: await gotScraping({ …, https: { rejectUnauthorized: true } }); ``` Please note: the meanings are opposite! So we needed to invert the values as well. ##### `header-generator` options[​](#header-generator-options "Direct link to header-generator-options") `useMobileVersion`, `languageCode` and `countryCode` no longer exist. Instead, you need to use `headerGeneratorOptions` directly: ``` // Before: await Apify.utils.requestAsBrowser({ …, useMobileVersion: true, languageCode: 'en', countryCode: 'US', }); // After: await gotScraping({ …, headerGeneratorOptions: { devices: ['mobile'], // or ['desktop'] locales: ['en-US'], }, }); ``` ##### `timeoutSecs`[​](#timeoutsecs "Direct link to timeoutsecs") In order to set a timeout, use `timeout.request` (which is **milliseconds** now). ``` // Before: await Apify.utils.requestAsBrowser({ …, timeoutSecs: 30, }); // After: await gotScraping({ …, timeout: { request: 30 * 1000, }, }); ``` ##### `throwOnHttpErrors`[​](#throwonhttperrors "Direct link to throwonhttperrors") `throwOnHttpErrors` → `throwHttpErrors`. This options throws on unsuccessful HTTP status codes, for example `404`. By default, it's set to `false`. ##### `decodeBody`[​](#decodebody "Direct link to decodebody") `decodeBody` → `decompress`. This options decompresses the body. Defaults to `true` - please do not change this or websites will break (unless you know what you're doing!). ##### `abortFunction`[​](#abortfunction "Direct link to abortfunction") This function used to make the promise throw on specific responses, if it returned `true`. However, it wasn't that useful. You probably want to cancel the request instead, which you can do in the following way: ``` const promise = gotScraping(…); promise.on('request', request => { // Please note this is not a Got Request instance, but a ClientRequest one. // https://nodejs.org/api/http.html#class-httpclientrequest if (request.protocol !== 'https:') { // Unsecure request, abort. promise.cancel(); // If you set `isStream` to `true`, please use `stream.destroy()` instead. } }); const response = await promise; ``` ### Removal of browser pool plugin mixing[​](#removal-of-browser-pool-plugin-mixing "Direct link to Removal of browser pool plugin mixing") Previously, you were able to have a browser pool that would mix Puppeteer and Playwright plugins (or even your own custom plugins if you've built any). As of this version, that is no longer allowed, and creating such a browser pool will cause an error to be thrown (it's expected that all plugins that will be used are of the same type). ### Handling requests outside of browser[​](#handling-requests-outside-of-browser "Direct link to Handling requests outside of browser") One small feature worth mentioning is the ability to handle requests with browser crawlers outside the browser. To do that, we can use a combination of `Request.skipNavigation` and `context.sendRequest()`. Take a look at how to achieve this by checking out the [Skipping navigation for certain requests](https://crawlee.dev/js/docs/examples/skip-navigation.md) example! ### Logging[​](#logging "Direct link to Logging") Crawlee exports the default `log` instance directly as a named export. We also have a scoped `log` instance provided in the crawling context - this one will log messages prefixed with the crawler name and should be preferred for logging inside the request handler. ``` const crawler = new CheerioCrawler({ async requestHandler({ log, request }) { log.info(`Opened ${request.loadedUrl}`); }, }); ``` ### Auto-saved crawler state[​](#auto-saved-crawler-state "Direct link to Auto-saved crawler state") Every crawler instance now has `useState()` method that will return a state object we can use. It will be automatically saved when `persistState` event occurs. The value is cached, so we can freely call this method multiple times and get the exact same reference. No need to worry about saving the value either, as it will happen automatically. ``` const crawler = new CheerioCrawler({ async requestHandler({ crawler }) { const state = await crawler.useState({ foo: [] as number[] }); // just change the value, no need to care about saving it state.foo.push(123); }, }); ``` ### Apify SDK[​](#apify-sdk "Direct link to Apify SDK") The Apify platform helpers can be now found in the Apify SDK (`apify` NPM package). It exports the `Actor` class that offers following static helpers: * `ApifyClient` shortcuts: `addWebhook()`, `call()`, `callTask()`, `metamorph()` * helpers for running on Apify platform: `init()`, `exit()`, `fail()`, `main()`, `isAtHome()`, `createProxyConfiguration()` * storage support: `getInput()`, `getValue()`, `openDataset()`, `openKeyValueStore()`, `openRequestQueue()`, `pushData()`, `setValue()` * events support: `on()`, `off()` * other utilities: `getEnv()`, `newClient()`, `reboot()` `Actor.main` is now just a syntax sugar around calling `Actor.init()` at the beginning and `Actor.exit()` at the end (plus wrapping the user function in try/catch block). All those methods are async and should be awaited - with node 16 we can use the top level await for that. In other words, following is equivalent: ``` import { Actor } from 'apify'; await Actor.init(); // your code await Actor.exit('Crawling finished!'); ``` ``` import { Actor } from 'apify'; await Actor.main(async () => { // your code }, { statusMessage: 'Crawling finished!' }); ``` `Actor.init()` will conditionally set the storage implementation of Crawlee to the `ApifyClient` when running on the Apify platform, or keep the default (memory storage) implementation otherwise. It will also subscribe to the websocket events (or mimic them locally). `Actor.exit()` will handle the tear down and calls `process.exit()` to ensure our process won't hang indefinitely for some reason. #### Events[​](#events "Direct link to Events") Apify SDK (v2) exports `Apify.events`, which is an `EventEmitter` instance. With Crawlee, the events are managed by [`EventManager`](https://crawlee.dev/js/api/core/class/EventManager.md) class instead. We can either access it via `Actor.eventManager` getter, or use `Actor.on` and `Actor.off` shortcuts instead. ``` -Apify.events.on(...); +Actor.on(...); ``` > We can also get the [`EventManager`](https://crawlee.dev/js/api/core/class/EventManager.md) instance via `Configuration.getEventManager()`. In addition to the existing events, we now have an `exit` event fired when calling `Actor.exit()` (which is called at the end of `Actor.main()`). This event allows you to gracefully shut down any resources when `Actor.exit` is called. ### Smaller/internal breaking changes[​](#smallerinternal-breaking-changes "Direct link to Smaller/internal breaking changes") * `Apify.call()` is now just a shortcut for running `ApifyClient.actor(actorId).call(input, options)`, while also taking the token inside env vars into account * `Apify.callTask()` is now just a shortcut for running `ApifyClient.task(taskId).call(input, options)`, while also taking the token inside env vars into account * `Apify.metamorph()` is now just a shortcut for running `ApifyClient.task(taskId).metamorph(input, options)`, while also taking the ACTOR\_RUN\_ID inside env vars into account * `Apify.waitForRunToFinish()` has been removed, use `ApifyClient.waitForFinish()` instead * `Actor.main/init` purges the storage by default * remove `purgeLocalStorage` helper, move purging to the storage class directly * `StorageClient` interface now has optional `purge` method * purging happens automatically via `Actor.init()` (you can opt out via `purge: false` in the options of `init/main` methods) * `QueueOperationInfo.request` is no longer available * `Request.handledAt` is now string date in ISO format * `Request.inProgress` and `Request.reclaimed` are now `Set`s instead of POJOs * `injectUnderscore` from puppeteer utils has been removed * `APIFY_MEMORY_MBYTES` is no longer taken into account, use `CRAWLEE_AVAILABLE_MEMORY_RATIO` instead * some `AutoscaledPool` options are no longer available: * `cpuSnapshotIntervalSecs` and `memorySnapshotIntervalSecs` has been replaced with top level `systemInfoIntervalMillis` configuration * `maxUsedCpuRatio` has been moved to the top level configuration * `ProxyConfiguration.newUrlFunction` can be async. `.newUrl()` and `.newProxyInfo()` now return promises. * `prepareRequestFunction` and `postResponseFunction` options are removed, use navigation hooks instead * `gotoFunction` and `gotoTimeoutSecs` are removed * removed compatibility fix for old/broken request queues with null `Request` props * `fingerprintsOptions` renamed to `fingerprintOptions` (`fingerprints` -> `fingerprint`). * `fingerprintOptions` now accept `useFingerprintCache` and `fingerprintCacheSize` (instead of `useFingerprintPerProxyCache` and `fingerprintPerProxyCacheSize`, which are now no longer available). This is because the cached fingerprints are no longer connected to proxy URLs but to sessions. ## [2.3.2](https://github.com/apify/crawlee/compare/v2.3.1...v2.3.2) (2022-05-05)[​](#232-2022-05-05 "Direct link to 232-2022-05-05") * fix: use default user agent for playwright with chrome instead of the default "headless UA" * fix: always hide webdriver of chrome browsers ## [2.3.1](https://github.com/apify/crawlee/compare/v2.3.0...v2.3.1) (2022-05-03)[​](#231-2022-05-03 "Direct link to 231-2022-05-03") * fix: `utils.apifyClient` early instantiation (#1330) * feat: `utils.playwright.injectJQuery()` (#1337) * feat: add `keyValueStore` option to `Statistics` class (#1345) * fix: ensure failed req count is correct when using `RequestList` (#1347) * fix: random puppeteer crawler (running in headful mode) failure (#1348) > This should help with the `We either navigate top level or have old version of the navigated frame` bug in puppeteer. * fix: allow returning falsy values in `RequestTransform`'s return type ## [2.3.0](https://github.com/apify/crawlee/compare/v2.2.2...v2.3.0) (2022-04-07)[​](#230-2022-04-07 "Direct link to 230-2022-04-07") * feat: accept more social media patterns (#1286) * feat: add multiple click support to `enqueueLinksByClickingElements` (#1295) * feat: instance-scoped "global" configuration (#1315) * feat: requestList accepts proxyConfiguration for requestsFromUrls (#1317) * feat: update `playwright` to v1.20.2 * feat: update `puppeteer` to v13.5.2 > We noticed that with this version of puppeteer actor run could crash with `We either navigate top level or have old version of the navigated frame` error (puppeteer issue [here](https://github.com/puppeteer/puppeteer/issues/7050)). It should not happen while running the browser in headless mode. In case you need to run the browser in headful mode (`headless: false`), we recommend pinning puppeteer version to `10.4.0` in actor `package.json` file. * feat: stealth deprecation (#1314) * feat: allow passing a stream to KeyValueStore.setRecord (#1325) * fix: use correct apify-client instance for snapshotting (#1308) * fix: automatically reset `RequestQueue` state after 5 minutes of inactivity, closes #997 * fix: improve guessing of chrome executable path on windows (#1294) * fix: prune CPU snapshots locally (#1313) * fix: improve browser launcher types (#1318) ### 0 concurrency mitigation[​](#0-concurrency-mitigation "Direct link to 0 concurrency mitigation") This release should resolve the 0 concurrency bug by automatically resetting the internal `RequestQueue` state after 5 minutes of inactivity. We now track last activity done on a `RequestQueue` instance: * added new request * started processing a request (added to `inProgress` cache) * marked request as handled * reclaimed request If we don't detect one of those actions in last 5 minutes, and we have some requests in the `inProgress` cache, we try to reset the state. We can override this limit via `CRAWLEE_INTERNAL_TIMEOUT` env var. This should finally resolve the 0 concurrency bug, as it was always about stuck requests in the `inProgress` cache. ## [2.2.2](https://github.com/apify/crawlee/compare/v2.2.1...v2.2.2) (2022-02-14)[​](#222-2022-02-14 "Direct link to 222-2022-02-14") * fix: ensure `request.headers` is set * fix: lower `RequestQueue` API timeout to 30 seconds * improve logging for fetching next request and timeouts ## [2.2.1](https://github.com/apify/crawlee/compare/v2.2.0...v2.2.1) (2022-01-03)[​](#221-2022-01-03 "Direct link to 221-2022-01-03") * fix: ignore requests that are no longer in progress (#1258) * fix: do not use `tryCancel()` from inside sync callback (#1265) * fix: revert to puppeteer 10.x (#1276) * fix: wait when `body` is not available in `infiniteScroll()` from Puppeteer utils (#1238) * fix: expose logger classes on the `utils.log` instance (#1278) ## [2.2.0](https://github.com/apify/crawlee/compare/v2.1.0...v2.2.0) (2021-12-17)[​](#220-2021-12-17 "Direct link to 220-2021-12-17") ### Proxy per page[​](#proxy-per-page "Direct link to Proxy per page") Up until now, browser crawlers used the same session (and therefore the same proxy) for all request from a single browser \* now get a new proxy for each session. This means that with incognito pages, each page will get a new proxy, aligning the behaviour with `CheerioCrawler`. This feature is not enabled by default. To use it, we need to enable `useIncognitoPages` flag under `launchContext`: ``` new Apify.Playwright({ launchContext: { useIncognitoPages: true, }, // ... }) ``` > Note that currently there is a performance overhead for using `useIncognitoPages`. Use this flag at your own will. We are planning to enable this feature by default in SDK v3.0. ### Abortable timeouts[​](#abortable-timeouts "Direct link to Abortable timeouts") Previously when a page function timed out, the task still kept running. This could lead to requests being processed multiple times. In v2.2 we now have abortable timeouts that will cancel the task as early as possible. ### Mitigation of zero concurrency issue[​](#mitigation-of-zero-concurrency-issue "Direct link to Mitigation of zero concurrency issue") Several new timeouts were added to the task function, which should help mitigate the zero concurrency bug. Namely fetching of next request information and reclaiming failed requests back to the queue are now executed with a timeout with 3 additional retries before the task fails. The timeout is always at least 300s (5 minutes), or `requestHandlerTimeoutSecs` if that value is higher. ### Full list of changes[​](#full-list-of-changes "Direct link to Full list of changes") * fix `RequestError: URI malformed` in cheerio crawler (#1205) * only provide Cookie header if cookies are present (#1218) * handle extra cases for `diffCookie` (#1217) * add timeout for task function (#1234) * implement proxy per page in browser crawlers (#1228) * add fingerprinting support (#1243) * implement abortable timeouts (#1245) * add timeouts with retries to `runTaskFunction()` (#1250) * automatically convert google spreadsheet URLs to CSV exports (#1255) ## [2.1.0](https://github.com/apify/crawlee/compare/v2.0.7...v2.1.0) (2021-10-07)[​](#210-2021-10-07 "Direct link to 210-2021-10-07") * automatically convert google docs share urls to csv download ones in request list (#1174) * use puppeteer emulating scrolls instead of `window.scrollBy` (#1170) * warn if apify proxy is used in proxyUrls (#1173) * fix `YOUTUBE_REGEX_STRING` being too greedy (#1171) * add `purgeLocalStorage` utility method (#1187) * catch errors inside request interceptors (#1188, #1190) * add support for cgroups v2 (#1177) * fix incorrect offset in `fixUrl` function (#1184) * support channel and user links in YouTube regex (#1178) * fix: allow passing `requestsFromUrl` to `RequestListOptions` in TS (#1191) * allow passing `forceCloud` down to the KV store (#1186), closes #752 * merge cookies from session with user provided ones (#1201), closes #1197 * use `ApifyClient` v2 (full rewrite to TS) ## [2.0.7](https://github.com/apify/crawlee/compare/v2.0.6...v2.0.7) (2021-09-08)[​](#207-2021-09-08 "Direct link to 207-2021-09-08") * Fix casting of int/bool environment variables (e.g. `APIFY_LOCAL_STORAGE_ENABLE_WAL_MODE`), closes #956 * Fix incognito pages and user data dir (#1145) * Add `@ts-ignore` comments to imports of optional peer dependencies (#1152) * Use config instance in `sdk.openSessionPool()` (#1154) * Add a breaking callback to `infiniteScroll` (#1140) ## [2.0.6](https://github.com/apify/crawlee/compare/v2.0.5...v2.0.6) (2021-08-27)[​](#206-2021-08-27 "Direct link to 206-2021-08-27") * Fix deprecation messages logged from `ProxyConfiguration` and `CheerioCrawler`. * Update `got-scraping` to receive multiple improvements. ## [2.0.5](https://github.com/apify/crawlee/compare/v2.0.4...v2.0.5) (2021-08-24)[​](#205-2021-08-24 "Direct link to 205-2021-08-24") * Fix error handling in puppeteer crawler ## [2.0.4](https://github.com/apify/crawlee/compare/v2.0.3...v2.0.4) (2021-08-23)[​](#204-2021-08-23 "Direct link to 204-2021-08-23") * Use `sessionToken` with `got-scraping` ## [2.0.3](https://github.com/apify/crawlee/compare/v2.0.2...v2.0.3) (2021-08-20)[​](#203-2021-08-20 "Direct link to 203-2021-08-20") * **BREAKING IN EDGE CASES** \* We removed `forceUrlEncoding` in `requestAsBrowser` because we found out that recent versions of the underlying HTTP client `got` already encode URLs and `forceUrlEncoding` could lead to weird behavior. We think of this as fixing a bug, so we're not bumping the major version. * Limit `handleRequestTimeoutMillis` to max valid value to prevent Node.js fallback to `1`. * Use `got-scraping@^3.0.1` * Disable SSL validation on MITM proxie * Limit `handleRequestTimeoutMillis` to max valid value ## [2.0.2](https://github.com/apify/crawlee/compare/v2.0.1...v2.0.2) (2021-08-12)[​](#202-2021-08-12 "Direct link to 202-2021-08-12") * Fix serialization issues in `CheerioCrawler` caused by parser conflicts in recent versions of `cheerio`. ## [2.0.1](https://github.com/apify/crawlee/compare/v2.0.0...v2.0.1) (2021-08-06)[​](#201-2021-08-06 "Direct link to 201-2021-08-06") * Use `got-scraping` 2.0.1 until fully compatible. ## [2.0.0](https://github.com/apify/crawlee/compare/v1.3.4...v2.0.0) (2021-08-05)[​](#200-2021-08-05 "Direct link to 200-2021-08-05") * **BREAKING**: Require Node.js >=15.10.0 because HTTP2 support on lower Node.js versions is very buggy. * **BREAKING**: Bump `cheerio` to `1.0.0-rc.10` from `rc.3`. There were breaking changes in `cheerio` between the versions so this bump might be breaking for you as well. * Remove `LiveViewServer` which was deprecated before release of SDK v1. --- # AutoscaledPool Manages a pool of asynchronous resource-intensive tasks that are executed in parallel. The pool only starts new tasks if there is enough free CPU and memory available and the Javascript event loop is not blocked. The information about the CPU and memory usage is obtained by the [Snapshotter](https://crawlee.dev/js/api/core/class/Snapshotter.md) class, which makes regular snapshots of system resources that may be either local or from the Apify cloud infrastructure in case the process is running on the Apify platform. Meaningful data gathered from these snapshots is provided to `AutoscaledPool` by the [SystemStatus](https://crawlee.dev/js/api/core/class/SystemStatus.md) class. Before running the pool, you need to implement the following three functions: [AutoscaledPoolOptions.runTaskFunction](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md#runTaskFunction), [AutoscaledPoolOptions.isTaskReadyFunction](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md#isTaskReadyFunction) and [AutoscaledPoolOptions.isFinishedFunction](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md#isFinishedFunction). The auto-scaled pool is started by calling the [AutoscaledPool.run](https://crawlee.dev/js/api/core/class/AutoscaledPool.md#run) function. The pool periodically queries the [AutoscaledPoolOptions.isTaskReadyFunction](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md#isTaskReadyFunction) function for more tasks, managing optimal concurrency, until the function resolves to `false`. The pool then queries the [AutoscaledPoolOptions.isFinishedFunction](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md#isFinishedFunction). If it resolves to `true`, the run finishes after all running tasks complete. If it resolves to `false`, it assumes there will be more tasks available later and keeps periodically querying for tasks. If any of the tasks throws then the [AutoscaledPool.run](https://crawlee.dev/js/api/core/class/AutoscaledPool.md#run) function rejects the promise with an error. The pool evaluates whether it should start a new task every time one of the tasks finishes and also in the interval set by the `options.maybeRunIntervalSecs` parameter. **Example usage:** ``` const pool = new AutoscaledPool({ maxConcurrency: 50, runTaskFunction: async () => { // Run some resource-intensive asynchronous operation here. }, isTaskReadyFunction: async () => { // Tell the pool whether more tasks are ready to be processed. // Return true or false }, isFinishedFunction: async () => { // Tell the pool whether it should finish // or wait for more tasks to become available. // Return true or false } }); await pool.run(); ``` ## Index[**](#Index) ### Constructors * [**constructor](#constructor) ### Accessors * [**currentConcurrency](#currentConcurrency) * [**desiredConcurrency](#desiredConcurrency) * [**maxConcurrency](#maxConcurrency) * [**minConcurrency](#minConcurrency) ### Methods * [**abort](#abort) * [**notify](#notify) * [**pause](#pause) * [**resume](#resume) * [**run](#run) ## Constructors[**](#Constructors) ### [**](#constructor)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/autoscaled_pool.ts#L214)constructor * ****new AutoscaledPool**(options, config): [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) - #### Parameters * ##### options: [AutoscaledPoolOptions](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md) * ##### config: [Configuration](https://crawlee.dev/js/api/core/class/Configuration.md) = ... #### Returns [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) ## Accessors[**](#Accessors) ### [**](#currentConcurrency)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/autoscaled_pool.ts#L354)currentConcurrency * **get currentConcurrency(): number - Gets the number of parallel tasks currently running in the pool. *** #### Returns number ### [**](#desiredConcurrency)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/autoscaled_pool.ts#L338)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/autoscaled_pool.ts#L346)desiredConcurrency * **get desiredConcurrency(): number * **set desiredConcurrency(value): void - Gets the desired concurrency for the pool, which is an estimated number of parallel tasks that the system can currently support. *** #### Returns number - Sets the desired concurrency for the pool, i.e. the number of tasks that should be running in parallel if there's large enough supply of tasks. *** #### Parameters * ##### value: number #### Returns void ### [**](#maxConcurrency)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/autoscaled_pool.ts#L322)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/autoscaled_pool.ts#L329)maxConcurrency * **get maxConcurrency(): number * **set maxConcurrency(value): void - Gets the maximum number of tasks running in parallel. *** #### Returns number - Sets the maximum number of tasks running in parallel. *** #### Parameters * ##### value: number #### Returns void ### [**](#minConcurrency)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/autoscaled_pool.ts#L304)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/autoscaled_pool.ts#L314)minConcurrency * **get minConcurrency(): number * **set minConcurrency(value): void - Gets the minimum number of tasks running in parallel. *** #### Returns number - Sets the minimum number of tasks running in parallel. *WARNING:* If you set this value too high with respect to the available system memory and CPU, your code might run extremely slow or crash. If you're not sure, just keep the default value and the concurrency will scale up automatically. *** #### Parameters * ##### value: number #### Returns void ## Methods[**](#Methods) ### [**](#abort)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/autoscaled_pool.ts#L402)abort * ****abort**(): Promise\ - Aborts the run of the auto-scaled pool and destroys it. The promise returned from the [AutoscaledPool.run](https://crawlee.dev/js/api/core/class/AutoscaledPool.md#run) function will immediately resolve, no more new tasks will be spawned and all running tasks will be left in their current state. Due to the nature of the tasks, auto-scaled pool cannot reliably guarantee abortion of all the running tasks, therefore, no abortion is attempted and some of the tasks may finish, while others may not. Essentially, auto-scaled pool doesn't care about their state after the invocation of `.abort()`, but that does not mean that some parts of their asynchronous chains of commands will not execute. *** #### Returns Promise\ ### [**](#notify)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/autoscaled_pool.ts#L461)notify * ****notify**(): Promise\ - Explicitly check the queue for new tasks. The AutoscaledPool checks the queue for new tasks periodically, every `maybeRunIntervalSecs` seconds. If you want to trigger the processing immediately, use this method. *** #### Returns Promise\ ### [**](#pause)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/autoscaled_pool.ts#L421)pause * ****pause**(timeoutSecs): Promise\ - Prevents the auto-scaled pool from starting new tasks, but allows the running ones to finish (unlike abort, which terminates them). Used together with [AutoscaledPool.resume](https://crawlee.dev/js/api/core/class/AutoscaledPool.md#resume) The function's promise will resolve once all running tasks have completed and the pool is effectively idle. If the `timeoutSecs` argument is provided, the promise will reject with a timeout error after the `timeoutSecs` seconds. The promise returned from the [AutoscaledPool.run](https://crawlee.dev/js/api/core/class/AutoscaledPool.md#run) function will not resolve when `.pause()` is invoked (unlike abort, which resolves it). *** #### Parameters * ##### optionaltimeoutSecs: number #### Returns Promise\ ### [**](#resume)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/autoscaled_pool.ts#L453)resume * ****resume**(): void - Resumes the operation of the autoscaled-pool by allowing more tasks to be run. Used together with [AutoscaledPool.pause](https://crawlee.dev/js/api/core/class/AutoscaledPool.md#pause) Tasks will automatically start running again in `options.maybeRunIntervalSecs`. *** #### Returns void ### [**](#run)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/autoscaled_pool.ts#L362)run * ****run**(): Promise\ - Runs the auto-scaled pool. Returns a promise that gets resolved or rejected once all the tasks are finished or one of them fails. *** #### Returns Promise\ --- # Configuration `Configuration` is a value object holding Crawlee configuration. By default, there is a global singleton instance of this class available via `Configuration.getGlobalConfig()`. Places that depend on a configurable behaviour depend on this class, as they have the global instance as the default value. *Using global configuration:* ``` import { BasicCrawler, Configuration } from 'crawlee'; // Get the global configuration const config = Configuration.getGlobalConfig(); // Set the 'persistStateIntervalMillis' option // of global configuration to 10 seconds config.set('persistStateIntervalMillis', 10_000); // No need to pass the configuration to the crawler, // as it's using the global configuration by default const crawler = new BasicCrawler(); ``` *Using custom configuration:* ``` import { BasicCrawler, Configuration } from 'crawlee'; // Create a new configuration const config = new Configuration({ persistStateIntervalMillis: 30_000 }); // Pass the configuration to the crawler const crawler = new BasicCrawler({ ... }, config); ``` The configuration provided via environment variables always takes precedence. We can also define the `crawlee.json` file in the project root directory which will serve as a baseline, so the options provided in constructor will override those. In other words, the precedence is: ``` crawlee.json < constructor options < environment variables ``` ## Supported Configuration Options | Key | Environment Variable | Default Value | | :--------------------------- | :-------------------------------------- | :------------ | | `memoryMbytes` | `CRAWLEE_MEMORY_MBYTES` | - | | `logLevel` | `CRAWLEE_LOG_LEVEL` | - | | `headless` | `CRAWLEE_HEADLESS` | `true` | | `defaultDatasetId` | `CRAWLEE_DEFAULT_DATASET_ID` | `'default'` | | `defaultKeyValueStoreId` | `CRAWLEE_DEFAULT_KEY_VALUE_STORE_ID` | `'default'` | | `defaultRequestQueueId` | `CRAWLEE_DEFAULT_REQUEST_QUEUE_ID` | `'default'` | | `persistStateIntervalMillis` | `CRAWLEE_PERSIST_STATE_INTERVAL_MILLIS` | `60_000` | | `purgeOnStart` | `CRAWLEE_PURGE_ON_START` | `true` | | `persistStorage` | `CRAWLEE_PERSIST_STORAGE` | `true` | ## Advanced Configuration Options | Key | Environment Variable | Default Value | | :---------------------- | :-------------------------------- | :------------ | | `inputKey` | `CRAWLEE_INPUT_KEY` | `'INPUT'` | | `xvfb` | `CRAWLEE_XVFB` | - | | `chromeExecutablePath` | `CRAWLEE_CHROME_EXECUTABLE_PATH` | - | | `defaultBrowserPath` | `CRAWLEE_DEFAULT_BROWSER_PATH` | - | | `disableBrowserSandbox` | `CRAWLEE_DISABLE_BROWSER_SANDBOX` | - | | `availableMemoryRatio` | `CRAWLEE_AVAILABLE_MEMORY_RATIO` | `0.25` | | `systemInfoV2` | `CRAWLEE_SYSTEM_INFO_V2` | false | | \`containerized | \`CRAWLEE\_CONTAINERIZED | - | ## Index[**](#Index) ### Constructors * [**constructor](#constructor) ### Properties * [**storageManagers](#storageManagers) ### Methods * [**get](#get) * [**getEventManager](#getEventManager) * [**set](#set) * [**useEventManager](#useEventManager) * [**useStorageClient](#useStorageClient) * [**getEventManager](#getEventManager) * [**getGlobalConfig](#getGlobalConfig) * [**getStorageClient](#getStorageClient) * [**resetGlobalState](#resetGlobalState) * [**set](#set) * [**useStorageClient](#useStorageClient) ## Constructors[**](#Constructors) ### [**](#constructor)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/configuration.ts#L318)constructor * ****new Configuration**(options): [Configuration](https://crawlee.dev/js/api/core/class/Configuration.md) - Creates new `Configuration` instance with provided options. Env vars will have precedence over those. *** #### Parameters * ##### options: [ConfigurationOptions](https://crawlee.dev/js/api/core/interface/ConfigurationOptions.md) = {} #### Returns [Configuration](https://crawlee.dev/js/api/core/class/Configuration.md) ## Properties[**](#Properties) ### [**](#storageManagers)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/configuration.ts#L313)publicreadonlystorageManagers **storageManagers: Map\> = ... ## Methods[**](#Methods) ### [**](#get)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/configuration.ts#L340)get * ****get**\(key, defaultValue): U - Returns configured value. First checks the environment variables, then provided configuration, fallbacks to the `defaultValue` argument if provided, otherwise uses the default value as described in the above section. *** #### Parameters * ##### key: T * ##### optionaldefaultValue: U #### Returns U ### [**](#getEventManager)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/configuration.ts#L421)getEventManager * ****getEventManager**(): [EventManager](https://crawlee.dev/js/api/core/class/EventManager.md) - #### Returns [EventManager](https://crawlee.dev/js/api/core/class/EventManager.md) ### [**](#set)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/configuration.ts#L391)set * ****set**(key, value): void - Sets value for given option. Only affects this `Configuration` instance, the value will not be propagated down to the env var. To reset a value, we can omit the `value` argument or pass `undefined` there. *** #### Parameters * ##### key: keyof [ConfigurationOptions](https://crawlee.dev/js/api/core/interface/ConfigurationOptions.md) * ##### optionalvalue: any #### Returns void ### [**](#useEventManager)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/configuration.ts#L465)useEventManager * ****useEventManager**(events): void - #### Parameters * ##### events: [EventManager](https://crawlee.dev/js/api/core/class/EventManager.md) #### Returns void ### [**](#useStorageClient)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/configuration.ts#L457)useStorageClient * ****useStorageClient**(client): void - #### Parameters * ##### client: [StorageClient](https://crawlee.dev/js/api/core/interface/StorageClient.md) #### Returns void ### [**](#getEventManager)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/configuration.ts#L491)staticgetEventManager * ****getEventManager**(): [EventManager](https://crawlee.dev/js/api/core/class/EventManager.md) - Gets default [EventManager](https://crawlee.dev/js/api/core/class/EventManager.md) instance. *** #### Returns [EventManager](https://crawlee.dev/js/api/core/class/EventManager.md) ### [**](#getGlobalConfig)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/configuration.ts#L472)staticgetGlobalConfig * ****getGlobalConfig**(): [Configuration](https://crawlee.dev/js/api/core/class/Configuration.md) - Returns the global configuration instance. It will respect the environment variables. *** #### Returns [Configuration](https://crawlee.dev/js/api/core/class/Configuration.md) ### [**](#getStorageClient)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/configuration.ts#L484)staticgetStorageClient * ****getStorageClient**(): [StorageClient](https://crawlee.dev/js/api/core/interface/StorageClient.md) - Gets default [StorageClient](https://crawlee.dev/js/api/core/interface/StorageClient.md) instance. *** #### Returns [StorageClient](https://crawlee.dev/js/api/core/interface/StorageClient.md) ### [**](#resetGlobalState)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/configuration.ts#L499)staticresetGlobalState * ****resetGlobalState**(): void - Resets global configuration instance. The default instance holds configuration based on env vars, if we want to change them, we need to first reset the global state. Used mainly for testing purposes. *** #### Returns void ### [**](#set)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/configuration.ts#L399)staticset * ****set**(key, value): void - Sets value for given option. Only affects the global `Configuration` instance, the value will not be propagated down to the env var. To reset a value, we can omit the `value` argument or pass `undefined` there. *** #### Parameters * ##### key: keyof [ConfigurationOptions](https://crawlee.dev/js/api/core/interface/ConfigurationOptions.md) * ##### optionalvalue: any #### Returns void ### [**](#useStorageClient)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/configuration.ts#L461)staticuseStorageClient * ****useStorageClient**(client): void - #### Parameters * ##### client: [StorageClient](https://crawlee.dev/js/api/core/interface/StorageClient.md) #### Returns void --- # CriticalError Errors of `CriticalError` type will shut down the whole crawler. Error handlers catching CriticalError should avoid logging it, as it will be logged by Node.js itself at the end ### Hierarchy * [NonRetryableError](https://crawlee.dev/js/api/core/class/NonRetryableError.md) * *CriticalError* * [BrowserLaunchError](https://crawlee.dev/js/api/browser-pool/class/BrowserLaunchError.md) ## Index[**](#Index) ### Constructors * [**constructor](#constructor) ### Properties * [**cause](#cause) * [**message](#message) * [**name](#name) * [**stack](#stack) * [**stackTraceLimit](#stackTraceLimit) ### Methods * [**captureStackTrace](#captureStackTrace) * [**isError](#isError) * [**prepareStackTrace](#prepareStackTrace) ## Constructors[**](#Constructors) ### [**](#constructor)[**](https://undefined/apify/crawlee/blob/master/website/node_modules/typescript/src/lib.es5.d.ts#L1082)externalconstructor * ****new CriticalError**(message): [CriticalError](https://crawlee.dev/js/api/core/class/CriticalError.md) * ****new CriticalError**(message, options): [CriticalError](https://crawlee.dev/js/api/core/class/CriticalError.md) - Inherited from NonRetryableError.constructor #### Parameters * ##### externaloptionalmessage: string #### Returns [CriticalError](https://crawlee.dev/js/api/core/class/CriticalError.md) ## Properties[**](#Properties) ### [**](#cause)[**](https://undefined/apify/crawlee/blob/master/website/node_modules/typescript/src/lib.es2022.error.d.ts#L26)externaloptionalinheritedcause **cause? : unknown Inherited from NonRetryableError.cause ### [**](#message)[**](https://undefined/apify/crawlee/blob/master/website/node_modules/typescript/src/lib.es5.d.ts#L1077)externalinheritedmessage **message: string Inherited from NonRetryableError.message ### [**](#name)[**](https://undefined/apify/crawlee/blob/master/website/node_modules/typescript/src/lib.es5.d.ts#L1076)externalinheritedname **name: string Inherited from NonRetryableError.name ### [**](#stack)[**](https://undefined/apify/crawlee/blob/master/website/node_modules/typescript/src/lib.es5.d.ts#L1078)externaloptionalinheritedstack **stack? : string Inherited from NonRetryableError.stack ### [**](#stackTraceLimit)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/globals.d.ts#L68)staticexternalinheritedstackTraceLimit **stackTraceLimit: number Inherited from NonRetryableError.stackTraceLimit The `Error.stackTraceLimit` property specifies the number of stack frames collected by a stack trace (whether generated by `new Error().stack` or `Error.captureStackTrace(obj)`). The default value is `10` but may be set to any valid JavaScript number. Changes will affect any stack trace captured *after* the value has been changed. If set to a non-number value, or set to a negative number, stack traces will not capture any frames. ## Methods[**](#Methods) ### [**](#captureStackTrace)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/globals.d.ts#L52)staticexternalinheritedcaptureStackTrace * ****captureStackTrace**(targetObject, constructorOpt): void - Inherited from NonRetryableError.captureStackTrace Creates a `.stack` property on `targetObject`, which when accessed returns a string representing the location in the code at which `Error.captureStackTrace()` was called. ``` const myObject = {}; Error.captureStackTrace(myObject); myObject.stack; // Similar to `new Error().stack` ``` The first line of the trace will be prefixed with `${myObject.name}: ${myObject.message}`. The optional `constructorOpt` argument accepts a function. If given, all frames above `constructorOpt`, including `constructorOpt`, will be omitted from the generated stack trace. The `constructorOpt` argument is useful for hiding implementation details of error generation from the user. For instance: ``` function a() { b(); } function b() { c(); } function c() { // Create an error without stack trace to avoid calculating the stack trace twice. const { stackTraceLimit } = Error; Error.stackTraceLimit = 0; const error = new Error(); Error.stackTraceLimit = stackTraceLimit; // Capture the stack trace above function b Error.captureStackTrace(error, b); // Neither function c, nor b is included in the stack trace throw error; } a(); ``` *** #### Parameters * ##### externaltargetObject: object * ##### externaloptionalconstructorOpt: Function #### Returns void ### [**](#isError)[**](https://undefined/apify/crawlee/blob/master/website/node_modules/typescript/src/lib.esnext.error.d.ts#L23)staticexternalinheritedisError * ****isError**(error): error is Error - Inherited from NonRetryableError.isError Indicates whether the argument provided is a built-in Error instance or not. *** #### Parameters * ##### externalerror: unknown #### Returns error is Error ### [**](#prepareStackTrace)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/globals.d.ts#L56)staticexternalinheritedprepareStackTrace * ****prepareStackTrace**(err, stackTraces): any - Inherited from NonRetryableError.prepareStackTrace * **@see** *** #### Parameters * ##### externalerr: Error * ##### externalstackTraces: CallSite\[] #### Returns any --- # Dataset \ The `Dataset` class represents a store for structured data where each object stored has the same attributes, such as online store products or real estate offers. You can imagine it as a table, where each object is a row and its attributes are columns. Dataset is an append-only storage - you can only add new records to it but you cannot modify or remove existing records. Typically it is used to store crawling results. Do not instantiate this class directly, use the [Dataset.open](https://crawlee.dev/js/api/core/class/Dataset.md#open) function instead. `Dataset` stores its data either on local disk or in the Apify cloud, depending on whether the `APIFY_LOCAL_STORAGE_DIR` or `APIFY_TOKEN` environment variables are set. If the `APIFY_LOCAL_STORAGE_DIR` environment variable is set, the data is stored in the local directory in the following files: ``` {APIFY_LOCAL_STORAGE_DIR}/datasets/{DATASET_ID}/{INDEX}.json ``` Note that `{DATASET_ID}` is the name or ID of the dataset. The default dataset has ID: `default`, unless you override it by setting the `APIFY_DEFAULT_DATASET_ID` environment variable. Each dataset item is stored as a separate JSON file, where `{INDEX}` is a zero-based index of the item in the dataset. If the `APIFY_TOKEN` environment variable is set but `APIFY_LOCAL_STORAGE_DIR` not, the data is stored in the [Apify Dataset](https://docs.apify.com/storage/dataset) cloud storage. Note that you can force usage of the cloud storage also by passing the `forceCloud` option to [Dataset.open](https://crawlee.dev/js/api/core/class/Dataset.md#open) function, even if the `APIFY_LOCAL_STORAGE_DIR` variable is set. **Example usage:** ``` // Write a single row to the default dataset await Dataset.pushData({ col1: 123, col2: 'val2' }); // Open a named dataset const dataset = await Dataset.open('some-name'); // Write a single row await dataset.pushData({ foo: 'bar' }); // Write multiple rows await dataset.pushData([ { foo: 'bar2', col2: 'val2' }, { col3: 123 }, ]); // Export the entirety of the dataset to one file in the key-value store await dataset.exportToCSV('MY-DATA'); ``` ## Index[**](#Index) ### Properties * [**client](#client) * [**config](#config) * [**id](#id) * [**log](#log) * [**name](#name) * [**storageObject](#storageObject) ### Methods * [**drop](#drop) * [**export](#export) * [**exportTo](#exportTo) * [**exportToCSV](#exportToCSV) * [**exportToJSON](#exportToJSON) * [**forEach](#forEach) * [**getData](#getData) * [**getInfo](#getInfo) * [**map](#map) * [**pushData](#pushData) * [**reduce](#reduce) * [**exportToCSV](#exportToCSV) * [**exportToJSON](#exportToJSON) * [**getData](#getData) * [**open](#open) ## Properties[**](#Properties) ### [**](#client)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L235)client **client: [DatasetClient](https://crawlee.dev/js/api/types/interface/DatasetClient.md)\ ### [**](#config)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L244)readonlyconfig **config: [Configuration](https://crawlee.dev/js/api/core/class/Configuration.md) = ... ### [**](#id)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L233)id **id: string ### [**](#log)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L237)log **log: [Log](https://crawlee.dev/js/api/core/class/Log.md) = ... ### [**](#name)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L234)optionalname **name? : string ### [**](#storageObject)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L236)optionalreadonlystorageObject **storageObject? : Record\ ## Methods[**](#Methods) ### [**](#drop)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L613)drop * ****drop**(): Promise\ - Removes the dataset either from the Apify cloud storage or from the local directory, depending on the mode of operation. *** #### Returns Promise\ ### [**](#export)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L323)export * ****export**(options): Promise\ - Returns all the data from the dataset. This will iterate through the whole dataset via the `listItems()` client method, which gives you only paginated results. *** #### Parameters * ##### options: [DatasetExportOptions](https://crawlee.dev/js/api/core/interface/DatasetExportOptions.md) = {} #### Returns Promise\ ### [**](#exportTo)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L355)exportTo * ****exportTo**(key, options, contentType): Promise\ - Save the entirety of the dataset's contents into one file within a key-value store. *** #### Parameters * ##### key: string The name of the value to save the data in. * ##### optionaloptions: [DatasetExportToOptions](https://crawlee.dev/js/api/core/interface/DatasetExportToOptions.md) An optional options object where you can provide the dataset and target KVS name. * ##### optionalcontentType: string Only JSON and CSV are supported currently, defaults to JSON. #### Returns Promise\ ### [**](#exportToCSV)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L406)exportToCSV * ****exportToCSV**(key, options): Promise\ - Save entire default dataset's contents into one CSV file within a key-value store. *** #### Parameters * ##### key: string The name of the value to save the data in. * ##### optionaloptions: Omit<[DatasetExportToOptions](https://crawlee.dev/js/api/core/interface/DatasetExportToOptions.md), fromDataset> An optional options object where you can provide the target KVS name. #### Returns Promise\ ### [**](#exportToJSON)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L396)exportToJSON * ****exportToJSON**(key, options): Promise\ - Save entire default dataset's contents into one JSON file within a key-value store. *** #### Parameters * ##### key: string The name of the value to save the data in. * ##### optionaloptions: Omit<[DatasetExportToOptions](https://crawlee.dev/js/api/core/interface/DatasetExportToOptions.md), fromDataset> An optional options object where you can provide the target KVS name. #### Returns Promise\ ### [**](#forEach)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L484)forEach * ****forEach**(iteratee, options, index): Promise\ - Iterates over dataset items, yielding each in turn to an `iteratee` function. Each invocation of `iteratee` is called with two arguments: `(item, index)`. If the `iteratee` function returns a Promise then it is awaited before the next call. If it throws an error, the iteration is aborted and the `forEach` function throws the error. **Example usage** ``` const dataset = await Dataset.open('my-results'); await dataset.forEach(async (item, index) => { console.log(`Item at ${index}: ${JSON.stringify(item)}`); }); ``` *** #### Parameters * ##### iteratee: [DatasetConsumer](https://crawlee.dev/js/api/core/interface/DatasetConsumer.md)\ A function that is called for every item in the dataset. * ##### optionaloptions: [DatasetIteratorOptions](https://crawlee.dev/js/api/core/interface/DatasetIteratorOptions.md) = {} All `forEach()` parameters. * ##### optionalindex: number = 0 Specifies the initial index number passed to the `iteratee` function. #### Returns Promise\ ### [**](#getData)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L303)getData * ****getData**(options): Promise<[DatasetContent](https://crawlee.dev/js/api/core/interface/DatasetContent.md)\> - Returns [DatasetContent](https://crawlee.dev/js/api/core/interface/DatasetContent.md) object holding the items in the dataset based on the provided parameters. *** #### Parameters * ##### options: [DatasetDataOptions](https://crawlee.dev/js/api/core/interface/DatasetDataOptions.md) = {} #### Returns Promise<[DatasetContent](https://crawlee.dev/js/api/core/interface/DatasetContent.md)\> ### [**](#getInfo)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L458)getInfo * ****getInfo**(): Promise\ - Returns an object containing general information about the dataset. The function returns the same object as the Apify API Client's [getDataset](https://docs.apify.com/api/apify-client-js/latest#ApifyClient-datasets-getDataset) function, which in turn calls the [Get dataset](https://apify.com/docs/api/v2#/reference/datasets/dataset/get-dataset) API endpoint. **Example:** ``` { id: "WkzbQMuFYuamGv3YF", name: "my-dataset", userId: "wRsJZtadYvn4mBZmm", createdAt: new Date("2015-12-12T07:34:14.202Z"), modifiedAt: new Date("2015-12-13T08:36:13.202Z"), accessedAt: new Date("2015-12-14T08:36:13.202Z"), itemCount: 14, } ``` *** #### Returns Promise\ ### [**](#map)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L514)map * ****map**\(iteratee, options): Promise\ - Produces a new array of values by mapping each value in list through a transformation function `iteratee()`. Each invocation of `iteratee()` is called with two arguments: `(element, index)`. If `iteratee` returns a `Promise` then it's awaited before a next call. *** #### Parameters * ##### iteratee: [DatasetMapper](https://crawlee.dev/js/api/core/interface/DatasetMapper.md)\ * ##### optionaloptions: [DatasetIteratorOptions](https://crawlee.dev/js/api/core/interface/DatasetIteratorOptions.md) = {} All `map()` parameters. #### Returns Promise\ ### [**](#pushData)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L276)pushData * ****pushData**(data): Promise\ - Stores an object or an array of objects to the dataset. The function returns a promise that resolves when the operation finishes. It has no result, but throws on invalid args or other errors. **IMPORTANT**: Make sure to use the `await` keyword when calling `pushData()`, otherwise the crawler process might finish before the data is stored! The size of the data is limited by the receiving API and therefore `pushData()` will only allow objects whose JSON representation is smaller than 9MB. When an array is passed, none of the included objects may be larger than 9MB, but the array itself may be of any size. The function internally chunks the array into separate items and pushes them sequentially. The chunking process is stable (keeps order of data), but it does not provide a transaction safety mechanism. Therefore, in the event of an uploading error (after several automatic retries), the function's Promise will reject and the dataset will be left in a state where some of the items have already been saved to the dataset while other items from the source array were not. To overcome this limitation, the developer may, for example, read the last item saved in the dataset and re-attempt the save of the data from this item onwards to prevent duplicates. *** #### Parameters * ##### data: Data | Data\[] Object or array of objects containing data to be stored in the default dataset. The objects must be serializable to JSON and the JSON representation of each object must be smaller than 9MB. #### Returns Promise\ ### [**](#reduce)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L544)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L565)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L584)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L586)reduce * ****reduce**(iteratee): Promise\ * ****reduce**(iteratee, memo, options): Promise\ * ****reduce**\(iteratee, memo, options): Promise\ - Reduces a list of values down to a single value. The first element of the dataset is the initial value, with each successive reductions should be returned by `iteratee()`. The `iteratee()` is passed three arguments: the `memo`, `value` and `index` of the current element being folded into the reduction. The `iteratee` is first invoked on the second element of the list (`index = 1`), with the first element given as the memo parameter. After that, the rest of the elements in the dataset is passed to `iteratee`, with the result of the previous invocation as the memo. If `iteratee()` returns a `Promise` it's awaited before a next call. If the dataset is empty, reduce will return undefined. *** #### Parameters * ##### iteratee: [DatasetReducer](https://crawlee.dev/js/api/core/interface/DatasetReducer.md)\ #### Returns Promise\ ### [**](#exportToCSV)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L429)staticexportToCSV * ****exportToCSV**(key, options): Promise\ - Save entire default dataset's contents into one CSV file within a key-value store. *** #### Parameters * ##### key: string The name of the value to save the data in. * ##### optionaloptions: [DatasetExportToOptions](https://crawlee.dev/js/api/core/interface/DatasetExportToOptions.md) An optional options object where you can provide the dataset and target KVS name. #### Returns Promise\ ### [**](#exportToJSON)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L416)staticexportToJSON * ****exportToJSON**(key, options): Promise\ - Save entire default dataset's contents into one JSON file within a key-value store. *** #### Parameters * ##### key: string The name of the value to save the data in. * ##### optionaloptions: [DatasetExportToOptions](https://crawlee.dev/js/api/core/interface/DatasetExportToOptions.md) An optional options object where you can provide the dataset and target KVS name. #### Returns Promise\ ### [**](#getData)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L692)staticgetData * ****getData**\(options): Promise<[DatasetContent](https://crawlee.dev/js/api/core/interface/DatasetContent.md)\> - Returns [DatasetContent](https://crawlee.dev/js/api/core/interface/DatasetContent.md) object holding the items in the dataset based on the provided parameters. *** #### Parameters * ##### options: [DatasetDataOptions](https://crawlee.dev/js/api/core/interface/DatasetDataOptions.md) = {} #### Returns Promise<[DatasetContent](https://crawlee.dev/js/api/core/interface/DatasetContent.md)\> ### [**](#open)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L635)staticopen * ****open**\(datasetIdOrName, options): Promise<[Dataset](https://crawlee.dev/js/api/core/class/Dataset.md)\> - Opens a dataset and returns a promise resolving to an instance of the [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md) class. Datasets are used to store structured data where each object stored has the same attributes, such as online store products or real estate offers. The actual data is stored either on the local filesystem or in the cloud. For more details and code examples, see the [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md) class. *** #### Parameters * ##### optionaldatasetIdOrName: null | string ID or name of the dataset to be opened. If `null` or `undefined`, the function returns the default dataset associated with the crawler run. * ##### optionaloptions: [StorageManagerOptions](https://crawlee.dev/js/api/core/interface/StorageManagerOptions.md) = {} Storage manager options. #### Returns Promise<[Dataset](https://crawlee.dev/js/api/core/class/Dataset.md)\> --- # ErrorSnapshotter ErrorSnapshotter class is used to capture a screenshot of the page and a snapshot of the HTML when an error occurs during web crawling. This functionality is opt-in, and can be enabled via the crawler options: ``` const crawler = new BasicCrawler({ // ... statisticsOptions: { saveErrorSnapshots: true, }, }); ``` ## Index[**](#Index) ### Constructors * [**constructor](#constructor) ### Properties * [**BASE\_MESSAGE](#BASE_MESSAGE) * [**MAX\_ERROR\_CHARACTERS](#MAX_ERROR_CHARACTERS) * [**MAX\_FILENAME\_LENGTH](#MAX_FILENAME_LENGTH) * [**MAX\_HASH\_LENGTH](#MAX_HASH_LENGTH) * [**SNAPSHOT\_PREFIX](#SNAPSHOT_PREFIX) ### Methods * [**captureSnapshot](#captureSnapshot) * [**contextCaptureSnapshot](#contextCaptureSnapshot) * [**generateFilename](#generateFilename) * [**saveHTMLSnapshot](#saveHTMLSnapshot) ## Constructors[**](#Constructors) ### [**](#constructor)constructor * ****new ErrorSnapshotter**(): [ErrorSnapshotter](https://crawlee.dev/js/api/core/class/ErrorSnapshotter.md) - #### Returns [ErrorSnapshotter](https://crawlee.dev/js/api/core/class/ErrorSnapshotter.md) ## Properties[**](#Properties) ### [**](#BASE_MESSAGE)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/error_snapshotter.ts#L46)staticreadonlyBASE\_MESSAGE **BASE\_MESSAGE: An error occurred = 'An error occurred' ### [**](#MAX_ERROR_CHARACTERS)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/error_snapshotter.ts#L43)staticreadonlyMAX\_ERROR\_CHARACTERS **MAX\_ERROR\_CHARACTERS: 30 = 30 ### [**](#MAX_FILENAME_LENGTH)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/error_snapshotter.ts#L45)staticreadonlyMAX\_FILENAME\_LENGTH **MAX\_FILENAME\_LENGTH: 250 = 250 ### [**](#MAX_HASH_LENGTH)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/error_snapshotter.ts#L44)staticreadonlyMAX\_HASH\_LENGTH **MAX\_HASH\_LENGTH: 30 = 30 ### [**](#SNAPSHOT_PREFIX)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/error_snapshotter.ts#L47)staticreadonlySNAPSHOT\_PREFIX **SNAPSHOT\_PREFIX: ERROR\_SNAPSHOT = 'ERROR\_SNAPSHOT' ## Methods[**](#Methods) ### [**](#captureSnapshot)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/error_snapshotter.ts#L52)captureSnapshot * ****captureSnapshot**(error, context): Promise\ - Capture a snapshot of the error context. *** #### Parameters * ##### error: [ErrnoException](https://crawlee.dev/js/api/core/interface/ErrnoException.md) * ##### context: [CrawlingContext](https://crawlee.dev/js/api/core/interface/CrawlingContext.md)\ #### Returns Promise\ ### [**](#contextCaptureSnapshot)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/error_snapshotter.ts#L105)contextCaptureSnapshot * ****contextCaptureSnapshot**(context, fileName): Promise\ - Captures a snapshot of the current page using the context.saveSnapshot function. This function is applicable for browser contexts only. Returns an object containing the filenames of the screenshot and HTML file. *** #### Parameters * ##### context: BrowserCrawlingContext * ##### fileName: string #### Returns Promise\ ### [**](#generateFilename)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/error_snapshotter.ts#L135)generateFilename * ****generateFilename**(error): string - Generate a unique fileName for each error snapshot. *** #### Parameters * ##### error: [ErrnoException](https://crawlee.dev/js/api/core/interface/ErrnoException.md) #### Returns string ### [**](#saveHTMLSnapshot)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/error_snapshotter.ts#L123)saveHTMLSnapshot * ****saveHTMLSnapshot**(html, keyValueStore, fileName): Promise\ - Save the HTML snapshot of the page, and return the fileName with the extension. *** #### Parameters * ##### html: string * ##### keyValueStore: [KeyValueStore](https://crawlee.dev/js/api/core/class/KeyValueStore.md) * ##### fileName: string #### Returns Promise\ --- # ErrorTracker This class tracks errors and computes a summary of information like: * where the errors happened * what the error names are * what the error codes are * what is the general error message This is extremely useful when there are dynamic error messages, such as argument validation. Since the structure of the `tracker.result` object differs when using different options, it's typed as `Record`. The most deep object has a `count` property, which is a number. It's possible to get the total amount of errors via the `tracker.total` property. ## Index[**](#Index) ### Constructors * [**constructor](#constructor) ### Properties * [**errorSnapshotter](#errorSnapshotter) * [**result](#result) * [**total](#total) ### Methods * [**add](#add) * [**addAsync](#addAsync) * [**captureSnapshot](#captureSnapshot) * [**getMostPopularErrors](#getMostPopularErrors) * [**getUniqueErrorCount](#getUniqueErrorCount) * [**reset](#reset) ## Constructors[**](#Constructors) ### [**](#constructor)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/error_tracker.ts#L295)constructor * ****new ErrorTracker**(options): [ErrorTracker](https://crawlee.dev/js/api/core/class/ErrorTracker.md) - #### Parameters * ##### options: Partial<[ErrorTrackerOptions](https://crawlee.dev/js/api/core/interface/ErrorTrackerOptions.md)> = {} #### Returns [ErrorTracker](https://crawlee.dev/js/api/core/class/ErrorTracker.md) ## Properties[**](#Properties) ### [**](#errorSnapshotter)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/error_tracker.ts#L293)optionalerrorSnapshotter **errorSnapshotter? : [ErrorSnapshotter](https://crawlee.dev/js/api/core/class/ErrorSnapshotter.md) ### [**](#result)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/error_tracker.ts#L289)result **result: Record\ ### [**](#total)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/error_tracker.ts#L291)total **total: number ## Methods[**](#Methods) ### [**](#add)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/error_tracker.ts#L339)add * ****add**(error): void - #### Parameters * ##### error: [ErrnoException](https://crawlee.dev/js/api/core/interface/ErrnoException.md) #### Returns void ### [**](#addAsync)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/error_tracker.ts#L353)addAsync * ****addAsync**(error, context): Promise\ - This method is async, because it captures a snapshot of the error context. We added this new method to avoid breaking changes. *** #### Parameters * ##### error: [ErrnoException](https://crawlee.dev/js/api/core/interface/ErrnoException.md) * ##### optionalcontext: [CrawlingContext](https://crawlee.dev/js/api/core/interface/CrawlingContext.md)\ #### Returns Promise\ ### [**](#captureSnapshot)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/error_tracker.ts#L408)captureSnapshot * ****captureSnapshot**(storage, error, context): Promise\ - #### Parameters * ##### storage: Record\ * ##### error: [ErrnoException](https://crawlee.dev/js/api/core/interface/ErrnoException.md) * ##### context: [CrawlingContext](https://crawlee.dev/js/api/core/interface/CrawlingContext.md)\ #### Returns Promise\ ### [**](#getMostPopularErrors)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/error_tracker.ts#L388)getMostPopularErrors * ****getMostPopularErrors**(count): \[number, string\[]]\[] - #### Parameters * ##### count: number #### Returns \[number, string\[]]\[] ### [**](#getUniqueErrorCount)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/error_tracker.ts#L368)getUniqueErrorCount * ****getUniqueErrorCount**(): number - #### Returns number ### [**](#reset)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/error_tracker.ts#L419)reset * ****reset**(): void - #### Returns void --- # abstractEventManager ### Hierarchy * *EventManager* * [LocalEventManager](https://crawlee.dev/js/api/core/class/LocalEventManager.md) ## Index[**](#Index) ### Constructors * [**constructor](#constructor) ### Properties * [**config](#config) ### Methods * [**close](#close) * [**emit](#emit) * [**init](#init) * [**isInitialized](#isInitialized) * [**off](#off) * [**on](#on) ## Constructors[**](#Constructors) ### [**](#constructor)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/events/event_manager.ts#L30)constructor * ****new EventManager**(config): [EventManager](https://crawlee.dev/js/api/core/class/EventManager.md) - #### Parameters * ##### config: [Configuration](https://crawlee.dev/js/api/core/class/Configuration.md) = ... #### Returns [EventManager](https://crawlee.dev/js/api/core/class/EventManager.md) ## Properties[**](#Properties) ### [**](#config)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/events/event_manager.ts#L30)readonlyconfig **config: [Configuration](https://crawlee.dev/js/api/core/class/Configuration.md) = ... ## Methods[**](#Methods) ### [**](#close)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/events/event_manager.ts#L55)close * ****close**(): Promise\ - Clears the internal `persistState` event interval. This is automatically called at the end of `crawler.run()`. *** #### Returns Promise\ ### [**](#emit)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/events/event_manager.ts#L82)emit * ****emit**(event, ...args): void - #### Parameters * ##### event: [EventTypeName](https://crawlee.dev/js/api/core.md#EventTypeName) * ##### rest...args: unknown\[] #### Returns void ### [**](#init)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/events/event_manager.ts#L38)init * ****init**(): Promise\ - Initializes the event manager by creating the `persistState` event interval. This is automatically called at the beginning of `crawler.run()`. *** #### Returns Promise\ ### [**](#isInitialized)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/events/event_manager.ts#L86)isInitialized * ****isInitialized**(): boolean - #### Returns boolean ### [**](#off)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/events/event_manager.ts#L74)off * ****off**(event, listener): void - #### Parameters * ##### event: [EventTypeName](https://crawlee.dev/js/api/core.md#EventTypeName) * ##### optionallistener: (...args) => any #### Returns void ### [**](#on)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/events/event_manager.ts#L70)on * ****on**(event, listener): void - #### Parameters * ##### event: [EventTypeName](https://crawlee.dev/js/api/core.md#EventTypeName) * ##### listener: (...args) => any #### Returns void --- # GotScrapingHttpClient A HTTP client implementation based on the `got-scraping` library. ### Implements * [BaseHttpClient](https://crawlee.dev/js/api/core/interface/BaseHttpClient.md) ## Index[**](#Index) ### Constructors * [**constructor](#constructor) ### Methods * [**sendRequest](#sendRequest) * [**stream](#stream) ## Constructors[**](#Constructors) ### [**](#constructor)constructor * ****new GotScrapingHttpClient**(): [GotScrapingHttpClient](https://crawlee.dev/js/api/core/class/GotScrapingHttpClient.md) - #### Returns [GotScrapingHttpClient](https://crawlee.dev/js/api/core/class/GotScrapingHttpClient.md) ## Methods[**](#Methods) ### [**](#sendRequest)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/got-scraping-http-client.ts#L21)sendRequest * ****sendRequest**\(request): Promise<[HttpResponse](https://crawlee.dev/js/api/core/interface/HttpResponse.md)\> - Implementation of BaseHttpClient.sendRequest Perform an HTTP Request and return the complete response. *** #### Parameters * ##### request: [HttpRequest](https://crawlee.dev/js/api/core/interface/HttpRequest.md)\ #### Returns Promise<[HttpResponse](https://crawlee.dev/js/api/core/interface/HttpResponse.md)\> ### [**](#stream)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/got-scraping-http-client.ts#L45)stream * ****stream**(request, handleRedirect): Promise<[StreamingHttpResponse](https://crawlee.dev/js/api/core/interface/StreamingHttpResponse.md)> - Implementation of BaseHttpClient.stream Perform an HTTP Request and return after the response headers are received. The body may be read from a stream contained in the response. *** #### Parameters * ##### request: [HttpRequest](https://crawlee.dev/js/api/core/interface/HttpRequest.md)\ * ##### optionalhandleRedirect: [RedirectHandler](https://crawlee.dev/js/api/core.md#RedirectHandler) #### Returns Promise<[StreamingHttpResponse](https://crawlee.dev/js/api/core/interface/StreamingHttpResponse.md)> --- # KeyValueStore The `KeyValueStore` class represents a key-value store, a simple data storage that is used for saving and reading data records or files. Each data record is represented by a unique key and associated with a MIME content type. Key-value stores are ideal for saving screenshots, crawler inputs and outputs, web pages, PDFs or to persist the state of crawlers. Do not instantiate this class directly, use the [KeyValueStore.open](https://crawlee.dev/js/api/core/class/KeyValueStore.md#open) function instead. Each crawler run is associated with a default key-value store, which is created exclusively for the run. By convention, the crawler input and output are stored into the default key-value store under the `INPUT` and `OUTPUT` key, respectively. Typically, input and output are JSON files, although it can be any other format. To access the default key-value store directly, you can use the [KeyValueStore.getValue](https://crawlee.dev/js/api/core/class/KeyValueStore.md#getValue) and [KeyValueStore.setValue](https://crawlee.dev/js/api/core/class/KeyValueStore.md#setValue) convenience functions. To access the input, you can also use the KeyValueStore.getInput convenience function. `KeyValueStore` stores its data on a local disk. If the `CRAWLEE_STORAGE_DIR` environment variable is set, the data is stored in the local directory in the following files: ``` {CRAWLEE_STORAGE_DIR}/key_value_stores/{STORE_ID}/{INDEX}.{EXT} ``` Note that `{STORE_ID}` is the name or ID of the key-value store. The default key-value store has ID: `default`, unless you override it by setting the `CRAWLEE_DEFAULT_KEY_VALUE_STORE_ID` environment variable. The `{KEY}` is the key of the record and `{EXT}` corresponds to the MIME content type of the data value. **Example usage:** ``` // Get crawler input from the default key-value store. const input = await KeyValueStore.getInput(); // Get some value from the default key-value store. const otherValue = await KeyValueStore.getValue('my-key'); // Write crawler output to the default key-value store. await KeyValueStore.setValue('OUTPUT', { myResult: 123 }); // Open a named key-value store const store = await KeyValueStore.open('some-name'); // Write a record. JavaScript object is automatically converted to JSON, // strings and binary buffers are stored as they are await store.setValue('some-key', { foo: 'bar' }); // Read a record. Note that JSON is automatically parsed to a JavaScript object, // text data returned as a string and other data is returned as binary buffer const value = await store.getValue('some-key'); // Drop (delete) the store await store.drop(); ``` ## Index[**](#Index) ### Properties * [**config](#config) * [**id](#id) * [**name](#name) * [**storageObject](#storageObject) ### Methods * [**drop](#drop) * [**forEachKey](#forEachKey) * [**getAutoSavedValue](#getAutoSavedValue) * [**getPublicUrl](#getPublicUrl) * [**getValue](#getValue) * [**recordExists](#recordExists) * [**setValue](#setValue) * [**getAutoSavedValue](#getAutoSavedValue) * [**open](#open) * [**recordExists](#recordExists) ## Properties[**](#Properties) ### [**](#config)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/key_value_store.ts#L123)readonlyconfig **config: [Configuration](https://crawlee.dev/js/api/core/class/Configuration.md) = ... ### [**](#id)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/key_value_store.ts#L109)readonlyid **id: string ### [**](#name)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/key_value_store.ts#L110)optionalreadonlyname **name? : string ### [**](#storageObject)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/key_value_store.ts#L111)optionalreadonlystorageObject **storageObject? : Record\ ## Methods[**](#Methods) ### [**](#drop)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/key_value_store.ts#L416)drop * ****drop**(): Promise\ - Removes the key-value store either from the Apify cloud storage or from the local directory, depending on the mode of operation. *** #### Returns Promise\ ### [**](#forEachKey)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/key_value_store.ts#L452)forEachKey * ****forEachKey**(iteratee, options): Promise\ - Iterates over key-value store keys, yielding each in turn to an `iteratee` function. Each invocation of `iteratee` is called with three arguments: `(key, index, info)`, where `key` is the record key, `index` is a zero-based index of the key in the current iteration (regardless of `options.exclusiveStartKey`) and `info` is an object that contains a single property `size` indicating size of the record in bytes. If the `iteratee` function returns a Promise then it is awaited before the next call. If it throws an error, the iteration is aborted and the `forEachKey` function throws the error. **Example usage** ``` const keyValueStore = await KeyValueStore.open(); await keyValueStore.forEachKey(async (key, index, info) => { console.log(`Key at ${index}: ${key} has size ${info.size}`); }); ``` *** #### Parameters * ##### iteratee: [KeyConsumer](https://crawlee.dev/js/api/core/interface/KeyConsumer.md) A function that is called for every key in the key-value store. * ##### optionaloptions: [KeyValueStoreIteratorOptions](https://crawlee.dev/js/api/core/interface/KeyValueStoreIteratorOptions.md) = {} All `forEachKey()` parameters. #### Returns Promise\ ### [**](#getAutoSavedValue)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/key_value_store.ts#L249)getAutoSavedValue * ****getAutoSavedValue**\(key, defaultValue): Promise\ - #### Parameters * ##### key: string * ##### defaultValue: T = ... #### Returns Promise\ ### [**](#getPublicUrl)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/key_value_store.ts#L487)getPublicUrl * ****getPublicUrl**(key): string - Returns a file URL for the given key. *** #### Parameters * ##### key: string #### Returns string ### [**](#getValue)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/key_value_store.ts#L161)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/key_value_store.ts#L194)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/key_value_store.ts#L227)getValue * ****getValue**\(key): Promise\ * ****getValue**\(key, defaultValue): Promise\ - Gets a value from the key-value store. The function returns a `Promise` that resolves to the record value, whose JavaScript type depends on the MIME content type of the record. Records with the `application/json` content type are automatically parsed and returned as a JavaScript object. Similarly, records with `text/plain` content types are returned as a string. For all other content types, the value is returned as a raw [`Buffer`](https://nodejs.org/api/buffer.html) instance. If the record does not exist, the function resolves to `null`. To save or delete a value in the key-value store, use the [KeyValueStore.setValue](https://crawlee.dev/js/api/core/class/KeyValueStore.md#setValue) function. **Example usage:** ``` const store = await KeyValueStore.open(); const buffer = await store.getValue('screenshot1.png'); ``` *** #### Parameters * ##### key: string Unique key of the record. It can be at most 256 characters long and only consist of the following characters: `a`-`z`, `A`-`Z`, `0`-`9` and `!-_.'()` #### Returns Promise\ Returns a promise that resolves to an object, string or [`Buffer`](https://nodejs.org/api/buffer.html), depending on the MIME content type of the record. ### [**](#recordExists)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/key_value_store.ts#L242)recordExists * ****recordExists**(key): Promise\ - Tests whether a record with the given key exists in the key-value store without retrieving its value. *** #### Parameters * ##### key: string The queried record key. #### Returns Promise\ `true` if the record exists, `false` if it does not. ### [**](#setValue)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/key_value_store.ts#L342)setValue * ****setValue**\(key, value, options): Promise\ - Saves or deletes a record in the key-value store. The function returns a promise that resolves once the record has been saved or deleted. **Example usage:** ``` const store = await KeyValueStore.open(); await store.setValue('OUTPUT', { foo: 'bar' }); ``` Beware that the key can be at most 256 characters long and only contain the following characters: `a-zA-Z0-9!-_.'()` By default, `value` is converted to JSON and stored with the `application/json; charset=utf-8` MIME content type. To store the value with another content type, pass it in the options as follows: ``` const store = await KeyValueStore.open('my-text-store'); await store.setValue('RESULTS', 'my text data', { contentType: 'text/plain' }); ``` If you set custom content type, `value` must be either a string or [`Buffer`](https://nodejs.org/api/buffer.html), otherwise an error will be thrown. If `value` is `null`, the record is deleted instead. Note that the `setValue()` function succeeds regardless whether the record existed or not. To retrieve a value from the key-value store, use the [KeyValueStore.getValue](https://crawlee.dev/js/api/core/class/KeyValueStore.md#getValue) function. **IMPORTANT:** Always make sure to use the `await` keyword when calling `setValue()`, otherwise the crawler process might finish before the value is stored! *** #### Parameters * ##### key: string Unique key of the record. It can be at most 256 characters long and only consist of the following characters: `a`-`z`, `A`-`Z`, `0`-`9` and `!-_.'()` * ##### value: null | T Record data, which can be one of the following values: * If `null`, the record in the key-value store is deleted. * If no `options.contentType` is specified, `value` can be any JavaScript object and it will be stringified to JSON. * If `options.contentType` is set, `value` is taken as is and it must be a `String` or [`Buffer`](https://nodejs.org/api/buffer.html). For any other value an error will be thrown. * ##### optionaloptions: [RecordOptions](https://crawlee.dev/js/api/core/interface/RecordOptions.md) = {} Record options. #### Returns Promise\ ### [**](#getAutoSavedValue)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/key_value_store.ts#L630)staticgetAutoSavedValue * ****getAutoSavedValue**\(key, defaultValue): Promise\ - #### Parameters * ##### key: string * ##### defaultValue: T = ... #### Returns Promise\ ### [**](#open)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/key_value_store.ts#L506)staticopen * ****open**(storeIdOrName, options): Promise<[KeyValueStore](https://crawlee.dev/js/api/core/class/KeyValueStore.md)> - Opens a key-value store and returns a promise resolving to an instance of the [KeyValueStore](https://crawlee.dev/js/api/core/class/KeyValueStore.md) class. Key-value stores are used to store records or files, along with their MIME content type. The records are stored and retrieved using a unique key. The actual data is stored either on a local filesystem or in the Apify cloud. For more details and code examples, see the [KeyValueStore](https://crawlee.dev/js/api/core/class/KeyValueStore.md) class. *** #### Parameters * ##### optionalstoreIdOrName: null | string ID or name of the key-value store to be opened. If `null` or `undefined`, the function returns the default key-value store associated with the crawler run. * ##### optionaloptions: [StorageManagerOptions](https://crawlee.dev/js/api/core/interface/StorageManagerOptions.md) = {} Storage manager options. #### Returns Promise<[KeyValueStore](https://crawlee.dev/js/api/core/class/KeyValueStore.md)> ### [**](#recordExists)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/key_value_store.ts#L625)staticrecordExists * ****recordExists**(key): Promise\ - Tests whether a record with the given key exists in the default [KeyValueStore](https://crawlee.dev/js/api/core/class/KeyValueStore.md) associated with the current crawler run. *** #### Parameters * ##### key: string The queried record key. #### Returns Promise\ `true` if the record exists, `false` if it does not. --- # LocalEventManager ### Hierarchy * [EventManager](https://crawlee.dev/js/api/core/class/EventManager.md) * *LocalEventManager* ## Index[**](#Index) ### Constructors * [**constructor](#constructor) ### Properties * [**config](#config) ### Methods * [**close](#close) * [**emit](#emit) * [**init](#init) * [**isInitialized](#isInitialized) * [**off](#off) * [**on](#on) ## Constructors[**](#Constructors) ### [**](#constructor)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/events/event_manager.ts#L30)constructor * ****new LocalEventManager**(config): [LocalEventManager](https://crawlee.dev/js/api/core/class/LocalEventManager.md) - Inherited from EventManager.constructor #### Parameters * ##### config: [Configuration](https://crawlee.dev/js/api/core/class/Configuration.md) = ... #### Returns [LocalEventManager](https://crawlee.dev/js/api/core/class/LocalEventManager.md) ## Properties[**](#Properties) ### [**](#config)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/events/event_manager.ts#L30)readonlyinheritedconfig **config: [Configuration](https://crawlee.dev/js/api/core/class/Configuration.md) = ... Inherited from EventManager.config ## Methods[**](#Methods) ### [**](#close)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/events/local_event_manager.ts#L33)close * ****close**(): Promise\ - Overrides EventManager.close Clears the internal `persistState` event interval. This is automatically called at the end of `crawler.run()`. *** #### Returns Promise\ ### [**](#emit)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/events/event_manager.ts#L82)inheritedemit * ****emit**(event, ...args): void - Inherited from EventManager.emit #### Parameters * ##### event: [EventTypeName](https://crawlee.dev/js/api/core.md#EventTypeName) * ##### rest...args: unknown\[] #### Returns void ### [**](#init)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/events/local_event_manager.ts#L18)init * ****init**(): Promise\ - Overrides EventManager.init Initializes the EventManager and sets up periodic `systemInfo` and `persistState` events. This is automatically called at the beginning of `crawler.run()`. *** #### Returns Promise\ ### [**](#isInitialized)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/events/event_manager.ts#L86)inheritedisInitialized * ****isInitialized**(): boolean - Inherited from EventManager.isInitialized #### Returns boolean ### [**](#off)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/events/event_manager.ts#L74)inheritedoff * ****off**(event, listener): void - Inherited from EventManager.off #### Parameters * ##### event: [EventTypeName](https://crawlee.dev/js/api/core.md#EventTypeName) * ##### optionallistener: (...args) => any #### Returns void ### [**](#on)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/events/event_manager.ts#L70)inheritedon * ****on**(event, listener): void - Inherited from EventManager.on #### Parameters * ##### event: [EventTypeName](https://crawlee.dev/js/api/core.md#EventTypeName) * ##### listener: (...args) => any #### Returns void --- # externalLog The log instance enables level aware logging of messages and we advise to use it instead of `console.log()` and its aliases in most development scenarios. A very useful use case for `log` is using `log.debug` liberally throughout the codebase to get useful logging messages only when appropriate log level is set and keeping the console tidy in production environments. The available logging levels are, in this order: `DEBUG`, `INFO`, `WARNING`, `ERROR`, `OFF` and can be referenced from the `log.LEVELS` constant, such as `log.LEVELS.ERROR`. To log messages to the system console, use the `log.level(message)` invocation, such as `log.debug('this is a debug message')`. To prevent writing of messages above a certain log level to the console, simply set the appropriate level. The default log level is `INFO`, which means that `DEBUG` messages will not be printed, unless enabled. **Example:** ``` import log from '@apify/log'; // importing from the Apify SDK or Crawlee is also supported: // import { log } from 'apify'; // import { log } from 'crawlee'; log.info('Information message', { someData: 123 }); // prints message log.debug('Debug message', { debugData: 'hello' }); // doesn't print anything log.setLevel(log.LEVELS.DEBUG); log.debug('Debug message'); // prints message log.setLevel(log.LEVELS.ERROR); log.debug('Debug message'); // doesn't print anything log.info('Info message'); // doesn't print anything log.error('Error message', { errorDetails: 'This is bad!' }); // prints message try { throw new Error('Not good!'); } catch (e) { log.exception(e, 'Exception occurred', { errorDetails: 'This is really bad!' }); // prints message } log.setOptions({ prefix: 'My actor' }); log.info('I am running!'); // prints "My actor: I am running" const childLog = log.child({ prefix: 'Crawler' }); log.info('I am crawling!'); // prints "My actor:Crawler: I am crawling" ``` Another very useful way of setting the log level is by setting the `APIFY_LOG_LEVEL` environment variable, such as `APIFY_LOG_LEVEL=DEBUG`. This way, no code changes are necessary to turn on your debug messages and start debugging right away. To add timestamps to your logs, you can override the default logger settings: ``` log.setOptions({ logger: new log.LoggerText({ skipTime: false }), }); ``` You can customize your logging further by extending or replacing the default logger instances with your own implementations. ## Index[**](#Index) ### Constructors * [**constructor](#constructor) ### Properties * [**LEVELS](#LEVELS) ### Methods * [**child](#child) * [**debug](#debug) * [**deprecated](#deprecated) * [**error](#error) * [**exception](#exception) * [**getLevel](#getLevel) * [**getOptions](#getOptions) * [**info](#info) * [**internal](#internal) * [**perf](#perf) * [**setLevel](#setLevel) * [**setOptions](#setOptions) * [**softFail](#softFail) * [**warning](#warning) * [**warningOnce](#warningOnce) ## Constructors[**](#Constructors) ### [**](#constructor)[**](https://undefined/apify/crawlee/blob/master/node_modules/@apify/log/src/index.d.ts#L136)externalconstructor * ****new Log**(options): [Log](https://crawlee.dev/js/api/core/class/Log.md) - #### Parameters * ##### externaloptionaloptions: Partial<[LoggerOptions](https://crawlee.dev/js/api/core/interface/LoggerOptions.md)> #### Returns [Log](https://crawlee.dev/js/api/core/class/Log.md) ## Properties[**](#Properties) ### [**](#LEVELS)[**](https://undefined/apify/crawlee/blob/master/node_modules/@apify/log/src/index.d.ts#L133)externalreadonlyLEVELS **LEVELS: typeof [LogLevel](https://crawlee.dev/js/api/core/enum/LogLevel.md) Map of available log levels that's useful for easy setting of appropriate log levels. Each log level is represented internally by a number. Eg. `log.LEVELS.DEBUG === 5`. ## Methods[**](#Methods) ### [**](#child)[**](https://undefined/apify/crawlee/blob/master/node_modules/@apify/log/src/index.d.ts#L168)externalchild * ****child**(options): [Log](https://crawlee.dev/js/api/core/class/Log.md) - Creates a new instance of logger that inherits settings from a parent logger. *** #### Parameters * ##### externaloptions: Partial<[LoggerOptions](https://crawlee.dev/js/api/core/interface/LoggerOptions.md)> #### Returns [Log](https://crawlee.dev/js/api/core/class/Log.md) ### [**](#debug)[**](https://undefined/apify/crawlee/blob/master/node_modules/@apify/log/src/index.d.ts#L195)externaldebug * ****debug**(message, data): void - Logs a `DEBUG` message. By default, it will not be written to the console. To see `DEBUG` messages in the console, set the log level to `DEBUG` either using the `log.setLevel(log.LEVELS.DEBUG)` method or using the environment variable `APIFY_LOG_LEVEL=DEBUG`. Data are stringified and appended to the message. *** #### Parameters * ##### externalmessage: string * ##### externaloptionaldata: AdditionalData #### Returns void ### [**](#deprecated)[**](https://undefined/apify/crawlee/blob/master/node_modules/@apify/log/src/index.d.ts#L204)externaldeprecated * ****deprecated**(message): void - Logs given message only once as WARNING. It's used to warn user that some feature he is using has been deprecated. *** #### Parameters * ##### externalmessage: string #### Returns void ### [**](#error)[**](https://undefined/apify/crawlee/blob/master/node_modules/@apify/log/src/index.d.ts#L173)externalerror * ****error**(message, data): void - Logs an `ERROR` message. Use this method to log error messages that are not directly connected to an exception. For logging exceptions, use the `log.exception` method. *** #### Parameters * ##### externalmessage: string * ##### externaloptionaldata: AdditionalData #### Returns void ### [**](#exception)[**](https://undefined/apify/crawlee/blob/master/node_modules/@apify/log/src/index.d.ts#L178)externalexception * ****exception**(exception, message, data): void - Logs an `ERROR` level message with a nicely formatted exception. Note that the exception is the first parameter here and an additional message is only optional. *** #### Parameters * ##### externalexception: Error * ##### externalmessage: string * ##### externaloptionaldata: AdditionalData #### Returns void ### [**](#getLevel)[**](https://undefined/apify/crawlee/blob/master/node_modules/@apify/log/src/index.d.ts#L144)externalgetLevel * ****getLevel**(): number - Returns the currently selected logging level. This is useful for checking whether a message will actually be printed to the console before one actually performs a resource intensive operation to construct the message, such as querying a DB for some metadata that need to be added. If the log level is not high enough at the moment, it doesn't make sense to execute the query. *** #### Returns number ### [**](#getOptions)[**](https://undefined/apify/crawlee/blob/master/node_modules/@apify/log/src/index.d.ts#L164)externalgetOptions * ****getOptions**(): Required<[LoggerOptions](https://crawlee.dev/js/api/core/interface/LoggerOptions.md)> - Returns the logger configuration. *** #### Returns Required<[LoggerOptions](https://crawlee.dev/js/api/core/interface/LoggerOptions.md)> ### [**](#info)[**](https://undefined/apify/crawlee/blob/master/node_modules/@apify/log/src/index.d.ts#L188)externalinfo * ****info**(message, data): void - Logs an `INFO` message. `INFO` is the default log level so info messages will be always logged, unless the log level is changed. Data are stringified and appended to the message. *** #### Parameters * ##### externalmessage: string * ##### externaloptionaldata: AdditionalData #### Returns void ### [**](#internal)[**](https://undefined/apify/crawlee/blob/master/node_modules/@apify/log/src/index.d.ts#L156)externalinternal * ****internal**(level, message, data, exception): void - #### Parameters * ##### externallevel: [LogLevel](https://crawlee.dev/js/api/core/enum/LogLevel.md) * ##### externalmessage: string * ##### externaloptionaldata: any * ##### externaloptionalexception: any #### Returns void ### [**](#perf)[**](https://undefined/apify/crawlee/blob/master/node_modules/@apify/log/src/index.d.ts#L196)externalperf * ****perf**(message, data): void - #### Parameters * ##### externalmessage: string * ##### externaloptionaldata: AdditionalData #### Returns void ### [**](#setLevel)[**](https://undefined/apify/crawlee/blob/master/node_modules/@apify/log/src/index.d.ts#L155)externalsetLevel * ****setLevel**(level): void - Sets the log level to the given value, preventing messages from less important log levels from being printed to the console. Use in conjunction with the `log.LEVELS` constants such as ``` log.setLevel(log.LEVELS.DEBUG); ``` Default log level is INFO. *** #### Parameters * ##### externallevel: [LogLevel](https://crawlee.dev/js/api/core/enum/LogLevel.md) #### Returns void ### [**](#setOptions)[**](https://undefined/apify/crawlee/blob/master/node_modules/@apify/log/src/index.d.ts#L160)externalsetOptions * ****setOptions**(options): void - Configures logger. *** #### Parameters * ##### externaloptions: Partial<[LoggerOptions](https://crawlee.dev/js/api/core/interface/LoggerOptions.md)> #### Returns void ### [**](#softFail)[**](https://undefined/apify/crawlee/blob/master/node_modules/@apify/log/src/index.d.ts#L179)externalsoftFail * ****softFail**(message, data): void - #### Parameters * ##### externalmessage: string * ##### externaloptionaldata: AdditionalData #### Returns void ### [**](#warning)[**](https://undefined/apify/crawlee/blob/master/node_modules/@apify/log/src/index.d.ts#L183)externalwarning * ****warning**(message, data): void - Logs a `WARNING` level message. Data are stringified and appended to the message. *** #### Parameters * ##### externalmessage: string * ##### externaloptionaldata: AdditionalData #### Returns void ### [**](#warningOnce)[**](https://undefined/apify/crawlee/blob/master/node_modules/@apify/log/src/index.d.ts#L200)externalwarningOnce * ****warningOnce**(message): void - Logs a `WARNING` level message only once. *** #### Parameters * ##### externalmessage: string #### Returns void --- # externalLogger This is an abstract class that should be extended by custom logger classes. this.\_log() method must be implemented by them. ### Hierarchy * EventEmitter * *Logger* * [LoggerJson](https://crawlee.dev/js/api/core/class/LoggerJson.md) * [LoggerText](https://crawlee.dev/js/api/core/class/LoggerText.md) ## Index[**](#Index) ### Constructors * [**constructor](#constructor) ### Properties * [**captureRejections](#captureRejections) * [**captureRejectionSymbol](#captureRejectionSymbol) * [**defaultMaxListeners](#defaultMaxListeners) * [**errorMonitor](#errorMonitor) ### Methods * [**\_log](#_log) * [**\_outputWithConsole](#_outputWithConsole) * [**\[captureRejectionSymbol\]](#\[captureRejectionSymbol]) * [**addListener](#addListener) * [**emit](#emit) * [**eventNames](#eventNames) * [**getMaxListeners](#getMaxListeners) * [**getOptions](#getOptions) * [**listenerCount](#listenerCount) * [**listeners](#listeners) * [**log](#log) * [**off](#off) * [**on](#on) * [**once](#once) * [**prependListener](#prependListener) * [**prependOnceListener](#prependOnceListener) * [**rawListeners](#rawListeners) * [**removeAllListeners](#removeAllListeners) * [**removeListener](#removeListener) * [**setMaxListeners](#setMaxListeners) * [**setOptions](#setOptions) * [**addAbortListener](#addAbortListener) * [**getEventListeners](#getEventListeners) * [**getMaxListeners](#getMaxListeners) * [**listenerCount](#listenerCount) * [**on](#on) * [**once](#once) * [**setMaxListeners](#setMaxListeners) ## Constructors[**](#Constructors) ### [**](#constructor)[**](https://undefined/apify/crawlee/blob/master/node_modules/@apify/log/src/index.d.ts#L33)externalconstructor * ****new Logger**(options): [Logger](https://crawlee.dev/js/api/core/class/Logger.md) - Overrides EventEmitter.constructor #### Parameters * ##### externaloptions: Record\ #### Returns [Logger](https://crawlee.dev/js/api/core/class/Logger.md) ## Properties[**](#Properties) ### [**](#captureRejections)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L458)staticexternalinheritedcaptureRejections **captureRejections: boolean Inherited from EventEmitter.captureRejections Value: [boolean](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Data_structures#Boolean_type) Change the default `captureRejections` option on all new `EventEmitter` objects. * **@since** v13.4.0, v12.16.0 ### [**](#captureRejectionSymbol)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L451)staticexternalreadonlyinheritedcaptureRejectionSymbol **captureRejectionSymbol: typeof captureRejectionSymbol Inherited from EventEmitter.captureRejectionSymbol Value: `Symbol.for('nodejs.rejection')` See how to write a custom `rejection handler`. * **@since** v13.4.0, v12.16.0 ### [**](#defaultMaxListeners)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L497)staticexternalinheriteddefaultMaxListeners **defaultMaxListeners: number Inherited from EventEmitter.defaultMaxListeners By default, a maximum of `10` listeners can be registered for any single event. This limit can be changed for individual `EventEmitter` instances using the `emitter.setMaxListeners(n)` method. To change the default for *all*`EventEmitter` instances, the `events.defaultMaxListeners` property can be used. If this value is not a positive number, a `RangeError` is thrown. Take caution when setting the `events.defaultMaxListeners` because the change affects *all* `EventEmitter` instances, including those created before the change is made. However, calling `emitter.setMaxListeners(n)` still has precedence over `events.defaultMaxListeners`. This is not a hard limit. The `EventEmitter` instance will allow more listeners to be added but will output a trace warning to stderr indicating that a "possible EventEmitter memory leak" has been detected. For any single `EventEmitter`, the `emitter.getMaxListeners()` and `emitter.setMaxListeners()` methods can be used to temporarily avoid this warning: ``` import { EventEmitter } from 'node:events'; const emitter = new EventEmitter(); emitter.setMaxListeners(emitter.getMaxListeners() + 1); emitter.once('event', () => { // do stuff emitter.setMaxListeners(Math.max(emitter.getMaxListeners() - 1, 0)); }); ``` The `--trace-warnings` command-line flag can be used to display the stack trace for such warnings. The emitted warning can be inspected with `process.on('warning')` and will have the additional `emitter`, `type`, and `count` properties, referring to the event emitter instance, the event's name and the number of attached listeners, respectively. Its `name` property is set to `'MaxListenersExceededWarning'`. * **@since** v0.11.2 ### [**](#errorMonitor)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L444)staticexternalreadonlyinheritederrorMonitor **errorMonitor: typeof errorMonitor Inherited from EventEmitter.errorMonitor This symbol shall be used to install a listener for only monitoring `'error'` events. Listeners installed using this symbol are called before the regular `'error'` listeners are called. Installing a listener using this symbol does not change the behavior once an `'error'` event is emitted. Therefore, the process will still crash if no regular `'error'` listener is installed. * **@since** v13.6.0, v12.17.0 ## Methods[**](#Methods) ### [**](#_log)[**](https://undefined/apify/crawlee/blob/master/node_modules/@apify/log/src/index.d.ts#L37)external\_log * ****\_log**(level, message, data, exception, opts): void - #### Parameters * ##### externallevel: [LogLevel](https://crawlee.dev/js/api/core/enum/LogLevel.md) * ##### externalmessage: string * ##### externaloptionaldata: any * ##### externaloptionalexception: unknown * ##### externaloptionalopts: Record\ #### Returns void ### [**](#_outputWithConsole)[**](https://undefined/apify/crawlee/blob/master/node_modules/@apify/log/src/index.d.ts#L36)external\_outputWithConsole * ****\_outputWithConsole**(level, line): void - #### Parameters * ##### externallevel: [LogLevel](https://crawlee.dev/js/api/core/enum/LogLevel.md) * ##### externalline: string #### Returns void ### [**](#\[captureRejectionSymbol])[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L136)externaloptionalinherited\[captureRejectionSymbol] * ****\[captureRejectionSymbol]**\(error, event, ...args): void - Inherited from EventEmitter.\[captureRejectionSymbol] #### Parameters * ##### externalerror: Error * ##### externalevent: string | symbol * ##### externalrest...args: AnyRest #### Returns void ### [**](#addListener)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L596)externalinheritedaddListener * ****addListener**\(eventName, listener): this - Inherited from EventEmitter.addListener Alias for `emitter.on(eventName, listener)`. * **@since** v0.1.26 *** #### Parameters * ##### externaleventName: string | symbol * ##### externallistener: (...args) => void #### Returns this ### [**](#emit)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L858)externalinheritedemit * ****emit**\(eventName, ...args): boolean - Inherited from EventEmitter.emit Synchronously calls each of the listeners registered for the event named `eventName`, in the order they were registered, passing the supplied arguments to each. Returns `true` if the event had listeners, `false` otherwise. ``` import { EventEmitter } from 'node:events'; const myEmitter = new EventEmitter(); // First listener myEmitter.on('event', function firstListener() { console.log('Helloooo! first listener'); }); // Second listener myEmitter.on('event', function secondListener(arg1, arg2) { console.log(`event with parameters ${arg1}, ${arg2} in second listener`); }); // Third listener myEmitter.on('event', function thirdListener(...args) { const parameters = args.join(', '); console.log(`event with parameters ${parameters} in third listener`); }); console.log(myEmitter.listeners('event')); myEmitter.emit('event', 1, 2, 3, 4, 5); // Prints: // [ // [Function: firstListener], // [Function: secondListener], // [Function: thirdListener] // ] // Helloooo! first listener // event with parameters 1, 2 in second listener // event with parameters 1, 2, 3, 4, 5 in third listener ``` * **@since** v0.1.26 *** #### Parameters * ##### externaleventName: string | symbol * ##### externalrest...args: AnyRest #### Returns boolean ### [**](#eventNames)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L921)externalinheritedeventNames * ****eventNames**(): (string | symbol)\[] - Inherited from EventEmitter.eventNames Returns an array listing the events for which the emitter has registered listeners. The values in the array are strings or `Symbol`s. ``` import { EventEmitter } from 'node:events'; const myEE = new EventEmitter(); myEE.on('foo', () => {}); myEE.on('bar', () => {}); const sym = Symbol('symbol'); myEE.on(sym, () => {}); console.log(myEE.eventNames()); // Prints: [ 'foo', 'bar', Symbol(symbol) ] ``` * **@since** v6.0.0 *** #### Returns (string | symbol)\[] ### [**](#getMaxListeners)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L773)externalinheritedgetMaxListeners * ****getMaxListeners**(): number - Inherited from EventEmitter.getMaxListeners Returns the current max listener value for the `EventEmitter` which is either set by `emitter.setMaxListeners(n)` or defaults to EventEmitter.defaultMaxListeners. * **@since** v1.0.0 *** #### Returns number ### [**](#getOptions)[**](https://undefined/apify/crawlee/blob/master/node_modules/@apify/log/src/index.d.ts#L35)externalgetOptions * ****getOptions**(): Record\ - #### Returns Record\ ### [**](#listenerCount)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L867)externalinheritedlistenerCount * ****listenerCount**\(eventName, listener): number - Inherited from EventEmitter.listenerCount Returns the number of listeners listening for the event named `eventName`. If `listener` is provided, it will return how many times the listener is found in the list of the listeners of the event. * **@since** v3.2.0 *** #### Parameters * ##### externaleventName: string | symbol The name of the event being listened for * ##### externaloptionallistener: Function The event handler function #### Returns number ### [**](#listeners)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L786)externalinheritedlisteners * ****listeners**\(eventName): Function\[] - Inherited from EventEmitter.listeners Returns a copy of the array of listeners for the event named `eventName`. ``` server.on('connection', (stream) => { console.log('someone connected!'); }); console.log(util.inspect(server.listeners('connection'))); // Prints: [ [Function] ] ``` * **@since** v0.1.26 *** #### Parameters * ##### externaleventName: string | symbol #### Returns Function\[] ### [**](#log)[**](https://undefined/apify/crawlee/blob/master/node_modules/@apify/log/src/index.d.ts#L38)externallog * ****log**(level, message, ...args): void - #### Parameters * ##### externallevel: [LogLevel](https://crawlee.dev/js/api/core/enum/LogLevel.md) * ##### externalmessage: string * ##### externalrest...args: any\[] #### Returns void ### [**](#off)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L746)externalinheritedoff * ****off**\(eventName, listener): this - Inherited from EventEmitter.off Alias for `emitter.removeListener()`. * **@since** v10.0.0 *** #### Parameters * ##### externaleventName: string | symbol * ##### externallistener: (...args) => void #### Returns this ### [**](#on)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L628)externalinheritedon * ****on**\(eventName, listener): this - Inherited from EventEmitter.on Adds the `listener` function to the end of the listeners array for the event named `eventName`. No checks are made to see if the `listener` has already been added. Multiple calls passing the same combination of `eventName` and `listener` will result in the `listener` being added, and called, multiple times. ``` server.on('connection', (stream) => { console.log('someone connected!'); }); ``` Returns a reference to the `EventEmitter`, so that calls can be chained. By default, event listeners are invoked in the order they are added. The `emitter.prependListener()` method can be used as an alternative to add the event listener to the beginning of the listeners array. ``` import { EventEmitter } from 'node:events'; const myEE = new EventEmitter(); myEE.on('foo', () => console.log('a')); myEE.prependListener('foo', () => console.log('b')); myEE.emit('foo'); // Prints: // b // a ``` * **@since** v0.1.101 *** #### Parameters * ##### externaleventName: string | symbol The name of the event. * ##### externallistener: (...args) => void The callback function #### Returns this ### [**](#once)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L658)externalinheritedonce * ****once**\(eventName, listener): this - Inherited from EventEmitter.once Adds a **one-time** `listener` function for the event named `eventName`. The next time `eventName` is triggered, this listener is removed and then invoked. ``` server.once('connection', (stream) => { console.log('Ah, we have our first user!'); }); ``` Returns a reference to the `EventEmitter`, so that calls can be chained. By default, event listeners are invoked in the order they are added. The `emitter.prependOnceListener()` method can be used as an alternative to add the event listener to the beginning of the listeners array. ``` import { EventEmitter } from 'node:events'; const myEE = new EventEmitter(); myEE.once('foo', () => console.log('a')); myEE.prependOnceListener('foo', () => console.log('b')); myEE.emit('foo'); // Prints: // b // a ``` * **@since** v0.3.0 *** #### Parameters * ##### externaleventName: string | symbol The name of the event. * ##### externallistener: (...args) => void The callback function #### Returns this ### [**](#prependListener)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L885)externalinheritedprependListener * ****prependListener**\(eventName, listener): this - Inherited from EventEmitter.prependListener Adds the `listener` function to the *beginning* of the listeners array for the event named `eventName`. No checks are made to see if the `listener` has already been added. Multiple calls passing the same combination of `eventName` and `listener` will result in the `listener` being added, and called, multiple times. ``` server.prependListener('connection', (stream) => { console.log('someone connected!'); }); ``` Returns a reference to the `EventEmitter`, so that calls can be chained. * **@since** v6.0.0 *** #### Parameters * ##### externaleventName: string | symbol The name of the event. * ##### externallistener: (...args) => void The callback function #### Returns this ### [**](#prependOnceListener)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L901)externalinheritedprependOnceListener * ****prependOnceListener**\(eventName, listener): this - Inherited from EventEmitter.prependOnceListener Adds a **one-time**`listener` function for the event named `eventName` to the *beginning* of the listeners array. The next time `eventName` is triggered, this listener is removed, and then invoked. ``` server.prependOnceListener('connection', (stream) => { console.log('Ah, we have our first user!'); }); ``` Returns a reference to the `EventEmitter`, so that calls can be chained. * **@since** v6.0.0 *** #### Parameters * ##### externaleventName: string | symbol The name of the event. * ##### externallistener: (...args) => void The callback function #### Returns this ### [**](#rawListeners)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L817)externalinheritedrawListeners * ****rawListeners**\(eventName): Function\[] - Inherited from EventEmitter.rawListeners Returns a copy of the array of listeners for the event named `eventName`, including any wrappers (such as those created by `.once()`). ``` import { EventEmitter } from 'node:events'; const emitter = new EventEmitter(); emitter.once('log', () => console.log('log once')); // Returns a new Array with a function `onceWrapper` which has a property // `listener` which contains the original listener bound above const listeners = emitter.rawListeners('log'); const logFnWrapper = listeners[0]; // Logs "log once" to the console and does not unbind the `once` event logFnWrapper.listener(); // Logs "log once" to the console and removes the listener logFnWrapper(); emitter.on('log', () => console.log('log persistently')); // Will return a new Array with a single function bound by `.on()` above const newListeners = emitter.rawListeners('log'); // Logs "log persistently" twice newListeners[0](); emitter.emit('log'); ``` * **@since** v9.4.0 *** #### Parameters * ##### externaleventName: string | symbol #### Returns Function\[] ### [**](#removeAllListeners)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L757)externalinheritedremoveAllListeners * ****removeAllListeners**(eventName): this - Inherited from EventEmitter.removeAllListeners Removes all listeners, or those of the specified `eventName`. It is bad practice to remove listeners added elsewhere in the code, particularly when the `EventEmitter` instance was created by some other component or module (e.g. sockets or file streams). Returns a reference to the `EventEmitter`, so that calls can be chained. * **@since** v0.1.26 *** #### Parameters * ##### externaloptionaleventName: string | symbol #### Returns this ### [**](#removeListener)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L741)externalinheritedremoveListener * ****removeListener**\(eventName, listener): this - Inherited from EventEmitter.removeListener Removes the specified `listener` from the listener array for the event named `eventName`. ``` const callback = (stream) => { console.log('someone connected!'); }; server.on('connection', callback); // ... server.removeListener('connection', callback); ``` `removeListener()` will remove, at most, one instance of a listener from the listener array. If any single listener has been added multiple times to the listener array for the specified `eventName`, then `removeListener()` must be called multiple times to remove each instance. Once an event is emitted, all listeners attached to it at the time of emitting are called in order. This implies that any `removeListener()` or `removeAllListeners()` calls *after* emitting and *before* the last listener finishes execution will not remove them from`emit()` in progress. Subsequent events behave as expected. ``` import { EventEmitter } from 'node:events'; class MyEmitter extends EventEmitter {} const myEmitter = new MyEmitter(); const callbackA = () => { console.log('A'); myEmitter.removeListener('event', callbackB); }; const callbackB = () => { console.log('B'); }; myEmitter.on('event', callbackA); myEmitter.on('event', callbackB); // callbackA removes listener callbackB but it will still be called. // Internal listener array at time of emit [callbackA, callbackB] myEmitter.emit('event'); // Prints: // A // B // callbackB is now removed. // Internal listener array [callbackA] myEmitter.emit('event'); // Prints: // A ``` Because listeners are managed using an internal array, calling this will change the position indices of any listener registered *after* the listener being removed. This will not impact the order in which listeners are called, but it means that any copies of the listener array as returned by the `emitter.listeners()` method will need to be recreated. When a single function has been added as a handler multiple times for a single event (as in the example below), `removeListener()` will remove the most recently added instance. In the example the `once('ping')` listener is removed: ``` import { EventEmitter } from 'node:events'; const ee = new EventEmitter(); function pong() { console.log('pong'); } ee.on('ping', pong); ee.once('ping', pong); ee.removeListener('ping', pong); ee.emit('ping'); ee.emit('ping'); ``` Returns a reference to the `EventEmitter`, so that calls can be chained. * **@since** v0.1.26 *** #### Parameters * ##### externaleventName: string | symbol * ##### externallistener: (...args) => void #### Returns this ### [**](#setMaxListeners)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L767)externalinheritedsetMaxListeners * ****setMaxListeners**(n): this - Inherited from EventEmitter.setMaxListeners By default `EventEmitter`s will print a warning if more than `10` listeners are added for a particular event. This is a useful default that helps finding memory leaks. The `emitter.setMaxListeners()` method allows the limit to be modified for this specific `EventEmitter` instance. The value can be set to `Infinity` (or `0`) to indicate an unlimited number of listeners. Returns a reference to the `EventEmitter`, so that calls can be chained. * **@since** v0.3.5 *** #### Parameters * ##### externaln: number #### Returns this ### [**](#setOptions)[**](https://undefined/apify/crawlee/blob/master/node_modules/@apify/log/src/index.d.ts#L34)externalsetOptions * ****setOptions**(options): void - #### Parameters * ##### externaloptions: Record\ #### Returns void ### [**](#addAbortListener)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L436)staticexternalinheritedaddAbortListener * ****addAbortListener**(signal, resource): Disposable - Inherited from EventEmitter.addAbortListener Listens once to the `abort` event on the provided `signal`. Listening to the `abort` event on abort signals is unsafe and may lead to resource leaks since another third party with the signal can call `e.stopImmediatePropagation()`. Unfortunately Node.js cannot change this since it would violate the web standard. Additionally, the original API makes it easy to forget to remove listeners. This API allows safely using `AbortSignal`s in Node.js APIs by solving these two issues by listening to the event such that `stopImmediatePropagation` does not prevent the listener from running. Returns a disposable so that it may be unsubscribed from more easily. ``` import { addAbortListener } from 'node:events'; function example(signal) { let disposable; try { signal.addEventListener('abort', (e) => e.stopImmediatePropagation()); disposable = addAbortListener(signal, (e) => { // Do something when signal is aborted. }); } finally { disposable?.[Symbol.dispose](); } } ``` * **@since** v20.5.0 *** #### Parameters * ##### externalsignal: AbortSignal * ##### externalresource: (event) => void #### Returns Disposable Disposable that removes the `abort` listener. ### [**](#getEventListeners)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L358)staticexternalinheritedgetEventListeners * ****getEventListeners**(emitter, name): Function\[] - Inherited from EventEmitter.getEventListeners Returns a copy of the array of listeners for the event named `eventName`. For `EventEmitter`s this behaves exactly the same as calling `.listeners` on the emitter. For `EventTarget`s this is the only way to get the event listeners for the event target. This is useful for debugging and diagnostic purposes. ``` import { getEventListeners, EventEmitter } from 'node:events'; { const ee = new EventEmitter(); const listener = () => console.log('Events are fun'); ee.on('foo', listener); console.log(getEventListeners(ee, 'foo')); // [ [Function: listener] ] } { const et = new EventTarget(); const listener = () => console.log('Events are fun'); et.addEventListener('foo', listener); console.log(getEventListeners(et, 'foo')); // [ [Function: listener] ] } ``` * **@since** v15.2.0, v14.17.0 *** #### Parameters * ##### externalemitter: EventEmitter\ | EventTarget * ##### externalname: string | symbol #### Returns Function\[] ### [**](#getMaxListeners)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L387)staticexternalinheritedgetMaxListeners * ****getMaxListeners**(emitter): number - Inherited from EventEmitter.getMaxListeners Returns the currently set max amount of listeners. For `EventEmitter`s this behaves exactly the same as calling `.getMaxListeners` on the emitter. For `EventTarget`s this is the only way to get the max event listeners for the event target. If the number of event handlers on a single EventTarget exceeds the max set, the EventTarget will print a warning. ``` import { getMaxListeners, setMaxListeners, EventEmitter } from 'node:events'; { const ee = new EventEmitter(); console.log(getMaxListeners(ee)); // 10 setMaxListeners(11, ee); console.log(getMaxListeners(ee)); // 11 } { const et = new EventTarget(); console.log(getMaxListeners(et)); // 10 setMaxListeners(11, et); console.log(getMaxListeners(et)); // 11 } ``` * **@since** v19.9.0 *** #### Parameters * ##### externalemitter: EventEmitter\ | EventTarget #### Returns number ### [**](#listenerCount)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L330)staticexternalinheritedlistenerCount * ****listenerCount**(emitter, eventName): number - Inherited from EventEmitter.listenerCount A class method that returns the number of listeners for the given `eventName` registered on the given `emitter`. ``` import { EventEmitter, listenerCount } from 'node:events'; const myEmitter = new EventEmitter(); myEmitter.on('event', () => {}); myEmitter.on('event', () => {}); console.log(listenerCount(myEmitter, 'event')); // Prints: 2 ``` * **@since** v0.9.12 * **@deprecated** Since v3.2.0 - Use `listenerCount` instead. *** #### Parameters * ##### externalemitter: EventEmitter\ The emitter to query * ##### externaleventName: string | symbol The event name #### Returns number ### [**](#on)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L303)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L308)staticexternalinheritedon * ****on**(emitter, eventName, options): AsyncIterator\ * ****on**(emitter, eventName, options): AsyncIterator\ - Inherited from EventEmitter.on ``` import { on, EventEmitter } from 'node:events'; import process from 'node:process'; const ee = new EventEmitter(); // Emit later on process.nextTick(() => { ee.emit('foo', 'bar'); ee.emit('foo', 42); }); for await (const event of on(ee, 'foo')) { // The execution of this inner block is synchronous and it // processes one event at a time (even with await). Do not use // if concurrent execution is required. console.log(event); // prints ['bar'] [42] } // Unreachable here ``` Returns an `AsyncIterator` that iterates `eventName` events. It will throw if the `EventEmitter` emits `'error'`. It removes all listeners when exiting the loop. The `value` returned by each iteration is an array composed of the emitted event arguments. An `AbortSignal` can be used to cancel waiting on events: ``` import { on, EventEmitter } from 'node:events'; import process from 'node:process'; const ac = new AbortController(); (async () => { const ee = new EventEmitter(); // Emit later on process.nextTick(() => { ee.emit('foo', 'bar'); ee.emit('foo', 42); }); for await (const event of on(ee, 'foo', { signal: ac.signal })) { // The execution of this inner block is synchronous and it // processes one event at a time (even with await). Do not use // if concurrent execution is required. console.log(event); // prints ['bar'] [42] } // Unreachable here })(); process.nextTick(() => ac.abort()); ``` Use the `close` option to specify an array of event names that will end the iteration: ``` import { on, EventEmitter } from 'node:events'; import process from 'node:process'; const ee = new EventEmitter(); // Emit later on process.nextTick(() => { ee.emit('foo', 'bar'); ee.emit('foo', 42); ee.emit('close'); }); for await (const event of on(ee, 'foo', { close: ['close'] })) { console.log(event); // prints ['bar'] [42] } // the loop will exit after 'close' is emitted console.log('done'); // prints 'done' ``` * **@since** v13.6.0, v12.16.0 *** #### Parameters * ##### externalemitter: EventEmitter\ * ##### externaleventName: string | symbol * ##### externaloptionaloptions: StaticEventEmitterIteratorOptions #### Returns AsyncIterator\ An `AsyncIterator` that iterates `eventName` events emitted by the `emitter` ### [**](#once)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L217)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L222)staticexternalinheritedonce * ****once**(emitter, eventName, options): Promise\ * ****once**(emitter, eventName, options): Promise\ - Inherited from EventEmitter.once Creates a `Promise` that is fulfilled when the `EventEmitter` emits the given event or that is rejected if the `EventEmitter` emits `'error'` while waiting. The `Promise` will resolve with an array of all the arguments emitted to the given event. This method is intentionally generic and works with the web platform [EventTarget](https://dom.spec.whatwg.org/#interface-eventtarget) interface, which has no special`'error'` event semantics and does not listen to the `'error'` event. ``` import { once, EventEmitter } from 'node:events'; import process from 'node:process'; const ee = new EventEmitter(); process.nextTick(() => { ee.emit('myevent', 42); }); const [value] = await once(ee, 'myevent'); console.log(value); const err = new Error('kaboom'); process.nextTick(() => { ee.emit('error', err); }); try { await once(ee, 'myevent'); } catch (err) { console.error('error happened', err); } ``` The special handling of the `'error'` event is only used when `events.once()` is used to wait for another event. If `events.once()` is used to wait for the '`error'` event itself, then it is treated as any other kind of event without special handling: ``` import { EventEmitter, once } from 'node:events'; const ee = new EventEmitter(); once(ee, 'error') .then(([err]) => console.log('ok', err.message)) .catch((err) => console.error('error', err.message)); ee.emit('error', new Error('boom')); // Prints: ok boom ``` An `AbortSignal` can be used to cancel waiting for the event: ``` import { EventEmitter, once } from 'node:events'; const ee = new EventEmitter(); const ac = new AbortController(); async function foo(emitter, event, signal) { try { await once(emitter, event, { signal }); console.log('event emitted!'); } catch (error) { if (error.name === 'AbortError') { console.error('Waiting for the event was canceled!'); } else { console.error('There was an error', error.message); } } } foo(ee, 'foo', ac.signal); ac.abort(); // Abort waiting for the event ee.emit('foo'); // Prints: Waiting for the event was canceled! ``` * **@since** v11.13.0, v10.16.0 *** #### Parameters * ##### externalemitter: EventEmitter\ * ##### externaleventName: string | symbol * ##### externaloptionaloptions: StaticEventEmitterOptions #### Returns Promise\ ### [**](#setMaxListeners)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L402)staticexternalinheritedsetMaxListeners * ****setMaxListeners**(n, ...eventTargets): void - Inherited from EventEmitter.setMaxListeners ``` import { setMaxListeners, EventEmitter } from 'node:events'; const target = new EventTarget(); const emitter = new EventEmitter(); setMaxListeners(5, target, emitter); ``` * **@since** v15.4.0 *** #### Parameters * ##### externaloptionaln: number A non-negative number. The maximum number of listeners per `EventTarget` event. * ##### externalrest...eventTargets: (EventEmitter\ | EventTarget)\[] Zero or more {EventTarget} or {EventEmitter} instances. If none are specified, `n` is set as the default max for all newly created {EventTarget} and {EventEmitter} objects. #### Returns void --- # externalLoggerJson This is an abstract class that should be extended by custom logger classes. this.\_log() method must be implemented by them. ### Hierarchy * [Logger](https://crawlee.dev/js/api/core/class/Logger.md) * *LoggerJson* ## Index[**](#Index) ### Constructors * [**constructor](#constructor) ### Properties * [**captureRejections](#captureRejections) * [**captureRejectionSymbol](#captureRejectionSymbol) * [**defaultMaxListeners](#defaultMaxListeners) * [**errorMonitor](#errorMonitor) ### Methods * [**\_log](#_log) * [**\_outputWithConsole](#_outputWithConsole) * [**\[captureRejectionSymbol\]](#\[captureRejectionSymbol]) * [**addListener](#addListener) * [**emit](#emit) * [**eventNames](#eventNames) * [**getMaxListeners](#getMaxListeners) * [**getOptions](#getOptions) * [**listenerCount](#listenerCount) * [**listeners](#listeners) * [**log](#log) * [**off](#off) * [**on](#on) * [**once](#once) * [**prependListener](#prependListener) * [**prependOnceListener](#prependOnceListener) * [**rawListeners](#rawListeners) * [**removeAllListeners](#removeAllListeners) * [**removeListener](#removeListener) * [**setMaxListeners](#setMaxListeners) * [**setOptions](#setOptions) * [**addAbortListener](#addAbortListener) * [**getEventListeners](#getEventListeners) * [**getMaxListeners](#getMaxListeners) * [**listenerCount](#listenerCount) * [**on](#on) * [**once](#once) * [**setMaxListeners](#setMaxListeners) ## Constructors[**](#Constructors) ### [**](#constructor)[**](https://undefined/apify/crawlee/blob/master/node_modules/@apify/log/src/index.d.ts#L241)externalconstructor * ****new LoggerJson**(options): [LoggerJson](https://crawlee.dev/js/api/core/class/LoggerJson.md) - Overrides Logger.constructor #### Parameters * ##### externaloptionaloptions: {} #### Returns [LoggerJson](https://crawlee.dev/js/api/core/class/LoggerJson.md) ## Properties[**](#Properties) ### [**](#captureRejections)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L458)staticexternalinheritedcaptureRejections **captureRejections: boolean Inherited from Logger.captureRejections Value: [boolean](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Data_structures#Boolean_type) Change the default `captureRejections` option on all new `EventEmitter` objects. * **@since** v13.4.0, v12.16.0 ### [**](#captureRejectionSymbol)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L451)staticexternalreadonlyinheritedcaptureRejectionSymbol **captureRejectionSymbol: typeof captureRejectionSymbol Inherited from Logger.captureRejectionSymbol Value: `Symbol.for('nodejs.rejection')` See how to write a custom `rejection handler`. * **@since** v13.4.0, v12.16.0 ### [**](#defaultMaxListeners)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L497)staticexternalinheriteddefaultMaxListeners **defaultMaxListeners: number Inherited from Logger.defaultMaxListeners By default, a maximum of `10` listeners can be registered for any single event. This limit can be changed for individual `EventEmitter` instances using the `emitter.setMaxListeners(n)` method. To change the default for *all*`EventEmitter` instances, the `events.defaultMaxListeners` property can be used. If this value is not a positive number, a `RangeError` is thrown. Take caution when setting the `events.defaultMaxListeners` because the change affects *all* `EventEmitter` instances, including those created before the change is made. However, calling `emitter.setMaxListeners(n)` still has precedence over `events.defaultMaxListeners`. This is not a hard limit. The `EventEmitter` instance will allow more listeners to be added but will output a trace warning to stderr indicating that a "possible EventEmitter memory leak" has been detected. For any single `EventEmitter`, the `emitter.getMaxListeners()` and `emitter.setMaxListeners()` methods can be used to temporarily avoid this warning: ``` import { EventEmitter } from 'node:events'; const emitter = new EventEmitter(); emitter.setMaxListeners(emitter.getMaxListeners() + 1); emitter.once('event', () => { // do stuff emitter.setMaxListeners(Math.max(emitter.getMaxListeners() - 1, 0)); }); ``` The `--trace-warnings` command-line flag can be used to display the stack trace for such warnings. The emitted warning can be inspected with `process.on('warning')` and will have the additional `emitter`, `type`, and `count` properties, referring to the event emitter instance, the event's name and the number of attached listeners, respectively. Its `name` property is set to `'MaxListenersExceededWarning'`. * **@since** v0.11.2 ### [**](#errorMonitor)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L444)staticexternalreadonlyinheritederrorMonitor **errorMonitor: typeof errorMonitor Inherited from Logger.errorMonitor This symbol shall be used to install a listener for only monitoring `'error'` events. Listeners installed using this symbol are called before the regular `'error'` listeners are called. Installing a listener using this symbol does not change the behavior once an `'error'` event is emitted. Therefore, the process will still crash if no regular `'error'` listener is installed. * **@since** v13.6.0, v12.17.0 ## Methods[**](#Methods) ### [**](#_log)[**](https://undefined/apify/crawlee/blob/master/node_modules/@apify/log/src/index.d.ts#L242)external\_log * ****\_log**(level, message, data, exception, opts): string - Overrides Logger.\_log #### Parameters * ##### externallevel: [LogLevel](https://crawlee.dev/js/api/core/enum/LogLevel.md) * ##### externalmessage: string * ##### externaloptionaldata: any * ##### externaloptionalexception: unknown * ##### externaloptionalopts: Record\ #### Returns string ### [**](#_outputWithConsole)[**](https://undefined/apify/crawlee/blob/master/node_modules/@apify/log/src/index.d.ts#L36)externalinherited\_outputWithConsole * ****\_outputWithConsole**(level, line): void - Inherited from Logger.\_outputWithConsole #### Parameters * ##### externallevel: [LogLevel](https://crawlee.dev/js/api/core/enum/LogLevel.md) * ##### externalline: string #### Returns void ### [**](#\[captureRejectionSymbol])[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L136)externaloptionalinherited\[captureRejectionSymbol] * ****\[captureRejectionSymbol]**\(error, event, ...args): void - Inherited from Logger.\[captureRejectionSymbol] #### Parameters * ##### externalerror: Error * ##### externalevent: string | symbol * ##### externalrest...args: AnyRest #### Returns void ### [**](#addListener)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L596)externalinheritedaddListener * ****addListener**\(eventName, listener): this - Inherited from Logger.addListener Alias for `emitter.on(eventName, listener)`. * **@since** v0.1.26 *** #### Parameters * ##### externaleventName: string | symbol * ##### externallistener: (...args) => void #### Returns this ### [**](#emit)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L858)externalinheritedemit * ****emit**\(eventName, ...args): boolean - Inherited from Logger.emit Synchronously calls each of the listeners registered for the event named `eventName`, in the order they were registered, passing the supplied arguments to each. Returns `true` if the event had listeners, `false` otherwise. ``` import { EventEmitter } from 'node:events'; const myEmitter = new EventEmitter(); // First listener myEmitter.on('event', function firstListener() { console.log('Helloooo! first listener'); }); // Second listener myEmitter.on('event', function secondListener(arg1, arg2) { console.log(`event with parameters ${arg1}, ${arg2} in second listener`); }); // Third listener myEmitter.on('event', function thirdListener(...args) { const parameters = args.join(', '); console.log(`event with parameters ${parameters} in third listener`); }); console.log(myEmitter.listeners('event')); myEmitter.emit('event', 1, 2, 3, 4, 5); // Prints: // [ // [Function: firstListener], // [Function: secondListener], // [Function: thirdListener] // ] // Helloooo! first listener // event with parameters 1, 2 in second listener // event with parameters 1, 2, 3, 4, 5 in third listener ``` * **@since** v0.1.26 *** #### Parameters * ##### externaleventName: string | symbol * ##### externalrest...args: AnyRest #### Returns boolean ### [**](#eventNames)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L921)externalinheritedeventNames * ****eventNames**(): (string | symbol)\[] - Inherited from Logger.eventNames Returns an array listing the events for which the emitter has registered listeners. The values in the array are strings or `Symbol`s. ``` import { EventEmitter } from 'node:events'; const myEE = new EventEmitter(); myEE.on('foo', () => {}); myEE.on('bar', () => {}); const sym = Symbol('symbol'); myEE.on(sym, () => {}); console.log(myEE.eventNames()); // Prints: [ 'foo', 'bar', Symbol(symbol) ] ``` * **@since** v6.0.0 *** #### Returns (string | symbol)\[] ### [**](#getMaxListeners)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L773)externalinheritedgetMaxListeners * ****getMaxListeners**(): number - Inherited from Logger.getMaxListeners Returns the current max listener value for the `EventEmitter` which is either set by `emitter.setMaxListeners(n)` or defaults to EventEmitter.defaultMaxListeners. * **@since** v1.0.0 *** #### Returns number ### [**](#getOptions)[**](https://undefined/apify/crawlee/blob/master/node_modules/@apify/log/src/index.d.ts#L35)externalinheritedgetOptions * ****getOptions**(): Record\ - Inherited from Logger.getOptions #### Returns Record\ ### [**](#listenerCount)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L867)externalinheritedlistenerCount * ****listenerCount**\(eventName, listener): number - Inherited from Logger.listenerCount Returns the number of listeners listening for the event named `eventName`. If `listener` is provided, it will return how many times the listener is found in the list of the listeners of the event. * **@since** v3.2.0 *** #### Parameters * ##### externaleventName: string | symbol The name of the event being listened for * ##### externaloptionallistener: Function The event handler function #### Returns number ### [**](#listeners)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L786)externalinheritedlisteners * ****listeners**\(eventName): Function\[] - Inherited from Logger.listeners Returns a copy of the array of listeners for the event named `eventName`. ``` server.on('connection', (stream) => { console.log('someone connected!'); }); console.log(util.inspect(server.listeners('connection'))); // Prints: [ [Function] ] ``` * **@since** v0.1.26 *** #### Parameters * ##### externaleventName: string | symbol #### Returns Function\[] ### [**](#log)[**](https://undefined/apify/crawlee/blob/master/node_modules/@apify/log/src/index.d.ts#L38)externalinheritedlog * ****log**(level, message, ...args): void - Inherited from Logger.log #### Parameters * ##### externallevel: [LogLevel](https://crawlee.dev/js/api/core/enum/LogLevel.md) * ##### externalmessage: string * ##### externalrest...args: any\[] #### Returns void ### [**](#off)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L746)externalinheritedoff * ****off**\(eventName, listener): this - Inherited from Logger.off Alias for `emitter.removeListener()`. * **@since** v10.0.0 *** #### Parameters * ##### externaleventName: string | symbol * ##### externallistener: (...args) => void #### Returns this ### [**](#on)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L628)externalinheritedon * ****on**\(eventName, listener): this - Inherited from Logger.on Adds the `listener` function to the end of the listeners array for the event named `eventName`. No checks are made to see if the `listener` has already been added. Multiple calls passing the same combination of `eventName` and `listener` will result in the `listener` being added, and called, multiple times. ``` server.on('connection', (stream) => { console.log('someone connected!'); }); ``` Returns a reference to the `EventEmitter`, so that calls can be chained. By default, event listeners are invoked in the order they are added. The `emitter.prependListener()` method can be used as an alternative to add the event listener to the beginning of the listeners array. ``` import { EventEmitter } from 'node:events'; const myEE = new EventEmitter(); myEE.on('foo', () => console.log('a')); myEE.prependListener('foo', () => console.log('b')); myEE.emit('foo'); // Prints: // b // a ``` * **@since** v0.1.101 *** #### Parameters * ##### externaleventName: string | symbol The name of the event. * ##### externallistener: (...args) => void The callback function #### Returns this ### [**](#once)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L658)externalinheritedonce * ****once**\(eventName, listener): this - Inherited from Logger.once Adds a **one-time** `listener` function for the event named `eventName`. The next time `eventName` is triggered, this listener is removed and then invoked. ``` server.once('connection', (stream) => { console.log('Ah, we have our first user!'); }); ``` Returns a reference to the `EventEmitter`, so that calls can be chained. By default, event listeners are invoked in the order they are added. The `emitter.prependOnceListener()` method can be used as an alternative to add the event listener to the beginning of the listeners array. ``` import { EventEmitter } from 'node:events'; const myEE = new EventEmitter(); myEE.once('foo', () => console.log('a')); myEE.prependOnceListener('foo', () => console.log('b')); myEE.emit('foo'); // Prints: // b // a ``` * **@since** v0.3.0 *** #### Parameters * ##### externaleventName: string | symbol The name of the event. * ##### externallistener: (...args) => void The callback function #### Returns this ### [**](#prependListener)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L885)externalinheritedprependListener * ****prependListener**\(eventName, listener): this - Inherited from Logger.prependListener Adds the `listener` function to the *beginning* of the listeners array for the event named `eventName`. No checks are made to see if the `listener` has already been added. Multiple calls passing the same combination of `eventName` and `listener` will result in the `listener` being added, and called, multiple times. ``` server.prependListener('connection', (stream) => { console.log('someone connected!'); }); ``` Returns a reference to the `EventEmitter`, so that calls can be chained. * **@since** v6.0.0 *** #### Parameters * ##### externaleventName: string | symbol The name of the event. * ##### externallistener: (...args) => void The callback function #### Returns this ### [**](#prependOnceListener)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L901)externalinheritedprependOnceListener * ****prependOnceListener**\(eventName, listener): this - Inherited from Logger.prependOnceListener Adds a **one-time**`listener` function for the event named `eventName` to the *beginning* of the listeners array. The next time `eventName` is triggered, this listener is removed, and then invoked. ``` server.prependOnceListener('connection', (stream) => { console.log('Ah, we have our first user!'); }); ``` Returns a reference to the `EventEmitter`, so that calls can be chained. * **@since** v6.0.0 *** #### Parameters * ##### externaleventName: string | symbol The name of the event. * ##### externallistener: (...args) => void The callback function #### Returns this ### [**](#rawListeners)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L817)externalinheritedrawListeners * ****rawListeners**\(eventName): Function\[] - Inherited from Logger.rawListeners Returns a copy of the array of listeners for the event named `eventName`, including any wrappers (such as those created by `.once()`). ``` import { EventEmitter } from 'node:events'; const emitter = new EventEmitter(); emitter.once('log', () => console.log('log once')); // Returns a new Array with a function `onceWrapper` which has a property // `listener` which contains the original listener bound above const listeners = emitter.rawListeners('log'); const logFnWrapper = listeners[0]; // Logs "log once" to the console and does not unbind the `once` event logFnWrapper.listener(); // Logs "log once" to the console and removes the listener logFnWrapper(); emitter.on('log', () => console.log('log persistently')); // Will return a new Array with a single function bound by `.on()` above const newListeners = emitter.rawListeners('log'); // Logs "log persistently" twice newListeners[0](); emitter.emit('log'); ``` * **@since** v9.4.0 *** #### Parameters * ##### externaleventName: string | symbol #### Returns Function\[] ### [**](#removeAllListeners)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L757)externalinheritedremoveAllListeners * ****removeAllListeners**(eventName): this - Inherited from Logger.removeAllListeners Removes all listeners, or those of the specified `eventName`. It is bad practice to remove listeners added elsewhere in the code, particularly when the `EventEmitter` instance was created by some other component or module (e.g. sockets or file streams). Returns a reference to the `EventEmitter`, so that calls can be chained. * **@since** v0.1.26 *** #### Parameters * ##### externaloptionaleventName: string | symbol #### Returns this ### [**](#removeListener)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L741)externalinheritedremoveListener * ****removeListener**\(eventName, listener): this - Inherited from Logger.removeListener Removes the specified `listener` from the listener array for the event named `eventName`. ``` const callback = (stream) => { console.log('someone connected!'); }; server.on('connection', callback); // ... server.removeListener('connection', callback); ``` `removeListener()` will remove, at most, one instance of a listener from the listener array. If any single listener has been added multiple times to the listener array for the specified `eventName`, then `removeListener()` must be called multiple times to remove each instance. Once an event is emitted, all listeners attached to it at the time of emitting are called in order. This implies that any `removeListener()` or `removeAllListeners()` calls *after* emitting and *before* the last listener finishes execution will not remove them from`emit()` in progress. Subsequent events behave as expected. ``` import { EventEmitter } from 'node:events'; class MyEmitter extends EventEmitter {} const myEmitter = new MyEmitter(); const callbackA = () => { console.log('A'); myEmitter.removeListener('event', callbackB); }; const callbackB = () => { console.log('B'); }; myEmitter.on('event', callbackA); myEmitter.on('event', callbackB); // callbackA removes listener callbackB but it will still be called. // Internal listener array at time of emit [callbackA, callbackB] myEmitter.emit('event'); // Prints: // A // B // callbackB is now removed. // Internal listener array [callbackA] myEmitter.emit('event'); // Prints: // A ``` Because listeners are managed using an internal array, calling this will change the position indices of any listener registered *after* the listener being removed. This will not impact the order in which listeners are called, but it means that any copies of the listener array as returned by the `emitter.listeners()` method will need to be recreated. When a single function has been added as a handler multiple times for a single event (as in the example below), `removeListener()` will remove the most recently added instance. In the example the `once('ping')` listener is removed: ``` import { EventEmitter } from 'node:events'; const ee = new EventEmitter(); function pong() { console.log('pong'); } ee.on('ping', pong); ee.once('ping', pong); ee.removeListener('ping', pong); ee.emit('ping'); ee.emit('ping'); ``` Returns a reference to the `EventEmitter`, so that calls can be chained. * **@since** v0.1.26 *** #### Parameters * ##### externaleventName: string | symbol * ##### externallistener: (...args) => void #### Returns this ### [**](#setMaxListeners)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L767)externalinheritedsetMaxListeners * ****setMaxListeners**(n): this - Inherited from Logger.setMaxListeners By default `EventEmitter`s will print a warning if more than `10` listeners are added for a particular event. This is a useful default that helps finding memory leaks. The `emitter.setMaxListeners()` method allows the limit to be modified for this specific `EventEmitter` instance. The value can be set to `Infinity` (or `0`) to indicate an unlimited number of listeners. Returns a reference to the `EventEmitter`, so that calls can be chained. * **@since** v0.3.5 *** #### Parameters * ##### externaln: number #### Returns this ### [**](#setOptions)[**](https://undefined/apify/crawlee/blob/master/node_modules/@apify/log/src/index.d.ts#L34)externalinheritedsetOptions * ****setOptions**(options): void - Inherited from Logger.setOptions #### Parameters * ##### externaloptions: Record\ #### Returns void ### [**](#addAbortListener)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L436)staticexternalinheritedaddAbortListener * ****addAbortListener**(signal, resource): Disposable - Inherited from Logger.addAbortListener Listens once to the `abort` event on the provided `signal`. Listening to the `abort` event on abort signals is unsafe and may lead to resource leaks since another third party with the signal can call `e.stopImmediatePropagation()`. Unfortunately Node.js cannot change this since it would violate the web standard. Additionally, the original API makes it easy to forget to remove listeners. This API allows safely using `AbortSignal`s in Node.js APIs by solving these two issues by listening to the event such that `stopImmediatePropagation` does not prevent the listener from running. Returns a disposable so that it may be unsubscribed from more easily. ``` import { addAbortListener } from 'node:events'; function example(signal) { let disposable; try { signal.addEventListener('abort', (e) => e.stopImmediatePropagation()); disposable = addAbortListener(signal, (e) => { // Do something when signal is aborted. }); } finally { disposable?.[Symbol.dispose](); } } ``` * **@since** v20.5.0 *** #### Parameters * ##### externalsignal: AbortSignal * ##### externalresource: (event) => void #### Returns Disposable Disposable that removes the `abort` listener. ### [**](#getEventListeners)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L358)staticexternalinheritedgetEventListeners * ****getEventListeners**(emitter, name): Function\[] - Inherited from Logger.getEventListeners Returns a copy of the array of listeners for the event named `eventName`. For `EventEmitter`s this behaves exactly the same as calling `.listeners` on the emitter. For `EventTarget`s this is the only way to get the event listeners for the event target. This is useful for debugging and diagnostic purposes. ``` import { getEventListeners, EventEmitter } from 'node:events'; { const ee = new EventEmitter(); const listener = () => console.log('Events are fun'); ee.on('foo', listener); console.log(getEventListeners(ee, 'foo')); // [ [Function: listener] ] } { const et = new EventTarget(); const listener = () => console.log('Events are fun'); et.addEventListener('foo', listener); console.log(getEventListeners(et, 'foo')); // [ [Function: listener] ] } ``` * **@since** v15.2.0, v14.17.0 *** #### Parameters * ##### externalemitter: EventEmitter\ | EventTarget * ##### externalname: string | symbol #### Returns Function\[] ### [**](#getMaxListeners)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L387)staticexternalinheritedgetMaxListeners * ****getMaxListeners**(emitter): number - Inherited from Logger.getMaxListeners Returns the currently set max amount of listeners. For `EventEmitter`s this behaves exactly the same as calling `.getMaxListeners` on the emitter. For `EventTarget`s this is the only way to get the max event listeners for the event target. If the number of event handlers on a single EventTarget exceeds the max set, the EventTarget will print a warning. ``` import { getMaxListeners, setMaxListeners, EventEmitter } from 'node:events'; { const ee = new EventEmitter(); console.log(getMaxListeners(ee)); // 10 setMaxListeners(11, ee); console.log(getMaxListeners(ee)); // 11 } { const et = new EventTarget(); console.log(getMaxListeners(et)); // 10 setMaxListeners(11, et); console.log(getMaxListeners(et)); // 11 } ``` * **@since** v19.9.0 *** #### Parameters * ##### externalemitter: EventEmitter\ | EventTarget #### Returns number ### [**](#listenerCount)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L330)staticexternalinheritedlistenerCount * ****listenerCount**(emitter, eventName): number - Inherited from Logger.listenerCount A class method that returns the number of listeners for the given `eventName` registered on the given `emitter`. ``` import { EventEmitter, listenerCount } from 'node:events'; const myEmitter = new EventEmitter(); myEmitter.on('event', () => {}); myEmitter.on('event', () => {}); console.log(listenerCount(myEmitter, 'event')); // Prints: 2 ``` * **@since** v0.9.12 * **@deprecated** Since v3.2.0 - Use `listenerCount` instead. *** #### Parameters * ##### externalemitter: EventEmitter\ The emitter to query * ##### externaleventName: string | symbol The event name #### Returns number ### [**](#on)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L303)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L308)staticexternalinheritedon * ****on**(emitter, eventName, options): AsyncIterator\ * ****on**(emitter, eventName, options): AsyncIterator\ - Inherited from Logger.on ``` import { on, EventEmitter } from 'node:events'; import process from 'node:process'; const ee = new EventEmitter(); // Emit later on process.nextTick(() => { ee.emit('foo', 'bar'); ee.emit('foo', 42); }); for await (const event of on(ee, 'foo')) { // The execution of this inner block is synchronous and it // processes one event at a time (even with await). Do not use // if concurrent execution is required. console.log(event); // prints ['bar'] [42] } // Unreachable here ``` Returns an `AsyncIterator` that iterates `eventName` events. It will throw if the `EventEmitter` emits `'error'`. It removes all listeners when exiting the loop. The `value` returned by each iteration is an array composed of the emitted event arguments. An `AbortSignal` can be used to cancel waiting on events: ``` import { on, EventEmitter } from 'node:events'; import process from 'node:process'; const ac = new AbortController(); (async () => { const ee = new EventEmitter(); // Emit later on process.nextTick(() => { ee.emit('foo', 'bar'); ee.emit('foo', 42); }); for await (const event of on(ee, 'foo', { signal: ac.signal })) { // The execution of this inner block is synchronous and it // processes one event at a time (even with await). Do not use // if concurrent execution is required. console.log(event); // prints ['bar'] [42] } // Unreachable here })(); process.nextTick(() => ac.abort()); ``` Use the `close` option to specify an array of event names that will end the iteration: ``` import { on, EventEmitter } from 'node:events'; import process from 'node:process'; const ee = new EventEmitter(); // Emit later on process.nextTick(() => { ee.emit('foo', 'bar'); ee.emit('foo', 42); ee.emit('close'); }); for await (const event of on(ee, 'foo', { close: ['close'] })) { console.log(event); // prints ['bar'] [42] } // the loop will exit after 'close' is emitted console.log('done'); // prints 'done' ``` * **@since** v13.6.0, v12.16.0 *** #### Parameters * ##### externalemitter: EventEmitter\ * ##### externaleventName: string | symbol * ##### externaloptionaloptions: StaticEventEmitterIteratorOptions #### Returns AsyncIterator\ An `AsyncIterator` that iterates `eventName` events emitted by the `emitter` ### [**](#once)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L217)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L222)staticexternalinheritedonce * ****once**(emitter, eventName, options): Promise\ * ****once**(emitter, eventName, options): Promise\ - Inherited from Logger.once Creates a `Promise` that is fulfilled when the `EventEmitter` emits the given event or that is rejected if the `EventEmitter` emits `'error'` while waiting. The `Promise` will resolve with an array of all the arguments emitted to the given event. This method is intentionally generic and works with the web platform [EventTarget](https://dom.spec.whatwg.org/#interface-eventtarget) interface, which has no special`'error'` event semantics and does not listen to the `'error'` event. ``` import { once, EventEmitter } from 'node:events'; import process from 'node:process'; const ee = new EventEmitter(); process.nextTick(() => { ee.emit('myevent', 42); }); const [value] = await once(ee, 'myevent'); console.log(value); const err = new Error('kaboom'); process.nextTick(() => { ee.emit('error', err); }); try { await once(ee, 'myevent'); } catch (err) { console.error('error happened', err); } ``` The special handling of the `'error'` event is only used when `events.once()` is used to wait for another event. If `events.once()` is used to wait for the '`error'` event itself, then it is treated as any other kind of event without special handling: ``` import { EventEmitter, once } from 'node:events'; const ee = new EventEmitter(); once(ee, 'error') .then(([err]) => console.log('ok', err.message)) .catch((err) => console.error('error', err.message)); ee.emit('error', new Error('boom')); // Prints: ok boom ``` An `AbortSignal` can be used to cancel waiting for the event: ``` import { EventEmitter, once } from 'node:events'; const ee = new EventEmitter(); const ac = new AbortController(); async function foo(emitter, event, signal) { try { await once(emitter, event, { signal }); console.log('event emitted!'); } catch (error) { if (error.name === 'AbortError') { console.error('Waiting for the event was canceled!'); } else { console.error('There was an error', error.message); } } } foo(ee, 'foo', ac.signal); ac.abort(); // Abort waiting for the event ee.emit('foo'); // Prints: Waiting for the event was canceled! ``` * **@since** v11.13.0, v10.16.0 *** #### Parameters * ##### externalemitter: EventEmitter\ * ##### externaleventName: string | symbol * ##### externaloptionaloptions: StaticEventEmitterOptions #### Returns Promise\ ### [**](#setMaxListeners)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L402)staticexternalinheritedsetMaxListeners * ****setMaxListeners**(n, ...eventTargets): void - Inherited from Logger.setMaxListeners ``` import { setMaxListeners, EventEmitter } from 'node:events'; const target = new EventTarget(); const emitter = new EventEmitter(); setMaxListeners(5, target, emitter); ``` * **@since** v15.4.0 *** #### Parameters * ##### externaloptionaln: number A non-negative number. The maximum number of listeners per `EventTarget` event. * ##### externalrest...eventTargets: (EventEmitter\ | EventTarget)\[] Zero or more {EventTarget} or {EventEmitter} instances. If none are specified, `n` is set as the default max for all newly created {EventTarget} and {EventEmitter} objects. #### Returns void --- # externalLoggerText This is an abstract class that should be extended by custom logger classes. this.\_log() method must be implemented by them. ### Hierarchy * [Logger](https://crawlee.dev/js/api/core/class/Logger.md) * *LoggerText* ## Index[**](#Index) ### Constructors * [**constructor](#constructor) ### Properties * [**captureRejections](#captureRejections) * [**captureRejectionSymbol](#captureRejectionSymbol) * [**defaultMaxListeners](#defaultMaxListeners) * [**errorMonitor](#errorMonitor) ### Methods * [**\_log](#_log) * [**\_outputWithConsole](#_outputWithConsole) * [**\[captureRejectionSymbol\]](#\[captureRejectionSymbol]) * [**addListener](#addListener) * [**emit](#emit) * [**eventNames](#eventNames) * [**getMaxListeners](#getMaxListeners) * [**getOptions](#getOptions) * [**listenerCount](#listenerCount) * [**listeners](#listeners) * [**log](#log) * [**off](#off) * [**on](#on) * [**once](#once) * [**prependListener](#prependListener) * [**prependOnceListener](#prependOnceListener) * [**rawListeners](#rawListeners) * [**removeAllListeners](#removeAllListeners) * [**removeListener](#removeListener) * [**setMaxListeners](#setMaxListeners) * [**setOptions](#setOptions) * [**addAbortListener](#addAbortListener) * [**getEventListeners](#getEventListeners) * [**getMaxListeners](#getMaxListeners) * [**listenerCount](#listenerCount) * [**on](#on) * [**once](#once) * [**setMaxListeners](#setMaxListeners) ## Constructors[**](#Constructors) ### [**](#constructor)[**](https://undefined/apify/crawlee/blob/master/node_modules/@apify/log/src/index.d.ts#L246)externalconstructor * ****new LoggerText**(options): [LoggerText](https://crawlee.dev/js/api/core/class/LoggerText.md) - Overrides Logger.constructor #### Parameters * ##### externaloptionaloptions: {} #### Returns [LoggerText](https://crawlee.dev/js/api/core/class/LoggerText.md) ## Properties[**](#Properties) ### [**](#captureRejections)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L458)staticexternalinheritedcaptureRejections **captureRejections: boolean Inherited from Logger.captureRejections Value: [boolean](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Data_structures#Boolean_type) Change the default `captureRejections` option on all new `EventEmitter` objects. * **@since** v13.4.0, v12.16.0 ### [**](#captureRejectionSymbol)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L451)staticexternalreadonlyinheritedcaptureRejectionSymbol **captureRejectionSymbol: typeof captureRejectionSymbol Inherited from Logger.captureRejectionSymbol Value: `Symbol.for('nodejs.rejection')` See how to write a custom `rejection handler`. * **@since** v13.4.0, v12.16.0 ### [**](#defaultMaxListeners)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L497)staticexternalinheriteddefaultMaxListeners **defaultMaxListeners: number Inherited from Logger.defaultMaxListeners By default, a maximum of `10` listeners can be registered for any single event. This limit can be changed for individual `EventEmitter` instances using the `emitter.setMaxListeners(n)` method. To change the default for *all*`EventEmitter` instances, the `events.defaultMaxListeners` property can be used. If this value is not a positive number, a `RangeError` is thrown. Take caution when setting the `events.defaultMaxListeners` because the change affects *all* `EventEmitter` instances, including those created before the change is made. However, calling `emitter.setMaxListeners(n)` still has precedence over `events.defaultMaxListeners`. This is not a hard limit. The `EventEmitter` instance will allow more listeners to be added but will output a trace warning to stderr indicating that a "possible EventEmitter memory leak" has been detected. For any single `EventEmitter`, the `emitter.getMaxListeners()` and `emitter.setMaxListeners()` methods can be used to temporarily avoid this warning: ``` import { EventEmitter } from 'node:events'; const emitter = new EventEmitter(); emitter.setMaxListeners(emitter.getMaxListeners() + 1); emitter.once('event', () => { // do stuff emitter.setMaxListeners(Math.max(emitter.getMaxListeners() - 1, 0)); }); ``` The `--trace-warnings` command-line flag can be used to display the stack trace for such warnings. The emitted warning can be inspected with `process.on('warning')` and will have the additional `emitter`, `type`, and `count` properties, referring to the event emitter instance, the event's name and the number of attached listeners, respectively. Its `name` property is set to `'MaxListenersExceededWarning'`. * **@since** v0.11.2 ### [**](#errorMonitor)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L444)staticexternalreadonlyinheritederrorMonitor **errorMonitor: typeof errorMonitor Inherited from Logger.errorMonitor This symbol shall be used to install a listener for only monitoring `'error'` events. Listeners installed using this symbol are called before the regular `'error'` listeners are called. Installing a listener using this symbol does not change the behavior once an `'error'` event is emitted. Therefore, the process will still crash if no regular `'error'` listener is installed. * **@since** v13.6.0, v12.17.0 ## Methods[**](#Methods) ### [**](#_log)[**](https://undefined/apify/crawlee/blob/master/node_modules/@apify/log/src/index.d.ts#L247)external\_log * ****\_log**(level, message, data, exception, opts): string - Overrides Logger.\_log #### Parameters * ##### externallevel: [LogLevel](https://crawlee.dev/js/api/core/enum/LogLevel.md) * ##### externalmessage: string * ##### externaloptionaldata: any * ##### externaloptionalexception: unknown * ##### externaloptionalopts: Record\ #### Returns string ### [**](#_outputWithConsole)[**](https://undefined/apify/crawlee/blob/master/node_modules/@apify/log/src/index.d.ts#L36)externalinherited\_outputWithConsole * ****\_outputWithConsole**(level, line): void - Inherited from Logger.\_outputWithConsole #### Parameters * ##### externallevel: [LogLevel](https://crawlee.dev/js/api/core/enum/LogLevel.md) * ##### externalline: string #### Returns void ### [**](#\[captureRejectionSymbol])[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L136)externaloptionalinherited\[captureRejectionSymbol] * ****\[captureRejectionSymbol]**\(error, event, ...args): void - Inherited from Logger.\[captureRejectionSymbol] #### Parameters * ##### externalerror: Error * ##### externalevent: string | symbol * ##### externalrest...args: AnyRest #### Returns void ### [**](#addListener)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L596)externalinheritedaddListener * ****addListener**\(eventName, listener): this - Inherited from Logger.addListener Alias for `emitter.on(eventName, listener)`. * **@since** v0.1.26 *** #### Parameters * ##### externaleventName: string | symbol * ##### externallistener: (...args) => void #### Returns this ### [**](#emit)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L858)externalinheritedemit * ****emit**\(eventName, ...args): boolean - Inherited from Logger.emit Synchronously calls each of the listeners registered for the event named `eventName`, in the order they were registered, passing the supplied arguments to each. Returns `true` if the event had listeners, `false` otherwise. ``` import { EventEmitter } from 'node:events'; const myEmitter = new EventEmitter(); // First listener myEmitter.on('event', function firstListener() { console.log('Helloooo! first listener'); }); // Second listener myEmitter.on('event', function secondListener(arg1, arg2) { console.log(`event with parameters ${arg1}, ${arg2} in second listener`); }); // Third listener myEmitter.on('event', function thirdListener(...args) { const parameters = args.join(', '); console.log(`event with parameters ${parameters} in third listener`); }); console.log(myEmitter.listeners('event')); myEmitter.emit('event', 1, 2, 3, 4, 5); // Prints: // [ // [Function: firstListener], // [Function: secondListener], // [Function: thirdListener] // ] // Helloooo! first listener // event with parameters 1, 2 in second listener // event with parameters 1, 2, 3, 4, 5 in third listener ``` * **@since** v0.1.26 *** #### Parameters * ##### externaleventName: string | symbol * ##### externalrest...args: AnyRest #### Returns boolean ### [**](#eventNames)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L921)externalinheritedeventNames * ****eventNames**(): (string | symbol)\[] - Inherited from Logger.eventNames Returns an array listing the events for which the emitter has registered listeners. The values in the array are strings or `Symbol`s. ``` import { EventEmitter } from 'node:events'; const myEE = new EventEmitter(); myEE.on('foo', () => {}); myEE.on('bar', () => {}); const sym = Symbol('symbol'); myEE.on(sym, () => {}); console.log(myEE.eventNames()); // Prints: [ 'foo', 'bar', Symbol(symbol) ] ``` * **@since** v6.0.0 *** #### Returns (string | symbol)\[] ### [**](#getMaxListeners)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L773)externalinheritedgetMaxListeners * ****getMaxListeners**(): number - Inherited from Logger.getMaxListeners Returns the current max listener value for the `EventEmitter` which is either set by `emitter.setMaxListeners(n)` or defaults to EventEmitter.defaultMaxListeners. * **@since** v1.0.0 *** #### Returns number ### [**](#getOptions)[**](https://undefined/apify/crawlee/blob/master/node_modules/@apify/log/src/index.d.ts#L35)externalinheritedgetOptions * ****getOptions**(): Record\ - Inherited from Logger.getOptions #### Returns Record\ ### [**](#listenerCount)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L867)externalinheritedlistenerCount * ****listenerCount**\(eventName, listener): number - Inherited from Logger.listenerCount Returns the number of listeners listening for the event named `eventName`. If `listener` is provided, it will return how many times the listener is found in the list of the listeners of the event. * **@since** v3.2.0 *** #### Parameters * ##### externaleventName: string | symbol The name of the event being listened for * ##### externaloptionallistener: Function The event handler function #### Returns number ### [**](#listeners)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L786)externalinheritedlisteners * ****listeners**\(eventName): Function\[] - Inherited from Logger.listeners Returns a copy of the array of listeners for the event named `eventName`. ``` server.on('connection', (stream) => { console.log('someone connected!'); }); console.log(util.inspect(server.listeners('connection'))); // Prints: [ [Function] ] ``` * **@since** v0.1.26 *** #### Parameters * ##### externaleventName: string | symbol #### Returns Function\[] ### [**](#log)[**](https://undefined/apify/crawlee/blob/master/node_modules/@apify/log/src/index.d.ts#L38)externalinheritedlog * ****log**(level, message, ...args): void - Inherited from Logger.log #### Parameters * ##### externallevel: [LogLevel](https://crawlee.dev/js/api/core/enum/LogLevel.md) * ##### externalmessage: string * ##### externalrest...args: any\[] #### Returns void ### [**](#off)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L746)externalinheritedoff * ****off**\(eventName, listener): this - Inherited from Logger.off Alias for `emitter.removeListener()`. * **@since** v10.0.0 *** #### Parameters * ##### externaleventName: string | symbol * ##### externallistener: (...args) => void #### Returns this ### [**](#on)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L628)externalinheritedon * ****on**\(eventName, listener): this - Inherited from Logger.on Adds the `listener` function to the end of the listeners array for the event named `eventName`. No checks are made to see if the `listener` has already been added. Multiple calls passing the same combination of `eventName` and `listener` will result in the `listener` being added, and called, multiple times. ``` server.on('connection', (stream) => { console.log('someone connected!'); }); ``` Returns a reference to the `EventEmitter`, so that calls can be chained. By default, event listeners are invoked in the order they are added. The `emitter.prependListener()` method can be used as an alternative to add the event listener to the beginning of the listeners array. ``` import { EventEmitter } from 'node:events'; const myEE = new EventEmitter(); myEE.on('foo', () => console.log('a')); myEE.prependListener('foo', () => console.log('b')); myEE.emit('foo'); // Prints: // b // a ``` * **@since** v0.1.101 *** #### Parameters * ##### externaleventName: string | symbol The name of the event. * ##### externallistener: (...args) => void The callback function #### Returns this ### [**](#once)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L658)externalinheritedonce * ****once**\(eventName, listener): this - Inherited from Logger.once Adds a **one-time** `listener` function for the event named `eventName`. The next time `eventName` is triggered, this listener is removed and then invoked. ``` server.once('connection', (stream) => { console.log('Ah, we have our first user!'); }); ``` Returns a reference to the `EventEmitter`, so that calls can be chained. By default, event listeners are invoked in the order they are added. The `emitter.prependOnceListener()` method can be used as an alternative to add the event listener to the beginning of the listeners array. ``` import { EventEmitter } from 'node:events'; const myEE = new EventEmitter(); myEE.once('foo', () => console.log('a')); myEE.prependOnceListener('foo', () => console.log('b')); myEE.emit('foo'); // Prints: // b // a ``` * **@since** v0.3.0 *** #### Parameters * ##### externaleventName: string | symbol The name of the event. * ##### externallistener: (...args) => void The callback function #### Returns this ### [**](#prependListener)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L885)externalinheritedprependListener * ****prependListener**\(eventName, listener): this - Inherited from Logger.prependListener Adds the `listener` function to the *beginning* of the listeners array for the event named `eventName`. No checks are made to see if the `listener` has already been added. Multiple calls passing the same combination of `eventName` and `listener` will result in the `listener` being added, and called, multiple times. ``` server.prependListener('connection', (stream) => { console.log('someone connected!'); }); ``` Returns a reference to the `EventEmitter`, so that calls can be chained. * **@since** v6.0.0 *** #### Parameters * ##### externaleventName: string | symbol The name of the event. * ##### externallistener: (...args) => void The callback function #### Returns this ### [**](#prependOnceListener)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L901)externalinheritedprependOnceListener * ****prependOnceListener**\(eventName, listener): this - Inherited from Logger.prependOnceListener Adds a **one-time**`listener` function for the event named `eventName` to the *beginning* of the listeners array. The next time `eventName` is triggered, this listener is removed, and then invoked. ``` server.prependOnceListener('connection', (stream) => { console.log('Ah, we have our first user!'); }); ``` Returns a reference to the `EventEmitter`, so that calls can be chained. * **@since** v6.0.0 *** #### Parameters * ##### externaleventName: string | symbol The name of the event. * ##### externallistener: (...args) => void The callback function #### Returns this ### [**](#rawListeners)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L817)externalinheritedrawListeners * ****rawListeners**\(eventName): Function\[] - Inherited from Logger.rawListeners Returns a copy of the array of listeners for the event named `eventName`, including any wrappers (such as those created by `.once()`). ``` import { EventEmitter } from 'node:events'; const emitter = new EventEmitter(); emitter.once('log', () => console.log('log once')); // Returns a new Array with a function `onceWrapper` which has a property // `listener` which contains the original listener bound above const listeners = emitter.rawListeners('log'); const logFnWrapper = listeners[0]; // Logs "log once" to the console and does not unbind the `once` event logFnWrapper.listener(); // Logs "log once" to the console and removes the listener logFnWrapper(); emitter.on('log', () => console.log('log persistently')); // Will return a new Array with a single function bound by `.on()` above const newListeners = emitter.rawListeners('log'); // Logs "log persistently" twice newListeners[0](); emitter.emit('log'); ``` * **@since** v9.4.0 *** #### Parameters * ##### externaleventName: string | symbol #### Returns Function\[] ### [**](#removeAllListeners)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L757)externalinheritedremoveAllListeners * ****removeAllListeners**(eventName): this - Inherited from Logger.removeAllListeners Removes all listeners, or those of the specified `eventName`. It is bad practice to remove listeners added elsewhere in the code, particularly when the `EventEmitter` instance was created by some other component or module (e.g. sockets or file streams). Returns a reference to the `EventEmitter`, so that calls can be chained. * **@since** v0.1.26 *** #### Parameters * ##### externaloptionaleventName: string | symbol #### Returns this ### [**](#removeListener)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L741)externalinheritedremoveListener * ****removeListener**\(eventName, listener): this - Inherited from Logger.removeListener Removes the specified `listener` from the listener array for the event named `eventName`. ``` const callback = (stream) => { console.log('someone connected!'); }; server.on('connection', callback); // ... server.removeListener('connection', callback); ``` `removeListener()` will remove, at most, one instance of a listener from the listener array. If any single listener has been added multiple times to the listener array for the specified `eventName`, then `removeListener()` must be called multiple times to remove each instance. Once an event is emitted, all listeners attached to it at the time of emitting are called in order. This implies that any `removeListener()` or `removeAllListeners()` calls *after* emitting and *before* the last listener finishes execution will not remove them from`emit()` in progress. Subsequent events behave as expected. ``` import { EventEmitter } from 'node:events'; class MyEmitter extends EventEmitter {} const myEmitter = new MyEmitter(); const callbackA = () => { console.log('A'); myEmitter.removeListener('event', callbackB); }; const callbackB = () => { console.log('B'); }; myEmitter.on('event', callbackA); myEmitter.on('event', callbackB); // callbackA removes listener callbackB but it will still be called. // Internal listener array at time of emit [callbackA, callbackB] myEmitter.emit('event'); // Prints: // A // B // callbackB is now removed. // Internal listener array [callbackA] myEmitter.emit('event'); // Prints: // A ``` Because listeners are managed using an internal array, calling this will change the position indices of any listener registered *after* the listener being removed. This will not impact the order in which listeners are called, but it means that any copies of the listener array as returned by the `emitter.listeners()` method will need to be recreated. When a single function has been added as a handler multiple times for a single event (as in the example below), `removeListener()` will remove the most recently added instance. In the example the `once('ping')` listener is removed: ``` import { EventEmitter } from 'node:events'; const ee = new EventEmitter(); function pong() { console.log('pong'); } ee.on('ping', pong); ee.once('ping', pong); ee.removeListener('ping', pong); ee.emit('ping'); ee.emit('ping'); ``` Returns a reference to the `EventEmitter`, so that calls can be chained. * **@since** v0.1.26 *** #### Parameters * ##### externaleventName: string | symbol * ##### externallistener: (...args) => void #### Returns this ### [**](#setMaxListeners)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L767)externalinheritedsetMaxListeners * ****setMaxListeners**(n): this - Inherited from Logger.setMaxListeners By default `EventEmitter`s will print a warning if more than `10` listeners are added for a particular event. This is a useful default that helps finding memory leaks. The `emitter.setMaxListeners()` method allows the limit to be modified for this specific `EventEmitter` instance. The value can be set to `Infinity` (or `0`) to indicate an unlimited number of listeners. Returns a reference to the `EventEmitter`, so that calls can be chained. * **@since** v0.3.5 *** #### Parameters * ##### externaln: number #### Returns this ### [**](#setOptions)[**](https://undefined/apify/crawlee/blob/master/node_modules/@apify/log/src/index.d.ts#L34)externalinheritedsetOptions * ****setOptions**(options): void - Inherited from Logger.setOptions #### Parameters * ##### externaloptions: Record\ #### Returns void ### [**](#addAbortListener)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L436)staticexternalinheritedaddAbortListener * ****addAbortListener**(signal, resource): Disposable - Inherited from Logger.addAbortListener Listens once to the `abort` event on the provided `signal`. Listening to the `abort` event on abort signals is unsafe and may lead to resource leaks since another third party with the signal can call `e.stopImmediatePropagation()`. Unfortunately Node.js cannot change this since it would violate the web standard. Additionally, the original API makes it easy to forget to remove listeners. This API allows safely using `AbortSignal`s in Node.js APIs by solving these two issues by listening to the event such that `stopImmediatePropagation` does not prevent the listener from running. Returns a disposable so that it may be unsubscribed from more easily. ``` import { addAbortListener } from 'node:events'; function example(signal) { let disposable; try { signal.addEventListener('abort', (e) => e.stopImmediatePropagation()); disposable = addAbortListener(signal, (e) => { // Do something when signal is aborted. }); } finally { disposable?.[Symbol.dispose](); } } ``` * **@since** v20.5.0 *** #### Parameters * ##### externalsignal: AbortSignal * ##### externalresource: (event) => void #### Returns Disposable Disposable that removes the `abort` listener. ### [**](#getEventListeners)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L358)staticexternalinheritedgetEventListeners * ****getEventListeners**(emitter, name): Function\[] - Inherited from Logger.getEventListeners Returns a copy of the array of listeners for the event named `eventName`. For `EventEmitter`s this behaves exactly the same as calling `.listeners` on the emitter. For `EventTarget`s this is the only way to get the event listeners for the event target. This is useful for debugging and diagnostic purposes. ``` import { getEventListeners, EventEmitter } from 'node:events'; { const ee = new EventEmitter(); const listener = () => console.log('Events are fun'); ee.on('foo', listener); console.log(getEventListeners(ee, 'foo')); // [ [Function: listener] ] } { const et = new EventTarget(); const listener = () => console.log('Events are fun'); et.addEventListener('foo', listener); console.log(getEventListeners(et, 'foo')); // [ [Function: listener] ] } ``` * **@since** v15.2.0, v14.17.0 *** #### Parameters * ##### externalemitter: EventEmitter\ | EventTarget * ##### externalname: string | symbol #### Returns Function\[] ### [**](#getMaxListeners)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L387)staticexternalinheritedgetMaxListeners * ****getMaxListeners**(emitter): number - Inherited from Logger.getMaxListeners Returns the currently set max amount of listeners. For `EventEmitter`s this behaves exactly the same as calling `.getMaxListeners` on the emitter. For `EventTarget`s this is the only way to get the max event listeners for the event target. If the number of event handlers on a single EventTarget exceeds the max set, the EventTarget will print a warning. ``` import { getMaxListeners, setMaxListeners, EventEmitter } from 'node:events'; { const ee = new EventEmitter(); console.log(getMaxListeners(ee)); // 10 setMaxListeners(11, ee); console.log(getMaxListeners(ee)); // 11 } { const et = new EventTarget(); console.log(getMaxListeners(et)); // 10 setMaxListeners(11, et); console.log(getMaxListeners(et)); // 11 } ``` * **@since** v19.9.0 *** #### Parameters * ##### externalemitter: EventEmitter\ | EventTarget #### Returns number ### [**](#listenerCount)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L330)staticexternalinheritedlistenerCount * ****listenerCount**(emitter, eventName): number - Inherited from Logger.listenerCount A class method that returns the number of listeners for the given `eventName` registered on the given `emitter`. ``` import { EventEmitter, listenerCount } from 'node:events'; const myEmitter = new EventEmitter(); myEmitter.on('event', () => {}); myEmitter.on('event', () => {}); console.log(listenerCount(myEmitter, 'event')); // Prints: 2 ``` * **@since** v0.9.12 * **@deprecated** Since v3.2.0 - Use `listenerCount` instead. *** #### Parameters * ##### externalemitter: EventEmitter\ The emitter to query * ##### externaleventName: string | symbol The event name #### Returns number ### [**](#on)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L303)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L308)staticexternalinheritedon * ****on**(emitter, eventName, options): AsyncIterator\ * ****on**(emitter, eventName, options): AsyncIterator\ - Inherited from Logger.on ``` import { on, EventEmitter } from 'node:events'; import process from 'node:process'; const ee = new EventEmitter(); // Emit later on process.nextTick(() => { ee.emit('foo', 'bar'); ee.emit('foo', 42); }); for await (const event of on(ee, 'foo')) { // The execution of this inner block is synchronous and it // processes one event at a time (even with await). Do not use // if concurrent execution is required. console.log(event); // prints ['bar'] [42] } // Unreachable here ``` Returns an `AsyncIterator` that iterates `eventName` events. It will throw if the `EventEmitter` emits `'error'`. It removes all listeners when exiting the loop. The `value` returned by each iteration is an array composed of the emitted event arguments. An `AbortSignal` can be used to cancel waiting on events: ``` import { on, EventEmitter } from 'node:events'; import process from 'node:process'; const ac = new AbortController(); (async () => { const ee = new EventEmitter(); // Emit later on process.nextTick(() => { ee.emit('foo', 'bar'); ee.emit('foo', 42); }); for await (const event of on(ee, 'foo', { signal: ac.signal })) { // The execution of this inner block is synchronous and it // processes one event at a time (even with await). Do not use // if concurrent execution is required. console.log(event); // prints ['bar'] [42] } // Unreachable here })(); process.nextTick(() => ac.abort()); ``` Use the `close` option to specify an array of event names that will end the iteration: ``` import { on, EventEmitter } from 'node:events'; import process from 'node:process'; const ee = new EventEmitter(); // Emit later on process.nextTick(() => { ee.emit('foo', 'bar'); ee.emit('foo', 42); ee.emit('close'); }); for await (const event of on(ee, 'foo', { close: ['close'] })) { console.log(event); // prints ['bar'] [42] } // the loop will exit after 'close' is emitted console.log('done'); // prints 'done' ``` * **@since** v13.6.0, v12.16.0 *** #### Parameters * ##### externalemitter: EventEmitter\ * ##### externaleventName: string | symbol * ##### externaloptionaloptions: StaticEventEmitterIteratorOptions #### Returns AsyncIterator\ An `AsyncIterator` that iterates `eventName` events emitted by the `emitter` ### [**](#once)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L217)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L222)staticexternalinheritedonce * ****once**(emitter, eventName, options): Promise\ * ****once**(emitter, eventName, options): Promise\ - Inherited from Logger.once Creates a `Promise` that is fulfilled when the `EventEmitter` emits the given event or that is rejected if the `EventEmitter` emits `'error'` while waiting. The `Promise` will resolve with an array of all the arguments emitted to the given event. This method is intentionally generic and works with the web platform [EventTarget](https://dom.spec.whatwg.org/#interface-eventtarget) interface, which has no special`'error'` event semantics and does not listen to the `'error'` event. ``` import { once, EventEmitter } from 'node:events'; import process from 'node:process'; const ee = new EventEmitter(); process.nextTick(() => { ee.emit('myevent', 42); }); const [value] = await once(ee, 'myevent'); console.log(value); const err = new Error('kaboom'); process.nextTick(() => { ee.emit('error', err); }); try { await once(ee, 'myevent'); } catch (err) { console.error('error happened', err); } ``` The special handling of the `'error'` event is only used when `events.once()` is used to wait for another event. If `events.once()` is used to wait for the '`error'` event itself, then it is treated as any other kind of event without special handling: ``` import { EventEmitter, once } from 'node:events'; const ee = new EventEmitter(); once(ee, 'error') .then(([err]) => console.log('ok', err.message)) .catch((err) => console.error('error', err.message)); ee.emit('error', new Error('boom')); // Prints: ok boom ``` An `AbortSignal` can be used to cancel waiting for the event: ``` import { EventEmitter, once } from 'node:events'; const ee = new EventEmitter(); const ac = new AbortController(); async function foo(emitter, event, signal) { try { await once(emitter, event, { signal }); console.log('event emitted!'); } catch (error) { if (error.name === 'AbortError') { console.error('Waiting for the event was canceled!'); } else { console.error('There was an error', error.message); } } } foo(ee, 'foo', ac.signal); ac.abort(); // Abort waiting for the event ee.emit('foo'); // Prints: Waiting for the event was canceled! ``` * **@since** v11.13.0, v10.16.0 *** #### Parameters * ##### externalemitter: EventEmitter\ * ##### externaleventName: string | symbol * ##### externaloptionaloptions: StaticEventEmitterOptions #### Returns Promise\ ### [**](#setMaxListeners)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L402)staticexternalinheritedsetMaxListeners * ****setMaxListeners**(n, ...eventTargets): void - Inherited from Logger.setMaxListeners ``` import { setMaxListeners, EventEmitter } from 'node:events'; const target = new EventTarget(); const emitter = new EventEmitter(); setMaxListeners(5, target, emitter); ``` * **@since** v15.4.0 *** #### Parameters * ##### externaloptionaln: number A non-negative number. The maximum number of listeners per `EventTarget` event. * ##### externalrest...eventTargets: (EventEmitter\ | EventTarget)\[] Zero or more {EventTarget} or {EventEmitter} instances. If none are specified, `n` is set as the default max for all newly created {EventTarget} and {EventEmitter} objects. #### Returns void --- # NonRetryableError Errors of `NonRetryableError` type will never be retried by the crawler. ### Hierarchy * Error * *NonRetryableError* * [CriticalError](https://crawlee.dev/js/api/core/class/CriticalError.md) ## Index[**](#Index) ### Constructors * [**constructor](#constructor) ### Properties * [**cause](#cause) * [**message](#message) * [**name](#name) * [**stack](#stack) * [**stackTraceLimit](#stackTraceLimit) ### Methods * [**captureStackTrace](#captureStackTrace) * [**isError](#isError) * [**prepareStackTrace](#prepareStackTrace) ## Constructors[**](#Constructors) ### [**](#constructor)[**](https://undefined/apify/crawlee/blob/master/website/node_modules/typescript/src/lib.es5.d.ts#L1082)externalconstructor * ****new NonRetryableError**(message): [NonRetryableError](https://crawlee.dev/js/api/core/class/NonRetryableError.md) * ****new NonRetryableError**(message, options): [NonRetryableError](https://crawlee.dev/js/api/core/class/NonRetryableError.md) - Inherited from Error.constructor #### Parameters * ##### externaloptionalmessage: string #### Returns [NonRetryableError](https://crawlee.dev/js/api/core/class/NonRetryableError.md) ## Properties[**](#Properties) ### [**](#cause)[**](https://undefined/apify/crawlee/blob/master/website/node_modules/typescript/src/lib.es2022.error.d.ts#L26)externaloptionalinheritedcause **cause? : unknown Inherited from Error.cause ### [**](#message)[**](https://undefined/apify/crawlee/blob/master/website/node_modules/typescript/src/lib.es5.d.ts#L1077)externalinheritedmessage **message: string Inherited from Error.message ### [**](#name)[**](https://undefined/apify/crawlee/blob/master/website/node_modules/typescript/src/lib.es5.d.ts#L1076)externalinheritedname **name: string Inherited from Error.name ### [**](#stack)[**](https://undefined/apify/crawlee/blob/master/website/node_modules/typescript/src/lib.es5.d.ts#L1078)externaloptionalinheritedstack **stack? : string Inherited from Error.stack ### [**](#stackTraceLimit)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/globals.d.ts#L68)staticexternalinheritedstackTraceLimit **stackTraceLimit: number Inherited from Error.stackTraceLimit The `Error.stackTraceLimit` property specifies the number of stack frames collected by a stack trace (whether generated by `new Error().stack` or `Error.captureStackTrace(obj)`). The default value is `10` but may be set to any valid JavaScript number. Changes will affect any stack trace captured *after* the value has been changed. If set to a non-number value, or set to a negative number, stack traces will not capture any frames. ## Methods[**](#Methods) ### [**](#captureStackTrace)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/globals.d.ts#L52)staticexternalinheritedcaptureStackTrace * ****captureStackTrace**(targetObject, constructorOpt): void - Inherited from Error.captureStackTrace Creates a `.stack` property on `targetObject`, which when accessed returns a string representing the location in the code at which `Error.captureStackTrace()` was called. ``` const myObject = {}; Error.captureStackTrace(myObject); myObject.stack; // Similar to `new Error().stack` ``` The first line of the trace will be prefixed with `${myObject.name}: ${myObject.message}`. The optional `constructorOpt` argument accepts a function. If given, all frames above `constructorOpt`, including `constructorOpt`, will be omitted from the generated stack trace. The `constructorOpt` argument is useful for hiding implementation details of error generation from the user. For instance: ``` function a() { b(); } function b() { c(); } function c() { // Create an error without stack trace to avoid calculating the stack trace twice. const { stackTraceLimit } = Error; Error.stackTraceLimit = 0; const error = new Error(); Error.stackTraceLimit = stackTraceLimit; // Capture the stack trace above function b Error.captureStackTrace(error, b); // Neither function c, nor b is included in the stack trace throw error; } a(); ``` *** #### Parameters * ##### externaltargetObject: object * ##### externaloptionalconstructorOpt: Function #### Returns void ### [**](#isError)[**](https://undefined/apify/crawlee/blob/master/website/node_modules/typescript/src/lib.esnext.error.d.ts#L23)staticexternalinheritedisError * ****isError**(error): error is Error - Inherited from Error.isError Indicates whether the argument provided is a built-in Error instance or not. *** #### Parameters * ##### externalerror: unknown #### Returns error is Error ### [**](#prepareStackTrace)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/globals.d.ts#L56)staticexternalinheritedprepareStackTrace * ****prepareStackTrace**(err, stackTraces): any - Inherited from Error.prepareStackTrace * **@see** *** #### Parameters * ##### externalerr: Error * ##### externalstackTraces: CallSite\[] #### Returns any --- # ProxyConfiguration Configures connection to a proxy server with the provided options. Proxy servers are used to prevent target websites from blocking your crawlers based on IP address rate limits or blacklists. Setting proxy configuration in your crawlers automatically configures them to use the selected proxies for all connections. You can get information about the currently used proxy by inspecting the [ProxyInfo](https://crawlee.dev/js/api/core/interface/ProxyInfo.md) property in your crawler's page function. There, you can inspect the proxy's URL and other attributes. If you want to use your own proxies, use the [ProxyConfigurationOptions.proxyUrls](https://crawlee.dev/js/api/core/interface/ProxyConfigurationOptions.md#proxyUrls) option. Your list of proxy URLs will be rotated by the configuration if this option is provided. **Example usage:** ``` const proxyConfiguration = new ProxyConfiguration({ proxyUrls: ['...', '...'], }); const crawler = new CheerioCrawler({ // ... proxyConfiguration, requestHandler({ proxyInfo }) { const usedProxyUrl = proxyInfo.url; // Getting the proxy URL } }) ``` ## Index[**](#Index) ### Constructors * [**constructor](#constructor) ### Properties * [**isManInTheMiddle](#isManInTheMiddle) ### Methods * [**newProxyInfo](#newProxyInfo) * [**newUrl](#newUrl) ## Constructors[**](#Constructors) ### [**](#constructor)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/proxy_configuration.ts#L233)constructor * ****new ProxyConfiguration**(options): [ProxyConfiguration](https://crawlee.dev/js/api/core/class/ProxyConfiguration.md) - Creates a [ProxyConfiguration](https://crawlee.dev/js/api/core/class/ProxyConfiguration.md) instance based on the provided options. Proxy servers are used to prevent target websites from blocking your crawlers based on IP address rate limits or blacklists. Setting proxy configuration in your crawlers automatically configures them to use the selected proxies for all connections. ``` const proxyConfiguration = new ProxyConfiguration({ proxyUrls: ['http://user:pass@proxy-1.com', 'http://user:pass@proxy-2.com'], }); const crawler = new CheerioCrawler({ // ... proxyConfiguration, requestHandler({ proxyInfo }) { const usedProxyUrl = proxyInfo.url; // Getting the proxy URL } }) ``` *** #### Parameters * ##### options: [ProxyConfigurationOptions](https://crawlee.dev/js/api/core/interface/ProxyConfigurationOptions.md) = {} #### Returns [ProxyConfiguration](https://crawlee.dev/js/api/core/class/ProxyConfiguration.md) ## Properties[**](#Properties) ### [**](#isManInTheMiddle)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/proxy_configuration.ts#L204)isManInTheMiddle **isManInTheMiddle: boolean = false ## Methods[**](#Methods) ### [**](#newProxyInfo)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/proxy_configuration.ts#L274)newProxyInfo * ****newProxyInfo**(sessionId, options): Promise\ - This function creates a new [ProxyInfo](https://crawlee.dev/js/api/core/interface/ProxyInfo.md) info object. It is used by CheerioCrawler and PuppeteerCrawler to generate proxy URLs and also to allow the user to inspect the currently used proxy via the requestHandler parameter `proxyInfo`. Use it if you want to work with a rich representation of a proxy URL. If you need the URL string only, use [ProxyConfiguration.newUrl](https://crawlee.dev/js/api/core/class/ProxyConfiguration.md#newUrl). *** #### Parameters * ##### optionalsessionId: string | number Represents the identifier of user [Session](https://crawlee.dev/js/api/core/class/Session.md) that can be managed by the [SessionPool](https://crawlee.dev/js/api/core/class/SessionPool.md) or you can use the Apify Proxy [Session](https://docs.apify.com/proxy#sessions) identifier. When the provided sessionId is a number, it's converted to a string. Property sessionId of [ProxyInfo](https://crawlee.dev/js/api/core/interface/ProxyInfo.md) is always returned as a type string. All the HTTP requests going through the proxy with the same session identifier will use the same target proxy server (i.e. the same IP address). The identifier must not be longer than 50 characters and include only the following: `0-9`, `a-z`, `A-Z`, `"."`, `"_"` and `"~"`. * ##### optionaloptions: TieredProxyOptions #### Returns Promise\ Represents information about used proxy and its configuration. ### [**](#newUrl)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/proxy_configuration.ts#L383)newUrl * ****newUrl**(sessionId, options): Promise\ - Returns a new proxy URL based on provided configuration options and the `sessionId` parameter. *** #### Parameters * ##### optionalsessionId: string | number Represents the identifier of user [Session](https://crawlee.dev/js/api/core/class/Session.md) that can be managed by the [SessionPool](https://crawlee.dev/js/api/core/class/SessionPool.md) or you can use the Apify Proxy [Session](https://docs.apify.com/proxy#sessions) identifier. When the provided sessionId is a number, it's converted to a string. All the HTTP requests going through the proxy with the same session identifier will use the same target proxy server (i.e. the same IP address). The identifier must not be longer than 50 characters and include only the following: `0-9`, `a-z`, `A-Z`, `"."`, `"_"` and `"~"`. * ##### optionaloptions: TieredProxyOptions #### Returns Promise\ A string with a proxy URL, including authentication credentials and port number. For example, `http://bob:password123@proxy.example.com:8000` --- # externalPseudoUrl Represents a pseudo-URL (PURL) - a URL pattern used to find the matching URLs on a page or html document. A PURL is simply a URL with special directives enclosed in `[]` brackets. Currently, the only supported directive is `[RegExp]`, which defines a JavaScript-style regular expression to match against the URL. The `PseudoUrl` class can be constructed either using a pseudo-URL string or a regular expression (an instance of the `RegExp` object). With a pseudo-URL string, the matching is always case-insensitive. If you need case-sensitive matching, use an appropriate `RegExp` object. Internally, `PseudoUrl` class is using `purlToRegExp` function which parses the provided PURL and converts it to an instance of the `RegExp` object (in case it's not). For example, a PURL `http://www.example.com/pages/[(\w|-)*]` will match all of the following URLs: * `http://www.example.com/pages/` * `http://www.example.com/pages/my-awesome-page` * `http://www.example.com/pages/something` Be careful to correctly escape special characters in the pseudo-URL string. If either `[` or `]` is part of the normal query string, it must be encoded as `[\x5B]` or `[\x5D]`, respectively. For example, the following PURL: ``` http://www.example.com/search?do[\x5B]load[\x5D]=1 ``` will match the URL: ``` http://www.example.com/search?do[load]=1 ``` If the regular expression in the pseudo-URL contains a backslash character (), you need to escape it with another back backslash, as shown in the example below. **Example usage:** ``` // Using a pseudo-URL string const purl = new PseudoUrl('http://www.example.com/pages/[(\\w|-)+]'); // Using a regular expression const purl2 = new PseudoUrl(/http://www\.example\.com/pages/(\w|-)+/); if (purl.matches('http://www.example.com/pages/my-awesome-page')) console.log('Match!'); ``` ## Index[**](#Index) ### Constructors * [**constructor](#constructor) ### Properties * [**regex](#regex) ### Methods * [**matches](#matches) ## Constructors[**](#Constructors) ### [**](#constructor)[**](https://undefined/apify/crawlee/blob/master/node_modules/@apify/pseudo_url/src/index.d.ts#L58)externalconstructor * ****new PseudoUrl**(purl): [PseudoUrl](https://crawlee.dev/js/api/core/class/PseudoUrl.md) - #### Parameters * ##### externalpurl: string | RegExp A pseudo-URL string or a regular expression object. Using a `RegExp` instance enables more granular control, such as making the matching case-sensitive. #### Returns [PseudoUrl](https://crawlee.dev/js/api/core/class/PseudoUrl.md) ## Properties[**](#Properties) ### [**](#regex)[**](https://undefined/apify/crawlee/blob/master/node_modules/@apify/pseudo_url/src/index.d.ts#L51)externalreadonlyregex **regex: RegExp ## Methods[**](#Methods) ### [**](#matches)[**](https://undefined/apify/crawlee/blob/master/node_modules/@apify/pseudo_url/src/index.d.ts#L62)externalmatches * ****matches**(url): boolean - Determines whether a URL matches this pseudo-URL pattern. *** #### Parameters * ##### externalurl: string #### Returns boolean --- # RecoverableState \ A class for managing persistent recoverable state using a plain JavaScript object. This class facilitates state persistence to a `KeyValueStore`, allowing data to be saved and retrieved across migrations or restarts. It manages the loading, saving, and resetting of state data, with optional persistence capabilities. The state is represented by a plain JavaScript object that can be serialized to and deserialized from JSON. The class automatically hooks into the event system to persist state when needed. ## Index[**](#Index) ### Constructors * [**constructor](#constructor) ### Accessors * [**currentValue](#currentValue) ### Methods * [**initialize](#initialize) * [**persistState](#persistState) * [**reset](#reset) * [**teardown](#teardown) ## Constructors[**](#Constructors) ### [**](#constructor)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/recoverable_state.ts#L93)constructor * ****new RecoverableState**\(options): [RecoverableState](https://crawlee.dev/js/api/core/class/RecoverableState.md)\ - Initialize a new recoverable state object. *** #### Parameters * ##### options: [RecoverableStateOptions](https://crawlee.dev/js/api/core/interface/RecoverableStateOptions.md)\ Configuration options for the recoverable state #### Returns [RecoverableState](https://crawlee.dev/js/api/core/class/RecoverableState.md)\ ## Accessors[**](#Accessors) ### [**](#currentValue)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/recoverable_state.ts#L157)currentValue * **get currentValue(): TStateModel - Get the current state. *** #### Returns TStateModel ## Methods[**](#Methods) ### [**](#initialize)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/recoverable_state.ts#L115)initialize * ****initialize**(): Promise\ - Initialize the recoverable state. This method must be called before using the recoverable state. It loads the saved state if persistence is enabled and registers the object to listen for PERSIST\_STATE events. *** #### Returns Promise\ The loaded state object ### [**](#persistState)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/recoverable_state.ts#L191)persistState * ****persistState**(eventData): Promise\ - Persist the current state to the KeyValueStore. This method is typically called in response to a PERSIST\_STATE event, but can also be called directly when needed. *** #### Parameters * ##### optionaleventData: { isMigrating: boolean } Optional data associated with a PERSIST\_STATE event * ##### isMigrating: boolean #### Returns Promise\ ### [**](#reset)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/recoverable_state.ts#L171)reset * ****reset**(): Promise\ - Reset the state to the default values and clear any persisted state. Resets the current state to the default state and, if persistence is enabled, clears the persisted state from the KeyValueStore. *** #### Returns Promise\ ### [**](#teardown)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/recoverable_state.ts#L144)teardown * ****teardown**(): Promise\ - Clean up resources used by the recoverable state. If persistence is enabled, this method deregisters the object from PERSIST\_STATE events and persists the current state one last time. *** #### Returns Promise\ --- # Request \ Represents a URL to be crawled, optionally including HTTP method, headers, payload and other metadata. The `Request` object also stores information about errors that occurred during processing of the request. Each `Request` instance has the `uniqueKey` property, which can be either specified manually in the constructor or generated automatically from the URL. Two requests with the same `uniqueKey` are considered as pointing to the same web resource. This behavior applies to all Crawlee classes, such as [RequestList](https://crawlee.dev/js/api/core/class/RequestList.md), [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md), [PuppeteerCrawler](https://crawlee.dev/js/api/puppeteer-crawler/class/PuppeteerCrawler.md) or [PlaywrightCrawler](https://crawlee.dev/js/api/playwright-crawler/class/PlaywrightCrawler.md). > To access and examine the actual request sent over http, with all autofilled headers you can access `response.request` object from the request handler Example use: ``` const request = new Request({ url: 'http://www.example.com', headers: { Accept: 'application/json' }, }); ... request.userData.foo = 'bar'; request.pushErrorMessage(new Error('Request failed!')); ... const foo = request.userData.foo; ``` ## Index[**](#Index) ### Constructors * [**constructor](#constructor) ### Properties * [**errorMessages](#errorMessages) * [**handledAt](#handledAt) * [**headers](#headers) * [**id](#id) * [**loadedUrl](#loadedUrl) * [**method](#method) * [**noRetry](#noRetry) * [**payload](#payload) * [**retryCount](#retryCount) * [**uniqueKey](#uniqueKey) * [**url](#url) * [**userData](#userData) ### Accessors * [**crawlDepth](#crawlDepth) * [**label](#label) * [**maxRetries](#maxRetries) * [**sessionRotationCount](#sessionRotationCount) * [**skipNavigation](#skipNavigation) * [**state](#state) ### Methods * [**pushErrorMessage](#pushErrorMessage) ## Constructors[**](#Constructors) ### [**](#constructor)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L140)constructor * ****new Request**\(options): [Request](https://crawlee.dev/js/api/core/class/Request.md)\ - `Request` parameters including the URL, HTTP method and headers, and others. *** #### Parameters * ##### options: [RequestOptions](https://crawlee.dev/js/api/core/interface/RequestOptions.md)\ #### Returns [Request](https://crawlee.dev/js/api/core/class/Request.md)\ ## Properties[**](#Properties) ### [**](#errorMessages)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L120)errorMessages **errorMessages: string\[] An array of error messages from request processing. ### [**](#handledAt)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L135)optionalhandledAt **handledAt? : string ISO datetime string that indicates the time when the request has been processed. Is `null` if the request has not been crawled yet. ### [**](#headers)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L123)optionalheaders **headers? : Record\ Object with HTTP headers. Key is header name, value is the value. ### [**](#id)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L86)optionalid **id? : string Request ID ### [**](#loadedUrl)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L99)optionalloadedUrl **loadedUrl? : string An actually loaded URL after redirects, if present. HTTP redirects are guaranteed to be included. When using [PuppeteerCrawler](https://crawlee.dev/js/api/puppeteer-crawler/class/PuppeteerCrawler.md) or [PlaywrightCrawler](https://crawlee.dev/js/api/playwright-crawler/class/PlaywrightCrawler.md), meta tag and JavaScript redirects may, or may not be included, depending on their nature. This generally means that redirects, which happen immediately will most likely be included, but delayed redirects will not. ### [**](#method)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L108)method **method: AllowedHttpMethods HTTP method, e.g. `GET` or `POST`. ### [**](#noRetry)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L114)noRetry **noRetry: boolean The `true` value indicates that the request will not be automatically retried on error. ### [**](#payload)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L111)optionalpayload **payload? : string HTTP request payload, e.g. for POST requests. ### [**](#retryCount)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L117)retryCount **retryCount: number Indicates the number of times the crawling of the request has been retried on error. ### [**](#uniqueKey)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L105)uniqueKey **uniqueKey: string A unique key identifying the request. Two requests with the same `uniqueKey` are considered as pointing to the same URL. ### [**](#url)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L89)url **url: string URL of the web page to crawl. ### [**](#userData)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L129)userData **userData: UserData = ... Custom user data assigned to the request. ## Accessors[**](#Accessors) ### [**](#crawlDepth)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L280)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L288)crawlDepth * **get crawlDepth(): number * **set crawlDepth(value): void - Depth of the request in the current crawl tree. Note that this is dependent on the crawler setup and might produce unexpected results when used with multiple crawlers. *** #### Returns number - Depth of the request in the current crawl tree. Note that this is dependent on the crawler setup and might produce unexpected results when used with multiple crawlers. *** #### Parameters * ##### value: number #### Returns void ### [**](#label)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L308)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L313)label * **get label(): undefined | string * **set label(value): void - shortcut for getting `request.userData.label` *** #### Returns undefined | string - shortcut for setting `request.userData.label` *** #### Parameters * ##### value: undefined | string #### Returns void ### [**](#maxRetries)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L318)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L323)maxRetries * **get maxRetries(): undefined | number * **set maxRetries(value): void - Maximum number of retries for this request. Allows to override the global `maxRequestRetries` option of `BasicCrawler`. *** #### Returns undefined | number - Maximum number of retries for this request. Allows to override the global `maxRequestRetries` option of `BasicCrawler`. *** #### Parameters * ##### value: undefined | number #### Returns void ### [**](#sessionRotationCount)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L294)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L299)sessionRotationCount * **get sessionRotationCount(): number * **set sessionRotationCount(value): void - Indicates the number of times the crawling of the request has rotated the session due to a session or a proxy error. *** #### Returns number - Indicates the number of times the crawling of the request has rotated the session due to a session or a proxy error. *** #### Parameters * ##### value: number #### Returns void ### [**](#skipNavigation)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L263)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L268)skipNavigation * **get skipNavigation(): boolean * **set skipNavigation(value): void - Tells the crawler processing this request to skip the navigation and process the request directly. *** #### Returns boolean - Tells the crawler processing this request to skip the navigation and process the request directly. *** #### Parameters * ##### value: boolean #### Returns void ### [**](#state)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L332)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L337)state * **get state(): [RequestState](https://crawlee.dev/js/api/core/enum/RequestState.md) * **set state(value): void - Describes the request's current lifecycle state. *** #### Returns [RequestState](https://crawlee.dev/js/api/core/enum/RequestState.md) - Describes the request's current lifecycle state. *** #### Parameters * ##### value: [RequestState](https://crawlee.dev/js/api/core/enum/RequestState.md) #### Returns void ## Methods[**](#Methods) ### [**](#pushErrorMessage)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L370)pushErrorMessage * ****pushErrorMessage**(errorOrMessage, options): void - Stores information about an error that occurred during processing of this request. You should always use Error instances when throwing errors in JavaScript. Nevertheless, to improve the debugging experience when using third party libraries that may not always throw an Error instance, the function performs a type inspection of the passed argument and attempts to extract as much information as possible, since just throwing a bad type error makes any debugging rather difficult. *** #### Parameters * ##### errorOrMessage: unknown Error object or error message to be stored in the request. * ##### optionaloptions: [PushErrorMessageOptions](https://crawlee.dev/js/api/core/interface/PushErrorMessageOptions.md) = {} #### Returns void --- # RequestHandlerResult experimental A partial implementation of [RestrictedCrawlingContext](https://crawlee.dev/js/api/core/interface/RestrictedCrawlingContext.md) that stores parameters of calls to context methods for later inspection. ## Index[**](#Index) ### Constructors * [**constructor](#constructor) ### Properties * [**addRequests](#addRequests) * [**getKeyValueStore](#getKeyValueStore) * [**pushData](#pushData) * [**useState](#useState) ### Accessors * [**calls](#calls) * [**datasetItems](#datasetItems) * [**enqueuedUrlLists](#enqueuedUrlLists) * [**enqueuedUrls](#enqueuedUrls) * [**keyValueStoreChanges](#keyValueStoreChanges) ## Constructors[**](#Constructors) ### [**](#constructor)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L182)constructor * ****new RequestHandlerResult**(config, crawleeStateKey): [RequestHandlerResult](https://crawlee.dev/js/api/core/class/RequestHandlerResult.md) - experimental #### Parameters * ##### config: [Configuration](https://crawlee.dev/js/api/core/class/Configuration.md) * ##### crawleeStateKey: string #### Returns [RequestHandlerResult](https://crawlee.dev/js/api/core/class/RequestHandlerResult.md) ## Properties[**](#Properties) ### [**](#addRequests)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L266)addRequests **addRequests: (requestsLike, options) => Promise\ = ... #### Type declaration * * **(requestsLike, options): Promise\ - #### Parameters * ##### requestsLike: readonly (string | ReadonlyObjectDeep\> & { regex?: RegExp; requestsFromUrl?: string }> | ReadonlyObjectDeep<[Request](https://crawlee.dev/js/api/core/class/Request.md)\>)\[] * ##### optionaloptions: ReadonlyObjectDeep<[RequestQueueOperationOptions](https://crawlee.dev/js/api/core/interface/RequestQueueOperationOptions.md)> #### Returns Promise\ ### [**](#getKeyValueStore)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L275)getKeyValueStore **getKeyValueStore: (idOrName) => Promise\> = ... #### Type declaration * * **(idOrName): Promise\> - #### Parameters * ##### optionalidOrName: string #### Returns Promise\> ### [**](#pushData)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L262)pushData **pushData: (data, datasetIdOrName) => Promise\ = ... #### Type declaration * * **(data, datasetIdOrName): Promise\ - This function allows you to push data to a [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md) specified by name, or the one currently used by the crawler. Shortcut for `crawler.pushData()`. *** #### Parameters * ##### optionaldata: ReadonlyDeep\ Data to be pushed to the default dataset. * ##### optionaldatasetIdOrName: string #### Returns Promise\ ### [**](#useState)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L270)useState **useState: \(defaultValue) => Promise\ = ... #### Type declaration * * **\(defaultValue): Promise\ - #### Parameters * ##### optionaldefaultValue: State #### Returns Promise\ ## Accessors[**](#Accessors) ### [**](#calls)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L190)calls * **get calls(): ReadonlyObjectDeep<{ addRequests: \[requestsLike: readonly (string | ReadonlyObjectDeep\> & { regex? : RegExp; requestsFromUrl? : string }> | ReadonlyObjectDeep<[Request](https://crawlee.dev/js/api/core/class/Request.md)\>)\[], options?: ReadonlyObjectDeep<[RequestQueueOperationOptions](https://crawlee.dev/js/api/core/interface/RequestQueueOperationOptions.md)>]\[]; pushData: \[data: ReadonlyDeep\, datasetIdOrName?: string]\[] }> - experimental A record of calls to [RestrictedCrawlingContext.pushData](https://crawlee.dev/js/api/core/interface/RestrictedCrawlingContext.md#pushData), [RestrictedCrawlingContext.addRequests](https://crawlee.dev/js/api/core/interface/RestrictedCrawlingContext.md#addRequests), [RestrictedCrawlingContext.enqueueLinks](https://crawlee.dev/js/api/core/interface/RestrictedCrawlingContext.md#enqueueLinks) made by a request handler. *** #### Returns ReadonlyObjectDeep<{ addRequests: \[requestsLike: readonly (string | ReadonlyObjectDeep\> & { regex?: RegExp; requestsFromUrl?: string }> | ReadonlyObjectDeep<[Request](https://crawlee.dev/js/api/core/class/Request.md)\>)\[], options?: ReadonlyObjectDeep<[RequestQueueOperationOptions](https://crawlee.dev/js/api/core/interface/RequestQueueOperationOptions.md)>]\[]; pushData: \[data: ReadonlyDeep\, datasetIdOrName?: string]\[] }> ### [**](#datasetItems)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L212)datasetItems * **get datasetItems(): readonly ReadonlyObjectDeep<{ datasetIdOrName? : string; item: Dictionary }>\[] - experimental Items added to datasets by a request handler. *** #### Returns readonly ReadonlyObjectDeep<{ datasetIdOrName?: string; item: Dictionary }>\[] ### [**](#enqueuedUrlLists)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L244)enqueuedUrlLists * **get enqueuedUrlLists(): readonly ReadonlyObjectDeep<{ label? : string; listUrl: string }>\[] - experimental URL lists enqueued to the request queue by a request handler via [RestrictedCrawlingContext.addRequests](https://crawlee.dev/js/api/core/interface/RestrictedCrawlingContext.md#addRequests) using the `requestsFromUrl` option. *** #### Returns readonly ReadonlyObjectDeep<{ label?: string; listUrl: string }>\[] ### [**](#enqueuedUrls)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L221)enqueuedUrls * **get enqueuedUrls(): readonly ReadonlyObjectDeep<{ label? : string; url: string }>\[] - experimental URLs enqueued to the request queue by a request handler, either via [RestrictedCrawlingContext.addRequests](https://crawlee.dev/js/api/core/interface/RestrictedCrawlingContext.md#addRequests) or [RestrictedCrawlingContext.enqueueLinks](https://crawlee.dev/js/api/core/interface/RestrictedCrawlingContext.md#enqueueLinks) *** #### Returns readonly ReadonlyObjectDeep<{ label?: string; url: string }>\[] ### [**](#keyValueStoreChanges)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L203)keyValueStoreChanges * **get keyValueStoreChanges(): ReadonlyObjectDeep\ : [RecordOptions](https://crawlee.dev/js/api/core/interface/RecordOptions.md) }>>> - experimental A record of changes made to key-value stores by a request handler. *** #### Returns ReadonlyObjectDeep\: [RecordOptions](https://crawlee.dev/js/api/core/interface/RecordOptions.md) }>>> --- # RequestList Represents a static list of URLs to crawl. The URLs can be provided either in code or parsed from a text file hosted on the web. `RequestList` is used by [BasicCrawler](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md), [CheerioCrawler](https://crawlee.dev/js/api/cheerio-crawler/class/CheerioCrawler.md), [PuppeteerCrawler](https://crawlee.dev/js/api/puppeteer-crawler/class/PuppeteerCrawler.md) and [PlaywrightCrawler](https://crawlee.dev/js/api/playwright-crawler/class/PlaywrightCrawler.md) as a source of URLs to crawl. Each URL is represented using an instance of the [Request](https://crawlee.dev/js/api/core/class/Request.md) class. The list can only contain unique URLs. More precisely, it can only contain `Request` instances with distinct `uniqueKey` properties. By default, `uniqueKey` is generated from the URL, but it can also be overridden. To add a single URL to the list multiple times, corresponding [Request](https://crawlee.dev/js/api/core/class/Request.md) objects will need to have different `uniqueKey` properties. You can use the `keepDuplicateUrls` option to do this for you when initializing the `RequestList` from sources. `RequestList` doesn't have a public constructor, you need to create it with the asynchronous [RequestList.open](https://crawlee.dev/js/api/core/class/RequestList.md#open) function. After the request list is created, no more URLs can be added to it. Unlike [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md), `RequestList` is static but it can contain even millions of URLs. > Note that `RequestList` can be used together with `RequestQueue` by the same crawler. In such cases, each request from `RequestList` is enqueued into `RequestQueue` first and then consumed from the latter. This is necessary to avoid the same URL being processed more than once (from the list first and then possibly from the queue). In practical terms, such a combination can be useful when there is a large number of initial URLs, but more URLs would be added dynamically by the crawler. `RequestList` has an internal state where it stores information about which requests were already handled, which are in progress and which were reclaimed. The state may be automatically persisted to the default [KeyValueStore](https://crawlee.dev/js/api/core/class/KeyValueStore.md) by setting the `persistStateKey` option so that if the Node.js process is restarted, the crawling can continue where it left off. The automated persisting is launched upon receiving the `persistState` event that is periodically emitted by [EventManager](https://crawlee.dev/js/api/core/class/EventManager.md). The internal state is closely tied to the provided sources (URLs). If the sources change on crawler restart, the state will become corrupted and `RequestList` will raise an exception. This typically happens when the sources is a list of URLs downloaded from the web. In such case, use the `persistRequestsKey` option in conjunction with `persistStateKey`, to make the `RequestList` store the initial sources to the default key-value store and load them after restart, which will prevent any issues that a live list of URLs might cause. **Basic usage:** ``` const requestList = await RequestList.open('my-request-list', [ 'http://www.example.com/page-1', { url: 'http://www.example.com/page-2', method: 'POST', userData: { foo: 'bar' }}, { requestsFromUrl: 'http://www.example.com/my-url-list.txt', userData: { isFromUrl: true } }, ]); ``` **Advanced usage:** ``` const requestList = await RequestList.open(null, [ // Separate requests { url: 'http://www.example.com/page-1', method: 'GET', headers: { ... } }, { url: 'http://www.example.com/page-2', userData: { foo: 'bar' }}, // Bulk load of URLs from file `http://www.example.com/my-url-list.txt` // Note that all URLs must start with http:// or https:// { requestsFromUrl: 'http://www.example.com/my-url-list.txt', userData: { isFromUrl: true } }, ], { // Persist the state to avoid re-crawling which can lead to data duplications. // Keep in mind that the sources have to be immutable or this will throw an error. persistStateKey: 'my-state', }); ``` ### Implements * [IRequestList](https://crawlee.dev/js/api/core/interface/IRequestList.md) ## Index[**](#Index) ### Methods * [**\[asyncIterator\]](#\[asyncIterator]) * [**fetchNextRequest](#fetchNextRequest) * [**getState](#getState) * [**handledCount](#handledCount) * [**isEmpty](#isEmpty) * [**isFinished](#isFinished) * [**length](#length) * [**markRequestHandled](#markRequestHandled) * [**persistState](#persistState) * [**reclaimRequest](#reclaimRequest) * [**open](#open) ## Methods[**](#Methods) ### [**](#\[asyncIterator])[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_list.ts#L684)\[asyncIterator] * ****\[asyncIterator]**(): AsyncGenerator<[Request](https://crawlee.dev/js/api/core/class/Request.md)\, void, unknown> - Implementation of IRequestList.\[asyncIterator] Can be used to iterate over the `RequestList` instance in a `for await .. of` loop. Provides an alternative for the repeated use of `fetchNextRequest`. *** #### Returns AsyncGenerator<[Request](https://crawlee.dev/js/api/core/class/Request.md)\, void, unknown> ### [**](#fetchNextRequest)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_list.ts#L657)fetchNextRequest * ****fetchNextRequest**(): Promise\> - Implementation of IRequestList.fetchNextRequest Gets the next [Request](https://crawlee.dev/js/api/core/class/Request.md) to process. First, the function gets a request previously reclaimed using the [RequestList.reclaimRequest](https://crawlee.dev/js/api/core/class/RequestList.md#reclaimRequest) function, if there is any. Otherwise it gets the next request from sources. The function's `Promise` resolves to `null` if there are no more requests to process. *** #### Returns Promise\> ### [**](#getState)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_list.ts#L626)getState * ****getState**(): [RequestListState](https://crawlee.dev/js/api/core/interface/RequestListState.md) - Returns an object representing the internal state of the `RequestList` instance. Note that the object's fields can change in future releases. *** #### Returns [RequestListState](https://crawlee.dev/js/api/core/interface/RequestListState.md) ### [**](#handledCount)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_list.ts#L870)handledCount * ****handledCount**(): number - Implementation of IRequestList.handledCount Returns number of handled requests. *** #### Returns number ### [**](#isEmpty)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_list.ts#L639)isEmpty * ****isEmpty**(): Promise\ - Implementation of IRequestList.isEmpty Resolves to `true` if the next call to [IRequestList.fetchNextRequest](https://crawlee.dev/js/api/core/interface/IRequestList.md#fetchNextRequest) function would return `null`, otherwise it resolves to `false`. Note that even if the list is empty, there might be some pending requests currently being processed. *** #### Returns Promise\ ### [**](#isFinished)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_list.ts#L648)isFinished * ****isFinished**(): Promise\ - Implementation of IRequestList.isFinished Returns `true` if all requests were already handled and there are no more left. *** #### Returns Promise\ ### [**](#length)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_list.ts#L861)length * ****length**(): number - Implementation of IRequestList.length Returns the total number of unique requests present in the `RequestList`. *** #### Returns number ### [**](#markRequestHandled)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_list.ts#L704)markRequestHandled * ****markRequestHandled**(request): Promise\ - Implementation of IRequestList.markRequestHandled Marks request as handled after successful processing. *** #### Parameters * ##### request: [Request](https://crawlee.dev/js/api/core/class/Request.md)\ #### Returns Promise\ ### [**](#persistState)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_list.ts#L504)persistState * ****persistState**(): Promise\ - Implementation of IRequestList.persistState Persists the current state of the `IRequestList` into the default [KeyValueStore](https://crawlee.dev/js/api/core/class/KeyValueStore.md). The state is persisted automatically in regular intervals, but calling this method manually is useful in cases where you want to have the most current state available after you pause or stop fetching its requests. For example after you pause or abort a crawl. Or just before a server migration. *** #### Returns Promise\ ### [**](#reclaimRequest)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_list.ts#L718)reclaimRequest * ****reclaimRequest**(request): Promise\ - Implementation of IRequestList.reclaimRequest Reclaims request to the list if its processing failed. The request will become available in the next `this.fetchNextRequest()`. *** #### Parameters * ##### request: [Request](https://crawlee.dev/js/api/core/class/Request.md)\ #### Returns Promise\ ### [**](#open)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_list.ts#L929)staticopen * ****open**(listNameOrOptions, sources, options): Promise<[RequestList](https://crawlee.dev/js/api/core/class/RequestList.md)> - Opens a request list and returns a promise resolving to an instance of the [RequestList](https://crawlee.dev/js/api/core/class/RequestList.md) class that is already initialized. [RequestList](https://crawlee.dev/js/api/core/class/RequestList.md) represents a list of URLs to crawl, which is always stored in memory. To enable picking up where left off after a process restart, the request list sources are persisted to the key-value store at initialization of the list. Then, while crawling, a small state object is regularly persisted to keep track of the crawling status. For more details and code examples, see the [RequestList](https://crawlee.dev/js/api/core/class/RequestList.md) class. **Example usage:** ``` const sources = [ 'https://www.example.com', 'https://www.google.com', 'https://www.bing.com' ]; const requestList = await RequestList.open('my-name', sources); ``` *** #### Parameters * ##### listNameOrOptions: null | string | [RequestListOptions](https://crawlee.dev/js/api/core/interface/RequestListOptions.md) Name of the request list to be opened, or the options object. Setting a name enables the `RequestList`'s state to be persisted in the key-value store. This is useful in case of a restart or migration. Since `RequestList` is only stored in memory, a restart or migration wipes it clean. Setting a name will enable the `RequestList`'s state to survive those situations and continue where it left off. The name will be used as a prefix in key-value store, producing keys such as `NAME-REQUEST_LIST_STATE` and `NAME-REQUEST_LIST_SOURCES`. If `null`, the list will not be persisted and will only be stored in memory. Process restart will then cause the list to be crawled again from the beginning. We suggest always using a name. * ##### optionalsources: RequestListSource\[] An array of sources of URLs for the [RequestList](https://crawlee.dev/js/api/core/class/RequestList.md). It can be either an array of strings, plain objects that define at least the `url` property, or an array of [Request](https://crawlee.dev/js/api/core/class/Request.md) instances. **IMPORTANT:** The `sources` array will be consumed (left empty) after [RequestList](https://crawlee.dev/js/api/core/class/RequestList.md) initializes. This is a measure to prevent memory leaks in situations when millions of sources are added. Additionally, the `requestsFromUrl` property may be used instead of `url`, which will instruct [RequestList](https://crawlee.dev/js/api/core/class/RequestList.md) to download the source URLs from a given remote location. The URLs will be parsed from the received response. In this case you can limit the URLs using `regex` parameter containing regular expression pattern for URLs to be included. For details, see the [RequestListOptions.sources](https://crawlee.dev/js/api/core/interface/RequestListOptions.md#sources) * ##### optionaloptions: [RequestListOptions](https://crawlee.dev/js/api/core/interface/RequestListOptions.md) = {} The [RequestList](https://crawlee.dev/js/api/core/class/RequestList.md) options. Note that the `listName` parameter supersedes the [RequestListOptions.persistStateKey](https://crawlee.dev/js/api/core/interface/RequestListOptions.md#persistStateKey) and [RequestListOptions.persistRequestsKey](https://crawlee.dev/js/api/core/interface/RequestListOptions.md#persistRequestsKey) options and the `sources` parameter supersedes the [RequestListOptions.sources](https://crawlee.dev/js/api/core/interface/RequestListOptions.md#sources) option. #### Returns Promise<[RequestList](https://crawlee.dev/js/api/core/class/RequestList.md)> --- # RequestManagerTandem A request manager that combines a RequestList and a RequestQueue. It first reads requests from the RequestList and then, when needed, transfers them in batches to the RequestQueue. ### Implements * [IRequestManager](https://crawlee.dev/js/api/core/interface/IRequestManager.md) ## Index[**](#Index) ### Constructors * [**constructor](#constructor) ### Methods * [**\[asyncIterator\]](#\[asyncIterator]) * [**addRequest](#addRequest) * [**addRequestsBatched](#addRequestsBatched) * [**fetchNextRequest](#fetchNextRequest) * [**getPendingCount](#getPendingCount) * [**getTotalCount](#getTotalCount) * [**handledCount](#handledCount) * [**isEmpty](#isEmpty) * [**isFinished](#isFinished) * [**markRequestHandled](#markRequestHandled) * [**reclaimRequest](#reclaimRequest) ## Constructors[**](#Constructors) ### [**](#constructor)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_manager_tandem.ts#L27)constructor * ****new RequestManagerTandem**(requestList, requestQueue): [RequestManagerTandem](https://crawlee.dev/js/api/core/class/RequestManagerTandem.md) - #### Parameters * ##### requestList: [IRequestList](https://crawlee.dev/js/api/core/interface/IRequestList.md) * ##### requestQueue: [IRequestManager](https://crawlee.dev/js/api/core/interface/IRequestManager.md) #### Returns [RequestManagerTandem](https://crawlee.dev/js/api/core/class/RequestManagerTandem.md) ## Methods[**](#Methods) ### [**](#\[asyncIterator])[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_manager_tandem.ts#L122)\[asyncIterator] * ****\[asyncIterator]**(): AsyncGenerator<[Request](https://crawlee.dev/js/api/core/class/Request.md)\, void, unknown> - Implementation of IRequestManager.\[asyncIterator] Can be used to iterate over the `RequestManager` instance in a `for await .. of` loop. Provides an alternative for the repeated use of `fetchNextRequest`. *** #### Returns AsyncGenerator<[Request](https://crawlee.dev/js/api/core/class/Request.md)\, void, unknown> ### [**](#addRequest)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_manager_tandem.ts#L150)addRequest * ****addRequest**(requestLike, options): Promise\ - Implementation of IRequestManager.addRequest * **@inheritDoc** *** #### Parameters * ##### requestLike: [Source](https://crawlee.dev/js/api/core.md#Source) * ##### optionaloptions: [RequestQueueOperationOptions](https://crawlee.dev/js/api/core/interface/RequestQueueOperationOptions.md) #### Returns Promise\ ### [**](#addRequestsBatched)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_manager_tandem.ts#L157)addRequestsBatched * ****addRequestsBatched**(requests, options): Promise<[AddRequestsBatchedResult](https://crawlee.dev/js/api/core/interface/AddRequestsBatchedResult.md)> - Implementation of IRequestManager.addRequestsBatched * **@inheritDoc** *** #### Parameters * ##### requests: [RequestsLike](https://crawlee.dev/js/api/core.md#RequestsLike) * ##### optionaloptions: [AddRequestsBatchedOptions](https://crawlee.dev/js/api/core/interface/AddRequestsBatchedOptions.md) #### Returns Promise<[AddRequestsBatchedResult](https://crawlee.dev/js/api/core/interface/AddRequestsBatchedResult.md)> ### [**](#fetchNextRequest)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_manager_tandem.ts#L66)fetchNextRequest * ****fetchNextRequest**\(): Promise\> - Implementation of IRequestManager.fetchNextRequest Gets the next [Request](https://crawlee.dev/js/api/core/class/Request.md) to process. The function's `Promise` resolves to `null` if there are no more requests to process. *** #### Returns Promise\> ### [**](#getPendingCount)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_manager_tandem.ts#L115)getPendingCount * ****getPendingCount**(): number - Implementation of IRequestManager.getPendingCount Get an offline approximation of the number of pending requests. *** #### Returns number ### [**](#getTotalCount)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_manager_tandem.ts#L108)getTotalCount * ****getTotalCount**(): number - Implementation of IRequestManager.getTotalCount Get the total number of requests known to the request manager. *** #### Returns number ### [**](#handledCount)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_manager_tandem.ts#L100)handledCount * ****handledCount**(): Promise\ - Implementation of IRequestManager.handledCount Returns number of handled requests. *** #### Returns Promise\ ### [**](#isEmpty)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_manager_tandem.ts#L92)isEmpty * ****isEmpty**(): Promise\ - Implementation of IRequestManager.isEmpty Resolves to `true` if the next call to [IRequestManager.fetchNextRequest](https://crawlee.dev/js/api/core/interface/IRequestManager.md#fetchNextRequest) function would return `null`, otherwise it resolves to `false`. Note that even if the provider is empty, there might be some pending requests currently being processed. *** #### Returns Promise\ ### [**](#isFinished)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_manager_tandem.ts#L84)isFinished * ****isFinished**(): Promise\ - Implementation of IRequestManager.isFinished Returns `true` if all requests were already handled and there are no more left. *** #### Returns Promise\ ### [**](#markRequestHandled)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_manager_tandem.ts#L133)markRequestHandled * ****markRequestHandled**(request): Promise\ - Implementation of IRequestManager.markRequestHandled Marks request as handled after successful processing. *** #### Parameters * ##### request: [Request](https://crawlee.dev/js/api/core/class/Request.md)\ #### Returns Promise\ ### [**](#reclaimRequest)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_manager_tandem.ts#L140)reclaimRequest * ****reclaimRequest**(request, options): Promise\ - Implementation of IRequestManager.reclaimRequest Reclaims request to the provider if its processing failed. The request will become available in the next `fetchNextRequest()`. *** #### Parameters * ##### request: [Request](https://crawlee.dev/js/api/core/class/Request.md)\ * ##### optionaloptions: [RequestQueueOperationOptions](https://crawlee.dev/js/api/core/interface/RequestQueueOperationOptions.md) #### Returns Promise\ --- # abstractRequestProvider Represents a provider of requests/URLs to crawl. ### Hierarchy * *RequestProvider* * [RequestQueueV1](https://crawlee.dev/js/api/core/class/RequestQueueV1.md) * [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) ### Implements * [IStorage](https://crawlee.dev/js/api/core/interface/IStorage.md) * [IRequestManager](https://crawlee.dev/js/api/core/interface/IRequestManager.md) ## Index[**](#Index) ### Constructors * [**constructor](#constructor) ### Properties * [**assumedHandledCount](#assumedHandledCount) * [**assumedTotalCount](#assumedTotalCount) * [**client](#client) * [**clientKey](#clientKey) * [**config](#config) * [**id](#id) * [**internalTimeoutMillis](#internalTimeoutMillis) * [**log](#log) * [**name](#name) * [**requestLockSecs](#requestLockSecs) * [**timeoutSecs](#timeoutSecs) ### Methods * [**\[asyncIterator\]](#\[asyncIterator]) * [**addRequest](#addRequest) * [**addRequests](#addRequests) * [**addRequestsBatched](#addRequestsBatched) * [**drop](#drop) * [**fetchNextRequest](#fetchNextRequest) * [**getInfo](#getInfo) * [**getPendingCount](#getPendingCount) * [**getRequest](#getRequest) * [**getTotalCount](#getTotalCount) * [**handledCount](#handledCount) * [**isEmpty](#isEmpty) * [**isFinished](#isFinished) * [**markRequestHandled](#markRequestHandled) * [**reclaimRequest](#reclaimRequest) * [**open](#open) ## Constructors[**](#Constructors) ### [**](#constructor)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L135)constructor * ****new RequestProvider**(options, config): [RequestProvider](https://crawlee.dev/js/api/core/class/RequestProvider.md) - #### Parameters * ##### options: InternalRequestProviderOptions * ##### config: [Configuration](https://crawlee.dev/js/api/core/class/Configuration.md) = ... #### Returns [RequestProvider](https://crawlee.dev/js/api/core/class/RequestProvider.md) ## Properties[**](#Properties) ### [**](#assumedHandledCount)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L117)assumedHandledCount **assumedHandledCount: number = 0 ### [**](#assumedTotalCount)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L116)assumedTotalCount **assumedTotalCount: number = 0 ### [**](#client)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L107)client **client: [RequestQueueClient](https://crawlee.dev/js/api/types/interface/RequestQueueClient.md) ### [**](#clientKey)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L106)clientKey **clientKey: string = ... ### [**](#config)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L137)readonlyconfig **config: [Configuration](https://crawlee.dev/js/api/core/class/Configuration.md) = ... ### [**](#id)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L103)id **id: string Implementation of IStorage.id ### [**](#internalTimeoutMillis)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L111)internalTimeoutMillis **internalTimeoutMillis: number = ... ### [**](#log)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L110)log **log: [Log](https://crawlee.dev/js/api/core/class/Log.md) ### [**](#name)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L104)optionalname **name? : string Implementation of IStorage.name ### [**](#requestLockSecs)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L112)requestLockSecs **requestLockSecs: number = ... ### [**](#timeoutSecs)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L105)timeoutSecs **timeoutSecs: number = 30 ## Methods[**](#Methods) ### [**](#\[asyncIterator])[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L728)\[asyncIterator] * ****\[asyncIterator]**(): AsyncGenerator<[Request](https://crawlee.dev/js/api/core/class/Request.md)\, void, unknown> - Implementation of IRequestManager.\[asyncIterator] Can be used to iterate over the `RequestManager` instance in a `for await .. of` loop. Provides an alternative for the repeated use of `fetchNextRequest`. *** #### Returns AsyncGenerator<[Request](https://crawlee.dev/js/api/core/class/Request.md)\, void, unknown> ### [**](#addRequest)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L191)addRequest * ****addRequest**(requestLike, options): Promise\ - Implementation of IRequestManager.addRequest Adds a request to the queue. If a request with the same `uniqueKey` property is already present in the queue, it will not be updated. You can find out whether this happened from the resulting [QueueOperationInfo](https://crawlee.dev/js/api/core/interface/QueueOperationInfo.md) object. To add multiple requests to the queue by extracting links from a webpage, see the enqueueLinks helper function. *** #### Parameters * ##### requestLike: [Source](https://crawlee.dev/js/api/core.md#Source) [Request](https://crawlee.dev/js/api/core/class/Request.md) object or vanilla object with request data. Note that the function sets the `uniqueKey` and `id` fields to the passed Request. * ##### optionaloptions: [RequestQueueOperationOptions](https://crawlee.dev/js/api/core/interface/RequestQueueOperationOptions.md) = {} Request queue operation options. #### Returns Promise\ ### [**](#addRequests)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L275)addRequests * ****addRequests**(requestsLike, options): Promise<[BatchAddRequestsResult](https://crawlee.dev/js/api/types/interface/BatchAddRequestsResult.md)> - Adds requests to the queue in batches of 25. This method will wait till all the requests are added to the queue before resolving. You should prefer using `queue.addRequestsBatched()` or `crawler.addRequests()` if you don't want to block the processing, as those methods will only wait for the initial 1000 requests, start processing right after that happens, and continue adding more in the background. If a request passed in is already present due to its `uniqueKey` property being the same, it will not be updated. You can find out whether this happened by finding the request in the resulting [BatchAddRequestsResult](https://crawlee.dev/js/api/types/interface/BatchAddRequestsResult.md) object. *** #### Parameters * ##### requestsLike: [RequestsLike](https://crawlee.dev/js/api/core.md#RequestsLike) [Request](https://crawlee.dev/js/api/core/class/Request.md) objects or vanilla objects with request data. Note that the function sets the `uniqueKey` and `id` fields to the passed requests if missing. * ##### optionaloptions: [RequestQueueOperationOptions](https://crawlee.dev/js/api/core/interface/RequestQueueOperationOptions.md) = {} Request queue operation options. #### Returns Promise<[BatchAddRequestsResult](https://crawlee.dev/js/api/types/interface/BatchAddRequestsResult.md)> ### [**](#addRequestsBatched)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L395)addRequestsBatched * ****addRequestsBatched**(requests, options): Promise<[AddRequestsBatchedResult](https://crawlee.dev/js/api/core/interface/AddRequestsBatchedResult.md)> - Implementation of IRequestManager.addRequestsBatched Adds requests to the queue in batches. By default, it will resolve after the initial batch is added, and continue adding the rest in the background. You can configure the batch size via `batchSize` option and the sleep time in between the batches via `waitBetweenBatchesMillis`. If you want to wait for all batches to be added to the queue, you can use the `waitForAllRequestsToBeAdded` promise you get in the response object. *** #### Parameters * ##### requests: [RequestsLike](https://crawlee.dev/js/api/core.md#RequestsLike) The requests to add * ##### options: [AddRequestsBatchedOptions](https://crawlee.dev/js/api/core/interface/AddRequestsBatchedOptions.md) = {} Options for the request queue #### Returns Promise<[AddRequestsBatchedResult](https://crawlee.dev/js/api/core/interface/AddRequestsBatchedResult.md)> ### [**](#drop)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L717)drop * ****drop**(): Promise\ - Removes the queue either from the Apify Cloud storage or from the local database, depending on the mode of operation. *** #### Returns Promise\ ### [**](#fetchNextRequest)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L563)abstractfetchNextRequest * ****fetchNextRequest**\(): Promise\> - Implementation of IRequestManager.fetchNextRequest Returns a next request in the queue to be processed, or `null` if there are no more pending requests. Once you successfully finish processing of the request, you need to call [RequestQueue.markRequestHandled](https://crawlee.dev/js/api/core/class/RequestQueue.md#markRequestHandled) to mark the request as handled in the queue. If there was some error in processing the request, call [RequestQueue.reclaimRequest](https://crawlee.dev/js/api/core/class/RequestQueue.md#reclaimRequest) instead, so that the queue will give the request to some other consumer in another call to the `fetchNextRequest` function. Note that the `null` return value doesn't mean the queue processing finished, it means there are currently no pending requests. To check whether all requests in queue were finished, use [RequestQueue.isFinished](https://crawlee.dev/js/api/core/class/RequestQueue.md#isFinished) instead. *** #### Returns Promise\> Returns the request object or `null` if there are no more pending requests. ### [**](#getInfo)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L776)getInfo * ****getInfo**(): Promise\ - Returns an object containing general information about the request queue. The function returns the same object as the Apify API Client's [getQueue](https://docs.apify.com/api/apify-client-js/latest#ApifyClient-requestQueues) function, which in turn calls the [Get request queue](https://apify.com/docs/api/v2#/reference/request-queues/queue/get-request-queue) API endpoint. **Example:** ``` { id: "WkzbQMuFYuamGv3YF", name: "my-queue", userId: "wRsJZtadYvn4mBZmm", createdAt: new Date("2015-12-12T07:34:14.202Z"), modifiedAt: new Date("2015-12-13T08:36:13.202Z"), accessedAt: new Date("2015-12-14T08:36:13.202Z"), totalRequestCount: 25, handledRequestCount: 5, pendingRequestCount: 20, } ``` *** #### Returns Promise\ ### [**](#getPendingCount)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L173)getPendingCount * ****getPendingCount**(): number - Implementation of IRequestManager.getPendingCount Returns an offline approximation of the total number of pending requests in the queue. Survives restarts and Actor migrations. *** #### Returns number ### [**](#getRequest)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L535)getRequest * ****getRequest**\(id): Promise\> - Gets the request from the queue specified by ID. *** #### Parameters * ##### id: string ID of the request. #### Returns Promise\> Returns the request object, or `null` if it was not found. ### [**](#getTotalCount)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L164)getTotalCount * ****getTotalCount**(): number - Implementation of IRequestManager.getTotalCount Returns an offline approximation of the total number of requests in the queue (i.e. pending + handled). Survives restarts and actor migrations. *** #### Returns number ### [**](#handledCount)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L746)handledCount * ****handledCount**(): Promise\ - Implementation of IRequestManager.handledCount Returns number of handled requests. *** #### Returns Promise\ ### [**](#isEmpty)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L663)isEmpty * ****isEmpty**(): Promise\ - Implementation of IRequestManager.isEmpty Resolves to `true` if the next call to [RequestQueue.fetchNextRequest](https://crawlee.dev/js/api/core/class/RequestQueue.md#fetchNextRequest) would return `null`, otherwise it resolves to `false`. Note that even if the queue is empty, there might be some pending requests currently being processed. If you need to ensure that there is no activity in the queue, use [RequestQueue.isFinished](https://crawlee.dev/js/api/core/class/RequestQueue.md#isFinished). *** #### Returns Promise\ ### [**](#isFinished)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L674)abstractisFinished * ****isFinished**(): Promise\ - Implementation of IRequestManager.isFinished Resolves to `true` if all requests were already handled and there are no more left. Due to the nature of distributed storage used by the queue, the function may occasionally return a false negative, but it shall never return a false positive. *** #### Returns Promise\ ### [**](#markRequestHandled)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L571)markRequestHandled * ****markRequestHandled**(request): Promise\ - Implementation of IRequestManager.markRequestHandled Marks a request that was previously returned by the [RequestQueue.fetchNextRequest](https://crawlee.dev/js/api/core/class/RequestQueue.md#fetchNextRequest) function as handled after successful processing. Handled requests will never again be returned by the `fetchNextRequest` function. *** #### Parameters * ##### request: [Request](https://crawlee.dev/js/api/core/class/Request.md)\ #### Returns Promise\ ### [**](#reclaimRequest)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L617)reclaimRequest * ****reclaimRequest**(request, options): Promise\ - Implementation of IRequestManager.reclaimRequest Reclaims a failed request back to the queue, so that it can be returned for processing later again by another call to [RequestQueue.fetchNextRequest](https://crawlee.dev/js/api/core/class/RequestQueue.md#fetchNextRequest). The request record in the queue is updated using the provided `request` parameter. For example, this lets you store the number of retries or error messages for the request. *** #### Parameters * ##### request: [Request](https://crawlee.dev/js/api/core/class/Request.md)\ * ##### options: [RequestQueueOperationOptions](https://crawlee.dev/js/api/core/interface/RequestQueueOperationOptions.md) = {} #### Returns Promise\ ### [**](#open)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L857)staticopen * ****open**(queueIdOrName, options): Promise<[RequestProvider](https://crawlee.dev/js/api/core/class/RequestProvider.md)> - Opens a request queue and returns a promise resolving to an instance of the [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) class. [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) represents a queue of URLs to crawl, which is stored either on local filesystem or in the cloud. The queue is used for deep crawling of websites, where you start with several URLs and then recursively follow links to other pages. The data structure supports both breadth-first and depth-first crawling orders. For more details and code examples, see the [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) class. *** #### Parameters * ##### optionalqueueIdOrName: null | string ID or name of the request queue to be opened. If `null` or `undefined`, the function returns the default request queue associated with the crawler run. * ##### optionaloptions: [StorageManagerOptions](https://crawlee.dev/js/api/core/interface/StorageManagerOptions.md) = {} Open Request Queue options. #### Returns Promise<[RequestProvider](https://crawlee.dev/js/api/core/class/RequestProvider.md)> --- # RequestQueue Represents a queue of URLs to crawl, which is used for deep crawling of websites where you start with several URLs and then recursively follow links to other pages. The data structure supports both breadth-first and depth-first crawling orders. Each URL is represented using an instance of the [Request](https://crawlee.dev/js/api/core/class/Request.md) class. The queue can only contain unique URLs. More precisely, it can only contain [Request](https://crawlee.dev/js/api/core/class/Request.md) instances with distinct `uniqueKey` properties. By default, `uniqueKey` is generated from the URL, but it can also be overridden. To add a single URL multiple times to the queue, corresponding [Request](https://crawlee.dev/js/api/core/class/Request.md) objects will need to have different `uniqueKey` properties. Do not instantiate this class directly, use the [RequestQueue.open](https://crawlee.dev/js/api/core/class/RequestQueue.md#open) function instead. `RequestQueue` is used by [BasicCrawler](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md), [CheerioCrawler](https://crawlee.dev/js/api/cheerio-crawler/class/CheerioCrawler.md), [PuppeteerCrawler](https://crawlee.dev/js/api/puppeteer-crawler/class/PuppeteerCrawler.md) and [PlaywrightCrawler](https://crawlee.dev/js/api/playwright-crawler/class/PlaywrightCrawler.md) as a source of URLs to crawl. Unlike [RequestList](https://crawlee.dev/js/api/core/class/RequestList.md), `RequestQueue` supports dynamic adding and removing of requests. On the other hand, the queue is not optimized for operations that add or remove a large number of URLs in a batch. **Example usage:** ``` // Open the default request queue associated with the crawler run const queue = await RequestQueue.open(); // Open a named request queue const queueWithName = await RequestQueue.open('some-name'); // Enqueue few requests await queue.addRequest({ url: 'http://example.com/aaa' }); await queue.addRequest({ url: 'http://example.com/bbb' }); await queue.addRequest({ url: 'http://example.com/foo/bar' }, { forefront: true }); ``` ### Hierarchy * [RequestProvider](https://crawlee.dev/js/api/core/class/RequestProvider.md) * *RequestQueue* ## Index[**](#Index) ### Constructors * [**constructor](#constructor) ### Properties * [**assumedHandledCount](#assumedHandledCount) * [**assumedTotalCount](#assumedTotalCount) * [**client](#client) * [**clientKey](#clientKey) * [**config](#config) * [**id](#id) * [**internalTimeoutMillis](#internalTimeoutMillis) * [**log](#log) * [**name](#name) * [**requestLockSecs](#requestLockSecs) * [**timeoutSecs](#timeoutSecs) ### Methods * [**\[asyncIterator\]](#\[asyncIterator]) * [**addRequest](#addRequest) * [**addRequests](#addRequests) * [**addRequestsBatched](#addRequestsBatched) * [**drop](#drop) * [**fetchNextRequest](#fetchNextRequest) * [**getInfo](#getInfo) * [**getPendingCount](#getPendingCount) * [**getRequest](#getRequest) * [**getTotalCount](#getTotalCount) * [**handledCount](#handledCount) * [**isEmpty](#isEmpty) * [**isFinished](#isFinished) * [**markRequestHandled](#markRequestHandled) * [**reclaimRequest](#reclaimRequest) * [**open](#open) ## Constructors[**](#Constructors) ### [**](#constructor)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_queue_v2.ts#L70)constructor * ****new RequestQueue**(options, config): [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) - Overrides RequestProvider.constructor #### Parameters * ##### options: [RequestProviderOptions](https://crawlee.dev/js/api/core/interface/RequestProviderOptions.md) * ##### config: [Configuration](https://crawlee.dev/js/api/core/class/Configuration.md) = ... #### Returns [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) ## Properties[**](#Properties) ### [**](#assumedHandledCount)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L117)inheritedassumedHandledCount **assumedHandledCount: number = 0 Inherited from RequestProvider.assumedHandledCount ### [**](#assumedTotalCount)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L116)inheritedassumedTotalCount **assumedTotalCount: number = 0 Inherited from RequestProvider.assumedTotalCount ### [**](#client)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L107)inheritedclient **client: [RequestQueueClient](https://crawlee.dev/js/api/types/interface/RequestQueueClient.md) Inherited from RequestProvider.client ### [**](#clientKey)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L106)inheritedclientKey **clientKey: string = ... Inherited from RequestProvider.clientKey ### [**](#config)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L137)readonlyinheritedconfig **config: [Configuration](https://crawlee.dev/js/api/core/class/Configuration.md) = ... Inherited from RequestProvider.config ### [**](#id)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L103)inheritedid **id: string Inherited from RequestProvider.id ### [**](#internalTimeoutMillis)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L111)inheritedinternalTimeoutMillis **internalTimeoutMillis: number = ... Inherited from RequestProvider.internalTimeoutMillis ### [**](#log)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L110)inheritedlog **log: [Log](https://crawlee.dev/js/api/core/class/Log.md) Inherited from RequestProvider.log ### [**](#name)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L104)optionalinheritedname **name? : string Inherited from RequestProvider.name ### [**](#requestLockSecs)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L112)inheritedrequestLockSecs **requestLockSecs: number = ... Inherited from RequestProvider.requestLockSecs ### [**](#timeoutSecs)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L105)inheritedtimeoutSecs **timeoutSecs: number = 30 Inherited from RequestProvider.timeoutSecs ## Methods[**](#Methods) ### [**](#\[asyncIterator])[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L728)inherited\[asyncIterator] * ****\[asyncIterator]**(): AsyncGenerator<[Request](https://crawlee.dev/js/api/core/class/Request.md)\, void, unknown> - Inherited from RequestProvider.\[asyncIterator] Can be used to iterate over the `RequestManager` instance in a `for await .. of` loop. Provides an alternative for the repeated use of `fetchNextRequest`. *** #### Returns AsyncGenerator<[Request](https://crawlee.dev/js/api/core/class/Request.md)\, void, unknown> ### [**](#addRequest)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_queue_v2.ts#L113)addRequest * ****addRequest**(requestLike, options): Promise\ - Overrides RequestProvider.addRequest Adds a request to the queue. If a request with the same `uniqueKey` property is already present in the queue, it will not be updated. You can find out whether this happened from the resulting [QueueOperationInfo](https://crawlee.dev/js/api/core/interface/QueueOperationInfo.md) object. To add multiple requests to the queue by extracting links from a webpage, see the enqueueLinks helper function. *** #### Parameters * ##### requestLike: [Source](https://crawlee.dev/js/api/core.md#Source) [Request](https://crawlee.dev/js/api/core/class/Request.md) object or vanilla object with request data. Note that the function sets the `uniqueKey` and `id` fields to the passed Request. * ##### options: [RequestQueueOperationOptions](https://crawlee.dev/js/api/core/interface/RequestQueueOperationOptions.md) = {} Request queue operation options. #### Returns Promise\ ### [**](#addRequests)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_queue_v2.ts#L127)addRequests * ****addRequests**(requestsLike, options): Promise<[BatchAddRequestsResult](https://crawlee.dev/js/api/types/interface/BatchAddRequestsResult.md)> - Overrides RequestProvider.addRequests Adds requests to the queue in batches of 25. This method will wait till all the requests are added to the queue before resolving. You should prefer using `queue.addRequestsBatched()` or `crawler.addRequests()` if you don't want to block the processing, as those methods will only wait for the initial 1000 requests, start processing right after that happens, and continue adding more in the background. If a request passed in is already present due to its `uniqueKey` property being the same, it will not be updated. You can find out whether this happened by finding the request in the resulting [BatchAddRequestsResult](https://crawlee.dev/js/api/types/interface/BatchAddRequestsResult.md) object. *** #### Parameters * ##### requestsLike: [RequestsLike](https://crawlee.dev/js/api/core.md#RequestsLike) [Request](https://crawlee.dev/js/api/core/class/Request.md) objects or vanilla objects with request data. Note that the function sets the `uniqueKey` and `id` fields to the passed requests if missing. * ##### options: [RequestQueueOperationOptions](https://crawlee.dev/js/api/core/interface/RequestQueueOperationOptions.md) = {} Request queue operation options. #### Returns Promise<[BatchAddRequestsResult](https://crawlee.dev/js/api/types/interface/BatchAddRequestsResult.md)> ### [**](#addRequestsBatched)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L395)inheritedaddRequestsBatched * ****addRequestsBatched**(requests, options): Promise<[AddRequestsBatchedResult](https://crawlee.dev/js/api/core/interface/AddRequestsBatchedResult.md)> - Inherited from RequestProvider.addRequestsBatched Adds requests to the queue in batches. By default, it will resolve after the initial batch is added, and continue adding the rest in the background. You can configure the batch size via `batchSize` option and the sleep time in between the batches via `waitBetweenBatchesMillis`. If you want to wait for all batches to be added to the queue, you can use the `waitForAllRequestsToBeAdded` promise you get in the response object. *** #### Parameters * ##### requests: [RequestsLike](https://crawlee.dev/js/api/core.md#RequestsLike) The requests to add * ##### options: [AddRequestsBatchedOptions](https://crawlee.dev/js/api/core/interface/AddRequestsBatchedOptions.md) = {} Options for the request queue #### Returns Promise<[AddRequestsBatchedResult](https://crawlee.dev/js/api/core/interface/AddRequestsBatchedResult.md)> ### [**](#drop)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L717)inheriteddrop * ****drop**(): Promise\ - Inherited from RequestProvider.drop Removes the queue either from the Apify Cloud storage or from the local database, depending on the mode of operation. *** #### Returns Promise\ ### [**](#fetchNextRequest)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_queue_v2.ts#L144)fetchNextRequest * ****fetchNextRequest**\(): Promise\> - Overrides RequestProvider.fetchNextRequest Returns a next request in the queue to be processed, or `null` if there are no more pending requests. Once you successfully finish processing of the request, you need to call [RequestQueue.markRequestHandled](https://crawlee.dev/js/api/core/class/RequestQueue.md#markRequestHandled) to mark the request as handled in the queue. If there was some error in processing the request, call [RequestQueue.reclaimRequest](https://crawlee.dev/js/api/core/class/RequestQueue.md#reclaimRequest) instead, so that the queue will give the request to some other consumer in another call to the `fetchNextRequest` function. Note that the `null` return value doesn't mean the queue processing finished, it means there are currently no pending requests. To check whether all requests in queue were finished, use [RequestQueue.isFinished](https://crawlee.dev/js/api/core/class/RequestQueue.md#isFinished) instead. *** #### Returns Promise\> Returns the request object or `null` if there are no more pending requests. ### [**](#getInfo)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L776)inheritedgetInfo * ****getInfo**(): Promise\ - Inherited from RequestProvider.getInfo Returns an object containing general information about the request queue. The function returns the same object as the Apify API Client's [getQueue](https://docs.apify.com/api/apify-client-js/latest#ApifyClient-requestQueues) function, which in turn calls the [Get request queue](https://apify.com/docs/api/v2#/reference/request-queues/queue/get-request-queue) API endpoint. **Example:** ``` { id: "WkzbQMuFYuamGv3YF", name: "my-queue", userId: "wRsJZtadYvn4mBZmm", createdAt: new Date("2015-12-12T07:34:14.202Z"), modifiedAt: new Date("2015-12-13T08:36:13.202Z"), accessedAt: new Date("2015-12-14T08:36:13.202Z"), totalRequestCount: 25, handledRequestCount: 5, pendingRequestCount: 20, } ``` *** #### Returns Promise\ ### [**](#getPendingCount)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L173)inheritedgetPendingCount * ****getPendingCount**(): number - Inherited from RequestProvider.getPendingCount Returns an offline approximation of the total number of pending requests in the queue. Survives restarts and Actor migrations. *** #### Returns number ### [**](#getRequest)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L535)inheritedgetRequest * ****getRequest**\(id): Promise\> - Inherited from RequestProvider.getRequest Gets the request from the queue specified by ID. *** #### Parameters * ##### id: string ID of the request. #### Returns Promise\> Returns the request object, or `null` if it was not found. ### [**](#getTotalCount)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L164)inheritedgetTotalCount * ****getTotalCount**(): number - Inherited from RequestProvider.getTotalCount Returns an offline approximation of the total number of requests in the queue (i.e. pending + handled). Survives restarts and actor migrations. *** #### Returns number ### [**](#handledCount)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L746)inheritedhandledCount * ****handledCount**(): Promise\ - Inherited from RequestProvider.handledCount Returns number of handled requests. *** #### Returns Promise\ ### [**](#isEmpty)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L663)inheritedisEmpty * ****isEmpty**(): Promise\ - Inherited from RequestProvider.isEmpty Resolves to `true` if the next call to [RequestQueue.fetchNextRequest](https://crawlee.dev/js/api/core/class/RequestQueue.md#fetchNextRequest) would return `null`, otherwise it resolves to `false`. Note that even if the queue is empty, there might be some pending requests currently being processed. If you need to ensure that there is no activity in the queue, use [RequestQueue.isFinished](https://crawlee.dev/js/api/core/class/RequestQueue.md#isFinished). *** #### Returns Promise\ ### [**](#isFinished)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_queue_v2.ts#L204)isFinished * ****isFinished**(): Promise\ - Overrides RequestProvider.isFinished Resolves to `true` if all requests were already handled and there are no more left. Due to the nature of distributed storage used by the queue, the function may occasionally return a false negative, but it shall never return a false positive. *** #### Returns Promise\ ### [**](#markRequestHandled)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_queue_v2.ts#L196)markRequestHandled * ****markRequestHandled**(request): Promise\ - Overrides RequestProvider.markRequestHandled Marks a request that was previously returned by the [RequestQueue.fetchNextRequest](https://crawlee.dev/js/api/core/class/RequestQueue.md#fetchNextRequest) function as handled after successful processing. Handled requests will never again be returned by the `fetchNextRequest` function. *** #### Parameters * ##### request: [Request](https://crawlee.dev/js/api/core/class/Request.md)\ #### Returns Promise\ ### [**](#reclaimRequest)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_queue_v2.ts#L286)reclaimRequest * ****reclaimRequest**(...args): Promise\ - Overrides RequestProvider.reclaimRequest Reclaims a failed request back to the queue, so that it can be returned for processing later again by another call to [RequestQueue.fetchNextRequest](https://crawlee.dev/js/api/core/class/RequestQueue.md#fetchNextRequest). The request record in the queue is updated using the provided `request` parameter. For example, this lets you store the number of retries or error messages for the request. *** #### Parameters * ##### rest...args: \[request: [Request](https://crawlee.dev/js/api/core/class/Request.md)\, options: [RequestQueueOperationOptions](https://crawlee.dev/js/api/core/interface/RequestQueueOperationOptions.md)] #### Returns Promise\ ### [**](#open)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_queue_v2.ts#L556)staticopen * ****open**(...args): Promise<[RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md)> - Overrides RequestProvider.open Opens a request queue and returns a promise resolving to an instance of the [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) class. [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) represents a queue of URLs to crawl, which is stored either on local filesystem or in the cloud. The queue is used for deep crawling of websites, where you start with several URLs and then recursively follow links to other pages. The data structure supports both breadth-first and depth-first crawling orders. For more details and code examples, see the [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) class. *** #### Parameters * ##### rest...args: \[queueIdOrName?: null | string, options: [StorageManagerOptions](https://crawlee.dev/js/api/core/interface/StorageManagerOptions.md)] ID or name of the request queue to be opened. If `null` or `undefined`, the function returns the default request queue associated with the crawler run. #### Returns Promise<[RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md)> --- # RequestQueueV1 Represents a queue of URLs to crawl, which is used for deep crawling of websites where you start with several URLs and then recursively follow links to other pages. The data structure supports both breadth-first and depth-first crawling orders. Each URL is represented using an instance of the [Request](https://crawlee.dev/js/api/core/class/Request.md) class. The queue can only contain unique URLs. More precisely, it can only contain [Request](https://crawlee.dev/js/api/core/class/Request.md) instances with distinct `uniqueKey` properties. By default, `uniqueKey` is generated from the URL, but it can also be overridden. To add a single URL multiple times to the queue, corresponding [Request](https://crawlee.dev/js/api/core/class/Request.md) objects will need to have different `uniqueKey` properties. Do not instantiate this class directly, use the [RequestQueue.open](https://crawlee.dev/js/api/core/class/RequestQueue.md#open) function instead. `RequestQueue` is used by [BasicCrawler](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md), [CheerioCrawler](https://crawlee.dev/js/api/cheerio-crawler/class/CheerioCrawler.md), [PuppeteerCrawler](https://crawlee.dev/js/api/puppeteer-crawler/class/PuppeteerCrawler.md) and [PlaywrightCrawler](https://crawlee.dev/js/api/playwright-crawler/class/PlaywrightCrawler.md) as a source of URLs to crawl. Unlike [RequestList](https://crawlee.dev/js/api/core/class/RequestList.md), `RequestQueue` supports dynamic adding and removing of requests. On the other hand, the queue is not optimized for operations that add or remove a large number of URLs in a batch. `RequestQueue` stores its data either on local disk or in the Apify Cloud, depending on whether the `APIFY_LOCAL_STORAGE_DIR` or `APIFY_TOKEN` environment variable is set. If the `APIFY_LOCAL_STORAGE_DIR` environment variable is set, the queue data is stored in that directory in an SQLite database file. If the `APIFY_TOKEN` environment variable is set but `APIFY_LOCAL_STORAGE_DIR` is not, the data is stored in the [Apify Request Queue](https://docs.apify.com/storage/request-queue) cloud storage. Note that you can force usage of the cloud storage also by passing the `forceCloud` option to [RequestQueue.open](https://crawlee.dev/js/api/core/class/RequestQueue.md#open) function, even if the `APIFY_LOCAL_STORAGE_DIR` variable is set. **Example usage:** ``` // Open the default request queue associated with the crawler run const queue = await RequestQueue.open(); // Open a named request queue const queueWithName = await RequestQueue.open('some-name'); // Enqueue few requests await queue.addRequest({ url: 'http://example.com/aaa' }); await queue.addRequest({ url: 'http://example.com/bbb' }); await queue.addRequest({ url: 'http://example.com/foo/bar' }, { forefront: true }); ``` * **@deprecated** RequestQueue v1 is deprecated and will be removed in the future. Please use [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) instead. ### Hierarchy * [RequestProvider](https://crawlee.dev/js/api/core/class/RequestProvider.md) * *RequestQueueV1* ## Index[**](#Index) ### Properties * [**assumedHandledCount](#assumedHandledCount) * [**assumedTotalCount](#assumedTotalCount) * [**client](#client) * [**clientKey](#clientKey) * [**config](#config) * [**id](#id) * [**internalTimeoutMillis](#internalTimeoutMillis) * [**log](#log) * [**name](#name) * [**requestLockSecs](#requestLockSecs) * [**timeoutSecs](#timeoutSecs) ### Methods * [**\[asyncIterator\]](#\[asyncIterator]) * [**addRequest](#addRequest) * [**addRequests](#addRequests) * [**addRequestsBatched](#addRequestsBatched) * [**drop](#drop) * [**fetchNextRequest](#fetchNextRequest) * [**getInfo](#getInfo) * [**getPendingCount](#getPendingCount) * [**getRequest](#getRequest) * [**getTotalCount](#getTotalCount) * [**handledCount](#handledCount) * [**isEmpty](#isEmpty) * [**isFinished](#isFinished) * [**markRequestHandled](#markRequestHandled) * [**reclaimRequest](#reclaimRequest) * [**open](#open) ## Properties[**](#Properties) ### [**](#assumedHandledCount)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L117)inheritedassumedHandledCount **assumedHandledCount: number = 0 Inherited from RequestProvider.assumedHandledCount ### [**](#assumedTotalCount)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L116)inheritedassumedTotalCount **assumedTotalCount: number = 0 Inherited from RequestProvider.assumedTotalCount ### [**](#client)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L107)inheritedclient **client: [RequestQueueClient](https://crawlee.dev/js/api/types/interface/RequestQueueClient.md) Inherited from RequestProvider.client ### [**](#clientKey)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L106)inheritedclientKey **clientKey: string = ... Inherited from RequestProvider.clientKey ### [**](#config)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L137)readonlyinheritedconfig **config: [Configuration](https://crawlee.dev/js/api/core/class/Configuration.md) = ... Inherited from RequestProvider.config ### [**](#id)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L103)inheritedid **id: string Inherited from RequestProvider.id ### [**](#internalTimeoutMillis)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L111)inheritedinternalTimeoutMillis **internalTimeoutMillis: number = ... Inherited from RequestProvider.internalTimeoutMillis ### [**](#log)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L110)inheritedlog **log: [Log](https://crawlee.dev/js/api/core/class/Log.md) Inherited from RequestProvider.log ### [**](#name)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L104)optionalinheritedname **name? : string Inherited from RequestProvider.name ### [**](#requestLockSecs)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L112)inheritedrequestLockSecs **requestLockSecs: number = ... Inherited from RequestProvider.requestLockSecs ### [**](#timeoutSecs)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L105)inheritedtimeoutSecs **timeoutSecs: number = 30 Inherited from RequestProvider.timeoutSecs ## Methods[**](#Methods) ### [**](#\[asyncIterator])[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L728)inherited\[asyncIterator] * ****\[asyncIterator]**(): AsyncGenerator<[Request](https://crawlee.dev/js/api/core/class/Request.md)\, void, unknown> - Inherited from RequestProvider.\[asyncIterator] Can be used to iterate over the `RequestManager` instance in a `for await .. of` loop. Provides an alternative for the repeated use of `fetchNextRequest`. *** #### Returns AsyncGenerator<[Request](https://crawlee.dev/js/api/core/class/Request.md)\, void, unknown> ### [**](#addRequest)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L191)inheritedaddRequest * ****addRequest**(requestLike, options): Promise\ - Inherited from RequestProvider.addRequest Adds a request to the queue. If a request with the same `uniqueKey` property is already present in the queue, it will not be updated. You can find out whether this happened from the resulting [QueueOperationInfo](https://crawlee.dev/js/api/core/interface/QueueOperationInfo.md) object. To add multiple requests to the queue by extracting links from a webpage, see the enqueueLinks helper function. *** #### Parameters * ##### requestLike: [Source](https://crawlee.dev/js/api/core.md#Source) [Request](https://crawlee.dev/js/api/core/class/Request.md) object or vanilla object with request data. Note that the function sets the `uniqueKey` and `id` fields to the passed Request. * ##### optionaloptions: [RequestQueueOperationOptions](https://crawlee.dev/js/api/core/interface/RequestQueueOperationOptions.md) = {} Request queue operation options. #### Returns Promise\ ### [**](#addRequests)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L275)inheritedaddRequests * ****addRequests**(requestsLike, options): Promise<[BatchAddRequestsResult](https://crawlee.dev/js/api/types/interface/BatchAddRequestsResult.md)> - Inherited from RequestProvider.addRequests Adds requests to the queue in batches of 25. This method will wait till all the requests are added to the queue before resolving. You should prefer using `queue.addRequestsBatched()` or `crawler.addRequests()` if you don't want to block the processing, as those methods will only wait for the initial 1000 requests, start processing right after that happens, and continue adding more in the background. If a request passed in is already present due to its `uniqueKey` property being the same, it will not be updated. You can find out whether this happened by finding the request in the resulting [BatchAddRequestsResult](https://crawlee.dev/js/api/types/interface/BatchAddRequestsResult.md) object. *** #### Parameters * ##### requestsLike: [RequestsLike](https://crawlee.dev/js/api/core.md#RequestsLike) [Request](https://crawlee.dev/js/api/core/class/Request.md) objects or vanilla objects with request data. Note that the function sets the `uniqueKey` and `id` fields to the passed requests if missing. * ##### optionaloptions: [RequestQueueOperationOptions](https://crawlee.dev/js/api/core/interface/RequestQueueOperationOptions.md) = {} Request queue operation options. #### Returns Promise<[BatchAddRequestsResult](https://crawlee.dev/js/api/types/interface/BatchAddRequestsResult.md)> ### [**](#addRequestsBatched)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L395)inheritedaddRequestsBatched * ****addRequestsBatched**(requests, options): Promise<[AddRequestsBatchedResult](https://crawlee.dev/js/api/core/interface/AddRequestsBatchedResult.md)> - Inherited from RequestProvider.addRequestsBatched Adds requests to the queue in batches. By default, it will resolve after the initial batch is added, and continue adding the rest in the background. You can configure the batch size via `batchSize` option and the sleep time in between the batches via `waitBetweenBatchesMillis`. If you want to wait for all batches to be added to the queue, you can use the `waitForAllRequestsToBeAdded` promise you get in the response object. *** #### Parameters * ##### requests: [RequestsLike](https://crawlee.dev/js/api/core.md#RequestsLike) The requests to add * ##### options: [AddRequestsBatchedOptions](https://crawlee.dev/js/api/core/interface/AddRequestsBatchedOptions.md) = {} Options for the request queue #### Returns Promise<[AddRequestsBatchedResult](https://crawlee.dev/js/api/core/interface/AddRequestsBatchedResult.md)> ### [**](#drop)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L717)inheriteddrop * ****drop**(): Promise\ - Inherited from RequestProvider.drop Removes the queue either from the Apify Cloud storage or from the local database, depending on the mode of operation. *** #### Returns Promise\ ### [**](#fetchNextRequest)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_queue.ts#L128)fetchNextRequest * ****fetchNextRequest**\(): Promise\> - Overrides RequestProvider.fetchNextRequest Returns a next request in the queue to be processed, or `null` if there are no more pending requests. Once you successfully finish processing of the request, you need to call [RequestQueue.markRequestHandled](https://crawlee.dev/js/api/core/class/RequestQueue.md#markRequestHandled) to mark the request as handled in the queue. If there was some error in processing the request, call [RequestQueue.reclaimRequest](https://crawlee.dev/js/api/core/class/RequestQueue.md#reclaimRequest) instead, so that the queue will give the request to some other consumer in another call to the `fetchNextRequest` function. Note that the `null` return value doesn't mean the queue processing finished, it means there are currently no pending requests. To check whether all requests in queue were finished, use [RequestQueue.isFinished](https://crawlee.dev/js/api/core/class/RequestQueue.md#isFinished) instead. *** #### Returns Promise\> Returns the request object or `null` if there are no more pending requests. ### [**](#getInfo)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L776)inheritedgetInfo * ****getInfo**(): Promise\ - Inherited from RequestProvider.getInfo Returns an object containing general information about the request queue. The function returns the same object as the Apify API Client's [getQueue](https://docs.apify.com/api/apify-client-js/latest#ApifyClient-requestQueues) function, which in turn calls the [Get request queue](https://apify.com/docs/api/v2#/reference/request-queues/queue/get-request-queue) API endpoint. **Example:** ``` { id: "WkzbQMuFYuamGv3YF", name: "my-queue", userId: "wRsJZtadYvn4mBZmm", createdAt: new Date("2015-12-12T07:34:14.202Z"), modifiedAt: new Date("2015-12-13T08:36:13.202Z"), accessedAt: new Date("2015-12-14T08:36:13.202Z"), totalRequestCount: 25, handledRequestCount: 5, pendingRequestCount: 20, } ``` *** #### Returns Promise\ ### [**](#getPendingCount)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L173)inheritedgetPendingCount * ****getPendingCount**(): number - Inherited from RequestProvider.getPendingCount Returns an offline approximation of the total number of pending requests in the queue. Survives restarts and Actor migrations. *** #### Returns number ### [**](#getRequest)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L535)inheritedgetRequest * ****getRequest**\(id): Promise\> - Inherited from RequestProvider.getRequest Gets the request from the queue specified by ID. *** #### Parameters * ##### id: string ID of the request. #### Returns Promise\> Returns the request object, or `null` if it was not found. ### [**](#getTotalCount)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L164)inheritedgetTotalCount * ****getTotalCount**(): number - Inherited from RequestProvider.getTotalCount Returns an offline approximation of the total number of requests in the queue (i.e. pending + handled). Survives restarts and actor migrations. *** #### Returns number ### [**](#handledCount)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L746)inheritedhandledCount * ****handledCount**(): Promise\ - Inherited from RequestProvider.handledCount Returns number of handled requests. *** #### Returns Promise\ ### [**](#isEmpty)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L663)inheritedisEmpty * ****isEmpty**(): Promise\ - Inherited from RequestProvider.isEmpty Resolves to `true` if the next call to [RequestQueue.fetchNextRequest](https://crawlee.dev/js/api/core/class/RequestQueue.md#fetchNextRequest) would return `null`, otherwise it resolves to `false`. Note that even if the queue is empty, there might be some pending requests currently being processed. If you need to ensure that there is no activity in the queue, use [RequestQueue.isFinished](https://crawlee.dev/js/api/core/class/RequestQueue.md#isFinished). *** #### Returns Promise\ ### [**](#isFinished)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_queue.ts#L311)isFinished * ****isFinished**(): Promise\ - Overrides RequestProvider.isFinished Resolves to `true` if all requests were already handled and there are no more left. Due to the nature of distributed storage used by the queue, the function may occasionally return a false negative, but it shall never return a false positive. *** #### Returns Promise\ ### [**](#markRequestHandled)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_queue.ts#L368)markRequestHandled * ****markRequestHandled**(request): Promise\ - Overrides RequestProvider.markRequestHandled Marks a request that was previously returned by the [RequestQueue.fetchNextRequest](https://crawlee.dev/js/api/core/class/RequestQueue.md#fetchNextRequest) function as handled after successful processing. Handled requests will never again be returned by the `fetchNextRequest` function. *** #### Parameters * ##### request: [Request](https://crawlee.dev/js/api/core/class/Request.md)\ #### Returns Promise\ ### [**](#reclaimRequest)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_queue.ts#L338)reclaimRequest * ****reclaimRequest**(...args): Promise\ - Overrides RequestProvider.reclaimRequest Reclaims a failed request back to the queue, so that it can be returned for processing later again by another call to [RequestQueue.fetchNextRequest](https://crawlee.dev/js/api/core/class/RequestQueue.md#fetchNextRequest). The request record in the queue is updated using the provided `request` parameter. For example, this lets you store the number of retries or error messages for the request. *** #### Parameters * ##### rest...args: \[request: [Request](https://crawlee.dev/js/api/core/class/Request.md)\, options: [RequestQueueOperationOptions](https://crawlee.dev/js/api/core/interface/RequestQueueOperationOptions.md)] #### Returns Promise\ ### [**](#open)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_queue.ts#L398)staticopen * ****open**(...args): Promise<[RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueueV1.md)> - Overrides RequestProvider.open Opens a request queue and returns a promise resolving to an instance of the [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) class. [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) represents a queue of URLs to crawl, which is stored either on local filesystem or in the cloud. The queue is used for deep crawling of websites, where you start with several URLs and then recursively follow links to other pages. The data structure supports both breadth-first and depth-first crawling orders. For more details and code examples, see the [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) class. *** #### Parameters * ##### rest...args: \[queueIdOrName?: null | string, options: [StorageManagerOptions](https://crawlee.dev/js/api/core/interface/StorageManagerOptions.md)] #### Returns Promise<[RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueueV1.md)> --- # RetryRequestError Errors of `RetryRequestError` type will always be retried by the crawler. *This error overrides the `maxRequestRetries` option, i.e. the request can be retried indefinitely until it succeeds.* ### Hierarchy * Error * *RetryRequestError* * [SessionError](https://crawlee.dev/js/api/core/class/SessionError.md) ## Index[**](#Index) ### Constructors * [**constructor](#constructor) ### Properties * [**cause](#cause) * [**message](#message) * [**name](#name) * [**stack](#stack) * [**stackTraceLimit](#stackTraceLimit) ### Methods * [**captureStackTrace](#captureStackTrace) * [**isError](#isError) * [**prepareStackTrace](#prepareStackTrace) ## Constructors[**](#Constructors) ### [**](#constructor)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/errors.ts#L23)constructor * ****new RetryRequestError**(message): [RetryRequestError](https://crawlee.dev/js/api/core/class/RetryRequestError.md) - Overrides Error.constructor #### Parameters * ##### optionalmessage: string #### Returns [RetryRequestError](https://crawlee.dev/js/api/core/class/RetryRequestError.md) ## Properties[**](#Properties) ### [**](#cause)[**](https://undefined/apify/crawlee/blob/master/website/node_modules/typescript/src/lib.es2022.error.d.ts#L26)externaloptionalinheritedcause **cause? : unknown Inherited from Error.cause ### [**](#message)[**](https://undefined/apify/crawlee/blob/master/website/node_modules/typescript/src/lib.es5.d.ts#L1077)externalinheritedmessage **message: string Inherited from Error.message ### [**](#name)[**](https://undefined/apify/crawlee/blob/master/website/node_modules/typescript/src/lib.es5.d.ts#L1076)externalinheritedname **name: string Inherited from Error.name ### [**](#stack)[**](https://undefined/apify/crawlee/blob/master/website/node_modules/typescript/src/lib.es5.d.ts#L1078)externaloptionalinheritedstack **stack? : string Inherited from Error.stack ### [**](#stackTraceLimit)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/globals.d.ts#L68)staticexternalinheritedstackTraceLimit **stackTraceLimit: number Inherited from Error.stackTraceLimit The `Error.stackTraceLimit` property specifies the number of stack frames collected by a stack trace (whether generated by `new Error().stack` or `Error.captureStackTrace(obj)`). The default value is `10` but may be set to any valid JavaScript number. Changes will affect any stack trace captured *after* the value has been changed. If set to a non-number value, or set to a negative number, stack traces will not capture any frames. ## Methods[**](#Methods) ### [**](#captureStackTrace)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/globals.d.ts#L52)staticexternalinheritedcaptureStackTrace * ****captureStackTrace**(targetObject, constructorOpt): void - Inherited from Error.captureStackTrace Creates a `.stack` property on `targetObject`, which when accessed returns a string representing the location in the code at which `Error.captureStackTrace()` was called. ``` const myObject = {}; Error.captureStackTrace(myObject); myObject.stack; // Similar to `new Error().stack` ``` The first line of the trace will be prefixed with `${myObject.name}: ${myObject.message}`. The optional `constructorOpt` argument accepts a function. If given, all frames above `constructorOpt`, including `constructorOpt`, will be omitted from the generated stack trace. The `constructorOpt` argument is useful for hiding implementation details of error generation from the user. For instance: ``` function a() { b(); } function b() { c(); } function c() { // Create an error without stack trace to avoid calculating the stack trace twice. const { stackTraceLimit } = Error; Error.stackTraceLimit = 0; const error = new Error(); Error.stackTraceLimit = stackTraceLimit; // Capture the stack trace above function b Error.captureStackTrace(error, b); // Neither function c, nor b is included in the stack trace throw error; } a(); ``` *** #### Parameters * ##### externaltargetObject: object * ##### externaloptionalconstructorOpt: Function #### Returns void ### [**](#isError)[**](https://undefined/apify/crawlee/blob/master/website/node_modules/typescript/src/lib.esnext.error.d.ts#L23)staticexternalinheritedisError * ****isError**(error): error is Error - Inherited from Error.isError Indicates whether the argument provided is a built-in Error instance or not. *** #### Parameters * ##### externalerror: unknown #### Returns error is Error ### [**](#prepareStackTrace)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/globals.d.ts#L56)staticexternalinheritedprepareStackTrace * ****prepareStackTrace**(err, stackTraces): any - Inherited from Error.prepareStackTrace * **@see** *** #### Parameters * ##### externalerr: Error * ##### externalstackTraces: CallSite\[] #### Returns any --- # Router \ Simple router that works based on request labels. This instance can then serve as a `requestHandler` of your crawler. ``` import { Router, CheerioCrawler, CheerioCrawlingContext } from 'crawlee'; const router = Router.create(); // we can also use factory methods for specific crawling contexts, the above equals to: // import { createCheerioRouter } from 'crawlee'; // const router = createCheerioRouter(); router.addHandler('label-a', async (ctx) => { ctx.log.info('...'); }); router.addDefaultHandler(async (ctx) => { ctx.log.info('...'); }); const crawler = new CheerioCrawler({ requestHandler: router, }); await crawler.run(); ``` Alternatively we can use the default router instance from crawler object: ``` import { CheerioCrawler } from 'crawlee'; const crawler = new CheerioCrawler(); crawler.router.addHandler('label-a', async (ctx) => { ctx.log.info('...'); }); crawler.router.addDefaultHandler(async (ctx) => { ctx.log.info('...'); }); await crawler.run(); ``` For convenience, we can also define the routes right when creating the router: ``` import { CheerioCrawler, createCheerioRouter } from 'crawlee'; const crawler = new CheerioCrawler({ requestHandler: createCheerioRouter({ 'label-a': async (ctx) => { ... }, 'label-b': async (ctx) => { ... }, })}, }); await crawler.run(); ``` Middlewares are also supported via the `router.use` method. There can be multiple middlewares for a single router, they will be executed sequentially in the same order as they were registered. ``` crawler.router.use(async (ctx) => { ctx.log.info('...'); }); ``` ### Hierarchy * *Router* * [RouterHandler](https://crawlee.dev/js/api/core/interface/RouterHandler.md) ## Index[**](#Index) ### Methods * [**addDefaultHandler](#addDefaultHandler) * [**addHandler](#addHandler) * [**getHandler](#getHandler) * [**use](#use) * [**create](#create) ## Methods[**](#Methods) ### [**](#addDefaultHandler)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/router.ts#L110)addDefaultHandler * ****addDefaultHandler**\(handler): void - Registers default route handler. *** #### Parameters * ##### handler: (ctx) => Awaitable\ #### Returns void ### [**](#addHandler)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/router.ts#L99)addHandler * ****addHandler**\(label, handler): void - Registers new route handler for given label. *** #### Parameters * ##### label: string | symbol * ##### handler: (ctx) => Awaitable\ #### Returns void ### [**](#getHandler)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/router.ts#L128)getHandler * ****getHandler**(label): (ctx) => Awaitable\ - Returns route handler for given label. If no label is provided, the default request handler will be returned. *** #### Parameters * ##### optionallabel: string | symbol #### Returns (ctx) => Awaitable\ * * **(ctx): Awaitable\ - #### Parameters * ##### ctx: Context #### Returns Awaitable\ ### [**](#use)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/router.ts#L121)use * ****use**(middleware): void - Registers a middleware that will be fired before the matching route handler. Multiple middlewares can be registered, they will be fired in the same order. *** #### Parameters * ##### middleware: (ctx) => Awaitable\ #### Returns void ### [**](#create)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/router.ts#L177)staticcreate * ****create**\(routes): [RouterHandler](https://crawlee.dev/js/api/core/interface/RouterHandler.md)\ - Creates new router instance. This instance can then serve as a `requestHandler` of your crawler. ``` import { Router, CheerioCrawler, CheerioCrawlingContext } from 'crawlee'; const router = Router.create(); router.addHandler('label-a', async (ctx) => { ctx.log.info('...'); }); router.addDefaultHandler(async (ctx) => { ctx.log.info('...'); }); const crawler = new CheerioCrawler({ requestHandler: router, }); await crawler.run(); ``` *** #### Parameters * ##### optionalroutes: [RouterRoutes](https://crawlee.dev/js/api/core.md#RouterRoutes)\ #### Returns [RouterHandler](https://crawlee.dev/js/api/core/interface/RouterHandler.md)\ --- # Session Sessions are used to store information such as cookies and can be used for generating fingerprints and proxy sessions. You can imagine each session as a specific user, with its own cookies, IP (via proxy) and potentially a unique browser fingerprint. Session internal state can be enriched with custom user data for example some authorization tokens and specific headers in general. ## Index[**](#Index) ### Constructors * [**constructor](#constructor) ### Properties * [**id](#id) * [**userData](#userData) ### Accessors * [**cookieJar](#cookieJar) * [**createdAt](#createdAt) * [**errorScore](#errorScore) * [**errorScoreDecrement](#errorScoreDecrement) * [**expiresAt](#expiresAt) * [**maxErrorScore](#maxErrorScore) * [**maxUsageCount](#maxUsageCount) * [**usageCount](#usageCount) ### Methods * [**getCookies](#getCookies) * [**getCookieString](#getCookieString) * [**getState](#getState) * [**isBlocked](#isBlocked) * [**isExpired](#isExpired) * [**isMaxUsageCountReached](#isMaxUsageCountReached) * [**isUsable](#isUsable) * [**markBad](#markBad) * [**markGood](#markGood) * [**retire](#retire) * [**retireOnBlockedStatusCodes](#retireOnBlockedStatusCodes) * [**setCookie](#setCookie) * [**setCookies](#setCookies) * [**setCookiesFromResponse](#setCookiesFromResponse) ## Constructors[**](#Constructors) ### [**](#constructor)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session.ts#L150)constructor * ****new Session**(options): [Session](https://crawlee.dev/js/api/core/class/Session.md) - Session configuration. *** #### Parameters * ##### options: [SessionOptions](https://crawlee.dev/js/api/core/interface/SessionOptions.md) #### Returns [Session](https://crawlee.dev/js/api/core/class/Session.md) ## Properties[**](#Properties) ### [**](#id)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session.ts#L101)readonlyid **id: string ### [**](#userData)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session.ts#L103)userData **userData: Dictionary ## Accessors[**](#Accessors) ### [**](#cookieJar)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session.ts#L143)cookieJar * **get cookieJar(): CookieJar - #### Returns CookieJar ### [**](#createdAt)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session.ts#L135)createdAt * **get createdAt(): Date - #### Returns Date ### [**](#errorScore)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session.ts#L115)errorScore * **get errorScore(): number - #### Returns number ### [**](#errorScoreDecrement)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session.ts#L127)errorScoreDecrement * **get errorScoreDecrement(): number - #### Returns number ### [**](#expiresAt)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session.ts#L131)expiresAt * **get expiresAt(): Date - #### Returns Date ### [**](#maxErrorScore)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session.ts#L123)maxErrorScore * **get maxErrorScore(): number - #### Returns number ### [**](#maxUsageCount)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session.ts#L139)maxUsageCount * **get maxUsageCount(): number - #### Returns number ### [**](#usageCount)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session.ts#L119)usageCount * **get usageCount(): number - #### Returns number ## Methods[**](#Methods) ### [**](#getCookies)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session.ts#L374)getCookies * ****getCookies**(url): [Cookie](https://crawlee.dev/js/api/core/interface/Cookie.md)\[] - Returns cookies in a format compatible with puppeteer/playwright and ready to be used with `page.setCookie`. *** #### Parameters * ##### url: string website url. Only cookies stored for this url will be returned #### Returns [Cookie](https://crawlee.dev/js/api/core/interface/Cookie.md)\[] ### [**](#getCookieString)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session.ts#L385)getCookieString * ****getCookieString**(url): string - Returns cookies saved with the session in the typical key1=value1; key2=value2 format, ready to be used in a cookie header or elsewhere. *** #### Parameters * ##### url: string #### Returns string Represents `Cookie` header. ### [**](#getState)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session.ts#L256)getState * ****getState**(): [SessionState](https://crawlee.dev/js/api/core/interface/SessionState.md) - Gets session state for persistence in KeyValueStore. *** #### Returns [SessionState](https://crawlee.dev/js/api/core/interface/SessionState.md) Represents session internal state. ### [**](#isBlocked)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session.ts#L209)isBlocked * ****isBlocked**(): boolean - Indicates whether the session is blocked. Session is blocked once it reaches the `maxErrorScore`. *** #### Returns boolean ### [**](#isExpired)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session.ts#L218)isExpired * ****isExpired**(): boolean - Indicates whether the session is expired. Session expiration is determined by the `maxAgeSecs`. Once the session is older than `createdAt + maxAgeSecs` the session is considered expired. *** #### Returns boolean ### [**](#isMaxUsageCountReached)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session.ts#L226)isMaxUsageCountReached * ****isMaxUsageCountReached**(): boolean - Indicates whether the session is used maximum number of times. Session maximum usage count can be changed by `maxUsageCount` parameter. *** #### Returns boolean ### [**](#isUsable)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session.ts#L234)isUsable * ****isUsable**(): boolean - Indicates whether the session can be used for next requests. Session is usable when it is not expired, not blocked and the maximum usage count has not be reached. *** #### Returns boolean ### [**](#markBad)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session.ts#L291)markBad * ****markBad**(): void - Increases usage and error count. Should be used when the session has been used unsuccessfully. For example because of timeouts. *** #### Returns void ### [**](#markGood)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session.ts#L242)markGood * ****markGood**(): void - This method should be called after a successful session usage. It increases `usageCount` and potentially lowers the `errorScore` by the `errorScoreDecrement`. *** #### Returns void ### [**](#retire)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session.ts#L278)retire * ****retire**(): void - Marks session as blocked and emits event on the `SessionPool` This method should be used if the session usage was unsuccessful and you are sure that it is because of the session configuration and not any external matters. For example when server returns 403 status code. If the session does not work due to some external factors as server error such as 5XX you probably want to use `markBad` method. *** #### Returns void ### [**](#retireOnBlockedStatusCodes)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session.ts#L306)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session.ts#L321)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session.ts#L323)retireOnBlockedStatusCodes * ****retireOnBlockedStatusCodes**(statusCode): boolean * ****retireOnBlockedStatusCodes**(statusCode, additionalBlockedStatusCodes): boolean - With certain status codes: `401`, `403` or `429` we can be certain that the target website is blocking us. This function helps to do this conveniently by retiring the session when such code is received. Optionally the default status codes can be extended in the second parameter. *** #### Parameters * ##### statusCode: number HTTP status code. #### Returns boolean Whether the session was retired. ### [**](#setCookie)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session.ts#L392)setCookie * ****setCookie**(rawCookie, url): void - Sets a cookie within this session for the specific URL. *** #### Parameters * ##### rawCookie: string * ##### url: string #### Returns void ### [**](#setCookies)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session.ts#L365)setCookies * ****setCookies**(cookies, url): void - Saves an array with cookie objects to be used with the session. The objects should be in the format that [Puppeteer uses](https://pptr.dev/#?product=Puppeteer\&version=v2.0.0\&show=api-pagecookiesurls), but you can also use this function to set cookies manually: ``` [ { name: 'cookie1', value: 'my-cookie' }, { name: 'cookie2', value: 'your-cookie' } ] ``` *** #### Parameters * ##### cookies: [Cookie](https://crawlee.dev/js/api/core/interface/Cookie.md)\[] * ##### url: string #### Returns void ### [**](#setCookiesFromResponse)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session.ts#L341)setCookiesFromResponse * ****setCookiesFromResponse**(response): void - Saves cookies from an HTTP response to be used with the session. It expects an object with a `headers` property that's either an `Object` (typical Node.js responses) or a `Function` (Puppeteer Response). It then parses and saves the cookies from the `set-cookie` header, if available. *** #### Parameters * ##### response: [ResponseLike](https://crawlee.dev/js/api/core/interface/ResponseLike.md) #### Returns void --- # SessionError Errors of `SessionError` type will trigger a session rotation. This error doesn't respect the `maxRequestRetries` option and has a separate limit of `maxSessionRotations`. ### Hierarchy * [RetryRequestError](https://crawlee.dev/js/api/core/class/RetryRequestError.md) * *SessionError* ## Index[**](#Index) ### Constructors * [**constructor](#constructor) ### Properties * [**cause](#cause) * [**message](#message) * [**name](#name) * [**stack](#stack) * [**stackTraceLimit](#stackTraceLimit) ### Methods * [**captureStackTrace](#captureStackTrace) * [**isError](#isError) * [**prepareStackTrace](#prepareStackTrace) ## Constructors[**](#Constructors) ### [**](#constructor)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/errors.ts#L34)constructor * ****new SessionError**(message): [SessionError](https://crawlee.dev/js/api/core/class/SessionError.md) - Overrides RetryRequestError.constructor #### Parameters * ##### optionalmessage: string #### Returns [SessionError](https://crawlee.dev/js/api/core/class/SessionError.md) ## Properties[**](#Properties) ### [**](#cause)[**](https://undefined/apify/crawlee/blob/master/website/node_modules/typescript/src/lib.es2022.error.d.ts#L26)externaloptionalinheritedcause **cause? : unknown Inherited from RetryRequestError.cause ### [**](#message)[**](https://undefined/apify/crawlee/blob/master/website/node_modules/typescript/src/lib.es5.d.ts#L1077)externalinheritedmessage **message: string Inherited from RetryRequestError.message ### [**](#name)[**](https://undefined/apify/crawlee/blob/master/website/node_modules/typescript/src/lib.es5.d.ts#L1076)externalinheritedname **name: string Inherited from RetryRequestError.name ### [**](#stack)[**](https://undefined/apify/crawlee/blob/master/website/node_modules/typescript/src/lib.es5.d.ts#L1078)externaloptionalinheritedstack **stack? : string Inherited from RetryRequestError.stack ### [**](#stackTraceLimit)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/globals.d.ts#L68)staticexternalinheritedstackTraceLimit **stackTraceLimit: number Inherited from RetryRequestError.stackTraceLimit The `Error.stackTraceLimit` property specifies the number of stack frames collected by a stack trace (whether generated by `new Error().stack` or `Error.captureStackTrace(obj)`). The default value is `10` but may be set to any valid JavaScript number. Changes will affect any stack trace captured *after* the value has been changed. If set to a non-number value, or set to a negative number, stack traces will not capture any frames. ## Methods[**](#Methods) ### [**](#captureStackTrace)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/globals.d.ts#L52)staticexternalinheritedcaptureStackTrace * ****captureStackTrace**(targetObject, constructorOpt): void - Inherited from RetryRequestError.captureStackTrace Creates a `.stack` property on `targetObject`, which when accessed returns a string representing the location in the code at which `Error.captureStackTrace()` was called. ``` const myObject = {}; Error.captureStackTrace(myObject); myObject.stack; // Similar to `new Error().stack` ``` The first line of the trace will be prefixed with `${myObject.name}: ${myObject.message}`. The optional `constructorOpt` argument accepts a function. If given, all frames above `constructorOpt`, including `constructorOpt`, will be omitted from the generated stack trace. The `constructorOpt` argument is useful for hiding implementation details of error generation from the user. For instance: ``` function a() { b(); } function b() { c(); } function c() { // Create an error without stack trace to avoid calculating the stack trace twice. const { stackTraceLimit } = Error; Error.stackTraceLimit = 0; const error = new Error(); Error.stackTraceLimit = stackTraceLimit; // Capture the stack trace above function b Error.captureStackTrace(error, b); // Neither function c, nor b is included in the stack trace throw error; } a(); ``` *** #### Parameters * ##### externaltargetObject: object * ##### externaloptionalconstructorOpt: Function #### Returns void ### [**](#isError)[**](https://undefined/apify/crawlee/blob/master/website/node_modules/typescript/src/lib.esnext.error.d.ts#L23)staticexternalinheritedisError * ****isError**(error): error is Error - Inherited from RetryRequestError.isError Indicates whether the argument provided is a built-in Error instance or not. *** #### Parameters * ##### externalerror: unknown #### Returns error is Error ### [**](#prepareStackTrace)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/globals.d.ts#L56)staticexternalinheritedprepareStackTrace * ****prepareStackTrace**(err, stackTraces): any - Inherited from RetryRequestError.prepareStackTrace * **@see** *** #### Parameters * ##### externalerr: Error * ##### externalstackTraces: CallSite\[] #### Returns any --- # SessionPool Handles the rotation, creation and persistence of user-like sessions. Creates a pool of [Session](https://crawlee.dev/js/api/core/class/Session.md) instances, that are randomly rotated. When some session is marked as blocked, it is removed and new one is created instead (the pool never returns an unusable session). Learn more in the [Session management guide](https://crawlee.dev/js/docs/guides/session-management.md). You can create one by calling the [SessionPool.open](https://crawlee.dev/js/api/core/class/SessionPool.md#open) function. Session pool is already integrated into crawlers, and it can significantly improve your scraper performance with just 2 lines of code. **Example usage:** ``` const crawler = new CheerioCrawler({ useSessionPool: true, persistCookiesPerSession: true, // ... }) ``` You can configure the pool with many options. See the [SessionPoolOptions](https://crawlee.dev/js/api/core/interface/SessionPoolOptions.md). Session pool is by default persisted in default [KeyValueStore](https://crawlee.dev/js/api/core/class/KeyValueStore.md). If you want to have one pool for all runs you have to specify [SessionPoolOptions.persistStateKeyValueStoreId](https://crawlee.dev/js/api/core/interface/SessionPoolOptions.md#persistStateKeyValueStoreId). **Advanced usage:** ``` const sessionPool = await SessionPool.open({ maxPoolSize: 25, sessionOptions:{ maxAgeSecs: 10, maxUsageCount: 150, // for example when you know that the site blocks after 150 requests. }, persistStateKeyValueStoreId: 'my-key-value-store-for-sessions', persistStateKey: 'my-session-pool', }); // Get random session from the pool const session1 = await sessionPool.getSession(); const session2 = await sessionPool.getSession(); const session3 = await sessionPool.getSession(); // Now you can mark the session either failed or successful // Marks session as bad after unsuccessful usage -> it increases error count (soft retire) session1.markBad() // Marks as successful. session2.markGood() // Retires session -> session is removed from the pool session3.retire() ``` \**Default session allocation flow:* 1. Until the `SessionPool` reaches `maxPoolSize`, new sessions are created, provided to the user and added to the pool 2. Blocked/retired sessions stay in the pool but are never provided to the user 3. Once the pool is full (live plus blocked session count reaches `maxPoolSize`), a random session from the pool is provided. 4. If a blocked session would be picked, instead all blocked sessions are evicted from the pool and a new session is created and provided ### Hierarchy * EventEmitter * *SessionPool* ## Index[**](#Index) ### Properties * [**config](#config) * [**captureRejections](#captureRejections) * [**captureRejectionSymbol](#captureRejectionSymbol) * [**defaultMaxListeners](#defaultMaxListeners) * [**errorMonitor](#errorMonitor) ### Accessors * [**retiredSessionsCount](#retiredSessionsCount) * [**usableSessionsCount](#usableSessionsCount) ### Methods * [**\[captureRejectionSymbol\]](#\[captureRejectionSymbol]) * [**addListener](#addListener) * [**addSession](#addSession) * [**emit](#emit) * [**eventNames](#eventNames) * [**getMaxListeners](#getMaxListeners) * [**getSession](#getSession) * [**getState](#getState) * [**initialize](#initialize) * [**listenerCount](#listenerCount) * [**listeners](#listeners) * [**off](#off) * [**on](#on) * [**once](#once) * [**persistState](#persistState) * [**prependListener](#prependListener) * [**prependOnceListener](#prependOnceListener) * [**rawListeners](#rawListeners) * [**removeAllListeners](#removeAllListeners) * [**removeListener](#removeListener) * [**resetStore](#resetStore) * [**setMaxListeners](#setMaxListeners) * [**teardown](#teardown) * [**addAbortListener](#addAbortListener) * [**getEventListeners](#getEventListeners) * [**getMaxListeners](#getMaxListeners) * [**listenerCount](#listenerCount) * [**on](#on) * [**once](#once) * [**open](#open) * [**setMaxListeners](#setMaxListeners) ## Properties[**](#Properties) ### [**](#config)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session_pool.ts#L160)readonlyconfig **config: [Configuration](https://crawlee.dev/js/api/core/class/Configuration.md) = ... ### [**](#captureRejections)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L458)staticexternalinheritedcaptureRejections **captureRejections: boolean Inherited from EventEmitter.captureRejections Value: [boolean](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Data_structures#Boolean_type) Change the default `captureRejections` option on all new `EventEmitter` objects. * **@since** v13.4.0, v12.16.0 ### [**](#captureRejectionSymbol)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L451)staticexternalreadonlyinheritedcaptureRejectionSymbol **captureRejectionSymbol: typeof captureRejectionSymbol Inherited from EventEmitter.captureRejectionSymbol Value: `Symbol.for('nodejs.rejection')` See how to write a custom `rejection handler`. * **@since** v13.4.0, v12.16.0 ### [**](#defaultMaxListeners)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L497)staticexternalinheriteddefaultMaxListeners **defaultMaxListeners: number Inherited from EventEmitter.defaultMaxListeners By default, a maximum of `10` listeners can be registered for any single event. This limit can be changed for individual `EventEmitter` instances using the `emitter.setMaxListeners(n)` method. To change the default for *all*`EventEmitter` instances, the `events.defaultMaxListeners` property can be used. If this value is not a positive number, a `RangeError` is thrown. Take caution when setting the `events.defaultMaxListeners` because the change affects *all* `EventEmitter` instances, including those created before the change is made. However, calling `emitter.setMaxListeners(n)` still has precedence over `events.defaultMaxListeners`. This is not a hard limit. The `EventEmitter` instance will allow more listeners to be added but will output a trace warning to stderr indicating that a "possible EventEmitter memory leak" has been detected. For any single `EventEmitter`, the `emitter.getMaxListeners()` and `emitter.setMaxListeners()` methods can be used to temporarily avoid this warning: ``` import { EventEmitter } from 'node:events'; const emitter = new EventEmitter(); emitter.setMaxListeners(emitter.getMaxListeners() + 1); emitter.once('event', () => { // do stuff emitter.setMaxListeners(Math.max(emitter.getMaxListeners() - 1, 0)); }); ``` The `--trace-warnings` command-line flag can be used to display the stack trace for such warnings. The emitted warning can be inspected with `process.on('warning')` and will have the additional `emitter`, `type`, and `count` properties, referring to the event emitter instance, the event's name and the number of attached listeners, respectively. Its `name` property is set to `'MaxListenersExceededWarning'`. * **@since** v0.11.2 ### [**](#errorMonitor)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L444)staticexternalreadonlyinheritederrorMonitor **errorMonitor: typeof errorMonitor Inherited from EventEmitter.errorMonitor This symbol shall be used to install a listener for only monitoring `'error'` events. Listeners installed using this symbol are called before the regular `'error'` listeners are called. Installing a listener using this symbol does not change the behavior once an `'error'` event is emitted. Therefore, the process will still crash if no regular `'error'` listener is installed. * **@since** v13.6.0, v12.17.0 ## Accessors[**](#Accessors) ### [**](#retiredSessionsCount)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session_pool.ts#L224)retiredSessionsCount * **get retiredSessionsCount(): number - Gets count of retired sessions in the pool. *** #### Returns number ### [**](#usableSessionsCount)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session_pool.ts#L217)usableSessionsCount * **get usableSessionsCount(): number - Gets count of usable sessions in the pool. *** #### Returns number ## Methods[**](#Methods) ### [**](#\[captureRejectionSymbol])[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L136)externaloptionalinherited\[captureRejectionSymbol] * ****\[captureRejectionSymbol]**\(error, event, ...args): void - Inherited from EventEmitter.\[captureRejectionSymbol] #### Parameters * ##### externalerror: Error * ##### externalevent: string | symbol * ##### externalrest...args: AnyRest #### Returns void ### [**](#addListener)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L596)externalinheritedaddListener * ****addListener**\(eventName, listener): this - Inherited from EventEmitter.addListener Alias for `emitter.on(eventName, listener)`. * **@since** v0.1.26 *** #### Parameters * ##### externaleventName: string | symbol * ##### externallistener: (...args) => void #### Returns this ### [**](#addSession)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session_pool.ts#L264)addSession * ****addSession**(options): Promise\ - Adds a new session to the session pool. The pool automatically creates sessions up to the maximum size of the pool, but this allows you to add more sessions once the max pool size is reached. This also allows you to add session with overridden session options (e.g. with specific session id). *** #### Parameters * ##### optionaloptions: [Session](https://crawlee.dev/js/api/core/class/Session.md) | [SessionOptions](https://crawlee.dev/js/api/core/interface/SessionOptions.md) = {} The configuration options for the session being added to the session pool. #### Returns Promise\ ### [**](#emit)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L858)externalinheritedemit * ****emit**\(eventName, ...args): boolean - Inherited from EventEmitter.emit Synchronously calls each of the listeners registered for the event named `eventName`, in the order they were registered, passing the supplied arguments to each. Returns `true` if the event had listeners, `false` otherwise. ``` import { EventEmitter } from 'node:events'; const myEmitter = new EventEmitter(); // First listener myEmitter.on('event', function firstListener() { console.log('Helloooo! first listener'); }); // Second listener myEmitter.on('event', function secondListener(arg1, arg2) { console.log(`event with parameters ${arg1}, ${arg2} in second listener`); }); // Third listener myEmitter.on('event', function thirdListener(...args) { const parameters = args.join(', '); console.log(`event with parameters ${parameters} in third listener`); }); console.log(myEmitter.listeners('event')); myEmitter.emit('event', 1, 2, 3, 4, 5); // Prints: // [ // [Function: firstListener], // [Function: secondListener], // [Function: thirdListener] // ] // Helloooo! first listener // event with parameters 1, 2 in second listener // event with parameters 1, 2, 3, 4, 5 in third listener ``` * **@since** v0.1.26 *** #### Parameters * ##### externaleventName: string | symbol * ##### externalrest...args: AnyRest #### Returns boolean ### [**](#eventNames)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L921)externalinheritedeventNames * ****eventNames**(): (string | symbol)\[] - Inherited from EventEmitter.eventNames Returns an array listing the events for which the emitter has registered listeners. The values in the array are strings or `Symbol`s. ``` import { EventEmitter } from 'node:events'; const myEE = new EventEmitter(); myEE.on('foo', () => {}); myEE.on('bar', () => {}); const sym = Symbol('symbol'); myEE.on(sym, () => {}); console.log(myEE.eventNames()); // Prints: [ 'foo', 'bar', Symbol(symbol) ] ``` * **@since** v6.0.0 *** #### Returns (string | symbol)\[] ### [**](#getMaxListeners)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L773)externalinheritedgetMaxListeners * ****getMaxListeners**(): number - Inherited from EventEmitter.getMaxListeners Returns the current max listener value for the `EventEmitter` which is either set by `emitter.setMaxListeners(n)` or defaults to EventEmitter.defaultMaxListeners. * **@since** v1.0.0 *** #### Returns number ### [**](#getSession)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session_pool.ts#L291)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session_pool.ts#L296)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session_pool.ts#L305)getSession * ****getSession**(): Promise<[Session](https://crawlee.dev/js/api/core/class/Session.md)> * ****getSession**(sessionId): Promise<[Session](https://crawlee.dev/js/api/core/class/Session.md)> - Gets session. If there is space for new session, it creates and returns new session. If the session pool is full, it picks a session from the pool, If the picked session is usable it is returned, otherwise it creates and returns a new one. *** #### Returns Promise<[Session](https://crawlee.dev/js/api/core/class/Session.md)> ### [**](#getState)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session_pool.ts#L348)getState * ****getState**(): { retiredSessionsCount: number; sessions: [SessionState](https://crawlee.dev/js/api/core/interface/SessionState.md)\[]; usableSessionsCount: number } - Returns an object representing the internal state of the `SessionPool` instance. Note that the object's fields can change in future releases. *** #### Returns { retiredSessionsCount: number; sessions: [SessionState](https://crawlee.dev/js/api/core/interface/SessionState.md)\[]; usableSessionsCount: number } * ##### retiredSessionsCount: number * ##### sessions: [SessionState](https://crawlee.dev/js/api/core/interface/SessionState.md)\[] * ##### usableSessionsCount: number ### [**](#initialize)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session_pool.ts#L232)initialize * ****initialize**(): Promise\ - Starts periodic state persistence and potentially loads SessionPool state from [KeyValueStore](https://crawlee.dev/js/api/core/class/KeyValueStore.md). It is called automatically by the [SessionPool.open](https://crawlee.dev/js/api/core/class/SessionPool.md#open) function. *** #### Returns Promise\ ### [**](#listenerCount)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L867)externalinheritedlistenerCount * ****listenerCount**\(eventName, listener): number - Inherited from EventEmitter.listenerCount Returns the number of listeners listening for the event named `eventName`. If `listener` is provided, it will return how many times the listener is found in the list of the listeners of the event. * **@since** v3.2.0 *** #### Parameters * ##### externaleventName: string | symbol The name of the event being listened for * ##### externaloptionallistener: Function The event handler function #### Returns number ### [**](#listeners)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L786)externalinheritedlisteners * ****listeners**\(eventName): Function\[] - Inherited from EventEmitter.listeners Returns a copy of the array of listeners for the event named `eventName`. ``` server.on('connection', (stream) => { console.log('someone connected!'); }); console.log(util.inspect(server.listeners('connection'))); // Prints: [ [Function] ] ``` * **@since** v0.1.26 *** #### Parameters * ##### externaleventName: string | symbol #### Returns Function\[] ### [**](#off)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L746)externalinheritedoff * ****off**\(eventName, listener): this - Inherited from EventEmitter.off Alias for `emitter.removeListener()`. * **@since** v10.0.0 *** #### Parameters * ##### externaleventName: string | symbol * ##### externallistener: (...args) => void #### Returns this ### [**](#on)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L628)externalinheritedon * ****on**\(eventName, listener): this - Inherited from EventEmitter.on Adds the `listener` function to the end of the listeners array for the event named `eventName`. No checks are made to see if the `listener` has already been added. Multiple calls passing the same combination of `eventName` and `listener` will result in the `listener` being added, and called, multiple times. ``` server.on('connection', (stream) => { console.log('someone connected!'); }); ``` Returns a reference to the `EventEmitter`, so that calls can be chained. By default, event listeners are invoked in the order they are added. The `emitter.prependListener()` method can be used as an alternative to add the event listener to the beginning of the listeners array. ``` import { EventEmitter } from 'node:events'; const myEE = new EventEmitter(); myEE.on('foo', () => console.log('a')); myEE.prependListener('foo', () => console.log('b')); myEE.emit('foo'); // Prints: // b // a ``` * **@since** v0.1.101 *** #### Parameters * ##### externaleventName: string | symbol The name of the event. * ##### externallistener: (...args) => void The callback function #### Returns this ### [**](#once)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L658)externalinheritedonce * ****once**\(eventName, listener): this - Inherited from EventEmitter.once Adds a **one-time** `listener` function for the event named `eventName`. The next time `eventName` is triggered, this listener is removed and then invoked. ``` server.once('connection', (stream) => { console.log('Ah, we have our first user!'); }); ``` Returns a reference to the `EventEmitter`, so that calls can be chained. By default, event listeners are invoked in the order they are added. The `emitter.prependOnceListener()` method can be used as an alternative to add the event listener to the beginning of the listeners array. ``` import { EventEmitter } from 'node:events'; const myEE = new EventEmitter(); myEE.once('foo', () => console.log('a')); myEE.prependOnceListener('foo', () => console.log('b')); myEE.emit('foo'); // Prints: // b // a ``` * **@since** v0.3.0 *** #### Parameters * ##### externaleventName: string | symbol The name of the event. * ##### externallistener: (...args) => void The callback function #### Returns this ### [**](#persistState)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session_pool.ts#L361)persistState * ****persistState**(options): Promise\ - Persists the current state of the `SessionPool` into the default [KeyValueStore](https://crawlee.dev/js/api/core/class/KeyValueStore.md). The state is persisted automatically in regular intervals. *** #### Parameters * ##### optionaloptions: [PersistenceOptions](https://crawlee.dev/js/api/core/interface/PersistenceOptions.md) Override the persistence options provided in the constructor #### Returns Promise\ ### [**](#prependListener)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L885)externalinheritedprependListener * ****prependListener**\(eventName, listener): this - Inherited from EventEmitter.prependListener Adds the `listener` function to the *beginning* of the listeners array for the event named `eventName`. No checks are made to see if the `listener` has already been added. Multiple calls passing the same combination of `eventName` and `listener` will result in the `listener` being added, and called, multiple times. ``` server.prependListener('connection', (stream) => { console.log('someone connected!'); }); ``` Returns a reference to the `EventEmitter`, so that calls can be chained. * **@since** v6.0.0 *** #### Parameters * ##### externaleventName: string | symbol The name of the event. * ##### externallistener: (...args) => void The callback function #### Returns this ### [**](#prependOnceListener)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L901)externalinheritedprependOnceListener * ****prependOnceListener**\(eventName, listener): this - Inherited from EventEmitter.prependOnceListener Adds a **one-time**`listener` function for the event named `eventName` to the *beginning* of the listeners array. The next time `eventName` is triggered, this listener is removed, and then invoked. ``` server.prependOnceListener('connection', (stream) => { console.log('Ah, we have our first user!'); }); ``` Returns a reference to the `EventEmitter`, so that calls can be chained. * **@since** v6.0.0 *** #### Parameters * ##### externaleventName: string | symbol The name of the event. * ##### externallistener: (...args) => void The callback function #### Returns this ### [**](#rawListeners)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L817)externalinheritedrawListeners * ****rawListeners**\(eventName): Function\[] - Inherited from EventEmitter.rawListeners Returns a copy of the array of listeners for the event named `eventName`, including any wrappers (such as those created by `.once()`). ``` import { EventEmitter } from 'node:events'; const emitter = new EventEmitter(); emitter.once('log', () => console.log('log once')); // Returns a new Array with a function `onceWrapper` which has a property // `listener` which contains the original listener bound above const listeners = emitter.rawListeners('log'); const logFnWrapper = listeners[0]; // Logs "log once" to the console and does not unbind the `once` event logFnWrapper.listener(); // Logs "log once" to the console and removes the listener logFnWrapper(); emitter.on('log', () => console.log('log persistently')); // Will return a new Array with a single function bound by `.on()` above const newListeners = emitter.rawListeners('log'); // Logs "log persistently" twice newListeners[0](); emitter.emit('log'); ``` * **@since** v9.4.0 *** #### Parameters * ##### externaleventName: string | symbol #### Returns Function\[] ### [**](#removeAllListeners)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L757)externalinheritedremoveAllListeners * ****removeAllListeners**(eventName): this - Inherited from EventEmitter.removeAllListeners Removes all listeners, or those of the specified `eventName`. It is bad practice to remove listeners added elsewhere in the code, particularly when the `EventEmitter` instance was created by some other component or module (e.g. sockets or file streams). Returns a reference to the `EventEmitter`, so that calls can be chained. * **@since** v0.1.26 *** #### Parameters * ##### externaloptionaleventName: string | symbol #### Returns this ### [**](#removeListener)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L741)externalinheritedremoveListener * ****removeListener**\(eventName, listener): this - Inherited from EventEmitter.removeListener Removes the specified `listener` from the listener array for the event named `eventName`. ``` const callback = (stream) => { console.log('someone connected!'); }; server.on('connection', callback); // ... server.removeListener('connection', callback); ``` `removeListener()` will remove, at most, one instance of a listener from the listener array. If any single listener has been added multiple times to the listener array for the specified `eventName`, then `removeListener()` must be called multiple times to remove each instance. Once an event is emitted, all listeners attached to it at the time of emitting are called in order. This implies that any `removeListener()` or `removeAllListeners()` calls *after* emitting and *before* the last listener finishes execution will not remove them from`emit()` in progress. Subsequent events behave as expected. ``` import { EventEmitter } from 'node:events'; class MyEmitter extends EventEmitter {} const myEmitter = new MyEmitter(); const callbackA = () => { console.log('A'); myEmitter.removeListener('event', callbackB); }; const callbackB = () => { console.log('B'); }; myEmitter.on('event', callbackA); myEmitter.on('event', callbackB); // callbackA removes listener callbackB but it will still be called. // Internal listener array at time of emit [callbackA, callbackB] myEmitter.emit('event'); // Prints: // A // B // callbackB is now removed. // Internal listener array [callbackA] myEmitter.emit('event'); // Prints: // A ``` Because listeners are managed using an internal array, calling this will change the position indices of any listener registered *after* the listener being removed. This will not impact the order in which listeners are called, but it means that any copies of the listener array as returned by the `emitter.listeners()` method will need to be recreated. When a single function has been added as a handler multiple times for a single event (as in the example below), `removeListener()` will remove the most recently added instance. In the example the `once('ping')` listener is removed: ``` import { EventEmitter } from 'node:events'; const ee = new EventEmitter(); function pong() { console.log('pong'); } ee.on('ping', pong); ee.once('ping', pong); ee.removeListener('ping', pong); ee.emit('ping'); ee.emit('ping'); ``` Returns a reference to the `EventEmitter`, so that calls can be chained. * **@since** v0.1.26 *** #### Parameters * ##### externaleventName: string | symbol * ##### externallistener: (...args) => void #### Returns this ### [**](#resetStore)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session_pool.ts#L336)resetStore * ****resetStore**(options): Promise\ - #### Parameters * ##### optionaloptions: [PersistenceOptions](https://crawlee.dev/js/api/core/interface/PersistenceOptions.md) Override the persistence options provided in the constructor #### Returns Promise\ ### [**](#setMaxListeners)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L767)externalinheritedsetMaxListeners * ****setMaxListeners**(n): this - Inherited from EventEmitter.setMaxListeners By default `EventEmitter`s will print a warning if more than `10` listeners are added for a particular event. This is a useful default that helps finding memory leaks. The `emitter.setMaxListeners()` method allows the limit to be modified for this specific `EventEmitter` instance. The value can be set to `Infinity` (or `0`) to indicate an unlimited number of listeners. Returns a reference to the `EventEmitter`, so that calls can be chained. * **@since** v0.3.5 *** #### Parameters * ##### externaln: number #### Returns this ### [**](#teardown)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session_pool.ts#L388)teardown * ****teardown**(): Promise\ - Removes listener from `persistState` event. This function should be called after you are done with using the `SessionPool` instance. *** #### Returns Promise\ ### [**](#addAbortListener)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L436)staticexternalinheritedaddAbortListener * ****addAbortListener**(signal, resource): Disposable - Inherited from EventEmitter.addAbortListener Listens once to the `abort` event on the provided `signal`. Listening to the `abort` event on abort signals is unsafe and may lead to resource leaks since another third party with the signal can call `e.stopImmediatePropagation()`. Unfortunately Node.js cannot change this since it would violate the web standard. Additionally, the original API makes it easy to forget to remove listeners. This API allows safely using `AbortSignal`s in Node.js APIs by solving these two issues by listening to the event such that `stopImmediatePropagation` does not prevent the listener from running. Returns a disposable so that it may be unsubscribed from more easily. ``` import { addAbortListener } from 'node:events'; function example(signal) { let disposable; try { signal.addEventListener('abort', (e) => e.stopImmediatePropagation()); disposable = addAbortListener(signal, (e) => { // Do something when signal is aborted. }); } finally { disposable?.[Symbol.dispose](); } } ``` * **@since** v20.5.0 *** #### Parameters * ##### externalsignal: AbortSignal * ##### externalresource: (event) => void #### Returns Disposable Disposable that removes the `abort` listener. ### [**](#getEventListeners)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L358)staticexternalinheritedgetEventListeners * ****getEventListeners**(emitter, name): Function\[] - Inherited from EventEmitter.getEventListeners Returns a copy of the array of listeners for the event named `eventName`. For `EventEmitter`s this behaves exactly the same as calling `.listeners` on the emitter. For `EventTarget`s this is the only way to get the event listeners for the event target. This is useful for debugging and diagnostic purposes. ``` import { getEventListeners, EventEmitter } from 'node:events'; { const ee = new EventEmitter(); const listener = () => console.log('Events are fun'); ee.on('foo', listener); console.log(getEventListeners(ee, 'foo')); // [ [Function: listener] ] } { const et = new EventTarget(); const listener = () => console.log('Events are fun'); et.addEventListener('foo', listener); console.log(getEventListeners(et, 'foo')); // [ [Function: listener] ] } ``` * **@since** v15.2.0, v14.17.0 *** #### Parameters * ##### externalemitter: EventEmitter\ | EventTarget * ##### externalname: string | symbol #### Returns Function\[] ### [**](#getMaxListeners)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L387)staticexternalinheritedgetMaxListeners * ****getMaxListeners**(emitter): number - Inherited from EventEmitter.getMaxListeners Returns the currently set max amount of listeners. For `EventEmitter`s this behaves exactly the same as calling `.getMaxListeners` on the emitter. For `EventTarget`s this is the only way to get the max event listeners for the event target. If the number of event handlers on a single EventTarget exceeds the max set, the EventTarget will print a warning. ``` import { getMaxListeners, setMaxListeners, EventEmitter } from 'node:events'; { const ee = new EventEmitter(); console.log(getMaxListeners(ee)); // 10 setMaxListeners(11, ee); console.log(getMaxListeners(ee)); // 11 } { const et = new EventTarget(); console.log(getMaxListeners(et)); // 10 setMaxListeners(11, et); console.log(getMaxListeners(et)); // 11 } ``` * **@since** v19.9.0 *** #### Parameters * ##### externalemitter: EventEmitter\ | EventTarget #### Returns number ### [**](#listenerCount)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L330)staticexternalinheritedlistenerCount * ****listenerCount**(emitter, eventName): number - Inherited from EventEmitter.listenerCount A class method that returns the number of listeners for the given `eventName` registered on the given `emitter`. ``` import { EventEmitter, listenerCount } from 'node:events'; const myEmitter = new EventEmitter(); myEmitter.on('event', () => {}); myEmitter.on('event', () => {}); console.log(listenerCount(myEmitter, 'event')); // Prints: 2 ``` * **@since** v0.9.12 * **@deprecated** Since v3.2.0 - Use `listenerCount` instead. *** #### Parameters * ##### externalemitter: EventEmitter\ The emitter to query * ##### externaleventName: string | symbol The event name #### Returns number ### [**](#on)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L303)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L308)staticexternalinheritedon * ****on**(emitter, eventName, options): AsyncIterator\ * ****on**(emitter, eventName, options): AsyncIterator\ - Inherited from EventEmitter.on ``` import { on, EventEmitter } from 'node:events'; import process from 'node:process'; const ee = new EventEmitter(); // Emit later on process.nextTick(() => { ee.emit('foo', 'bar'); ee.emit('foo', 42); }); for await (const event of on(ee, 'foo')) { // The execution of this inner block is synchronous and it // processes one event at a time (even with await). Do not use // if concurrent execution is required. console.log(event); // prints ['bar'] [42] } // Unreachable here ``` Returns an `AsyncIterator` that iterates `eventName` events. It will throw if the `EventEmitter` emits `'error'`. It removes all listeners when exiting the loop. The `value` returned by each iteration is an array composed of the emitted event arguments. An `AbortSignal` can be used to cancel waiting on events: ``` import { on, EventEmitter } from 'node:events'; import process from 'node:process'; const ac = new AbortController(); (async () => { const ee = new EventEmitter(); // Emit later on process.nextTick(() => { ee.emit('foo', 'bar'); ee.emit('foo', 42); }); for await (const event of on(ee, 'foo', { signal: ac.signal })) { // The execution of this inner block is synchronous and it // processes one event at a time (even with await). Do not use // if concurrent execution is required. console.log(event); // prints ['bar'] [42] } // Unreachable here })(); process.nextTick(() => ac.abort()); ``` Use the `close` option to specify an array of event names that will end the iteration: ``` import { on, EventEmitter } from 'node:events'; import process from 'node:process'; const ee = new EventEmitter(); // Emit later on process.nextTick(() => { ee.emit('foo', 'bar'); ee.emit('foo', 42); ee.emit('close'); }); for await (const event of on(ee, 'foo', { close: ['close'] })) { console.log(event); // prints ['bar'] [42] } // the loop will exit after 'close' is emitted console.log('done'); // prints 'done' ``` * **@since** v13.6.0, v12.16.0 *** #### Parameters * ##### externalemitter: EventEmitter\ * ##### externaleventName: string | symbol * ##### externaloptionaloptions: StaticEventEmitterIteratorOptions #### Returns AsyncIterator\ An `AsyncIterator` that iterates `eventName` events emitted by the `emitter` ### [**](#once)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L217)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L222)staticexternalinheritedonce * ****once**(emitter, eventName, options): Promise\ * ****once**(emitter, eventName, options): Promise\ - Inherited from EventEmitter.once Creates a `Promise` that is fulfilled when the `EventEmitter` emits the given event or that is rejected if the `EventEmitter` emits `'error'` while waiting. The `Promise` will resolve with an array of all the arguments emitted to the given event. This method is intentionally generic and works with the web platform [EventTarget](https://dom.spec.whatwg.org/#interface-eventtarget) interface, which has no special`'error'` event semantics and does not listen to the `'error'` event. ``` import { once, EventEmitter } from 'node:events'; import process from 'node:process'; const ee = new EventEmitter(); process.nextTick(() => { ee.emit('myevent', 42); }); const [value] = await once(ee, 'myevent'); console.log(value); const err = new Error('kaboom'); process.nextTick(() => { ee.emit('error', err); }); try { await once(ee, 'myevent'); } catch (err) { console.error('error happened', err); } ``` The special handling of the `'error'` event is only used when `events.once()` is used to wait for another event. If `events.once()` is used to wait for the '`error'` event itself, then it is treated as any other kind of event without special handling: ``` import { EventEmitter, once } from 'node:events'; const ee = new EventEmitter(); once(ee, 'error') .then(([err]) => console.log('ok', err.message)) .catch((err) => console.error('error', err.message)); ee.emit('error', new Error('boom')); // Prints: ok boom ``` An `AbortSignal` can be used to cancel waiting for the event: ``` import { EventEmitter, once } from 'node:events'; const ee = new EventEmitter(); const ac = new AbortController(); async function foo(emitter, event, signal) { try { await once(emitter, event, { signal }); console.log('event emitted!'); } catch (error) { if (error.name === 'AbortError') { console.error('Waiting for the event was canceled!'); } else { console.error('There was an error', error.message); } } } foo(ee, 'foo', ac.signal); ac.abort(); // Abort waiting for the event ee.emit('foo'); // Prints: Waiting for the event was canceled! ``` * **@since** v11.13.0, v10.16.0 *** #### Parameters * ##### externalemitter: EventEmitter\ * ##### externaleventName: string | symbol * ##### externaloptionaloptions: StaticEventEmitterOptions #### Returns Promise\ ### [**](#open)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session_pool.ts#L512)staticopen * ****open**(options, config): Promise<[SessionPool](https://crawlee.dev/js/api/core/class/SessionPool.md)> - Opens a SessionPool and returns a promise resolving to an instance of the [SessionPool](https://crawlee.dev/js/api/core/class/SessionPool.md) class that is already initialized. For more details and code examples, see the [SessionPool](https://crawlee.dev/js/api/core/class/SessionPool.md) class. *** #### Parameters * ##### optionaloptions: [SessionPoolOptions](https://crawlee.dev/js/api/core/interface/SessionPoolOptions.md) * ##### optionalconfig: [Configuration](https://crawlee.dev/js/api/core/class/Configuration.md) #### Returns Promise<[SessionPool](https://crawlee.dev/js/api/core/class/SessionPool.md)> ### [**](#setMaxListeners)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L402)staticexternalinheritedsetMaxListeners * ****setMaxListeners**(n, ...eventTargets): void - Inherited from EventEmitter.setMaxListeners ``` import { setMaxListeners, EventEmitter } from 'node:events'; const target = new EventTarget(); const emitter = new EventEmitter(); setMaxListeners(5, target, emitter); ``` * **@since** v15.4.0 *** #### Parameters * ##### externaloptionaln: number A non-negative number. The maximum number of listeners per `EventTarget` event. * ##### externalrest...eventTargets: (EventEmitter\ | EventTarget)\[] Zero or more {EventTarget} or {EventEmitter} instances. If none are specified, `n` is set as the default max for all newly created {EventTarget} and {EventEmitter} objects. #### Returns void --- # SitemapRequestList A list of URLs to crawl parsed from a sitemap. The loading of the sitemap is performed in the background so that crawling can start before the sitemap is fully loaded. ### Implements * [IRequestList](https://crawlee.dev/js/api/core/interface/IRequestList.md) ## Index[**](#Index) ### Methods * [**\[asyncIterator\]](#\[asyncIterator]) * [**fetchNextRequest](#fetchNextRequest) * [**handledCount](#handledCount) * [**isEmpty](#isEmpty) * [**isFinished](#isFinished) * [**isSitemapFullyLoaded](#isSitemapFullyLoaded) * [**length](#length) * [**markRequestHandled](#markRequestHandled) * [**persistState](#persistState) * [**reclaimRequest](#reclaimRequest) * [**teardown](#teardown) * [**open](#open) ## Methods[**](#Methods) ### [**](#\[asyncIterator])[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/sitemap_request_list.ts#L572)\[asyncIterator] * ****\[asyncIterator]**(): AsyncGenerator<[Request](https://crawlee.dev/js/api/core/class/Request.md)\, void, unknown> - Implementation of IRequestList.\[asyncIterator] Can be used to iterate over the `RequestList` instance in a `for await .. of` loop. Provides an alternative for the repeated use of `fetchNextRequest`. *** #### Returns AsyncGenerator<[Request](https://crawlee.dev/js/api/core/class/Request.md)\, void, unknown> ### [**](#fetchNextRequest)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/sitemap_request_list.ts#L551)fetchNextRequest * ****fetchNextRequest**(): Promise\> - Implementation of IRequestList.fetchNextRequest Gets the next [Request](https://crawlee.dev/js/api/core/class/Request.md) to process. First, the function gets a request previously reclaimed using the [RequestList.reclaimRequest](https://crawlee.dev/js/api/core/class/RequestList.md#reclaimRequest) function, if there is any. Otherwise it gets the next request from sources. The function's `Promise` resolves to `null` if there are no more requests to process. *** #### Returns Promise\> ### [**](#handledCount)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/sitemap_request_list.ts#L465)handledCount * ****handledCount**(): number - Implementation of IRequestList.handledCount Returns number of handled requests. *** #### Returns number ### [**](#isEmpty)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/sitemap_request_list.ts#L458)isEmpty * ****isEmpty**(): Promise\ - Implementation of IRequestList.isEmpty Resolves to `true` if the next call to [IRequestList.fetchNextRequest](https://crawlee.dev/js/api/core/interface/IRequestList.md#fetchNextRequest) function would return `null`, otherwise it resolves to `false`. Note that even if the list is empty, there might be some pending requests currently being processed. *** #### Returns Promise\ ### [**](#isFinished)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/sitemap_request_list.ts#L449)isFinished * ****isFinished**(): Promise\ - Implementation of IRequestList.isFinished Returns `true` if all requests were already handled and there are no more left. *** #### Returns Promise\ ### [**](#isSitemapFullyLoaded)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/sitemap_request_list.ts#L358)isSitemapFullyLoaded * ****isSitemapFullyLoaded**(): boolean - Indicates whether the background processing of sitemap contents has successfully finished. If this is `false`, the background processing is either still in progress or was aborted. *** #### Returns boolean ### [**](#length)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/sitemap_request_list.ts#L442)length * ****length**(): number - Implementation of IRequestList.length Returns the total number of unique requests present in the list. *** #### Returns number ### [**](#markRequestHandled)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/sitemap_request_list.ts#L607)markRequestHandled * ****markRequestHandled**(request): Promise\ - Implementation of IRequestList.markRequestHandled Marks request as handled after successful processing. *** #### Parameters * ##### request: [Request](https://crawlee.dev/js/api/core/class/Request.md)\ #### Returns Promise\ ### [**](#persistState)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/sitemap_request_list.ts#L472)persistState * ****persistState**(): Promise\ - Implementation of IRequestList.persistState Persists the current state of the `IRequestList` into the default [KeyValueStore](https://crawlee.dev/js/api/core/class/KeyValueStore.md). The state is persisted automatically in regular intervals, but calling this method manually is useful in cases where you want to have the most current state available after you pause or stop fetching its requests. For example after you pause or abort a crawl. Or just before a server migration. *** #### Returns Promise\ ### [**](#reclaimRequest)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/sitemap_request_list.ts#L584)reclaimRequest * ****reclaimRequest**(request): Promise\ - Implementation of IRequestList.reclaimRequest Reclaims request to the list if its processing failed. The request will become available in the next `this.fetchNextRequest()`. *** #### Parameters * ##### request: [Request](https://crawlee.dev/js/api/core/class/Request.md)\ #### Returns Promise\ ### [**](#teardown)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/sitemap_request_list.ts#L595)teardown * ****teardown**(): Promise\ - Aborts the internal sitemap loading, stops the processing of the sitemap contents and drops all the pending URLs. Calling `fetchNextRequest()` after this method will always return `null`. *** #### Returns Promise\ ### [**](#open)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/sitemap_request_list.ts#L414)staticopen * ****open**(options): Promise<[SitemapRequestList](https://crawlee.dev/js/api/core/class/SitemapRequestList.md)> - Open a sitemap and start processing it. Resolves to a new instance of `SitemapRequestList`, which **might not be fully loaded yet** - i.e. the sitemap might still be loading in the background. Track the loading progress using the `isSitemapFullyLoaded` property. *** #### Parameters * ##### options: [SitemapRequestListOptions](https://crawlee.dev/js/api/core/interface/SitemapRequestListOptions.md) #### Returns Promise<[SitemapRequestList](https://crawlee.dev/js/api/core/class/SitemapRequestList.md)> --- # Snapshotter Creates snapshots of system resources at given intervals and marks the resource as either overloaded or not during the last interval. Keeps a history of the snapshots. It tracks the following resources: Memory, EventLoop, API and CPU. The class is used by the [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) class. When running on the Apify platform, the CPU and memory statistics are provided by the platform, as collected from the running Docker container. When running locally, `Snapshotter` makes its own statistics by querying the OS. CPU becomes overloaded locally when its current use exceeds the `maxUsedCpuRatio` option or when Apify platform marks it as overloaded. Memory becomes overloaded if its current use exceeds the `maxUsedMemoryRatio` option. It's computed using the total memory available to the container when running on the Apify platform and a quarter of total system memory when running locally. Max total memory when running locally may be overridden by using the `CRAWLEE_MEMORY_MBYTES` environment variable. Event loop becomes overloaded if it slows down by more than the `maxBlockedMillis` option. Client becomes overloaded when rate limit errors (429 - Too Many Requests), typically received from the request queue, exceed the set limit within the set interval. ## Index[**](#Index) ### Constructors * [**constructor](#constructor) ### Properties * [**client](#client) * [**clientInterval](#clientInterval) * [**clientSnapshotIntervalMillis](#clientSnapshotIntervalMillis) * [**clientSnapshots](#clientSnapshots) * [**config](#config) * [**cpuSnapshots](#cpuSnapshots) * [**eventLoopInterval](#eventLoopInterval) * [**eventLoopSnapshotIntervalMillis](#eventLoopSnapshotIntervalMillis) * [**eventLoopSnapshots](#eventLoopSnapshots) * [**events](#events) * [**lastLoggedCriticalMemoryOverloadAt](#lastLoggedCriticalMemoryOverloadAt) * [**log](#log) * [**maxBlockedMillis](#maxBlockedMillis) * [**maxClientErrors](#maxClientErrors) * [**maxMemoryBytes](#maxMemoryBytes) * [**maxUsedMemoryRatio](#maxUsedMemoryRatio) * [**memorySnapshots](#memorySnapshots) * [**snapshotHistoryMillis](#snapshotHistoryMillis) ### Methods * [**getClientSample](#getClientSample) * [**getCpuSample](#getCpuSample) * [**getEventLoopSample](#getEventLoopSample) * [**getMemorySample](#getMemorySample) * [**start](#start) * [**stop](#stop) ## Constructors[**](#Constructors) ### [**](#constructor)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/snapshotter.ts#L144)constructor * ****new Snapshotter**(options): [Snapshotter](https://crawlee.dev/js/api/core/class/Snapshotter.md) - #### Parameters * ##### optionaloptions: [SnapshotterOptions](https://crawlee.dev/js/api/core/interface/SnapshotterOptions.md) = {} All `Snapshotter` configuration options. #### Returns [Snapshotter](https://crawlee.dev/js/api/core/class/Snapshotter.md) ## Properties[**](#Properties) ### [**](#client)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/snapshotter.ts#L120)client **client: [StorageClient](https://crawlee.dev/js/api/core/interface/StorageClient.md) ### [**](#clientInterval)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/snapshotter.ts#L137)clientInterval **clientInterval: BetterIntervalID = ... ### [**](#clientSnapshotIntervalMillis)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/snapshotter.ts#L124)clientSnapshotIntervalMillis **clientSnapshotIntervalMillis: number ### [**](#clientSnapshots)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/snapshotter.ts#L134)clientSnapshots **clientSnapshots: ClientSnapshot\[] = \[] ### [**](#config)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/snapshotter.ts#L121)config **config: [Configuration](https://crawlee.dev/js/api/core/class/Configuration.md) ### [**](#cpuSnapshots)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/snapshotter.ts#L131)cpuSnapshots **cpuSnapshots: CpuSnapshot\[] = \[] ### [**](#eventLoopInterval)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/snapshotter.ts#L136)eventLoopInterval **eventLoopInterval: BetterIntervalID = ... ### [**](#eventLoopSnapshotIntervalMillis)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/snapshotter.ts#L123)eventLoopSnapshotIntervalMillis **eventLoopSnapshotIntervalMillis: number ### [**](#eventLoopSnapshots)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/snapshotter.ts#L132)eventLoopSnapshots **eventLoopSnapshots: EventLoopSnapshot\[] = \[] ### [**](#events)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/snapshotter.ts#L122)events **events: [EventManager](https://crawlee.dev/js/api/core/class/EventManager.md) ### [**](#lastLoggedCriticalMemoryOverloadAt)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/snapshotter.ts#L139)lastLoggedCriticalMemoryOverloadAt **lastLoggedCriticalMemoryOverloadAt: null | Date = null ### [**](#log)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/snapshotter.ts#L119)log **log: [Log](https://crawlee.dev/js/api/core/class/Log.md) ### [**](#maxBlockedMillis)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/snapshotter.ts#L126)maxBlockedMillis **maxBlockedMillis: number ### [**](#maxClientErrors)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/snapshotter.ts#L128)maxClientErrors **maxClientErrors: number ### [**](#maxMemoryBytes)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/snapshotter.ts#L129)maxMemoryBytes **maxMemoryBytes: number ### [**](#maxUsedMemoryRatio)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/snapshotter.ts#L127)maxUsedMemoryRatio **maxUsedMemoryRatio: number ### [**](#memorySnapshots)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/snapshotter.ts#L133)memorySnapshots **memorySnapshots: MemorySnapshot\[] = \[] ### [**](#snapshotHistoryMillis)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/snapshotter.ts#L125)snapshotHistoryMillis **snapshotHistoryMillis: number ## Methods[**](#Methods) ### [**](#getClientSample)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/snapshotter.ts#L268)getClientSample * ****getClientSample**(sampleDurationMillis): ClientSnapshot\[] - Returns a sample of latest Client snapshots, with the size of the sample defined by the sampleDurationMillis parameter. If omitted, it returns a full snapshot history. *** #### Parameters * ##### optionalsampleDurationMillis: number #### Returns ClientSnapshot\[] ### [**](#getCpuSample)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/snapshotter.ts#L260)getCpuSample * ****getCpuSample**(sampleDurationMillis): CpuSnapshot\[] - Returns a sample of latest CPU snapshots, with the size of the sample defined by the sampleDurationMillis parameter. If omitted, it returns a full snapshot history. *** #### Parameters * ##### optionalsampleDurationMillis: number #### Returns CpuSnapshot\[] ### [**](#getEventLoopSample)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/snapshotter.ts#L252)getEventLoopSample * ****getEventLoopSample**(sampleDurationMillis): EventLoopSnapshot\[] - Returns a sample of latest event loop snapshots, with the size of the sample defined by the sampleDurationMillis parameter. If omitted, it returns a full snapshot history. *** #### Parameters * ##### optionalsampleDurationMillis: number #### Returns EventLoopSnapshot\[] ### [**](#getMemorySample)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/snapshotter.ts#L244)getMemorySample * ****getMemorySample**(sampleDurationMillis): MemorySnapshot\[] - Returns a sample of latest memory snapshots, with the size of the sample defined by the sampleDurationMillis parameter. If omitted, it returns a full snapshot history. *** #### Parameters * ##### optionalsampleDurationMillis: number #### Returns MemorySnapshot\[] ### [**](#start)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/snapshotter.ts#L192)start * ****start**(): Promise\ - Starts capturing snapshots at configured intervals. *** #### Returns Promise\ ### [**](#stop)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/snapshotter.ts#L229)stop * ****stop**(): Promise\ - Stops all resource capturing. *** #### Returns Promise\ --- # Statistics The statistics class provides an interface to collecting and logging run statistics for requests. All statistic information is saved on key value store under the key `SDK_CRAWLER_STATISTICS_*`, persists between migrations and abort/resurrect ## Index[**](#Index) ### Properties * [**errorTracker](#errorTracker) * [**errorTrackerRetry](#errorTrackerRetry) * [**id](#id) * [**requestRetryHistogram](#requestRetryHistogram) * [**state](#state) ### Methods * [**calculate](#calculate) * [**persistState](#persistState) * [**registerStatusCode](#registerStatusCode) * [**reset](#reset) * [**resetStore](#resetStore) * [**startCapturing](#startCapturing) * [**stopCapturing](#stopCapturing) * [**toJSON](#toJSON) ## Properties[**](#Properties) ### [**](#errorTracker)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L65)errorTracker **errorTracker: [ErrorTracker](https://crawlee.dev/js/api/core/class/ErrorTracker.md) An error tracker for final retry errors. ### [**](#errorTrackerRetry)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L70)errorTrackerRetry **errorTrackerRetry: [ErrorTracker](https://crawlee.dev/js/api/core/class/ErrorTracker.md) An error tracker for retry errors prior to the final retry. ### [**](#id)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L75)readonlyid **id: number = ... Statistic instance id. ### [**](#requestRetryHistogram)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L85)readonlyrequestRetryHistogram **requestRetryHistogram: number\[] = \[] Contains the current retries histogram. Index 0 means 0 retries, index 2, 2 retries, and so on ### [**](#state)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L80)state **state: [StatisticState](https://crawlee.dev/js/api/core/interface/StatisticState.md) Current statistic state used for doing calculations on [Statistics.calculate](https://crawlee.dev/js/api/core/class/Statistics.md#calculate) calls ## Methods[**](#Methods) ### [**](#calculate)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L253)calculate * ****calculate**(): { crawlerRuntimeMillis: number; requestAvgFailedDurationMillis: number; requestAvgFinishedDurationMillis: number; requestsFailedPerMinute: number; requestsFinishedPerMinute: number; requestsTotal: number; requestTotalDurationMillis: number } - Calculate the current statistics *** #### Returns { crawlerRuntimeMillis: number; requestAvgFailedDurationMillis: number; requestAvgFinishedDurationMillis: number; requestsFailedPerMinute: number; requestsFinishedPerMinute: number; requestsTotal: number; requestTotalDurationMillis: number } * ##### crawlerRuntimeMillis: number * ##### requestAvgFailedDurationMillis: number * ##### requestAvgFinishedDurationMillis: number * ##### requestsFailedPerMinute: number * ##### requestsFinishedPerMinute: number * ##### requestsTotal: number * ##### requestTotalDurationMillis: number ### [**](#persistState)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L320)persistState * ****persistState**(options): Promise\ - Persist internal state to the key value store *** #### Parameters * ##### optionaloptions: [PersistenceOptions](https://crawlee.dev/js/api/core/interface/PersistenceOptions.md) Override the persistence options provided in the constructor #### Returns Promise\ ### [**](#registerStatusCode)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L198)registerStatusCode * ****registerStatusCode**(code): void - Increments the status code counter. *** #### Parameters * ##### code: number #### Returns void ### [**](#reset)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L150)reset * ****reset**(): void - Set the current statistic instance to pristine values *** #### Returns void ### [**](#resetStore)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L183)resetStore * ****resetStore**(options): Promise\ - #### Parameters * ##### optionaloptions: [PersistenceOptions](https://crawlee.dev/js/api/core/interface/PersistenceOptions.md) Override the persistence options provided in the constructor #### Returns Promise\ ### [**](#startCapturing)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L279)startCapturing * ****startCapturing**(): Promise\ - Initializes the key value store for persisting the statistics, displaying the current state in predefined intervals *** #### Returns Promise\ ### [**](#stopCapturing)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L302)stopCapturing * ****stopCapturing**(): Promise\ - Stops logging and remove event listeners, then persist *** #### Returns Promise\ ### [**](#toJSON)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L404)toJSON * ****toJSON**(): [StatisticPersistedState](https://crawlee.dev/js/api/core/interface/StatisticPersistedState.md) - Make this class serializable when called with `JSON.stringify(statsInstance)` directly or through `keyValueStore.setValue('KEY', statsInstance)` *** #### Returns [StatisticPersistedState](https://crawlee.dev/js/api/core/interface/StatisticPersistedState.md) --- # SystemStatus Provides a simple interface to reading system status from a [Snapshotter](https://crawlee.dev/js/api/core/class/Snapshotter.md) instance. It only exposes two functions [SystemStatus.getCurrentStatus](https://crawlee.dev/js/api/core/class/SystemStatus.md#getCurrentStatus) and [SystemStatus.getHistoricalStatus](https://crawlee.dev/js/api/core/class/SystemStatus.md#getHistoricalStatus). The system status is calculated using a weighted average of overloaded messages in the snapshots, with the weights being the time intervals between the snapshots. Each resource is calculated separately and the system is overloaded whenever at least one resource is overloaded. The class is used by the [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) class. [SystemStatus.getCurrentStatus](https://crawlee.dev/js/api/core/class/SystemStatus.md#getCurrentStatus) returns a boolean that represents the current status of the system. The length of the current timeframe in seconds is configurable by the `currentHistorySecs` option and represents the max age of snapshots to be considered for the calculation. [SystemStatus.getHistoricalStatus](https://crawlee.dev/js/api/core/class/SystemStatus.md#getHistoricalStatus) returns a boolean that represents the long-term status of the system. It considers the full snapshot history available in the [Snapshotter](https://crawlee.dev/js/api/core/class/Snapshotter.md) instance. ## Index[**](#Index) ### Constructors * [**constructor](#constructor) ### Methods * [**getCurrentStatus](#getCurrentStatus) * [**getHistoricalStatus](#getHistoricalStatus) ## Constructors[**](#Constructors) ### [**](#constructor)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/system_status.ts#L128)constructor * ****new SystemStatus**(options): [SystemStatus](https://crawlee.dev/js/api/core/class/SystemStatus.md) - #### Parameters * ##### options: [SystemStatusOptions](https://crawlee.dev/js/api/core/interface/SystemStatusOptions.md) = {} #### Returns [SystemStatus](https://crawlee.dev/js/api/core/class/SystemStatus.md) ## Methods[**](#Methods) ### [**](#getCurrentStatus)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/system_status.ts#L176)getCurrentStatus * ****getCurrentStatus**(): [SystemInfo](https://crawlee.dev/js/api/core/interface/SystemInfo.md) - Returns an [SystemInfo](https://crawlee.dev/js/api/core/interface/SystemInfo.md) object with the following structure: ``` { isSystemIdle: Boolean, memInfo: Object, eventLoopInfo: Object, cpuInfo: Object } ``` Where the `isSystemIdle` property is set to `false` if the system has been overloaded in the last `options.currentHistorySecs` seconds, and `true` otherwise. *** #### Returns [SystemInfo](https://crawlee.dev/js/api/core/interface/SystemInfo.md) ### [**](#getHistoricalStatus)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/system_status.ts#L196)getHistoricalStatus * ****getHistoricalStatus**(): [SystemInfo](https://crawlee.dev/js/api/core/interface/SystemInfo.md) - Returns an [SystemInfo](https://crawlee.dev/js/api/core/interface/SystemInfo.md) object with the following structure: ``` { isSystemIdle: Boolean, memInfo: Object, eventLoopInfo: Object, cpuInfo: Object } ``` Where the `isSystemIdle` property is set to `false` if the system has been overloaded in the full history of the [Snapshotter](https://crawlee.dev/js/api/core/class/Snapshotter.md) (which is configurable in the [Snapshotter](https://crawlee.dev/js/api/core/class/Snapshotter.md)) and `true` otherwise. *** #### Returns [SystemInfo](https://crawlee.dev/js/api/core/interface/SystemInfo.md) --- # EnqueueStrategy The different enqueueing strategies available. Depending on the strategy you select, we will only check certain parts of the URLs found. Here is a diagram of each URL part and their name: ``` Protocol Domain ┌────┐ ┌─────────┐ https://example.crawlee.dev/... │ └─────────────────┤ │ Hostname │ │ │ └─────────────────────────┘ Origin ``` * The `Protocol` is usually `http` or `https` * The `Domain` represents the path without any possible subdomains to a website. For example, `crawlee.dev` is the domain of `https://example.crawlee.dev/` * The `Hostname` is the full path to a website, including any subdomains. For example, `example.crawlee.dev` is the hostname of `https://example.crawlee.dev/` * The `Origin` is the combination of the `Protocol` and `Hostname`. For example, `https://example.crawlee.dev` is the origin of `https://example.crawlee.dev/` ## Index[**](#Index) ### Enumeration Members * [**All](#All) * [**SameDomain](#SameDomain) * [**SameHostname](#SameHostname) * [**SameOrigin](#SameOrigin) ## Enumeration Members[**](<#Enumeration Members>) ### [**](#All)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/enqueue_links.ts#L220)All **All: all Matches any URLs found ### [**](#SameDomain)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/enqueue_links.ts#L238)SameDomain **SameDomain: same-domain Matches any URLs that have the same domain as the base URL. For example, `https://wow.an.example.com` and `https://example.com` will both be matched for a base url of `https://example.com`. > This strategy will match both `http` and `https` protocols regardless of the base URL protocol. ### [**](#SameHostname)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/enqueue_links.ts#L229)SameHostname **SameHostname: same-hostname Matches any URLs that have the same hostname. For example, `https://wow.example.com/hello` will be matched for a base url of `https://wow.example.com/`, but `https://example.com/hello` will not be matched. > This strategy will match both `http` and `https` protocols regardless of the base URL protocol. ### [**](#SameOrigin)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/enqueue_links.ts#L247)SameOrigin **SameOrigin: same-origin Matches any URLs that have the same hostname and protocol. For example, `https://wow.example.com/hello` will be matched for a base url of `https://wow.example.com/`, but `http://wow.example.com/hello` will not be matched. > This strategy will ensure the protocol of the base URL is the same as the protocol of the URL to be enqueued. --- # constEventType ## Index[**](#Index) ### Enumeration Members * [**ABORTING](#ABORTING) * [**EXIT](#EXIT) * [**MIGRATING](#MIGRATING) * [**PERSIST\_STATE](#PERSIST_STATE) * [**SYSTEM\_INFO](#SYSTEM_INFO) ## Enumeration Members[**](<#Enumeration Members>) ### [**](#ABORTING)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/events/event_manager.ts#L13)ABORTING **ABORTING: aborting ### [**](#EXIT)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/events/event_manager.ts#L14)EXIT **EXIT: exit ### [**](#MIGRATING)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/events/event_manager.ts#L12)MIGRATING **MIGRATING: migrating ### [**](#PERSIST_STATE)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/events/event_manager.ts#L10)PERSIST\_STATE **PERSIST\_STATE: persistState ### [**](#SYSTEM_INFO)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/events/event_manager.ts#L11)SYSTEM\_INFO **SYSTEM\_INFO: systemInfo --- # externalLogLevel ## Index[**](#Index) ### Enumeration Members * [**DEBUG](#DEBUG) * [**ERROR](#ERROR) * [**INFO](#INFO) * [**OFF](#OFF) * [**PERF](#PERF) * [**SOFT\_FAIL](#SOFT_FAIL) * [**WARNING](#WARNING) ## Enumeration Members[**](<#Enumeration Members>) ### [**](#DEBUG)[**](https://undefined/apify/crawlee/blob/master/node_modules/@apify/log/src/index.d.ts#L9)externalDEBUG **DEBUG: 5 ### [**](#ERROR)[**](https://undefined/apify/crawlee/blob/master/node_modules/@apify/log/src/index.d.ts#L5)externalERROR **ERROR: 1 ### [**](#INFO)[**](https://undefined/apify/crawlee/blob/master/node_modules/@apify/log/src/index.d.ts#L8)externalINFO **INFO: 4 ### [**](#OFF)[**](https://undefined/apify/crawlee/blob/master/node_modules/@apify/log/src/index.d.ts#L4)externalOFF **OFF: 0 ### [**](#PERF)[**](https://undefined/apify/crawlee/blob/master/node_modules/@apify/log/src/index.d.ts#L10)externalPERF **PERF: 6 ### [**](#SOFT_FAIL)[**](https://undefined/apify/crawlee/blob/master/node_modules/@apify/log/src/index.d.ts#L6)externalSOFT\_FAIL **SOFT\_FAIL: 2 ### [**](#WARNING)[**](https://undefined/apify/crawlee/blob/master/node_modules/@apify/log/src/index.d.ts#L7)externalWARNING **WARNING: 3 --- # RequestState ## Index[**](#Index) ### Enumeration Members * [**AFTER\_NAV](#AFTER_NAV) * [**BEFORE\_NAV](#BEFORE_NAV) * [**DONE](#DONE) * [**ERROR](#ERROR) * [**ERROR\_HANDLER](#ERROR_HANDLER) * [**REQUEST\_HANDLER](#REQUEST_HANDLER) * [**SKIPPED](#SKIPPED) * [**UNPROCESSED](#UNPROCESSED) ## Enumeration Members[**](<#Enumeration Members>) ### [**](#AFTER_NAV)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L45)AFTER\_NAV **AFTER\_NAV: 2 ### [**](#BEFORE_NAV)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L44)BEFORE\_NAV **BEFORE\_NAV: 1 ### [**](#DONE)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L47)DONE **DONE: 4 ### [**](#ERROR)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L49)ERROR **ERROR: 6 ### [**](#ERROR_HANDLER)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L48)ERROR\_HANDLER **ERROR\_HANDLER: 5 ### [**](#REQUEST_HANDLER)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L46)REQUEST\_HANDLER **REQUEST\_HANDLER: 3 ### [**](#SKIPPED)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L50)SKIPPED **SKIPPED: 7 ### [**](#UNPROCESSED)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L43)UNPROCESSED **UNPROCESSED: 0 --- # checkStorageAccess Invoke a storage access checker function defined using [withCheckedStorageAccess](https://crawlee.dev/js/api/core/function/withCheckedStorageAccess.md) higher up in the call stack. ### Callable * ****checkStorageAccess**(): undefined | void *** * #### Returns undefined | void --- # enqueueLinks ### Callable * ****enqueueLinks**(options): Promise<[BatchAddRequestsResult](https://crawlee.dev/js/api/types/interface/BatchAddRequestsResult.md)> *** * This function enqueues the urls provided to the [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) provided. If you want to automatically find and enqueue links, you should use the context-aware `enqueueLinks` function provided on the crawler contexts. Optionally, the function allows you to filter the target links' URLs using an array of globs or regular expressions and override settings of the enqueued [Request](https://crawlee.dev/js/api/core/class/Request.md) objects. **Example usage** ``` await enqueueLinks({ urls: aListOfFoundUrls, requestQueue, selector: 'a.product-detail', globs: [ 'https://www.example.com/handbags/*', 'https://www.example.com/purses/*' ], }); ``` *** #### Parameters * ##### options: { baseUrl?: string; exclude?: readonly ([GlobInput](https://crawlee.dev/js/api/core.md#GlobInput) | [RegExpInput](https://crawlee.dev/js/api/core.md#RegExpInput))\[]; forefront?: boolean; globs?: readonly [GlobInput](https://crawlee.dev/js/api/core.md#GlobInput)\[]; label?: string; limit?: number; onSkippedRequest?: [SkippedRequestCallback](https://crawlee.dev/js/api/core.md#SkippedRequestCallback); pseudoUrls?: readonly [PseudoUrlInput](https://crawlee.dev/js/api/core.md#PseudoUrlInput)\[]; regexps?: readonly [RegExpInput](https://crawlee.dev/js/api/core.md#RegExpInput)\[]; robotsTxtFile?: Pick<[RobotsTxtFile](https://crawlee.dev/js/api/utils/class/RobotsTxtFile.md), isAllowed>; selector?: string; skipNavigation?: boolean; strategy?: [EnqueueStrategy](https://crawlee.dev/js/api/core/enum/EnqueueStrategy.md) | all | same-domain | same-hostname | same-origin; transformRequestFunction?: [RequestTransform](https://crawlee.dev/js/api/core/interface/RequestTransform.md); urls: readonly string\[]; userData?: Dictionary; waitForAllRequestsToBeAdded?: boolean } & { requestQueue: { addRequestsBatched: (requests, options) => Promise<[AddRequestsBatchedResult](https://crawlee.dev/js/api/core/interface/AddRequestsBatchedResult.md)> } } All `enqueueLinks()` parameters are passed via an options object. #### Returns Promise<[BatchAddRequestsResult](https://crawlee.dev/js/api/types/interface/BatchAddRequestsResult.md)> Promise that resolves to [BatchAddRequestsResult](https://crawlee.dev/js/api/types/interface/BatchAddRequestsResult.md) object. --- # filterRequestsByPatterns ### Callable * ****filterRequestsByPatterns**(requests, patterns, onSkippedUrl): [Request](https://crawlee.dev/js/api/core/class/Request.md)\[] *** * #### Parameters * ##### requests: [Request](https://crawlee.dev/js/api/core/class/Request.md)\\[] * ##### optionalpatterns: [UrlPatternObject](https://crawlee.dev/js/api/core.md#UrlPatternObject)\[] * ##### optionalonSkippedUrl: (url) => void #### Returns [Request](https://crawlee.dev/js/api/core/class/Request.md)\[] --- # processHttpRequestOptions ### Callable * ****processHttpRequestOptions**\(\_\_namedParameters): [HttpRequest](https://crawlee.dev/js/api/core/interface/HttpRequest.md)\ *** * Converts [HttpRequestOptions](https://crawlee.dev/js/api/core/interface/HttpRequestOptions.md) to a [HttpRequest](https://crawlee.dev/js/api/core/interface/HttpRequest.md). *** #### Parameters * ##### \_\_namedParameters: [HttpRequestOptions](https://crawlee.dev/js/api/core/interface/HttpRequestOptions.md)\ #### Returns [HttpRequest](https://crawlee.dev/js/api/core/interface/HttpRequest.md)\ --- # purgeDefaultStorages ### Callable * ****purgeDefaultStorages**(options): Promise\ * ****purgeDefaultStorages**(config, client): Promise\ *** * Cleans up the local storage folder (defaults to `./storage`) created when running code locally. Purging will remove all the files in all storages except for INPUT.json in the default KV store. Purging of storages is happening automatically when we run our crawler (or when we open some storage explicitly, e.g. via `RequestList.open()`). We can disable that via `purgeOnStart` [Configuration](https://crawlee.dev/js/api/core/class/Configuration.md) option or by setting `CRAWLEE_PURGE_ON_START` environment variable to `0` or `false`. This is a shortcut for running (optional) `purge` method on the StorageClient interface, in other words it will call the `purge` method of the underlying storage implementation we are currently using. You can make sure the storage is purged only once for a given execution context if you set `onlyPurgeOnce` to `true` in the `options` object *** #### Parameters * ##### optionaloptions: PurgeDefaultStorageOptions #### Returns Promise\ --- # tryAbsoluteURL ### Callable * ****tryAbsoluteURL**(href, baseUrl): string | undefined *** * Helper function used to validate URLs used when extracting URLs from a page *** #### Parameters * ##### href: string * ##### baseUrl: string #### Returns string | undefined --- # useState ### Callable * ****useState**\(name, defaultValue, options): Promise\ *** * Easily create and manage state values. All state values are automatically persisted. Values can be modified by simply using the assignment operator. *** #### Parameters * ##### optionalname: string The name of the store to use. * ##### defaultValue: State = ... If the store does not yet have a value in it, the value will be initialized with the `defaultValue` you provide. * ##### optionaloptions: [UseStateOptions](https://crawlee.dev/js/api/core/interface/UseStateOptions.md) An optional object parameter where a custom `keyValueStoreName` and `config` can be passed in. #### Returns Promise\ --- # withCheckedStorageAccess Define a storage access checker function that should be used by calls to [checkStorageAccess](https://crawlee.dev/js/api/core/function/checkStorageAccess.md) in the callbacks. ### Callable * ****withCheckedStorageAccess**\(checkFunction, callback): Promise\ *** * #### Parameters * ##### checkFunction: () => void The check function that should be invoked by [checkStorageAccess](https://crawlee.dev/js/api/core/function/checkStorageAccess.md) calls * ##### callback: () => Awaitable\ The code that should be invoked with the `checkFunction` setting #### Returns Promise\ --- # AddRequestsBatchedOptions ### Hierarchy * [RequestQueueOperationOptions](https://crawlee.dev/js/api/core/interface/RequestQueueOperationOptions.md) * *AddRequestsBatchedOptions* * [CrawlerAddRequestsOptions](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerAddRequestsOptions.md) ## Index[**](#Index) ### Properties * [**batchSize](#batchSize) * [**forefront](#forefront) * [**waitBetweenBatchesMillis](#waitBetweenBatchesMillis) * [**waitForAllRequestsToBeAdded](#waitForAllRequestsToBeAdded) ## Properties[**](#Properties) ### [**](#batchSize)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L975)optionalbatchSize **batchSize? : number = 1000 ### [**](#forefront)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L948)optionalinheritedforefront **forefront? : boolean = false Inherited from RequestQueueOperationOptions.forefront If set to `true`: * while adding the request to the queue: the request will be added to the foremost position in the queue. * while reclaiming the request: the request will be placed to the beginning of the queue, so that it's returned in the next call to [RequestQueue.fetchNextRequest](https://crawlee.dev/js/api/core/class/RequestQueue.md#fetchNextRequest). By default, it's put to the end of the queue. In case the request is already present in the queue, this option has no effect. If more requests are added with this option at once, their order in the following `fetchNextRequest` call is arbitrary. ### [**](#waitBetweenBatchesMillis)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L980)optionalwaitBetweenBatchesMillis **waitBetweenBatchesMillis? : number = 1000 ### [**](#waitForAllRequestsToBeAdded)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L970)optionalwaitForAllRequestsToBeAdded **waitForAllRequestsToBeAdded? : boolean = false Whether to wait for all the provided requests to be added, instead of waiting just for the initial batch of up to `batchSize`. --- # AddRequestsBatchedResult ### Hierarchy * *AddRequestsBatchedResult* * [CrawlerAddRequestsResult](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerAddRequestsResult.md) ## Index[**](#Index) ### Properties * [**addedRequests](#addedRequests) * [**waitForAllRequestsToBeAdded](#waitForAllRequestsToBeAdded) ## Properties[**](#Properties) ### [**](#addedRequests)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L984)addedRequests **addedRequests: [ProcessedRequest](https://crawlee.dev/js/api/types/interface/ProcessedRequest.md)\[] ### [**](#waitForAllRequestsToBeAdded)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L1001)waitForAllRequestsToBeAdded **waitForAllRequestsToBeAdded: Promise<[ProcessedRequest](https://crawlee.dev/js/api/types/interface/ProcessedRequest.md)\[]> A promise which will resolve with the rest of the requests that were added to the queue. Alternatively, we can set [`waitForAllRequestsToBeAdded`](https://crawlee.dev/js/api/core/interface/AddRequestsBatchedOptions.md#waitForAllRequestsToBeAdded) to `true` in the [`crawler.addRequests()`](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md#addRequests) options. **Example:** ``` // Assuming `requests` is a list of requests. const result = await crawler.addRequests(requests); // If we want to wait for the rest of the requests to be added to the queue: await result.waitForAllRequestsToBeAdded; ``` --- # AutoscaledPoolOptions ## Index[**](#Index) ### Properties * [**autoscaleIntervalSecs](#autoscaleIntervalSecs) * [**desiredConcurrency](#desiredConcurrency) * [**desiredConcurrencyRatio](#desiredConcurrencyRatio) * [**isFinishedFunction](#isFinishedFunction) * [**isTaskReadyFunction](#isTaskReadyFunction) * [**log](#log) * [**loggingIntervalSecs](#loggingIntervalSecs) * [**maxConcurrency](#maxConcurrency) * [**maxTasksPerMinute](#maxTasksPerMinute) * [**maybeRunIntervalSecs](#maybeRunIntervalSecs) * [**minConcurrency](#minConcurrency) * [**runTaskFunction](#runTaskFunction) * [**scaleDownStepRatio](#scaleDownStepRatio) * [**scaleUpStepRatio](#scaleUpStepRatio) * [**snapshotterOptions](#snapshotterOptions) * [**systemStatusOptions](#systemStatusOptions) * [**taskTimeoutSecs](#taskTimeoutSecs) ## Properties[**](#Properties) ### [**](#autoscaleIntervalSecs)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/autoscaled_pool.ts#L102)optionalautoscaleIntervalSecs **autoscaleIntervalSecs? : number = 10 Defines in seconds how often the pool should attempt to adjust the desired concurrency based on the latest system status. Setting it lower than 1 might have a severe impact on performance. We suggest using a value from 5 to 20. ### [**](#desiredConcurrency)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/autoscaled_pool.ts#L60)optionaldesiredConcurrency **desiredConcurrency? : number The desired number of tasks that should be running parallel on the start of the pool, if there is a large enough supply of them. By default, it is `minConcurrency`. ### [**](#desiredConcurrencyRatio)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/autoscaled_pool.ts#L66)optionaldesiredConcurrencyRatio **desiredConcurrencyRatio? : number = 0.90 Minimum level of desired concurrency to reach before more scaling up is allowed. ### [**](#isFinishedFunction)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/autoscaled_pool.ts#L38)optionalisFinishedFunction **isFinishedFunction? : () => Promise\ A function that is called only when there are no tasks to be processed. If it resolves to `true` then the pool's run finishes. Being called only when there are no tasks being processed means that as long as `isTaskReadyFunction()` keeps resolving to `true`, `isFinishedFunction()` will never be called. To abort a run, use the [AutoscaledPool.abort](https://crawlee.dev/js/api/core/class/AutoscaledPool.md#abort) method. *** #### Type declaration * * **(): Promise\ - #### Returns Promise\ ### [**](#isTaskReadyFunction)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/autoscaled_pool.ts#L29)optionalisTaskReadyFunction **isTaskReadyFunction? : () => Promise\ A function that indicates whether `runTaskFunction` should be called. This function is called every time there is free capacity for a new task and it should indicate whether it should start a new task or not by resolving to either `true` or `false`. Besides its obvious use, it is also useful for task throttling to save resources. *** #### Type declaration * * **(): Promise\ - #### Returns Promise\ ### [**](#log)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/autoscaled_pool.ts#L129)optionallog **log? : [Log](https://crawlee.dev/js/api/core/class/Log.md) ### [**](#loggingIntervalSecs)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/autoscaled_pool.ts#L94)optionalloggingIntervalSecs **loggingIntervalSecs? : null | number = null | number Specifies a period in which the instance logs its state, in seconds. Set to `null` to disable periodic logging. ### [**](#maxConcurrency)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/autoscaled_pool.ts#L53)optionalmaxConcurrency **maxConcurrency? : number = 200 The maximum number of tasks running in parallel. ### [**](#maxTasksPerMinute)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/autoscaled_pool.ts#L127)optionalmaxTasksPerMinute **maxTasksPerMinute? : number The maximum number of tasks per minute the pool can run. By default, this is set to `Infinity`, but you can pass any positive, non-zero integer. ### [**](#maybeRunIntervalSecs)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/autoscaled_pool.ts#L87)optionalmaybeRunIntervalSecs **maybeRunIntervalSecs? : number = 0.5 Indicates how often the pool should call the `runTaskFunction()` to start a new task, in seconds. This has no effect on starting new tasks immediately after a task completes. ### [**](#minConcurrency)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/autoscaled_pool.ts#L47)optionalminConcurrency **minConcurrency? : number = 1 The minimum number of tasks running in parallel. *WARNING:* If you set this value too high with respect to the available system memory and CPU, your code might run extremely slow or crash. If you're not sure, just keep the default value and the concurrency will scale up automatically. ### [**](#runTaskFunction)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/autoscaled_pool.ts#L21)optionalrunTaskFunction **runTaskFunction? : () => Promise\ A function that performs an asynchronous resource-intensive task. The function must either be labeled `async` or return a promise. *** #### Type declaration * * **(): Promise\ - #### Returns Promise\ ### [**](#scaleDownStepRatio)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/autoscaled_pool.ts#L80)optionalscaleDownStepRatio **scaleDownStepRatio? : number = 0.05 Defines the amount of desired concurrency to be subtracted with each scaling down. The minimum scaling step is one. ### [**](#scaleUpStepRatio)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/autoscaled_pool.ts#L73)optionalscaleUpStepRatio **scaleUpStepRatio? : number = 0.05 Defines the fractional amount of desired concurrency to be added with each scaling up. The minimum scaling step is one. ### [**](#snapshotterOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/autoscaled_pool.ts#L114)optionalsnapshotterOptions **snapshotterOptions? : [SnapshotterOptions](https://crawlee.dev/js/api/core/interface/SnapshotterOptions.md) Options to be passed down to the [Snapshotter](https://crawlee.dev/js/api/core/class/Snapshotter.md) constructor. This is useful for fine-tuning the snapshot intervals and history. ### [**](#systemStatusOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/autoscaled_pool.ts#L121)optionalsystemStatusOptions **systemStatusOptions? : [SystemStatusOptions](https://crawlee.dev/js/api/core/interface/SystemStatusOptions.md) Options to be passed down to the [SystemStatus](https://crawlee.dev/js/api/core/class/SystemStatus.md) constructor. This is useful for fine-tuning the system status reports. If a custom snapshotter is set in the options, it will be used by the pool. ### [**](#taskTimeoutSecs)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/autoscaled_pool.ts#L108)optionaltaskTimeoutSecs **taskTimeoutSecs? : number = 0 Timeout in which the `runTaskFunction` needs to finish, given in seconds. --- # BaseHttpClient Interface for user-defined HTTP clients to be used for plain HTTP crawling and for sending additional requests during a crawl. ### Implemented by * [GotScrapingHttpClient](https://crawlee.dev/js/api/core/class/GotScrapingHttpClient.md) ## Index[**](#Index) ### Methods * [**sendRequest](#sendRequest) * [**stream](#stream) ## Methods[**](#Methods) ### [**](#sendRequest)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L183)sendRequest * ****sendRequest**\(request): Promise<[HttpResponse](https://crawlee.dev/js/api/core/interface/HttpResponse.md)\> - Perform an HTTP Request and return the complete response. *** #### Parameters * ##### request: [HttpRequest](https://crawlee.dev/js/api/core/interface/HttpRequest.md)\ #### Returns Promise<[HttpResponse](https://crawlee.dev/js/api/core/interface/HttpResponse.md)\> ### [**](#stream)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L190)stream * ****stream**(request, onRedirect): Promise<[StreamingHttpResponse](https://crawlee.dev/js/api/core/interface/StreamingHttpResponse.md)> - Perform an HTTP Request and return after the response headers are received. The body may be read from a stream contained in the response. *** #### Parameters * ##### request: [HttpRequest](https://crawlee.dev/js/api/core/interface/HttpRequest.md)\ * ##### optionalonRedirect: [RedirectHandler](https://crawlee.dev/js/api/core.md#RedirectHandler) #### Returns Promise<[StreamingHttpResponse](https://crawlee.dev/js/api/core/interface/StreamingHttpResponse.md)> --- # BaseHttpResponseData HTTP response data, without a body, as returned by [BaseHttpClient](https://crawlee.dev/js/api/core/interface/BaseHttpClient.md) methods. ## Index[**](#Index) ### Properties * [**complete](#complete) * [**headers](#headers) * [**ip](#ip) * [**redirectUrls](#redirectUrls) * [**statusCode](#statusCode) * [**statusMessage](#statusMessage) * [**trailers](#trailers) * [**url](#url) ## Properties[**](#Properties) ### [**](#complete)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L141)complete **complete: boolean ### [**](#headers)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L138)headers **headers: SimpleHeaders ### [**](#ip)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L134)optionalip **ip? : string ### [**](#redirectUrls)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L131)redirectUrls **redirectUrls: URL\[] ### [**](#statusCode)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L135)statusCode **statusCode: number ### [**](#statusMessage)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L136)optionalstatusMessage **statusMessage? : string ### [**](#trailers)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L139)trailers **trailers: SimpleHeaders ### [**](#url)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L132)url **url: string --- # ClientInfo ## Index[**](#Index) ### Properties * [**actualRatio](#actualRatio) * [**isOverloaded](#isOverloaded) * [**limitRatio](#limitRatio) ## Properties[**](#Properties) ### [**](#actualRatio)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/system_status.ts#L82)actualRatio **actualRatio: number ### [**](#isOverloaded)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/system_status.ts#L80)isOverloaded **isOverloaded: boolean ### [**](#limitRatio)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/system_status.ts#L81)limitRatio **limitRatio: number --- # ConfigurationOptions ## Index[**](#Index) ### Properties * [**availableMemoryRatio](#availableMemoryRatio) * [**chromeExecutablePath](#chromeExecutablePath) * [**containerized](#containerized) * [**defaultBrowserPath](#defaultBrowserPath) * [**defaultDatasetId](#defaultDatasetId) * [**defaultKeyValueStoreId](#defaultKeyValueStoreId) * [**defaultRequestQueueId](#defaultRequestQueueId) * [**disableBrowserSandbox](#disableBrowserSandbox) * [**eventManager](#eventManager) * [**headless](#headless) * [**inputKey](#inputKey) * [**logLevel](#logLevel) * [**maxUsedCpuRatio](#maxUsedCpuRatio) * [**memoryMbytes](#memoryMbytes) * [**persistStateIntervalMillis](#persistStateIntervalMillis) * [**persistStorage](#persistStorage) * [**purgeOnStart](#purgeOnStart) * [**storageClient](#storageClient) * [**storageClientOptions](#storageClientOptions) * [**systemInfoIntervalMillis](#systemInfoIntervalMillis) * [**systemInfoV2](#systemInfoV2) * [**xvfb](#xvfb) ## Properties[**](#Properties) ### [**](#availableMemoryRatio)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/configuration.ts#L81)optionalavailableMemoryRatio **availableMemoryRatio? : number = 0.25 Sets the ratio, defining the amount of system memory that could be used by the [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md). When the memory usage is more than the provided ratio, the memory is considered overloaded. Alternative to `CRAWLEE_AVAILABLE_MEMORY_RATIO` environment variable. ### [**](#chromeExecutablePath)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/configuration.ts#L135)optionalchromeExecutablePath **chromeExecutablePath? : string Defines a path to Chrome executable. Alternative to `CRAWLEE_CHROME_EXECUTABLE_PATH` environment variable. ### [**](#containerized)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/configuration.ts#L178)optionalcontainerized **containerized? : boolean Used in place of `isContainerized()` when collecting system metrics. Alternative to `CRAWLEE_CONTAINERIZED` environment variable. ### [**](#defaultBrowserPath)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/configuration.ts#L142)optionaldefaultBrowserPath **defaultBrowserPath? : string Defines a path to default browser executable. Alternative to `CRAWLEE_DEFAULT_BROWSER_PATH` environment variable. ### [**](#defaultDatasetId)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/configuration.ts#L41)optionaldefaultDatasetId **defaultDatasetId? : string = ‘default’ Default dataset id. Alternative to `CRAWLEE_DEFAULT_DATASET_ID` environment variable. ### [**](#defaultKeyValueStoreId)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/configuration.ts#L57)optionaldefaultKeyValueStoreId **defaultKeyValueStoreId? : string = ‘default’ Default key-value store id. Alternative to `CRAWLEE_DEFAULT_KEY_VALUE_STORE_ID` environment variable. ### [**](#defaultRequestQueueId)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/configuration.ts#L65)optionaldefaultRequestQueueId **defaultRequestQueueId? : string = ‘default’ Default request queue id. Alternative to `CRAWLEE_DEFAULT_REQUEST_QUEUE_ID` environment variable. ### [**](#disableBrowserSandbox)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/configuration.ts#L149)optionaldisableBrowserSandbox **disableBrowserSandbox? : boolean Defines whether to disable browser sandbox by adding `--no-sandbox` flag to `launchOptions`. Alternative to `CRAWLEE_DISABLE_BROWSER_SANDBOX` environment variable. ### [**](#eventManager)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/configuration.ts#L27)optionaleventManager **eventManager? : [EventManager](https://crawlee.dev/js/api/core/class/EventManager.md) = [EventManager](https://crawlee.dev/js/api/core/class/EventManager.md) Defines the Event Manager to be used. ### [**](#headless)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/configuration.ts#L120)optionalheadless **headless? : boolean = true Defines whether web browsers launched by Crawlee will run in the headless mode. Alternative to `CRAWLEE_HEADLESS` environment variable. ### [**](#inputKey)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/configuration.ts#L112)optionalinputKey **inputKey? : string = ‘INPUT’ Defines the default input key, i.e. the key that is used to get the crawler input value from the default [KeyValueStore](https://crawlee.dev/js/api/core/class/KeyValueStore.md) associated with the current crawler run. Alternative to `CRAWLEE_INPUT_KEY` environment variable. ### [**](#logLevel)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/configuration.ts#L157)optionallogLevel **logLevel? : [LogLevel](https://crawlee.dev/js/api/core/enum/LogLevel.md) | (radix) => string | (fractionDigits) => string | (fractionDigits) => string | (precision) => string | () => number | ({ (locales, options): string; (locales, options): string }) = [LogLevel](https://crawlee.dev/js/api/core/enum/LogLevel.md) | (radix) => string | (fractionDigits) => string | (fractionDigits) => string | (precision) => string | () => number | ({ (locales, options): string; (locales, options): string }) Sets the log level to the given value. Alternative to `CRAWLEE_LOG_LEVEL` environment variable. ### [**](#maxUsedCpuRatio)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/configuration.ts#L72)optionalmaxUsedCpuRatio **maxUsedCpuRatio? : number = 0.95 Sets the ratio, defining the maximum CPU usage. When the CPU usage is higher than the provided ratio, the CPU is considered overloaded. ### [**](#memoryMbytes)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/configuration.ts#L89)optionalmemoryMbytes **memoryMbytes? : number Sets the amount of system memory in megabytes to be used by the [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md). By default, the maximum memory is set to one quarter of total system memory. Alternative to `CRAWLEE_MEMORY_MBYTES` environment variable. ### [**](#persistStateIntervalMillis)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/configuration.ts#L97)optionalpersistStateIntervalMillis **persistStateIntervalMillis? : number = 60\_000 Defines the interval of emitting the `persistState` event. Alternative to `CRAWLEE_PERSIST_STATE_INTERVAL_MILLIS` environment variable. ### [**](#persistStorage)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/configuration.ts#L164)optionalpersistStorage **persistStorage? : boolean Defines whether the storage client used should persist the data it stores. Alternative to `CRAWLEE_PERSIST_STORAGE` environment variable. ### [**](#purgeOnStart)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/configuration.ts#L49)optionalpurgeOnStart **purgeOnStart? : boolean = true Defines whether to purge the default storage folders before starting the crawler run. Alternative to `CRAWLEE_PURGE_ON_START` environment variable. ### [**](#storageClient)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/configuration.ts#L21)optionalstorageClient **storageClient? : [StorageClient](https://crawlee.dev/js/api/core/interface/StorageClient.md) = [StorageClient](https://crawlee.dev/js/api/core/interface/StorageClient.md) Defines storage client to be used. ### [**](#storageClientOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/configuration.ts#L33)optionalstorageClientOptions **storageClientOptions? : Dictionary Could be used to adjust the storage client behavior e.g. [MemoryStorageOptions](https://crawlee.dev/js/api/memory-storage/interface/MemoryStorageOptions.md) could be used to adjust the [MemoryStorage](https://crawlee.dev/js/api/memory-storage/class/MemoryStorage.md) behavior. ### [**](#systemInfoIntervalMillis)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/configuration.ts#L103)optionalsystemInfoIntervalMillis **systemInfoIntervalMillis? : number = 1\_000 Defines the interval of emitting the `systemInfo` event. ### [**](#systemInfoV2)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/configuration.ts#L171)optionalsystemInfoV2 **systemInfoV2? : boolean Defines whether to use the systemInfoV2 metric collection experiment. Alternative to `CRAWLEE_SYSTEM_INFO_V2` environment variable. ### [**](#xvfb)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/configuration.ts#L128)optionalxvfb **xvfb? : boolean = false Defines whether to run X virtual framebuffer on the web browsers launched by Crawlee. Alternative to `CRAWLEE_XVFB` environment variable. --- # Cookie ## Index[**](#Index) ### Properties * [**domain](#domain) * [**expires](#expires) * [**httpOnly](#httpOnly) * [**name](#name) * [**path](#path) * [**priority](#priority) * [**sameParty](#sameParty) * [**sameSite](#sameSite) * [**secure](#secure) * [**sourcePort](#sourcePort) * [**sourceScheme](#sourceScheme) * [**url](#url) * [**value](#value) ## Properties[**](#Properties) ### [**](#domain)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/browser.ts#L20)optionaldomain **domain? : string Cookie domain. ### [**](#expires)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/browser.ts#L40)optionalexpires **expires? : number Cookie expiration date, session cookie if not set ### [**](#httpOnly)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/browser.ts#L32)optionalhttpOnly **httpOnly? : boolean True if cookie is http-only. ### [**](#name)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/browser.ts#L7)name **name: string Cookie name. ### [**](#path)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/browser.ts#L24)optionalpath **path? : string Cookie path. ### [**](#priority)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/browser.ts#L44)optionalpriority **priority? : Low | Medium | High Cookie Priority. ### [**](#sameParty)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/browser.ts#L48)optionalsameParty **sameParty? : boolean True if cookie is SameParty. ### [**](#sameSite)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/browser.ts#L36)optionalsameSite **sameSite? : Strict | Lax | None Cookie SameSite type. ### [**](#secure)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/browser.ts#L28)optionalsecure **secure? : boolean True if cookie is secure. ### [**](#sourcePort)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/browser.ts#L58)optionalsourcePort **sourcePort? : number Cookie source port. Valid values are `-1` or `1-65535`, `-1` indicates an unspecified port. An unspecified port value allows protocol clients to emulate legacy cookie scope for the port. This is a temporary ability and it will be removed in the future. ### [**](#sourceScheme)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/browser.ts#L52)optionalsourceScheme **sourceScheme? : Unset | NonSecure | Secure Cookie source scheme type. ### [**](#url)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/browser.ts#L16)optionalurl **url? : string The request-URI to associate with the setting of the cookie. This value can affect the default domain, path, source port, and source scheme values of the created cookie. ### [**](#value)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/browser.ts#L11)value **value: string Cookie value. --- # CrawlingContext \ ### Hierarchy * [RestrictedCrawlingContext](https://crawlee.dev/js/api/core/interface/RestrictedCrawlingContext.md)\ * *CrawlingContext* * [BasicCrawlingContext](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlingContext.md) * [BrowserCrawlingContext](https://crawlee.dev/js/api/browser-crawler/interface/BrowserCrawlingContext.md) ## Index[**](#Index) ### Properties * [**addRequests](#addRequests) * [**crawler](#crawler) * [**getKeyValueStore](#getKeyValueStore) * [**id](#id) * [**log](#log) * [**proxyInfo](#proxyInfo) * [**request](#request) * [**session](#session) * [**useState](#useState) ### Methods * [**enqueueLinks](#enqueueLinks) * [**pushData](#pushData) * [**sendRequest](#sendRequest) ## Properties[**](#Properties) ### [**](#addRequests)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L88)inheritedaddRequests **addRequests: (requestsLike, options) => Promise\ Inherited from RestrictedCrawlingContext.addRequests Add requests directly to the request queue. *** #### Type declaration * * **(requestsLike, options): Promise\ - #### Parameters * ##### requestsLike: readonly (string | ReadonlyObjectDeep\> & { regex?: RegExp; requestsFromUrl?: string }> | ReadonlyObjectDeep<[Request](https://crawlee.dev/js/api/core/class/Request.md)\>)\[] * ##### optionaloptions: ReadonlyObjectDeep<[RequestQueueOperationOptions](https://crawlee.dev/js/api/core/interface/RequestQueueOperationOptions.md)> Options for the request queue #### Returns Promise\ ### [**](#crawler)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L113)crawler **crawler: Crawler ### [**](#getKeyValueStore)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L147)getKeyValueStore **getKeyValueStore: (idOrName) => Promise<[KeyValueStore](https://crawlee.dev/js/api/core/class/KeyValueStore.md)> Overrides RestrictedCrawlingContext.getKeyValueStore Get a key-value store with given name or id, or the default one for the crawler. *** #### Type declaration * * **(idOrName): Promise<[KeyValueStore](https://crawlee.dev/js/api/core/class/KeyValueStore.md)> - #### Parameters * ##### optionalidOrName: string #### Returns Promise<[KeyValueStore](https://crawlee.dev/js/api/core/class/KeyValueStore.md)> ### [**](#id)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L33)inheritedid **id: string Inherited from RestrictedCrawlingContext.id ### [**](#log)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L108)inheritedlog **log: [Log](https://crawlee.dev/js/api/core/class/Log.md) Inherited from RestrictedCrawlingContext.log A preconfigured logger for the request handler. ### [**](#proxyInfo)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L40)optionalinheritedproxyInfo **proxyInfo? : [ProxyInfo](https://crawlee.dev/js/api/core/interface/ProxyInfo.md) Inherited from RestrictedCrawlingContext.proxyInfo An object with information about currently used proxy by the crawler and configured by the [ProxyConfiguration](https://crawlee.dev/js/api/core/class/ProxyConfiguration.md) class. ### [**](#request)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L45)inheritedrequest **request: [Request](https://crawlee.dev/js/api/core/class/Request.md)\ Inherited from RestrictedCrawlingContext.request The original [Request](https://crawlee.dev/js/api/core/class/Request.md) object. ### [**](#session)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L34)optionalinheritedsession **session? : [Session](https://crawlee.dev/js/api/core/class/Session.md) Inherited from RestrictedCrawlingContext.session ### [**](#useState)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L96)inheriteduseState **useState: \(defaultValue) => Promise\ Inherited from RestrictedCrawlingContext.useState Returns the state - a piece of mutable persistent data shared across all the request handler runs. *** #### Type declaration * * **\(defaultValue): Promise\ - #### Parameters * ##### optionaldefaultValue: State #### Returns Promise\ ## Methods[**](#Methods) ### [**](#enqueueLinks)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L140)enqueueLinks * ****enqueueLinks**(options): Promise<[BatchAddRequestsResult](https://crawlee.dev/js/api/types/interface/BatchAddRequestsResult.md)> - Overrides RestrictedCrawlingContext.enqueueLinks This function automatically finds and enqueues links from the current page, adding them to the [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) currently used by the crawler. Optionally, the function allows you to filter the target links' URLs using an array of globs or regular expressions and override settings of the enqueued [Request](https://crawlee.dev/js/api/core/class/Request.md) objects. Check out the [Crawl a website with relative links](https://crawlee.dev/js/docs/examples/crawl-relative-links.md) example for more details regarding its usage. **Example usage** ``` async requestHandler({ enqueueLinks }) { await enqueueLinks({ globs: [ 'https://www.example.com/handbags/*', ], }); }, ``` *** #### Parameters * ##### optionaloptions: ReadonlyObjectDeep\> & Pick<[EnqueueLinksOptions](https://crawlee.dev/js/api/core/interface/EnqueueLinksOptions.md), requestQueue> All `enqueueLinks()` parameters are passed via an options object. #### Returns Promise<[BatchAddRequestsResult](https://crawlee.dev/js/api/types/interface/BatchAddRequestsResult.md)> Promise that resolves to [BatchAddRequestsResult](https://crawlee.dev/js/api/types/interface/BatchAddRequestsResult.md) object. ### [**](#pushData)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L54)inheritedpushData * ****pushData**(data, datasetIdOrName): Promise\ - Inherited from RestrictedCrawlingContext.pushData This function allows you to push data to a [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md) specified by name, or the one currently used by the crawler. Shortcut for `crawler.pushData()`. *** #### Parameters * ##### optionaldata: ReadonlyDeep\ Data to be pushed to the default dataset. * ##### optionaldatasetIdOrName: string #### Returns Promise\ ### [**](#sendRequest)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L166)sendRequest * ****sendRequest**\(overrideOptions): Promise\> - Fires HTTP request via [`got-scraping`](https://crawlee.dev/js/docs/guides/got-scraping.md), allowing to override the request options on the fly. This is handy when you work with a browser crawler but want to execute some requests outside it (e.g. API requests). Check the [Skipping navigations for certain requests](https://crawlee.dev/js/docs/examples/skip-navigation.md) example for more detailed explanation of how to do that. ``` async requestHandler({ sendRequest }) { const { body } = await sendRequest({ // override headers only headers: { ... }, }); }, ``` *** #### Parameters * ##### optionaloverrideOptions: Partial\ #### Returns Promise\> --- # CreateSession Factory user-function which creates customized [Session](https://crawlee.dev/js/api/core/class/Session.md) instances. ### Callable * ****CreateSession**(sessionPool, options): [Session](https://crawlee.dev/js/api/core/class/Session.md) | Promise<[Session](https://crawlee.dev/js/api/core/class/Session.md)> *** * #### Parameters * ##### sessionPool: [SessionPool](https://crawlee.dev/js/api/core/class/SessionPool.md) Pool requesting the new session. * ##### optionaloptions: { sessionOptions?: [SessionOptions](https://crawlee.dev/js/api/core/interface/SessionOptions.md) } * ##### optionalsessionOptions: [SessionOptions](https://crawlee.dev/js/api/core/interface/SessionOptions.md) #### Returns [Session](https://crawlee.dev/js/api/core/class/Session.md) | Promise<[Session](https://crawlee.dev/js/api/core/class/Session.md)> --- # DatasetConsumer \ User-function used in the `Dataset.forEach()` API. ### Callable * ****DatasetConsumer**(item, index): Awaitable\ *** * #### Parameters * ##### item: Data Current [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md) entry being processed. * ##### index: number Position of current [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md) entry. #### Returns Awaitable\ --- # DatasetContent \ ## Index[**](#Index) ### Properties * [**count](#count) * [**desc](#desc) * [**items](#items) * [**limit](#limit) * [**offset](#offset) * [**total](#total) ## Properties[**](#Properties) ### [**](#count)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L746)count **count: number Count of dataset entries returned in this set. ### [**](#desc)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L754)optionaldesc **desc? : boolean Should the results be in descending order. ### [**](#items)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L752)items **items: Data\[] Dataset entries based on chosen format parameter. ### [**](#limit)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L750)limit **limit: number Maximum number of dataset entries requested. ### [**](#offset)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L748)offset **offset: number Position of the first returned entry in the dataset. ### [**](#total)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L744)total **total: number Total count of entries in the dataset. --- # DatasetDataOptions ## Index[**](#Index) ### Properties * [**clean](#clean) * [**desc](#desc) * [**fields](#fields) * [**limit](#limit) * [**offset](#offset) * [**skipEmpty](#skipEmpty) * [**skipHidden](#skipHidden) * [**unwind](#unwind) ## Properties[**](#Properties) ### [**](#clean)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L128)optionalclean **clean? : boolean = false If `true` then the function returns only non-empty items and skips hidden fields (i.e. fields starting with `#` character). Note that the `clean` parameter is a shortcut for `skipHidden: true` and `skipEmpty: true` options. ### [**](#desc)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L110)optionaldesc **desc? : boolean = false If `true` then the objects are sorted by `createdAt` in descending order. Otherwise they are sorted in ascending order. ### [**](#fields)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L115)optionalfields **fields? : string\[] An array of field names that will be included in the result. If omitted, all fields are included in the results. ### [**](#limit)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L103)optionallimit **limit? : number = 250000 Maximum number of array elements to return. ### [**](#offset)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L97)optionaloffset **offset? : number = 0 Number of array elements that should be skipped at the start. ### [**](#skipEmpty)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L141)optionalskipEmpty **skipEmpty? : boolean = false If `true` then the function doesn't return empty items. Note that in this case the returned number of items might be lower than limit parameter and pagination must be done using the `limit` value. ### [**](#skipHidden)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L134)optionalskipHidden **skipHidden? : boolean = false If `true` then the function doesn't return hidden fields (fields starting with "#" character). ### [**](#unwind)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L121)optionalunwind **unwind? : string Specifies a name of the field in the result objects that will be used to unwind the resulting objects. By default, the results are returned as they are. --- # DatasetExportOptions ### Hierarchy * Omit<[DatasetDataOptions](https://crawlee.dev/js/api/core/interface/DatasetDataOptions.md), offset | limit> * *DatasetExportOptions* * [DatasetExportToOptions](https://crawlee.dev/js/api/core/interface/DatasetExportToOptions.md) ## Index[**](#Index) ### Properties * [**clean](#clean) * [**collectAllKeys](#collectAllKeys) * [**desc](#desc) * [**fields](#fields) * [**skipEmpty](#skipEmpty) * [**skipHidden](#skipHidden) * [**unwind](#unwind) ## Properties[**](#Properties) ### [**](#clean)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L128)optionalinheritedclean **clean? : boolean = false Inherited from Omit.clean If `true` then the function returns only non-empty items and skips hidden fields (i.e. fields starting with `#` character). Note that the `clean` parameter is a shortcut for `skipHidden: true` and `skipEmpty: true` options. ### [**](#collectAllKeys)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L149)optionalcollectAllKeys **collectAllKeys? : boolean If true, includes all unique keys from all dataset items in the CSV export header. If omitted or false, only keys from the first item are used. ### [**](#desc)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L110)optionalinheriteddesc **desc? : boolean = false Inherited from Omit.desc If `true` then the objects are sorted by `createdAt` in descending order. Otherwise they are sorted in ascending order. ### [**](#fields)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L115)optionalinheritedfields **fields? : string\[] Inherited from Omit.fields An array of field names that will be included in the result. If omitted, all fields are included in the results. ### [**](#skipEmpty)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L141)optionalinheritedskipEmpty **skipEmpty? : boolean = false Inherited from Omit.skipEmpty If `true` then the function doesn't return empty items. Note that in this case the returned number of items might be lower than limit parameter and pagination must be done using the `limit` value. ### [**](#skipHidden)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L134)optionalinheritedskipHidden **skipHidden? : boolean = false Inherited from Omit.skipHidden If `true` then the function doesn't return hidden fields (fields starting with "#" character). ### [**](#unwind)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L121)optionalinheritedunwind **unwind? : string Inherited from Omit.unwind Specifies a name of the field in the result objects that will be used to unwind the resulting objects. By default, the results are returned as they are. --- # DatasetExportToOptions ### Hierarchy * [DatasetExportOptions](https://crawlee.dev/js/api/core/interface/DatasetExportOptions.md) * *DatasetExportToOptions* ## Index[**](#Index) ### Properties * [**clean](#clean) * [**collectAllKeys](#collectAllKeys) * [**desc](#desc) * [**fields](#fields) * [**fromDataset](#fromDataset) * [**skipEmpty](#skipEmpty) * [**skipHidden](#skipHidden) * [**toKVS](#toKVS) * [**unwind](#unwind) ## Properties[**](#Properties) ### [**](#clean)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L128)optionalinheritedclean **clean? : boolean = false Inherited from DatasetExportOptions.clean If `true` then the function returns only non-empty items and skips hidden fields (i.e. fields starting with `#` character). Note that the `clean` parameter is a shortcut for `skipHidden: true` and `skipEmpty: true` options. ### [**](#collectAllKeys)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L149)optionalinheritedcollectAllKeys **collectAllKeys? : boolean Inherited from DatasetExportOptions.collectAllKeys If true, includes all unique keys from all dataset items in the CSV export header. If omitted or false, only keys from the first item are used. ### [**](#desc)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L110)optionalinheriteddesc **desc? : boolean = false Inherited from DatasetExportOptions.desc If `true` then the objects are sorted by `createdAt` in descending order. Otherwise they are sorted in ascending order. ### [**](#fields)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L115)optionalinheritedfields **fields? : string\[] Inherited from DatasetExportOptions.fields An array of field names that will be included in the result. If omitted, all fields are included in the results. ### [**](#fromDataset)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L177)optionalfromDataset **fromDataset? : string ### [**](#skipEmpty)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L141)optionalinheritedskipEmpty **skipEmpty? : boolean = false Inherited from DatasetExportOptions.skipEmpty If `true` then the function doesn't return empty items. Note that in this case the returned number of items might be lower than limit parameter and pagination must be done using the `limit` value. ### [**](#skipHidden)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L134)optionalinheritedskipHidden **skipHidden? : boolean = false Inherited from DatasetExportOptions.skipHidden If `true` then the function doesn't return hidden fields (fields starting with "#" character). ### [**](#toKVS)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L178)optionaltoKVS **toKVS? : string ### [**](#unwind)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L121)optionalinheritedunwind **unwind? : string Inherited from DatasetExportOptions.unwind Specifies a name of the field in the result objects that will be used to unwind the resulting objects. By default, the results are returned as they are. --- # DatasetIteratorOptions ### Hierarchy * Omit<[DatasetDataOptions](https://crawlee.dev/js/api/core/interface/DatasetDataOptions.md), offset | limit | clean | skipHidden | skipEmpty> * *DatasetIteratorOptions* ## Index[**](#Index) ### Properties * [**desc](#desc) * [**fields](#fields) * [**unwind](#unwind) ## Properties[**](#Properties) ### [**](#desc)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L110)optionalinheriteddesc **desc? : boolean = false Inherited from Omit.desc If `true` then the objects are sorted by `createdAt` in descending order. Otherwise they are sorted in ascending order. ### [**](#fields)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L115)optionalinheritedfields **fields? : string\[] Inherited from Omit.fields An array of field names that will be included in the result. If omitted, all fields are included in the results. ### [**](#unwind)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L121)optionalinheritedunwind **unwind? : string Inherited from Omit.unwind Specifies a name of the field in the result objects that will be used to unwind the resulting objects. By default, the results are returned as they are. --- # DatasetMapper \ User-function used in the `Dataset.map()` API. ### Callable * ****DatasetMapper**(item, index): Awaitable\ *** * User-function used in the `Dataset.map()` API. *** #### Parameters * ##### item: Data Current [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md) entry being processed. * ##### index: number Position of current [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md) entry. #### Returns Awaitable\ --- # DatasetOptions ## Index[**](#Index) ### Properties * [**client](#client) * [**id](#id) * [**name](#name) * [**storageObject](#storageObject) ## Properties[**](#Properties) ### [**](#client)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L738)client **client: [StorageClient](https://crawlee.dev/js/api/core/interface/StorageClient.md) ### [**](#id)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L736)id **id: string ### [**](#name)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L737)optionalname **name? : string ### [**](#storageObject)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L739)optionalstorageObject **storageObject? : Record\ --- # DatasetReducer \ User-function used in the `Dataset.reduce()` API. ### Callable * ****DatasetReducer**(memo, item, index): Awaitable\ *** * #### Parameters * ##### memo: T Previous state of the reduction. * ##### item: Data Current [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md) entry being processed. * ##### index: number Position of current [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md) entry. #### Returns Awaitable\ --- # EnqueueLinksOptions ### Hierarchy * [RequestQueueOperationOptions](https://crawlee.dev/js/api/core/interface/RequestQueueOperationOptions.md) * *EnqueueLinksOptions* ## Index[**](#Index) ### Properties * [**baseUrl](#baseUrl) * [**exclude](#exclude) * [**forefront](#forefront) * [**globs](#globs) * [**label](#label) * [**limit](#limit) * [**onSkippedRequest](#onSkippedRequest) * [**pseudoUrls](#pseudoUrls) * [**regexps](#regexps) * [**requestQueue](#requestQueue) * [**robotsTxtFile](#robotsTxtFile) * [**selector](#selector) * [**skipNavigation](#skipNavigation) * [**strategy](#strategy) * [**transformRequestFunction](#transformRequestFunction) * [**urls](#urls) * [**userData](#userData) * [**waitForAllRequestsToBeAdded](#waitForAllRequestsToBeAdded) ## Properties[**](#Properties) ### [**](#baseUrl)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/enqueue_links.ts#L68)optionalbaseUrl **baseUrl? : string A base URL that will be used to resolve relative URLs when using Cheerio. Ignored when using Puppeteer, since the relative URL resolution is done inside the browser automatically. ### [**](#exclude)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/enqueue_links.ts#L94)optionalexclude **exclude? : readonly ([GlobInput](https://crawlee.dev/js/api/core.md#GlobInput) | [RegExpInput](https://crawlee.dev/js/api/core.md#RegExpInput))\[] An array of glob pattern strings, regexp patterns or plain objects containing patterns matching URLs that will **never** be enqueued. The plain objects must include either the `glob` property or the `regexp` property. Glob matching is always case-insensitive. If you need case-sensitive matching, provide a regexp. ### [**](#forefront)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L948)optionalinheritedforefront **forefront? : boolean = false Inherited from RequestQueueOperationOptions.forefront If set to `true`: * while adding the request to the queue: the request will be added to the foremost position in the queue. * while reclaiming the request: the request will be placed to the beginning of the queue, so that it's returned in the next call to [RequestQueue.fetchNextRequest](https://crawlee.dev/js/api/core/class/RequestQueue.md#fetchNextRequest). By default, it's put to the end of the queue. In case the request is already present in the queue, this option has no effect. If more requests are added with this option at once, their order in the following `fetchNextRequest` call is arbitrary. ### [**](#globs)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/enqueue_links.ts#L83)optionalglobs **globs? : readonly [GlobInput](https://crawlee.dev/js/api/core.md#GlobInput)\[] An array of glob pattern strings or plain objects containing glob pattern strings matching the URLs to be enqueued. The plain objects must include at least the `glob` property, which holds the glob pattern string. All remaining keys will be used as request options for the corresponding enqueued [Request](https://crawlee.dev/js/api/core/class/Request.md) objects. The matching is always case-insensitive. If you need case-sensitive matching, use `regexps` property directly. If `globs` is an empty array or `undefined`, and `regexps` are also not defined, then the function enqueues the links with the same subdomain. ### [**](#label)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/enqueue_links.ts#L56)optionallabel **label? : string Sets [Request.label](https://crawlee.dev/js/api/core/class/Request.md#label) for newly enqueued requests. Note that the request options specified in `globs`, `regexps`, or `pseudoUrls` objects have priority over this option. ### [**](#limit)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/enqueue_links.ts#L36)optionallimit **limit? : number Limit the amount of actually enqueued URLs to this number. Useful for testing across the entire crawling scope. ### [**](#onSkippedRequest)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/enqueue_links.ts#L192)optionalonSkippedRequest **onSkippedRequest? : [SkippedRequestCallback](https://crawlee.dev/js/api/core.md#SkippedRequestCallback) When a request is skipped for some reason, you can use this callback to act on it. This is currently fired for requests skipped 1. based on robots.txt file, 2. because they don't match enqueueLinks filters, 3. or because the maxRequestsPerCrawl limit has been reached ### [**](#pseudoUrls)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/enqueue_links.ts#L126)optionalpseudoUrls **pseudoUrls? : readonly [PseudoUrlInput](https://crawlee.dev/js/api/core.md#PseudoUrlInput)\[] *NOTE:* In future versions of SDK the options will be removed. Please use `globs` or `regexps` instead. An array of [PseudoUrl](https://crawlee.dev/js/api/core/class/PseudoUrl.md) strings or plain objects containing [PseudoUrl](https://crawlee.dev/js/api/core/class/PseudoUrl.md) strings matching the URLs to be enqueued. The plain objects must include at least the `purl` property, which holds the pseudo-URL string. All remaining keys will be used as request options for the corresponding enqueued [Request](https://crawlee.dev/js/api/core/class/Request.md) objects. With a pseudo-URL string, the matching is always case-insensitive. If you need case-sensitive matching, use `regexps` property directly. If `pseudoUrls` is an empty array or `undefined`, then the function enqueues the links with the same subdomain. * **@deprecated** prefer using `globs` or `regexps` instead ### [**](#regexps)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/enqueue_links.ts#L106)optionalregexps **regexps? : readonly [RegExpInput](https://crawlee.dev/js/api/core.md#RegExpInput)\[] An array of regular expressions or plain objects containing regular expressions matching the URLs to be enqueued. The plain objects must include at least the `regexp` property, which holds the regular expression. All remaining keys will be used as request options for the corresponding enqueued [Request](https://crawlee.dev/js/api/core/class/Request.md) objects. If `regexps` is an empty array or `undefined`, and `globs` are also not defined, then the function enqueues the links with the same subdomain. ### [**](#requestQueue)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/enqueue_links.ts#L42)optionalrequestQueue **requestQueue? : [RequestProvider](https://crawlee.dev/js/api/core/class/RequestProvider.md) A request queue to which the URLs will be enqueued. ### [**](#robotsTxtFile)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/enqueue_links.ts#L183)optionalrobotsTxtFile **robotsTxtFile? : Pick<[RobotsTxtFile](https://crawlee.dev/js/api/utils/class/RobotsTxtFile.md), isAllowed> RobotsTxtFile instance for the current request that triggered the `enqueueLinks`. If provided, disallowed URLs will be ignored. ### [**](#selector)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/enqueue_links.ts#L45)optionalselector **selector? : string A CSS selector matching links to be enqueued. ### [**](#skipNavigation)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/enqueue_links.ts#L62)optionalskipNavigation **skipNavigation? : boolean = false If set to `true`, tells the crawler to skip navigation and process the request directly. ### [**](#strategy)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/enqueue_links.ts#L171)optionalstrategy **strategy? : [EnqueueStrategy](https://crawlee.dev/js/api/core/enum/EnqueueStrategy.md) | all | same-domain | same-hostname | same-origin = [EnqueueStrategy](https://crawlee.dev/js/api/core/enum/EnqueueStrategy.md) | all | same-domain | same-hostname | same-origin The strategy to use when enqueueing the urls. Depending on the strategy you select, we will only check certain parts of the URLs found. Here is a diagram of each URL part and their name: ``` Protocol Domain ┌────┐ ┌─────────┐ https://example.crawlee.dev/... │ └─────────────────┤ │ Hostname │ │ │ └─────────────────────────┘ Origin ``` ### [**](#transformRequestFunction)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/enqueue_links.ts#L151)optionaltransformRequestFunction **transformRequestFunction? : [RequestTransform](https://crawlee.dev/js/api/core/interface/RequestTransform.md) Just before a new [Request](https://crawlee.dev/js/api/core/class/Request.md) is constructed and enqueued to the [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md), this function can be used to remove it or modify its contents such as `userData`, `payload` or, most importantly `uniqueKey`. This is useful when you need to enqueue multiple `Requests` to the queue that share the same URL, but differ in methods or payloads, or to dynamically update or create `userData`. For example: by adding `keepUrlFragment: true` to the `request` object, URL fragments will not be removed when `uniqueKey` is computed. **Example:** ``` { transformRequestFunction: (request) => { request.userData.foo = 'bar'; request.keepUrlFragment = true; return request; } } ``` Note that the request options specified in `globs`, `regexps`, or `pseudoUrls` objects have priority over this function. Some request options returned by `transformRequestFunction` may be overwritten by pattern-based options from `globs`, `regexps`, or `pseudoUrls`. ### [**](#urls)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/enqueue_links.ts#L39)optionalurls **urls? : readonly string\[] An array of URLs to enqueue. ### [**](#userData)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/enqueue_links.ts#L48)optionaluserData **userData? : Dictionary Sets [Request.userData](https://crawlee.dev/js/api/core/class/Request.md#userData) for newly enqueued requests. ### [**](#waitForAllRequestsToBeAdded)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/enqueue_links.ts#L177)optionalwaitForAllRequestsToBeAdded **waitForAllRequestsToBeAdded? : boolean By default, only the first batch (1000) of found requests will be added to the queue before resolving the call. You can use this option to wait for adding all of them. --- # ErrnoException Node.js Error interface ### Hierarchy * Error * *ErrnoException* ## Index[**](#Index) ### Properties * [**cause](#cause) * [**code](#code) * [**errno](#errno) * [**message](#message) * [**name](#name) * [**path](#path) * [**stack](#stack) * [**syscall](#syscall) ## Properties[**](#Properties) ### [**](#cause)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/error_tracker.ts#L14)optionalcause **cause? : any Overrides Error.cause ### [**](#code)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/error_tracker.ts#L11)optionalcode **code? : string | number ### [**](#errno)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/error_tracker.ts#L10)optionalerrno **errno? : number ### [**](#message)[**](https://undefined/apify/crawlee/blob/master/website/node_modules/typescript/src/lib.es5.d.ts#L1077)externalinheritedmessage **message: string Inherited from Error.message ### [**](#name)[**](https://undefined/apify/crawlee/blob/master/website/node_modules/typescript/src/lib.es5.d.ts#L1076)externalinheritedname **name: string Inherited from Error.name ### [**](#path)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/error_tracker.ts#L12)optionalpath **path? : string ### [**](#stack)[**](https://undefined/apify/crawlee/blob/master/website/node_modules/typescript/src/lib.es5.d.ts#L1078)externaloptionalinheritedstack **stack? : string Inherited from Error.stack ### [**](#syscall)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/error_tracker.ts#L13)optionalsyscall **syscall? : string --- # ErrorTrackerOptions ## Index[**](#Index) ### Properties * [**saveErrorSnapshots](#saveErrorSnapshots) * [**showErrorCode](#showErrorCode) * [**showErrorMessage](#showErrorMessage) * [**showErrorName](#showErrorName) * [**showFullMessage](#showFullMessage) * [**showFullStack](#showFullStack) * [**showStackTrace](#showStackTrace) ## Properties[**](#Properties) ### [**](#saveErrorSnapshots)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/error_tracker.ts#L24)saveErrorSnapshots **saveErrorSnapshots: boolean ### [**](#showErrorCode)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/error_tracker.ts#L18)showErrorCode **showErrorCode: boolean ### [**](#showErrorMessage)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/error_tracker.ts#L22)showErrorMessage **showErrorMessage: boolean ### [**](#showErrorName)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/error_tracker.ts#L19)showErrorName **showErrorName: boolean ### [**](#showFullMessage)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/error_tracker.ts#L23)showFullMessage **showFullMessage: boolean ### [**](#showFullStack)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/error_tracker.ts#L21)showFullStack **showFullStack: boolean ### [**](#showStackTrace)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/error_tracker.ts#L20)showStackTrace **showStackTrace: boolean --- # FinalStatistics ## Index[**](#Index) ### Properties * [**crawlerRuntimeMillis](#crawlerRuntimeMillis) * [**requestAvgFailedDurationMillis](#requestAvgFailedDurationMillis) * [**requestAvgFinishedDurationMillis](#requestAvgFinishedDurationMillis) * [**requestsFailed](#requestsFailed) * [**requestsFailedPerMinute](#requestsFailedPerMinute) * [**requestsFinished](#requestsFinished) * [**requestsFinishedPerMinute](#requestsFinishedPerMinute) * [**requestsTotal](#requestsTotal) * [**requestTotalDurationMillis](#requestTotalDurationMillis) * [**retryHistogram](#retryHistogram) ## Properties[**](#Properties) ### [**](#crawlerRuntimeMillis)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/system_status.ts#L95)crawlerRuntimeMillis **crawlerRuntimeMillis: number ### [**](#requestAvgFailedDurationMillis)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/system_status.ts#L89)requestAvgFailedDurationMillis **requestAvgFailedDurationMillis: number ### [**](#requestAvgFinishedDurationMillis)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/system_status.ts#L90)requestAvgFinishedDurationMillis **requestAvgFinishedDurationMillis: number ### [**](#requestsFailed)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/system_status.ts#L87)requestsFailed **requestsFailed: number ### [**](#requestsFailedPerMinute)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/system_status.ts#L92)requestsFailedPerMinute **requestsFailedPerMinute: number ### [**](#requestsFinished)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/system_status.ts#L86)requestsFinished **requestsFinished: number ### [**](#requestsFinishedPerMinute)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/system_status.ts#L91)requestsFinishedPerMinute **requestsFinishedPerMinute: number ### [**](#requestsTotal)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/system_status.ts#L94)requestsTotal **requestsTotal: number ### [**](#requestTotalDurationMillis)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/system_status.ts#L93)requestTotalDurationMillis **requestTotalDurationMillis: number ### [**](#retryHistogram)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/system_status.ts#L88)retryHistogram **retryHistogram: number\[] --- # HttpRequest \ HTTP Request as accepted by [BaseHttpClient](https://crawlee.dev/js/api/core/interface/BaseHttpClient.md) methods. ### Hierarchy * *HttpRequest* * [HttpRequestOptions](https://crawlee.dev/js/api/core/interface/HttpRequestOptions.md) ## Index[**](#Index) ### Properties * [**body](#body) * [**cookieJar](#cookieJar) * [**encoding](#encoding) * [**followRedirect](#followRedirect) * [**headerGenerator](#headerGenerator) * [**headerGeneratorOptions](#headerGeneratorOptions) * [**headers](#headers) * [**insecureHTTPParser](#insecureHTTPParser) * [**maxRedirects](#maxRedirects) * [**method](#method) * [**proxyUrl](#proxyUrl) * [**responseType](#responseType) * [**sessionToken](#sessionToken) * [**signal](#signal) * [**throwHttpErrors](#throwHttpErrors) * [**timeout](#timeout) * [**url](#url) * [**useHeaderGenerator](#useHeaderGenerator) ## Properties[**](#Properties) ### [**](#body)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L84)optionalbody **body? : string | Readable | Buffer\ | Generator\ | AsyncGenerator\ | FormDataLike ### [**](#cookieJar)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L89)optionalcookieJar **cookieJar? : ToughCookieJar | PromiseCookieJar ### [**](#encoding)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L93)optionalencoding **encoding? : BufferEncoding ### [**](#followRedirect)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L90)optionalfollowRedirect **followRedirect? : boolean | (response) => boolean ### [**](#headerGenerator)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L101)optionalheaderGenerator **headerGenerator? : { getHeaders: (options) => Record\ } #### Type declaration * ##### getHeaders: (options) => Record\ * * **(options): Record\ - #### Parameters * ##### options: Record\ #### Returns Record\ ### [**](#headerGeneratorOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L99)optionalheaderGeneratorOptions **headerGeneratorOptions? : Record\ ### [**](#headers)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L83)optionalheaders **headers? : SimpleHeaders ### [**](#insecureHTTPParser)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L104)optionalinsecureHTTPParser **insecureHTTPParser? : boolean ### [**](#maxRedirects)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L91)optionalmaxRedirects **maxRedirects? : number ### [**](#method)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L82)optionalmethod **method? : Method ### [**](#proxyUrl)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L98)optionalproxyUrl **proxyUrl? : string ### [**](#responseType)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L94)optionalresponseType **responseType? : TResponseType ### [**](#sessionToken)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L105)optionalsessionToken **sessionToken? : object ### [**](#signal)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L86)optionalsignal **signal? : AbortSignal ### [**](#throwHttpErrors)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L95)optionalthrowHttpErrors **throwHttpErrors? : boolean ### [**](#timeout)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L87)optionaltimeout **timeout? : Partial\ ### [**](#url)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L81)url **url: string | URL ### [**](#useHeaderGenerator)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L100)optionaluseHeaderGenerator **useHeaderGenerator? : boolean --- # HttpRequestOptions \ Additional options for HTTP requests that need to be handled separately before passing to [BaseHttpClient](https://crawlee.dev/js/api/core/interface/BaseHttpClient.md). ### Hierarchy * [HttpRequest](https://crawlee.dev/js/api/core/interface/HttpRequest.md)\ * *HttpRequestOptions* ## Index[**](#Index) ### Properties * [**body](#body) * [**cookieJar](#cookieJar) * [**encoding](#encoding) * [**followRedirect](#followRedirect) * [**form](#form) * [**headerGenerator](#headerGenerator) * [**headerGeneratorOptions](#headerGeneratorOptions) * [**headers](#headers) * [**insecureHTTPParser](#insecureHTTPParser) * [**json](#json) * [**maxRedirects](#maxRedirects) * [**method](#method) * [**password](#password) * [**proxyUrl](#proxyUrl) * [**responseType](#responseType) * [**searchParams](#searchParams) * [**sessionToken](#sessionToken) * [**signal](#signal) * [**throwHttpErrors](#throwHttpErrors) * [**timeout](#timeout) * [**url](#url) * [**useHeaderGenerator](#useHeaderGenerator) * [**username](#username) ## Properties[**](#Properties) ### [**](#body)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L84)optionalinheritedbody **body? : string | Readable | Buffer\ | Generator\ | AsyncGenerator\ | FormDataLike Inherited from HttpRequest.body ### [**](#cookieJar)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L89)optionalinheritedcookieJar **cookieJar? : ToughCookieJar | PromiseCookieJar Inherited from HttpRequest.cookieJar ### [**](#encoding)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L93)optionalinheritedencoding **encoding? : BufferEncoding Inherited from HttpRequest.encoding ### [**](#followRedirect)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L90)optionalinheritedfollowRedirect **followRedirect? : boolean | (response) => boolean Inherited from HttpRequest.followRedirect ### [**](#form)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L117)optionalform **form? : Record\ A form to be sent in the HTTP request body (URL encoding will be used) ### [**](#headerGenerator)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L101)optionalinheritedheaderGenerator **headerGenerator? : { getHeaders: (options) => Record\ } Inherited from HttpRequest.headerGenerator #### Type declaration * ##### getHeaders: (options) => Record\ * * **(options): Record\ - #### Parameters * ##### options: Record\ #### Returns Record\ ### [**](#headerGeneratorOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L99)optionalinheritedheaderGeneratorOptions **headerGeneratorOptions? : Record\ Inherited from HttpRequest.headerGeneratorOptions ### [**](#headers)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L83)optionalinheritedheaders **headers? : SimpleHeaders Inherited from HttpRequest.headers ### [**](#insecureHTTPParser)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L104)optionalinheritedinsecureHTTPParser **insecureHTTPParser? : boolean Inherited from HttpRequest.insecureHTTPParser ### [**](#json)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L119)optionaljson **json? : unknown Artbitrary object to be JSON-serialized and sent as the HTTP request body ### [**](#maxRedirects)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L91)optionalinheritedmaxRedirects **maxRedirects? : number Inherited from HttpRequest.maxRedirects ### [**](#method)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L82)optionalinheritedmethod **method? : Method Inherited from HttpRequest.method ### [**](#password)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L124)optionalpassword **password? : string Basic HTTP Auth password ### [**](#proxyUrl)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L98)optionalinheritedproxyUrl **proxyUrl? : string Inherited from HttpRequest.proxyUrl ### [**](#responseType)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L94)optionalinheritedresponseType **responseType? : TResponseType Inherited from HttpRequest.responseType ### [**](#searchParams)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L114)optionalsearchParams **searchParams? : [SearchParams](https://crawlee.dev/js/api/utils.md#SearchParams) Search (query string) parameters to be appended to the request URL ### [**](#sessionToken)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L105)optionalinheritedsessionToken **sessionToken? : object Inherited from HttpRequest.sessionToken ### [**](#signal)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L86)optionalinheritedsignal **signal? : AbortSignal Inherited from HttpRequest.signal ### [**](#throwHttpErrors)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L95)optionalinheritedthrowHttpErrors **throwHttpErrors? : boolean Inherited from HttpRequest.throwHttpErrors ### [**](#timeout)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L87)optionalinheritedtimeout **timeout? : Partial\ Inherited from HttpRequest.timeout ### [**](#url)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L81)inheritedurl **url: string | URL Inherited from HttpRequest.url ### [**](#useHeaderGenerator)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L100)optionalinheriteduseHeaderGenerator **useHeaderGenerator? : boolean Inherited from HttpRequest.useHeaderGenerator ### [**](#username)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L122)optionalusername **username? : string Basic HTTP Auth username --- # HttpResponse \ HTTP response data as returned by the [BaseHttpClient.sendRequest](https://crawlee.dev/js/api/core/interface/BaseHttpClient.md#sendRequest) method. ### Hierarchy * HttpResponseWithoutBody\ * *HttpResponse* ## Index[**](#Index) ### Properties * [**body](#body) * [**complete](#complete) * [**headers](#headers) * [**ip](#ip) * [**redirectUrls](#redirectUrls) * [**request](#request) * [**statusCode](#statusCode) * [**statusMessage](#statusMessage) * [**trailers](#trailers) * [**url](#url) ## Properties[**](#Properties) ### [**](#body)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L156)body **body: [ResponseTypes](https://crawlee.dev/js/api/core/interface/ResponseTypes.md)\[TResponseType] ### [**](#complete)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L141)inheritedcomplete **complete: boolean Inherited from HttpResponseWithoutBody.complete ### [**](#headers)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L138)inheritedheaders **headers: SimpleHeaders Inherited from HttpResponseWithoutBody.headers ### [**](#ip)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L134)optionalinheritedip **ip? : string Inherited from HttpResponseWithoutBody.ip ### [**](#redirectUrls)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L131)inheritedredirectUrls **redirectUrls: URL\[] Inherited from HttpResponseWithoutBody.redirectUrls ### [**](#request)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L146)inheritedrequest **request: [HttpRequest](https://crawlee.dev/js/api/core/interface/HttpRequest.md)\ Inherited from HttpResponseWithoutBody.request ### [**](#statusCode)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L135)inheritedstatusCode **statusCode: number Inherited from HttpResponseWithoutBody.statusCode ### [**](#statusMessage)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L136)optionalinheritedstatusMessage **statusMessage? : string Inherited from HttpResponseWithoutBody.statusMessage ### [**](#trailers)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L139)inheritedtrailers **trailers: SimpleHeaders Inherited from HttpResponseWithoutBody.trailers ### [**](#url)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L132)inheritedurl **url: string Inherited from HttpResponseWithoutBody.url --- # IRequestList Represents a static list of URLs to crawl. ### Implemented by * [RequestList](https://crawlee.dev/js/api/core/class/RequestList.md) * [SitemapRequestList](https://crawlee.dev/js/api/core/class/SitemapRequestList.md) ## Index[**](#Index) ### Methods * [**\[asyncIterator\]](#\[asyncIterator]) * [**fetchNextRequest](#fetchNextRequest) * [**handledCount](#handledCount) * [**isEmpty](#isEmpty) * [**isFinished](#isFinished) * [**length](#length) * [**markRequestHandled](#markRequestHandled) * [**persistState](#persistState) * [**reclaimRequest](#reclaimRequest) ## Methods[**](#Methods) ### [**](#\[asyncIterator])[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_list.ts#L72)\[asyncIterator] * ****\[asyncIterator]**(): AsyncGenerator<[Request](https://crawlee.dev/js/api/core/class/Request.md)\, any, any> - Can be used to iterate over the `RequestList` instance in a `for await .. of` loop. Provides an alternative for the repeated use of `fetchNextRequest`. *** #### Returns AsyncGenerator<[Request](https://crawlee.dev/js/api/core/class/Request.md)\, any, any> ### [**](#fetchNextRequest)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_list.ts#L66)fetchNextRequest * ****fetchNextRequest**(): Promise\> - Gets the next [Request](https://crawlee.dev/js/api/core/class/Request.md) to process. First, the function gets a request previously reclaimed using the [RequestList.reclaimRequest](https://crawlee.dev/js/api/core/class/RequestList.md#reclaimRequest) function, if there is any. Otherwise it gets the next request from sources. The function's `Promise` resolves to `null` if there are no more requests to process. *** #### Returns Promise\> ### [**](#handledCount)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_list.ts#L47)handledCount * ****handledCount**(): number - Returns number of handled requests. *** #### Returns number ### [**](#isEmpty)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_list.ts#L42)isEmpty * ****isEmpty**(): Promise\ - Resolves to `true` if the next call to [IRequestList.fetchNextRequest](https://crawlee.dev/js/api/core/interface/IRequestList.md#fetchNextRequest) function would return `null`, otherwise it resolves to `false`. Note that even if the list is empty, there might be some pending requests currently being processed. *** #### Returns Promise\ ### [**](#isFinished)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_list.ts#L35)isFinished * ****isFinished**(): Promise\ - Returns `true` if all requests were already handled and there are no more left. *** #### Returns Promise\ ### [**](#length)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_list.ts#L30)length * ****length**(): number - Returns the total number of unique requests present in the list. *** #### Returns number ### [**](#markRequestHandled)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_list.ts#L83)markRequestHandled * ****markRequestHandled**(request): Promise\ - Marks request as handled after successful processing. *** #### Parameters * ##### request: [Request](https://crawlee.dev/js/api/core/class/Request.md)\ #### Returns Promise\ ### [**](#persistState)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_list.ts#L56)persistState * ****persistState**(): Promise\ - Persists the current state of the `IRequestList` into the default [KeyValueStore](https://crawlee.dev/js/api/core/class/KeyValueStore.md). The state is persisted automatically in regular intervals, but calling this method manually is useful in cases where you want to have the most current state available after you pause or stop fetching its requests. For example after you pause or abort a crawl. Or just before a server migration. *** #### Returns Promise\ ### [**](#reclaimRequest)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_list.ts#L78)reclaimRequest * ****reclaimRequest**(request): Promise\ - Reclaims request to the list if its processing failed. The request will become available in the next `this.fetchNextRequest()`. *** #### Parameters * ##### request: [Request](https://crawlee.dev/js/api/core/class/Request.md)\ #### Returns Promise\ --- # IRequestManager Represents a provider of requests/URLs to crawl. ### Implemented by * [RequestManagerTandem](https://crawlee.dev/js/api/core/class/RequestManagerTandem.md) * [RequestProvider](https://crawlee.dev/js/api/core/class/RequestProvider.md) ## Index[**](#Index) ### Methods * [**\[asyncIterator\]](#\[asyncIterator]) * [**addRequest](#addRequest) * [**addRequestsBatched](#addRequestsBatched) * [**fetchNextRequest](#fetchNextRequest) * [**getPendingCount](#getPendingCount) * [**getTotalCount](#getTotalCount) * [**handledCount](#handledCount) * [**isEmpty](#isEmpty) * [**isFinished](#isFinished) * [**markRequestHandled](#markRequestHandled) * [**reclaimRequest](#reclaimRequest) ## Methods[**](#Methods) ### [**](#\[asyncIterator])[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L84)\[asyncIterator] * ****\[asyncIterator]**(): AsyncGenerator<[Request](https://crawlee.dev/js/api/core/class/Request.md)\, any, any> - Can be used to iterate over the `RequestManager` instance in a `for await .. of` loop. Provides an alternative for the repeated use of `fetchNextRequest`. *** #### Returns AsyncGenerator<[Request](https://crawlee.dev/js/api/core/class/Request.md)\, any, any> ### [**](#addRequest)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L97)addRequest * ****addRequest**(requestLike, options): Promise\ - #### Parameters * ##### requestLike: [Source](https://crawlee.dev/js/api/core.md#Source) * ##### optionaloptions: [RequestQueueOperationOptions](https://crawlee.dev/js/api/core/interface/RequestQueueOperationOptions.md) #### Returns Promise\ ### [**](#addRequestsBatched)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L99)addRequestsBatched * ****addRequestsBatched**(requests, options): Promise<[AddRequestsBatchedResult](https://crawlee.dev/js/api/core/interface/AddRequestsBatchedResult.md)> - #### Parameters * ##### requests: [RequestsLike](https://crawlee.dev/js/api/core.md#RequestsLike) * ##### optionaloptions: [AddRequestsBatchedOptions](https://crawlee.dev/js/api/core/interface/AddRequestsBatchedOptions.md) #### Returns Promise<[AddRequestsBatchedResult](https://crawlee.dev/js/api/core/interface/AddRequestsBatchedResult.md)> ### [**](#fetchNextRequest)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L78)fetchNextRequest * ****fetchNextRequest**\(): Promise\> - Gets the next [Request](https://crawlee.dev/js/api/core/class/Request.md) to process. The function's `Promise` resolves to `null` if there are no more requests to process. *** #### Returns Promise\> ### [**](#getPendingCount)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L70)getPendingCount * ****getPendingCount**(): number - Get an offline approximation of the number of pending requests. *** #### Returns number ### [**](#getTotalCount)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L65)getTotalCount * ****getTotalCount**(): number - Get the total number of requests known to the request manager. *** #### Returns number ### [**](#handledCount)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L60)handledCount * ****handledCount**(): Promise\ - Returns number of handled requests. *** #### Returns Promise\ ### [**](#isEmpty)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L55)isEmpty * ****isEmpty**(): Promise\ - Resolves to `true` if the next call to [IRequestManager.fetchNextRequest](https://crawlee.dev/js/api/core/interface/IRequestManager.md#fetchNextRequest) function would return `null`, otherwise it resolves to `false`. Note that even if the provider is empty, there might be some pending requests currently being processed. *** #### Returns Promise\ ### [**](#isFinished)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L48)isFinished * ****isFinished**(): Promise\ - Returns `true` if all requests were already handled and there are no more left. *** #### Returns Promise\ ### [**](#markRequestHandled)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L89)markRequestHandled * ****markRequestHandled**(request): Promise\ - Marks request as handled after successful processing. *** #### Parameters * ##### request: [Request](https://crawlee.dev/js/api/core/class/Request.md)\ #### Returns Promise\ ### [**](#reclaimRequest)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L95)reclaimRequest * ****reclaimRequest**(request, options): Promise\ - Reclaims request to the provider if its processing failed. The request will become available in the next `fetchNextRequest()`. *** #### Parameters * ##### request: [Request](https://crawlee.dev/js/api/core/class/Request.md)\ * ##### optionaloptions: [RequestQueueOperationOptions](https://crawlee.dev/js/api/core/interface/RequestQueueOperationOptions.md) #### Returns Promise\ --- # IStorage ### Implemented by * [RequestProvider](https://crawlee.dev/js/api/core/class/RequestProvider.md) ## Index[**](#Index) ### Properties * [**id](#id) * [**name](#name) ## Properties[**](#Properties) ### [**](#id)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/storage_manager.ts#L15)id **id: string ### [**](#name)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/storage_manager.ts#L16)optionalname **name? : string --- # KeyConsumer User-function used in the [KeyValueStore.forEachKey](https://crawlee.dev/js/api/core/class/KeyValueStore.md#forEachKey) method. ### Callable * ****KeyConsumer**(key, index, info): Awaitable\ *** * #### Parameters * ##### key: string Current [KeyValueStore](https://crawlee.dev/js/api/core/class/KeyValueStore.md) key being processed. * ##### index: number Position of the current key in [KeyValueStore](https://crawlee.dev/js/api/core/class/KeyValueStore.md). * ##### info: { size: number } Information about the current [KeyValueStore](https://crawlee.dev/js/api/core/class/KeyValueStore.md) entry. * ##### size: number Size of the value associated with the current key in bytes. #### Returns Awaitable\ --- # KeyValueStoreIteratorOptions ## Index[**](#Index) ### Properties * [**collection](#collection) * [**exclusiveStartKey](#exclusiveStartKey) * [**prefix](#prefix) ## Properties[**](#Properties) ### [**](#collection)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/key_value_store.ts#L770)optionalcollection **collection? : string Collection name to use for listing keys. ### [**](#exclusiveStartKey)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/key_value_store.ts#L762)optionalexclusiveStartKey **exclusiveStartKey? : string All keys up to this one (including) are skipped from the result. ### [**](#prefix)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/key_value_store.ts#L766)optionalprefix **prefix? : string If set, only keys that start with this prefix are returned. --- # KeyValueStoreOptions ## Index[**](#Index) ### Properties * [**client](#client) * [**id](#id) * [**name](#name) * [**storageObject](#storageObject) ## Properties[**](#Properties) ### [**](#client)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/key_value_store.ts#L737)client **client: [StorageClient](https://crawlee.dev/js/api/core/interface/StorageClient.md) ### [**](#id)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/key_value_store.ts#L735)id **id: string ### [**](#name)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/key_value_store.ts#L736)optionalname **name? : string ### [**](#storageObject)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/key_value_store.ts#L738)optionalstorageObject **storageObject? : Record\ --- # externalLoggerOptions ## Index[**](#Index) ### Properties * [**data](#data) * [**level](#level) * [**logger](#logger) * [**maxDepth](#maxDepth) * [**maxStringLength](#maxStringLength) * [**prefix](#prefix) * [**suffix](#suffix) ## Properties[**](#Properties) ### [**](#data)[**](https://undefined/apify/crawlee/blob/master/node_modules/@apify/log/src/index.d.ts#L61)externaloptionaldata **data? : Record\ Additional data to be added to each log line. ### [**](#level)[**](https://undefined/apify/crawlee/blob/master/node_modules/@apify/log/src/index.d.ts#L46)externaloptionallevel **level? : number Sets the log level to the given value, preventing messages from less important log levels from being printed to the console. Use in conjunction with the `log.LEVELS` constants. ### [**](#logger)[**](https://undefined/apify/crawlee/blob/master/node_modules/@apify/log/src/index.d.ts#L59)externaloptionallogger **logger? : [Logger](https://crawlee.dev/js/api/core/class/Logger.md) Logger implementation to be used. Default one is log.LoggerText to log messages as easily readable strings. Optionally you can use `log.LoggerJson` that formats each log line as a JSON. ### [**](#maxDepth)[**](https://undefined/apify/crawlee/blob/master/node_modules/@apify/log/src/index.d.ts#L48)externaloptionalmaxDepth **maxDepth? : number Max depth of data object that will be logged. Anything deeper than the limit will be stripped off. ### [**](#maxStringLength)[**](https://undefined/apify/crawlee/blob/master/node_modules/@apify/log/src/index.d.ts#L50)externaloptionalmaxStringLength **maxStringLength? : number Max length of the string to be logged. Longer strings will be truncated. ### [**](#prefix)[**](https://undefined/apify/crawlee/blob/master/node_modules/@apify/log/src/index.d.ts#L52)externaloptionalprefix **prefix? : null | string Prefix to be prepended the each logged line. ### [**](#suffix)[**](https://undefined/apify/crawlee/blob/master/node_modules/@apify/log/src/index.d.ts#L54)externaloptionalsuffix **suffix? : null | string Suffix that will be appended the each logged line. --- # PersistenceOptions Persistence-related options to control how and when crawler's data gets persisted. ## Index[**](#Index) ### Properties * [**enable](#enable) ## Properties[**](#Properties) ### [**](#enable)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L46)optionalenable **enable? : boolean = true Use this flag to disable or enable periodic persistence to key value store. --- # ProxyConfigurationFunction ### Callable * ****ProxyConfigurationFunction**(sessionId, options): null | string | Promise\ *** * #### Parameters * ##### sessionId: string | number * ##### optionaloptions: { request?: [Request](https://crawlee.dev/js/api/core/class/Request.md)\ } * ##### optionalrequest: [Request](https://crawlee.dev/js/api/core/class/Request.md)\ #### Returns null | string | Promise\ --- # ProxyConfigurationOptions ## Index[**](#Index) ### Properties * [**newUrlFunction](#newUrlFunction) * [**proxyUrls](#proxyUrls) * [**tieredProxyUrls](#tieredProxyUrls) ## Properties[**](#Properties) ### [**](#newUrlFunction)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/proxy_configuration.ts#L29)optionalnewUrlFunction **newUrlFunction? : [ProxyConfigurationFunction](https://crawlee.dev/js/api/core/interface/ProxyConfigurationFunction.md) Custom function that allows you to generate the new proxy URL dynamically. It gets the `sessionId` as a parameter and an optional parameter with the `Request` object when applicable. Can return either stringified proxy URL or `null` if the proxy should not be used. Can be asynchronous. This function is used to generate the URL when [ProxyConfiguration.newUrl](https://crawlee.dev/js/api/core/class/ProxyConfiguration.md#newUrl) or [ProxyConfiguration.newProxyInfo](https://crawlee.dev/js/api/core/class/ProxyConfiguration.md#newProxyInfo) is called. ### [**](#proxyUrls)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/proxy_configuration.ts#L21)optionalproxyUrls **proxyUrls? : UrlList An array of custom proxy URLs to be rotated. Custom proxies are not compatible with Apify Proxy and an attempt to use both configuration options will cause an error to be thrown on initialize. ### [**](#tieredProxyUrls)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/proxy_configuration.ts#L42)optionaltieredProxyUrls **tieredProxyUrls? : UrlList\[] An array of custom proxy URLs to be rotated stratified in tiers. This is a more advanced version of `proxyUrls` that allows you to define a hierarchy of proxy URLs If everything goes well, all the requests will be sent through the first proxy URL in the list. Whenever the crawler encounters a problem with the current proxy on the given domain, it will switch to the higher tier for this domain. The crawler probes lower-level proxies at intervals to check if it can make the tier downshift. This feature is useful when you have a set of proxies with different performance characteristics (speed, price, antibot performance etc.) and you want to use the best one for each domain. Use `null` as a proxy URL to disable the proxy for the given tier. --- # ProxyInfo The main purpose of the ProxyInfo object is to provide information about the current proxy connection used by the crawler for the request. Outside of crawlers, you can get this object by calling [ProxyConfiguration.newProxyInfo](https://crawlee.dev/js/api/core/class/ProxyConfiguration.md#newProxyInfo). **Example usage:** ``` const proxyConfiguration = new ProxyConfiguration({ proxyUrls: ['...', '...'] // List of Proxy URLs to rotate }); // Getting proxyInfo object by calling class method directly const proxyInfo = await proxyConfiguration.newProxyInfo(); // In crawler const crawler = new CheerioCrawler({ // ... proxyConfiguration, requestHandler({ proxyInfo }) { // Getting used proxy URL const proxyUrl = proxyInfo.url; // Getting ID of used Session const sessionIdentifier = proxyInfo.sessionId; } }) ``` ## Index[**](#Index) ### Properties * [**hostname](#hostname) * [**password](#password) * [**port](#port) * [**proxyTier](#proxyTier) * [**sessionId](#sessionId) * [**url](#url) * [**username](#username) ## Properties[**](#Properties) ### [**](#hostname)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/proxy_configuration.ts#L104)hostname **hostname: string Hostname of your proxy. ### [**](#password)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/proxy_configuration.ts#L99)password **password: string User's password for the proxy. ### [**](#port)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/proxy_configuration.ts#L109)port **port: string | number Proxy port. ### [**](#proxyTier)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/proxy_configuration.ts#L114)optionalproxyTier **proxyTier? : number Proxy tier for the current proxy, if applicable (only for `tieredProxyUrls`). ### [**](#sessionId)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/proxy_configuration.ts#L84)optionalsessionId **sessionId? : string The identifier of used [Session](https://crawlee.dev/js/api/core/class/Session.md), if used. ### [**](#url)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/proxy_configuration.ts#L89)url **url: string The URL of the proxy. ### [**](#username)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/proxy_configuration.ts#L94)optionalusername **username? : string Username for the proxy. --- # PushErrorMessageOptions ## Index[**](#Index) ### Properties * [**omitStack](#omitStack) ## Properties[**](#Properties) ### [**](#omitStack)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L564)optionalomitStack **omitStack? : boolean = false Only push the error message without stack trace when true. --- # QueueOperationInfo A helper class that is used to report results from various [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) functions as well as enqueueLinks. ## Index[**](#Index) ### Properties * [**requestId](#requestId) * [**wasAlreadyHandled](#wasAlreadyHandled) * [**wasAlreadyPresent](#wasAlreadyPresent) ## Properties[**](#Properties) ### [**](#requestId)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L15)requestId **requestId: string The ID of the added request ### [**](#wasAlreadyHandled)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L12)wasAlreadyHandled **wasAlreadyHandled: boolean Indicates if request was already marked as handled. ### [**](#wasAlreadyPresent)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L9)wasAlreadyPresent **wasAlreadyPresent: boolean Indicates if request was already present in the queue. --- # RecordOptions ## Index[**](#Index) ### Properties * [**contentType](#contentType) * [**doNotRetryTimeouts](#doNotRetryTimeouts) * [**timeoutSecs](#timeoutSecs) ## Properties[**](#Properties) ### [**](#contentType)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/key_value_store.ts#L745)optionalcontentType **contentType? : string Specifies a custom MIME content type of the record. ### [**](#doNotRetryTimeouts)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/key_value_store.ts#L755)optionaldoNotRetryTimeouts **doNotRetryTimeouts? : boolean If set to `true`, the `set-record` API call will not be retried if it times out. ### [**](#timeoutSecs)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/key_value_store.ts#L750)optionaltimeoutSecs **timeoutSecs? : number Specifies a custom timeout for the `set-record` API call, in seconds. --- # RecoverableStateOptions \ Options for configuring the RecoverableState ### Hierarchy * [RecoverableStatePersistenceOptions](https://crawlee.dev/js/api/core/interface/RecoverableStatePersistenceOptions.md) * *RecoverableStateOptions* ## Index[**](#Index) ### Properties * [**config](#config) * [**defaultState](#defaultState) * [**deserialize](#deserialize) * [**logger](#logger) * [**persistenceEnabled](#persistenceEnabled) * [**persistStateKey](#persistStateKey) * [**persistStateKvsId](#persistStateKvsId) * [**persistStateKvsName](#persistStateKvsName) * [**serialize](#serialize) ## Properties[**](#Properties) ### [**](#config)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/recoverable_state.ts#L49)optionalconfig **config? : [Configuration](https://crawlee.dev/js/api/core/class/Configuration.md) Configuration instance to use ### [**](#defaultState)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/recoverable_state.ts#L39)defaultState **defaultState: TStateModel The default state used if no persisted state is found. A deep copy is made each time the state is used. ### [**](#deserialize)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/recoverable_state.ts#L62)optionaldeserialize **deserialize? : (serializedState) => TStateModel Optional function to transform a JSON-serialized object back to the state model. If not provided, JSON.parse is used. It is advisable to perform validation in this function and to throw an exception if it fails. *** #### Type declaration * * **(serializedState): TStateModel - #### Parameters * ##### serializedState: string #### Returns TStateModel ### [**](#logger)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/recoverable_state.ts#L44)optionallogger **logger? : [Log](https://crawlee.dev/js/api/core/class/Log.md) A logger instance for logging operations related to state persistence ### [**](#persistenceEnabled)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/recoverable_state.ts#L15)optionalinheritedpersistenceEnabled **persistenceEnabled? : boolean Inherited from RecoverableStatePersistenceOptions.persistenceEnabled Flag to enable or disable state persistence ### [**](#persistStateKey)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/recoverable_state.ts#L10)inheritedpersistStateKey **persistStateKey: string Inherited from RecoverableStatePersistenceOptions.persistStateKey The key under which the state is stored in the KeyValueStore ### [**](#persistStateKvsId)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/recoverable_state.ts#L27)optionalinheritedpersistStateKvsId **persistStateKvsId? : string Inherited from RecoverableStatePersistenceOptions.persistStateKvsId The identifier of the KeyValueStore to use for persistence. If neither a name nor an id are supplied, the default store will be used. ### [**](#persistStateKvsName)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/recoverable_state.ts#L21)optionalinheritedpersistStateKvsName **persistStateKvsName? : string Inherited from RecoverableStatePersistenceOptions.persistStateKvsName The name of the KeyValueStore to use for persistence. If neither a name nor an id are supplied, the default store will be used. ### [**](#serialize)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/recoverable_state.ts#L55)optionalserialize **serialize? : (state) => string Optional function to transform the state to a JSON string before persistence. If not provided, JSON.stringify will be used. *** #### Type declaration * * **(state): string - #### Parameters * ##### state: TStateModel #### Returns string --- # RecoverableStatePersistenceOptions ### Hierarchy * *RecoverableStatePersistenceOptions* * [RecoverableStateOptions](https://crawlee.dev/js/api/core/interface/RecoverableStateOptions.md) ## Index[**](#Index) ### Properties * [**persistenceEnabled](#persistenceEnabled) * [**persistStateKey](#persistStateKey) * [**persistStateKvsId](#persistStateKvsId) * [**persistStateKvsName](#persistStateKvsName) ## Properties[**](#Properties) ### [**](#persistenceEnabled)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/recoverable_state.ts#L15)optionalpersistenceEnabled **persistenceEnabled? : boolean Flag to enable or disable state persistence ### [**](#persistStateKey)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/recoverable_state.ts#L10)persistStateKey **persistStateKey: string The key under which the state is stored in the KeyValueStore ### [**](#persistStateKvsId)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/recoverable_state.ts#L27)optionalpersistStateKvsId **persistStateKvsId? : string The identifier of the KeyValueStore to use for persistence. If neither a name nor an id are supplied, the default store will be used. ### [**](#persistStateKvsName)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/recoverable_state.ts#L21)optionalpersistStateKvsName **persistStateKvsName? : string The name of the KeyValueStore to use for persistence. If neither a name nor an id are supplied, the default store will be used. --- # RequestListOptions ## Index[**](#Index) ### Properties * [**keepDuplicateUrls](#keepDuplicateUrls) * [**persistRequestsKey](#persistRequestsKey) * [**persistStateKey](#persistStateKey) * [**proxyConfiguration](#proxyConfiguration) * [**sources](#sources) * [**sourcesFunction](#sourcesFunction) * [**state](#state) ## Properties[**](#Properties) ### [**](#keepDuplicateUrls)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_list.ts#L233)optionalkeepDuplicateUrls **keepDuplicateUrls? : boolean = false By default, `RequestList` will deduplicate the provided URLs. Default deduplication is based on the `uniqueKey` property of passed source [Request](https://crawlee.dev/js/api/core/class/Request.md) objects. If the property is not present, it is generated by normalizing the URL. If present, it is kept intact. In any case, only one request per `uniqueKey` is added to the `RequestList` resulting in removal of duplicate URLs / unique keys. Setting `keepDuplicateUrls` to `true` will append an additional identifier to the `uniqueKey` of each request that does not already include a `uniqueKey`. Therefore, duplicate URLs will be kept in the list. It does not protect the user from having duplicates in user set `uniqueKey`s however. It is the user's responsibility to ensure uniqueness of their unique keys if they wish to keep more than just a single copy in the `RequestList`. ### [**](#persistRequestsKey)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_list.ts#L196)optionalpersistRequestsKey **persistRequestsKey? : string Identifies the key in the default key-value store under which the `RequestList` persists its Requests during the RequestList.initialize call. This is necessary if `persistStateKey` is set and the source URLs might potentially change, to ensure consistency of the source URLs and state object. However, it comes with some storage and performance overheads. If `persistRequestsKey` is not set, RequestList.initialize will always fetch the sources from their origin, check that they are consistent with the restored state (if any) and throw an error if they are not. ### [**](#persistStateKey)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_list.ts#L183)optionalpersistStateKey **persistStateKey? : string Identifies the key in the default key-value store under which `RequestList` periodically stores its state (i.e. which URLs were crawled and which not). If the crawler is restarted, `RequestList` will read the state and continue where it left off. If `persistStateKey` is not set, `RequestList` will always start from the beginning, and all the source URLs will be crawled again. ### [**](#proxyConfiguration)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_list.ts#L172)optionalproxyConfiguration **proxyConfiguration? : [ProxyConfiguration](https://crawlee.dev/js/api/core/class/ProxyConfiguration.md) Used to pass the proxy configuration for the `requestsFromUrl` objects. Takes advantage of the internal address rotation and authentication process. If undefined, the `requestsFromUrl` requests will be made without proxy. ### [**](#sources)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_list.ts#L125)optionalsources **sources? : RequestListSource\[] An array of sources of URLs for the [RequestList](https://crawlee.dev/js/api/core/class/RequestList.md). It can be either an array of strings, plain objects that define at least the `url` property, or an array of [Request](https://crawlee.dev/js/api/core/class/Request.md) instances. **IMPORTANT:** The `sources` array will be consumed (left empty) after `RequestList` initializes. This is a measure to prevent memory leaks in situations when millions of sources are added. Additionally, the `requestsFromUrl` property may be used instead of `url`, which will instruct `RequestList` to download the source URLs from a given remote location. The URLs will be parsed from the received response. ``` [ // A single URL 'http://example.com/a/b', // Modify Request options { method: PUT, 'https://example.com/put, payload: { foo: 'bar' }} // Batch import of URLs from a file hosted on the web, // where the URLs should be requested using the HTTP POST request { method: 'POST', requestsFromUrl: 'http://example.com/urls.txt' }, // Batch import from remote file, using a specific regular expression to extract the URLs. { requestsFromUrl: 'http://example.com/urls.txt', regex: /https://example.com/.+/ }, // Get list of URLs from a Google Sheets document. Just add "/gviz/tq?tqx=out:csv" to the Google Sheet URL. // For details, see https://help.apify.com/en/articles/2906022-scraping-a-list-of-urls-from-a-google-sheets-document { requestsFromUrl: 'https://docs.google.com/spreadsheets/d/1-2mUcRAiBbCTVA5KcpFdEYWflLMLp9DDU3iJutvES4w/gviz/tq?tqx=out:csv' } ] ``` ### [**](#sourcesFunction)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_list.ts#L165)optionalsourcesFunction **sourcesFunction? : [RequestListSourcesFunction](https://crawlee.dev/js/api/core.md#RequestListSourcesFunction) A function that will be called to get the sources for the `RequestList`, but only if `RequestList` was not able to fetch their persisted version (see [RequestListOptions.persistRequestsKey](https://crawlee.dev/js/api/core/interface/RequestListOptions.md#persistRequestsKey)). It must return an `Array` of [Request](https://crawlee.dev/js/api/core/class/Request.md) or [RequestOptions](https://crawlee.dev/js/api/core/interface/RequestOptions.md). This is very useful in a scenario when getting the sources is a resource intensive or time consuming task, such as fetching URLs from multiple sitemaps or parsing URLs from large datasets. Using the `sourcesFunction` in combination with `persistStateKey` and `persistRequestsKey` will allow you to fetch and parse those URLs only once, saving valuable time when your crawler migrates or restarts. If both [RequestListOptions.sources](https://crawlee.dev/js/api/core/interface/RequestListOptions.md#sources) and [RequestListOptions.sourcesFunction](https://crawlee.dev/js/api/core/interface/RequestListOptions.md#sourcesFunction) are provided, the sources returned by the function will be added after the `sources`. **Example:** ``` // Let's say we want to scrape URLs extracted from sitemaps. const sourcesFunction = async () => { // With super large sitemaps, this operation could take very long // and big websites typically have multiple sitemaps. const sitemaps = await downloadHugeSitemaps(); return parseUrlsFromSitemaps(sitemaps); }; // Sitemaps can change in real-time, so it's important to persist // the URLs we collected. Otherwise we might lose our scraping // state in case of an crawler migration / failure / time-out. const requestList = await RequestList.open(null, [], { // The sourcesFunction is called now and the Requests are persisted. // If something goes wrong and we need to start again, RequestList // will load the persisted Requests from storage and will NOT // call the sourcesFunction again, saving time and resources. sourcesFunction, persistStateKey: 'state-key', persistRequestsKey: 'requests-key', }) ``` ### [**](#state)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_list.ts#L216)optionalstate **state? : [RequestListState](https://crawlee.dev/js/api/core/interface/RequestListState.md) The state object that the `RequestList` will be initialized from. It is in the form as returned by `RequestList.getState()`, such as follows: ``` { nextIndex: 5, nextUniqueKey: 'unique-key-5' inProgress: { 'unique-key-1': true, 'unique-key-4': true, }, } ``` Note that the preferred (and simpler) way to persist the state of crawling of the `RequestList` is to use the `stateKeyPrefix` parameter instead. --- # RequestListState Represents state of a [RequestList](https://crawlee.dev/js/api/core/class/RequestList.md). It can be used to resume a [RequestList](https://crawlee.dev/js/api/core/class/RequestList.md) which has been previously processed. You can obtain the state by calling [RequestList.getState](https://crawlee.dev/js/api/core/class/RequestList.md#getState) and receive an object with the following structure: ``` { nextIndex: 5, nextUniqueKey: 'unique-key-5' inProgress: { 'unique-key-1': true, 'unique-key-4': true }, } ``` ## Index[**](#Index) ### Properties * [**inProgress](#inProgress) * [**nextIndex](#nextIndex) * [**nextUniqueKey](#nextUniqueKey) ## Properties[**](#Properties) ### [**](#inProgress)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_list.ts#L996)inProgress **inProgress: string\[] Array of request keys representing those that being processed at the moment. ### [**](#nextIndex)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_list.ts#L990)nextIndex **nextIndex: number Position of the next request to be processed. ### [**](#nextUniqueKey)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_list.ts#L993)nextUniqueKey **nextUniqueKey: null | string Key of the next request to be processed. --- # RequestOptions \ Specifies required and optional fields for constructing a [Request](https://crawlee.dev/js/api/core/class/Request.md). ## Index[**](#Index) ### Properties * [**crawlDepth](#crawlDepth) * [**headers](#headers) * [**keepUrlFragment](#keepUrlFragment) * [**label](#label) * [**maxRetries](#maxRetries) * [**method](#method) * [**noRetry](#noRetry) * [**payload](#payload) * [**skipNavigation](#skipNavigation) * [**uniqueKey](#uniqueKey) * [**url](#url) * [**useExtendedUniqueKey](#useExtendedUniqueKey) * [**userData](#userData) ## Properties[**](#Properties) ### [**](#crawlDepth)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L532)optionalcrawlDepth **crawlDepth? : number = 0 Depth of the request in the current crawl tree. Note that this is dependent on the crawler setup and might produce unexpected results when used with multiple crawlers. ### [**](#headers)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L484)optionalheaders **headers? : Record\ HTTP headers in the following format: ``` { Accept: 'text/html', 'Content-Type': 'application/json' } ``` ### [**](#keepUrlFragment)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L504)optionalkeepUrlFragment **keepUrlFragment? : boolean = false If `false` then the hash part of a URL is removed when computing the `uniqueKey` property. For example, this causes the `http://www.example.com#foo` and `http://www.example.com#bar` URLs to have the same `uniqueKey` of `http://www.example.com` and thus the URLs are considered equal. Note that this option only has an effect if `uniqueKey` is not set. ### [**](#label)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L495)optionallabel **label? : string Shortcut for setting `userData: { label: '...' }`. ### [**](#maxRetries)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L544)optionalmaxRetries **maxRetries? : number Maximum number of retries for this request. Allows to override the global `maxRequestRetries` option of `BasicCrawler`. ### [**](#method)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L470)optionalmethod **method? : get | options | post | put | patch | head | delete | trace | AllowedHttpMethods | connect = get | options | post | put | patch | head | delete | trace | AllowedHttpMethods | connect ### [**](#noRetry)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L518)optionalnoRetry **noRetry? : boolean = false The `true` value indicates that the request will not be automatically retried on error. ### [**](#payload)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L473)optionalpayload **payload? : string HTTP request payload, e.g. for POST requests. ### [**](#skipNavigation)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L525)optionalskipNavigation **skipNavigation? : boolean = false If set to `true` then the crawler processing this request evaluates the `requestHandler` immediately without prior browser navigation. ### [**](#uniqueKey)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L467)optionaluniqueKey **uniqueKey? : string A unique key identifying the request. Two requests with the same `uniqueKey` are considered as pointing to the same URL. If `uniqueKey` is not provided, then it is automatically generated by normalizing the URL. For example, the URL of `HTTP://www.EXAMPLE.com/something/` will produce the `uniqueKey` of `http://www.example.com/something`. The `keepUrlFragment` option determines whether URL hash fragment is included in the `uniqueKey` or not. The `useExtendedUniqueKey` options determines whether method and payload are included in the `uniqueKey`, producing a `uniqueKey` in the following format: `METHOD(payloadHash):normalizedUrl`. This is useful when requests point to the same URL, but with different methods and payloads. For example: form submits. Pass an arbitrary non-empty text value to the `uniqueKey` property to override the default behavior and specify which URLs shall be considered equal. ### [**](#url)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L448)url **url: string URL of the web page to crawl. It must be a non-empty string. ### [**](#useExtendedUniqueKey)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L512)optionaluseExtendedUniqueKey **useExtendedUniqueKey? : boolean = false If `true` then the `uniqueKey` is computed not only from the URL, but also from the method and payload properties. This is useful when making requests to the same URL that are differentiated by method or payload, such as form submit navigations in browsers. ### [**](#userData)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L490)optionaluserData **userData? : UserData Custom user data assigned to the request. Use this to save any request related data to the request's scope, keeping them accessible on retries, failures etc. --- # RequestProviderOptions ### Hierarchy * *RequestProviderOptions* * [RequestQueueOptions](https://crawlee.dev/js/api/core/interface/RequestQueueOptions.md) ## Index[**](#Index) ### Properties * [**client](#client) * [**id](#id) * [**name](#name) * [**proxyConfiguration](#proxyConfiguration) ## Properties[**](#Properties) ### [**](#client)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L910)client **client: [StorageClient](https://crawlee.dev/js/api/core/interface/StorageClient.md) ### [**](#id)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L908)id **id: string ### [**](#name)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L909)optionalname **name? : string ### [**](#proxyConfiguration)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L917)optionalproxyConfiguration **proxyConfiguration? : [ProxyConfiguration](https://crawlee.dev/js/api/core/class/ProxyConfiguration.md) Used to pass the proxy configuration for the `requestsFromUrl` objects. Takes advantage of the internal address rotation and authentication process. If undefined, the `requestsFromUrl` requests will be made without proxy. --- # RequestQueueOperationOptions ### Hierarchy * *RequestQueueOperationOptions* * [EnqueueLinksOptions](https://crawlee.dev/js/api/core/interface/EnqueueLinksOptions.md) * [AddRequestsBatchedOptions](https://crawlee.dev/js/api/core/interface/AddRequestsBatchedOptions.md) ## Index[**](#Index) ### Properties * [**forefront](#forefront) ## Properties[**](#Properties) ### [**](#forefront)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L948)optionalforefront **forefront? : boolean = false If set to `true`: * while adding the request to the queue: the request will be added to the foremost position in the queue. * while reclaiming the request: the request will be placed to the beginning of the queue, so that it's returned in the next call to [RequestQueue.fetchNextRequest](https://crawlee.dev/js/api/core/class/RequestQueue.md#fetchNextRequest). By default, it's put to the end of the queue. In case the request is already present in the queue, this option has no effect. If more requests are added with this option at once, their order in the following `fetchNextRequest` call is arbitrary. --- # RequestQueueOptions * **@deprecated** Use [RequestProviderOptions](https://crawlee.dev/js/api/core/interface/RequestProviderOptions.md) instead. ### Hierarchy * [RequestProviderOptions](https://crawlee.dev/js/api/core/interface/RequestProviderOptions.md) * *RequestQueueOptions* ## Index[**](#Index) ### Properties * [**client](#client) * [**id](#id) * [**name](#name) * [**proxyConfiguration](#proxyConfiguration) ## Properties[**](#Properties) ### [**](#client)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L910)inheritedclient **client: [StorageClient](https://crawlee.dev/js/api/core/interface/StorageClient.md) Inherited from RequestProviderOptions.client ### [**](#id)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L908)inheritedid **id: string Inherited from RequestProviderOptions.id ### [**](#name)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L909)optionalinheritedname **name? : string Inherited from RequestProviderOptions.name ### [**](#proxyConfiguration)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L917)optionalinheritedproxyConfiguration **proxyConfiguration? : [ProxyConfiguration](https://crawlee.dev/js/api/core/class/ProxyConfiguration.md) Inherited from RequestProviderOptions.proxyConfiguration Used to pass the proxy configuration for the `requestsFromUrl` objects. Takes advantage of the internal address rotation and authentication process. If undefined, the `requestsFromUrl` requests will be made without proxy. --- # RequestTransform Takes an Apify [RequestOptions](https://crawlee.dev/js/api/core/interface/RequestOptions.md) object and changes its attributes in a desired way. This user-function is used enqueueLinks to modify requests before enqueuing them. ### Callable * ****RequestTransform**(original): undefined | null | false | [RequestOptions](https://crawlee.dev/js/api/core/interface/RequestOptions.md)\ *** - *** #### Parameters * ##### original: [RequestOptions](https://crawlee.dev/js/api/core/interface/RequestOptions.md)\ Request options to be modified. #### Returns undefined | null | false | [RequestOptions](https://crawlee.dev/js/api/core/interface/RequestOptions.md)\ The modified request options to enqueue. --- # ResponseLike ## Index[**](#Index) ### Properties * [**headers](#headers) * [**url](#url) ## Properties[**](#Properties) ### [**](#headers)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/cookie_utils.ts#L9)optionalheaders **headers? : Record\ | () => Record\ ### [**](#url)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/cookie_utils.ts#L8)optionalurl **url? : string | () => string --- # ResponseTypes Maps permitted values of the `responseType` option on [HttpRequest](https://crawlee.dev/js/api/core/interface/HttpRequest.md) to the types that they produce. ## Index[**](#Index) ### Properties * [**buffer](#buffer) * [**json](#json) * [**text](#text) ## Properties[**](#Properties) ### [**](#buffer)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L42)buffer **buffer: Buffer\ ### [**](#json)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L40)json **json: unknown ### [**](#text)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L41)text **text: string --- # RestrictedCrawlingContext \ ### Hierarchy * Record\ * *RestrictedCrawlingContext* * [CrawlingContext](https://crawlee.dev/js/api/core/interface/CrawlingContext.md) * [AdaptivePlaywrightCrawlerContext](https://crawlee.dev/js/api/playwright-crawler/interface/AdaptivePlaywrightCrawlerContext.md) ## Index[**](#Index) ### Properties * [**addRequests](#addRequests) * [**enqueueLinks](#enqueueLinks) * [**getKeyValueStore](#getKeyValueStore) * [**id](#id) * [**log](#log) * [**proxyInfo](#proxyInfo) * [**request](#request) * [**session](#session) * [**useState](#useState) ### Methods * [**pushData](#pushData) ## Properties[**](#Properties) ### [**](#addRequests)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L88)addRequests **addRequests: (requestsLike, options) => Promise\ Add requests directly to the request queue. *** #### Type declaration * * **(requestsLike, options): Promise\ - #### Parameters * ##### requestsLike: readonly (string | ReadonlyObjectDeep\> & { regex?: RegExp; requestsFromUrl?: string }> | ReadonlyObjectDeep<[Request](https://crawlee.dev/js/api/core/class/Request.md)\>)\[] * ##### optionaloptions: ReadonlyObjectDeep<[RequestQueueOperationOptions](https://crawlee.dev/js/api/core/interface/RequestQueueOperationOptions.md)> Options for the request queue #### Returns Promise\ ### [**](#enqueueLinks)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L80)enqueueLinks **enqueueLinks: (options) => Promise\ This function automatically finds and enqueues links from the current page, adding them to the [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) currently used by the crawler. Optionally, the function allows you to filter the target links' URLs using an array of globs or regular expressions and override settings of the enqueued [Request](https://crawlee.dev/js/api/core/class/Request.md) objects. Check out the [Crawl a website with relative links](https://crawlee.dev/js/docs/examples/crawl-relative-links.md) example for more details regarding its usage. **Example usage** ``` async requestHandler({ enqueueLinks }) { await enqueueLinks({ globs: [ 'https://www.example.com/handbags/*', ], }); }, ``` *** #### Type declaration * * **(options): Promise\ - #### Parameters * ##### optionaloptions: ReadonlyObjectDeep\> All `enqueueLinks()` parameters are passed via an options object. #### Returns Promise\ ### [**](#getKeyValueStore)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L101)getKeyValueStore **getKeyValueStore: (idOrName) => Promise\> Get a key-value store with given name or id, or the default one for the crawler. *** #### Type declaration * * **(idOrName): Promise\> - #### Parameters * ##### optionalidOrName: string #### Returns Promise\> ### [**](#id)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L33)id **id: string ### [**](#log)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L108)log **log: [Log](https://crawlee.dev/js/api/core/class/Log.md) A preconfigured logger for the request handler. ### [**](#proxyInfo)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L40)optionalproxyInfo **proxyInfo? : [ProxyInfo](https://crawlee.dev/js/api/core/interface/ProxyInfo.md) An object with information about currently used proxy by the crawler and configured by the [ProxyConfiguration](https://crawlee.dev/js/api/core/class/ProxyConfiguration.md) class. ### [**](#request)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L45)request **request: [Request](https://crawlee.dev/js/api/core/class/Request.md)\ The original [Request](https://crawlee.dev/js/api/core/class/Request.md) object. ### [**](#session)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L34)optionalsession **session? : [Session](https://crawlee.dev/js/api/core/class/Session.md) ### [**](#useState)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L96)useState **useState: \(defaultValue) => Promise\ Returns the state - a piece of mutable persistent data shared across all the request handler runs. *** #### Type declaration * * **\(defaultValue): Promise\ - #### Parameters * ##### optionaldefaultValue: State #### Returns Promise\ ## Methods[**](#Methods) ### [**](#pushData)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L54)pushData * ****pushData**(data, datasetIdOrName): Promise\ - This function allows you to push data to a [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md) specified by name, or the one currently used by the crawler. Shortcut for `crawler.pushData()`. *** #### Parameters * ##### optionaldata: ReadonlyDeep\ Data to be pushed to the default dataset. * ##### optionaldatasetIdOrName: string #### Returns Promise\ --- # RouterHandler \ Simple router that works based on request labels. This instance can then serve as a `requestHandler` of your crawler. ``` import { Router, CheerioCrawler, CheerioCrawlingContext } from 'crawlee'; const router = Router.create(); // we can also use factory methods for specific crawling contexts, the above equals to: // import { createCheerioRouter } from 'crawlee'; // const router = createCheerioRouter(); router.addHandler('label-a', async (ctx) => { ctx.log.info('...'); }); router.addDefaultHandler(async (ctx) => { ctx.log.info('...'); }); const crawler = new CheerioCrawler({ requestHandler: router, }); await crawler.run(); ``` Alternatively we can use the default router instance from crawler object: ``` import { CheerioCrawler } from 'crawlee'; const crawler = new CheerioCrawler(); crawler.router.addHandler('label-a', async (ctx) => { ctx.log.info('...'); }); crawler.router.addDefaultHandler(async (ctx) => { ctx.log.info('...'); }); await crawler.run(); ``` For convenience, we can also define the routes right when creating the router: ``` import { CheerioCrawler, createCheerioRouter } from 'crawlee'; const crawler = new CheerioCrawler({ requestHandler: createCheerioRouter({ 'label-a': async (ctx) => { ... }, 'label-b': async (ctx) => { ... }, })}, }); await crawler.run(); ``` Middlewares are also supported via the `router.use` method. There can be multiple middlewares for a single router, they will be executed sequentially in the same order as they were registered. ``` crawler.router.use(async (ctx) => { ctx.log.info('...'); }); ``` ### Hierarchy * [Router](https://crawlee.dev/js/api/core/class/Router.md)\ * *RouterHandler* ### Callable * ****RouterHandler**(ctx): Awaitable\ *** * #### Parameters * ##### ctx: Context #### Returns Awaitable\ ## Index[**](#Index) ### Methods * [**addDefaultHandler](#addDefaultHandler) * [**addHandler](#addHandler) * [**getHandler](#getHandler) * [**use](#use) ## Methods[**](#Methods) ### [**](#addDefaultHandler)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/router.ts#L110)inheritedaddDefaultHandler * ****addDefaultHandler**\(handler): void - Inherited from Router.addDefaultHandler Registers default route handler. *** #### Parameters * ##### handler: (ctx) => Awaitable\ #### Returns void ### [**](#addHandler)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/router.ts#L99)inheritedaddHandler * ****addHandler**\(label, handler): void - Inherited from Router.addHandler Registers new route handler for given label. *** #### Parameters * ##### label: string | symbol * ##### handler: (ctx) => Awaitable\ #### Returns void ### [**](#getHandler)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/router.ts#L128)inheritedgetHandler * ****getHandler**(label): (ctx) => Awaitable\ - Inherited from Router.getHandler Returns route handler for given label. If no label is provided, the default request handler will be returned. *** #### Parameters * ##### optionallabel: string | symbol #### Returns (ctx) => Awaitable\ * * **(ctx): Awaitable\ - #### Parameters * ##### ctx: Context #### Returns Awaitable\ ### [**](#use)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/router.ts#L121)inheriteduse * ****use**(middleware): void - Inherited from Router.use Registers a middleware that will be fired before the matching route handler. Multiple middlewares can be registered, they will be fired in the same order. *** #### Parameters * ##### middleware: (ctx) => Awaitable\ #### Returns void --- # SessionOptions ## Index[**](#Index) ### Properties * [**cookieJar](#cookieJar) * [**createdAt](#createdAt) * [**errorScore](#errorScore) * [**errorScoreDecrement](#errorScoreDecrement) * [**expiresAt](#expiresAt) * [**id](#id) * [**log](#log) * [**maxAgeSecs](#maxAgeSecs) * [**maxErrorScore](#maxErrorScore) * [**maxUsageCount](#maxUsageCount) * [**sessionPool](#sessionPool) * [**usageCount](#usageCount) * [**userData](#userData) ## Properties[**](#Properties) ### [**](#cookieJar)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session.ts#L91)optionalcookieJar **cookieJar? : CookieJar ### [**](#createdAt)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session.ts#L68)optionalcreatedAt **createdAt? : Date Date of creation. ### [**](#errorScore)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session.ts#L90)optionalerrorScore **errorScore? : number ### [**](#errorScoreDecrement)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session.ts#L65)optionalerrorScoreDecrement **errorScoreDecrement? : number = 0.5 It is used for healing the session. For example: if your session is marked bad two times, but it is successful on the third attempt it's errorScore is decremented by this number. ### [**](#expiresAt)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session.ts#L71)optionalexpiresAt **expiresAt? : Date Date of expiration. ### [**](#id)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session.ts#L39)optionalid **id? : string Id of session used for generating fingerprints. It is used as proxy session name. ### [**](#log)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session.ts#L89)optionallog **log? : [Log](https://crawlee.dev/js/api/core/class/Log.md) ### [**](#maxAgeSecs)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session.ts#L45)optionalmaxAgeSecs **maxAgeSecs? : number = 3000 Number of seconds after which the session is considered as expired. ### [**](#maxErrorScore)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session.ts#L57)optionalmaxErrorScore **maxErrorScore? : number = 3 Maximum number of marking session as blocked usage. If the `errorScore` reaches the `maxErrorScore` session is marked as block and it is thrown away. It starts at 0. Calling the `markBad` function increases the `errorScore` by 1. Calling the `markGood` will decrease the `errorScore` by `errorScoreDecrement` ### [**](#maxUsageCount)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session.ts#L84)optionalmaxUsageCount **maxUsageCount? : number = 50 Session should be used only a limited amount of times. This number indicates how many times the session is going to be used, before it is thrown away. ### [**](#sessionPool)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session.ts#L87)optionalsessionPool **sessionPool? : [SessionPool](https://crawlee.dev/js/api/core/class/SessionPool.md) SessionPool instance. Session will emit the `sessionRetired` event on this instance. ### [**](#usageCount)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session.ts#L77)optionalusageCount **usageCount? : number = 0 Indicates how many times the session has been used. ### [**](#userData)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session.ts#L48)optionaluserData **userData? : Dictionary Object where custom user data can be stored. For example custom headers. --- # SessionPoolOptions ## Index[**](#Index) ### Properties * [**blockedStatusCodes](#blockedStatusCodes) * [**createSessionFunction](#createSessionFunction) * [**maxPoolSize](#maxPoolSize) * [**persistenceOptions](#persistenceOptions) * [**persistStateKey](#persistStateKey) * [**persistStateKeyValueStoreId](#persistStateKeyValueStoreId) * [**sessionOptions](#sessionOptions) ## Properties[**](#Properties) ### [**](#blockedStatusCodes)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session_pool.ts#L61)optionalblockedStatusCodes **blockedStatusCodes? : number\[] = \[401, 403, 429] Specifies which response status codes are considered as blocked. Session connected to such request will be marked as retired. ### [**](#createSessionFunction)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session_pool.ts#L54)optionalcreateSessionFunction **createSessionFunction? : [CreateSession](https://crawlee.dev/js/api/core/interface/CreateSession.md) Custom function that should return `Session` instance. Any error thrown from this function will terminate the process. Function receives `SessionPool` instance as a parameter ### [**](#maxPoolSize)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session_pool.ts#L35)optionalmaxPoolSize **maxPoolSize? : number = 1000 Maximum size of the pool. Indicates how many sessions are rotated. ### [**](#persistenceOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session_pool.ts#L69)optionalpersistenceOptions **persistenceOptions? : [PersistenceOptions](https://crawlee.dev/js/api/core/interface/PersistenceOptions.md) Control how and when to persist the state of the session pool. ### [**](#persistStateKey)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session_pool.ts#L47)optionalpersistStateKey **persistStateKey? : string = SESSION\_POOL\_STATE Session pool persists it's state under this key in Key value store. ### [**](#persistStateKeyValueStoreId)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session_pool.ts#L41)optionalpersistStateKeyValueStoreId **persistStateKeyValueStoreId? : string Name or Id of `KeyValueStore` where is the `SessionPool` state stored. ### [**](#sessionOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session_pool.ts#L38)optionalsessionOptions **sessionOptions? : [SessionOptions](https://crawlee.dev/js/api/core/interface/SessionOptions.md) The configuration options for [Session](https://crawlee.dev/js/api/core/class/Session.md) instances. --- # SessionState Persistable [Session](https://crawlee.dev/js/api/core/class/Session.md) state. ## Index[**](#Index) ### Properties * [**cookieJar](#cookieJar) * [**createdAt](#createdAt) * [**errorScore](#errorScore) * [**errorScoreDecrement](#errorScoreDecrement) * [**expiresAt](#expiresAt) * [**id](#id) * [**maxErrorScore](#maxErrorScore) * [**maxUsageCount](#maxUsageCount) * [**usageCount](#usageCount) * [**userData](#userData) ## Properties[**](#Properties) ### [**](#cookieJar)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session.ts#L26)cookieJar **cookieJar: SerializedCookieJar ### [**](#createdAt)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session.ts#L34)createdAt **createdAt: string ### [**](#errorScore)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session.ts#L28)errorScore **errorScore: number ### [**](#errorScoreDecrement)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session.ts#L30)errorScoreDecrement **errorScoreDecrement: number ### [**](#expiresAt)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session.ts#L33)expiresAt **expiresAt: string ### [**](#id)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session.ts#L25)id **id: string ### [**](#maxErrorScore)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session.ts#L29)maxErrorScore **maxErrorScore: number ### [**](#maxUsageCount)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session.ts#L32)maxUsageCount **maxUsageCount: number ### [**](#usageCount)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session.ts#L31)usageCount **usageCount: number ### [**](#userData)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session.ts#L27)userData **userData: object --- # SitemapRequestListOptions ### Hierarchy * UrlConstraints * *SitemapRequestListOptions* ## Index[**](#Index) ### Properties * [**config](#config) * [**exclude](#exclude) * [**globs](#globs) * [**maxBufferSize](#maxBufferSize) * [**parseSitemapOptions](#parseSitemapOptions) * [**persistenceOptions](#persistenceOptions) * [**persistStateKey](#persistStateKey) * [**proxyUrl](#proxyUrl) * [**regexps](#regexps) * [**signal](#signal) * [**sitemapUrls](#sitemapUrls) * [**timeoutMillis](#timeoutMillis) ## Properties[**](#Properties) ### [**](#config)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/sitemap_request_list.ts#L105)optionalconfig **config? : [Configuration](https://crawlee.dev/js/api/core/class/Configuration.md) Crawlee configuration ### [**](#exclude)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/sitemap_request_list.ts#L46)optionalinheritedexclude **exclude? : readonly (RegExp | [GlobInput](https://crawlee.dev/js/api/core.md#GlobInput))\[] Inherited from UrlConstraints.exclude An array of glob pattern strings, regexp patterns or plain objects containing patterns matching URLs that will **never** be included. The plain objects must include either the `glob` property or the `regexp` property. Glob matching is always case-insensitive. If you need case-sensitive matching, provide a regexp. ### [**](#globs)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/sitemap_request_list.ts#L35)optionalinheritedglobs **globs? : readonly [GlobInput](https://crawlee.dev/js/api/core.md#GlobInput)\[] Inherited from UrlConstraints.globs An array of glob pattern strings or plain objects containing glob pattern strings matching the URLs to be enqueued. The plain objects must include at least the `glob` property, which holds the glob pattern string. The matching is always case-insensitive. If you need case-sensitive matching, use `regexps` property directly. If `globs` is an empty array or `undefined`, and `regexps` are also not defined, then the `SitemapRequestList` includes all the URLs from the sitemap. ### [**](#maxBufferSize)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/sitemap_request_list.ts#L97)optionalmaxBufferSize **maxBufferSize? : number = 200 Maximum number of buffered URLs for the sitemap loading stream. If the buffer is full, the stream will pause until the buffer is drained. ### [**](#parseSitemapOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/sitemap_request_list.ts#L101)optionalparseSitemapOptions **parseSitemapOptions? : Omit<[ParseSitemapOptions](https://crawlee.dev/js/api/utils/interface/ParseSitemapOptions.md), emitNestedSitemaps | maxDepth> Advanced options for the underlying `parseSitemap` call. ### [**](#persistenceOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/sitemap_request_list.ts#L76)optionalpersistenceOptions **persistenceOptions? : { enable? : boolean } Persistence-related options to control how and when crawler's data gets persisted. *** #### Type declaration * ##### optionalenable?: boolean Use this flag to disable or enable periodic persistence to key value store. ### [**](#persistStateKey)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/sitemap_request_list.ts#L72)optionalpersistStateKey **persistStateKey? : string Key for persisting the state of the request list in the `KeyValueStore`. ### [**](#proxyUrl)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/sitemap_request_list.ts#L68)optionalproxyUrl **proxyUrl? : string Proxy URL to be used for sitemap loading. ### [**](#regexps)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/sitemap_request_list.ts#L57)optionalinheritedregexps **regexps? : readonly [RegExpInput](https://crawlee.dev/js/api/core.md#RegExpInput)\[] Inherited from UrlConstraints.regexps An array of regular expressions or plain objects containing regular expressions matching the URLs to be enqueued. The plain objects must include at least the `regexp` property, which holds the regular expression. If `regexps` is an empty array or `undefined`, and `globs` are also not defined, then the `SitemapRequestList` includes all the URLs from the sitemap. ### [**](#signal)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/sitemap_request_list.ts#L86)optionalsignal **signal? : AbortSignal Abort signal to be used for sitemap loading. ### [**](#sitemapUrls)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/sitemap_request_list.ts#L64)sitemapUrls **sitemapUrls: string\[] List of sitemap URLs to parse. ### [**](#timeoutMillis)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/sitemap_request_list.ts#L90)optionaltimeoutMillis **timeoutMillis? : number Timeout for sitemap loading in milliseconds. If both `signal` and `timeoutMillis` are provided, either of them can abort the loading. --- # SnapshotResult ## Index[**](#Index) ### Properties * [**htmlFileName](#htmlFileName) * [**screenshotFileName](#screenshotFileName) ## Properties[**](#Properties) ### [**](#htmlFileName)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/error_snapshotter.ts#L18)optionalhtmlFileName **htmlFileName? : string ### [**](#screenshotFileName)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/error_snapshotter.ts#L17)optionalscreenshotFileName **screenshotFileName? : string --- # SnapshotterOptions ## Index[**](#Index) ### Properties * [**clientSnapshotIntervalSecs](#clientSnapshotIntervalSecs) * [**eventLoopSnapshotIntervalSecs](#eventLoopSnapshotIntervalSecs) * [**maxBlockedMillis](#maxBlockedMillis) * [**maxClientErrors](#maxClientErrors) * [**maxUsedMemoryRatio](#maxUsedMemoryRatio) * [**snapshotHistorySecs](#snapshotHistorySecs) ## Properties[**](#Properties) ### [**](#clientSnapshotIntervalSecs)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/snapshotter.ts#L31)optionalclientSnapshotIntervalSecs **clientSnapshotIntervalSecs? : number = 1 Defines the interval of checking the current state of the remote API client. ### [**](#eventLoopSnapshotIntervalSecs)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/snapshotter.ts#L24)optionaleventLoopSnapshotIntervalSecs **eventLoopSnapshotIntervalSecs? : number = 0.5 Defines the interval of measuring the event loop response time. ### [**](#maxBlockedMillis)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/snapshotter.ts#L38)optionalmaxBlockedMillis **maxBlockedMillis? : number = 50 Maximum allowed delay of the event loop in milliseconds. Exceeding this limit overloads the event loop. ### [**](#maxClientErrors)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/snapshotter.ts#L52)optionalmaxClientErrors **maxClientErrors? : number = 1 Defines the maximum number of new rate limit errors within the given interval. ### [**](#maxUsedMemoryRatio)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/snapshotter.ts#L45)optionalmaxUsedMemoryRatio **maxUsedMemoryRatio? : number = 0.9 Defines the maximum ratio of total memory that can be used. Exceeding this limit overloads the memory. ### [**](#snapshotHistorySecs)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/snapshotter.ts#L59)optionalsnapshotHistorySecs **snapshotHistorySecs? : number = 60 Sets the interval in seconds for which a history of resource snapshots will be kept. Increasing this to very high numbers will affect performance. --- # StatisticPersistedState Format of the persisted stats ### Hierarchy * Omit<[StatisticState](https://crawlee.dev/js/api/core/interface/StatisticState.md), statsPersistedAt> * *StatisticPersistedState* ## Index[**](#Index) ### Properties * [**crawlerFinishedAt](#crawlerFinishedAt) * [**crawlerLastStartTimestamp](#crawlerLastStartTimestamp) * [**crawlerRuntimeMillis](#crawlerRuntimeMillis) * [**crawlerStartedAt](#crawlerStartedAt) * [**errors](#errors) * [**requestAvgFailedDurationMillis](#requestAvgFailedDurationMillis) * [**requestAvgFinishedDurationMillis](#requestAvgFinishedDurationMillis) * [**requestMaxDurationMillis](#requestMaxDurationMillis) * [**requestMinDurationMillis](#requestMinDurationMillis) * [**requestRetryHistogram](#requestRetryHistogram) * [**requestsFailed](#requestsFailed) * [**requestsFailedPerMinute](#requestsFailedPerMinute) * [**requestsFinished](#requestsFinished) * [**requestsFinishedPerMinute](#requestsFinishedPerMinute) * [**requestsRetries](#requestsRetries) * [**requestsTotal](#requestsTotal) * [**requestsWithStatusCode](#requestsWithStatusCode) * [**requestTotalDurationMillis](#requestTotalDurationMillis) * [**requestTotalFailedDurationMillis](#requestTotalFailedDurationMillis) * [**requestTotalFinishedDurationMillis](#requestTotalFinishedDurationMillis) * [**retryErrors](#retryErrors) * [**statsId](#statsId) * [**statsPersistedAt](#statsPersistedAt) ## Properties[**](#Properties) ### [**](#crawlerFinishedAt)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L507)inheritedcrawlerFinishedAt **crawlerFinishedAt: null | string | Date Inherited from Omit.crawlerFinishedAt ### [**](#crawlerLastStartTimestamp)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L489)crawlerLastStartTimestamp **crawlerLastStartTimestamp: number ### [**](#crawlerRuntimeMillis)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L508)inheritedcrawlerRuntimeMillis **crawlerRuntimeMillis: number Inherited from Omit.crawlerRuntimeMillis ### [**](#crawlerStartedAt)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L506)inheritedcrawlerStartedAt **crawlerStartedAt: null | string | Date Inherited from Omit.crawlerStartedAt ### [**](#errors)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L510)inheritederrors **errors: Record\ Inherited from Omit.errors ### [**](#requestAvgFailedDurationMillis)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L485)requestAvgFailedDurationMillis **requestAvgFailedDurationMillis: number ### [**](#requestAvgFinishedDurationMillis)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L486)requestAvgFinishedDurationMillis **requestAvgFinishedDurationMillis: number ### [**](#requestMaxDurationMillis)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L503)inheritedrequestMaxDurationMillis **requestMaxDurationMillis: number Inherited from Omit.requestMaxDurationMillis ### [**](#requestMinDurationMillis)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L502)inheritedrequestMinDurationMillis **requestMinDurationMillis: number Inherited from Omit.requestMinDurationMillis ### [**](#requestRetryHistogram)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L483)requestRetryHistogram **requestRetryHistogram: number\[] ### [**](#requestsFailed)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L498)inheritedrequestsFailed **requestsFailed: number Inherited from Omit.requestsFailed ### [**](#requestsFailedPerMinute)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L500)inheritedrequestsFailedPerMinute **requestsFailedPerMinute: number Inherited from Omit.requestsFailedPerMinute ### [**](#requestsFinished)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L497)inheritedrequestsFinished **requestsFinished: number Inherited from Omit.requestsFinished ### [**](#requestsFinishedPerMinute)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L501)inheritedrequestsFinishedPerMinute **requestsFinishedPerMinute: number Inherited from Omit.requestsFinishedPerMinute ### [**](#requestsRetries)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L499)inheritedrequestsRetries **requestsRetries: number Inherited from Omit.requestsRetries ### [**](#requestsTotal)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L488)requestsTotal **requestsTotal: number ### [**](#requestsWithStatusCode)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L512)inheritedrequestsWithStatusCode **requestsWithStatusCode: Record\ Inherited from Omit.requestsWithStatusCode ### [**](#requestTotalDurationMillis)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L487)requestTotalDurationMillis **requestTotalDurationMillis: number ### [**](#requestTotalFailedDurationMillis)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L504)inheritedrequestTotalFailedDurationMillis **requestTotalFailedDurationMillis: number Inherited from Omit.requestTotalFailedDurationMillis ### [**](#requestTotalFinishedDurationMillis)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L505)inheritedrequestTotalFinishedDurationMillis **requestTotalFinishedDurationMillis: number Inherited from Omit.requestTotalFinishedDurationMillis ### [**](#retryErrors)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L511)inheritedretryErrors **retryErrors: Record\ Inherited from Omit.retryErrors ### [**](#statsId)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L484)statsId **statsId: number ### [**](#statsPersistedAt)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L490)statsPersistedAt **statsPersistedAt: string --- # StatisticsOptions Configuration for the [Statistics](https://crawlee.dev/js/api/core/class/Statistics.md) instance used by the crawler ## Index[**](#Index) ### Properties * [**config](#config) * [**keyValueStore](#keyValueStore) * [**log](#log) * [**logIntervalSecs](#logIntervalSecs) * [**logMessage](#logMessage) * [**persistenceOptions](#persistenceOptions) * [**saveErrorSnapshots](#saveErrorSnapshots) ## Properties[**](#Properties) ### [**](#config)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L465)optionalconfig **config? : [Configuration](https://crawlee.dev/js/api/core/class/Configuration.md) = [Configuration](https://crawlee.dev/js/api/core/class/Configuration.md) Configuration instance to use ### [**](#keyValueStore)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L459)optionalkeyValueStore **keyValueStore? : [KeyValueStore](https://crawlee.dev/js/api/core/class/KeyValueStore.md) Key value store instance to persist the statistics. If not provided, the default one will be used when capturing starts ### [**](#log)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L453)optionallog **log? : [Log](https://crawlee.dev/js/api/core/class/Log.md) = [Log](https://crawlee.dev/js/api/core/class/Log.md) Parent logger instance, the statistics will create a child logger from this. ### [**](#logIntervalSecs)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L441)optionallogIntervalSecs **logIntervalSecs? : number = 60 Interval in seconds to log the current statistics ### [**](#logMessage)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L447)optionallogMessage **logMessage? : string = ‘Statistics’ Message to log with the current statistics ### [**](#persistenceOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L470)optionalpersistenceOptions **persistenceOptions? : [PersistenceOptions](https://crawlee.dev/js/api/core/interface/PersistenceOptions.md) Control how and when to persist the statistics. ### [**](#saveErrorSnapshots)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L476)optionalsaveErrorSnapshots **saveErrorSnapshots? : boolean = false Save HTML snapshot (and a screenshot if possible) when an error occurs. --- # StatisticState Contains the statistics state ## Index[**](#Index) ### Properties * [**crawlerFinishedAt](#crawlerFinishedAt) * [**crawlerRuntimeMillis](#crawlerRuntimeMillis) * [**crawlerStartedAt](#crawlerStartedAt) * [**errors](#errors) * [**requestMaxDurationMillis](#requestMaxDurationMillis) * [**requestMinDurationMillis](#requestMinDurationMillis) * [**requestsFailed](#requestsFailed) * [**requestsFailedPerMinute](#requestsFailedPerMinute) * [**requestsFinished](#requestsFinished) * [**requestsFinishedPerMinute](#requestsFinishedPerMinute) * [**requestsRetries](#requestsRetries) * [**requestsWithStatusCode](#requestsWithStatusCode) * [**requestTotalFailedDurationMillis](#requestTotalFailedDurationMillis) * [**requestTotalFinishedDurationMillis](#requestTotalFinishedDurationMillis) * [**retryErrors](#retryErrors) * [**statsPersistedAt](#statsPersistedAt) ## Properties[**](#Properties) ### [**](#crawlerFinishedAt)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L507)crawlerFinishedAt **crawlerFinishedAt: null | string | Date ### [**](#crawlerRuntimeMillis)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L508)crawlerRuntimeMillis **crawlerRuntimeMillis: number ### [**](#crawlerStartedAt)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L506)crawlerStartedAt **crawlerStartedAt: null | string | Date ### [**](#errors)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L510)errors **errors: Record\ ### [**](#requestMaxDurationMillis)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L503)requestMaxDurationMillis **requestMaxDurationMillis: number ### [**](#requestMinDurationMillis)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L502)requestMinDurationMillis **requestMinDurationMillis: number ### [**](#requestsFailed)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L498)requestsFailed **requestsFailed: number ### [**](#requestsFailedPerMinute)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L500)requestsFailedPerMinute **requestsFailedPerMinute: number ### [**](#requestsFinished)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L497)requestsFinished **requestsFinished: number ### [**](#requestsFinishedPerMinute)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L501)requestsFinishedPerMinute **requestsFinishedPerMinute: number ### [**](#requestsRetries)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L499)requestsRetries **requestsRetries: number ### [**](#requestsWithStatusCode)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L512)requestsWithStatusCode **requestsWithStatusCode: Record\ ### [**](#requestTotalFailedDurationMillis)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L504)requestTotalFailedDurationMillis **requestTotalFailedDurationMillis: number ### [**](#requestTotalFinishedDurationMillis)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L505)requestTotalFinishedDurationMillis **requestTotalFinishedDurationMillis: number ### [**](#retryErrors)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L511)retryErrors **retryErrors: Record\ ### [**](#statsPersistedAt)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L509)statsPersistedAt **statsPersistedAt: null | string | Date --- # StorageClient Represents a storage capable of working with datasets, KV stores and request queues. ### Implemented by * [MemoryStorage](https://crawlee.dev/js/api/memory-storage/class/MemoryStorage.md) ## Index[**](#Index) ### Properties * [**stats](#stats) ### Methods * [**dataset](#dataset) * [**datasets](#datasets) * [**keyValueStore](#keyValueStore) * [**keyValueStores](#keyValueStores) * [**purge](#purge) * [**requestQueue](#requestQueue) * [**requestQueues](#requestQueues) * [**setStatusMessage](#setStatusMessage) * [**teardown](#teardown) ## Properties[**](#Properties) ### [**](#stats)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L333)optionalstats **stats? : { rateLimitErrors: number\[] } #### Type declaration * ##### rateLimitErrors: number\[] ## Methods[**](#Methods) ### [**](#dataset)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L325)dataset * ****dataset**(id): [DatasetClient](https://crawlee.dev/js/api/types/interface/DatasetClient.md)\ - #### Parameters * ##### id: string #### Returns [DatasetClient](https://crawlee.dev/js/api/types/interface/DatasetClient.md)\ ### [**](#datasets)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L324)datasets * ****datasets**(): [DatasetCollectionClient](https://crawlee.dev/js/api/types/interface/DatasetCollectionClient.md) - #### Returns [DatasetCollectionClient](https://crawlee.dev/js/api/types/interface/DatasetCollectionClient.md) ### [**](#keyValueStore)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L327)keyValueStore * ****keyValueStore**(id): [KeyValueStoreClient](https://crawlee.dev/js/api/types/interface/KeyValueStoreClient.md) - #### Parameters * ##### id: string #### Returns [KeyValueStoreClient](https://crawlee.dev/js/api/types/interface/KeyValueStoreClient.md) ### [**](#keyValueStores)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L326)keyValueStores * ****keyValueStores**(): [KeyValueStoreCollectionClient](https://crawlee.dev/js/api/types/interface/KeyValueStoreCollectionClient.md) - #### Returns [KeyValueStoreCollectionClient](https://crawlee.dev/js/api/types/interface/KeyValueStoreCollectionClient.md) ### [**](#purge)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L330)optionalpurge * ****purge**(): Promise\ - #### Returns Promise\ ### [**](#requestQueue)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L329)requestQueue * ****requestQueue**(id, options): [RequestQueueClient](https://crawlee.dev/js/api/types/interface/RequestQueueClient.md) - #### Parameters * ##### id: string * ##### optionaloptions: [RequestQueueOptions](https://crawlee.dev/js/api/types/interface/RequestQueueOptions.md) #### Returns [RequestQueueClient](https://crawlee.dev/js/api/types/interface/RequestQueueClient.md) ### [**](#requestQueues)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L328)requestQueues * ****requestQueues**(): [RequestQueueCollectionClient](https://crawlee.dev/js/api/types/interface/RequestQueueCollectionClient.md) - #### Returns [RequestQueueCollectionClient](https://crawlee.dev/js/api/types/interface/RequestQueueCollectionClient.md) ### [**](#setStatusMessage)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L332)optionalsetStatusMessage * ****setStatusMessage**(message, options): Promise\ - #### Parameters * ##### message: string * ##### optionaloptions: [SetStatusMessageOptions](https://crawlee.dev/js/api/types/interface/SetStatusMessageOptions.md) #### Returns Promise\ ### [**](#teardown)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L331)optionalteardown * ****teardown**(): Promise\ - #### Returns Promise\ --- # StorageManagerOptions ## Index[**](#Index) ### Properties * [**config](#config) * [**proxyConfiguration](#proxyConfiguration) * [**storageClient](#storageClient) ## Properties[**](#Properties) ### [**](#config)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/storage_manager.ts#L160)optionalconfig **config? : [Configuration](https://crawlee.dev/js/api/core/class/Configuration.md) SDK configuration instance, defaults to the static register. ### [**](#proxyConfiguration)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/storage_manager.ts#L172)optionalproxyConfiguration **proxyConfiguration? : [ProxyConfiguration](https://crawlee.dev/js/api/core/class/ProxyConfiguration.md) Used to pass the proxy configuration for the `requestsFromUrl` objects. Takes advantage of the internal address rotation and authentication process. If undefined, the `requestsFromUrl` requests will be made without proxy. ### [**](#storageClient)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/storage_manager.ts#L165)optionalstorageClient **storageClient? : [StorageClient](https://crawlee.dev/js/api/core/interface/StorageClient.md) Optional storage client that should be used to open storages. --- # StreamingHttpResponse HTTP response data as returned by the [BaseHttpClient.stream](https://crawlee.dev/js/api/core/interface/BaseHttpClient.md#stream) method. ### Hierarchy * HttpResponseWithoutBody * *StreamingHttpResponse* ## Index[**](#Index) ### Properties * [**complete](#complete) * [**downloadProgress](#downloadProgress) * [**headers](#headers) * [**ip](#ip) * [**redirectUrls](#redirectUrls) * [**request](#request) * [**statusCode](#statusCode) * [**statusMessage](#statusMessage) * [**stream](#stream) * [**trailers](#trailers) * [**uploadProgress](#uploadProgress) * [**url](#url) ## Properties[**](#Properties) ### [**](#complete)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L141)inheritedcomplete **complete: boolean Inherited from HttpResponseWithoutBody.complete ### [**](#downloadProgress)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L164)readonlydownloadProgress **downloadProgress: Progress ### [**](#headers)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L138)inheritedheaders **headers: SimpleHeaders Inherited from HttpResponseWithoutBody.headers ### [**](#ip)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L134)optionalinheritedip **ip? : string Inherited from HttpResponseWithoutBody.ip ### [**](#redirectUrls)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L131)inheritedredirectUrls **redirectUrls: URL\[] Inherited from HttpResponseWithoutBody.redirectUrls ### [**](#request)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L146)inheritedrequest **request: [HttpRequest](https://crawlee.dev/js/api/core/interface/HttpRequest.md)\ [ResponseTypes](https://crawlee.dev/js/api/core/interface/ResponseTypes.md)> Inherited from HttpResponseWithoutBody.request ### [**](#statusCode)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L135)inheritedstatusCode **statusCode: number Inherited from HttpResponseWithoutBody.statusCode ### [**](#statusMessage)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L136)optionalinheritedstatusMessage **statusMessage? : string Inherited from HttpResponseWithoutBody.statusMessage ### [**](#stream)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L163)stream **stream: Readable ### [**](#trailers)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L139)inheritedtrailers **trailers: SimpleHeaders Inherited from HttpResponseWithoutBody.trailers ### [**](#uploadProgress)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L165)readonlyuploadProgress **uploadProgress: Progress ### [**](#url)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L132)inheritedurl **url: string Inherited from HttpResponseWithoutBody.url --- # SystemInfo Represents the current status of the system. ## Index[**](#Index) ### Properties * [**clientInfo](#clientInfo) * [**cpuInfo](#cpuInfo) * [**eventLoopInfo](#eventLoopInfo) * [**isSystemIdle](#isSystemIdle) * [**memCurrentBytes](#memCurrentBytes) * [**memInfo](#memInfo) ## Properties[**](#Properties) ### [**](#clientInfo)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/system_status.ts#L16)clientInfo **clientInfo: [ClientInfo](https://crawlee.dev/js/api/core/interface/ClientInfo.md) ### [**](#cpuInfo)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/system_status.ts#L15)cpuInfo **cpuInfo: [ClientInfo](https://crawlee.dev/js/api/core/interface/ClientInfo.md) ### [**](#eventLoopInfo)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/system_status.ts#L14)eventLoopInfo **eventLoopInfo: [ClientInfo](https://crawlee.dev/js/api/core/interface/ClientInfo.md) ### [**](#isSystemIdle)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/system_status.ts#L12)isSystemIdle **isSystemIdle: boolean If false, system is being overloaded. ### [**](#memCurrentBytes)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/system_status.ts#L17)optionalmemCurrentBytes **memCurrentBytes? : number ### [**](#memInfo)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/system_status.ts#L13)memInfo **memInfo: [ClientInfo](https://crawlee.dev/js/api/core/interface/ClientInfo.md) --- # SystemStatusOptions ## Index[**](#Index) ### Properties * [**currentHistorySecs](#currentHistorySecs) * [**maxClientOverloadedRatio](#maxClientOverloadedRatio) * [**maxCpuOverloadedRatio](#maxCpuOverloadedRatio) * [**maxEventLoopOverloadedRatio](#maxEventLoopOverloadedRatio) * [**maxMemoryOverloadedRatio](#maxMemoryOverloadedRatio) * [**snapshotter](#snapshotter) ## Properties[**](#Properties) ### [**](#currentHistorySecs)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/system_status.ts#L40)optionalcurrentHistorySecs **currentHistorySecs? : number = 5 Defines max age of snapshots used in the [SystemStatus.getCurrentStatus](https://crawlee.dev/js/api/core/class/SystemStatus.md#getCurrentStatus) measurement. ### [**](#maxClientOverloadedRatio)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/system_status.ts#L68)optionalmaxClientOverloadedRatio **maxClientOverloadedRatio? : number = 0.3 Sets the maximum ratio of overloaded snapshots in a Client sample. If the sample exceeds this ratio, the system will be overloaded. ### [**](#maxCpuOverloadedRatio)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/system_status.ts#L61)optionalmaxCpuOverloadedRatio **maxCpuOverloadedRatio? : number = 0.4 Sets the maximum ratio of overloaded snapshots in a CPU sample. If the sample exceeds this ratio, the system will be overloaded. ### [**](#maxEventLoopOverloadedRatio)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/system_status.ts#L54)optionalmaxEventLoopOverloadedRatio **maxEventLoopOverloadedRatio? : number = 0.6 Sets the maximum ratio of overloaded snapshots in an event loop sample. If the sample exceeds this ratio, the system will be overloaded. ### [**](#maxMemoryOverloadedRatio)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/system_status.ts#L47)optionalmaxMemoryOverloadedRatio **maxMemoryOverloadedRatio? : number = 0.2 Sets the maximum ratio of overloaded snapshots in a memory sample. If the sample exceeds this ratio, the system will be overloaded. ### [**](#snapshotter)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/system_status.ts#L73)optionalsnapshotter **snapshotter? : [Snapshotter](https://crawlee.dev/js/api/core/class/Snapshotter.md) The `Snapshotter` instance to be queried for `SystemStatus`. --- # TieredProxy ## Index[**](#Index) ### Properties * [**proxyTier](#proxyTier) * [**proxyUrl](#proxyUrl) ## Properties[**](#Properties) ### [**](#proxyTier)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/proxy_configuration.ts#L47)optionalproxyTier **proxyTier? : number ### [**](#proxyUrl)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/proxy_configuration.ts#L46)proxyUrl **proxyUrl: null | string --- # UseStateOptions ## Index[**](#Index) ### Properties * [**config](#config) * [**keyValueStoreName](#keyValueStoreName) ## Properties[**](#Properties) ### [**](#config)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/utils.ts#L70)optionalconfig **config? : [Configuration](https://crawlee.dev/js/api/core/class/Configuration.md) ### [**](#keyValueStoreName)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/utils.ts#L75)optionalkeyValueStoreName **keyValueStoreName? : null | string The name of the key-value store you'd like the state to be stored in. If not provided, the default store will be used. --- # @crawlee/http Provides a framework for the parallel crawling of web pages using plain HTTP requests. The URLs to crawl are fed either from a static list of URLs or from a dynamic queue of URLs enabling recursive crawling of websites. It is very fast and efficient on data bandwidth. However, if the target website requires JavaScript to display the content, you might need to use [PuppeteerCrawler](https://crawlee.dev/js/api/puppeteer-crawler/class/PuppeteerCrawler.md) or [PlaywrightCrawler](https://crawlee.dev/js/api/playwright-crawler/class/PlaywrightCrawler.md) instead, because it loads the pages using full-featured headless Chrome browser. **This crawler downloads each URL using a plain HTTP request and doesn't do any HTML parsing.** The source URLs are represented using [Request](https://crawlee.dev/js/api/core/class/Request.md) objects that are fed from [RequestList](https://crawlee.dev/js/api/core/class/RequestList.md) or [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) instances provided by the [HttpCrawlerOptions.requestList](https://crawlee.dev/js/api/http-crawler/interface/HttpCrawlerOptions.md#requestList) or [HttpCrawlerOptions.requestQueue](https://crawlee.dev/js/api/http-crawler/interface/HttpCrawlerOptions.md#requestQueue) constructor options, respectively. If both [HttpCrawlerOptions.requestList](https://crawlee.dev/js/api/http-crawler/interface/HttpCrawlerOptions.md#requestList) and [HttpCrawlerOptions.requestQueue](https://crawlee.dev/js/api/http-crawler/interface/HttpCrawlerOptions.md#requestQueue) are used, the instance first processes URLs from the [RequestList](https://crawlee.dev/js/api/core/class/RequestList.md) and automatically enqueues all of them to [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) before it starts their processing. This ensures that a single URL is not crawled multiple times. The crawler finishes when there are no more [Request](https://crawlee.dev/js/api/core/class/Request.md) objects to crawl. We can use the `preNavigationHooks` to adjust `gotOptions`: ``` preNavigationHooks: [ (crawlingContext, gotOptions) => { // ... }, ] ``` By default, `HttpCrawler` only processes web pages with the `text/html` and `application/xhtml+xml` MIME content types (as reported by the `Content-Type` HTTP header), and skips pages with other content types. If you want the crawler to process other content types, use the [HttpCrawlerOptions.additionalMimeTypes](https://crawlee.dev/js/api/http-crawler/interface/HttpCrawlerOptions.md#additionalMimeTypes) constructor option. Beware that the parsing behavior differs for HTML, XML, JSON and other types of content. For more details, see [HttpCrawlerOptions.requestHandler](https://crawlee.dev/js/api/http-crawler/interface/HttpCrawlerOptions.md#requestHandler). New requests are only dispatched when there is enough free CPU and memory available, using the functionality provided by the [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) class. All [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) configuration options can be passed to the `autoscaledPoolOptions` parameter of the `HttpCrawler` constructor. For user convenience, the `minConcurrency` and `maxConcurrency` [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) options are available directly in the `HttpCrawler` constructor. ## Example usage[​](#example-usage "Direct link to Example usage") ``` import { HttpCrawler, Dataset } from '@crawlee/http'; const crawler = new HttpCrawler({ requestList, async requestHandler({ request, response, body, contentType }) { // Save the data to dataset. await Dataset.pushData({ url: request.url, html: body, }); }, }); await crawler.run([ 'http://www.example.com/page-1', 'http://www.example.com/page-2', ]); ``` ## Index[**](#Index) ### Crawlers * [**HttpCrawler](https://crawlee.dev/js/api/http-crawler/class/HttpCrawler.md) ### Other * [**AddRequestsBatchedOptions](https://crawlee.dev/js/api/http-crawler.md#AddRequestsBatchedOptions) * [**AddRequestsBatchedResult](https://crawlee.dev/js/api/http-crawler.md#AddRequestsBatchedResult) * [**AutoscaledPool](https://crawlee.dev/js/api/http-crawler.md#AutoscaledPool) * [**AutoscaledPoolOptions](https://crawlee.dev/js/api/http-crawler.md#AutoscaledPoolOptions) * [**BaseHttpClient](https://crawlee.dev/js/api/http-crawler.md#BaseHttpClient) * [**BaseHttpResponseData](https://crawlee.dev/js/api/http-crawler.md#BaseHttpResponseData) * [**BASIC\_CRAWLER\_TIMEOUT\_BUFFER\_SECS](https://crawlee.dev/js/api/http-crawler.md#BASIC_CRAWLER_TIMEOUT_BUFFER_SECS) * [**BasicCrawler](https://crawlee.dev/js/api/http-crawler.md#BasicCrawler) * [**BasicCrawlerOptions](https://crawlee.dev/js/api/http-crawler.md#BasicCrawlerOptions) * [**BasicCrawlingContext](https://crawlee.dev/js/api/http-crawler.md#BasicCrawlingContext) * [**BLOCKED\_STATUS\_CODES](https://crawlee.dev/js/api/http-crawler.md#BLOCKED_STATUS_CODES) * [**checkStorageAccess](https://crawlee.dev/js/api/http-crawler.md#checkStorageAccess) * [**ClientInfo](https://crawlee.dev/js/api/http-crawler.md#ClientInfo) * [**Configuration](https://crawlee.dev/js/api/http-crawler.md#Configuration) * [**ConfigurationOptions](https://crawlee.dev/js/api/http-crawler.md#ConfigurationOptions) * [**Cookie](https://crawlee.dev/js/api/http-crawler.md#Cookie) * [**CrawlerAddRequestsOptions](https://crawlee.dev/js/api/http-crawler.md#CrawlerAddRequestsOptions) * [**CrawlerAddRequestsResult](https://crawlee.dev/js/api/http-crawler.md#CrawlerAddRequestsResult) * [**CrawlerExperiments](https://crawlee.dev/js/api/http-crawler.md#CrawlerExperiments) * [**CrawlerRunOptions](https://crawlee.dev/js/api/http-crawler.md#CrawlerRunOptions) * [**CrawlingContext](https://crawlee.dev/js/api/http-crawler.md#CrawlingContext) * [**createBasicRouter](https://crawlee.dev/js/api/http-crawler.md#createBasicRouter) * [**CreateContextOptions](https://crawlee.dev/js/api/http-crawler.md#CreateContextOptions) * [**CreateSession](https://crawlee.dev/js/api/http-crawler.md#CreateSession) * [**CriticalError](https://crawlee.dev/js/api/http-crawler.md#CriticalError) * [**Dataset](https://crawlee.dev/js/api/http-crawler.md#Dataset) * [**DatasetConsumer](https://crawlee.dev/js/api/http-crawler.md#DatasetConsumer) * [**DatasetContent](https://crawlee.dev/js/api/http-crawler.md#DatasetContent) * [**DatasetDataOptions](https://crawlee.dev/js/api/http-crawler.md#DatasetDataOptions) * [**DatasetExportOptions](https://crawlee.dev/js/api/http-crawler.md#DatasetExportOptions) * [**DatasetExportToOptions](https://crawlee.dev/js/api/http-crawler.md#DatasetExportToOptions) * [**DatasetIteratorOptions](https://crawlee.dev/js/api/http-crawler.md#DatasetIteratorOptions) * [**DatasetMapper](https://crawlee.dev/js/api/http-crawler.md#DatasetMapper) * [**DatasetOptions](https://crawlee.dev/js/api/http-crawler.md#DatasetOptions) * [**DatasetReducer](https://crawlee.dev/js/api/http-crawler.md#DatasetReducer) * [**enqueueLinks](https://crawlee.dev/js/api/http-crawler.md#enqueueLinks) * [**EnqueueLinksOptions](https://crawlee.dev/js/api/http-crawler.md#EnqueueLinksOptions) * [**EnqueueStrategy](https://crawlee.dev/js/api/http-crawler.md#EnqueueStrategy) * [**ErrnoException](https://crawlee.dev/js/api/http-crawler.md#ErrnoException) * [**ErrorHandler](https://crawlee.dev/js/api/http-crawler.md#ErrorHandler) * [**ErrorSnapshotter](https://crawlee.dev/js/api/http-crawler.md#ErrorSnapshotter) * [**ErrorTracker](https://crawlee.dev/js/api/http-crawler.md#ErrorTracker) * [**ErrorTrackerOptions](https://crawlee.dev/js/api/http-crawler.md#ErrorTrackerOptions) * [**EventManager](https://crawlee.dev/js/api/http-crawler.md#EventManager) * [**EventType](https://crawlee.dev/js/api/http-crawler.md#EventType) * [**EventTypeName](https://crawlee.dev/js/api/http-crawler.md#EventTypeName) * [**filterRequestsByPatterns](https://crawlee.dev/js/api/http-crawler.md#filterRequestsByPatterns) * [**FinalStatistics](https://crawlee.dev/js/api/http-crawler.md#FinalStatistics) * [**GetUserDataFromRequest](https://crawlee.dev/js/api/http-crawler.md#GetUserDataFromRequest) * [**GlobInput](https://crawlee.dev/js/api/http-crawler.md#GlobInput) * [**GlobObject](https://crawlee.dev/js/api/http-crawler.md#GlobObject) * [**GotScrapingHttpClient](https://crawlee.dev/js/api/http-crawler.md#GotScrapingHttpClient) * [**HttpRequest](https://crawlee.dev/js/api/http-crawler.md#HttpRequest) * [**HttpRequestOptions](https://crawlee.dev/js/api/http-crawler.md#HttpRequestOptions) * [**HttpResponse](https://crawlee.dev/js/api/http-crawler.md#HttpResponse) * [**IRequestList](https://crawlee.dev/js/api/http-crawler.md#IRequestList) * [**IRequestManager](https://crawlee.dev/js/api/http-crawler.md#IRequestManager) * [**IStorage](https://crawlee.dev/js/api/http-crawler.md#IStorage) * [**KeyConsumer](https://crawlee.dev/js/api/http-crawler.md#KeyConsumer) * [**KeyValueStore](https://crawlee.dev/js/api/http-crawler.md#KeyValueStore) * [**KeyValueStoreIteratorOptions](https://crawlee.dev/js/api/http-crawler.md#KeyValueStoreIteratorOptions) * [**KeyValueStoreOptions](https://crawlee.dev/js/api/http-crawler.md#KeyValueStoreOptions) * [**LoadedRequest](https://crawlee.dev/js/api/http-crawler.md#LoadedRequest) * [**LocalEventManager](https://crawlee.dev/js/api/http-crawler.md#LocalEventManager) * [**log](https://crawlee.dev/js/api/http-crawler.md#log) * [**Log](https://crawlee.dev/js/api/http-crawler.md#Log) * [**Logger](https://crawlee.dev/js/api/http-crawler.md#Logger) * [**LoggerJson](https://crawlee.dev/js/api/http-crawler.md#LoggerJson) * [**LoggerOptions](https://crawlee.dev/js/api/http-crawler.md#LoggerOptions) * [**LoggerText](https://crawlee.dev/js/api/http-crawler.md#LoggerText) * [**LogLevel](https://crawlee.dev/js/api/http-crawler.md#LogLevel) * [**MAX\_POOL\_SIZE](https://crawlee.dev/js/api/http-crawler.md#MAX_POOL_SIZE) * [**NonRetryableError](https://crawlee.dev/js/api/http-crawler.md#NonRetryableError) * [**PERSIST\_STATE\_KEY](https://crawlee.dev/js/api/http-crawler.md#PERSIST_STATE_KEY) * [**PersistenceOptions](https://crawlee.dev/js/api/http-crawler.md#PersistenceOptions) * [**processHttpRequestOptions](https://crawlee.dev/js/api/http-crawler.md#processHttpRequestOptions) * [**ProxyConfiguration](https://crawlee.dev/js/api/http-crawler.md#ProxyConfiguration) * [**ProxyConfigurationFunction](https://crawlee.dev/js/api/http-crawler.md#ProxyConfigurationFunction) * [**ProxyConfigurationOptions](https://crawlee.dev/js/api/http-crawler.md#ProxyConfigurationOptions) * [**ProxyInfo](https://crawlee.dev/js/api/http-crawler.md#ProxyInfo) * [**PseudoUrl](https://crawlee.dev/js/api/http-crawler.md#PseudoUrl) * [**PseudoUrlInput](https://crawlee.dev/js/api/http-crawler.md#PseudoUrlInput) * [**PseudoUrlObject](https://crawlee.dev/js/api/http-crawler.md#PseudoUrlObject) * [**purgeDefaultStorages](https://crawlee.dev/js/api/http-crawler.md#purgeDefaultStorages) * [**PushErrorMessageOptions](https://crawlee.dev/js/api/http-crawler.md#PushErrorMessageOptions) * [**QueueOperationInfo](https://crawlee.dev/js/api/http-crawler.md#QueueOperationInfo) * [**RecordOptions](https://crawlee.dev/js/api/http-crawler.md#RecordOptions) * [**RecoverableState](https://crawlee.dev/js/api/http-crawler.md#RecoverableState) * [**RecoverableStateOptions](https://crawlee.dev/js/api/http-crawler.md#RecoverableStateOptions) * [**RecoverableStatePersistenceOptions](https://crawlee.dev/js/api/http-crawler.md#RecoverableStatePersistenceOptions) * [**RedirectHandler](https://crawlee.dev/js/api/http-crawler.md#RedirectHandler) * [**RegExpInput](https://crawlee.dev/js/api/http-crawler.md#RegExpInput) * [**RegExpObject](https://crawlee.dev/js/api/http-crawler.md#RegExpObject) * [**Request](https://crawlee.dev/js/api/http-crawler.md#Request) * [**RequestHandler](https://crawlee.dev/js/api/http-crawler.md#RequestHandler) * [**RequestHandlerResult](https://crawlee.dev/js/api/http-crawler.md#RequestHandlerResult) * [**RequestList](https://crawlee.dev/js/api/http-crawler.md#RequestList) * [**RequestListOptions](https://crawlee.dev/js/api/http-crawler.md#RequestListOptions) * [**RequestListSourcesFunction](https://crawlee.dev/js/api/http-crawler.md#RequestListSourcesFunction) * [**RequestListState](https://crawlee.dev/js/api/http-crawler.md#RequestListState) * [**RequestManagerTandem](https://crawlee.dev/js/api/http-crawler.md#RequestManagerTandem) * [**RequestOptions](https://crawlee.dev/js/api/http-crawler.md#RequestOptions) * [**RequestProvider](https://crawlee.dev/js/api/http-crawler.md#RequestProvider) * [**RequestProviderOptions](https://crawlee.dev/js/api/http-crawler.md#RequestProviderOptions) * [**RequestQueue](https://crawlee.dev/js/api/http-crawler.md#RequestQueue) * [**RequestQueueOperationOptions](https://crawlee.dev/js/api/http-crawler.md#RequestQueueOperationOptions) * [**RequestQueueOptions](https://crawlee.dev/js/api/http-crawler.md#RequestQueueOptions) * [**RequestQueueV1](https://crawlee.dev/js/api/http-crawler.md#RequestQueueV1) * [**RequestQueueV2](https://crawlee.dev/js/api/http-crawler.md#RequestQueueV2) * [**RequestsLike](https://crawlee.dev/js/api/http-crawler.md#RequestsLike) * [**RequestState](https://crawlee.dev/js/api/http-crawler.md#RequestState) * [**RequestTransform](https://crawlee.dev/js/api/http-crawler.md#RequestTransform) * [**ResponseLike](https://crawlee.dev/js/api/http-crawler.md#ResponseLike) * [**ResponseTypes](https://crawlee.dev/js/api/http-crawler.md#ResponseTypes) * [**RestrictedCrawlingContext](https://crawlee.dev/js/api/http-crawler.md#RestrictedCrawlingContext) * [**RetryRequestError](https://crawlee.dev/js/api/http-crawler.md#RetryRequestError) * [**Router](https://crawlee.dev/js/api/http-crawler.md#Router) * [**RouterHandler](https://crawlee.dev/js/api/http-crawler.md#RouterHandler) * [**RouterRoutes](https://crawlee.dev/js/api/http-crawler.md#RouterRoutes) * [**Session](https://crawlee.dev/js/api/http-crawler.md#Session) * [**SessionError](https://crawlee.dev/js/api/http-crawler.md#SessionError) * [**SessionOptions](https://crawlee.dev/js/api/http-crawler.md#SessionOptions) * [**SessionPool](https://crawlee.dev/js/api/http-crawler.md#SessionPool) * [**SessionPoolOptions](https://crawlee.dev/js/api/http-crawler.md#SessionPoolOptions) * [**SessionState](https://crawlee.dev/js/api/http-crawler.md#SessionState) * [**SitemapRequestList](https://crawlee.dev/js/api/http-crawler.md#SitemapRequestList) * [**SitemapRequestListOptions](https://crawlee.dev/js/api/http-crawler.md#SitemapRequestListOptions) * [**SkippedRequestCallback](https://crawlee.dev/js/api/http-crawler.md#SkippedRequestCallback) * [**SkippedRequestReason](https://crawlee.dev/js/api/http-crawler.md#SkippedRequestReason) * [**SnapshotResult](https://crawlee.dev/js/api/http-crawler.md#SnapshotResult) * [**Snapshotter](https://crawlee.dev/js/api/http-crawler.md#Snapshotter) * [**SnapshotterOptions](https://crawlee.dev/js/api/http-crawler.md#SnapshotterOptions) * [**Source](https://crawlee.dev/js/api/http-crawler.md#Source) * [**StatisticPersistedState](https://crawlee.dev/js/api/http-crawler.md#StatisticPersistedState) * [**Statistics](https://crawlee.dev/js/api/http-crawler.md#Statistics) * [**StatisticsOptions](https://crawlee.dev/js/api/http-crawler.md#StatisticsOptions) * [**StatisticState](https://crawlee.dev/js/api/http-crawler.md#StatisticState) * [**StatusMessageCallback](https://crawlee.dev/js/api/http-crawler.md#StatusMessageCallback) * [**StatusMessageCallbackParams](https://crawlee.dev/js/api/http-crawler.md#StatusMessageCallbackParams) * [**StorageClient](https://crawlee.dev/js/api/http-crawler.md#StorageClient) * [**StorageManagerOptions](https://crawlee.dev/js/api/http-crawler.md#StorageManagerOptions) * [**StreamingHttpResponse](https://crawlee.dev/js/api/http-crawler.md#StreamingHttpResponse) * [**SystemInfo](https://crawlee.dev/js/api/http-crawler.md#SystemInfo) * [**SystemStatus](https://crawlee.dev/js/api/http-crawler.md#SystemStatus) * [**SystemStatusOptions](https://crawlee.dev/js/api/http-crawler.md#SystemStatusOptions) * [**TieredProxy](https://crawlee.dev/js/api/http-crawler.md#TieredProxy) * [**tryAbsoluteURL](https://crawlee.dev/js/api/http-crawler.md#tryAbsoluteURL) * [**UrlPatternObject](https://crawlee.dev/js/api/http-crawler.md#UrlPatternObject) * [**useState](https://crawlee.dev/js/api/http-crawler.md#useState) * [**UseStateOptions](https://crawlee.dev/js/api/http-crawler.md#UseStateOptions) * [**withCheckedStorageAccess](https://crawlee.dev/js/api/http-crawler.md#withCheckedStorageAccess) * [**FileDownload](https://crawlee.dev/js/api/http-crawler/class/FileDownload.md) * [**FileDownloadCrawlingContext](https://crawlee.dev/js/api/http-crawler/interface/FileDownloadCrawlingContext.md) * [**HttpCrawlerOptions](https://crawlee.dev/js/api/http-crawler/interface/HttpCrawlerOptions.md) * [**HttpCrawlingContext](https://crawlee.dev/js/api/http-crawler/interface/HttpCrawlingContext.md) * [**FileDownloadErrorHandler](https://crawlee.dev/js/api/http-crawler.md#FileDownloadErrorHandler) * [**FileDownloadHook](https://crawlee.dev/js/api/http-crawler.md#FileDownloadHook) * [**FileDownloadOptions](https://crawlee.dev/js/api/http-crawler.md#FileDownloadOptions) * [**FileDownloadRequestHandler](https://crawlee.dev/js/api/http-crawler.md#FileDownloadRequestHandler) * [**HttpErrorHandler](https://crawlee.dev/js/api/http-crawler.md#HttpErrorHandler) * [**HttpHook](https://crawlee.dev/js/api/http-crawler.md#HttpHook) * [**HttpRequestHandler](https://crawlee.dev/js/api/http-crawler.md#HttpRequestHandler) * [**StreamHandlerContext](https://crawlee.dev/js/api/http-crawler.md#StreamHandlerContext) * [**ByteCounterStream](https://crawlee.dev/js/api/http-crawler/function/ByteCounterStream.md) * [**createFileRouter](https://crawlee.dev/js/api/http-crawler/function/createFileRouter.md) * [**createHttpRouter](https://crawlee.dev/js/api/http-crawler/function/createHttpRouter.md) * [**MinimumSpeedStream](https://crawlee.dev/js/api/http-crawler/function/MinimumSpeedStream.md) ## Other[**](#__CATEGORY__) ### [**](#AddRequestsBatchedOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L965)AddRequestsBatchedOptions Re-exports [AddRequestsBatchedOptions](https://crawlee.dev/js/api/core/interface/AddRequestsBatchedOptions.md) ### [**](#AddRequestsBatchedResult)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L983)AddRequestsBatchedResult Re-exports [AddRequestsBatchedResult](https://crawlee.dev/js/api/core/interface/AddRequestsBatchedResult.md) ### [**](#AutoscaledPool)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/autoscaled_pool.ts#L180)AutoscaledPool Re-exports [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) ### [**](#AutoscaledPoolOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/autoscaled_pool.ts#L16)AutoscaledPoolOptions Re-exports [AutoscaledPoolOptions](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md) ### [**](#BaseHttpClient)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L179)BaseHttpClient Re-exports [BaseHttpClient](https://crawlee.dev/js/api/core/interface/BaseHttpClient.md) ### [**](#BaseHttpResponseData)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L130)BaseHttpResponseData Re-exports [BaseHttpResponseData](https://crawlee.dev/js/api/core/interface/BaseHttpResponseData.md) ### [**](#BASIC_CRAWLER_TIMEOUT_BUFFER_SECS)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/constants.ts#L6)BASIC\_CRAWLER\_TIMEOUT\_BUFFER\_SECS Re-exports [BASIC\_CRAWLER\_TIMEOUT\_BUFFER\_SECS](https://crawlee.dev/js/api/basic-crawler.md#BASIC_CRAWLER_TIMEOUT_BUFFER_SECS) ### [**](#BasicCrawler)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L485)BasicCrawler Re-exports [BasicCrawler](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md) ### [**](#BasicCrawlerOptions)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L133)BasicCrawlerOptions Re-exports [BasicCrawlerOptions](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md) ### [**](#BasicCrawlingContext)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L71)BasicCrawlingContext Re-exports [BasicCrawlingContext](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlingContext.md) ### [**](#BLOCKED_STATUS_CODES)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/consts.ts#L1)BLOCKED\_STATUS\_CODES Re-exports [BLOCKED\_STATUS\_CODES](https://crawlee.dev/js/api/core.md#BLOCKED_STATUS_CODES) ### [**](#checkStorageAccess)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/access_checking.ts#L10)checkStorageAccess Re-exports [checkStorageAccess](https://crawlee.dev/js/api/core/function/checkStorageAccess.md) ### [**](#ClientInfo)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/system_status.ts#L79)ClientInfo Re-exports [ClientInfo](https://crawlee.dev/js/api/core/interface/ClientInfo.md) ### [**](#Configuration)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/configuration.ts#L247)Configuration Re-exports [Configuration](https://crawlee.dev/js/api/core/class/Configuration.md) ### [**](#ConfigurationOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/configuration.ts#L16)ConfigurationOptions Re-exports [ConfigurationOptions](https://crawlee.dev/js/api/core/interface/ConfigurationOptions.md) ### [**](#Cookie)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/index.ts#L19)Cookie Re-exports [Cookie](https://crawlee.dev/js/api/core/interface/Cookie.md) ### [**](#CrawlerAddRequestsOptions)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L2035)CrawlerAddRequestsOptions Re-exports [CrawlerAddRequestsOptions](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerAddRequestsOptions.md) ### [**](#CrawlerAddRequestsResult)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L2037)CrawlerAddRequestsResult Re-exports [CrawlerAddRequestsResult](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerAddRequestsResult.md) ### [**](#CrawlerExperiments)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L411)CrawlerExperiments Re-exports [CrawlerExperiments](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerExperiments.md) ### [**](#CrawlerRunOptions)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L2039)CrawlerRunOptions Re-exports [CrawlerRunOptions](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerRunOptions.md) ### [**](#CrawlingContext)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L111)CrawlingContext Re-exports [CrawlingContext](https://crawlee.dev/js/api/core/interface/CrawlingContext.md) ### [**](#createBasicRouter)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L2081)createBasicRouter Re-exports [createBasicRouter](https://crawlee.dev/js/api/basic-crawler/function/createBasicRouter.md) ### [**](#CreateContextOptions)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L2029)CreateContextOptions Re-exports [CreateContextOptions](https://crawlee.dev/js/api/basic-crawler/interface/CreateContextOptions.md) ### [**](#CreateSession)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session_pool.ts#L22)CreateSession Re-exports [CreateSession](https://crawlee.dev/js/api/core/interface/CreateSession.md) ### [**](#CriticalError)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/errors.ts#L10)CriticalError Re-exports [CriticalError](https://crawlee.dev/js/api/core/class/CriticalError.md) ### [**](#Dataset)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L232)Dataset Re-exports [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md) ### [**](#DatasetConsumer)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L703)DatasetConsumer Re-exports [DatasetConsumer](https://crawlee.dev/js/api/core/interface/DatasetConsumer.md) ### [**](#DatasetContent)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L742)DatasetContent Re-exports [DatasetContent](https://crawlee.dev/js/api/core/interface/DatasetContent.md) ### [**](#DatasetDataOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L92)DatasetDataOptions Re-exports [DatasetDataOptions](https://crawlee.dev/js/api/core/interface/DatasetDataOptions.md) ### [**](#DatasetExportOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L144)DatasetExportOptions Re-exports [DatasetExportOptions](https://crawlee.dev/js/api/core/interface/DatasetExportOptions.md) ### [**](#DatasetExportToOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L176)DatasetExportToOptions Re-exports [DatasetExportToOptions](https://crawlee.dev/js/api/core/interface/DatasetExportToOptions.md) ### [**](#DatasetIteratorOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L152)DatasetIteratorOptions Re-exports [DatasetIteratorOptions](https://crawlee.dev/js/api/core/interface/DatasetIteratorOptions.md) ### [**](#DatasetMapper)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L714)DatasetMapper Re-exports [DatasetMapper](https://crawlee.dev/js/api/core/interface/DatasetMapper.md) ### [**](#DatasetOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L735)DatasetOptions Re-exports [DatasetOptions](https://crawlee.dev/js/api/core/interface/DatasetOptions.md) ### [**](#DatasetReducer)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L726)DatasetReducer Re-exports [DatasetReducer](https://crawlee.dev/js/api/core/interface/DatasetReducer.md) ### [**](#enqueueLinks)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/enqueue_links.ts#L274)enqueueLinks Re-exports [enqueueLinks](https://crawlee.dev/js/api/core/function/enqueueLinks.md) ### [**](#EnqueueLinksOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/enqueue_links.ts#L34)EnqueueLinksOptions Re-exports [EnqueueLinksOptions](https://crawlee.dev/js/api/core/interface/EnqueueLinksOptions.md) ### [**](#EnqueueStrategy)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/enqueue_links.ts#L216)EnqueueStrategy Re-exports [EnqueueStrategy](https://crawlee.dev/js/api/core/enum/EnqueueStrategy.md) ### [**](#ErrnoException)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/error_tracker.ts#L9)ErrnoException Re-exports [ErrnoException](https://crawlee.dev/js/api/core/interface/ErrnoException.md) ### [**](#ErrorHandler)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L114)ErrorHandler Re-exports [ErrorHandler](https://crawlee.dev/js/api/basic-crawler.md#ErrorHandler) ### [**](#ErrorSnapshotter)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/error_snapshotter.ts#L42)ErrorSnapshotter Re-exports [ErrorSnapshotter](https://crawlee.dev/js/api/core/class/ErrorSnapshotter.md) ### [**](#ErrorTracker)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/error_tracker.ts#L286)ErrorTracker Re-exports [ErrorTracker](https://crawlee.dev/js/api/core/class/ErrorTracker.md) ### [**](#ErrorTrackerOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/error_tracker.ts#L17)ErrorTrackerOptions Re-exports [ErrorTrackerOptions](https://crawlee.dev/js/api/core/interface/ErrorTrackerOptions.md) ### [**](#EventManager)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/events/event_manager.ts#L24)EventManager Re-exports [EventManager](https://crawlee.dev/js/api/core/class/EventManager.md) ### [**](#EventType)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/events/event_manager.ts#L9)EventType Re-exports [EventType](https://crawlee.dev/js/api/core/enum/EventType.md) ### [**](#EventTypeName)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/events/event_manager.ts#L17)EventTypeName Re-exports [EventTypeName](https://crawlee.dev/js/api/core.md#EventTypeName) ### [**](#filterRequestsByPatterns)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L217)filterRequestsByPatterns Re-exports [filterRequestsByPatterns](https://crawlee.dev/js/api/core/function/filterRequestsByPatterns.md) ### [**](#FinalStatistics)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/system_status.ts#L85)FinalStatistics Re-exports [FinalStatistics](https://crawlee.dev/js/api/core/interface/FinalStatistics.md) ### [**](#GetUserDataFromRequest)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/router.ts#L15)GetUserDataFromRequest Re-exports [GetUserDataFromRequest](https://crawlee.dev/js/api/core.md#GetUserDataFromRequest) ### [**](#GlobInput)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L41)GlobInput Re-exports [GlobInput](https://crawlee.dev/js/api/core.md#GlobInput) ### [**](#GlobObject)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L36)GlobObject Re-exports [GlobObject](https://crawlee.dev/js/api/core.md#GlobObject) ### [**](#GotScrapingHttpClient)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/got-scraping-http-client.ts#L17)GotScrapingHttpClient Re-exports [GotScrapingHttpClient](https://crawlee.dev/js/api/core/class/GotScrapingHttpClient.md) ### [**](#HttpRequest)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L78)HttpRequest Re-exports [HttpRequest](https://crawlee.dev/js/api/core/interface/HttpRequest.md) ### [**](#HttpRequestOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L111)HttpRequestOptions Re-exports [HttpRequestOptions](https://crawlee.dev/js/api/core/interface/HttpRequestOptions.md) ### [**](#HttpResponse)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L152)HttpResponse Re-exports [HttpResponse](https://crawlee.dev/js/api/core/interface/HttpResponse.md) ### [**](#IRequestList)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_list.ts#L26)IRequestList Re-exports [IRequestList](https://crawlee.dev/js/api/core/interface/IRequestList.md) ### [**](#IRequestManager)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L44)IRequestManager Re-exports [IRequestManager](https://crawlee.dev/js/api/core/interface/IRequestManager.md) ### [**](#IStorage)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/storage_manager.ts#L14)IStorage Re-exports [IStorage](https://crawlee.dev/js/api/core/interface/IStorage.md) ### [**](#KeyConsumer)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/key_value_store.ts#L724)KeyConsumer Re-exports [KeyConsumer](https://crawlee.dev/js/api/core/interface/KeyConsumer.md) ### [**](#KeyValueStore)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/key_value_store.ts#L108)KeyValueStore Re-exports [KeyValueStore](https://crawlee.dev/js/api/core/class/KeyValueStore.md) ### [**](#KeyValueStoreIteratorOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/key_value_store.ts#L758)KeyValueStoreIteratorOptions Re-exports [KeyValueStoreIteratorOptions](https://crawlee.dev/js/api/core/interface/KeyValueStoreIteratorOptions.md) ### [**](#KeyValueStoreOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/key_value_store.ts#L734)KeyValueStoreOptions Re-exports [KeyValueStoreOptions](https://crawlee.dev/js/api/core/interface/KeyValueStoreOptions.md) ### [**](#LoadedRequest)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L21)LoadedRequest Re-exports [LoadedRequest](https://crawlee.dev/js/api/core.md#LoadedRequest) ### [**](#LocalEventManager)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/events/local_event_manager.ts#L11)LocalEventManager Re-exports [LocalEventManager](https://crawlee.dev/js/api/core/class/LocalEventManager.md) ### [**](#log)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/log.ts#L3)log Re-exports [log](https://crawlee.dev/js/api/core.md#log) ### [**](#Log)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/log.ts#L3)Log Re-exports [Log](https://crawlee.dev/js/api/core/class/Log.md) ### [**](#Logger)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/log.ts#L3)Logger Re-exports [Logger](https://crawlee.dev/js/api/core/class/Logger.md) ### [**](#LoggerJson)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/log.ts#L3)LoggerJson Re-exports [LoggerJson](https://crawlee.dev/js/api/core/class/LoggerJson.md) ### [**](#LoggerOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/log.ts#L3)LoggerOptions Re-exports [LoggerOptions](https://crawlee.dev/js/api/core/interface/LoggerOptions.md) ### [**](#LoggerText)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/log.ts#L3)LoggerText Re-exports [LoggerText](https://crawlee.dev/js/api/core/class/LoggerText.md) ### [**](#LogLevel)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/log.ts#L3)LogLevel Re-exports [LogLevel](https://crawlee.dev/js/api/core/enum/LogLevel.md) ### [**](#MAX_POOL_SIZE)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/consts.ts#L3)MAX\_POOL\_SIZE Re-exports [MAX\_POOL\_SIZE](https://crawlee.dev/js/api/core.md#MAX_POOL_SIZE) ### [**](#NonRetryableError)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/errors.ts#L4)NonRetryableError Re-exports [NonRetryableError](https://crawlee.dev/js/api/core/class/NonRetryableError.md) ### [**](#PERSIST_STATE_KEY)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/consts.ts#L2)PERSIST\_STATE\_KEY Re-exports [PERSIST\_STATE\_KEY](https://crawlee.dev/js/api/core.md#PERSIST_STATE_KEY) ### [**](#PersistenceOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L41)PersistenceOptions Re-exports [PersistenceOptions](https://crawlee.dev/js/api/core/interface/PersistenceOptions.md) ### [**](#processHttpRequestOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L196)processHttpRequestOptions Re-exports [processHttpRequestOptions](https://crawlee.dev/js/api/core/function/processHttpRequestOptions.md) ### [**](#ProxyConfiguration)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/proxy_configuration.ts#L203)ProxyConfiguration Re-exports [ProxyConfiguration](https://crawlee.dev/js/api/core/class/ProxyConfiguration.md) ### [**](#ProxyConfigurationFunction)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/proxy_configuration.ts#L9)ProxyConfigurationFunction Re-exports [ProxyConfigurationFunction](https://crawlee.dev/js/api/core/interface/ProxyConfigurationFunction.md) ### [**](#ProxyConfigurationOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/proxy_configuration.ts#L15)ProxyConfigurationOptions Re-exports [ProxyConfigurationOptions](https://crawlee.dev/js/api/core/interface/ProxyConfigurationOptions.md) ### [**](#ProxyInfo)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/proxy_configuration.ts#L80)ProxyInfo Re-exports [ProxyInfo](https://crawlee.dev/js/api/core/interface/ProxyInfo.md) ### [**](#PseudoUrl)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/index.ts#L18)PseudoUrl Re-exports [PseudoUrl](https://crawlee.dev/js/api/core/class/PseudoUrl.md) ### [**](#PseudoUrlInput)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L34)PseudoUrlInput Re-exports [PseudoUrlInput](https://crawlee.dev/js/api/core.md#PseudoUrlInput) ### [**](#PseudoUrlObject)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L29)PseudoUrlObject Re-exports [PseudoUrlObject](https://crawlee.dev/js/api/core.md#PseudoUrlObject) ### [**](#purgeDefaultStorages)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/utils.ts#L33)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/utils.ts#L45)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/utils.ts#L46)purgeDefaultStorages Re-exports [purgeDefaultStorages](https://crawlee.dev/js/api/core/function/purgeDefaultStorages.md) ### [**](#PushErrorMessageOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L559)PushErrorMessageOptions Re-exports [PushErrorMessageOptions](https://crawlee.dev/js/api/core/interface/PushErrorMessageOptions.md) ### [**](#QueueOperationInfo)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/index.ts#L19)QueueOperationInfo Re-exports [QueueOperationInfo](https://crawlee.dev/js/api/core/interface/QueueOperationInfo.md) ### [**](#RecordOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/key_value_store.ts#L741)RecordOptions Re-exports [RecordOptions](https://crawlee.dev/js/api/core/interface/RecordOptions.md) ### [**](#RecoverableState)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/recoverable_state.ts#L75)RecoverableState Re-exports [RecoverableState](https://crawlee.dev/js/api/core/class/RecoverableState.md) ### [**](#RecoverableStateOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/recoverable_state.ts#L33)RecoverableStateOptions Re-exports [RecoverableStateOptions](https://crawlee.dev/js/api/core/interface/RecoverableStateOptions.md) ### [**](#RecoverableStatePersistenceOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/recoverable_state.ts#L6)RecoverableStatePersistenceOptions Re-exports [RecoverableStatePersistenceOptions](https://crawlee.dev/js/api/core/interface/RecoverableStatePersistenceOptions.md) ### [**](#RedirectHandler)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L171)RedirectHandler Re-exports [RedirectHandler](https://crawlee.dev/js/api/core.md#RedirectHandler) ### [**](#RegExpInput)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L48)RegExpInput Re-exports [RegExpInput](https://crawlee.dev/js/api/core.md#RegExpInput) ### [**](#RegExpObject)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L43)RegExpObject Re-exports [RegExpObject](https://crawlee.dev/js/api/core.md#RegExpObject) ### [**](#Request)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L84)Request Re-exports [Request](https://crawlee.dev/js/api/core/class/Request.md) ### [**](#RequestHandler)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L110)RequestHandler Re-exports [RequestHandler](https://crawlee.dev/js/api/basic-crawler.md#RequestHandler) ### [**](#RequestHandlerResult)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L174)RequestHandlerResult Re-exports [RequestHandlerResult](https://crawlee.dev/js/api/core/class/RequestHandlerResult.md) ### [**](#RequestList)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_list.ts#L300)RequestList Re-exports [RequestList](https://crawlee.dev/js/api/core/class/RequestList.md) ### [**](#RequestListOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_list.ts#L91)RequestListOptions Re-exports [RequestListOptions](https://crawlee.dev/js/api/core/interface/RequestListOptions.md) ### [**](#RequestListSourcesFunction)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_list.ts#L1000)RequestListSourcesFunction Re-exports [RequestListSourcesFunction](https://crawlee.dev/js/api/core.md#RequestListSourcesFunction) ### [**](#RequestListState)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_list.ts#L988)RequestListState Re-exports [RequestListState](https://crawlee.dev/js/api/core/interface/RequestListState.md) ### [**](#RequestManagerTandem)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_manager_tandem.ts#L22)RequestManagerTandem Re-exports [RequestManagerTandem](https://crawlee.dev/js/api/core/class/RequestManagerTandem.md) ### [**](#RequestOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L446)RequestOptions Re-exports [RequestOptions](https://crawlee.dev/js/api/core/interface/RequestOptions.md) ### [**](#RequestProvider)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L102)RequestProvider Re-exports [RequestProvider](https://crawlee.dev/js/api/core/class/RequestProvider.md) ### [**](#RequestProviderOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L907)RequestProviderOptions Re-exports [RequestProviderOptions](https://crawlee.dev/js/api/core/interface/RequestProviderOptions.md) ### [**](#RequestQueue)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/index.ts#L7)RequestQueue Re-exports [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) ### [**](#RequestQueueOperationOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L934)RequestQueueOperationOptions Re-exports [RequestQueueOperationOptions](https://crawlee.dev/js/api/core/interface/RequestQueueOperationOptions.md) ### [**](#RequestQueueOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L923)RequestQueueOptions Re-exports [RequestQueueOptions](https://crawlee.dev/js/api/core/interface/RequestQueueOptions.md) ### [**](#RequestQueueV1)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/index.ts#L6)RequestQueueV1 Re-exports [RequestQueueV1](https://crawlee.dev/js/api/core/class/RequestQueueV1.md) ### [**](#RequestQueueV2)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/index.ts#L8)RequestQueueV2 Re-exports [RequestQueueV2](https://crawlee.dev/js/api/core.md#RequestQueueV2) ### [**](#RequestsLike)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L39)RequestsLike Re-exports [RequestsLike](https://crawlee.dev/js/api/core.md#RequestsLike) ### [**](#RequestState)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L42)RequestState Re-exports [RequestState](https://crawlee.dev/js/api/core/enum/RequestState.md) ### [**](#RequestTransform)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L287)RequestTransform Re-exports [RequestTransform](https://crawlee.dev/js/api/core/interface/RequestTransform.md) ### [**](#ResponseLike)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/cookie_utils.ts#L7)ResponseLike Re-exports [ResponseLike](https://crawlee.dev/js/api/core/interface/ResponseLike.md) ### [**](#ResponseTypes)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L39)ResponseTypes Re-exports [ResponseTypes](https://crawlee.dev/js/api/core/interface/ResponseTypes.md) ### [**](#RestrictedCrawlingContext)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L30)RestrictedCrawlingContext Re-exports [RestrictedCrawlingContext](https://crawlee.dev/js/api/core/interface/RestrictedCrawlingContext.md) ### [**](#RetryRequestError)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/errors.ts#L22)RetryRequestError Re-exports [RetryRequestError](https://crawlee.dev/js/api/core/class/RetryRequestError.md) ### [**](#Router)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/router.ts#L86)Router Re-exports [Router](https://crawlee.dev/js/api/core/class/Router.md) ### [**](#RouterHandler)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/router.ts#L10)RouterHandler Re-exports [RouterHandler](https://crawlee.dev/js/api/core/interface/RouterHandler.md) ### [**](#RouterRoutes)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/router.ts#L17)RouterRoutes Re-exports [RouterRoutes](https://crawlee.dev/js/api/core.md#RouterRoutes) ### [**](#Session)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session.ts#L100)Session Re-exports [Session](https://crawlee.dev/js/api/core/class/Session.md) ### [**](#SessionError)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/errors.ts#L33)SessionError Re-exports [SessionError](https://crawlee.dev/js/api/core/class/SessionError.md) ### [**](#SessionOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session.ts#L37)SessionOptions Re-exports [SessionOptions](https://crawlee.dev/js/api/core/interface/SessionOptions.md) ### [**](#SessionPool)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session_pool.ts#L137)SessionPool Re-exports [SessionPool](https://crawlee.dev/js/api/core/class/SessionPool.md) ### [**](#SessionPoolOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session_pool.ts#L30)SessionPoolOptions Re-exports [SessionPoolOptions](https://crawlee.dev/js/api/core/interface/SessionPoolOptions.md) ### [**](#SessionState)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session.ts#L24)SessionState Re-exports [SessionState](https://crawlee.dev/js/api/core/interface/SessionState.md) ### [**](#SitemapRequestList)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/sitemap_request_list.ts#L128)SitemapRequestList Re-exports [SitemapRequestList](https://crawlee.dev/js/api/core/class/SitemapRequestList.md) ### [**](#SitemapRequestListOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/sitemap_request_list.ts#L60)SitemapRequestListOptions Re-exports [SitemapRequestListOptions](https://crawlee.dev/js/api/core/interface/SitemapRequestListOptions.md) ### [**](#SkippedRequestCallback)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L52)SkippedRequestCallback Re-exports [SkippedRequestCallback](https://crawlee.dev/js/api/core.md#SkippedRequestCallback) ### [**](#SkippedRequestReason)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L50)SkippedRequestReason Re-exports [SkippedRequestReason](https://crawlee.dev/js/api/core.md#SkippedRequestReason) ### [**](#SnapshotResult)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/error_snapshotter.ts#L16)SnapshotResult Re-exports [SnapshotResult](https://crawlee.dev/js/api/core/interface/SnapshotResult.md) ### [**](#Snapshotter)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/snapshotter.ts#L118)Snapshotter Re-exports [Snapshotter](https://crawlee.dev/js/api/core/class/Snapshotter.md) ### [**](#SnapshotterOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/snapshotter.ts#L19)SnapshotterOptions Re-exports [SnapshotterOptions](https://crawlee.dev/js/api/core/interface/SnapshotterOptions.md) ### [**](#Source)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L575)Source Re-exports [Source](https://crawlee.dev/js/api/core.md#Source) ### [**](#StatisticPersistedState)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L482)StatisticPersistedState Re-exports [StatisticPersistedState](https://crawlee.dev/js/api/core/interface/StatisticPersistedState.md) ### [**](#Statistics)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L59)Statistics Re-exports [Statistics](https://crawlee.dev/js/api/core/class/Statistics.md) ### [**](#StatisticsOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L436)StatisticsOptions Re-exports [StatisticsOptions](https://crawlee.dev/js/api/core/interface/StatisticsOptions.md) ### [**](#StatisticState)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L496)StatisticState Re-exports [StatisticState](https://crawlee.dev/js/api/core/interface/StatisticState.md) ### [**](#StatusMessageCallback)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L128)StatusMessageCallback Re-exports [StatusMessageCallback](https://crawlee.dev/js/api/basic-crawler.md#StatusMessageCallback) ### [**](#StatusMessageCallbackParams)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L118)StatusMessageCallbackParams Re-exports [StatusMessageCallbackParams](https://crawlee.dev/js/api/basic-crawler/interface/StatusMessageCallbackParams.md) ### [**](#StorageClient)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/index.ts#L19)StorageClient Re-exports [StorageClient](https://crawlee.dev/js/api/core/interface/StorageClient.md) ### [**](#StorageManagerOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/storage_manager.ts#L156)StorageManagerOptions Re-exports [StorageManagerOptions](https://crawlee.dev/js/api/core/interface/StorageManagerOptions.md) ### [**](#StreamingHttpResponse)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L162)StreamingHttpResponse Re-exports [StreamingHttpResponse](https://crawlee.dev/js/api/core/interface/StreamingHttpResponse.md) ### [**](#SystemInfo)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/system_status.ts#L10)SystemInfo Re-exports [SystemInfo](https://crawlee.dev/js/api/core/interface/SystemInfo.md) ### [**](#SystemStatus)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/system_status.ts#L120)SystemStatus Re-exports [SystemStatus](https://crawlee.dev/js/api/core/class/SystemStatus.md) ### [**](#SystemStatusOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/system_status.ts#L35)SystemStatusOptions Re-exports [SystemStatusOptions](https://crawlee.dev/js/api/core/interface/SystemStatusOptions.md) ### [**](#TieredProxy)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/proxy_configuration.ts#L45)TieredProxy Re-exports [TieredProxy](https://crawlee.dev/js/api/core/interface/TieredProxy.md) ### [**](#tryAbsoluteURL)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L12)tryAbsoluteURL Re-exports [tryAbsoluteURL](https://crawlee.dev/js/api/core/function/tryAbsoluteURL.md) ### [**](#UrlPatternObject)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L24)UrlPatternObject Re-exports [UrlPatternObject](https://crawlee.dev/js/api/core.md#UrlPatternObject) ### [**](#useState)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/utils.ts#L87)useState Re-exports [useState](https://crawlee.dev/js/api/core/function/useState.md) ### [**](#UseStateOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/utils.ts#L69)UseStateOptions Re-exports [UseStateOptions](https://crawlee.dev/js/api/core/interface/UseStateOptions.md) ### [**](#withCheckedStorageAccess)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/access_checking.ts#L18)withCheckedStorageAccess Re-exports [withCheckedStorageAccess](https://crawlee.dev/js/api/core/function/withCheckedStorageAccess.md) ### [**](#FileDownloadErrorHandler)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/file-download.ts#L20)FileDownloadErrorHandler **FileDownloadErrorHandler\: [ErrorHandler](https://crawlee.dev/js/api/basic-crawler.md#ErrorHandler)<[FileDownloadCrawlingContext](https://crawlee.dev/js/api/http-crawler/interface/FileDownloadCrawlingContext.md)\> #### Type parameters * **UserData**: Dictionary = any * **JSONData**: Dictionary = any ### [**](#FileDownloadHook)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/file-download.ts#L47)FileDownloadHook **FileDownloadHook\: InternalHttpHook<[FileDownloadCrawlingContext](https://crawlee.dev/js/api/http-crawler/interface/FileDownloadCrawlingContext.md)\> #### Type parameters * **UserData**: Dictionary = any * **JSONData**: Dictionary = any ### [**](#FileDownloadOptions)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/file-download.ts#L34)FileDownloadOptions **FileDownloadOptions\: (Omit<[HttpCrawlerOptions](https://crawlee.dev/js/api/http-crawler/interface/HttpCrawlerOptions.md)<[FileDownloadCrawlingContext](https://crawlee.dev/js/api/http-crawler/interface/FileDownloadCrawlingContext.md)\>, requestHandler> & { requestHandler? : never; streamHandler? : StreamHandler }) | (Omit<[HttpCrawlerOptions](https://crawlee.dev/js/api/http-crawler/interface/HttpCrawlerOptions.md)<[FileDownloadCrawlingContext](https://crawlee.dev/js/api/http-crawler/interface/FileDownloadCrawlingContext.md)\>, requestHandler> & { requestHandler: [FileDownloadRequestHandler](https://crawlee.dev/js/api/http-crawler.md#FileDownloadRequestHandler); streamHandler? : never }) #### Type parameters * **UserData**: Dictionary = any * **JSONData**: Dictionary = any ### [**](#FileDownloadRequestHandler)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/file-download.ts#L57)FileDownloadRequestHandler **FileDownloadRequestHandler\: [RequestHandler](https://crawlee.dev/js/api/basic-crawler.md#RequestHandler)<[FileDownloadCrawlingContext](https://crawlee.dev/js/api/http-crawler/interface/FileDownloadCrawlingContext.md)\> #### Type parameters * **UserData**: Dictionary = any * **JSONData**: Dictionary = any ### [**](#HttpErrorHandler)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L75)HttpErrorHandler **HttpErrorHandler\: [ErrorHandler](https://crawlee.dev/js/api/basic-crawler.md#ErrorHandler)<[HttpCrawlingContext](https://crawlee.dev/js/api/http-crawler/interface/HttpCrawlingContext.md)\> #### Type parameters * **UserData**: Dictionary = any * **JSONData**: JsonValue = any ### [**](#HttpHook)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L194)HttpHook **HttpHook\: InternalHttpHook<[HttpCrawlingContext](https://crawlee.dev/js/api/http-crawler/interface/HttpCrawlingContext.md)\> #### Type parameters * **UserData**: Dictionary = any * **JSONData**: JsonValue = any ### [**](#HttpRequestHandler)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L258)HttpRequestHandler **HttpRequestHandler\: [RequestHandler](https://crawlee.dev/js/api/basic-crawler.md#RequestHandler)<[HttpCrawlingContext](https://crawlee.dev/js/api/http-crawler/interface/HttpCrawlingContext.md)\> #### Type parameters * **UserData**: Dictionary = any * **JSONData**: JsonValue = any ### [**](#StreamHandlerContext)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/file-download.ts#L25)StreamHandlerContext **StreamHandlerContext: Omit<[FileDownloadCrawlingContext](https://crawlee.dev/js/api/http-crawler/interface/FileDownloadCrawlingContext.md), body | parseWithCheerio | json | addRequests | contentType> & { stream: Request } --- # Changelog All notable changes to this project will be documented in this file. See [Conventional Commits](https://conventionalcommits.org) for commit guidelines. ## [3.15.3](https://github.com/apify/crawlee/compare/v3.15.2...v3.15.3) (2025-11-10)[​](#3153-2025-11-10 "Direct link to 3153-2025-11-10") **Note:** Version bump only for package @crawlee/http ## [3.15.2](https://github.com/apify/crawlee/compare/v3.15.1...v3.15.2) (2025-10-23)[​](#3152-2025-10-23 "Direct link to 3152-2025-10-23") **Note:** Version bump only for package @crawlee/http ## [3.15.1](https://github.com/apify/crawlee/compare/v3.15.0...v3.15.1) (2025-09-26)[​](#3151-2025-09-26 "Direct link to 3151-2025-09-26") **Note:** Version bump only for package @crawlee/http # [3.15.0](https://github.com/apify/crawlee/compare/v3.14.1...v3.15.0) (2025-09-17) **Note:** Version bump only for package @crawlee/http ## [3.14.1](https://github.com/apify/crawlee/compare/v3.14.0...v3.14.1) (2025-08-05)[​](#3141-2025-08-05 "Direct link to 3141-2025-08-05") **Note:** Version bump only for package @crawlee/http # [3.14.0](https://github.com/apify/crawlee/compare/v3.13.10...v3.14.0) (2025-07-25) ### Bug Fixes[​](#bug-fixes "Direct link to Bug Fixes") * retry on blocked status codes in `HttpCrawler` ([#3060](https://github.com/apify/crawlee/issues/3060)) ([b5fcd79](https://github.com/apify/crawlee/commit/b5fcd79324ed61c6591fbdc9ffba67b35dc54fde)), closes [/github.com/apify/crawlee/blob/f68d2a95d67cc6230122dc1a5226c57ca23d0ae7/packages/browser-crawler/src/internals/browser-crawler.ts#L481-L486](https://github.com//github.com/apify/crawlee/blob/f68d2a95d67cc6230122dc1a5226c57ca23d0ae7/packages/browser-crawler/src/internals/browser-crawler.ts/issues/L481-L486) [#3029](https://github.com/apify/crawlee/issues/3029) ## [3.13.10](https://github.com/apify/crawlee/compare/v3.13.9...v3.13.10) (2025-07-09)[​](#31310-2025-07-09 "Direct link to 31310-2025-07-09") **Note:** Version bump only for package @crawlee/http ## [3.13.9](https://github.com/apify/crawlee/compare/v3.13.8...v3.13.9) (2025-06-27)[​](#3139-2025-06-27 "Direct link to 3139-2025-06-27") ### Features[​](#features "Direct link to Features") * Report links skipped because of various filter conditions ([#3026](https://github.com/apify/crawlee/issues/3026)) ([5a867bc](https://github.com/apify/crawlee/commit/5a867bc28135803b55c765ec12e6fd04017ce53d)), closes [#3016](https://github.com/apify/crawlee/issues/3016) ## [3.13.8](https://github.com/apify/crawlee/compare/v3.13.7...v3.13.8) (2025-06-16)[​](#3138-2025-06-16 "Direct link to 3138-2025-06-16") **Note:** Version bump only for package @crawlee/http ## [3.13.7](https://github.com/apify/crawlee/compare/v3.13.6...v3.13.7) (2025-06-06)[​](#3137-2025-06-06 "Direct link to 3137-2025-06-06") **Note:** Version bump only for package @crawlee/http ## [3.13.6](https://github.com/apify/crawlee/compare/v3.13.5...v3.13.6) (2025-06-05)[​](#3136-2025-06-05 "Direct link to 3136-2025-06-05") ### Bug Fixes[​](#bug-fixes-1 "Direct link to Bug Fixes") * enable full cookie support for `ImpitHttpClient` ([#2991](https://github.com/apify/crawlee/issues/2991)) ([120f0a7](https://github.com/apify/crawlee/commit/120f0a7968670eaab14d217e12c09b4dba216d7d)) ## [3.13.5](https://github.com/apify/crawlee/compare/v3.13.4...v3.13.5) (2025-05-20)[​](#3135-2025-05-20 "Direct link to 3135-2025-05-20") ### Features[​](#features-1 "Direct link to Features") * add `MinimumSpeedStream` and `ByteCounterStream` helpers ([#2970](https://github.com/apify/crawlee/issues/2970)) ([921c4ee](https://github.com/apify/crawlee/commit/921c4ee3401bd41b8a197b955474bc297152e58b)) ## [3.13.4](https://github.com/apify/crawlee/compare/v3.13.3...v3.13.4) (2025-05-14)[​](#3134-2025-05-14 "Direct link to 3134-2025-05-14") **Note:** Version bump only for package @crawlee/http ## [3.13.3](https://github.com/apify/crawlee/compare/v3.13.2...v3.13.3) (2025-05-05)[​](#3133-2025-05-05 "Direct link to 3133-2025-05-05") ### Features[​](#features-2 "Direct link to Features") * pass `response` to `FileDownload.streamHandler` ([#2930](https://github.com/apify/crawlee/issues/2930)) ([008c4c7](https://github.com/apify/crawlee/commit/008c4c7d879195a492bbbdb9dcda23acad4d51e1)) ## [3.13.2](https://github.com/apify/crawlee/compare/v3.13.1...v3.13.2) (2025-04-08)[​](#3132-2025-04-08 "Direct link to 3132-2025-04-08") **Note:** Version bump only for package @crawlee/http ## [3.13.1](https://github.com/apify/crawlee/compare/v3.13.0...v3.13.1) (2025-04-07)[​](#3131-2025-04-07 "Direct link to 3131-2025-04-07") ### Bug Fixes[​](#bug-fixes-2 "Direct link to Bug Fixes") * treat `406` as other `4xx` status codes in `HttpCrawler` ([#2907](https://github.com/apify/crawlee/issues/2907)) ([b0e6f6d](https://github.com/apify/crawlee/commit/b0e6f6d3fc4455de467baf666e0f67f8738cc57f)), closes [#2892](https://github.com/apify/crawlee/issues/2892) ### Features[​](#features-3 "Direct link to Features") * add `respectRobotsTxtFile` crawler option ([#2910](https://github.com/apify/crawlee/issues/2910)) ([0eabed1](https://github.com/apify/crawlee/commit/0eabed1f13070d902c2c67b340621830a7f64464)) # [3.13.0](https://github.com/apify/crawlee/compare/v3.12.2...v3.13.0) (2025-03-04) **Note:** Version bump only for package @crawlee/http ## [3.12.2](https://github.com/apify/crawlee/compare/v3.12.1...v3.12.2) (2025-01-27)[​](#3122-2025-01-27 "Direct link to 3122-2025-01-27") **Note:** Version bump only for package @crawlee/http ## [3.12.1](https://github.com/apify/crawlee/compare/v3.12.0...v3.12.1) (2024-12-04)[​](#3121-2024-12-04 "Direct link to 3121-2024-12-04") **Note:** Version bump only for package @crawlee/http # [3.12.0](https://github.com/apify/crawlee/compare/v3.11.5...v3.12.0) (2024-11-04) ### Features[​](#features-4 "Direct link to Features") * allow using other HTTP clients ([#2661](https://github.com/apify/crawlee/issues/2661)) ([568c655](https://github.com/apify/crawlee/commit/568c6556d79ce91654c8a715d1d1729d7d6ed8ef)), closes [#2659](https://github.com/apify/crawlee/issues/2659) ## [3.11.5](https://github.com/apify/crawlee/compare/v3.11.4...v3.11.5) (2024-10-04)[​](#3115-2024-10-04 "Direct link to 3115-2024-10-04") ### Bug Fixes[​](#bug-fixes-3 "Direct link to Bug Fixes") * **http-crawler:** avoid crashing when gotOptions.cache is on ([#2686](https://github.com/apify/crawlee/issues/2686)) ([1106d3a](https://github.com/apify/crawlee/commit/1106d3aeccd9d1aca8b2630d720d3ea6a1c955f6)) ## [3.11.4](https://github.com/apify/crawlee/compare/v3.11.3...v3.11.4) (2024-09-23)[​](#3114-2024-09-23 "Direct link to 3114-2024-09-23") **Note:** Version bump only for package @crawlee/http ## [3.11.3](https://github.com/apify/crawlee/compare/v3.11.2...v3.11.3) (2024-09-03)[​](#3113-2024-09-03 "Direct link to 3113-2024-09-03") **Note:** Version bump only for package @crawlee/http ## [3.11.2](https://github.com/apify/crawlee/compare/v3.11.1...v3.11.2) (2024-08-28)[​](#3112-2024-08-28 "Direct link to 3112-2024-08-28") **Note:** Version bump only for package @crawlee/http ## [3.11.1](https://github.com/apify/crawlee/compare/v3.11.0...v3.11.1) (2024-07-24)[​](#3111-2024-07-24 "Direct link to 3111-2024-07-24") **Note:** Version bump only for package @crawlee/http # [3.11.0](https://github.com/apify/crawlee/compare/v3.10.5...v3.11.0) (2024-07-09) **Note:** Version bump only for package @crawlee/http ## [3.10.5](https://github.com/apify/crawlee/compare/v3.10.4...v3.10.5) (2024-06-12)[​](#3105-2024-06-12 "Direct link to 3105-2024-06-12") ### Bug Fixes[​](#bug-fixes-4 "Direct link to Bug Fixes") * mark `context.request.loadedUrl` and `id` as required inside the request handler ([#2531](https://github.com/apify/crawlee/issues/2531)) ([2b54660](https://github.com/apify/crawlee/commit/2b546600691d84852a2f9ef42f273cecf818d66d)) ## [3.10.4](https://github.com/apify/crawlee/compare/v3.10.3...v3.10.4) (2024-06-11)[​](#3104-2024-06-11 "Direct link to 3104-2024-06-11") ### Bug Fixes[​](#bug-fixes-5 "Direct link to Bug Fixes") * make `crawler.log` publicly accessible ([#2526](https://github.com/apify/crawlee/issues/2526)) ([3e9e665](https://github.com/apify/crawlee/commit/3e9e6652c0b5e4d0c2707985abbad7d80336b9af)) ## [3.10.3](https://github.com/apify/crawlee/compare/v3.10.2...v3.10.3) (2024-06-07)[​](#3103-2024-06-07 "Direct link to 3103-2024-06-07") ### Features[​](#features-5 "Direct link to Features") * add `waitForSelector` context helper + `parseWithCheerio` in adaptive crawler ([#2522](https://github.com/apify/crawlee/issues/2522)) ([6f88e73](https://github.com/apify/crawlee/commit/6f88e738d43ab4774dc4ef3f78775a5d88728e0d)) ## [3.10.2](https://github.com/apify/crawlee/compare/v3.10.1...v3.10.2) (2024-06-03)[​](#3102-2024-06-03 "Direct link to 3102-2024-06-03") **Note:** Version bump only for package @crawlee/http ## [3.10.1](https://github.com/apify/crawlee/compare/v3.10.0...v3.10.1) (2024-05-23)[​](#3101-2024-05-23 "Direct link to 3101-2024-05-23") **Note:** Version bump only for package @crawlee/http # [3.10.0](https://github.com/apify/crawlee/compare/v3.9.2...v3.10.0) (2024-05-16) ### Features[​](#features-6 "Direct link to Features") * add `FileDownload` "crawler" ([#2435](https://github.com/apify/crawlee/issues/2435)) ([d73756b](https://github.com/apify/crawlee/commit/d73756bb225d9ed8f58cf0a3b2e0ce96f6188863)) ## [3.9.2](https://github.com/apify/crawlee/compare/v3.9.1...v3.9.2) (2024-04-17)[​](#392-2024-04-17 "Direct link to 392-2024-04-17") **Note:** Version bump only for package @crawlee/http ## [3.9.1](https://github.com/apify/crawlee/compare/v3.9.0...v3.9.1) (2024-04-11)[​](#391-2024-04-11 "Direct link to 391-2024-04-11") **Note:** Version bump only for package @crawlee/http # [3.9.0](https://github.com/apify/crawlee/compare/v3.8.2...v3.9.0) (2024-04-10) ### Features[​](#features-7 "Direct link to Features") * `tieredProxyUrls` for ProxyConfiguration ([#2348](https://github.com/apify/crawlee/issues/2348)) ([5408c7f](https://github.com/apify/crawlee/commit/5408c7f60a5bf4dbdba92f2d7440e0946b94ea6e)) ## [3.8.2](https://github.com/apify/crawlee/compare/v3.8.1...v3.8.2) (2024-03-21)[​](#382-2024-03-21 "Direct link to 382-2024-03-21") **Note:** Version bump only for package @crawlee/http ## [3.8.1](https://github.com/apify/crawlee/compare/v3.8.0...v3.8.1) (2024-02-22)[​](#381-2024-02-22 "Direct link to 381-2024-02-22") **Note:** Version bump only for package @crawlee/http # [3.8.0](https://github.com/apify/crawlee/compare/v3.7.3...v3.8.0) (2024-02-21) **Note:** Version bump only for package @crawlee/http ## [3.7.3](https://github.com/apify/crawlee/compare/v3.7.2...v3.7.3) (2024-01-30)[​](#373-2024-01-30 "Direct link to 373-2024-01-30") **Note:** Version bump only for package @crawlee/http ## [3.7.2](https://github.com/apify/crawlee/compare/v3.7.1...v3.7.2) (2024-01-09)[​](#372-2024-01-09 "Direct link to 372-2024-01-09") **Note:** Version bump only for package @crawlee/http ## [3.7.1](https://github.com/apify/crawlee/compare/v3.7.0...v3.7.1) (2024-01-02)[​](#371-2024-01-02 "Direct link to 371-2024-01-02") **Note:** Version bump only for package @crawlee/http # [3.7.0](https://github.com/apify/crawlee/compare/v3.6.2...v3.7.0) (2023-12-21) ### Features[​](#features-8 "Direct link to Features") * check enqueue link strategy post redirect ([#2238](https://github.com/apify/crawlee/issues/2238)) ([3c5f9d6](https://github.com/apify/crawlee/commit/3c5f9d6056158e042e12d75b2b1b21ef6c32e618)), closes [#2173](https://github.com/apify/crawlee/issues/2173) * log cause with `retryOnBlocked` ([#2252](https://github.com/apify/crawlee/issues/2252)) ([e19a773](https://github.com/apify/crawlee/commit/e19a773693cfc5e65c1e2321bfc8b73c9844ea8b)), closes [#2249](https://github.com/apify/crawlee/issues/2249) ## [3.6.2](https://github.com/apify/crawlee/compare/v3.6.1...v3.6.2) (2023-11-26)[​](#362-2023-11-26 "Direct link to 362-2023-11-26") **Note:** Version bump only for package @crawlee/http ## [3.6.1](https://github.com/apify/crawlee/compare/v3.6.0...v3.6.1) (2023-11-15)[​](#361-2023-11-15 "Direct link to 361-2023-11-15") **Note:** Version bump only for package @crawlee/http # [3.6.0](https://github.com/apify/crawlee/compare/v3.5.8...v3.6.0) (2023-11-15) ### Bug Fixes[​](#bug-fixes-6 "Direct link to Bug Fixes") * retry incorrect Content-Type when response has blocked status code ([#2176](https://github.com/apify/crawlee/issues/2176)) ([b54fb8b](https://github.com/apify/crawlee/commit/b54fb8bb7bc3575195ee676d21e5feb8f898ef47)), closes [#1994](https://github.com/apify/crawlee/issues/1994) ### Features[​](#features-9 "Direct link to Features") * got-scraping v4 ([#2110](https://github.com/apify/crawlee/issues/2110)) ([2f05ed2](https://github.com/apify/crawlee/commit/2f05ed22b203f688095300400bb0e6d03a03283c)) ## [3.5.8](https://github.com/apify/crawlee/compare/v3.5.7...v3.5.8) (2023-10-17)[​](#358-2023-10-17 "Direct link to 358-2023-10-17") **Note:** Version bump only for package @crawlee/http ## [3.5.7](https://github.com/apify/crawlee/compare/v3.5.6...v3.5.7) (2023-10-05)[​](#357-2023-10-05 "Direct link to 357-2023-10-05") **Note:** Version bump only for package @crawlee/http ## [3.5.6](https://github.com/apify/crawlee/compare/v3.5.5...v3.5.6) (2023-10-04)[​](#356-2023-10-04 "Direct link to 356-2023-10-04") **Note:** Version bump only for package @crawlee/http ## [3.5.5](https://github.com/apify/crawlee/compare/v3.5.4...v3.5.5) (2023-10-02)[​](#355-2023-10-02 "Direct link to 355-2023-10-02") **Note:** Version bump only for package @crawlee/http ## [3.5.4](https://github.com/apify/crawlee/compare/v3.5.3...v3.5.4) (2023-09-11)[​](#354-2023-09-11 "Direct link to 354-2023-09-11") **Note:** Version bump only for package @crawlee/http ## [3.5.3](https://github.com/apify/crawlee/compare/v3.5.2...v3.5.3) (2023-08-31)[​](#353-2023-08-31 "Direct link to 353-2023-08-31") ### Bug Fixes[​](#bug-fixes-7 "Direct link to Bug Fixes") * pin all internal dependencies ([#2041](https://github.com/apify/crawlee/issues/2041)) ([d6f2b17](https://github.com/apify/crawlee/commit/d6f2b172d4a6776137c7893ca798d5b4a9408e79)), closes [#2040](https://github.com/apify/crawlee/issues/2040) ## [3.5.2](https://github.com/apify/crawlee/compare/v3.5.1...v3.5.2) (2023-08-21)[​](#352-2023-08-21 "Direct link to 352-2023-08-21") ### Bug Fixes[​](#bug-fixes-8 "Direct link to Bug Fixes") * support `DELETE` requests in `HttpCrawler` ([#2039](https://github.com/apify/crawlee/issues/2039)) ([7ea5c41](https://github.com/apify/crawlee/commit/7ea5c4185b169ec933dcd8df2e85824a7e452913)), closes [#1658](https://github.com/apify/crawlee/issues/1658) ### Features[​](#features-10 "Direct link to Features") * Add options for custom HTTP error status codes ([#2035](https://github.com/apify/crawlee/issues/2035)) ([b50ef1a](https://github.com/apify/crawlee/commit/b50ef1ad51d6d7c7a71e7f40efdb2b1ef0f09291)), closes [#1711](https://github.com/apify/crawlee/issues/1711) ## [3.5.1](https://github.com/apify/crawlee/compare/v3.5.0...v3.5.1) (2023-08-16)[​](#351-2023-08-16 "Direct link to 351-2023-08-16") ### Bug Fixes[​](#bug-fixes-9 "Direct link to Bug Fixes") * log original error message on session rotation ([#2022](https://github.com/apify/crawlee/issues/2022)) ([8a11ffb](https://github.com/apify/crawlee/commit/8a11ffbdaef6b2fe8603aac570c3038f84c2f203)) # [3.5.0](https://github.com/apify/crawlee/compare/v3.4.2...v3.5.0) (2023-07-31) ### Features[​](#features-11 "Direct link to Features") * retire session on proxy error ([#2002](https://github.com/apify/crawlee/issues/2002)) ([8c0928b](https://github.com/apify/crawlee/commit/8c0928b24ceabefc454f8114ac30a27023709010)), closes [#1912](https://github.com/apify/crawlee/issues/1912) ## [3.4.2](https://github.com/apify/crawlee/compare/v3.4.1...v3.4.2) (2023-07-19)[​](#342-2023-07-19 "Direct link to 342-2023-07-19") ### Features[​](#features-12 "Direct link to Features") * retryOnBlocked detects blocked webpage ([#1956](https://github.com/apify/crawlee/issues/1956)) ([766fa9b](https://github.com/apify/crawlee/commit/766fa9b88029e9243a7427075384c1abe85c70c8)) ## [3.4.1](https://github.com/apify/crawlee/compare/v3.4.0...v3.4.1) (2023-07-13)[​](#341-2023-07-13 "Direct link to 341-2023-07-13") ### Bug Fixes[​](#bug-fixes-10 "Direct link to Bug Fixes") * **http-crawler:** replace `IncomingMessage` with `PlainResponse` for context's `response` ([#1973](https://github.com/apify/crawlee/issues/1973)) ([2a1cc7f](https://github.com/apify/crawlee/commit/2a1cc7f4f87f0b1c657759076a236a8f8d9b76ba)), closes [#1964](https://github.com/apify/crawlee/issues/1964) # [3.4.0](https://github.com/apify/crawlee/compare/v3.3.3...v3.4.0) (2023-06-12) **Note:** Version bump only for package @crawlee/http ## [3.3.3](https://github.com/apify/crawlee/compare/v3.3.2...v3.3.3) (2023-05-31)[​](#333-2023-05-31 "Direct link to 333-2023-05-31") **Note:** Version bump only for package @crawlee/http ## [3.3.2](https://github.com/apify/crawlee/compare/v3.3.1...v3.3.2) (2023-05-11)[​](#332-2023-05-11 "Direct link to 332-2023-05-11") ### Features[​](#features-13 "Direct link to Features") * **HttpCrawler:** add `parseWithCheerio` helper to `HttpCrawler` ([#1906](https://github.com/apify/crawlee/issues/1906)) ([ff5f76f](https://github.com/apify/crawlee/commit/ff5f76f9336c47c555c28038cdc72dc650bb5065)) * **router:** allow inline router definition ([#1877](https://github.com/apify/crawlee/issues/1877)) ([2d241c9](https://github.com/apify/crawlee/commit/2d241c9f88964ebd41a181069c378b6b7b5bf262)) ## [3.3.1](https://github.com/apify/crawlee/compare/v3.3.0...v3.3.1) (2023-04-11)[​](#331-2023-04-11 "Direct link to 331-2023-04-11") ### Bug Fixes[​](#bug-fixes-11 "Direct link to Bug Fixes") * **jsdom:** use no-op `enqueueLinks` in http crawlers when parsing fails ([fd35270](https://github.com/apify/crawlee/commit/fd35270e7da67a77eb60108e19294f0fd2016706)) # [3.3.0](https://github.com/apify/crawlee/compare/v3.2.2...v3.3.0) (2023-03-09) **Note:** Version bump only for package @crawlee/http ## [3.2.2](https://github.com/apify/crawlee/compare/v3.2.1...v3.2.2) (2023-02-08)[​](#322-2023-02-08 "Direct link to 322-2023-02-08") **Note:** Version bump only for package @crawlee/http ## [3.2.1](https://github.com/apify/crawlee/compare/v3.2.0...v3.2.1) (2023-02-07)[​](#321-2023-02-07 "Direct link to 321-2023-02-07") **Note:** Version bump only for package @crawlee/http # [3.2.0](https://github.com/apify/crawlee/compare/v3.1.4...v3.2.0) (2023-02-07) ### Bug Fixes[​](#bug-fixes-12 "Direct link to Bug Fixes") * declare missing dependency on `tslib` ([27e96c8](https://github.com/apify/crawlee/commit/27e96c80c26e7fc31809a4b518d699573cb8c662)), closes [#1747](https://github.com/apify/crawlee/issues/1747) ## [3.1.4](https://github.com/apify/crawlee/compare/v3.1.3...v3.1.4) (2022-12-14)[​](#314-2022-12-14 "Direct link to 314-2022-12-14") **Note:** Version bump only for package @crawlee/http ## [3.1.3](https://github.com/apify/crawlee/compare/v3.1.2...v3.1.3) (2022-12-07)[​](#313-2022-12-07 "Direct link to 313-2022-12-07") **Note:** Version bump only for package @crawlee/http ## 3.1.2 (2022-11-15)[​](#312-2022-11-15 "Direct link to 3.1.2 (2022-11-15)") **Note:** Version bump only for package @crawlee/http ## 3.1.1 (2022-11-07)[​](#311-2022-11-07 "Direct link to 3.1.1 (2022-11-07)") **Note:** Version bump only for package @crawlee/http # 3.1.0 (2022-10-13) **Note:** Version bump only for package @crawlee/http ## [3.0.4](https://github.com/apify/crawlee/compare/v3.0.3...v3.0.4) (2022-08-22)[​](#304-2022-08-22 "Direct link to 304-2022-08-22") **Note:** Version bump only for package @crawlee/http --- # FileDownload Provides a framework for downloading files in parallel using plain HTTP requests. The URLs to download are fed either from a static list of URLs or they can be added on the fly from another crawler. Since `FileDownload` uses raw HTTP requests to download the files, it is very fast and bandwidth-efficient. However, it doesn't parse the content - if you need to e.g. extract data from the downloaded files, you might need to use [CheerioCrawler](https://crawlee.dev/js/api/cheerio-crawler/class/CheerioCrawler.md), [PuppeteerCrawler](https://crawlee.dev/js/api/puppeteer-crawler/class/PuppeteerCrawler.md) or [PlaywrightCrawler](https://crawlee.dev/js/api/playwright-crawler/class/PlaywrightCrawler.md) instead. `FileCrawler` downloads each URL using a plain HTTP request and then invokes the user-provided FileDownloadOptions.requestHandler where the user can specify what to do with the downloaded data. The source URLs are represented using [Request](https://crawlee.dev/js/api/core/class/Request.md) objects that are fed from [RequestList](https://crawlee.dev/js/api/core/class/RequestList.md) or [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) instances provided by the FileDownloadOptions.requestList or FileDownloadOptions.requestQueue constructor options, respectively. If both FileDownloadOptions.requestList and FileDownloadOptions.requestQueue are used, the instance first processes URLs from the [RequestList](https://crawlee.dev/js/api/core/class/RequestList.md) and automatically enqueues all of them to [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) before it starts their processing. This ensures that a single URL is not crawled multiple times. The crawler finishes when there are no more [Request](https://crawlee.dev/js/api/core/class/Request.md) objects to crawl. We can use the `preNavigationHooks` to adjust `gotOptions`: ``` preNavigationHooks: [ (crawlingContext, gotOptions) => { // ... }, ] ``` New requests are only dispatched when there is enough free CPU and memory available, using the functionality provided by the [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) class. All [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) configuration options can be passed to the `autoscaledPoolOptions` parameter of the `FileCrawler` constructor. For user convenience, the `minConcurrency` and `maxConcurrency` [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) options are available directly in the `FileCrawler` constructor. ## Example usage ``` const crawler = new FileDownloader({ requestHandler({ body, request }) { writeFileSync(request.url.replace(/[^a-z0-9\.]/gi, '_'), body); }, }); await crawler.run([ 'http://www.example.com/document.pdf', 'http://www.example.com/sound.mp3', 'http://www.example.com/video.mkv', ]); ``` ### Hierarchy * [HttpCrawler](https://crawlee.dev/js/api/http-crawler/class/HttpCrawler.md)<[FileDownloadCrawlingContext](https://crawlee.dev/js/api/http-crawler/interface/FileDownloadCrawlingContext.md)> * *FileDownload* ## Index[**](#Index) ### Constructors * [**constructor](#constructor) ### Properties * [**autoscaledPool](#autoscaledPool) * [**config](#config) * [**hasFinishedBefore](#hasFinishedBefore) * [**log](#log) * [**proxyConfiguration](#proxyConfiguration) * [**requestList](#requestList) * [**requestQueue](#requestQueue) * [**router](#router) * [**running](#running) * [**sessionPool](#sessionPool) * [**stats](#stats) ### Methods * [**addRequests](#addRequests) * [**exportData](#exportData) * [**getData](#getData) * [**getDataset](#getDataset) * [**getRequestQueue](#getRequestQueue) * [**pushData](#pushData) * [**run](#run) * [**setStatusMessage](#setStatusMessage) * [**stop](#stop) * [**use](#use) * [**useState](#useState) ## Constructors[**](#Constructors) ### [**](#constructor)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/file-download.ts#L187)constructor * ****new FileDownload**(options): [FileDownload](https://crawlee.dev/js/api/http-crawler/class/FileDownload.md) - Overrides HttpCrawler.constructor #### Parameters * ##### options: [FileDownloadOptions](https://crawlee.dev/js/api/http-crawler.md#FileDownloadOptions)\ = {} #### Returns [FileDownload](https://crawlee.dev/js/api/http-crawler/class/FileDownload.md) ## Properties[**](#Properties) ### [**](#autoscaledPool)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L524)optionalinheritedautoscaledPool **autoscaledPool? : [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) Inherited from HttpCrawler.autoscaledPool A reference to the underlying [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) class that manages the concurrency of the crawler. > *NOTE:* This property is only initialized after calling the [`crawler.run()`](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md#run) function. We can use it to change the concurrency settings on the fly, to pause the crawler by calling [`autoscaledPool.pause()`](https://crawlee.dev/js/api/core/class/AutoscaledPool.md#pause) or to abort it by calling [`autoscaledPool.abort()`](https://crawlee.dev/js/api/core/class/AutoscaledPool.md#abort). ### [**](#config)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L375)readonlyinheritedconfig **config: [Configuration](https://crawlee.dev/js/api/core/class/Configuration.md) = ... Inherited from HttpCrawler.config ### [**](#hasFinishedBefore)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L533)inheritedhasFinishedBefore **hasFinishedBefore: boolean = false Inherited from HttpCrawler.hasFinishedBefore ### [**](#log)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L535)readonlyinheritedlog **log: [Log](https://crawlee.dev/js/api/core/class/Log.md) Inherited from HttpCrawler.log ### [**](#proxyConfiguration)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L337)optionalinheritedproxyConfiguration **proxyConfiguration? : [ProxyConfiguration](https://crawlee.dev/js/api/core/class/ProxyConfiguration.md) Inherited from HttpCrawler.proxyConfiguration A reference to the underlying [ProxyConfiguration](https://crawlee.dev/js/api/core/class/ProxyConfiguration.md) class that manages the crawler's proxies. Only available if used by the crawler. ### [**](#requestList)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L497)optionalinheritedrequestList **requestList? : [IRequestList](https://crawlee.dev/js/api/core/interface/IRequestList.md) Inherited from HttpCrawler.requestList A reference to the underlying [RequestList](https://crawlee.dev/js/api/core/class/RequestList.md) class that manages the crawler's [requests](https://crawlee.dev/js/api/core/class/Request.md). Only available if used by the crawler. ### [**](#requestQueue)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L504)optionalinheritedrequestQueue **requestQueue? : [RequestProvider](https://crawlee.dev/js/api/core/class/RequestProvider.md) Inherited from HttpCrawler.requestQueue Dynamic queue of URLs to be processed. This is useful for recursive crawling of websites. A reference to the underlying [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) class that manages the crawler's [requests](https://crawlee.dev/js/api/core/class/Request.md). Only available if used by the crawler. ### [**](#router)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L530)readonlyinheritedrouter **router: [RouterHandler](https://crawlee.dev/js/api/core/interface/RouterHandler.md)<{ request: [LoadedRequest](https://crawlee.dev/js/api/core.md#LoadedRequest)<[Request](https://crawlee.dev/js/api/core/class/Request.md)\> } & Omit<[FileDownloadCrawlingContext](https://crawlee.dev/js/api/http-crawler/interface/FileDownloadCrawlingContext.md)\, request>> = ... Inherited from HttpCrawler.router Default [Router](https://crawlee.dev/js/api/core/class/Router.md) instance that will be used if we don't specify any [`requestHandler`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#requestHandler). See [`router.addHandler()`](https://crawlee.dev/js/api/core/class/Router.md#addHandler) and [`router.addDefaultHandler()`](https://crawlee.dev/js/api/core/class/Router.md#addDefaultHandler). ### [**](#running)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L532)inheritedrunning **running: boolean = false Inherited from HttpCrawler.running ### [**](#sessionPool)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L515)optionalinheritedsessionPool **sessionPool? : [SessionPool](https://crawlee.dev/js/api/core/class/SessionPool.md) Inherited from HttpCrawler.sessionPool A reference to the underlying [SessionPool](https://crawlee.dev/js/api/core/class/SessionPool.md) class that manages the crawler's [sessions](https://crawlee.dev/js/api/core/class/Session.md). Only available if used by the crawler. ### [**](#stats)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L491)readonlyinheritedstats **stats: [Statistics](https://crawlee.dev/js/api/core/class/Statistics.md) Inherited from HttpCrawler.stats A reference to the underlying [Statistics](https://crawlee.dev/js/api/core/class/Statistics.md) class that collects and logs run statistics for requests. ## Methods[**](#Methods) ### [**](#addRequests)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L1147)inheritedaddRequests * ****addRequests**(requests, options): Promise<[CrawlerAddRequestsResult](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerAddRequestsResult.md)> - Inherited from HttpCrawler.addRequests Adds requests to the queue in batches. By default, it will resolve after the initial batch is added, and continue adding the rest in background. You can configure the batch size via `batchSize` option and the sleep time in between the batches via `waitBetweenBatchesMillis`. If you want to wait for all batches to be added to the queue, you can use the `waitForAllRequestsToBeAdded` promise you get in the response object. This is an alias for calling `addRequestsBatched()` on the implicit `RequestQueue` for this crawler instance. *** #### Parameters * ##### requests: [RequestsLike](https://crawlee.dev/js/api/core.md#RequestsLike) The requests to add * ##### options: [CrawlerAddRequestsOptions](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerAddRequestsOptions.md) = {} Options for the request queue #### Returns Promise<[CrawlerAddRequestsResult](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerAddRequestsResult.md)> ### [**](#exportData)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L1253)inheritedexportData * ****exportData**\(path, format, options): Promise\ - Inherited from HttpCrawler.exportData Retrieves all the data from the default crawler [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md) and exports them to the specified format. Supported formats are currently 'json' and 'csv', and will be inferred from the `path` automatically. *** #### Parameters * ##### path: string * ##### optionalformat: json | csv * ##### optionaloptions: [DatasetExportOptions](https://crawlee.dev/js/api/core/interface/DatasetExportOptions.md) #### Returns Promise\ ### [**](#getData)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L1244)inheritedgetData * ****getData**(...args): Promise<[DatasetContent](https://crawlee.dev/js/api/core/interface/DatasetContent.md)\> - Inherited from HttpCrawler.getData Retrieves data from the default crawler [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md) by calling [Dataset.getData](https://crawlee.dev/js/api/core/class/Dataset.md#getData). *** #### Parameters * ##### rest...args: \[options: [DatasetDataOptions](https://crawlee.dev/js/api/core/interface/DatasetDataOptions.md)] #### Returns Promise<[DatasetContent](https://crawlee.dev/js/api/core/interface/DatasetContent.md)\> ### [**](#getDataset)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L1237)inheritedgetDataset * ****getDataset**(idOrName): Promise<[Dataset](https://crawlee.dev/js/api/core/class/Dataset.md)\> - Inherited from HttpCrawler.getDataset Retrieves the specified [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md), or the default crawler [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md). *** #### Parameters * ##### optionalidOrName: string #### Returns Promise<[Dataset](https://crawlee.dev/js/api/core/class/Dataset.md)\> ### [**](#getRequestQueue)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L1080)inheritedgetRequestQueue * ****getRequestQueue**(): Promise<[RequestProvider](https://crawlee.dev/js/api/core/class/RequestProvider.md)> - Inherited from HttpCrawler.getRequestQueue #### Returns Promise<[RequestProvider](https://crawlee.dev/js/api/core/class/RequestProvider.md)> ### [**](#pushData)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L1229)inheritedpushData * ****pushData**(data, datasetIdOrName): Promise\ - Inherited from HttpCrawler.pushData Pushes data to the specified [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md), or the default crawler [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md) by calling [Dataset.pushData](https://crawlee.dev/js/api/core/class/Dataset.md#pushData). *** #### Parameters * ##### data: Dictionary | Dictionary\[] * ##### optionaldatasetIdOrName: string #### Returns Promise\ ### [**](#run)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L948)inheritedrun * ****run**(requests, options): Promise<[FinalStatistics](https://crawlee.dev/js/api/core/interface/FinalStatistics.md)> - Inherited from HttpCrawler.run Runs the crawler. Returns a promise that resolves once all the requests are processed and `autoscaledPool.isFinished` returns `true`. We can use the `requests` parameter to enqueue the initial requests — it is a shortcut for running [`crawler.addRequests()`](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md#addRequests) before [`crawler.run()`](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md#run). *** #### Parameters * ##### optionalrequests: [RequestsLike](https://crawlee.dev/js/api/core.md#RequestsLike) The requests to add. * ##### optionaloptions: [CrawlerRunOptions](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerRunOptions.md) Options for the request queue. #### Returns Promise<[FinalStatistics](https://crawlee.dev/js/api/core/interface/FinalStatistics.md)> ### [**](#setStatusMessage)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L871)inheritedsetStatusMessage * ****setStatusMessage**(message, options): Promise\ - Inherited from HttpCrawler.setStatusMessage This method is periodically called by the crawler, every `statusMessageLoggingInterval` seconds. *** #### Parameters * ##### message: string * ##### options: [SetStatusMessageOptions](https://crawlee.dev/js/api/types/interface/SetStatusMessageOptions.md) = {} #### Returns Promise\ ### [**](#stop)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L1068)inheritedstop * ****stop**(message): void - Inherited from HttpCrawler.stop Gracefully stops the current run of the crawler. All the tasks active at the time of calling this method will be allowed to finish. *** #### Parameters * ##### message: string = 'The crawler has been gracefully stopped.' #### Returns void ### [**](#use)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L470)inheriteduse * ****use**(extension): void - Inherited from HttpCrawler.use **EXPERIMENTAL** Function for attaching CrawlerExtensions such as the Unblockers. *** #### Parameters * ##### extension: CrawlerExtension Crawler extension that overrides the crawler configuration. #### Returns void ### [**](#useState)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L1102)inheriteduseState * ****useState**\(defaultValue): Promise\ - Inherited from HttpCrawler.useState #### Parameters * ##### defaultValue: State = ... #### Returns Promise\ --- # HttpCrawler \ Provides a framework for the parallel crawling of web pages using plain HTTP requests. The URLs to crawl are fed either from a static list of URLs or from a dynamic queue of URLs enabling recursive crawling of websites. It is very fast and efficient on data bandwidth. However, if the target website requires JavaScript to display the content, you might need to use [PuppeteerCrawler](https://crawlee.dev/js/api/puppeteer-crawler/class/PuppeteerCrawler.md) or [PlaywrightCrawler](https://crawlee.dev/js/api/playwright-crawler/class/PlaywrightCrawler.md) instead, because it loads the pages using full-featured headless Chrome browser. This crawler downloads each URL using a plain HTTP request and doesn't do any HTML parsing. The source URLs are represented using [Request](https://crawlee.dev/js/api/core/class/Request.md) objects that are fed from [RequestList](https://crawlee.dev/js/api/core/class/RequestList.md) or [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) instances provided by the [HttpCrawlerOptions.requestList](https://crawlee.dev/js/api/http-crawler/interface/HttpCrawlerOptions.md#requestList) or [HttpCrawlerOptions.requestQueue](https://crawlee.dev/js/api/http-crawler/interface/HttpCrawlerOptions.md#requestQueue) constructor options, respectively. If both [HttpCrawlerOptions.requestList](https://crawlee.dev/js/api/http-crawler/interface/HttpCrawlerOptions.md#requestList) and [HttpCrawlerOptions.requestQueue](https://crawlee.dev/js/api/http-crawler/interface/HttpCrawlerOptions.md#requestQueue) are used, the instance first processes URLs from the [RequestList](https://crawlee.dev/js/api/core/class/RequestList.md) and automatically enqueues all of them to [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) before it starts their processing. This ensures that a single URL is not crawled multiple times. The crawler finishes when there are no more [Request](https://crawlee.dev/js/api/core/class/Request.md) objects to crawl. We can use the `preNavigationHooks` to adjust `gotOptions`: ``` preNavigationHooks: [ (crawlingContext, gotOptions) => { // ... }, ] ``` By default, this crawler only processes web pages with the `text/html` and `application/xhtml+xml` MIME content types (as reported by the `Content-Type` HTTP header), and skips pages with other content types. If you want the crawler to process other content types, use the [HttpCrawlerOptions.additionalMimeTypes](https://crawlee.dev/js/api/http-crawler/interface/HttpCrawlerOptions.md#additionalMimeTypes) constructor option. Beware that the parsing behavior differs for HTML, XML, JSON and other types of content. For details, see [HttpCrawlerOptions.requestHandler](https://crawlee.dev/js/api/http-crawler/interface/HttpCrawlerOptions.md#requestHandler). New requests are only dispatched when there is enough free CPU and memory available, using the functionality provided by the [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) class. All [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) configuration options can be passed to the `autoscaledPoolOptions` parameter of the constructor. For user convenience, the `minConcurrency` and `maxConcurrency` [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) options are available directly in the constructor. **Example usage:** ``` import { HttpCrawler, Dataset } from '@crawlee/http'; const crawler = new HttpCrawler({ requestList, async requestHandler({ request, response, body, contentType }) { // Save the data to dataset. await Dataset.pushData({ url: request.url, html: body, }); }, }); await crawler.run([ 'http://www.example.com/page-1', 'http://www.example.com/page-2', ]); ``` ### Hierarchy * [BasicCrawler](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md)\ * *HttpCrawler* * [FileDownload](https://crawlee.dev/js/api/http-crawler/class/FileDownload.md) * [CheerioCrawler](https://crawlee.dev/js/api/cheerio-crawler/class/CheerioCrawler.md) * [JSDOMCrawler](https://crawlee.dev/js/api/jsdom-crawler/class/JSDOMCrawler.md) * [LinkeDOMCrawler](https://crawlee.dev/js/api/linkedom-crawler/class/LinkeDOMCrawler.md) ## Index[**](#Index) ### Constructors * [**constructor](#constructor) ### Properties * [**autoscaledPool](#autoscaledPool) * [**config](#config) * [**hasFinishedBefore](#hasFinishedBefore) * [**log](#log) * [**proxyConfiguration](#proxyConfiguration) * [**requestList](#requestList) * [**requestQueue](#requestQueue) * [**router](#router) * [**running](#running) * [**sessionPool](#sessionPool) * [**stats](#stats) ### Methods * [**addRequests](#addRequests) * [**exportData](#exportData) * [**getData](#getData) * [**getDataset](#getDataset) * [**getRequestQueue](#getRequestQueue) * [**pushData](#pushData) * [**run](#run) * [**setStatusMessage](#setStatusMessage) * [**stop](#stop) * [**use](#use) * [**useState](#useState) ## Constructors[**](#Constructors) ### [**](#constructor)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L373)constructor * ****new HttpCrawler**\(options, config): [HttpCrawler](https://crawlee.dev/js/api/http-crawler/class/HttpCrawler.md)\ - Overrides BasicCrawler.constructor All `HttpCrawlerOptions` parameters are passed via an options object. *** #### Parameters * ##### options: [HttpCrawlerOptions](https://crawlee.dev/js/api/http-crawler/interface/HttpCrawlerOptions.md)\ = {} * ##### config: [Configuration](https://crawlee.dev/js/api/core/class/Configuration.md) = ... #### Returns [HttpCrawler](https://crawlee.dev/js/api/http-crawler/class/HttpCrawler.md)\ ## Properties[**](#Properties) ### [**](#autoscaledPool)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L524)optionalinheritedautoscaledPool **autoscaledPool? : [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) Inherited from BasicCrawler.autoscaledPool A reference to the underlying [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) class that manages the concurrency of the crawler. > *NOTE:* This property is only initialized after calling the [`crawler.run()`](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md#run) function. We can use it to change the concurrency settings on the fly, to pause the crawler by calling [`autoscaledPool.pause()`](https://crawlee.dev/js/api/core/class/AutoscaledPool.md#pause) or to abort it by calling [`autoscaledPool.abort()`](https://crawlee.dev/js/api/core/class/AutoscaledPool.md#abort). ### [**](#config)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L375)readonlyinheritedconfig **config: [Configuration](https://crawlee.dev/js/api/core/class/Configuration.md) = ... Inherited from BasicCrawler.config ### [**](#hasFinishedBefore)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L533)inheritedhasFinishedBefore **hasFinishedBefore: boolean = false Inherited from BasicCrawler.hasFinishedBefore ### [**](#log)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L535)readonlyinheritedlog **log: [Log](https://crawlee.dev/js/api/core/class/Log.md) Inherited from BasicCrawler.log ### [**](#proxyConfiguration)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L337)optionalproxyConfiguration **proxyConfiguration? : [ProxyConfiguration](https://crawlee.dev/js/api/core/class/ProxyConfiguration.md) A reference to the underlying [ProxyConfiguration](https://crawlee.dev/js/api/core/class/ProxyConfiguration.md) class that manages the crawler's proxies. Only available if used by the crawler. ### [**](#requestList)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L497)optionalinheritedrequestList **requestList? : [IRequestList](https://crawlee.dev/js/api/core/interface/IRequestList.md) Inherited from BasicCrawler.requestList A reference to the underlying [RequestList](https://crawlee.dev/js/api/core/class/RequestList.md) class that manages the crawler's [requests](https://crawlee.dev/js/api/core/class/Request.md). Only available if used by the crawler. ### [**](#requestQueue)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L504)optionalinheritedrequestQueue **requestQueue? : [RequestProvider](https://crawlee.dev/js/api/core/class/RequestProvider.md) Inherited from BasicCrawler.requestQueue Dynamic queue of URLs to be processed. This is useful for recursive crawling of websites. A reference to the underlying [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) class that manages the crawler's [requests](https://crawlee.dev/js/api/core/class/Request.md). Only available if used by the crawler. ### [**](#router)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L530)readonlyinheritedrouter **router: [RouterHandler](https://crawlee.dev/js/api/core/interface/RouterHandler.md)\> = ... Inherited from BasicCrawler.router Default [Router](https://crawlee.dev/js/api/core/class/Router.md) instance that will be used if we don't specify any [`requestHandler`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#requestHandler). See [`router.addHandler()`](https://crawlee.dev/js/api/core/class/Router.md#addHandler) and [`router.addDefaultHandler()`](https://crawlee.dev/js/api/core/class/Router.md#addDefaultHandler). ### [**](#running)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L532)inheritedrunning **running: boolean = false Inherited from BasicCrawler.running ### [**](#sessionPool)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L515)optionalinheritedsessionPool **sessionPool? : [SessionPool](https://crawlee.dev/js/api/core/class/SessionPool.md) Inherited from BasicCrawler.sessionPool A reference to the underlying [SessionPool](https://crawlee.dev/js/api/core/class/SessionPool.md) class that manages the crawler's [sessions](https://crawlee.dev/js/api/core/class/Session.md). Only available if used by the crawler. ### [**](#stats)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L491)readonlyinheritedstats **stats: [Statistics](https://crawlee.dev/js/api/core/class/Statistics.md) Inherited from BasicCrawler.stats A reference to the underlying [Statistics](https://crawlee.dev/js/api/core/class/Statistics.md) class that collects and logs run statistics for requests. ## Methods[**](#Methods) ### [**](#addRequests)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L1147)inheritedaddRequests * ****addRequests**(requests, options): Promise<[CrawlerAddRequestsResult](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerAddRequestsResult.md)> - Inherited from BasicCrawler.addRequests Adds requests to the queue in batches. By default, it will resolve after the initial batch is added, and continue adding the rest in background. You can configure the batch size via `batchSize` option and the sleep time in between the batches via `waitBetweenBatchesMillis`. If you want to wait for all batches to be added to the queue, you can use the `waitForAllRequestsToBeAdded` promise you get in the response object. This is an alias for calling `addRequestsBatched()` on the implicit `RequestQueue` for this crawler instance. *** #### Parameters * ##### requests: [RequestsLike](https://crawlee.dev/js/api/core.md#RequestsLike) The requests to add * ##### options: [CrawlerAddRequestsOptions](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerAddRequestsOptions.md) = {} Options for the request queue #### Returns Promise<[CrawlerAddRequestsResult](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerAddRequestsResult.md)> ### [**](#exportData)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L1253)inheritedexportData * ****exportData**\(path, format, options): Promise\ - Inherited from BasicCrawler.exportData Retrieves all the data from the default crawler [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md) and exports them to the specified format. Supported formats are currently 'json' and 'csv', and will be inferred from the `path` automatically. *** #### Parameters * ##### path: string * ##### optionalformat: json | csv * ##### optionaloptions: [DatasetExportOptions](https://crawlee.dev/js/api/core/interface/DatasetExportOptions.md) #### Returns Promise\ ### [**](#getData)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L1244)inheritedgetData * ****getData**(...args): Promise<[DatasetContent](https://crawlee.dev/js/api/core/interface/DatasetContent.md)\> - Inherited from BasicCrawler.getData Retrieves data from the default crawler [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md) by calling [Dataset.getData](https://crawlee.dev/js/api/core/class/Dataset.md#getData). *** #### Parameters * ##### rest...args: \[options: [DatasetDataOptions](https://crawlee.dev/js/api/core/interface/DatasetDataOptions.md)] #### Returns Promise<[DatasetContent](https://crawlee.dev/js/api/core/interface/DatasetContent.md)\> ### [**](#getDataset)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L1237)inheritedgetDataset * ****getDataset**(idOrName): Promise<[Dataset](https://crawlee.dev/js/api/core/class/Dataset.md)\> - Inherited from BasicCrawler.getDataset Retrieves the specified [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md), or the default crawler [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md). *** #### Parameters * ##### optionalidOrName: string #### Returns Promise<[Dataset](https://crawlee.dev/js/api/core/class/Dataset.md)\> ### [**](#getRequestQueue)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L1080)inheritedgetRequestQueue * ****getRequestQueue**(): Promise<[RequestProvider](https://crawlee.dev/js/api/core/class/RequestProvider.md)> - Inherited from BasicCrawler.getRequestQueue #### Returns Promise<[RequestProvider](https://crawlee.dev/js/api/core/class/RequestProvider.md)> ### [**](#pushData)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L1229)inheritedpushData * ****pushData**(data, datasetIdOrName): Promise\ - Inherited from BasicCrawler.pushData Pushes data to the specified [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md), or the default crawler [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md) by calling [Dataset.pushData](https://crawlee.dev/js/api/core/class/Dataset.md#pushData). *** #### Parameters * ##### data: Dictionary | Dictionary\[] * ##### optionaldatasetIdOrName: string #### Returns Promise\ ### [**](#run)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L948)inheritedrun * ****run**(requests, options): Promise<[FinalStatistics](https://crawlee.dev/js/api/core/interface/FinalStatistics.md)> - Inherited from BasicCrawler.run Runs the crawler. Returns a promise that resolves once all the requests are processed and `autoscaledPool.isFinished` returns `true`. We can use the `requests` parameter to enqueue the initial requests — it is a shortcut for running [`crawler.addRequests()`](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md#addRequests) before [`crawler.run()`](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md#run). *** #### Parameters * ##### optionalrequests: [RequestsLike](https://crawlee.dev/js/api/core.md#RequestsLike) The requests to add. * ##### optionaloptions: [CrawlerRunOptions](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerRunOptions.md) Options for the request queue. #### Returns Promise<[FinalStatistics](https://crawlee.dev/js/api/core/interface/FinalStatistics.md)> ### [**](#setStatusMessage)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L871)inheritedsetStatusMessage * ****setStatusMessage**(message, options): Promise\ - Inherited from BasicCrawler.setStatusMessage This method is periodically called by the crawler, every `statusMessageLoggingInterval` seconds. *** #### Parameters * ##### message: string * ##### options: [SetStatusMessageOptions](https://crawlee.dev/js/api/types/interface/SetStatusMessageOptions.md) = {} #### Returns Promise\ ### [**](#stop)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L1068)inheritedstop * ****stop**(message): void - Inherited from BasicCrawler.stop Gracefully stops the current run of the crawler. All the tasks active at the time of calling this method will be allowed to finish. *** #### Parameters * ##### message: string = 'The crawler has been gracefully stopped.' #### Returns void ### [**](#use)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L470)use * ****use**(extension): void - **EXPERIMENTAL** Function for attaching CrawlerExtensions such as the Unblockers. *** #### Parameters * ##### extension: CrawlerExtension Crawler extension that overrides the crawler configuration. #### Returns void ### [**](#useState)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L1102)inheriteduseState * ****useState**\(defaultValue): Promise\ - Inherited from BasicCrawler.useState #### Parameters * ##### defaultValue: State = ... #### Returns Promise\ --- # ByteCounterStream ### Callable * ****ByteCounterStream**(\_\_namedParameters): Transform *** * Creates a transform stream that logs the progress of the incoming data. This `Transform` calls the `logProgress` function every `loggingInterval` milliseconds with the number of bytes received so far. Can be used e.g. to log the progress of a download. *** #### Parameters * ##### \_\_namedParameters: { loggingInterval?: number; logTransferredBytes: (transferredBytes) => void } * ##### optionalloggingInterval: number = 5000 * ##### logTransferredBytes: (transferredBytes) => void #### Returns Transform Transform stream logging the progress of the incoming data. --- # createFileRouter ### Callable * ****createFileRouter**\(routes): [RouterHandler](https://crawlee.dev/js/api/core/interface/RouterHandler.md)\ *** * Creates new [Router](https://crawlee.dev/js/api/core/class/Router.md) instance that works based on request labels. This instance can then serve as a `requestHandler` of your [FileDownload](https://crawlee.dev/js/api/http-crawler/class/FileDownload.md). Defaults to the [FileDownloadCrawlingContext](https://crawlee.dev/js/api/http-crawler/interface/FileDownloadCrawlingContext.md). > Serves as a shortcut for using `Router.create()`. ``` import { FileDownload, createFileRouter } from 'crawlee'; const router = createFileRouter(); router.addHandler('label-a', async (ctx) => { ctx.log.info('...'); }); router.addDefaultHandler(async (ctx) => { ctx.log.info('...'); }); const crawler = new FileDownload({ requestHandler: router, }); await crawler.run(); ``` *** #### Parameters * ##### optionalroutes: [RouterRoutes](https://crawlee.dev/js/api/core.md#RouterRoutes)\ #### Returns [RouterHandler](https://crawlee.dev/js/api/core/interface/RouterHandler.md)\ --- # createHttpRouter ### Callable * ****createHttpRouter**\(routes): [RouterHandler](https://crawlee.dev/js/api/core/interface/RouterHandler.md)\ *** * Creates new [Router](https://crawlee.dev/js/api/core/class/Router.md) instance that works based on request labels. This instance can then serve as a `requestHandler` of your [HttpCrawler](https://crawlee.dev/js/api/http-crawler/class/HttpCrawler.md). Defaults to the [HttpCrawlingContext](https://crawlee.dev/js/api/http-crawler/interface/HttpCrawlingContext.md). > Serves as a shortcut for using `Router.create()`. ``` import { HttpCrawler, createHttpRouter } from 'crawlee'; const router = createHttpRouter(); router.addHandler('label-a', async (ctx) => { ctx.log.info('...'); }); router.addDefaultHandler(async (ctx) => { ctx.log.info('...'); }); const crawler = new HttpCrawler({ requestHandler: router, }); await crawler.run(); ``` *** #### Parameters * ##### optionalroutes: [RouterRoutes](https://crawlee.dev/js/api/core.md#RouterRoutes)\ #### Returns [RouterHandler](https://crawlee.dev/js/api/core/interface/RouterHandler.md)\ --- # MinimumSpeedStream ### Callable * ****MinimumSpeedStream**(\_\_namedParameters): Transform *** * Creates a transform stream that throws an error if the source data speed is below the specified minimum speed. This `Transform` checks the amount of data every `checkProgressInterval` milliseconds. If the stream has received less than `minSpeedKbps * historyLengthMs / 1000` bytes in the last `historyLengthMs` milliseconds, it will throw an error. Can be used e.g. to abort a download if the network speed is too slow. *** #### Parameters * ##### \_\_namedParameters: { checkProgressInterval?: number; historyLengthMs?: number; minSpeedKbps: number } * ##### optionalcheckProgressInterval: number = 5e3 * ##### optionalhistoryLengthMs: number = 10e3 * ##### minSpeedKbps: number #### Returns Transform Transform stream that monitors the speed of the incoming data. --- # FileDownloadCrawlingContext \ ### Hierarchy * InternalHttpCrawlingContext\ * *FileDownloadCrawlingContext* ## Index[**](#Index) ### Properties * [**addRequests](#addRequests) * [**body](#body) * [**contentType](#contentType) * [**crawler](#crawler) * [**getKeyValueStore](#getKeyValueStore) * [**id](#id) * [**json](#json) * [**log](#log) * [**proxyInfo](#proxyInfo) * [**request](#request) * [**response](#response) * [**session](#session) * [**useState](#useState) ### Methods * [**enqueueLinks](#enqueueLinks) * [**parseWithCheerio](#parseWithCheerio) * [**pushData](#pushData) * [**sendRequest](#sendRequest) * [**waitForSelector](#waitForSelector) ## Properties[**](#Properties) ### [**](#addRequests)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L88)inheritedaddRequests **addRequests: (requestsLike, options) => Promise\ Inherited from InternalHttpCrawlingContext.addRequests Add requests directly to the request queue. *** #### Type declaration * * **(requestsLike, options): Promise\ - #### Parameters * ##### requestsLike: readonly (string | ReadonlyObjectDeep\> & { regex?: RegExp; requestsFromUrl?: string }> | ReadonlyObjectDeep<[Request](https://crawlee.dev/js/api/core/class/Request.md)\>)\[] * ##### optionaloptions: ReadonlyObjectDeep<[RequestQueueOperationOptions](https://crawlee.dev/js/api/core/interface/RequestQueueOperationOptions.md)> Options for the request queue #### Returns Promise\ ### [**](#body)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L213)inheritedbody **body: string | Buffer\ Inherited from InternalHttpCrawlingContext.body The request body of the web page. The type depends on the `Content-Type` header of the web page: * String for `text/html`, `application/xhtml+xml`, `application/xml` MIME content types * Buffer for others MIME content types ### [**](#contentType)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L223)inheritedcontentType **contentType: { encoding: BufferEncoding; type: string } Inherited from InternalHttpCrawlingContext.contentType Parsed `Content-Type header: { type, encoding }`. *** #### Type declaration * ##### encoding: BufferEncoding * ##### type: string ### [**](#crawler)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L113)inheritedcrawler **crawler: [FileDownload](https://crawlee.dev/js/api/http-crawler/class/FileDownload.md) Inherited from InternalHttpCrawlingContext.crawler ### [**](#getKeyValueStore)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L147)inheritedgetKeyValueStore **getKeyValueStore: (idOrName) => Promise<[KeyValueStore](https://crawlee.dev/js/api/core/class/KeyValueStore.md)> Inherited from InternalHttpCrawlingContext.getKeyValueStore Get a key-value store with given name or id, or the default one for the crawler. *** #### Type declaration * * **(idOrName): Promise<[KeyValueStore](https://crawlee.dev/js/api/core/class/KeyValueStore.md)> - #### Parameters * ##### optionalidOrName: string #### Returns Promise<[KeyValueStore](https://crawlee.dev/js/api/core/class/KeyValueStore.md)> ### [**](#id)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L33)inheritedid **id: string Inherited from InternalHttpCrawlingContext.id ### [**](#json)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L218)inheritedjson **json: JSONData Inherited from InternalHttpCrawlingContext.json The parsed object from JSON string if the response contains the content type application/json. ### [**](#log)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L108)inheritedlog **log: [Log](https://crawlee.dev/js/api/core/class/Log.md) Inherited from InternalHttpCrawlingContext.log A preconfigured logger for the request handler. ### [**](#proxyInfo)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L40)optionalinheritedproxyInfo **proxyInfo? : [ProxyInfo](https://crawlee.dev/js/api/core/interface/ProxyInfo.md) Inherited from InternalHttpCrawlingContext.proxyInfo An object with information about currently used proxy by the crawler and configured by the [ProxyConfiguration](https://crawlee.dev/js/api/core/class/ProxyConfiguration.md) class. ### [**](#request)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L45)inheritedrequest **request: [Request](https://crawlee.dev/js/api/core/class/Request.md)\ Inherited from InternalHttpCrawlingContext.request The original [Request](https://crawlee.dev/js/api/core/class/Request.md) object. ### [**](#response)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L224)inheritedresponse **response: PlainResponse Inherited from InternalHttpCrawlingContext.response ### [**](#session)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L34)optionalinheritedsession **session? : [Session](https://crawlee.dev/js/api/core/class/Session.md) Inherited from InternalHttpCrawlingContext.session ### [**](#useState)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L96)inheriteduseState **useState: \(defaultValue) => Promise\ Inherited from InternalHttpCrawlingContext.useState Returns the state - a piece of mutable persistent data shared across all the request handler runs. *** #### Type declaration * * **\(defaultValue): Promise\ - #### Parameters * ##### optionaldefaultValue: State #### Returns Promise\ ## Methods[**](#Methods) ### [**](#enqueueLinks)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L140)inheritedenqueueLinks * ****enqueueLinks**(options): Promise<[BatchAddRequestsResult](https://crawlee.dev/js/api/types/interface/BatchAddRequestsResult.md)> - Inherited from InternalHttpCrawlingContext.enqueueLinks This function automatically finds and enqueues links from the current page, adding them to the [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) currently used by the crawler. Optionally, the function allows you to filter the target links' URLs using an array of globs or regular expressions and override settings of the enqueued [Request](https://crawlee.dev/js/api/core/class/Request.md) objects. Check out the [Crawl a website with relative links](https://crawlee.dev/js/docs/examples/crawl-relative-links.md) example for more details regarding its usage. **Example usage** ``` async requestHandler({ enqueueLinks }) { await enqueueLinks({ globs: [ 'https://www.example.com/handbags/*', ], }); }, ``` *** #### Parameters * ##### optionaloptions: ReadonlyObjectDeep\> & Pick<[EnqueueLinksOptions](https://crawlee.dev/js/api/core/interface/EnqueueLinksOptions.md), requestQueue> All `enqueueLinks()` parameters are passed via an options object. #### Returns Promise<[BatchAddRequestsResult](https://crawlee.dev/js/api/types/interface/BatchAddRequestsResult.md)> Promise that resolves to [BatchAddRequestsResult](https://crawlee.dev/js/api/types/interface/BatchAddRequestsResult.md) object. ### [**](#parseWithCheerio)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L252)inheritedparseWithCheerio * ****parseWithCheerio**(selector, timeoutMs): Promise\ - Inherited from InternalHttpCrawlingContext.parseWithCheerio Returns Cheerio handle for `page.content()`, allowing to work with the data same way as with [CheerioCrawler](https://crawlee.dev/js/api/cheerio-crawler/class/CheerioCrawler.md). When provided with the `selector` argument, it will throw if it's not available. **Example usage:** ``` async requestHandler({ parseWithCheerio }) { const $ = await parseWithCheerio(); const title = $('title').text(); }); ``` *** #### Parameters * ##### optionalselector: string * ##### optionaltimeoutMs: number #### Returns Promise\ ### [**](#pushData)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L54)inheritedpushData * ****pushData**(data, datasetIdOrName): Promise\ - Inherited from InternalHttpCrawlingContext.pushData This function allows you to push data to a [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md) specified by name, or the one currently used by the crawler. Shortcut for `crawler.pushData()`. *** #### Parameters * ##### optionaldata: ReadonlyDeep\ Data to be pushed to the default dataset. * ##### optionaldatasetIdOrName: string #### Returns Promise\ ### [**](#sendRequest)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L166)inheritedsendRequest * ****sendRequest**\(overrideOptions): Promise\> - Inherited from InternalHttpCrawlingContext.sendRequest Fires HTTP request via [`got-scraping`](https://crawlee.dev/js/docs/guides/got-scraping.md), allowing to override the request options on the fly. This is handy when you work with a browser crawler but want to execute some requests outside it (e.g. API requests). Check the [Skipping navigations for certain requests](https://crawlee.dev/js/docs/examples/skip-navigation.md) example for more detailed explanation of how to do that. ``` async requestHandler({ sendRequest }) { const { body } = await sendRequest({ // override headers only headers: { ... }, }); }, ``` *** #### Parameters * ##### optionaloverrideOptions: Partial\ #### Returns Promise\> ### [**](#waitForSelector)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L238)inheritedwaitForSelector * ****waitForSelector**(selector, timeoutMs): Promise\ - Inherited from InternalHttpCrawlingContext.waitForSelector Wait for an element matching the selector to appear. Timeout is ignored. **Example usage:** ``` async requestHandler({ waitForSelector, parseWithCheerio }) { await waitForSelector('article h1'); const $ = await parseWithCheerio(); const title = $('title').text(); }); ``` *** #### Parameters * ##### selector: string * ##### optionaltimeoutMs: number #### Returns Promise\ --- # HttpCrawlerOptions \ ### Hierarchy * [BasicCrawlerOptions](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md)\ * *HttpCrawlerOptions* * [CheerioCrawlerOptions](https://crawlee.dev/js/api/cheerio-crawler/interface/CheerioCrawlerOptions.md) * [JSDOMCrawlerOptions](https://crawlee.dev/js/api/jsdom-crawler/interface/JSDOMCrawlerOptions.md) * [LinkeDOMCrawlerOptions](https://crawlee.dev/js/api/linkedom-crawler/interface/LinkeDOMCrawlerOptions.md) ## Index[**](#Index) ### Properties * [**additionalHttpErrorStatusCodes](#additionalHttpErrorStatusCodes) * [**additionalMimeTypes](#additionalMimeTypes) * [**autoscaledPoolOptions](#autoscaledPoolOptions) * [**errorHandler](#errorHandler) * [**experiments](#experiments) * [**failedRequestHandler](#failedRequestHandler) * [**forceResponseEncoding](#forceResponseEncoding) * [**handlePageFunction](#handlePageFunction) * [**httpClient](#httpClient) * [**ignoreHttpErrorStatusCodes](#ignoreHttpErrorStatusCodes) * [**ignoreSslErrors](#ignoreSslErrors) * [**keepAlive](#keepAlive) * [**maxConcurrency](#maxConcurrency) * [**maxCrawlDepth](#maxCrawlDepth) * [**maxRequestRetries](#maxRequestRetries) * [**maxRequestsPerCrawl](#maxRequestsPerCrawl) * [**maxRequestsPerMinute](#maxRequestsPerMinute) * [**maxSessionRotations](#maxSessionRotations) * [**minConcurrency](#minConcurrency) * [**navigationTimeoutSecs](#navigationTimeoutSecs) * [**onSkippedRequest](#onSkippedRequest) * [**persistCookiesPerSession](#persistCookiesPerSession) * [**postNavigationHooks](#postNavigationHooks) * [**preNavigationHooks](#preNavigationHooks) * [**proxyConfiguration](#proxyConfiguration) * [**requestHandler](#requestHandler) * [**requestHandlerTimeoutSecs](#requestHandlerTimeoutSecs) * [**requestList](#requestList) * [**requestManager](#requestManager) * [**requestQueue](#requestQueue) * [**respectRobotsTxtFile](#respectRobotsTxtFile) * [**retryOnBlocked](#retryOnBlocked) * [**sameDomainDelaySecs](#sameDomainDelaySecs) * [**sessionPoolOptions](#sessionPoolOptions) * [**statisticsOptions](#statisticsOptions) * [**statusMessageCallback](#statusMessageCallback) * [**statusMessageLoggingInterval](#statusMessageLoggingInterval) * [**suggestResponseEncoding](#suggestResponseEncoding) * [**useSessionPool](#useSessionPool) ## Properties[**](#Properties) ### [**](#additionalHttpErrorStatusCodes)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L186)optionaladditionalHttpErrorStatusCodes **additionalHttpErrorStatusCodes? : number\[] An array of additional HTTP response [Status Codes](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status) to be treated as errors. By default, status codes >= 500 trigger errors. ### [**](#additionalMimeTypes)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L142)optionaladditionalMimeTypes **additionalMimeTypes? : string\[] An array of [MIME types](https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/MIME_types/Complete_list_of_MIME_types) you want the crawler to load and process. By default, only `text/html` and `application/xhtml+xml` MIME types are supported. ### [**](#autoscaledPoolOptions)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L294)optionalinheritedautoscaledPoolOptions **autoscaledPoolOptions? : [AutoscaledPoolOptions](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md) Inherited from BasicCrawlerOptions.autoscaledPoolOptions Custom options passed to the underlying [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) constructor. > *NOTE:* The [`runTaskFunction`](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md#runTaskFunction) option is provided by the crawler and cannot be overridden. However, we can provide custom implementations of [`isFinishedFunction`](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md#isFinishedFunction) and [`isTaskReadyFunction`](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md#isTaskReadyFunction). ### [**](#errorHandler)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L222)optionalinheritederrorHandler **errorHandler? : [ErrorHandler](https://crawlee.dev/js/api/basic-crawler.md#ErrorHandler)\ Inherited from BasicCrawlerOptions.errorHandler User-provided function that allows modifying the request object before it gets retried by the crawler. It's executed before each retry for the requests that failed less than [`maxRequestRetries`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#maxRequestRetries) times. The function receives the [BasicCrawlingContext](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlingContext.md) as the first argument, where the [`request`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlingContext.md#request) corresponds to the request to be retried. Second argument is the `Error` instance that represents the last error thrown during processing of the request. ### [**](#experiments)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L390)optionalinheritedexperiments **experiments? : [CrawlerExperiments](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerExperiments.md) Inherited from BasicCrawlerOptions.experiments Enables experimental features of Crawlee, which can alter the behavior of the crawler. WARNING: these options are not guaranteed to be stable and may change or be removed at any time. ### [**](#failedRequestHandler)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L232)optionalinheritedfailedRequestHandler **failedRequestHandler? : [ErrorHandler](https://crawlee.dev/js/api/basic-crawler.md#ErrorHandler)\ Inherited from BasicCrawlerOptions.failedRequestHandler A function to handle requests that failed more than [`maxRequestRetries`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#maxRequestRetries) times. The function receives the [BasicCrawlingContext](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlingContext.md) as the first argument, where the [`request`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlingContext.md#request) corresponds to the failed request. Second argument is the `Error` instance that represents the last error thrown during processing of the request. ### [**](#forceResponseEncoding)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L166)optionalforceResponseEncoding **forceResponseEncoding? : string By default this crawler will extract correct encoding from the HTTP response headers. Use `forceResponseEncoding` to force a certain encoding, disregarding the response headers. To only provide a default for missing encodings, use [HttpCrawlerOptions.suggestResponseEncoding](https://crawlee.dev/js/api/http-crawler/interface/HttpCrawlerOptions.md#suggestResponseEncoding) ``` // Will force windows-1250 encoding even if headers say otherwise forceResponseEncoding: 'windows-1250' ``` ### [**](#handlePageFunction)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L87)optionalhandlePageFunction **handlePageFunction? : [RequestHandler](https://crawlee.dev/js/api/basic-crawler.md#RequestHandler)\> An alias for [HttpCrawlerOptions.requestHandler](https://crawlee.dev/js/api/http-crawler/interface/HttpCrawlerOptions.md#requestHandler) Soon to be removed, use `requestHandler` instead. * **@deprecated** ### [**](#httpClient)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L402)optionalinheritedhttpClient **httpClient? : [BaseHttpClient](https://crawlee.dev/js/api/core/interface/BaseHttpClient.md) Inherited from BasicCrawlerOptions.httpClient HTTP client implementation for the `sendRequest` context helper and for plain HTTP crawling. Defaults to a new instance of [GotScrapingHttpClient](https://crawlee.dev/js/api/core/class/GotScrapingHttpClient.md) ### [**](#ignoreHttpErrorStatusCodes)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L180)optionalignoreHttpErrorStatusCodes **ignoreHttpErrorStatusCodes? : number\[] An array of HTTP response [Status Codes](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status) to be excluded from error consideration. By default, status codes >= 500 trigger errors. ### [**](#ignoreSslErrors)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L97)optionalignoreSslErrors **ignoreSslErrors? : boolean If set to true, SSL certificate errors will be ignored. ### [**](#keepAlive)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L322)optionalinheritedkeepAlive **keepAlive? : boolean Inherited from BasicCrawlerOptions.keepAlive Allows to keep the crawler alive even if the [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) gets empty. By default, the `crawler.run()` will resolve once the queue is empty. With `keepAlive: true` it will keep running, waiting for more requests to come. Use `crawler.stop()` to exit the crawler gracefully, or `crawler.teardown()` to stop it immediately. ### [**](#maxConcurrency)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L308)optionalinheritedmaxConcurrency **maxConcurrency? : number Inherited from BasicCrawlerOptions.maxConcurrency Sets the maximum concurrency (parallelism) for the crawl. Shortcut for the AutoscaledPool [`maxConcurrency`](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md#maxConcurrency) option. ### [**](#maxCrawlDepth)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L285)optionalinheritedmaxCrawlDepth **maxCrawlDepth? : number Inherited from BasicCrawlerOptions.maxCrawlDepth Maximum depth of the crawl. If not set, the crawl will continue until all requests are processed. Setting this to `0` will only process the initial requests, skipping all links enqueued by `crawlingContext.enqueueLinks` and `crawlingContext.addRequests`. Passing `1` will process the initial requests and all links enqueued by `crawlingContext.enqueueLinks` and `crawlingContext.addRequests` in the handler for initial requests. ### [**](#maxRequestRetries)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L256)optionalinheritedmaxRequestRetries **maxRequestRetries? : number = 3 Inherited from BasicCrawlerOptions.maxRequestRetries Specifies the maximum number of retries allowed for a request if its processing fails. This includes retries due to navigation errors or errors thrown from user-supplied functions (`requestHandler`, `preNavigationHooks`, `postNavigationHooks`). This limit does not apply to retries triggered by session rotation (see [`maxSessionRotations`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#maxSessionRotations)). ### [**](#maxRequestsPerCrawl)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L278)optionalinheritedmaxRequestsPerCrawl **maxRequestsPerCrawl? : number Inherited from BasicCrawlerOptions.maxRequestsPerCrawl Maximum number of pages that the crawler will open. The crawl will stop when this limit is reached. This value should always be set in order to prevent infinite loops in misconfigured crawlers. > *NOTE:* In cases of parallel crawling, the actual number of pages visited might be slightly higher than this value. ### [**](#maxRequestsPerMinute)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L315)optionalinheritedmaxRequestsPerMinute **maxRequestsPerMinute? : number Inherited from BasicCrawlerOptions.maxRequestsPerMinute The maximum number of requests per minute the crawler should run. By default, this is set to `Infinity`, but we can pass any positive, non-zero integer. Shortcut for the AutoscaledPool [`maxTasksPerMinute`](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md#maxTasksPerMinute) option. ### [**](#maxSessionRotations)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L271)optionalinheritedmaxSessionRotations **maxSessionRotations? : number = 10 Inherited from BasicCrawlerOptions.maxSessionRotations Maximum number of session rotations per request. The crawler will automatically rotate the session in case of a proxy error or if it gets blocked by the website. The session rotations are not counted towards the [`maxRequestRetries`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#maxRequestRetries) limit. ### [**](#minConcurrency)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L302)optionalinheritedminConcurrency **minConcurrency? : number Inherited from BasicCrawlerOptions.minConcurrency Sets the minimum concurrency (parallelism) for the crawl. Shortcut for the AutoscaledPool [`minConcurrency`](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md#minConcurrency) option. > *WARNING:* If we set this value too high with respect to the available system memory and CPU, our crawler will run extremely slow or crash. If not sure, it's better to keep the default value and the concurrency will scale up automatically. ### [**](#navigationTimeoutSecs)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L92)optionalnavigationTimeoutSecs **navigationTimeoutSecs? : number Timeout in which the HTTP request to the resource needs to finish, given in seconds. ### [**](#onSkippedRequest)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L381)optionalinheritedonSkippedRequest **onSkippedRequest? : [SkippedRequestCallback](https://crawlee.dev/js/api/core.md#SkippedRequestCallback) Inherited from BasicCrawlerOptions.onSkippedRequest When a request is skipped for some reason, you can use this callback to act on it. This is currently fired for requests skipped 1. based on robots.txt file, 2. because they don't match enqueueLinks filters, 3. because they are redirected to a URL that doesn't match the enqueueLinks strategy, 4. or because the [`maxRequestsPerCrawl`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#maxRequestsPerCrawl) limit has been reached ### [**](#persistCookiesPerSession)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L174)optionalpersistCookiesPerSession **persistCookiesPerSession? : boolean Automatically saves cookies to Session. Works only if Session Pool is used. It parses cookie from response "set-cookie" header saves or updates cookies for session and once the session is used for next request. It passes the "Cookie" header to the request with the session cookies. ### [**](#postNavigationHooks)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L136)optionalpostNavigationHooks **postNavigationHooks? : InternalHttpHook\\[] Async functions that are sequentially evaluated after the navigation. Good for checking if the navigation was successful. The function accepts `crawlingContext` as the only parameter. Example: ``` postNavigationHooks: [ async (crawlingContext) => { // ... }, ] ``` ### [**](#preNavigationHooks)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L122)optionalpreNavigationHooks **preNavigationHooks? : InternalHttpHook\\[] Async functions that are sequentially evaluated before the navigation. Good for setting additional cookies or browser properties before navigation. The function accepts two parameters, `crawlingContext` and `gotOptions`, which are passed to the `requestAsBrowser()` function the crawler calls to navigate. Example: ``` preNavigationHooks: [ async (crawlingContext, gotOptions) => { // ... }, ] ``` Modyfing `pageOptions` is supported only in Playwright incognito. See [PrePageCreateHook](https://crawlee.dev/js/api/browser-pool.md#PrePageCreateHook) ### [**](#proxyConfiguration)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L104)optionalproxyConfiguration **proxyConfiguration? : [ProxyConfiguration](https://crawlee.dev/js/api/core/class/ProxyConfiguration.md) If set, this crawler will be configured for all connections to use [Apify Proxy](https://console.apify.com/proxy) or your own Proxy URLs provided and rotated according to the configuration. For more information, see the [documentation](https://docs.apify.com/proxy). ### [**](#requestHandler)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L151)optionalinheritedrequestHandler **requestHandler? : [RequestHandler](https://crawlee.dev/js/api/basic-crawler.md#RequestHandler)\> Inherited from BasicCrawlerOptions.requestHandler User-provided function that performs the logic of the crawler. It is called for each URL to crawl. The function receives the [BasicCrawlingContext](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlingContext.md) as an argument, where the [`request`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlingContext.md#request) represents the URL to crawl. The function must return a promise, which is then awaited by the crawler. If the function throws an exception, the crawler will try to re-crawl the request later, up to the [`maxRequestRetries`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#maxRequestRetries) times. If all the retries fail, the crawler calls the function provided to the [`failedRequestHandler`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#failedRequestHandler) parameter. To make this work, we should **always** let our function throw exceptions rather than catch them. The exceptions are logged to the request using the [`Request.pushErrorMessage()`](https://crawlee.dev/js/api/core/class/Request.md#pushErrorMessage) function. ### [**](#requestHandlerTimeoutSecs)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L203)optionalinheritedrequestHandlerTimeoutSecs **requestHandlerTimeoutSecs? : number = 60 Inherited from BasicCrawlerOptions.requestHandlerTimeoutSecs Timeout in which the function passed as [`requestHandler`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#requestHandler) needs to finish, in seconds. ### [**](#requestList)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L181)optionalinheritedrequestList **requestList? : [IRequestList](https://crawlee.dev/js/api/core/interface/IRequestList.md) Inherited from BasicCrawlerOptions.requestList Static list of URLs to be processed. If not provided, the crawler will open the default request queue when the [`crawler.addRequests()`](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md#addRequests) function is called. > Alternatively, `requests` parameter of [`crawler.run()`](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md#run) could be used to enqueue the initial requests - it is a shortcut for running `crawler.addRequests()` before the `crawler.run()`. ### [**](#requestManager)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L197)optionalinheritedrequestManager **requestManager? : [IRequestManager](https://crawlee.dev/js/api/core/interface/IRequestManager.md) Inherited from BasicCrawlerOptions.requestManager Allows explicitly configuring a request manager. Mutually exclusive with the `requestQueue` and `requestList` options. This enables explicitly configuring the crawler to use `RequestManagerTandem`, for instance. If using this, the type of `BasicCrawler.requestQueue` may not be fully compatible with the `RequestProvider` class. ### [**](#requestQueue)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L189)optionalinheritedrequestQueue **requestQueue? : [RequestProvider](https://crawlee.dev/js/api/core/class/RequestProvider.md) Inherited from BasicCrawlerOptions.requestQueue Dynamic queue of URLs to be processed. This is useful for recursive crawling of websites. If not provided, the crawler will open the default request queue when the [`crawler.addRequests()`](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md#addRequests) function is called. > Alternatively, `requests` parameter of [`crawler.run()`](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md#run) could be used to enqueue the initial requests - it is a shortcut for running `crawler.addRequests()` before the `crawler.run()`. ### [**](#respectRobotsTxtFile)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L371)optionalinheritedrespectRobotsTxtFile **respectRobotsTxtFile? : boolean Inherited from BasicCrawlerOptions.respectRobotsTxtFile If set to `true`, the crawler will automatically try to fetch the robots.txt file for each domain, and skip those that are not allowed. This also prevents disallowed URLs to be added via `enqueueLinks`. ### [**](#retryOnBlocked)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L365)optionalinheritedretryOnBlocked **retryOnBlocked? : boolean Inherited from BasicCrawlerOptions.retryOnBlocked If set to `true`, the crawler will automatically try to bypass any detected bot protection. Currently supports: * [**Cloudflare** Bot Management](https://www.cloudflare.com/products/bot-management/) * [**Google Search** Rate Limiting](https://www.google.com/sorry/) ### [**](#sameDomainDelaySecs)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L262)optionalinheritedsameDomainDelaySecs **sameDomainDelaySecs? : number = 0 Inherited from BasicCrawlerOptions.sameDomainDelaySecs Indicates how much time (in seconds) to wait before crawling another same domain request. ### [**](#sessionPoolOptions)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L333)optionalinheritedsessionPoolOptions **sessionPoolOptions? : [SessionPoolOptions](https://crawlee.dev/js/api/core/interface/SessionPoolOptions.md) Inherited from BasicCrawlerOptions.sessionPoolOptions The configuration options for [SessionPool](https://crawlee.dev/js/api/core/class/SessionPool.md) to use. ### [**](#statisticsOptions)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L396)optionalinheritedstatisticsOptions **statisticsOptions? : [StatisticsOptions](https://crawlee.dev/js/api/core/interface/StatisticsOptions.md) Inherited from BasicCrawlerOptions.statisticsOptions Customize the way statistics collecting works, such as logging interval or whether to output them to the Key-Value store. ### [**](#statusMessageCallback)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L356)optionalinheritedstatusMessageCallback **statusMessageCallback? : [StatusMessageCallback](https://crawlee.dev/js/api/basic-crawler.md#StatusMessageCallback)<[BasicCrawlingContext](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlingContext.md)\, [BasicCrawler](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md)<[BasicCrawlingContext](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlingContext.md)\>> Inherited from BasicCrawlerOptions.statusMessageCallback Allows overriding the default status message. The callback needs to call `crawler.setStatusMessage()` explicitly. The default status message is provided in the parameters. ``` const crawler = new CheerioCrawler({ statusMessageCallback: async (ctx) => { return ctx.crawler.setStatusMessage(`this is status message from ${new Date().toISOString()}`, { level: 'INFO' }); // log level defaults to 'DEBUG' }, statusMessageLoggingInterval: 1, // defaults to 10s async requestHandler({ $, enqueueLinks, request, log }) { // ... }, }); ``` ### [**](#statusMessageLoggingInterval)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L338)optionalinheritedstatusMessageLoggingInterval **statusMessageLoggingInterval? : number Inherited from BasicCrawlerOptions.statusMessageLoggingInterval Defines the length of the interval for calling the `setStatusMessage` in seconds. ### [**](#suggestResponseEncoding)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L155)optionalsuggestResponseEncoding **suggestResponseEncoding? : string By default this crawler will extract correct encoding from the HTTP response headers. Sadly, there are some websites which use invalid headers. Those are encoded using the UTF-8 encoding. If those sites actually use a different encoding, the response will be corrupted. You can use `suggestResponseEncoding` to fall back to a certain encoding, if you know that your target website uses it. To force a certain encoding, disregarding the response headers, use [HttpCrawlerOptions.forceResponseEncoding](https://crawlee.dev/js/api/http-crawler/interface/HttpCrawlerOptions.md#forceResponseEncoding) ``` // Will fall back to windows-1250 encoding if none found suggestResponseEncoding: 'windows-1250' ``` ### [**](#useSessionPool)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L328)optionalinheriteduseSessionPool **useSessionPool? : boolean Inherited from BasicCrawlerOptions.useSessionPool Basic crawler will initialize the [SessionPool](https://crawlee.dev/js/api/core/class/SessionPool.md) with the corresponding [`sessionPoolOptions`](https://crawlee.dev/js/api/core/interface/SessionPoolOptions.md). The session instance will be than available in the [`requestHandler`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#requestHandler). --- # HttpCrawlingContext \ ### Hierarchy * InternalHttpCrawlingContext\>> * *HttpCrawlingContext* ## Index[**](#Index) ### Properties * [**addRequests](#addRequests) * [**body](#body) * [**contentType](#contentType) * [**crawler](#crawler) * [**getKeyValueStore](#getKeyValueStore) * [**id](#id) * [**json](#json) * [**log](#log) * [**proxyInfo](#proxyInfo) * [**request](#request) * [**response](#response) * [**session](#session) * [**useState](#useState) ### Methods * [**enqueueLinks](#enqueueLinks) * [**parseWithCheerio](#parseWithCheerio) * [**pushData](#pushData) * [**sendRequest](#sendRequest) * [**waitForSelector](#waitForSelector) ## Properties[**](#Properties) ### [**](#addRequests)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L88)inheritedaddRequests **addRequests: (requestsLike, options) => Promise\ Inherited from InternalHttpCrawlingContext.addRequests Add requests directly to the request queue. *** #### Type declaration * * **(requestsLike, options): Promise\ - #### Parameters * ##### requestsLike: readonly (string | ReadonlyObjectDeep\> & { regex?: RegExp; requestsFromUrl?: string }> | ReadonlyObjectDeep<[Request](https://crawlee.dev/js/api/core/class/Request.md)\>)\[] * ##### optionaloptions: ReadonlyObjectDeep<[RequestQueueOperationOptions](https://crawlee.dev/js/api/core/interface/RequestQueueOperationOptions.md)> Options for the request queue #### Returns Promise\ ### [**](#body)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L213)inheritedbody **body: string | Buffer\ Inherited from InternalHttpCrawlingContext.body The request body of the web page. The type depends on the `Content-Type` header of the web page: * String for `text/html`, `application/xhtml+xml`, `application/xml` MIME content types * Buffer for others MIME content types ### [**](#contentType)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L223)inheritedcontentType **contentType: { encoding: BufferEncoding; type: string } Inherited from InternalHttpCrawlingContext.contentType Parsed `Content-Type header: { type, encoding }`. *** #### Type declaration * ##### encoding: BufferEncoding * ##### type: string ### [**](#crawler)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L113)inheritedcrawler **crawler: [HttpCrawler](https://crawlee.dev/js/api/http-crawler/class/HttpCrawler.md)<[HttpCrawlingContext](https://crawlee.dev/js/api/http-crawler/interface/HttpCrawlingContext.md)\> Inherited from InternalHttpCrawlingContext.crawler ### [**](#getKeyValueStore)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L147)inheritedgetKeyValueStore **getKeyValueStore: (idOrName) => Promise<[KeyValueStore](https://crawlee.dev/js/api/core/class/KeyValueStore.md)> Inherited from InternalHttpCrawlingContext.getKeyValueStore Get a key-value store with given name or id, or the default one for the crawler. *** #### Type declaration * * **(idOrName): Promise<[KeyValueStore](https://crawlee.dev/js/api/core/class/KeyValueStore.md)> - #### Parameters * ##### optionalidOrName: string #### Returns Promise<[KeyValueStore](https://crawlee.dev/js/api/core/class/KeyValueStore.md)> ### [**](#id)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L33)inheritedid **id: string Inherited from InternalHttpCrawlingContext.id ### [**](#json)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L218)inheritedjson **json: JSONData Inherited from InternalHttpCrawlingContext.json The parsed object from JSON string if the response contains the content type application/json. ### [**](#log)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L108)inheritedlog **log: [Log](https://crawlee.dev/js/api/core/class/Log.md) Inherited from InternalHttpCrawlingContext.log A preconfigured logger for the request handler. ### [**](#proxyInfo)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L40)optionalinheritedproxyInfo **proxyInfo? : [ProxyInfo](https://crawlee.dev/js/api/core/interface/ProxyInfo.md) Inherited from InternalHttpCrawlingContext.proxyInfo An object with information about currently used proxy by the crawler and configured by the [ProxyConfiguration](https://crawlee.dev/js/api/core/class/ProxyConfiguration.md) class. ### [**](#request)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L45)inheritedrequest **request: [Request](https://crawlee.dev/js/api/core/class/Request.md)\ Inherited from InternalHttpCrawlingContext.request The original [Request](https://crawlee.dev/js/api/core/class/Request.md) object. ### [**](#response)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L224)inheritedresponse **response: PlainResponse Inherited from InternalHttpCrawlingContext.response ### [**](#session)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L34)optionalinheritedsession **session? : [Session](https://crawlee.dev/js/api/core/class/Session.md) Inherited from InternalHttpCrawlingContext.session ### [**](#useState)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L96)inheriteduseState **useState: \(defaultValue) => Promise\ Inherited from InternalHttpCrawlingContext.useState Returns the state - a piece of mutable persistent data shared across all the request handler runs. *** #### Type declaration * * **\(defaultValue): Promise\ - #### Parameters * ##### optionaldefaultValue: State #### Returns Promise\ ## Methods[**](#Methods) ### [**](#enqueueLinks)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L140)inheritedenqueueLinks * ****enqueueLinks**(options): Promise<[BatchAddRequestsResult](https://crawlee.dev/js/api/types/interface/BatchAddRequestsResult.md)> - Inherited from InternalHttpCrawlingContext.enqueueLinks This function automatically finds and enqueues links from the current page, adding them to the [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) currently used by the crawler. Optionally, the function allows you to filter the target links' URLs using an array of globs or regular expressions and override settings of the enqueued [Request](https://crawlee.dev/js/api/core/class/Request.md) objects. Check out the [Crawl a website with relative links](https://crawlee.dev/js/docs/examples/crawl-relative-links.md) example for more details regarding its usage. **Example usage** ``` async requestHandler({ enqueueLinks }) { await enqueueLinks({ globs: [ 'https://www.example.com/handbags/*', ], }); }, ``` *** #### Parameters * ##### optionaloptions: ReadonlyObjectDeep\> & Pick<[EnqueueLinksOptions](https://crawlee.dev/js/api/core/interface/EnqueueLinksOptions.md), requestQueue> All `enqueueLinks()` parameters are passed via an options object. #### Returns Promise<[BatchAddRequestsResult](https://crawlee.dev/js/api/types/interface/BatchAddRequestsResult.md)> Promise that resolves to [BatchAddRequestsResult](https://crawlee.dev/js/api/types/interface/BatchAddRequestsResult.md) object. ### [**](#parseWithCheerio)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L252)inheritedparseWithCheerio * ****parseWithCheerio**(selector, timeoutMs): Promise\ - Inherited from InternalHttpCrawlingContext.parseWithCheerio Returns Cheerio handle for `page.content()`, allowing to work with the data same way as with [CheerioCrawler](https://crawlee.dev/js/api/cheerio-crawler/class/CheerioCrawler.md). When provided with the `selector` argument, it will throw if it's not available. **Example usage:** ``` async requestHandler({ parseWithCheerio }) { const $ = await parseWithCheerio(); const title = $('title').text(); }); ``` *** #### Parameters * ##### optionalselector: string * ##### optionaltimeoutMs: number #### Returns Promise\ ### [**](#pushData)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L54)inheritedpushData * ****pushData**(data, datasetIdOrName): Promise\ - Inherited from InternalHttpCrawlingContext.pushData This function allows you to push data to a [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md) specified by name, or the one currently used by the crawler. Shortcut for `crawler.pushData()`. *** #### Parameters * ##### optionaldata: ReadonlyDeep\ Data to be pushed to the default dataset. * ##### optionaldatasetIdOrName: string #### Returns Promise\ ### [**](#sendRequest)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L166)inheritedsendRequest * ****sendRequest**\(overrideOptions): Promise\> - Inherited from InternalHttpCrawlingContext.sendRequest Fires HTTP request via [`got-scraping`](https://crawlee.dev/js/docs/guides/got-scraping.md), allowing to override the request options on the fly. This is handy when you work with a browser crawler but want to execute some requests outside it (e.g. API requests). Check the [Skipping navigations for certain requests](https://crawlee.dev/js/docs/examples/skip-navigation.md) example for more detailed explanation of how to do that. ``` async requestHandler({ sendRequest }) { const { body } = await sendRequest({ // override headers only headers: { ... }, }); }, ``` *** #### Parameters * ##### optionaloverrideOptions: Partial\ #### Returns Promise\> ### [**](#waitForSelector)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L238)inheritedwaitForSelector * ****waitForSelector**(selector, timeoutMs): Promise\ - Inherited from InternalHttpCrawlingContext.waitForSelector Wait for an element matching the selector to appear. Timeout is ignored. **Example usage:** ``` async requestHandler({ waitForSelector, parseWithCheerio }) { await waitForSelector('article h1'); const $ = await parseWithCheerio(); const title = $('title').text(); }); ``` *** #### Parameters * ##### selector: string * ##### optionaltimeoutMs: number #### Returns Promise\ --- # @crawlee/jsdom Provides a framework for the parallel crawling of web pages using plain HTTP requests and [jsdom](https://www.npmjs.com/package/jsdom) DOM implementation. The URLs to crawl are fed either from a static list of URLs or from a dynamic queue of URLs enabling recursive crawling of websites. Since `JSDOMCrawler` uses raw HTTP requests to download web pages, it is very fast and efficient on data bandwidth. However, if the target website requires JavaScript to display the content, you might need to use [PuppeteerCrawler](https://crawlee.dev/js/api/puppeteer-crawler/class/PuppeteerCrawler.md) or [PlaywrightCrawler](https://crawlee.dev/js/api/playwright-crawler/class/PlaywrightCrawler.md) instead, because it loads the pages using full-featured headless Chrome browser. `JSDOMCrawler` downloads each URL using a plain HTTP request, parses the HTML content using [JSDOM](https://www.npmjs.com/package/jsdom) and then invokes the user-provided [JSDOMCrawlerOptions.requestHandler](https://crawlee.dev/js/api/jsdom-crawler/interface/JSDOMCrawlerOptions.md#requestHandler) to extract page data using the `window` object. The source URLs are represented using [Request](https://crawlee.dev/js/api/core/class/Request.md) objects that are fed from [RequestList](https://crawlee.dev/js/api/core/class/RequestList.md) or [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) instances provided by the [JSDOMCrawlerOptions.requestList](https://crawlee.dev/js/api/jsdom-crawler/interface/JSDOMCrawlerOptions.md#requestList) or [JSDOMCrawlerOptions.requestQueue](https://crawlee.dev/js/api/jsdom-crawler/interface/JSDOMCrawlerOptions.md#requestQueue) constructor options, respectively. If both [JSDOMCrawlerOptions.requestList](https://crawlee.dev/js/api/jsdom-crawler/interface/JSDOMCrawlerOptions.md#requestList) and [JSDOMCrawlerOptions.requestQueue](https://crawlee.dev/js/api/jsdom-crawler/interface/JSDOMCrawlerOptions.md#requestQueue) are used, the instance first processes URLs from the [RequestList](https://crawlee.dev/js/api/core/class/RequestList.md) and automatically enqueues all of them to [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) before it starts their processing. This ensures that a single URL is not crawled multiple times. The crawler finishes when there are no more [Request](https://crawlee.dev/js/api/core/class/Request.md) objects to crawl. We can use the `preNavigationHooks` to adjust `gotOptions`: ``` preNavigationHooks: [ (crawlingContext, gotOptions) => { // ... }, ] ``` By default, `JSDOMCrawler` only processes web pages with the `text/html` and `application/xhtml+xml` MIME content types (as reported by the `Content-Type` HTTP header), and skips pages with other content types. If you want the crawler to process other content types, use the [JSDOMCrawlerOptions.additionalMimeTypes](https://crawlee.dev/js/api/jsdom-crawler/interface/JSDOMCrawlerOptions.md#additionalMimeTypes) constructor option. Beware that the parsing behavior differs for HTML, XML, JSON and other types of content. For more details, see [JSDOMCrawlerOptions.requestHandler](https://crawlee.dev/js/api/jsdom-crawler/interface/JSDOMCrawlerOptions.md#requestHandler). New requests are only dispatched when there is enough free CPU and memory available, using the functionality provided by the [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) class. All [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) configuration options can be passed to the `autoscaledPoolOptions` parameter of the `JSDOMCrawler` constructor. For user convenience, the `minConcurrency` and `maxConcurrency` [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) options are available directly in the `JSDOMCrawler` constructor. ## Example usage[​](#example-usage "Direct link to Example usage") ``` const crawler = new JSDOMCrawler({ async requestHandler({ request, window }) { await Dataset.pushData({ url: request.url, title: window.document.title, }); }, }); await crawler.run([ 'http://crawlee.dev', ]); ``` ## Index[**](#Index) ### Crawlers * [**JSDOMCrawler](https://crawlee.dev/js/api/jsdom-crawler/class/JSDOMCrawler.md) ### Other * [**AddRequestsBatchedOptions](https://crawlee.dev/js/api/jsdom-crawler.md#AddRequestsBatchedOptions) * [**AddRequestsBatchedResult](https://crawlee.dev/js/api/jsdom-crawler.md#AddRequestsBatchedResult) * [**AutoscaledPool](https://crawlee.dev/js/api/jsdom-crawler.md#AutoscaledPool) * [**AutoscaledPoolOptions](https://crawlee.dev/js/api/jsdom-crawler.md#AutoscaledPoolOptions) * [**BaseHttpClient](https://crawlee.dev/js/api/jsdom-crawler.md#BaseHttpClient) * [**BaseHttpResponseData](https://crawlee.dev/js/api/jsdom-crawler.md#BaseHttpResponseData) * [**BASIC\_CRAWLER\_TIMEOUT\_BUFFER\_SECS](https://crawlee.dev/js/api/jsdom-crawler.md#BASIC_CRAWLER_TIMEOUT_BUFFER_SECS) * [**BasicCrawler](https://crawlee.dev/js/api/jsdom-crawler.md#BasicCrawler) * [**BasicCrawlerOptions](https://crawlee.dev/js/api/jsdom-crawler.md#BasicCrawlerOptions) * [**BasicCrawlingContext](https://crawlee.dev/js/api/jsdom-crawler.md#BasicCrawlingContext) * [**BLOCKED\_STATUS\_CODES](https://crawlee.dev/js/api/jsdom-crawler.md#BLOCKED_STATUS_CODES) * [**ByteCounterStream](https://crawlee.dev/js/api/jsdom-crawler.md#ByteCounterStream) * [**checkStorageAccess](https://crawlee.dev/js/api/jsdom-crawler.md#checkStorageAccess) * [**ClientInfo](https://crawlee.dev/js/api/jsdom-crawler.md#ClientInfo) * [**Configuration](https://crawlee.dev/js/api/jsdom-crawler.md#Configuration) * [**ConfigurationOptions](https://crawlee.dev/js/api/jsdom-crawler.md#ConfigurationOptions) * [**Cookie](https://crawlee.dev/js/api/jsdom-crawler.md#Cookie) * [**CrawlerAddRequestsOptions](https://crawlee.dev/js/api/jsdom-crawler.md#CrawlerAddRequestsOptions) * [**CrawlerAddRequestsResult](https://crawlee.dev/js/api/jsdom-crawler.md#CrawlerAddRequestsResult) * [**CrawlerExperiments](https://crawlee.dev/js/api/jsdom-crawler.md#CrawlerExperiments) * [**CrawlerRunOptions](https://crawlee.dev/js/api/jsdom-crawler.md#CrawlerRunOptions) * [**CrawlingContext](https://crawlee.dev/js/api/jsdom-crawler.md#CrawlingContext) * [**createBasicRouter](https://crawlee.dev/js/api/jsdom-crawler.md#createBasicRouter) * [**CreateContextOptions](https://crawlee.dev/js/api/jsdom-crawler.md#CreateContextOptions) * [**createFileRouter](https://crawlee.dev/js/api/jsdom-crawler.md#createFileRouter) * [**createHttpRouter](https://crawlee.dev/js/api/jsdom-crawler.md#createHttpRouter) * [**CreateSession](https://crawlee.dev/js/api/jsdom-crawler.md#CreateSession) * [**CriticalError](https://crawlee.dev/js/api/jsdom-crawler.md#CriticalError) * [**Dataset](https://crawlee.dev/js/api/jsdom-crawler.md#Dataset) * [**DatasetConsumer](https://crawlee.dev/js/api/jsdom-crawler.md#DatasetConsumer) * [**DatasetContent](https://crawlee.dev/js/api/jsdom-crawler.md#DatasetContent) * [**DatasetDataOptions](https://crawlee.dev/js/api/jsdom-crawler.md#DatasetDataOptions) * [**DatasetExportOptions](https://crawlee.dev/js/api/jsdom-crawler.md#DatasetExportOptions) * [**DatasetExportToOptions](https://crawlee.dev/js/api/jsdom-crawler.md#DatasetExportToOptions) * [**DatasetIteratorOptions](https://crawlee.dev/js/api/jsdom-crawler.md#DatasetIteratorOptions) * [**DatasetMapper](https://crawlee.dev/js/api/jsdom-crawler.md#DatasetMapper) * [**DatasetOptions](https://crawlee.dev/js/api/jsdom-crawler.md#DatasetOptions) * [**DatasetReducer](https://crawlee.dev/js/api/jsdom-crawler.md#DatasetReducer) * [**enqueueLinks](https://crawlee.dev/js/api/jsdom-crawler.md#enqueueLinks) * [**EnqueueLinksOptions](https://crawlee.dev/js/api/jsdom-crawler.md#EnqueueLinksOptions) * [**EnqueueStrategy](https://crawlee.dev/js/api/jsdom-crawler.md#EnqueueStrategy) * [**ErrnoException](https://crawlee.dev/js/api/jsdom-crawler.md#ErrnoException) * [**ErrorHandler](https://crawlee.dev/js/api/jsdom-crawler.md#ErrorHandler) * [**ErrorSnapshotter](https://crawlee.dev/js/api/jsdom-crawler.md#ErrorSnapshotter) * [**ErrorTracker](https://crawlee.dev/js/api/jsdom-crawler.md#ErrorTracker) * [**ErrorTrackerOptions](https://crawlee.dev/js/api/jsdom-crawler.md#ErrorTrackerOptions) * [**EventManager](https://crawlee.dev/js/api/jsdom-crawler.md#EventManager) * [**EventType](https://crawlee.dev/js/api/jsdom-crawler.md#EventType) * [**EventTypeName](https://crawlee.dev/js/api/jsdom-crawler.md#EventTypeName) * [**FileDownload](https://crawlee.dev/js/api/jsdom-crawler.md#FileDownload) * [**FileDownloadCrawlingContext](https://crawlee.dev/js/api/jsdom-crawler.md#FileDownloadCrawlingContext) * [**FileDownloadErrorHandler](https://crawlee.dev/js/api/jsdom-crawler.md#FileDownloadErrorHandler) * [**FileDownloadHook](https://crawlee.dev/js/api/jsdom-crawler.md#FileDownloadHook) * [**FileDownloadOptions](https://crawlee.dev/js/api/jsdom-crawler.md#FileDownloadOptions) * [**FileDownloadRequestHandler](https://crawlee.dev/js/api/jsdom-crawler.md#FileDownloadRequestHandler) * [**filterRequestsByPatterns](https://crawlee.dev/js/api/jsdom-crawler.md#filterRequestsByPatterns) * [**FinalStatistics](https://crawlee.dev/js/api/jsdom-crawler.md#FinalStatistics) * [**GetUserDataFromRequest](https://crawlee.dev/js/api/jsdom-crawler.md#GetUserDataFromRequest) * [**GlobInput](https://crawlee.dev/js/api/jsdom-crawler.md#GlobInput) * [**GlobObject](https://crawlee.dev/js/api/jsdom-crawler.md#GlobObject) * [**GotScrapingHttpClient](https://crawlee.dev/js/api/jsdom-crawler.md#GotScrapingHttpClient) * [**HttpCrawler](https://crawlee.dev/js/api/jsdom-crawler.md#HttpCrawler) * [**HttpCrawlerOptions](https://crawlee.dev/js/api/jsdom-crawler.md#HttpCrawlerOptions) * [**HttpCrawlingContext](https://crawlee.dev/js/api/jsdom-crawler.md#HttpCrawlingContext) * [**HttpErrorHandler](https://crawlee.dev/js/api/jsdom-crawler.md#HttpErrorHandler) * [**HttpHook](https://crawlee.dev/js/api/jsdom-crawler.md#HttpHook) * [**HttpRequest](https://crawlee.dev/js/api/jsdom-crawler.md#HttpRequest) * [**HttpRequestHandler](https://crawlee.dev/js/api/jsdom-crawler.md#HttpRequestHandler) * [**HttpRequestOptions](https://crawlee.dev/js/api/jsdom-crawler.md#HttpRequestOptions) * [**HttpResponse](https://crawlee.dev/js/api/jsdom-crawler.md#HttpResponse) * [**IRequestList](https://crawlee.dev/js/api/jsdom-crawler.md#IRequestList) * [**IRequestManager](https://crawlee.dev/js/api/jsdom-crawler.md#IRequestManager) * [**IStorage](https://crawlee.dev/js/api/jsdom-crawler.md#IStorage) * [**KeyConsumer](https://crawlee.dev/js/api/jsdom-crawler.md#KeyConsumer) * [**KeyValueStore](https://crawlee.dev/js/api/jsdom-crawler.md#KeyValueStore) * [**KeyValueStoreIteratorOptions](https://crawlee.dev/js/api/jsdom-crawler.md#KeyValueStoreIteratorOptions) * [**KeyValueStoreOptions](https://crawlee.dev/js/api/jsdom-crawler.md#KeyValueStoreOptions) * [**LoadedRequest](https://crawlee.dev/js/api/jsdom-crawler.md#LoadedRequest) * [**LocalEventManager](https://crawlee.dev/js/api/jsdom-crawler.md#LocalEventManager) * [**log](https://crawlee.dev/js/api/jsdom-crawler.md#log) * [**Log](https://crawlee.dev/js/api/jsdom-crawler.md#Log) * [**Logger](https://crawlee.dev/js/api/jsdom-crawler.md#Logger) * [**LoggerJson](https://crawlee.dev/js/api/jsdom-crawler.md#LoggerJson) * [**LoggerOptions](https://crawlee.dev/js/api/jsdom-crawler.md#LoggerOptions) * [**LoggerText](https://crawlee.dev/js/api/jsdom-crawler.md#LoggerText) * [**LogLevel](https://crawlee.dev/js/api/jsdom-crawler.md#LogLevel) * [**MAX\_POOL\_SIZE](https://crawlee.dev/js/api/jsdom-crawler.md#MAX_POOL_SIZE) * [**MinimumSpeedStream](https://crawlee.dev/js/api/jsdom-crawler.md#MinimumSpeedStream) * [**NonRetryableError](https://crawlee.dev/js/api/jsdom-crawler.md#NonRetryableError) * [**PERSIST\_STATE\_KEY](https://crawlee.dev/js/api/jsdom-crawler.md#PERSIST_STATE_KEY) * [**PersistenceOptions](https://crawlee.dev/js/api/jsdom-crawler.md#PersistenceOptions) * [**processHttpRequestOptions](https://crawlee.dev/js/api/jsdom-crawler.md#processHttpRequestOptions) * [**ProxyConfiguration](https://crawlee.dev/js/api/jsdom-crawler.md#ProxyConfiguration) * [**ProxyConfigurationFunction](https://crawlee.dev/js/api/jsdom-crawler.md#ProxyConfigurationFunction) * [**ProxyConfigurationOptions](https://crawlee.dev/js/api/jsdom-crawler.md#ProxyConfigurationOptions) * [**ProxyInfo](https://crawlee.dev/js/api/jsdom-crawler.md#ProxyInfo) * [**PseudoUrl](https://crawlee.dev/js/api/jsdom-crawler.md#PseudoUrl) * [**PseudoUrlInput](https://crawlee.dev/js/api/jsdom-crawler.md#PseudoUrlInput) * [**PseudoUrlObject](https://crawlee.dev/js/api/jsdom-crawler.md#PseudoUrlObject) * [**purgeDefaultStorages](https://crawlee.dev/js/api/jsdom-crawler.md#purgeDefaultStorages) * [**PushErrorMessageOptions](https://crawlee.dev/js/api/jsdom-crawler.md#PushErrorMessageOptions) * [**QueueOperationInfo](https://crawlee.dev/js/api/jsdom-crawler.md#QueueOperationInfo) * [**RecordOptions](https://crawlee.dev/js/api/jsdom-crawler.md#RecordOptions) * [**RecoverableState](https://crawlee.dev/js/api/jsdom-crawler.md#RecoverableState) * [**RecoverableStateOptions](https://crawlee.dev/js/api/jsdom-crawler.md#RecoverableStateOptions) * [**RecoverableStatePersistenceOptions](https://crawlee.dev/js/api/jsdom-crawler.md#RecoverableStatePersistenceOptions) * [**RedirectHandler](https://crawlee.dev/js/api/jsdom-crawler.md#RedirectHandler) * [**RegExpInput](https://crawlee.dev/js/api/jsdom-crawler.md#RegExpInput) * [**RegExpObject](https://crawlee.dev/js/api/jsdom-crawler.md#RegExpObject) * [**Request](https://crawlee.dev/js/api/jsdom-crawler.md#Request) * [**RequestHandler](https://crawlee.dev/js/api/jsdom-crawler.md#RequestHandler) * [**RequestHandlerResult](https://crawlee.dev/js/api/jsdom-crawler.md#RequestHandlerResult) * [**RequestList](https://crawlee.dev/js/api/jsdom-crawler.md#RequestList) * [**RequestListOptions](https://crawlee.dev/js/api/jsdom-crawler.md#RequestListOptions) * [**RequestListSourcesFunction](https://crawlee.dev/js/api/jsdom-crawler.md#RequestListSourcesFunction) * [**RequestListState](https://crawlee.dev/js/api/jsdom-crawler.md#RequestListState) * [**RequestManagerTandem](https://crawlee.dev/js/api/jsdom-crawler.md#RequestManagerTandem) * [**RequestOptions](https://crawlee.dev/js/api/jsdom-crawler.md#RequestOptions) * [**RequestProvider](https://crawlee.dev/js/api/jsdom-crawler.md#RequestProvider) * [**RequestProviderOptions](https://crawlee.dev/js/api/jsdom-crawler.md#RequestProviderOptions) * [**RequestQueue](https://crawlee.dev/js/api/jsdom-crawler.md#RequestQueue) * [**RequestQueueOperationOptions](https://crawlee.dev/js/api/jsdom-crawler.md#RequestQueueOperationOptions) * [**RequestQueueOptions](https://crawlee.dev/js/api/jsdom-crawler.md#RequestQueueOptions) * [**RequestQueueV1](https://crawlee.dev/js/api/jsdom-crawler.md#RequestQueueV1) * [**RequestQueueV2](https://crawlee.dev/js/api/jsdom-crawler.md#RequestQueueV2) * [**RequestsLike](https://crawlee.dev/js/api/jsdom-crawler.md#RequestsLike) * [**RequestState](https://crawlee.dev/js/api/jsdom-crawler.md#RequestState) * [**RequestTransform](https://crawlee.dev/js/api/jsdom-crawler.md#RequestTransform) * [**ResponseLike](https://crawlee.dev/js/api/jsdom-crawler.md#ResponseLike) * [**ResponseTypes](https://crawlee.dev/js/api/jsdom-crawler.md#ResponseTypes) * [**RestrictedCrawlingContext](https://crawlee.dev/js/api/jsdom-crawler.md#RestrictedCrawlingContext) * [**RetryRequestError](https://crawlee.dev/js/api/jsdom-crawler.md#RetryRequestError) * [**Router](https://crawlee.dev/js/api/jsdom-crawler.md#Router) * [**RouterHandler](https://crawlee.dev/js/api/jsdom-crawler.md#RouterHandler) * [**RouterRoutes](https://crawlee.dev/js/api/jsdom-crawler.md#RouterRoutes) * [**Session](https://crawlee.dev/js/api/jsdom-crawler.md#Session) * [**SessionError](https://crawlee.dev/js/api/jsdom-crawler.md#SessionError) * [**SessionOptions](https://crawlee.dev/js/api/jsdom-crawler.md#SessionOptions) * [**SessionPool](https://crawlee.dev/js/api/jsdom-crawler.md#SessionPool) * [**SessionPoolOptions](https://crawlee.dev/js/api/jsdom-crawler.md#SessionPoolOptions) * [**SessionState](https://crawlee.dev/js/api/jsdom-crawler.md#SessionState) * [**SitemapRequestList](https://crawlee.dev/js/api/jsdom-crawler.md#SitemapRequestList) * [**SitemapRequestListOptions](https://crawlee.dev/js/api/jsdom-crawler.md#SitemapRequestListOptions) * [**SkippedRequestCallback](https://crawlee.dev/js/api/jsdom-crawler.md#SkippedRequestCallback) * [**SkippedRequestReason](https://crawlee.dev/js/api/jsdom-crawler.md#SkippedRequestReason) * [**SnapshotResult](https://crawlee.dev/js/api/jsdom-crawler.md#SnapshotResult) * [**Snapshotter](https://crawlee.dev/js/api/jsdom-crawler.md#Snapshotter) * [**SnapshotterOptions](https://crawlee.dev/js/api/jsdom-crawler.md#SnapshotterOptions) * [**Source](https://crawlee.dev/js/api/jsdom-crawler.md#Source) * [**StatisticPersistedState](https://crawlee.dev/js/api/jsdom-crawler.md#StatisticPersistedState) * [**Statistics](https://crawlee.dev/js/api/jsdom-crawler.md#Statistics) * [**StatisticsOptions](https://crawlee.dev/js/api/jsdom-crawler.md#StatisticsOptions) * [**StatisticState](https://crawlee.dev/js/api/jsdom-crawler.md#StatisticState) * [**StatusMessageCallback](https://crawlee.dev/js/api/jsdom-crawler.md#StatusMessageCallback) * [**StatusMessageCallbackParams](https://crawlee.dev/js/api/jsdom-crawler.md#StatusMessageCallbackParams) * [**StorageClient](https://crawlee.dev/js/api/jsdom-crawler.md#StorageClient) * [**StorageManagerOptions](https://crawlee.dev/js/api/jsdom-crawler.md#StorageManagerOptions) * [**StreamHandlerContext](https://crawlee.dev/js/api/jsdom-crawler.md#StreamHandlerContext) * [**StreamingHttpResponse](https://crawlee.dev/js/api/jsdom-crawler.md#StreamingHttpResponse) * [**SystemInfo](https://crawlee.dev/js/api/jsdom-crawler.md#SystemInfo) * [**SystemStatus](https://crawlee.dev/js/api/jsdom-crawler.md#SystemStatus) * [**SystemStatusOptions](https://crawlee.dev/js/api/jsdom-crawler.md#SystemStatusOptions) * [**TieredProxy](https://crawlee.dev/js/api/jsdom-crawler.md#TieredProxy) * [**tryAbsoluteURL](https://crawlee.dev/js/api/jsdom-crawler.md#tryAbsoluteURL) * [**UrlPatternObject](https://crawlee.dev/js/api/jsdom-crawler.md#UrlPatternObject) * [**useState](https://crawlee.dev/js/api/jsdom-crawler.md#useState) * [**UseStateOptions](https://crawlee.dev/js/api/jsdom-crawler.md#UseStateOptions) * [**withCheckedStorageAccess](https://crawlee.dev/js/api/jsdom-crawler.md#withCheckedStorageAccess) * [**JSDOMCrawlerOptions](https://crawlee.dev/js/api/jsdom-crawler/interface/JSDOMCrawlerOptions.md) * [**JSDOMCrawlingContext](https://crawlee.dev/js/api/jsdom-crawler/interface/JSDOMCrawlingContext.md) * [**JSDOMErrorHandler](https://crawlee.dev/js/api/jsdom-crawler.md#JSDOMErrorHandler) * [**JSDOMHook](https://crawlee.dev/js/api/jsdom-crawler.md#JSDOMHook) * [**JSDOMRequestHandler](https://crawlee.dev/js/api/jsdom-crawler.md#JSDOMRequestHandler) * [**createJSDOMRouter](https://crawlee.dev/js/api/jsdom-crawler/function/createJSDOMRouter.md) ## Other[**](#__CATEGORY__) ### [**](#AddRequestsBatchedOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L965)AddRequestsBatchedOptions Re-exports [AddRequestsBatchedOptions](https://crawlee.dev/js/api/core/interface/AddRequestsBatchedOptions.md) ### [**](#AddRequestsBatchedResult)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L983)AddRequestsBatchedResult Re-exports [AddRequestsBatchedResult](https://crawlee.dev/js/api/core/interface/AddRequestsBatchedResult.md) ### [**](#AutoscaledPool)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/autoscaled_pool.ts#L180)AutoscaledPool Re-exports [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) ### [**](#AutoscaledPoolOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/autoscaled_pool.ts#L16)AutoscaledPoolOptions Re-exports [AutoscaledPoolOptions](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md) ### [**](#BaseHttpClient)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L179)BaseHttpClient Re-exports [BaseHttpClient](https://crawlee.dev/js/api/core/interface/BaseHttpClient.md) ### [**](#BaseHttpResponseData)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L130)BaseHttpResponseData Re-exports [BaseHttpResponseData](https://crawlee.dev/js/api/core/interface/BaseHttpResponseData.md) ### [**](#BASIC_CRAWLER_TIMEOUT_BUFFER_SECS)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/constants.ts#L6)BASIC\_CRAWLER\_TIMEOUT\_BUFFER\_SECS Re-exports [BASIC\_CRAWLER\_TIMEOUT\_BUFFER\_SECS](https://crawlee.dev/js/api/basic-crawler.md#BASIC_CRAWLER_TIMEOUT_BUFFER_SECS) ### [**](#BasicCrawler)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L485)BasicCrawler Re-exports [BasicCrawler](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md) ### [**](#BasicCrawlerOptions)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L133)BasicCrawlerOptions Re-exports [BasicCrawlerOptions](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md) ### [**](#BasicCrawlingContext)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L71)BasicCrawlingContext Re-exports [BasicCrawlingContext](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlingContext.md) ### [**](#BLOCKED_STATUS_CODES)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/consts.ts#L1)BLOCKED\_STATUS\_CODES Re-exports [BLOCKED\_STATUS\_CODES](https://crawlee.dev/js/api/core.md#BLOCKED_STATUS_CODES) ### [**](#ByteCounterStream)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/file-download.ts#L116)ByteCounterStream Re-exports [ByteCounterStream](https://crawlee.dev/js/api/http-crawler/function/ByteCounterStream.md) ### [**](#checkStorageAccess)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/access_checking.ts#L10)checkStorageAccess Re-exports [checkStorageAccess](https://crawlee.dev/js/api/core/function/checkStorageAccess.md) ### [**](#ClientInfo)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/system_status.ts#L79)ClientInfo Re-exports [ClientInfo](https://crawlee.dev/js/api/core/interface/ClientInfo.md) ### [**](#Configuration)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/configuration.ts#L247)Configuration Re-exports [Configuration](https://crawlee.dev/js/api/core/class/Configuration.md) ### [**](#ConfigurationOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/configuration.ts#L16)ConfigurationOptions Re-exports [ConfigurationOptions](https://crawlee.dev/js/api/core/interface/ConfigurationOptions.md) ### [**](#Cookie)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/index.ts#L19)Cookie Re-exports [Cookie](https://crawlee.dev/js/api/core/interface/Cookie.md) ### [**](#CrawlerAddRequestsOptions)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L2035)CrawlerAddRequestsOptions Re-exports [CrawlerAddRequestsOptions](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerAddRequestsOptions.md) ### [**](#CrawlerAddRequestsResult)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L2037)CrawlerAddRequestsResult Re-exports [CrawlerAddRequestsResult](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerAddRequestsResult.md) ### [**](#CrawlerExperiments)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L411)CrawlerExperiments Re-exports [CrawlerExperiments](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerExperiments.md) ### [**](#CrawlerRunOptions)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L2039)CrawlerRunOptions Re-exports [CrawlerRunOptions](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerRunOptions.md) ### [**](#CrawlingContext)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L111)CrawlingContext Re-exports [CrawlingContext](https://crawlee.dev/js/api/core/interface/CrawlingContext.md) ### [**](#createBasicRouter)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L2081)createBasicRouter Re-exports [createBasicRouter](https://crawlee.dev/js/api/basic-crawler/function/createBasicRouter.md) ### [**](#CreateContextOptions)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L2029)CreateContextOptions Re-exports [CreateContextOptions](https://crawlee.dev/js/api/basic-crawler/interface/CreateContextOptions.md) ### [**](#createFileRouter)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/file-download.ts#L304)createFileRouter Re-exports [createFileRouter](https://crawlee.dev/js/api/http-crawler/function/createFileRouter.md) ### [**](#createHttpRouter)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L1068)createHttpRouter Re-exports [createHttpRouter](https://crawlee.dev/js/api/http-crawler/function/createHttpRouter.md) ### [**](#CreateSession)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session_pool.ts#L22)CreateSession Re-exports [CreateSession](https://crawlee.dev/js/api/core/interface/CreateSession.md) ### [**](#CriticalError)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/errors.ts#L10)CriticalError Re-exports [CriticalError](https://crawlee.dev/js/api/core/class/CriticalError.md) ### [**](#Dataset)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L232)Dataset Re-exports [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md) ### [**](#DatasetConsumer)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L703)DatasetConsumer Re-exports [DatasetConsumer](https://crawlee.dev/js/api/core/interface/DatasetConsumer.md) ### [**](#DatasetContent)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L742)DatasetContent Re-exports [DatasetContent](https://crawlee.dev/js/api/core/interface/DatasetContent.md) ### [**](#DatasetDataOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L92)DatasetDataOptions Re-exports [DatasetDataOptions](https://crawlee.dev/js/api/core/interface/DatasetDataOptions.md) ### [**](#DatasetExportOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L144)DatasetExportOptions Re-exports [DatasetExportOptions](https://crawlee.dev/js/api/core/interface/DatasetExportOptions.md) ### [**](#DatasetExportToOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L176)DatasetExportToOptions Re-exports [DatasetExportToOptions](https://crawlee.dev/js/api/core/interface/DatasetExportToOptions.md) ### [**](#DatasetIteratorOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L152)DatasetIteratorOptions Re-exports [DatasetIteratorOptions](https://crawlee.dev/js/api/core/interface/DatasetIteratorOptions.md) ### [**](#DatasetMapper)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L714)DatasetMapper Re-exports [DatasetMapper](https://crawlee.dev/js/api/core/interface/DatasetMapper.md) ### [**](#DatasetOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L735)DatasetOptions Re-exports [DatasetOptions](https://crawlee.dev/js/api/core/interface/DatasetOptions.md) ### [**](#DatasetReducer)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L726)DatasetReducer Re-exports [DatasetReducer](https://crawlee.dev/js/api/core/interface/DatasetReducer.md) ### [**](#enqueueLinks)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/enqueue_links.ts#L274)enqueueLinks Re-exports [enqueueLinks](https://crawlee.dev/js/api/core/function/enqueueLinks.md) ### [**](#EnqueueLinksOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/enqueue_links.ts#L34)EnqueueLinksOptions Re-exports [EnqueueLinksOptions](https://crawlee.dev/js/api/core/interface/EnqueueLinksOptions.md) ### [**](#EnqueueStrategy)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/enqueue_links.ts#L216)EnqueueStrategy Re-exports [EnqueueStrategy](https://crawlee.dev/js/api/core/enum/EnqueueStrategy.md) ### [**](#ErrnoException)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/error_tracker.ts#L9)ErrnoException Re-exports [ErrnoException](https://crawlee.dev/js/api/core/interface/ErrnoException.md) ### [**](#ErrorHandler)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L114)ErrorHandler Re-exports [ErrorHandler](https://crawlee.dev/js/api/basic-crawler.md#ErrorHandler) ### [**](#ErrorSnapshotter)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/error_snapshotter.ts#L42)ErrorSnapshotter Re-exports [ErrorSnapshotter](https://crawlee.dev/js/api/core/class/ErrorSnapshotter.md) ### [**](#ErrorTracker)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/error_tracker.ts#L286)ErrorTracker Re-exports [ErrorTracker](https://crawlee.dev/js/api/core/class/ErrorTracker.md) ### [**](#ErrorTrackerOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/error_tracker.ts#L17)ErrorTrackerOptions Re-exports [ErrorTrackerOptions](https://crawlee.dev/js/api/core/interface/ErrorTrackerOptions.md) ### [**](#EventManager)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/events/event_manager.ts#L24)EventManager Re-exports [EventManager](https://crawlee.dev/js/api/core/class/EventManager.md) ### [**](#EventType)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/events/event_manager.ts#L9)EventType Re-exports [EventType](https://crawlee.dev/js/api/core/enum/EventType.md) ### [**](#EventTypeName)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/events/event_manager.ts#L17)EventTypeName Re-exports [EventTypeName](https://crawlee.dev/js/api/core.md#EventTypeName) ### [**](#FileDownload)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/file-download.ts#L184)FileDownload Re-exports [FileDownload](https://crawlee.dev/js/api/http-crawler/class/FileDownload.md) ### [**](#FileDownloadCrawlingContext)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/file-download.ts#L52)FileDownloadCrawlingContext Re-exports [FileDownloadCrawlingContext](https://crawlee.dev/js/api/http-crawler/interface/FileDownloadCrawlingContext.md) ### [**](#FileDownloadErrorHandler)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/file-download.ts#L20)FileDownloadErrorHandler Re-exports [FileDownloadErrorHandler](https://crawlee.dev/js/api/http-crawler.md#FileDownloadErrorHandler) ### [**](#FileDownloadHook)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/file-download.ts#L47)FileDownloadHook Re-exports [FileDownloadHook](https://crawlee.dev/js/api/http-crawler.md#FileDownloadHook) ### [**](#FileDownloadOptions)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/file-download.ts#L34)FileDownloadOptions Re-exports [FileDownloadOptions](https://crawlee.dev/js/api/http-crawler.md#FileDownloadOptions) ### [**](#FileDownloadRequestHandler)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/file-download.ts#L57)FileDownloadRequestHandler Re-exports [FileDownloadRequestHandler](https://crawlee.dev/js/api/http-crawler.md#FileDownloadRequestHandler) ### [**](#filterRequestsByPatterns)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L217)filterRequestsByPatterns Re-exports [filterRequestsByPatterns](https://crawlee.dev/js/api/core/function/filterRequestsByPatterns.md) ### [**](#FinalStatistics)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/system_status.ts#L85)FinalStatistics Re-exports [FinalStatistics](https://crawlee.dev/js/api/core/interface/FinalStatistics.md) ### [**](#GetUserDataFromRequest)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/router.ts#L15)GetUserDataFromRequest Re-exports [GetUserDataFromRequest](https://crawlee.dev/js/api/core.md#GetUserDataFromRequest) ### [**](#GlobInput)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L41)GlobInput Re-exports [GlobInput](https://crawlee.dev/js/api/core.md#GlobInput) ### [**](#GlobObject)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L36)GlobObject Re-exports [GlobObject](https://crawlee.dev/js/api/core.md#GlobObject) ### [**](#GotScrapingHttpClient)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/got-scraping-http-client.ts#L17)GotScrapingHttpClient Re-exports [GotScrapingHttpClient](https://crawlee.dev/js/api/core/class/GotScrapingHttpClient.md) ### [**](#HttpCrawler)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L330)HttpCrawler Re-exports [HttpCrawler](https://crawlee.dev/js/api/http-crawler/class/HttpCrawler.md) ### [**](#HttpCrawlerOptions)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L80)HttpCrawlerOptions Re-exports [HttpCrawlerOptions](https://crawlee.dev/js/api/http-crawler/interface/HttpCrawlerOptions.md) ### [**](#HttpCrawlingContext)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L255)HttpCrawlingContext Re-exports [HttpCrawlingContext](https://crawlee.dev/js/api/http-crawler/interface/HttpCrawlingContext.md) ### [**](#HttpErrorHandler)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L75)HttpErrorHandler Re-exports [HttpErrorHandler](https://crawlee.dev/js/api/http-crawler.md#HttpErrorHandler) ### [**](#HttpHook)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L194)HttpHook Re-exports [HttpHook](https://crawlee.dev/js/api/http-crawler.md#HttpHook) ### [**](#HttpRequest)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L78)HttpRequest Re-exports [HttpRequest](https://crawlee.dev/js/api/core/interface/HttpRequest.md) ### [**](#HttpRequestHandler)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L258)HttpRequestHandler Re-exports [HttpRequestHandler](https://crawlee.dev/js/api/http-crawler.md#HttpRequestHandler) ### [**](#HttpRequestOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L111)HttpRequestOptions Re-exports [HttpRequestOptions](https://crawlee.dev/js/api/core/interface/HttpRequestOptions.md) ### [**](#HttpResponse)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L152)HttpResponse Re-exports [HttpResponse](https://crawlee.dev/js/api/core/interface/HttpResponse.md) ### [**](#IRequestList)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_list.ts#L26)IRequestList Re-exports [IRequestList](https://crawlee.dev/js/api/core/interface/IRequestList.md) ### [**](#IRequestManager)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L44)IRequestManager Re-exports [IRequestManager](https://crawlee.dev/js/api/core/interface/IRequestManager.md) ### [**](#IStorage)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/storage_manager.ts#L14)IStorage Re-exports [IStorage](https://crawlee.dev/js/api/core/interface/IStorage.md) ### [**](#KeyConsumer)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/key_value_store.ts#L724)KeyConsumer Re-exports [KeyConsumer](https://crawlee.dev/js/api/core/interface/KeyConsumer.md) ### [**](#KeyValueStore)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/key_value_store.ts#L108)KeyValueStore Re-exports [KeyValueStore](https://crawlee.dev/js/api/core/class/KeyValueStore.md) ### [**](#KeyValueStoreIteratorOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/key_value_store.ts#L758)KeyValueStoreIteratorOptions Re-exports [KeyValueStoreIteratorOptions](https://crawlee.dev/js/api/core/interface/KeyValueStoreIteratorOptions.md) ### [**](#KeyValueStoreOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/key_value_store.ts#L734)KeyValueStoreOptions Re-exports [KeyValueStoreOptions](https://crawlee.dev/js/api/core/interface/KeyValueStoreOptions.md) ### [**](#LoadedRequest)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L21)LoadedRequest Re-exports [LoadedRequest](https://crawlee.dev/js/api/core.md#LoadedRequest) ### [**](#LocalEventManager)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/events/local_event_manager.ts#L11)LocalEventManager Re-exports [LocalEventManager](https://crawlee.dev/js/api/core/class/LocalEventManager.md) ### [**](#log)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/log.ts#L3)log Re-exports [log](https://crawlee.dev/js/api/core.md#log) ### [**](#Log)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/log.ts#L3)Log Re-exports [Log](https://crawlee.dev/js/api/core/class/Log.md) ### [**](#Logger)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/log.ts#L3)Logger Re-exports [Logger](https://crawlee.dev/js/api/core/class/Logger.md) ### [**](#LoggerJson)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/log.ts#L3)LoggerJson Re-exports [LoggerJson](https://crawlee.dev/js/api/core/class/LoggerJson.md) ### [**](#LoggerOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/log.ts#L3)LoggerOptions Re-exports [LoggerOptions](https://crawlee.dev/js/api/core/interface/LoggerOptions.md) ### [**](#LoggerText)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/log.ts#L3)LoggerText Re-exports [LoggerText](https://crawlee.dev/js/api/core/class/LoggerText.md) ### [**](#LogLevel)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/log.ts#L3)LogLevel Re-exports [LogLevel](https://crawlee.dev/js/api/core/enum/LogLevel.md) ### [**](#MAX_POOL_SIZE)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/consts.ts#L3)MAX\_POOL\_SIZE Re-exports [MAX\_POOL\_SIZE](https://crawlee.dev/js/api/core.md#MAX_POOL_SIZE) ### [**](#MinimumSpeedStream)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/file-download.ts#L71)MinimumSpeedStream Re-exports [MinimumSpeedStream](https://crawlee.dev/js/api/http-crawler/function/MinimumSpeedStream.md) ### [**](#NonRetryableError)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/errors.ts#L4)NonRetryableError Re-exports [NonRetryableError](https://crawlee.dev/js/api/core/class/NonRetryableError.md) ### [**](#PERSIST_STATE_KEY)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/consts.ts#L2)PERSIST\_STATE\_KEY Re-exports [PERSIST\_STATE\_KEY](https://crawlee.dev/js/api/core.md#PERSIST_STATE_KEY) ### [**](#PersistenceOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L41)PersistenceOptions Re-exports [PersistenceOptions](https://crawlee.dev/js/api/core/interface/PersistenceOptions.md) ### [**](#processHttpRequestOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L196)processHttpRequestOptions Re-exports [processHttpRequestOptions](https://crawlee.dev/js/api/core/function/processHttpRequestOptions.md) ### [**](#ProxyConfiguration)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/proxy_configuration.ts#L203)ProxyConfiguration Re-exports [ProxyConfiguration](https://crawlee.dev/js/api/core/class/ProxyConfiguration.md) ### [**](#ProxyConfigurationFunction)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/proxy_configuration.ts#L9)ProxyConfigurationFunction Re-exports [ProxyConfigurationFunction](https://crawlee.dev/js/api/core/interface/ProxyConfigurationFunction.md) ### [**](#ProxyConfigurationOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/proxy_configuration.ts#L15)ProxyConfigurationOptions Re-exports [ProxyConfigurationOptions](https://crawlee.dev/js/api/core/interface/ProxyConfigurationOptions.md) ### [**](#ProxyInfo)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/proxy_configuration.ts#L80)ProxyInfo Re-exports [ProxyInfo](https://crawlee.dev/js/api/core/interface/ProxyInfo.md) ### [**](#PseudoUrl)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/index.ts#L18)PseudoUrl Re-exports [PseudoUrl](https://crawlee.dev/js/api/core/class/PseudoUrl.md) ### [**](#PseudoUrlInput)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L34)PseudoUrlInput Re-exports [PseudoUrlInput](https://crawlee.dev/js/api/core.md#PseudoUrlInput) ### [**](#PseudoUrlObject)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L29)PseudoUrlObject Re-exports [PseudoUrlObject](https://crawlee.dev/js/api/core.md#PseudoUrlObject) ### [**](#purgeDefaultStorages)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/utils.ts#L33)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/utils.ts#L45)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/utils.ts#L46)purgeDefaultStorages Re-exports [purgeDefaultStorages](https://crawlee.dev/js/api/core/function/purgeDefaultStorages.md) ### [**](#PushErrorMessageOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L559)PushErrorMessageOptions Re-exports [PushErrorMessageOptions](https://crawlee.dev/js/api/core/interface/PushErrorMessageOptions.md) ### [**](#QueueOperationInfo)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/index.ts#L19)QueueOperationInfo Re-exports [QueueOperationInfo](https://crawlee.dev/js/api/core/interface/QueueOperationInfo.md) ### [**](#RecordOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/key_value_store.ts#L741)RecordOptions Re-exports [RecordOptions](https://crawlee.dev/js/api/core/interface/RecordOptions.md) ### [**](#RecoverableState)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/recoverable_state.ts#L75)RecoverableState Re-exports [RecoverableState](https://crawlee.dev/js/api/core/class/RecoverableState.md) ### [**](#RecoverableStateOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/recoverable_state.ts#L33)RecoverableStateOptions Re-exports [RecoverableStateOptions](https://crawlee.dev/js/api/core/interface/RecoverableStateOptions.md) ### [**](#RecoverableStatePersistenceOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/recoverable_state.ts#L6)RecoverableStatePersistenceOptions Re-exports [RecoverableStatePersistenceOptions](https://crawlee.dev/js/api/core/interface/RecoverableStatePersistenceOptions.md) ### [**](#RedirectHandler)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L171)RedirectHandler Re-exports [RedirectHandler](https://crawlee.dev/js/api/core.md#RedirectHandler) ### [**](#RegExpInput)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L48)RegExpInput Re-exports [RegExpInput](https://crawlee.dev/js/api/core.md#RegExpInput) ### [**](#RegExpObject)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L43)RegExpObject Re-exports [RegExpObject](https://crawlee.dev/js/api/core.md#RegExpObject) ### [**](#Request)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L84)Request Re-exports [Request](https://crawlee.dev/js/api/core/class/Request.md) ### [**](#RequestHandler)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L110)RequestHandler Re-exports [RequestHandler](https://crawlee.dev/js/api/basic-crawler.md#RequestHandler) ### [**](#RequestHandlerResult)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L174)RequestHandlerResult Re-exports [RequestHandlerResult](https://crawlee.dev/js/api/core/class/RequestHandlerResult.md) ### [**](#RequestList)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_list.ts#L300)RequestList Re-exports [RequestList](https://crawlee.dev/js/api/core/class/RequestList.md) ### [**](#RequestListOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_list.ts#L91)RequestListOptions Re-exports [RequestListOptions](https://crawlee.dev/js/api/core/interface/RequestListOptions.md) ### [**](#RequestListSourcesFunction)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_list.ts#L1000)RequestListSourcesFunction Re-exports [RequestListSourcesFunction](https://crawlee.dev/js/api/core.md#RequestListSourcesFunction) ### [**](#RequestListState)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_list.ts#L988)RequestListState Re-exports [RequestListState](https://crawlee.dev/js/api/core/interface/RequestListState.md) ### [**](#RequestManagerTandem)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_manager_tandem.ts#L22)RequestManagerTandem Re-exports [RequestManagerTandem](https://crawlee.dev/js/api/core/class/RequestManagerTandem.md) ### [**](#RequestOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L446)RequestOptions Re-exports [RequestOptions](https://crawlee.dev/js/api/core/interface/RequestOptions.md) ### [**](#RequestProvider)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L102)RequestProvider Re-exports [RequestProvider](https://crawlee.dev/js/api/core/class/RequestProvider.md) ### [**](#RequestProviderOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L907)RequestProviderOptions Re-exports [RequestProviderOptions](https://crawlee.dev/js/api/core/interface/RequestProviderOptions.md) ### [**](#RequestQueue)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/index.ts#L7)RequestQueue Re-exports [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) ### [**](#RequestQueueOperationOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L934)RequestQueueOperationOptions Re-exports [RequestQueueOperationOptions](https://crawlee.dev/js/api/core/interface/RequestQueueOperationOptions.md) ### [**](#RequestQueueOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L923)RequestQueueOptions Re-exports [RequestQueueOptions](https://crawlee.dev/js/api/core/interface/RequestQueueOptions.md) ### [**](#RequestQueueV1)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/index.ts#L6)RequestQueueV1 Re-exports [RequestQueueV1](https://crawlee.dev/js/api/core/class/RequestQueueV1.md) ### [**](#RequestQueueV2)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/index.ts#L8)RequestQueueV2 Re-exports [RequestQueueV2](https://crawlee.dev/js/api/core.md#RequestQueueV2) ### [**](#RequestsLike)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L39)RequestsLike Re-exports [RequestsLike](https://crawlee.dev/js/api/core.md#RequestsLike) ### [**](#RequestState)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L42)RequestState Re-exports [RequestState](https://crawlee.dev/js/api/core/enum/RequestState.md) ### [**](#RequestTransform)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L287)RequestTransform Re-exports [RequestTransform](https://crawlee.dev/js/api/core/interface/RequestTransform.md) ### [**](#ResponseLike)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/cookie_utils.ts#L7)ResponseLike Re-exports [ResponseLike](https://crawlee.dev/js/api/core/interface/ResponseLike.md) ### [**](#ResponseTypes)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L39)ResponseTypes Re-exports [ResponseTypes](https://crawlee.dev/js/api/core/interface/ResponseTypes.md) ### [**](#RestrictedCrawlingContext)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L30)RestrictedCrawlingContext Re-exports [RestrictedCrawlingContext](https://crawlee.dev/js/api/core/interface/RestrictedCrawlingContext.md) ### [**](#RetryRequestError)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/errors.ts#L22)RetryRequestError Re-exports [RetryRequestError](https://crawlee.dev/js/api/core/class/RetryRequestError.md) ### [**](#Router)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/router.ts#L86)Router Re-exports [Router](https://crawlee.dev/js/api/core/class/Router.md) ### [**](#RouterHandler)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/router.ts#L10)RouterHandler Re-exports [RouterHandler](https://crawlee.dev/js/api/core/interface/RouterHandler.md) ### [**](#RouterRoutes)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/router.ts#L17)RouterRoutes Re-exports [RouterRoutes](https://crawlee.dev/js/api/core.md#RouterRoutes) ### [**](#Session)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session.ts#L100)Session Re-exports [Session](https://crawlee.dev/js/api/core/class/Session.md) ### [**](#SessionError)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/errors.ts#L33)SessionError Re-exports [SessionError](https://crawlee.dev/js/api/core/class/SessionError.md) ### [**](#SessionOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session.ts#L37)SessionOptions Re-exports [SessionOptions](https://crawlee.dev/js/api/core/interface/SessionOptions.md) ### [**](#SessionPool)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session_pool.ts#L137)SessionPool Re-exports [SessionPool](https://crawlee.dev/js/api/core/class/SessionPool.md) ### [**](#SessionPoolOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session_pool.ts#L30)SessionPoolOptions Re-exports [SessionPoolOptions](https://crawlee.dev/js/api/core/interface/SessionPoolOptions.md) ### [**](#SessionState)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session.ts#L24)SessionState Re-exports [SessionState](https://crawlee.dev/js/api/core/interface/SessionState.md) ### [**](#SitemapRequestList)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/sitemap_request_list.ts#L128)SitemapRequestList Re-exports [SitemapRequestList](https://crawlee.dev/js/api/core/class/SitemapRequestList.md) ### [**](#SitemapRequestListOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/sitemap_request_list.ts#L60)SitemapRequestListOptions Re-exports [SitemapRequestListOptions](https://crawlee.dev/js/api/core/interface/SitemapRequestListOptions.md) ### [**](#SkippedRequestCallback)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L52)SkippedRequestCallback Re-exports [SkippedRequestCallback](https://crawlee.dev/js/api/core.md#SkippedRequestCallback) ### [**](#SkippedRequestReason)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L50)SkippedRequestReason Re-exports [SkippedRequestReason](https://crawlee.dev/js/api/core.md#SkippedRequestReason) ### [**](#SnapshotResult)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/error_snapshotter.ts#L16)SnapshotResult Re-exports [SnapshotResult](https://crawlee.dev/js/api/core/interface/SnapshotResult.md) ### [**](#Snapshotter)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/snapshotter.ts#L118)Snapshotter Re-exports [Snapshotter](https://crawlee.dev/js/api/core/class/Snapshotter.md) ### [**](#SnapshotterOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/snapshotter.ts#L19)SnapshotterOptions Re-exports [SnapshotterOptions](https://crawlee.dev/js/api/core/interface/SnapshotterOptions.md) ### [**](#Source)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L575)Source Re-exports [Source](https://crawlee.dev/js/api/core.md#Source) ### [**](#StatisticPersistedState)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L482)StatisticPersistedState Re-exports [StatisticPersistedState](https://crawlee.dev/js/api/core/interface/StatisticPersistedState.md) ### [**](#Statistics)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L59)Statistics Re-exports [Statistics](https://crawlee.dev/js/api/core/class/Statistics.md) ### [**](#StatisticsOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L436)StatisticsOptions Re-exports [StatisticsOptions](https://crawlee.dev/js/api/core/interface/StatisticsOptions.md) ### [**](#StatisticState)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L496)StatisticState Re-exports [StatisticState](https://crawlee.dev/js/api/core/interface/StatisticState.md) ### [**](#StatusMessageCallback)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L128)StatusMessageCallback Re-exports [StatusMessageCallback](https://crawlee.dev/js/api/basic-crawler.md#StatusMessageCallback) ### [**](#StatusMessageCallbackParams)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L118)StatusMessageCallbackParams Re-exports [StatusMessageCallbackParams](https://crawlee.dev/js/api/basic-crawler/interface/StatusMessageCallbackParams.md) ### [**](#StorageClient)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/index.ts#L19)StorageClient Re-exports [StorageClient](https://crawlee.dev/js/api/core/interface/StorageClient.md) ### [**](#StorageManagerOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/storage_manager.ts#L156)StorageManagerOptions Re-exports [StorageManagerOptions](https://crawlee.dev/js/api/core/interface/StorageManagerOptions.md) ### [**](#StreamHandlerContext)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/file-download.ts#L25)StreamHandlerContext Re-exports [StreamHandlerContext](https://crawlee.dev/js/api/http-crawler.md#StreamHandlerContext) ### [**](#StreamingHttpResponse)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L162)StreamingHttpResponse Re-exports [StreamingHttpResponse](https://crawlee.dev/js/api/core/interface/StreamingHttpResponse.md) ### [**](#SystemInfo)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/system_status.ts#L10)SystemInfo Re-exports [SystemInfo](https://crawlee.dev/js/api/core/interface/SystemInfo.md) ### [**](#SystemStatus)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/system_status.ts#L120)SystemStatus Re-exports [SystemStatus](https://crawlee.dev/js/api/core/class/SystemStatus.md) ### [**](#SystemStatusOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/system_status.ts#L35)SystemStatusOptions Re-exports [SystemStatusOptions](https://crawlee.dev/js/api/core/interface/SystemStatusOptions.md) ### [**](#TieredProxy)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/proxy_configuration.ts#L45)TieredProxy Re-exports [TieredProxy](https://crawlee.dev/js/api/core/interface/TieredProxy.md) ### [**](#tryAbsoluteURL)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L12)tryAbsoluteURL Re-exports [tryAbsoluteURL](https://crawlee.dev/js/api/core/function/tryAbsoluteURL.md) ### [**](#UrlPatternObject)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L24)UrlPatternObject Re-exports [UrlPatternObject](https://crawlee.dev/js/api/core.md#UrlPatternObject) ### [**](#useState)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/utils.ts#L87)useState Re-exports [useState](https://crawlee.dev/js/api/core/function/useState.md) ### [**](#UseStateOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/utils.ts#L69)UseStateOptions Re-exports [UseStateOptions](https://crawlee.dev/js/api/core/interface/UseStateOptions.md) ### [**](#withCheckedStorageAccess)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/access_checking.ts#L18)withCheckedStorageAccess Re-exports [withCheckedStorageAccess](https://crawlee.dev/js/api/core/function/withCheckedStorageAccess.md) ### [**](#JSDOMErrorHandler)[**](https://github.com/apify/crawlee/blob/master/packages/jsdom-crawler/src/internals/jsdom-crawler.ts#L34)JSDOMErrorHandler **JSDOMErrorHandler\: [ErrorHandler](https://crawlee.dev/js/api/basic-crawler.md#ErrorHandler)<[JSDOMCrawlingContext](https://crawlee.dev/js/api/jsdom-crawler/interface/JSDOMCrawlingContext.md)\> #### Type parameters * **UserData**: Dictionary = any * **JSONData**: Dictionary = any ### [**](#JSDOMHook)[**](https://github.com/apify/crawlee/blob/master/packages/jsdom-crawler/src/internals/jsdom-crawler.ts#L53)JSDOMHook **JSDOMHook\: InternalHttpHook<[JSDOMCrawlingContext](https://crawlee.dev/js/api/jsdom-crawler/interface/JSDOMCrawlingContext.md)\> #### Type parameters * **UserData**: Dictionary = any * **JSONData**: Dictionary = any ### [**](#JSDOMRequestHandler)[**](https://github.com/apify/crawlee/blob/master/packages/jsdom-crawler/src/internals/jsdom-crawler.ts#L95)JSDOMRequestHandler **JSDOMRequestHandler\: [RequestHandler](https://crawlee.dev/js/api/basic-crawler.md#RequestHandler)<[JSDOMCrawlingContext](https://crawlee.dev/js/api/jsdom-crawler/interface/JSDOMCrawlingContext.md)\> #### Type parameters * **UserData**: Dictionary = any * **JSONData**: Dictionary = any --- # Changelog All notable changes to this project will be documented in this file. See [Conventional Commits](https://conventionalcommits.org) for commit guidelines. ## [3.15.3](https://github.com/apify/crawlee/compare/v3.15.2...v3.15.3) (2025-11-10)[​](#3153-2025-11-10 "Direct link to 3153-2025-11-10") **Note:** Version bump only for package @crawlee/jsdom ## [3.15.2](https://github.com/apify/crawlee/compare/v3.15.1...v3.15.2) (2025-10-23)[​](#3152-2025-10-23 "Direct link to 3152-2025-10-23") **Note:** Version bump only for package @crawlee/jsdom ## [3.15.1](https://github.com/apify/crawlee/compare/v3.15.0...v3.15.1) (2025-09-26)[​](#3151-2025-09-26 "Direct link to 3151-2025-09-26") **Note:** Version bump only for package @crawlee/jsdom # [3.15.0](https://github.com/apify/crawlee/compare/v3.14.1...v3.15.0) (2025-09-17) **Note:** Version bump only for package @crawlee/jsdom ## [3.14.1](https://github.com/apify/crawlee/compare/v3.14.0...v3.14.1) (2025-08-05)[​](#3141-2025-08-05 "Direct link to 3141-2025-08-05") **Note:** Version bump only for package @crawlee/jsdom # [3.14.0](https://github.com/apify/crawlee/compare/v3.13.10...v3.14.0) (2025-07-25) ### Features[​](#features "Direct link to Features") * add `maxCrawlDepth` crawler option ([#3045](https://github.com/apify/crawlee/issues/3045)) ([0090df9](https://github.com/apify/crawlee/commit/0090df93a12df9918d016cf2f1378f1f7d40557d)), closes [#2633](https://github.com/apify/crawlee/issues/2633) ## [3.13.10](https://github.com/apify/crawlee/compare/v3.13.9...v3.13.10) (2025-07-09)[​](#31310-2025-07-09 "Direct link to 31310-2025-07-09") **Note:** Version bump only for package @crawlee/jsdom ## [3.13.9](https://github.com/apify/crawlee/compare/v3.13.8...v3.13.9) (2025-06-27)[​](#3139-2025-06-27 "Direct link to 3139-2025-06-27") **Note:** Version bump only for package @crawlee/jsdom ## [3.13.8](https://github.com/apify/crawlee/compare/v3.13.7...v3.13.8) (2025-06-16)[​](#3138-2025-06-16 "Direct link to 3138-2025-06-16") ### Bug Fixes[​](#bug-fixes "Direct link to Bug Fixes") * Do not enqueue more links than what the crawler is capable of processing ([#2990](https://github.com/apify/crawlee/issues/2990)) ([ea094c8](https://github.com/apify/crawlee/commit/ea094c819232e0b30bc550270836d10506eb9454)), closes [#2728](https://github.com/apify/crawlee/issues/2728) ## [3.13.7](https://github.com/apify/crawlee/compare/v3.13.6...v3.13.7) (2025-06-06)[​](#3137-2025-06-06 "Direct link to 3137-2025-06-06") **Note:** Version bump only for package @crawlee/jsdom ## [3.13.6](https://github.com/apify/crawlee/compare/v3.13.5...v3.13.6) (2025-06-05)[​](#3136-2025-06-05 "Direct link to 3136-2025-06-05") **Note:** Version bump only for package @crawlee/jsdom ## [3.13.5](https://github.com/apify/crawlee/compare/v3.13.4...v3.13.5) (2025-05-20)[​](#3135-2025-05-20 "Direct link to 3135-2025-05-20") **Note:** Version bump only for package @crawlee/jsdom ## [3.13.4](https://github.com/apify/crawlee/compare/v3.13.3...v3.13.4) (2025-05-14)[​](#3134-2025-05-14 "Direct link to 3134-2025-05-14") **Note:** Version bump only for package @crawlee/jsdom ## [3.13.3](https://github.com/apify/crawlee/compare/v3.13.2...v3.13.3) (2025-05-05)[​](#3133-2025-05-05 "Direct link to 3133-2025-05-05") **Note:** Version bump only for package @crawlee/jsdom ## [3.13.2](https://github.com/apify/crawlee/compare/v3.13.1...v3.13.2) (2025-04-08)[​](#3132-2025-04-08 "Direct link to 3132-2025-04-08") ### Features[​](#features-1 "Direct link to Features") * add `onSkippedRequest` option ([#2916](https://github.com/apify/crawlee/issues/2916)) ([764f992](https://github.com/apify/crawlee/commit/764f99203627b6a44d2ee90d623b8b0e6ecbffb5)), closes [#2910](https://github.com/apify/crawlee/issues/2910) ## [3.13.1](https://github.com/apify/crawlee/compare/v3.13.0...v3.13.1) (2025-04-07)[​](#3131-2025-04-07 "Direct link to 3131-2025-04-07") ### Bug Fixes[​](#bug-fixes-1 "Direct link to Bug Fixes") * rename `RobotsFile` to `RobotsTxtFile` ([#2913](https://github.com/apify/crawlee/issues/2913)) ([3160f71](https://github.com/apify/crawlee/commit/3160f717e865326476d78089d778cbc7d35aa58d)), closes [#2910](https://github.com/apify/crawlee/issues/2910) ### Features[​](#features-2 "Direct link to Features") * add `respectRobotsTxtFile` crawler option ([#2910](https://github.com/apify/crawlee/issues/2910)) ([0eabed1](https://github.com/apify/crawlee/commit/0eabed1f13070d902c2c67b340621830a7f64464)) # [3.13.0](https://github.com/apify/crawlee/compare/v3.12.2...v3.13.0) (2025-03-04) **Note:** Version bump only for package @crawlee/jsdom ## [3.12.2](https://github.com/apify/crawlee/compare/v3.12.1...v3.12.2) (2025-01-27)[​](#3122-2025-01-27 "Direct link to 3122-2025-01-27") **Note:** Version bump only for package @crawlee/jsdom ## [3.12.1](https://github.com/apify/crawlee/compare/v3.12.0...v3.12.1) (2024-12-04)[​](#3121-2024-12-04 "Direct link to 3121-2024-12-04") **Note:** Version bump only for package @crawlee/jsdom # [3.12.0](https://github.com/apify/crawlee/compare/v3.11.5...v3.12.0) (2024-11-04) **Note:** Version bump only for package @crawlee/jsdom ## [3.11.5](https://github.com/apify/crawlee/compare/v3.11.4...v3.11.5) (2024-10-04)[​](#3115-2024-10-04 "Direct link to 3115-2024-10-04") **Note:** Version bump only for package @crawlee/jsdom ## [3.11.4](https://github.com/apify/crawlee/compare/v3.11.3...v3.11.4) (2024-09-23)[​](#3114-2024-09-23 "Direct link to 3114-2024-09-23") **Note:** Version bump only for package @crawlee/jsdom ## [3.11.3](https://github.com/apify/crawlee/compare/v3.11.2...v3.11.3) (2024-09-03)[​](#3113-2024-09-03 "Direct link to 3113-2024-09-03") **Note:** Version bump only for package @crawlee/jsdom ## [3.11.2](https://github.com/apify/crawlee/compare/v3.11.1...v3.11.2) (2024-08-28)[​](#3112-2024-08-28 "Direct link to 3112-2024-08-28") **Note:** Version bump only for package @crawlee/jsdom ## [3.11.1](https://github.com/apify/crawlee/compare/v3.11.0...v3.11.1) (2024-07-24)[​](#3111-2024-07-24 "Direct link to 3111-2024-07-24") **Note:** Version bump only for package @crawlee/jsdom # [3.11.0](https://github.com/apify/crawlee/compare/v3.10.5...v3.11.0) (2024-07-09) **Note:** Version bump only for package @crawlee/jsdom ## [3.10.5](https://github.com/apify/crawlee/compare/v3.10.4...v3.10.5) (2024-06-12)[​](#3105-2024-06-12 "Direct link to 3105-2024-06-12") ### Bug Fixes[​](#bug-fixes-2 "Direct link to Bug Fixes") * declare missing peer dependencies in `@crawlee/browser` package ([#2532](https://github.com/apify/crawlee/issues/2532)) ([3357c7f](https://github.com/apify/crawlee/commit/3357c7fc5ab071b12f72097c190dbee9990e3751)) ## [3.10.4](https://github.com/apify/crawlee/compare/v3.10.3...v3.10.4) (2024-06-11)[​](#3104-2024-06-11 "Direct link to 3104-2024-06-11") **Note:** Version bump only for package @crawlee/jsdom ## [3.10.3](https://github.com/apify/crawlee/compare/v3.10.2...v3.10.3) (2024-06-07)[​](#3103-2024-06-07 "Direct link to 3103-2024-06-07") ### Features[​](#features-3 "Direct link to Features") * add `waitForSelector` context helper + `parseWithCheerio` in adaptive crawler ([#2522](https://github.com/apify/crawlee/issues/2522)) ([6f88e73](https://github.com/apify/crawlee/commit/6f88e738d43ab4774dc4ef3f78775a5d88728e0d)) ## [3.10.2](https://github.com/apify/crawlee/compare/v3.10.1...v3.10.2) (2024-06-03)[​](#3102-2024-06-03 "Direct link to 3102-2024-06-03") **Note:** Version bump only for package @crawlee/jsdom ## [3.10.1](https://github.com/apify/crawlee/compare/v3.10.0...v3.10.1) (2024-05-23)[​](#3101-2024-05-23 "Direct link to 3101-2024-05-23") **Note:** Version bump only for package @crawlee/jsdom # [3.10.0](https://github.com/apify/crawlee/compare/v3.9.2...v3.10.0) (2024-05-16) **Note:** Version bump only for package @crawlee/jsdom ## [3.9.2](https://github.com/apify/crawlee/compare/v3.9.1...v3.9.2) (2024-04-17)[​](#392-2024-04-17 "Direct link to 392-2024-04-17") **Note:** Version bump only for package @crawlee/jsdom ## [3.9.1](https://github.com/apify/crawlee/compare/v3.9.0...v3.9.1) (2024-04-11)[​](#391-2024-04-11 "Direct link to 391-2024-04-11") **Note:** Version bump only for package @crawlee/jsdom # [3.9.0](https://github.com/apify/crawlee/compare/v3.8.2...v3.9.0) (2024-04-10) **Note:** Version bump only for package @crawlee/jsdom ## [3.8.2](https://github.com/apify/crawlee/compare/v3.8.1...v3.8.2) (2024-03-21)[​](#382-2024-03-21 "Direct link to 382-2024-03-21") **Note:** Version bump only for package @crawlee/jsdom ## [3.8.1](https://github.com/apify/crawlee/compare/v3.8.0...v3.8.1) (2024-02-22)[​](#381-2024-02-22 "Direct link to 381-2024-02-22") **Note:** Version bump only for package @crawlee/jsdom # [3.8.0](https://github.com/apify/crawlee/compare/v3.7.3...v3.8.0) (2024-02-21) **Note:** Version bump only for package @crawlee/jsdom ## [3.7.3](https://github.com/apify/crawlee/compare/v3.7.2...v3.7.3) (2024-01-30)[​](#373-2024-01-30 "Direct link to 373-2024-01-30") **Note:** Version bump only for package @crawlee/jsdom ## [3.7.2](https://github.com/apify/crawlee/compare/v3.7.1...v3.7.2) (2024-01-09)[​](#372-2024-01-09 "Direct link to 372-2024-01-09") **Note:** Version bump only for package @crawlee/jsdom ## [3.7.1](https://github.com/apify/crawlee/compare/v3.7.0...v3.7.1) (2024-01-02)[​](#371-2024-01-02 "Direct link to 371-2024-01-02") **Note:** Version bump only for package @crawlee/jsdom # [3.7.0](https://github.com/apify/crawlee/compare/v3.6.2...v3.7.0) (2023-12-21) **Note:** Version bump only for package @crawlee/jsdom ## [3.6.2](https://github.com/apify/crawlee/compare/v3.6.1...v3.6.2) (2023-11-26)[​](#362-2023-11-26 "Direct link to 362-2023-11-26") **Note:** Version bump only for package @crawlee/jsdom ## [3.6.1](https://github.com/apify/crawlee/compare/v3.6.0...v3.6.1) (2023-11-15)[​](#361-2023-11-15 "Direct link to 361-2023-11-15") **Note:** Version bump only for package @crawlee/jsdom # [3.6.0](https://github.com/apify/crawlee/compare/v3.5.8...v3.6.0) (2023-11-15) **Note:** Version bump only for package @crawlee/jsdom ## [3.5.8](https://github.com/apify/crawlee/compare/v3.5.7...v3.5.8) (2023-10-17)[​](#358-2023-10-17 "Direct link to 358-2023-10-17") **Note:** Version bump only for package @crawlee/jsdom ## [3.5.7](https://github.com/apify/crawlee/compare/v3.5.6...v3.5.7) (2023-10-05)[​](#357-2023-10-05 "Direct link to 357-2023-10-05") **Note:** Version bump only for package @crawlee/jsdom ## [3.5.6](https://github.com/apify/crawlee/compare/v3.5.5...v3.5.6) (2023-10-04)[​](#356-2023-10-04 "Direct link to 356-2023-10-04") **Note:** Version bump only for package @crawlee/jsdom ## [3.5.5](https://github.com/apify/crawlee/compare/v3.5.4...v3.5.5) (2023-10-02)[​](#355-2023-10-02 "Direct link to 355-2023-10-02") ### Features[​](#features-4 "Direct link to Features") * Request Queue v2 ([#1975](https://github.com/apify/crawlee/issues/1975)) ([70a77ee](https://github.com/apify/crawlee/commit/70a77ee15f984e9ae67cd584fc58ace7e55346db)), closes [#1365](https://github.com/apify/crawlee/issues/1365) ## [3.5.4](https://github.com/apify/crawlee/compare/v3.5.3...v3.5.4) (2023-09-11)[​](#354-2023-09-11 "Direct link to 354-2023-09-11") **Note:** Version bump only for package @crawlee/jsdom ## [3.5.3](https://github.com/apify/crawlee/compare/v3.5.2...v3.5.3) (2023-08-31)[​](#353-2023-08-31 "Direct link to 353-2023-08-31") ### Bug Fixes[​](#bug-fixes-3 "Direct link to Bug Fixes") * pin all internal dependencies ([#2041](https://github.com/apify/crawlee/issues/2041)) ([d6f2b17](https://github.com/apify/crawlee/commit/d6f2b172d4a6776137c7893ca798d5b4a9408e79)), closes [#2040](https://github.com/apify/crawlee/issues/2040) ## [3.5.2](https://github.com/apify/crawlee/compare/v3.5.1...v3.5.2) (2023-08-21)[​](#352-2023-08-21 "Direct link to 352-2023-08-21") **Note:** Version bump only for package @crawlee/jsdom ## [3.5.1](https://github.com/apify/crawlee/compare/v3.5.0...v3.5.1) (2023-08-16)[​](#351-2023-08-16 "Direct link to 351-2023-08-16") **Note:** Version bump only for package @crawlee/jsdom # [3.5.0](https://github.com/apify/crawlee/compare/v3.4.2...v3.5.0) (2023-07-31) **Note:** Version bump only for package @crawlee/jsdom ## [3.4.2](https://github.com/apify/crawlee/compare/v3.4.1...v3.4.2) (2023-07-19)[​](#342-2023-07-19 "Direct link to 342-2023-07-19") **Note:** Version bump only for package @crawlee/jsdom ## [3.4.1](https://github.com/apify/crawlee/compare/v3.4.0...v3.4.1) (2023-07-13)[​](#341-2023-07-13 "Direct link to 341-2023-07-13") ### Features[​](#features-5 "Direct link to Features") * **jsdom,linkedom:** Expose document to crawler router context ([#1950](https://github.com/apify/crawlee/issues/1950)) ([4536dc2](https://github.com/apify/crawlee/commit/4536dc2900ee6d0acb562583ed8fca183df28e39)) # [3.4.0](https://github.com/apify/crawlee/compare/v3.3.3...v3.4.0) (2023-06-12) **Note:** Version bump only for package @crawlee/jsdom ## [3.3.3](https://github.com/apify/crawlee/compare/v3.3.2...v3.3.3) (2023-05-31)[​](#333-2023-05-31 "Direct link to 333-2023-05-31") **Note:** Version bump only for package @crawlee/jsdom ## [3.3.2](https://github.com/apify/crawlee/compare/v3.3.1...v3.3.2) (2023-05-11)[​](#332-2023-05-11 "Direct link to 332-2023-05-11") ### Features[​](#features-6 "Direct link to Features") * **HttpCrawler:** add `parseWithCheerio` helper to `HttpCrawler` ([#1906](https://github.com/apify/crawlee/issues/1906)) ([ff5f76f](https://github.com/apify/crawlee/commit/ff5f76f9336c47c555c28038cdc72dc650bb5065)) * **router:** allow inline router definition ([#1877](https://github.com/apify/crawlee/issues/1877)) ([2d241c9](https://github.com/apify/crawlee/commit/2d241c9f88964ebd41a181069c378b6b7b5bf262)) ## [3.3.1](https://github.com/apify/crawlee/compare/v3.3.0...v3.3.1) (2023-04-11)[​](#331-2023-04-11 "Direct link to 331-2023-04-11") ### Bug Fixes[​](#bug-fixes-4 "Direct link to Bug Fixes") * **jsdom:** add timeout to the window\.load wait when `runScripts` are enabled ([806de31](https://github.com/apify/crawlee/commit/806de31222e138ef2d8e2706536a7288423be3d4)) * **jsdom:** delay closing of the window and add some polyfills ([2e81618](https://github.com/apify/crawlee/commit/2e81618afb5f3890495e3e5fcfa037eb3319edc9)) * **jsdom:** use no-op `enqueueLinks` in http crawlers when parsing fails ([fd35270](https://github.com/apify/crawlee/commit/fd35270e7da67a77eb60108e19294f0fd2016706)) ### Features[​](#features-7 "Direct link to Features") * **jsdom:** add `parseWithCheerio` context helper ([c8f0796](https://github.com/apify/crawlee/commit/c8f0796aebc0dfa6e6d04740a0bb7d8ddd5b2d96)) # [3.3.0](https://github.com/apify/crawlee/compare/v3.2.2...v3.3.0) (2023-03-09) ### Bug Fixes[​](#bug-fixes-5 "Direct link to Bug Fixes") * ignore invalid URLs in `enqueueLinks` in browser crawlers ([#1803](https://github.com/apify/crawlee/issues/1803)) ([5ac336c](https://github.com/apify/crawlee/commit/5ac336c5b83b212fd6281659b8ceee091e259ff1)) ## [3.2.2](https://github.com/apify/crawlee/compare/v3.2.1...v3.2.2) (2023-02-08)[​](#322-2023-02-08 "Direct link to 322-2023-02-08") **Note:** Version bump only for package @crawlee/jsdom ## [3.2.1](https://github.com/apify/crawlee/compare/v3.2.0...v3.2.1) (2023-02-07)[​](#321-2023-02-07 "Direct link to 321-2023-02-07") **Note:** Version bump only for package @crawlee/jsdom # [3.2.0](https://github.com/apify/crawlee/compare/v3.1.4...v3.2.0) (2023-02-07) ### Bug Fixes[​](#bug-fixes-6 "Direct link to Bug Fixes") * declare missing dependency on `tslib` ([27e96c8](https://github.com/apify/crawlee/commit/27e96c80c26e7fc31809a4b518d699573cb8c662)), closes [#1747](https://github.com/apify/crawlee/issues/1747) ## [3.1.4](https://github.com/apify/crawlee/compare/v3.1.3...v3.1.4) (2022-12-14)[​](#314-2022-12-14 "Direct link to 314-2022-12-14") **Note:** Version bump only for package @crawlee/jsdom ## [3.1.3](https://github.com/apify/crawlee/compare/v3.1.2...v3.1.3) (2022-12-07)[​](#313-2022-12-07 "Direct link to 313-2022-12-07") ### Features[​](#features-8 "Direct link to Features") * hideInternalConsole in JSDOMCrawler ([#1707](https://github.com/apify/crawlee/issues/1707)) ([8975f90](https://github.com/apify/crawlee/commit/8975f9088cf4dd38629c21e21061616fc1e7b003)) ## 3.1.2 (2022-11-15)[​](#312-2022-11-15 "Direct link to 3.1.2 (2022-11-15)") **Note:** Version bump only for package @crawlee/jsdom ## 3.1.1 (2022-11-07)[​](#311-2022-11-07 "Direct link to 3.1.1 (2022-11-07)") **Note:** Version bump only for package @crawlee/jsdom # 3.1.0 (2022-10-13) **Note:** Version bump only for package @crawlee/jsdom ## [3.0.4](https://github.com/apify/crawlee/compare/v3.0.3...v3.0.4) (2022-08-22)[​](#304-2022-08-22 "Direct link to 304-2022-08-22") **Note:** Version bump only for package @crawlee/jsdom --- # JSDOMCrawler Provides a framework for the parallel crawling of web pages using plain HTTP requests. The URLs to crawl are fed either from a static list of URLs or from a dynamic queue of URLs enabling recursive crawling of websites. It is very fast and efficient on data bandwidth. However, if the target website requires JavaScript to display the content, you might need to use [PuppeteerCrawler](https://crawlee.dev/js/api/puppeteer-crawler/class/PuppeteerCrawler.md) or [PlaywrightCrawler](https://crawlee.dev/js/api/playwright-crawler/class/PlaywrightCrawler.md) instead, because it loads the pages using full-featured headless Chrome browser. This crawler downloads each URL using a plain HTTP request and doesn't do any HTML parsing. The source URLs are represented using [Request](https://crawlee.dev/js/api/core/class/Request.md) objects that are fed from [RequestList](https://crawlee.dev/js/api/core/class/RequestList.md) or [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) instances provided by the [HttpCrawlerOptions.requestList](https://crawlee.dev/js/api/http-crawler/interface/HttpCrawlerOptions.md#requestList) or [HttpCrawlerOptions.requestQueue](https://crawlee.dev/js/api/http-crawler/interface/HttpCrawlerOptions.md#requestQueue) constructor options, respectively. If both [HttpCrawlerOptions.requestList](https://crawlee.dev/js/api/http-crawler/interface/HttpCrawlerOptions.md#requestList) and [HttpCrawlerOptions.requestQueue](https://crawlee.dev/js/api/http-crawler/interface/HttpCrawlerOptions.md#requestQueue) are used, the instance first processes URLs from the [RequestList](https://crawlee.dev/js/api/core/class/RequestList.md) and automatically enqueues all of them to [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) before it starts their processing. This ensures that a single URL is not crawled multiple times. The crawler finishes when there are no more [Request](https://crawlee.dev/js/api/core/class/Request.md) objects to crawl. We can use the `preNavigationHooks` to adjust `gotOptions`: ``` preNavigationHooks: [ (crawlingContext, gotOptions) => { // ... }, ] ``` By default, this crawler only processes web pages with the `text/html` and `application/xhtml+xml` MIME content types (as reported by the `Content-Type` HTTP header), and skips pages with other content types. If you want the crawler to process other content types, use the [HttpCrawlerOptions.additionalMimeTypes](https://crawlee.dev/js/api/http-crawler/interface/HttpCrawlerOptions.md#additionalMimeTypes) constructor option. Beware that the parsing behavior differs for HTML, XML, JSON and other types of content. For details, see [HttpCrawlerOptions.requestHandler](https://crawlee.dev/js/api/http-crawler/interface/HttpCrawlerOptions.md#requestHandler). New requests are only dispatched when there is enough free CPU and memory available, using the functionality provided by the [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) class. All [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) configuration options can be passed to the `autoscaledPoolOptions` parameter of the constructor. For user convenience, the `minConcurrency` and `maxConcurrency` [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) options are available directly in the constructor. **Example usage:** ``` import { HttpCrawler, Dataset } from '@crawlee/http'; const crawler = new HttpCrawler({ requestList, async requestHandler({ request, response, body, contentType }) { // Save the data to dataset. await Dataset.pushData({ url: request.url, html: body, }); }, }); await crawler.run([ 'http://www.example.com/page-1', 'http://www.example.com/page-2', ]); ``` ### Hierarchy * [HttpCrawler](https://crawlee.dev/js/api/http-crawler/class/HttpCrawler.md)<[JSDOMCrawlingContext](https://crawlee.dev/js/api/jsdom-crawler/interface/JSDOMCrawlingContext.md)> * *JSDOMCrawler* ## Index[**](#Index) ### Constructors * [**constructor](#constructor) ### Properties * [**autoscaledPool](#autoscaledPool) * [**config](#config) * [**hasFinishedBefore](#hasFinishedBefore) * [**log](#log) * [**proxyConfiguration](#proxyConfiguration) * [**requestList](#requestList) * [**requestQueue](#requestQueue) * [**router](#router) * [**running](#running) * [**sessionPool](#sessionPool) * [**stats](#stats) ### Methods * [**\_runRequestHandler](#_runRequestHandler) * [**addRequests](#addRequests) * [**exportData](#exportData) * [**getData](#getData) * [**getDataset](#getDataset) * [**getRequestQueue](#getRequestQueue) * [**getVirtualConsole](#getVirtualConsole) * [**pushData](#pushData) * [**run](#run) * [**setStatusMessage](#setStatusMessage) * [**stop](#stop) * [**use](#use) * [**useState](#useState) ## Constructors[**](#Constructors) ### [**](#constructor)[**](https://github.com/apify/crawlee/blob/master/packages/jsdom-crawler/src/internals/jsdom-crawler.ts#L191)constructor * ****new JSDOMCrawler**(options, config): [JSDOMCrawler](https://crawlee.dev/js/api/jsdom-crawler/class/JSDOMCrawler.md) - Overrides HttpCrawler.constructor #### Parameters * ##### options: [JSDOMCrawlerOptions](https://crawlee.dev/js/api/jsdom-crawler/interface/JSDOMCrawlerOptions.md)\ = {} * ##### optionalconfig: [Configuration](https://crawlee.dev/js/api/core/class/Configuration.md) #### Returns [JSDOMCrawler](https://crawlee.dev/js/api/jsdom-crawler/class/JSDOMCrawler.md) ## Properties[**](#Properties) ### [**](#autoscaledPool)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L524)optionalinheritedautoscaledPool **autoscaledPool? : [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) Inherited from HttpCrawler.autoscaledPool A reference to the underlying [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) class that manages the concurrency of the crawler. > *NOTE:* This property is only initialized after calling the [`crawler.run()`](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md#run) function. We can use it to change the concurrency settings on the fly, to pause the crawler by calling [`autoscaledPool.pause()`](https://crawlee.dev/js/api/core/class/AutoscaledPool.md#pause) or to abort it by calling [`autoscaledPool.abort()`](https://crawlee.dev/js/api/core/class/AutoscaledPool.md#abort). ### [**](#config)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L375)readonlyinheritedconfig **config: [Configuration](https://crawlee.dev/js/api/core/class/Configuration.md) = ... Inherited from HttpCrawler.config ### [**](#hasFinishedBefore)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L533)inheritedhasFinishedBefore **hasFinishedBefore: boolean = false Inherited from HttpCrawler.hasFinishedBefore ### [**](#log)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L535)readonlyinheritedlog **log: [Log](https://crawlee.dev/js/api/core/class/Log.md) Inherited from HttpCrawler.log ### [**](#proxyConfiguration)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L337)optionalinheritedproxyConfiguration **proxyConfiguration? : [ProxyConfiguration](https://crawlee.dev/js/api/core/class/ProxyConfiguration.md) Inherited from HttpCrawler.proxyConfiguration A reference to the underlying [ProxyConfiguration](https://crawlee.dev/js/api/core/class/ProxyConfiguration.md) class that manages the crawler's proxies. Only available if used by the crawler. ### [**](#requestList)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L497)optionalinheritedrequestList **requestList? : [IRequestList](https://crawlee.dev/js/api/core/interface/IRequestList.md) Inherited from HttpCrawler.requestList A reference to the underlying [RequestList](https://crawlee.dev/js/api/core/class/RequestList.md) class that manages the crawler's [requests](https://crawlee.dev/js/api/core/class/Request.md). Only available if used by the crawler. ### [**](#requestQueue)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L504)optionalinheritedrequestQueue **requestQueue? : [RequestProvider](https://crawlee.dev/js/api/core/class/RequestProvider.md) Inherited from HttpCrawler.requestQueue Dynamic queue of URLs to be processed. This is useful for recursive crawling of websites. A reference to the underlying [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) class that manages the crawler's [requests](https://crawlee.dev/js/api/core/class/Request.md). Only available if used by the crawler. ### [**](#router)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L530)readonlyinheritedrouter **router: [RouterHandler](https://crawlee.dev/js/api/core/interface/RouterHandler.md)<{ request: [LoadedRequest](https://crawlee.dev/js/api/core.md#LoadedRequest)<[Request](https://crawlee.dev/js/api/core/class/Request.md)\> } & Omit<[JSDOMCrawlingContext](https://crawlee.dev/js/api/jsdom-crawler/interface/JSDOMCrawlingContext.md)\, request>> = ... Inherited from HttpCrawler.router Default [Router](https://crawlee.dev/js/api/core/class/Router.md) instance that will be used if we don't specify any [`requestHandler`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#requestHandler). See [`router.addHandler()`](https://crawlee.dev/js/api/core/class/Router.md#addHandler) and [`router.addDefaultHandler()`](https://crawlee.dev/js/api/core/class/Router.md#addDefaultHandler). ### [**](#running)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L532)inheritedrunning **running: boolean = false Inherited from HttpCrawler.running ### [**](#sessionPool)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L515)optionalinheritedsessionPool **sessionPool? : [SessionPool](https://crawlee.dev/js/api/core/class/SessionPool.md) Inherited from HttpCrawler.sessionPool A reference to the underlying [SessionPool](https://crawlee.dev/js/api/core/class/SessionPool.md) class that manages the crawler's [sessions](https://crawlee.dev/js/api/core/class/Session.md). Only available if used by the crawler. ### [**](#stats)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L491)readonlyinheritedstats **stats: [Statistics](https://crawlee.dev/js/api/core/class/Statistics.md) Inherited from HttpCrawler.stats A reference to the underlying [Statistics](https://crawlee.dev/js/api/core/class/Statistics.md) class that collects and logs run statistics for requests. ## Methods[**](#Methods) ### [**](#_runRequestHandler)[**](https://github.com/apify/crawlee/blob/master/packages/jsdom-crawler/src/internals/jsdom-crawler.ts#L318)\_runRequestHandler * ****\_runRequestHandler**(context): Promise\ - Overrides HttpCrawler.\_runRequestHandler Wrapper around requestHandler that opens and closes pages etc. *** #### Parameters * ##### context: [JSDOMCrawlingContext](https://crawlee.dev/js/api/jsdom-crawler/interface/JSDOMCrawlingContext.md)\ #### Returns Promise\ ### [**](#addRequests)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L1147)inheritedaddRequests * ****addRequests**(requests, options): Promise<[CrawlerAddRequestsResult](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerAddRequestsResult.md)> - Inherited from HttpCrawler.addRequests Adds requests to the queue in batches. By default, it will resolve after the initial batch is added, and continue adding the rest in background. You can configure the batch size via `batchSize` option and the sleep time in between the batches via `waitBetweenBatchesMillis`. If you want to wait for all batches to be added to the queue, you can use the `waitForAllRequestsToBeAdded` promise you get in the response object. This is an alias for calling `addRequestsBatched()` on the implicit `RequestQueue` for this crawler instance. *** #### Parameters * ##### requests: [RequestsLike](https://crawlee.dev/js/api/core.md#RequestsLike) The requests to add * ##### options: [CrawlerAddRequestsOptions](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerAddRequestsOptions.md) = {} Options for the request queue #### Returns Promise<[CrawlerAddRequestsResult](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerAddRequestsResult.md)> ### [**](#exportData)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L1253)inheritedexportData * ****exportData**\(path, format, options): Promise\ - Inherited from HttpCrawler.exportData Retrieves all the data from the default crawler [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md) and exports them to the specified format. Supported formats are currently 'json' and 'csv', and will be inferred from the `path` automatically. *** #### Parameters * ##### path: string * ##### optionalformat: json | csv * ##### optionaloptions: [DatasetExportOptions](https://crawlee.dev/js/api/core/interface/DatasetExportOptions.md) #### Returns Promise\ ### [**](#getData)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L1244)inheritedgetData * ****getData**(...args): Promise<[DatasetContent](https://crawlee.dev/js/api/core/interface/DatasetContent.md)\> - Inherited from HttpCrawler.getData Retrieves data from the default crawler [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md) by calling [Dataset.getData](https://crawlee.dev/js/api/core/class/Dataset.md#getData). *** #### Parameters * ##### rest...args: \[options: [DatasetDataOptions](https://crawlee.dev/js/api/core/interface/DatasetDataOptions.md)] #### Returns Promise<[DatasetContent](https://crawlee.dev/js/api/core/interface/DatasetContent.md)\> ### [**](#getDataset)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L1237)inheritedgetDataset * ****getDataset**(idOrName): Promise<[Dataset](https://crawlee.dev/js/api/core/class/Dataset.md)\> - Inherited from HttpCrawler.getDataset Retrieves the specified [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md), or the default crawler [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md). *** #### Parameters * ##### optionalidOrName: string #### Returns Promise<[Dataset](https://crawlee.dev/js/api/core/class/Dataset.md)\> ### [**](#getRequestQueue)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L1080)inheritedgetRequestQueue * ****getRequestQueue**(): Promise<[RequestProvider](https://crawlee.dev/js/api/core/class/RequestProvider.md)> - Inherited from HttpCrawler.getRequestQueue #### Returns Promise<[RequestProvider](https://crawlee.dev/js/api/core/class/RequestProvider.md)> ### [**](#getVirtualConsole)[**](https://github.com/apify/crawlee/blob/master/packages/jsdom-crawler/src/internals/jsdom-crawler.ts#L214)getVirtualConsole * ****getVirtualConsole**(): VirtualConsole - Returns the currently used `VirtualConsole` instance. Can be used to listen for the JSDOM's internal console messages. If the `hideInternalConsole` option is set to `true`, the messages aren't logged to the console by default, but the virtual console can still be listened to. **Example usage:** ``` const console = crawler.getVirtualConsole(); console.on('error', (e) => { log.error(e); }); ``` *** #### Returns VirtualConsole ### [**](#pushData)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L1229)inheritedpushData * ****pushData**(data, datasetIdOrName): Promise\ - Inherited from HttpCrawler.pushData Pushes data to the specified [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md), or the default crawler [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md) by calling [Dataset.pushData](https://crawlee.dev/js/api/core/class/Dataset.md#pushData). *** #### Parameters * ##### data: Dictionary | Dictionary\[] * ##### optionaldatasetIdOrName: string #### Returns Promise\ ### [**](#run)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L948)inheritedrun * ****run**(requests, options): Promise<[FinalStatistics](https://crawlee.dev/js/api/core/interface/FinalStatistics.md)> - Inherited from HttpCrawler.run Runs the crawler. Returns a promise that resolves once all the requests are processed and `autoscaledPool.isFinished` returns `true`. We can use the `requests` parameter to enqueue the initial requests — it is a shortcut for running [`crawler.addRequests()`](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md#addRequests) before [`crawler.run()`](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md#run). *** #### Parameters * ##### optionalrequests: [RequestsLike](https://crawlee.dev/js/api/core.md#RequestsLike) The requests to add. * ##### optionaloptions: [CrawlerRunOptions](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerRunOptions.md) Options for the request queue. #### Returns Promise<[FinalStatistics](https://crawlee.dev/js/api/core/interface/FinalStatistics.md)> ### [**](#setStatusMessage)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L871)inheritedsetStatusMessage * ****setStatusMessage**(message, options): Promise\ - Inherited from HttpCrawler.setStatusMessage This method is periodically called by the crawler, every `statusMessageLoggingInterval` seconds. *** #### Parameters * ##### message: string * ##### options: [SetStatusMessageOptions](https://crawlee.dev/js/api/types/interface/SetStatusMessageOptions.md) = {} #### Returns Promise\ ### [**](#stop)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L1068)inheritedstop * ****stop**(message): void - Inherited from HttpCrawler.stop Gracefully stops the current run of the crawler. All the tasks active at the time of calling this method will be allowed to finish. *** #### Parameters * ##### message: string = 'The crawler has been gracefully stopped.' #### Returns void ### [**](#use)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L470)inheriteduse * ****use**(extension): void - Inherited from HttpCrawler.use **EXPERIMENTAL** Function for attaching CrawlerExtensions such as the Unblockers. *** #### Parameters * ##### extension: CrawlerExtension Crawler extension that overrides the crawler configuration. #### Returns void ### [**](#useState)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L1102)inheriteduseState * ****useState**\(defaultValue): Promise\ - Inherited from HttpCrawler.useState #### Parameters * ##### defaultValue: State = ... #### Returns Promise\ --- # createJSDOMRouter ### Callable * ****createJSDOMRouter**\(routes): [RouterHandler](https://crawlee.dev/js/api/core/interface/RouterHandler.md)\ *** * Creates new [Router](https://crawlee.dev/js/api/core/class/Router.md) instance that works based on request labels. This instance can then serve as a `requestHandler` of your [JSDOMCrawler](https://crawlee.dev/js/api/jsdom-crawler/class/JSDOMCrawler.md). Defaults to the [JSDOMCrawlingContext](https://crawlee.dev/js/api/jsdom-crawler/interface/JSDOMCrawlingContext.md). > Serves as a shortcut for using `Router.create()`. ``` import { JSDOMCrawler, createJSDOMRouter } from 'crawlee'; const router = createJSDOMRouter(); router.addHandler('label-a', async (ctx) => { ctx.log.info('...'); }); router.addDefaultHandler(async (ctx) => { ctx.log.info('...'); }); const crawler = new JSDOMCrawler({ requestHandler: router, }); await crawler.run(); ``` *** #### Parameters * ##### optionalroutes: [RouterRoutes](https://crawlee.dev/js/api/core.md#RouterRoutes)\ #### Returns [RouterHandler](https://crawlee.dev/js/api/core/interface/RouterHandler.md)\ --- # JSDOMCrawlerOptions \ ### Hierarchy * [HttpCrawlerOptions](https://crawlee.dev/js/api/http-crawler/interface/HttpCrawlerOptions.md)<[JSDOMCrawlingContext](https://crawlee.dev/js/api/jsdom-crawler/interface/JSDOMCrawlingContext.md)\> * *JSDOMCrawlerOptions* ## Index[**](#Index) ### Properties * [**additionalHttpErrorStatusCodes](#additionalHttpErrorStatusCodes) * [**additionalMimeTypes](#additionalMimeTypes) * [**autoscaledPoolOptions](#autoscaledPoolOptions) * [**errorHandler](#errorHandler) * [**experiments](#experiments) * [**failedRequestHandler](#failedRequestHandler) * [**forceResponseEncoding](#forceResponseEncoding) * [**handlePageFunction](#handlePageFunction) * [**hideInternalConsole](#hideInternalConsole) * [**httpClient](#httpClient) * [**ignoreHttpErrorStatusCodes](#ignoreHttpErrorStatusCodes) * [**ignoreSslErrors](#ignoreSslErrors) * [**keepAlive](#keepAlive) * [**maxConcurrency](#maxConcurrency) * [**maxCrawlDepth](#maxCrawlDepth) * [**maxRequestRetries](#maxRequestRetries) * [**maxRequestsPerCrawl](#maxRequestsPerCrawl) * [**maxRequestsPerMinute](#maxRequestsPerMinute) * [**maxSessionRotations](#maxSessionRotations) * [**minConcurrency](#minConcurrency) * [**navigationTimeoutSecs](#navigationTimeoutSecs) * [**onSkippedRequest](#onSkippedRequest) * [**persistCookiesPerSession](#persistCookiesPerSession) * [**postNavigationHooks](#postNavigationHooks) * [**preNavigationHooks](#preNavigationHooks) * [**proxyConfiguration](#proxyConfiguration) * [**requestHandler](#requestHandler) * [**requestHandlerTimeoutSecs](#requestHandlerTimeoutSecs) * [**requestList](#requestList) * [**requestManager](#requestManager) * [**requestQueue](#requestQueue) * [**respectRobotsTxtFile](#respectRobotsTxtFile) * [**retryOnBlocked](#retryOnBlocked) * [**runScripts](#runScripts) * [**sameDomainDelaySecs](#sameDomainDelaySecs) * [**sessionPoolOptions](#sessionPoolOptions) * [**statisticsOptions](#statisticsOptions) * [**statusMessageCallback](#statusMessageCallback) * [**statusMessageLoggingInterval](#statusMessageLoggingInterval) * [**suggestResponseEncoding](#suggestResponseEncoding) * [**useSessionPool](#useSessionPool) ## Properties[**](#Properties) ### [**](#additionalHttpErrorStatusCodes)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L186)optionalinheritedadditionalHttpErrorStatusCodes **additionalHttpErrorStatusCodes? : number\[] Inherited from HttpCrawlerOptions.additionalHttpErrorStatusCodes An array of additional HTTP response [Status Codes](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status) to be treated as errors. By default, status codes >= 500 trigger errors. ### [**](#additionalMimeTypes)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L142)optionalinheritedadditionalMimeTypes **additionalMimeTypes? : string\[] Inherited from HttpCrawlerOptions.additionalMimeTypes An array of [MIME types](https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/MIME_types/Complete_list_of_MIME_types) you want the crawler to load and process. By default, only `text/html` and `application/xhtml+xml` MIME types are supported. ### [**](#autoscaledPoolOptions)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L294)optionalinheritedautoscaledPoolOptions **autoscaledPoolOptions? : [AutoscaledPoolOptions](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md) Inherited from HttpCrawlerOptions.autoscaledPoolOptions Custom options passed to the underlying [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) constructor. > *NOTE:* The [`runTaskFunction`](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md#runTaskFunction) option is provided by the crawler and cannot be overridden. However, we can provide custom implementations of [`isFinishedFunction`](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md#isFinishedFunction) and [`isTaskReadyFunction`](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md#isTaskReadyFunction). ### [**](#errorHandler)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L222)optionalinheritederrorHandler **errorHandler? : [ErrorHandler](https://crawlee.dev/js/api/basic-crawler.md#ErrorHandler)<[JSDOMCrawlingContext](https://crawlee.dev/js/api/jsdom-crawler/interface/JSDOMCrawlingContext.md)\> Inherited from HttpCrawlerOptions.errorHandler User-provided function that allows modifying the request object before it gets retried by the crawler. It's executed before each retry for the requests that failed less than [`maxRequestRetries`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#maxRequestRetries) times. The function receives the [BasicCrawlingContext](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlingContext.md) as the first argument, where the [`request`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlingContext.md#request) corresponds to the request to be retried. Second argument is the `Error` instance that represents the last error thrown during processing of the request. ### [**](#experiments)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L390)optionalinheritedexperiments **experiments? : [CrawlerExperiments](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerExperiments.md) Inherited from HttpCrawlerOptions.experiments Enables experimental features of Crawlee, which can alter the behavior of the crawler. WARNING: these options are not guaranteed to be stable and may change or be removed at any time. ### [**](#failedRequestHandler)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L232)optionalinheritedfailedRequestHandler **failedRequestHandler? : [ErrorHandler](https://crawlee.dev/js/api/basic-crawler.md#ErrorHandler)<[JSDOMCrawlingContext](https://crawlee.dev/js/api/jsdom-crawler/interface/JSDOMCrawlingContext.md)\> Inherited from HttpCrawlerOptions.failedRequestHandler A function to handle requests that failed more than [`maxRequestRetries`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#maxRequestRetries) times. The function receives the [BasicCrawlingContext](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlingContext.md) as the first argument, where the [`request`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlingContext.md#request) corresponds to the failed request. Second argument is the `Error` instance that represents the last error thrown during processing of the request. ### [**](#forceResponseEncoding)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L166)optionalinheritedforceResponseEncoding **forceResponseEncoding? : string Inherited from HttpCrawlerOptions.forceResponseEncoding By default this crawler will extract correct encoding from the HTTP response headers. Use `forceResponseEncoding` to force a certain encoding, disregarding the response headers. To only provide a default for missing encodings, use [HttpCrawlerOptions.suggestResponseEncoding](https://crawlee.dev/js/api/http-crawler/interface/HttpCrawlerOptions.md#suggestResponseEncoding) ``` // Will force windows-1250 encoding even if headers say otherwise forceResponseEncoding: 'windows-1250' ``` ### [**](#handlePageFunction)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L87)optionalinheritedhandlePageFunction **handlePageFunction? : [RequestHandler](https://crawlee.dev/js/api/basic-crawler.md#RequestHandler)<{ request: [LoadedRequest](https://crawlee.dev/js/api/core.md#LoadedRequest)<[Request](https://crawlee.dev/js/api/core/class/Request.md)\> } & Omit<[JSDOMCrawlingContext](https://crawlee.dev/js/api/jsdom-crawler/interface/JSDOMCrawlingContext.md)\, request>> Inherited from HttpCrawlerOptions.handlePageFunction An alias for [HttpCrawlerOptions.requestHandler](https://crawlee.dev/js/api/http-crawler/interface/HttpCrawlerOptions.md#requestHandler) Soon to be removed, use `requestHandler` instead. * **@deprecated** ### [**](#hideInternalConsole)[**](https://github.com/apify/crawlee/blob/master/packages/jsdom-crawler/src/internals/jsdom-crawler.ts#L50)optionalhideInternalConsole **hideInternalConsole? : boolean Suppress the logs from JSDOM internal console. ### [**](#httpClient)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L402)optionalinheritedhttpClient **httpClient? : [BaseHttpClient](https://crawlee.dev/js/api/core/interface/BaseHttpClient.md) Inherited from HttpCrawlerOptions.httpClient HTTP client implementation for the `sendRequest` context helper and for plain HTTP crawling. Defaults to a new instance of [GotScrapingHttpClient](https://crawlee.dev/js/api/core/class/GotScrapingHttpClient.md) ### [**](#ignoreHttpErrorStatusCodes)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L180)optionalinheritedignoreHttpErrorStatusCodes **ignoreHttpErrorStatusCodes? : number\[] Inherited from HttpCrawlerOptions.ignoreHttpErrorStatusCodes An array of HTTP response [Status Codes](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status) to be excluded from error consideration. By default, status codes >= 500 trigger errors. ### [**](#ignoreSslErrors)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L97)optionalinheritedignoreSslErrors **ignoreSslErrors? : boolean Inherited from HttpCrawlerOptions.ignoreSslErrors If set to true, SSL certificate errors will be ignored. ### [**](#keepAlive)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L322)optionalinheritedkeepAlive **keepAlive? : boolean Inherited from HttpCrawlerOptions.keepAlive Allows to keep the crawler alive even if the [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) gets empty. By default, the `crawler.run()` will resolve once the queue is empty. With `keepAlive: true` it will keep running, waiting for more requests to come. Use `crawler.stop()` to exit the crawler gracefully, or `crawler.teardown()` to stop it immediately. ### [**](#maxConcurrency)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L308)optionalinheritedmaxConcurrency **maxConcurrency? : number Inherited from HttpCrawlerOptions.maxConcurrency Sets the maximum concurrency (parallelism) for the crawl. Shortcut for the AutoscaledPool [`maxConcurrency`](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md#maxConcurrency) option. ### [**](#maxCrawlDepth)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L285)optionalinheritedmaxCrawlDepth **maxCrawlDepth? : number Inherited from HttpCrawlerOptions.maxCrawlDepth Maximum depth of the crawl. If not set, the crawl will continue until all requests are processed. Setting this to `0` will only process the initial requests, skipping all links enqueued by `crawlingContext.enqueueLinks` and `crawlingContext.addRequests`. Passing `1` will process the initial requests and all links enqueued by `crawlingContext.enqueueLinks` and `crawlingContext.addRequests` in the handler for initial requests. ### [**](#maxRequestRetries)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L256)optionalinheritedmaxRequestRetries **maxRequestRetries? : number = 3 Inherited from HttpCrawlerOptions.maxRequestRetries Specifies the maximum number of retries allowed for a request if its processing fails. This includes retries due to navigation errors or errors thrown from user-supplied functions (`requestHandler`, `preNavigationHooks`, `postNavigationHooks`). This limit does not apply to retries triggered by session rotation (see [`maxSessionRotations`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#maxSessionRotations)). ### [**](#maxRequestsPerCrawl)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L278)optionalinheritedmaxRequestsPerCrawl **maxRequestsPerCrawl? : number Inherited from HttpCrawlerOptions.maxRequestsPerCrawl Maximum number of pages that the crawler will open. The crawl will stop when this limit is reached. This value should always be set in order to prevent infinite loops in misconfigured crawlers. > *NOTE:* In cases of parallel crawling, the actual number of pages visited might be slightly higher than this value. ### [**](#maxRequestsPerMinute)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L315)optionalinheritedmaxRequestsPerMinute **maxRequestsPerMinute? : number Inherited from HttpCrawlerOptions.maxRequestsPerMinute The maximum number of requests per minute the crawler should run. By default, this is set to `Infinity`, but we can pass any positive, non-zero integer. Shortcut for the AutoscaledPool [`maxTasksPerMinute`](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md#maxTasksPerMinute) option. ### [**](#maxSessionRotations)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L271)optionalinheritedmaxSessionRotations **maxSessionRotations? : number = 10 Inherited from HttpCrawlerOptions.maxSessionRotations Maximum number of session rotations per request. The crawler will automatically rotate the session in case of a proxy error or if it gets blocked by the website. The session rotations are not counted towards the [`maxRequestRetries`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#maxRequestRetries) limit. ### [**](#minConcurrency)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L302)optionalinheritedminConcurrency **minConcurrency? : number Inherited from HttpCrawlerOptions.minConcurrency Sets the minimum concurrency (parallelism) for the crawl. Shortcut for the AutoscaledPool [`minConcurrency`](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md#minConcurrency) option. > *WARNING:* If we set this value too high with respect to the available system memory and CPU, our crawler will run extremely slow or crash. If not sure, it's better to keep the default value and the concurrency will scale up automatically. ### [**](#navigationTimeoutSecs)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L92)optionalinheritednavigationTimeoutSecs **navigationTimeoutSecs? : number Inherited from HttpCrawlerOptions.navigationTimeoutSecs Timeout in which the HTTP request to the resource needs to finish, given in seconds. ### [**](#onSkippedRequest)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L381)optionalinheritedonSkippedRequest **onSkippedRequest? : [SkippedRequestCallback](https://crawlee.dev/js/api/core.md#SkippedRequestCallback) Inherited from HttpCrawlerOptions.onSkippedRequest When a request is skipped for some reason, you can use this callback to act on it. This is currently fired for requests skipped 1. based on robots.txt file, 2. because they don't match enqueueLinks filters, 3. because they are redirected to a URL that doesn't match the enqueueLinks strategy, 4. or because the [`maxRequestsPerCrawl`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#maxRequestsPerCrawl) limit has been reached ### [**](#persistCookiesPerSession)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L174)optionalinheritedpersistCookiesPerSession **persistCookiesPerSession? : boolean Inherited from HttpCrawlerOptions.persistCookiesPerSession Automatically saves cookies to Session. Works only if Session Pool is used. It parses cookie from response "set-cookie" header saves or updates cookies for session and once the session is used for next request. It passes the "Cookie" header to the request with the session cookies. ### [**](#postNavigationHooks)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L136)optionalinheritedpostNavigationHooks **postNavigationHooks? : InternalHttpHook<[JSDOMCrawlingContext](https://crawlee.dev/js/api/jsdom-crawler/interface/JSDOMCrawlingContext.md)\>\[] Inherited from HttpCrawlerOptions.postNavigationHooks Async functions that are sequentially evaluated after the navigation. Good for checking if the navigation was successful. The function accepts `crawlingContext` as the only parameter. Example: ``` postNavigationHooks: [ async (crawlingContext) => { // ... }, ] ``` ### [**](#preNavigationHooks)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L122)optionalinheritedpreNavigationHooks **preNavigationHooks? : InternalHttpHook<[JSDOMCrawlingContext](https://crawlee.dev/js/api/jsdom-crawler/interface/JSDOMCrawlingContext.md)\>\[] Inherited from HttpCrawlerOptions.preNavigationHooks Async functions that are sequentially evaluated before the navigation. Good for setting additional cookies or browser properties before navigation. The function accepts two parameters, `crawlingContext` and `gotOptions`, which are passed to the `requestAsBrowser()` function the crawler calls to navigate. Example: ``` preNavigationHooks: [ async (crawlingContext, gotOptions) => { // ... }, ] ``` Modyfing `pageOptions` is supported only in Playwright incognito. See [PrePageCreateHook](https://crawlee.dev/js/api/browser-pool.md#PrePageCreateHook) ### [**](#proxyConfiguration)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L104)optionalinheritedproxyConfiguration **proxyConfiguration? : [ProxyConfiguration](https://crawlee.dev/js/api/core/class/ProxyConfiguration.md) Inherited from HttpCrawlerOptions.proxyConfiguration If set, this crawler will be configured for all connections to use [Apify Proxy](https://console.apify.com/proxy) or your own Proxy URLs provided and rotated according to the configuration. For more information, see the [documentation](https://docs.apify.com/proxy). ### [**](#requestHandler)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L151)optionalinheritedrequestHandler **requestHandler? : [RequestHandler](https://crawlee.dev/js/api/basic-crawler.md#RequestHandler)<{ request: [LoadedRequest](https://crawlee.dev/js/api/core.md#LoadedRequest)<[Request](https://crawlee.dev/js/api/core/class/Request.md)\> } & Omit<[JSDOMCrawlingContext](https://crawlee.dev/js/api/jsdom-crawler/interface/JSDOMCrawlingContext.md)\, request>> Inherited from HttpCrawlerOptions.requestHandler User-provided function that performs the logic of the crawler. It is called for each URL to crawl. The function receives the [BasicCrawlingContext](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlingContext.md) as an argument, where the [`request`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlingContext.md#request) represents the URL to crawl. The function must return a promise, which is then awaited by the crawler. If the function throws an exception, the crawler will try to re-crawl the request later, up to the [`maxRequestRetries`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#maxRequestRetries) times. If all the retries fail, the crawler calls the function provided to the [`failedRequestHandler`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#failedRequestHandler) parameter. To make this work, we should **always** let our function throw exceptions rather than catch them. The exceptions are logged to the request using the [`Request.pushErrorMessage()`](https://crawlee.dev/js/api/core/class/Request.md#pushErrorMessage) function. ### [**](#requestHandlerTimeoutSecs)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L203)optionalinheritedrequestHandlerTimeoutSecs **requestHandlerTimeoutSecs? : number = 60 Inherited from HttpCrawlerOptions.requestHandlerTimeoutSecs Timeout in which the function passed as [`requestHandler`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#requestHandler) needs to finish, in seconds. ### [**](#requestList)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L181)optionalinheritedrequestList **requestList? : [IRequestList](https://crawlee.dev/js/api/core/interface/IRequestList.md) Inherited from HttpCrawlerOptions.requestList Static list of URLs to be processed. If not provided, the crawler will open the default request queue when the [`crawler.addRequests()`](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md#addRequests) function is called. > Alternatively, `requests` parameter of [`crawler.run()`](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md#run) could be used to enqueue the initial requests - it is a shortcut for running `crawler.addRequests()` before the `crawler.run()`. ### [**](#requestManager)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L197)optionalinheritedrequestManager **requestManager? : [IRequestManager](https://crawlee.dev/js/api/core/interface/IRequestManager.md) Inherited from HttpCrawlerOptions.requestManager Allows explicitly configuring a request manager. Mutually exclusive with the `requestQueue` and `requestList` options. This enables explicitly configuring the crawler to use `RequestManagerTandem`, for instance. If using this, the type of `BasicCrawler.requestQueue` may not be fully compatible with the `RequestProvider` class. ### [**](#requestQueue)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L189)optionalinheritedrequestQueue **requestQueue? : [RequestProvider](https://crawlee.dev/js/api/core/class/RequestProvider.md) Inherited from HttpCrawlerOptions.requestQueue Dynamic queue of URLs to be processed. This is useful for recursive crawling of websites. If not provided, the crawler will open the default request queue when the [`crawler.addRequests()`](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md#addRequests) function is called. > Alternatively, `requests` parameter of [`crawler.run()`](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md#run) could be used to enqueue the initial requests - it is a shortcut for running `crawler.addRequests()` before the `crawler.run()`. ### [**](#respectRobotsTxtFile)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L371)optionalinheritedrespectRobotsTxtFile **respectRobotsTxtFile? : boolean Inherited from HttpCrawlerOptions.respectRobotsTxtFile If set to `true`, the crawler will automatically try to fetch the robots.txt file for each domain, and skip those that are not allowed. This also prevents disallowed URLs to be added via `enqueueLinks`. ### [**](#retryOnBlocked)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L365)optionalinheritedretryOnBlocked **retryOnBlocked? : boolean Inherited from HttpCrawlerOptions.retryOnBlocked If set to `true`, the crawler will automatically try to bypass any detected bot protection. Currently supports: * [**Cloudflare** Bot Management](https://www.cloudflare.com/products/bot-management/) * [**Google Search** Rate Limiting](https://www.google.com/sorry/) ### [**](#runScripts)[**](https://github.com/apify/crawlee/blob/master/packages/jsdom-crawler/src/internals/jsdom-crawler.ts#L46)optionalrunScripts **runScripts? : boolean Download and run scripts. ### [**](#sameDomainDelaySecs)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L262)optionalinheritedsameDomainDelaySecs **sameDomainDelaySecs? : number = 0 Inherited from HttpCrawlerOptions.sameDomainDelaySecs Indicates how much time (in seconds) to wait before crawling another same domain request. ### [**](#sessionPoolOptions)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L333)optionalinheritedsessionPoolOptions **sessionPoolOptions? : [SessionPoolOptions](https://crawlee.dev/js/api/core/interface/SessionPoolOptions.md) Inherited from HttpCrawlerOptions.sessionPoolOptions The configuration options for [SessionPool](https://crawlee.dev/js/api/core/class/SessionPool.md) to use. ### [**](#statisticsOptions)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L396)optionalinheritedstatisticsOptions **statisticsOptions? : [StatisticsOptions](https://crawlee.dev/js/api/core/interface/StatisticsOptions.md) Inherited from HttpCrawlerOptions.statisticsOptions Customize the way statistics collecting works, such as logging interval or whether to output them to the Key-Value store. ### [**](#statusMessageCallback)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L356)optionalinheritedstatusMessageCallback **statusMessageCallback? : [StatusMessageCallback](https://crawlee.dev/js/api/basic-crawler.md#StatusMessageCallback)<[BasicCrawlingContext](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlingContext.md)\, [BasicCrawler](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md)<[BasicCrawlingContext](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlingContext.md)\>> Inherited from HttpCrawlerOptions.statusMessageCallback Allows overriding the default status message. The callback needs to call `crawler.setStatusMessage()` explicitly. The default status message is provided in the parameters. ``` const crawler = new CheerioCrawler({ statusMessageCallback: async (ctx) => { return ctx.crawler.setStatusMessage(`this is status message from ${new Date().toISOString()}`, { level: 'INFO' }); // log level defaults to 'DEBUG' }, statusMessageLoggingInterval: 1, // defaults to 10s async requestHandler({ $, enqueueLinks, request, log }) { // ... }, }); ``` ### [**](#statusMessageLoggingInterval)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L338)optionalinheritedstatusMessageLoggingInterval **statusMessageLoggingInterval? : number Inherited from HttpCrawlerOptions.statusMessageLoggingInterval Defines the length of the interval for calling the `setStatusMessage` in seconds. ### [**](#suggestResponseEncoding)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L155)optionalinheritedsuggestResponseEncoding **suggestResponseEncoding? : string Inherited from HttpCrawlerOptions.suggestResponseEncoding By default this crawler will extract correct encoding from the HTTP response headers. Sadly, there are some websites which use invalid headers. Those are encoded using the UTF-8 encoding. If those sites actually use a different encoding, the response will be corrupted. You can use `suggestResponseEncoding` to fall back to a certain encoding, if you know that your target website uses it. To force a certain encoding, disregarding the response headers, use [HttpCrawlerOptions.forceResponseEncoding](https://crawlee.dev/js/api/http-crawler/interface/HttpCrawlerOptions.md#forceResponseEncoding) ``` // Will fall back to windows-1250 encoding if none found suggestResponseEncoding: 'windows-1250' ``` ### [**](#useSessionPool)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L328)optionalinheriteduseSessionPool **useSessionPool? : boolean Inherited from HttpCrawlerOptions.useSessionPool Basic crawler will initialize the [SessionPool](https://crawlee.dev/js/api/core/class/SessionPool.md) with the corresponding [`sessionPoolOptions`](https://crawlee.dev/js/api/core/interface/SessionPoolOptions.md). The session instance will be than available in the [`requestHandler`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#requestHandler). --- # JSDOMCrawlingContext \ ### Hierarchy * InternalHttpCrawlingContext\ * *JSDOMCrawlingContext* ## Index[**](#Index) ### Properties * [**addRequests](#addRequests) * [**body](#body) * [**contentType](#contentType) * [**crawler](#crawler) * [**document](#document) * [**getKeyValueStore](#getKeyValueStore) * [**id](#id) * [**json](#json) * [**log](#log) * [**proxyInfo](#proxyInfo) * [**request](#request) * [**response](#response) * [**session](#session) * [**useState](#useState) * [**window](#window) ### Methods * [**enqueueLinks](#enqueueLinks) * [**parseWithCheerio](#parseWithCheerio) * [**pushData](#pushData) * [**sendRequest](#sendRequest) * [**waitForSelector](#waitForSelector) ## Properties[**](#Properties) ### [**](#addRequests)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L88)inheritedaddRequests **addRequests: (requestsLike, options) => Promise\ Inherited from InternalHttpCrawlingContext.addRequests Add requests directly to the request queue. *** #### Type declaration * * **(requestsLike, options): Promise\ - #### Parameters * ##### requestsLike: readonly (string | ReadonlyObjectDeep\> & { regex?: RegExp; requestsFromUrl?: string }> | ReadonlyObjectDeep<[Request](https://crawlee.dev/js/api/core/class/Request.md)\>)\[] * ##### optionaloptions: ReadonlyObjectDeep<[RequestQueueOperationOptions](https://crawlee.dev/js/api/core/interface/RequestQueueOperationOptions.md)> Options for the request queue #### Returns Promise\ ### [**](#body)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L213)inheritedbody **body: string | Buffer\ Inherited from InternalHttpCrawlingContext.body The request body of the web page. The type depends on the `Content-Type` header of the web page: * String for `text/html`, `application/xhtml+xml`, `application/xml` MIME content types * Buffer for others MIME content types ### [**](#contentType)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L223)inheritedcontentType **contentType: { encoding: BufferEncoding; type: string } Inherited from InternalHttpCrawlingContext.contentType Parsed `Content-Type header: { type, encoding }`. *** #### Type declaration * ##### encoding: BufferEncoding * ##### type: string ### [**](#crawler)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L113)inheritedcrawler **crawler: [JSDOMCrawler](https://crawlee.dev/js/api/jsdom-crawler/class/JSDOMCrawler.md) Inherited from InternalHttpCrawlingContext.crawler ### [**](#document)[**](https://github.com/apify/crawlee/blob/master/packages/jsdom-crawler/src/internals/jsdom-crawler.ts#L63)document **document: Document ### [**](#getKeyValueStore)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L147)inheritedgetKeyValueStore **getKeyValueStore: (idOrName) => Promise<[KeyValueStore](https://crawlee.dev/js/api/core/class/KeyValueStore.md)> Inherited from InternalHttpCrawlingContext.getKeyValueStore Get a key-value store with given name or id, or the default one for the crawler. *** #### Type declaration * * **(idOrName): Promise<[KeyValueStore](https://crawlee.dev/js/api/core/class/KeyValueStore.md)> - #### Parameters * ##### optionalidOrName: string #### Returns Promise<[KeyValueStore](https://crawlee.dev/js/api/core/class/KeyValueStore.md)> ### [**](#id)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L33)inheritedid **id: string Inherited from InternalHttpCrawlingContext.id ### [**](#json)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L218)inheritedjson **json: JSONData Inherited from InternalHttpCrawlingContext.json The parsed object from JSON string if the response contains the content type application/json. ### [**](#log)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L108)inheritedlog **log: [Log](https://crawlee.dev/js/api/core/class/Log.md) Inherited from InternalHttpCrawlingContext.log A preconfigured logger for the request handler. ### [**](#proxyInfo)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L40)optionalinheritedproxyInfo **proxyInfo? : [ProxyInfo](https://crawlee.dev/js/api/core/interface/ProxyInfo.md) Inherited from InternalHttpCrawlingContext.proxyInfo An object with information about currently used proxy by the crawler and configured by the [ProxyConfiguration](https://crawlee.dev/js/api/core/class/ProxyConfiguration.md) class. ### [**](#request)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L45)inheritedrequest **request: [Request](https://crawlee.dev/js/api/core/class/Request.md)\ Inherited from InternalHttpCrawlingContext.request The original [Request](https://crawlee.dev/js/api/core/class/Request.md) object. ### [**](#response)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L224)inheritedresponse **response: PlainResponse Inherited from InternalHttpCrawlingContext.response ### [**](#session)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L34)optionalinheritedsession **session? : [Session](https://crawlee.dev/js/api/core/class/Session.md) Inherited from InternalHttpCrawlingContext.session ### [**](#useState)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L96)inheriteduseState **useState: \(defaultValue) => Promise\ Inherited from InternalHttpCrawlingContext.useState Returns the state - a piece of mutable persistent data shared across all the request handler runs. *** #### Type declaration * * **\(defaultValue): Promise\ - #### Parameters * ##### optionaldefaultValue: State #### Returns Promise\ ### [**](#window)[**](https://github.com/apify/crawlee/blob/master/packages/jsdom-crawler/src/internals/jsdom-crawler.ts#L62)window **window: DOMWindow ## Methods[**](#Methods) ### [**](#enqueueLinks)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L140)inheritedenqueueLinks * ****enqueueLinks**(options): Promise<[BatchAddRequestsResult](https://crawlee.dev/js/api/types/interface/BatchAddRequestsResult.md)> - Inherited from InternalHttpCrawlingContext.enqueueLinks This function automatically finds and enqueues links from the current page, adding them to the [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) currently used by the crawler. Optionally, the function allows you to filter the target links' URLs using an array of globs or regular expressions and override settings of the enqueued [Request](https://crawlee.dev/js/api/core/class/Request.md) objects. Check out the [Crawl a website with relative links](https://crawlee.dev/js/docs/examples/crawl-relative-links.md) example for more details regarding its usage. **Example usage** ``` async requestHandler({ enqueueLinks }) { await enqueueLinks({ globs: [ 'https://www.example.com/handbags/*', ], }); }, ``` *** #### Parameters * ##### optionaloptions: ReadonlyObjectDeep\> & Pick<[EnqueueLinksOptions](https://crawlee.dev/js/api/core/interface/EnqueueLinksOptions.md), requestQueue> All `enqueueLinks()` parameters are passed via an options object. #### Returns Promise<[BatchAddRequestsResult](https://crawlee.dev/js/api/types/interface/BatchAddRequestsResult.md)> Promise that resolves to [BatchAddRequestsResult](https://crawlee.dev/js/api/types/interface/BatchAddRequestsResult.md) object. ### [**](#parseWithCheerio)[**](https://github.com/apify/crawlee/blob/master/packages/jsdom-crawler/src/internals/jsdom-crawler.ts#L92)parseWithCheerio * ****parseWithCheerio**(selector, timeoutMs): Promise\ - Overrides InternalHttpCrawlingContext.parseWithCheerio Returns Cheerio handle, allowing to work with the data same way as with [CheerioCrawler](https://crawlee.dev/js/api/cheerio-crawler/class/CheerioCrawler.md). When provided with the `selector` argument, it will first look for the selector with a 5s timeout. **Example usage:** ``` async requestHandler({ parseWithCheerio }) { const $ = await parseWithCheerio(); const title = $('title').text(); }); ``` *** #### Parameters * ##### optionalselector: string * ##### optionaltimeoutMs: number #### Returns Promise\ ### [**](#pushData)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L54)inheritedpushData * ****pushData**(data, datasetIdOrName): Promise\ - Inherited from InternalHttpCrawlingContext.pushData This function allows you to push data to a [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md) specified by name, or the one currently used by the crawler. Shortcut for `crawler.pushData()`. *** #### Parameters * ##### optionaldata: ReadonlyDeep\ Data to be pushed to the default dataset. * ##### optionaldatasetIdOrName: string #### Returns Promise\ ### [**](#sendRequest)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L166)inheritedsendRequest * ****sendRequest**\(overrideOptions): Promise\> - Inherited from InternalHttpCrawlingContext.sendRequest Fires HTTP request via [`got-scraping`](https://crawlee.dev/js/docs/guides/got-scraping.md), allowing to override the request options on the fly. This is handy when you work with a browser crawler but want to execute some requests outside it (e.g. API requests). Check the [Skipping navigations for certain requests](https://crawlee.dev/js/docs/examples/skip-navigation.md) example for more detailed explanation of how to do that. ``` async requestHandler({ sendRequest }) { const { body } = await sendRequest({ // override headers only headers: { ... }, }); }, ``` *** #### Parameters * ##### optionaloverrideOptions: Partial\ #### Returns Promise\> ### [**](#waitForSelector)[**](https://github.com/apify/crawlee/blob/master/packages/jsdom-crawler/src/internals/jsdom-crawler.ts#L78)waitForSelector * ****waitForSelector**(selector, timeoutMs): Promise\ - Overrides InternalHttpCrawlingContext.waitForSelector Wait for an element matching the selector to appear. Timeout defaults to 5s. **Example usage:** ``` async requestHandler({ waitForSelector, parseWithCheerio }) { await waitForSelector('article h1'); const $ = await parseWithCheerio(); const title = $('title').text(); }); ``` *** #### Parameters * ##### selector: string * ##### optionaltimeoutMs: number #### Returns Promise\ --- # @crawlee/linkedom Provides a framework for the parallel crawling of web pages using plain HTTP requests and [linkedom](https://www.npmjs.com/package/linkedom) DOM implementation. The URLs to crawl are fed either from a static list of URLs or from a dynamic queue of URLs enabling recursive crawling of websites. Since `LinkeDOMCrawler` uses raw HTTP requests to download web pages, it is very fast and efficient on data bandwidth. However, if the target website requires JavaScript to display the content, you might need to use [PuppeteerCrawler](https://crawlee.dev/js/api/puppeteer-crawler/class/PuppeteerCrawler.md) or [PlaywrightCrawler](https://crawlee.dev/js/api/playwright-crawler/class/PlaywrightCrawler.md) instead, because it loads the pages using full-featured headless Chrome browser. `LinkeDOMCrawler` downloads each URL using a plain HTTP request, parses the HTML content using [LinkeDOM](https://www.npmjs.com/package/linkedom) and then invokes the user-provided [LinkeDOMCrawlerOptions.requestHandler](https://crawlee.dev/js/api/linkedom-crawler/interface/LinkeDOMCrawlerOptions.md#requestHandler) to extract page data using the `window` object. The source URLs are represented using [Request](https://crawlee.dev/js/api/core/class/Request.md) objects that are fed from [RequestList](https://crawlee.dev/js/api/core/class/RequestList.md) or [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) instances provided by the [LinkeDOMCrawlerOptions.requestList](https://crawlee.dev/js/api/linkedom-crawler/interface/LinkeDOMCrawlerOptions.md#requestList) or [LinkeDOMCrawlerOptions.requestQueue](https://crawlee.dev/js/api/linkedom-crawler/interface/LinkeDOMCrawlerOptions.md#requestQueue) constructor options, respectively. If both [LinkeDOMCrawlerOptions.requestList](https://crawlee.dev/js/api/linkedom-crawler/interface/LinkeDOMCrawlerOptions.md#requestList) and [LinkeDOMCrawlerOptions.requestQueue](https://crawlee.dev/js/api/linkedom-crawler/interface/LinkeDOMCrawlerOptions.md#requestQueue) are used, the instance first processes URLs from the [RequestList](https://crawlee.dev/js/api/core/class/RequestList.md) and automatically enqueues all of them to [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) before it starts their processing. This ensures that a single URL is not crawled multiple times. The crawler finishes when there are no more [Request](https://crawlee.dev/js/api/core/class/Request.md) objects to crawl. We can use the `preNavigationHooks` to adjust `gotOptions`: ``` preNavigationHooks: [ (crawlingContext, gotOptions) => { // ... }, ] ``` By default, `LinkeDOMCrawler` only processes web pages with the `text/html` and `application/xhtml+xml` MIME content types (as reported by the `Content-Type` HTTP header), and skips pages with other content types. If you want the crawler to process other content types, use the [LinkeDOMCrawlerOptions.additionalMimeTypes](https://crawlee.dev/js/api/linkedom-crawler/interface/LinkeDOMCrawlerOptions.md#additionalMimeTypes) constructor option. Beware that the parsing behavior differs for HTML, XML, JSON and other types of content. For more details, see [LinkeDOMCrawlerOptions.requestHandler](https://crawlee.dev/js/api/linkedom-crawler/interface/LinkeDOMCrawlerOptions.md#requestHandler). New requests are only dispatched when there is enough free CPU and memory available, using the functionality provided by the [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) class. All [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) configuration options can be passed to the `autoscaledPoolOptions` parameter of the `LinkeDOMCrawler` constructor. For user convenience, the `minConcurrency` and `maxConcurrency` [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) options are available directly in the `LinkeDOMCrawler` constructor. ## Example usage[​](#example-usage "Direct link to Example usage") ``` const crawler = new LinkeDOMCrawler({ async requestHandler({ request, window }) { await Dataset.pushData({ url: request.url, title: window.document.title, }); }, }); await crawler.run([ 'http://crawlee.dev', ]); ``` ## Index[**](#Index) ### Crawlers * [**LinkeDOMCrawler](https://crawlee.dev/js/api/linkedom-crawler/class/LinkeDOMCrawler.md) ### Other * [**AddRequestsBatchedOptions](https://crawlee.dev/js/api/linkedom-crawler.md#AddRequestsBatchedOptions) * [**AddRequestsBatchedResult](https://crawlee.dev/js/api/linkedom-crawler.md#AddRequestsBatchedResult) * [**AutoscaledPool](https://crawlee.dev/js/api/linkedom-crawler.md#AutoscaledPool) * [**AutoscaledPoolOptions](https://crawlee.dev/js/api/linkedom-crawler.md#AutoscaledPoolOptions) * [**BaseHttpClient](https://crawlee.dev/js/api/linkedom-crawler.md#BaseHttpClient) * [**BaseHttpResponseData](https://crawlee.dev/js/api/linkedom-crawler.md#BaseHttpResponseData) * [**BASIC\_CRAWLER\_TIMEOUT\_BUFFER\_SECS](https://crawlee.dev/js/api/linkedom-crawler.md#BASIC_CRAWLER_TIMEOUT_BUFFER_SECS) * [**BasicCrawler](https://crawlee.dev/js/api/linkedom-crawler.md#BasicCrawler) * [**BasicCrawlerOptions](https://crawlee.dev/js/api/linkedom-crawler.md#BasicCrawlerOptions) * [**BasicCrawlingContext](https://crawlee.dev/js/api/linkedom-crawler.md#BasicCrawlingContext) * [**BLOCKED\_STATUS\_CODES](https://crawlee.dev/js/api/linkedom-crawler.md#BLOCKED_STATUS_CODES) * [**ByteCounterStream](https://crawlee.dev/js/api/linkedom-crawler.md#ByteCounterStream) * [**checkStorageAccess](https://crawlee.dev/js/api/linkedom-crawler.md#checkStorageAccess) * [**ClientInfo](https://crawlee.dev/js/api/linkedom-crawler.md#ClientInfo) * [**Configuration](https://crawlee.dev/js/api/linkedom-crawler.md#Configuration) * [**ConfigurationOptions](https://crawlee.dev/js/api/linkedom-crawler.md#ConfigurationOptions) * [**Cookie](https://crawlee.dev/js/api/linkedom-crawler.md#Cookie) * [**CrawlerAddRequestsOptions](https://crawlee.dev/js/api/linkedom-crawler.md#CrawlerAddRequestsOptions) * [**CrawlerAddRequestsResult](https://crawlee.dev/js/api/linkedom-crawler.md#CrawlerAddRequestsResult) * [**CrawlerExperiments](https://crawlee.dev/js/api/linkedom-crawler.md#CrawlerExperiments) * [**CrawlerRunOptions](https://crawlee.dev/js/api/linkedom-crawler.md#CrawlerRunOptions) * [**CrawlingContext](https://crawlee.dev/js/api/linkedom-crawler.md#CrawlingContext) * [**createBasicRouter](https://crawlee.dev/js/api/linkedom-crawler.md#createBasicRouter) * [**CreateContextOptions](https://crawlee.dev/js/api/linkedom-crawler.md#CreateContextOptions) * [**createFileRouter](https://crawlee.dev/js/api/linkedom-crawler.md#createFileRouter) * [**createHttpRouter](https://crawlee.dev/js/api/linkedom-crawler.md#createHttpRouter) * [**CreateSession](https://crawlee.dev/js/api/linkedom-crawler.md#CreateSession) * [**CriticalError](https://crawlee.dev/js/api/linkedom-crawler.md#CriticalError) * [**Dataset](https://crawlee.dev/js/api/linkedom-crawler.md#Dataset) * [**DatasetConsumer](https://crawlee.dev/js/api/linkedom-crawler.md#DatasetConsumer) * [**DatasetContent](https://crawlee.dev/js/api/linkedom-crawler.md#DatasetContent) * [**DatasetDataOptions](https://crawlee.dev/js/api/linkedom-crawler.md#DatasetDataOptions) * [**DatasetExportOptions](https://crawlee.dev/js/api/linkedom-crawler.md#DatasetExportOptions) * [**DatasetExportToOptions](https://crawlee.dev/js/api/linkedom-crawler.md#DatasetExportToOptions) * [**DatasetIteratorOptions](https://crawlee.dev/js/api/linkedom-crawler.md#DatasetIteratorOptions) * [**DatasetMapper](https://crawlee.dev/js/api/linkedom-crawler.md#DatasetMapper) * [**DatasetOptions](https://crawlee.dev/js/api/linkedom-crawler.md#DatasetOptions) * [**DatasetReducer](https://crawlee.dev/js/api/linkedom-crawler.md#DatasetReducer) * [**enqueueLinks](https://crawlee.dev/js/api/linkedom-crawler.md#enqueueLinks) * [**EnqueueLinksOptions](https://crawlee.dev/js/api/linkedom-crawler.md#EnqueueLinksOptions) * [**EnqueueStrategy](https://crawlee.dev/js/api/linkedom-crawler.md#EnqueueStrategy) * [**ErrnoException](https://crawlee.dev/js/api/linkedom-crawler.md#ErrnoException) * [**ErrorHandler](https://crawlee.dev/js/api/linkedom-crawler.md#ErrorHandler) * [**ErrorSnapshotter](https://crawlee.dev/js/api/linkedom-crawler.md#ErrorSnapshotter) * [**ErrorTracker](https://crawlee.dev/js/api/linkedom-crawler.md#ErrorTracker) * [**ErrorTrackerOptions](https://crawlee.dev/js/api/linkedom-crawler.md#ErrorTrackerOptions) * [**EventManager](https://crawlee.dev/js/api/linkedom-crawler.md#EventManager) * [**EventType](https://crawlee.dev/js/api/linkedom-crawler.md#EventType) * [**EventTypeName](https://crawlee.dev/js/api/linkedom-crawler.md#EventTypeName) * [**FileDownload](https://crawlee.dev/js/api/linkedom-crawler.md#FileDownload) * [**FileDownloadCrawlingContext](https://crawlee.dev/js/api/linkedom-crawler.md#FileDownloadCrawlingContext) * [**FileDownloadErrorHandler](https://crawlee.dev/js/api/linkedom-crawler.md#FileDownloadErrorHandler) * [**FileDownloadHook](https://crawlee.dev/js/api/linkedom-crawler.md#FileDownloadHook) * [**FileDownloadOptions](https://crawlee.dev/js/api/linkedom-crawler.md#FileDownloadOptions) * [**FileDownloadRequestHandler](https://crawlee.dev/js/api/linkedom-crawler.md#FileDownloadRequestHandler) * [**filterRequestsByPatterns](https://crawlee.dev/js/api/linkedom-crawler.md#filterRequestsByPatterns) * [**FinalStatistics](https://crawlee.dev/js/api/linkedom-crawler.md#FinalStatistics) * [**GetUserDataFromRequest](https://crawlee.dev/js/api/linkedom-crawler.md#GetUserDataFromRequest) * [**GlobInput](https://crawlee.dev/js/api/linkedom-crawler.md#GlobInput) * [**GlobObject](https://crawlee.dev/js/api/linkedom-crawler.md#GlobObject) * [**GotScrapingHttpClient](https://crawlee.dev/js/api/linkedom-crawler.md#GotScrapingHttpClient) * [**HttpCrawler](https://crawlee.dev/js/api/linkedom-crawler.md#HttpCrawler) * [**HttpCrawlerOptions](https://crawlee.dev/js/api/linkedom-crawler.md#HttpCrawlerOptions) * [**HttpCrawlingContext](https://crawlee.dev/js/api/linkedom-crawler.md#HttpCrawlingContext) * [**HttpErrorHandler](https://crawlee.dev/js/api/linkedom-crawler.md#HttpErrorHandler) * [**HttpHook](https://crawlee.dev/js/api/linkedom-crawler.md#HttpHook) * [**HttpRequest](https://crawlee.dev/js/api/linkedom-crawler.md#HttpRequest) * [**HttpRequestHandler](https://crawlee.dev/js/api/linkedom-crawler.md#HttpRequestHandler) * [**HttpRequestOptions](https://crawlee.dev/js/api/linkedom-crawler.md#HttpRequestOptions) * [**HttpResponse](https://crawlee.dev/js/api/linkedom-crawler.md#HttpResponse) * [**IRequestList](https://crawlee.dev/js/api/linkedom-crawler.md#IRequestList) * [**IRequestManager](https://crawlee.dev/js/api/linkedom-crawler.md#IRequestManager) * [**IStorage](https://crawlee.dev/js/api/linkedom-crawler.md#IStorage) * [**KeyConsumer](https://crawlee.dev/js/api/linkedom-crawler.md#KeyConsumer) * [**KeyValueStore](https://crawlee.dev/js/api/linkedom-crawler.md#KeyValueStore) * [**KeyValueStoreIteratorOptions](https://crawlee.dev/js/api/linkedom-crawler.md#KeyValueStoreIteratorOptions) * [**KeyValueStoreOptions](https://crawlee.dev/js/api/linkedom-crawler.md#KeyValueStoreOptions) * [**LoadedRequest](https://crawlee.dev/js/api/linkedom-crawler.md#LoadedRequest) * [**LocalEventManager](https://crawlee.dev/js/api/linkedom-crawler.md#LocalEventManager) * [**log](https://crawlee.dev/js/api/linkedom-crawler.md#log) * [**Log](https://crawlee.dev/js/api/linkedom-crawler.md#Log) * [**Logger](https://crawlee.dev/js/api/linkedom-crawler.md#Logger) * [**LoggerJson](https://crawlee.dev/js/api/linkedom-crawler.md#LoggerJson) * [**LoggerOptions](https://crawlee.dev/js/api/linkedom-crawler.md#LoggerOptions) * [**LoggerText](https://crawlee.dev/js/api/linkedom-crawler.md#LoggerText) * [**LogLevel](https://crawlee.dev/js/api/linkedom-crawler.md#LogLevel) * [**MAX\_POOL\_SIZE](https://crawlee.dev/js/api/linkedom-crawler.md#MAX_POOL_SIZE) * [**MinimumSpeedStream](https://crawlee.dev/js/api/linkedom-crawler.md#MinimumSpeedStream) * [**NonRetryableError](https://crawlee.dev/js/api/linkedom-crawler.md#NonRetryableError) * [**PERSIST\_STATE\_KEY](https://crawlee.dev/js/api/linkedom-crawler.md#PERSIST_STATE_KEY) * [**PersistenceOptions](https://crawlee.dev/js/api/linkedom-crawler.md#PersistenceOptions) * [**processHttpRequestOptions](https://crawlee.dev/js/api/linkedom-crawler.md#processHttpRequestOptions) * [**ProxyConfiguration](https://crawlee.dev/js/api/linkedom-crawler.md#ProxyConfiguration) * [**ProxyConfigurationFunction](https://crawlee.dev/js/api/linkedom-crawler.md#ProxyConfigurationFunction) * [**ProxyConfigurationOptions](https://crawlee.dev/js/api/linkedom-crawler.md#ProxyConfigurationOptions) * [**ProxyInfo](https://crawlee.dev/js/api/linkedom-crawler.md#ProxyInfo) * [**PseudoUrl](https://crawlee.dev/js/api/linkedom-crawler.md#PseudoUrl) * [**PseudoUrlInput](https://crawlee.dev/js/api/linkedom-crawler.md#PseudoUrlInput) * [**PseudoUrlObject](https://crawlee.dev/js/api/linkedom-crawler.md#PseudoUrlObject) * [**purgeDefaultStorages](https://crawlee.dev/js/api/linkedom-crawler.md#purgeDefaultStorages) * [**PushErrorMessageOptions](https://crawlee.dev/js/api/linkedom-crawler.md#PushErrorMessageOptions) * [**QueueOperationInfo](https://crawlee.dev/js/api/linkedom-crawler.md#QueueOperationInfo) * [**RecordOptions](https://crawlee.dev/js/api/linkedom-crawler.md#RecordOptions) * [**RecoverableState](https://crawlee.dev/js/api/linkedom-crawler.md#RecoverableState) * [**RecoverableStateOptions](https://crawlee.dev/js/api/linkedom-crawler.md#RecoverableStateOptions) * [**RecoverableStatePersistenceOptions](https://crawlee.dev/js/api/linkedom-crawler.md#RecoverableStatePersistenceOptions) * [**RedirectHandler](https://crawlee.dev/js/api/linkedom-crawler.md#RedirectHandler) * [**RegExpInput](https://crawlee.dev/js/api/linkedom-crawler.md#RegExpInput) * [**RegExpObject](https://crawlee.dev/js/api/linkedom-crawler.md#RegExpObject) * [**Request](https://crawlee.dev/js/api/linkedom-crawler.md#Request) * [**RequestHandler](https://crawlee.dev/js/api/linkedom-crawler.md#RequestHandler) * [**RequestHandlerResult](https://crawlee.dev/js/api/linkedom-crawler.md#RequestHandlerResult) * [**RequestList](https://crawlee.dev/js/api/linkedom-crawler.md#RequestList) * [**RequestListOptions](https://crawlee.dev/js/api/linkedom-crawler.md#RequestListOptions) * [**RequestListSourcesFunction](https://crawlee.dev/js/api/linkedom-crawler.md#RequestListSourcesFunction) * [**RequestListState](https://crawlee.dev/js/api/linkedom-crawler.md#RequestListState) * [**RequestManagerTandem](https://crawlee.dev/js/api/linkedom-crawler.md#RequestManagerTandem) * [**RequestOptions](https://crawlee.dev/js/api/linkedom-crawler.md#RequestOptions) * [**RequestProvider](https://crawlee.dev/js/api/linkedom-crawler.md#RequestProvider) * [**RequestProviderOptions](https://crawlee.dev/js/api/linkedom-crawler.md#RequestProviderOptions) * [**RequestQueue](https://crawlee.dev/js/api/linkedom-crawler.md#RequestQueue) * [**RequestQueueOperationOptions](https://crawlee.dev/js/api/linkedom-crawler.md#RequestQueueOperationOptions) * [**RequestQueueOptions](https://crawlee.dev/js/api/linkedom-crawler.md#RequestQueueOptions) * [**RequestQueueV1](https://crawlee.dev/js/api/linkedom-crawler.md#RequestQueueV1) * [**RequestQueueV2](https://crawlee.dev/js/api/linkedom-crawler.md#RequestQueueV2) * [**RequestsLike](https://crawlee.dev/js/api/linkedom-crawler.md#RequestsLike) * [**RequestState](https://crawlee.dev/js/api/linkedom-crawler.md#RequestState) * [**RequestTransform](https://crawlee.dev/js/api/linkedom-crawler.md#RequestTransform) * [**ResponseLike](https://crawlee.dev/js/api/linkedom-crawler.md#ResponseLike) * [**ResponseTypes](https://crawlee.dev/js/api/linkedom-crawler.md#ResponseTypes) * [**RestrictedCrawlingContext](https://crawlee.dev/js/api/linkedom-crawler.md#RestrictedCrawlingContext) * [**RetryRequestError](https://crawlee.dev/js/api/linkedom-crawler.md#RetryRequestError) * [**Router](https://crawlee.dev/js/api/linkedom-crawler.md#Router) * [**RouterHandler](https://crawlee.dev/js/api/linkedom-crawler.md#RouterHandler) * [**RouterRoutes](https://crawlee.dev/js/api/linkedom-crawler.md#RouterRoutes) * [**Session](https://crawlee.dev/js/api/linkedom-crawler.md#Session) * [**SessionError](https://crawlee.dev/js/api/linkedom-crawler.md#SessionError) * [**SessionOptions](https://crawlee.dev/js/api/linkedom-crawler.md#SessionOptions) * [**SessionPool](https://crawlee.dev/js/api/linkedom-crawler.md#SessionPool) * [**SessionPoolOptions](https://crawlee.dev/js/api/linkedom-crawler.md#SessionPoolOptions) * [**SessionState](https://crawlee.dev/js/api/linkedom-crawler.md#SessionState) * [**SitemapRequestList](https://crawlee.dev/js/api/linkedom-crawler.md#SitemapRequestList) * [**SitemapRequestListOptions](https://crawlee.dev/js/api/linkedom-crawler.md#SitemapRequestListOptions) * [**SkippedRequestCallback](https://crawlee.dev/js/api/linkedom-crawler.md#SkippedRequestCallback) * [**SkippedRequestReason](https://crawlee.dev/js/api/linkedom-crawler.md#SkippedRequestReason) * [**SnapshotResult](https://crawlee.dev/js/api/linkedom-crawler.md#SnapshotResult) * [**Snapshotter](https://crawlee.dev/js/api/linkedom-crawler.md#Snapshotter) * [**SnapshotterOptions](https://crawlee.dev/js/api/linkedom-crawler.md#SnapshotterOptions) * [**Source](https://crawlee.dev/js/api/linkedom-crawler.md#Source) * [**StatisticPersistedState](https://crawlee.dev/js/api/linkedom-crawler.md#StatisticPersistedState) * [**Statistics](https://crawlee.dev/js/api/linkedom-crawler.md#Statistics) * [**StatisticsOptions](https://crawlee.dev/js/api/linkedom-crawler.md#StatisticsOptions) * [**StatisticState](https://crawlee.dev/js/api/linkedom-crawler.md#StatisticState) * [**StatusMessageCallback](https://crawlee.dev/js/api/linkedom-crawler.md#StatusMessageCallback) * [**StatusMessageCallbackParams](https://crawlee.dev/js/api/linkedom-crawler.md#StatusMessageCallbackParams) * [**StorageClient](https://crawlee.dev/js/api/linkedom-crawler.md#StorageClient) * [**StorageManagerOptions](https://crawlee.dev/js/api/linkedom-crawler.md#StorageManagerOptions) * [**StreamHandlerContext](https://crawlee.dev/js/api/linkedom-crawler.md#StreamHandlerContext) * [**StreamingHttpResponse](https://crawlee.dev/js/api/linkedom-crawler.md#StreamingHttpResponse) * [**SystemInfo](https://crawlee.dev/js/api/linkedom-crawler.md#SystemInfo) * [**SystemStatus](https://crawlee.dev/js/api/linkedom-crawler.md#SystemStatus) * [**SystemStatusOptions](https://crawlee.dev/js/api/linkedom-crawler.md#SystemStatusOptions) * [**TieredProxy](https://crawlee.dev/js/api/linkedom-crawler.md#TieredProxy) * [**tryAbsoluteURL](https://crawlee.dev/js/api/linkedom-crawler.md#tryAbsoluteURL) * [**UrlPatternObject](https://crawlee.dev/js/api/linkedom-crawler.md#UrlPatternObject) * [**useState](https://crawlee.dev/js/api/linkedom-crawler.md#useState) * [**UseStateOptions](https://crawlee.dev/js/api/linkedom-crawler.md#UseStateOptions) * [**withCheckedStorageAccess](https://crawlee.dev/js/api/linkedom-crawler.md#withCheckedStorageAccess) * [**LinkeDOMCrawlerEnqueueLinksOptions](https://crawlee.dev/js/api/linkedom-crawler/interface/LinkeDOMCrawlerEnqueueLinksOptions.md) * [**LinkeDOMCrawlerOptions](https://crawlee.dev/js/api/linkedom-crawler/interface/LinkeDOMCrawlerOptions.md) * [**LinkeDOMCrawlingContext](https://crawlee.dev/js/api/linkedom-crawler/interface/LinkeDOMCrawlingContext.md) * [**LinkeDOMErrorHandler](https://crawlee.dev/js/api/linkedom-crawler.md#LinkeDOMErrorHandler) * [**LinkeDOMHook](https://crawlee.dev/js/api/linkedom-crawler.md#LinkeDOMHook) * [**LinkeDOMRequestHandler](https://crawlee.dev/js/api/linkedom-crawler.md#LinkeDOMRequestHandler) * [**createLinkeDOMRouter](https://crawlee.dev/js/api/linkedom-crawler/function/createLinkeDOMRouter.md) ## Other[**](#__CATEGORY__) ### [**](#AddRequestsBatchedOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L965)AddRequestsBatchedOptions Re-exports [AddRequestsBatchedOptions](https://crawlee.dev/js/api/core/interface/AddRequestsBatchedOptions.md) ### [**](#AddRequestsBatchedResult)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L983)AddRequestsBatchedResult Re-exports [AddRequestsBatchedResult](https://crawlee.dev/js/api/core/interface/AddRequestsBatchedResult.md) ### [**](#AutoscaledPool)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/autoscaled_pool.ts#L180)AutoscaledPool Re-exports [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) ### [**](#AutoscaledPoolOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/autoscaled_pool.ts#L16)AutoscaledPoolOptions Re-exports [AutoscaledPoolOptions](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md) ### [**](#BaseHttpClient)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L179)BaseHttpClient Re-exports [BaseHttpClient](https://crawlee.dev/js/api/core/interface/BaseHttpClient.md) ### [**](#BaseHttpResponseData)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L130)BaseHttpResponseData Re-exports [BaseHttpResponseData](https://crawlee.dev/js/api/core/interface/BaseHttpResponseData.md) ### [**](#BASIC_CRAWLER_TIMEOUT_BUFFER_SECS)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/constants.ts#L6)BASIC\_CRAWLER\_TIMEOUT\_BUFFER\_SECS Re-exports [BASIC\_CRAWLER\_TIMEOUT\_BUFFER\_SECS](https://crawlee.dev/js/api/basic-crawler.md#BASIC_CRAWLER_TIMEOUT_BUFFER_SECS) ### [**](#BasicCrawler)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L485)BasicCrawler Re-exports [BasicCrawler](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md) ### [**](#BasicCrawlerOptions)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L133)BasicCrawlerOptions Re-exports [BasicCrawlerOptions](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md) ### [**](#BasicCrawlingContext)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L71)BasicCrawlingContext Re-exports [BasicCrawlingContext](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlingContext.md) ### [**](#BLOCKED_STATUS_CODES)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/consts.ts#L1)BLOCKED\_STATUS\_CODES Re-exports [BLOCKED\_STATUS\_CODES](https://crawlee.dev/js/api/core.md#BLOCKED_STATUS_CODES) ### [**](#ByteCounterStream)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/file-download.ts#L116)ByteCounterStream Re-exports [ByteCounterStream](https://crawlee.dev/js/api/http-crawler/function/ByteCounterStream.md) ### [**](#checkStorageAccess)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/access_checking.ts#L10)checkStorageAccess Re-exports [checkStorageAccess](https://crawlee.dev/js/api/core/function/checkStorageAccess.md) ### [**](#ClientInfo)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/system_status.ts#L79)ClientInfo Re-exports [ClientInfo](https://crawlee.dev/js/api/core/interface/ClientInfo.md) ### [**](#Configuration)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/configuration.ts#L247)Configuration Re-exports [Configuration](https://crawlee.dev/js/api/core/class/Configuration.md) ### [**](#ConfigurationOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/configuration.ts#L16)ConfigurationOptions Re-exports [ConfigurationOptions](https://crawlee.dev/js/api/core/interface/ConfigurationOptions.md) ### [**](#Cookie)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/index.ts#L19)Cookie Re-exports [Cookie](https://crawlee.dev/js/api/core/interface/Cookie.md) ### [**](#CrawlerAddRequestsOptions)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L2035)CrawlerAddRequestsOptions Re-exports [CrawlerAddRequestsOptions](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerAddRequestsOptions.md) ### [**](#CrawlerAddRequestsResult)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L2037)CrawlerAddRequestsResult Re-exports [CrawlerAddRequestsResult](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerAddRequestsResult.md) ### [**](#CrawlerExperiments)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L411)CrawlerExperiments Re-exports [CrawlerExperiments](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerExperiments.md) ### [**](#CrawlerRunOptions)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L2039)CrawlerRunOptions Re-exports [CrawlerRunOptions](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerRunOptions.md) ### [**](#CrawlingContext)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L111)CrawlingContext Re-exports [CrawlingContext](https://crawlee.dev/js/api/core/interface/CrawlingContext.md) ### [**](#createBasicRouter)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L2081)createBasicRouter Re-exports [createBasicRouter](https://crawlee.dev/js/api/basic-crawler/function/createBasicRouter.md) ### [**](#CreateContextOptions)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L2029)CreateContextOptions Re-exports [CreateContextOptions](https://crawlee.dev/js/api/basic-crawler/interface/CreateContextOptions.md) ### [**](#createFileRouter)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/file-download.ts#L304)createFileRouter Re-exports [createFileRouter](https://crawlee.dev/js/api/http-crawler/function/createFileRouter.md) ### [**](#createHttpRouter)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L1068)createHttpRouter Re-exports [createHttpRouter](https://crawlee.dev/js/api/http-crawler/function/createHttpRouter.md) ### [**](#CreateSession)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session_pool.ts#L22)CreateSession Re-exports [CreateSession](https://crawlee.dev/js/api/core/interface/CreateSession.md) ### [**](#CriticalError)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/errors.ts#L10)CriticalError Re-exports [CriticalError](https://crawlee.dev/js/api/core/class/CriticalError.md) ### [**](#Dataset)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L232)Dataset Re-exports [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md) ### [**](#DatasetConsumer)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L703)DatasetConsumer Re-exports [DatasetConsumer](https://crawlee.dev/js/api/core/interface/DatasetConsumer.md) ### [**](#DatasetContent)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L742)DatasetContent Re-exports [DatasetContent](https://crawlee.dev/js/api/core/interface/DatasetContent.md) ### [**](#DatasetDataOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L92)DatasetDataOptions Re-exports [DatasetDataOptions](https://crawlee.dev/js/api/core/interface/DatasetDataOptions.md) ### [**](#DatasetExportOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L144)DatasetExportOptions Re-exports [DatasetExportOptions](https://crawlee.dev/js/api/core/interface/DatasetExportOptions.md) ### [**](#DatasetExportToOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L176)DatasetExportToOptions Re-exports [DatasetExportToOptions](https://crawlee.dev/js/api/core/interface/DatasetExportToOptions.md) ### [**](#DatasetIteratorOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L152)DatasetIteratorOptions Re-exports [DatasetIteratorOptions](https://crawlee.dev/js/api/core/interface/DatasetIteratorOptions.md) ### [**](#DatasetMapper)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L714)DatasetMapper Re-exports [DatasetMapper](https://crawlee.dev/js/api/core/interface/DatasetMapper.md) ### [**](#DatasetOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L735)DatasetOptions Re-exports [DatasetOptions](https://crawlee.dev/js/api/core/interface/DatasetOptions.md) ### [**](#DatasetReducer)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L726)DatasetReducer Re-exports [DatasetReducer](https://crawlee.dev/js/api/core/interface/DatasetReducer.md) ### [**](#enqueueLinks)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/enqueue_links.ts#L274)enqueueLinks Re-exports [enqueueLinks](https://crawlee.dev/js/api/core/function/enqueueLinks.md) ### [**](#EnqueueLinksOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/enqueue_links.ts#L34)EnqueueLinksOptions Re-exports [EnqueueLinksOptions](https://crawlee.dev/js/api/core/interface/EnqueueLinksOptions.md) ### [**](#EnqueueStrategy)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/enqueue_links.ts#L216)EnqueueStrategy Re-exports [EnqueueStrategy](https://crawlee.dev/js/api/core/enum/EnqueueStrategy.md) ### [**](#ErrnoException)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/error_tracker.ts#L9)ErrnoException Re-exports [ErrnoException](https://crawlee.dev/js/api/core/interface/ErrnoException.md) ### [**](#ErrorHandler)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L114)ErrorHandler Re-exports [ErrorHandler](https://crawlee.dev/js/api/basic-crawler.md#ErrorHandler) ### [**](#ErrorSnapshotter)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/error_snapshotter.ts#L42)ErrorSnapshotter Re-exports [ErrorSnapshotter](https://crawlee.dev/js/api/core/class/ErrorSnapshotter.md) ### [**](#ErrorTracker)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/error_tracker.ts#L286)ErrorTracker Re-exports [ErrorTracker](https://crawlee.dev/js/api/core/class/ErrorTracker.md) ### [**](#ErrorTrackerOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/error_tracker.ts#L17)ErrorTrackerOptions Re-exports [ErrorTrackerOptions](https://crawlee.dev/js/api/core/interface/ErrorTrackerOptions.md) ### [**](#EventManager)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/events/event_manager.ts#L24)EventManager Re-exports [EventManager](https://crawlee.dev/js/api/core/class/EventManager.md) ### [**](#EventType)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/events/event_manager.ts#L9)EventType Re-exports [EventType](https://crawlee.dev/js/api/core/enum/EventType.md) ### [**](#EventTypeName)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/events/event_manager.ts#L17)EventTypeName Re-exports [EventTypeName](https://crawlee.dev/js/api/core.md#EventTypeName) ### [**](#FileDownload)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/file-download.ts#L184)FileDownload Re-exports [FileDownload](https://crawlee.dev/js/api/http-crawler/class/FileDownload.md) ### [**](#FileDownloadCrawlingContext)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/file-download.ts#L52)FileDownloadCrawlingContext Re-exports [FileDownloadCrawlingContext](https://crawlee.dev/js/api/http-crawler/interface/FileDownloadCrawlingContext.md) ### [**](#FileDownloadErrorHandler)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/file-download.ts#L20)FileDownloadErrorHandler Re-exports [FileDownloadErrorHandler](https://crawlee.dev/js/api/http-crawler.md#FileDownloadErrorHandler) ### [**](#FileDownloadHook)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/file-download.ts#L47)FileDownloadHook Re-exports [FileDownloadHook](https://crawlee.dev/js/api/http-crawler.md#FileDownloadHook) ### [**](#FileDownloadOptions)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/file-download.ts#L34)FileDownloadOptions Re-exports [FileDownloadOptions](https://crawlee.dev/js/api/http-crawler.md#FileDownloadOptions) ### [**](#FileDownloadRequestHandler)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/file-download.ts#L57)FileDownloadRequestHandler Re-exports [FileDownloadRequestHandler](https://crawlee.dev/js/api/http-crawler.md#FileDownloadRequestHandler) ### [**](#filterRequestsByPatterns)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L217)filterRequestsByPatterns Re-exports [filterRequestsByPatterns](https://crawlee.dev/js/api/core/function/filterRequestsByPatterns.md) ### [**](#FinalStatistics)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/system_status.ts#L85)FinalStatistics Re-exports [FinalStatistics](https://crawlee.dev/js/api/core/interface/FinalStatistics.md) ### [**](#GetUserDataFromRequest)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/router.ts#L15)GetUserDataFromRequest Re-exports [GetUserDataFromRequest](https://crawlee.dev/js/api/core.md#GetUserDataFromRequest) ### [**](#GlobInput)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L41)GlobInput Re-exports [GlobInput](https://crawlee.dev/js/api/core.md#GlobInput) ### [**](#GlobObject)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L36)GlobObject Re-exports [GlobObject](https://crawlee.dev/js/api/core.md#GlobObject) ### [**](#GotScrapingHttpClient)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/got-scraping-http-client.ts#L17)GotScrapingHttpClient Re-exports [GotScrapingHttpClient](https://crawlee.dev/js/api/core/class/GotScrapingHttpClient.md) ### [**](#HttpCrawler)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L330)HttpCrawler Re-exports [HttpCrawler](https://crawlee.dev/js/api/http-crawler/class/HttpCrawler.md) ### [**](#HttpCrawlerOptions)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L80)HttpCrawlerOptions Re-exports [HttpCrawlerOptions](https://crawlee.dev/js/api/http-crawler/interface/HttpCrawlerOptions.md) ### [**](#HttpCrawlingContext)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L255)HttpCrawlingContext Re-exports [HttpCrawlingContext](https://crawlee.dev/js/api/http-crawler/interface/HttpCrawlingContext.md) ### [**](#HttpErrorHandler)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L75)HttpErrorHandler Re-exports [HttpErrorHandler](https://crawlee.dev/js/api/http-crawler.md#HttpErrorHandler) ### [**](#HttpHook)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L194)HttpHook Re-exports [HttpHook](https://crawlee.dev/js/api/http-crawler.md#HttpHook) ### [**](#HttpRequest)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L78)HttpRequest Re-exports [HttpRequest](https://crawlee.dev/js/api/core/interface/HttpRequest.md) ### [**](#HttpRequestHandler)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L258)HttpRequestHandler Re-exports [HttpRequestHandler](https://crawlee.dev/js/api/http-crawler.md#HttpRequestHandler) ### [**](#HttpRequestOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L111)HttpRequestOptions Re-exports [HttpRequestOptions](https://crawlee.dev/js/api/core/interface/HttpRequestOptions.md) ### [**](#HttpResponse)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L152)HttpResponse Re-exports [HttpResponse](https://crawlee.dev/js/api/core/interface/HttpResponse.md) ### [**](#IRequestList)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_list.ts#L26)IRequestList Re-exports [IRequestList](https://crawlee.dev/js/api/core/interface/IRequestList.md) ### [**](#IRequestManager)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L44)IRequestManager Re-exports [IRequestManager](https://crawlee.dev/js/api/core/interface/IRequestManager.md) ### [**](#IStorage)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/storage_manager.ts#L14)IStorage Re-exports [IStorage](https://crawlee.dev/js/api/core/interface/IStorage.md) ### [**](#KeyConsumer)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/key_value_store.ts#L724)KeyConsumer Re-exports [KeyConsumer](https://crawlee.dev/js/api/core/interface/KeyConsumer.md) ### [**](#KeyValueStore)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/key_value_store.ts#L108)KeyValueStore Re-exports [KeyValueStore](https://crawlee.dev/js/api/core/class/KeyValueStore.md) ### [**](#KeyValueStoreIteratorOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/key_value_store.ts#L758)KeyValueStoreIteratorOptions Re-exports [KeyValueStoreIteratorOptions](https://crawlee.dev/js/api/core/interface/KeyValueStoreIteratorOptions.md) ### [**](#KeyValueStoreOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/key_value_store.ts#L734)KeyValueStoreOptions Re-exports [KeyValueStoreOptions](https://crawlee.dev/js/api/core/interface/KeyValueStoreOptions.md) ### [**](#LoadedRequest)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L21)LoadedRequest Re-exports [LoadedRequest](https://crawlee.dev/js/api/core.md#LoadedRequest) ### [**](#LocalEventManager)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/events/local_event_manager.ts#L11)LocalEventManager Re-exports [LocalEventManager](https://crawlee.dev/js/api/core/class/LocalEventManager.md) ### [**](#log)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/log.ts#L3)log Re-exports [log](https://crawlee.dev/js/api/core.md#log) ### [**](#Log)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/log.ts#L3)Log Re-exports [Log](https://crawlee.dev/js/api/core/class/Log.md) ### [**](#Logger)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/log.ts#L3)Logger Re-exports [Logger](https://crawlee.dev/js/api/core/class/Logger.md) ### [**](#LoggerJson)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/log.ts#L3)LoggerJson Re-exports [LoggerJson](https://crawlee.dev/js/api/core/class/LoggerJson.md) ### [**](#LoggerOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/log.ts#L3)LoggerOptions Re-exports [LoggerOptions](https://crawlee.dev/js/api/core/interface/LoggerOptions.md) ### [**](#LoggerText)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/log.ts#L3)LoggerText Re-exports [LoggerText](https://crawlee.dev/js/api/core/class/LoggerText.md) ### [**](#LogLevel)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/log.ts#L3)LogLevel Re-exports [LogLevel](https://crawlee.dev/js/api/core/enum/LogLevel.md) ### [**](#MAX_POOL_SIZE)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/consts.ts#L3)MAX\_POOL\_SIZE Re-exports [MAX\_POOL\_SIZE](https://crawlee.dev/js/api/core.md#MAX_POOL_SIZE) ### [**](#MinimumSpeedStream)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/file-download.ts#L71)MinimumSpeedStream Re-exports [MinimumSpeedStream](https://crawlee.dev/js/api/http-crawler/function/MinimumSpeedStream.md) ### [**](#NonRetryableError)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/errors.ts#L4)NonRetryableError Re-exports [NonRetryableError](https://crawlee.dev/js/api/core/class/NonRetryableError.md) ### [**](#PERSIST_STATE_KEY)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/consts.ts#L2)PERSIST\_STATE\_KEY Re-exports [PERSIST\_STATE\_KEY](https://crawlee.dev/js/api/core.md#PERSIST_STATE_KEY) ### [**](#PersistenceOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L41)PersistenceOptions Re-exports [PersistenceOptions](https://crawlee.dev/js/api/core/interface/PersistenceOptions.md) ### [**](#processHttpRequestOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L196)processHttpRequestOptions Re-exports [processHttpRequestOptions](https://crawlee.dev/js/api/core/function/processHttpRequestOptions.md) ### [**](#ProxyConfiguration)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/proxy_configuration.ts#L203)ProxyConfiguration Re-exports [ProxyConfiguration](https://crawlee.dev/js/api/core/class/ProxyConfiguration.md) ### [**](#ProxyConfigurationFunction)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/proxy_configuration.ts#L9)ProxyConfigurationFunction Re-exports [ProxyConfigurationFunction](https://crawlee.dev/js/api/core/interface/ProxyConfigurationFunction.md) ### [**](#ProxyConfigurationOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/proxy_configuration.ts#L15)ProxyConfigurationOptions Re-exports [ProxyConfigurationOptions](https://crawlee.dev/js/api/core/interface/ProxyConfigurationOptions.md) ### [**](#ProxyInfo)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/proxy_configuration.ts#L80)ProxyInfo Re-exports [ProxyInfo](https://crawlee.dev/js/api/core/interface/ProxyInfo.md) ### [**](#PseudoUrl)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/index.ts#L18)PseudoUrl Re-exports [PseudoUrl](https://crawlee.dev/js/api/core/class/PseudoUrl.md) ### [**](#PseudoUrlInput)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L34)PseudoUrlInput Re-exports [PseudoUrlInput](https://crawlee.dev/js/api/core.md#PseudoUrlInput) ### [**](#PseudoUrlObject)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L29)PseudoUrlObject Re-exports [PseudoUrlObject](https://crawlee.dev/js/api/core.md#PseudoUrlObject) ### [**](#purgeDefaultStorages)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/utils.ts#L33)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/utils.ts#L45)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/utils.ts#L46)purgeDefaultStorages Re-exports [purgeDefaultStorages](https://crawlee.dev/js/api/core/function/purgeDefaultStorages.md) ### [**](#PushErrorMessageOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L559)PushErrorMessageOptions Re-exports [PushErrorMessageOptions](https://crawlee.dev/js/api/core/interface/PushErrorMessageOptions.md) ### [**](#QueueOperationInfo)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/index.ts#L19)QueueOperationInfo Re-exports [QueueOperationInfo](https://crawlee.dev/js/api/core/interface/QueueOperationInfo.md) ### [**](#RecordOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/key_value_store.ts#L741)RecordOptions Re-exports [RecordOptions](https://crawlee.dev/js/api/core/interface/RecordOptions.md) ### [**](#RecoverableState)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/recoverable_state.ts#L75)RecoverableState Re-exports [RecoverableState](https://crawlee.dev/js/api/core/class/RecoverableState.md) ### [**](#RecoverableStateOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/recoverable_state.ts#L33)RecoverableStateOptions Re-exports [RecoverableStateOptions](https://crawlee.dev/js/api/core/interface/RecoverableStateOptions.md) ### [**](#RecoverableStatePersistenceOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/recoverable_state.ts#L6)RecoverableStatePersistenceOptions Re-exports [RecoverableStatePersistenceOptions](https://crawlee.dev/js/api/core/interface/RecoverableStatePersistenceOptions.md) ### [**](#RedirectHandler)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L171)RedirectHandler Re-exports [RedirectHandler](https://crawlee.dev/js/api/core.md#RedirectHandler) ### [**](#RegExpInput)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L48)RegExpInput Re-exports [RegExpInput](https://crawlee.dev/js/api/core.md#RegExpInput) ### [**](#RegExpObject)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L43)RegExpObject Re-exports [RegExpObject](https://crawlee.dev/js/api/core.md#RegExpObject) ### [**](#Request)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L84)Request Re-exports [Request](https://crawlee.dev/js/api/core/class/Request.md) ### [**](#RequestHandler)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L110)RequestHandler Re-exports [RequestHandler](https://crawlee.dev/js/api/basic-crawler.md#RequestHandler) ### [**](#RequestHandlerResult)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L174)RequestHandlerResult Re-exports [RequestHandlerResult](https://crawlee.dev/js/api/core/class/RequestHandlerResult.md) ### [**](#RequestList)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_list.ts#L300)RequestList Re-exports [RequestList](https://crawlee.dev/js/api/core/class/RequestList.md) ### [**](#RequestListOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_list.ts#L91)RequestListOptions Re-exports [RequestListOptions](https://crawlee.dev/js/api/core/interface/RequestListOptions.md) ### [**](#RequestListSourcesFunction)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_list.ts#L1000)RequestListSourcesFunction Re-exports [RequestListSourcesFunction](https://crawlee.dev/js/api/core.md#RequestListSourcesFunction) ### [**](#RequestListState)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_list.ts#L988)RequestListState Re-exports [RequestListState](https://crawlee.dev/js/api/core/interface/RequestListState.md) ### [**](#RequestManagerTandem)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_manager_tandem.ts#L22)RequestManagerTandem Re-exports [RequestManagerTandem](https://crawlee.dev/js/api/core/class/RequestManagerTandem.md) ### [**](#RequestOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L446)RequestOptions Re-exports [RequestOptions](https://crawlee.dev/js/api/core/interface/RequestOptions.md) ### [**](#RequestProvider)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L102)RequestProvider Re-exports [RequestProvider](https://crawlee.dev/js/api/core/class/RequestProvider.md) ### [**](#RequestProviderOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L907)RequestProviderOptions Re-exports [RequestProviderOptions](https://crawlee.dev/js/api/core/interface/RequestProviderOptions.md) ### [**](#RequestQueue)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/index.ts#L7)RequestQueue Re-exports [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) ### [**](#RequestQueueOperationOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L934)RequestQueueOperationOptions Re-exports [RequestQueueOperationOptions](https://crawlee.dev/js/api/core/interface/RequestQueueOperationOptions.md) ### [**](#RequestQueueOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L923)RequestQueueOptions Re-exports [RequestQueueOptions](https://crawlee.dev/js/api/core/interface/RequestQueueOptions.md) ### [**](#RequestQueueV1)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/index.ts#L6)RequestQueueV1 Re-exports [RequestQueueV1](https://crawlee.dev/js/api/core/class/RequestQueueV1.md) ### [**](#RequestQueueV2)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/index.ts#L8)RequestQueueV2 Re-exports [RequestQueueV2](https://crawlee.dev/js/api/core.md#RequestQueueV2) ### [**](#RequestsLike)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L39)RequestsLike Re-exports [RequestsLike](https://crawlee.dev/js/api/core.md#RequestsLike) ### [**](#RequestState)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L42)RequestState Re-exports [RequestState](https://crawlee.dev/js/api/core/enum/RequestState.md) ### [**](#RequestTransform)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L287)RequestTransform Re-exports [RequestTransform](https://crawlee.dev/js/api/core/interface/RequestTransform.md) ### [**](#ResponseLike)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/cookie_utils.ts#L7)ResponseLike Re-exports [ResponseLike](https://crawlee.dev/js/api/core/interface/ResponseLike.md) ### [**](#ResponseTypes)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L39)ResponseTypes Re-exports [ResponseTypes](https://crawlee.dev/js/api/core/interface/ResponseTypes.md) ### [**](#RestrictedCrawlingContext)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L30)RestrictedCrawlingContext Re-exports [RestrictedCrawlingContext](https://crawlee.dev/js/api/core/interface/RestrictedCrawlingContext.md) ### [**](#RetryRequestError)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/errors.ts#L22)RetryRequestError Re-exports [RetryRequestError](https://crawlee.dev/js/api/core/class/RetryRequestError.md) ### [**](#Router)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/router.ts#L86)Router Re-exports [Router](https://crawlee.dev/js/api/core/class/Router.md) ### [**](#RouterHandler)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/router.ts#L10)RouterHandler Re-exports [RouterHandler](https://crawlee.dev/js/api/core/interface/RouterHandler.md) ### [**](#RouterRoutes)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/router.ts#L17)RouterRoutes Re-exports [RouterRoutes](https://crawlee.dev/js/api/core.md#RouterRoutes) ### [**](#Session)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session.ts#L100)Session Re-exports [Session](https://crawlee.dev/js/api/core/class/Session.md) ### [**](#SessionError)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/errors.ts#L33)SessionError Re-exports [SessionError](https://crawlee.dev/js/api/core/class/SessionError.md) ### [**](#SessionOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session.ts#L37)SessionOptions Re-exports [SessionOptions](https://crawlee.dev/js/api/core/interface/SessionOptions.md) ### [**](#SessionPool)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session_pool.ts#L137)SessionPool Re-exports [SessionPool](https://crawlee.dev/js/api/core/class/SessionPool.md) ### [**](#SessionPoolOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session_pool.ts#L30)SessionPoolOptions Re-exports [SessionPoolOptions](https://crawlee.dev/js/api/core/interface/SessionPoolOptions.md) ### [**](#SessionState)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session.ts#L24)SessionState Re-exports [SessionState](https://crawlee.dev/js/api/core/interface/SessionState.md) ### [**](#SitemapRequestList)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/sitemap_request_list.ts#L128)SitemapRequestList Re-exports [SitemapRequestList](https://crawlee.dev/js/api/core/class/SitemapRequestList.md) ### [**](#SitemapRequestListOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/sitemap_request_list.ts#L60)SitemapRequestListOptions Re-exports [SitemapRequestListOptions](https://crawlee.dev/js/api/core/interface/SitemapRequestListOptions.md) ### [**](#SkippedRequestCallback)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L52)SkippedRequestCallback Re-exports [SkippedRequestCallback](https://crawlee.dev/js/api/core.md#SkippedRequestCallback) ### [**](#SkippedRequestReason)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L50)SkippedRequestReason Re-exports [SkippedRequestReason](https://crawlee.dev/js/api/core.md#SkippedRequestReason) ### [**](#SnapshotResult)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/error_snapshotter.ts#L16)SnapshotResult Re-exports [SnapshotResult](https://crawlee.dev/js/api/core/interface/SnapshotResult.md) ### [**](#Snapshotter)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/snapshotter.ts#L118)Snapshotter Re-exports [Snapshotter](https://crawlee.dev/js/api/core/class/Snapshotter.md) ### [**](#SnapshotterOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/snapshotter.ts#L19)SnapshotterOptions Re-exports [SnapshotterOptions](https://crawlee.dev/js/api/core/interface/SnapshotterOptions.md) ### [**](#Source)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L575)Source Re-exports [Source](https://crawlee.dev/js/api/core.md#Source) ### [**](#StatisticPersistedState)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L482)StatisticPersistedState Re-exports [StatisticPersistedState](https://crawlee.dev/js/api/core/interface/StatisticPersistedState.md) ### [**](#Statistics)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L59)Statistics Re-exports [Statistics](https://crawlee.dev/js/api/core/class/Statistics.md) ### [**](#StatisticsOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L436)StatisticsOptions Re-exports [StatisticsOptions](https://crawlee.dev/js/api/core/interface/StatisticsOptions.md) ### [**](#StatisticState)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L496)StatisticState Re-exports [StatisticState](https://crawlee.dev/js/api/core/interface/StatisticState.md) ### [**](#StatusMessageCallback)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L128)StatusMessageCallback Re-exports [StatusMessageCallback](https://crawlee.dev/js/api/basic-crawler.md#StatusMessageCallback) ### [**](#StatusMessageCallbackParams)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L118)StatusMessageCallbackParams Re-exports [StatusMessageCallbackParams](https://crawlee.dev/js/api/basic-crawler/interface/StatusMessageCallbackParams.md) ### [**](#StorageClient)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/index.ts#L19)StorageClient Re-exports [StorageClient](https://crawlee.dev/js/api/core/interface/StorageClient.md) ### [**](#StorageManagerOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/storage_manager.ts#L156)StorageManagerOptions Re-exports [StorageManagerOptions](https://crawlee.dev/js/api/core/interface/StorageManagerOptions.md) ### [**](#StreamHandlerContext)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/file-download.ts#L25)StreamHandlerContext Re-exports [StreamHandlerContext](https://crawlee.dev/js/api/http-crawler.md#StreamHandlerContext) ### [**](#StreamingHttpResponse)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L162)StreamingHttpResponse Re-exports [StreamingHttpResponse](https://crawlee.dev/js/api/core/interface/StreamingHttpResponse.md) ### [**](#SystemInfo)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/system_status.ts#L10)SystemInfo Re-exports [SystemInfo](https://crawlee.dev/js/api/core/interface/SystemInfo.md) ### [**](#SystemStatus)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/system_status.ts#L120)SystemStatus Re-exports [SystemStatus](https://crawlee.dev/js/api/core/class/SystemStatus.md) ### [**](#SystemStatusOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/system_status.ts#L35)SystemStatusOptions Re-exports [SystemStatusOptions](https://crawlee.dev/js/api/core/interface/SystemStatusOptions.md) ### [**](#TieredProxy)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/proxy_configuration.ts#L45)TieredProxy Re-exports [TieredProxy](https://crawlee.dev/js/api/core/interface/TieredProxy.md) ### [**](#tryAbsoluteURL)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L12)tryAbsoluteURL Re-exports [tryAbsoluteURL](https://crawlee.dev/js/api/core/function/tryAbsoluteURL.md) ### [**](#UrlPatternObject)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L24)UrlPatternObject Re-exports [UrlPatternObject](https://crawlee.dev/js/api/core.md#UrlPatternObject) ### [**](#useState)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/utils.ts#L87)useState Re-exports [useState](https://crawlee.dev/js/api/core/function/useState.md) ### [**](#UseStateOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/utils.ts#L69)UseStateOptions Re-exports [UseStateOptions](https://crawlee.dev/js/api/core/interface/UseStateOptions.md) ### [**](#withCheckedStorageAccess)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/access_checking.ts#L18)withCheckedStorageAccess Re-exports [withCheckedStorageAccess](https://crawlee.dev/js/api/core/function/withCheckedStorageAccess.md) ### [**](#LinkeDOMErrorHandler)[**](https://github.com/apify/crawlee/blob/master/packages/linkedom-crawler/src/internals/linkedom-crawler.ts#L31)LinkeDOMErrorHandler **LinkeDOMErrorHandler\: [ErrorHandler](https://crawlee.dev/js/api/basic-crawler.md#ErrorHandler)<[LinkeDOMCrawlingContext](https://crawlee.dev/js/api/linkedom-crawler/interface/LinkeDOMCrawlingContext.md)\> #### Type parameters * **UserData**: Dictionary = any * **JSONData**: Dictionary = any ### [**](#LinkeDOMHook)[**](https://github.com/apify/crawlee/blob/master/packages/linkedom-crawler/src/internals/linkedom-crawler.ts#L43)LinkeDOMHook **LinkeDOMHook\: InternalHttpHook<[LinkeDOMCrawlingContext](https://crawlee.dev/js/api/linkedom-crawler/interface/LinkeDOMCrawlingContext.md)\> #### Type parameters * **UserData**: Dictionary = any * **JSONData**: Dictionary = any ### [**](#LinkeDOMRequestHandler)[**](https://github.com/apify/crawlee/blob/master/packages/linkedom-crawler/src/internals/linkedom-crawler.ts#L90)LinkeDOMRequestHandler **LinkeDOMRequestHandler\: [RequestHandler](https://crawlee.dev/js/api/basic-crawler.md#RequestHandler)<[LinkeDOMCrawlingContext](https://crawlee.dev/js/api/linkedom-crawler/interface/LinkeDOMCrawlingContext.md)\> #### Type parameters * **UserData**: Dictionary = any * **JSONData**: Dictionary = any --- # Changelog All notable changes to this project will be documented in this file. See [Conventional Commits](https://conventionalcommits.org) for commit guidelines. ## [3.15.3](https://github.com/apify/crawlee/compare/v3.15.2...v3.15.3) (2025-11-10)[​](#3153-2025-11-10 "Direct link to 3153-2025-11-10") **Note:** Version bump only for package @crawlee/linkedom ## [3.15.2](https://github.com/apify/crawlee/compare/v3.15.1...v3.15.2) (2025-10-23)[​](#3152-2025-10-23 "Direct link to 3152-2025-10-23") **Note:** Version bump only for package @crawlee/linkedom ## [3.15.1](https://github.com/apify/crawlee/compare/v3.15.0...v3.15.1) (2025-09-26)[​](#3151-2025-09-26 "Direct link to 3151-2025-09-26") **Note:** Version bump only for package @crawlee/linkedom # [3.15.0](https://github.com/apify/crawlee/compare/v3.14.1...v3.15.0) (2025-09-17) **Note:** Version bump only for package @crawlee/linkedom ## [3.14.1](https://github.com/apify/crawlee/compare/v3.14.0...v3.14.1) (2025-08-05)[​](#3141-2025-08-05 "Direct link to 3141-2025-08-05") **Note:** Version bump only for package @crawlee/linkedom # [3.14.0](https://github.com/apify/crawlee/compare/v3.13.10...v3.14.0) (2025-07-25) ### Features[​](#features "Direct link to Features") * add `maxCrawlDepth` crawler option ([#3045](https://github.com/apify/crawlee/issues/3045)) ([0090df9](https://github.com/apify/crawlee/commit/0090df93a12df9918d016cf2f1378f1f7d40557d)), closes [#2633](https://github.com/apify/crawlee/issues/2633) ## [3.13.10](https://github.com/apify/crawlee/compare/v3.13.9...v3.13.10) (2025-07-09)[​](#31310-2025-07-09 "Direct link to 31310-2025-07-09") **Note:** Version bump only for package @crawlee/linkedom ## [3.13.9](https://github.com/apify/crawlee/compare/v3.13.8...v3.13.9) (2025-06-27)[​](#3139-2025-06-27 "Direct link to 3139-2025-06-27") **Note:** Version bump only for package @crawlee/linkedom ## [3.13.8](https://github.com/apify/crawlee/compare/v3.13.7...v3.13.8) (2025-06-16)[​](#3138-2025-06-16 "Direct link to 3138-2025-06-16") ### Bug Fixes[​](#bug-fixes "Direct link to Bug Fixes") * Do not enqueue more links than what the crawler is capable of processing ([#2990](https://github.com/apify/crawlee/issues/2990)) ([ea094c8](https://github.com/apify/crawlee/commit/ea094c819232e0b30bc550270836d10506eb9454)), closes [#2728](https://github.com/apify/crawlee/issues/2728) ## [3.13.7](https://github.com/apify/crawlee/compare/v3.13.6...v3.13.7) (2025-06-06)[​](#3137-2025-06-06 "Direct link to 3137-2025-06-06") **Note:** Version bump only for package @crawlee/linkedom ## [3.13.6](https://github.com/apify/crawlee/compare/v3.13.5...v3.13.6) (2025-06-05)[​](#3136-2025-06-05 "Direct link to 3136-2025-06-05") **Note:** Version bump only for package @crawlee/linkedom ## [3.13.5](https://github.com/apify/crawlee/compare/v3.13.4...v3.13.5) (2025-05-20)[​](#3135-2025-05-20 "Direct link to 3135-2025-05-20") **Note:** Version bump only for package @crawlee/linkedom ## [3.13.4](https://github.com/apify/crawlee/compare/v3.13.3...v3.13.4) (2025-05-14)[​](#3134-2025-05-14 "Direct link to 3134-2025-05-14") **Note:** Version bump only for package @crawlee/linkedom ## [3.13.3](https://github.com/apify/crawlee/compare/v3.13.2...v3.13.3) (2025-05-05)[​](#3133-2025-05-05 "Direct link to 3133-2025-05-05") **Note:** Version bump only for package @crawlee/linkedom ## [3.13.2](https://github.com/apify/crawlee/compare/v3.13.1...v3.13.2) (2025-04-08)[​](#3132-2025-04-08 "Direct link to 3132-2025-04-08") ### Features[​](#features-1 "Direct link to Features") * add `onSkippedRequest` option ([#2916](https://github.com/apify/crawlee/issues/2916)) ([764f992](https://github.com/apify/crawlee/commit/764f99203627b6a44d2ee90d623b8b0e6ecbffb5)), closes [#2910](https://github.com/apify/crawlee/issues/2910) ## [3.13.1](https://github.com/apify/crawlee/compare/v3.13.0...v3.13.1) (2025-04-07)[​](#3131-2025-04-07 "Direct link to 3131-2025-04-07") ### Bug Fixes[​](#bug-fixes-1 "Direct link to Bug Fixes") * rename `RobotsFile` to `RobotsTxtFile` ([#2913](https://github.com/apify/crawlee/issues/2913)) ([3160f71](https://github.com/apify/crawlee/commit/3160f717e865326476d78089d778cbc7d35aa58d)), closes [#2910](https://github.com/apify/crawlee/issues/2910) ### Features[​](#features-2 "Direct link to Features") * add `respectRobotsTxtFile` crawler option ([#2910](https://github.com/apify/crawlee/issues/2910)) ([0eabed1](https://github.com/apify/crawlee/commit/0eabed1f13070d902c2c67b340621830a7f64464)) # [3.13.0](https://github.com/apify/crawlee/compare/v3.12.2...v3.13.0) (2025-03-04) **Note:** Version bump only for package @crawlee/linkedom ## [3.12.2](https://github.com/apify/crawlee/compare/v3.12.1...v3.12.2) (2025-01-27)[​](#3122-2025-01-27 "Direct link to 3122-2025-01-27") **Note:** Version bump only for package @crawlee/linkedom ## [3.12.1](https://github.com/apify/crawlee/compare/v3.12.0...v3.12.1) (2024-12-04)[​](#3121-2024-12-04 "Direct link to 3121-2024-12-04") **Note:** Version bump only for package @crawlee/linkedom # [3.12.0](https://github.com/apify/crawlee/compare/v3.11.5...v3.12.0) (2024-11-04) **Note:** Version bump only for package @crawlee/linkedom ## [3.11.5](https://github.com/apify/crawlee/compare/v3.11.4...v3.11.5) (2024-10-04)[​](#3115-2024-10-04 "Direct link to 3115-2024-10-04") **Note:** Version bump only for package @crawlee/linkedom ## [3.11.4](https://github.com/apify/crawlee/compare/v3.11.3...v3.11.4) (2024-09-23)[​](#3114-2024-09-23 "Direct link to 3114-2024-09-23") **Note:** Version bump only for package @crawlee/linkedom ## [3.11.3](https://github.com/apify/crawlee/compare/v3.11.2...v3.11.3) (2024-09-03)[​](#3113-2024-09-03 "Direct link to 3113-2024-09-03") **Note:** Version bump only for package @crawlee/linkedom ## [3.11.2](https://github.com/apify/crawlee/compare/v3.11.1...v3.11.2) (2024-08-28)[​](#3112-2024-08-28 "Direct link to 3112-2024-08-28") **Note:** Version bump only for package @crawlee/linkedom ## [3.11.1](https://github.com/apify/crawlee/compare/v3.11.0...v3.11.1) (2024-07-24)[​](#3111-2024-07-24 "Direct link to 3111-2024-07-24") **Note:** Version bump only for package @crawlee/linkedom # [3.11.0](https://github.com/apify/crawlee/compare/v3.10.5...v3.11.0) (2024-07-09) **Note:** Version bump only for package @crawlee/linkedom ## [3.10.5](https://github.com/apify/crawlee/compare/v3.10.4...v3.10.5) (2024-06-12)[​](#3105-2024-06-12 "Direct link to 3105-2024-06-12") **Note:** Version bump only for package @crawlee/linkedom ## [3.10.4](https://github.com/apify/crawlee/compare/v3.10.3...v3.10.4) (2024-06-11)[​](#3104-2024-06-11 "Direct link to 3104-2024-06-11") **Note:** Version bump only for package @crawlee/linkedom ## [3.10.3](https://github.com/apify/crawlee/compare/v3.10.2...v3.10.3) (2024-06-07)[​](#3103-2024-06-07 "Direct link to 3103-2024-06-07") ### Features[​](#features-3 "Direct link to Features") * add `waitForSelector` context helper + `parseWithCheerio` in adaptive crawler ([#2522](https://github.com/apify/crawlee/issues/2522)) ([6f88e73](https://github.com/apify/crawlee/commit/6f88e738d43ab4774dc4ef3f78775a5d88728e0d)) ## [3.10.2](https://github.com/apify/crawlee/compare/v3.10.1...v3.10.2) (2024-06-03)[​](#3102-2024-06-03 "Direct link to 3102-2024-06-03") **Note:** Version bump only for package @crawlee/linkedom ## [3.10.1](https://github.com/apify/crawlee/compare/v3.10.0...v3.10.1) (2024-05-23)[​](#3101-2024-05-23 "Direct link to 3101-2024-05-23") **Note:** Version bump only for package @crawlee/linkedom # [3.10.0](https://github.com/apify/crawlee/compare/v3.9.2...v3.10.0) (2024-05-16) **Note:** Version bump only for package @crawlee/linkedom ## [3.9.2](https://github.com/apify/crawlee/compare/v3.9.1...v3.9.2) (2024-04-17)[​](#392-2024-04-17 "Direct link to 392-2024-04-17") **Note:** Version bump only for package @crawlee/linkedom ## [3.9.1](https://github.com/apify/crawlee/compare/v3.9.0...v3.9.1) (2024-04-11)[​](#391-2024-04-11 "Direct link to 391-2024-04-11") **Note:** Version bump only for package @crawlee/linkedom # [3.9.0](https://github.com/apify/crawlee/compare/v3.8.2...v3.9.0) (2024-04-10) **Note:** Version bump only for package @crawlee/linkedom ## [3.8.2](https://github.com/apify/crawlee/compare/v3.8.1...v3.8.2) (2024-03-21)[​](#382-2024-03-21 "Direct link to 382-2024-03-21") **Note:** Version bump only for package @crawlee/linkedom ## [3.8.1](https://github.com/apify/crawlee/compare/v3.8.0...v3.8.1) (2024-02-22)[​](#381-2024-02-22 "Direct link to 381-2024-02-22") **Note:** Version bump only for package @crawlee/linkedom # [3.8.0](https://github.com/apify/crawlee/compare/v3.7.3...v3.8.0) (2024-02-21) **Note:** Version bump only for package @crawlee/linkedom ## [3.7.3](https://github.com/apify/crawlee/compare/v3.7.2...v3.7.3) (2024-01-30)[​](#373-2024-01-30 "Direct link to 373-2024-01-30") **Note:** Version bump only for package @crawlee/linkedom ## [3.7.2](https://github.com/apify/crawlee/compare/v3.7.1...v3.7.2) (2024-01-09)[​](#372-2024-01-09 "Direct link to 372-2024-01-09") **Note:** Version bump only for package @crawlee/linkedom ## [3.7.1](https://github.com/apify/crawlee/compare/v3.7.0...v3.7.1) (2024-01-02)[​](#371-2024-01-02 "Direct link to 371-2024-01-02") **Note:** Version bump only for package @crawlee/linkedom # [3.7.0](https://github.com/apify/crawlee/compare/v3.6.2...v3.7.0) (2023-12-21) **Note:** Version bump only for package @crawlee/linkedom ## [3.6.2](https://github.com/apify/crawlee/compare/v3.6.1...v3.6.2) (2023-11-26)[​](#362-2023-11-26 "Direct link to 362-2023-11-26") **Note:** Version bump only for package @crawlee/linkedom ## [3.6.1](https://github.com/apify/crawlee/compare/v3.6.0...v3.6.1) (2023-11-15)[​](#361-2023-11-15 "Direct link to 361-2023-11-15") **Note:** Version bump only for package @crawlee/linkedom # [3.6.0](https://github.com/apify/crawlee/compare/v3.5.8...v3.6.0) (2023-11-15) ### Features[​](#features-4 "Direct link to Features") * got-scraping v4 ([#2110](https://github.com/apify/crawlee/issues/2110)) ([2f05ed2](https://github.com/apify/crawlee/commit/2f05ed22b203f688095300400bb0e6d03a03283c)) ## [3.5.8](https://github.com/apify/crawlee/compare/v3.5.7...v3.5.8) (2023-10-17)[​](#358-2023-10-17 "Direct link to 358-2023-10-17") **Note:** Version bump only for package @crawlee/linkedom ## [3.5.7](https://github.com/apify/crawlee/compare/v3.5.6...v3.5.7) (2023-10-05)[​](#357-2023-10-05 "Direct link to 357-2023-10-05") **Note:** Version bump only for package @crawlee/linkedom ## [3.5.6](https://github.com/apify/crawlee/compare/v3.5.5...v3.5.6) (2023-10-04)[​](#356-2023-10-04 "Direct link to 356-2023-10-04") **Note:** Version bump only for package @crawlee/linkedom ## [3.5.5](https://github.com/apify/crawlee/compare/v3.5.4...v3.5.5) (2023-10-02)[​](#355-2023-10-02 "Direct link to 355-2023-10-02") ### Features[​](#features-5 "Direct link to Features") * Request Queue v2 ([#1975](https://github.com/apify/crawlee/issues/1975)) ([70a77ee](https://github.com/apify/crawlee/commit/70a77ee15f984e9ae67cd584fc58ace7e55346db)), closes [#1365](https://github.com/apify/crawlee/issues/1365) ## [3.5.4](https://github.com/apify/crawlee/compare/v3.5.3...v3.5.4) (2023-09-11)[​](#354-2023-09-11 "Direct link to 354-2023-09-11") **Note:** Version bump only for package @crawlee/linkedom ## [3.5.3](https://github.com/apify/crawlee/compare/v3.5.2...v3.5.3) (2023-08-31)[​](#353-2023-08-31 "Direct link to 353-2023-08-31") ### Bug Fixes[​](#bug-fixes-2 "Direct link to Bug Fixes") * pin all internal dependencies ([#2041](https://github.com/apify/crawlee/issues/2041)) ([d6f2b17](https://github.com/apify/crawlee/commit/d6f2b172d4a6776137c7893ca798d5b4a9408e79)), closes [#2040](https://github.com/apify/crawlee/issues/2040) ## [3.5.2](https://github.com/apify/crawlee/compare/v3.5.1...v3.5.2) (2023-08-21)[​](#352-2023-08-21 "Direct link to 352-2023-08-21") **Note:** Version bump only for package @crawlee/linkedom ## [3.5.1](https://github.com/apify/crawlee/compare/v3.5.0...v3.5.1) (2023-08-16)[​](#351-2023-08-16 "Direct link to 351-2023-08-16") **Note:** Version bump only for package @crawlee/linkedom # [3.5.0](https://github.com/apify/crawlee/compare/v3.4.2...v3.5.0) (2023-07-31) **Note:** Version bump only for package @crawlee/linkedom ## [3.4.2](https://github.com/apify/crawlee/compare/v3.4.1...v3.4.2) (2023-07-19)[​](#342-2023-07-19 "Direct link to 342-2023-07-19") **Note:** Version bump only for package @crawlee/linkedom ## [3.4.1](https://github.com/apify/crawlee/compare/v3.4.0...v3.4.1) (2023-07-13)[​](#341-2023-07-13 "Direct link to 341-2023-07-13") ### Features[​](#features-6 "Direct link to Features") * **jsdom,linkedom:** Expose document to crawler router context ([#1950](https://github.com/apify/crawlee/issues/1950)) ([4536dc2](https://github.com/apify/crawlee/commit/4536dc2900ee6d0acb562583ed8fca183df28e39)) # [3.4.0](https://github.com/apify/crawlee/compare/v3.3.3...v3.4.0) (2023-06-12) ### Features[​](#features-7 "Direct link to Features") * add LinkeDOMCrawler ([#1907](https://github.com/apify/crawlee/issues/1907)) ([1c69560](https://github.com/apify/crawlee/commit/1c69560fe7ef45097e6be1037b79a84eb9a06337)), closes [/github.com/apify/crawlee/pull/1890#issuecomment-1533271694](https://github.com//github.com/apify/crawlee/pull/1890/issues/issuecomment-1533271694) --- # LinkeDOMCrawler Provides a framework for the parallel crawling of web pages using plain HTTP requests and [linkedom](https://www.npmjs.com/package/linkedom) LinkeDOM implementation. The URLs to crawl are fed either from a static list of URLs or from a dynamic queue of URLs enabling recursive crawling of websites. Since `LinkeDOMCrawler` uses raw HTTP requests to download web pages, it is very fast and efficient on data bandwidth. However, if the target website requires JavaScript to display the content, you might need to use [PuppeteerCrawler](https://crawlee.dev/js/api/puppeteer-crawler/class/PuppeteerCrawler.md) or [PlaywrightCrawler](https://crawlee.dev/js/api/playwright-crawler/class/PlaywrightCrawler.md) instead, because it loads the pages using full-featured headless Chrome browser. **Limitation**: This crawler does not support proxies and cookies yet (each open starts with empty cookie store), and the user agent is always set to `Chrome`. `LinkeDOMCrawler` downloads each URL using a plain HTTP request, parses the HTML content using [LinkeDOM](https://www.npmjs.com/package/linkedom) and then invokes the user-provided [LinkeDOMCrawlerOptions.requestHandler](https://crawlee.dev/js/api/linkedom-crawler/interface/LinkeDOMCrawlerOptions.md#requestHandler) to extract page data using the `window` object. The source URLs are represented using [Request](https://crawlee.dev/js/api/core/class/Request.md) objects that are fed from [RequestList](https://crawlee.dev/js/api/core/class/RequestList.md) or [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) instances provided by the [LinkeDOMCrawlerOptions.requestList](https://crawlee.dev/js/api/linkedom-crawler/interface/LinkeDOMCrawlerOptions.md#requestList) or [LinkeDOMCrawlerOptions.requestQueue](https://crawlee.dev/js/api/linkedom-crawler/interface/LinkeDOMCrawlerOptions.md#requestQueue) constructor options, respectively. If both [LinkeDOMCrawlerOptions.requestList](https://crawlee.dev/js/api/linkedom-crawler/interface/LinkeDOMCrawlerOptions.md#requestList) and [LinkeDOMCrawlerOptions.requestQueue](https://crawlee.dev/js/api/linkedom-crawler/interface/LinkeDOMCrawlerOptions.md#requestQueue) are used, the instance first processes URLs from the [RequestList](https://crawlee.dev/js/api/core/class/RequestList.md) and automatically enqueues all of them to [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) before it starts their processing. This ensures that a single URL is not crawled multiple times. The crawler finishes when there are no more [Request](https://crawlee.dev/js/api/core/class/Request.md) objects to crawl. We can use the `preNavigationHooks` to adjust `gotOptions`: ``` preNavigationHooks: [ (crawlingContext, gotOptions) => { // ... }, ] ``` By default, `LinkeDOMCrawler` only processes web pages with the `text/html` and `application/xhtml+xml` MIME content types (as reported by the `Content-Type` HTTP header), and skips pages with other content types. If you want the crawler to process other content types, use the [LinkeDOMCrawlerOptions.additionalMimeTypes](https://crawlee.dev/js/api/linkedom-crawler/interface/LinkeDOMCrawlerOptions.md#additionalMimeTypes) constructor option. Beware that the parsing behavior differs for HTML, XML, JSON and other types of content. For more details, see [LinkeDOMCrawlerOptions.requestHandler](https://crawlee.dev/js/api/linkedom-crawler/interface/LinkeDOMCrawlerOptions.md#requestHandler). New requests are only dispatched when there is enough free CPU and memory available, using the functionality provided by the [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) class. All [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) configuration options can be passed to the `autoscaledPoolOptions` parameter of the `CheerioCrawler` constructor. For user convenience, the `minConcurrency` and `maxConcurrency` [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) options are available directly in the `CheerioCrawler` constructor. **Example usage:** ``` const crawler = new LinkeDOMCrawler({ async requestHandler({ request, window }) { await Dataset.pushData({ url: request.url, title: window.document.title, }); }, }); await crawler.run([ 'http://crawlee.dev', ]); ``` ### Hierarchy * [HttpCrawler](https://crawlee.dev/js/api/http-crawler/class/HttpCrawler.md)<[LinkeDOMCrawlingContext](https://crawlee.dev/js/api/linkedom-crawler/interface/LinkeDOMCrawlingContext.md)> * *LinkeDOMCrawler* ## Index[**](#Index) ### Constructors * [**constructor](#constructor) ### Properties * [**autoscaledPool](#autoscaledPool) * [**config](#config) * [**hasFinishedBefore](#hasFinishedBefore) * [**log](#log) * [**proxyConfiguration](#proxyConfiguration) * [**requestList](#requestList) * [**requestQueue](#requestQueue) * [**router](#router) * [**running](#running) * [**sessionPool](#sessionPool) * [**stats](#stats) ### Methods * [**\_runRequestHandler](#_runRequestHandler) * [**addRequests](#addRequests) * [**exportData](#exportData) * [**getData](#getData) * [**getDataset](#getDataset) * [**getRequestQueue](#getRequestQueue) * [**pushData](#pushData) * [**run](#run) * [**setStatusMessage](#setStatusMessage) * [**stop](#stop) * [**use](#use) * [**useState](#useState) ## Constructors[**](#Constructors) ### [**](#constructor)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L373)constructor * ****new LinkeDOMCrawler**(options, config): [LinkeDOMCrawler](https://crawlee.dev/js/api/linkedom-crawler/class/LinkeDOMCrawler.md) - Inherited from HttpCrawler.constructor All `HttpCrawlerOptions` parameters are passed via an options object. *** #### Parameters * ##### options: [HttpCrawlerOptions](https://crawlee.dev/js/api/http-crawler/interface/HttpCrawlerOptions.md)<[LinkeDOMCrawlingContext](https://crawlee.dev/js/api/linkedom-crawler/interface/LinkeDOMCrawlingContext.md)\> = {} * ##### config: [Configuration](https://crawlee.dev/js/api/core/class/Configuration.md) = ... #### Returns [LinkeDOMCrawler](https://crawlee.dev/js/api/linkedom-crawler/class/LinkeDOMCrawler.md) ## Properties[**](#Properties) ### [**](#autoscaledPool)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L524)optionalinheritedautoscaledPool **autoscaledPool? : [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) Inherited from HttpCrawler.autoscaledPool A reference to the underlying [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) class that manages the concurrency of the crawler. > *NOTE:* This property is only initialized after calling the [`crawler.run()`](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md#run) function. We can use it to change the concurrency settings on the fly, to pause the crawler by calling [`autoscaledPool.pause()`](https://crawlee.dev/js/api/core/class/AutoscaledPool.md#pause) or to abort it by calling [`autoscaledPool.abort()`](https://crawlee.dev/js/api/core/class/AutoscaledPool.md#abort). ### [**](#config)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L375)readonlyinheritedconfig **config: [Configuration](https://crawlee.dev/js/api/core/class/Configuration.md) = ... Inherited from HttpCrawler.config ### [**](#hasFinishedBefore)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L533)inheritedhasFinishedBefore **hasFinishedBefore: boolean = false Inherited from HttpCrawler.hasFinishedBefore ### [**](#log)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L535)readonlyinheritedlog **log: [Log](https://crawlee.dev/js/api/core/class/Log.md) Inherited from HttpCrawler.log ### [**](#proxyConfiguration)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L337)optionalinheritedproxyConfiguration **proxyConfiguration? : [ProxyConfiguration](https://crawlee.dev/js/api/core/class/ProxyConfiguration.md) Inherited from HttpCrawler.proxyConfiguration A reference to the underlying [ProxyConfiguration](https://crawlee.dev/js/api/core/class/ProxyConfiguration.md) class that manages the crawler's proxies. Only available if used by the crawler. ### [**](#requestList)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L497)optionalinheritedrequestList **requestList? : [IRequestList](https://crawlee.dev/js/api/core/interface/IRequestList.md) Inherited from HttpCrawler.requestList A reference to the underlying [RequestList](https://crawlee.dev/js/api/core/class/RequestList.md) class that manages the crawler's [requests](https://crawlee.dev/js/api/core/class/Request.md). Only available if used by the crawler. ### [**](#requestQueue)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L504)optionalinheritedrequestQueue **requestQueue? : [RequestProvider](https://crawlee.dev/js/api/core/class/RequestProvider.md) Inherited from HttpCrawler.requestQueue Dynamic queue of URLs to be processed. This is useful for recursive crawling of websites. A reference to the underlying [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) class that manages the crawler's [requests](https://crawlee.dev/js/api/core/class/Request.md). Only available if used by the crawler. ### [**](#router)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L530)readonlyinheritedrouter **router: [RouterHandler](https://crawlee.dev/js/api/core/interface/RouterHandler.md)<{ request: [LoadedRequest](https://crawlee.dev/js/api/core.md#LoadedRequest)<[Request](https://crawlee.dev/js/api/core/class/Request.md)\> } & Omit<[LinkeDOMCrawlingContext](https://crawlee.dev/js/api/linkedom-crawler/interface/LinkeDOMCrawlingContext.md)\, request>> = ... Inherited from HttpCrawler.router Default [Router](https://crawlee.dev/js/api/core/class/Router.md) instance that will be used if we don't specify any [`requestHandler`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#requestHandler). See [`router.addHandler()`](https://crawlee.dev/js/api/core/class/Router.md#addHandler) and [`router.addDefaultHandler()`](https://crawlee.dev/js/api/core/class/Router.md#addDefaultHandler). ### [**](#running)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L532)inheritedrunning **running: boolean = false Inherited from HttpCrawler.running ### [**](#sessionPool)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L515)optionalinheritedsessionPool **sessionPool? : [SessionPool](https://crawlee.dev/js/api/core/class/SessionPool.md) Inherited from HttpCrawler.sessionPool A reference to the underlying [SessionPool](https://crawlee.dev/js/api/core/class/SessionPool.md) class that manages the crawler's [sessions](https://crawlee.dev/js/api/core/class/Session.md). Only available if used by the crawler. ### [**](#stats)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L491)readonlyinheritedstats **stats: [Statistics](https://crawlee.dev/js/api/core/class/Statistics.md) Inherited from HttpCrawler.stats A reference to the underlying [Statistics](https://crawlee.dev/js/api/core/class/Statistics.md) class that collects and logs run statistics for requests. ## Methods[**](#Methods) ### [**](#_runRequestHandler)[**](https://github.com/apify/crawlee/blob/master/packages/linkedom-crawler/src/internals/linkedom-crawler.ts#L201)\_runRequestHandler * ****\_runRequestHandler**(context): Promise\ - Overrides HttpCrawler.\_runRequestHandler Wrapper around requestHandler that opens and closes pages etc. *** #### Parameters * ##### context: [LinkeDOMCrawlingContext](https://crawlee.dev/js/api/linkedom-crawler/interface/LinkeDOMCrawlingContext.md)\ #### Returns Promise\ ### [**](#addRequests)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L1147)inheritedaddRequests * ****addRequests**(requests, options): Promise<[CrawlerAddRequestsResult](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerAddRequestsResult.md)> - Inherited from HttpCrawler.addRequests Adds requests to the queue in batches. By default, it will resolve after the initial batch is added, and continue adding the rest in background. You can configure the batch size via `batchSize` option and the sleep time in between the batches via `waitBetweenBatchesMillis`. If you want to wait for all batches to be added to the queue, you can use the `waitForAllRequestsToBeAdded` promise you get in the response object. This is an alias for calling `addRequestsBatched()` on the implicit `RequestQueue` for this crawler instance. *** #### Parameters * ##### requests: [RequestsLike](https://crawlee.dev/js/api/core.md#RequestsLike) The requests to add * ##### options: [CrawlerAddRequestsOptions](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerAddRequestsOptions.md) = {} Options for the request queue #### Returns Promise<[CrawlerAddRequestsResult](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerAddRequestsResult.md)> ### [**](#exportData)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L1253)inheritedexportData * ****exportData**\(path, format, options): Promise\ - Inherited from HttpCrawler.exportData Retrieves all the data from the default crawler [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md) and exports them to the specified format. Supported formats are currently 'json' and 'csv', and will be inferred from the `path` automatically. *** #### Parameters * ##### path: string * ##### optionalformat: json | csv * ##### optionaloptions: [DatasetExportOptions](https://crawlee.dev/js/api/core/interface/DatasetExportOptions.md) #### Returns Promise\ ### [**](#getData)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L1244)inheritedgetData * ****getData**(...args): Promise<[DatasetContent](https://crawlee.dev/js/api/core/interface/DatasetContent.md)\> - Inherited from HttpCrawler.getData Retrieves data from the default crawler [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md) by calling [Dataset.getData](https://crawlee.dev/js/api/core/class/Dataset.md#getData). *** #### Parameters * ##### rest...args: \[options: [DatasetDataOptions](https://crawlee.dev/js/api/core/interface/DatasetDataOptions.md)] #### Returns Promise<[DatasetContent](https://crawlee.dev/js/api/core/interface/DatasetContent.md)\> ### [**](#getDataset)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L1237)inheritedgetDataset * ****getDataset**(idOrName): Promise<[Dataset](https://crawlee.dev/js/api/core/class/Dataset.md)\> - Inherited from HttpCrawler.getDataset Retrieves the specified [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md), or the default crawler [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md). *** #### Parameters * ##### optionalidOrName: string #### Returns Promise<[Dataset](https://crawlee.dev/js/api/core/class/Dataset.md)\> ### [**](#getRequestQueue)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L1080)inheritedgetRequestQueue * ****getRequestQueue**(): Promise<[RequestProvider](https://crawlee.dev/js/api/core/class/RequestProvider.md)> - Inherited from HttpCrawler.getRequestQueue #### Returns Promise<[RequestProvider](https://crawlee.dev/js/api/core/class/RequestProvider.md)> ### [**](#pushData)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L1229)inheritedpushData * ****pushData**(data, datasetIdOrName): Promise\ - Inherited from HttpCrawler.pushData Pushes data to the specified [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md), or the default crawler [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md) by calling [Dataset.pushData](https://crawlee.dev/js/api/core/class/Dataset.md#pushData). *** #### Parameters * ##### data: Dictionary | Dictionary\[] * ##### optionaldatasetIdOrName: string #### Returns Promise\ ### [**](#run)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L948)inheritedrun * ****run**(requests, options): Promise<[FinalStatistics](https://crawlee.dev/js/api/core/interface/FinalStatistics.md)> - Inherited from HttpCrawler.run Runs the crawler. Returns a promise that resolves once all the requests are processed and `autoscaledPool.isFinished` returns `true`. We can use the `requests` parameter to enqueue the initial requests — it is a shortcut for running [`crawler.addRequests()`](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md#addRequests) before [`crawler.run()`](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md#run). *** #### Parameters * ##### optionalrequests: [RequestsLike](https://crawlee.dev/js/api/core.md#RequestsLike) The requests to add. * ##### optionaloptions: [CrawlerRunOptions](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerRunOptions.md) Options for the request queue. #### Returns Promise<[FinalStatistics](https://crawlee.dev/js/api/core/interface/FinalStatistics.md)> ### [**](#setStatusMessage)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L871)inheritedsetStatusMessage * ****setStatusMessage**(message, options): Promise\ - Inherited from HttpCrawler.setStatusMessage This method is periodically called by the crawler, every `statusMessageLoggingInterval` seconds. *** #### Parameters * ##### message: string * ##### options: [SetStatusMessageOptions](https://crawlee.dev/js/api/types/interface/SetStatusMessageOptions.md) = {} #### Returns Promise\ ### [**](#stop)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L1068)inheritedstop * ****stop**(message): void - Inherited from HttpCrawler.stop Gracefully stops the current run of the crawler. All the tasks active at the time of calling this method will be allowed to finish. *** #### Parameters * ##### message: string = 'The crawler has been gracefully stopped.' #### Returns void ### [**](#use)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L470)inheriteduse * ****use**(extension): void - Inherited from HttpCrawler.use **EXPERIMENTAL** Function for attaching CrawlerExtensions such as the Unblockers. *** #### Parameters * ##### extension: CrawlerExtension Crawler extension that overrides the crawler configuration. #### Returns void ### [**](#useState)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L1102)inheriteduseState * ****useState**\(defaultValue): Promise\ - Inherited from HttpCrawler.useState #### Parameters * ##### defaultValue: State = ... #### Returns Promise\ --- # createLinkeDOMRouter ### Callable * ****createLinkeDOMRouter**\(routes): [RouterHandler](https://crawlee.dev/js/api/core/interface/RouterHandler.md)\ *** * Creates new [Router](https://crawlee.dev/js/api/core/class/Router.md) instance that works based on request labels. This instance can then serve as a `requestHandler` of your [LinkeDOMCrawler](https://crawlee.dev/js/api/linkedom-crawler/class/LinkeDOMCrawler.md). Defaults to the [LinkeDOMCrawlingContext](https://crawlee.dev/js/api/linkedom-crawler/interface/LinkeDOMCrawlingContext.md). > Serves as a shortcut for using `Router.create()`. ``` import { LinkeDOMCrawler, createLinkeDOMRouter } from 'crawlee'; const router = createLinkeDOMRouter(); router.addHandler('label-a', async (ctx) => { ctx.log.info('...'); }); router.addDefaultHandler(async (ctx) => { ctx.log.info('...'); }); const crawler = new LinkeDOMCrawler({ requestHandler: router, }); await crawler.run(); ``` *** #### Parameters * ##### optionalroutes: [RouterRoutes](https://crawlee.dev/js/api/core.md#RouterRoutes)\ #### Returns [RouterHandler](https://crawlee.dev/js/api/core/interface/RouterHandler.md)\ --- # LinkeDOMCrawlerEnqueueLinksOptions ### Hierarchy * Omit<[EnqueueLinksOptions](https://crawlee.dev/js/api/core/interface/EnqueueLinksOptions.md), urls | requestQueue> * *LinkeDOMCrawlerEnqueueLinksOptions* ## Index[**](#Index) ### Properties * [**baseUrl](#baseUrl) * [**exclude](#exclude) * [**forefront](#forefront) * [**globs](#globs) * [**label](#label) * [**limit](#limit) * [**onSkippedRequest](#onSkippedRequest) * [**pseudoUrls](#pseudoUrls) * [**regexps](#regexps) * [**robotsTxtFile](#robotsTxtFile) * [**selector](#selector) * [**skipNavigation](#skipNavigation) * [**strategy](#strategy) * [**transformRequestFunction](#transformRequestFunction) * [**userData](#userData) * [**waitForAllRequestsToBeAdded](#waitForAllRequestsToBeAdded) ## Properties[**](#Properties) ### [**](#baseUrl)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/enqueue_links.ts#L68)optionalinheritedbaseUrl **baseUrl? : string Inherited from Omit.baseUrl A base URL that will be used to resolve relative URLs when using Cheerio. Ignored when using Puppeteer, since the relative URL resolution is done inside the browser automatically. ### [**](#exclude)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/enqueue_links.ts#L94)optionalinheritedexclude **exclude? : readonly ([GlobInput](https://crawlee.dev/js/api/core.md#GlobInput) | [RegExpInput](https://crawlee.dev/js/api/core.md#RegExpInput))\[] Inherited from Omit.exclude An array of glob pattern strings, regexp patterns or plain objects containing patterns matching URLs that will **never** be enqueued. The plain objects must include either the `glob` property or the `regexp` property. Glob matching is always case-insensitive. If you need case-sensitive matching, provide a regexp. ### [**](#forefront)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L948)optionalinheritedforefront **forefront? : boolean = false Inherited from Omit.forefront If set to `true`: * while adding the request to the queue: the request will be added to the foremost position in the queue. * while reclaiming the request: the request will be placed to the beginning of the queue, so that it's returned in the next call to [RequestQueue.fetchNextRequest](https://crawlee.dev/js/api/core/class/RequestQueue.md#fetchNextRequest). By default, it's put to the end of the queue. In case the request is already present in the queue, this option has no effect. If more requests are added with this option at once, their order in the following `fetchNextRequest` call is arbitrary. ### [**](#globs)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/enqueue_links.ts#L83)optionalinheritedglobs **globs? : readonly [GlobInput](https://crawlee.dev/js/api/core.md#GlobInput)\[] Inherited from Omit.globs An array of glob pattern strings or plain objects containing glob pattern strings matching the URLs to be enqueued. The plain objects must include at least the `glob` property, which holds the glob pattern string. All remaining keys will be used as request options for the corresponding enqueued [Request](https://crawlee.dev/js/api/core/class/Request.md) objects. The matching is always case-insensitive. If you need case-sensitive matching, use `regexps` property directly. If `globs` is an empty array or `undefined`, and `regexps` are also not defined, then the function enqueues the links with the same subdomain. ### [**](#label)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/enqueue_links.ts#L56)optionalinheritedlabel **label? : string Inherited from Omit.label Sets [Request.label](https://crawlee.dev/js/api/core/class/Request.md#label) for newly enqueued requests. Note that the request options specified in `globs`, `regexps`, or `pseudoUrls` objects have priority over this option. ### [**](#limit)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/enqueue_links.ts#L36)optionalinheritedlimit **limit? : number Inherited from Omit.limit Limit the amount of actually enqueued URLs to this number. Useful for testing across the entire crawling scope. ### [**](#onSkippedRequest)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/enqueue_links.ts#L192)optionalinheritedonSkippedRequest **onSkippedRequest? : [SkippedRequestCallback](https://crawlee.dev/js/api/core.md#SkippedRequestCallback) Inherited from Omit.onSkippedRequest When a request is skipped for some reason, you can use this callback to act on it. This is currently fired for requests skipped 1. based on robots.txt file, 2. because they don't match enqueueLinks filters, 3. or because the maxRequestsPerCrawl limit has been reached ### [**](#pseudoUrls)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/enqueue_links.ts#L126)optionalinheritedpseudoUrls **pseudoUrls? : readonly [PseudoUrlInput](https://crawlee.dev/js/api/core.md#PseudoUrlInput)\[] Inherited from Omit.pseudoUrls *NOTE:* In future versions of SDK the options will be removed. Please use `globs` or `regexps` instead. An array of [PseudoUrl](https://crawlee.dev/js/api/core/class/PseudoUrl.md) strings or plain objects containing [PseudoUrl](https://crawlee.dev/js/api/core/class/PseudoUrl.md) strings matching the URLs to be enqueued. The plain objects must include at least the `purl` property, which holds the pseudo-URL string. All remaining keys will be used as request options for the corresponding enqueued [Request](https://crawlee.dev/js/api/core/class/Request.md) objects. With a pseudo-URL string, the matching is always case-insensitive. If you need case-sensitive matching, use `regexps` property directly. If `pseudoUrls` is an empty array or `undefined`, then the function enqueues the links with the same subdomain. * **@deprecated** prefer using `globs` or `regexps` instead ### [**](#regexps)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/enqueue_links.ts#L106)optionalinheritedregexps **regexps? : readonly [RegExpInput](https://crawlee.dev/js/api/core.md#RegExpInput)\[] Inherited from Omit.regexps An array of regular expressions or plain objects containing regular expressions matching the URLs to be enqueued. The plain objects must include at least the `regexp` property, which holds the regular expression. All remaining keys will be used as request options for the corresponding enqueued [Request](https://crawlee.dev/js/api/core/class/Request.md) objects. If `regexps` is an empty array or `undefined`, and `globs` are also not defined, then the function enqueues the links with the same subdomain. ### [**](#robotsTxtFile)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/enqueue_links.ts#L183)optionalinheritedrobotsTxtFile **robotsTxtFile? : Pick<[RobotsTxtFile](https://crawlee.dev/js/api/utils/class/RobotsTxtFile.md), isAllowed> Inherited from Omit.robotsTxtFile RobotsTxtFile instance for the current request that triggered the `enqueueLinks`. If provided, disallowed URLs will be ignored. ### [**](#selector)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/enqueue_links.ts#L45)optionalinheritedselector **selector? : string Inherited from Omit.selector A CSS selector matching links to be enqueued. ### [**](#skipNavigation)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/enqueue_links.ts#L62)optionalinheritedskipNavigation **skipNavigation? : boolean = false Inherited from Omit.skipNavigation If set to `true`, tells the crawler to skip navigation and process the request directly. ### [**](#strategy)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/enqueue_links.ts#L171)optionalinheritedstrategy **strategy? : [EnqueueStrategy](https://crawlee.dev/js/api/core/enum/EnqueueStrategy.md) | all | same-domain | same-hostname | same-origin = [EnqueueStrategy](https://crawlee.dev/js/api/core/enum/EnqueueStrategy.md) | all | same-domain | same-hostname | same-origin Inherited from Omit.strategy The strategy to use when enqueueing the urls. Depending on the strategy you select, we will only check certain parts of the URLs found. Here is a diagram of each URL part and their name: ``` Protocol Domain ┌────┐ ┌─────────┐ https://example.crawlee.dev/... │ └─────────────────┤ │ Hostname │ │ │ └─────────────────────────┘ Origin ``` ### [**](#transformRequestFunction)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/enqueue_links.ts#L151)optionalinheritedtransformRequestFunction **transformRequestFunction? : [RequestTransform](https://crawlee.dev/js/api/core/interface/RequestTransform.md) Inherited from Omit.transformRequestFunction Just before a new [Request](https://crawlee.dev/js/api/core/class/Request.md) is constructed and enqueued to the [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md), this function can be used to remove it or modify its contents such as `userData`, `payload` or, most importantly `uniqueKey`. This is useful when you need to enqueue multiple `Requests` to the queue that share the same URL, but differ in methods or payloads, or to dynamically update or create `userData`. For example: by adding `keepUrlFragment: true` to the `request` object, URL fragments will not be removed when `uniqueKey` is computed. **Example:** ``` { transformRequestFunction: (request) => { request.userData.foo = 'bar'; request.keepUrlFragment = true; return request; } } ``` Note that the request options specified in `globs`, `regexps`, or `pseudoUrls` objects have priority over this function. Some request options returned by `transformRequestFunction` may be overwritten by pattern-based options from `globs`, `regexps`, or `pseudoUrls`. ### [**](#userData)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/enqueue_links.ts#L48)optionalinheriteduserData **userData? : Dictionary Inherited from Omit.userData Sets [Request.userData](https://crawlee.dev/js/api/core/class/Request.md#userData) for newly enqueued requests. ### [**](#waitForAllRequestsToBeAdded)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/enqueue_links.ts#L177)optionalinheritedwaitForAllRequestsToBeAdded **waitForAllRequestsToBeAdded? : boolean Inherited from Omit.waitForAllRequestsToBeAdded By default, only the first batch (1000) of found requests will be added to the queue before resolving the call. You can use this option to wait for adding all of them. --- # LinkeDOMCrawlerOptions \ ### Hierarchy * [HttpCrawlerOptions](https://crawlee.dev/js/api/http-crawler/interface/HttpCrawlerOptions.md)<[LinkeDOMCrawlingContext](https://crawlee.dev/js/api/linkedom-crawler/interface/LinkeDOMCrawlingContext.md)\> * *LinkeDOMCrawlerOptions* ## Index[**](#Index) ### Properties * [**additionalHttpErrorStatusCodes](#additionalHttpErrorStatusCodes) * [**additionalMimeTypes](#additionalMimeTypes) * [**autoscaledPoolOptions](#autoscaledPoolOptions) * [**errorHandler](#errorHandler) * [**experiments](#experiments) * [**failedRequestHandler](#failedRequestHandler) * [**forceResponseEncoding](#forceResponseEncoding) * [**handlePageFunction](#handlePageFunction) * [**httpClient](#httpClient) * [**ignoreHttpErrorStatusCodes](#ignoreHttpErrorStatusCodes) * [**ignoreSslErrors](#ignoreSslErrors) * [**keepAlive](#keepAlive) * [**maxConcurrency](#maxConcurrency) * [**maxCrawlDepth](#maxCrawlDepth) * [**maxRequestRetries](#maxRequestRetries) * [**maxRequestsPerCrawl](#maxRequestsPerCrawl) * [**maxRequestsPerMinute](#maxRequestsPerMinute) * [**maxSessionRotations](#maxSessionRotations) * [**minConcurrency](#minConcurrency) * [**navigationTimeoutSecs](#navigationTimeoutSecs) * [**onSkippedRequest](#onSkippedRequest) * [**persistCookiesPerSession](#persistCookiesPerSession) * [**postNavigationHooks](#postNavigationHooks) * [**preNavigationHooks](#preNavigationHooks) * [**proxyConfiguration](#proxyConfiguration) * [**requestHandler](#requestHandler) * [**requestHandlerTimeoutSecs](#requestHandlerTimeoutSecs) * [**requestList](#requestList) * [**requestManager](#requestManager) * [**requestQueue](#requestQueue) * [**respectRobotsTxtFile](#respectRobotsTxtFile) * [**retryOnBlocked](#retryOnBlocked) * [**sameDomainDelaySecs](#sameDomainDelaySecs) * [**sessionPoolOptions](#sessionPoolOptions) * [**statisticsOptions](#statisticsOptions) * [**statusMessageCallback](#statusMessageCallback) * [**statusMessageLoggingInterval](#statusMessageLoggingInterval) * [**suggestResponseEncoding](#suggestResponseEncoding) * [**useSessionPool](#useSessionPool) ## Properties[**](#Properties) ### [**](#additionalHttpErrorStatusCodes)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L186)optionalinheritedadditionalHttpErrorStatusCodes **additionalHttpErrorStatusCodes? : number\[] Inherited from HttpCrawlerOptions.additionalHttpErrorStatusCodes An array of additional HTTP response [Status Codes](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status) to be treated as errors. By default, status codes >= 500 trigger errors. ### [**](#additionalMimeTypes)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L142)optionalinheritedadditionalMimeTypes **additionalMimeTypes? : string\[] Inherited from HttpCrawlerOptions.additionalMimeTypes An array of [MIME types](https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/MIME_types/Complete_list_of_MIME_types) you want the crawler to load and process. By default, only `text/html` and `application/xhtml+xml` MIME types are supported. ### [**](#autoscaledPoolOptions)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L294)optionalinheritedautoscaledPoolOptions **autoscaledPoolOptions? : [AutoscaledPoolOptions](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md) Inherited from HttpCrawlerOptions.autoscaledPoolOptions Custom options passed to the underlying [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) constructor. > *NOTE:* The [`runTaskFunction`](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md#runTaskFunction) option is provided by the crawler and cannot be overridden. However, we can provide custom implementations of [`isFinishedFunction`](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md#isFinishedFunction) and [`isTaskReadyFunction`](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md#isTaskReadyFunction). ### [**](#errorHandler)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L222)optionalinheritederrorHandler **errorHandler? : [ErrorHandler](https://crawlee.dev/js/api/basic-crawler.md#ErrorHandler)<[LinkeDOMCrawlingContext](https://crawlee.dev/js/api/linkedom-crawler/interface/LinkeDOMCrawlingContext.md)\> Inherited from HttpCrawlerOptions.errorHandler User-provided function that allows modifying the request object before it gets retried by the crawler. It's executed before each retry for the requests that failed less than [`maxRequestRetries`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#maxRequestRetries) times. The function receives the [BasicCrawlingContext](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlingContext.md) as the first argument, where the [`request`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlingContext.md#request) corresponds to the request to be retried. Second argument is the `Error` instance that represents the last error thrown during processing of the request. ### [**](#experiments)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L390)optionalinheritedexperiments **experiments? : [CrawlerExperiments](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerExperiments.md) Inherited from HttpCrawlerOptions.experiments Enables experimental features of Crawlee, which can alter the behavior of the crawler. WARNING: these options are not guaranteed to be stable and may change or be removed at any time. ### [**](#failedRequestHandler)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L232)optionalinheritedfailedRequestHandler **failedRequestHandler? : [ErrorHandler](https://crawlee.dev/js/api/basic-crawler.md#ErrorHandler)<[LinkeDOMCrawlingContext](https://crawlee.dev/js/api/linkedom-crawler/interface/LinkeDOMCrawlingContext.md)\> Inherited from HttpCrawlerOptions.failedRequestHandler A function to handle requests that failed more than [`maxRequestRetries`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#maxRequestRetries) times. The function receives the [BasicCrawlingContext](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlingContext.md) as the first argument, where the [`request`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlingContext.md#request) corresponds to the failed request. Second argument is the `Error` instance that represents the last error thrown during processing of the request. ### [**](#forceResponseEncoding)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L166)optionalinheritedforceResponseEncoding **forceResponseEncoding? : string Inherited from HttpCrawlerOptions.forceResponseEncoding By default this crawler will extract correct encoding from the HTTP response headers. Use `forceResponseEncoding` to force a certain encoding, disregarding the response headers. To only provide a default for missing encodings, use [HttpCrawlerOptions.suggestResponseEncoding](https://crawlee.dev/js/api/http-crawler/interface/HttpCrawlerOptions.md#suggestResponseEncoding) ``` // Will force windows-1250 encoding even if headers say otherwise forceResponseEncoding: 'windows-1250' ``` ### [**](#handlePageFunction)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L87)optionalinheritedhandlePageFunction **handlePageFunction? : [RequestHandler](https://crawlee.dev/js/api/basic-crawler.md#RequestHandler)<{ request: [LoadedRequest](https://crawlee.dev/js/api/core.md#LoadedRequest)<[Request](https://crawlee.dev/js/api/core/class/Request.md)\> } & Omit<[LinkeDOMCrawlingContext](https://crawlee.dev/js/api/linkedom-crawler/interface/LinkeDOMCrawlingContext.md)\, request>> Inherited from HttpCrawlerOptions.handlePageFunction An alias for [HttpCrawlerOptions.requestHandler](https://crawlee.dev/js/api/http-crawler/interface/HttpCrawlerOptions.md#requestHandler) Soon to be removed, use `requestHandler` instead. * **@deprecated** ### [**](#httpClient)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L402)optionalinheritedhttpClient **httpClient? : [BaseHttpClient](https://crawlee.dev/js/api/core/interface/BaseHttpClient.md) Inherited from HttpCrawlerOptions.httpClient HTTP client implementation for the `sendRequest` context helper and for plain HTTP crawling. Defaults to a new instance of [GotScrapingHttpClient](https://crawlee.dev/js/api/core/class/GotScrapingHttpClient.md) ### [**](#ignoreHttpErrorStatusCodes)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L180)optionalinheritedignoreHttpErrorStatusCodes **ignoreHttpErrorStatusCodes? : number\[] Inherited from HttpCrawlerOptions.ignoreHttpErrorStatusCodes An array of HTTP response [Status Codes](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status) to be excluded from error consideration. By default, status codes >= 500 trigger errors. ### [**](#ignoreSslErrors)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L97)optionalinheritedignoreSslErrors **ignoreSslErrors? : boolean Inherited from HttpCrawlerOptions.ignoreSslErrors If set to true, SSL certificate errors will be ignored. ### [**](#keepAlive)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L322)optionalinheritedkeepAlive **keepAlive? : boolean Inherited from HttpCrawlerOptions.keepAlive Allows to keep the crawler alive even if the [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) gets empty. By default, the `crawler.run()` will resolve once the queue is empty. With `keepAlive: true` it will keep running, waiting for more requests to come. Use `crawler.stop()` to exit the crawler gracefully, or `crawler.teardown()` to stop it immediately. ### [**](#maxConcurrency)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L308)optionalinheritedmaxConcurrency **maxConcurrency? : number Inherited from HttpCrawlerOptions.maxConcurrency Sets the maximum concurrency (parallelism) for the crawl. Shortcut for the AutoscaledPool [`maxConcurrency`](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md#maxConcurrency) option. ### [**](#maxCrawlDepth)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L285)optionalinheritedmaxCrawlDepth **maxCrawlDepth? : number Inherited from HttpCrawlerOptions.maxCrawlDepth Maximum depth of the crawl. If not set, the crawl will continue until all requests are processed. Setting this to `0` will only process the initial requests, skipping all links enqueued by `crawlingContext.enqueueLinks` and `crawlingContext.addRequests`. Passing `1` will process the initial requests and all links enqueued by `crawlingContext.enqueueLinks` and `crawlingContext.addRequests` in the handler for initial requests. ### [**](#maxRequestRetries)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L256)optionalinheritedmaxRequestRetries **maxRequestRetries? : number = 3 Inherited from HttpCrawlerOptions.maxRequestRetries Specifies the maximum number of retries allowed for a request if its processing fails. This includes retries due to navigation errors or errors thrown from user-supplied functions (`requestHandler`, `preNavigationHooks`, `postNavigationHooks`). This limit does not apply to retries triggered by session rotation (see [`maxSessionRotations`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#maxSessionRotations)). ### [**](#maxRequestsPerCrawl)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L278)optionalinheritedmaxRequestsPerCrawl **maxRequestsPerCrawl? : number Inherited from HttpCrawlerOptions.maxRequestsPerCrawl Maximum number of pages that the crawler will open. The crawl will stop when this limit is reached. This value should always be set in order to prevent infinite loops in misconfigured crawlers. > *NOTE:* In cases of parallel crawling, the actual number of pages visited might be slightly higher than this value. ### [**](#maxRequestsPerMinute)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L315)optionalinheritedmaxRequestsPerMinute **maxRequestsPerMinute? : number Inherited from HttpCrawlerOptions.maxRequestsPerMinute The maximum number of requests per minute the crawler should run. By default, this is set to `Infinity`, but we can pass any positive, non-zero integer. Shortcut for the AutoscaledPool [`maxTasksPerMinute`](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md#maxTasksPerMinute) option. ### [**](#maxSessionRotations)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L271)optionalinheritedmaxSessionRotations **maxSessionRotations? : number = 10 Inherited from HttpCrawlerOptions.maxSessionRotations Maximum number of session rotations per request. The crawler will automatically rotate the session in case of a proxy error or if it gets blocked by the website. The session rotations are not counted towards the [`maxRequestRetries`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#maxRequestRetries) limit. ### [**](#minConcurrency)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L302)optionalinheritedminConcurrency **minConcurrency? : number Inherited from HttpCrawlerOptions.minConcurrency Sets the minimum concurrency (parallelism) for the crawl. Shortcut for the AutoscaledPool [`minConcurrency`](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md#minConcurrency) option. > *WARNING:* If we set this value too high with respect to the available system memory and CPU, our crawler will run extremely slow or crash. If not sure, it's better to keep the default value and the concurrency will scale up automatically. ### [**](#navigationTimeoutSecs)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L92)optionalinheritednavigationTimeoutSecs **navigationTimeoutSecs? : number Inherited from HttpCrawlerOptions.navigationTimeoutSecs Timeout in which the HTTP request to the resource needs to finish, given in seconds. ### [**](#onSkippedRequest)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L381)optionalinheritedonSkippedRequest **onSkippedRequest? : [SkippedRequestCallback](https://crawlee.dev/js/api/core.md#SkippedRequestCallback) Inherited from HttpCrawlerOptions.onSkippedRequest When a request is skipped for some reason, you can use this callback to act on it. This is currently fired for requests skipped 1. based on robots.txt file, 2. because they don't match enqueueLinks filters, 3. because they are redirected to a URL that doesn't match the enqueueLinks strategy, 4. or because the [`maxRequestsPerCrawl`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#maxRequestsPerCrawl) limit has been reached ### [**](#persistCookiesPerSession)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L174)optionalinheritedpersistCookiesPerSession **persistCookiesPerSession? : boolean Inherited from HttpCrawlerOptions.persistCookiesPerSession Automatically saves cookies to Session. Works only if Session Pool is used. It parses cookie from response "set-cookie" header saves or updates cookies for session and once the session is used for next request. It passes the "Cookie" header to the request with the session cookies. ### [**](#postNavigationHooks)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L136)optionalinheritedpostNavigationHooks **postNavigationHooks? : InternalHttpHook<[LinkeDOMCrawlingContext](https://crawlee.dev/js/api/linkedom-crawler/interface/LinkeDOMCrawlingContext.md)\>\[] Inherited from HttpCrawlerOptions.postNavigationHooks Async functions that are sequentially evaluated after the navigation. Good for checking if the navigation was successful. The function accepts `crawlingContext` as the only parameter. Example: ``` postNavigationHooks: [ async (crawlingContext) => { // ... }, ] ``` ### [**](#preNavigationHooks)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L122)optionalinheritedpreNavigationHooks **preNavigationHooks? : InternalHttpHook<[LinkeDOMCrawlingContext](https://crawlee.dev/js/api/linkedom-crawler/interface/LinkeDOMCrawlingContext.md)\>\[] Inherited from HttpCrawlerOptions.preNavigationHooks Async functions that are sequentially evaluated before the navigation. Good for setting additional cookies or browser properties before navigation. The function accepts two parameters, `crawlingContext` and `gotOptions`, which are passed to the `requestAsBrowser()` function the crawler calls to navigate. Example: ``` preNavigationHooks: [ async (crawlingContext, gotOptions) => { // ... }, ] ``` Modyfing `pageOptions` is supported only in Playwright incognito. See [PrePageCreateHook](https://crawlee.dev/js/api/browser-pool.md#PrePageCreateHook) ### [**](#proxyConfiguration)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L104)optionalinheritedproxyConfiguration **proxyConfiguration? : [ProxyConfiguration](https://crawlee.dev/js/api/core/class/ProxyConfiguration.md) Inherited from HttpCrawlerOptions.proxyConfiguration If set, this crawler will be configured for all connections to use [Apify Proxy](https://console.apify.com/proxy) or your own Proxy URLs provided and rotated according to the configuration. For more information, see the [documentation](https://docs.apify.com/proxy). ### [**](#requestHandler)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L151)optionalinheritedrequestHandler **requestHandler? : [RequestHandler](https://crawlee.dev/js/api/basic-crawler.md#RequestHandler)<{ request: [LoadedRequest](https://crawlee.dev/js/api/core.md#LoadedRequest)<[Request](https://crawlee.dev/js/api/core/class/Request.md)\> } & Omit<[LinkeDOMCrawlingContext](https://crawlee.dev/js/api/linkedom-crawler/interface/LinkeDOMCrawlingContext.md)\, request>> Inherited from HttpCrawlerOptions.requestHandler User-provided function that performs the logic of the crawler. It is called for each URL to crawl. The function receives the [BasicCrawlingContext](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlingContext.md) as an argument, where the [`request`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlingContext.md#request) represents the URL to crawl. The function must return a promise, which is then awaited by the crawler. If the function throws an exception, the crawler will try to re-crawl the request later, up to the [`maxRequestRetries`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#maxRequestRetries) times. If all the retries fail, the crawler calls the function provided to the [`failedRequestHandler`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#failedRequestHandler) parameter. To make this work, we should **always** let our function throw exceptions rather than catch them. The exceptions are logged to the request using the [`Request.pushErrorMessage()`](https://crawlee.dev/js/api/core/class/Request.md#pushErrorMessage) function. ### [**](#requestHandlerTimeoutSecs)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L203)optionalinheritedrequestHandlerTimeoutSecs **requestHandlerTimeoutSecs? : number = 60 Inherited from HttpCrawlerOptions.requestHandlerTimeoutSecs Timeout in which the function passed as [`requestHandler`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#requestHandler) needs to finish, in seconds. ### [**](#requestList)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L181)optionalinheritedrequestList **requestList? : [IRequestList](https://crawlee.dev/js/api/core/interface/IRequestList.md) Inherited from HttpCrawlerOptions.requestList Static list of URLs to be processed. If not provided, the crawler will open the default request queue when the [`crawler.addRequests()`](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md#addRequests) function is called. > Alternatively, `requests` parameter of [`crawler.run()`](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md#run) could be used to enqueue the initial requests - it is a shortcut for running `crawler.addRequests()` before the `crawler.run()`. ### [**](#requestManager)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L197)optionalinheritedrequestManager **requestManager? : [IRequestManager](https://crawlee.dev/js/api/core/interface/IRequestManager.md) Inherited from HttpCrawlerOptions.requestManager Allows explicitly configuring a request manager. Mutually exclusive with the `requestQueue` and `requestList` options. This enables explicitly configuring the crawler to use `RequestManagerTandem`, for instance. If using this, the type of `BasicCrawler.requestQueue` may not be fully compatible with the `RequestProvider` class. ### [**](#requestQueue)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L189)optionalinheritedrequestQueue **requestQueue? : [RequestProvider](https://crawlee.dev/js/api/core/class/RequestProvider.md) Inherited from HttpCrawlerOptions.requestQueue Dynamic queue of URLs to be processed. This is useful for recursive crawling of websites. If not provided, the crawler will open the default request queue when the [`crawler.addRequests()`](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md#addRequests) function is called. > Alternatively, `requests` parameter of [`crawler.run()`](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md#run) could be used to enqueue the initial requests - it is a shortcut for running `crawler.addRequests()` before the `crawler.run()`. ### [**](#respectRobotsTxtFile)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L371)optionalinheritedrespectRobotsTxtFile **respectRobotsTxtFile? : boolean Inherited from HttpCrawlerOptions.respectRobotsTxtFile If set to `true`, the crawler will automatically try to fetch the robots.txt file for each domain, and skip those that are not allowed. This also prevents disallowed URLs to be added via `enqueueLinks`. ### [**](#retryOnBlocked)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L365)optionalinheritedretryOnBlocked **retryOnBlocked? : boolean Inherited from HttpCrawlerOptions.retryOnBlocked If set to `true`, the crawler will automatically try to bypass any detected bot protection. Currently supports: * [**Cloudflare** Bot Management](https://www.cloudflare.com/products/bot-management/) * [**Google Search** Rate Limiting](https://www.google.com/sorry/) ### [**](#sameDomainDelaySecs)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L262)optionalinheritedsameDomainDelaySecs **sameDomainDelaySecs? : number = 0 Inherited from HttpCrawlerOptions.sameDomainDelaySecs Indicates how much time (in seconds) to wait before crawling another same domain request. ### [**](#sessionPoolOptions)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L333)optionalinheritedsessionPoolOptions **sessionPoolOptions? : [SessionPoolOptions](https://crawlee.dev/js/api/core/interface/SessionPoolOptions.md) Inherited from HttpCrawlerOptions.sessionPoolOptions The configuration options for [SessionPool](https://crawlee.dev/js/api/core/class/SessionPool.md) to use. ### [**](#statisticsOptions)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L396)optionalinheritedstatisticsOptions **statisticsOptions? : [StatisticsOptions](https://crawlee.dev/js/api/core/interface/StatisticsOptions.md) Inherited from HttpCrawlerOptions.statisticsOptions Customize the way statistics collecting works, such as logging interval or whether to output them to the Key-Value store. ### [**](#statusMessageCallback)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L356)optionalinheritedstatusMessageCallback **statusMessageCallback? : [StatusMessageCallback](https://crawlee.dev/js/api/basic-crawler.md#StatusMessageCallback)<[BasicCrawlingContext](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlingContext.md)\, [BasicCrawler](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md)<[BasicCrawlingContext](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlingContext.md)\>> Inherited from HttpCrawlerOptions.statusMessageCallback Allows overriding the default status message. The callback needs to call `crawler.setStatusMessage()` explicitly. The default status message is provided in the parameters. ``` const crawler = new CheerioCrawler({ statusMessageCallback: async (ctx) => { return ctx.crawler.setStatusMessage(`this is status message from ${new Date().toISOString()}`, { level: 'INFO' }); // log level defaults to 'DEBUG' }, statusMessageLoggingInterval: 1, // defaults to 10s async requestHandler({ $, enqueueLinks, request, log }) { // ... }, }); ``` ### [**](#statusMessageLoggingInterval)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L338)optionalinheritedstatusMessageLoggingInterval **statusMessageLoggingInterval? : number Inherited from HttpCrawlerOptions.statusMessageLoggingInterval Defines the length of the interval for calling the `setStatusMessage` in seconds. ### [**](#suggestResponseEncoding)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L155)optionalinheritedsuggestResponseEncoding **suggestResponseEncoding? : string Inherited from HttpCrawlerOptions.suggestResponseEncoding By default this crawler will extract correct encoding from the HTTP response headers. Sadly, there are some websites which use invalid headers. Those are encoded using the UTF-8 encoding. If those sites actually use a different encoding, the response will be corrupted. You can use `suggestResponseEncoding` to fall back to a certain encoding, if you know that your target website uses it. To force a certain encoding, disregarding the response headers, use [HttpCrawlerOptions.forceResponseEncoding](https://crawlee.dev/js/api/http-crawler/interface/HttpCrawlerOptions.md#forceResponseEncoding) ``` // Will fall back to windows-1250 encoding if none found suggestResponseEncoding: 'windows-1250' ``` ### [**](#useSessionPool)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L328)optionalinheriteduseSessionPool **useSessionPool? : boolean Inherited from HttpCrawlerOptions.useSessionPool Basic crawler will initialize the [SessionPool](https://crawlee.dev/js/api/core/class/SessionPool.md) with the corresponding [`sessionPoolOptions`](https://crawlee.dev/js/api/core/interface/SessionPoolOptions.md). The session instance will be than available in the [`requestHandler`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#requestHandler). --- # LinkeDOMCrawlingContext \ ### Hierarchy * InternalHttpCrawlingContext\ * *LinkeDOMCrawlingContext* ## Index[**](#Index) ### Properties * [**addRequests](#addRequests) * [**body](#body) * [**contentType](#contentType) * [**crawler](#crawler) * [**document](#document) * [**getKeyValueStore](#getKeyValueStore) * [**id](#id) * [**json](#json) * [**log](#log) * [**proxyInfo](#proxyInfo) * [**request](#request) * [**response](#response) * [**session](#session) * [**useState](#useState) * [**window](#window) ### Methods * [**enqueueLinks](#enqueueLinks) * [**parseWithCheerio](#parseWithCheerio) * [**pushData](#pushData) * [**sendRequest](#sendRequest) * [**waitForSelector](#waitForSelector) ## Properties[**](#Properties) ### [**](#addRequests)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L88)inheritedaddRequests **addRequests: (requestsLike, options) => Promise\ Inherited from InternalHttpCrawlingContext.addRequests Add requests directly to the request queue. *** #### Type declaration * * **(requestsLike, options): Promise\ - #### Parameters * ##### requestsLike: readonly (string | ReadonlyObjectDeep\> & { regex?: RegExp; requestsFromUrl?: string }> | ReadonlyObjectDeep<[Request](https://crawlee.dev/js/api/core/class/Request.md)\>)\[] * ##### optionaloptions: ReadonlyObjectDeep<[RequestQueueOperationOptions](https://crawlee.dev/js/api/core/interface/RequestQueueOperationOptions.md)> Options for the request queue #### Returns Promise\ ### [**](#body)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L213)inheritedbody **body: string | Buffer\ Inherited from InternalHttpCrawlingContext.body The request body of the web page. The type depends on the `Content-Type` header of the web page: * String for `text/html`, `application/xhtml+xml`, `application/xml` MIME content types * Buffer for others MIME content types ### [**](#contentType)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L223)inheritedcontentType **contentType: { encoding: BufferEncoding; type: string } Inherited from InternalHttpCrawlingContext.contentType Parsed `Content-Type header: { type, encoding }`. *** #### Type declaration * ##### encoding: BufferEncoding * ##### type: string ### [**](#crawler)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L113)inheritedcrawler **crawler: [LinkeDOMCrawler](https://crawlee.dev/js/api/linkedom-crawler/class/LinkeDOMCrawler.md) Inherited from InternalHttpCrawlingContext.crawler ### [**](#document)[**](https://github.com/apify/crawlee/blob/master/packages/linkedom-crawler/src/internals/linkedom-crawler.ts#L58)document **document: Document ### [**](#getKeyValueStore)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L147)inheritedgetKeyValueStore **getKeyValueStore: (idOrName) => Promise<[KeyValueStore](https://crawlee.dev/js/api/core/class/KeyValueStore.md)> Inherited from InternalHttpCrawlingContext.getKeyValueStore Get a key-value store with given name or id, or the default one for the crawler. *** #### Type declaration * * **(idOrName): Promise<[KeyValueStore](https://crawlee.dev/js/api/core/class/KeyValueStore.md)> - #### Parameters * ##### optionalidOrName: string #### Returns Promise<[KeyValueStore](https://crawlee.dev/js/api/core/class/KeyValueStore.md)> ### [**](#id)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L33)inheritedid **id: string Inherited from InternalHttpCrawlingContext.id ### [**](#json)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L218)inheritedjson **json: JSONData Inherited from InternalHttpCrawlingContext.json The parsed object from JSON string if the response contains the content type application/json. ### [**](#log)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L108)inheritedlog **log: [Log](https://crawlee.dev/js/api/core/class/Log.md) Inherited from InternalHttpCrawlingContext.log A preconfigured logger for the request handler. ### [**](#proxyInfo)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L40)optionalinheritedproxyInfo **proxyInfo? : [ProxyInfo](https://crawlee.dev/js/api/core/interface/ProxyInfo.md) Inherited from InternalHttpCrawlingContext.proxyInfo An object with information about currently used proxy by the crawler and configured by the [ProxyConfiguration](https://crawlee.dev/js/api/core/class/ProxyConfiguration.md) class. ### [**](#request)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L45)inheritedrequest **request: [Request](https://crawlee.dev/js/api/core/class/Request.md)\ Inherited from InternalHttpCrawlingContext.request The original [Request](https://crawlee.dev/js/api/core/class/Request.md) object. ### [**](#response)[**](https://github.com/apify/crawlee/blob/master/packages/http-crawler/src/internals/http-crawler.ts#L224)inheritedresponse **response: PlainResponse Inherited from InternalHttpCrawlingContext.response ### [**](#session)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L34)optionalinheritedsession **session? : [Session](https://crawlee.dev/js/api/core/class/Session.md) Inherited from InternalHttpCrawlingContext.session ### [**](#useState)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L96)inheriteduseState **useState: \(defaultValue) => Promise\ Inherited from InternalHttpCrawlingContext.useState Returns the state - a piece of mutable persistent data shared across all the request handler runs. *** #### Type declaration * * **\(defaultValue): Promise\ - #### Parameters * ##### optionaldefaultValue: State #### Returns Promise\ ### [**](#window)[**](https://github.com/apify/crawlee/blob/master/packages/linkedom-crawler/src/internals/linkedom-crawler.ts#L52)window **window: Window ## Methods[**](#Methods) ### [**](#enqueueLinks)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L140)inheritedenqueueLinks * ****enqueueLinks**(options): Promise<[BatchAddRequestsResult](https://crawlee.dev/js/api/types/interface/BatchAddRequestsResult.md)> - Inherited from InternalHttpCrawlingContext.enqueueLinks This function automatically finds and enqueues links from the current page, adding them to the [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) currently used by the crawler. Optionally, the function allows you to filter the target links' URLs using an array of globs or regular expressions and override settings of the enqueued [Request](https://crawlee.dev/js/api/core/class/Request.md) objects. Check out the [Crawl a website with relative links](https://crawlee.dev/js/docs/examples/crawl-relative-links.md) example for more details regarding its usage. **Example usage** ``` async requestHandler({ enqueueLinks }) { await enqueueLinks({ globs: [ 'https://www.example.com/handbags/*', ], }); }, ``` *** #### Parameters * ##### optionaloptions: ReadonlyObjectDeep\> & Pick<[EnqueueLinksOptions](https://crawlee.dev/js/api/core/interface/EnqueueLinksOptions.md), requestQueue> All `enqueueLinks()` parameters are passed via an options object. #### Returns Promise<[BatchAddRequestsResult](https://crawlee.dev/js/api/types/interface/BatchAddRequestsResult.md)> Promise that resolves to [BatchAddRequestsResult](https://crawlee.dev/js/api/types/interface/BatchAddRequestsResult.md) object. ### [**](#parseWithCheerio)[**](https://github.com/apify/crawlee/blob/master/packages/linkedom-crawler/src/internals/linkedom-crawler.ts#L87)parseWithCheerio * ****parseWithCheerio**(selector, timeoutMs): Promise\ - Overrides InternalHttpCrawlingContext.parseWithCheerio Returns Cheerio handle, allowing to work with the data same way as with [CheerioCrawler](https://crawlee.dev/js/api/cheerio-crawler/class/CheerioCrawler.md). When provided with the `selector` argument, it will first look for the selector with a 5s timeout. **Example usage:** ``` async requestHandler({ parseWithCheerio }) { const $ = await parseWithCheerio(); const title = $('title').text(); }); ``` *** #### Parameters * ##### optionalselector: string * ##### optionaltimeoutMs: number #### Returns Promise\ ### [**](#pushData)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L54)inheritedpushData * ****pushData**(data, datasetIdOrName): Promise\ - Inherited from InternalHttpCrawlingContext.pushData This function allows you to push data to a [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md) specified by name, or the one currently used by the crawler. Shortcut for `crawler.pushData()`. *** #### Parameters * ##### optionaldata: ReadonlyDeep\ Data to be pushed to the default dataset. * ##### optionaldatasetIdOrName: string #### Returns Promise\ ### [**](#sendRequest)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L166)inheritedsendRequest * ****sendRequest**\(overrideOptions): Promise\> - Inherited from InternalHttpCrawlingContext.sendRequest Fires HTTP request via [`got-scraping`](https://crawlee.dev/js/docs/guides/got-scraping.md), allowing to override the request options on the fly. This is handy when you work with a browser crawler but want to execute some requests outside it (e.g. API requests). Check the [Skipping navigations for certain requests](https://crawlee.dev/js/docs/examples/skip-navigation.md) example for more detailed explanation of how to do that. ``` async requestHandler({ sendRequest }) { const { body } = await sendRequest({ // override headers only headers: { ... }, }); }, ``` *** #### Parameters * ##### optionaloverrideOptions: Partial\ #### Returns Promise\> ### [**](#waitForSelector)[**](https://github.com/apify/crawlee/blob/master/packages/linkedom-crawler/src/internals/linkedom-crawler.ts#L73)waitForSelector * ****waitForSelector**(selector, timeoutMs): Promise\ - Overrides InternalHttpCrawlingContext.waitForSelector Wait for an element matching the selector to appear. Timeout defaults to 5s. **Example usage:** ``` async requestHandler({ waitForSelector, parseWithCheerio }) { await waitForSelector('article h1'); const $ = await parseWithCheerio(); const title = $('title').text(); }); ``` *** #### Parameters * ##### selector: string * ##### optionaltimeoutMs: number #### Returns Promise\ --- # @crawlee/memory-storage ## Index[**](#Index) ### Classes * [**MemoryStorage](https://crawlee.dev/js/api/memory-storage/class/MemoryStorage.md) ### Interfaces * [**MemoryStorageOptions](https://crawlee.dev/js/api/memory-storage/interface/MemoryStorageOptions.md) --- # Changelog All notable changes to this project will be documented in this file. See [Conventional Commits](https://conventionalcommits.org) for commit guidelines. ## [3.15.3](https://github.com/apify/crawlee/compare/v3.15.2...v3.15.3) (2025-11-10)[​](#3153-2025-11-10 "Direct link to 3153-2025-11-10") **Note:** Version bump only for package @crawlee/memory-storage ## [3.15.2](https://github.com/apify/crawlee/compare/v3.15.1...v3.15.2) (2025-10-23)[​](#3152-2025-10-23 "Direct link to 3152-2025-10-23") **Note:** Version bump only for package @crawlee/memory-storage ## [3.15.1](https://github.com/apify/crawlee/compare/v3.15.0...v3.15.1) (2025-09-26)[​](#3151-2025-09-26 "Direct link to 3151-2025-09-26") **Note:** Version bump only for package @crawlee/memory-storage # [3.15.0](https://github.com/apify/crawlee/compare/v3.14.1...v3.15.0) (2025-09-17) **Note:** Version bump only for package @crawlee/memory-storage ## [3.14.1](https://github.com/apify/crawlee/compare/v3.14.0...v3.14.1) (2025-08-05)[​](#3141-2025-08-05 "Direct link to 3141-2025-08-05") **Note:** Version bump only for package @crawlee/memory-storage # [3.14.0](https://github.com/apify/crawlee/compare/v3.13.10...v3.14.0) (2025-07-25) **Note:** Version bump only for package @crawlee/memory-storage ## [3.13.10](https://github.com/apify/crawlee/compare/v3.13.9...v3.13.10) (2025-07-09)[​](#31310-2025-07-09 "Direct link to 31310-2025-07-09") **Note:** Version bump only for package @crawlee/memory-storage ## [3.13.9](https://github.com/apify/crawlee/compare/v3.13.8...v3.13.9) (2025-06-27)[​](#3139-2025-06-27 "Direct link to 3139-2025-06-27") **Note:** Version bump only for package @crawlee/memory-storage ## [3.13.8](https://github.com/apify/crawlee/compare/v3.13.7...v3.13.8) (2025-06-16)[​](#3138-2025-06-16 "Direct link to 3138-2025-06-16") ### Features[​](#features "Direct link to Features") * support `KVS.listKeys()` `prefix` and `collection` parameters ([#3001](https://github.com/apify/crawlee/issues/3001)) ([5c4726d](https://github.com/apify/crawlee/commit/5c4726df96e358a9bbf44a0cd2760e4e269f0fae)), closes [#2974](https://github.com/apify/crawlee/issues/2974) ## [3.13.7](https://github.com/apify/crawlee/compare/v3.13.6...v3.13.7) (2025-06-06)[​](#3137-2025-06-06 "Direct link to 3137-2025-06-06") **Note:** Version bump only for package @crawlee/memory-storage ## [3.13.6](https://github.com/apify/crawlee/compare/v3.13.5...v3.13.6) (2025-06-05)[​](#3136-2025-06-05 "Direct link to 3136-2025-06-05") **Note:** Version bump only for package @crawlee/memory-storage ## [3.13.5](https://github.com/apify/crawlee/compare/v3.13.4...v3.13.5) (2025-05-20)[​](#3135-2025-05-20 "Direct link to 3135-2025-05-20") **Note:** Version bump only for package @crawlee/memory-storage ## [3.13.4](https://github.com/apify/crawlee/compare/v3.13.3...v3.13.4) (2025-05-14)[​](#3134-2025-05-14 "Direct link to 3134-2025-05-14") **Note:** Version bump only for package @crawlee/memory-storage ## [3.13.3](https://github.com/apify/crawlee/compare/v3.13.2...v3.13.3) (2025-05-05)[​](#3133-2025-05-05 "Direct link to 3133-2025-05-05") **Note:** Version bump only for package @crawlee/memory-storage ## [3.13.2](https://github.com/apify/crawlee/compare/v3.13.1...v3.13.2) (2025-04-08)[​](#3132-2025-04-08 "Direct link to 3132-2025-04-08") **Note:** Version bump only for package @crawlee/memory-storage ## [3.13.1](https://github.com/apify/crawlee/compare/v3.13.0...v3.13.1) (2025-04-07)[​](#3131-2025-04-07 "Direct link to 3131-2025-04-07") **Note:** Version bump only for package @crawlee/memory-storage # [3.13.0](https://github.com/apify/crawlee/compare/v3.12.2...v3.13.0) (2025-03-04) **Note:** Version bump only for package @crawlee/memory-storage ## [3.12.2](https://github.com/apify/crawlee/compare/v3.12.1...v3.12.2) (2025-01-27)[​](#3122-2025-01-27 "Direct link to 3122-2025-01-27") **Note:** Version bump only for package @crawlee/memory-storage ## [3.12.1](https://github.com/apify/crawlee/compare/v3.12.0...v3.12.1) (2024-12-04)[​](#3121-2024-12-04 "Direct link to 3121-2024-12-04") **Note:** Version bump only for package @crawlee/memory-storage # [3.12.0](https://github.com/apify/crawlee/compare/v3.11.5...v3.12.0) (2024-11-04) **Note:** Version bump only for package @crawlee/memory-storage ## [3.11.5](https://github.com/apify/crawlee/compare/v3.11.4...v3.11.5) (2024-10-04)[​](#3115-2024-10-04 "Direct link to 3115-2024-10-04") ### Bug Fixes[​](#bug-fixes "Direct link to Bug Fixes") * `prolong-` and `deleteRequestLock` `forefront` option ([#2690](https://github.com/apify/crawlee/issues/2690)) ([cba8da3](https://github.com/apify/crawlee/commit/cba8da31312bcc4228662c79c4472e35278627c1)), closes [#2681](https://github.com/apify/crawlee/issues/2681) [#2689](https://github.com/apify/crawlee/issues/2689) [#2669](https://github.com/apify/crawlee/issues/2669) * respect `forefront` option in `MemoryStorage`'s `RequestQueue` ([#2681](https://github.com/apify/crawlee/issues/2681)) ([b0527f9](https://github.com/apify/crawlee/commit/b0527f948b73e3b74ac77e58f9184b34c1adab3a)), closes [#2669](https://github.com/apify/crawlee/issues/2669) ## [3.11.4](https://github.com/apify/crawlee/compare/v3.11.3...v3.11.4) (2024-09-23)[​](#3114-2024-09-23 "Direct link to 3114-2024-09-23") **Note:** Version bump only for package @crawlee/memory-storage ## [3.11.3](https://github.com/apify/crawlee/compare/v3.11.2...v3.11.3) (2024-09-03)[​](#3113-2024-09-03 "Direct link to 3113-2024-09-03") **Note:** Version bump only for package @crawlee/memory-storage ## [3.11.2](https://github.com/apify/crawlee/compare/v3.11.1...v3.11.2) (2024-08-28)[​](#3112-2024-08-28 "Direct link to 3112-2024-08-28") ### Bug Fixes[​](#bug-fixes-1 "Direct link to Bug Fixes") * **RequestQueueV2:** remove `inProgress` cache, rely solely on locked states ([#2601](https://github.com/apify/crawlee/issues/2601)) ([57fcb08](https://github.com/apify/crawlee/commit/57fcb0804a9f1268039d1e2b246c515ceca7e405)) * Use the correct mutex in memory storage RequestQueueClient ([#2623](https://github.com/apify/crawlee/issues/2623)) ([2fa8a29](https://github.com/apify/crawlee/commit/2fa8a29b815689f041f3d06cc0563e77e02e05f4)) ## [3.11.1](https://github.com/apify/crawlee/compare/v3.11.0...v3.11.1) (2024-07-24)[​](#3111-2024-07-24 "Direct link to 3111-2024-07-24") **Note:** Version bump only for package @crawlee/memory-storage # [3.11.0](https://github.com/apify/crawlee/compare/v3.10.5...v3.11.0) (2024-07-09) **Note:** Version bump only for package @crawlee/memory-storage ## [3.10.5](https://github.com/apify/crawlee/compare/v3.10.4...v3.10.5) (2024-06-12)[​](#3105-2024-06-12 "Direct link to 3105-2024-06-12") **Note:** Version bump only for package @crawlee/memory-storage ## [3.10.4](https://github.com/apify/crawlee/compare/v3.10.3...v3.10.4) (2024-06-11)[​](#3104-2024-06-11 "Direct link to 3104-2024-06-11") **Note:** Version bump only for package @crawlee/memory-storage ## [3.10.3](https://github.com/apify/crawlee/compare/v3.10.2...v3.10.3) (2024-06-07)[​](#3103-2024-06-07 "Direct link to 3103-2024-06-07") **Note:** Version bump only for package @crawlee/memory-storage ## [3.10.2](https://github.com/apify/crawlee/compare/v3.10.1...v3.10.2) (2024-06-03)[​](#3102-2024-06-03 "Direct link to 3102-2024-06-03") ### Bug Fixes[​](#bug-fixes-2 "Direct link to Bug Fixes") * improve fix for double extension in KVS with HTML files ([#2505](https://github.com/apify/crawlee/issues/2505)) ([157927d](https://github.com/apify/crawlee/commit/157927d67f42342c20fdf01ef81bdafd7095f0b8)), closes [#2419](https://github.com/apify/crawlee/issues/2419) ## [3.10.1](https://github.com/apify/crawlee/compare/v3.10.0...v3.10.1) (2024-05-23)[​](#3101-2024-05-23 "Direct link to 3101-2024-05-23") **Note:** Version bump only for package @crawlee/memory-storage # [3.10.0](https://github.com/apify/crawlee/compare/v3.9.2...v3.10.0) (2024-05-16) ### Bug Fixes[​](#bug-fixes-3 "Direct link to Bug Fixes") * Fixed double extension for screenshots ([#2419](https://github.com/apify/crawlee/issues/2419)) ([e8b39c4](https://github.com/apify/crawlee/commit/e8b39c41764726280c995e52fa7d79a9240d993e)), closes [#1980](https://github.com/apify/crawlee/issues/1980) ## [3.9.2](https://github.com/apify/crawlee/compare/v3.9.1...v3.9.2) (2024-04-17)[​](#392-2024-04-17 "Direct link to 392-2024-04-17") **Note:** Version bump only for package @crawlee/memory-storage ## [3.9.1](https://github.com/apify/crawlee/compare/v3.9.0...v3.9.1) (2024-04-11)[​](#391-2024-04-11 "Direct link to 391-2024-04-11") **Note:** Version bump only for package @crawlee/memory-storage # [3.9.0](https://github.com/apify/crawlee/compare/v3.8.2...v3.9.0) (2024-04-10) **Note:** Version bump only for package @crawlee/memory-storage ## [3.8.2](https://github.com/apify/crawlee/compare/v3.8.1...v3.8.2) (2024-03-21)[​](#382-2024-03-21 "Direct link to 382-2024-03-21") **Note:** Version bump only for package @crawlee/memory-storage ## [3.8.1](https://github.com/apify/crawlee/compare/v3.8.0...v3.8.1) (2024-02-22)[​](#381-2024-02-22 "Direct link to 381-2024-02-22") **Note:** Version bump only for package @crawlee/memory-storage # [3.8.0](https://github.com/apify/crawlee/compare/v3.7.3...v3.8.0) (2024-02-21) ### Features[​](#features-1 "Direct link to Features") * `KeyValueStore.recordExists()` ([#2339](https://github.com/apify/crawlee/issues/2339)) ([8507a65](https://github.com/apify/crawlee/commit/8507a65d1ad079f64c752a6ddb1d8fac9b494228)) ## [3.7.3](https://github.com/apify/crawlee/compare/v3.7.2...v3.7.3) (2024-01-30)[​](#373-2024-01-30 "Direct link to 373-2024-01-30") **Note:** Version bump only for package @crawlee/memory-storage ## [3.7.2](https://github.com/apify/crawlee/compare/v3.7.1...v3.7.2) (2024-01-09)[​](#372-2024-01-09 "Direct link to 372-2024-01-09") **Note:** Version bump only for package @crawlee/memory-storage ## [3.7.1](https://github.com/apify/crawlee/compare/v3.7.0...v3.7.1) (2024-01-02)[​](#371-2024-01-02 "Direct link to 371-2024-01-02") **Note:** Version bump only for package @crawlee/memory-storage # [3.7.0](https://github.com/apify/crawlee/compare/v3.6.2...v3.7.0) (2023-12-21) ### Bug Fixes[​](#bug-fixes-4 "Direct link to Bug Fixes") * **MemoryStorage:** lock request JSON file when reading to support multiple process crawling ([#2215](https://github.com/apify/crawlee/issues/2215)) ([eb84ce9](https://github.com/apify/crawlee/commit/eb84ce9ce5540b72d5799b1f66c80938d57bc1cc)) ## [3.6.2](https://github.com/apify/crawlee/compare/v3.6.1...v3.6.2) (2023-11-26)[​](#362-2023-11-26 "Direct link to 362-2023-11-26") **Note:** Version bump only for package @crawlee/memory-storage ## [3.6.1](https://github.com/apify/crawlee/compare/v3.6.0...v3.6.1) (2023-11-15)[​](#361-2023-11-15 "Direct link to 361-2023-11-15") **Note:** Version bump only for package @crawlee/memory-storage # [3.6.0](https://github.com/apify/crawlee/compare/v3.5.8...v3.6.0) (2023-11-15) **Note:** Version bump only for package @crawlee/memory-storage ## [3.5.8](https://github.com/apify/crawlee/compare/v3.5.7...v3.5.8) (2023-10-17)[​](#358-2023-10-17 "Direct link to 358-2023-10-17") ### Bug Fixes[​](#bug-fixes-5 "Direct link to Bug Fixes") * **MemoryStorage:** ignore invalid files for request queues ([#2132](https://github.com/apify/crawlee/issues/2132)) ([fa58581](https://github.com/apify/crawlee/commit/fa58581b530ef3ad89bdd71403df2d2e4f06c59f)), closes [#1985](https://github.com/apify/crawlee/issues/1985) ## [3.5.7](https://github.com/apify/crawlee/compare/v3.5.6...v3.5.7) (2023-10-05)[​](#357-2023-10-05 "Direct link to 357-2023-10-05") **Note:** Version bump only for package @crawlee/memory-storage ## [3.5.6](https://github.com/apify/crawlee/compare/v3.5.5...v3.5.6) (2023-10-04)[​](#356-2023-10-04 "Direct link to 356-2023-10-04") **Note:** Version bump only for package @crawlee/memory-storage ## [3.5.5](https://github.com/apify/crawlee/compare/v3.5.4...v3.5.5) (2023-10-02)[​](#355-2023-10-02 "Direct link to 355-2023-10-02") ### Features[​](#features-2 "Direct link to Features") * Request Queue v2 ([#1975](https://github.com/apify/crawlee/issues/1975)) ([70a77ee](https://github.com/apify/crawlee/commit/70a77ee15f984e9ae67cd584fc58ace7e55346db)), closes [#1365](https://github.com/apify/crawlee/issues/1365) ## [3.5.4](https://github.com/apify/crawlee/compare/v3.5.3...v3.5.4) (2023-09-11)[​](#354-2023-09-11 "Direct link to 354-2023-09-11") **Note:** Version bump only for package @crawlee/memory-storage ## [3.5.3](https://github.com/apify/crawlee/compare/v3.5.2...v3.5.3) (2023-08-31)[​](#353-2023-08-31 "Direct link to 353-2023-08-31") ### Bug Fixes[​](#bug-fixes-6 "Direct link to Bug Fixes") * pin all internal dependencies ([#2041](https://github.com/apify/crawlee/issues/2041)) ([d6f2b17](https://github.com/apify/crawlee/commit/d6f2b172d4a6776137c7893ca798d5b4a9408e79)), closes [#2040](https://github.com/apify/crawlee/issues/2040) ## [3.5.2](https://github.com/apify/crawlee/compare/v3.5.1...v3.5.2) (2023-08-21)[​](#352-2023-08-21 "Direct link to 352-2023-08-21") **Note:** Version bump only for package @crawlee/memory-storage ## [3.5.1](https://github.com/apify/crawlee/compare/v3.5.0...v3.5.1) (2023-08-16)[​](#351-2023-08-16 "Direct link to 351-2023-08-16") **Note:** Version bump only for package @crawlee/memory-storage # [3.5.0](https://github.com/apify/crawlee/compare/v3.4.2...v3.5.0) (2023-07-31) ### Bug Fixes[​](#bug-fixes-7 "Direct link to Bug Fixes") * cleanup worker stuff from memory storage to fix `vitest` ([#2004](https://github.com/apify/crawlee/issues/2004)) ([d2e098c](https://github.com/apify/crawlee/commit/d2e098cab62c700a5c58fcf43a5bcf9f492d71ec)), closes [#1999](https://github.com/apify/crawlee/issues/1999) ## [3.4.2](https://github.com/apify/crawlee/compare/v3.4.1...v3.4.2) (2023-07-19)[​](#342-2023-07-19 "Direct link to 342-2023-07-19") **Note:** Version bump only for package @crawlee/memory-storage ## [3.4.1](https://github.com/apify/crawlee/compare/v3.4.0...v3.4.1) (2023-07-13)[​](#341-2023-07-13 "Direct link to 341-2023-07-13") **Note:** Version bump only for package @crawlee/memory-storage # [3.4.0](https://github.com/apify/crawlee/compare/v3.3.3...v3.4.0) (2023-06-12) **Note:** Version bump only for package @crawlee/memory-storage ## [3.3.3](https://github.com/apify/crawlee/compare/v3.3.2...v3.3.3) (2023-05-31)[​](#333-2023-05-31 "Direct link to 333-2023-05-31") ### Bug Fixes[​](#bug-fixes-8 "Direct link to Bug Fixes") * **MemoryStorage:** handle EXDEV errors when purging storages ([#1932](https://github.com/apify/crawlee/issues/1932)) ([e656050](https://github.com/apify/crawlee/commit/e6560507243f5e2d0b126160616573f13e5998e1)) ## [3.3.2](https://github.com/apify/crawlee/compare/v3.3.1...v3.3.2) (2023-05-11)[​](#332-2023-05-11 "Direct link to 332-2023-05-11") ### Bug Fixes[​](#bug-fixes-9 "Direct link to Bug Fixes") * **MemoryStorage:** cache requests in `RequestQueue` ([#1899](https://github.com/apify/crawlee/issues/1899)) ([063dcd1](https://github.com/apify/crawlee/commit/063dcd1c9e6652cd316cc0e8c4f4e4bbb70c246e)) ### Features[​](#features-3 "Direct link to Features") * RQv2 memory storage support ([#1874](https://github.com/apify/crawlee/issues/1874)) ([049486b](https://github.com/apify/crawlee/commit/049486b772cc2accd2d2d226d8c8726e5ab933a9)) ## [3.3.1](https://github.com/apify/crawlee/compare/v3.3.0...v3.3.1) (2023-04-11)[​](#331-2023-04-11 "Direct link to 331-2023-04-11") ### Bug Fixes[​](#bug-fixes-10 "Direct link to Bug Fixes") * **MemoryStorage:** handling of readable streams for key-value stores when setting records ([#1852](https://github.com/apify/crawlee/issues/1852)) ([a5ee37d](https://github.com/apify/crawlee/commit/a5ee37d7e245f004785fc03220e37aeafdfa0e81)), closes [#1843](https://github.com/apify/crawlee/issues/1843) # [3.3.0](https://github.com/apify/crawlee/compare/v3.2.2...v3.3.0) (2023-03-09) ### Bug Fixes[​](#bug-fixes-11 "Direct link to Bug Fixes") * **MemoryStorage:** request queues race conditions causing crashes ([#1806](https://github.com/apify/crawlee/issues/1806)) ([083a9db](https://github.com/apify/crawlee/commit/083a9db9ebcddd3fa886631234c790d4c5bcdf86)), closes [#1792](https://github.com/apify/crawlee/issues/1792) * **MemoryStorage:** RequestQueue should respect `forefront` ([#1816](https://github.com/apify/crawlee/issues/1816)) ([b68e86a](https://github.com/apify/crawlee/commit/b68e86a97954bcbe30fde802fed5f263016fffe2)), closes [#1787](https://github.com/apify/crawlee/issues/1787) * **MemoryStorage:** RequestQueue#handledRequestCount should update ([#1817](https://github.com/apify/crawlee/issues/1817)) ([a775e4a](https://github.com/apify/crawlee/commit/a775e4afea20d0b31492f44b90f61b6a903491b6)), closes [#1764](https://github.com/apify/crawlee/issues/1764) ### Features[​](#features-4 "Direct link to Features") * add basic support for `setStatusMessage` ([#1790](https://github.com/apify/crawlee/issues/1790)) ([c318980](https://github.com/apify/crawlee/commit/c318980ec11d211b1a5c9e6bdbe76198c5d895be)) * move the status message implementation to Crawlee, noop in storage ([#1808](https://github.com/apify/crawlee/issues/1808)) ([99c3fdc](https://github.com/apify/crawlee/commit/99c3fdc18030b7898e6b6d149d6d94fab7881f09)) ## [3.2.2](https://github.com/apify/crawlee/compare/v3.2.1...v3.2.2) (2023-02-08)[​](#322-2023-02-08 "Direct link to 322-2023-02-08") ### Bug Fixes[​](#bug-fixes-12 "Direct link to Bug Fixes") * **MemoryStorage:** request queues saved in the wrong place ([#1779](https://github.com/apify/crawlee/issues/1779)) ([19409db](https://github.com/apify/crawlee/commit/19409dbd614560a73c97ef6e00997e482573d2ff)) ## [3.2.1](https://github.com/apify/crawlee/compare/v3.2.0...v3.2.1) (2023-02-07)[​](#321-2023-02-07 "Direct link to 321-2023-02-07") **Note:** Version bump only for package @crawlee/memory-storage # [3.2.0](https://github.com/apify/crawlee/compare/v3.1.4...v3.2.0) (2023-02-07) ### Bug Fixes[​](#bug-fixes-13 "Direct link to Bug Fixes") * Correctly compute `pendingRequestCount` in request queue ([#1765](https://github.com/apify/crawlee/issues/1765)) ([946535f](https://github.com/apify/crawlee/commit/946535f2338086e13c71ff70129e7a1f6bfd275d)), closes [/github.com/apify/crawlee/blob/master/packages/memory-storage/src/resource-clients/request-queue.ts#L291-L298](https://github.com//github.com/apify/crawlee/blob/master/packages/memory-storage/src/resource-clients/request-queue.ts/issues/L291-L298) * **KeyValueStore:** big buffers should not crash ([#1734](https://github.com/apify/crawlee/issues/1734)) ([2f682f7](https://github.com/apify/crawlee/commit/2f682f7ddd189cad11a3f5e7655ac6243444ff74)), closes [#1732](https://github.com/apify/crawlee/issues/1732) [#1710](https://github.com/apify/crawlee/issues/1710) * **memory-storage:** dont fail when storage already purged ([#1737](https://github.com/apify/crawlee/issues/1737)) ([8694027](https://github.com/apify/crawlee/commit/86940273dbac2d13294140962f816f66582684ff)), closes [#1736](https://github.com/apify/crawlee/issues/1736) * **utils:** add missing dependency on `ow` ([bf0e03c](https://github.com/apify/crawlee/commit/bf0e03cc6ddc103c9337de5cd8dce9bc86c369a3)), closes [#1716](https://github.com/apify/crawlee/issues/1716) ### Features[​](#features-5 "Direct link to Features") * **MemoryStorage:** read from fs if persistStorage is enabled, ram only otherwise ([#1761](https://github.com/apify/crawlee/issues/1761)) ([e903980](https://github.com/apify/crawlee/commit/e9039809a0c0af0bc086be1f1400d18aa45ae490)) ## 3.1.2 (2022-11-15)[​](#312-2022-11-15 "Direct link to 3.1.2 (2022-11-15)") **Note:** Version bump only for package @crawlee/memory-storage ## 3.1.1 (2022-11-07)[​](#311-2022-11-07 "Direct link to 3.1.1 (2022-11-07)") **Note:** Version bump only for package @crawlee/memory-storage # 3.1.0 (2022-10-13) **Note:** Version bump only for package @crawlee/memory-storage ## [3.0.4](https://github.com/apify/crawlee/compare/v3.0.3...v3.0.4) (2022-08-22)[​](#304-2022-08-22 "Direct link to 304-2022-08-22") ### Bug Fixes[​](#bug-fixes-14 "Direct link to Bug Fixes") * key value stores emitting an error when multiple write promises ran in parallel ([#1460](https://github.com/apify/crawlee/issues/1460)) ([f201cca](https://github.com/apify/crawlee/commit/f201cca4a99d1c8b3e87be0289d5b3b363048f09)) --- # MemoryStorage Represents a storage capable of working with datasets, KV stores and request queues. ### Implements * [StorageClient](https://crawlee.dev/js/api/core/interface/StorageClient.md) ## Index[**](#Index) ### Constructors * [**constructor](#constructor) ### Properties * [**datasetClientsHandled](#datasetClientsHandled) * [**datasetsDirectory](#datasetsDirectory) * [**keyValueStoresDirectory](#keyValueStoresDirectory) * [**keyValueStoresHandled](#keyValueStoresHandled) * [**localDataDirectory](#localDataDirectory) * [**persistStorage](#persistStorage) * [**requestQueuesDirectory](#requestQueuesDirectory) * [**requestQueuesHandled](#requestQueuesHandled) * [**writeMetadata](#writeMetadata) ### Methods * [**dataset](#dataset) * [**datasets](#datasets) * [**keyValueStore](#keyValueStore) * [**keyValueStores](#keyValueStores) * [**purge](#purge) * [**requestQueue](#requestQueue) * [**requestQueues](#requestQueues) * [**setStatusMessage](#setStatusMessage) * [**teardown](#teardown) ## Constructors[**](#Constructors) ### [**](#constructor)[**](https://github.com/apify/crawlee/blob/master/packages/memory-storage/src/memory-storage.ts#L52)constructor * ****new MemoryStorage**(options): [MemoryStorage](https://crawlee.dev/js/api/memory-storage/class/MemoryStorage.md) - #### Parameters * ##### options: [MemoryStorageOptions](https://crawlee.dev/js/api/memory-storage/interface/MemoryStorageOptions.md) = {} #### Returns [MemoryStorage](https://crawlee.dev/js/api/memory-storage/class/MemoryStorage.md) ## Properties[**](#Properties) ### [**](#datasetClientsHandled)[**](https://github.com/apify/crawlee/blob/master/packages/memory-storage/src/memory-storage.ts#L49)readonlydatasetClientsHandled **datasetClientsHandled: DatasetClient\\[] = \[] ### [**](#datasetsDirectory)[**](https://github.com/apify/crawlee/blob/master/packages/memory-storage/src/memory-storage.ts#L42)readonlydatasetsDirectory **datasetsDirectory: string ### [**](#keyValueStoresDirectory)[**](https://github.com/apify/crawlee/blob/master/packages/memory-storage/src/memory-storage.ts#L43)readonlykeyValueStoresDirectory **keyValueStoresDirectory: string ### [**](#keyValueStoresHandled)[**](https://github.com/apify/crawlee/blob/master/packages/memory-storage/src/memory-storage.ts#L48)readonlykeyValueStoresHandled **keyValueStoresHandled: KeyValueStoreClient\[] = \[] ### [**](#localDataDirectory)[**](https://github.com/apify/crawlee/blob/master/packages/memory-storage/src/memory-storage.ts#L41)readonlylocalDataDirectory **localDataDirectory: string ### [**](#persistStorage)[**](https://github.com/apify/crawlee/blob/master/packages/memory-storage/src/memory-storage.ts#L46)readonlypersistStorage **persistStorage: boolean ### [**](#requestQueuesDirectory)[**](https://github.com/apify/crawlee/blob/master/packages/memory-storage/src/memory-storage.ts#L44)readonlyrequestQueuesDirectory **requestQueuesDirectory: string ### [**](#requestQueuesHandled)[**](https://github.com/apify/crawlee/blob/master/packages/memory-storage/src/memory-storage.ts#L50)readonlyrequestQueuesHandled **requestQueuesHandled: RequestQueueClient\[] = \[] ### [**](#writeMetadata)[**](https://github.com/apify/crawlee/blob/master/packages/memory-storage/src/memory-storage.ts#L45)readonlywriteMetadata **writeMetadata: boolean ## Methods[**](#Methods) ### [**](#dataset)[**](https://github.com/apify/crawlee/blob/master/packages/memory-storage/src/memory-storage.ts#L93)dataset * ****dataset**\(id): [DatasetClient](https://crawlee.dev/js/api/types/interface/DatasetClient.md)\ - Implementation of storage.StorageClient.dataset #### Parameters * ##### id: string #### Returns [DatasetClient](https://crawlee.dev/js/api/types/interface/DatasetClient.md)\ ### [**](#datasets)[**](https://github.com/apify/crawlee/blob/master/packages/memory-storage/src/memory-storage.ts#L86)datasets * ****datasets**(): [DatasetCollectionClient](https://crawlee.dev/js/api/types/interface/DatasetCollectionClient.md) - Implementation of storage.StorageClient.datasets #### Returns [DatasetCollectionClient](https://crawlee.dev/js/api/types/interface/DatasetCollectionClient.md) ### [**](#keyValueStore)[**](https://github.com/apify/crawlee/blob/master/packages/memory-storage/src/memory-storage.ts#L106)keyValueStore * ****keyValueStore**(id): [KeyValueStoreClient](https://crawlee.dev/js/api/types/interface/KeyValueStoreClient.md) - Implementation of storage.StorageClient.keyValueStore #### Parameters * ##### id: string #### Returns [KeyValueStoreClient](https://crawlee.dev/js/api/types/interface/KeyValueStoreClient.md) ### [**](#keyValueStores)[**](https://github.com/apify/crawlee/blob/master/packages/memory-storage/src/memory-storage.ts#L99)keyValueStores * ****keyValueStores**(): [KeyValueStoreCollectionClient](https://crawlee.dev/js/api/types/interface/KeyValueStoreCollectionClient.md) - Implementation of storage.StorageClient.keyValueStores #### Returns [KeyValueStoreCollectionClient](https://crawlee.dev/js/api/types/interface/KeyValueStoreCollectionClient.md) ### [**](#purge)[**](https://github.com/apify/crawlee/blob/master/packages/memory-storage/src/memory-storage.ts#L149)purge * ****purge**(): Promise\ - Implementation of storage.StorageClient.purge Cleans up the default storage directories before the run starts: * local directory containing the default dataset; * all records from the default key-value store in the local directory, except for the "INPUT" key; * local directory containing the default request queue. *** #### Returns Promise\ ### [**](#requestQueue)[**](https://github.com/apify/crawlee/blob/master/packages/memory-storage/src/memory-storage.ts#L119)requestQueue * ****requestQueue**(id, options): [RequestQueueClient](https://crawlee.dev/js/api/types/interface/RequestQueueClient.md) - Implementation of storage.StorageClient.requestQueue #### Parameters * ##### id: string * ##### options: [RequestQueueOptions](https://crawlee.dev/js/api/types/interface/RequestQueueOptions.md) = {} #### Returns [RequestQueueClient](https://crawlee.dev/js/api/types/interface/RequestQueueClient.md) ### [**](#requestQueues)[**](https://github.com/apify/crawlee/blob/master/packages/memory-storage/src/memory-storage.ts#L112)requestQueues * ****requestQueues**(): [RequestQueueCollectionClient](https://crawlee.dev/js/api/types/interface/RequestQueueCollectionClient.md) - Implementation of storage.StorageClient.requestQueues #### Returns [RequestQueueCollectionClient](https://crawlee.dev/js/api/types/interface/RequestQueueCollectionClient.md) ### [**](#setStatusMessage)[**](https://github.com/apify/crawlee/blob/master/packages/memory-storage/src/memory-storage.ts#L134)setStatusMessage * ****setStatusMessage**(message, options): Promise\ - Implementation of storage.StorageClient.setStatusMessage #### Parameters * ##### message: string * ##### options: [SetStatusMessageOptions](https://crawlee.dev/js/api/types/interface/SetStatusMessageOptions.md) = {} #### Returns Promise\ ### [**](#teardown)[**](https://github.com/apify/crawlee/blob/master/packages/memory-storage/src/memory-storage.ts#L198)teardown * ****teardown**(): Promise\ - Implementation of storage.StorageClient.teardown This method should be called at the end of the process, to ensure all data is saved. *** #### Returns Promise\ --- # MemoryStorageOptions ## Index[**](#Index) ### Properties * [**localDataDirectory](#localDataDirectory) * [**persistStorage](#persistStorage) * [**writeMetadata](#writeMetadata) ## Properties[**](#Properties) ### [**](#localDataDirectory)[**](https://github.com/apify/crawlee/blob/master/packages/memory-storage/src/memory-storage.ts#L23)optionallocalDataDirectory **localDataDirectory? : string = process.env.CRAWLEE\_STORAGE\_DIR ?? ‘./storage’ Path to directory where the data will also be saved. ### [**](#persistStorage)[**](https://github.com/apify/crawlee/blob/master/packages/memory-storage/src/memory-storage.ts#L37)optionalpersistStorage **persistStorage? : boolean = true Whether the memory storage should also write its stored content to the disk. You can also disable this by setting the `CRAWLEE_PERSIST_STORAGE` environment variable to `false`. ### [**](#writeMetadata)[**](https://github.com/apify/crawlee/blob/master/packages/memory-storage/src/memory-storage.ts#L29)optionalwriteMetadata **writeMetadata? : boolean = process.env.DEBUG?.includes(‘\*’) ?? process.env.DEBUG?.includes(‘crawlee:memory-storage’) ?? false Whether to also write optional metadata files when storing to disk. --- # @crawlee/playwright Provides a simple framework for parallel crawling of web pages using headless Chromium, Firefox and Webkit browsers with [Playwright](https://github.com/microsoft/playwright). The URLs to crawl are fed either from a static list of URLs or from a dynamic queue of URLs enabling recursive crawling of websites. Since `Playwright` uses headless browser to download web pages and extract data, it is useful for crawling of websites that require to execute JavaScript. If the target website doesn't need JavaScript, consider using [CheerioCrawler](https://crawlee.dev/js/api/cheerio-crawler/class/CheerioCrawler.md), which downloads the pages using raw HTTP requests and is about 10x faster. The source URLs are represented using [Request](https://crawlee.dev/js/api/core/class/Request.md) objects that are fed from [RequestList](https://crawlee.dev/js/api/core/class/RequestList.md) or [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) instances provided by the [PlaywrightCrawlerOptions.requestList](https://crawlee.dev/js/api/playwright-crawler/interface/PlaywrightCrawlerOptions.md#requestList) or [PlaywrightCrawlerOptions.requestQueue](https://crawlee.dev/js/api/playwright-crawler/interface/PlaywrightCrawlerOptions.md#requestQueue) constructor options, respectively. If both [PlaywrightCrawlerOptions.requestList](https://crawlee.dev/js/api/playwright-crawler/interface/PlaywrightCrawlerOptions.md#requestList) and [PlaywrightCrawlerOptions.requestQueue](https://crawlee.dev/js/api/playwright-crawler/interface/PlaywrightCrawlerOptions.md#requestQueue) are used, the instance first processes URLs from the [RequestList](https://crawlee.dev/js/api/core/class/RequestList.md) and automatically enqueues all of them to [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) before it starts their processing. This ensures that a single URL is not crawled multiple times. The crawler finishes when there are no more [Request](https://crawlee.dev/js/api/core/class/Request.md) objects to crawl. `PlaywrightCrawler` opens a new Chrome page (i.e. tab) for each [Request](https://crawlee.dev/js/api/core/class/Request.md) object to crawl and then calls the function provided by user as the [PlaywrightCrawlerOptions.requestHandler](https://crawlee.dev/js/api/playwright-crawler/interface/PlaywrightCrawlerOptions.md#requestHandler) option. New pages are only opened when there is enough free CPU and memory available, using the functionality provided by the [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) class. All [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) configuration options can be passed to the [PlaywrightCrawlerOptions.autoscaledPoolOptions](https://crawlee.dev/js/api/playwright-crawler/interface/PlaywrightCrawlerOptions.md#autoscaledPoolOptions) parameter of the `PlaywrightCrawler` constructor. For user convenience, the `minConcurrency` and `maxConcurrency` [AutoscaledPoolOptions](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md) are available directly in the `PlaywrightCrawler` constructor. Note that the pool of Playwright instances is internally managed by the [BrowserPool](https://github.com/apify/browser-pool) class. ## Example usage[​](#example-usage "Direct link to Example usage") ``` const crawler = new PlaywrightCrawler({ async requestHandler({ page, request }) { // This function is called to extract data from a single web page // 'page' is an instance of Playwright.Page with page.goto(request.url) already called // 'request' is an instance of Request class with information about the page to load await Dataset.pushData({ title: await page.title(), url: request.url, succeeded: true, }) }, async failedRequestHandler({ request }) { // This function is called when the crawling of a request failed too many times await Dataset.pushData({ url: request.url, succeeded: false, errors: request.errorMessages, }) }, }); await crawler.run([ 'http://www.example.com/page-1', 'http://www.example.com/page-2', ]); ``` ## Index[**](#Index) ### Crawlers * [**PlaywrightCrawler](https://crawlee.dev/js/api/playwright-crawler/class/PlaywrightCrawler.md) ### Other * [**AddRequestsBatchedOptions](https://crawlee.dev/js/api/playwright-crawler.md#AddRequestsBatchedOptions) * [**AddRequestsBatchedResult](https://crawlee.dev/js/api/playwright-crawler.md#AddRequestsBatchedResult) * [**AutoscaledPool](https://crawlee.dev/js/api/playwright-crawler.md#AutoscaledPool) * [**AutoscaledPoolOptions](https://crawlee.dev/js/api/playwright-crawler.md#AutoscaledPoolOptions) * [**BaseHttpClient](https://crawlee.dev/js/api/playwright-crawler.md#BaseHttpClient) * [**BaseHttpResponseData](https://crawlee.dev/js/api/playwright-crawler.md#BaseHttpResponseData) * [**BASIC\_CRAWLER\_TIMEOUT\_BUFFER\_SECS](https://crawlee.dev/js/api/playwright-crawler.md#BASIC_CRAWLER_TIMEOUT_BUFFER_SECS) * [**BasicCrawler](https://crawlee.dev/js/api/playwright-crawler.md#BasicCrawler) * [**BasicCrawlerOptions](https://crawlee.dev/js/api/playwright-crawler.md#BasicCrawlerOptions) * [**BasicCrawlingContext](https://crawlee.dev/js/api/playwright-crawler.md#BasicCrawlingContext) * [**BLOCKED\_STATUS\_CODES](https://crawlee.dev/js/api/playwright-crawler.md#BLOCKED_STATUS_CODES) * [**BrowserCrawler](https://crawlee.dev/js/api/playwright-crawler.md#BrowserCrawler) * [**BrowserCrawlerOptions](https://crawlee.dev/js/api/playwright-crawler.md#BrowserCrawlerOptions) * [**BrowserCrawlingContext](https://crawlee.dev/js/api/playwright-crawler.md#BrowserCrawlingContext) * [**BrowserErrorHandler](https://crawlee.dev/js/api/playwright-crawler.md#BrowserErrorHandler) * [**BrowserHook](https://crawlee.dev/js/api/playwright-crawler.md#BrowserHook) * [**BrowserLaunchContext](https://crawlee.dev/js/api/playwright-crawler.md#BrowserLaunchContext) * [**BrowserRequestHandler](https://crawlee.dev/js/api/playwright-crawler.md#BrowserRequestHandler) * [**checkStorageAccess](https://crawlee.dev/js/api/playwright-crawler.md#checkStorageAccess) * [**ClientInfo](https://crawlee.dev/js/api/playwright-crawler.md#ClientInfo) * [**Configuration](https://crawlee.dev/js/api/playwright-crawler.md#Configuration) * [**ConfigurationOptions](https://crawlee.dev/js/api/playwright-crawler.md#ConfigurationOptions) * [**Cookie](https://crawlee.dev/js/api/playwright-crawler.md#Cookie) * [**CrawlerAddRequestsOptions](https://crawlee.dev/js/api/playwright-crawler.md#CrawlerAddRequestsOptions) * [**CrawlerAddRequestsResult](https://crawlee.dev/js/api/playwright-crawler.md#CrawlerAddRequestsResult) * [**CrawlerExperiments](https://crawlee.dev/js/api/playwright-crawler.md#CrawlerExperiments) * [**CrawlerRunOptions](https://crawlee.dev/js/api/playwright-crawler.md#CrawlerRunOptions) * [**CrawlingContext](https://crawlee.dev/js/api/playwright-crawler.md#CrawlingContext) * [**createBasicRouter](https://crawlee.dev/js/api/playwright-crawler.md#createBasicRouter) * [**CreateContextOptions](https://crawlee.dev/js/api/playwright-crawler.md#CreateContextOptions) * [**CreateSession](https://crawlee.dev/js/api/playwright-crawler.md#CreateSession) * [**CriticalError](https://crawlee.dev/js/api/playwright-crawler.md#CriticalError) * [**Dataset](https://crawlee.dev/js/api/playwright-crawler.md#Dataset) * [**DatasetConsumer](https://crawlee.dev/js/api/playwright-crawler.md#DatasetConsumer) * [**DatasetContent](https://crawlee.dev/js/api/playwright-crawler.md#DatasetContent) * [**DatasetDataOptions](https://crawlee.dev/js/api/playwright-crawler.md#DatasetDataOptions) * [**DatasetExportOptions](https://crawlee.dev/js/api/playwright-crawler.md#DatasetExportOptions) * [**DatasetExportToOptions](https://crawlee.dev/js/api/playwright-crawler.md#DatasetExportToOptions) * [**DatasetIteratorOptions](https://crawlee.dev/js/api/playwright-crawler.md#DatasetIteratorOptions) * [**DatasetMapper](https://crawlee.dev/js/api/playwright-crawler.md#DatasetMapper) * [**DatasetOptions](https://crawlee.dev/js/api/playwright-crawler.md#DatasetOptions) * [**DatasetReducer](https://crawlee.dev/js/api/playwright-crawler.md#DatasetReducer) * [**enqueueLinks](https://crawlee.dev/js/api/playwright-crawler.md#enqueueLinks) * [**EnqueueLinksOptions](https://crawlee.dev/js/api/playwright-crawler.md#EnqueueLinksOptions) * [**EnqueueStrategy](https://crawlee.dev/js/api/playwright-crawler.md#EnqueueStrategy) * [**ErrnoException](https://crawlee.dev/js/api/playwright-crawler.md#ErrnoException) * [**ErrorHandler](https://crawlee.dev/js/api/playwright-crawler.md#ErrorHandler) * [**ErrorSnapshotter](https://crawlee.dev/js/api/playwright-crawler.md#ErrorSnapshotter) * [**ErrorTracker](https://crawlee.dev/js/api/playwright-crawler.md#ErrorTracker) * [**ErrorTrackerOptions](https://crawlee.dev/js/api/playwright-crawler.md#ErrorTrackerOptions) * [**EventManager](https://crawlee.dev/js/api/playwright-crawler.md#EventManager) * [**EventType](https://crawlee.dev/js/api/playwright-crawler.md#EventType) * [**EventTypeName](https://crawlee.dev/js/api/playwright-crawler.md#EventTypeName) * [**filterRequestsByPatterns](https://crawlee.dev/js/api/playwright-crawler.md#filterRequestsByPatterns) * [**FinalStatistics](https://crawlee.dev/js/api/playwright-crawler.md#FinalStatistics) * [**GetUserDataFromRequest](https://crawlee.dev/js/api/playwright-crawler.md#GetUserDataFromRequest) * [**GlobInput](https://crawlee.dev/js/api/playwright-crawler.md#GlobInput) * [**GlobObject](https://crawlee.dev/js/api/playwright-crawler.md#GlobObject) * [**GotScrapingHttpClient](https://crawlee.dev/js/api/playwright-crawler.md#GotScrapingHttpClient) * [**HttpRequest](https://crawlee.dev/js/api/playwright-crawler.md#HttpRequest) * [**HttpRequestOptions](https://crawlee.dev/js/api/playwright-crawler.md#HttpRequestOptions) * [**HttpResponse](https://crawlee.dev/js/api/playwright-crawler.md#HttpResponse) * [**IRequestList](https://crawlee.dev/js/api/playwright-crawler.md#IRequestList) * [**IRequestManager](https://crawlee.dev/js/api/playwright-crawler.md#IRequestManager) * [**IStorage](https://crawlee.dev/js/api/playwright-crawler.md#IStorage) * [**KeyConsumer](https://crawlee.dev/js/api/playwright-crawler.md#KeyConsumer) * [**KeyValueStore](https://crawlee.dev/js/api/playwright-crawler.md#KeyValueStore) * [**KeyValueStoreIteratorOptions](https://crawlee.dev/js/api/playwright-crawler.md#KeyValueStoreIteratorOptions) * [**KeyValueStoreOptions](https://crawlee.dev/js/api/playwright-crawler.md#KeyValueStoreOptions) * [**LoadedRequest](https://crawlee.dev/js/api/playwright-crawler.md#LoadedRequest) * [**LocalEventManager](https://crawlee.dev/js/api/playwright-crawler.md#LocalEventManager) * [**log](https://crawlee.dev/js/api/playwright-crawler.md#log) * [**Log](https://crawlee.dev/js/api/playwright-crawler.md#Log) * [**Logger](https://crawlee.dev/js/api/playwright-crawler.md#Logger) * [**LoggerJson](https://crawlee.dev/js/api/playwright-crawler.md#LoggerJson) * [**LoggerOptions](https://crawlee.dev/js/api/playwright-crawler.md#LoggerOptions) * [**LoggerText](https://crawlee.dev/js/api/playwright-crawler.md#LoggerText) * [**LogLevel](https://crawlee.dev/js/api/playwright-crawler.md#LogLevel) * [**MAX\_POOL\_SIZE](https://crawlee.dev/js/api/playwright-crawler.md#MAX_POOL_SIZE) * [**NonRetryableError](https://crawlee.dev/js/api/playwright-crawler.md#NonRetryableError) * [**PERSIST\_STATE\_KEY](https://crawlee.dev/js/api/playwright-crawler.md#PERSIST_STATE_KEY) * [**PersistenceOptions](https://crawlee.dev/js/api/playwright-crawler.md#PersistenceOptions) * [**PlaywrightDirectNavigationOptions](https://crawlee.dev/js/api/playwright-crawler.md#PlaywrightDirectNavigationOptions) * [**processHttpRequestOptions](https://crawlee.dev/js/api/playwright-crawler.md#processHttpRequestOptions) * [**ProxyConfiguration](https://crawlee.dev/js/api/playwright-crawler.md#ProxyConfiguration) * [**ProxyConfigurationFunction](https://crawlee.dev/js/api/playwright-crawler.md#ProxyConfigurationFunction) * [**ProxyConfigurationOptions](https://crawlee.dev/js/api/playwright-crawler.md#ProxyConfigurationOptions) * [**ProxyInfo](https://crawlee.dev/js/api/playwright-crawler.md#ProxyInfo) * [**PseudoUrl](https://crawlee.dev/js/api/playwright-crawler.md#PseudoUrl) * [**PseudoUrlInput](https://crawlee.dev/js/api/playwright-crawler.md#PseudoUrlInput) * [**PseudoUrlObject](https://crawlee.dev/js/api/playwright-crawler.md#PseudoUrlObject) * [**purgeDefaultStorages](https://crawlee.dev/js/api/playwright-crawler.md#purgeDefaultStorages) * [**PushErrorMessageOptions](https://crawlee.dev/js/api/playwright-crawler.md#PushErrorMessageOptions) * [**QueueOperationInfo](https://crawlee.dev/js/api/playwright-crawler.md#QueueOperationInfo) * [**RecordOptions](https://crawlee.dev/js/api/playwright-crawler.md#RecordOptions) * [**RecoverableState](https://crawlee.dev/js/api/playwright-crawler.md#RecoverableState) * [**RecoverableStateOptions](https://crawlee.dev/js/api/playwright-crawler.md#RecoverableStateOptions) * [**RecoverableStatePersistenceOptions](https://crawlee.dev/js/api/playwright-crawler.md#RecoverableStatePersistenceOptions) * [**RedirectHandler](https://crawlee.dev/js/api/playwright-crawler.md#RedirectHandler) * [**RegExpInput](https://crawlee.dev/js/api/playwright-crawler.md#RegExpInput) * [**RegExpObject](https://crawlee.dev/js/api/playwright-crawler.md#RegExpObject) * [**Request](https://crawlee.dev/js/api/playwright-crawler.md#Request) * [**RequestHandler](https://crawlee.dev/js/api/playwright-crawler.md#RequestHandler) * [**RequestHandlerResult](https://crawlee.dev/js/api/playwright-crawler.md#RequestHandlerResult) * [**RequestList](https://crawlee.dev/js/api/playwright-crawler.md#RequestList) * [**RequestListOptions](https://crawlee.dev/js/api/playwright-crawler.md#RequestListOptions) * [**RequestListSourcesFunction](https://crawlee.dev/js/api/playwright-crawler.md#RequestListSourcesFunction) * [**RequestListState](https://crawlee.dev/js/api/playwright-crawler.md#RequestListState) * [**RequestManagerTandem](https://crawlee.dev/js/api/playwright-crawler.md#RequestManagerTandem) * [**RequestOptions](https://crawlee.dev/js/api/playwright-crawler.md#RequestOptions) * [**RequestProvider](https://crawlee.dev/js/api/playwright-crawler.md#RequestProvider) * [**RequestProviderOptions](https://crawlee.dev/js/api/playwright-crawler.md#RequestProviderOptions) * [**RequestQueue](https://crawlee.dev/js/api/playwright-crawler.md#RequestQueue) * [**RequestQueueOperationOptions](https://crawlee.dev/js/api/playwright-crawler.md#RequestQueueOperationOptions) * [**RequestQueueOptions](https://crawlee.dev/js/api/playwright-crawler.md#RequestQueueOptions) * [**RequestQueueV1](https://crawlee.dev/js/api/playwright-crawler.md#RequestQueueV1) * [**RequestQueueV2](https://crawlee.dev/js/api/playwright-crawler.md#RequestQueueV2) * [**RequestsLike](https://crawlee.dev/js/api/playwright-crawler.md#RequestsLike) * [**RequestState](https://crawlee.dev/js/api/playwright-crawler.md#RequestState) * [**RequestTransform](https://crawlee.dev/js/api/playwright-crawler.md#RequestTransform) * [**ResponseLike](https://crawlee.dev/js/api/playwright-crawler.md#ResponseLike) * [**ResponseTypes](https://crawlee.dev/js/api/playwright-crawler.md#ResponseTypes) * [**RestrictedCrawlingContext](https://crawlee.dev/js/api/playwright-crawler.md#RestrictedCrawlingContext) * [**RetryRequestError](https://crawlee.dev/js/api/playwright-crawler.md#RetryRequestError) * [**Router](https://crawlee.dev/js/api/playwright-crawler.md#Router) * [**RouterHandler](https://crawlee.dev/js/api/playwright-crawler.md#RouterHandler) * [**RouterRoutes](https://crawlee.dev/js/api/playwright-crawler.md#RouterRoutes) * [**Session](https://crawlee.dev/js/api/playwright-crawler.md#Session) * [**SessionError](https://crawlee.dev/js/api/playwright-crawler.md#SessionError) * [**SessionOptions](https://crawlee.dev/js/api/playwright-crawler.md#SessionOptions) * [**SessionPool](https://crawlee.dev/js/api/playwright-crawler.md#SessionPool) * [**SessionPoolOptions](https://crawlee.dev/js/api/playwright-crawler.md#SessionPoolOptions) * [**SessionState](https://crawlee.dev/js/api/playwright-crawler.md#SessionState) * [**SitemapRequestList](https://crawlee.dev/js/api/playwright-crawler.md#SitemapRequestList) * [**SitemapRequestListOptions](https://crawlee.dev/js/api/playwright-crawler.md#SitemapRequestListOptions) * [**SkippedRequestCallback](https://crawlee.dev/js/api/playwright-crawler.md#SkippedRequestCallback) * [**SkippedRequestReason](https://crawlee.dev/js/api/playwright-crawler.md#SkippedRequestReason) * [**SnapshotResult](https://crawlee.dev/js/api/playwright-crawler.md#SnapshotResult) * [**Snapshotter](https://crawlee.dev/js/api/playwright-crawler.md#Snapshotter) * [**SnapshotterOptions](https://crawlee.dev/js/api/playwright-crawler.md#SnapshotterOptions) * [**Source](https://crawlee.dev/js/api/playwright-crawler.md#Source) * [**StatisticPersistedState](https://crawlee.dev/js/api/playwright-crawler.md#StatisticPersistedState) * [**Statistics](https://crawlee.dev/js/api/playwright-crawler.md#Statistics) * [**StatisticsOptions](https://crawlee.dev/js/api/playwright-crawler.md#StatisticsOptions) * [**StatisticState](https://crawlee.dev/js/api/playwright-crawler.md#StatisticState) * [**StatusMessageCallback](https://crawlee.dev/js/api/playwright-crawler.md#StatusMessageCallback) * [**StatusMessageCallbackParams](https://crawlee.dev/js/api/playwright-crawler.md#StatusMessageCallbackParams) * [**StorageClient](https://crawlee.dev/js/api/playwright-crawler.md#StorageClient) * [**StorageManagerOptions](https://crawlee.dev/js/api/playwright-crawler.md#StorageManagerOptions) * [**StreamingHttpResponse](https://crawlee.dev/js/api/playwright-crawler.md#StreamingHttpResponse) * [**SystemInfo](https://crawlee.dev/js/api/playwright-crawler.md#SystemInfo) * [**SystemStatus](https://crawlee.dev/js/api/playwright-crawler.md#SystemStatus) * [**SystemStatusOptions](https://crawlee.dev/js/api/playwright-crawler.md#SystemStatusOptions) * [**TieredProxy](https://crawlee.dev/js/api/playwright-crawler.md#TieredProxy) * [**tryAbsoluteURL](https://crawlee.dev/js/api/playwright-crawler.md#tryAbsoluteURL) * [**UrlPatternObject](https://crawlee.dev/js/api/playwright-crawler.md#UrlPatternObject) * [**useState](https://crawlee.dev/js/api/playwright-crawler.md#useState) * [**UseStateOptions](https://crawlee.dev/js/api/playwright-crawler.md#UseStateOptions) * [**withCheckedStorageAccess](https://crawlee.dev/js/api/playwright-crawler.md#withCheckedStorageAccess) * [**playwrightClickElements](https://crawlee.dev/js/api/playwright-crawler/namespace/playwrightClickElements.md) * [**playwrightUtils](https://crawlee.dev/js/api/playwright-crawler/namespace/playwrightUtils.md) * [**AdaptivePlaywrightCrawler](https://crawlee.dev/js/api/playwright-crawler/class/AdaptivePlaywrightCrawler.md) * [**RenderingTypePredictor](https://crawlee.dev/js/api/playwright-crawler/class/RenderingTypePredictor.md) * [**AdaptivePlaywrightCrawlerContext](https://crawlee.dev/js/api/playwright-crawler/interface/AdaptivePlaywrightCrawlerContext.md) * [**AdaptivePlaywrightCrawlerOptions](https://crawlee.dev/js/api/playwright-crawler/interface/AdaptivePlaywrightCrawlerOptions.md) * [**PlaywrightCrawlerOptions](https://crawlee.dev/js/api/playwright-crawler/interface/PlaywrightCrawlerOptions.md) * [**PlaywrightCrawlingContext](https://crawlee.dev/js/api/playwright-crawler/interface/PlaywrightCrawlingContext.md) * [**PlaywrightHook](https://crawlee.dev/js/api/playwright-crawler/interface/PlaywrightHook.md) * [**PlaywrightLaunchContext](https://crawlee.dev/js/api/playwright-crawler/interface/PlaywrightLaunchContext.md) * [**PlaywrightRequestHandler](https://crawlee.dev/js/api/playwright-crawler/interface/PlaywrightRequestHandler.md) * [**PlaywrightGotoOptions](https://crawlee.dev/js/api/playwright-crawler.md#PlaywrightGotoOptions) * [**RenderingType](https://crawlee.dev/js/api/playwright-crawler.md#RenderingType) * [**createAdaptivePlaywrightRouter](https://crawlee.dev/js/api/playwright-crawler/function/createAdaptivePlaywrightRouter.md) * [**createPlaywrightRouter](https://crawlee.dev/js/api/playwright-crawler/function/createPlaywrightRouter.md) * [**launchPlaywright](https://crawlee.dev/js/api/playwright-crawler/function/launchPlaywright.md) ## Other[**](#__CATEGORY__) ### [**](#AddRequestsBatchedOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L965)AddRequestsBatchedOptions Re-exports [AddRequestsBatchedOptions](https://crawlee.dev/js/api/core/interface/AddRequestsBatchedOptions.md) ### [**](#AddRequestsBatchedResult)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L983)AddRequestsBatchedResult Re-exports [AddRequestsBatchedResult](https://crawlee.dev/js/api/core/interface/AddRequestsBatchedResult.md) ### [**](#AutoscaledPool)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/autoscaled_pool.ts#L180)AutoscaledPool Re-exports [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) ### [**](#AutoscaledPoolOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/autoscaled_pool.ts#L16)AutoscaledPoolOptions Re-exports [AutoscaledPoolOptions](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md) ### [**](#BaseHttpClient)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L179)BaseHttpClient Re-exports [BaseHttpClient](https://crawlee.dev/js/api/core/interface/BaseHttpClient.md) ### [**](#BaseHttpResponseData)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L130)BaseHttpResponseData Re-exports [BaseHttpResponseData](https://crawlee.dev/js/api/core/interface/BaseHttpResponseData.md) ### [**](#BASIC_CRAWLER_TIMEOUT_BUFFER_SECS)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/constants.ts#L6)BASIC\_CRAWLER\_TIMEOUT\_BUFFER\_SECS Re-exports [BASIC\_CRAWLER\_TIMEOUT\_BUFFER\_SECS](https://crawlee.dev/js/api/basic-crawler.md#BASIC_CRAWLER_TIMEOUT_BUFFER_SECS) ### [**](#BasicCrawler)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L485)BasicCrawler Re-exports [BasicCrawler](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md) ### [**](#BasicCrawlerOptions)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L133)BasicCrawlerOptions Re-exports [BasicCrawlerOptions](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md) ### [**](#BasicCrawlingContext)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L71)BasicCrawlingContext Re-exports [BasicCrawlingContext](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlingContext.md) ### [**](#BLOCKED_STATUS_CODES)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/consts.ts#L1)BLOCKED\_STATUS\_CODES Re-exports [BLOCKED\_STATUS\_CODES](https://crawlee.dev/js/api/core.md#BLOCKED_STATUS_CODES) ### [**](#BrowserCrawler)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L314)BrowserCrawler Re-exports [BrowserCrawler](https://crawlee.dev/js/api/browser-crawler/class/BrowserCrawler.md) ### [**](#BrowserCrawlerOptions)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L75)BrowserCrawlerOptions Re-exports [BrowserCrawlerOptions](https://crawlee.dev/js/api/browser-crawler/interface/BrowserCrawlerOptions.md) ### [**](#BrowserCrawlingContext)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L52)BrowserCrawlingContext Re-exports [BrowserCrawlingContext](https://crawlee.dev/js/api/browser-crawler/interface/BrowserCrawlingContext.md) ### [**](#BrowserErrorHandler)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L67)BrowserErrorHandler Re-exports [BrowserErrorHandler](https://crawlee.dev/js/api/browser-crawler.md#BrowserErrorHandler) ### [**](#BrowserHook)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L70)BrowserHook Re-exports [BrowserHook](https://crawlee.dev/js/api/browser-crawler.md#BrowserHook) ### [**](#BrowserLaunchContext)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-launcher.ts#L14)BrowserLaunchContext Re-exports [BrowserLaunchContext](https://crawlee.dev/js/api/browser-crawler/interface/BrowserLaunchContext.md) ### [**](#BrowserRequestHandler)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L64)BrowserRequestHandler Re-exports [BrowserRequestHandler](https://crawlee.dev/js/api/browser-crawler.md#BrowserRequestHandler) ### [**](#checkStorageAccess)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/access_checking.ts#L10)checkStorageAccess Re-exports [checkStorageAccess](https://crawlee.dev/js/api/core/function/checkStorageAccess.md) ### [**](#ClientInfo)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/system_status.ts#L79)ClientInfo Re-exports [ClientInfo](https://crawlee.dev/js/api/core/interface/ClientInfo.md) ### [**](#Configuration)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/configuration.ts#L247)Configuration Re-exports [Configuration](https://crawlee.dev/js/api/core/class/Configuration.md) ### [**](#ConfigurationOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/configuration.ts#L16)ConfigurationOptions Re-exports [ConfigurationOptions](https://crawlee.dev/js/api/core/interface/ConfigurationOptions.md) ### [**](#Cookie)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/index.ts#L19)Cookie Re-exports [Cookie](https://crawlee.dev/js/api/core/interface/Cookie.md) ### [**](#CrawlerAddRequestsOptions)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L2035)CrawlerAddRequestsOptions Re-exports [CrawlerAddRequestsOptions](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerAddRequestsOptions.md) ### [**](#CrawlerAddRequestsResult)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L2037)CrawlerAddRequestsResult Re-exports [CrawlerAddRequestsResult](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerAddRequestsResult.md) ### [**](#CrawlerExperiments)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L411)CrawlerExperiments Re-exports [CrawlerExperiments](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerExperiments.md) ### [**](#CrawlerRunOptions)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L2039)CrawlerRunOptions Re-exports [CrawlerRunOptions](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerRunOptions.md) ### [**](#CrawlingContext)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L111)CrawlingContext Re-exports [CrawlingContext](https://crawlee.dev/js/api/core/interface/CrawlingContext.md) ### [**](#createBasicRouter)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L2081)createBasicRouter Re-exports [createBasicRouter](https://crawlee.dev/js/api/basic-crawler/function/createBasicRouter.md) ### [**](#CreateContextOptions)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L2029)CreateContextOptions Re-exports [CreateContextOptions](https://crawlee.dev/js/api/basic-crawler/interface/CreateContextOptions.md) ### [**](#CreateSession)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session_pool.ts#L22)CreateSession Re-exports [CreateSession](https://crawlee.dev/js/api/core/interface/CreateSession.md) ### [**](#CriticalError)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/errors.ts#L10)CriticalError Re-exports [CriticalError](https://crawlee.dev/js/api/core/class/CriticalError.md) ### [**](#Dataset)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L232)Dataset Re-exports [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md) ### [**](#DatasetConsumer)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L703)DatasetConsumer Re-exports [DatasetConsumer](https://crawlee.dev/js/api/core/interface/DatasetConsumer.md) ### [**](#DatasetContent)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L742)DatasetContent Re-exports [DatasetContent](https://crawlee.dev/js/api/core/interface/DatasetContent.md) ### [**](#DatasetDataOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L92)DatasetDataOptions Re-exports [DatasetDataOptions](https://crawlee.dev/js/api/core/interface/DatasetDataOptions.md) ### [**](#DatasetExportOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L144)DatasetExportOptions Re-exports [DatasetExportOptions](https://crawlee.dev/js/api/core/interface/DatasetExportOptions.md) ### [**](#DatasetExportToOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L176)DatasetExportToOptions Re-exports [DatasetExportToOptions](https://crawlee.dev/js/api/core/interface/DatasetExportToOptions.md) ### [**](#DatasetIteratorOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L152)DatasetIteratorOptions Re-exports [DatasetIteratorOptions](https://crawlee.dev/js/api/core/interface/DatasetIteratorOptions.md) ### [**](#DatasetMapper)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L714)DatasetMapper Re-exports [DatasetMapper](https://crawlee.dev/js/api/core/interface/DatasetMapper.md) ### [**](#DatasetOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L735)DatasetOptions Re-exports [DatasetOptions](https://crawlee.dev/js/api/core/interface/DatasetOptions.md) ### [**](#DatasetReducer)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L726)DatasetReducer Re-exports [DatasetReducer](https://crawlee.dev/js/api/core/interface/DatasetReducer.md) ### [**](#enqueueLinks)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/enqueue_links.ts#L274)enqueueLinks Re-exports [enqueueLinks](https://crawlee.dev/js/api/core/function/enqueueLinks.md) ### [**](#EnqueueLinksOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/enqueue_links.ts#L34)EnqueueLinksOptions Re-exports [EnqueueLinksOptions](https://crawlee.dev/js/api/core/interface/EnqueueLinksOptions.md) ### [**](#EnqueueStrategy)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/enqueue_links.ts#L216)EnqueueStrategy Re-exports [EnqueueStrategy](https://crawlee.dev/js/api/core/enum/EnqueueStrategy.md) ### [**](#ErrnoException)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/error_tracker.ts#L9)ErrnoException Re-exports [ErrnoException](https://crawlee.dev/js/api/core/interface/ErrnoException.md) ### [**](#ErrorHandler)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L114)ErrorHandler Re-exports [ErrorHandler](https://crawlee.dev/js/api/basic-crawler.md#ErrorHandler) ### [**](#ErrorSnapshotter)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/error_snapshotter.ts#L42)ErrorSnapshotter Re-exports [ErrorSnapshotter](https://crawlee.dev/js/api/core/class/ErrorSnapshotter.md) ### [**](#ErrorTracker)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/error_tracker.ts#L286)ErrorTracker Re-exports [ErrorTracker](https://crawlee.dev/js/api/core/class/ErrorTracker.md) ### [**](#ErrorTrackerOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/error_tracker.ts#L17)ErrorTrackerOptions Re-exports [ErrorTrackerOptions](https://crawlee.dev/js/api/core/interface/ErrorTrackerOptions.md) ### [**](#EventManager)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/events/event_manager.ts#L24)EventManager Re-exports [EventManager](https://crawlee.dev/js/api/core/class/EventManager.md) ### [**](#EventType)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/events/event_manager.ts#L9)EventType Re-exports [EventType](https://crawlee.dev/js/api/core/enum/EventType.md) ### [**](#EventTypeName)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/events/event_manager.ts#L17)EventTypeName Re-exports [EventTypeName](https://crawlee.dev/js/api/core.md#EventTypeName) ### [**](#filterRequestsByPatterns)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L217)filterRequestsByPatterns Re-exports [filterRequestsByPatterns](https://crawlee.dev/js/api/core/function/filterRequestsByPatterns.md) ### [**](#FinalStatistics)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/system_status.ts#L85)FinalStatistics Re-exports [FinalStatistics](https://crawlee.dev/js/api/core/interface/FinalStatistics.md) ### [**](#GetUserDataFromRequest)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/router.ts#L15)GetUserDataFromRequest Re-exports [GetUserDataFromRequest](https://crawlee.dev/js/api/core.md#GetUserDataFromRequest) ### [**](#GlobInput)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L41)GlobInput Re-exports [GlobInput](https://crawlee.dev/js/api/core.md#GlobInput) ### [**](#GlobObject)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L36)GlobObject Re-exports [GlobObject](https://crawlee.dev/js/api/core.md#GlobObject) ### [**](#GotScrapingHttpClient)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/got-scraping-http-client.ts#L17)GotScrapingHttpClient Re-exports [GotScrapingHttpClient](https://crawlee.dev/js/api/core/class/GotScrapingHttpClient.md) ### [**](#HttpRequest)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L78)HttpRequest Re-exports [HttpRequest](https://crawlee.dev/js/api/core/interface/HttpRequest.md) ### [**](#HttpRequestOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L111)HttpRequestOptions Re-exports [HttpRequestOptions](https://crawlee.dev/js/api/core/interface/HttpRequestOptions.md) ### [**](#HttpResponse)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L152)HttpResponse Re-exports [HttpResponse](https://crawlee.dev/js/api/core/interface/HttpResponse.md) ### [**](#IRequestList)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_list.ts#L26)IRequestList Re-exports [IRequestList](https://crawlee.dev/js/api/core/interface/IRequestList.md) ### [**](#IRequestManager)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L44)IRequestManager Re-exports [IRequestManager](https://crawlee.dev/js/api/core/interface/IRequestManager.md) ### [**](#IStorage)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/storage_manager.ts#L14)IStorage Re-exports [IStorage](https://crawlee.dev/js/api/core/interface/IStorage.md) ### [**](#KeyConsumer)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/key_value_store.ts#L724)KeyConsumer Re-exports [KeyConsumer](https://crawlee.dev/js/api/core/interface/KeyConsumer.md) ### [**](#KeyValueStore)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/key_value_store.ts#L108)KeyValueStore Re-exports [KeyValueStore](https://crawlee.dev/js/api/core/class/KeyValueStore.md) ### [**](#KeyValueStoreIteratorOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/key_value_store.ts#L758)KeyValueStoreIteratorOptions Re-exports [KeyValueStoreIteratorOptions](https://crawlee.dev/js/api/core/interface/KeyValueStoreIteratorOptions.md) ### [**](#KeyValueStoreOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/key_value_store.ts#L734)KeyValueStoreOptions Re-exports [KeyValueStoreOptions](https://crawlee.dev/js/api/core/interface/KeyValueStoreOptions.md) ### [**](#LoadedRequest)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L21)LoadedRequest Re-exports [LoadedRequest](https://crawlee.dev/js/api/core.md#LoadedRequest) ### [**](#LocalEventManager)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/events/local_event_manager.ts#L11)LocalEventManager Re-exports [LocalEventManager](https://crawlee.dev/js/api/core/class/LocalEventManager.md) ### [**](#log)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/log.ts#L3)log Re-exports [log](https://crawlee.dev/js/api/core.md#log) ### [**](#Log)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/log.ts#L3)Log Re-exports [Log](https://crawlee.dev/js/api/core/class/Log.md) ### [**](#Logger)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/log.ts#L3)Logger Re-exports [Logger](https://crawlee.dev/js/api/core/class/Logger.md) ### [**](#LoggerJson)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/log.ts#L3)LoggerJson Re-exports [LoggerJson](https://crawlee.dev/js/api/core/class/LoggerJson.md) ### [**](#LoggerOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/log.ts#L3)LoggerOptions Re-exports [LoggerOptions](https://crawlee.dev/js/api/core/interface/LoggerOptions.md) ### [**](#LoggerText)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/log.ts#L3)LoggerText Re-exports [LoggerText](https://crawlee.dev/js/api/core/class/LoggerText.md) ### [**](#LogLevel)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/log.ts#L3)LogLevel Re-exports [LogLevel](https://crawlee.dev/js/api/core/enum/LogLevel.md) ### [**](#MAX_POOL_SIZE)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/consts.ts#L3)MAX\_POOL\_SIZE Re-exports [MAX\_POOL\_SIZE](https://crawlee.dev/js/api/core.md#MAX_POOL_SIZE) ### [**](#NonRetryableError)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/errors.ts#L4)NonRetryableError Re-exports [NonRetryableError](https://crawlee.dev/js/api/core/class/NonRetryableError.md) ### [**](#PERSIST_STATE_KEY)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/consts.ts#L2)PERSIST\_STATE\_KEY Re-exports [PERSIST\_STATE\_KEY](https://crawlee.dev/js/api/core.md#PERSIST_STATE_KEY) ### [**](#PersistenceOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L41)PersistenceOptions Re-exports [PersistenceOptions](https://crawlee.dev/js/api/core/interface/PersistenceOptions.md) ### [**](#PlaywrightDirectNavigationOptions)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/index.ts#L9)PlaywrightDirectNavigationOptions Renames and re-exports [DirectNavigationOptions](https://crawlee.dev/js/api/playwright-crawler/namespace/playwrightUtils.md#DirectNavigationOptions) ### [**](#processHttpRequestOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L196)processHttpRequestOptions Re-exports [processHttpRequestOptions](https://crawlee.dev/js/api/core/function/processHttpRequestOptions.md) ### [**](#ProxyConfiguration)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/proxy_configuration.ts#L203)ProxyConfiguration Re-exports [ProxyConfiguration](https://crawlee.dev/js/api/core/class/ProxyConfiguration.md) ### [**](#ProxyConfigurationFunction)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/proxy_configuration.ts#L9)ProxyConfigurationFunction Re-exports [ProxyConfigurationFunction](https://crawlee.dev/js/api/core/interface/ProxyConfigurationFunction.md) ### [**](#ProxyConfigurationOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/proxy_configuration.ts#L15)ProxyConfigurationOptions Re-exports [ProxyConfigurationOptions](https://crawlee.dev/js/api/core/interface/ProxyConfigurationOptions.md) ### [**](#ProxyInfo)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/proxy_configuration.ts#L80)ProxyInfo Re-exports [ProxyInfo](https://crawlee.dev/js/api/core/interface/ProxyInfo.md) ### [**](#PseudoUrl)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/index.ts#L18)PseudoUrl Re-exports [PseudoUrl](https://crawlee.dev/js/api/core/class/PseudoUrl.md) ### [**](#PseudoUrlInput)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L34)PseudoUrlInput Re-exports [PseudoUrlInput](https://crawlee.dev/js/api/core.md#PseudoUrlInput) ### [**](#PseudoUrlObject)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L29)PseudoUrlObject Re-exports [PseudoUrlObject](https://crawlee.dev/js/api/core.md#PseudoUrlObject) ### [**](#purgeDefaultStorages)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/utils.ts#L33)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/utils.ts#L45)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/utils.ts#L46)purgeDefaultStorages Re-exports [purgeDefaultStorages](https://crawlee.dev/js/api/core/function/purgeDefaultStorages.md) ### [**](#PushErrorMessageOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L559)PushErrorMessageOptions Re-exports [PushErrorMessageOptions](https://crawlee.dev/js/api/core/interface/PushErrorMessageOptions.md) ### [**](#QueueOperationInfo)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/index.ts#L19)QueueOperationInfo Re-exports [QueueOperationInfo](https://crawlee.dev/js/api/core/interface/QueueOperationInfo.md) ### [**](#RecordOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/key_value_store.ts#L741)RecordOptions Re-exports [RecordOptions](https://crawlee.dev/js/api/core/interface/RecordOptions.md) ### [**](#RecoverableState)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/recoverable_state.ts#L75)RecoverableState Re-exports [RecoverableState](https://crawlee.dev/js/api/core/class/RecoverableState.md) ### [**](#RecoverableStateOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/recoverable_state.ts#L33)RecoverableStateOptions Re-exports [RecoverableStateOptions](https://crawlee.dev/js/api/core/interface/RecoverableStateOptions.md) ### [**](#RecoverableStatePersistenceOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/recoverable_state.ts#L6)RecoverableStatePersistenceOptions Re-exports [RecoverableStatePersistenceOptions](https://crawlee.dev/js/api/core/interface/RecoverableStatePersistenceOptions.md) ### [**](#RedirectHandler)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L171)RedirectHandler Re-exports [RedirectHandler](https://crawlee.dev/js/api/core.md#RedirectHandler) ### [**](#RegExpInput)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L48)RegExpInput Re-exports [RegExpInput](https://crawlee.dev/js/api/core.md#RegExpInput) ### [**](#RegExpObject)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L43)RegExpObject Re-exports [RegExpObject](https://crawlee.dev/js/api/core.md#RegExpObject) ### [**](#Request)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L84)Request Re-exports [Request](https://crawlee.dev/js/api/core/class/Request.md) ### [**](#RequestHandler)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L110)RequestHandler Re-exports [RequestHandler](https://crawlee.dev/js/api/basic-crawler.md#RequestHandler) ### [**](#RequestHandlerResult)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L174)RequestHandlerResult Re-exports [RequestHandlerResult](https://crawlee.dev/js/api/core/class/RequestHandlerResult.md) ### [**](#RequestList)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_list.ts#L300)RequestList Re-exports [RequestList](https://crawlee.dev/js/api/core/class/RequestList.md) ### [**](#RequestListOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_list.ts#L91)RequestListOptions Re-exports [RequestListOptions](https://crawlee.dev/js/api/core/interface/RequestListOptions.md) ### [**](#RequestListSourcesFunction)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_list.ts#L1000)RequestListSourcesFunction Re-exports [RequestListSourcesFunction](https://crawlee.dev/js/api/core.md#RequestListSourcesFunction) ### [**](#RequestListState)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_list.ts#L988)RequestListState Re-exports [RequestListState](https://crawlee.dev/js/api/core/interface/RequestListState.md) ### [**](#RequestManagerTandem)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_manager_tandem.ts#L22)RequestManagerTandem Re-exports [RequestManagerTandem](https://crawlee.dev/js/api/core/class/RequestManagerTandem.md) ### [**](#RequestOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L446)RequestOptions Re-exports [RequestOptions](https://crawlee.dev/js/api/core/interface/RequestOptions.md) ### [**](#RequestProvider)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L102)RequestProvider Re-exports [RequestProvider](https://crawlee.dev/js/api/core/class/RequestProvider.md) ### [**](#RequestProviderOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L907)RequestProviderOptions Re-exports [RequestProviderOptions](https://crawlee.dev/js/api/core/interface/RequestProviderOptions.md) ### [**](#RequestQueue)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/index.ts#L7)RequestQueue Re-exports [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) ### [**](#RequestQueueOperationOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L934)RequestQueueOperationOptions Re-exports [RequestQueueOperationOptions](https://crawlee.dev/js/api/core/interface/RequestQueueOperationOptions.md) ### [**](#RequestQueueOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L923)RequestQueueOptions Re-exports [RequestQueueOptions](https://crawlee.dev/js/api/core/interface/RequestQueueOptions.md) ### [**](#RequestQueueV1)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/index.ts#L6)RequestQueueV1 Re-exports [RequestQueueV1](https://crawlee.dev/js/api/core/class/RequestQueueV1.md) ### [**](#RequestQueueV2)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/index.ts#L8)RequestQueueV2 Re-exports [RequestQueueV2](https://crawlee.dev/js/api/core.md#RequestQueueV2) ### [**](#RequestsLike)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L39)RequestsLike Re-exports [RequestsLike](https://crawlee.dev/js/api/core.md#RequestsLike) ### [**](#RequestState)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L42)RequestState Re-exports [RequestState](https://crawlee.dev/js/api/core/enum/RequestState.md) ### [**](#RequestTransform)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L287)RequestTransform Re-exports [RequestTransform](https://crawlee.dev/js/api/core/interface/RequestTransform.md) ### [**](#ResponseLike)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/cookie_utils.ts#L7)ResponseLike Re-exports [ResponseLike](https://crawlee.dev/js/api/core/interface/ResponseLike.md) ### [**](#ResponseTypes)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L39)ResponseTypes Re-exports [ResponseTypes](https://crawlee.dev/js/api/core/interface/ResponseTypes.md) ### [**](#RestrictedCrawlingContext)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L30)RestrictedCrawlingContext Re-exports [RestrictedCrawlingContext](https://crawlee.dev/js/api/core/interface/RestrictedCrawlingContext.md) ### [**](#RetryRequestError)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/errors.ts#L22)RetryRequestError Re-exports [RetryRequestError](https://crawlee.dev/js/api/core/class/RetryRequestError.md) ### [**](#Router)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/router.ts#L86)Router Re-exports [Router](https://crawlee.dev/js/api/core/class/Router.md) ### [**](#RouterHandler)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/router.ts#L10)RouterHandler Re-exports [RouterHandler](https://crawlee.dev/js/api/core/interface/RouterHandler.md) ### [**](#RouterRoutes)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/router.ts#L17)RouterRoutes Re-exports [RouterRoutes](https://crawlee.dev/js/api/core.md#RouterRoutes) ### [**](#Session)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session.ts#L100)Session Re-exports [Session](https://crawlee.dev/js/api/core/class/Session.md) ### [**](#SessionError)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/errors.ts#L33)SessionError Re-exports [SessionError](https://crawlee.dev/js/api/core/class/SessionError.md) ### [**](#SessionOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session.ts#L37)SessionOptions Re-exports [SessionOptions](https://crawlee.dev/js/api/core/interface/SessionOptions.md) ### [**](#SessionPool)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session_pool.ts#L137)SessionPool Re-exports [SessionPool](https://crawlee.dev/js/api/core/class/SessionPool.md) ### [**](#SessionPoolOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session_pool.ts#L30)SessionPoolOptions Re-exports [SessionPoolOptions](https://crawlee.dev/js/api/core/interface/SessionPoolOptions.md) ### [**](#SessionState)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session.ts#L24)SessionState Re-exports [SessionState](https://crawlee.dev/js/api/core/interface/SessionState.md) ### [**](#SitemapRequestList)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/sitemap_request_list.ts#L128)SitemapRequestList Re-exports [SitemapRequestList](https://crawlee.dev/js/api/core/class/SitemapRequestList.md) ### [**](#SitemapRequestListOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/sitemap_request_list.ts#L60)SitemapRequestListOptions Re-exports [SitemapRequestListOptions](https://crawlee.dev/js/api/core/interface/SitemapRequestListOptions.md) ### [**](#SkippedRequestCallback)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L52)SkippedRequestCallback Re-exports [SkippedRequestCallback](https://crawlee.dev/js/api/core.md#SkippedRequestCallback) ### [**](#SkippedRequestReason)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L50)SkippedRequestReason Re-exports [SkippedRequestReason](https://crawlee.dev/js/api/core.md#SkippedRequestReason) ### [**](#SnapshotResult)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/error_snapshotter.ts#L16)SnapshotResult Re-exports [SnapshotResult](https://crawlee.dev/js/api/core/interface/SnapshotResult.md) ### [**](#Snapshotter)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/snapshotter.ts#L118)Snapshotter Re-exports [Snapshotter](https://crawlee.dev/js/api/core/class/Snapshotter.md) ### [**](#SnapshotterOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/snapshotter.ts#L19)SnapshotterOptions Re-exports [SnapshotterOptions](https://crawlee.dev/js/api/core/interface/SnapshotterOptions.md) ### [**](#Source)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L575)Source Re-exports [Source](https://crawlee.dev/js/api/core.md#Source) ### [**](#StatisticPersistedState)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L482)StatisticPersistedState Re-exports [StatisticPersistedState](https://crawlee.dev/js/api/core/interface/StatisticPersistedState.md) ### [**](#Statistics)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L59)Statistics Re-exports [Statistics](https://crawlee.dev/js/api/core/class/Statistics.md) ### [**](#StatisticsOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L436)StatisticsOptions Re-exports [StatisticsOptions](https://crawlee.dev/js/api/core/interface/StatisticsOptions.md) ### [**](#StatisticState)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L496)StatisticState Re-exports [StatisticState](https://crawlee.dev/js/api/core/interface/StatisticState.md) ### [**](#StatusMessageCallback)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L128)StatusMessageCallback Re-exports [StatusMessageCallback](https://crawlee.dev/js/api/basic-crawler.md#StatusMessageCallback) ### [**](#StatusMessageCallbackParams)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L118)StatusMessageCallbackParams Re-exports [StatusMessageCallbackParams](https://crawlee.dev/js/api/basic-crawler/interface/StatusMessageCallbackParams.md) ### [**](#StorageClient)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/index.ts#L19)StorageClient Re-exports [StorageClient](https://crawlee.dev/js/api/core/interface/StorageClient.md) ### [**](#StorageManagerOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/storage_manager.ts#L156)StorageManagerOptions Re-exports [StorageManagerOptions](https://crawlee.dev/js/api/core/interface/StorageManagerOptions.md) ### [**](#StreamingHttpResponse)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L162)StreamingHttpResponse Re-exports [StreamingHttpResponse](https://crawlee.dev/js/api/core/interface/StreamingHttpResponse.md) ### [**](#SystemInfo)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/system_status.ts#L10)SystemInfo Re-exports [SystemInfo](https://crawlee.dev/js/api/core/interface/SystemInfo.md) ### [**](#SystemStatus)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/system_status.ts#L120)SystemStatus Re-exports [SystemStatus](https://crawlee.dev/js/api/core/class/SystemStatus.md) ### [**](#SystemStatusOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/system_status.ts#L35)SystemStatusOptions Re-exports [SystemStatusOptions](https://crawlee.dev/js/api/core/interface/SystemStatusOptions.md) ### [**](#TieredProxy)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/proxy_configuration.ts#L45)TieredProxy Re-exports [TieredProxy](https://crawlee.dev/js/api/core/interface/TieredProxy.md) ### [**](#tryAbsoluteURL)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L12)tryAbsoluteURL Re-exports [tryAbsoluteURL](https://crawlee.dev/js/api/core/function/tryAbsoluteURL.md) ### [**](#UrlPatternObject)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L24)UrlPatternObject Re-exports [UrlPatternObject](https://crawlee.dev/js/api/core.md#UrlPatternObject) ### [**](#useState)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/utils.ts#L87)useState Re-exports [useState](https://crawlee.dev/js/api/core/function/useState.md) ### [**](#UseStateOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/utils.ts#L69)UseStateOptions Re-exports [UseStateOptions](https://crawlee.dev/js/api/core/interface/UseStateOptions.md) ### [**](#withCheckedStorageAccess)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/access_checking.ts#L18)withCheckedStorageAccess Re-exports [withCheckedStorageAccess](https://crawlee.dev/js/api/core/function/withCheckedStorageAccess.md) ### [**](#PlaywrightGotoOptions)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/playwright-crawler.ts#L26)PlaywrightGotoOptions **PlaywrightGotoOptions: Dictionary & Parameters\\[1] ### [**](#RenderingType)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/utils/rendering-type-prediction.ts#L7)RenderingType **RenderingType: clientOnly | static --- # Changelog All notable changes to this project will be documented in this file. See [Conventional Commits](https://conventionalcommits.org) for commit guidelines. ## [3.15.3](https://github.com/apify/crawlee/compare/v3.15.2...v3.15.3) (2025-11-10)[​](#3153-2025-11-10 "Direct link to 3153-2025-11-10") ### Bug Fixes[​](#bug-fixes "Direct link to Bug Fixes") * use shared enqueue links wrapper in `AdaptivePlaywrightCrawler` ([#3188](https://github.com/apify/crawlee/issues/3188)) ([9569d19](https://github.com/apify/crawlee/commit/9569d191933325d93f6c66754274b63fd272fc59)) ## [3.15.2](https://github.com/apify/crawlee/compare/v3.15.1...v3.15.2) (2025-10-23)[​](#3152-2025-10-23 "Direct link to 3152-2025-10-23") **Note:** Version bump only for package @crawlee/playwright ## [3.15.1](https://github.com/apify/crawlee/compare/v3.15.0...v3.15.1) (2025-09-26)[​](#3151-2025-09-26 "Direct link to 3151-2025-09-26") **Note:** Version bump only for package @crawlee/playwright # [3.15.0](https://github.com/apify/crawlee/compare/v3.14.1...v3.15.0) (2025-09-17) **Note:** Version bump only for package @crawlee/playwright ## [3.14.1](https://github.com/apify/crawlee/compare/v3.14.0...v3.14.1) (2025-08-05)[​](#3141-2025-08-05 "Direct link to 3141-2025-08-05") **Note:** Version bump only for package @crawlee/playwright # [3.14.0](https://github.com/apify/crawlee/compare/v3.13.10...v3.14.0) (2025-07-25) ### Bug Fixes[​](#bug-fixes-1 "Direct link to Bug Fixes") * respect `exclude` option in `enqueueLinksByClickingElements` ([#3058](https://github.com/apify/crawlee/issues/3058)) ([013eb02](https://github.com/apify/crawlee/commit/013eb028b6ecf05f83f8790a4a6164b9c4873733)) ## [3.13.10](https://github.com/apify/crawlee/compare/v3.13.9...v3.13.10) (2025-07-09)[​](#31310-2025-07-09 "Direct link to 31310-2025-07-09") ### Bug Fixes[​](#bug-fixes-2 "Direct link to Bug Fixes") * call `onSkippedRequest` for `AdaptivePlaywrightCrawler.enqueueLinks` ([#3043](https://github.com/apify/crawlee/issues/3043)) ([fc23d34](https://github.com/apify/crawlee/commit/fc23d34ba7fa0daded253a0a958fe9b7bb32e5ca)), closes [#3026](https://github.com/apify/crawlee/issues/3026) [#3039](https://github.com/apify/crawlee/issues/3039) ## [3.13.9](https://github.com/apify/crawlee/compare/v3.13.8...v3.13.9) (2025-06-27)[​](#3139-2025-06-27 "Direct link to 3139-2025-06-27") ### Bug Fixes[​](#bug-fixes-3 "Direct link to Bug Fixes") * Fix link filtering in enqueueLinks in AdaptivePlaywrightCrawler ([#3021](https://github.com/apify/crawlee/issues/3021)) ([8a3b6f8](https://github.com/apify/crawlee/commit/8a3b6f8847586eb3b0865fe93053468e1605399c)), closes [#2525](https://github.com/apify/crawlee/issues/2525) ### Features[​](#features "Direct link to Features") * Report links skipped because of various filter conditions ([#3026](https://github.com/apify/crawlee/issues/3026)) ([5a867bc](https://github.com/apify/crawlee/commit/5a867bc28135803b55c765ec12e6fd04017ce53d)), closes [#3016](https://github.com/apify/crawlee/issues/3016) ## [3.13.8](https://github.com/apify/crawlee/compare/v3.13.7...v3.13.8) (2025-06-16)[​](#3138-2025-06-16 "Direct link to 3138-2025-06-16") ### Bug Fixes[​](#bug-fixes-4 "Direct link to Bug Fixes") * Persist rendering type detection results in `AdaptivePlaywrightCrawler` ([#2987](https://github.com/apify/crawlee/issues/2987)) ([76431ba](https://github.com/apify/crawlee/commit/76431badf8a55892303d9b53fe23e029fad9cb18)), closes [#2899](https://github.com/apify/crawlee/issues/2899) ## [3.13.7](https://github.com/apify/crawlee/compare/v3.13.6...v3.13.7) (2025-06-06)[​](#3137-2025-06-06 "Direct link to 3137-2025-06-06") **Note:** Version bump only for package @crawlee/playwright ## [3.13.6](https://github.com/apify/crawlee/compare/v3.13.5...v3.13.6) (2025-06-05)[​](#3136-2025-06-05 "Direct link to 3136-2025-06-05") ### Bug Fixes[​](#bug-fixes-5 "Direct link to Bug Fixes") * ensure `PlaywrightGotoOptions` won't result in `unknown` when playwright is not installed ([#2995](https://github.com/apify/crawlee/issues/2995)) ([93eba38](https://github.com/apify/crawlee/commit/93eba38b9cd88e543717f885b2c5644f63979bc9)), closes [#2994](https://github.com/apify/crawlee/issues/2994) * extract only `body` from `iframe` elements ([#2986](https://github.com/apify/crawlee/issues/2986)) ([c36166e](https://github.com/apify/crawlee/commit/c36166e24887ca6de12f0c60ef010256fa830c31)), closes [#2979](https://github.com/apify/crawlee/issues/2979) ## [3.13.5](https://github.com/apify/crawlee/compare/v3.13.4...v3.13.5) (2025-05-20)[​](#3135-2025-05-20 "Direct link to 3135-2025-05-20") ### Features[​](#features-1 "Direct link to Features") * Allow the AdaptivePlaywrightCrawler result comparator to signal an inconclusive result ([#2975](https://github.com/apify/crawlee/issues/2975)) ([7ba8906](https://github.com/apify/crawlee/commit/7ba8906158e2dbc474de1b1e89937562abe76877)) ## [3.13.4](https://github.com/apify/crawlee/compare/v3.13.3...v3.13.4) (2025-05-14)[​](#3134-2025-05-14 "Direct link to 3134-2025-05-14") **Note:** Version bump only for package @crawlee/playwright ## [3.13.3](https://github.com/apify/crawlee/compare/v3.13.2...v3.13.3) (2025-05-05)[​](#3133-2025-05-05 "Direct link to 3133-2025-05-05") ### Bug Fixes[​](#bug-fixes-6 "Direct link to Bug Fixes") * Fix useState behavior in adaptive crawler ([#2941](https://github.com/apify/crawlee/issues/2941)) ([5282381](https://github.com/apify/crawlee/commit/52823818bd66995c1512b433e6d82755c487cb58)) ## [3.13.2](https://github.com/apify/crawlee/compare/v3.13.1...v3.13.2) (2025-04-08)[​](#3132-2025-04-08 "Direct link to 3132-2025-04-08") **Note:** Version bump only for package @crawlee/playwright ## [3.13.1](https://github.com/apify/crawlee/compare/v3.13.0...v3.13.1) (2025-04-07)[​](#3131-2025-04-07 "Direct link to 3131-2025-04-07") **Note:** Version bump only for package @crawlee/playwright # [3.13.0](https://github.com/apify/crawlee/compare/v3.12.2...v3.13.0) (2025-03-04) ### Features[​](#features-2 "Direct link to Features") * **playwright:** add `handleCloudflareChallenge` helper ([#2865](https://github.com/apify/crawlee/issues/2865)) ([9a1725f](https://github.com/apify/crawlee/commit/9a1725f7b87fb70194fc31858500cb35639fb964)) ## [3.12.2](https://github.com/apify/crawlee/compare/v3.12.1...v3.12.2) (2025-01-27)[​](#3122-2025-01-27 "Direct link to 3122-2025-01-27") **Note:** Version bump only for package @crawlee/playwright ## [3.12.1](https://github.com/apify/crawlee/compare/v3.12.0...v3.12.1) (2024-12-04)[​](#3121-2024-12-04 "Direct link to 3121-2024-12-04") **Note:** Version bump only for package @crawlee/playwright # [3.12.0](https://github.com/apify/crawlee/compare/v3.11.5...v3.12.0) (2024-11-04) ### Bug Fixes[​](#bug-fixes-7 "Direct link to Bug Fixes") * ignore errors from iframe content extraction ([#2714](https://github.com/apify/crawlee/issues/2714)) ([627e5c2](https://github.com/apify/crawlee/commit/627e5c2fbadce63c7e631217cd0e735597c0ce08)), closes [#2708](https://github.com/apify/crawlee/issues/2708) ## [3.11.5](https://github.com/apify/crawlee/compare/v3.11.4...v3.11.5) (2024-10-04)[​](#3115-2024-10-04 "Direct link to 3115-2024-10-04") ### Bug Fixes[​](#bug-fixes-8 "Direct link to Bug Fixes") * **core:** accept `UInt8Array` in `KVS.setValue()` ([#2682](https://github.com/apify/crawlee/issues/2682)) ([8ef0e60](https://github.com/apify/crawlee/commit/8ef0e60ca6fb2f4ec1b0d1aec6dcd53fcfb398b3)) ## [3.11.4](https://github.com/apify/crawlee/compare/v3.11.3...v3.11.4) (2024-09-23)[​](#3114-2024-09-23 "Direct link to 3114-2024-09-23") **Note:** Version bump only for package @crawlee/playwright ## [3.11.3](https://github.com/apify/crawlee/compare/v3.11.2...v3.11.3) (2024-09-03)[​](#3113-2024-09-03 "Direct link to 3113-2024-09-03") **Note:** Version bump only for package @crawlee/playwright ## [3.11.2](https://github.com/apify/crawlee/compare/v3.11.1...v3.11.2) (2024-08-28)[​](#3112-2024-08-28 "Direct link to 3112-2024-08-28") **Note:** Version bump only for package @crawlee/playwright ## [3.11.1](https://github.com/apify/crawlee/compare/v3.11.0...v3.11.1) (2024-07-24)[​](#3111-2024-07-24 "Direct link to 3111-2024-07-24") **Note:** Version bump only for package @crawlee/playwright # [3.11.0](https://github.com/apify/crawlee/compare/v3.10.5...v3.11.0) (2024-07-09) ### Features[​](#features-3 "Direct link to Features") * add `iframe` expansion to `parseWithCheerio` in browsers ([#2542](https://github.com/apify/crawlee/issues/2542)) ([328d085](https://github.com/apify/crawlee/commit/328d08598807782b3712bd543e394fe9a000a85d)), closes [#2507](https://github.com/apify/crawlee/issues/2507) * add `ignoreIframes` opt-out from the Cheerio iframe expansion ([#2562](https://github.com/apify/crawlee/issues/2562)) ([474a8dc](https://github.com/apify/crawlee/commit/474a8dc06a567cde0651d385fdac9c350ddf4508)) ## [3.10.5](https://github.com/apify/crawlee/compare/v3.10.4...v3.10.5) (2024-06-12)[​](#3105-2024-06-12 "Direct link to 3105-2024-06-12") ### Bug Fixes[​](#bug-fixes-9 "Direct link to Bug Fixes") * allow creating new adaptive crawler instance without any parameters ([9b7f595](https://github.com/apify/crawlee/commit/9b7f595a2d70cab5c50e188581b21b0ef7e51780)) * fix detection of HTTP site when using the `useState` in adaptive crawler ([#2530](https://github.com/apify/crawlee/issues/2530)) ([7e195c1](https://github.com/apify/crawlee/commit/7e195c17cf1d9beae7f6f068fe505f1334a3a5b3)) * mark `context.request.loadedUrl` and `id` as required inside the request handler ([#2531](https://github.com/apify/crawlee/issues/2531)) ([2b54660](https://github.com/apify/crawlee/commit/2b546600691d84852a2f9ef42f273cecf818d66d)) ## [3.10.4](https://github.com/apify/crawlee/compare/v3.10.3...v3.10.4) (2024-06-11)[​](#3104-2024-06-11 "Direct link to 3104-2024-06-11") ### Bug Fixes[​](#bug-fixes-10 "Direct link to Bug Fixes") * **playwright:** allow passing new context options in `launchOptions` on type level ([0519d40](https://github.com/apify/crawlee/commit/0519d4099d257bbc40ed091c131a674ea5f8d731)), closes [#1849](https://github.com/apify/crawlee/issues/1849) ## [3.10.3](https://github.com/apify/crawlee/compare/v3.10.2...v3.10.3) (2024-06-07)[​](#3103-2024-06-07 "Direct link to 3103-2024-06-07") ### Bug Fixes[​](#bug-fixes-11 "Direct link to Bug Fixes") * **adaptive-crawler:** log only once for the committed request handler execution ([#2524](https://github.com/apify/crawlee/issues/2524)) ([533bd3f](https://github.com/apify/crawlee/commit/533bd3f04671d54273f0861664d316269d08fbfb)) * respect implicit router when no `requestHandler` is provided in `AdaptiveCrawler` ([#2518](https://github.com/apify/crawlee/issues/2518)) ([31083aa](https://github.com/apify/crawlee/commit/31083aa27ddd51827f73c7ac4290379ec7a81283)) ### Features[​](#features-4 "Direct link to Features") * add `waitForSelector` context helper + `parseWithCheerio` in adaptive crawler ([#2522](https://github.com/apify/crawlee/issues/2522)) ([6f88e73](https://github.com/apify/crawlee/commit/6f88e738d43ab4774dc4ef3f78775a5d88728e0d)) ## [3.10.2](https://github.com/apify/crawlee/compare/v3.10.1...v3.10.2) (2024-06-03)[​](#3102-2024-06-03 "Direct link to 3102-2024-06-03") **Note:** Version bump only for package @crawlee/playwright ## [3.10.1](https://github.com/apify/crawlee/compare/v3.10.0...v3.10.1) (2024-05-23)[​](#3101-2024-05-23 "Direct link to 3101-2024-05-23") **Note:** Version bump only for package @crawlee/playwright # [3.10.0](https://github.com/apify/crawlee/compare/v3.9.2...v3.10.0) (2024-05-16) ### Bug Fixes[​](#bug-fixes-12 "Direct link to Bug Fixes") * do not drop statistics on migration/resurrection/resume ([#2462](https://github.com/apify/crawlee/issues/2462)) ([8ce7dd4](https://github.com/apify/crawlee/commit/8ce7dd4ae6a3718dac95e784a53bd5661c827edc)) ## [3.9.2](https://github.com/apify/crawlee/compare/v3.9.1...v3.9.2) (2024-04-17)[​](#392-2024-04-17 "Direct link to 392-2024-04-17") **Note:** Version bump only for package @crawlee/playwright ## [3.9.1](https://github.com/apify/crawlee/compare/v3.9.0...v3.9.1) (2024-04-11)[​](#391-2024-04-11 "Direct link to 391-2024-04-11") **Note:** Version bump only for package @crawlee/playwright # [3.9.0](https://github.com/apify/crawlee/compare/v3.8.2...v3.9.0) (2024-04-10) ### Features[​](#features-5 "Direct link to Features") * `createAdaptivePlaywrightRouter` utility ([#2415](https://github.com/apify/crawlee/issues/2415)) ([cee4778](https://github.com/apify/crawlee/commit/cee477814e4901d025c5376205ad884c2fe08e0e)), closes [#2407](https://github.com/apify/crawlee/issues/2407) * expand #shadow-root elements automatically in `parseWithCheerio` helper ([#2396](https://github.com/apify/crawlee/issues/2396)) ([a05b3a9](https://github.com/apify/crawlee/commit/a05b3a93a9b57926b353df0e79d846b5024c42ac)) ## [3.8.2](https://github.com/apify/crawlee/compare/v3.8.1...v3.8.2) (2024-03-21)[​](#382-2024-03-21 "Direct link to 382-2024-03-21") ### Features[​](#features-6 "Direct link to Features") * implement global storage access checking and use it to prevent unwanted side effects in adaptive crawler ([#2371](https://github.com/apify/crawlee/issues/2371)) ([fb3b7da](https://github.com/apify/crawlee/commit/fb3b7da402522ddff8c7394ac1253ba8aeac984c)), closes [#2364](https://github.com/apify/crawlee/issues/2364) ## [3.8.1](https://github.com/apify/crawlee/compare/v3.8.0...v3.8.1) (2024-02-22)[​](#381-2024-02-22 "Direct link to 381-2024-02-22") **Note:** Version bump only for package @crawlee/playwright # [3.8.0](https://github.com/apify/crawlee/compare/v3.7.3...v3.8.0) (2024-02-21) ### Features[​](#features-7 "Direct link to Features") * adaptive playwright crawler ([#2316](https://github.com/apify/crawlee/issues/2316)) ([8e4218a](https://github.com/apify/crawlee/commit/8e4218ada03cf485751def46f8c465b2d2a825c7)) ## [3.7.3](https://github.com/apify/crawlee/compare/v3.7.2...v3.7.3) (2024-01-30)[​](#373-2024-01-30 "Direct link to 373-2024-01-30") **Note:** Version bump only for package @crawlee/playwright ## [3.7.2](https://github.com/apify/crawlee/compare/v3.7.1...v3.7.2) (2024-01-09)[​](#372-2024-01-09 "Direct link to 372-2024-01-09") **Note:** Version bump only for package @crawlee/playwright ## [3.7.1](https://github.com/apify/crawlee/compare/v3.7.0...v3.7.1) (2024-01-02)[​](#371-2024-01-02 "Direct link to 371-2024-01-02") **Note:** Version bump only for package @crawlee/playwright # [3.7.0](https://github.com/apify/crawlee/compare/v3.6.2...v3.7.0) (2023-12-21) **Note:** Version bump only for package @crawlee/playwright ## [3.6.2](https://github.com/apify/crawlee/compare/v3.6.1...v3.6.2) (2023-11-26)[​](#362-2023-11-26 "Direct link to 362-2023-11-26") **Note:** Version bump only for package @crawlee/playwright ## [3.6.1](https://github.com/apify/crawlee/compare/v3.6.0...v3.6.1) (2023-11-15)[​](#361-2023-11-15 "Direct link to 361-2023-11-15") ### Features[​](#features-8 "Direct link to Features") * **puppeteer:** enable `new` headless mode ([#1910](https://github.com/apify/crawlee/issues/1910)) ([7fc999c](https://github.com/apify/crawlee/commit/7fc999cf4658ca69b97f16d434444081998470f4)) # [3.6.0](https://github.com/apify/crawlee/compare/v3.5.8...v3.6.0) (2023-11-15) ### Bug Fixes[​](#bug-fixes-13 "Direct link to Bug Fixes") * add `skipNavigation` option to `enqueueLinks` ([#2153](https://github.com/apify/crawlee/issues/2153)) ([118515d](https://github.com/apify/crawlee/commit/118515d2ba534b99be2f23436f6abe41d66a8e07)) ## [3.5.8](https://github.com/apify/crawlee/compare/v3.5.7...v3.5.8) (2023-10-17)[​](#358-2023-10-17 "Direct link to 358-2023-10-17") **Note:** Version bump only for package @crawlee/playwright ## [3.5.7](https://github.com/apify/crawlee/compare/v3.5.6...v3.5.7) (2023-10-05)[​](#357-2023-10-05 "Direct link to 357-2023-10-05") **Note:** Version bump only for package @crawlee/playwright ## [3.5.6](https://github.com/apify/crawlee/compare/v3.5.5...v3.5.6) (2023-10-04)[​](#356-2023-10-04 "Direct link to 356-2023-10-04") **Note:** Version bump only for package @crawlee/playwright ## [3.5.5](https://github.com/apify/crawlee/compare/v3.5.4...v3.5.5) (2023-10-02)[​](#355-2023-10-02 "Direct link to 355-2023-10-02") ### Bug Fixes[​](#bug-fixes-14 "Direct link to Bug Fixes") * allow to use any version of puppeteer or playwright ([#2102](https://github.com/apify/crawlee/issues/2102)) ([0cafceb](https://github.com/apify/crawlee/commit/0cafceb2966d430dd1b2a1b619fe66da1c951f4c)), closes [#2101](https://github.com/apify/crawlee/issues/2101) ### Features[​](#features-9 "Direct link to Features") * Request Queue v2 ([#1975](https://github.com/apify/crawlee/issues/1975)) ([70a77ee](https://github.com/apify/crawlee/commit/70a77ee15f984e9ae67cd584fc58ace7e55346db)), closes [#1365](https://github.com/apify/crawlee/issues/1365) ## [3.5.4](https://github.com/apify/crawlee/compare/v3.5.3...v3.5.4) (2023-09-11)[​](#354-2023-09-11 "Direct link to 354-2023-09-11") ### Bug Fixes[​](#bug-fixes-15 "Direct link to Bug Fixes") * various helpers opening KVS now respect Configuration ([#2071](https://github.com/apify/crawlee/issues/2071)) ([59dbb16](https://github.com/apify/crawlee/commit/59dbb164699774e5a6718e98d0a4e8f630f35323)) ## [3.5.3](https://github.com/apify/crawlee/compare/v3.5.2...v3.5.3) (2023-08-31)[​](#353-2023-08-31 "Direct link to 353-2023-08-31") ### Bug Fixes[​](#bug-fixes-16 "Direct link to Bug Fixes") * pin all internal dependencies ([#2041](https://github.com/apify/crawlee/issues/2041)) ([d6f2b17](https://github.com/apify/crawlee/commit/d6f2b172d4a6776137c7893ca798d5b4a9408e79)), closes [#2040](https://github.com/apify/crawlee/issues/2040) ## [3.5.2](https://github.com/apify/crawlee/compare/v3.5.1...v3.5.2) (2023-08-21)[​](#352-2023-08-21 "Direct link to 352-2023-08-21") **Note:** Version bump only for package @crawlee/playwright ## [3.5.1](https://github.com/apify/crawlee/compare/v3.5.0...v3.5.1) (2023-08-16)[​](#351-2023-08-16 "Direct link to 351-2023-08-16") **Note:** Version bump only for package @crawlee/playwright # [3.5.0](https://github.com/apify/crawlee/compare/v3.4.2...v3.5.0) (2023-07-31) ### Features[​](#features-10 "Direct link to Features") * add `closeCookieModals` context helper for Playwright and Puppeteer ([#1927](https://github.com/apify/crawlee/issues/1927)) ([98d93bb](https://github.com/apify/crawlee/commit/98d93bb6713ec219baa83db2ad2cd1d7621a3339)) * **core:** use `RequestQueue.addBatchedRequests()` in `enqueueLinks` helper ([4d61ca9](https://github.com/apify/crawlee/commit/4d61ca934072f8bbb680c842d8b1c9a4452ee73a)), closes [#1995](https://github.com/apify/crawlee/issues/1995) ## [3.4.2](https://github.com/apify/crawlee/compare/v3.4.1...v3.4.2) (2023-07-19)[​](#342-2023-07-19 "Direct link to 342-2023-07-19") **Note:** Version bump only for package @crawlee/playwright ## [3.4.1](https://github.com/apify/crawlee/compare/v3.4.0...v3.4.1) (2023-07-13)[​](#341-2023-07-13 "Direct link to 341-2023-07-13") **Note:** Version bump only for package @crawlee/playwright # [3.4.0](https://github.com/apify/crawlee/compare/v3.3.3...v3.4.0) (2023-06-12) ### Features[​](#features-11 "Direct link to Features") * infiniteScroll has maxScrollHeight limit ([#1945](https://github.com/apify/crawlee/issues/1945)) ([44997bb](https://github.com/apify/crawlee/commit/44997bba5bbf33ddb7dbac2f3e26d4bee60d4f47)) ## [3.3.3](https://github.com/apify/crawlee/compare/v3.3.2...v3.3.3) (2023-05-31)[​](#333-2023-05-31 "Direct link to 333-2023-05-31") **Note:** Version bump only for package @crawlee/playwright ## [3.3.2](https://github.com/apify/crawlee/compare/v3.3.1...v3.3.2) (2023-05-11)[​](#332-2023-05-11 "Direct link to 332-2023-05-11") ### Features[​](#features-12 "Direct link to Features") * **router:** allow inline router definition ([#1877](https://github.com/apify/crawlee/issues/1877)) ([2d241c9](https://github.com/apify/crawlee/commit/2d241c9f88964ebd41a181069c378b6b7b5bf262)) ## [3.3.1](https://github.com/apify/crawlee/compare/v3.3.0...v3.3.1) (2023-04-11)[​](#331-2023-04-11 "Direct link to 331-2023-04-11") ### Bug Fixes[​](#bug-fixes-17 "Direct link to Bug Fixes") * infiniteScroll() not working in Firefox ([#1826](https://github.com/apify/crawlee/issues/1826)) ([4286c5d](https://github.com/apify/crawlee/commit/4286c5d29b94aec3f4d3835bbf36b7fafcaec8f0)), closes [#1821](https://github.com/apify/crawlee/issues/1821) * **jsdom:** delay closing of the window and add some polyfills ([2e81618](https://github.com/apify/crawlee/commit/2e81618afb5f3890495e3e5fcfa037eb3319edc9)) # [3.3.0](https://github.com/apify/crawlee/compare/v3.2.2...v3.3.0) (2023-03-09) **Note:** Version bump only for package @crawlee/playwright ## [3.2.2](https://github.com/apify/crawlee/compare/v3.2.1...v3.2.2) (2023-02-08)[​](#322-2023-02-08 "Direct link to 322-2023-02-08") **Note:** Version bump only for package @crawlee/playwright ## [3.2.1](https://github.com/apify/crawlee/compare/v3.2.0...v3.2.1) (2023-02-07)[​](#321-2023-02-07 "Direct link to 321-2023-02-07") **Note:** Version bump only for package @crawlee/playwright # [3.2.0](https://github.com/apify/crawlee/compare/v3.1.4...v3.2.0) (2023-02-07) ### Bug Fixes[​](#bug-fixes-18 "Direct link to Bug Fixes") * allow `userData` option in `enqueueLinksByClickingElements` ([#1749](https://github.com/apify/crawlee/issues/1749)) ([736f85d](https://github.com/apify/crawlee/commit/736f85d4a3b99a06d0f99f91e33e71976a9458a3)), closes [#1617](https://github.com/apify/crawlee/issues/1617) * declare missing dependency on `tslib` ([27e96c8](https://github.com/apify/crawlee/commit/27e96c80c26e7fc31809a4b518d699573cb8c662)), closes [#1747](https://github.com/apify/crawlee/issues/1747) * update playwright to 1.29.2 and make peer dep. less strict ([#1735](https://github.com/apify/crawlee/issues/1735)) ([c654fcd](https://github.com/apify/crawlee/commit/c654fcdea06fb203b7952ed97650190cc0e74394)), closes [#1723](https://github.com/apify/crawlee/issues/1723) ### Features[​](#features-13 "Direct link to Features") * add `forefront` option to all `enqueueLinks` variants ([#1760](https://github.com/apify/crawlee/issues/1760)) ([a01459d](https://github.com/apify/crawlee/commit/a01459dffb51162e676354f0aa4811a1d36affa9)), closes [#1483](https://github.com/apify/crawlee/issues/1483) ## [3.1.4](https://github.com/apify/crawlee/compare/v3.1.3...v3.1.4) (2022-12-14)[​](#314-2022-12-14 "Direct link to 314-2022-12-14") **Note:** Version bump only for package @crawlee/playwright ## [3.1.3](https://github.com/apify/crawlee/compare/v3.1.2...v3.1.3) (2022-12-07)[​](#313-2022-12-07 "Direct link to 313-2022-12-07") **Note:** Version bump only for package @crawlee/playwright ## 3.1.2 (2022-11-15)[​](#312-2022-11-15 "Direct link to 3.1.2 (2022-11-15)") **Note:** Version bump only for package @crawlee/playwright ## 3.1.1 (2022-11-07)[​](#311-2022-11-07 "Direct link to 3.1.1 (2022-11-07)") **Note:** Version bump only for package @crawlee/playwright # 3.1.0 (2022-10-13) **Note:** Version bump only for package @crawlee/playwright ## [3.0.4](https://github.com/apify/crawlee/compare/v3.0.3...v3.0.4) (2022-08-22)[​](#304-2022-08-22 "Direct link to 304-2022-08-22") ### Features[​](#features-14 "Direct link to Features") * enable tab-as-a-container for Firefox ([#1456](https://github.com/apify/crawlee/issues/1456)) ([ae5ba4f](https://github.com/apify/crawlee/commit/ae5ba4f15fd6d14f444486234753ce1781c74cc8)) --- # AdaptivePlaywrightCrawler experimental An extension of [PlaywrightCrawler](https://crawlee.dev/js/api/playwright-crawler/class/PlaywrightCrawler.md) that uses a more limited request handler interface so that it is able to switch to HTTP-only crawling when it detects it may be possible. **Example usage:** ``` const crawler = new AdaptivePlaywrightCrawler({ renderingTypeDetectionRatio: 0.1, async requestHandler({ querySelector, pushData, enqueueLinks, request, log }) { // This function is called to extract data from a single web page const $prices = await querySelector('span.price') await pushData({ url: request.url, price: $prices.filter(':contains("$")').first().text(), }) await enqueueLinks({ selector: '.pagination a' }) }, }); await crawler.run([ 'http://www.example.com/page-1', 'http://www.example.com/page-2', ]); ``` ### Hierarchy * [PlaywrightCrawler](https://crawlee.dev/js/api/playwright-crawler/class/PlaywrightCrawler.md) * *AdaptivePlaywrightCrawler* ## Index[**](#Index) ### Constructors * [**constructor](#constructor) ### Properties * [**autoscaledPool](#autoscaledPool) * [**browserPool](#browserPool) * [**config](#config) * [**hasFinishedBefore](#hasFinishedBefore) * [**launchContext](#launchContext) * [**log](#log) * [**proxyConfiguration](#proxyConfiguration) * [**requestList](#requestList) * [**requestQueue](#requestQueue) * [**router](#router) * [**running](#running) * [**sessionPool](#sessionPool) * [**stats](#stats) ### Methods * [**addRequests](#addRequests) * [**exportData](#exportData) * [**getData](#getData) * [**getDataset](#getDataset) * [**getRequestQueue](#getRequestQueue) * [**pushData](#pushData) * [**run](#run) * [**setStatusMessage](#setStatusMessage) * [**stop](#stop) * [**useState](#useState) ## Constructors[**](#Constructors) ### [**](#constructor)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/adaptive-playwright-crawler.ts#L282)constructor * ****new AdaptivePlaywrightCrawler**(options, config): [AdaptivePlaywrightCrawler](https://crawlee.dev/js/api/playwright-crawler/class/AdaptivePlaywrightCrawler.md) - Overrides PlaywrightCrawler.constructor experimental #### Parameters * ##### options: [AdaptivePlaywrightCrawlerOptions](https://crawlee.dev/js/api/playwright-crawler/interface/AdaptivePlaywrightCrawlerOptions.md) = {} * ##### config: [Configuration](https://crawlee.dev/js/api/core/class/Configuration.md) = ... #### Returns [AdaptivePlaywrightCrawler](https://crawlee.dev/js/api/playwright-crawler/class/AdaptivePlaywrightCrawler.md) ## Properties[**](#Properties) ### [**](#autoscaledPool)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L524)optionalinheritedautoscaledPoolexperimental **autoscaledPool? : [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) Inherited from PlaywrightCrawler.autoscaledPool A reference to the underlying [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) class that manages the concurrency of the crawler. > *NOTE:* This property is only initialized after calling the [`crawler.run()`](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md#run) function. We can use it to change the concurrency settings on the fly, to pause the crawler by calling [`autoscaledPool.pause()`](https://crawlee.dev/js/api/core/class/AutoscaledPool.md#pause) or to abort it by calling [`autoscaledPool.abort()`](https://crawlee.dev/js/api/core/class/AutoscaledPool.md#abort). ### [**](#browserPool)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L329)inheritedbrowserPoolexperimental **browserPool: [BrowserPool](https://crawlee.dev/js/api/browser-pool/class/BrowserPool.md)<{ browserPlugins: \[[PlaywrightPlugin](https://crawlee.dev/js/api/browser-pool/class/PlaywrightPlugin.md)] }, \[[PlaywrightPlugin](https://crawlee.dev/js/api/browser-pool/class/PlaywrightPlugin.md)], [BrowserController](https://crawlee.dev/js/api/browser-pool/class/BrowserController.md)\, undefined | LaunchOptions, Browser, undefined | { acceptDownloads? : boolean; baseURL? : string; bypassCSP? : boolean; clientCertificates? : { cert? : Buffer\; certPath? : string; key? : Buffer\; keyPath? : string; origin: string; passphrase? : string; pfx? : Buffer\; pfxPath? : string }\[]; colorScheme? : null | light | dark | no-preference; contrast? : null | no-preference | more; deviceScaleFactor? : number; extraHTTPHeaders? : {}; forcedColors? : null | active | none; geolocation? : { accuracy? : number; latitude: number; longitude: number }; hasTouch? : boolean; httpCredentials? : { origin? : string; password: string; send? : unauthorized | always; username: string }; ignoreHTTPSErrors? : boolean; isMobile? : boolean; javaScriptEnabled? : boolean; locale? : string; logger? : Logger; offline? : boolean; permissions? : string\[]; proxy? : { bypass? : string; password? : string; server: string; username? : string }; recordHar? : { content? : omit | embed | attach; mode? : full | minimal; omitContent? : boolean; path: string; urlFilter? : string | RegExp }; recordVideo? : { dir: string; size? : { height: number; width: number } }; reducedMotion? : null | reduce | no-preference; screen? : { height: number; width: number }; serviceWorkers? : allow | block; storageState? : string | { cookies: { domain: string; expires: number; httpOnly: boolean; name: string; path: string; sameSite: Strict | Lax | None; secure: boolean; value: string }\[]; origins: { localStorage: { name: string; value: string }\[]; origin: string }\[] }; strictSelectors? : boolean; timezoneId? : string; userAgent? : string; videoSize? : { height: number; width: number }; videosPath? : string; viewport? : null | { height: number; width: number } }, Page>, [LaunchContext](https://crawlee.dev/js/api/browser-pool/class/LaunchContext.md)\, undefined | LaunchOptions, Browser, undefined | { acceptDownloads? : boolean; baseURL? : string; bypassCSP? : boolean; clientCertificates? : { cert? : Buffer\; certPath? : string; key? : Buffer\; keyPath? : string; origin: string; passphrase? : string; pfx? : Buffer\; pfxPath? : string }\[]; colorScheme? : null | light | dark | no-preference; contrast? : null | no-preference | more; deviceScaleFactor? : number; extraHTTPHeaders? : {}; forcedColors? : null | active | none; geolocation? : { accuracy? : number; latitude: number; longitude: number }; hasTouch? : boolean; httpCredentials? : { origin? : string; password: string; send? : unauthorized | always; username: string }; ignoreHTTPSErrors? : boolean; isMobile? : boolean; javaScriptEnabled? : boolean; locale? : string; logger? : Logger; offline? : boolean; permissions? : string\[]; proxy? : { bypass? : string; password? : string; server: string; username? : string }; recordHar? : { content? : omit | embed | attach; mode? : full | minimal; omitContent? : boolean; path: string; urlFilter? : string | RegExp }; recordVideo? : { dir: string; size? : { height: number; width: number } }; reducedMotion? : null | reduce | no-preference; screen? : { height: number; width: number }; serviceWorkers? : allow | block; storageState? : string | { cookies: { domain: string; expires: number; httpOnly: boolean; name: string; path: string; sameSite: Strict | Lax | None; secure: boolean; value: string }\[]; origins: { localStorage: { name: string; value: string }\[]; origin: string }\[] }; strictSelectors? : boolean; timezoneId? : string; userAgent? : string; videoSize? : { height: number; width: number }; videosPath? : string; viewport? : null | { height: number; width: number } }, Page>, undefined | { acceptDownloads? : boolean; baseURL? : string; bypassCSP? : boolean; clientCertificates? : { cert? : Buffer\; certPath? : string; key? : Buffer\; keyPath? : string; origin: string; passphrase? : string; pfx? : Buffer\; pfxPath? : string }\[]; colorScheme? : null | light | dark | no-preference; contrast? : null | no-preference | more; deviceScaleFactor? : number; extraHTTPHeaders? : {}; forcedColors? : null | active | none; geolocation? : { accuracy? : number; latitude: number; longitude: number }; hasTouch? : boolean; httpCredentials? : { origin? : string; password: string; send? : unauthorized | always; username: string }; ignoreHTTPSErrors? : boolean; isMobile? : boolean; javaScriptEnabled? : boolean; locale? : string; logger? : Logger; offline? : boolean; permissions? : string\[]; proxy? : { bypass? : string; password? : string; server: string; username? : string }; recordHar? : { content? : omit | embed | attach; mode? : full | minimal; omitContent? : boolean; path: string; urlFilter? : string | RegExp }; recordVideo? : { dir: string; size? : { height: number; width: number } }; reducedMotion? : null | reduce | no-preference; screen? : { height: number; width: number }; serviceWorkers? : allow | block; storageState? : string | { cookies: { domain: string; expires: number; httpOnly: boolean; name: string; path: string; sameSite: Strict | Lax | None; secure: boolean; value: string }\[]; origins: { localStorage: { name: string; value: string }\[]; origin: string }\[] }; strictSelectors? : boolean; timezoneId? : string; userAgent? : string; videoSize? : { height: number; width: number }; videosPath? : string; viewport? : null | { height: number; width: number } }, Page> Inherited from PlaywrightCrawler.browserPool A reference to the underlying [BrowserPool](https://crawlee.dev/js/api/browser-pool/class/BrowserPool.md) class that manages the crawler's browsers. ### [**](#config)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/adaptive-playwright-crawler.ts#L284)readonlyinheritedconfigexperimental **config: [Configuration](https://crawlee.dev/js/api/core/class/Configuration.md) = ... Inherited from PlaywrightCrawler.config ### [**](#hasFinishedBefore)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L533)inheritedhasFinishedBeforeexperimental **hasFinishedBefore: boolean = false Inherited from PlaywrightCrawler.hasFinishedBefore ### [**](#launchContext)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L331)inheritedlaunchContextexperimental **launchContext: [BrowserLaunchContext](https://crawlee.dev/js/api/browser-crawler/interface/BrowserLaunchContext.md)\ Inherited from PlaywrightCrawler.launchContext ### [**](#log)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L535)readonlyinheritedlogexperimental **log: [Log](https://crawlee.dev/js/api/core/class/Log.md) Inherited from PlaywrightCrawler.log ### [**](#proxyConfiguration)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L324)optionalinheritedproxyConfigurationexperimental **proxyConfiguration? : [ProxyConfiguration](https://crawlee.dev/js/api/core/class/ProxyConfiguration.md) Inherited from PlaywrightCrawler.proxyConfiguration A reference to the underlying [ProxyConfiguration](https://crawlee.dev/js/api/core/class/ProxyConfiguration.md) class that manages the crawler's proxies. Only available if used by the crawler. ### [**](#requestList)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L497)optionalinheritedrequestListexperimental **requestList? : [IRequestList](https://crawlee.dev/js/api/core/interface/IRequestList.md) Inherited from PlaywrightCrawler.requestList A reference to the underlying [RequestList](https://crawlee.dev/js/api/core/class/RequestList.md) class that manages the crawler's [requests](https://crawlee.dev/js/api/core/class/Request.md). Only available if used by the crawler. ### [**](#requestQueue)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L504)optionalinheritedrequestQueueexperimental **requestQueue? : [RequestProvider](https://crawlee.dev/js/api/core/class/RequestProvider.md) Inherited from PlaywrightCrawler.requestQueue Dynamic queue of URLs to be processed. This is useful for recursive crawling of websites. A reference to the underlying [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) class that manages the crawler's [requests](https://crawlee.dev/js/api/core/class/Request.md). Only available if used by the crawler. ### [**](#router)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/adaptive-playwright-crawler.ts#L279)readonlyrouterexperimental **router: [RouterHandler](https://crawlee.dev/js/api/core/interface/RouterHandler.md)<[AdaptivePlaywrightCrawlerContext](https://crawlee.dev/js/api/playwright-crawler/interface/AdaptivePlaywrightCrawlerContext.md)\> = ... Overrides PlaywrightCrawler.router Default [Router](https://crawlee.dev/js/api/core/class/Router.md) instance that will be used if we don't specify any [`requestHandler`](https://crawlee.dev/js/api/playwright-crawler/interface/AdaptivePlaywrightCrawlerOptions.md#requestHandler). See [`router.addHandler()`](https://crawlee.dev/js/api/core/class/Router.md#addHandler) and [`router.addDefaultHandler()`](https://crawlee.dev/js/api/core/class/Router.md#addDefaultHandler). ### [**](#running)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L532)inheritedrunningexperimental **running: boolean = false Inherited from PlaywrightCrawler.running ### [**](#sessionPool)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L515)optionalinheritedsessionPoolexperimental **sessionPool? : [SessionPool](https://crawlee.dev/js/api/core/class/SessionPool.md) Inherited from PlaywrightCrawler.sessionPool A reference to the underlying [SessionPool](https://crawlee.dev/js/api/core/class/SessionPool.md) class that manages the crawler's [sessions](https://crawlee.dev/js/api/core/class/Session.md). Only available if used by the crawler. ### [**](#stats)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/adaptive-playwright-crawler.ts#L272)readonlystatsexperimental **stats: AdaptivePlaywrightCrawlerStatistics Overrides PlaywrightCrawler.stats A reference to the underlying [Statistics](https://crawlee.dev/js/api/core/class/Statistics.md) class that collects and logs run statistics for requests. ## Methods[**](#Methods) ### [**](#addRequests)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L1147)inheritedaddRequests * ****addRequests**(requests, options): Promise<[CrawlerAddRequestsResult](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerAddRequestsResult.md)> - Inherited from PlaywrightCrawler.addRequests experimental Adds requests to the queue in batches. By default, it will resolve after the initial batch is added, and continue adding the rest in background. You can configure the batch size via `batchSize` option and the sleep time in between the batches via `waitBetweenBatchesMillis`. If you want to wait for all batches to be added to the queue, you can use the `waitForAllRequestsToBeAdded` promise you get in the response object. This is an alias for calling `addRequestsBatched()` on the implicit `RequestQueue` for this crawler instance. *** #### Parameters * ##### requests: [RequestsLike](https://crawlee.dev/js/api/core.md#RequestsLike) The requests to add * ##### options: [CrawlerAddRequestsOptions](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerAddRequestsOptions.md) = {} Options for the request queue #### Returns Promise<[CrawlerAddRequestsResult](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerAddRequestsResult.md)> ### [**](#exportData)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L1253)inheritedexportData * ****exportData**\(path, format, options): Promise\ - Inherited from PlaywrightCrawler.exportData experimental Retrieves all the data from the default crawler [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md) and exports them to the specified format. Supported formats are currently 'json' and 'csv', and will be inferred from the `path` automatically. *** #### Parameters * ##### path: string * ##### optionalformat: json | csv * ##### optionaloptions: [DatasetExportOptions](https://crawlee.dev/js/api/core/interface/DatasetExportOptions.md) #### Returns Promise\ ### [**](#getData)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L1244)inheritedgetData * ****getData**(...args): Promise<[DatasetContent](https://crawlee.dev/js/api/core/interface/DatasetContent.md)\> - Inherited from PlaywrightCrawler.getData experimental Retrieves data from the default crawler [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md) by calling [Dataset.getData](https://crawlee.dev/js/api/core/class/Dataset.md#getData). *** #### Parameters * ##### rest...args: \[options: [DatasetDataOptions](https://crawlee.dev/js/api/core/interface/DatasetDataOptions.md)] #### Returns Promise<[DatasetContent](https://crawlee.dev/js/api/core/interface/DatasetContent.md)\> ### [**](#getDataset)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L1237)inheritedgetDataset * ****getDataset**(idOrName): Promise<[Dataset](https://crawlee.dev/js/api/core/class/Dataset.md)\> - Inherited from PlaywrightCrawler.getDataset experimental Retrieves the specified [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md), or the default crawler [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md). *** #### Parameters * ##### optionalidOrName: string #### Returns Promise<[Dataset](https://crawlee.dev/js/api/core/class/Dataset.md)\> ### [**](#getRequestQueue)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L1080)inheritedgetRequestQueue * ****getRequestQueue**(): Promise<[RequestProvider](https://crawlee.dev/js/api/core/class/RequestProvider.md)> - Inherited from PlaywrightCrawler.getRequestQueue experimental #### Returns Promise<[RequestProvider](https://crawlee.dev/js/api/core/class/RequestProvider.md)> ### [**](#pushData)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L1229)inheritedpushData * ****pushData**(data, datasetIdOrName): Promise\ - Inherited from PlaywrightCrawler.pushData experimental Pushes data to the specified [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md), or the default crawler [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md) by calling [Dataset.pushData](https://crawlee.dev/js/api/core/class/Dataset.md#pushData). *** #### Parameters * ##### data: Dictionary | Dictionary\[] * ##### optionaldatasetIdOrName: string #### Returns Promise\ ### [**](#run)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L948)inheritedrun * ****run**(requests, options): Promise<[FinalStatistics](https://crawlee.dev/js/api/core/interface/FinalStatistics.md)> - Inherited from PlaywrightCrawler.run experimental Runs the crawler. Returns a promise that resolves once all the requests are processed and `autoscaledPool.isFinished` returns `true`. We can use the `requests` parameter to enqueue the initial requests — it is a shortcut for running [`crawler.addRequests()`](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md#addRequests) before [`crawler.run()`](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md#run). *** #### Parameters * ##### optionalrequests: [RequestsLike](https://crawlee.dev/js/api/core.md#RequestsLike) The requests to add. * ##### optionaloptions: [CrawlerRunOptions](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerRunOptions.md) Options for the request queue. #### Returns Promise<[FinalStatistics](https://crawlee.dev/js/api/core/interface/FinalStatistics.md)> ### [**](#setStatusMessage)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L871)inheritedsetStatusMessage * ****setStatusMessage**(message, options): Promise\ - Inherited from PlaywrightCrawler.setStatusMessage experimental This method is periodically called by the crawler, every `statusMessageLoggingInterval` seconds. *** #### Parameters * ##### message: string * ##### options: [SetStatusMessageOptions](https://crawlee.dev/js/api/types/interface/SetStatusMessageOptions.md) = {} #### Returns Promise\ ### [**](#stop)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L1068)inheritedstop * ****stop**(message): void - Inherited from PlaywrightCrawler.stop experimental Gracefully stops the current run of the crawler. All the tasks active at the time of calling this method will be allowed to finish. *** #### Parameters * ##### message: string = 'The crawler has been gracefully stopped.' #### Returns void ### [**](#useState)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L1102)inheriteduseState * ****useState**\(defaultValue): Promise\ - Inherited from PlaywrightCrawler.useState experimental #### Parameters * ##### defaultValue: State = ... #### Returns Promise\ --- # PlaywrightCrawler Provides a simple framework for parallel crawling of web pages using headless Chromium, Firefox and Webkit browsers with [Playwright](https://github.com/microsoft/playwright). The URLs to crawl are fed either from a static list of URLs or from a dynamic queue of URLs enabling recursive crawling of websites. Since `Playwright` uses headless browser to download web pages and extract data, it is useful for crawling of websites that require to execute JavaScript. If the target website doesn't need JavaScript, consider using [CheerioCrawler](https://crawlee.dev/js/api/cheerio-crawler/class/CheerioCrawler.md), which downloads the pages using raw HTTP requests and is about 10x faster. The source URLs are represented using [Request](https://crawlee.dev/js/api/core/class/Request.md) objects that are fed from [RequestList](https://crawlee.dev/js/api/core/class/RequestList.md) or [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) instances provided by the [PlaywrightCrawlerOptions.requestList](https://crawlee.dev/js/api/playwright-crawler/interface/PlaywrightCrawlerOptions.md#requestList) or [PlaywrightCrawlerOptions.requestQueue](https://crawlee.dev/js/api/playwright-crawler/interface/PlaywrightCrawlerOptions.md#requestQueue) constructor options, respectively. If both [PlaywrightCrawlerOptions.requestList](https://crawlee.dev/js/api/playwright-crawler/interface/PlaywrightCrawlerOptions.md#requestList) and [PlaywrightCrawlerOptions.requestQueue](https://crawlee.dev/js/api/playwright-crawler/interface/PlaywrightCrawlerOptions.md#requestQueue) are used, the instance first processes URLs from the [RequestList](https://crawlee.dev/js/api/core/class/RequestList.md) and automatically enqueues all of them to [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) before it starts their processing. This ensures that a single URL is not crawled multiple times. The crawler finishes when there are no more [Request](https://crawlee.dev/js/api/core/class/Request.md) objects to crawl. `PlaywrightCrawler` opens a new Chrome page (i.e. tab) for each [Request](https://crawlee.dev/js/api/core/class/Request.md) object to crawl and then calls the function provided by user as the [PlaywrightCrawlerOptions.requestHandler](https://crawlee.dev/js/api/playwright-crawler/interface/PlaywrightCrawlerOptions.md#requestHandler) option. New pages are only opened when there is enough free CPU and memory available, using the functionality provided by the [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) class. All [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) configuration options can be passed to the [PlaywrightCrawlerOptions.autoscaledPoolOptions](https://crawlee.dev/js/api/playwright-crawler/interface/PlaywrightCrawlerOptions.md#autoscaledPoolOptions) parameter of the `PlaywrightCrawler` constructor. For user convenience, the `minConcurrency` and `maxConcurrency` [AutoscaledPoolOptions](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md) are available directly in the `PlaywrightCrawler` constructor. Note that the pool of Playwright instances is internally managed by the [BrowserPool](https://github.com/apify/browser-pool) class. **Example usage:** ``` const crawler = new PlaywrightCrawler({ async requestHandler({ page, request }) { // This function is called to extract data from a single web page // 'page' is an instance of Playwright.Page with page.goto(request.url) already called // 'request' is an instance of Request class with information about the page to load await Dataset.pushData({ title: await page.title(), url: request.url, succeeded: true, }) }, async failedRequestHandler({ request }) { // This function is called when the crawling of a request failed too many times await Dataset.pushData({ url: request.url, succeeded: false, errors: request.errorMessages, }) }, }); await crawler.run([ 'http://www.example.com/page-1', 'http://www.example.com/page-2', ]); ``` ### Hierarchy * [BrowserCrawler](https://crawlee.dev/js/api/browser-crawler/class/BrowserCrawler.md)<{ browserPlugins: \[[PlaywrightPlugin](https://crawlee.dev/js/api/browser-pool/class/PlaywrightPlugin.md)] }, LaunchOptions, [PlaywrightCrawlingContext](https://crawlee.dev/js/api/playwright-crawler/interface/PlaywrightCrawlingContext.md)> * *PlaywrightCrawler* * [AdaptivePlaywrightCrawler](https://crawlee.dev/js/api/playwright-crawler/class/AdaptivePlaywrightCrawler.md) ## Index[**](#Index) ### Constructors * [**constructor](#constructor) ### Properties * [**autoscaledPool](#autoscaledPool) * [**browserPool](#browserPool) * [**config](#config) * [**hasFinishedBefore](#hasFinishedBefore) * [**launchContext](#launchContext) * [**log](#log) * [**proxyConfiguration](#proxyConfiguration) * [**requestList](#requestList) * [**requestQueue](#requestQueue) * [**router](#router) * [**running](#running) * [**sessionPool](#sessionPool) * [**stats](#stats) ### Methods * [**addRequests](#addRequests) * [**exportData](#exportData) * [**getData](#getData) * [**getDataset](#getDataset) * [**getRequestQueue](#getRequestQueue) * [**pushData](#pushData) * [**run](#run) * [**setStatusMessage](#setStatusMessage) * [**stop](#stop) * [**useState](#useState) ## Constructors[**](#Constructors) ### [**](#constructor)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/playwright-crawler.ts#L204)constructor * ****new PlaywrightCrawler**(options, config): [PlaywrightCrawler](https://crawlee.dev/js/api/playwright-crawler/class/PlaywrightCrawler.md) - Overrides BrowserCrawler< { browserPlugins: \[PlaywrightPlugin] }, LaunchOptions, PlaywrightCrawlingContext >.constructor All `PlaywrightCrawler` parameters are passed via an options object. *** #### Parameters * ##### options: [PlaywrightCrawlerOptions](https://crawlee.dev/js/api/playwright-crawler/interface/PlaywrightCrawlerOptions.md) = {} * ##### config: [Configuration](https://crawlee.dev/js/api/core/class/Configuration.md) = ... #### Returns [PlaywrightCrawler](https://crawlee.dev/js/api/playwright-crawler/class/PlaywrightCrawler.md) ## Properties[**](#Properties) ### [**](#autoscaledPool)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L524)optionalinheritedautoscaledPool **autoscaledPool? : [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) Inherited from BrowserCrawler.autoscaledPool A reference to the underlying [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) class that manages the concurrency of the crawler. > *NOTE:* This property is only initialized after calling the [`crawler.run()`](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md#run) function. We can use it to change the concurrency settings on the fly, to pause the crawler by calling [`autoscaledPool.pause()`](https://crawlee.dev/js/api/core/class/AutoscaledPool.md#pause) or to abort it by calling [`autoscaledPool.abort()`](https://crawlee.dev/js/api/core/class/AutoscaledPool.md#abort). ### [**](#browserPool)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L329)inheritedbrowserPool **browserPool: [BrowserPool](https://crawlee.dev/js/api/browser-pool/class/BrowserPool.md)<{ browserPlugins: \[[PlaywrightPlugin](https://crawlee.dev/js/api/browser-pool/class/PlaywrightPlugin.md)] }, \[[PlaywrightPlugin](https://crawlee.dev/js/api/browser-pool/class/PlaywrightPlugin.md)], [BrowserController](https://crawlee.dev/js/api/browser-pool/class/BrowserController.md)\, undefined | LaunchOptions, Browser, undefined | { acceptDownloads? : boolean; baseURL? : string; bypassCSP? : boolean; clientCertificates? : { cert? : Buffer\; certPath? : string; key? : Buffer\; keyPath? : string; origin: string; passphrase? : string; pfx? : Buffer\; pfxPath? : string }\[]; colorScheme? : null | light | dark | no-preference; contrast? : null | no-preference | more; deviceScaleFactor? : number; extraHTTPHeaders? : {}; forcedColors? : null | active | none; geolocation? : { accuracy? : number; latitude: number; longitude: number }; hasTouch? : boolean; httpCredentials? : { origin? : string; password: string; send? : unauthorized | always; username: string }; ignoreHTTPSErrors? : boolean; isMobile? : boolean; javaScriptEnabled? : boolean; locale? : string; logger? : Logger; offline? : boolean; permissions? : string\[]; proxy? : { bypass? : string; password? : string; server: string; username? : string }; recordHar? : { content? : omit | embed | attach; mode? : full | minimal; omitContent? : boolean; path: string; urlFilter? : string | RegExp }; recordVideo? : { dir: string; size? : { height: number; width: number } }; reducedMotion? : null | reduce | no-preference; screen? : { height: number; width: number }; serviceWorkers? : allow | block; storageState? : string | { cookies: { domain: string; expires: number; httpOnly: boolean; name: string; path: string; sameSite: Strict | Lax | None; secure: boolean; value: string }\[]; origins: { localStorage: { name: string; value: string }\[]; origin: string }\[] }; strictSelectors? : boolean; timezoneId? : string; userAgent? : string; videoSize? : { height: number; width: number }; videosPath? : string; viewport? : null | { height: number; width: number } }, Page>, [LaunchContext](https://crawlee.dev/js/api/browser-pool/class/LaunchContext.md)\, undefined | LaunchOptions, Browser, undefined | { acceptDownloads? : boolean; baseURL? : string; bypassCSP? : boolean; clientCertificates? : { cert? : Buffer\; certPath? : string; key? : Buffer\; keyPath? : string; origin: string; passphrase? : string; pfx? : Buffer\; pfxPath? : string }\[]; colorScheme? : null | light | dark | no-preference; contrast? : null | no-preference | more; deviceScaleFactor? : number; extraHTTPHeaders? : {}; forcedColors? : null | active | none; geolocation? : { accuracy? : number; latitude: number; longitude: number }; hasTouch? : boolean; httpCredentials? : { origin? : string; password: string; send? : unauthorized | always; username: string }; ignoreHTTPSErrors? : boolean; isMobile? : boolean; javaScriptEnabled? : boolean; locale? : string; logger? : Logger; offline? : boolean; permissions? : string\[]; proxy? : { bypass? : string; password? : string; server: string; username? : string }; recordHar? : { content? : omit | embed | attach; mode? : full | minimal; omitContent? : boolean; path: string; urlFilter? : string | RegExp }; recordVideo? : { dir: string; size? : { height: number; width: number } }; reducedMotion? : null | reduce | no-preference; screen? : { height: number; width: number }; serviceWorkers? : allow | block; storageState? : string | { cookies: { domain: string; expires: number; httpOnly: boolean; name: string; path: string; sameSite: Strict | Lax | None; secure: boolean; value: string }\[]; origins: { localStorage: { name: string; value: string }\[]; origin: string }\[] }; strictSelectors? : boolean; timezoneId? : string; userAgent? : string; videoSize? : { height: number; width: number }; videosPath? : string; viewport? : null | { height: number; width: number } }, Page>, undefined | { acceptDownloads? : boolean; baseURL? : string; bypassCSP? : boolean; clientCertificates? : { cert? : Buffer\; certPath? : string; key? : Buffer\; keyPath? : string; origin: string; passphrase? : string; pfx? : Buffer\; pfxPath? : string }\[]; colorScheme? : null | light | dark | no-preference; contrast? : null | no-preference | more; deviceScaleFactor? : number; extraHTTPHeaders? : {}; forcedColors? : null | active | none; geolocation? : { accuracy? : number; latitude: number; longitude: number }; hasTouch? : boolean; httpCredentials? : { origin? : string; password: string; send? : unauthorized | always; username: string }; ignoreHTTPSErrors? : boolean; isMobile? : boolean; javaScriptEnabled? : boolean; locale? : string; logger? : Logger; offline? : boolean; permissions? : string\[]; proxy? : { bypass? : string; password? : string; server: string; username? : string }; recordHar? : { content? : omit | embed | attach; mode? : full | minimal; omitContent? : boolean; path: string; urlFilter? : string | RegExp }; recordVideo? : { dir: string; size? : { height: number; width: number } }; reducedMotion? : null | reduce | no-preference; screen? : { height: number; width: number }; serviceWorkers? : allow | block; storageState? : string | { cookies: { domain: string; expires: number; httpOnly: boolean; name: string; path: string; sameSite: Strict | Lax | None; secure: boolean; value: string }\[]; origins: { localStorage: { name: string; value: string }\[]; origin: string }\[] }; strictSelectors? : boolean; timezoneId? : string; userAgent? : string; videoSize? : { height: number; width: number }; videosPath? : string; viewport? : null | { height: number; width: number } }, Page> Inherited from BrowserCrawler.browserPool A reference to the underlying [BrowserPool](https://crawlee.dev/js/api/browser-pool/class/BrowserPool.md) class that manages the crawler's browsers. ### [**](#config)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/playwright-crawler.ts#L206)readonlyinheritedconfig **config: [Configuration](https://crawlee.dev/js/api/core/class/Configuration.md) = ... Inherited from BrowserCrawler.config ### [**](#hasFinishedBefore)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L533)inheritedhasFinishedBefore **hasFinishedBefore: boolean = false Inherited from BrowserCrawler.hasFinishedBefore ### [**](#launchContext)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L331)inheritedlaunchContext **launchContext: [BrowserLaunchContext](https://crawlee.dev/js/api/browser-crawler/interface/BrowserLaunchContext.md)\ Inherited from BrowserCrawler.launchContext ### [**](#log)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L535)readonlyinheritedlog **log: [Log](https://crawlee.dev/js/api/core/class/Log.md) Inherited from BrowserCrawler.log ### [**](#proxyConfiguration)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L324)optionalinheritedproxyConfiguration **proxyConfiguration? : [ProxyConfiguration](https://crawlee.dev/js/api/core/class/ProxyConfiguration.md) Inherited from BrowserCrawler.proxyConfiguration A reference to the underlying [ProxyConfiguration](https://crawlee.dev/js/api/core/class/ProxyConfiguration.md) class that manages the crawler's proxies. Only available if used by the crawler. ### [**](#requestList)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L497)optionalinheritedrequestList **requestList? : [IRequestList](https://crawlee.dev/js/api/core/interface/IRequestList.md) Inherited from BrowserCrawler.requestList A reference to the underlying [RequestList](https://crawlee.dev/js/api/core/class/RequestList.md) class that manages the crawler's [requests](https://crawlee.dev/js/api/core/class/Request.md). Only available if used by the crawler. ### [**](#requestQueue)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L504)optionalinheritedrequestQueue **requestQueue? : [RequestProvider](https://crawlee.dev/js/api/core/class/RequestProvider.md) Inherited from BrowserCrawler.requestQueue Dynamic queue of URLs to be processed. This is useful for recursive crawling of websites. A reference to the underlying [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) class that manages the crawler's [requests](https://crawlee.dev/js/api/core/class/Request.md). Only available if used by the crawler. ### [**](#router)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L530)readonlyinheritedrouter **router: [RouterHandler](https://crawlee.dev/js/api/core/interface/RouterHandler.md)<{ request: [LoadedRequest](https://crawlee.dev/js/api/core.md#LoadedRequest)<[Request](https://crawlee.dev/js/api/core/class/Request.md)\> } & Omit<[PlaywrightCrawlingContext](https://crawlee.dev/js/api/playwright-crawler/interface/PlaywrightCrawlingContext.md)\, request>> = ... Inherited from BrowserCrawler.router Default [Router](https://crawlee.dev/js/api/core/class/Router.md) instance that will be used if we don't specify any [`requestHandler`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#requestHandler). See [`router.addHandler()`](https://crawlee.dev/js/api/core/class/Router.md#addHandler) and [`router.addDefaultHandler()`](https://crawlee.dev/js/api/core/class/Router.md#addDefaultHandler). ### [**](#running)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L532)inheritedrunning **running: boolean = false Inherited from BrowserCrawler.running ### [**](#sessionPool)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L515)optionalinheritedsessionPool **sessionPool? : [SessionPool](https://crawlee.dev/js/api/core/class/SessionPool.md) Inherited from BrowserCrawler.sessionPool A reference to the underlying [SessionPool](https://crawlee.dev/js/api/core/class/SessionPool.md) class that manages the crawler's [sessions](https://crawlee.dev/js/api/core/class/Session.md). Only available if used by the crawler. ### [**](#stats)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L491)readonlyinheritedstats **stats: [Statistics](https://crawlee.dev/js/api/core/class/Statistics.md) Inherited from BrowserCrawler.stats A reference to the underlying [Statistics](https://crawlee.dev/js/api/core/class/Statistics.md) class that collects and logs run statistics for requests. ## Methods[**](#Methods) ### [**](#addRequests)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L1147)inheritedaddRequests * ****addRequests**(requests, options): Promise<[CrawlerAddRequestsResult](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerAddRequestsResult.md)> - Inherited from BrowserCrawler.addRequests Adds requests to the queue in batches. By default, it will resolve after the initial batch is added, and continue adding the rest in background. You can configure the batch size via `batchSize` option and the sleep time in between the batches via `waitBetweenBatchesMillis`. If you want to wait for all batches to be added to the queue, you can use the `waitForAllRequestsToBeAdded` promise you get in the response object. This is an alias for calling `addRequestsBatched()` on the implicit `RequestQueue` for this crawler instance. *** #### Parameters * ##### requests: [RequestsLike](https://crawlee.dev/js/api/core.md#RequestsLike) The requests to add * ##### options: [CrawlerAddRequestsOptions](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerAddRequestsOptions.md) = {} Options for the request queue #### Returns Promise<[CrawlerAddRequestsResult](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerAddRequestsResult.md)> ### [**](#exportData)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L1253)inheritedexportData * ****exportData**\(path, format, options): Promise\ - Inherited from BrowserCrawler.exportData Retrieves all the data from the default crawler [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md) and exports them to the specified format. Supported formats are currently 'json' and 'csv', and will be inferred from the `path` automatically. *** #### Parameters * ##### path: string * ##### optionalformat: json | csv * ##### optionaloptions: [DatasetExportOptions](https://crawlee.dev/js/api/core/interface/DatasetExportOptions.md) #### Returns Promise\ ### [**](#getData)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L1244)inheritedgetData * ****getData**(...args): Promise<[DatasetContent](https://crawlee.dev/js/api/core/interface/DatasetContent.md)\> - Inherited from BrowserCrawler.getData Retrieves data from the default crawler [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md) by calling [Dataset.getData](https://crawlee.dev/js/api/core/class/Dataset.md#getData). *** #### Parameters * ##### rest...args: \[options: [DatasetDataOptions](https://crawlee.dev/js/api/core/interface/DatasetDataOptions.md)] #### Returns Promise<[DatasetContent](https://crawlee.dev/js/api/core/interface/DatasetContent.md)\> ### [**](#getDataset)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L1237)inheritedgetDataset * ****getDataset**(idOrName): Promise<[Dataset](https://crawlee.dev/js/api/core/class/Dataset.md)\> - Inherited from BrowserCrawler.getDataset Retrieves the specified [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md), or the default crawler [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md). *** #### Parameters * ##### optionalidOrName: string #### Returns Promise<[Dataset](https://crawlee.dev/js/api/core/class/Dataset.md)\> ### [**](#getRequestQueue)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L1080)inheritedgetRequestQueue * ****getRequestQueue**(): Promise<[RequestProvider](https://crawlee.dev/js/api/core/class/RequestProvider.md)> - Inherited from BrowserCrawler.getRequestQueue #### Returns Promise<[RequestProvider](https://crawlee.dev/js/api/core/class/RequestProvider.md)> ### [**](#pushData)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L1229)inheritedpushData * ****pushData**(data, datasetIdOrName): Promise\ - Inherited from BrowserCrawler.pushData Pushes data to the specified [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md), or the default crawler [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md) by calling [Dataset.pushData](https://crawlee.dev/js/api/core/class/Dataset.md#pushData). *** #### Parameters * ##### data: Dictionary | Dictionary\[] * ##### optionaldatasetIdOrName: string #### Returns Promise\ ### [**](#run)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L948)inheritedrun * ****run**(requests, options): Promise<[FinalStatistics](https://crawlee.dev/js/api/core/interface/FinalStatistics.md)> - Inherited from BrowserCrawler.run Runs the crawler. Returns a promise that resolves once all the requests are processed and `autoscaledPool.isFinished` returns `true`. We can use the `requests` parameter to enqueue the initial requests — it is a shortcut for running [`crawler.addRequests()`](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md#addRequests) before [`crawler.run()`](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md#run). *** #### Parameters * ##### optionalrequests: [RequestsLike](https://crawlee.dev/js/api/core.md#RequestsLike) The requests to add. * ##### optionaloptions: [CrawlerRunOptions](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerRunOptions.md) Options for the request queue. #### Returns Promise<[FinalStatistics](https://crawlee.dev/js/api/core/interface/FinalStatistics.md)> ### [**](#setStatusMessage)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L871)inheritedsetStatusMessage * ****setStatusMessage**(message, options): Promise\ - Inherited from BrowserCrawler.setStatusMessage This method is periodically called by the crawler, every `statusMessageLoggingInterval` seconds. *** #### Parameters * ##### message: string * ##### options: [SetStatusMessageOptions](https://crawlee.dev/js/api/types/interface/SetStatusMessageOptions.md) = {} #### Returns Promise\ ### [**](#stop)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L1068)inheritedstop * ****stop**(message): void - Inherited from BrowserCrawler.stop Gracefully stops the current run of the crawler. All the tasks active at the time of calling this method will be allowed to finish. *** #### Parameters * ##### message: string = 'The crawler has been gracefully stopped.' #### Returns void ### [**](#useState)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L1102)inheriteduseState * ****useState**\(defaultValue): Promise\ - Inherited from BrowserCrawler.useState #### Parameters * ##### defaultValue: State = ... #### Returns Promise\ --- # RenderingTypePredictor experimental Stores rendering type information for previously crawled URLs and predicts the rendering type for URLs that have yet to be crawled and recommends when rendering type detection should be performed. ## Index[**](#Index) ### Constructors * [**constructor](#constructor) ### Methods * [**initialize](#initialize) * [**predict](#predict) * [**storeResult](#storeResult) ## Constructors[**](#Constructors) ### [**](#constructor)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/utils/rendering-type-prediction.ts#L50)constructor * ****new RenderingTypePredictor**(\_\_namedParameters): [RenderingTypePredictor](https://crawlee.dev/js/api/playwright-crawler/class/RenderingTypePredictor.md) - experimental #### Parameters * ##### \_\_namedParameters: RenderingTypePredictorOptions #### Returns [RenderingTypePredictor](https://crawlee.dev/js/api/playwright-crawler/class/RenderingTypePredictor.md) ## Methods[**](#Methods) ### [**](#initialize)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/utils/rendering-type-prediction.ts#L65)initialize * ****initialize**(): Promise\ - experimental Initialize the predictor by restoring persisted state. *** #### Returns Promise\ ### [**](#predict)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/utils/rendering-type-prediction.ts#L72)publicpredict * ****predict**(\_\_namedParameters): { detectionProbabilityRecommendation: number; renderingType: [RenderingType](https://crawlee.dev/js/api/playwright-crawler.md#RenderingType) } - experimental Predict the rendering type for a given URL and request label. *** #### Parameters * ##### \_\_namedParameters: [Request](https://crawlee.dev/js/api/core/class/Request.md)\ #### Returns { detectionProbabilityRecommendation: number; renderingType: [RenderingType](https://crawlee.dev/js/api/playwright-crawler.md#RenderingType) } * ##### detectionProbabilityRecommendation: number * ##### renderingType: [RenderingType](https://crawlee.dev/js/api/playwright-crawler.md#RenderingType) ### [**](#storeResult)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/utils/rendering-type-prediction.ts#L99)publicstoreResult * ****storeResult**(\_\_namedParameters, renderingType): void - experimental Store the rendering type for a given URL and request label. This updates the underlying prediction model, which may be costly. *** #### Parameters * ##### \_\_namedParameters: [Request](https://crawlee.dev/js/api/core/class/Request.md)\ * ##### renderingType: [RenderingType](https://crawlee.dev/js/api/playwright-crawler.md#RenderingType) #### Returns void --- # createAdaptivePlaywrightRouter ### Callable * ****createAdaptivePlaywrightRouter**\(routes): [RouterHandler](https://crawlee.dev/js/api/core/interface/RouterHandler.md)\ *** * #### Parameters * ##### optionalroutes: [RouterRoutes](https://crawlee.dev/js/api/core.md#RouterRoutes)\ #### Returns [RouterHandler](https://crawlee.dev/js/api/core/interface/RouterHandler.md)\ --- # createPlaywrightRouter ### Callable * ****createPlaywrightRouter**\(routes): [RouterHandler](https://crawlee.dev/js/api/core/interface/RouterHandler.md)\ *** * Creates new [Router](https://crawlee.dev/js/api/core/class/Router.md) instance that works based on request labels. This instance can then serve as a `requestHandler` of your [PlaywrightCrawler](https://crawlee.dev/js/api/playwright-crawler/class/PlaywrightCrawler.md). Defaults to the [PlaywrightCrawlingContext](https://crawlee.dev/js/api/playwright-crawler/interface/PlaywrightCrawlingContext.md). > Serves as a shortcut for using `Router.create()`. ``` import { PlaywrightCrawler, createPlaywrightRouter } from 'crawlee'; const router = createPlaywrightRouter(); router.addHandler('label-a', async (ctx) => { ctx.log.info('...'); }); router.addDefaultHandler(async (ctx) => { ctx.log.info('...'); }); const crawler = new PlaywrightCrawler({ requestHandler: router, }); await crawler.run(); ``` *** #### Parameters * ##### optionalroutes: [RouterRoutes](https://crawlee.dev/js/api/core.md#RouterRoutes)\ #### Returns [RouterHandler](https://crawlee.dev/js/api/core/interface/RouterHandler.md)\ --- # launchPlaywright ### Callable * ****launchPlaywright**(launchContext, config): Promise\ *** * Launches headless browsers using Playwright pre-configured to work within the Apify platform. The function has the same return value as `browserType.launch()`. See [Playwright documentation](https://playwright.dev/docs/api/class-browsertype) for more details. The `launchPlaywright()` function alters the following Playwright options: * Passes the setting from the `CRAWLEE_HEADLESS` environment variable to the `headless` option, unless it was already defined by the caller or `CRAWLEE_XVFB` environment variable is set to `1`. Note that Apify Actor cloud platform automatically sets `CRAWLEE_HEADLESS=1` to all running actors. * Takes the `proxyUrl` option, validates it and adds it to `launchOptions` in a proper format. The proxy URL must define a port number and have one of the following schemes: `http://`, `https://`, `socks4://` or `socks5://`. If the proxy is HTTP (i.e. has the `http://` scheme) and contains username or password, the `launchPlaywright` functions sets up an anonymous proxy HTTP to make the proxy work with headless Chrome. For more information, read the [blog post about proxy-chain library](https://blog.apify.com/how-to-make-headless-chrome-and-puppeteer-use-a-proxy-server-with-authentication-249a21a79212). To use this function, you need to have the [Playwright](https://www.npmjs.com/package/playwright) NPM package installed in your project. When running on the Apify Platform, you can achieve that simply by using the `apify/actor-node-playwright-*` base Docker image for your actor - see [Apify Actor documentation](https://docs.apify.com/actor/build#base-images) for details. *** #### Parameters * ##### optionallaunchContext: [PlaywrightLaunchContext](https://crawlee.dev/js/api/playwright-crawler/interface/PlaywrightLaunchContext.md) Optional settings passed to `browserType.launch()`. In addition to [Playwright's options](https://playwright.dev/docs/api/class-browsertype?_highlight=launch#browsertypelaunchoptions) the object may contain our own [PlaywrightLaunchContext](https://crawlee.dev/js/api/playwright-crawler/interface/PlaywrightLaunchContext.md) that enable additional features. * ##### optionalconfig: [Configuration](https://crawlee.dev/js/api/core/class/Configuration.md) = ... #### Returns Promise\ Promise that resolves to Playwright's `Browser` instance. --- # AdaptivePlaywrightCrawlerContext \ ### Hierarchy * [RestrictedCrawlingContext](https://crawlee.dev/js/api/core/interface/RestrictedCrawlingContext.md)\ * *AdaptivePlaywrightCrawlerContext* ## Index[**](#Index) ### Properties * [**addRequests](#addRequests) * [**enqueueLinks](#enqueueLinks) * [**getKeyValueStore](#getKeyValueStore) * [**id](#id) * [**log](#log) * [**page](#page) * [**proxyInfo](#proxyInfo) * [**request](#request) * [**response](#response) * [**session](#session) * [**useState](#useState) ### Methods * [**parseWithCheerio](#parseWithCheerio) * [**pushData](#pushData) * [**querySelector](#querySelector) * [**waitForSelector](#waitForSelector) ## Properties[**](#Properties) ### [**](#addRequests)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L88)inheritedaddRequests **addRequests: (requestsLike, options) => Promise\ Inherited from RestrictedCrawlingContext.addRequests Add requests directly to the request queue. *** #### Type declaration * * **(requestsLike, options): Promise\ - #### Parameters * ##### requestsLike: readonly (string | ReadonlyObjectDeep\> & { regex?: RegExp; requestsFromUrl?: string }> | ReadonlyObjectDeep<[Request](https://crawlee.dev/js/api/core/class/Request.md)\>)\[] * ##### optionaloptions: ReadonlyObjectDeep<[RequestQueueOperationOptions](https://crawlee.dev/js/api/core/interface/RequestQueueOperationOptions.md)> Options for the request queue #### Returns Promise\ ### [**](#enqueueLinks)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L80)inheritedenqueueLinks **enqueueLinks: (options) => Promise\ Inherited from RestrictedCrawlingContext.enqueueLinks This function automatically finds and enqueues links from the current page, adding them to the [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) currently used by the crawler. Optionally, the function allows you to filter the target links' URLs using an array of globs or regular expressions and override settings of the enqueued [Request](https://crawlee.dev/js/api/core/class/Request.md) objects. Check out the [Crawl a website with relative links](https://crawlee.dev/js/docs/examples/crawl-relative-links.md) example for more details regarding its usage. **Example usage** ``` async requestHandler({ enqueueLinks }) { await enqueueLinks({ globs: [ 'https://www.example.com/handbags/*', ], }); }, ``` *** #### Type declaration * * **(options): Promise\ - #### Parameters * ##### optionaloptions: ReadonlyObjectDeep\> All `enqueueLinks()` parameters are passed via an options object. #### Returns Promise\ ### [**](#getKeyValueStore)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L101)inheritedgetKeyValueStore **getKeyValueStore: (idOrName) => Promise\> Inherited from RestrictedCrawlingContext.getKeyValueStore Get a key-value store with given name or id, or the default one for the crawler. *** #### Type declaration * * **(idOrName): Promise\> - #### Parameters * ##### optionalidOrName: string #### Returns Promise\> ### [**](#id)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L33)inheritedid **id: string Inherited from RestrictedCrawlingContext.id ### [**](#log)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L108)inheritedlog **log: [Log](https://crawlee.dev/js/api/core/class/Log.md) Inherited from RestrictedCrawlingContext.log A preconfigured logger for the request handler. ### [**](#page)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/adaptive-playwright-crawler.ts#L109)page **page: Page Playwright Page object. If accessed in HTTP-only rendering, this will throw an error and make the AdaptivePlaywrightCrawlerContext retry the request in a browser. ### [**](#proxyInfo)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L40)optionalinheritedproxyInfo **proxyInfo? : [ProxyInfo](https://crawlee.dev/js/api/core/interface/ProxyInfo.md) Inherited from RestrictedCrawlingContext.proxyInfo An object with information about currently used proxy by the crawler and configured by the [ProxyConfiguration](https://crawlee.dev/js/api/core/class/ProxyConfiguration.md) class. ### [**](#request)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L45)inheritedrequest **request: [Request](https://crawlee.dev/js/api/core/class/Request.md)\ Inherited from RestrictedCrawlingContext.request The original [Request](https://crawlee.dev/js/api/core/class/Request.md) object. ### [**](#response)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/adaptive-playwright-crawler.ts#L104)response **response: [BaseHttpResponseData](https://crawlee.dev/js/api/core/interface/BaseHttpResponseData.md) The HTTP response, either from the HTTP client or from the initial request from playwright's navigation. ### [**](#session)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L34)optionalinheritedsession **session? : [Session](https://crawlee.dev/js/api/core/class/Session.md) Inherited from RestrictedCrawlingContext.session ### [**](#useState)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L96)inheriteduseState **useState: \(defaultValue) => Promise\ Inherited from RestrictedCrawlingContext.useState Returns the state - a piece of mutable persistent data shared across all the request handler runs. *** #### Type declaration * * **\(defaultValue): Promise\ - #### Parameters * ##### optionaldefaultValue: State #### Returns Promise\ ## Methods[**](#Methods) ### [**](#parseWithCheerio)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/adaptive-playwright-crawler.ts#L144)parseWithCheerio * ****parseWithCheerio**(selector, timeoutMs): Promise\ - Returns Cheerio handle for `page.content()`, allowing to work with the data same way as with [CheerioCrawler](https://crawlee.dev/js/api/cheerio-crawler/class/CheerioCrawler.md). When provided with the `selector` argument, it will first look for the selector with a 5s timeout. **Example usage:** ``` async requestHandler({ parseWithCheerio }) { const $ = await parseWithCheerio(); const title = $('title').text(); }); ``` *** #### Parameters * ##### optionalselector: string * ##### optionaltimeoutMs: number #### Returns Promise\ ### [**](#pushData)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L54)inheritedpushData * ****pushData**(data, datasetIdOrName): Promise\ - Inherited from RestrictedCrawlingContext.pushData This function allows you to push data to a [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md) specified by name, or the one currently used by the crawler. Shortcut for `crawler.pushData()`. *** #### Parameters * ##### optionaldata: ReadonlyDeep\ Data to be pushed to the default dataset. * ##### optionaldatasetIdOrName: string #### Returns Promise\ ### [**](#querySelector)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/adaptive-playwright-crawler.ts#L115)querySelector * ****querySelector**(selector, timeoutMs): Promise\> - Wait for an element matching the selector to appear and return a Cheerio object of matched elements. Timeout defaults to 5s. *** #### Parameters * ##### selector: string * ##### optionaltimeoutMs: number #### Returns Promise\> ### [**](#waitForSelector)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/adaptive-playwright-crawler.ts#L130)waitForSelector * ****waitForSelector**(selector, timeoutMs): Promise\ - Wait for an element matching the selector to appear. Timeout defaults to 5s. **Example usage:** ``` async requestHandler({ waitForSelector, parseWithCheerio }) { await waitForSelector('article h1'); const $ = await parseWithCheerio(); const title = $('title').text(); }); ``` *** #### Parameters * ##### selector: string * ##### optionaltimeoutMs: number #### Returns Promise\ --- # AdaptivePlaywrightCrawlerOptions ### Hierarchy * Omit<[PlaywrightCrawlerOptions](https://crawlee.dev/js/api/playwright-crawler/interface/PlaywrightCrawlerOptions.md), requestHandler | handlePageFunction | preNavigationHooks | postNavigationHooks> * *AdaptivePlaywrightCrawlerOptions* ## Index[**](#Index) ### Properties * [**autoscaledPoolOptions](#autoscaledPoolOptions) * [**browserPoolOptions](#browserPoolOptions) * [**errorHandler](#errorHandler) * [**experiments](#experiments) * [**failedRequestHandler](#failedRequestHandler) * [**headless](#headless) * [**httpClient](#httpClient) * [**ignoreIframes](#ignoreIframes) * [**ignoreShadowRoots](#ignoreShadowRoots) * [**keepAlive](#keepAlive) * [**launchContext](#launchContext) * [**maxConcurrency](#maxConcurrency) * [**maxCrawlDepth](#maxCrawlDepth) * [**maxRequestRetries](#maxRequestRetries) * [**maxRequestsPerCrawl](#maxRequestsPerCrawl) * [**maxRequestsPerMinute](#maxRequestsPerMinute) * [**maxSessionRotations](#maxSessionRotations) * [**minConcurrency](#minConcurrency) * [**navigationTimeoutSecs](#navigationTimeoutSecs) * [**onSkippedRequest](#onSkippedRequest) * [**persistCookiesPerSession](#persistCookiesPerSession) * [**postNavigationHooks](#postNavigationHooks) * [**preNavigationHooks](#preNavigationHooks) * [**preventDirectStorageAccess](#preventDirectStorageAccess) * [**proxyConfiguration](#proxyConfiguration) * [**renderingTypeDetectionRatio](#renderingTypeDetectionRatio) * [**renderingTypePredictor](#renderingTypePredictor) * [**requestHandler](#requestHandler) * [**requestHandlerTimeoutSecs](#requestHandlerTimeoutSecs) * [**requestList](#requestList) * [**requestManager](#requestManager) * [**requestQueue](#requestQueue) * [**respectRobotsTxtFile](#respectRobotsTxtFile) * [**resultChecker](#resultChecker) * [**resultComparator](#resultComparator) * [**retryOnBlocked](#retryOnBlocked) * [**sameDomainDelaySecs](#sameDomainDelaySecs) * [**sessionPoolOptions](#sessionPoolOptions) * [**statisticsOptions](#statisticsOptions) * [**statusMessageCallback](#statusMessageCallback) * [**statusMessageLoggingInterval](#statusMessageLoggingInterval) * [**useSessionPool](#useSessionPool) ## Properties[**](#Properties) ### [**](#autoscaledPoolOptions)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L294)optionalinheritedautoscaledPoolOptions **autoscaledPoolOptions? : [AutoscaledPoolOptions](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md) Inherited from Omit.autoscaledPoolOptions Custom options passed to the underlying [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) constructor. > *NOTE:* The [`runTaskFunction`](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md#runTaskFunction) option is provided by the crawler and cannot be overridden. However, we can provide custom implementations of [`isFinishedFunction`](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md#isFinishedFunction) and [`isTaskReadyFunction`](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md#isTaskReadyFunction). ### [**](#browserPoolOptions)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L194)optionalinheritedbrowserPoolOptions **browserPoolOptions? : Partial<[BrowserPoolOptions](https://crawlee.dev/js/api/browser-pool/interface/BrowserPoolOptions.md)<[BrowserPlugin](https://crawlee.dev/js/api/browser-pool/class/BrowserPlugin.md)<[CommonLibrary](https://crawlee.dev/js/api/browser-pool/interface/CommonLibrary.md), undefined | Dictionary, CommonBrowser, unknown, CommonPage>>> & Partial<[BrowserPoolHooks](https://crawlee.dev/js/api/browser-pool/interface/BrowserPoolHooks.md)<[BrowserController](https://crawlee.dev/js/api/browser-pool/class/BrowserController.md)\, undefined | LaunchOptions, Browser, undefined | { acceptDownloads? : boolean; baseURL? : string; bypassCSP? : boolean; clientCertificates? : { cert? : ... | ...; certPath? : ... | ...; key? : ... | ...; keyPath? : ... | ...; origin: string; passphrase? : ... | ...; pfx? : ... | ...; pfxPath? : ... | ... }\[]; colorScheme? : null | light | dark | no-preference; contrast? : null | no-preference | more; deviceScaleFactor? : number; extraHTTPHeaders? : {}; forcedColors? : null | active | none; geolocation? : { accuracy? : number; latitude: number; longitude: number }; hasTouch? : boolean; httpCredentials? : { origin? : string; password: string; send? : unauthorized | always; username: string }; ignoreHTTPSErrors? : boolean; isMobile? : boolean; javaScriptEnabled? : boolean; locale? : string; logger? : Logger; offline? : boolean; permissions? : string\[]; proxy? : { bypass? : string; password? : string; server: string; username? : string }; recordHar? : { content? : omit | embed | attach; mode? : full | minimal; omitContent? : boolean; path: string; urlFilter? : string | RegExp }; recordVideo? : { dir: string; size? : { height: ...; width: ... } }; reducedMotion? : null | reduce | no-preference; screen? : { height: number; width: number }; serviceWorkers? : allow | block; storageState? : string | { cookies: { domain: ...; expires: ...; httpOnly: ...; name: ...; path: ...; sameSite: ...; secure: ...; value: ... }\[]; origins: { localStorage: ...; origin: ... }\[] }; strictSelectors? : boolean; timezoneId? : string; userAgent? : string; videoSize? : { height: number; width: number }; videosPath? : string; viewport? : null | { height: number; width: number } }, Page>, [LaunchContext](https://crawlee.dev/js/api/browser-pool/class/LaunchContext.md)\, undefined | LaunchOptions, Browser, undefined | { acceptDownloads? : boolean; baseURL? : string; bypassCSP? : boolean; clientCertificates? : { cert? : ... | ...; certPath? : ... | ...; key? : ... | ...; keyPath? : ... | ...; origin: string; passphrase? : ... | ...; pfx? : ... | ...; pfxPath? : ... | ... }\[]; colorScheme? : null | light | dark | no-preference; contrast? : null | no-preference | more; deviceScaleFactor? : number; extraHTTPHeaders? : {}; forcedColors? : null | active | none; geolocation? : { accuracy? : number; latitude: number; longitude: number }; hasTouch? : boolean; httpCredentials? : { origin? : string; password: string; send? : unauthorized | always; username: string }; ignoreHTTPSErrors? : boolean; isMobile? : boolean; javaScriptEnabled? : boolean; locale? : string; logger? : Logger; offline? : boolean; permissions? : string\[]; proxy? : { bypass? : string; password? : string; server: string; username? : string }; recordHar? : { content? : omit | embed | attach; mode? : full | minimal; omitContent? : boolean; path: string; urlFilter? : string | RegExp }; recordVideo? : { dir: string; size? : { height: ...; width: ... } }; reducedMotion? : null | reduce | no-preference; screen? : { height: number; width: number }; serviceWorkers? : allow | block; storageState? : string | { cookies: { domain: ...; expires: ...; httpOnly: ...; name: ...; path: ...; sameSite: ...; secure: ...; value: ... }\[]; origins: { localStorage: ...; origin: ... }\[] }; strictSelectors? : boolean; timezoneId? : string; userAgent? : string; videoSize? : { height: number; width: number }; videosPath? : string; viewport? : null | { height: number; width: number } }, Page>, Page>> Inherited from Omit.browserPoolOptions Custom options passed to the underlying [BrowserPool](https://crawlee.dev/js/api/browser-pool/class/BrowserPool.md) constructor. We can tweak those to fine-tune browser management. ### [**](#errorHandler)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L163)optionalinheritederrorHandler **errorHandler? : [BrowserErrorHandler](https://crawlee.dev/js/api/browser-crawler.md#BrowserErrorHandler)<[PlaywrightCrawlingContext](https://crawlee.dev/js/api/playwright-crawler/interface/PlaywrightCrawlingContext.md)\> Inherited from Omit.errorHandler User-provided function that allows modifying the request object before it gets retried by the crawler. It's executed before each retry for the requests that failed less than [`maxRequestRetries`](https://crawlee.dev/js/api/browser-crawler/interface/BrowserCrawlerOptions.md#maxRequestRetries) times. The function receives the [BrowserCrawlingContext](https://crawlee.dev/js/api/browser-crawler/interface/BrowserCrawlingContext.md) (actual context will be enhanced with the crawler specific properties) as the first argument, where the [`request`](https://crawlee.dev/js/api/browser-crawler/interface/BrowserCrawlingContext.md#request) corresponds to the request to be retried. Second argument is the `Error` instance that represents the last error thrown during processing of the request. ### [**](#experiments)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L390)optionalinheritedexperiments **experiments? : [CrawlerExperiments](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerExperiments.md) Inherited from Omit.experiments Enables experimental features of Crawlee, which can alter the behavior of the crawler. WARNING: these options are not guaranteed to be stable and may change or be removed at any time. ### [**](#failedRequestHandler)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L174)optionalinheritedfailedRequestHandler **failedRequestHandler? : [BrowserErrorHandler](https://crawlee.dev/js/api/browser-crawler.md#BrowserErrorHandler)<[PlaywrightCrawlingContext](https://crawlee.dev/js/api/playwright-crawler/interface/PlaywrightCrawlingContext.md)\> Inherited from Omit.failedRequestHandler A function to handle requests that failed more than `option.maxRequestRetries` times. The function receives the [BrowserCrawlingContext](https://crawlee.dev/js/api/browser-crawler/interface/BrowserCrawlingContext.md) (actual context will be enhanced with the crawler specific properties) as the first argument, where the [`request`](https://crawlee.dev/js/api/browser-crawler/interface/BrowserCrawlingContext.md#request) corresponds to the failed request. Second argument is the `Error` instance that represents the last error thrown during processing of the request. ### [**](#headless)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L260)optionalinheritedheadless **headless? : boolean | new | old Inherited from Omit.headless Whether to run browser in headless mode. Defaults to `true`. Can be also set via [Configuration](https://crawlee.dev/js/api/core/class/Configuration.md). ### [**](#httpClient)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L402)optionalinheritedhttpClient **httpClient? : [BaseHttpClient](https://crawlee.dev/js/api/core/interface/BaseHttpClient.md) Inherited from Omit.httpClient HTTP client implementation for the `sendRequest` context helper and for plain HTTP crawling. Defaults to a new instance of [GotScrapingHttpClient](https://crawlee.dev/js/api/core/class/GotScrapingHttpClient.md) ### [**](#ignoreIframes)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L272)optionalinheritedignoreIframes **ignoreIframes? : boolean Inherited from Omit.ignoreIframes Whether to ignore `iframes` when processing the page content via `parseWithCheerio` helper. By default, `iframes` are expanded automatically. Use this option to disable this behavior. ### [**](#ignoreShadowRoots)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L266)optionalinheritedignoreShadowRoots **ignoreShadowRoots? : boolean Inherited from Omit.ignoreShadowRoots Whether to ignore custom elements (and their #shadow-roots) when processing the page content via `parseWithCheerio` helper. By default, they are expanded automatically. Use this option to disable this behavior. ### [**](#keepAlive)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L322)optionalinheritedkeepAlive **keepAlive? : boolean Inherited from Omit.keepAlive Allows to keep the crawler alive even if the [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) gets empty. By default, the `crawler.run()` will resolve once the queue is empty. With `keepAlive: true` it will keep running, waiting for more requests to come. Use `crawler.stop()` to exit the crawler gracefully, or `crawler.teardown()` to stop it immediately. ### [**](#launchContext)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/playwright-crawler.ts#L33)optionalinheritedlaunchContext **launchContext? : [PlaywrightLaunchContext](https://crawlee.dev/js/api/playwright-crawler/interface/PlaywrightLaunchContext.md) Inherited from Omit.launchContext The same options as used by [launchPlaywright](https://crawlee.dev/js/api/playwright-crawler/function/launchPlaywright.md). ### [**](#maxConcurrency)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L308)optionalinheritedmaxConcurrency **maxConcurrency? : number Inherited from Omit.maxConcurrency Sets the maximum concurrency (parallelism) for the crawl. Shortcut for the AutoscaledPool [`maxConcurrency`](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md#maxConcurrency) option. ### [**](#maxCrawlDepth)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L285)optionalinheritedmaxCrawlDepth **maxCrawlDepth? : number Inherited from Omit.maxCrawlDepth Maximum depth of the crawl. If not set, the crawl will continue until all requests are processed. Setting this to `0` will only process the initial requests, skipping all links enqueued by `crawlingContext.enqueueLinks` and `crawlingContext.addRequests`. Passing `1` will process the initial requests and all links enqueued by `crawlingContext.enqueueLinks` and `crawlingContext.addRequests` in the handler for initial requests. ### [**](#maxRequestRetries)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L256)optionalinheritedmaxRequestRetries **maxRequestRetries? : number = 3 Inherited from Omit.maxRequestRetries Specifies the maximum number of retries allowed for a request if its processing fails. This includes retries due to navigation errors or errors thrown from user-supplied functions (`requestHandler`, `preNavigationHooks`, `postNavigationHooks`). This limit does not apply to retries triggered by session rotation (see [`maxSessionRotations`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#maxSessionRotations)). ### [**](#maxRequestsPerCrawl)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L278)optionalinheritedmaxRequestsPerCrawl **maxRequestsPerCrawl? : number Inherited from Omit.maxRequestsPerCrawl Maximum number of pages that the crawler will open. The crawl will stop when this limit is reached. This value should always be set in order to prevent infinite loops in misconfigured crawlers. > *NOTE:* In cases of parallel crawling, the actual number of pages visited might be slightly higher than this value. ### [**](#maxRequestsPerMinute)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L315)optionalinheritedmaxRequestsPerMinute **maxRequestsPerMinute? : number Inherited from Omit.maxRequestsPerMinute The maximum number of requests per minute the crawler should run. By default, this is set to `Infinity`, but we can pass any positive, non-zero integer. Shortcut for the AutoscaledPool [`maxTasksPerMinute`](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md#maxTasksPerMinute) option. ### [**](#maxSessionRotations)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L271)optionalinheritedmaxSessionRotations **maxSessionRotations? : number = 10 Inherited from Omit.maxSessionRotations Maximum number of session rotations per request. The crawler will automatically rotate the session in case of a proxy error or if it gets blocked by the website. The session rotations are not counted towards the [`maxRequestRetries`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#maxRequestRetries) limit. ### [**](#minConcurrency)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L302)optionalinheritedminConcurrency **minConcurrency? : number Inherited from Omit.minConcurrency Sets the minimum concurrency (parallelism) for the crawl. Shortcut for the AutoscaledPool [`minConcurrency`](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md#minConcurrency) option. > *WARNING:* If we set this value too high with respect to the available system memory and CPU, our crawler will run extremely slow or crash. If not sure, it's better to keep the default value and the concurrency will scale up automatically. ### [**](#navigationTimeoutSecs)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L248)optionalinheritednavigationTimeoutSecs **navigationTimeoutSecs? : number Inherited from Omit.navigationTimeoutSecs Timeout in which page navigation needs to finish, in seconds. ### [**](#onSkippedRequest)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L381)optionalinheritedonSkippedRequest **onSkippedRequest? : [SkippedRequestCallback](https://crawlee.dev/js/api/core.md#SkippedRequestCallback) Inherited from Omit.onSkippedRequest When a request is skipped for some reason, you can use this callback to act on it. This is currently fired for requests skipped 1. based on robots.txt file, 2. because they don't match enqueueLinks filters, 3. because they are redirected to a URL that doesn't match the enqueueLinks strategy, 4. or because the [`maxRequestsPerCrawl`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#maxRequestsPerCrawl) limit has been reached ### [**](#persistCookiesPerSession)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L254)optionalinheritedpersistCookiesPerSession **persistCookiesPerSession? : boolean Inherited from Omit.persistCookiesPerSession Defines whether the cookies should be persisted for sessions. This can only be used when `useSessionPool` is set to `true`. ### [**](#postNavigationHooks)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/adaptive-playwright-crawler.ts#L183)optionalpostNavigationHooks **postNavigationHooks? : AdaptiveHook\[] Async functions that are sequentially evaluated after the navigation. Good for checking if the navigation was successful. The function accepts a subset of the crawling context. If you attempt to access the `page` property during HTTP-only crawling, an exception will be thrown. If it's not caught, the request will be transparently retried in a browser. ### [**](#preNavigationHooks)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/adaptive-playwright-crawler.ts#L176)optionalpreNavigationHooks **preNavigationHooks? : AdaptiveHook\[] Async functions that are sequentially evaluated before the navigation. Good for setting additional cookies. The function accepts a subset of the crawling context. If you attempt to access the `page` property during HTTP-only crawling, an exception will be thrown. If it's not caught, the request will be transparently retried in a browser. ### [**](#preventDirectStorageAccess)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/adaptive-playwright-crawler.ts#L221)optionalpreventDirectStorageAccess **preventDirectStorageAccess? : boolean Prevent direct access to storage in request handlers (only allow using context helpers). Defaults to `true` ### [**](#proxyConfiguration)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L201)optionalinheritedproxyConfiguration **proxyConfiguration? : [ProxyConfiguration](https://crawlee.dev/js/api/core/class/ProxyConfiguration.md) Inherited from Omit.proxyConfiguration If set, the crawler will be configured for all connections to use the Proxy URLs provided and rotated according to the configuration. ### [**](#renderingTypeDetectionRatio)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/adaptive-playwright-crawler.ts#L189)optionalrenderingTypeDetectionRatio **renderingTypeDetectionRatio? : number Specifies the frequency of rendering type detection checks - 0.1 means roughly 10% of requests. Defaults to 0.1 (so 10%). ### [**](#renderingTypePredictor)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/adaptive-playwright-crawler.ts#L215)optionalrenderingTypePredictor **renderingTypePredictor? : Pick<[RenderingTypePredictor](https://crawlee.dev/js/api/playwright-crawler/class/RenderingTypePredictor.md), predict | storeResult | initialize> A custom rendering type predictor ### [**](#requestHandler)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/adaptive-playwright-crawler.ts#L169)optionalrequestHandler **requestHandler? : (crawlingContext) => Awaitable\ Function that is called to process each request. The function receives the AdaptivePlaywrightCrawlingContext as an argument, and it must refrain from calling code with side effects, other than the methods of the crawling context. Any other side effects may be invoked repeatedly by the crawler, which can lead to inconsistent results. The function must return a promise, which is then awaited by the crawler. If the function throws an exception, the crawler will try to re-crawl the request later, up to `option.maxRequestRetries` times. *** #### Type declaration * * **(crawlingContext): Awaitable\ - #### Parameters * ##### crawlingContext: { request: [LoadedRequest](https://crawlee.dev/js/api/core.md#LoadedRequest)<[Request](https://crawlee.dev/js/api/core/class/Request.md)\> } & Omit<[AdaptivePlaywrightCrawlerContext](https://crawlee.dev/js/api/playwright-crawler/interface/AdaptivePlaywrightCrawlerContext.md)\, request> #### Returns Awaitable\ ### [**](#requestHandlerTimeoutSecs)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L203)optionalinheritedrequestHandlerTimeoutSecs **requestHandlerTimeoutSecs? : number = 60 Inherited from Omit.requestHandlerTimeoutSecs Timeout in which the function passed as [`requestHandler`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#requestHandler) needs to finish, in seconds. ### [**](#requestList)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L181)optionalinheritedrequestList **requestList? : [IRequestList](https://crawlee.dev/js/api/core/interface/IRequestList.md) Inherited from Omit.requestList Static list of URLs to be processed. If not provided, the crawler will open the default request queue when the [`crawler.addRequests()`](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md#addRequests) function is called. > Alternatively, `requests` parameter of [`crawler.run()`](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md#run) could be used to enqueue the initial requests - it is a shortcut for running `crawler.addRequests()` before the `crawler.run()`. ### [**](#requestManager)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L197)optionalinheritedrequestManager **requestManager? : [IRequestManager](https://crawlee.dev/js/api/core/interface/IRequestManager.md) Inherited from Omit.requestManager Allows explicitly configuring a request manager. Mutually exclusive with the `requestQueue` and `requestList` options. This enables explicitly configuring the crawler to use `RequestManagerTandem`, for instance. If using this, the type of `BasicCrawler.requestQueue` may not be fully compatible with the `RequestProvider` class. ### [**](#requestQueue)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L189)optionalinheritedrequestQueue **requestQueue? : [RequestProvider](https://crawlee.dev/js/api/core/class/RequestProvider.md) Inherited from Omit.requestQueue Dynamic queue of URLs to be processed. This is useful for recursive crawling of websites. If not provided, the crawler will open the default request queue when the [`crawler.addRequests()`](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md#addRequests) function is called. > Alternatively, `requests` parameter of [`crawler.run()`](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md#run) could be used to enqueue the initial requests - it is a shortcut for running `crawler.addRequests()` before the `crawler.run()`. ### [**](#respectRobotsTxtFile)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L371)optionalinheritedrespectRobotsTxtFile **respectRobotsTxtFile? : boolean Inherited from Omit.respectRobotsTxtFile If set to `true`, the crawler will automatically try to fetch the robots.txt file for each domain, and skip those that are not allowed. This also prevents disallowed URLs to be added via `enqueueLinks`. ### [**](#resultChecker)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/adaptive-playwright-crawler.ts#L196)optionalresultChecker **resultChecker? : (result) => boolean An optional callback that is called on dataset items found by the request handler in plain HTTP mode. If it returns false, the request is retried in a browser. If no callback is specified, every dataset item is considered valid. *** #### Type declaration * * **(result): boolean - #### Parameters * ##### result: [RequestHandlerResult](https://crawlee.dev/js/api/core/class/RequestHandlerResult.md) #### Returns boolean ### [**](#resultComparator)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/adaptive-playwright-crawler.ts#L207)optionalresultComparator **resultComparator? : (resultA, resultB) => boolean | equal | different | inconclusive An optional callback used in rendering type detection. On each detection, the result of the plain HTTP run is compared to that of the browser one. If a callback is provided, the contract is as follows: It the callback returns true or 'equal', the results are considered equal and the target site is considered static. If it returns false or 'different', the target site is considered client-rendered. If it returns 'inconclusive', the detection result won't be used. If no result comparator is specified, but there is a `resultChecker`, any site where the `resultChecker` returns true is considered static. If neither `resultComparator` nor `resultChecker` are specified, a deep comparison of returned dataset items is used as a default. *** #### Type declaration * * **(resultA, resultB): boolean | equal | different | inconclusive - #### Parameters * ##### resultA: [RequestHandlerResult](https://crawlee.dev/js/api/core/class/RequestHandlerResult.md) * ##### resultB: [RequestHandlerResult](https://crawlee.dev/js/api/core/class/RequestHandlerResult.md) #### Returns boolean | equal | different | inconclusive ### [**](#retryOnBlocked)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L365)optionalinheritedretryOnBlocked **retryOnBlocked? : boolean Inherited from Omit.retryOnBlocked If set to `true`, the crawler will automatically try to bypass any detected bot protection. Currently supports: * [**Cloudflare** Bot Management](https://www.cloudflare.com/products/bot-management/) * [**Google Search** Rate Limiting](https://www.google.com/sorry/) ### [**](#sameDomainDelaySecs)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L262)optionalinheritedsameDomainDelaySecs **sameDomainDelaySecs? : number = 0 Inherited from Omit.sameDomainDelaySecs Indicates how much time (in seconds) to wait before crawling another same domain request. ### [**](#sessionPoolOptions)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L333)optionalinheritedsessionPoolOptions **sessionPoolOptions? : [SessionPoolOptions](https://crawlee.dev/js/api/core/interface/SessionPoolOptions.md) Inherited from Omit.sessionPoolOptions The configuration options for [SessionPool](https://crawlee.dev/js/api/core/class/SessionPool.md) to use. ### [**](#statisticsOptions)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L396)optionalinheritedstatisticsOptions **statisticsOptions? : [StatisticsOptions](https://crawlee.dev/js/api/core/interface/StatisticsOptions.md) Inherited from Omit.statisticsOptions Customize the way statistics collecting works, such as logging interval or whether to output them to the Key-Value store. ### [**](#statusMessageCallback)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L356)optionalinheritedstatusMessageCallback **statusMessageCallback? : [StatusMessageCallback](https://crawlee.dev/js/api/basic-crawler.md#StatusMessageCallback)<[BasicCrawlingContext](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlingContext.md)\, [BasicCrawler](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md)<[BasicCrawlingContext](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlingContext.md)\>> Inherited from Omit.statusMessageCallback Allows overriding the default status message. The callback needs to call `crawler.setStatusMessage()` explicitly. The default status message is provided in the parameters. ``` const crawler = new CheerioCrawler({ statusMessageCallback: async (ctx) => { return ctx.crawler.setStatusMessage(`this is status message from ${new Date().toISOString()}`, { level: 'INFO' }); // log level defaults to 'DEBUG' }, statusMessageLoggingInterval: 1, // defaults to 10s async requestHandler({ $, enqueueLinks, request, log }) { // ... }, }); ``` ### [**](#statusMessageLoggingInterval)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L338)optionalinheritedstatusMessageLoggingInterval **statusMessageLoggingInterval? : number Inherited from Omit.statusMessageLoggingInterval Defines the length of the interval for calling the `setStatusMessage` in seconds. ### [**](#useSessionPool)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L328)optionalinheriteduseSessionPool **useSessionPool? : boolean Inherited from Omit.useSessionPool Basic crawler will initialize the [SessionPool](https://crawlee.dev/js/api/core/class/SessionPool.md) with the corresponding [`sessionPoolOptions`](https://crawlee.dev/js/api/core/interface/SessionPoolOptions.md). The session instance will be than available in the [`requestHandler`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#requestHandler). --- # PlaywrightCrawlerOptions ### Hierarchy * [BrowserCrawlerOptions](https://crawlee.dev/js/api/browser-crawler/interface/BrowserCrawlerOptions.md)<[PlaywrightCrawlingContext](https://crawlee.dev/js/api/playwright-crawler/interface/PlaywrightCrawlingContext.md), { browserPlugins: \[[PlaywrightPlugin](https://crawlee.dev/js/api/browser-pool/class/PlaywrightPlugin.md)] }> * *PlaywrightCrawlerOptions* ## Index[**](#Index) ### Properties * [**autoscaledPoolOptions](#autoscaledPoolOptions) * [**browserPoolOptions](#browserPoolOptions) * [**errorHandler](#errorHandler) * [**experiments](#experiments) * [**failedRequestHandler](#failedRequestHandler) * [**headless](#headless) * [**httpClient](#httpClient) * [**ignoreIframes](#ignoreIframes) * [**ignoreShadowRoots](#ignoreShadowRoots) * [**keepAlive](#keepAlive) * [**launchContext](#launchContext) * [**maxConcurrency](#maxConcurrency) * [**maxCrawlDepth](#maxCrawlDepth) * [**maxRequestRetries](#maxRequestRetries) * [**maxRequestsPerCrawl](#maxRequestsPerCrawl) * [**maxRequestsPerMinute](#maxRequestsPerMinute) * [**maxSessionRotations](#maxSessionRotations) * [**minConcurrency](#minConcurrency) * [**navigationTimeoutSecs](#navigationTimeoutSecs) * [**onSkippedRequest](#onSkippedRequest) * [**persistCookiesPerSession](#persistCookiesPerSession) * [**postNavigationHooks](#postNavigationHooks) * [**preNavigationHooks](#preNavigationHooks) * [**proxyConfiguration](#proxyConfiguration) * [**requestHandler](#requestHandler) * [**requestHandlerTimeoutSecs](#requestHandlerTimeoutSecs) * [**requestList](#requestList) * [**requestManager](#requestManager) * [**requestQueue](#requestQueue) * [**respectRobotsTxtFile](#respectRobotsTxtFile) * [**retryOnBlocked](#retryOnBlocked) * [**sameDomainDelaySecs](#sameDomainDelaySecs) * [**sessionPoolOptions](#sessionPoolOptions) * [**statisticsOptions](#statisticsOptions) * [**statusMessageCallback](#statusMessageCallback) * [**statusMessageLoggingInterval](#statusMessageLoggingInterval) * [**useSessionPool](#useSessionPool) ## Properties[**](#Properties) ### [**](#autoscaledPoolOptions)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L294)optionalinheritedautoscaledPoolOptions **autoscaledPoolOptions? : [AutoscaledPoolOptions](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md) Inherited from BrowserCrawlerOptions.autoscaledPoolOptions Custom options passed to the underlying [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) constructor. > *NOTE:* The [`runTaskFunction`](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md#runTaskFunction) option is provided by the crawler and cannot be overridden. However, we can provide custom implementations of [`isFinishedFunction`](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md#isFinishedFunction) and [`isTaskReadyFunction`](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md#isTaskReadyFunction). ### [**](#browserPoolOptions)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L194)optionalinheritedbrowserPoolOptions **browserPoolOptions? : Partial<[BrowserPoolOptions](https://crawlee.dev/js/api/browser-pool/interface/BrowserPoolOptions.md)<[BrowserPlugin](https://crawlee.dev/js/api/browser-pool/class/BrowserPlugin.md)<[CommonLibrary](https://crawlee.dev/js/api/browser-pool/interface/CommonLibrary.md), undefined | Dictionary, CommonBrowser, unknown, CommonPage>>> & Partial<[BrowserPoolHooks](https://crawlee.dev/js/api/browser-pool/interface/BrowserPoolHooks.md)<[BrowserController](https://crawlee.dev/js/api/browser-pool/class/BrowserController.md)\, undefined | LaunchOptions, Browser, undefined | { acceptDownloads? : boolean; baseURL? : string; bypassCSP? : boolean; clientCertificates? : { cert? : ... | ...; certPath? : ... | ...; key? : ... | ...; keyPath? : ... | ...; origin: string; passphrase? : ... | ...; pfx? : ... | ...; pfxPath? : ... | ... }\[]; colorScheme? : null | light | dark | no-preference; contrast? : null | no-preference | more; deviceScaleFactor? : number; extraHTTPHeaders? : {}; forcedColors? : null | active | none; geolocation? : { accuracy? : number; latitude: number; longitude: number }; hasTouch? : boolean; httpCredentials? : { origin? : string; password: string; send? : unauthorized | always; username: string }; ignoreHTTPSErrors? : boolean; isMobile? : boolean; javaScriptEnabled? : boolean; locale? : string; logger? : Logger; offline? : boolean; permissions? : string\[]; proxy? : { bypass? : string; password? : string; server: string; username? : string }; recordHar? : { content? : omit | embed | attach; mode? : full | minimal; omitContent? : boolean; path: string; urlFilter? : string | RegExp }; recordVideo? : { dir: string; size? : { height: ...; width: ... } }; reducedMotion? : null | reduce | no-preference; screen? : { height: number; width: number }; serviceWorkers? : allow | block; storageState? : string | { cookies: { domain: ...; expires: ...; httpOnly: ...; name: ...; path: ...; sameSite: ...; secure: ...; value: ... }\[]; origins: { localStorage: ...; origin: ... }\[] }; strictSelectors? : boolean; timezoneId? : string; userAgent? : string; videoSize? : { height: number; width: number }; videosPath? : string; viewport? : null | { height: number; width: number } }, Page>, [LaunchContext](https://crawlee.dev/js/api/browser-pool/class/LaunchContext.md)\, undefined | LaunchOptions, Browser, undefined | { acceptDownloads? : boolean; baseURL? : string; bypassCSP? : boolean; clientCertificates? : { cert? : ... | ...; certPath? : ... | ...; key? : ... | ...; keyPath? : ... | ...; origin: string; passphrase? : ... | ...; pfx? : ... | ...; pfxPath? : ... | ... }\[]; colorScheme? : null | light | dark | no-preference; contrast? : null | no-preference | more; deviceScaleFactor? : number; extraHTTPHeaders? : {}; forcedColors? : null | active | none; geolocation? : { accuracy? : number; latitude: number; longitude: number }; hasTouch? : boolean; httpCredentials? : { origin? : string; password: string; send? : unauthorized | always; username: string }; ignoreHTTPSErrors? : boolean; isMobile? : boolean; javaScriptEnabled? : boolean; locale? : string; logger? : Logger; offline? : boolean; permissions? : string\[]; proxy? : { bypass? : string; password? : string; server: string; username? : string }; recordHar? : { content? : omit | embed | attach; mode? : full | minimal; omitContent? : boolean; path: string; urlFilter? : string | RegExp }; recordVideo? : { dir: string; size? : { height: ...; width: ... } }; reducedMotion? : null | reduce | no-preference; screen? : { height: number; width: number }; serviceWorkers? : allow | block; storageState? : string | { cookies: { domain: ...; expires: ...; httpOnly: ...; name: ...; path: ...; sameSite: ...; secure: ...; value: ... }\[]; origins: { localStorage: ...; origin: ... }\[] }; strictSelectors? : boolean; timezoneId? : string; userAgent? : string; videoSize? : { height: number; width: number }; videosPath? : string; viewport? : null | { height: number; width: number } }, Page>, Page>> Inherited from BrowserCrawlerOptions.browserPoolOptions Custom options passed to the underlying [BrowserPool](https://crawlee.dev/js/api/browser-pool/class/BrowserPool.md) constructor. We can tweak those to fine-tune browser management. ### [**](#errorHandler)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L163)optionalinheritederrorHandler **errorHandler? : [BrowserErrorHandler](https://crawlee.dev/js/api/browser-crawler.md#BrowserErrorHandler)<[PlaywrightCrawlingContext](https://crawlee.dev/js/api/playwright-crawler/interface/PlaywrightCrawlingContext.md)\> Inherited from BrowserCrawlerOptions.errorHandler User-provided function that allows modifying the request object before it gets retried by the crawler. It's executed before each retry for the requests that failed less than [`maxRequestRetries`](https://crawlee.dev/js/api/browser-crawler/interface/BrowserCrawlerOptions.md#maxRequestRetries) times. The function receives the [BrowserCrawlingContext](https://crawlee.dev/js/api/browser-crawler/interface/BrowserCrawlingContext.md) (actual context will be enhanced with the crawler specific properties) as the first argument, where the [`request`](https://crawlee.dev/js/api/browser-crawler/interface/BrowserCrawlingContext.md#request) corresponds to the request to be retried. Second argument is the `Error` instance that represents the last error thrown during processing of the request. ### [**](#experiments)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L390)optionalinheritedexperiments **experiments? : [CrawlerExperiments](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerExperiments.md) Inherited from BrowserCrawlerOptions.experiments Enables experimental features of Crawlee, which can alter the behavior of the crawler. WARNING: these options are not guaranteed to be stable and may change or be removed at any time. ### [**](#failedRequestHandler)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L174)optionalinheritedfailedRequestHandler **failedRequestHandler? : [BrowserErrorHandler](https://crawlee.dev/js/api/browser-crawler.md#BrowserErrorHandler)<[PlaywrightCrawlingContext](https://crawlee.dev/js/api/playwright-crawler/interface/PlaywrightCrawlingContext.md)\> Inherited from BrowserCrawlerOptions.failedRequestHandler A function to handle requests that failed more than `option.maxRequestRetries` times. The function receives the [BrowserCrawlingContext](https://crawlee.dev/js/api/browser-crawler/interface/BrowserCrawlingContext.md) (actual context will be enhanced with the crawler specific properties) as the first argument, where the [`request`](https://crawlee.dev/js/api/browser-crawler/interface/BrowserCrawlingContext.md#request) corresponds to the failed request. Second argument is the `Error` instance that represents the last error thrown during processing of the request. ### [**](#headless)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L260)optionalinheritedheadless **headless? : boolean | new | old Inherited from BrowserCrawlerOptions.headless Whether to run browser in headless mode. Defaults to `true`. Can be also set via [Configuration](https://crawlee.dev/js/api/core/class/Configuration.md). ### [**](#httpClient)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L402)optionalinheritedhttpClient **httpClient? : [BaseHttpClient](https://crawlee.dev/js/api/core/interface/BaseHttpClient.md) Inherited from BrowserCrawlerOptions.httpClient HTTP client implementation for the `sendRequest` context helper and for plain HTTP crawling. Defaults to a new instance of [GotScrapingHttpClient](https://crawlee.dev/js/api/core/class/GotScrapingHttpClient.md) ### [**](#ignoreIframes)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L272)optionalinheritedignoreIframes **ignoreIframes? : boolean Inherited from BrowserCrawlerOptions.ignoreIframes Whether to ignore `iframes` when processing the page content via `parseWithCheerio` helper. By default, `iframes` are expanded automatically. Use this option to disable this behavior. ### [**](#ignoreShadowRoots)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L266)optionalinheritedignoreShadowRoots **ignoreShadowRoots? : boolean Inherited from BrowserCrawlerOptions.ignoreShadowRoots Whether to ignore custom elements (and their #shadow-roots) when processing the page content via `parseWithCheerio` helper. By default, they are expanded automatically. Use this option to disable this behavior. ### [**](#keepAlive)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L322)optionalinheritedkeepAlive **keepAlive? : boolean Inherited from BrowserCrawlerOptions.keepAlive Allows to keep the crawler alive even if the [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) gets empty. By default, the `crawler.run()` will resolve once the queue is empty. With `keepAlive: true` it will keep running, waiting for more requests to come. Use `crawler.stop()` to exit the crawler gracefully, or `crawler.teardown()` to stop it immediately. ### [**](#launchContext)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/playwright-crawler.ts#L33)optionallaunchContext **launchContext? : [PlaywrightLaunchContext](https://crawlee.dev/js/api/playwright-crawler/interface/PlaywrightLaunchContext.md) Overrides BrowserCrawlerOptions.launchContext The same options as used by [launchPlaywright](https://crawlee.dev/js/api/playwright-crawler/function/launchPlaywright.md). ### [**](#maxConcurrency)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L308)optionalinheritedmaxConcurrency **maxConcurrency? : number Inherited from BrowserCrawlerOptions.maxConcurrency Sets the maximum concurrency (parallelism) for the crawl. Shortcut for the AutoscaledPool [`maxConcurrency`](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md#maxConcurrency) option. ### [**](#maxCrawlDepth)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L285)optionalinheritedmaxCrawlDepth **maxCrawlDepth? : number Inherited from BrowserCrawlerOptions.maxCrawlDepth Maximum depth of the crawl. If not set, the crawl will continue until all requests are processed. Setting this to `0` will only process the initial requests, skipping all links enqueued by `crawlingContext.enqueueLinks` and `crawlingContext.addRequests`. Passing `1` will process the initial requests and all links enqueued by `crawlingContext.enqueueLinks` and `crawlingContext.addRequests` in the handler for initial requests. ### [**](#maxRequestRetries)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L256)optionalinheritedmaxRequestRetries **maxRequestRetries? : number = 3 Inherited from BrowserCrawlerOptions.maxRequestRetries Specifies the maximum number of retries allowed for a request if its processing fails. This includes retries due to navigation errors or errors thrown from user-supplied functions (`requestHandler`, `preNavigationHooks`, `postNavigationHooks`). This limit does not apply to retries triggered by session rotation (see [`maxSessionRotations`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#maxSessionRotations)). ### [**](#maxRequestsPerCrawl)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L278)optionalinheritedmaxRequestsPerCrawl **maxRequestsPerCrawl? : number Inherited from BrowserCrawlerOptions.maxRequestsPerCrawl Maximum number of pages that the crawler will open. The crawl will stop when this limit is reached. This value should always be set in order to prevent infinite loops in misconfigured crawlers. > *NOTE:* In cases of parallel crawling, the actual number of pages visited might be slightly higher than this value. ### [**](#maxRequestsPerMinute)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L315)optionalinheritedmaxRequestsPerMinute **maxRequestsPerMinute? : number Inherited from BrowserCrawlerOptions.maxRequestsPerMinute The maximum number of requests per minute the crawler should run. By default, this is set to `Infinity`, but we can pass any positive, non-zero integer. Shortcut for the AutoscaledPool [`maxTasksPerMinute`](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md#maxTasksPerMinute) option. ### [**](#maxSessionRotations)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L271)optionalinheritedmaxSessionRotations **maxSessionRotations? : number = 10 Inherited from BrowserCrawlerOptions.maxSessionRotations Maximum number of session rotations per request. The crawler will automatically rotate the session in case of a proxy error or if it gets blocked by the website. The session rotations are not counted towards the [`maxRequestRetries`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#maxRequestRetries) limit. ### [**](#minConcurrency)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L302)optionalinheritedminConcurrency **minConcurrency? : number Inherited from BrowserCrawlerOptions.minConcurrency Sets the minimum concurrency (parallelism) for the crawl. Shortcut for the AutoscaledPool [`minConcurrency`](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md#minConcurrency) option. > *WARNING:* If we set this value too high with respect to the available system memory and CPU, our crawler will run extremely slow or crash. If not sure, it's better to keep the default value and the concurrency will scale up automatically. ### [**](#navigationTimeoutSecs)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L248)optionalinheritednavigationTimeoutSecs **navigationTimeoutSecs? : number Inherited from BrowserCrawlerOptions.navigationTimeoutSecs Timeout in which page navigation needs to finish, in seconds. ### [**](#onSkippedRequest)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L381)optionalinheritedonSkippedRequest **onSkippedRequest? : [SkippedRequestCallback](https://crawlee.dev/js/api/core.md#SkippedRequestCallback) Inherited from BrowserCrawlerOptions.onSkippedRequest When a request is skipped for some reason, you can use this callback to act on it. This is currently fired for requests skipped 1. based on robots.txt file, 2. because they don't match enqueueLinks filters, 3. because they are redirected to a URL that doesn't match the enqueueLinks strategy, 4. or because the [`maxRequestsPerCrawl`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#maxRequestsPerCrawl) limit has been reached ### [**](#persistCookiesPerSession)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L254)optionalinheritedpersistCookiesPerSession **persistCookiesPerSession? : boolean Inherited from BrowserCrawlerOptions.persistCookiesPerSession Defines whether the cookies should be persisted for sessions. This can only be used when `useSessionPool` is set to `true`. ### [**](#postNavigationHooks)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/playwright-crawler.ts#L124)optionalpostNavigationHooks **postNavigationHooks? : [PlaywrightHook](https://crawlee.dev/js/api/playwright-crawler/interface/PlaywrightHook.md)\[] Overrides BrowserCrawlerOptions.postNavigationHooks Async functions that are sequentially evaluated after the navigation. Good for checking if the navigation was successful. The function accepts `crawlingContext` as the only parameter. Example: ``` postNavigationHooks: [ async (crawlingContext) => { const { page } = crawlingContext; if (hasCaptcha(page)) { await solveCaptcha (page); } }, ] ``` ### [**](#preNavigationHooks)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/playwright-crawler.ts#L107)optionalpreNavigationHooks **preNavigationHooks? : [PlaywrightHook](https://crawlee.dev/js/api/playwright-crawler/interface/PlaywrightHook.md)\[] Overrides BrowserCrawlerOptions.preNavigationHooks Async functions that are sequentially evaluated before the navigation. Good for setting additional cookies or browser properties before navigation. The function accepts two parameters, `crawlingContext` and `gotoOptions`, which are passed to the `page.goto()` function the crawler calls to navigate. Example: ``` preNavigationHooks: [ async (crawlingContext, gotoOptions) => { const { page } = crawlingContext; await page.evaluate((attr) => { window.foo = attr; }, 'bar'); }, ] ``` Modyfing `pageOptions` is supported only in Playwright incognito. See [PrePageCreateHook](https://crawlee.dev/js/api/browser-pool.md#PrePageCreateHook) ### [**](#proxyConfiguration)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L201)optionalinheritedproxyConfiguration **proxyConfiguration? : [ProxyConfiguration](https://crawlee.dev/js/api/core/class/ProxyConfiguration.md) Inherited from BrowserCrawlerOptions.proxyConfiguration If set, the crawler will be configured for all connections to use the Proxy URLs provided and rotated according to the configuration. ### [**](#requestHandler)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/playwright-crawler.ts#L59)optionalrequestHandler **requestHandler? : [PlaywrightRequestHandler](https://crawlee.dev/js/api/playwright-crawler/interface/PlaywrightRequestHandler.md) Overrides BrowserCrawlerOptions.requestHandler Function that is called to process each request. The function receives the [PlaywrightCrawlingContext](https://crawlee.dev/js/api/playwright-crawler/interface/PlaywrightCrawlingContext.md) as an argument, where: * `request` is an instance of the [Request](https://crawlee.dev/js/api/core/class/Request.md) object with details about the URL to open, HTTP method etc. * `page` is an instance of the `Playwright` [`Page`](https://playwright.dev/docs/api/class-page) * `browserController` is an instance of the [`BrowserController`](https://github.com/apify/browser-pool#browsercontroller), * `response` is an instance of the `Playwright` [`Response`](https://playwright.dev/docs/api/class-response), which is the main resource response as returned by `page.goto(request.url)`. The function must return a promise, which is then awaited by the crawler. If the function throws an exception, the crawler will try to re-crawl the request later, up to `option.maxRequestRetries` times. If all the retries fail, the crawler calls the function provided to the `failedRequestHandler` parameter. To make this work, you should **always** let your function throw exceptions rather than catch them. The exceptions are logged to the request using the [Request.pushErrorMessage](https://crawlee.dev/js/api/core/class/Request.md#pushErrorMessage) function. ### [**](#requestHandlerTimeoutSecs)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L203)optionalinheritedrequestHandlerTimeoutSecs **requestHandlerTimeoutSecs? : number = 60 Inherited from BrowserCrawlerOptions.requestHandlerTimeoutSecs Timeout in which the function passed as [`requestHandler`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#requestHandler) needs to finish, in seconds. ### [**](#requestList)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L181)optionalinheritedrequestList **requestList? : [IRequestList](https://crawlee.dev/js/api/core/interface/IRequestList.md) Inherited from BrowserCrawlerOptions.requestList Static list of URLs to be processed. If not provided, the crawler will open the default request queue when the [`crawler.addRequests()`](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md#addRequests) function is called. > Alternatively, `requests` parameter of [`crawler.run()`](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md#run) could be used to enqueue the initial requests - it is a shortcut for running `crawler.addRequests()` before the `crawler.run()`. ### [**](#requestManager)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L197)optionalinheritedrequestManager **requestManager? : [IRequestManager](https://crawlee.dev/js/api/core/interface/IRequestManager.md) Inherited from BrowserCrawlerOptions.requestManager Allows explicitly configuring a request manager. Mutually exclusive with the `requestQueue` and `requestList` options. This enables explicitly configuring the crawler to use `RequestManagerTandem`, for instance. If using this, the type of `BasicCrawler.requestQueue` may not be fully compatible with the `RequestProvider` class. ### [**](#requestQueue)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L189)optionalinheritedrequestQueue **requestQueue? : [RequestProvider](https://crawlee.dev/js/api/core/class/RequestProvider.md) Inherited from BrowserCrawlerOptions.requestQueue Dynamic queue of URLs to be processed. This is useful for recursive crawling of websites. If not provided, the crawler will open the default request queue when the [`crawler.addRequests()`](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md#addRequests) function is called. > Alternatively, `requests` parameter of [`crawler.run()`](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md#run) could be used to enqueue the initial requests - it is a shortcut for running `crawler.addRequests()` before the `crawler.run()`. ### [**](#respectRobotsTxtFile)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L371)optionalinheritedrespectRobotsTxtFile **respectRobotsTxtFile? : boolean Inherited from BrowserCrawlerOptions.respectRobotsTxtFile If set to `true`, the crawler will automatically try to fetch the robots.txt file for each domain, and skip those that are not allowed. This also prevents disallowed URLs to be added via `enqueueLinks`. ### [**](#retryOnBlocked)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L365)optionalinheritedretryOnBlocked **retryOnBlocked? : boolean Inherited from BrowserCrawlerOptions.retryOnBlocked If set to `true`, the crawler will automatically try to bypass any detected bot protection. Currently supports: * [**Cloudflare** Bot Management](https://www.cloudflare.com/products/bot-management/) * [**Google Search** Rate Limiting](https://www.google.com/sorry/) ### [**](#sameDomainDelaySecs)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L262)optionalinheritedsameDomainDelaySecs **sameDomainDelaySecs? : number = 0 Inherited from BrowserCrawlerOptions.sameDomainDelaySecs Indicates how much time (in seconds) to wait before crawling another same domain request. ### [**](#sessionPoolOptions)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L333)optionalinheritedsessionPoolOptions **sessionPoolOptions? : [SessionPoolOptions](https://crawlee.dev/js/api/core/interface/SessionPoolOptions.md) Inherited from BrowserCrawlerOptions.sessionPoolOptions The configuration options for [SessionPool](https://crawlee.dev/js/api/core/class/SessionPool.md) to use. ### [**](#statisticsOptions)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L396)optionalinheritedstatisticsOptions **statisticsOptions? : [StatisticsOptions](https://crawlee.dev/js/api/core/interface/StatisticsOptions.md) Inherited from BrowserCrawlerOptions.statisticsOptions Customize the way statistics collecting works, such as logging interval or whether to output them to the Key-Value store. ### [**](#statusMessageCallback)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L356)optionalinheritedstatusMessageCallback **statusMessageCallback? : [StatusMessageCallback](https://crawlee.dev/js/api/basic-crawler.md#StatusMessageCallback)<[BasicCrawlingContext](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlingContext.md)\, [BasicCrawler](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md)<[BasicCrawlingContext](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlingContext.md)\>> Inherited from BrowserCrawlerOptions.statusMessageCallback Allows overriding the default status message. The callback needs to call `crawler.setStatusMessage()` explicitly. The default status message is provided in the parameters. ``` const crawler = new CheerioCrawler({ statusMessageCallback: async (ctx) => { return ctx.crawler.setStatusMessage(`this is status message from ${new Date().toISOString()}`, { level: 'INFO' }); // log level defaults to 'DEBUG' }, statusMessageLoggingInterval: 1, // defaults to 10s async requestHandler({ $, enqueueLinks, request, log }) { // ... }, }); ``` ### [**](#statusMessageLoggingInterval)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L338)optionalinheritedstatusMessageLoggingInterval **statusMessageLoggingInterval? : number Inherited from BrowserCrawlerOptions.statusMessageLoggingInterval Defines the length of the interval for calling the `setStatusMessage` in seconds. ### [**](#useSessionPool)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L328)optionalinheriteduseSessionPool **useSessionPool? : boolean Inherited from BrowserCrawlerOptions.useSessionPool Basic crawler will initialize the [SessionPool](https://crawlee.dev/js/api/core/class/SessionPool.md) with the corresponding [`sessionPoolOptions`](https://crawlee.dev/js/api/core/interface/SessionPoolOptions.md). The session instance will be than available in the [`requestHandler`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#requestHandler). --- # PlaywrightCrawlingContext \ ### Hierarchy * [BrowserCrawlingContext](https://crawlee.dev/js/api/browser-crawler/interface/BrowserCrawlingContext.md)<[PlaywrightCrawler](https://crawlee.dev/js/api/playwright-crawler/class/PlaywrightCrawler.md), Page, Response, [PlaywrightController](https://crawlee.dev/js/api/browser-pool/class/PlaywrightController.md), UserData> * PlaywrightContextUtils * *PlaywrightCrawlingContext* ## Index[**](#Index) ### Properties * [**addRequests](#addRequests) * [**browserController](#browserController) * [**crawler](#crawler) * [**getKeyValueStore](#getKeyValueStore) * [**id](#id) * [**log](#log) * [**page](#page) * [**proxyInfo](#proxyInfo) * [**request](#request) * [**response](#response) * [**session](#session) * [**useState](#useState) ### Methods * [**blockRequests](#blockRequests) * [**closeCookieModals](#closeCookieModals) * [**compileScript](#compileScript) * [**enqueueLinks](#enqueueLinks) * [**enqueueLinksByClickingElements](#enqueueLinksByClickingElements) * [**handleCloudflareChallenge](#handleCloudflareChallenge) * [**infiniteScroll](#infiniteScroll) * [**injectFile](#injectFile) * [**injectJQuery](#injectJQuery) * [**parseWithCheerio](#parseWithCheerio) * [**pushData](#pushData) * [**saveSnapshot](#saveSnapshot) * [**sendRequest](#sendRequest) * [**waitForSelector](#waitForSelector) ## Properties[**](#Properties) ### [**](#addRequests)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L88)inheritedaddRequests **addRequests: (requestsLike, options) => Promise\ Inherited from BrowserCrawlingContext.addRequests Add requests directly to the request queue. *** #### Type declaration * * **(requestsLike, options): Promise\ - #### Parameters * ##### requestsLike: readonly (string | ReadonlyObjectDeep\> & { regex?: RegExp; requestsFromUrl?: string }> | ReadonlyObjectDeep<[Request](https://crawlee.dev/js/api/core/class/Request.md)\>)\[] * ##### optionaloptions: ReadonlyObjectDeep<[RequestQueueOperationOptions](https://crawlee.dev/js/api/core/interface/RequestQueueOperationOptions.md)> Options for the request queue #### Returns Promise\ ### [**](#browserController)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L59)inheritedbrowserController **browserController: [PlaywrightController](https://crawlee.dev/js/api/browser-pool/class/PlaywrightController.md) Inherited from BrowserCrawlingContext.browserController ### [**](#crawler)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L113)inheritedcrawler **crawler: [PlaywrightCrawler](https://crawlee.dev/js/api/playwright-crawler/class/PlaywrightCrawler.md) Inherited from BrowserCrawlingContext.crawler ### [**](#getKeyValueStore)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L147)inheritedgetKeyValueStore **getKeyValueStore: (idOrName) => Promise<[KeyValueStore](https://crawlee.dev/js/api/core/class/KeyValueStore.md)> Inherited from BrowserCrawlingContext.getKeyValueStore Get a key-value store with given name or id, or the default one for the crawler. *** #### Type declaration * * **(idOrName): Promise<[KeyValueStore](https://crawlee.dev/js/api/core/class/KeyValueStore.md)> - #### Parameters * ##### optionalidOrName: string #### Returns Promise<[KeyValueStore](https://crawlee.dev/js/api/core/class/KeyValueStore.md)> ### [**](#id)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L33)inheritedid **id: string Inherited from BrowserCrawlingContext.id ### [**](#log)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L108)inheritedlog **log: [Log](https://crawlee.dev/js/api/core/class/Log.md) Inherited from BrowserCrawlingContext.log A preconfigured logger for the request handler. ### [**](#page)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L60)inheritedpage **page: Page Inherited from BrowserCrawlingContext.page ### [**](#proxyInfo)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L40)optionalinheritedproxyInfo **proxyInfo? : [ProxyInfo](https://crawlee.dev/js/api/core/interface/ProxyInfo.md) Inherited from BrowserCrawlingContext.proxyInfo An object with information about currently used proxy by the crawler and configured by the [ProxyConfiguration](https://crawlee.dev/js/api/core/class/ProxyConfiguration.md) class. ### [**](#request)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L45)inheritedrequest **request: [Request](https://crawlee.dev/js/api/core/class/Request.md)\ Inherited from BrowserCrawlingContext.request The original [Request](https://crawlee.dev/js/api/core/class/Request.md) object. ### [**](#response)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L61)optionalinheritedresponse **response? : Response Inherited from BrowserCrawlingContext.response ### [**](#session)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L34)optionalinheritedsession **session? : [Session](https://crawlee.dev/js/api/core/class/Session.md) Inherited from BrowserCrawlingContext.session ### [**](#useState)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L96)inheriteduseState **useState: \(defaultValue) => Promise\ Inherited from BrowserCrawlingContext.useState Returns the state - a piece of mutable persistent data shared across all the request handler runs. *** #### Type declaration * * **\(defaultValue): Promise\ - #### Parameters * ##### optionaldefaultValue: State #### Returns Promise\ ## Methods[**](#Methods) ### [**](#blockRequests)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/utils/playwright-utils.ts#L878)inheritedblockRequests * ****blockRequests**(options): Promise\ - Inherited from PlaywrightContextUtils.blockRequests Forces the Playwright browser tab to block loading URLs that match a provided pattern. This is useful to speed up crawling of websites, since it reduces the amount of data that needs to be downloaded from the web, but it may break some websites or unexpectedly prevent loading of resources. By default, the function will block all URLs including the following patterns: ``` [".css", ".jpg", ".jpeg", ".png", ".svg", ".gif", ".woff", ".pdf", ".zip"] ``` If you want to extend this list further, use the `extraUrlPatterns` option, which will keep blocking the default patterns, as well as add your custom ones. If you would like to block only specific patterns, use the `urlPatterns` option, which will override the defaults and block only URLs with your custom patterns. This function does not use Playwright's request interception and therefore does not interfere with browser cache. It's also faster than blocking requests using interception, because the blocking happens directly in the browser without the round-trip to Node.js, but it does not provide the extra benefits of request interception. The function will never block main document loads and their respective redirects. **Example usage** ``` preNavigationHooks: [ async ({ blockRequests }) => { // Block all requests to URLs that include `adsbygoogle.js` and also all defaults. await blockRequests({ extraUrlPatterns: ['adsbygoogle.js'], }); }, ], ``` *** #### Parameters * ##### optionaloptions: [BlockRequestsOptions](https://crawlee.dev/js/api/playwright-crawler/namespace/playwrightUtils.md#BlockRequestsOptions) #### Returns Promise\ ### [**](#closeCookieModals)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/utils/playwright-utils.ts#L997)inheritedcloseCookieModals * ****closeCookieModals**(): Promise\ - Inherited from PlaywrightContextUtils.closeCookieModals Tries to close cookie consent modals on the page. Based on the I Don't Care About Cookies browser extension. *** #### Returns Promise\ ### [**](#compileScript)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/utils/playwright-utils.ts#L992)inheritedcompileScript * ****compileScript**(scriptString, ctx): [CompiledScriptFunction](https://crawlee.dev/js/api/playwright-crawler/namespace/playwrightUtils.md#CompiledScriptFunction) - Inherited from PlaywrightContextUtils.compileScript Compiles a Playwright script into an async function that may be executed at any time by providing it with the following object: ``` { page: Page, request: Request, } ``` Where `page` is a Playwright [`Page`](https://playwright.dev/docs/api/class-page) and `request` is a [Request](https://crawlee.dev/js/api/core/class/Request.md). The function is compiled by using the `scriptString` parameter as the function's body, so any limitations to function bodies apply. Return value of the compiled function is the return value of the function body = the `scriptString` parameter. As a security measure, no globals such as `process` or `require` are accessible from within the function body. Note that the function does not provide a safe sandbox and even though globals are not easily accessible, malicious code may still execute in the main process via prototype manipulation. Therefore you should only use this function to execute sanitized or safe code. Custom context may also be provided using the `context` parameter. To improve security, make sure to only pass the really necessary objects to the context. Preferably making secured copies beforehand. *** #### Parameters * ##### scriptString: string * ##### optionalctx: Dictionary #### Returns [CompiledScriptFunction](https://crawlee.dev/js/api/playwright-crawler/namespace/playwrightUtils.md#CompiledScriptFunction) ### [**](#enqueueLinks)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L140)inheritedenqueueLinks * ****enqueueLinks**(options): Promise<[BatchAddRequestsResult](https://crawlee.dev/js/api/types/interface/BatchAddRequestsResult.md)> - Inherited from BrowserCrawlingContext.enqueueLinks This function automatically finds and enqueues links from the current page, adding them to the [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) currently used by the crawler. Optionally, the function allows you to filter the target links' URLs using an array of globs or regular expressions and override settings of the enqueued [Request](https://crawlee.dev/js/api/core/class/Request.md) objects. Check out the [Crawl a website with relative links](https://crawlee.dev/js/docs/examples/crawl-relative-links.md) example for more details regarding its usage. **Example usage** ``` async requestHandler({ enqueueLinks }) { await enqueueLinks({ globs: [ 'https://www.example.com/handbags/*', ], }); }, ``` *** #### Parameters * ##### optionaloptions: ReadonlyObjectDeep\> & Pick<[EnqueueLinksOptions](https://crawlee.dev/js/api/core/interface/EnqueueLinksOptions.md), requestQueue> All `enqueueLinks()` parameters are passed via an options object. #### Returns Promise<[BatchAddRequestsResult](https://crawlee.dev/js/api/types/interface/BatchAddRequestsResult.md)> Promise that resolves to [BatchAddRequestsResult](https://crawlee.dev/js/api/types/interface/BatchAddRequestsResult.md) object. ### [**](#enqueueLinksByClickingElements)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/utils/playwright-utils.ts#L962)inheritedenqueueLinksByClickingElements * ****enqueueLinksByClickingElements**(options): Promise<[BatchAddRequestsResult](https://crawlee.dev/js/api/types/interface/BatchAddRequestsResult.md)> - Inherited from PlaywrightContextUtils.enqueueLinksByClickingElements The function finds elements matching a specific CSS selector in a Playwright page, clicks all those elements using a mouse move and a left mouse button click and intercepts all the navigation requests that are subsequently produced by the page. The intercepted requests, including their methods, headers and payloads are then enqueued to a provided [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md). This is useful to crawl JavaScript heavy pages where links are not available in `href` elements, but rather navigations are triggered in click handlers. If you're looking to find URLs in `href` attributes of the page, see enqueueLinks. Optionally, the function allows you to filter the target links' URLs using an array of [PseudoUrl](https://crawlee.dev/js/api/core/class/PseudoUrl.md) objects and override settings of the enqueued [Request](https://crawlee.dev/js/api/core/class/Request.md) objects. **IMPORTANT**: To be able to do this, this function uses various mutations on the page, such as changing the Z-index of elements being clicked and their visibility. Therefore, it is recommended to only use this function as the last operation in the page. **USING HEADFUL BROWSER**: When using a headful browser, this function will only be able to click elements in the focused tab, effectively limiting concurrency to 1. In headless mode, full concurrency can be achieved. **PERFORMANCE**: Clicking elements with a mouse and intercepting requests is not a low level operation that takes nanoseconds. It's not very CPU intensive, but it takes time. We strongly recommend limiting the scope of the clicking as much as possible by using a specific selector that targets only the elements that you assume or know will produce a navigation. You can certainly click everything by using the `*` selector, but be prepared to wait minutes to get results on a large and complex page. **Example usage** ``` async requestHandler({ enqueueLinksByClickingElements }) { await enqueueLinksByClickingElements({ selector: 'a.product-detail', globs: [ 'https://www.example.com/handbags/**' 'https://www.example.com/purses/**' ], }); }); ``` *** #### Parameters * ##### options: Omit<[EnqueueLinksByClickingElementsOptions](https://crawlee.dev/js/api/playwright-crawler/namespace/playwrightClickElements.md#EnqueueLinksByClickingElementsOptions), requestQueue | page> #### Returns Promise<[BatchAddRequestsResult](https://crawlee.dev/js/api/types/interface/BatchAddRequestsResult.md)> Promise that resolves to [BatchAddRequestsResult](https://crawlee.dev/js/api/types/interface/BatchAddRequestsResult.md) object. ### [**](#handleCloudflareChallenge)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/utils/playwright-utils.ts#L1019)inheritedhandleCloudflareChallenge * ****handleCloudflareChallenge**(options): Promise\ - Inherited from PlaywrightContextUtils.handleCloudflareChallenge This helper tries to solve the Cloudflare challenge automatically by clicking on the checkbox. It will try to detect the Cloudflare page, click on the checkbox, and wait for 10 seconds (configurable via `sleepSecs` option) for the page to load. Use this in the `postNavigationHooks`, a failures will result in a SessionError which will be automatically retried, so only successful requests will get into the `requestHandler`. Works best with camoufox. **Example usage** ``` postNavigationHooks: [ async ({ handleCloudflareChallenge }) => { await handleCloudflareChallenge(); }, ], ``` *** #### Parameters * ##### optionaloptions: HandleCloudflareChallengeOptions #### Returns Promise\ ### [**](#infiniteScroll)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/utils/playwright-utils.ts#L913)inheritedinfiniteScroll * ****infiniteScroll**(options): Promise\ - Inherited from PlaywrightContextUtils.infiniteScroll Scrolls to the bottom of a page, or until it times out. Loads dynamic content when it hits the bottom of a page, and then continues scrolling. *** #### Parameters * ##### optionaloptions: [InfiniteScrollOptions](https://crawlee.dev/js/api/playwright-crawler/namespace/playwrightUtils.md#InfiniteScrollOptions) #### Returns Promise\ ### [**](#injectFile)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/utils/playwright-utils.ts#L813)inheritedinjectFile * ****injectFile**(filePath, options): Promise\ - Inherited from PlaywrightContextUtils.injectFile Injects a JavaScript file into current `page`. Unlike Playwright's `addScriptTag` function, this function works on pages with arbitrary Cross-Origin Resource Sharing (CORS) policies. File contents are cached for up to 10 files to limit file system access. *** #### Parameters * ##### filePath: string * ##### optionaloptions: [InjectFileOptions](https://crawlee.dev/js/api/playwright-crawler/namespace/playwrightUtils.md#InjectFileOptions) #### Returns Promise\ ### [**](#injectJQuery)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/utils/playwright-utils.ts#L840)inheritedinjectJQuery * ****injectJQuery**(): Promise\ - Inherited from PlaywrightContextUtils.injectJQuery Injects the [jQuery](https://jquery.com/) library into current `page`. jQuery is often useful for various web scraping and crawling tasks. For example, it can help extract text from HTML elements using CSS selectors. Beware that the injected jQuery object will be set to the `window.$` variable and thus it might cause conflicts with other libraries included by the page that use the same variable name (e.g. another version of jQuery). This can affect functionality of page's scripts. The injected jQuery will survive page navigations and reloads. **Example usage:** ``` async requestHandler({ page, injectJQuery }) { await injectJQuery(); const title = await page.evaluate(() => { return $('head title').text(); }); }); ``` Note that `injectJQuery()` does not affect the Playwright [`page.$()`](https://playwright.dev/docs/api/class-page#page-query-selector) function in any way. *** #### Returns Promise\ ### [**](#parseWithCheerio)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/utils/playwright-utils.ts#L907)inheritedparseWithCheerio * ****parseWithCheerio**(selector, timeoutMs): Promise\ - Inherited from PlaywrightContextUtils.parseWithCheerio Returns Cheerio handle for `page.content()`, allowing to work with the data same way as with [CheerioCrawler](https://crawlee.dev/js/api/cheerio-crawler/class/CheerioCrawler.md). When provided with the `selector` argument, it waits for it to be available first. **Example usage:** ``` async requestHandler({ parseWithCheerio }) { const $ = await parseWithCheerio(); const title = $('title').text(); }); ``` *** #### Parameters * ##### optionalselector: string * ##### optionaltimeoutMs: number #### Returns Promise\ ### [**](#pushData)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L54)inheritedpushData * ****pushData**(data, datasetIdOrName): Promise\ - Inherited from BrowserCrawlingContext.pushData This function allows you to push data to a [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md) specified by name, or the one currently used by the crawler. Shortcut for `crawler.pushData()`. *** #### Parameters * ##### optionaldata: ReadonlyDeep\ Data to be pushed to the default dataset. * ##### optionaldatasetIdOrName: string #### Returns Promise\ ### [**](#saveSnapshot)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/utils/playwright-utils.ts#L919)inheritedsaveSnapshot * ****saveSnapshot**(options): Promise\ - Inherited from PlaywrightContextUtils.saveSnapshot Saves a full screenshot and HTML of the current page into a Key-Value store. *** #### Parameters * ##### optionaloptions: [SaveSnapshotOptions](https://crawlee.dev/js/api/playwright-crawler/namespace/playwrightUtils.md#SaveSnapshotOptions) #### Returns Promise\ ### [**](#sendRequest)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L166)inheritedsendRequest * ****sendRequest**\(overrideOptions): Promise\> - Inherited from BrowserCrawlingContext.sendRequest Fires HTTP request via [`got-scraping`](https://crawlee.dev/js/docs/guides/got-scraping.md), allowing to override the request options on the fly. This is handy when you work with a browser crawler but want to execute some requests outside it (e.g. API requests). Check the [Skipping navigations for certain requests](https://crawlee.dev/js/docs/examples/skip-navigation.md) example for more detailed explanation of how to do that. ``` async requestHandler({ sendRequest }) { const { body } = await sendRequest({ // override headers only headers: { ... }, }); }, ``` *** #### Parameters * ##### optionaloverrideOptions: Partial\ #### Returns Promise\> ### [**](#waitForSelector)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/utils/playwright-utils.ts#L893)inheritedwaitForSelector * ****waitForSelector**(selector, timeoutMs): Promise\ - Inherited from PlaywrightContextUtils.waitForSelector Wait for an element matching the selector to appear. Timeout defaults to 5s. **Example usage:** ``` async requestHandler({ waitForSelector, parseWithCheerio }) { await waitForSelector('article h1'); const $ = await parseWithCheerio(); const title = $('title').text(); }); ``` *** #### Parameters * ##### selector: string * ##### optionaltimeoutMs: number #### Returns Promise\ --- # PlaywrightHook ### Hierarchy * [BrowserHook](https://crawlee.dev/js/api/browser-crawler.md#BrowserHook)<[PlaywrightCrawlingContext](https://crawlee.dev/js/api/playwright-crawler/interface/PlaywrightCrawlingContext.md), [PlaywrightGotoOptions](https://crawlee.dev/js/api/playwright-crawler.md#PlaywrightGotoOptions)> * *PlaywrightHook* ### Callable * ****PlaywrightHook**(crawlingContext, gotoOptions): Awaitable\ *** * #### Parameters * ##### crawlingContext: [PlaywrightCrawlingContext](https://crawlee.dev/js/api/playwright-crawler/interface/PlaywrightCrawlingContext.md)\ * ##### gotoOptions: Dictionary & { referer?: string; timeout?: number; waitUntil?: domcontentloaded | load | networkidle | commit } #### Returns Awaitable\ --- # PlaywrightLaunchContext Apify extends the launch options of Playwright. You can use any of the Playwright compatible [`LaunchOptions`](https://playwright.dev/docs/api/class-browsertype#browsertypelaunchoptions) options by providing the `launchOptions` property. **Example:** ``` // launch a headless Chrome (not Chromium) const launchContext = { // Apify helpers useChrome: true, proxyUrl: 'http://user:password@some.proxy.com' // Native Playwright options launchOptions: { headless: true, args: ['--some-flag'], } } ``` ### Hierarchy * [BrowserLaunchContext](https://crawlee.dev/js/api/browser-crawler/interface/BrowserLaunchContext.md)\ * *PlaywrightLaunchContext* ## Index[**](#Index) ### Properties * [**browserPerProxy](#browserPerProxy) * [**experimentalContainers](#experimentalContainers) * [**launcher](#launcher) * [**launchOptions](#launchOptions) * [**proxyUrl](#proxyUrl) * [**useChrome](#useChrome) * [**useIncognitoPages](#useIncognitoPages) * [**userAgent](#userAgent) * [**userDataDir](#userDataDir) ## Properties[**](#Properties) ### [**](#browserPerProxy)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-launcher.ts#L40)optionalinheritedbrowserPerProxy **browserPerProxy? : boolean Inherited from BrowserLaunchContext.browserPerProxy If set to `true`, the crawler respects the proxy url generated for the given request. This aligns the browser-based crawlers with the `HttpCrawler`. Might cause performance issues, as Crawlee might launch too many browser instances. ### [**](#experimentalContainers)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/playwright-launcher.ts#L66)optionalexperimentalContainersexperimental **experimentalContainers? : boolean Overrides BrowserLaunchContext.experimentalContainers Like `useIncognitoPages`, but for persistent contexts, so cache is used for faster loading. Works best with Firefox. Unstable on Chromium. ### [**](#launcher)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/playwright-launcher.ts#L79)optionallauncher **launcher? : BrowserType<{}> Overrides BrowserLaunchContext.launcher By default this function uses `require("playwright").chromium`. If you want to use a different browser you can pass it by this property as e.g. `require("playwright").firefox` ### [**](#launchOptions)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/playwright-launcher.ts#L33)optionallaunchOptions **launchOptions? : LaunchOptions & { acceptDownloads? : boolean; args? : string\[]; baseURL? : string; bypassCSP? : boolean; channel? : string; chromiumSandbox? : boolean; clientCertificates? : { cert? : Buffer\; certPath? : string; key? : Buffer\; keyPath? : string; origin: string; passphrase? : string; pfx? : Buffer\; pfxPath? : string }\[]; colorScheme? : null | light | dark | no-preference; contrast? : null | no-preference | more; deviceScaleFactor? : number; devtools? : boolean; downloadsPath? : string; env? : {}; executablePath? : string; extraHTTPHeaders? : {}; firefoxUserPrefs? : {}; forcedColors? : null | active | none; geolocation? : { accuracy? : number; latitude: number; longitude: number }; handleSIGHUP? : boolean; handleSIGINT? : boolean; handleSIGTERM? : boolean; hasTouch? : boolean; headless? : boolean; httpCredentials? : { origin? : string; password: string; send? : unauthorized | always; username: string }; ignoreDefaultArgs? : boolean | string\[]; ignoreHTTPSErrors? : boolean; isMobile? : boolean; javaScriptEnabled? : boolean; locale? : string; logger? : Logger; offline? : boolean; permissions? : string\[]; proxy? : { bypass? : string; password? : string; server: string; username? : string }; recordHar? : { content? : omit | embed | attach; mode? : full | minimal; omitContent? : boolean; path: string; urlFilter? : string | RegExp }; recordVideo? : { dir: string; size? : { height: number; width: number } }; reducedMotion? : null | reduce | no-preference; screen? : { height: number; width: number }; serviceWorkers? : allow | block; slowMo? : number; strictSelectors? : boolean; timeout? : number; timezoneId? : string; tracesDir? : string; userAgent? : string; videoSize? : { height: number; width: number }; videosPath? : string; viewport? : null | { height: number; width: number } } Overrides BrowserLaunchContext.launchOptions `browserType.launch` [options](https://playwright.dev/docs/api/class-browsertype#browser-type-launch) or `browserType.launchContextOptions` [options](https://playwright.dev/docs/api/class-browsertype#browser-type-launch-persistent-context) ### [**](#proxyUrl)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/playwright-launcher.ts#L41)optionalproxyUrl **proxyUrl? : string Overrides BrowserLaunchContext.proxyUrl URL to a HTTP proxy server. It must define the port number, and it may also contain proxy username and password. Example: `http://bob:pass123@proxy.example.com:1234`. ### [**](#useChrome)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/playwright-launcher.ts#L52)optionaluseChrome **useChrome? : boolean = false Overrides BrowserLaunchContext.useChrome If `true` and `executablePath` is not set, Playwright will launch full Google Chrome browser available on the machine rather than the bundled Chromium. The path to Chrome executable is taken from the `CRAWLEE_CHROME_EXECUTABLE_PATH` environment variable if provided, or defaults to the typical Google Chrome executable location specific for the operating system. By default, this option is `false`. ### [**](#useIncognitoPages)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/playwright-launcher.ts#L59)optionaluseIncognitoPages **useIncognitoPages? : boolean = false Overrides BrowserLaunchContext.useIncognitoPages With this option selected, all pages will be opened in a new incognito browser context. This means they will not share cookies nor cache and their resources will not be throttled by one another. ### [**](#userAgent)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-launcher.ts#L68)optionalinheriteduserAgent **userAgent? : string Inherited from BrowserLaunchContext.userAgent The `User-Agent` HTTP header used by the browser. If not provided, the function sets `User-Agent` to a reasonable default to reduce the chance of detection of the crawler. ### [**](#userDataDir)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/playwright-launcher.ts#L73)optionaluserDataDir **userDataDir? : string Overrides BrowserLaunchContext.userDataDir Sets the [User Data Directory](https://chromium.googlesource.com/chromium/src/+/master/docs/user_data_dir.md) path. The user data directory contains profile data such as history, bookmarks, and cookies, as well as other per-installation local state. If not specified, a temporary directory is used instead. --- # PlaywrightRequestHandler ### Hierarchy * [BrowserRequestHandler](https://crawlee.dev/js/api/browser-crawler.md#BrowserRequestHandler)\> * *PlaywrightRequestHandler* ### Callable * ****PlaywrightRequestHandler**(inputs): Awaitable\ *** * #### Parameters * ##### inputs: { request: [LoadedRequest](https://crawlee.dev/js/api/core.md#LoadedRequest)<[LoadedRequest](https://crawlee.dev/js/api/core.md#LoadedRequest)<[Request](https://crawlee.dev/js/api/core/class/Request.md)\>> } & Omit<{ request: [LoadedRequest](https://crawlee.dev/js/api/core.md#LoadedRequest)<[Request](https://crawlee.dev/js/api/core/class/Request.md)\> } & Omit<[PlaywrightCrawlingContext](https://crawlee.dev/js/api/playwright-crawler/interface/PlaywrightCrawlingContext.md)\, request>, request> #### Returns Awaitable\ --- # playwrightClickElements ## Index[**](#Index) ### References * [**enqueueLinksByClickingElements](https://crawlee.dev/js/api/playwright-crawler/namespace/playwrightClickElements.md#enqueueLinksByClickingElements) ### Interfaces * [**EnqueueLinksByClickingElementsOptions](https://crawlee.dev/js/api/playwright-crawler/namespace/playwrightClickElements.md#EnqueueLinksByClickingElementsOptions) ## References[**](#References) ### [**](#enqueueLinksByClickingElements)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/enqueue-links/click-elements.ts#L225)enqueueLinksByClickingElements Re-exports [enqueueLinksByClickingElements](https://crawlee.dev/js/api/playwright-crawler/namespace/playwrightUtils.md#enqueueLinksByClickingElements) ## Interfaces[**](#Interfaces) ### [**](#EnqueueLinksByClickingElementsOptions)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/enqueue-links/click-elements.ts#L30)EnqueueLinksByClickingElementsOptions **EnqueueLinksByClickingElementsOptions: ### [**](#clickOptions)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/enqueue-links/click-elements.ts#L56)optionalclickOptions **clickOptions? : { button? : left | right | middle; clickCount? : number; delay? : number; force? : boolean; modifiers? : (Alt | Control | ControlOrMeta | Meta | Shift)\[]; noWaitAfter? : boolean; position? : { x: number; y: number }; strict? : boolean; timeout? : number; trial? : boolean } Click options for use in Playwright click handler. *** #### Type declaration * ##### externaloptionalbutton?: left | right | middle Defaults to `left`. * ##### externaloptionalclickCount?: number defaults to 1. See \[UIEvent.detail]. * ##### externaloptionaldelay?: number Time to wait between `mousedown` and `mouseup` in milliseconds. Defaults to 0. * ##### externaloptionalforce?: boolean Whether to bypass the [actionability](https://playwright.dev/docs/actionability) checks. Defaults to `false`. * ##### externaloptionalmodifiers?: (Alt | Control | ControlOrMeta | Meta | Shift)\[] Modifier keys to press. Ensures that only these modifiers are pressed during the operation, and then restores current modifiers back. If not specified, currently pressed modifiers are used. "ControlOrMeta" resolves to "Control" on Windows and Linux and to "Meta" on macOS. * ##### externaloptionalnoWaitAfter?: boolean Actions that initiate navigations are waiting for these navigations to happen and for pages to start loading. You can opt out of waiting via setting this flag. You would only need this option in the exceptional cases such as navigating to inaccessible pages. Defaults to `false`. * **@deprecated** This option will default to `true` in the future. * ##### externaloptionalposition?: { x: number; y: number } A point to use relative to the top-left corner of element padding box. If not specified, uses some visible point of the element. * ##### externalx: number * ##### externaly: number * ##### externaloptionalstrict?: boolean When true, the call requires selector to resolve to a single element. If given selector resolves to more than one element, the call throws an exception. * ##### externaloptionaltimeout?: number Maximum time in milliseconds. Defaults to `0` - no timeout. The default value can be changed via `actionTimeout` option in the config, or by using the [browserContext.setDefaultTimeout(timeout)](https://playwright.dev/docs/api/class-browsercontext#browser-context-set-default-timeout) or [page.setDefaultTimeout(timeout)](https://playwright.dev/docs/api/class-page#page-set-default-timeout) methods. * ##### externaloptionaltrial?: boolean When set, this method only performs the [actionability](https://playwright.dev/docs/actionability) checks and skips the action. Defaults to `false`. Useful to wait until the element is ready for the action without performing it. Note that keyboard `modifiers` will be pressed regardless of `trial` to allow testing elements which are only visible when those keys are pressed. ### [**](#exclude)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/enqueue-links/click-elements.ts#L83)optionalexclude **exclude? : readonly ([GlobInput](https://crawlee.dev/js/api/core.md#GlobInput) | [RegExpInput](https://crawlee.dev/js/api/core.md#RegExpInput))\[] An array of glob pattern strings, regexp patterns or plain objects containing patterns matching URLs that will **never** be enqueued. The plain objects must include either the `glob` property or the `regexp` property. Glob matching is always case-insensitive. If you need case-sensitive matching, provide a regexp. ### [**](#forefront)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/enqueue-links/click-elements.ts#L175)optionalforefront **forefront? : boolean = false If set to `true`: * while adding the request to the queue: the request will be added to the foremost position in the queue. * while reclaiming the request: the request will be placed to the beginning of the queue, so that it's returned in the next call to [RequestQueue.fetchNextRequest](https://crawlee.dev/js/api/core/class/RequestQueue.md#fetchNextRequest). By default, it's put to the end of the queue. ### [**](#globs)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/enqueue-links/click-elements.ts#L72)optionalglobs **globs? : [GlobInput](https://crawlee.dev/js/api/core.md#GlobInput)\[] An array of glob pattern strings or plain objects containing glob pattern strings matching the URLs to be enqueued. The plain objects must include at least the `glob` property, which holds the glob pattern string. All remaining keys will be used as request options for the corresponding enqueued [Request](https://crawlee.dev/js/api/core/class/Request.md) objects. The matching is always case-insensitive. If you need case-sensitive matching, use `regexps` property directly. If `globs` is an empty array or `undefined`, then the function enqueues all the intercepted navigation requests produced by the page after clicking on elements matching the provided CSS selector. ### [**](#label)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/enqueue-links/click-elements.ts#L51)optionallabel **label? : string Sets [Request.label](https://crawlee.dev/js/api/core/class/Request.md#label) for newly enqueued requests. ### [**](#maxWaitForPageIdleSecs)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/enqueue-links/click-elements.ts#L165)optionalmaxWaitForPageIdleSecs **maxWaitForPageIdleSecs? : number = 5 This is the maximum period for which the function will keep tracking events, even if more events keep coming. Its purpose is to prevent a deadlock in the page by periodic events, often unrelated to the clicking itself. See `waitForPageIdleSecs` above for an explanation. ### [**](#page)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/enqueue-links/click-elements.ts#L34)page **page: Page Playwright [`Page`](https://playwright.dev/docs/api/class-page) object. ### [**](#pseudoUrls)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/enqueue-links/click-elements.ts#L117)optionalpseudoUrls **pseudoUrls? : [PseudoUrlInput](https://crawlee.dev/js/api/core.md#PseudoUrlInput)\[] *NOTE:* In future versions of SDK the options will be removed. Please use `globs` or `regexps` instead. An array of [PseudoUrl](https://crawlee.dev/js/api/core/class/PseudoUrl.md) strings or plain objects containing [PseudoUrl](https://crawlee.dev/js/api/core/class/PseudoUrl.md) strings matching the URLs to be enqueued. The plain objects must include at least the `purl` property, which holds the pseudo-URL pattern string. All remaining keys will be used as request options for the corresponding enqueued [Request](https://crawlee.dev/js/api/core/class/Request.md) objects. With a pseudo-URL string, the matching is always case-insensitive. If you need case-sensitive matching, use `regexps` property directly. If `pseudoUrls` is an empty array or `undefined`, then the function enqueues all the intercepted navigation requests produced by the page after clicking on elements matching the provided CSS selector. * **@deprecated** prefer using `globs` or `regexps` instead ### [**](#regexps)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/enqueue-links/click-elements.ts#L96)optionalregexps **regexps? : [RegExpInput](https://crawlee.dev/js/api/core.md#RegExpInput)\[] An array of regular expressions or plain objects containing regular expressions matching the URLs to be enqueued. The plain objects must include at least the `regexp` property, which holds the regular expression. All remaining keys will be used as request options for the corresponding enqueued [Request](https://crawlee.dev/js/api/core/class/Request.md) objects. If `regexps` is an empty array or `undefined`, then the function enqueues all the intercepted navigation requests produced by the page after clicking on elements matching the provided CSS selector. ### [**](#requestQueue)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/enqueue-links/click-elements.ts#L39)requestQueue **requestQueue: [RequestProvider](https://crawlee.dev/js/api/core/class/RequestProvider.md) A request queue to which the URLs will be enqueued. ### [**](#selector)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/enqueue-links/click-elements.ts#L45)selector **selector: string A CSS selector matching elements to be clicked on. Unlike in enqueueLinks, there is no default value. This is to prevent suboptimal use of this function by using it too broadly. ### [**](#skipNavigation)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/enqueue-links/click-elements.ts#L181)optionalskipNavigation **skipNavigation? : boolean = false If set to `true`, tells the crawler to skip navigation and process the request directly. ### [**](#transformRequestFunction)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/enqueue-links/click-elements.ts#L140)optionaltransformRequestFunction **transformRequestFunction? : [RequestTransform](https://crawlee.dev/js/api/core/interface/RequestTransform.md) Just before a new [Request](https://crawlee.dev/js/api/core/class/Request.md) is constructed and enqueued to the [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md), this function can be used to remove it or modify its contents such as `userData`, `payload` or, most importantly `uniqueKey`. This is useful when you need to enqueue multiple `Requests` to the queue that share the same URL, but differ in methods or payloads, or to dynamically update or create `userData`. For example: by adding `useExtendedUniqueKey: true` to the `request` object, `uniqueKey` will be computed from a combination of `url`, `method` and `payload` which enables crawling of websites that navigate using form submits (POST requests). **Example:** ``` { transformRequestFunction: (request) => { request.userData.foo = 'bar'; request.useExtendedUniqueKey = true; return request; } } ``` ### [**](#userData)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/enqueue-links/click-elements.ts#L48)optionaluserData **userData? : Dictionary Sets [Request.userData](https://crawlee.dev/js/api/core/class/Request.md#userData) for newly enqueued requests. ### [**](#waitForPageIdleSecs)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/enqueue-links/click-elements.ts#L157)optionalwaitForPageIdleSecs **waitForPageIdleSecs? : number = 1 Clicking in the page triggers various asynchronous operations that lead to new URLs being shown by the browser. It could be a simple JavaScript redirect or opening of a new tab in the browser. These events often happen only some time after the actual click. Requests typically take milliseconds while new tabs open in hundreds of milliseconds. To be able to capture all those events, the `enqueueLinksByClickingElements()` function repeatedly waits for the `waitForPageIdleSecs`. By repeatedly we mean that whenever a relevant event is triggered, the timer is restarted. As long as new events keep coming, the function will not return, unless the below `maxWaitForPageIdleSecs` timeout is reached. You may want to reduce this for example when you're sure that your clicks do not open new tabs, or increase when you're not getting all the expected URLs. --- # playwrightUtils A namespace that contains various utilities for [Playwright](https://github.com/microsoft/playwright) - the headless Chrome Node API. **Example usage:** ``` import { launchPlaywright, playwrightUtils } from 'crawlee'; // Navigate to https://www.example.com in Playwright with a POST request const browser = await launchPlaywright(); const page = await browser.newPage(); await playwrightUtils.gotoExtended(page, { url: 'https://example.com, method: 'POST', }); ``` ## Index[**](#Index) ### Interfaces * [**BlockRequestsOptions](https://crawlee.dev/js/api/playwright-crawler/namespace/playwrightUtils.md#BlockRequestsOptions) * [**CompiledScriptParams](https://crawlee.dev/js/api/playwright-crawler/namespace/playwrightUtils.md#CompiledScriptParams) * [**DirectNavigationOptions](https://crawlee.dev/js/api/playwright-crawler/namespace/playwrightUtils.md#DirectNavigationOptions) * [**InfiniteScrollOptions](https://crawlee.dev/js/api/playwright-crawler/namespace/playwrightUtils.md#InfiniteScrollOptions) * [**InjectFileOptions](https://crawlee.dev/js/api/playwright-crawler/namespace/playwrightUtils.md#InjectFileOptions) * [**SaveSnapshotOptions](https://crawlee.dev/js/api/playwright-crawler/namespace/playwrightUtils.md#SaveSnapshotOptions) ### Type Aliases * [**CompiledScriptFunction](https://crawlee.dev/js/api/playwright-crawler/namespace/playwrightUtils.md#CompiledScriptFunction) ### Functions * [**blockRequests](https://crawlee.dev/js/api/playwright-crawler/namespace/playwrightUtils.md#blockRequests) * [**closeCookieModals](https://crawlee.dev/js/api/playwright-crawler/namespace/playwrightUtils.md#closeCookieModals) * [**compileScript](https://crawlee.dev/js/api/playwright-crawler/namespace/playwrightUtils.md#compileScript) * [**enqueueLinksByClickingElements](https://crawlee.dev/js/api/playwright-crawler/namespace/playwrightUtils.md#enqueueLinksByClickingElements) * [**gotoExtended](https://crawlee.dev/js/api/playwright-crawler/namespace/playwrightUtils.md#gotoExtended) * [**infiniteScroll](https://crawlee.dev/js/api/playwright-crawler/namespace/playwrightUtils.md#infiniteScroll) * [**injectFile](https://crawlee.dev/js/api/playwright-crawler/namespace/playwrightUtils.md#injectFile) * [**injectJQuery](https://crawlee.dev/js/api/playwright-crawler/namespace/playwrightUtils.md#injectJQuery) * [**parseWithCheerio](https://crawlee.dev/js/api/playwright-crawler/namespace/playwrightUtils.md#parseWithCheerio) * [**registerUtilsToContext](https://crawlee.dev/js/api/playwright-crawler/namespace/playwrightUtils.md#registerUtilsToContext) * [**saveSnapshot](https://crawlee.dev/js/api/playwright-crawler/namespace/playwrightUtils.md#saveSnapshot) ## Interfaces[**](#Interfaces) ### [**](#BlockRequestsOptions)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/utils/playwright-utils.ts#L64)BlockRequestsOptions **BlockRequestsOptions: ### [**](#extraUrlPatterns)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/utils/playwright-utils.ts#L76)optionalextraUrlPatterns **extraUrlPatterns? : string\[] If you just want to append to the default blocked patterns, use this property. ### [**](#urlPatterns)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/utils/playwright-utils.ts#L71)optionalurlPatterns **urlPatterns? : string\[] The patterns of URLs to block from being loaded by the browser. Only `*` can be used as a wildcard. It is also automatically added to the beginning and end of the pattern. This limitation is enforced by the DevTools protocol. `.png` is the same as `*.png*`. ### [**](#CompiledScriptParams)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/utils/playwright-utils.ts#L315)CompiledScriptParams **CompiledScriptParams: ### [**](#page)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/utils/playwright-utils.ts#L316)page **page: Page ### [**](#request)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/utils/playwright-utils.ts#L317)request **request: [Request](https://crawlee.dev/js/api/core/class/Request.md)\ ### [**](#DirectNavigationOptions)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/utils/playwright-utils.ts#L154)DirectNavigationOptions **DirectNavigationOptions: ### [**](#referer)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/utils/playwright-utils.ts#L174)optionalreferer **referer? : string Referer header value. If provided it will take preference over the referer header value set by page.setExtraHTTPHeaders(headers). ### [**](#timeout)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/utils/playwright-utils.ts#L161)optionaltimeout **timeout? : number Maximum operation time in milliseconds, defaults to 30 seconds, pass `0` to disable timeout. The default value can be changed by using the browserContext.setDefaultNavigationTimeout(timeout), browserContext.setDefaultTimeout(timeout), page.setDefaultNavigationTimeout(timeout) or page.setDefaultTimeout(timeout) methods. ### [**](#waitUntil)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/utils/playwright-utils.ts#L169)optionalwaitUntil **waitUntil? : domcontentloaded | load | networkidle When to consider operation succeeded, defaults to `load`. Events can be either: * `'domcontentloaded'` - consider operation to be finished when the `DOMContentLoaded` event is fired. * `'load'` - consider operation to be finished when the `load` event is fired. * `'networkidle'` - consider operation to be finished when there are no network connections for at least `500` ms. ### [**](#InfiniteScrollOptions)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/utils/playwright-utils.ts#L364)InfiniteScrollOptions **InfiniteScrollOptions: ### [**](#buttonSelector)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/utils/playwright-utils.ts#L392)optionalbuttonSelector **buttonSelector? : string Optionally checks and clicks a button if it appears while scrolling. This is required on some websites for the scroll to work. ### [**](#maxScrollHeight)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/utils/playwright-utils.ts#L375)optionalmaxScrollHeight **maxScrollHeight? : number = 0 How many pixels to scroll down. If 0, will scroll until bottom of page. ### [**](#scrollDownAndUp)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/utils/playwright-utils.ts#L387)optionalscrollDownAndUp **scrollDownAndUp? : boolean = false If true, it will scroll up a bit after each scroll down. This is required on some websites for the scroll to work. ### [**](#stopScrollCallback)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/utils/playwright-utils.ts#L397)optionalstopScrollCallback **stopScrollCallback? : () => unknown This function is called after every scroll and stops the scrolling process if it returns `true`. The function can be `async`. *** #### Type declaration * * **(): unknown - #### Returns unknown ### [**](#timeoutSecs)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/utils/playwright-utils.ts#L369)optionaltimeoutSecs **timeoutSecs? : number = 0 How many seconds to scroll for. If 0, will scroll until bottom of page. ### [**](#waitForSecs)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/utils/playwright-utils.ts#L381)optionalwaitForSecs **waitForSecs? : number = 4 How many seconds to wait for no new content to load before exit. ### [**](#InjectFileOptions)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/utils/playwright-utils.ts#L55)InjectFileOptions **InjectFileOptions: ### [**](#surviveNavigations)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/utils/playwright-utils.ts#L61)optionalsurviveNavigations **surviveNavigations? : boolean Enables the injected script to survive page navigations and reloads without need to be re-injected manually. This does not mean, however, that internal state will be preserved. Just that it will be automatically re-injected on each navigation before any other scripts get the chance to execute. ### [**](#SaveSnapshotOptions)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/utils/playwright-utils.ts#L506)SaveSnapshotOptions **SaveSnapshotOptions: ### [**](#config)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/utils/playwright-utils.ts#L541)optionalconfig **config? : [Configuration](https://crawlee.dev/js/api/core/class/Configuration.md) = [Configuration](https://crawlee.dev/js/api/core/class/Configuration.md) Configuration of the crawler that will be used to save the snapshot. ### [**](#key)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/utils/playwright-utils.ts#L511)optionalkey **key? : string = ‘SNAPSHOT’ Key under which the screenshot and HTML will be saved. `.jpg` will be appended for screenshot and `.html` for HTML. ### [**](#keyValueStoreName)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/utils/playwright-utils.ts#L535)optionalkeyValueStoreName **keyValueStoreName? : null | string = null | string Name or id of the Key-Value store where snapshot is saved. By default it is saved to default Key-Value store. ### [**](#saveHtml)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/utils/playwright-utils.ts#L529)optionalsaveHtml **saveHtml? : boolean = true If true, it will save a full HTML of the current page as a record with `key` appended by `.html`. ### [**](#saveScreenshot)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/utils/playwright-utils.ts#L523)optionalsaveScreenshot **saveScreenshot? : boolean = true If true, it will save a full screenshot of the current page as a record with `key` appended by `.jpg`. ### [**](#screenshotQuality)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/utils/playwright-utils.ts#L517)optionalscreenshotQuality **screenshotQuality? : number = 50 The quality of the image, between 0-100. Higher quality images have bigger size and require more storage. ## Type Aliases[**](<#Type Aliases>) ### [**](#CompiledScriptFunction)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/utils/playwright-utils.ts#L320)CompiledScriptFunction **CompiledScriptFunction: (params) => Promise\ #### Type declaration * * **(params): Promise\ - #### Parameters * ##### params: [CompiledScriptParams](https://crawlee.dev/js/api/playwright-crawler/namespace/playwrightUtils.md#CompiledScriptParams) #### Returns Promise\ ## Functions[**](#Functions) ### [**](#blockRequests)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/utils/playwright-utils.ts#L291)blockRequests * ****blockRequests**(page, options): Promise\ - > This is a **Chromium-only feature.** > > Using this option with Firefox and WebKit browsers doesn't have any effect. To set up request blocking for these browsers, use `page.route()` instead. Forces the Playwright browser tab to block loading URLs that match a provided pattern. This is useful to speed up crawling of websites, since it reduces the amount of data that needs to be downloaded from the web, but it may break some websites or unexpectedly prevent loading of resources. By default, the function will block all URLs including the following patterns: ``` [".css", ".jpg", ".jpeg", ".png", ".svg", ".gif", ".woff", ".pdf", ".zip"] ``` If you want to extend this list further, use the `extraUrlPatterns` option, which will keep blocking the default patterns, as well as add your custom ones. If you would like to block only specific patterns, use the `urlPatterns` option, which will override the defaults and block only URLs with your custom patterns. This function does not use Playwright's request interception and therefore does not interfere with browser cache. It's also faster than blocking requests using interception, because the blocking happens directly in the browser without the round-trip to Node.js, but it does not provide the extra benefits of request interception. The function will never block main document loads and their respective redirects. **Example usage** ``` import { launchPlaywright, playwrightUtils } from 'crawlee'; const browser = await launchPlaywright(); const page = await browser.newPage(); // Block all requests to URLs that include `adsbygoogle.js` and also all defaults. await playwrightUtils.blockRequests(page, { extraUrlPatterns: ['adsbygoogle.js'], }); await page.goto('https://cnn.com'); ``` *** #### Parameters * ##### page: Page Playwright [`Page`](https://playwright.dev/docs/api/class-page) object. * ##### optionaloptions: [BlockRequestsOptions](https://crawlee.dev/js/api/playwright-crawler/namespace/playwrightUtils.md#BlockRequestsOptions) = {} #### Returns Promise\ ### [**](#closeCookieModals)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/utils/playwright-utils.ts#L659)closeCookieModals * ****closeCookieModals**(page): Promise\ - #### Parameters * ##### page: Page #### Returns Promise\ ### [**](#compileScript)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/utils/playwright-utils.ts#L348)compileScript * ****compileScript**(scriptString, context): [CompiledScriptFunction](https://crawlee.dev/js/api/playwright-crawler/namespace/playwrightUtils.md#CompiledScriptFunction) - Compiles a Playwright script into an async function that may be executed at any time by providing it with the following object: ``` { page: Page, request: Request, } ``` Where `page` is a Playwright [`Page`](https://playwright.dev/docs/api/class-page) and `request` is a [Request](https://crawlee.dev/js/api/core/class/Request.md). The function is compiled by using the `scriptString` parameter as the function's body, so any limitations to function bodies apply. Return value of the compiled function is the return value of the function body = the `scriptString` parameter. As a security measure, no globals such as `process` or `require` are accessible from within the function body. Note that the function does not provide a safe sandbox and even though globals are not easily accessible, malicious code may still execute in the main process via prototype manipulation. Therefore you should only use this function to execute sanitized or safe code. Custom context may also be provided using the `context` parameter. To improve security, make sure to only pass the really necessary objects to the context. Preferably making secured copies beforehand. *** #### Parameters * ##### scriptString: string * ##### context: Dictionary = ... #### Returns [CompiledScriptFunction](https://crawlee.dev/js/api/playwright-crawler/namespace/playwrightUtils.md#CompiledScriptFunction) ### [**](#enqueueLinksByClickingElements)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/enqueue-links/click-elements.ts#L225)enqueueLinksByClickingElements * ****enqueueLinksByClickingElements**(options): Promise<[BatchAddRequestsResult](https://crawlee.dev/js/api/types/interface/BatchAddRequestsResult.md)> - The function finds elements matching a specific CSS selector in a Playwright page, clicks all those elements using a mouse move and a left mouse button click and intercepts all the navigation requests that are subsequently produced by the page. The intercepted requests, including their methods, headers and payloads are then enqueued to a provided [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md). This is useful to crawl JavaScript heavy pages where links are not available in `href` elements, but rather navigations are triggered in click handlers. If you're looking to find URLs in `href` attributes of the page, see enqueueLinks. Optionally, the function allows you to filter the target links' URLs using an array of [PseudoUrl](https://crawlee.dev/js/api/core/class/PseudoUrl.md) objects and override settings of the enqueued [Request](https://crawlee.dev/js/api/core/class/Request.md) objects. **IMPORTANT**: To be able to do this, this function uses various mutations on the page, such as changing the Z-index of elements being clicked and their visibility. Therefore, it is recommended to only use this function as the last operation in the page. **USING HEADFUL BROWSER**: When using a headful browser, this function will only be able to click elements in the focused tab, effectively limiting concurrency to 1. In headless mode, full concurrency can be achieved. **PERFORMANCE**: Clicking elements with a mouse and intercepting requests is not a low level operation that takes nanoseconds. It's not very CPU intensive, but it takes time. We strongly recommend limiting the scope of the clicking as much as possible by using a specific selector that targets only the elements that you assume or know will produce a navigation. You can certainly click everything by using the `*` selector, but be prepared to wait minutes to get results on a large and complex page. **Example usage** ``` await playwrightUtils.enqueueLinksByClickingElements({ page, requestQueue, selector: 'a.product-detail', pseudoUrls: [ 'https://www.example.com/handbags/[.*]' 'https://www.example.com/purses/[.*]' ], }); ``` *** #### Parameters * ##### options: [EnqueueLinksByClickingElementsOptions](https://crawlee.dev/js/api/playwright-crawler/namespace/playwrightClickElements.md#EnqueueLinksByClickingElementsOptions) #### Returns Promise<[BatchAddRequestsResult](https://crawlee.dev/js/api/types/interface/BatchAddRequestsResult.md)> Promise that resolves to [BatchAddRequestsResult](https://crawlee.dev/js/api/types/interface/BatchAddRequestsResult.md) object. ### [**](#gotoExtended)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/utils/playwright-utils.ts#L189)gotoExtended * ****gotoExtended**(page, request, gotoOptions): Promise\ - Extended version of Playwright's `page.goto()` allowing to perform requests with HTTP method other than GET, with custom headers and POST payload. URL, method, headers and payload are taken from request parameter that must be an instance of Request class. *NOTE:* In recent versions of Playwright using requests other than GET, overriding headers and adding payloads disables browser cache which degrades performance. *** #### Parameters * ##### page: Page Playwright [`Page`](https://playwright.dev/docs/api/class-page) object. * ##### request: [Request](https://crawlee.dev/js/api/core/class/Request.md)\ * ##### optionalgotoOptions: [DirectNavigationOptions](https://crawlee.dev/js/api/playwright-crawler/namespace/playwrightUtils.md#DirectNavigationOptions) = {} Custom options for `page.goto()`. #### Returns Promise\ ### [**](#infiniteScroll)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/utils/playwright-utils.ts#L406)infiniteScroll * ****infiniteScroll**(page, options): Promise\ - Scrolls to the bottom of a page, or until it times out. Loads dynamic content when it hits the bottom of a page, and then continues scrolling. *** #### Parameters * ##### page: Page Playwright [`Page`](https://playwright.dev/docs/api/class-page) object. * ##### optionaloptions: [InfiniteScrollOptions](https://crawlee.dev/js/api/playwright-crawler/namespace/playwrightUtils.md#InfiniteScrollOptions) = {} #### Returns Promise\ ### [**](#injectFile)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/utils/playwright-utils.ts#L95)injectFile * ****injectFile**(page, filePath, options): Promise\ - Injects a JavaScript file into a Playwright page. Unlike Playwright's `addScriptTag` function, this function works on pages with arbitrary Cross-Origin Resource Sharing (CORS) policies. File contents are cached for up to 10 files to limit file system access. *** #### Parameters * ##### page: Page Playwright [`Page`](https://playwright.dev/docs/api/class-page) object. * ##### filePath: string File path * ##### optionaloptions: [InjectFileOptions](https://crawlee.dev/js/api/playwright-crawler/namespace/playwrightUtils.md#InjectFileOptions) = {} #### Returns Promise\ ### [**](#injectJQuery)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/utils/playwright-utils.ts#L149)injectJQuery * ****injectJQuery**(page, options): Promise\ - Injects the [jQuery](https://jquery.com/) library into a Playwright page. jQuery is often useful for various web scraping and crawling tasks. For example, it can help extract text from HTML elements using CSS selectors. Beware that the injected jQuery object will be set to the `window.$` variable and thus it might cause conflicts with other libraries included by the page that use the same variable name (e.g. another version of jQuery). This can affect functionality of page's scripts. The injected jQuery will survive page navigations and reloads by default. **Example usage:** ``` await playwrightUtils.injectJQuery(page); const title = await page.evaluate(() => { return $('head title').text(); }); ``` Note that `injectJQuery()` does not affect the Playwright [`page.$()`](https://playwright.dev/docs/api/class-page#page-query-selector) function in any way. *** #### Parameters * ##### page: Page Playwright [`Page`](https://playwright.dev/docs/api/class-page) object. * ##### optionaloptions: { surviveNavigations?: boolean } * ##### optionalsurviveNavigations: boolean Opt-out option to disable the JQuery reinjection after navigation. #### Returns Promise\ ### [**](#parseWithCheerio)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/utils/playwright-utils.ts#L610)parseWithCheerio * ****parseWithCheerio**(page, ignoreShadowRoots, ignoreIframes): Promise<[CheerioRoot](https://crawlee.dev/js/api/utils.md#CheerioRoot)> - Returns Cheerio handle for `page.content()`, allowing to work with the data same way as with [CheerioCrawler](https://crawlee.dev/js/api/cheerio-crawler/class/CheerioCrawler.md). **Example usage:** ``` const $ = await playwrightUtils.parseWithCheerio(page); const title = $('title').text(); ``` *** #### Parameters * ##### page: Page Playwright [`Page`](https://playwright.dev/docs/api/class-page) object. * ##### ignoreShadowRoots: boolean = false * ##### ignoreIframes: boolean = false #### Returns Promise<[CheerioRoot](https://crawlee.dev/js/api/utils.md#CheerioRoot)> ### [**](#registerUtilsToContext)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/utils/playwright-utils.ts#L1022)registerUtilsToContext * ****registerUtilsToContext**(context, crawlerOptions): void - #### Parameters * ##### context: [PlaywrightCrawlingContext](https://crawlee.dev/js/api/playwright-crawler/interface/PlaywrightCrawlingContext.md)\ * ##### crawlerOptions: [PlaywrightCrawlerOptions](https://crawlee.dev/js/api/playwright-crawler/interface/PlaywrightCrawlerOptions.md) #### Returns void ### [**](#saveSnapshot)[**](https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/utils/playwright-utils.ts#L549)saveSnapshot * ****saveSnapshot**(page, options): Promise\ - Saves a full screenshot and HTML of the current page into a Key-Value store. *** #### Parameters * ##### page: Page Playwright [`Page`](https://playwright.dev/docs/api/class-page) object. * ##### optionaloptions: [SaveSnapshotOptions](https://crawlee.dev/js/api/playwright-crawler/namespace/playwrightUtils.md#SaveSnapshotOptions) = {} #### Returns Promise\ --- # @crawlee/puppeteer Provides a simple framework for parallel crawling of web pages using headless Chrome with [Puppeteer](https://github.com/puppeteer/puppeteer). The URLs to crawl are fed either from a static list of URLs or from a dynamic queue of URLs enabling recursive crawling of websites. Since `PuppeteerCrawler` uses headless Chrome to download web pages and extract data, it is useful for crawling of websites that require to execute JavaScript. If the target website doesn't need JavaScript, consider using [CheerioCrawler](https://crawlee.dev/js/api/cheerio-crawler/class/CheerioCrawler.md), which downloads the pages using raw HTTP requests and is about 10x faster. The source URLs are represented using [Request](https://crawlee.dev/js/api/core/class/Request.md) objects that are fed from [RequestList](https://crawlee.dev/js/api/core/class/RequestList.md) or [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) instances provided by the [PuppeteerCrawlerOptions.requestList](https://crawlee.dev/js/api/puppeteer-crawler/interface/PuppeteerCrawlerOptions.md#requestList) or [PuppeteerCrawlerOptions.requestQueue](https://crawlee.dev/js/api/puppeteer-crawler/interface/PuppeteerCrawlerOptions.md#requestQueue) constructor options, respectively. If both [PuppeteerCrawlerOptions.requestList](https://crawlee.dev/js/api/puppeteer-crawler/interface/PuppeteerCrawlerOptions.md#requestList) and [PuppeteerCrawlerOptions.requestQueue](https://crawlee.dev/js/api/puppeteer-crawler/interface/PuppeteerCrawlerOptions.md#requestQueue) are used, the instance first processes URLs from the [RequestList](https://crawlee.dev/js/api/core/class/RequestList.md) and automatically enqueues all of them to [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) before it starts their processing. This ensures that a single URL is not crawled multiple times. The crawler finishes when there are no more [Request](https://crawlee.dev/js/api/core/class/Request.md) objects to crawl. `PuppeteerCrawler` opens a new Chrome page (i.e. tab) for each [Request](https://crawlee.dev/js/api/core/class/Request.md) object to crawl and then calls the function provided by user as the [PuppeteerCrawlerOptions.requestHandler](https://crawlee.dev/js/api/puppeteer-crawler/interface/PuppeteerCrawlerOptions.md#requestHandler) option. New pages are only opened when there is enough free CPU and memory available, using the functionality provided by the [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) class. All [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) configuration options can be passed to the [PuppeteerCrawlerOptions.autoscaledPoolOptions](https://crawlee.dev/js/api/puppeteer-crawler/interface/PuppeteerCrawlerOptions.md#autoscaledPoolOptions) parameter of the `PuppeteerCrawler` constructor. For user convenience, the `minConcurrency` and `maxConcurrency` [AutoscaledPoolOptions](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md) are available directly in the `PuppeteerCrawler` constructor. Note that the pool of Puppeteer instances is internally managed by the [BrowserPool](https://github.com/apify/browser-pool) class. ## Example usage[​](#example-usage "Direct link to Example usage") ``` const crawler = new PuppeteerCrawler({ async requestHandler({ page, request }) { // This function is called to extract data from a single web page // 'page' is an instance of Puppeteer.Page with page.goto(request.url) already called // 'request' is an instance of Request class with information about the page to load await Dataset.pushData({ title: await page.title(), url: request.url, succeeded: true, }) }, async failedRequestHandler({ request }) { // This function is called when the crawling of a request failed too many times await Dataset.pushData({ url: request.url, succeeded: false, errors: request.errorMessages, }) }, }); await crawler.run([ 'http://www.example.com/page-1', 'http://www.example.com/page-2', ]); ``` ## Index[**](#Index) ### Crawlers * [**PuppeteerCrawler](https://crawlee.dev/js/api/puppeteer-crawler/class/PuppeteerCrawler.md) ### Other * [**AddRequestsBatchedOptions](https://crawlee.dev/js/api/puppeteer-crawler.md#AddRequestsBatchedOptions) * [**AddRequestsBatchedResult](https://crawlee.dev/js/api/puppeteer-crawler.md#AddRequestsBatchedResult) * [**AutoscaledPool](https://crawlee.dev/js/api/puppeteer-crawler.md#AutoscaledPool) * [**AutoscaledPoolOptions](https://crawlee.dev/js/api/puppeteer-crawler.md#AutoscaledPoolOptions) * [**BaseHttpClient](https://crawlee.dev/js/api/puppeteer-crawler.md#BaseHttpClient) * [**BaseHttpResponseData](https://crawlee.dev/js/api/puppeteer-crawler.md#BaseHttpResponseData) * [**BASIC\_CRAWLER\_TIMEOUT\_BUFFER\_SECS](https://crawlee.dev/js/api/puppeteer-crawler.md#BASIC_CRAWLER_TIMEOUT_BUFFER_SECS) * [**BasicCrawler](https://crawlee.dev/js/api/puppeteer-crawler.md#BasicCrawler) * [**BasicCrawlerOptions](https://crawlee.dev/js/api/puppeteer-crawler.md#BasicCrawlerOptions) * [**BasicCrawlingContext](https://crawlee.dev/js/api/puppeteer-crawler.md#BasicCrawlingContext) * [**BLOCKED\_STATUS\_CODES](https://crawlee.dev/js/api/puppeteer-crawler.md#BLOCKED_STATUS_CODES) * [**BlockRequestsOptions](https://crawlee.dev/js/api/puppeteer-crawler.md#BlockRequestsOptions) * [**BrowserCrawler](https://crawlee.dev/js/api/puppeteer-crawler.md#BrowserCrawler) * [**BrowserCrawlerOptions](https://crawlee.dev/js/api/puppeteer-crawler.md#BrowserCrawlerOptions) * [**BrowserCrawlingContext](https://crawlee.dev/js/api/puppeteer-crawler.md#BrowserCrawlingContext) * [**BrowserErrorHandler](https://crawlee.dev/js/api/puppeteer-crawler.md#BrowserErrorHandler) * [**BrowserHook](https://crawlee.dev/js/api/puppeteer-crawler.md#BrowserHook) * [**BrowserLaunchContext](https://crawlee.dev/js/api/puppeteer-crawler.md#BrowserLaunchContext) * [**BrowserRequestHandler](https://crawlee.dev/js/api/puppeteer-crawler.md#BrowserRequestHandler) * [**checkStorageAccess](https://crawlee.dev/js/api/puppeteer-crawler.md#checkStorageAccess) * [**ClientInfo](https://crawlee.dev/js/api/puppeteer-crawler.md#ClientInfo) * [**CompiledScriptFunction](https://crawlee.dev/js/api/puppeteer-crawler.md#CompiledScriptFunction) * [**CompiledScriptParams](https://crawlee.dev/js/api/puppeteer-crawler.md#CompiledScriptParams) * [**Configuration](https://crawlee.dev/js/api/puppeteer-crawler.md#Configuration) * [**ConfigurationOptions](https://crawlee.dev/js/api/puppeteer-crawler.md#ConfigurationOptions) * [**Cookie](https://crawlee.dev/js/api/puppeteer-crawler.md#Cookie) * [**CrawlerAddRequestsOptions](https://crawlee.dev/js/api/puppeteer-crawler.md#CrawlerAddRequestsOptions) * [**CrawlerAddRequestsResult](https://crawlee.dev/js/api/puppeteer-crawler.md#CrawlerAddRequestsResult) * [**CrawlerExperiments](https://crawlee.dev/js/api/puppeteer-crawler.md#CrawlerExperiments) * [**CrawlerRunOptions](https://crawlee.dev/js/api/puppeteer-crawler.md#CrawlerRunOptions) * [**CrawlingContext](https://crawlee.dev/js/api/puppeteer-crawler.md#CrawlingContext) * [**createBasicRouter](https://crawlee.dev/js/api/puppeteer-crawler.md#createBasicRouter) * [**CreateContextOptions](https://crawlee.dev/js/api/puppeteer-crawler.md#CreateContextOptions) * [**CreateSession](https://crawlee.dev/js/api/puppeteer-crawler.md#CreateSession) * [**CriticalError](https://crawlee.dev/js/api/puppeteer-crawler.md#CriticalError) * [**Dataset](https://crawlee.dev/js/api/puppeteer-crawler.md#Dataset) * [**DatasetConsumer](https://crawlee.dev/js/api/puppeteer-crawler.md#DatasetConsumer) * [**DatasetContent](https://crawlee.dev/js/api/puppeteer-crawler.md#DatasetContent) * [**DatasetDataOptions](https://crawlee.dev/js/api/puppeteer-crawler.md#DatasetDataOptions) * [**DatasetExportOptions](https://crawlee.dev/js/api/puppeteer-crawler.md#DatasetExportOptions) * [**DatasetExportToOptions](https://crawlee.dev/js/api/puppeteer-crawler.md#DatasetExportToOptions) * [**DatasetIteratorOptions](https://crawlee.dev/js/api/puppeteer-crawler.md#DatasetIteratorOptions) * [**DatasetMapper](https://crawlee.dev/js/api/puppeteer-crawler.md#DatasetMapper) * [**DatasetOptions](https://crawlee.dev/js/api/puppeteer-crawler.md#DatasetOptions) * [**DatasetReducer](https://crawlee.dev/js/api/puppeteer-crawler.md#DatasetReducer) * [**enqueueLinks](https://crawlee.dev/js/api/puppeteer-crawler.md#enqueueLinks) * [**EnqueueLinksByClickingElementsOptions](https://crawlee.dev/js/api/puppeteer-crawler.md#EnqueueLinksByClickingElementsOptions) * [**EnqueueLinksOptions](https://crawlee.dev/js/api/puppeteer-crawler.md#EnqueueLinksOptions) * [**EnqueueStrategy](https://crawlee.dev/js/api/puppeteer-crawler.md#EnqueueStrategy) * [**ErrnoException](https://crawlee.dev/js/api/puppeteer-crawler.md#ErrnoException) * [**ErrorHandler](https://crawlee.dev/js/api/puppeteer-crawler.md#ErrorHandler) * [**ErrorSnapshotter](https://crawlee.dev/js/api/puppeteer-crawler.md#ErrorSnapshotter) * [**ErrorTracker](https://crawlee.dev/js/api/puppeteer-crawler.md#ErrorTracker) * [**ErrorTrackerOptions](https://crawlee.dev/js/api/puppeteer-crawler.md#ErrorTrackerOptions) * [**EventManager](https://crawlee.dev/js/api/puppeteer-crawler.md#EventManager) * [**EventType](https://crawlee.dev/js/api/puppeteer-crawler.md#EventType) * [**EventTypeName](https://crawlee.dev/js/api/puppeteer-crawler.md#EventTypeName) * [**filterRequestsByPatterns](https://crawlee.dev/js/api/puppeteer-crawler.md#filterRequestsByPatterns) * [**FinalStatistics](https://crawlee.dev/js/api/puppeteer-crawler.md#FinalStatistics) * [**GetUserDataFromRequest](https://crawlee.dev/js/api/puppeteer-crawler.md#GetUserDataFromRequest) * [**GlobInput](https://crawlee.dev/js/api/puppeteer-crawler.md#GlobInput) * [**GlobObject](https://crawlee.dev/js/api/puppeteer-crawler.md#GlobObject) * [**GotScrapingHttpClient](https://crawlee.dev/js/api/puppeteer-crawler.md#GotScrapingHttpClient) * [**HttpRequest](https://crawlee.dev/js/api/puppeteer-crawler.md#HttpRequest) * [**HttpRequestOptions](https://crawlee.dev/js/api/puppeteer-crawler.md#HttpRequestOptions) * [**HttpResponse](https://crawlee.dev/js/api/puppeteer-crawler.md#HttpResponse) * [**InfiniteScrollOptions](https://crawlee.dev/js/api/puppeteer-crawler.md#InfiniteScrollOptions) * [**InjectFileOptions](https://crawlee.dev/js/api/puppeteer-crawler.md#InjectFileOptions) * [**InterceptHandler](https://crawlee.dev/js/api/puppeteer-crawler.md#InterceptHandler) * [**IRequestList](https://crawlee.dev/js/api/puppeteer-crawler.md#IRequestList) * [**IRequestManager](https://crawlee.dev/js/api/puppeteer-crawler.md#IRequestManager) * [**IStorage](https://crawlee.dev/js/api/puppeteer-crawler.md#IStorage) * [**KeyConsumer](https://crawlee.dev/js/api/puppeteer-crawler.md#KeyConsumer) * [**KeyValueStore](https://crawlee.dev/js/api/puppeteer-crawler.md#KeyValueStore) * [**KeyValueStoreIteratorOptions](https://crawlee.dev/js/api/puppeteer-crawler.md#KeyValueStoreIteratorOptions) * [**KeyValueStoreOptions](https://crawlee.dev/js/api/puppeteer-crawler.md#KeyValueStoreOptions) * [**LoadedRequest](https://crawlee.dev/js/api/puppeteer-crawler.md#LoadedRequest) * [**LocalEventManager](https://crawlee.dev/js/api/puppeteer-crawler.md#LocalEventManager) * [**log](https://crawlee.dev/js/api/puppeteer-crawler.md#log) * [**Log](https://crawlee.dev/js/api/puppeteer-crawler.md#Log) * [**Logger](https://crawlee.dev/js/api/puppeteer-crawler.md#Logger) * [**LoggerJson](https://crawlee.dev/js/api/puppeteer-crawler.md#LoggerJson) * [**LoggerOptions](https://crawlee.dev/js/api/puppeteer-crawler.md#LoggerOptions) * [**LoggerText](https://crawlee.dev/js/api/puppeteer-crawler.md#LoggerText) * [**LogLevel](https://crawlee.dev/js/api/puppeteer-crawler.md#LogLevel) * [**MAX\_POOL\_SIZE](https://crawlee.dev/js/api/puppeteer-crawler.md#MAX_POOL_SIZE) * [**NonRetryableError](https://crawlee.dev/js/api/puppeteer-crawler.md#NonRetryableError) * [**PERSIST\_STATE\_KEY](https://crawlee.dev/js/api/puppeteer-crawler.md#PERSIST_STATE_KEY) * [**PersistenceOptions](https://crawlee.dev/js/api/puppeteer-crawler.md#PersistenceOptions) * [**processHttpRequestOptions](https://crawlee.dev/js/api/puppeteer-crawler.md#processHttpRequestOptions) * [**ProxyConfiguration](https://crawlee.dev/js/api/puppeteer-crawler.md#ProxyConfiguration) * [**ProxyConfigurationFunction](https://crawlee.dev/js/api/puppeteer-crawler.md#ProxyConfigurationFunction) * [**ProxyConfigurationOptions](https://crawlee.dev/js/api/puppeteer-crawler.md#ProxyConfigurationOptions) * [**ProxyInfo](https://crawlee.dev/js/api/puppeteer-crawler.md#ProxyInfo) * [**PseudoUrl](https://crawlee.dev/js/api/puppeteer-crawler.md#PseudoUrl) * [**PseudoUrlInput](https://crawlee.dev/js/api/puppeteer-crawler.md#PseudoUrlInput) * [**PseudoUrlObject](https://crawlee.dev/js/api/puppeteer-crawler.md#PseudoUrlObject) * [**PuppeteerDirectNavigationOptions](https://crawlee.dev/js/api/puppeteer-crawler.md#PuppeteerDirectNavigationOptions) * [**purgeDefaultStorages](https://crawlee.dev/js/api/puppeteer-crawler.md#purgeDefaultStorages) * [**PushErrorMessageOptions](https://crawlee.dev/js/api/puppeteer-crawler.md#PushErrorMessageOptions) * [**QueueOperationInfo](https://crawlee.dev/js/api/puppeteer-crawler.md#QueueOperationInfo) * [**RecordOptions](https://crawlee.dev/js/api/puppeteer-crawler.md#RecordOptions) * [**RecoverableState](https://crawlee.dev/js/api/puppeteer-crawler.md#RecoverableState) * [**RecoverableStateOptions](https://crawlee.dev/js/api/puppeteer-crawler.md#RecoverableStateOptions) * [**RecoverableStatePersistenceOptions](https://crawlee.dev/js/api/puppeteer-crawler.md#RecoverableStatePersistenceOptions) * [**RedirectHandler](https://crawlee.dev/js/api/puppeteer-crawler.md#RedirectHandler) * [**RegExpInput](https://crawlee.dev/js/api/puppeteer-crawler.md#RegExpInput) * [**RegExpObject](https://crawlee.dev/js/api/puppeteer-crawler.md#RegExpObject) * [**Request](https://crawlee.dev/js/api/puppeteer-crawler.md#Request) * [**RequestHandler](https://crawlee.dev/js/api/puppeteer-crawler.md#RequestHandler) * [**RequestHandlerResult](https://crawlee.dev/js/api/puppeteer-crawler.md#RequestHandlerResult) * [**RequestList](https://crawlee.dev/js/api/puppeteer-crawler.md#RequestList) * [**RequestListOptions](https://crawlee.dev/js/api/puppeteer-crawler.md#RequestListOptions) * [**RequestListSourcesFunction](https://crawlee.dev/js/api/puppeteer-crawler.md#RequestListSourcesFunction) * [**RequestListState](https://crawlee.dev/js/api/puppeteer-crawler.md#RequestListState) * [**RequestManagerTandem](https://crawlee.dev/js/api/puppeteer-crawler.md#RequestManagerTandem) * [**RequestOptions](https://crawlee.dev/js/api/puppeteer-crawler.md#RequestOptions) * [**RequestProvider](https://crawlee.dev/js/api/puppeteer-crawler.md#RequestProvider) * [**RequestProviderOptions](https://crawlee.dev/js/api/puppeteer-crawler.md#RequestProviderOptions) * [**RequestQueue](https://crawlee.dev/js/api/puppeteer-crawler.md#RequestQueue) * [**RequestQueueOperationOptions](https://crawlee.dev/js/api/puppeteer-crawler.md#RequestQueueOperationOptions) * [**RequestQueueOptions](https://crawlee.dev/js/api/puppeteer-crawler.md#RequestQueueOptions) * [**RequestQueueV1](https://crawlee.dev/js/api/puppeteer-crawler.md#RequestQueueV1) * [**RequestQueueV2](https://crawlee.dev/js/api/puppeteer-crawler.md#RequestQueueV2) * [**RequestsLike](https://crawlee.dev/js/api/puppeteer-crawler.md#RequestsLike) * [**RequestState](https://crawlee.dev/js/api/puppeteer-crawler.md#RequestState) * [**RequestTransform](https://crawlee.dev/js/api/puppeteer-crawler.md#RequestTransform) * [**ResponseLike](https://crawlee.dev/js/api/puppeteer-crawler.md#ResponseLike) * [**ResponseTypes](https://crawlee.dev/js/api/puppeteer-crawler.md#ResponseTypes) * [**RestrictedCrawlingContext](https://crawlee.dev/js/api/puppeteer-crawler.md#RestrictedCrawlingContext) * [**RetryRequestError](https://crawlee.dev/js/api/puppeteer-crawler.md#RetryRequestError) * [**Router](https://crawlee.dev/js/api/puppeteer-crawler.md#Router) * [**RouterHandler](https://crawlee.dev/js/api/puppeteer-crawler.md#RouterHandler) * [**RouterRoutes](https://crawlee.dev/js/api/puppeteer-crawler.md#RouterRoutes) * [**SaveSnapshotOptions](https://crawlee.dev/js/api/puppeteer-crawler.md#SaveSnapshotOptions) * [**Session](https://crawlee.dev/js/api/puppeteer-crawler.md#Session) * [**SessionError](https://crawlee.dev/js/api/puppeteer-crawler.md#SessionError) * [**SessionOptions](https://crawlee.dev/js/api/puppeteer-crawler.md#SessionOptions) * [**SessionPool](https://crawlee.dev/js/api/puppeteer-crawler.md#SessionPool) * [**SessionPoolOptions](https://crawlee.dev/js/api/puppeteer-crawler.md#SessionPoolOptions) * [**SessionState](https://crawlee.dev/js/api/puppeteer-crawler.md#SessionState) * [**SitemapRequestList](https://crawlee.dev/js/api/puppeteer-crawler.md#SitemapRequestList) * [**SitemapRequestListOptions](https://crawlee.dev/js/api/puppeteer-crawler.md#SitemapRequestListOptions) * [**SkippedRequestCallback](https://crawlee.dev/js/api/puppeteer-crawler.md#SkippedRequestCallback) * [**SkippedRequestReason](https://crawlee.dev/js/api/puppeteer-crawler.md#SkippedRequestReason) * [**SnapshotResult](https://crawlee.dev/js/api/puppeteer-crawler.md#SnapshotResult) * [**Snapshotter](https://crawlee.dev/js/api/puppeteer-crawler.md#Snapshotter) * [**SnapshotterOptions](https://crawlee.dev/js/api/puppeteer-crawler.md#SnapshotterOptions) * [**Source](https://crawlee.dev/js/api/puppeteer-crawler.md#Source) * [**StatisticPersistedState](https://crawlee.dev/js/api/puppeteer-crawler.md#StatisticPersistedState) * [**Statistics](https://crawlee.dev/js/api/puppeteer-crawler.md#Statistics) * [**StatisticsOptions](https://crawlee.dev/js/api/puppeteer-crawler.md#StatisticsOptions) * [**StatisticState](https://crawlee.dev/js/api/puppeteer-crawler.md#StatisticState) * [**StatusMessageCallback](https://crawlee.dev/js/api/puppeteer-crawler.md#StatusMessageCallback) * [**StatusMessageCallbackParams](https://crawlee.dev/js/api/puppeteer-crawler.md#StatusMessageCallbackParams) * [**StorageClient](https://crawlee.dev/js/api/puppeteer-crawler.md#StorageClient) * [**StorageManagerOptions](https://crawlee.dev/js/api/puppeteer-crawler.md#StorageManagerOptions) * [**StreamingHttpResponse](https://crawlee.dev/js/api/puppeteer-crawler.md#StreamingHttpResponse) * [**SystemInfo](https://crawlee.dev/js/api/puppeteer-crawler.md#SystemInfo) * [**SystemStatus](https://crawlee.dev/js/api/puppeteer-crawler.md#SystemStatus) * [**SystemStatusOptions](https://crawlee.dev/js/api/puppeteer-crawler.md#SystemStatusOptions) * [**TieredProxy](https://crawlee.dev/js/api/puppeteer-crawler.md#TieredProxy) * [**tryAbsoluteURL](https://crawlee.dev/js/api/puppeteer-crawler.md#tryAbsoluteURL) * [**UrlPatternObject](https://crawlee.dev/js/api/puppeteer-crawler.md#UrlPatternObject) * [**useState](https://crawlee.dev/js/api/puppeteer-crawler.md#useState) * [**UseStateOptions](https://crawlee.dev/js/api/puppeteer-crawler.md#UseStateOptions) * [**withCheckedStorageAccess](https://crawlee.dev/js/api/puppeteer-crawler.md#withCheckedStorageAccess) * [**puppeteerClickElements](https://crawlee.dev/js/api/puppeteer-crawler/namespace/puppeteerClickElements.md) * [**puppeteerRequestInterception](https://crawlee.dev/js/api/puppeteer-crawler/namespace/puppeteerRequestInterception.md) * [**puppeteerUtils](https://crawlee.dev/js/api/puppeteer-crawler/namespace/puppeteerUtils.md) * [**PuppeteerCrawlerOptions](https://crawlee.dev/js/api/puppeteer-crawler/interface/PuppeteerCrawlerOptions.md) * [**PuppeteerCrawlingContext](https://crawlee.dev/js/api/puppeteer-crawler/interface/PuppeteerCrawlingContext.md) * [**PuppeteerHook](https://crawlee.dev/js/api/puppeteer-crawler/interface/PuppeteerHook.md) * [**PuppeteerLaunchContext](https://crawlee.dev/js/api/puppeteer-crawler/interface/PuppeteerLaunchContext.md) * [**PuppeteerRequestHandler](https://crawlee.dev/js/api/puppeteer-crawler/interface/PuppeteerRequestHandler.md) * [**PuppeteerGoToOptions](https://crawlee.dev/js/api/puppeteer-crawler.md#PuppeteerGoToOptions) * [**createPuppeteerRouter](https://crawlee.dev/js/api/puppeteer-crawler/function/createPuppeteerRouter.md) * [**launchPuppeteer](https://crawlee.dev/js/api/puppeteer-crawler/function/launchPuppeteer.md) ## Other[**](#__CATEGORY__) ### [**](#AddRequestsBatchedOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L965)AddRequestsBatchedOptions Re-exports [AddRequestsBatchedOptions](https://crawlee.dev/js/api/core/interface/AddRequestsBatchedOptions.md) ### [**](#AddRequestsBatchedResult)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L983)AddRequestsBatchedResult Re-exports [AddRequestsBatchedResult](https://crawlee.dev/js/api/core/interface/AddRequestsBatchedResult.md) ### [**](#AutoscaledPool)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/autoscaled_pool.ts#L180)AutoscaledPool Re-exports [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) ### [**](#AutoscaledPoolOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/autoscaled_pool.ts#L16)AutoscaledPoolOptions Re-exports [AutoscaledPoolOptions](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md) ### [**](#BaseHttpClient)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L179)BaseHttpClient Re-exports [BaseHttpClient](https://crawlee.dev/js/api/core/interface/BaseHttpClient.md) ### [**](#BaseHttpResponseData)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L130)BaseHttpResponseData Re-exports [BaseHttpResponseData](https://crawlee.dev/js/api/core/interface/BaseHttpResponseData.md) ### [**](#BASIC_CRAWLER_TIMEOUT_BUFFER_SECS)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/constants.ts#L6)BASIC\_CRAWLER\_TIMEOUT\_BUFFER\_SECS Re-exports [BASIC\_CRAWLER\_TIMEOUT\_BUFFER\_SECS](https://crawlee.dev/js/api/basic-crawler.md#BASIC_CRAWLER_TIMEOUT_BUFFER_SECS) ### [**](#BasicCrawler)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L485)BasicCrawler Re-exports [BasicCrawler](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md) ### [**](#BasicCrawlerOptions)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L133)BasicCrawlerOptions Re-exports [BasicCrawlerOptions](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md) ### [**](#BasicCrawlingContext)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L71)BasicCrawlingContext Re-exports [BasicCrawlingContext](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlingContext.md) ### [**](#BLOCKED_STATUS_CODES)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/consts.ts#L1)BLOCKED\_STATUS\_CODES Re-exports [BLOCKED\_STATUS\_CODES](https://crawlee.dev/js/api/core.md#BLOCKED_STATUS_CODES) ### [**](#BlockRequestsOptions)[**](https://github.com/apify/crawlee/blob/master/packages/puppeteer-crawler/src/index.ts#L10)BlockRequestsOptions Re-exports [BlockRequestsOptions](https://crawlee.dev/js/api/puppeteer-crawler/namespace/puppeteerUtils.md#BlockRequestsOptions) ### [**](#BrowserCrawler)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L314)BrowserCrawler Re-exports [BrowserCrawler](https://crawlee.dev/js/api/browser-crawler/class/BrowserCrawler.md) ### [**](#BrowserCrawlerOptions)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L75)BrowserCrawlerOptions Re-exports [BrowserCrawlerOptions](https://crawlee.dev/js/api/browser-crawler/interface/BrowserCrawlerOptions.md) ### [**](#BrowserCrawlingContext)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L52)BrowserCrawlingContext Re-exports [BrowserCrawlingContext](https://crawlee.dev/js/api/browser-crawler/interface/BrowserCrawlingContext.md) ### [**](#BrowserErrorHandler)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L67)BrowserErrorHandler Re-exports [BrowserErrorHandler](https://crawlee.dev/js/api/browser-crawler.md#BrowserErrorHandler) ### [**](#BrowserHook)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L70)BrowserHook Re-exports [BrowserHook](https://crawlee.dev/js/api/browser-crawler.md#BrowserHook) ### [**](#BrowserLaunchContext)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-launcher.ts#L14)BrowserLaunchContext Re-exports [BrowserLaunchContext](https://crawlee.dev/js/api/browser-crawler/interface/BrowserLaunchContext.md) ### [**](#BrowserRequestHandler)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L64)BrowserRequestHandler Re-exports [BrowserRequestHandler](https://crawlee.dev/js/api/browser-crawler.md#BrowserRequestHandler) ### [**](#checkStorageAccess)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/access_checking.ts#L10)checkStorageAccess Re-exports [checkStorageAccess](https://crawlee.dev/js/api/core/function/checkStorageAccess.md) ### [**](#ClientInfo)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/system_status.ts#L79)ClientInfo Re-exports [ClientInfo](https://crawlee.dev/js/api/core/interface/ClientInfo.md) ### [**](#CompiledScriptFunction)[**](https://github.com/apify/crawlee/blob/master/packages/puppeteer-crawler/src/index.ts#L11)CompiledScriptFunction Re-exports [CompiledScriptFunction](https://crawlee.dev/js/api/puppeteer-crawler/namespace/puppeteerUtils.md#CompiledScriptFunction) ### [**](#CompiledScriptParams)[**](https://github.com/apify/crawlee/blob/master/packages/puppeteer-crawler/src/index.ts#L12)CompiledScriptParams Re-exports [CompiledScriptParams](https://crawlee.dev/js/api/puppeteer-crawler/namespace/puppeteerUtils.md#CompiledScriptParams) ### [**](#Configuration)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/configuration.ts#L247)Configuration Re-exports [Configuration](https://crawlee.dev/js/api/core/class/Configuration.md) ### [**](#ConfigurationOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/configuration.ts#L16)ConfigurationOptions Re-exports [ConfigurationOptions](https://crawlee.dev/js/api/core/interface/ConfigurationOptions.md) ### [**](#Cookie)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/index.ts#L19)Cookie Re-exports [Cookie](https://crawlee.dev/js/api/core/interface/Cookie.md) ### [**](#CrawlerAddRequestsOptions)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L2035)CrawlerAddRequestsOptions Re-exports [CrawlerAddRequestsOptions](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerAddRequestsOptions.md) ### [**](#CrawlerAddRequestsResult)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L2037)CrawlerAddRequestsResult Re-exports [CrawlerAddRequestsResult](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerAddRequestsResult.md) ### [**](#CrawlerExperiments)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L411)CrawlerExperiments Re-exports [CrawlerExperiments](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerExperiments.md) ### [**](#CrawlerRunOptions)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L2039)CrawlerRunOptions Re-exports [CrawlerRunOptions](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerRunOptions.md) ### [**](#CrawlingContext)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L111)CrawlingContext Re-exports [CrawlingContext](https://crawlee.dev/js/api/core/interface/CrawlingContext.md) ### [**](#createBasicRouter)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L2081)createBasicRouter Re-exports [createBasicRouter](https://crawlee.dev/js/api/basic-crawler/function/createBasicRouter.md) ### [**](#CreateContextOptions)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L2029)CreateContextOptions Re-exports [CreateContextOptions](https://crawlee.dev/js/api/basic-crawler/interface/CreateContextOptions.md) ### [**](#CreateSession)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session_pool.ts#L22)CreateSession Re-exports [CreateSession](https://crawlee.dev/js/api/core/interface/CreateSession.md) ### [**](#CriticalError)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/errors.ts#L10)CriticalError Re-exports [CriticalError](https://crawlee.dev/js/api/core/class/CriticalError.md) ### [**](#Dataset)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L232)Dataset Re-exports [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md) ### [**](#DatasetConsumer)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L703)DatasetConsumer Re-exports [DatasetConsumer](https://crawlee.dev/js/api/core/interface/DatasetConsumer.md) ### [**](#DatasetContent)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L742)DatasetContent Re-exports [DatasetContent](https://crawlee.dev/js/api/core/interface/DatasetContent.md) ### [**](#DatasetDataOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L92)DatasetDataOptions Re-exports [DatasetDataOptions](https://crawlee.dev/js/api/core/interface/DatasetDataOptions.md) ### [**](#DatasetExportOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L144)DatasetExportOptions Re-exports [DatasetExportOptions](https://crawlee.dev/js/api/core/interface/DatasetExportOptions.md) ### [**](#DatasetExportToOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L176)DatasetExportToOptions Re-exports [DatasetExportToOptions](https://crawlee.dev/js/api/core/interface/DatasetExportToOptions.md) ### [**](#DatasetIteratorOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L152)DatasetIteratorOptions Re-exports [DatasetIteratorOptions](https://crawlee.dev/js/api/core/interface/DatasetIteratorOptions.md) ### [**](#DatasetMapper)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L714)DatasetMapper Re-exports [DatasetMapper](https://crawlee.dev/js/api/core/interface/DatasetMapper.md) ### [**](#DatasetOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L735)DatasetOptions Re-exports [DatasetOptions](https://crawlee.dev/js/api/core/interface/DatasetOptions.md) ### [**](#DatasetReducer)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L726)DatasetReducer Re-exports [DatasetReducer](https://crawlee.dev/js/api/core/interface/DatasetReducer.md) ### [**](#enqueueLinks)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/enqueue_links.ts#L274)enqueueLinks Re-exports [enqueueLinks](https://crawlee.dev/js/api/core/function/enqueueLinks.md) ### [**](#EnqueueLinksByClickingElementsOptions)[**](https://github.com/apify/crawlee/blob/master/packages/puppeteer-crawler/src/index.ts#L20)EnqueueLinksByClickingElementsOptions Re-exports [EnqueueLinksByClickingElementsOptions](https://crawlee.dev/js/api/puppeteer-crawler/namespace/puppeteerClickElements.md#EnqueueLinksByClickingElementsOptions) ### [**](#EnqueueLinksOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/enqueue_links.ts#L34)EnqueueLinksOptions Re-exports [EnqueueLinksOptions](https://crawlee.dev/js/api/core/interface/EnqueueLinksOptions.md) ### [**](#EnqueueStrategy)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/enqueue_links.ts#L216)EnqueueStrategy Re-exports [EnqueueStrategy](https://crawlee.dev/js/api/core/enum/EnqueueStrategy.md) ### [**](#ErrnoException)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/error_tracker.ts#L9)ErrnoException Re-exports [ErrnoException](https://crawlee.dev/js/api/core/interface/ErrnoException.md) ### [**](#ErrorHandler)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L114)ErrorHandler Re-exports [ErrorHandler](https://crawlee.dev/js/api/basic-crawler.md#ErrorHandler) ### [**](#ErrorSnapshotter)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/error_snapshotter.ts#L42)ErrorSnapshotter Re-exports [ErrorSnapshotter](https://crawlee.dev/js/api/core/class/ErrorSnapshotter.md) ### [**](#ErrorTracker)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/error_tracker.ts#L286)ErrorTracker Re-exports [ErrorTracker](https://crawlee.dev/js/api/core/class/ErrorTracker.md) ### [**](#ErrorTrackerOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/error_tracker.ts#L17)ErrorTrackerOptions Re-exports [ErrorTrackerOptions](https://crawlee.dev/js/api/core/interface/ErrorTrackerOptions.md) ### [**](#EventManager)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/events/event_manager.ts#L24)EventManager Re-exports [EventManager](https://crawlee.dev/js/api/core/class/EventManager.md) ### [**](#EventType)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/events/event_manager.ts#L9)EventType Re-exports [EventType](https://crawlee.dev/js/api/core/enum/EventType.md) ### [**](#EventTypeName)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/events/event_manager.ts#L17)EventTypeName Re-exports [EventTypeName](https://crawlee.dev/js/api/core.md#EventTypeName) ### [**](#filterRequestsByPatterns)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L217)filterRequestsByPatterns Re-exports [filterRequestsByPatterns](https://crawlee.dev/js/api/core/function/filterRequestsByPatterns.md) ### [**](#FinalStatistics)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/system_status.ts#L85)FinalStatistics Re-exports [FinalStatistics](https://crawlee.dev/js/api/core/interface/FinalStatistics.md) ### [**](#GetUserDataFromRequest)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/router.ts#L15)GetUserDataFromRequest Re-exports [GetUserDataFromRequest](https://crawlee.dev/js/api/core.md#GetUserDataFromRequest) ### [**](#GlobInput)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L41)GlobInput Re-exports [GlobInput](https://crawlee.dev/js/api/core.md#GlobInput) ### [**](#GlobObject)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L36)GlobObject Re-exports [GlobObject](https://crawlee.dev/js/api/core.md#GlobObject) ### [**](#GotScrapingHttpClient)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/got-scraping-http-client.ts#L17)GotScrapingHttpClient Re-exports [GotScrapingHttpClient](https://crawlee.dev/js/api/core/class/GotScrapingHttpClient.md) ### [**](#HttpRequest)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L78)HttpRequest Re-exports [HttpRequest](https://crawlee.dev/js/api/core/interface/HttpRequest.md) ### [**](#HttpRequestOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L111)HttpRequestOptions Re-exports [HttpRequestOptions](https://crawlee.dev/js/api/core/interface/HttpRequestOptions.md) ### [**](#HttpResponse)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L152)HttpResponse Re-exports [HttpResponse](https://crawlee.dev/js/api/core/interface/HttpResponse.md) ### [**](#InfiniteScrollOptions)[**](https://github.com/apify/crawlee/blob/master/packages/puppeteer-crawler/src/index.ts#L14)InfiniteScrollOptions Re-exports [InfiniteScrollOptions](https://crawlee.dev/js/api/puppeteer-crawler/namespace/puppeteerUtils.md#InfiniteScrollOptions) ### [**](#InjectFileOptions)[**](https://github.com/apify/crawlee/blob/master/packages/puppeteer-crawler/src/index.ts#L15)InjectFileOptions Re-exports [InjectFileOptions](https://crawlee.dev/js/api/puppeteer-crawler/namespace/puppeteerUtils.md#InjectFileOptions) ### [**](#InterceptHandler)[**](https://github.com/apify/crawlee/blob/master/packages/puppeteer-crawler/src/index.ts#L6)InterceptHandler Re-exports [InterceptHandler](https://crawlee.dev/js/api/puppeteer-crawler/namespace/puppeteerRequestInterception.md#InterceptHandler) ### [**](#IRequestList)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_list.ts#L26)IRequestList Re-exports [IRequestList](https://crawlee.dev/js/api/core/interface/IRequestList.md) ### [**](#IRequestManager)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L44)IRequestManager Re-exports [IRequestManager](https://crawlee.dev/js/api/core/interface/IRequestManager.md) ### [**](#IStorage)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/storage_manager.ts#L14)IStorage Re-exports [IStorage](https://crawlee.dev/js/api/core/interface/IStorage.md) ### [**](#KeyConsumer)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/key_value_store.ts#L724)KeyConsumer Re-exports [KeyConsumer](https://crawlee.dev/js/api/core/interface/KeyConsumer.md) ### [**](#KeyValueStore)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/key_value_store.ts#L108)KeyValueStore Re-exports [KeyValueStore](https://crawlee.dev/js/api/core/class/KeyValueStore.md) ### [**](#KeyValueStoreIteratorOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/key_value_store.ts#L758)KeyValueStoreIteratorOptions Re-exports [KeyValueStoreIteratorOptions](https://crawlee.dev/js/api/core/interface/KeyValueStoreIteratorOptions.md) ### [**](#KeyValueStoreOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/key_value_store.ts#L734)KeyValueStoreOptions Re-exports [KeyValueStoreOptions](https://crawlee.dev/js/api/core/interface/KeyValueStoreOptions.md) ### [**](#LoadedRequest)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L21)LoadedRequest Re-exports [LoadedRequest](https://crawlee.dev/js/api/core.md#LoadedRequest) ### [**](#LocalEventManager)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/events/local_event_manager.ts#L11)LocalEventManager Re-exports [LocalEventManager](https://crawlee.dev/js/api/core/class/LocalEventManager.md) ### [**](#log)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/log.ts#L3)log Re-exports [log](https://crawlee.dev/js/api/core.md#log) ### [**](#Log)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/log.ts#L3)Log Re-exports [Log](https://crawlee.dev/js/api/core/class/Log.md) ### [**](#Logger)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/log.ts#L3)Logger Re-exports [Logger](https://crawlee.dev/js/api/core/class/Logger.md) ### [**](#LoggerJson)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/log.ts#L3)LoggerJson Re-exports [LoggerJson](https://crawlee.dev/js/api/core/class/LoggerJson.md) ### [**](#LoggerOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/log.ts#L3)LoggerOptions Re-exports [LoggerOptions](https://crawlee.dev/js/api/core/interface/LoggerOptions.md) ### [**](#LoggerText)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/log.ts#L3)LoggerText Re-exports [LoggerText](https://crawlee.dev/js/api/core/class/LoggerText.md) ### [**](#LogLevel)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/log.ts#L3)LogLevel Re-exports [LogLevel](https://crawlee.dev/js/api/core/enum/LogLevel.md) ### [**](#MAX_POOL_SIZE)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/consts.ts#L3)MAX\_POOL\_SIZE Re-exports [MAX\_POOL\_SIZE](https://crawlee.dev/js/api/core.md#MAX_POOL_SIZE) ### [**](#NonRetryableError)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/errors.ts#L4)NonRetryableError Re-exports [NonRetryableError](https://crawlee.dev/js/api/core/class/NonRetryableError.md) ### [**](#PERSIST_STATE_KEY)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/consts.ts#L2)PERSIST\_STATE\_KEY Re-exports [PERSIST\_STATE\_KEY](https://crawlee.dev/js/api/core.md#PERSIST_STATE_KEY) ### [**](#PersistenceOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L41)PersistenceOptions Re-exports [PersistenceOptions](https://crawlee.dev/js/api/core/interface/PersistenceOptions.md) ### [**](#processHttpRequestOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L196)processHttpRequestOptions Re-exports [processHttpRequestOptions](https://crawlee.dev/js/api/core/function/processHttpRequestOptions.md) ### [**](#ProxyConfiguration)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/proxy_configuration.ts#L203)ProxyConfiguration Re-exports [ProxyConfiguration](https://crawlee.dev/js/api/core/class/ProxyConfiguration.md) ### [**](#ProxyConfigurationFunction)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/proxy_configuration.ts#L9)ProxyConfigurationFunction Re-exports [ProxyConfigurationFunction](https://crawlee.dev/js/api/core/interface/ProxyConfigurationFunction.md) ### [**](#ProxyConfigurationOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/proxy_configuration.ts#L15)ProxyConfigurationOptions Re-exports [ProxyConfigurationOptions](https://crawlee.dev/js/api/core/interface/ProxyConfigurationOptions.md) ### [**](#ProxyInfo)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/proxy_configuration.ts#L80)ProxyInfo Re-exports [ProxyInfo](https://crawlee.dev/js/api/core/interface/ProxyInfo.md) ### [**](#PseudoUrl)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/index.ts#L18)PseudoUrl Re-exports [PseudoUrl](https://crawlee.dev/js/api/core/class/PseudoUrl.md) ### [**](#PseudoUrlInput)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L34)PseudoUrlInput Re-exports [PseudoUrlInput](https://crawlee.dev/js/api/core.md#PseudoUrlInput) ### [**](#PseudoUrlObject)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L29)PseudoUrlObject Re-exports [PseudoUrlObject](https://crawlee.dev/js/api/core.md#PseudoUrlObject) ### [**](#PuppeteerDirectNavigationOptions)[**](https://github.com/apify/crawlee/blob/master/packages/puppeteer-crawler/src/index.ts#L13)PuppeteerDirectNavigationOptions Renames and re-exports [DirectNavigationOptions](https://crawlee.dev/js/api/puppeteer-crawler/namespace/puppeteerUtils.md#DirectNavigationOptions) ### [**](#purgeDefaultStorages)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/utils.ts#L33)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/utils.ts#L45)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/utils.ts#L46)purgeDefaultStorages Re-exports [purgeDefaultStorages](https://crawlee.dev/js/api/core/function/purgeDefaultStorages.md) ### [**](#PushErrorMessageOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L559)PushErrorMessageOptions Re-exports [PushErrorMessageOptions](https://crawlee.dev/js/api/core/interface/PushErrorMessageOptions.md) ### [**](#QueueOperationInfo)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/index.ts#L19)QueueOperationInfo Re-exports [QueueOperationInfo](https://crawlee.dev/js/api/core/interface/QueueOperationInfo.md) ### [**](#RecordOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/key_value_store.ts#L741)RecordOptions Re-exports [RecordOptions](https://crawlee.dev/js/api/core/interface/RecordOptions.md) ### [**](#RecoverableState)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/recoverable_state.ts#L75)RecoverableState Re-exports [RecoverableState](https://crawlee.dev/js/api/core/class/RecoverableState.md) ### [**](#RecoverableStateOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/recoverable_state.ts#L33)RecoverableStateOptions Re-exports [RecoverableStateOptions](https://crawlee.dev/js/api/core/interface/RecoverableStateOptions.md) ### [**](#RecoverableStatePersistenceOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/recoverable_state.ts#L6)RecoverableStatePersistenceOptions Re-exports [RecoverableStatePersistenceOptions](https://crawlee.dev/js/api/core/interface/RecoverableStatePersistenceOptions.md) ### [**](#RedirectHandler)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L171)RedirectHandler Re-exports [RedirectHandler](https://crawlee.dev/js/api/core.md#RedirectHandler) ### [**](#RegExpInput)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L48)RegExpInput Re-exports [RegExpInput](https://crawlee.dev/js/api/core.md#RegExpInput) ### [**](#RegExpObject)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L43)RegExpObject Re-exports [RegExpObject](https://crawlee.dev/js/api/core.md#RegExpObject) ### [**](#Request)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L84)Request Re-exports [Request](https://crawlee.dev/js/api/core/class/Request.md) ### [**](#RequestHandler)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L110)RequestHandler Re-exports [RequestHandler](https://crawlee.dev/js/api/basic-crawler.md#RequestHandler) ### [**](#RequestHandlerResult)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L174)RequestHandlerResult Re-exports [RequestHandlerResult](https://crawlee.dev/js/api/core/class/RequestHandlerResult.md) ### [**](#RequestList)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_list.ts#L300)RequestList Re-exports [RequestList](https://crawlee.dev/js/api/core/class/RequestList.md) ### [**](#RequestListOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_list.ts#L91)RequestListOptions Re-exports [RequestListOptions](https://crawlee.dev/js/api/core/interface/RequestListOptions.md) ### [**](#RequestListSourcesFunction)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_list.ts#L1000)RequestListSourcesFunction Re-exports [RequestListSourcesFunction](https://crawlee.dev/js/api/core.md#RequestListSourcesFunction) ### [**](#RequestListState)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_list.ts#L988)RequestListState Re-exports [RequestListState](https://crawlee.dev/js/api/core/interface/RequestListState.md) ### [**](#RequestManagerTandem)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_manager_tandem.ts#L22)RequestManagerTandem Re-exports [RequestManagerTandem](https://crawlee.dev/js/api/core/class/RequestManagerTandem.md) ### [**](#RequestOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L446)RequestOptions Re-exports [RequestOptions](https://crawlee.dev/js/api/core/interface/RequestOptions.md) ### [**](#RequestProvider)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L102)RequestProvider Re-exports [RequestProvider](https://crawlee.dev/js/api/core/class/RequestProvider.md) ### [**](#RequestProviderOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L907)RequestProviderOptions Re-exports [RequestProviderOptions](https://crawlee.dev/js/api/core/interface/RequestProviderOptions.md) ### [**](#RequestQueue)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/index.ts#L7)RequestQueue Re-exports [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) ### [**](#RequestQueueOperationOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L934)RequestQueueOperationOptions Re-exports [RequestQueueOperationOptions](https://crawlee.dev/js/api/core/interface/RequestQueueOperationOptions.md) ### [**](#RequestQueueOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L923)RequestQueueOptions Re-exports [RequestQueueOptions](https://crawlee.dev/js/api/core/interface/RequestQueueOptions.md) ### [**](#RequestQueueV1)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/index.ts#L6)RequestQueueV1 Re-exports [RequestQueueV1](https://crawlee.dev/js/api/core/class/RequestQueueV1.md) ### [**](#RequestQueueV2)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/index.ts#L8)RequestQueueV2 Re-exports [RequestQueueV2](https://crawlee.dev/js/api/core.md#RequestQueueV2) ### [**](#RequestsLike)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L39)RequestsLike Re-exports [RequestsLike](https://crawlee.dev/js/api/core.md#RequestsLike) ### [**](#RequestState)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L42)RequestState Re-exports [RequestState](https://crawlee.dev/js/api/core/enum/RequestState.md) ### [**](#RequestTransform)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L287)RequestTransform Re-exports [RequestTransform](https://crawlee.dev/js/api/core/interface/RequestTransform.md) ### [**](#ResponseLike)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/cookie_utils.ts#L7)ResponseLike Re-exports [ResponseLike](https://crawlee.dev/js/api/core/interface/ResponseLike.md) ### [**](#ResponseTypes)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L39)ResponseTypes Re-exports [ResponseTypes](https://crawlee.dev/js/api/core/interface/ResponseTypes.md) ### [**](#RestrictedCrawlingContext)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L30)RestrictedCrawlingContext Re-exports [RestrictedCrawlingContext](https://crawlee.dev/js/api/core/interface/RestrictedCrawlingContext.md) ### [**](#RetryRequestError)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/errors.ts#L22)RetryRequestError Re-exports [RetryRequestError](https://crawlee.dev/js/api/core/class/RetryRequestError.md) ### [**](#Router)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/router.ts#L86)Router Re-exports [Router](https://crawlee.dev/js/api/core/class/Router.md) ### [**](#RouterHandler)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/router.ts#L10)RouterHandler Re-exports [RouterHandler](https://crawlee.dev/js/api/core/interface/RouterHandler.md) ### [**](#RouterRoutes)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/router.ts#L17)RouterRoutes Re-exports [RouterRoutes](https://crawlee.dev/js/api/core.md#RouterRoutes) ### [**](#SaveSnapshotOptions)[**](https://github.com/apify/crawlee/blob/master/packages/puppeteer-crawler/src/index.ts#L16)SaveSnapshotOptions Re-exports [SaveSnapshotOptions](https://crawlee.dev/js/api/puppeteer-crawler/namespace/puppeteerUtils.md#SaveSnapshotOptions) ### [**](#Session)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session.ts#L100)Session Re-exports [Session](https://crawlee.dev/js/api/core/class/Session.md) ### [**](#SessionError)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/errors.ts#L33)SessionError Re-exports [SessionError](https://crawlee.dev/js/api/core/class/SessionError.md) ### [**](#SessionOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session.ts#L37)SessionOptions Re-exports [SessionOptions](https://crawlee.dev/js/api/core/interface/SessionOptions.md) ### [**](#SessionPool)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session_pool.ts#L137)SessionPool Re-exports [SessionPool](https://crawlee.dev/js/api/core/class/SessionPool.md) ### [**](#SessionPoolOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session_pool.ts#L30)SessionPoolOptions Re-exports [SessionPoolOptions](https://crawlee.dev/js/api/core/interface/SessionPoolOptions.md) ### [**](#SessionState)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session.ts#L24)SessionState Re-exports [SessionState](https://crawlee.dev/js/api/core/interface/SessionState.md) ### [**](#SitemapRequestList)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/sitemap_request_list.ts#L128)SitemapRequestList Re-exports [SitemapRequestList](https://crawlee.dev/js/api/core/class/SitemapRequestList.md) ### [**](#SitemapRequestListOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/sitemap_request_list.ts#L60)SitemapRequestListOptions Re-exports [SitemapRequestListOptions](https://crawlee.dev/js/api/core/interface/SitemapRequestListOptions.md) ### [**](#SkippedRequestCallback)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L52)SkippedRequestCallback Re-exports [SkippedRequestCallback](https://crawlee.dev/js/api/core.md#SkippedRequestCallback) ### [**](#SkippedRequestReason)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L50)SkippedRequestReason Re-exports [SkippedRequestReason](https://crawlee.dev/js/api/core.md#SkippedRequestReason) ### [**](#SnapshotResult)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/error_snapshotter.ts#L16)SnapshotResult Re-exports [SnapshotResult](https://crawlee.dev/js/api/core/interface/SnapshotResult.md) ### [**](#Snapshotter)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/snapshotter.ts#L118)Snapshotter Re-exports [Snapshotter](https://crawlee.dev/js/api/core/class/Snapshotter.md) ### [**](#SnapshotterOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/snapshotter.ts#L19)SnapshotterOptions Re-exports [SnapshotterOptions](https://crawlee.dev/js/api/core/interface/SnapshotterOptions.md) ### [**](#Source)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L575)Source Re-exports [Source](https://crawlee.dev/js/api/core.md#Source) ### [**](#StatisticPersistedState)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L482)StatisticPersistedState Re-exports [StatisticPersistedState](https://crawlee.dev/js/api/core/interface/StatisticPersistedState.md) ### [**](#Statistics)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L59)Statistics Re-exports [Statistics](https://crawlee.dev/js/api/core/class/Statistics.md) ### [**](#StatisticsOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L436)StatisticsOptions Re-exports [StatisticsOptions](https://crawlee.dev/js/api/core/interface/StatisticsOptions.md) ### [**](#StatisticState)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L496)StatisticState Re-exports [StatisticState](https://crawlee.dev/js/api/core/interface/StatisticState.md) ### [**](#StatusMessageCallback)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L128)StatusMessageCallback Re-exports [StatusMessageCallback](https://crawlee.dev/js/api/basic-crawler.md#StatusMessageCallback) ### [**](#StatusMessageCallbackParams)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L118)StatusMessageCallbackParams Re-exports [StatusMessageCallbackParams](https://crawlee.dev/js/api/basic-crawler/interface/StatusMessageCallbackParams.md) ### [**](#StorageClient)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/index.ts#L19)StorageClient Re-exports [StorageClient](https://crawlee.dev/js/api/core/interface/StorageClient.md) ### [**](#StorageManagerOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/storage_manager.ts#L156)StorageManagerOptions Re-exports [StorageManagerOptions](https://crawlee.dev/js/api/core/interface/StorageManagerOptions.md) ### [**](#StreamingHttpResponse)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L162)StreamingHttpResponse Re-exports [StreamingHttpResponse](https://crawlee.dev/js/api/core/interface/StreamingHttpResponse.md) ### [**](#SystemInfo)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/system_status.ts#L10)SystemInfo Re-exports [SystemInfo](https://crawlee.dev/js/api/core/interface/SystemInfo.md) ### [**](#SystemStatus)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/system_status.ts#L120)SystemStatus Re-exports [SystemStatus](https://crawlee.dev/js/api/core/class/SystemStatus.md) ### [**](#SystemStatusOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/system_status.ts#L35)SystemStatusOptions Re-exports [SystemStatusOptions](https://crawlee.dev/js/api/core/interface/SystemStatusOptions.md) ### [**](#TieredProxy)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/proxy_configuration.ts#L45)TieredProxy Re-exports [TieredProxy](https://crawlee.dev/js/api/core/interface/TieredProxy.md) ### [**](#tryAbsoluteURL)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L12)tryAbsoluteURL Re-exports [tryAbsoluteURL](https://crawlee.dev/js/api/core/function/tryAbsoluteURL.md) ### [**](#UrlPatternObject)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L24)UrlPatternObject Re-exports [UrlPatternObject](https://crawlee.dev/js/api/core.md#UrlPatternObject) ### [**](#useState)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/utils.ts#L87)useState Re-exports [useState](https://crawlee.dev/js/api/core/function/useState.md) ### [**](#UseStateOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/utils.ts#L69)UseStateOptions Re-exports [UseStateOptions](https://crawlee.dev/js/api/core/interface/UseStateOptions.md) ### [**](#withCheckedStorageAccess)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/access_checking.ts#L18)withCheckedStorageAccess Re-exports [withCheckedStorageAccess](https://crawlee.dev/js/api/core/function/withCheckedStorageAccess.md) ### [**](#PuppeteerGoToOptions)[**](https://github.com/apify/crawlee/blob/master/packages/puppeteer-crawler/src/internals/puppeteer-crawler.ts#L26)PuppeteerGoToOptions **PuppeteerGoToOptions: Parameters\\[1] --- # Changelog All notable changes to this project will be documented in this file. See [Conventional Commits](https://conventionalcommits.org) for commit guidelines. ## [3.15.3](https://github.com/apify/crawlee/compare/v3.15.2...v3.15.3) (2025-11-10)[​](#3153-2025-11-10 "Direct link to 3153-2025-11-10") **Note:** Version bump only for package @crawlee/puppeteer ## [3.15.2](https://github.com/apify/crawlee/compare/v3.15.1...v3.15.2) (2025-10-23)[​](#3152-2025-10-23 "Direct link to 3152-2025-10-23") **Note:** Version bump only for package @crawlee/puppeteer ## [3.15.1](https://github.com/apify/crawlee/compare/v3.15.0...v3.15.1) (2025-09-26)[​](#3151-2025-09-26 "Direct link to 3151-2025-09-26") **Note:** Version bump only for package @crawlee/puppeteer # [3.15.0](https://github.com/apify/crawlee/compare/v3.14.1...v3.15.0) (2025-09-17) **Note:** Version bump only for package @crawlee/puppeteer ## [3.14.1](https://github.com/apify/crawlee/compare/v3.14.0...v3.14.1) (2025-08-05)[​](#3141-2025-08-05 "Direct link to 3141-2025-08-05") **Note:** Version bump only for package @crawlee/puppeteer # [3.14.0](https://github.com/apify/crawlee/compare/v3.13.10...v3.14.0) (2025-07-25) ### Bug Fixes[​](#bug-fixes "Direct link to Bug Fixes") * respect `exclude` option in `enqueueLinksByClickingElements` ([#3058](https://github.com/apify/crawlee/issues/3058)) ([013eb02](https://github.com/apify/crawlee/commit/013eb028b6ecf05f83f8790a4a6164b9c4873733)) ## [3.13.10](https://github.com/apify/crawlee/compare/v3.13.9...v3.13.10) (2025-07-09)[​](#31310-2025-07-09 "Direct link to 31310-2025-07-09") **Note:** Version bump only for package @crawlee/puppeteer ## [3.13.9](https://github.com/apify/crawlee/compare/v3.13.8...v3.13.9) (2025-06-27)[​](#3139-2025-06-27 "Direct link to 3139-2025-06-27") **Note:** Version bump only for package @crawlee/puppeteer ## [3.13.8](https://github.com/apify/crawlee/compare/v3.13.7...v3.13.8) (2025-06-16)[​](#3138-2025-06-16 "Direct link to 3138-2025-06-16") **Note:** Version bump only for package @crawlee/puppeteer ## [3.13.7](https://github.com/apify/crawlee/compare/v3.13.6...v3.13.7) (2025-06-06)[​](#3137-2025-06-06 "Direct link to 3137-2025-06-06") **Note:** Version bump only for package @crawlee/puppeteer ## [3.13.6](https://github.com/apify/crawlee/compare/v3.13.5...v3.13.6) (2025-06-05)[​](#3136-2025-06-05 "Direct link to 3136-2025-06-05") ### Bug Fixes[​](#bug-fixes-1 "Direct link to Bug Fixes") * extract only `body` from `iframe` elements ([#2986](https://github.com/apify/crawlee/issues/2986)) ([c36166e](https://github.com/apify/crawlee/commit/c36166e24887ca6de12f0c60ef010256fa830c31)), closes [#2979](https://github.com/apify/crawlee/issues/2979) ## [3.13.5](https://github.com/apify/crawlee/compare/v3.13.4...v3.13.5) (2025-05-20)[​](#3135-2025-05-20 "Direct link to 3135-2025-05-20") **Note:** Version bump only for package @crawlee/puppeteer ## [3.13.4](https://github.com/apify/crawlee/compare/v3.13.3...v3.13.4) (2025-05-14)[​](#3134-2025-05-14 "Direct link to 3134-2025-05-14") **Note:** Version bump only for package @crawlee/puppeteer ## [3.13.3](https://github.com/apify/crawlee/compare/v3.13.2...v3.13.3) (2025-05-05)[​](#3133-2025-05-05 "Direct link to 3133-2025-05-05") **Note:** Version bump only for package @crawlee/puppeteer ## [3.13.2](https://github.com/apify/crawlee/compare/v3.13.1...v3.13.2) (2025-04-08)[​](#3132-2025-04-08 "Direct link to 3132-2025-04-08") **Note:** Version bump only for package @crawlee/puppeteer ## [3.13.1](https://github.com/apify/crawlee/compare/v3.13.0...v3.13.1) (2025-04-07)[​](#3131-2025-04-07 "Direct link to 3131-2025-04-07") **Note:** Version bump only for package @crawlee/puppeteer # [3.13.0](https://github.com/apify/crawlee/compare/v3.12.2...v3.13.0) (2025-03-04) **Note:** Version bump only for package @crawlee/puppeteer ## [3.12.2](https://github.com/apify/crawlee/compare/v3.12.1...v3.12.2) (2025-01-27)[​](#3122-2025-01-27 "Direct link to 3122-2025-01-27") **Note:** Version bump only for package @crawlee/puppeteer ## [3.12.1](https://github.com/apify/crawlee/compare/v3.12.0...v3.12.1) (2024-12-04)[​](#3121-2024-12-04 "Direct link to 3121-2024-12-04") **Note:** Version bump only for package @crawlee/puppeteer # [3.12.0](https://github.com/apify/crawlee/compare/v3.11.5...v3.12.0) (2024-11-04) ### Bug Fixes[​](#bug-fixes-2 "Direct link to Bug Fixes") * ignore errors from iframe content extraction ([#2714](https://github.com/apify/crawlee/issues/2714)) ([627e5c2](https://github.com/apify/crawlee/commit/627e5c2fbadce63c7e631217cd0e735597c0ce08)), closes [#2708](https://github.com/apify/crawlee/issues/2708) ## [3.11.5](https://github.com/apify/crawlee/compare/v3.11.4...v3.11.5) (2024-10-04)[​](#3115-2024-10-04 "Direct link to 3115-2024-10-04") ### Bug Fixes[​](#bug-fixes-3 "Direct link to Bug Fixes") * **core:** accept `UInt8Array` in `KVS.setValue()` ([#2682](https://github.com/apify/crawlee/issues/2682)) ([8ef0e60](https://github.com/apify/crawlee/commit/8ef0e60ca6fb2f4ec1b0d1aec6dcd53fcfb398b3)) ## [3.11.4](https://github.com/apify/crawlee/compare/v3.11.3...v3.11.4) (2024-09-23)[​](#3114-2024-09-23 "Direct link to 3114-2024-09-23") **Note:** Version bump only for package @crawlee/puppeteer ## [3.11.3](https://github.com/apify/crawlee/compare/v3.11.2...v3.11.3) (2024-09-03)[​](#3113-2024-09-03 "Direct link to 3113-2024-09-03") **Note:** Version bump only for package @crawlee/puppeteer ## [3.11.2](https://github.com/apify/crawlee/compare/v3.11.1...v3.11.2) (2024-08-28)[​](#3112-2024-08-28 "Direct link to 3112-2024-08-28") **Note:** Version bump only for package @crawlee/puppeteer ## [3.11.1](https://github.com/apify/crawlee/compare/v3.11.0...v3.11.1) (2024-07-24)[​](#3111-2024-07-24 "Direct link to 3111-2024-07-24") **Note:** Version bump only for package @crawlee/puppeteer # [3.11.0](https://github.com/apify/crawlee/compare/v3.10.5...v3.11.0) (2024-07-09) ### Features[​](#features "Direct link to Features") * add `iframe` expansion to `parseWithCheerio` in browsers ([#2542](https://github.com/apify/crawlee/issues/2542)) ([328d085](https://github.com/apify/crawlee/commit/328d08598807782b3712bd543e394fe9a000a85d)), closes [#2507](https://github.com/apify/crawlee/issues/2507) * add `ignoreIframes` opt-out from the Cheerio iframe expansion ([#2562](https://github.com/apify/crawlee/issues/2562)) ([474a8dc](https://github.com/apify/crawlee/commit/474a8dc06a567cde0651d385fdac9c350ddf4508)) ## [3.10.5](https://github.com/apify/crawlee/compare/v3.10.4...v3.10.5) (2024-06-12)[​](#3105-2024-06-12 "Direct link to 3105-2024-06-12") **Note:** Version bump only for package @crawlee/puppeteer ## [3.10.4](https://github.com/apify/crawlee/compare/v3.10.3...v3.10.4) (2024-06-11)[​](#3104-2024-06-11 "Direct link to 3104-2024-06-11") **Note:** Version bump only for package @crawlee/puppeteer ## [3.10.3](https://github.com/apify/crawlee/compare/v3.10.2...v3.10.3) (2024-06-07)[​](#3103-2024-06-07 "Direct link to 3103-2024-06-07") ### Features[​](#features-1 "Direct link to Features") * add `waitForSelector` context helper + `parseWithCheerio` in adaptive crawler ([#2522](https://github.com/apify/crawlee/issues/2522)) ([6f88e73](https://github.com/apify/crawlee/commit/6f88e738d43ab4774dc4ef3f78775a5d88728e0d)) ## [3.10.2](https://github.com/apify/crawlee/compare/v3.10.1...v3.10.2) (2024-06-03)[​](#3102-2024-06-03 "Direct link to 3102-2024-06-03") **Note:** Version bump only for package @crawlee/puppeteer ## [3.10.1](https://github.com/apify/crawlee/compare/v3.10.0...v3.10.1) (2024-05-23)[​](#3101-2024-05-23 "Direct link to 3101-2024-05-23") **Note:** Version bump only for package @crawlee/puppeteer # [3.10.0](https://github.com/apify/crawlee/compare/v3.9.2...v3.10.0) (2024-05-16) **Note:** Version bump only for package @crawlee/puppeteer ## [3.9.2](https://github.com/apify/crawlee/compare/v3.9.1...v3.9.2) (2024-04-17)[​](#392-2024-04-17 "Direct link to 392-2024-04-17") **Note:** Version bump only for package @crawlee/puppeteer ## [3.9.1](https://github.com/apify/crawlee/compare/v3.9.0...v3.9.1) (2024-04-11)[​](#391-2024-04-11 "Direct link to 391-2024-04-11") **Note:** Version bump only for package @crawlee/puppeteer # [3.9.0](https://github.com/apify/crawlee/compare/v3.8.2...v3.9.0) (2024-04-10) ### Bug Fixes[​](#bug-fixes-4 "Direct link to Bug Fixes") * **puppeteer:** allow passing `networkidle` to `waitUntil` in `gotoExtended` ([#2399](https://github.com/apify/crawlee/issues/2399)) ([5d0030d](https://github.com/apify/crawlee/commit/5d0030d24858585715b0fac5568440f2b2346706)), closes [#2398](https://github.com/apify/crawlee/issues/2398) ### Features[​](#features-2 "Direct link to Features") * expand #shadow-root elements automatically in `parseWithCheerio` helper ([#2396](https://github.com/apify/crawlee/issues/2396)) ([a05b3a9](https://github.com/apify/crawlee/commit/a05b3a93a9b57926b353df0e79d846b5024c42ac)) ## [3.8.2](https://github.com/apify/crawlee/compare/v3.8.1...v3.8.2) (2024-03-21)[​](#382-2024-03-21 "Direct link to 382-2024-03-21") **Note:** Version bump only for package @crawlee/puppeteer ## [3.8.1](https://github.com/apify/crawlee/compare/v3.8.0...v3.8.1) (2024-02-22)[​](#381-2024-02-22 "Direct link to 381-2024-02-22") **Note:** Version bump only for package @crawlee/puppeteer # [3.8.0](https://github.com/apify/crawlee/compare/v3.7.3...v3.8.0) (2024-02-21) ### Bug Fixes[​](#bug-fixes-5 "Direct link to Bug Fixes") * **puppeteer:** replace `page.waitForTimeout()` with `sleep()` ([52d7219](https://github.com/apify/crawlee/commit/52d7219acdc19b34a727e5d26f7f9288d27ca57f)), closes [#2335](https://github.com/apify/crawlee/issues/2335) ## [3.7.3](https://github.com/apify/crawlee/compare/v3.7.2...v3.7.3) (2024-01-30)[​](#373-2024-01-30 "Direct link to 373-2024-01-30") **Note:** Version bump only for package @crawlee/puppeteer ## [3.7.2](https://github.com/apify/crawlee/compare/v3.7.1...v3.7.2) (2024-01-09)[​](#372-2024-01-09 "Direct link to 372-2024-01-09") **Note:** Version bump only for package @crawlee/puppeteer ## [3.7.1](https://github.com/apify/crawlee/compare/v3.7.0...v3.7.1) (2024-01-02)[​](#371-2024-01-02 "Direct link to 371-2024-01-02") **Note:** Version bump only for package @crawlee/puppeteer # [3.7.0](https://github.com/apify/crawlee/compare/v3.6.2...v3.7.0) (2023-12-21) **Note:** Version bump only for package @crawlee/puppeteer ## [3.6.2](https://github.com/apify/crawlee/compare/v3.6.1...v3.6.2) (2023-11-26)[​](#362-2023-11-26 "Direct link to 362-2023-11-26") **Note:** Version bump only for package @crawlee/puppeteer ## [3.6.1](https://github.com/apify/crawlee/compare/v3.6.0...v3.6.1) (2023-11-15)[​](#361-2023-11-15 "Direct link to 361-2023-11-15") ### Features[​](#features-3 "Direct link to Features") * **puppeteer:** enable `new` headless mode ([#1910](https://github.com/apify/crawlee/issues/1910)) ([7fc999c](https://github.com/apify/crawlee/commit/7fc999cf4658ca69b97f16d434444081998470f4)) # [3.6.0](https://github.com/apify/crawlee/compare/v3.5.8...v3.6.0) (2023-11-15) ### Bug Fixes[​](#bug-fixes-6 "Direct link to Bug Fixes") * add `skipNavigation` option to `enqueueLinks` ([#2153](https://github.com/apify/crawlee/issues/2153)) ([118515d](https://github.com/apify/crawlee/commit/118515d2ba534b99be2f23436f6abe41d66a8e07)) ## [3.5.8](https://github.com/apify/crawlee/compare/v3.5.7...v3.5.8) (2023-10-17)[​](#358-2023-10-17 "Direct link to 358-2023-10-17") **Note:** Version bump only for package @crawlee/puppeteer ## [3.5.7](https://github.com/apify/crawlee/compare/v3.5.6...v3.5.7) (2023-10-05)[​](#357-2023-10-05 "Direct link to 357-2023-10-05") **Note:** Version bump only for package @crawlee/puppeteer ## [3.5.6](https://github.com/apify/crawlee/compare/v3.5.5...v3.5.6) (2023-10-04)[​](#356-2023-10-04 "Direct link to 356-2023-10-04") **Note:** Version bump only for package @crawlee/puppeteer ## [3.5.5](https://github.com/apify/crawlee/compare/v3.5.4...v3.5.5) (2023-10-02)[​](#355-2023-10-02 "Direct link to 355-2023-10-02") ### Bug Fixes[​](#bug-fixes-7 "Direct link to Bug Fixes") * allow to use any version of puppeteer or playwright ([#2102](https://github.com/apify/crawlee/issues/2102)) ([0cafceb](https://github.com/apify/crawlee/commit/0cafceb2966d430dd1b2a1b619fe66da1c951f4c)), closes [#2101](https://github.com/apify/crawlee/issues/2101) ### Features[​](#features-4 "Direct link to Features") * Request Queue v2 ([#1975](https://github.com/apify/crawlee/issues/1975)) ([70a77ee](https://github.com/apify/crawlee/commit/70a77ee15f984e9ae67cd584fc58ace7e55346db)), closes [#1365](https://github.com/apify/crawlee/issues/1365) ## [3.5.4](https://github.com/apify/crawlee/compare/v3.5.3...v3.5.4) (2023-09-11)[​](#354-2023-09-11 "Direct link to 354-2023-09-11") ### Bug Fixes[​](#bug-fixes-8 "Direct link to Bug Fixes") * various helpers opening KVS now respect Configuration ([#2071](https://github.com/apify/crawlee/issues/2071)) ([59dbb16](https://github.com/apify/crawlee/commit/59dbb164699774e5a6718e98d0a4e8f630f35323)) ## [3.5.3](https://github.com/apify/crawlee/compare/v3.5.2...v3.5.3) (2023-08-31)[​](#353-2023-08-31 "Direct link to 353-2023-08-31") ### Bug Fixes[​](#bug-fixes-9 "Direct link to Bug Fixes") * pin all internal dependencies ([#2041](https://github.com/apify/crawlee/issues/2041)) ([d6f2b17](https://github.com/apify/crawlee/commit/d6f2b172d4a6776137c7893ca798d5b4a9408e79)), closes [#2040](https://github.com/apify/crawlee/issues/2040) ## [3.5.2](https://github.com/apify/crawlee/compare/v3.5.1...v3.5.2) (2023-08-21)[​](#352-2023-08-21 "Direct link to 352-2023-08-21") **Note:** Version bump only for package @crawlee/puppeteer ## [3.5.1](https://github.com/apify/crawlee/compare/v3.5.0...v3.5.1) (2023-08-16)[​](#351-2023-08-16 "Direct link to 351-2023-08-16") **Note:** Version bump only for package @crawlee/puppeteer # [3.5.0](https://github.com/apify/crawlee/compare/v3.4.2...v3.5.0) (2023-07-31) ### Features[​](#features-5 "Direct link to Features") * add `closeCookieModals` context helper for Playwright and Puppeteer ([#1927](https://github.com/apify/crawlee/issues/1927)) ([98d93bb](https://github.com/apify/crawlee/commit/98d93bb6713ec219baa83db2ad2cd1d7621a3339)) * **core:** use `RequestQueue.addBatchedRequests()` in `enqueueLinks` helper ([4d61ca9](https://github.com/apify/crawlee/commit/4d61ca934072f8bbb680c842d8b1c9a4452ee73a)), closes [#1995](https://github.com/apify/crawlee/issues/1995) ## [3.4.2](https://github.com/apify/crawlee/compare/v3.4.1...v3.4.2) (2023-07-19)[​](#342-2023-07-19 "Direct link to 342-2023-07-19") **Note:** Version bump only for package @crawlee/puppeteer ## [3.4.1](https://github.com/apify/crawlee/compare/v3.4.0...v3.4.1) (2023-07-13)[​](#341-2023-07-13 "Direct link to 341-2023-07-13") **Note:** Version bump only for package @crawlee/puppeteer # [3.4.0](https://github.com/apify/crawlee/compare/v3.3.3...v3.4.0) (2023-06-12) ### Features[​](#features-6 "Direct link to Features") * infiniteScroll has maxScrollHeight limit ([#1945](https://github.com/apify/crawlee/issues/1945)) ([44997bb](https://github.com/apify/crawlee/commit/44997bba5bbf33ddb7dbac2f3e26d4bee60d4f47)) ## [3.3.3](https://github.com/apify/crawlee/compare/v3.3.2...v3.3.3) (2023-05-31)[​](#333-2023-05-31 "Direct link to 333-2023-05-31") **Note:** Version bump only for package @crawlee/puppeteer ## [3.3.2](https://github.com/apify/crawlee/compare/v3.3.1...v3.3.2) (2023-05-11)[​](#332-2023-05-11 "Direct link to 332-2023-05-11") ### Features[​](#features-7 "Direct link to Features") * **router:** allow inline router definition ([#1877](https://github.com/apify/crawlee/issues/1877)) ([2d241c9](https://github.com/apify/crawlee/commit/2d241c9f88964ebd41a181069c378b6b7b5bf262)) ## [3.3.1](https://github.com/apify/crawlee/compare/v3.3.0...v3.3.1) (2023-04-11)[​](#331-2023-04-11 "Direct link to 331-2023-04-11") ### Bug Fixes[​](#bug-fixes-10 "Direct link to Bug Fixes") * **jsdom:** delay closing of the window and add some polyfills ([2e81618](https://github.com/apify/crawlee/commit/2e81618afb5f3890495e3e5fcfa037eb3319edc9)) # [3.3.0](https://github.com/apify/crawlee/compare/v3.2.2...v3.3.0) (2023-03-09) **Note:** Version bump only for package @crawlee/puppeteer ## [3.2.2](https://github.com/apify/crawlee/compare/v3.2.1...v3.2.2) (2023-02-08)[​](#322-2023-02-08 "Direct link to 322-2023-02-08") **Note:** Version bump only for package @crawlee/puppeteer ## [3.2.1](https://github.com/apify/crawlee/compare/v3.2.0...v3.2.1) (2023-02-07)[​](#321-2023-02-07 "Direct link to 321-2023-02-07") **Note:** Version bump only for package @crawlee/puppeteer # [3.2.0](https://github.com/apify/crawlee/compare/v3.1.4...v3.2.0) (2023-02-07) ### Bug Fixes[​](#bug-fixes-11 "Direct link to Bug Fixes") * allow `userData` option in `enqueueLinksByClickingElements` ([#1749](https://github.com/apify/crawlee/issues/1749)) ([736f85d](https://github.com/apify/crawlee/commit/736f85d4a3b99a06d0f99f91e33e71976a9458a3)), closes [#1617](https://github.com/apify/crawlee/issues/1617) * declare missing dependency on `tslib` ([27e96c8](https://github.com/apify/crawlee/commit/27e96c80c26e7fc31809a4b518d699573cb8c662)), closes [#1747](https://github.com/apify/crawlee/issues/1747) ### Features[​](#features-8 "Direct link to Features") * add `forefront` option to all `enqueueLinks` variants ([#1760](https://github.com/apify/crawlee/issues/1760)) ([a01459d](https://github.com/apify/crawlee/commit/a01459dffb51162e676354f0aa4811a1d36affa9)), closes [#1483](https://github.com/apify/crawlee/issues/1483) ## [3.1.4](https://github.com/apify/crawlee/compare/v3.1.3...v3.1.4) (2022-12-14)[​](#314-2022-12-14 "Direct link to 314-2022-12-14") **Note:** Version bump only for package @crawlee/puppeteer ## [3.1.3](https://github.com/apify/crawlee/compare/v3.1.2...v3.1.3) (2022-12-07)[​](#313-2022-12-07 "Direct link to 313-2022-12-07") **Note:** Version bump only for package @crawlee/puppeteer ## 3.1.2 (2022-11-15)[​](#312-2022-11-15 "Direct link to 3.1.2 (2022-11-15)") **Note:** Version bump only for package @crawlee/puppeteer ## 3.1.1 (2022-11-07)[​](#311-2022-11-07 "Direct link to 3.1.1 (2022-11-07)") **Note:** Version bump only for package @crawlee/puppeteer # 3.1.0 (2022-10-13) **Note:** Version bump only for package @crawlee/puppeteer ## [3.0.4](https://github.com/apify/crawlee/compare/v3.0.3...v3.0.4) (2022-08-22)[​](#304-2022-08-22 "Direct link to 304-2022-08-22") **Note:** Version bump only for package @crawlee/puppeteer --- # PuppeteerCrawler Provides a simple framework for parallel crawling of web pages using headless Chrome with [Puppeteer](https://github.com/puppeteer/puppeteer). The URLs to crawl are fed either from a static list of URLs or from a dynamic queue of URLs enabling recursive crawling of websites. Since `PuppeteerCrawler` uses headless Chrome to download web pages and extract data, it is useful for crawling of websites that require to execute JavaScript. If the target website doesn't need JavaScript, consider using [CheerioCrawler](https://crawlee.dev/js/api/cheerio-crawler/class/CheerioCrawler.md), which downloads the pages using raw HTTP requests and is about 10x faster. The source URLs are represented using [Request](https://crawlee.dev/js/api/core/class/Request.md) objects that are fed from [RequestList](https://crawlee.dev/js/api/core/class/RequestList.md) or [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) instances provided by the [PuppeteerCrawlerOptions.requestList](https://crawlee.dev/js/api/puppeteer-crawler/interface/PuppeteerCrawlerOptions.md#requestList) or [PuppeteerCrawlerOptions.requestQueue](https://crawlee.dev/js/api/puppeteer-crawler/interface/PuppeteerCrawlerOptions.md#requestQueue) constructor options, respectively. If both [PuppeteerCrawlerOptions.requestList](https://crawlee.dev/js/api/puppeteer-crawler/interface/PuppeteerCrawlerOptions.md#requestList) and [PuppeteerCrawlerOptions.requestQueue](https://crawlee.dev/js/api/puppeteer-crawler/interface/PuppeteerCrawlerOptions.md#requestQueue) are used, the instance first processes URLs from the [RequestList](https://crawlee.dev/js/api/core/class/RequestList.md) and automatically enqueues all of them to [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) before it starts their processing. This ensures that a single URL is not crawled multiple times. The crawler finishes when there are no more [Request](https://crawlee.dev/js/api/core/class/Request.md) objects to crawl. `PuppeteerCrawler` opens a new Chrome page (i.e. tab) for each [Request](https://crawlee.dev/js/api/core/class/Request.md) object to crawl and then calls the function provided by user as the [PuppeteerCrawlerOptions.requestHandler](https://crawlee.dev/js/api/puppeteer-crawler/interface/PuppeteerCrawlerOptions.md#requestHandler) option. New pages are only opened when there is enough free CPU and memory available, using the functionality provided by the [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) class. All [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) configuration options can be passed to the [PuppeteerCrawlerOptions.autoscaledPoolOptions](https://crawlee.dev/js/api/puppeteer-crawler/interface/PuppeteerCrawlerOptions.md#autoscaledPoolOptions) parameter of the `PuppeteerCrawler` constructor. For user convenience, the `minConcurrency` and `maxConcurrency` [AutoscaledPoolOptions](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md) are available directly in the `PuppeteerCrawler` constructor. Note that the pool of Puppeteer instances is internally managed by the [BrowserPool](https://github.com/apify/browser-pool) class. **Example usage:** ``` const crawler = new PuppeteerCrawler({ async requestHandler({ page, request }) { // This function is called to extract data from a single web page // 'page' is an instance of Puppeteer.Page with page.goto(request.url) already called // 'request' is an instance of Request class with information about the page to load await Dataset.pushData({ title: await page.title(), url: request.url, succeeded: true, }) }, async failedRequestHandler({ request }) { // This function is called when the crawling of a request failed too many times await Dataset.pushData({ url: request.url, succeeded: false, errors: request.errorMessages, }) }, }); await crawler.run([ 'http://www.example.com/page-1', 'http://www.example.com/page-2', ]); ``` ### Hierarchy * [BrowserCrawler](https://crawlee.dev/js/api/browser-crawler/class/BrowserCrawler.md)<{ browserPlugins: \[[PuppeteerPlugin](https://crawlee.dev/js/api/browser-pool/class/PuppeteerPlugin.md)] }, LaunchOptions, [PuppeteerCrawlingContext](https://crawlee.dev/js/api/puppeteer-crawler/interface/PuppeteerCrawlingContext.md)> * *PuppeteerCrawler* ## Index[**](#Index) ### Constructors * [**constructor](#constructor) ### Properties * [**autoscaledPool](#autoscaledPool) * [**browserPool](#browserPool) * [**config](#config) * [**hasFinishedBefore](#hasFinishedBefore) * [**launchContext](#launchContext) * [**log](#log) * [**proxyConfiguration](#proxyConfiguration) * [**requestList](#requestList) * [**requestQueue](#requestQueue) * [**router](#router) * [**running](#running) * [**sessionPool](#sessionPool) * [**stats](#stats) ### Methods * [**addRequests](#addRequests) * [**exportData](#exportData) * [**getData](#getData) * [**getDataset](#getDataset) * [**getRequestQueue](#getRequestQueue) * [**pushData](#pushData) * [**run](#run) * [**setStatusMessage](#setStatusMessage) * [**stop](#stop) * [**useState](#useState) ## Constructors[**](#Constructors) ### [**](#constructor)[**](https://github.com/apify/crawlee/blob/master/packages/puppeteer-crawler/src/internals/puppeteer-crawler.ts#L148)constructor * ****new PuppeteerCrawler**(options, config): [PuppeteerCrawler](https://crawlee.dev/js/api/puppeteer-crawler/class/PuppeteerCrawler.md) - Overrides BrowserCrawler< { browserPlugins: \[PuppeteerPlugin] }, LaunchOptions, PuppeteerCrawlingContext >.constructor All `PuppeteerCrawler` parameters are passed via an options object. *** #### Parameters * ##### options: [PuppeteerCrawlerOptions](https://crawlee.dev/js/api/puppeteer-crawler/interface/PuppeteerCrawlerOptions.md) = {} * ##### config: [Configuration](https://crawlee.dev/js/api/core/class/Configuration.md) = ... #### Returns [PuppeteerCrawler](https://crawlee.dev/js/api/puppeteer-crawler/class/PuppeteerCrawler.md) ## Properties[**](#Properties) ### [**](#autoscaledPool)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L524)optionalinheritedautoscaledPool **autoscaledPool? : [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) Inherited from BrowserCrawler.autoscaledPool A reference to the underlying [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) class that manages the concurrency of the crawler. > *NOTE:* This property is only initialized after calling the [`crawler.run()`](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md#run) function. We can use it to change the concurrency settings on the fly, to pause the crawler by calling [`autoscaledPool.pause()`](https://crawlee.dev/js/api/core/class/AutoscaledPool.md#pause) or to abort it by calling [`autoscaledPool.abort()`](https://crawlee.dev/js/api/core/class/AutoscaledPool.md#abort). ### [**](#browserPool)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L329)inheritedbrowserPool **browserPool: [BrowserPool](https://crawlee.dev/js/api/browser-pool/class/BrowserPool.md)<{ browserPlugins: \[[PuppeteerPlugin](https://crawlee.dev/js/api/browser-pool/class/PuppeteerPlugin.md)] }, \[[PuppeteerPlugin](https://crawlee.dev/js/api/browser-pool/class/PuppeteerPlugin.md)], [BrowserController](https://crawlee.dev/js/api/browser-pool/class/BrowserController.md)\, [LaunchContext](https://crawlee.dev/js/api/browser-pool/class/LaunchContext.md)\, undefined | PuppeteerNewPageOptions, Page> Inherited from BrowserCrawler.browserPool A reference to the underlying [BrowserPool](https://crawlee.dev/js/api/browser-pool/class/BrowserPool.md) class that manages the crawler's browsers. ### [**](#config)[**](https://github.com/apify/crawlee/blob/master/packages/puppeteer-crawler/src/internals/puppeteer-crawler.ts#L150)readonlyinheritedconfig **config: [Configuration](https://crawlee.dev/js/api/core/class/Configuration.md) = ... Inherited from BrowserCrawler.config ### [**](#hasFinishedBefore)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L533)inheritedhasFinishedBefore **hasFinishedBefore: boolean = false Inherited from BrowserCrawler.hasFinishedBefore ### [**](#launchContext)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L331)inheritedlaunchContext **launchContext: [BrowserLaunchContext](https://crawlee.dev/js/api/browser-crawler/interface/BrowserLaunchContext.md)\ Inherited from BrowserCrawler.launchContext ### [**](#log)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L535)readonlyinheritedlog **log: [Log](https://crawlee.dev/js/api/core/class/Log.md) Inherited from BrowserCrawler.log ### [**](#proxyConfiguration)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L324)optionalinheritedproxyConfiguration **proxyConfiguration? : [ProxyConfiguration](https://crawlee.dev/js/api/core/class/ProxyConfiguration.md) Inherited from BrowserCrawler.proxyConfiguration A reference to the underlying [ProxyConfiguration](https://crawlee.dev/js/api/core/class/ProxyConfiguration.md) class that manages the crawler's proxies. Only available if used by the crawler. ### [**](#requestList)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L497)optionalinheritedrequestList **requestList? : [IRequestList](https://crawlee.dev/js/api/core/interface/IRequestList.md) Inherited from BrowserCrawler.requestList A reference to the underlying [RequestList](https://crawlee.dev/js/api/core/class/RequestList.md) class that manages the crawler's [requests](https://crawlee.dev/js/api/core/class/Request.md). Only available if used by the crawler. ### [**](#requestQueue)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L504)optionalinheritedrequestQueue **requestQueue? : [RequestProvider](https://crawlee.dev/js/api/core/class/RequestProvider.md) Inherited from BrowserCrawler.requestQueue Dynamic queue of URLs to be processed. This is useful for recursive crawling of websites. A reference to the underlying [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) class that manages the crawler's [requests](https://crawlee.dev/js/api/core/class/Request.md). Only available if used by the crawler. ### [**](#router)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L530)readonlyinheritedrouter **router: [RouterHandler](https://crawlee.dev/js/api/core/interface/RouterHandler.md)<{ request: [LoadedRequest](https://crawlee.dev/js/api/core.md#LoadedRequest)<[Request](https://crawlee.dev/js/api/core/class/Request.md)\> } & Omit<[PuppeteerCrawlingContext](https://crawlee.dev/js/api/puppeteer-crawler/interface/PuppeteerCrawlingContext.md)\, request>> = ... Inherited from BrowserCrawler.router Default [Router](https://crawlee.dev/js/api/core/class/Router.md) instance that will be used if we don't specify any [`requestHandler`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#requestHandler). See [`router.addHandler()`](https://crawlee.dev/js/api/core/class/Router.md#addHandler) and [`router.addDefaultHandler()`](https://crawlee.dev/js/api/core/class/Router.md#addDefaultHandler). ### [**](#running)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L532)inheritedrunning **running: boolean = false Inherited from BrowserCrawler.running ### [**](#sessionPool)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L515)optionalinheritedsessionPool **sessionPool? : [SessionPool](https://crawlee.dev/js/api/core/class/SessionPool.md) Inherited from BrowserCrawler.sessionPool A reference to the underlying [SessionPool](https://crawlee.dev/js/api/core/class/SessionPool.md) class that manages the crawler's [sessions](https://crawlee.dev/js/api/core/class/Session.md). Only available if used by the crawler. ### [**](#stats)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L491)readonlyinheritedstats **stats: [Statistics](https://crawlee.dev/js/api/core/class/Statistics.md) Inherited from BrowserCrawler.stats A reference to the underlying [Statistics](https://crawlee.dev/js/api/core/class/Statistics.md) class that collects and logs run statistics for requests. ## Methods[**](#Methods) ### [**](#addRequests)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L1147)inheritedaddRequests * ****addRequests**(requests, options): Promise<[CrawlerAddRequestsResult](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerAddRequestsResult.md)> - Inherited from BrowserCrawler.addRequests Adds requests to the queue in batches. By default, it will resolve after the initial batch is added, and continue adding the rest in background. You can configure the batch size via `batchSize` option and the sleep time in between the batches via `waitBetweenBatchesMillis`. If you want to wait for all batches to be added to the queue, you can use the `waitForAllRequestsToBeAdded` promise you get in the response object. This is an alias for calling `addRequestsBatched()` on the implicit `RequestQueue` for this crawler instance. *** #### Parameters * ##### requests: [RequestsLike](https://crawlee.dev/js/api/core.md#RequestsLike) The requests to add * ##### options: [CrawlerAddRequestsOptions](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerAddRequestsOptions.md) = {} Options for the request queue #### Returns Promise<[CrawlerAddRequestsResult](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerAddRequestsResult.md)> ### [**](#exportData)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L1253)inheritedexportData * ****exportData**\(path, format, options): Promise\ - Inherited from BrowserCrawler.exportData Retrieves all the data from the default crawler [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md) and exports them to the specified format. Supported formats are currently 'json' and 'csv', and will be inferred from the `path` automatically. *** #### Parameters * ##### path: string * ##### optionalformat: json | csv * ##### optionaloptions: [DatasetExportOptions](https://crawlee.dev/js/api/core/interface/DatasetExportOptions.md) #### Returns Promise\ ### [**](#getData)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L1244)inheritedgetData * ****getData**(...args): Promise<[DatasetContent](https://crawlee.dev/js/api/core/interface/DatasetContent.md)\> - Inherited from BrowserCrawler.getData Retrieves data from the default crawler [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md) by calling [Dataset.getData](https://crawlee.dev/js/api/core/class/Dataset.md#getData). *** #### Parameters * ##### rest...args: \[options: [DatasetDataOptions](https://crawlee.dev/js/api/core/interface/DatasetDataOptions.md)] #### Returns Promise<[DatasetContent](https://crawlee.dev/js/api/core/interface/DatasetContent.md)\> ### [**](#getDataset)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L1237)inheritedgetDataset * ****getDataset**(idOrName): Promise<[Dataset](https://crawlee.dev/js/api/core/class/Dataset.md)\> - Inherited from BrowserCrawler.getDataset Retrieves the specified [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md), or the default crawler [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md). *** #### Parameters * ##### optionalidOrName: string #### Returns Promise<[Dataset](https://crawlee.dev/js/api/core/class/Dataset.md)\> ### [**](#getRequestQueue)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L1080)inheritedgetRequestQueue * ****getRequestQueue**(): Promise<[RequestProvider](https://crawlee.dev/js/api/core/class/RequestProvider.md)> - Inherited from BrowserCrawler.getRequestQueue #### Returns Promise<[RequestProvider](https://crawlee.dev/js/api/core/class/RequestProvider.md)> ### [**](#pushData)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L1229)inheritedpushData * ****pushData**(data, datasetIdOrName): Promise\ - Inherited from BrowserCrawler.pushData Pushes data to the specified [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md), or the default crawler [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md) by calling [Dataset.pushData](https://crawlee.dev/js/api/core/class/Dataset.md#pushData). *** #### Parameters * ##### data: Dictionary | Dictionary\[] * ##### optionaldatasetIdOrName: string #### Returns Promise\ ### [**](#run)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L948)inheritedrun * ****run**(requests, options): Promise<[FinalStatistics](https://crawlee.dev/js/api/core/interface/FinalStatistics.md)> - Inherited from BrowserCrawler.run Runs the crawler. Returns a promise that resolves once all the requests are processed and `autoscaledPool.isFinished` returns `true`. We can use the `requests` parameter to enqueue the initial requests — it is a shortcut for running [`crawler.addRequests()`](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md#addRequests) before [`crawler.run()`](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md#run). *** #### Parameters * ##### optionalrequests: [RequestsLike](https://crawlee.dev/js/api/core.md#RequestsLike) The requests to add. * ##### optionaloptions: [CrawlerRunOptions](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerRunOptions.md) Options for the request queue. #### Returns Promise<[FinalStatistics](https://crawlee.dev/js/api/core/interface/FinalStatistics.md)> ### [**](#setStatusMessage)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L871)inheritedsetStatusMessage * ****setStatusMessage**(message, options): Promise\ - Inherited from BrowserCrawler.setStatusMessage This method is periodically called by the crawler, every `statusMessageLoggingInterval` seconds. *** #### Parameters * ##### message: string * ##### options: [SetStatusMessageOptions](https://crawlee.dev/js/api/types/interface/SetStatusMessageOptions.md) = {} #### Returns Promise\ ### [**](#stop)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L1068)inheritedstop * ****stop**(message): void - Inherited from BrowserCrawler.stop Gracefully stops the current run of the crawler. All the tasks active at the time of calling this method will be allowed to finish. *** #### Parameters * ##### message: string = 'The crawler has been gracefully stopped.' #### Returns void ### [**](#useState)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L1102)inheriteduseState * ****useState**\(defaultValue): Promise\ - Inherited from BrowserCrawler.useState #### Parameters * ##### defaultValue: State = ... #### Returns Promise\ --- # createPuppeteerRouter ### Callable * ****createPuppeteerRouter**\(routes): [RouterHandler](https://crawlee.dev/js/api/core/interface/RouterHandler.md)\ *** * Creates new [Router](https://crawlee.dev/js/api/core/class/Router.md) instance that works based on request labels. This instance can then serve as a `requestHandler` of your [PuppeteerCrawler](https://crawlee.dev/js/api/puppeteer-crawler/class/PuppeteerCrawler.md). Defaults to the [PuppeteerCrawlingContext](https://crawlee.dev/js/api/puppeteer-crawler/interface/PuppeteerCrawlingContext.md). > Serves as a shortcut for using `Router.create()`. ``` import { PuppeteerCrawler, createPuppeteerRouter } from 'crawlee'; const router = createPuppeteerRouter(); router.addHandler('label-a', async (ctx) => { ctx.log.info('...'); }); router.addDefaultHandler(async (ctx) => { ctx.log.info('...'); }); const crawler = new PuppeteerCrawler({ requestHandler: router, }); await crawler.run(); ``` *** #### Parameters * ##### optionalroutes: [RouterRoutes](https://crawlee.dev/js/api/core.md#RouterRoutes)\ #### Returns [RouterHandler](https://crawlee.dev/js/api/core/interface/RouterHandler.md)\ --- # launchPuppeteer ### Callable * ****launchPuppeteer**(launchContext, config): Promise\ *** * Launches headless Chrome using Puppeteer pre-configured to work within the Apify platform. The function has the same argument and the return value as `puppeteer.launch()`. See [Puppeteer documentation](https://pptr.dev/api/puppeteer.launchoptions) for more details. The `launchPuppeteer()` function alters the following Puppeteer options: * Passes the setting from the `CRAWLEE_HEADLESS` environment variable to the `headless` option, unless it was already defined by the caller or `CRAWLEE_XVFB` environment variable is set to `1`. Note that Apify Actor cloud platform automatically sets `CRAWLEE_HEADLESS=1` to all running actors. * Takes the `proxyUrl` option, validates it and adds it to `args` as `--proxy-server=XXX`. The proxy URL must define a port number and have one of the following schemes: `http://`, `https://`, `socks4://` or `socks5://`. If the proxy is HTTP (i.e. has the `http://` scheme) and contains username or password, the `launchPuppeteer` functions sets up an anonymous proxy HTTP to make the proxy work with headless Chrome. For more information, read the [blog post about proxy-chain library](https://blog.apify.com/how-to-make-headless-chrome-and-puppeteer-use-a-proxy-server-with-authentication-249a21a79212). To use this function, you need to have the [puppeteer](https://www.npmjs.com/package/puppeteer) NPM package installed in your project. When running on the Apify cloud, you can achieve that simply by using the `apify/actor-node-chrome` base Docker image for your actor - see [Apify Actor documentation](https://docs.apify.com/actor/build#base-images) for details. *** #### Parameters * ##### optionallaunchContext: [PuppeteerLaunchContext](https://crawlee.dev/js/api/puppeteer-crawler/interface/PuppeteerLaunchContext.md) All `PuppeteerLauncher` parameters are passed via an launchContext object. If you want to pass custom `puppeteer.launch(options)` options you can use the `PuppeteerLaunchContext.launchOptions` property. * ##### optionalconfig: [Configuration](https://crawlee.dev/js/api/core/class/Configuration.md) = ... #### Returns Promise\ Promise that resolves to Puppeteer's `Browser` instance. --- # PuppeteerCrawlerOptions ### Hierarchy * [BrowserCrawlerOptions](https://crawlee.dev/js/api/browser-crawler/interface/BrowserCrawlerOptions.md)<[PuppeteerCrawlingContext](https://crawlee.dev/js/api/puppeteer-crawler/interface/PuppeteerCrawlingContext.md), { browserPlugins: \[[PuppeteerPlugin](https://crawlee.dev/js/api/browser-pool/class/PuppeteerPlugin.md)] }> * *PuppeteerCrawlerOptions* ## Index[**](#Index) ### Properties * [**autoscaledPoolOptions](#autoscaledPoolOptions) * [**browserPoolOptions](#browserPoolOptions) * [**errorHandler](#errorHandler) * [**experiments](#experiments) * [**failedRequestHandler](#failedRequestHandler) * [**headless](#headless) * [**httpClient](#httpClient) * [**ignoreIframes](#ignoreIframes) * [**ignoreShadowRoots](#ignoreShadowRoots) * [**keepAlive](#keepAlive) * [**launchContext](#launchContext) * [**maxConcurrency](#maxConcurrency) * [**maxCrawlDepth](#maxCrawlDepth) * [**maxRequestRetries](#maxRequestRetries) * [**maxRequestsPerCrawl](#maxRequestsPerCrawl) * [**maxRequestsPerMinute](#maxRequestsPerMinute) * [**maxSessionRotations](#maxSessionRotations) * [**minConcurrency](#minConcurrency) * [**navigationTimeoutSecs](#navigationTimeoutSecs) * [**onSkippedRequest](#onSkippedRequest) * [**persistCookiesPerSession](#persistCookiesPerSession) * [**postNavigationHooks](#postNavigationHooks) * [**preNavigationHooks](#preNavigationHooks) * [**proxyConfiguration](#proxyConfiguration) * [**requestHandler](#requestHandler) * [**requestHandlerTimeoutSecs](#requestHandlerTimeoutSecs) * [**requestList](#requestList) * [**requestManager](#requestManager) * [**requestQueue](#requestQueue) * [**respectRobotsTxtFile](#respectRobotsTxtFile) * [**retryOnBlocked](#retryOnBlocked) * [**sameDomainDelaySecs](#sameDomainDelaySecs) * [**sessionPoolOptions](#sessionPoolOptions) * [**statisticsOptions](#statisticsOptions) * [**statusMessageCallback](#statusMessageCallback) * [**statusMessageLoggingInterval](#statusMessageLoggingInterval) * [**useSessionPool](#useSessionPool) ## Properties[**](#Properties) ### [**](#autoscaledPoolOptions)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L294)optionalinheritedautoscaledPoolOptions **autoscaledPoolOptions? : [AutoscaledPoolOptions](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md) Inherited from BrowserCrawlerOptions.autoscaledPoolOptions Custom options passed to the underlying [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) constructor. > *NOTE:* The [`runTaskFunction`](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md#runTaskFunction) option is provided by the crawler and cannot be overridden. However, we can provide custom implementations of [`isFinishedFunction`](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md#isFinishedFunction) and [`isTaskReadyFunction`](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md#isTaskReadyFunction). ### [**](#browserPoolOptions)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L194)optionalinheritedbrowserPoolOptions **browserPoolOptions? : Partial<[BrowserPoolOptions](https://crawlee.dev/js/api/browser-pool/interface/BrowserPoolOptions.md)<[BrowserPlugin](https://crawlee.dev/js/api/browser-pool/class/BrowserPlugin.md)<[CommonLibrary](https://crawlee.dev/js/api/browser-pool/interface/CommonLibrary.md), undefined | Dictionary, CommonBrowser, unknown, CommonPage>>> & Partial<[BrowserPoolHooks](https://crawlee.dev/js/api/browser-pool/interface/BrowserPoolHooks.md)<[BrowserController](https://crawlee.dev/js/api/browser-pool/class/BrowserController.md)\, [LaunchContext](https://crawlee.dev/js/api/browser-pool/class/LaunchContext.md)\, Page>> Inherited from BrowserCrawlerOptions.browserPoolOptions Custom options passed to the underlying [BrowserPool](https://crawlee.dev/js/api/browser-pool/class/BrowserPool.md) constructor. We can tweak those to fine-tune browser management. ### [**](#errorHandler)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L163)optionalinheritederrorHandler **errorHandler? : [BrowserErrorHandler](https://crawlee.dev/js/api/browser-crawler.md#BrowserErrorHandler)<[PuppeteerCrawlingContext](https://crawlee.dev/js/api/puppeteer-crawler/interface/PuppeteerCrawlingContext.md)\> Inherited from BrowserCrawlerOptions.errorHandler User-provided function that allows modifying the request object before it gets retried by the crawler. It's executed before each retry for the requests that failed less than [`maxRequestRetries`](https://crawlee.dev/js/api/browser-crawler/interface/BrowserCrawlerOptions.md#maxRequestRetries) times. The function receives the [BrowserCrawlingContext](https://crawlee.dev/js/api/browser-crawler/interface/BrowserCrawlingContext.md) (actual context will be enhanced with the crawler specific properties) as the first argument, where the [`request`](https://crawlee.dev/js/api/browser-crawler/interface/BrowserCrawlingContext.md#request) corresponds to the request to be retried. Second argument is the `Error` instance that represents the last error thrown during processing of the request. ### [**](#experiments)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L390)optionalinheritedexperiments **experiments? : [CrawlerExperiments](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerExperiments.md) Inherited from BrowserCrawlerOptions.experiments Enables experimental features of Crawlee, which can alter the behavior of the crawler. WARNING: these options are not guaranteed to be stable and may change or be removed at any time. ### [**](#failedRequestHandler)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L174)optionalinheritedfailedRequestHandler **failedRequestHandler? : [BrowserErrorHandler](https://crawlee.dev/js/api/browser-crawler.md#BrowserErrorHandler)<[PuppeteerCrawlingContext](https://crawlee.dev/js/api/puppeteer-crawler/interface/PuppeteerCrawlingContext.md)\> Inherited from BrowserCrawlerOptions.failedRequestHandler A function to handle requests that failed more than `option.maxRequestRetries` times. The function receives the [BrowserCrawlingContext](https://crawlee.dev/js/api/browser-crawler/interface/BrowserCrawlingContext.md) (actual context will be enhanced with the crawler specific properties) as the first argument, where the [`request`](https://crawlee.dev/js/api/browser-crawler/interface/BrowserCrawlingContext.md#request) corresponds to the failed request. Second argument is the `Error` instance that represents the last error thrown during processing of the request. ### [**](#headless)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L260)optionalinheritedheadless **headless? : boolean | new | old Inherited from BrowserCrawlerOptions.headless Whether to run browser in headless mode. Defaults to `true`. Can be also set via [Configuration](https://crawlee.dev/js/api/core/class/Configuration.md). ### [**](#httpClient)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L402)optionalinheritedhttpClient **httpClient? : [BaseHttpClient](https://crawlee.dev/js/api/core/interface/BaseHttpClient.md) Inherited from BrowserCrawlerOptions.httpClient HTTP client implementation for the `sendRequest` context helper and for plain HTTP crawling. Defaults to a new instance of [GotScrapingHttpClient](https://crawlee.dev/js/api/core/class/GotScrapingHttpClient.md) ### [**](#ignoreIframes)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L272)optionalinheritedignoreIframes **ignoreIframes? : boolean Inherited from BrowserCrawlerOptions.ignoreIframes Whether to ignore `iframes` when processing the page content via `parseWithCheerio` helper. By default, `iframes` are expanded automatically. Use this option to disable this behavior. ### [**](#ignoreShadowRoots)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L266)optionalinheritedignoreShadowRoots **ignoreShadowRoots? : boolean Inherited from BrowserCrawlerOptions.ignoreShadowRoots Whether to ignore custom elements (and their #shadow-roots) when processing the page content via `parseWithCheerio` helper. By default, they are expanded automatically. Use this option to disable this behavior. ### [**](#keepAlive)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L322)optionalinheritedkeepAlive **keepAlive? : boolean Inherited from BrowserCrawlerOptions.keepAlive Allows to keep the crawler alive even if the [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) gets empty. By default, the `crawler.run()` will resolve once the queue is empty. With `keepAlive: true` it will keep running, waiting for more requests to come. Use `crawler.stop()` to exit the crawler gracefully, or `crawler.teardown()` to stop it immediately. ### [**](#launchContext)[**](https://github.com/apify/crawlee/blob/master/packages/puppeteer-crawler/src/internals/puppeteer-crawler.ts#L33)optionallaunchContext **launchContext? : [PuppeteerLaunchContext](https://crawlee.dev/js/api/puppeteer-crawler/interface/PuppeteerLaunchContext.md) Overrides BrowserCrawlerOptions.launchContext Options used by [launchPuppeteer](https://crawlee.dev/js/api/puppeteer-crawler/function/launchPuppeteer.md) to start new Puppeteer instances. ### [**](#maxConcurrency)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L308)optionalinheritedmaxConcurrency **maxConcurrency? : number Inherited from BrowserCrawlerOptions.maxConcurrency Sets the maximum concurrency (parallelism) for the crawl. Shortcut for the AutoscaledPool [`maxConcurrency`](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md#maxConcurrency) option. ### [**](#maxCrawlDepth)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L285)optionalinheritedmaxCrawlDepth **maxCrawlDepth? : number Inherited from BrowserCrawlerOptions.maxCrawlDepth Maximum depth of the crawl. If not set, the crawl will continue until all requests are processed. Setting this to `0` will only process the initial requests, skipping all links enqueued by `crawlingContext.enqueueLinks` and `crawlingContext.addRequests`. Passing `1` will process the initial requests and all links enqueued by `crawlingContext.enqueueLinks` and `crawlingContext.addRequests` in the handler for initial requests. ### [**](#maxRequestRetries)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L256)optionalinheritedmaxRequestRetries **maxRequestRetries? : number = 3 Inherited from BrowserCrawlerOptions.maxRequestRetries Specifies the maximum number of retries allowed for a request if its processing fails. This includes retries due to navigation errors or errors thrown from user-supplied functions (`requestHandler`, `preNavigationHooks`, `postNavigationHooks`). This limit does not apply to retries triggered by session rotation (see [`maxSessionRotations`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#maxSessionRotations)). ### [**](#maxRequestsPerCrawl)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L278)optionalinheritedmaxRequestsPerCrawl **maxRequestsPerCrawl? : number Inherited from BrowserCrawlerOptions.maxRequestsPerCrawl Maximum number of pages that the crawler will open. The crawl will stop when this limit is reached. This value should always be set in order to prevent infinite loops in misconfigured crawlers. > *NOTE:* In cases of parallel crawling, the actual number of pages visited might be slightly higher than this value. ### [**](#maxRequestsPerMinute)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L315)optionalinheritedmaxRequestsPerMinute **maxRequestsPerMinute? : number Inherited from BrowserCrawlerOptions.maxRequestsPerMinute The maximum number of requests per minute the crawler should run. By default, this is set to `Infinity`, but we can pass any positive, non-zero integer. Shortcut for the AutoscaledPool [`maxTasksPerMinute`](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md#maxTasksPerMinute) option. ### [**](#maxSessionRotations)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L271)optionalinheritedmaxSessionRotations **maxSessionRotations? : number = 10 Inherited from BrowserCrawlerOptions.maxSessionRotations Maximum number of session rotations per request. The crawler will automatically rotate the session in case of a proxy error or if it gets blocked by the website. The session rotations are not counted towards the [`maxRequestRetries`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#maxRequestRetries) limit. ### [**](#minConcurrency)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L302)optionalinheritedminConcurrency **minConcurrency? : number Inherited from BrowserCrawlerOptions.minConcurrency Sets the minimum concurrency (parallelism) for the crawl. Shortcut for the AutoscaledPool [`minConcurrency`](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md#minConcurrency) option. > *WARNING:* If we set this value too high with respect to the available system memory and CPU, our crawler will run extremely slow or crash. If not sure, it's better to keep the default value and the concurrency will scale up automatically. ### [**](#navigationTimeoutSecs)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L248)optionalinheritednavigationTimeoutSecs **navigationTimeoutSecs? : number Inherited from BrowserCrawlerOptions.navigationTimeoutSecs Timeout in which page navigation needs to finish, in seconds. ### [**](#onSkippedRequest)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L381)optionalinheritedonSkippedRequest **onSkippedRequest? : [SkippedRequestCallback](https://crawlee.dev/js/api/core.md#SkippedRequestCallback) Inherited from BrowserCrawlerOptions.onSkippedRequest When a request is skipped for some reason, you can use this callback to act on it. This is currently fired for requests skipped 1. based on robots.txt file, 2. because they don't match enqueueLinks filters, 3. because they are redirected to a URL that doesn't match the enqueueLinks strategy, 4. or because the [`maxRequestsPerCrawl`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#maxRequestsPerCrawl) limit has been reached ### [**](#persistCookiesPerSession)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L254)optionalinheritedpersistCookiesPerSession **persistCookiesPerSession? : boolean Inherited from BrowserCrawlerOptions.persistCookiesPerSession Defines whether the cookies should be persisted for sessions. This can only be used when `useSessionPool` is set to `true`. ### [**](#postNavigationHooks)[**](https://github.com/apify/crawlee/blob/master/packages/puppeteer-crawler/src/internals/puppeteer-crawler.ts#L69)optionalpostNavigationHooks **postNavigationHooks? : [PuppeteerHook](https://crawlee.dev/js/api/puppeteer-crawler/interface/PuppeteerHook.md)\[] Overrides BrowserCrawlerOptions.postNavigationHooks Async functions that are sequentially evaluated after the navigation. Good for checking if the navigation was successful. The function accepts `crawlingContext` as the only parameter. Example: ``` postNavigationHooks: [ async (crawlingContext) => { const { page } = crawlingContext; if (hasCaptcha(page)) { await solveCaptcha (page); } }, ] ``` ### [**](#preNavigationHooks)[**](https://github.com/apify/crawlee/blob/master/packages/puppeteer-crawler/src/internals/puppeteer-crawler.ts#L52)optionalpreNavigationHooks **preNavigationHooks? : [PuppeteerHook](https://crawlee.dev/js/api/puppeteer-crawler/interface/PuppeteerHook.md)\[] Overrides BrowserCrawlerOptions.preNavigationHooks Async functions that are sequentially evaluated before the navigation. Good for setting additional cookies or browser properties before navigation. The function accepts two parameters, `crawlingContext` and `gotoOptions`, which are passed to the `page.goto()` function the crawler calls to navigate. Example: ``` preNavigationHooks: [ async (crawlingContext, gotoOptions) => { const { page } = crawlingContext; await page.evaluate((attr) => { window.foo = attr; }, 'bar'); }, ] ``` Modyfing `pageOptions` is supported only in Playwright incognito. See [PrePageCreateHook](https://crawlee.dev/js/api/browser-pool.md#PrePageCreateHook) ### [**](#proxyConfiguration)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L201)optionalinheritedproxyConfiguration **proxyConfiguration? : [ProxyConfiguration](https://crawlee.dev/js/api/core/class/ProxyConfiguration.md) Inherited from BrowserCrawlerOptions.proxyConfiguration If set, the crawler will be configured for all connections to use the Proxy URLs provided and rotated according to the configuration. ### [**](#requestHandler)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L119)optionalinheritedrequestHandler **requestHandler? : [BrowserRequestHandler](https://crawlee.dev/js/api/browser-crawler.md#BrowserRequestHandler)<{ request: [LoadedRequest](https://crawlee.dev/js/api/core.md#LoadedRequest)<[Request](https://crawlee.dev/js/api/core/class/Request.md)\> } & Omit<[PuppeteerCrawlingContext](https://crawlee.dev/js/api/puppeteer-crawler/interface/PuppeteerCrawlingContext.md)\, request>> Inherited from BrowserCrawlerOptions.requestHandler Function that is called to process each request. The function receives the [BrowserCrawlingContext](https://crawlee.dev/js/api/browser-crawler/interface/BrowserCrawlingContext.md) (actual context will be enhanced with the crawler specific properties) as an argument, where: * [`request`](https://crawlee.dev/js/api/browser-crawler/interface/BrowserCrawlingContext.md#request) is an instance of the [Request](https://crawlee.dev/js/api/core/class/Request.md) object with details about the URL to open, HTTP method etc; * [`page`](https://crawlee.dev/js/api/browser-crawler/interface/BrowserCrawlingContext.md#page) is an instance of the Puppeteer [Page](https://pptr.dev/api/puppeteer.page) or Playwright [Page](https://playwright.dev/docs/api/class-page); * [`browserController`](https://crawlee.dev/js/api/browser-crawler/interface/BrowserCrawlingContext.md#browserController) is an instance of the [BrowserController](https://crawlee.dev/js/api/browser-pool/class/BrowserController.md); * [`response`](https://crawlee.dev/js/api/browser-crawler/interface/BrowserCrawlingContext.md#response) is an instance of the Puppeteer [Response](https://pptr.dev/api/puppeteer.httpresponse) or Playwright [Response](https://playwright.dev/docs/api/class-response), which is the main resource response as returned by the respective `page.goto()` function. The function must return a promise, which is then awaited by the crawler. If the function throws an exception, the crawler will try to re-crawl the request later, up to the [`maxRequestRetries`](https://crawlee.dev/js/api/browser-crawler/interface/BrowserCrawlerOptions.md#maxRequestRetries) times. If all the retries fail, the crawler calls the function provided to the [`failedRequestHandler`](https://crawlee.dev/js/api/browser-crawler/interface/BrowserCrawlerOptions.md#failedRequestHandler) parameter. To make this work, we should **always** let our function throw exceptions rather than catch them. The exceptions are logged to the request using the [`Request.pushErrorMessage()`](https://crawlee.dev/js/api/core/class/Request.md#pushErrorMessage) function. ### [**](#requestHandlerTimeoutSecs)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L203)optionalinheritedrequestHandlerTimeoutSecs **requestHandlerTimeoutSecs? : number = 60 Inherited from BrowserCrawlerOptions.requestHandlerTimeoutSecs Timeout in which the function passed as [`requestHandler`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#requestHandler) needs to finish, in seconds. ### [**](#requestList)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L181)optionalinheritedrequestList **requestList? : [IRequestList](https://crawlee.dev/js/api/core/interface/IRequestList.md) Inherited from BrowserCrawlerOptions.requestList Static list of URLs to be processed. If not provided, the crawler will open the default request queue when the [`crawler.addRequests()`](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md#addRequests) function is called. > Alternatively, `requests` parameter of [`crawler.run()`](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md#run) could be used to enqueue the initial requests - it is a shortcut for running `crawler.addRequests()` before the `crawler.run()`. ### [**](#requestManager)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L197)optionalinheritedrequestManager **requestManager? : [IRequestManager](https://crawlee.dev/js/api/core/interface/IRequestManager.md) Inherited from BrowserCrawlerOptions.requestManager Allows explicitly configuring a request manager. Mutually exclusive with the `requestQueue` and `requestList` options. This enables explicitly configuring the crawler to use `RequestManagerTandem`, for instance. If using this, the type of `BasicCrawler.requestQueue` may not be fully compatible with the `RequestProvider` class. ### [**](#requestQueue)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L189)optionalinheritedrequestQueue **requestQueue? : [RequestProvider](https://crawlee.dev/js/api/core/class/RequestProvider.md) Inherited from BrowserCrawlerOptions.requestQueue Dynamic queue of URLs to be processed. This is useful for recursive crawling of websites. If not provided, the crawler will open the default request queue when the [`crawler.addRequests()`](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md#addRequests) function is called. > Alternatively, `requests` parameter of [`crawler.run()`](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md#run) could be used to enqueue the initial requests - it is a shortcut for running `crawler.addRequests()` before the `crawler.run()`. ### [**](#respectRobotsTxtFile)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L371)optionalinheritedrespectRobotsTxtFile **respectRobotsTxtFile? : boolean Inherited from BrowserCrawlerOptions.respectRobotsTxtFile If set to `true`, the crawler will automatically try to fetch the robots.txt file for each domain, and skip those that are not allowed. This also prevents disallowed URLs to be added via `enqueueLinks`. ### [**](#retryOnBlocked)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L365)optionalinheritedretryOnBlocked **retryOnBlocked? : boolean Inherited from BrowserCrawlerOptions.retryOnBlocked If set to `true`, the crawler will automatically try to bypass any detected bot protection. Currently supports: * [**Cloudflare** Bot Management](https://www.cloudflare.com/products/bot-management/) * [**Google Search** Rate Limiting](https://www.google.com/sorry/) ### [**](#sameDomainDelaySecs)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L262)optionalinheritedsameDomainDelaySecs **sameDomainDelaySecs? : number = 0 Inherited from BrowserCrawlerOptions.sameDomainDelaySecs Indicates how much time (in seconds) to wait before crawling another same domain request. ### [**](#sessionPoolOptions)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L333)optionalinheritedsessionPoolOptions **sessionPoolOptions? : [SessionPoolOptions](https://crawlee.dev/js/api/core/interface/SessionPoolOptions.md) Inherited from BrowserCrawlerOptions.sessionPoolOptions The configuration options for [SessionPool](https://crawlee.dev/js/api/core/class/SessionPool.md) to use. ### [**](#statisticsOptions)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L396)optionalinheritedstatisticsOptions **statisticsOptions? : [StatisticsOptions](https://crawlee.dev/js/api/core/interface/StatisticsOptions.md) Inherited from BrowserCrawlerOptions.statisticsOptions Customize the way statistics collecting works, such as logging interval or whether to output them to the Key-Value store. ### [**](#statusMessageCallback)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L356)optionalinheritedstatusMessageCallback **statusMessageCallback? : [StatusMessageCallback](https://crawlee.dev/js/api/basic-crawler.md#StatusMessageCallback)<[BasicCrawlingContext](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlingContext.md)\, [BasicCrawler](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md)<[BasicCrawlingContext](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlingContext.md)\>> Inherited from BrowserCrawlerOptions.statusMessageCallback Allows overriding the default status message. The callback needs to call `crawler.setStatusMessage()` explicitly. The default status message is provided in the parameters. ``` const crawler = new CheerioCrawler({ statusMessageCallback: async (ctx) => { return ctx.crawler.setStatusMessage(`this is status message from ${new Date().toISOString()}`, { level: 'INFO' }); // log level defaults to 'DEBUG' }, statusMessageLoggingInterval: 1, // defaults to 10s async requestHandler({ $, enqueueLinks, request, log }) { // ... }, }); ``` ### [**](#statusMessageLoggingInterval)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L338)optionalinheritedstatusMessageLoggingInterval **statusMessageLoggingInterval? : number Inherited from BrowserCrawlerOptions.statusMessageLoggingInterval Defines the length of the interval for calling the `setStatusMessage` in seconds. ### [**](#useSessionPool)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L328)optionalinheriteduseSessionPool **useSessionPool? : boolean Inherited from BrowserCrawlerOptions.useSessionPool Basic crawler will initialize the [SessionPool](https://crawlee.dev/js/api/core/class/SessionPool.md) with the corresponding [`sessionPoolOptions`](https://crawlee.dev/js/api/core/interface/SessionPoolOptions.md). The session instance will be than available in the [`requestHandler`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#requestHandler). --- # PuppeteerCrawlingContext \ ### Hierarchy * [BrowserCrawlingContext](https://crawlee.dev/js/api/browser-crawler/interface/BrowserCrawlingContext.md)<[PuppeteerCrawler](https://crawlee.dev/js/api/puppeteer-crawler/class/PuppeteerCrawler.md), Page, HTTPResponse, [PuppeteerController](https://crawlee.dev/js/api/browser-pool/class/PuppeteerController.md), UserData> * PuppeteerContextUtils * *PuppeteerCrawlingContext* ## Index[**](#Index) ### Properties * [**addRequests](#addRequests) * [**browserController](#browserController) * [**crawler](#crawler) * [**getKeyValueStore](#getKeyValueStore) * [**id](#id) * [**log](#log) * [**page](#page) * [**proxyInfo](#proxyInfo) * [**request](#request) * [**response](#response) * [**session](#session) * [**useState](#useState) ### Methods * [**addInterceptRequestHandler](#addInterceptRequestHandler) * [**blockRequests](#blockRequests) * [**blockResources](#blockResources) * [**cacheResponses](#cacheResponses) * [**closeCookieModals](#closeCookieModals) * [**compileScript](#compileScript) * [**enqueueLinks](#enqueueLinks) * [**enqueueLinksByClickingElements](#enqueueLinksByClickingElements) * [**infiniteScroll](#infiniteScroll) * [**injectFile](#injectFile) * [**injectJQuery](#injectJQuery) * [**parseWithCheerio](#parseWithCheerio) * [**pushData](#pushData) * [**removeInterceptRequestHandler](#removeInterceptRequestHandler) * [**saveSnapshot](#saveSnapshot) * [**sendRequest](#sendRequest) * [**waitForSelector](#waitForSelector) ## Properties[**](#Properties) ### [**](#addRequests)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L88)inheritedaddRequests **addRequests: (requestsLike, options) => Promise\ Inherited from BrowserCrawlingContext.addRequests Add requests directly to the request queue. *** #### Type declaration * * **(requestsLike, options): Promise\ - #### Parameters * ##### requestsLike: readonly (string | ReadonlyObjectDeep\> & { regex?: RegExp; requestsFromUrl?: string }> | ReadonlyObjectDeep<[Request](https://crawlee.dev/js/api/core/class/Request.md)\>)\[] * ##### optionaloptions: ReadonlyObjectDeep<[RequestQueueOperationOptions](https://crawlee.dev/js/api/core/interface/RequestQueueOperationOptions.md)> Options for the request queue #### Returns Promise\ ### [**](#browserController)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L59)inheritedbrowserController **browserController: [PuppeteerController](https://crawlee.dev/js/api/browser-pool/class/PuppeteerController.md) Inherited from BrowserCrawlingContext.browserController ### [**](#crawler)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L113)inheritedcrawler **crawler: [PuppeteerCrawler](https://crawlee.dev/js/api/puppeteer-crawler/class/PuppeteerCrawler.md) Inherited from BrowserCrawlingContext.crawler ### [**](#getKeyValueStore)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L147)inheritedgetKeyValueStore **getKeyValueStore: (idOrName) => Promise<[KeyValueStore](https://crawlee.dev/js/api/core/class/KeyValueStore.md)> Inherited from BrowserCrawlingContext.getKeyValueStore Get a key-value store with given name or id, or the default one for the crawler. *** #### Type declaration * * **(idOrName): Promise<[KeyValueStore](https://crawlee.dev/js/api/core/class/KeyValueStore.md)> - #### Parameters * ##### optionalidOrName: string #### Returns Promise<[KeyValueStore](https://crawlee.dev/js/api/core/class/KeyValueStore.md)> ### [**](#id)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L33)inheritedid **id: string Inherited from BrowserCrawlingContext.id ### [**](#log)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L108)inheritedlog **log: [Log](https://crawlee.dev/js/api/core/class/Log.md) Inherited from BrowserCrawlingContext.log A preconfigured logger for the request handler. ### [**](#page)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L60)inheritedpage **page: Page Inherited from BrowserCrawlingContext.page ### [**](#proxyInfo)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L40)optionalinheritedproxyInfo **proxyInfo? : [ProxyInfo](https://crawlee.dev/js/api/core/interface/ProxyInfo.md) Inherited from BrowserCrawlingContext.proxyInfo An object with information about currently used proxy by the crawler and configured by the [ProxyConfiguration](https://crawlee.dev/js/api/core/class/ProxyConfiguration.md) class. ### [**](#request)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L45)inheritedrequest **request: [Request](https://crawlee.dev/js/api/core/class/Request.md)\ Inherited from BrowserCrawlingContext.request The original [Request](https://crawlee.dev/js/api/core/class/Request.md) object. ### [**](#response)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L61)optionalinheritedresponse **response? : HTTPResponse Inherited from BrowserCrawlingContext.response ### [**](#session)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L34)optionalinheritedsession **session? : [Session](https://crawlee.dev/js/api/core/class/Session.md) Inherited from BrowserCrawlingContext.session ### [**](#useState)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L96)inheriteduseState **useState: \(defaultValue) => Promise\ Inherited from BrowserCrawlingContext.useState Returns the state - a piece of mutable persistent data shared across all the request handler runs. *** #### Type declaration * * **\(defaultValue): Promise\ - #### Parameters * ##### optionaldefaultValue: State #### Returns Promise\ ## Methods[**](#Methods) ### [**](#addInterceptRequestHandler)[**](https://github.com/apify/crawlee/blob/master/packages/puppeteer-crawler/src/internals/utils/puppeteer_utils.ts#L1038)inheritedaddInterceptRequestHandler * ****addInterceptRequestHandler**(handler): Promise\ - Inherited from PuppeteerContextUtils.addInterceptRequestHandler Adds request interception handler in similar to `page.on('request', handler);` but in addition to that supports multiple parallel handlers. All the handlers are executed sequentially in the order as they were added. Each of the handlers must call one of `request.continue()`, `request.abort()` and `request.respond()`. In addition to that any of the handlers may modify the request object (method, postData, headers) by passing its overrides to `request.continue()`. If multiple handlers modify same property then the last one wins. Headers are merged separately so you can override only a value of specific header. If one the handlers calls `request.abort()` or `request.respond()` then request is not propagated further to any of the remaining handlers. **Example usage:** ``` preNavigationHooks: [ async ({ addInterceptRequestHandler }) => { // Replace images with placeholder. await addInterceptRequestHandler((request) => { if (request.resourceType() === 'image') { return request.respond({ statusCode: 200, contentType: 'image/jpeg', body: placeholderImageBuffer, }); } return request.continue(); }); // Abort all the scripts. await addInterceptRequestHandler((request) => { if (request.resourceType() === 'script') return request.abort(); return request.continue(); }); // Change requests to post. await addInterceptRequestHandler((request) => { return request.continue({ method: 'POST', }); }); }), ], ``` *** #### Parameters * ##### handler: [InterceptHandler](https://crawlee.dev/js/api/puppeteer-crawler/namespace/puppeteerRequestInterception.md#InterceptHandler) Request interception handler. #### Returns Promise\ ### [**](#blockRequests)[**](https://github.com/apify/crawlee/blob/master/packages/puppeteer-crawler/src/internals/utils/puppeteer_utils.ts#L933)inheritedblockRequests * ****blockRequests**(options): Promise\ - Inherited from PuppeteerContextUtils.blockRequests Forces the Puppeteer browser tab to block loading URLs that match a provided pattern. This is useful to speed up crawling of websites, since it reduces the amount of data that needs to be downloaded from the web, but it may break some websites or unexpectedly prevent loading of resources. By default, the function will block all URLs including the following patterns: ``` [".css", ".jpg", ".jpeg", ".png", ".svg", ".gif", ".woff", ".pdf", ".zip"] ``` If you want to extend this list further, use the `extraUrlPatterns` option, which will keep blocking the default patterns, as well as add your custom ones. If you would like to block only specific patterns, use the `urlPatterns` option, which will override the defaults and block only URLs with your custom patterns. This function does not use Puppeteer's request interception and therefore does not interfere with browser cache. It's also faster than blocking requests using interception, because the blocking happens directly in the browser without the round-trip to Node.js, but it does not provide the extra benefits of request interception. The function will never block main document loads and their respective redirects. **Example usage** ``` preNavigationHooks: [ async ({ blockRequests }) => { // Block all requests to URLs that include `adsbygoogle.js` and also all defaults. await blockRequests({ extraUrlPatterns: ['adsbygoogle.js'], }), }), ], ``` *** #### Parameters * ##### optionaloptions: [BlockRequestsOptions](https://crawlee.dev/js/api/puppeteer-crawler/namespace/puppeteerUtils.md#BlockRequestsOptions) #### Returns Promise\ ### [**](#blockResources)[**](https://github.com/apify/crawlee/blob/master/packages/puppeteer-crawler/src/internals/utils/puppeteer_utils.ts#L940)inheritedblockResources * ****blockResources**(resourceTypes): Promise\ - Inherited from PuppeteerContextUtils.blockResources `blockResources()` has a high impact on performance in recent versions of Puppeteer. Until this resolves, please use `utils.puppeteer.blockRequests()`. * **@deprecated** *** #### Parameters * ##### optionalresourceTypes: string\[] #### Returns Promise\ ### [**](#cacheResponses)[**](https://github.com/apify/crawlee/blob/master/packages/puppeteer-crawler/src/internals/utils/puppeteer_utils.ts#L956)inheritedcacheResponses * ****cacheResponses**(cache, responseUrlRules): Promise\ - Inherited from PuppeteerContextUtils.cacheResponses *NOTE:* In recent versions of Puppeteer using this function entirely disables browser cache which resolves in sub-optimal performance. Until this resolves, we suggest just relying on the in-browser cache unless absolutely necessary. Enables caching of intercepted responses into a provided object. Automatically enables request interception in Puppeteer. *IMPORTANT*: Caching responses stores them to memory, so too loose rules could cause memory leaks for longer running crawlers. This issue should be resolved or atleast mitigated in future iterations of this feature. * **@deprecated** *** #### Parameters * ##### cache: Dictionary\> Object in which responses are stored * ##### responseUrlRules: (string | RegExp)\[] List of rules that are used to check if the response should be cached. String rules are compared as page.url().includes(rule) while RegExp rules are evaluated as rule.test(page.url()). #### Returns Promise\ ### [**](#closeCookieModals)[**](https://github.com/apify/crawlee/blob/master/packages/puppeteer-crawler/src/internals/utils/puppeteer_utils.ts#L1061)inheritedcloseCookieModals * ****closeCookieModals**(): Promise\ - Inherited from PuppeteerContextUtils.closeCookieModals Tries to close cookie consent modals on the page. Based on the I Don't Care About Cookies browser extension. *** #### Returns Promise\ ### [**](#compileScript)[**](https://github.com/apify/crawlee/blob/master/packages/puppeteer-crawler/src/internals/utils/puppeteer_utils.ts#L987)inheritedcompileScript * ****compileScript**(scriptString, ctx): [CompiledScriptFunction](https://crawlee.dev/js/api/puppeteer-crawler/namespace/puppeteerUtils.md#CompiledScriptFunction) - Inherited from PuppeteerContextUtils.compileScript Compiles a Puppeteer script into an async function that may be executed at any time by providing it with the following object: ``` { page: Page, request: Request, } ``` Where `page` is a Puppeteer [`Page`](https://pptr.dev/api/puppeteer.page) and `request` is a [Request](https://crawlee.dev/js/api/core/class/Request.md). The function is compiled by using the `scriptString` parameter as the function's body, so any limitations to function bodies apply. Return value of the compiled function is the return value of the function body = the `scriptString` parameter. As a security measure, no globals such as `process` or `require` are accessible from within the function body. Note that the function does not provide a safe sandbox and even though globals are not easily accessible, malicious code may still execute in the main process via prototype manipulation. Therefore you should only use this function to execute sanitized or safe code. Custom context may also be provided using the `context` parameter. To improve security, make sure to only pass the really necessary objects to the context. Preferably making secured copies beforehand. *** #### Parameters * ##### scriptString: string * ##### optionalctx: Dictionary #### Returns [CompiledScriptFunction](https://crawlee.dev/js/api/puppeteer-crawler/namespace/puppeteerUtils.md#CompiledScriptFunction) ### [**](#enqueueLinks)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L140)inheritedenqueueLinks * ****enqueueLinks**(options): Promise<[BatchAddRequestsResult](https://crawlee.dev/js/api/types/interface/BatchAddRequestsResult.md)> - Inherited from BrowserCrawlingContext.enqueueLinks This function automatically finds and enqueues links from the current page, adding them to the [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) currently used by the crawler. Optionally, the function allows you to filter the target links' URLs using an array of globs or regular expressions and override settings of the enqueued [Request](https://crawlee.dev/js/api/core/class/Request.md) objects. Check out the [Crawl a website with relative links](https://crawlee.dev/js/docs/examples/crawl-relative-links.md) example for more details regarding its usage. **Example usage** ``` async requestHandler({ enqueueLinks }) { await enqueueLinks({ globs: [ 'https://www.example.com/handbags/*', ], }); }, ``` *** #### Parameters * ##### optionaloptions: ReadonlyObjectDeep\> & Pick<[EnqueueLinksOptions](https://crawlee.dev/js/api/core/interface/EnqueueLinksOptions.md), requestQueue> All `enqueueLinks()` parameters are passed via an options object. #### Returns Promise<[BatchAddRequestsResult](https://crawlee.dev/js/api/types/interface/BatchAddRequestsResult.md)> Promise that resolves to [BatchAddRequestsResult](https://crawlee.dev/js/api/types/interface/BatchAddRequestsResult.md) object. ### [**](#enqueueLinksByClickingElements)[**](https://github.com/apify/crawlee/blob/master/packages/puppeteer-crawler/src/internals/utils/puppeteer_utils.ts#L893)inheritedenqueueLinksByClickingElements * ****enqueueLinksByClickingElements**(options): Promise<[BatchAddRequestsResult](https://crawlee.dev/js/api/types/interface/BatchAddRequestsResult.md)> - Inherited from PuppeteerContextUtils.enqueueLinksByClickingElements The function finds elements matching a specific CSS selector in a Puppeteer page, clicks all those elements using a mouse move and a left mouse button click and intercepts all the navigation requests that are subsequently produced by the page. The intercepted requests, including their methods, headers and payloads are then enqueued to a provided [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md). This is useful to crawl JavaScript heavy pages where links are not available in `href` elements, but rather navigations are triggered in click handlers. If you're looking to find URLs in `href` attributes of the page, see enqueueLinks. Optionally, the function allows you to filter the target links' URLs using an array of [PseudoUrl](https://crawlee.dev/js/api/core/class/PseudoUrl.md) objects and override settings of the enqueued [Request](https://crawlee.dev/js/api/core/class/Request.md) objects. **IMPORTANT**: To be able to do this, this function uses various mutations on the page, such as changing the Z-index of elements being clicked and their visibility. Therefore, it is recommended to only use this function as the last operation in the page. **USING HEADFUL BROWSER**: When using a headful browser, this function will only be able to click elements in the focused tab, effectively limiting concurrency to 1. In headless mode, full concurrency can be achieved. **PERFORMANCE**: Clicking elements with a mouse and intercepting requests is not a low level operation that takes nanoseconds. It's not very CPU intensive, but it takes time. We strongly recommend limiting the scope of the clicking as much as possible by using a specific selector that targets only the elements that you assume or know will produce a navigation. You can certainly click everything by using the `*` selector, but be prepared to wait minutes to get results on a large and complex page. **Example usage** ``` async requestHandler({ enqueueLinksByClickingElements }) { await enqueueLinksByClickingElements({ selector: 'a.product-detail', globs: [ 'https://www.example.com/handbags/**' 'https://www.example.com/purses/**' ], }); }); ``` *** #### Parameters * ##### options: Omit<[EnqueueLinksByClickingElementsOptions](https://crawlee.dev/js/api/puppeteer-crawler/namespace/puppeteerClickElements.md#EnqueueLinksByClickingElementsOptions), requestQueue | page> #### Returns Promise<[BatchAddRequestsResult](https://crawlee.dev/js/api/types/interface/BatchAddRequestsResult.md)> Promise that resolves to [BatchAddRequestsResult](https://crawlee.dev/js/api/types/interface/BatchAddRequestsResult.md) object. ### [**](#infiniteScroll)[**](https://github.com/apify/crawlee/blob/master/packages/puppeteer-crawler/src/internals/utils/puppeteer_utils.ts#L1051)inheritedinfiniteScroll * ****infiniteScroll**(options): Promise\ - Inherited from PuppeteerContextUtils.infiniteScroll Scrolls to the bottom of a page, or until it times out. Loads dynamic content when it hits the bottom of a page, and then continues scrolling. *** #### Parameters * ##### optionaloptions: [InfiniteScrollOptions](https://crawlee.dev/js/api/puppeteer-crawler/namespace/puppeteerUtils.md#InfiniteScrollOptions) #### Returns Promise\ ### [**](#injectFile)[**](https://github.com/apify/crawlee/blob/master/packages/puppeteer-crawler/src/internals/utils/puppeteer_utils.ts#L794)inheritedinjectFile * ****injectFile**(filePath, options): Promise\ - Inherited from PuppeteerContextUtils.injectFile Injects a JavaScript file into current `page`. Unlike Puppeteer's `addScriptTag` function, this function works on pages with arbitrary Cross-Origin Resource Sharing (CORS) policies. File contents are cached for up to 10 files to limit file system access. *** #### Parameters * ##### filePath: string * ##### optionaloptions: [InjectFileOptions](https://crawlee.dev/js/api/puppeteer-crawler/namespace/puppeteerUtils.md#InjectFileOptions) #### Returns Promise\ ### [**](#injectJQuery)[**](https://github.com/apify/crawlee/blob/master/packages/puppeteer-crawler/src/internals/utils/puppeteer_utils.ts#L821)inheritedinjectJQuery * ****injectJQuery**(): Promise\ - Inherited from PuppeteerContextUtils.injectJQuery Injects the [jQuery](https://jquery.com/) library into current `page`. jQuery is often useful for various web scraping and crawling tasks. For example, it can help extract text from HTML elements using CSS selectors. Beware that the injected jQuery object will be set to the `window.$` variable and thus it might cause conflicts with other libraries included by the page that use the same variable name (e.g. another version of jQuery). This can affect functionality of page's scripts. The injected jQuery will survive page navigations and reloads. **Example usage:** ``` async requestHandler({ page, injectJQuery }) { await injectJQuery(); const title = await page.evaluate(() => { return $('head title').text(); }); }); ``` Note that `injectJQuery()` does not affect the Puppeteer's [`page.$()`](https://pptr.dev/api/puppeteer.page._/) function in any way. *** #### Returns Promise\ ### [**](#parseWithCheerio)[**](https://github.com/apify/crawlee/blob/master/packages/puppeteer-crawler/src/internals/utils/puppeteer_utils.ts#L850)inheritedparseWithCheerio * ****parseWithCheerio**(selector, timeoutMs): Promise\ - Inherited from PuppeteerContextUtils.parseWithCheerio Returns Cheerio handle for `page.content()`, allowing to work with the data same way as with [CheerioCrawler](https://crawlee.dev/js/api/cheerio-crawler/class/CheerioCrawler.md). When provided with the `selector` argument, it waits for it to be available first. **Example usage:** ``` async requestHandler({ parseWithCheerio }) { const $ = await parseWithCheerio(); const title = $('title').text(); }); ``` *** #### Parameters * ##### optionalselector: string * ##### optionaltimeoutMs: number #### Returns Promise\ ### [**](#pushData)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L54)inheritedpushData * ****pushData**(data, datasetIdOrName): Promise\ - Inherited from BrowserCrawlingContext.pushData This function allows you to push data to a [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md) specified by name, or the one currently used by the crawler. Shortcut for `crawler.pushData()`. *** #### Parameters * ##### optionaldata: ReadonlyDeep\ Data to be pushed to the default dataset. * ##### optionaldatasetIdOrName: string #### Returns Promise\ ### [**](#removeInterceptRequestHandler)[**](https://github.com/apify/crawlee/blob/master/packages/puppeteer-crawler/src/internals/utils/puppeteer_utils.ts#L1045)inheritedremoveInterceptRequestHandler * ****removeInterceptRequestHandler**(handler): Promise\ - Inherited from PuppeteerContextUtils.removeInterceptRequestHandler Removes request interception handler for given page. *** #### Parameters * ##### handler: [InterceptHandler](https://crawlee.dev/js/api/puppeteer-crawler/namespace/puppeteerRequestInterception.md#InterceptHandler) Request interception handler. #### Returns Promise\ ### [**](#saveSnapshot)[**](https://github.com/apify/crawlee/blob/master/packages/puppeteer-crawler/src/internals/utils/puppeteer_utils.ts#L1056)inheritedsaveSnapshot * ****saveSnapshot**(options): Promise\ - Inherited from PuppeteerContextUtils.saveSnapshot Saves a full screenshot and HTML of the current page into a Key-Value store. *** #### Parameters * ##### optionaloptions: [SaveSnapshotOptions](https://crawlee.dev/js/api/puppeteer-crawler/namespace/puppeteerUtils.md#SaveSnapshotOptions) #### Returns Promise\ ### [**](#sendRequest)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L166)inheritedsendRequest * ****sendRequest**\(overrideOptions): Promise\> - Inherited from BrowserCrawlingContext.sendRequest Fires HTTP request via [`got-scraping`](https://crawlee.dev/js/docs/guides/got-scraping.md), allowing to override the request options on the fly. This is handy when you work with a browser crawler but want to execute some requests outside it (e.g. API requests). Check the [Skipping navigations for certain requests](https://crawlee.dev/js/docs/examples/skip-navigation.md) example for more detailed explanation of how to do that. ``` async requestHandler({ sendRequest }) { const { body } = await sendRequest({ // override headers only headers: { ... }, }); }, ``` *** #### Parameters * ##### optionaloverrideOptions: Partial\ #### Returns Promise\> ### [**](#waitForSelector)[**](https://github.com/apify/crawlee/blob/master/packages/puppeteer-crawler/src/internals/utils/puppeteer_utils.ts#L836)inheritedwaitForSelector * ****waitForSelector**(selector, timeoutMs): Promise\ - Inherited from PuppeteerContextUtils.waitForSelector Wait for an element matching the selector to appear. Timeout defaults to 5s. **Example usage:** ``` async requestHandler({ waitForSelector, parseWithCheerio }) { await waitForSelector('article h1'); const $ = await parseWithCheerio(); const title = $('title').text(); }); ``` *** #### Parameters * ##### selector: string * ##### optionaltimeoutMs: number #### Returns Promise\ --- # PuppeteerHook ### Hierarchy * [BrowserHook](https://crawlee.dev/js/api/browser-crawler.md#BrowserHook)<[PuppeteerCrawlingContext](https://crawlee.dev/js/api/puppeteer-crawler/interface/PuppeteerCrawlingContext.md), [PuppeteerGoToOptions](https://crawlee.dev/js/api/puppeteer-crawler.md#PuppeteerGoToOptions)> * *PuppeteerHook* ### Callable * ****PuppeteerHook**(crawlingContext, gotoOptions): Awaitable\ *** * #### Parameters * ##### crawlingContext: [PuppeteerCrawlingContext](https://crawlee.dev/js/api/puppeteer-crawler/interface/PuppeteerCrawlingContext.md)\ * ##### gotoOptions: undefined | GoToOptions #### Returns Awaitable\ --- # PuppeteerLaunchContext Apify extends the launch options of Puppeteer. You can use any of the Puppeteer compatible [`LaunchOptions`](https://pptr.dev/api/puppeteer.launchoptions) options by providing the `launchOptions` property. **Example:** ``` // launch a headless Chrome (not Chromium) const launchContext = { // Apify helpers useChrome: true, proxyUrl: 'http://user:password@some.proxy.com' // Native Puppeteer options launchOptions: { headless: true, args: ['--some-flag'], } } ``` ### Hierarchy * [BrowserLaunchContext](https://crawlee.dev/js/api/browser-crawler/interface/BrowserLaunchContext.md)<[PuppeteerPlugin](https://crawlee.dev/js/api/browser-pool/class/PuppeteerPlugin.md)\[launchOptions], unknown> * *PuppeteerLaunchContext* ## Index[**](#Index) ### Properties * [**browserPerProxy](#browserPerProxy) * [**experimentalContainers](#experimentalContainers) * [**launcher](#launcher) * [**launchOptions](#launchOptions) * [**proxyUrl](#proxyUrl) * [**useChrome](#useChrome) * [**useIncognitoPages](#useIncognitoPages) * [**userAgent](#userAgent) * [**userDataDir](#userDataDir) ## Properties[**](#Properties) ### [**](#browserPerProxy)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-launcher.ts#L40)optionalinheritedbrowserPerProxy **browserPerProxy? : boolean Inherited from BrowserLaunchContext.browserPerProxy If set to `true`, the crawler respects the proxy url generated for the given request. This aligns the browser-based crawlers with the `HttpCrawler`. Might cause performance issues, as Crawlee might launch too many browser instances. ### [**](#experimentalContainers)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-launcher.ts#L54)optionalinheritedexperimentalContainersexperimental **experimentalContainers? : boolean Inherited from BrowserLaunchContext.experimentalContainers Like `useIncognitoPages`, but for persistent contexts, so cache is used for faster loading. Works best with Firefox. Unstable on Chromium. ### [**](#launcher)[**](https://github.com/apify/crawlee/blob/master/packages/puppeteer-crawler/src/internals/puppeteer-launcher.ts#L60)optionallauncher **launcher? : unknown Overrides BrowserLaunchContext.launcher Already required module (`Object`). This enables usage of various Puppeteer wrappers such as `puppeteer-extra`. Take caution, because it can cause all kinds of unexpected errors and weird behavior. Crawlee is not tested with any other library besides `puppeteer` itself. ### [**](#launchOptions)[**](https://github.com/apify/crawlee/blob/master/packages/puppeteer-crawler/src/internals/puppeteer-launcher.ts#L32)optionallaunchOptions **launchOptions? : LaunchOptions Overrides BrowserLaunchContext.launchOptions `puppeteer.launch` [options](https://pptr.dev/api/puppeteer.launchoptions) ### [**](#proxyUrl)[**](https://github.com/apify/crawlee/blob/master/packages/puppeteer-crawler/src/internals/puppeteer-launcher.ts#L40)optionalproxyUrl **proxyUrl? : string Overrides BrowserLaunchContext.proxyUrl URL to a HTTP proxy server. It must define the port number, and it may also contain proxy username and password. Example: `http://bob:pass123@proxy.example.com:1234`. ### [**](#useChrome)[**](https://github.com/apify/crawlee/blob/master/packages/puppeteer-crawler/src/internals/puppeteer-launcher.ts#L51)optionaluseChrome **useChrome? : boolean = false Overrides BrowserLaunchContext.useChrome If `true` and `executablePath` is not set, Puppeteer will launch full Google Chrome browser available on the machine rather than the bundled Chromium. The path to Chrome executable is taken from the `CRAWLEE_CHROME_EXECUTABLE_PATH` environment variable if provided, or defaults to the typical Google Chrome executable location specific for the operating system. By default, this option is `false`. ### [**](#useIncognitoPages)[**](https://github.com/apify/crawlee/blob/master/packages/puppeteer-crawler/src/internals/puppeteer-launcher.ts#L67)optionaluseIncognitoPages **useIncognitoPages? : boolean = false Overrides BrowserLaunchContext.useIncognitoPages With this option selected, all pages will be opened in a new incognito browser context. This means they will not share cookies nor cache and their resources will not be throttled by one another. ### [**](#userAgent)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-launcher.ts#L68)optionalinheriteduserAgent **userAgent? : string Inherited from BrowserLaunchContext.userAgent The `User-Agent` HTTP header used by the browser. If not provided, the function sets `User-Agent` to a reasonable default to reduce the chance of detection of the crawler. ### [**](#userDataDir)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-launcher.ts#L61)optionalinheriteduserDataDir **userDataDir? : string Inherited from BrowserLaunchContext.userDataDir Sets the [User Data Directory](https://chromium.googlesource.com/chromium/src/+/master/docs/user_data_dir.md) path. The user data directory contains profile data such as history, bookmarks, and cookies, as well as other per-installation local state. If not specified, a temporary directory is used instead. --- # PuppeteerRequestHandler ### Hierarchy * [BrowserRequestHandler](https://crawlee.dev/js/api/browser-crawler.md#BrowserRequestHandler)\> * *PuppeteerRequestHandler* ### Callable * ****PuppeteerRequestHandler**(inputs): Awaitable\ *** * #### Parameters * ##### inputs: { request: [LoadedRequest](https://crawlee.dev/js/api/core.md#LoadedRequest)<[LoadedRequest](https://crawlee.dev/js/api/core.md#LoadedRequest)<[Request](https://crawlee.dev/js/api/core/class/Request.md)\>> } & Omit<{ request: [LoadedRequest](https://crawlee.dev/js/api/core.md#LoadedRequest)<[Request](https://crawlee.dev/js/api/core/class/Request.md)\> } & Omit<[PuppeteerCrawlingContext](https://crawlee.dev/js/api/puppeteer-crawler/interface/PuppeteerCrawlingContext.md)\, request>, request> #### Returns Awaitable\ --- # puppeteerClickElements ## Index[**](#Index) ### References * [**enqueueLinksByClickingElements](https://crawlee.dev/js/api/puppeteer-crawler/namespace/puppeteerClickElements.md#enqueueLinksByClickingElements) ### Interfaces * [**EnqueueLinksByClickingElementsOptions](https://crawlee.dev/js/api/puppeteer-crawler/namespace/puppeteerClickElements.md#EnqueueLinksByClickingElementsOptions) ### Functions * [**isTargetRelevant](https://crawlee.dev/js/api/puppeteer-crawler/namespace/puppeteerClickElements.md#isTargetRelevant) ## References[**](#References) ### [**](#enqueueLinksByClickingElements)[**](https://github.com/apify/crawlee/blob/master/packages/puppeteer-crawler/src/internals/enqueue-links/click-elements.ts#L225)enqueueLinksByClickingElements Re-exports [enqueueLinksByClickingElements](https://crawlee.dev/js/api/puppeteer-crawler/namespace/puppeteerUtils.md#enqueueLinksByClickingElements) ## Interfaces[**](#Interfaces) ### [**](#EnqueueLinksByClickingElementsOptions)[**](https://github.com/apify/crawlee/blob/master/packages/puppeteer-crawler/src/internals/enqueue-links/click-elements.ts#L30)EnqueueLinksByClickingElementsOptions **EnqueueLinksByClickingElementsOptions: ### [**](#clickOptions)[**](https://github.com/apify/crawlee/blob/master/packages/puppeteer-crawler/src/internals/enqueue-links/click-elements.ts#L56)optionalclickOptions **clickOptions? : ClickOptions Click options for use in Puppeteer's click handler. ### [**](#exclude)[**](https://github.com/apify/crawlee/blob/master/packages/puppeteer-crawler/src/internals/enqueue-links/click-elements.ts#L83)optionalexclude **exclude? : readonly ([GlobInput](https://crawlee.dev/js/api/core.md#GlobInput) | [RegExpInput](https://crawlee.dev/js/api/core.md#RegExpInput))\[] An array of glob pattern strings, regexp patterns or plain objects containing patterns matching URLs that will **never** be enqueued. The plain objects must include either the `glob` property or the `regexp` property. Glob matching is always case-insensitive. If you need case-sensitive matching, provide a regexp. ### [**](#forefront)[**](https://github.com/apify/crawlee/blob/master/packages/puppeteer-crawler/src/internals/enqueue-links/click-elements.ts#L175)optionalforefront **forefront? : boolean = false If set to `true`: * while adding the request to the queue: the request will be added to the foremost position in the queue. * while reclaiming the request: the request will be placed to the beginning of the queue, so that it's returned in the next call to [RequestQueue.fetchNextRequest](https://crawlee.dev/js/api/core/class/RequestQueue.md#fetchNextRequest). By default, it's put to the end of the queue. ### [**](#globs)[**](https://github.com/apify/crawlee/blob/master/packages/puppeteer-crawler/src/internals/enqueue-links/click-elements.ts#L72)optionalglobs **globs? : [GlobInput](https://crawlee.dev/js/api/core.md#GlobInput)\[] An array of glob pattern strings or plain objects containing glob pattern strings matching the URLs to be enqueued. The plain objects must include at least the `glob` property, which holds the glob pattern string. All remaining keys will be used as request options for the corresponding enqueued [Request](https://crawlee.dev/js/api/core/class/Request.md) objects. The matching is always case-insensitive. If you need case-sensitive matching, use `regexps` property directly. If `globs` is an empty array or `undefined`, then the function enqueues all the intercepted navigation requests produced by the page after clicking on elements matching the provided CSS selector. ### [**](#label)[**](https://github.com/apify/crawlee/blob/master/packages/puppeteer-crawler/src/internals/enqueue-links/click-elements.ts#L51)optionallabel **label? : string Sets [Request.label](https://crawlee.dev/js/api/core/class/Request.md#label) for newly enqueued requests. ### [**](#maxWaitForPageIdleSecs)[**](https://github.com/apify/crawlee/blob/master/packages/puppeteer-crawler/src/internals/enqueue-links/click-elements.ts#L165)optionalmaxWaitForPageIdleSecs **maxWaitForPageIdleSecs? : number = 5 This is the maximum period for which the function will keep tracking events, even if more events keep coming. Its purpose is to prevent a deadlock in the page by periodic events, often unrelated to the clicking itself. See `waitForPageIdleSecs` above for an explanation. ### [**](#page)[**](https://github.com/apify/crawlee/blob/master/packages/puppeteer-crawler/src/internals/enqueue-links/click-elements.ts#L34)page **page: Page Puppeteer [`Page`](https://pptr.dev/#?product=Puppeteer\&show=api-class-page) object. ### [**](#pseudoUrls)[**](https://github.com/apify/crawlee/blob/master/packages/puppeteer-crawler/src/internals/enqueue-links/click-elements.ts#L117)optionalpseudoUrls **pseudoUrls? : [PseudoUrlInput](https://crawlee.dev/js/api/core.md#PseudoUrlInput)\[] *NOTE:* In future versions of SDK the options will be removed. Please use `globs` or `regexps` instead. An array of [PseudoUrl](https://crawlee.dev/js/api/core/class/PseudoUrl.md) strings or plain objects containing [PseudoUrl](https://crawlee.dev/js/api/core/class/PseudoUrl.md) strings matching the URLs to be enqueued. The plain objects must include at least the `purl` property, which holds the pseudo-URL pattern string. All remaining keys will be used as request options for the corresponding enqueued [Request](https://crawlee.dev/js/api/core/class/Request.md) objects. With a pseudo-URL string, the matching is always case-insensitive. If you need case-sensitive matching, use `regexps` property directly. If `pseudoUrls` is an empty array or `undefined`, then the function enqueues all the intercepted navigation requests produced by the page after clicking on elements matching the provided CSS selector. * **@deprecated** prefer using `globs` or `regexps` instead ### [**](#regexps)[**](https://github.com/apify/crawlee/blob/master/packages/puppeteer-crawler/src/internals/enqueue-links/click-elements.ts#L96)optionalregexps **regexps? : [RegExpInput](https://crawlee.dev/js/api/core.md#RegExpInput)\[] An array of regular expressions or plain objects containing regular expressions matching the URLs to be enqueued. The plain objects must include at least the `regexp` property, which holds the regular expression. All remaining keys will be used as request options for the corresponding enqueued [Request](https://crawlee.dev/js/api/core/class/Request.md) objects. If `regexps` is an empty array or `undefined`, then the function enqueues all the intercepted navigation requests produced by the page after clicking on elements matching the provided CSS selector. ### [**](#requestQueue)[**](https://github.com/apify/crawlee/blob/master/packages/puppeteer-crawler/src/internals/enqueue-links/click-elements.ts#L39)requestQueue **requestQueue: [RequestProvider](https://crawlee.dev/js/api/core/class/RequestProvider.md) A request queue to which the URLs will be enqueued. ### [**](#selector)[**](https://github.com/apify/crawlee/blob/master/packages/puppeteer-crawler/src/internals/enqueue-links/click-elements.ts#L45)selector **selector: string A CSS selector matching elements to be clicked on. Unlike in enqueueLinks, there is no default value. This is to prevent suboptimal use of this function by using it too broadly. ### [**](#skipNavigation)[**](https://github.com/apify/crawlee/blob/master/packages/puppeteer-crawler/src/internals/enqueue-links/click-elements.ts#L181)optionalskipNavigation **skipNavigation? : boolean = false If set to `true`, tells the crawler to skip navigation and process the request directly. ### [**](#transformRequestFunction)[**](https://github.com/apify/crawlee/blob/master/packages/puppeteer-crawler/src/internals/enqueue-links/click-elements.ts#L140)optionaltransformRequestFunction **transformRequestFunction? : [RequestTransform](https://crawlee.dev/js/api/core/interface/RequestTransform.md) Just before a new [Request](https://crawlee.dev/js/api/core/class/Request.md) is constructed and enqueued to the [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md), this function can be used to remove it or modify its contents such as `userData`, `payload` or, most importantly `uniqueKey`. This is useful when you need to enqueue multiple `Requests` to the queue that share the same URL, but differ in methods or payloads, or to dynamically update or create `userData`. For example: by adding `useExtendedUniqueKey: true` to the `request` object, `uniqueKey` will be computed from a combination of `url`, `method` and `payload` which enables crawling of websites that navigate using form submits (POST requests). **Example:** ``` { transformRequestFunction: (request) => { request.userData.foo = 'bar'; request.useExtendedUniqueKey = true; return request; } } ``` ### [**](#userData)[**](https://github.com/apify/crawlee/blob/master/packages/puppeteer-crawler/src/internals/enqueue-links/click-elements.ts#L48)optionaluserData **userData? : Dictionary Sets [Request.userData](https://crawlee.dev/js/api/core/class/Request.md#userData) for newly enqueued requests. ### [**](#waitForPageIdleSecs)[**](https://github.com/apify/crawlee/blob/master/packages/puppeteer-crawler/src/internals/enqueue-links/click-elements.ts#L157)optionalwaitForPageIdleSecs **waitForPageIdleSecs? : number = 1 Clicking in the page triggers various asynchronous operations that lead to new URLs being shown by the browser. It could be a simple JavaScript redirect or opening of a new tab in the browser. These events often happen only some time after the actual click. Requests typically take milliseconds while new tabs open in hundreds of milliseconds. To be able to capture all those events, the `enqueueLinksByClickingElements()` function repeatedly waits for the `waitForPageIdleSecs`. By repeatedly we mean that whenever a relevant event is triggered, the timer is restarted. As long as new events keep coming, the function will not return, unless the below `maxWaitForPageIdleSecs` timeout is reached. You may want to reduce this for example when you're sure that your clicks do not open new tabs, or increase when you're not getting all the expected URLs. ## Functions[**](#Functions) ### [**](#isTargetRelevant)[**](https://github.com/apify/crawlee/blob/master/packages/puppeteer-crawler/src/internals/enqueue-links/click-elements.ts#L417)isTargetRelevant * ****isTargetRelevant**(page, target): boolean - We're only interested in pages created by the page we're currently clicking in. There will generally be a lot of other targets being created in the browser. *** #### Parameters * ##### page: Page * ##### target: Target #### Returns boolean --- # puppeteerRequestInterception ## Index[**](#Index) ### Type Aliases * [**InterceptHandler](https://crawlee.dev/js/api/puppeteer-crawler/namespace/puppeteerRequestInterception.md#InterceptHandler) ### Functions * [**addInterceptRequestHandler](https://crawlee.dev/js/api/puppeteer-crawler/namespace/puppeteerRequestInterception.md#addInterceptRequestHandler) * [**removeInterceptRequestHandler](https://crawlee.dev/js/api/puppeteer-crawler/namespace/puppeteerRequestInterception.md#removeInterceptRequestHandler) ## Type Aliases[**](<#Type Aliases>) ### [**](#InterceptHandler)[**](https://github.com/apify/crawlee/blob/master/packages/puppeteer-crawler/src/internals/utils/puppeteer_request_interception.ts#L9)InterceptHandler **InterceptHandler: (request) => unknown #### Type declaration * * **(request): unknown - #### Parameters * ##### request: PuppeteerRequest #### Returns unknown ## Functions[**](#Functions) ### [**](#addInterceptRequestHandler)[**](https://github.com/apify/crawlee/blob/master/packages/puppeteer-crawler/src/internals/utils/puppeteer_request_interception.ts#L160)addInterceptRequestHandler * ****addInterceptRequestHandler**(page, handler): Promise\ - Adds request interception handler in similar to `page.on('request', handler);` but in addition to that supports multiple parallel handlers. All the handlers are executed sequentially in the order as they were added. Each of the handlers must call one of `request.continue()`, `request.abort()` and `request.respond()`. In addition to that any of the handlers may modify the request object (method, postData, headers) by passing its overrides to `request.continue()`. If multiple handlers modify same property then the last one wins. Headers are merged separately so you can override only a value of specific header. If one the handlers calls `request.abort()` or `request.respond()` then request is not propagated further to any of the remaining handlers. **Example usage:** ``` // Replace images with placeholder. await addInterceptRequestHandler(page, (request) => { if (request.resourceType() === 'image') { return request.respond({ statusCode: 200, contentType: 'image/jpeg', body: placeholderImageBuffer, }); } return request.continue(); }); // Abort all the scripts. await addInterceptRequestHandler(page, (request) => { if (request.resourceType() === 'script') return request.abort(); return request.continue(); }); // Change requests to post. await addInterceptRequestHandler(page, (request) => { return request.continue({ method: 'POST', }); }); await page.goto('http://example.com'); ``` *** #### Parameters * ##### page: Page Puppeteer [`Page`](https://pptr.dev/#?product=Puppeteer\&show=api-class-page) object. * ##### handler: [InterceptHandler](https://crawlee.dev/js/api/puppeteer-crawler/namespace/puppeteerRequestInterception.md#InterceptHandler) Request interception handler. #### Returns Promise\ ### [**](#removeInterceptRequestHandler)[**](https://github.com/apify/crawlee/blob/master/packages/puppeteer-crawler/src/internals/utils/puppeteer_request_interception.ts#L203)removeInterceptRequestHandler * ****removeInterceptRequestHandler**(page, handler): Promise\ - Removes request interception handler for given page. *** #### Parameters * ##### page: Page Puppeteer [`Page`](https://pptr.dev/#?product=Puppeteer\&show=api-class-page) object. * ##### handler: [InterceptHandler](https://crawlee.dev/js/api/puppeteer-crawler/namespace/puppeteerRequestInterception.md#InterceptHandler) Request interception handler. #### Returns Promise\ --- # puppeteerUtils A namespace that contains various utilities for [Puppeteer](https://github.com/puppeteer/puppeteer) - the headless Chrome Node API. **Example usage:** ``` import { launchPuppeteer, utils } from 'crawlee'; // Open https://www.example.com in Puppeteer const browser = await launchPuppeteer(); const page = await browser.newPage(); await page.goto('https://www.example.com'); // Inject jQuery into a page await utils.puppeteer.injectJQuery(page); ``` ## Index[**](#Index) ### References * [**addInterceptRequestHandler](https://crawlee.dev/js/api/puppeteer-crawler/namespace/puppeteerUtils.md#addInterceptRequestHandler) * [**removeInterceptRequestHandler](https://crawlee.dev/js/api/puppeteer-crawler/namespace/puppeteerUtils.md#removeInterceptRequestHandler) ### Interfaces * [**BlockRequestsOptions](https://crawlee.dev/js/api/puppeteer-crawler/namespace/puppeteerUtils.md#BlockRequestsOptions) * [**CompiledScriptParams](https://crawlee.dev/js/api/puppeteer-crawler/namespace/puppeteerUtils.md#CompiledScriptParams) * [**DirectNavigationOptions](https://crawlee.dev/js/api/puppeteer-crawler/namespace/puppeteerUtils.md#DirectNavigationOptions) * [**InfiniteScrollOptions](https://crawlee.dev/js/api/puppeteer-crawler/namespace/puppeteerUtils.md#InfiniteScrollOptions) * [**InjectFileOptions](https://crawlee.dev/js/api/puppeteer-crawler/namespace/puppeteerUtils.md#InjectFileOptions) * [**SaveSnapshotOptions](https://crawlee.dev/js/api/puppeteer-crawler/namespace/puppeteerUtils.md#SaveSnapshotOptions) ### Type Aliases * [**CompiledScriptFunction](https://crawlee.dev/js/api/puppeteer-crawler/namespace/puppeteerUtils.md#CompiledScriptFunction) ### Functions * [**blockRequests](https://crawlee.dev/js/api/puppeteer-crawler/namespace/puppeteerUtils.md#blockRequests) * [**blockResources](https://crawlee.dev/js/api/puppeteer-crawler/namespace/puppeteerUtils.md#blockResources) * [**cacheResponses](https://crawlee.dev/js/api/puppeteer-crawler/namespace/puppeteerUtils.md#cacheResponses) * [**closeCookieModals](https://crawlee.dev/js/api/puppeteer-crawler/namespace/puppeteerUtils.md#closeCookieModals) * [**compileScript](https://crawlee.dev/js/api/puppeteer-crawler/namespace/puppeteerUtils.md#compileScript) * [**enqueueLinksByClickingElements](https://crawlee.dev/js/api/puppeteer-crawler/namespace/puppeteerUtils.md#enqueueLinksByClickingElements) * [**gotoExtended](https://crawlee.dev/js/api/puppeteer-crawler/namespace/puppeteerUtils.md#gotoExtended) * [**infiniteScroll](https://crawlee.dev/js/api/puppeteer-crawler/namespace/puppeteerUtils.md#infiniteScroll) * [**injectFile](https://crawlee.dev/js/api/puppeteer-crawler/namespace/puppeteerUtils.md#injectFile) * [**injectJQuery](https://crawlee.dev/js/api/puppeteer-crawler/namespace/puppeteerUtils.md#injectJQuery) * [**parseWithCheerio](https://crawlee.dev/js/api/puppeteer-crawler/namespace/puppeteerUtils.md#parseWithCheerio) * [**saveSnapshot](https://crawlee.dev/js/api/puppeteer-crawler/namespace/puppeteerUtils.md#saveSnapshot) ## References[**](#References) ### [**](#addInterceptRequestHandler)[**](https://github.com/apify/crawlee/blob/master/packages/puppeteer-crawler/src/internals/utils/puppeteer_utils.ts#L1118)addInterceptRequestHandler Re-exports [addInterceptRequestHandler](https://crawlee.dev/js/api/puppeteer-crawler/namespace/puppeteerRequestInterception.md#addInterceptRequestHandler) ### [**](#removeInterceptRequestHandler)[**](https://github.com/apify/crawlee/blob/master/packages/puppeteer-crawler/src/internals/utils/puppeteer_utils.ts#L1118)removeInterceptRequestHandler Re-exports [removeInterceptRequestHandler](https://crawlee.dev/js/api/puppeteer-crawler/namespace/puppeteerRequestInterception.md#removeInterceptRequestHandler) ## Interfaces[**](#Interfaces) ### [**](#BlockRequestsOptions)[**](https://github.com/apify/crawlee/blob/master/packages/puppeteer-crawler/src/internals/utils/puppeteer_utils.ts#L84)BlockRequestsOptions **BlockRequestsOptions: ### [**](#extraUrlPatterns)[**](https://github.com/apify/crawlee/blob/master/packages/puppeteer-crawler/src/internals/utils/puppeteer_utils.ts#L96)optionalextraUrlPatterns **extraUrlPatterns? : string\[] If you just want to append to the default blocked patterns, use this property. ### [**](#urlPatterns)[**](https://github.com/apify/crawlee/blob/master/packages/puppeteer-crawler/src/internals/utils/puppeteer_utils.ts#L91)optionalurlPatterns **urlPatterns? : string\[] The patterns of URLs to block from being loaded by the browser. Only `*` can be used as a wildcard. It is also automatically added to the beginning and end of the pattern. This limitation is enforced by the DevTools protocol. `.png` is the same as `*.png*`. ### [**](#CompiledScriptParams)[**](https://github.com/apify/crawlee/blob/master/packages/puppeteer-crawler/src/internals/utils/puppeteer_utils.ts#L99)CompiledScriptParams **CompiledScriptParams: ### [**](#page)[**](https://github.com/apify/crawlee/blob/master/packages/puppeteer-crawler/src/internals/utils/puppeteer_utils.ts#L100)page **page: Page ### [**](#request)[**](https://github.com/apify/crawlee/blob/master/packages/puppeteer-crawler/src/internals/utils/puppeteer_utils.ts#L101)request **request: [Request](https://crawlee.dev/js/api/core/class/Request.md)\ ### [**](#DirectNavigationOptions)[**](https://github.com/apify/crawlee/blob/master/packages/puppeteer-crawler/src/internals/utils/puppeteer_utils.ts#L50)DirectNavigationOptions **DirectNavigationOptions: ### [**](#referer)[**](https://github.com/apify/crawlee/blob/master/packages/puppeteer-crawler/src/internals/utils/puppeteer_utils.ts#L72)optionalreferer **referer? : string Referer header value. If provided it will take preference over the referer header value set by page.setExtraHTTPHeaders(headers). ### [**](#timeout)[**](https://github.com/apify/crawlee/blob/master/packages/puppeteer-crawler/src/internals/utils/puppeteer_utils.ts#L57)optionaltimeout **timeout? : number Maximum operation time in milliseconds, defaults to 30 seconds, pass `0` to disable timeout. The default value can be changed by using the browserContext.setDefaultNavigationTimeout(timeout), browserContext.setDefaultTimeout(timeout), page.setDefaultNavigationTimeout(timeout) or page.setDefaultTimeout(timeout) methods. ### [**](#waitUntil)[**](https://github.com/apify/crawlee/blob/master/packages/puppeteer-crawler/src/internals/utils/puppeteer_utils.ts#L67)optionalwaitUntil **waitUntil? : domcontentloaded | load | networkidle | networkidle0 | networkidle2 When to consider operation succeeded, defaults to `load`. Events can be either: * `domcontentloaded` - consider operation to be finished when the `DOMContentLoaded` event is fired. * `load` - consider operation to be finished when the `load` event is fired. * `networkidle0` - consider operation to be finished when there are no network connections for at least `500` ms. * `networkidle2` - consider operation to be finished when there are no more than 2 network connections for at least `500` ms. * `networkidle` - alias for `networkidle0` ### [**](#InfiniteScrollOptions)[**](https://github.com/apify/crawlee/blob/master/packages/puppeteer-crawler/src/internals/utils/puppeteer_utils.ts#L526)InfiniteScrollOptions **InfiniteScrollOptions: ### [**](#buttonSelector)[**](https://github.com/apify/crawlee/blob/master/packages/puppeteer-crawler/src/internals/utils/puppeteer_utils.ts#L554)optionalbuttonSelector **buttonSelector? : string Optionally checks and clicks a button if it appears while scrolling. This is required on some websites for the scroll to work. ### [**](#maxScrollHeight)[**](https://github.com/apify/crawlee/blob/master/packages/puppeteer-crawler/src/internals/utils/puppeteer_utils.ts#L537)optionalmaxScrollHeight **maxScrollHeight? : number = 0 How many pixels to scroll down. If 0, will scroll until bottom of page. ### [**](#scrollDownAndUp)[**](https://github.com/apify/crawlee/blob/master/packages/puppeteer-crawler/src/internals/utils/puppeteer_utils.ts#L549)optionalscrollDownAndUp **scrollDownAndUp? : boolean = false If true, it will scroll up a bit after each scroll down. This is required on some websites for the scroll to work. ### [**](#stopScrollCallback)[**](https://github.com/apify/crawlee/blob/master/packages/puppeteer-crawler/src/internals/utils/puppeteer_utils.ts#L559)optionalstopScrollCallback **stopScrollCallback? : () => unknown This function is called after every scroll and stops the scrolling process if it returns `true`. The function can be `async`. *** #### Type declaration * * **(): unknown - #### Returns unknown ### [**](#timeoutSecs)[**](https://github.com/apify/crawlee/blob/master/packages/puppeteer-crawler/src/internals/utils/puppeteer_utils.ts#L531)optionaltimeoutSecs **timeoutSecs? : number = 0 How many seconds to scroll for. If 0, will scroll until bottom of page. ### [**](#waitForSecs)[**](https://github.com/apify/crawlee/blob/master/packages/puppeteer-crawler/src/internals/utils/puppeteer_utils.ts#L543)optionalwaitForSecs **waitForSecs? : number = 4 How many seconds to wait for no new content to load before exit. ### [**](#InjectFileOptions)[**](https://github.com/apify/crawlee/blob/master/packages/puppeteer-crawler/src/internals/utils/puppeteer_utils.ts#L75)InjectFileOptions **InjectFileOptions: ### [**](#surviveNavigations)[**](https://github.com/apify/crawlee/blob/master/packages/puppeteer-crawler/src/internals/utils/puppeteer_utils.ts#L81)optionalsurviveNavigations **surviveNavigations? : boolean Enables the injected script to survive page navigations and reloads without need to be re-injected manually. This does not mean, however, that internal state will be preserved. Just that it will be automatically re-injected on each navigation before any other scripts get the chance to execute. ### [**](#SaveSnapshotOptions)[**](https://github.com/apify/crawlee/blob/master/packages/puppeteer-crawler/src/internals/utils/puppeteer_utils.ts#L690)SaveSnapshotOptions **SaveSnapshotOptions: ### [**](#config)[**](https://github.com/apify/crawlee/blob/master/packages/puppeteer-crawler/src/internals/utils/puppeteer_utils.ts#L725)optionalconfig **config? : [Configuration](https://crawlee.dev/js/api/core/class/Configuration.md) = [Configuration](https://crawlee.dev/js/api/core/class/Configuration.md) Configuration of the crawler that will be used to save the snapshot. ### [**](#key)[**](https://github.com/apify/crawlee/blob/master/packages/puppeteer-crawler/src/internals/utils/puppeteer_utils.ts#L695)optionalkey **key? : string = ‘SNAPSHOT’ Key under which the screenshot and HTML will be saved. `.jpg` will be appended for screenshot and `.html` for HTML. ### [**](#keyValueStoreName)[**](https://github.com/apify/crawlee/blob/master/packages/puppeteer-crawler/src/internals/utils/puppeteer_utils.ts#L719)optionalkeyValueStoreName **keyValueStoreName? : null | string = null | string Name or id of the Key-Value store where snapshot is saved. By default it is saved to default Key-Value store. ### [**](#saveHtml)[**](https://github.com/apify/crawlee/blob/master/packages/puppeteer-crawler/src/internals/utils/puppeteer_utils.ts#L713)optionalsaveHtml **saveHtml? : boolean = true If true, it will save a full HTML of the current page as a record with `key` appended by `.html`. ### [**](#saveScreenshot)[**](https://github.com/apify/crawlee/blob/master/packages/puppeteer-crawler/src/internals/utils/puppeteer_utils.ts#L707)optionalsaveScreenshot **saveScreenshot? : boolean = true If true, it will save a full screenshot of the current page as a record with `key` appended by `.jpg`. ### [**](#screenshotQuality)[**](https://github.com/apify/crawlee/blob/master/packages/puppeteer-crawler/src/internals/utils/puppeteer_utils.ts#L701)optionalscreenshotQuality **screenshotQuality? : number = 50 The quality of the image, between 0-100. Higher quality images have bigger size and require more storage. ## Type Aliases[**](<#Type Aliases>) ### [**](#CompiledScriptFunction)[**](https://github.com/apify/crawlee/blob/master/packages/puppeteer-crawler/src/internals/utils/puppeteer_utils.ts#L104)CompiledScriptFunction **CompiledScriptFunction: (params) => Promise\ #### Type declaration * * **(params): Promise\ - #### Parameters * ##### params: [CompiledScriptParams](https://crawlee.dev/js/api/puppeteer-crawler/namespace/puppeteerUtils.md#CompiledScriptParams) #### Returns Promise\ ## Functions[**](#Functions) ### [**](#blockRequests)[**](https://github.com/apify/crawlee/blob/master/packages/puppeteer-crawler/src/internals/utils/puppeteer_utils.ts#L282)blockRequests * ****blockRequests**(page, options): Promise\ - Forces the Puppeteer browser tab to block loading URLs that match a provided pattern. This is useful to speed up crawling of websites, since it reduces the amount of data that needs to be downloaded from the web, but it may break some websites or unexpectedly prevent loading of resources. By default, the function will block all URLs including the following patterns: ``` [".css", ".jpg", ".jpeg", ".png", ".svg", ".gif", ".woff", ".pdf", ".zip"] ``` If you want to extend this list further, use the `extraUrlPatterns` option, which will keep blocking the default patterns, as well as add your custom ones. If you would like to block only specific patterns, use the `urlPatterns` option, which will override the defaults and block only URLs with your custom patterns. This function does not use Puppeteer's request interception and therefore does not interfere with browser cache. It's also faster than blocking requests using interception, because the blocking happens directly in the browser without the round-trip to Node.js, but it does not provide the extra benefits of request interception. The function will never block main document loads and their respective redirects. **Example usage** ``` import { launchPuppeteer, utils } from 'crawlee'; const browser = await launchPuppeteer(); const page = await browser.newPage(); // Block all requests to URLs that include `adsbygoogle.js` and also all defaults. await utils.puppeteer.blockRequests(page, { extraUrlPatterns: ['adsbygoogle.js'], }); await page.goto('https://cnn.com'); ``` *** #### Parameters * ##### page: Page Puppeteer [`Page`](https://pptr.dev/api/puppeteer.page) object. * ##### optionaloptions: [BlockRequestsOptions](https://crawlee.dev/js/api/puppeteer-crawler/namespace/puppeteerUtils.md#BlockRequestsOptions) = {} #### Returns Promise\ ### [**](#blockResources)[**](https://github.com/apify/crawlee/blob/master/packages/puppeteer-crawler/src/internals/utils/puppeteer_utils.ts#L334)blockResources * ****blockResources**(page, resourceTypes): Promise\ - #### Parameters * ##### page: Page * ##### resourceTypes: string\[] = ... #### Returns Promise\ ### [**](#cacheResponses)[**](https://github.com/apify/crawlee/blob/master/packages/puppeteer-crawler/src/internals/utils/puppeteer_utils.ts#L362)cacheResponses * ****cacheResponses**(page, cache, responseUrlRules): Promise\ - *NOTE:* In recent versions of Puppeteer using this function entirely disables browser cache which resolves in sub-optimal performance. Until this resolves, we suggest just relying on the in-browser cache unless absolutely necessary. Enables caching of intercepted responses into a provided object. Automatically enables request interception in Puppeteer. *IMPORTANT*: Caching responses stores them to memory, so too loose rules could cause memory leaks for longer running crawlers. This issue should be resolved or atleast mitigated in future iterations of this feature. * **@deprecated** *** #### Parameters * ##### page: Page Puppeteer [`Page`](https://pptr.dev/api/puppeteer.page) object. * ##### cache: Dictionary\> Object in which responses are stored * ##### responseUrlRules: (string | RegExp)\[] List of rules that are used to check if the response should be cached. String rules are compared as page.url().includes(rule) while RegExp rules are evaluated as rule.test(page.url()). #### Returns Promise\ ### [**](#closeCookieModals)[**](https://github.com/apify/crawlee/blob/master/packages/puppeteer-crawler/src/internals/utils/puppeteer_utils.ts#L781)closeCookieModals * ****closeCookieModals**(page): Promise\ - #### Parameters * ##### page: Page #### Returns Promise\ ### [**](#compileScript)[**](https://github.com/apify/crawlee/blob/master/packages/puppeteer-crawler/src/internals/utils/puppeteer_utils.ts#L440)compileScript * ****compileScript**(scriptString, context): [CompiledScriptFunction](https://crawlee.dev/js/api/puppeteer-crawler/namespace/puppeteerUtils.md#CompiledScriptFunction) - Compiles a Puppeteer script into an async function that may be executed at any time by providing it with the following object: ``` { page: Page, request: Request, } ``` Where `page` is a Puppeteer [`Page`](https://pptr.dev/api/puppeteer.page) and `request` is a [Request](https://crawlee.dev/js/api/core/class/Request.md). The function is compiled by using the `scriptString` parameter as the function's body, so any limitations to function bodies apply. Return value of the compiled function is the return value of the function body = the `scriptString` parameter. As a security measure, no globals such as `process` or `require` are accessible from within the function body. Note that the function does not provide a safe sandbox and even though globals are not easily accessible, malicious code may still execute in the main process via prototype manipulation. Therefore you should only use this function to execute sanitized or safe code. Custom context may also be provided using the `context` parameter. To improve security, make sure to only pass the really necessary objects to the context. Preferably making secured copies beforehand. *** #### Parameters * ##### scriptString: string * ##### context: Dictionary = ... #### Returns [CompiledScriptFunction](https://crawlee.dev/js/api/puppeteer-crawler/namespace/puppeteerUtils.md#CompiledScriptFunction) ### [**](#enqueueLinksByClickingElements)[**](https://github.com/apify/crawlee/blob/master/packages/puppeteer-crawler/src/internals/enqueue-links/click-elements.ts#L225)enqueueLinksByClickingElements * ****enqueueLinksByClickingElements**(options): Promise<[BatchAddRequestsResult](https://crawlee.dev/js/api/types/interface/BatchAddRequestsResult.md)> - The function finds elements matching a specific CSS selector in a Puppeteer page, clicks all those elements using a mouse move and a left mouse button click and intercepts all the navigation requests that are subsequently produced by the page. The intercepted requests, including their methods, headers and payloads are then enqueued to a provided [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md). This is useful to crawl JavaScript heavy pages where links are not available in `href` elements, but rather navigations are triggered in click handlers. If you're looking to find URLs in `href` attributes of the page, see enqueueLinks. Optionally, the function allows you to filter the target links' URLs using an array of [PseudoUrl](https://crawlee.dev/js/api/core/class/PseudoUrl.md) objects and override settings of the enqueued [Request](https://crawlee.dev/js/api/core/class/Request.md) objects. **IMPORTANT**: To be able to do this, this function uses various mutations on the page, such as changing the Z-index of elements being clicked and their visibility. Therefore, it is recommended to only use this function as the last operation in the page. **USING HEADFUL BROWSER**: When using a headful browser, this function will only be able to click elements in the focused tab, effectively limiting concurrency to 1. In headless mode, full concurrency can be achieved. **PERFORMANCE**: Clicking elements with a mouse and intercepting requests is not a low level operation that takes nanoseconds. It's not very CPU intensive, but it takes time. We strongly recommend limiting the scope of the clicking as much as possible by using a specific selector that targets only the elements that you assume or know will produce a navigation. You can certainly click everything by using the `*` selector, but be prepared to wait minutes to get results on a large and complex page. **Example usage** ``` await utils.puppeteer.enqueueLinksByClickingElements({ page, requestQueue, selector: 'a.product-detail', pseudoUrls: [ 'https://www.example.com/handbags/[.*]' 'https://www.example.com/purses/[.*]' ], }); ``` *** #### Parameters * ##### options: [EnqueueLinksByClickingElementsOptions](https://crawlee.dev/js/api/puppeteer-crawler/namespace/puppeteerClickElements.md#EnqueueLinksByClickingElementsOptions) #### Returns Promise<[BatchAddRequestsResult](https://crawlee.dev/js/api/types/interface/BatchAddRequestsResult.md)> Promise that resolves to [BatchAddRequestsResult](https://crawlee.dev/js/api/types/interface/BatchAddRequestsResult.md) object. ### [**](#gotoExtended)[**](https://github.com/apify/crawlee/blob/master/packages/puppeteer-crawler/src/internals/utils/puppeteer_utils.ts#L468)gotoExtended * ****gotoExtended**(page, request, gotoOptions): Promise\ - Extended version of Puppeteer's `page.goto()` allowing to perform requests with HTTP method other than GET, with custom headers and POST payload. URL, method, headers and payload are taken from request parameter that must be an instance of Request class. *NOTE:* In recent versions of Puppeteer using requests other than GET, overriding headers and adding payloads disables browser cache which degrades performance. *** #### Parameters * ##### page: Page Puppeteer [`Page`](https://pptr.dev/api/puppeteer.page) object. * ##### request: [Request](https://crawlee.dev/js/api/core/class/Request.md)\ * ##### optionalgotoOptions: [DirectNavigationOptions](https://crawlee.dev/js/api/puppeteer-crawler/namespace/puppeteerUtils.md#DirectNavigationOptions) = {} Custom options for `page.goto()`. #### Returns Promise\ ### [**](#infiniteScroll)[**](https://github.com/apify/crawlee/blob/master/packages/puppeteer-crawler/src/internals/utils/puppeteer_utils.ts#L568)infiniteScroll * ****infiniteScroll**(page, options): Promise\ - Scrolls to the bottom of a page, or until it times out. Loads dynamic content when it hits the bottom of a page, and then continues scrolling. *** #### Parameters * ##### page: Page Puppeteer [`Page`](https://pptr.dev/api/puppeteer.page) object. * ##### optionaloptions: [InfiniteScrollOptions](https://crawlee.dev/js/api/puppeteer-crawler/namespace/puppeteerUtils.md#InfiniteScrollOptions) = {} #### Returns Promise\ ### [**](#injectFile)[**](https://github.com/apify/crawlee/blob/master/packages/puppeteer-crawler/src/internals/utils/puppeteer_utils.ts#L122)injectFile * ****injectFile**(page, filePath, options): Promise\ - Injects a JavaScript file into a Puppeteer page. Unlike Puppeteer's `addScriptTag` function, this function works on pages with arbitrary Cross-Origin Resource Sharing (CORS) policies. File contents are cached for up to 10 files to limit file system access. *** #### Parameters * ##### page: Page Puppeteer [`Page`](https://pptr.dev/api/puppeteer.page) object. * ##### filePath: string File path * ##### optionaloptions: [InjectFileOptions](https://crawlee.dev/js/api/puppeteer-crawler/namespace/puppeteerUtils.md#InjectFileOptions) = {} #### Returns Promise\ ### [**](#injectJQuery)[**](https://github.com/apify/crawlee/blob/master/packages/puppeteer-crawler/src/internals/utils/puppeteer_utils.ts#L175)injectJQuery * ****injectJQuery**(page, options): Promise\ - Injects the [jQuery](https://jquery.com/) library into a Puppeteer page. jQuery is often useful for various web scraping and crawling tasks. For example, it can help extract text from HTML elements using CSS selectors. Beware that the injected jQuery object will be set to the `window.$` variable and thus it might cause conflicts with other libraries included by the page that use the same variable name (e.g. another version of jQuery). This can affect functionality of page's scripts. The injected jQuery will survive page navigations and reloads by default. **Example usage:** ``` await utils.puppeteer.injectJQuery(page); const title = await page.evaluate(() => { return $('head title').text(); }); ``` Note that `injectJQuery()` does not affect the Puppeteer's [`page.$()`](https://pptr.dev/api/puppeteer.page._/) function in any way. *** #### Parameters * ##### page: Page Puppeteer [`Page`](https://pptr.dev/api/puppeteer.page) object. * ##### optionaloptions: { surviveNavigations?: boolean } * ##### optionalsurviveNavigations: boolean Opt-out option to disable the JQuery reinjection after navigation. #### Returns Promise\ ### [**](#parseWithCheerio)[**](https://github.com/apify/crawlee/blob/master/packages/puppeteer-crawler/src/internals/utils/puppeteer_utils.ts#L192)parseWithCheerio * ****parseWithCheerio**(page, ignoreShadowRoots, ignoreIframes): Promise<[CheerioRoot](https://crawlee.dev/js/api/utils.md#CheerioRoot)> - Returns Cheerio handle for `page.content()`, allowing to work with the data same way as with [CheerioCrawler](https://crawlee.dev/js/api/cheerio-crawler/class/CheerioCrawler.md). **Example usage:** ``` const $ = await utils.puppeteer.parseWithCheerio(page); const title = $('title').text(); ``` *** #### Parameters * ##### page: Page Puppeteer [`Page`](https://pptr.dev/api/puppeteer.page) object. * ##### ignoreShadowRoots: boolean = false * ##### ignoreIframes: boolean = false #### Returns Promise<[CheerioRoot](https://crawlee.dev/js/api/utils.md#CheerioRoot)> ### [**](#saveSnapshot)[**](https://github.com/apify/crawlee/blob/master/packages/puppeteer-crawler/src/internals/utils/puppeteer_utils.ts#L733)saveSnapshot * ****saveSnapshot**(page, options): Promise\ - Saves a full screenshot and HTML of the current page into a Key-Value store. *** #### Parameters * ##### page: Page Puppeteer [`Page`](https://pptr.dev/api/puppeteer.page) object. * ##### optionaloptions: [SaveSnapshotOptions](https://crawlee.dev/js/api/puppeteer-crawler/namespace/puppeteerUtils.md#SaveSnapshotOptions) = {} #### Returns Promise\ --- # @crawlee/types ## Index[**](#Index) ### References * [**Cookie](https://crawlee.dev/js/api/types.md#Cookie) * [**QueueOperationInfo](https://crawlee.dev/js/api/types.md#QueueOperationInfo) * [**StorageClient](https://crawlee.dev/js/api/types.md#StorageClient) ### Interfaces * [**BatchAddRequestsResult](https://crawlee.dev/js/api/types/interface/BatchAddRequestsResult.md) * [**BrowserLikeResponse](https://crawlee.dev/js/api/types/interface/BrowserLikeResponse.md) * [**Dataset](https://crawlee.dev/js/api/types/interface/Dataset.md) * [**DatasetClient](https://crawlee.dev/js/api/types/interface/DatasetClient.md) * [**DatasetClientListOptions](https://crawlee.dev/js/api/types/interface/DatasetClientListOptions.md) * [**DatasetClientUpdateOptions](https://crawlee.dev/js/api/types/interface/DatasetClientUpdateOptions.md) * [**DatasetCollectionClient](https://crawlee.dev/js/api/types/interface/DatasetCollectionClient.md) * [**DatasetCollectionClientOptions](https://crawlee.dev/js/api/types/interface/DatasetCollectionClientOptions.md) * [**DatasetCollectionData](https://crawlee.dev/js/api/types/interface/DatasetCollectionData.md) * [**DatasetInfo](https://crawlee.dev/js/api/types/interface/DatasetInfo.md) * [**DatasetStats](https://crawlee.dev/js/api/types/interface/DatasetStats.md) * [**DeleteRequestLockOptions](https://crawlee.dev/js/api/types/interface/DeleteRequestLockOptions.md) * [**KeyValueStoreClient](https://crawlee.dev/js/api/types/interface/KeyValueStoreClient.md) * [**KeyValueStoreClientGetRecordOptions](https://crawlee.dev/js/api/types/interface/KeyValueStoreClientGetRecordOptions.md) * [**KeyValueStoreClientListData](https://crawlee.dev/js/api/types/interface/KeyValueStoreClientListData.md) * [**KeyValueStoreClientListOptions](https://crawlee.dev/js/api/types/interface/KeyValueStoreClientListOptions.md) * [**KeyValueStoreClientUpdateOptions](https://crawlee.dev/js/api/types/interface/KeyValueStoreClientUpdateOptions.md) * [**KeyValueStoreCollectionClient](https://crawlee.dev/js/api/types/interface/KeyValueStoreCollectionClient.md) * [**KeyValueStoreInfo](https://crawlee.dev/js/api/types/interface/KeyValueStoreInfo.md) * [**KeyValueStoreItemData](https://crawlee.dev/js/api/types/interface/KeyValueStoreItemData.md) * [**KeyValueStoreRecord](https://crawlee.dev/js/api/types/interface/KeyValueStoreRecord.md) * [**KeyValueStoreRecordOptions](https://crawlee.dev/js/api/types/interface/KeyValueStoreRecordOptions.md) * [**KeyValueStoreStats](https://crawlee.dev/js/api/types/interface/KeyValueStoreStats.md) * [**ListAndLockHeadResult](https://crawlee.dev/js/api/types/interface/ListAndLockHeadResult.md) * [**ListAndLockOptions](https://crawlee.dev/js/api/types/interface/ListAndLockOptions.md) * [**ListOptions](https://crawlee.dev/js/api/types/interface/ListOptions.md) * [**PaginatedList](https://crawlee.dev/js/api/types/interface/PaginatedList.md) * [**ProcessedRequest](https://crawlee.dev/js/api/types/interface/ProcessedRequest.md) * [**ProlongRequestLockOptions](https://crawlee.dev/js/api/types/interface/ProlongRequestLockOptions.md) * [**ProlongRequestLockResult](https://crawlee.dev/js/api/types/interface/ProlongRequestLockResult.md) * [**QueueHead](https://crawlee.dev/js/api/types/interface/QueueHead.md) * [**RequestOptions](https://crawlee.dev/js/api/types/interface/RequestOptions.md) * [**RequestQueueClient](https://crawlee.dev/js/api/types/interface/RequestQueueClient.md) * [**RequestQueueCollectionClient](https://crawlee.dev/js/api/types/interface/RequestQueueCollectionClient.md) * [**RequestQueueHeadItem](https://crawlee.dev/js/api/types/interface/RequestQueueHeadItem.md) * [**RequestQueueInfo](https://crawlee.dev/js/api/types/interface/RequestQueueInfo.md) * [**RequestQueueOptions](https://crawlee.dev/js/api/types/interface/RequestQueueOptions.md) * [**RequestQueueStats](https://crawlee.dev/js/api/types/interface/RequestQueueStats.md) * [**RequestSchema](https://crawlee.dev/js/api/types/interface/RequestSchema.md) * [**SetStatusMessageOptions](https://crawlee.dev/js/api/types/interface/SetStatusMessageOptions.md) * [**UnprocessedRequest](https://crawlee.dev/js/api/types/interface/UnprocessedRequest.md) * [**UpdateRequestSchema](https://crawlee.dev/js/api/types/interface/UpdateRequestSchema.md) ### Type Aliases * [**AllowedHttpMethods](https://crawlee.dev/js/api/types.md#AllowedHttpMethods) ## References[**](#References) ### [**](#Cookie)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/browser.ts#L3)Cookie Re-exports [Cookie](https://crawlee.dev/js/api/core/interface/Cookie.md) ### [**](#QueueOperationInfo)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L7)QueueOperationInfo Re-exports [QueueOperationInfo](https://crawlee.dev/js/api/core/interface/QueueOperationInfo.md) ### [**](#StorageClient)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L323)StorageClient Re-exports [StorageClient](https://crawlee.dev/js/api/core/interface/StorageClient.md) ## Type Aliases[**](<#Type Aliases>) ### [**](#AllowedHttpMethods)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/utility-types.ts#L10)AllowedHttpMethods **AllowedHttpMethods: GET | HEAD | POST | PUT | DELETE | TRACE | OPTIONS | CONNECT | PATCH --- # Changelog All notable changes to this project will be documented in this file. See [Conventional Commits](https://conventionalcommits.org) for commit guidelines. ## [3.15.3](https://github.com/apify/crawlee/compare/v3.15.2...v3.15.3) (2025-11-10)[​](#3153-2025-11-10 "Direct link to 3153-2025-11-10") **Note:** Version bump only for package @crawlee/types ## [3.15.2](https://github.com/apify/crawlee/compare/v3.15.1...v3.15.2) (2025-10-23)[​](#3152-2025-10-23 "Direct link to 3152-2025-10-23") **Note:** Version bump only for package @crawlee/types ## [3.15.1](https://github.com/apify/crawlee/compare/v3.15.0...v3.15.1) (2025-09-26)[​](#3151-2025-09-26 "Direct link to 3151-2025-09-26") **Note:** Version bump only for package @crawlee/types # [3.15.0](https://github.com/apify/crawlee/compare/v3.14.1...v3.15.0) (2025-09-17) **Note:** Version bump only for package @crawlee/types ## [3.14.1](https://github.com/apify/crawlee/compare/v3.14.0...v3.14.1) (2025-08-05)[​](#3141-2025-08-05 "Direct link to 3141-2025-08-05") **Note:** Version bump only for package @crawlee/types # [3.14.0](https://github.com/apify/crawlee/compare/v3.13.10...v3.14.0) (2025-07-25) **Note:** Version bump only for package @crawlee/types ## [3.13.10](https://github.com/apify/crawlee/compare/v3.13.9...v3.13.10) (2025-07-09)[​](#31310-2025-07-09 "Direct link to 31310-2025-07-09") **Note:** Version bump only for package @crawlee/types ## [3.13.9](https://github.com/apify/crawlee/compare/v3.13.8...v3.13.9) (2025-06-27)[​](#3139-2025-06-27 "Direct link to 3139-2025-06-27") **Note:** Version bump only for package @crawlee/types ## [3.13.8](https://github.com/apify/crawlee/compare/v3.13.7...v3.13.8) (2025-06-16)[​](#3138-2025-06-16 "Direct link to 3138-2025-06-16") ### Features[​](#features "Direct link to Features") * support `KVS.listKeys()` `prefix` and `collection` parameters ([#3001](https://github.com/apify/crawlee/issues/3001)) ([5c4726d](https://github.com/apify/crawlee/commit/5c4726df96e358a9bbf44a0cd2760e4e269f0fae)), closes [#2974](https://github.com/apify/crawlee/issues/2974) ## [3.13.7](https://github.com/apify/crawlee/compare/v3.13.6...v3.13.7) (2025-06-06)[​](#3137-2025-06-06 "Direct link to 3137-2025-06-06") **Note:** Version bump only for package @crawlee/types ## [3.13.6](https://github.com/apify/crawlee/compare/v3.13.5...v3.13.6) (2025-06-05)[​](#3136-2025-06-05 "Direct link to 3136-2025-06-05") **Note:** Version bump only for package @crawlee/types ## [3.13.5](https://github.com/apify/crawlee/compare/v3.13.4...v3.13.5) (2025-05-20)[​](#3135-2025-05-20 "Direct link to 3135-2025-05-20") **Note:** Version bump only for package @crawlee/types ## [3.13.4](https://github.com/apify/crawlee/compare/v3.13.3...v3.13.4) (2025-05-14)[​](#3134-2025-05-14 "Direct link to 3134-2025-05-14") ### Bug Fixes[​](#bug-fixes "Direct link to Bug Fixes") * **core:** use short timeouts for periodic `KVS.setRecord` calls ([#2962](https://github.com/apify/crawlee/issues/2962)) ([d31d90e](https://github.com/apify/crawlee/commit/d31d90e5288ea80b3ed6ec4a75a4b8f87686a2c4)) ## [3.13.3](https://github.com/apify/crawlee/compare/v3.13.2...v3.13.3) (2025-05-05)[​](#3133-2025-05-05 "Direct link to 3133-2025-05-05") **Note:** Version bump only for package @crawlee/types ## [3.13.2](https://github.com/apify/crawlee/compare/v3.13.1...v3.13.2) (2025-04-08)[​](#3132-2025-04-08 "Direct link to 3132-2025-04-08") **Note:** Version bump only for package @crawlee/types ## [3.13.1](https://github.com/apify/crawlee/compare/v3.13.0...v3.13.1) (2025-04-07)[​](#3131-2025-04-07 "Direct link to 3131-2025-04-07") **Note:** Version bump only for package @crawlee/types # [3.13.0](https://github.com/apify/crawlee/compare/v3.12.2...v3.13.0) (2025-03-04) ### Bug Fixes[​](#bug-fixes-1 "Direct link to Bug Fixes") * Simplified RequestQueueV2 implementation ([#2775](https://github.com/apify/crawlee/issues/2775)) ([d1a094a](https://github.com/apify/crawlee/commit/d1a094a47eaecbf367b222f9b8c14d7da5d3e03a)), closes [#2767](https://github.com/apify/crawlee/issues/2767) [#2700](https://github.com/apify/crawlee/issues/2700) ## [3.12.2](https://github.com/apify/crawlee/compare/v3.12.1...v3.12.2) (2025-01-27)[​](#3122-2025-01-27 "Direct link to 3122-2025-01-27") **Note:** Version bump only for package @crawlee/types ## [3.12.1](https://github.com/apify/crawlee/compare/v3.12.0...v3.12.1) (2024-12-04)[​](#3121-2024-12-04 "Direct link to 3121-2024-12-04") **Note:** Version bump only for package @crawlee/types # [3.12.0](https://github.com/apify/crawlee/compare/v3.11.5...v3.12.0) (2024-11-04) **Note:** Version bump only for package @crawlee/types ## [3.11.5](https://github.com/apify/crawlee/compare/v3.11.4...v3.11.5) (2024-10-04)[​](#3115-2024-10-04 "Direct link to 3115-2024-10-04") **Note:** Version bump only for package @crawlee/types ## [3.11.4](https://github.com/apify/crawlee/compare/v3.11.3...v3.11.4) (2024-09-23)[​](#3114-2024-09-23 "Direct link to 3114-2024-09-23") **Note:** Version bump only for package @crawlee/types ## [3.11.3](https://github.com/apify/crawlee/compare/v3.11.2...v3.11.3) (2024-09-03)[​](#3113-2024-09-03 "Direct link to 3113-2024-09-03") **Note:** Version bump only for package @crawlee/types ## [3.11.2](https://github.com/apify/crawlee/compare/v3.11.1...v3.11.2) (2024-08-28)[​](#3112-2024-08-28 "Direct link to 3112-2024-08-28") **Note:** Version bump only for package @crawlee/types ## [3.11.1](https://github.com/apify/crawlee/compare/v3.11.0...v3.11.1) (2024-07-24)[​](#3111-2024-07-24 "Direct link to 3111-2024-07-24") **Note:** Version bump only for package @crawlee/types # [3.11.0](https://github.com/apify/crawlee/compare/v3.10.5...v3.11.0) (2024-07-09) **Note:** Version bump only for package @crawlee/types ## [3.10.5](https://github.com/apify/crawlee/compare/v3.10.4...v3.10.5) (2024-06-12)[​](#3105-2024-06-12 "Direct link to 3105-2024-06-12") **Note:** Version bump only for package @crawlee/types ## [3.10.4](https://github.com/apify/crawlee/compare/v3.10.3...v3.10.4) (2024-06-11)[​](#3104-2024-06-11 "Direct link to 3104-2024-06-11") **Note:** Version bump only for package @crawlee/types ## [3.10.3](https://github.com/apify/crawlee/compare/v3.10.2...v3.10.3) (2024-06-07)[​](#3103-2024-06-07 "Direct link to 3103-2024-06-07") **Note:** Version bump only for package @crawlee/types ## [3.10.2](https://github.com/apify/crawlee/compare/v3.10.1...v3.10.2) (2024-06-03)[​](#3102-2024-06-03 "Direct link to 3102-2024-06-03") **Note:** Version bump only for package @crawlee/types ## [3.10.1](https://github.com/apify/crawlee/compare/v3.10.0...v3.10.1) (2024-05-23)[​](#3101-2024-05-23 "Direct link to 3101-2024-05-23") **Note:** Version bump only for package @crawlee/types # [3.10.0](https://github.com/apify/crawlee/compare/v3.9.2...v3.10.0) (2024-05-16) **Note:** Version bump only for package @crawlee/types ## [3.9.2](https://github.com/apify/crawlee/compare/v3.9.1...v3.9.2) (2024-04-17)[​](#392-2024-04-17 "Direct link to 392-2024-04-17") **Note:** Version bump only for package @crawlee/types ## [3.9.1](https://github.com/apify/crawlee/compare/v3.9.0...v3.9.1) (2024-04-11)[​](#391-2024-04-11 "Direct link to 391-2024-04-11") **Note:** Version bump only for package @crawlee/types # [3.9.0](https://github.com/apify/crawlee/compare/v3.8.2...v3.9.0) (2024-04-10) **Note:** Version bump only for package @crawlee/types ## [3.8.2](https://github.com/apify/crawlee/compare/v3.8.1...v3.8.2) (2024-03-21)[​](#382-2024-03-21 "Direct link to 382-2024-03-21") **Note:** Version bump only for package @crawlee/types ## [3.8.1](https://github.com/apify/crawlee/compare/v3.8.0...v3.8.1) (2024-02-22)[​](#381-2024-02-22 "Direct link to 381-2024-02-22") **Note:** Version bump only for package @crawlee/types # [3.8.0](https://github.com/apify/crawlee/compare/v3.7.3...v3.8.0) (2024-02-21) ### Features[​](#features-1 "Direct link to Features") * `KeyValueStore.recordExists()` ([#2339](https://github.com/apify/crawlee/issues/2339)) ([8507a65](https://github.com/apify/crawlee/commit/8507a65d1ad079f64c752a6ddb1d8fac9b494228)) ## [3.7.3](https://github.com/apify/crawlee/compare/v3.7.2...v3.7.3) (2024-01-30)[​](#373-2024-01-30 "Direct link to 373-2024-01-30") **Note:** Version bump only for package @crawlee/types ## [3.7.2](https://github.com/apify/crawlee/compare/v3.7.1...v3.7.2) (2024-01-09)[​](#372-2024-01-09 "Direct link to 372-2024-01-09") **Note:** Version bump only for package @crawlee/types ## [3.7.1](https://github.com/apify/crawlee/compare/v3.7.0...v3.7.1) (2024-01-02)[​](#371-2024-01-02 "Direct link to 371-2024-01-02") **Note:** Version bump only for package @crawlee/types # [3.7.0](https://github.com/apify/crawlee/compare/v3.6.2...v3.7.0) (2023-12-21) **Note:** Version bump only for package @crawlee/types ## [3.6.2](https://github.com/apify/crawlee/compare/v3.6.1...v3.6.2) (2023-11-26)[​](#362-2023-11-26 "Direct link to 362-2023-11-26") **Note:** Version bump only for package @crawlee/types ## [3.6.1](https://github.com/apify/crawlee/compare/v3.6.0...v3.6.1) (2023-11-15)[​](#361-2023-11-15 "Direct link to 361-2023-11-15") **Note:** Version bump only for package @crawlee/types # [3.6.0](https://github.com/apify/crawlee/compare/v3.5.8...v3.6.0) (2023-11-15) **Note:** Version bump only for package @crawlee/types ## [3.5.8](https://github.com/apify/crawlee/compare/v3.5.7...v3.5.8) (2023-10-17)[​](#358-2023-10-17 "Direct link to 358-2023-10-17") **Note:** Version bump only for package @crawlee/types ## [3.5.7](https://github.com/apify/crawlee/compare/v3.5.6...v3.5.7) (2023-10-05)[​](#357-2023-10-05 "Direct link to 357-2023-10-05") **Note:** Version bump only for package @crawlee/types ## [3.5.6](https://github.com/apify/crawlee/compare/v3.5.5...v3.5.6) (2023-10-04)[​](#356-2023-10-04 "Direct link to 356-2023-10-04") **Note:** Version bump only for package @crawlee/types ## [3.5.5](https://github.com/apify/crawlee/compare/v3.5.4...v3.5.5) (2023-10-02)[​](#355-2023-10-02 "Direct link to 355-2023-10-02") **Note:** Version bump only for package @crawlee/types ## [3.5.4](https://github.com/apify/crawlee/compare/v3.5.3...v3.5.4) (2023-09-11)[​](#354-2023-09-11 "Direct link to 354-2023-09-11") **Note:** Version bump only for package @crawlee/types ## [3.5.3](https://github.com/apify/crawlee/compare/v3.5.2...v3.5.3) (2023-08-31)[​](#353-2023-08-31 "Direct link to 353-2023-08-31") **Note:** Version bump only for package @crawlee/types ## [3.5.2](https://github.com/apify/crawlee/compare/v3.5.1...v3.5.2) (2023-08-21)[​](#352-2023-08-21 "Direct link to 352-2023-08-21") **Note:** Version bump only for package @crawlee/types ## [3.5.1](https://github.com/apify/crawlee/compare/v3.5.0...v3.5.1) (2023-08-16)[​](#351-2023-08-16 "Direct link to 351-2023-08-16") **Note:** Version bump only for package @crawlee/types # [3.5.0](https://github.com/apify/crawlee/compare/v3.4.2...v3.5.0) (2023-07-31) ### Features[​](#features-2 "Direct link to Features") * **basic-crawler:** allow configuring the automatic status message ([#2001](https://github.com/apify/crawlee/issues/2001)) ([3eb4e4c](https://github.com/apify/crawlee/commit/3eb4e4c558b4bc0673fbff75b1db19c46004a1da)) ## [3.4.2](https://github.com/apify/crawlee/compare/v3.4.1...v3.4.2) (2023-07-19)[​](#342-2023-07-19 "Direct link to 342-2023-07-19") **Note:** Version bump only for package @crawlee/types ## [3.4.1](https://github.com/apify/crawlee/compare/v3.4.0...v3.4.1) (2023-07-13)[​](#341-2023-07-13 "Direct link to 341-2023-07-13") **Note:** Version bump only for package @crawlee/types # [3.4.0](https://github.com/apify/crawlee/compare/v3.3.3...v3.4.0) (2023-06-12) **Note:** Version bump only for package @crawlee/types ## [3.3.3](https://github.com/apify/crawlee/compare/v3.3.2...v3.3.3) (2023-05-31)[​](#333-2023-05-31 "Direct link to 333-2023-05-31") **Note:** Version bump only for package @crawlee/types ## [3.3.2](https://github.com/apify/crawlee/compare/v3.3.1...v3.3.2) (2023-05-11)[​](#332-2023-05-11 "Direct link to 332-2023-05-11") ### Features[​](#features-3 "Direct link to Features") * RQv2 memory storage support ([#1874](https://github.com/apify/crawlee/issues/1874)) ([049486b](https://github.com/apify/crawlee/commit/049486b772cc2accd2d2d226d8c8726e5ab933a9)) ## [3.3.1](https://github.com/apify/crawlee/compare/v3.3.0...v3.3.1) (2023-04-11)[​](#331-2023-04-11 "Direct link to 331-2023-04-11") **Note:** Version bump only for package @crawlee/types # [3.3.0](https://github.com/apify/crawlee/compare/v3.2.2...v3.3.0) (2023-03-09) ### Bug Fixes[​](#bug-fixes-2 "Direct link to Bug Fixes") * **MemoryStorage:** RequestQueue#handledRequestCount should update ([#1817](https://github.com/apify/crawlee/issues/1817)) ([a775e4a](https://github.com/apify/crawlee/commit/a775e4afea20d0b31492f44b90f61b6a903491b6)), closes [#1764](https://github.com/apify/crawlee/issues/1764) ### Features[​](#features-4 "Direct link to Features") * add basic support for `setStatusMessage` ([#1790](https://github.com/apify/crawlee/issues/1790)) ([c318980](https://github.com/apify/crawlee/commit/c318980ec11d211b1a5c9e6bdbe76198c5d895be)) * move the status message implementation to Crawlee, noop in storage ([#1808](https://github.com/apify/crawlee/issues/1808)) ([99c3fdc](https://github.com/apify/crawlee/commit/99c3fdc18030b7898e6b6d149d6d94fab7881f09)) ## [3.2.2](https://github.com/apify/crawlee/compare/v3.2.1...v3.2.2) (2023-02-08)[​](#322-2023-02-08 "Direct link to 322-2023-02-08") **Note:** Version bump only for package @crawlee/types ## [3.2.1](https://github.com/apify/crawlee/compare/v3.2.0...v3.2.1) (2023-02-07)[​](#321-2023-02-07 "Direct link to 321-2023-02-07") **Note:** Version bump only for package @crawlee/types # [3.2.0](https://github.com/apify/crawlee/compare/v3.1.4...v3.2.0) (2023-02-07) ### Features[​](#features-5 "Direct link to Features") * **MemoryStorage:** read from fs if persistStorage is enabled, ram only otherwise ([#1761](https://github.com/apify/crawlee/issues/1761)) ([e903980](https://github.com/apify/crawlee/commit/e9039809a0c0af0bc086be1f1400d18aa45ae490)) ## 3.1.2 (2022-11-15)[​](#312-2022-11-15 "Direct link to 3.1.2 (2022-11-15)") **Note:** Version bump only for package @crawlee/types ## 3.1.1 (2022-11-07)[​](#311-2022-11-07 "Direct link to 3.1.1 (2022-11-07)") **Note:** Version bump only for package @crawlee/types # 3.1.0 (2022-10-13) **Note:** Version bump only for package @crawlee/types ## [3.0.4](https://github.com/apify/crawlee/compare/v3.0.3...v3.0.4) (2022-08-22)[​](#304-2022-08-22 "Direct link to 304-2022-08-22") **Note:** Version bump only for package @crawlee/types --- # BatchAddRequestsResult ## Index[**](#Index) ### Properties * [**processedRequests](#processedRequests) * [**unprocessedRequests](#unprocessedRequests) ## Properties[**](#Properties) ### [**](#processedRequests)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L291)processedRequests **processedRequests: [ProcessedRequest](https://crawlee.dev/js/api/types/interface/ProcessedRequest.md)\[] ### [**](#unprocessedRequests)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L292)unprocessedRequests **unprocessedRequests: [UnprocessedRequest](https://crawlee.dev/js/api/types/interface/UnprocessedRequest.md)\[] --- # BrowserLikeResponse ## Index[**](#Index) ### Methods * [**headers](#headers) * [**url](#url) ## Methods[**](#Methods) ### [**](#headers)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/browser.ts#L63)headers * ****headers**(): Dictionary\ - #### Returns Dictionary\ ### [**](#url)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/browser.ts#L62)url * ****url**(): string - #### Returns string --- # Dataset ### Hierarchy * [DatasetCollectionData](https://crawlee.dev/js/api/types/interface/DatasetCollectionData.md) * *Dataset* ## Index[**](#Index) ### Properties * [**accessedAt](#accessedAt) * [**createdAt](#createdAt) * [**id](#id) * [**itemCount](#itemCount) * [**modifiedAt](#modifiedAt) * [**name](#name) ## Properties[**](#Properties) ### [**](#accessedAt)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L27)inheritedaccessedAt **accessedAt: Date Inherited from DatasetCollectionData.accessedAt ### [**](#createdAt)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L25)inheritedcreatedAt **createdAt: Date Inherited from DatasetCollectionData.createdAt ### [**](#id)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L23)inheritedid **id: string Inherited from DatasetCollectionData.id ### [**](#itemCount)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L46)itemCount **itemCount: number ### [**](#modifiedAt)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L26)inheritedmodifiedAt **modifiedAt: Date Inherited from DatasetCollectionData.modifiedAt ### [**](#name)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L24)optionalinheritedname **name? : string Inherited from DatasetCollectionData.name --- # DatasetClient \ ## Index[**](#Index) ### Methods * [**delete](#delete) * [**downloadItems](#downloadItems) * [**get](#get) * [**listItems](#listItems) * [**pushItems](#pushItems) * [**update](#update) ## Methods[**](#Methods) ### [**](#delete)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L87)delete * ****delete**(): Promise\ - #### Returns Promise\ ### [**](#downloadItems)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L88)downloadItems * ****downloadItems**(...args): Promise\> - #### Parameters * ##### rest...args: unknown\[] #### Returns Promise\> ### [**](#get)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L85)get * ****get**(): Promise\ - #### Returns Promise\ ### [**](#listItems)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L89)listItems * ****listItems**(options): Promise<[PaginatedList](https://crawlee.dev/js/api/types/interface/PaginatedList.md)\> - #### Parameters * ##### optionaloptions: [DatasetClientListOptions](https://crawlee.dev/js/api/types/interface/DatasetClientListOptions.md) #### Returns Promise<[PaginatedList](https://crawlee.dev/js/api/types/interface/PaginatedList.md)\> ### [**](#pushItems)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L90)pushItems * ****pushItems**(items): Promise\ - #### Parameters * ##### items: string | Data | string\[] | Data\[] #### Returns Promise\ ### [**](#update)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L86)update * ****update**(newFields): Promise\> - #### Parameters * ##### newFields: [DatasetClientUpdateOptions](https://crawlee.dev/js/api/types/interface/DatasetClientUpdateOptions.md) #### Returns Promise\> --- # DatasetClientListOptions ## Index[**](#Index) ### Properties * [**desc](#desc) * [**limit](#limit) * [**offset](#offset) ## Properties[**](#Properties) ### [**](#desc)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L62)optionaldesc **desc? : boolean ### [**](#limit)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L63)optionallimit **limit? : number ### [**](#offset)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L64)optionaloffset **offset? : number --- # DatasetClientUpdateOptions ## Index[**](#Index) ### Properties * [**name](#name) ## Properties[**](#Properties) ### [**](#name)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L58)optionalname **name? : string --- # DatasetCollectionClient Dataset collection client. ## Index[**](#Index) ### Methods * [**getOrCreate](#getOrCreate) * [**list](#list) ## Methods[**](#Methods) ### [**](#getOrCreate)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L54)getOrCreate * ****getOrCreate**(name): Promise<[DatasetCollectionData](https://crawlee.dev/js/api/types/interface/DatasetCollectionData.md)> - #### Parameters * ##### optionalname: string #### Returns Promise<[DatasetCollectionData](https://crawlee.dev/js/api/types/interface/DatasetCollectionData.md)> ### [**](#list)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L53)list * ****list**(): Promise<[PaginatedList](https://crawlee.dev/js/api/types/interface/PaginatedList.md)<[Dataset](https://crawlee.dev/js/api/types/interface/Dataset.md)>> - #### Returns Promise<[PaginatedList](https://crawlee.dev/js/api/types/interface/PaginatedList.md)<[Dataset](https://crawlee.dev/js/api/types/interface/Dataset.md)>> --- # DatasetCollectionClientOptions ## Index[**](#Index) ### Properties * [**storageDir](#storageDir) ## Properties[**](#Properties) ### [**](#storageDir)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L19)storageDir **storageDir: string --- # DatasetCollectionData ### Hierarchy * *DatasetCollectionData* * [Dataset](https://crawlee.dev/js/api/types/interface/Dataset.md) ## Index[**](#Index) ### Properties * [**accessedAt](#accessedAt) * [**createdAt](#createdAt) * [**id](#id) * [**modifiedAt](#modifiedAt) * [**name](#name) ## Properties[**](#Properties) ### [**](#accessedAt)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L27)accessedAt **accessedAt: Date ### [**](#createdAt)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L25)createdAt **createdAt: Date ### [**](#id)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L23)id **id: string ### [**](#modifiedAt)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L26)modifiedAt **modifiedAt: Date ### [**](#name)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L24)optionalname **name? : string --- # DatasetInfo ## Index[**](#Index) ### Properties * [**accessedAt](#accessedAt) * [**actId](#actId) * [**actRunId](#actRunId) * [**createdAt](#createdAt) * [**id](#id) * [**itemCount](#itemCount) * [**modifiedAt](#modifiedAt) * [**name](#name) ## Properties[**](#Properties) ### [**](#accessedAt)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L72)accessedAt **accessedAt: Date ### [**](#actId)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L74)optionalactId **actId? : string ### [**](#actRunId)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L75)optionalactRunId **actRunId? : string ### [**](#createdAt)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L70)createdAt **createdAt: Date ### [**](#id)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L68)id **id: string ### [**](#itemCount)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L73)itemCount **itemCount: number ### [**](#modifiedAt)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L71)modifiedAt **modifiedAt: Date ### [**](#name)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L69)optionalname **name? : string --- # DatasetStats ## Index[**](#Index) ### Properties * [**deleteCount](#deleteCount) * [**readCount](#readCount) * [**storageBytes](#storageBytes) * [**writeCount](#writeCount) ## Properties[**](#Properties) ### [**](#deleteCount)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L80)optionaldeleteCount **deleteCount? : number ### [**](#readCount)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L78)optionalreadCount **readCount? : number ### [**](#storageBytes)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L81)optionalstorageBytes **storageBytes? : number ### [**](#writeCount)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L79)optionalwriteCount **writeCount? : number --- # DeleteRequestLockOptions ## Index[**](#Index) ### Properties * [**forefront](#forefront) ## Properties[**](#Properties) ### [**](#forefront)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L250)optionalforefront **forefront? : boolean --- # KeyValueStoreClient Key-value Store client. ## Index[**](#Index) ### Methods * [**delete](#delete) * [**deleteRecord](#deleteRecord) * [**get](#get) * [**getRecord](#getRecord) * [**listKeys](#listKeys) * [**recordExists](#recordExists) * [**setRecord](#setRecord) * [**update](#update) ## Methods[**](#Methods) ### [**](#delete)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L168)delete * ****delete**(): Promise\ - #### Returns Promise\ ### [**](#deleteRecord)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L173)deleteRecord * ****deleteRecord**(key): Promise\ - #### Parameters * ##### key: string #### Returns Promise\ ### [**](#get)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L166)get * ****get**(): Promise\ - #### Returns Promise\ ### [**](#getRecord)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L171)getRecord * ****getRecord**(key, options): Promise\ - #### Parameters * ##### key: string * ##### optionaloptions: [KeyValueStoreClientGetRecordOptions](https://crawlee.dev/js/api/types/interface/KeyValueStoreClientGetRecordOptions.md) #### Returns Promise\ ### [**](#listKeys)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L169)listKeys * ****listKeys**(options): Promise<[KeyValueStoreClientListData](https://crawlee.dev/js/api/types/interface/KeyValueStoreClientListData.md)> - #### Parameters * ##### optionaloptions: [KeyValueStoreClientListOptions](https://crawlee.dev/js/api/types/interface/KeyValueStoreClientListOptions.md) #### Returns Promise<[KeyValueStoreClientListData](https://crawlee.dev/js/api/types/interface/KeyValueStoreClientListData.md)> ### [**](#recordExists)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L170)recordExists * ****recordExists**(key): Promise\ - #### Parameters * ##### key: string #### Returns Promise\ ### [**](#setRecord)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L172)setRecord * ****setRecord**(record, options): Promise\ - #### Parameters * ##### record: [KeyValueStoreRecord](https://crawlee.dev/js/api/types/interface/KeyValueStoreRecord.md) * ##### optionaloptions: [KeyValueStoreRecordOptions](https://crawlee.dev/js/api/types/interface/KeyValueStoreRecordOptions.md) #### Returns Promise\ ### [**](#update)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L167)update * ****update**(newFields): Promise\> - #### Parameters * ##### newFields: [KeyValueStoreClientUpdateOptions](https://crawlee.dev/js/api/types/interface/KeyValueStoreClientUpdateOptions.md) #### Returns Promise\> --- # KeyValueStoreClientGetRecordOptions ## Index[**](#Index) ### Properties * [**buffer](#buffer) * [**stream](#stream) ## Properties[**](#Properties) ### [**](#buffer)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L158)optionalbuffer **buffer? : boolean ### [**](#stream)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L159)optionalstream **stream? : boolean --- # KeyValueStoreClientListData ## Index[**](#Index) ### Properties * [**count](#count) * [**exclusiveStartKey](#exclusiveStartKey) * [**isTruncated](#isTruncated) * [**items](#items) * [**limit](#limit) * [**nextExclusiveStartKey](#nextExclusiveStartKey) ## Properties[**](#Properties) ### [**](#count)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L149)count **count: number ### [**](#exclusiveStartKey)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L151)optionalexclusiveStartKey **exclusiveStartKey? : string ### [**](#isTruncated)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L152)isTruncated **isTruncated: boolean ### [**](#items)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L154)items **items: [KeyValueStoreItemData](https://crawlee.dev/js/api/types/interface/KeyValueStoreItemData.md)\[] ### [**](#limit)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L150)limit **limit: number ### [**](#nextExclusiveStartKey)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L153)optionalnextExclusiveStartKey **nextExclusiveStartKey? : string --- # KeyValueStoreClientListOptions ## Index[**](#Index) ### Properties * [**collection](#collection) * [**exclusiveStartKey](#exclusiveStartKey) * [**limit](#limit) * [**prefix](#prefix) ## Properties[**](#Properties) ### [**](#collection)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L139)optionalcollection **collection? : string ### [**](#exclusiveStartKey)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L138)optionalexclusiveStartKey **exclusiveStartKey? : string ### [**](#limit)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L137)optionallimit **limit? : number ### [**](#prefix)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L140)optionalprefix **prefix? : string --- # KeyValueStoreClientUpdateOptions ## Index[**](#Index) ### Properties * [**name](#name) ## Properties[**](#Properties) ### [**](#name)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L133)optionalname **name? : string --- # KeyValueStoreCollectionClient Key-value store collection client. ## Index[**](#Index) ### Methods * [**getOrCreate](#getOrCreate) * [**list](#list) ## Methods[**](#Methods) ### [**](#getOrCreate)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L118)getOrCreate * ****getOrCreate**(name): Promise<[KeyValueStoreInfo](https://crawlee.dev/js/api/types/interface/KeyValueStoreInfo.md)> - #### Parameters * ##### optionalname: string #### Returns Promise<[KeyValueStoreInfo](https://crawlee.dev/js/api/types/interface/KeyValueStoreInfo.md)> ### [**](#list)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L117)list * ****list**(): Promise<[PaginatedList](https://crawlee.dev/js/api/types/interface/PaginatedList.md)<[KeyValueStoreInfo](https://crawlee.dev/js/api/types/interface/KeyValueStoreInfo.md)>> - #### Returns Promise<[PaginatedList](https://crawlee.dev/js/api/types/interface/PaginatedList.md)<[KeyValueStoreInfo](https://crawlee.dev/js/api/types/interface/KeyValueStoreInfo.md)>> --- # KeyValueStoreInfo ## Index[**](#Index) ### Properties * [**accessedAt](#accessedAt) * [**actId](#actId) * [**actRunId](#actRunId) * [**createdAt](#createdAt) * [**id](#id) * [**modifiedAt](#modifiedAt) * [**name](#name) * [**stats](#stats) * [**userId](#userId) ## Properties[**](#Properties) ### [**](#accessedAt)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L107)accessedAt **accessedAt: Date ### [**](#actId)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L108)optionalactId **actId? : string ### [**](#actRunId)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L109)optionalactRunId **actRunId? : string ### [**](#createdAt)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L105)createdAt **createdAt: Date ### [**](#id)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L102)id **id: string ### [**](#modifiedAt)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L106)modifiedAt **modifiedAt: Date ### [**](#name)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L103)optionalname **name? : string ### [**](#stats)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L110)optionalstats **stats? : [KeyValueStoreStats](https://crawlee.dev/js/api/types/interface/KeyValueStoreStats.md) ### [**](#userId)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L104)optionaluserId **userId? : string --- # KeyValueStoreItemData ## Index[**](#Index) ### Properties * [**key](#key) * [**size](#size) ## Properties[**](#Properties) ### [**](#key)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L144)key **key: string ### [**](#size)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L145)size **size: number --- # KeyValueStoreRecord ## Index[**](#Index) ### Properties * [**contentType](#contentType) * [**key](#key) * [**value](#value) ## Properties[**](#Properties) ### [**](#contentType)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L124)optionalcontentType **contentType? : string ### [**](#key)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L122)key **key: string ### [**](#value)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L123)value **value: any --- # KeyValueStoreRecordOptions ## Index[**](#Index) ### Properties * [**doNotRetryTimeouts](#doNotRetryTimeouts) * [**timeoutSecs](#timeoutSecs) ## Properties[**](#Properties) ### [**](#doNotRetryTimeouts)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L129)optionaldoNotRetryTimeouts **doNotRetryTimeouts? : boolean ### [**](#timeoutSecs)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L128)optionaltimeoutSecs **timeoutSecs? : number --- # KeyValueStoreStats ## Index[**](#Index) ### Properties * [**deleteCount](#deleteCount) * [**listCount](#listCount) * [**readCount](#readCount) * [**storageBytes](#storageBytes) * [**writeCount](#writeCount) ## Properties[**](#Properties) ### [**](#deleteCount)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L96)optionaldeleteCount **deleteCount? : number ### [**](#listCount)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L97)optionallistCount **listCount? : number ### [**](#readCount)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L94)optionalreadCount **readCount? : number ### [**](#storageBytes)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L98)optionalstorageBytes **storageBytes? : number ### [**](#writeCount)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L95)optionalwriteCount **writeCount? : number --- # ListAndLockHeadResult ### Hierarchy * [QueueHead](https://crawlee.dev/js/api/types/interface/QueueHead.md) * *ListAndLockHeadResult* ## Index[**](#Index) ### Properties * [**hadMultipleClients](#hadMultipleClients) * [**items](#items) * [**limit](#limit) * [**lockSecs](#lockSecs) * [**queueHasLockedRequests](#queueHasLockedRequests) * [**queueModifiedAt](#queueModifiedAt) ## Properties[**](#Properties) ### [**](#hadMultipleClients)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L220)optionalinheritedhadMultipleClients **hadMultipleClients? : boolean Inherited from QueueHead.hadMultipleClients ### [**](#items)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L221)inheriteditems **items: [RequestQueueHeadItem](https://crawlee.dev/js/api/types/interface/RequestQueueHeadItem.md)\[] Inherited from QueueHead.items ### [**](#limit)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L218)inheritedlimit **limit: number Inherited from QueueHead.limit ### [**](#lockSecs)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L236)lockSecs **lockSecs: number ### [**](#queueHasLockedRequests)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L237)optionalqueueHasLockedRequests **queueHasLockedRequests? : boolean ### [**](#queueModifiedAt)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L219)inheritedqueueModifiedAt **queueModifiedAt: Date Inherited from QueueHead.queueModifiedAt --- # ListAndLockOptions ### Hierarchy * [ListOptions](https://crawlee.dev/js/api/types/interface/ListOptions.md) * *ListAndLockOptions* ## Index[**](#Index) ### Properties * [**limit](#limit) * [**lockSecs](#lockSecs) ## Properties[**](#Properties) ### [**](#limit)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L228)optionalinheritedlimit **limit? : number = 100 Inherited from ListOptions.limit ### [**](#lockSecs)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L232)lockSecs **lockSecs: number --- # ListOptions ### Hierarchy * *ListOptions* * [ListAndLockOptions](https://crawlee.dev/js/api/types/interface/ListAndLockOptions.md) ## Index[**](#Index) ### Properties * [**limit](#limit) ## Properties[**](#Properties) ### [**](#limit)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L228)optionallimit **limit? : number = 100 --- # PaginatedList \ ## Index[**](#Index) ### Properties * [**count](#count) * [**desc](#desc) * [**items](#items) * [**limit](#limit) * [**offset](#offset) * [**total](#total) ## Properties[**](#Properties) ### [**](#count)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L34)count **count: number Count of dataset entries returned in this set. ### [**](#desc)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L40)optionaldesc **desc? : boolean Should the results be in descending order. ### [**](#items)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L42)items **items: Data\[] Dataset entries based on chosen format parameter. ### [**](#limit)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L38)limit **limit: number Maximum number of dataset entries requested. ### [**](#offset)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L36)offset **offset: number Position of the first returned entry in the dataset. ### [**](#total)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L32)total **total: number Total count of entries in the dataset. --- # ProcessedRequest ## Index[**](#Index) ### Properties * [**requestId](#requestId) * [**uniqueKey](#uniqueKey) * [**wasAlreadyHandled](#wasAlreadyHandled) * [**wasAlreadyPresent](#wasAlreadyPresent) ## Properties[**](#Properties) ### [**](#requestId)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L279)requestId **requestId: string ### [**](#uniqueKey)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L278)uniqueKey **uniqueKey: string ### [**](#wasAlreadyHandled)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L281)wasAlreadyHandled **wasAlreadyHandled: boolean ### [**](#wasAlreadyPresent)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L280)wasAlreadyPresent **wasAlreadyPresent: boolean --- # ProlongRequestLockOptions ## Index[**](#Index) ### Properties * [**forefront](#forefront) * [**lockSecs](#lockSecs) ## Properties[**](#Properties) ### [**](#forefront)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L242)optionalforefront **forefront? : boolean ### [**](#lockSecs)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L241)lockSecs **lockSecs: number --- # ProlongRequestLockResult ## Index[**](#Index) ### Properties * [**lockExpiresAt](#lockExpiresAt) ## Properties[**](#Properties) ### [**](#lockExpiresAt)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L246)lockExpiresAt **lockExpiresAt: Date --- # QueueHead ### Hierarchy * *QueueHead* * [ListAndLockHeadResult](https://crawlee.dev/js/api/types/interface/ListAndLockHeadResult.md) ## Index[**](#Index) ### Properties * [**hadMultipleClients](#hadMultipleClients) * [**items](#items) * [**limit](#limit) * [**queueModifiedAt](#queueModifiedAt) ## Properties[**](#Properties) ### [**](#hadMultipleClients)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L220)optionalhadMultipleClients **hadMultipleClients? : boolean ### [**](#items)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L221)items **items: [RequestQueueHeadItem](https://crawlee.dev/js/api/types/interface/RequestQueueHeadItem.md)\[] ### [**](#limit)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L218)limit **limit: number ### [**](#queueModifiedAt)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L219)queueModifiedAt **queueModifiedAt: Date --- # RequestOptions ## Index[**](#Index) ### Properties * [**forefront](#forefront) ## Properties[**](#Properties) ### [**](#forefront)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L254)optionalforefront **forefront? : boolean --- # RequestQueueClient ## Index[**](#Index) ### Methods * [**addRequest](#addRequest) * [**batchAddRequests](#batchAddRequests) * [**delete](#delete) * [**deleteRequest](#deleteRequest) * [**deleteRequestLock](#deleteRequestLock) * [**get](#get) * [**getRequest](#getRequest) * [**listAndLockHead](#listAndLockHead) * [**listHead](#listHead) * [**prolongRequestLock](#prolongRequestLock) * [**update](#update) * [**updateRequest](#updateRequest) ## Methods[**](#Methods) ### [**](#addRequest)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L300)addRequest * ****addRequest**(request, options): Promise<[QueueOperationInfo](https://crawlee.dev/js/api/core/interface/QueueOperationInfo.md)> - #### Parameters * ##### request: [RequestSchema](https://crawlee.dev/js/api/types/interface/RequestSchema.md) * ##### optionaloptions: [RequestOptions](https://crawlee.dev/js/api/types/interface/RequestOptions.md) #### Returns Promise<[QueueOperationInfo](https://crawlee.dev/js/api/core/interface/QueueOperationInfo.md)> ### [**](#batchAddRequests)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L301)batchAddRequests * ****batchAddRequests**(requests, options): Promise<[BatchAddRequestsResult](https://crawlee.dev/js/api/types/interface/BatchAddRequestsResult.md)> - #### Parameters * ##### requests: [RequestSchema](https://crawlee.dev/js/api/types/interface/RequestSchema.md)\[] * ##### optionaloptions: [RequestOptions](https://crawlee.dev/js/api/types/interface/RequestOptions.md) #### Returns Promise<[BatchAddRequestsResult](https://crawlee.dev/js/api/types/interface/BatchAddRequestsResult.md)> ### [**](#delete)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L298)delete * ****delete**(): Promise\ - #### Returns Promise\ ### [**](#deleteRequest)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L304)deleteRequest * ****deleteRequest**(id): Promise\ - #### Parameters * ##### id: string #### Returns Promise\ ### [**](#deleteRequestLock)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L307)deleteRequestLock * ****deleteRequestLock**(id, options): Promise\ - #### Parameters * ##### id: string * ##### optionaloptions: [DeleteRequestLockOptions](https://crawlee.dev/js/api/types/interface/DeleteRequestLockOptions.md) #### Returns Promise\ ### [**](#get)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L296)get * ****get**(): Promise\ - #### Returns Promise\ ### [**](#getRequest)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L302)getRequest * ****getRequest**(id): Promise\ - #### Parameters * ##### id: string #### Returns Promise\ ### [**](#listAndLockHead)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L305)listAndLockHead * ****listAndLockHead**(options): Promise<[ListAndLockHeadResult](https://crawlee.dev/js/api/types/interface/ListAndLockHeadResult.md)> - #### Parameters * ##### options: [ListAndLockOptions](https://crawlee.dev/js/api/types/interface/ListAndLockOptions.md) #### Returns Promise<[ListAndLockHeadResult](https://crawlee.dev/js/api/types/interface/ListAndLockHeadResult.md)> ### [**](#listHead)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L299)listHead * ****listHead**(options): Promise<[QueueHead](https://crawlee.dev/js/api/types/interface/QueueHead.md)> - #### Parameters * ##### optionaloptions: [ListOptions](https://crawlee.dev/js/api/types/interface/ListOptions.md) #### Returns Promise<[QueueHead](https://crawlee.dev/js/api/types/interface/QueueHead.md)> ### [**](#prolongRequestLock)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L306)prolongRequestLock * ****prolongRequestLock**(id, options): Promise<[ProlongRequestLockResult](https://crawlee.dev/js/api/types/interface/ProlongRequestLockResult.md)> - #### Parameters * ##### id: string * ##### options: [ProlongRequestLockOptions](https://crawlee.dev/js/api/types/interface/ProlongRequestLockOptions.md) #### Returns Promise<[ProlongRequestLockResult](https://crawlee.dev/js/api/types/interface/ProlongRequestLockResult.md)> ### [**](#update)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L297)update * ****update**(newFields): Promise\> - #### Parameters * ##### newFields: { name?: string } * ##### optionalname: string #### Returns Promise\> ### [**](#updateRequest)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L303)updateRequest * ****updateRequest**(request, options): Promise<[QueueOperationInfo](https://crawlee.dev/js/api/core/interface/QueueOperationInfo.md)> - #### Parameters * ##### request: [UpdateRequestSchema](https://crawlee.dev/js/api/types/interface/UpdateRequestSchema.md) * ##### optionaloptions: [RequestOptions](https://crawlee.dev/js/api/types/interface/RequestOptions.md) #### Returns Promise<[QueueOperationInfo](https://crawlee.dev/js/api/core/interface/QueueOperationInfo.md)> --- # RequestQueueCollectionClient Request queue collection client. ## Index[**](#Index) ### Methods * [**getOrCreate](#getOrCreate) * [**list](#list) ## Methods[**](#Methods) ### [**](#getOrCreate)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L206)getOrCreate * ****getOrCreate**(name): Promise<[RequestQueueInfo](https://crawlee.dev/js/api/types/interface/RequestQueueInfo.md)> - #### Parameters * ##### name: string #### Returns Promise<[RequestQueueInfo](https://crawlee.dev/js/api/types/interface/RequestQueueInfo.md)> ### [**](#list)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L205)list * ****list**(): Promise<[PaginatedList](https://crawlee.dev/js/api/types/interface/PaginatedList.md)<[RequestQueueInfo](https://crawlee.dev/js/api/types/interface/RequestQueueInfo.md)>> - #### Returns Promise<[PaginatedList](https://crawlee.dev/js/api/types/interface/PaginatedList.md)<[RequestQueueInfo](https://crawlee.dev/js/api/types/interface/RequestQueueInfo.md)>> --- # RequestQueueHeadItem ## Index[**](#Index) ### Properties * [**id](#id) * [**method](#method) * [**retryCount](#retryCount) * [**uniqueKey](#uniqueKey) * [**url](#url) ## Properties[**](#Properties) ### [**](#id)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L210)id **id: string ### [**](#method)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L214)method **method: [AllowedHttpMethods](https://crawlee.dev/js/api/types.md#AllowedHttpMethods) ### [**](#retryCount)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L211)retryCount **retryCount: number ### [**](#uniqueKey)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L212)uniqueKey **uniqueKey: string ### [**](#url)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L213)url **url: string --- # RequestQueueInfo ## Index[**](#Index) ### Properties * [**accessedAt](#accessedAt) * [**actId](#actId) * [**actRunId](#actRunId) * [**createdAt](#createdAt) * [**expireAt](#expireAt) * [**hadMultipleClients](#hadMultipleClients) * [**handledRequestCount](#handledRequestCount) * [**id](#id) * [**modifiedAt](#modifiedAt) * [**name](#name) * [**pendingRequestCount](#pendingRequestCount) * [**stats](#stats) * [**totalRequestCount](#totalRequestCount) * [**userId](#userId) ## Properties[**](#Properties) ### [**](#accessedAt)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L190)accessedAt **accessedAt: Date ### [**](#actId)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L195)optionalactId **actId? : string ### [**](#actRunId)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L196)optionalactRunId **actRunId? : string ### [**](#createdAt)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L188)createdAt **createdAt: Date ### [**](#expireAt)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L191)optionalexpireAt **expireAt? : string ### [**](#hadMultipleClients)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L197)optionalhadMultipleClients **hadMultipleClients? : boolean ### [**](#handledRequestCount)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L193)handledRequestCount **handledRequestCount: number ### [**](#id)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L185)id **id: string ### [**](#modifiedAt)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L189)modifiedAt **modifiedAt: Date ### [**](#name)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L186)optionalname **name? : string ### [**](#pendingRequestCount)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L194)pendingRequestCount **pendingRequestCount: number ### [**](#stats)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L198)optionalstats **stats? : [RequestQueueStats](https://crawlee.dev/js/api/types/interface/RequestQueueStats.md) ### [**](#totalRequestCount)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L192)totalRequestCount **totalRequestCount: number ### [**](#userId)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L187)optionaluserId **userId? : string --- # RequestQueueOptions ## Index[**](#Index) ### Properties * [**clientKey](#clientKey) * [**timeoutSecs](#timeoutSecs) ## Properties[**](#Properties) ### [**](#clientKey)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L311)optionalclientKey **clientKey? : string ### [**](#timeoutSecs)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L312)optionaltimeoutSecs **timeoutSecs? : number --- # RequestQueueStats ## Index[**](#Index) ### Properties * [**deleteCount](#deleteCount) * [**headItemReadCount](#headItemReadCount) * [**readCount](#readCount) * [**storageBytes](#storageBytes) * [**writeCount](#writeCount) ## Properties[**](#Properties) ### [**](#deleteCount)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L179)optionaldeleteCount **deleteCount? : number ### [**](#headItemReadCount)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L180)optionalheadItemReadCount **headItemReadCount? : number ### [**](#readCount)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L177)optionalreadCount **readCount? : number ### [**](#storageBytes)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L181)optionalstorageBytes **storageBytes? : number ### [**](#writeCount)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L178)optionalwriteCount **writeCount? : number --- # RequestSchema ### Hierarchy * *RequestSchema* * [UpdateRequestSchema](https://crawlee.dev/js/api/types/interface/UpdateRequestSchema.md) ## Index[**](#Index) ### Properties * [**errorMessages](#errorMessages) * [**handledAt](#handledAt) * [**headers](#headers) * [**id](#id) * [**loadedUrl](#loadedUrl) * [**method](#method) * [**noRetry](#noRetry) * [**payload](#payload) * [**retryCount](#retryCount) * [**uniqueKey](#uniqueKey) * [**url](#url) * [**userData](#userData) ## Properties[**](#Properties) ### [**](#errorMessages)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L266)optionalerrorMessages **errorMessages? : string\[] ### [**](#handledAt)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L269)optionalhandledAt **handledAt? : string ### [**](#headers)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L267)optionalheaders **headers? : Dictionary\ ### [**](#id)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L259)optionalid **id? : string ### [**](#loadedUrl)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L270)optionalloadedUrl **loadedUrl? : string ### [**](#method)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L262)optionalmethod **method? : [AllowedHttpMethods](https://crawlee.dev/js/api/types.md#AllowedHttpMethods) ### [**](#noRetry)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L264)optionalnoRetry **noRetry? : boolean ### [**](#payload)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L263)optionalpayload **payload? : string ### [**](#retryCount)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L265)optionalretryCount **retryCount? : number ### [**](#uniqueKey)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L261)uniqueKey **uniqueKey: string ### [**](#url)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L260)url **url: string ### [**](#userData)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L268)optionaluserData **userData? : Dictionary --- # SetStatusMessageOptions ## Index[**](#Index) ### Properties * [**isStatusMessageTerminal](#isStatusMessageTerminal) * [**level](#level) ## Properties[**](#Properties) ### [**](#isStatusMessageTerminal)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L316)optionalisStatusMessageTerminal **isStatusMessageTerminal? : boolean ### [**](#level)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L317)optionallevel **level? : DEBUG | INFO | WARNING | ERROR --- # UnprocessedRequest ## Index[**](#Index) ### Properties * [**method](#method) * [**uniqueKey](#uniqueKey) * [**url](#url) ## Properties[**](#Properties) ### [**](#method)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L287)optionalmethod **method? : [AllowedHttpMethods](https://crawlee.dev/js/api/types.md#AllowedHttpMethods) ### [**](#uniqueKey)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L285)uniqueKey **uniqueKey: string ### [**](#url)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L286)url **url: string --- # UpdateRequestSchema ### Hierarchy * [RequestSchema](https://crawlee.dev/js/api/types/interface/RequestSchema.md) * *UpdateRequestSchema* ## Index[**](#Index) ### Properties * [**errorMessages](#errorMessages) * [**handledAt](#handledAt) * [**headers](#headers) * [**id](#id) * [**loadedUrl](#loadedUrl) * [**method](#method) * [**noRetry](#noRetry) * [**payload](#payload) * [**retryCount](#retryCount) * [**uniqueKey](#uniqueKey) * [**url](#url) * [**userData](#userData) ## Properties[**](#Properties) ### [**](#errorMessages)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L266)optionalinheritederrorMessages **errorMessages? : string\[] Inherited from RequestSchema.errorMessages ### [**](#handledAt)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L269)optionalinheritedhandledAt **handledAt? : string Inherited from RequestSchema.handledAt ### [**](#headers)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L267)optionalinheritedheaders **headers? : Dictionary\ Inherited from RequestSchema.headers ### [**](#id)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L274)id **id: string Overrides RequestSchema.id ### [**](#loadedUrl)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L270)optionalinheritedloadedUrl **loadedUrl? : string Inherited from RequestSchema.loadedUrl ### [**](#method)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L262)optionalinheritedmethod **method? : [AllowedHttpMethods](https://crawlee.dev/js/api/types.md#AllowedHttpMethods) Inherited from RequestSchema.method ### [**](#noRetry)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L264)optionalinheritednoRetry **noRetry? : boolean Inherited from RequestSchema.noRetry ### [**](#payload)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L263)optionalinheritedpayload **payload? : string Inherited from RequestSchema.payload ### [**](#retryCount)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L265)optionalinheritedretryCount **retryCount? : number Inherited from RequestSchema.retryCount ### [**](#uniqueKey)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L261)inheriteduniqueKey **uniqueKey: string Inherited from RequestSchema.uniqueKey ### [**](#url)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L260)inheritedurl **url: string Inherited from RequestSchema.url ### [**](#userData)[**](https://github.com/apify/crawlee/blob/master/packages/types/src/storages.ts#L268)optionalinheriteduserData **userData? : Dictionary Inherited from RequestSchema.userData --- # @crawlee/utils ## Index[**](#Index) ### References * [**RobotsFile](https://crawlee.dev/js/api/utils.md#RobotsFile) * [**tryAbsoluteURL](https://crawlee.dev/js/api/utils.md#tryAbsoluteURL) ### Namespaces * [**social](https://crawlee.dev/js/api/utils/namespace/social.md) ### Classes * [**RobotsTxtFile](https://crawlee.dev/js/api/utils/class/RobotsTxtFile.md) * [**Sitemap](https://crawlee.dev/js/api/utils/class/Sitemap.md) ### Interfaces * [**DownloadListOfUrlsOptions](https://crawlee.dev/js/api/utils/interface/DownloadListOfUrlsOptions.md) * [**ExtractUrlsOptions](https://crawlee.dev/js/api/utils/interface/ExtractUrlsOptions.md) * [**MemoryInfo](https://crawlee.dev/js/api/utils/interface/MemoryInfo.md) * [**OpenGraphProperty](https://crawlee.dev/js/api/utils/interface/OpenGraphProperty.md) * [**ParseSitemapOptions](https://crawlee.dev/js/api/utils/interface/ParseSitemapOptions.md) ### Type Aliases * [**CheerioRoot](https://crawlee.dev/js/api/utils.md#CheerioRoot) * [**SearchParams](https://crawlee.dev/js/api/utils.md#SearchParams) * [**SitemapUrl](https://crawlee.dev/js/api/utils.md#SitemapUrl) ### Variables * [**CLOUDFLARE\_RETRY\_CSS\_SELECTORS](https://crawlee.dev/js/api/utils.md#CLOUDFLARE_RETRY_CSS_SELECTORS) * [**RETRY\_CSS\_SELECTORS](https://crawlee.dev/js/api/utils.md#RETRY_CSS_SELECTORS) * [**ROTATE\_PROXY\_ERRORS](https://crawlee.dev/js/api/utils.md#ROTATE_PROXY_ERRORS) * [**URL\_NO\_COMMAS\_REGEX](https://crawlee.dev/js/api/utils.md#URL_NO_COMMAS_REGEX) * [**URL\_WITH\_COMMAS\_REGEX](https://crawlee.dev/js/api/utils.md#URL_WITH_COMMAS_REGEX) ### Functions * [**chunk](https://crawlee.dev/js/api/utils/function/chunk.md) * [**createRequestDebugInfo](https://crawlee.dev/js/api/utils/function/createRequestDebugInfo.md) * [**downloadListOfUrls](https://crawlee.dev/js/api/utils/function/downloadListOfUrls.md) * [**extractUrls](https://crawlee.dev/js/api/utils/function/extractUrls.md) * [**extractUrlsFromCheerio](https://crawlee.dev/js/api/utils/function/extractUrlsFromCheerio.md) * [**getCgroupsVersion](https://crawlee.dev/js/api/utils/function/getCgroupsVersion.md) * [**getMemoryInfo](https://crawlee.dev/js/api/utils/function/getMemoryInfo.md) * [**getObjectType](https://crawlee.dev/js/api/utils/function/getObjectType.md) * [**gotScraping](https://crawlee.dev/js/api/utils/function/gotScraping.md) * [**htmlToText](https://crawlee.dev/js/api/utils/function/htmlToText.md) * [**isContainerized](https://crawlee.dev/js/api/utils/function/isContainerized.md) * [**isDocker](https://crawlee.dev/js/api/utils/function/isDocker.md) * [**isLambda](https://crawlee.dev/js/api/utils/function/isLambda.md) * [**parseOpenGraph](https://crawlee.dev/js/api/utils/function/parseOpenGraph.md) * [**parseSitemap](https://crawlee.dev/js/api/utils/function/parseSitemap.md) * [**sleep](https://crawlee.dev/js/api/utils/function/sleep.md) ## References[**](#References) ### [**](#RobotsFile)[**](https://github.com/apify/crawlee/blob/master/packages/utils/src/internals/robots.ts#L122)RobotsFile Renames and re-exports [RobotsTxtFile](https://crawlee.dev/js/api/utils/class/RobotsTxtFile.md) ### [**](#tryAbsoluteURL)[**](https://github.com/apify/crawlee/blob/master/packages/utils/src/internals/extract-urls.ts#L96)tryAbsoluteURL Re-exports [tryAbsoluteURL](https://crawlee.dev/js/api/core/function/tryAbsoluteURL.md) ## Type Aliases[**](<#Type Aliases>) ### [**](#CheerioRoot)[**](https://github.com/apify/crawlee/blob/master/packages/utils/src/internals/cheerio.ts#L7)CheerioRoot **CheerioRoot: ReturnType\ ### [**](#SearchParams)[**](https://github.com/apify/crawlee/blob/master/packages/utils/src/internals/url.ts#L1)SearchParams **SearchParams: string | URLSearchParams | Record\ ### [**](#SitemapUrl)[**](https://github.com/apify/crawlee/blob/master/packages/utils/src/internals/sitemap.ts#L21)SitemapUrl **SitemapUrl: SitemapUrlData & { originSitemapUrl: string } ## Variables[**](#Variables) ### [**](#CLOUDFLARE_RETRY_CSS_SELECTORS)[**](https://github.com/apify/crawlee/blob/master/packages/utils/src/internals/blocked.ts#L1)constCLOUDFLARE\_RETRY\_CSS\_SELECTORS **CLOUDFLARE\_RETRY\_CSS\_SELECTORS: string\[] = ... ### [**](#RETRY_CSS_SELECTORS)[**](https://github.com/apify/crawlee/blob/master/packages/utils/src/internals/blocked.ts#L6)constRETRY\_CSS\_SELECTORS **RETRY\_CSS\_SELECTORS: string\[] = ... CSS selectors for elements that should trigger a retry, as the crawler is likely getting blocked. ### [**](#ROTATE_PROXY_ERRORS)[**](https://github.com/apify/crawlee/blob/master/packages/utils/src/internals/blocked.ts#L15)constROTATE\_PROXY\_ERRORS **ROTATE\_PROXY\_ERRORS: string\[] = ... Content of proxy errors that should trigger a retry, as the proxy is likely getting blocked / is malfunctioning. ### [**](#URL_NO_COMMAS_REGEX)[**](https://github.com/apify/crawlee/blob/master/packages/utils/src/internals/general.ts#L8)constURL\_NO\_COMMAS\_REGEX **URL\_NO\_COMMAS\_REGEX: RegExp = ... Default regular expression to match URLs in a string that may be plain text, JSON, CSV or other. It supports common URL characters and does not support URLs containing commas or spaces. The URLs also may contain Unicode letters (not symbols). ### [**](#URL_WITH_COMMAS_REGEX)[**](https://github.com/apify/crawlee/blob/master/packages/utils/src/internals/general.ts#L15)constURL\_WITH\_COMMAS\_REGEX **URL\_WITH\_COMMAS\_REGEX: RegExp = ... Regular expression that, in addition to the default regular expression `URL_NO_COMMAS_REGEX`, supports matching commas in URL path and query. Note, however, that this may prevent parsing URLs from comma delimited lists, or the URLs may become malformed. --- # Changelog All notable changes to this project will be documented in this file. See [Conventional Commits](https://conventionalcommits.org) for commit guidelines. ## [3.15.3](https://github.com/apify/crawlee/compare/v3.15.2...v3.15.3) (2025-11-10)[​](#3153-2025-11-10 "Direct link to 3153-2025-11-10") **Note:** Version bump only for package @crawlee/utils ## [3.15.2](https://github.com/apify/crawlee/compare/v3.15.1...v3.15.2) (2025-10-23)[​](#3152-2025-10-23 "Direct link to 3152-2025-10-23") ### Features[​](#features "Direct link to Features") * export cheerio types in all crawler packages ([#3204](https://github.com/apify/crawlee/issues/3204)) ([f05790b](https://github.com/apify/crawlee/commit/f05790b8c4e77056fd3cdbdd6d6abe3186ddf104)) ## [3.15.1](https://github.com/apify/crawlee/compare/v3.15.0...v3.15.1) (2025-09-26)[​](#3151-2025-09-26 "Direct link to 3151-2025-09-26") **Note:** Version bump only for package @crawlee/utils # [3.15.0](https://github.com/apify/crawlee/compare/v3.14.1...v3.15.0) (2025-09-17) **Note:** Version bump only for package @crawlee/utils ## [3.14.1](https://github.com/apify/crawlee/compare/v3.14.0...v3.14.1) (2025-08-05)[​](#3141-2025-08-05 "Direct link to 3141-2025-08-05") **Note:** Version bump only for package @crawlee/utils # [3.14.0](https://github.com/apify/crawlee/compare/v3.13.10...v3.14.0) (2025-07-25) ### Bug Fixes[​](#bug-fixes "Direct link to Bug Fixes") * validation of iterables when adding requests to the queue ([#3091](https://github.com/apify/crawlee/issues/3091)) ([529a1dd](https://github.com/apify/crawlee/commit/529a1dd57278efef4fb2013e79a09fd1bc8594a5)), closes [#3063](https://github.com/apify/crawlee/issues/3063) ## [3.13.10](https://github.com/apify/crawlee/compare/v3.13.9...v3.13.10) (2025-07-09)[​](#31310-2025-07-09 "Direct link to 31310-2025-07-09") **Note:** Version bump only for package @crawlee/utils ## [3.13.9](https://github.com/apify/crawlee/compare/v3.13.8...v3.13.9) (2025-06-27)[​](#3139-2025-06-27 "Direct link to 3139-2025-06-27") ### Bug Fixes[​](#bug-fixes-1 "Direct link to Bug Fixes") * Do not log 'malformed sitemap content' on network errors in `Sitemap.tryCommonNames` ([#3015](https://github.com/apify/crawlee/issues/3015)) ([64a090f](https://github.com/apify/crawlee/commit/64a090ffbba5c69730ec0616e415a1eadf4bc7b3)), closes [#2884](https://github.com/apify/crawlee/issues/2884) ### Features[​](#features-1 "Direct link to Features") * Accept (Async)Iterables in `addRequests` methods ([#3013](https://github.com/apify/crawlee/issues/3013)) ([a4ab748](https://github.com/apify/crawlee/commit/a4ab74852c3c60bdbc96035f54b16d125220f699)), closes [#2980](https://github.com/apify/crawlee/issues/2980) ## [3.13.8](https://github.com/apify/crawlee/compare/v3.13.7...v3.13.8) (2025-06-16)[​](#3138-2025-06-16 "Direct link to 3138-2025-06-16") ### Bug Fixes[​](#bug-fixes-2 "Direct link to Bug Fixes") * Persist rendering type detection results in `AdaptivePlaywrightCrawler` ([#2987](https://github.com/apify/crawlee/issues/2987)) ([76431ba](https://github.com/apify/crawlee/commit/76431badf8a55892303d9b53fe23e029fad9cb18)), closes [#2899](https://github.com/apify/crawlee/issues/2899) ## [3.13.7](https://github.com/apify/crawlee/compare/v3.13.6...v3.13.7) (2025-06-06)[​](#3137-2025-06-06 "Direct link to 3137-2025-06-06") **Note:** Version bump only for package @crawlee/utils ## [3.13.6](https://github.com/apify/crawlee/compare/v3.13.5...v3.13.6) (2025-06-05)[​](#3136-2025-06-05 "Direct link to 3136-2025-06-05") **Note:** Version bump only for package @crawlee/utils ## [3.13.5](https://github.com/apify/crawlee/compare/v3.13.4...v3.13.5) (2025-05-20)[​](#3135-2025-05-20 "Direct link to 3135-2025-05-20") **Note:** Version bump only for package @crawlee/utils ## [3.13.4](https://github.com/apify/crawlee/compare/v3.13.3...v3.13.4) (2025-05-14)[​](#3134-2025-05-14 "Direct link to 3134-2025-05-14") ### Bug Fixes[​](#bug-fixes-3 "Direct link to Bug Fixes") * **social:** extract emails from each text node separately ([#2952](https://github.com/apify/crawlee/issues/2952)) ([799afc1](https://github.com/apify/crawlee/commit/799afc1dbb6843efa9d585823674ea75b9b352ea)) ## [3.13.3](https://github.com/apify/crawlee/compare/v3.13.2...v3.13.3) (2025-05-05)[​](#3133-2025-05-05 "Direct link to 3133-2025-05-05") **Note:** Version bump only for package @crawlee/utils ## [3.13.2](https://github.com/apify/crawlee/compare/v3.13.1...v3.13.2) (2025-04-08)[​](#3132-2025-04-08 "Direct link to 3132-2025-04-08") **Note:** Version bump only for package @crawlee/utils ## [3.13.1](https://github.com/apify/crawlee/compare/v3.13.0...v3.13.1) (2025-04-07)[​](#3131-2025-04-07 "Direct link to 3131-2025-04-07") ### Bug Fixes[​](#bug-fixes-4 "Direct link to Bug Fixes") * rename `RobotsFile` to `RobotsTxtFile` ([#2913](https://github.com/apify/crawlee/issues/2913)) ([3160f71](https://github.com/apify/crawlee/commit/3160f717e865326476d78089d778cbc7d35aa58d)), closes [#2910](https://github.com/apify/crawlee/issues/2910) ### Features[​](#features-2 "Direct link to Features") * add `respectRobotsTxtFile` crawler option ([#2910](https://github.com/apify/crawlee/issues/2910)) ([0eabed1](https://github.com/apify/crawlee/commit/0eabed1f13070d902c2c67b340621830a7f64464)) # [3.13.0](https://github.com/apify/crawlee/compare/v3.12.2...v3.13.0) (2025-03-04) ### Features[​](#features-3 "Direct link to Features") * improved cross platform metric collection ([#2834](https://github.com/apify/crawlee/issues/2834)) ([e41b2f7](https://github.com/apify/crawlee/commit/e41b2f744513dd80aa05336eedfa1c08c54d3832)), closes [#2771](https://github.com/apify/crawlee/issues/2771) ## [3.12.2](https://github.com/apify/crawlee/compare/v3.12.1...v3.12.2) (2025-01-27)[​](#3122-2025-01-27 "Direct link to 3122-2025-01-27") **Note:** Version bump only for package @crawlee/utils ## [3.12.1](https://github.com/apify/crawlee/compare/v3.12.0...v3.12.1) (2024-12-04)[​](#3121-2024-12-04 "Direct link to 3121-2024-12-04") ### Bug Fixes[​](#bug-fixes-5 "Direct link to Bug Fixes") * **social:** support new URL formats for Facebook, YouTube and X ([#2758](https://github.com/apify/crawlee/issues/2758)) ([4c95847](https://github.com/apify/crawlee/commit/4c95847d5cedd6514620ccab31d5b242ba76de80)), closes [#525](https://github.com/apify/crawlee/issues/525) # [3.12.0](https://github.com/apify/crawlee/compare/v3.11.5...v3.12.0) (2024-11-04) ### Bug Fixes[​](#bug-fixes-6 "Direct link to Bug Fixes") * `.trim()` urls from pretty-printed sitemap.xml files ([#2709](https://github.com/apify/crawlee/issues/2709)) ([802a6fe](https://github.com/apify/crawlee/commit/802a6fea7b2125e2b36d740fc2d5d131de5d53ed)), closes [#2698](https://github.com/apify/crawlee/issues/2698) ### Features[​](#features-4 "Direct link to Features") * allow using other HTTP clients ([#2661](https://github.com/apify/crawlee/issues/2661)) ([568c655](https://github.com/apify/crawlee/commit/568c6556d79ce91654c8a715d1d1729d7d6ed8ef)), closes [#2659](https://github.com/apify/crawlee/issues/2659) ## [3.11.5](https://github.com/apify/crawlee/compare/v3.11.4...v3.11.5) (2024-10-04)[​](#3115-2024-10-04 "Direct link to 3115-2024-10-04") **Note:** Version bump only for package @crawlee/utils ## [3.11.4](https://github.com/apify/crawlee/compare/v3.11.3...v3.11.4) (2024-09-23)[​](#3114-2024-09-23 "Direct link to 3114-2024-09-23") ### Bug Fixes[​](#bug-fixes-7 "Direct link to Bug Fixes") * `SitemapRequestList.teardown()` doesn't break `persistState` calls ([#2673](https://github.com/apify/crawlee/issues/2673)) ([fb2c5cd](https://github.com/apify/crawlee/commit/fb2c5cdaa47e2d3a91ade726cfba3091917a0137)), closes [/github.com/apify/crawlee/blob/f3eb99d9fa9a7aa0ec1dcb9773e666a9ac14fb76/packages/core/src/storages/sitemap\_request\_list.ts#L446](https://github.com//github.com/apify/crawlee/blob/f3eb99d9fa9a7aa0ec1dcb9773e666a9ac14fb76/packages/core/src/storages/sitemap_request_list.ts/issues/L446) [#2672](https://github.com/apify/crawlee/issues/2672) ## [3.11.3](https://github.com/apify/crawlee/compare/v3.11.2...v3.11.3) (2024-09-03)[​](#3113-2024-09-03 "Direct link to 3113-2024-09-03") ### Bug Fixes[​](#bug-fixes-8 "Direct link to Bug Fixes") * improve `FACEBOOK_REGEX` to match older style page URLs ([#2650](https://github.com/apify/crawlee/issues/2650)) ([a005e69](https://github.com/apify/crawlee/commit/a005e699682cbf4bb2e48ff92cf2bbf3e0d2be26)), closes [#2216](https://github.com/apify/crawlee/issues/2216) ## [3.11.2](https://github.com/apify/crawlee/compare/v3.11.1...v3.11.2) (2024-08-28)[​](#3112-2024-08-28 "Direct link to 3112-2024-08-28") ### Bug Fixes[​](#bug-fixes-9 "Direct link to Bug Fixes") * use namespace imports for cheerio to be compatible with v1 ([#2641](https://github.com/apify/crawlee/issues/2641)) ([f48296f](https://github.com/apify/crawlee/commit/f48296f6cba7b81fe102d4b874505c27f93d9fc1)) ### Features[​](#features-5 "Direct link to Features") * resilient sitemap loading ([#2619](https://github.com/apify/crawlee/issues/2619)) ([1dd7660](https://github.com/apify/crawlee/commit/1dd76601e03de4541964116b3a77376e233ea22b)) ## [3.11.1](https://github.com/apify/crawlee/compare/v3.11.0...v3.11.1) (2024-07-24)[​](#3111-2024-07-24 "Direct link to 3111-2024-07-24") ### Bug Fixes[​](#bug-fixes-10 "Direct link to Bug Fixes") * use `getHTML` in the shadow root expansion ([#2587](https://github.com/apify/crawlee/issues/2587)) ([a244d62](https://github.com/apify/crawlee/commit/a244d62cca03d628677eca8a5adcf41e33c51dee)), closes [#2583](https://github.com/apify/crawlee/issues/2583) # [3.11.0](https://github.com/apify/crawlee/compare/v3.10.5...v3.11.0) (2024-07-09) ### Features[​](#features-6 "Direct link to Features") * Sitemap-based request list implementation ([#2498](https://github.com/apify/crawlee/issues/2498)) ([7bf8f0b](https://github.com/apify/crawlee/commit/7bf8f0bcd4cc81e02c7cc60e82dfe7a0cdd80938)) ## [3.10.5](https://github.com/apify/crawlee/compare/v3.10.4...v3.10.5) (2024-06-12)[​](#3105-2024-06-12 "Direct link to 3105-2024-06-12") **Note:** Version bump only for package @crawlee/utils ## [3.10.4](https://github.com/apify/crawlee/compare/v3.10.3...v3.10.4) (2024-06-11)[​](#3104-2024-06-11 "Direct link to 3104-2024-06-11") **Note:** Version bump only for package @crawlee/utils ## [3.10.3](https://github.com/apify/crawlee/compare/v3.10.2...v3.10.3) (2024-06-07)[​](#3103-2024-06-07 "Direct link to 3103-2024-06-07") ### Bug Fixes[​](#bug-fixes-11 "Direct link to Bug Fixes") * respect implicit router when no `requestHandler` is provided in `AdaptiveCrawler` ([#2518](https://github.com/apify/crawlee/issues/2518)) ([31083aa](https://github.com/apify/crawlee/commit/31083aa27ddd51827f73c7ac4290379ec7a81283)) ## [3.10.2](https://github.com/apify/crawlee/compare/v3.10.1...v3.10.2) (2024-06-03)[​](#3102-2024-06-03 "Direct link to 3102-2024-06-03") ### Bug Fixes[​](#bug-fixes-12 "Direct link to Bug Fixes") * Autodetect sitemap filetype from content ([#2497](https://github.com/apify/crawlee/issues/2497)) ([62a9f40](https://github.com/apify/crawlee/commit/62a9f4036dba92d07547af489ac8b6c7974faa6f)), closes [#2461](https://github.com/apify/crawlee/issues/2461) ### Features[​](#features-7 "Direct link to Features") * Loading sitemaps from string ([#2496](https://github.com/apify/crawlee/issues/2496)) ([38ed0d6](https://github.com/apify/crawlee/commit/38ed0d6ad90a868df9c02632334fec8db9ef29a0)), closes [#2460](https://github.com/apify/crawlee/issues/2460) ## [3.10.1](https://github.com/apify/crawlee/compare/v3.10.0...v3.10.1) (2024-05-23)[​](#3101-2024-05-23 "Direct link to 3101-2024-05-23") ### Bug Fixes[​](#bug-fixes-13 "Direct link to Bug Fixes") * adjust `URL_NO_COMMAS_REGEX` regexp to allow single character hostnames ([#2492](https://github.com/apify/crawlee/issues/2492)) ([ec802e8](https://github.com/apify/crawlee/commit/ec802e85f54022616e5bdcc1a6fd1bd43e1b3ace)), closes [#2487](https://github.com/apify/crawlee/issues/2487) # [3.10.0](https://github.com/apify/crawlee/compare/v3.9.2...v3.10.0) (2024-05-16) ### Bug Fixes[​](#bug-fixes-14 "Direct link to Bug Fixes") * malformed sitemap url when sitemap index child contains querystring ([#2430](https://github.com/apify/crawlee/issues/2430)) ([e4cd41c](https://github.com/apify/crawlee/commit/e4cd41c49999af270fbe2476a61d92c8e3502463)) * return true when robots.isAllowed returns undefined ([#2439](https://github.com/apify/crawlee/issues/2439)) ([6f541f8](https://github.com/apify/crawlee/commit/6f541f8c4ea9b1e94eb506383019397676fd79fe)), closes [#2437](https://github.com/apify/crawlee/issues/2437) * sitemap `content-type` check breaks on `content-type` parameters ([#2442](https://github.com/apify/crawlee/issues/2442)) ([db7d372](https://github.com/apify/crawlee/commit/db7d37256a49820e3e584165fff42377042ec258)) ### Features[​](#features-8 "Direct link to Features") * implement ErrorSnapshotter for error context capture ([#2332](https://github.com/apify/crawlee/issues/2332)) ([e861dfd](https://github.com/apify/crawlee/commit/e861dfdb451ae32fb1e0c7749c6b59744654b303)), closes [#2280](https://github.com/apify/crawlee/issues/2280) ## [3.9.2](https://github.com/apify/crawlee/compare/v3.9.1...v3.9.2) (2024-04-17)[​](#392-2024-04-17 "Direct link to 392-2024-04-17") ### Features[​](#features-9 "Direct link to Features") * **sitemap:** Support CDATA in sitemaps ([#2424](https://github.com/apify/crawlee/issues/2424)) ([635f046](https://github.com/apify/crawlee/commit/635f046b7933e0ad1b0ee627a22a9adaf21847d3)) ## [3.9.1](https://github.com/apify/crawlee/compare/v3.9.0...v3.9.1) (2024-04-11)[​](#391-2024-04-11 "Direct link to 391-2024-04-11") **Note:** Version bump only for package @crawlee/utils # [3.9.0](https://github.com/apify/crawlee/compare/v3.8.2...v3.9.0) (2024-04-10) ### Bug Fixes[​](#bug-fixes-15 "Direct link to Bug Fixes") * sitemaps support `application/xml` ([#2408](https://github.com/apify/crawlee/issues/2408)) ([cbcf47a](https://github.com/apify/crawlee/commit/cbcf47a7b991a8b88a6c2a46f3684444d776fcdd)) ### Features[​](#features-10 "Direct link to Features") * expand #shadow-root elements automatically in `parseWithCheerio` helper ([#2396](https://github.com/apify/crawlee/issues/2396)) ([a05b3a9](https://github.com/apify/crawlee/commit/a05b3a93a9b57926b353df0e79d846b5024c42ac)) ## [3.8.2](https://github.com/apify/crawlee/compare/v3.8.1...v3.8.2) (2024-03-21)[​](#382-2024-03-21 "Direct link to 382-2024-03-21") ### Bug Fixes[​](#bug-fixes-16 "Direct link to Bug Fixes") * correctly report gzip decompression errors ([#2368](https://github.com/apify/crawlee/issues/2368)) ([84a2f17](https://github.com/apify/crawlee/commit/84a2f1733033bf247b2cede3f1728e75bf2c8ff9)) ## [3.8.1](https://github.com/apify/crawlee/compare/v3.8.0...v3.8.1) (2024-02-22)[​](#381-2024-02-22 "Direct link to 381-2024-02-22") **Note:** Version bump only for package @crawlee/utils # [3.8.0](https://github.com/apify/crawlee/compare/v3.7.3...v3.8.0) (2024-02-21) ### Features[​](#features-11 "Direct link to Features") * add Sitemap.tryCommonNames to check well known sitemap locations ([#2311](https://github.com/apify/crawlee/issues/2311)) ([85589f1](https://github.com/apify/crawlee/commit/85589f167196ac49c0cc10664ab3e9e5595208ed)), closes [#2307](https://github.com/apify/crawlee/issues/2307) * **core:** add `userAgent` parameter to `RobotsFile.isAllowed()` + `RobotsFile.from()` helper ([#2338](https://github.com/apify/crawlee/issues/2338)) ([343c159](https://github.com/apify/crawlee/commit/343c159f20546a2006db33da4674e6ffd77db572)) * Support plain-text sitemap files (sitemap.txt) ([#2315](https://github.com/apify/crawlee/issues/2315)) ([0bee7da](https://github.com/apify/crawlee/commit/0bee7daf9509fe61c8d83799e706f0bb030257ec)) ## [3.7.3](https://github.com/apify/crawlee/compare/v3.7.2...v3.7.3) (2024-01-30)[​](#373-2024-01-30 "Direct link to 373-2024-01-30") ### Bug Fixes[​](#bug-fixes-17 "Direct link to Bug Fixes") * pass on an invisible CF turnstile ([#2277](https://github.com/apify/crawlee/issues/2277)) ([d8734e7](https://github.com/apify/crawlee/commit/d8734e765238115d9cba6dda9c649ad8573890d8)), closes [#2256](https://github.com/apify/crawlee/issues/2256) ## [3.7.2](https://github.com/apify/crawlee/compare/v3.7.1...v3.7.2) (2024-01-09)[​](#372-2024-01-09 "Direct link to 372-2024-01-09") **Note:** Version bump only for package @crawlee/utils ## [3.7.1](https://github.com/apify/crawlee/compare/v3.7.0...v3.7.1) (2024-01-02)[​](#371-2024-01-02 "Direct link to 371-2024-01-02") ### Bug Fixes[​](#bug-fixes-18 "Direct link to Bug Fixes") * ES2022 build compatibility and move to NodeNext for module ([#2258](https://github.com/apify/crawlee/issues/2258)) ([7fe1e68](https://github.com/apify/crawlee/commit/7fe1e685904660c8446aafdf739fd1212684b48c)), closes [#2257](https://github.com/apify/crawlee/issues/2257) # [3.7.0](https://github.com/apify/crawlee/compare/v3.6.2...v3.7.0) (2023-12-21) ### Bug Fixes[​](#bug-fixes-19 "Direct link to Bug Fixes") * `retryOnBlocked` doesn't override the blocked HTTP codes ([#2243](https://github.com/apify/crawlee/issues/2243)) ([81672c3](https://github.com/apify/crawlee/commit/81672c3d1db1dcdcffb868de5740addff82cf112)) ### Features[​](#features-12 "Direct link to Features") * robots.txt and sitemap.xml utils ([#2214](https://github.com/apify/crawlee/issues/2214)) ([fdfec4f](https://github.com/apify/crawlee/commit/fdfec4f4d0a0f925b49015d2d63932c4a82555ba)), closes [#2187](https://github.com/apify/crawlee/issues/2187) ## [3.6.2](https://github.com/apify/crawlee/compare/v3.6.1...v3.6.2) (2023-11-26)[​](#362-2023-11-26 "Direct link to 362-2023-11-26") **Note:** Version bump only for package @crawlee/utils ## [3.6.1](https://github.com/apify/crawlee/compare/v3.6.0...v3.6.1) (2023-11-15)[​](#361-2023-11-15 "Direct link to 361-2023-11-15") **Note:** Version bump only for package @crawlee/utils # [3.6.0](https://github.com/apify/crawlee/compare/v3.5.8...v3.6.0) (2023-11-15) ### Features[​](#features-13 "Direct link to Features") * got-scraping v4 ([#2110](https://github.com/apify/crawlee/issues/2110)) ([2f05ed2](https://github.com/apify/crawlee/commit/2f05ed22b203f688095300400bb0e6d03a03283c)) ## [3.5.8](https://github.com/apify/crawlee/compare/v3.5.7...v3.5.8) (2023-10-17)[​](#358-2023-10-17 "Direct link to 358-2023-10-17") ### Bug Fixes[​](#bug-fixes-20 "Direct link to Bug Fixes") * refactor `extractUrls` to split the text line by line first ([#2122](https://github.com/apify/crawlee/issues/2122)) ([7265cd7](https://github.com/apify/crawlee/commit/7265cd7148bb4889d60434d671f153387fb5a4dd)) ## [3.5.7](https://github.com/apify/crawlee/compare/v3.5.6...v3.5.7) (2023-10-05)[​](#357-2023-10-05 "Direct link to 357-2023-10-05") **Note:** Version bump only for package @crawlee/utils ## [3.5.6](https://github.com/apify/crawlee/compare/v3.5.5...v3.5.6) (2023-10-04)[​](#356-2023-10-04 "Direct link to 356-2023-10-04") ### Features[​](#features-14 "Direct link to Features") * add incapsula iframe selector to the blocked list ([#2111](https://github.com/apify/crawlee/issues/2111)) ([2b17d8a](https://github.com/apify/crawlee/commit/2b17d8a797dec2824a0063792aa7bd3fce8dccae)), closes [apify/store-website-content-crawler#154](https://github.com/apify/store-website-content-crawler/issues/154) ## [3.5.5](https://github.com/apify/crawlee/compare/v3.5.4...v3.5.5) (2023-10-02)[​](#355-2023-10-02 "Direct link to 355-2023-10-02") **Note:** Version bump only for package @crawlee/utils ## [3.5.4](https://github.com/apify/crawlee/compare/v3.5.3...v3.5.4) (2023-09-11)[​](#354-2023-09-11 "Direct link to 354-2023-09-11") **Note:** Version bump only for package @crawlee/utils ## [3.5.3](https://github.com/apify/crawlee/compare/v3.5.2...v3.5.3) (2023-08-31)[​](#353-2023-08-31 "Direct link to 353-2023-08-31") ### Bug Fixes[​](#bug-fixes-21 "Direct link to Bug Fixes") * pin all internal dependencies ([#2041](https://github.com/apify/crawlee/issues/2041)) ([d6f2b17](https://github.com/apify/crawlee/commit/d6f2b172d4a6776137c7893ca798d5b4a9408e79)), closes [#2040](https://github.com/apify/crawlee/issues/2040) ## [3.5.2](https://github.com/apify/crawlee/compare/v3.5.1...v3.5.2) (2023-08-21)[​](#352-2023-08-21 "Direct link to 352-2023-08-21") **Note:** Version bump only for package @crawlee/utils ## [3.5.1](https://github.com/apify/crawlee/compare/v3.5.0...v3.5.1) (2023-08-16)[​](#351-2023-08-16 "Direct link to 351-2023-08-16") **Note:** Version bump only for package @crawlee/utils # [3.5.0](https://github.com/apify/crawlee/compare/v3.4.2...v3.5.0) (2023-07-31) ### Features[​](#features-15 "Direct link to Features") * retire session on proxy error ([#2002](https://github.com/apify/crawlee/issues/2002)) ([8c0928b](https://github.com/apify/crawlee/commit/8c0928b24ceabefc454f8114ac30a27023709010)), closes [#1912](https://github.com/apify/crawlee/issues/1912) ## [3.4.2](https://github.com/apify/crawlee/compare/v3.4.1...v3.4.2) (2023-07-19)[​](#342-2023-07-19 "Direct link to 342-2023-07-19") ### Features[​](#features-16 "Direct link to Features") * retryOnBlocked detects blocked webpage ([#1956](https://github.com/apify/crawlee/issues/1956)) ([766fa9b](https://github.com/apify/crawlee/commit/766fa9b88029e9243a7427075384c1abe85c70c8)) ## [3.4.1](https://github.com/apify/crawlee/compare/v3.4.0...v3.4.1) (2023-07-13)[​](#341-2023-07-13 "Direct link to 341-2023-07-13") **Note:** Version bump only for package @crawlee/utils # [3.4.0](https://github.com/apify/crawlee/compare/v3.3.3...v3.4.0) (2023-06-12) **Note:** Version bump only for package @crawlee/utils ## [3.3.3](https://github.com/apify/crawlee/compare/v3.3.2...v3.3.3) (2023-05-31)[​](#333-2023-05-31 "Direct link to 333-2023-05-31") **Note:** Version bump only for package @crawlee/utils ## [3.3.2](https://github.com/apify/crawlee/compare/v3.3.1...v3.3.2) (2023-05-11)[​](#332-2023-05-11 "Direct link to 332-2023-05-11") **Note:** Version bump only for package @crawlee/utils ## [3.3.1](https://github.com/apify/crawlee/compare/v3.3.0...v3.3.1) (2023-04-11)[​](#331-2023-04-11 "Direct link to 331-2023-04-11") ### Bug Fixes[​](#bug-fixes-22 "Direct link to Bug Fixes") * **jsdom:** delay closing of the window and add some polyfills ([2e81618](https://github.com/apify/crawlee/commit/2e81618afb5f3890495e3e5fcfa037eb3319edc9)) # [3.3.0](https://github.com/apify/crawlee/compare/v3.2.2...v3.3.0) (2023-03-09) ### Bug Fixes[​](#bug-fixes-23 "Direct link to Bug Fixes") * add `proxyUrl` to `DownloadListOfUrlsOptions` ([779be1e](https://github.com/apify/crawlee/commit/779be1e4f29dff191d02e623eefb1bd5650c14ad)), closes [#1780](https://github.com/apify/crawlee/issues/1780) ## [3.2.2](https://github.com/apify/crawlee/compare/v3.2.1...v3.2.2) (2023-02-08)[​](#322-2023-02-08 "Direct link to 322-2023-02-08") **Note:** Version bump only for package @crawlee/utils ## [3.2.1](https://github.com/apify/crawlee/compare/v3.2.0...v3.2.1) (2023-02-07)[​](#321-2023-02-07 "Direct link to 321-2023-02-07") **Note:** Version bump only for package @crawlee/utils # [3.2.0](https://github.com/apify/crawlee/compare/v3.1.4...v3.2.0) (2023-02-07) ### Bug Fixes[​](#bug-fixes-24 "Direct link to Bug Fixes") * **utils:** add missing dependency on `ow` ([bf0e03c](https://github.com/apify/crawlee/commit/bf0e03cc6ddc103c9337de5cd8dce9bc86c369a3)), closes [#1716](https://github.com/apify/crawlee/issues/1716) ## 3.1.2 (2022-11-15)[​](#312-2022-11-15 "Direct link to 3.1.2 (2022-11-15)") **Note:** Version bump only for package @crawlee/utils ## 3.1.1 (2022-11-07)[​](#311-2022-11-07 "Direct link to 3.1.1 (2022-11-07)") **Note:** Version bump only for package @crawlee/utils # 3.1.0 (2022-10-13) **Note:** Version bump only for package @crawlee/utils ## [3.0.4](https://github.com/apify/crawlee/compare/v3.0.3...v3.0.4) (2022-08-22)[​](#304-2022-08-22 "Direct link to 304-2022-08-22") **Note:** Version bump only for package @crawlee/utils --- # RobotsTxtFile Loads and queries information from a [robots.txt file](https://en.wikipedia.org/wiki/Robots.txt). **Example usage:** ``` // Load the robots.txt file const robots = await RobotsTxtFile.find('https://crawlee.dev/js/docs/introduction/first-crawler'); // Check if a URL should be crawled according to robots.txt const url = 'https://crawlee.dev/api/puppeteer-crawler/class/PuppeteerCrawler'; if (robots.isAllowed(url)) { await crawler.addRequests([url]); } // Enqueue all links in the sitemap(s) await crawler.addRequests(await robots.parseUrlsFromSitemaps()); ``` ## Index[**](#Index) ### Methods * [**getSitemaps](#getSitemaps) * [**isAllowed](#isAllowed) * [**parseSitemaps](#parseSitemaps) * [**parseUrlsFromSitemaps](#parseUrlsFromSitemaps) * [**find](#find) * [**from](#from) ## Methods[**](#Methods) ### [**](#getSitemaps)[**](https://github.com/apify/crawlee/blob/master/packages/utils/src/internals/robots.ts#L102)getSitemaps * ****getSitemaps**(): string\[] - Get URLs of sitemaps referenced in the robots file. *** #### Returns string\[] ### [**](#isAllowed)[**](https://github.com/apify/crawlee/blob/master/packages/utils/src/internals/robots.ts#L95)isAllowed * ****isAllowed**(url, userAgent): boolean - Check if a URL should be crawled by robots. *** #### Parameters * ##### url: string the URL to check against the rules in robots.txt * ##### optionaluserAgent: string = '\*' relevant user agent, default to `*` #### Returns boolean ### [**](#parseSitemaps)[**](https://github.com/apify/crawlee/blob/master/packages/utils/src/internals/robots.ts#L109)parseSitemaps * ****parseSitemaps**(): Promise<[Sitemap](https://crawlee.dev/js/api/utils/class/Sitemap.md)> - Parse all the sitemaps referenced in the robots file. *** #### Returns Promise<[Sitemap](https://crawlee.dev/js/api/utils/class/Sitemap.md)> ### [**](#parseUrlsFromSitemaps)[**](https://github.com/apify/crawlee/blob/master/packages/utils/src/internals/robots.ts#L116)parseUrlsFromSitemaps * ****parseUrlsFromSitemaps**(): Promise\ - Get all URLs from all the sitemaps referenced in the robots file. A shorthand for `(await robots.parseSitemaps()).urls`. *** #### Returns Promise\ ### [**](#find)[**](https://github.com/apify/crawlee/blob/master/packages/utils/src/internals/robots.ts#L40)staticfind * ****find**(url, proxyUrl): Promise<[RobotsTxtFile](https://crawlee.dev/js/api/utils/class/RobotsTxtFile.md)> - Determine the location of a robots.txt file for a URL and fetch it. *** #### Parameters * ##### url: string the URL to fetch robots.txt for * ##### optionalproxyUrl: string a proxy to be used for fetching the robots.txt file #### Returns Promise<[RobotsTxtFile](https://crawlee.dev/js/api/utils/class/RobotsTxtFile.md)> ### [**](#from)[**](https://github.com/apify/crawlee/blob/master/packages/utils/src/internals/robots.ts#L54)staticfrom * ****from**(url, content, proxyUrl): [RobotsTxtFile](https://crawlee.dev/js/api/utils/class/RobotsTxtFile.md) - Allows providing the URL and robots.txt content explicitly instead of loading it from the target site. *** #### Parameters * ##### url: string the URL for robots.txt file * ##### content: string contents of robots.txt * ##### optionalproxyUrl: string a proxy to be used for fetching the robots.txt file #### Returns [RobotsTxtFile](https://crawlee.dev/js/api/utils/class/RobotsTxtFile.md) --- # Sitemap Loads one or more sitemaps from given URLs, following references in sitemap index files, and exposes the contained URLs. **Example usage:** ``` // Load a sitemap const sitemap = await Sitemap.load(['https://example.com/sitemap.xml', 'https://example.com/sitemap_2.xml.gz']); // Enqueue all the contained URLs (including those from sub-sitemaps from sitemap indexes) await crawler.addRequests(sitemap.urls); ``` ## Index[**](#Index) ### Constructors * [**constructor](#constructor) ### Properties * [**urls](#urls) ### Methods * [**fromXmlString](#fromXmlString) * [**load](#load) * [**tryCommonNames](#tryCommonNames) ## Constructors[**](#Constructors) ### [**](#constructor)[**](https://github.com/apify/crawlee/blob/master/packages/utils/src/internals/sitemap.ts#L372)constructor * ****new Sitemap**(urls): [Sitemap](https://crawlee.dev/js/api/utils/class/Sitemap.md) - #### Parameters * ##### urls: string\[] #### Returns [Sitemap](https://crawlee.dev/js/api/utils/class/Sitemap.md) ## Properties[**](#Properties) ### [**](#urls)[**](https://github.com/apify/crawlee/blob/master/packages/utils/src/internals/sitemap.ts#L372)readonlyurls **urls: string\[] ## Methods[**](#Methods) ### [**](#fromXmlString)[**](https://github.com/apify/crawlee/blob/master/packages/utils/src/internals/sitemap.ts#L417)staticfromXmlString * ****fromXmlString**(content, proxyUrl): Promise<[Sitemap](https://crawlee.dev/js/api/utils/class/Sitemap.md)> - Parse XML sitemap content from a string and return URLs of referenced pages. If the sitemap references other sitemaps, they will be loaded via HTTP. *** #### Parameters * ##### content: string XML sitemap content * ##### optionalproxyUrl: string URL of a proxy to be used for fetching sitemap contents #### Returns Promise<[Sitemap](https://crawlee.dev/js/api/utils/class/Sitemap.md)> ### [**](#load)[**](https://github.com/apify/crawlee/blob/master/packages/utils/src/internals/sitemap.ts#L400)staticload * ****load**(urls, proxyUrl, parseSitemapOptions): Promise<[Sitemap](https://crawlee.dev/js/api/utils/class/Sitemap.md)> - Fetch sitemap content from given URL or URLs and return URLs of referenced pages. *** #### Parameters * ##### urls: string | string\[] sitemap URL(s) * ##### optionalproxyUrl: string URL of a proxy to be used for fetching sitemap contents * ##### optionalparseSitemapOptions: [ParseSitemapOptions](https://crawlee.dev/js/api/utils/interface/ParseSitemapOptions.md) #### Returns Promise<[Sitemap](https://crawlee.dev/js/api/utils/class/Sitemap.md)> ### [**](#tryCommonNames)[**](https://github.com/apify/crawlee/blob/master/packages/utils/src/internals/sitemap.ts#L380)statictryCommonNames * ****tryCommonNames**(url, proxyUrl): Promise<[Sitemap](https://crawlee.dev/js/api/utils/class/Sitemap.md)> - Try to load sitemap from the most common locations - `/sitemap.xml` and `/sitemap.txt`. For loading based on `Sitemap` entries in `robots.txt`, the [RobotsTxtFile](https://crawlee.dev/js/api/utils/class/RobotsTxtFile.md) class should be used. *** #### Parameters * ##### url: string The domain URL to fetch the sitemap for. * ##### optionalproxyUrl: string A proxy to be used for fetching the sitemap file. #### Returns Promise<[Sitemap](https://crawlee.dev/js/api/utils/class/Sitemap.md)> --- # chunk ### Callable * ****chunk**\(array, chunkSize): T\[]\[] *** * #### Parameters * ##### array: readonly T\[] * ##### chunkSize: number #### Returns T\[]\[] --- # createRequestDebugInfo ### Callable * ****createRequestDebugInfo**(request, response, additionalFields): Dictionary *** * Creates a standardized debug info from request and response. This info is usually added to dataset under the hidden `#debug` field. *** #### Parameters * ##### request: Request\ [Request](https://sdk.apify.com/docs/api/request) object. * ##### optionalresponse: IncomingMessage | Partial\ = {} Puppeteer [`Response`](https://pptr.dev/#?product=Puppeteer\&version=v1.11.0\&show=api-class-response) or NodeJS [`http.IncomingMessage`](https://nodejs.org/api/http.html#http_class_http_serverresponse). * ##### optionaladditionalFields: Dictionary = {} Object containing additional fields to be added. #### Returns Dictionary --- # downloadListOfUrls ### Callable * ****downloadListOfUrls**(options): Promise\ *** * Returns a promise that resolves to an array of urls parsed from the resource available at the provided url. Optionally, custom regular expression and encoding may be provided. *** #### Parameters * ##### options: [DownloadListOfUrlsOptions](https://crawlee.dev/js/api/utils/interface/DownloadListOfUrlsOptions.md) #### Returns Promise\ --- # extractUrls ### Callable * ****extractUrls**(options): string\[] *** * Collects all URLs in an arbitrary string to an array, optionally using a custom regular expression. *** #### Parameters * ##### options: [ExtractUrlsOptions](https://crawlee.dev/js/api/utils/interface/ExtractUrlsOptions.md) #### Returns string\[] --- # extractUrlsFromCheerio ### Callable * ****extractUrlsFromCheerio**($, selector, baseUrl): string\[] *** * Extracts URLs from a given Cheerio object. * **@throws** when a relative URL is encountered with no baseUrl set *** #### Parameters * ##### $: CheerioAPI the Cheerio object to extract URLs from * ##### selector: string = 'a' a CSS selector for matching link elements * ##### baseUrl: string = '' a URL for resolving relative links #### Returns string\[] An array of absolute URLs --- # getCgroupsVersion ### Callable * ****getCgroupsVersion**(forceReset): Promise\ *** * gets the cgroup version by checking for a file at /sys/fs/cgroup/memory *** #### Parameters * ##### optionalforceReset: boolean #### Returns Promise\ "V1" or "V2" for the version of cgroup or null if cgroup is not found. --- # getMemoryInfo ### Callable * ****getMemoryInfo**(): Promise<[MemoryInfo](https://crawlee.dev/js/api/utils/interface/MemoryInfo.md)> *** * Returns memory statistics of the process and the system, see [MemoryInfo](https://crawlee.dev/js/api/utils/interface/MemoryInfo.md). If the process runs inside of Docker, the `getMemoryInfo` gets container memory limits, otherwise it gets system memory limits. Beware that the function is quite inefficient because it spawns a new process. Therefore you shouldn't call it too often, like more than once per second. *** #### Returns Promise<[MemoryInfo](https://crawlee.dev/js/api/utils/interface/MemoryInfo.md)> --- # getObjectType ### Callable * ****getObjectType**(value): string *** * #### Parameters * ##### value: unknown #### Returns string --- # gotScraping ### Callable * ****gotScraping**(url, options): CancelableRequest\> * ****gotScraping**\(url, options): CancelableRequest\> * ****gotScraping**(url, options): CancelableRequest\>> * ****gotScraping**(url, options): CancelableRequest\ * ****gotScraping**(options): CancelableRequest\> * ****gotScraping**\(options): CancelableRequest\> * ****gotScraping**(options): CancelableRequest\>> * ****gotScraping**(options): CancelableRequest\ * ****gotScraping**(url, options): CancelableRequest\ * ****gotScraping**\(url, options): CancelableRequest\ * ****gotScraping**(url, options): CancelableRequest\> * ****gotScraping**(options): CancelableRequest\ * ****gotScraping**\(options): CancelableRequest\ * ****gotScraping**(options): CancelableRequest\> * ****gotScraping**(url, options): Request * ****gotScraping**(options): Request * ****gotScraping**(url, options): Request | CancelableRequest\ * ****gotScraping**(options): Request | CancelableRequest\ * ****gotScraping**(url, options, defaults): Request | CancelableRequest\ *** * #### Parameters * ##### url: string | URL * ##### optionaloptions: ExtendedOptionsOfTextResponseBody #### Returns CancelableRequest\> ## Index[**](#Index) ### Properties * [**defaults](#defaults) * [**delete](#delete) * [**extend](#extend) * [**get](#get) * [**head](#head) * [**paginate](#paginate) * [**patch](#patch) * [**post](#post) * [**put](#put) * [**stream](#stream) ## Properties[**](#Properties) ### [**](#defaults)[**](https://undefined/apify/crawlee/blob/master/node_modules/got-scraping/src/index.d.ts#L224)externaldefaults **defaults: InstanceDefaults ### [**](#delete)delete **delete: ExtendedGotRequestFunction ### [**](#extend)[**](https://undefined/apify/crawlee/blob/master/node_modules/got-scraping/src/index.d.ts#L225)externalextend **extend: (...instancesOrOptions) => GotScraping #### Type declaration * * **(...instancesOrOptions): GotScraping - #### Parameters * ##### externalrest...instancesOrOptions: (GotScraping | ExtendedExtendOptions)\[] #### Returns GotScraping ### [**](#get)get **get: ExtendedGotRequestFunction ### [**](#head)head **head: ExtendedGotRequestFunction ### [**](#paginate)[**](https://undefined/apify/crawlee/blob/master/node_modules/got-scraping/src/index.d.ts#L223)externalpaginate **paginate: ExtendedGotPaginate ### [**](#patch)patch **patch: ExtendedGotRequestFunction ### [**](#post)post **post: ExtendedGotRequestFunction ### [**](#put)put **put: ExtendedGotRequestFunction ### [**](#stream)[**](https://undefined/apify/crawlee/blob/master/node_modules/got-scraping/src/index.d.ts#L222)externalstream **stream: ExtendedGotStream --- # htmlToText ### Callable * ****htmlToText**(htmlOrCheerioElement): string *** * The function converts a HTML document to a plain text. The plain text generated by the function is similar to a text captured by pressing Ctrl+A and Ctrl+C on a page when loaded in a web browser. The function doesn't aspire to preserve the formatting or to be perfectly correct with respect to HTML specifications. However, it attempts to generate newlines and whitespaces in and around HTML elements to avoid merging distinct parts of text and thus enable extraction of data from the text (e.g. phone numbers). **Example usage** ``` const text = htmlToText('Some text'); console.log(text); ``` Note that the function uses [cheerio](https://www.npmjs.com/package/cheerio) to parse the HTML. Optionally, to avoid duplicate parsing of HTML and thus improve performance, you can pass an existing Cheerio object to the function instead of the HTML text. The HTML should be parsed with the `decodeEntities` option set to `true`. For example: ``` import * as cheerio from 'cheerio'; const html = 'Some text'; const text = htmlToText(cheerio.load(html, { decodeEntities: true })); ``` *** #### Parameters * ##### htmlOrCheerioElement: string | CheerioAPI HTML text or parsed HTML represented using a [cheerio](https://www.npmjs.com/package/cheerio) function. #### Returns string Plain text --- # isContainerized ### Callable * ****isContainerized**(): Promise\ *** * Detects if crawlee is running in a containerized environment. *** #### Returns Promise\ --- # isDocker ### Callable * ****isDocker**(forceReset): Promise\ *** * Returns a `Promise` that resolves to true if the code is running in a Docker container. *** #### Parameters * ##### optionalforceReset: boolean #### Returns Promise\ --- # isLambda ### Callable * ****isLambda**(): boolean *** * #### Returns boolean --- # parseOpenGraph ### Callable * ****parseOpenGraph**(raw, additionalProperties): Dictionary\ * ****parseOpenGraph**($, additionalProperties): Dictionary\ *** * Easily parse all OpenGraph properties from a page with just a `CheerioAPI` object. *** #### Parameters * ##### raw: string * ##### optionaladditionalProperties: [OpenGraphProperty](https://crawlee.dev/js/api/utils/interface/OpenGraphProperty.md)\[] Any potential additional `OpenGraphProperty` items you'd like to be scraped. Currently existing properties are kept up to date. #### Returns Dictionary\ Scraped OpenGraph properties as an object. --- # parseSitemap ### Callable * ****parseSitemap**\(initialSources, proxyUrl, options): AsyncIterable\ *** * #### Parameters * ##### initialSources: SitemapSource\[] * ##### optionalproxyUrl: string * ##### optionaloptions: T #### Returns AsyncIterable\ --- # sleep ### Callable * ****sleep**(millis): Promise\ *** * Returns a `Promise` that resolves after a specific period of time. This is useful to implement waiting in your code, e.g. to prevent overloading of target website or to avoid bot detection. **Example usage:** ``` import { sleep } from 'crawlee'; ... // Sleep 1.5 seconds await sleep(1500); ``` *** #### Parameters * ##### optionalmillis: number Period of time to sleep, in milliseconds. If not a positive number, the returned promise resolves immediately. #### Returns Promise\ --- # DownloadListOfUrlsOptions ## Index[**](#Index) ### Properties * [**encoding](#encoding) * [**proxyUrl](#proxyUrl) * [**url](#url) * [**urlRegExp](#urlRegExp) ## Properties[**](#Properties) ### [**](#encoding)[**](https://github.com/apify/crawlee/blob/master/packages/utils/src/internals/extract-urls.ts#L16)optionalencoding **encoding? : BufferEncoding = BufferEncoding The encoding of the file. ### [**](#proxyUrl)[**](https://github.com/apify/crawlee/blob/master/packages/utils/src/internals/extract-urls.ts#L26)optionalproxyUrl **proxyUrl? : string Allows to use a proxy for the download request. ### [**](#url)[**](https://github.com/apify/crawlee/blob/master/packages/utils/src/internals/extract-urls.ts#L10)url **url: string URL to the file ### [**](#urlRegExp)[**](https://github.com/apify/crawlee/blob/master/packages/utils/src/internals/extract-urls.ts#L23)optionalurlRegExp **urlRegExp? : RegExp = RegExp Custom regular expression to identify the URLs in the file to extract. The regular expression should be case-insensitive and have global flag set (i.e. `/something/gi`). --- # ExtractUrlsOptions ## Index[**](#Index) ### Properties * [**string](#string) * [**urlRegExp](#urlRegExp) ## Properties[**](#Properties) ### [**](#string)[**](https://github.com/apify/crawlee/blob/master/packages/utils/src/internals/extract-urls.ts#L62)string **string: string The string to extract URLs from. ### [**](#urlRegExp)[**](https://github.com/apify/crawlee/blob/master/packages/utils/src/internals/extract-urls.ts#L68)optionalurlRegExp **urlRegExp? : RegExp = RegExp Custom regular expression --- # MemoryInfo Describes memory usage of the process. ## Index[**](#Index) ### Properties * [**childProcessesBytes](#childProcessesBytes) * [**freeBytes](#freeBytes) * [**mainProcessBytes](#mainProcessBytes) * [**totalBytes](#totalBytes) * [**usedBytes](#usedBytes) ## Properties[**](#Properties) ### [**](#childProcessesBytes)[**](https://github.com/apify/crawlee/blob/master/packages/utils/src/internals/memory-info.ts#L42)childProcessesBytes **childProcessesBytes: number Amount of memory used by child processes of the current Node.js process ### [**](#freeBytes)[**](https://github.com/apify/crawlee/blob/master/packages/utils/src/internals/memory-info.ts#L33)freeBytes **freeBytes: number Amount of free memory in the system or container ### [**](#mainProcessBytes)[**](https://github.com/apify/crawlee/blob/master/packages/utils/src/internals/memory-info.ts#L39)mainProcessBytes **mainProcessBytes: number Amount of memory used the current Node.js process ### [**](#totalBytes)[**](https://github.com/apify/crawlee/blob/master/packages/utils/src/internals/memory-info.ts#L30)totalBytes **totalBytes: number Total memory available in the system or container ### [**](#usedBytes)[**](https://github.com/apify/crawlee/blob/master/packages/utils/src/internals/memory-info.ts#L36)usedBytes **usedBytes: number Amount of memory used (= totalBytes - freeBytes) --- # OpenGraphProperty ## Index[**](#Index) ### Properties * [**children](#children) * [**name](#name) * [**outputName](#outputName) ## Properties[**](#Properties) ### [**](#children)[**](https://github.com/apify/crawlee/blob/master/packages/utils/src/internals/open_graph_parser.ts#L8)children **children: [OpenGraphProperty](https://crawlee.dev/js/api/utils/interface/OpenGraphProperty.md)\[] ### [**](#name)[**](https://github.com/apify/crawlee/blob/master/packages/utils/src/internals/open_graph_parser.ts#L6)name **name: string ### [**](#outputName)[**](https://github.com/apify/crawlee/blob/master/packages/utils/src/internals/open_graph_parser.ts#L7)outputName **outputName: string --- # ParseSitemapOptions ## Index[**](#Index) ### Properties * [**emitNestedSitemaps](#emitNestedSitemaps) * [**maxDepth](#maxDepth) * [**networkTimeouts](#networkTimeouts) * [**reportNetworkErrors](#reportNetworkErrors) * [**sitemapRetries](#sitemapRetries) ## Properties[**](#Properties) ### [**](#emitNestedSitemaps)[**](https://github.com/apify/crawlee/blob/master/packages/utils/src/internals/sitemap.ts#L176)optionalemitNestedSitemaps **emitNestedSitemaps? : boolean If set to `true`, elements referring to other sitemaps will be emitted as special objects with `originSitemapUrl` set to `null`. ### [**](#maxDepth)[**](https://github.com/apify/crawlee/blob/master/packages/utils/src/internals/sitemap.ts#L180)optionalmaxDepth **maxDepth? : number Maximum depth of nested sitemaps to follow. ### [**](#networkTimeouts)[**](https://github.com/apify/crawlee/blob/master/packages/utils/src/internals/sitemap.ts#L188)optionalnetworkTimeouts **networkTimeouts? : Delays Network timeouts for sitemap fetching. See [Got documentation](https://github.com/sindresorhus/got/blob/main/documentation/6-timeout.md) for more details. ### [**](#reportNetworkErrors)[**](https://github.com/apify/crawlee/blob/master/packages/utils/src/internals/sitemap.ts#L193)optionalreportNetworkErrors **reportNetworkErrors? : boolean = true If true, the parser will log a warning if it fails to fetch a sitemap due to a network error ### [**](#sitemapRetries)[**](https://github.com/apify/crawlee/blob/master/packages/utils/src/internals/sitemap.ts#L184)optionalsitemapRetries **sitemapRetries? : number Number of retries for fetching sitemaps. The counter resets for each nested sitemap. --- # social ## Index[**](#Index) ### Interfaces * [**SocialHandles](https://crawlee.dev/js/api/utils/namespace/social.md#SocialHandles) ### Variables * [**DISCORD\_REGEX](https://crawlee.dev/js/api/utils/namespace/social.md#DISCORD_REGEX) * [**DISCORD\_REGEX\_GLOBAL](https://crawlee.dev/js/api/utils/namespace/social.md#DISCORD_REGEX_GLOBAL) * [**EMAIL\_REGEX](https://crawlee.dev/js/api/utils/namespace/social.md#EMAIL_REGEX) * [**EMAIL\_REGEX\_GLOBAL](https://crawlee.dev/js/api/utils/namespace/social.md#EMAIL_REGEX_GLOBAL) * [**FACEBOOK\_REGEX](https://crawlee.dev/js/api/utils/namespace/social.md#FACEBOOK_REGEX) * [**FACEBOOK\_REGEX\_GLOBAL](https://crawlee.dev/js/api/utils/namespace/social.md#FACEBOOK_REGEX_GLOBAL) * [**INSTAGRAM\_REGEX](https://crawlee.dev/js/api/utils/namespace/social.md#INSTAGRAM_REGEX) * [**INSTAGRAM\_REGEX\_GLOBAL](https://crawlee.dev/js/api/utils/namespace/social.md#INSTAGRAM_REGEX_GLOBAL) * [**LINKEDIN\_REGEX](https://crawlee.dev/js/api/utils/namespace/social.md#LINKEDIN_REGEX) * [**LINKEDIN\_REGEX\_GLOBAL](https://crawlee.dev/js/api/utils/namespace/social.md#LINKEDIN_REGEX_GLOBAL) * [**PINTEREST\_REGEX](https://crawlee.dev/js/api/utils/namespace/social.md#PINTEREST_REGEX) * [**PINTEREST\_REGEX\_GLOBAL](https://crawlee.dev/js/api/utils/namespace/social.md#PINTEREST_REGEX_GLOBAL) * [**TIKTOK\_REGEX](https://crawlee.dev/js/api/utils/namespace/social.md#TIKTOK_REGEX) * [**TIKTOK\_REGEX\_GLOBAL](https://crawlee.dev/js/api/utils/namespace/social.md#TIKTOK_REGEX_GLOBAL) * [**TWITTER\_REGEX](https://crawlee.dev/js/api/utils/namespace/social.md#TWITTER_REGEX) * [**TWITTER\_REGEX\_GLOBAL](https://crawlee.dev/js/api/utils/namespace/social.md#TWITTER_REGEX_GLOBAL) * [**YOUTUBE\_REGEX](https://crawlee.dev/js/api/utils/namespace/social.md#YOUTUBE_REGEX) * [**YOUTUBE\_REGEX\_GLOBAL](https://crawlee.dev/js/api/utils/namespace/social.md#YOUTUBE_REGEX_GLOBAL) ### Functions * [**emailsFromText](https://crawlee.dev/js/api/utils/namespace/social.md#emailsFromText) * [**emailsFromUrls](https://crawlee.dev/js/api/utils/namespace/social.md#emailsFromUrls) * [**parseHandlesFromHtml](https://crawlee.dev/js/api/utils/namespace/social.md#parseHandlesFromHtml) * [**phonesFromText](https://crawlee.dev/js/api/utils/namespace/social.md#phonesFromText) * [**phonesFromUrls](https://crawlee.dev/js/api/utils/namespace/social.md#phonesFromUrls) ## Interfaces[**](#Interfaces) ### [**](#SocialHandles)[**](https://github.com/apify/crawlee/blob/master/packages/utils/src/internals/social.ts#L202)SocialHandles **SocialHandles: Representation of social handles parsed from a HTML page. ### [**](#discords)[**](https://github.com/apify/crawlee/blob/master/packages/utils/src/internals/social.ts#L213)discords **discords: string\[] ### [**](#emails)[**](https://github.com/apify/crawlee/blob/master/packages/utils/src/internals/social.ts#L203)emails **emails: string\[] ### [**](#facebooks)[**](https://github.com/apify/crawlee/blob/master/packages/utils/src/internals/social.ts#L209)facebooks **facebooks: string\[] ### [**](#instagrams)[**](https://github.com/apify/crawlee/blob/master/packages/utils/src/internals/social.ts#L208)instagrams **instagrams: string\[] ### [**](#linkedIns)[**](https://github.com/apify/crawlee/blob/master/packages/utils/src/internals/social.ts#L206)linkedIns **linkedIns: string\[] ### [**](#phones)[**](https://github.com/apify/crawlee/blob/master/packages/utils/src/internals/social.ts#L204)phones **phones: string\[] ### [**](#phonesUncertain)[**](https://github.com/apify/crawlee/blob/master/packages/utils/src/internals/social.ts#L205)phonesUncertain **phonesUncertain: string\[] ### [**](#pinterests)[**](https://github.com/apify/crawlee/blob/master/packages/utils/src/internals/social.ts#L212)pinterests **pinterests: string\[] ### [**](#tiktoks)[**](https://github.com/apify/crawlee/blob/master/packages/utils/src/internals/social.ts#L211)tiktoks **tiktoks: string\[] ### [**](#twitters)[**](https://github.com/apify/crawlee/blob/master/packages/utils/src/internals/social.ts#L207)twitters **twitters: string\[] ### [**](#youtubes)[**](https://github.com/apify/crawlee/blob/master/packages/utils/src/internals/social.ts#L210)youtubes **youtubes: string\[] ## Variables[**](#Variables) ### [**](#DISCORD_REGEX)[**](https://github.com/apify/crawlee/blob/master/packages/utils/src/internals/social.ts#L608)constDISCORD\_REGEX **DISCORD\_REGEX: RegExp = ... Regular expression to exactly match a Discord invite or channel. It has the following form: `/^...$/i` and matches URLs such as: ``` https://discord.gg/discord-developers https://discord.com/invite/jyEM2PRvMU https://discordapp.com/channels/1234 https://discord.com/channels/1234/1234 discord.gg/discord-developers ``` Example usage: ``` import { social } from 'crawlee'; if (social.DISCORD_REGEX.test('https://discord.gg/discord-developers')) { console.log('Match!'); } ``` ### [**](#DISCORD_REGEX_GLOBAL)[**](https://github.com/apify/crawlee/blob/master/packages/utils/src/internals/social.ts#L629)constDISCORD\_REGEX\_GLOBAL **DISCORD\_REGEX\_GLOBAL: RegExp = ... Regular expression to find multiple Discord channels or invites in a text or HTML. It has the following form: `/.../ig` and matches URLs such as: ``` https://discord.gg/discord-developers https://discord.com/invite/jyEM2PRvMU https://discordapp.com/channels/1234 https://discord.com/channels/1234/1234 discord.gg/discord-developers ``` Example usage: ``` import { social } from 'crawlee'; const matches = text.match(social.DISCORD_REGEX_GLOBAL); if (matches) console.log(`${matches.length} Discord channels found!`); ``` ### [**](#EMAIL_REGEX)[**](https://github.com/apify/crawlee/blob/master/packages/utils/src/internals/social.ts#L13)constEMAIL\_REGEX **EMAIL\_REGEX: RegExp = ... Regular expression to exactly match a single email address. It has the following form: `/^...$/i`. ### [**](#EMAIL_REGEX_GLOBAL)[**](https://github.com/apify/crawlee/blob/master/packages/utils/src/internals/social.ts#L19)constEMAIL\_REGEX\_GLOBAL **EMAIL\_REGEX\_GLOBAL: RegExp = ... Regular expression to find multiple email addresses in a text. It has the following form: `/.../ig`. ### [**](#FACEBOOK_REGEX)[**](https://github.com/apify/crawlee/blob/master/packages/utils/src/internals/social.ts#L419)constFACEBOOK\_REGEX **FACEBOOK\_REGEX: RegExp = ... Regular expression to exactly match a single Facebook profile URL. It has the following form: `/^...$/i` and matches URLs such as: ``` https://www.facebook.com/apifytech facebook.com/apifytech fb.com/apifytech https://www.facebook.com/profile.php?id=123456789 ``` The regular expression does NOT match URLs with additional subdirectories or query parameters, such as: ``` https://www.facebook.com/apifytech/photos ``` Example usage: ``` import { social } from 'crawlee'; if (social.FACEBOOK_REGEX.test('https://www.facebook.com/apifytech')) { console.log('Match!'); } ``` ### [**](#FACEBOOK_REGEX_GLOBAL)[**](https://github.com/apify/crawlee/blob/master/packages/utils/src/internals/social.ts#L448)constFACEBOOK\_REGEX\_GLOBAL **FACEBOOK\_REGEX\_GLOBAL: RegExp = ... Regular expression to find multiple Facebook profile URLs in a text or HTML. It has the following form: `/.../ig` and matches URLs such as: ``` https://www.facebook.com/apifytech facebook.com/apifytech fb.com/apifytech ``` If the profile URL contains subdirectories or query parameters, the regular expression extracts just the base part of the profile URL. For example, from text such as: ``` https://www.facebook.com/apifytech/photos ``` the expression extracts only the following base URL: ``` https://www.facebook.com/apifytech ``` Example usage: ``` import { social } from 'crawlee'; const matches = text.match(social.FACEBOOK_REGEX_GLOBAL); if (matches) console.log(`${matches.length} Facebook profiles found!`); ``` ### [**](#INSTAGRAM_REGEX)[**](https://github.com/apify/crawlee/blob/master/packages/utils/src/internals/social.ts#L303)constINSTAGRAM\_REGEX **INSTAGRAM\_REGEX: RegExp = ... Regular expression to exactly match a single Instagram profile URL. It has the following form: `/^...$/i` and matches URLs such as: ``` https://www.instagram.com/old_prague www.instagram.com/old_prague/ instagr.am/old_prague ``` The regular expression does NOT match URLs with additional subdirectories or query parameters, such as: ``` https://www.instagram.com/cristiano/followers ``` It also does NOT match the following URLs: ``` https://www.instagram.com/explore/ https://www.instagram.com/_n/ https://www.instagram.com/_u/ Example usage: ``` import { social } from 'crawlee'; if (social.INSTAGRAM\_REGEX.test('')) { console.log('Match!'); } \`\`\` ### [**](#INSTAGRAM_REGEX_GLOBAL)[**](https://github.com/apify/crawlee/blob/master/packages/utils/src/internals/social.ts#L339)constINSTAGRAM\_REGEX\_GLOBAL **INSTAGRAM\_REGEX\_GLOBAL: RegExp = ... Regular expression to find multiple Instagram profile URLs in a text or HTML. It has the following form: `/.../ig` and matches URLs such as: ``` https://www.instagram.com/old_prague www.instagram.com/old_prague/ instagr.am/old_prague ``` If the profile URL contains subdirectories or query parameters, the regular expression extracts just the base part of the profile URL. For example, from text such as: ``` https://www.instagram.com/cristiano/followers ``` the expression extracts just the following base URL: ``` https://www.instagram.com/cristiano ``` The regular expression does NOT match the following URLs: ``` https://www.instagram.com/explore/ https://www.instagram.com/_n/ https://www.instagram.com/_u/ ``` Example usage: ``` import { social } from 'crawlee'; const matches = text.match(social.INSTAGRAM_REGEX_GLOBAL); if (matches) console.log(`${matches.length} Instagram profiles found!`); ``` ### [**](#LINKEDIN_REGEX)[**](https://github.com/apify/crawlee/blob/master/packages/utils/src/internals/social.ts#L241)constLINKEDIN\_REGEX **LINKEDIN\_REGEX: RegExp = ... Regular expression to exactly match a single LinkedIn profile URL. It has the following form: `/^...$/i` and matches URLs such as: ``` https://www.linkedin.com/in/alan-turing en.linkedin.com/in/alan-turing linkedin.com/in/alan-turing https://www.linkedin.com/company/linkedin/ ``` The regular expression does NOT match URLs with additional subdirectories or query parameters, such as: ``` https://www.linkedin.com/in/linus-torvalds/latest-activity ``` Example usage: ``` import { social } from 'crawlee'; if (social.LINKEDIN_REGEX.test('https://www.linkedin.com/in/alan-turing')) { console.log('Match!'); } ``` ### [**](#LINKEDIN_REGEX_GLOBAL)[**](https://github.com/apify/crawlee/blob/master/packages/utils/src/internals/social.ts#L271)constLINKEDIN\_REGEX\_GLOBAL **LINKEDIN\_REGEX\_GLOBAL: RegExp = ... Regular expression to find multiple LinkedIn profile URLs in a text or HTML. It has the following form: `/.../ig` and matches URLs such as: ``` https://www.linkedin.com/in/alan-turing en.linkedin.com/in/alan-turing linkedin.com/in/alan-turing https://www.linkedin.com/company/linkedin/ ``` If the profile URL contains subdirectories or query parameters, the regular expression extracts just the base part of the profile URL. For example, from text such as: ``` https://www.linkedin.com/in/linus-torvalds/latest-activity ``` the expression extracts just the following base URL: ``` https://www.linkedin.com/in/linus-torvalds ``` Example usage: ``` import { social } from 'crawlee'; const matches = text.match(social.LINKEDIN_REGEX_GLOBAL); if (matches) console.log(`${matches.length} LinkedIn profiles found!`); ``` ### [**](#PINTEREST_REGEX)[**](https://github.com/apify/crawlee/blob/master/packages/utils/src/internals/social.ts#L563)constPINTEREST\_REGEX **PINTEREST\_REGEX: RegExp = ... Regular expression to exactly match a Pinterest pin, user or user's board. It has the following form: `/^...$/i` and matches URLs such as: ``` https://pinterest.com/pin/123456789 https://www.pinterest.cz/pin/123456789 https://www.pinterest.com/user https://uk.pinterest.com/user https://www.pinterest.co.uk/user pinterest.com/user_name.gold https://cz.pinterest.com/user/board ``` Example usage: ``` import { social } from 'crawlee'; if (social.PINTEREST_REGEX.test('https://pinterest.com/pin/123456789')) { console.log('Match!'); } ``` ### [**](#PINTEREST_REGEX_GLOBAL)[**](https://github.com/apify/crawlee/blob/master/packages/utils/src/internals/social.ts#L586)constPINTEREST\_REGEX\_GLOBAL **PINTEREST\_REGEX\_GLOBAL: RegExp = ... Regular expression to find multiple Pinterest pins, users or boards in a text or HTML. It has the following form: `/.../ig` and matches URLs such as: ``` https://pinterest.com/pin/123456789 https://www.pinterest.cz/pin/123456789 https://www.pinterest.com/user https://uk.pinterest.com/user https://www.pinterest.co.uk/user pinterest.com/user_name.gold https://cz.pinterest.com/user/board ``` Example usage: ``` import { social } from 'crawlee'; const matches = text.match(social.PINTEREST_REGEX_GLOBAL); if (matches) console.log(`${matches.length} Pinterest pins found!`); ``` ### [**](#TIKTOK_REGEX)[**](https://github.com/apify/crawlee/blob/master/packages/utils/src/internals/social.ts#L517)constTIKTOK\_REGEX **TIKTOK\_REGEX: RegExp = ... Regular expression to exactly match a Tiktok video or user account. It has the following form: `/^...$/i` and matches URLs such as: ``` https://www.tiktok.com/trending?shareId=123456789 https://www.tiktok.com/embed/123456789 https://m.tiktok.com/v/123456789 https://www.tiktok.com/@user https://www.tiktok.com/@user-account.pro https://www.tiktok.com/@user/video/123456789 ``` Example usage: ``` import { social } from 'crawlee'; if (social.TIKTOK_REGEX.test('https://www.tiktok.com/trending?shareId=123456789')) { console.log('Match!'); } ``` ### [**](#TIKTOK_REGEX_GLOBAL)[**](https://github.com/apify/crawlee/blob/master/packages/utils/src/internals/social.ts#L539)constTIKTOK\_REGEX\_GLOBAL **TIKTOK\_REGEX\_GLOBAL: RegExp = ... Regular expression to find multiple Tiktok videos or user accounts in a text or HTML. It has the following form: `/.../ig` and matches URLs such as: ``` https://www.tiktok.com/trending?shareId=123456789 https://www.tiktok.com/embed/123456789 https://m.tiktok.com/v/123456789 https://www.tiktok.com/@user https://www.tiktok.com/@user-account.pro https://www.tiktok.com/@user/video/123456789 ``` Example usage: ``` import { social } from 'crawlee'; const matches = text.match(social.TIKTOK_REGEX_GLOBAL); if (matches) console.log(`${matches.length} Tiktok profiles/videos found!`); ``` ### [**](#TWITTER_REGEX)[**](https://github.com/apify/crawlee/blob/master/packages/utils/src/internals/social.ts#L364)constTWITTER\_REGEX **TWITTER\_REGEX: RegExp = ... Regular expression to exactly match a single Twitter profile URL. It has the following form: `/^...$/i` and matches URLs such as: ``` https://www.twitter.com/apify twitter.com/apify ``` The regular expression does NOT match URLs with additional subdirectories or query parameters, such as: ``` https://www.twitter.com/realdonaldtrump/following ``` Example usage: ``` import { social } from 'crawlee'; if (social.TWITTER_REGEX.test('https://www.twitter.com/apify')) { console.log('Match!'); } ``` ### [**](#TWITTER_REGEX_GLOBAL)[**](https://github.com/apify/crawlee/blob/master/packages/utils/src/internals/social.ts#L392)constTWITTER\_REGEX\_GLOBAL **TWITTER\_REGEX\_GLOBAL: RegExp = ... Regular expression to find multiple Twitter profile URLs in a text or HTML. It has the following form: `/.../ig` and matches URLs such as: ``` https://www.twitter.com/apify twitter.com/apify ``` If the profile URL contains subdirectories or query parameters, the regular expression extracts just the base part of the profile URL. For example, from text such as: ``` https://www.twitter.com/realdonaldtrump/following ``` the expression extracts only the following base URL: ``` https://www.twitter.com/realdonaldtrump ``` Example usage: ``` import { social } from 'crawlee'; const matches = text.match(social.TWITTER_REGEX_STRING); if (matches) console.log(`${matches.length} Twitter profiles found!`); ``` ### [**](#YOUTUBE_REGEX)[**](https://github.com/apify/crawlee/blob/master/packages/utils/src/internals/social.ts#L472)constYOUTUBE\_REGEX **YOUTUBE\_REGEX: RegExp = ... Regular expression to exactly match a single Youtube channel, user or video URL. It has the following form: `/^...$/i` and matches URLs such as: ``` https://www.youtube.com/watch?v=kM7YfhfkiEE https://youtu.be/kM7YfhfkiEE https://www.youtube.com/c/TrapNation https://www.youtube.com/channel/UCklie6BM0fhFvzWYqQVoCTA https://www.youtube.com/user/pewdiepie ``` Please note that this won't match URLs like that redirect to /user or /channel. Example usage: ``` import { social } from 'crawlee'; if (social.YOUTUBE_REGEX.test('https://www.youtube.com/watch?v=kM7YfhfkiEE')) { console.log('Match!'); } ``` ### [**](#YOUTUBE_REGEX_GLOBAL)[**](https://github.com/apify/crawlee/blob/master/packages/utils/src/internals/social.ts#L495)constYOUTUBE\_REGEX\_GLOBAL **YOUTUBE\_REGEX\_GLOBAL: RegExp = ... Regular expression to find multiple Youtube channel, user or video URLs in a text or HTML. It has the following form: `/.../ig` and matches URLs such as: ``` https://www.youtube.com/watch?v=kM7YfhfkiEE https://youtu.be/kM7YfhfkiEE https://www.youtube.com/c/TrapNation https://www.youtube.com/channel/UCklie6BM0fhFvzWYqQVoCTA https://www.youtube.com/user/pewdiepie ``` Please note that this won't match URLs like that redirect to /user or /channel. Example usage: ``` import { social } from 'crawlee'; const matches = text.match(social.YOUTUBE_REGEX_GLOBAL); if (matches) console.log(`${matches.length} Youtube videos found!`); ``` ## Functions[**](#Functions) ### [**](#emailsFromText)[**](https://github.com/apify/crawlee/blob/master/packages/utils/src/internals/social.ts#L30)emailsFromText * ****emailsFromText**(text): string\[] - The function extracts email addresses from a plain text. Note that the function preserves the order of emails and keep duplicates. *** #### Parameters * ##### text: string Text to search in. #### Returns string\[] Array of emails addresses found. If no emails are found, the function returns an empty array. ### [**](#emailsFromUrls)[**](https://github.com/apify/crawlee/blob/master/packages/utils/src/internals/social.ts#L43)emailsFromUrls * ****emailsFromUrls**(urls): string\[] - The function extracts email addresses from a list of URLs. Basically it looks for all `mailto:` URLs and returns valid email addresses from them. Note that the function preserves the order of emails and keep duplicates. *** #### Parameters * ##### urls: string\[] Array of URLs. #### Returns string\[] Array of emails addresses found. If no emails are found, the function returns an empty array. ### [**](#parseHandlesFromHtml)[**](https://github.com/apify/crawlee/blob/master/packages/utils/src/internals/social.ts#L661)parseHandlesFromHtml * ****parseHandlesFromHtml**(html, data): [SocialHandles](https://crawlee.dev/js/api/utils/namespace/social.md#SocialHandles) - The function attempts to extract emails, phone numbers and social profile URLs from a HTML document, specifically LinkedIn, Twitter, Instagram and Facebook profile URLs. The function removes duplicates from the resulting arrays and sorts the items alphabetically. Note that the `phones` field contains phone numbers extracted from the special phone links such as `[call us](tel:+1234556789)` (see [phonesFromUrls](https://crawlee.dev/js/api/utils/namespace/social.md#phonesFromUrls)) and potentially other sources with high certainty, while `phonesUncertain` contains phone numbers extracted from the plain text, which might be very inaccurate. **Example usage:** ``` import { launchPuppeteer, social } from 'crawlee'; const browser = await launchPuppeteer(); const page = await browser.newPage(); await page.goto('http://www.example.com'); const html = await page.content(); const result = social.parseHandlesFromHtml(html); console.log('Social handles:'); console.dir(result); ``` *** #### Parameters * ##### html: string HTML text * ##### optionaldata: null | Record\ = null Optional object which will receive the `text` and `$` properties that contain text content of the HTML and `cheerio` object, respectively. This is an optimization so that the caller doesn't need to parse the HTML document again, if needed. #### Returns [SocialHandles](https://crawlee.dev/js/api/utils/namespace/social.md#SocialHandles) An object with the social handles. ### [**](#phonesFromText)[**](https://github.com/apify/crawlee/blob/master/packages/utils/src/internals/social.ts#L124)phonesFromText * ****phonesFromText**(text): string\[] - The function attempts to extract phone numbers from a text. Please note that the results might not be accurate, since phone numbers appear in a large variety of formats and conventions. If you encounter some problems, please [file an issue](https://github.com/apify/crawlee/issues). *** #### Parameters * ##### text: string Text to search the phone numbers in. #### Returns string\[] Array of phone numbers found. If no phone numbers are found, the function returns an empty array. ### [**](#phonesFromUrls)[**](https://github.com/apify/crawlee/blob/master/packages/utils/src/internals/social.ts#L150)phonesFromUrls * ****phonesFromUrls**(urls): string\[] - Finds phone number links in an array of URLs and extracts the phone numbers from them. Note that the phone number links look like `tel://123456789`, `tel:/123456789` or `tel:123456789`. *** #### Parameters * ##### urls: string\[] Array of URLs. #### Returns string\[] Array of phone numbers found. If no phone numbers are found, the function returns an empty array. --- ## [📄️ Deploy on Apify](https://crawlee.dev/js/docs/deployment/apify-platform.md) [Apify platform - large-scale and high-performance web scraping](https://crawlee.dev/js/docs/deployment/apify-platform.md) --- # Apify Platform Copy for LLM Apify is a [platform](https://apify.com) built to serve large-scale and high-performance web scraping and automation needs. It provides easy access to [compute instances (Actors)](#what-is-an-actor), convenient [request](https://crawlee.dev/js/docs/guides/request-storage.md) and [result](https://crawlee.dev/js/docs/guides/result-storage.md) storages, [proxies](https://crawlee.dev/js/docs/guides/proxy-management.md), [scheduling](https://docs.apify.com/scheduler), [webhooks](https://docs.apify.com/webhooks) and [more](https://docs.apify.com/), accessible through a [web interface](https://console.apify.com) or an [API](https://docs.apify.com/api). While we think that the Apify platform is super cool, and it's definitely worth signing up for a [free account](https://console.apify.com/sign-up), **Crawlee is and will always be open source**, runnable locally or on any cloud infrastructure. note We do not test Crawlee in other cloud environments such as Lambda or on specific architectures such as Raspberry PI. We strive to make it work, but there are no guarantees. ## Logging into Apify platform from Crawlee[​](#logging-into-apify-platform-from-crawlee "Direct link to Logging into Apify platform from Crawlee") To access your [Apify account](https://console.apify.com/sign-up) from Crawlee, you must provide credentials - your [API token](https://console.apify.com/account?tab=integrations). You can do that either by utilizing [Apify CLI](https://github.com/apify/apify-cli) or with environment variables. Once you provide credentials to your scraper, you will be able to use all the Apify platform features, such as calling actors, saving to cloud storages, using Apify proxies, setting up webhooks and so on. ### Log in with CLI[​](#log-in-with-cli "Direct link to Log in with CLI") Apify CLI allows you to log in to your Apify account on your computer. If you then run your scraper using the CLI, your credentials will automatically be added. ``` npm install -g apify-cli apify login -t YOUR_API_TOKEN ``` ### Log in with environment variables[​](#log-in-with-environment-variables "Direct link to Log in with environment variables") Alternatively, you can always provide credentials to your scraper by setting the [`APIFY_TOKEN`](#apify_token) environment variable to your API token. > There's also the [`APIFY_PROXY_PASSWORD`](#apify_proxy_password) environment variable. Actor automatically infers that from your token, but it can be useful when you need to access proxies from a different account than your token represents. ### Log in with Configuration[​](#log-in-with-configuration "Direct link to Log in with Configuration") Another option is to use the [`Configuration`](https://docs.apify.com/sdk/js/reference/class/Configuration) instance and set your api token there. ``` import { Actor } from 'apify'; const sdk = new Actor({ token: 'your_api_token' }); ``` ## What is an actor[​](#what-is-an-actor "Direct link to What is an actor") When you deploy your script to the Apify platform, it becomes an [actor](https://apify.com/actors). An actor is a serverless microservice that accepts an input and produces an output. It can run for a few seconds, hours or even infinitely. An actor can perform anything from a simple action such as filling out a web form or sending an email, to complex operations such as crawling an entire website and removing duplicates from a large dataset. Actors can be shared in the [Apify Store](https://apify.com/store) so that other people can use them. But don't worry, if you share your actor in the store and somebody uses it, it runs under their account, not yours. **Related links** * [Store of existing actors](https://apify.com/store) * [Documentation](https://docs.apify.com/actors) * [View actors in Apify Console](https://console.apify.com/actors) * [API reference](https://apify.com/docs/api/v2#/reference/actors) ## Running an actor locally[​](#running-an-actor-locally "Direct link to Running an actor locally") First let's create a boilerplate of the new actor. You could use Apify CLI and just run: ``` apify create my-hello-world ``` The CLI will prompt you to select a project boilerplate template - let's pick "Hello world". The tool will create a directory called `my-hello-world` with a Node.js project files. You can run the actor as follows: ``` cd my-hello-world apify run ``` ## Running Crawlee code as an actor[​](#running-crawlee-code-as-an-actor "Direct link to Running Crawlee code as an actor") For running Crawlee code as an actor on [Apify platform](https://apify.com/actors) you should either: * use a combination of [`Actor.init()`](https://docs.apify.com/sdk/js/reference/class/Actor#init) and [`Actor.exit()`](https://docs.apify.com/sdk/js/reference/class/Actor#exit) functions; * or wrap it into [`Actor.main()`](https://docs.apify.com/sdk/js/reference/class/Actor#main) function. NOTE * Adding [`Actor.init()`](https://docs.apify.com/sdk/js/reference/class/Actor#init) and [`Actor.exit()`](https://docs.apify.com/sdk/js/reference/class/Actor#exit) to your code are the only two important things needed to run it on Apify platform as an actor. `Actor.init()` is needed to initialize your actor (e.g. to set the correct storage implementation), while without `Actor.exit()` the process will simply never stop. * [`Actor.main()`](https://docs.apify.com/sdk/js/reference/class/Actor#main) is an alternative to `Actor.init()` and `Actor.exit()` as it calls both behind the scenes. Let's look at the `CheerioCrawler` example from the [Quick Start](https://crawlee.dev/js/docs/quick-start.md) guide: * Using Actor.main() * Using Actor.init() and Actor.exit() ``` import { Actor } from 'apify'; import { CheerioCrawler } from 'crawlee'; await Actor.main(async () => { const crawler = new CheerioCrawler({ async requestHandler({ request, $, enqueueLinks }) { const { url } = request; // Extract HTML title of the page. const title = $('title').text(); console.log(`Title of ${url}: ${title}`); // Add URLs that match the provided pattern. await enqueueLinks({ globs: ['https://www.iana.org/*'], }); // Save extracted data to dataset. await Actor.pushData({ url, title }); }, }); // Enqueue the initial request and run the crawler await crawler.run(['https://www.iana.org/']); }); ``` ``` import { Actor } from 'apify'; import { CheerioCrawler } from 'crawlee'; await Actor.init(); const crawler = new CheerioCrawler({ async requestHandler({ request, $, enqueueLinks }) { const { url } = request; // Extract HTML title of the page. const title = $('title').text(); console.log(`Title of ${url}: ${title}`); // Add URLs that match the provided pattern. await enqueueLinks({ globs: ['https://www.iana.org/*'], }); // Save extracted data to dataset. await Actor.pushData({ url, title }); }, }); // Enqueue the initial request and run the crawler await crawler.run(['https://www.iana.org/']); await Actor.exit(); ``` Note that you could also run your actor (that is using Crawlee) locally with Apify CLI. You could start it via the following command in your project folder: ``` apify run ``` ## Deploying an actor to Apify platform[​](#deploying-an-actor-to-apify-platform "Direct link to Deploying an actor to Apify platform") Now (assuming you are already logged in to your Apify account) you can easily deploy your code to the Apify platform by running: ``` apify push ``` Your script will be uploaded to and built on the Apify platform so that it can be run there. For more information, view the [Apify Actor](https://docs.apify.com/cli) documentation. ## Usage on Apify platform[​](#usage-on-apify-platform "Direct link to Usage on Apify platform") You can also develop your actor in an online code editor directly on the platform (you'll need an Apify Account). Let's go to the [Actors](https://console.apify.com/actors) page in the app, click *Create new* and then go to the *Source* tab and start writing the code or paste one of the examples from the [Examples](https://crawlee.dev/js/docs/examples.md) section. ## Storages[​](#storages "Direct link to Storages") There are several things worth mentioning here. ### Helper functions for default Key-Value Store and Dataset[​](#helper-functions-for-default-key-value-store-and-dataset "Direct link to Helper functions for default Key-Value Store and Dataset") To simplify access to the *default* storages, instead of using the helper functions of respective storage classes, you could use: * [`Actor.setValue()`](https://docs.apify.com/sdk/js/reference/class/Actor#setValue), [`Actor.getValue()`](https://docs.apify.com/sdk/js/reference/class/Actor#getValue), [`Actor.getInput()`](https://docs.apify.com/sdk/js/reference/class/Actor#getInput) for `Key-Value Store` * [`Actor.pushData()`](https://docs.apify.com/sdk/js/reference/class/Actor#pushData) for `Dataset` ### Using platform storage in a local actor[​](#using-platform-storage-in-a-local-actor "Direct link to Using platform storage in a local actor") When you plan to use the platform storage while developing and running your actor locally, you should use [`Actor.openKeyValueStore()`](https://docs.apify.com/sdk/js/reference/class/Actor#openKeyValueStore), [`Actor.openDataset()`](https://docs.apify.com/sdk/js/reference/class/Actor#openDataset) and [`Actor.openRequestQueue()`](https://docs.apify.com/sdk/js/reference/class/Actor#openRequestQueue) to open the respective storage. Using each of these methods allows to pass the [`OpenStorageOptions`](https://docs.apify.com/sdk/js/reference/interface/OpenStorageOptions) as a second argument, which has only one optional property: [`forceCloud`](https://docs.apify.com/sdk/js/reference/interface/OpenStorageOptions#forceCloud). If set to `true` - cloud storage will be used instead of the folder on the local disk. note If you don't plan to force usage of the platform storages when running the actor locally, there is no need to use the [`Actor`](https://docs.apify.com/sdk/js/reference/class/Actor) class for it. The Crawlee variants [`KeyValueStore.open()`](https://crawlee.dev/js/api/core/class/KeyValueStore.md#open), [`Dataset.open()`](https://crawlee.dev/js/api/core/class/Dataset.md#open) and [`RequestQueue.open()`](https://crawlee.dev/js/api/core/class/RequestQueue.md#open) will work the same. ### Getting public url of an item in the platform storage[​](#getting-public-url-of-an-item-in-the-platform-storage "Direct link to Getting public url of an item in the platform storage") If you need to share a link to some file stored in a Key-Value Store on Apify Platform, you can use [`getPublicUrl()`](https://docs.apify.com/sdk/js/reference/class/KeyValueStore#getPublicUrl) method. It accepts only one parameter: `key` - the key of the item you want to share. ``` import { KeyValueStore } from 'apify'; const store = await KeyValueStore.open(); await store.setValue('your-file', { foo: 'bar' }); const url = store.getPublicUrl('your-file'); // https://api.apify.com/v2/key-value-stores//records/your-file ``` ### Exporting dataset data[​](#exporting-dataset-data "Direct link to Exporting dataset data") When the [`Dataset`](https://crawlee.dev/js/api/core/class/Dataset.md) is stored on the [Apify platform](https://apify.com/actors), you can export its data to the following formats: HTML, JSON, CSV, Excel, XML and RSS. The datasets are displayed on the actor run details page and in the [Storage](https://console.apify.com/storage) section in the Apify Console. The actual data is exported using the [Get dataset items](https://apify.com/docs/api/v2#/reference/datasets/item-collection/get-items) Apify API endpoint. This way you can easily share the crawling results. **Related links** * [Apify platform storage documentation](https://docs.apify.com/storage) * [View storage in Apify Console](https://console.apify.com/storage) * [Key-value stores API reference](https://apify.com/docs/api/v2#/reference/key-value-stores) * [Datasets API reference](https://docs.apify.com/api/v2#/reference/datasets) * [Request queues API reference](https://docs.apify.com/api/v2#/reference/request-queues) ## Environment variables[​](#environment-variables "Direct link to Environment variables") The following are some additional environment variables specific to Apify platform. More Crawlee specific environment variables could be found in the [Environment Variables](https://crawlee.dev/js/docs/guides/configuration.md#environment-variables) guide. note It's important to notice that `CRAWLEE_` environment variables don't need to be replaced with equivalent `APIFY_` ones. Likewise, Crawlee understands `APIFY_` environment variables after calling `Actor.init()` or when using `Actor.main()`. ### `APIFY_TOKEN`[​](#apify_token "Direct link to apify_token") The API token for your Apify account. It is used to access the Apify API, e.g. to access cloud storage or to run an actor on the Apify platform. You can find your API token on the [Account Settings / Integrations](https://console.apify.com/account?tab=integrations) page. ### Combinations of `APIFY_TOKEN` and `CRAWLEE_STORAGE_DIR`[​](#combinations-of-apify_token-and-crawlee_storage_dir "Direct link to combinations-of-apify_token-and-crawlee_storage_dir") > `CRAWLEE_STORAGE_DIR` env variable description could be found in [Environment Variables](https://crawlee.dev/js/docs/guides/configuration.md#crawlee_storage_dir) guide. By combining the env vars in various ways, you can greatly influence the actor's behavior. | Env Vars | API | Storages | | --------------------------------------- | --- | ---------------- | | none OR `CRAWLEE_STORAGE_DIR` | no | local | | `APIFY_TOKEN` | yes | Apify platform | | `APIFY_TOKEN` AND `CRAWLEE_STORAGE_DIR` | yes | local + platform | When using both `APIFY_TOKEN` and `CRAWLEE_STORAGE_DIR`, you can use all the Apify platform features and your data will be stored locally by default. If you want to access platform storages, you can use the `{ forceCloud: true }` option in their respective functions. ``` import { Actor } from 'apify'; import { Dataset } from 'crawlee'; // or Dataset.open('my-local-data') const localDataset = await Actor.openDataset('my-local-data'); // but here we need the `Actor` class const remoteDataset = await Actor.openDataset('my-dataset', { forceCloud: true }); ``` ### `APIFY_PROXY_PASSWORD`[​](#apify_proxy_password "Direct link to apify_proxy_password") Optional password to [Apify Proxy](https://docs.apify.com/proxy) for IP address rotation. Assuming Apify Account was already created, you can find the password on the [Proxy page](https://console.apify.com/proxy) in the Apify Console. The password is automatically inferred using the `APIFY_TOKEN` env var, so in most cases, you don't need to touch it. You should use it when, for some reason, you need access to Apify Proxy, but not access to Apify API, or when you need access to proxy from a different account than your token represents. ## Proxy management[​](#proxy-management "Direct link to Proxy management") In addition to your own proxy servers and proxy servers acquired from third-party providers used together with Crawlee, you can also rely on [Apify Proxy](https://apify.com/proxy) for your scraping needs. ### Apify Proxy[​](#apify-proxy "Direct link to Apify Proxy") If you are already subscribed to Apify Proxy, you can start using them immediately in only a few lines of code (for local usage you first should be [logged in](#logging-into-apify-platform-from-crawlee) to your Apify account. ``` import { Actor } from 'apify'; const proxyConfiguration = await Actor.createProxyConfiguration(); const proxyUrl = await proxyConfiguration.newUrl(); ``` Note that unlike using your own proxies in Crawlee, you shouldn't use the constructor to create [`ProxyConfiguration`](https://crawlee.dev/js/api/core/class/ProxyConfiguration.md) instance. For using Apify Proxy you should create an instance using the [`Actor.createProxyConfiguration()`](https://docs.apify.com/sdk/js/reference/class/Actor#createProxyConfiguration) function instead. ### Apify Proxy Configuration[​](#apify-proxy-configuration "Direct link to Apify Proxy Configuration") With Apify Proxy, you can select specific proxy groups to use, or countries to connect from. This allows you to get better proxy performance after some initial research. ``` import { Actor } from 'apify'; const proxyConfiguration = await Actor.createProxyConfiguration({ groups: ['RESIDENTIAL'], countryCode: 'US', }); const proxyUrl = await proxyConfiguration.newUrl(); ``` Now your crawlers will use only Residential proxies from the US. Note that you must first get access to a proxy group before you are able to use it. You can check proxy groups available to you in the [proxy dashboard](https://console.apify.com/proxy). ### Apify Proxy vs. Own proxies[​](#apify-proxy-vs-own-proxies "Direct link to Apify Proxy vs. Own proxies") The `ProxyConfiguration` class covers both Apify Proxy and custom proxy URLs so that you can easily switch between proxy providers. However, some features of the class are available only to Apify Proxy users, mainly because Apify Proxy is what one would call a super-proxy. It's not a single proxy server, but an API endpoint that allows connection through millions of different IP addresses. So the class essentially has two modes: Apify Proxy or Own (third party) proxy. The difference is easy to remember. * If you're using your own proxies - you should create an instance with the ProxyConfiguration [`constructor`](https://crawlee.dev/js/api/core/class/ProxyConfiguration.md#constructor) function based on the provided [`ProxyConfigurationOptions`](https://crawlee.dev/js/api/core/interface/ProxyConfigurationOptions.md). * If you are planning to use Apify Proxy - you should create an instance using the [`Actor.createProxyConfiguration()`](https://docs.apify.com/sdk/js/reference/class/Actor#createProxyConfiguration) function. [`ProxyConfigurationOptions.proxyUrls`](https://crawlee.dev/js/api/core/interface/ProxyConfigurationOptions.md#proxyUrls) and [`ProxyConfigurationOptions.newUrlFunction`](https://crawlee.dev/js/api/core/interface/ProxyConfigurationOptions.md#newUrlFunction) enable use of your custom proxy URLs, whereas all the other options are there to configure Apify Proxy. **Related links** * [Apify Proxy docs](https://docs.apify.com/proxy) --- # Browsers on AWS Lambda Copy for LLM Running browser-enabled Crawlee crawlers in AWS Lambda is a bit complicated - but not too much. The main problem is that we have to upload not only our code and the dependencies, but also the **browser binaries**. ## Managing browser binaries[​](#managing-browser-binaries "Direct link to Managing browser binaries") Fortunately, there are already some NPM packages that can help us with managing the browser binaries installation: * [@sparticuz/chromium](https://www.npmjs.com/package/@sparticuz/chromium) is an NPM package containing brotli-compressed chromium binaries. When run in the Lambda environment, the package unzips the binaries under the `/tmp/` path and returns the path to the executable. We just add this package to the project dependencies and zip the `node_modules` folder. ``` # Install the package npm i -S @sparticuz/chromium # Zip the dependencies zip -r dependencies.zip ./node_modules ``` We will now upload the `dependencies.zip` as a Lambda Layer to AWS. Unfortunately, we cannot do this directly - there is a 50MB limit on direct uploads (and the compressed Chromium build is around that size itself). Instead, we'll upload it as an object into an S3 storage and provide the link to that object during the layer creation. ## Updating the code[​](#updating-the-code "Direct link to Updating the code") We also have to slightly update the Crawlee code: * First, we pass a new `Configuration` instance to the Crawler. This way, every crawler instance we create will have its own storage and won’t interfere with other crawler instances running in your Lambda environment. src/main.js ``` // For more information, see https://crawlee.dev/ import { Configuration, PlaywrightCrawler } from 'crawlee'; import { router } from './routes.js'; const startUrls = ['https://crawlee.dev']; const crawler = new PlaywrightCrawler({ requestHandler: router, }, new Configuration({ persistStorage: false, })); await crawler.run(startUrls); ``` * Now, we actually have to supply the code with the Chromium path from the `@sparticuz/chromium` package. AWS Lambda execution also lacks some hardware support for GPU acceleration etc. - you can tell Chrome about this by passing the `aws_chromium.args` to the `args` parameter. src/main.js ``` // For more information, see https://crawlee.dev/ import { Configuration, PlaywrightCrawler } from 'crawlee'; import { router } from './routes.js'; import aws_chromium from '@sparticuz/chromium'; const startUrls = ['https://crawlee.dev']; const crawler = new PlaywrightCrawler({ requestHandler: router, launchContext: { launchOptions: { executablePath: await aws_chromium.executablePath(), args: aws_chromium.args, headless: true } } }, new Configuration({ persistStorage: false, })); ``` * Last but not least, we have to wrap the code in the exported `handler` function - this will become the Lambda AWS will be executing. src/main.js ``` import { Configuration, PlaywrightCrawler } from 'crawlee'; import { router } from './routes.js'; import aws_chromium from '@sparticuz/chromium'; const startUrls = ['https://crawlee.dev']; export const handler = async (event, context) => { const crawler = new PlaywrightCrawler({ requestHandler: router, launchContext: { launchOptions: { executablePath: await aws_chromium.executablePath(), args: aws_chromium.args, headless: true } } }, new Configuration({ persistStorage: false, })); await crawler.run(startUrls); return { statusCode: 200, body: await crawler.getData(), }; } ``` ## Deploying the code[​](#deploying-the-code "Direct link to Deploying the code") Now we can simply pack the code into a zip archive (minus the `node_modules` folder, we have put that in the Lambda Layer, remember?). We upload the code archive to AWS as the Lambda body, set up the Lambda so it uses the dependencies Layer, and test our newly created Lambda. Memory settings Since we’re using full-size browsers here, we have to update the Lambda configurations a bit. Most importantly, make sure to set the memory setting to **1024 MB or more** and update the **Lambda timeout**. The target timeout value depends on how long your crawler will be running. Try measuring the execution time when running your crawler locally and set the timeout accordingly. --- # Cheerio on AWS Lambda Copy for LLM Locally, we can conveniently create a Crawlee project with `npx crawlee create`. In order to run this project on AWS Lambda, however, we need to do a few tweaks. ## Updating the code[​](#updating-the-code "Direct link to Updating the code") Whenever we instantiate a new crawler, we have to pass a unique `Configuration` instance to it. By default, all the Crawlee crawler instances share the same storage - this can be convenient, but would also cause “statefulness” of our Lambda, which would lead to hard-to-debug problems. Also, when creating this Configuration instance, make sure to pass the `persistStorage: false` option. This tells Crawlee to use in-memory storage, as the Lambda filesystem is read-only. src/main.js ``` // For more information, see https://crawlee.dev/ import { CheerioCrawler, Configuration, ProxyConfiguration } from 'crawlee'; import { router } from './routes.js'; const startUrls = ['https://crawlee.dev']; const crawler = new CheerioCrawler({ requestHandler: router, }, new Configuration({ persistStorage: false, })); await crawler.run(startUrls); ``` Now, we wrap all the logic in a `handler` function. This is the actual “Lambda” that AWS will be executing later on. src/main.js ``` // For more information, see https://crawlee.dev/ import { CheerioCrawler, Configuration } from 'crawlee'; import { router } from './routes.js'; const startUrls = ['https://crawlee.dev']; export const handler = async (event, context) => { const crawler = new CheerioCrawler({ requestHandler: router, }, new Configuration({ persistStorage: false, })); await crawler.run(startUrls); }; ``` **Important** Make sure to always instantiate a **new crawler instance for every Lambda**. AWS always keeps the environment running for some time after the first Lambda execution (in order to reduce cold-start times) - so any subsequent Lambda calls will access the already-used crawler instance. **TLDR: Keep your Lambda stateless.** Last things last, we also want to return the scraped data from the Lambda when the crawler run ends. In the end, your `main.js` script should look something like this: src/main.js ``` // For more information, see https://crawlee.dev/ import { CheerioCrawler, Configuration } from 'crawlee'; import { router } from './routes.js'; const startUrls = ['https://crawlee.dev']; export const handler = async (event, context) => { const crawler = new CheerioCrawler({ requestHandler: router, }, new Configuration({ persistStorage: false, })); await crawler.run(startUrls); return { statusCode: 200, body: await crawler.getData(), } }; ``` ## Deploying the project[​](#deploying-the-project "Direct link to Deploying the project") Now it’s time to deploy our script on AWS! Let’s create a zip archive from our project (including the `node_modules` folder) by running `zip -r package.zip .` in the project folder. Large `node_modules` folder? AWS has a limit of 50MB for direct file upload. Usually, our Crawlee projects won’t be anywhere near this limit, but we can easily exceed this with large dependency trees. A better way to install your project dependencies is by using Lambda Layers. With Layers, we can also share files between multiple Lambdas - and keep the actual “code” part of the Lambdas as slim as possible. **To create a Lambda Layer, we need to:** * Pack the `node_modules` folder into a separate zip file (the archive should contain one folder named `node_modules`). * Create a new Lambda layer from this archive. We’ll probably need to upload this file to AWS S3 storage and create the Lambda Layer like this. * After creating it, we simply tell our new Lambda function to use this layer. To deploy our actual code, we upload the `package.zip` archive as our code source. In Lambda Runtime Settings, we point the `handler` to the main function that runs the crawler. You can use slashes to describe directory structure and `.` to denote a named export. Our handler function is called `handler` and is exported from the `src/main.js` file, so we’ll use `src/main.handler` as the handler name. Now we’re all set! By clicking the **Test** button, we can send an example testing event to our new Lambda. The actual contents of the event don’t really matter for now - if you want, further parameterize your crawler run by analyzing the `event` object AWS passes as the first argument to the handler. tip In the Configuration tab in the AWS Lambda dashboard, you can configure the amount of memory the Lambda is running with or the size of the ephemeral storage. The memory size can greatly affect the execution speed of your Lambda. See the [official documentation](https://docs.aws.amazon.com/lambda/latest/operatorguide/computing-power.html) to see how the performance and cost scale with more memory. --- # Browsers in GCP Cloud Run Copy for LLM Running full-size browsers on GCP Cloud Functions is actually a bit different from doing so on AWS Lambda - [apparently](https://pptr.dev/troubleshooting#running-puppeteer-on-google-cloud-functions), the latest runtime versions miss dependencies required to run Chromium. If we want to run browser-enabled Crawlee crawlers on GCP, we’ll need to turn towards **Cloud Run.** Cloud Run is GCP’s platform for running Docker containers - other than that, (almost) everything is the same as with Cloud Functions / AWS Lambdas. GCP can spin up your containers on demand, so you’re only billed for the time it takes your container to return an HTTP response to the requesting client. In a way, it also provides a slightly better developer experience (than regular FaaS), as you can debug your Docker containers locally and be sure you’re getting the same setup in the cloud. ## Preparing the project[​](#preparing-the-project "Direct link to Preparing the project") As always, we first pass a new `Configuration` instance to the crawler constructor: src/main.js ``` import { Configuration, PlaywrightCrawler } from 'crawlee'; import { router } from './routes.js'; const startUrls = ['https://crawlee.dev']; const crawler = new PlaywrightCrawler({ requestHandler: router, }, new Configuration({ persistStorage: false, })); await crawler.run(startUrls); ``` All we now need to do is wrap our crawler with an Express HTTP server handler, so it can communicate with the client via HTTP. Because the Cloud Run platform sees only an opaque Docker container, we have to take care of this bit ourselves. info GCP passes you an environment variable called `PORT` - your HTTP server is expected to be listening on this port (GCP exposes this one to the outer world). The `main.js` script should be looking like this in the end: src/main.js ``` import { Configuration, PlaywrightCrawler } from 'crawlee'; import { router } from './routes.js'; import express from 'express'; const app = express(); const startUrls = ['https://crawlee.dev']; app.get('/', async (req, res) => { const crawler = new PlaywrightCrawler({ requestHandler: router, }, new Configuration({ persistStorage: false, })); await crawler.run(startUrls); return res.send(await crawler.getData()); }); app.listen(parseInt(process.env.PORT) || 3000); ``` tip Always make sure to keep all the logic in the request handler - as with other FaaS services, your request handlers have to be **stateless.** ## Deploying to GCP[​](#deploying-to-gcp "Direct link to Deploying to GCP") Now, we’re ready to deploy! If you have initialized your project using `npx crawlee create`, the initialization script has prepared a Dockerfile for you. All you have to do now is run `gcloud run deploy` in your project folder (the one with your Dockerfile in it). The gcloud CLI application will ask you a few questions, such as what region you want to deploy your application in, or whether you want to make your application public or private. After answering those questions, you should be able to see your application in the GCP dashboard and run it using the link you find there. tip In case your first execution of your newly created Cloud Run fails, try editing the Run configuration - mainly setting the available memory to 1GiB or more and updating the request timeout according to the size of the website you are scraping. --- # Cheerio on GCP Cloud Functions Copy for LLM Running CheerioCrawler-based project in GCP functions is actually quite easy - you just have to make a few changes to the project code. ## Updating the project[​](#updating-the-project "Direct link to Updating the project") Let’s first create the Crawlee project locally with `npx crawlee create`. Set the `"main"` field in the `package.json` file to `"src/main.js"`. package.json ``` { "name": "my-crawlee-project", "version": "1.0.0", "main": "src/main.js", ... } ``` Now, let’s update the `main.js` file, namely: * Pass a separate `Configuration` instance (with the `persistStorage` option set to `false`) to the crawler constructor. src/main.js ``` import { CheerioCrawler, Configuration } from 'crawlee'; import { router } from './routes.js'; const startUrls = ['https://crawlee.dev']; const crawler = new CheerioCrawler({ requestHandler: router, }, new Configuration({ persistStorage: false, })); await crawler.run(startUrls); ``` * Wrap the crawler call in a separate handler function. This function: * Can be asynchronous * Takes two positional arguments - `req` (containing details about the user-made request to your cloud function) and `res` (response object you can modify). * Call `res.send(data)` to return any data from the cloud function. * Export this function from the `src/main.js` module as a named export. src/main.js ``` import { CheerioCrawler, Configuration } from 'crawlee'; import { router } from './routes.js'; const startUrls = ['https://crawlee.dev']; export const handler = async (req, res) => { const crawler = new CheerioCrawler({ requestHandler: router, }, new Configuration({ persistStorage: false, })); await crawler.run(startUrls); return res.send(await crawler.getData()) } ``` ## Deploying to Google Cloud Platform[​](#deploying-to-google-cloud-platform "Direct link to Deploying to Google Cloud Platform") In the Google Cloud dashboard, create a new function, allocate memory and CPUs to it, set region and function timeout. When deploying, pick **ZIP Upload**. You have to create a new GCP storage bucket to store the zip packages in. Now, for the package - you should zip all the contents of your project folder **excluding the `node_modules` folder** - GCP doesn’t have Layers like AWS Lambda does, but takes care of the project setup for us based on the `package.json` file). Also, make sure to set the **Entry point** to the name of the function you’ve exported from the `src/main.js` file. GCP takes the file from the `package.json`'s `main` field. After the Function deploys, you can test it by clicking the “Testing” tab. This tab contains a `curl` script that calls your new Cloud Function. To avoid having to install the `gcloud` CLI application locally, you can also run this script in the Cloud Shell by clicking the link above the code block. --- ## [📄️ Accept user input](https://crawlee.dev/js/docs/examples/accept-user-input.md) [This example accepts and logs user input:](https://crawlee.dev/js/docs/examples/accept-user-input.md) --- # Accept user input Copy for LLM This example accepts and logs user input: ``` import { KeyValueStore } from 'crawlee'; const input = await KeyValueStore.getInput(); console.log(input); ``` To provide the actor with input, create a `INPUT.json` file inside the "default" key-value store: ``` {PROJECT_FOLDER}/storage/key_value_stores/default/INPUT.json ``` Anything in this file will be available to the actor when it runs. To learn about other ways to provide an actor with input, refer to the [Apify Platform Documentation](https://apify.com/docs/actor#run). --- # Add data to dataset Copy for LLM This example saves data to the default dataset. If the dataset doesn't exist, it will be created. You can save data to custom datasets by using [`Dataset.open()`](https://crawlee.dev/js/api/core/class/Dataset.md#open) [Run on](https://console.apify.com/actors/kk67IcZkKSSBTslXI?runConfig=eyJ1IjoiRWdQdHczb2VqNlRhRHQ1cW4iLCJ2IjoxfQ.eyJpbnB1dCI6IntcImNvZGVcIjpcImltcG9ydCB7IENoZWVyaW9DcmF3bGVyIH0gZnJvbSAnY3Jhd2xlZSc7XFxuXFxuY29uc3QgY3Jhd2xlciA9IG5ldyBDaGVlcmlvQ3Jhd2xlcih7XFxuICAgIC8vIEZ1bmN0aW9uIGNhbGxlZCBmb3IgZWFjaCBVUkxcXG4gICAgYXN5bmMgcmVxdWVzdEhhbmRsZXIoeyBwdXNoRGF0YSwgcmVxdWVzdCwgYm9keSB9KSB7XFxuICAgICAgICAvLyBTYXZlIGRhdGEgdG8gZGVmYXVsdCBkYXRhc2V0XFxuICAgICAgICBhd2FpdCBwdXNoRGF0YSh7XFxuICAgICAgICAgICAgdXJsOiByZXF1ZXN0LnVybCxcXG4gICAgICAgICAgICBodG1sOiBib2R5LFxcbiAgICAgICAgfSk7XFxuICAgIH0sXFxufSk7XFxuXFxuYXdhaXQgY3Jhd2xlci5hZGRSZXF1ZXN0cyhbXFxuICAgICdodHRwOi8vd3d3LmV4YW1wbGUuY29tL3BhZ2UtMScsXFxuICAgICdodHRwOi8vd3d3LmV4YW1wbGUuY29tL3BhZ2UtMicsXFxuICAgICdodHRwOi8vd3d3LmV4YW1wbGUuY29tL3BhZ2UtMycsXFxuXSk7XFxuXFxuLy8gUnVuIHRoZSBjcmF3bGVyXFxuYXdhaXQgY3Jhd2xlci5ydW4oKTtcXG5cIn0iLCJvcHRpb25zIjp7ImJ1aWxkIjoibGF0ZXN0IiwiY29udGVudFR5cGUiOiJhcHBsaWNhdGlvbi9qc29uOyBjaGFyc2V0PXV0Zi04IiwibWVtb3J5IjoxMDI0LCJ0aW1lb3V0IjoxODB9fQ.y9kz_gyD0gZNJaNVFyYfICCT63Qx-6Kf2Lk6EddXLt4\&asrc=run_on_apify) ``` import { CheerioCrawler } from 'crawlee'; const crawler = new CheerioCrawler({ // Function called for each URL async requestHandler({ pushData, request, body }) { // Save data to default dataset await pushData({ url: request.url, html: body, }); }, }); await crawler.addRequests([ 'http://www.example.com/page-1', 'http://www.example.com/page-2', 'http://www.example.com/page-3', ]); // Run the crawler await crawler.run(); ``` Each item in this dataset will be saved to its own file in the following directory: ``` {PROJECT_FOLDER}/storage/datasets/default/ ``` --- # Basic crawler Copy for LLM This is the most bare-bones example of using Crawlee, which demonstrates some of its building blocks such as the [`BasicCrawler`](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md). You probably don't need to go this deep though, and it would be better to start with one of the full-featured crawlers like [`CheerioCrawler`](https://crawlee.dev/js/api/cheerio-crawler/class/CheerioCrawler.md) or [`PlaywrightCrawler`](https://crawlee.dev/js/api/playwright-crawler/class/PlaywrightCrawler.md). The script simply downloads several web pages with plain HTTP requests using the [`sendRequest`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlingContext.md#sendRequest) utility function (which uses the [`got-scraping`](https://github.com/apify/got-scraping) npm module internally) and stores their raw HTML and URL in the default dataset. In local configuration, the data will be stored as JSON files in `./storage/datasets/default`. [Run on](https://console.apify.com/actors/kk67IcZkKSSBTslXI?runConfig=eyJ1IjoiRWdQdHczb2VqNlRhRHQ1cW4iLCJ2IjoxfQ.eyJpbnB1dCI6IntcImNvZGVcIjpcImltcG9ydCB7IEJhc2ljQ3Jhd2xlciB9IGZyb20gJ2NyYXdsZWUnO1xcblxcbi8vIENyZWF0ZSBhIEJhc2ljQ3Jhd2xlciAtIHRoZSBzaW1wbGVzdCBjcmF3bGVyIHRoYXQgZW5hYmxlc1xcbi8vIHVzZXJzIHRvIGltcGxlbWVudCB0aGUgY3Jhd2xpbmcgbG9naWMgdGhlbXNlbHZlcy5cXG5jb25zdCBjcmF3bGVyID0gbmV3IEJhc2ljQ3Jhd2xlcih7XFxuICAgIC8vIFRoaXMgZnVuY3Rpb24gd2lsbCBiZSBjYWxsZWQgZm9yIGVhY2ggVVJMIHRvIGNyYXdsLlxcbiAgICBhc3luYyByZXF1ZXN0SGFuZGxlcih7IHB1c2hEYXRhLCByZXF1ZXN0LCBzZW5kUmVxdWVzdCwgbG9nIH0pIHtcXG4gICAgICAgIGNvbnN0IHsgdXJsIH0gPSByZXF1ZXN0O1xcbiAgICAgICAgbG9nLmluZm8oYFByb2Nlc3NpbmcgJHt1cmx9Li4uYCk7XFxuXFxuICAgICAgICAvLyBGZXRjaCB0aGUgcGFnZSBIVE1MIHZpYSB0aGUgY3Jhd2xlZSBzZW5kUmVxdWVzdCB1dGlsaXR5IG1ldGhvZFxcbiAgICAgICAgLy8gQnkgZGVmYXVsdCwgdGhlIG1ldGhvZCB3aWxsIHVzZSB0aGUgY3VycmVudCByZXF1ZXN0IHRoYXQgaXMgYmVpbmcgaGFuZGxlZCwgc28geW91IGRvbid0IGhhdmUgdG9cXG4gICAgICAgIC8vIHByb3ZpZGUgaXQgeW91cnNlbGYuIFlvdSBjYW4gYWxzbyBwcm92aWRlIGEgY3VzdG9tIHJlcXVlc3QgaWYgeW91IHdhbnQuXFxuICAgICAgICBjb25zdCB7IGJvZHkgfSA9IGF3YWl0IHNlbmRSZXF1ZXN0KCk7XFxuXFxuICAgICAgICAvLyBTdG9yZSB0aGUgSFRNTCBhbmQgVVJMIHRvIHRoZSBkZWZhdWx0IGRhdGFzZXQuXFxuICAgICAgICBhd2FpdCBwdXNoRGF0YSh7XFxuICAgICAgICAgICAgdXJsLFxcbiAgICAgICAgICAgIGh0bWw6IGJvZHksXFxuICAgICAgICB9KTtcXG4gICAgfSxcXG59KTtcXG5cXG4vLyBUaGUgaW5pdGlhbCBsaXN0IG9mIFVSTHMgdG8gY3Jhd2wuIEhlcmUgd2UgdXNlIGp1c3QgYSBmZXcgaGFyZC1jb2RlZCBVUkxzLlxcbmF3YWl0IGNyYXdsZXIuYWRkUmVxdWVzdHMoW1xcbiAgICAnaHR0cHM6Ly93d3cuZ29vZ2xlLmNvbScsXFxuICAgICdodHRwczovL3d3dy5leGFtcGxlLmNvbScsXFxuICAgICdodHRwczovL3d3dy5iaW5nLmNvbScsXFxuICAgICdodHRwczovL3d3dy53aWtpcGVkaWEuY29tJyxcXG5dKTtcXG5cXG4vLyBSdW4gdGhlIGNyYXdsZXIgYW5kIHdhaXQgZm9yIGl0IHRvIGZpbmlzaC5cXG5hd2FpdCBjcmF3bGVyLnJ1bigpO1xcblxcbmNvbnNvbGUubG9nKCdDcmF3bGVyIGZpbmlzaGVkLicpO1xcblwifSIsIm9wdGlvbnMiOnsiYnVpbGQiOiJsYXRlc3QiLCJjb250ZW50VHlwZSI6ImFwcGxpY2F0aW9uL2pzb247IGNoYXJzZXQ9dXRmLTgiLCJtZW1vcnkiOjEwMjQsInRpbWVvdXQiOjE4MH19.jFrwSiKGzhJE8bZfqP_Tf7TU-RdpZnGb1cJ78bke0rQ\&asrc=run_on_apify) ``` import { BasicCrawler } from 'crawlee'; // Create a BasicCrawler - the simplest crawler that enables // users to implement the crawling logic themselves. const crawler = new BasicCrawler({ // This function will be called for each URL to crawl. async requestHandler({ pushData, request, sendRequest, log }) { const { url } = request; log.info(`Processing ${url}...`); // Fetch the page HTML via the crawlee sendRequest utility method // By default, the method will use the current request that is being handled, so you don't have to // provide it yourself. You can also provide a custom request if you want. const { body } = await sendRequest(); // Store the HTML and URL to the default dataset. await pushData({ url, html: body, }); }, }); // The initial list of URLs to crawl. Here we use just a few hard-coded URLs. await crawler.addRequests([ 'https://www.google.com', 'https://www.example.com', 'https://www.bing.com', 'https://www.wikipedia.com', ]); // Run the crawler and wait for it to finish. await crawler.run(); console.log('Crawler finished.'); ``` --- # Capture a screenshot using Puppeteer Copy for LLM ## Using Puppeteer directly[​](#using-puppeteer-directly "Direct link to Using Puppeteer directly") tip To run this example on the Apify Platform, select the `apify/actor-node-puppeteer-chrome` image for your Dockerfile. This example captures a screenshot of a web page using `Puppeteer`. It would look almost exactly the same with `Playwright`. * Page Screenshot * Crawler Utils Screenshot Using `page.screenshot()`: [Run on](https://console.apify.com/actors/7tWSD8hrYzuc9Lte7?runConfig=eyJ1IjoiRWdQdHczb2VqNlRhRHQ1cW4iLCJ2IjoxfQ.eyJpbnB1dCI6IntcImNvZGVcIjpcImltcG9ydCB7IEtleVZhbHVlU3RvcmUsIGxhdW5jaFB1cHBldGVlciB9IGZyb20gJ2NyYXdsZWUnO1xcblxcbmNvbnN0IGtleVZhbHVlU3RvcmUgPSBhd2FpdCBLZXlWYWx1ZVN0b3JlLm9wZW4oKTtcXG5cXG5jb25zdCB1cmwgPSAnaHR0cHM6Ly9jcmF3bGVlLmRldic7XFxuLy8gU3RhcnQgYSBicm93c2VyXFxuY29uc3QgYnJvd3NlciA9IGF3YWl0IGxhdW5jaFB1cHBldGVlcigpO1xcblxcbi8vIE9wZW4gbmV3IHRhYiBpbiB0aGUgYnJvd3NlclxcbmNvbnN0IHBhZ2UgPSBhd2FpdCBicm93c2VyLm5ld1BhZ2UoKTtcXG5cXG4vLyBOYXZpZ2F0ZSB0byB0aGUgVVJMXFxuYXdhaXQgcGFnZS5nb3RvKHVybCk7XFxuXFxuLy8gQ2FwdHVyZSB0aGUgc2NyZWVuc2hvdFxcbmNvbnN0IHNjcmVlbnNob3QgPSBhd2FpdCBwYWdlLnNjcmVlbnNob3QoKTtcXG5cXG4vLyBTYXZlIHRoZSBzY3JlZW5zaG90IHRvIHRoZSBkZWZhdWx0IGtleS12YWx1ZSBzdG9yZVxcbmF3YWl0IGtleVZhbHVlU3RvcmUuc2V0VmFsdWUoJ215LWtleScsIHNjcmVlbnNob3QsIHsgY29udGVudFR5cGU6ICdpbWFnZS9wbmcnIH0pO1xcblxcbi8vIENsb3NlIFB1cHBldGVlclxcbmF3YWl0IGJyb3dzZXIuY2xvc2UoKTtcXG5cIn0iLCJvcHRpb25zIjp7ImJ1aWxkIjoibGF0ZXN0IiwiY29udGVudFR5cGUiOiJhcHBsaWNhdGlvbi9qc29uOyBjaGFyc2V0PXV0Zi04IiwibWVtb3J5Ijo0MDk2LCJ0aW1lb3V0IjoxODB9fQ.hnB2LA3UbM_7PMJC08VHU7l3FPloqtx7pSzIDU4nO0I\&asrc=run_on_apify) ``` import { KeyValueStore, launchPuppeteer } from 'crawlee'; const keyValueStore = await KeyValueStore.open(); const url = 'https://crawlee.dev'; // Start a browser const browser = await launchPuppeteer(); // Open new tab in the browser const page = await browser.newPage(); // Navigate to the URL await page.goto(url); // Capture the screenshot const screenshot = await page.screenshot(); // Save the screenshot to the default key-value store await keyValueStore.setValue('my-key', screenshot, { contentType: 'image/png' }); // Close Puppeteer await browser.close(); ``` Using `utils.puppeteer.saveSnapshot()`: [Run on](https://console.apify.com/actors/7tWSD8hrYzuc9Lte7?runConfig=eyJ1IjoiRWdQdHczb2VqNlRhRHQ1cW4iLCJ2IjoxfQ.eyJpbnB1dCI6IntcImNvZGVcIjpcImltcG9ydCB7IGxhdW5jaFB1cHBldGVlciwgdXRpbHMgfSBmcm9tICdjcmF3bGVlJztcXG5cXG5jb25zdCB1cmwgPSAnaHR0cDovL3d3dy5leGFtcGxlLmNvbS8nO1xcbi8vIFN0YXJ0IGEgYnJvd3NlclxcbmNvbnN0IGJyb3dzZXIgPSBhd2FpdCBsYXVuY2hQdXBwZXRlZXIoKTtcXG5cXG4vLyBPcGVuIG5ldyB0YWIgaW4gdGhlIGJyb3dzZXJcXG5jb25zdCBwYWdlID0gYXdhaXQgYnJvd3Nlci5uZXdQYWdlKCk7XFxuXFxuLy8gTmF2aWdhdGUgdG8gdGhlIFVSTFxcbmF3YWl0IHBhZ2UuZ290byh1cmwpO1xcblxcbi8vIENhcHR1cmUgdGhlIHNjcmVlbnNob3RcXG5hd2FpdCB1dGlscy5wdXBwZXRlZXIuc2F2ZVNuYXBzaG90KHBhZ2UsIHsga2V5OiAnbXkta2V5Jywgc2F2ZUh0bWw6IGZhbHNlIH0pO1xcblxcbi8vIENsb3NlIFB1cHBldGVlclxcbmF3YWl0IGJyb3dzZXIuY2xvc2UoKTtcXG5cIn0iLCJvcHRpb25zIjp7ImJ1aWxkIjoibGF0ZXN0IiwiY29udGVudFR5cGUiOiJhcHBsaWNhdGlvbi9qc29uOyBjaGFyc2V0PXV0Zi04IiwibWVtb3J5Ijo0MDk2LCJ0aW1lb3V0IjoxODB9fQ.43Fi6LdMsWMLqVj33VMqbSZKfZFbgbNBSo11DKsWzto\&asrc=run_on_apify) ``` import { launchPuppeteer, utils } from 'crawlee'; const url = 'http://www.example.com/'; // Start a browser const browser = await launchPuppeteer(); // Open new tab in the browser const page = await browser.newPage(); // Navigate to the URL await page.goto(url); // Capture the screenshot await utils.puppeteer.saveSnapshot(page, { key: 'my-key', saveHtml: false }); // Close Puppeteer await browser.close(); ``` ## Using `PuppeteerCrawler`[​](#using-puppeteercrawler "Direct link to using-puppeteercrawler") This example captures a screenshot of multiple web pages when using `PuppeteerCrawler`: * Page Screenshot * Crawler Utils Screenshot Using `page.screenshot()`: [Run on](https://console.apify.com/actors/7tWSD8hrYzuc9Lte7?runConfig=eyJ1IjoiRWdQdHczb2VqNlRhRHQ1cW4iLCJ2IjoxfQ.eyJpbnB1dCI6IntcImNvZGVcIjpcImltcG9ydCB7IFB1cHBldGVlckNyYXdsZXIsIEtleVZhbHVlU3RvcmUgfSBmcm9tICdjcmF3bGVlJztcXG5cXG4vLyBDcmVhdGUgYSBQdXBwZXRlZXJDcmF3bGVyXFxuY29uc3QgY3Jhd2xlciA9IG5ldyBQdXBwZXRlZXJDcmF3bGVyKHtcXG4gICAgYXN5bmMgcmVxdWVzdEhhbmRsZXIoeyByZXF1ZXN0LCBwYWdlIH0pIHtcXG4gICAgICAgIC8vIENhcHR1cmUgdGhlIHNjcmVlbnNob3Qgd2l0aCBQdXBwZXRlZXJcXG4gICAgICAgIGNvbnN0IHNjcmVlbnNob3QgPSBhd2FpdCBwYWdlLnNjcmVlbnNob3QoKTtcXG4gICAgICAgIC8vIENvbnZlcnQgdGhlIFVSTCBpbnRvIGEgdmFsaWQga2V5XFxuICAgICAgICBjb25zdCBrZXkgPSByZXF1ZXN0LnVybC5yZXBsYWNlKC9bOi9dL2csICdfJyk7XFxuICAgICAgICAvLyBTYXZlIHRoZSBzY3JlZW5zaG90IHRvIHRoZSBkZWZhdWx0IGtleS12YWx1ZSBzdG9yZVxcbiAgICAgICAgYXdhaXQgS2V5VmFsdWVTdG9yZS5zZXRWYWx1ZShrZXksIHNjcmVlbnNob3QsIHsgY29udGVudFR5cGU6ICdpbWFnZS9wbmcnIH0pO1xcbiAgICB9LFxcbn0pO1xcblxcbmF3YWl0IGNyYXdsZXIuYWRkUmVxdWVzdHMoW1xcbiAgICB7IHVybDogJ2h0dHA6Ly93d3cuZXhhbXBsZS5jb20vcGFnZS0xJyB9LFxcbiAgICB7IHVybDogJ2h0dHA6Ly93d3cuZXhhbXBsZS5jb20vcGFnZS0yJyB9LFxcbiAgICB7IHVybDogJ2h0dHA6Ly93d3cuZXhhbXBsZS5jb20vcGFnZS0zJyB9LFxcbl0pO1xcblxcbi8vIFJ1biB0aGUgY3Jhd2xlclxcbmF3YWl0IGNyYXdsZXIucnVuKCk7XFxuXCJ9Iiwib3B0aW9ucyI6eyJidWlsZCI6ImxhdGVzdCIsImNvbnRlbnRUeXBlIjoiYXBwbGljYXRpb24vanNvbjsgY2hhcnNldD11dGYtOCIsIm1lbW9yeSI6NDA5NiwidGltZW91dCI6MTgwfX0.dW6w_in8q5kLx6sM1tplVR0-n9GFpTMRCTjpsTyNhzQ\&asrc=run_on_apify) ``` import { PuppeteerCrawler, KeyValueStore } from 'crawlee'; // Create a PuppeteerCrawler const crawler = new PuppeteerCrawler({ async requestHandler({ request, page }) { // Capture the screenshot with Puppeteer const screenshot = await page.screenshot(); // Convert the URL into a valid key const key = request.url.replace(/[:/]/g, '_'); // Save the screenshot to the default key-value store await KeyValueStore.setValue(key, screenshot, { contentType: 'image/png' }); }, }); await crawler.addRequests([ { url: 'http://www.example.com/page-1' }, { url: 'http://www.example.com/page-2' }, { url: 'http://www.example.com/page-3' }, ]); // Run the crawler await crawler.run(); ``` Using the context-aware [`saveSnapshot()`](https://crawlee.dev/js/api/puppeteer-crawler/namespace/puppeteerUtils.md#saveSnapshot) utility: [Run on](https://console.apify.com/actors/7tWSD8hrYzuc9Lte7?runConfig=eyJ1IjoiRWdQdHczb2VqNlRhRHQ1cW4iLCJ2IjoxfQ.eyJpbnB1dCI6IntcImNvZGVcIjpcImltcG9ydCB7IFB1cHBldGVlckNyYXdsZXIgfSBmcm9tICdjcmF3bGVlJztcXG5cXG4vLyBDcmVhdGUgYSBQdXBwZXRlZXJDcmF3bGVyXFxuY29uc3QgY3Jhd2xlciA9IG5ldyBQdXBwZXRlZXJDcmF3bGVyKHtcXG4gICAgYXN5bmMgcmVxdWVzdEhhbmRsZXIoeyByZXF1ZXN0LCBzYXZlU25hcHNob3QgfSkge1xcbiAgICAgICAgLy8gQ29udmVydCB0aGUgVVJMIGludG8gYSB2YWxpZCBrZXlcXG4gICAgICAgIGNvbnN0IGtleSA9IHJlcXVlc3QudXJsLnJlcGxhY2UoL1s6L10vZywgJ18nKTtcXG4gICAgICAgIC8vIENhcHR1cmUgdGhlIHNjcmVlbnNob3RcXG4gICAgICAgIGF3YWl0IHNhdmVTbmFwc2hvdCh7IGtleSwgc2F2ZUh0bWw6IGZhbHNlIH0pO1xcbiAgICB9LFxcbn0pO1xcblxcbmF3YWl0IGNyYXdsZXIuYWRkUmVxdWVzdHMoW1xcbiAgICB7IHVybDogJ2h0dHA6Ly93d3cuZXhhbXBsZS5jb20vcGFnZS0xJyB9LFxcbiAgICB7IHVybDogJ2h0dHA6Ly93d3cuZXhhbXBsZS5jb20vcGFnZS0yJyB9LFxcbiAgICB7IHVybDogJ2h0dHA6Ly93d3cuZXhhbXBsZS5jb20vcGFnZS0zJyB9LFxcbl0pO1xcblxcbi8vIFJ1biB0aGUgY3Jhd2xlclxcbmF3YWl0IGNyYXdsZXIucnVuKCk7XFxuXCJ9Iiwib3B0aW9ucyI6eyJidWlsZCI6ImxhdGVzdCIsImNvbnRlbnRUeXBlIjoiYXBwbGljYXRpb24vanNvbjsgY2hhcnNldD11dGYtOCIsIm1lbW9yeSI6NDA5NiwidGltZW91dCI6MTgwfX0.0vtFUxFqfNHq5Y7EZ95YMfXOq2WqBpN0zprfavDk7mU\&asrc=run_on_apify) ``` import { PuppeteerCrawler } from 'crawlee'; // Create a PuppeteerCrawler const crawler = new PuppeteerCrawler({ async requestHandler({ request, saveSnapshot }) { // Convert the URL into a valid key const key = request.url.replace(/[:/]/g, '_'); // Capture the screenshot await saveSnapshot({ key, saveHtml: false }); }, }); await crawler.addRequests([ { url: 'http://www.example.com/page-1' }, { url: 'http://www.example.com/page-2' }, { url: 'http://www.example.com/page-3' }, ]); // Run the crawler await crawler.run(); ``` To take full page screenshot using puppeteer we need to pass parameter `fullPage` as `true`in the `screenshot()`: `page.screenshot(fullPage: true)` In both examples using `page.screenshot()`, a `key` variable is created based on the URL of the web page. This variable is used as the key when saving each screenshot into a key-value store. --- # Cheerio crawler Copy for LLM This example demonstrates how to use [`CheerioCrawler`](https://crawlee.dev/js/api/cheerio-crawler/class/CheerioCrawler.md) to crawl a list of URLs from an external file, load each URL using a plain HTTP request, parse the HTML using the [Cheerio library](https://www.npmjs.com/package/cheerio) and extract some data from it: the page title and all `h1` tags. [Run on](https://console.apify.com/actors/kk67IcZkKSSBTslXI?runConfig=eyJ1IjoiRWdQdHczb2VqNlRhRHQ1cW4iLCJ2IjoxfQ.eyJpbnB1dCI6IntcImNvZGVcIjpcImltcG9ydCB7IENoZWVyaW9DcmF3bGVyLCBsb2csIExvZ0xldmVsIH0gZnJvbSAnY3Jhd2xlZSc7XFxuXFxuLy8gQ3Jhd2xlcnMgY29tZSB3aXRoIHZhcmlvdXMgdXRpbGl0aWVzLCBlLmcuIGZvciBsb2dnaW5nLlxcbi8vIEhlcmUgd2UgdXNlIGRlYnVnIGxldmVsIG9mIGxvZ2dpbmcgdG8gaW1wcm92ZSB0aGUgZGVidWdnaW5nIGV4cGVyaWVuY2UuXFxuLy8gVGhpcyBmdW5jdGlvbmFsaXR5IGlzIG9wdGlvbmFsIVxcbmxvZy5zZXRMZXZlbChMb2dMZXZlbC5ERUJVRyk7XFxuXFxuLy8gQ3JlYXRlIGFuIGluc3RhbmNlIG9mIHRoZSBDaGVlcmlvQ3Jhd2xlciBjbGFzcyAtIGEgY3Jhd2xlclxcbi8vIHRoYXQgYXV0b21hdGljYWxseSBsb2FkcyB0aGUgVVJMcyBhbmQgcGFyc2VzIHRoZWlyIEhUTUwgdXNpbmcgdGhlIGNoZWVyaW8gbGlicmFyeS5cXG5jb25zdCBjcmF3bGVyID0gbmV3IENoZWVyaW9DcmF3bGVyKHtcXG4gICAgLy8gVGhlIGNyYXdsZXIgZG93bmxvYWRzIGFuZCBwcm9jZXNzZXMgdGhlIHdlYiBwYWdlcyBpbiBwYXJhbGxlbCwgd2l0aCBhIGNvbmN1cnJlbmN5XFxuICAgIC8vIGF1dG9tYXRpY2FsbHkgbWFuYWdlZCBiYXNlZCBvbiB0aGUgYXZhaWxhYmxlIHN5c3RlbSBtZW1vcnkgYW5kIENQVSAoc2VlIEF1dG9zY2FsZWRQb29sIGNsYXNzKS5cXG4gICAgLy8gSGVyZSB3ZSBkZWZpbmUgc29tZSBoYXJkIGxpbWl0cyBmb3IgdGhlIGNvbmN1cnJlbmN5LlxcbiAgICBtaW5Db25jdXJyZW5jeTogMTAsXFxuICAgIG1heENvbmN1cnJlbmN5OiA1MCxcXG5cXG4gICAgLy8gT24gZXJyb3IsIHJldHJ5IGVhY2ggcGFnZSBhdCBtb3N0IG9uY2UuXFxuICAgIG1heFJlcXVlc3RSZXRyaWVzOiAxLFxcblxcbiAgICAvLyBJbmNyZWFzZSB0aGUgdGltZW91dCBmb3IgcHJvY2Vzc2luZyBvZiBlYWNoIHBhZ2UuXFxuICAgIHJlcXVlc3RIYW5kbGVyVGltZW91dFNlY3M6IDMwLFxcblxcbiAgICAvLyBMaW1pdCB0byAxMCByZXF1ZXN0cyBwZXIgb25lIGNyYXdsXFxuICAgIG1heFJlcXVlc3RzUGVyQ3Jhd2w6IDEwLFxcblxcbiAgICAvLyBUaGlzIGZ1bmN0aW9uIHdpbGwgYmUgY2FsbGVkIGZvciBlYWNoIFVSTCB0byBjcmF3bC5cXG4gICAgLy8gSXQgYWNjZXB0cyBhIHNpbmdsZSBwYXJhbWV0ZXIsIHdoaWNoIGlzIGFuIG9iamVjdCB3aXRoIG9wdGlvbnMgYXM6XFxuICAgIC8vIGh0dHBzOi8vY3Jhd2xlZS5kZXYvanMvYXBpL2NoZWVyaW8tY3Jhd2xlci9pbnRlcmZhY2UvQ2hlZXJpb0NyYXdsZXJPcHRpb25zI3JlcXVlc3RIYW5kbGVyXFxuICAgIC8vIFdlIHVzZSBmb3IgZGVtb25zdHJhdGlvbiBvbmx5IDIgb2YgdGhlbTpcXG4gICAgLy8gLSByZXF1ZXN0OiBhbiBpbnN0YW5jZSBvZiB0aGUgUmVxdWVzdCBjbGFzcyB3aXRoIGluZm9ybWF0aW9uIHN1Y2ggYXMgdGhlIFVSTCB0aGF0IGlzIGJlaW5nIGNyYXdsZWQgYW5kIEhUVFAgbWV0aG9kXFxuICAgIC8vIC0gJDogdGhlIGNoZWVyaW8gb2JqZWN0IGNvbnRhaW5pbmcgcGFyc2VkIEhUTUxcXG4gICAgYXN5bmMgcmVxdWVzdEhhbmRsZXIoeyBwdXNoRGF0YSwgcmVxdWVzdCwgJCB9KSB7XFxuICAgICAgICBsb2cuZGVidWcoYFByb2Nlc3NpbmcgJHtyZXF1ZXN0LnVybH0uLi5gKTtcXG5cXG4gICAgICAgIC8vIEV4dHJhY3QgZGF0YSBmcm9tIHRoZSBwYWdlIHVzaW5nIGNoZWVyaW8uXFxuICAgICAgICBjb25zdCB0aXRsZSA9ICQoJ3RpdGxlJykudGV4dCgpO1xcbiAgICAgICAgY29uc3QgaDF0ZXh0czogeyB0ZXh0OiBzdHJpbmcgfVtdID0gW107XFxuICAgICAgICAkKCdoMScpLmVhY2goKGluZGV4LCBlbCkgPT4ge1xcbiAgICAgICAgICAgIGgxdGV4dHMucHVzaCh7XFxuICAgICAgICAgICAgICAgIHRleHQ6ICQoZWwpLnRleHQoKSxcXG4gICAgICAgICAgICB9KTtcXG4gICAgICAgIH0pO1xcblxcbiAgICAgICAgLy8gU3RvcmUgdGhlIHJlc3VsdHMgdG8gdGhlIGRhdGFzZXQuIEluIGxvY2FsIGNvbmZpZ3VyYXRpb24sXFxuICAgICAgICAvLyB0aGUgZGF0YSB3aWxsIGJlIHN0b3JlZCBhcyBKU09OIGZpbGVzIGluIC4vc3RvcmFnZS9kYXRhc2V0cy9kZWZhdWx0XFxuICAgICAgICBhd2FpdCBwdXNoRGF0YSh7XFxuICAgICAgICAgICAgdXJsOiByZXF1ZXN0LnVybCxcXG4gICAgICAgICAgICB0aXRsZSxcXG4gICAgICAgICAgICBoMXRleHRzLFxcbiAgICAgICAgfSk7XFxuICAgIH0sXFxuXFxuICAgIC8vIFRoaXMgZnVuY3Rpb24gaXMgY2FsbGVkIGlmIHRoZSBwYWdlIHByb2Nlc3NpbmcgZmFpbGVkIG1vcmUgdGhhbiBtYXhSZXF1ZXN0UmV0cmllcyArIDEgdGltZXMuXFxuICAgIGZhaWxlZFJlcXVlc3RIYW5kbGVyKHsgcmVxdWVzdCB9KSB7XFxuICAgICAgICBsb2cuZGVidWcoYFJlcXVlc3QgJHtyZXF1ZXN0LnVybH0gZmFpbGVkIHR3aWNlLmApO1xcbiAgICB9LFxcbn0pO1xcblxcbi8vIFJ1biB0aGUgY3Jhd2xlciBhbmQgd2FpdCBmb3IgaXQgdG8gZmluaXNoLlxcbmF3YWl0IGNyYXdsZXIucnVuKFsnaHR0cHM6Ly9jcmF3bGVlLmRldiddKTtcXG5cXG5sb2cuZGVidWcoJ0NyYXdsZXIgZmluaXNoZWQuJyk7XFxuXCJ9Iiwib3B0aW9ucyI6eyJidWlsZCI6ImxhdGVzdCIsImNvbnRlbnRUeXBlIjoiYXBwbGljYXRpb24vanNvbjsgY2hhcnNldD11dGYtOCIsIm1lbW9yeSI6MTAyNCwidGltZW91dCI6MTgwfX0.sZ3S96qg-5sNtVOu6wpMBxeYZ1xQXA496A-Ou_nSUpc\&asrc=run_on_apify) ``` import { CheerioCrawler, log, LogLevel } from 'crawlee'; // Crawlers come with various utilities, e.g. for logging. // Here we use debug level of logging to improve the debugging experience. // This functionality is optional! log.setLevel(LogLevel.DEBUG); // Create an instance of the CheerioCrawler class - a crawler // that automatically loads the URLs and parses their HTML using the cheerio library. const crawler = new CheerioCrawler({ // The crawler downloads and processes the web pages in parallel, with a concurrency // automatically managed based on the available system memory and CPU (see AutoscaledPool class). // Here we define some hard limits for the concurrency. minConcurrency: 10, maxConcurrency: 50, // On error, retry each page at most once. maxRequestRetries: 1, // Increase the timeout for processing of each page. requestHandlerTimeoutSecs: 30, // Limit to 10 requests per one crawl maxRequestsPerCrawl: 10, // This function will be called for each URL to crawl. // It accepts a single parameter, which is an object with options as: // https://crawlee.dev/js/api/cheerio-crawler/interface/CheerioCrawlerOptions#requestHandler // We use for demonstration only 2 of them: // - request: an instance of the Request class with information such as the URL that is being crawled and HTTP method // - $: the cheerio object containing parsed HTML async requestHandler({ pushData, request, $ }) { log.debug(`Processing ${request.url}...`); // Extract data from the page using cheerio. const title = $('title').text(); const h1texts: { text: string }[] = []; $('h1').each((index, el) => { h1texts.push({ text: $(el).text(), }); }); // Store the results to the dataset. In local configuration, // the data will be stored as JSON files in ./storage/datasets/default await pushData({ url: request.url, title, h1texts, }); }, // This function is called if the page processing failed more than maxRequestRetries + 1 times. failedRequestHandler({ request }) { log.debug(`Request ${request.url} failed twice.`); }, }); // Run the crawler and wait for it to finish. await crawler.run(['https://crawlee.dev']); log.debug('Crawler finished.'); ``` --- # Crawl all links on a website Copy for LLM This example uses the `enqueueLinks()` method to add new links to the `RequestQueue` as the crawler navigates from page to page. This example can also be used to find all URLs on a domain by removing the `maxRequestsPerCrawl` option. tip If no options are given, by default the method will only add links that are under the same subdomain. This behavior can be controlled with the [`strategy`](https://crawlee.dev/js/api/core/interface/EnqueueLinksOptions.md#strategy) option. You can find more info about this option in the [`Crawl relative links`](https://crawlee.dev/js/docs/examples/crawl-relative-links.md) examples. * Cheerio Crawler * Puppeteer Crawler * Playwright Crawler [Run on](https://console.apify.com/actors/kk67IcZkKSSBTslXI?runConfig=eyJ1IjoiRWdQdHczb2VqNlRhRHQ1cW4iLCJ2IjoxfQ.eyJpbnB1dCI6IntcImNvZGVcIjpcImltcG9ydCB7IENoZWVyaW9DcmF3bGVyIH0gZnJvbSAnY3Jhd2xlZSc7XFxuXFxuY29uc3QgY3Jhd2xlciA9IG5ldyBDaGVlcmlvQ3Jhd2xlcih7XFxuICAgIGFzeW5jIHJlcXVlc3RIYW5kbGVyKHsgcmVxdWVzdCwgZW5xdWV1ZUxpbmtzLCBsb2cgfSkge1xcbiAgICAgICAgbG9nLmluZm8ocmVxdWVzdC51cmwpO1xcbiAgICAgICAgLy8gQWRkIGFsbCBsaW5rcyBmcm9tIHBhZ2UgdG8gUmVxdWVzdFF1ZXVlXFxuICAgICAgICBhd2FpdCBlbnF1ZXVlTGlua3MoKTtcXG4gICAgfSxcXG4gICAgbWF4UmVxdWVzdHNQZXJDcmF3bDogMTAsIC8vIExpbWl0YXRpb24gZm9yIG9ubHkgMTAgcmVxdWVzdHMgKGRvIG5vdCB1c2UgaWYgeW91IHdhbnQgdG8gY3Jhd2wgYWxsIGxpbmtzKVxcbn0pO1xcblxcbi8vIFJ1biB0aGUgY3Jhd2xlciB3aXRoIGluaXRpYWwgcmVxdWVzdFxcbmF3YWl0IGNyYXdsZXIucnVuKFsnaHR0cHM6Ly9jcmF3bGVlLmRldiddKTtcXG5cIn0iLCJvcHRpb25zIjp7ImJ1aWxkIjoibGF0ZXN0IiwiY29udGVudFR5cGUiOiJhcHBsaWNhdGlvbi9qc29uOyBjaGFyc2V0PXV0Zi04IiwibWVtb3J5IjoxMDI0LCJ0aW1lb3V0IjoxODB9fQ.LBIV5tC8xatPLd7liUmYWtCnUL8bFQBt6Eq8fnylMkg\&asrc=run_on_apify) ``` import { CheerioCrawler } from 'crawlee'; const crawler = new CheerioCrawler({ async requestHandler({ request, enqueueLinks, log }) { log.info(request.url); // Add all links from page to RequestQueue await enqueueLinks(); }, maxRequestsPerCrawl: 10, // Limitation for only 10 requests (do not use if you want to crawl all links) }); // Run the crawler with initial request await crawler.run(['https://crawlee.dev']); ``` tip To run this example on the Apify Platform, select the `apify/actor-node-puppeteer-chrome` image for your Dockerfile. [Run on](https://console.apify.com/actors/7tWSD8hrYzuc9Lte7?runConfig=eyJ1IjoiRWdQdHczb2VqNlRhRHQ1cW4iLCJ2IjoxfQ.eyJpbnB1dCI6IntcImNvZGVcIjpcImltcG9ydCB7IFB1cHBldGVlckNyYXdsZXIgfSBmcm9tICdjcmF3bGVlJztcXG5cXG5jb25zdCBjcmF3bGVyID0gbmV3IFB1cHBldGVlckNyYXdsZXIoe1xcbiAgICBhc3luYyByZXF1ZXN0SGFuZGxlcih7IHJlcXVlc3QsIGVucXVldWVMaW5rcywgbG9nIH0pIHtcXG4gICAgICAgIGxvZy5pbmZvKHJlcXVlc3QudXJsKTtcXG4gICAgICAgIC8vIEFkZCBhbGwgbGlua3MgZnJvbSBwYWdlIHRvIFJlcXVlc3RRdWV1ZVxcbiAgICAgICAgYXdhaXQgZW5xdWV1ZUxpbmtzKCk7XFxuICAgIH0sXFxuICAgIG1heFJlcXVlc3RzUGVyQ3Jhd2w6IDEwLCAvLyBMaW1pdGF0aW9uIGZvciBvbmx5IDEwIHJlcXVlc3RzIChkbyBub3QgdXNlIGlmIHlvdSB3YW50IHRvIGNyYXdsIGFsbCBsaW5rcylcXG59KTtcXG5cXG4vLyBSdW4gdGhlIGNyYXdsZXIgd2l0aCBpbml0aWFsIHJlcXVlc3RcXG5hd2FpdCBjcmF3bGVyLnJ1bihbJ2h0dHBzOi8vY3Jhd2xlZS5kZXYnXSk7XFxuXCJ9Iiwib3B0aW9ucyI6eyJidWlsZCI6ImxhdGVzdCIsImNvbnRlbnRUeXBlIjoiYXBwbGljYXRpb24vanNvbjsgY2hhcnNldD11dGYtOCIsIm1lbW9yeSI6NDA5NiwidGltZW91dCI6MTgwfX0.G2vsd_Fgpa50zBrg6m9S-dTzY4pzWTkAxqe6CzZtX5k\&asrc=run_on_apify) ``` import { PuppeteerCrawler } from 'crawlee'; const crawler = new PuppeteerCrawler({ async requestHandler({ request, enqueueLinks, log }) { log.info(request.url); // Add all links from page to RequestQueue await enqueueLinks(); }, maxRequestsPerCrawl: 10, // Limitation for only 10 requests (do not use if you want to crawl all links) }); // Run the crawler with initial request await crawler.run(['https://crawlee.dev']); ``` tip To run this example on the Apify Platform, select the `apify/actor-node-playwright-chrome` image for your Dockerfile. [Run on](https://console.apify.com/actors/6i5QsHBMtm3hKph70?runConfig=eyJ1IjoiRWdQdHczb2VqNlRhRHQ1cW4iLCJ2IjoxfQ.eyJpbnB1dCI6IntcImNvZGVcIjpcImltcG9ydCB7IFBsYXl3cmlnaHRDcmF3bGVyIH0gZnJvbSAnY3Jhd2xlZSc7XFxuXFxuY29uc3QgY3Jhd2xlciA9IG5ldyBQbGF5d3JpZ2h0Q3Jhd2xlcih7XFxuICAgIGFzeW5jIHJlcXVlc3RIYW5kbGVyKHsgcmVxdWVzdCwgZW5xdWV1ZUxpbmtzLCBsb2cgfSkge1xcbiAgICAgICAgbG9nLmluZm8ocmVxdWVzdC51cmwpO1xcbiAgICAgICAgLy8gQWRkIGFsbCBsaW5rcyBmcm9tIHBhZ2UgdG8gUmVxdWVzdFF1ZXVlXFxuICAgICAgICBhd2FpdCBlbnF1ZXVlTGlua3MoKTtcXG4gICAgfSxcXG4gICAgbWF4UmVxdWVzdHNQZXJDcmF3bDogMTAsIC8vIExpbWl0YXRpb24gZm9yIG9ubHkgMTAgcmVxdWVzdHMgKGRvIG5vdCB1c2UgaWYgeW91IHdhbnQgdG8gY3Jhd2wgYWxsIGxpbmtzKVxcbn0pO1xcblxcbi8vIFJ1biB0aGUgY3Jhd2xlciB3aXRoIGluaXRpYWwgcmVxdWVzdFxcbmF3YWl0IGNyYXdsZXIucnVuKFsnaHR0cHM6Ly9jcmF3bGVlLmRldiddKTtcXG5cIn0iLCJvcHRpb25zIjp7ImJ1aWxkIjoibGF0ZXN0IiwiY29udGVudFR5cGUiOiJhcHBsaWNhdGlvbi9qc29uOyBjaGFyc2V0PXV0Zi04IiwibWVtb3J5Ijo0MDk2LCJ0aW1lb3V0IjoxODB9fQ.NdlPlyegNit9Kua8PQcBs0l9SELlDds4jvMbM0_tnhc\&asrc=run_on_apify) ``` import { PlaywrightCrawler } from 'crawlee'; const crawler = new PlaywrightCrawler({ async requestHandler({ request, enqueueLinks, log }) { log.info(request.url); // Add all links from page to RequestQueue await enqueueLinks(); }, maxRequestsPerCrawl: 10, // Limitation for only 10 requests (do not use if you want to crawl all links) }); // Run the crawler with initial request await crawler.run(['https://crawlee.dev']); ``` --- # Crawl multiple URLs Copy for LLM This example crawls the specified list of URLs. * Cheerio Crawler * Puppeteer Crawler * Playwright Crawler [Run on](https://console.apify.com/actors/kk67IcZkKSSBTslXI?runConfig=eyJ1IjoiRWdQdHczb2VqNlRhRHQ1cW4iLCJ2IjoxfQ.eyJpbnB1dCI6IntcImNvZGVcIjpcImltcG9ydCB7IENoZWVyaW9DcmF3bGVyIH0gZnJvbSAnY3Jhd2xlZSc7XFxuXFxuY29uc3QgY3Jhd2xlciA9IG5ldyBDaGVlcmlvQ3Jhd2xlcih7XFxuICAgIC8vIEZ1bmN0aW9uIGNhbGxlZCBmb3IgZWFjaCBVUkxcXG4gICAgYXN5bmMgcmVxdWVzdEhhbmRsZXIoeyByZXF1ZXN0LCAkLCBsb2cgfSkge1xcbiAgICAgICAgY29uc3QgdGl0bGUgPSAkKCd0aXRsZScpLnRleHQoKTtcXG4gICAgICAgIGxvZy5pbmZvKGBVUkw6ICR7cmVxdWVzdC51cmx9XFxcXG5USVRMRTogJHt0aXRsZX1gKTtcXG4gICAgfSxcXG59KTtcXG5cXG4vLyBSdW4gdGhlIGNyYXdsZXIgd2l0aCBpbml0aWFsIHJlcXVlc3RcXG5hd2FpdCBjcmF3bGVyLnJ1bihbJ2h0dHA6Ly93d3cuZXhhbXBsZS5jb20vcGFnZS0xJywgJ2h0dHA6Ly93d3cuZXhhbXBsZS5jb20vcGFnZS0yJywgJ2h0dHA6Ly93d3cuZXhhbXBsZS5jb20vcGFnZS0zJ10pO1xcblwifSIsIm9wdGlvbnMiOnsiYnVpbGQiOiJsYXRlc3QiLCJjb250ZW50VHlwZSI6ImFwcGxpY2F0aW9uL2pzb247IGNoYXJzZXQ9dXRmLTgiLCJtZW1vcnkiOjEwMjQsInRpbWVvdXQiOjE4MH19.EkXGuY4BB9beeDa547KhHku8moogGGz0it_b02peucA\&asrc=run_on_apify) ``` import { CheerioCrawler } from 'crawlee'; const crawler = new CheerioCrawler({ // Function called for each URL async requestHandler({ request, $, log }) { const title = $('title').text(); log.info(`URL: ${request.url}\nTITLE: ${title}`); }, }); // Run the crawler with initial request await crawler.run(['http://www.example.com/page-1', 'http://www.example.com/page-2', 'http://www.example.com/page-3']); ``` tip To run this example on the Apify Platform, select the `apify/actor-node-puppeteer-chrome` image for your Dockerfile. [Run on](https://console.apify.com/actors/7tWSD8hrYzuc9Lte7?runConfig=eyJ1IjoiRWdQdHczb2VqNlRhRHQ1cW4iLCJ2IjoxfQ.eyJpbnB1dCI6IntcImNvZGVcIjpcImltcG9ydCB7IFB1cHBldGVlckNyYXdsZXIgfSBmcm9tICdjcmF3bGVlJztcXG5cXG5jb25zdCBjcmF3bGVyID0gbmV3IFB1cHBldGVlckNyYXdsZXIoe1xcbiAgICAvLyBGdW5jdGlvbiBjYWxsZWQgZm9yIGVhY2ggVVJMXFxuICAgIGFzeW5jIHJlcXVlc3RIYW5kbGVyKHsgcmVxdWVzdCwgcGFnZSwgbG9nIH0pIHtcXG4gICAgICAgIGNvbnN0IHRpdGxlID0gYXdhaXQgcGFnZS50aXRsZSgpO1xcbiAgICAgICAgbG9nLmluZm8oYFVSTDogJHtyZXF1ZXN0LnVybH1cXFxcblRJVExFOiAke3RpdGxlfWApO1xcbiAgICB9LFxcbn0pO1xcblxcbi8vIFJ1biB0aGUgY3Jhd2xlciB3aXRoIGluaXRpYWwgcmVxdWVzdFxcbmF3YWl0IGNyYXdsZXIucnVuKFsnaHR0cDovL3d3dy5leGFtcGxlLmNvbS9wYWdlLTEnLCAnaHR0cDovL3d3dy5leGFtcGxlLmNvbS9wYWdlLTInLCAnaHR0cDovL3d3dy5leGFtcGxlLmNvbS9wYWdlLTMnXSk7XFxuXCJ9Iiwib3B0aW9ucyI6eyJidWlsZCI6ImxhdGVzdCIsImNvbnRlbnRUeXBlIjoiYXBwbGljYXRpb24vanNvbjsgY2hhcnNldD11dGYtOCIsIm1lbW9yeSI6NDA5NiwidGltZW91dCI6MTgwfX0.giI3tJSfWG6oPGR2aMc4P1hv9q3DjQouI10GxYdUr5c\&asrc=run_on_apify) ``` import { PuppeteerCrawler } from 'crawlee'; const crawler = new PuppeteerCrawler({ // Function called for each URL async requestHandler({ request, page, log }) { const title = await page.title(); log.info(`URL: ${request.url}\nTITLE: ${title}`); }, }); // Run the crawler with initial request await crawler.run(['http://www.example.com/page-1', 'http://www.example.com/page-2', 'http://www.example.com/page-3']); ``` tip To run this example on the Apify Platform, select the `apify/actor-node-playwright-chrome` image for your Dockerfile. [Run on](https://console.apify.com/actors/6i5QsHBMtm3hKph70?runConfig=eyJ1IjoiRWdQdHczb2VqNlRhRHQ1cW4iLCJ2IjoxfQ.eyJpbnB1dCI6IntcImNvZGVcIjpcImltcG9ydCB7IFBsYXl3cmlnaHRDcmF3bGVyIH0gZnJvbSAnY3Jhd2xlZSc7XFxuXFxuY29uc3QgY3Jhd2xlciA9IG5ldyBQbGF5d3JpZ2h0Q3Jhd2xlcih7XFxuICAgIC8vIEZ1bmN0aW9uIGNhbGxlZCBmb3IgZWFjaCBVUkxcXG4gICAgYXN5bmMgcmVxdWVzdEhhbmRsZXIoeyByZXF1ZXN0LCBwYWdlLCBsb2cgfSkge1xcbiAgICAgICAgY29uc3QgdGl0bGUgPSBhd2FpdCBwYWdlLnRpdGxlKCk7XFxuICAgICAgICBsb2cuaW5mbyhgVVJMOiAke3JlcXVlc3QudXJsfVxcXFxuVElUTEU6ICR7dGl0bGV9YCk7XFxuICAgIH0sXFxufSk7XFxuXFxuLy8gUnVuIHRoZSBjcmF3bGVyIHdpdGggaW5pdGlhbCByZXF1ZXN0XFxuYXdhaXQgY3Jhd2xlci5ydW4oWydodHRwOi8vd3d3LmV4YW1wbGUuY29tL3BhZ2UtMScsICdodHRwOi8vd3d3LmV4YW1wbGUuY29tL3BhZ2UtMicsICdodHRwOi8vd3d3LmV4YW1wbGUuY29tL3BhZ2UtMyddKTtcXG5cIn0iLCJvcHRpb25zIjp7ImJ1aWxkIjoibGF0ZXN0IiwiY29udGVudFR5cGUiOiJhcHBsaWNhdGlvbi9qc29uOyBjaGFyc2V0PXV0Zi04IiwibWVtb3J5Ijo0MDk2LCJ0aW1lb3V0IjoxODB9fQ.usJ_mWQQRhnzUWTSjqEaplezGdxO-uK49YEErKaMke0\&asrc=run_on_apify) ``` import { PlaywrightCrawler } from 'crawlee'; const crawler = new PlaywrightCrawler({ // Function called for each URL async requestHandler({ request, page, log }) { const title = await page.title(); log.info(`URL: ${request.url}\nTITLE: ${title}`); }, }); // Run the crawler with initial request await crawler.run(['http://www.example.com/page-1', 'http://www.example.com/page-2', 'http://www.example.com/page-3']); ``` --- # Crawl a website with relative links Copy for LLM When crawling a website, you may encounter different types of links present that you may want to crawl. To facilitate the easy crawling of such links, we provide the `enqueueLinks()` method on the crawler context, which will automatically find links and add them to the crawler's [`RequestQueue`](https://crawlee.dev/js/api/core/class/RequestQueue.md). We provide 3 different strategies for crawling relative links: * [All (or the string "all")](https://crawlee.dev/js/api/core/enum/EnqueueStrategy.md#All) which will enqueue all links found, regardless of the domain they point to. * [SameHostname (or the string "same-hostname")](https://crawlee.dev/js/api/core/enum/EnqueueStrategy.md#SameHostname) which will enqueue all links found for the same hostname. This is the default strategy. * [SameDomain (or the string "same-domain")](https://crawlee.dev/js/api/core/enum/EnqueueStrategy.md#SameDomain) which will enqueue all links found that have the same domain name, including links from any possible subdomain. note For these examples, we are using the [`CheerioCrawler`](https://crawlee.dev/js/api/cheerio-crawler/class/CheerioCrawler.md), however the same method is available for both the [`PuppeteerCrawler`](https://crawlee.dev/js/api/puppeteer-crawler/class/PuppeteerCrawler.md) and [`PlaywrightCrawler`](https://crawlee.dev/js/api/playwright-crawler/class/PlaywrightCrawler.md), and you use it the exact same way. * All Links * Same Hostname * Same Subdomain Example domains Any urls found will be matched by this strategy, even if they go off of the site you are currently crawling. [Run on](https://console.apify.com/actors/kk67IcZkKSSBTslXI?runConfig=eyJ1IjoiRWdQdHczb2VqNlRhRHQ1cW4iLCJ2IjoxfQ.eyJpbnB1dCI6IntcImNvZGVcIjpcImltcG9ydCB7IENoZWVyaW9DcmF3bGVyLCBFbnF1ZXVlU3RyYXRlZ3kgfSBmcm9tICdjcmF3bGVlJztcXG5cXG5jb25zdCBjcmF3bGVyID0gbmV3IENoZWVyaW9DcmF3bGVyKHtcXG4gICAgbWF4UmVxdWVzdHNQZXJDcmF3bDogMTAsIC8vIExpbWl0YXRpb24gZm9yIG9ubHkgMTAgcmVxdWVzdHMgKGRvIG5vdCB1c2UgaWYgeW91IHdhbnQgdG8gY3Jhd2wgYWxsIGxpbmtzKVxcbiAgICBhc3luYyByZXF1ZXN0SGFuZGxlcih7IHJlcXVlc3QsIGVucXVldWVMaW5rcywgbG9nIH0pIHtcXG4gICAgICAgIGxvZy5pbmZvKHJlcXVlc3QudXJsKTtcXG4gICAgICAgIGF3YWl0IGVucXVldWVMaW5rcyh7XFxuICAgICAgICAgICAgLy8gU2V0dGluZyB0aGUgc3RyYXRlZ3kgdG8gJ2FsbCcgd2lsbCBlbnF1ZXVlIGFsbCBsaW5rcyBmb3VuZFxcbiAgICAgICAgICAgIC8vIGhpZ2hsaWdodC1uZXh0LWxpbmVcXG4gICAgICAgICAgICBzdHJhdGVneTogRW5xdWV1ZVN0cmF0ZWd5LkFsbCxcXG4gICAgICAgICAgICAvLyBBbHRlcm5hdGl2ZWx5LCB5b3UgY2FuIHBhc3MgaW4gdGhlIHN0cmluZyAnYWxsJ1xcbiAgICAgICAgICAgIC8vIHN0cmF0ZWd5OiAnYWxsJyxcXG4gICAgICAgIH0pO1xcbiAgICB9LFxcbn0pO1xcblxcbi8vIFJ1biB0aGUgY3Jhd2xlciB3aXRoIGluaXRpYWwgcmVxdWVzdFxcbmF3YWl0IGNyYXdsZXIucnVuKFsnaHR0cHM6Ly9jcmF3bGVlLmRldiddKTtcXG5cIn0iLCJvcHRpb25zIjp7ImJ1aWxkIjoibGF0ZXN0IiwiY29udGVudFR5cGUiOiJhcHBsaWNhdGlvbi9qc29uOyBjaGFyc2V0PXV0Zi04IiwibWVtb3J5IjoxMDI0LCJ0aW1lb3V0IjoxODB9fQ.zrKphqRNzrvQObV0GryliVYKQmeIFEkOtV_qBMeXvis\&asrc=run_on_apify) ``` import { CheerioCrawler, EnqueueStrategy } from 'crawlee'; const crawler = new CheerioCrawler({ maxRequestsPerCrawl: 10, // Limitation for only 10 requests (do not use if you want to crawl all links) async requestHandler({ request, enqueueLinks, log }) { log.info(request.url); await enqueueLinks({ // Setting the strategy to 'all' will enqueue all links found strategy: EnqueueStrategy.All, // Alternatively, you can pass in the string 'all' // strategy: 'all', }); }, }); // Run the crawler with initial request await crawler.run(['https://crawlee.dev']); ``` Example domains For a url of `https://example.com`, `enqueueLinks()` will match relative urls and urls that point to the same hostname. > This is the default strategy when calling `enqueueLinks()`, so you don't have to specify it. For instance, hyperlinks like `https://example.com/some/path`, `/absolute/example` or `./relative/example` will all be matched by this strategy. But links to any subdomain like `https://subdomain.example.com/some/path` won't. [Run on](https://console.apify.com/actors/kk67IcZkKSSBTslXI?runConfig=eyJ1IjoiRWdQdHczb2VqNlRhRHQ1cW4iLCJ2IjoxfQ.eyJpbnB1dCI6IntcImNvZGVcIjpcImltcG9ydCB7IENoZWVyaW9DcmF3bGVyLCBFbnF1ZXVlU3RyYXRlZ3kgfSBmcm9tICdjcmF3bGVlJztcXG5cXG5jb25zdCBjcmF3bGVyID0gbmV3IENoZWVyaW9DcmF3bGVyKHtcXG4gICAgbWF4UmVxdWVzdHNQZXJDcmF3bDogMTAsIC8vIExpbWl0YXRpb24gZm9yIG9ubHkgMTAgcmVxdWVzdHMgKGRvIG5vdCB1c2UgaWYgeW91IHdhbnQgdG8gY3Jhd2wgYWxsIGxpbmtzKVxcbiAgICBhc3luYyByZXF1ZXN0SGFuZGxlcih7IHJlcXVlc3QsIGVucXVldWVMaW5rcywgbG9nIH0pIHtcXG4gICAgICAgIGxvZy5pbmZvKHJlcXVlc3QudXJsKTtcXG4gICAgICAgIGF3YWl0IGVucXVldWVMaW5rcyh7XFxuICAgICAgICAgICAgLy8gU2V0dGluZyB0aGUgc3RyYXRlZ3kgdG8gJ3NhbWUtaG9zdG5hbWUnIHdpbGwgZW5xdWV1ZSBhbGwgbGlua3MgZm91bmQgdGhhdCBhcmUgb24gdGhlXFxuICAgICAgICAgICAgLy8gc2FtZSBob3N0bmFtZSAoaW5jbHVkaW5nIHN1YmRvbWFpbikgYXMgcmVxdWVzdC5sb2FkZWRVcmwgb3IgcmVxdWVzdC51cmxcXG4gICAgICAgICAgICAvLyBoaWdobGlnaHQtbmV4dC1saW5lXFxuICAgICAgICAgICAgc3RyYXRlZ3k6IEVucXVldWVTdHJhdGVneS5TYW1lSG9zdG5hbWUsXFxuICAgICAgICAgICAgLy8gQWx0ZXJuYXRpdmVseSwgeW91IGNhbiBwYXNzIGluIHRoZSBzdHJpbmcgJ3NhbWUtaG9zdG5hbWUnXFxuICAgICAgICAgICAgLy8gc3RyYXRlZ3k6ICdzYW1lLWhvc3RuYW1lJyxcXG4gICAgICAgIH0pO1xcbiAgICB9LFxcbn0pO1xcblxcbi8vIFJ1biB0aGUgY3Jhd2xlciB3aXRoIGluaXRpYWwgcmVxdWVzdFxcbmF3YWl0IGNyYXdsZXIucnVuKFsnaHR0cHM6Ly9jcmF3bGVlLmRldiddKTtcXG5cIn0iLCJvcHRpb25zIjp7ImJ1aWxkIjoibGF0ZXN0IiwiY29udGVudFR5cGUiOiJhcHBsaWNhdGlvbi9qc29uOyBjaGFyc2V0PXV0Zi04IiwibWVtb3J5IjoxMDI0LCJ0aW1lb3V0IjoxODB9fQ.iCcYmWUvfNLGhjIu0mJ9cQwXfpdl2TIbAnyCU5XVdrw\&asrc=run_on_apify) ``` import { CheerioCrawler, EnqueueStrategy } from 'crawlee'; const crawler = new CheerioCrawler({ maxRequestsPerCrawl: 10, // Limitation for only 10 requests (do not use if you want to crawl all links) async requestHandler({ request, enqueueLinks, log }) { log.info(request.url); await enqueueLinks({ // Setting the strategy to 'same-hostname' will enqueue all links found that are on the // same hostname (including subdomain) as request.loadedUrl or request.url strategy: EnqueueStrategy.SameHostname, // Alternatively, you can pass in the string 'same-hostname' // strategy: 'same-hostname', }); }, }); // Run the crawler with initial request await crawler.run(['https://crawlee.dev']); ``` Example domains For a url of `https://subdomain.example.com`, `enqueueLinks()` will match relative urls or urls that point to the same domain name, regardless of their subdomain. For instance, hyperlinks like `https://subdomain.example.com/some/path`, `/absolute/example` or `./relative/example` will all be matched by this strategy, as well as links to other subdomains or to the naked domain, like `https://other-subdomain.example.com` or `https://example.com` will work too. [Run on](https://console.apify.com/actors/kk67IcZkKSSBTslXI?runConfig=eyJ1IjoiRWdQdHczb2VqNlRhRHQ1cW4iLCJ2IjoxfQ.eyJpbnB1dCI6IntcImNvZGVcIjpcImltcG9ydCB7IENoZWVyaW9DcmF3bGVyLCBFbnF1ZXVlU3RyYXRlZ3kgfSBmcm9tICdjcmF3bGVlJztcXG5cXG5jb25zdCBjcmF3bGVyID0gbmV3IENoZWVyaW9DcmF3bGVyKHtcXG4gICAgbWF4UmVxdWVzdHNQZXJDcmF3bDogMTAsIC8vIExpbWl0YXRpb24gZm9yIG9ubHkgMTAgcmVxdWVzdHMgKGRvIG5vdCB1c2UgaWYgeW91IHdhbnQgdG8gY3Jhd2wgYWxsIGxpbmtzKVxcbiAgICBhc3luYyByZXF1ZXN0SGFuZGxlcih7IHJlcXVlc3QsIGVucXVldWVMaW5rcywgbG9nIH0pIHtcXG4gICAgICAgIGxvZy5pbmZvKHJlcXVlc3QudXJsKTtcXG4gICAgICAgIGF3YWl0IGVucXVldWVMaW5rcyh7XFxuICAgICAgICAgICAgLy8gU2V0dGluZyB0aGUgc3RyYXRlZ3kgdG8gJ3NhbWUtZG9tYWluJyB3aWxsIGVucXVldWUgYWxsIGxpbmtzIGZvdW5kIHRoYXQgYXJlIG9uIHRoZVxcbiAgICAgICAgICAgIC8vIHNhbWUgaG9zdG5hbWUgYXMgcmVxdWVzdC5sb2FkZWRVcmwgb3IgcmVxdWVzdC51cmxcXG4gICAgICAgICAgICAvLyBoaWdobGlnaHQtbmV4dC1saW5lXFxuICAgICAgICAgICAgc3RyYXRlZ3k6IEVucXVldWVTdHJhdGVneS5TYW1lRG9tYWluLFxcbiAgICAgICAgICAgIC8vIEFsdGVybmF0aXZlbHksIHlvdSBjYW4gcGFzcyBpbiB0aGUgc3RyaW5nICdzYW1lLWRvbWFpbidcXG4gICAgICAgICAgICAvLyBzdHJhdGVneTogJ3NhbWUtZG9tYWluJyxcXG4gICAgICAgIH0pO1xcbiAgICB9LFxcbn0pO1xcblxcbi8vIFJ1biB0aGUgY3Jhd2xlciB3aXRoIGluaXRpYWwgcmVxdWVzdFxcbmF3YWl0IGNyYXdsZXIucnVuKFsnaHR0cHM6Ly9jcmF3bGVlLmRldiddKTtcXG5cIn0iLCJvcHRpb25zIjp7ImJ1aWxkIjoibGF0ZXN0IiwiY29udGVudFR5cGUiOiJhcHBsaWNhdGlvbi9qc29uOyBjaGFyc2V0PXV0Zi04IiwibWVtb3J5IjoxMDI0LCJ0aW1lb3V0IjoxODB9fQ.eW4ZGM7CltwTaGI0ye7ioJvou8nYvf6dW6LLwLtFWWA\&asrc=run_on_apify) ``` import { CheerioCrawler, EnqueueStrategy } from 'crawlee'; const crawler = new CheerioCrawler({ maxRequestsPerCrawl: 10, // Limitation for only 10 requests (do not use if you want to crawl all links) async requestHandler({ request, enqueueLinks, log }) { log.info(request.url); await enqueueLinks({ // Setting the strategy to 'same-domain' will enqueue all links found that are on the // same hostname as request.loadedUrl or request.url strategy: EnqueueStrategy.SameDomain, // Alternatively, you can pass in the string 'same-domain' // strategy: 'same-domain', }); }, }); // Run the crawler with initial request await crawler.run(['https://crawlee.dev']); ``` --- # Crawl a single URL Copy for LLM This example uses the [`got-scraping`](https://github.com/apify/got-scraping) npm package to grab the HTML of a web page. [Run on](https://console.apify.com/actors/kk67IcZkKSSBTslXI?runConfig=eyJ1IjoiRWdQdHczb2VqNlRhRHQ1cW4iLCJ2IjoxfQ.eyJpbnB1dCI6IntcImNvZGVcIjpcImltcG9ydCB7IGdvdFNjcmFwaW5nIH0gZnJvbSAnZ290LXNjcmFwaW5nJztcXG5cXG4vLyBHZXQgdGhlIEhUTUwgb2YgYSB3ZWIgcGFnZVxcbmNvbnN0IHsgYm9keSB9ID0gYXdhaXQgZ290U2NyYXBpbmcoeyB1cmw6ICdodHRwczovL3d3dy5leGFtcGxlLmNvbScgfSk7XFxuY29uc29sZS5sb2coYm9keSk7XFxuXCJ9Iiwib3B0aW9ucyI6eyJidWlsZCI6ImxhdGVzdCIsImNvbnRlbnRUeXBlIjoiYXBwbGljYXRpb24vanNvbjsgY2hhcnNldD11dGYtOCIsIm1lbW9yeSI6MTAyNCwidGltZW91dCI6MTgwfX0.0S1i1yD10_82mLCH3VWFtCZTU4-BDrDU1UGY208IqgE\&asrc=run_on_apify) ``` import { gotScraping } from 'got-scraping'; // Get the HTML of a web page const { body } = await gotScraping({ url: 'https://www.example.com' }); console.log(body); ``` If you don't want to hard-code the URL into the script, refer to the [Accept User Input](https://crawlee.dev/js/docs/examples/accept-user-input.md) example. --- # Crawl a sitemap Copy for LLM We will crawl sitemap which tells search engines which pages and file are important in the website, it also provides valuable information about these files. This example builds a sitemap crawler which downloads and crawls the URLs from a sitemap, by using the [`Sitemap`](https://crawlee.dev/js/api/utils/class/Sitemap.md) utility class provided by the [`@crawlee/utils`](https://crawlee.dev/js/api/utils.md) module. * Cheerio Crawler * Puppeteer Crawler * Playwright Crawler [Run on](https://console.apify.com/actors/kk67IcZkKSSBTslXI?runConfig=eyJ1IjoiRWdQdHczb2VqNlRhRHQ1cW4iLCJ2IjoxfQ.eyJpbnB1dCI6IntcImNvZGVcIjpcImltcG9ydCB7IENoZWVyaW9DcmF3bGVyLCBTaXRlbWFwIH0gZnJvbSAnY3Jhd2xlZSc7XFxuXFxuY29uc3QgY3Jhd2xlciA9IG5ldyBDaGVlcmlvQ3Jhd2xlcih7XFxuICAgIC8vIEZ1bmN0aW9uIGNhbGxlZCBmb3IgZWFjaCBVUkxcXG4gICAgYXN5bmMgcmVxdWVzdEhhbmRsZXIoeyByZXF1ZXN0LCBsb2cgfSkge1xcbiAgICAgICAgbG9nLmluZm8ocmVxdWVzdC51cmwpO1xcbiAgICB9LFxcbiAgICBtYXhSZXF1ZXN0c1BlckNyYXdsOiAxMCwgLy8gTGltaXRhdGlvbiBmb3Igb25seSAxMCByZXF1ZXN0cyAoZG8gbm90IHVzZSBpZiB5b3Ugd2FudCB0byBjcmF3bCBhIHNpdGVtYXApXFxufSk7XFxuXFxuY29uc3QgeyB1cmxzIH0gPSBhd2FpdCBTaXRlbWFwLmxvYWQoJ2h0dHBzOi8vY3Jhd2xlZS5kZXYvc2l0ZW1hcC54bWwnKTtcXG5cXG5hd2FpdCBjcmF3bGVyLmFkZFJlcXVlc3RzKHVybHMpO1xcblxcbi8vIFJ1biB0aGUgY3Jhd2xlclxcbmF3YWl0IGNyYXdsZXIucnVuKCk7XFxuXCJ9Iiwib3B0aW9ucyI6eyJidWlsZCI6ImxhdGVzdCIsImNvbnRlbnRUeXBlIjoiYXBwbGljYXRpb24vanNvbjsgY2hhcnNldD11dGYtOCIsIm1lbW9yeSI6MTAyNCwidGltZW91dCI6MTgwfX0.tV8iOCFCHW8ymY2fNGesiSri1fq3k4YmUem3HRJ4EzA\&asrc=run_on_apify) ``` import { CheerioCrawler, Sitemap } from 'crawlee'; const crawler = new CheerioCrawler({ // Function called for each URL async requestHandler({ request, log }) { log.info(request.url); }, maxRequestsPerCrawl: 10, // Limitation for only 10 requests (do not use if you want to crawl a sitemap) }); const { urls } = await Sitemap.load('https://crawlee.dev/sitemap.xml'); await crawler.addRequests(urls); // Run the crawler await crawler.run(); ``` tip To run this example on the Apify Platform, select the `apify/actor-node-puppeteer-chrome` image for your Dockerfile. [Run on](https://console.apify.com/actors/7tWSD8hrYzuc9Lte7?runConfig=eyJ1IjoiRWdQdHczb2VqNlRhRHQ1cW4iLCJ2IjoxfQ.eyJpbnB1dCI6IntcImNvZGVcIjpcImltcG9ydCB7IFB1cHBldGVlckNyYXdsZXIsIFNpdGVtYXAgfSBmcm9tICdjcmF3bGVlJztcXG5cXG5jb25zdCBjcmF3bGVyID0gbmV3IFB1cHBldGVlckNyYXdsZXIoe1xcbiAgICAvLyBGdW5jdGlvbiBjYWxsZWQgZm9yIGVhY2ggVVJMXFxuICAgIGFzeW5jIHJlcXVlc3RIYW5kbGVyKHsgcmVxdWVzdCwgbG9nIH0pIHtcXG4gICAgICAgIGxvZy5pbmZvKHJlcXVlc3QudXJsKTtcXG4gICAgfSxcXG4gICAgbWF4UmVxdWVzdHNQZXJDcmF3bDogMTAsIC8vIExpbWl0YXRpb24gZm9yIG9ubHkgMTAgcmVxdWVzdHMgKGRvIG5vdCB1c2UgaWYgeW91IHdhbnQgdG8gY3Jhd2wgYSBzaXRlbWFwKVxcbn0pO1xcblxcbmNvbnN0IHsgdXJscyB9ID0gYXdhaXQgU2l0ZW1hcC5sb2FkKCdodHRwczovL2NyYXdsZWUuZGV2L3NpdGVtYXAueG1sJyk7XFxuXFxuYXdhaXQgY3Jhd2xlci5hZGRSZXF1ZXN0cyh1cmxzKTtcXG5cXG4vLyBSdW4gdGhlIGNyYXdsZXJcXG5hd2FpdCBjcmF3bGVyLnJ1bigpO1xcblwifSIsIm9wdGlvbnMiOnsiYnVpbGQiOiJsYXRlc3QiLCJjb250ZW50VHlwZSI6ImFwcGxpY2F0aW9uL2pzb247IGNoYXJzZXQ9dXRmLTgiLCJtZW1vcnkiOjQwOTYsInRpbWVvdXQiOjE4MH19.xqNohmh8_of2Vvb4ZItu__LG2i404uvrtO2NqNMAkls\&asrc=run_on_apify) ``` import { PuppeteerCrawler, Sitemap } from 'crawlee'; const crawler = new PuppeteerCrawler({ // Function called for each URL async requestHandler({ request, log }) { log.info(request.url); }, maxRequestsPerCrawl: 10, // Limitation for only 10 requests (do not use if you want to crawl a sitemap) }); const { urls } = await Sitemap.load('https://crawlee.dev/sitemap.xml'); await crawler.addRequests(urls); // Run the crawler await crawler.run(); ``` tip To run this example on the Apify Platform, select the `apify/actor-node-playwright-chrome` image for your Dockerfile. [Run on](https://console.apify.com/actors/6i5QsHBMtm3hKph70?runConfig=eyJ1IjoiRWdQdHczb2VqNlRhRHQ1cW4iLCJ2IjoxfQ.eyJpbnB1dCI6IntcImNvZGVcIjpcImltcG9ydCB7IFBsYXl3cmlnaHRDcmF3bGVyLCBTaXRlbWFwIH0gZnJvbSAnY3Jhd2xlZSc7XFxuXFxuY29uc3QgY3Jhd2xlciA9IG5ldyBQbGF5d3JpZ2h0Q3Jhd2xlcih7XFxuICAgIC8vIEZ1bmN0aW9uIGNhbGxlZCBmb3IgZWFjaCBVUkxcXG4gICAgYXN5bmMgcmVxdWVzdEhhbmRsZXIoeyByZXF1ZXN0LCBsb2cgfSkge1xcbiAgICAgICAgbG9nLmluZm8ocmVxdWVzdC51cmwpO1xcbiAgICB9LFxcbiAgICBtYXhSZXF1ZXN0c1BlckNyYXdsOiAxMCwgLy8gTGltaXRhdGlvbiBmb3Igb25seSAxMCByZXF1ZXN0cyAoZG8gbm90IHVzZSBpZiB5b3Ugd2FudCB0byBjcmF3bCBhIHNpdGVtYXApXFxufSk7XFxuXFxuY29uc3QgeyB1cmxzIH0gPSBhd2FpdCBTaXRlbWFwLmxvYWQoJ2h0dHBzOi8vY3Jhd2xlZS5kZXYvc2l0ZW1hcC54bWwnKTtcXG5cXG5hd2FpdCBjcmF3bGVyLmFkZFJlcXVlc3RzKHVybHMpO1xcblxcbi8vIFJ1biB0aGUgY3Jhd2xlclxcbmF3YWl0IGNyYXdsZXIucnVuKCk7XFxuXCJ9Iiwib3B0aW9ucyI6eyJidWlsZCI6ImxhdGVzdCIsImNvbnRlbnRUeXBlIjoiYXBwbGljYXRpb24vanNvbjsgY2hhcnNldD11dGYtOCIsIm1lbW9yeSI6NDA5NiwidGltZW91dCI6MTgwfX0.gYC10LbyMQ7K8r7Pc_WYu2KpbHLP6jmqOtfJvMxeIWM\&asrc=run_on_apify) ``` import { PlaywrightCrawler, Sitemap } from 'crawlee'; const crawler = new PlaywrightCrawler({ // Function called for each URL async requestHandler({ request, log }) { log.info(request.url); }, maxRequestsPerCrawl: 10, // Limitation for only 10 requests (do not use if you want to crawl a sitemap) }); const { urls } = await Sitemap.load('https://crawlee.dev/sitemap.xml'); await crawler.addRequests(urls); // Run the crawler await crawler.run(); ``` --- # Crawl some links on a website Copy for LLM This [`CheerioCrawler`](https://crawlee.dev/js/api/cheerio-crawler/class/CheerioCrawler.md) example uses the [`globs`](https://crawlee.dev/js/api/core/interface/EnqueueLinksOptions.md#globs) property in the [`enqueueLinks()`](https://crawlee.dev/js/api/cheerio-crawler/interface/CheerioCrawlingContext.md#enqueueLinks) method to only add links to the [`RequestQueue`](https://crawlee.dev/js/api/core/class/RequestQueue.md) queue if they match the specified pattern. [Run on](https://console.apify.com/actors/kk67IcZkKSSBTslXI?runConfig=eyJ1IjoiRWdQdHczb2VqNlRhRHQ1cW4iLCJ2IjoxfQ.eyJpbnB1dCI6IntcImNvZGVcIjpcImltcG9ydCB7IENoZWVyaW9DcmF3bGVyIH0gZnJvbSAnY3Jhd2xlZSc7XFxuXFxuLy8gQ3JlYXRlIGEgQ2hlZXJpb0NyYXdsZXJcXG5jb25zdCBjcmF3bGVyID0gbmV3IENoZWVyaW9DcmF3bGVyKHtcXG4gICAgLy8gTGltaXRzIHRoZSBjcmF3bGVyIHRvIG9ubHkgMTAgcmVxdWVzdHMgKGRvIG5vdCB1c2UgaWYgeW91IHdhbnQgdG8gY3Jhd2wgYWxsIGxpbmtzKVxcbiAgICBtYXhSZXF1ZXN0c1BlckNyYXdsOiAxMCxcXG4gICAgLy8gRnVuY3Rpb24gY2FsbGVkIGZvciBlYWNoIFVSTFxcbiAgICBhc3luYyByZXF1ZXN0SGFuZGxlcih7IHJlcXVlc3QsIGVucXVldWVMaW5rcywgbG9nIH0pIHtcXG4gICAgICAgIGxvZy5pbmZvKHJlcXVlc3QudXJsKTtcXG4gICAgICAgIC8vIEFkZCBzb21lIGxpbmtzIGZyb20gcGFnZSB0byB0aGUgY3Jhd2xlcidzIFJlcXVlc3RRdWV1ZVxcbiAgICAgICAgYXdhaXQgZW5xdWV1ZUxpbmtzKHtcXG4gICAgICAgICAgICBnbG9iczogWydodHRwPyhzKTovL2NyYXdsZWUuZGV2LyovKiddLFxcbiAgICAgICAgfSk7XFxuICAgIH0sXFxufSk7XFxuXFxuLy8gRGVmaW5lIHRoZSBzdGFydGluZyBVUkxcXG5hd2FpdCBjcmF3bGVyLmFkZFJlcXVlc3RzKFsnaHR0cHM6Ly9jcmF3bGVlLmRldiddKTtcXG5cXG4vLyBSdW4gdGhlIGNyYXdsZXJcXG5hd2FpdCBjcmF3bGVyLnJ1bigpO1xcblwifSIsIm9wdGlvbnMiOnsiYnVpbGQiOiJsYXRlc3QiLCJjb250ZW50VHlwZSI6ImFwcGxpY2F0aW9uL2pzb247IGNoYXJzZXQ9dXRmLTgiLCJtZW1vcnkiOjEwMjQsInRpbWVvdXQiOjE4MH19.VVfWBrP6w-lmqCKIsnpIZMak9PCZ9HoTchf7mqrbVUo\&asrc=run_on_apify) ``` import { CheerioCrawler } from 'crawlee'; // Create a CheerioCrawler const crawler = new CheerioCrawler({ // Limits the crawler to only 10 requests (do not use if you want to crawl all links) maxRequestsPerCrawl: 10, // Function called for each URL async requestHandler({ request, enqueueLinks, log }) { log.info(request.url); // Add some links from page to the crawler's RequestQueue await enqueueLinks({ globs: ['http?(s)://crawlee.dev/*/*'], }); }, }); // Define the starting URL await crawler.addRequests(['https://crawlee.dev']); // Run the crawler await crawler.run(); ``` --- # Using Puppeteer Stealth Plugin (puppeteer-extra) and playwright-extra Copy for LLM [`puppeteer-extra`](https://www.npmjs.com/package/puppeteer-extra) and [`playwright-extra`](https://www.npmjs.com/package/playwright-extra) are community-built libraries that bring in a plugin system to enhance the usage of [`puppeteer`](https://www.npmjs.com/package/puppeteer) and [`playwright`](https://www.npmjs.com/package/playwright) respectively (bringing in extra functionality, like improving stealth for example by using the Puppeteer Stealth plugin [(`puppeteer-extra-plugin-stealth`)](https://www.npmjs.com/package/puppeteer-extra-plugin-stealth)). Available plugins You can see a list of available plugins on the [`puppeteer-extra`s plugin list](https://www.npmjs.com/package/puppeteer-extra#plugins). For [`playwright`](https://www.npmjs.com/package/playwright), please see [`playwright-extra`s plugin list](https://www.npmjs.com/package/playwright-extra#plugins) instead. In this example, we'll show you how to use the Puppeteer Stealth [(`puppeteer-extra-plugin-stealth`)](https://www.npmjs.com/package/puppeteer-extra-plugin-stealth) plugin to help you avoid bot detections when crawling your target website. * Puppeteer & puppeteer-extra * Playwright & playwright-extra Before you begin Make sure you've installed the Puppeteer Extra (`puppeteer-extra`) and Puppeteer Stealth plugin(`puppeteer-extra-plugin-stealth`) packages via your preferred package manager ``` npm install puppeteer-extra puppeteer-extra-plugin-stealth ``` tip To run this example on the Apify Platform, select the `apify/actor-node-puppeteer-chrome` image for your Dockerfile. [Run on](https://console.apify.com/actors/7tWSD8hrYzuc9Lte7?runConfig=eyJ1IjoiRWdQdHczb2VqNlRhRHQ1cW4iLCJ2IjoxfQ.eyJpbnB1dCI6IntcImNvZGVcIjpcImltcG9ydCB7IFB1cHBldGVlckNyYXdsZXIgfSBmcm9tICdjcmF3bGVlJztcXG5pbXBvcnQgcHVwcGV0ZWVyRXh0cmEgZnJvbSAncHVwcGV0ZWVyLWV4dHJhJztcXG5pbXBvcnQgc3RlYWx0aFBsdWdpbiBmcm9tICdwdXBwZXRlZXItZXh0cmEtcGx1Z2luLXN0ZWFsdGgnO1xcblxcbi8vIEZpcnN0LCB3ZSB0ZWxsIHB1cHBldGVlci1leHRyYSB0byB1c2UgdGhlIHBsdWdpbiAob3IgcGx1Z2lucykgd2Ugd2FudC5cXG4vLyBDZXJ0YWluIHBsdWdpbnMgbWlnaHQgaGF2ZSBvcHRpb25zIHlvdSBjYW4gcGFzcyBpbiAtIHJlYWQgdXAgb24gdGhlaXIgZG9jdW1lbnRhdGlvbiFcXG5wdXBwZXRlZXJFeHRyYS51c2Uoc3RlYWx0aFBsdWdpbigpKTtcXG5cXG4vLyBDcmVhdGUgYW4gaW5zdGFuY2Ugb2YgdGhlIFB1cHBldGVlckNyYXdsZXIgY2xhc3MgLSBhIGNyYXdsZXJcXG4vLyB0aGF0IGF1dG9tYXRpY2FsbHkgbG9hZHMgdGhlIFVSTHMgaW4gaGVhZGxlc3MgQ2hyb21lIC8gUHVwcGV0ZWVyLlxcbmNvbnN0IGNyYXdsZXIgPSBuZXcgUHVwcGV0ZWVyQ3Jhd2xlcih7XFxuICAgIGxhdW5jaENvbnRleHQ6IHtcXG4gICAgICAgIC8vICEhISBZb3UgbmVlZCB0byBzcGVjaWZ5IHRoaXMgb3B0aW9uIHRvIHRlbGwgQ3Jhd2xlZSB0byB1c2UgcHVwcGV0ZWVyLWV4dHJhIGFzIHRoZSBsYXVuY2hlciAhISFcXG4gICAgICAgIGxhdW5jaGVyOiBwdXBwZXRlZXJFeHRyYSxcXG4gICAgICAgIGxhdW5jaE9wdGlvbnM6IHtcXG4gICAgICAgICAgICAvLyBPdGhlciBwdXBwZXRlZXIgb3B0aW9ucyB3b3JrIGFzIHVzdWFsXFxuICAgICAgICAgICAgaGVhZGxlc3M6IHRydWUsXFxuICAgICAgICB9LFxcbiAgICB9LFxcblxcbiAgICAvLyBTdG9wIGNyYXdsaW5nIGFmdGVyIHNldmVyYWwgcGFnZXNcXG4gICAgbWF4UmVxdWVzdHNQZXJDcmF3bDogNTAsXFxuXFxuICAgIC8vIFRoaXMgZnVuY3Rpb24gd2lsbCBiZSBjYWxsZWQgZm9yIGVhY2ggVVJMIHRvIGNyYXdsLlxcbiAgICAvLyBIZXJlIHlvdSBjYW4gd3JpdGUgdGhlIFB1cHBldGVlciBzY3JpcHRzIHlvdSBhcmUgZmFtaWxpYXIgd2l0aCxcXG4gICAgLy8gd2l0aCB0aGUgZXhjZXB0aW9uIHRoYXQgYnJvd3NlcnMgYW5kIHBhZ2VzIGFyZSBhdXRvbWF0aWNhbGx5IG1hbmFnZWQgYnkgQ3Jhd2xlZS5cXG4gICAgLy8gVGhlIGZ1bmN0aW9uIGFjY2VwdHMgYSBzaW5nbGUgcGFyYW1ldGVyLCB3aGljaCBpcyBhbiBvYmplY3Qgd2l0aCB0aGUgZm9sbG93aW5nIGZpZWxkczpcXG4gICAgLy8gLSByZXF1ZXN0OiBhbiBpbnN0YW5jZSBvZiB0aGUgUmVxdWVzdCBjbGFzcyB3aXRoIGluZm9ybWF0aW9uIHN1Y2ggYXMgVVJMIGFuZCBIVFRQIG1ldGhvZFxcbiAgICAvLyAtIHBhZ2U6IFB1cHBldGVlcidzIFBhZ2Ugb2JqZWN0IChzZWUgaHR0cHM6Ly9wcHRyLmRldi8jc2hvdz1hcGktY2xhc3MtcGFnZSlcXG4gICAgYXN5bmMgcmVxdWVzdEhhbmRsZXIoeyBwdXNoRGF0YSwgcmVxdWVzdCwgcGFnZSwgZW5xdWV1ZUxpbmtzLCBsb2cgfSkge1xcbiAgICAgICAgbG9nLmluZm8oYFByb2Nlc3NpbmcgJHtyZXF1ZXN0LnVybH0uLi5gKTtcXG5cXG4gICAgICAgIC8vIEEgZnVuY3Rpb24gdG8gYmUgZXZhbHVhdGVkIGJ5IFB1cHBldGVlciB3aXRoaW4gdGhlIGJyb3dzZXIgY29udGV4dC5cXG4gICAgICAgIGNvbnN0IGRhdGEgPSBhd2FpdCBwYWdlLiQkZXZhbCgnLmF0aGluZycsICgkcG9zdHMpID0-IHtcXG4gICAgICAgICAgICBjb25zdCBzY3JhcGVkRGF0YTogeyB0aXRsZTogc3RyaW5nOyByYW5rOiBzdHJpbmc7IGhyZWY6IHN0cmluZyB9W10gPSBbXTtcXG5cXG4gICAgICAgICAgICAvLyBXZSdyZSBnZXR0aW5nIHRoZSB0aXRsZSwgcmFuayBhbmQgVVJMIG9mIGVhY2ggcG9zdCBvbiBIYWNrZXIgTmV3cy5cXG4gICAgICAgICAgICAkcG9zdHMuZm9yRWFjaCgoJHBvc3QpID0-IHtcXG4gICAgICAgICAgICAgICAgc2NyYXBlZERhdGEucHVzaCh7XFxuICAgICAgICAgICAgICAgICAgICB0aXRsZTogJHBvc3QucXVlcnlTZWxlY3RvcignLnRpdGxlIGEnKS5pbm5lclRleHQsXFxuICAgICAgICAgICAgICAgICAgICByYW5rOiAkcG9zdC5xdWVyeVNlbGVjdG9yKCcucmFuaycpLmlubmVyVGV4dCxcXG4gICAgICAgICAgICAgICAgICAgIGhyZWY6ICRwb3N0LnF1ZXJ5U2VsZWN0b3IoJy50aXRsZSBhJykuaHJlZixcXG4gICAgICAgICAgICAgICAgfSk7XFxuICAgICAgICAgICAgfSk7XFxuXFxuICAgICAgICAgICAgcmV0dXJuIHNjcmFwZWREYXRhO1xcbiAgICAgICAgfSk7XFxuXFxuICAgICAgICAvLyBTdG9yZSB0aGUgcmVzdWx0cyB0byB0aGUgZGVmYXVsdCBkYXRhc2V0LlxcbiAgICAgICAgYXdhaXQgcHVzaERhdGEoZGF0YSk7XFxuXFxuICAgICAgICAvLyBGaW5kIGEgbGluayB0byB0aGUgbmV4dCBwYWdlIGFuZCBlbnF1ZXVlIGl0IGlmIGl0IGV4aXN0cy5cXG4gICAgICAgIGNvbnN0IGluZm9zID0gYXdhaXQgZW5xdWV1ZUxpbmtzKHtcXG4gICAgICAgICAgICBzZWxlY3RvcjogJy5tb3JlbGluaycsXFxuICAgICAgICB9KTtcXG5cXG4gICAgICAgIGlmIChpbmZvcy5wcm9jZXNzZWRSZXF1ZXN0cy5sZW5ndGggPT09IDApIGxvZy5pbmZvKGAke3JlcXVlc3QudXJsfSBpcyB0aGUgbGFzdCBwYWdlIWApO1xcbiAgICB9LFxcblxcbiAgICAvLyBUaGlzIGZ1bmN0aW9uIGlzIGNhbGxlZCBpZiB0aGUgcGFnZSBwcm9jZXNzaW5nIGZhaWxlZCBtb3JlIHRoYW4gbWF4UmVxdWVzdFJldHJpZXMrMSB0aW1lcy5cXG4gICAgZmFpbGVkUmVxdWVzdEhhbmRsZXIoeyByZXF1ZXN0LCBsb2cgfSkge1xcbiAgICAgICAgbG9nLmVycm9yKGBSZXF1ZXN0ICR7cmVxdWVzdC51cmx9IGZhaWxlZCB0b28gbWFueSB0aW1lcy5gKTtcXG4gICAgfSxcXG59KTtcXG5cXG5hd2FpdCBjcmF3bGVyLmFkZFJlcXVlc3RzKFsnaHR0cHM6Ly9uZXdzLnljb21iaW5hdG9yLmNvbS8nXSk7XFxuXFxuLy8gUnVuIHRoZSBjcmF3bGVyIGFuZCB3YWl0IGZvciBpdCB0byBmaW5pc2guXFxuYXdhaXQgY3Jhd2xlci5ydW4oKTtcXG5cXG5jb25zb2xlLmxvZygnQ3Jhd2xlciBmaW5pc2hlZC4nKTtcXG5cIn0iLCJvcHRpb25zIjp7ImJ1aWxkIjoibGF0ZXN0IiwiY29udGVudFR5cGUiOiJhcHBsaWNhdGlvbi9qc29uOyBjaGFyc2V0PXV0Zi04IiwibWVtb3J5Ijo0MDk2LCJ0aW1lb3V0IjoxODB9fQ.g_R2JClxWWMdjTJ4Lc3VRq98gXZsBQrLF1DcejyfEsI\&asrc=run_on_apify) src/crawler.ts ``` import { PuppeteerCrawler } from 'crawlee'; import puppeteerExtra from 'puppeteer-extra'; import stealthPlugin from 'puppeteer-extra-plugin-stealth'; // First, we tell puppeteer-extra to use the plugin (or plugins) we want. // Certain plugins might have options you can pass in - read up on their documentation! puppeteerExtra.use(stealthPlugin()); // Create an instance of the PuppeteerCrawler class - a crawler // that automatically loads the URLs in headless Chrome / Puppeteer. const crawler = new PuppeteerCrawler({ launchContext: { // !!! You need to specify this option to tell Crawlee to use puppeteer-extra as the launcher !!! launcher: puppeteerExtra, launchOptions: { // Other puppeteer options work as usual headless: true, }, }, // Stop crawling after several pages maxRequestsPerCrawl: 50, // This function will be called for each URL to crawl. // Here you can write the Puppeteer scripts you are familiar with, // with the exception that browsers and pages are automatically managed by Crawlee. // The function accepts a single parameter, which is an object with the following fields: // - request: an instance of the Request class with information such as URL and HTTP method // - page: Puppeteer's Page object (see https://pptr.dev/#show=api-class-page) async requestHandler({ pushData, request, page, enqueueLinks, log }) { log.info(`Processing ${request.url}...`); // A function to be evaluated by Puppeteer within the browser context. const data = await page.$$eval('.athing', ($posts) => { const scrapedData: { title: string; rank: string; href: string }[] = []; // We're getting the title, rank and URL of each post on Hacker News. $posts.forEach(($post) => { scrapedData.push({ title: $post.querySelector('.title a').innerText, rank: $post.querySelector('.rank').innerText, href: $post.querySelector('.title a').href, }); }); return scrapedData; }); // Store the results to the default dataset. await pushData(data); // Find a link to the next page and enqueue it if it exists. const infos = await enqueueLinks({ selector: '.morelink', }); if (infos.processedRequests.length === 0) log.info(`${request.url} is the last page!`); }, // This function is called if the page processing failed more than maxRequestRetries+1 times. failedRequestHandler({ request, log }) { log.error(`Request ${request.url} failed too many times.`); }, }); await crawler.addRequests(['https://news.ycombinator.com/']); // Run the crawler and wait for it to finish. await crawler.run(); console.log('Crawler finished.'); ``` Before you begin Make sure you've installed the `playwright-extra` and `puppeteer-extra-plugin-stealth` packages via your preferred package manager ``` npm install playwright-extra puppeteer-extra-plugin-stealth ``` tip To run this example on the Apify Platform, select the `apify/actor-node-puppeteer-chrome` image for your Dockerfile. [Run on](https://console.apify.com/actors/6i5QsHBMtm3hKph70?runConfig=eyJ1IjoiRWdQdHczb2VqNlRhRHQ1cW4iLCJ2IjoxfQ.eyJpbnB1dCI6IntcImNvZGVcIjpcImltcG9ydCB7IFBsYXl3cmlnaHRDcmF3bGVyIH0gZnJvbSAnY3Jhd2xlZSc7XFxuXFxuLy8gRm9yIHBsYXl3cmlnaHQtZXh0cmEgeW91IHdpbGwgbmVlZCB0byBpbXBvcnQgdGhlIGJyb3dzZXIgdHlwZSBpdHNlbGYgdGhhdCB5b3Ugd2FudCB0byB1c2UhXFxuLy8gQnkgZGVmYXVsdCwgUGxheXdyaWdodENyYXdsZXIgdXNlcyBjaHJvbWl1bSwgYnV0IHlvdSBjYW4gYWxzbyB1c2UgZmlyZWZveCBvciB3ZWJraXQuXFxuaW1wb3J0IHsgY2hyb21pdW0gfSBmcm9tICdwbGF5d3JpZ2h0LWV4dHJhJztcXG5pbXBvcnQgc3RlYWx0aFBsdWdpbiBmcm9tICdwdXBwZXRlZXItZXh0cmEtcGx1Z2luLXN0ZWFsdGgnO1xcblxcbi8vIEZpcnN0LCB3ZSB0ZWxsIHBsYXl3cmlnaHQtZXh0cmEgdG8gdXNlIHRoZSBwbHVnaW4gKG9yIHBsdWdpbnMpIHdlIHdhbnQuXFxuLy8gQ2VydGFpbiBwbHVnaW5zIG1pZ2h0IGhhdmUgb3B0aW9ucyB5b3UgY2FuIHBhc3MgaW4gLSByZWFkIHVwIG9uIHRoZWlyIGRvY3VtZW50YXRpb24hXFxuY2hyb21pdW0udXNlKHN0ZWFsdGhQbHVnaW4oKSk7XFxuXFxuLy8gQ3JlYXRlIGFuIGluc3RhbmNlIG9mIHRoZSBQbGF5d3JpZ2h0Q3Jhd2xlciBjbGFzcyAtIGEgY3Jhd2xlclxcbi8vIHRoYXQgYXV0b21hdGljYWxseSBsb2FkcyB0aGUgVVJMcyBpbiBoZWFkbGVzcyBDaHJvbWUgLyBQbGF5d3JpZ2h0LlxcbmNvbnN0IGNyYXdsZXIgPSBuZXcgUGxheXdyaWdodENyYXdsZXIoe1xcbiAgICBsYXVuY2hDb250ZXh0OiB7XFxuICAgICAgICAvLyAhISEgWW91IG5lZWQgdG8gc3BlY2lmeSB0aGlzIG9wdGlvbiB0byB0ZWxsIENyYXdsZWUgdG8gdXNlIHBsYXl3cmlnaHQtZXh0cmEgYXMgdGhlIGxhdW5jaGVyICEhIVxcbiAgICAgICAgbGF1bmNoZXI6IGNocm9taXVtLFxcbiAgICAgICAgbGF1bmNoT3B0aW9uczoge1xcbiAgICAgICAgICAgIC8vIE90aGVyIHBsYXl3cmlnaHQgb3B0aW9ucyB3b3JrIGFzIHVzdWFsXFxuICAgICAgICAgICAgaGVhZGxlc3M6IHRydWUsXFxuICAgICAgICB9LFxcbiAgICB9LFxcblxcbiAgICAvLyBTdG9wIGNyYXdsaW5nIGFmdGVyIHNldmVyYWwgcGFnZXNcXG4gICAgbWF4UmVxdWVzdHNQZXJDcmF3bDogNTAsXFxuXFxuICAgIC8vIFRoaXMgZnVuY3Rpb24gd2lsbCBiZSBjYWxsZWQgZm9yIGVhY2ggVVJMIHRvIGNyYXdsLlxcbiAgICAvLyBIZXJlIHlvdSBjYW4gd3JpdGUgdGhlIFB1cHBldGVlciBzY3JpcHRzIHlvdSBhcmUgZmFtaWxpYXIgd2l0aCxcXG4gICAgLy8gd2l0aCB0aGUgZXhjZXB0aW9uIHRoYXQgYnJvd3NlcnMgYW5kIHBhZ2VzIGFyZSBhdXRvbWF0aWNhbGx5IG1hbmFnZWQgYnkgQ3Jhd2xlZS5cXG4gICAgLy8gVGhlIGZ1bmN0aW9uIGFjY2VwdHMgYSBzaW5nbGUgcGFyYW1ldGVyLCB3aGljaCBpcyBhbiBvYmplY3Qgd2l0aCB0aGUgZm9sbG93aW5nIGZpZWxkczpcXG4gICAgLy8gLSByZXF1ZXN0OiBhbiBpbnN0YW5jZSBvZiB0aGUgUmVxdWVzdCBjbGFzcyB3aXRoIGluZm9ybWF0aW9uIHN1Y2ggYXMgVVJMIGFuZCBIVFRQIG1ldGhvZFxcbiAgICAvLyAtIHBhZ2U6IFB1cHBldGVlcidzIFBhZ2Ugb2JqZWN0IChzZWUgaHR0cHM6Ly9wcHRyLmRldi8jc2hvdz1hcGktY2xhc3MtcGFnZSlcXG4gICAgYXN5bmMgcmVxdWVzdEhhbmRsZXIoeyBwdXNoRGF0YSwgcmVxdWVzdCwgcGFnZSwgZW5xdWV1ZUxpbmtzLCBsb2cgfSkge1xcbiAgICAgICAgbG9nLmluZm8oYFByb2Nlc3NpbmcgJHtyZXF1ZXN0LnVybH0uLi5gKTtcXG5cXG4gICAgICAgIC8vIEEgZnVuY3Rpb24gdG8gYmUgZXZhbHVhdGVkIGJ5IFB1cHBldGVlciB3aXRoaW4gdGhlIGJyb3dzZXIgY29udGV4dC5cXG4gICAgICAgIGNvbnN0IGRhdGEgPSBhd2FpdCBwYWdlLiQkZXZhbCgnLmF0aGluZycsICgkcG9zdHMpID0-IHtcXG4gICAgICAgICAgICBjb25zdCBzY3JhcGVkRGF0YTogeyB0aXRsZTogc3RyaW5nOyByYW5rOiBzdHJpbmc7IGhyZWY6IHN0cmluZyB9W10gPSBbXTtcXG5cXG4gICAgICAgICAgICAvLyBXZSdyZSBnZXR0aW5nIHRoZSB0aXRsZSwgcmFuayBhbmQgVVJMIG9mIGVhY2ggcG9zdCBvbiBIYWNrZXIgTmV3cy5cXG4gICAgICAgICAgICAkcG9zdHMuZm9yRWFjaCgoJHBvc3QpID0-IHtcXG4gICAgICAgICAgICAgICAgc2NyYXBlZERhdGEucHVzaCh7XFxuICAgICAgICAgICAgICAgICAgICB0aXRsZTogJHBvc3QucXVlcnlTZWxlY3RvcignLnRpdGxlIGEnKS5pbm5lclRleHQsXFxuICAgICAgICAgICAgICAgICAgICByYW5rOiAkcG9zdC5xdWVyeVNlbGVjdG9yKCcucmFuaycpLmlubmVyVGV4dCxcXG4gICAgICAgICAgICAgICAgICAgIGhyZWY6ICRwb3N0LnF1ZXJ5U2VsZWN0b3IoJy50aXRsZSBhJykuaHJlZixcXG4gICAgICAgICAgICAgICAgfSk7XFxuICAgICAgICAgICAgfSk7XFxuXFxuICAgICAgICAgICAgcmV0dXJuIHNjcmFwZWREYXRhO1xcbiAgICAgICAgfSk7XFxuXFxuICAgICAgICAvLyBTdG9yZSB0aGUgcmVzdWx0cyB0byB0aGUgZGVmYXVsdCBkYXRhc2V0LlxcbiAgICAgICAgYXdhaXQgcHVzaERhdGEoZGF0YSk7XFxuXFxuICAgICAgICAvLyBGaW5kIGEgbGluayB0byB0aGUgbmV4dCBwYWdlIGFuZCBlbnF1ZXVlIGl0IGlmIGl0IGV4aXN0cy5cXG4gICAgICAgIGNvbnN0IGluZm9zID0gYXdhaXQgZW5xdWV1ZUxpbmtzKHtcXG4gICAgICAgICAgICBzZWxlY3RvcjogJy5tb3JlbGluaycsXFxuICAgICAgICB9KTtcXG5cXG4gICAgICAgIGlmIChpbmZvcy5wcm9jZXNzZWRSZXF1ZXN0cy5sZW5ndGggPT09IDApIGxvZy5pbmZvKGAke3JlcXVlc3QudXJsfSBpcyB0aGUgbGFzdCBwYWdlIWApO1xcbiAgICB9LFxcblxcbiAgICAvLyBUaGlzIGZ1bmN0aW9uIGlzIGNhbGxlZCBpZiB0aGUgcGFnZSBwcm9jZXNzaW5nIGZhaWxlZCBtb3JlIHRoYW4gbWF4UmVxdWVzdFJldHJpZXMrMSB0aW1lcy5cXG4gICAgZmFpbGVkUmVxdWVzdEhhbmRsZXIoeyByZXF1ZXN0LCBsb2cgfSkge1xcbiAgICAgICAgbG9nLmVycm9yKGBSZXF1ZXN0ICR7cmVxdWVzdC51cmx9IGZhaWxlZCB0b28gbWFueSB0aW1lcy5gKTtcXG4gICAgfSxcXG59KTtcXG5cXG5hd2FpdCBjcmF3bGVyLmFkZFJlcXVlc3RzKFsnaHR0cHM6Ly9uZXdzLnljb21iaW5hdG9yLmNvbS8nXSk7XFxuXFxuLy8gUnVuIHRoZSBjcmF3bGVyIGFuZCB3YWl0IGZvciBpdCB0byBmaW5pc2guXFxuYXdhaXQgY3Jhd2xlci5ydW4oKTtcXG5cXG5jb25zb2xlLmxvZygnQ3Jhd2xlciBmaW5pc2hlZC4nKTtcXG5cIn0iLCJvcHRpb25zIjp7ImJ1aWxkIjoibGF0ZXN0IiwiY29udGVudFR5cGUiOiJhcHBsaWNhdGlvbi9qc29uOyBjaGFyc2V0PXV0Zi04IiwibWVtb3J5Ijo0MDk2LCJ0aW1lb3V0IjoxODB9fQ.0tvt0H2UF7_7bUtiJNPbAvh3PnZHcQ7dhMcIWDBvS5g\&asrc=run_on_apify) src/crawler.ts ``` import { PlaywrightCrawler } from 'crawlee'; // For playwright-extra you will need to import the browser type itself that you want to use! // By default, PlaywrightCrawler uses chromium, but you can also use firefox or webkit. import { chromium } from 'playwright-extra'; import stealthPlugin from 'puppeteer-extra-plugin-stealth'; // First, we tell playwright-extra to use the plugin (or plugins) we want. // Certain plugins might have options you can pass in - read up on their documentation! chromium.use(stealthPlugin()); // Create an instance of the PlaywrightCrawler class - a crawler // that automatically loads the URLs in headless Chrome / Playwright. const crawler = new PlaywrightCrawler({ launchContext: { // !!! You need to specify this option to tell Crawlee to use playwright-extra as the launcher !!! launcher: chromium, launchOptions: { // Other playwright options work as usual headless: true, }, }, // Stop crawling after several pages maxRequestsPerCrawl: 50, // This function will be called for each URL to crawl. // Here you can write the Puppeteer scripts you are familiar with, // with the exception that browsers and pages are automatically managed by Crawlee. // The function accepts a single parameter, which is an object with the following fields: // - request: an instance of the Request class with information such as URL and HTTP method // - page: Puppeteer's Page object (see https://pptr.dev/#show=api-class-page) async requestHandler({ pushData, request, page, enqueueLinks, log }) { log.info(`Processing ${request.url}...`); // A function to be evaluated by Puppeteer within the browser context. const data = await page.$$eval('.athing', ($posts) => { const scrapedData: { title: string; rank: string; href: string }[] = []; // We're getting the title, rank and URL of each post on Hacker News. $posts.forEach(($post) => { scrapedData.push({ title: $post.querySelector('.title a').innerText, rank: $post.querySelector('.rank').innerText, href: $post.querySelector('.title a').href, }); }); return scrapedData; }); // Store the results to the default dataset. await pushData(data); // Find a link to the next page and enqueue it if it exists. const infos = await enqueueLinks({ selector: '.morelink', }); if (infos.processedRequests.length === 0) log.info(`${request.url} is the last page!`); }, // This function is called if the page processing failed more than maxRequestRetries+1 times. failedRequestHandler({ request, log }) { log.error(`Request ${request.url} failed too many times.`); }, }); await crawler.addRequests(['https://news.ycombinator.com/']); // Run the crawler and wait for it to finish. await crawler.run(); console.log('Crawler finished.'); ``` --- # Export entire dataset to one file Copy for LLM This `Dataset` example uses the `exportToValue` function to export the entire default dataset to a single CSV file into the default key-value store. [Run on](https://console.apify.com/actors/kk67IcZkKSSBTslXI?runConfig=eyJ1IjoiRWdQdHczb2VqNlRhRHQ1cW4iLCJ2IjoxfQ.eyJpbnB1dCI6IntcImNvZGVcIjpcImltcG9ydCB7IERhdGFzZXQgfSBmcm9tICdjcmF3bGVlJztcXG5cXG4vLyBSZXRyaWV2ZSBvciBnZW5lcmF0ZSB0d28gaXRlbXMgdG8gYmUgcHVzaGVkXFxuY29uc3QgZGF0YSA9IFtcXG4gICAge1xcbiAgICAgICAgaWQ6IDEyMyxcXG4gICAgICAgIG5hbWU6ICdmb28nLFxcbiAgICB9LFxcbiAgICB7XFxuICAgICAgICBpZDogNDU2LFxcbiAgICAgICAgbmFtZTogJ2JhcicsXFxuICAgIH0sXFxuXTtcXG5cXG4vLyBQdXNoIHRoZSB0d28gaXRlbXMgdG8gdGhlIGRlZmF1bHQgZGF0YXNldFxcbmF3YWl0IERhdGFzZXQucHVzaERhdGEoZGF0YSk7XFxuXFxuLy8gRXhwb3J0IHRoZSBlbnRpcmV0eSBvZiB0aGUgZGF0YXNldCB0byBhIHNpbmdsZSBmaWxlIGluXFxuLy8gdGhlIGRlZmF1bHQga2V5LXZhbHVlIHN0b3JlIHVuZGVyIHRoZSBrZXkgXFxcIk9VVFBVVFxcXCJcXG5hd2FpdCBEYXRhc2V0LmV4cG9ydFRvQ1NWKCdPVVRQVVQnKTtcXG5cIn0iLCJvcHRpb25zIjp7ImJ1aWxkIjoibGF0ZXN0IiwiY29udGVudFR5cGUiOiJhcHBsaWNhdGlvbi9qc29uOyBjaGFyc2V0PXV0Zi04IiwibWVtb3J5IjoxMDI0LCJ0aW1lb3V0IjoxODB9fQ.BLrD3LxSd28XkttXUERpIQj2dQIkUd5kRf1WgkgH6po\&asrc=run_on_apify) ``` import { Dataset } from 'crawlee'; // Retrieve or generate two items to be pushed const data = [ { id: 123, name: 'foo', }, { id: 456, name: 'bar', }, ]; // Push the two items to the default dataset await Dataset.pushData(data); // Export the entirety of the dataset to a single file in // the default key-value store under the key "OUTPUT" await Dataset.exportToCSV('OUTPUT'); ``` --- # Download a file Copy for LLM When webcrawling, you sometimes need to download files such as images, PDFs, or other binary files. This example demonstrates how to download files using Crawlee and save them to the default key-value store. The script simply downloads several files with plain HTTP requests using the custom [`FileDownload`](https://crawlee.dev/js/api/http-crawler/class/FileDownload.md) crawler class and stores their contents in the default key-value store. In local configuration, the data will be stored as files in `./storage/key_value_stores/default`. [Run on](https://console.apify.com/actors/kk67IcZkKSSBTslXI?runConfig=eyJ1IjoiRWdQdHczb2VqNlRhRHQ1cW4iLCJ2IjoxfQ.eyJpbnB1dCI6IntcImNvZGVcIjpcImltcG9ydCB7IEZpbGVEb3dubG9hZCB9IGZyb20gJ2NyYXdsZWUnO1xcblxcbi8vIENyZWF0ZSBhIEZpbGVEb3dubG9hZCAtIGEgY3VzdG9tIGNyYXdsZXIgaW5zdGFuY2UgdGhhdCB3aWxsIGRvd25sb2FkIGZpbGVzIGZyb20gVVJMcy5cXG5jb25zdCBjcmF3bGVyID0gbmV3IEZpbGVEb3dubG9hZCh7XFxuICAgIGFzeW5jIHJlcXVlc3RIYW5kbGVyKHsgYm9keSwgcmVxdWVzdCwgY29udGVudFR5cGUsIGdldEtleVZhbHVlU3RvcmUgfSkge1xcbiAgICAgICAgY29uc3QgdXJsID0gbmV3IFVSTChyZXF1ZXN0LnVybCk7XFxuICAgICAgICBjb25zdCBrdnMgPSBhd2FpdCBnZXRLZXlWYWx1ZVN0b3JlKCk7XFxuXFxuICAgICAgICBhd2FpdCBrdnMuc2V0VmFsdWUodXJsLnBhdGhuYW1lLnJlcGxhY2UoL1xcXFwvL2csICdfJyksIGJvZHksIHsgY29udGVudFR5cGU6IGNvbnRlbnRUeXBlLnR5cGUgfSk7XFxuICAgIH0sXFxufSk7XFxuXFxuLy8gVGhlIGluaXRpYWwgbGlzdCBvZiBVUkxzIHRvIGNyYXdsLiBIZXJlIHdlIHVzZSBqdXN0IGEgZmV3IGhhcmQtY29kZWQgVVJMcy5cXG5hd2FpdCBjcmF3bGVyLmFkZFJlcXVlc3RzKFtcXG4gICAgJ2h0dHBzOi8vcGRmb2JqZWN0LmNvbS9wZGYvc2FtcGxlLnBkZicsXFxuICAgICdodHRwczovL2Rvd25sb2FkLmJsZW5kZXIub3JnL3BlYWNoL2JpZ2J1Y2tidW5ueV9tb3ZpZXMvQmlnQnVja0J1bm55XzMyMHgxODAubXA0JyxcXG4gICAgJ2h0dHBzOi8vdXBsb2FkLndpa2ltZWRpYS5vcmcvd2lraXBlZGlhL2NvbW1vbnMvYy9jOC9FeGFtcGxlLm9nZycsXFxuXSk7XFxuXFxuLy8gUnVuIHRoZSBkb3dubG9hZGVyIGFuZCB3YWl0IGZvciBpdCB0byBmaW5pc2guXFxuYXdhaXQgY3Jhd2xlci5ydW4oKTtcXG5cXG5jb25zb2xlLmxvZygnQ3Jhd2xlciBmaW5pc2hlZC4nKTtcXG5cIn0iLCJvcHRpb25zIjp7ImJ1aWxkIjoibGF0ZXN0IiwiY29udGVudFR5cGUiOiJhcHBsaWNhdGlvbi9qc29uOyBjaGFyc2V0PXV0Zi04IiwibWVtb3J5IjoxMDI0LCJ0aW1lb3V0IjoxODB9fQ.lA9lShaKL-UqYLrTmBECFTAxDjy9wuV88662NBW9hTg\&asrc=run_on_apify) ``` import { FileDownload } from 'crawlee'; // Create a FileDownload - a custom crawler instance that will download files from URLs. const crawler = new FileDownload({ async requestHandler({ body, request, contentType, getKeyValueStore }) { const url = new URL(request.url); const kvs = await getKeyValueStore(); await kvs.setValue(url.pathname.replace(/\//g, '_'), body, { contentType: contentType.type }); }, }); // The initial list of URLs to crawl. Here we use just a few hard-coded URLs. await crawler.addRequests([ 'https://pdfobject.com/pdf/sample.pdf', 'https://download.blender.org/peach/bigbuckbunny_movies/BigBuckBunny_320x180.mp4', 'https://upload.wikimedia.org/wikipedia/commons/c/c8/Example.ogg', ]); // Run the downloader and wait for it to finish. await crawler.run(); console.log('Crawler finished.'); ``` --- # Download a file with Node.js streams Copy for LLM For larger files, it is more efficient to use Node.js streams to download and transfer the files. This example demonstrates how to download files using streams. The script uses the [`FileDownload`](https://crawlee.dev/js/api/http-crawler/class/FileDownload.md) crawler class to download files with streams, log the progress, and store the data in the key-value store. In local configuration, the data will be stored as files in `./storage/key_value_stores/default`. [Run on](https://console.apify.com/actors/kk67IcZkKSSBTslXI?runConfig=eyJ1IjoiRWdQdHczb2VqNlRhRHQ1cW4iLCJ2IjoxfQ.eyJpbnB1dCI6IntcImNvZGVcIjpcImltcG9ydCB7IHBpcGVsaW5lLCBUcmFuc2Zvcm0gfSBmcm9tICdzdHJlYW0nO1xcblxcbmltcG9ydCB7IEZpbGVEb3dubG9hZCwgdHlwZSBMb2cgfSBmcm9tICdjcmF3bGVlJztcXG5cXG4vLyBBIHNhbXBsZSBUcmFuc2Zvcm0gc3RyZWFtIGxvZ2dpbmcgdGhlIGRvd25sb2FkIHByb2dyZXNzLlxcbmZ1bmN0aW9uIGNyZWF0ZVByb2dyZXNzVHJhY2tlcih7IHVybCwgbG9nLCB0b3RhbEJ5dGVzIH06IHsgdXJsOiBVUkw7IGxvZzogTG9nOyB0b3RhbEJ5dGVzOiBudW1iZXIgfSkge1xcbiAgICBsZXQgZG93bmxvYWRlZEJ5dGVzID0gMDtcXG5cXG4gICAgcmV0dXJuIG5ldyBUcmFuc2Zvcm0oe1xcbiAgICAgICAgdHJhbnNmb3JtKGNodW5rLCBfLCBjYWxsYmFjaykge1xcbiAgICAgICAgICAgIGlmIChkb3dubG9hZGVkQnl0ZXMgJSAxZTYgPiAoZG93bmxvYWRlZEJ5dGVzICsgY2h1bmsubGVuZ3RoKSAlIDFlNikge1xcbiAgICAgICAgICAgICAgICBsb2cuaW5mbyhcXG4gICAgICAgICAgICAgICAgICAgIGBEb3dubG9hZGVkICR7ZG93bmxvYWRlZEJ5dGVzIC8gMWU2fSBNQiAoJHtNYXRoLmZsb29yKChkb3dubG9hZGVkQnl0ZXMgLyB0b3RhbEJ5dGVzKSAqIDEwMCl9JSkgZm9yICR7dXJsfS5gLFxcbiAgICAgICAgICAgICAgICApO1xcbiAgICAgICAgICAgIH1cXG4gICAgICAgICAgICBkb3dubG9hZGVkQnl0ZXMgKz0gY2h1bmsubGVuZ3RoO1xcblxcbiAgICAgICAgICAgIHRoaXMucHVzaChjaHVuayk7XFxuICAgICAgICAgICAgY2FsbGJhY2soKTtcXG4gICAgICAgIH0sXFxuICAgIH0pO1xcbn1cXG5cXG4vLyBDcmVhdGUgYSBGaWxlRG93bmxvYWQgLSBhIGN1c3RvbSBjcmF3bGVyIGluc3RhbmNlIHRoYXQgd2lsbCBkb3dubG9hZCBmaWxlcyBmcm9tIFVSTHMuXFxuY29uc3QgY3Jhd2xlciA9IG5ldyBGaWxlRG93bmxvYWQoe1xcbiAgICBhc3luYyBzdHJlYW1IYW5kbGVyKHsgc3RyZWFtLCByZXF1ZXN0LCBsb2csIGdldEtleVZhbHVlU3RvcmUgfSkge1xcbiAgICAgICAgY29uc3QgdXJsID0gbmV3IFVSTChyZXF1ZXN0LnVybCk7XFxuXFxuICAgICAgICBsb2cuaW5mbyhgRG93bmxvYWRpbmcgJHt1cmx9IHRvICR7dXJsLnBhdGhuYW1lLnJlcGxhY2UoL1xcXFwvL2csICdfJyl9Li4uYCk7XFxuXFxuICAgICAgICBhd2FpdCBuZXcgUHJvbWlzZTx2b2lkPigocmVzb2x2ZSwgcmVqZWN0KSA9PiB7XFxuICAgICAgICAgICAgLy8gV2l0aCB0aGUgJ3Jlc3BvbnNlJyBldmVudCwgd2UgaGF2ZSByZWNlaXZlZCB0aGUgaGVhZGVycyBvZiB0aGUgcmVzcG9uc2UuXFxuICAgICAgICAgICAgc3RyZWFtLm9uKCdyZXNwb25zZScsIGFzeW5jIChyZXNwb25zZSkgPT4ge1xcbiAgICAgICAgICAgICAgICBjb25zdCBrdnMgPSBhd2FpdCBnZXRLZXlWYWx1ZVN0b3JlKCk7XFxuICAgICAgICAgICAgICAgIGF3YWl0IGt2cy5zZXRWYWx1ZShcXG4gICAgICAgICAgICAgICAgICAgIHVybC5wYXRobmFtZS5yZXBsYWNlKC9cXFxcLy9nLCAnXycpLFxcbiAgICAgICAgICAgICAgICAgICAgcGlwZWxpbmUoXFxuICAgICAgICAgICAgICAgICAgICAgICAgc3RyZWFtLFxcbiAgICAgICAgICAgICAgICAgICAgICAgIGNyZWF0ZVByb2dyZXNzVHJhY2tlcih7IHVybCwgbG9nLCB0b3RhbEJ5dGVzOiBOdW1iZXIocmVzcG9uc2UuaGVhZGVyc1snY29udGVudC1sZW5ndGgnXSkgfSksXFxuICAgICAgICAgICAgICAgICAgICAgICAgKGVycm9yKSA9PiB7XFxuICAgICAgICAgICAgICAgICAgICAgICAgICAgIGlmIChlcnJvcikgcmVqZWN0KGVycm9yKTtcXG4gICAgICAgICAgICAgICAgICAgICAgICB9LFxcbiAgICAgICAgICAgICAgICAgICAgKSxcXG4gICAgICAgICAgICAgICAgICAgIHsgY29udGVudFR5cGU6IHJlc3BvbnNlLmhlYWRlcnNbJ2NvbnRlbnQtdHlwZSddIH0sXFxuICAgICAgICAgICAgICAgICk7XFxuXFxuICAgICAgICAgICAgICAgIGxvZy5pbmZvKGBEb3dubG9hZGVkICR7dXJsfSB0byAke3VybC5wYXRobmFtZS5yZXBsYWNlKC9cXFxcLy9nLCAnXycpfS5gKTtcXG5cXG4gICAgICAgICAgICAgICAgcmVzb2x2ZSgpO1xcbiAgICAgICAgICAgIH0pO1xcbiAgICAgICAgfSk7XFxuICAgIH0sXFxufSk7XFxuXFxuLy8gVGhlIGluaXRpYWwgbGlzdCBvZiBVUkxzIHRvIGNyYXdsLiBIZXJlIHdlIHVzZSBqdXN0IGEgZmV3IGhhcmQtY29kZWQgVVJMcy5cXG5hd2FpdCBjcmF3bGVyLmFkZFJlcXVlc3RzKFtcXG4gICAgJ2h0dHBzOi8vZG93bmxvYWQuYmxlbmRlci5vcmcvcGVhY2gvYmlnYnVja2J1bm55X21vdmllcy9CaWdCdWNrQnVubnlfMzIweDE4MC5tcDQnLFxcbiAgICAnaHR0cHM6Ly9kb3dubG9hZC5ibGVuZGVyLm9yZy9wZWFjaC9iaWdidWNrYnVubnlfbW92aWVzL0JpZ0J1Y2tCdW5ueV82NDB4MzYwLm00dicsXFxuXSk7XFxuXFxuLy8gUnVuIHRoZSBkb3dubG9hZGVyIGFuZCB3YWl0IGZvciBpdCB0byBmaW5pc2guXFxuYXdhaXQgY3Jhd2xlci5ydW4oKTtcXG5cIn0iLCJvcHRpb25zIjp7ImJ1aWxkIjoibGF0ZXN0IiwiY29udGVudFR5cGUiOiJhcHBsaWNhdGlvbi9qc29uOyBjaGFyc2V0PXV0Zi04IiwibWVtb3J5IjoxMDI0LCJ0aW1lb3V0IjoxODB9fQ.p6REYoHLUmIyvUTwyLYhetG0F1lH4usCZiuf0iBHM6w\&asrc=run_on_apify) ``` import { pipeline, Transform } from 'stream'; import { FileDownload, type Log } from 'crawlee'; // A sample Transform stream logging the download progress. function createProgressTracker({ url, log, totalBytes }: { url: URL; log: Log; totalBytes: number }) { let downloadedBytes = 0; return new Transform({ transform(chunk, _, callback) { if (downloadedBytes % 1e6 > (downloadedBytes + chunk.length) % 1e6) { log.info( `Downloaded ${downloadedBytes / 1e6} MB (${Math.floor((downloadedBytes / totalBytes) * 100)}%) for ${url}.`, ); } downloadedBytes += chunk.length; this.push(chunk); callback(); }, }); } // Create a FileDownload - a custom crawler instance that will download files from URLs. const crawler = new FileDownload({ async streamHandler({ stream, request, log, getKeyValueStore }) { const url = new URL(request.url); log.info(`Downloading ${url} to ${url.pathname.replace(/\//g, '_')}...`); await new Promise((resolve, reject) => { // With the 'response' event, we have received the headers of the response. stream.on('response', async (response) => { const kvs = await getKeyValueStore(); await kvs.setValue( url.pathname.replace(/\//g, '_'), pipeline( stream, createProgressTracker({ url, log, totalBytes: Number(response.headers['content-length']) }), (error) => { if (error) reject(error); }, ), { contentType: response.headers['content-type'] }, ); log.info(`Downloaded ${url} to ${url.pathname.replace(/\//g, '_')}.`); resolve(); }); }); }, }); // The initial list of URLs to crawl. Here we use just a few hard-coded URLs. await crawler.addRequests([ 'https://download.blender.org/peach/bigbuckbunny_movies/BigBuckBunny_320x180.mp4', 'https://download.blender.org/peach/bigbuckbunny_movies/BigBuckBunny_640x360.m4v', ]); // Run the downloader and wait for it to finish. await crawler.run(); ``` --- # Fill and Submit a Form using Puppeteer Copy for LLM This example demonstrates how to use [`PuppeteerCrawler`](https://crawlee.dev/js/api/puppeteer-crawler/class/PuppeteerCrawler.md) to automatically fill and submit a search form to look up repositories on [GitHub](https://github.com) using headless Chrome / Puppeteer. The crawler first fills in the search term, repository owner, start date and language of the repository, then submits the form and prints out the results. Finally, the results are saved either on the Apify platform to the default [`dataset`](https://crawlee.dev/js/api/core/class/Dataset.md) or on the local machine as JSON files in `./storage/datasets/default`. tip To run this example on the Apify Platform, select the `apify/actor-node-puppeteer-chrome` image for your Dockerfile. [Run on](https://console.apify.com/actors/7tWSD8hrYzuc9Lte7?runConfig=eyJ1IjoiRWdQdHczb2VqNlRhRHQ1cW4iLCJ2IjoxfQ.eyJpbnB1dCI6IntcImNvZGVcIjpcImltcG9ydCB7IERhdGFzZXQsIGxhdW5jaFB1cHBldGVlciB9IGZyb20gJ2NyYXdsZWUnO1xcblxcbi8vIExhdW5jaCB0aGUgd2ViIGJyb3dzZXIuXFxuY29uc3QgYnJvd3NlciA9IGF3YWl0IGxhdW5jaFB1cHBldGVlcigpO1xcblxcbi8vIENyZWF0ZSBhbmQgbmF2aWdhdGUgbmV3IHBhZ2VcXG5jb25zb2xlLmxvZygnT3BlbiB0YXJnZXQgcGFnZScpO1xcbmNvbnN0IHBhZ2UgPSBhd2FpdCBicm93c2VyLm5ld1BhZ2UoKTtcXG5hd2FpdCBwYWdlLmdvdG8oJ2h0dHBzOi8vZ2l0aHViLmNvbS9zZWFyY2gvYWR2YW5jZWQnKTtcXG5cXG4vLyBGaWxsIGZvcm0gZmllbGRzIGFuZCBzZWxlY3QgZGVzaXJlZCBzZWFyY2ggb3B0aW9uc1xcbmNvbnNvbGUubG9nKCdGaWxsIGluIHNlYXJjaCBmb3JtJyk7XFxuYXdhaXQgcGFnZS50eXBlKCcjYWR2X2NvZGVfc2VhcmNoIGlucHV0LmpzLWFkdmFuY2VkLXNlYXJjaC1pbnB1dCcsICdhcGlmeS1qcycpO1xcbmF3YWl0IHBhZ2UudHlwZSgnI3NlYXJjaF9mcm9tJywgJ2FwaWZ5Jyk7XFxuYXdhaXQgcGFnZS50eXBlKCcjc2VhcmNoX2RhdGUnLCAnPjIwMTUnKTtcXG5hd2FpdCBwYWdlLnNlbGVjdCgnc2VsZWN0I3NlYXJjaF9sYW5ndWFnZScsICdKYXZhU2NyaXB0Jyk7XFxuXFxuLy8gU3VibWl0IHRoZSBmb3JtIGFuZCB3YWl0IGZvciBmdWxsIGxvYWQgb2YgbmV4dCBwYWdlXFxuY29uc29sZS5sb2coJ1N1Ym1pdCBzZWFyY2ggZm9ybScpO1xcbmF3YWl0IFByb21pc2UuYWxsKFtcXG4gICAgcGFnZS53YWl0Rm9yTmF2aWdhdGlvbih7IHdhaXRVbnRpbDogJ25ldHdvcmtpZGxlMicgfSksXFxuICAgIHBhZ2UuY2xpY2soJyNhZHZfY29kZV9zZWFyY2ggYnV0dG9uW3R5cGU9XFxcInN1Ym1pdFxcXCJdJyksXFxuXSk7XFxuXFxuLy8gT2J0YWluIGFuZCBwcmludCBsaXN0IG9mIHNlYXJjaCByZXN1bHRzXFxuY29uc3QgcmVzdWx0cyA9IGF3YWl0IHBhZ2UuJCRldmFsKCdbZGF0YS10ZXN0aWQ9XFxcInJlc3VsdHMtbGlzdFxcXCJdIGRpdi5zZWFyY2gtdGl0bGUgPiBhJywgKG5vZGVzKSA9PlxcbiAgICBub2Rlcy5tYXAoKG5vZGUpID0-ICh7XFxuICAgICAgICB1cmw6IG5vZGUuaHJlZixcXG4gICAgICAgIG5hbWU6IG5vZGUuaW5uZXJUZXh0LFxcbiAgICB9KSksXFxuKTtcXG5cXG5jb25zb2xlLmxvZygnUmVzdWx0czonLCByZXN1bHRzKTtcXG5cXG4vLyBTdG9yZSBkYXRhIGluIGRlZmF1bHQgZGF0YXNldFxcbmF3YWl0IERhdGFzZXQucHVzaERhdGEocmVzdWx0cyk7XFxuXFxuLy8gQ2xvc2UgYnJvd3NlclxcbmF3YWl0IGJyb3dzZXIuY2xvc2UoKTtcXG5cIn0iLCJvcHRpb25zIjp7ImJ1aWxkIjoibGF0ZXN0IiwiY29udGVudFR5cGUiOiJhcHBsaWNhdGlvbi9qc29uOyBjaGFyc2V0PXV0Zi04IiwibWVtb3J5Ijo0MDk2LCJ0aW1lb3V0IjoxODB9fQ.AQF16S0-_uJ55mZ5XgWHl5zj4KbRk-NJAuFJL7sL9VY\&asrc=run_on_apify) ``` import { Dataset, launchPuppeteer } from 'crawlee'; // Launch the web browser. const browser = await launchPuppeteer(); // Create and navigate new page console.log('Open target page'); const page = await browser.newPage(); await page.goto('https://github.com/search/advanced'); // Fill form fields and select desired search options console.log('Fill in search form'); await page.type('#adv_code_search input.js-advanced-search-input', 'apify-js'); await page.type('#search_from', 'apify'); await page.type('#search_date', '>2015'); await page.select('select#search_language', 'JavaScript'); // Submit the form and wait for full load of next page console.log('Submit search form'); await Promise.all([ page.waitForNavigation({ waitUntil: 'networkidle2' }), page.click('#adv_code_search button[type="submit"]'), ]); // Obtain and print list of search results const results = await page.$$eval('[data-testid="results-list"] div.search-title > a', (nodes) => nodes.map((node) => ({ url: node.href, name: node.innerText, })), ); console.log('Results:', results); // Store data in default dataset await Dataset.pushData(results); // Close browser await browser.close(); ``` --- # HTTP crawler Copy for LLM This example demonstrates how to use [`HttpCrawler`](https://crawlee.dev/js/api/http-crawler/class/HttpCrawler.md) to build a HTML crawler that crawls a list of URLs from an external file, load each URL using a plain HTTP request, and save HTML. [Run on](https://console.apify.com/actors/kk67IcZkKSSBTslXI?runConfig=eyJ1IjoiRWdQdHczb2VqNlRhRHQ1cW4iLCJ2IjoxfQ.eyJpbnB1dCI6IntcImNvZGVcIjpcImltcG9ydCB7IEh0dHBDcmF3bGVyLCBsb2csIExvZ0xldmVsIH0gZnJvbSAnY3Jhd2xlZSc7XFxuXFxuLy8gQ3Jhd2xlcnMgY29tZSB3aXRoIHZhcmlvdXMgdXRpbGl0aWVzLCBlLmcuIGZvciBsb2dnaW5nLlxcbi8vIEhlcmUgd2UgdXNlIGRlYnVnIGxldmVsIG9mIGxvZ2dpbmcgdG8gaW1wcm92ZSB0aGUgZGVidWdnaW5nIGV4cGVyaWVuY2UuXFxuLy8gVGhpcyBmdW5jdGlvbmFsaXR5IGlzIG9wdGlvbmFsIVxcbmxvZy5zZXRMZXZlbChMb2dMZXZlbC5ERUJVRyk7XFxuXFxuLy8gQ3JlYXRlIGFuIGluc3RhbmNlIG9mIHRoZSBIdHRwQ3Jhd2xlciBjbGFzcyAtIGEgY3Jhd2xlclxcbi8vIHRoYXQgYXV0b21hdGljYWxseSBsb2FkcyB0aGUgVVJMcyBhbmQgc2F2ZXMgdGhlaXIgSFRNTC5cXG5jb25zdCBjcmF3bGVyID0gbmV3IEh0dHBDcmF3bGVyKHtcXG4gICAgLy8gVGhlIGNyYXdsZXIgZG93bmxvYWRzIGFuZCBwcm9jZXNzZXMgdGhlIHdlYiBwYWdlcyBpbiBwYXJhbGxlbCwgd2l0aCBhIGNvbmN1cnJlbmN5XFxuICAgIC8vIGF1dG9tYXRpY2FsbHkgbWFuYWdlZCBiYXNlZCBvbiB0aGUgYXZhaWxhYmxlIHN5c3RlbSBtZW1vcnkgYW5kIENQVSAoc2VlIEF1dG9zY2FsZWRQb29sIGNsYXNzKS5cXG4gICAgLy8gSGVyZSB3ZSBkZWZpbmUgc29tZSBoYXJkIGxpbWl0cyBmb3IgdGhlIGNvbmN1cnJlbmN5LlxcbiAgICBtaW5Db25jdXJyZW5jeTogMTAsXFxuICAgIG1heENvbmN1cnJlbmN5OiA1MCxcXG5cXG4gICAgLy8gT24gZXJyb3IsIHJldHJ5IGVhY2ggcGFnZSBhdCBtb3N0IG9uY2UuXFxuICAgIG1heFJlcXVlc3RSZXRyaWVzOiAxLFxcblxcbiAgICAvLyBJbmNyZWFzZSB0aGUgdGltZW91dCBmb3IgcHJvY2Vzc2luZyBvZiBlYWNoIHBhZ2UuXFxuICAgIHJlcXVlc3RIYW5kbGVyVGltZW91dFNlY3M6IDMwLFxcblxcbiAgICAvLyBMaW1pdCB0byAxMCByZXF1ZXN0cyBwZXIgb25lIGNyYXdsXFxuICAgIG1heFJlcXVlc3RzUGVyQ3Jhd2w6IDEwLFxcblxcbiAgICAvLyBUaGlzIGZ1bmN0aW9uIHdpbGwgYmUgY2FsbGVkIGZvciBlYWNoIFVSTCB0byBjcmF3bC5cXG4gICAgLy8gSXQgYWNjZXB0cyBhIHNpbmdsZSBwYXJhbWV0ZXIsIHdoaWNoIGlzIGFuIG9iamVjdCB3aXRoIG9wdGlvbnMgYXM6XFxuICAgIC8vIGh0dHBzOi8vY3Jhd2xlZS5kZXYvanMvYXBpL2h0dHAtY3Jhd2xlci9pbnRlcmZhY2UvSHR0cENyYXdsZXJPcHRpb25zI3JlcXVlc3RIYW5kbGVyXFxuICAgIC8vIFdlIHVzZSBmb3IgZGVtb25zdHJhdGlvbiBvbmx5IDIgb2YgdGhlbTpcXG4gICAgLy8gLSByZXF1ZXN0OiBhbiBpbnN0YW5jZSBvZiB0aGUgUmVxdWVzdCBjbGFzcyB3aXRoIGluZm9ybWF0aW9uIHN1Y2ggYXMgdGhlIFVSTCB0aGF0IGlzIGJlaW5nIGNyYXdsZWQgYW5kIEhUVFAgbWV0aG9kXFxuICAgIC8vIC0gYm9keTogdGhlIEhUTUwgY29kZSBvZiB0aGUgY3VycmVudCBwYWdlXFxuICAgIGFzeW5jIHJlcXVlc3RIYW5kbGVyKHsgcHVzaERhdGEsIHJlcXVlc3QsIGJvZHkgfSkge1xcbiAgICAgICAgbG9nLmRlYnVnKGBQcm9jZXNzaW5nICR7cmVxdWVzdC51cmx9Li4uYCk7XFxuXFxuICAgICAgICAvLyBTdG9yZSB0aGUgcmVzdWx0cyB0byB0aGUgZGF0YXNldC4gSW4gbG9jYWwgY29uZmlndXJhdGlvbixcXG4gICAgICAgIC8vIHRoZSBkYXRhIHdpbGwgYmUgc3RvcmVkIGFzIEpTT04gZmlsZXMgaW4gLi9zdG9yYWdlL2RhdGFzZXRzL2RlZmF1bHRcXG4gICAgICAgIGF3YWl0IHB1c2hEYXRhKHtcXG4gICAgICAgICAgICB1cmw6IHJlcXVlc3QudXJsLCAvLyBVUkwgb2YgdGhlIHBhZ2VcXG4gICAgICAgICAgICBib2R5LCAvLyBIVE1MIGNvZGUgb2YgdGhlIHBhZ2VcXG4gICAgICAgIH0pO1xcbiAgICB9LFxcblxcbiAgICAvLyBUaGlzIGZ1bmN0aW9uIGlzIGNhbGxlZCBpZiB0aGUgcGFnZSBwcm9jZXNzaW5nIGZhaWxlZCBtb3JlIHRoYW4gbWF4UmVxdWVzdFJldHJpZXMgKyAxIHRpbWVzLlxcbiAgICBmYWlsZWRSZXF1ZXN0SGFuZGxlcih7IHJlcXVlc3QgfSkge1xcbiAgICAgICAgbG9nLmRlYnVnKGBSZXF1ZXN0ICR7cmVxdWVzdC51cmx9IGZhaWxlZCB0d2ljZS5gKTtcXG4gICAgfSxcXG59KTtcXG5cXG4vLyBSdW4gdGhlIGNyYXdsZXIgYW5kIHdhaXQgZm9yIGl0IHRvIGZpbmlzaC5cXG4vLyBJdCB3aWxsIGNyYXdsIGEgbGlzdCBvZiBVUkxzIGZyb20gYW4gZXh0ZXJuYWwgZmlsZSwgbG9hZCBlYWNoIFVSTCB1c2luZyBhIHBsYWluIEhUVFAgcmVxdWVzdCwgYW5kIHNhdmUgSFRNTFxcbmF3YWl0IGNyYXdsZXIucnVuKFsnaHR0cHM6Ly9jcmF3bGVlLmRldiddKTtcXG5cXG5sb2cuZGVidWcoJ0NyYXdsZXIgZmluaXNoZWQuJyk7XFxuXCJ9Iiwib3B0aW9ucyI6eyJidWlsZCI6ImxhdGVzdCIsImNvbnRlbnRUeXBlIjoiYXBwbGljYXRpb24vanNvbjsgY2hhcnNldD11dGYtOCIsIm1lbW9yeSI6MTAyNCwidGltZW91dCI6MTgwfX0._n-BJJxViGocFVO6x8ubr-Ct4oBImxxGAXTGA1om-4E\&asrc=run_on_apify) ``` import { HttpCrawler, log, LogLevel } from 'crawlee'; // Crawlers come with various utilities, e.g. for logging. // Here we use debug level of logging to improve the debugging experience. // This functionality is optional! log.setLevel(LogLevel.DEBUG); // Create an instance of the HttpCrawler class - a crawler // that automatically loads the URLs and saves their HTML. const crawler = new HttpCrawler({ // The crawler downloads and processes the web pages in parallel, with a concurrency // automatically managed based on the available system memory and CPU (see AutoscaledPool class). // Here we define some hard limits for the concurrency. minConcurrency: 10, maxConcurrency: 50, // On error, retry each page at most once. maxRequestRetries: 1, // Increase the timeout for processing of each page. requestHandlerTimeoutSecs: 30, // Limit to 10 requests per one crawl maxRequestsPerCrawl: 10, // This function will be called for each URL to crawl. // It accepts a single parameter, which is an object with options as: // https://crawlee.dev/js/api/http-crawler/interface/HttpCrawlerOptions#requestHandler // We use for demonstration only 2 of them: // - request: an instance of the Request class with information such as the URL that is being crawled and HTTP method // - body: the HTML code of the current page async requestHandler({ pushData, request, body }) { log.debug(`Processing ${request.url}...`); // Store the results to the dataset. In local configuration, // the data will be stored as JSON files in ./storage/datasets/default await pushData({ url: request.url, // URL of the page body, // HTML code of the page }); }, // This function is called if the page processing failed more than maxRequestRetries + 1 times. failedRequestHandler({ request }) { log.debug(`Request ${request.url} failed twice.`); }, }); // Run the crawler and wait for it to finish. // It will crawl a list of URLs from an external file, load each URL using a plain HTTP request, and save HTML await crawler.run(['https://crawlee.dev']); log.debug('Crawler finished.'); ``` --- # JSDOM crawler Copy for LLM This example demonstrates how to use [`JSDOMCrawler`](https://crawlee.dev/js/api/jsdom-crawler/class/JSDOMCrawler.md) to interact with a website using [jsdom](https://www.npmjs.com/package/jsdom) DOM implementation. Here the script will open a calculator app from the [React examples](https://reactjs.org/community/examples.html), click `1` `+` `1` `=` and extract the result. [Run on](https://console.apify.com/actors/kk67IcZkKSSBTslXI?runConfig=eyJ1IjoiRWdQdHczb2VqNlRhRHQ1cW4iLCJ2IjoxfQ.eyJpbnB1dCI6IntcImNvZGVcIjpcImltcG9ydCB7IEpTRE9NQ3Jhd2xlciwgbG9nIH0gZnJvbSAnY3Jhd2xlZSc7XFxuXFxuLy8gQ3JlYXRlIGFuIGluc3RhbmNlIG9mIHRoZSBKU0RPTUNyYXdsZXIgY2xhc3MgLSBjcmF3bGVyIHRoYXQgYXV0b21hdGljYWxseVxcbi8vIGxvYWRzIHRoZSBVUkxzIGFuZCBwYXJzZXMgdGhlaXIgSFRNTCB1c2luZyB0aGUganNkb20gbGlicmFyeS5cXG5jb25zdCBjcmF3bGVyID0gbmV3IEpTRE9NQ3Jhd2xlcih7XFxuICAgIC8vIFNldHRpbmcgdGhlIGBydW5TY3JpcHRzYCBvcHRpb24gdG8gYHRydWVgIGFsbG93cyB0aGUgY3Jhd2xlciB0byBleGVjdXRlIGNsaWVudC1zaWRlXFxuICAgIC8vIEphdmFTY3JpcHQgY29kZSBvbiB0aGUgcGFnZS4gVGhpcyBpcyByZXF1aXJlZCBmb3Igc29tZSB3ZWJzaXRlcyAoc3VjaCBhcyB0aGUgUmVhY3QgYXBwbGljYXRpb24gaW4gdGhpcyBleGFtcGxlKSwgYnV0IG1heSBwb3NlIGEgc2VjdXJpdHkgcmlzay5cXG4gICAgcnVuU2NyaXB0czogdHJ1ZSxcXG4gICAgLy8gVGhpcyBmdW5jdGlvbiB3aWxsIGJlIGNhbGxlZCBmb3IgZWFjaCBjcmF3bGVkIFVSTC5cXG4gICAgLy8gSGVyZSB3ZSBleHRyYWN0IHRoZSB3aW5kb3cgb2JqZWN0IGZyb20gdGhlIG9wdGlvbnMgYW5kIHVzZSBpdCB0byBleHRyYWN0IGRhdGEgZnJvbSB0aGUgcGFnZS5cXG4gICAgcmVxdWVzdEhhbmRsZXI6IGFzeW5jICh7IHdpbmRvdyB9KSA9PiB7XFxuICAgICAgICBjb25zdCB7IGRvY3VtZW50IH0gPSB3aW5kb3c7XFxuICAgICAgICAvLyBUaGUgYGRvY3VtZW50YCBvYmplY3QgaXMgYW5hbG9nb3VzIHRvIHRoZSBgd2luZG93LmRvY3VtZW50YCBvYmplY3QgeW91IGtub3cgZnJvbSB5b3VyIGZhdm91cml0ZSB3ZWIgYnJvd3NlcnMuXFxuICAgICAgICAvLyBUaGFua3MgdG8gdGhpcywgeW91IGNhbiB1c2UgdGhlIHJlZ3VsYXIgYnJvd3Nlci1zaWRlIEFQSXMgaGVyZS5cXG4gICAgICAgIGRvY3VtZW50LnF1ZXJ5U2VsZWN0b3JBbGwoJ2J1dHRvbicpWzEyXS5jbGljaygpOyAvLyAxXFxuICAgICAgICBkb2N1bWVudC5xdWVyeVNlbGVjdG9yQWxsKCdidXR0b24nKVsxNV0uY2xpY2soKTsgLy8gK1xcbiAgICAgICAgZG9jdW1lbnQucXVlcnlTZWxlY3RvckFsbCgnYnV0dG9uJylbMTJdLmNsaWNrKCk7IC8vIDFcXG4gICAgICAgIGRvY3VtZW50LnF1ZXJ5U2VsZWN0b3JBbGwoJ2J1dHRvbicpWzE4XS5jbGljaygpOyAvLyA9XFxuXFxuICAgICAgICBjb25zdCByZXN1bHQgPSBkb2N1bWVudC5xdWVyeVNlbGVjdG9yQWxsKCcuY29tcG9uZW50LWRpc3BsYXknKVswXS5jaGlsZE5vZGVzWzBdIGFzIEVsZW1lbnQ7XFxuICAgICAgICAvLyBUaGUgcmVzdWx0IGlzIHBhc3NlZCB0byB0aGUgY29uc29sZS4gVW5saWtlIHdpdGggUGxheXdyaWdodCBvciBQdXBwZXRlZXIgY3Jhd2xlcnMsXFxuICAgICAgICAvLyB0aGlzIGNvbnNvbGUgY2FsbCBnb2VzIHRvIHRoZSBOb2RlLmpzIGNvbnNvbGUsIG5vdCB0aGUgYnJvd3NlciBjb25zb2xlLiBBbGwgdGhlIGNvZGUgaGVyZSBydW5zIHJpZ2h0IGluIE5vZGUuanMhXFxuICAgICAgICBsb2cuaW5mbyhyZXN1bHQuaW5uZXJIVE1MKTsgLy8gMlxcbiAgICB9LFxcbn0pO1xcblxcbi8vIFJ1biB0aGUgY3Jhd2xlciBhbmQgd2FpdCBmb3IgaXQgdG8gZmluaXNoLlxcbmF3YWl0IGNyYXdsZXIucnVuKFsnaHR0cHM6Ly9haGZhcm1lci5naXRodWIuaW8vY2FsY3VsYXRvci8nXSk7XFxuXFxubG9nLmRlYnVnKCdDcmF3bGVyIGZpbmlzaGVkLicpO1xcblwifSIsIm9wdGlvbnMiOnsiYnVpbGQiOiJsYXRlc3QiLCJjb250ZW50VHlwZSI6ImFwcGxpY2F0aW9uL2pzb247IGNoYXJzZXQ9dXRmLTgiLCJtZW1vcnkiOjQwOTYsInRpbWVvdXQiOjE4MH19.b6nD1XviKzm3PPZn7EhoucAvSW9a8_JMDXSczJBMZUg\&asrc=run_on_apify) ``` import { JSDOMCrawler, log } from 'crawlee'; // Create an instance of the JSDOMCrawler class - crawler that automatically // loads the URLs and parses their HTML using the jsdom library. const crawler = new JSDOMCrawler({ // Setting the `runScripts` option to `true` allows the crawler to execute client-side // JavaScript code on the page. This is required for some websites (such as the React application in this example), but may pose a security risk. runScripts: true, // This function will be called for each crawled URL. // Here we extract the window object from the options and use it to extract data from the page. requestHandler: async ({ window }) => { const { document } = window; // The `document` object is analogous to the `window.document` object you know from your favourite web browsers. // Thanks to this, you can use the regular browser-side APIs here. document.querySelectorAll('button')[12].click(); // 1 document.querySelectorAll('button')[15].click(); // + document.querySelectorAll('button')[12].click(); // 1 document.querySelectorAll('button')[18].click(); // = const result = document.querySelectorAll('.component-display')[0].childNodes[0] as Element; // The result is passed to the console. Unlike with Playwright or Puppeteer crawlers, // this console call goes to the Node.js console, not the browser console. All the code here runs right in Node.js! log.info(result.innerHTML); // 2 }, }); // Run the crawler and wait for it to finish. await crawler.run(['https://ahfarmer.github.io/calculator/']); log.debug('Crawler finished.'); ``` In the following example, we use [`JSDOMCrawler`](https://crawlee.dev/js/api/jsdom-crawler/class/JSDOMCrawler.md) to crawl a list of URLs from an external file, load each URL using a plain HTTP request, parse the HTML using the [jsdom](https://www.npmjs.com/package/jsdom) DOM implementation and extract some data from it: the page title and all `h1` tags. [Run on](https://console.apify.com/actors/kk67IcZkKSSBTslXI?runConfig=eyJ1IjoiRWdQdHczb2VqNlRhRHQ1cW4iLCJ2IjoxfQ.eyJpbnB1dCI6IntcImNvZGVcIjpcImltcG9ydCB7IEpTRE9NQ3Jhd2xlciwgbG9nLCBMb2dMZXZlbCB9IGZyb20gJ2NyYXdsZWUnO1xcblxcbi8vIENyYXdsZXJzIGNvbWUgd2l0aCB2YXJpb3VzIHV0aWxpdGllcywgZS5nLiBmb3IgbG9nZ2luZy5cXG4vLyBIZXJlIHdlIHVzZSBkZWJ1ZyBsZXZlbCBvZiBsb2dnaW5nIHRvIGltcHJvdmUgdGhlIGRlYnVnZ2luZyBleHBlcmllbmNlLlxcbi8vIFRoaXMgZnVuY3Rpb25hbGl0eSBpcyBvcHRpb25hbCFcXG5sb2cuc2V0TGV2ZWwoTG9nTGV2ZWwuREVCVUcpO1xcblxcbi8vIENyZWF0ZSBhbiBpbnN0YW5jZSBvZiB0aGUgSlNET01DcmF3bGVyIGNsYXNzIC0gYSBjcmF3bGVyXFxuLy8gdGhhdCBhdXRvbWF0aWNhbGx5IGxvYWRzIHRoZSBVUkxzIGFuZCBwYXJzZXMgdGhlaXIgSFRNTCB1c2luZyB0aGUganNkb20gbGlicmFyeS5cXG5jb25zdCBjcmF3bGVyID0gbmV3IEpTRE9NQ3Jhd2xlcih7XFxuICAgIC8vIFRoZSBjcmF3bGVyIGRvd25sb2FkcyBhbmQgcHJvY2Vzc2VzIHRoZSB3ZWIgcGFnZXMgaW4gcGFyYWxsZWwsIHdpdGggYSBjb25jdXJyZW5jeVxcbiAgICAvLyBhdXRvbWF0aWNhbGx5IG1hbmFnZWQgYmFzZWQgb24gdGhlIGF2YWlsYWJsZSBzeXN0ZW0gbWVtb3J5IGFuZCBDUFUgKHNlZSBBdXRvc2NhbGVkUG9vbCBjbGFzcykuXFxuICAgIC8vIEhlcmUgd2UgZGVmaW5lIHNvbWUgaGFyZCBsaW1pdHMgZm9yIHRoZSBjb25jdXJyZW5jeS5cXG4gICAgbWluQ29uY3VycmVuY3k6IDEwLFxcbiAgICBtYXhDb25jdXJyZW5jeTogNTAsXFxuXFxuICAgIC8vIE9uIGVycm9yLCByZXRyeSBlYWNoIHBhZ2UgYXQgbW9zdCBvbmNlLlxcbiAgICBtYXhSZXF1ZXN0UmV0cmllczogMSxcXG5cXG4gICAgLy8gSW5jcmVhc2UgdGhlIHRpbWVvdXQgZm9yIHByb2Nlc3Npbmcgb2YgZWFjaCBwYWdlLlxcbiAgICByZXF1ZXN0SGFuZGxlclRpbWVvdXRTZWNzOiAzMCxcXG5cXG4gICAgLy8gTGltaXQgdG8gMTAgcmVxdWVzdHMgcGVyIG9uZSBjcmF3bFxcbiAgICBtYXhSZXF1ZXN0c1BlckNyYXdsOiAxMCxcXG5cXG4gICAgLy8gVGhpcyBmdW5jdGlvbiB3aWxsIGJlIGNhbGxlZCBmb3IgZWFjaCBVUkwgdG8gY3Jhd2wuXFxuICAgIC8vIEl0IGFjY2VwdHMgYSBzaW5nbGUgcGFyYW1ldGVyLCB3aGljaCBpcyBhbiBvYmplY3Qgd2l0aCBvcHRpb25zIGFzOlxcbiAgICAvLyBodHRwczovL2NyYXdsZWUuZGV2L2pzL2FwaS9qc2RvbS1jcmF3bGVyL2ludGVyZmFjZS9KU0RPTUNyYXdsZXJPcHRpb25zI3JlcXVlc3RIYW5kbGVyXFxuICAgIC8vIFdlIHVzZSBmb3IgZGVtb25zdHJhdGlvbiBvbmx5IDIgb2YgdGhlbTpcXG4gICAgLy8gLSByZXF1ZXN0OiBhbiBpbnN0YW5jZSBvZiB0aGUgUmVxdWVzdCBjbGFzcyB3aXRoIGluZm9ybWF0aW9uIHN1Y2ggYXMgdGhlIFVSTCB0aGF0IGlzIGJlaW5nIGNyYXdsZWQgYW5kIEhUVFAgbWV0aG9kXFxuICAgIC8vIC0gd2luZG93OiB0aGUgSlNET00gd2luZG93IG9iamVjdFxcbiAgICBhc3luYyByZXF1ZXN0SGFuZGxlcih7IHB1c2hEYXRhLCByZXF1ZXN0LCB3aW5kb3cgfSkge1xcbiAgICAgICAgbG9nLmRlYnVnKGBQcm9jZXNzaW5nICR7cmVxdWVzdC51cmx9Li4uYCk7XFxuXFxuICAgICAgICAvLyBFeHRyYWN0IGRhdGEgZnJvbSB0aGUgcGFnZVxcbiAgICAgICAgY29uc3QgdGl0bGUgPSB3aW5kb3cuZG9jdW1lbnQudGl0bGU7XFxuICAgICAgICBjb25zdCBoMXRleHRzOiB7IHRleHQ6IHN0cmluZyB9W10gPSBbXTtcXG4gICAgICAgIHdpbmRvdy5kb2N1bWVudC5xdWVyeVNlbGVjdG9yQWxsKCdoMScpLmZvckVhY2goKGVsZW1lbnQpID0-IHtcXG4gICAgICAgICAgICBoMXRleHRzLnB1c2goe1xcbiAgICAgICAgICAgICAgICB0ZXh0OiBlbGVtZW50LnRleHRDb250ZW50ISxcXG4gICAgICAgICAgICB9KTtcXG4gICAgICAgIH0pO1xcblxcbiAgICAgICAgLy8gU3RvcmUgdGhlIHJlc3VsdHMgdG8gdGhlIGRhdGFzZXQuIEluIGxvY2FsIGNvbmZpZ3VyYXRpb24sXFxuICAgICAgICAvLyB0aGUgZGF0YSB3aWxsIGJlIHN0b3JlZCBhcyBKU09OIGZpbGVzIGluIC4vc3RvcmFnZS9kYXRhc2V0cy9kZWZhdWx0XFxuICAgICAgICBhd2FpdCBwdXNoRGF0YSh7XFxuICAgICAgICAgICAgdXJsOiByZXF1ZXN0LnVybCxcXG4gICAgICAgICAgICB0aXRsZSxcXG4gICAgICAgICAgICBoMXRleHRzLFxcbiAgICAgICAgfSk7XFxuICAgIH0sXFxuXFxuICAgIC8vIFRoaXMgZnVuY3Rpb24gaXMgY2FsbGVkIGlmIHRoZSBwYWdlIHByb2Nlc3NpbmcgZmFpbGVkIG1vcmUgdGhhbiBtYXhSZXF1ZXN0UmV0cmllcyArIDEgdGltZXMuXFxuICAgIGZhaWxlZFJlcXVlc3RIYW5kbGVyKHsgcmVxdWVzdCB9KSB7XFxuICAgICAgICBsb2cuZGVidWcoYFJlcXVlc3QgJHtyZXF1ZXN0LnVybH0gZmFpbGVkIHR3aWNlLmApO1xcbiAgICB9LFxcbn0pO1xcblxcbi8vIFJ1biB0aGUgY3Jhd2xlciBhbmQgd2FpdCBmb3IgaXQgdG8gZmluaXNoLlxcbmF3YWl0IGNyYXdsZXIucnVuKFsnaHR0cHM6Ly9jcmF3bGVlLmRldiddKTtcXG5cXG5sb2cuZGVidWcoJ0NyYXdsZXIgZmluaXNoZWQuJyk7XFxuXCJ9Iiwib3B0aW9ucyI6eyJidWlsZCI6ImxhdGVzdCIsImNvbnRlbnRUeXBlIjoiYXBwbGljYXRpb24vanNvbjsgY2hhcnNldD11dGYtOCIsIm1lbW9yeSI6MTAyNCwidGltZW91dCI6MTgwfX0.umwggJJOSg5zONEBoF4xtTX3y30ODDUm6sGCJJrQANc\&asrc=run_on_apify) ``` import { JSDOMCrawler, log, LogLevel } from 'crawlee'; // Crawlers come with various utilities, e.g. for logging. // Here we use debug level of logging to improve the debugging experience. // This functionality is optional! log.setLevel(LogLevel.DEBUG); // Create an instance of the JSDOMCrawler class - a crawler // that automatically loads the URLs and parses their HTML using the jsdom library. const crawler = new JSDOMCrawler({ // The crawler downloads and processes the web pages in parallel, with a concurrency // automatically managed based on the available system memory and CPU (see AutoscaledPool class). // Here we define some hard limits for the concurrency. minConcurrency: 10, maxConcurrency: 50, // On error, retry each page at most once. maxRequestRetries: 1, // Increase the timeout for processing of each page. requestHandlerTimeoutSecs: 30, // Limit to 10 requests per one crawl maxRequestsPerCrawl: 10, // This function will be called for each URL to crawl. // It accepts a single parameter, which is an object with options as: // https://crawlee.dev/js/api/jsdom-crawler/interface/JSDOMCrawlerOptions#requestHandler // We use for demonstration only 2 of them: // - request: an instance of the Request class with information such as the URL that is being crawled and HTTP method // - window: the JSDOM window object async requestHandler({ pushData, request, window }) { log.debug(`Processing ${request.url}...`); // Extract data from the page const title = window.document.title; const h1texts: { text: string }[] = []; window.document.querySelectorAll('h1').forEach((element) => { h1texts.push({ text: element.textContent!, }); }); // Store the results to the dataset. In local configuration, // the data will be stored as JSON files in ./storage/datasets/default await pushData({ url: request.url, title, h1texts, }); }, // This function is called if the page processing failed more than maxRequestRetries + 1 times. failedRequestHandler({ request }) { log.debug(`Request ${request.url} failed twice.`); }, }); // Run the crawler and wait for it to finish. await crawler.run(['https://crawlee.dev']); log.debug('Crawler finished.'); ``` --- # Dataset Map and Reduce methods Copy for LLM This example shows an easy use-case of the [`Dataset`](https://crawlee.dev/js/api/core/class/Dataset.md) [`map`](https://crawlee.dev/js/api/core/class/Dataset.md#map) and [`reduce`](https://crawlee.dev/js/api/core/class/Dataset.md#reduce) methods. Both methods can be used to simplify the dataset results workflow process. Both can be called on the [dataset](https://crawlee.dev/js/api/core/class/Dataset.md) directly. Important to mention is that both methods return a new result (`map` returns a new array and `reduce` can return any type) - neither method updates the dataset in any way. Examples for both methods are demonstrated on a simple dataset containing the results scraped from a page: the `URL` and a hypothetical number of `h1` - `h3` header elements under the `headingCount` key. This data structure is stored in the default dataset under `{PROJECT_FOLDER}/storage/datasets/default/`. If you want to simulate the functionality, you can use the [`dataset.pushData()`](https://crawlee.dev/js/api/core/class/Dataset.md#pushData) method to save the example `JSON array` to your dataset. ``` [ { "url": "https://crawlee.dev/", "headingCount": 11 }, { "url": "https://crawlee.dev/storage", "headingCount": 8 }, { "url": "https://crawlee.dev/proxy", "headingCount": 4 } ] ``` ### Map[​](#map "Direct link to Map") The dataset `map` method is very similar to standard Array mapping methods. It produces a new array of values by mapping each value in the existing array through a transformation function and an options parameter. The `map` method used to check if are there more than 5 header elements on each page: [Run on](https://console.apify.com/actors/kk67IcZkKSSBTslXI?runConfig=eyJ1IjoiRWdQdHczb2VqNlRhRHQ1cW4iLCJ2IjoxfQ.eyJpbnB1dCI6IntcImNvZGVcIjpcImltcG9ydCB7IERhdGFzZXQsIEtleVZhbHVlU3RvcmUgfSBmcm9tICdjcmF3bGVlJztcXG5cXG5jb25zdCBkYXRhc2V0ID0gYXdhaXQgRGF0YXNldC5vcGVuPHtcXG4gICAgdXJsOiBzdHJpbmc7XFxuICAgIGhlYWRpbmdDb3VudDogbnVtYmVyO1xcbn0-KCk7XFxuXFxuLy8gU2VlZGluZyB0aGUgZGF0YXNldCB3aXRoIHNvbWUgZGF0YVxcbmF3YWl0IGRhdGFzZXQucHVzaERhdGEoW1xcbiAgICB7XFxuICAgICAgICB1cmw6ICdodHRwczovL2NyYXdsZWUuZGV2LycsXFxuICAgICAgICBoZWFkaW5nQ291bnQ6IDExLFxcbiAgICB9LFxcbiAgICB7XFxuICAgICAgICB1cmw6ICdodHRwczovL2NyYXdsZWUuZGV2L3N0b3JhZ2UnLFxcbiAgICAgICAgaGVhZGluZ0NvdW50OiA4LFxcbiAgICB9LFxcbiAgICB7XFxuICAgICAgICB1cmw6ICdodHRwczovL2NyYXdsZWUuZGV2L3Byb3h5JyxcXG4gICAgICAgIGhlYWRpbmdDb3VudDogNCxcXG4gICAgfSxcXG5dKTtcXG5cXG4vLyBDYWxsaW5nIG1hcCBmdW5jdGlvbiBhbmQgZmlsdGVyaW5nIHRocm91Z2ggbWFwcGVkIGl0ZW1zLi4uXFxuY29uc3QgbW9yZVRoYW41aGVhZGVycyA9IChhd2FpdCBkYXRhc2V0Lm1hcCgoaXRlbSkgPT4gaXRlbS5oZWFkaW5nQ291bnQpKS5maWx0ZXIoKGNvdW50KSA9PiBjb3VudCA-IDUpO1xcblxcbi8vIFNhdmluZyB0aGUgcmVzdWx0IG9mIG1hcCB0byBkZWZhdWx0IGtleS12YWx1ZSBzdG9yZS4uLlxcbmF3YWl0IEtleVZhbHVlU3RvcmUuc2V0VmFsdWUoJ3BhZ2VzX3dpdGhfbW9yZV90aGFuXzVfaGVhZGVycycsIG1vcmVUaGFuNWhlYWRlcnMpO1xcblwifSIsIm9wdGlvbnMiOnsiYnVpbGQiOiJsYXRlc3QiLCJjb250ZW50VHlwZSI6ImFwcGxpY2F0aW9uL2pzb247IGNoYXJzZXQ9dXRmLTgiLCJtZW1vcnkiOjEwMjQsInRpbWVvdXQiOjE4MH19.dWnmTm2KruX_Ypcwm8jC0o6B-FFEiKkCEtDIWGEmXDQ\&asrc=run_on_apify) ``` import { Dataset, KeyValueStore } from 'crawlee'; const dataset = await Dataset.open<{ url: string; headingCount: number; }>(); // Seeding the dataset with some data await dataset.pushData([ { url: 'https://crawlee.dev/', headingCount: 11, }, { url: 'https://crawlee.dev/storage', headingCount: 8, }, { url: 'https://crawlee.dev/proxy', headingCount: 4, }, ]); // Calling map function and filtering through mapped items... const moreThan5headers = (await dataset.map((item) => item.headingCount)).filter((count) => count > 5); // Saving the result of map to default key-value store... await KeyValueStore.setValue('pages_with_more_than_5_headers', moreThan5headers); ``` The `moreThan5headers` variable is an array of `headingCount` attributes where the number of headers is greater than 5. The `map` method's result value saved to the [`key-value store`](https://crawlee.dev/js/api/core/class/KeyValueStore.md) should be: ``` [11, 8] ``` ### Reduce[​](#reduce "Direct link to Reduce") The dataset `reduce` method does not produce a new array of values - it reduces a list of values down to a single value. The method iterates through the items in the dataset using the [`memo` argument](https://crawlee.dev/js/api/core/class/Dataset.md#reduce). After performing the necessary calculation, the `memo` is sent to the next iteration, while the item just processed is reduced (removed). Using the `reduce` method to get the total number of headers scraped (all items in the dataset): [Run on](https://console.apify.com/actors/kk67IcZkKSSBTslXI?runConfig=eyJ1IjoiRWdQdHczb2VqNlRhRHQ1cW4iLCJ2IjoxfQ.eyJpbnB1dCI6IntcImNvZGVcIjpcImltcG9ydCB7IERhdGFzZXQsIEtleVZhbHVlU3RvcmUgfSBmcm9tICdjcmF3bGVlJztcXG5cXG5jb25zdCBkYXRhc2V0ID0gYXdhaXQgRGF0YXNldC5vcGVuPHtcXG4gICAgdXJsOiBzdHJpbmc7XFxuICAgIGhlYWRpbmdDb3VudDogbnVtYmVyO1xcbn0-KCk7XFxuXFxuLy8gU2VlZGluZyB0aGUgZGF0YXNldCB3aXRoIHNvbWUgZGF0YVxcbmF3YWl0IGRhdGFzZXQucHVzaERhdGEoW1xcbiAgICB7XFxuICAgICAgICB1cmw6ICdodHRwczovL2NyYXdsZWUuZGV2LycsXFxuICAgICAgICBoZWFkaW5nQ291bnQ6IDExLFxcbiAgICB9LFxcbiAgICB7XFxuICAgICAgICB1cmw6ICdodHRwczovL2NyYXdsZWUuZGV2L3N0b3JhZ2UnLFxcbiAgICAgICAgaGVhZGluZ0NvdW50OiA4LFxcbiAgICB9LFxcbiAgICB7XFxuICAgICAgICB1cmw6ICdodHRwczovL2NyYXdsZWUuZGV2L3Byb3h5JyxcXG4gICAgICAgIGhlYWRpbmdDb3VudDogNCxcXG4gICAgfSxcXG5dKTtcXG5cXG4vLyBjYWxsaW5nIHJlZHVjZSBmdW5jdGlvbiBhbmQgdXNpbmcgbWVtbyB0byBjYWxjdWxhdGUgbnVtYmVyIG9mIGhlYWRlcnNcXG5jb25zdCBwYWdlc0hlYWRpbmdDb3VudCA9IGF3YWl0IGRhdGFzZXQucmVkdWNlKChtZW1vLCB2YWx1ZSkgPT4ge1xcbiAgICByZXR1cm4gbWVtbyArIHZhbHVlLmhlYWRpbmdDb3VudDtcXG59LCAwKTtcXG5cXG4vLyBzYXZpbmcgcmVzdWx0IG9mIG1hcCB0byBkZWZhdWx0IEtleS12YWx1ZSBzdG9yZVxcbmF3YWl0IEtleVZhbHVlU3RvcmUuc2V0VmFsdWUoJ3BhZ2VzX2hlYWRpbmdfY291bnQnLCBwYWdlc0hlYWRpbmdDb3VudCk7XFxuXCJ9Iiwib3B0aW9ucyI6eyJidWlsZCI6ImxhdGVzdCIsImNvbnRlbnRUeXBlIjoiYXBwbGljYXRpb24vanNvbjsgY2hhcnNldD11dGYtOCIsIm1lbW9yeSI6MTAyNCwidGltZW91dCI6MTgwfX0.L0rbLqQxTG_mkQh0PqIgkFgTsa2luGQx5z1Fh2bLdwo\&asrc=run_on_apify) ``` import { Dataset, KeyValueStore } from 'crawlee'; const dataset = await Dataset.open<{ url: string; headingCount: number; }>(); // Seeding the dataset with some data await dataset.pushData([ { url: 'https://crawlee.dev/', headingCount: 11, }, { url: 'https://crawlee.dev/storage', headingCount: 8, }, { url: 'https://crawlee.dev/proxy', headingCount: 4, }, ]); // calling reduce function and using memo to calculate number of headers const pagesHeadingCount = await dataset.reduce((memo, value) => { return memo + value.headingCount; }, 0); // saving result of map to default Key-value store await KeyValueStore.setValue('pages_heading_count', pagesHeadingCount); ``` The original dataset will be reduced to a single value, `pagesHeadingCount`, which contains the count of all headers for all scraped pages (all dataset items). The `reduce` method's result value saved to the [`key-value store`](https://crawlee.dev/js/api/core/class/KeyValueStore.md) should be: ``` 23 ``` --- # Playwright crawler Copy for LLM This example demonstrates how to use [`PlaywrightCrawler`](https://crawlee.dev/js/api/playwright-crawler/class/PlaywrightCrawler.md) in combination with [`RequestQueue`](https://crawlee.dev/js/api/core/class/RequestQueue.md) to recursively scrape the [Hacker News website](https://news.ycombinator.com) using headless Chrome / Playwright. The crawler starts with a single URL, finds links to next pages, enqueues them and continues until no more desired links are available. The results are stored to the default dataset. In local configuration, the results are stored as JSON files in `./storage/datasets/default`. tip To run this example on the Apify Platform, select the `apify/actor-node-playwright-chrome` image for your Dockerfile. [Run on](https://console.apify.com/actors/6i5QsHBMtm3hKph70?runConfig=eyJ1IjoiRWdQdHczb2VqNlRhRHQ1cW4iLCJ2IjoxfQ.eyJpbnB1dCI6IntcImNvZGVcIjpcImltcG9ydCB7IFBsYXl3cmlnaHRDcmF3bGVyIH0gZnJvbSAnY3Jhd2xlZSc7XFxuXFxuLy8gQ3JlYXRlIGFuIGluc3RhbmNlIG9mIHRoZSBQbGF5d3JpZ2h0Q3Jhd2xlciBjbGFzcyAtIGEgY3Jhd2xlclxcbi8vIHRoYXQgYXV0b21hdGljYWxseSBsb2FkcyB0aGUgVVJMcyBpbiBoZWFkbGVzcyBDaHJvbWUgLyBQbGF5d3JpZ2h0LlxcbmNvbnN0IGNyYXdsZXIgPSBuZXcgUGxheXdyaWdodENyYXdsZXIoe1xcbiAgICBsYXVuY2hDb250ZXh0OiB7XFxuICAgICAgICAvLyBIZXJlIHlvdSBjYW4gc2V0IG9wdGlvbnMgdGhhdCBhcmUgcGFzc2VkIHRvIHRoZSBwbGF5d3JpZ2h0IC5sYXVuY2goKSBmdW5jdGlvbi5cXG4gICAgICAgIGxhdW5jaE9wdGlvbnM6IHtcXG4gICAgICAgICAgICBoZWFkbGVzczogdHJ1ZSxcXG4gICAgICAgIH0sXFxuICAgIH0sXFxuXFxuICAgIC8vIFN0b3AgY3Jhd2xpbmcgYWZ0ZXIgc2V2ZXJhbCBwYWdlc1xcbiAgICBtYXhSZXF1ZXN0c1BlckNyYXdsOiA1MCxcXG5cXG4gICAgLy8gVGhpcyBmdW5jdGlvbiB3aWxsIGJlIGNhbGxlZCBmb3IgZWFjaCBVUkwgdG8gY3Jhd2wuXFxuICAgIC8vIEhlcmUgeW91IGNhbiB3cml0ZSB0aGUgUGxheXdyaWdodCBzY3JpcHRzIHlvdSBhcmUgZmFtaWxpYXIgd2l0aCxcXG4gICAgLy8gd2l0aCB0aGUgZXhjZXB0aW9uIHRoYXQgYnJvd3NlcnMgYW5kIHBhZ2VzIGFyZSBhdXRvbWF0aWNhbGx5IG1hbmFnZWQgYnkgQ3Jhd2xlZS5cXG4gICAgLy8gVGhlIGZ1bmN0aW9uIGFjY2VwdHMgYSBzaW5nbGUgcGFyYW1ldGVyLCB3aGljaCBpcyBhbiBvYmplY3Qgd2l0aCBhIGxvdCBvZiBwcm9wZXJ0aWVzLFxcbiAgICAvLyB0aGUgbW9zdCBpbXBvcnRhbnQgYmVpbmc6XFxuICAgIC8vIC0gcmVxdWVzdDogYW4gaW5zdGFuY2Ugb2YgdGhlIFJlcXVlc3QgY2xhc3Mgd2l0aCBpbmZvcm1hdGlvbiBzdWNoIGFzIFVSTCBhbmQgSFRUUCBtZXRob2RcXG4gICAgLy8gLSBwYWdlOiBQbGF5d3JpZ2h0J3MgUGFnZSBvYmplY3QgKHNlZSBodHRwczovL3BsYXl3cmlnaHQuZGV2L2RvY3MvYXBpL2NsYXNzLXBhZ2UpXFxuICAgIGFzeW5jIHJlcXVlc3RIYW5kbGVyKHsgcHVzaERhdGEsIHJlcXVlc3QsIHBhZ2UsIGVucXVldWVMaW5rcywgbG9nIH0pIHtcXG4gICAgICAgIGxvZy5pbmZvKGBQcm9jZXNzaW5nICR7cmVxdWVzdC51cmx9Li4uYCk7XFxuXFxuICAgICAgICAvLyBBIGZ1bmN0aW9uIHRvIGJlIGV2YWx1YXRlZCBieSBQbGF5d3JpZ2h0IHdpdGhpbiB0aGUgYnJvd3NlciBjb250ZXh0LlxcbiAgICAgICAgY29uc3QgZGF0YSA9IGF3YWl0IHBhZ2UuJCRldmFsKCcuYXRoaW5nJywgKCRwb3N0cykgPT4ge1xcbiAgICAgICAgICAgIGNvbnN0IHNjcmFwZWREYXRhOiB7IHRpdGxlOiBzdHJpbmc7IHJhbms6IHN0cmluZzsgaHJlZjogc3RyaW5nIH1bXSA9IFtdO1xcblxcbiAgICAgICAgICAgIC8vIFdlJ3JlIGdldHRpbmcgdGhlIHRpdGxlLCByYW5rIGFuZCBVUkwgb2YgZWFjaCBwb3N0IG9uIEhhY2tlciBOZXdzLlxcbiAgICAgICAgICAgICRwb3N0cy5mb3JFYWNoKCgkcG9zdCkgPT4ge1xcbiAgICAgICAgICAgICAgICBzY3JhcGVkRGF0YS5wdXNoKHtcXG4gICAgICAgICAgICAgICAgICAgIHRpdGxlOiAkcG9zdC5xdWVyeVNlbGVjdG9yKCcudGl0bGUgYScpLmlubmVyVGV4dCxcXG4gICAgICAgICAgICAgICAgICAgIHJhbms6ICRwb3N0LnF1ZXJ5U2VsZWN0b3IoJy5yYW5rJykuaW5uZXJUZXh0LFxcbiAgICAgICAgICAgICAgICAgICAgaHJlZjogJHBvc3QucXVlcnlTZWxlY3RvcignLnRpdGxlIGEnKS5ocmVmLFxcbiAgICAgICAgICAgICAgICB9KTtcXG4gICAgICAgICAgICB9KTtcXG5cXG4gICAgICAgICAgICByZXR1cm4gc2NyYXBlZERhdGE7XFxuICAgICAgICB9KTtcXG5cXG4gICAgICAgIC8vIFN0b3JlIHRoZSByZXN1bHRzIHRvIHRoZSBkZWZhdWx0IGRhdGFzZXQuXFxuICAgICAgICBhd2FpdCBwdXNoRGF0YShkYXRhKTtcXG5cXG4gICAgICAgIC8vIEZpbmQgYSBsaW5rIHRvIHRoZSBuZXh0IHBhZ2UgYW5kIGVucXVldWUgaXQgaWYgaXQgZXhpc3RzLlxcbiAgICAgICAgY29uc3QgaW5mb3MgPSBhd2FpdCBlbnF1ZXVlTGlua3Moe1xcbiAgICAgICAgICAgIHNlbGVjdG9yOiAnLm1vcmVsaW5rJyxcXG4gICAgICAgIH0pO1xcblxcbiAgICAgICAgaWYgKGluZm9zLnByb2Nlc3NlZFJlcXVlc3RzLmxlbmd0aCA9PT0gMCkgbG9nLmluZm8oYCR7cmVxdWVzdC51cmx9IGlzIHRoZSBsYXN0IHBhZ2UhYCk7XFxuICAgIH0sXFxuXFxuICAgIC8vIFRoaXMgZnVuY3Rpb24gaXMgY2FsbGVkIGlmIHRoZSBwYWdlIHByb2Nlc3NpbmcgZmFpbGVkIG1vcmUgdGhhbiBtYXhSZXF1ZXN0UmV0cmllcysxIHRpbWVzLlxcbiAgICBmYWlsZWRSZXF1ZXN0SGFuZGxlcih7IHJlcXVlc3QsIGxvZyB9KSB7XFxuICAgICAgICBsb2cuaW5mbyhgUmVxdWVzdCAke3JlcXVlc3QudXJsfSBmYWlsZWQgdG9vIG1hbnkgdGltZXMuYCk7XFxuICAgIH0sXFxufSk7XFxuXFxuYXdhaXQgY3Jhd2xlci5hZGRSZXF1ZXN0cyhbJ2h0dHBzOi8vbmV3cy55Y29tYmluYXRvci5jb20vJ10pO1xcblxcbi8vIFJ1biB0aGUgY3Jhd2xlciBhbmQgd2FpdCBmb3IgaXQgdG8gZmluaXNoLlxcbmF3YWl0IGNyYXdsZXIucnVuKCk7XFxuXFxuY29uc29sZS5sb2coJ0NyYXdsZXIgZmluaXNoZWQuJyk7XFxuXCJ9Iiwib3B0aW9ucyI6eyJidWlsZCI6ImxhdGVzdCIsImNvbnRlbnRUeXBlIjoiYXBwbGljYXRpb24vanNvbjsgY2hhcnNldD11dGYtOCIsIm1lbW9yeSI6NDA5NiwidGltZW91dCI6MTgwfX0.TlfLVk0_w85cLtPnSSQQTafQ-FuVCpbtoSLrLFjMnS4\&asrc=run_on_apify) ``` import { PlaywrightCrawler } from 'crawlee'; // Create an instance of the PlaywrightCrawler class - a crawler // that automatically loads the URLs in headless Chrome / Playwright. const crawler = new PlaywrightCrawler({ launchContext: { // Here you can set options that are passed to the playwright .launch() function. launchOptions: { headless: true, }, }, // Stop crawling after several pages maxRequestsPerCrawl: 50, // This function will be called for each URL to crawl. // Here you can write the Playwright scripts you are familiar with, // with the exception that browsers and pages are automatically managed by Crawlee. // The function accepts a single parameter, which is an object with a lot of properties, // the most important being: // - request: an instance of the Request class with information such as URL and HTTP method // - page: Playwright's Page object (see https://playwright.dev/docs/api/class-page) async requestHandler({ pushData, request, page, enqueueLinks, log }) { log.info(`Processing ${request.url}...`); // A function to be evaluated by Playwright within the browser context. const data = await page.$$eval('.athing', ($posts) => { const scrapedData: { title: string; rank: string; href: string }[] = []; // We're getting the title, rank and URL of each post on Hacker News. $posts.forEach(($post) => { scrapedData.push({ title: $post.querySelector('.title a').innerText, rank: $post.querySelector('.rank').innerText, href: $post.querySelector('.title a').href, }); }); return scrapedData; }); // Store the results to the default dataset. await pushData(data); // Find a link to the next page and enqueue it if it exists. const infos = await enqueueLinks({ selector: '.morelink', }); if (infos.processedRequests.length === 0) log.info(`${request.url} is the last page!`); }, // This function is called if the page processing failed more than maxRequestRetries+1 times. failedRequestHandler({ request, log }) { log.info(`Request ${request.url} failed too many times.`); }, }); await crawler.addRequests(['https://news.ycombinator.com/']); // Run the crawler and wait for it to finish. await crawler.run(); console.log('Crawler finished.'); ``` --- # Using Firefox browser with Playwright crawler Copy for LLM This example demonstrates how to use [`PlaywrightCrawler`](https://crawlee.dev/js/api/playwright-crawler/class/PlaywrightCrawler.md) with headless Firefox browser. tip To run this example on the Apify Platform, select the `apify/actor-node-playwright-firefox` image for your Dockerfile. [Run on](https://console.apify.com/actors/6i5QsHBMtm3hKph70?runConfig=eyJ1IjoiRWdQdHczb2VqNlRhRHQ1cW4iLCJ2IjoxfQ.eyJpbnB1dCI6IntcImNvZGVcIjpcImltcG9ydCB7IFBsYXl3cmlnaHRDcmF3bGVyIH0gZnJvbSAnY3Jhd2xlZSc7XFxuaW1wb3J0IHsgZmlyZWZveCB9IGZyb20gJ3BsYXl3cmlnaHQnO1xcblxcbi8vIENyZWF0ZSBhbiBpbnN0YW5jZSBvZiB0aGUgUGxheXdyaWdodENyYXdsZXIgY2xhc3MuXFxuY29uc3QgY3Jhd2xlciA9IG5ldyBQbGF5d3JpZ2h0Q3Jhd2xlcih7XFxuICAgIGxhdW5jaENvbnRleHQ6IHtcXG4gICAgICAgIC8vIFNldCB0aGUgRmlyZWZveCBicm93c2VyIHRvIGJlIHVzZWQgYnkgdGhlIGNyYXdsZXIuXFxuICAgICAgICAvLyBJZiBsYXVuY2hlciBvcHRpb24gaXMgbm90IHNwZWNpZmllZCBoZXJlLFxcbiAgICAgICAgLy8gZGVmYXVsdCBDaHJvbWl1bSBicm93c2VyIHdpbGwgYmUgdXNlZC5cXG4gICAgICAgIGxhdW5jaGVyOiBmaXJlZm94LFxcbiAgICB9LFxcbiAgICBhc3luYyByZXF1ZXN0SGFuZGxlcih7IHJlcXVlc3QsIHBhZ2UsIGxvZyB9KSB7XFxuICAgICAgICBjb25zdCBwYWdlVGl0bGUgPSBhd2FpdCBwYWdlLnRpdGxlKCk7XFxuXFxuICAgICAgICBsb2cuaW5mbyhgVVJMOiAke3JlcXVlc3QubG9hZGVkVXJsfSB8IFBhZ2UgdGl0bGU6ICR7cGFnZVRpdGxlfWApO1xcbiAgICB9LFxcbn0pO1xcblxcbmF3YWl0IGNyYXdsZXIuYWRkUmVxdWVzdHMoWydodHRwczovL2V4YW1wbGUuY29tJ10pO1xcblxcbi8vIFJ1biB0aGUgY3Jhd2xlciBhbmQgd2FpdCBmb3IgaXQgdG8gZmluaXNoLlxcbmF3YWl0IGNyYXdsZXIucnVuKCk7XFxuXCJ9Iiwib3B0aW9ucyI6eyJidWlsZCI6ImxhdGVzdCIsImNvbnRlbnRUeXBlIjoiYXBwbGljYXRpb24vanNvbjsgY2hhcnNldD11dGYtOCIsIm1lbW9yeSI6NDA5NiwidGltZW91dCI6MTgwfX0.vWp2zchK13fPXAQRaau5xNiHOhbCxKML5odaC7BwEDU\&asrc=run_on_apify) ``` import { PlaywrightCrawler } from 'crawlee'; import { firefox } from 'playwright'; // Create an instance of the PlaywrightCrawler class. const crawler = new PlaywrightCrawler({ launchContext: { // Set the Firefox browser to be used by the crawler. // If launcher option is not specified here, // default Chromium browser will be used. launcher: firefox, }, async requestHandler({ request, page, log }) { const pageTitle = await page.title(); log.info(`URL: ${request.loadedUrl} | Page title: ${pageTitle}`); }, }); await crawler.addRequests(['https://example.com']); // Run the crawler and wait for it to finish. await crawler.run(); ``` To see a real-world example of how to use [`PlaywrightCrawler`](https://crawlee.dev/js/api/playwright-crawler/class/PlaywrightCrawler.md) in combination with [`RequestQueue`](https://crawlee.dev/js/api/core/class/RequestQueue.md) to recursively scrape the [Hacker News website](https://news.ycombinator.com) check out the [`Playwright crawler example`](https://crawlee.dev/js/docs/examples/playwright-crawler.md). --- # Puppeteer crawler Copy for LLM This example demonstrates how to use [`PuppeteerCrawler`](https://crawlee.dev/js/api/puppeteer-crawler/class/PuppeteerCrawler.md) in combination with [`RequestQueue`](https://crawlee.dev/js/api/core/class/RequestQueue.md) to recursively scrape the [Hacker News website](https://news.ycombinator.com) using headless Chrome / Puppeteer. The crawler starts with a single URL, finds links to next pages, enqueues them and continues until no more desired links are available. The results are stored to the default dataset. In local configuration, the results are stored as JSON files in `./storage/datasets/default` tip To run this example on the Apify Platform, select the `apify/actor-node-puppeteer-chrome` image for your Dockerfile. [Run on](https://console.apify.com/actors/7tWSD8hrYzuc9Lte7?runConfig=eyJ1IjoiRWdQdHczb2VqNlRhRHQ1cW4iLCJ2IjoxfQ.eyJpbnB1dCI6IntcImNvZGVcIjpcImltcG9ydCB7IFB1cHBldGVlckNyYXdsZXIgfSBmcm9tICdjcmF3bGVlJztcXG5cXG4vLyBDcmVhdGUgYW4gaW5zdGFuY2Ugb2YgdGhlIFB1cHBldGVlckNyYXdsZXIgY2xhc3MgLSBhIGNyYXdsZXJcXG4vLyB0aGF0IGF1dG9tYXRpY2FsbHkgbG9hZHMgdGhlIFVSTHMgaW4gaGVhZGxlc3MgQ2hyb21lIC8gUHVwcGV0ZWVyLlxcbmNvbnN0IGNyYXdsZXIgPSBuZXcgUHVwcGV0ZWVyQ3Jhd2xlcih7XFxuICAgIC8vIEhlcmUgeW91IGNhbiBzZXQgb3B0aW9ucyB0aGF0IGFyZSBwYXNzZWQgdG8gdGhlIGxhdW5jaFB1cHBldGVlcigpIGZ1bmN0aW9uLlxcbiAgICBsYXVuY2hDb250ZXh0OiB7XFxuICAgICAgICBsYXVuY2hPcHRpb25zOiB7XFxuICAgICAgICAgICAgaGVhZGxlc3M6IHRydWUsXFxuICAgICAgICAgICAgLy8gT3RoZXIgUHVwcGV0ZWVyIG9wdGlvbnNcXG4gICAgICAgIH0sXFxuICAgIH0sXFxuXFxuICAgIC8vIFN0b3AgY3Jhd2xpbmcgYWZ0ZXIgc2V2ZXJhbCBwYWdlc1xcbiAgICBtYXhSZXF1ZXN0c1BlckNyYXdsOiA1MCxcXG5cXG4gICAgLy8gVGhpcyBmdW5jdGlvbiB3aWxsIGJlIGNhbGxlZCBmb3IgZWFjaCBVUkwgdG8gY3Jhd2wuXFxuICAgIC8vIEhlcmUgeW91IGNhbiB3cml0ZSB0aGUgUHVwcGV0ZWVyIHNjcmlwdHMgeW91IGFyZSBmYW1pbGlhciB3aXRoLFxcbiAgICAvLyB3aXRoIHRoZSBleGNlcHRpb24gdGhhdCBicm93c2VycyBhbmQgcGFnZXMgYXJlIGF1dG9tYXRpY2FsbHkgbWFuYWdlZCBieSBDcmF3bGVlLlxcbiAgICAvLyBUaGUgZnVuY3Rpb24gYWNjZXB0cyBhIHNpbmdsZSBwYXJhbWV0ZXIsIHdoaWNoIGlzIGFuIG9iamVjdCB3aXRoIHRoZSBmb2xsb3dpbmcgZmllbGRzOlxcbiAgICAvLyAtIHJlcXVlc3Q6IGFuIGluc3RhbmNlIG9mIHRoZSBSZXF1ZXN0IGNsYXNzIHdpdGggaW5mb3JtYXRpb24gc3VjaCBhcyBVUkwgYW5kIEhUVFAgbWV0aG9kXFxuICAgIC8vIC0gcGFnZTogUHVwcGV0ZWVyJ3MgUGFnZSBvYmplY3QgKHNlZSBodHRwczovL3BwdHIuZGV2LyNzaG93PWFwaS1jbGFzcy1wYWdlKVxcbiAgICBhc3luYyByZXF1ZXN0SGFuZGxlcih7IHB1c2hEYXRhLCByZXF1ZXN0LCBwYWdlLCBlbnF1ZXVlTGlua3MsIGxvZyB9KSB7XFxuICAgICAgICBsb2cuaW5mbyhgUHJvY2Vzc2luZyAke3JlcXVlc3QudXJsfS4uLmApO1xcblxcbiAgICAgICAgLy8gQSBmdW5jdGlvbiB0byBiZSBldmFsdWF0ZWQgYnkgUHVwcGV0ZWVyIHdpdGhpbiB0aGUgYnJvd3NlciBjb250ZXh0LlxcbiAgICAgICAgY29uc3QgZGF0YSA9IGF3YWl0IHBhZ2UuJCRldmFsKCcuYXRoaW5nJywgKCRwb3N0cykgPT4ge1xcbiAgICAgICAgICAgIGNvbnN0IHNjcmFwZWREYXRhOiB7IHRpdGxlOiBzdHJpbmc7IHJhbms6IHN0cmluZzsgaHJlZjogc3RyaW5nIH1bXSA9IFtdO1xcblxcbiAgICAgICAgICAgIC8vIFdlJ3JlIGdldHRpbmcgdGhlIHRpdGxlLCByYW5rIGFuZCBVUkwgb2YgZWFjaCBwb3N0IG9uIEhhY2tlciBOZXdzLlxcbiAgICAgICAgICAgICRwb3N0cy5mb3JFYWNoKCgkcG9zdCkgPT4ge1xcbiAgICAgICAgICAgICAgICBzY3JhcGVkRGF0YS5wdXNoKHtcXG4gICAgICAgICAgICAgICAgICAgIHRpdGxlOiAkcG9zdC5xdWVyeVNlbGVjdG9yKCcudGl0bGUgYScpLmlubmVyVGV4dCxcXG4gICAgICAgICAgICAgICAgICAgIHJhbms6ICRwb3N0LnF1ZXJ5U2VsZWN0b3IoJy5yYW5rJykuaW5uZXJUZXh0LFxcbiAgICAgICAgICAgICAgICAgICAgaHJlZjogJHBvc3QucXVlcnlTZWxlY3RvcignLnRpdGxlIGEnKS5ocmVmLFxcbiAgICAgICAgICAgICAgICB9KTtcXG4gICAgICAgICAgICB9KTtcXG5cXG4gICAgICAgICAgICByZXR1cm4gc2NyYXBlZERhdGE7XFxuICAgICAgICB9KTtcXG5cXG4gICAgICAgIC8vIFN0b3JlIHRoZSByZXN1bHRzIHRvIHRoZSBkZWZhdWx0IGRhdGFzZXQuXFxuICAgICAgICBhd2FpdCBwdXNoRGF0YShkYXRhKTtcXG5cXG4gICAgICAgIC8vIEZpbmQgYSBsaW5rIHRvIHRoZSBuZXh0IHBhZ2UgYW5kIGVucXVldWUgaXQgaWYgaXQgZXhpc3RzLlxcbiAgICAgICAgY29uc3QgaW5mb3MgPSBhd2FpdCBlbnF1ZXVlTGlua3Moe1xcbiAgICAgICAgICAgIHNlbGVjdG9yOiAnLm1vcmVsaW5rJyxcXG4gICAgICAgIH0pO1xcblxcbiAgICAgICAgaWYgKGluZm9zLnByb2Nlc3NlZFJlcXVlc3RzLmxlbmd0aCA9PT0gMCkgbG9nLmluZm8oYCR7cmVxdWVzdC51cmx9IGlzIHRoZSBsYXN0IHBhZ2UhYCk7XFxuICAgIH0sXFxuXFxuICAgIC8vIFRoaXMgZnVuY3Rpb24gaXMgY2FsbGVkIGlmIHRoZSBwYWdlIHByb2Nlc3NpbmcgZmFpbGVkIG1vcmUgdGhhbiBtYXhSZXF1ZXN0UmV0cmllcysxIHRpbWVzLlxcbiAgICBmYWlsZWRSZXF1ZXN0SGFuZGxlcih7IHJlcXVlc3QsIGxvZyB9KSB7XFxuICAgICAgICBsb2cuZXJyb3IoYFJlcXVlc3QgJHtyZXF1ZXN0LnVybH0gZmFpbGVkIHRvbyBtYW55IHRpbWVzLmApO1xcbiAgICB9LFxcbn0pO1xcblxcbmF3YWl0IGNyYXdsZXIuYWRkUmVxdWVzdHMoWydodHRwczovL25ld3MueWNvbWJpbmF0b3IuY29tLyddKTtcXG5cXG4vLyBSdW4gdGhlIGNyYXdsZXIgYW5kIHdhaXQgZm9yIGl0IHRvIGZpbmlzaC5cXG5hd2FpdCBjcmF3bGVyLnJ1bigpO1xcblxcbmNvbnNvbGUubG9nKCdDcmF3bGVyIGZpbmlzaGVkLicpO1xcblwifSIsIm9wdGlvbnMiOnsiYnVpbGQiOiJsYXRlc3QiLCJjb250ZW50VHlwZSI6ImFwcGxpY2F0aW9uL2pzb247IGNoYXJzZXQ9dXRmLTgiLCJtZW1vcnkiOjQwOTYsInRpbWVvdXQiOjE4MH19.fcPpn-wBUel6Dcmcwc2b40Zm-wNiTcsgDbJ5_nWKYts\&asrc=run_on_apify) ``` import { PuppeteerCrawler } from 'crawlee'; // Create an instance of the PuppeteerCrawler class - a crawler // that automatically loads the URLs in headless Chrome / Puppeteer. const crawler = new PuppeteerCrawler({ // Here you can set options that are passed to the launchPuppeteer() function. launchContext: { launchOptions: { headless: true, // Other Puppeteer options }, }, // Stop crawling after several pages maxRequestsPerCrawl: 50, // This function will be called for each URL to crawl. // Here you can write the Puppeteer scripts you are familiar with, // with the exception that browsers and pages are automatically managed by Crawlee. // The function accepts a single parameter, which is an object with the following fields: // - request: an instance of the Request class with information such as URL and HTTP method // - page: Puppeteer's Page object (see https://pptr.dev/#show=api-class-page) async requestHandler({ pushData, request, page, enqueueLinks, log }) { log.info(`Processing ${request.url}...`); // A function to be evaluated by Puppeteer within the browser context. const data = await page.$$eval('.athing', ($posts) => { const scrapedData: { title: string; rank: string; href: string }[] = []; // We're getting the title, rank and URL of each post on Hacker News. $posts.forEach(($post) => { scrapedData.push({ title: $post.querySelector('.title a').innerText, rank: $post.querySelector('.rank').innerText, href: $post.querySelector('.title a').href, }); }); return scrapedData; }); // Store the results to the default dataset. await pushData(data); // Find a link to the next page and enqueue it if it exists. const infos = await enqueueLinks({ selector: '.morelink', }); if (infos.processedRequests.length === 0) log.info(`${request.url} is the last page!`); }, // This function is called if the page processing failed more than maxRequestRetries+1 times. failedRequestHandler({ request, log }) { log.error(`Request ${request.url} failed too many times.`); }, }); await crawler.addRequests(['https://news.ycombinator.com/']); // Run the crawler and wait for it to finish. await crawler.run(); console.log('Crawler finished.'); ``` --- # Puppeteer recursive crawl Copy for LLM Run the following example to perform a recursive crawl of a website using [`PuppeteerCrawler`](https://crawlee.dev/js/api/puppeteer-crawler/class/PuppeteerCrawler.md). tip To run this example on the Apify Platform, select the `apify/actor-node-puppeteer-chrome` image for your Dockerfile. [Run on](https://console.apify.com/actors/7tWSD8hrYzuc9Lte7?runConfig=eyJ1IjoiRWdQdHczb2VqNlRhRHQ1cW4iLCJ2IjoxfQ.eyJpbnB1dCI6IntcImNvZGVcIjpcImltcG9ydCB7IFB1cHBldGVlckNyYXdsZXIgfSBmcm9tICdjcmF3bGVlJztcXG5cXG5jb25zdCBjcmF3bGVyID0gbmV3IFB1cHBldGVlckNyYXdsZXIoe1xcbiAgICBhc3luYyByZXF1ZXN0SGFuZGxlcih7IHJlcXVlc3QsIHBhZ2UsIGVucXVldWVMaW5rcywgbG9nIH0pIHtcXG4gICAgICAgIGNvbnN0IHRpdGxlID0gYXdhaXQgcGFnZS50aXRsZSgpO1xcbiAgICAgICAgbG9nLmluZm8oYFRpdGxlIG9mICR7cmVxdWVzdC51cmx9OiAke3RpdGxlfWApO1xcblxcbiAgICAgICAgYXdhaXQgZW5xdWV1ZUxpbmtzKHtcXG4gICAgICAgICAgICBnbG9iczogWydodHRwPyhzKTovL3d3dy5pYW5hLm9yZy8qKiddLFxcbiAgICAgICAgfSk7XFxuICAgIH0sXFxuICAgIG1heFJlcXVlc3RzUGVyQ3Jhd2w6IDEwLFxcbn0pO1xcblxcbmF3YWl0IGNyYXdsZXIuYWRkUmVxdWVzdHMoWydodHRwczovL3d3dy5pYW5hLm9yZy8nXSk7XFxuXFxuYXdhaXQgY3Jhd2xlci5ydW4oKTtcXG5cIn0iLCJvcHRpb25zIjp7ImJ1aWxkIjoibGF0ZXN0IiwiY29udGVudFR5cGUiOiJhcHBsaWNhdGlvbi9qc29uOyBjaGFyc2V0PXV0Zi04IiwibWVtb3J5Ijo0MDk2LCJ0aW1lb3V0IjoxODB9fQ.WX9qygDqmffD0uvnNe4zaDatAVvSiCm1XcrSGPwvh6g\&asrc=run_on_apify) ``` import { PuppeteerCrawler } from 'crawlee'; const crawler = new PuppeteerCrawler({ async requestHandler({ request, page, enqueueLinks, log }) { const title = await page.title(); log.info(`Title of ${request.url}: ${title}`); await enqueueLinks({ globs: ['http?(s)://www.iana.org/**'], }); }, maxRequestsPerCrawl: 10, }); await crawler.addRequests(['https://www.iana.org/']); await crawler.run(); ``` --- # Skipping navigations for certain requests Copy for LLM While crawling a website, you may encounter certain resources you'd like to save, but don't need the full power of a crawler to do so (like images delivered through a CDN). By combining the [`Request#skipNavigation`](https://crawlee.dev/js/api/core/class/Request.md#skipNavigation) option with [`sendRequest`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlingContext.md#sendRequest), we can fetch the image from the CDN, and save it to our key-value store without needing to use the full crawler. info For this example, we are using the [`PlaywrightCrawler`](https://crawlee.dev/js/api/playwright-crawler/class/PlaywrightCrawler.md) to showcase this, but this is available on all the crawlers we provide. [Run on](https://console.apify.com/actors/6i5QsHBMtm3hKph70?runConfig=eyJ1IjoiRWdQdHczb2VqNlRhRHQ1cW4iLCJ2IjoxfQ.eyJpbnB1dCI6IntcImNvZGVcIjpcImltcG9ydCB7IFBsYXl3cmlnaHRDcmF3bGVyLCBLZXlWYWx1ZVN0b3JlIH0gZnJvbSAnY3Jhd2xlZSc7XFxuXFxuLy8gQ3JlYXRlIGEga2V5IHZhbHVlIHN0b3JlIGZvciBhbGwgaW1hZ2VzIHdlIGZpbmRcXG5jb25zdCBpbWFnZVN0b3JlID0gYXdhaXQgS2V5VmFsdWVTdG9yZS5vcGVuKCdpbWFnZXMnKTtcXG5cXG5jb25zdCBjcmF3bGVyID0gbmV3IFBsYXl3cmlnaHRDcmF3bGVyKHtcXG4gICAgYXN5bmMgcmVxdWVzdEhhbmRsZXIoeyByZXF1ZXN0LCBwYWdlLCBzZW5kUmVxdWVzdCB9KSB7XFxuICAgICAgICAvLyBUaGUgcmVxdWVzdCBzaG91bGQgaGF2ZSB0aGUgbmF2aWdhdGlvbiBza2lwcGVkXFxuICAgICAgICBpZiAocmVxdWVzdC5za2lwTmF2aWdhdGlvbikge1xcbiAgICAgICAgICAgIC8vIFJlcXVlc3QgdGhlIGltYWdlIGFuZCBnZXQgaXRzIGJ1ZmZlciBiYWNrXFxuICAgICAgICAgICAgY29uc3QgaW1hZ2VSZXNwb25zZSA9IGF3YWl0IHNlbmRSZXF1ZXN0KHsgcmVzcG9uc2VUeXBlOiAnYnVmZmVyJyB9KTtcXG5cXG4gICAgICAgICAgICAvLyBTYXZlIHRoZSBpbWFnZSBpbiB0aGUga2V5LXZhbHVlIHN0b3JlXFxuICAgICAgICAgICAgYXdhaXQgaW1hZ2VTdG9yZS5zZXRWYWx1ZShgJHtyZXF1ZXN0LnVzZXJEYXRhLmtleX0ucG5nYCwgaW1hZ2VSZXNwb25zZS5ib2R5KTtcXG5cXG4gICAgICAgICAgICAvLyBQcmV2ZW50IGV4ZWN1dGluZyB0aGUgcmVzdCBvZiB0aGUgY29kZSBhcyB3ZSBkbyBub3QgbmVlZCBpdFxcbiAgICAgICAgICAgIHJldHVybjtcXG4gICAgICAgIH1cXG5cXG4gICAgICAgIC8vIEdldCBhbGwgdGhlIGltYWdlIHNvdXJjZXMgaW4gdGhlIGN1cnJlbnQgcGFnZVxcbiAgICAgICAgY29uc3QgaW1hZ2VzID0gYXdhaXQgcGFnZS4kJGV2YWwoJ2ltZycsIChpbWdzKSA9PiBpbWdzLm1hcCgoaW1nKSA9PiBpbWcuc3JjKSk7XFxuXFxuICAgICAgICAvLyBBZGQgYWxsIHRoZSB1cmxzIGFzIHJlcXVlc3RzIGZvciB0aGUgY3Jhd2xlciwgZ2l2aW5nIGVhY2ggaW1hZ2UgYSBrZXlcXG4gICAgICAgIGF3YWl0IGNyYXdsZXIuYWRkUmVxdWVzdHMoaW1hZ2VzLm1hcCgodXJsLCBpKSA9PiAoeyB1cmwsIHNraXBOYXZpZ2F0aW9uOiB0cnVlLCB1c2VyRGF0YTogeyBrZXk6IGkgfSB9KSkpO1xcbiAgICB9LFxcbn0pO1xcblxcbmF3YWl0IGNyYXdsZXIuYWRkUmVxdWVzdHMoWydodHRwczovL2NyYXdsZWUuZGV2J10pO1xcblxcbi8vIFJ1biB0aGUgY3Jhd2xlclxcbmF3YWl0IGNyYXdsZXIucnVuKCk7XFxuXCJ9Iiwib3B0aW9ucyI6eyJidWlsZCI6ImxhdGVzdCIsImNvbnRlbnRUeXBlIjoiYXBwbGljYXRpb24vanNvbjsgY2hhcnNldD11dGYtOCIsIm1lbW9yeSI6NDA5NiwidGltZW91dCI6MTgwfX0.cNsd2-DLQUjMSHwY8npJ3Im5Ffh-jGfcpADCVsdj91U\&asrc=run_on_apify) ``` import { PlaywrightCrawler, KeyValueStore } from 'crawlee'; // Create a key value store for all images we find const imageStore = await KeyValueStore.open('images'); const crawler = new PlaywrightCrawler({ async requestHandler({ request, page, sendRequest }) { // The request should have the navigation skipped if (request.skipNavigation) { // Request the image and get its buffer back const imageResponse = await sendRequest({ responseType: 'buffer' }); // Save the image in the key-value store await imageStore.setValue(`${request.userData.key}.png`, imageResponse.body); // Prevent executing the rest of the code as we do not need it return; } // Get all the image sources in the current page const images = await page.$$eval('img', (imgs) => imgs.map((img) => img.src)); // Add all the urls as requests for the crawler, giving each image a key await crawler.addRequests(images.map((url, i) => ({ url, skipNavigation: true, userData: { key: i } }))); }, }); await crawler.addRequests(['https://crawlee.dev']); // Run the crawler await crawler.run(); ``` --- ## [📄️ Request Locking](https://crawlee.dev/js/docs/experiments/experiments-request-locking.md) [Parallelize crawlers with ease using request locking](https://crawlee.dev/js/docs/experiments/experiments-request-locking.md) --- # Request Locking Copy for LLM Release announcement As of **May 2024** (`crawlee` version `3.10.0`), this experiment is now enabled by default! With that said, if you encounter issues you can: * set `requestLocking` to `false` in the `experiments` object of your crawler options * update all imports of `RequestQueue` to `RequestQueueV1` * open an issue on our [GitHub repository](https://github.com/apify/crawlee) The content below is kept for documentation purposes. If you're interested in the changes, you can read the [blog post about the new Request Queue storage system on the Apify blog](https://blog.apify.com/new-apify-request-queue/). *** caution This is an experimental feature. While we welcome testers, keep in mind that it is currently not recommended to use this in production. The API is subject to change, and we might introduce breaking changes in the future. Should you be using this, feel free to open issues on our [GitHub repository](https://github.com/apify/crawlee), and we'll take a look. Starting with `crawlee` version `3.5.5`, we have introduced a new crawler option that lets you enable using a new request locking API. With this API, you will be able to pass a `RequestQueue` to multiple crawlers to parallelize the crawling process. Keep in mind The request queue that supports request locking is currently exported via the `RequestQueueV2` class. Once the experiment is over, this class will replace the current `RequestQueue` class ## How to enable the experiment[​](#how-to-enable-the-experiment "Direct link to How to enable the experiment") ### In crawlers[​](#in-crawlers "Direct link to In crawlers") note This example shows how to enable the experiment in the [`CheerioCrawler`](https://crawlee.dev/js/api/cheerio-crawler/class/CheerioCrawler.md), but you can apply this to any crawler type. ``` import { CheerioCrawler } from 'crawlee'; const crawler = new CheerioCrawler({ experiments: { requestLocking: true, }, async requestHandler({ $, request }) { const title = $('title').text(); console.log(`The title of "${request.url}" is: ${title}.`); }, }); await crawler.run(['https://crawlee.dev']); ``` ### Outside crawlers (to setup your own request queue that supports locking)[​](#outside-crawlers-to-setup-your-own-request-queue-that-supports-locking "Direct link to Outside crawlers (to setup your own request queue that supports locking)") Previously, you would import `RequestQueue` from `crawlee`. To switch to the queue that supports locking, you need to import `RequestQueueV2` instead. ``` import { RequestQueueV2 } from 'crawlee'; const queue = await RequestQueueV2.open('my-locking-queue'); await queue.addRequests([ { url: 'https://crawlee.dev' }, { url: 'https://crawlee.dev/js/docs' }, { url: 'https://crawlee.dev/js/api' }, ]); ``` ### Using the new request queue in crawlers[​](#using-the-new-request-queue-in-crawlers "Direct link to Using the new request queue in crawlers") If you make your own request queue that supports locking, you will also need to enable the experiment in your crawlers. danger If you do not enable the experiment, you will receive a runtime error and the crawler will not start. ``` import { CheerioCrawler, RequestQueueV2 } from 'crawlee'; const queue = await RequestQueueV2.open('my-locking-queue'); const crawler = new CheerioCrawler({ experiments: { requestLocking: true, }, requestQueue: queue, async requestHandler({ $, request }) { const title = $('title').text(); console.log(`The title of "${request.url}" is: ${title}.`); }, }); await crawler.run(); ``` ## Other changes[​](#other-changes "Direct link to Other changes") info This section is only useful if you're a tinkerer and want to see what's going on under the hood. In order to facilitate the new request locking API, as well as keep both the current request queue logic and the new, locking based request queue logic, we have implemented a common starting point called `RequestProvider`. This class implements almost all functions by default, but expects you, the developer, to implement the following methods: `fetchNextRequest` and `ensureHeadIsNotEmpty`. These methods are responsible for loading and returning requests to process, and tell crawlers if there are more requests to process. You can use this base class to implement your own request providers if you need to fetch requests from a different source. tip We recommend you use TypeScript when implementing your own request provider, as it comes with suggestions for the abstract methods, as well as giving you the exact types you need to return. --- # System Infomation V2 Copy for LLM caution This is an experimental feature. While we welcome testers, keep in mind that it is currently not recommended to use this in production. The API is subject to change, and we might introduce breaking changes in the future. Should you be using this, feel free to open issues on our [GitHub repository](https://github.com/apify/crawlee), and we'll take a look. Starting with the newest `crawlee` beta, we have introduced a new crawler option that enables an improved metric collection system. This new system should collect cpu and memory metrics more accurately in containerised environments by checking for cgroup enforce limits. ## How to enable the experiment[​](#how-to-enable-the-experiment "Direct link to How to enable the experiment") note This example shows how to enable the experiment in the [`CheerioCrawler`](https://crawlee.dev/js/api/cheerio-crawler/class/CheerioCrawler.md), but you can apply this to any crawler type. ``` import { CheerioCrawler, Configuration } from 'crawlee'; Configuration.set('systemInfoV2', true); const crawler = new CheerioCrawler({ async requestHandler({ $, request }) { const title = $('title').text(); console.log(`The title of "${request.url}" is: ${title}.`); }, }); await crawler.run(['https://crawlee.dev']); ``` ## Other changes[​](#other-changes "Direct link to Other changes") info This section is only useful if you're a tinkerer and want to see what's going on under the hood. The existing solution checked the bare metal metrics for how much cpu and memory was being used and how much headroom was available. This is an intuitive solution but unfortunately doesnt account for when there is an external limit on the amount of resources a process can consume. This is often the case in containerized environments where each container will have a quota for its cpu and memory usage. This experiment attempts to address this issue by introducing a new `isContainerized()` utility function and changing the way resources are collected when a container is detected. note This `isContainerized()` function is very similar to the existing `isDocker()` function however for now they both work side by side. If this experiment is successful, eventualy `isDocker()` may eventually be depreciated in favour of `isContainerized()`. ### Cgroup detection[​](#cgroup-detection "Direct link to Cgroup detection") On linux, to detect if cgroup is available, we check if there is a directory at `/sys/fs/cgroup`. If the directory exists, a version of cgroup is installed. Next we check the version of cgroup installed by checking for a directory at `/sys/fs/cgroup/memory/`. If it exists, cgroup V1 is installed. If it is missing, it is assumed cgroup V2 is installed. ### CPU metric collection[​](#cpu-metric-collection "Direct link to CPU metric collection") The existing solution worked by checking the fraction of cpu idle ticks to the total number of cpu ticks since the last profile. If 100000 ticks elapse and 5000 were idle, the cpu is at 95% utilisation. In this experiment, the method of cpu load calculation depends on the result of `isContainerized()` or if set, the `CRAWLEE_CONTAINERIZED` environment variable. If `isContainerized()` returns true, the new cgroup aware metric collection will be used over the "bare metal" numbers. This works by inspecting the `/sys/fs/cgroup/cpuacct/cpuacct.usage`, `/sys/fs/cgroup/cpu/cpu.cfs_quota_us` and `/sys/fs/cgroup/cpu/cpu.cfs_period_us` files for cgroup V1 and the `/sys/fs/cgroup/cpu.stat` and `/sys/fs/cgroup/cpu.max` files for cgroup V2. The actual cpu usage figure is calculated in the same manner as the "bare metal" figure by comparing the total number of ticks elapsed to the number of idle ticks between profiles but by using the figures from the cgroup files. If no cgroup quota is enforced, the "bare metal" numbers will be used. ### Memory metric collection[​](#memory-metric-collection "Direct link to Memory metric collection") The existing solution was already cgroup aware however an improvement has been made to memory metric collection when running on windows. The existing solution used an external package `apify/ps-tree` to find the amount of memory crawlee and any child processes were using. On Windows, this package used the depreciated "WMIC" command line utility to determine memory usage. In this experiment, `apify/ps-tree` has been removed and replaced by the `packages/utils/src/internals/ps-tree.ts` file. This works in much the same manner however, instead of using "WMIC", it uses "powershell" to collect the same data. --- ## [📄️ Request Storage](https://crawlee.dev/js/docs/guides/request-storage.md) [How to store the requests your crawler will go through](https://crawlee.dev/js/docs/guides/request-storage.md) --- # Avoid getting blocked Copy for LLM A scraper might get blocked for numerous reasons. Let's narrow it down to the two main ones. The first is a bad or blocked IP address. You can learn about this topic in the [proxy management guide](https://crawlee.dev/js/docs/guides/proxy-management.md). The second reason is [browser fingerprints](https://pixelprivacy.com/resources/browser-fingerprinting/) (or signatures), which we will explore more in this guide. Check the [Apify Academy anti-scraping course](https://docs.apify.com/academy/anti-scraping) to gain a deeper theoretical understanding of blocking and learn a few tips and tricks. Browser fingerprint is a collection of browser attributes and significant features that can show if our browser is a bot or a real user. Moreover, most browsers have these unique features that allow the website to track the browser even within different IP addresses. This is the main reason why scrapers should change browser fingerprints while doing browser-based scraping. In return, it should significantly reduce the blocking. ## Using browser fingerprints[​](#using-browser-fingerprints "Direct link to Using browser fingerprints") Changing browser fingerprints can be a tedious job. Luckily, Crawlee provides this feature with zero configuration necessary - the usage of fingerprints is enabled by default and available in `PlaywrightCrawler` and `PuppeteerCrawler`. So whenever we build a scraper that is using one of these crawlers - the fingerprints are going to be generated for the default browser and the operating system out of the box. ## Customizing browser fingerprints[​](#customizing-browser-fingerprints "Direct link to Customizing browser fingerprints") In certain cases we want to narrow down the fingerprints used - e.g. specify a certain operating system, locale or browser. This is also possible with Crawlee - the crawler can have the generation algorithm customized to reflect the particular browser version and many more. Let's take a look at the examples bellow: * PlaywrightCrawler * PuppeteerCrawler ``` import { PlaywrightCrawler } from 'crawlee'; import { BrowserName, DeviceCategory, OperatingSystemsName } from '@crawlee/browser-pool'; const crawler = new PlaywrightCrawler({ browserPoolOptions: { useFingerprints: true, // this is the default fingerprintOptions: { fingerprintGeneratorOptions: { browsers: [ { name: BrowserName.edge, minVersion: 96, }, ], devices: [DeviceCategory.desktop], operatingSystems: [OperatingSystemsName.windows], }, }, }, // ... }); ``` ``` import { PuppeteerCrawler } from 'crawlee'; import { BrowserName, DeviceCategory } from '@crawlee/browser-pool'; const crawler = new PuppeteerCrawler({ browserPoolOptions: { useFingerprints: true, // this is the default fingerprintOptions: { fingerprintGeneratorOptions: { browsers: [BrowserName.chrome, BrowserName.firefox], devices: [DeviceCategory.mobile], locales: ['en-US'], }, }, }, // ... }); ``` ## Disabling browser fingerprints[​](#disabling-browser-fingerprints "Direct link to Disabling browser fingerprints") On the contrary, sometimes we want to entirely disable the usage of browser fingerprints. This is easy to do with Crawlee too. All we have to do is set the `useFingerprints` option of the `browserPoolOptions` to `false`: * PlaywrightCrawler * PuppeteerCrawler ``` import { PlaywrightCrawler } from 'crawlee'; const crawler = new PlaywrightCrawler({ browserPoolOptions: { useFingerprints: false, }, // ... }); ``` ``` import { PuppeteerCrawler } from 'crawlee'; const crawler = new PuppeteerCrawler({ browserPoolOptions: { useFingerprints: false, }, // ... }); ``` ## Camoufox[​](#camoufox "Direct link to Camoufox") For some protections, using our integrated solutions is not enough, one example could be the Cloudflare challenge. For such pages, you can try [Camoufox](https://camoufox.com/), a custom stealthy build of Firefox for web scraping. It might not get you through the challenge automatically, but with our `handleCloudflareChallenge` helper, it should be able to successfully mimic the required user action and get you through it. ``` import { PlaywrightCrawler } from 'crawlee'; import { launchOptions } from 'camoufox-js'; import { firefox } from 'playwright'; const crawler = new PlaywrightCrawler({ postNavigationHooks: [ async ({ handleCloudflareChallenge }) => { await handleCloudflareChallenge(); }, ], browserPoolOptions: { // Disable the default fingerprint spoofing to avoid conflicts with Camoufox. useFingerprints: false, }, launchContext: { launcher: firefox, launchOptions: await launchOptions({ headless: true, }), }, // ... }); ``` **Related links** * [Fingerprint Suite Docs](https://github.com/apify/fingerprint-suite) * [Apify Academy anti-scraping course](https://docs.apify.com/academy/anti-scraping) * [Camoufox JS wrapper](https://github.com/apify/camoufox-js) --- # CheerioCrawler guide Copy for LLM ​[`CheerioCrawler`](https://crawlee.dev/js/api/cheerio-crawler/class/CheerioCrawler.md) is our simplest and fastest crawler. If you're familiar with [jQuery](https://jquery.com/), you'll understand `CheerioCrawler` in minutes. ## What is Cheerio[​](#what-is-cheerio "Direct link to What is Cheerio") [Cheerio](https://cheerio.js.org/) is essentially [jQuery](https://jquery.com/) for Node.js. It offers the same API, including the familiar `$` object. You can use it, as you would use jQuery for manipulating the DOM of an HTML page. In crawling, you'll mostly use it to select the needed elements and extract their values - the data you're interested in. But jQuery runs in a browser and attaches directly to the browser's DOM. Where does `cheerio` get its HTML? This is where the `Crawler` part of [`CheerioCrawler`](https://crawlee.dev/js/api/cheerio-crawler/class/CheerioCrawler.md) comes in. ## How the crawler works[​](#how-the-crawler-works "Direct link to How the crawler works") ​[`CheerioCrawler`](https://crawlee.dev/js/api/cheerio-crawler/class/CheerioCrawler.md) crawls by making plain HTTP requests to the provided URLs using the specialized [got-scraping](https://github.com/apify/got-scraping) HTTP client. The URLs are fed to the crawler using [`RequestQueue`](https://crawlee.dev/js/api/core/class/RequestQueue.md). The HTTP responses it gets back are usually HTML pages. The same pages you would get in your browser when you first load a URL. But it can handle any content types with the help of the [`additionalMimeTypes`](https://crawlee.dev/js/api/cheerio-crawler/interface/CheerioCrawlerOptions.md#additionalMimeTypes) option. info Modern web pages often do not serve all of their content in the first HTML response, but rather the first HTML contains links to other resources such as CSS and JavaScript that get downloaded afterwards, and together they create the final page. To crawl those, see [`PuppeteerCrawler`](https://crawlee.dev/js/api/puppeteer-crawler/class/PuppeteerCrawler.md) and [`PlaywrightCrawler`](https://crawlee.dev/js/api/playwright-crawler/class/PlaywrightCrawler.md). Once the page's HTML is retrieved, the crawler will pass it to [Cheerio](https://github.com/cheeriojs/cheerio) for parsing. The result is the typical `$` function, which should be familiar to jQuery users. You can use the `$` function to do all sorts of lookups and manipulation of the page's HTML, but in scraping, you will mostly use it to find specific HTML elements and extract their data. Example use of Cheerio and its `$` function in comparison to browser JavaScript: ``` // Return the text content of the element. document.querySelector('title').textContent; // plain JS $('title').text(); // Cheerio // Return an array of all 'href' links on the page. Array.from(document.querySelectorAll('[href]')).map(el => el.href); // plain JS $('[href]') .map((i, el) => $(el).attr('href')) .get(); // Cheerio ``` note This is not to show that Cheerio is better than plain browser JavaScript. Some might actually prefer the more expressive way plain JS provides. Unfortunately, the browser JavaScript methods are not available in Node.js, so Cheerio is your best bet to do the parsing in Node.js. ## When to use `CheerioCrawler`[​](#when-to-use-cheeriocrawler "Direct link to when-to-use-cheeriocrawler") `CheerioCrawler` really shines when you need to cope with extremely high workloads. With just 4 GBs of memory and a single CPU core, you can scrape 500 or more pages a minute! *(assuming each page contains approximately 400KB of HTML)*. To scrape this fast with a full browser scraper, such as the [`PlaywrightCrawler`](https://crawlee.dev/js/api/playwright-crawler/class/PlaywrightCrawler.md), you'd need significantly more computing power. **Advantages:** * Extremely fast and cheap to run * Easy to set up * Familiar for jQuery users * Automatically avoids some anti-scraping bans **Disadvantages:** * Does not work for websites that require JavaScript rendering * May easily overload the target website with requests * Does not enable any manipulation of the website before scraping ## Web scraping with Cheerio: Examples[​](#web-scraping-with-cheerio-examples "Direct link to Web scraping with Cheerio: Examples") ### Get text content of an element[​](#get-text-content-of-an-element "Direct link to Get text content of an element") Finds the first `<h2>` element and returns its text content. ``` $('h2').text() ``` ### Find all links on a page[​](#find-all-links-on-a-page "Direct link to Find all links on a page") This snippet finds all `<a>` elements which have the `href` attribute and extracts the hrefs into an array. ``` $('a[href]') .map((i, el) => $(el).attr('href')) .get(); ``` ### Other examples[​](#other-examples "Direct link to Other examples") Visit the [Examples](https://crawlee.dev/js/docs/examples.md) section to browse examples of `CheerioCrawler` usage. Almost all examples show `CheerioCrawler` code in their code tabs. --- # Configuration Copy for LLM ​[`Configuration`](https://crawlee.dev/js/api/core/class/Configuration.md) is a class holding Crawlee configuration parameters. By default, you don't need to set or change any of them, but for certain use cases you might want to do so, e.g. in order to change the default storage directory, or enable verbose error logging, and so on. There are three ways of changing the configuration parameters: * adding `crawlee.json` file to your project * setting environment variables * using the `Configuration` class You could also combine all the above, but you should keep in mind, that the precedence for these 3 options is the following: ***`crawlee.json`*** < ***constructor options*** < ***environment variables***. `crawlee.json` is a baseline. The options provided in the `Configuration` constructor will override the options provided in the JSON. Environment variables will override both. ## `crawlee.json`[​](#crawleejson "Direct link to crawleejson") The first option you could use for configuring Crawlee is `crawlee.json` file. The only thing you need to do is specify the [`ConfigurationOptions`](https://crawlee.dev/js/api/core/interface/ConfigurationOptions.md) in the file, place the file in the root of your project, and Crawlee will use provided options as global configuration. crawlee.json ``` { "persistStateIntervalMillis": 10000, "logLevel": "DEBUG" } ``` With `crawlee.json` you don't need to do anything else in the code: ``` import { CheerioCrawler, sleep } from 'crawlee'; // We are not importing nor passing // the Configuration to the crawler. // We are not assigning any env vars either. const crawler = new CheerioCrawler(); crawler.router.addDefaultHandler(async ({ request }) => { // for the first request we wait for 5 seconds, // and add the second request to the queue if (request.url === 'https://www.example.com/1') { await sleep(5_000); await crawler.addRequests(['https://www.example.com/2']) } // for the second request we wait for 10 seconds, // and abort the run if (request.url === 'https://www.example.com/2') { await sleep(10_000); process.exit(0); } }); await crawler.run(['https://www.example.com/1']); ``` If you run this example (assuming you placed the `crawlee.json` file with `persistStateIntervalMillis` and `logLevel` specified there in the root of your project), you will find the `SDK_CRAWLER_STATISTICS` file in default Key-Value store, which would show, that there's 1 finished request and crawler runtime was \~10 seconds. This confirms that the state was persisted after 10 seconds, as it was set in `crawlee.json`. Besides, you should see `DEBUG` logs in addition to `INFO` ones in your terminal, as `logLevel` was set to `DEBUG` in the `crawlee.json`, meaning Crawlee picked both provided options correctly. ## Environment Variables[​](#environment-variables "Direct link to Environment Variables") Another way of configuring Crawlee is setting environment variables. The following is a list of the environment variables used by Crawlee that are available to the user. ### Important env vars[​](#important-env-vars "Direct link to Important env vars") The following environment variables have large impact on the way Crawlee works and its behavior can be changed significantly by setting or unsetting them. #### `CRAWLEE_STORAGE_DIR`[​](#crawlee_storage_dir "Direct link to crawlee_storage_dir") Defines the path to a local directory where [`KeyValueStore`](https://crawlee.dev/js/api/core/class/KeyValueStore.md), [`Dataset`](https://crawlee.dev/js/api/core/class/Dataset.md), and [`RequestQueue`](https://crawlee.dev/js/api/core/class/RequestQueue.md) store their data. By default, it is set to `./storage`. #### `CRAWLEE_DEFAULT_DATASET_ID`[​](#crawlee_default_dataset_id "Direct link to crawlee_default_dataset_id") The default dataset has ID `default`. Setting this environment variable overrides the default dataset ID with the provided value. #### `CRAWLEE_DEFAULT_KEY_VALUE_STORE_ID`[​](#crawlee_default_key_value_store_id "Direct link to crawlee_default_key_value_store_id") The default key-value store has ID `default`. Setting this environment variable overrides the default key-value store ID with the provided value. #### `CRAWLEE_DEFAULT_REQUEST_QUEUE_ID`[​](#crawlee_default_request_queue_id "Direct link to crawlee_default_request_queue_id") The default request queue has ID `default`. Setting this environment variable overrides the default request queue ID with the provided value. #### `CRAWLEE_PURGE_ON_START`[​](#crawlee_purge_on_start "Direct link to crawlee_purge_on_start") Storage directories are purged by default. If set to `false` - local storage directories would not be purged automatically at the start of the crawler run or before opening of some storage explicitly (e.g. via `Dataset.open()`). Useful if we're trying e.g. to add more items to dataset with each next run (and keep the previously saved/scraped items). #### `CRAWLEE_CONTAINERIZED`[​](#crawlee_containerized "Direct link to crawlee_containerized") This variable is only effective when the systemInfoV2 experiment is enabled. Changes how crawlee measures its CPU and Memory usage and limits. If unset, crawlee will determine if it is containerised using common features of containerized environments using the `isContainerized` utility function. * A file at `/.dockerenv`. * A file at `/proc/self/cgroup` containing `docker`. * A value for the `KUBERNETES_SERVICE_HOST` environment variable. If `isLambda` returns true, `isContainerized` will return false regardless of these other checks. When this variable is set, it is used in place of `isContainerized`. ### Convenience env vars[​](#convenience-env-vars "Direct link to Convenience env vars") The next group includes env vars that can help achieve certain goals without having to change our code, such as temporarily switching log level to DEBUG or enabling verbose logging for errors. #### `CRAWLEE_HEADLESS`[​](#crawlee_headless "Direct link to crawlee_headless") If set to `1`, web browsers launched by Crawlee will run in the headless mode. We can still override this setting in the code, e.g. by passing the `headless: true` option to the [`launchPuppeteer()`](https://crawlee.dev/js/api/puppeteer-crawler/function/launchPuppeteer.md) function. By default, the browsers are launched in headful mode, i.e. with windows. #### `CRAWLEE_LOG_LEVEL`[​](#crawlee_log_level "Direct link to crawlee_log_level") Specifies the minimum log level, which can be one of the following values (in order of severity): `DEBUG`, `INFO`, `WARNING`, `ERROR` and `OFF`. By default, the log level is set to `INFO`, which means that `DEBUG` messages are not printed to console. See the [`utils.log`](https://crawlee.dev/js/api/core/class/Log.md) namespace for logging utilities. #### `CRAWLEE_VERBOSE_LOG`[​](#crawlee_verbose_log "Direct link to crawlee_verbose_log") Enables verbose logging if set to `true`. If not explicitly set to `true` - for errors thrown from inside request handler a warning with only error message will be logged as long as we know the request will be retried. Same applies to some known errors (such as timeout errors). Disabled by default. #### `CRAWLEE_MEMORY_MBYTES`[​](#crawlee_memory_mbytes "Direct link to crawlee_memory_mbytes") Sets the amount of system memory in megabytes to be used by the [`AutoscaledPool`](https://crawlee.dev/js/api/core/class/AutoscaledPool.md). It is used to limit the number of concurrently running tasks. By default, the max amount of memory to be used is set to one quarter of total system memory, i.e. on a system with 8192 MB of memory, the autoscaling feature will only use up to 2048 MB of memory. ## Configuration class[​](#configuration-class "Direct link to Configuration class") The last option to adjust Crawlee configuration is to use the [`Configuration`](https://crawlee.dev/js/api/core/class/Configuration.md) class in the code. ### Global Configuration[​](#global-configuration "Direct link to Global Configuration") By default, there is a global singleton instance of `Configuration` class, it is used by the crawlers and some other classes that depend on a configurable behavior. In most cases you don't need to adjust any options there, but if needed - you can get access to it via [`Configuration.getGlobalConfig()`](https://crawlee.dev/js/api/core/class/Configuration.md#getGlobalConfig) function. Now you can easily [`get`](https://crawlee.dev/js/api/core/class/Configuration.md#get) and [`set`](https://crawlee.dev/js/api/core/class/Configuration.md#set) the [`ConfigurationOptions`](https://crawlee.dev/js/api/core/interface/ConfigurationOptions.md). ``` import { CheerioCrawler, Configuration, sleep } from 'crawlee'; // Get the global configuration const config = Configuration.getGlobalConfig(); // Set the 'persistStateIntervalMillis' option // of global configuration to 10 seconds config.set('persistStateIntervalMillis', 10_000); // Note, that we are not passing the configuration to the crawler // as it's using the global configuration const crawler = new CheerioCrawler(); crawler.router.addDefaultHandler(async ({ request }) => { // For the first request we wait for 5 seconds, // and add the second request to the queue if (request.url === 'https://www.example.com/1') { await sleep(5_000); await crawler.addRequests(['https://www.example.com/2']) } // For the second request we wait for 10 seconds, // and abort the run if (request.url === 'https://www.example.com/2') { await sleep(10_000); process.exit(0); } }); await crawler.run(['https://www.example.com/1']); ``` This is pretty much the same example we used for showing `crawlee.json` usage, but now we're using the global configuration, which is the only difference. If you run this example - you will find the `SDK_CRAWLER_STATISTICS` file in default Key-Value store as before, which would show the same number of finishes requests (one) and the same crawler runtime (\~10 seconds). This confirms that provided parameters worked: the state was persisted after 10 seconds, as it was set in the global configuration. note After running the same example with commented two lines of code related to `Configuration` there will be no `SDK_CRAWLER_STATISTICS` file stored in the default Key-Value store: as we did not change the `persistStateIntervalMillis`, Crawlee used the default value of 60 seconds, and the crawler was forcefully aborted after \~15 seconds of run time before it persisted the state for the first time. ### Custom configuration[​](#custom-configuration "Direct link to Custom configuration") Alternatively, you can create a custom configuration. In this case you need to pass it to the class that is going to use it, e.g. to the crawler. Let's adjust the previous example: ``` import { CheerioCrawler, Configuration, sleep } from 'crawlee'; // Create new configuration const config = new Configuration({ // Set the 'persistStateIntervalMillis' option to 10 seconds persistStateIntervalMillis: 10_000, }); // Now we need to pass the configuration to the crawler const crawler = new CheerioCrawler({}, config); crawler.router.addDefaultHandler(async ({ request }) => { // for the first request we wait for 5 seconds, // and add the second request to the queue if (request.url === 'https://www.example.com/1') { await sleep(5_000); await crawler.addRequests(['https://www.example.com/2']) } // for the second request we wait for 10 seconds, // and abort the run if (request.url === 'https://www.example.com/2') { await sleep(10_000); process.exit(0); } }); await crawler.run(['https://www.example.com/1']); ``` If you run this example - it would work exactly the same as before, with the same `SDK_CRAWLER_STATISTICS` file in default Key-Value store after the run, showing the same number of finished requests and the same crawler run time. note If you would not pass the configuration to the crawler, there again will be no `SDK_CRAWLER_STATISTICS` file stored in the default Key-Value store, this time for a different reason though. Since we did not pass the configuration to the crawler, the crawler will use the global configuration, which is using the default `persistStateIntervalMillis`. So again, the run was aborted before the state was persisted for the first time. --- # Using a custom HTTP client (Experimental) Copy for LLM The [`BasicCrawler`](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md) class allows you to configure the HTTP client implementation using the `httpClient` constructor option. This might be useful for testing or if you need to swap out the default implementation based on `got-scraping` for something else, such as `curl-impersonate` or `axios`. The HTTP client implementation needs to conform to the [`BaseHttpClient`](https://crawlee.dev/js/api/core/interface/BaseHttpClient.md) interface. For a rough idea on how it might look, see a skeleton implementation that uses the standard `fetch` interface: ``` import type { BaseHttpClient, HttpRequest, HttpResponse, RedirectHandler, ResponseTypes, StreamingHttpResponse, } from '@crawlee/core'; import { Readable } from 'node:stream'; class CustomHttpClient implements BaseHttpClient { async sendRequest<TResponseType extends keyof ResponseTypes = 'text'>( request: HttpRequest<TResponseType>, ): Promise<HttpResponse<TResponseType>> { const requestHeaders = new Headers(); for (let [headerName, headerValues] of Object.entries(request.headers ?? {})) { if (headerValues === undefined) { continue; } if (!Array.isArray(headerValues)) { headerValues = [headerValues]; } for (const value of headerValues) { requestHeaders.append(headerName, value); } } const response = await fetch(request.url, { method: request.method, headers: requestHeaders, body: request.body as string, // TODO implement stream/generator handling signal: request.signal, // TODO implement the rest of request parameters (e.g., timeout, proxyUrl, cookieJar, ...) }); const headers: Record<string, string> = {}; response.headers.forEach((value, headerName) => { headers[headerName] = value; }); return { complete: true, request, url: response.url, statusCode: response.status, redirectUrls: [], // TODO you need to handle redirects manually to track them headers, trailers: {}, // TODO not supported by fetch ip: undefined, body: request.responseType === 'text' ? await response.text() : request.responseType === 'json' ? await response.json() : Buffer.from(await response.text()), }; } async stream(request: HttpRequest, onRedirect?: RedirectHandler): Promise<StreamingHttpResponse> { const fetchResponse = await fetch(request.url, { method: request.method, headers: new Headers(), body: request.body as string, // TODO implement stream/generator handling signal: request.signal, // TODO implement the rest of request parameters (e.g., timeout, proxyUrl, cookieJar, ...) }); const headers: Record<string, string> = {}; // TODO same as in sendRequest() async function* read() { const reader = fetchResponse.body?.getReader(); const stream = new ReadableStream({ start(controller) { if (!reader) { return null; } return pump(); function pump() { return reader!.read().then(({ done, value }) => { // When no more data needs to be consumed, close the stream if (done) { controller.close(); return; } // Enqueue the next data chunk into our target stream controller.enqueue(value); return pump(); }); } }, }); for await (const chunk of stream) { yield chunk; } } const response = { complete: false, request, url: fetchResponse.url, statusCode: fetchResponse.status, redirectUrls: [], // TODO you need to handle redirects manually to track them headers, trailers: {}, // TODO not supported by fetch ip: undefined, stream: Readable.from(read()), get downloadProgress() { return { percent: 0, transferred: 0 }; // TODO track this }, get uploadProgress() { return { percent: 0, transferred: 0 }; // TODO track this }, }; return response; } } ``` You may then instantiate it and pass to a crawler constructor: ``` const crawler = new HttpCrawler({ httpClient: new CustomHttpClient(), async requestHandler() { /* ... */ }, }); ``` Please note that the interface is experimental and it will likely change with Crawlee version 4. --- # Running in Docker Copy for LLM Running headless browsers in Docker requires a lot of setup to do it right. But there's no need to worry about that, because we already created base images that you can freely use. We use them every day on the [Apify Platform](https://crawlee.dev/js/docs/deployment/apify-platform.md). All images can be found in their [GitHub repo](https://github.com/apify/apify-actor-docker) and in our [DockerHub](https://hub.docker.com/orgs/apify). ## Overview[​](#overview "Direct link to Overview") Browsers are pretty big, so we try to provide a wide variety of images to suit the specific needs. Here's a full list of our Docker images. * [`apify/actor-node`](#actor-node) * [`apify/actor-node-puppeteer-chrome`](#actor-node-puppeteer-chrome) * [`apify/actor-node-playwright`](#actor-node-playwright) * [`apify/actor-node-playwright-chrome`](#actor-node-playwright-chrome) * [`apify/actor-node-playwright-firefox`](#actor-node-playwright-firefox) * [`apify/actor-node-playwright-webkit`](#actor-node-playwright-webkit) ## Versioning[​](#versioning "Direct link to Versioning") Each image is tagged with up to 2 version tags, depending on the type of the image. One for Node.js version and second for pre-installed web automation library version. If you use the image name without a version tag, you'll always get the latest available version. > We recommend always using at least the Node.js version tag in production Dockerfiles. It will ensure that a future update of Node.js will not break our automations. ### Node.js versioning[​](#nodejs-versioning "Direct link to Node.js versioning") Our images are built with multiple Node.js versions to ensure backwards compatibility. Currently, Node.js **versions 16 and 18 are supported** (legacy versions still exist, see DockerHub). To select the preferred version, use the appropriate number as the image tag. ``` # Use Node.js 20 FROM apify/actor-node:20 ``` ### Automation library versioning[​](#automation-library-versioning "Direct link to Automation library versioning") Images that include a pre-installed automation library, which means all images that include `puppeteer` or `playwright` in their name, are also tagged with the pre-installed version of the library. For example, `apify/actor-node-puppeteer-chrome:20-22.1.0` comes with Node.js 20 and Puppeteer v22.1.0. If you try to install a different version of Puppeteer into this image, you may run into compatibility issues, because the Chromium version bundled with `puppeteer` will not match the version of Chromium that's pre-installed. Similarly `apify/actor-node-playwright-firefox:14-1.21.1` runs on Node.js 14 and is pre-installed with the Firefox version that comes with v1.21.1. Installing `apify/actor-node-puppeteer-chrome` (without a tag) will install the latest available version of Node.js and `puppeteer`. ### Pre-release tags[​](#pre-release-tags "Direct link to Pre-release tags") We also build pre-release versions of the images to test the changes we make. Those are typically denoted by a `beta` suffix, but it can vary depending on our needs. If you need to try a pre-release version, you can do it like this: ``` # Without library version. FROM apify/actor-node:20-beta ``` ``` # With library version. FROM apify/actor-node-playwright-chrome:20-1.10.0-beta ``` ## Best practices[​](#best-practices "Direct link to Best practices") For production crawlers, we recommend pinning both the Node.js version **and** the automation library version in your Dockerfile tag. This ensures reproducible builds and prevents unexpected behavior when new versions are released. ### Recommended approach: Pin both versions[​](#recommended-approach-pin-both-versions "Direct link to Recommended approach: Pin both versions") Match the automation library version in your `package.json` with the version in your Docker image tag: ``` FROM apify/actor-node-playwright-chrome:22-1.52.0 ``` ``` { "dependencies": { "crawlee": "^3.0.0", "playwright": "1.52.0" } } ``` Why version matching matters If you pin the Docker image to `22-1.52.0` but install a different Playwright version via `package.json`, you may encounter browser compatibility issues. The browsers pre-installed in the image are specifically built for that Playwright version. ### Alternative approach: Using asterisk `*`[​](#alternative-approach-using-asterisk- "Direct link to alternative-approach-using-asterisk-") You can also use asterisk `*` as the automation library version in your `package.json`: ``` FROM apify/actor-node-playwright-chrome:22 ``` ``` { "dependencies": { "crawlee": "^3.0.0", "playwright": "*" } } ``` This makes sure the pre-installed version of Puppeteer or Playwright is not re-installed on build. However, this approach is less predictable because you'll get whatever version was latest when the Docker image was built. ## Finding available tags[​](#finding-available-tags "Direct link to Finding available tags") To see all available tags for each image, you can visit Docker Hub directly: * [apify/actor-node](https://hub.docker.com/r/apify/actor-node/tags) * [apify/actor-node-puppeteer-chrome](https://hub.docker.com/r/apify/actor-node-puppeteer-chrome/tags) * [apify/actor-node-playwright](https://hub.docker.com/r/apify/actor-node-playwright/tags) * [apify/actor-node-playwright-chrome](https://hub.docker.com/r/apify/actor-node-playwright-chrome/tags) * [apify/actor-node-playwright-firefox](https://hub.docker.com/r/apify/actor-node-playwright-firefox/tags) * [apify/actor-node-playwright-webkit](https://hub.docker.com/r/apify/actor-node-playwright-webkit/tags) You can also query available tags programmatically: ``` curl -s "https://registry.hub.docker.com/v2/repositories/apify/actor-node-playwright-chrome/tags?page_size=50" | jq '.results[].name' ``` ### Warning about image size[​](#warning-about-image-size "Direct link to Warning about image size") Browsers are huge. If you don't need them all in your image, it's better to use a smaller image with only the one browser you need. You should also be careful when installing new dependencies. Nothing prevents you from installing Playwright into the`actor-node-puppeteer-chrome` image, but the resulting image will be about 3 times larger and extremely slow to download and build. When you use only what you need, you'll be rewarded with reasonable build and start times. ## Apify Docker Images[​](#apify-docker-images "Direct link to Apify Docker Images") ### actor-node[​](#actor-node "Direct link to actor-node") This is the smallest image we have based on Alpine Linux. It does not include any browsers, and it's therefore best used with [`CheerioCrawler`](https://crawlee.dev/js/api/cheerio-crawler/class/CheerioCrawler.md). It benefits from lightning fast builds and container startups. ​[`PuppeteerCrawler`](https://crawlee.dev/js/api/puppeteer-crawler/class/PuppeteerCrawler.md), [`PlaywrightCrawler`](https://crawlee.dev/js/api/playwright-crawler/class/PlaywrightCrawler.md) and other browser based features will **NOT** work with this image. ``` FROM apify/actor-node:20 ``` ### actor-node-puppeteer-chrome[​](#actor-node-puppeteer-chrome "Direct link to actor-node-puppeteer-chrome") This image includes Puppeteer (Chromium) and the Chrome browser. It can be used with [`CheerioCrawler`](https://crawlee.dev/js/api/cheerio-crawler/class/CheerioCrawler.md) and [`PuppeteerCrawler`](https://crawlee.dev/js/api/puppeteer-crawler/class/PuppeteerCrawler.md), but **NOT** with [`PlaywrightCrawler`](https://crawlee.dev/js/api/playwright-crawler/class/PlaywrightCrawler.md). The image supports XVFB by default, so you can run both `headless` and `headful` browsers with it. ``` FROM apify/actor-node-puppeteer-chrome:20 ``` ### actor-node-playwright[​](#actor-node-playwright "Direct link to actor-node-playwright") A very large and slow image that can run all Playwright browsers: Chromium, Chrome, Firefox, WebKit. Everything is installed. If you need to develop or test with multiple browsers, this is the image to choose, but in most cases, it's better to use the specialized images below. ``` FROM apify/actor-node-playwright:20 ``` ### actor-node-playwright-chrome[​](#actor-node-playwright-chrome "Direct link to actor-node-playwright-chrome") Similar to [`actor-node-puppeteer-chrome`](#actor-node-puppeteer-chrome), but for Playwright. You can run [`CheerioCrawler`](https://crawlee.dev/js/api/cheerio-crawler/class/CheerioCrawler.md) and [`PlaywrightCrawler`](https://crawlee.dev/js/api/playwright-crawler/class/PlaywrightCrawler.md), but **NOT** [`PuppeteerCrawler`](https://crawlee.dev/js/api/puppeteer-crawler/class/PuppeteerCrawler.md). It uses the [`PLAYWRIGHT_SKIP_BROWSER_DOWNLOAD`](https://playwright.dev/docs/api/environment-variables/) environment variable to block installation of more browsers into the image to keep it small. If you want more browsers, either use the [`actor-node-playwright`](#actor-node-playwright) image override this env var. The image supports XVFB by default, so we can run both `headless` and `headful` browsers with it. ``` FROM apify/actor-node-playwright-chrome:20 ``` ### actor-node-playwright-firefox[​](#actor-node-playwright-firefox "Direct link to actor-node-playwright-firefox") Same idea as [`actor-node-playwright-chrome`](#actor-node-playwright-chrome), but with Firefox pre-installed. ``` FROM apify/actor-node-playwright-firefox:20 ``` ### actor-node-playwright-webkit[​](#actor-node-playwright-webkit "Direct link to actor-node-playwright-webkit") Same idea as [`actor-node-playwright-chrome`](#actor-node-playwright-chrome), but with WebKit pre-installed. ``` FROM apify/actor-node-playwright-webkit:20 ``` ## Example Dockerfile[​](#example-dockerfile "Direct link to Example Dockerfile") To use the above images, it's necessary to have a [`Dockerfile`](https://docs.docker.com/engine/reference/builder/). You can either use this example, or bootstrap your projects with the [Crawlee CLI](https://crawlee.dev/js/docs/introduction/setting-up.md) which automatically adds the correct Dockerfile into our project folder. * Node+JavaScript * Node+TypeScript * Browser+JavaScript * Browser+TypeScript ``` # Specify the base Docker image. You can read more about # the available images at https://crawlee.dev/js/docs/guides/docker-images # You can also use any other image from Docker Hub. FROM apify/actor-node:20 # Copy just package.json and package-lock.json # to speed up the build using Docker layer cache. COPY package*.json ./ # Install NPM packages, skip optional and development dependencies to # keep the image small. Avoid logging too much and print the dependency # tree for debugging RUN npm --quiet set progress=false \ && npm install --omit=dev --omit=optional \ && echo "Installed NPM packages:" \ && (npm list --omit=dev --all || true) \ && echo "Node.js version:" \ && node --version \ && echo "NPM version:" \ && npm --version # Next, copy the remaining files and directories with the source code. # Since we do this after NPM install, quick build will be really fast # for most source file changes. COPY . ./ # Run the image. CMD npm start --silent ``` ``` # Specify the base Docker image. You can read more about # the available images at https://crawlee.dev/js/docs/guides/docker-images # You can also use any other image from Docker Hub. FROM apify/actor-node:20 AS builder # Copy just package.json and package-lock.json # to speed up the build using Docker layer cache. COPY package*.json ./ # Install all dependencies. Don't audit to speed up the installation. RUN npm install --include=dev --audit=false # Next, copy the source files using the user set # in the base image. COPY . ./ # Install all dependencies and build the project. # Don't audit to speed up the installation. RUN npm run build # Create final image FROM apify/actor-node:20 # Copy only built JS files from builder image COPY --from=builder /usr/src/app/dist ./dist # Copy just package.json and package-lock.json # to speed up the build using Docker layer cache. COPY package*.json ./ # Install NPM packages, skip optional and development dependencies to # keep the image small. Avoid logging too much and print the dependency # tree for debugging RUN npm --quiet set progress=false \ && npm install --omit=dev --omit=optional \ && echo "Installed NPM packages:" \ && (npm list --omit=dev --all || true) \ && echo "Node.js version:" \ && node --version \ && echo "NPM version:" \ && npm --version # Next, copy the remaining files and directories with the source code. # Since we do this after NPM install, quick build will be really fast # for most source file changes. COPY . ./ # Run the image. CMD npm run start:prod --silent ``` This example is for Playwright. If you want to use Puppeteer, simply replace **playwright** with **puppeteer** in the `FROM` declaration. ``` # Specify the base Docker image. You can read more about # the available images at https://crawlee.dev/js/docs/guides/docker-images # You can also use any other image from Docker Hub. FROM apify/actor-node-playwright-chrome:20 # Copy just package.json and package-lock.json # to speed up the build using Docker layer cache. COPY --chown=myuser package*.json ./ # Install NPM packages, skip optional and development dependencies to # keep the image small. Avoid logging too much and print the dependency # tree for debugging RUN npm --quiet set progress=false \ && npm install --omit=dev --omit=optional \ && echo "Installed NPM packages:" \ && (npm list --omit=dev --all || true) \ && echo "Node.js version:" \ && node --version \ && echo "NPM version:" \ && npm --version # Next, copy the remaining files and directories with the source code. # Since we do this after NPM install, quick build will be really fast # for most source file changes. COPY --chown=myuser . ./ # Run the image. CMD npm start --silent ``` This example is for Playwright. If you want to use Puppeteer, simply replace **playwright** with **puppeteer** in both `FROM` declarations. ``` # Specify the base Docker image. You can read more about # the available images at https://crawlee.dev/js/docs/guides/docker-images # You can also use any other image from Docker Hub. FROM apify/actor-node-playwright-chrome:20 AS builder # Copy just package.json and package-lock.json # to speed up the build using Docker layer cache. COPY --chown=myuser package*.json ./ # Install all dependencies. Don't audit to speed up the installation. RUN npm install --include=dev --audit=false # Next, copy the source files using the user set # in the base image. COPY --chown=myuser . ./ # Install all dependencies and build the project. # Don't audit to speed up the installation. RUN npm run build # Create final image FROM apify/actor-node-playwright-chrome:20 # Copy only built JS files from builder image COPY --from=builder --chown=myuser /home/myuser/dist ./dist # Copy just package.json and package-lock.json # to speed up the build using Docker layer cache. COPY --chown=myuser package*.json ./ # Install NPM packages, skip optional and development dependencies to # keep the image small. Avoid logging too much and print the dependency # tree for debugging RUN npm --quiet set progress=false \ && npm install --omit=dev --omit=optional \ && echo "Installed NPM packages:" \ && (npm list --omit=dev --all || true) \ && echo "Node.js version:" \ && node --version \ && echo "NPM version:" \ && npm --version # Next, copy the remaining files and directories with the source code. # Since we do this after NPM install, quick build will be really fast # for most source file changes. COPY --chown=myuser . ./ # Run the image. If you know you won't need headful browsers, # you can remove the XVFB start script for a micro perf gain. CMD ./start_xvfb_and_run_cmd.sh && npm run start:prod --silent ``` --- # Got Scraping Copy for LLM ## Intro[​](#intro "Direct link to Intro") When using `BasicCrawler`, we have to send the requests manually. In order to do this, we can use the context-aware `sendRequest()` function: ``` import { BasicCrawler } from 'crawlee'; const crawler = new BasicCrawler({ async requestHandler({ sendRequest, log }) { const res = await sendRequest(); log.info('received body', res.body); }, }); ``` It uses [`got-scraping`](https://github.com/apify/got-scraping) under the hood. Got Scraping is a [Got](https://github.com/sindresorhus/got) extension developed to mimic browser requests, so there's a high chance we'll open the webpage without getting blocked. ## `sendRequest` API[​](#sendrequest-api "Direct link to sendrequest-api") ``` async sendRequest(overrideOptions?: GotOptionsInit) => { return gotScraping({ url: request.url, method: request.method, body: request.payload, headers: request.headers, proxyUrl: crawlingContext.proxyInfo?.url, sessionToken: session, responseType: 'text', ...overrideOptions, retry: { limit: 0, ...overrideOptions?.retry, }, cookieJar: { getCookieString: (url: string) => session!.getCookieString(url), setCookie: (rawCookie: string, url: string) => session!.setCookie(rawCookie, url), ...overrideOptions?.cookieJar, }, }); } ``` ### `url`[​](#url "Direct link to url") By default, it's the URL of current task. However you can override this with a `string` or a `URL` instance if necessary. *More details in [Got documentation](https://github.com/sindresorhus/got/blob/main/documentation/2-options.md#url).* ### `method`[​](#method "Direct link to method") By default, it's the HTTP method of current task. Possible values are `'GET', 'POST', 'HEAD', 'PUT', 'PATCH', 'DELETE'`. *More details in [Got documentation](https://github.com/sindresorhus/got/blob/main/documentation/2-options.md#method).* ### `body`[​](#body "Direct link to body") By default, it's the HTTP payload of current task. *More details in [Got documentation](https://github.com/sindresorhus/got/blob/main/documentation/2-options.md#body).* ### `headers`[​](#headers "Direct link to headers") By default, it's the HTTP headers of current task. It's an object with `string` values. *More details in [Got documentation](https://github.com/sindresorhus/got/blob/main/documentation/2-options.md#headers).* ### `proxyUrl`[​](#proxyurl "Direct link to proxyurl") It's a string representing the proxy server in the format of `protocol://username:password@hostname:port`. For example, an Apify proxy server looks like this: `http://auto:password@proxy.apify.com:8000`. Basic Crawler does not have the concept of a session or proxy, therefore we need to manually pass the `proxyUrl` option: ``` import { BasicCrawler } from 'crawlee'; const crawler = new BasicCrawler({ async requestHandler({ sendRequest, log }) { const res = await sendRequest({ proxyUrl: 'http://auto:password@proxy.apify.com:8000', }); log.info('received body', res.body); }, }); ``` We use proxies to hide our real IP address. *More details in [Got Scraping documentation](https://github.com/apify/got-scraping#proxyurl).* ### `sessionToken`[​](#sessiontoken "Direct link to sessiontoken") It's a non-primitive object used as a key when generating browser fingerprint. Fingerprints with the same token don't change. This can be used to retain the `user-agent` header when using the same Apify Session. *More details in [Got Scraping documentation](https://github.com/apify/got-scraping#sessiontoken).* ### `responseType`[​](#responsetype "Direct link to responsetype") This option defines how the response should be parsed. By default, we fetch HTML websites - that is plaintext. Hence, we set `responseType` to `'text'`. However, JSON is possible as well: ``` import { BasicCrawler } from 'crawlee'; const crawler = new BasicCrawler({ async requestHandler({ sendRequest, log }) { const res = await sendRequest({ responseType: 'json' }); log.info('received body', res.body); }, }); ``` *More details in [Got documentation](https://github.com/sindresorhus/got/blob/main/documentation/2-options.md#responsetype).* ### `cookieJar`[​](#cookiejar "Direct link to cookiejar") `Got` uses a `cookieJar` to manage cookies. It's an object with an interface of a [`tough-cookie` package](https://github.com/salesforce/tough-cookie). Example: ``` import { BasicCrawler } from 'crawlee'; import { CookieJar } from 'tough-cookie'; const cookieJar = new CookieJar(); const crawler = new BasicCrawler({ async requestHandler({ sendRequest, log }) { const res = await sendRequest({ cookieJar }); log.info('received body', res.body); }, }); ``` *More details in* * *[Got documentation](https://github.com/sindresorhus/got/blob/main/documentation/2-options.md#cookiejar)* * *[Tough Cookie documentation](https://github.com/salesforce/tough-cookie#cookiejarstore-options)* ### `retry.limit`[​](#retrylimit "Direct link to retrylimit") This option specifies the maximum number of `Got` retries. By default, `retry.limit` is set to `0`. This is because Crawlee has its own (complicated enough) retry management. We suggest NOT changing this value for stability reasons. ### `useHeaderGenerator`[​](#useheadergenerator "Direct link to useheadergenerator") It's a boolean for whether to generate browser headers. By default, it's set to `true`, and we recommend keeping this for better results. ### `headerGeneratorOptions`[​](#headergeneratoroptions "Direct link to headergeneratoroptions") This option represents an object how to generate browser fingerprint. Example: ``` import { BasicCrawler } from 'crawlee'; const crawler = new BasicCrawler({ async requestHandler({ sendRequest, log }) { const res = await sendRequest({ headerGeneratorOptions: { devices: ['mobile', 'desktop'], locales: ['en-US'], operatingSystems: ['windows', 'macos', 'android', 'ios'], browsers: ['chrome', 'edge', 'firefox', 'safari'], }, }); log.info('received body', res.body); }, }); ``` *More details in [`HeaderGeneratorOptions` documentation](https://apify.github.io/fingerprint-suite/api/fingerprint-generator/interface/HeaderGeneratorOptions/).* **Related links** * [Got documentation](https://github.com/sindresorhus/got#documentation) * [Got Scraping documentation](https://github.com/apify/got-scraping) * [Header Generator documentation](https://apify.github.io/fingerprint-suite/docs/guides/fingerprint-generator/) --- # Impit HTTP Client Copy for LLM ## Introduction[​](#introduction "Direct link to Introduction") The `ImpitHttpClient` is an HTTP client implementation based on the [Impit](https://github.com/apify/impit) library. It enables browser impersonation for HTTP requests, helping you bypass bot detection systems without running an actual browser. Successor to got-scraping Impit is the successor to `got-scraping`, which is no longer actively maintained. We recommend using `ImpitHttpClient` for all new projects. Impit provides better anti-bot evasion through TLS fingerprinting and HTTP/3 support, while maintaining a smaller package size. **Impit will become the default HTTP client in the next major version of Crawlee.** ### Why use Impit?[​](#why-use-impit "Direct link to Why use Impit?") Websites increasingly use sophisticated bot detection that analyzes: * **HTTP fingerprints**: User-Agent strings, header ordering, HTTP/2 pseudo-header sequences * **TLS fingerprints**: Cipher suites, TLS extensions, and cryptographic details in the ClientHello message Standard HTTP clients like `fetch` or `axios` are easily detected because their fingerprints don't match real browsers. Unlike `got-scraping` which only handles HTTP-level fingerprinting, Impit also mimics TLS fingerprints, making requests appear to come from real browsers. ## Installation[​](#installation "Direct link to Installation") Install the `@crawlee/impit-client` package: ``` npm install @crawlee/impit-client ``` note The `impit` package includes native binaries and supports Windows, macOS (including ARM), and Linux out of the box. ## Basic usage[​](#basic-usage "Direct link to Basic usage") Pass the `ImpitHttpClient` instance to the `httpClient` option of any Crawlee crawler: ``` import { BasicCrawler } from 'crawlee'; import { ImpitHttpClient, Browser } from '@crawlee/impit-client'; const crawler = new BasicCrawler({ httpClient: new ImpitHttpClient({ browser: Browser.Firefox, }), async requestHandler({ sendRequest, log }) { const response = await sendRequest(); log.info('Received response', { statusCode: response.statusCode }); }, }); await crawler.run(['https://example.com']); ``` ## Usage with different crawlers[​](#usage-with-different-crawlers "Direct link to Usage with different crawlers") ### CheerioCrawler[​](#cheeriocrawler "Direct link to CheerioCrawler") ``` import { CheerioCrawler } from 'crawlee'; import { ImpitHttpClient, Browser } from '@crawlee/impit-client'; const crawler = new CheerioCrawler({ httpClient: new ImpitHttpClient({ browser: Browser.Chrome, }), async requestHandler({ $, request, enqueueLinks, pushData }) { const title = $('title').text(); const h1 = $('h1').first().text(); await pushData({ url: request.url, title, h1, }); // Enqueue links found on the page await enqueueLinks(); }, }); await crawler.run(['https://example.com']); ``` ### HttpCrawler[​](#httpcrawler "Direct link to HttpCrawler") ``` import { HttpCrawler } from 'crawlee'; import { ImpitHttpClient, Browser } from '@crawlee/impit-client'; const crawler = new HttpCrawler({ httpClient: new ImpitHttpClient({ browser: Browser.Firefox, http3: true, }), async requestHandler({ body, request, log, pushData }) { log.info(`Processing ${request.url}`); // body is the raw HTML string await pushData({ url: request.url, bodyLength: body.length, }); }, }); await crawler.run(['https://example.com']); ``` ## Configuration options[​](#configuration-options "Direct link to Configuration options") The `ImpitHttpClient` constructor accepts the following options: | Option | Type | Default | Description | | ----------------- | ------------------------- | ----------- | ------------------------------------------------------------------------------ | | `browser` | `'chrome'` \| `'firefox'` | `undefined` | Browser to impersonate. Affects TLS fingerprint and default headers. | | `http3` | `boolean` | `false` | Enable HTTP/3 (QUIC) protocol support. | | `ignoreTlsErrors` | `boolean` | `false` | Ignore TLS certificate errors. Useful for testing or self-signed certificates. | ### Browser impersonation[​](#browser-impersonation "Direct link to Browser impersonation") Use the `Browser` enum to specify which browser to impersonate: ``` import { ImpitHttpClient, Browser } from '@crawlee/impit-client'; // Impersonate Firefox const firefoxClient = new ImpitHttpClient({ browser: Browser.Firefox }); // Impersonate Chrome const chromeClient = new ImpitHttpClient({ browser: Browser.Chrome }); ``` ### Advanced configuration[​](#advanced-configuration "Direct link to Advanced configuration") ``` import { CheerioCrawler } from 'crawlee'; import { ImpitHttpClient, Browser } from '@crawlee/impit-client'; const crawler = new CheerioCrawler({ httpClient: new ImpitHttpClient({ // Impersonate Chrome browser browser: Browser.Chrome, // Enable HTTP/3 protocol http3: true, }), async requestHandler({ $ }) { console.log(`Title: ${$('title').text()}`); }, }); await crawler.run(['https://example.com']); ``` ## Proxy support[​](#proxy-support "Direct link to Proxy support") Proxies are configured per-request through Crawlee's proxy management system, not on the `ImpitHttpClient` itself. Use `ProxyConfiguration` as you normally would: ``` import { CheerioCrawler, ProxyConfiguration } from 'crawlee'; import { ImpitHttpClient, Browser } from '@crawlee/impit-client'; const proxyConfiguration = new ProxyConfiguration({ proxyUrls: ['http://proxy1.example.com:8080', 'http://proxy2.example.com:8080'], }); const crawler = new CheerioCrawler({ httpClient: new ImpitHttpClient({ browser: Browser.Chrome }), proxyConfiguration, async requestHandler({ $, request }) { console.log(`Scraped ${request.url}`); }, }); ``` ## How it works[​](#how-it-works "Direct link to How it works") Impit achieves browser impersonation at two levels: 1. **HTTP level**: Mimics browser-specific header ordering, HTTP/2 settings, and pseudo-header sequences that antibot services analyze. 2. **TLS level**: Uses a patched version of `rustls` to replicate the exact TLS ClientHello message that browsers send, including cipher suites and extensions. This dual-layer approach makes requests appear to come from a real browser, significantly reducing blocks from bot detection systems. ## Comparison with other solutions[​](#comparison-with-other-solutions "Direct link to Comparison with other solutions") | Feature | got-scraping | curl-impersonate | Impit | | ---------------------- | ------------ | ------------------ | ------ | | TLS fingerprinting | No | Yes | Yes | | HTTP/3 support | No | Yes | Yes | | Native Node.js package | Yes | No (child process) | Yes | | Windows/macOS ARM | Yes | No | Yes | | Package size | \~10 MB | \~20 MB | \~8 MB | **Related links** * [Impit GitHub repository](https://github.com/apify/impit) * [Custom HTTP Client guide](https://crawlee.dev/js/docs/guides/custom-http-client.md) * [Proxy Management guide](https://crawlee.dev/js/docs/guides/proxy-management.md) * [Avoiding blocking guide](https://crawlee.dev/js/docs/guides/avoid-blocking.md) --- # JavaScript rendering Copy for LLM JavaScript rendering is the process of executing JavaScript on a page to make changes in the page's structure or content. It's also called client-side rendering, the opposite of server-side rendering. Some modern websites render on the client, some on the server and many cutting edge websites render some things on the server and other things on the client. The Crawlee website does not use JavaScript rendering to display its content, so we have to look for an example elsewhere. [Apify Store](https://apify.com/store) is a library of scrapers and automations called **actors** that anyone can grab and use for free. It uses JavaScript rendering to display the list of actors, so let's use it to demonstrate how it works. src/main.mjs ``` import { CheerioCrawler } from 'crawlee'; const crawler = new CheerioCrawler({ async requestHandler({ $, request }) { // Extract text content of an actor card const actorText = $('.ActorStoreItem').text(); console.log(`ACTOR: ${actorText}`); } }) await crawler.run(['https://apify.com/store']); ``` Run the code, and you'll see that the crawler won't print the content of the actor card. ``` ACTOR: ``` That's because Apify Store uses client-side JavaScript to render its content and `CheerioCrawler` can't execute it, so the text never appears in the page's HTML. You can confirm this using Chrome DevTools. If you go to <https://apify.com/store>, right-click anywhere in the page, select **View Page Source** and search for **ActorStoreItem** you won't find any results. Then, if you right-click again, select **Inspect** and search for the same **ActorStoreItem**, you will find many of them. How's this possible? Because **View Page Source** shows the original HTML, before any JavaScript executions. That's what `CheerioCrawler` gets. Whereas with **Inspect** you see the current HTML - after JavaScript execution. When you understand this, it's not a huge surprise that `CheerioCrawler` can't find the data. For that we need a headless browser. ## Headless browsers[​](#headless-browsers "Direct link to Headless browsers") To get the contents of `.ActorStoreItem`, you will have to use a headless browser. You can choose from two libraries to control your browser: [Puppeteer](https://github.com/puppeteer/puppeteer) or [Playwright](https://github.com/microsoft/playwright). The choice is simple. If you know one of them, choose the one you know. If you know both, or none, choose Playwright, because it's better in most cases. ## Waiting for elements to render[​](#waiting-for-elements-to-render "Direct link to Waiting for elements to render") No matter which library you pick, here's example code for both. Playwright is a little more pleasant to use, but both libraries will get the job done. The big difference between them is that Playwright will automatically wait for elements to appear, whereas in Puppeteer, you have to explicitly wait for them. * PlaywrightCrawler * PuppeteerCrawler src/main.mjs ``` import { PlaywrightCrawler } from 'crawlee'; const crawler = new PlaywrightCrawler({ async requestHandler({ page }) { // page.locator points to an element in the DOM // using a CSS selector, but it does not access it yet. const actorCard = page.locator('.ActorStoreItem').first(); // Upon calling one of the locator methods Playwright // waits for the element to render and then accesses it. const actorText = await actorCard.textContent(); console.log(`ACTOR: ${actorText}`); }, }); await crawler.run(['https://apify.com/store']); ``` src/main.mjs ``` import { PuppeteerCrawler } from 'crawlee'; const crawler = new PuppeteerCrawler({ async requestHandler({ page }) { // Puppeteer does not have the automatic waiting functionality // of Playwright, so we have to explicitly wait for the element. await page.waitForSelector('.ActorStoreItem'); // Puppeteer does not have helper methods like locator.textContent, // so we have to manually extract the value using in-page JavaScript. const actorText = await page.$eval('.ActorStoreItem', (el) => { return el.textContent; }); console.log(`ACTOR: ${actorText}`); }, }); await crawler.run(['https://apify.com/store']); ``` When you run the code, you'll see the *badly formatted* content of the first actor card printed to console: ``` ACTOR: Web Scraperapify/web-scraperCrawls arbitrary websites using [...] ``` ### We're not kidding[​](#were-not-kidding "Direct link to We're not kidding") If you don't believe us that the elements need to be waited for, run the following code which skips the waiting. * PlaywrightCrawler * PuppeteerCrawler src/main.mjs ``` import { PlaywrightCrawler } from 'crawlee'; const crawler = new PlaywrightCrawler({ async requestHandler({ page }) { // Here we don't wait for the selector and immediately // extract the text content from the page. const actorText = await page.$eval('.ActorStoreItem', (el) => { return el.textContent; }); console.log(`ACTOR: ${actorText}`); }, }); await crawler.run(['https://apify.com/store']); ``` src/main.mjs ``` import { PuppeteerCrawler } from 'crawlee'; const crawler = new PuppeteerCrawler({ async requestHandler({ page }) { // Here we don't wait for the selector and immediately // extract the text content from the page. const actorText = await page.$eval('.ActorStoreItem', (el) => { return el.textContent; }); console.log(`ACTOR: ${actorText}`); }, }); await crawler.run(['https://apify.com/store']); ``` In both cases, the request will be retried a few times and eventually fail with an error like this: ``` ERROR [...] Error: failed to find element matching selector ".ActorStoreItem" ``` That's because when you try to access the element in the browser, it's not been rendered in the DOM yet. tip This guide only touches the concept of JavaScript rendering and use of headless browsers. To learn more, continue with the [Puppeteer & Playwright course](https://developers.apify.com/academy/puppeteer-playwright) in the Apify Academy. **It's free and open-source** ❤️. --- # JSDOMCrawler guide Copy for LLM ​[`JSDOMCrawler`](https://crawlee.dev/js/api/jsdom-crawler/class/JSDOMCrawler.md) is very useful for scraping with the Window API. ## How the crawler works[​](#how-the-crawler-works "Direct link to How the crawler works") ​[`JSDOMCrawler`](https://crawlee.dev/js/api/jsdom-crawler/class/JSDOMCrawler.md) crawls by making plain HTTP requests to the provided URLs using the specialized [got-scraping](https://github.com/apify/got-scraping) HTTP client. The URLs are fed to the crawler using [`RequestQueue`](https://crawlee.dev/js/api/core/class/RequestQueue.md). The HTTP responses it gets back are usually HTML pages. The same pages you would get in your browser when you first load a URL. But it can handle any content types with the help of the [`additionalMimeTypes`](https://crawlee.dev/js/api/jsdom-crawler/interface/JSDOMCrawlerOptions.md#additionalMimeTypes) option. info Modern web pages often do not serve all of their content in the first HTML response, but rather the first HTML contains links to other resources such as CSS and JavaScript that get downloaded afterwards, and together they create the final page. To crawl those, see [`PuppeteerCrawler`](https://crawlee.dev/js/api/puppeteer-crawler/class/PuppeteerCrawler.md) and [`PlaywrightCrawler`](https://crawlee.dev/js/api/playwright-crawler/class/PlaywrightCrawler.md). Once the page's HTML is retrieved, the crawler will pass it to [JSDOM](https://www.npmjs.com/package/jsdom) for parsing. The result is a `window` property, which should be familiar to frontend developers. You can use the Window API to do all sorts of lookups and manipulation of the page's HTML, but in scraping, you will mostly use it to find specific HTML elements and extract their data. Example use of browser JavaScript: ``` // Return the page title document.title; // browsers window.document.title; // JSDOM ``` ## When to use `JSDOMCrawler`[​](#when-to-use-jsdomcrawler "Direct link to when-to-use-jsdomcrawler") `JSDOMCrawler` really shines when `CheerioCrawler` is just not enough. There is an entire set of [APIs](https://developer.mozilla.org/en-US/docs/Web/API/HTML_DOM_API) available! **Advantages:** * Easy to set up * Familiar for frontend developers * Content can be manipulated * Automatically avoids some anti-scraping bans **Disadvantages:** * Slower than `CheerioCrawler` * Does not work for websites that require JavaScript rendering * May easily overload the target website with requests ## Example use of Element API[​](#example-use-of-element-api "Direct link to Example use of Element API") ### Find all links on a page[​](#find-all-links-on-a-page "Direct link to Find all links on a page") This snippet finds all `<a>` elements which have the `href` attribute and extracts the hrefs into an array. ``` Array.from(document.querySelectorAll('a[href]')).map((a) => a.href); ``` ### Other examples[​](#other-examples "Direct link to Other examples") Visit the [Examples](https://crawlee.dev/js/docs/examples.md) section to browse examples of `JSDOMCrawler` usage. Almost all examples show `JSDOMCrawler` code in their code tabs. --- # motivation Copy for LLM --- # Parallel Scraping Guide Copy for LLM Experimental features ahead At the time of writing this guide (December 2023), request locking is still an experimental feature. You can read more about the experiment by visiting the [request locking experiment](https://crawlee.dev/js/docs/experiments/experiments-request-locking.md) page. In this guide, we will walk you through how you can turn your single scraper into a scraper that can be parallelized and run in multiple instances. This guide assumes you've read and walked through our [introduction guide](https://crawlee.dev/js/docs/introduction/setting-up.md) (or have a fully-fledged scraper already built), but if you haven't done so yet, take a break, go read through all that, and come back. We'll be waiting... *Oh, you're back already! Let's proceed in making that scraper parallel!* ## Things to consider before parallelizing[​](#things-to-consider-before-parallelizing "Direct link to Things to consider before parallelizing") Before you rush ahead and change your scraper to support parallelization, take a minute to consider the following factors: * Do you plan on scraping so many pages that you need to parallelize your scraper? <!-- --> * For example, if your scraper goes across a few pages, you probably don't need parallelization * But if you scrape a lot of pages, or you scrape pages that take a long time to load, you might want to consider parallelization * Can you parallelize your scraper while not overloading the target website? <!-- --> * For example, if you scrape a website that has a lot of traffic, you don't want to add to that traffic by running multiple scrapers in parallel as that might cause the website to go down for all its users * Do you have the resources available to run multiple scrapers in parallel? <!-- --> * When running locally, depending on your scraper type, do you have enough CPU and RAM available to sustain multiple scrapers running in parallel * When running in the cloud, will the extra speed from parallelization be worth the extra cost of running multiple scrapers in parallel? Let's assume you answered yes to all of those. Yes? Yes. Before we go ahead and get to the actual guide, we'd like to ask you to also take a read on [Apify's Ethical Web Scraping](https://blog.apify.com/what-is-ethical-web-scraping-and-how-do-you-do-it/) blog post! Now that we've gone through all that, the guide is split into two parts: converting the initial scraper we built in the [introduction guide](https://crawlee.dev/js/docs/introduction/setting-up.md) to one that prepares requests to be usable in parallel scrapers, and then running scrapers in parallel. Want to see the final result? You can see it on the [Crawlee Parallel Scraping Example](https://github.com/apify/crawlee-parallel-scraping-example) repository! It's the same scraper we built in the [introduction guide](https://crawlee.dev/js/docs/introduction/setting-up.md), but in TypeScript and parallelized! ### But isn't Crawlee already concurrent? What's the difference between concurrency and parallelization?[​](#but-isnt-crawlee-already-concurrent-whats-the-difference-between-concurrency-and-parallelization "Direct link to But isn't Crawlee already concurrent? What's the difference between concurrency and parallelization?") > *Hold on! I've used Crawlee before, and it has a `maxConcurrency` option! What's this for then?!* You're correct, Crawlee already supports scraping in "parallel" (more accurately called concurrent). What that enables is one process having multiple tasks that run in the background at the same time. But, as your scraping operation scales up, you are likely to encounter bottlenecks. These can range from the runtime environment's inability to process more requests simultaneously, to resources like RAM and CPU being maxed out. You can only scale up resources so much before it stops providing a real benefit. This is what people refer to when saying vertical or horizontal scaling. Vertical scaling is when you increase the resources of a single process or machine, while horizontal scaling is when you increase the number of processes or machines. Horizontal scaling, on the other hand, is the kind of scaling (or what we're referring to as "parallelization") we are showcasing in this guide! ## Preparing your scraper for parallelization[​](#preparing-your-scraper-for-parallelization "Direct link to Preparing your scraper for parallelization") One of the best parts of Crawlee is that, for the most part, we do not need to change much to make this happen! Just create the queue that supports locking, enqueue links to it from the initial scraper, then build scrapers that run in parallel that use that queue! ### Creating the request queue with locking support[​](#creating-the-request-queue-with-locking-support "Direct link to Creating the request queue with locking support") The first step in our conversion process will be creating a common file (let's call it `requestQueue.mjs`) that will store the request queue that supports request locking. src/requestQueue.mjs ``` import { RequestQueueV2 } from 'crawlee'; // Create the request queue that also supports parallelization let queue; /** * @param {boolean} makeFresh Whether the queue should be cleared before returning it * @returns The queue */ export async function getOrInitQueue(makeFresh = false) { if (queue) { return queue; } queue = await RequestQueueV2.open('shop-urls'); if (makeFresh) { await queue.drop(); queue = await RequestQueueV2.open('shop-urls'); } return queue; } ``` The exported function, `getOrInitQueue`, might seem like it does a lot. In essence, it just ensures the request queue is initialized, and if requested, ensures it starts off with an empty state. ### Adapting our previous scraper to enqueue the product URLs to the new queue[​](#adapting-our-previous-scraper-to-enqueue-the-product-urls-to-the-new-queue "Direct link to Adapting our previous scraper to enqueue the product URLs to the new queue") In the `src/routes.mjs` file of the scraper we previously built, we have a handler for the `CATEGORY` label. Let's adapt that handler to enqueue the product URLs to the new queue we created. Firstly, let's import the `getOrInitQueue` function from the `requestQueue.mjs` file we created earlier. Add the following line at the start of the file: src/routes.mjs ``` import { getOrInitQueue } from './requestQueue.mjs'; ``` Then, replace the `CATEGORY` handler with the following: src/routes.mjs ``` router.addHandler('CATEGORY', async ({ page, enqueueLinks, request, log }) => { log.debug(`Enqueueing pagination for: ${request.url}`); // We are now on a category page. We can use this to paginate through and enqueue all products, // as well as any subsequent pages we find await page.waitForSelector('.product-item > a'); await enqueueLinks({ selector: '.product-item > a', label: 'DETAIL', // <= note the different label, requestQueue: await getOrInitQueue(), // <= note the different request queue }); // Now we need to find the "Next" button and enqueue the next page of results (if it exists) const nextButton = await page.$('a.pagination__next'); if (nextButton) { await enqueueLinks({ selector: 'a.pagination__next', label: 'CATEGORY', // <= note the same label }); } }); ``` Now, let's rename our entry point file `src/main.mjs` to `src/initial-scraper.mjs` and run it. You should see the crawler not scrape any detail pages, but now the URLs are being enqueued to the queue that supports locking! Before we wrap up, let's also add the following line before `crawler.run()`: src/initial-scraper.mjs ``` import { getOrInitQueue } from './requestQueue.mjs'; // Pre-initialize the queue so that we have a blank slate that will get filled out by the crawler await getOrInitQueue(true); ``` We need this to ensure the queue always starts on an empty slate when we run the scraper. But you may not need this in your use case - remember to always experiment and see what works best! And that's it with preparing our initial scraper to save all URLs we want to scrape to the queue that supports locking! ### Creating the parallel scrapers[​](#creating-the-parallel-scrapers "Direct link to Creating the parallel scrapers") Up next, let's build another scraper that will schedule the URLs from the queue to be scraped in parallel! For this, we will be using child processes from Node.js, but you can use any other method you want to run multiple scrapers in parallel. You will need to adjust your code if you use other methods. The scraper will fork itself twice (but you can experiment with this), and each fork will re-use the queue we created earlier. The best part? We can re-use the previous router we built for the initial scraper! Yay for code reuse! src/parallel-scraper.mjs ``` import { fork } from 'node:child_process'; import { Configuration, Dataset, PlaywrightCrawler, log } from 'crawlee'; import { router } from './routes.mjs'; import { getOrInitQueue } from './shared.mjs'; // For this example, we will spawn 2 separate processes that will scrape the store in parallel. if (!process.env.IN_WORKER_THREAD) { // This is the main process. We will use this to spawn the worker threads. log.info('Setting up worker threads.'); const currentFile = new URL(import.meta.url).pathname; // Store a promise per worker, so we wait for all to finish before exiting the main process const promises = []; // You can decide how many workers you want to spawn, but keep in mind you can only spawn so many before you overload your machine for (let i = 0; i < 2; i++) { const proc = fork(currentFile, { env: { // Share the current process's env across to the newly created process ...process.env, // ...but also tell the process that it's a worker process IN_WORKER_THREAD: 'true', // ...as well as which worker it is WORKER_INDEX: String(i), }, }); proc.on('online', () => { log.info(`Process ${i} is online.`); // Log out what the crawlers are doing // Note: we want to use console.log instead of log.info because we already get formatted output from the crawlers proc.stdout.on('data', (data) => { // eslint-disable-next-line no-console console.log(data.toString()); }); proc.stderr.on('data', (data) => { // eslint-disable-next-line no-console console.error(data.toString()); }); }); proc.on('message', async (data) => { log.debug(`Process ${i} sent data.`, data); await Dataset.pushData(data); }); promises.push( new Promise((resolve) => { proc.once('exit', (code, signal) => { log.info(`Process ${i} exited with code ${code} and signal ${signal}`); resolve(); }); }), ); } await Promise.all(promises); log.info('Crawling complete!'); } else { // This is the worker process. We will use this to scrape the store. // Let's build a logger that will prefix the log messages with the worker index const workerLogger = log.child({ prefix: `[Worker ${process.env.WORKER_INDEX}]` }); // This is better set with CRAWLEE_LOG_LEVEL env var // or a configuration option. This is just for show 😈 workerLogger.setLevel(log.LEVELS.DEBUG); // Disable the automatic purge on start // This is needed when running locally, as otherwise multiple processes will try to clear the default storage (and that will cause clashes) Configuration.set('purgeOnStart', false); // Get the request queue const requestQueue = await getOrInitQueue(false); // Configure crawlee to store the worker-specific data in a separate directory (needs to be done AFTER the queue is initialized when running locally) const config = new Configuration({ storageClientOptions: { localDataDirectory: `./storage/worker-${process.env.WORKER_INDEX}`, }, }); workerLogger.debug('Setting up crawler.'); const crawler = new PlaywrightCrawler( { log: workerLogger, // Instead of the long requestHandler with // if clauses we provide a router instance. requestHandler: router, // Enable the request locking experiment so that we can actually use the queue. experiments: { requestLocking: true, }, // Provide the request queue we've pre-filled in previous steps requestQueue, // Let's also limit the crawler's concurrency, we don't want to overload a single process 🐌 maxConcurrency: 5, }, config, ); await crawler.run(); } ``` We'll also need to do one small change in the `DETAIL` route handler. Instead of calling `context.pushData`, we want to replace that with `process.send` instead. But why? Since we use child processes, and each worker process has its own storage space, calling `context.pushData` will not work as we want it to work. Instead, we need to send the data back to the parent process, which has the context where we want to store the data. This might not be needed depending on your use case! You'll need to experiment and see what works best for you src/routes.mjs ``` // This replaces the request.label === DETAIL branch of the if clause. router.addHandler('DETAIL', async ({ request, page, log }) => { log.debug(`Extracting data: ${request.url}`); const urlPart = request.url.split('/').slice(-1); // ['sennheiser-mke-440-professional-stereo-shotgun-microphone-mke-440'] const manufacturer = urlPart[0].split('-')[0]; // 'sennheiser' const title = await page.locator('.product-meta h1').textContent(); const sku = await page.locator('span.product-meta__sku-number').textContent(); const priceElement = page .locator('span.price') .filter({ hasText: '$', }) .first(); const currentPriceString = await priceElement.textContent(); const rawPrice = currentPriceString.split('$')[1]; const price = Number(rawPrice.replaceAll(',', '')); const inStockElement = page .locator('span.product-form__inventory') .filter({ hasText: 'In stock', }) .first(); const inStock = (await inStockElement.count()) > 0; const results = { url: request.url, manufacturer, title, sku, currentPrice: price, availableInStock: inStock, }; log.debug(`Saving data: ${request.url}`); // Send the data to the parent process // Depending on how you build your crawler, this line could instead be something like `context.pushData()`! Experiment, and see what you can build process.send(results); }); ``` There is a lot of code, so let's break it down: #### The `if` check for `process.env.IS_WORKER_THREAD`[​](#the-if-check-for-processenvis_worker_thread "Direct link to the-if-check-for-processenvis_worker_thread") This will check how the script is executed as. If this value has *any* value, it will assume it's meant to start scraping. If not, it's considered the **parent** process and will fork copies of itself to do the scraping. #### Why do we create a Promise per worker process?[​](#why-do-we-create-a-promise-per-worker-process "Direct link to Why do we create a Promise per worker process?") We use this to ensure the parent process stays alive until all the worker processes exit. Otherwise, the worker processes would just get spawned, and lose the ability to communicate with the parent. You might not need this depending on your use case (maybe you just need to spawn workers and let them process). #### What's with all those `Configuration` calls?[​](#whats-with-all-those-configuration-calls "Direct link to whats-with-all-those-configuration-calls") There are three steps we want to do for the worker processes: * ensure the default storages do **not** get purged on start, as otherwise we'd lose the queue we prepared * get the queue that supports locking from the same location as the parent process * initialize a special storage for worker processes so they do not collide with each other In order, that's what these lines do: src/parallel-scraper.mjs ``` // Disable the automatic purge on start (step 1) // This is needed when running locally, as otherwise multiple processes will try to clear the default storage (and that will cause clashes) Configuration.set('purgeOnStart', false); // Get the request queue from the parent process (step 2) const requestQueue = await getOrInitQueue(false); // Configure crawlee to store the worker-specific data in a separate directory (needs to be done AFTER the queue is initialized when running locally) (step 3) const config = new Configuration({ storageClientOptions: { localDataDirectory: `./storage/worker-${process.env.WORKER_INDEX}`, }, }); ``` #### Enabling the request locking experiment, and telling the crawler to use the worker configuration[​](#enabling-the-request-locking-experiment-and-telling-the-crawler-to-use-the-worker-configuration "Direct link to Enabling the request locking experiment, and telling the crawler to use the worker configuration") You might have noticed several lines highlighted in the code above. Those show how you can enable the request locking experiment, as well as how you provide the request queue to the crawler. You can read more about the experiment by visiting the [request locking experiment](https://crawlee.dev/js/docs/experiments/experiments-request-locking.md) page. You might have also noticed we passed in a second parameter to the constructor of the crawler, the `config` variable we created earlier. This is needed to ensure the crawler uses the worker-specific storages for internal states, and that they do not collide with each other. #### Why do we use `process.send` instead of `context.pushData`?[​](#why-do-we-use-processsend-instead-of-contextpushdata "Direct link to why-do-we-use-processsend-instead-of-contextpushdata") Since we use child processes, and each worker process has its own storage space, calling `context.pushData` will not work as we want it to work (each worker would just push to its own personal dataset that is considered the "default" one). Instead, we need to send the data back to the parent process, which has the dataset where we want to store the data, in a centralized place. Why don't we apply the same logic we did to the request queue to the dataset? This is a very valid question, but it has a simple answer: since each process tracks its own internal state of how a dataset looks like (when we are scrapping locally), the worker processes would get out of sync real fast and would either miss or override data. This is why we need to send the data back to the parent process, which has the dataset where we want to store the data, in a centralized place. Depending on your crawler, this might not be an issue! Each use case has its own quirks, but this is something you should keep in mind when building your scraper. #### Why did we limit the maximum concurrency to `5`?[​](#why-did-we-limit-the-maximum-concurrency-to-5 "Direct link to why-did-we-limit-the-maximum-concurrency-to-5") This question has a two-fold answer: * we don't want to overload the target website with requests, so we limit the number of concurrent requests to a reasonable number per worker process * we don't want to overload the machine that is running the scraper This circles back to the initial paragraph about whether you should parallelize your scraper or not. ## Other questions[​](#other-questions "Direct link to Other questions") #### Couldn't the `initial-scraper` be merged into the `parallel-scraper`?[​](#couldnt-the-initial-scraper-be-merged-into-the-parallel-scraper "Direct link to couldnt-the-initial-scraper-be-merged-into-the-parallel-scraper") Technically, it could! Nothing stops you from first enqueuing all the URLs in the parent process, and then run the worker process logic after to scrape them. We separated them so it's easier to follow and understand what each part does, but you can merge them if you want to. #### Will I benefit from this if I run XYZ scraper / want to scrape XYZ website?[​](#will-i-benefit-from-this-if-i-run-xyz-scraper--want-to-scrape-xyz-website "Direct link to Will I benefit from this if I run XYZ scraper / want to scrape XYZ website?") We don't know! 🤷 What we do know is that first, you should build your scraper to work as a single scraper, then monitor its performance. Do you see it being too slow? Do you scrape many pages, or do the few pages you scrape take a long time to load? If so, then you might benefit from parallelization. When in doubt, follow the list of things to consider before parallelizing at the start of this guide. --- # Proxy Management Copy for LLM [IP address blocking](https://en.wikipedia.org/wiki/IP_address_blocking) is one of the oldest and most effective ways of preventing access to a website. It is therefore paramount for a good web scraping library to provide easy to use but powerful tools which can work around IP blocking. The most powerful weapon in our anti IP blocking arsenal is a [proxy server](https://en.wikipedia.org/wiki/Proxy_server). With Crawlee we can use our own proxy servers or proxy servers acquired from third-party providers. Check out the [avoid blocking guide](https://crawlee.dev/js/docs/guides/avoid-blocking.md) for more information about blocking. ## Quick start[​](#quick-start "Direct link to Quick start") If we already have proxy URLs of our own, we can start using them immediately in only a few lines of code. ``` import { ProxyConfiguration } from 'crawlee'; const proxyConfiguration = new ProxyConfiguration({ proxyUrls: [ 'http://proxy-1.com', 'http://proxy-2.com', ] }); const proxyUrl = await proxyConfiguration.newUrl(); ``` Examples of how to use our proxy URLs with crawlers are shown below in [Crawler integration](#crawler-integration) section. ## Proxy Configuration[​](#proxy-configuration "Direct link to Proxy Configuration") All our proxy needs are managed by the [`ProxyConfiguration`](https://crawlee.dev/js/api/core/class/ProxyConfiguration.md) class. We create an instance using the `ProxyConfiguration` [`constructor`](https://crawlee.dev/js/api/core/class/ProxyConfiguration.md#constructor) function based on the provided options. See the [`ProxyConfigurationOptions`](https://crawlee.dev/js/api/core/interface/ProxyConfigurationOptions.md) for all the possible constructor options. ### Static proxy list[​](#static-proxy-list "Direct link to Static proxy list") You can provide a static list of proxy URLs to the `proxyUrls` option. The `ProxyConfiguration` will then rotate through the provided proxies. ``` const proxyConfiguration = new ProxyConfiguration({ proxyUrls: [ 'http://proxy-1.com', 'http://proxy-2.com', null // null means no proxy is used ] }); ``` This is the simplest way to use a list of proxies. Crawlee will rotate through the list of proxies in a round-robin fashion. ### Custom proxy function[​](#custom-proxy-function "Direct link to Custom proxy function") The `ProxyConfiguration` class allows you to provide a custom function to pick a proxy URL. This is useful when you want to implement your own logic for selecting a proxy. ``` const proxyConfiguration = new ProxyConfiguration({ newUrlFunction: (sessionId, { request }) => { if (request?.url.includes('crawlee.dev')) { return null; // for crawlee.dev, we don't use a proxy } return 'http://proxy-1.com'; // for all other URLs, we use this proxy } }); ``` The `newUrlFunction` receives two parameters - `sessionId` and `options` - and returns a string containing the proxy URL. The `sessionId` parameter is always provided and allows us to differentiate between different sessions - e.g. when Crawlee recognizes your crawlers are being blocked, it will automatically create a new session with a different id. The `options` parameter is an object containing a [`Request`](https://crawlee.dev/js/api/core/class/Request.md), which is the request that will be made. Note that this object is not always available, for example when we are using the `newUrl` function directly. Your custom function should therefore not rely on the `request` object being present and provide a default behavior when it is not. ### Tiered proxies[​](#tiered-proxies "Direct link to Tiered proxies") You can also provide a list of proxy tiers to the `ProxyConfiguration` class. This is useful when you want to switch between different proxies automatically based on the blocking behavior of the website. warning Note that the `tieredProxyUrls` option requires `ProxyConfiguration` to be used from a crawler instance ([see below](#crawler-integration)). Using this configuration through the `newUrl` calls will not yield the expected results. ``` const proxyConfiguration = new ProxyConfiguration({ tieredProxyUrls: [ [null], // At first, we try to connect without a proxy ['http://okay-proxy.com'], ['http://slightly-better-proxy.com', 'http://slightly-better-proxy-2.com'], ['http://very-good-and-expensive-proxy.com'], ] }); ``` This configuration will start with no proxy, then switch to `http://okay-proxy.com` if Crawlee recognizes we're getting blocked by the target website. If that proxy is also blocked, we will switch to one of the `slightly-better-proxy` URLs. If those are blocked, we will switch to the `very-good-and-expensive-proxy.com` URL. Crawlee also periodically probes lower tier proxies to see if they are unblocked, and if they are, it will switch back to them. ## Crawler integration[​](#crawler-integration "Direct link to Crawler integration") `ProxyConfiguration` integrates seamlessly into [`HttpCrawler`](https://crawlee.dev/js/api/http-crawler/class/HttpCrawler.md), [`CheerioCrawler`](https://crawlee.dev/js/api/cheerio-crawler/class/CheerioCrawler.md), [`JSDOMCrawler`](https://crawlee.dev/js/api/jsdom-crawler/class/JSDOMCrawler.md), [`PlaywrightCrawler`](https://crawlee.dev/js/api/playwright-crawler/class/PlaywrightCrawler.md) and [`PuppeteerCrawler`](https://crawlee.dev/js/api/puppeteer-crawler/class/PuppeteerCrawler.md). * HttpCrawler * CheerioCrawler * JSDOMCrawler * PlaywrightCrawler * PuppeteerCrawler ``` import { HttpCrawler, ProxyConfiguration } from 'crawlee'; const proxyConfiguration = new ProxyConfiguration({ proxyUrls: ['http://proxy-1.com', 'http://proxy-2.com'], }); const crawler = new HttpCrawler({ proxyConfiguration, // ... }); ``` ``` import { CheerioCrawler, ProxyConfiguration } from 'crawlee'; const proxyConfiguration = new ProxyConfiguration({ proxyUrls: ['http://proxy-1.com', 'http://proxy-2.com'], }); const crawler = new CheerioCrawler({ proxyConfiguration, // ... }); ``` ``` import { JSDOMCrawler, ProxyConfiguration } from 'crawlee'; const proxyConfiguration = new ProxyConfiguration({ proxyUrls: ['http://proxy-1.com', 'http://proxy-2.com'], }); const crawler = new JSDOMCrawler({ proxyConfiguration, // ... }); ``` ``` import { PlaywrightCrawler, ProxyConfiguration } from 'crawlee'; const proxyConfiguration = new ProxyConfiguration({ proxyUrls: ['http://proxy-1.com', 'http://proxy-2.com'], }); const crawler = new PlaywrightCrawler({ proxyConfiguration, // ... }); ``` ``` import { PuppeteerCrawler, ProxyConfiguration } from 'crawlee'; const proxyConfiguration = new ProxyConfiguration({ proxyUrls: ['http://proxy-1.com', 'http://proxy-2.com'], }); const crawler = new PuppeteerCrawler({ proxyConfiguration, // ... }); ``` Our crawlers will now use the selected proxies for all connections. ## IP Rotation and session management[​](#ip-rotation-and-session-management "Direct link to IP Rotation and session management") ​[`proxyConfiguration.newUrl()`](https://crawlee.dev/js/api/core/class/ProxyConfiguration.md#newUrl) allows us to pass a `sessionId` parameter. It will then be used to create a `sessionId`-`proxyUrl` pair, and subsequent `newUrl()` calls with the same `sessionId` will always return the same `proxyUrl`. This is extremely useful in scraping, because we want to create the impression of a real user. See the [session management guide](https://crawlee.dev/js/docs/guides/session-management.md) and [`SessionPool`](https://crawlee.dev/js/api/core/class/SessionPool.md) class for more information on how keeping a real session helps us avoid blocking. When no `sessionId` is provided, our proxy URLs are rotated round-robin. * HttpCrawler * CheerioCrawler * JSDOMCrawler * PlaywrightCrawler * PuppeteerCrawler * Standalone ``` import { HttpCrawler, ProxyConfiguration } from 'crawlee'; const proxyConfiguration = new ProxyConfiguration({ /* opts */ }); const crawler = new HttpCrawler({ useSessionPool: true, persistCookiesPerSession: true, proxyConfiguration, // ... }); ``` ``` import { CheerioCrawler, ProxyConfiguration } from 'crawlee'; const proxyConfiguration = new ProxyConfiguration({ /* opts */ }); const crawler = new CheerioCrawler({ useSessionPool: true, persistCookiesPerSession: true, proxyConfiguration, // ... }); ``` ``` import { JSDOMCrawler, ProxyConfiguration } from 'crawlee'; const proxyConfiguration = new ProxyConfiguration({ /* opts */ }); const crawler = new JSDOMCrawler({ useSessionPool: true, persistCookiesPerSession: true, proxyConfiguration, // ... }); ``` ``` import { PlaywrightCrawler, ProxyConfiguration } from 'crawlee'; const proxyConfiguration = new ProxyConfiguration({ /* opts */ }); const crawler = new PlaywrightCrawler({ useSessionPool: true, persistCookiesPerSession: true, proxyConfiguration, // ... }); ``` ``` import { PuppeteerCrawler, ProxyConfiguration } from 'crawlee'; const proxyConfiguration = new ProxyConfiguration({ /* opts */ }); const crawler = new PuppeteerCrawler({ useSessionPool: true, persistCookiesPerSession: true, proxyConfiguration, // ... }); ``` ``` import { ProxyConfiguration, SessionPool } from 'crawlee'; const proxyConfiguration = new ProxyConfiguration({ /* opts */ }); const sessionPool = await SessionPool.open({ /* opts */ }); const session = await sessionPool.getSession(); const proxyUrl = await proxyConfiguration.newUrl(session.id); ``` ## Inspecting current proxy in Crawlers[​](#inspecting-current-proxy-in-crawlers "Direct link to Inspecting current proxy in Crawlers") `HttpCrawler`, `CheerioCrawler`, `JSDOMCrawler`, `PlaywrightCrawler` and `PuppeteerCrawler` grant access to information about the currently used proxy in their `requestHandler` using a [`proxyInfo`](https://crawlee.dev/js/api/core/interface/ProxyInfo.md) object. With the `proxyInfo` object, we can easily access the proxy URL. * HttpCrawler * CheerioCrawler * JSDOMCrawler * PlaywrightCrawler * PuppeteerCrawler ``` import { HttpCrawler, ProxyConfiguration } from 'crawlee'; const proxyConfiguration = new ProxyConfiguration({ /* opts */ }); const crawler = new HttpCrawler({ proxyConfiguration, async requestHandler({ proxyInfo }) { console.log(proxyInfo); }, // ... }); ``` ``` import { CheerioCrawler, ProxyConfiguration } from 'crawlee'; const proxyConfiguration = new ProxyConfiguration({ /* opts */ }); const crawler = new CheerioCrawler({ proxyConfiguration, async requestHandler({ proxyInfo }) { console.log(proxyInfo); }, // ... }); ``` ``` import { JSDOMCrawler, ProxyConfiguration } from 'crawlee'; const proxyConfiguration = new ProxyConfiguration({ /* opts */ }); const crawler = new JSDOMCrawler({ proxyConfiguration, async requestHandler({ proxyInfo }) { console.log(proxyInfo); }, // ... }); ``` ``` import { PlaywrightCrawler, ProxyConfiguration } from 'crawlee'; const proxyConfiguration = new ProxyConfiguration({ /* opts */ }); const crawler = new PlaywrightCrawler({ proxyConfiguration, async requestHandler({ proxyInfo }) { console.log(proxyInfo); }, // ... }); ``` ``` import { PuppeteerCrawler, ProxyConfiguration } from 'crawlee'; const proxyConfiguration = new ProxyConfiguration({ /* opts */ }); const crawler = new PuppeteerCrawler({ proxyConfiguration, async requestHandler({ proxyInfo }) { console.log(proxyInfo); }, // ... }); ``` --- # Request Storage Copy for LLM Crawlee has several request storage types that are useful for specific tasks. The requests are stored on local disk to a directory defined by the `CRAWLEE_STORAGE_DIR` environment variable. If this variable is not defined, by default Crawlee sets `CRAWLEE_STORAGE_DIR` to `./storage` in the current working directory. ## Request queue[​](#request-queue "Direct link to Request queue") The request queue is a storage of URLs to crawl. The queue is used for the deep crawling of websites, where we start with several URLs and then recursively follow links to other pages. The data structure supports both breadth-first and depth-first crawling orders. Each Crawlee project run is associated with a **default request queue**. Typically, it is used to store URLs to crawl in the specific crawler run. Its usage is optional. In Crawlee, the request queue is represented by the [`RequestQueue`](https://crawlee.dev/js/api/core/class/RequestQueue.md) class. The request queue is managed by [`MemoryStorage`](https://crawlee.dev/js/api/memory-storage/class/MemoryStorage.md) class and its data is stored in memory, while also being off-loaded to the local directory specified by the `CRAWLEE_STORAGE_DIR` environment variable as follows: ``` {CRAWLEE_STORAGE_DIR}/request_queues/{QUEUE_ID}/entries.json ``` note `{QUEUE_ID}` is the name or ID of the request queue. The default queue has ID `default`, unless we override it by setting the `CRAWLEE_DEFAULT_REQUEST_QUEUE_ID` environment variable. note `entries.json` contains an array of requests. The following code demonstrates the usage of the request queue: * Usage with Crawler * Explicit usage with Crawler * Basic Operations ``` import { CheerioCrawler } from 'crawlee'; // The crawler will automatically process requests from the queue. // It's used the same way for Puppeteer/Playwright crawlers. const crawler = new CheerioCrawler({ // Note that we're not specifying the requestQueue here async requestHandler({ $, crawler, enqueueLinks }) { // Add new request to the queue await crawler.addRequests([{ url: 'https://example.com/new-page' }]); // Add links found on page to the queue await enqueueLinks(); }, }); // Add the initial requests. // Note that we are not opening the request queue explicitly before await crawler.addRequests([ { url: 'https://example.com/1' }, { url: 'https://example.com/2' }, { url: 'https://example.com/3' }, // ... ]); // Run the crawler await crawler.run(); ``` ``` import { RequestQueue, CheerioCrawler } from 'crawlee'; // Open the default request queue associated with the current run const requestQueue = await RequestQueue.open(); // Enqueue the initial requests await requestQueue.addRequests([ { url: 'https://example.com/1' }, { url: 'https://example.com/2' }, { url: 'https://example.com/3' }, // ... ]); // The crawler will automatically process requests from the queue. // It's used the same way for Puppeteer/Playwright crawlers const crawler = new CheerioCrawler({ requestQueue, async requestHandler({ $, request, enqueueLinks }) { // Add new request to the queue await requestQueue.addRequests([{ url: 'https://example.com/new-page' }]); // Add links found on page to the queue await enqueueLinks(); }, }); // Run the crawler await crawler.run(); ``` ``` import { RequestQueue } from 'crawlee'; // Open the default request queue associated with the crawler run const requestQueue = await RequestQueue.open(); // Enqueue the initial batch of requests (could be an array of just one) await requestQueue.addRequests([ { url: 'https://example.com/1' }, { url: 'https://example.com/2' }, { url: 'https://example.com/3' }, ]); // Open the named request queue const namedRequestQueue = await RequestQueue.open('named-queue'); // Remove the named request queue await namedRequestQueue.drop(); ``` To see more detailed example of how to use the request queue with a crawler, see the [Puppeteer Crawler](https://crawlee.dev/js/docs/examples/puppeteer-crawler.md) example. ## Request list[​](#request-list "Direct link to Request list") The request list is not a storage per se - it represents the list of URLs to crawl that is stored in a crawler run memory (or optionally in default [Key-Value Store](https://crawlee.dev/js/docs/guides/result-storage.md#key-value-store) associated with the run, if specified). The list is used for the crawling of a large number of URLs, when we know all the URLs which should be visited by the crawler and no URLs would be added during the run. The URLs can be provided either in code or parsed from a text file hosted on the web. Request list is created exclusively for the crawler run and only if its usage is explicitly specified in the code. Its usage is optional. In Crawlee, the request list is represented by the [`RequestList`](https://crawlee.dev/js/api/core/class/RequestList.md) class. The following code demonstrates basic operations of the request list: ``` import { RequestList, PuppeteerCrawler } from 'crawlee'; // Prepare the sources array with URLs to visit const sources = [ { url: 'http://www.example.com/page-1' }, { url: 'http://www.example.com/page-2' }, { url: 'http://www.example.com/page-3' }, ]; // Open the request list. // List name is used to persist the sources and the list state in the key-value store const requestList = await RequestList.open('my-list', sources); // The crawler will automatically process requests from the list // It's used the same way for Cheerio /Playwright crawlers. const crawler = new PuppeteerCrawler({ requestList, async requestHandler({ page, request }) { // Process the page (extract data, take page screenshot, etc). // No more requests could be added to the request list here }, }); ``` ## Which one to choose?[​](#which-one-to-choose "Direct link to Which one to choose?") When using Request queue - we would normally have several start URLs (e.g. category pages on e-commerce website) and then recursively add more (e.g. individual item pages) programmatically to the queue, it supports dynamic adding and removing of requests. No more URLs can be added to Request list after its initialization as it is immutable, URLs cannot be removed from the list either. On the other hand, the Request queue is not optimized for adding or removing numerous URLs in a batch. This is technically possible, but requests are added one by one to the queue, and thus it would take significant time with a larger number of requests. Request list however can contain even millions of URLs, and it would take significantly less time to add them to the list, compared to the queue. Note that Request queue and Request list can be used together by the same crawler. In such cases, each request from the Request list is enqueued into the Request queue first (to the foremost position in the queue, even if Request queue is not empty) and then consumed from the latter. This is necessary to avoid the same URL being processed more than once (from the list first and then possibly from the queue). In practical terms, such a combination can be useful when there are numerous initial URLs, but more URLs would be added dynamically by the crawler. tip In Crawlee, there is not much need to combine the request queue together with the request list (although it's technically possible). Previously there was no way to add the initial requests to the queue in batches (to add an array of requests), i.e. we could have only added the requests one by one to the queue with the help of [`addRequest()`](https://crawlee.dev/js/api/core/class/RequestQueue.md#addRequest) function. However, now we could use the [`addRequests()`](https://crawlee.dev/js/api/core/class/RequestQueue.md#addRequests) function, which adds requests in batches. Thus, instead of combining the request queue and the request list, we can use only the request queue for such use-cases now. See the examples below. * Request Queue * Request Queue + Request List ``` // This is the suggested way. // Note that we are not using the request list at all, // and not using the request queue explicitly here. import { PuppeteerCrawler } from 'crawlee'; // Prepare the sources array with URLs to visit (it can contain millions of URLs) const sources = [ { url: 'http://www.example.com/page-1' }, { url: 'http://www.example.com/page-2' }, { url: 'http://www.example.com/page-3' }, // ... ]; // The crawler will automatically process requests from the queue. // It's used the same way for Cheerio/Playwright crawlers const crawler = new PuppeteerCrawler({ async requestHandler({ crawler, enqueueLinks }) { // Add new request to the queue await crawler.addRequests(['http://www.example.com/new-page']); // Add links found on page to the queue await enqueueLinks(); // The requests above would be added to the queue // and would be processed after the initial requests are processed. }, }); // Add the initial sources array to the request queue // and run the crawler await crawler.run(sources); ``` ``` // This is technically correct, but // we need to explicitly open/use both the request queue and the request list. // We suggest using the request queue and batch add the requests instead. import { RequestList, RequestQueue, PuppeteerCrawler } from 'crawlee'; // Prepare the sources array with URLs to visit (it can contain millions of URLs) const sources = [ { url: 'http://www.example.com/page-1' }, { url: 'http://www.example.com/page-2' }, { url: 'http://www.example.com/page-3' }, // ... ]; // Open the request list with the initial sources array const requestList = await RequestList.open('my-list', sources); // Open the default request queue. It's not necessary to add any requests to the queue const requestQueue = await RequestQueue.open(); // The crawler will automatically process requests from the list and the queue. // It's used the same way for Cheerio/Playwright crawlers const crawler = new PuppeteerCrawler({ requestList, requestQueue, // Each request from the request list is enqueued to the request queue one by one. // At this point request with the same URL would exist in the list and the queue async requestHandler({ crawler, enqueueLinks }) { // Add new request to the queue await crawler.addRequests(['http://www.example.com/new-page']); // Add links found on page to the queue await enqueueLinks(); // The requests above would be added to the queue (but not to the list) // and would be processed after the request list is empty. // No more requests could be added to the list here }, }); // Run the crawler await crawler.run(); ``` ## Cleaning up the storages[​](#cleaning-up-the-storages "Direct link to Cleaning up the storages") Default storages are purged before the crawler starts if not specified otherwise. This happens as early as when we try to open some storage (e.g. via `RequestQueue.open()`) or when we try to work with a default storage via one of the helper methods (e.g. `crawler.addRequests()` that under the hood calls `RequestQueue.open()`). If we don't work with storages explicitly in our code, the purging will eventually happen when the `run` method of our crawler is executed. In case we need to purge the storages sooner, we can use the [`purgeDefaultStorages()`](https://crawlee.dev/js/api/core/function/purgeDefaultStorages.md) helper explicitly: ``` import { purgeDefaultStorages } from 'crawlee'; await purgeDefaultStorages(); ``` Calling this function will clean up the default request storage directory (and also the request list stored in default key-value store). This is a shortcut for running (optional) `purge` method on the [`StorageClient`](https://crawlee.dev/js/api/core/interface/StorageClient.md) interface, in other words it will call the `purge` method of the underlying storage implementation we are currently using. You can make sure the storage is purged only once for a given execution context if you set `onlyPurgeOnce` to `true` in the `options` object. --- # Result Storage Copy for LLM Crawlee has several result storage types that are useful for specific tasks. The data is stored on a local disk to the directory defined by the `CRAWLEE_STORAGE_DIR` environment variable. If this variable is not defined, by default Crawlee sets `CRAWLEE_STORAGE_DIR` to `./storage` in the current working directory. Crawlee storage is managed by [`MemoryStorage`](https://crawlee.dev/js/api/memory-storage/class/MemoryStorage.md) class. During the crawler run all information is stored in memory, while also being off-loaded to the local files in respective storage type folders. ## Key-value store[​](#key-value-store "Direct link to Key-value store") The key-value store is used for saving and reading data records or files. Each data record is represented by a unique key and associated with a MIME content type. Key-value stores are ideal for saving screenshots of web pages, PDFs or to persist the state of crawlers. Each Crawlee project run is associated with a **default key-value store**. By convention, the project input and output are stored in the default key-value store under the `INPUT` and `OUTPUT` keys respectively. Typically, both input and output are JSON files, although they could be any other format. In Crawlee, the key-value store is represented by the [`KeyValueStore`](https://crawlee.dev/js/api/core/class/KeyValueStore.md) class. In order to simplify access to the default key-value store, Crawlee also provides [`KeyValueStore.getValue()`](https://crawlee.dev/js/api/core/class/KeyValueStore.md#getValue) and [`KeyValueStore.setValue()`](https://crawlee.dev/js/api/core/class/KeyValueStore.md#setValue) functions. The data is stored in the directory specified by the `CRAWLEE_STORAGE_DIR` environment variable as follows: ``` {CRAWLEE_STORAGE_DIR}/key_value_stores/{STORE_ID}/{KEY}.{EXT} ``` note `{STORE_ID}` is the name or the ID of the key-value store. The default key-value store has ID `default`, unless we override it by setting the `CRAWLEE_DEFAULT_KEY_VALUE_STORE_ID` environment variable. The `{KEY}` is the key of the record and `{EXT}` corresponds to the MIME content type of the data value. The following code demonstrates basic operations of key-value stores: ``` import { KeyValueStore } from 'crawlee'; // Get the INPUT from the default key-value store const input = await KeyValueStore.getInput(); // Write the OUTPUT to the default key-value store await KeyValueStore.setValue('OUTPUT', { myResult: 123 }); // Open a named key-value store const store = await KeyValueStore.open('some-name'); // Write a record to the named key-value store. // JavaScript object is automatically converted to JSON, // strings and binary buffers are stored as they are await store.setValue('some-key', { foo: 'bar' }); // Read a record from the named key-value store. // Note that JSON is automatically parsed to a JavaScript object, // text data is returned as a string, and other data is returned as binary buffer const value = await store.getValue('some-key'); // Delete a record from the named key-value store await store.setValue('some-key', null); ``` To see a real-world example of how to get the input from the key-value store, see the [Screenshots](https://crawlee.dev/js/docs/examples/capture-screenshot.md) example. ## Dataset[​](#dataset "Direct link to Dataset") Datasets are used to store structured data where each object stored has the same attributes, such as online store products or real estate offers. Dataset can be imagined as a table, where each object is a row and its attributes are columns. Dataset is an append-only storage - we can only add new records to it, but we cannot modify or remove existing records. Each Crawlee project run is associated with a **default dataset**. Typically, it is used to store crawling results specific for the crawler run. Its usage is optional. In Crawlee, the dataset is represented by the [`Dataset`](https://crawlee.dev/js/api/core/class/Dataset.md) class. In order to simplify writes to the default dataset, Crawlee also provides the [`Dataset.pushData()`](https://crawlee.dev/js/api/core/class/Dataset.md#pushData) function. The data is stored in the directory specified by the `CRAWLEE_STORAGE_DIR` environment variable as follows: ``` {CRAWLEE_STORAGE_DIR}/datasets/{DATASET_ID}/{INDEX}.json ``` note `{DATASET_ID}` is the name or the ID of the dataset. The default dataset has ID `default`, unless we override it by setting the `CRAWLEE_DEFAULT_DATASET_ID` environment variable. Each dataset item is stored as a separate JSON file, where `{INDEX}` is a zero-based index of the item in the dataset. The following code demonstrates basic operations of the dataset: ``` import { Dataset } from 'crawlee'; // Write a single row to the default dataset await Dataset.pushData({ col1: 123, col2: 'val2' }); // Open a named dataset const dataset = await Dataset.open('some-name'); // Write a single row await dataset.pushData({ foo: 'bar' }); // Write multiple rows await dataset.pushData([{ foo: 'bar2', col2: 'val2' }, { col3: 123 }]); ``` To see how to use the dataset to store crawler results, see the [Cheerio Crawler](https://crawlee.dev/js/docs/examples/cheerio-crawler.md) example. ## Cleaning up the storages[​](#cleaning-up-the-storages "Direct link to Cleaning up the storages") Default storages are purged before the crawler starts if not specified otherwise. This happens as early as when we try to open some storage (e.g. via `Dataset.open()`) or when we try to work with a default storage via one of the helper methods (e.g. `Dataset.pushData()` that under the hood calls `Dataset.open()`). If we don't work with storages explicitly in our code, the purging will eventually happen when the `run` method of our crawler is executed. In case we need to purge the storages sooner, we can use the [`purgeDefaultStorages()`](https://crawlee.dev/js/api/core/function/purgeDefaultStorages.md) helper explicitly: ``` import { purgeDefaultStorages } from 'crawlee'; await purgeDefaultStorages(); ``` Calling this function will clean up the default results storage directories except the `INPUT` key in default key-value store directory. This is a shortcut for running (optional) `purge` method on the [`StorageClient`](https://crawlee.dev/js/api/core/interface/StorageClient.md) interface, in other words it will call the `purge` method of the underlying storage implementation we are currently using. In addition, this method will make sure the storage is purged only once for a given execution context, so it is safe to call it multiple times. --- # Running in web server Copy for LLM Most of the time, Crawlee jobs are run as batch jobs. You have a list of URLs you want to scrape every week or you might want to scrape a whole website once per day. After the scrape, you send the data to your warehouse for analytics. Batch jobs are efficient because they can use [Crawlee's built-in autoscaling](https://crawlee.dev/js/docs/guides/scaling-crawlers.md) to fully utilize the resources you have available. But sometimes you have a use-case where you need to return scrape data as soon as possible. There might be a user waiting on the other end so every millisecond counts. This is where running Crawlee in a web server comes in. We will build a simple HTTP server that receives a page URL and returns the page title in the response. We will base this guide on the approach used in [Apify's Super Scraper API repository](https://github.com/apify/super-scraper) which maps incoming HTTP requests to Crawlee [Request](https://crawlee.dev/js/api/core/class/Request.md). ## Set up a web server[​](#set-up-a-web-server "Direct link to Set up a web server") There are many popular web server frameworks for Node.js, such as Express, Koa, Fastify, and Hapi but in this guide, we will use the built-in `http` Node.js module to keep things simple. This will be our core server setup: ``` import { createServer } from 'http'; import { log } from 'crawlee'; const server = createServer(async (req, res) => { log.info(`Request received: ${req.method} ${req.url}`); res.writeHead(200, { 'Content-Type': 'text/plain' }); // We will return the page title here later instead res.end('Hello World\n'); }); server.listen(3000, () => { log.info('Server is listening for user requests'); }); ``` ## Create the Crawler[​](#create-the-crawler "Direct link to Create the Crawler") We will create a standard [CheerioCrawler](https://crawlee.dev/js/api/cheerio-crawler/class/CheerioCrawler.md) and use the [`keepAlive: true`](https://crawlee.dev/js/api/cheerio-crawler/interface/CheerioCrawlerOptions.md#keepAlive) option to keep the crawler running even if there are no requests currently in the [Request Queue](https://crawlee.dev/js/api/core/class/RequestQueue.md). This way it will always be waiting for new requests to come in. ``` import { CheerioCrawler, log } from 'crawlee'; const crawler = new CheerioCrawler({ keepAlive: true, requestHandler: async ({ request, $ }) => { const title = $('title').text(); // We will send the response here later log.info(`Page title: ${title} on ${request.url}`); }, }); ``` ## Glue it together[​](#glue-it-together "Direct link to Glue it together") Now we need to glue the server and the crawler together using the mapping of Crawlee Requests to HTTP responses discussed above. The whole program is actually quite simple. For production-grade service, you will need to improve error handling, logging, and monitoring but this is a good starting point. src/web-server.mjs ``` import { randomUUID } from 'node:crypto'; import { CheerioCrawler, log } from 'crawlee'; import { createServer } from 'http'; // We will bind an HTTP response that we want to send to the Request.uniqueKey const requestsToResponses = new Map(); const crawler = new CheerioCrawler({ keepAlive: true, requestHandler: async ({ request, $ }) => { const title = $('title').text(); log.info(`Page title: ${title} on ${request.url}, sending response`); // We will pick the response from the map and send it to the user // We know the response is there with this uniqueKey const httpResponse = requestsToResponses.get(request.uniqueKey); httpResponse.writeHead(200, { 'Content-Type': 'application/json' }); httpResponse.end(JSON.stringify({ title })); // We can delete the response from the map now to free up memory requestsToResponses.delete(request.uniqueKey); }, }); const server = createServer(async (req, res) => { // We parse the requested URL from the query parameters, e.g. localhost:3000/?url=https://example.com const urlObj = new URL(req.url, 'http://localhost:3000'); const requestedUrl = urlObj.searchParams.get('url'); log.info(`HTTP request received for ${requestedUrl}, adding to the queue`); if (!requestedUrl) { log.error('No URL provided as query parameter, returning 400'); res.writeHead(400, { 'Content-Type': 'application/json' }); res.end(JSON.stringify({ error: 'No URL provided as query parameter' })); return; } // We will add it first to the map and then enqueue it to the crawler that immediately processes it // uniqueKey must be random so we process the same URL again const crawleeRequest = { url: requestedUrl, uniqueKey: randomUUID() }; requestsToResponses.set(crawleeRequest.uniqueKey, res); await crawler.addRequests([crawleeRequest]); }); // Now we start the server, the crawler and wait for incoming connections server.listen(3000, () => { log.info('Server is listening for user requests'); }); await crawler.run(); ``` --- # Scaling our crawlers Copy for LLM As we build our crawler, we might want to control how many requests we do to the website at a time. Crawlee provides several options to fine tune how many parallel requests should be made at any time, how many requests should be done per minute, and how should scaling work based on the available system resources. tip All of these options are available on all crawlers Crawlee provides, but for this guide we'll be using the [`CheerioCrawler`](https://crawlee.dev/js/api/cheerio-crawler/class/CheerioCrawler.md). We can see all options that are available [`here`](https://crawlee.dev/js/api/cheerio-crawler/interface/CheerioCrawlerOptions.md). ## `maxRequestsPerMinute`[​](#maxrequestsperminute "Direct link to maxrequestsperminute") This controls how many total requests can be made per minute. It counts the amount of requests done every second, to ensure there is not a burst of requests at the `maxConcurrency` limit followed by a long period of waiting. By default, it is set to `Infinity` which means the crawler will keep going up to the `maxConcurrency`. We would set this if we wanted our crawler to work at full throughput, but also not keep hitting the website we're crawling with non-stop requests. ``` import { CheerioCrawler } from 'crawlee'; const crawler = new CheerioCrawler({ // Let the crawler know it can run up to 100 requests concurrently at any time maxConcurrency: 100, // ...but also ensure the crawler never exceeds 250 requests per minute maxRequestsPerMinute: 250, }); ``` ## `minConcurrency` and `maxConcurrency`[​](#minconcurrency-and-maxconcurrency "Direct link to minconcurrency-and-maxconcurrency") These control how many parallel requests can be run at any time. By default, crawlers will start with one parallel request at a time and scale up over time to a maximum of 200 requests at a time. Don't set `minConcurrency` too high! Setting this option too high compared to the available system resources will make your crawler run extremely slow or might even crash. It's recommended to leave it at the default value that is provided and letting the crawler scale up and down automatically based on available resources instead. ``` import { CheerioCrawler } from 'crawlee'; const crawler = new CheerioCrawler({ // Start the crawler right away and ensure there will always be 5 concurrent requests ran at any time minConcurrency: 5, // Ensure the crawler doesn't exceed 15 concurrent requests ran at any time maxConcurrency: 15, }); ``` ## Advanced options[​](#advanced-options "Direct link to Advanced options") While the options above should be enough for most users, if we wanted to get super deep into the configuration of the autoscaling pool (the internal utility in Crawlee that helps us allow crawlers to scale up and down), we can do so through the [`autoscaledPoolOptions`](https://crawlee.dev/js/api/cheerio-crawler/interface/CheerioCrawlerOptions.md#autoscaledPoolOptions) object available on crawler options. Complex options up ahead! This section is super advanced and, unless you test the changes extensively and know what you're doing, it's better to leave these options to their defaults, as they are most likely going to work fine without much fuss. With that warning aside, if we're feeling adventurous, this is how we would pass these options when using a crawler: ``` import { CheerioCrawler } from 'crawlee'; const crawler = new CheerioCrawler({ // Pass in advanced options by providing them in the autoscaledPoolOptions autoscaledPoolOptions: { // ... }, }); ``` ### `desiredConcurrency`[​](#desiredconcurrency "Direct link to desiredconcurrency") This option specifies the amount of requests that should be running in parallel at the start of the crawler, assuming there are so many available. It defaults to the same value as `minConcurrency`. ### `desiredConcurrencyRatio`[​](#desiredconcurrencyratio "Direct link to desiredconcurrencyratio") The minimum ratio of concurrency to reach before more scaling up is allowed (a number between `0` and `1`). By default, it is set to `0.95`. We can think of this as the point where the autoscaling pool can attempt to scale up (or down), monitor if there's any changes, and correct them if necessary. ### `scaleUpStepRatio` and `scaleDownStepRatio`[​](#scaleupstepratio-and-scaledownstepratio "Direct link to scaleupstepratio-and-scaledownstepratio") These values define the fractional amount of desired concurrency to be added or subtracted as the autoscaling pool scales up or down. Both of these values default to `0.05`. Every time the autoscaled pool attempts to scale up or down, this value will be added or subtracted from the current concurrency, and, based on the [`desiredConcurrencyRatio`](#desiredconcurrencyratio) and [`maxConcurrency`](#minconcurrency-and-maxconcurrency), determines how many requests can run concurrently. ### `maybeRunIntervalSecs`[​](#mayberunintervalsecs "Direct link to mayberunintervalsecs") Indicates how often the autoscaling pool should check if more requests can be started and, if that's true, starts a new request if there are any available. This value is represented in seconds, and defaults to `0.5`. info Changing this has no effect for requests that are fired immediately after the previous ones are finished. However, it will influence how fast new requests will be started after the autoscaled pool scales up. ### `loggingIntervalSecs`[​](#loggingintervalsecs "Direct link to loggingintervalsecs") This option lets us control how often the autoscaled pool should log its current state (the current concurrency ratio, desired ratios, if the system is overloaded and so on). We can disable logging altogether by setting this to `null`. By default, it is set to `60` seconds. ### `autoscaleIntervalSecs`[​](#autoscaleintervalsecs "Direct link to autoscaleintervalsecs") This option lets us control how often the autoscaling pool should check if it can and should scale up or down. This value is represented in seconds, and defaults to `10`. tip It's recommended you keep this value between `5` and `20` seconds. Be careful with how low, or high, you set this option Setting this option to a value that's too low might have a severe impact on our crawling performance. And, in reverse, setting this to a value that's too high might mean we leave performance on the table that could've been used for crawling more requests instead. With that said, if you configure this alongside [`scaleUpStepRatio` and `scaleDownStepRatio`](#scaleupstepratio-and-scaledownstepratio), you could make your crawler scale up at a slower interval, but with more requests at a time when it does. ### `maxTasksPerMinute`[​](#maxtasksperminute "Direct link to maxtasksperminute") This controls how many total requests can be made per minute. It counts the amount of requests done every second, to ensure there is not a burst of requests at the `maxConcurrency` limit followed by a long period of waiting. By default, it is set to `Infinity` which means the crawler will keep going up to the `maxConcurrency`. We would set this if we wanted our crawler to work at full throughput, but also not keep hitting the website we're crawl with non-stop requests. info This option can be set by specifying [`maxRequestsPerMinute`](#maxrequestsperminute) in your crawler options too, as it is a shortcut for visibility and ease of access. --- # Session Management Copy for LLM ​[`SessionPool`](https://crawlee.dev/js/api/core/class/SessionPool.md) is a class that allows us to handle the rotation of proxy IP addresses along with cookies and other custom settings in Crawlee. The main benefit of using Session pool is that we can filter out blocked or non-working proxies, so our actor does not retry requests over known blocked/non-working proxies. Another benefit of using SessionPool is that we can store information tied tightly to an IP address, such as cookies, auth tokens, and particular headers. Having our cookies and other identifiers used only with a specific IP will reduce the chance of being blocked. The last but not least benefit is the even rotation of IP addresses - SessionPool picks the session randomly, which should prevent burning out a small pool of available IPs. Check out the [avoid blocking guide](https://crawlee.dev/js/docs/guides/avoid-blocking.md) for more information about blocking. Now let's take a look at the examples of how to use Session pool: * with [`BasicCrawler`](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md); * with [`HttpCrawler`](https://crawlee.dev/js/api/http-crawler/class/HttpCrawler.md); * with [`CheerioCrawler`](https://crawlee.dev/js/api/cheerio-crawler/class/CheerioCrawler.md); * with [`JSDOMCrawler`](https://crawlee.dev/js/api/jsdom-crawler/class/JSDOMCrawler.md); * with [`PlaywrightCrawler`](https://crawlee.dev/js/api/playwright-crawler/class/PlaywrightCrawler.md); * with [`PuppeteerCrawler`](https://crawlee.dev/js/api/puppeteer-crawler/class/PuppeteerCrawler.md); * without a crawler (standalone usage to manage sessions manually). - BasicCrawler - HttpCrawler - CheerioCrawler - JSDOMCrawler - PlaywrightCrawler - PuppeteerCrawler - Standalone ``` import { BasicCrawler, ProxyConfiguration } from 'crawlee'; import { gotScraping } from 'got-scraping'; const proxyConfiguration = new ProxyConfiguration({ /* opts */ }); const crawler = new BasicCrawler({ // Activates the Session pool (default is true). useSessionPool: true, // Overrides default Session pool configuration. sessionPoolOptions: { maxPoolSize: 100 }, async requestHandler({ request, session }) { const { url } = request; const requestOptions = { url, // We use session id in order to have the same proxyUrl // for all the requests using the same session. proxyUrl: await proxyConfiguration.newUrl(session.id), throwHttpErrors: false, headers: { // If you want to use the cookieJar. // This way you get the Cookie headers string from session. Cookie: session.getCookieString(url), }, }; let response; try { response = await gotScraping(requestOptions); } catch (e) { if (e === 'SomeNetworkError') { // If a network error happens, such as timeout, socket hangup, etc. // There is usually a chance that it was just bad luck // and the proxy works. No need to throw it away. session.markBad(); } throw e; } // Automatically retires the session based on response HTTP status code. session.retireOnBlockedStatusCodes(response.statusCode); if (response.body.blocked) { // You are sure it is blocked. // This will throw away the session. session.retire(); } // Everything is ok, you can get the data. // No need to call session.markGood -> BasicCrawler calls it for you. // If you want to use the CookieJar in session you need. session.setCookiesFromResponse(response); }, }); ``` ``` import { HttpCrawler, ProxyConfiguration } from 'crawlee'; const proxyConfiguration = new ProxyConfiguration({ /* opts */ }); const crawler = new HttpCrawler({ // To use the proxy IP session rotation logic, you must turn the proxy usage on. proxyConfiguration, // Activates the Session pool (default is true). useSessionPool: true, // Overrides default Session pool configuration. sessionPoolOptions: { maxPoolSize: 100 }, // Set to true if you want the crawler to save cookies per session, // and set the cookie header to request automatically (default is true). persistCookiesPerSession: true, async requestHandler({ session, body }) { const title = body.match(/<title(?:.*?)>(.*?)<\/title>/)?.[1]; if (title === 'Blocked') { session.retire(); } else if (title === 'Not sure if blocked, might also be a connection error') { session.markBad(); } else { // session.markGood() - this step is done automatically in BasicCrawler. } }, }); ``` ``` import { CheerioCrawler, ProxyConfiguration } from 'crawlee'; const proxyConfiguration = new ProxyConfiguration({ /* opts */ }); const crawler = new CheerioCrawler({ // To use the proxy IP session rotation logic, you must turn the proxy usage on. proxyConfiguration, // Activates the Session pool (default is true). useSessionPool: true, // Overrides default Session pool configuration. sessionPoolOptions: { maxPoolSize: 100 }, // Set to true if you want the crawler to save cookies per session, // and set the cookie header to request automatically (default is true). persistCookiesPerSession: true, async requestHandler({ session, $ }) { const title = $('title').text(); if (title === 'Blocked') { session.retire(); } else if (title === 'Not sure if blocked, might also be a connection error') { session.markBad(); } else { // session.markGood() - this step is done automatically in BasicCrawler. } }, }); ``` ``` import { JSDOMCrawler, ProxyConfiguration } from 'crawlee'; const proxyConfiguration = new ProxyConfiguration({ /* opts */ }); const crawler = new JSDOMCrawler({ // To use the proxy IP session rotation logic, you must turn the proxy usage on. proxyConfiguration, // Activates the Session pool (default is true). useSessionPool: true, // Overrides default Session pool configuration. sessionPoolOptions: { maxPoolSize: 100 }, // Set to true if you want the crawler to save cookies per session, // and set the cookie header to request automatically (default is true). persistCookiesPerSession: true, async requestHandler({ session, window }) { const title = window.document.title; if (title === 'Blocked') { session.retire(); } else if (title === 'Not sure if blocked, might also be a connection error') { session.markBad(); } else { // session.markGood() - this step is done automatically in BasicCrawler. } }, }); ``` ``` import { PlaywrightCrawler, ProxyConfiguration } from 'crawlee'; const proxyConfiguration = new ProxyConfiguration({ /* opts */ }); const crawler = new PlaywrightCrawler({ // To use the proxy IP session rotation logic, you must turn the proxy usage on. proxyConfiguration, // Activates the Session pool (default is true). useSessionPool: true, // Overrides default Session pool configuration sessionPoolOptions: { maxPoolSize: 100 }, // Set to true if you want the crawler to save cookies per session, // and set the cookies to page before navigation automatically (default is true). persistCookiesPerSession: true, async requestHandler({ page, session }) { const title = await page.title(); if (title === 'Blocked') { session.retire(); } else if (title === 'Not sure if blocked, might also be a connection error') { session.markBad(); } else { // session.markGood() - this step is done automatically in PlaywrightCrawler. } }, }); ``` ``` import { PuppeteerCrawler, ProxyConfiguration } from 'crawlee'; const proxyConfiguration = new ProxyConfiguration({ /* opts */ }); const crawler = new PuppeteerCrawler({ // To use the proxy IP session rotation logic, you must turn the proxy usage on. proxyConfiguration, // Activates the Session pool (default is true). useSessionPool: true, // Overrides default Session pool configuration sessionPoolOptions: { maxPoolSize: 100 }, // Set to true if you want the crawler to save cookies per session, // and set the cookies to page before navigation automatically (default is true). persistCookiesPerSession: true, async requestHandler({ page, session }) { const title = await page.title(); if (title === 'Blocked') { session.retire(); } else if (title === 'Not sure if blocked, might also be a connection error') { session.markBad(); } else { // session.markGood() - this step is done automatically in PuppeteerCrawler. } }, }); ``` ``` import { SessionPool } from 'crawlee'; // Override the default Session pool configuration. const sessionPoolOptions = { maxPoolSize: 100, }; // Open Session Pool. const sessionPool = await SessionPool.open(sessionPoolOptions); // Get session. const session = await sessionPool.getSession(); // Increase the errorScore. session.markBad(); // Throw away the session. session.retire(); // Lower the errorScore and mark the session good. session.markGood(); ``` These are the basics of configuring SessionPool. Please, bear in mind that a Session pool needs time to find working IPs and build up the pool, so we will probably see a lot of errors until it becomes stabilized. --- # TypeScript Projects Copy for LLM Crawlee is built with TypeScript, which means it provides the type definition directly in the package. This allows writing code with auto-completion for TypeScript and JavaScript code alike. Besides that, projects written in TypeScript can take advantage of compile-time type-checking and avoid many coding mistakes, while providing documentation for functions, parameters and return values. It will also help with refactoring a lot, and ensuring the least amount of bugs will sneak through. ## Setting up a TypeScript project[​](#setting-up-a-typescript-project "Direct link to Setting up a TypeScript project") To use TypeScript in our projects, we'll need the following prerequisites: 1. TypeScript compiler `tsc` installed somewhere: ``` npm install --save-dev typescript ``` TypeScript can be a development dependency in our project, as shown above. There's no need to pollute the production environment or the system's global repository with TypeScript. 2. A build script invoking `tsc` and a correctly specified `main` entry point defined in the `package.json` (pointing to the built code): ``` { "scripts": { "build": "tsc" }, "main": "dist/main.js" } ``` 3. Type declarations for NodeJS, so we can take advantage of type-checking in all the features we'll use: ``` npm install --save-dev @types/node ``` 4. TypeScript configuration file allowing `tsc` to understand the project layout and the features used in the project: > We are extending the [`@apify/tsconfig`](https://github.com/apify/apify-tsconfig), it contains [the set of rules](https://github.com/apify/apify-tsconfig/blob/main/tsconfig.json) we believe are worth following. > To be able to use feature called [Top level await](https://blog.saeloun.com/2021/11/25/ecmascript-top-level-await.html), we will need to set the `module` and `target` compiler options to `ES2022` or above. This will make the project compile to [ECMAScript Modules](https://nodejs.org/api/esm.html). tsconfig.json ``` { "extends": "@apify/tsconfig", "compilerOptions": { "module": "ES2022", "target": "ES2022", "outDir": "dist" }, "include": [ "./src/**/*" ] } ``` Place the content above inside a `tsconfig.json` in the root folder. Also, to enjoy using the types in `.js` source files, VSCode users that are using JavaScript should create a `jsconfig.json` with the same content and add `"checkJs": true` to `"compilerOptions"`. > If we want to use one of the browser crawlers, we will also need to add `"lib": ["DOM"]` to the compiler options. Ensure that you have installed `@apify/tsconfig` ``` npm install --save-dev @apify/tsconfig ``` ### Running the project with `ts-node`[​](#running-the-project-with-ts-node "Direct link to running-the-project-with-ts-node") During development, it's handy to run the project directly instead of compiling the TypeScript code to JavaScript every time. We can use `ts-node` for that, just install it as a dev dependency and add a new NPM script: ``` npm install --save-dev ts-node ``` > As mentioned above, our project will be compiled to use ES Modules. Because of this, we need to use the `ts-node-esm` binary. > We use the `-T` or `--transpileOnly` flag, this means the code will **not** be type-checked, which results in faster compilation. If you don't mind the added time and want to do the type checking, just remove this flag. package.json ``` { "scripts": { "start:dev": "ts-node-esm -T src/main.ts" } } ``` ### Running in production[​](#running-in-production "Direct link to Running in production") To run the project in production, we first need to compile it via build script. After that, we will have the compiled JavaScript code in the `dist`, and we can use `node dist/main.js` to run it. package.json ``` { "scripts": { "start:prod": "node dist/main.js" } } ``` ## Docker build[​](#docker-build "Direct link to Docker build") For `Dockerfile` we recommend using multi-stage build, so we don't install the dev dependencies like TypeScript in the final image: Dockerfile ``` # using multistage build, as we need dev deps to build the TS source code FROM apify/actor-node:20 AS builder # copy all files, install all dependencies (including dev deps) and build the project COPY . ./ RUN npm install --include=dev \ && npm run build # create final image FROM apify/actor-node:20 # copy only necessary files COPY --from=builder /usr/src/app/package*.json ./ COPY --from=builder /usr/src/app/dist ./dist # install only prod deps RUN npm --quiet set progress=false \ && npm install --only=prod --no-optional # run compiled code CMD npm run start:prod ``` ### Putting it all together[​](#putting-it-all-together "Direct link to Putting it all together") Let's wrap it up to. In addition to the scripts we described above, we also need to set the `type: 'module'` in the `package.json` to be able to use the Top level await described above. For convenience, we will have 3 `start` scripts, the default one will be an alias to `start:dev`, which is our `ts-node` script that does not require compilation (nor type checking). The production script (`start:prod`) is then used in the `Dockerfile`, after explicit `npm run build` call. package.json ``` { "name": "my-crawlee-project", "type": "module", "main": "dist/main.js", "dependencies": { "crawlee": "3.0.0" }, "devDependencies": { "@apify/tsconfig": "^0.1.0", "@types/node": "^18.14.0", "ts-node": "^10.8.0", "typescript": "^4.7.4" }, "scripts": { "start": "npm run start:dev", "start:prod": "node dist/main.js", "start:dev": "ts-node-esm -T src/main.ts", "build": "tsc" } } ``` --- # Introduction Copy for LLM Crawlee covers your crawling and scraping end-to-end and helps you **build reliable scrapers. Fast.** Your crawlers will appear human-like and fly under the radar of modern bot protections even with the default configuration. Crawlee gives you the tools to crawl the web for links, scrape data and persistently store it in machine-readable formats, without having to worry about the technical details. And thanks to rich configuration options, you can tweak almost any aspect of Crawlee to suit your project's needs if the default settings don't cut it. ## What you will learn[​](#what-you-will-learn "Direct link to What you will learn") The goal of the introduction is to provide a step-by-step guide to the most important features of Crawlee. It will walk you through creating the simplest of crawlers that only prints text to console, all the way up to a full-featured scraper that collects links from a website and extracts data. ## 🛠 Features[​](#-features "Direct link to 🛠 Features") * Single interface for **HTTP and headless browser** crawling * Persistent **queue** for URLs to crawl (breadth & depth first) * Pluggable **storage** of both tabular data and files * Automatic **scaling** with available system resources * Integrated **proxy rotation** and session management * Lifecycles customizable with **hooks** * **CLI** to bootstrap your projects * Configurable **routing**, **error handling** and **retries** * **Dockerfiles** ready to deploy * Written in **TypeScript** with generics ### 👾 HTTP crawling[​](#-http-crawling "Direct link to 👾 HTTP crawling") * Zero config **HTTP2 support**, even for proxies * Automatic generation of **browser-like headers** * Replication of browser **TLS fingerprints** * Integrated fast **HTML parsers**. Cheerio and JSDOM * Yes, you can scrape **JSON APIs** as well ### 💻 Real browser crawling[​](#-real-browser-crawling "Direct link to 💻 Real browser crawling") * JavaScript **rendering** and **screenshots** * **Headless** and **headful** support * Zero-config generation of **human-like fingerprints** * Automatic **browser management** * Use **Playwright** and **Puppeteer** with the same interface * **Chrome**, **Firefox**, **Webkit** and many others ## Next steps[​](#next-steps "Direct link to Next steps") Next, you will install Crawlee and learn how to bootstrap projects with the Crawlee CLI. --- # Adding more URLs Copy for LLM Previously you've built a very simple crawler that downloads HTML of a single page, reads its title and prints it to the console. This is the original source code: ``` import { CheerioCrawler } from 'crawlee'; const crawler = new CheerioCrawler({ async requestHandler({ $, request }) { const title = $('title').text(); console.log(`The title of "${request.url}" is: ${title}.`); } }) await crawler.run(['https://crawlee.dev']); ``` Now you'll use the example from the previous section and improve on it. You'll add more URLs to the queue and thanks to that the crawler will keep going, finding new links, enqueuing them into the `RequestQueue` and then scraping them. ## How crawling works[​](#how-crawling-works "Direct link to How crawling works") The process is simple: 1. Find new links on the page. 2. Filter only those pointing to the same domain, in this case `crawlee.dev`. 3. Enqueue (add) them to the `RequestQueue`. 4. Visit the newly enqueued links. 5. Repeat the process. In the following paragraphs you will learn about the [`enqueueLinks`](https://crawlee.dev/js/api/core/function/enqueueLinks.md) function which simplifies crawling to a single function call. For comparison and learning purposes we will show an equivalent solution written without `enqueueLinks` in the second code tab. `enqueueLinks` context awareness The `enqueueLinks` function is context aware. It means that it will read the information about the currently crawled page from the context, and you don't need to explicitly provide any arguments. It will find the links using the Cheerio function `$` and automatically add the links to the running crawler's `RequestQueue`. ## Limit your crawls with `maxRequestsPerCrawl`[​](#limit-your-crawls-with-maxrequestspercrawl "Direct link to limit-your-crawls-with-maxrequestspercrawl") When you're just testing your code or when your crawler could potentially find millions of links, it's very useful to set a maximum limit of crawled pages. The option is called `maxRequestsPerCrawl`, is available in all crawlers, and you can set it like this: ``` const crawler = new CheerioCrawler({ maxRequestsPerCrawl: 20, // ... }); ``` This means that no new requests will be started after the 20th request is finished. The actual number of processed requests might be a little higher thanks to parallelization, because the running requests won't be forcefully aborted. It's not even possible in most cases. ## Finding new links[​](#finding-new-links "Direct link to Finding new links") There are numerous approaches to finding links to follow when crawling the web. For our purposes, we will be looking for `<a>` elements that contain the `href` attribute because that's what you need in most cases. For example: ``` <a href="https://crawlee.dev/js/docs/introduction">This is a link to Crawlee introduction</a> ``` Since this is the most common case, it is also the `enqueueLinks` default. * with enqueueLinks * without enqueueLinks src/main.mjs ``` import { CheerioCrawler } from 'crawlee'; const crawler = new CheerioCrawler({ // Let's limit our crawls to make our // tests shorter and safer. maxRequestsPerCrawl: 20, // enqueueLinks is an argument of the requestHandler async requestHandler({ $, request, enqueueLinks }) { const title = $('title').text(); console.log(`The title of "${request.url}" is: ${title}.`); // The enqueueLinks function is context aware, // so it does not require any parameters. await enqueueLinks(); }, }); await crawler.run(['https://crawlee.dev']); ``` src/main.mjs ``` import { CheerioCrawler } from 'crawlee'; import { URL } from 'node:url'; const crawler = new CheerioCrawler({ // Let's limit our crawls to make our // tests shorter and safer. maxRequestsPerCrawl: 20, async requestHandler({ request, $ }) { const title = $('title').text(); console.log(`The title of "${request.url}" is: ${title}.`); // Without enqueueLinks, we first have to extract all // the URLs from the page with Cheerio. const links = $('a[href]') .map((_, el) => $(el).attr('href')) .get(); // Then we need to resolve relative URLs, // otherwise they would be unusable for crawling. const absoluteUrls = links.map((link) => new URL(link, request.loadedUrl).href); // Finally, we have to add the URLs to the queue await crawler.addRequests(absoluteUrls); }, }); await crawler.run(['https://crawlee.dev']); ``` If you need to override the default selection of elements in `enqueueLinks`, you can use the `selector` argument. ``` await enqueueLinks({ selector: 'div.has-link' }); ``` ## Filtering links to same domain[​](#filtering-links-to-same-domain "Direct link to Filtering links to same domain") Websites typically contain a lot of links that lead away from the original page. This is normal, but when crawling a website, we usually want to crawl that one site and not let our crawler wander away to Google, Facebook and Twitter. Therefore, we need to filter out the off-domain links and only keep the ones that lead to the same domain. * with enqueueLinks * without enqueueLinks src/main.mjs ``` import { CheerioCrawler } from 'crawlee'; const crawler = new CheerioCrawler({ maxRequestsPerCrawl: 20, async requestHandler({ $, request, enqueueLinks }) { const title = $('title').text(); console.log(`The title of "${request.url}" is: ${title}.`); // The default behavior of enqueueLinks is to stay on the same hostname, // so it does not require any parameters. // This will ensure the subdomain stays the same. await enqueueLinks(); }, }); await crawler.run(['https://crawlee.dev']); ``` src/main.mjs ``` import { CheerioCrawler } from 'crawlee'; import { URL } from 'node:url'; const crawler = new CheerioCrawler({ maxRequestsPerCrawl: 20, async requestHandler({ request, $ }) { const title = $('title').text(); console.log(`The title of "${request.url}" is: ${title}.`); const links = $('a[href]') .map((_, el) => $(el).attr('href')) .get(); // Besides resolving the URLs, we now also need to // grab their hostname for filtering. const { hostname } = new URL(request.loadedUrl); const absoluteUrls = links.map((link) => new URL(link, request.loadedUrl)); // We use the hostname to filter links that point // to a different domain, even subdomain. const sameHostnameLinks = absoluteUrls .filter((url) => url.hostname === hostname) .map((url) => ({ url: url.href })); // Finally, we have to add the URLs to the queue await crawler.addRequests(sameHostnameLinks); }, }); await crawler.run(['https://crawlee.dev']); ``` The default behavior of `enqueueLinks` is to stay on the same hostname. This **does not include subdomains**. To include subdomains in your crawl, use the `strategy` argument. ``` await enqueueLinks({ strategy: 'same-domain' }); ``` When you run the code, you will see the crawler log the **title** of the first page, then the **enqueueing** message showing number of URLs, followed by the **title** of the first enqueued page and so on and so on. ## Skipping duplicate URLs[​](#skipping-duplicate-urls "Direct link to Skipping duplicate URLs") Skipping of duplicate URLs is critical, because visiting the same page multiple times would lead to duplicate results. This is automatically handled by the `RequestQueue` which deduplicates requests using their `uniqueKey`. This `uniqueKey` is automatically generated from the request's URL by lowercasing the URL, lexically ordering query parameters, removing fragments and a few other tweaks that ensure the queue only includes unique URLs. ## Advanced filtering arguments[​](#advanced-filtering-arguments "Direct link to Advanced filtering arguments") While the defaults for `enqueueLinks` can be often exactly what you need, it also gives you fine-grained control over which URLs should be enqueued. One way we already mentioned above. It is using the [`EnqueueStrategy`](https://crawlee.dev/js/api/core/enum/EnqueueStrategy.md). You can use the [`All`](https://crawlee.dev/js/api/core/enum/EnqueueStrategy.md#All) strategy if you want to follow every single link, regardless of its domain, or you can enqueue links that target the same domain name with the [`SameDomain`](https://crawlee.dev/js/api/core/enum/EnqueueStrategy.md#SameDomain) strategy. ``` await enqueueLinks({ strategy: 'all', // wander the internet }); ``` ### Filter URLs with patterns[​](#filter-urls-with-patterns "Direct link to Filter URLs with patterns") For even more control, you can use `globs`, `regexps` and `pseudoUrls` to filter the URLs. Each of those arguments is always an `Array`, but the contents can take on many forms. [See the reference](https://crawlee.dev/js/api/core/interface/EnqueueLinksOptions.md) for more information about them as well as other options. Defaults override If you provide one of those options, the default `same-hostname` strategy will **not** be applied unless explicitly set in the options. ``` await enqueueLinks({ globs: ['http?(s)://apify.com/*/*'], }); ``` ### Transform requests[​](#transform-requests "Direct link to Transform requests") To have absolute control, we have the [`transformRequestFunction`](https://crawlee.dev/js/api/core/interface/EnqueueLinksOptions.md#transformRequestFunction). Just before a new [`Request`](https://crawlee.dev/js/api/core/class/Request.md) is constructed and enqueued to the [`RequestQueue`](https://crawlee.dev/js/api/core/class/RequestQueue.md), this function can be used to skip it or modify its contents such as `userData`, `payload` or, most importantly, `uniqueKey`. This is useful when you need to enqueue multiple requests to the queue, and these requests share the same URL, but differ in methods or payloads. Another use case is to dynamically update or create the `userData`. ``` await enqueueLinks({ globs: ['http?(s)://apify.com/*/*'], transformRequestFunction(req) { // ignore all links ending with `.pdf` if (req.url.endsWith('.pdf')) return false; return req; }, }); ``` And that's it! `enqueueLinks()` is just one example of Crawlee's powerful helper functions. They're all designed to make your life easier, so you can focus on getting your data, while leaving the mundane crawling management to the tools. ## Next steps[​](#next-steps "Direct link to Next steps") Next, you will start your project of scraping a production website and learn some more Crawlee tricks in the process. --- # Crawling the Store Copy for LLM To crawl the whole [example Warehouse Store](https://warehouse-theme-metal.myshopify.com/collections) and find all the data, you first need to visit all the pages with products - going through all categories available and also all the product detail pages. ## Crawling the listing pages[​](#crawling-the-listing-pages "Direct link to Crawling the listing pages") In previous lessons, you used the `enqueueLinks()` function like this: ``` await enqueueLinks(); ``` While useful in that scenario, you need something different now. Instead of finding all the `<a href="..">` elements with links to the same hostname, you need to find only the specific ones that will take your crawler to the next page of results. Otherwise, the crawler will visit a lot of other pages that you're not interested in. Using the power of DevTools and yet another `enqueueLinks()` parameter, this becomes fairly easy. ``` import { PlaywrightCrawler } from 'crawlee'; const crawler = new PlaywrightCrawler({ requestHandler: async ({ page, request, enqueueLinks }) => { console.log(`Processing: ${request.url}`); // Only run this logic on the main category listing, not on sub-pages. if (request.label !== 'CATEGORY') { // Wait for the category cards to render, // otherwise enqueueLinks wouldn't enqueue anything. await page.waitForSelector('.collection-block-item'); // Add links to the queue, but only from // elements matching the provided selector. await enqueueLinks({ selector: '.collection-block-item', label: 'CATEGORY', }); } }, }); await crawler.run(['https://warehouse-theme-metal.myshopify.com/collections']); ``` The code should look pretty familiar to you. It's a very simple `requestHandler` where we log the currently processed URL to the console and enqueue more links. But there are also a few new, interesting additions. Let's break it down. ### The `selector` parameter of `enqueueLinks()`[​](#the-selector-parameter-of-enqueuelinks "Direct link to the-selector-parameter-of-enqueuelinks") When you previously used `enqueueLinks()`, you were not providing any `selector` parameter, and it was fine, because you wanted to use the default value, which is `a` - finds all `<a>` elements. But now, you need to be more specific. There are multiple `<a>` links on the `Categories` page, and you're only interested in those that will take your crawler to the available list of results. Using the DevTools, you'll find that you can select the links you need using the `.collection-block-item` selector, which selects all the elements that have the `class=collection-block-item` attribute. ### The `label` of `enqueueLinks()`[​](#the-label-of-enqueuelinks "Direct link to the-label-of-enqueuelinks") You will see `label` used often throughout Crawlee, as it's a convenient way of labelling a `Request` instance for quick identification later. You can access it with `request.label` and it's a `string`. You can name your requests any way you want. Here, we used the label `CATEGORY` to note that we're enqueueing pages that represent a category of products. The `enqueueLinks()` function will add this label to all requests before enqueueing them to the `RequestQueue`. Why this is useful will become obvious in a minute. ## Crawling the detail pages[​](#crawling-the-detail-pages "Direct link to Crawling the detail pages") In a similar fashion, you need to collect all the URLs to the product detail pages, because only from there you can scrape all the data you need. The following code only repeats the concepts you already know for another set of links. ``` import { PlaywrightCrawler } from 'crawlee'; const crawler = new PlaywrightCrawler({ requestHandler: async ({ page, request, enqueueLinks }) => { console.log(`Processing: ${request.url}`); if (request.label === 'DETAIL') { // We're not doing anything with the details yet. } else if (request.label === 'CATEGORY') { // We are now on a category page. We can use this to paginate through and enqueue all products, // as well as any subsequent pages we find await page.waitForSelector('.product-item > a'); await enqueueLinks({ selector: '.product-item > a', label: 'DETAIL', // <= note the different label }); // Now we need to find the "Next" button and enqueue the next page of results (if it exists) const nextButton = await page.$('a.pagination__next'); if (nextButton) { await enqueueLinks({ selector: 'a.pagination__next', label: 'CATEGORY', // <= note the same label }); } } else { // This means we're on the start page, with no label. // On this page, we just want to enqueue all the category pages. await page.waitForSelector('.collection-block-item'); await enqueueLinks({ selector: '.collection-block-item', label: 'CATEGORY', }); } }, }); await crawler.run(['https://warehouse-theme-metal.myshopify.com/collections']); ``` The crawling code is now complete. When you run the code, you'll see the crawler visit all the listing URLs and all the detail URLs. ## Next steps[​](#next-steps "Direct link to Next steps") This concludes the Crawling lesson, because you have taught the crawler to visit all the pages it needs. Let's continue with scraping data. --- # Running your crawler in the Cloud Copy for LLM ## Apify Platform[​](#apify-platform "Direct link to Apify Platform") Crawlee is developed by [**Apify**](https://apify.com), the web scraping and automation platform. You could say it is the **home of Crawlee projects**. In this section you'll see how to deploy the crawler there with just a few simple steps. You can deploy a **Crawlee** project wherever you want, but using the [**Apify Platform**](https://console.apify.com) will give you the best experience. In case you want to deploy your Crawlee project to other platforms, check out the [**Deployment**](https://crawlee.dev/js/docs/deployment.md) section. With a few simple steps, you can convert your Crawlee project into a so-called **Actor**. Actors are serverless micro-apps that are easy to develop, run, share, and integrate. The infra, proxies, and storages are ready to go. [Learn more about Actors](https://apify.com/actors). Choosing between Crawlee CLI and Apify CLI for project setup We started this guide by using the Crawlee CLI to bootstrap the project - it offers the basic Crawlee templates, including a ready-made `Dockerfile`. If you know you will be deploying your project to the Apify Platform, you might want to start with the Apify CLI instead. It also offers several project templates, and those are all set up to be used on the Apify Platform right ahead. ## Dependencies[​](#dependencies "Direct link to Dependencies") The first step will be installing two new dependencies: * Apify SDK, a toolkit for working with the Apify Platform. This will allow us to wire the storages (e.g. `RequestQueue` and `Dataset`) to the Apify cloud products. This will be a dependency of our Node.js project. ``` npm install apify ``` * Apify CLI, a command-line tool that will help us with authentication and deployment. This will be a globally installed tool, you will install it only once and use it in all your Crawlee/Apify projects. ``` npm install -g apify-cli ``` ## Logging in to the Apify Platform[​](#logging-in-to-the-apify-platform "Direct link to Logging in to the Apify Platform") The next step will be [creating your Apify account](https://console.apify.com/sign-up). Don't worry, we have a **free tier**, so you can try things out before you buy in! Once you have that, it's time to log in with the just-installed [Apify CLI](https://docs.apify.com/cli/). You will need your personal access token, which you can find at <https://console.apify.com/account#/integrations>. ``` apify login ``` ## Adjusting the code[​](#adjusting-the-code "Direct link to Adjusting the code") Now that you have your account set up, you will need to adjust the code a tiny bit. We will use the [Apify SDK](https://docs.apify.com/sdk/js/), which will help us to wire the Crawlee storages (like the `RequestQueue`) to their Apify Platform counterparts - otherwise Crawlee would keep things only in memory. Open your `src/main.js` file (or `src/main.ts` if you used a TypeScript template), and add `Actor.init()` to the beginning of your main script and `Actor.exit()` to the end of it. Don't forget to `await` those calls, as both functions are async. Your code should look like this: src/main.js ``` import { Actor } from 'apify'; import { PlaywrightCrawler, log } from 'crawlee'; import { router } from './routes.mjs'; await Actor.init(); // This is better set with CRAWLEE_LOG_LEVEL env var // or a configuration option. This is just for show 😈 log.setLevel(log.LEVELS.DEBUG); log.debug('Setting up crawler.'); const crawler = new PlaywrightCrawler({ // Instead of the long requestHandler with // if clauses we provide a router instance. requestHandler: router, }); await crawler.run(['https://warehouse-theme-metal.myshopify.com/collections']); await Actor.exit(); ``` The `Actor.init()` call will configure Crawlee to use the Apify API instead of its default memory storage interface. It also sets up few other things, like listening to the platform events via websockets. The `Actor.exit()` call then handles graceful shutdown - it will close the open handles created by the `Actor.init()` call, as without that, the Node.js process would be stuck. Understanding `Actor.init()` behavior with environment variables The `Actor.init()` call works conditionally based on the environment variables, namely based on the `APIFY_IS_AT_HOME` env var, which is set to `true` on the Apify Platform. This means that your project will remain working the same locally, but will use the Apify API when deployed to the Apify Platform. ## Initializing the project[​](#initializing-the-project "Direct link to Initializing the project") You will also need to initialize the project for Apify, to do that, use the Apify CLI again: ``` apify init ``` This will create a folder called `.actor`, and an `actor.json` file inside it - this file contains the configuration relevant to the Apify Platform, namely the Actor name, version, build tag, and few other things. Check out the [relevant documentation](https://docs.apify.com/platform/actors/development/actor-definition/actor-json) to see all the different things you can set there up. ## Ship it\![​](#ship-it "Direct link to Ship it!") And that's all, your project is now ready to be published on the Apify Platform. You can use the Apify CLI once more to do that: ``` apify push ``` This command will create an archive from your project, upload it to the Apify Platform and initiate a Docker build. Once finished, you will get a link to your new Actor on the platform. ## Learning more about web scraping[​](#learning-more-about-web-scraping "Direct link to Learning more about web scraping") Explore Apify Academy Resources If you want to learn more about web scraping and browser automation, check out the [Apify Academy](https://developers.apify.com/academy). It's full of courses and tutorials on the topic. From beginner to advanced. And the best thing: **It's free and open source** ❤️ If you want to do one more project, checkout our tutorial on building a [HackerNews scraper using Crawlee](https://blog.apify.com/crawlee-web-scraping-tutorial/). ## Thank you! 🎉[​](#thank-you- "Direct link to Thank you! 🎉") That's it! Thanks for reading the whole introduction and if there's anything wrong, please 🙏 let us know on [GitHub](https://github.com/apify/crawlee) or in our [Discord community](https://discord.com/invite/jyEM2PRvMU). Happy scraping! 👋 --- # First crawler Copy for LLM Now, you will build your first crawler. But before you do, let's briefly introduce the Crawlee classes involved in the process. ## How Crawlee works[​](#how-crawlee-works "Direct link to How Crawlee works") There are 3 main crawler classes available for use in Crawlee. * [`CheerioCrawler`](https://crawlee.dev/js/api/cheerio-crawler/class/CheerioCrawler.md) * [`PuppeteerCrawler`](https://crawlee.dev/js/api/puppeteer-crawler/class/PuppeteerCrawler.md) * [`PlaywrightCrawler`](https://crawlee.dev/js/api/playwright-crawler/class/PlaywrightCrawler.md) We'll talk about their differences later. Now, let's talk about what they have in common. The general idea of each crawler is to go to a web page, open it, do some stuff there, save some results, continue to the next page, and repeat this process until the crawler's done its job. So the crawler always needs to find answers to two questions: *Where should I go?* and *What should I do there?* Answering those two questions is the only required setup. The crawlers have reasonable defaults for everything else. ### The Where - `Request` and `RequestQueue`[​](#the-where---request-and-requestqueue "Direct link to the-where---request-and-requestqueue") All crawlers use instances of the [`Request`](https://crawlee.dev/js/api/core/class/Request.md) class to determine where they need to go. Each request may hold a lot of information, but at the very least, it must hold a URL - a web page to open. But having only one URL would not make sense for crawling. Sometimes you have a pre-existing list of your own URLs that you wish to visit, perhaps a thousand. Other times you need to build this list dynamically as you crawl, adding more and more URLs to the list as you progress. Most of the time, you will use both options. The requests are stored in a [`RequestQueue`](https://crawlee.dev/js/api/core/class/RequestQueue.md), a dynamic queue of `Request` instances. You can seed it with start URLs and also add more requests while the crawler is running. This allows the crawler to open one page, extract interesting URLs, such as links to other pages on the same domain, add them to the queue (called *enqueuing*) and repeat this process to build a queue of virtually unlimited number of URLs. ### The What - `requestHandler`[​](#the-what---requesthandler "Direct link to the-what---requesthandler") In the `requestHandler` you tell the crawler what to do at each and every page it visits. You can use it to handle extraction of data from the page, processing the data, saving it, calling APIs, doing calculations and so on. The `requestHandler` is a user-defined function, invoked automatically by the crawler for each `Request` from the `RequestQueue`. It always receives a single argument - a [`CrawlingContext`](https://crawlee.dev/js/api/core/interface/CrawlingContext.md). Its properties change depending on the crawler class used, but it always includes the `request` property, which represents the currently crawled URL and related metadata. ## Building a crawler[​](#building-a-crawler "Direct link to Building a crawler") Let's put the theory into practice and start with something easy. Visit a page and get its HTML title. In this tutorial, you'll scrape the Crawlee website <https://crawlee.dev>, but the same code will work for any website. Top level await configuration We are using a JavaScript feature called [Top level await](https://blog.saeloun.com/2021/11/25/ecmascript-top-level-await.html) in our examples. To be able to use that, you might need some extra setup. Namely, it requires the use of [ECMAScript Modules](https://nodejs.org/api/esm.html) - this means you either need to add `"type": "module"` to your `package.json` file, or use `*.mjs` extension for your files. Additionally, if you are in a TypeScript project, you need to set the `module` and `target` compiler options to `ES2022` or above. ### Adding requests to the crawling queue[​](#adding-requests-to-the-crawling-queue "Direct link to Adding requests to the crawling queue") Earlier you learned that the crawler uses a queue of requests as its source of URLs to crawl. Let's create it and add the first request. src/main.js ``` import { RequestQueue } from 'crawlee'; // First you create the request queue instance. const requestQueue = await RequestQueue.open(); // And then you add one or more requests to it. await requestQueue.addRequest({ url: 'https://crawlee.dev' }); ``` The [`requestQueue.addRequest()`](https://crawlee.dev/js/api/core/class/RequestQueue.md#addRequest) function automatically converts the object with URL string to a [`Request`](https://crawlee.dev/js/api/core/class/Request.md) instance. So now you have a `requestQueue` that holds one request which points to `https://crawlee.dev`. Bulk add requests The code above is for illustration of the request queue concept. Soon you'll learn about the `crawler.addRequests()` method which allows you to skip this initialization code, and it also supports adding a large number of requests without blocking. ### Building a CheerioCrawler[​](#building-a-cheeriocrawler "Direct link to Building a CheerioCrawler") Crawlee comes with three main crawler classes: [`CheerioCrawler`](https://crawlee.dev/js/api/cheerio-crawler/class/CheerioCrawler.md), [`PuppeteerCrawler`](https://crawlee.dev/js/api/puppeteer-crawler/class/PuppeteerCrawler.md) and [`PlaywrightCrawler`](https://crawlee.dev/js/api/playwright-crawler/class/PlaywrightCrawler.md). You can read their short descriptions in the [Quick start](https://crawlee.dev/js/docs/quick-start.md) lesson. Unless you have a good reason to start with a different one, you should try building a `CheerioCrawler` first. It is an HTTP crawler with HTTP2 support, anti-blocking features and integrated HTML parser - [Cheerio](https://www.npmjs.com/package/cheerio). It's fast, simple, cheap to run and does not require complicated dependencies. The only downside is that it won't work out of the box for websites which require JavaScript rendering. But you might not need JavaScript rendering at all, because many modern websites use server-side rendering. Let's continue with the earlier `RequestQueue` example. src/main.js ``` // Add import of CheerioCrawler import { RequestQueue, CheerioCrawler } from 'crawlee'; const requestQueue = await RequestQueue.open(); await requestQueue.addRequest({ url: 'https://crawlee.dev' }); // Create the crawler and add the queue with our URL // and a request handler to process the page. const crawler = new CheerioCrawler({ requestQueue, // The `$` argument is the Cheerio object // which contains parsed HTML of the website. async requestHandler({ $, request }) { // Extract <title> text with Cheerio. // See Cheerio documentation for API docs. const title = $('title').text(); console.log(`The title of "${request.url}" is: ${title}.`); } }) // Start the crawler and wait for it to finish await crawler.run(); ``` When you run the example, you will see the title of <https://crawlee.dev> printed to the log. What really happens is that CheerioCrawler first makes an HTTP request to `https://crawlee.dev`, then parses the received HTML with Cheerio and makes it available as the `$` argument of the `requestHandler`. ``` The title of "https://crawlee.dev" is: Crawlee · The scalable web crawling, scraping and automation library for JavaScript/Node.js | Crawlee. ``` ### Add requests faster[​](#add-requests-faster "Direct link to Add requests faster") Earlier we mentioned that you'll learn how to use the `crawler.addRequests()` method to skip the request queue initialization. It's simple. Every crawler has an implicit `RequestQueue` instance, and you can add requests to it with the `crawler.addRequests()` method. In fact, you can go even further and just use the first parameter of `crawler.run()`! src/main.js ``` // You don't need to import RequestQueue anymore import { CheerioCrawler } from 'crawlee'; const crawler = new CheerioCrawler({ async requestHandler({ $, request }) { const title = $('title').text(); console.log(`The title of "${request.url}" is: ${title}.`); } }) // Start the crawler with the provided URLs await crawler.run(['https://crawlee.dev']); ``` When you run this code, you'll see exactly the same output as with the earlier, longer example. The `RequestQueue` is still there, it's just managed by the crawler automatically. info This method not only makes the code shorter, it will help with performance too! It will wait only for the initial batch of 1000 requests to be added to the queue before resolving, which means the processing will start almost instantly. After that, it will continue adding the rest of the requests in the background (again, in batches of 1000 items, once every second). ## Next steps[​](#next-steps "Direct link to Next steps") Next, you'll learn about crawling links. That means finding new URLs on the pages you crawl and adding them to the `RequestQueue` for the crawler to visit. --- # Getting some real-world data Copy for LLM > *Hey, guys, you know, it's cool that we can scrape the `<title>` elements of web pages, but that's not very useful. Can we finally scrape some real data and save it somewhere in a machine-readable format? Because that's why I started reading this tutorial in the first place!* We hear you, young padawan! First, learn how to crawl, you must. Only then, walk through data, you can! ## Making a production-grade crawler[​](#making-a-production-grade-crawler "Direct link to Making a production-grade crawler") Making a production-grade crawler is not difficult, but there are many pitfalls of scraping that can catch you off guard. So for the real world project you'll learn how to scrape an [example Warehouse Store](https://warehouse-theme-metal.myshopify.com/collections) instead of the Crawlee website. It contains a list of products of different categories, and each product has its own detail page. The website requires JavaScript rendering, which allows us to showcase more features of Crawlee. We've also added some helpful tips that prepare you for the real-world issues that you will surely encounter when scraping at scale. Not interested in theory? If you're not interested in crawling theory, feel free to [skip to the next chapter](https://crawlee.dev/js/docs/introduction/crawling.md) and get right back to coding. ## Drawing a plan[​](#drawing-a-plan "Direct link to Drawing a plan") Sometimes scraping is really straightforward, but most of the time, it really pays off to do a bit of research first and try to answer some of these questions: * How is the website structured? * Can I scrape it only with HTTP requests (read "with `CheerioCrawler`")? * Do I need a headless browser for something? * Are there any anti-scraping protections in place? * Do I need to parse the HTML or can I get the data otherwise, such as directly from the website's API? For the purposes of this tutorial, let's assume that the website cannot be scraped with `CheerioCrawler`. It actually can, but we would have to dive a bit deeper than this introductory guide allows. So for now we will make things easier for you, scrape it with `PlaywrightCrawler`, and you'll learn about headless browsers in the process. ## Choosing the data you need[​](#choosing-the-data-you-need "Direct link to Choosing the data you need") A good first step is to figure out what data you want to scrape and where to find it. For the time being, let's just agree that we want to scrape all products from all categories available on the [All collections page of the store](https://warehouse-theme-metal.myshopify.com/collections) and for each product we want to get its: * URL * Manufacturer * SKU * Title * Current price * Stock available You will notice that some information is available directly on the list page, but for details such as "SKU" we'll also need to open the product's detail page. ![data to scrape](/assets/images/scraping-practice-ed4e3a233c852ffa694b80371fed9d37.jpg "Overview of data to be scraped.") ### The start URL(s)[​](#the-start-urls "Direct link to The start URL(s)") This is where you start your crawl. It's convenient to start as close to the data as possible. For example, it wouldn't make much sense to start at `https://warehouse-theme-metal.myshopify.com/` and look for a `collections` link there, when we already know that everything we want to extract can be found at the `https://warehouse-theme-metal.myshopify.com/collections` page. ## Exploring the page[​](#exploring-the-page "Direct link to Exploring the page") Let's take a look at the `https://warehouse-theme-metal.myshopify.com/collections` page more carefully. There are some **categories** on the page, and each category has a list of **items**. On some category pages, at the bottom you will notice there are links to the next pages of results. This is usually called **the pagination**. ### Categories and sorting[​](#categories-and-sorting "Direct link to Categories and sorting") When you click the categories, you'll see that they load a page of products filtered by that category. By going through a few categories and observing the behavior, we can also observe that we can sort by different conditions (such as `Best selling`, or `Price, low to high`), but for this example, we will not be looking into those. Limited pagination Be careful, because on some websites, like [amazon.com](https://amazon.com), this is not true and the sum of products in categories is actually larger than what's available without filters. Learn more in our [tutorial on scraping websites with limited pagination](https://docs.apify.com/tutorials/scrape-paginated-sites). ### Pagination[​](#pagination "Direct link to Pagination") The pagination of the demo Warehouse Store is simple enough. When switching between pages, you will see that the URL changes to: ``` https://warehouse-theme-metal.myshopify.com/collections/headphones?page=2 ``` Try clicking on the link to page 4. You'll see that the pagination links update and show more pages. But can you trust that this will include all pages and won't stop at some point? Test your assumptions Similarly to the issue with filters explained above, the existence of pagination does not guarantee that you can simply paginate through all the results. Always test your assumptions about pagination. Otherwise, you might miss a chunk of results, and not even know about it. At the time of writing the `Headphones` collection results counter showed 75 results - products. Quick count of products on one page of results makes 24. 6 rows times 4 products. This means that there are 4 pages of results. If you're not convinced, you can visit a page somewhere in the middle, like `https://warehouse-theme-metal.myshopify.com/collections/headphones?page=2` and see how the pagination looks there. ## The crawling strategy[​](#the-crawling-strategy "Direct link to The crawling strategy") Now that you know where to start and how to find all the Actor details, let's look at the crawling process. 1. Visit the store page containing the list of categories (our start URL). 2. Enqueue all links to all categories. 3. Enqueue all product pages from the current page. 4. Enqueue links to next pages of results. 5. Open the next page in queue. <!-- --> * When it's a results list page, go to 2. * When it's a product page, scrape the data. 6. Repeat until all results pages and all products have been processed. `PlaywrightCrawler` will make sure to visit the pages for you, if you provide the correct requests, and you already know how to enqueue pages, so this should be fairly easy. Nevertheless, there are few more tricks that we'd like to showcase. ## Sanity check[​](#sanity-check "Direct link to Sanity check") Let's check that everything is set up correctly before writing the scraping logic itself. You might realize that something in your previous analysis doesn't quite add up, or the website might not behave exactly as you expected. The example below creates a new crawler that visits the start URL and prints the text content of all the categories on that page. When you run the code, you will see the *very badly formatted* content of the individual category card. * Playwright * Playwright with Cheerio src/main.mjs ``` // Instead of CheerioCrawler let's use Playwright // to be able to render JavaScript. import { PlaywrightCrawler } from 'crawlee'; const crawler = new PlaywrightCrawler({ requestHandler: async ({ page }) => { // Wait for the actor cards to render. await page.waitForSelector('.collection-block-item'); // Execute a function in the browser which targets // the actor card elements and allows their manipulation. const categoryTexts = await page.$$eval('.collection-block-item', (els) => { // Extract text content from the actor cards return els.map((el) => el.textContent); }); categoryTexts.forEach((text, i) => { console.log(`CATEGORY_${i + 1}: ${text}\n`); }); }, }); await crawler.run(['https://warehouse-theme-metal.myshopify.com/collections']); ``` src/main.mjs ``` // Instead of CheerioCrawler let's use Playwright // to be able to render JavaScript. import { PlaywrightCrawler } from 'crawlee'; const crawler = new PlaywrightCrawler({ requestHandler: async ({ page, parseWithCheerio }) => { // Wait for the actor cards to render. await page.waitForSelector('.collection-block-item'); // Extract the page's HTML from browser // and parse it with Cheerio. const $ = await parseWithCheerio(); // Use familiar Cheerio syntax to // select all the actor cards. $('.collection-block-item').each((i, el) => { const text = $(el).text(); console.log(`CATEGORY_${i + 1}: ${text}\n`); }); }, }); await crawler.run(['https://warehouse-theme-metal.myshopify.com/collections']); ``` If you're wondering how to get that `.collection-block-item` selector. We'll explain it in the next chapter on DevTools. ## DevTools - the scraper's toolbox[​](#devtools---the-scrapers-toolbox "Direct link to DevTools - the scraper's toolbox") DevTool choice We'll use Chrome DevTools here, since it's the most common browser, but feel free to use any other, they're all very similar. Let's open DevTools by going to <https://warehouse-theme-metal.myshopify.com/collections> in Chrome and then right-clicking anywhere in the page and selecting **Inspect**, or by pressing **F12** or whatever your system prefers. With DevTools, you can inspect or manipulate any aspect of the currently open web page. You can learn more about DevTools in their [official documentation](https://developer.chrome.com/docs/devtools/). ## Selecting elements[​](#selecting-elements "Direct link to Selecting elements") In the DevTools, choose the **Select an element** tool and try hovering over one of the Actor cards. ![select an element](/assets/images/select-an-element-63e42331a0df1985c597ffc8ead02a0f.png "Finding the select an element tool.") You'll see that you can select different elements inside the card. Instead, select the whole card, not just some of its contents, such as its title or description. ![selected element](/assets/images/selected-element-652798a29828d5b1a4d893c2de7a0e75.png "Selecting an element by hovering over it.") Selecting an element will highlight it in the DevTools HTML inspector. When carefully look at the elements, you'll see that there are some **classes** attached to the different HTML elements. Those are called **CSS classes**, and we can make a use of them in scraping. Conversely, by hovering over elements in the HTML inspector, you will see them highlight on the page. Inspect the page's structure around the collection card. You'll see that all the card's data is displayed in an `<a>` element with a `class` attribute that includes **collection-block-item**. It should now make sense how we got that `.collection-block-item` selector. It's just a way to find all elements that are annotated with the `collection-block-item`. It's always a good idea to double-check that you're not getting any unwanted elements with this class. To do that, go into the **Console** tab of DevTools and run: ``` document.querySelectorAll('.collection-block-item'); ``` You will see that only the 31 collection cards will be returned, and nothing else. Learn more about CSS selectors and DevTools CSS selectors and DevTools are quite a big topic. If you want to learn more, visit the [Web scraping for beginners course](https://developers.apify.com/academy/web-scraping-for-beginners) in the Apify Academy. **It's free and open-source** ❤️. ## Next steps[​](#next-steps "Direct link to Next steps") Next, you will crawl the whole store, including all the listing pages and all the product detail pages. --- # Refactoring Copy for LLM It may seem that the data is extracted and the crawler is done, but honestly, this is just the beginning. For the sake of brevity, we've completely omitted error handling, proxies, logging, architecture, tests, documentation and other stuff that a reliable software should have. The good thing is, **error handling is mostly done by Crawlee itself**, so no worries on that front, unless you need some custom magic. Navigating automatic bot-protextion avoidance You might be wondering about the **anti-blocking, bot-protection avoiding stealthy features** and why we haven't highlighted them yet. The reason is straightforward: these features are **automatically used** within the default configuration, providing a smooth start without manual adjustments. However, the default configuration, while powerful, may not cover every scenario. If you want to learn more, browse the [Avoid getting blocked](https://crawlee.dev/js/docs/guides/avoid-blocking.md), [Proxy management](https://crawlee.dev/js/docs/guides/proxy-management.md) and [Session management](https://crawlee.dev/js/docs/guides/session-management.md) guides. Anyway, to promote good coding practices, let's look at how you can use a [`Router`](https://crawlee.dev/js/api/core/class/Router.md) to better structure your crawler code. ## Routing[​](#routing "Direct link to Routing") In the following code we've made several changes: * Split the code into multiple files. * Replaced `console.log` with the Crawlee logger for nicer, colourful logs. * Added a `Router` to make our routing cleaner, without `if` clauses. In our `main.mjs` file, we place the general structure of the crawler: src/main.mjs ``` import { PlaywrightCrawler, log } from 'crawlee'; import { router } from './routes.mjs'; // This is better set with CRAWLEE_LOG_LEVEL env var // or a configuration option. This is just for show 😈 log.setLevel(log.LEVELS.DEBUG); log.debug('Setting up crawler.'); const crawler = new PlaywrightCrawler({ // Instead of the long requestHandler with // if clauses we provide a router instance. requestHandler: router, }); await crawler.run(['https://warehouse-theme-metal.myshopify.com/collections']); ``` Then in a separate `routes.mjs` file: src/routes.mjs ``` import { createPlaywrightRouter, Dataset } from 'crawlee'; // createPlaywrightRouter() is only a helper to get better // intellisense and typings. You can use Router.create() too. export const router = createPlaywrightRouter(); // This replaces the request.label === DETAIL branch of the if clause. router.addHandler('DETAIL', async ({ request, page, log }) => { log.debug(`Extracting data: ${request.url}`); const urlPart = request.url.split('/').slice(-1); // ['sennheiser-mke-440-professional-stereo-shotgun-microphone-mke-440'] const manufacturer = urlPart[0].split('-')[0]; // 'sennheiser' const title = await page.locator('.product-meta h1').textContent(); const sku = await page .locator('span.product-meta__sku-number') .textContent(); const priceElement = page .locator('span.price') .filter({ hasText: '$', }) .first(); const currentPriceString = await priceElement.textContent(); const rawPrice = currentPriceString.split('$')[1]; const price = Number(rawPrice.replaceAll(',', '')); const inStockElement = page .locator('span.product-form__inventory') .filter({ hasText: 'In stock', }) .first(); const inStock = (await inStockElement.count()) > 0; const results = { url: request.url, manufacturer, title, sku, currentPrice: price, availableInStock: inStock, }; log.debug(`Saving data: ${request.url}`); await Dataset.pushData(results); }); router.addHandler('CATEGORY', async ({ page, enqueueLinks, request, log }) => { log.debug(`Enqueueing pagination for: ${request.url}`); // We are now on a category page. We can use this to paginate through and enqueue all products, // as well as any subsequent pages we find await page.waitForSelector('.product-item > a'); await enqueueLinks({ selector: '.product-item > a', label: 'DETAIL', // <= note the different label }); // Now we need to find the "Next" button and enqueue the next page of results (if it exists) const nextButton = await page.$('a.pagination__next'); if (nextButton) { await enqueueLinks({ selector: 'a.pagination__next', label: 'CATEGORY', // <= note the same label }); } }); // This is a fallback route which will handle the start URL // as well as the LIST labeled URLs. router.addDefaultHandler(async ({ request, page, enqueueLinks, log }) => { log.debug(`Enqueueing categories from page: ${request.url}`); // This means we're on the start page, with no label. // On this page, we just want to enqueue all the category pages. await page.waitForSelector('.collection-block-item'); await enqueueLinks({ selector: '.collection-block-item', label: 'CATEGORY', }); }); ``` Let's explore the changes in more detail. We believe these modification will enhance the readability and manageability of the crawler. ## Splitting your code into multiple files[​](#splitting-your-code-into-multiple-files "Direct link to Splitting your code into multiple files") There's no reason not to split your code into multiple files and keep your logic separate. Less code in a single file means less code you need to think about at any time, and that's good. We would most likely go even further and split even the routes into separate files. ## Using Crawlee `log` instead of `console.log`[​](#using-crawlee-log-instead-of-consolelog "Direct link to using-crawlee-log-instead-of-consolelog") We won't go to great lengths here to talk about `log` object from Crawlee, because you can read all about it in the [documentation](https://crawlee.dev/js/api/core/class/Log.md), but there's just one thing that we need to stress: **log levels**. Crawlee `log` has multiple log levels, such as `log.debug`, `log.info` or `log.warning`. It not only makes your log more readable, but it also allows selective turning off of some levels by either calling the `log.setLevel()` function or by setting the `CRAWLEE_LOG_LEVEL` environment variable. Thanks to this you can add a lot of debug logs to your crawler without polluting your log when they're not needed, but ready to help when you encounter issues. ## Using a router to structure your crawling[​](#using-a-router-to-structure-your-crawling "Direct link to Using a router to structure your crawling") Initially, using a simple `if/else` statement for selecting different logic based on the crawled pages might appear more readable. However, this approach can become cumbersome with more than two types of pages, especially when the logic for each page extends over dozens or even hundreds of lines of code. It's good practice in any programming language to split your logic into bite-sized chunks that are easy to read and reason about. Scrolling through a thousand line long `requestHandler()` where everything interacts with everything and variables can be used everywhere is not a beautiful thing to do and a pain to debug. That's why we prefer the separation of routes into their own files. ## Next steps[​](#next-steps "Direct link to Next steps") In the next and final step, you'll see how to deploy your Crawlee project to the cloud. If you used the CLI to bootstrap your project, you already have a **Dockerfile** ready, and the next section will show you how to deploy it to the [Apify Platform](https://crawlee.dev/js/docs/deployment/apify-platform.md) with ease. --- # Saving data Copy for LLM A data extraction job would not be complete without saving the data for later use and processing. You've come to the final and most difficult part of this tutorial so make sure to pay attention very carefully! First, add a new import to the top of the file: ``` import { PlaywrightCrawler, Dataset } from 'crawlee'; ``` Then, replace the `console.log(results)` call with: ``` await Dataset.pushData(results); ``` and that's it. Unlike earlier, we are being serious now. That's it, you're done. The final code looks like this: [Run on](https://console.apify.com/actors/6i5QsHBMtm3hKph70?runConfig=eyJ1IjoiRWdQdHczb2VqNlRhRHQ1cW4iLCJ2IjoxfQ.eyJpbnB1dCI6IntcImNvZGVcIjpcImltcG9ydCB7IFBsYXl3cmlnaHRDcmF3bGVyLCBEYXRhc2V0IH0gZnJvbSAnY3Jhd2xlZSc7XFxuXFxuY29uc3QgY3Jhd2xlciA9IG5ldyBQbGF5d3JpZ2h0Q3Jhd2xlcih7XFxuICAgIHJlcXVlc3RIYW5kbGVyOiBhc3luYyAoeyBwYWdlLCByZXF1ZXN0LCBlbnF1ZXVlTGlua3MgfSkgPT4ge1xcbiAgICAgICAgY29uc29sZS5sb2coYFByb2Nlc3Npbmc6ICR7cmVxdWVzdC51cmx9YCk7XFxuICAgICAgICBpZiAocmVxdWVzdC5sYWJlbCA9PT0gJ0RFVEFJTCcpIHtcXG4gICAgICAgICAgICBjb25zdCB1cmxQYXJ0ID0gcmVxdWVzdC51cmwuc3BsaXQoJy8nKS5zbGljZSgtMSk7IC8vIFsnc2VubmhlaXNlci1ta2UtNDQwLXByb2Zlc3Npb25hbC1zdGVyZW8tc2hvdGd1bi1taWNyb3Bob25lLW1rZS00NDAnXVxcbiAgICAgICAgICAgIGNvbnN0IG1hbnVmYWN0dXJlciA9IHVybFBhcnRbMF0uc3BsaXQoJy0nKVswXTsgLy8gJ3Nlbm5oZWlzZXInXFxuXFxuICAgICAgICAgICAgY29uc3QgdGl0bGUgPSBhd2FpdCBwYWdlLmxvY2F0b3IoJy5wcm9kdWN0LW1ldGEgaDEnKS50ZXh0Q29udGVudCgpO1xcbiAgICAgICAgICAgIGNvbnN0IHNrdSA9IGF3YWl0IHBhZ2UubG9jYXRvcignc3Bhbi5wcm9kdWN0LW1ldGFfX3NrdS1udW1iZXInKS50ZXh0Q29udGVudCgpO1xcblxcbiAgICAgICAgICAgIGNvbnN0IHByaWNlRWxlbWVudCA9IHBhZ2VcXG4gICAgICAgICAgICAgICAgLmxvY2F0b3IoJ3NwYW4ucHJpY2UnKVxcbiAgICAgICAgICAgICAgICAuZmlsdGVyKHtcXG4gICAgICAgICAgICAgICAgICAgIGhhc1RleHQ6ICckJyxcXG4gICAgICAgICAgICAgICAgfSlcXG4gICAgICAgICAgICAgICAgLmZpcnN0KCk7XFxuXFxuICAgICAgICAgICAgY29uc3QgY3VycmVudFByaWNlU3RyaW5nID0gYXdhaXQgcHJpY2VFbGVtZW50LnRleHRDb250ZW50KCk7XFxuICAgICAgICAgICAgY29uc3QgcmF3UHJpY2UgPSBjdXJyZW50UHJpY2VTdHJpbmcuc3BsaXQoJyQnKVsxXTtcXG4gICAgICAgICAgICBjb25zdCBwcmljZSA9IE51bWJlcihyYXdQcmljZS5yZXBsYWNlQWxsKCcsJywgJycpKTtcXG5cXG4gICAgICAgICAgICBjb25zdCBpblN0b2NrRWxlbWVudCA9IHBhZ2VcXG4gICAgICAgICAgICAgICAgLmxvY2F0b3IoJ3NwYW4ucHJvZHVjdC1mb3JtX19pbnZlbnRvcnknKVxcbiAgICAgICAgICAgICAgICAuZmlsdGVyKHtcXG4gICAgICAgICAgICAgICAgICAgIGhhc1RleHQ6ICdJbiBzdG9jaycsXFxuICAgICAgICAgICAgICAgIH0pXFxuICAgICAgICAgICAgICAgIC5maXJzdCgpO1xcblxcbiAgICAgICAgICAgIGNvbnN0IGluU3RvY2sgPSAoYXdhaXQgaW5TdG9ja0VsZW1lbnQuY291bnQoKSkgPiAwO1xcblxcbiAgICAgICAgICAgIGNvbnN0IHJlc3VsdHMgPSB7XFxuICAgICAgICAgICAgICAgIHVybDogcmVxdWVzdC51cmwsXFxuICAgICAgICAgICAgICAgIG1hbnVmYWN0dXJlcixcXG4gICAgICAgICAgICAgICAgdGl0bGUsXFxuICAgICAgICAgICAgICAgIHNrdSxcXG4gICAgICAgICAgICAgICAgY3VycmVudFByaWNlOiBwcmljZSxcXG4gICAgICAgICAgICAgICAgYXZhaWxhYmxlSW5TdG9jazogaW5TdG9jayxcXG4gICAgICAgICAgICB9O1xcblxcbiAgICAgICAgICAgIC8vIGhpZ2hsaWdodC1uZXh0LWxpbmVcXG4gICAgICAgICAgICBhd2FpdCBEYXRhc2V0LnB1c2hEYXRhKHJlc3VsdHMpO1xcbiAgICAgICAgfSBlbHNlIGlmIChyZXF1ZXN0LmxhYmVsID09PSAnQ0FURUdPUlknKSB7XFxuICAgICAgICAgICAgLy8gV2UgYXJlIG5vdyBvbiBhIGNhdGVnb3J5IHBhZ2UuIFdlIGNhbiB1c2UgdGhpcyB0byBwYWdpbmF0ZSB0aHJvdWdoIGFuZCBlbnF1ZXVlIGFsbCBwcm9kdWN0cyxcXG4gICAgICAgICAgICAvLyBhcyB3ZWxsIGFzIGFueSBzdWJzZXF1ZW50IHBhZ2VzIHdlIGZpbmRcXG5cXG4gICAgICAgICAgICBhd2FpdCBwYWdlLndhaXRGb3JTZWxlY3RvcignLnByb2R1Y3QtaXRlbSA-IGEnKTtcXG4gICAgICAgICAgICBhd2FpdCBlbnF1ZXVlTGlua3Moe1xcbiAgICAgICAgICAgICAgICBzZWxlY3RvcjogJy5wcm9kdWN0LWl0ZW0gPiBhJyxcXG4gICAgICAgICAgICAgICAgbGFiZWw6ICdERVRBSUwnLCAvLyA8PSBub3RlIHRoZSBkaWZmZXJlbnQgbGFiZWxcXG4gICAgICAgICAgICB9KTtcXG5cXG4gICAgICAgICAgICAvLyBOb3cgd2UgbmVlZCB0byBmaW5kIHRoZSBcXFwiTmV4dFxcXCIgYnV0dG9uIGFuZCBlbnF1ZXVlIHRoZSBuZXh0IHBhZ2Ugb2YgcmVzdWx0cyAoaWYgaXQgZXhpc3RzKVxcbiAgICAgICAgICAgIGNvbnN0IG5leHRCdXR0b24gPSBhd2FpdCBwYWdlLiQoJ2EucGFnaW5hdGlvbl9fbmV4dCcpO1xcbiAgICAgICAgICAgIGlmIChuZXh0QnV0dG9uKSB7XFxuICAgICAgICAgICAgICAgIGF3YWl0IGVucXVldWVMaW5rcyh7XFxuICAgICAgICAgICAgICAgICAgICBzZWxlY3RvcjogJ2EucGFnaW5hdGlvbl9fbmV4dCcsXFxuICAgICAgICAgICAgICAgICAgICBsYWJlbDogJ0NBVEVHT1JZJywgLy8gPD0gbm90ZSB0aGUgc2FtZSBsYWJlbFxcbiAgICAgICAgICAgICAgICB9KTtcXG4gICAgICAgICAgICB9XFxuICAgICAgICB9IGVsc2Uge1xcbiAgICAgICAgICAgIC8vIFRoaXMgbWVhbnMgd2UncmUgb24gdGhlIHN0YXJ0IHBhZ2UsIHdpdGggbm8gbGFiZWwuXFxuICAgICAgICAgICAgLy8gT24gdGhpcyBwYWdlLCB3ZSBqdXN0IHdhbnQgdG8gZW5xdWV1ZSBhbGwgdGhlIGNhdGVnb3J5IHBhZ2VzLlxcblxcbiAgICAgICAgICAgIGF3YWl0IHBhZ2Uud2FpdEZvclNlbGVjdG9yKCcuY29sbGVjdGlvbi1ibG9jay1pdGVtJyk7XFxuICAgICAgICAgICAgYXdhaXQgZW5xdWV1ZUxpbmtzKHtcXG4gICAgICAgICAgICAgICAgc2VsZWN0b3I6ICcuY29sbGVjdGlvbi1ibG9jay1pdGVtJyxcXG4gICAgICAgICAgICAgICAgbGFiZWw6ICdDQVRFR09SWScsXFxuICAgICAgICAgICAgfSk7XFxuICAgICAgICB9XFxuICAgIH0sXFxuXFxuICAgIC8vIExldCdzIGxpbWl0IG91ciBjcmF3bHMgdG8gbWFrZSBvdXIgdGVzdHMgc2hvcnRlciBhbmQgc2FmZXIuXFxuICAgIG1heFJlcXVlc3RzUGVyQ3Jhd2w6IDUwLFxcbn0pO1xcblxcbmF3YWl0IGNyYXdsZXIucnVuKFsnaHR0cHM6Ly93YXJlaG91c2UtdGhlbWUtbWV0YWwubXlzaG9waWZ5LmNvbS9jb2xsZWN0aW9ucyddKTtcXG5cIn0iLCJvcHRpb25zIjp7ImJ1aWxkIjoibGF0ZXN0IiwiY29udGVudFR5cGUiOiJhcHBsaWNhdGlvbi9qc29uOyBjaGFyc2V0PXV0Zi04IiwibWVtb3J5Ijo0MDk2LCJ0aW1lb3V0IjoxODB9fQ.TePUf8gRETWwYQcR6grSUw9tXO1h7TGQAYAqoebaWus\&asrc=run_on_apify) ``` import { PlaywrightCrawler, Dataset } from 'crawlee'; const crawler = new PlaywrightCrawler({ requestHandler: async ({ page, request, enqueueLinks }) => { console.log(`Processing: ${request.url}`); if (request.label === 'DETAIL') { const urlPart = request.url.split('/').slice(-1); // ['sennheiser-mke-440-professional-stereo-shotgun-microphone-mke-440'] const manufacturer = urlPart[0].split('-')[0]; // 'sennheiser' const title = await page.locator('.product-meta h1').textContent(); const sku = await page.locator('span.product-meta__sku-number').textContent(); const priceElement = page .locator('span.price') .filter({ hasText: '$', }) .first(); const currentPriceString = await priceElement.textContent(); const rawPrice = currentPriceString.split('$')[1]; const price = Number(rawPrice.replaceAll(',', '')); const inStockElement = page .locator('span.product-form__inventory') .filter({ hasText: 'In stock', }) .first(); const inStock = (await inStockElement.count()) > 0; const results = { url: request.url, manufacturer, title, sku, currentPrice: price, availableInStock: inStock, }; await Dataset.pushData(results); } else if (request.label === 'CATEGORY') { // We are now on a category page. We can use this to paginate through and enqueue all products, // as well as any subsequent pages we find await page.waitForSelector('.product-item > a'); await enqueueLinks({ selector: '.product-item > a', label: 'DETAIL', // <= note the different label }); // Now we need to find the "Next" button and enqueue the next page of results (if it exists) const nextButton = await page.$('a.pagination__next'); if (nextButton) { await enqueueLinks({ selector: 'a.pagination__next', label: 'CATEGORY', // <= note the same label }); } } else { // This means we're on the start page, with no label. // On this page, we just want to enqueue all the category pages. await page.waitForSelector('.collection-block-item'); await enqueueLinks({ selector: '.collection-block-item', label: 'CATEGORY', }); } }, // Let's limit our crawls to make our tests shorter and safer. maxRequestsPerCrawl: 50, }); await crawler.run(['https://warehouse-theme-metal.myshopify.com/collections']); ``` ## What's `Dataset.pushData()`[​](#whats-datasetpushdata "Direct link to whats-datasetpushdata") ​[`Dataset.pushData()`](https://crawlee.dev/js/api/core/class/Dataset.md#pushData) is a function that saves data to the default [`Dataset`](https://crawlee.dev/js/api/core/class/Dataset.md). `Dataset` is a storage designed to hold data in a format similar to a table. Each time you call `Dataset.pushData()` a new row in the table is created, with the property names serving as column titles. In the default configuration, the rows are represented as JSON files saved on your disk, but other storage systems can be plugged into Crawlee as well. Automatic dataset initialization in Crawlee Each time you start Crawlee a default `Dataset` is automatically created, so there's no need to initialize it or create an instance first. You can create as many datasets as you want and even give them names. For more details see the [Result storage guide](https://crawlee.dev/js/docs/guides/result-storage.md#dataset) and the [`Dataset.open()`](https://crawlee.dev/js/api/core/class/Dataset.md#open) function. ## Finding saved data[​](#finding-saved-data "Direct link to Finding saved data") Unless you changed the configuration that Crawlee uses locally, which would suggest that you knew what you were doing, and you didn't need this tutorial anyway, you'll find your data in the `storage` directory that Crawlee creates in the working directory of the running script: ``` {PROJECT_FOLDER}/storage/datasets/default/ ``` The above folder will hold all your saved data in numbered files, as they were pushed into the dataset. Each file represents one invocation of `Dataset.pushData()` or one table row. Single file data storage options If you would like to store your data in a single big file, instead of many small ones, see the [Result storage guide](https://crawlee.dev/js/docs/guides/result-storage.md#key-value-store) for Key-value stores. ## Next steps[​](#next-steps "Direct link to Next steps") Next, you'll see some improvements that you can add to your crawler code that will make it more readable and maintainable in the long run. --- # Scraping the Store Copy for LLM In the [Real-world project chapter](https://crawlee.dev/js/docs/introduction/real-world-project.md#choosing-the-data-you-need), you've created a list of the information you wanted to collect about the products in the example Warehouse store. Let's review that and figure out ways to access the data. * URL * Manufacturer * SKU * Title * Current price * Stock available ![data to scrape](/assets/images/scraping-practice-ed4e3a233c852ffa694b80371fed9d37.jpg "Overview of data to be scraped.") ### Scraping the URL, Manufacturer and SKU[​](#scraping-the-url-manufacturer-and-sku "Direct link to Scraping the URL, Manufacturer and SKU") Some information is lying right there in front of us without even having to touch the product detail pages. The `URL` we already have - the `request.url`. And by looking at it carefully, we realize that we can also extract the manufacturer from the URL (as all product urls start with `/products/<manufacturer>`). We can just split the `string` and be on our way then! `request.loaderUrl` vs `request.url` You can use `request.loadedUrl` as well. Remember the difference: `request.url` is what you enqueue, `request.loadedUrl` is what gets processed (after possible redirects). ``` // request.url = https://warehouse-theme-metal.myshopify.com/products/sennheiser-mke-440-professional-stereo-shotgun-microphone-mke-440 const urlPart = request.url.split('/').slice(-1); // ['sennheiser-mke-440-professional-stereo-shotgun-microphone-mke-440'] const manufacturer = urlPart[0].split('-')[0]; // 'sennheiser' ``` Storing information It's a matter of preference, whether to store this information separately in the resulting dataset, or not. Whoever uses the dataset can easily parse the `manufacturer` from the `URL`, so should you duplicate the data unnecessarily? Our opinion is that unless the increased data consumption would be too large to bear, it's better to make the dataset as rich as possible. For example, someone might want to filter by `manufacturer`. Adapt and extract One thing you may notice is that the `manufacturer` might have a `-` in its name. If that's the case, your best bet is extracting it from the details page instead, but it's not mandatory. At the end of the day, you should always adjust and pick the best solution for your use case, and website you are crawling. Now it's time to add more data to the results. Let's open one of the product detail pages, for example the [`Sony XBR-950G`](https://warehouse-theme-metal.myshopify.com/products/sony-xbr-65x950g-65-class-64-5-diag-bravia-4k-hdr-ultra-hd-tv) page and use our DevTools-Fu 🥋 to figure out how to get the title of the product. ### Title[​](#title "Direct link to Title") ![product title](/assets/images/title-8f63a08e5ecf82b5547f1fac8ffc77a7.jpg "Finding product title in DevTools.") By using the element selector tool, you can see that the title is there under an `<h1>` tag, as titles should be. The `<h1>` tag is enclosed in a `<div>` with class `product-meta`. We can leverage this to create a combined selector `.product-meta h1`. It selects any `<h1>` element that is a child of a different element with the class `product-meta`. Verifying selectors with DevTools Remember that you can press CTRL+F (or CMD+F on Mac) in the **Elements** tab of DevTools to open the search bar where you can quickly search for elements using their selectors. Always verify your scraping process and assumptions using the DevTools. It's faster than changing the crawler code all the time. To get the title, you need to find it using `Playwright` and a `.product-meta h1` locator, which selects the `<h1>` element you're looking for, or throws, if it finds more than one. That's good. It's usually better to crash the crawler than silently return bad data. ``` const title = await page.locator('.product-meta h1').textContent(); ``` ### SKU[​](#sku "Direct link to SKU") Using the DevTools, you can find that the product SKU is inside a `<span>` tag with a class `product-meta__sku-number`. And since there's no other `<span>` with that class on the page, you can safely use it. ![product sku selector](/assets/images/sku-4427a5a820183e7c74fb4beeabcf9116.jpg "Finding product SKU in DevTools.") ``` const sku = await page.locator('span.product-meta__sku-number').textContent(); ``` ### Current price[​](#current-price "Direct link to Current price") DevTools can tell you that the `currentPrice` can be found in a `<span>` element tagged with the `price` class. But it also shows that it is nested as raw text alongside another `<span>` element with the `visually-hidden` class. You don't want that, so you need to filter it out, and the `hasText` helper can be used for that for that. ![product current price selector](/assets/images/current-price-16b0f4b92332837111d04f632234d2c3.jpg "Finding product current price in DevTools.") ``` const priceElement = page .locator('span.price') .filter({ hasText: '$', }) .first(); const currentPriceString = await priceElement.textContent(); const rawPrice = currentPriceString.split('$')[1]; const price = Number(rawPrice.replaceAll(',', '')); ``` It might look a little too complex at first glance, but let's walk through what you did. First off, you find the right part of the `price` span (specifically the actual price) by filtering the element that has the `$` sign in it. When you do that, you will get a string similar to `Sale price$1,398.00`. This, by itself, is not that useful, so you extract the actual numeric part by splitting by the `$` sign. Once you do that, you receive a string that represents our price, but you will be converting it to a number. You do that by replacing all the commas with nothingness (so we can parse it into a number), then it is parsed into a number using `Number()`. ### Stock available[​](#stock-available "Direct link to Stock available") You're finishing up with the `availableInStock`. There is a span with the `product-form__inventory` class, and it contains the text `In stock`. You can use the `hasText` helper again to filter out the right element. ``` const inStockElement = await page .locator('span.product-form__inventory') .filter({ hasText: 'In stock', }) .first(); const inStock = (await inStockElement.count()) > 0; ``` For this, all that matter is whether the element exists or not, so you can use the `count()` method to check if there are any elements that match our selector. If there are, that means the product is in stock. And there you have it! All the needed data. For the sake of completeness, let's add all the properties together, and you're good to go. ``` const urlPart = request.url.split('/').slice(-1); // ['sennheiser-mke-440-professional-stereo-shotgun-microphone-mke-440'] const manufacturer = urlPart.split('-')[0]; // 'sennheiser' const title = await page.locator('.product-meta h1').textContent(); const sku = await page.locator('span.product-meta__sku-number').textContent(); const priceElement = page .locator('span.price') .filter({ hasText: '$', }) .first(); const currentPriceString = await priceElement.textContent(); const rawPrice = currentPriceString.split('$')[1]; const price = Number(rawPrice.replaceAll(',', '')); const inStockElement = await page .locator('span.product-form__inventory') .filter({ hasText: 'In stock', }) .first(); const inStock = (await inStockElement.count()) > 0; ``` ## Trying it out[​](#trying-it-out "Direct link to Trying it out") You have everything that is needed, so grab your newly created scraping logic, dump it into your original `requestHandler()` and see the magic happen! [Run on](https://console.apify.com/actors/6i5QsHBMtm3hKph70?runConfig=eyJ1IjoiRWdQdHczb2VqNlRhRHQ1cW4iLCJ2IjoxfQ.eyJpbnB1dCI6IntcImNvZGVcIjpcImltcG9ydCB7IFBsYXl3cmlnaHRDcmF3bGVyIH0gZnJvbSAnY3Jhd2xlZSc7XFxuXFxuY29uc3QgY3Jhd2xlciA9IG5ldyBQbGF5d3JpZ2h0Q3Jhd2xlcih7XFxuICAgIHJlcXVlc3RIYW5kbGVyOiBhc3luYyAoeyBwYWdlLCByZXF1ZXN0LCBlbnF1ZXVlTGlua3MgfSkgPT4ge1xcbiAgICAgICAgY29uc29sZS5sb2coYFByb2Nlc3Npbmc6ICR7cmVxdWVzdC51cmx9YCk7XFxuICAgICAgICBpZiAocmVxdWVzdC5sYWJlbCA9PT0gJ0RFVEFJTCcpIHtcXG4gICAgICAgICAgICBjb25zdCB1cmxQYXJ0ID0gcmVxdWVzdC51cmwuc3BsaXQoJy8nKS5zbGljZSgtMSk7IC8vIFsnc2VubmhlaXNlci1ta2UtNDQwLXByb2Zlc3Npb25hbC1zdGVyZW8tc2hvdGd1bi1taWNyb3Bob25lLW1rZS00NDAnXVxcbiAgICAgICAgICAgIGNvbnN0IG1hbnVmYWN0dXJlciA9IHVybFBhcnRbMF0uc3BsaXQoJy0nKVswXTsgLy8gJ3Nlbm5oZWlzZXInXFxuXFxuICAgICAgICAgICAgY29uc3QgdGl0bGUgPSBhd2FpdCBwYWdlLmxvY2F0b3IoJy5wcm9kdWN0LW1ldGEgaDEnKS50ZXh0Q29udGVudCgpO1xcbiAgICAgICAgICAgIGNvbnN0IHNrdSA9IGF3YWl0IHBhZ2UubG9jYXRvcignc3Bhbi5wcm9kdWN0LW1ldGFfX3NrdS1udW1iZXInKS50ZXh0Q29udGVudCgpO1xcblxcbiAgICAgICAgICAgIGNvbnN0IHByaWNlRWxlbWVudCA9IHBhZ2VcXG4gICAgICAgICAgICAgICAgLmxvY2F0b3IoJ3NwYW4ucHJpY2UnKVxcbiAgICAgICAgICAgICAgICAuZmlsdGVyKHtcXG4gICAgICAgICAgICAgICAgICAgIGhhc1RleHQ6ICckJyxcXG4gICAgICAgICAgICAgICAgfSlcXG4gICAgICAgICAgICAgICAgLmZpcnN0KCk7XFxuXFxuICAgICAgICAgICAgY29uc3QgY3VycmVudFByaWNlU3RyaW5nID0gYXdhaXQgcHJpY2VFbGVtZW50LnRleHRDb250ZW50KCk7XFxuICAgICAgICAgICAgY29uc3QgcmF3UHJpY2UgPSBjdXJyZW50UHJpY2VTdHJpbmcuc3BsaXQoJyQnKVsxXTtcXG4gICAgICAgICAgICBjb25zdCBwcmljZSA9IE51bWJlcihyYXdQcmljZS5yZXBsYWNlQWxsKCcsJywgJycpKTtcXG5cXG4gICAgICAgICAgICBjb25zdCBpblN0b2NrRWxlbWVudCA9IHBhZ2VcXG4gICAgICAgICAgICAgICAgLmxvY2F0b3IoJ3NwYW4ucHJvZHVjdC1mb3JtX19pbnZlbnRvcnknKVxcbiAgICAgICAgICAgICAgICAuZmlsdGVyKHtcXG4gICAgICAgICAgICAgICAgICAgIGhhc1RleHQ6ICdJbiBzdG9jaycsXFxuICAgICAgICAgICAgICAgIH0pXFxuICAgICAgICAgICAgICAgIC5maXJzdCgpO1xcblxcbiAgICAgICAgICAgIGNvbnN0IGluU3RvY2sgPSAoYXdhaXQgaW5TdG9ja0VsZW1lbnQuY291bnQoKSkgPiAwO1xcblxcbiAgICAgICAgICAgIGNvbnN0IHJlc3VsdHMgPSB7XFxuICAgICAgICAgICAgICAgIHVybDogcmVxdWVzdC51cmwsXFxuICAgICAgICAgICAgICAgIG1hbnVmYWN0dXJlcixcXG4gICAgICAgICAgICAgICAgdGl0bGUsXFxuICAgICAgICAgICAgICAgIHNrdSxcXG4gICAgICAgICAgICAgICAgY3VycmVudFByaWNlOiBwcmljZSxcXG4gICAgICAgICAgICAgICAgYXZhaWxhYmxlSW5TdG9jazogaW5TdG9jayxcXG4gICAgICAgICAgICB9O1xcblxcbiAgICAgICAgICAgIGNvbnNvbGUubG9nKHJlc3VsdHMpO1xcbiAgICAgICAgfSBlbHNlIGlmIChyZXF1ZXN0LmxhYmVsID09PSAnQ0FURUdPUlknKSB7XFxuICAgICAgICAgICAgLy8gV2UgYXJlIG5vdyBvbiBhIGNhdGVnb3J5IHBhZ2UuIFdlIGNhbiB1c2UgdGhpcyB0byBwYWdpbmF0ZSB0aHJvdWdoIGFuZCBlbnF1ZXVlIGFsbCBwcm9kdWN0cyxcXG4gICAgICAgICAgICAvLyBhcyB3ZWxsIGFzIGFueSBzdWJzZXF1ZW50IHBhZ2VzIHdlIGZpbmRcXG5cXG4gICAgICAgICAgICBhd2FpdCBwYWdlLndhaXRGb3JTZWxlY3RvcignLnByb2R1Y3QtaXRlbSA-IGEnKTtcXG4gICAgICAgICAgICBhd2FpdCBlbnF1ZXVlTGlua3Moe1xcbiAgICAgICAgICAgICAgICBzZWxlY3RvcjogJy5wcm9kdWN0LWl0ZW0gPiBhJyxcXG4gICAgICAgICAgICAgICAgbGFiZWw6ICdERVRBSUwnLCAvLyA8PSBub3RlIHRoZSBkaWZmZXJlbnQgbGFiZWxcXG4gICAgICAgICAgICB9KTtcXG5cXG4gICAgICAgICAgICAvLyBOb3cgd2UgbmVlZCB0byBmaW5kIHRoZSBcXFwiTmV4dFxcXCIgYnV0dG9uIGFuZCBlbnF1ZXVlIHRoZSBuZXh0IHBhZ2Ugb2YgcmVzdWx0cyAoaWYgaXQgZXhpc3RzKVxcbiAgICAgICAgICAgIGNvbnN0IG5leHRCdXR0b24gPSBhd2FpdCBwYWdlLiQoJ2EucGFnaW5hdGlvbl9fbmV4dCcpO1xcbiAgICAgICAgICAgIGlmIChuZXh0QnV0dG9uKSB7XFxuICAgICAgICAgICAgICAgIGF3YWl0IGVucXVldWVMaW5rcyh7XFxuICAgICAgICAgICAgICAgICAgICBzZWxlY3RvcjogJ2EucGFnaW5hdGlvbl9fbmV4dCcsXFxuICAgICAgICAgICAgICAgICAgICBsYWJlbDogJ0NBVEVHT1JZJywgLy8gPD0gbm90ZSB0aGUgc2FtZSBsYWJlbFxcbiAgICAgICAgICAgICAgICB9KTtcXG4gICAgICAgICAgICB9XFxuICAgICAgICB9IGVsc2Uge1xcbiAgICAgICAgICAgIC8vIFRoaXMgbWVhbnMgd2UncmUgb24gdGhlIHN0YXJ0IHBhZ2UsIHdpdGggbm8gbGFiZWwuXFxuICAgICAgICAgICAgLy8gT24gdGhpcyBwYWdlLCB3ZSBqdXN0IHdhbnQgdG8gZW5xdWV1ZSBhbGwgdGhlIGNhdGVnb3J5IHBhZ2VzLlxcblxcbiAgICAgICAgICAgIGF3YWl0IHBhZ2Uud2FpdEZvclNlbGVjdG9yKCcuY29sbGVjdGlvbi1ibG9jay1pdGVtJyk7XFxuICAgICAgICAgICAgYXdhaXQgZW5xdWV1ZUxpbmtzKHtcXG4gICAgICAgICAgICAgICAgc2VsZWN0b3I6ICcuY29sbGVjdGlvbi1ibG9jay1pdGVtJyxcXG4gICAgICAgICAgICAgICAgbGFiZWw6ICdDQVRFR09SWScsXFxuICAgICAgICAgICAgfSk7XFxuICAgICAgICB9XFxuICAgIH0sXFxuXFxuICAgIC8vIExldCdzIGxpbWl0IG91ciBjcmF3bHMgdG8gbWFrZSBvdXIgdGVzdHMgc2hvcnRlciBhbmQgc2FmZXIuXFxuICAgIG1heFJlcXVlc3RzUGVyQ3Jhd2w6IDUwLFxcbn0pO1xcblxcbmF3YWl0IGNyYXdsZXIucnVuKFsnaHR0cHM6Ly93YXJlaG91c2UtdGhlbWUtbWV0YWwubXlzaG9waWZ5LmNvbS9jb2xsZWN0aW9ucyddKTtcXG5cIn0iLCJvcHRpb25zIjp7ImJ1aWxkIjoibGF0ZXN0IiwiY29udGVudFR5cGUiOiJhcHBsaWNhdGlvbi9qc29uOyBjaGFyc2V0PXV0Zi04IiwibWVtb3J5Ijo0MDk2LCJ0aW1lb3V0IjoxODB9fQ.kD0Kv02LqYWc0KoeyGVDl4T9x6QzNWTLJP_-bZxykus\&asrc=run_on_apify) ``` import { PlaywrightCrawler } from 'crawlee'; const crawler = new PlaywrightCrawler({ requestHandler: async ({ page, request, enqueueLinks }) => { console.log(`Processing: ${request.url}`); if (request.label === 'DETAIL') { const urlPart = request.url.split('/').slice(-1); // ['sennheiser-mke-440-professional-stereo-shotgun-microphone-mke-440'] const manufacturer = urlPart[0].split('-')[0]; // 'sennheiser' const title = await page.locator('.product-meta h1').textContent(); const sku = await page.locator('span.product-meta__sku-number').textContent(); const priceElement = page .locator('span.price') .filter({ hasText: '$', }) .first(); const currentPriceString = await priceElement.textContent(); const rawPrice = currentPriceString.split('$')[1]; const price = Number(rawPrice.replaceAll(',', '')); const inStockElement = page .locator('span.product-form__inventory') .filter({ hasText: 'In stock', }) .first(); const inStock = (await inStockElement.count()) > 0; const results = { url: request.url, manufacturer, title, sku, currentPrice: price, availableInStock: inStock, }; console.log(results); } else if (request.label === 'CATEGORY') { // We are now on a category page. We can use this to paginate through and enqueue all products, // as well as any subsequent pages we find await page.waitForSelector('.product-item > a'); await enqueueLinks({ selector: '.product-item > a', label: 'DETAIL', // <= note the different label }); // Now we need to find the "Next" button and enqueue the next page of results (if it exists) const nextButton = await page.$('a.pagination__next'); if (nextButton) { await enqueueLinks({ selector: 'a.pagination__next', label: 'CATEGORY', // <= note the same label }); } } else { // This means we're on the start page, with no label. // On this page, we just want to enqueue all the category pages. await page.waitForSelector('.collection-block-item'); await enqueueLinks({ selector: '.collection-block-item', label: 'CATEGORY', }); } }, // Let's limit our crawls to make our tests shorter and safer. maxRequestsPerCrawl: 50, }); await crawler.run(['https://warehouse-theme-metal.myshopify.com/collections']); ``` When you run the crawler, you will see the crawled URLs and their scraped data printed to the console. The output will look something like this: ``` { "url": "https://warehouse-theme-metal.myshopify.com/products/sony-str-za810es-7-2-channel-hi-res-wi-fi-network-av-receiver", "manufacturer": "sony", "title": "Sony STR-ZA810ES 7.2-Ch Hi-Res Wi-Fi Network A/V Receiver", "sku": "SON-692802-STR-DE", "currentPrice": 698, "availableInStock": true } ``` ## Next steps[​](#next-steps "Direct link to Next steps") Next, you'll see how to save the data you scraped to the disk for further processing. --- # Setting up Copy for LLM To run Crawlee on your own computer, you need to meet the following pre-requisites first: 1. Have **Node.js version 16.0** (Visit [Node.js website](https://nodejs.org/en/download/) to download or use [fnm](https://github.com/Schniz/fnm)) or higher installed. 2. Have **NPM** installed, or use other package manager of your choice. If not certain, confirm the prerequisites by running: ``` node -v ``` ``` npm -v ``` ## Creating a new project[​](#creating-a-new-project "Direct link to Creating a new project") The fastest and best way to create new projects with Crawlee is to use the [Crawlee CLI](https://www.npmjs.com/package/@crawlee/cli). You can use the `npx` utility to download and run the CLI - it is embedded in the `crawlee` package: ``` npx crawlee create my-crawler ``` A prompt will be shown, asking you to select a template. Crawlee is written in TypeScript so if you're familiar with it, choosing a TypeScript template will give you better code completion and static type checking, but feel free to use JavaScript as well. Functionally they're identical. Let's choose the first template called **Getting started example**. The command will create a new directory in your current working directory, called **my-crawler**, add a **package.json** to this folder and install all the necessary dependencies. It will also add example source code that you can immediately run. Let's try that! ``` cd my-crawler npm start ``` You will see log messages in the terminal as Crawlee boots up and starts scraping the Crawlee website. ``` INFO PlaywrightCrawler: Starting the crawl INFO PlaywrightCrawler: Title of https://crawlee.dev/ is 'Crawlee · Build reliable crawlers. Fast. | Crawlee' INFO PlaywrightCrawler: Title of https://crawlee.dev/js/docs/examples is 'Examples | Crawlee' INFO PlaywrightCrawler: Title of https://crawlee.dev/js/api/core is '@crawlee/core | API | Crawlee' INFO PlaywrightCrawler: Title of https://crawlee.dev/js/api/core/changelog is 'Changelog | API | Crawlee' INFO PlaywrightCrawler: Title of https://crawlee.dev/js/docs/quick-start is 'Quick Start | Crawlee' ``` You can always terminate the crawl with a keypress in the terminal: ``` CTRL+C ``` ### Running headful browsers[​](#running-headful-browsers "Direct link to Running headful browsers") Browsers controlled by Playwright run headless (without a visible window). You can switch to headful by uncommenting the `headless: false` option in the crawler's constructor. This is useful in the development phase when you want to see what's going on in the browser. ``` // Uncomment this option to see the browser window. headless: false ``` When you run the example again, after a second a Chromium browser window will open. In the window, you'll see quickly changing pages as the crawler does its job. note For the sake of this show off, we've slowed down the crawler, but rest assured, it's blazing fast in real world usage. ![An image showing off Crawlee scraping the Crawlee website using Puppeteer/Playwright and Chromium](/img/chrome-scrape-light.gif)![An image showing off Crawlee scraping the Crawlee website using Puppeteer/Playwright and Chromium](/img/chrome-scrape-dark.gif) ## Next steps[​](#next-steps "Direct link to Next steps") Next, you will see how to create a very simple crawler and explain Crawlee components while building it. --- # Quick Start Copy for LLM With this short tutorial you can start scraping with Crawlee in a minute or two. To learn in-depth how Crawlee works, read the [Introduction](https://crawlee.dev/js/docs/introduction.md), which is a comprehensive step-by-step guide for creating your first scraper. ## Choose your crawler[​](#choose-your-crawler "Direct link to Choose your crawler") Crawlee comes with three main crawler classes: [`CheerioCrawler`](https://crawlee.dev/js/api/cheerio-crawler/class/CheerioCrawler.md), [`PuppeteerCrawler`](https://crawlee.dev/js/api/puppeteer-crawler/class/PuppeteerCrawler.md) and [`PlaywrightCrawler`](https://crawlee.dev/js/api/playwright-crawler/class/PlaywrightCrawler.md). All classes share the same interface for maximum flexibility when switching between them. ### CheerioCrawler[​](#cheeriocrawler "Direct link to CheerioCrawler") This is a plain HTTP crawler. It parses HTML using the [Cheerio](https://github.com/cheeriojs/cheerio) library and crawls the web using the specialized [got-scraping](https://github.com/apify/got-scraping) HTTP client which masks as a browser. It's very fast and efficient, but can't handle JavaScript rendering. ### PuppeteerCrawler[​](#puppeteercrawler "Direct link to PuppeteerCrawler") This crawler uses a headless browser to crawl, controlled by the [Puppeteer](https://github.com/puppeteer/puppeteer) library. It can control Chromium or Chrome. Puppeteer is the de-facto standard in headless browser automation. ### PlaywrightCrawler[​](#playwrightcrawler "Direct link to PlaywrightCrawler") [Playwright](https://github.com/microsoft/playwright) is a more powerful and full-featured successor to Puppeteer. It can control Chromium, Chrome, Firefox, Webkit and many other browsers. If you're not familiar with Puppeteer already, and you need a headless browser, go with Playwright. before you start Crawlee requires [Node.js 16 or later](https://nodejs.org/en/). ## Installation with Crawlee CLI[​](#installation-with-crawlee-cli "Direct link to Installation with Crawlee CLI") The fastest way to try Crawlee out is to use the **Crawlee CLI** and choose the **Getting started example**. The CLI will install all the necessary dependencies and add boilerplate code for you to play with. ``` npx crawlee create my-crawler ``` After the installation is complete you can start the crawler like this: ``` cd my-crawler && npm start ``` ## Manual installation[​](#manual-installation "Direct link to Manual installation") You can add Crawlee to any Node.js project by running: * CheerioCrawler * PlaywrightCrawler * PuppeteerCrawler ``` npm install crawlee ``` caution `playwright` is not bundled with Crawlee to reduce install size and allow greater flexibility. You need to explicitly install it with NPM. 👇 ``` npm install crawlee playwright ``` caution `puppeteer` is not bundled with Crawlee to reduce install size and allow greater flexibility. You need to explicitly install it with NPM. 👇 ``` npm install crawlee puppeteer ``` ## Crawling[​](#crawling "Direct link to Crawling") Run the following example to perform a recursive crawl of the Crawlee website using the selected crawler. Don't forget about module imports To run the example, add a `"type": "module"` clause into your `package.json` or copy it into a file with an `.mjs` suffix. This enables `import` statements in Node.js. See [Node.js docs](https://nodejs.org/dist/latest-v16.x/docs/api/esm.html#enabling) for more information. * CheerioCrawler * PlaywrightCrawler * PuppeteerCrawler [Run on](https://console.apify.com/actors/kk67IcZkKSSBTslXI?runConfig=eyJ1IjoiRWdQdHczb2VqNlRhRHQ1cW4iLCJ2IjoxfQ.eyJpbnB1dCI6IntcImNvZGVcIjpcImltcG9ydCB7IENoZWVyaW9DcmF3bGVyLCBEYXRhc2V0IH0gZnJvbSAnY3Jhd2xlZSc7XFxuXFxuLy8gQ2hlZXJpb0NyYXdsZXIgY3Jhd2xzIHRoZSB3ZWIgdXNpbmcgSFRUUCByZXF1ZXN0c1xcbi8vIGFuZCBwYXJzZXMgSFRNTCB1c2luZyB0aGUgQ2hlZXJpbyBsaWJyYXJ5LlxcbmNvbnN0IGNyYXdsZXIgPSBuZXcgQ2hlZXJpb0NyYXdsZXIoe1xcbiAgICAvLyBVc2UgdGhlIHJlcXVlc3RIYW5kbGVyIHRvIHByb2Nlc3MgZWFjaCBvZiB0aGUgY3Jhd2xlZCBwYWdlcy5cXG4gICAgYXN5bmMgcmVxdWVzdEhhbmRsZXIoeyByZXF1ZXN0LCAkLCBlbnF1ZXVlTGlua3MsIGxvZyB9KSB7XFxuICAgICAgICBjb25zdCB0aXRsZSA9ICQoJ3RpdGxlJykudGV4dCgpO1xcbiAgICAgICAgbG9nLmluZm8oYFRpdGxlIG9mICR7cmVxdWVzdC5sb2FkZWRVcmx9IGlzICcke3RpdGxlfSdgKTtcXG5cXG4gICAgICAgIC8vIFNhdmUgcmVzdWx0cyBhcyBKU09OIHRvIC4vc3RvcmFnZS9kYXRhc2V0cy9kZWZhdWx0XFxuICAgICAgICBhd2FpdCBEYXRhc2V0LnB1c2hEYXRhKHsgdGl0bGUsIHVybDogcmVxdWVzdC5sb2FkZWRVcmwgfSk7XFxuXFxuICAgICAgICAvLyBFeHRyYWN0IGxpbmtzIGZyb20gdGhlIGN1cnJlbnQgcGFnZVxcbiAgICAgICAgLy8gYW5kIGFkZCB0aGVtIHRvIHRoZSBjcmF3bGluZyBxdWV1ZS5cXG4gICAgICAgIGF3YWl0IGVucXVldWVMaW5rcygpO1xcbiAgICB9LFxcblxcbiAgICAvLyBMZXQncyBsaW1pdCBvdXIgY3Jhd2xzIHRvIG1ha2Ugb3VyIHRlc3RzIHNob3J0ZXIgYW5kIHNhZmVyLlxcbiAgICBtYXhSZXF1ZXN0c1BlckNyYXdsOiA1MCxcXG59KTtcXG5cXG4vLyBBZGQgZmlyc3QgVVJMIHRvIHRoZSBxdWV1ZSBhbmQgc3RhcnQgdGhlIGNyYXdsLlxcbmF3YWl0IGNyYXdsZXIucnVuKFsnaHR0cHM6Ly9jcmF3bGVlLmRldiddKTtcXG5cIn0iLCJvcHRpb25zIjp7ImJ1aWxkIjoibGF0ZXN0IiwiY29udGVudFR5cGUiOiJhcHBsaWNhdGlvbi9qc29uOyBjaGFyc2V0PXV0Zi04IiwibWVtb3J5IjoxMDI0LCJ0aW1lb3V0IjoxODB9fQ.Ja0vzMfKZoDTDX1L9bEJsVFrKUcp0sJyWJ46kbitQOs\&asrc=run_on_apify) ``` import { CheerioCrawler, Dataset } from 'crawlee'; // CheerioCrawler crawls the web using HTTP requests // and parses HTML using the Cheerio library. const crawler = new CheerioCrawler({ // Use the requestHandler to process each of the crawled pages. async requestHandler({ request, $, enqueueLinks, log }) { const title = $('title').text(); log.info(`Title of ${request.loadedUrl} is '${title}'`); // Save results as JSON to ./storage/datasets/default await Dataset.pushData({ title, url: request.loadedUrl }); // Extract links from the current page // and add them to the crawling queue. await enqueueLinks(); }, // Let's limit our crawls to make our tests shorter and safer. maxRequestsPerCrawl: 50, }); // Add first URL to the queue and start the crawl. await crawler.run(['https://crawlee.dev']); ``` [Run on](https://console.apify.com/actors/6i5QsHBMtm3hKph70?runConfig=eyJ1IjoiRWdQdHczb2VqNlRhRHQ1cW4iLCJ2IjoxfQ.eyJpbnB1dCI6IntcImNvZGVcIjpcImltcG9ydCB7IFBsYXl3cmlnaHRDcmF3bGVyLCBEYXRhc2V0IH0gZnJvbSAnY3Jhd2xlZSc7XFxuXFxuLy8gUGxheXdyaWdodENyYXdsZXIgY3Jhd2xzIHRoZSB3ZWIgdXNpbmcgYSBoZWFkbGVzc1xcbi8vIGJyb3dzZXIgY29udHJvbGxlZCBieSB0aGUgUGxheXdyaWdodCBsaWJyYXJ5LlxcbmNvbnN0IGNyYXdsZXIgPSBuZXcgUGxheXdyaWdodENyYXdsZXIoe1xcbiAgICAvLyBVc2UgdGhlIHJlcXVlc3RIYW5kbGVyIHRvIHByb2Nlc3MgZWFjaCBvZiB0aGUgY3Jhd2xlZCBwYWdlcy5cXG4gICAgYXN5bmMgcmVxdWVzdEhhbmRsZXIoeyByZXF1ZXN0LCBwYWdlLCBlbnF1ZXVlTGlua3MsIGxvZyB9KSB7XFxuICAgICAgICBjb25zdCB0aXRsZSA9IGF3YWl0IHBhZ2UudGl0bGUoKTtcXG4gICAgICAgIGxvZy5pbmZvKGBUaXRsZSBvZiAke3JlcXVlc3QubG9hZGVkVXJsfSBpcyAnJHt0aXRsZX0nYCk7XFxuXFxuICAgICAgICAvLyBTYXZlIHJlc3VsdHMgYXMgSlNPTiB0byAuL3N0b3JhZ2UvZGF0YXNldHMvZGVmYXVsdFxcbiAgICAgICAgYXdhaXQgRGF0YXNldC5wdXNoRGF0YSh7IHRpdGxlLCB1cmw6IHJlcXVlc3QubG9hZGVkVXJsIH0pO1xcblxcbiAgICAgICAgLy8gRXh0cmFjdCBsaW5rcyBmcm9tIHRoZSBjdXJyZW50IHBhZ2VcXG4gICAgICAgIC8vIGFuZCBhZGQgdGhlbSB0byB0aGUgY3Jhd2xpbmcgcXVldWUuXFxuICAgICAgICBhd2FpdCBlbnF1ZXVlTGlua3MoKTtcXG4gICAgfSxcXG4gICAgLy8gVW5jb21tZW50IHRoaXMgb3B0aW9uIHRvIHNlZSB0aGUgYnJvd3NlciB3aW5kb3cuXFxuICAgIC8vIGhlYWRsZXNzOiBmYWxzZSxcXG5cXG4gICAgLy8gTGV0J3MgbGltaXQgb3VyIGNyYXdscyB0byBtYWtlIG91ciB0ZXN0cyBzaG9ydGVyIGFuZCBzYWZlci5cXG4gICAgbWF4UmVxdWVzdHNQZXJDcmF3bDogNTAsXFxufSk7XFxuXFxuLy8gQWRkIGZpcnN0IFVSTCB0byB0aGUgcXVldWUgYW5kIHN0YXJ0IHRoZSBjcmF3bC5cXG5hd2FpdCBjcmF3bGVyLnJ1bihbJ2h0dHBzOi8vY3Jhd2xlZS5kZXYnXSk7XFxuXCJ9Iiwib3B0aW9ucyI6eyJidWlsZCI6ImxhdGVzdCIsImNvbnRlbnRUeXBlIjoiYXBwbGljYXRpb24vanNvbjsgY2hhcnNldD11dGYtOCIsIm1lbW9yeSI6NDA5NiwidGltZW91dCI6MTgwfX0.t_TCm8kwdGMajR-HxGyGZQ-N9vOJbcHUo8cgMhCec0E\&asrc=run_on_apify) ``` import { PlaywrightCrawler, Dataset } from 'crawlee'; // PlaywrightCrawler crawls the web using a headless // browser controlled by the Playwright library. const crawler = new PlaywrightCrawler({ // Use the requestHandler to process each of the crawled pages. async requestHandler({ request, page, enqueueLinks, log }) { const title = await page.title(); log.info(`Title of ${request.loadedUrl} is '${title}'`); // Save results as JSON to ./storage/datasets/default await Dataset.pushData({ title, url: request.loadedUrl }); // Extract links from the current page // and add them to the crawling queue. await enqueueLinks(); }, // Uncomment this option to see the browser window. // headless: false, // Let's limit our crawls to make our tests shorter and safer. maxRequestsPerCrawl: 50, }); // Add first URL to the queue and start the crawl. await crawler.run(['https://crawlee.dev']); ``` [Run on](https://console.apify.com/actors/7tWSD8hrYzuc9Lte7?runConfig=eyJ1IjoiRWdQdHczb2VqNlRhRHQ1cW4iLCJ2IjoxfQ.eyJpbnB1dCI6IntcImNvZGVcIjpcImltcG9ydCB7IFB1cHBldGVlckNyYXdsZXIsIERhdGFzZXQgfSBmcm9tICdjcmF3bGVlJztcXG5cXG4vLyBQdXBwZXRlZXJDcmF3bGVyIGNyYXdscyB0aGUgd2ViIHVzaW5nIGEgaGVhZGxlc3NcXG4vLyBicm93c2VyIGNvbnRyb2xsZWQgYnkgdGhlIFB1cHBldGVlciBsaWJyYXJ5LlxcbmNvbnN0IGNyYXdsZXIgPSBuZXcgUHVwcGV0ZWVyQ3Jhd2xlcih7XFxuICAgIC8vIFVzZSB0aGUgcmVxdWVzdEhhbmRsZXIgdG8gcHJvY2VzcyBlYWNoIG9mIHRoZSBjcmF3bGVkIHBhZ2VzLlxcbiAgICBhc3luYyByZXF1ZXN0SGFuZGxlcih7IHJlcXVlc3QsIHBhZ2UsIGVucXVldWVMaW5rcywgbG9nIH0pIHtcXG4gICAgICAgIGNvbnN0IHRpdGxlID0gYXdhaXQgcGFnZS50aXRsZSgpO1xcbiAgICAgICAgbG9nLmluZm8oYFRpdGxlIG9mICR7cmVxdWVzdC5sb2FkZWRVcmx9IGlzICcke3RpdGxlfSdgKTtcXG5cXG4gICAgICAgIC8vIFNhdmUgcmVzdWx0cyBhcyBKU09OIHRvIC4vc3RvcmFnZS9kYXRhc2V0cy9kZWZhdWx0XFxuICAgICAgICBhd2FpdCBEYXRhc2V0LnB1c2hEYXRhKHsgdGl0bGUsIHVybDogcmVxdWVzdC5sb2FkZWRVcmwgfSk7XFxuXFxuICAgICAgICAvLyBFeHRyYWN0IGxpbmtzIGZyb20gdGhlIGN1cnJlbnQgcGFnZVxcbiAgICAgICAgLy8gYW5kIGFkZCB0aGVtIHRvIHRoZSBjcmF3bGluZyBxdWV1ZS5cXG4gICAgICAgIGF3YWl0IGVucXVldWVMaW5rcygpO1xcbiAgICB9LFxcbiAgICAvLyBVbmNvbW1lbnQgdGhpcyBvcHRpb24gdG8gc2VlIHRoZSBicm93c2VyIHdpbmRvdy5cXG4gICAgLy8gaGVhZGxlc3M6IGZhbHNlLFxcblxcbiAgICAvLyBMZXQncyBsaW1pdCBvdXIgY3Jhd2xzIHRvIG1ha2Ugb3VyIHRlc3RzIHNob3J0ZXIgYW5kIHNhZmVyLlxcbiAgICBtYXhSZXF1ZXN0c1BlckNyYXdsOiA1MCxcXG59KTtcXG5cXG4vLyBBZGQgZmlyc3QgVVJMIHRvIHRoZSBxdWV1ZSBhbmQgc3RhcnQgdGhlIGNyYXdsLlxcbmF3YWl0IGNyYXdsZXIucnVuKFsnaHR0cHM6Ly9jcmF3bGVlLmRldiddKTtcXG5cIn0iLCJvcHRpb25zIjp7ImJ1aWxkIjoibGF0ZXN0IiwiY29udGVudFR5cGUiOiJhcHBsaWNhdGlvbi9qc29uOyBjaGFyc2V0PXV0Zi04IiwibWVtb3J5Ijo0MDk2LCJ0aW1lb3V0IjoxODB9fQ.r3-Jgz2GRxUEVxzBr5czC9lcH0ty_8aKkcd9XHHZryg\&asrc=run_on_apify) ``` import { PuppeteerCrawler, Dataset } from 'crawlee'; // PuppeteerCrawler crawls the web using a headless // browser controlled by the Puppeteer library. const crawler = new PuppeteerCrawler({ // Use the requestHandler to process each of the crawled pages. async requestHandler({ request, page, enqueueLinks, log }) { const title = await page.title(); log.info(`Title of ${request.loadedUrl} is '${title}'`); // Save results as JSON to ./storage/datasets/default await Dataset.pushData({ title, url: request.loadedUrl }); // Extract links from the current page // and add them to the crawling queue. await enqueueLinks(); }, // Uncomment this option to see the browser window. // headless: false, // Let's limit our crawls to make our tests shorter and safer. maxRequestsPerCrawl: 50, }); // Add first URL to the queue and start the crawl. await crawler.run(['https://crawlee.dev']); ``` When you run the example, you will see Crawlee automating the data extraction process in your terminal. ``` INFO CheerioCrawler: Starting the crawl INFO CheerioCrawler: Title of https://crawlee.dev/ is 'Crawlee · Build reliable crawlers. Fast. | Crawlee' INFO CheerioCrawler: Title of https://crawlee.dev/js/docs/examples is 'Examples | Crawlee' INFO CheerioCrawler: Title of https://crawlee.dev/js/docs/quick-start is 'Quick Start | Crawlee' INFO CheerioCrawler: Title of https://crawlee.dev/js/docs/guides is 'Guides | Crawlee' ``` ### Running headful browsers[​](#running-headful-browsers "Direct link to Running headful browsers") Browsers controlled by Puppeteer and Playwright run headless (without a visible window). You can switch to headful by adding the `headless: false` option to the crawlers' constructor. This is useful in the development phase when you want to see what's going on in the browser. * PlaywrightCrawler * PuppeteerCrawler [Run on](https://console.apify.com/actors/6i5QsHBMtm3hKph70?runConfig=eyJ1IjoiRWdQdHczb2VqNlRhRHQ1cW4iLCJ2IjoxfQ.eyJpbnB1dCI6IntcImNvZGVcIjpcImltcG9ydCB7IFBsYXl3cmlnaHRDcmF3bGVyLCBEYXRhc2V0IH0gZnJvbSAnY3Jhd2xlZSc7XFxuXFxuY29uc3QgY3Jhd2xlciA9IG5ldyBQbGF5d3JpZ2h0Q3Jhd2xlcih7XFxuICAgIGFzeW5jIHJlcXVlc3RIYW5kbGVyKHsgcmVxdWVzdCwgcGFnZSwgZW5xdWV1ZUxpbmtzLCBsb2cgfSkge1xcbiAgICAgICAgY29uc3QgdGl0bGUgPSBhd2FpdCBwYWdlLnRpdGxlKCk7XFxuICAgICAgICBsb2cuaW5mbyhgVGl0bGUgb2YgJHtyZXF1ZXN0LmxvYWRlZFVybH0gaXMgJyR7dGl0bGV9J2ApO1xcbiAgICAgICAgYXdhaXQgRGF0YXNldC5wdXNoRGF0YSh7IHRpdGxlLCB1cmw6IHJlcXVlc3QubG9hZGVkVXJsIH0pO1xcbiAgICAgICAgYXdhaXQgZW5xdWV1ZUxpbmtzKCk7XFxuICAgIH0sXFxuICAgIC8vIFdoZW4geW91IHR1cm4gb2ZmIGhlYWRsZXNzIG1vZGUsIHRoZSBjcmF3bGVyXFxuICAgIC8vIHdpbGwgcnVuIHdpdGggYSB2aXNpYmxlIGJyb3dzZXIgd2luZG93LlxcbiAgICBoZWFkbGVzczogZmFsc2UsXFxuXFxuICAgIC8vIExldCdzIGxpbWl0IG91ciBjcmF3bHMgdG8gbWFrZSBvdXIgdGVzdHMgc2hvcnRlciBhbmQgc2FmZXIuXFxuICAgIG1heFJlcXVlc3RzUGVyQ3Jhd2w6IDUwLFxcbn0pO1xcblxcbi8vIEFkZCBmaXJzdCBVUkwgdG8gdGhlIHF1ZXVlIGFuZCBzdGFydCB0aGUgY3Jhd2wuXFxuYXdhaXQgY3Jhd2xlci5ydW4oWydodHRwczovL2NyYXdsZWUuZGV2J10pO1xcblwifSIsIm9wdGlvbnMiOnsiYnVpbGQiOiJsYXRlc3QiLCJjb250ZW50VHlwZSI6ImFwcGxpY2F0aW9uL2pzb247IGNoYXJzZXQ9dXRmLTgiLCJtZW1vcnkiOjQwOTYsInRpbWVvdXQiOjE4MH19.hy0W1IDTCxm-B-7JSs_YOrqWnYAemKGg8vJVLIaigIg\&asrc=run_on_apify) ``` import { PlaywrightCrawler, Dataset } from 'crawlee'; const crawler = new PlaywrightCrawler({ async requestHandler({ request, page, enqueueLinks, log }) { const title = await page.title(); log.info(`Title of ${request.loadedUrl} is '${title}'`); await Dataset.pushData({ title, url: request.loadedUrl }); await enqueueLinks(); }, // When you turn off headless mode, the crawler // will run with a visible browser window. headless: false, // Let's limit our crawls to make our tests shorter and safer. maxRequestsPerCrawl: 50, }); // Add first URL to the queue and start the crawl. await crawler.run(['https://crawlee.dev']); ``` [Run on](https://console.apify.com/actors/7tWSD8hrYzuc9Lte7?runConfig=eyJ1IjoiRWdQdHczb2VqNlRhRHQ1cW4iLCJ2IjoxfQ.eyJpbnB1dCI6IntcImNvZGVcIjpcImltcG9ydCB7IFB1cHBldGVlckNyYXdsZXIsIERhdGFzZXQgfSBmcm9tICdjcmF3bGVlJztcXG5cXG5jb25zdCBjcmF3bGVyID0gbmV3IFB1cHBldGVlckNyYXdsZXIoe1xcbiAgICBhc3luYyByZXF1ZXN0SGFuZGxlcih7IHJlcXVlc3QsIHBhZ2UsIGVucXVldWVMaW5rcywgbG9nIH0pIHtcXG4gICAgICAgIGNvbnN0IHRpdGxlID0gYXdhaXQgcGFnZS50aXRsZSgpO1xcbiAgICAgICAgbG9nLmluZm8oYFRpdGxlIG9mICR7cmVxdWVzdC5sb2FkZWRVcmx9IGlzICcke3RpdGxlfSdgKTtcXG4gICAgICAgIGF3YWl0IERhdGFzZXQucHVzaERhdGEoeyB0aXRsZSwgdXJsOiByZXF1ZXN0LmxvYWRlZFVybCB9KTtcXG4gICAgICAgIGF3YWl0IGVucXVldWVMaW5rcygpO1xcbiAgICB9LFxcbiAgICAvLyBXaGVuIHlvdSB0dXJuIG9mZiBoZWFkbGVzcyBtb2RlLCB0aGUgY3Jhd2xlclxcbiAgICAvLyB3aWxsIHJ1biB3aXRoIGEgdmlzaWJsZSBicm93c2VyIHdpbmRvdy5cXG4gICAgaGVhZGxlc3M6IGZhbHNlLFxcblxcbiAgICAvLyBMZXQncyBsaW1pdCBvdXIgY3Jhd2xzIHRvIG1ha2Ugb3VyIHRlc3RzIHNob3J0ZXIgYW5kIHNhZmVyLlxcbiAgICBtYXhSZXF1ZXN0c1BlckNyYXdsOiA1MCxcXG59KTtcXG5cXG4vLyBBZGQgZmlyc3QgVVJMIHRvIHRoZSBxdWV1ZSBhbmQgc3RhcnQgdGhlIGNyYXdsLlxcbmF3YWl0IGNyYXdsZXIucnVuKFsnaHR0cHM6Ly9jcmF3bGVlLmRldiddKTtcXG5cIn0iLCJvcHRpb25zIjp7ImJ1aWxkIjoibGF0ZXN0IiwiY29udGVudFR5cGUiOiJhcHBsaWNhdGlvbi9qc29uOyBjaGFyc2V0PXV0Zi04IiwibWVtb3J5Ijo0MDk2LCJ0aW1lb3V0IjoxODB9fQ.SeMW82sV8hdxSVLInwu1lVZjrCxNzASe8GlszF0s-W8\&asrc=run_on_apify) ``` import { PuppeteerCrawler, Dataset } from 'crawlee'; const crawler = new PuppeteerCrawler({ async requestHandler({ request, page, enqueueLinks, log }) { const title = await page.title(); log.info(`Title of ${request.loadedUrl} is '${title}'`); await Dataset.pushData({ title, url: request.loadedUrl }); await enqueueLinks(); }, // When you turn off headless mode, the crawler // will run with a visible browser window. headless: false, // Let's limit our crawls to make our tests shorter and safer. maxRequestsPerCrawl: 50, }); // Add first URL to the queue and start the crawl. await crawler.run(['https://crawlee.dev']); ``` When you run the example code, you'll see an automated browser blaze through the Crawlee website. note For the sake of this show off, we've slowed down the crawler, but rest assured, it's blazing fast in real world usage. ![An image showing off Crawlee scraping the Crawlee website using Puppeteer/Playwright and Chromium](/img/chrome-scrape-light.gif)![An image showing off Crawlee scraping the Crawlee website using Puppeteer/Playwright and Chromium](/img/chrome-scrape-dark.gif) ## Results[​](#results "Direct link to Results") Crawlee stores data to the `./storage` directory in your current working directory. The results of your crawl will be available under `./storage/datasets/default/*.json` as JSON files. ./storage/datasets/default/000000001.json ``` { "url": "https://crawlee.dev/", "title": "Crawlee · The scalable web crawling, scraping and automation library for JavaScript/Node.js | Crawlee" } ``` tip You can override the storage directory by setting the `CRAWLEE_STORAGE_DIR` environment variable. ## Examples and further reading[​](#examples-and-further-reading "Direct link to Examples and further reading") You can find more examples showcasing various features of Crawlee in the [Examples](https://crawlee.dev/js/docs/examples.md) section of the documentation. To better understand Crawlee and its components you should read the [Introduction](https://crawlee.dev/js/docs/introduction.md) step-by-step guide. **Related links** * [Configuration](https://crawlee.dev/js/docs/guides/configuration.md) * [Request storage](https://crawlee.dev/js/docs/guides/request-storage.md) * [Result storage](https://crawlee.dev/js/docs/guides/result-storage.md) --- ## [📄️<!-- --> <!-- -->Upgrading to v1](https://crawlee.dev/js/docs/upgrading/upgrading-to-v1.md) [Summary](https://crawlee.dev/js/docs/upgrading/upgrading-to-v1.md) --- # Upgrading to v1 Copy for LLM ## Summary[​](#summary "Direct link to Summary") After 3.5 years of rapid development and a lot of breaking changes and deprecations, here comes the result - **Apify SDK v1**. There were two goals for this release. **Stability** and **adding support for more browsers** - Firefox and Webkit (Safari). The SDK has grown quite popular over the years, powering thousands of web scraping and automation projects. We think our developers deserve a stable environment to work in and by releasing SDK v1, **we commit to only make breaking changes once a year, with a new major release**. We added support for more browsers by replacing `PuppeteerPool` with [`browser-pool`](https://github.com/apify/browser-pool). A new library that we created specifically for this purpose. It builds on the ideas from `PuppeteerPool` and extends them to support [Playwright](https://github.com/microsoft/playwright). Playwright is a browser automation library similar to Puppeteer. It works with all well known browsers and uses almost the same interface as Puppeteer, while adding useful features and simplifying common tasks. Don't worry, you can still use Puppeteer with the new `BrowserPool`. A large breaking change is that neither `puppeteer` nor `playwright` are bundled with the SDK v1. To make the choice of a library easier and installs faster, users will have to install the selected modules and versions themselves. This allows us to add support for even more libraries in the future. Thanks to the addition of Playwright we now have a `PlaywrightCrawler`. It is very similar to `PuppeteerCrawler` and you can pick the one you prefer. It also means we needed to make some interface changes. The `launchPuppeteerFunction` option of `PuppeteerCrawler` is gone and `launchPuppeteerOptions` were replaced by `launchContext`. We also moved things around in the `handlePageFunction` arguments. See the [migration guide](#migration-guide) for more detailed explanation and migration examples. What's in store for SDK v2? We want to split the SDK into smaller libraries, so that everyone can install only the things they need. We plan a TypeScript migration to make crawler development faster and safer. Finally, we will take a good look at the interface of the whole SDK and update it to improve the developer experience. Bug fixes and scraping features will of course keep landing in versions 1.X as well. ## Migration Guide[​](#migration-guide "Direct link to Migration Guide") There are a lot of breaking changes in the v1.0.0 release, but we're confident that updating your code will be a matter of minutes. Below, you'll find examples how to do it and also short tutorials how to use many of the new features. > Many of the new features are made with power users in mind, so don't worry if something looks complicated. You don't need to use it. ## Installation[​](#installation "Direct link to Installation") Previous versions of the SDK bundled the `puppeteer` package, so you did not have to install it. SDK v1 supports also `playwright` and we don't want to force users to install both. To install SDK v1 with Puppeteer (same as previous versions), run: ``` npm install apify puppeteer ``` To install SDK v1 with Playwright run: ``` npm install apify playwright ``` > While we tried to add the most important functionality in the initial release, you may find that there are still some utilities or options that are only supported by Puppeteer and not Playwright. ## Running on Apify Platform[​](#running-on-apify-platform "Direct link to Running on Apify Platform") If you want to make use of Playwright on the Apify Platform, you need to use a Docker image that supports Playwright. We've created them for you, so head over to the new [Docker image guide](https://sdk.apify.com/docs/guides/docker-images) and pick the one that best suits your needs. Note that your `package.json` **MUST** include `puppeteer` and/or `playwright` as dependencies. If you don't list them, the libraries will be uninstalled from your `node_modules` folder when you build your actors. ## Handler arguments are now Crawling Context[​](#handler-arguments-are-now-crawling-context "Direct link to Handler arguments are now Crawling Context") Previously, arguments of user provided handler functions were provided in separate objects. This made it difficult to track values across function invocations. ``` const handlePageFunction = async (args1) => { args1.hasOwnProperty('proxyInfo') // true } const handleFailedRequestFunction = async (args2) => { args2.hasOwnProperty('proxyInfo') // false } args1 === args2 // false ``` This happened because a new arguments object was created for each function. With SDK v1 we now have a single object called Crawling Context. ``` const handlePageFunction = async (crawlingContext1) => { crawlingContext1.hasOwnProperty('proxyInfo') // true } const handleFailedRequestFunction = async (crawlingContext2) => { crawlingContext2.hasOwnProperty('proxyInfo') // true } // All contexts are the same object. crawlingContext1 === crawlingContext2 // true ``` ### `Map` of crawling contexts and their IDs[​](#map-of-crawling-contexts-and-their-ids "Direct link to map-of-crawling-contexts-and-their-ids") Now that all the objects are the same, we can keep track of all running crawling contexts. We can do that by working with the new `id` property of `crawlingContext` This is useful when you need cross-context access. ``` let masterContextId; const handlePageFunction = async ({ id, page, request, crawler }) => { if (request.userData.masterPage) { masterContextId = id; // Prepare the master page. } else { const masterContext = crawler.crawlingContexts.get(masterContextId); const masterPage = masterContext.page; const masterRequest = masterContext.request; // Now we can manipulate the master data from another handlePageFunction. } } ``` ### `autoscaledPool` was moved under `crawlingContext.crawler`[​](#autoscaledpool-was-moved-under-crawlingcontextcrawler "Direct link to autoscaledpool-was-moved-under-crawlingcontextcrawler") To prevent bloat and to make access to certain key objects easier, we exposed a `crawler` property on the handle page arguments. ``` const handlePageFunction = async ({ request, page, crawler }) => { await crawler.requestQueue.addRequest({ url: 'https://example.com' }); await crawler.autoscaledPool.pause(); } ``` This also means that some shorthands like `puppeteerPool` or `autoscaledPool` were no longer necessary. ``` const handlePageFunction = async (crawlingContext) => { crawlingContext.autoscaledPool // does NOT exist anymore crawlingContext.crawler.autoscaledPool // <= this is correct usage } ``` ## Replacement of `PuppeteerPool` with `BrowserPool`[​](#replacement-of-puppeteerpool-with-browserpool "Direct link to replacement-of-puppeteerpool-with-browserpool") `BrowserPool` was created to extend `PuppeteerPool` with the ability to manage other browser automation libraries. The API is similar, but not the same. ### Access to running `BrowserPool`[​](#access-to-running-browserpool "Direct link to access-to-running-browserpool") Only `PuppeteerCrawler` and `PlaywrightCrawler` use `BrowserPool`. You can access it on the `crawler` object. ``` const crawler = new Apify.PlaywrightCrawler({ handlePageFunction: async ({ page, crawler }) => { crawler.browserPool // <----- } }); crawler.browserPool // <----- ``` ### Pages now have IDs[​](#pages-now-have-ids "Direct link to Pages now have IDs") And they're equal to `crawlingContext.id` which gives you access to full `crawlingContext` in hooks. See [Lifecycle hooks](#configuration-and-lifecycle-hooks) below. ``` const pageId = browserPool.getPageId ``` ### Configuration and lifecycle hooks[​](#configuration-and-lifecycle-hooks "Direct link to Configuration and lifecycle hooks") The most important addition with `BrowserPool` are the [lifecycle hooks](https://github.com/apify/browser-pool#browserpool). You can access them via `browserPoolOptions` in both crawlers. A full list of `browserPoolOptions` can be found in [`browser-pool` readme](https://github.com/apify/browser-pool#new-browserpooloptions). ``` const crawler = new Apify.PuppeteerCrawler({ browserPoolOptions: { retireBrowserAfterPageCount: 10, preLaunchHooks: [ async (pageId, launchContext) => { const { request } = crawler.crawlingContexts.get(pageId); if (request.userData.useHeadful === true) { launchContext.launchOptions.headless = false; } } ] } }) ``` ### Introduction of `BrowserController`[​](#introduction-of-browsercontroller "Direct link to introduction-of-browsercontroller") [`BrowserController`](https://github.com/apify/browser-pool#browsercontroller) is a class of `browser-pool` that's responsible for browser management. Its purpose is to provide a single API for working with both Puppeteer and Playwright browsers. It works automatically in the background, but if you ever wanted to close a browser properly, you should use a `browserController` to do it. You can find it in the handle page arguments. ``` const handlePageFunction = async ({ page, browserController }) => { // Wrong usage. Could backfire because it bypasses BrowserPool. await page.browser().close(); // Correct usage. Allows graceful shutdown. await browserController.close(); const cookies = [/* some cookie objects */]; // Wrong usage. Will only work in Puppeteer and not Playwright. await page.setCookies(...cookies); // Correct usage. Will work in both. await browserController.setCookies(page, cookies); } ``` The `BrowserController` also includes important information about the browser, such as the context it was launched with. This was difficult to do before SDK v1. ``` const handlePageFunction = async ({ browserController }) => { // Information about the proxy used by the browser browserController.launchContext.proxyInfo // Session used by the browser browserController.launchContext.session } ``` ### `BrowserPool` methods vs `PuppeteerPool`[​](#browserpool-methods-vs-puppeteerpool "Direct link to browserpool-methods-vs-puppeteerpool") Some functions were removed (in line with earlier deprecations), and some were changed a bit: ``` // OLD await puppeteerPool.recyclePage(page); // NEW await page.close(); ``` ``` // OLD await puppeteerPool.retire(page.browser()); // NEW browserPool.retireBrowserByPage(page); ``` ``` // OLD await puppeteerPool.serveLiveViewSnapshot(); // NEW // There's no LiveView in BrowserPool ``` ## Updated `PuppeteerCrawlerOptions`[​](#updated-puppeteercrawleroptions "Direct link to updated-puppeteercrawleroptions") To keep `PuppeteerCrawler` and `PlaywrightCrawler` consistent, we updated the options. ### Removal of `gotoFunction`[​](#removal-of-gotofunction "Direct link to removal-of-gotofunction") The concept of a configurable `gotoFunction` is not ideal. Especially since we use a modified `gotoExtended`. Users have to know this when they override `gotoFunction` if they want to extend default behavior. We decided to replace `gotoFunction` with `preNavigationHooks` and `postNavigationHooks`. The following example illustrates how `gotoFunction` makes things complicated. ``` const gotoFunction = async ({ request, page }) => { // pre-processing await makePageStealthy(page); // Have to remember how to do this: const response = await gotoExtended(page, request, {/* have to remember the defaults */}); // post-processing await page.evaluate(() => { window.foo = 'bar'; }); // Must not forget! return response; } const crawler = new Apify.PuppeteerCrawler({ gotoFunction, // ... }) ``` With `preNavigationHooks` and `postNavigationHooks` it's much easier. `preNavigationHooks` are called with two arguments: `crawlingContext` and `gotoOptions`. `postNavigationHooks` are called only with `crawlingContext`. ``` const preNavigationHooks = [ async ({ page }) => makePageStealthy(page) ]; const postNavigationHooks = [ async ({ page }) => page.evaluate(() => { window.foo = 'bar' }) ] const crawler = new Apify.PuppeteerCrawler({ preNavigationHooks, postNavigationHooks, // ... }) ``` ### `launchPuppeteerOptions` => `launchContext`[​](#launchpuppeteeroptions--launchcontext "Direct link to launchpuppeteeroptions--launchcontext") Those were always a point of confusion because they merged custom Apify options with `launchOptions` of Puppeteer. ``` const launchPuppeteerOptions = { useChrome: true, // Apify option headless: false, // Puppeteer option } ``` Use the new `launchContext` object, which explicitly defines `launchOptions`. `launchPuppeteerOptions` were removed. ``` const crawler = new Apify.PuppeteerCrawler({ launchContext: { useChrome: true, // Apify option launchOptions: { headless: false // Puppeteer option } } }) ``` > LaunchContext is also a type of [`browser-pool`](https://github.com/apify/browser-pool) and the structure is exactly the same there. SDK only adds extra options. ### Removal of `launchPuppeteerFunction`[​](#removal-of-launchpuppeteerfunction "Direct link to removal-of-launchpuppeteerfunction") `browser-pool` introduces the idea of [lifecycle hooks](https://github.com/apify/browser-pool#browserpool), which are functions that are executed when a certain event in the browser lifecycle happens. ``` const launchPuppeteerFunction = async (launchPuppeteerOptions) => { if (someVariable === 'chrome') { launchPuppeteerOptions.useChrome = true; } return Apify.launchPuppeteer(launchPuppeteerOptions); } const crawler = new Apify.PuppeteerCrawler({ launchPuppeteerFunction, // ... }) ``` Now you can recreate the same functionality with a `preLaunchHook`: ``` const maybeLaunchChrome = (pageId, launchContext) => { if (someVariable === 'chrome') { launchContext.useChrome = true; } } const crawler = new Apify.PuppeteerCrawler({ browserPoolOptions: { preLaunchHooks: [maybeLaunchChrome] }, // ... }) ``` This is better in multiple ways. It is consistent across both Puppeteer and Playwright. It allows you to easily construct your browsers with pre-defined behavior: ``` const preLaunchHooks = [ maybeLaunchChrome, useHeadfulIfNeeded, injectNewFingerprint, ] ``` And thanks to the addition of [`crawler.crawlingContexts`](#handler-arguments-are-now-crawling-context) the functions also have access to the `crawlingContext` of the `request` that triggered the launch. ``` const preLaunchHooks = [ async function maybeLaunchChrome(pageId, launchContext) { const { request } = crawler.crawlingContexts.get(pageId); if (request.userData.useHeadful === true) { launchContext.launchOptions.headless = false; } } ] ``` ## Launch functions[​](#launch-functions "Direct link to Launch functions") In addition to `Apify.launchPuppeteer()` we now also have `Apify.launchPlaywright()`. ### Updated arguments[​](#updated-arguments "Direct link to Updated arguments") We [updated the launch options object](#launchpuppeteeroptions--launchcontext) because it was a frequent source of confusion. ``` // OLD await Apify.launchPuppeteer({ useChrome: true, headless: true, }) // NEW await Apify.launchPuppeteer({ useChrome: true, launchOptions: { headless: true, } }) ``` ### Custom modules[​](#custom-modules "Direct link to Custom modules") `Apify.launchPuppeteer` already supported the `puppeteerModule` option. With Playwright, we normalized the name to `launcher` because the `playwright` module itself does not launch browsers. ``` const puppeteer = require('puppeteer'); const playwright = require('playwright'); await Apify.launchPuppeteer(); // Is the same as: await Apify.launchPuppeteer({ launcher: puppeteer }) await Apify.launchPlaywright(); // Is the same as: await Apify.launchPlaywright({ launcher: playwright.chromium }) ``` --- # Upgrading to v2 Copy for LLM * **BREAKING**: Require Node.js >=15.10.0 because HTTP2 support on lower Node.js versions is very buggy. * **BREAKING**: Bump `cheerio` to `1.0.0-rc.10` from `rc.3`. There were breaking changes in `cheerio` between the versions so this bump might be breaking for you as well. * Remove `LiveViewServer` which was deprecated before release of SDK v1. --- # Upgrading to v3 Copy for LLM This page summarizes most of the breaking changes between Crawlee (v3) and Apify SDK (v2). Crawlee is the spiritual successor to Apify SDK, so we decided to keep the versioning and release Crawlee as v3. Crawlee vs Apify SDK v2 Up until version 3 of `apify`, the package contained both scraping related tools and Apify platform related helper methods. With v3 we are splitting the whole project into two main parts: * [Crawlee](https://github.com/apify/crawlee), the new web-scraping library, available as [`crawlee`](https://www.npmjs.com/package/crawlee) package on NPM * [Apify SDK](https://github.com/apify/apify-sdk-js), helpers for the Apify platform, available as [`apify`](https://www.npmjs.com/package/apify) package on NPM ## Crawlee monorepo[​](#crawlee-monorepo "Direct link to Crawlee monorepo") The [`crawlee`](https://www.npmjs.com/package/crawlee) package consists of several smaller packages, released separately under `@crawlee` namespace: * [`@crawlee/core`](https://crawlee.dev/js/api/core.md): the base for all the crawler implementations, also contains things like `Request`, `RequestQueue`, `RequestList` or `Dataset` classes * [`@crawlee/cheerio`](https://crawlee.dev/js/api/cheerio-crawler.md): exports `CheerioCrawler` * [`@crawlee/playwright`](https://crawlee.dev/js/api/playwright-crawler.md): exports `PlaywrightCrawler` * [`@crawlee/puppeteer`](https://crawlee.dev/js/api/puppeteer-crawler.md): exports `PuppeteerCrawler` * [`@crawlee/jsdom`](https://crawlee.dev/js/api/jsdom-crawler.md): exports `JSDOMCrawler` * [`@crawlee/basic`](https://crawlee.dev/js/api/basic-crawler.md): exports `BasicCrawler` * [`@crawlee/http`](https://crawlee.dev/js/api/http-crawler.md): exports `HttpCrawler` (which is used for creating [`@crawlee/jsdom`](https://crawlee.dev/js/api/jsdom-crawler.md) and [`@crawlee/cheerio`](https://crawlee.dev/js/api/cheerio-crawler.md)) * [`@crawlee/browser`](https://crawlee.dev/js/api/browser-crawler.md): exports `BrowserCrawler` (which is used for creating [`@crawlee/playwright`](https://crawlee.dev/js/api/playwright-crawler.md) and [`@crawlee/puppeteer`](https://crawlee.dev/js/api/puppeteer-crawler.md)) * [`@crawlee/memory-storage`](https://crawlee.dev/js/api/memory-storage.md): [`@apify/storage-local`](https://npmjs.com/package/@apify/storage-local) alternative * [`@crawlee/browser-pool`](https://crawlee.dev/js/api/browser-pool.md): previously [`browser-pool`](https://npmjs.com/package/browser-pool) package * [`@crawlee/utils`](https://crawlee.dev/js/api/utils.md): utility methods * [`@crawlee/types`](https://crawlee.dev/js/api/types.md): holds TS interfaces mainly about the [`StorageClient`](https://crawlee.dev/js/api/core/interface/StorageClient.md) ### Installing Crawlee[​](#installing-crawlee "Direct link to Installing Crawlee") Most of the Crawlee packages are extending and reexporting each other, so it's enough to install just the one you plan on using, e.g. `@crawlee/playwright` if you plan on using `playwright` - it already contains everything from the `@crawlee/browser` package, which includes everything from `@crawlee/basic`, which includes everything from `@crawlee/core`. If we don't care much about additional code being pulled in, we can just use the `crawlee` meta-package, which contains (re-exports) most of the `@crawlee/*` packages, and therefore contains all the crawler classes. ``` npm install crawlee ``` Or if all we need is cheerio support, we can install only `@crawlee/cheerio`. ``` npm install @crawlee/cheerio ``` When using `playwright` or `puppeteer`, we still need to install those dependencies explicitly - this allows the users to be in control of which version will be used. ``` npm install crawlee playwright # or npm install @crawlee/playwright playwright ``` Alternatively we can also use the `crawlee` meta-package which contains (re-exports) most of the `@crawlee/*` packages, and therefore contains all the crawler classes. > Sometimes you might want to use some utility methods from `@crawlee/utils`, so you might want to install that as well. This package contains some utilities that were previously available under `Apify.utils`. Browser related utilities can be also found in the crawler packages (e.g. `@crawlee/playwright`). ## Full TypeScript support[​](#full-typescript-support "Direct link to Full TypeScript support") Both Crawlee and Apify SDK are full TypeScript rewrite, so they include up-to-date types in the package. For your TypeScript crawlers we recommend using our predefined TypeScript configuration from `@apify/tsconfig` package. Don't forget to set the `module` and `target` to `ES2022` or above to be able to use top level await. > The `@apify/tsconfig` config has [`noImplicitAny`](https://www.typescriptlang.org/tsconfig#noImplicitAny) enabled, you might want to disable it during the initial development as it will cause build failures if you left some unused local variables in your code. tsconfig.json ``` { "extends": "@apify/tsconfig", "compilerOptions": { "module": "ES2022", "target": "ES2022", "outDir": "dist", "lib": ["DOM"] }, "include": [ "./src/**/*" ] } ``` ### Docker build[​](#docker-build "Direct link to Docker build") For `Dockerfile` we recommend using multi-stage build, so you don't install the dev dependencies like TypeScript in your final image: Dockerfile ``` # using multistage build, as we need dev deps to build the TS source code FROM apify/actor-node:20 AS builder # copy all files, install all dependencies (including dev deps) and build the project COPY . ./ RUN npm install --include=dev \ && npm run build # create final image FROM apify/actor-node:20 # copy only necessary files COPY --from=builder /usr/src/app/package*.json ./ COPY --from=builder /usr/src/app/README.md ./ COPY --from=builder /usr/src/app/dist ./dist COPY --from=builder /usr/src/app/apify.json ./apify.json COPY --from=builder /usr/src/app/INPUT_SCHEMA.json ./INPUT_SCHEMA.json # install only prod deps RUN npm --quiet set progress=false \ && npm install --only=prod --no-optional \ && echo "Installed NPM packages:" \ && (npm list --only=prod --no-optional --all || true) \ && echo "Node.js version:" \ && node --version \ && echo "NPM version:" \ && npm --version # run compiled code CMD npm run start:prod ``` ## Browser fingerprints[​](#browser-fingerprints "Direct link to Browser fingerprints") Previously we had a magical `stealth` option in the puppeteer crawler that enabled several tricks aiming to mimic the real users as much as possible. While this worked to a certain degree, we decided to replace it with generated browser fingerprints. In case we don't want to have dynamic fingerprints, we can disable this behaviour via `useFingerprints` in `browserPoolOptions`: ``` const crawler = new PlaywrightCrawler({ browserPoolOptions: { useFingerprints: false, }, }); ``` ## Session cookie method renames[​](#session-cookie-method-renames "Direct link to Session cookie method renames") Previously, if we wanted to get or add cookies for the session that would be used for the request, we had to call `session.getPuppeteerCookies()` or `session.setPuppeteerCookies()`. Since this method could be used for any of our crawlers, not just `PuppeteerCrawler`, the methods have been renamed to `session.getCookies()` and `session.setCookies()` respectively. Otherwise, their usage is exactly the same! ## Memory storage[​](#memory-storage "Direct link to Memory storage") When we store some data or intermediate state (like the one `RequestQueue` holds), we now use `@crawlee/memory-storage` by default. It is an alternative to the `@apify/storage-local`, that stores the state inside memory (as opposed to SQLite database used by `@apify/storage-local`). While the state is stored in memory, it also dumps it to the file system, so we can observe it, as well as respects the existing data stored in KeyValueStore (e.g. the `INPUT.json` file). When we want to run the crawler on Apify platform, we need to use `Actor.init` or `Actor.main`, which will automatically switch the storage client to `ApifyClient` when on the Apify platform. We can still use the `@apify/storage-local`, to do it, first install it pass it to the `Actor.init` or `Actor.main` options: > `@apify/storage-local` v2.1.0+ is required for Crawlee ``` import { Actor } from 'apify'; import { ApifyStorageLocal } from '@apify/storage-local'; const storage = new ApifyStorageLocal(/* options like `enableWalMode` belong here */); await Actor.init({ storage }); ``` ## Purging of the default storage[​](#purging-of-the-default-storage "Direct link to Purging of the default storage") Previously the state was preserved between local runs, and we had to use `--purge` argument of the `apify-cli`. With Crawlee, this is now the default behaviour, we purge the storage automatically on `Actor.init/main` call. We can opt out of it via `purge: false` in the `Actor.init` options. ## Renamed crawler options and interfaces[​](#renamed-crawler-options-and-interfaces "Direct link to Renamed crawler options and interfaces") Some options were renamed to better reflect what they do. We still support all the old parameter names too, but not at the TS level. * `handleRequestFunction` -> `requestHandler` * `handlePageFunction` -> `requestHandler` * `handleRequestTimeoutSecs` -> `requestHandlerTimeoutSecs` * `handlePageTimeoutSecs` -> `requestHandlerTimeoutSecs` * `requestTimeoutSecs` -> `navigationTimeoutSecs` * `handleFailedRequestFunction` -> `failedRequestHandler` We also renamed the crawling context interfaces, so they follow the same convention and are more meaningful: * `CheerioHandlePageInputs` -> `CheerioCrawlingContext` * `PlaywrightHandlePageFunction` -> `PlaywrightCrawlingContext` * `PuppeteerHandlePageFunction` -> `PuppeteerCrawlingContext` ## Context aware helpers[​](#context-aware-helpers "Direct link to Context aware helpers") Some utilities previously available under `Apify.utils` namespace are now moved to the crawling context and are *context aware*. This means they have some parameters automatically filled in from the context, like the current `Request` instance or current `Page` object, or the `RequestQueue` bound to the crawler. ### Enqueuing links[​](#enqueuing-links "Direct link to Enqueuing links") One common helper that received more attention is the `enqueueLinks`. As mentioned above, it is context aware - we no longer need pass in the `requestQueue` or `page` arguments (or the cheerio handle `$`). In addition to that, it now offers 3 enqueuing strategies: * `EnqueueStrategy.All` (`'all'`): Matches any URLs found * `EnqueueStrategy.SameHostname` (`'same-hostname'`) Matches any URLs that have the same subdomain as the base URL (default) * `EnqueueStrategy.SameDomain` (`'same-domain'`) Matches any URLs that have the same domain name. For example, `https://wow.an.example.com` and `https://example.com` will both be matched for a base url of `https://example.com`. This means we can even call `enqueueLinks()` without any parameters. By default, it will go through all the links found on current page and filter only those targeting the same subdomain. Moreover, we can specify patterns the URL should match via globs: ``` const crawler = new PlaywrightCrawler({ async requestHandler({ enqueueLinks }) { await enqueueLinks({ globs: ['https://crawlee.dev/*/*'], // we can also use `regexps` and `pseudoUrls` keys here }); }, }); ``` ## Implicit `RequestQueue` instance[​](#implicit-requestqueue-instance "Direct link to implicit-requestqueue-instance") All crawlers now have the `RequestQueue` instance automatically available via `crawler.getRequestQueue()` method. It will create the instance for you if it does not exist yet. This mean we no longer need to create the `RequestQueue` instance manually, and we can just use `crawler.addRequests()` method described underneath. > We can still create the `RequestQueue` explicitly, the `crawler.getRequestQueue()` method will respect that and return the instance provided via crawler options. ## `crawler.addRequests()`[​](#crawleraddrequests "Direct link to crawleraddrequests") We can now add multiple requests in batches. The newly added `addRequests` method will handle everything for us. It enqueues the first 1000 requests and resolves, while continuing with the rest in the background, again in a smaller 1000 items batches, so we don't fall into any API rate limits. This means the crawling will start almost immediately (within few seconds at most), something previously possible only with a combination of `RequestQueue` and `RequestList`. ``` // will resolve right after the initial batch of 1000 requests is added const result = await crawler.addRequests([/* many requests, can be even millions */]); // if we want to wait for all the requests to be added, we can await the `waitForAllRequestsToBeAdded` promise await result.waitForAllRequestsToBeAdded; ``` ## Less verbose error logging[​](#less-verbose-error-logging "Direct link to Less verbose error logging") Previously an error thrown from inside request handler resulted in full error object being logged. With Crawlee, we log only the error message as a warning as long as we know the request will be retried. If you want to enable verbose logging like in v2, use the `CRAWLEE_VERBOSE_LOG` env var. ## `Request.label` shortcut[​](#requestlabel-shortcut "Direct link to requestlabel-shortcut") Labeling requests used to work via the `Request.userData` object. With Crawlee, we can also use the `Request.label` shortcut. It is implemented as a `get/set` pair, using the value from `Request.userData`. The support for this shortcut is also added to the `enqueueLinks` options interface. ``` async requestHandler({ request, enqueueLinks }) { if (request.label !== 'DETAIL') { await enqueueLinks({ globs: ['...'], label: 'DETAIL', }); } } ``` ## Removal of `requestAsBrowser`[​](#removal-of-requestasbrowser "Direct link to removal-of-requestasbrowser") In v1 we replaced the underlying implementation of `requestAsBrowser` to be just a proxy over calling [`got-scraping`](https://github.com/apify/got-scraping) - our custom extension to `got` that tries to mimic the real browsers as much as possible. With v3, we are removing the `requestAsBrowser`, encouraging the use of [`got-scraping`](https://github.com/apify/got-scraping) directly. For easier migration, we also added `context.sendRequest()` helper that allows processing the context bound `Request` object through [`got-scraping`](https://github.com/apify/got-scraping): ``` const crawler = new BasicCrawler({ async requestHandler({ sendRequest, log }) { // we can use the options parameter to override gotScraping options const res = await sendRequest({ responseType: 'json' }); log.info('received body', res.body); }, }); ``` ### How to use `sendRequest()`?[​](#how-to-use-sendrequest "Direct link to how-to-use-sendrequest") See [the Got Scraping guide](https://crawlee.dev/js/docs/guides/got-scraping.md). ### Removed options[​](#removed-options "Direct link to Removed options") The `useInsecureHttpParser` option has been removed. It's permanently set to `true` in order to better mimic browsers' behavior. Got Scraping automatically performs protocol negotiation, hence we removed the `useHttp2` option. It's set to `true` - 100% of browsers nowadays are capable of HTTP/2 requests. Oh, more and more of the web is using it too! ### Renamed options[​](#renamed-options "Direct link to Renamed options") In the `requestAsBrowser` approach, some of the options were named differently. Here's a list of renamed options: #### `payload`[​](#payload "Direct link to payload") This options represents the body to send. It could be a `string` or a `Buffer`. However, there is no `payload` option anymore. You need to use `body` instead. Or, if you wish to send JSON, `json`. Here's an example: ``` // Before: await Apify.utils.requestAsBrowser({ …, payload: 'Hello, world!' }); await Apify.utils.requestAsBrowser({ …, payload: Buffer.from('c0ffe', 'hex') }); await Apify.utils.requestAsBrowser({ …, json: { hello: 'world' } }); // After: await gotScraping({ …, body: 'Hello, world!' }); await gotScraping({ …, body: Buffer.from('c0ffe', 'hex') }); await gotScraping({ …, json: { hello: 'world' } }); ``` #### `ignoreSslErrors`[​](#ignoresslerrors "Direct link to ignoresslerrors") It has been renamed to `https.rejectUnauthorized`. By default, it's set to `false` for convenience. However, if you want to make sure the connection is secure, you can do the following: ``` // Before: await Apify.utils.requestAsBrowser({ …, ignoreSslErrors: false }); // After: await gotScraping({ …, https: { rejectUnauthorized: true } }); ``` Please note: the meanings are opposite! So we needed to invert the values as well. #### `header-generator` options[​](#header-generator-options "Direct link to header-generator-options") `useMobileVersion`, `languageCode` and `countryCode` no longer exist. Instead, you need to use `headerGeneratorOptions` directly: ``` // Before: await Apify.utils.requestAsBrowser({ …, useMobileVersion: true, languageCode: 'en', countryCode: 'US', }); // After: await gotScraping({ …, headerGeneratorOptions: { devices: ['mobile'], // or ['desktop'] locales: ['en-US'], }, }); ``` #### `timeoutSecs`[​](#timeoutsecs "Direct link to timeoutsecs") In order to set a timeout, use `timeout.request` (which is **milliseconds** now). ``` // Before: await Apify.utils.requestAsBrowser({ …, timeoutSecs: 30, }); // After: await gotScraping({ …, timeout: { request: 30 * 1000, }, }); ``` #### `throwOnHttpErrors`[​](#throwonhttperrors "Direct link to throwonhttperrors") `throwOnHttpErrors` → `throwHttpErrors`. This options throws on unsuccessful HTTP status codes, for example `404`. By default, it's set to `false`. #### `decodeBody`[​](#decodebody "Direct link to decodebody") `decodeBody` → `decompress`. This options decompresses the body. Defaults to `true` - please do not change this or websites will break (unless you know what you're doing!). #### `abortFunction`[​](#abortfunction "Direct link to abortfunction") This function used to make the promise throw on specific responses, if it returned `true`. However, it wasn't that useful. You probably want to cancel the request instead, which you can do in the following way: ``` const promise = gotScraping(…); promise.on('request', request => { // Please note this is not a Got Request instance, but a ClientRequest one. // https://nodejs.org/api/http.html#class-httpclientrequest if (request.protocol !== 'https:') { // Unsecure request, abort. promise.cancel(); // If you set `isStream` to `true`, please use `stream.destroy()` instead. } }); const response = await promise; ``` ## Removal of browser pool plugin mixing[​](#removal-of-browser-pool-plugin-mixing "Direct link to Removal of browser pool plugin mixing") Previously, you were able to have a browser pool that would mix Puppeteer and Playwright plugins (or even your own custom plugins if you've built any). As of this version, that is no longer allowed, and creating such a browser pool will cause an error to be thrown (it's expected that all plugins that will be used are of the same type). Confused? As an example, this change disallows a pool to mix Puppeteer with Playwright. You can still create pools that use multiple Playwright plugins, each with a different launcher if you want! ## Handling requests outside of browser[​](#handling-requests-outside-of-browser "Direct link to Handling requests outside of browser") One small feature worth mentioning is the ability to handle requests with browser crawlers outside the browser. To do that, we can use a combination of `Request.skipNavigation` and `context.sendRequest()`. Take a look at how to achieve this by checking out the [Skipping navigation for certain requests](https://crawlee.dev/js/docs/examples/skip-navigation.md) example! ## Logging[​](#logging "Direct link to Logging") Crawlee exports the default `log` instance directly as a named export. We also have a scoped `log` instance provided in the crawling context - this one will log messages prefixed with the crawler name and should be preferred for logging inside the request handler. ``` const crawler = new CheerioCrawler({ async requestHandler({ log, request }) { log.info(`Opened ${request.loadedUrl}`); }, }); ``` ## Auto-saved crawler state[​](#auto-saved-crawler-state "Direct link to Auto-saved crawler state") Every crawler instance now has `useState()` method that will return a state object we can use. It will be automatically saved when `persistState` event occurs. The value is cached, so we can freely call this method multiple times and get the exact same reference. No need to worry about saving the value either, as it will happen automatically. ``` const crawler = new CheerioCrawler({ async requestHandler({ crawler }) { const state = await crawler.useState({ foo: [] as number[] }); // just change the value, no need to care about saving it state.foo.push(123); }, }); ``` ## Apify SDK[​](#apify-sdk "Direct link to Apify SDK") The Apify platform helpers can be now found in the Apify SDK (`apify` NPM package). It exports the `Actor` class that offers following static helpers: * `ApifyClient` shortcuts: `addWebhook()`, `call()`, `callTask()`, `metamorph()` * helpers for running on Apify platform: `init()`, `exit()`, `fail()`, `main()`, `isAtHome()`, `createProxyConfiguration()` * storage support: `getInput()`, `getValue()`, `openDataset()`, `openKeyValueStore()`, `openRequestQueue()`, `pushData()`, `setValue()` * events support: `on()`, `off()` * other utilities: `getEnv()`, `newClient()`, `reboot()` `Actor.main` is now just a syntax sugar around calling `Actor.init()` at the beginning and `Actor.exit()` at the end (plus wrapping the user function in try/catch block). All those methods are async and should be awaited - with node 16 we can use the top level await for that. In other words, following is equivalent: ``` import { Actor } from 'apify'; await Actor.init(); // your code await Actor.exit('Crawling finished!'); ``` ``` import { Actor } from 'apify'; await Actor.main(async () => { // your code }, { statusMessage: 'Crawling finished!' }); ``` `Actor.init()` will conditionally set the storage implementation of Crawlee to the `ApifyClient` when running on the Apify platform, or keep the default (memory storage) implementation otherwise. It will also subscribe to the websocket events (or mimic them locally). `Actor.exit()` will handle the tear down and calls `process.exit()` to ensure our process won't hang indefinitely for some reason. ### Events[​](#events "Direct link to Events") Apify SDK (v2) exports `Apify.events`, which is an `EventEmitter` instance. With Crawlee, the events are managed by [`EventManager`](https://crawlee.dev/js/api/core/class/EventManager.md) class instead. We can either access it via `Actor.eventManager` getter, or use `Actor.on` and `Actor.off` shortcuts instead. ``` -Apify.events.on(...); +Actor.on(...); ``` > We can also get the [`EventManager`](https://crawlee.dev/js/api/core/class/EventManager.md) instance via `Configuration.getEventManager()`. In addition to the existing events, we now have an `exit` event fired when calling `Actor.exit()` (which is called at the end of `Actor.main()`). This event allows you to gracefully shut down any resources when `Actor.exit` is called. ## Smaller/internal breaking changes[​](#smallerinternal-breaking-changes "Direct link to Smaller/internal breaking changes") * `Apify.call()` is now just a shortcut for running `ApifyClient.actor(actorId).call(input, options)`, while also taking the token inside env vars into account * `Apify.callTask()` is now just a shortcut for running `ApifyClient.task(taskId).call(input, options)`, while also taking the token inside env vars into account * `Apify.metamorph()` is now just a shortcut for running `ApifyClient.task(taskId).metamorph(input, options)`, while also taking the ACTOR\_RUN\_ID inside env vars into account * `Apify.waitForRunToFinish()` has been removed, use `ApifyClient.waitForFinish()` instead * `Actor.main/init` purges the storage by default * remove `purgeLocalStorage` helper, move purging to the storage class directly <!-- --> * `StorageClient` interface now has optional `purge` method * purging happens automatically via `Actor.init()` (you can opt out via `purge: false` in the options of `init/main` methods) * `QueueOperationInfo.request` is no longer available * `Request.handledAt` is now string date in ISO format * `Request.inProgress` and `Request.reclaimed` are now `Set`s instead of POJOs * `injectUnderscore` from puppeteer utils has been removed * `APIFY_MEMORY_MBYTES` is no longer taken into account, use `CRAWLEE_AVAILABLE_MEMORY_RATIO` instead * some `AutoscaledPool` options are no longer available: <!-- --> * `cpuSnapshotIntervalSecs` and `memorySnapshotIntervalSecs` has been replaced with top level `systemInfoIntervalMillis` configuration * `maxUsedCpuRatio` has been moved to the top level configuration * `ProxyConfiguration.newUrlFunction` can be async. `.newUrl()` and `.newProxyInfo()` now return promises. * `prepareRequestFunction` and `postResponseFunction` options are removed, use navigation hooks instead * `gotoFunction` and `gotoTimeoutSecs` are removed * removed compatibility fix for old/broken request queues with null `Request` props * `fingerprintsOptions` renamed to `fingerprintOptions` (`fingerprints` -> `fingerprint`). * `fingerprintOptions` now accept `useFingerprintCache` and `fingerprintCacheSize` (instead of `useFingerprintPerProxyCache` and `fingerprintPerProxyCacheSize`, which are now no longer available). This is because the cached fingerprints are no longer connected to proxy URLs but to sessions. ---