# Crawlee for JavaScript · Build reliable crawlers. Fast. - [Build reliable web scrapers. Fast.](https://crawlee.dev/index.md) ## blog - [Crawlee Blog - learn how to build better scrapers](https://crawlee.dev/blog.md) - [Archive](https://crawlee.dev/blog/archive.md) - [Authors](https://crawlee.dev/blog/authors.md) - [Current problems and mistakes of web scraping in Python and tricks to solve them!](https://crawlee.dev/blog/common-problems-in-web-scraping.md) - [Launching Crawlee Blog](https://crawlee.dev/blog/crawlee-blog-launch.md) - [Crawlee for Python v0.5](https://crawlee.dev/blog/crawlee-for-python-v05.md) - [Crawlee for Python v0.6](https://crawlee.dev/blog/crawlee-for-python-v06.md) - [Crawlee for Python v1](https://crawlee.dev/blog/crawlee-for-python-v1.md) - [How to build a price tracker with Crawlee and Apify](https://crawlee.dev/blog/crawlee-python-price-tracker.md) - [Reverse engineering GraphQL persistedQuery extension](https://crawlee.dev/blog/graphql-persisted-query.md) - [How to scrape Amazon products](https://crawlee.dev/blog/how-to-scrape-amazon.md) - [How to scrape infinite scrolling webpages with Python](https://crawlee.dev/blog/infinite-scroll-using-python.md) - [Announcing Crawlee for Python: Now you can use Python to build reliable web crawlers](https://crawlee.dev/blog/launching-crawlee-python.md) - [How to create a LinkedIn job scraper in Python with Crawlee](https://crawlee.dev/blog/linkedin-job-scraper-python.md) - [Building a Netflix show recommender using Crawlee and React](https://crawlee.dev/blog/netflix-show-recommender.md) - [Crawlee Blog - learn how to build better scrapers](https://crawlee.dev/blog/page/2.md) - [Crawlee Blog - learn how to build better scrapers](https://crawlee.dev/blog/page/3.md) - [How Crawlee uses tiered proxies to avoid getting blocked](https://crawlee.dev/blog/proxy-management-in-crawlee.md) - [How to scrape Bluesky with Python](https://crawlee.dev/blog/scrape-bluesky-using-python.md) - [How to scrape Crunchbase using Python in 2024 (Easy Guide)](https://crawlee.dev/blog/scrape-crunchbase-python.md) - [How to scrape Google Maps data using Python](https://crawlee.dev/blog/scrape-google-maps.md) - [How to scrape Google search results with Python](https://crawlee.dev/blog/scrape-google-search.md) - [How to scrape TikTok using Python](https://crawlee.dev/blog/scrape-tiktok-python.md) - [Optimizing web scraping: Scraping auth data using JSDOM](https://crawlee.dev/blog/scrape-using-jsdom.md) - [How to scrape YouTube using Python [2025 guide]](https://crawlee.dev/blog/scrape-youtube-python.md) - [Web scraping of a dynamic website using Python with HTTP Client](https://crawlee.dev/blog/scraping-dynamic-websites-using-python.md) - [Scrapy vs. Crawlee](https://crawlee.dev/blog/scrapy-vs-crawlee.md) - [Inside implementing SuperScraper with Crawlee](https://crawlee.dev/blog/superscraper-with-crawlee.md) - [Tags](https://crawlee.dev/blog/tags.md) - [10 posts tagged with "community"](https://crawlee.dev/blog/tags/community.md) - [One post tagged with "proxy"](https://crawlee.dev/blog/tags/proxy.md) - [12 tips on how to think like a web scraping expert](https://crawlee.dev/blog/web-scraping-tips.md) ## js - [Build reliable web scrapers. Fast.](https://crawlee.dev/js.md) - [API](https://crawlee.dev/js/api.md) - [@crawlee/basic](https://crawlee.dev/js/api/basic-crawler.md) - [Changelog](https://crawlee.dev/js/api/basic-crawler/changelog.md) - [BasicCrawler ](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md) - [createBasicRouter](https://crawlee.dev/js/api/basic-crawler/function/createBasicRouter.md) - [BasicCrawlerOptions ](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md) - [BasicCrawlingContext ](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlingContext.md) - [CrawlerAddRequestsOptions](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerAddRequestsOptions.md) - [CrawlerAddRequestsResult](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerAddRequestsResult.md) - [CrawlerExperiments](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerExperiments.md) - [CrawlerRunOptions](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerRunOptions.md) - [CreateContextOptions](https://crawlee.dev/js/api/basic-crawler/interface/CreateContextOptions.md) - [StatusMessageCallbackParams ](https://crawlee.dev/js/api/basic-crawler/interface/StatusMessageCallbackParams.md) - [@crawlee/browser](https://crawlee.dev/js/api/browser-crawler.md) - [Changelog](https://crawlee.dev/js/api/browser-crawler/changelog.md) - [abstractBrowserCrawler ](https://crawlee.dev/js/api/browser-crawler/class/BrowserCrawler.md) - [BrowserCrawlerOptions ](https://crawlee.dev/js/api/browser-crawler/interface/BrowserCrawlerOptions.md) - [BrowserCrawlingContext ](https://crawlee.dev/js/api/browser-crawler/interface/BrowserCrawlingContext.md) - [BrowserLaunchContext ](https://crawlee.dev/js/api/browser-crawler/interface/BrowserLaunchContext.md) - [@crawlee/browser-pool](https://crawlee.dev/js/api/browser-pool.md) - [Changelog](https://crawlee.dev/js/api/browser-pool/changelog.md) - [abstractBrowserController ](https://crawlee.dev/js/api/browser-pool/class/BrowserController.md) - [BrowserLaunchError](https://crawlee.dev/js/api/browser-pool/class/BrowserLaunchError.md) - [abstractBrowserPlugin ](https://crawlee.dev/js/api/browser-pool/class/BrowserPlugin.md) - [BrowserPool ](https://crawlee.dev/js/api/browser-pool/class/BrowserPool.md) - [LaunchContext ](https://crawlee.dev/js/api/browser-pool/class/LaunchContext.md) - [PlaywrightBrowser](https://crawlee.dev/js/api/browser-pool/class/PlaywrightBrowser.md) - [PlaywrightController](https://crawlee.dev/js/api/browser-pool/class/PlaywrightController.md) - [PlaywrightPlugin](https://crawlee.dev/js/api/browser-pool/class/PlaywrightPlugin.md) - [PuppeteerController](https://crawlee.dev/js/api/browser-pool/class/PuppeteerController.md) - [PuppeteerPlugin](https://crawlee.dev/js/api/browser-pool/class/PuppeteerPlugin.md) - [constBROWSER_CONTROLLER_EVENTS](https://crawlee.dev/js/api/browser-pool/enum/BROWSER_CONTROLLER_EVENTS.md) - [constBROWSER_POOL_EVENTS](https://crawlee.dev/js/api/browser-pool/enum/BROWSER_POOL_EVENTS.md) - [BrowserName](https://crawlee.dev/js/api/browser-pool/enum/BrowserName.md) - [constDeviceCategory](https://crawlee.dev/js/api/browser-pool/enum/DeviceCategory.md) - [constOperatingSystemsName](https://crawlee.dev/js/api/browser-pool/enum/OperatingSystemsName.md) - [BrowserControllerEvents ](https://crawlee.dev/js/api/browser-pool/interface/BrowserControllerEvents.md) - [BrowserPluginOptions ](https://crawlee.dev/js/api/browser-pool/interface/BrowserPluginOptions.md) - [BrowserPoolEvents ](https://crawlee.dev/js/api/browser-pool/interface/BrowserPoolEvents.md) - [BrowserPoolHooks ](https://crawlee.dev/js/api/browser-pool/interface/BrowserPoolHooks.md) - [BrowserPoolNewPageInNewBrowserOptions ](https://crawlee.dev/js/api/browser-pool/interface/BrowserPoolNewPageInNewBrowserOptions.md) - [BrowserPoolNewPageOptions ](https://crawlee.dev/js/api/browser-pool/interface/BrowserPoolNewPageOptions.md) - [BrowserPoolOptions ](https://crawlee.dev/js/api/browser-pool/interface/BrowserPoolOptions.md) - [BrowserSpecification](https://crawlee.dev/js/api/browser-pool/interface/BrowserSpecification.md) - [CommonLibrary](https://crawlee.dev/js/api/browser-pool/interface/CommonLibrary.md) - [CreateLaunchContextOptions ](https://crawlee.dev/js/api/browser-pool/interface/CreateLaunchContextOptions.md) - [FingerprintGenerator](https://crawlee.dev/js/api/browser-pool/interface/FingerprintGenerator.md) - [FingerprintGeneratorOptions](https://crawlee.dev/js/api/browser-pool/interface/FingerprintGeneratorOptions.md) - [FingerprintOptions](https://crawlee.dev/js/api/browser-pool/interface/FingerprintOptions.md) - [GetFingerprintReturn](https://crawlee.dev/js/api/browser-pool/interface/GetFingerprintReturn.md) - [LaunchContextOptions ](https://crawlee.dev/js/api/browser-pool/interface/LaunchContextOptions.md) - [@crawlee/cheerio](https://crawlee.dev/js/api/cheerio-crawler.md) - [Changelog](https://crawlee.dev/js/api/cheerio-crawler/changelog.md) - [CheerioCrawler](https://crawlee.dev/js/api/cheerio-crawler/class/CheerioCrawler.md) - [createCheerioRouter](https://crawlee.dev/js/api/cheerio-crawler/function/createCheerioRouter.md) - [CheerioCrawlerOptions ](https://crawlee.dev/js/api/cheerio-crawler/interface/CheerioCrawlerOptions.md) - [CheerioCrawlingContext ](https://crawlee.dev/js/api/cheerio-crawler/interface/CheerioCrawlingContext.md) - [@crawlee/core](https://crawlee.dev/js/api/core.md) - [Changelog](https://crawlee.dev/js/api/core/changelog.md) - [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) - [Configuration](https://crawlee.dev/js/api/core/class/Configuration.md) - [CriticalError](https://crawlee.dev/js/api/core/class/CriticalError.md) - [Dataset ](https://crawlee.dev/js/api/core/class/Dataset.md) - [ErrorSnapshotter](https://crawlee.dev/js/api/core/class/ErrorSnapshotter.md) - [ErrorTracker](https://crawlee.dev/js/api/core/class/ErrorTracker.md) - [abstractEventManager](https://crawlee.dev/js/api/core/class/EventManager.md) - [GotScrapingHttpClient](https://crawlee.dev/js/api/core/class/GotScrapingHttpClient.md) - [KeyValueStore](https://crawlee.dev/js/api/core/class/KeyValueStore.md) - [LocalEventManager](https://crawlee.dev/js/api/core/class/LocalEventManager.md) - [externalLog](https://crawlee.dev/js/api/core/class/Log.md) - [externalLogger](https://crawlee.dev/js/api/core/class/Logger.md) - [externalLoggerJson](https://crawlee.dev/js/api/core/class/LoggerJson.md) - [externalLoggerText](https://crawlee.dev/js/api/core/class/LoggerText.md) - [NonRetryableError](https://crawlee.dev/js/api/core/class/NonRetryableError.md) - [ProxyConfiguration](https://crawlee.dev/js/api/core/class/ProxyConfiguration.md) - [externalPseudoUrl](https://crawlee.dev/js/api/core/class/PseudoUrl.md) - [RecoverableState ](https://crawlee.dev/js/api/core/class/RecoverableState.md) - [Request ](https://crawlee.dev/js/api/core/class/Request.md) - [RequestHandlerResult](https://crawlee.dev/js/api/core/class/RequestHandlerResult.md) - [RequestList](https://crawlee.dev/js/api/core/class/RequestList.md) - [RequestManagerTandem](https://crawlee.dev/js/api/core/class/RequestManagerTandem.md) - [abstractRequestProvider](https://crawlee.dev/js/api/core/class/RequestProvider.md) - [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) - [RequestQueueV1](https://crawlee.dev/js/api/core/class/RequestQueueV1.md) - [RetryRequestError](https://crawlee.dev/js/api/core/class/RetryRequestError.md) - [Router ](https://crawlee.dev/js/api/core/class/Router.md) - [Session](https://crawlee.dev/js/api/core/class/Session.md) - [SessionError](https://crawlee.dev/js/api/core/class/SessionError.md) - [SessionPool](https://crawlee.dev/js/api/core/class/SessionPool.md) - [SitemapRequestList](https://crawlee.dev/js/api/core/class/SitemapRequestList.md) - [Snapshotter](https://crawlee.dev/js/api/core/class/Snapshotter.md) - [Statistics](https://crawlee.dev/js/api/core/class/Statistics.md) - [SystemStatus](https://crawlee.dev/js/api/core/class/SystemStatus.md) - [EnqueueStrategy](https://crawlee.dev/js/api/core/enum/EnqueueStrategy.md) - [constEventType](https://crawlee.dev/js/api/core/enum/EventType.md) - [externalLogLevel](https://crawlee.dev/js/api/core/enum/LogLevel.md) - [RequestState](https://crawlee.dev/js/api/core/enum/RequestState.md) - [checkStorageAccess](https://crawlee.dev/js/api/core/function/checkStorageAccess.md) - [enqueueLinks](https://crawlee.dev/js/api/core/function/enqueueLinks.md) - [filterRequestsByPatterns](https://crawlee.dev/js/api/core/function/filterRequestsByPatterns.md) - [processHttpRequestOptions](https://crawlee.dev/js/api/core/function/processHttpRequestOptions.md) - [purgeDefaultStorages](https://crawlee.dev/js/api/core/function/purgeDefaultStorages.md) - [tryAbsoluteURL](https://crawlee.dev/js/api/core/function/tryAbsoluteURL.md) - [useState](https://crawlee.dev/js/api/core/function/useState.md) - [withCheckedStorageAccess](https://crawlee.dev/js/api/core/function/withCheckedStorageAccess.md) - [AddRequestsBatchedOptions](https://crawlee.dev/js/api/core/interface/AddRequestsBatchedOptions.md) - [AddRequestsBatchedResult](https://crawlee.dev/js/api/core/interface/AddRequestsBatchedResult.md) - [AutoscaledPoolOptions](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md) - [BaseHttpClient](https://crawlee.dev/js/api/core/interface/BaseHttpClient.md) - [BaseHttpResponseData](https://crawlee.dev/js/api/core/interface/BaseHttpResponseData.md) - [ClientInfo](https://crawlee.dev/js/api/core/interface/ClientInfo.md) - [ConfigurationOptions](https://crawlee.dev/js/api/core/interface/ConfigurationOptions.md) - [Cookie](https://crawlee.dev/js/api/core/interface/Cookie.md) - [CrawlingContext ](https://crawlee.dev/js/api/core/interface/CrawlingContext.md) - [CreateSession](https://crawlee.dev/js/api/core/interface/CreateSession.md) - [DatasetConsumer ](https://crawlee.dev/js/api/core/interface/DatasetConsumer.md) - [DatasetContent ](https://crawlee.dev/js/api/core/interface/DatasetContent.md) - [DatasetDataOptions](https://crawlee.dev/js/api/core/interface/DatasetDataOptions.md) - [DatasetExportOptions](https://crawlee.dev/js/api/core/interface/DatasetExportOptions.md) - [DatasetExportToOptions](https://crawlee.dev/js/api/core/interface/DatasetExportToOptions.md) - [DatasetIteratorOptions](https://crawlee.dev/js/api/core/interface/DatasetIteratorOptions.md) - [DatasetMapper ](https://crawlee.dev/js/api/core/interface/DatasetMapper.md) - [DatasetOptions](https://crawlee.dev/js/api/core/interface/DatasetOptions.md) - [DatasetReducer ](https://crawlee.dev/js/api/core/interface/DatasetReducer.md) - [EnqueueLinksOptions](https://crawlee.dev/js/api/core/interface/EnqueueLinksOptions.md) - [ErrnoException](https://crawlee.dev/js/api/core/interface/ErrnoException.md) - [ErrorTrackerOptions](https://crawlee.dev/js/api/core/interface/ErrorTrackerOptions.md) - [FinalStatistics](https://crawlee.dev/js/api/core/interface/FinalStatistics.md) - [HttpRequest ](https://crawlee.dev/js/api/core/interface/HttpRequest.md) - [HttpRequestOptions ](https://crawlee.dev/js/api/core/interface/HttpRequestOptions.md) - [HttpResponse ](https://crawlee.dev/js/api/core/interface/HttpResponse.md) - [IRequestList](https://crawlee.dev/js/api/core/interface/IRequestList.md) - [IRequestManager](https://crawlee.dev/js/api/core/interface/IRequestManager.md) - [IStorage](https://crawlee.dev/js/api/core/interface/IStorage.md) - [KeyConsumer](https://crawlee.dev/js/api/core/interface/KeyConsumer.md) - [KeyValueStoreIteratorOptions](https://crawlee.dev/js/api/core/interface/KeyValueStoreIteratorOptions.md) - [KeyValueStoreOptions](https://crawlee.dev/js/api/core/interface/KeyValueStoreOptions.md) - [externalLoggerOptions](https://crawlee.dev/js/api/core/interface/LoggerOptions.md) - [PersistenceOptions](https://crawlee.dev/js/api/core/interface/PersistenceOptions.md) - [ProxyConfigurationFunction](https://crawlee.dev/js/api/core/interface/ProxyConfigurationFunction.md) - [ProxyConfigurationOptions](https://crawlee.dev/js/api/core/interface/ProxyConfigurationOptions.md) - [ProxyInfo](https://crawlee.dev/js/api/core/interface/ProxyInfo.md) - [PushErrorMessageOptions](https://crawlee.dev/js/api/core/interface/PushErrorMessageOptions.md) - [QueueOperationInfo](https://crawlee.dev/js/api/core/interface/QueueOperationInfo.md) - [RecordOptions](https://crawlee.dev/js/api/core/interface/RecordOptions.md) - [RecoverableStateOptions ](https://crawlee.dev/js/api/core/interface/RecoverableStateOptions.md) - [RecoverableStatePersistenceOptions](https://crawlee.dev/js/api/core/interface/RecoverableStatePersistenceOptions.md) - [RequestListOptions](https://crawlee.dev/js/api/core/interface/RequestListOptions.md) - [RequestListState](https://crawlee.dev/js/api/core/interface/RequestListState.md) - [RequestOptions ](https://crawlee.dev/js/api/core/interface/RequestOptions.md) - [RequestProviderOptions](https://crawlee.dev/js/api/core/interface/RequestProviderOptions.md) - [RequestQueueOperationOptions](https://crawlee.dev/js/api/core/interface/RequestQueueOperationOptions.md) - [RequestQueueOptions](https://crawlee.dev/js/api/core/interface/RequestQueueOptions.md) - [RequestTransform](https://crawlee.dev/js/api/core/interface/RequestTransform.md) - [ResponseLike](https://crawlee.dev/js/api/core/interface/ResponseLike.md) - [ResponseTypes](https://crawlee.dev/js/api/core/interface/ResponseTypes.md) - [RestrictedCrawlingContext ](https://crawlee.dev/js/api/core/interface/RestrictedCrawlingContext.md) - [RouterHandler ](https://crawlee.dev/js/api/core/interface/RouterHandler.md) - [SessionOptions](https://crawlee.dev/js/api/core/interface/SessionOptions.md) - [SessionPoolOptions](https://crawlee.dev/js/api/core/interface/SessionPoolOptions.md) - [SessionState](https://crawlee.dev/js/api/core/interface/SessionState.md) - [SitemapRequestListOptions](https://crawlee.dev/js/api/core/interface/SitemapRequestListOptions.md) - [SnapshotResult](https://crawlee.dev/js/api/core/interface/SnapshotResult.md) - [SnapshotterOptions](https://crawlee.dev/js/api/core/interface/SnapshotterOptions.md) - [StatisticPersistedState](https://crawlee.dev/js/api/core/interface/StatisticPersistedState.md) - [StatisticsOptions](https://crawlee.dev/js/api/core/interface/StatisticsOptions.md) - [StatisticState](https://crawlee.dev/js/api/core/interface/StatisticState.md) - [StorageClient](https://crawlee.dev/js/api/core/interface/StorageClient.md) - [StorageManagerOptions](https://crawlee.dev/js/api/core/interface/StorageManagerOptions.md) - [StreamingHttpResponse](https://crawlee.dev/js/api/core/interface/StreamingHttpResponse.md) - [SystemInfo](https://crawlee.dev/js/api/core/interface/SystemInfo.md) - [SystemStatusOptions](https://crawlee.dev/js/api/core/interface/SystemStatusOptions.md) - [TieredProxy](https://crawlee.dev/js/api/core/interface/TieredProxy.md) - [UseStateOptions](https://crawlee.dev/js/api/core/interface/UseStateOptions.md) - [@crawlee/http](https://crawlee.dev/js/api/http-crawler.md) - [Changelog](https://crawlee.dev/js/api/http-crawler/changelog.md) - [FileDownload](https://crawlee.dev/js/api/http-crawler/class/FileDownload.md) - [HttpCrawler ](https://crawlee.dev/js/api/http-crawler/class/HttpCrawler.md) - [ByteCounterStream](https://crawlee.dev/js/api/http-crawler/function/ByteCounterStream.md) - [createFileRouter](https://crawlee.dev/js/api/http-crawler/function/createFileRouter.md) - [createHttpRouter](https://crawlee.dev/js/api/http-crawler/function/createHttpRouter.md) - [MinimumSpeedStream](https://crawlee.dev/js/api/http-crawler/function/MinimumSpeedStream.md) - [FileDownloadCrawlingContext ](https://crawlee.dev/js/api/http-crawler/interface/FileDownloadCrawlingContext.md) - [HttpCrawlerOptions ](https://crawlee.dev/js/api/http-crawler/interface/HttpCrawlerOptions.md) - [HttpCrawlingContext ](https://crawlee.dev/js/api/http-crawler/interface/HttpCrawlingContext.md) - [@crawlee/jsdom](https://crawlee.dev/js/api/jsdom-crawler.md) - [Changelog](https://crawlee.dev/js/api/jsdom-crawler/changelog.md) - [JSDOMCrawler](https://crawlee.dev/js/api/jsdom-crawler/class/JSDOMCrawler.md) - [createJSDOMRouter](https://crawlee.dev/js/api/jsdom-crawler/function/createJSDOMRouter.md) - [JSDOMCrawlerOptions ](https://crawlee.dev/js/api/jsdom-crawler/interface/JSDOMCrawlerOptions.md) - [JSDOMCrawlingContext ](https://crawlee.dev/js/api/jsdom-crawler/interface/JSDOMCrawlingContext.md) - [@crawlee/linkedom](https://crawlee.dev/js/api/linkedom-crawler.md) - [Changelog](https://crawlee.dev/js/api/linkedom-crawler/changelog.md) - [LinkeDOMCrawler](https://crawlee.dev/js/api/linkedom-crawler/class/LinkeDOMCrawler.md) - [createLinkeDOMRouter](https://crawlee.dev/js/api/linkedom-crawler/function/createLinkeDOMRouter.md) - [LinkeDOMCrawlerEnqueueLinksOptions](https://crawlee.dev/js/api/linkedom-crawler/interface/LinkeDOMCrawlerEnqueueLinksOptions.md) - [LinkeDOMCrawlerOptions ](https://crawlee.dev/js/api/linkedom-crawler/interface/LinkeDOMCrawlerOptions.md) - [LinkeDOMCrawlingContext ](https://crawlee.dev/js/api/linkedom-crawler/interface/LinkeDOMCrawlingContext.md) - [@crawlee/memory-storage](https://crawlee.dev/js/api/memory-storage.md) - [Changelog](https://crawlee.dev/js/api/memory-storage/changelog.md) - [MemoryStorage](https://crawlee.dev/js/api/memory-storage/class/MemoryStorage.md) - [MemoryStorageOptions](https://crawlee.dev/js/api/memory-storage/interface/MemoryStorageOptions.md) - [@crawlee/playwright](https://crawlee.dev/js/api/playwright-crawler.md) - [Changelog](https://crawlee.dev/js/api/playwright-crawler/changelog.md) - [AdaptivePlaywrightCrawler](https://crawlee.dev/js/api/playwright-crawler/class/AdaptivePlaywrightCrawler.md) - [PlaywrightCrawler](https://crawlee.dev/js/api/playwright-crawler/class/PlaywrightCrawler.md) - [RenderingTypePredictor](https://crawlee.dev/js/api/playwright-crawler/class/RenderingTypePredictor.md) - [createAdaptivePlaywrightRouter](https://crawlee.dev/js/api/playwright-crawler/function/createAdaptivePlaywrightRouter.md) - [createPlaywrightRouter](https://crawlee.dev/js/api/playwright-crawler/function/createPlaywrightRouter.md) - [launchPlaywright](https://crawlee.dev/js/api/playwright-crawler/function/launchPlaywright.md) - [AdaptivePlaywrightCrawlerContext ](https://crawlee.dev/js/api/playwright-crawler/interface/AdaptivePlaywrightCrawlerContext.md) - [AdaptivePlaywrightCrawlerOptions](https://crawlee.dev/js/api/playwright-crawler/interface/AdaptivePlaywrightCrawlerOptions.md) - [PlaywrightCrawlerOptions](https://crawlee.dev/js/api/playwright-crawler/interface/PlaywrightCrawlerOptions.md) - [PlaywrightCrawlingContext ](https://crawlee.dev/js/api/playwright-crawler/interface/PlaywrightCrawlingContext.md) - [PlaywrightHook](https://crawlee.dev/js/api/playwright-crawler/interface/PlaywrightHook.md) - [PlaywrightLaunchContext](https://crawlee.dev/js/api/playwright-crawler/interface/PlaywrightLaunchContext.md) - [PlaywrightRequestHandler](https://crawlee.dev/js/api/playwright-crawler/interface/PlaywrightRequestHandler.md) - [playwrightClickElements](https://crawlee.dev/js/api/playwright-crawler/namespace/playwrightClickElements.md) - [playwrightUtils](https://crawlee.dev/js/api/playwright-crawler/namespace/playwrightUtils.md) - [@crawlee/puppeteer](https://crawlee.dev/js/api/puppeteer-crawler.md) - [Changelog](https://crawlee.dev/js/api/puppeteer-crawler/changelog.md) - [PuppeteerCrawler](https://crawlee.dev/js/api/puppeteer-crawler/class/PuppeteerCrawler.md) - [createPuppeteerRouter](https://crawlee.dev/js/api/puppeteer-crawler/function/createPuppeteerRouter.md) - [launchPuppeteer](https://crawlee.dev/js/api/puppeteer-crawler/function/launchPuppeteer.md) - [PuppeteerCrawlerOptions](https://crawlee.dev/js/api/puppeteer-crawler/interface/PuppeteerCrawlerOptions.md) - [PuppeteerCrawlingContext ](https://crawlee.dev/js/api/puppeteer-crawler/interface/PuppeteerCrawlingContext.md) - [PuppeteerHook](https://crawlee.dev/js/api/puppeteer-crawler/interface/PuppeteerHook.md) - [PuppeteerLaunchContext](https://crawlee.dev/js/api/puppeteer-crawler/interface/PuppeteerLaunchContext.md) - [PuppeteerRequestHandler](https://crawlee.dev/js/api/puppeteer-crawler/interface/PuppeteerRequestHandler.md) - [puppeteerClickElements](https://crawlee.dev/js/api/puppeteer-crawler/namespace/puppeteerClickElements.md) - [puppeteerRequestInterception](https://crawlee.dev/js/api/puppeteer-crawler/namespace/puppeteerRequestInterception.md) - [puppeteerUtils](https://crawlee.dev/js/api/puppeteer-crawler/namespace/puppeteerUtils.md) - [@crawlee/types](https://crawlee.dev/js/api/types.md) - [Changelog](https://crawlee.dev/js/api/types/changelog.md) - [BatchAddRequestsResult](https://crawlee.dev/js/api/types/interface/BatchAddRequestsResult.md) - [BrowserLikeResponse](https://crawlee.dev/js/api/types/interface/BrowserLikeResponse.md) - [Dataset](https://crawlee.dev/js/api/types/interface/Dataset.md) - [DatasetClient ](https://crawlee.dev/js/api/types/interface/DatasetClient.md) - [DatasetClientListOptions](https://crawlee.dev/js/api/types/interface/DatasetClientListOptions.md) - [DatasetClientUpdateOptions](https://crawlee.dev/js/api/types/interface/DatasetClientUpdateOptions.md) - [DatasetCollectionClient](https://crawlee.dev/js/api/types/interface/DatasetCollectionClient.md) - [DatasetCollectionClientOptions](https://crawlee.dev/js/api/types/interface/DatasetCollectionClientOptions.md) - [DatasetCollectionData](https://crawlee.dev/js/api/types/interface/DatasetCollectionData.md) - [DatasetInfo](https://crawlee.dev/js/api/types/interface/DatasetInfo.md) - [DatasetStats](https://crawlee.dev/js/api/types/interface/DatasetStats.md) - [DeleteRequestLockOptions](https://crawlee.dev/js/api/types/interface/DeleteRequestLockOptions.md) - [KeyValueStoreClient](https://crawlee.dev/js/api/types/interface/KeyValueStoreClient.md) - [KeyValueStoreClientGetRecordOptions](https://crawlee.dev/js/api/types/interface/KeyValueStoreClientGetRecordOptions.md) - [KeyValueStoreClientListData](https://crawlee.dev/js/api/types/interface/KeyValueStoreClientListData.md) - [KeyValueStoreClientListOptions](https://crawlee.dev/js/api/types/interface/KeyValueStoreClientListOptions.md) - [KeyValueStoreClientUpdateOptions](https://crawlee.dev/js/api/types/interface/KeyValueStoreClientUpdateOptions.md) - [KeyValueStoreCollectionClient](https://crawlee.dev/js/api/types/interface/KeyValueStoreCollectionClient.md) - [KeyValueStoreInfo](https://crawlee.dev/js/api/types/interface/KeyValueStoreInfo.md) - [KeyValueStoreItemData](https://crawlee.dev/js/api/types/interface/KeyValueStoreItemData.md) - [KeyValueStoreRecord](https://crawlee.dev/js/api/types/interface/KeyValueStoreRecord.md) - [KeyValueStoreRecordOptions](https://crawlee.dev/js/api/types/interface/KeyValueStoreRecordOptions.md) - [KeyValueStoreStats](https://crawlee.dev/js/api/types/interface/KeyValueStoreStats.md) - [ListAndLockHeadResult](https://crawlee.dev/js/api/types/interface/ListAndLockHeadResult.md) - [ListAndLockOptions](https://crawlee.dev/js/api/types/interface/ListAndLockOptions.md) - [ListOptions](https://crawlee.dev/js/api/types/interface/ListOptions.md) - [PaginatedList ](https://crawlee.dev/js/api/types/interface/PaginatedList.md) - [ProcessedRequest](https://crawlee.dev/js/api/types/interface/ProcessedRequest.md) - [ProlongRequestLockOptions](https://crawlee.dev/js/api/types/interface/ProlongRequestLockOptions.md) - [ProlongRequestLockResult](https://crawlee.dev/js/api/types/interface/ProlongRequestLockResult.md) - [QueueHead](https://crawlee.dev/js/api/types/interface/QueueHead.md) - [RequestOptions](https://crawlee.dev/js/api/types/interface/RequestOptions.md) - [RequestQueueClient](https://crawlee.dev/js/api/types/interface/RequestQueueClient.md) - [RequestQueueCollectionClient](https://crawlee.dev/js/api/types/interface/RequestQueueCollectionClient.md) - [RequestQueueHeadItem](https://crawlee.dev/js/api/types/interface/RequestQueueHeadItem.md) - [RequestQueueInfo](https://crawlee.dev/js/api/types/interface/RequestQueueInfo.md) - [RequestQueueOptions](https://crawlee.dev/js/api/types/interface/RequestQueueOptions.md) - [RequestQueueStats](https://crawlee.dev/js/api/types/interface/RequestQueueStats.md) - [RequestSchema](https://crawlee.dev/js/api/types/interface/RequestSchema.md) - [SetStatusMessageOptions](https://crawlee.dev/js/api/types/interface/SetStatusMessageOptions.md) - [UnprocessedRequest](https://crawlee.dev/js/api/types/interface/UnprocessedRequest.md) - [UpdateRequestSchema](https://crawlee.dev/js/api/types/interface/UpdateRequestSchema.md) - [@crawlee/utils](https://crawlee.dev/js/api/utils.md) - [Changelog](https://crawlee.dev/js/api/utils/changelog.md) - [RobotsTxtFile](https://crawlee.dev/js/api/utils/class/RobotsTxtFile.md) - [Sitemap](https://crawlee.dev/js/api/utils/class/Sitemap.md) - [chunk](https://crawlee.dev/js/api/utils/function/chunk.md) - [createRequestDebugInfo](https://crawlee.dev/js/api/utils/function/createRequestDebugInfo.md) - [downloadListOfUrls](https://crawlee.dev/js/api/utils/function/downloadListOfUrls.md) - [extractUrls](https://crawlee.dev/js/api/utils/function/extractUrls.md) - [extractUrlsFromCheerio](https://crawlee.dev/js/api/utils/function/extractUrlsFromCheerio.md) - [getCgroupsVersion](https://crawlee.dev/js/api/utils/function/getCgroupsVersion.md) - [getMemoryInfo](https://crawlee.dev/js/api/utils/function/getMemoryInfo.md) - [getObjectType](https://crawlee.dev/js/api/utils/function/getObjectType.md) - [gotScraping](https://crawlee.dev/js/api/utils/function/gotScraping.md) - [htmlToText](https://crawlee.dev/js/api/utils/function/htmlToText.md) - [isContainerized](https://crawlee.dev/js/api/utils/function/isContainerized.md) - [isDocker](https://crawlee.dev/js/api/utils/function/isDocker.md) - [isLambda](https://crawlee.dev/js/api/utils/function/isLambda.md) - [parseOpenGraph](https://crawlee.dev/js/api/utils/function/parseOpenGraph.md) - [parseSitemap](https://crawlee.dev/js/api/utils/function/parseSitemap.md) - [sleep](https://crawlee.dev/js/api/utils/function/sleep.md) - [DownloadListOfUrlsOptions](https://crawlee.dev/js/api/utils/interface/DownloadListOfUrlsOptions.md) - [ExtractUrlsOptions](https://crawlee.dev/js/api/utils/interface/ExtractUrlsOptions.md) - [MemoryInfo](https://crawlee.dev/js/api/utils/interface/MemoryInfo.md) - [OpenGraphProperty](https://crawlee.dev/js/api/utils/interface/OpenGraphProperty.md) - [ParseSitemapOptions](https://crawlee.dev/js/api/utils/interface/ParseSitemapOptions.md) - [social](https://crawlee.dev/js/api/utils/namespace/social.md) - [Deployment guides](https://crawlee.dev/js/docs/deployment.md) - [Apify Platform](https://crawlee.dev/js/docs/deployment/apify-platform.md) - [Browsers on AWS Lambda](https://crawlee.dev/js/docs/deployment/aws-browsers.md) - [Cheerio on AWS Lambda](https://crawlee.dev/js/docs/deployment/aws-cheerio.md) - [Browsers in GCP Cloud Run](https://crawlee.dev/js/docs/deployment/gcp-browsers.md) - [Cheerio on GCP Cloud Functions](https://crawlee.dev/js/docs/deployment/gcp-cheerio.md) - [Examples](https://crawlee.dev/js/docs/examples.md) - [Accept user input](https://crawlee.dev/js/docs/examples/accept-user-input.md) - [Add data to dataset](https://crawlee.dev/js/docs/examples/add-data-to-dataset.md) - [Basic crawler](https://crawlee.dev/js/docs/examples/basic-crawler.md) - [Capture a screenshot using Puppeteer](https://crawlee.dev/js/docs/examples/capture-screenshot.md) - [Cheerio crawler](https://crawlee.dev/js/docs/examples/cheerio-crawler.md) - [Crawl all links on a website](https://crawlee.dev/js/docs/examples/crawl-all-links.md) - [Crawl multiple URLs](https://crawlee.dev/js/docs/examples/crawl-multiple-urls.md) - [Crawl a website with relative links](https://crawlee.dev/js/docs/examples/crawl-relative-links.md) - [Crawl a single URL](https://crawlee.dev/js/docs/examples/crawl-single-url.md) - [Crawl a sitemap](https://crawlee.dev/js/docs/examples/crawl-sitemap.md) - [Crawl some links on a website](https://crawlee.dev/js/docs/examples/crawl-some-links.md) - [Using Puppeteer Stealth Plugin (puppeteer-extra) and playwright-extra](https://crawlee.dev/js/docs/examples/crawler-plugins.md) - [Export entire dataset to one file](https://crawlee.dev/js/docs/examples/export-entire-dataset.md) - [Download a file](https://crawlee.dev/js/docs/examples/file-download.md) - [Download a file with Node.js streams](https://crawlee.dev/js/docs/examples/file-download-stream.md) - [Fill and Submit a Form using Puppeteer](https://crawlee.dev/js/docs/examples/forms.md) - [HTTP crawler](https://crawlee.dev/js/docs/examples/http-crawler.md) - [JSDOM crawler](https://crawlee.dev/js/docs/examples/jsdom-crawler.md) - [Dataset Map and Reduce methods](https://crawlee.dev/js/docs/examples/map-and-reduce.md) - [Playwright crawler](https://crawlee.dev/js/docs/examples/playwright-crawler.md) - [Using Firefox browser with Playwright crawler](https://crawlee.dev/js/docs/examples/playwright-crawler-firefox.md) - [Puppeteer crawler](https://crawlee.dev/js/docs/examples/puppeteer-crawler.md) - [Puppeteer recursive crawl](https://crawlee.dev/js/docs/examples/puppeteer-recursive-crawl.md) - [Skipping navigations for certain requests](https://crawlee.dev/js/docs/examples/skip-navigation.md) - [Experiments](https://crawlee.dev/js/docs/experiments.md) - [Request Locking](https://crawlee.dev/js/docs/experiments/experiments-request-locking.md) - [System Infomation V2](https://crawlee.dev/js/docs/experiments/experiments-system-infomation-v2.md) - [Guides](https://crawlee.dev/js/docs/guides.md) - [Avoid getting blocked](https://crawlee.dev/js/docs/guides/avoid-blocking.md) - [CheerioCrawler guide](https://crawlee.dev/js/docs/guides/cheerio-crawler-guide.md) - [Configuration](https://crawlee.dev/js/docs/guides/configuration.md) - [Using a custom HTTP client (Experimental)](https://crawlee.dev/js/docs/guides/custom-http-client.md) - [Running in Docker](https://crawlee.dev/js/docs/guides/docker-images.md) - [Got Scraping](https://crawlee.dev/js/docs/guides/got-scraping.md) - [Impit HTTP Client](https://crawlee.dev/js/docs/guides/impit-http-client.md) - [JavaScript rendering](https://crawlee.dev/js/docs/guides/javascript-rendering.md) - [JSDOMCrawler guide](https://crawlee.dev/js/docs/guides/jsdom-crawler-guide.md) - [motivation](https://crawlee.dev/js/docs/guides/motivation.md) - [Parallel Scraping Guide](https://crawlee.dev/js/docs/guides/parallel-scraping.md) - [Proxy Management](https://crawlee.dev/js/docs/guides/proxy-management.md) - [Request Storage](https://crawlee.dev/js/docs/guides/request-storage.md) - [Result Storage](https://crawlee.dev/js/docs/guides/result-storage.md) - [Running in web server](https://crawlee.dev/js/docs/guides/running-in-web-server.md) - [Scaling our crawlers](https://crawlee.dev/js/docs/guides/scaling-crawlers.md) - [Session Management](https://crawlee.dev/js/docs/guides/session-management.md) - [TypeScript Projects](https://crawlee.dev/js/docs/guides/typescript-project.md) - [Introduction](https://crawlee.dev/js/docs/introduction.md) - [Adding more URLs](https://crawlee.dev/js/docs/introduction/adding-urls.md) - [Crawling the Store](https://crawlee.dev/js/docs/introduction/crawling.md) - [Running your crawler in the Cloud](https://crawlee.dev/js/docs/introduction/deployment.md) - [First crawler](https://crawlee.dev/js/docs/introduction/first-crawler.md) - [Getting some real-world data](https://crawlee.dev/js/docs/introduction/real-world-project.md) - [Refactoring](https://crawlee.dev/js/docs/introduction/refactoring.md) - [Saving data](https://crawlee.dev/js/docs/introduction/saving-data.md) - [Scraping the Store](https://crawlee.dev/js/docs/introduction/scraping.md) - [Setting up](https://crawlee.dev/js/docs/introduction/setting-up.md) - [Quick Start](https://crawlee.dev/js/docs/quick-start.md) - [Upgrading](https://crawlee.dev/js/docs/upgrading.md) - [Upgrading to v1](https://crawlee.dev/js/docs/upgrading/upgrading-to-v1.md) - [Upgrading to v2](https://crawlee.dev/js/docs/upgrading/upgrading-to-v2.md) - [Upgrading to v3](https://crawlee.dev/js/docs/upgrading/upgrading-to-v3.md) ## search - [Search the documentation](https://crawlee.dev/search.md) ## Optional - [Crawlee for Python llms.txt](https://crawlee.dev/python/llms.txt) - [Crawlee for Python llms-full.txt](https://crawlee.dev/python/llms-full.txt) --- # Full Documentation Content ## [Crawlee for Python v1](https://crawlee.dev/blog/crawlee-for-python-v1.md) September 15, 2025 · 15 min read [![Vlada Dusek](https://avatars.githubusercontent.com/u/25082181?v=4)](https://github.com/vdusek) [Vlada Dusek](https://github.com/vdusek) Developer of Crawlee for Python We launched Crawlee for Python in beta mode in [July 2024](https://www.crawlee.dev/blog/launching-crawlee-python). Over the past year, we received many early adopters, tremendous interest in the library from the Python community, more than 6000 stars on GitHub, a dozen contributors, and many feature requests. After months of development, polishing, and community feedback, the library is leaving beta and entering a production/stable development status. **We are happy to announce Crawlee for Python v1.0.** From now on, Crawlee for Python will strictly follow [semantic versioning](https://www.semver.org/). You can now rely on it as a stable foundation for your crawling and scraping projects, knowing that breaking changes will only occur in major releases. ## What's new in Crawlee for Python v1[](#whats-new-in-crawlee-for-python-v1 "Direct link to What's new in Crawlee for Python v1") * [New storage client system](#new-storage-client-system) * [Adaptive Playwright crawler](#adaptive-playwright-crawler) * [Impit HTTP client](#impit-http-client) * [Sitemap request loader](#sitemap-request-loader) * [Robots exclusion standard](#robots-exclusion-standard) * [Fingerprinting](#fingerprinting) * [Open telemetry](#open-telemetry) ![Crawlee for Python v1.0](/assets/images/crawlee_v100-d491a6c5406c55e0bfcdc9b39b81b7ae.webp) [**Read More**](https://crawlee.dev/blog/crawlee-for-python-v1.md) --- ### 2025[](#2025 "Direct link to 2025") * [January 3](https://crawlee.dev/blog/scrape-crunchbase-python.md) [ - ](https://crawlee.dev/blog/scrape-crunchbase-python.md) [How to scrape Crunchbase using Python in 2024 (Easy Guide)](https://crawlee.dev/blog/scrape-crunchbase-python.md) * [January 10](https://crawlee.dev/blog/crawlee-for-python-v05.md) [ - ](https://crawlee.dev/blog/crawlee-for-python-v05.md) [Crawlee for Python v0.5](https://crawlee.dev/blog/crawlee-for-python-v05.md) * [March 5](https://crawlee.dev/blog/superscraper-with-crawlee.md) [ - ](https://crawlee.dev/blog/superscraper-with-crawlee.md) [Inside implementing SuperScraper with Crawlee](https://crawlee.dev/blog/superscraper-with-crawlee.md) * [March 6](https://crawlee.dev/blog/crawlee-for-python-v06.md) [ - ](https://crawlee.dev/blog/crawlee-for-python-v06.md) [Crawlee for Python v0.6](https://crawlee.dev/blog/crawlee-for-python-v06.md) * [March 20](https://crawlee.dev/blog/scrape-bluesky-using-python.md) [ - ](https://crawlee.dev/blog/scrape-bluesky-using-python.md) [How to scrape Bluesky with Python](https://crawlee.dev/blog/scrape-bluesky-using-python.md) * [April 8](https://crawlee.dev/blog/crawlee-python-price-tracker.md) [ - ](https://crawlee.dev/blog/crawlee-python-price-tracker.md) [How to build a price tracker with Crawlee and Apify](https://crawlee.dev/blog/crawlee-python-price-tracker.md) * [April 25](https://crawlee.dev/blog/scrape-tiktok-python.md) [ - ](https://crawlee.dev/blog/scrape-tiktok-python.md) [How to scrape TikTok using Python](https://crawlee.dev/blog/scrape-tiktok-python.md) * [July 14](https://crawlee.dev/blog/scrape-youtube-python.md) [ - ](https://crawlee.dev/blog/scrape-youtube-python.md) [How to scrape YouTube using Python \[2025 guide\]](https://crawlee.dev/blog/scrape-youtube-python.md) * [September 15](https://crawlee.dev/blog/crawlee-for-python-v1.md) [ - ](https://crawlee.dev/blog/crawlee-for-python-v1.md) [Crawlee for Python v1](https://crawlee.dev/blog/crawlee-for-python-v1.md) --- # Authors * [![Percival Villalva](https://avatars.githubusercontent.com/u/70678259?v=4)](https://github.com/PerVillalva) ## [Percival Villalva](https://github.com/PerVillalva) 1 Community Member of Crawlee [](https://github.com/PerVillalva "GitHub") * [![Saurav Jain](https://avatars.githubusercontent.com/u/53312820?v=4)](https://github.com/souravjain540) ## [Saurav Jain](https://github.com/souravjain540) 8 Developer Community Manager [](https://x.com/sauain "X")[](https://github.com/souravjain540 "GitHub") * [![Arindam Majumder](https://avatars.githubusercontent.com/u/109217591?v=4)](https://github.com/Arindam200) ## [Arindam Majumder](https://github.com/Arindam200) 1 Community Member of Crawlee [](https://x.com/Arindam_1729 "X")[](https://github.com/Arindam200 "GitHub") * [![Ayush Thakur](https://avatars.githubusercontent.com/u/43995654?v=4)](https://github.com/ayush2390) ## [Ayush Thakur](https://github.com/ayush2390) 1 Community Member of Crawlee [](https://x.com/JSAyushThakur "X")[](https://github.com/ayush2390 "GitHub") * [![Max](https://avatars.githubusercontent.com/u/34358312?v=4)](https://github.com/Mantisus) ## [Max](https://github.com/Mantisus) 8 Community Member of Crawlee and web scraping expert [](https://github.com/Mantisus "GitHub") * [![Lukáš Průša](./img/lukasp.webp)](https://github.com/Patai5) ## [Lukáš Průša](https://github.com/Patai5) 1 Junior Web Automation Engineer [](https://github.com/Patai5 "GitHub") * [![Matěj Volf](https://avatars.githubusercontent.com/u/31281386?v=4)](https://github.com/mvolfik) ## [Matěj Volf](https://github.com/mvolfik) 1 Web Automation Engineer [](https://github.com/mvolfik "GitHub") * [![Satyam Tripathi](https://avatars.githubusercontent.com/u/69134468?v=4)](https://github.com/triposat) ## [Satyam Tripathi](https://github.com/triposat) 1 Community Member of Crawlee [](https://github.com/triposat "GitHub") * [![Vlada Dusek](https://avatars.githubusercontent.com/u/25082181?v=4)](https://github.com/vdusek) ## [Vlada Dusek](https://github.com/vdusek) 3 Developer of Crawlee for Python [](https://github.com/vdusek "GitHub") * [![Radoslav Chudovský](https://ca.slack-edge.com/T0KRMEKK6-U04MGU11VUK-7f59c4a9343b-512)](https://github.com/chudovskyr) ## [Radoslav Chudovský](https://github.com/chudovskyr) 1 Web Automation Engineer [](https://github.com/chudovskyr "GitHub") --- # Current problems and mistakes of web scraping in Python and tricks to solve them! August 20, 2024 · 17 min read [![Max](https://avatars.githubusercontent.com/u/34358312?v=4)](https://github.com/Mantisus) [Max](https://github.com/Mantisus) Community Member of Crawlee and web scraping expert ## Introduction[](#introduction "Direct link to Introduction") Greetings! I'm [Max](https://apify.com/mantisus), a Python developer from Ukraine, a developer with expertise in web scraping, data analysis, and processing. My journey in web scraping started in 2016 when I was solving lead generation challenges for a small company. Initially, I used off-the-shelf solutions such as [Import.io](https://www.import.io/) and Kimono Labs. However, I quickly encountered limitations such as blocking, inaccurate data extraction, and performance issues. This led me to learn Python. Those were the glory days when [`requests`](https://requests.readthedocs.io/en/latest/) and [`lxml`](https://lxml.de/)/[`beautifulsoup`](https://beautiful-soup-4.readthedocs.io/en/latest/) were enough to extract data from most websites. And if you knew how to work with threads, you were already a respected expert :) note One of our community members wrote this blog as a contribution to Crawlee Blog. If you want to contribute blogs like these to Crawlee Blog, please reach out to us on our [discord channel](https://apify.com/discord). As a freelancer, I've built small solutions and large, complex data mining systems for products over the years. Today, I want to discuss the realities of [web scraping with Python in 2024](https://blog.apify.com/web-scraping-python/). We'll look at the mistakes I sometimes see and the problems you'll encounter and offer solutions to some of them. Let's get started. Just take `requests` and `beautifulsoup` and start making a lot of money... No, this is not that kind of article. ## 1. "I got a 200 response from the server, but it's an unreadable character set."[](#1-i-got-a-200-response-from-the-server-but-its-an-unreadable-character-set "Direct link to 1. \"I got a 200 response from the server, but it's an unreadable character set.\"") Yes, it can be surprising. But I've seen this message from customers and developers six years ago, four years ago, and in 2024. I read a post on Reddit just a few months ago about this issue. Let's look at a simple code example. This will work for `requests`, [`httpx`](https://www.python-httpx.org/), and [`aiohttp`](https://docs.aiohttp.org/en/stable/client.html#aiohttp-client) with a clean installation and no extensions. ``` import httpx url = 'https://www.wayfair.com/' headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:128.0) Gecko/20100101 Firefox/128.0", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/png,image/svg+xml,*/*;q=0.8", "Accept-Language": "en-US,en;q=0.5", "Accept-Encoding": "gzip, deflate, br, zstd", "Connection": "keep-alive", } response = httpx.get(url, headers=headers) print(response.content[:10]) ``` The print result will be similar to: ``` b'\x83\x0c\x00\x00\xc4\r\x8e4\x82\x8a' ``` It's not an error - it's a perfectly valid server response. It's encoded somehow. The answer lies in the `Accept-Encoding` header. In the example above, I just copied it from my browser, so it lists all the compression methods my browser supports: "gzip, deflate, br, zstd". The Wayfair backend supports compression with "br", which is [Brotli](https://github.com/google/brotli), and uses it as the most efficient method. This can happen if none of the libraries listed above have a `Brotli` dependency among their standard dependencies. However, they all support decompression from this format if you already have `Brotli` installed. Therefore, it's sufficient to install the appropriate library: ``` pip install Brotli ``` This will allow you to get the result of the print: ``` b' 3 min read [![Saurav Jain](https://avatars.githubusercontent.com/u/53312820?v=4)](https://github.com/souravjain540) [Saurav Jain](https://github.com/souravjain540) Developer Community Manager Hey, crawling masters! I’m Saurav, Developer Community Manager at Apify, and I’m thrilled to announce that we’re launching the Crawlee blog today 🎉 We launched Crawlee, the successor to our Apify SDK, in [August 2022](https://blog.apify.com/announcing-crawlee-the-web-scraping-and-browser-automation-library/) to make the best web scraping and automation library for Node.js developers who like to write code in JavaScript or TypeScript. Since then, our dev community has grown exponentially. I’m proud to tell you that we have **over 11,500 Stars on GitHub**, over **6,000 community members on our Discord**, and over **125,000 downloads monthly on npm**. We’re now the most popular web scraping and automation library for Node.js developers 👏 ## Changes in Crawlee since the launch[](#changes-in-crawlee-since-the-launch "Direct link to Changes in Crawlee since the launch") Crawlee has progressively evolved with the introduction of key features to enhance web scraping and automation: * [v3.1](https://github.com/apify/crawlee/releases/tag/v3.1.0) added an [error tracker](https://crawlee.dev/js/api/core/class/ErrorTracker.md) for analyzing and summarizing failed requests. * The [v3.3](https://github.com/apify/crawlee/releases/tag/v3.3.0) update brought an `exclude` option to the `enqueueLinks` helper and integrated status messages. This improved usability on the Apify platform with automatic summary updates in the console UI. * [v3.4](https://github.com/apify/crawlee/releases/tag/v3.4.0) introduced the [`linkedom` crawler](https://crawlee.dev/js/api/linkedom-crawler.md), offering a new parsing option. * The [v3.5](https://github.com/apify/crawlee/releases/tag/v3.5.0) update optimized link enqueuing for efficiency. * [v3.6](https://github.com/apify/crawlee/releases/tag/v3.6.0) launched experimental support for a [new request queue API](https://crawlee.dev/js/docs/experiments/experiments-request-locking.md), enabling parallel execution and improved scalability for multiple scrapers working concurrently. All of this marked significant strides in making web scraping more efficient and robust. ## Future of Crawlee\![](#future-of-crawlee "Direct link to Future of Crawlee!") The Crawlee team is actively developing an adaptive crawling feature to revolutionize how Crawlee interacts with and navigates through websites. We just launched [v3.8](https://github.com/apify/crawlee/releases/tag/v3.8.0) with experimental support for the new [adaptive crawler type](https://crawlee.dev/js/api/playwright-crawler/class/AdaptivePlaywrightCrawler.md). ## Support us on GitHub.[](#support-us-on-github "Direct link to Support us on GitHub.") Before I tell you about our upcoming plans for Crawlee Blog, I recommend you check out Crawlee if you haven’t already. We are open-source. You can see our [source code here](https://github.com/apify/crawlee/). If you like Crawlee, then please don’t forget to give us a ⭐ on GitHub. ![Crawlee\_presentation\_final](https://github.com/souravjain540/crawlee-first-blog/assets/53312820/051ec8a3-86a7-4109-8fb3-135e399cbe93) ## Crawlee Blog and upcoming plans\![](#crawlee-blog-and-upcoming-plans "Direct link to Crawlee Blog and upcoming plans!") The first step to achieving this goal is to reach out to the broader developer community through our content. The Crawlee blog aims to be the best informational hub for Node.js developers interested in web scraping and automation. **What to expect:** * How-to-tutorials on making web crawlers, scrapers, and automation applications using Crawlee. * Thought leadership content on web crawling. * Crawlee feature updates and changes. * Community content collaboration. We’ll be posting content monthly for our dev community, so stay tuned! If you have ideas on specific content topics and want to give us input, please [join our Discord community](https://apify.com/discord) and tag me with your ideas. Also, we encourage collaboration with the community, so if you have some interesting pieces of content related to Crawlee, let us know in Discord, and we’ll feature them on our blog. 😀 In the meantime, you might want to check out this article on [Crawlee data storage types](https://blog.apify.com/crawlee-data-storage-types/) on the Apify Blog. --- # Crawlee for Python v0.5 January 10, 2025 · 7 min read [![Vlada Dusek](https://avatars.githubusercontent.com/u/25082181?v=4)](https://github.com/vdusek) [Vlada Dusek](https://github.com/vdusek) Developer of Crawlee for Python Crawlee for Python v0.5 is now available! This is our biggest release to date, bringing new ported functionality from the [Crawlee for JavaScript](https://github.com/apify/crawlee), brand-new features that are exclusive to the Python library (for now), a new consolidated package structure, and a bunch of bug fixes and further improvements. ## Getting started[](#getting-started "Direct link to Getting started") You can upgrade to the latest version straight from [PyPI](https://pypi.org/project/crawlee/): ``` pip install --upgrade crawlee ``` Check out the full changelog on our [website](https://www.crawlee.dev/python/docs/changelog#050-2025-01-02) to see all the details. If you are updating from an older version, make sure to follow our [Upgrading to v0.5](https://www.crawlee.dev/python/docs/upgrading/upgrading-to-v0x#upgrading-to-v05) guide for a smooth upgrade. ## New package structure[](#new-package-structure "Direct link to New package structure") We have introduced a new consolidated package structure. The goal is to streamline the development experience, help you find the crawlers you are looking for faster, and improve the IDE's code suggestions while importing. ### Crawlers[](#crawlers "Direct link to Crawlers") We have grouped all crawler classes (and their corresponding crawling context classes) into a single sub-package called `crawlers`. Here is a quick example of how the imports have changed: ``` - from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler, BeautifulSoupCrawlingContext + from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext ``` Look how you can see all the crawlers that we have, isn't that cool! ![Import from crawlers subpackage.](/assets/images/import_crawlers-32dc36ba69192c5d936cbc8c05a9b946.webp) ### Storage clients[](#storage-clients "Direct link to Storage clients") Similarly, we have moved all storage client classes under `storage_clients` sub-package. For instance: ``` - from crawlee.memory_storage_client import MemoryStorageClient + from crawlee.storage_clients import MemoryStorageClient ``` This consolidation makes it clearer where each class belongs and ensures that your IDE can provide better autocompletion when you are looking for the right crawler or storage client. ## Continued parity with Crawlee JS[](#continued-parity-with-crawlee-js "Direct link to Continued parity with Crawlee JS") We are constantly working toward feature parity with our JavaScript library, [Crawlee JS](https://github.com/apify/crawlee). With v0.5, we have brought over more functionality: ### HTML to text context helper[](#html-to-text-context-helper "Direct link to HTML to text context helper") The `html_to_text` crawling context helper simplifies extracting text from an HTML page by automatically removing all tags and returning only the raw text content. It's available in the [`ParselCrawlingContext`](https://www.crawlee.dev/python/api/class/ParselCrawlingContext#html_to_text) and [`BeautifulSoupCrawlingContext`](https://www.crawlee.dev/python/api/class/BeautifulSoupCrawlingContext#html_to_text). ``` import asyncio from crawlee.crawlers import ParselCrawler, ParselCrawlingContext async def main() -> None: crawler = ParselCrawler() @crawler.router.default_handler async def handler(context: ParselCrawlingContext) -> None: context.log.info('Crawling: %s', context.request.url) text = context.html_to_text() # Continue with the processing... await crawler.run(['https://crawlee.dev']) if __name__ == '__main__': asyncio.run(main()) ``` In this example, we use a [`ParselCrawler`](https://www.crawlee.dev/python/api/class/ParselCrawler) to fetch a webpage, then invoke `context.html_to_text()` to extract clean text for further processing. ### Use state[](#use-state "Direct link to Use state") The [`use_state`](https://www.crawlee.dev/python/api/class/UseStateFunction) crawling context helper makes it simple to create and manage persistent state values within your crawler. It ensures that all state values are automatically persisted. It enables you to maintain data across different crawler runs, restarts, and failures. It acts as a convenient abstraction for interaction with [`KeyValueStore`](https://www.crawlee.dev/python/api/class/KeyValueStore). ``` import asyncio from crawlee import Request from crawlee.configuration import Configuration from crawlee.crawlers import ParselCrawler, ParselCrawlingContext async def main() -> None: # Create a crawler with purge_on_start disabled to retain state across runs. crawler = ParselCrawler( configuration=Configuration(purge_on_start=False), ) @crawler.router.default_handler async def handler(context: ParselCrawlingContext) -> None: context.log.info(f'Crawling {context.request.url}') # Retrieve or initialize the state with a default value. state = await context.use_state('state', default_value={'runs': 0}) # Increment the run count. state['runs'] += 1 # Create a request with always_enqueue enabled to bypass deduplication and ensure it is processed. request = Request.from_url('https://crawlee.dev/', always_enqueue=True) # Run the crawler with the start request. await crawler.run([request]) # Fetch the persisted state from the key-value store. kvs = await crawler.get_key_value_store() state = await kvs.get_auto_saved_value('state') crawler.log.info(f'Final state after run: {state}') if __name__ == '__main__': asyncio.run(main()) ``` Please note that the `use_state` is an experimental feature. Its behavior and interface may evolve in future versions. ## Brand new features[](#brand-new-features "Direct link to Brand new features") In addition to porting features from JS, we are introducing new, Python-first functionalities that will eventually make their way into Crawlee JS in the coming months. ### Crawler's stop method[](#crawlers-stop-method "Direct link to Crawler's stop method") The [`BasicCrawler`](https://www.crawlee.dev/python/api/class/BasicCrawler), and by extension, all crawlers that inherit from it, now has a [`stop`](https://www.crawlee.dev/python/api/class/BasicCrawler#stop) method. This makes it easy to halt the crawling when a specific condition is met, for instance, if you have found the data you were looking for. ``` import asyncio from crawlee.crawlers import ParselCrawler, ParselCrawlingContext async def main() -> None: crawler = ParselCrawler() @crawler.router.default_handler async def handler(context: ParselCrawlingContext) -> None: context.log.info('Crawling: %s', context.request.url) # Extract and enqueue links from the page. await context.enqueue_links() title = context.selector.css('title::text').get() # Condition when you want to stop the crawler, e.g. you # have found what you were looking for. if 'Crawlee for Python' in title: context.log.info('Condition met, stopping the crawler.') await crawler.stop() await crawler.run(['https://crawlee.dev']) if __name__ == '__main__': asyncio.run(main()) ``` ### Request loaders[](#request-loaders "Direct link to Request loaders") There are new classes [`RequestLoader`](https://www.crawlee.dev/python/api/class/RequestLoader), [`RequestManager`](https://www.crawlee.dev/python/api/class/RequestManager) and [`RequestManagerTandem`](https://www.crawlee.dev/python/api/class/RequestManagerTandem) that manage how Crawlee accesses and stores requests. They allow you to use other component (service) as a source for requests and optionally you can combine it with a [`RequestQueue`](https://www.crawlee.dev/python/api/class/RequestQueue). They let you plug in any request source, and combine the external data sources with Crawlee's standard `RequestQueue`. You can learn more about these new features in the [Request loaders guide](https://www.crawlee.dev/python/docs/guides/request-loaders). ``` import asyncio from crawlee.crawlers import ParselCrawler, ParselCrawlingContext from crawlee.request_loaders import RequestList, RequestManagerTandem from crawlee.storages import RequestQueue async def main() -> None: rl = RequestList( [ 'https://crawlee.dev', 'https://apify.com', # Long list of URLs... ], ) rq = await RequestQueue.open() # Combine them into a single request source. tandem = RequestManagerTandem(rl, rq) crawler = ParselCrawler(request_manager=tandem) @crawler.router.default_handler async def handler(context: ParselCrawlingContext) -> None: context.log.info(f'Crawling {context.request.url}') # ... await crawler.run() if __name__ == '__main__': asyncio.run(main()) ``` In this example we combine a [`RequestList`](https://www.crawlee.dev/python/api/class/RequestList) with a [`RequestQueue`](https://www.crawlee.dev/python/api/class/RequestQueue). However, instead of the `RequestList` you can use any other class that implements the [`RequestLoader`](https://www.crawlee.dev/python/api/class/RequestLoader) interface to suit your specific requirements. ### Service locator[](#service-locator "Direct link to Service locator") The [`ServiceLocator`](https://www.crawlee.dev/python/api/class/ServiceLocator) is primarily an internal mechanism for managing the services that Crawlee depends on. Specifically, the [`Configuration`](https://www.crawlee.dev/python/api/class/ServiceLocator), [`StorageClient`](https://www.crawlee.dev/python/api/class/ServiceLocator), and [`EventManager`](https://www.crawlee.dev/python/api/class/ServiceLocator). By swapping out these components, you can adapt Crawlee to suit different runtime environments. You can use the service locator explicitly: ``` import asyncio from crawlee import service_locator from crawlee.configuration import Configuration from crawlee.crawlers import ParselCrawler, ParselCrawlingContext from crawlee.events import LocalEventManager from crawlee.storage_clients import MemoryStorageClient async def main() -> None: service_locator.set_configuration(Configuration()) service_locator.set_storage_client(MemoryStorageClient()) service_locator.set_event_manager(LocalEventManager()) crawler = ParselCrawler() # ... if __name__ == '__main__': asyncio.run(main()) ``` Or pass the services directly to the crawler instance, and they will be set under the hood: ``` import asyncio from crawlee.configuration import Configuration from crawlee.crawlers import ParselCrawler, ParselCrawlingContext from crawlee.events import LocalEventManager from crawlee.storage_clients import MemoryStorageClient async def main() -> None: crawler = ParselCrawler( configuration=Configuration(), storage_client=MemoryStorageClient(), event_manager=LocalEventManager(), ) # ... if __name__ == '__main__': asyncio.run(main()) ``` ## Conclusion[](#conclusion "Direct link to Conclusion") We are excited to share that Crawlee v0.5 is here. If you have any questions or feedback, please open a [GitHub discussion](https://github.com/apify/crawlee-python/discussions). If you encounter any bugs, or have an idea for a new feature, please open a [GitHub issue](https://github.com/apify/crawlee-python/issues). --- # Crawlee for Python v0.6 March 6, 2025 · 4 min read [![Vlada Dusek](https://avatars.githubusercontent.com/u/25082181?v=4)](https://github.com/vdusek) [Vlada Dusek](https://github.com/vdusek) Developer of Crawlee for Python Crawlee for Python v0.6 is here, and it's packed with new features and important bug fixes. If you're upgrading from a previous version, please take a moment to review the breaking changes detailed below to ensure a smooth transition. ![Crawlee for Python v0.6.0](/assets/images/crawlee_v060-5cdf895baf62d5ab5beea47ce6502dec.webp) ## Getting started[](#getting-started "Direct link to Getting started") You can upgrade to the latest version straight from [PyPI](https://www.pypi.org/project/crawlee/): ``` pip install --upgrade crawlee ``` Check out the full changelog on our [website](https://www.crawlee.dev/python/docs/changelog#060-2025-03-03) to see all the details. If you are updating from an older version, make sure to follow our [Upgrading to v0.6](https://www.crawlee.dev/python/docs/upgrading/upgrading-to-v0x#upgrading-to-v06) guide. ## Adaptive Playwright crawler[](#adaptive-playwright-crawler "Direct link to Adaptive Playwright crawler") The new [`AdaptivePlaywrightCrawler`](https://www.crawlee.dev/python/api/class/AdaptivePlaywrightCrawler) is a hybrid solution that combines the best of two worlds: full browser rendering with [Playwright](https://www.playwright.dev/) and lightweight HTTP-based crawling (using, for example, [`BeautifulSoupCrawler`](https://www.crawlee.dev/python/api/class/BeautifulSoupCrawler) or [`ParselCrawler`](https://www.crawlee.dev/python/api/class/ParselCrawler)). It automatically switches between the two methods based on real-time analysis of the target page, helping you achieve lower crawl costs and improved performance when crawling a variety of websites. The example below demonstrates how the `AdaptivePlaywrightCrawler` can handle both static and dynamic content. ``` import asyncio from datetime import timedelta from crawlee.crawlers import AdaptivePlaywrightCrawler, AdaptivePlaywrightCrawlingContext async def main() -> None: crawler = AdaptivePlaywrightCrawler.with_beautifulsoup_static_parser( max_requests_per_crawl=5, playwright_crawler_specific_kwargs={'browser_type': 'chromium'}, ) @crawler.router.default_handler async def request_handler(context: AdaptivePlaywrightCrawlingContext) -> None: # Do some processing using `parsed_content` context.log.info(context.parsed_content.title) # Locate element h2 within 5 seconds h2 = await context.query_selector_one('h2', timedelta(seconds=5)) # Do stuff with element found by the selector context.log.info(h2) # Find more links and enqueue them. await context.enqueue_links() # Save some data. await context.push_data({'Visited url': context.request.url}) await crawler.run(['https://www.crawlee.dev/']) if __name__ == '__main__': asyncio.run(main()) ``` Check out our [Adaptive Playwright crawler guide](https://www.crawlee.dev/python/docs/guides/adaptive-playwright-crawler) for more details on how to use this new crawler. ## Browserforge fingerprints[](#browserforge-fingerprints "Direct link to Browserforge fingerprints") To help you avoid detection and blocking, Crawlee now integrates the [browserforge](https://www.github.com/daijro/browserforge) library - intelligent browser header & fingerprint generator. This feature simulates real browser behavior by automatically randomizing HTTP headers and fingerprints, making your crawling sessions significantly more resilient against anti-bot measures. With [browserforge](https://www.github.com/daijro/browserforge) fingerprints enabled by default, your crawler sends realistic HTTP headers and user-agent strings. HTTP-based crawlers, which use [`HttpxHttpClient`](https://www.crawlee.dev/python/api/class/HttpxHttpClient) by default benefit from these adjustments, while the [`CurlImpersonateHttpClient`](https://www.crawlee.dev/python/api/class/CurlImpersonateHttpClient) employs its own stealthy techniques. The [`PlaywrightCrawler`](https://www.crawlee.dev/python/docs/guides/playwright-crawler) adjusts HTTP headers and browser fingerprints accordingly. Together, these improvements make your crawlers much harder to detect. Below is an example of using `PlaywrightCrawler`, which now benefits from the [browserforge](https://www.github.com/daijro/browserforge) library: ``` import asyncio from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext async def main() -> None: # The browserforge fingerprints and headers are used by default. crawler = PlaywrightCrawler() @crawler.router.default_handler async def handler(context: PlaywrightCrawlingContext) -> None: url = context.request.url context.log.info(f'Crawling URL: {url}') # Decode and log the response body, which contains the headers we sent. headers = (await context.response.body()).decode() context.log.info(f'Response headers: {headers}') # Extract and log the User-Agent and UA data used in the browser context. ua = await context.page.evaluate('() => window.navigator.userAgent') ua_data = await context.page.evaluate('() => window.navigator.userAgentData') context.log.info(f'Navigator user-agent: {ua}') context.log.info(f'Navigator user-agent data: {ua_data}') # The endpoint httpbin.org/headers returns the request headers in the response body. await crawler.run(['https://www.httpbin.org/headers']) if __name__ == '__main__': asyncio.run(main()) ``` For further details on utilizing [browserforge](https://www.github.com/daijro/browserforge) to avoid blocking, please refer to our [Avoid getting blocked guide](https://www.crawlee.dev/python/docs/guides/avoid-blocking). ## CLI dependencies[](#cli-dependencies "Direct link to CLI dependencies") In v0.6, we've reduced the size of the core package by moving CLI (template creation) dependencies to optional extras. This change reduces the package footprint, keeping the base installation lightweight. To use Crawlee's CLI for creating new projects, simply install the package with the CLI extras. For example, to create a new project from a template using `pipx`, run: ``` pipx run 'crawlee[cli]' create my-crawler ``` Or with `uvx`: ``` uvx 'crawlee[cli]' create my-crawler ``` This change ensures that while the core package remains lean, you can still opt in to CLI functionality when bootstrapping new projects. ## Conclusion[](#conclusion "Direct link to Conclusion") We are excited to share that Crawlee v0.6 is here. If you have any questions or feedback, please open a [GitHub discussion](https://www.github.com/apify/crawlee-python/discussions). If you encounter any bugs, or have an idea for a new feature, please open a [GitHub issue](https://www.github.com/apify/crawlee-python/issues). --- # Crawlee for Python v1 September 15, 2025 · 15 min read [![Vlada Dusek](https://avatars.githubusercontent.com/u/25082181?v=4)](https://github.com/vdusek) [Vlada Dusek](https://github.com/vdusek) Developer of Crawlee for Python We launched Crawlee for Python in beta mode in [July 2024](https://www.crawlee.dev/blog/launching-crawlee-python). Over the past year, we received many early adopters, tremendous interest in the library from the Python community, more than 6000 stars on GitHub, a dozen contributors, and many feature requests. After months of development, polishing, and community feedback, the library is leaving beta and entering a production/stable development status. **We are happy to announce Crawlee for Python v1.0.** From now on, Crawlee for Python will strictly follow [semantic versioning](https://www.semver.org/). You can now rely on it as a stable foundation for your crawling and scraping projects, knowing that breaking changes will only occur in major releases. ## What's new in Crawlee for Python v1[](#whats-new-in-crawlee-for-python-v1 "Direct link to What's new in Crawlee for Python v1") * [New storage client system](#new-storage-client-system) * [Adaptive Playwright crawler](#adaptive-playwright-crawler) * [Impit HTTP client](#impit-http-client) * [Sitemap request loader](#sitemap-request-loader) * [Robots exclusion standard](#robots-exclusion-standard) * [Fingerprinting](#fingerprinting) * [Open telemetry](#open-telemetry) ![Crawlee for Python v1.0](/assets/images/crawlee_v100-d491a6c5406c55e0bfcdc9b39b81b7ae.webp) ## Getting started[](#getting-started "Direct link to Getting started") You can upgrade to the latest version straight from [PyPI](https://www.pypi.org/project/crawlee/): ``` pip install --upgrade crawlee ``` Check out the full changelog on our [website](https://www.crawlee.dev/python/docs/changelog#100-2025-09-15) to see all the details. If you are updating from an older version, make sure to follow our [Upgrading to v1](https://www.crawlee.dev/python/docs/upgrading/upgrading-to-v1) guide. ## New storage client system[](#new-storage-client-system "Direct link to New storage client system") One of the biggest architectural changes in Crawlee v1 is the introduction of a new storage client system. Until now, datasets, key–value stores, and request queues were handled in slightly different ways depending on where they were stored. With v1, this has been unified under a single, consistent interface. This means that whether you're storing data in memory, on the local file system, in a database, on the Apify platform, or even using a custom backend, the API remains the same. The result is less duplication, better extensibility, and a cleaner developer experience. It also opens the door for the community to build and share their own storage client implementations. For example, here's how to set up a crawler with a file-system–backed storage client, which persists data locally: ``` from crawlee.configuration import Configuration from crawlee.crawlers import ParselCrawler from crawlee.storage_clients import FileSystemStorageClient # Create a new instance of storage client. storage_client = FileSystemStorageClient() # Create a configuration with custom settings. configuration = Configuration( storage_dir='./my_storage', purge_on_start=False, ) # And pass them to the crawler. crawler = ParselCrawler( storage_client=storage_client, configuration=configuration, ) ``` And here's an example of using a memory-only storage client, useful for testing or short-lived crawls: ``` from crawlee.crawlers import ParselCrawler from crawlee.storage_clients import MemoryStorageClient # Create a new instance of storage client. storage_client = MemoryStorageClient() # And pass it to the crawler. crawler = ParselCrawler(storage_client=storage_client) ``` With this new design, switching between storage backends is as simple as swapping out a client, without changing your crawling logic. To dive deeper into configuration, advanced usage (e.g. using different storage clients for specific storage instances), and even how to write your own storage client, see the [Storages](https://www.crawlee.dev/python/docs/guides/storages) and [Storage clients](https://www.crawlee.dev/python/docs/guides/storage-clients) guides. ### New experimental SQL storage client[](#new-experimental-sql-storage-client "Direct link to New experimental SQL storage client") Crawlee v1 introduces an experimental [`SqlStorageClient`](https://www.crawlee.dev/python/api/class/SqlStorageClient) that enables persistent storage using SQL databases. Currently, SQLite and PostgreSQL are supported. This storage backend supports concurrent access from multiple crawler processes, enabling distributed crawling scenarios. The SQL storage client uses [SQLAlchemy 2+](https://www.sqlalchemy.org/) under the hood, providing automatic schema creation, connection pooling, and database-specific optimizations. It maintains the same interface as other storage clients, making it easy to switch between different storage backends without changing your crawling logic. The client uses a context manager to ensure proper connection handling: ``` import asyncio from crawlee.crawlers import ParselCrawler from crawlee.storage_clients import SqlStorageClient async def main() -> None: # Create SQL storage client (defaults to SQLite). async with SqlStorageClient() as storage_client: # Pass it to the crawler. crawler = ParselCrawler(storage_client=storage_client) # ... define your handlers and crawling logic await crawler.run(['https://crawlee.dev']) if __name__ == '__main__': asyncio.run(main()) ``` For PostgreSQL, simply provide a connection string: ``` import asyncio from crawlee.crawlers import ParselCrawler from crawlee.storage_clients import SqlStorageClient async def main() -> None: async with SqlStorageClient( connection_string='postgresql+asyncpg://user:pass@localhost/crawlee_db' ) as storage_client: crawler = ParselCrawler(storage_client=storage_client) # ... define your handlers and crawling logic await crawler.run(['https://crawlee.dev']) if __name__ == '__main__': asyncio.run(main()) ``` Since this is an experimental feature, the implementation may evolve in future releases as we gather feedback from the community. ## Adaptive Playwright crawler[](#adaptive-playwright-crawler "Direct link to Adaptive Playwright crawler") Some websites can be scraped quickly with plain HTTP requests, while others require the full power of a browser to render dynamic content. Traditionally, you had to decide upfront whether to use one of the lightweight HTTP-based crawlers ([`ParselCrawler`](https://www.crawlee.dev/python/api/class/ParselCrawler) or [`BeautifulSoupCrawler`](https://www.crawlee.dev/python/api/class/BeautifulSoupCrawler)) or a browser-based [`PlaywrightCrawler`](https://www.crawlee.dev/python/api/class/PlaywrightCrawler). Crawlee v1 introduces the [`AdaptivePlaywrightCrawler`](https://www.crawlee.dev/python/api/class/AdaptivePlaywrightCrawler), which automatically chooses the right approach for each page. The adaptive crawler uses a detection mechanism: it compares the results of plain HTTP requests with those of a browser-rendered version of the same page. If both match, it can continue with the faster HTTP approach; if differences appear, it falls back to browser-based crawling. Over time, it builds confidence about which rendering type is needed for different pages, occasionally re-checking with the browser to ensure its predictions stay correct. This makes your crawls faster and cheaper, while still allowing you to reliably handle complex, dynamic websites. In practice, you get the best of both worlds: speed on simple pages and robustness on modern, JavaScript-heavy sites. For advanced options, such as customizing the detection strategy, see the [Adaptive Playwright crawler guide](https://www.crawlee.dev/python/docs/guides/adaptive-playwright-crawler). Here's a simplified example using the static [Parsel](https://www.github.com/scrapy/parsel) parser for HTTP responses, and falling back to [Playwright](https://www.playwright.dev/) only when needed: ``` import asyncio from datetime import timedelta from crawlee.crawlers import AdaptivePlaywrightCrawler, AdaptivePlaywrightCrawlingContext async def main() -> None: crawler = AdaptivePlaywrightCrawler.with_parsel_static_parser() @crawler.router.default_handler async def request_handler(context: AdaptivePlaywrightCrawlingContext) -> None: # Locate element h2 within 5 seconds h2 = await context.query_selector_one('h2', timedelta(milliseconds=5000)) # Do stuff with element found by the selector context.log.info(h2) await crawler.run(['https://crawlee.dev/']) if __name__ == '__main__': asyncio.run(main()) ``` In this example, pages that don't need JavaScript rendering will be processed through the fast HTTP client, while others will be automatically handled with Playwright. You don't need to write two different crawlers or guess in advance which method to use - Crawlee adapts dynamically. For more details and configuration options, see the [Adaptive Playwright crawler](https://www.crawlee.dev/python/docs/guides/adaptive-playwright-crawler) guide. ## Impit HTTP client[](#impit-http-client "Direct link to Impit HTTP client") Crawlee v1 introduces a brand-new default HTTP client: [`ImpitHttpClient`](https://www.crawlee.dev/python/api/class/ImpitHttpClient), powered by the [Impit](https://www.github.com/apify/impit) library. Written in Rust and exposed to Python through bindings, it delivers better performance, async-first design, HTTP/3 support, and browser impersonation. It can impersonate real browsers out of the box, which makes your crawlers harder to detect and block by common anti-bot systems. This means fewer false positives, more resilient crawls, and less need for complicated workarounds. Impit is also developed as an open-source project by Apify, so you can dive into the internals or contribute improvements yourself. By default, Crawlee now uses [`ImpitHttpClient`](https://www.crawlee.dev/python/api/class/ImpitHttpClient) under the hood. But you can also create your own instance, configure it to your needs (e.g. enable HTTP/3 or choose a specific browser profile), and pass it into your crawler. Here's an example of explicitly using [`ImpitHttpClient`](https://www.crawlee.dev/python/api/class/ImpitHttpClient) with a [`ParselCrawler`](https://www.crawlee.dev/python/api/class/ParselCrawler): ``` import asyncio from crawlee.crawlers import ParselCrawler, ParselCrawlingContext from crawlee.http_clients import ImpitHttpClient async def main() -> None: http_client = ImpitHttpClient( # Optional additional keyword arguments for `impit.AsyncClient`. http3=True, browser='firefox', verify=True, ) crawler = ParselCrawler( http_client=http_client, # Limit the crawl to max requests. Remove or increase it for crawling all links. max_requests_per_crawl=10, ) # Define the default request handler, which will be called for every request. @crawler.router.default_handler async def request_handler(context: ParselCrawlingContext) -> None: context.log.info(f'Processing {context.request.url} ...') # Enqueue all links from the page. await context.enqueue_links() # Extract data from the page. data = { 'url': context.request.url, 'title': context.selector.css('title::text').get(), } # Push the extracted data to the default dataset. await context.push_data(data) # Run the crawler with the initial list of URLs. await crawler.run(['https://crawlee.dev']) if __name__ == '__main__': asyncio.run(main()) ``` With the [`ImpitHttpClient`](https://www.crawlee.dev/python/api/class/ImpitHttpClient), you get stealth without extra dependencies or plugins. Check out the [HTTP clients](https://www.crawlee.dev/python/docs/guides/http-clients) guide for more details and advanced configuration options. ## Sitemap request loader[](#sitemap-request-loader "Direct link to Sitemap request loader") Many websites expose their structure through sitemaps. These files provide a clear list of all available URLs, and are often the most efficient way to discover content on a site. In previous Crawlee versions, you had to fetch and parse these XML files manually before feeding them into your crawler. With Crawlee v1, that's no longer necessary. The new [`SitemapRequestLoader`](https://www.crawlee.dev/python/api/class/SitemapRequestLoader) lets you load URLs directly from a sitemap into your request queue, with options for filtering and batching. This makes it much easier to start large-scale crawls where sitemaps already provide full coverage of the site. Here's an example that loads a sitemap, filters out only documentation pages, and processes them with a [`ParselCrawler`](https://www.crawlee.dev/python/api/class/ParselCrawler): ``` import asyncio import re from crawlee.crawlers import ParselCrawler, ParselCrawlingContext from crawlee.http_clients import ImpitHttpClient from crawlee.request_loaders import SitemapRequestLoader async def main() -> None: # Create an HTTP client for fetching the sitemap. http_client = ImpitHttpClient() # Create a sitemap request loader with filtering rules. sitemap_loader = SitemapRequestLoader( sitemap_urls=['https://crawlee.dev/sitemap.xml'], http_client=http_client, include=[re.compile(r'.*docs.*')], # Only include URLs containing 'docs'. max_buffer_size=500, # Keep up to 500 URLs in memory before processing. ) # Convert the sitemap loader into a request manager linked # to the default request queue. request_manager = await sitemap_loader.to_tandem() # Create a crawler and pass the request manager to it. crawler = ParselCrawler( request_manager=request_manager, max_requests_per_crawl=10, # Limit the max requests per crawl. ) @crawler.router.default_handler async def handler(context: ParselCrawlingContext) -> None: context.log.info(f'Processing {context.request.url}') # New links will be enqueued directly to the queue. await context.enqueue_links() # Extract data using Parsel's XPath and CSS selectors. data = { 'url': context.request.url, 'title': context.selector.xpath('//title/text()').get(), } # Push extracted data to the dataset. await context.push_data(data) await crawler.run() if __name__ == '__main__': asyncio.run(main()) ``` By connecting the [`SitemapRequestLoader`](https://www.crawlee.dev/python/api/class/SitemapRequestLoader) directly with a crawler, you can skip the boilerplate of parsing XML and just focus on extracting data. For more details, see the [Request loaders](https://www.crawlee.dev/python/docs/guides/request-loaders) guide. ## Robots exclusion standard[](#robots-exclusion-standard "Direct link to Robots exclusion standard") Respecting [`robots.txt`](https://en.wikipedia.org/wiki/Robots.txt) is an important part of responsible web crawling. This simple file lets website owners declare which parts of their site should not be crawled by automated agents. Crawlee v1 makes it trivial to follow these rules: just set the `respect_robots_txt_file` option on your crawler, and Crawlee will automatically check the file before issuing requests. This not only helps you build ethical crawlers, but can also save time and bandwidth by skipping disallowed or irrelevant pages. For example, login pages, search results, or admin sections are often excluded in [`robots.txt`](https://www.en.wikipedia.org/wiki/Robots.txt), and Crawlee will handle that for you automatically. Here's a minimal example showing how a [`ParselCrawler`](https://www.crawlee.dev/python/api/class/ParselCrawler) obeys the robots exclusion standard: ``` import asyncio from crawlee.crawlers import ParselCrawler, ParselCrawlingContext async def main() -> None: # Create a new crawler instance with robots.txt compliance enabled. crawler = ParselCrawler( respect_robots_txt_file=True, ) # Define the default request handler. @crawler.router.default_handler async def request_handler(context: ParselCrawlingContext) -> None: context.log.info(f'Processing {context.request.url}') # Extract the data from website. data = { 'url': context.request.url, 'title': context.selector.xpath('//title/text()').get(), } # Push extracted data to the dataset. await context.push_data(data) # Run the crawler with the list of start URLs. # The crawler will check the robots.txt file before making requests. # In this example, "https://news.ycombinator.com/login" will be skipped # because it's disallowed in the site's robots.txt file. await crawler.run( ['https://news.ycombinator.com/', 'https://news.ycombinator.com/login'] ) if __name__ == '__main__': asyncio.run(main()) ``` With this option enabled, you don't need to manually check which URLs are allowed. Crawlee will handle it, letting you focus on the crawling logic and data extraction. For a more information, see the [Respect robots.txt file](https://www.crawlee.dev/python/docs/examples/respect-robots-txt-file) documentation page. ## Fingerprinting[](#fingerprinting "Direct link to Fingerprinting") Modern websites often rely on browser fingerprinting to distinguish real users from automated traffic. Instead of just checking the [User-Agent](https://www.developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Headers/User-Agent) header, they combine dozens of subtle signals - supported fonts, canvas rendering, WebGL features, media devices, screen resolution, and more. Together, these form a unique [device fingerprint](https://www.en.wikipedia.org/wiki/Device_fingerprint) that can easily expose headless browsers or automation frameworks. Without fingerprinting, Playwright sessions tend to look identical and are more likely to be flagged by anti-bot systems. Crawlee v1 integrates with the [`FingerprintGenerator`](https://www.crawlee.dev/python/api/class/FingerprintGenerator) to automatically inject realistic, randomized fingerprints into every [`PlaywrightCrawler`](https://www.crawlee.dev/python/api/class/PlaywrightCrawler) session. This modifies HTTP headers, browser APIs, and other low-level signals so that each crawler run looks like a real browser on a real device. Using fingerprinting in Crawlee is straightforward: create a fingerprint generator with your desired options and pass it to the crawler. ``` import asyncio from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext from crawlee.fingerprint_suite import ( DefaultFingerprintGenerator, HeaderGeneratorOptions, ScreenOptions, ) async def main() -> None: # Use default fingerprint generator with desired fingerprint options. # Generator will generate real looking browser fingerprint based on the options. # Unspecified fingerprint options will be automatically selected by the generator. fingerprint_generator = DefaultFingerprintGenerator( header_options=HeaderGeneratorOptions(browsers=['chrome']), screen_options=ScreenOptions(min_width=400), ) crawler = PlaywrightCrawler( # Limit the crawl to max requests. Remove or increase it for crawling all links. max_requests_per_crawl=10, # Headless mode, set to False to see the browser in action. headless=False, # Browser types supported by Playwright. browser_type='chromium', # Fingerprint generator to be used. By default no fingerprint generation is done. fingerprint_generator=fingerprint_generator, ) # Define the default request handler, which will be called for every request. @crawler.router.default_handler async def request_handler(context: PlaywrightCrawlingContext) -> None: context.log.info(f'Processing {context.request.url} ...') # Find a link to the next page and enqueue it if it exists. await context.enqueue_links(selector='.morelink') # Run the crawler with the initial list of URLs. await crawler.run(['https://news.ycombinator.com/']) if __name__ == '__main__': asyncio.run(main()) ``` In this example, each Playwright instance starts with a unique, realistic fingerprint. From the website’s perspective, the crawler behaves like a real browser session, reducing the chance of detection or blocking. For more details and examples, see the [Avoid getting blocked](https://www.crawlee.dev/python/docs/guides/avoid-blocking) guide and the [Playwright crawler with fingerprint generator](https://www.crawlee.dev/python/docs/examples/playwright-crawler-with-fingeprint-generator) documentation page. ## Open telemetry[](#open-telemetry "Direct link to Open telemetry") Running crawlers in production means you often want more than just logs - you need visibility into what the crawler is doing, how it's performing, and where bottlenecks occur. Crawlee v1 adds basic [OpenTelemetry](https://www.opentelemetry.io/) instrumentation via [`CrawlerInstrumentor`](https://www.crawlee.dev/python/api/class/CrawlerInstrumentor), giving you a standardized way to collect traces and metrics from your crawlers. With [OpenTelemetry](https://www.opentelemetry.io/) enabled, Crawlee automatically records information such as: * Requests and responses (including timings, retries, and errors). * Resource usage events (memory, concurrency, system snapshots). * Lifecycle events from crawlers, routers, and handlers. These signals can be exported to any OpenTelemetry-compatible backend (e.g. [Jaeger](https://www.jaegertracing.io/), [Prometheus](https://www.prometheus.io/), or [Grafana](https://www.grafana.com/)), where you can monitor real-time dashboards or analyze traces to understand crawler performance. Here's a minimal example: ``` import asyncio from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter from opentelemetry.sdk.resources import Resource from opentelemetry.sdk.trace import TracerProvider from opentelemetry.sdk.trace.export import SimpleSpanProcessor from opentelemetry.trace import set_tracer_provider from crawlee.crawlers import BasicCrawlingContext, ParselCrawler, ParselCrawlingContext from crawlee.otel import CrawlerInstrumentor from crawlee.storages import Dataset, KeyValueStore, RequestQueue def instrument_crawler() -> None: resource = Resource.create( { 'service.name': 'ExampleCrawler', 'service.version': '1.0.0', 'environment': 'development', } ) # Set up the OpenTelemetry tracer provider and exporter provider = TracerProvider(resource=resource) otlp_exporter = OTLPSpanExporter(endpoint='localhost:4317', insecure=True) provider.add_span_processor(SimpleSpanProcessor(otlp_exporter)) set_tracer_provider(provider) # Instrument the crawler with OpenTelemetry CrawlerInstrumentor( instrument_classes=[RequestQueue, KeyValueStore, Dataset] ).instrument() async def main() -> None: instrument_crawler() crawler = ParselCrawler(max_requests_per_crawl=100) kvs = await KeyValueStore.open() @crawler.pre_navigation_hook async def pre_nav_hook(_: BasicCrawlingContext) -> None: # Simulate some pre-navigation processing await asyncio.sleep(0.01) @crawler.router.default_handler async def handler(context: ParselCrawlingContext) -> None: await context.push_data({'url': context.request.url}) await kvs.set_value(key='url', value=context.request.url) await context.enqueue_links() await crawler.run(['https://crawlee.dev/']) if __name__ == '__main__': asyncio.run(main()) ``` Once configured, your traces and metrics can be exported using standard OpenTelemetry exporters (e.g. OTLP, console, or custom backends). This makes it much easier to integrate Crawlee into existing monitoring pipelines. For more details on available options and examples of exporting traces, see the [Trace and monitor crawlers](https://www.crawlee.dev/python/docs/guides/trace-and-monitor-crawlers) guide. ## A message from the Crawlee team[](#a-message-from-the-crawlee-team "Direct link to A message from the Crawlee team") Last but not least, we want to thank our open-source community members who tried Crawlee for Python in its beta version and helped us improve it for the scraping and automation community. We would appreciate it if you could check out the latest version and [give us a star on GitHub](https://www.github.com/apify/crawlee-python/) if you like the new features. If you have any questions or feedback, please open a [GitHub discussion](https://www.github.com/apify/crawlee-python/discussions) or [join our Discord community](https://www.apify.com/discord/) to get support or talk to fellow Crawlee users. If you encounter any bugs or have an idea for a new feature, please open a [GitHub issue](https://www.github.com/apify/crawlee-python/issues). --- # How to build a price tracker with Crawlee and Apify April 8, 2025 · 11 min read [![Percival Villalva](https://avatars.githubusercontent.com/u/70678259?v=4)](https://github.com/PerVillalva) [Percival Villalva](https://github.com/PerVillalva) Community Member of Crawlee Build a price tracker with Crawlee for Python to scrape product details, export data in multiple formats, and send email alerts for price drops, then deploy and schedule it as an Apify Actor. ![Crawlee for Python Price Tracker](/assets/images/crawlee-python-price-tracker-8ffc0121eee82024852513938dd525ab.webp) In this tutorial, we’ll build a price tracker using Crawlee for Python and Apify. By the end, you’ll have an Apify Actor that scrapes product details from a webpage, exports the data in various formats (CSV, Excel, JSON, and more), and sends an email alert when the product’s price falls below your specified threshold. ## 1. Project Setup[](#1-project-setup "Direct link to 1. Project Setup") Our first step is to install the [Apify CLI](https://docs.apify.com/cli/docs). You can do this using either Homebrew or NPM with the following commands: s ### Homebrew[](#homebrew "Direct link to Homebrew") ``` brew install apify-cli ``` ### Via NPM[](#via-npm "Direct link to Via NPM") ``` npm -g install apify-cli ``` Next, let’s run the following commands to use one of Apify’s pre-built templates. This will streamline the setup process and get us coding right away: ``` apify create price-tracking-actor ``` A dropdown list will appear. To follow along with this tutorial, select **`Python`** and **`Crawlee + BeautifulSoup`** `template`. Once the template is installed, navigate to the newly created folder and open it in your preferred IDE. ![actor-templates](/assets/images/actor-templates-88fa253dabe612261cb2fe95430c4c04.webp) Navigate to **`src/main.py`** in your project, and you’ll find that a significant amount of boilerplate code has already been generated for you. If you’re new to Apify or Crawlee, don’t worry, it’s not as complex as it might seem. This pre-written code is designed to save you time and jumpstart your development process. ![crawlee-bs4-template](/assets/images/crawlee-bs4-template-528a9eee4ab1c859feb2ed42e3328045.webp) In fact, this template comes with fully functional code that scrapes the Apify homepage. To test it out, simply run the command **`apify run`**. Within a few seconds, you’ll see the **`storage/datasets`** directory populate with the scraped data in JSON format. ![json-data](/assets/images/json-data-9ec19a8958775e66dcd094d0d46faa90.webp) ## 2. Customizing the template[](#2-customizing-the-template "Direct link to 2. Customizing the template") Now that our project is set up, let’s customize the template to scrape our target website: [Raspberry Pi 5 (8GB RAM) on Central Computer](https://www.centralcomputer.com/raspberry-pi-5-8gb-ram-board.html). First, on the `src/main.py` file, go to the `crawler.run(start_urls)` and replace it with the URL for the target website, as shown below: ``` await crawler.run(['https://www.centralcomputer.com/raspberry-pi-5-8gb-ram-board.html']) ``` Normally, you could let users specify a URL through the Actor input, and the Actor would prioritize it. However, since we’re scraping a specific page, we’ll just use the hardcoded URL for simplicity. Keep in mind that dynamic input is still an option if you want to make the Actor more flexible later. ### Extracting the Product’s Name and Price[](#extracting-the-products-name-and-price "Direct link to Extracting the Product’s Name and Price") Finally, let’s modify our template to extract key elements from the page, such as the product name and price. Starting with the **product name**, inspect the [target page](https://www.centralcomputer.com/raspberry-pi-5-8gb-ram-board.html) using DevTools to find suitable selectors for targeting the element. ![product-name](/assets/images/product-name-dbaba09d2d06b4b8a6b9a340698739af.webp) Next, create a `product_name_element` variable to hold the element selected with the CSS selectors found on the page and update the `data` dictionary with the element’s text contents. Also, remove the line of code that previously made the Actor crawl the Apify website, as we now want it to scrape only a single page. Your `request_handler` function should look similar to the example below: ``` @crawler.router.default_handler async def request_handler(context: BeautifulSoupCrawlingContext) -> None: url = context.request.url Actor.log.info(f'Scraping {url}...') # Select the product name and price elements. product_name_element = context.soup.find('div', class_='productname') # Extract the desired data. data = { 'url': context.request.url, 'product_name': product_name_element.text.strip() if product_name_element else None, } # Store the extracted data to the default dataset. await context.push_data(data) # Enqueue additional links found on the current page. # await context.enqueue_links() -> REMOVE THIS LINE ``` It’s a good practice to test our code after every significant change to ensure it works as expected. Run `apify run` again, but this time, add the `–-purge` flag to prevent the newly scraped data from mixing with previous runs: ``` apify run --purge ``` Navigate to `storage/datasets`, and you should find a file with the scraped content: ``` { "url": "https://www.centralcomputer.com/raspberry-pi-5-8gb-ram-board.html", "product_name": "Raspberry Pi 5 8GB RAM Board"} } ``` Now that you’ve got the hang of it, let’s do the same thing for the price: `79.99`. ![product\_price.png](/assets/images/product-price-fa3ab906b4a95258251defe78c19b6d3.webp) In the code below, you’ll notice a slight difference: instead of extracting the element’s text content, we’re retrieving the value of its `data-price-amount` attribute. This approach avoids capturing the dollar sign `($)` that would otherwise come with the text. If you prefer working with text content instead, that’s perfectly fine, you can simply use `.replace('$', '')` to remove the dollar sign. Also, keep in mind that the extracted price will be a `string` by default. To perform numerical comparisons, we need to convert it to a `float`. This conversion will allow us to accurately compare the price values later on. Here’s how the updated code looks so far: ``` # main.py # ...previous code @crawler.router.default_handler async def request_handler(context: BeautifulSoupCrawlingContext) -> None: url = context.request.url Actor.log.info(f'Scraping {url}...') # Select the product name and price elements. product_name_element = context.soup.find('div', class_='productname') product_price_element = context.soup.find('span', id='product-price-395001') # Extract the desired data. data = { 'url': context.request.url, 'product_name': product_name_element.text.strip() if product_name_element else None, 'price': float(product_price_element['data-price-amount']) if product_price_element else None, } # Store the extracted data to the default dataset. await context.push_data(data) ``` Again, try running it with `apify run --purge` and check if you get a similar output as the example below: ``` { "url": "https://www.centralcomputer.com/raspberry-pi-5-8gb-ram-board.html", "product_name": "Raspberry Pi 5 8GB RAM Board", "price": 79.99 } ``` That’s it for the extraction part! Below is the complete code we’ve written so far. > 💡 **TIP:** If you’d like to get some more practice, try scraping additional elements such as the **`model`**, **`Item #`**, or **`stock availability (In stock)`**. ``` # main.py from apify import Actor from crawler.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext async def main() -> None: # Enter the context of the Actor. async with Actor: # Create a crawler. crawler = BeautifulSoupCrawler( # Limit the crawl to max requests. Remove or increase it for crawling all links. max_requests_per_crawl=50, ) # Define a request handler, which will be called for every request. @crawler.router.default_handler async def request_handler(context: BeautifulSoupCrawlingContext) -> None: url = context.request.url Actor.log.info(f'Scraping {url}...') # Select the product name and price elements. product_name_element = context.soup.find('div', class_='productname') product_price_element = context.soup.find('span', id='product-price-395001') # Extract the desired data. data = { 'url': context.request.url, 'product_name': product_name_element.text.strip() if product_name_element else None, 'price': float(product_price_element['data-price-amount']) if product_price_element else None, } # Store the extracted data to the default dataset. await context.push_data(data) # Run the crawler with the starting requests. await crawler.run(['https://www.centralcomputer.com/raspberry-pi-5-8gb-ram-board.html']) ``` ## 3. Sending an Email Alert[](#3-sending-an-email-alert "Direct link to 3. Sending an Email Alert") From this point forward, you’ll need an **Apify account**. You can create one for free [here](https://console.apify.com/sign-up). We need an Apify account because we’ll be making an API call to a pre-existing Actor from the **Apify Store,** the “Send Email Actor”, to handle notifications. Apify’s email system takes care of sending alerts, so we don’t have to worry about handling **2FA** in our automation. ``` # main.py # ...previous code # Define a price threshold price_threshold = 80 # Call the "Send Email" Actor when the price goes below the threshold if data['price'] < price_threshold: actor_run = await Actor.start( actor_id="apify/send-mail", run_input={ "to": "your_email@email.com", "subject": "Python Price Alert", "text": f"The price of '{data['product_name']}' has dropped below ${price_threshold} and is now ${data['price']}.\n\nCheck it out here: {data['url']}", }, ) Actor.log.info(f"Email sent with run ID: {actor_run.id}") ``` In the code above, we’re using the **Apify Python SDK**, which is already included in our project, to call the “Send Email” Actor with the required input. To make this API call work, you’ll need to log in to your Apify account from the terminal using your **`APIFY_API_TOKEN`**. To get your **`APIFY_API_TOKEN`**, sign up for an Apify account, then navigate to **Settings → API & Integrations**, and copy your **Personal API token**. ![apify-api-token](/assets/images/apify-api-token-eb76078df32c242a7f064ab71e63c7fa.webp) Next, enter the following command in the terminal inside your **Price Tracking Project**: ``` apify login ``` Select `Enter API Token Manually` , paste the token you copied from your account and hit enter. ![apify-login](data:image/webp;base64,UklGRkAZAABXRUJQVlA4WAoAAAAoAAAAVAIASgAASUNDUAwCAAAAAAIMYXBwbAQAAABtbnRyUkdCIFhZWiAH6QADAAIACwAxADlhY3NwQVBQTAAAAABBUFBMAAAAAAAAAAAAAAAAAAAAAAAA9tYAAQAAAADTLWFwcGyr6ljS/rLBRgz/7+fcGoQHAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAApkZXNjAAAA/AAAADJjcHJ0AAABMAAAAFB3dHB0AAABgAAAABRyWFlaAAABlAAAABRnWFlaAAABqAAAABRiWFlaAAABvAAAABRyVFJDAAAB0AAAABBjaGFkAAAB4AAAACxiVFJDAAAB0AAAABBnVFJDAAAB0AAAABBtbHVjAAAAAAAAAAEAAAAMZW5VUwAAABYAAAAcAFMAYwBlAHAAdAByAGUAIABGADIANwAAbWx1YwAAAAAAAAABAAAADGVuVVMAAAA0AAAAHABDAG8AcAB5AHIAaQBnAGgAdAAgAEEAcABwAGwAZQAgAEkAbgBjAC4ALAAgADIAMAAyADVYWVogAAAAAAAA9tYAAQAAAADTLVhZWiAAAAAAAABkpAAAMzsAAAFeWFlaIAAAAAAAAGu+AAC9KQAAD61YWVogAAAAAAAAJnMAAA+cAADCInBhcmEAAAAAAAAAAAAB9gRzZjMyAAAAAAABC7cAAAWW///zVwAABykAAP3X///7t////aYAAAPaAADA9lZQOCDOFgAA8FUAnQEqVQJLAD6RRJ1JpaQjISeXGmiwEglpbt1fNlO7Abb3nj/RLvEW8d/u3lDHkD+4dov9n/Jfzr8Mnrv9d/cv1W8i/m/755kfyT7K/m/7V5g/6D+2eIPwZ/qfUI/Mv5r/uP7j4/OwKz7/PegF7H/XP+L/i/EQ/mP8R6iflX9f/5vuAfyv+n/7b0f/yfgU/c/9L+ynwC/zr+9/9f/Gf374UP6L/1f6PzifUX/t/1PwFf0L+6dZ70pyqgJUhT7s3WLIkzwKFQUEu9wU5d8R7mDpQKrOp6uFKbtUTokSi1tPdJtfpfFJm9mbaVPGtH3OMK00mFAREnPGrnguZFj5XvIbOVpq0uftFkdl6GRv4sGz4aFPWIVlUkU5shSS0A9CcFtft1NiIOZ72Y5CDp7gF/o8mh7mbuezzbFe0kE891QHY1SUTtl9I7+fO4GKkZUWTHFvD5xpuBn/IdHa4ra7WMHS+E5ugSJJsTnlS5tVt4MqfcvKnTse9ZxeD9Xux9V2UfAyBv7uCPG1LyUpfGX7iyS0EGx80uNSG85Drosncn+kEi9202Vaz+T2Lt7dFrhlQI0vIRmtC7/puzf0NbHLJAKVYycKCI0BrLA5JgRwxfE85T5C1Jift669oSqYTIsumRcPfdbS8CNN2Ozdc5iOfyFgqHEEXSbiC3rCAxrGvFCQrzcQsWoqlJfwHlCrzG5G7l9ZBrSuGHXucGk9OcR1jTuHuHoxSaoORtXrCifa9EDvuHc26Po8CmPdiK/w+RzMJyC7tUnsinlfnkyJf3VTGzSomjTkQKM8tqfFZUsGjMhZWGgUVEUOZGGRvmSzFMnPsn9afIrMIZghOIxfC91R1+UiQmhUsylsoG3y3fbfyZCBCT6DRfxHR+DtwvYv61Yii9JB5L4Oi+KI0kfB5LEgmJ1/118URo7wAP6hJnGQRk6IbxLrsBKgulT/zYsDrQVVhQhHF8QnxsnF7U1RqCoNq212Yr4cPTfodyrd+t/0wUeMaMYlnVBRPjAf8USni9m/pAtoJgdpZUvZ/qf6uAl8FDRP+nsFgfHnbMtuvefdUWd7es6dQlATQslvm7TzWfMYv0Zjfo/R80KymPC1cyaPJn+JaZpVl27te0d0sVgc4dFzFDZtK6LD9uox3tFaW9G38lUk8gQuWQurw07EBFUSA5z9o1UUwuA96pmNVwmEKpwlVXR3wny+2wIP86FbM60oNTtQqUnHMahh5xDBFG7G2NV6uj11SH3ORVXZZIXbRmj93/qQEuH9mc+Zxx5J0sJ39O30UUIbU27y8lou603Eufp/2jmyGtpuSrCdWLiyMqd/sW7AE64gtTZZJOEIBsK/lLHGFKi3De7kBr4RKOT7Iw8I0mFiSEo1FtTo7qO10sWdFlyvz71zsxWv1Ew6Ai66Pn4qdal+9MjrOVpTJYd0jhzjNne37AuWyZw8i3Kvrntnh3+9EMtRNehQJFX2xktkCWaHUn4CavLNLQf/pmz+z3NCNrHsNYR2xTmH7jnYtHtQ5DwVuOkrHB2oxVCS0iDF2rv2bH9uu26B1taeL0kHrPrKj7zAF20GP4c0Y60m+BORNa4DfnTDX9KLnO6Bf/iGTuzvSQBwQUH8IPKs5NHLacRadoiCnixz+Y8z420Dtk7RL18bCX3k+TYlhFyT7IjutlG/GU/+4hc0wzGmIaAD48CtVkMzARSDbinZpO6W8I3qXHA/pFPkz7tOU3kV8Q2+YmyB2PLXGJyr3kMRyTQn/JybjoCjN+r5JaXUd8c/PYGqS71yW2jMIy67eOpoEBVkLIcE2HM3L2yjf5Opg8WZaOm6jStDWxq3xmpfz2JRFMaH//NJxMKvDwCd1zEHyLpwKc5b5vllKicngkd84XLuTv1xY81myd+K4L4WJb9+qafVDvRk6IJQ5awrYij2BHxgE7z+nog891uAHCsC/BjJcEdI6lvX6mZpfwts0XXkvCpuQbMlsgsl9RUxuLj5mqKPnRNFBZBvW0bA/SsxfeS7PisWxnL8C5Qz8xF7W480VKze3ssZazEitbMBK40rHDUOEvf22MJ2duiBuc8oj7HsXG/fSKJuQaIcHwrR52UTTK738MnwPVZk8uv0Mrk3Ak7b7eLaYtn6yGtPbUtCXTRaEgSHn0ZWIKvsT/XM2lRNiVwvVtwf1yHVXXD0Qnb9k5Loft+VOh75T0MNEZ6VJMogx7Fvyb3oSl++9tPb0Rt4eJAfyJQY564ia82b2TNqgbvLatCg5zQMIteY9PwGTdJeslv9KfwysCLilrhKRf7JTUI3veGA9dh5ewC47dgHDxuaEYN7RSelbMC6xlG5Ssfk22bCTKkIzy8gjevancjx1Whr3CsXvPnUAqYdPbwvcQNdHvcM2dfLa+EeeoUqSXJytwuQgWLbcBA2Ocs5rUeEXo/5PQ10BTO2jDyfjlDed32453mvmgsXk+MdO+X9xmw97qSZevbwXHvU6/Ai4bQIketeHZmeuMf5xX205tGlKn/dKqEVMhV72oH1back5H0yCdMKGr1bTm2pisB5IXE6q+e/7DsvjSy23Jw5M+vIyE+Enur7t8edFHKVwER69/k5jYoPoqUBYnjUlkJPQcPxaHAbEdD1uCmt4eiPZKPSDc0H3bTf8NHG4/gDNQrAv+j4j6PKeOfhMMqJC1rOvy+eIPWtYo1J5dy52AsZ88kYVopIiN0IG30vKqAzPA1LYnvYTlwpldKi5KimxzUOG/4QIsg5eOf6QnHzfRfV661w5nOUvYIIVMnZRRgJHJqp10n8D1gWrXqbB58OYzlq5xMdmKwTPnpexWV2BUcpjsy1hi5jaYQSNGhrAsJ5SrV4JAoMTDOOMdEMdbkAKgQrAVgGWTsMCQvOOoyW7+5G4hHami6MLy67O8XMclSThJYOxvpVwbfn0rtkX7V2/7q8O10QBVn95v63jLuYmPrc9mjKBPvnxr5iP4zwS51/wxP8vfNqV1EqRgrlyPTlMDexuNFPb2uHgZwaE9EuttvMlLeJY0+0uBwW7+ZKwPYR3fBMhO2AYy7wCwuKON8CnVm/U6x4bUXlrp4RvVvRGS7V6HKsyMHRkXfwJ3mWS/+9qC1zrZVQ/P2wY3tU4kv94Js+U+7G44sNZq4c+/FZRGVKFM0v6gbc7ED9YjLxRz+2VPomnXmj1GlqSP56p8ewJL3E1VtNAvXBLDu4xOsc3dmuyMT0rxTteJLSm+/aYCOspVfCKifH6otiCLvmqezLYReNWGDJIBic+R3TXsetyYaVNtuLhQkB2WmOYINKAzyI2dfGTIkX9G54AYZkWvp36QKT1Vum9YWLdPE+0MApdUI2PjBUg98PGvjjmGgAGZvaJgR4yl35TozURJ4Yi8RK5B4Sf8HE+4iCANM3PIyNYRKgzdsKRs4QFzjTRatl0vmDUAz1hyJKYQT3WsJn20sRWG4uRJeg9wwSjRgF3LJOfTrJOolE29d4d2xadRCQIUiAdtVl86GdpZ7E70dhn+fJf2rhDAmykSq/6vA2U1PfUdOodL3Nz+H3IKSyzfTitquqQoRraMHFc/zeoeiebXKM724ZrNfsF2UN5YbLMcF+7xCVQdGwZTp975IMYRUDbJ87w2gj7sz3R0oa+g3wklRwDnnAQflPZ41N2BaNQDLX7icsf+mk2+IURt7ruZpU/JcJNNpRe5w/uOXB/hvjMD5FCHZjOQMi7v3gdRuBumPMbCCkuiWRwQjhVaeqahxJJ1L0oU9QBVsAljVrdEE2Eto4iEgBHDiw6vq/lScOjyKtK0jreuLKC1e2YryS3wrF9JgXmynz637sCqPZw678txHqc+qYq6KNSPE/lAgjKyX2p9HH+J4Ph9yUtB6X2qxo1B7XI4dqvtW6rYhhRZ3pQhQVjPc8Wlltci2gS8/8pm4caUPzVxgHW/Ou0F9i5rKFyXPyifg+n+QBPt47ImhTp/hQ//gaWzRPUVqOOrdpIJ62ZVsb4L7iYnQkxorX3nt/EvMrqinThMh9AF1I7fuuMVOldVUW6Y8aOCJfVoueqheobeXelACyo+FhM6133AC3Kqqy3INnf8YeVYvgHzNFWK+pZNa0DCgQvi0anFKXmHo6SiJ67zhYftKNXnoa10FWfb25ULTToTWDFa310/Z4mi6M361ecQY74UTUzrgC63RK1MarSQgFWWEH/ei3zjGSMk328RO8HM6hTmT7tBjKbPx5iO3JjE3fcp5bK3Fi/DOEvl3xPjBrlkMT6a2elS2EDU44HHZIHalf92dptsb2sE+8PNYkxRhL6ozKYIGKaLWLXDiKf4/5cJ/MrVBgsDILet4tCkMVbHYMJ1Lc4tHxVvSGCotcDrUA8wU0fDnaA1tDsFDkhTg4oZVpFsANg1Pi7TvykgHiYPhgeB4qLJlq1A0uCOGGh1++MleCbEyYRqxpmFeToUsoTtJtPVZ2C2BF8VHzy63l28Y0BvMy58tyK0ARb9Z0lCEm6zbB5IBMGWNANsq5zs5GJ0IfUBpOv3333VGBbwnaN6QJSSG7KF4O62g9ZJNV9IrPWpISWdqaYmh48nzdTg3btjImE6H0Cd8ZkJ5PkS/CO2gMjNaVKVPf3eECbv/S8wIeHY0uUe65wDhXPU9YyE+4bOVTWUZ8dmA+OdCjFVR5PDI/fS7hO+kxDnfeu7kXin6LHNL6KSRN+1/ZdCs+Mj4KcLsKTzNH0FFPEKct6vfEOld+JwQs0nn9BsmteWT3KdjNaEH8U5MogC3Qlertg7Go5q/WThmh+dI7Mo2TfHmH1+oto19Zb4hq4pWYC55RrK3NXXyo1T8bEMe7gzdsyfWfrVpFTZkx00ZjSP1tNCvSD+ThLZViGgWTCiYjZRWecjyX9vqy81CBb4A7Kdj5hiqawwvaMOXTG3bC78kzQinHmY5B/IJm9SHi970zDhIDNAus4gL6B0xZK7mADZ+Y4u3bhWJTME1LMzCEo2gsGDnsRRDIcsxcdIoYTa9HF4Kj4IMCG+DSnV91esm6SZXq+5g1fOCJoEM/PQSmmG4x0fi7cbVjBsyjdsaPB8JmTCUwO5uAV2S0d2O5Ldr7cEWTkUccDnHn5WS+84DnzYt3tDSEUrR8zp/EbVBhaBQdJQELQKpDQy+N/VGRecnUFmn/uMiZCmCN57wsruFu8Yl5w5nKFnuB/nWy9A9CA7+TJ4xBlk+PmUa9ZgIrdWS8vydx6h7EgphSn0GjJK6cwzlMqHuJwdaHu19MZCbmYKTPLaTmam2RH9rIZKusJYS71t21SA9exPRMMuKP2fbPkgFBLC34TZAHMuZrr4JkK3U2KFl+PmC3ezP8zc0G3goMjiPkfm7Qx/xTXN4p5QP3NtL6M1/jR3OCGiaer3j/55oO96iQnxb0dxpQ+OEvayj68aFZ4mx9+66R01H72K61mnASlN4iMbzib6GgPSVKswmBh3soc+K1yJN92pMxMlgEyx1Xly9maAAA75CofFZ7yvpCCJIA4xCXTz7rAGwoj7guravfYIyzT1sVkP40MHhBbb8odiLoyask5JtjuJqMw7Mx6/g/XbMuSYoyvLM/qvJrlEEsOC61Jk2eD9r2GqOb0yuzaKspTEOkoFRQxQ1TzyXEMDplJrAzXBg5Is9+a4gQGuKlPDU7jFMWqyZ79S3fWUVAybicET976oo4lPj4NoQS95Wzbb4BCYJsEaFiqbe35eN6NPnaVpwoLW0fm9yU939V+dBfGhmn/dxtWzFU/fT4+lsiseZU/yOHRf2Cv/AcUL9QGwj8t0c8CWzDsML6NQVF6l9twfVVXKEhsmHTh/3H2MiGlzh1csO+6Ia7/evG0ELKt4I1e0rSjsQ9poD4P4BfGATdPhZ6e4csAWXdguQ9hYEdIQUYiOoPwfnS+LrISerYvZ/mErEh4ziy5eUSjoLqfWtTb/gTT3DDZx+gBWraTQhlPwNg1B/BZnb8DnSURPa8cRnP3Nho1wAJaQDTEYycHankCjedseLuPYBx3awwEGbxCYbmdjbtvlWoxnX4hEQ2TybALWTSmB00MiXoD4rO0Gc/rGL5UT1DEIqos+BAQ1KzmwlVNHEHvpd6CywloDi7iaP8+zKHa6AjwnpzG3eJy8FIQDxMHwwPAnchHuzDTMMlUi8N41AFivjIlftGSysuRBEjNriqXC+RBEontuOwoOAqukoPOmjAuORXeLxyD5wxEgPV9mQZNoLZaasck712yVroju5Hw+fiZF0n4moXZXx1urg3fa61Wk1kcMqfmBNYkEj4Ff+xDLNDc1bSkHqXz0OEhPj+m6b1S+4MLO3q1irclOZW3j+UlfW3YyIahH0brrM5wK6o5nlV5R6EsAp+yc4mogwnWixXtH3n0yTkNGdeEX/KPFoyH5itgJ/Yf+vI9FAuAdCFFjpuo9FPND9AlafBM8CEZ5YXsjneOTvLr5UarLsKNmoNes4Lw1blTlrd5WB2/I4XXnlLqnO2CyqwdOed9Hh2fg6kRylXIM0jqIHLbxnG2zjNsfsERmk1BJRRkXKN1thKBvNDJbMzYXsD5CnlF9CFpkYWQF2P40d1kpewONrNBfq34IQGqp5uAFBuHZxLKRinb7NkcqqFTtJIPuBCspnpydgcxqgJx04v9Ga/JdROq426VrhOlo09arduVv5XbGUIl+8zmxU5k6l9xgA0WegarqfF+gAAanpyiX/qTWqZXBZdnxLmkKLRJNnyGY/Sor9whIPC4fhy2pQiSkoScVwfIaAJcMj9LBS181UXnj8uPbud/TodIfr4/pBsiuSPsYRbiDeHM+RnNrR9N7NbBMCjnqPlzQ7UHAWsQ5CMOTlU8hpVb5DDZlfEauXJ4nMlBUWi4Rg0Rrk4ILtY+S6nRdCdBdI6YKRvgx4uZW0SgPJrZZ0d0BCT1FfFro0Qus8IKqqlXB+SXpOZITNwpVcLs+VSpHVtsQ19O6CQk5ynwPJZATjgLcseIYRGpsP6sZqm9a/DmyC7OAWfgD18GsKNTEzu0UeEtP2ZDKjAlzaaupA03RJ1N0kM+YWU3jQSaPL6hA3i8prohWJrPD02xNbuqqYBe8eWgwjOqJRUDwGfbvfO1IeSk4rKET1IaHxhR3HWgmsX3+HytP2tCR5EgCPhcykbtP8q7SAPSUelIq/PJJlEQqPqy7f411W4/fqQm+QVPkOkLtgRSUbD0S9QABNX2l3hSUYQsaIx1vKcKMPkIyyNTwnCQTvrmWk6y9i81os2Y7eXXsWESJ+RNtLBFdYV+v5G1vWwmwc5ZKTqlT2KOVE3HO/jJ+uKQLU2m/K1xh/7i0BMGltL/HdyHyj2ZiJuB5Cvx6dsPYSiCEqH55gAAABGYZs0snuxZgR2bOTS7zKzVdB1S5ZIhWfYKTBf/zIGhEYWsx7usIvfGk1sduusG66oapz4yAomOhlCpfBxwYCR40VxD7HTRM7Pf/pUOpCAZAhXyik0+O3f8UwP6jFavd0vFfLSk38vKR/4Rden15AAYWMFDbRdP92kLOYDCqk/ZkUT7+vZvp/rntH1zzJKKLc8ts+vdou/wAoxSheZbcHxu8disMonsc9Ezz83Q4YJYKIF/J4nKhWR07WklLgz2unYfGISpakg+XVg+VF0IPnc6ZssIPQ5Tatlr8Ixxk/wDovCXP0/7awjh52GmswYk6rKFKz/spVSMPzdv2EfMK8e81Q/8okO+uKon21uawhvCjKcmNFIAWpELoX6iwrq2wBK1gi4e5OdI4TaYPt7NmnW7OLFXrdAY2jMwxsREkcVoold+VxiQOcRxjj/bWbCV4K/QpEgaYNzkJ6BDiw9Cj/YWCaNF/FCmaQgAAAAAAAARVhJRjgAAABNTQAqAAAACAABh2kABAAAAAEAAAAaAAAAAAACoAIABAAAAAEAAAJVoAMABAAAAAEAAABLAAAAAA==) You’ll see a confirmation that you’re now logged into your Apify account. When you run the code, the API token will be automatically inferred from your account, allowing you to use the **Send Email Actor**. If you encountered any issues, double-check that your code matches the one below: ``` from apify import Actor from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext async def main() -> None: # Enter the context of the Actor. async with Actor: # Create a crawler. crawler = BeautifulSoupCrawler( # Limit the crawl to max requests. Remove or increase it for crawling all links. max_requests_per_crawl=50, ) # Define a request handler, which will be called for every request. @crawler.router.default_handler async def request_handler(context: BeautifulSoupCrawlingContext) -> None: url = context.request.url Actor.log.info(f'Scraping {url}...') # Select the product name and price elements. product_name_element = context.soup.find('div', class_='productname') product_price_element = context.soup.find('span', id='product-price-395001') # Extract the desired data. data = { 'url': context.request.url, 'product_name': product_name_element.text.strip() if product_name_element else None, 'price': float(product_price_element['data-price-amount']) if product_price_element else None, } price_threshold = 80 if data['price'] < price_threshold: actor_run = await Actor.start( actor_id="apify/send-mail", run_input={ "to": "your_email@gmail.com", "subject": "Python Price Alert", "text": f"The price of '{data['product_name']}' has dropped below ${price_threshold} and is now ${data['price']}.\n\nCheck it out here: {data['url']}", }, ) Actor.log.info(f"Email sent with run ID: {actor_run.id}") # Store the extracted data to the default dataset. await context.push_data(data) # Run the crawler with the starting requests. await crawler.run(['https://www.centralcomputer.com/raspberry-pi-5-8gb-ram-board.html']) ``` > 🔖 Replace the placeholder email address with your actual email, the one where you want to receive notifications. Make sure it matches the email you used to register your **Apify account**. Then, run the code using: ``` apify run --purge ``` If everything works correctly, you should receive an email like the one below in your inbox. ![price-alert](/assets/images/price-alet-530cccd85b681fd98e32a81e4f52e488.webp) ## 4. Deploying your Actor[](#4-deploying-your-actor "Direct link to 4. Deploying your Actor") It’s time to deploy your Actor to the cloud, allowing it to take full advantage of the Apify Platform’s features. Fortunately, this process is incredibly simple. Since you’re already logged into your account, just run the following command: ``` apify push ``` In just a few seconds, you’ll find your newly created Actor in your Apify account by navigating to **Actors → Development → Price Tracking Actor**. ![price-tracking-actor](/assets/images/price-tracking-actor-c91e4f5243ea20363d2621424d89985f.webp) Note that the **Start URLs** input has been reset to **apify.com**, so be sure to replace it with our target website: Once updated, click the green ***Save & Start*** button at the bottom of the page to run your Actor. After the run completes, you’ll see a **preview of the results** in the ***Output*** tab. You can also export your data in multiple formats from the ***Storage*** tab. ![actor-run](/assets/images/actor-run-faa6f7deb56846b88c7d446e9eb05e1d.webp) **Export dataset:** ![actor-export-dataset](/assets/images/export-dataset-9d56cd86006ff21fbbd695a72cd5529c.webp) ## 5. Schedule your runs[](#5-schedule-your-runs "Direct link to 5. Schedule your runs") Now, a **price monitoring script** wouldn’t be very effective unless it ran on a schedule, automatically checking the product’s price and notifying us when it drops below the threshold. Since our Actor is already deployed on **Apify**, scheduling it to run, say, every hour, is incredibly simple. On your Actor page, click the three dots in the top-right corner of the screen and select **“Schedule Actor.”** ![schedule-run](/assets/images/schedule-run-3c2c1975cb23d5f4bdbe8116172a2a47.webp) Next, choose how often you want your Actor to run, and that’s it! Your script will now run in the cloud, continuously monitoring the product’s price and sending you an email notification whenever it goes on sale. ![actor-schedule](/assets/images/actor-schedule-2fe3df75d91fa3270776f814ed6888dc.webp) ## That’s a wrap\![](#thats-a-wrap "Direct link to That’s a wrap!") Congratulations on completing this tutorial! I hope you enjoyed getting your feet wet with Crawlee and feel confident enough to tweak the code to build your own price tracker. We’ve only scratched the surface of what Apify and Crawlee can do. As a next step, join our [Discord community](https://discord.com/invite/jyEM2PRvMU) to connect with other web scraping developers and stay up to date with the latest news about Crawlee and Apify! --- # Reverse engineering GraphQL persistedQuery extension November 15, 2024 · 5 min read [![Saurav Jain](https://avatars.githubusercontent.com/u/53312820?v=4)](https://github.com/souravjain540) [Saurav Jain](https://github.com/souravjain540) Developer Community Manager [![Matěj Volf](https://avatars.githubusercontent.com/u/31281386?v=4)](https://github.com/mvolfik) [Matěj Volf](https://github.com/mvolfik) Web Automation Engineer GraphQL is a query language for getting deeply nested structured data from a website's backend, similar to MongoDB queries. The request is usually a POST to some general `/graphql` endpoint with a body like this: ![GraphQL Query](/assets/images/graphql-a3962ed441b2a078e43c8158ad64336a.webp) When scraping data from websites using GraphQL, it’s common to inspect the network requests in developer tools to find the exact queries being used. However, on some websites, you might notice that the GraphQL query itself isn’t visible in the request. Instead, you only see a cryptic hash value. This can be confusing and makes it harder to understand how data is being requested from the server. This is because some websites use a feature called ["persisted queries.](https://www.apollographql.com/docs/apollo-server/performance/apq/) It's a performance optimization that reduces the amount of data sent with each request by replacing the full query text with a precomputed hash. While this improves website speed and efficiency, it introduces challenges for scraping because the query text isn’t readily available. ![Persisted Query Reverse Engineering](/assets/images/graphql-persisted-query-6e36e61d76503e617fe4e7651bdf53a3.webp) TLDR: the client computes the sha256 hash of the `query` text and only sends that hash. In addition, you can possibly fit all of this into the query string of a GET request, making it easily cachable. Below is an example request from Zillow ![Request from Zillow](/assets/images/zillow-ebd03223cb4ed6af11e972135e854851.webp) As you can see, it’s just some metadata about the persistedQuery extension, the hash of the query, and variables to be embedded in the query. Here’s another request from expedia.com, sent as a POST, but with the same extension: ![Expedia Query](/assets/images/expedia-2e5f3670fa2a7fe4b27c9e5f93e5ec5a.webp) This primarily optimizes website performance, but it creates several challenges for web scraping: * GET requests are usually more prone to being blocked. * Hidden Query Parameters: We don’t know the full query, so if the website responds with a “Persisted query not found” error (asking us to send the query in full, not just the hash), we can’t send it. * Once the website changes even a little bit and the clients start asking for a new query - even though the old one might still work, the server will very soon forget its ID/hash, and your request with this hash will never work again, since you can’t “remind” the server of the full query text. For various reasons, you might need to extract the entire GraphQL query text, but this can be tricky. While you could inspect the website’s JavaScript to find the query text, it’s often dynamically constructed from multiple fragments, making it hard to piece together. Instead, we’ll take a more direct approach: tricking the client application (e.g., the browser) into revealing the full query. When the client uses a hash that the server doesn't recognize, the server typically responds with an error message like `PersistedQueryNotFound`. This prompts the client to resend the full query in a subsequent request. By intercepting and modifying the original request to include an invalid hash, we can trigger this behavior and capture the complete query text. This method avoids digging through JavaScript and relies on the natural error-handling flow of the client-server interaction. For exactly this use case, a perfect tool exists: [mitmproxy](https://mitmproxy.org/), an open-source Python library that intercepts requests made by your own devices, websites, or apps and allows you to modify them with simple Python scripts. Download `mitmproxy`, and prepare a Python script like this: ``` import json def request(flow): try: dat = json.loads(flow.request.text) dat[0]["extensions"]["persistedQuery"]["sha256Hash"] = "0d9e" # any bogus hex string here flow.request.text = json.dumps(dat) except: pass ``` This defines a hook that `mitmproxy` will run on every request: it tries to load the request's JSON body, modifies the hash to an arbitrary value, and writes the updated JSON as a new body of the request. We also need to make sure we reroute our browser requests to `mitmproxy`. For this purpose we are going to use a browser extension called [FoxyProxy](https://chromewebstore.google.com/detail/foxyproxy/gcknhkkoolaabfmlnjonogaaifnjlfnp?hl=en). It is available in both Firefox and Chrome. Just add a route with these settings: ![mitmproxy settings](/assets/images/mitmprpxy-1e6b253c473a57f3451077aae16640b6.webp) Now we can run `mitmproxy` with this script: `mitmweb -s script.py` This will open a browser tab where you can watch all the intercepted requests in real-time. ![Browser tab](/assets/images/browser-408715fa1be9f079c6672f7f3ae59644.webp) If you go to the particular path and see the query in the request section, you will see some garbage value has replaced the hash. ![Replaced hash](/assets/images/request-6f8330f873c988f6dd07d358130627bd.webp) Now, if you visit Zillow and open that particular path that we tried for the extension, and go to the response section, the client-side receives the PersistedQueryNotFound error. ![Persisted query error](/assets/images/error-2b5eed861143a45328231c6629406454.webp) The front end of Zillow reacts with sending the whole query as a POST request. ![POST request](/assets/images/query-b793b6bbe82994b3d38a565204f82e11.webp) We extract the query and hash directly from this POST request. To ensure that the Zillow server does not forget about this hash, we periodically run this POST request with the exact same query and hash. This will ensure that the scraper continues to work even when the server's cache is cleaned or reset or the website changes. ## Conclusion[](#conclusion "Direct link to Conclusion") Persisted queries are a powerful optimization tool for GraphQL APIs, enhancing website performance by minimizing payload sizes and enabling GET request caching. However, they also pose significant challenges for web scraping, primarily due to the reliance on server-stored hashes and the potential for those hashes to become invalid. Using `mitmproxy` to intercept and manipulate GraphQL requests gives an efficient approach to reveal the full query text without delving into complex client-side JavaScript. By forcing the server to respond with a `PersistedQueryNotFound` error, we can capture the full query payload and utilize it for scraping purposes. Periodically running the extracted query ensures the scraper remains functional, even when server-side cache resets occur or the website evolves. --- # How to scrape Amazon products March 27, 2024 · 11 min read [![Lukáš Průša](/assets/images/lukasp-e0c7202aabdcc50c75cf45603be990a0.webp)](https://github.com/Patai5) [Lukáš Průša](https://github.com/Patai5) Junior Web Automation Engineer ## Introduction[](#introduction "Direct link to Introduction") Amazon is one of the largest and most complex websites, which means scraping it is pretty challenging. Thankfully, the Crawlee library makes things a little easier, with utilities like JSON file outputs, automatic scaling, and request queue management. In this guide, we'll be extracting information from Amazon product pages using the power of [TypeScript](https://www.typescriptlang.org) in combination with the [Cheerio](https://cheerio.js.org) and [Crawlee](https://crawlee.dev) libraries. We'll explore how to retrieve and extract detailed product data such as titles, prices, image URLs, and more from Amazon's vast marketplace. We'll also discuss handling potential blocking issues that may arise during the scraping process. ![How to scrape Amazon using Typescript, Cheerio, and Crawlee](/assets/images/how-to-scrape-amazon-b6c5753f8b985c94a3d4cc372048f79d.webp) ## Prerequisites[](#prerequisites "Direct link to Prerequisites") You'll find the journey smoother if you have a decent grasp of the TypeScript language and a fundamental understanding of [HTML](https://developer.mozilla.org/en-US/docs/Web/HTML) structure. A familiarity with Cheerio and Crawlee is advised but optional. This guide is built to introduce these tools and their use cases in an approachable manner. Crawlee is open-source with nearly 12,000 stars on GitHub. You can check out the [source code here](https://github.com/apify/crawlee). Feel free to play with Crawlee with the inbuilt templates that they offer. ## Writing the scraper[](#writing-the-scraper "Direct link to Writing the scraper") To begin with, let's identify the product fields that we're interested in scraping: * Product Title * Price * List Price * Review Rating * Review Count * Image URLs * Product Overview Attributes ![Image highlighting the product fields to be scraped on Amazon](/assets/images/fields-to-scrape-e30b9a71e42a7b6baed85d7936fbb165.webp) For now, our focus will be solely on the scraping part. In a later section, we'll shift our attention to Crawlee, our crawling tool. Let's begin! ### Scraping the individual data points[](#scraping-the-individual-data-points "Direct link to Scraping the individual data points") Our first step will be to utilize [browser DevTools](https://developer.mozilla.org/en-US/docs/Learn/Common_questions/Tools_and_setup/What_are_browser_developer_tools) to inspect the layout and discover the [CSS selectors](https://developer.mozilla.org/en-US/docs/Learn/CSS/Building_blocks/Selectors) for the data points we aim to scrape. (by default on [Chrome](https://developer.chrome.com/docs/devtools), press `Ctrl + Shift + C`) For example, let's take a look at how we find the selector for the product title: ![Amazon product title selector in DevTools](/assets/images/dev-tools-example-b243683d8baf93e34bbce102986d37b5.webp) The product title selector we've deduced is `span#productTitle`. This selector targets all `span` elements with the id of `productTitle`. Luckily, there's only one such element on the page - exactly what we're after. We can find the selectors for the remaining data points using the same principle combined with a sprinkle of trial and error. Next, let's write a function that uses a [Cheerio object](https://cheerio.js.org/docs/api/interfaces/CheerioAPI) of the product page as input and outputs our extracted data in a structured format. Initially, we'll focus on scraping simple data points. We'll leave the more complex ones, like image URLs and product attributes overview, for later. ``` import { CheerioAPI } from 'cheerio'; type ProductDetails = { title: string; price: string; listPrice: string; reviewRating: string; reviewCount: string; }; /** * CSS selectors for the product details. Feel free to figure out different variations of these selectors. */ const SELECTORS = { TITLE: 'span#productTitle', PRICE: 'span.priceToPay', LIST_PRICE: 'span.basisPrice .a-offscreen', REVIEW_RATING: '#acrPopover a > span', REVIEW_COUNT: '#acrCustomerReviewText', } as const; /** * Scrapes the product details from the given Cheerio object. */ export const extractProductDetails = ($: CheerioAPI): ProductDetails => { const title = $(SELECTORS.TITLE).text().trim(); const price = $(SELECTORS.PRICE).first().text(); const listPrice = $(SELECTORS.LIST_PRICE).first().text(); const reviewRating = $(SELECTORS.REVIEW_RATING).first().text(); const reviewCount = $(SELECTORS.REVIEW_COUNT).first().text(); return { title, price, listPrice, reviewRating, reviewCount }; }; ``` ## Improving the scraper[](#improving-the-scraper "Direct link to Improving the scraper") At this point, our scraper extracts all fields as strings, which isn't ideal for numerical fields like prices and review counts - we'd rather have those as numbers. Simple casting from string to numbers will only work for some fields. In some cases, such as processing the price fields, we must clean the string and remove unnecessary characters before conversion. To address this, write a utility function parsing a number from a string. We'll also have another function to find the first element matching our selector and return it parsed as a number. ``` /** * Parses a number from a string by removing all non-numeric characters. * - Keeps the decimal point. */ const parseNumberValue = (rawString: string): number => { return Number(rawString.replace(/[^\d.]+/g, '')); }; /** * Parses a number value from the first element matching the given selector. */ export const parseNumberFromSelector = ($: CheerioAPI, selector: string): number => { const rawValue = $(selector).first().text(); return parseNumberValue(rawValue); }; ``` With the function above: `parseNumberValue`, we can now update and simplify the main scraping function `extractProductDetails`. ``` import { CheerioAPI } from 'cheerio'; import { parseNumberFromSelector } from './utils.js'; type ProductDetails = { title: string; price: number; // listPrice: number; // updated to numbers reviewRating: number; // reviewCount: number; // }; ... /** * Scrapes the product details from the given Cheerio object. */ export const extractProductDetails = ($: CheerioAPI): ProductDetails => { const title = $(SELECTORS.TITLE).text().trim(); const price = parseNumberFromSelector($, SELECTORS.PRICE); const listPrice = parseNumberFromSelector($, SELECTORS.LIST_PRICE); const reviewRating = parseNumberFromSelector($, SELECTORS.REVIEW_RATING); const reviewCount = parseNumberFromSelector($, SELECTORS.REVIEW_COUNT); return { title, price, listPrice, reviewRating, reviewCount }; }; ``` ### Scraping the advanced data points[](#scraping-the-advanced-data-points "Direct link to Scraping the advanced data points") As we progress in our scraping journey, it's time to focus on the more complex data fields, like image URLs and product attributes overview. To extract data from these fields, we must utilize the `map` function to iterate over all matching elements and fetch data from each. Let's start with image URLs. ``` const SELECTORS = { ... IMAGES: '#altImages .item img', } as const; /** * Extracts the product image URLs from the given Cheerio object. * - We have to iterate over the image elements and extract the `src` attribute. */ const extractImageUrls = ($: CheerioAPI): string[] => { const imageUrls = $(SELECTORS.IMAGES) .map((_, imageEl) => $(imageEl).attr('src')) .get(); // `get()` - Retrieve all elements matched by the Cheerio object, as an array. Removes `undefined` values. return imageUrls; }; ``` Extracting images is relatively simple yet still deserves a separate function for clarity. We'll now parse the product attributes overview. ``` type ProductAttribute = { label: string; value: string; }; const SELECTORS = { ... PRODUCT_ATTRIBUTE_ROWS: '#productOverview_feature_div tr', ATTRIBUTES_LABEL: 'td:nth-of-type(1) span', ATTRIBUTES_VALUE: 'td:nth-of-type(2) span', } as const; /** * Extracts the product attributes from the given Cheerio object. * - We have to iterate over the attribute rows and extract both label and value for each row. */ const extractProductAttributes = ($: CheerioAPI): ProductAttribute[] => { const attributeRowEls = $(SELECTORS.PRODUCT_ATTRIBUTE_ROWS).get(); const attributeRows = attributeRowEls.map((rowEl) => { const label = $(rowEl).find(SELECTORS.ATTRIBUTES_LABEL).text(); const value = $(rowEl).find(SELECTORS.ATTRIBUTES_VALUE).text(); return { label, value }; }); return attributeRows; }; ``` We've now effectively crafted our scraping functions. Here's the complete `scraper.ts` file: ``` import { CheerioAPI } from 'cheerio'; import { parseNumberFromSelector } from './utils.js'; type ProductAttribute = { label: string; value: string; }; type ProductDetails = { title: string; price: number; listPrice: number; reviewRating: number; reviewCount: number; imageUrls: string[]; attributes: ProductAttribute[]; }; /** * CSS selectors for the product details. Feel free to figure out different variations of these selectors. */ const SELECTORS = { TITLE: 'span#productTitle', PRICE: 'span.priceToPay', LIST_PRICE: 'span.basisPrice .a-offscreen', REVIEW_RATING: '#acrPopover a > span', REVIEW_COUNT: '#acrCustomerReviewText', IMAGES: '#altImages .item img', PRODUCT_ATTRIBUTE_ROWS: '#productOverview_feature_div tr', ATTRIBUTES_LABEL: 'td:nth-of-type(1) span', ATTRIBUTES_VALUE: 'td:nth-of-type(2) span', } as const; /** * Extracts the product image URLs from the given Cheerio object. * - We have to iterate over the image elements and extract the `src` attribute. */ const extractImageUrls = ($: CheerioAPI): string[] => { const imageUrls = $(SELECTORS.IMAGES) .map((_, imageEl) => $(imageEl).attr('src')) .get(); // `get()` - Retrieve all elements matched by the Cheerio object, as an array. Removes `undefined` values. return imageUrls; }; /** * Extracts the product attributes from the given Cheerio object. * - We have to iterate over the attribute rows and extract both label and value for each row. */ const extractProductAttributes = ($: CheerioAPI): ProductAttribute[] => { const attributeRowEls = $(SELECTORS.PRODUCT_ATTRIBUTE_ROWS).get(); const attributeRows = attributeRowEls.map((rowEl) => { const label = $(rowEl).find(SELECTORS.ATTRIBUTES_LABEL).text(); const value = $(rowEl).find(SELECTORS.ATTRIBUTES_VALUE).text(); return { label, value }; }); return attributeRows; }; /** * Scrapes the product details from the given Cheerio object. */ export const extractProductDetails = ($: CheerioAPI): ProductDetails => { const title = $(SELECTORS.TITLE).text().trim(); const price = parseNumberFromSelector($, SELECTORS.PRICE); const listPrice = parseNumberFromSelector($, SELECTORS.LIST_PRICE); const reviewRating = parseNumberFromSelector($, SELECTORS.REVIEW_RATING); const reviewCount = parseNumberFromSelector($, SELECTORS.REVIEW_COUNT); const imageUrls = extractImageUrls($); const attributes = extractProductAttributes($); return { title, price, listPrice, reviewRating, reviewCount, imageUrls, attributes }; }; ``` Next up is the task of making the scraping part functional. Let's implement the crawling part using Crawlee. ## Crawling the product pages[](#crawling-the-product-pages "Direct link to Crawling the product pages") We'll utilize the features that Crawlee offers to crawl the product pages. As we mentioned at the beginning, it considerably simplifies web scraping with JSON file outputs, automatic scaling, and request queue management. Our next stepping stone is to wrap our scraping logic within Crawlee, thereby implementing the crawling part of our process. ``` import { CheerioCrawler, CheerioCrawlingContext, log } from 'crawlee'; import { extractProductDetails } from './scraper.js'; /** * Performs the logic of the crawler. It is called for each URL to crawl. * - Passed to the crawler using the `requestHandler` option. */ const requestHandler = async (context: CheerioCrawlingContext) => { const { $, request } = context; const { url } = request; log.info(`Scraping product page`, { url }); const extractedProduct = extractProductDetails($); log.info(`Scraped product details for "${extractedProduct.title}", saving...`, { url }); crawler.pushData(extractedProduct); }; /** * The crawler instance. Crawlee provides a few different crawlers, but we'll use CheerioCrawler, as it's very fast and simple to use. * - Alternatively, we could use a full browser crawler like `PlaywrightCrawler` to imitate a real browser. */ const crawler = new CheerioCrawler({ requestHandler }); await crawler.run(['https://www.amazon.com/dp/B0BV7XQ9V9']); ``` The code now successfully extracts the product details from the given URLs. We've integrated our scraping function into Crawlee, and it's ready to scrape. Here's an example of the extracted data: ``` { "title": "ASUS ROG Strix G16 (2023) Gaming Laptop, 16” 16:10 FHD 165Hz, GeForce RTX 4070, Intel Core i9-13980HX, 16GB DDR5, 1TB PCIe SSD, Wi-Fi 6E, Windows 11, G614JI-AS94, Eclipse Gray", "price": 1799.99, "listPrice": 1999.99, "reviewRating": 4.3, "reviewCount": 372, "imageUrls": [ "https://m.media-amazon.com/images/I/41EWnXeuMzL._AC_US40_.jpg", "https://m.media-amazon.com/images/I/51gAOHZbtUL._AC_US40_.jpg", "https://m.media-amazon.com/images/I/51WLw+9ItgL._AC_US40_.jpg", "https://m.media-amazon.com/images/I/41D-FN8qjLL._AC_US40_.jpg", "https://m.media-amazon.com/images/I/41X+oNPvdkL._AC_US40_.jpg", "https://m.media-amazon.com/images/I/41X6TCWz69L._AC_US40_.jpg", "https://m.media-amazon.com/images/I/31rphsiD0lL.SS40_BG85,85,85_BR-120_PKdp-play-icon-overlay__.jpg" ], "attributes": [ { "label": "Brand", "value": "ASUS" }, { "label": "Model Name", "value": "ROG Strix G16" }, { "label": "Screen Size", "value": "16 Inches" }, { "label": "Color", "value": "Eclipse Gray" }, { "label": "Hard Disk Size", "value": "1 TB" }, { "label": "CPU Model", "value": "Intel Core i9" }, { "label": "Ram Memory Installed Size", "value": "16 GB" }, { "label": "Operating System", "value": "Windows 11 Home" }, { "label": "Special Feature", "value": "Anti Glare Coating" }, { "label": "Graphics Card Description", "value": "Dedicated" } ] } ``` ## How to avoid getting blocked when scraping Amazon[](#how-to-avoid-getting-blocked-when-scraping-amazon "Direct link to How to avoid getting blocked when scraping Amazon") With a giant website like Amazon, one is bound to face some issues with blocking. Let's discuss how to handle them. Amazon frequently presents annoying CAPTCHAs or warning screens that may detect or block your scraper. We can counter this inconvenience by implementing a mechanism to detect and handle these blocks. As soon as we stumble upon one, we retry the request. ``` import { CheerioAPI } from 'cheerio'; const CAPTCHA_SELECTOR = '[action="/errors/validateCaptcha"]'; /** * Handles the captcha blocking. Throws an error if a captcha is displayed. * - Crawlee automatically retries any requests that throw an error. * - Status code blocking (e.g. Amazon's `503`) is handled automatically by Crawlee. */ export const handleCaptchaBlocking = ($: CheerioAPI) => { const isCaptchaDisplayed = $(CAPTCHA_SELECTOR).length > 0; if (isCaptchaDisplayed) throw new Error('Captcha is displayed! Retrying...'); }; ``` Make a small tweak in the request handler to use `handleCaptchaBlocking`: ``` import { handleCaptchaBlocking } from './blocking-detection.js'; const requestHandler = async (context: CheerioCrawlingContext) => { const { request, $ } = context; const { url } = request; handleCaptchaBlocking($); // Alternatively, we can put this into the crawler's `postNavigationHooks` log.info(`Scraping product page`, { url }); ... }; ``` While Crawlee's browser-like user-agent headers prevent blocking to a certain extent, this is only partially effective for a site as vast as Amazon. ### Using proxies[](#using-proxies "Direct link to Using proxies") The use of proxies marks another significant tactic in evading blocking. You'll be pleased to know that Crawlee excels in this domain, supporting both [custom proxies](https://crawlee.dev/js/docs/guides/proxy-management.md) and [Apify proxies](https://apify.com/proxy). Here's an example of how to use Apify's [residential proxies](https://docs.apify.com/platform/proxy/residential-proxy), which are highly effective in preventing blocking: ``` import { ProxyConfiguration } from 'apify'; const proxyConfiguration = new ProxyConfiguration({ groups: ['RESIDENTIAL'], countryCode: 'US', // Optionally, you can specify the proxy country code. // This is useful for sites like Amazon, which display different content based on the user's location. }); const crawler = new CheerioCrawler({ requestHandler, proxyConfiguration }); ... ``` ### Using headless browsers to scrape Amazon[](#using-headless-browsers-to-scrape-amazon "Direct link to Using headless browsers to scrape Amazon") For more advanced scraping, you can use a headless browser like [Playwright](https://crawlee.dev/js/docs/examples/playwright-crawler.md) to scrape Amazon. This method is more effective in preventing blocking and can handle websites with complex JavaScript interactions. To use Playwright with Crawlee, we can replace the `CheerioCrawler` with `PlaywrightCrawler`: ``` import { PlaywrightCrawler } from 'crawlee'; const crawler = new PlaywrightCrawler({ requestHandler, proxyConfiguration }); ... ``` And update our Cheerio-dependent code to work within Playwright: ``` import { PlaywrightCrawlingContext } from 'crawlee'; const requestHandler = async (context: PlaywrightCrawlingContext) => { const { request, parseWithCheerio } = context; const { url } = request; const $ = await parseWithCheerio(); // Get the Cheerio object for the page. ... }; ``` ## Conclusion and next steps[](#conclusion-and-next-steps "Direct link to Conclusion and next steps") You've now journeyed through the basic and advanced terrains of web scraping Amazon product pages using the capabilities of TypeScript, Cheerio, and Crawlee. It can seem like a lot to digest but don't worry! With more practice, each step will become more familiar and intuitive - until you become a web scraping ninja. So go ahead and start experimenting. If you want to learn more, check out our detailed tutorial on building a [HackerNews scraper using Crawlee](https://blog.apify.com/crawlee-web-scraping-tutorial/). For more extensive web scraping abilities, check out pre-built scrapers from Apify, like [Amazon Web Scraper](https://apify.com/junglee/amazon-crawler)! --- # How to scrape infinite scrolling webpages with Python August 27, 2024 · 7 min read [![Saurav Jain](https://avatars.githubusercontent.com/u/53312820?v=4)](https://github.com/souravjain540) [Saurav Jain](https://github.com/souravjain540) Developer Community Manager Hello, Crawlee Devs, and welcome back to another tutorial on the Crawlee Blog. This tutorial will teach you how to scrape infinite-scrolling websites using Crawlee for Python. For context, infinite-scrolling pages are a modern alternative to classic pagination. When users scroll to the bottom of the webpage instead of choosing the next page, the page automatically loads more data, and users can scroll more. As a big sneakerhead, I'll take the Nike shoes infinite-scrolling [website](https://www.nike.com/) as an example, and we'll scrape thousands of sneakers from it. ![How to scrape infinite scrolling pages with Python](/assets/images/infinite-scroll-de1fd1c1791fdf8f6b5614a947ccc878.webp) Crawlee for Python has some amazing initial features, such as a unified interface for HTTP and headless browser crawling, automatic retries, and much more. ## Prerequisites and bootstrapping the project[](#prerequisites-and-bootstrapping-the-project "Direct link to Prerequisites and bootstrapping the project") Let's start the tutorial by installing Crawlee for Python with this command: ``` pipx run crawlee create nike-crawler ``` note Before going ahead if you like reading this blog, we would be really happy if you gave [Crawlee for Python a star on GitHub!](https://github.com/apify/crawlee-python/) We will scrape using headless browsers. Select `PlaywrightCrawler` in the terminal when Crawlee for Python asks for it. After installation, Crawlee for Python will create boilerplate code for you. Redirect into the project folder and then run this command for all the dependencies installation: ``` poetry install ``` ## How to scrape infinite scrolling webpages[](#how-to-scrape-infinite-scrolling-webpages "Direct link to How to scrape infinite scrolling webpages") 1. Handling accept cookie dialog 2. Adding request of all shoes links 3. Extract data from product details 4. Accept Cookies context manager 5. Handling infinite scroll on the listing page 6. Exporting data to CSV format ### Handling accept cookie dialog[](#handling-accept-cookie-dialog "Direct link to Handling accept cookie dialog") After all the necessary installations, we'll start looking into the files and configuring them accordingly. When you look into the folder, you'll see many files, but for now, let's focus on `main.py` and `routes.py`. In `main.py`, let's change the target location to the Nike website. Then, just to see how scraping will happen, we'll add `headless = False` to the `PlaywrightCrawler` parameters. Let's also increase the maximum requests per crawl option to 100 to see the power of parallel scraping in Crawlee for Python. The final code will look like this: ``` from crawlee.playwright_crawler import PlaywrightCrawler from .routes import router async def main() -> None: """The crawler entry point.""" crawler = PlaywrightCrawler( headless=False, request_handler=router, max_requests_per_crawl=100, ) await crawler.run( [ 'https://nike.com/, ] ) ``` Now coming to `routes.py`, let's remove: ``` await context.enqueue_links() ``` As we don't want to scrape the whole website. Now, if you run the crawler using the command: ``` poetry run python -m nike-crawler ``` As the cookie dialog is blocking us from crawling more than one page's worth of shoes, let's get it out of our way. We can handle the cookie dialog by going to Chrome dev tools and looking at the `test_id` of the "accept cookies" button, which is `dialog-accept-button`. Now, let's remove the `context.push_data` call that was left there from the project template and add the code to accept the dialog in routes.py. The updated code will look like this: ``` from crawlee.router import Router from crawlee.playwright_crawler import PlaywrightCrawlingContext router = Router[PlaywrightCrawlingContext]() @router.default_handler async def default_handler(context: PlaywrightCrawlingContext) -> None: """Default request handler.""" # Wait for the popup to be visible to ensure it has loaded on the page. await context.page.get_by_test_id('dialog-accept-button').click() ``` ### Adding request of all shoes links[](#adding-request-of-all-shoes-links "Direct link to Adding request of all shoes links") Now, if you hover over the top bar and see all the sections, i.e., man, woman, and kids, you'll notice the “All shoes” section. As we want to scrape all the sneakers, this section interests us. Let's use `get_by_test_id` with the filter of `has_text='All shoes'` and add all the links with the text “All shoes” to the request handler. Let's add this code to the existing `routes.py` file: ``` shoe_listing_links = ( await context.page.get_by_test_id('link').filter(has_text='All shoes').all() ) await context.add_requests( [ Request.from_url(url, label='listing') for link in shoe_listing_links if (url := await link.get_attribute('href')) ] ) @router.handler('listing') async def listing_handler(context: PlaywrightCrawlingContext) -> None: """Handler for shoe listings.""" ``` ### Extract data from product details[](#extract-data-from-product-details "Direct link to Extract data from product details") Now that we have all the links to the pages with the title “All Shoes,” the next step is to scrape all the products on each page and the information provided on them. We'll extract each shoe's URL, title, price, and description. Again, let's go to dev tools and extract each parameter's relevant `test_id`. After scraping each of the parameters, we'll use the `context.push_data` function to add it to the local storage. Now let's add the following code to the `listing_handler` and update it in the `routes.py` file: ``` @router.handler('listing') async def listing_handler(context: PlaywrightCrawlingContext) -> None: """Handler for shoe listings.""" await context.enqueue_links(selector='a.product-card__link-overlay', label='detail') @router.handler('detail') async def detail_handler(context: PlaywrightCrawlingContext) -> None: """Handler for shoe details.""" title = await context.page.get_by_test_id( 'product_title', ).text_content() price = await context.page.get_by_test_id( 'currentPrice-container', ).first.text_content() description = await context.page.get_by_test_id( 'product-description', ).text_content() await context.push_data( { 'url': context.request.loaded_url, 'title': title, 'price': price, 'description': description, } ) ``` ### Accept Cookies context manager[](#accept-cookies-context-manager "Direct link to Accept Cookies context manager") Since we're dealing with multiple browser pages with multiple links and we want to do infinite scrolling, we may encounter an accept cookie dialog on each page. This will prevent loading more shoes via infinite scroll. We'll need to check for cookies on every page, as each one may be opened with a fresh session (no stored cookies) and we'll get the accept cookie dialog even though we already accepted it in another browser window. However, if we don't get the dialog, we want the request handler to work as usual. To solve this problem, we'll try to deal with the dialog in a parallel task that will run in the background. A context manager is a nice abstraction that will allow us to reuse this logic in all the router handlers. So, let's build a context manager: ``` from playwright.async_api import TimeoutError as PlaywrightTimeoutError @asynccontextmanager async def accept_cookies(page: Page): task = asyncio.create_task(page.get_by_test_id('dialog-accept-button').click()) try: yield finally: if not task.done(): task.cancel() with suppress(asyncio.CancelledError, PlaywrightTimeoutError): await task ``` This context manager will make sure we're accepting the cookie dialog if it exists before scrolling and scraping the page. Let's implement it in the `routes.py` file, and the updated code is [here](https://github.com/janbuchar/crawlee-python-demo/blob/6ca6f7f1d1bbbf789a3b86f14bec492cf756251e/crawlee-python-webinar/routes.py) ### Handling infinite scroll on the listing page[](#handling-infinite-scroll-on-the-listing-page "Direct link to Handling infinite scroll on the listing page") Now for the last and most interesting part of the tutorial! How to handle the infinite scroll of each shoe listing page and make sure our crawler is scrolling and scraping the data constantly. This tutorial is taken from the webinar held on August 5th where Jan Buchar, Senior Python Engineer at Apify, gave a live demo about this use case. Watch the tutorial here: [YouTube video player](https://www.youtube.com/embed/ip8Ii0eLfRY?si=7ZllUhMhuC7VC23B\&start=667) To handle infinite scrolling in Crawlee for Python, we just need to make sure the page is loaded, which is done by waiting for the `network_idle` load state, and then use the `infinite_scroll` helper function which will keep scrolling to the bottom of the page as long as that makes additional items appear. Let's add two lines of code to the `listing` handler: ``` @router.handler('listing') async def listing_handler(context: PlaywrightCrawlingContext) -> None: """Handler for shoe listings.""" async with accept_cookies(context.page): await context.page.wait_for_load_state('networkidle') await context.infinite_scroll() await context.enqueue_links( selector='a.product-card__link-overlay', label='detail' ) ``` ## Exporting data to CSV format[](#exporting-data-to-csv-format "Direct link to Exporting data to CSV format") As we want to store all the shoe data into a CSV file, we can just add a call to the `export_data` helper into the `main.py` file just after the crawler run: ``` await crawler.export_data('shoes.csv') ``` ## Working crawler and its code[](#working-crawler-and-its-code "Direct link to Working crawler and its code") Now, we have a crawler ready that can scrape all the shoes from the Nike website while handling infinite scrolling and many other problems, like the cookies dialog. You can find the complete working crawler code here on the [GitHub repository](https://github.com/janbuchar/crawlee-python-demo). Learn more about Crawlee for Python from our latest step by step [tutorial](https://blog.apify.com/crawlee-for-python-tutorial/). If you have any doubts regarding this tutorial or using Crawlee for Python, feel free to [join our discord community](https://apify.com/discord/) and ask fellow developers or the Crawlee team. --- # Announcing Crawlee for Python: Now you can use Python to build reliable web crawlers July 5, 2024 · 6 min read [![Saurav Jain](https://avatars.githubusercontent.com/u/53312820?v=4)](https://github.com/souravjain540) [Saurav Jain](https://github.com/souravjain540) Developer Community Manager > Testimonial from early adopters > > “Crawlee for Python development team did a great job in building the product, it makes things faster for a Python developer.” > > \~ [Maksym Bohomolov](https://apify.com/mantisus) We launched Crawlee in [August 2022](https://blog.apify.com/announcing-crawlee-the-web-scraping-and-browser-automation-library/) and got an amazing response from the JavaScript community. With many early adopters in its initial days, we got valuable feedback, which gave Crawlee a strong base for its success. Today, [Crawlee built-in TypeScript](https://github.com/apify/crawlee) has nearly **13,000 stars on GitHub**, with 90 open-source contributors worldwide building the best web scraping and automation library. Since the launch, the feedback we’ve received most often [\[1\]](https://discord.com/channels/801163717915574323/999250964554981446/1138826582581059585)[\[2\]](https://discord.com/channels/801163717915574323/801163719198638092/1137702376267059290)[\[3\]](https://discord.com/channels/801163717915574323/1090592836044476426/1103977818221719584) has been to build Crawlee in Python so that the Python community can use all the features the JavaScript community does. With all these requests in mind and to simplify the life of Python web scraping developers, **we’re launching [Crawlee for Python](https://github.com/apify/crawlee-python) today.** The new library is still in **beta**, and we are looking for **early adopters**. ![Crawlee for Python is looking for early adopters](/assets/images/early-adopters-0c5f38327dd8e5fad85dc127dcabc1f0.webp) Crawlee for Python has some amazing initial features, such as a unified interface for HTTP and headless browser crawling, automatic retries, and much more. ## Why use Crawlee instead of a random HTTP library with an HTML parser?[](#why-use-crawlee-instead-of-a-random-http-library-with-an-html-parser "Direct link to Why use Crawlee instead of a random HTTP library with an HTML parser?") * Unified interface for HTTP & headless browser crawling. * HTTP - HTTPX with BeautifulSoup, * Headless browser - Playwright. * Automatic parallel crawling based on available system resources. * Written in Python with type hints - enhances DX (IDE autocompletion) and reduces bugs (static type checking). * Automatic retries on errors or when you’re getting blocked. * Integrated proxy rotation and session management. * Configurable request routing - direct URLs to the appropriate handlers. * Persistent queue for URLs to crawl. * Pluggable storage of both tabular data and files. ## Understanding the why behind the features of Crawlee[](#understanding-the-why-behind-the-features-of-crawlee "Direct link to Understanding the why behind the features of Crawlee") ### Out-of-the-box support for headless browser crawling (Playwright).[](#out-of-the-box-support-for-headless-browser-crawling-playwright "Direct link to Out-of-the-box support for headless browser crawling (Playwright).") While libraries like Scrapy require additional installation of middleware, i.e, [`scrapy-playwright`](https://github.com/scrapy-plugins/scrapy-playwright) and still doesn’t work with Windows, Crawlee for Python supports a unified interface for HTTP & headless browsers. Using a headless browser to download web pages and extract data, `PlaywrightCrawler` is ideal for crawling websites that require JavaScript execution. For websites that don’t require JavaScript, consider using the `BeautifulSoupCrawler,` which utilizes raw HTTP requests and will be much faster. ``` import asyncio from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext async def main() -> None: # Create a crawler instance crawler = PlaywrightCrawler( # headless=False, # browser_type='firefox', ) @crawler.router.default_handler async def request_handler(context: PlaywrightCrawlingContext) -> None: data = { 'request_url': context.request.url, 'page_url': context.page.url, 'page_title': await context.page.title(), 'page_content': (await context.page.content())[:10000], } await context.push_data(data) await crawler.run(['https://crawlee.dev']) if __name__ == '__main__': asyncio.run(main()) ``` The above example uses Crawlee’s built-in `PlaywrightCrawler` to crawl the [https://crawlee.dev/](https://crawlee.dev/index.md) website title and its content. ### Small learning curve[](#small-learning-curve "Direct link to Small learning curve") In other libraries like Scrapy, when you run a command to create a new project, you get many files. Then you need to learn about the architecture, including various components (spiders, middlewares, pipelines, etc.). [The learning curve is very steep](https://crawlee.dev/blog/scrapy-vs-crawlee.md#language-and-development-environments). While building Crawlee, we made sure that the learning curve and the setup would be as fast as possible. With [ready-made templates](https://github.com/apify/crawlee-python/tree/master/templates), and having only a single file to add the code, it's very easy to start building a scraper, you might need to learn a little about request handlers and storage, but that’s all. ### Complete type hint coverage[](#complete-type-hint-coverage "Direct link to Complete type hint coverage") We know how much developers like their code to be high-quality, readable, and maintainable. That's why the whole code base of Crawlee is fully type-hinted. Thanks to that, you should have better autocompletion in your IDE, enhancing developer experience while developing your scrapers using Crawlee. Type hinting should also reduce the number of bugs thanks to static type checking. ![Crawlee\_Python\_Type\_Hint](/assets/images/crawlee-python-type-hint-90bb0ec4fb86916d8a6b2512a80f965b.webp) ### Based on Asyncio[](#based-on-asyncio "Direct link to Based on Asyncio") Crawlee is fully asynchronous and based on [Asyncio](https://docs.python.org/3/library/asyncio.html). For scraping frameworks, where many IO-bounds operations occur, this should be crucial to achieving high performance. Also, thanks to Asyncio, integration with other applications or the rest of your system should be easy. How is this different from the Scrapy framework, which is also asynchronous? Scrapy relies on the "legacy" Twisted framework. Integrating Scrapy with modern Asyncio-based applications can be challenging, often requiring more effort and debugging [\[1\]](https://stackoverflow.com/questions/49201915/debugging-scrapy-project-in-visual-studio-code). ## Power of open source community and early adopters giveaway[](#power-of-open-source-community-and-early-adopters-giveaway "Direct link to Power of open source community and early adopters giveaway") Crawlee for Python is fully open-sourced and the codebase is available on the [GitHub repository of Crawlee for Python](https://github.com/apify/crawlee-python). We have already started receiving initial and very [valuable contributions from the Python community](https://github.com/apify/crawlee-python/pull/226). > Early adopters also said: > > “Crawlee for Python development team did a great job in building the product, it makes things faster for a Python developer.” > > \~ [Maksym Bohomolov](https://apify.com/mantisus) There’s still room for improvement. Feel free to open issues, make pull requests, and [star the repository](https://github.com/apify/crawlee-python/) to spread the work to other developers. **We will award the first 10 pieces of feedback** that add value and are accepted by our team with an exclusive Crawlee for Python swag (The first Crawlee for Python swag ever). Check out the [GitHub issue here](https://github.com/apify/crawlee-python/issues/269/). With such contributions, we’re excited and looking forward to building an amazing library for the Python community. Check out a step by step guide on how to use Crawlee for Python through one of our [latest tutorial](https://blog.apify.com/crawlee-for-python-tutorial/). [Join our Discord community](https://apify.com/discord) with nearly 8,000 web scraping developers, where our team would be happy to help you with any problems or discuss any use case for Crawlee for Python. --- # How to create a LinkedIn job scraper in Python with Crawlee October 14, 2024 · 7 min read [![Arindam Majumder](https://avatars.githubusercontent.com/u/109217591?v=4)](https://github.com/Arindam200) [Arindam Majumder](https://github.com/Arindam200) Community Member of Crawlee ## Introduction[](#introduction "Direct link to Introduction") In this article, we will build a web application that scrapes LinkedIn for job postings using Crawlee and Streamlit. We will create a LinkedIn job scraper in Python using Crawlee for Python to extract the company name, job title, time of posting, and link to the job posting from dynamically received user input through the web application. note One of our community members wrote this blog as a contribution to Crawlee Blog. If you want to contribute blogs like these to Crawlee Blog, please reach out to us on our [discord channel](https://apify.com/discord). By the end of this tutorial, you’ll have a fully functional web application that you can use to scrape job postings from LinkedIn. ![Linkedin Job Scraper](data:image/webp;base64,UklGRpgcAABXRUJQVlA4IIwcAAAQfAGdASpcBt0CPpFIoEylpCaioJN4QNASCWlu+F8+/k/AuBCbOFzUYLOjUh0O8tefHp3bdRuonqp/z31K/PL9aj/jZM75b/xP42eEf+Z8N/IZ799v+U91X5q/y375fwPNPwB+IGoR+V/0L/U72OAD8t/r368+Ob/l+jH2Q9gLzB8EGgN5Nf+j5Tvr/2EOmGB9D/YD8z5nzPmfM+Z8z5nzPmfM+Z8z5nzPmfM+Z8z5nzPmfM+Z8z5nzPmfM+Z8z5nzPmfM+Z8z5nzPmfM+Z8z5nzPmfM+Z8z5nzPmfM+Z8z5nzPmfM+Yq8OTW3tvbe29t7b23tvbe22iZO/M+Z8z5nzPmfM+Z8z5nzPmfM+Z8z5nzPmfwiTklmXH4pWwfHDhw4cOHDhw4cOHDhw4cOHDhw4cOHDhw4cOHDhw4cOHDhw4cOHDhw4b/YoBGhwWLFixYsWLFixYsWLFixYsWLFixYsWLFixYsWLFixYsWLFixYsWLFimOzM6zIyDLIC6inMYBQN4AdJ+xanZv1IVvZErjAbqWdBIkSJEiRIkSJEiRIkSJEiRIkSJEiRIkSJEiRIkSJEiRIkSJEiRIkSJEiRLgLnNevXr169evXr169evXr169evXr169evXr169evXr169evXr169evXr169e69y5cuXLly5cuXLly5cuXLly5cuXLly5cuXLly5cuXLly5cuXLly5cuXLly6R48ePHjx48ePHjx48ePHjx48ePHjx48ePHjx48ePHjx48ePHjx48ePHjx48iCtWrVq1atWrVq1atWrVq1atWrVq1atWrVq1atWrVq1atWrVq1atWrVq1atW8tkyZMmTJkyZMmTHkuiL1UwcSehb4g6f+/kZqrDYDuhB5g1lF0nZC9nZ+vVNIKgtI4eyNTRiq4pxF9r8+2qMYQezlaAB+FZCmqk8DlC4jj+06dOnTp06dOnTp06dOnTp06dOnT3kiRIkSJEiRIkSJC/a91YpfY9LY7LHPlSIuGOyl8hLfCadJCMOqQf2nFwzjL0sCEFlwSmiGpay2JlSr+iwgbQ/pr3/Ynau1aSaw0G/XGdFiZN0U3g9p2vfcqW19E3mXGJv0k2NLrU4el8Rntj/mUStqrFnao8rs/wEEb5SvrG/EAEcf8WauHMhiN9lYgl2y/WuvXS47Ad0sCQXpQ5Nij8jEYp+5SOO5My++pwpNPhyQp6DiawABYsWLFixYsWLFixYsWLFixYsWLPB48ePHjx48ePHjwyCazIpuZ2oqIwtbNyUGmr/YqnQDGqKoAqAOC+AcbvAl6wa8o9xi6LIPKbbUi45jqAXa2j2Gj95zrxLBMiuH02BvEolPoFjYmD0ZKtzDSDllWy8+Flzd+B+OzJt/0oD3bUj4YEAiLSXdQ+r41F0F9sCCT3Ev3MTUyqF7F08v9d3zuBQwIC8skNs/43cnF0y20CAU2fPnz58+fPnz58+fPnz58+fPn0AQkSJEiRIkSJEiRIk0lONsiRIkVM0IoWkPA0COSa9evXr169evXr169evXr169evde5cuXLly5cuXLlybh1/cSC+LPV6vV6vV6vV6vV6vV6vV6vV6vV6vV6HatiOGNmzZs2bNmzZs2bNuGBAgQIECBAgQIEBsEIXEFaMWp2OMZCEqZYsWLFixYsWLFixYsWLFixYs8Hjx48ePHjx48ePHpLVq1atWrVq1atWrGVxgdKMwI6H/Ym0ZEun08JZg6tAPH4t+yZMmTJkyZMmTJkyZMmTJkyZQDw4cOHDhw4cOHDhxYHxw4cOHDhw4cOHCIWNkEvOOewez7OeSlCkjWWuuTa9XtfwJxUszla8lJ/crS4InNevXr169evXr16917ly5cuXLly5cuXKjLb+Vq1atWrVq1atWrVq1atWrVq1bBpsRIkSJEiRIkSJEiSG+fPnz58+fPnz587tyCt9HPAEWaQwGAwGAwGAwGAwGAwGAwGAwGAwGAwF/sqMz6biw4cOHDhw4cOHDhxFLZs2bNmzZs2bNmy1x9llmnoZJ6a7f4Qo9Rdv8IUeou3+EKPUXb/CFHqLt/hCj1FxJLVq1atWrVq1atWrWbyZMmTJkyZMmTJkwObpZg7BbwG3kXgHT+Ifw9We5EAuFPbxx2Nz+SJEiRIkSJEiRIkSJEiRIkSQ3z58+fPnz58+fPn1BIkSJEiRIkSJEiRFbm9gAP3Yig/dVuq8k79Vbvl+gB1E4kVDD3La9Xq9Xq9Xq9Xq9Xq9Xq9Xq9Xq9Wj91BRIkSJEiRIkSJEiROGfPnz58+fPnz58+PjNKh8L8OYiTklmXH4pZ6i7f4Qo9Rdv8IUeou3+EKPUXb/CFG8B1yuF2bNmzZs2bNmzZs24YECBAgQIECBAgQGwQjnYwyZMmTJkyZMmTJkyZMmTJkyZMocqXLly5cuXLly5cuXSPHjx48ePHjx48eO0d82+2TRbe4zjoOiFHqLt/hCj1F2/whR6i7f4Qo9Rdv8IUeou3+EKOf+PTb4Hxw4cOHDhw4cOHDmQiRIkSJEiRIkSJEVuc9x5l7TVdAzwpeAZ4UvAM8KXgGeFLwDPCl4BnhS8AzuuQgQIECBAgQIECBAjOCxYsWLFixYsWLFaZan17VjsKGW4mfhbr4uDWNmMlU5/DYpzUFDKudkJXnBM87dLMZhwLuBdANKzlFVJjrxPyBw4cOHDhw4cOHDhw4cOHDiwPjhw4cOHDhw4cOHMhEiRIkSJEiRIkSIrc31ectMWdFKFIYDAYDAYDAYDAYDAYDAYDAX9HO5dz+vXr169evXr169ews1atWrVq1atWrVqxlcxua1atWrVq1atWrVq1atWrVq1at7J4cOHDhw4cOHDhw4sD44cOHDhw4cOHDhELGz8ho0ggQIECBAgQIECBAgQIECBAgQFB2mEOa1atWrVq1atWrVq3lsmTJkyZMmTJkyZGeus1CmLLs9Rdv8IUeou3+EKPUXb/CFHqLt/hCj1F2/whR6i7f4QlQUougsWLFixYsWLFixZ4PHjx48ePHjx48eLwut/K2BZaHwJ/1E7OH/169evXr169evXr169evXsLNWrVq1atWrVq1atZvJkyZMmTJkyZMmTA5vfI1UIFuBXQxuyTahDxzn+3naEaFCUHjKr7Z0cAY25W4DZ4BCiMzT8bgjHr169evXr169evXr169evfNatWrVq1atWrVq1by2TJkyZMmTJkyZMjPXBCbVB4EhjESWdcDRjdynfp2vU3Lp06dOnTp06dOnTp06dOnWFEiRIkSJEiRIkSJE4Z8+fPnz58+fPnz4+R9TGVTFByOR1er1er1er1er1er1er1er1er1erzzKO2I2gcOHDhw4cOHDhw4cWB8cOHDhw4cOHDhw6SHDhw4cOHDhw4cOHDhw4cOHDhw4cOHDhw4cOHDhw4cOZCJEiRIkSJEiRIkSJEiRIkSJEiRIkSJEiRIkSJEiRIkSJEiRIkSJEiRIkSJFzmvXr169evXr169evXr169evXr169evXr169evXr169evXr169evXr169evde5cuXLly5cuXLly5cuXLly5cuXLly5cuXLly5cuXLly5cuXLly5cuXLly5dI8ePHjx48ePHjx48ePHjx48ePHjx48ePHjx48ePHjx48ePHjx48ePHjx48eRBWrVq1atWrVq1atWrVq1atWrVq1atWrVq1atWrVq1atWrVq1atWrVq1atWreWyZMmTJkyZMmTJkyZMmTJkyZMmTJkyZMmTJkyZMmTJkyZMmTJkyZMmTJkyZ0Bw4cOHDhw4cOHDhw4cOHDhw4cOHDhw4cOHDhw4cOHDhw4cOHDhw4cOHDhw7PPnz58+fPnz58+fPnz58+fPnz58+fPnz58+fPnz58+fPnz58+fPnz58+fPoAhIkSJEiRIkSJEiRIkSJEiRIkSJEiRIkSJEiRIkSJEiRIkSJEiRIkSJEiRIlLNWrVq1atWrVq1atWrVq1atWrVq1atWrVq1atWrVq1atWrVq1atWrVq1atWvbLly5cuXLly5cuXLly5cuXLly5cuXLly5cuXLly5cuXLly5cuXLly5cuXLkYAP6AO7hn23LIvWT3M3vMst7B+vrx5e/ZlBrK08CPxP+D/oZWo2jCcFgsFgsFgsFjqxIDI/q0JC2E6Q6uQNNk+PZFrddrzPSK0FGCRVC1EavogH9xSgVxasOrkksNwPYjioVCoWodXHir8xLkjqxIBGNBdEp+ta4LBcqQdXGbxGr3lLDiCLxAo2l6DHkxJSoVC1wy0zVQ4tVYtVW0LzAnginiPvpXWx/eOlaivY/mfa8FgsFgsFguXAOrjLiUaViPxgV600NYu3cBGcQLNnHMCLd6t0Fm1HqWPeCKNMtyAAAAX9w7UT+nbjLsWFoAAAG0aKFHOqwbxgPSUn6VD01qVkVGzxq9DOalKLLfFdrCKMGk7Ri1lMwtOcMzf01s2OduQRBx/mYRhMNNDKvgrWKatcolqXGWPgEreD0n/tgs3WiHzwSspoRTkcz5zogVz21uiELFwRzBqFqw7kpZ74PK/6L73212+vDZSVYmytUmEz2pEOCPLTb0FmPxgFCygOKse06osrNc6X8siWviWy17381t9GwO6wzroCRVlbbpFydOnqWK3QnPbr2LY180iBGBKGn21nmp9E1ICTJj255HWH5H6PmnWWhdE7wSv/jrL4OlAA7b5BJBNMA2+i+QAAAAAAAAAAAAAAAAAAAAAAAFKwHSLbvIIml3X9mx7PjoUS30gxAQRlx7lfW7KBGB7v3LaMKXRiEBO3FNhEWH5DJHyzfg098I5D0+n8oBt6iFQHta9JXQ8agvoMYB3vFfgPyYmRfwBNFD3mZ69AI0mbVx8bcozR0lbRv6V++Me39KY/DDNJ5ikQiIA+em3gL1nIl5Rj7KcwFX9/q10mQbVxo+tboWxDo4s3iwu8djqrMkJmDuIN9r0f6eRkPbO0o2GoPSIjKnB2wKihBTFED/YUmnam69VhOXvGQNe6EbQ9SsozxE+Ax0PfeaTOmhYTRRyLmtSWJVP7U6YJcE7EOwTWTbOs/9zQibPYftSrMAzKEFAC2vvC7BSkNVQ0R6i6GrmtNXBV281fx0i5JEk1QgEH/vSnPEtVoCbTFAgIZ+DotsYbjvr218F4UspZu8Az+0XLs3VE7B4/UaDG1OoQ96GAAD8CZexB5fHSOqilCDQU8NRKtiujsjJHvNNPePGf6HdyZno/gnUzPNA1AALXmEzozCBneXJguHb9SkUznQxFKb+zRsVY3bVqa7eQrIqNDZzBn6k8M1M4aMwO/Rr60fvC7xxtJMChb2D0qWBjQzV840aI0yrDphsWsfdRJ58RGfSGthmJQ44QW7GMfRW0DczfQpX4NUYYudCC9kg3cH2KLaGGzr1wKQjUi1uP0XWGtC1GribJrlrgpWfGK3WibAT5AbVHhRaMLyeuo9JQk3yUJthoJQWKcEtyBEfTAigGu64QtlQFjC3MWBCULmu9R1slHE5mA7dPPBVWYE6wDCwX8nXWcnuEhBp6xTQnaIBBg665KMfjOWwYTkjXexBkzjNkgKUxtLTYO6VnNKU81flbNhdbod8+YoeBWMHnru94FhgYjNu30wugcG7sR6EYSUKfTQAuG6WG5vfff4zx9QDFflwYA+lQo4gZSoIgoJAwc1LdypG+cVr7Ht4xViZiibBUebK32Yg3gwP4ywZrD1ftYKRVmGi+YYmlTbcq+DwurGdc3OAujqiHYezOb/JzvkE1cQvizN1o2OCwR1SnvW6U9lcOUNZiMBB9gxZHE0cvnM7IWJ1yjooW6KJ/IIdj5OkS7Vc1l+pw2rgOv4sAmV4uzX7twZZtP3Ejgch6itSsKUAcFi4yZ71ctpXqpipsyrOCdKxvqiiF5MbkAD5iPm1IUC4fW/MmYxGleN2VtCx2eW/ciSQtMQYHBYINjqujpxf/V4HUTTZaIlG2WTv9tcM/rvpuoiWDMWEOZbN7vWeyvUvpzY/peveh5oubVK6r2Wjh8KdMV/TgpzF5IMHK2Ocuwz/INqWzuvPXPD8FQbF3pzIPPbDP5lR1zgWNmEjhV4+nTfCUfUrriHBqRc/HtuCDdA7oDq6BomPCW+RNi1Zl3Fwo31w2xTHlws8VuelJUJVj0IvSYrBcp0pV8YoluCWTxxrEXKwq9pufG+oRm6CWxDTrtkHy4Bf42chCHTHBsxRBxSfQlDoD2OXPap40JwDjKKwej7l8WF3Hnfk0IMTZ5iVLwjneExnfimEuhNfj4smBGf1nQK+2irInWJzco+sp6PQl3jeczs4RlP9fZsCXzU9dqHxyMMNs3jYbMOCUxnJAk8IRoxN3ul436EnVeoRf9r7Fnakpd+KFMgNHs62+Kf3eFW83gvbbu/jZaMI/76k3VJbAZcezIMY8/tFr+rlLPEJUCN0xd/wtKTu+A0sn8b+S/aGoef0nVx03u2Yar3v3spduyfAixL8MM6nxOr5LAEVX/Awz80nDfviaZCMCgEQ63zch+xVJdR+aakXy7qat1qq1SvPbnjZpmtKwp05S6yoFfotiuAADG+7lE7rBa0aDn8vkC4fQymz61TcdKYty0IFt3C+Qxl+orJXeh/zZDFvdR/LxUFVM6pvHrfsBibZvliCYvB0mcGyNi4sWg/3Ao3U+M3OkP8TfCleFBCZOd8r7i00/fdp/otgaZLjZ8Bysc/xp2iyhT4LQmJHC/pKrHAeNhJGhbrpsYyVPc9NSNs4iOIerR5LzUMmL+an5aNSC5SbR5z727Ep1SlKocfyWZjEQQ4E0lOMgCKSCk5j0dHX8huFwRjSpCszqcM5nX8lpe4WjACdF+7kyOa26FiCPySLR8Iy4W+yS30OGoWqbvKjn/AH3bXaZvYwsTQAIA1TzudC8+7Rvcdb302oZyy9PvIwsr+ZsN2hFZmSuFdwATVOPHlwrlnUXbmrpRhg8SOHA7UOTZ2f0Tig1wIJ9nwaeO2b1JOVYH4Qc4jiLQSjwxk4cjbolKN+V2RzbMfQwlbFM6kXoHnAYZC8qv0Q3lyczdaccWf4Ka0oUZRMEU3zC7USdguTkzM0DK4n/OCNbubEPUrEPWEzS3PsLLK/XmXpZGKSuYlpdD0mF3H5OpbRJQRt5Pp4EvmeLSAYNOXP6FtvbRDDhHyTY91bfX04xsLLLnWk/uxUYg7HPiCNGHoy6v11s2fVJzKqtF0MkvZF6S9SqE+aplINFLSM+biUMHZq58vo6vm2OauUZkgYYBEfcLw+Q9oRMgEn+LVmDSkgTf5uPXZKmnRcXHaHH22y5BlHmV/biXKZAECO6SBgSE5sx0V/mP0WDcifW+KQIyKNbS2AYpreXVuvqjwziBaDJkVBTCoH9BwMsukgbczmIFYS1n/E88nleas2TCZj469Fwjrokr6GAAAAAAAAAUuj4L1wuzBr4ANeUEty1rRAWQAA98OKegz+OXSrc5LLfMIXjYdL1roEsPB8dO4D1eaeuN09cQZM1STJuU4Xev4XoaNwHzuryum3loNEdkwfgvJTTiAzgR2ABY+d+AaZ9eocK6s5S8L3w31jnkXWrHFGBML+4chZ6/k4HXYtk5lM9bBg7VeiBLbD99lVtzzDChneeJx+X+CQfWXuCqpQtUx8p+KB+D0hlCt4PDLt6jO+1owc+7aG+s49Ceq/tjGd3FJBA7uPy944jY8xb45hN/QEqioGF7mf/daYg9LuZUZtDDs98qUmrby76oKJ0PW1d/VypjDysJqe+926kU9mIvVPIAAADVVwjytyvACT+XHfug3t+IuN3PZZ9SFwXG7nss+pC4Ljdz2WfUhcFxu57LPqQuC43c9ln1IXBcbueyz6kLguN3PZZ9SEWABU/J4AAAAAAJ6GAAAAAAAABlVk1PjFjG9jKp8I8CAcdo81w+rJ0QzZdTsHjlIMTCIivF6+54ecBYjxNB+hDXPuUmIPX3DOZrscbjH9laqLeXobhiNV4wh9eCYSz8ffq4biO/L38voUVHkP4Z0ynmn8ZNB0hHf4zEOOZgxdcx15DtyHnwq0C7QxB5FH8deOx8KXzg+Gh1yOA6SQkydlG6BtYQkUGXLYZx0in0TAzVSGIFWUZxUxWFJWc91aOhD1hlq1ijqU5bETt16pY8/bXr4A8sqFVdS90VdbNDlvO2LRRXDEWkoCqaCTa2meY0QKmy/63mqG3dMcRQvKmO6XlQN4EDGAAAB0Zr7v83mc/84feuyZY6XEFtYM7HEY6SdMBTZyPsKq9jvMD3hQgpX2CeSZAS+9Ilh6RWup7PRD80o1xMUDC64q64x2FA0Y4xGWpkeM/pz0Zq0d+fRIrNfacyVig6yRCy8Sxw0BoCwYywAGIZeCCI8PAGYLAA2qaBgiwAF/BYHL4+AoxYAaCSDwJvZhwAAAIQlFTpUxgpS1A8nlPjZkgsDAJH5qukhvXydIed994fOggH8EMhTVDVtu7v7wlKjXi9WEEezbIGNnhgEVtfMvv85OMTHSvGFteDoST2J1nuWhm2fGSadUB0v1d+ugd5MtH7xL7X1gBWY0p/cGFi5S7ufPrZXXw1/ng5xGxUXRb6sIaeouXLb4zm2QfUSc1O5CUucu5e84lRA13hIJziA74lz3tvG0Avx5uOTx4dA+dHAvLCUFAeAODYh+QdcvkVMjbhI0cTcGgz2UXcJi0zYoe0rfhA4ndDRQiqquIrVnGmRjq+XCQVS1jT89F8RwrwWS1bnbbZFjMvwyxZwh4ks0aBiZIXfc/M8N0Du2CTPWclmraBcsfG+OpLXr9Tc+Yq0ZP/5t5Q1gv+dpwtxLLDt9fT3pgy5jpfXsoNcC9+qM7dLW3+2GDYmJR4fiWuoHYo9QdaXf8iAisso6PAXcrwBDRUXuiSAWJGD46WAU2vZQAGCisYsv1P3rSeNoSlIuPe92HFNXAMP3Sp8t1+wAAAHiiyh3+kb8VQWPWI9M7DOqTydZyeTyeTyeTyeTyeTyeTyeTyeTyeTyeTyeTyeTyeTyeTyeTyeTyeTyeTyeTydLBYAgOJAAAAcYTt24AAAA9YIAAAAAAD5LZIz1V5aIR5JR2TuI9wgAn7dWuPUAAABHLkZArjelrbGpqDoLo2qFqqVcEfqpk5QfNOQoCsYgOBLEBPXYOQnUXknc4jGqAyC//gdXbDUdDjwyNHufwY4lVjsvX2yQxB2VCeVhr22pxq9yWnbhhWX/iPpM0BfbLGe5I9Z08Xdamn8/8xen0W8cfzwVrKogWHMfJruMKL0dR8T0czWzc3s/XAmlBKWdmzAoRe8Fgn1RzlVnsDLN40nvgfvJh9OY6uewTufqhgI5HqvK9sK9k1undUY97F+Lyr3M4XEIL2QyawItQwg1ODjsSmTWtNd5YxWkTXfYv7bCuAyNGyDF3hvKa83ZnDpD5j4JYlA6kk+enAE4RTGgol40rRDWp2ZN8LIhXWs1hc7aPagXnEOdDrxlqWPwyb8G2/h5ufuO3oFcOiysQOYyzaLEVfPfsa5oMdrxpsISO66msXWSmfVdZeUmbFkEFfP/CbyYcW3LwDMJYN8i5Bf3BW3m3VZNr6yw7fHT8MDPTYB3NjukSN5cVRdaNK7YBk3/vu5TpvDyEB6pmdBl/RwzZTtYfvYCP80JWxW/GLTKGF6Rhl2TFAjsAAABreajb8jK/nT2a6PaZiGc1u46XXk7PWyY3vAAAAFo7TNu+VTSE0ADz3VxwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=) Let's begin. ## Prerequisites[](#prerequisites "Direct link to Prerequisites") Let's start by creating a new Crawlee for Python project with this command: ``` pipx run crawlee create linkedin-scraper ``` Select `PlaywrightCrawler` in the terminal when Crawlee asks for it. After installation, Crawlee for Python will create boilerplate code for you. You can change the directory(`cd`) to the project folder and run this command to install dependencies. ``` poetry install ``` We are going to begin editing the files provided to us by Crawlee so we can build our scraper. note Before going ahead if you like reading this blog, we would be really happy if you gave [Crawlee for Python a star on GitHub](https://github.com/apify/crawlee-python/)! ## Building the LinkedIn job Scraper in Python with Crawlee[](#building-the-linkedin-job-scraper-in-python-with-crawlee "Direct link to Building the LinkedIn job Scraper in Python with Crawlee") In this section, we will be building the scraper using the Crawlee for Python package. To learn more about Crawlee, check out their [documentation](https://www.crawlee.dev/python/docs/quick-start). ### 1. Inspecting the LinkedIn job Search Page[](#1-inspecting-the-linkedin-job-search-page "Direct link to 1. Inspecting the LinkedIn job Search Page") Open LinkedIn in your web browser and sign out from the website (if you already have an account logged in). You should see an interface like this. ![LinkedIn Homepage](/assets/images/linkedin-homepage-8bec2b6a9ae97a18a7e49d4275c14cee.webp) Navigate to the jobs section, search for a job and location of your choice, and copy the URL. ![LinkedIn Jobs Page](/assets/images/linkedin-jobs-44e352d2233de5adb7af9838b75b9895.webp) You should have something like this: `https://www.linkedin.com/jobs/search?keywords=Backend%20Developer&location=Canada&geoId=101174742&trk=public_jobs_jobs-search-bar_search-submit&position=1&pageNum=0` We're going to focus on the search parameters, which is the part that goes after '?'. The keyword and location parameters are the most important ones for us. The job title the user supplies will be input to the keyword parameter, while the location the user supplies will go into the location parameter. Lastly, the `geoId` parameter will be removed while we keep the other parameters constant. We are going to be making changes to our `main.py` file. Copy and paste the code below in your `main.py` file. ``` from crawlee.playwright_crawler import PlaywrightCrawler from .routes import router import urllib.parse async def main(title: str, location: str, data_name: str) -> None: base_url = "https://www.linkedin.com/jobs/search" # URL encode the parameters params = { "keywords": title, "location": location, "trk": "public_jobs_jobs-search-bar_search-submit", "position": "1", "pageNum": "0" } encoded_params = urlencode(params) # Encode parameters into a query string query_string = '?' + encoded_params # Combine base URL with the encoded query string encoded_url = urljoin(base_url, "") + query_string # Initialize the crawler crawler = PlaywrightCrawler( request_handler=router, ) # Run the crawler with the initial list of URLs await crawler.run([encoded_url]) # Save the data in a CSV file output_file = f"{data_name}.csv" await crawler.export_data(output_file) ``` Now that we have encoded the URL, the next step for us is to adjust the generated router to handle LinkedIn job postings. ### 2. Routing your crawler[](#2-routing-your-crawler "Direct link to 2. Routing your crawler") We will be making use of two handlers for your application: * **Default handler** The `default_handler` handles the start URL * **Job listing** The `job_listing` handler extracts the individual job details. Playwright crawler is going to crawl through the job posting page and extract the links to all job postings on the page. ![Identifying elements](/assets/images/elements-a634b50a7ad31ae15db61e1a06f5125e.webp) When you examine the job postings, you will discover that the job posting links are inside an ordered list with a class named `jobs-search__results-list`. We will then extract the links using the Playwright locator object and add them to the `job_listing` route for processing. ``` router = Router[PlaywrightCrawlingContext]() @router.default_handler async def default_handler(context: PlaywrightCrawlingContext) -> None: """Default request handler.""" #select all the links for the job posting on the page hrefs = await context.page.locator('ul.jobs-search__results-list a').evaluate_all("links => links.map(link => link.href)") #add all the links to the job listing route await context.add_requests( [Request.from_url(rec, label='job_listing') for rec in hrefs] ) ``` Now that we have the job listings, the next step is to scrape their details. We'll extract each job’s title, company's name, time of posting, and the link to the job post. Open your dev tools to extract each element using its CSS selector. ![Inspecting elements](/assets/images/inspect-90f77b162804bd1163b16bb23b315ed8.webp) After scraping each of the listings, we'll remove special characters from the text to make it clean and push the data to local storage using the `context.push_data` function. ``` @router.handler('job_listing') async def listing_handler(context: PlaywrightCrawlingContext) -> None: """Handler for job listings.""" await context.page.wait_for_load_state('load') job_title = await context.page.locator('div.top-card-layout__entity-info h1.top-card-layout__title').text_content() company_name = await context.page.locator('span.topcard__flavor a').text_content() time_of_posting= await context.page.locator('div.topcard__flavor-row span.posted-time-ago__text').text_content() await context.push_data( { # we are making use of regex to remove special characters for the extracted texts 'title': re.sub(r'[\s\n]+', '', job_title), 'Company name': re.sub(r'[\s\n]+', '', company_name), 'Time of posting': re.sub(r'[\s\n]+', '', time_of_posting), 'url': context.request.loaded_url, } ) ``` ## 3. Creating your application[](#3-creating-your-application "Direct link to 3. Creating your application") For this project, we will be using Streamlit for the web application. Before we proceed, we are going to create a new file named `app.py` in your project directory. In addition, ensure you have [Streamlit](https://docs.streamlit.io/get-started/installation) installed in your global Python environment before proceeding with this section. ``` import streamlit as st import subprocess # Streamlit form for inputs st.title("LinkedIn Job Scraper") with st.form("scraper_form"): title = st.text_input("Job Title", value="backend developer") location = st.text_input("Job Location", value="newyork") data_name = st.text_input("Output File Name", value="backend_jobs") submit_button = st.form_submit_button("Run Scraper") if submit_button: # Run the scraping script with the form inputs command = f"""poetry run python -m linkedin-scraper --title "{title}" --location "{location}" --data_name "{data_name}" """ with st.spinner("Crawling in progress..."): # Execute the command and display the results result = subprocess.run(command, shell=True, capture_output=True, text=True) st.write("Script Output:") st.text(result.stdout) if result.returncode == 0: st.success(f"Data successfully saved in {data_name}.csv") else: st.error(f"Error: {result.stderr}") ``` The Streamlit web application takes in the user's input and uses the Python Subprocess package to run the Crawlee scraping script. ## 4. Testing your app[](#4-testing-your-app "Direct link to 4. Testing your app") Before we test the application, we need to make a little modification to the `__main__` file in order for it to accommodate the command line arguments. ``` import asyncio import argparse from .main import main def get_args(): # ArgumentParser object to capture command-line arguments parser = argparse.ArgumentParser(description="Crawl LinkedIn job listings") # Define the arguments parser.add_argument("--title", type=str, required=True, help="Job title") parser.add_argument("--location", type=str, required=True, help="Job location") parser.add_argument("--data_name", type=str, required=True, help="Name for the output CSV file") # Parse the arguments return parser.parse_args() if __name__ == '__main__': args = get_args() # Run the main function with the parsed command-line arguments asyncio.run(main(args.title, args.location, args.data_name)) ``` We will start the Streamlit application by running this code in the terminal: ``` streamlit run app.py ``` This is what your application what the application should look like on the browser: ![Running scraper](/assets/images/running-555ab15f009be751f516aabd99e6c574.webp) You will get this interface showing you that the scraping has been completed: ![Filling input form](/assets/images/form-774ee8d03c87acfc38d3012d38a9c4ce.webp) To access the scraped data, go over to your project directory and open the CSV file. ![CSV file with all scraped LinkedIn jobs](/assets/images/excel-23850449d4d74099a1264cd93ca8565b.webp) You should have something like this as the output of your CSV file. ## Conclusion[](#conclusion "Direct link to Conclusion") In this tutorial, we have learned how to build an application that can scrape job posting data from LinkedIn using Crawlee. Have fun building great scraping applications with Crawlee. You can find the complete working Crawler code here on the [GitHub repository.](https://github.com/Arindam200/LinkedIn_Scraping) --- # Building a Netflix show recommender using Crawlee and React June 10, 2024 · 8 min read [![Ayush Thakur](https://avatars.githubusercontent.com/u/43995654?v=4)](https://github.com/ayush2390) [Ayush Thakur](https://github.com/ayush2390) Community Member of Crawlee In this blog, we'll guide you through the process of using Vite and Crawlee to build a website that recommends Netflix shows based on their categories and genres. To do that, we will first scrape the shows and categories from Netflix using Crawlee, and then visualize the scraped data in a React app built with Vite. By the end of this guide, you'll have a functional web show recommender that can provide Netflix show suggestions. note One of our community members wrote this blog as a contribution to Crawlee Blog. If you want to contribute blogs like these to Crawlee Blog, please reach out to us on our [discord channel](https://apify.com/discord). ![How to scrape Netflix using Crawlee and React to build a show recommender](/assets/images/create-netflix-show-recommender-c429467c4a972badaa0b8ab414454250.webp) Let’s get started! ## Prerequisites[](#prerequisites "Direct link to Prerequisites") To use Crawlee, you need to have Node.js 16 or newer. tip If you like the posts on the Crawlee blog so far, please consider [giving Crawlee a star on GitHub](https://github.com/apify/crawlee), it helps us to reach and help more developers. You can install the latest version of Node.js from the [official website](https://nodejs.org/en/). This great [Node.js installation guide](https://blog.apify.com/how-to-install-nodejs/) gives you tips to avoid issues later on. ## Creating a React app[](#creating-a-react-app "Direct link to Creating a React app") First, we will create a React app (for the front end) using Vite. Run this command in the terminal to create it: ``` npx create-vite@latest ``` You can check out the [Vite Docs](https://vitejs.dev/guide/) for more details on how to create a React app. Once the React app is created, open it in VS Code. ![react](/assets/images/react-646682cf5586bf230bf98086a4323845.webp) This will be the structure of your React app. Run `npm run dev` command in the terminal to run the app. ![viteandreact](/assets/images/viteandreact-57c4bb4028b4d6b7cc9a22b32b70d3f7.webp) This will be the output displayed. ## Adding Scraper code[](#adding-scraper-code "Direct link to Adding Scraper code") As per our project requirements, we will scrape the genres and the titles of the shows available on Netflix. Let’s start building the scraper code. ### Installation[](#installation "Direct link to Installation") Run this command to install `crawlee`: ``` npm install crawlee ``` Crawlee utilizes Cheerio for HTML parsing and scraping of static websites. While faster and [less resource-intensive](https://crawlee.dev/js/docs/guides/scaling-crawlers.md), it can only scrape websites that do not require JavaScript rendering, making it unsuitable for SPAs (single page applications). In this tutorial we can extract data from the HTML structure, so we will go with Cheerio but for extracting data from SPAs or JavaScript-rendered websites, Crawlee also supports headless browser libraries like [Playwright](https://playwright.dev/) and [Puppeteer](https://pptr.dev/) After installing the libraries, it’s time to create the scraper code. Create a file in `src` directory and name it `scraper.js`. The entire scraper code will be created in this file. ### Scraping genres and shows[](#scraping-genres-and-shows "Direct link to Scraping genres and shows") To scrape the genres and shows, we will utilize the [browser DevTools](https://developer.mozilla.org/en-US/docs/Learn/Common%60questions/Tools%60and%60setup/What%60are%60browser%60developer%60tools) to identify the tags and CSS selectors targeting the genre elements on the Netflix website. We can capture the HTML structure and call `$(element)` to query the element's subtree. ![genre](/assets/images/genre-cea03ab54c084a8df3139bf584920062.webp) Here, we can observe that the name of the genre is captured by a `span` tag with `nm-collections-row-name` class. So we can use the `span.nm-collections-row-name` selector to capture this and similar elements. ![title](/assets/images/title-b56306b68714d95cc9e45168906a045f.webp) Similarly, we can observe that the title of the show is captured by the `span` tag having `nm-collections-title-name` class. So we can use the `span.nm-collections-title-name` selector to capture this and similar elements. ``` // Use parseWithCheerio for efficient HTML parsing const $ = await parseWithCheerio(); // Extract genre and shows directly from the HTML structure const data = $('[data-uia="collections-row"]') .map((_, el) => { const genre = $(el) .find('[data-uia="collections-row-title"]') .text() .trim(); const items = $(el) .find('[data-uia="collections-title"]') .map((_, itemEl) => $(itemEl).text().trim()) .get(); return { genre, items }; }) .get(); const genres = data.map((d) => d.genre); const shows = data.map((d) => d.items); ``` In the code snippet given above, we are using `parseWithCheerio` to parse the HTML content of the current page and extract `genres` and `shows` information from the HTML structure using Cheerio. This will give the `genres` and `shows` array having list of genres and shows stored in it respectively. ### Storing data[](#storing-data "Direct link to Storing data") Now we have all the data that we want for our project and it’s time to store or save the scraped data. To store the data, Crawlee comes with a `pushData()` method. The [pushData()](https://crawlee.dev/js/docs/introduction/saving-data.md) method creates a storage folder in the project directory and stores the scraped data in JSON format. ``` await pushData({ genres: genres, shows: shows, }); ``` This will save the `genres` and `shows` arrays as values in the `genres` and `shows` keys. Here’s the full code that we will use in our project: ``` import { CheerioCrawler, log, Dataset } from "crawlee"; const crawler = new CheerioCrawler({ requestHandler: async ({ request, parseWithCheerio, pushData }) => { log.info(`Processing: ${request.url}`); // Use parseWithCheerio for efficient HTML parsing const $ = await parseWithCheerio(); // Extract genre and shows directly from the HTML structure const data = $('[data-uia="collections-row"]') .map((_, el) => { const genre = $(el) .find('[data-uia="collections-row-title"]') .text() .trim(); const items = $(el) .find('[data-uia="collections-title"]') .map((_, itemEl) => $(itemEl).text().trim()) .get(); return { genre, items }; }) .get(); // Prepare data for pushing const genres = data.map((d) => d.genre); const shows = data.map((d) => d.items); await pushData({ genres: genres, shows: shows, }); }, // Limit crawls for efficiency maxRequestsPerCrawl: 20, }); await crawler.run(["https://www.netflix.com/in/browse/genre/1191605"]); await Dataset.exportToJSON("results"); ``` Now, we will run Crawlee to scrape the website. Before running Crawlee, we need to tweak the `package.json` file. We will add the `start` script targeting the `scraper.js` file to run Crawlee. Add the following code in `'scripts'` object: ``` "start": "node src/scraper.js" ``` and save it. Now run this command to run Crawlee to scrape the data: ``` npm start ``` After running this command, you will see a `storage` folder with the `key_value_stores/default/results.json` file. The scraped data will be stored in JSON format in this file. Now we can use this JSON data and display it in the `App.jsx` component to create the project. In the `App.jsx` component, we will import `jsonData` from the `results.json` file: ``` import { useState } from "react"; import "./App.css"; import jsonData from "../storage/key_value_stores/default/results.json"; function HeaderAndSelector({ handleChange }) { return ( <>

Netflix Web Show Recommender

); } function App() { const [count, setCount] = useState(null); const handleChange = (event) => { const value = event.target.value; if (value) setCount(parseInt(value)); }; // Validate count to ensure it is within the bounds of the jsonData.shows array const isValidCount = count !== null && count <= jsonData[0].shows.length; return (

{isValidCount && ( <>

{show}

{show}

)}

); } export default App; ``` In this code snippet, the `genre` array is used to display the list of genres. User can select their desired genre and based upon that a list of web shows available on Netflix will be displayed using the `shows` array. Make sure to update CSS on the `App.css` file from here: and download and save this image file in main project folder: [Download Image](https://raw.githubusercontent.com/ayush2390/web-show-recommender/main/Netflix.png) Our project is ready! ## Result[](#result "Direct link to Result") Now, to run your project on localhost, run this command: ``` npm run dev ``` This command will run your project on localhost. Here is a demo of the project: ![result](/assets/images/result-021f50e0c1a5870d2448701c8ca6042d.gif) Project link - In this project, we used Crawlee to scrape Netflix; similarly, Crawlee can be used to scrape single application pages (SPAs) and JavaScript-rendered websites. The best part is all of this can be done while coding in JavaScript/TypeScript and using a single library. If you want to learn more about Crawlee, go through the [documentation](https://crawlee.dev/js/docs/quick-start.md) and this step-by-step [Crawlee web scraping tutorial](https://blog.apify.com/crawlee-web-scraping-tutorial/) from Apify. **Tags:** * [community](https://crawlee.dev/blog/tags/community.md) --- ## [How to scrape Google search results with Python](https://crawlee.dev/blog/scrape-google-search.md) December 2, 2024 · 7 min read [![Max](https://avatars.githubusercontent.com/u/34358312?v=4)](https://github.com/Mantisus) [Max](https://github.com/Mantisus) Community Member of Crawlee and web scraping expert Scraping `Google Search` delivers essential `SERP analysis`, SEO optimization, and data collection capabilities. Modern scraping tools make this process faster and more reliable. note One of our community members wrote this blog as a contribution to the Crawlee Blog. If you would like to contribute blogs like these to Crawlee Blog, please reach out to us on our [discord channel](https://apify.com/discord). In this guide, we'll create a Google Search scraper using [`Crawlee for Python`](https://github.com/apify/crawlee-python) that can handle result ranking and pagination. We'll create a scraper that: * Extracts titles, URLs, and descriptions from search results * Handles multiple search queries * Tracks ranking positions * Processes multiple result pages * Saves data in a structured format ![How to scrape Google search results with Python](/assets/images/google-search-a91bfdf17a4c2860798444b1be56f625.webp) **Tags:** * [community](https://crawlee.dev/blog/tags/community.md) [**Read More**](https://crawlee.dev/blog/scrape-google-search.md) --- ## [Building a Netflix show recommender using Crawlee and React](https://crawlee.dev/blog/netflix-show-recommender.md) June 10, 2024 · 8 min read [![Ayush Thakur](https://avatars.githubusercontent.com/u/43995654?v=4)](https://github.com/ayush2390) [Ayush Thakur](https://github.com/ayush2390) Community Member of Crawlee In this blog, we'll guide you through the process of using Vite and Crawlee to build a website that recommends Netflix shows based on their categories and genres. To do that, we will first scrape the shows and categories from Netflix using Crawlee, and then visualize the scraped data in a React app built with Vite. By the end of this guide, you'll have a functional web show recommender that can provide Netflix show suggestions. note One of our community members wrote this blog as a contribution to Crawlee Blog. If you want to contribute blogs like these to Crawlee Blog, please reach out to us on our [discord channel](https://apify.com/discord). ![How to scrape Netflix using Crawlee and React to build a show recommender](/assets/images/create-netflix-show-recommender-c429467c4a972badaa0b8ab414454250.webp) **Tags:** * [community](https://crawlee.dev/blog/tags/community.md) [**Read More**](https://crawlee.dev/blog/netflix-show-recommender.md) --- # How Crawlee uses tiered proxies to avoid getting blocked June 24, 2024 · 4 min read [![Saurav Jain](https://avatars.githubusercontent.com/u/53312820?v=4)](https://github.com/souravjain540) [Saurav Jain](https://github.com/souravjain540) Developer Community Manager Hello Crawlee community, We are back with another blog, this time explaining how Crawlee rotates proxies and prevents crawlers from getting blocked. Proxies vary in quality, speed, reliability, and cost. There are a [few types of proxies](https://blog.apify.com/types-of-proxies/), such as datacenter and residential proxies. Datacenter proxies are cheaper but, on the other hand, more prone to getting blocked, and vice versa with residential proxies. It is hard for developers to decide which proxy to use while scraping data. We might get blocked if we use [datacenter proxies](https://blog.apify.com/datacenter-proxies-when-to-use-them-and-how-to-make-the-most-of-them/) for low-cost scraping, but residential proxies are sometimes too expensive for bigger projects. Developers need a system that can manage both costs and avoid getting blocked. To manage this, we recently introduced tiered proxies in Crawlee. Let’s take a look at it. note If you like reading this blog, we would be really happy if you gave [Crawlee a star on GitHub!](https://github.com/apify/crawlee/) ## What are tiered proxies?[](#what-are-tiered-proxies "Direct link to What are tiered proxies?") Tiered proxies are a method of organizing and using different types of proxies based on their quality, speed, reliability, and cost. Tiered proxies allow you to rotate between a mix of proxy types to optimize your scraping activities. You categorize your proxies into different tiers based on their quality. For example: * **High-tier proxies**: Fast, reliable, and expensive. Best for critical tasks where you need high performance. * **Mid-tier proxies**: Moderate speed and reliability. A good balance between cost and performance. * **Low-tier proxies**: Slow and less reliable but cheap. Useful for less critical tasks or high-volume scraping. ## Features:[](#features "Direct link to Features:") * **Tracking errors**: The system monitors errors (e.g. failed requests, retries) for each domain. * **Adjusting tiers**: Higher-tier proxies are used if a domain shows more errors. Conversely, if a domain performs well with a high-tier proxy, the system will occasionally test lower-tier proxies. If successful, it continues using the lower tier, optimizing costs. * **Forgetting old errors**: Old errors are given less weight over time, allowing the system to adjust tiers dynamically as proxies' performance changes. ## Working[](#working "Direct link to Working") The `tieredProxyUrls` option in Crawlee's `ProxyConfigurationOptions` allows you to define a list of proxy URLs organized into tiers. Each tier represents a different level of quality, speed, and reliability. ## Usage[](#usage "Direct link to Usage") **Fallback Mechanism**: Crawlee starts with the first tier of proxies. If proxies in the current tier fail, it will switch to the next tier. ``` import { CheerioCrawler, ProxyConfiguration } from 'crawlee'; const proxyConfiguration = new ProxyConfiguration({ tieredProxyUrls: [ ['http://tier1-proxy1.example.com', 'http://tier1-proxy2.example.com'], ['http://tier2-proxy1.example.com', 'http://tier2-proxy2.example.com'], ['http://tier2-proxy1.example.com', 'http://tier3-proxy2.example.com'], ], }); const crawler = new CheerioCrawler({ proxyConfiguration, requestHandler: async ({ request, response }) => { // Handle the request }, }); await crawler.addRequests([ { url: 'https://example.com/critical' }, { url: 'https://example.com/important' }, { url: 'https://example.com/regular' }, ]); await crawler.run(); ``` ## How tiered proxies use Session Pool under the hood[](#how-tiered-proxies-use-session-pool-under-the-hood "Direct link to How tiered proxies use Session Pool under the hood") A session pool is a way to manage multiple [sessions](https://crawlee.dev/js/api/core/class/Session.md) on a website so you can distribute your requests across them, reducing the chances of being detected and blocked. You can imagine each session like a different human user with its own IP address. When you use tiered proxies, each proxy tier works with the [session pool](https://crawlee.dev/js/api/core/class/SessionPool.md) to enhance request distribution and manage errors effectively. ![Diagram explaining how tiered proxies use Session Pool under the hood](/assets/images/session-pool-working-a2dee3e83a3444b1330081044b0a234a.webp) For each request, the crawler instance asks the `ProxyConfiguration` which proxy it should use. ' ProxyConfiguration\` also keeps track of the requests domains, and if it sees more requests being retried or, say, more errors, it returns higher proxy tiers. In each request, we must pass `sessionId` and the request URL to the proxy configuration to get the needed proxy URL from one of the tiers. Choosing which session to pass is where SessionPool comes in. Session pool automatically creates a pool of sessions, rotates them, and uses one of them without getting blocked and mimicking human-like behavior. ## Conclusion: using proxies efficiently[](#conclusion-using-proxies-efficiently "Direct link to Conclusion: using proxies efficiently") This inbuilt feature is similar to what Scrapy's `scrapy-rotating-proxies` plugin offers to its users. The tiered proxy configuration dynamically adjusts proxy usage based on real-time performance data, optimizing cost and performance. The session pool ensures requests are distributed across multiple sessions, mimicking human behavior and reducing detection risk. We hope this gives you a better understanding of how Crawlee manages proxies and sessions to make your scraping tasks more effective. As always, we welcome your feedback. [Join our developer community on Discord](https://apify.com/discord) to ask any questions about Crawlee or tell us how you use it. **Tags:** * [proxy](https://crawlee.dev/blog/tags/proxy.md) --- # How to scrape Bluesky with Python March 20, 2025 · 15 min read [![Max](https://avatars.githubusercontent.com/u/34358312?v=4)](https://github.com/Mantisus) [Max](https://github.com/Mantisus) Community Member of Crawlee and web scraping expert [Bluesky](https://bsky.app/) is an emerging social network developed by former members of the [Twitter](https://x.com/)(now X) development team. The platform has been showing significant growth recently, reaching 140.3 million visits according to [SimilarWeb](https://www.similarweb.com/website/bsky.app/#traffic). Like X, Bluesky generates a vast amount of data that can be used for analysis. In this article, we’ll explore how to collect this data using [Crawlee for Python](https://github.com/apify/crawlee-python). note One of our community members wrote this blog as a contribution to the Crawlee Blog. If you’d like to contribute articles like these, please reach out to us on our [discord channel](https://apify.com/discord). ![Banner article](/assets/images/scrape-bluesky-using-python-723c9a74dadb375da06226b1a6a29e10.webp) Key steps we will cover: 1. Project setup 2. Development of the Bluesky crawler in Python 3. Create Apify Actor for Bluesky crawler 4. Conclusion and repository access ## Prerequisites[](#prerequisites "Direct link to Prerequisites") * Basic understanding of web scraping concepts * Python 3.9 or higher * [UV](https://docs.astral.sh/uv/) version 0.6.0 or higher * Crawlee for Python v0.6.5 or higher * Bluesky account for API access ### Project setup[](#project-setup "Direct link to Project setup") In this project, we’ll use UV for package management and a specific Python version installed through UV. UV is a fast and modern package manager written in Rust. 1. If you don’t have UV installed yet, follow the [guide](https://docs.astral.sh/uv/getting-started/installation/) or use this command: ``` curl -LsSf https://astral.sh/uv/install.sh | sh ``` 2. Install standalone Python using UV: ``` uv install python 3.13 ``` 3. Create a new project and install Crawlee for Python: ``` uv init bluesky-crawlee --package cd bluesky-crawlee uv add crawlee ``` We’ve created a new isolated Python project with all the necessary dependencies for Crawlee. ## Development of the Bluesky crawler in Python[](#development-of-the-bluesky-crawler-in-python "Direct link to Development of the Bluesky crawler in Python") note Before going ahead with the project, I'd like to ask you to star Crawlee for Python on [GitHub](https://github.com/apify/crawlee-python/), it helps us to spread the word to fellow scraper developers. ### 1. Identifying the data source[](#1-identifying-the-data-source "Direct link to 1. Identifying the data source") When accessing the [search page](https://bsky.app/search?q=apify), you'll see data displayed, but be aware of a key limitation: the site only allows viewing the first page of results, preventing access to any additional pages. ![Search Limit](/assets/images/search_limit-c8ee1da0dc9b48fdb6fb125600519ee3.webp) Fortunately, Bluesky provides a well-documented [API](https://docs.bsky.app/docs/get-started) that is accessible to any registered user without additional permissions. This is what we’ll use for data collection ### 2. Creating a session for API interaction[](#2-creating-a-session-for-api-interaction "Direct link to 2. Creating a session for API interaction") note For secure API interaction, you need to create a dedicated app password instead of using your main account password. Go to Settings -> Privacy and Security -> [App Passwords](https://bsky.app/settings/app-passwords) and click *Add App Password*. Important: Save the generated password, as it won’t be visible after creation. Next, create environment variables to store your credentials: * Your application password * Your user identifier (found in your profile and Bluesky URL, for example: [`mantisus.bsky.social`](https://bsky.app/profile/mantisus.bsky.social)) ``` export BLUESKY_APP_PASSWORD=your_app_password export BLUESKY_IDENTIFIER=your_identifier ``` Using the [createSession](https://docs.bsky.app/docs/api/com-atproto-server-create-session), [deleteSession](https://docs.bsky.app/docs/api/com-atproto-server-delete-session) endpoints and [`httpx`](https://www.python-httpx.org/), we can create a session for API interaction. Let us create a class with the necessary methods: ``` import asyncio import json import os import traceback import httpx from yarl import URL from crawlee import ConcurrencySettings, Request from crawlee.configuration import Configuration from crawlee.crawlers import HttpCrawler, HttpCrawlingContext from crawlee.http_clients import HttpxHttpClient from crawlee.storages import Dataset # Environment variables for authentication # BLUESKY_APP_PASSWORD: App-specific password generated from Bluesky settings # BLUESKY_IDENTIFIER: Your Bluesky handle (e.g., username.bsky.social) BLUESKY_APP_PASSWORD = os.getenv('BLUESKY_APP_PASSWORD') BLUESKY_IDENTIFIER = os.getenv('BLUESKY_IDENTIFIER') class BlueskyApiScraper: """A scraper class for extracting data from Bluesky social network using their official API. This scraper manages authentication, concurrent requests, and data collection for both posts and user profiles. It uses separate datasets for storing post and user information. """ def __init__(self) -> None: self._crawler: HttpCrawler | None = None self._users: Dataset | None = None self._posts: Dataset | None = None # Variables for storing session data self._service_endpoint: str | None = None self._user_did: str | None = None self._access_token: str | None = None self._refresh_token: str | None = None self._handle: str | None = None def create_session(self) -> None: """Create credentials for the session.""" url = 'https://bsky.social/xrpc/com.atproto.server.createSession' headers = { 'Content-Type': 'application/json', } data = {'identifier': BLUESKY_IDENTIFIER, 'password': BLUESKY_APP_PASSWORD} response = httpx.post(url, headers=headers, json=data) response.raise_for_status() data = response.json() self._service_endpoint = data['didDoc']['service'][0]['serviceEndpoint'] self._user_did = data['didDoc']['id'] self._access_token = data['accessJwt'] self._refresh_token = data['refreshJwt'] self._handle = data['handle'] def delete_session(self) -> None: """Delete the current session.""" url = f'{self._service_endpoint}/xrpc/com.atproto.server.deleteSession' headers = {'Content-Type': 'application/json', 'authorization': f'Bearer {self._refresh_token}'} response = httpx.post(url, headers=headers) response.raise_for_status() ``` The session expires after 2 hours, so if you plan for your crawler to run longer, you should also add a method for [refresh](https://docs.bsky.app/docs/api/com-atproto-server-refresh-session). ### 3. Configuring Crawlee for Python for data collection[](#3-configuring-crawlee-for-python-for-data-collection "Direct link to 3. Configuring Crawlee for Python for data collection") Since we’ll be using the official API, we do not need to worry about being blocked by Bluesky. However, we should be careful with the number of requests to avoid overloading Bluesky's servers, so we will configure [`ConcurrencySettings`](https://www.crawlee.dev/python/api/class/ConcurrencySettings). We’ll also configure [`HttpxHttpClient`](https://www.crawlee.dev/python/api/class/HttpxHttpClient) to use custom headers with the current session's `Authorization`. We’ll use 2 endpoints for data collection: [searchPosts](https://docs.bsky.app/docs/api/app-bsky-feed-search-posts) for posts and [getProfile](https://docs.bsky.app/docs/api/app-bsky-actor-get-profile). If you plan to scale the crawler, you can use [getProfiles](https://docs.bsky.app/docs/api/app-bsky-actor-get-profiles) for user data, but in this case, you’ll need to implement deduplication logic. When each link is unique, Crawlee for Python handles this for you. When collecting data, I’d like to separately collect user and post data, so we’ll use different [`Dataset`](https://www.crawlee.dev/python/api/class/Dataset) instances for storage. ``` async def init_crawler(self) -> None: """Initialize the crawler.""" if not self._user_did: raise ValueError('Session not created.') # Initialize the datasets purge the data if it is not empty self._users = await Dataset.open(name='users', configuration=Configuration(purge_on_start=True)) self._posts = await Dataset.open(name='posts', configuration=Configuration(purge_on_start=True)) # Initialize the crawler self._crawler = HttpCrawler( max_requests_per_crawl=100, http_client=HttpxHttpClient( # Set headers for API requests headers={ 'Content-Type': 'application/json', 'Authorization': f'Bearer {self._access_token}', 'Connection': 'Keep-Alive', 'accept-encoding': 'gzip, deflate, br, zstd', } ), # Configuring concurrency of crawling requests concurrency_settings=ConcurrencySettings( min_concurrency=10, desired_concurrency=10, max_concurrency=30, max_tasks_per_minute=200, ), ) self._crawler.router.default_handler(self._search_handler) # Handler for search requests self._crawler.router.handler(label='user')(self._user_handler) # Handler for user requests ``` ### 4. Implementing handlers for data collection[](#4-implementing-handlers-for-data-collection "Direct link to 4. Implementing handlers for data collection") Now we can implement the handler for searching posts. We’ll save the retrieved posts in `self._posts` and create requests for user data, placing them in the crawler's queue. We also need to handle pagination by forming the link to the next search page. ``` async def _search_handler(self, context: HttpCrawlingContext) -> None: context.log.info(f'Processing search {context.request.url} ...') data = json.loads(context.http_response.read()) if 'posts' not in data: context.log.warning(f'No posts found in response: {context.request.url}') return user_requests = {} posts = [] profile_url = URL(f'{self._service_endpoint}/xrpc/app.bsky.actor.getProfile') for post in data['posts']: # Add user request if not already added in current context if post['author']['did'] not in user_requests: user_requests[post['author']['did']] = Request.from_url( url=str(profile_url.with_query(actor=post['author']['did'])), user_data={'label': 'user'}, ) posts.append( { 'uri': post['uri'], 'cid': post['cid'], 'author_did': post['author']['did'], 'created': post['record']['createdAt'], 'indexed': post['indexedAt'], 'reply_count': post['replyCount'], 'repost_count': post['repostCount'], 'like_count': post['likeCount'], 'quote_count': post['quoteCount'], 'text': post['record']['text'], 'langs': '; '.join(post['record'].get('langs', [])), 'reply_parent': post['record'].get('reply', {}).get('parent', {}).get('uri'), 'reply_root': post['record'].get('reply', {}).get('root', {}).get('uri'), } ) await self._posts.push_data(posts) # Push a batch of posts to the dataset await context.add_requests(list(user_requests.values())) if cursor := data.get('cursor'): next_url = URL(context.request.url).update_query({'cursor': cursor}) # Use yarl for update the query string await context.add_requests([str(next_url)]) ``` When receiving user data, we'll store it in the corresponding Dataset `self._users` ``` async def _user_handler(self, context: HttpCrawlingContext) -> None: context.log.info(f'Processing user {context.request.url} ...') data = json.loads(context.http_response.read()) user_item = { 'did': data['did'], 'created': data['createdAt'], 'avatar': data.get('avatar'), 'description': data.get('description'), 'display_name': data.get('displayName'), 'handle': data['handle'], 'indexed': data.get('indexedAt'), 'posts_count': data['postsCount'], 'followers_count': data['followersCount'], 'follows_count': data['followsCount'], } await self._users.push_data(user_item) ``` ### 5. Saving data to files[](#5-saving-data-to-files "Direct link to 5. Saving data to files") For saving results, we will use the [`write_to_json`](https://www.crawlee.dev/python/api/class/Dataset#write_to_json). ``` async def save_data(self) -> None: """Save the data.""" if not self._users or not self._posts: raise ValueError('Datasets not initialized.') with open('users.json', 'w') as f: await self._users.write_to_json(f, indent=4) with open('posts.json', 'w') as f: await self._posts.write_to_json(f, indent=4) ``` ### 6. Running the crawler[](#6-running-the-crawler "Direct link to 6. Running the crawler") We have everything needed to complete the crawler. We just need a method to execute the crawling - let us call it `crawl` ``` async def crawl(self, queries: list[str]) -> None: """Crawl the given URL.""" if not self._crawler: raise ValueError('Crawler not initialized.') search_url = URL(f'{self._service_endpoint}/xrpc/app.bsky.feed.searchPosts') await self._crawler.run([str(search_url.with_query(q=query)) for query in queries]) ``` Let's finalize the code: ``` async def run() -> None: """Main execution function that orchestrates the crawling process. Creates a scraper instance, manages the session, and handles the complete crawling lifecycle including proper cleanup on completion or error. """ scraper = BlueskyApiScraper() scraper.create_session() try: await scraper.init_crawler() await scraper.crawl(['python', 'apify', 'crawlee']) await scraper.save_data() except Exception: traceback.print_exc() finally: scraper.delete_session() def main() -> None: """Entry point for the crawler application.""" asyncio.run(run()) ``` If you check your `pyproject.toml`, you will see that UV created an entrypoint for running `bluesky-crawlee = "bluesky_crawlee:main"`, so we can run our crawler simply by executing: ``` uv run bluesky-crawlee ``` Let's look at sample results: Posts ![Posts Example](/assets/images/posts-9156686b24a69b73efbc3915f1c8d18e.webp) Users ![Users Example](/assets/images/users-d896c9f24165a0e970d2b26c54def9eb.webp) ## Create Apify Actor for Bluesky crawler[](#create-apify-actor-for-bluesky-crawler "Direct link to Create Apify Actor for Bluesky crawler") We already have a fully functional implementation for local execution. Let us explore how to adapt it for running on the [Apify Platform](https://apify.com/) and transform in [Apify Actor](https://docs.apify.com/platform/actors). An Actor is a simple and efficient way to deploy your code in the cloud infrastructure on the Apify Platform. You can flexibly interact with the Actor, [schedule regular runs](https://docs.apify.com/platform/schedules) for monitoring data, or [integrate](https://docs.apify.com/platform/integrations) with other tools to build data processing flows. First, create an `.actor` directory with platform configuration files: ``` mkdir .actor && touch .actor/{actor.json,Dockerfile,input_schema.json} ``` Then add [Apify SDK for Python](https://docs.apify.com/sdk/python/) as a project dependency: ``` uv add apify ``` ### Configure Dockerfile[](#configure-dockerfile "Direct link to Configure Dockerfile") We’ll use the official [Apify Docker image](https://docs.apify.com/academy/deploying-your-code/docker-file) along with recommended [UV practices for Docker](https://docs.astral.sh/uv/guides/integration/docker/): ``` FROM apify/actor-python:3.13 ENV PATH='/app/.venv/bin:$PATH' WORKDIR /app COPY --from=ghcr.io/astral-sh/uv:latest /uv /uvx /bin/ COPY pyproject.toml uv.lock ./ RUN uv sync --frozen --no-install-project --no-editable -q --no-dev COPY . . RUN uv sync --frozen --no-editable -q --no-dev CMD ["bluesky-crawlee"] ``` Here, `bluesky-crawlee` refers to the entrypoint specified in `pyproject.toml`. ### Define project metadata in actor.json[](#define-project-metadata-in-actorjson "Direct link to Define project metadata in actor.json") The `actor.json` file contains project metadata for Apify Platform. Follow the [documentation for proper configuration](https://docs.apify.com/platform/actors/development/actor-definition/actor-json): ``` { "actorSpecification": 1, "name": "Bluesky-Crawlee", "title": "Bluesky - Crawlee", "minMemoryMbytes": 128, "maxMemoryMbytes": 2048, "description": "Scrape data products from bluesky", "version": "0.1", "meta": { "templateId": "bluesky-crawlee" }, "input": "./input_schema.json", "dockerfile": "./Dockerfile" } ``` ### Define Actor input parameters[](#define-actor-input-parameters "Direct link to Define Actor input parameters") Our crawler requires several external parameters. Let’s define them: * identifier: User's Bluesky identifier (encrypted for security) * appPassword: Bluesky app password (encrypted) * queries: List of search queries for crawling * maxRequestsPerCrawl: Optional limit for testing * mode: Choose between collecting posts or user data who post on specific topics Configure the input schema following the [specification](https://docs.apify.com/platform/actors/development/actor-definition/input-schema/specification/v1): ``` { "title": "Bluesky - Crawlee", "type": "object", "schemaVersion": 1, "properties": { "identifier": { "title": "Bluesky identifier", "description": "Bluesky identifier for API login", "type": "string", "editor": "textfield", "isSecret": true }, "appPassword": { "title": "Bluesky app password", "description": "Bluesky app password for API", "type": "string", "editor": "textfield", "isSecret": true }, "maxRequestsPerCrawl": { "title": "Max requests per crawl", "description": "Maximum number of requests for crawling", "type": "integer" }, "queries": { "title": "Queries", "type": "array", "description": "Search queries", "editor": "stringList", "prefill": [ "apify" ], "example": [ "apify", "crawlee" ] }, "mode": { "title": "Mode", "type": "string", "description": "Collect posts or users who post on a topic", "enum": [ "posts", "users" ], "default": "posts" } }, "required": [ "identifier", "appPassword", "queries", "mode" ] } ``` ### Update project code[](#update-project-code "Direct link to Update project code") Remove environment variables and parameterize the code according to the Actor input parameters. Replace named datasets with the default dataset. Add Actor logging: ``` # __init__.py import logging from apify.log import ActorLogFormatter handler = logging.StreamHandler() handler.setFormatter(ActorLogFormatter()) apify_client_logger = logging.getLogger('apify_client') apify_client_logger.setLevel(logging.INFO) apify_client_logger.addHandler(handler) apify_logger = logging.getLogger('apify') apify_logger.setLevel(logging.DEBUG) apify_logger.addHandler(handler) ``` Update imports and entry point code: ``` import asyncio import json import traceback from dataclasses import dataclass import httpx from apify import Actor from yarl import URL from crawlee import ConcurrencySettings, Request from crawlee.crawlers import HttpCrawler, HttpCrawlingContext from crawlee.http_clients import HttpxHttpClient @dataclass class ActorInput: """Actor input schema.""" identifier: str app_password: str queries: list[str] mode: str max_requests_per_crawl: Optional[int] = None async def run() -> None: """Main execution function that orchestrates the crawling process. Creates a scraper instance, manages the session, and handles the complete crawling lifecycle including proper cleanup on completion or error. """ async with Actor: raw_input = await Actor.get_input() actor_input = ActorInput( identifier=raw_input.get('indentifier', ''), app_password=raw_input.get('appPassword', ''), queries=raw_input.get('queries', []), mode=raw_input.get('mode', 'posts'), max_requests_per_crawl=raw_input.get('maxRequestsPerCrawl') ) scraper = BlueskyApiScraper(actor_input.mode, actor_input.max_requests_per_crawl) try: scraper.create_session(actor_input.identifier, actor_input.app_password) await scraper.init_crawler() await scraper.crawl(actor_input.queries) except httpx.HTTPError as e: Actor.log.error(f'HTTP error occurred: {e}') raise except Exception as e: Actor.log.error(f'Unexpected error: {e}') traceback.print_exc() finally: scraper.delete_session() def main() -> None: """Entry point for the scraper application.""" asyncio.run(run()) ``` Update methods with Actor input parameters: ``` class BlueskyApiScraper: """A scraper class for extracting data from Bluesky social network using their official API. This scraper manages authentication, concurrent requests, and data collection for both posts and user profiles. It uses separate datasets for storing post and user information. """ def __init__(self, mode: str, max_request: int | None) -> None: self._crawler: HttpCrawler | None = None self.mode = mode self.max_request = max_request # Variables for storing session data self._service_endpoint: str | None = None self._user_did: str | None = None self._access_token: str | None = None self._refresh_token: str | None = None self._handle: str | None = None def create_session(self, identifier: str, password: str) -> None: """Create credentials for the session.""" url = 'https://bsky.social/xrpc/com.atproto.server.createSession' headers = { 'Content-Type': 'application/json', } data = {'identifier': identifier, 'password': password} response = httpx.post(url, headers=headers, json=data) response.raise_for_status() data = response.json() self._service_endpoint = data['didDoc']['service'][0]['serviceEndpoint'] self._user_did = data['didDoc']['id'] self._access_token = data['accessJwt'] self._refresh_token = data['refreshJwt'] self._handle = data['handle'] ``` Implement mode-aware data collection logic: ``` async def _search_handler(self, context: HttpCrawlingContext) -> None: """Handle search requests based on mode.""" context.log.info(f'Processing search {context.request.url} ...') data = json.loads(context.http_response.read()) if 'posts' not in data: context.log.warning(f'No posts found in response: {context.request.url}') return user_requests = {} posts = [] profile_url = URL(f'{self._service_endpoint}/xrpc/app.bsky.actor.getProfile') for post in data['posts']: if self.mode == 'users' and post['author']['did'] not in user_requests: user_requests[post['author']['did']] = Request.from_url( url=str(profile_url.with_query(actor=post['author']['did'])), user_data={'label': 'user'}, ) elif self.mode == 'posts': posts.append( { 'uri': post['uri'], 'cid': post['cid'], 'author_did': post['author']['did'], 'created': post['record']['createdAt'], 'indexed': post['indexedAt'], 'reply_count': post['replyCount'], 'repost_count': post['repostCount'], 'like_count': post['likeCount'], 'quote_count': post['quoteCount'], 'text': post['record']['text'], 'langs': '; '.join(post['record'].get('langs', [])), 'reply_parent': post['record'].get('reply', {}).get('parent', {}).get('uri'), 'reply_root': post['record'].get('reply', {}).get('root', {}).get('uri'), } ) if self.mode == 'posts': await context.push_data(posts) else: await context.add_requests(list(user_requests.values())) if cursor := data.get('cursor'): next_url = URL(context.request.url).update_query({'cursor': cursor}) await context.add_requests([str(next_url)]) ``` Update the user handler for the default dataset: ``` async def _user_handler(self, context: HttpCrawlingContext) -> None: """Handle user profile requests.""" context.log.info(f'Processing user {context.request.url} ...') data = json.loads(context.http_response.read()) user_item = { 'did': data['did'], 'created': data['createdAt'], 'avatar': data.get('avatar'), 'description': data.get('description'), 'display_name': data.get('displayName'), 'handle': data['handle'], 'indexed': data.get('indexedAt'), 'posts_count': data['postsCount'], 'followers_count': data['followersCount'], 'follows_count': data['followsCount'], } await context.push_data(user_item) ``` ### Deploy[](#deploy "Direct link to Deploy") Use the official [Apify CLI](https://docs.apify.com/cli/) to upload your code: Authenticate using your API token from [Apify Console](https://console.apify.com/settings/integrations): ``` apify login ``` Choose "Enter API token manually" and paste your token. Push the project to the platform: ``` apify push ``` Now you can configure runs on Apify Platform. Let’s perform a test run: Fill in the input parameters: ![Actor Input](/assets/images/input_actor-20bb99df05dea1b2e799d92d6e3750f5.webp) Check that logging works correctly: ![Actor Log](/assets/images/actor_log-c74fa12a02ea0ff9ec3f77cfcb02bc52.webp) View results in the dataset: ![Dataset Results](/assets/images/actor_results-dca44d296e6897737ef338a19b7b2177.webp) If you want to make your Actor public and provide access to other users, potentially to earn income from it, follow this [publishing guide](https://docs.apify.com/platform/actors/publishing) for [Apify Store](https://apify.com/store). ## Conclusion and repository access[](#conclusion-and-repository-access "Direct link to Conclusion and repository access") We’ve created an efficient crawler for Bluesky using the official API. If you want to learn more this topic for regular data extraction from Bluesky, I recommend explorin [custom feed generation](https://docs.bsky.app/docs/starter-templates/custom-feeds) - I think it opens up some interesting possibilities. And if you need to quickly create a crawler that can retrieve data for various queries, you now have everything you need. You can find the complete code in the [repository](https://github.com/Mantisus/bluesky-crawlee) If you enjoyed this blog, feel free to support Crawlee for Python by starring the [repository](https://github.com/apify/crawlee-python) or joining the maintainer team. Have questions or want to discuss implementation details? Join our [Discord](https://discord.com/invite/jyEM2PRvMU) - our community of 10,000+ developers is there to help. **Tags:** * [community](https://crawlee.dev/blog/tags/community.md) --- # How to scrape Crunchbase using Python in 2024 (Easy Guide) January 3, 2025 · 13 min read [![Max](https://avatars.githubusercontent.com/u/34358312?v=4)](https://github.com/Mantisus) [Max](https://github.com/Mantisus) Community Member of Crawlee and web scraping expert Python developers know the drill: you need reliable company data, and Crunchbase has it. This guide shows you how to build an effective [Crunchbase](https://www.crunchbase.com/) scraper in Python that gets you the data you need. Crunchbase tracks details that matter: locations, business focus, founders, and investment histories. Manual extraction from such a large dataset isn't practical -automation is essential for transforming this information into an analyzable format. By the end of this blog, we'll explore three different ways to extract data from Crunchbase using [`Crawlee for Python`](https://github.com/apify/crawlee-python). We'll fully implement two of them and discuss the specifics and challenges of the third. This will help us better understand how important it is to properly [choose the right data source](https://www.crawlee.dev/blog/web-scraping-tips#1-choosing-a-data-source-for-the-project). note This guide comes from a developer in our growing community. Have you built interesting projects with Crawlee? Join us on [Discord](https://discord.com/invite/jyEM2PRvMU) to share your experiences and blog ideas - we value these contributions from developers like you. ![How to Scrape Crunchbase Using Python](/assets/images/scrape_crunchbase-28a71b5380492fe6618bbd9c90989543.webp) Key steps we'll cover: 1. Project setup 2. Choosing the data source 3. Implementing sitemap-based crawler 4. Analysis of search-based approach and its limitations 5. Implementing the official API crawler 6. Conclusion and repository access ## Prerequisites[](#prerequisites "Direct link to Prerequisites") * Python 3.9 or higher * Familiarity with web scraping concepts * Crawlee for Python `v0.5.0` * poetry `v2.0` or higher ### Project setup[](#project-setup "Direct link to Project setup") Before we start scraping, we need to set up our project. In this guide, we won't be using crawler templates (`Playwright` and `Beautifulsoup`), so we'll set up the project manually. 1. Install [`Poetry`](https://python-poetry.org/) ``` pipx install poetry ``` 2. Create and navigate to the project folder. ``` mkdir crunchbase-crawlee && cd crunchbase-crawlee ``` 3. Initialize the project using Poetry, leaving all fields empty. ``` poetry init ``` When prompted: * For "Compatible Python versions", enter: `>={your Python version},<4.0` (For example, if you're using Python 3.10, enter: `>=3.10,<4.0`) * Leave all other fields empty by pressing Enter * Confirm the generation by typing "yes" 4. Add and install Crawlee with necessary dependencies to your project using `Poetry.` ``` poetry add crawlee[parsel,curl-impersonate] ``` 5. Complete the project setup by creating the standard file structure for `Crawlee for Python` projects. ``` mkdir crunchbase-crawlee && touch crunchbase-crawlee/{__init__.py,__main__.py,main.py,routes.py} ``` After setting up the basic project structure, we can explore different methods of obtaining data from Crunchbase. ### Choosing the data source[](#choosing-the-data-source "Direct link to Choosing the data source") While we can extract target data directly from the [company page](https://www.crunchbase.com/organization/apify), we need to choose the best way to navigate the site. A careful examination of Crunchbase's structure shows that we have three main options for obtaining data: 1. [`Sitemap`](https://www.crunchbase.com/www-sitemaps/sitemap-index.xml) - for complete site traversal. 2. [`Search`](https://www.crunchbase.com/discover/organization.companies) - for targeted data collection. 3. [Official API](https://data.crunchbase.com/v4-legacy/docs/crunchbase-basic-getting-started) - recommended method. Let's examine each of these approaches in detail. ## Scraping Crunchbase using sitemap and Crawlee for Python[](#scraping-crunchbase-using-sitemap-and-crawlee-for-python "Direct link to Scraping Crunchbase using sitemap and Crawlee for Python") `Sitemap` is a standard way of site navigation used by crawlers like [`Google`](https://google.com/), [`Ahrefs`](https://ahrefs.com/), and other search engines. All crawlers must follow the rules described in [`robots.txt`](https://www.crunchbase.com/robots.txt). Let's look at the structure of Crunchbase's Sitemap: ![Sitemap first lvl](/assets/images/sitemap_lvl_one-553a6b9df5c5d3c35a8987878456fe7b.webp) As you can see, links to organization pages are located inside second-level `Sitemap` files, which are compressed using `gzip`. The structure of one of these files looks like this: ![Sitemap second lvl](/assets/images/sitemap_lvl_two-8f3213f305713ebf8bf91b32febfa234.webp) The `lastmod` field is particularly important here. It allows tracking which companies have updated their information since the previous data collection. This is especially useful for regular data updates. ### 1. Configuring the crawler for scraping[](#1-configuring-the-crawler-for-scraping "Direct link to 1. Configuring the crawler for scraping") To work with the site, we'll use [`CurlImpersonateHttpClient`](https://www.crawlee.dev/python/api/class/CurlImpersonateHttpClient), which impersonates a `Safari` browser. While this choice might seem unexpected for working with a sitemap, it's necessitated by Crunchbase's protection features. The reason is that Crunchbase uses [Cloudflare](https://www.cloudflare.com/) to protect against automated access. This is clearly visible when analyzing traffic on a company page: ![Cloudflare Link](/assets/images/cloudflare_link-bf8b6ba2c873ccb31463258e5964e39b.webp) An interesting feature is that `challenges.cloudflare` is executed after loading the document with data. This means we receive the data first, and only then JavaScript checks if we're a bot. If our HTTP client's fingerprint is sufficiently similar to a real browser, we'll successfully receive the data. Cloudflare [also analyzes traffic at the sitemap level](https://developers.cloudflare.com/waf/custom-rules/use-cases/allow-traffic-from-verified-bots/). If our crawler doesn't look legitimate, access will be blocked. That's why we impersonate a real browser. To prevent blocks due to overly aggressive crawling, we'll configure [`ConcurrencySettings`](https://www.crawlee.dev/python/api/class/ConcurrencySettings). When scaling this approach, you'll likely need proxies. Detailed information about proxy setup can be found in the [documentation](https://www.crawlee.dev/python/docs/guides/proxy-management). We'll save our scraping results in `JSON` format. Here's how the basic crawler configuration looks: ``` # main.py from crawlee import ConcurrencySettings, HttpHeaders from crawlee.crawlers import ParselCrawler from crawlee.http_clients import CurlImpersonateHttpClient from .routes import router async def main() -> None: """The crawler entry point.""" concurrency_settings = ConcurrencySettings(max_concurrency=1, max_tasks_per_minute=50) http_client = CurlImpersonateHttpClient( impersonate='safari17_0', headers=HttpHeaders( { 'accept-language': 'en', 'accept-encoding': 'gzip, deflate, br, zstd', } ), ) crawler = ParselCrawler( request_handler=router, max_request_retries=1, concurrency_settings=concurrency_settings, http_client=http_client, max_requests_per_crawl=30, ) await crawler.run(['https://www.crunchbase.com/www-sitemaps/sitemap-index.xml']) await crawler.export_data_json('crunchbase_data.json') ``` ### 2. Implementing sitemap navigation[](#2-implementing-sitemap-navigation "Direct link to 2. Implementing sitemap navigation") Sitemap navigation happens in two stages. In the first stage, we need to get a list of all files containing organization information: ``` # routes.py from crawlee.crawlers import ParselCrawlingContext from crawlee.router import Router from crawlee import Request router = Router[ParselCrawlingContext]() @router.default_handler async def default_handler(context: ParselCrawlingContext) -> None: """Default request handler.""" context.log.info(f'default_handler processing {context.request} ...') requests = [ Request.from_url(url, label='sitemap') for url in context.selector.xpath('//loc[contains(., "sitemap-organizations")]/text()').getall() ] # Since this is a tutorial, I don't want to upload more than one sitemap link await context.add_requests(requests, limit=1) ``` In the second stage, we process second-level sitemap files stored in `gzip` format. This requires a special approach as the data needs to be decompressed first: ``` # routes.py from gzip import decompress from parsel import Selector @router.handler('sitemap') async def sitemap_handler(context: ParselCrawlingContext) -> None: """Sitemap gzip request handler.""" context.log.info(f'sitemap_handler processing {context.request.url} ...') data = context.http_response.read() data = decompress(data) selector = Selector(data.decode()) requests = [Request.from_url(url, label='company') for url in selector.xpath('//loc/text()').getall()] await context.add_requests(requests) ``` ### 3. Extracting and saving data[](#3-extracting-and-saving-data "Direct link to 3. Extracting and saving data") Each company page contains a large amount of information. For demonstration purposes, we'll focus on the main fields: `Company Name`, `Short Description`, `Website`, and `Location`. One of Crunchbase's advantages is that all data is stored in `JSON` format within the page: ![Company Data](/assets/images/data_json-7c79a7387510a995f29ba5ce157f0845.webp) This significantly simplifies data extraction - we only need to use one `Xpath` selector to get the `JSON`, and then apply [`jmespath`](https://jmespath.org/) to extract the needed fields: ``` # routes.py @router.handler('company') async def company_handler(context: ParselCrawlingContext) -> None: """Company request handler.""" context.log.info(f'company_handler processing {context.request.url} ...') json_selector = context.selector.xpath('//*[@id="ng-state"]/text()') await context.push_data( { 'Company Name': json_selector.jmespath('HttpState.*.data[].properties.identifier.value').get(), 'Short Description': json_selector.jmespath('HttpState.*.data[].properties.short_description').get(), 'Website': json_selector.jmespath('HttpState.*.data[].cards.company_about_fields2.website.value').get(), 'Location': '; '.join( json_selector.jmespath( 'HttpState.*.data[].cards.company_about_fields2.location_identifiers[].value' ).getall() ), } ) ``` The collected data is saved in `Crawlee for Python`'s internal storage using the `context.push_data` method. When the crawler finishes, we export all collected data to a JSON file: ``` # main.py await crawler.export_data_json('crunchbase_data.json') ``` ### 4. Running the project[](#4-running-the-project "Direct link to 4. Running the project") With all components in place, we need to create an entry point for our crawler: ``` # __main__.py import asyncio from .main import main if __name__ == '__main__': asyncio.run(main()) ``` Execute the crawler using Poetry: ``` poetry run python -m crunchbase-crawlee ``` ### 5. Finally, characteristics of using the sitemap crawler[](#5-finally-characteristics-of-using-the-sitemap-crawler "Direct link to 5. Finally, characteristics of using the sitemap crawler") The sitemap approach has its distinct advantages and limitations. It's ideal in the following cases: * When you need to collect data about all companies on the platform * When there are no specific company selection criteria * If you have sufficient time and computational resources However, there are significant limitations to consider: * Almost no ability to filter data during collection * Requires constant monitoring of Cloudflare blocks * Scaling the solution requires proxy servers, which increases project costs ## Using search for scraping Crunchbase[](#using-search-for-scraping-crunchbase "Direct link to Using search for scraping Crunchbase") The limitations of the sitemap approach might point to search as the next solution. However, Crunchbase applies tighter security measures to its search functionality compared to its public pages. The key difference lies in how Cloudflare protection works. While we receive data before the `challenges.cloudflare` check when accessing a company page, the search API requires valid `cookies` that have passed this check. Let's verify this in practice. Open the following link in Incognito mode: ``` ``` When analyzing the traffic, we'll see the following pattern: ![Search Protect](/assets/images/search_protect-3b4a1a1934d54c12ac210217919b8b88.webp) The sequence of events here is: 1. First, the page is blocked with code `403` 2. Then the `challenges.cloudflare` check is performed 3. Only after successfully passing the check do we receive data with code `200` Automating this process would require a `headless` browser capable of bypassing [`Cloudflare Turnstile`](https://www.cloudflare.com/application-services/products/turnstile/). The current version of `Crawlee for Python` (v0.5.0) doesn't provide this functionality, although it's planned for future development. You can extend the capabilities of Crawlee for Python by integrating [`Camoufox`](https://camoufox.com/) following this [example.](https://www.crawlee.dev/python/docs/examples/playwright-crawler-with-camoufox) ## Working with the official Crunchbase API[](#working-with-the-official-crunchbase-api "Direct link to Working with the official Crunchbase API") Crunchbase provides a [free API](https://data.crunchbase.com/v4-legacy/docs/crunchbase-basic-using-api) with basic functionality. Paid subscription users get expanded data access. Complete documentation for available endpoints can be found in the [official API specification](https://app.swaggerhub.com/apis-docs/Crunchbase/crunchbase-enterprise_api). ### 1. Setting up API access[](#1-setting-up-api-access "Direct link to 1. Setting up API access") To start working with the API, follow these steps: 1. [Create a Crunchbase account](https://www.crunchbase.com/register) 2. Go to the Integrations section 3. Create a Crunchbase Basic API key Although the documentation states that key activation may take up to an hour, it usually starts working immediately after creation. ### 2. Configuring the crawler for API work[](#2-configuring-the-crawler-for-api-work "Direct link to 2. Configuring the crawler for API work") An important API feature is the limit - no more than 200 requests per minute, but in the free version, this number is significantly lower. Taking this into account, let's configure [`ConcurrencySettings`](https://www.crawlee.dev/python/api/class/ConcurrencySettings). Since we're working with the official API, we don't need to mask our HTTP client. We'll use the standard ['HttpxHttpClient'](https://www.crawlee.dev/python/api/class/HttpxHttpClient) with preset headers. First, let's save the API key in an environment variable: ``` export CRUNCHBASE_TOKEN={YOUR KEY} ``` Here's how the crawler configuration for working with the API looks: ``` # main.py import os from crawlee.crawlers import HttpCrawler from crawlee.http_clients import HttpxHttpClient from crawlee import ConcurrencySettings, HttpHeaders from .routes import router CRUNCHBASE_TOKEN = os.getenv('CRUNCHBASE_TOKEN', '') async def main() -> None: """The crawler entry point.""" concurrency_settings = ConcurrencySettings(max_tasks_per_minute=60) http_client = HttpxHttpClient( headers=HttpHeaders({'accept-encoding': 'gzip, deflate, br, zstd', 'X-cb-user-key': CRUNCHBASE_TOKEN}) ) crawler = HttpCrawler( request_handler=router, concurrency_settings=concurrency_settings, http_client=http_client, max_requests_per_crawl=30, ) await crawler.run( ['https://api.crunchbase.com/api/v4/autocompletes?query=apify&collection_ids=organizations&limit=25'] ) await crawler.export_data_json('crunchbase_data.json') ``` ### 3. Processing search results[](#3-processing-search-results "Direct link to 3. Processing search results") For working with the API, we'll need two main endpoints: 1. [get\_autocompletes](https://app.swaggerhub.com/apis-docs/Crunchbase/crunchbase-enterprise_api/1.0.3#/Autocomplete/get_autocompletes) - for searching 2. [get\_entities\_organizations\_\_entity\_id](https://app.swaggerhub.com/apis-docs/Crunchbase/crunchbase-enterprise_api/1.0.3#/Entity/get_entities_organizations__entity_id_) - for getting data First, let's implement search results processing: ``` import json from crawlee.crawlers import HttpCrawler from crawlee.router import Router from crawlee import Request router = Router[HttpCrawlingContext]() @router.default_handler async def default_handler(context: HttpCrawlingContext) -> None: """Default request handler.""" context.log.info(f'default_handler processing {context.request.url} ...') data = json.loads(context.http_response.read()) requests = [] for entity in data['entities']: permalink = entity['identifier']['permalink'] requests.append( Request.from_url( url=f'https://api.crunchbase.com/api/v4/entities/organizations/{permalink}?field_ids=short_description%2Clocation_identifiers%2Cwebsite_url', label='company', ) ) await context.add_requests(requests) ``` ### 4. Extracting company data[](#4-extracting-company-data "Direct link to 4. Extracting company data") After getting the list of companies, we extract detailed information about each one: ``` @router.handler('company') async def company_handler(context: HttpCrawlingContext) -> None: """Company request handler.""" context.log.info(f'company_handler processing {context.request.url} ...') data = json.loads(context.http_response.read()) await context.push_data( { 'Company Name': data['properties']['identifier']['value'], 'Short Description': data['properties']['short_description'], 'Website': data['properties'].get('website_url'), 'Location': '; '.join([item['value'] for item in data['properties'].get('location_identifiers', [])]), } ) ``` ### 5. Advanced location-based search[](#5-advanced-location-based-search "Direct link to 5. Advanced location-based search") If you need more flexible search capabilities, the API provides a special [`search`](https://app.swaggerhub.com/apis-docs/Crunchbase/crunchbase-enterprise_api/1.0.3#/Search/post_searches_organizations) endpoint. Here's an example of searching for all companies in Prague: ``` payload = { 'field_ids': ['identifier', 'location_identifiers', 'short_description', 'website_url'], 'limit': 200, 'order': [{'field_id': 'rank_org', 'sort': 'asc'}], 'query': [ { 'field_id': 'location_identifiers', 'operator_id': 'includes', 'type': 'predicate', 'values': ['e0b951dc-f710-8754-ddde-5ef04dddd9f8'], }, {'field_id': 'facet_ids', 'operator_id': 'includes', 'type': 'predicate', 'values': ['company']}, ], } serialiazed_payload = json.dumps(payload) await crawler.run( [ Request.from_url( url='https://api.crunchbase.com/api/v4/searches/organizations', method='POST', payload=serialiazed_payload, use_extended_unique_key=True, headers=HttpHeaders({'Content-Type': 'application/json'}), label='search', ) ] ) ``` For processing search results and pagination, we use the following handler: ``` @router.handler('search') async def search_handler(context: HttpCrawlingContext) -> None: """Search results handler with pagination support.""" context.log.info(f'search_handler processing {context.request.url} ...') data = json.loads(context.http_response.read()) last_entity = None results = [] for entity in data['entities']: last_entity = entity['uuid'] results.append( { 'Company Name': entity['properties']['identifier']['value'], 'Short Description': entity['properties']['short_description'], 'Website': entity['properties'].get('website_url'), 'Location': '; '.join([item['value'] for item in entity['properties'].get('location_identifiers', [])]), } ) if results: await context.push_data(results) if last_entity: payload = json.loads(context.request.payload) payload['after_id'] = last_entity payload = json.dumps(payload) await context.add_requests( [ Request.from_url( url='https://api.crunchbase.com/api/v4/searches/organizations', method='POST', payload=payload, use_extended_unique_key=True, headers=HttpHeaders({'Content-Type': 'application/json'}), label='search', ) ] ) ``` ### 6. Finally, free API limitations[](#6-finally-free-api-limitations "Direct link to 6. Finally, free API limitations") The free version of the API has significant limitations: * Limited set of available endpoints * Autocompletes function only works for company searches * Not all data fields are accessible * Limited search filtering capabilities Consider a paid subscription for production-level work. The API provides the most reliable way to access Crunchbase data, even with its rate constraints. ## What’s your best path forward?[](#whats-your-best-path-forward "Direct link to What’s your best path forward?") We've explored three different approaches to obtaining data from Crunchbase: 1. **Sitemap** - for large-scale data collection 2. **Search** - difficult to automate due to Cloudflare protection 3. **Official API** - the most reliable solution for commercial projects Each method has its advantages, but for most projects, I recommend using the official API despite its limitations in the free version. The complete source code is available in my [repository](https://github.com/Mantisus/crunchbase-crawlee). Have questions or want to discuss implementation details? Join our [Discord](https://discord.com/invite/jyEM2PRvMU) - our community of developers is there to help. **Tags:** * [community](https://crawlee.dev/blog/tags/community.md) --- # How to scrape Google Maps data using Python December 13, 2024 · 12 min read [![Satyam Tripathi](https://avatars.githubusercontent.com/u/69134468?v=4)](https://github.com/triposat) [Satyam Tripathi](https://github.com/triposat) Community Member of Crawlee Millions of people use Google Maps daily, leaving behind a goldmine of data just waiting to be analyzed. In this guide, I'll show you how to build a reliable scraper using Crawlee and Python to extract locations, ratings, and reviews from Google Maps, all while handling its dynamic content challenges. note One of our community members wrote this blog as a contribution to the Crawlee Blog. If you would like to contribute blogs like these to Crawlee Blog, please reach out to us on our [discord channel](https://apify.com/discord). ## What data will we extract from Google Maps?[](#what-data-will-we-extract-from-google-maps "Direct link to What data will we extract from Google Maps?") We’ll collect information about hotels in a specific city. You can also customize your search to meet your requirements. For example, you might search for "hotels near me", "5-star hotels in Bombay", or other similar queries. ![Google Maps Data Screenshot](/assets/images/scrape-google-maps-with-crawlee-screenshot-data-to-scrape-00e7e4e3498679b8a7611eafd0a1bfbe.webp) We’ll extract important data, including the hotel name, rating, review count, price, a link to the hotel page on Google Maps, and all available amenities. Here’s an example of what the extracted data will look like: ``` { "name": "Vividus Hotels, Bangalore", "rating": "4.3", "reviews": "633", "price": "₹3,667", "amenities": [ "Pool available", "Free breakfast available", "Free Wi-Fi available", "Free parking available" ], "link": "https://www.google.com/maps/place/Vividus+Hotels+,+Bangalore/..." } ``` ## Building a Google Maps scraper[](#building-a-google-maps-scraper "Direct link to Building a Google Maps scraper") Let's build a Google Maps scraper step-by-step. note Crawlee requires Python 3.9 or later. ### 1. Setting up your environment[](#1-setting-up-your-environment "Direct link to 1. Setting up your environment") First, let's set up everything you’ll need to run the scraper. Open your terminal and run these commands: ``` # Create and activate a virtual environment python -m venv google-maps-scraper # Windows: .\google-maps-scraper\Scripts\activate # Mac/Linux: source google-maps-scraper/bin/activate # We plan to use Playwright with Crawlee, so we need to install both: pip install crawlee "crawlee[playwright]" playwright install ``` *If you're new to **Crawlee**, check out its easy-to-follow documentation. It’s available for both [Node.js](https://www.crawlee.dev/js/docs/quick-start) and [Python](https://www.crawlee.dev/python/docs/quick-start).* note Before going ahead with the project, I'd like to ask you to star Crawlee for Python on [GitHub](https://github.com/apify/crawlee-python/), it helps us to spread the word to fellow scraper developers. ### 2. Connecting to Google Maps[](#2-connecting-to-google-maps "Direct link to 2. Connecting to Google Maps") Let's see the steps to connect to Google Maps. **Step 1: Setting up the crawler** The first step is to configure the crawler. We're using [`PlaywrightCrawler`](https://www.crawlee.dev/python/api/class/PlaywrightCrawler) from Crawlee, which gives us powerful tools for automated browsing. We set `headless=False` to make the browser visible during scraping and allow 5 minutes for the pages to load. ``` from crawlee.playwright_crawler import PlaywrightCrawler from datetime import timedelta # Initialize crawler with browser visibility and timeout settings crawler = PlaywrightCrawler( headless=False, # Shows the browser window while scraping request_handler_timeout=timedelta( minutes=5 ), # Allows plenty of time for page loading ) ``` **Step 2: Handling each page** This function defines how each page is handled when the crawler visits it. It uses `context.page` to navigate to the target URL. ``` async def scrape_google_maps(context): """ Establishes connection to Google Maps and handles the initial page load """ page = context.page await page.goto(context.request.url) context.log.info(f"Processing: {context.request.url}") ``` **Step 3: Launching the crawler** Finally, the main function brings everything together. It creates a search URL, sets up the crawler, and starts the scraping process. ``` import asyncio async def main(): # Prepare the search URL search_query = "hotels in bengaluru" start_url = f"https://www.google.com/maps/search/{search_query.replace(' ', '+')}" # Tell the crawler how to handle each page it visits crawler.router.default_handler(scrape_google_maps) # Start the scraping process await crawler.run([start_url]) if __name__ == "__main__": asyncio.run(main()) ``` Let’s combine the above code snippets and save them in a file named `gmap_scraper.py`: ``` from crawlee.playwright_crawler import PlaywrightCrawler from datetime import timedelta import asyncio async def scrape_google_maps(context): """ Establishes connection to Google Maps and handles the initial page load """ page = context.page await page.goto(context.request.url) context.log.info(f"Processing: {context.request.url}") async def main(): """ Configures and launches the crawler with custom settings """ # Initialize crawler with browser visibility and timeout settings crawler = PlaywrightCrawler( headless=False, # Shows the browser window while scraping request_handler_timeout=timedelta( minutes=5 ), # Allows plenty of time for page loading ) # Tell the crawler how to handle each page it visits crawler.router.default_handler(scrape_google_maps) # Prepare the search URL search_query = "hotels in bengaluru" start_url = f"https://www.google.com/maps/search/{search_query.replace(' ', '+')}" # Start the scraping process await crawler.run([start_url]) if __name__ == "__main__": asyncio.run(main()) ``` Run the code using: ``` $ python3 gmap_scraper.py ``` When everything works correctly, you'll see the output like this: ![Connect to page](/assets/images/scrape-google-maps-with-crawlee-screenshot-connect-to-page-6d6391022d64446a161825935a307d8d.png) ### 3. Import dependencies and defining Scraper Class[](#3-import-dependencies-and-defining-scraper-class "Direct link to 3. Import dependencies and defining Scraper Class") Let's start with the basic structure and necessary imports: ``` import asyncio from datetime import timedelta from typing import Dict, Optional, Set from crawlee.playwright_crawler import PlaywrightCrawler from playwright.async_api import Page, ElementHandle ``` The `GoogleMapsScraper` class serves as the main scraper engine: ``` class GoogleMapsScraper: def __init__(self, headless: bool = True, timeout_minutes: int = 5): self.crawler = PlaywrightCrawler( headless=headless, request_handler_timeout=timedelta(minutes=timeout_minutes), ) self.processed_names: Set[str] = set() async def setup_crawler(self) -> None: self.crawler.router.default_handler(self._scrape_listings) ``` This initialization code sets up two crucial components: 1. A `PlaywrightCrawler` instance configured to run either headlessly (without a visible browser window) or with a visible browser 2. A set to track processed business names, preventing duplicate entries The `setup_crawler` method configures the crawler to use our main scraping function as the default handler for all requests. ### 4. Understanding Google Maps internal code structure[](#4-understanding-google-maps-internal-code-structure "Direct link to 4. Understanding Google Maps internal code structure") Before we dive into scraping, let's understand exactly what elements we need to target. When you search for hotels in Bengaluru, Google Maps organizes hotel information in a specific structure. Here's a detailed breakdown of how to locate each piece of information. **Hotel name:** ![Hotel name](/assets/images/scrape-google-maps-with-crawlee-screenshot-name-d1fcc59eb4e3eec109fcbf5be0237fbc.webp) **Hotel rating:** ![Hotel rating](/assets/images/scrape-google-maps-with-crawlee-screenshot-ratings-7748ca46b1e14126de728add8313d286.webp) **Hotel review count:** ![Hotel Review Count](/assets/images/scrape-google-maps-with-crawlee-screenshot-reviews-521c92ebf7eeefb615659e0cd9cce6eb.webp) **Hotel URL:** ![Hotel URL](/assets/images/scrape-google-maps-with-crawlee-screenshot-url-ef8f37822fe579765ece5c37c1f8fdeb.webp) **Hotel Price:** ![Hotel Price](/assets/images/scrape-google-maps-with-crawlee-screenshot-price-a2ab8516020bfcbfd6054d889f871743.webp) **Hotel amenities:** This returns multiple elements as each hotel has several amenities. We'll need to iterate through these. ![Hotel amenities](/assets/images/scrape-google-maps-with-crawlee-screenshot-amenities-8a138b2fc9d7c4fad6a81bec55ee5db7.webp) **Quick tips:** * Always verify these selectors before scraping, as Google might update them. * Use Chrome DevTools (F12) to inspect elements and confirm selectors. * Some elements might not be present for all hotels (like prices during the off-season). ### 5. Scraping Google Maps data using identified selectors[](#5-scraping-google-maps-data-using-identified-selectors "Direct link to 5. Scraping Google Maps data using identified selectors") Let's build a scraper to extract detailed hotel information from Google Maps. First, create the core scraping function to handle data extraction. *gmap\_scraper.py:* ``` async def _extract_listing_data(self, listing: ElementHandle) -> Optional[Dict]: """Extract structured data from a single listing element.""" try: name_el = await listing.query_selector(".qBF1Pd") if not name_el: return None name = await name_el.inner_text() if name in self.processed_names: return None elements = { "rating": await listing.query_selector(".MW4etd"), "reviews": await listing.query_selector(".UY7F9"), "price": await listing.query_selector(".wcldff"), "link": await listing.query_selector("a.hfpxzc"), "address": await listing.query_selector(".W4Efsd:nth-child(2)"), "category": await listing.query_selector(".W4Efsd:nth-child(1)"), } amenities = [] amenities_els = await listing.query_selector_all(".dc6iWb") for amenity in amenities_els: amenity_text = await amenity.get_attribute("aria-label") if amenity_text: amenities.append(amenity_text) place_data = { "name": name, "rating": await elements["rating"].inner_text() if elements["rating"] else None, "reviews": (await elements["reviews"].inner_text()).strip("()") if elements["reviews"] else None, "price": await elements["price"].inner_text() if elements["price"] else None, "address": await elements["address"].inner_text() if elements["address"] else None, "category": await elements["category"].inner_text() if elements["category"] else None, "amenities": amenities if amenities else None, "link": await elements["link"].get_attribute("href") if elements["link"] else None, } self.processed_names.add(name) return place_data except Exception as e: context.log.exception("Error extracting listing data") return None ``` In the code: * `query_selector`: Returns first DOM element matching CSS selector, useful for single items like a name or rating * `query_selector_all`: Returns all matching elements, ideal for multiple items like amenities * `inner_text()`: Extracts text content * Some hotels might not have all the information available - we handle this with 'N/A’ When you run this script, you'll see output similar to this: ``` { "name": "GRAND KALINGA HOTEL", "rating": "4.2", "reviews": "1,171", "price": "\u20b91,760", "link": "https://www.google.com/maps/place/GRAND+KALINGA+HOTEL/data=!4m10!3m9!1s0x3bae160e0ce07789:0xb15bf736f4238e6a!5m2!4m1!1i2!8m2!3d12.9762259!4d77.5786043!16s%2Fg%2F11sp32pz28!19sChIJiXfgDA4WrjsRao4j9Db3W7E?authuser=0&hl=en&rclk=1", "amenities": [ "Pool available", "Free breakfast available", "Free Wi-Fi available", "Free parking available" ] } ``` ### 6. Managing Infinite Scrolling[](#6-managing-infinite-scrolling "Direct link to 6. Managing Infinite Scrolling") Google Maps uses infinite scrolling to load more results as users scroll down. We handle this with a dedicated method: First, we need a function that can handle the scrolling and detect when we've hit the bottom. Copy-paste this new function in the `gmap_scraper.py` file: ``` async def _load_more_items(self, page: Page) -> bool: """Scroll down to load more items.""" try: feed = await page.query_selector('div[role="feed"]') if not feed: return False prev_scroll = await feed.evaluate("(element) => element.scrollTop") await feed.evaluate("(element) => element.scrollTop += 800") await page.wait_for_timeout(2000) new_scroll = await feed.evaluate("(element) => element.scrollTop") if new_scroll <= prev_scroll: return False await page.wait_for_timeout(1000) return True except Exception as e: context.log.exception("Error during scroll") return False ``` Run this code using: ``` $ python3 gmap_scraper.py ``` You should see an output like this: ![scrape-google-maps-with-crawlee-screenshot-handle-pagination](/assets/images/scrape-google-maps-with-crawlee-screenshot-handle-pagination-319232595ced535f175346ae0003e32f.webp) ### 7. Scrape Listings[](#7-scrape-listings "Direct link to 7. Scrape Listings") The main scraping function ties everything together. It scrapes listings from the page by repeatedly extracting data and scrolling. ``` async def _scrape_listings(self, context) -> None: """Main scraping function to process all listings""" try: page = context.page print(f"\nProcessing URL: {context.request.url}\n") await page.wait_for_selector(".Nv2PK", timeout=30000) await page.wait_for_timeout(2000) while True: listings = await page.query_selector_all(".Nv2PK") new_items = 0 for listing in listings: place_data = await self._extract_listing_data(listing) if place_data: await context.push_data(place_data) new_items += 1 print(f"Processed: {place_data['name']}") if new_items == 0 and not await self._load_more_items(page): break if new_items > 0: await self._load_more_items(page) print(f"\nFinished processing! Total items: {len(self.processed_names)}") except Exception as e: print(f"Error in scraping: {str(e)}") ``` The scraper uses Crawlee's built-in storage system to manage scraped data. When you run the scraper, it creates a `storage` directory in your project with several key components: * `datasets/`: Contains the scraped results in JSON format * `key_value_stores/`: Stores crawler state and metadata * `request_queues/`: Manages URLs to be processed The `push_data()` method we use in our scraper sends the data to Crawlee's dataset storage as you can see below: ![Crawlee push\_data](/assets/images/How-to-scrape-Google-Maps-data-using-Python-and-Crawlee-metadata-a27257a5ffffad0fdcc598064445fe57.webp) ### 8. Running the Scraper[](#8-running-the-scraper "Direct link to 8. Running the Scraper") Finally, we need functions to execute our scraper: ``` async def run(self, search_query: str) -> None: """Execute the scraper with a search query""" try: await self.setup_crawler() start_url = f"https://www.google.com/maps/search/{search_query.replace(' ', '+')}" await self.crawler.run([start_url]) await self.crawler.export_data_json('gmap_data.json') except Exception as e: print(f"Error running scraper: {str(e)}") async def main(): """Entry point of the script""" scraper = GoogleMapsScraper(headless=True) search_query = "hotels in bengaluru" await scraper.run(search_query) if __name__ == "__main__": asyncio.run(main()) ``` This data is automatically stored and can later be exported to a JSON file using: ``` await self.crawler.export_data_json('gmap_data.json') ``` Here's what your exported JSON file will look like: ``` [ { "name": "Vividus Hotels, Bangalore", "rating": "4.3", "reviews": "633", "price": "₹3,667", "amenities": [ "Pool available", "Free breakfast available", "Free Wi-Fi available", "Free parking available" ], "link": "https://www.google.com/maps/place/Vividus+Hotels+,+Bangalore/..." } ] ``` ### 9. Using proxies for Google Maps scraping[](#9-using-proxies-for-google-maps-scraping "Direct link to 9. Using proxies for Google Maps scraping") When scraping Google Maps at scale, using proxies is very helpful. Here are a few key reasons why: 1. **Avoid IP blocks**: Google Maps can detect and block IP addresses that make an excessive number of requests in a short time. Using proxies helps you stay under the radar. 2. **Bypass rate limits**: Google implements strict limits on the number of requests per IP address. By rotating through multiple IPs, you can maintain a consistent scraping pace without hitting these limits. 3. **Access location-specific data**: Different regions may display different data on Google Maps. Proxies allow you to view listings as if you are browsing from any specific location. Here's a simple implementation using Crawlee's built-in proxy management. Update your previous code with this to use proxy settings. ``` from crawlee.playwright_crawler import PlaywrightCrawler from crawlee.proxy_configuration import ProxyConfiguration # Configure your proxy settings proxy_configuration = ProxyConfiguration( proxy_urls=[ "http://username:password@proxy.provider.com:12345", # Add more proxy URLs as needed ] ) # Initialize crawler with proxy support crawler = PlaywrightCrawler( headless=True, request_handler_timeout=timedelta(minutes=5), proxy_configuration=proxy_configuration, ) ``` Here, I use a proxy to scrape hotel data in New York City. ![Using a proxy](/assets/images/scrape-google-maps-with-crawlee-screenshot-proxies-5c4dece0247a87e7d338328c472cea74.webp) Here's an example of data scraped from New York City hotels using proxies: ``` { "name": "The Manhattan at Times Square Hotel", "rating": "3.1", "reviews": "8,591", "price": "$120", "amenities": [ "Free parking available", "Free Wi-Fi available", "Air-conditioned available", "Breakfast available" ], "link": "https://www.google.com/maps/place/..." } ``` ### 10. Project: Interactive hotel analysis dashboard[](#10-project-interactive-hotel-analysis-dashboard "Direct link to 10. Project: Interactive hotel analysis dashboard") After scraping hotel data from Google Maps, you can build an interactive dashboard that helps analyze hotel trends. Here’s a preview of how the dashboard works: ![Final dashboard](/assets/images/scrape-google-maps-with-crawlee-screenshot-hotel-analysis-dashboard-c14806409a7c1db63943f58d855aa07e.webp) Find the complete info for this dashboard on GitHub: [Hotel Analysis Dashboard](https://github.com/triposat/Hotel-Analytics-Dashboard). ### 11. Now you’re ready to put everything into action\![](#11-now-youre-ready-to-put-everything-into-action "Direct link to 11. Now you’re ready to put everything into action!") Take a look at the complete scripts in my GitHub Gist: * [Basic Scraper](https://gist.github.com/triposat/9a6fb03130f3c4332bab71b72a973940) * [Code with Proxy Integration](https://gist.github.com/triposat/6c554b13c787a55348b48b6bfc5459c0) * [Hotel Analysis Dashboard](https://gist.github.com/triposat/13ce4b05c36512e69b5602833e781a6c) To make it all work: 1. **Run the basic scraper or proxy-integrated scraper**: This will collect the hotel data and store it in a JSON file. 2. **Run the dashboard script**: Load your JSON data and view it interactively in the dashboard. ## Wrapping up and next steps[](#wrapping-up-and-next-steps "Direct link to Wrapping up and next steps") You've successfully built a comprehensive Google Maps scraper that collects and processes hotel data, presenting it through an interactive dashboard. Now you’ve learned about: * Using Crawlee with Playwright to navigate and extract data from Google Maps * Using proxies to scale up scraping without getting blocked * Storing the extracted data in JSON format * Creating an interactive dashboard to analyze hotel data We’ve handpicked some great resources to help you further explore web scraping: * [Scrapy vs. Crawlee: Choosing the right tool](https://www.crawlee.dev/blog/scrapy-vs-crawlee) * [Mastering proxy management with Crawlee](https://www.crawlee.dev/blog/proxy-management-in-crawlee) * [Think like a web scraping expert: 12 pro tips](https://www.crawlee.dev/blog/web-scraping-tips) * [Building a LinkedIn job scraper](https://www.crawlee.dev/blog/linkedin-job-scraper-python) **Tags:** * [community](https://crawlee.dev/blog/tags/community.md) --- # How to scrape Google search results with Python December 2, 2024 · 7 min read [![Max](https://avatars.githubusercontent.com/u/34358312?v=4)](https://github.com/Mantisus) [Max](https://github.com/Mantisus) Community Member of Crawlee and web scraping expert Scraping `Google Search` delivers essential `SERP analysis`, SEO optimization, and data collection capabilities. Modern scraping tools make this process faster and more reliable. note One of our community members wrote this blog as a contribution to the Crawlee Blog. If you would like to contribute blogs like these to Crawlee Blog, please reach out to us on our [discord channel](https://apify.com/discord). In this guide, we'll create a Google Search scraper using [`Crawlee for Python`](https://github.com/apify/crawlee-python) that can handle result ranking and pagination. We'll create a scraper that: * Extracts titles, URLs, and descriptions from search results * Handles multiple search queries * Tracks ranking positions * Processes multiple result pages * Saves data in a structured format ![How to scrape Google search results with Python](/assets/images/google-search-a91bfdf17a4c2860798444b1be56f625.webp) ## Prerequisites[](#prerequisites "Direct link to Prerequisites") * Python 3.7 or higher * Basic understanding of HTML and CSS selectors * Familiarity with web scraping concepts * Crawlee for Python v0.4.2 or higher ### Project setup[](#project-setup "Direct link to Project setup") 1. Install Crawlee with required dependencies: ``` pipx install crawlee[beautifulsoup,curl-impersonate] ``` 2. Create a new project using Crawlee CLI: ``` pipx run crawlee create crawlee-google-search ``` 3. When prompted, select `Beautifulsoup` as your template type. 4. Navigate to the project directory and complete installation: ``` cd crawlee-google-search poetry install ``` ## Development of the Google Search scraper in Python[](#development-of-the-google-search-scraper-in-python "Direct link to Development of the Google Search scraper in Python") ### 1. Defining data for extraction[](#1-defining-data-for-extraction "Direct link to 1. Defining data for extraction") First, let's define our extraction scope. Google's search results now include maps, notable people, company details, videos, common questions, and many other elements. We'll focus on analyzing standard search results with rankings. Here's what we'll be extracting: ![Search Example](/assets/images/search_example-53f4fdf556178b9478a8d4f3e3816669.webp) Let's verify whether we can extract the necessary data from the page's HTML code, or if we need deeper analysis or `JS` rendering. Note that this verification is sensitive to HTML tags: ![Check Html](/assets/images/check_html-e243b1a0eff6d4404b9034863969bedc.webp) Based on the data obtained from the page, all necessary information is present in the HTML code. Therefore, we can use [`beautifulsoup_crawler`](https://www.crawlee.dev/python/docs/examples/beautifulsoup-crawler). The fields we'll extract: * Search result titles * URLs * Description text * Ranking positions ### 2. Configure the crawler[](#2-configure-the-crawler "Direct link to 2. Configure the crawler") First, let's create the crawler configuration. We'll use [`CurlImpersonateHttpClient`](https://www.crawlee.dev/python/api/class/CurlImpersonateHttpClient) as our `http_client` with preset `headers` and `impersonate` relevant to the [`Chrome`](https://www.google.com/intl/en/chrome/) browser. We'll also configure [`ConcurrencySettings`](https://www.crawlee.dev/python/api/class/ConcurrencySettings) to control scraping aggressiveness. This is crucial to avoid getting blocked by Google. If you need to extract data more intensively, consider setting up [`ProxyConfiguration`](https://www.crawlee.dev/python/api/class/ProxyConfiguration). ``` from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler from crawlee.http_clients.curl_impersonate import CurlImpersonateHttpClient from crawlee import ConcurrencySettings, HttpHeaders async def main() -> None: concurrency_settings = ConcurrencySettings(max_concurrency=5, max_tasks_per_minute=200) http_client = CurlImpersonateHttpClient(impersonate="chrome124", headers=HttpHeaders({"referer": "https://www.google.com/", "accept-language": "en", "accept-encoding": "gzip, deflate, br, zstd", "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36" })) crawler = BeautifulSoupCrawler( max_request_retries=1, concurrency_settings=concurrency_settings, http_client=http_client, max_requests_per_crawl=10, max_crawl_depth=5 ) await crawler.run(['https://www.google.com/search?q=Apify']) ``` ### 3. Implementing data extraction[](#3-implementing-data-extraction "Direct link to 3. Implementing data extraction") First, let's analyze the HTML code of the elements we need to extract: ![Check Html](/assets/images/html_example-ccefa4ed63c38812ac5b8ca7b5122c8c.webp) There's an obvious distinction between *readable* ID attributes and *generated* class names and other attributes. When creating selectors for data extraction, you should ignore any generated attributes. Even if you've read that Google has been using a particular generated tag for N years, you shouldn't rely on it - this reflects your experience in writing robust code. Now that we understand the HTML structure, let's implement the extraction. As our crawler deals with only one type of page, we can use `router.default_handler` for processing it. Within the handler, we'll use `BeautifulSoup` to iterate through each search result, extracting data such as `title`, `url`, and `text_widget` while saving the results. ``` @crawler.router.default_handler async def default_handler(context: BeautifulSoupCrawlingContext) -> None: """Default request handler.""" context.log.info(f'Processing {context.request} ...') for item in context.soup.select("div#search div#rso div[data-hveid][lang]"): data = { 'title': item.select_one("h3").get_text(), "url": item.select_one("a").get("href"), "text_widget": item.select_one("div[style*='line']").get_text(), } await context.push_data(data) ``` ### 4. Handling pagination[](#4-handling-pagination "Direct link to 4. Handling pagination") Since Google results depend on the IP geolocation of the search request, we can't rely on link text for pagination. We need to create a more sophisticated CSS selector that works regardless of geolocation and language settings. The `max_crawl_depth` parameter controls how many pages our crawler should scan. Once we have our robust selector, we simply need to get the next page link and add it to the crawler's queue. To write more efficient selectors, learn the basics of [CSS](https://www.w3schools.com/cssref/css_selectors.php) and [XPath](https://www.w3schools.com/xml/xpath_syntax.asp) syntax. ``` await context.enqueue_links(selector="div[role='navigation'] td[role='heading']:last-of-type > a") ``` ### 5. Exporting data to CSV format[](#5-exporting-data-to-csv-format "Direct link to 5. Exporting data to CSV format") Since we want to save all search result data in a convenient tabular format like CSV, we can simply add the export\_data method call right after running the crawler: ``` await crawler.export_data_csv("google_search.csv") ``` ### 6. Finalizing the Google Search scraper[](#6-finalizing-the-google-search-scraper "Direct link to 6. Finalizing the Google Search scraper") While our core crawler logic works, you might have noticed that our results currently lack ranking position information. To complete our scraper, we need to implement proper ranking position tracking by passing data between requests using `user_data` in [`Request`](https://www.crawlee.dev/python/api/class/Request). Let's modify the script to handle multiple queries and track ranking positions for search results analysis. We'll also set the crawling depth as a top-level variable. Let's move the `router.default_handler` to `routes.py` to match the project structure: ``` # crawlee-google-search.main from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler, BeautifulSoupCrawlingContext from crawlee.http_clients.curl_impersonate import CurlImpersonateHttpClient from crawlee import Request, ConcurrencySettings, HttpHeaders from .routes import router QUERIES = ["Apify", "Crawlee"] CRAWL_DEPTH = 2 async def main() -> None: """The crawler entry point.""" concurrency_settings = ConcurrencySettings(max_concurrency=5, max_tasks_per_minute=200) http_client = CurlImpersonateHttpClient(impersonate="chrome124", headers=HttpHeaders({"referer": "https://www.google.com/", "accept-language": "en", "accept-encoding": "gzip, deflate, br, zstd", "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36" })) crawler = BeautifulSoupCrawler( request_handler=router, max_request_retries=1, concurrency_settings=concurrency_settings, http_client=http_client, max_requests_per_crawl=100, max_crawl_depth=CRAWL_DEPTH ) requests_lists = [Request.from_url(f"https://www.google.com/search?q={query}", user_data = {"query": query}) for query in QUERIES] await crawler.run(requests_lists) await crawler.export_data_csv("google_ranked.csv") ``` Let's also modify the handler to add `query` and `order_no` fields and basic error handling: ``` # crawlee-google-search.routes from crawlee.beautifulsoup_crawler import BeautifulSoupCrawlingContext from crawlee.router import Router router = Router[BeautifulSoupCrawlingContext]() @router.default_handler async def default_handler(context: BeautifulSoupCrawlingContext) -> None: """Default request handler.""" context.log.info(f'Processing {context.request.url} ...') order = context.request.user_data.get("last_order", 1) query = context.request.user_data.get("query") for item in context.soup.select("div#search div#rso div[data-hveid][lang]"): try: data = { "query": query, "order_no": order, 'title': item.select_one("h3").get_text(), "url": item.select_one("a").get("href"), "text_widget": item.select_one("div[style*='line']").get_text(), } await context.push_data(data) order += 1 except AttributeError as e: context.log.warning(f'Attribute error for query "{query}": {str(e)}') except Exception as e: context.log.error(f'Unexpected error for query "{query}": {str(e)}') await context.enqueue_links(selector="div[role='navigation'] td[role='heading']:last-of-type > a", user_data={"last_order": order, "query": query}) ``` And we're done! Our Google Search crawler is ready. Let's look at the results in the `google_ranked.csv` file: ![Results CSV](/assets/images/results-03c51354b4347837a24ec6977a442ce8.webp) The code repository is available on [`GitHub`](https://github.com/Mantisus/crawlee-google-search) ## Scrape Google Search results with Apify[](#scrape-google-search-results-with-apify "Direct link to Scrape Google Search results with Apify") If you're working on a large-scale project requiring millions of data points, like the project featured in this [article about Google ranking analysis](https://backlinko.com/search-engine-ranking) - you might need a ready-made solution. Consider using [`Google Search Results Scraper`](https://www.apify.com/apify/google-search-scraper) by the Apify team. It offers important features such as: * Proxy support * Scalability for large-scale data extraction * Geolocation control * Integration with external services like [`Zapier`](https://zapier.com/), [`Make`](https://www.make.com/), [`Airbyte`](https://airbyte.com/), [`LangChain`](https://www.langchain.com/) and others You can learn more in the Apify [blog](https://blog.apify.com/unofficial-google-search-api-from-apify-22a20537a951/) ## What will you scrape?[](#what-will-you-scrape "Direct link to What will you scrape?") In this blog, we've explored step-by-step how to create a Google Search crawler that collects ranking data. How you analyze this dataset is up to you! As a reminder, you can find the full project code on [`GitHub`](https://github.com/Mantisus/crawlee-google-search). I'd like to think that in 5 years I'll need to write an article on "How to extract data from the best search engine for LLMs", but I suspect that in 5 years this article will still be relevant. **Tags:** * [community](https://crawlee.dev/blog/tags/community.md) --- # How to scrape TikTok using Python April 25, 2025 · 12 min read [![Max](https://avatars.githubusercontent.com/u/34358312?v=4)](https://github.com/Mantisus) [Max](https://github.com/Mantisus) Community Member of Crawlee and web scraping expert [TikTok](https://www.tiktok.com/) users generate tons of data that are valuable for analysis. Which hashtags are trending now? What is an influencer's engagement rate? What topics are important for a content creator? You can find answers to these and many other questions by analyzing TikTok data. However, for analysis, you need to extract the data in a convenient format. In this blog, we'll explore how to scrape TikTok using [Crawlee for Python](https://github.com/apify/crawlee-python). note One of our community members wrote this blog as a contribution to the Crawlee Blog. If you'd like to contribute articles like these, please reach out to us on our [Discord channel](https://apify.com/discord). ![How to scrape TikTok using Python](/assets/images/main_image-94d608c24b2e8970cac1d9040b8290a5.webp) Key steps we'll cover: 1. [Project setup](https://www.crawlee.dev/blog/scrape-tiktok-python#1-project-setup) 2. [Analyzing TikTok and determining a scraping strategy](https://www.crawlee.dev/blog/scrape-tiktok-python#2-analyzing-tiktok-and-determining-a-scraping-strategy) 3. [Configuring Crawlee](https://www.crawlee.dev/blog/scrape-tiktok-python#3-configuring-crawlee) 4. [Extracting TikTok data](https://www.crawlee.dev/blog/scrape-tiktok-python#4-extracting-tiktok-data) 5. [Creating TikTok Actor on the Apify platform](https://www.crawlee.dev/blog/scrape-tiktok-python#5-creating-tiktok-actor-on-apify-platform) 6. [Deploying to Apify](https://www.crawlee.dev/blog/scrape-tiktok-python#6-deploying-to-apify) ## Prerequisites[](#prerequisites "Direct link to Prerequisites") * Python 3.9 or higher * Familiarity with web scraping concepts * Crawlee for Python `v0.6.0` or higher * [uv](https://docs.astral.sh/uv/) `v0.6` or higher * An Apify account ## 1. Project setup[](#1-project-setup "Direct link to 1. Project setup") note Before going ahead with the project, I'd like to ask you to star Crawlee for Python on [GitHub](https://github.com/apify/crawlee-python/), it helps us to spread the word to fellow scraper developers. In this project, we'll use uv for package management and a specific Python version will be installed through uv. Uv is a fast and modern package manager written in Rust. If you don't have uv installed yet, just follow the [guide](https://docs.astral.sh/uv/getting-started/installation/) or use this command: ``` curl -LsSf https://astral.sh/uv/install.sh | sh ``` To create the project, run: ``` uvx crawlee['cli'] create tiktok-crawlee ``` In the `cli` menu that opens, select: 1. `Playwright` 2. `Httpx` 3. `uv` 4. Leave the default value - `https://crawlee.dev` 5. `y` Or, just run the command: ``` uvx crawlee['cli'] create tiktok-crawlee --crawler-type playwright --http-client httpx --package-manager uv --apify --start-url 'https://crawlee.dev' ``` Creating the project may take a few minutes. After installation is complete, navigate to the project folder: ``` cd tiktok-crawlee ``` ## 2. Analyzing TikTok and determining a scraping strategy[](#2-analyzing-tiktok-and-determining-a-scraping-strategy "Direct link to 2. Analyzing TikTok and determining a scraping strategy") TikTok uses quite a lot of JavaScript on its site, both for displaying content and for analyzing user behavior, including detecting and blocking crawlers. Therefore, for crawling TikTok, we'll use a headless browser with [Playwright](https://playwright.dev/python/). To load new elements on a user's page, TikTok uses infinite scrolling. You may already be familiar with this method from this [article](https://www.crawlee.dev/blog/infinite-scroll-using-python). Let's look at what happens under the hood when we scroll a TikTok page. I recommend studying network activity in [DevTools](https://developer.chrome.com/docs/devtools) to understand what requests are going to the server. ![Backend Network](/assets/images/load_elems-b739afc4d1d682c6fa2944275e1f8a9f.webp) Let's examine the HTML structure to understand if navigating to elements will be difficult. ![Selectors](/assets/images/selectors-80c3c3aa2697ef3c0f8b2422e7367d65.webp) Well, this looks quite simple. If using [CSS selectors](https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_selectors), `[data-e2e="user-post-item"] a` is sufficient. Let's look at what a video page response looks like to see what data we can extract. ![Video Response](/assets/images/html_response-4344e00324cd04aa52a5a8b257d48eaf.webp) It seems that the HTML code contains JSON with all the data we're interested in. Great! ## 3. Configuring Crawlee[](#3-configuring-crawlee "Direct link to 3. Configuring Crawlee") Now that we understand our scraping strategy, let's set up Crawlee for scraping TikTok. Since pages have infinite scrolling, we need to limit the number of elements we want to get. For this, we'll add a `max_items` parameter that will limit the maximum number of elements for each search and pass it in `user_data` when forming a [Request](https://www.crawlee.dev/python/api/class/Request). We'll limit the intensity of scraping by setting `max_tasks_per_minute` in [`ConcurrencySettings`](https://www.crawlee.dev/python/api/class/ConcurrencySettings). This will help us reduce the likelihood of being blocked by TikTok. We'll set `browser_type` to `firefox`, as it performed better for TikTok in my tests. TikTok may request permissions to access device data, so we'll explicitly limit all [permissions](https://playwright.dev/python/docs/api/class-browser#browser-new-context-option-permissions) by passing the appropriate parameter to `browser_new_context_options`. Scrolling pages can take a long time, so we should increase the time limit for processing a single request using `request_handler_timeout`. ``` # main.py from datetime import timedelta from apify import Actor from crawlee import ConcurrencySettings, Request from crawlee.crawlers import PlaywrightCrawler from .routes import router async def main() -> None: """The crawler entry point.""" # When creating the template, we confirmed Apify integration. # However, this isn't important for us at this stage. async with Actor: max_items = 20 # Create a crawler with the necessary settings crawler = PlaywrightCrawler( # Limit scraping intensity by setting a limit on requests per minute concurrency_settings=ConcurrencySettings(max_tasks_per_minute=50), # We'll configure the `router` in the next step request_handler=router, # You can use `False` during development. But for production, it's always `True` headless=True, max_requests_per_crawl=100, # Increase the timeout for the request handling pipeline request_handler_timeout=timedelta(seconds=120), browser_type='firefox', # Limit any permissions to device data browser_new_context_options={'permissions': []}, ) # Run the crawler to collect data from several user pages await crawler.run( [ Request.from_url('https://www.tiktok.com/@apifyoffice', user_data={'limit': max_items}), Request.from_url('https://www.tiktok.com/@authorbrandonsanderson', user_data={'limit': max_items}), ] ) ``` Someone might ask, "What about configurations to avoid fingerprint blocking?!!!" My answer is, "Crawlee for Python has already done that for you." Depending on your deployment environment, you may need to add a proxy. We'll come back to this in the last section. ## 4. Extracting TikTok data[](#4-extracting-tiktok-data "Direct link to 4. Extracting TikTok data") After configuration, let's move on to navigation and data extraction. For infinite scrolling, we'll use the built-in helper function ['infinite\_scroll'](https://www.crawlee.dev/python/api/class/PlaywrightCrawlingContext#infinite_scroll). But instead of waiting for scrolling to complete, which in some cases can take a really long time, we'll use Python's `asyncio` capabilities to make it a background task. Also, with deeper investigation, you may encounter a TikTok page that doesn't load user videos, but only shows a button and an error message. ![Error Page](/assets/images/went_wrong-413878d9f5a4331add12544c0a25ccd7.webp) It's very important to handle this case. Also during testing, I discovered that you need to interact with scrolling, otherwise when using `infinite_scroll`, new elements don't load. I think this is a TikTok bug. Let's start with a simple function to extract video links. It will help avoid code duplication. ``` # routes.py import asyncio import json from playwright.async_api import Page from crawlee import Request from crawlee.crawlers import PlaywrightCrawlingContext from crawlee.router import Router router = Router[PlaywrightCrawlingContext]() # Helper function that extracts all loaded video links async def extract_video_links(page: Page) -> list[Request]: """Extract all loaded video links from the page.""" links = [] for post in await page.query_selector_all('[data-e2e="user-post-item"] a'): post_link = await post.get_attribute('href') if post_link and '/video/' in post_link: links.append(Request.from_url(post_link, label='video')) return links ``` Now we can move on to the main handler that will process TikTok user pages. ``` # routes.py # Main handler used for TikTok user pages @router.default_handler async def default_handler(context: PlaywrightCrawlingContext) -> None: """Handle request without specific label.""" context.log.info(f'Processing {context.request.url} ...') # Get the limit for video elements from `user_data` limit = context.request.user_data.get('limit', 10) if not isinstance(limit, int): raise TypeError('Limit must be an integer') # Wait until the button or at least a video loads, if the connection is slow check_locator = context.page.locator('[data-e2e="user-post-item"], main button').first await check_locator.wait_for() # If the button loaded, click it to initiate video loading if button := await context.page.query_selector('main button'): await button.click() # Perform interaction with scrolling await context.page.press('body', 'PageDown') # Start `infinite_scroll` as a background task scroll_task: asyncio.Task[None] = asyncio.create_task(context.infinite_scroll()) # Wait until scrolling is completed or until the limit is reached while not scroll_task.done(): requests = await extract_video_links(context.page) # If we've already reached the limit, interrupt scrolling and exit the loop if len(requests) >= limit: scroll_task.cancel() break # Switch the asynchronous context to allow other tasks to execute await asyncio.sleep(0.2) else: requests = await extract_video_links(context.page) # Limit the number of requests to the limit value requests = requests[:limit] # If the page wasn't properly processed for some reason and didn't find any links, # then I want to raise an error for retry if not requests: raise RuntimeError('No video links found') await context.add_requests(requests) ``` The final stage is handling the video page. ``` # routes.py @router.handler(label='video') async def video_handler(context: PlaywrightCrawlingContext) -> None: """Handle request with the label 'video'.""" context.log.info(f'Processing video {context.request.url} ...') # Extract the element containing JSON with data json_element = await context.page.query_selector('#__UNIVERSAL_DATA_FOR_REHYDRATION__') if json_element: # Extract JSON and convert it to a dictionary text_data = await json_element.text_content() json_data = json.loads(text_data) data = json_data['__DEFAULT_SCOPE__']['webapp.video-detail']['itemInfo']['itemStruct'] # Create result item result_item = { 'author': { 'nickname': data['author']['nickname'], 'id': data['author']['id'], 'handle': data['author']['uniqueId'], 'signature': data['author']['signature'], 'followers': data['authorStats']['followerCount'], 'following': data['authorStats']['followingCount'], 'hearts': data['authorStats']['heart'], 'videos': data['authorStats']['videoCount'], }, 'description': data['desc'], 'tags': [item['hashtagName'] for item in data['textExtra'] if item['hashtagName']], 'hearts': data['stats']['diggCount'], 'shares': data['stats']['shareCount'], 'comments': data['stats']['commentCount'], 'plays': data['stats']['playCount'], } # Save the result to the dataset await context.push_data(result_item) else: # If the data wasn't received, we raise an error for retry raise RuntimeError('No JSON data found') ``` The crawler is ready for local launch. To run it, execute the command: ``` uv run python -m tiktok_crawlee ``` You can view the saved results in the `dataset` folder, path `./storage/datasets/default/`. Example record: ``` { "author": { "nickname": "apifyoffice", "id": "7095709566285480965", "handle": "apifyoffice", "signature": "🤖 web scraping and AI 🤖\n\ncheck out our open positions at ✨apify.it/jobs✨", "followers": 118, "following": 3, "hearts": 1975, "videos": 33 }, "description": ""Fun" is the top word Apifiers used to describe our culture. Here's what else came to their minds 🎤 #workculture #teambuilding #interview #czech #ilovemyjob ", "tags": [ "workculture", "teambuilding", "interview", "czech", "ilovemyjob" ], "hearts": 7, "shares": 1, "comments": 1, "plays": 448 } ``` ## 5. Creating TikTok Actor on the [Apify platform](https://apify.com/)[](#5-creating-tiktok-actor-on-the-apify-platform "Direct link to 5-creating-tiktok-actor-on-the-apify-platform") For deployment, we'll use the [Apify platform](https://apify.com/). It's a simple and effective environment for cloud deployment, allowing efficient interaction with your crawler. Call it via [API](https://docs.apify.com/api/v2/), [schedule tasks](https://docs.apify.com/platform/schedules), [integrate](https://docs.apify.com/platform/integrations) with various services, and much more. To deploy to the Apify platform, we need to adapt our project for the [Apify Actor](https://apify.com/actors) structure. Create an `.actor` folder with the necessary files. ``` mkdir .actor && touch .actor/{actor.json,input_schema.json} ``` Move the `Dockerfile` from the root folder to `.actor`. ``` mv Dockerfile .actor ``` Let's fill in the empty files: The `actor.json` file contains project metadata for the Apify platform. Follow the [documentation for proper configuration](https://docs.apify.com/platform/actors/development/actor-definition/actor-json): ``` { "actorSpecification": 1, "name": "TikTok-Crawlee", "title": "TikTok - Crawlee", "minMemoryMbytes": 2048, "description": "Scrape video elements from TikTok user pages", "version": "0.1", "meta": { "templateId": "tiktok-crawlee" }, "input": "./input_schema.json", "dockerfile": "./Dockerfile" } ``` Actor input parameters are defined using `input_schema.json`, which is specified [here](https://docs.apify.com/platform/actors/development/actor-definition/input-schema/specification/v1). Let's define input parameters for our crawler: * `maxItems` - this should be an externally configurable parameter. * `urls` - these are links to TikTok user pages, the starting points for our crawler's scraping * `proxySettings` - proxy settings, since without a proxy you'll be using the datacenter IP that Apify uses. ``` { "title": "TikTok Crawlee", "type": "object", "schemaVersion": 1, "properties": { "urls": { "title": "List URLs", "type": "array", "description": "Direct URLs to pages TikTok profiles.", "editor": "stringList", "prefill": ["https://www.tiktok.com/@apifyoffice"] }, "maxItems": { "type": "integer", "editor": "number", "title": "Limit search results", "description": "Limits the maximum number of results, applies to each search separately.", "default": 10 }, "proxySettings": { "title": "Proxy configuration", "type": "object", "description": "Select proxies to be used by your scraper.", "prefill": { "useApifyProxy": true }, "editor": "proxy" } }, "required": ["urls"] } ``` Let's update the code to accept input parameters. ``` # main.py from datetime import timedelta from apify import Actor from crawlee.crawlers import PlaywrightCrawler from crawlee import ConcurrencySettings from crawlee import Request from .routes import router async def main() -> None: """The crawler entry point.""" async with Actor: # Accept input parameters passed when starting the Actor actor_input = await Actor.get_input() max_items = actor_input.get('maxItems', 0) requests = [Request.from_url(url, user_data={'limit': max_items}) for url in actor_input.get('urls', [])] proxy = await Actor.create_proxy_configuration(actor_proxy_input=actor_input.get('proxySettings')) crawler = PlaywrightCrawler( concurrency_settings=ConcurrencySettings(max_tasks_per_minute=50), proxy_configuration=proxy, request_handler=router, headless=True, request_handler_timeout=timedelta(seconds=120), browser_type='firefox', browser_new_context_options={'permissions': []} ) await crawler.run(requests) ``` That's it, the project is ready for deployment. ## 6. Deploying to Apify[](#6-deploying-to-apify "Direct link to 6. Deploying to Apify") Use the official [Apify CLI](https://docs.apify.com/cli/) to upload your code: Authenticate using your API token from [Apify Console](https://console.apify.com/settings/integrations): ``` apify login ``` Choose "Enter API token manually" and paste your token. Push the project to the platform: ``` apify push ``` Now you can configure runs on the Apify platform. Let's perform a test run: Fill in the input parameters: ![Actor Input](/assets/images/input_actor-33501c94f9a90c5e28c272016a7d5ec9.webp) Check that logging works correctly: ![Actor Log](/assets/images/actor_log-4301af07fb3f21631f98802876e6b3f5.webp) View results in the dataset: ![Dataset Results](/assets/images/actor_results-7ab9904db12130be0317320c43070b71.webp) If you want to make your Actor public and provide access to other users, potentially to earn income from it, follow this [publishing guide](https://docs.apify.com/platform/actors/publishing) for [Apify Store](https://apify.com/store). ## Conclusion[](#conclusion "Direct link to Conclusion") We've created a good foundation for crawling TikTok using Crawlee for Python and Playwright. If you want to improve the project, I would recommend adding error handling and handling cases when you get a CAPTCHA to reduce the likelihood of being blocked by TikTok. However, this is a good foundation to start working with TikTok. It allows you to get data right now. You can find the complete code in the [repository](https://github.com/Mantisus/tiktok-crawlee) If you enjoyed this blog, feel free to support Crawlee for Python by starring the [repository](https://github.com/apify/crawlee-python) or joining the maintainer team. Have questions or want to discuss implementation details? Join our [Discord](https://discord.com/invite/jyEM2PRvMU) - our community of 10,000+ developers is there to help. **Tags:** * [community](https://crawlee.dev/blog/tags/community.md) --- # Optimizing web scraping: Scraping auth data using JSDOM September 30, 2024 · 8 min read [![Saurav Jain](https://avatars.githubusercontent.com/u/53312820?v=4)](https://github.com/souravjain540) [Saurav Jain](https://github.com/souravjain540) Developer Community Manager As scraping developers, we sometimes need to extract authentication data like temporary keys to perform our tasks. However, it is not as simple as that. Usually, it is in HTML or XHR network requests, but sometimes, the auth data is computed. In that case, we can either reverse-engineer the computation, which takes a lot of time to deobfuscate scripts or run the JavaScript that computes it. Normally, we use a browser, but that is expensive. Crawlee provides support for running browser scraper and Cheerio Scraper in parallel, but that is very complex and expensive in terms of compute resource usage. JSDOM helps us run page JavaScript with fewer resources than a browser and slightly higher than Cheerio. This article will discuss a new approach that we use in one of our Actors to obtain the authentication data from TikTok ads creative center generated by browser web applications without actually running the browser but instead of it, using JSDOM. ![JSDOM based approach from scraping](data:image/webp;base64,UklGRoYlAABXRUJQVlA4IHolAAAwdwGdASqABIgCPpFIoEwlpCalIROpKNASCWlu8hrke3XPLgHb9D1Wds7uP6XZF5P/t89k/4y9YOjyKelv5OR1wufOzmP5V+Z/rP7afDhc37F+Dvl1Ym/Ef2383dR/zBv0Z/33WF8wH7U+r35xPqAf2vqT/QA8un2R/3M9EP//6076W/3vrK81nzvcRey/PmfHth4ATxe0OwP8JP5HWyzxX/B5Uf23fH/vGHpLm3ht0692Id2Me7EO7qpM1Mx7EXjVX3Z+P0sNbmhaq+7Px+lhrc0LVSiIMc5QmllZTyQ6/UOl+HeivFrbDtSRugEP6BUI/Io7XtSExQKhH5FHa9qQmKBT9v3rKxJiZwlqpq2aTj/S243PsvniQmt2Me7EO7GPdiHdjHuxDuyVHuVr//+RRytVe5Xp42cEzTTpdQSRoMelzbw26de7EO7GPdiHdjHwUEH5RB5AgN7KtoB81ZAWCumRCeI3xh7nviHdjHuxDuxj3Yh3Yx7sQ79q6AzCmFssXeWHQihJQMLSDfUzVv8EH1Ps7sQ7sY92Id2Me7EO7GPdkIZfaen6Z1VcL5TTSj0cqLr1u+4OtYk6ViED5t4bdOvdiHdjHuxDvm1QJX5PR8lvcQmLDQlSVaSOyXUOS/TMD5t4bdOvdiHdjHuxDuxldSJ2EKAemmKo1uaCyZt46Gb5NF6sqSMPXuxDuxj3Yh3Yx7sQ7sY9/J2udY4xJydhkoMkml2RqBya2890xTSMqaTuxDuxj3Yh3Yx7sQ7sZVJ3XYB/hj4x16Cm99+WZUQdKpzXUP2cG6x2GPjHYc9lFTmMAcXVRZfYQeZXe1sTviHc30Wxu5gV6mbU609u7GV4boVQJeArWbeG3Ttfr1TZeVXjT4U9hRTGGyBJIoxrwpJKpweRHnFxFeQTf8w8he8rkWTNH5sca9IyHPYvJF0152hdyDaZvwU043TndjHofS90evVmMIWlSxyvnQ3HQWYTB3HQWf8oPF/K2CDePonCcndiH0L+/RR2NymLtMHOM5iJSX+iSgVYOHOzVVuGDXqwnw5Kex0udSaJPKhnatlyRm+zKn2d2G7FzF6m2YX9R7gmf5n8dePVSRoF92txlovLwtoxJ77P2Vtw6Bnr792IfRGc24UVNve6j3cXAxcJSdl6VK/+uvkGZD4O6SMPXuw4XOEJoc19BltrDUhMUCoQrEn8cT/Okrh+SYbdP8JsZ42ar/8kKP/tHILN/HizKI0b3Nz3xDue/jGqxXt1/kN/29L18Tc2JBXfOt1P4g9nUSgY5ht0/yjoDtBWxvM2sbdRoOElJylR03Av7+lUbO87rP92IdX36n00Hqj/TrpJ3Be5xiKftRbGHW97wIGOYbdP9YxipMOa3+D7sCoaigpQRxBxBNBCt7zEpICIQPlfJEgIjLd3UxVAmhlyqWYHGaSSZQxn5DDjJe7TBigXjnr/Pv0rEINL3QvSfoRmlBJ6uD/9150ca7asaQOFeBJIGkz6M4d+J7Ew26cm8oYI8NNLq6oIy8PmiT5pPg2NNaKBKC9Em+ffpWIQapo9KHsuGt/lRrxFyQCL9+zkp1MsiIw/WSSWcsRzmJPMHLDJvlH2nrb4pmwsSoXtD7TimKNyssrKaQf5vFMgaRK5gPmw0hDFUpb5BBUiaLyf7yYEk569D2mT+QHiHdrat3zD6LEuihoQVvkC++dkNjpZ8xmVFIC4dSjR5FnALGaFnjLnh48CBmJ+0p+yvSZ8aE0q4rIX7Kyt8j4xCQL9bjhgLVVIX+2ha9CFYx342uQKS93RSV+3Tr3YlFVabmIJRU9643pKXWswn9vuaznftXMRbdS2P0rA1Fmag1SgoQUsGizXtsvHqz3E2w6BYZfJwUaQhiqSQpjRhUUpJk6KIapxeeEHn4Nq+Bj0RFBEPte7opLQUgQPm+HmBPyguBszs23OXDabCM6AR9iqpK7fTKb1fq05zou1NLEeW9BvMshQFgg26HeopgJ1pkhkaAIBddmwa2IQL9ekhttDlwnUn6lR3VQIBJ3RMUj0XuzBufWvwzrKmobJaNW5dIbcNRWlemrWNjFZCCAFwZmi2SSwbcZ4nDSZsH2cJjMjJ70m+KJYuzjyD7YMarAJrG2Zl9FgfEwelUcJEULEukZz4WjqATFJ6UkCWIghjHVUOo4AGZc5DzQIJkY0s5BMyfnp/ZTP/DhDNww/ar7oCnfwLnouFAuJ9qpfESiusYDedgzq5A1Oj4R/+vbFxvgkD5uf2jjmBt3Wf7ryeaj7QqOnKykB0o/mBKz+6cGgOF1nDBOCsW5sxBcmD9hvwV6S0yBA72Cu8i8csiV0e2a5aq7+K7EO7IAjwCT/z2gW354UuFWvbCrky5qSFV92fz7bksyxXIt6Y8fe00B+yR8nQnTI5LioG+hUxhF6iz4g7qiXOWZBCnKu0Nd0x7tdRJzUkvS7wQesdJtrcrKNy3OwPpJP9Q/XmxK48psHdgf9G9UoY1bAQbNXOnBbeftaiWo94GPRxZfiobF+VRa9UWZTAcW95Qu3fJIbYXKsFStMOOUEqS7F9SXMxnsOQLM+f6ifNQb01gjMqy/V6o7Z1HPKF3Y9hYkYbhs/jhXNwToPGIryKeaku/9LlPUD/nWgvhsNRzhA+bDPQSPvNxhKWiui6oBdGffmgl+ejplEWb9ogEw0edwPBhp+3214zZM1iWo/IY3WnYZ4bi2/Nt3BgpSYFo9BSKBnZ3xzXp1YZ7iI7Lrp4BU0v0HEDVv8j0fn/n5IWquhoKnZ/QY1Q0IEuWP4FzcxzYIOYrEiAblenlIt0pchSkomdSaJP92txlqk1XLCF/iSocc7oEm9LcAY9UZsepi9s+qymtS3eRN1QNKChNZ/vhTrU5t3XpRDCXnAVQklQsDYTd2OBcbckhm7+dlQxRWc8gyNq8J7Mp5DVcVrNt50MxZNB/QDMzZPEPHOATnAOE5iphg4c7SW6XK7yljOJN2QonixK09Q/m1gj1V8WITGLPqq7lg7ROYa1yNLs7I3swpdpa7QZRpCoUbKN/0YCfikbvfnCpVOKItMzKHU/d7OlfzpSULSdZIVMTYT8cAvN4FcKGQCEelgbZxvC6PTMSyHX3Lf+hHXwzH5xki2XsbJ2fB54SanofQ9+Vx4Xm8n3Dqivym9ewr9wceYzOQselcvpZk5P2xp3z0WrhUQrHlJBnRhjf4v+dDviKUT0G2fKWd+IxH001O3N38cpowikOEWxqHAPVdnmLZa/3Oxm7HFTT+Yl+SyYWuy12jhwzAQK3NC1WGvEa5ZL7u5eOjinRkbIYzP1MoCPWp+ifZdyc+dmDm1F9DqI1S955dgBb08sNxpVm2Kuu7l2YzNp8573sRlV27fi3PGKeq0B4lYWd12IMmrQ4LceRgdLLvG9qr1kGX0sNQt3lflHrarYhx2MwqQ5nmrc+nhM/PK7vnBnkZdVmGllOMzQhEp3AmUDMOngPIYP7Upmkc8ky8yBPi+i21mZ/A4Euos0XHnU44+7PzWUyGVPEjK2zaOJa9mA6NhIgeKkI8c6xfuV0RCh00UUseY8C65NXkP7QzH6M7ta9OTqtv7jwrG2Y0C+GHG3F0wX5N+slhBAO92Id2vole9+wzoxP7GN7uWG/xQK9+FDEa7l9J44r0WcMcHHpUrOM3di1z6IF0dKsKSDsO5PKmzcrIwBoi9xNPg8poRRUbY13nZ0rW7pV48xxai0JnqHNNSPWodFDEoZaaFA0bfvhjpj0hH+q0gQjJzvjMoBdIEDUWp+Xae1EYROK+CcMxhbpmSJ9787oR7UsW8z1VZvrt4dAaVkMb30qVHuSW4hCA33LfVQscWL3wZeLSJKoRND0b2qStKuWa/fiY/SpP8B9osXO+tmgJ6AEqVCG4shzfJdos392fb13MPz4DlmojevdiHdq3JKMVshhVEHLIsALp4D50t/LEN+7sc3WFn/JpEkndiH0yGikF35Soeoj2pfhZpc6kjQLn6zivOyomdqXj1UkaBc/c1XPXqU9iBc6k0Sf7IAP70qt+Fw3yknAYA7LVKgOvTmhueEKfCy4Bt60SjxBgq/nyMoMIFeUqAIbvDD5idTJMUs6k2/U4apNwXa1R0vZcc/hbckkBk4CE5CEmNQTTn71Tx5eA1SqJvM2jQs8Th+R4xZ4nD8jxizxOH5HjFnicPyPGLPE4fkeMWeJw/I8Ys8Th+R4xZ4nD8jxizxOH5HjFnicPyPGLPE4fkeMWfk1JaYSL2sBIPU4lijrV83LuXubZbWNMbtYlJLyunBqwxhiXPhSHECu2B7Whe3OaN1B+AAAACgZSXguc4Z7wrczIZWB1dN4az8IAMYBTFGnsQFxatGRm3E+X5qKKmdtvbSFpnvAAAAAFAv7WwYxOZjj4JZ2xNprcvMngNbZC0yXcZbCWf+Ub36jieerr1o6RKjq4H/c4t9JUCBFW7PT4AAAABGc+sdqd/R2v/yRSMINAdmonAerozQP2B2tKdTLGRb6jWgUNFh7cFEjAFGqI6AKVcBAAAABDwZCJaZuPJuE+kF8Z8OA8b2KoIQmNZBKB21js8MxLaTXHWdQ4rASXyKrW3OBHWwdIAAAAFAUIX4IEmthvVUdZ19K2AbHS/1B0ThS65THEhBrfG05eiKuw9kj2PwpVeNkzgzqi3xSNiAAAACezDFfm01s9DrZ4vobHKFkWGP2DnB0SHl9B/C5qd8iMxRRJldNJw+IV7ifAPJmrY0oTKkUW3YROS0wAAAACe+wsV7E06xgCYcnN+dtfIsjMF+a5bBP4ANJshTFG4bJTYDIkv4M5UkIJgzEBXcHaIicsqAAAAAE9l9jzY8o5MT7/nINttrv5BV5BxXCYEJ1muPR8Eln+KvCWKefQ/L06OYOAAAAAise67IcdeSlQkU4pf8mnbi7Fn9q4zMy2sB005Zx71UVo2px4Fx3U6VUjFskwPToe2v66Bysmma/t4ndDrQWyZIJ7HTqsmcpmcpFkGoFig6aHacJg83flHnbbABkn3UAmueMH9Y00cW9lTLUfOY/YJsEIvnVcAdYJ8HlEIGfKE7GmtcRytQwd6SBo0/wR9zh0zeBvn8/A0K1nxq9njkTczwZ5Gsgpt379mp6QYzX+L/ZXyAoqJPP9DItNcsUTx1Q8cvUbzemTP0nkugv1WqyvTSWhlBWsdb2RIegwnl1iEz2i+ReqjjFop761Vrh3s0ixRuOrVjUAByHHe6lkvzRC3Z7sfxIdCNTw85wGXsV8LBvct4zP1ANxA36kCJtxd2oNdxoIs9+GSV0L/hU07CPm1eg1T/gWi5aPuz5QZjAnJjSZrPQuI2+14FE3EiVVK8o738CeuVsWYZV0G5uBAdKxks0CBjnTyTZfxbvjfl0ZRw489XB6RLPlmMA+eLREaM/jfbiWNSLS1QpqaFguiNcqLt+0mlwpY43mE4TMoB91tcSc0JynT1Fr/66aayPSu7MKlR6Sb+/f2Mtc+/enAcrnEqmtOt8xpPWcR2cwp7ItGs61dxpHie/OxEGnj+M/79bbj/pUEVZlSpqJhOKFxi3ideQ0goKlPy/DWQUTXzHnMEu7DxrobLcthlbblJm64fOshLTVyXrnJOWmdSG5fWItHNyAx6BjYReIiZJhcZviw/vkPNOVYMRjdUHFuPRByQpbGsfSf6ncUJi7oZKSMTU4lWKl4EXTWoaHAG7LmPb9V9d8B066Ap+OO4jpuwLBxajh+ngn3Trjm/pOy8fJchJT/JbcclJ2L7TeLeSbSxw8XG2IYnNu/gWtXA6fFtKMgMSvAStWv+0NO8l2/Oi7pEjTmPJBOPLUPuQKF/jDMn+iM/W09qEBe9Sxw/2ZcSi+FyC8fDw9gnkn0CRqwKcarOPf2EtfMpzC1SC97iVj+FquFqdQgacW51kXs+L+KvcxdHdiZ0oklhBtItEYFhMWtJ1ZSMbneoape4iSBh8Nl8n0XAwIzWE8bG9wY2yt6tCLFBbfoQE38A9KukJ0ueJpMRYmCcSlUXhAdX59oGlkxWRHZ72ilX6zOPE6JA/JshqGTqIxO/BDFckaFSQlCYX+ey0+u2tAIXc/IgbfF77X2256/s5xCJyylXXwDZMHbCW8fp8Lj/ZHyOXKka8GbjnpHkjFV3esGUsN3mDt5bOmc36OHS3dtzo9787ECrKVLjj2Xj/7/oFJSUNEi2S9vmZmJRT5D+dGI85GGVuQvrEbfaaCZGDUMD7ZAv3MIaxYxnYBn0ER+pLRYTFyLuSIY5fSWg4kCcnuupFX/i9MAEj93r1TrUQP3VCfK4EN3FpC3XNySvO+hxo1ehcSAFZrQGaH5MWQDKwyGtv+ljyhObmHMBcTj+LxW6iiQBxIHBVe7/fXCnLkngDsNvMMv7Ve7z0Ek6ZBBX7pNdXDOpByGkutTe9Sra/40XxiCIk+nPCcJPWbBTjYpr9ruFb5o1SuU0Qv5n8HYKv/nRd8MOrHI3/8qviuR73GgDBKD46Wq0dKGEhPe5cg2Cl+M17WRDitU7eI5nWPRGX2QZgkypLG4NGYHmxDCubFyhmABvF4rb8NdJ0fD6azOW/PtZ3tSkhrolI+gGXKCDUuZKwpNmZeLNR64tKsINJv5iYYp8XqeAwZQymTT9TcyquM+57bA8QOaZa8zvx3pI0VHF+LNrXex8R7yf2xGYEJlkEdY9aXGW2BC0qE/K/UrAoa0xmvGHDq5op/syk36LHKP2ybAjyvULOvYfA+4MkgsB+w4W4eVk+ZdY9TdBx8YRcDwoqCP3ngrBipb4ZbAl2HLZtKhXl2U3FABEiUcyUtvps9GuykLj1pAn45sZsymBkQ6Dzj1fAH8VSj+b0T7bCZpfKE3Tj7myMW5J9DuGdZ+QBR+UA01hqVrroOOgSRnEDyc9AhoqdbAVfv0A6b6ho2JPs4Hp3QjJvmaDH8XR1aibcTFe32+5LOpAi75dBPs5ljubeb3//TWtSLiPmlE6VBBGamhh41AZTEsNrYZWeHGDSGbw//LkD6lkp6QaK/4yG8Fl4voW+snkwonBoAIDKhjudqXFPLtBt0T+4lP5T+wZ/9mVeCq7CdS3syGmi383vVS3ast0nnpmp2OUYConOWp4a6TXtQEJQPk06Z1xPsvAMjNslqCn+y8numhXEpwL1z4fSAz1IcZ14C7GHnp82Rci4A0SNhqOXoH3DPRQZTOBxEUlb7I8XqAbz3bktt1BT9dkNcQ9P8oD3iqzXS4tyJo2Pj3WQ1RSl0f1XcsPNv/QFFw1oi7DRTzpqb/90wp5anJWIt5ZocLda6GN3d39ujh2X6/LY8fj5+q3sX4FYvW+oEcZ+Uv9Jg96dtbdJdqP582wDp869YIEgT+CxNCxNCxNC08a5zqvqU1vpVcY8JpdANT2/dDnLZuVaI31h26KszeFcmwJQYiqvnjuQdbLB6X8KnY9JsVYEikVA3MDZETgKGB2lE86NSdVwHokfN2p+RDrChCCj8v4zbu4ouQ+1NJ0Z2zIXc3ASoZ4SQz6sKAl0VgyPQMMXx2hpD+0b+R+KLLke1s8owlN4VYRE5iqCXx1PVBIv10eSFeY10Dc27kvNBHHPWdYSR8eZ5CNtjvh9/MNJ6u2YToBN/T1dOnX0VgPsO4yKHbaVSTWfJfU/JALE3dTnrNWLnWQz2WbMG4Wdt8cyO3m34Ah4nZBz19j/X5s36I3FWnhKQlBLE7RwcTFzi+huZoVvLELYjO8Z5lFp02279SCW9jD10ZXH0agtoxawa8T3Xsg/o5CHafdQv1FnEDqMwS/GaiADDipfU1wagstKLbhCeZrga8jVuZ0ZrVxY0WIb/RSoM14pmW8E6vZahRGg7oemlhGSFdC9HGuOljaYoaVXaiWKKzeGGeDE2dwmYZGDwG5OLB+YgWo9MiKomEqcmq+ykuvtQziB+JVAtgpc7czSq7O8L73B2Q9Fre8bC1sqsW5qzfHU2/10dJhcZnXM812+4Q4WAyvFPiPjFRs4MPkyGGwQ1bn08XmxEB6Cc1Yut4noTDL4EoCjYaNgWOzkF8gN8yEKiPNPLPls1j69bSTI61YlBrart2YHcKRqAN3loZOjzTJXY60+yrdO4oPMqNQKEFLY/3d+Udftrj8jV32V1PzNviY8IKUbCxgF/30KqHefjD/OhQ4Y3qeaVN8U3V9ced23AyPldtjn6rwt9+F/X/JvVj8z1EwNM/t7JVR91nSlejns609ZOBsilQGCDWbipid8RVKxgb0wADr1rcnWTKVcr0rObCAlUJQko7FugBsncvg7g1hYGyBPsQ7cCCOCWgtMOguCeU9rk2M2H+xnqW3cChfe9jFImpHcrREOmwf7Rl/hPldHY/1J8ol9y1UhCtNm1RYnFvC1YxQGk/tFxyGBrS86qwpcrKzYYYlt087ttysxy2QSbNUBESAEk+YNAvfDD2FddBUC1xFnARYI92nHS+lcyexLkTIwhYR/MPf4EHbI472EluNATMZNbxaszNlchIW+TQu7BlD3SLvkIX0UVRAUCp0lbt33zzNgUD2vXSTzejc3GFtdwdtpyXs3fPZvj4CSDwA30LTMRZH1HKQoMDKTHmn1s8TRhe+3bYTvFwl+rpbHDr42GhjZWGhkXsw6+XAkOuHt+863o+nMY7yEI6cMVLPJ8xd/pg/er8aAgC24pfTwHBFmOQbMw+gTeeU7kXx4dWEBS5vcBHr4nUd1gAg1BwAQf3skpo9EKRUjakytqwPWeNN8fl4gS1nAP6UnxKN+t6zOYaJkf4Ins7OcAJljs6VU0/UlXXBK5OS7a9ept6JYorNtheoez6IjQ49NSlRET9POE59rEcvbjlkWeAwuaQ0gYQuCBq0LsFspq7XKPhLnbmE1Ie0qzc4m1TTOX6QJ8Et9HGXBujGoN5DeZAQ+AfT8CJWYtGReUGf6wAPUb4N0negvGJoHFfEpeTNWCeMbAhTvMqcbkPzEAUEpF9AkwsR9BFOnzoZD8QfKG0BvHHGwc93fGLCIz0M+hzqC4D+AeDbq0Knzb771+FLhOgRT38DVYbznTIGLRbSSlEcLfvimOqJIpRCLwcUi4Adh9jyVnPUO9p/ZZphvtZ0wqPN8OGXqPohtMn4v6WawiTy5LW9TLoxKmbK7RTcP/e5xPfaoAwOuEgJMWNFgEJx/emx6kT/8sXxfBk+rqoMonvmynxkXvbq5Jighwe7STRAoqJagdBQoLQdyUjisWNHpzLUWKMUypATAgNvenjVdGP9XL4Ihui5IArtRT1L7Z/rkiW1Sqdz56+9v2ik7mtgRNeyyRUPa6VTcefdbxIi9BRsoHx+yRsfJp55pdjXJU4tfcAb8QvaT1ibKNupF10KlvbAteUVrO7todbx/qPp7Xdp2gCuu273OVENYbhdDC4jJBfe7kWVcWHn/sdN9QEu51NYYtUqLqs0FnRAsAMU/FgyYGr2I9ixne5VpWEsEzEwt9BX8uNmcKSGvL/iVKoc5/d+WQi6ToX2KM1eLoJG8z5hG8odHMAALXr3I5vwXviu+7ZeReJpb/BKiJeX0/Wozv5u3HyeGzSov3UxgsDJj8vqIpi8eSlllb8ObgXa3mNCX4y1GiZ5kL4ve8IN/4DgRQSis2ujmhHKEHeeL32jzopGde0AuYX/GeQtqLuvo9PORQm4q6JZt6cIoMC2yGx3WpkXR6OGybNeX8hxeUooojNzeTZfTDoNRLfitCMtukF4gsvMXTVcemhUkdddPDyBICua6Vdcrw7BIT5J1fSsD66eCDHnM5cQQoveFUNnOeiGVG4338+IX1Dmn05vw1Sa+O5fK2V7V4p3UDIlm0VdgeeaLB4dqBer6mCOXeHuz+uBQfeBcfLK4DWaZfhI+jfQ1ZSJ8Db4h5zu/j5j3NGpwTDAUeSeY/HY8L8vmyQsTqZrH2kS4dSQGzUltko4sKs0zeUTqA3ZjOVkoXrnasKNWxbm5gs1W/10tiX7s+dU0VQwSd1SBDDdhHH89VJFChvw2gj7iYF/fjFnCTdMNi962mdTe+5282iskn/SvDUPXfsqQbg8suRWdBUvK/WvTbI8qcLuPx/gUI/G1bUAPWeQp/4iGIpdikW9hJ5RxVhgUPdBcXz7Yi8mYl8Irc1T2OkJP0nQGI+moI0c9pWBoG3NhIEVVqV8Z8h6/wEM4Vxhq7AxpOm2URCLwVsFA7vBhXegJDXwx8AOAfloXAYRAUnd/94bY0U82hH5EwnQVvFtzlclIWT8YBieFHc+eV9C/a3xnjT4DCEuuHUR/BEjOVvT/2dznReyGOL11jFYIJzOwkRNoiUIF5UbnBe4258Vc93Cl/YWYa3pS1g0OHrobaOWESgvAQmCq3Ug+c6Zk8aNNN6wcsUP+nW8z4OZaYR3mrMcIXsIrKl4NPP+MuyqS5dxRsVJiV5JCkZgjeSxnGNBHSbr2Y8ExR3W1Yl5vAGY3i3O75KyZi5R8VUtNGjG86JlhxebOKsmHcjBRlwxwsY54s4eMLxn6EO4L7QuOsOL/HB9f6CJ3W/LjAX8pSMkOQ7BuKMy1eSRBfQsavOHMz0OAZbeSi1aLSFGo4g08tPL3uIRswxqVPI/p75nE/PGps/ItImobbbQbfdNYmp5CE6NaGbf0bGqzaJmzGZ+Ty9C1i9RTmlTHGtiypNeWbpQg8hQM/gu+xWu8+IzdXyqR+SrIN50xFANydpLWxL5Bp03Nh/NiZtIJ+ULz7q1BisqyQ1v0WG2ZWnn5zrmm7TWlKQVeBDoCjZjI0RATnBazmbQlCJEsiFBSktaPdnNlkFjwjagJdAEiulancMu67iosEnDprowzKoc8FZHARaqXdHgwzaAcVEyRgJD+o5Ncdi5UU+yKjhuhFjGwfAAzzmsGfvnOWGOUGdeF38RTDNpD+GYlyjXaPj3ZKHN9N2OSjDoMKm+DIm0R7ElCtHcPgTQhaXm5YuBwVfgY4J2NxpOy7IWrQzqED3zXMDhlso9OFUdLVLP2yii+tQ4IIxE1ddKFXpUealUbNTjMrm9kjeWRi1TX9E4mCrQfexWa9nqgdI98nHp6+9sx+ZEZDkLzzTFNkhUUOzSWlNGhc02H87G1ZvOE5WN/AwutrCAFmVM+917ex+ilY7sqFTK6Y8psgLiChJiqxh+TRi/XGDVdK1cT0Td5Aggu+wmMJPlIAxCfWhFjxJCgIz/v938tSqygvglsZ6maEeWYyGCg8OwqY459ATfDLA1vaUNg5aH8JUKo4w4+GvuEu9C1lJjrjQaW1g6Fa2U9jdtMTPyxNUxEunpCgk4KdA9teViVBhVJ1kMDP1tOMNDUHvVO6WlUlSw9O7ofsg+H8/wdFGVieKOjHkEpClyjlc8HxUre66KyrX1+NaoS9Eskg94XASK2+fT7wsQbC81P3P/zaSsTMM2dCzvRcRYuYHqakiHnZwnB1x70qEJaazTBGCbEFPJ/KGojkE1DlSI2xBkQF7COEMpBP26WnW/f63impb483ZLZSJGJ10FOQ5hFh7AXrDgrwUiEOiT5OoNyVRvkUKmR1cgOeQyFY9REftEZEYqtPJurl0LzekQV/k71/TUuPW7z7RMTAbZGqB93pru1wl+lqTrQmPeRxRxNcxsJQuXjjlqa8CEcxEWEhPHCWm2dJwlptmcskqythSYCYmx2ouPLXfy2Z2EVU0d6Lb3kMVl5AWce0eE3/ou4Ir7sW8okWwRKqBUbYF/AXrIqJzJT6s/7rBSFJb2o5RyGMxlihNHOc89REmYTdSs19CtC9PlEfXXPJBkLDo+NdaSMZf/v6UG4FqyRyKGtjEHtYA/6Md5r1bHys/+2V7INVo8UBqC6Ebnqpi6xCB9wXq6EGOhadcdOjISVdqzovg93PVPA3C/kjhVNcprG+qfVVMSsp5+F69HPsEABj0wBgZQ0kAFNWwmlB/oR5Kc4Yck2pjKYcPYRJiK9uYI/rsa8WeuUMsY8OowF5VdjuEYJ0peImt1VTBmbs1n3Z0gQfAAzgvA/qAAez+1j57asGP3wSXFokIsT6y6f/YOeaKlgPaFQPDx6Ptf4BVjH0mGfLqJN0UK2Gx3fyfyPlGYZBgUHTSdNUcgUY7Gr/K7NI4Xs5vMmG6cptvsQjWqpSqcHsAKDEAT9rMFlUu/N8xyyMgkHcADe7Pl18uuGi/7sbj1fJU+o+eAibl2YZkylocZCWrBQe/h10Nl2Kv2kYiRSeMK/vbp+hVtNuOy20D0yBR3LyIyIrya0JL2zo9cDhHPDBic92UfoytHDTFFAo7v6LOZgwHzmz8iAJOR9MG/qHjdFWRqqH9zBap7bZKfuRm3KpEOcIoLaS672lVzwEzMC75ABdEjtg2OjUQ4SYpcC+1azsFAeAJMK2Zi1T7iNXB1LgxcyblWAwiNL6gLouXLMxlj+1r2wr+MGDFGvWLwyJ+obzAYRBhO7rqrkt6+dcaHBpwdjvFvAYR6wg4Ciuj5+aGlOmfYCFweMWKRgMBhHrCLMnFLayVGqhQnhvAeHJ0fGS9FGhZEITOBWCfrGr7sDLhdpI8YBHFNd5IMwLyoZ7AgdEeGoGbvbQCvU7bzhHJ8joc7pfD/qcyIFoJ/4ADzmBXuPOXlt580YtHtyK864KXro/ogKyzC3GbNw9uKfCprIStLLKgVoxuTk6f2BQYPOToPpxSy0Z2g2qjA9fgtJZmUSQUDnpR9s1pNwq3To11J4W2gzzEWzVfAccvB0E77GGBAC/QAAAABYtF5UgQd+5iBPX/r7w7aRkWAzqRkHkiDj+cAAAAbEw3Sxc55BkJ/yQCl8XZiiu9tnkC6Qldg21zcIyev7dv8nwoPQMSjYChAA89sfRhAZD1PEJE/mFwQJQhYTCK6aBfHtU04PwEAAA=) ## Analyzing the website[](#analyzing-the-website "Direct link to Analyzing the website") When you visit this URL: `https://ads.tiktok.com/business/creativecenter/inspiration/popular/hashtag/pc/en` You will see a list of hashtags with their live ranking, the number of posts they have, trend chart, creators, and analytics. You can also notice that we can filter the industry, set the time period, and use a check box to filter if the trend is new to the top 100 or not. ![tiktok-trends](/assets/images/tiktok-trends-1b92bf04848ae6c440eb1e9fabb55a41.webp) Our goal here is to extract the top 100 hashtags from the list with the given filters. The two possible approaches are to use [`CheerioCrawler`](https://crawlee.dev/js/docs/guides/cheerio-crawler-guide.md), and the second one will be browser-based scraping. Cheerio gives results faster but does not work with JavaScript-rendered websites. Cheerio is not the best option here as the [Creative Center](https://ads.tiktok.com/business/creativecenter/inspiration/popular/hashtag/pad/en) is a web application, and the data source is API, so we can only get the hashtags initially present in the HTML structure but not each of the 100 as we require. The second approach can be using libraries like Puppeteer, Playwright, etc, to do browser-based scraping and using automation to scrape all of the hashtags, but with previous experiences, it takes a lot of time for such a small task. Now comes the new approach that we developed to make this process a lot better than browser based and very close to CheerioCrawler based crawling. ## JSDOM Approach[](#jsdom-approach "Direct link to JSDOM Approach") note Before diving deep into this approach, I would like to give credit to [Alexey Udovydchenko](https://apify.com/alexey), Web Automation Engineer at Apify, for developing this approach. Kudos to him! In this approach, we are going to make API calls to `https://ads.tiktok.com/creative_radar_api/v1/popular_trend/hashtag/list` to get the required data. Before making calls to this API, we will need few required headers (auth data), so we will first make the call to `https://ads.tiktok.com/business/creativecenter/inspiration/popular/hashtag/pad/en`. We will start this approach by creating a function that will create the URL for the API call for us and, make the call and get the data. ``` export const createStartUrls = (input) => { const { days = '7', country = '', resultsLimit = 100, industry = '', isNewToTop100, } = input; const filterBy = isNewToTop100 ? 'new_on_board' : ''; return [ { url: `https://ads.tiktok.com/creative_radar_api/v1/popular_trend/hashtag/list?page=1&limit=50&period=${days}&country_code=${country}&filter_by=${filterBy}&sort_by=popular&industry_id=${industry}`, headers: { // required headers }, userData: { resultsLimit }, }, ]; }; ``` In the above function, we create the start url for the API call that include various parameters as we talked about earlier. After creating the URL according to the parameters it will call the `creative_radar_api` and fetch all the results. But it won’t work until we get the headers. So, let’s create a function that will first create a session using `sessionPool` and `proxyConfiguration`. ``` export const createSessionFunction = async ( sessionPool, proxyConfiguration, ) => { const proxyUrl = await proxyConfiguration.newUrl(Math.random().toString()); const url = 'https://ads.tiktok.com/business/creativecenter/inspiration/popular/hashtag/pad/en'; // need url with data to generate token const response = await gotScraping({ url, proxyUrl }); const headers = await getApiUrlWithVerificationToken( response.body.toString(), url, ); if (!headers) { throw new Error(`Token generation blocked`); } log.info(`Generated API verification headers`, Object.values(headers)); return new Session({ userData: { headers, }, sessionPool, }); }; ``` In this function, the main goal is to call `https://ads.tiktok.com/business/creativecenter/inspiration/popular/hashtag/pad/en` and get headers in return. To get the headers we are using `getApiUrlWithVerificationToken` function. note Before going ahead, I want to mention that Crawlee natively supports JSDOM using the [JSDOM Crawler](https://crawlee.dev/js/api/jsdom-crawler.md). It gives a framework for the parallel crawling of web pages using plain HTTP requests and jsdom DOM implementation. It uses raw HTTP requests to download web pages, it is very fast and efficient on data bandwidth. Let’s see how we are going to create the `getApiUrlWithVerificationToken` function: ``` const getApiUrlWithVerificationToken = async (body, url) => { log.info(`Getting API session`); const virtualConsole = new VirtualConsole(); const { window } = new JSDOM(body, { url, contentType: 'text/html', runScripts: 'dangerously', resources: 'usable' || new CustomResourceLoader(), // ^ 'usable' faster than custom and works without canvas pretendToBeVisual: false, virtualConsole, }); virtualConsole.on('error', () => { // ignore errors cause by fake XMLHttpRequest }); const apiHeaderKeys = ['anonymous-user-id', 'timestamp', 'user-sign']; const apiValues = {}; let retries = 10; // api calls made outside of fetch, hack below is to get URL without actual call window.XMLHttpRequest.prototype.setRequestHeader = (name, value) => { if (apiHeaderKeys.includes(name)) { apiValues[name] = value; } if (Object.values(apiValues).length === apiHeaderKeys.length) { retries = 0; } }; window.XMLHttpRequest.prototype.open = (method, urlToOpen) => { if ( ['static', 'scontent'].find((x) => urlToOpen.startsWith(`https://${x}`), ) ) log.debug('urlToOpen', urlToOpen); }; do { await sleep(4000); retries--; } while (retries > 0); await window.close(); return apiValues; }; ``` In this function, we are creating a virtual console that uses `CustomResourceLoader` to run the background process and replace the browser with JSDOM. For this particular example, we need three mandatory headers to make the API call, and those are `anonymous-user-id,` `timestamp,` and `user-sign.` Using `XMLHttpRequest.prototype.setRequestHeader`, we are checking if the mentioned headers are in the response or not, if yeas, we take the value of those headers, and repeat the retries until we get all the headers. Then, the most important part is that we use `XMLHttpRequest.prototype.open` to extract the auth data and make calls without actually using browsers or exposing the bot activity. At the end of `createSessionFunction`, it returns a session with the required headers. Now coming to our main code, we will use CheerioCrawler and will use `prenavigationHooks` to inject the headers that we got from the earlier function into the `requestHandler`. ``` const crawler = new CheerioCrawler({ sessionPoolOptions: { maxPoolSize: 1, createSessionFunction: async (sessionPool) => createSessionFunction(sessionPool, proxyConfiguration), }, preNavigationHooks: [ (crawlingContext) => { const { request, session } = crawlingContext; request.headers = { ...request.headers, ...session.userData?.headers, }; }, ], proxyConfiguration, }); ``` Finally in the request handler we make the call using the headers and make sure how many calls are needed to fetch all the data handling pagination. ``` async requestHandler(context) { const { log, request, json } = context; const { userData } = request; const { itemsCounter = 0, resultsLimit = 0 } = userData; if (!json.data) { throw new Error('BLOCKED'); } const { data } = json; const items = data.list; const counter = itemsCounter + items.length; const dataItems = items.slice( 0, resultsLimit && counter > resultsLimit ? resultsLimit - itemsCounter : undefined, ); await context.pushData(dataItems); const { pagination: { page, total }, } = data; log.info( `Scraped ${dataItems.length} results out of ${total} from search page ${page}`, ); const isResultsLimitNotReached = counter < Math.min(total, resultsLimit); if (isResultsLimitNotReached && data.pagination.has_more) { const nextUrl = new URL(request.url); nextUrl.searchParams.set('page', page + 1); await crawler.addRequests([ { url: nextUrl.toString(), headers: request.headers, userData: { ...request.userData, itemsCounter: itemsCounter + dataItems.length, }, }, ]); } } ``` One important thing to note here is that we are making this code in a way that we can make any numbers of API calls. In this particular example we just made one request and a single session, but you can make more if you need. When the first API call will be completed, it will create the second API call. Again, you can make more calls if needed, but we stopped at two. To make things more clear, here is how code flow looks: ![code flow](/assets/images/code-flow-9b59d77892326bdf8ae27f1e99489c9e.webp) ## Conclusion[](#conclusion "Direct link to Conclusion") This approach helps us to get a third way to extract the authentication data without actually using a browser and pass the data to CheerioCrawler. This significantly improves the performance and reduces the RAM requirement by 50%, and while browser-based scraping performance is ten times slower than pure Cheerio, JSDOM does it just 3-4 times slower, which makes it 2-3 times faster than browser-based scraping. The project's codebase is already [uploaded here](https://github.com/souravjain540/tiktok-trends). The code is written as an Apify Actor; you can find more about it [here](https://docs.apify.com/academy/getting-started/creating-actors), but you can also run it without using Apify SDK. If you have any doubts or questions about this approach, reach out to us on our [Discord server](https://apify.com/discord). --- # How to scrape YouTube using Python \[2025 guide] July 14, 2025 · 23 min read [![Max](https://avatars.githubusercontent.com/u/34358312?v=4)](https://github.com/Mantisus) [Max](https://github.com/Mantisus) Community Member of Crawlee and web scraping expert In this guide, we'll explore how to efficiently collect data from YouTube using [Crawlee for Python](https://github.com/apify/crawlee-python). The scraper will extract video metadata, video statistics, and transcripts - giving you structured YouTube data perfect for content analysis, ML training, or trend monitoring. note One of our community members wrote this guide as a contribution to the Crawlee Blog. If you'd like to contribute articles like these, please reach out to us on Apify’s [Discord channel](https://apify.com/discord). ![How to scrape YouTube using Python](/assets/images/youtube_banner-fb73d10d52bbf13a89f3c0d66d2eff5b.webp) Key steps we'll cover: 1. [Project setup](https://www.crawlee.dev/blog/scrape-youtube-python#1-project-setup) 2. [Analyzing YouTube and determining a scraping strategy](https://www.crawlee.dev/blog/scrape-youtube-python#2-analyzing-youtube-and-determining-a-scraping-strategy) 3. [Configuring YouTube](https://www.crawlee.dev/blog/scrape-youtube-python#3-configuring-crawlee) 4. [Extracting YouTube data](https://www.crawlee.dev/blog/scrape-youtube-python#4-extracting-youtube-data) 5. [Enhancing the scraper capabilities](https://www.crawlee.dev/blog/scrape-youtube-python#5-enhancing-the-scraper-capabilities) 6. [Creating a YouTube Actor on the Apify platform](https://www.crawlee.dev/blog/scrape-youtube-python#6-creating-a-youtube-actor-on-the-apify-platform) 7. [Deploying to Apify](https://www.crawlee.dev/blog/scrape-youtube-python#7-deploying-to-apify) ## What you’ll need to get started[](#what-youll-need-to-get-started "Direct link to What you’ll need to get started") * Python 3.10 or higher * Familiarity with web scraping concepts * Crawlee for Python `v0.6.0` or higher * [uv](https://docs.astral.sh/uv/) `v0.7` or higher ## 1. Project setup[](#1-project-setup "Direct link to 1. Project setup") note Before starting the project, I'd like to ask you to star Crawlee for Python on [GitHub](https://github.com/apify/crawlee-python/). This will help us spread the word to fellow scraper developers. In this project, we'll use uv for package management and a specific Python version will be installed through uv. If you don't have uv installed yet, just follow the [guide](https://docs.astral.sh/uv/getting-started/installation/) or use this command: ``` curl -LsSf https://astral.sh/uv/install.sh | sh ``` To create the project, run: ``` uvx crawlee['cli'] create youtube-crawlee ``` In the `cli` menu that opens, select: 1. `Playwright` 2. `Httpx` 3. `uv` 4. Leave the default value - `https://crawlee.dev` 5. `y` Or, just run the command: ``` uvx crawlee['cli'] create youtube-crawlee --crawler-type playwright --http-client httpx --package-manager uv --apify --start-url 'https://crawlee.dev' ``` Or, if you prefer to use `pipx`. ``` pipx run crawlee['cli'] create youtube-crawlee --crawler-type playwright --http-client httpx --package-manager uv --apify --start-url 'https://crawlee.dev' ``` Creating the project may take a few minutes. After installation is complete, navigate to the project folder: ``` cd youtube-crawlee ``` ## 2. Analyzing YouTube and determining a scraping strategy[](#2-analyzing-youtube-and-determining-a-scraping-strategy "Direct link to 2. Analyzing YouTube and determining a scraping strategy") If you're working on a small project to extract data from YouTube, you should use the [YouTube API](https://developers.google.com/youtube/v3/docs/search/list) to get your data. However, the API has very strict quotas, with no more than [10,000 units per day](https://developers.google.com/youtube/v3/determine_quota_cost). This allows you to get just 100 search pages, and you can't increase this limit. If your project requires more data than the API allows, you'll need to use crawling. Let's examine the site to develop an optimal crawling strategy. Let's study YouTube navigation using [Apify's YouTube channel](https://www.youtube.com/@Apify) as an example to better understand the features and data extraction points. YouTube uses infinite scrolling to load new elements on the page, similar to what we discussed in the corresponding [article](https://www.crawlee.dev/blog/infinite-scroll-using-python) from the [Apify](https://apify.com/) team. Let's look at how this works using [DevTools](https://developer.chrome.com/docs/devtools) and the [Network](https://developer.chrome.com/docs/devtools/network/) tab. ![Load Request](/assets/images/load_request-c583830dda107ae55fb6426d7b96e569.webp) If we look at the response structure, we can see that YouTube uses [JSON](https://www.json.org) to transmit data, but its structure is quite complex to navigate. ![Load Response](/assets/images/load_response-7061bb91cadc904d54073c033f3f0a20.webp) Therefore, we'll use [Playwright](https://playwright.dev/python/docs/intro) for crawling, which will help us avoid parsing complex JSON responses. But if you want to practice crawling complex websites, try implementing a crawler based on an HTTP client, like in this [article](https://www.crawlee.dev/blog/scraping-dynamic-websites-using-python). Let's analyze the selectors for getting video links using the [Elements](https://developer.chrome.com/docs/devtools/elements/) tab: ![Selectors](/assets/images/selectors-745f5daab12810cc998990e4c066afdf.webp) It looks like we're interested in `a` tags with the attribute `id="video-title-link"`! Let's look at the video page to understand better how YouTube transmits data. As expected, we see data in JSON format. ![Video Response](/assets/images/video_json-44affd2ba348740caa8d1bc79ba9a8a9.webp) Now let's get the transcript link. Click on the subtitles button in the player to trigger the transcript request. ![Transcript Request](/assets/images/transcript_request-77c78163912afe398161b431c20cb733.webp) Let's verify that we can access the transcript via this link. Remove the `fmt=json3` parameter from the URL and open it in your browser. Removing the `fmt` parameter is necessary to get the data in a convenient XML format instead of the complex JSON3 format. ![Transcript Response](/assets/images/transcript_response-06133506fa3559a10cfc43912d1af67c.webp) If you live in a country where [GDPR](https://gdpr-info.eu/) applies, you'll need to handle the following pop-up before you can access the data: ![GDPR](/assets/images/GDPR-103ec4d5f927916f704ec1d4d597bd82.webp) After our analysis, we now understand: * **Navigation strategy**: How to navigate the channel page to retrieve all videos using infinite scroll. * **Video metadata extraction**: How to extract video statistics, title, description, publish date, and other metadata from video pages. * **Transcript access**: How to obtain the correct transcript link. * **Data formats**: Transcript data is available in XML format, which is easier to parse than JSON3 * **Regional considerations**: Special handling required for GDPR consent in European countries With this knowledge, we're ready to implement the YouTube scraper using Crawlee for Python. ## 3. Configuring Crawlee[](#3-configuring-crawlee "Direct link to 3. Configuring Crawlee") Configuring Crawlee for YouTube is very similar to configuring it for [TikTok](https://www.crawlee.dev/blog/scrape-tiktok-python), but with some key differences. Since pages have infinite scrolling, we need to limit the number of elements we want to get. For this, we'll add a `max_items` parameter that will limit the maximum number of elements for each search, and pass it in `user_data` when forming a [Request](https://www.crawlee.dev/python/api/class/Request). We'll limit the intensity of scraping by setting `max_tasks_per_minute` in [`ConcurrencySettings`](https://www.crawlee.dev/python/api/class/ConcurrencySettings). This will help us reduce the likelihood of being blocked by YouTube. Scrolling pages can take a long time, so we’ll increase the time limit for processing a single request using `request_handler_timeout`. Since we won't be saving images, videos, and similar media content during crawling, we can block requests to them using [`block_requests`](https://www.crawlee.dev/python/api/class/BlockRequestsFunction) and [`pre_navigation_hook`](https://www.crawlee.dev/python/api/class/PlaywrightCrawler#pre_navigation_hook). Also, to handle the `GDPR` page only once, we'll use [`use_state`](https://www.crawlee.dev/python/api/class/UseStateFunction) to pass the appropriate cookies between sessions, ensuring all requests have the necessary cookies. ``` # main.py from datetime import timedelta from apify import Actor from crawlee import ConcurrencySettings, Request from crawlee.crawlers import PlaywrightCrawler from .hooks import pre_hook from .routes import router async def main() -> None: """The crawler entry point.""" async with Actor: # Create a crawler instance with the router crawler = PlaywrightCrawler( # Limit scraping intensity by setting a limit on requests per minute concurrency_settings=ConcurrencySettings(max_tasks_per_minute=50), # We'll configure the `router` in the next step request_handler=router, # Increase the timeout for the request handling pipeline request_handler_timeout=timedelta(seconds=120), # Runs browser without visual interface headless=True, # Limit requests per crawl for testing purposes max_requests_per_crawl=100, ) # Set the maximum number of items to scrape per youtube channel max_items = 1 # Set the list of channels to scrape channels = ['Apify'] # Set hook for prepare context before navigation on each request crawler.pre_navigation_hook(pre_hook) await crawler.run( [ Request.from_url(f'https://www.youtube.com/@{channel}/videos', user_data={'limit': max_items}) for channel in channels ] ) ``` Let's prepare the `pre_hook` function to block requests and set cookies (the cookie collection process will be explained in the extraction section): ``` # hooks.py from crawlee.crawlers import PlaywrightPreNavCrawlingContext async def pre_hook(context: PlaywrightPreNavCrawlingContext) -> None: """Prepare context before navigation.""" crawler_state = await context.use_state() # Check if there are previously collected cookies in the crawler state and set them for the session if 'cookies' in crawler_state and context.session: cookies = crawler_state['cookies'] # Set cookies for the session context.session.cookies.set_cookies_from_playwright_format(cookies) # Block requests to resources that aren't needed for parsing # This is similar to the default value, but we don't block `css` as it is needed for Player loading await context.block_requests( url_patterns=['.webp', '.jpg', '.jpeg', '.png', '.svg', '.gif', '.woff', '.pdf', '.zip'] ) ``` ## 4. Extracting YouTube data[](#4-extracting-youtube-data "Direct link to 4. Extracting YouTube data") After configuration, let's move on to navigation and data extraction. For infinite scrolling, we'll use the built-in helper function ['infinite\_scroll'](https://www.crawlee.dev/python/api/class/PlaywrightCrawlingContext#infinite_scroll). But instead of waiting for scrolling to complete, which in some cases can take a really long time, we'll use Python's `asyncio` capabilities to make it a background task. The `GDPR` page requiring consent for cookie usage is on the domain `consent.youtube.com`, which might cause an error when forming a [Request](https://www.crawlee.dev/python/api/class/Request) for a video page. Therefore, we need to use a helper function for the `transform_request_function` parameter in [`extract_links`](https://www.crawlee.dev/python/api/class/ExtractLinksFunction). This function will check each extracted URL. If it contains 'consent.youtube', we'll replace it with '[www.youtube](http://www.youtube)'. This will allow us to get the correct URL for the video page. ``` # routes.py from __future__ import annotations import asyncio import xml.etree.ElementTree as ET from typing import TYPE_CHECKING from yarl import URL from crawlee import Request, RequestOptions, RequestTransformAction from crawlee.crawlers import PlaywrightCrawlingContext from crawlee.router import Router if TYPE_CHECKING: from playwright.async_api import Request as PlaywrightRequest from playwright.async_api import Route as PlaywrightRoute router = Router[PlaywrightCrawlingContext]() def request_domain_transform(request_param: RequestOptions) -> RequestOptions | RequestTransformAction: """Transform request before adding it to the queue.""" if 'consent.youtube' in request_param['url']: request_param['url'] = request_param['url'].replace('consent.youtube', 'www.youtube') return request_param return 'unchanged' ``` Let's implement a function that will intercept transcript requests for later modification and processing in the crawler: ``` # routes.py async def extract_transcript_url(context: PlaywrightCrawlingContext) -> str | None: """Extract the transcript URL from request intercepted by Playwright.""" # Create a Future to store the transcript URL transcript_future: asyncio.Future[str] = asyncio.Future() # Define a handler for the transcript request # This will be called when the page requests the transcript async def handle_transcript_request(route: PlaywrightRoute, request: PlaywrightRequest) -> None: # Set the result of the future with the transcript URL if not transcript_future.done(): transcript_future.set_result(request.url) await route.fulfill(status=200) # Set up a route to intercept requests to the transcript API await context.page.route('**/api/timedtext**', handle_transcript_request) # Click the subtitles button to trigger the transcript request await context.page.click('.ytp-subtitles-button') # Wait for the transcript URL to be captured # The future will resolve when handle_transcript_request is called return await transcript_future ``` Now, let's create the main handler that will navigate to the channel page, perform infinite scrolling, and extract links to videos. ``` # routes.py @router.default_handler async def default_handler(context: PlaywrightCrawlingContext) -> None: """Handle requests that do not match any specific handler.""" context.log.info(f'Processing {context.request.url} ...') # Get the limit from user_data, default to 10 if not set limit = context.request.user_data.get('limit', 10) if not isinstance(limit, int): raise TypeError('Limit must be an integer') # Wait for the page to load await context.page.locator('h1').first.wait_for(state='attached') # Check if there's a GDPR popup on the page requiring consent for cookie usage cookies_button = context.page.locator('button[aria-label*="Accept"]').first if await cookies_button.is_visible(): await cookies_button.click() # Save cookies for later use with other sessions # You can learn more about `SOCS` cookies from - https://policies.google.com/technologies/cookies?hl=en-US cookies_state = [cookie for cookie in await context.page.context.cookies() if cookie['name'] == 'SOCS'] crawler_state = await context.use_state() crawler_state['cookies'] = cookies_state # Wait until at least one video loads await context.page.locator('a[href*="watch"]').first.wait_for() # Create a background task for infinite scrolling scroll_task: asyncio.Task[None] = asyncio.create_task(context.infinite_scroll()) # Scroll the page to the end until we reach the limit or finish scrolling while not scroll_task.done(): # Extract links to videos requests = await context.extract_links( selector='a[href*="watch"]', label='video', transform_request_function=request_domain_transform, strategy='same-domain', ) # Create a dictionary to avoid duplicates requests_map = {request.id: request for request in requests} # If the limit is reached, cancel the scrolling task if len(requests_map) >= limit: scroll_task.cancel() break # Switch the asynchronous context to allow other tasks to execute await asyncio.sleep(0.2) else: # If the scroll task is done, we can safely assume that we have reached the end of the page requests = await context.extract_links( selector='a[href*="watch"]', label='video', transform_request_function=request_domain_transform, strategy='same-domain', ) requests_map = {request.id: request for request in requests} requests = list(requests_map.values()) requests = requests[:limit] # Add the requests to the queue await context.enqueue_links(requests=requests) ``` Let's take a closer look at the parameters used in [`extract_links`](https://www.crawlee.dev/python/api/class/ExtractLinksFunction#Methods): * `selector` - selector for extracting links to videos. We expected that we could use `id="video-title-link"`, but YouTube uses different page formats with different selectors, so the selector `a[href*="watch"]` will be more universal. * `label` - pointer for the router that will be used to handle the video page. * `transform_request_function` - function to transform the request before adding it to the queue. We use it to replace the domain `consent.youtube` with `www.youtube`, which helps avoid errors when processing the video page. * `strategy` - strategy for extracting links. We use `same-domain` to extract links to any subdomain of `youtube.com`. Let's move on to the handler for video pages. In it, we'll extract video data and also look at how to get and process the video transcript link. ``` # routes.py @router.handler('video') async def video_handler(context: PlaywrightCrawlingContext) -> None: """Handle video requests.""" context.log.info(f'Processing video {context.request.url} ...') # extract video data from the page video_data = await context.page.evaluate('window.ytInitialPlayerResponse') main_data = { 'url': context.request.url, 'title': video_data['videoDetails']['title'], 'description': video_data['videoDetails']['shortDescription'], 'channel': video_data['videoDetails']['author'], 'channel_id': video_data['videoDetails']['channelId'], 'video_id': video_data['videoDetails']['videoId'], 'duration': video_data['videoDetails']['lengthSeconds'], 'keywords': video_data['videoDetails']['keywords'], 'view_count': video_data['videoDetails']['viewCount'], 'like_count': video_data['microformat']['playerMicroformatRenderer']['likeCount'], 'is_shorts': video_data['microformat']['playerMicroformatRenderer']['isShortsEligible'], 'publish_date': video_data['microformat']['playerMicroformatRenderer']['publishDate'], } # Try to extract the transcript URL try: transcript_url = await asyncio.wait_for(extract_transcript_url(context), timeout=20) except asyncio.TimeoutError: transcript_url = None if transcript_url: transcript_url = str(URL(transcript_url).without_query_params('fmt')) context.log.info(f'Found transcript URL: {transcript_url}') await context.add_requests( [Request.from_url(transcript_url, label='transcript', user_data={'video_data': main_data})] ) else: await context.push_data(main_data) ``` Note that if we want to extract the video transcript, we need to get the link to the transcript file and pass the video data to the next handler before it's saved to the [`Dataset`](https://www.crawlee.dev/python/api/class/Dataset). The final stage is processing the transcript. YouTube uses [XML](https://www.w3schools.com/xml/) to transmit transcript data, so we need to use a library to parse XML, such as [`xml.etree.ElementTree`](https://docs.python.org/3/library/xml.etree.elementtree.html). ``` # routes.py @router.handler('transcript') async def transcript_handler(context: PlaywrightCrawlingContext) -> None: """Handle transcript requests.""" context.log.info(f'Processing transcript {context.request.url} ...') # Get the main video data extracted in `video_handler` video_data = context.request.user_data.get('video_data', {}) try: # Get XML data from the response root = ET.fromstring(await context.response.text()) # Extract text elements from XML transcript_data = [text_element.text.strip() for text_element in root.findall('.//text') if text_element.text] # Enrich video data by adding the transcript video_data['transcript'] = '\n'.join(transcript_data) # Save the data to Dataset await context.push_data(video_data) except ET.ParseError: context.log.warning('Incorect XML Response') # Save the video data without the transcript await context.push_data(video_data) ``` After collecting the data, we need to save the results to a file. Just add the following code to the end of the `main` function in `main.py`: ``` # main.py # Export the data from Dataset to JSON format await crawler.export_data_json('youtube.json') ``` To run the crawler, use the command: ``` uv run python -m youtube_crawlee ``` Example result record: ``` { "url": "https://www.youtube.com/watch?v=r-1J94tk5Fo", "title": "Facebook Marketplace API - Scrape Data Based on LOCATION, CATEGORY and SEARCH", "description": "See how you can export Facebook Marketplace listings to Excel, CSV or JSON with the Facebook Marketplace API 🛍️ Input one or more URLs to scrape price, description, images, delivery info, seller data, location, listing status, and much more 📊\n\nWith the Facebook Marketplace Downloader, you can:\n🛒 **Extract listings and seller details** from any public Marketplace category or search query.\n📷 **Scrape product details**, including images, prices, descriptions, locations, and timestamps.\n💰 **Get thousands of marketplace listings** quickly and efficiently.\n📦 **Export results** via API or in JSON, CSV, or Excel with all listing details.\n\n🛍️ Facebook Marketplace Search API 👉 https://apify.it/3E5NLz4\n📱 Explore other Facebook Scrapers 👉 https://apify.it/43Bae1f\n\n*Why scrape Facebook Marketplace data?* 🤔\n💰 Price & Demand Analysis – Track product pricing trends and demand fluctuations.\n📊 Competitor Insights – Monitor listings from competitors to adjust pricing and strategy.\n📍 Location-Based Market Trends – Identify popular products in specific regions.\n🔎 Product Availability Monitoring – Detect shortages or oversupply in certain categories.\n📈 Reselling Opportunities – Find underpriced items for profitable flips.\n🛍 Consumer Behavior Insights – Understand what products and features attract buyers.\n💡 Trend Spotting – Discover emerging products before they go mainstream.\n📝 Market Research – Gather data for academic, business, or personal research.\n\n*How to* scrape *facebook marketplace? 🧑‍🏫* \nStep 1. Find the Facebook Marketplace dataset tool on Apify Store\nStep 2: Click ‘Try for free’\nStep 3: Input a URL\nStep 4: Fine tune the input\nStep 5: Start the Actor and get your data!\n\n*Useful links 🧑‍💻*\n📚 Read more about Scraping Facebook data: https://apify.it/43wyth9\n🧑‍💻 Sign up for Apify: https://apify.it/42e8nNu\n🧩 Integrate the Actor with other tools: https://apify.it/43Ustiz\n📱 Browse other Social Media Scrapers on Apify Store: https://apify.it/4jhq7i8\n\n*Follow us 🤳*\nhttps://www.linkedin.com/company/apifytech\nhttps://twitter.com/apify\nhttps://www.tiktok.com/@apifytech\nhttps://discord.com/invite/jyEM2PRvMU\n\n*Timestamps ⌛️*\n00:00 Introduction\n01:27 Input\n02:17 Run\n02:26 Export\n02:41 Scheduling\n02:54 Integrations\n03:00 API\n03:13 Other Meta Scrapers\n03:26 Like and subscribe!\n\n#webscraping #instagram", "channel": "Apify", "channel_id": "UCTgwcoeGGKmZ3zzCXN2qo_A", "video_id": "r-1J94tk5Fo", "duration": "226", "keywords": [ "web scraping platform", "web automation", "scrapers", "Apify", "web crawling", "web scraping", "data extraction", "best web scraping tool", "API", "how to extract data from any website", "web scraping tutorial", "web scrape", "data collection tool", "RPA", "web integration", "how to turn website into API", "JSON", "python web scraping", "web scraping python", "web api integration", "how to turn website into api", "scraping", "apify", "data extraction tools", "how to web scrape", "web scraping javascript", "web scraping tool" ], "view_count": "765", "like_count": "8", "is_shorts": false, "publish_date": "2025-04-03T05:33:18-07:00", "transcript": "Hi, Theo here. In this video, I’ll \nshow you how to scrape structured\ndata from Facebook Marketplace by location, \ncategory, or specific search query. You’ll\nbe able to extract listing details like price, \ndescription, images, delivery info, seller data,\nlocation, and listing status — using a \ntool called Facebook Marketplace Scraper.\nHere’s what you can do with it. \nIf you're reselling, flipping,\nor deal hunting, scraping helps you track \nprices, spot trends, and catch underpriced\nor free items early. Looking for a rental \nor house? Compare listings across cities,\ncheck historical prices, and avoid wasting \ntime on overpriced options. Selling on\nMarketplace? Analyze top-performing listings, \noptimize keywords, and price competitively.\nFor businesses, scraping \nenables competitor tracking,\ndynamic pricing, real estate \nresearch, fraud detection,\nand brand protection — like spotting counterfeit \nor unauthorized listings before they do damage.\nThe best part is you don’t need to \njump through hoops to get this data:\nFacebook Marketplace Scraper makes things simple: \nno login, no cookies, no browser extension.\nIt runs in the cloud, and you can export \nresults in JSON, CSV, Excel — or use the API.\nLet’s see how it works.\nFirst, head to the link in the description, \nwhich’ll take you to Facebook Marketplace\nScraper’s README. Click on `try \nfor free`, which will send you to\nthe `Login page` and you can get started \nwith a free Apify account - don’t worry,\nthere’s no limit on the free plan and \nno credit card will ever be required.\nAfter logging in, you’ll land on the Actor’s \ninput page. While you can configure this through\neither the intuitive UI or JSON, we’ll \nstick with the UI option to keep it easy.\nFor scraping Facebook Marketplace, you’re gonna \nneed the URL from Facebook. You can use a URL of\na search term, location or an item category. For \nthis tutorial, we’re gonna go with an iPhone. So\nlet’s open up Facebook Marketplace, input a search \nterm and then copy the URL from the toolbar and\npaste it in the input. You can add more via the \nadd button, edit them in bulk or import the URLs\nas a text file. Next, you can limit how many \nposts you want to scrape. And that’s it.\nBefore running your Actor, it’s a great idea \nto save your configuration and create a task.\nThis will come in handy for scheduling or \nintegrating your Actor with other tools,\nor if you plan to work with \nmultiple configurations.\nNow that we have the `input`, let’s run \nthe Actor by hitting START. You can watch\nyour results appear in Overview or switch to \nthe Log tab to see more details about run.\nNow that your run is finished, we can get the \ndata via the Export button. You can choose your\npreffered format, and select which fields you want \nto include or exclude in your dataset. Then just\nhit Download and you have your dataset file. Let \nme show you what this looks like in JSON format.\nIf you want to automate your workflow \neven more, you can schedule your Facebook\nMarketplace Scraper to run at regular intervals. \nChoose your task and hit schedule. You can set\nthe frequency of how often you want to run \nthe Actor. You can even connect your Actor\nto other cloud services, such as Google \nDrive, Make, or any other Apify Actor.\nYou can also run this scraper locally via \nAPI. You can find the code in Node.js,\nPython, or curl in the API \ndrop down menu in the top-right\ncorner. To learn more about retrieving data \nprogramatically, check out our video on it.\nNeed more Facebook or Instagram data? \nCheck out our other scrapers in Apify\nStore. We have got dozens of meta \nscrapers, links are in the description.\nIf you prefer video tutorials, we have a playlist \ncovering different Instagram scraping use cases.\nAnd that’s all for today! Let us know what you \nthink about the Facebok Marketplace Scraper.\nRemember, if you come across any issues, make \nsure to report them to our team in Apify Console.\nIf you found this helpful, give us a thumbs \nup and subscribe. Don't forget to hit the\nbell to stay updated on new tutorials. Thanks for \nwatching! So long, and thanks for all the likes" } ``` ## 5. Enhancing the scraper capabilities[](#5-enhancing-the-scraper-capabilities "Direct link to 5. Enhancing the scraper capabilities") As with any project working with a large site like YouTube, you may encounter various issues that need to be resolved. Currently, the Crawlee for Python documentation contains many guides and examples to help you with this. * Use [`Camoufox`](https://camoufox.com/), a project compatible with Playwright, which allows you to get a browser configuration that's more resistant to blocking, and you can easily [integrate it with Crawlee for Python](https://www.crawlee.dev/python/docs/examples/playwright-crawler-with-camoufox). * Improve error handling and logging for unusual cases so you can easily debug and maintain the project; the guide on [error handling](https://www.crawlee.dev/python/docs/guides/error-handling) is a good place to start. * Add proxy support to avoid blocks from YouTube. You can use [Apify Proxy](https://apify.com/proxy) and [`ProxyConfiguration`](https://www.crawlee.dev/python/api/class/ProxyConfiguration); you can learn more in this guide in the [documentation](https://www.crawlee.dev/python/docs/guides/proxy-management#proxy-configuration). * Make your crawler a web service that crawls pages by user request, using [FastAPI](https://fastapi.tiangolo.com/) and following this [guide](https://www.crawlee.dev/python/docs/guides/running-in-web-server). ## 6. Creating YouTube Actor on the Apify platform[](#6-creating-youtube-actor-on-the-apify-platform "Direct link to 6. Creating YouTube Actor on the Apify platform") For deployment, we'll use the [Apify platform](https://apify.com/). It's a simple and effective environment for cloud deployment, allowing efficient interaction with your crawler. Call it via [API](https://docs.apify.com/api/v2/), [schedule tasks](https://docs.apify.com/platform/schedules), [integrate](https://docs.apify.com/platform/integrations) with various services, and much more. To deploy to the Apify platform, we need to adapt our project for the [Apify Actor](https://apify.com/actors) structure. Create an `.actor` folder with the necessary files. ``` mkdir .actor && touch .actor/{actor.json,input_schema.json} ``` Move the `Dockerfile` from the root folder to `.actor`. ``` mv Dockerfile .actor ``` Let's fill in the empty files: The `actor.json` file contains project metadata for the Apify platform. Follow the [documentation for proper configuration](https://docs.apify.com/platform/actors/development/actor-definition/actor-json): ``` { "actorSpecification": 1, "name": "YouTube-Crawlee", "title": "YouTube - Crawlee", "minMemoryMbytes": 2048, "description": "Scrape video stats, metadata and transcripts from videos in YouTube channels", "version": "0.1", "meta": { "templateId": "youtube-crawlee" }, "input": "./input_schema.json", "dockerfile": "./Dockerfile" } ``` Actor input parameters are defined using `input_schema.json`, which is specified [here](https://docs.apify.com/platform/actors/development/actor-definition/input-schema/specification/v1). Let's define input parameters for our crawler: * `maxItems` - maximum number of videos per channel for scraping. * `channelNames` - these are the YouTube channel names to scrape. * `proxySettings` - proxy settings, since without a proxy, you'll be using the datacenter IP that Apify uses. ``` { "title": "YouTube Crawlee", "type": "object", "schemaVersion": 1, "properties": { "channelNames": { "title": "List Channel Names", "type": "array", "description": "Channel names for extraction video stats, metadata and transcripts.", "editor": "stringList", "prefill": ["Apify"] }, "maxItems": { "type": "integer", "editor": "number", "title": "Limit search results", "description": "Limits the maximum number of results, applies to each search separately.", "default": 10 }, "proxySettings": { "title": "Proxy configuration", "type": "object", "description": "Select proxies to be used by your scraper.", "prefill": { "useApifyProxy": true }, "editor": "proxy" } }, "required": ["channelNames"] } ``` Let's update the code to accept input parameters. ``` # main.py async def main() -> None: """The crawler entry point.""" async with Actor: # Get the input parameters from the Actor actor_input = await Actor.get_input() max_items = actor_input.get('maxItems', 0) channels = actor_input.get('channelNames', []) proxy = await Actor.create_proxy_configuration(actor_proxy_input=actor_input.get('proxySettings')) crawler = PlaywrightCrawler( concurrency_settings=ConcurrencySettings(max_tasks_per_minute=50), request_handler=router, request_handler_timeout=timedelta(seconds=120), headless=True, max_requests_per_crawl=100, proxy_configuration=proxy ) ``` And delete export to JSON from the `main` function, as the Apify platform will handle data storage in the [Dataset](https://docs.apify.com/platform/storage/dataset). That's it, the project is ready for deployment. ## 7. Deploying to Apify[](#7-deploying-to-apify "Direct link to 7. Deploying to Apify") Use the official [Apify CLI](https://docs.apify.com/cli/) to upload your code: Authenticate using your API token from [Apify Console](https://console.apify.com/settings/integrations): ``` apify login ``` Choose "Enter API token manually" and paste your token. Push the project to the platform: ``` apify push ``` Now you can configure runs on the Apify platform. Let's perform a test run: Fill in the input parameters: ![Actor Input](/assets/images/input_actor-6bdab40eb022bcb34ad63da770f4dcea.webp) View results in the dataset: ![Dataset Results](/assets/images/actor_results-36a0c08c154c59a9fb3887222c5926f2.webp) If you want to make your Actor public and provide access to other users, potentially to earn income from it, follow this [publishing guide](https://docs.apify.com/platform/actors/publishing) for [Apify Store](https://apify.com/store). ## Conclusion[](#conclusion "Direct link to Conclusion") We've created a good foundation for crawling YouTube using Crawlee for Python and Playwright. If you're just starting your journey in crawling, this will be an excellent project for learning and practice. You can use it as a basis for creating more complex crawlers that will collect data from YouTube. If this is your first project using Crawlee for Python, check out all the documentation links provided in this article; it will help you better understand how Crawlee for Python works and how you can use it for your projects. You can find the complete code in the [repository](https://github.com/Mantisus/youtube-crawlee) If you enjoyed this blog, feel free to support Crawlee for Python by starring the [repository](https://github.com/apify/crawlee-python) or joining the maintainer team. Do you have questions or want to discuss the details of the implementation? Join our [Discord](https://discord.com/invite/jyEM2PRvMU)—our community of 11,000+ developers is there to help. **Tags:** * [community](https://crawlee.dev/blog/tags/community.md) --- # Web scraping of a dynamic website using Python with HTTP Client September 12, 2024 · 15 min read [![Max](https://avatars.githubusercontent.com/u/34358312?v=4)](https://github.com/Mantisus) [Max](https://github.com/Mantisus) Community Member of Crawlee and web scraping expert Dynamic websites that use JavaScript for content rendering and backend interaction often create challenges for web scraping. The traditional approach to solving this problem is browser emulation, but it's not very efficient in terms of resource consumption. note One of our community members wrote this blog as a contribution to Crawlee Blog. If you want to contribute blogs like these to Crawlee Blog, please reach out to us on our [discord channel](https://apify.com/discord). In this article, we'll explore an alternative method based on in-depth site analysis and the use of an HTTP client. We'll go through the entire process from analyzing a dynamic website to implementing an efficient web crawler using the [`Crawlee for Python`](https://www.crawlee.dev/python/) framework. ![How to scrape dynamic websites in Python](/assets/images/dynamic-websites-d9a83deff0729330b2d3de2d1481cd6a.webp) ## What you'll learn in this tutorial[](#what-youll-learn-in-this-tutorial "Direct link to What you'll learn in this tutorial") Our subject of study is the [Accommodation for Students](https://www.accommodationforstudents.com) website. Using this example, we'll examine the specifics of analyzing sites built with the Next.js framework and implement a crawler capable of efficiently extracting data without using browser emulation. By the end of this article, you will have: * A clear understanding of how to analyze sites with dynamic content rendered using JavaScript. * How to implement a crawler based on Crawlee for Python. * Insight into some of the details of working with sites that use [`Next.js`](https://nextjs.org/). * A link to a GitHub repository with the full crawler implementation code. ## Website analysis[](#website-analysis "Direct link to Website analysis") To track all requests, open your Dev Tools and the `network` tab before entering the site. Some data may be transmitted only once the site is first opened. As the site is intended for students in the UK, let's go to London. We'll start the analysis from the [search page](https://www.accommodationforstudents.com/search-results?location=London\&beds=0\&occupancy=min\&minPrice=0\&maxPrice=500\&latitude=51.509865\&longitude=-0.118092\&geo=false\&page=1) Interacting with elements on the site page, you'll quickly notice a request of this type: ``` https://www.accommodationforstudents.com/search?limit=22&skip=0&random=false&mode=text&numberOfBedrooms=0&occupancy=min&countryCode=gb&location=London&sortBy=price&order=asc ``` ![Request type](/assets/images/request-185e9cf4845c0b0f07c004d155563ea7.webp) If we look at the format of the received response, we'll immediately notice that it comes in [`JSON`](https://www.json.org/json-en.html) format. ![JSON reposonse](/assets/images/json-a85571ceba8b80c314af9a159db15511.webp) Great, we're getting data in a structured format that's very convenient to work with. We see the total number of results links to listings are in the `url` attribute for each `properties` element Let's also take a look at the server response headers. ![server response](data:image/webp;base64,UklGRtYhAABXRUJQVlA4IMohAACQgACdASpzAsMAPpFGnUslo6KhpTU5sLASCWVu/HyZjugB159PE/+GG39eku2dXh+QHbh9BX/T6e3oP/9nRJ+iD/G9Lb/6vYG/u3/n9gDzt/VF/1PSAf/31AP//1v/R7+I/jd3xf1f+1/sb5s/jfxT9a/UX/Yf3X1yP7Lwa8xf8L0H/j32e/O/2fyc/vX27+gvt6/gvtp+QL8f/in97/r35E++h7F/n+0KAB+X/zv/Xf3b8pvbH8n/xX9I/GL4F/IP53/o/7J+UH2Afxv+af6j+9fjx8O/3D/JeIN8y/w/68fAD/Lf53/uv73/o/2R+jv9h/5v95/wn7r+yD8y/tf/U/w3+m+QT+Zf0//f/3b/K++D///bt+2H//9179of///zAz0RHd6I/5A4Rh8ndm/Y1S59TB7Us0tq5QYGbSdMM9X6dk9N/jRMmT8W2VF0tPnwDa/knroHOydjn+QiIiHqsaatZI8dG9PTu0ZUaxbJUy8rWPSUTopp1wA87PAsLEBC2xm683GRd25pu08RIrVGIzPhsz4bM+GzPhskbKZUmlxNW7nhAwHT+g41APt8a1+JtLhD3GfMy6ZbsnX273JQ9qrLT6Pdmchf7mSgSeZUfLeEOqgKJ+zHcZWL+Seugc7J2OpS+KGxpEwmNwGxtOIdAuUD8qjp0XNDx9JvPcAXMwIaebnC0kdhveRhHBmJBzmhmoyBJon5GoG1/JPXQOdlADyMHJw9SVw02TRTdska074IKu3dmw0HCYGG28dbmrUnaa6bx5cvlv335VnTxJtfmdk7HP8hERERWPtwUCgWt5kGs96v+b0imC7jjD0D8o3XeRpOcRMfr+8cAeu+WPb/EAvW6BtY7hJrstZVnqjro390FyBwdQdQfr+kbwtl+VVVVVVVV0OPjBgsLFCymAj2tCer+SZPJix6KBhuyfFetCKPU8/UschSrlw1ipn2ffwPNvo9UYabXs6rSHuXcNcjfia9WQiIiIiIiIrHfT8uBgw34l1idjn+Pdm3/YBtP4ofOk2y4atiWiAFnL+MwQlf6nBLEWfMaPZcZr0sAbwL8cG6UYBQvjdAU8CtkSK9S/8VzQLkFkmoG1/JPYqXA0cFcJlfaBpDxNBmFJbkIiHb0nqQ+lfjiBmY3gCZHAKKBYLflpLz10DnZO2rRXhCS8NdY8UhCUD8qqQDXP/GpBElzXhMoOdk7HP8hERERERWPxpqON4I8GkC4BbnJH3yERDyKFJCobh6iOVyGjkRERERERERERERWPx4r1k2Z5Bqvjuk1emn2RVQi3Wv32kvgazVHXKGdhf2011r+Seugc7J2Of5LHzF/AY9B2W3ZNbDdsEhye82EKn3ZckOT325BU+7Lkhye+3IKmWxxOKkAAD8iVwV6FM8oPeFAAkF3G6dTjvULyCJHNxEp4viEx2xwk8efiS3rScf9NBdtKzhSkZyQ6dm67BccUM2aaR9dGml4yXoVlEllg3+ntsMAkU+2s5X1QCInTto1IO1gkXFuc7dG5gzZOGdmEtbpEhRHBmpjot0gmD9iUsrc7eRU9rsDYXVgju1C4ks62K1QlMrz3b5wV33kiQkyolwXrcNskZMhjG6gPJ7k8hPm/wXxpeR10foq+FukWMZo3DOAxl78rzF97SGnqh32/a3P6ItaMMNuk6xW/FBp+WjUyLvw4Pd77JG04QTW/VbMANfsHnQrfYMWzI6jKZOcjwWU7vw6cBzM5YPE+rVYlp6iJziQUg5v2qMN5L5Xks6rbhC9p9uUA3GOsfY9DXLpj9fhs7CBr37G68AYjeSoTTRd+86/3q5XDzHfTHOw/yfQ87ZVORqPaWLuPogk84E4OPjJggIr79vwgXr418nNbrHNdefgIiP6uECOi9byNzrjFVhUUcVkiVFihpbM0jF8BmMrZ8n31dcBxW9nFGzBlsABGVwD/ywDCLL0v/aqCerxpdKHQKTZrikGP6etfAwSzl5S1Atsohz9NzMNSN9bru+gRL1IKRxSGqgmsWdqpGpfts1+Y+8qWXwyg03r0pujVmdjU1HccQKWLx8pgA+PNm0W3rUwaM80RBepefKRsjDAerUeyBy9YFdhDcg0TZVJi6qxdE6nHQ47fxYYrhOVOJRwMSN05DXXZdSn8v5qIcCZYifz1x9l38ATW/fxMoyfOxoc4seZrAQAM8XXe/msU54uIU7g0++Z8qKYVDG4ljFBDbHRFl4NIZaKkdokocsKmdPKFJENwPtrfqYKZj4++DgizXoU80Xel/GUuDJ1o4+DaGZsWMAwqYvFMPEhwO4hBOL4W/dPc2Pi+lcKNlL9S4NLSmIH0GOENmzrSgQw6B0Cc7xKibh4lFkK+lUOrvl+m9vD11g68Y48Y4ha+HeZ2DziOXORj21mVULwvoizCQjETMPHGSMVvMCm1aWQHFQFz6hdFGMfpOV5Ec8Q9CZSqEp2aemIqMpzpCOS2tQ77KMOrZ20L0+GanZVCzIHQxeNAvtQCguWGfWwmLHfM13y7jhc1eQwnsnfa3C1Cd3YA5Kye4Eq1/FNhwD0UD1Nq8jSTIuvlCoaQL0mkjBZ74qGVK5h+AIeUQmaLTEh1ItKOj00Hx3vUC9ed5mOFigqu1aRWozgJlPw4LUOZdCyyd/aAdOgqceGx0uXGXfwPdCeieVyHem2KgZkRaisK8khH/aO0cuGbEpet7r4xZQDCljUll2jhgcaBuM7FrGmx/j9yjW+3xuyD6si063qPjU44YOksdp9vf+CzJLOlH4+wRGgPsUssp54isUlOhyOck0FTnAatXAvtSmLrh7C4TcChzowVLcKPDjHevX+sXC1ymQ7ppUvWheSbgal087V1wcVmxRiuxsY0pWrZZfIMs49WAUwKgs+kCRq5sEciauphVQqqUTxLAP1zmEy/r5zVJOSr/euDuOKnkaDZq5NWgPqmAh078aVPtKT0vnAoveM/mfjqMUkID9qjWlMhAPowX8Z16SzwVkwOwOpYjtVHHA3yjOlITOXyxIfynsLy3iSD3Bm4vlWjwM3lMGIwluBaZkKyKJsRzIaQwc9X1pddeUELx7GMU9z1BJs3MohIwOObNlGAmKAT9QsGI6c4YDjTKd2+/pqRXKYxxUInGweF7HXc8rzimvHU/By/SqKVtiUgPZpem0d6w4GOJpunov2VrL0YH1Crwn1wTwXLyWWXQwtmQ258yzADzZrgqJ9WDf66waAImP3Da35tzaRIM+p968rAy/tm6M8ZQyA4tQAraFWMrEny3uOXg6WaL2yL544Hlin4Rmq+hAx0wy+yubqTyPtgqn0n8EHQ+2wH/redI/1kr7RW6wRUGDXFtZ77XBO02F6PumAOKmJZz99YH82MycdUhr0/IMuRzXF2X/vhCpCHSy+byjdQMOne9NwB9NG49cPDsTL/Id1/YWiiB8kDguazgBpY5WrdmNflxBnN6Ivad5yjykBaH3+v2lR3sz9n87jSldZpCU/GkADqlhmxxuSWgnjtatIbO1RUc8sDM2OZlcZ1X5DnMM/PMfFky5yZIIOY6DRNKuixYkm1TFiQdkmb6nfAgTTLpkO2AQcj+NVwXzOuZt2enEMeYE+0zlfQnfZkXEMGWaEEZqZ+/a1gl75Z0r6qcHq/loNcFXEF5BVjbE/QDE4NYVeDZLkmW09DZxnUpjXpFA04WGiQfrWyHzBQSqjpeqGcXXtb0CDy2aqPqwLBk4UKySNwc+nuG1yUFVVhZNt/HR++O9ZvH8d6m2F4ziGeokEaUGep1fSO8dgK0CLBovRIfdgDQn12USruyIHg0YmuH3ZtVpSIhJp+GxAHGuJQ/yQM0tqagCe6Nt9LYCuDMyxgjIDF2n1Xyr3NCDlq9hh7G8VtpkocZ/0rFDWFXmykzyJZ7jiSbq58ZLMtwaWCH9edXlZ40JLFtoVy1d+CbO+1Afk3k5bkT5qf9xlUr2X76xmyJXyeiAvJpzkam0KX4DCMW5romH6o72Wk687WJq/6MZfvOb8Sud0Tb309MaIS4cpjCvKgPuq9pYV/2PvYFhtSMdcEeYjySnx1oUQlAAminZ8ByTWs2bWKtrpf/czL8hoZNKuHf+8e/+r+KaRAOucqZ1fNFRjrye3MAzo31NtVh+dnxx771o7Gl4RYeQWKP5PLa3RzVlEabW20JahvQgEpRBSddN4LiAiyOebiN5OdorDJYKyB2jCrAAffznb6hxXbmio0hBIQWFidvVIhG9bl6OkDFmyb9FHCOqCBdmzenJUF5b/Gjh7DjDTGzznHPek8Q48wdcsv1WnOD44smFmmO8PRCCc0UQFxXVxNtLdQOFhgWgdae/WMwmoXV1UKo90c4gRongtUMWJalAV5x3givUEg3KSDDm+JfuNXASsYReyVZor5Ti06RLAAv6dhYRRFRHFnGBwi+L0pAS/w+FA4dvxhhck9wiCS64l9lPAWEjRQCJW+IKskw7vdkvCZCxSQop2FbR0PZNQfBzhc+nMKjLJuTSm7zDzUzNynQTDv+MHdHnLU3ZQoY9V9dSiHEzKIA+//VlLJuI7wNVNCB3iSTNHK2CFKufbCPMhdqcUdHhNtVbqtAZYTMZ9LtXS+uMWWIivzf0+T4UaGggjuksh/6OYNIRRwvWKhlH4T80MFX2X8EXqWx2z4VZtPqZRsRu82kTosIy/HbMyWDJk72h7oisUWSGKS7bMa3btadxEQ9UD1cCiPCWO9mFVNMbs0EmWHHJMMeXxkoeRlDzcAXKsxGSQ07LGApOIzvlNd00b80O57HiTe1mnmBy0knQvORjtNjJbn9LIs9GDQJdzP8L+91z+mmS6E6VWnfB6gCZWzqUZvfDQYcUfyEdKASD6t1f2v0Rs/DlCcd/l0n0LxSfyq8iRSOsbUxgOE+lgX3LCPtIXXUQKVL0ostdt5nFeLN0bcOuKrkl2BwTT2A/Dk2QRQeIqJt5SIjo/NgFZqlD18Z2Evnu/JRIS3Mn111YfpBhGKnODhrIfr/sSI3Of8y8fKJ585AiNikV4+ZSwgBMCNgNGtPrMq8MwMn9foX2gM+HQQmlZiGqcsYEydPcUCcU6PmR2eYZSnAuLAZ5rZxKJruLuqy4gDiSLqAlaVSIVgwcDqJFKjPr0iqSK7XDuVY8JCpSor8TgABXrQ6dmrxSOaLzl13w3cdjDn3/tZsj+Hw2ByiBoBvc1+H7N22wrHFtYf3AhTs6NT54szLITMSvkWX06gbyMO7kwWHBdTChNoqytm5mRkNuNd+BOGp70jxWUFmTFUXvbG9IR2Myrj+oSsO0ca+sQtrHQKg2N3vtws2LZjDjYePrsEgZLLmj5WIw7CuHoCOLED/ebJ2RzM0BvddBzHt54oi3BHoUgAvOxVbh3TI+v4PxVIiscGN4KvzdK23CnwFPPBlgjHTPl1Ym4/0xFcnHOnV7mH9CJ1AyEQoDLWmQ03Nn9qTEnDAgnfAIWz9Nq01SmMuC8qDoAm9zLAtwy3pls/LguQTg/3OSLij6LS5hLLaL3mbQVNf1dRwOTzQI0V2VTdZFy5GAZjpPHzUL6k49HgCzrk04Nsasl/HHq+Igm9xhhNonEbCtIAGz6zz26EJvJ6pP6TC90wUbd5r2IAosIyk0coOYTwkEUtI1bg+p1QsQqMK+Q2d+KO24crHnX1+mPIklcLDudyZT53fyXRx1nDpgYfxSlJgJjv7UXbB7mq2aBhACZvAHRRwjqd3+WT50o3kVxx/DK/WItp83bXMRqDbD9lX0yOV0zLvC1grTy8fwyRhbm2EyMOEJhiC1xjbVGzXieUXhj63TbqDl7xlVALiuK9F8ruBS+N4gthlGWuf4qabMdpRPjINyto4Y7xJA4zpBUt9EpSBsG616TIlKKfjL3OUM5QKDkozw8CXnXsDbG5XVfOdzzyi+coamQyrYMuj5QqWN04oS/EjrU6x4BrSC2Tkm1sW6i6xcVaVUbwqwLk6yMd4CwNH6rQhjo0jVI+H59P4bFCk0Sx9IbiwlLZiVcfKzzLfgVCms5Ap07Cq2YiBqBhWF05S23jWqbR76nZy6/ijjQv7g5QzvReSZ0N2d1dQG9VZ+FokG0VUzVE0quOM23RUERELUSgdOEjeUgU4tjVkH4IsduaBf+mOOmy/AgnXlFrYuBKdsoSn10i19zTTFyyL75CqK0GgXXDvw/QiN9NZPE/AXhqsej6XL8WWmBKiLOXfTGOTjSxJUh4AdB+OOxP7iJ47o9SzvOyt6hYZtkDBHoBXii3qFyuz6KMHeFuihPEM2VXV1643BdEc2xefzs411YeK3wJemGrs09M9j6Ix6r7AhplBc/tdrMnQqRrUZtebXb92Fc+psxGwkyyYeU21ZNZquU1onpJcekWUBIAaFt8WqK4CEjtTXmgccLrOjT9y+VlYwq+ruuHhkwYqr4XK9pB3dk1XbLl5qLiz3D3Wluh0fqxAaFy4cGOv6jsuh6ajlW2fONprj7MvwQrWFI/U8oaJCHS83X+G2OznoHYogdmG+isXTGJkds7T+6Yc7JOGa1CixzF4Z8u8ZcdJ/cyNAxCD7yObXkQBmJmrnDatTbEdGCOc/qTrx5Kodwt4VE5QH0u4ecZk9EwWmCq3KMMpAhgeRjz+wRb5Fse0MQqzZzkoXq2dtEm3z7Fp6M1JkNdUb7DIik4qpWaJGRMv6erXaSvd0q6Oa7nm8TlSPgGW7Gd7z6MikJhys/IaJvVHIzPQPeuoWfYkRS+xK0PwN70VFvmDulya7nmgqCrUal0tbiIfZLuaSpeMwOAkRfvPCTO5hQA8Dn3kbUJ9d9FyelugnxbpgMMjIkdZ4cFzcziBbtsclQz7ObTYxHmvbn3SKZVrjcWjUX/bh+yvrlRoo6UgHL+9D2Ffk1kFUNfLLdNz+zmPjHImBcHTRK5Xkm0nTFD90m8iwYqj7HN6KZG6r/Czq5ToMFMS6S399MgxAy4aL+HjXF8a5POfbOf7hRLr0qX2Q1urKiJHpksbvRXI90jF/s+TgwFETpVaKzvC4YllPuO/MGZ6Q3+ybddwbdfk8DtSx924urg8P8dLifZccD3CGQmnlqkAp4oyAiJP9zAnCHrRSoNYW+tHMnWT7syw36lNXweQtrCGiy/7gwz3Q89YtbEFaCACt0gxe6Vl+Q5EGwBt78F8u04q2CB0zW34jIj+ZYPvzFbRGd2FOyDfeAjy+JgdYjJqpmUUlB1uXVwupQPvcKn/Y+Pur+ajiQSx4wKUVsS8LIY2+yBhfpR1W2iBG56y373PWXzlqLfeLo4ux/K/zVGkX+tnCtsbZdgLImwSE5CKV0z1CrRl/8L30mpPj0iBquaKlxHc1yJarSEWRntzT72/c0oC/2pr9cYeOdI0srhtSvBKiU8VRkBsRrMWHaKVIgQUH73Q3fl2n1h0H/w+pblNUf55ysTPk3xT5E2zQpZDbGs6N03ybNZxEQYdIZCmFP06wgqZ3DM+2gm8aRZgN+2uswo7/aN98dsaeU9Tx9AogJL6FP+ADr9hO/N+MQK1qXXYxWDBxJJr0DdVWAApe9WHPzFxbB2nH8iwtxpqlos4lAUFHzg7F/nBOD8/VUNP4fRYOOYz7dnsbgOIhZI0fuAXP13a2vmNuodlok+871sUHzKWaHBsCXbgyBD3C2F7n8R18icF6SyU7qHyl8nwSy3UFY/WC3avSExt5CDfmj91fFSMPsNBkJDd+6n7VgZVSUnlbQb9/2pHQ++B5cB7p6spgf7mco9mVm5gih/oLBCKm0dUCcJNNMGDZ/LheyVPLpYcnNIRZDBVhIkVJsprIvKWzHa3BCxbgrlu980BTGRYTWOwDGFiFNFVkw/8DEAKWqnDgS7mI9DiAuQ5NF9sNqu4QYMdIM4XRIX4SZzJltfVlGFwOpVQRyoBTT/MVD0zvfxCdAmLRfUS+9FJvEfOi8tZkuNa9JxPNTNIFpN6KrFwLTVHlaO0+O3eLNcKHxbtvuFJYCqeAZevfFvINOFFGDT9uny3ZR6lVYDcDtKHgpdCCe3s0+9OoF2k5hEyUljqoIIdIlpC5Ve0gtsl6fjZOn841p6ucQAOueEf2+KIUM4hdJQcxYhOz+ZSWlIEMB3JzwXaRnjzlZrQGMaIkIEJeHXjJYx9j0OqdyfckUcPfioT4DT3hURjXvlMzC7OtrhQqNZu59Vmex8NIJd5kWmREPnwrCMLGM2yQpkQx572TEMtkwgnQ5D9K9YcRQwEnJNe67ssX38fUbvjflVDqliXDDLTDg7NFBNBpClgxjBjSuaUiVNnJ6JyWkkn6UZxKmR7GxLB/aEMBdwsRWsMGzRnr2IEey47lZcnAsR8ejc9FFq1ayjb8ecAbKdfKpzEs3Y46rs2KXeF/BSNzCf3PNuDIjJPTT89m6tuEwHrSc6YuH/oCLDN7IvbBzALxxKRSse06ac/EeXt+h2OyvZj8h4e1EF6s5AtNU4gx/CdkUWB1QgRVOA8784rpHHR0WuNk0H8z6uU8nutO9lsDehgfHovDnFGw02d+YJt4rUW3iTEf7RLw8ZlzDd9X5h1iEtHQZJEazr1rzaRJd4CZMn7X9/gYZvPUqc4m0svU6JzSUKsuUFmcWuVoEkcQwk8svJ2L2R50i7DR6WhRf+TAcs2EN7efXdkBSkal4KOZRVO4FLz11wezTfO1Mq3AaZkkNV55yh18MehkzmJ0ptF+LXvtqsYnPLJSYL0e/47NXqTGqB1U+f7EjdqAlV9WtUPU+H9UmoJ1BVj5I+5W8ukGu33yICzNzz5iq9V/MYgwfvgw9HmJPjSX2Mq9V3Bs+updBzZUpMiWYGNTlzDbVfBzCiTYVI82kS6F1GDhyb6q49kZE2EbMDcyfjPsY5HZIYYsZfD/0gntEAGvTo/Ks6J4q/SrgVwA8WVLVhwxB2RKHG8J402Lf34XF15uWKdqc6rPVKRp3dniAN5pvA4gTkBFzf13h58lvRgu6XTJ0swNn0mndDWa4bs5oYa1InaM4UW2fh1KszptJfYRA9XINzRIvF8UpL8qEKedZzc8ySov7NBM09hoF6wTYXxOaZ/taI87y2eApFkL4s+z1xAwi0duxgY7ow21WQ7aWCl1KSB7UuxoeL/rhYFlQe6Aa0U9zk1mH4W98H0BQwnYFfLWxeEOu+HlkpkaATPupjUb6aqBnjboFYXXnd4ycIK1COlyP5JlCaBY5cHbihIOyyFXVk7fH0ppPztv/7BXK1ovsFTmRUDZCnyrypUHTIKzYFwcb0pLt3UN2S4ZTFg5s2B+hEKRPr2vZdAbXiK5eiS6U5bP9Z6mrw4soNdzQ978mviyF4GDL2ZcyTL82okXzVg8CIrGyXe2FiyNSohpy3d8yjMcgRI1zWaL/fMY9DiRkW/SjEcYozIBNFzJW37uKCa18WPsZqg6zoJ9pqiO5+TOBEgiFpgmx9/1Fi9avW/sUbeo/YYHSO1as/iMuw2jVLr2zjLsHO2tRobc+5jSLUwleaP2Pt4TecQlLZFZMgbi2brn10stGBQPrULHQ7y/ML3wnX/wRPtpr5hebHQS+OuoiC0NreVZeBzH9RelRe0VZA6TrNqOlO3yX36JhpXc44gbw1wQGLDn/70pcpyIjTP6gr1iadcosFI8WwPMQTxJRS5qQEuEDELIx1xJo88maFEm41xqJOPCwKeaLI7X5a9odcRDKNvkOWeglKcCssKUZX9Oi4/kANQbNP5l+6Gd+mUHRwB9beN8jAa+nGDkwrpfzJ2qvpV5Wn0eSc/euNxlL7emh3P4PYLoiUdz07+cpvzOqUtwau7dreNDO4j93YE49j/SPh3ssfotVq+bpT2DP4vNVOyrABsA3kzS6JMRyrJ/KPNqEV82X94FSC969CKUGOqtgr/dxorMwrZPRAojHUopBZTs0MfAFMquUnxKFsKLMYRM327L6tkTV31YkDRBTBfO5BwBaZbZ1YSq3DlpF36JjCpn19mzRq00Zyo1pT3YA0EOBXYnsrwjf6fLl3qLJM/B2aqvSrvWDVHAlwLeEN4juD0L3tDf/srsKkEiidA+m3gaciL5WkkoyOd6bf0upN9EgxxVRkNom0DTK5vNZM45VKNcw9y3OlOJKKlmzIMflikoYrbdjpRfQuDev1gPJa/AM+hNAydDIRuhpQpWuTUJPRGuwLsISVCJZ6zRZ1CFGsHFLiqvAT3NIfPXtf03e+tX2ZLO1H6UOlt5YWE5ynuKFsaY5abJ56G2awpIWQwEPXpqVMFK7pM8c5YvowyTBVs1A54EvL3m/+/V//WC13a4PAI0IPHO3KW3AyMnoYoqUSOHPwL6W7KwrMREylahhndK8x0dufG9+V3ZZSOXuCleiz504Gw6x1U8foRMdsR+eM0X7mmtroX4VT7jKfIAAAAAAWAixWvTN8hHfqjaFHDZ53VcY2OjVt92q5R12m/6WHcUiXszd2SVQ9wL1cONblBwWunZJ4qKFQ1cFvmL47L9hGtK9/NX5fH2RlaJkDlPQguWZS43jEAFRRR5Nyce23psI+gPXG7WX6vzGLqzzRu8U7lHtRi8bNujQai8+0gxMWvF4QzsRgTSraz9OkgEqUMCbLcBuepqSnzIhXUmF1SUazWVH7EiGIMtohcB9eZz2xvKbyMnAFbRgvVulxWEjPIBjMLTOpXyNMxghu9eihoaQIpKzKYDEADvD1Hi4XeaSsOoMZq3qRjLsv/v1yb3CQ26WLPVdQ9/E2VwxbdpEXYKmaO/YTfe/gjvJYSSzwMDf363VQkX7AAAkN6uWYbEONss22RdQW3Hbv1O2aj8dtsgrKrlH5ABD+GBYDKtGDJWbucEqvjmE0qiNod6NyUkwyJkDDPqlk75q0ejqXFwpfZwziykrne0mQXxHwBLHMGHQn4yHYOWGEEJRPvXRQ2HxpOqa6utjEIlQclaZ4o/XSrtpdDwijjiflryKib3fcYZpE7hm402MJTNqtMHesv9BoEVbdr5vIn7BhUYG6Jkusi/xSMhjtSX8cQj8Fc3dD8RSr8zXct55AYOrz71IK2RTKy9+pZBzSXOEVjS/XoKKK4E139pIRu0wqcqp8Yg9ItG3Ji3Q1N8C8aoEMcFDD3Y8hDsJAtgHCaeXM3JsUVyIRswn9dAROk1DAdxVhcjIQ85DDG3KRPEe6B1145UCTNo6HBkgEtWBTvz8W05YQAJi5VhrAYKZtq6dMU5cvLn4FOKt7UZGsrwY9AtV3ZKgUbYR51O0bTVtanI5RrxJgtp73zd0OAG5/yW31BEiEbLHJ97W91VBUjzPKWmgKlh5p+6msLybwvdYzxFKWto54IISmvCXN36Vr9HKbGITU/Eayt8E1ym3GGz17LY2hDNHKSPWIPJBg5U1MoLxT2S1QkmbJYBHvO7TadyRDeWm7u9YrbG/q8FcTPHo5OFKRk2qtxLv7HgzNzs3BJMnzKfLQs1YYMF9Ci2gpGNGRrm7jJmu5Mm+1lFWAXI6ouTuHF8vuZfmHS54I4Q5qt1ZmodC7BWAVXlPf7wthBOKqxAW3ANZvoYUG1Tnrs/MjxT/3jwz2SrCxX5QBrQvAj1IcQ/oEQrZ2gBynGMDQEclry+nFeEgAACuaBSnvWb8nuDAAAA) * `content-type: application/json; charset=utf-8` - It tells us that the server response comes in JSON format, which we've already confirmed visually. * `content-encoding: gzip` - It tells us that the response was compressed using [`gzip`](https://www.gnu.org/software/gzip/), and therefore we should use appropriate decompression in our crawler. * `server: cloudflare` - The site is hosted on [Сloudflare](https://www.cloudflare.com/) servers and uses their protection. We should consider this when creating our crawler. Great, let's also look at the parameters used in the search API request and make hypotheses about what they're responsible for: * `limit: 22` - The number of elements we get per request. * `skip: 0` - The element from which we'll start getting important data for pagination. * `random: false` - We don't change the random sorting as we benefit from strict sorting. * `mode: text` - An unusual parameter. If you decide to conduct several experiments, you'll find that it can take the following values: text, fallback, geo. - Interestingly, the geo parameter completely changes the output, returning about 5400 options. I assume it's necessary to search by coordinates, and if we don't pass any coordinates, we get all the available results. * `numberOfBedrooms: 0 `- filter by bedrooms. * `occupancy: min` - filter by occupancy. * `countryCode: gb` - country code, in our case it's Great Britain * `location: London` - search location * `sortBy: price` - the field by which sorting is performed * `order: asc` - type of sorting But there's another important point to pay attention to. Let's look at our link in the browser bar, which looks like this: ``` https://www.accommodationforstudents.com/search-results?location=London&beds=0&occupancy=min&minPrice=0&maxPrice=500&latitude=51.509865&longitude=-0.118092&geo=false&page=1 ``` In it, we see the coordinate parameters `latitude` and `longitude`, which don't participate in any way when interacting with the backend, and the `geo` parameter with a false value. This also confirms our hypothesis regarding the mode parameter. This is quite useful if you want to extract all data from the site. Great. We can get the site's search data in a convenient JSON format. We also have flexible parameters to guarantee data extraction, whether all are available on the site or for a specific city. Let's move on to analyzing the property page. Since after clicking on the listing it opens in a new window, make sure you have `Auto-open DevTools for popups` option set in Dev Tools Unfortunately, we don't see any interesting interaction with the backend after analyzing all requests. All listing data is obtained in one request containing HTML code and JSON elements. ![Listing data contained in HTML code and JSON elements](/assets/images/listing-d32f6a4dabca952c5150d3e4705028fb.webp) After carefully studying the page's source code, we can say that all the data we're interested in is in the JSON located in the `script` tag, which has an `id` attribute with the value `__NEXT_DATA__`. We can easily extract this JSON using a regular expression or HTML parser. We already have everything necessary to build the crawler at this analysis stage. We know how to get data from the search, how pagination works, how to go from the search to the listing page, and where to extract the data we're interested in on the listing page. But there's one obvious inconvenience: we get search data in JSON, and listing data we get in HTML inside, which is JSON. This isn't a problem but rather an inconvenience and higher traffic consumption, as such an HTML page will weigh much more than just JSON. Let's continue our analysis. The data in `__NEXT_DATA__` signals that the site uses the Next.js framework. Each framework has its own established internal patterns, parameters, and features. Let's analyze the listing page again by refreshing it and analyzing the `.js` files we receive. ![Javascript files](/assets/images/javascript-52ed58b7cb5fca440f94193ff7687de3.webp) We're interested in the file containing `_buildManifest.js` in its name, the link to it will regularly change, so I'll provide an example: ``` https://www.accommodationforstudents.com/_next/static/B5yLvSqNOvFysuIu10hQ5/_buildManifest.js ``` This file contains all possible routes available on the site. After careful study, we can see a link format like `/property/[id]`, which is clearly related to the property page. After reading more about Next.js, we can get the final link—`https://www.accommodationforstudents.com/_next/data/[build_id]/property/[id].json`. This link has two variables: 1. `build_id` - the current build of the `Next.js` application, it can be obtained from `__NEXT_DATA__` on any application page. In the example link for `_buildManifest.js`, its value is `B5yLvSqNOvFysuIu10hQ5` 2. `id` - the identifier for the property object whose data we're interested in. Let's form a link and study the result in the browser. ![Study the result in browser](/assets/images/result-64a7188999fd127f0b6e26bf94a4a7e5.webp) As you can see, now we get the listing results in JSON format. But after all, `Next.js` works for search, so let's get a link for it, so that our future crawler interacts with only one API. It transforms from the link you see in the browser bar and will look like this: ``` https://www.accommodationforstudents.com/_next/data/[build_id]/search-results.json?location=[location]&page=[page] ``` I think you immediately noticed that I excluded part of the search parameters, I did this because we simply don't need them. Coordinates aren't used in basic interaction with the backend. I plan that the crawler will search by location, so I keep the location and pagination parameters. Let's summarize our analysis. 1. For search pages, we'll use links of the format - `https://www.accommodationforstudents.com/_next/data/[build_id]/search-results.json?location=[location]&page=[page]` 2. For listing pages, we'll use links of the format - `https://www.accommodationforstudents.com/_next/data/[build_id]/property/[id].json` 3. We need to get the `build_id`, let's use the main page of the site and a simple regular expression for this. 4. We need an HTTP client that allows bypassing Cloudflare, and we don't need any HTML parsers, as we'll get all target data from JSON. ## Crawler implementation[](#crawler-implementation "Direct link to Crawler implementation") I'm using Crawlee for Python version `0.3.5`, this is important, as the library is developing actively and will have more capabilities in higher versions. But this is an ideal moment to show how we can work with it for complex projects. The library already has support for an HTTP client that allows bypassing Cloudflare - [`CurlImpersonateHttpClient`](https://github.com/apify/crawlee-python/blob/v0.3.6/src/crawlee/http_clients/curl_impersonate.py). Since we have to work with JSON responses we could use [`parsel_crawler`](https://github.com/apify/crawlee-python/tree/v0.3.5/src/crawlee/parsel_crawler) added in version `0.3.0`, but I think this is excessive for such tasks, besides I like the high speed of [`orjson`](https://github.com/ijl/orjson).. Therefore, we'll need to implement our crawler rather than using one of the ready-made ones. As a sample crawler, we'll use [beautifulsoup\_crawler](https://github.com/apify/crawlee-python/tree/v0.3.5/src/crawlee/beautifulsoup_crawler) Let's install the necessary dependencies. ``` pip install crawlee[curl-impersonate]==0.3.5 pip install orjson>=3.10.7,<4.0.0" ``` I'm using [`orjson`](https://pypi.org/project/orjson/) instead of the standard [`json`](https://docs.python.org/3/library/json.html) module due to its high performance, which is especially noticeable in asynchronous applications. Well, let's implement our custom\_crawler. Let's define the `CustomContext` class with the necessary attributes. ``` # custom_context.py from __future__ import annotations from dataclasses import dataclass from typing import TYPE_CHECKING from crawlee.basic_crawler import BasicCrawlingContext from crawlee.http_crawler import HttpCrawlingResult if TYPE_CHECKING: from collections.abc import Callable @dataclass(frozen=True) class CustomContext(HttpCrawlingResult, BasicCrawlingContext): """Crawling context used by CustomCrawler.""" page_data: dict | None # not `EnqueueLinksFunction`` because we are breaking protocol since we are not working with HTML # and we are not using selectors enqueue_links: Callable ``` Note that in my context, `enqueue_links` is just `Callable`, not [`EnqueueLinksFunction`](https://github.com/apify/crawlee-python/blob/v0.3.5/src/crawlee/_types.py#L162). This is because we won't be using selectors and extracting links from HTML, which violate the agreed protocol. Still, I want the syntax in my crawler to be as close to standardized as possible. Let's move on to the crawler functionality in the `CustomCrawler` class. ``` # custom_crawler.py from __future__ import annotations import logging from re import search from typing import TYPE_CHECKING, Any, Unpack from crawlee import Request from crawlee.basic_crawler import ( BasicCrawler, BasicCrawlerOptions, BasicCrawlingContext, ContextPipeline, ) from crawlee.errors import SessionError from crawlee.http_clients.curl_impersonate import CurlImpersonateHttpClient from crawlee.http_crawler import HttpCrawlingContext from orjson import loads from afs_crawlee.constants import BASE_TEMPLATE, HEADERS from .custom_context import CustomContext if TYPE_CHECKING: from collections.abc import AsyncGenerator, Iterable class CustomCrawler(BasicCrawler[CustomContext]): """A crawler that fetches the request URL using `curl_impersonate` and parses the result with `orjson` and `re`.""" def __init__( self, *, impersonate: str = 'chrome124', additional_http_error_status_codes: Iterable[int] = (), ignore_http_error_status_codes: Iterable[int] = (), **kwargs: Unpack[BasicCrawlerOptions[CustomContext]], ) -> None: self._build_id = None self._base_url = BASE_TEMPLATE kwargs['_context_pipeline'] = ( ContextPipeline() .compose(self._make_http_request) .compose(self._handle_blocked_request) .compose(self._parse_http_response) ) # Initialize curl_impersonate http client using TLS preset and necessary headers kwargs.setdefault( 'http_client', CurlImpersonateHttpClient( additional_http_error_status_codes=additional_http_error_status_codes, ignore_http_error_status_codes=ignore_http_error_status_codes, impersonate=impersonate, headers=HEADERS, ), ) kwargs.setdefault('_logger', logging.getLogger(__name__)) super().__init__(**kwargs) ``` In `__init__`, we define that we'll use `CurlImpersonateHttpClient` as the `http_client`. Another important element is `_context_pipeline`, which defines the sequence of methods through which our context passes. `_make_http_request` - is completely identical to `BeautifulSoupCrawler` `_handle_blocked_request` - since we get all data through the API, only the server response status will signal about blocking. ``` async def _handle_blocked_request(self, crawling_context: CustomContext) -> AsyncGenerator[CustomContext, None]: if self._retry_on_blocked: status_code = crawling_context.http_response.status_code if crawling_context.session and crawling_context.session.is_blocked_status_code(status_code=status_code): raise SessionError(f'Assuming the session is blocked based on HTTP status code {status_code}') yield crawling_context ``` `_parse_http_response` - a function that encapsulates the main logic of parsing responses ``` async def _parse_http_response(self, context: HttpCrawlingContext) -> AsyncGenerator[CustomContext, None]: page_data = None if context.http_response.headers['content-type'] == 'text/html; charset=utf-8': # Get Build ID for Next js from the start page of the site, form a link to next.js endpoints build_id = search(rb'"buildId":"(.{21})"', context.http_response.read()).group(1) self._build_id = build_id.decode('UTF-8') self._base_url = self._base_url.format(build_id=self._build_id) else: # Convert json to python dictionary page_data = context.http_response.read() page_data = page_data.decode('ISO-8859-1').encode('utf-8') page_data = loads(page_data) async def enqueue_links( *, path_template: str, items: list[str], user_data: dict[str, Any] | None = None, label: str | None = None ) -> None: requests = list[Request]() user_data = user_data if user_data else {} for item in items: link_user_data = user_data.copy() if label is not None: link_user_data.setdefault('label', label) if link_user_data.get('label') == 'SEARCH': link_user_data['location'] = item url = self._base_url + path_template.format(item=item, **user_data) requests.append(Request.from_url(url, user_data=link_user_data)) await context.add_requests(requests) yield CustomContext( request=context.request, session=context.session, proxy_info=context.proxy_info, enqueue_links=enqueue_links, add_requests=context.add_requests, send_request=context.send_request, push_data=context.push_data, log=context.log, http_response=context.http_response, page_data=page_data, ) ``` As you can see, if the server response comes in HTML, we get the `build_id` using a simple regular expression. This condition should be executed once for the first link and is necessary to interact further with the Next.js API. In all other cases, we simply convert JSON to a Python `dict` and save it in the context. In `enqueue_links`, I create logic for generating links based on string templates and input parameters. That's it: our custom Crawler Class for Crawlee for Python is ready, it's based on the `CurlImpersonateHttpClient` client, works with JSON responses instead of HTML, and implements the link generation logic we need. Let's finalize it by defining public classes for import. ``` # init.py from .custom_crawler import CustomCrawler from .types import CustomContext __all__ = ['CustomCrawler', 'CustomContext'] ``` Now that we have the crawler functionality, let's implement routing and data extraction from the site. We'll use the [`official documentation`](https://www.crawlee.dev/python/docs/introduction/refactoring) as a template. ``` # router.py from crawlee.router import Router from .constants import LISTING_PATH, SEARCH_PATH, TARGET_LOCATIONS from .custom_crawler import CustomContext router = Router[CustomContext]() @router.default_handler async def default_handler(context: CustomContext) -> None: """Handle the start URL to get the Build ID and create search links.""" context.log.info(f'default_handler is processing {context.request.url}') await context.enqueue_links( path_template=SEARCH_PATH, items=TARGET_LOCATIONS, label='SEARCH', user_data={'page': 1} ) @router.handler('SEARCH') async def search_handler(context: CustomContext) -> None: """Handle the SEARCH URL generates links to listings and to the next search page.""" context.log.info(f'search_handler is processing {context.request.url}') max_pages = context.page_data['pageProps']['initialPageCount'] current_page = context.request.user_data['page'] if current_page < max_pages: await context.enqueue_links( path_template=SEARCH_PATH, items=[context.request.user_data['location']], label='SEARCH', user_data={'page': current_page + 1}, ) else: context.log.info(f'Last page for {context.request.user_data["location"]} location') listing_ids = [ listing['property']['id'] for group in context.page_data['pageProps']['initialListings']['groups'] for listing in group['results'] if listing.get('property') ] await context.enqueue_links(path_template=LISTING_PATH, items=listing_ids, label='LISTING') @router.handler('LISTING') async def listing_handler(context: CustomContext) -> None: """Handle the LISTING URL extracts data from the listings and saving it to a dataset.""" context.log.info(f'listing_handler is processing {context.request.url}') listing_data = context.page_data['pageProps']['viewModel']['propertyDetails'] if not listing_data['exists']: context.log.info(f'listing_handler, data is not available for url {context.request.url}') return property_data = { 'property_id': listing_data['id'], 'property_type': listing_data['propertyType'], 'location_latitude': listing_data['coordinates']['lat'], 'location_longitude': listing_data['coordinates']['lng'], 'address1': listing_data['address']['address1'], 'address2': listing_data['address']['address2'], 'city': listing_data['address']['city'], 'postcode': listing_data['address']['postcode'], 'bills_included': listing_data.get('terms', {}).get('billsIncluded'), 'description': listing_data.get('description'), 'bathrooms': listing_data.get('numberOfBathrooms'), 'number_rooms': len(listing_data['rooms']) if listing_data.get('rooms') else None, 'rent_ppw': listing_data.get('terms', {}).get('rentPpw', {}).get('value', None), } await context.push_data(property_data) ``` Let's define our `main` function, which will launch the crawler. ``` # main.py from .custom_crawler import CustomCrawler from .router import router async def main() -> None: """The main function that starts crawling.""" crawler = CustomCrawler(max_requests_per_crawl=50, request_handler=router) # Run the crawler with the initial list of URLs. await crawler.run(['https://www.accommodationforstudents.com/']) await crawler.export_data('results.json') ``` Let's look at the results. ![Final results file](/assets/images/final-results-f14b378f9aa0cbd5d1185301c49a222e.webp) As I prefer to manage my projects as packages and use `pyproject.toml` according to [PEP 518](https://peps.python.org/pep-0518/), the final structure of our project will look like this. ![PEP 518 file structure](data:image/webp;base64,UklGRvweAABXRUJQVlA4IPAeAADwhwCdASqcAWUBPpFIn0slpCKhpTMJILASCWdu4XYGQBqTpq/L/u3ck4DjnVMuyL5PxMYbPQx/vvTV9PHoM8wHnOelD/mdKz/zP///6vgp/pv/j9gD9pPXB9WX/b+mf6AH//4IHw9/fP5J29f2b8tv6r6U+F32n+q+b5jz6cNRH5D9y/1v+C8w/8h/X/EX4Rf4HqBfmX81/3Hh5/2P9V7iDT/8X/v/UC9gPnH/N/tv5Ue3L6P/of8Z6l/lX92/8H2RfYB/Jv61/sfTn/b+CZ9l/yX7EfAF/J/7j/0v8H/nPhT/k//p/k/OP+ff4//1/5j/R/IR/Pf7j/4P8j7a///9xP7of//3c/3S/////GytcvQHasOaljN4jfbAMP3aj0uKwxzitE1urEUBfir8sOZbIsBaxOyFN1ZxsujKiYe3uiyZlHyHFTweFHjPK1URpBrU+XWNWnA9A0vYaP4/6EbJFzorMq9iR+Plm5oyx+ha3T8PHrq2hzZo+ERrzTFlUcXtBp4WNL1kNPIkhoYBwF/BTh0OyomHmVpkgWBHE+K5iINDAkBhUm2AUcrnRYlhDGFEcv3iH5fAKcOh2U4iO9zRhCyv7bmqcikf8D6hU2TB8BlnHRk84wk0F+tlS1j1FZsh2VEw90OtXbwaGR+d/PJdE79KTvKaEVbsmbymJxm+NxIon+y+AU4dDskhu5Kvhkej5J6UuUJHX/huNCb2U30KTvqkrpOefjFh9OpiismcC6MqJh7odkkDAkQyCe5BOWJejasQCoohmYpTHoFgIwe6eR6qeKSXwfl8Apw6HZJBRo1B4P3OkhvhVvSmYSpLAbezGTvsvgFOHQ7KiWsFeycSkXHcEaD4CqGIJhGSPmQXB/+BpQbLoyomHuh2SQOEylR2g3VOsH+3Q+hU3ktI+QP4NaNTdvoIWmn3wCqFh6uWFeDEKId2XRlRMPdDskgavdoFkfuHgGLBVJ1LLCyAaIMGomq2Qs2Ubzm8vgFOHQ7KiWsJti6MkuwilBBLwNjccNxGsKwlg/pg8DKiYe6HZUS1gruZFMSGqpnn+HPw8Z4auCzvpyqFdfQ7KiYe6HZUTC5OM4HlQbSsk3cNXSsz095tovxmyaqz0xl0ZUTD3Q7KiWsKDo3sit7/ZqO+tQf8O6pPHyqyvnLTh0OyomHuh2U4gAkqvkpsdoUiRns1KxE17RaFRtNNYdDsqJh7obFs31CQk7PPbBHmXBBnt0wophbvqccTtD3HbJ4IYVEnWYqnY3fAuuCnDodlRLWC1b8W84JAJMpI0AcrYOwfeil83+pDMpaZNlQeVo6u6YR64LYqqsx9xBYw90OyomFi7pOUQuuZESQVRSFGszC67aTxXorrrj01CkqacOh2VEw90OyP+xVjELlDdUmD6gIaEEff8HA6p/AKcOh2VEw9z8EtxsbHsFOHQ7KiYe6GxhqnDodlRMPdDsqJh4gAAP6Vp35aMhVa8zeSdOjP1bW3C12q6jRXPfDditGwDlvzmWjxscBvqs8fRarg5Vx+RlQ9K8xOGQV/43/HnltVJqO/hDra9fw+yfQxE+E0gAuQTcw10FnRie7TA3APnvZQO8HY1iu/qmmaActKWtJ3/xfHs9VRHH7nufV4fIk7IFjBXUY1c5Vam2oY8fXJqhf8/1QQx3g/jmTMx5TBta565OgTUXDY69dhe4j5NNSaIPwKC1wZJH/uyPScCvApzbChkiMjMLNLuvGECSwrmozPrwo7PzfRZ1PNhlWvBHVqPGuVZMQd5R7whOSBxWRdMWnFeYObUOxmbCpqH5X9ojC6B3CgdHOB07pU/6VhmWHZ4twvLL/kYStsFGgd0MhLyZTwPPN5Qyu15BsiDGRV2w3fRoSIc3cWXWtraY7aABlXlofzhoUw2SBQ3xrm0jln9Njumb3mSmZuzGoJzFw85SnkhPNflxcydSYteMsnQg1+SR4v5hX8fs1TozPzxILpHOLe3hPggvcodHSKWsC/FMdjKzPxXhr9LbltlipUtAQzwLAi8swDvdikNDQonCQsUmu8aLG4FYpK/rtbS8tV8KFscHZdDGUPEK/mn3qBBtA72et7u0ZHvj87nFu+tuqlIJ1IuS/lA0RBqXOtZFEPw0EWw2aoE1QxlTHn7TrhahJIyBORmgVgvDGiyN1x7qKgB4xLuUWhfNFMf4jydTwT5wcQoorXUtfrEBfYplXP8PzFlMQJkBhgiN/5LqdyIHQ4OBlc7HKqgVw6JJpGdamcrrnUauRc02EWCaxDP4YFChaKn2j46UR/SQ7qNa6V0bzEomuyIJVh1SDSQU5PkB3KW0awzAgZoanm3i5NQinBKnzPpRpqa4Sd7iH7oUB12Ehv3ELp6BbPYD4Sz+G72fxH+9MAqYFWucfvZDTQbuuGnMyavfr539eXCf187+vLhP69NyhnmyF6au8jorx+vq7lwghYDK3Mi9NYv3yUV+TJhC/WhL84Zrp0228Pdi3QfVJjvZ+Y/JTmS1pWhG4rofgEtoBwMbvpuJmL8hzI7kvpapk975wrV79GuhJ0CTyZgn2YyTmUIIRWCO5AOXhTqLsw2LNFw97BEYF3Lv942HwG0wltWupxrMYhSPsLbQ8P05UcF1ekN4DHiluN/H8iVNeGdzjVBj4SqmLwZKMP6kQL3M+kjqMZnzLfQMEpNptlS0CltHyQLXLo2wk+kQZeJoWVJkL7FLz1z7HJE+V4uSTmZUs9jeXUQGlqQOYbARGQI3GbJhFnZq/zgyHQnXisz2RlE4+QtiE6hBQbCBdcY9a502n839m6iP/GwbobdUXPlS8cHyIfX0U0DFk4CqPGYagipU+2SJLyMRRF2yJVzpAABp84AW8gT2NFEIwRTTU1cPiiGA0TXVR6+4n5wWXN6MDpc3KRkQK9NFDM08Z00dUSZsYq8+49zqn2w4ICRSYMHwxz/uZP6ACVzvSJX49ivaKQTbwyvDrUK7JGnpzFe5tXe/0PH/yEx77yO/lf1Pdxqkni5C2en7FUp05CF8bF1RZqIGyDDsEAh0TNDPoHPt9mZ3RFf8soTXEnXWG6+sWNZ+7+l1gDL5ao3kcFDcwUiNGAu/Z/yyJUNCTE2sPz5jsDZnAmpbm1f8NZ2sE54ITzxM7eblJUzKpqnTAKFbAfknJM28EbZhbQQGh/QyLw7P71i+QD7QRyvfBYcDO98Rjv0iGZrHozkjQCmPq12Nkgge5rQ7BZvFHM419Edg2uQHvfDtUp0q0AncPMCnYAH+vlI73MXmW3czYKL41A+jZKURvuJ9uEVvx6BW6uodpBSeHaXQwqNE8eRqRMxvHRpVFf6sqM4ZZvBGG1BLrqNxmrnixoTjU4scsMEGBidxlLdW4xXw/DlgWftNZzyyJz7JFb7NthouBakIE0ujw9oF4hjMWi9PtSUdtatXxMGCXLXHIfVHuktHZii97u/ocJNA4+eSW+FQRuqPWc4aJmKSed49Tss6yI68WICmaHwci0KQrVyT7DarGYkz/OX0lqMbOjKB3iApw7xFr/ndCzl+XhTyJkdDZsue/ABrgEod5h6IBtUnKv6iaOLYn1zAzHBSF/Jvv2V9pxFIW8XhFJIsWVBCGMVSZYq7h3v4WgS1Fk3TzFiW2b0J9GFMsp0NHm3inUM+Id0DiWekRJoGhfXUdm9rX82ADp5k4YP23P54Rw3uT4YHb7cd8sVRepyQUkrGq13W4/LflMbf+jge1Kj/7y7wp6Fa0MRmogyaIDjPghIj4d7ket1QiJ/a5tob7iGGWe6zsY0dPZv8GIFJIjJ94P9cPPKPI+mmaa5pHp5a3PYoRY7b5Zz5aln3xFJU3OTDTjhBr0RCfyT5dn4cmw19QHVTI4WKVUUxVAnh0MPIroEpqifBJETmtxQMqC1i8sPLiwvbo6psOsMcbfA9+G0WL6QIUpH7he33mt99DTpxe1B188Dk03/5L8U+7mOG0+LilFMRWqVo5KWeM9UxryUvOV4nSNEvJamOdkm9faow74yGILDbtYFsWi/tIB44DAZKDs1TVzbxBBdMv/AInLvbeFUP2BFMqsCL/WzyTMcjQMHiisAIm1nCthUhmHYnlCacGQ0b/J2pqgpTopwi/HZSwtVxYQsz7G/p4YAYL3iNQGvDTUHHSf8E/U4I0N0xgXBzgDX0qAQQkZUHAfHlU0H4GjZcKpoxK/1HMv2Wz2iSkZECvNIUcjz7pvp7tW/pewLzPJ7I6fSwWcwbVX+K1BHkdzCLCAZH9A2W0fwsEldzOG/sWNJVj2K/uD8oDcDrd1MdEY3XnjxEu2rLAsDaThej8//Q1ja4sOR3KmJq7eF1MLaXpRcohSS9O+9cWcdJQlfL7XcYITkbqceTloWJKMyUSzXs3BM0HEK7Z2HFJVigj/8nZjoPpwb50PkjHZfLSxFmtFRfETWe0kyVlTuc/3StNNMLjwXNWQ0ZcqDvZclqnVkenZnLZXXpdQOH4YmlrB16mJbFVBAnnp3NgLz63GSLjrlNGhNTvqM2uDjh2A4vRxkpmeif22vuufr7zz3wJYd2Vx2xL82urlzEhwErox4pkwmyvwT9yQykDvGc8/XAJt8ZyZ8kLBgKACZLXKmoi02YL9mJUNzSLK4GrXrr2m9tqDHqWnYpsFvG5Me0goK5wj8TkiBvch1wkJ2wbzSncQUcNPBlORtrosD2b3zuVb2WuY2I2GquanN6Ak9XfzFe2/uS4RqYUWiNtbQywMIpq2Yc5rEDRcMVrhLgiFzhQtTAwlSG5bCuYxqA/uL/r3BO4jA2maGSuPdPsfQZthhMNYxnZqazk3yMNSHQHSMZqkKIVaM+yeelwA1eM9mWtObmt6Dtd2/2DRkRJ7rr6hC/Yjn/H0mDv51YgmZQwGsOaPlEwaUbNbnvTZ8ev2zeJ2J8Ch/wBove0igS4h6tIe+qJn4QH5Po5Imgj/PWDmOFHwimbJmiDZl9PpBHZjAshMt/56Gn1DKXuAHjnaZ4Lkk7R7tfx2w1ciM+ueVXi9cubfWen8bF8gquDLjNjjc1IMIUpbdsQWdldwjpemakFk6xq6Hl43QUSp1/e//H519qAaE4kddOUUa2A/hXEY8NY0weEE91vAsRrIEOZvRjgEa9aCN/21C7BVvszqRcqGEPWSmnxUEVMw6tgUfxq/Lc4ZvkyEOopnUFePZNwxdMvMDJ1mbyt6ZqiL+CkCan6NAyaoONmABeG9qiaIojniYWmyKg8fAAtXx6IUu8CJQVn1Y7eeaU1DdwmxC/YyfiyaDu+U2YEvI4YOUzqX967TRR4qHvCsHx7MiaeXKso9Gib63x9nKMc9k9v/lnGbwGo99lWYealdaLlqiyU/ue+5yEe86ufRLXR9y4UPoTRPTWDsdLfQvwC+IQWuan6JCMgWJ54GZTffEyr76eNwLnjV5azYEhcQFR/ujLi7Ts7fCGIg6LUNAupiekp8LTYKGWqkyNyMbTlGWozet3vgl5KTkQbO3OZ7seiHlzWH61KrmfHYkVk6P3TXB7KMgM/Gui09YPpsXwfFmDCgTro383VhXwKmPouHVxhx4xiBdmFFG3F+LCX9z5wCH4k1CCDo5MY9WpDZuypWtWYdApWAIFG0I3bkSdyctxQFq25J4TRg6ylXeIZYtkNq9vpKb/8mb6OTnvITGEQZnn8rDOxnBdnebKs+piX2BJfSMgg9q3Kk366U9BWRLFiUoVSE4FwqnIxz3GeIKpHRnf4gExe4AdCLvoifiHdshgxHhkv9VfOVZikJkXr5zthjXlhPCGkTi+GXwpy+A8GWbBbkexrfxEM4z1Ps4ALqkqgxB26eutjBVwjPHoae9AOSRfFlwtzOx2ZbG70forUpXhYASuO4eQJAo16sYnLpqj5dMTxAOJgs9sYFVV5u/khGtkjdQQrpbWWVfnyi88+Gsx1bcGeocmdCxRwbnkmY6dsD2xT5vVEUrOYAphbD1zv0IE2RaxkMeLrRDrrq1/ol2jjndT9IXI7Y2natu7fAamDTIv4LcSl3+XtV8bsFvTSANPDT5VOscx/rufD9ipIFun6T5TO9Icr2R6OL8Rjzyyy8kla/30a3+6D5aj5dMTxAUOv2UCmzgUhcPL/Jyy5k2i2ymAuaL9jSlsFQW0qZqs8gaEEBaiaG3t4Kvia9IepxLcqp8UTDAxTwsWyOTzM729gStRzGkWj3VCvVxRTzUw/cBbpj9kUIVmLFG6rp33MYYGzHWWgn+qoAWM5R2yj70xF+xx4khabHoeduoD/yp/L4H/6ENDsgWVL/XMIBGlZhEddgP1kgTW+GbnyL3Sowy/klH1Pddh6Fw1+hq6/yATzRE3cWF7dHLIXPwfJHXRiYku+lIyIFh6/t3pvp8zOduuwLKrx3aRtzr9hmHauIPUrc+GICcARLo6BOTDvGYXr1ReIL3LEsleVPOUa/3r6LSwMGvwIWEkDjBJwvm4YIYYZue4lYOjdjU9PajRHYvFaZbM74xOAVB6sG6bdYQNBFuAQrGSdBXBms4ekN/F+rpdTl7TnAczqZrk1aXGtycXzFaq8DFtM6Zz2MO1Yosij9ANhKTQoIt9onYeoRPraqAImT+rUapWDXca4zr2W54F8Am4quZHuFsX7XypOWgngdZrq3jONJS+Iwfdncc/LvIP0EUVxdB0/UVDfkGsKbgukYrGGTHkdLE4IRXmM2WGo4AUjKCkDQeVjN05blTpFpYc2kxacM4hjnmrwGD7Smzk8q7LJ1KdUcRpnc4vgVXm2WHKqEzNNXlx2xL82uhbolQPd3PbbBLysPpp4D6YDKosXaYEykMcv4VPYaEt55D2sVVQiXlRxsofkEX23BMVDBfZ4e2TZD5m5HUhte806vAIN8x9bANsdOQPS0NhthMJ9Ljl01ClBforhBR+mZZ0dM6BBwEjdxcOG+dsMqf6CO3dDGN4PX4dEiIGsTp/f2hfFYnAWgj7pR6Iars1r8+0B0686SemYMmZDgVo1+0abciLrnqA75HqjqUJKbe0cA96YyAOKGHZZblqh9tWen+yo6e7seo4Xi1NSaNdXKPGBS6VGUR9OijUQo74ZKqH5pXLB6tiyOcUclcZMxBnegonDqS7C3ozf+x0N/yGjKu7JjXkWNJs5uwMYAASp4OnVNbolD+rLMWc3gXPs+W+ozW533eV2l0MaDKd3Gtt3sp3ATBLwDZZ1Ppg8zE+JxiZa286NOfFtrzjgP6rafU0aog5GFsDqRB17hr6vo41bnnuaBHFGm2bPoOcINWappUQc6GddvMfIyz2IWVkmaOR6CEsKLCgj5hIl1YYNUycIjX3M5G9E62l7eAb+HuVeuTCDirm+RQLV6Gx4HwwrCrzeLhgG6Gh9nPmyjAqr8u6n7a0xhjkkGYfscuyJgTuFlFS39jE1/IAbaKry2Qel4bGzW3PK1yS9n7nwdvMqbgRgZUoW8howXOcFxPOybJFXw1Wi6JxJ6O2iU+4tVU21XUlGH7ewvN4COYtURvouMDZ18K6OVrDUvOKzYKJHk9EkmCtygUSRfCaDf4AkSloYs7hZRvfpkeqwCIm9N925ap4LqZmSbdnuBrnmRBkrDuz2/J/sYrZtHxEWnG/OgnhIhIvzn3SfehniPIYiQc6Z+KMPG/TprXqnIxC2ADLh8tJBvOcbEWM/2TfGxE1wjIYxniPEmHHHAi5yYe+x7Ywpl39Ryo3jIZtfQsP/uUwOTED4K8SvAm0MWTn3PU3+kKJA7sJi+hXkmuwGJci/0QPwwvHfG+giY4KTAfDcNWbOoHoA4VS0xNGkzxn3h0fBk4Rn9qIEbazmVsVBgATJqdOfjBAVBbOtizPNyF2goQCVM8F5g/czGfPY3zBnAhoO0crcPesUGGHuDVyGxPYi7cGQtPW0JBADvGyeDhMvVj1FGWYWeYzoEeMF72st04kuFcHgGcjuAh+Udyy28hT4xIYJZO/2Jerj3VR790YLzQAYbC2140dRwVfTY/ezqYIKo3uRJmgRr97A8F8xbczX14Kq+i4X76vO+vXZjVfzgokBvOqKILw5Yn8CbR3YDZ58j34yUuFd4ISYnS0jjv6mvon3xNknJfTCXy1LpOlBy5/UgN6ecuXHaODQzSrerODLJ9L1ecUY1LFswscM/m5RJ8lWxhpw+eDIPWcwB7O4ZfS96dZCKSAFNd4ZWUzHVhzm7FaFVDMPJ0IY6I8PuSt04gaLHZiRhGSVZ369XgmfcEpebfQUvsZ36FOa1pQ784fSNZpkWS4wwB9G+it+uO5oZde4HtKOgzEfDjsMRAN9OX80ekZDGn2NaLfPMAx1jgw7MGp+VUURP+bQawBywwtvQ97S8aXsHaab+73xq23pDuE+U03pmDG4vNJ58sIFp0Iin2pEuDja2zKJTAki1J9HdQ4VnYrArSRiOan8DbFwoDkyrTJMeZP+HFZf2Pbtol57ivitVUkoqMdijbvYaoQ1zhUIwRBVNi5CgDmQg2w6eKMz7GHUZdjYFODBATBfGyZUWAavTps1KuSVIdLhlVVrHpDLWgItlzL0ZTOoQdVmy+dVu91auQl7e0rI86k3AoeAu7IHMYp0S7M3NUvgLzmoMypROKrq6YkFKYLZssxtN51B/Aez+J4K9sHTHGzfJuWW3rnJlFckOPN2UGVd2yfGF/jkFm6ptNwI68D81QAQOYdSxGuGmc1EiedIoLTG10s+OP2JJB8AmQbaKcQV+mVlS0U4gQRwGdc1BuZqPEAvprquRZ5M4AERIlGuTGggCCstAPV/X8AiMIH1OZb7+z8Fmk5ULVOtfCSe2hrXzU//jF2A/PnWfGd2FkHaTd8+Um806f1n6wanz5yJR5ntjYNieu/oDmb9f/tTNR3l6nkPSpn3vfPmbvQwxwjasA8LtGbBw+gZHaVT2k9uTYptrZ6F48FbJcEj8VRwIyujpPuZJy92H7wB9HbOn/YHuPOHQuE3tv3pvtKK7woJofNKYodgn6xEX9IpwY4k6gHsFqZeyVlWZQbA0bc6/YZiDEQ6qAjXOzuzcU8vsKV2QC6vpmB0fzsUvmD686z7k0SAqsb3DlXv931kkO7YtwgiLyIY5lByE9RrqsGd2eAad0BjlZJK8hzK8BaKxk//oS1ygIoDqlwtLsKcAdY93H2RI2e8nj64pww6uX/8tvxob9tUFPA3u9vzFVG+8myQW1bIlj734NaJBP+wb98vzhR/l0gD6qTeC7rU4oJfRosavg7n2kqi1ZBXk0vozf4jgoNkyGcJBDOtlp4dalYet6GirZN1A4kqHjYiuho/Z90pbUGp3ZOiqdEZTNz8LbwEiGPbo+FAesw7T3smwkz7D3JS5Ye39bFaQxJVOJDSL72GwvgewVfrQEpdBfWTziuw3pn8v/QxpQtVGjGyuMzevpfMGf/VtjtGT/tdHW5XNcMruD8w9xgOOdCeQ7fExcjWSTNqVUtgtDpFqnJ9/e2k1GDgbFcRZmOQ8U8884M3QmsGAS+5FsmiQiEIgr3qDB/veYV5N5oz7CULbtNW0bY7wR1ZQizXJq4y1X5vsmAh8Amb+uwgnyhKOVFWn81jh+hGCR1DbPE/e7P+LHUvpbA+qhqc92KY2L/cXtJec4XVH7Igbu3JNe5o6R7xxLTnom3y7YAzQMBGIT4VPvOSP48PdaS9TlG/S1pVGItHGND79b6p43pwUnouyGk/Fib+UAeH40wx3vsYd+3A/dfqww6foSc0B5UhCX75zFShfE6sDIualTD6/X64L9pJaGPOw4zEaq4JQ9RVN7ShajYIERFhGo0BqOC06/1HDbli40+jLppxA/lHOksIwmWsp50Ok12ZH0GT2vgbpCK3JY4QKe/1WSBKeqdadH3LclvUdku85VP7YyKkkOSw/g88z8eTNaZ4smlkF02di1z1Q/oiv4ejuEKA1jrm1gWGbcXRn2WJmd9L31BspP/9+s/J3Zy5viDE07/1pV8xHabJQ4wMablz8wPtVJYZ7zmByO5ufUAHjwwkbOU0tC/tkF9+3mfFsfNPnb+pt3EC9r4h+gg33Ilu8Mok2OXnCVWWHmZbi6xXqxTYTe6bvRppOS40bCqgtquyNoVTMHCdM64ZLwFio3HCcCbAaMZZTXFPXa/SlzqjrlKLApGdz47VEcE453UtO9WKcJ+uj5xGkptnOql/mEUpn9Nl/u5X8TeQ/WOrNr2qtk8ugEGFGo2BqStSFnqusfEM7AiuZuiuraZt90Wg3bUQCMK0L6cn/QeEmfYANN8DFmczfOaLSr+nvabfb7TQfj0GMpm55W5lw8fjDGg/Qq/EMy6zHVR5K0ExJaLfNF2Tev67m6V+0jqMYdtveV3LAAMzWWVKDUW3WHM7mSFKk39o2w8cQ+MdOAsJgIDk4832vfsJiTkdAsjbSJXpH8+ouj6NCz2a00+9V5A7WfoHAP7YAwTJlLyzOBzmEaj2RmzyE7Puzs+3QFXm6TfoOLlwNI1AXX9J8VwfqyuCPydQRJ2EbpnNQJvcVOj5os8Dr9XDBTYv6v9n3cyiY9vzqGhJqMnmcMqbvYRUmdKn9ONCGJPM3XIJY79kMtpXnsLLsbGeHAvpm7d/So1D5LMFctpzYb/cGK97xWnoLd/GIoUEKAAAAAAA=) ## Conclusion[](#conclusion "Direct link to Conclusion") In this project, we went through the entire cycle of crawler development, from analyzing a rather interesting dynamic site to full implementation of a crawler using `Crawlee for Python`. You can view the full project code on [GitHub](https://github.com/Mantisus/crawlee_python_example) I would also like to hear your comments and thoughts on the web scraping topic you'd like to see in the next article. Feel free to comment here in the article or contact me in the [Crawlee developer community](https://apify.com/discord) on Discord. If you are looking out to how to start scraping using Crawlee for Python, check out our [latest tutorial here](https://blog.apify.com/crawlee-for-python-tutorial/). You can find me on the following platforms: [Github](https://github.com/Mantisus), [Linkedin](https://www.linkedin.com/in/max-bohomolov/), [Apify](https://apify.com/mantisus), [Upwork](https://www.upwork.com/freelancers/mantisus), [Contra](https://contra.com/mantisus). Thank you for your attention. I hope you found this information useful. **Tags:** * [community](https://crawlee.dev/blog/tags/community.md) --- # Scrapy vs. Crawlee April 23, 2024 · 12 min read [![Saurav Jain](https://avatars.githubusercontent.com/u/53312820?v=4)](https://github.com/souravjain540) [Saurav Jain](https://github.com/souravjain540) Developer Community Manager Hey, crawling masters! Welcome to another post on the Crawlee blog; this time, we are going to compare Scrapy, one of the oldest and most popular web scraping libraries in the world, with Crawlee, a relative newcomer. This article will answer your questions about when to use Scrapy and help you decide when it would be better to use Crawlee instead. This article will be the first in a series comparing the various technical aspects of Crawlee with Scrapy. ## Introduction:[](#introduction "Direct link to Introduction:") [Scrapy](https://scrapy.org/) is an open-source Python-based web scraping framework that extracts data from websites. With Scrapy, you create spiders, which are autonomous scripts to download and process web content. The limitation of Scrapy is that it does not work very well with JavaScript rendered websites, as it was designed for static HTML pages. We will do a comparison later in the article about this. Crawlee is also an open-source library that originated as [Apify SDK](https://docs.apify.com/sdk/js/). Crawlee has the advantage of being the latest library in the market, so it already has many features that Scrapy lacks, like autoscaling, headless browsing, working with JavaScript rendered websites without any plugins, and many more, which we are going to explain later on. ## Feature comparison[](#feature-comparison "Direct link to Feature comparison") We'll start comparing Scrapy and Crawlee by looking at language and development environments, and then features to make the scraping process easier for developers, like autoscaling, headless browsing, queue management, and more. ### Language and development environments[](#language-and-development-environments "Direct link to Language and development environments") Scrapy is written in Python, making it easier for the data science community to integrate it with various tools. While Scrapy offers very detailed documentation, it can take a lot of work to get started with Scrapy. One of the reasons why it is considered not so beginner-friendly[\[1\]](https://towardsdatascience.com/web-scraping-with-scrapy-theoretical-understanding-f8639a25d9cd)[\[2\]](https://www.accordbox.com/blog/scrapy-tutorial-1-scrapy-vs-beautiful-soup/#:~:text=Since%20Scrapy%20does%20no%20only,to%20become%20a%20Scrapy%20expert.)[\[3\]](https://www.udemy.com/tutorial/scrapy-tutorial-web-scraping-with-python/scrapy-vs-beautiful-soup-vs-selenium//1000) is its [complex architecture](https://docs.scrapy.org/en/latest/topics/architecture.html), which consists of various components like spiders, middleware, item pipelines, and settings. These can be challenging for beginners. Crawlee is one of the few web scraping and automation libraries that supports JavaScript and TypeScript. Crawlee supports CLI just like Scrapy, but it also provides [pre-built templates](https://github.com/apify/crawlee/tree/master/packages/templates/templates) in TypeScript and JavaScript with support for Playwright and Puppeteer. These templates help beginners to quickly understand the file structure and how it works. ### Headless browsing and JS rendering[](#headless-browsing-and-js-rendering "Direct link to Headless browsing and JS rendering") Scrapy does not support headless browsers natively, but it supports them with its plugin system, similarly it does not support scraping JavaScript rendered websites, but the plugin system makes this possible. One of the best examples is its [Playwright plugin](https://github.com/scrapy-plugins/scrapy-playwright/tree/main). Apify Store is a JavaScript rendered website, so we will scrape it in this example using the `scrapy-playwright` integration. For installation and to make changes to \[`settings.py`], please follow the instructions on the `scrapy-playwright` [repository on GitHub](https://github.com/scrapy-plugins/scrapy-playwright/tree/main?tab=readme-ov-file#installation). Then, create a spider with this code to scrape the data: spider.py ``` import scrapy class ActorSpider(scrapy.Spider): name = 'actor_spider' start_urls = ['https://apify.com/store'] def start_requests(self): for url in self.start_urls: yield scrapy.Request( url, meta={"playwright": True, "playwright_include_page": True}, callback=self.parse_playwright ) async def parse_playwright(self, response): page = response.meta['playwright_page'] await page.wait_for_selector('.ActorStoreItem-title-wrapper') actor_card = await page.query_selector('.ActorStoreItem-title-wrapper') if actor_card: actor_text = await actor_card.text_content() yield { 'actor': actor_text.strip() if actor_text else 'N/A' } await page.close() ``` One of the drawbacks of this plugin is its [lack of native support for windows](https://github.com/scrapy-plugins/scrapy-playwright/tree/main?tab=readme-ov-file#lack-of-native-support-for-windows). In Crawlee, you can scrape JavaScript rendered websites using the built-in headless [Puppeteer](https://github.com/puppeteer/puppeteer/) and [Playwright](https://github.com/microsoft/playwright) browsers. It is important to note that, by default, Crawlee scrapes in headless mode. If you don't want headless, then just set `headless: false`. * Playwright * Puppeteer crawler.js ``` import { PlaywrightCrawler } from 'crawlee'; const crawler = new PlaywrightCrawler({ async requestHandler({ page }) { const actorCard = page.locator('.ActorStoreItem-title-wrapper').first(); const actorText = await actorCard.textContent(); await crawler.pushData({ 'actor': actorText }); }, }); await crawler.run(['https://apify.com/store']); ``` crawler.js ``` import { PuppeteerCrawler } from 'crawlee'; const crawler = new PuppeteerCrawler({ async requestHandler({ page }) { await page.waitForSelector('.ActorStoreItem-title-wrapper'); const actorText = await page.$eval('.ActorStoreItem-title-wrapper', (el) => { return el.textContent; }); await crawler.pushData({ 'actor': actorText }); }, }); await crawler.run(['https://apify.com/store']); ``` ### Autoscaling support[](#autoscaling-support "Direct link to Autoscaling support") Autoscaling refers to the capability of a library to automatically adjusting the number of concurrent tasks (such as browser instances, HTTP requests, etc.) based on the current load and system resources. This feature is particularly useful when handling web scraping and crawling tasks that may require dynamically scaled resources to optimize performance, manage system load, and handle rate limitations efficiently. Scrapy does not have built-in autoscaling capabilities, but it can be done using external services like [Scrapyd](https://scrapyd.readthedocs.io/en/latest/) or deployed in a distributed manner with Scrapy Cluster. Crawlee has [built-in autoscaling](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) with `AutoscaledPool`. It increases the number of requests that are processed concurrently within one crawler. ### Queue management[](#queue-management "Direct link to Queue management") Scrapy supports both breadth-first and depth-first crawling strategies using a disk-based queuing system. By default, it uses the LIFO queue for the pending requests, which means it is using depth-first order, but if you want to use breadth-first order, you can do it by changing these settings: settings.py ``` DEPTH_PRIORITY = 1 SCHEDULER_DISK_QUEUE = "scrapy.squeues.PickleFifoDiskQueue" SCHEDULER_MEMORY_QUEUE = "scrapy.squeues.FifoMemoryQueue" ``` Crawlee uses breadth-first by default and you can override it on a per-request basis by using the `forefront: true` argument in `addRequest` and its derivatives. If you use `forefront: true` for all requests, it becomes a depth-first process. ### CLI support[](#cli-support "Direct link to CLI support") Scrapy has a [powerful command-line interface](https://docs.scrapy.org/en/latest/topics/commands.html#command-line-tool) that offers functionalities like starting a project, generating spiders, and controlling the crawling process. Scrapy CLI comes with Scrapy. Just run this command, and you are good to go: ``` pip install scrapy ``` Crawlee also [includes a CLI tool](https://crawlee.dev/js/docs/quick-start.md#installation-with-crawlee-cli) (`crawlee-cli`) that facilitates project setup, crawler creation and execution, streamlining the development process for users familiar with Node.js environments. The command for installation is: ``` npx crawlee create my-crawler ``` ### Proxy rotation and storage management[](#proxy-rotation-and-storage-management "Direct link to Proxy rotation and storage management") Scrapy handles it via custom middleware. You have to install their [`scrapy-rotating-proxies`](https://pypi.org/project/scrapy-rotating-proxies/) package using pip. ``` pip install scrapy-rotating-proxies ``` Then in the `settings.py` file, add `ROTATING_PROXY_LIST` and the middleware to the `DOWNLOADER_MIDDLEWARES` and specify the list of proxy servers. For example: settings.py ``` DOWNLOADER_MIDDLEWARES = { # Lower value means higher priority 'scrapy.downloadermiddlewares.retry.RetryMiddleware': 90, 'scrapy_rotating_proxies.middlewares.RotatingProxyMiddleware': 610, 'scrapy_rotating_proxies.middlewares.BanDetectionMiddleware': 620, } ROTATING_PROXY_LIST = [ 'proxy1.com:8000', 'proxy2.com:8031', # Add more proxies as needed ] ``` Now create a spider with the code you want to scrape any site and the `ROTATING_PROXY_LIST` in `settings.py` will manage which proxy to use for each request. Here middleware will treat each proxy initially as valid and then when a request is made, the middleware selects a proxy from the list of available proxies. The selection isn't purely sequential but is influenced by the recent history of proxy performance. The middleware has mechanisms to detect when a proxy might be banned or rendered ineffective. When such conditions are detected, the proxy is temporarily deactivated and put into a cooldown period. After the cooldown period expires, the proxy is reconsidered for use. In Crawlee, you can [use your own proxy servers](https://crawlee.dev/js/docs/guides/proxy-management.md) or proxy servers acquired from third-party providers. If you already have your proxy URLs, you can start using them like this: crawler.js ``` import { ProxyConfiguration } from 'crawlee'; const proxyConfiguration = new ProxyConfiguration({ proxyUrls: [ 'http://proxy1.example.com', 'http://proxy2.example.com', ] }); const crawler = new CheerioCrawler({ proxyConfiguration, // ... }); ``` Crawlee also has [`SessionPool`](https://crawlee.dev/js/api/core/class/SessionPool.md), a built-in allocation system for proxies. It handles the rotation, creation, and persistence of user-like sessions. It creates a pool of session instances that are randomly rotated. ### Data storage[](#data-storage "Direct link to Data storage") One of the most frequently required features when implementing scrapers is being able to store the scraped data as an "export file". Scrapy provides this functionality out of the box with the [`Feed Exports`](https://docs.scrapy.org/en/latest/topics/feed-exports.html), which allows it to generate feeds with the scraped items, using multiple serialization formats and storage backends. It supports `CSV, JSON, JSON Lines, and XML.` To do this, you need to modify your `settings.py` file and enter: settings.py ``` # To store in CSV format FEEDS = { 'data/crawl_data.csv': {'format': 'csv', 'overwrite': True} } # OR to store in JSON format FEEDS = { 'data/crawl_data.json': {'format': 'json', 'overwrite': True} } ``` Crawlee's storage can be divided into two categories: Request Storage (Request Queue and Request List) and Results Storage (Datasets and Key Value Stores). Both are stored locally by default in the `./storage` directory. Also, remember that Crawlee, by default, clears its storages before starting a crawler run. This action is taken to prevent old data from interfering with new crawling sessions. Let's see how Crawlee stores the result: * You can use local storage with dataset crawler.js ``` import { PlaywrightCrawler } from 'crawlee'; const crawler = new PlaywrightCrawler({ requestHandler: async ({ page }) => { const title = await page.title(); const price = await page.textContent('.price'); await crawler.pushData({ url: request.url, title, price }); } }) await crawler.run(['http://example.com']); ``` * Using Key-Value Store crawler.js ``` import { KeyValueStore } from 'crawlee'; //... Code to crawl the data await KeyValueStore.setValue('key', { foo: 'bar' }); ``` ### Anti-blocking and fingerprints[](#anti-blocking-and-fingerprints "Direct link to Anti-blocking and fingerprints") In Scrapy, handling anti-blocking strategies like [IP rotation](https://pypi.org/project/scrapy-rotated-proxy/), [user-agent rotation](https://python.plainenglish.io/rotating-user-agent-with-scrapy-78ca141969fe), custom solutions via middleware, and plugins are needed. Crawlee provides HTTP crawling and [browser fingerprints](https://crawlee.dev/js/docs/guides/avoid-blocking.md) with zero configuration necessary; fingerprints are enabled by default and available in `PlaywrightCrawler` and `PuppeteerCrawler` but also work with `CheerioCrawler` and the other HTTP Crawlers. ### Error handling[](#error-handling "Direct link to Error handling") Both libraries support error-handling practices like automatic retries, logging, and custom error handling. In Scrapy, you can handle errors using middleware and [signals](https://docs.scrapy.org/en/latest/topics/signals.html). There are also [exceptions](https://docs.scrapy.org/en/latest/topics/exceptions.html) like `IgnoreRequest`, which can be raised by Scheduler or any downloader middleware to indicate that the request should be ignored. Similarly, a spider callback can raise' CloseSpider' to close the spider. Scrapy has built-in support for retrying failed requests. You can configure the retry policy (e.g., the number of retries, retrying on particular HTTP codes) via settings such as `RETRY_TIMES`, as shown in the example: settings.py ``` RETRY_ENABLED = True RETRY_TIMES = 2 # Number of retry attempts RETRY_HTTP_CODES = [500, 502, 503, 504, 522, 524] # HTTP error codes to retry ``` In Crawlee, you can also set up a custom error handler. For retries, `maxRequestRetries` controls how often Crawlee will retry a request before marking it as failed. To set it up, you just need to add the following line of code in your crawler. crawler.js ``` const crawler = new CheerioCrawler({ maxRequestRetries: 3 // Crawler will retry three times. // ... }) ``` There is also `noRetry`. If set to `true` then the request will not be automatically tried. Crawlee also provides a built-in [logging mechanism](https://crawlee.dev/js/api/core/class/Log.md) via `log`, allowing you to log warnings, errors, and other information effectively. ### Deployment using Docker[](#deployment-using-docker "Direct link to Deployment using Docker") Scrapy can be containerized using Docker, though it typically requires manual setup to create Dockerfiles and configure environments. While Crawlee includes [ready-to-use Docker configurations](https://crawlee.dev/js/docs/guides/docker-images.md), making deployment straightforward across various environments without additional configuration. ## Community[](#community "Direct link to Community") Both projects are open source. Scrapy benefits from a large and well-established community. It has been around since 2008 and has attracted a lot of attention among developers, particularly those in the Python ecosystem. Crawlee started its journey as Apify SDK in 2018. It now has more than [12K stars on GitHub](https://github.com/apify/crawlee), a community of more than 7,000 developers in its [Discord Community](https://apify.com/discord), and is used by the TypeScript and JavaScript community. ## So which is better - Scrapy or Crawlee?[](#so-which-is-better---scrapy-or-crawlee "Direct link to So which is better - Scrapy or Crawlee?") Both frameworks can handle a wide range of scraping tasks, and the best choice will depend on specific technical needs like language preference, project requirements, ease of use, etc. If you are comfortable with Python and want to work only with it, go with Scrapy. It has very detailed documentation, and it is one of the oldest and most stable libraries in the space. But if you want to explore or are comfortable working with TypeScript or JavaScript, our recommendation is Crawlee. With all the valuable features like a single interface for HTTP requests and headless browsing, making it work well with JavaScript rendered websites, autoscaling and fingerprint support, it is the best choice for scraping websites that can be complex, resource intensive, using JavaScript, or even have blocking methods. As promised, this is just the first of the many articles comparing Scrapy and Crawlee. With the upcoming articles, you will learn more about every technical detail. Meanwhile, if you want to learn more about Crawlee, read our [introduction to Crawlee](https://crawlee.dev/js/docs/introduction.md) or Apify's [Crawlee web scraping tutorial](https://blog.apify.com/crawlee-web-scraping-tutorial/). --- # Inside implementing SuperScraper with Crawlee March 5, 2025 · 6 min read [![Saurav Jain](https://avatars.githubusercontent.com/u/53312820?v=4)](https://github.com/souravjain540) [Saurav Jain](https://github.com/souravjain540) Developer Community Manager [![Radoslav Chudovský](https://ca.slack-edge.com/T0KRMEKK6-U04MGU11VUK-7f59c4a9343b-512)](https://github.com/chudovskyr) [Radoslav Chudovský](https://github.com/chudovskyr) Web Automation Engineer [SuperScraper](https://github.com/apify/super-scraper) is an open-source [Actor](https://docs.apify.com/platform/actors) that combines features from various web scraping services, including [ScrapingBee](https://www.scrapingbee.com/), [ScrapingAnt](https://scrapingant.com/), and [ScraperAPI](https://www.scraperapi.com/). A key capability is its standby mode, which runs the Actor as a persistent API server. This removes the usual start-up times - a common pain point in many systems - and lets users make direct API calls to interact with the system immediately. This blog explains how SuperScraper works, highlights its implementation details, and provides code snippets to demonstrate its core functionality. ![Google Maps Data Screenshot](/assets/images/superscraper-8d24da63227f97df70998e8900b3a901.webp) ### What is SuperScraper?[](#what-is-superscraper "Direct link to What is SuperScraper?") SuperScraper transforms a traditional scraper into an API server. Instead of running with static inputs and waiting for completion, it starts only once, stays active, and listens for incoming requests. ### How to enable standby mode[](#how-to-enable-standby-mode "Direct link to How to enable standby mode") To activate standby mode, you must configure the settings so it listens for incoming requests. ![Activating Actor standby mode](/assets/images/actor-standby-9b094dde2615b70afb82685d56c8d74e.webp) ### Server setup[](#server-setup "Direct link to Server setup") The project uses Node.js `http` module to create a server that listens on the desired port. After the server starts, a check ensures users are interacting with it correctly by sending requests instead of running it traditionally. This keeps SuperScraper operating as a persistent server. ### Handling multiple crawlers[](#handling-multiple-crawlers "Direct link to Handling multiple crawlers") SuperScraper processes user requests using multiple instances of Crawlee’s [`PlaywrightCrawler`](https://crawlee.dev/js/api/playwright-crawler/class/PlaywrightCrawler.md). Since each `PlaywrightCrawler` instance can only handle one proxy configuration, a separate crawler is created for each unique proxy setting. For example, if the user sends one request for “normal” proxies and one request with residential US proxies, a separate crawler needs to be created for each proxy configuration. Hence, to solve this, we store the crawlers in a key-value map, where the key is a stringified proxy configuration. ``` const crawlers = new Map(); ``` Here’s a part of the code that gets executed when a new request from the user arrives; if the crawler for this proxy configuration exists in the map, it will be used. Otherwise, a new crawler gets created. Then, we add the request to the crawler’s queue so it can be processed. ``` const key = JSON.stringify(crawlerOptions); const crawler = crawlers.has(key) ? crawlers.get(key)! : await createAndStartCrawler(crawlerOptions); await crawler.addRequests([request]); ``` The function below initializes new crawlers with predefined settings and behaviors. Each crawler utilizes its own in-memory queue created with the `MemoryStorage` client. This approach is used for two key reasons: 1. **Performance**: In-memory queues are faster, and there's no need to persist them when SuperScraper migrates. 2. **Isolation**: Using a separate queue prevents interference with the shared default queue of the SuperScraper Actor, avoiding potential bugs when multiple crawlers use it simultaneously. ``` export const createAndStartCrawler = async (crawlerOptions: CrawlerOptions = DEFAULT_CRAWLER_OPTIONS) => { const client = new MemoryStorage({ persistStorage: false }); const queue = await RequestQueue.open(undefined, { storageClient: client }); const proxyConfig = await Actor.createProxyConfiguration(crawlerOptions.proxyConfigurationOptions); const crawler = new PlaywrightCrawler({ keepAlive: true, proxyConfiguration: proxyConfig, maxRequestRetries: 4, requestQueue: queue, }); }; ``` At the end of the function, we start the crawler and log a message if it terminates for any reason. Next, we add the newly created crawler to the key-value map containing all crawlers, and finally, we return the crawler. ``` crawler.run().then( () => log.warning(`Crawler ended`, crawlerOptions), () => { } ); crawlers.set(JSON.stringify(crawlerOptions), crawler); log.info('Crawler ready 🚀', crawlerOptions); return crawler; ``` ### Mapping standby HTTP requests to Crawlee requests[](#mapping-standby-http-requests-to-crawlee-requests "Direct link to Mapping standby HTTP requests to Crawlee requests") When creating the server, it accepts a request listener function that takes two arguments: the user’s request and a response object. The response object is used to send scraped data back to the user. These response objects are stored in a key-value map to so they can be accessed later in the code. The key is a randomly generated string shared between the request and its corresponding response object, it is used as `request.uniqueKey`. ``` const responses = new Map(); ``` **Saving response objects** The following function stores a response object in the key-value map: ``` export function addResponse(responseId: string, response: ServerResponse) { responses.set(responseId, response); } ``` **Updating crawler logic to store responses** Here’s the updated logic for fetching/creating the corresponding crawler for a given proxy configuration, with a call to store the response object: ``` const key = JSON.stringify(crawlerOptions); const crawler = crawlers.has(key) ? crawlers.get(key)! : await createAndStartCrawler(crawlerOptions); addResponse(request.uniqueKey!, res); await crawler.requestQueue!.addRequest(request); ``` **Sending scraped data back** Once a crawler finishes processing a request, it retrieves the corresponding response object using the key and sends the scraped data back to the user: ``` export const sendSuccResponseById = (responseId: string, result: unknown, contentType: string) => { const res = responses.get(responseId); if (!res) { log.info(`Response for request ${responseId} not found`); return; } res.writeHead(200, { 'Content-Type': contentType }); res.end(result); responses.delete(responseId); }; ``` **Error handling** There is similar logic to send a response back if an error occurs during scraping: ``` export const sendErrorResponseById = (responseId: string, result: string, statusCode: number = 500) => { const res = responses.get(responseId); if (!res) { log.info(`Response for request ${responseId} not found`); return; } res.writeHead(statusCode, { 'Content-Type': 'application/json' }); res.end(result); responses.delete(responseId); }; ``` **Adding timeouts during migrations** During migration, SuperScraper adds timeouts to pending responses to handle termination cleanly. ``` export const addTimeoutToAllResponses = (timeoutInSeconds: number = 60) => { const migrationErrorMessage = { errorMessage: 'Actor had to migrate to another server. Please, retry your request.', }; const responseKeys = Object.keys(responses); for (const key of responseKeys) { setTimeout(() => { sendErrorResponseById(key, JSON.stringify(migrationErrorMessage)); }, timeoutInSeconds * 1000); } }; ``` ### Managing migrations[](#managing-migrations "Direct link to Managing migrations") SuperScraper handles migrations by timing out active responses to prevent lingering requests during server transitions. ``` Actor.on('migrating', ()=>{ addTimeoutToAllResponses(60); }); ``` Users receive clear feedback during server migrations, maintaining stable operation. ### Build your own[](#build-your-own "Direct link to Build your own") This guide showed how to build and manage a standby web scraper using Apify’s platform and Crawlee. The implementation handles multiple proxy configurations through `PlaywrightCrawler` instances while managing request-response cycles efficiently to support diverse scraping needs. Standby mode transforms SuperScraper into a persistent API server, eliminating start-up delays. The migration handling system keeps operations stable during server transitions. You can build on this foundation to create web scraping tools tailored to your requirements. To get started, explore the project on [GitHub](https://github.com/apify/super-scraper) or learn more about [Crawlee](https://crawlee.dev/index.md) to build your own scalable web scraping tools. --- ## C[](#C "Direct link to C") * [community10](https://crawlee.dev/blog/tags/community.md) *** --- ## [How to scrape YouTube using Python \[2025 guide\]](https://crawlee.dev/blog/scrape-youtube-python.md) July 14, 2025 · 23 min read [![Max](https://avatars.githubusercontent.com/u/34358312?v=4)](https://github.com/Mantisus) [Max](https://github.com/Mantisus) Community Member of Crawlee and web scraping expert In this guide, we'll explore how to efficiently collect data from YouTube using [Crawlee for Python](https://github.com/apify/crawlee-python). The scraper will extract video metadata, video statistics, and transcripts - giving you structured YouTube data perfect for content analysis, ML training, or trend monitoring. note One of our community members wrote this guide as a contribution to the Crawlee Blog. If you'd like to contribute articles like these, please reach out to us on Apify’s [Discord channel](https://apify.com/discord). ![How to scrape YouTube using Python](/assets/images/youtube_banner-fb73d10d52bbf13a89f3c0d66d2eff5b.webp) Key steps we'll cover: 1. [Project setup](https://www.crawlee.dev/blog/scrape-youtube-python#1-project-setup) 2. [Analyzing YouTube and determining a scraping strategy](https://www.crawlee.dev/blog/scrape-youtube-python#2-analyzing-youtube-and-determining-a-scraping-strategy) 3. [Configuring YouTube](https://www.crawlee.dev/blog/scrape-youtube-python#3-configuring-crawlee) 4. [Extracting YouTube data](https://www.crawlee.dev/blog/scrape-youtube-python#4-extracting-youtube-data) 5. [Enhancing the scraper capabilities](https://www.crawlee.dev/blog/scrape-youtube-python#5-enhancing-the-scraper-capabilities) 6. [Creating a YouTube Actor on the Apify platform](https://www.crawlee.dev/blog/scrape-youtube-python#6-creating-a-youtube-actor-on-the-apify-platform) 7. [Deploying to Apify](https://www.crawlee.dev/blog/scrape-youtube-python#7-deploying-to-apify) **Tags:** * [community](https://crawlee.dev/blog/tags/community.md) [**Read More**](https://crawlee.dev/blog/scrape-youtube-python.md) --- ## [How Crawlee uses tiered proxies to avoid getting blocked](https://crawlee.dev/blog/proxy-management-in-crawlee.md) June 24, 2024 · 4 min read [![Saurav Jain](https://avatars.githubusercontent.com/u/53312820?v=4)](https://github.com/souravjain540) [Saurav Jain](https://github.com/souravjain540) Developer Community Manager Hello Crawlee community, We are back with another blog, this time explaining how Crawlee rotates proxies and prevents crawlers from getting blocked. Proxies vary in quality, speed, reliability, and cost. There are a [few types of proxies](https://blog.apify.com/types-of-proxies/), such as datacenter and residential proxies. Datacenter proxies are cheaper but, on the other hand, more prone to getting blocked, and vice versa with residential proxies. It is hard for developers to decide which proxy to use while scraping data. We might get blocked if we use [datacenter proxies](https://blog.apify.com/datacenter-proxies-when-to-use-them-and-how-to-make-the-most-of-them/) for low-cost scraping, but residential proxies are sometimes too expensive for bigger projects. Developers need a system that can manage both costs and avoid getting blocked. To manage this, we recently introduced tiered proxies in Crawlee. Let’s take a look at it. **Tags:** * [proxy](https://crawlee.dev/blog/tags/proxy.md) [**Read More**](https://crawlee.dev/blog/proxy-management-in-crawlee.md) --- # 12 tips on how to think like a web scraping expert November 10, 2024 · 13 min read [![Max](https://avatars.githubusercontent.com/u/34358312?v=4)](https://github.com/Mantisus) [Max](https://github.com/Mantisus) Community Member of Crawlee and web scraping expert Typically, tutorials focus on the technical aspects, on what you can replicate: "Start here, follow this path, and you'll end up here." This is great for learning a particular technology, but it's sometimes difficult to understand why the author decided to do things a certain way or what guides their development process. note One of our community members wrote this blog as a contribution to Crawlee Blog. If you want to contribute blogs like these to Crawlee Blog, please reach out to us on our [discord channel](https://apify.com/discord). In this blog, I'll discuss the general rules and principles that guide me when I work on web scraping projects and allow me to achieve great results. So, let's explore the mindset of a web scraping developer. ![How to think like a web scraping expert](/assets/images/scraping-tips-8c538d5ae19dc1737b083169ad2a203b.webp) ## 1. Choosing a data source for the project[](#1-choosing-a-data-source-for-the-project "Direct link to 1. Choosing a data source for the project") When you start working on a project, you likely have a target site from which you need to extract specific data. Check what possibilities this site or application provides for data extraction. Here are some possible options: * `Official API` - the site may provide a free official API through which you can get all the necessary data. This is the best option for you. For example, you can consider this approach if you need to extract data from [`Yelp`](https://docs.developer.yelp.com/docs/fusion-intro) * `Website` - in this case, we study the website, its structure, as well as the ways the frontend and backend interact * `Mobile Application` - in some cases, there's no website or API at all, or the mobile application provides more data, in which case, don't forget about the [`man-in-the-middle`](https://blog.apify.com/using-a-man-in-the-middle-proxy-to-scrape-data-from-a-mobile-app-api-e954915f979d/) approach If one data source fails, try accessing another available source. For example, for `Yelp`, all three options are available, and if the `Official API` doesn't suit you for some reason, you can try the other two. ## 2. Check [`robots.txt`](https://developers.google.com/search/docs/crawling-indexing/robots/intro) and [`sitemap`](https://developers.google.com/search/docs/crawling-indexing/sitemaps/build-sitemap)[](#2-check-robotstxt-and-sitemap "Direct link to 2-check-robotstxt-and-sitemap") I think everyone knows about `robots.txt` and `sitemap` one way or another, but I regularly see people simply forgetting about them. If you're hearing about these for the first time, here's a quick explanation: * `robots` is the established name for crawlers in SEO. Usually, this refers to crawlers of major search engines like Google and Bing, or services like Ahrefs and ChatGPT. * `robots.txt` is a file describing the allowed behavior for robots. It includes permitted crawler user-agents, wait time between page scans, patterns of pages forbidden for scanning, and more. These rules are typically based on which pages should be indexed by search engines and which should not. * `sitemap` describes the site structure to make it easier for robots to navigate. It also helps in scanning only the content that needs updating, without creating unnecessary load on the site Since you're not [`Google`](http://google.com/) or any other popular search engine, the robot rules in `robots.txt` will likely be against you. But combined with the `sitemap`, this is a good place to study the site structure, expected interaction with robots, and non-browser user-agents. In some situations, it simplifies data extraction from the site. For example, using the [`sitemap`](https://www.crawlee.dev/sitemap.xml) for [Crawlee website](http://www.crawlee.dev/), you can easily get direct links to posts both for the entire lifespan of the blog and for a specific period. One simple check, and you don't need to implement pagination logic. ## 3. Don't neglect site analysis[](#3-dont-neglect-site-analysis "Direct link to 3. Don't neglect site analysis") Thorough site analysis is an important prerequisite for creating an effective web scraper, especially if you're not planning to use browser automation. However, such analysis takes time, sometimes a lot of it. It's also worth noting that the time spent on analysis and searching for a more optimal crawling solution doesn't always pay off - you might spend hours only to discover that the most obvious approach was the best all along. Therefore, it's wise to set limits on your initial site analysis. If you don't see a better path within the allocated time, revert to simpler approaches. As you gain more experience, you'll more often be able to tell early on, based on the technologies used on the site, whether it's worth dedicating more time to analysis or not. Also, in projects where you need to extract data from a site just once, thorough site analysis can sometimes eliminate the need to write scraper code altogether. Here's an example of such a site - `https://ricebyrice.com/nl/pages/find-store`. ![Ricebyrice](/assets/images/ricebyrice_base-433dcb67f3debf8855b0043fb87a63c3.webp) By analyzing it, you'll easily discover that all the data can be obtained with a single request. You simply need to copy this data from your browser into a JSON file, and your task is complete. ![Ricebyrice Response](/assets/images/ricebyrice_response-77221911846c701f7abd865673867d60.webp) ## 4. Maximum interactivity[](#4-maximum-interactivity "Direct link to 4. Maximum interactivity") When analyzing a site, switch sorts, pages, interact with various elements of the site, while watching the `Network` tab in your browser's [Dev Tools](https://developer.chrome.com/docs/devtools). This will allow you to better understand how the site interacts with the backend, what framework it's built on, and what behavior can be expected from it. ## 5. Data doesn't appear out of thin air[](#5-data-doesnt-appear-out-of-thin-air "Direct link to 5. Data doesn't appear out of thin air") This is obvious, but it's important to keep in mind while working on a project. If you see some data or request parameters, it means they were obtained somewhere earlier, possibly in another request, possibly they may have already been on the website page, possibly they were formed using JS from other parameters. But they are always somewhere. If you don't understand where the data on the page comes from, or the data used in a request, follow these steps: 1. Sequentially, check all requests the site made before this point. 2. Examine their responses, headers, and cookies. 3. Use your intuition: Could this parameter be a timestamp? Could it be another parameter in a modified form? 4. Does it resemble any standard hashes or encodings? Practice makes perfect here. As you become familiar with different technologies, various frameworks, and their expected behaviors, and as you encounter a wide range of technologies, you'll find it easier to understand how things work and how data is transferred. This accumulated knowledge will significantly improve your ability to trace and understand data flow in web applications. ## 6. Data is cached[](#6-data-is-cached "Direct link to 6. Data is cached") You may notice that when opening the same page several times, the requests transmitted to the server differ: possibly something was cached and is already stored on your computer. Therefore, it's recommended to analyze the site in incognito mode, as well as switch browsers. This situation is especially relevant for mobile applications, which may store some data in storage on the device. Therefore, when analyzing mobile applications, you may need to clear the cache and storage. ## 7. Learn more about the framework[](#7-learn-more-about-the-framework "Direct link to 7. Learn more about the framework") If during the analysis you discover that the site uses a framework you haven't encountered before, take some time to learn about it and its features. For example, if you notice a site is built with Next.js, understanding how it handles routing and data fetching could be crucial for your scraping strategy. You can learn about these frameworks through official documentation or by using LLMs like [`ChatGPT`](https://openai.com/chatgpt/) or [`Claude`](https://claude.ai/). These AI assistants are excellent at explaining framework-specific concepts. Here's an example of how you might query an LLM about Next.js: ``` I am in the process of optimizing my website using Next.js. Are there any files passed to the browser that describe all internal routing and how links are formed? Restrictions: - Accompany your answers with code samples - Use this message as the main message for all subsequent responses - Reference only those elements that are available on the client side, without access to the project code base ``` You can create similar queries for backend frameworks as well. For instance, with GraphQL, you might ask about available fields and query structures. These insights can help you understand how to better interact with the site's API and what data is potentially available. For effective work with LLM, I recommend at least basically studying the basics of [`prompt engineering`](https://parlance-labs.com/education/prompt_eng/berryman.html). ## 8. Reverse engineering[](#8-reverse-engineering "Direct link to 8. Reverse engineering") Web scraping goes hand in hand with reverse engineering. You study the interactions of the frontend and backend, you may need to study the code to better understand how certain parameters are formed. But in some cases, reverse engineering may require more knowledge, effort, time, or have a high degree of complexity. At this point, you need to decide whether you need to delve into it or it's better to change the data source, or, for example, technologies. Most likely, this will be the moment when you decide to abandon HTTP web scraping and switch to a headless browser. The main principle of most web scraping protections is not to make web scraping impossible, but to make it expensive. Let's just look at what the response to a search on [`zoopla`](https://www.zoopla.co.uk/) looks like ![Zoopla Search Response](/assets/images/zoopla_response-c6997e953965244f6293d44d2562f2dd.webp) ## 9. Testing requests to endpoints[](#9-testing-requests-to-endpoints "Direct link to 9. Testing requests to endpoints") After identifying the endpoints you need to extract the target data, make sure you get a correct response when making a request. If you get a response from the server other than 200, or data different from expected, then you need to figure out why. Here are some possible reasons: * You need to pass some parameters, for example cookies, or specific technical headers * The site requires that when accessing this endpoint, there is a corresponding `Referrer` header * The site expects that the headers will follow a certain order. I've encountered this only a couple of times, but I have encountered it * The site uses protection against web scraping, for example with `TLS fingerprint` And many other possible reasons, each of which requires separate analysis. ## 10. Experiment with request parameters[](#10-experiment-with-request-parameters "Direct link to 10. Experiment with request parameters") Explore what results you get when changing request parameters, if any. Some parameters may be missing but supported on the server side. For example, `order`, `sort`, `per_page`, `limit`, and others. Try adding them and see if the behavior changes. This is especially relevant for sites using [`graphql`](https://graphql.org/) Let's consider this [`example`](https://restoran.ua/en/posts?subsection=0) If you analyze the site, you'll see a request that can be reproduced with the following code, I've formatted it a bit to improve readability: ``` import requests url = "https://restoran.ua/graphql" data = { "operationName": "Posts_PostsForView", "variables": {"sort": {"sortBy": ["startAt_DESC"]}}, "query": """query Posts_PostsForView( $where: PostForViewWhereInput, $sort: PostForViewSortInput, $pagination: PaginationInput, $search: String, $token: String, $coordinates_slice: SliceInput) { PostsForView( where: $where sort: $sort pagination: $pagination search: $search token: $token ) { id title: ukTitle summary: ukSummary slug startAt endAt newsFeed events journal toProfessionals photoHeader { address: mobile __typename } coordinates(slice: $coordinates_slice) { lng lat __typename } __typename } }""" } response = requests.post(url, json=data) print(response.json()) ``` Now I'll update it to get results in 2 languages at once, and most importantly, along with the internal text of the publications: ``` import requests url = "https://restoran.ua/graphql" data = { "operationName": "Posts_PostsForView", "variables": {"sort": {"sortBy": ["startAt_DESC"]}}, "query": """query Posts_PostsForView( $where: PostForViewWhereInput, $sort: PostForViewSortInput, $pagination: PaginationInput, $search: String, $token: String, $coordinates_slice: SliceInput) { PostsForView( where: $where sort: $sort pagination: $pagination search: $search token: $token ) { id uk_title: ukTitle en_title: enTitle summary: ukSummary slug startAt endAt newsFeed events journal toProfessionals photoHeader { address: mobile __typename } mixedBlocks { index en_text: enText uk_text: ukText __typename } coordinates(slice: $coordinates_slice) { lng lat __typename } __typename } }""" } response = requests.post(url, json=data) print(response.json()) ``` As you can see, a small update of the request parameters allows me not to worry about visiting the internal page of each publication. You have no idea how many times this trick has saved me. If you see `graphql` in front of you and don't know where to start, then my advice about documentation and LLM works here too. ## 11. Don't be afraid of new technologies[](#11-dont-be-afraid-of-new-technologies "Direct link to 11. Don't be afraid of new technologies") I know how easy it is to master a few tools and just use them because it works. I've fallen into this trap more than once myself. But modern sites use modern technologies that have a significant impact on web scraping, and in response, new tools for web scraping are emerging. Learning these may greatly simplify your next project, and may even solve some problems that were insurmountable for you. I wrote about some tools [`earlier`](https://www.crawlee.dev/blog/common-problems-in-web-scraping). I especially recommend paying attention to [`curl_cffi`](https://curl-cffi.readthedocs.io/en/latest/) and frameworks [`botasaurus`](https://www.omkar.cloud/botasaurus/) and [`Crawlee for Python`](https://www.crawlee.dev/python/). ## 12. Help open-source libraries[](#12-help-open-source-libraries "Direct link to 12. Help open-source libraries") Personally, I only recently came to realize the importance of this. All the tools I use for my work are either open-source developments or based on open-source. Web scraping literally lives thanks to open-source, and this is especially noticeable if you're a `Python` developer and have realized that on pure `Python` everything is quite sad when you need to deal with `TLS fingerprint`, and again, open-source saved us here. And it seems to me that the least we could do is invest a little of our knowledge and skills in supporting open-source. I chose to support [`Crawlee for Python`](https://www.crawlee.dev/python/), and no, not because they allowed me to write in their blog, but because it shows excellent development dynamics and is aimed at making life easier for web crawler developers. It allows for faster crawler development by taking care of and hiding under the hood such critical aspects as session management, session rotation when blocked, managing concurrency of asynchronous tasks (if you write asynchronous code, you know what a pain this can be), and much more. tip If you like the blog so far, please consider [giving Crawlee a star on GitHub](https://github.com/apify/crawlee), it helps us to reach and help more developers. And what choice will you make? ## Conclusion[](#conclusion "Direct link to Conclusion") I think some things in the article were obvious to you, some things you follow yourself, but I hope you learned something new too. If most of them were new, then try using these rules as a checklist in your next project. I would be happy to discuss the article. Feel free to comment here, in the article, or contact me in the [Crawlee developer community](https://apify.com/discord) on Discord. You can also find me on the following platforms: [Github](https://github.com/Mantisus), [Linkedin](https://www.linkedin.com/in/max-bohomolov/), [Apify](https://apify.com/mantisus), [Upwork](https://www.upwork.com/freelancers/mantisus), [Contra](https://contra.com/mantisus). Thank you for your attention :) **Tags:** * [community](https://crawlee.dev/blog/tags/community.md) --- [Skip to main content](#__docusaurus_skipToContent_fallback) **[Apify $1M Challenge 💰](https://apify.com/challenge)** Earn and win building with Crawlee! [![Crawlee JavaScript](/img/crawlee-javascript-light.svg)![Crawlee JavaScript](/img/crawlee-javascript-dark.svg)](https://crawlee.dev/js.md) [![Crawlee JavaScript](/img/crawlee-javascript-light.svg)![Crawlee JavaScript](/img/crawlee-javascript-dark.svg)](https://crawlee.dev/js.md)[![Crawlee Python](/img/crawlee-python-light.svg)![Crawlee Python](/img/crawlee-python-dark.svg)](https://crawlee.dev/python)[![Crawlee](/img/crawlee-light.svg)![Crawlee](/img/crawlee-dark.svg)](https://crawlee.dev/index.md) [Docs](https://crawlee.dev/js/docs/quick-start.md)[Examples](https://crawlee.dev/js/docs/examples.md)[API](https://crawlee.dev/js/api/core.md)[Changelog](https://crawlee.dev/js/api/core/changelog.md)[Blog](https://crawlee.dev/blog.md) Search documentation... [Get started](https://crawlee.dev/js/docs/quick-start.md) # Build reliable web scrapers. Fast. Crawlee is a web scraping library for JavaScript and Python. It handles blocking, crawling, proxies, and browsers for you. ![Crawlee JavaScript](/img/crawlee-javascript-light.svg)![Crawlee JavaScript](/img/crawlee-javascript-dark.svg) [Get started](https://crawlee.dev/js/docs/quick-start.md)[Star](https://github.com/apify/crawlee) [Run on](https://console.apify.com/actors/6i5QsHBMtm3hKph70?runConfig=eyJ1IjoiRWdQdHczb2VqNlRhRHQ1cW4iLCJ2IjoxfQ.eyJpbnB1dCI6IntcbiAgICBcImNvZGVcIjogXCJpbXBvcnQgeyBQbGF5d3JpZ2h0Q3Jhd2xlciB9IGZyb20gJ2NyYXdsZWUnO1xcblxcbi8vIENyYXdsZXIgc2V0dXAgZnJvbSB0aGUgcHJldmlvdXMgZXhhbXBsZS5cXG5jb25zdCBjcmF3bGVyID0gbmV3IFBsYXl3cmlnaHRDcmF3bGVyKHtcXG4gICAgLy8gVXNlIHRoZSByZXF1ZXN0SGFuZGxlciB0byBwcm9jZXNzIGVhY2ggb2YgdGhlIGNyYXdsZWQgcGFnZXMuXFxuICAgIGFzeW5jIHJlcXVlc3RIYW5kbGVyKHsgcmVxdWVzdCwgcGFnZSwgZW5xdWV1ZUxpbmtzLCBwdXNoRGF0YSwgbG9nIH0pIHtcXG4gICAgICAgIGNvbnN0IHRpdGxlID0gYXdhaXQgcGFnZS50aXRsZSgpO1xcbiAgICAgICAgbG9nLmluZm8oYFRpdGxlIG9mICR7cmVxdWVzdC5sb2FkZWRVcmx9IGlzICcke3RpdGxlfSdgKTtcXG5cXG4gICAgICAgIC8vIFNhdmUgcmVzdWx0cyBhcyBKU09OIHRvIC4vc3RvcmFnZS9kYXRhc2V0cy9kZWZhdWx0XFxuICAgICAgICBhd2FpdCBwdXNoRGF0YSh7IHRpdGxlLCB1cmw6IHJlcXVlc3QubG9hZGVkVXJsIH0pO1xcblxcbiAgICAgICAgLy8gRXh0cmFjdCBsaW5rcyBmcm9tIHRoZSBjdXJyZW50IHBhZ2VcXG4gICAgICAgIC8vIGFuZCBhZGQgdGhlbSB0byB0aGUgY3Jhd2xpbmcgcXVldWUuXFxuICAgICAgICBhd2FpdCBlbnF1ZXVlTGlua3MoKTtcXG4gICAgfSxcXG5cXG4gICAgLy8gVW5jb21tZW50IHRoaXMgb3B0aW9uIHRvIHNlZSB0aGUgYnJvd3NlciB3aW5kb3cuXFxuICAgIC8vIGhlYWRsZXNzOiBmYWxzZSxcXG5cXG4gICAgLy8gQ29tbWVudCB0aGlzIG9wdGlvbiB0byBzY3JhcGUgdGhlIGZ1bGwgd2Vic2l0ZS5cXG4gICAgbWF4UmVxdWVzdHNQZXJDcmF3bDogMjAsXFxufSk7XFxuXFxuLy8gQWRkIGZpcnN0IFVSTCB0byB0aGUgcXVldWUgYW5kIHN0YXJ0IHRoZSBjcmF3bC5cXG5hd2FpdCBjcmF3bGVyLnJ1bihbJ2h0dHBzOi8vY3Jhd2xlZS5kZXYnXSk7XFxuXFxuLy8gRXhwb3J0IHRoZSBlbnRpcmV0eSBvZiB0aGUgZGF0YXNldCB0byBhIHNpbmdsZSBmaWxlIGluXFxuLy8gLi9zdG9yYWdlL2tleV92YWx1ZV9zdG9yZXMvcmVzdWx0LmNzdlxcbmNvbnN0IGRhdGFzZXQgPSBhd2FpdCBjcmF3bGVyLmdldERhdGFzZXQoKTtcXG5hd2FpdCBkYXRhc2V0LmV4cG9ydFRvQ1NWKCdyZXN1bHQnKTtcXG5cXG4vLyBPciB3b3JrIHdpdGggdGhlIGRhdGEgZGlyZWN0bHkuXFxuY29uc3QgZGF0YSA9IGF3YWl0IGNyYXdsZXIuZ2V0RGF0YSgpO1xcbmNvbnNvbGUudGFibGUoZGF0YS5pdGVtcyk7XFxuXCJcbn0iLCJvcHRpb25zIjp7ImNvbnRlbnRUeXBlIjoiYXBwbGljYXRpb24vanNvbjsgY2hhcnNldD11dGYtOCIsIm1lbW9yeSI6NDA5Nn19.WKB14SjgTceKYyhONw2oXTkiOao6X4-UAS7cIuwqGvo\&asrc=run_on_apify) ``` import { PlaywrightCrawler } from 'crawlee'; const crawler = new PlaywrightCrawler({ async requestHandler({ request, page, enqueueLinks, pushData, log }) { const title = await page.title(); log.info(`Title of ${request.loadedUrl} is '${title}'`); await pushData({ title, url: request.loadedUrl }); await enqueueLinks(); }, // Uncomment this option to see the browser window. // headless: false, }); await crawler.run(['https://crawlee.dev']); ``` Or start with a template from our CLI `$npx crawlee create my-crawler` Built with 🤍 by Apify. Forever free and open-source. ## What are the benefits? ### Unblock websites by default Crawlee crawls stealthily with zero configuration, but you can customize its behavior to overcome any protection. Real-world fingerprints included. [Learn more](https://crawlee.dev/js/docs/guides/avoid-blocking.md) ``` { fingerprintOptions: { fingerprintGeneratorOptions: { browsers: ['chrome', 'firefox'], devices: ['mobile'], locales: ['en-US'], }, }, }, ``` ### Work with your favorite tools Crawlee integrates BeautifulSoup, Cheerio, Puppeteer, Playwright, and other popular open-source tools. No need to learn new syntax. [Learn more](https://crawlee.dev/js/docs/quick-start.md#choose-your-crawler) ![Work with your favorite tools](/img/favorite-tools-light.webp)![Work with your favorite tools](/img/favorite-tools-dark.webp) ### One API for headless and HTTP Switch between HTTP and headless without big rewrites thanks to a shared API. Or even let Adaptive crawler decide if JS rendering is needed. [Learn more](https://crawlee.dev/js/api/core.md) ``` const crawler = new AdaptivePlaywrightCrawler({ renderingTypeDetectionRatio: 0.1, async requestHandler({ querySelector, enqueueLinks }) { // The crawler detects if JS rendering is needed // to extract this data. If not, it will use HTTP // for follow-up requests to save time and costs. const $prices = await querySelector('span.price') await enqueueLinks(); }, }); ``` ## What else is in Crawlee? [![](/img/auto-scaling-light.webp)![](/img/auto-scaling-dark.webp)](https://crawlee.dev/js/docs/guides/scaling-crawlers.md) ### [Auto scaling](https://crawlee.dev/js/docs/guides/scaling-crawlers.md) [Crawlers automatically adjust concurrency based on available system resources. Avoid memory errors in small containers and run faster in large ones.](https://crawlee.dev/js/docs/guides/scaling-crawlers.md) [![](/img/smart-proxy-light.webp)![](/img/smart-proxy-dark.webp)](https://crawlee.dev/js/docs/guides/proxy-management.md) ### [Smart proxy rotation](https://crawlee.dev/js/docs/guides/proxy-management.md) [Crawlee uses a pool of sessions represented by different proxies to maintain the proxy performance and keep IPs healthy. Blocked proxies are removed from the pool automatically.](https://crawlee.dev/js/docs/guides/proxy-management.md) [![](/img/queue-light-icon.svg)![](/img/queue-dark-icon.svg)](https://crawlee.dev/js/docs/guides/request-storage.md) ### [Queue and storage](https://crawlee.dev/js/docs/guides/request-storage.md) [Pause and resume crawlers thanks to a persistent queue of URLs and storage for structured data.](https://crawlee.dev/js/docs/guides/request-storage.md) [![](/img/scraping-utils-light-icon.svg)![](/img/scraping-utils-dark-icon.svg)](https://crawlee.dev/js/api/utils.md) ### [Handy scraping utils](https://crawlee.dev/js/api/utils.md) [Sitemaps, infinite scroll, contact extraction, large asset blocking and many more utils included.](https://crawlee.dev/js/api/utils.md) [![](/img/routing-light-icon.svg)![](/img/routing-dark-icon.svg)](https://crawlee.dev/js/api/core/class/Router.md) ### [Routing & middleware](https://crawlee.dev/js/api/core/class/Router.md) [Keep your code clean and organized while managing complex crawls with a built-in router that streamlines the process.](https://crawlee.dev/js/api/core/class/Router.md) ## Deploy to cloud Crawlee, by Apify, works anywhere, but Apify offers the best experience. Easily turn your project into an [Actor](https://apify.com/actors)—a serverless micro-app with built-in infra, proxies, and storage. [Deploy to Apify](https://crawlee.dev/js/docs/deployment/apify-platform.md) 1 Install Apify SDK and Apify CLI. 2 Add ``` Actor.init() ``` to the begining and ``` Actor.exit() ``` to the end of your code. 3 Use the Apify CLI to push the code to the Apify platform. ## Crawlee helps you build scrapers faster ![](/img/zero-setup-light-icon.svg)![](/img/zero-setup-dark-icon.svg) ### Zero setup required Copy code example, install Crawlee and go. No CLI required, no complex file structure, no boilerplate. [Get started](https://crawlee.dev/js/docs/quick-start.md) ![](/img/defaults-light-icon.svg)![](/img/defaults-dark-icon.svg) ### Reasonable defaults Unblocking, proxy rotation and other core features are already turned on. But also very configurable. [Learn more](https://crawlee.dev/js/docs/guides/configuration.md) ![](/img/community-light-icon.svg)![](/img/community-dark-icon.svg) ### Helpful community Join our Discord community of over 10k developers and get fast answers to your web scraping questions. [Join Discord](https://discord.gg/jyEM2PRvMU) ## Get started now! Crawlee won’t fix broken selectors for you (yet), but it makes building and maintaining reliable crawlers faster and easier—so you can focus on what matters most. [Get started](https://crawlee.dev/js/docs/quick-start.md) [![Docusaurus themed image](/img/crawlee-light.svg)![Docusaurus themed image](/img/crawlee-dark.svg)](https://crawlee.dev/index.md) Docs * [Guides](https://crawlee.dev/js/docs/guides.md) * [Examples](https://crawlee.dev/js/docs/examples.md) * [API reference](https://crawlee.dev/js/api/core.md) * [Changelog](https://crawlee.dev/js/api/core/changelog.md) Product * [Discord](https://discord.com/invite/jyEM2PRvMU) * [Stack Overflow](https://stackoverflow.com/questions/tagged/crawlee) * [Twitter](https://twitter.com/apify) * [YouTube](https://www.youtube.com/apify) More * [Apify platform](https://apify.com) * [Docusaurus](https://docusaurus.io) * [GitHub](https://github.com/apify/crawlee) Crawlee is forever free and open source © 2026 Apify --- [Skip to main content](#__docusaurus_skipToContent_fallback) **[Apify $1M Challenge 💰](https://apify.com/challenge)** Earn and win building with Crawlee! [![Crawlee JavaScript](/img/crawlee-javascript-light.svg)![Crawlee JavaScript](/img/crawlee-javascript-dark.svg)](https://crawlee.dev/js.md) [![Crawlee JavaScript](/img/crawlee-javascript-light.svg)![Crawlee JavaScript](/img/crawlee-javascript-dark.svg)](https://crawlee.dev/js.md)[![Crawlee Python](/img/crawlee-python-light.svg)![Crawlee Python](/img/crawlee-python-dark.svg)](https://crawlee.dev/python)[![Crawlee](/img/crawlee-light.svg)![Crawlee](/img/crawlee-dark.svg)](https://crawlee.dev/index.md) [Docs](https://crawlee.dev/js/docs/quick-start.md)[Examples](https://crawlee.dev/js/docs/examples.md)[API](https://crawlee.dev/js/api/core.md)[Changelog](https://crawlee.dev/js/api/core/changelog.md)[Blog](https://crawlee.dev/blog.md) [3.15](https://crawlee.dev/js/docs/quick-start.md) * [Next](https://crawlee.dev/js/docs/next/quick-start) * [3.15](https://crawlee.dev/js/docs/quick-start.md) * [3.14](https://crawlee.dev/js/docs/3.14/quick-start) * [3.13](https://crawlee.dev/js/docs/3.13/quick-start) * [3.12](https://crawlee.dev/js/docs/3.12/quick-start) * [3.11](https://crawlee.dev/js/docs/3.11/quick-start) * [3.10](https://crawlee.dev/js/docs/3.10/quick-start) * [2.2](https://sdk.apify.com/docs/guides/getting-started) * [1.3](https://sdk.apify.com/docs/1.3.1/guides/getting-started) Search documentation... [Get started](https://crawlee.dev/js/docs/quick-start.md) # Search the documentation Type your search here Next (current) Powered by[](https://www.algolia.com/) [![Docusaurus themed image](/img/crawlee-light.svg)![Docusaurus themed image](/img/crawlee-dark.svg)](https://crawlee.dev/index.md) Docs * [Guides](https://crawlee.dev/js/docs/guides.md) * [Examples](https://crawlee.dev/js/docs/examples.md) * [API reference](https://crawlee.dev/js/api/core.md) * [Changelog](https://crawlee.dev/js/api/core/changelog.md) Product * [Discord](https://discord.com/invite/jyEM2PRvMU) * [Stack Overflow](https://stackoverflow.com/questions/tagged/crawlee) * [Twitter](https://twitter.com/apify) * [YouTube](https://www.youtube.com/apify) More * [Apify platform](https://apify.com) * [Docusaurus](https://docusaurus.io) * [GitHub](https://github.com/apify/crawlee) Crawlee is forever free and open source © 2026 Apify --- # API ### Packages * [v](https://crawlee.dev/js/api/core.md) [3.15.0 @crawlee/core](https://crawlee.dev/js/api/core.md) * [v](https://crawlee.dev/js/api/cheerio-crawler.md) [3.15.0 @crawlee/cheerio](https://crawlee.dev/js/api/cheerio-crawler.md) * [v](https://crawlee.dev/js/api/playwright-crawler.md) [3.15.0 @crawlee/playwright](https://crawlee.dev/js/api/playwright-crawler.md) * [v](https://crawlee.dev/js/api/puppeteer-crawler.md) [3.15.0 @crawlee/puppeteer](https://crawlee.dev/js/api/puppeteer-crawler.md) * [v](https://crawlee.dev/js/api/jsdom-crawler.md) [3.15.0 @crawlee/jsdom](https://crawlee.dev/js/api/jsdom-crawler.md) * [v](https://crawlee.dev/js/api/linkedom-crawler.md) [3.15.0 @crawlee/linkedom](https://crawlee.dev/js/api/linkedom-crawler.md) * [v](https://crawlee.dev/js/api/basic-crawler.md) [3.15.0 @crawlee/basic](https://crawlee.dev/js/api/basic-crawler.md) * [v](https://crawlee.dev/js/api/http-crawler.md) [3.15.0 @crawlee/http](https://crawlee.dev/js/api/http-crawler.md) * [v](https://crawlee.dev/js/api/browser-crawler.md) [3.15.0 @crawlee/browser](https://crawlee.dev/js/api/browser-crawler.md) * [v](https://crawlee.dev/js/api/memory-storage.md) [3.15.0 @crawlee/memory-storage](https://crawlee.dev/js/api/memory-storage.md) * [v](https://crawlee.dev/js/api/browser-pool.md) [3.15.0 @crawlee/browser-pool](https://crawlee.dev/js/api/browser-pool.md) * [v](https://crawlee.dev/js/api/utils.md) [3.15.0 @crawlee/utils](https://crawlee.dev/js/api/utils.md) * [v](https://crawlee.dev/js/api/types.md) [3.15.0 @crawlee/types](https://crawlee.dev/js/api/types.md) --- # @crawlee/basic Provides a simple framework for parallel crawling of web pages. The URLs to crawl are fed either from a static list of URLs or from a dynamic queue of URLs enabling recursive crawling of websites. `BasicCrawler` is a low-level tool that requires the user to implement the page download and data extraction functionality themselves. If we want a crawler that already facilitates this functionality, we should consider using [CheerioCrawler](https://crawlee.dev/js/api/cheerio-crawler/class/CheerioCrawler.md), [PuppeteerCrawler](https://crawlee.dev/js/api/puppeteer-crawler/class/PuppeteerCrawler.md) or [PlaywrightCrawler](https://crawlee.dev/js/api/playwright-crawler/class/PlaywrightCrawler.md). `BasicCrawler` invokes the user-provided [`requestHandler`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#requestHandler) for each [Request](https://crawlee.dev/js/api/core/class/Request.md) object, which represents a single URL to crawl. The [Request](https://crawlee.dev/js/api/core/class/Request.md) objects are fed from the [RequestList](https://crawlee.dev/js/api/core/class/RequestList.md) or [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) instances provided by the [`requestList`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#requestList) or [`requestQueue`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#requestQueue) constructor options, respectively. If neither `requestList` nor `requestQueue` options are provided, the crawler will open the default request queue either when the [`crawler.addRequests()`](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md#addRequests) function is called, or if `requests` parameter (representing the initial requests) of the [`crawler.run()`](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md#run) function is provided. If both [`requestList`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#requestList) and [`requestQueue`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#requestQueue) options are used, the instance first processes URLs from the [RequestList](https://crawlee.dev/js/api/core/class/RequestList.md) and automatically enqueues all of them to the [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) before it starts their processing. This ensures that a single URL is not crawled multiple times. The crawler finishes if there are no more [Request](https://crawlee.dev/js/api/core/class/Request.md) objects to crawl. New requests are only dispatched when there is enough free CPU and memory available, using the functionality provided by the [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) class. All [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) configuration options can be passed to the [`autoscaledPoolOptions`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#autoscaledPoolOptions) parameter of the `BasicCrawler` constructor. For user convenience, the [`minConcurrency`](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md#minConcurrency) and [`maxConcurrency`](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md#maxConcurrency) options of the underlying [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) constructor are available directly in the `BasicCrawler` constructor. ## Example usage[](#example-usage "Direct link to Example usage") ``` import { BasicCrawler, Dataset } from 'crawlee'; // Create a crawler instance const crawler = new BasicCrawler({ async requestHandler({ request, sendRequest }) { // 'request' contains an instance of the Request class // Here we simply fetch the HTML of the page and store it to a dataset const { body } = await sendRequest({ url: request.url, method: request.method, body: request.payload, headers: request.headers, }); await Dataset.pushData({ url: request.url, html: body, }) }, }); // Enqueue the initial requests and run the crawler await crawler.run([ 'http://www.example.com/page-1', 'http://www.example.com/page-2', ]); ``` ## Index[**](#Index) ### Crawlers * [**BasicCrawler](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md) ### Other * [**AddRequestsBatchedOptions](https://crawlee.dev/js/api/basic-crawler.md#AddRequestsBatchedOptions) * [**AddRequestsBatchedResult](https://crawlee.dev/js/api/basic-crawler.md#AddRequestsBatchedResult) * [**AutoscaledPool](https://crawlee.dev/js/api/basic-crawler.md#AutoscaledPool) * [**AutoscaledPoolOptions](https://crawlee.dev/js/api/basic-crawler.md#AutoscaledPoolOptions) * [**BaseHttpClient](https://crawlee.dev/js/api/basic-crawler.md#BaseHttpClient) * [**BaseHttpResponseData](https://crawlee.dev/js/api/basic-crawler.md#BaseHttpResponseData) * [**BLOCKED\_STATUS\_CODES](https://crawlee.dev/js/api/basic-crawler.md#BLOCKED_STATUS_CODES) * [**checkStorageAccess](https://crawlee.dev/js/api/basic-crawler.md#checkStorageAccess) * [**ClientInfo](https://crawlee.dev/js/api/basic-crawler.md#ClientInfo) * [**Configuration](https://crawlee.dev/js/api/basic-crawler.md#Configuration) * [**ConfigurationOptions](https://crawlee.dev/js/api/basic-crawler.md#ConfigurationOptions) * [**Cookie](https://crawlee.dev/js/api/basic-crawler.md#Cookie) * [**CrawlingContext](https://crawlee.dev/js/api/basic-crawler.md#CrawlingContext) * [**CreateSession](https://crawlee.dev/js/api/basic-crawler.md#CreateSession) * [**CriticalError](https://crawlee.dev/js/api/basic-crawler.md#CriticalError) * [**Dataset](https://crawlee.dev/js/api/basic-crawler.md#Dataset) * [**DatasetConsumer](https://crawlee.dev/js/api/basic-crawler.md#DatasetConsumer) * [**DatasetContent](https://crawlee.dev/js/api/basic-crawler.md#DatasetContent) * [**DatasetDataOptions](https://crawlee.dev/js/api/basic-crawler.md#DatasetDataOptions) * [**DatasetExportOptions](https://crawlee.dev/js/api/basic-crawler.md#DatasetExportOptions) * [**DatasetExportToOptions](https://crawlee.dev/js/api/basic-crawler.md#DatasetExportToOptions) * [**DatasetIteratorOptions](https://crawlee.dev/js/api/basic-crawler.md#DatasetIteratorOptions) * [**DatasetMapper](https://crawlee.dev/js/api/basic-crawler.md#DatasetMapper) * [**DatasetOptions](https://crawlee.dev/js/api/basic-crawler.md#DatasetOptions) * [**DatasetReducer](https://crawlee.dev/js/api/basic-crawler.md#DatasetReducer) * [**enqueueLinks](https://crawlee.dev/js/api/basic-crawler.md#enqueueLinks) * [**EnqueueLinksOptions](https://crawlee.dev/js/api/basic-crawler.md#EnqueueLinksOptions) * [**EnqueueStrategy](https://crawlee.dev/js/api/basic-crawler.md#EnqueueStrategy) * [**ErrnoException](https://crawlee.dev/js/api/basic-crawler.md#ErrnoException) * [**ErrorSnapshotter](https://crawlee.dev/js/api/basic-crawler.md#ErrorSnapshotter) * [**ErrorTracker](https://crawlee.dev/js/api/basic-crawler.md#ErrorTracker) * [**ErrorTrackerOptions](https://crawlee.dev/js/api/basic-crawler.md#ErrorTrackerOptions) * [**EventManager](https://crawlee.dev/js/api/basic-crawler.md#EventManager) * [**EventType](https://crawlee.dev/js/api/basic-crawler.md#EventType) * [**EventTypeName](https://crawlee.dev/js/api/basic-crawler.md#EventTypeName) * [**filterRequestsByPatterns](https://crawlee.dev/js/api/basic-crawler.md#filterRequestsByPatterns) * [**FinalStatistics](https://crawlee.dev/js/api/basic-crawler.md#FinalStatistics) * [**GetUserDataFromRequest](https://crawlee.dev/js/api/basic-crawler.md#GetUserDataFromRequest) * [**GlobInput](https://crawlee.dev/js/api/basic-crawler.md#GlobInput) * [**GlobObject](https://crawlee.dev/js/api/basic-crawler.md#GlobObject) * [**GotScrapingHttpClient](https://crawlee.dev/js/api/basic-crawler.md#GotScrapingHttpClient) * [**HttpRequest](https://crawlee.dev/js/api/basic-crawler.md#HttpRequest) * [**HttpRequestOptions](https://crawlee.dev/js/api/basic-crawler.md#HttpRequestOptions) * [**HttpResponse](https://crawlee.dev/js/api/basic-crawler.md#HttpResponse) * [**IRequestList](https://crawlee.dev/js/api/basic-crawler.md#IRequestList) * [**IRequestManager](https://crawlee.dev/js/api/basic-crawler.md#IRequestManager) * [**IStorage](https://crawlee.dev/js/api/basic-crawler.md#IStorage) * [**KeyConsumer](https://crawlee.dev/js/api/basic-crawler.md#KeyConsumer) * [**KeyValueStore](https://crawlee.dev/js/api/basic-crawler.md#KeyValueStore) * [**KeyValueStoreIteratorOptions](https://crawlee.dev/js/api/basic-crawler.md#KeyValueStoreIteratorOptions) * [**KeyValueStoreOptions](https://crawlee.dev/js/api/basic-crawler.md#KeyValueStoreOptions) * [**LoadedRequest](https://crawlee.dev/js/api/basic-crawler.md#LoadedRequest) * [**LocalEventManager](https://crawlee.dev/js/api/basic-crawler.md#LocalEventManager) * [**log](https://crawlee.dev/js/api/basic-crawler.md#log) * [**Log](https://crawlee.dev/js/api/basic-crawler.md#Log) * [**Logger](https://crawlee.dev/js/api/basic-crawler.md#Logger) * [**LoggerJson](https://crawlee.dev/js/api/basic-crawler.md#LoggerJson) * [**LoggerOptions](https://crawlee.dev/js/api/basic-crawler.md#LoggerOptions) * [**LoggerText](https://crawlee.dev/js/api/basic-crawler.md#LoggerText) * [**LogLevel](https://crawlee.dev/js/api/basic-crawler.md#LogLevel) * [**MAX\_POOL\_SIZE](https://crawlee.dev/js/api/basic-crawler.md#MAX_POOL_SIZE) * [**NonRetryableError](https://crawlee.dev/js/api/basic-crawler.md#NonRetryableError) * [**PERSIST\_STATE\_KEY](https://crawlee.dev/js/api/basic-crawler.md#PERSIST_STATE_KEY) * [**PersistenceOptions](https://crawlee.dev/js/api/basic-crawler.md#PersistenceOptions) * [**processHttpRequestOptions](https://crawlee.dev/js/api/basic-crawler.md#processHttpRequestOptions) * [**ProxyConfiguration](https://crawlee.dev/js/api/basic-crawler.md#ProxyConfiguration) * [**ProxyConfigurationFunction](https://crawlee.dev/js/api/basic-crawler.md#ProxyConfigurationFunction) * [**ProxyConfigurationOptions](https://crawlee.dev/js/api/basic-crawler.md#ProxyConfigurationOptions) * [**ProxyInfo](https://crawlee.dev/js/api/basic-crawler.md#ProxyInfo) * [**PseudoUrl](https://crawlee.dev/js/api/basic-crawler.md#PseudoUrl) * [**PseudoUrlInput](https://crawlee.dev/js/api/basic-crawler.md#PseudoUrlInput) * [**PseudoUrlObject](https://crawlee.dev/js/api/basic-crawler.md#PseudoUrlObject) * [**purgeDefaultStorages](https://crawlee.dev/js/api/basic-crawler.md#purgeDefaultStorages) * [**PushErrorMessageOptions](https://crawlee.dev/js/api/basic-crawler.md#PushErrorMessageOptions) * [**QueueOperationInfo](https://crawlee.dev/js/api/basic-crawler.md#QueueOperationInfo) * [**RecordOptions](https://crawlee.dev/js/api/basic-crawler.md#RecordOptions) * [**RecoverableState](https://crawlee.dev/js/api/basic-crawler.md#RecoverableState) * [**RecoverableStateOptions](https://crawlee.dev/js/api/basic-crawler.md#RecoverableStateOptions) * [**RecoverableStatePersistenceOptions](https://crawlee.dev/js/api/basic-crawler.md#RecoverableStatePersistenceOptions) * [**RedirectHandler](https://crawlee.dev/js/api/basic-crawler.md#RedirectHandler) * [**RegExpInput](https://crawlee.dev/js/api/basic-crawler.md#RegExpInput) * [**RegExpObject](https://crawlee.dev/js/api/basic-crawler.md#RegExpObject) * [**Request](https://crawlee.dev/js/api/basic-crawler.md#Request) * [**RequestHandlerResult](https://crawlee.dev/js/api/basic-crawler.md#RequestHandlerResult) * [**RequestList](https://crawlee.dev/js/api/basic-crawler.md#RequestList) * [**RequestListOptions](https://crawlee.dev/js/api/basic-crawler.md#RequestListOptions) * [**RequestListSourcesFunction](https://crawlee.dev/js/api/basic-crawler.md#RequestListSourcesFunction) * [**RequestListState](https://crawlee.dev/js/api/basic-crawler.md#RequestListState) * [**RequestManagerTandem](https://crawlee.dev/js/api/basic-crawler.md#RequestManagerTandem) * [**RequestOptions](https://crawlee.dev/js/api/basic-crawler.md#RequestOptions) * [**RequestProvider](https://crawlee.dev/js/api/basic-crawler.md#RequestProvider) * [**RequestProviderOptions](https://crawlee.dev/js/api/basic-crawler.md#RequestProviderOptions) * [**RequestQueue](https://crawlee.dev/js/api/basic-crawler.md#RequestQueue) * [**RequestQueueOperationOptions](https://crawlee.dev/js/api/basic-crawler.md#RequestQueueOperationOptions) * [**RequestQueueOptions](https://crawlee.dev/js/api/basic-crawler.md#RequestQueueOptions) * [**RequestQueueV1](https://crawlee.dev/js/api/basic-crawler.md#RequestQueueV1) * [**RequestQueueV2](https://crawlee.dev/js/api/basic-crawler.md#RequestQueueV2) * [**RequestsLike](https://crawlee.dev/js/api/basic-crawler.md#RequestsLike) * [**RequestState](https://crawlee.dev/js/api/basic-crawler.md#RequestState) * [**RequestTransform](https://crawlee.dev/js/api/basic-crawler.md#RequestTransform) * [**ResponseLike](https://crawlee.dev/js/api/basic-crawler.md#ResponseLike) * [**ResponseTypes](https://crawlee.dev/js/api/basic-crawler.md#ResponseTypes) * [**RestrictedCrawlingContext](https://crawlee.dev/js/api/basic-crawler.md#RestrictedCrawlingContext) * [**RetryRequestError](https://crawlee.dev/js/api/basic-crawler.md#RetryRequestError) * [**Router](https://crawlee.dev/js/api/basic-crawler.md#Router) * [**RouterHandler](https://crawlee.dev/js/api/basic-crawler.md#RouterHandler) * [**RouterRoutes](https://crawlee.dev/js/api/basic-crawler.md#RouterRoutes) * [**Session](https://crawlee.dev/js/api/basic-crawler.md#Session) * [**SessionError](https://crawlee.dev/js/api/basic-crawler.md#SessionError) * [**SessionOptions](https://crawlee.dev/js/api/basic-crawler.md#SessionOptions) * [**SessionPool](https://crawlee.dev/js/api/basic-crawler.md#SessionPool) * [**SessionPoolOptions](https://crawlee.dev/js/api/basic-crawler.md#SessionPoolOptions) * [**SessionState](https://crawlee.dev/js/api/basic-crawler.md#SessionState) * [**SitemapRequestList](https://crawlee.dev/js/api/basic-crawler.md#SitemapRequestList) * [**SitemapRequestListOptions](https://crawlee.dev/js/api/basic-crawler.md#SitemapRequestListOptions) * [**SkippedRequestCallback](https://crawlee.dev/js/api/basic-crawler.md#SkippedRequestCallback) * [**SkippedRequestReason](https://crawlee.dev/js/api/basic-crawler.md#SkippedRequestReason) * [**SnapshotResult](https://crawlee.dev/js/api/basic-crawler.md#SnapshotResult) * [**Snapshotter](https://crawlee.dev/js/api/basic-crawler.md#Snapshotter) * [**SnapshotterOptions](https://crawlee.dev/js/api/basic-crawler.md#SnapshotterOptions) * [**Source](https://crawlee.dev/js/api/basic-crawler.md#Source) * [**StatisticPersistedState](https://crawlee.dev/js/api/basic-crawler.md#StatisticPersistedState) * [**Statistics](https://crawlee.dev/js/api/basic-crawler.md#Statistics) * [**StatisticsOptions](https://crawlee.dev/js/api/basic-crawler.md#StatisticsOptions) * [**StatisticState](https://crawlee.dev/js/api/basic-crawler.md#StatisticState) * [**StorageClient](https://crawlee.dev/js/api/basic-crawler.md#StorageClient) * [**StorageManagerOptions](https://crawlee.dev/js/api/basic-crawler.md#StorageManagerOptions) * [**StreamingHttpResponse](https://crawlee.dev/js/api/basic-crawler.md#StreamingHttpResponse) * [**SystemInfo](https://crawlee.dev/js/api/basic-crawler.md#SystemInfo) * [**SystemStatus](https://crawlee.dev/js/api/basic-crawler.md#SystemStatus) * [**SystemStatusOptions](https://crawlee.dev/js/api/basic-crawler.md#SystemStatusOptions) * [**TieredProxy](https://crawlee.dev/js/api/basic-crawler.md#TieredProxy) * [**tryAbsoluteURL](https://crawlee.dev/js/api/basic-crawler.md#tryAbsoluteURL) * [**UrlPatternObject](https://crawlee.dev/js/api/basic-crawler.md#UrlPatternObject) * [**useState](https://crawlee.dev/js/api/basic-crawler.md#useState) * [**UseStateOptions](https://crawlee.dev/js/api/basic-crawler.md#UseStateOptions) * [**withCheckedStorageAccess](https://crawlee.dev/js/api/basic-crawler.md#withCheckedStorageAccess) * [**BasicCrawlerOptions](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md) * [**BasicCrawlingContext](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlingContext.md) * [**CrawlerAddRequestsOptions](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerAddRequestsOptions.md) * [**CrawlerAddRequestsResult](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerAddRequestsResult.md) * [**CrawlerExperiments](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerExperiments.md) * [**CrawlerRunOptions](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerRunOptions.md) * [**CreateContextOptions](https://crawlee.dev/js/api/basic-crawler/interface/CreateContextOptions.md) * [**StatusMessageCallbackParams](https://crawlee.dev/js/api/basic-crawler/interface/StatusMessageCallbackParams.md) * [**ErrorHandler](https://crawlee.dev/js/api/basic-crawler.md#ErrorHandler) * [**RequestHandler](https://crawlee.dev/js/api/basic-crawler.md#RequestHandler) * [**StatusMessageCallback](https://crawlee.dev/js/api/basic-crawler.md#StatusMessageCallback) * [**BASIC\_CRAWLER\_TIMEOUT\_BUFFER\_SECS](https://crawlee.dev/js/api/basic-crawler.md#BASIC_CRAWLER_TIMEOUT_BUFFER_SECS) * [**createBasicRouter](https://crawlee.dev/js/api/basic-crawler/function/createBasicRouter.md) ## Other[**](#__CATEGORY__) ### [**](#AddRequestsBatchedOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L965)AddRequestsBatchedOptions Re-exports [AddRequestsBatchedOptions](https://crawlee.dev/js/api/core/interface/AddRequestsBatchedOptions.md) ### [**](#AddRequestsBatchedResult)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L983)AddRequestsBatchedResult Re-exports [AddRequestsBatchedResult](https://crawlee.dev/js/api/core/interface/AddRequestsBatchedResult.md) ### [**](#AutoscaledPool)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/autoscaled_pool.ts#L180)AutoscaledPool Re-exports [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) ### [**](#AutoscaledPoolOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/autoscaled_pool.ts#L16)AutoscaledPoolOptions Re-exports [AutoscaledPoolOptions](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md) ### [**](#BaseHttpClient)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L179)BaseHttpClient Re-exports [BaseHttpClient](https://crawlee.dev/js/api/core/interface/BaseHttpClient.md) ### [**](#BaseHttpResponseData)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L130)BaseHttpResponseData Re-exports [BaseHttpResponseData](https://crawlee.dev/js/api/core/interface/BaseHttpResponseData.md) ### [**](#BLOCKED_STATUS_CODES)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/consts.ts#L1)BLOCKED\_STATUS\_CODES Re-exports [BLOCKED\_STATUS\_CODES](https://crawlee.dev/js/api/core.md#BLOCKED_STATUS_CODES) ### [**](#checkStorageAccess)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/access_checking.ts#L10)checkStorageAccess Re-exports [checkStorageAccess](https://crawlee.dev/js/api/core/function/checkStorageAccess.md) ### [**](#ClientInfo)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/system_status.ts#L79)ClientInfo Re-exports [ClientInfo](https://crawlee.dev/js/api/core/interface/ClientInfo.md) ### [**](#Configuration)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/configuration.ts#L247)Configuration Re-exports [Configuration](https://crawlee.dev/js/api/core/class/Configuration.md) ### [**](#ConfigurationOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/configuration.ts#L16)ConfigurationOptions Re-exports [ConfigurationOptions](https://crawlee.dev/js/api/core/interface/ConfigurationOptions.md) ### [**](#Cookie)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/index.ts#L19)Cookie Re-exports [Cookie](https://crawlee.dev/js/api/core/interface/Cookie.md) ### [**](#CrawlingContext)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L111)CrawlingContext Re-exports [CrawlingContext](https://crawlee.dev/js/api/core/interface/CrawlingContext.md) ### [**](#CreateSession)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session_pool.ts#L22)CreateSession Re-exports [CreateSession](https://crawlee.dev/js/api/core/interface/CreateSession.md) ### [**](#CriticalError)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/errors.ts#L10)CriticalError Re-exports [CriticalError](https://crawlee.dev/js/api/core/class/CriticalError.md) ### [**](#Dataset)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L232)Dataset Re-exports [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md) ### [**](#DatasetConsumer)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L703)DatasetConsumer Re-exports [DatasetConsumer](https://crawlee.dev/js/api/core/interface/DatasetConsumer.md) ### [**](#DatasetContent)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L742)DatasetContent Re-exports [DatasetContent](https://crawlee.dev/js/api/core/interface/DatasetContent.md) ### [**](#DatasetDataOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L92)DatasetDataOptions Re-exports [DatasetDataOptions](https://crawlee.dev/js/api/core/interface/DatasetDataOptions.md) ### [**](#DatasetExportOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L144)DatasetExportOptions Re-exports [DatasetExportOptions](https://crawlee.dev/js/api/core/interface/DatasetExportOptions.md) ### [**](#DatasetExportToOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L176)DatasetExportToOptions Re-exports [DatasetExportToOptions](https://crawlee.dev/js/api/core/interface/DatasetExportToOptions.md) ### [**](#DatasetIteratorOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L152)DatasetIteratorOptions Re-exports [DatasetIteratorOptions](https://crawlee.dev/js/api/core/interface/DatasetIteratorOptions.md) ### [**](#DatasetMapper)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L714)DatasetMapper Re-exports [DatasetMapper](https://crawlee.dev/js/api/core/interface/DatasetMapper.md) ### [**](#DatasetOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L735)DatasetOptions Re-exports [DatasetOptions](https://crawlee.dev/js/api/core/interface/DatasetOptions.md) ### [**](#DatasetReducer)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L726)DatasetReducer Re-exports [DatasetReducer](https://crawlee.dev/js/api/core/interface/DatasetReducer.md) ### [**](#enqueueLinks)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/enqueue_links.ts#L274)enqueueLinks Re-exports [enqueueLinks](https://crawlee.dev/js/api/core/function/enqueueLinks.md) ### [**](#EnqueueLinksOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/enqueue_links.ts#L34)EnqueueLinksOptions Re-exports [EnqueueLinksOptions](https://crawlee.dev/js/api/core/interface/EnqueueLinksOptions.md) ### [**](#EnqueueStrategy)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/enqueue_links.ts#L216)EnqueueStrategy Re-exports [EnqueueStrategy](https://crawlee.dev/js/api/core/enum/EnqueueStrategy.md) ### [**](#ErrnoException)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/error_tracker.ts#L9)ErrnoException Re-exports [ErrnoException](https://crawlee.dev/js/api/core/interface/ErrnoException.md) ### [**](#ErrorSnapshotter)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/error_snapshotter.ts#L42)ErrorSnapshotter Re-exports [ErrorSnapshotter](https://crawlee.dev/js/api/core/class/ErrorSnapshotter.md) ### [**](#ErrorTracker)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/error_tracker.ts#L286)ErrorTracker Re-exports [ErrorTracker](https://crawlee.dev/js/api/core/class/ErrorTracker.md) ### [**](#ErrorTrackerOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/error_tracker.ts#L17)ErrorTrackerOptions Re-exports [ErrorTrackerOptions](https://crawlee.dev/js/api/core/interface/ErrorTrackerOptions.md) ### [**](#EventManager)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/events/event_manager.ts#L24)EventManager Re-exports [EventManager](https://crawlee.dev/js/api/core/class/EventManager.md) ### [**](#EventType)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/events/event_manager.ts#L9)EventType Re-exports [EventType](https://crawlee.dev/js/api/core/enum/EventType.md) ### [**](#EventTypeName)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/events/event_manager.ts#L17)EventTypeName Re-exports [EventTypeName](https://crawlee.dev/js/api/core.md#EventTypeName) ### [**](#filterRequestsByPatterns)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L217)filterRequestsByPatterns Re-exports [filterRequestsByPatterns](https://crawlee.dev/js/api/core/function/filterRequestsByPatterns.md) ### [**](#FinalStatistics)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/system_status.ts#L85)FinalStatistics Re-exports [FinalStatistics](https://crawlee.dev/js/api/core/interface/FinalStatistics.md) ### [**](#GetUserDataFromRequest)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/router.ts#L15)GetUserDataFromRequest Re-exports [GetUserDataFromRequest](https://crawlee.dev/js/api/core.md#GetUserDataFromRequest) ### [**](#GlobInput)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L41)GlobInput Re-exports [GlobInput](https://crawlee.dev/js/api/core.md#GlobInput) ### [**](#GlobObject)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L36)GlobObject Re-exports [GlobObject](https://crawlee.dev/js/api/core.md#GlobObject) ### [**](#GotScrapingHttpClient)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/got-scraping-http-client.ts#L17)GotScrapingHttpClient Re-exports [GotScrapingHttpClient](https://crawlee.dev/js/api/core/class/GotScrapingHttpClient.md) ### [**](#HttpRequest)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L78)HttpRequest Re-exports [HttpRequest](https://crawlee.dev/js/api/core/interface/HttpRequest.md) ### [**](#HttpRequestOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L111)HttpRequestOptions Re-exports [HttpRequestOptions](https://crawlee.dev/js/api/core/interface/HttpRequestOptions.md) ### [**](#HttpResponse)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L152)HttpResponse Re-exports [HttpResponse](https://crawlee.dev/js/api/core/interface/HttpResponse.md) ### [**](#IRequestList)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_list.ts#L26)IRequestList Re-exports [IRequestList](https://crawlee.dev/js/api/core/interface/IRequestList.md) ### [**](#IRequestManager)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L44)IRequestManager Re-exports [IRequestManager](https://crawlee.dev/js/api/core/interface/IRequestManager.md) ### [**](#IStorage)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/storage_manager.ts#L14)IStorage Re-exports [IStorage](https://crawlee.dev/js/api/core/interface/IStorage.md) ### [**](#KeyConsumer)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/key_value_store.ts#L724)KeyConsumer Re-exports [KeyConsumer](https://crawlee.dev/js/api/core/interface/KeyConsumer.md) ### [**](#KeyValueStore)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/key_value_store.ts#L108)KeyValueStore Re-exports [KeyValueStore](https://crawlee.dev/js/api/core/class/KeyValueStore.md) ### [**](#KeyValueStoreIteratorOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/key_value_store.ts#L758)KeyValueStoreIteratorOptions Re-exports [KeyValueStoreIteratorOptions](https://crawlee.dev/js/api/core/interface/KeyValueStoreIteratorOptions.md) ### [**](#KeyValueStoreOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/key_value_store.ts#L734)KeyValueStoreOptions Re-exports [KeyValueStoreOptions](https://crawlee.dev/js/api/core/interface/KeyValueStoreOptions.md) ### [**](#LoadedRequest)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L21)LoadedRequest Re-exports [LoadedRequest](https://crawlee.dev/js/api/core.md#LoadedRequest) ### [**](#LocalEventManager)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/events/local_event_manager.ts#L11)LocalEventManager Re-exports [LocalEventManager](https://crawlee.dev/js/api/core/class/LocalEventManager.md) ### [**](#log)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/log.ts#L3)log Re-exports [log](https://crawlee.dev/js/api/core.md#log) ### [**](#Log)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/log.ts#L3)Log Re-exports [Log](https://crawlee.dev/js/api/core/class/Log.md) ### [**](#Logger)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/log.ts#L3)Logger Re-exports [Logger](https://crawlee.dev/js/api/core/class/Logger.md) ### [**](#LoggerJson)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/log.ts#L3)LoggerJson Re-exports [LoggerJson](https://crawlee.dev/js/api/core/class/LoggerJson.md) ### [**](#LoggerOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/log.ts#L3)LoggerOptions Re-exports [LoggerOptions](https://crawlee.dev/js/api/core/interface/LoggerOptions.md) ### [**](#LoggerText)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/log.ts#L3)LoggerText Re-exports [LoggerText](https://crawlee.dev/js/api/core/class/LoggerText.md) ### [**](#LogLevel)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/log.ts#L3)LogLevel Re-exports [LogLevel](https://crawlee.dev/js/api/core/enum/LogLevel.md) ### [**](#MAX_POOL_SIZE)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/consts.ts#L3)MAX\_POOL\_SIZE Re-exports [MAX\_POOL\_SIZE](https://crawlee.dev/js/api/core.md#MAX_POOL_SIZE) ### [**](#NonRetryableError)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/errors.ts#L4)NonRetryableError Re-exports [NonRetryableError](https://crawlee.dev/js/api/core/class/NonRetryableError.md) ### [**](#PERSIST_STATE_KEY)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/consts.ts#L2)PERSIST\_STATE\_KEY Re-exports [PERSIST\_STATE\_KEY](https://crawlee.dev/js/api/core.md#PERSIST_STATE_KEY) ### [**](#PersistenceOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L41)PersistenceOptions Re-exports [PersistenceOptions](https://crawlee.dev/js/api/core/interface/PersistenceOptions.md) ### [**](#processHttpRequestOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L196)processHttpRequestOptions Re-exports [processHttpRequestOptions](https://crawlee.dev/js/api/core/function/processHttpRequestOptions.md) ### [**](#ProxyConfiguration)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/proxy_configuration.ts#L203)ProxyConfiguration Re-exports [ProxyConfiguration](https://crawlee.dev/js/api/core/class/ProxyConfiguration.md) ### [**](#ProxyConfigurationFunction)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/proxy_configuration.ts#L9)ProxyConfigurationFunction Re-exports [ProxyConfigurationFunction](https://crawlee.dev/js/api/core/interface/ProxyConfigurationFunction.md) ### [**](#ProxyConfigurationOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/proxy_configuration.ts#L15)ProxyConfigurationOptions Re-exports [ProxyConfigurationOptions](https://crawlee.dev/js/api/core/interface/ProxyConfigurationOptions.md) ### [**](#ProxyInfo)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/proxy_configuration.ts#L80)ProxyInfo Re-exports [ProxyInfo](https://crawlee.dev/js/api/core/interface/ProxyInfo.md) ### [**](#PseudoUrl)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/index.ts#L18)PseudoUrl Re-exports [PseudoUrl](https://crawlee.dev/js/api/core/class/PseudoUrl.md) ### [**](#PseudoUrlInput)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L34)PseudoUrlInput Re-exports [PseudoUrlInput](https://crawlee.dev/js/api/core.md#PseudoUrlInput) ### [**](#PseudoUrlObject)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L29)PseudoUrlObject Re-exports [PseudoUrlObject](https://crawlee.dev/js/api/core.md#PseudoUrlObject) ### [**](#purgeDefaultStorages)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/utils.ts#L33)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/utils.ts#L45)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/utils.ts#L46)purgeDefaultStorages Re-exports [purgeDefaultStorages](https://crawlee.dev/js/api/core/function/purgeDefaultStorages.md) ### [**](#PushErrorMessageOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L559)PushErrorMessageOptions Re-exports [PushErrorMessageOptions](https://crawlee.dev/js/api/core/interface/PushErrorMessageOptions.md) ### [**](#QueueOperationInfo)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/index.ts#L19)QueueOperationInfo Re-exports [QueueOperationInfo](https://crawlee.dev/js/api/core/interface/QueueOperationInfo.md) ### [**](#RecordOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/key_value_store.ts#L741)RecordOptions Re-exports [RecordOptions](https://crawlee.dev/js/api/core/interface/RecordOptions.md) ### [**](#RecoverableState)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/recoverable_state.ts#L75)RecoverableState Re-exports [RecoverableState](https://crawlee.dev/js/api/core/class/RecoverableState.md) ### [**](#RecoverableStateOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/recoverable_state.ts#L33)RecoverableStateOptions Re-exports [RecoverableStateOptions](https://crawlee.dev/js/api/core/interface/RecoverableStateOptions.md) ### [**](#RecoverableStatePersistenceOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/recoverable_state.ts#L6)RecoverableStatePersistenceOptions Re-exports [RecoverableStatePersistenceOptions](https://crawlee.dev/js/api/core/interface/RecoverableStatePersistenceOptions.md) ### [**](#RedirectHandler)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L171)RedirectHandler Re-exports [RedirectHandler](https://crawlee.dev/js/api/core.md#RedirectHandler) ### [**](#RegExpInput)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L48)RegExpInput Re-exports [RegExpInput](https://crawlee.dev/js/api/core.md#RegExpInput) ### [**](#RegExpObject)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L43)RegExpObject Re-exports [RegExpObject](https://crawlee.dev/js/api/core.md#RegExpObject) ### [**](#Request)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L84)Request Re-exports [Request](https://crawlee.dev/js/api/core/class/Request.md) ### [**](#RequestHandlerResult)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L174)RequestHandlerResult Re-exports [RequestHandlerResult](https://crawlee.dev/js/api/core/class/RequestHandlerResult.md) ### [**](#RequestList)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_list.ts#L300)RequestList Re-exports [RequestList](https://crawlee.dev/js/api/core/class/RequestList.md) ### [**](#RequestListOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_list.ts#L91)RequestListOptions Re-exports [RequestListOptions](https://crawlee.dev/js/api/core/interface/RequestListOptions.md) ### [**](#RequestListSourcesFunction)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_list.ts#L1000)RequestListSourcesFunction Re-exports [RequestListSourcesFunction](https://crawlee.dev/js/api/core.md#RequestListSourcesFunction) ### [**](#RequestListState)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_list.ts#L988)RequestListState Re-exports [RequestListState](https://crawlee.dev/js/api/core/interface/RequestListState.md) ### [**](#RequestManagerTandem)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_manager_tandem.ts#L22)RequestManagerTandem Re-exports [RequestManagerTandem](https://crawlee.dev/js/api/core/class/RequestManagerTandem.md) ### [**](#RequestOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L446)RequestOptions Re-exports [RequestOptions](https://crawlee.dev/js/api/core/interface/RequestOptions.md) ### [**](#RequestProvider)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L102)RequestProvider Re-exports [RequestProvider](https://crawlee.dev/js/api/core/class/RequestProvider.md) ### [**](#RequestProviderOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L907)RequestProviderOptions Re-exports [RequestProviderOptions](https://crawlee.dev/js/api/core/interface/RequestProviderOptions.md) ### [**](#RequestQueue)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/index.ts#L7)RequestQueue Re-exports [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) ### [**](#RequestQueueOperationOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L934)RequestQueueOperationOptions Re-exports [RequestQueueOperationOptions](https://crawlee.dev/js/api/core/interface/RequestQueueOperationOptions.md) ### [**](#RequestQueueOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L923)RequestQueueOptions Re-exports [RequestQueueOptions](https://crawlee.dev/js/api/core/interface/RequestQueueOptions.md) ### [**](#RequestQueueV1)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/index.ts#L6)RequestQueueV1 Re-exports [RequestQueueV1](https://crawlee.dev/js/api/core/class/RequestQueueV1.md) ### [**](#RequestQueueV2)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/index.ts#L8)RequestQueueV2 Re-exports [RequestQueueV2](https://crawlee.dev/js/api/core.md#RequestQueueV2) ### [**](#RequestsLike)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L39)RequestsLike Re-exports [RequestsLike](https://crawlee.dev/js/api/core.md#RequestsLike) ### [**](#RequestState)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L42)RequestState Re-exports [RequestState](https://crawlee.dev/js/api/core/enum/RequestState.md) ### [**](#RequestTransform)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L287)RequestTransform Re-exports [RequestTransform](https://crawlee.dev/js/api/core/interface/RequestTransform.md) ### [**](#ResponseLike)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/cookie_utils.ts#L7)ResponseLike Re-exports [ResponseLike](https://crawlee.dev/js/api/core/interface/ResponseLike.md) ### [**](#ResponseTypes)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L39)ResponseTypes Re-exports [ResponseTypes](https://crawlee.dev/js/api/core/interface/ResponseTypes.md) ### [**](#RestrictedCrawlingContext)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L30)RestrictedCrawlingContext Re-exports [RestrictedCrawlingContext](https://crawlee.dev/js/api/core/interface/RestrictedCrawlingContext.md) ### [**](#RetryRequestError)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/errors.ts#L22)RetryRequestError Re-exports [RetryRequestError](https://crawlee.dev/js/api/core/class/RetryRequestError.md) ### [**](#Router)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/router.ts#L86)Router Re-exports [Router](https://crawlee.dev/js/api/core/class/Router.md) ### [**](#RouterHandler)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/router.ts#L10)RouterHandler Re-exports [RouterHandler](https://crawlee.dev/js/api/core/interface/RouterHandler.md) ### [**](#RouterRoutes)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/router.ts#L17)RouterRoutes Re-exports [RouterRoutes](https://crawlee.dev/js/api/core.md#RouterRoutes) ### [**](#Session)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session.ts#L100)Session Re-exports [Session](https://crawlee.dev/js/api/core/class/Session.md) ### [**](#SessionError)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/errors.ts#L33)SessionError Re-exports [SessionError](https://crawlee.dev/js/api/core/class/SessionError.md) ### [**](#SessionOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session.ts#L37)SessionOptions Re-exports [SessionOptions](https://crawlee.dev/js/api/core/interface/SessionOptions.md) ### [**](#SessionPool)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session_pool.ts#L137)SessionPool Re-exports [SessionPool](https://crawlee.dev/js/api/core/class/SessionPool.md) ### [**](#SessionPoolOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session_pool.ts#L30)SessionPoolOptions Re-exports [SessionPoolOptions](https://crawlee.dev/js/api/core/interface/SessionPoolOptions.md) ### [**](#SessionState)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session.ts#L24)SessionState Re-exports [SessionState](https://crawlee.dev/js/api/core/interface/SessionState.md) ### [**](#SitemapRequestList)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/sitemap_request_list.ts#L128)SitemapRequestList Re-exports [SitemapRequestList](https://crawlee.dev/js/api/core/class/SitemapRequestList.md) ### [**](#SitemapRequestListOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/sitemap_request_list.ts#L60)SitemapRequestListOptions Re-exports [SitemapRequestListOptions](https://crawlee.dev/js/api/core/interface/SitemapRequestListOptions.md) ### [**](#SkippedRequestCallback)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L52)SkippedRequestCallback Re-exports [SkippedRequestCallback](https://crawlee.dev/js/api/core.md#SkippedRequestCallback) ### [**](#SkippedRequestReason)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L50)SkippedRequestReason Re-exports [SkippedRequestReason](https://crawlee.dev/js/api/core.md#SkippedRequestReason) ### [**](#SnapshotResult)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/error_snapshotter.ts#L16)SnapshotResult Re-exports [SnapshotResult](https://crawlee.dev/js/api/core/interface/SnapshotResult.md) ### [**](#Snapshotter)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/snapshotter.ts#L118)Snapshotter Re-exports [Snapshotter](https://crawlee.dev/js/api/core/class/Snapshotter.md) ### [**](#SnapshotterOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/snapshotter.ts#L19)SnapshotterOptions Re-exports [SnapshotterOptions](https://crawlee.dev/js/api/core/interface/SnapshotterOptions.md) ### [**](#Source)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L575)Source Re-exports [Source](https://crawlee.dev/js/api/core.md#Source) ### [**](#StatisticPersistedState)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L482)StatisticPersistedState Re-exports [StatisticPersistedState](https://crawlee.dev/js/api/core/interface/StatisticPersistedState.md) ### [**](#Statistics)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L59)Statistics Re-exports [Statistics](https://crawlee.dev/js/api/core/class/Statistics.md) ### [**](#StatisticsOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L436)StatisticsOptions Re-exports [StatisticsOptions](https://crawlee.dev/js/api/core/interface/StatisticsOptions.md) ### [**](#StatisticState)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L496)StatisticState Re-exports [StatisticState](https://crawlee.dev/js/api/core/interface/StatisticState.md) ### [**](#StorageClient)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/index.ts#L19)StorageClient Re-exports [StorageClient](https://crawlee.dev/js/api/core/interface/StorageClient.md) ### [**](#StorageManagerOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/storage_manager.ts#L156)StorageManagerOptions Re-exports [StorageManagerOptions](https://crawlee.dev/js/api/core/interface/StorageManagerOptions.md) ### [**](#StreamingHttpResponse)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L162)StreamingHttpResponse Re-exports [StreamingHttpResponse](https://crawlee.dev/js/api/core/interface/StreamingHttpResponse.md) ### [**](#SystemInfo)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/system_status.ts#L10)SystemInfo Re-exports [SystemInfo](https://crawlee.dev/js/api/core/interface/SystemInfo.md) ### [**](#SystemStatus)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/system_status.ts#L120)SystemStatus Re-exports [SystemStatus](https://crawlee.dev/js/api/core/class/SystemStatus.md) ### [**](#SystemStatusOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/system_status.ts#L35)SystemStatusOptions Re-exports [SystemStatusOptions](https://crawlee.dev/js/api/core/interface/SystemStatusOptions.md) ### [**](#TieredProxy)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/proxy_configuration.ts#L45)TieredProxy Re-exports [TieredProxy](https://crawlee.dev/js/api/core/interface/TieredProxy.md) ### [**](#tryAbsoluteURL)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L12)tryAbsoluteURL Re-exports [tryAbsoluteURL](https://crawlee.dev/js/api/core/function/tryAbsoluteURL.md) ### [**](#UrlPatternObject)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L24)UrlPatternObject Re-exports [UrlPatternObject](https://crawlee.dev/js/api/core.md#UrlPatternObject) ### [**](#useState)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/utils.ts#L87)useState Re-exports [useState](https://crawlee.dev/js/api/core/function/useState.md) ### [**](#UseStateOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/utils.ts#L69)UseStateOptions Re-exports [UseStateOptions](https://crawlee.dev/js/api/core/interface/UseStateOptions.md) ### [**](#withCheckedStorageAccess)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/access_checking.ts#L18)withCheckedStorageAccess Re-exports [withCheckedStorageAccess](https://crawlee.dev/js/api/core/function/withCheckedStorageAccess.md) ### [**](#ErrorHandler)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L114)ErrorHandler **ErrorHandler\: (inputs, error) => Awaitable\ #### Type parameters * **Context**: [CrawlingContext](https://crawlee.dev/js/api/core/interface/CrawlingContext.md) = LoadedContext<[BasicCrawlingContext](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlingContext.md) & [RestrictedCrawlingContext](https://crawlee.dev/js/api/core/interface/RestrictedCrawlingContext.md)> #### Type declaration * * **(inputs, error): Awaitable\ - #### Parameters * ##### inputs: LoadedContext\ * ##### error: Error #### Returns Awaitable\ ### [**](#RequestHandler)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L110)RequestHandler **RequestHandler\: (inputs) => Awaitable\ #### Type parameters * **Context**: [CrawlingContext](https://crawlee.dev/js/api/core/interface/CrawlingContext.md) = LoadedContext<[BasicCrawlingContext](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlingContext.md) & [RestrictedCrawlingContext](https://crawlee.dev/js/api/core/interface/RestrictedCrawlingContext.md)> #### Type declaration * * **(inputs): Awaitable\ - #### Parameters * ##### inputs: LoadedContext\ #### Returns Awaitable\ ### [**](#StatusMessageCallback)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L128)StatusMessageCallback **StatusMessageCallback\: (params) => Awaitable\ #### Type parameters * **Context**: [CrawlingContext](https://crawlee.dev/js/api/core/interface/CrawlingContext.md) = [BasicCrawlingContext](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlingContext.md) * **Crawler**: [BasicCrawler](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md)\ = [BasicCrawler](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md)\ #### Type declaration * * **(params): Awaitable\ - #### Parameters * ##### params: [StatusMessageCallbackParams](https://crawlee.dev/js/api/basic-crawler/interface/StatusMessageCallbackParams.md)\ #### Returns Awaitable\ ### [**](#BASIC_CRAWLER_TIMEOUT_BUFFER_SECS)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/constants.ts#L6)constBASIC\_CRAWLER\_TIMEOUT\_BUFFER\_SECS **BASIC\_CRAWLER\_TIMEOUT\_BUFFER\_SECS: 10 = 10 Additional number of seconds used in [CheerioCrawler](https://crawlee.dev/js/api/cheerio-crawler/class/CheerioCrawler.md) and [BrowserCrawler](https://crawlee.dev/js/api/browser-crawler/class/BrowserCrawler.md) to set a reasonable [`requestHandlerTimeoutSecs`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#requestHandlerTimeoutSecs) for [BasicCrawler](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md) that would not impare functionality (not timeout before crawlers). --- # Changelog All notable changes to this project will be documented in this file. See [Conventional Commits](https://conventionalcommits.org) for commit guidelines. ## [3.15.3](https://github.com/apify/crawlee/compare/v3.15.2...v3.15.3) (2025-11-10)[](#3153-2025-11-10 "Direct link to 3153-2025-11-10") ### Bug Fixes[](#bug-fixes "Direct link to Bug Fixes") * await retries inside `_timeoutAndRetry` ([#3206](https://github.com/apify/crawlee/issues/3206)) ([9c1cf6d](https://github.com/apify/crawlee/commit/9c1cf6d68acd356af8b7dbd682141357d789e3fb)), closes [/github.com/apify/crawlee/pull/3188#discussion\_r2410256271](https://github.com//github.com/apify/crawlee/pull/3188/issues/discussion_r2410256271) * use shared enqueue links wrapper in `AdaptivePlaywrightCrawler` ([#3188](https://github.com/apify/crawlee/issues/3188)) ([9569d19](https://github.com/apify/crawlee/commit/9569d191933325d93f6c66754274b63fd272fc59)) ### Features[](#features "Direct link to Features") * support custom `userAgent` with `respectRobotsTxtFile` ([#3226](https://github.com/apify/crawlee/issues/3226)) ([354252d](https://github.com/apify/crawlee/commit/354252dee44c5ea618a12e087acb24b9e0f555c7)), closes [#3222](https://github.com/apify/crawlee/issues/3222) ## [3.15.2](https://github.com/apify/crawlee/compare/v3.15.1...v3.15.2) (2025-10-23)[](#3152-2025-10-23 "Direct link to 3152-2025-10-23") ### Features[](#features-1 "Direct link to Features") * export cheerio types in all crawler packages ([#3204](https://github.com/apify/crawlee/issues/3204)) ([f05790b](https://github.com/apify/crawlee/commit/f05790b8c4e77056fd3cdbdd6d6abe3186ddf104)) ### Performance Improvements[](#performance-improvements "Direct link to Performance Improvements") * don't await `crawler.setStatusMessage` ([#3207](https://github.com/apify/crawlee/issues/3207)) ([1a67ffb](https://github.com/apify/crawlee/commit/1a67ffbf22e0ecf034d30a2215c4bd0f0ecbf41e)) ## [3.15.1](https://github.com/apify/crawlee/compare/v3.15.0...v3.15.1) (2025-09-26)[](#3151-2025-09-26 "Direct link to 3151-2025-09-26") ### Bug Fixes[](#bug-fixes-1 "Direct link to Bug Fixes") * use correct config for storage classes to avoid memory leaks ([#3144](https://github.com/apify/crawlee/issues/3144)) ([911a2eb](https://github.com/apify/crawlee/commit/911a2eb45cdb5e3fc0e6a96471af86b43bc828bf)) # [3.15.0](https://github.com/apify/crawlee/compare/v3.14.1...v3.15.0) (2025-09-17) ### Bug Fixes[](#bug-fixes-2 "Direct link to Bug Fixes") * don't fail `exportData` calls on empty datasets ([#3115](https://github.com/apify/crawlee/issues/3115)) ([298f170](https://github.com/apify/crawlee/commit/298f170ef032f76d5b252e2a08971bfd161a7ef5)), closes [#2734](https://github.com/apify/crawlee/issues/2734) * respect `maxCrawlDepth` with a custom enqueueLinks `transformRequestFunction` ([#3159](https://github.com/apify/crawlee/issues/3159)) ([e2ecb74](https://github.com/apify/crawlee/commit/e2ecb745da6105d8d083b30b8b68197e53b1cf84)) ### Features[](#features-2 "Direct link to Features") * add `collectAllKeys` option for `BasicCrawler.exportData` ([#3129](https://github.com/apify/crawlee/issues/3129)) ([2ddfc9c](https://github.com/apify/crawlee/commit/2ddfc9c6108207d3289ee92fe3c5b646611cc508)), closes [#3007](https://github.com/apify/crawlee/issues/3007) * add `TandemRequestProvider` for combined `RequestList` and `RequestQueue` usage ([#2914](https://github.com/apify/crawlee/issues/2914)) ([4ca450f](https://github.com/apify/crawlee/commit/4ca450f08b9fb69ae3b2ba3fc66361f14631b15b)), closes [#2499](https://github.com/apify/crawlee/issues/2499) ## [3.14.1](https://github.com/apify/crawlee/compare/v3.14.0...v3.14.1) (2025-08-05)[](#3141-2025-08-05 "Direct link to 3141-2025-08-05") **Note:** Version bump only for package @crawlee/basic # [3.14.0](https://github.com/apify/crawlee/compare/v3.13.10...v3.14.0) (2025-07-25) ### Bug Fixes[](#bug-fixes-3 "Direct link to Bug Fixes") * validation of iterables when adding requests to the queue ([#3091](https://github.com/apify/crawlee/issues/3091)) ([529a1dd](https://github.com/apify/crawlee/commit/529a1dd57278efef4fb2013e79a09fd1bc8594a5)), closes [#3063](https://github.com/apify/crawlee/issues/3063) ### Features[](#features-3 "Direct link to Features") * add `maxCrawlDepth` crawler option ([#3045](https://github.com/apify/crawlee/issues/3045)) ([0090df9](https://github.com/apify/crawlee/commit/0090df93a12df9918d016cf2f1378f1f7d40557d)), closes [#2633](https://github.com/apify/crawlee/issues/2633) ## [3.13.10](https://github.com/apify/crawlee/compare/v3.13.9...v3.13.10) (2025-07-09)[](#31310-2025-07-09 "Direct link to 31310-2025-07-09") **Note:** Version bump only for package @crawlee/basic ## [3.13.9](https://github.com/apify/crawlee/compare/v3.13.8...v3.13.9) (2025-06-27)[](#3139-2025-06-27 "Direct link to 3139-2025-06-27") ### Features[](#features-4 "Direct link to Features") * Accept (Async)Iterables in `addRequests` methods ([#3013](https://github.com/apify/crawlee/issues/3013)) ([a4ab748](https://github.com/apify/crawlee/commit/a4ab74852c3c60bdbc96035f54b16d125220f699)), closes [#2980](https://github.com/apify/crawlee/issues/2980) * Report links skipped because of various filter conditions ([#3026](https://github.com/apify/crawlee/issues/3026)) ([5a867bc](https://github.com/apify/crawlee/commit/5a867bc28135803b55c765ec12e6fd04017ce53d)), closes [#3016](https://github.com/apify/crawlee/issues/3016) ## [3.13.8](https://github.com/apify/crawlee/compare/v3.13.7...v3.13.8) (2025-06-16)[](#3138-2025-06-16 "Direct link to 3138-2025-06-16") ### Bug Fixes[](#bug-fixes-4 "Direct link to Bug Fixes") * Do not enqueue more links than what the crawler is capable of processing ([#2990](https://github.com/apify/crawlee/issues/2990)) ([ea094c8](https://github.com/apify/crawlee/commit/ea094c819232e0b30bc550270836d10506eb9454)), closes [#2728](https://github.com/apify/crawlee/issues/2728) ## [3.13.7](https://github.com/apify/crawlee/compare/v3.13.6...v3.13.7) (2025-06-06)[](#3137-2025-06-06 "Direct link to 3137-2025-06-06") **Note:** Version bump only for package @crawlee/basic ## [3.13.6](https://github.com/apify/crawlee/compare/v3.13.5...v3.13.6) (2025-06-05)[](#3136-2025-06-05 "Direct link to 3136-2025-06-05") **Note:** Version bump only for package @crawlee/basic ## [3.13.5](https://github.com/apify/crawlee/compare/v3.13.4...v3.13.5) (2025-05-20)[](#3135-2025-05-20 "Direct link to 3135-2025-05-20") **Note:** Version bump only for package @crawlee/basic ## [3.13.4](https://github.com/apify/crawlee/compare/v3.13.3...v3.13.4) (2025-05-14)[](#3134-2025-05-14 "Direct link to 3134-2025-05-14") ### Bug Fixes[](#bug-fixes-5 "Direct link to Bug Fixes") * Optimize request unlocking to get rid of unnecessary unlock calls ([#2963](https://github.com/apify/crawlee/issues/2963)) ([a433037](https://github.com/apify/crawlee/commit/a433037f307ed3490a1ef5df334f1f9a9044510d)) ## [3.13.3](https://github.com/apify/crawlee/compare/v3.13.2...v3.13.3) (2025-05-05)[](#3133-2025-05-05 "Direct link to 3133-2025-05-05") ### Bug Fixes[](#bug-fixes-6 "Direct link to Bug Fixes") * respect `autoscaledPoolOptions.isTaskReadyFunction` option ([#2948](https://github.com/apify/crawlee/issues/2948)) ([fe2d206](https://github.com/apify/crawlee/commit/fe2d206b46afabb18c83e8af11fa03f085f4cd4e)), closes [#2922](https://github.com/apify/crawlee/issues/2922) * **statistics:** track actual request.retryCount in Statistics ([#2940](https://github.com/apify/crawlee/issues/2940)) ([c9f7f54](https://github.com/apify/crawlee/commit/c9f7f5494ac4895a30b283a5defe382db0cdea26)) ## [3.13.2](https://github.com/apify/crawlee/compare/v3.13.1...v3.13.2) (2025-04-08)[](#3132-2025-04-08 "Direct link to 3132-2025-04-08") ### Features[](#features-5 "Direct link to Features") * add `onSkippedRequest` option ([#2916](https://github.com/apify/crawlee/issues/2916)) ([764f992](https://github.com/apify/crawlee/commit/764f99203627b6a44d2ee90d623b8b0e6ecbffb5)), closes [#2910](https://github.com/apify/crawlee/issues/2910) ## [3.13.1](https://github.com/apify/crawlee/compare/v3.13.0...v3.13.1) (2025-04-07)[](#3131-2025-04-07 "Direct link to 3131-2025-04-07") ### Bug Fixes[](#bug-fixes-7 "Direct link to Bug Fixes") * rename `RobotsFile` to `RobotsTxtFile` ([#2913](https://github.com/apify/crawlee/issues/2913)) ([3160f71](https://github.com/apify/crawlee/commit/3160f717e865326476d78089d778cbc7d35aa58d)), closes [#2910](https://github.com/apify/crawlee/issues/2910) ### Features[](#features-6 "Direct link to Features") * add `respectRobotsTxtFile` crawler option ([#2910](https://github.com/apify/crawlee/issues/2910)) ([0eabed1](https://github.com/apify/crawlee/commit/0eabed1f13070d902c2c67b340621830a7f64464)) # [3.13.0](https://github.com/apify/crawlee/compare/v3.12.2...v3.13.0) (2025-03-04) ### Bug Fixes[](#bug-fixes-8 "Direct link to Bug Fixes") * Simplified RequestQueueV2 implementation ([#2775](https://github.com/apify/crawlee/issues/2775)) ([d1a094a](https://github.com/apify/crawlee/commit/d1a094a47eaecbf367b222f9b8c14d7da5d3e03a)), closes [#2767](https://github.com/apify/crawlee/issues/2767) [#2700](https://github.com/apify/crawlee/issues/2700) ## [3.12.2](https://github.com/apify/crawlee/compare/v3.12.1...v3.12.2) (2025-01-27)[](#3122-2025-01-27 "Direct link to 3122-2025-01-27") ### Bug Fixes[](#bug-fixes-9 "Direct link to Bug Fixes") * destructure `CrawlerRunOptions` before passing them to `addRequests` ([#2803](https://github.com/apify/crawlee/issues/2803)) ([02a598c](https://github.com/apify/crawlee/commit/02a598c2a501957f04ca3a2362bcee289ef861c0)), closes [#2802](https://github.com/apify/crawlee/issues/2802) * graceful `BasicCrawler` tidy-up on `CriticalError` ([#2817](https://github.com/apify/crawlee/issues/2817)) ([53331e8](https://github.com/apify/crawlee/commit/53331e82ee66274316add7cadb4afec1ce2d4bcf)), closes [#2807](https://github.com/apify/crawlee/issues/2807) ### Features[](#features-7 "Direct link to Features") * stopping the crawlers gracefully with `BasicCrawler.stop()` ([#2792](https://github.com/apify/crawlee/issues/2792)) ([af2966f](https://github.com/apify/crawlee/commit/af2966f65caeaf4273fd0a8ab583a7857e4330ab)), closes [#2777](https://github.com/apify/crawlee/issues/2777) ## [3.12.1](https://github.com/apify/crawlee/compare/v3.12.0...v3.12.1) (2024-12-04)[](#3121-2024-12-04 "Direct link to 3121-2024-12-04") ### Bug Fixes[](#bug-fixes-10 "Direct link to Bug Fixes") * log status message timeouts to debug level ([55ee44a](https://github.com/apify/crawlee/commit/55ee44aaf5e73c2a9d96d973a4aae111ab2e0025)) # [3.12.0](https://github.com/apify/crawlee/compare/v3.11.5...v3.12.0) (2024-11-04) ### Features[](#features-8 "Direct link to Features") * allow using other HTTP clients ([#2661](https://github.com/apify/crawlee/issues/2661)) ([568c655](https://github.com/apify/crawlee/commit/568c6556d79ce91654c8a715d1d1729d7d6ed8ef)), closes [#2659](https://github.com/apify/crawlee/issues/2659) ## [3.11.5](https://github.com/apify/crawlee/compare/v3.11.4...v3.11.5) (2024-10-04)[](#3115-2024-10-04 "Direct link to 3115-2024-10-04") ### Bug Fixes[](#bug-fixes-11 "Direct link to Bug Fixes") * check `.isFinished()` before `RequestList` reads ([#2695](https://github.com/apify/crawlee/issues/2695)) ([6fa170f](https://github.com/apify/crawlee/commit/6fa170fbe16c326307b8a58c09c07f64afb64bb2)) * **core:** trigger `errorHandler` for session errors ([#2683](https://github.com/apify/crawlee/issues/2683)) ([7d72bcb](https://github.com/apify/crawlee/commit/7d72bcb36f32933c6251382e5efd28a284e9267d)), closes [#2678](https://github.com/apify/crawlee/issues/2678) ## [3.11.4](https://github.com/apify/crawlee/compare/v3.11.3...v3.11.4) (2024-09-23)[](#3114-2024-09-23 "Direct link to 3114-2024-09-23") **Note:** Version bump only for package @crawlee/basic ## [3.11.3](https://github.com/apify/crawlee/compare/v3.11.2...v3.11.3) (2024-09-03)[](#3113-2024-09-03 "Direct link to 3113-2024-09-03") **Note:** Version bump only for package @crawlee/basic ## [3.11.2](https://github.com/apify/crawlee/compare/v3.11.1...v3.11.2) (2024-08-28)[](#3112-2024-08-28 "Direct link to 3112-2024-08-28") ### Bug Fixes[](#bug-fixes-12 "Direct link to Bug Fixes") * **RequestQueueV2:** remove `inProgress` cache, rely solely on locked states ([#2601](https://github.com/apify/crawlee/issues/2601)) ([57fcb08](https://github.com/apify/crawlee/commit/57fcb0804a9f1268039d1e2b246c515ceca7e405)) ## [3.11.1](https://github.com/apify/crawlee/compare/v3.11.0...v3.11.1) (2024-07-24)[](#3111-2024-07-24 "Direct link to 3111-2024-07-24") **Note:** Version bump only for package @crawlee/basic # [3.11.0](https://github.com/apify/crawlee/compare/v3.10.5...v3.11.0) (2024-07-09) ### Features[](#features-9 "Direct link to Features") * Sitemap-based request list implementation ([#2498](https://github.com/apify/crawlee/issues/2498)) ([7bf8f0b](https://github.com/apify/crawlee/commit/7bf8f0bcd4cc81e02c7cc60e82dfe7a0cdd80938)) ## [3.10.5](https://github.com/apify/crawlee/compare/v3.10.4...v3.10.5) (2024-06-12)[](#3105-2024-06-12 "Direct link to 3105-2024-06-12") ### Bug Fixes[](#bug-fixes-13 "Direct link to Bug Fixes") * mark `context.request.loadedUrl` and `id` as required inside the request handler ([#2531](https://github.com/apify/crawlee/issues/2531)) ([2b54660](https://github.com/apify/crawlee/commit/2b546600691d84852a2f9ef42f273cecf818d66d)) ## [3.10.4](https://github.com/apify/crawlee/compare/v3.10.3...v3.10.4) (2024-06-11)[](#3104-2024-06-11 "Direct link to 3104-2024-06-11") ### Bug Fixes[](#bug-fixes-14 "Direct link to Bug Fixes") * add missing `useState` implementation into crawling context ([eec4a71](https://github.com/apify/crawlee/commit/eec4a71769f1236ca0876a4a32288241b1b63db1)) * make `crawler.log` publicly accessible ([#2526](https://github.com/apify/crawlee/issues/2526)) ([3e9e665](https://github.com/apify/crawlee/commit/3e9e6652c0b5e4d0c2707985abbad7d80336b9af)) * respect `crawler.log` when creating child logger for `Statistics` ([0a0d75d](https://github.com/apify/crawlee/commit/0a0d75d40b5f78b329589535bbe3e0e84be76a7e)), closes [#2412](https://github.com/apify/crawlee/issues/2412) ## [3.10.3](https://github.com/apify/crawlee/compare/v3.10.2...v3.10.3) (2024-06-07)[](#3103-2024-06-07 "Direct link to 3103-2024-06-07") ### Features[](#features-10 "Direct link to Features") * log desired concurrency in the default status message ([9f0b796](https://github.com/apify/crawlee/commit/9f0b79684d9e27e6ba29634e7da2e9a095367eda)) ## [3.10.2](https://github.com/apify/crawlee/compare/v3.10.1...v3.10.2) (2024-06-03)[](#3102-2024-06-03 "Direct link to 3102-2024-06-03") **Note:** Version bump only for package @crawlee/basic ## [3.10.1](https://github.com/apify/crawlee/compare/v3.10.0...v3.10.1) (2024-05-23)[](#3101-2024-05-23 "Direct link to 3101-2024-05-23") **Note:** Version bump only for package @crawlee/basic # [3.10.0](https://github.com/apify/crawlee/compare/v3.9.2...v3.10.0) (2024-05-16) ### Bug Fixes[](#bug-fixes-15 "Direct link to Bug Fixes") * `EnqueueStrategy.All` erroring with links using unsupported protocols ([#2389](https://github.com/apify/crawlee/issues/2389)) ([8db3908](https://github.com/apify/crawlee/commit/8db39080b7711ba3c27dff7fce1170ddb0ee3d05)) * do not drop statistics on migration/resurrection/resume ([#2462](https://github.com/apify/crawlee/issues/2462)) ([8ce7dd4](https://github.com/apify/crawlee/commit/8ce7dd4ae6a3718dac95e784a53bd5661c827edc)) ### Features[](#features-11 "Direct link to Features") * implement ErrorSnapshotter for error context capture ([#2332](https://github.com/apify/crawlee/issues/2332)) ([e861dfd](https://github.com/apify/crawlee/commit/e861dfdb451ae32fb1e0c7749c6b59744654b303)), closes [#2280](https://github.com/apify/crawlee/issues/2280) * make `RequestQueue` v2 the default queue, see more on [Apify blog](https://blog.apify.com/new-apify-request-queue/) ([#2390](https://github.com/apify/crawlee/issues/2390)) ([41ae8ab](https://github.com/apify/crawlee/commit/41ae8abec1da811ae0750ac2d298e77c1e3b7b55)), closes [#2388](https://github.com/apify/crawlee/issues/2388) ## [3.9.2](https://github.com/apify/crawlee/compare/v3.9.1...v3.9.2) (2024-04-17)[](#392-2024-04-17 "Direct link to 392-2024-04-17") ### Bug Fixes[](#bug-fixes-16 "Direct link to Bug Fixes") * don't call `notify` in `addRequests()` ([#2425](https://github.com/apify/crawlee/issues/2425)) ([c4d5446](https://github.com/apify/crawlee/commit/c4d54469120648a592b6898f849154fda60e3d59)), closes [#2421](https://github.com/apify/crawlee/issues/2421) ## [3.9.1](https://github.com/apify/crawlee/compare/v3.9.0...v3.9.1) (2024-04-11)[](#391-2024-04-11 "Direct link to 391-2024-04-11") **Note:** Version bump only for package @crawlee/basic # [3.9.0](https://github.com/apify/crawlee/compare/v3.8.2...v3.9.0) (2024-04-10) ### Bug Fixes[](#bug-fixes-17 "Direct link to Bug Fixes") * notify autoscaled pool about newly added requests ([#2400](https://github.com/apify/crawlee/issues/2400)) ([a90177d](https://github.com/apify/crawlee/commit/a90177d5207794be1d6e401d746dd4c6e5961976)) ### Features[](#features-12 "Direct link to Features") * `tieredProxyUrls` for ProxyConfiguration ([#2348](https://github.com/apify/crawlee/issues/2348)) ([5408c7f](https://github.com/apify/crawlee/commit/5408c7f60a5bf4dbdba92f2d7440e0946b94ea6e)) ## [3.8.2](https://github.com/apify/crawlee/compare/v3.8.1...v3.8.2) (2024-03-21)[](#382-2024-03-21 "Direct link to 382-2024-03-21") **Note:** Version bump only for package @crawlee/basic ## [3.8.1](https://github.com/apify/crawlee/compare/v3.8.0...v3.8.1) (2024-02-22)[](#381-2024-02-22 "Direct link to 381-2024-02-22") **Note:** Version bump only for package @crawlee/basic # [3.8.0](https://github.com/apify/crawlee/compare/v3.7.3...v3.8.0) (2024-02-21) ### Bug Fixes[](#bug-fixes-18 "Direct link to Bug Fixes") * declare missing dependencies on `csv-stringify` and `fs-extra` ([#2326](https://github.com/apify/crawlee/issues/2326)) ([718959d](https://github.com/apify/crawlee/commit/718959dbbe1fa69f948d0b778d0f54d9c493ab25)), closes [/github.com/redabacha/crawlee/blob/2f05ed22b203f688095300400bb0e6d03a03283c/.eslintrc.json#L50](https://github.com//github.com/redabacha/crawlee/blob/2f05ed22b203f688095300400bb0e6d03a03283c/.eslintrc.json/issues/L50) ### Features[](#features-13 "Direct link to Features") * accessing crawler state, key-value store and named datasets via crawling context ([#2283](https://github.com/apify/crawlee/issues/2283)) ([58dd5fc](https://github.com/apify/crawlee/commit/58dd5fcc25f31bb066402c46e48a9e5e91efd5c5)) * adaptive playwright crawler ([#2316](https://github.com/apify/crawlee/issues/2316)) ([8e4218a](https://github.com/apify/crawlee/commit/8e4218ada03cf485751def46f8c465b2d2a825c7)) ## [3.7.3](https://github.com/apify/crawlee/compare/v3.7.2...v3.7.3) (2024-01-30)[](#373-2024-01-30 "Direct link to 373-2024-01-30") **Note:** Version bump only for package @crawlee/basic ## [3.7.2](https://github.com/apify/crawlee/compare/v3.7.1...v3.7.2) (2024-01-09)[](#372-2024-01-09 "Direct link to 372-2024-01-09") **Note:** Version bump only for package @crawlee/basic ## [3.7.1](https://github.com/apify/crawlee/compare/v3.7.0...v3.7.1) (2024-01-02)[](#371-2024-01-02 "Direct link to 371-2024-01-02") **Note:** Version bump only for package @crawlee/basic # [3.7.0](https://github.com/apify/crawlee/compare/v3.6.2...v3.7.0) (2023-12-21) ### Features[](#features-14 "Direct link to Features") * allow configuring crawler statistics ([#2213](https://github.com/apify/crawlee/issues/2213)) ([9fd60e4](https://github.com/apify/crawlee/commit/9fd60e4036dce720c71f2d169a8eccbc4c813a96)), closes [#1789](https://github.com/apify/crawlee/issues/1789) * check enqueue link strategy post redirect ([#2238](https://github.com/apify/crawlee/issues/2238)) ([3c5f9d6](https://github.com/apify/crawlee/commit/3c5f9d6056158e042e12d75b2b1b21ef6c32e618)), closes [#2173](https://github.com/apify/crawlee/issues/2173) * log cause with `retryOnBlocked` ([#2252](https://github.com/apify/crawlee/issues/2252)) ([e19a773](https://github.com/apify/crawlee/commit/e19a773693cfc5e65c1e2321bfc8b73c9844ea8b)), closes [#2249](https://github.com/apify/crawlee/issues/2249) ## [3.6.2](https://github.com/apify/crawlee/compare/v3.6.1...v3.6.2) (2023-11-26)[](#362-2023-11-26 "Direct link to 362-2023-11-26") **Note:** Version bump only for package @crawlee/basic ## [3.6.1](https://github.com/apify/crawlee/compare/v3.6.0...v3.6.1) (2023-11-15)[](#361-2023-11-15 "Direct link to 361-2023-11-15") **Note:** Version bump only for package @crawlee/basic # [3.6.0](https://github.com/apify/crawlee/compare/v3.5.8...v3.6.0) (2023-11-15) ### Features[](#features-15 "Direct link to Features") * **core:** add `crawler.exportData()` helper ([#2166](https://github.com/apify/crawlee/issues/2166)) ([c8c09a5](https://github.com/apify/crawlee/commit/c8c09a54a712689969ff1f6bddf70f12a2a22670)) * got-scraping v4 ([#2110](https://github.com/apify/crawlee/issues/2110)) ([2f05ed2](https://github.com/apify/crawlee/commit/2f05ed22b203f688095300400bb0e6d03a03283c)) ## [3.5.8](https://github.com/apify/crawlee/compare/v3.5.7...v3.5.8) (2023-10-17)[](#358-2023-10-17 "Direct link to 358-2023-10-17") **Note:** Version bump only for package @crawlee/basic ## [3.5.7](https://github.com/apify/crawlee/compare/v3.5.6...v3.5.7) (2023-10-05)[](#357-2023-10-05 "Direct link to 357-2023-10-05") ### Bug Fixes[](#bug-fixes-19 "Direct link to Bug Fixes") * add warning when we detect use of RL and RQ, but RQ is not provided explicitly ([#2115](https://github.com/apify/crawlee/issues/2115)) ([6fb1c55](https://github.com/apify/crawlee/commit/6fb1c5568a0bf3b6fa38045161866a32b13310ca)), closes [#1773](https://github.com/apify/crawlee/issues/1773) * ensure the status message cannot stuck the crawler ([#2114](https://github.com/apify/crawlee/issues/2114)) ([9034f08](https://github.com/apify/crawlee/commit/9034f08106f53a70205695076e874f04f632c5bb)) * RQ request count is consistent after migration ([#2116](https://github.com/apify/crawlee/issues/2116)) ([9ab8c18](https://github.com/apify/crawlee/commit/9ab8c1874f52acc3f0337fdabd36321d0fb40b86)), closes [#1855](https://github.com/apify/crawlee/issues/1855) [#1855](https://github.com/apify/crawlee/issues/1855) ## [3.5.6](https://github.com/apify/crawlee/compare/v3.5.5...v3.5.6) (2023-10-04)[](#356-2023-10-04 "Direct link to 356-2023-10-04") **Note:** Version bump only for package @crawlee/basic ## [3.5.5](https://github.com/apify/crawlee/compare/v3.5.4...v3.5.5) (2023-10-02)[](#355-2023-10-02 "Direct link to 355-2023-10-02") ### Bug Fixes[](#bug-fixes-20 "Direct link to Bug Fixes") * session pool leaks memory on multiple crawler runs ([#2083](https://github.com/apify/crawlee/issues/2083)) ([b96582a](https://github.com/apify/crawlee/commit/b96582a200e25ec11124da1f7f84a2b16b64d133)), closes [#2074](https://github.com/apify/crawlee/issues/2074) [#2031](https://github.com/apify/crawlee/issues/2031) ### Features[](#features-16 "Direct link to Features") * Request Queue v2 ([#1975](https://github.com/apify/crawlee/issues/1975)) ([70a77ee](https://github.com/apify/crawlee/commit/70a77ee15f984e9ae67cd584fc58ace7e55346db)), closes [#1365](https://github.com/apify/crawlee/issues/1365) ## [3.5.4](https://github.com/apify/crawlee/compare/v3.5.3...v3.5.4) (2023-09-11)[](#354-2023-09-11 "Direct link to 354-2023-09-11") ### Features[](#features-17 "Direct link to Features") * remove side effect from the deprecated error context augmentation ([#2069](https://github.com/apify/crawlee/issues/2069)) ([f9fb5c4](https://github.com/apify/crawlee/commit/f9fb5c42ecb14f8d0845a15982d204bd2b5b228f)) ## [3.5.3](https://github.com/apify/crawlee/compare/v3.5.2...v3.5.3) (2023-08-31)[](#353-2023-08-31 "Direct link to 353-2023-08-31") ### Bug Fixes[](#bug-fixes-21 "Direct link to Bug Fixes") * **browser-pool:** improve error handling when browser is not found ([#2050](https://github.com/apify/crawlee/issues/2050)) ([282527f](https://github.com/apify/crawlee/commit/282527f31bb366a4e52463212f652dcf6679b6c3)), closes [#1459](https://github.com/apify/crawlee/issues/1459) * clean up `inProgress` cache when delaying requests via `sameDomainDelaySecs` ([#2045](https://github.com/apify/crawlee/issues/2045)) ([f63ccc0](https://github.com/apify/crawlee/commit/f63ccc018c9e9046531287c47d11283a8e71a6ad)) * pin all internal dependencies ([#2041](https://github.com/apify/crawlee/issues/2041)) ([d6f2b17](https://github.com/apify/crawlee/commit/d6f2b172d4a6776137c7893ca798d5b4a9408e79)), closes [#2040](https://github.com/apify/crawlee/issues/2040) * respect current config when creating implicit `RequestQueue` instance ([845141d](https://github.com/apify/crawlee/commit/845141d921c10dd5fb121a499bb1b24f5eb3ff04)), closes [#2043](https://github.com/apify/crawlee/issues/2043) ### Features[](#features-18 "Direct link to Features") * **core:** add default dataset helpers to `BasicCrawler` ([#2057](https://github.com/apify/crawlee/issues/2057)) ([e2a7544](https://github.com/apify/crawlee/commit/e2a7544ddf775db023ca25553d21cb73484fcd8c)) ## [3.5.2](https://github.com/apify/crawlee/compare/v3.5.1...v3.5.2) (2023-08-21)[](#352-2023-08-21 "Direct link to 352-2023-08-21") **Note:** Version bump only for package @crawlee/basic ## [3.5.1](https://github.com/apify/crawlee/compare/v3.5.0...v3.5.1) (2023-08-16)[](#351-2023-08-16 "Direct link to 351-2023-08-16") ### Features[](#features-19 "Direct link to Features") * exceeding maxSessionRotations calls failedRequestHandler ([#2029](https://github.com/apify/crawlee/issues/2029)) ([b1cb108](https://github.com/apify/crawlee/commit/b1cb108882ab28d956adfc3d77ba9813507823f6)), closes [#2028](https://github.com/apify/crawlee/issues/2028) # [3.5.0](https://github.com/apify/crawlee/compare/v3.4.2...v3.5.0) (2023-07-31) ### Features[](#features-20 "Direct link to Features") * add support for `sameDomainDelay` ([#2003](https://github.com/apify/crawlee/issues/2003)) ([e796883](https://github.com/apify/crawlee/commit/e79688324790e5d07fc11192769cf051617e96e4)), closes [#1993](https://github.com/apify/crawlee/issues/1993) * **basic-crawler:** allow configuring the automatic status message ([#2001](https://github.com/apify/crawlee/issues/2001)) ([3eb4e4c](https://github.com/apify/crawlee/commit/3eb4e4c558b4bc0673fbff75b1db19c46004a1da)) * retire session on proxy error ([#2002](https://github.com/apify/crawlee/issues/2002)) ([8c0928b](https://github.com/apify/crawlee/commit/8c0928b24ceabefc454f8114ac30a27023709010)), closes [#1912](https://github.com/apify/crawlee/issues/1912) ## [3.4.2](https://github.com/apify/crawlee/compare/v3.4.1...v3.4.2) (2023-07-19)[](#342-2023-07-19 "Direct link to 342-2023-07-19") ### Bug Fixes[](#bug-fixes-22 "Direct link to Bug Fixes") * **basic-crawler:** limit `internalTimeoutMillis` in addition to `requestHandlerTimeoutMillis` ([#1981](https://github.com/apify/crawlee/issues/1981)) ([8122622](https://github.com/apify/crawlee/commit/8122622c3054a0e0e0c1869ba462276cbead8090)), closes [#1766](https://github.com/apify/crawlee/issues/1766) ### Features[](#features-21 "Direct link to Features") * **core:** add `RequestQueue.addRequestsBatched()` that is non-blocking ([#1996](https://github.com/apify/crawlee/issues/1996)) ([c85485d](https://github.com/apify/crawlee/commit/c85485d6ca2bb61cfebb24a2ad99e0b3ba5c069b)), closes [#1995](https://github.com/apify/crawlee/issues/1995) * retryOnBlocked detects blocked webpage ([#1956](https://github.com/apify/crawlee/issues/1956)) ([766fa9b](https://github.com/apify/crawlee/commit/766fa9b88029e9243a7427075384c1abe85c70c8)) ## [3.4.1](https://github.com/apify/crawlee/compare/v3.4.0...v3.4.1) (2023-07-13)[](#341-2023-07-13 "Direct link to 341-2023-07-13") **Note:** Version bump only for package @crawlee/basic # [3.4.0](https://github.com/apify/crawlee/compare/v3.3.3...v3.4.0) (2023-06-12) **Note:** Version bump only for package @crawlee/basic ## [3.3.3](https://github.com/apify/crawlee/compare/v3.3.2...v3.3.3) (2023-05-31)[](#333-2023-05-31 "Direct link to 333-2023-05-31") ### Bug Fixes[](#bug-fixes-23 "Direct link to Bug Fixes") * set status message every 5 seconds and log it via debug level ([#1918](https://github.com/apify/crawlee/issues/1918)) ([32aede6](https://github.com/apify/crawlee/commit/32aede6bbaa25b402e6e9cee9d3aa44722b1cfd0)) ### Features[](#features-22 "Direct link to Features") * **core:** add `Request.maxRetries` to allow overriding the `maxRequestRetries` ([#1925](https://github.com/apify/crawlee/issues/1925)) ([c5592db](https://github.com/apify/crawlee/commit/c5592db0f8094de27c46ad993bea2c1ab1f61385)) ## [3.3.2](https://github.com/apify/crawlee/compare/v3.3.1...v3.3.2) (2023-05-11)[](#332-2023-05-11 "Direct link to 332-2023-05-11") ### Bug Fixes[](#bug-fixes-24 "Direct link to Bug Fixes") * respect config object when creating `SessionPool` ([#1881](https://github.com/apify/crawlee/issues/1881)) ([db069df](https://github.com/apify/crawlee/commit/db069df80bc183c6b861c9ac82f1e278e57ea92b)) ### Features[](#features-23 "Direct link to Features") * allow running single crawler instance multiple times ([#1844](https://github.com/apify/crawlee/issues/1844)) ([9e6eb1e](https://github.com/apify/crawlee/commit/9e6eb1e32f582a8837311aac12cc1d657432f3fa)), closes [#765](https://github.com/apify/crawlee/issues/765) * **router:** allow inline router definition ([#1877](https://github.com/apify/crawlee/issues/1877)) ([2d241c9](https://github.com/apify/crawlee/commit/2d241c9f88964ebd41a181069c378b6b7b5bf262)) ## [3.3.1](https://github.com/apify/crawlee/compare/v3.3.0...v3.3.1) (2023-04-11)[](#331-2023-04-11 "Direct link to 331-2023-04-11") ### Bug Fixes[](#bug-fixes-25 "Direct link to Bug Fixes") * start status message logger after the crawl actually starts ([5d1df7a](https://github.com/apify/crawlee/commit/5d1df7aae00d0d6ca29338723f92b77cff667354)) * status message - total requests ([#1842](https://github.com/apify/crawlee/issues/1842)) ([710f734](https://github.com/apify/crawlee/commit/710f7347623619057e99abf539f0ccf78de41bbc)) # [3.3.0](https://github.com/apify/crawlee/compare/v3.2.2...v3.3.0) (2023-03-09) ### Features[](#features-24 "Direct link to Features") * add basic support for `setStatusMessage` ([#1790](https://github.com/apify/crawlee/issues/1790)) ([c318980](https://github.com/apify/crawlee/commit/c318980ec11d211b1a5c9e6bdbe76198c5d895be)) * move the status message implementation to Crawlee, noop in storage ([#1808](https://github.com/apify/crawlee/issues/1808)) ([99c3fdc](https://github.com/apify/crawlee/commit/99c3fdc18030b7898e6b6d149d6d94fab7881f09)) ## [3.2.2](https://github.com/apify/crawlee/compare/v3.2.1...v3.2.2) (2023-02-08)[](#322-2023-02-08 "Direct link to 322-2023-02-08") **Note:** Version bump only for package @crawlee/basic ## [3.2.1](https://github.com/apify/crawlee/compare/v3.2.0...v3.2.1) (2023-02-07)[](#321-2023-02-07 "Direct link to 321-2023-02-07") **Note:** Version bump only for package @crawlee/basic # [3.2.0](https://github.com/apify/crawlee/compare/v3.1.4...v3.2.0) (2023-02-07) ### Bug Fixes[](#bug-fixes-26 "Direct link to Bug Fixes") * declare missing dependency on `tslib` ([27e96c8](https://github.com/apify/crawlee/commit/27e96c80c26e7fc31809a4b518d699573cb8c662)), closes [#1747](https://github.com/apify/crawlee/issues/1747) ## [3.1.4](https://github.com/apify/crawlee/compare/v3.1.3...v3.1.4) (2022-12-14)[](#314-2022-12-14 "Direct link to 314-2022-12-14") ### Bug Fixes[](#bug-fixes-27 "Direct link to Bug Fixes") * session.markBad() on requestHandler error ([#1709](https://github.com/apify/crawlee/issues/1709)) ([e87eb1f](https://github.com/apify/crawlee/commit/e87eb1f2ccd9585f8d53cb03ec671cedf23a06b4)), closes [#1635](https://github.com/apify/crawlee/issues/1635) [/github.com/apify/crawlee/blob/5ff04faa85c3a6b6f02cd58a91b46b80610d8ae6/packages/browser-crawler/src/internals/browser-crawler.ts#L524](https://github.com//github.com/apify/crawlee/blob/5ff04faa85c3a6b6f02cd58a91b46b80610d8ae6/packages/browser-crawler/src/internals/browser-crawler.ts/issues/L524) ## [3.1.3](https://github.com/apify/crawlee/compare/v3.1.2...v3.1.3) (2022-12-07)[](#313-2022-12-07 "Direct link to 313-2022-12-07") ### Bug Fixes[](#bug-fixes-28 "Direct link to Bug Fixes") * remove memory leaks from migration event handling ([#1679](https://github.com/apify/crawlee/issues/1679)) ([49bba25](https://github.com/apify/crawlee/commit/49bba252ebc348b61eac3895155361f7d394db36)), closes [#1670](https://github.com/apify/crawlee/issues/1670) ### Features[](#features-25 "Direct link to Features") * always show error origin if inside the userland ([#1677](https://github.com/apify/crawlee/issues/1677)) ([bbe9045](https://github.com/apify/crawlee/commit/bbe9045d550f95138d570522f6f469eae2d146d0)) ## 3.1.2 (2022-11-15)[](#312-2022-11-15 "Direct link to 3.1.2 (2022-11-15)") **Note:** Version bump only for package @crawlee/basic ## 3.1.1 (2022-11-07)[](#311-2022-11-07 "Direct link to 3.1.1 (2022-11-07)") **Note:** Version bump only for package @crawlee/basic # 3.1.0 (2022-10-13) **Note:** Version bump only for package @crawlee/basic ## [3.0.4](https://github.com/apify/crawlee/compare/v3.0.3...v3.0.4) (2022-08-22)[](#304-2022-08-22 "Direct link to 304-2022-08-22") **Note:** Version bump only for package @crawlee/basic --- # BasicCrawler \ Provides a simple framework for parallel crawling of web pages. The URLs to crawl are fed either from a static list of URLs or from a dynamic queue of URLs enabling recursive crawling of websites. `BasicCrawler` is a low-level tool that requires the user to implement the page download and data extraction functionality themselves. If we want a crawler that already facilitates this functionality, we should consider using [CheerioCrawler](https://crawlee.dev/js/api/cheerio-crawler/class/CheerioCrawler.md), [PuppeteerCrawler](https://crawlee.dev/js/api/puppeteer-crawler/class/PuppeteerCrawler.md) or [PlaywrightCrawler](https://crawlee.dev/js/api/playwright-crawler/class/PlaywrightCrawler.md). `BasicCrawler` invokes the user-provided [`requestHandler`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#requestHandler) for each [Request](https://crawlee.dev/js/api/core/class/Request.md) object, which represents a single URL to crawl. The [Request](https://crawlee.dev/js/api/core/class/Request.md) objects are fed from the [RequestList](https://crawlee.dev/js/api/core/class/RequestList.md) or [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) instances provided by the [`requestList`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#requestList) or [`requestQueue`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#requestQueue) constructor options, respectively. If neither `requestList` nor `requestQueue` options are provided, the crawler will open the default request queue either when the [`crawler.addRequests()`](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md#addRequests) function is called, or if `requests` parameter (representing the initial requests) of the [`crawler.run()`](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md#run) function is provided. If both [`requestList`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#requestList) and [`requestQueue`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#requestQueue) options are used, the instance first processes URLs from the [RequestList](https://crawlee.dev/js/api/core/class/RequestList.md) and automatically enqueues all of them to the [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) before it starts their processing. This ensures that a single URL is not crawled multiple times. The crawler finishes if there are no more [Request](https://crawlee.dev/js/api/core/class/Request.md) objects to crawl. New requests are only dispatched when there is enough free CPU and memory available, using the functionality provided by the [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) class. All [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) configuration options can be passed to the [`autoscaledPoolOptions`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#autoscaledPoolOptions) parameter of the `BasicCrawler` constructor. For user convenience, the [`minConcurrency`](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md#minConcurrency) and [`maxConcurrency`](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md#maxConcurrency) options of the underlying [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) constructor are available directly in the `BasicCrawler` constructor. **Example usage:** ``` import { BasicCrawler, Dataset } from 'crawlee'; // Create a crawler instance const crawler = new BasicCrawler({ async requestHandler({ request, sendRequest }) { // 'request' contains an instance of the Request class // Here we simply fetch the HTML of the page and store it to a dataset const { body } = await sendRequest({ url: request.url, method: request.method, body: request.payload, headers: request.headers, }); await Dataset.pushData({ url: request.url, html: body, }) }, }); // Enqueue the initial requests and run the crawler await crawler.run([ 'http://www.example.com/page-1', 'http://www.example.com/page-2', ]); ``` ### Hierarchy * *BasicCrawler* * [BrowserCrawler](https://crawlee.dev/js/api/browser-crawler/class/BrowserCrawler.md) * [HttpCrawler](https://crawlee.dev/js/api/http-crawler/class/HttpCrawler.md) ## Index[**](#Index) ### Constructors * [**constructor](#constructor) ### Properties * [**autoscaledPool](#autoscaledPool) * [**config](#config) * [**hasFinishedBefore](#hasFinishedBefore) * [**log](#log) * [**requestList](#requestList) * [**requestQueue](#requestQueue) * [**router](#router) * [**running](#running) * [**sessionPool](#sessionPool) * [**stats](#stats) ### Methods * [**addRequests](#addRequests) * [**exportData](#exportData) * [**getData](#getData) * [**getDataset](#getDataset) * [**getRequestQueue](#getRequestQueue) * [**pushData](#pushData) * [**run](#run) * [**setStatusMessage](#setStatusMessage) * [**stop](#stop) * [**useState](#useState) ## Constructors[**](#Constructors) ### [**](#constructor)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L615)constructor * ****new BasicCrawler**\(options, config): [BasicCrawler](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md)\ - All `BasicCrawler` parameters are passed via an options object. *** #### Parameters * ##### options: [BasicCrawlerOptions](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md)\ = {} * ##### config: [Configuration](https://crawlee.dev/js/api/core/class/Configuration.md) = ... #### Returns [BasicCrawler](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md)\ ## Properties[**](#Properties) ### [**](#autoscaledPool)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L524)optionalautoscaledPool **autoscaledPool? : [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) A reference to the underlying [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) class that manages the concurrency of the crawler. > *NOTE:* This property is only initialized after calling the [`crawler.run()`](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md#run) function. We can use it to change the concurrency settings on the fly, to pause the crawler by calling [`autoscaledPool.pause()`](https://crawlee.dev/js/api/core/class/AutoscaledPool.md#pause) or to abort it by calling [`autoscaledPool.abort()`](https://crawlee.dev/js/api/core/class/AutoscaledPool.md#abort). ### [**](#config)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L617)readonlyconfig **config: [Configuration](https://crawlee.dev/js/api/core/class/Configuration.md) = ... ### [**](#hasFinishedBefore)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L533)hasFinishedBefore **hasFinishedBefore: boolean = false ### [**](#log)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L535)readonlylog **log: [Log](https://crawlee.dev/js/api/core/class/Log.md) ### [**](#requestList)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L497)optionalrequestList **requestList? : [IRequestList](https://crawlee.dev/js/api/core/interface/IRequestList.md) A reference to the underlying [RequestList](https://crawlee.dev/js/api/core/class/RequestList.md) class that manages the crawler's [requests](https://crawlee.dev/js/api/core/class/Request.md). Only available if used by the crawler. ### [**](#requestQueue)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L504)optionalrequestQueue **requestQueue? : [RequestProvider](https://crawlee.dev/js/api/core/class/RequestProvider.md) Dynamic queue of URLs to be processed. This is useful for recursive crawling of websites. A reference to the underlying [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) class that manages the crawler's [requests](https://crawlee.dev/js/api/core/class/Request.md). Only available if used by the crawler. ### [**](#router)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L530)readonlyrouter **router: [RouterHandler](https://crawlee.dev/js/api/core/interface/RouterHandler.md)\> = ... Default [Router](https://crawlee.dev/js/api/core/class/Router.md) instance that will be used if we don't specify any [`requestHandler`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#requestHandler). See [`router.addHandler()`](https://crawlee.dev/js/api/core/class/Router.md#addHandler) and [`router.addDefaultHandler()`](https://crawlee.dev/js/api/core/class/Router.md#addDefaultHandler). ### [**](#running)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L532)running **running: boolean = false ### [**](#sessionPool)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L515)optionalsessionPool **sessionPool? : [SessionPool](https://crawlee.dev/js/api/core/class/SessionPool.md) A reference to the underlying [SessionPool](https://crawlee.dev/js/api/core/class/SessionPool.md) class that manages the crawler's [sessions](https://crawlee.dev/js/api/core/class/Session.md). Only available if used by the crawler. ### [**](#stats)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L491)readonlystats **stats: [Statistics](https://crawlee.dev/js/api/core/class/Statistics.md) A reference to the underlying [Statistics](https://crawlee.dev/js/api/core/class/Statistics.md) class that collects and logs run statistics for requests. ## Methods[**](#Methods) ### [**](#addRequests)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L1147)addRequests * ****addRequests**(requests, options): Promise<[CrawlerAddRequestsResult](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerAddRequestsResult.md)> - Adds requests to the queue in batches. By default, it will resolve after the initial batch is added, and continue adding the rest in background. You can configure the batch size via `batchSize` option and the sleep time in between the batches via `waitBetweenBatchesMillis`. If you want to wait for all batches to be added to the queue, you can use the `waitForAllRequestsToBeAdded` promise you get in the response object. This is an alias for calling `addRequestsBatched()` on the implicit `RequestQueue` for this crawler instance. *** #### Parameters * ##### requests: [RequestsLike](https://crawlee.dev/js/api/core.md#RequestsLike) The requests to add * ##### options: [CrawlerAddRequestsOptions](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerAddRequestsOptions.md) = {} Options for the request queue #### Returns Promise<[CrawlerAddRequestsResult](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerAddRequestsResult.md)> ### [**](#exportData)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L1253)exportData * ****exportData**\(path, format, options): Promise\ - Retrieves all the data from the default crawler [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md) and exports them to the specified format. Supported formats are currently 'json' and 'csv', and will be inferred from the `path` automatically. *** #### Parameters * ##### path: string * ##### optionalformat: json | csv * ##### optionaloptions: [DatasetExportOptions](https://crawlee.dev/js/api/core/interface/DatasetExportOptions.md) #### Returns Promise\ ### [**](#getData)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L1244)getData * ****getData**(...args): Promise<[DatasetContent](https://crawlee.dev/js/api/core/interface/DatasetContent.md)\> - Retrieves data from the default crawler [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md) by calling [Dataset.getData](https://crawlee.dev/js/api/core/class/Dataset.md#getData). *** #### Parameters * ##### rest...args: \[options: [DatasetDataOptions](https://crawlee.dev/js/api/core/interface/DatasetDataOptions.md)] #### Returns Promise<[DatasetContent](https://crawlee.dev/js/api/core/interface/DatasetContent.md)\> ### [**](#getDataset)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L1237)getDataset * ****getDataset**(idOrName): Promise<[Dataset](https://crawlee.dev/js/api/core/class/Dataset.md)\> - Retrieves the specified [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md), or the default crawler [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md). *** #### Parameters * ##### optionalidOrName: string #### Returns Promise<[Dataset](https://crawlee.dev/js/api/core/class/Dataset.md)\> ### [**](#getRequestQueue)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L1080)getRequestQueue * ****getRequestQueue**(): Promise<[RequestProvider](https://crawlee.dev/js/api/core/class/RequestProvider.md)> - #### Returns Promise<[RequestProvider](https://crawlee.dev/js/api/core/class/RequestProvider.md)> ### [**](#pushData)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L1229)pushData * ****pushData**(data, datasetIdOrName): Promise\ - Pushes data to the specified [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md), or the default crawler [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md) by calling [Dataset.pushData](https://crawlee.dev/js/api/core/class/Dataset.md#pushData). *** #### Parameters * ##### data: Dictionary | Dictionary\[] * ##### optionaldatasetIdOrName: string #### Returns Promise\ ### [**](#run)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L948)run * ****run**(requests, options): Promise<[FinalStatistics](https://crawlee.dev/js/api/core/interface/FinalStatistics.md)> - Runs the crawler. Returns a promise that resolves once all the requests are processed and `autoscaledPool.isFinished` returns `true`. We can use the `requests` parameter to enqueue the initial requests — it is a shortcut for running [`crawler.addRequests()`](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md#addRequests) before [`crawler.run()`](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md#run). *** #### Parameters * ##### optionalrequests: [RequestsLike](https://crawlee.dev/js/api/core.md#RequestsLike) The requests to add. * ##### optionaloptions: [CrawlerRunOptions](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerRunOptions.md) Options for the request queue. #### Returns Promise<[FinalStatistics](https://crawlee.dev/js/api/core/interface/FinalStatistics.md)> ### [**](#setStatusMessage)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L871)setStatusMessage * ****setStatusMessage**(message, options): Promise\ - This method is periodically called by the crawler, every `statusMessageLoggingInterval` seconds. *** #### Parameters * ##### message: string * ##### options: [SetStatusMessageOptions](https://crawlee.dev/js/api/types/interface/SetStatusMessageOptions.md) = {} #### Returns Promise\ ### [**](#stop)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L1068)stop * ****stop**(message): void - Gracefully stops the current run of the crawler. All the tasks active at the time of calling this method will be allowed to finish. *** #### Parameters * ##### message: string = 'The crawler has been gracefully stopped.' #### Returns void ### [**](#useState)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L1102)useState * ****useState**\(defaultValue): Promise\ - #### Parameters * ##### defaultValue: State = ... #### Returns Promise\ --- # createBasicRouter ### Callable * ****createBasicRouter**\(routes): [RouterHandler](https://crawlee.dev/js/api/core/interface/RouterHandler.md)\ *** * Creates new [Router](https://crawlee.dev/js/api/core/class/Router.md) instance that works based on request labels. This instance can then serve as a [`requestHandler`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#requestHandler) of our [BasicCrawler](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md). Defaults to the [BasicCrawlingContext](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlingContext.md). > Serves as a shortcut for using `Router.create()`. ``` import { BasicCrawler, createBasicRouter } from 'crawlee'; const router = createBasicRouter(); router.addHandler('label-a', async (ctx) => { ctx.log.info('...'); }); router.addDefaultHandler(async (ctx) => { ctx.log.info('...'); }); const crawler = new BasicCrawler({ requestHandler: router, }); await crawler.run(); ``` *** #### Parameters * ##### optionalroutes: [RouterRoutes](https://crawlee.dev/js/api/core.md#RouterRoutes)\ #### Returns [RouterHandler](https://crawlee.dev/js/api/core/interface/RouterHandler.md)\ --- # BasicCrawlerOptions \ ### Hierarchy * *BasicCrawlerOptions* * [HttpCrawlerOptions](https://crawlee.dev/js/api/http-crawler/interface/HttpCrawlerOptions.md) ## Index[**](#Index) ### Properties * [**autoscaledPoolOptions](#autoscaledPoolOptions) * [**errorHandler](#errorHandler) * [**experiments](#experiments) * [**failedRequestHandler](#failedRequestHandler) * [**httpClient](#httpClient) * [**keepAlive](#keepAlive) * [**maxConcurrency](#maxConcurrency) * [**maxCrawlDepth](#maxCrawlDepth) * [**maxRequestRetries](#maxRequestRetries) * [**maxRequestsPerCrawl](#maxRequestsPerCrawl) * [**maxRequestsPerMinute](#maxRequestsPerMinute) * [**maxSessionRotations](#maxSessionRotations) * [**minConcurrency](#minConcurrency) * [**onSkippedRequest](#onSkippedRequest) * [**requestHandler](#requestHandler) * [**requestHandlerTimeoutSecs](#requestHandlerTimeoutSecs) * [**requestList](#requestList) * [**requestManager](#requestManager) * [**requestQueue](#requestQueue) * [**respectRobotsTxtFile](#respectRobotsTxtFile) * [**retryOnBlocked](#retryOnBlocked) * [**sameDomainDelaySecs](#sameDomainDelaySecs) * [**sessionPoolOptions](#sessionPoolOptions) * [**statisticsOptions](#statisticsOptions) * [**statusMessageCallback](#statusMessageCallback) * [**statusMessageLoggingInterval](#statusMessageLoggingInterval) * [**useSessionPool](#useSessionPool) ## Properties[**](#Properties) ### [**](#autoscaledPoolOptions)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L294)optionalautoscaledPoolOptions **autoscaledPoolOptions? : [AutoscaledPoolOptions](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md) Custom options passed to the underlying [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) constructor. > *NOTE:* The [`runTaskFunction`](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md#runTaskFunction) option is provided by the crawler and cannot be overridden. However, we can provide custom implementations of [`isFinishedFunction`](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md#isFinishedFunction) and [`isTaskReadyFunction`](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md#isTaskReadyFunction). ### [**](#errorHandler)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L222)optionalerrorHandler **errorHandler? : [ErrorHandler](https://crawlee.dev/js/api/basic-crawler.md#ErrorHandler)\ User-provided function that allows modifying the request object before it gets retried by the crawler. It's executed before each retry for the requests that failed less than [`maxRequestRetries`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#maxRequestRetries) times. The function receives the [BasicCrawlingContext](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlingContext.md) as the first argument, where the [`request`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlingContext.md#request) corresponds to the request to be retried. Second argument is the `Error` instance that represents the last error thrown during processing of the request. ### [**](#experiments)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L390)optionalexperiments **experiments? : [CrawlerExperiments](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerExperiments.md) Enables experimental features of Crawlee, which can alter the behavior of the crawler. WARNING: these options are not guaranteed to be stable and may change or be removed at any time. ### [**](#failedRequestHandler)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L232)optionalfailedRequestHandler **failedRequestHandler? : [ErrorHandler](https://crawlee.dev/js/api/basic-crawler.md#ErrorHandler)\ A function to handle requests that failed more than [`maxRequestRetries`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#maxRequestRetries) times. The function receives the [BasicCrawlingContext](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlingContext.md) as the first argument, where the [`request`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlingContext.md#request) corresponds to the failed request. Second argument is the `Error` instance that represents the last error thrown during processing of the request. ### [**](#httpClient)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L402)optionalhttpClient **httpClient? : [BaseHttpClient](https://crawlee.dev/js/api/core/interface/BaseHttpClient.md) HTTP client implementation for the `sendRequest` context helper and for plain HTTP crawling. Defaults to a new instance of [GotScrapingHttpClient](https://crawlee.dev/js/api/core/class/GotScrapingHttpClient.md) ### [**](#keepAlive)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L322)optionalkeepAlive **keepAlive? : boolean Allows to keep the crawler alive even if the [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) gets empty. By default, the `crawler.run()` will resolve once the queue is empty. With `keepAlive: true` it will keep running, waiting for more requests to come. Use `crawler.stop()` to exit the crawler gracefully, or `crawler.teardown()` to stop it immediately. ### [**](#maxConcurrency)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L308)optionalmaxConcurrency **maxConcurrency? : number Sets the maximum concurrency (parallelism) for the crawl. Shortcut for the AutoscaledPool [`maxConcurrency`](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md#maxConcurrency) option. ### [**](#maxCrawlDepth)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L285)optionalmaxCrawlDepth **maxCrawlDepth? : number Maximum depth of the crawl. If not set, the crawl will continue until all requests are processed. Setting this to `0` will only process the initial requests, skipping all links enqueued by `crawlingContext.enqueueLinks` and `crawlingContext.addRequests`. Passing `1` will process the initial requests and all links enqueued by `crawlingContext.enqueueLinks` and `crawlingContext.addRequests` in the handler for initial requests. ### [**](#maxRequestRetries)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L256)optionalmaxRequestRetries **maxRequestRetries? : number = 3 Specifies the maximum number of retries allowed for a request if its processing fails. This includes retries due to navigation errors or errors thrown from user-supplied functions (`requestHandler`, `preNavigationHooks`, `postNavigationHooks`). This limit does not apply to retries triggered by session rotation (see [`maxSessionRotations`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#maxSessionRotations)). ### [**](#maxRequestsPerCrawl)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L278)optionalmaxRequestsPerCrawl **maxRequestsPerCrawl? : number Maximum number of pages that the crawler will open. The crawl will stop when this limit is reached. This value should always be set in order to prevent infinite loops in misconfigured crawlers. > *NOTE:* In cases of parallel crawling, the actual number of pages visited might be slightly higher than this value. ### [**](#maxRequestsPerMinute)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L315)optionalmaxRequestsPerMinute **maxRequestsPerMinute? : number The maximum number of requests per minute the crawler should run. By default, this is set to `Infinity`, but we can pass any positive, non-zero integer. Shortcut for the AutoscaledPool [`maxTasksPerMinute`](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md#maxTasksPerMinute) option. ### [**](#maxSessionRotations)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L271)optionalmaxSessionRotations **maxSessionRotations? : number = 10 Maximum number of session rotations per request. The crawler will automatically rotate the session in case of a proxy error or if it gets blocked by the website. The session rotations are not counted towards the [`maxRequestRetries`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#maxRequestRetries) limit. ### [**](#minConcurrency)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L302)optionalminConcurrency **minConcurrency? : number Sets the minimum concurrency (parallelism) for the crawl. Shortcut for the AutoscaledPool [`minConcurrency`](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md#minConcurrency) option. > *WARNING:* If we set this value too high with respect to the available system memory and CPU, our crawler will run extremely slow or crash. If not sure, it's better to keep the default value and the concurrency will scale up automatically. ### [**](#onSkippedRequest)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L381)optionalonSkippedRequest **onSkippedRequest? : [SkippedRequestCallback](https://crawlee.dev/js/api/core.md#SkippedRequestCallback) When a request is skipped for some reason, you can use this callback to act on it. This is currently fired for requests skipped 1. based on robots.txt file, 2. because they don't match enqueueLinks filters, 3. because they are redirected to a URL that doesn't match the enqueueLinks strategy, 4. or because the [`maxRequestsPerCrawl`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#maxRequestsPerCrawl) limit has been reached ### [**](#requestHandler)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L151)optionalrequestHandler **requestHandler? : [RequestHandler](https://crawlee.dev/js/api/basic-crawler.md#RequestHandler)\> User-provided function that performs the logic of the crawler. It is called for each URL to crawl. The function receives the [BasicCrawlingContext](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlingContext.md) as an argument, where the [`request`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlingContext.md#request) represents the URL to crawl. The function must return a promise, which is then awaited by the crawler. If the function throws an exception, the crawler will try to re-crawl the request later, up to the [`maxRequestRetries`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#maxRequestRetries) times. If all the retries fail, the crawler calls the function provided to the [`failedRequestHandler`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#failedRequestHandler) parameter. To make this work, we should **always** let our function throw exceptions rather than catch them. The exceptions are logged to the request using the [`Request.pushErrorMessage()`](https://crawlee.dev/js/api/core/class/Request.md#pushErrorMessage) function. ### [**](#requestHandlerTimeoutSecs)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L203)optionalrequestHandlerTimeoutSecs **requestHandlerTimeoutSecs? : number = 60 Timeout in which the function passed as [`requestHandler`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#requestHandler) needs to finish, in seconds. ### [**](#requestList)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L181)optionalrequestList **requestList? : [IRequestList](https://crawlee.dev/js/api/core/interface/IRequestList.md) Static list of URLs to be processed. If not provided, the crawler will open the default request queue when the [`crawler.addRequests()`](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md#addRequests) function is called. > Alternatively, `requests` parameter of [`crawler.run()`](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md#run) could be used to enqueue the initial requests - it is a shortcut for running `crawler.addRequests()` before the `crawler.run()`. ### [**](#requestManager)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L197)optionalrequestManager **requestManager? : [IRequestManager](https://crawlee.dev/js/api/core/interface/IRequestManager.md) Allows explicitly configuring a request manager. Mutually exclusive with the `requestQueue` and `requestList` options. This enables explicitly configuring the crawler to use `RequestManagerTandem`, for instance. If using this, the type of `BasicCrawler.requestQueue` may not be fully compatible with the `RequestProvider` class. ### [**](#requestQueue)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L189)optionalrequestQueue **requestQueue? : [RequestProvider](https://crawlee.dev/js/api/core/class/RequestProvider.md) Dynamic queue of URLs to be processed. This is useful for recursive crawling of websites. If not provided, the crawler will open the default request queue when the [`crawler.addRequests()`](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md#addRequests) function is called. > Alternatively, `requests` parameter of [`crawler.run()`](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md#run) could be used to enqueue the initial requests - it is a shortcut for running `crawler.addRequests()` before the `crawler.run()`. ### [**](#respectRobotsTxtFile)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L371)optionalrespectRobotsTxtFile **respectRobotsTxtFile? : boolean If set to `true`, the crawler will automatically try to fetch the robots.txt file for each domain, and skip those that are not allowed. This also prevents disallowed URLs to be added via `enqueueLinks`. ### [**](#retryOnBlocked)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L365)optionalretryOnBlocked **retryOnBlocked? : boolean If set to `true`, the crawler will automatically try to bypass any detected bot protection. Currently supports: * [**Cloudflare** Bot Management](https://www.cloudflare.com/products/bot-management/) * [**Google Search** Rate Limiting](https://www.google.com/sorry/) ### [**](#sameDomainDelaySecs)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L262)optionalsameDomainDelaySecs **sameDomainDelaySecs? : number = 0 Indicates how much time (in seconds) to wait before crawling another same domain request. ### [**](#sessionPoolOptions)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L333)optionalsessionPoolOptions **sessionPoolOptions? : [SessionPoolOptions](https://crawlee.dev/js/api/core/interface/SessionPoolOptions.md) The configuration options for [SessionPool](https://crawlee.dev/js/api/core/class/SessionPool.md) to use. ### [**](#statisticsOptions)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L396)optionalstatisticsOptions **statisticsOptions? : [StatisticsOptions](https://crawlee.dev/js/api/core/interface/StatisticsOptions.md) Customize the way statistics collecting works, such as logging interval or whether to output them to the Key-Value store. ### [**](#statusMessageCallback)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L356)optionalstatusMessageCallback **statusMessageCallback? : [StatusMessageCallback](https://crawlee.dev/js/api/basic-crawler.md#StatusMessageCallback)<[BasicCrawlingContext](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlingContext.md)\, [BasicCrawler](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md)<[BasicCrawlingContext](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlingContext.md)\>> Allows overriding the default status message. The callback needs to call `crawler.setStatusMessage()` explicitly. The default status message is provided in the parameters. ``` const crawler = new CheerioCrawler({ statusMessageCallback: async (ctx) => { return ctx.crawler.setStatusMessage(`this is status message from ${new Date().toISOString()}`, { level: 'INFO' }); // log level defaults to 'DEBUG' }, statusMessageLoggingInterval: 1, // defaults to 10s async requestHandler({ $, enqueueLinks, request, log }) { // ... }, }); ``` ### [**](#statusMessageLoggingInterval)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L338)optionalstatusMessageLoggingInterval **statusMessageLoggingInterval? : number Defines the length of the interval for calling the `setStatusMessage` in seconds. ### [**](#useSessionPool)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L328)optionaluseSessionPool **useSessionPool? : boolean Basic crawler will initialize the [SessionPool](https://crawlee.dev/js/api/core/class/SessionPool.md) with the corresponding [`sessionPoolOptions`](https://crawlee.dev/js/api/core/interface/SessionPoolOptions.md). The session instance will be than available in the [`requestHandler`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#requestHandler). --- # BasicCrawlingContext \ ### Hierarchy * [CrawlingContext](https://crawlee.dev/js/api/core/interface/CrawlingContext.md)<[BasicCrawler](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md), UserData> * *BasicCrawlingContext* ## Index[**](#Index) ### Properties * [**addRequests](#addRequests) * [**crawler](#crawler) * [**getKeyValueStore](#getKeyValueStore) * [**id](#id) * [**log](#log) * [**proxyInfo](#proxyInfo) * [**request](#request) * [**session](#session) * [**useState](#useState) ### Methods * [**enqueueLinks](#enqueueLinks) * [**pushData](#pushData) * [**sendRequest](#sendRequest) ## Properties[**](#Properties) ### [**](#addRequests)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L88)inheritedaddRequests **addRequests: (requestsLike, options) => Promise\ Inherited from CrawlingContext.addRequests Add requests directly to the request queue. *** #### Type declaration * * **(requestsLike, options): Promise\ - #### Parameters * ##### requestsLike: readonly (string | ReadonlyObjectDeep\> & { regex?: RegExp; requestsFromUrl?: string }> | ReadonlyObjectDeep<[Request](https://crawlee.dev/js/api/core/class/Request.md)\>)\[] * ##### optionaloptions: ReadonlyObjectDeep<[RequestQueueOperationOptions](https://crawlee.dev/js/api/core/interface/RequestQueueOperationOptions.md)> Options for the request queue #### Returns Promise\ ### [**](#crawler)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L113)inheritedcrawler **crawler: [BasicCrawler](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md)<[BasicCrawlingContext](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlingContext.md)\> Inherited from CrawlingContext.crawler ### [**](#getKeyValueStore)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L147)inheritedgetKeyValueStore **getKeyValueStore: (idOrName) => Promise<[KeyValueStore](https://crawlee.dev/js/api/core/class/KeyValueStore.md)> Inherited from CrawlingContext.getKeyValueStore Get a key-value store with given name or id, or the default one for the crawler. *** #### Type declaration * * **(idOrName): Promise<[KeyValueStore](https://crawlee.dev/js/api/core/class/KeyValueStore.md)> - #### Parameters * ##### optionalidOrName: string #### Returns Promise<[KeyValueStore](https://crawlee.dev/js/api/core/class/KeyValueStore.md)> ### [**](#id)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L33)inheritedid **id: string Inherited from CrawlingContext.id ### [**](#log)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L108)inheritedlog **log: [Log](https://crawlee.dev/js/api/core/class/Log.md) Inherited from CrawlingContext.log A preconfigured logger for the request handler. ### [**](#proxyInfo)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L40)optionalinheritedproxyInfo **proxyInfo? : [ProxyInfo](https://crawlee.dev/js/api/core/interface/ProxyInfo.md) Inherited from CrawlingContext.proxyInfo An object with information about currently used proxy by the crawler and configured by the [ProxyConfiguration](https://crawlee.dev/js/api/core/class/ProxyConfiguration.md) class. ### [**](#request)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L45)inheritedrequest **request: [Request](https://crawlee.dev/js/api/core/class/Request.md)\ Inherited from CrawlingContext.request The original [Request](https://crawlee.dev/js/api/core/class/Request.md) object. ### [**](#session)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L34)optionalinheritedsession **session? : [Session](https://crawlee.dev/js/api/core/class/Session.md) Inherited from CrawlingContext.session ### [**](#useState)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L96)inheriteduseState **useState: \(defaultValue) => Promise\ Inherited from CrawlingContext.useState Returns the state - a piece of mutable persistent data shared across all the request handler runs. *** #### Type declaration * * **\(defaultValue): Promise\ - #### Parameters * ##### optionaldefaultValue: State #### Returns Promise\ ## Methods[**](#Methods) ### [**](#enqueueLinks)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L96)enqueueLinks * ****enqueueLinks**(options): Promise<[BatchAddRequestsResult](https://crawlee.dev/js/api/types/interface/BatchAddRequestsResult.md)> - Overrides CrawlingContext.enqueueLinks This function automatically finds and enqueues links from the current page, adding them to the [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) currently used by the crawler. Optionally, the function allows you to filter the target links' URLs using an array of globs or regular expressions and override settings of the enqueued [Request](https://crawlee.dev/js/api/core/class/Request.md) objects. Check out the [Crawl a website with relative links](https://crawlee.dev/js/docs/examples/crawl-relative-links.md) example for more details regarding its usage. **Example usage** ``` async requestHandler({ enqueueLinks }) { await enqueueLinks({ urls: [...], }); }, ``` *** #### Parameters * ##### optionaloptions: { baseUrl?: string; exclude?: readonly ([GlobInput](https://crawlee.dev/js/api/core.md#GlobInput) | [RegExpInput](https://crawlee.dev/js/api/core.md#RegExpInput))\[]; forefront?: boolean; globs?: readonly [GlobInput](https://crawlee.dev/js/api/core.md#GlobInput)\[]; label?: string; limit?: number; onSkippedRequest?: [SkippedRequestCallback](https://crawlee.dev/js/api/core.md#SkippedRequestCallback); pseudoUrls?: readonly [PseudoUrlInput](https://crawlee.dev/js/api/core.md#PseudoUrlInput)\[]; regexps?: readonly [RegExpInput](https://crawlee.dev/js/api/core.md#RegExpInput)\[]; requestQueue?: [RequestProvider](https://crawlee.dev/js/api/core/class/RequestProvider.md); robotsTxtFile?: Pick<[RobotsTxtFile](https://crawlee.dev/js/api/utils/class/RobotsTxtFile.md), isAllowed>; selector?: string; skipNavigation?: boolean; strategy?: [EnqueueStrategy](https://crawlee.dev/js/api/core/enum/EnqueueStrategy.md) | all | same-domain | same-hostname | same-origin; transformRequestFunction?: [RequestTransform](https://crawlee.dev/js/api/core/interface/RequestTransform.md); urls: readonly string\[]; userData?: Dictionary; waitForAllRequestsToBeAdded?: boolean } All `enqueueLinks()` parameters are passed via an options object. * ##### optionalbaseUrl: string A base URL that will be used to resolve relative URLs when using Cheerio. Ignored when using Puppeteer, since the relative URL resolution is done inside the browser automatically. * ##### optionalexclude: readonly ([GlobInput](https://crawlee.dev/js/api/core.md#GlobInput) | [RegExpInput](https://crawlee.dev/js/api/core.md#RegExpInput))\[] An array of glob pattern strings, regexp patterns or plain objects containing patterns matching URLs that will **never** be enqueued. The plain objects must include either the `glob` property or the `regexp` property. Glob matching is always case-insensitive. If you need case-sensitive matching, provide a regexp. * ##### optionalforefront: boolean = false If set to `true`: * while adding the request to the queue: the request will be added to the foremost position in the queue. * while reclaiming the request: the request will be placed to the beginning of the queue, so that it's returned in the next call to [RequestQueue.fetchNextRequest](https://crawlee.dev/js/api/core/class/RequestQueue.md#fetchNextRequest). By default, it's put to the end of the queue. In case the request is already present in the queue, this option has no effect. If more requests are added with this option at once, their order in the following `fetchNextRequest` call is arbitrary. * ##### optionalglobs: readonly [GlobInput](https://crawlee.dev/js/api/core.md#GlobInput)\[] An array of glob pattern strings or plain objects containing glob pattern strings matching the URLs to be enqueued. The plain objects must include at least the `glob` property, which holds the glob pattern string. All remaining keys will be used as request options for the corresponding enqueued [Request](https://crawlee.dev/js/api/core/class/Request.md) objects. The matching is always case-insensitive. If you need case-sensitive matching, use `regexps` property directly. If `globs` is an empty array or `undefined`, and `regexps` are also not defined, then the function enqueues the links with the same subdomain. * ##### optionallabel: string Sets [Request.label](https://crawlee.dev/js/api/core/class/Request.md#label) for newly enqueued requests. Note that the request options specified in `globs`, `regexps`, or `pseudoUrls` objects have priority over this option. * ##### optionallimit: number Limit the amount of actually enqueued URLs to this number. Useful for testing across the entire crawling scope. * ##### optionalonSkippedRequest: [SkippedRequestCallback](https://crawlee.dev/js/api/core.md#SkippedRequestCallback) When a request is skipped for some reason, you can use this callback to act on it. This is currently fired for requests skipped 1. based on robots.txt file, 2. because they don't match enqueueLinks filters, 3. or because the maxRequestsPerCrawl limit has been reached * ##### optionalpseudoUrls: readonly [PseudoUrlInput](https://crawlee.dev/js/api/core.md#PseudoUrlInput)\[] *NOTE:* In future versions of SDK the options will be removed. Please use `globs` or `regexps` instead. An array of [PseudoUrl](https://crawlee.dev/js/api/core/class/PseudoUrl.md) strings or plain objects containing [PseudoUrl](https://crawlee.dev/js/api/core/class/PseudoUrl.md) strings matching the URLs to be enqueued. The plain objects must include at least the `purl` property, which holds the pseudo-URL string. All remaining keys will be used as request options for the corresponding enqueued [Request](https://crawlee.dev/js/api/core/class/Request.md) objects. With a pseudo-URL string, the matching is always case-insensitive. If you need case-sensitive matching, use `regexps` property directly. If `pseudoUrls` is an empty array or `undefined`, then the function enqueues the links with the same subdomain. * **@deprecated** prefer using `globs` or `regexps` instead * ##### optionalregexps: readonly [RegExpInput](https://crawlee.dev/js/api/core.md#RegExpInput)\[] An array of regular expressions or plain objects containing regular expressions matching the URLs to be enqueued. The plain objects must include at least the `regexp` property, which holds the regular expression. All remaining keys will be used as request options for the corresponding enqueued [Request](https://crawlee.dev/js/api/core/class/Request.md) objects. If `regexps` is an empty array or `undefined`, and `globs` are also not defined, then the function enqueues the links with the same subdomain. * ##### optionalrequestQueue: [RequestProvider](https://crawlee.dev/js/api/core/class/RequestProvider.md) A request queue to which the URLs will be enqueued. * ##### optionalrobotsTxtFile: Pick<[RobotsTxtFile](https://crawlee.dev/js/api/utils/class/RobotsTxtFile.md), isAllowed> RobotsTxtFile instance for the current request that triggered the `enqueueLinks`. If provided, disallowed URLs will be ignored. * ##### optionalselector: string A CSS selector matching links to be enqueued. * ##### optionalskipNavigation: boolean = false If set to `true`, tells the crawler to skip navigation and process the request directly. * ##### optionalstrategy: [EnqueueStrategy](https://crawlee.dev/js/api/core/enum/EnqueueStrategy.md) | all | same-domain | same-hostname | same-origin = [EnqueueStrategy](https://crawlee.dev/js/api/core/enum/EnqueueStrategy.md) | all | same-domain | same-hostname | same-origin The strategy to use when enqueueing the urls. Depending on the strategy you select, we will only check certain parts of the URLs found. Here is a diagram of each URL part and their name: ``` Protocol Domain ┌────┐ ┌─────────┐ https://example.crawlee.dev/... │ └─────────────────┤ │ Hostname │ │ │ └─────────────────────────┘ Origin ``` * ##### optionaltransformRequestFunction: [RequestTransform](https://crawlee.dev/js/api/core/interface/RequestTransform.md) Just before a new [Request](https://crawlee.dev/js/api/core/class/Request.md) is constructed and enqueued to the [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md), this function can be used to remove it or modify its contents such as `userData`, `payload` or, most importantly `uniqueKey`. This is useful when you need to enqueue multiple `Requests` to the queue that share the same URL, but differ in methods or payloads, or to dynamically update or create `userData`. For example: by adding `keepUrlFragment: true` to the `request` object, URL fragments will not be removed when `uniqueKey` is computed. **Example:** ``` { transformRequestFunction: (request) => { request.userData.foo = 'bar'; request.keepUrlFragment = true; return request; } } ``` Note that the request options specified in `globs`, `regexps`, or `pseudoUrls` objects have priority over this function. Some request options returned by `transformRequestFunction` may be overwritten by pattern-based options from `globs`, `regexps`, or `pseudoUrls`. * ##### urls: readonly string\[] An array of URLs to enqueue. * ##### optionaluserData: Dictionary Sets [Request.userData](https://crawlee.dev/js/api/core/class/Request.md#userData) for newly enqueued requests. * ##### optionalwaitForAllRequestsToBeAdded: boolean By default, only the first batch (1000) of found requests will be added to the queue before resolving the call. You can use this option to wait for adding all of them. #### Returns Promise<[BatchAddRequestsResult](https://crawlee.dev/js/api/types/interface/BatchAddRequestsResult.md)> Promise that resolves to [BatchAddRequestsResult](https://crawlee.dev/js/api/types/interface/BatchAddRequestsResult.md) object. ### [**](#pushData)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L54)inheritedpushData * ****pushData**(data, datasetIdOrName): Promise\ - Inherited from CrawlingContext.pushData This function allows you to push data to a [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md) specified by name, or the one currently used by the crawler. Shortcut for `crawler.pushData()`. *** #### Parameters * ##### optionaldata: ReadonlyDeep\ Data to be pushed to the default dataset. * ##### optionaldatasetIdOrName: string #### Returns Promise\ ### [**](#sendRequest)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L166)inheritedsendRequest * ****sendRequest**\(overrideOptions): Promise\> - Inherited from CrawlingContext.sendRequest Fires HTTP request via [`got-scraping`](https://crawlee.dev/js/docs/guides/got-scraping.md), allowing to override the request options on the fly. This is handy when you work with a browser crawler but want to execute some requests outside it (e.g. API requests). Check the [Skipping navigations for certain requests](https://crawlee.dev/js/docs/examples/skip-navigation.md) example for more detailed explanation of how to do that. ``` async requestHandler({ sendRequest }) { const { body } = await sendRequest({ // override headers only headers: { ... }, }); }, ``` *** #### Parameters * ##### optionaloverrideOptions: Partial\ #### Returns Promise\> --- # CrawlerAddRequestsOptions ### Hierarchy * [AddRequestsBatchedOptions](https://crawlee.dev/js/api/core/interface/AddRequestsBatchedOptions.md) * *CrawlerAddRequestsOptions* * [CrawlerRunOptions](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerRunOptions.md) ## Index[**](#Index) ### Properties * [**batchSize](#batchSize) * [**forefront](#forefront) * [**waitBetweenBatchesMillis](#waitBetweenBatchesMillis) * [**waitForAllRequestsToBeAdded](#waitForAllRequestsToBeAdded) ## Properties[**](#Properties) ### [**](#batchSize)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L975)optionalinheritedbatchSize **batchSize? : number = 1000 Inherited from AddRequestsBatchedOptions.batchSize ### [**](#forefront)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L948)optionalinheritedforefront **forefront? : boolean = false Inherited from AddRequestsBatchedOptions.forefront If set to `true`: * while adding the request to the queue: the request will be added to the foremost position in the queue. * while reclaiming the request: the request will be placed to the beginning of the queue, so that it's returned in the next call to [RequestQueue.fetchNextRequest](https://crawlee.dev/js/api/core/class/RequestQueue.md#fetchNextRequest). By default, it's put to the end of the queue. In case the request is already present in the queue, this option has no effect. If more requests are added with this option at once, their order in the following `fetchNextRequest` call is arbitrary. ### [**](#waitBetweenBatchesMillis)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L980)optionalinheritedwaitBetweenBatchesMillis **waitBetweenBatchesMillis? : number = 1000 Inherited from AddRequestsBatchedOptions.waitBetweenBatchesMillis ### [**](#waitForAllRequestsToBeAdded)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L970)optionalinheritedwaitForAllRequestsToBeAdded **waitForAllRequestsToBeAdded? : boolean = false Inherited from AddRequestsBatchedOptions.waitForAllRequestsToBeAdded Whether to wait for all the provided requests to be added, instead of waiting just for the initial batch of up to `batchSize`. --- # CrawlerAddRequestsResult ### Hierarchy * [AddRequestsBatchedResult](https://crawlee.dev/js/api/core/interface/AddRequestsBatchedResult.md) * *CrawlerAddRequestsResult* ## Index[**](#Index) ### Properties * [**addedRequests](#addedRequests) * [**waitForAllRequestsToBeAdded](#waitForAllRequestsToBeAdded) ## Properties[**](#Properties) ### [**](#addedRequests)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L984)inheritedaddedRequests **addedRequests: [ProcessedRequest](https://crawlee.dev/js/api/types/interface/ProcessedRequest.md)\[] Inherited from AddRequestsBatchedResult.addedRequests ### [**](#waitForAllRequestsToBeAdded)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L1001)inheritedwaitForAllRequestsToBeAdded **waitForAllRequestsToBeAdded: Promise<[ProcessedRequest](https://crawlee.dev/js/api/types/interface/ProcessedRequest.md)\[]> Inherited from AddRequestsBatchedResult.waitForAllRequestsToBeAdded A promise which will resolve with the rest of the requests that were added to the queue. Alternatively, we can set [`waitForAllRequestsToBeAdded`](https://crawlee.dev/js/api/core/interface/AddRequestsBatchedOptions.md#waitForAllRequestsToBeAdded) to `true` in the [`crawler.addRequests()`](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md#addRequests) options. **Example:** ``` // Assuming `requests` is a list of requests. const result = await crawler.addRequests(requests); // If we want to wait for the rest of the requests to be added to the queue: await result.waitForAllRequestsToBeAdded; ``` --- # CrawlerExperiments A set of options that you can toggle to enable experimental features in Crawlee. NOTE: These options will not respect semantic versioning and may be removed or changed at any time. Use at your own risk. If you do use these and encounter issues, please report them to us. ## Index[**](#Index) ### Properties * [**requestLocking](#requestLocking) ## Properties[**](#Properties) ### [**](#requestLocking)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L418)optionalrequestLocking **requestLocking? : boolean * **@deprecated** This experiment is now enabled by default, and this flag will be removed in a future release. If you encounter issues due to this change, please: * report it to us: * set `requestLocking` to `false` in the `experiments` option of the crawler --- # CrawlerRunOptions ### Hierarchy * [CrawlerAddRequestsOptions](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerAddRequestsOptions.md) * *CrawlerRunOptions* ## Index[**](#Index) ### Properties * [**batchSize](#batchSize) * [**forefront](#forefront) * [**purgeRequestQueue](#purgeRequestQueue) * [**waitBetweenBatchesMillis](#waitBetweenBatchesMillis) * [**waitForAllRequestsToBeAdded](#waitForAllRequestsToBeAdded) ## Properties[**](#Properties) ### [**](#batchSize)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L975)optionalinheritedbatchSize **batchSize? : number = 1000 Inherited from CrawlerAddRequestsOptions.batchSize ### [**](#forefront)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L948)optionalinheritedforefront **forefront? : boolean = false Inherited from CrawlerAddRequestsOptions.forefront If set to `true`: * while adding the request to the queue: the request will be added to the foremost position in the queue. * while reclaiming the request: the request will be placed to the beginning of the queue, so that it's returned in the next call to [RequestQueue.fetchNextRequest](https://crawlee.dev/js/api/core/class/RequestQueue.md#fetchNextRequest). By default, it's put to the end of the queue. In case the request is already present in the queue, this option has no effect. If more requests are added with this option at once, their order in the following `fetchNextRequest` call is arbitrary. ### [**](#purgeRequestQueue)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L2045)optionalpurgeRequestQueue **purgeRequestQueue? : boolean = true Whether to purge the RequestQueue before running the crawler again. Defaults to true, so it is possible to reprocess failed requests. When disabled, only new requests will be considered. Note that even a failed request is considered as handled. ### [**](#waitBetweenBatchesMillis)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L980)optionalinheritedwaitBetweenBatchesMillis **waitBetweenBatchesMillis? : number = 1000 Inherited from CrawlerAddRequestsOptions.waitBetweenBatchesMillis ### [**](#waitForAllRequestsToBeAdded)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L970)optionalinheritedwaitForAllRequestsToBeAdded **waitForAllRequestsToBeAdded? : boolean = false Inherited from CrawlerAddRequestsOptions.waitForAllRequestsToBeAdded Whether to wait for all the provided requests to be added, instead of waiting just for the initial batch of up to `batchSize`. --- # CreateContextOptions ## Index[**](#Index) ### Properties * [**proxyInfo](#proxyInfo) * [**request](#request) * [**session](#session) ## Properties[**](#Properties) ### [**](#proxyInfo)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L2032)optionalproxyInfo **proxyInfo? : [ProxyInfo](https://crawlee.dev/js/api/core/interface/ProxyInfo.md) ### [**](#request)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L2030)request **request: [Request](https://crawlee.dev/js/api/core/class/Request.md)\ ### [**](#session)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L2031)optionalsession **session? : [Session](https://crawlee.dev/js/api/core/class/Session.md) --- # StatusMessageCallbackParams \ ## Index[**](#Index) ### Properties * [**crawler](#crawler) * [**message](#message) * [**previousState](#previousState) * [**state](#state) ## Properties[**](#Properties) ### [**](#crawler)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L123)crawler **crawler: Crawler ### [**](#message)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L125)message **message: string ### [**](#previousState)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L124)previousState **previousState: [StatisticState](https://crawlee.dev/js/api/core/interface/StatisticState.md) ### [**](#state)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L122)state **state: [StatisticState](https://crawlee.dev/js/api/core/interface/StatisticState.md) --- # @crawlee/browser Provides a simple framework for parallel crawling of web pages using headless browsers with [Puppeteer](https://github.com/puppeteer/puppeteer) and [Playwright](https://github.com/microsoft/playwright). The URLs to crawl are fed either from a static list of URLs or from a dynamic queue of URLs enabling recursive crawling of websites. Since `BrowserCrawler` uses headless (or even headful) browsers to download web pages and extract data, it is useful for crawling of websites that require to execute JavaScript. If the target website doesn't need JavaScript, we should consider using the [CheerioCrawler](https://crawlee.dev/js/api/cheerio-crawler/class/CheerioCrawler.md), which downloads the pages using raw HTTP requests and is about 10x faster. The source URLs are represented by the [Request](https://crawlee.dev/js/api/core/class/Request.md) objects that are fed from the [RequestList](https://crawlee.dev/js/api/core/class/RequestList.md) or [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) instances provided by the [`requestList`](https://crawlee.dev/js/api/browser-crawler/interface/BrowserCrawlerOptions.md#requestList) or [`requestQueue`](https://crawlee.dev/js/api/browser-crawler/interface/BrowserCrawlerOptions.md#requestQueue) constructor options, respectively. If neither `requestList` nor `requestQueue` options are provided, the crawler will open the default request queue either when the [`crawler.addRequests()`](https://crawlee.dev/js/api/browser-crawler/class/BrowserCrawler.md#addRequests) function is called, or if `requests` parameter (representing the initial requests) of the [`crawler.run()`](https://crawlee.dev/js/api/browser-crawler/class/BrowserCrawler.md#run) function is provided. If both [`requestList`](https://crawlee.dev/js/api/browser-crawler/interface/BrowserCrawlerOptions.md#requestList) and [`requestQueue`](https://crawlee.dev/js/api/browser-crawler/interface/BrowserCrawlerOptions.md#requestQueue) options are used, the instance first processes URLs from the [RequestList](https://crawlee.dev/js/api/core/class/RequestList.md) and automatically enqueues all of them to the [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) before it starts their processing. This ensures that a single URL is not crawled multiple times. The crawler finishes when there are no more [Request](https://crawlee.dev/js/api/core/class/Request.md) objects to crawl. `BrowserCrawler` opens a new browser page (i.e. tab or window) for each [Request](https://crawlee.dev/js/api/core/class/Request.md) object to crawl and then calls the function provided by user as the [`requestHandler`](https://crawlee.dev/js/api/browser-crawler/interface/BrowserCrawlerOptions.md#requestHandler) option. New pages are only opened when there is enough free CPU and memory available, using the functionality provided by the [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) class. All [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) configuration options can be passed to the [`autoscaledPoolOptions`](https://crawlee.dev/js/api/browser-crawler/interface/BrowserCrawlerOptions.md#autoscaledPoolOptions) parameter of the `BrowserCrawler` constructor. For user convenience, the [`minConcurrency`](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md#minConcurrency) and [`maxConcurrency`](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md#maxConcurrency) options of the underlying [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) constructor are available directly in the `BrowserCrawler` constructor. > *NOTE:* the pool of browser instances is internally managed by the [BrowserPool](https://crawlee.dev/js/api/browser-pool/class/BrowserPool.md) class. ## Index[**](#Index) ### Crawlers * [**BrowserCrawler](https://crawlee.dev/js/api/browser-crawler/class/BrowserCrawler.md) ### Other * [**AddRequestsBatchedOptions](https://crawlee.dev/js/api/browser-crawler.md#AddRequestsBatchedOptions) * [**AddRequestsBatchedResult](https://crawlee.dev/js/api/browser-crawler.md#AddRequestsBatchedResult) * [**AutoscaledPool](https://crawlee.dev/js/api/browser-crawler.md#AutoscaledPool) * [**AutoscaledPoolOptions](https://crawlee.dev/js/api/browser-crawler.md#AutoscaledPoolOptions) * [**BaseHttpClient](https://crawlee.dev/js/api/browser-crawler.md#BaseHttpClient) * [**BaseHttpResponseData](https://crawlee.dev/js/api/browser-crawler.md#BaseHttpResponseData) * [**BASIC\_CRAWLER\_TIMEOUT\_BUFFER\_SECS](https://crawlee.dev/js/api/browser-crawler.md#BASIC_CRAWLER_TIMEOUT_BUFFER_SECS) * [**BasicCrawler](https://crawlee.dev/js/api/browser-crawler.md#BasicCrawler) * [**BasicCrawlerOptions](https://crawlee.dev/js/api/browser-crawler.md#BasicCrawlerOptions) * [**BasicCrawlingContext](https://crawlee.dev/js/api/browser-crawler.md#BasicCrawlingContext) * [**BLOCKED\_STATUS\_CODES](https://crawlee.dev/js/api/browser-crawler.md#BLOCKED_STATUS_CODES) * [**checkStorageAccess](https://crawlee.dev/js/api/browser-crawler.md#checkStorageAccess) * [**ClientInfo](https://crawlee.dev/js/api/browser-crawler.md#ClientInfo) * [**Configuration](https://crawlee.dev/js/api/browser-crawler.md#Configuration) * [**ConfigurationOptions](https://crawlee.dev/js/api/browser-crawler.md#ConfigurationOptions) * [**Cookie](https://crawlee.dev/js/api/browser-crawler.md#Cookie) * [**CrawlerAddRequestsOptions](https://crawlee.dev/js/api/browser-crawler.md#CrawlerAddRequestsOptions) * [**CrawlerAddRequestsResult](https://crawlee.dev/js/api/browser-crawler.md#CrawlerAddRequestsResult) * [**CrawlerExperiments](https://crawlee.dev/js/api/browser-crawler.md#CrawlerExperiments) * [**CrawlerRunOptions](https://crawlee.dev/js/api/browser-crawler.md#CrawlerRunOptions) * [**CrawlingContext](https://crawlee.dev/js/api/browser-crawler.md#CrawlingContext) * [**createBasicRouter](https://crawlee.dev/js/api/browser-crawler.md#createBasicRouter) * [**CreateContextOptions](https://crawlee.dev/js/api/browser-crawler.md#CreateContextOptions) * [**CreateSession](https://crawlee.dev/js/api/browser-crawler.md#CreateSession) * [**CriticalError](https://crawlee.dev/js/api/browser-crawler.md#CriticalError) * [**Dataset](https://crawlee.dev/js/api/browser-crawler.md#Dataset) * [**DatasetConsumer](https://crawlee.dev/js/api/browser-crawler.md#DatasetConsumer) * [**DatasetContent](https://crawlee.dev/js/api/browser-crawler.md#DatasetContent) * [**DatasetDataOptions](https://crawlee.dev/js/api/browser-crawler.md#DatasetDataOptions) * [**DatasetExportOptions](https://crawlee.dev/js/api/browser-crawler.md#DatasetExportOptions) * [**DatasetExportToOptions](https://crawlee.dev/js/api/browser-crawler.md#DatasetExportToOptions) * [**DatasetIteratorOptions](https://crawlee.dev/js/api/browser-crawler.md#DatasetIteratorOptions) * [**DatasetMapper](https://crawlee.dev/js/api/browser-crawler.md#DatasetMapper) * [**DatasetOptions](https://crawlee.dev/js/api/browser-crawler.md#DatasetOptions) * [**DatasetReducer](https://crawlee.dev/js/api/browser-crawler.md#DatasetReducer) * [**enqueueLinks](https://crawlee.dev/js/api/browser-crawler.md#enqueueLinks) * [**EnqueueLinksOptions](https://crawlee.dev/js/api/browser-crawler.md#EnqueueLinksOptions) * [**EnqueueStrategy](https://crawlee.dev/js/api/browser-crawler.md#EnqueueStrategy) * [**ErrnoException](https://crawlee.dev/js/api/browser-crawler.md#ErrnoException) * [**ErrorHandler](https://crawlee.dev/js/api/browser-crawler.md#ErrorHandler) * [**ErrorSnapshotter](https://crawlee.dev/js/api/browser-crawler.md#ErrorSnapshotter) * [**ErrorTracker](https://crawlee.dev/js/api/browser-crawler.md#ErrorTracker) * [**ErrorTrackerOptions](https://crawlee.dev/js/api/browser-crawler.md#ErrorTrackerOptions) * [**EventManager](https://crawlee.dev/js/api/browser-crawler.md#EventManager) * [**EventType](https://crawlee.dev/js/api/browser-crawler.md#EventType) * [**EventTypeName](https://crawlee.dev/js/api/browser-crawler.md#EventTypeName) * [**filterRequestsByPatterns](https://crawlee.dev/js/api/browser-crawler.md#filterRequestsByPatterns) * [**FinalStatistics](https://crawlee.dev/js/api/browser-crawler.md#FinalStatistics) * [**GetUserDataFromRequest](https://crawlee.dev/js/api/browser-crawler.md#GetUserDataFromRequest) * [**GlobInput](https://crawlee.dev/js/api/browser-crawler.md#GlobInput) * [**GlobObject](https://crawlee.dev/js/api/browser-crawler.md#GlobObject) * [**GotScrapingHttpClient](https://crawlee.dev/js/api/browser-crawler.md#GotScrapingHttpClient) * [**HttpRequest](https://crawlee.dev/js/api/browser-crawler.md#HttpRequest) * [**HttpRequestOptions](https://crawlee.dev/js/api/browser-crawler.md#HttpRequestOptions) * [**HttpResponse](https://crawlee.dev/js/api/browser-crawler.md#HttpResponse) * [**IRequestList](https://crawlee.dev/js/api/browser-crawler.md#IRequestList) * [**IRequestManager](https://crawlee.dev/js/api/browser-crawler.md#IRequestManager) * [**IStorage](https://crawlee.dev/js/api/browser-crawler.md#IStorage) * [**KeyConsumer](https://crawlee.dev/js/api/browser-crawler.md#KeyConsumer) * [**KeyValueStore](https://crawlee.dev/js/api/browser-crawler.md#KeyValueStore) * [**KeyValueStoreIteratorOptions](https://crawlee.dev/js/api/browser-crawler.md#KeyValueStoreIteratorOptions) * [**KeyValueStoreOptions](https://crawlee.dev/js/api/browser-crawler.md#KeyValueStoreOptions) * [**LoadedRequest](https://crawlee.dev/js/api/browser-crawler.md#LoadedRequest) * [**LocalEventManager](https://crawlee.dev/js/api/browser-crawler.md#LocalEventManager) * [**log](https://crawlee.dev/js/api/browser-crawler.md#log) * [**Log](https://crawlee.dev/js/api/browser-crawler.md#Log) * [**Logger](https://crawlee.dev/js/api/browser-crawler.md#Logger) * [**LoggerJson](https://crawlee.dev/js/api/browser-crawler.md#LoggerJson) * [**LoggerOptions](https://crawlee.dev/js/api/browser-crawler.md#LoggerOptions) * [**LoggerText](https://crawlee.dev/js/api/browser-crawler.md#LoggerText) * [**LogLevel](https://crawlee.dev/js/api/browser-crawler.md#LogLevel) * [**MAX\_POOL\_SIZE](https://crawlee.dev/js/api/browser-crawler.md#MAX_POOL_SIZE) * [**NonRetryableError](https://crawlee.dev/js/api/browser-crawler.md#NonRetryableError) * [**PERSIST\_STATE\_KEY](https://crawlee.dev/js/api/browser-crawler.md#PERSIST_STATE_KEY) * [**PersistenceOptions](https://crawlee.dev/js/api/browser-crawler.md#PersistenceOptions) * [**processHttpRequestOptions](https://crawlee.dev/js/api/browser-crawler.md#processHttpRequestOptions) * [**ProxyConfiguration](https://crawlee.dev/js/api/browser-crawler.md#ProxyConfiguration) * [**ProxyConfigurationFunction](https://crawlee.dev/js/api/browser-crawler.md#ProxyConfigurationFunction) * [**ProxyConfigurationOptions](https://crawlee.dev/js/api/browser-crawler.md#ProxyConfigurationOptions) * [**ProxyInfo](https://crawlee.dev/js/api/browser-crawler.md#ProxyInfo) * [**PseudoUrl](https://crawlee.dev/js/api/browser-crawler.md#PseudoUrl) * [**PseudoUrlInput](https://crawlee.dev/js/api/browser-crawler.md#PseudoUrlInput) * [**PseudoUrlObject](https://crawlee.dev/js/api/browser-crawler.md#PseudoUrlObject) * [**purgeDefaultStorages](https://crawlee.dev/js/api/browser-crawler.md#purgeDefaultStorages) * [**PushErrorMessageOptions](https://crawlee.dev/js/api/browser-crawler.md#PushErrorMessageOptions) * [**QueueOperationInfo](https://crawlee.dev/js/api/browser-crawler.md#QueueOperationInfo) * [**RecordOptions](https://crawlee.dev/js/api/browser-crawler.md#RecordOptions) * [**RecoverableState](https://crawlee.dev/js/api/browser-crawler.md#RecoverableState) * [**RecoverableStateOptions](https://crawlee.dev/js/api/browser-crawler.md#RecoverableStateOptions) * [**RecoverableStatePersistenceOptions](https://crawlee.dev/js/api/browser-crawler.md#RecoverableStatePersistenceOptions) * [**RedirectHandler](https://crawlee.dev/js/api/browser-crawler.md#RedirectHandler) * [**RegExpInput](https://crawlee.dev/js/api/browser-crawler.md#RegExpInput) * [**RegExpObject](https://crawlee.dev/js/api/browser-crawler.md#RegExpObject) * [**Request](https://crawlee.dev/js/api/browser-crawler.md#Request) * [**RequestHandler](https://crawlee.dev/js/api/browser-crawler.md#RequestHandler) * [**RequestHandlerResult](https://crawlee.dev/js/api/browser-crawler.md#RequestHandlerResult) * [**RequestList](https://crawlee.dev/js/api/browser-crawler.md#RequestList) * [**RequestListOptions](https://crawlee.dev/js/api/browser-crawler.md#RequestListOptions) * [**RequestListSourcesFunction](https://crawlee.dev/js/api/browser-crawler.md#RequestListSourcesFunction) * [**RequestListState](https://crawlee.dev/js/api/browser-crawler.md#RequestListState) * [**RequestManagerTandem](https://crawlee.dev/js/api/browser-crawler.md#RequestManagerTandem) * [**RequestOptions](https://crawlee.dev/js/api/browser-crawler.md#RequestOptions) * [**RequestProvider](https://crawlee.dev/js/api/browser-crawler.md#RequestProvider) * [**RequestProviderOptions](https://crawlee.dev/js/api/browser-crawler.md#RequestProviderOptions) * [**RequestQueue](https://crawlee.dev/js/api/browser-crawler.md#RequestQueue) * [**RequestQueueOperationOptions](https://crawlee.dev/js/api/browser-crawler.md#RequestQueueOperationOptions) * [**RequestQueueOptions](https://crawlee.dev/js/api/browser-crawler.md#RequestQueueOptions) * [**RequestQueueV1](https://crawlee.dev/js/api/browser-crawler.md#RequestQueueV1) * [**RequestQueueV2](https://crawlee.dev/js/api/browser-crawler.md#RequestQueueV2) * [**RequestsLike](https://crawlee.dev/js/api/browser-crawler.md#RequestsLike) * [**RequestState](https://crawlee.dev/js/api/browser-crawler.md#RequestState) * [**RequestTransform](https://crawlee.dev/js/api/browser-crawler.md#RequestTransform) * [**ResponseLike](https://crawlee.dev/js/api/browser-crawler.md#ResponseLike) * [**ResponseTypes](https://crawlee.dev/js/api/browser-crawler.md#ResponseTypes) * [**RestrictedCrawlingContext](https://crawlee.dev/js/api/browser-crawler.md#RestrictedCrawlingContext) * [**RetryRequestError](https://crawlee.dev/js/api/browser-crawler.md#RetryRequestError) * [**Router](https://crawlee.dev/js/api/browser-crawler.md#Router) * [**RouterHandler](https://crawlee.dev/js/api/browser-crawler.md#RouterHandler) * [**RouterRoutes](https://crawlee.dev/js/api/browser-crawler.md#RouterRoutes) * [**Session](https://crawlee.dev/js/api/browser-crawler.md#Session) * [**SessionError](https://crawlee.dev/js/api/browser-crawler.md#SessionError) * [**SessionOptions](https://crawlee.dev/js/api/browser-crawler.md#SessionOptions) * [**SessionPool](https://crawlee.dev/js/api/browser-crawler.md#SessionPool) * [**SessionPoolOptions](https://crawlee.dev/js/api/browser-crawler.md#SessionPoolOptions) * [**SessionState](https://crawlee.dev/js/api/browser-crawler.md#SessionState) * [**SitemapRequestList](https://crawlee.dev/js/api/browser-crawler.md#SitemapRequestList) * [**SitemapRequestListOptions](https://crawlee.dev/js/api/browser-crawler.md#SitemapRequestListOptions) * [**SkippedRequestCallback](https://crawlee.dev/js/api/browser-crawler.md#SkippedRequestCallback) * [**SkippedRequestReason](https://crawlee.dev/js/api/browser-crawler.md#SkippedRequestReason) * [**SnapshotResult](https://crawlee.dev/js/api/browser-crawler.md#SnapshotResult) * [**Snapshotter](https://crawlee.dev/js/api/browser-crawler.md#Snapshotter) * [**SnapshotterOptions](https://crawlee.dev/js/api/browser-crawler.md#SnapshotterOptions) * [**Source](https://crawlee.dev/js/api/browser-crawler.md#Source) * [**StatisticPersistedState](https://crawlee.dev/js/api/browser-crawler.md#StatisticPersistedState) * [**Statistics](https://crawlee.dev/js/api/browser-crawler.md#Statistics) * [**StatisticsOptions](https://crawlee.dev/js/api/browser-crawler.md#StatisticsOptions) * [**StatisticState](https://crawlee.dev/js/api/browser-crawler.md#StatisticState) * [**StatusMessageCallback](https://crawlee.dev/js/api/browser-crawler.md#StatusMessageCallback) * [**StatusMessageCallbackParams](https://crawlee.dev/js/api/browser-crawler.md#StatusMessageCallbackParams) * [**StorageClient](https://crawlee.dev/js/api/browser-crawler.md#StorageClient) * [**StorageManagerOptions](https://crawlee.dev/js/api/browser-crawler.md#StorageManagerOptions) * [**StreamingHttpResponse](https://crawlee.dev/js/api/browser-crawler.md#StreamingHttpResponse) * [**SystemInfo](https://crawlee.dev/js/api/browser-crawler.md#SystemInfo) * [**SystemStatus](https://crawlee.dev/js/api/browser-crawler.md#SystemStatus) * [**SystemStatusOptions](https://crawlee.dev/js/api/browser-crawler.md#SystemStatusOptions) * [**TieredProxy](https://crawlee.dev/js/api/browser-crawler.md#TieredProxy) * [**tryAbsoluteURL](https://crawlee.dev/js/api/browser-crawler.md#tryAbsoluteURL) * [**UrlPatternObject](https://crawlee.dev/js/api/browser-crawler.md#UrlPatternObject) * [**useState](https://crawlee.dev/js/api/browser-crawler.md#useState) * [**UseStateOptions](https://crawlee.dev/js/api/browser-crawler.md#UseStateOptions) * [**withCheckedStorageAccess](https://crawlee.dev/js/api/browser-crawler.md#withCheckedStorageAccess) * [**BrowserCrawlerOptions](https://crawlee.dev/js/api/browser-crawler/interface/BrowserCrawlerOptions.md) * [**BrowserCrawlingContext](https://crawlee.dev/js/api/browser-crawler/interface/BrowserCrawlingContext.md) * [**BrowserLaunchContext](https://crawlee.dev/js/api/browser-crawler/interface/BrowserLaunchContext.md) * [**BrowserErrorHandler](https://crawlee.dev/js/api/browser-crawler.md#BrowserErrorHandler) * [**BrowserHook](https://crawlee.dev/js/api/browser-crawler.md#BrowserHook) * [**BrowserRequestHandler](https://crawlee.dev/js/api/browser-crawler.md#BrowserRequestHandler) ## Other[**](#__CATEGORY__) ### [**](#AddRequestsBatchedOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L965)AddRequestsBatchedOptions Re-exports [AddRequestsBatchedOptions](https://crawlee.dev/js/api/core/interface/AddRequestsBatchedOptions.md) ### [**](#AddRequestsBatchedResult)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L983)AddRequestsBatchedResult Re-exports [AddRequestsBatchedResult](https://crawlee.dev/js/api/core/interface/AddRequestsBatchedResult.md) ### [**](#AutoscaledPool)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/autoscaled_pool.ts#L180)AutoscaledPool Re-exports [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) ### [**](#AutoscaledPoolOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/autoscaled_pool.ts#L16)AutoscaledPoolOptions Re-exports [AutoscaledPoolOptions](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md) ### [**](#BaseHttpClient)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L179)BaseHttpClient Re-exports [BaseHttpClient](https://crawlee.dev/js/api/core/interface/BaseHttpClient.md) ### [**](#BaseHttpResponseData)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L130)BaseHttpResponseData Re-exports [BaseHttpResponseData](https://crawlee.dev/js/api/core/interface/BaseHttpResponseData.md) ### [**](#BASIC_CRAWLER_TIMEOUT_BUFFER_SECS)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/constants.ts#L6)BASIC\_CRAWLER\_TIMEOUT\_BUFFER\_SECS Re-exports [BASIC\_CRAWLER\_TIMEOUT\_BUFFER\_SECS](https://crawlee.dev/js/api/basic-crawler.md#BASIC_CRAWLER_TIMEOUT_BUFFER_SECS) ### [**](#BasicCrawler)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L485)BasicCrawler Re-exports [BasicCrawler](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md) ### [**](#BasicCrawlerOptions)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L133)BasicCrawlerOptions Re-exports [BasicCrawlerOptions](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md) ### [**](#BasicCrawlingContext)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L71)BasicCrawlingContext Re-exports [BasicCrawlingContext](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlingContext.md) ### [**](#BLOCKED_STATUS_CODES)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/consts.ts#L1)BLOCKED\_STATUS\_CODES Re-exports [BLOCKED\_STATUS\_CODES](https://crawlee.dev/js/api/core.md#BLOCKED_STATUS_CODES) ### [**](#checkStorageAccess)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/access_checking.ts#L10)checkStorageAccess Re-exports [checkStorageAccess](https://crawlee.dev/js/api/core/function/checkStorageAccess.md) ### [**](#ClientInfo)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/system_status.ts#L79)ClientInfo Re-exports [ClientInfo](https://crawlee.dev/js/api/core/interface/ClientInfo.md) ### [**](#Configuration)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/configuration.ts#L247)Configuration Re-exports [Configuration](https://crawlee.dev/js/api/core/class/Configuration.md) ### [**](#ConfigurationOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/configuration.ts#L16)ConfigurationOptions Re-exports [ConfigurationOptions](https://crawlee.dev/js/api/core/interface/ConfigurationOptions.md) ### [**](#Cookie)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/index.ts#L19)Cookie Re-exports [Cookie](https://crawlee.dev/js/api/core/interface/Cookie.md) ### [**](#CrawlerAddRequestsOptions)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L2035)CrawlerAddRequestsOptions Re-exports [CrawlerAddRequestsOptions](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerAddRequestsOptions.md) ### [**](#CrawlerAddRequestsResult)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L2037)CrawlerAddRequestsResult Re-exports [CrawlerAddRequestsResult](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerAddRequestsResult.md) ### [**](#CrawlerExperiments)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L411)CrawlerExperiments Re-exports [CrawlerExperiments](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerExperiments.md) ### [**](#CrawlerRunOptions)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L2039)CrawlerRunOptions Re-exports [CrawlerRunOptions](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerRunOptions.md) ### [**](#CrawlingContext)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L111)CrawlingContext Re-exports [CrawlingContext](https://crawlee.dev/js/api/core/interface/CrawlingContext.md) ### [**](#createBasicRouter)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L2081)createBasicRouter Re-exports [createBasicRouter](https://crawlee.dev/js/api/basic-crawler/function/createBasicRouter.md) ### [**](#CreateContextOptions)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L2029)CreateContextOptions Re-exports [CreateContextOptions](https://crawlee.dev/js/api/basic-crawler/interface/CreateContextOptions.md) ### [**](#CreateSession)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session_pool.ts#L22)CreateSession Re-exports [CreateSession](https://crawlee.dev/js/api/core/interface/CreateSession.md) ### [**](#CriticalError)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/errors.ts#L10)CriticalError Re-exports [CriticalError](https://crawlee.dev/js/api/core/class/CriticalError.md) ### [**](#Dataset)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L232)Dataset Re-exports [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md) ### [**](#DatasetConsumer)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L703)DatasetConsumer Re-exports [DatasetConsumer](https://crawlee.dev/js/api/core/interface/DatasetConsumer.md) ### [**](#DatasetContent)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L742)DatasetContent Re-exports [DatasetContent](https://crawlee.dev/js/api/core/interface/DatasetContent.md) ### [**](#DatasetDataOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L92)DatasetDataOptions Re-exports [DatasetDataOptions](https://crawlee.dev/js/api/core/interface/DatasetDataOptions.md) ### [**](#DatasetExportOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L144)DatasetExportOptions Re-exports [DatasetExportOptions](https://crawlee.dev/js/api/core/interface/DatasetExportOptions.md) ### [**](#DatasetExportToOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L176)DatasetExportToOptions Re-exports [DatasetExportToOptions](https://crawlee.dev/js/api/core/interface/DatasetExportToOptions.md) ### [**](#DatasetIteratorOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L152)DatasetIteratorOptions Re-exports [DatasetIteratorOptions](https://crawlee.dev/js/api/core/interface/DatasetIteratorOptions.md) ### [**](#DatasetMapper)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L714)DatasetMapper Re-exports [DatasetMapper](https://crawlee.dev/js/api/core/interface/DatasetMapper.md) ### [**](#DatasetOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L735)DatasetOptions Re-exports [DatasetOptions](https://crawlee.dev/js/api/core/interface/DatasetOptions.md) ### [**](#DatasetReducer)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts#L726)DatasetReducer Re-exports [DatasetReducer](https://crawlee.dev/js/api/core/interface/DatasetReducer.md) ### [**](#enqueueLinks)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/enqueue_links.ts#L274)enqueueLinks Re-exports [enqueueLinks](https://crawlee.dev/js/api/core/function/enqueueLinks.md) ### [**](#EnqueueLinksOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/enqueue_links.ts#L34)EnqueueLinksOptions Re-exports [EnqueueLinksOptions](https://crawlee.dev/js/api/core/interface/EnqueueLinksOptions.md) ### [**](#EnqueueStrategy)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/enqueue_links.ts#L216)EnqueueStrategy Re-exports [EnqueueStrategy](https://crawlee.dev/js/api/core/enum/EnqueueStrategy.md) ### [**](#ErrnoException)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/error_tracker.ts#L9)ErrnoException Re-exports [ErrnoException](https://crawlee.dev/js/api/core/interface/ErrnoException.md) ### [**](#ErrorHandler)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L114)ErrorHandler Re-exports [ErrorHandler](https://crawlee.dev/js/api/basic-crawler.md#ErrorHandler) ### [**](#ErrorSnapshotter)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/error_snapshotter.ts#L42)ErrorSnapshotter Re-exports [ErrorSnapshotter](https://crawlee.dev/js/api/core/class/ErrorSnapshotter.md) ### [**](#ErrorTracker)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/error_tracker.ts#L286)ErrorTracker Re-exports [ErrorTracker](https://crawlee.dev/js/api/core/class/ErrorTracker.md) ### [**](#ErrorTrackerOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/error_tracker.ts#L17)ErrorTrackerOptions Re-exports [ErrorTrackerOptions](https://crawlee.dev/js/api/core/interface/ErrorTrackerOptions.md) ### [**](#EventManager)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/events/event_manager.ts#L24)EventManager Re-exports [EventManager](https://crawlee.dev/js/api/core/class/EventManager.md) ### [**](#EventType)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/events/event_manager.ts#L9)EventType Re-exports [EventType](https://crawlee.dev/js/api/core/enum/EventType.md) ### [**](#EventTypeName)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/events/event_manager.ts#L17)EventTypeName Re-exports [EventTypeName](https://crawlee.dev/js/api/core.md#EventTypeName) ### [**](#filterRequestsByPatterns)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L217)filterRequestsByPatterns Re-exports [filterRequestsByPatterns](https://crawlee.dev/js/api/core/function/filterRequestsByPatterns.md) ### [**](#FinalStatistics)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/system_status.ts#L85)FinalStatistics Re-exports [FinalStatistics](https://crawlee.dev/js/api/core/interface/FinalStatistics.md) ### [**](#GetUserDataFromRequest)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/router.ts#L15)GetUserDataFromRequest Re-exports [GetUserDataFromRequest](https://crawlee.dev/js/api/core.md#GetUserDataFromRequest) ### [**](#GlobInput)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L41)GlobInput Re-exports [GlobInput](https://crawlee.dev/js/api/core.md#GlobInput) ### [**](#GlobObject)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L36)GlobObject Re-exports [GlobObject](https://crawlee.dev/js/api/core.md#GlobObject) ### [**](#GotScrapingHttpClient)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/got-scraping-http-client.ts#L17)GotScrapingHttpClient Re-exports [GotScrapingHttpClient](https://crawlee.dev/js/api/core/class/GotScrapingHttpClient.md) ### [**](#HttpRequest)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L78)HttpRequest Re-exports [HttpRequest](https://crawlee.dev/js/api/core/interface/HttpRequest.md) ### [**](#HttpRequestOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L111)HttpRequestOptions Re-exports [HttpRequestOptions](https://crawlee.dev/js/api/core/interface/HttpRequestOptions.md) ### [**](#HttpResponse)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L152)HttpResponse Re-exports [HttpResponse](https://crawlee.dev/js/api/core/interface/HttpResponse.md) ### [**](#IRequestList)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_list.ts#L26)IRequestList Re-exports [IRequestList](https://crawlee.dev/js/api/core/interface/IRequestList.md) ### [**](#IRequestManager)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L44)IRequestManager Re-exports [IRequestManager](https://crawlee.dev/js/api/core/interface/IRequestManager.md) ### [**](#IStorage)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/storage_manager.ts#L14)IStorage Re-exports [IStorage](https://crawlee.dev/js/api/core/interface/IStorage.md) ### [**](#KeyConsumer)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/key_value_store.ts#L724)KeyConsumer Re-exports [KeyConsumer](https://crawlee.dev/js/api/core/interface/KeyConsumer.md) ### [**](#KeyValueStore)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/key_value_store.ts#L108)KeyValueStore Re-exports [KeyValueStore](https://crawlee.dev/js/api/core/class/KeyValueStore.md) ### [**](#KeyValueStoreIteratorOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/key_value_store.ts#L758)KeyValueStoreIteratorOptions Re-exports [KeyValueStoreIteratorOptions](https://crawlee.dev/js/api/core/interface/KeyValueStoreIteratorOptions.md) ### [**](#KeyValueStoreOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/key_value_store.ts#L734)KeyValueStoreOptions Re-exports [KeyValueStoreOptions](https://crawlee.dev/js/api/core/interface/KeyValueStoreOptions.md) ### [**](#LoadedRequest)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L21)LoadedRequest Re-exports [LoadedRequest](https://crawlee.dev/js/api/core.md#LoadedRequest) ### [**](#LocalEventManager)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/events/local_event_manager.ts#L11)LocalEventManager Re-exports [LocalEventManager](https://crawlee.dev/js/api/core/class/LocalEventManager.md) ### [**](#log)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/log.ts#L3)log Re-exports [log](https://crawlee.dev/js/api/core.md#log) ### [**](#Log)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/log.ts#L3)Log Re-exports [Log](https://crawlee.dev/js/api/core/class/Log.md) ### [**](#Logger)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/log.ts#L3)Logger Re-exports [Logger](https://crawlee.dev/js/api/core/class/Logger.md) ### [**](#LoggerJson)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/log.ts#L3)LoggerJson Re-exports [LoggerJson](https://crawlee.dev/js/api/core/class/LoggerJson.md) ### [**](#LoggerOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/log.ts#L3)LoggerOptions Re-exports [LoggerOptions](https://crawlee.dev/js/api/core/interface/LoggerOptions.md) ### [**](#LoggerText)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/log.ts#L3)LoggerText Re-exports [LoggerText](https://crawlee.dev/js/api/core/class/LoggerText.md) ### [**](#LogLevel)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/log.ts#L3)LogLevel Re-exports [LogLevel](https://crawlee.dev/js/api/core/enum/LogLevel.md) ### [**](#MAX_POOL_SIZE)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/consts.ts#L3)MAX\_POOL\_SIZE Re-exports [MAX\_POOL\_SIZE](https://crawlee.dev/js/api/core.md#MAX_POOL_SIZE) ### [**](#NonRetryableError)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/errors.ts#L4)NonRetryableError Re-exports [NonRetryableError](https://crawlee.dev/js/api/core/class/NonRetryableError.md) ### [**](#PERSIST_STATE_KEY)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/consts.ts#L2)PERSIST\_STATE\_KEY Re-exports [PERSIST\_STATE\_KEY](https://crawlee.dev/js/api/core.md#PERSIST_STATE_KEY) ### [**](#PersistenceOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L41)PersistenceOptions Re-exports [PersistenceOptions](https://crawlee.dev/js/api/core/interface/PersistenceOptions.md) ### [**](#processHttpRequestOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L196)processHttpRequestOptions Re-exports [processHttpRequestOptions](https://crawlee.dev/js/api/core/function/processHttpRequestOptions.md) ### [**](#ProxyConfiguration)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/proxy_configuration.ts#L203)ProxyConfiguration Re-exports [ProxyConfiguration](https://crawlee.dev/js/api/core/class/ProxyConfiguration.md) ### [**](#ProxyConfigurationFunction)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/proxy_configuration.ts#L9)ProxyConfigurationFunction Re-exports [ProxyConfigurationFunction](https://crawlee.dev/js/api/core/interface/ProxyConfigurationFunction.md) ### [**](#ProxyConfigurationOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/proxy_configuration.ts#L15)ProxyConfigurationOptions Re-exports [ProxyConfigurationOptions](https://crawlee.dev/js/api/core/interface/ProxyConfigurationOptions.md) ### [**](#ProxyInfo)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/proxy_configuration.ts#L80)ProxyInfo Re-exports [ProxyInfo](https://crawlee.dev/js/api/core/interface/ProxyInfo.md) ### [**](#PseudoUrl)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/index.ts#L18)PseudoUrl Re-exports [PseudoUrl](https://crawlee.dev/js/api/core/class/PseudoUrl.md) ### [**](#PseudoUrlInput)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L34)PseudoUrlInput Re-exports [PseudoUrlInput](https://crawlee.dev/js/api/core.md#PseudoUrlInput) ### [**](#PseudoUrlObject)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L29)PseudoUrlObject Re-exports [PseudoUrlObject](https://crawlee.dev/js/api/core.md#PseudoUrlObject) ### [**](#purgeDefaultStorages)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/utils.ts#L33)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/utils.ts#L45)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/utils.ts#L46)purgeDefaultStorages Re-exports [purgeDefaultStorages](https://crawlee.dev/js/api/core/function/purgeDefaultStorages.md) ### [**](#PushErrorMessageOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L559)PushErrorMessageOptions Re-exports [PushErrorMessageOptions](https://crawlee.dev/js/api/core/interface/PushErrorMessageOptions.md) ### [**](#QueueOperationInfo)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/index.ts#L19)QueueOperationInfo Re-exports [QueueOperationInfo](https://crawlee.dev/js/api/core/interface/QueueOperationInfo.md) ### [**](#RecordOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/key_value_store.ts#L741)RecordOptions Re-exports [RecordOptions](https://crawlee.dev/js/api/core/interface/RecordOptions.md) ### [**](#RecoverableState)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/recoverable_state.ts#L75)RecoverableState Re-exports [RecoverableState](https://crawlee.dev/js/api/core/class/RecoverableState.md) ### [**](#RecoverableStateOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/recoverable_state.ts#L33)RecoverableStateOptions Re-exports [RecoverableStateOptions](https://crawlee.dev/js/api/core/interface/RecoverableStateOptions.md) ### [**](#RecoverableStatePersistenceOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/recoverable_state.ts#L6)RecoverableStatePersistenceOptions Re-exports [RecoverableStatePersistenceOptions](https://crawlee.dev/js/api/core/interface/RecoverableStatePersistenceOptions.md) ### [**](#RedirectHandler)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L171)RedirectHandler Re-exports [RedirectHandler](https://crawlee.dev/js/api/core.md#RedirectHandler) ### [**](#RegExpInput)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L48)RegExpInput Re-exports [RegExpInput](https://crawlee.dev/js/api/core.md#RegExpInput) ### [**](#RegExpObject)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L43)RegExpObject Re-exports [RegExpObject](https://crawlee.dev/js/api/core.md#RegExpObject) ### [**](#Request)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L84)Request Re-exports [Request](https://crawlee.dev/js/api/core/class/Request.md) ### [**](#RequestHandler)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L110)RequestHandler Re-exports [RequestHandler](https://crawlee.dev/js/api/basic-crawler.md#RequestHandler) ### [**](#RequestHandlerResult)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L174)RequestHandlerResult Re-exports [RequestHandlerResult](https://crawlee.dev/js/api/core/class/RequestHandlerResult.md) ### [**](#RequestList)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_list.ts#L300)RequestList Re-exports [RequestList](https://crawlee.dev/js/api/core/class/RequestList.md) ### [**](#RequestListOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_list.ts#L91)RequestListOptions Re-exports [RequestListOptions](https://crawlee.dev/js/api/core/interface/RequestListOptions.md) ### [**](#RequestListSourcesFunction)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_list.ts#L1000)RequestListSourcesFunction Re-exports [RequestListSourcesFunction](https://crawlee.dev/js/api/core.md#RequestListSourcesFunction) ### [**](#RequestListState)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_list.ts#L988)RequestListState Re-exports [RequestListState](https://crawlee.dev/js/api/core/interface/RequestListState.md) ### [**](#RequestManagerTandem)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_manager_tandem.ts#L22)RequestManagerTandem Re-exports [RequestManagerTandem](https://crawlee.dev/js/api/core/class/RequestManagerTandem.md) ### [**](#RequestOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L446)RequestOptions Re-exports [RequestOptions](https://crawlee.dev/js/api/core/interface/RequestOptions.md) ### [**](#RequestProvider)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L102)RequestProvider Re-exports [RequestProvider](https://crawlee.dev/js/api/core/class/RequestProvider.md) ### [**](#RequestProviderOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L907)RequestProviderOptions Re-exports [RequestProviderOptions](https://crawlee.dev/js/api/core/interface/RequestProviderOptions.md) ### [**](#RequestQueue)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/index.ts#L7)RequestQueue Re-exports [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) ### [**](#RequestQueueOperationOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L934)RequestQueueOperationOptions Re-exports [RequestQueueOperationOptions](https://crawlee.dev/js/api/core/interface/RequestQueueOperationOptions.md) ### [**](#RequestQueueOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L923)RequestQueueOptions Re-exports [RequestQueueOptions](https://crawlee.dev/js/api/core/interface/RequestQueueOptions.md) ### [**](#RequestQueueV1)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/index.ts#L6)RequestQueueV1 Re-exports [RequestQueueV1](https://crawlee.dev/js/api/core/class/RequestQueueV1.md) ### [**](#RequestQueueV2)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/index.ts#L8)RequestQueueV2 Re-exports [RequestQueueV2](https://crawlee.dev/js/api/core.md#RequestQueueV2) ### [**](#RequestsLike)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_provider.ts#L39)RequestsLike Re-exports [RequestsLike](https://crawlee.dev/js/api/core.md#RequestsLike) ### [**](#RequestState)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L42)RequestState Re-exports [RequestState](https://crawlee.dev/js/api/core/enum/RequestState.md) ### [**](#RequestTransform)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L287)RequestTransform Re-exports [RequestTransform](https://crawlee.dev/js/api/core/interface/RequestTransform.md) ### [**](#ResponseLike)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/cookie_utils.ts#L7)ResponseLike Re-exports [ResponseLike](https://crawlee.dev/js/api/core/interface/ResponseLike.md) ### [**](#ResponseTypes)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L39)ResponseTypes Re-exports [ResponseTypes](https://crawlee.dev/js/api/core/interface/ResponseTypes.md) ### [**](#RestrictedCrawlingContext)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L30)RestrictedCrawlingContext Re-exports [RestrictedCrawlingContext](https://crawlee.dev/js/api/core/interface/RestrictedCrawlingContext.md) ### [**](#RetryRequestError)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/errors.ts#L22)RetryRequestError Re-exports [RetryRequestError](https://crawlee.dev/js/api/core/class/RetryRequestError.md) ### [**](#Router)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/router.ts#L86)Router Re-exports [Router](https://crawlee.dev/js/api/core/class/Router.md) ### [**](#RouterHandler)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/router.ts#L10)RouterHandler Re-exports [RouterHandler](https://crawlee.dev/js/api/core/interface/RouterHandler.md) ### [**](#RouterRoutes)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/router.ts#L17)RouterRoutes Re-exports [RouterRoutes](https://crawlee.dev/js/api/core.md#RouterRoutes) ### [**](#Session)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session.ts#L100)Session Re-exports [Session](https://crawlee.dev/js/api/core/class/Session.md) ### [**](#SessionError)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/errors.ts#L33)SessionError Re-exports [SessionError](https://crawlee.dev/js/api/core/class/SessionError.md) ### [**](#SessionOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session.ts#L37)SessionOptions Re-exports [SessionOptions](https://crawlee.dev/js/api/core/interface/SessionOptions.md) ### [**](#SessionPool)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session_pool.ts#L137)SessionPool Re-exports [SessionPool](https://crawlee.dev/js/api/core/class/SessionPool.md) ### [**](#SessionPoolOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session_pool.ts#L30)SessionPoolOptions Re-exports [SessionPoolOptions](https://crawlee.dev/js/api/core/interface/SessionPoolOptions.md) ### [**](#SessionState)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/session_pool/session.ts#L24)SessionState Re-exports [SessionState](https://crawlee.dev/js/api/core/interface/SessionState.md) ### [**](#SitemapRequestList)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/sitemap_request_list.ts#L128)SitemapRequestList Re-exports [SitemapRequestList](https://crawlee.dev/js/api/core/class/SitemapRequestList.md) ### [**](#SitemapRequestListOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/sitemap_request_list.ts#L60)SitemapRequestListOptions Re-exports [SitemapRequestListOptions](https://crawlee.dev/js/api/core/interface/SitemapRequestListOptions.md) ### [**](#SkippedRequestCallback)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L52)SkippedRequestCallback Re-exports [SkippedRequestCallback](https://crawlee.dev/js/api/core.md#SkippedRequestCallback) ### [**](#SkippedRequestReason)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L50)SkippedRequestReason Re-exports [SkippedRequestReason](https://crawlee.dev/js/api/core.md#SkippedRequestReason) ### [**](#SnapshotResult)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/error_snapshotter.ts#L16)SnapshotResult Re-exports [SnapshotResult](https://crawlee.dev/js/api/core/interface/SnapshotResult.md) ### [**](#Snapshotter)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/snapshotter.ts#L118)Snapshotter Re-exports [Snapshotter](https://crawlee.dev/js/api/core/class/Snapshotter.md) ### [**](#SnapshotterOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/snapshotter.ts#L19)SnapshotterOptions Re-exports [SnapshotterOptions](https://crawlee.dev/js/api/core/interface/SnapshotterOptions.md) ### [**](#Source)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/request.ts#L575)Source Re-exports [Source](https://crawlee.dev/js/api/core.md#Source) ### [**](#StatisticPersistedState)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L482)StatisticPersistedState Re-exports [StatisticPersistedState](https://crawlee.dev/js/api/core/interface/StatisticPersistedState.md) ### [**](#Statistics)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L59)Statistics Re-exports [Statistics](https://crawlee.dev/js/api/core/class/Statistics.md) ### [**](#StatisticsOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L436)StatisticsOptions Re-exports [StatisticsOptions](https://crawlee.dev/js/api/core/interface/StatisticsOptions.md) ### [**](#StatisticState)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/statistics.ts#L496)StatisticState Re-exports [StatisticState](https://crawlee.dev/js/api/core/interface/StatisticState.md) ### [**](#StatusMessageCallback)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L128)StatusMessageCallback Re-exports [StatusMessageCallback](https://crawlee.dev/js/api/basic-crawler.md#StatusMessageCallback) ### [**](#StatusMessageCallbackParams)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L118)StatusMessageCallbackParams Re-exports [StatusMessageCallbackParams](https://crawlee.dev/js/api/basic-crawler/interface/StatusMessageCallbackParams.md) ### [**](#StorageClient)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/index.ts#L19)StorageClient Re-exports [StorageClient](https://crawlee.dev/js/api/core/interface/StorageClient.md) ### [**](#StorageManagerOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/storage_manager.ts#L156)StorageManagerOptions Re-exports [StorageManagerOptions](https://crawlee.dev/js/api/core/interface/StorageManagerOptions.md) ### [**](#StreamingHttpResponse)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/http_clients/base-http-client.ts#L162)StreamingHttpResponse Re-exports [StreamingHttpResponse](https://crawlee.dev/js/api/core/interface/StreamingHttpResponse.md) ### [**](#SystemInfo)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/system_status.ts#L10)SystemInfo Re-exports [SystemInfo](https://crawlee.dev/js/api/core/interface/SystemInfo.md) ### [**](#SystemStatus)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/system_status.ts#L120)SystemStatus Re-exports [SystemStatus](https://crawlee.dev/js/api/core/class/SystemStatus.md) ### [**](#SystemStatusOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/system_status.ts#L35)SystemStatusOptions Re-exports [SystemStatusOptions](https://crawlee.dev/js/api/core/interface/SystemStatusOptions.md) ### [**](#TieredProxy)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/proxy_configuration.ts#L45)TieredProxy Re-exports [TieredProxy](https://crawlee.dev/js/api/core/interface/TieredProxy.md) ### [**](#tryAbsoluteURL)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L12)tryAbsoluteURL Re-exports [tryAbsoluteURL](https://crawlee.dev/js/api/core/function/tryAbsoluteURL.md) ### [**](#UrlPatternObject)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/enqueue_links/shared.ts#L24)UrlPatternObject Re-exports [UrlPatternObject](https://crawlee.dev/js/api/core.md#UrlPatternObject) ### [**](#useState)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/utils.ts#L87)useState Re-exports [useState](https://crawlee.dev/js/api/core/function/useState.md) ### [**](#UseStateOptions)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/utils.ts#L69)UseStateOptions Re-exports [UseStateOptions](https://crawlee.dev/js/api/core/interface/UseStateOptions.md) ### [**](#withCheckedStorageAccess)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/storages/access_checking.ts#L18)withCheckedStorageAccess Re-exports [withCheckedStorageAccess](https://crawlee.dev/js/api/core/function/withCheckedStorageAccess.md) ### [**](#BrowserErrorHandler)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L67)BrowserErrorHandler **BrowserErrorHandler\: [ErrorHandler](https://crawlee.dev/js/api/basic-crawler.md#ErrorHandler)\ #### Type parameters * **Context**: [BrowserCrawlingContext](https://crawlee.dev/js/api/browser-crawler/interface/BrowserCrawlingContext.md) = [BrowserCrawlingContext](https://crawlee.dev/js/api/browser-crawler/interface/BrowserCrawlingContext.md) ### [**](#BrowserHook)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L70)BrowserHook **BrowserHook\: (crawlingContext, gotoOptions) => Awaitable\ #### Type parameters * **Context** = [BrowserCrawlingContext](https://crawlee.dev/js/api/browser-crawler/interface/BrowserCrawlingContext.md) * **GoToOptions**: Dictionary | undefined = Dictionary #### Type declaration * * **(crawlingContext, gotoOptions): Awaitable\ - #### Parameters * ##### crawlingContext: Context * ##### gotoOptions: GoToOptions #### Returns Awaitable\ ### [**](#BrowserRequestHandler)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L64)BrowserRequestHandler **BrowserRequestHandler\: [RequestHandler](https://crawlee.dev/js/api/basic-crawler.md#RequestHandler)\ #### Type parameters * **Context**: [BrowserCrawlingContext](https://crawlee.dev/js/api/browser-crawler/interface/BrowserCrawlingContext.md) = [BrowserCrawlingContext](https://crawlee.dev/js/api/browser-crawler/interface/BrowserCrawlingContext.md) --- # Changelog All notable changes to this project will be documented in this file. See [Conventional Commits](https://conventionalcommits.org) for commit guidelines. ## [3.15.3](https://github.com/apify/crawlee/compare/v3.15.2...v3.15.3) (2025-11-10)[](#3153-2025-11-10 "Direct link to 3153-2025-11-10") **Note:** Version bump only for package @crawlee/browser ## [3.15.2](https://github.com/apify/crawlee/compare/v3.15.1...v3.15.2) (2025-10-23)[](#3152-2025-10-23 "Direct link to 3152-2025-10-23") **Note:** Version bump only for package @crawlee/browser ## [3.15.1](https://github.com/apify/crawlee/compare/v3.15.0...v3.15.1) (2025-09-26)[](#3151-2025-09-26 "Direct link to 3151-2025-09-26") **Note:** Version bump only for package @crawlee/browser # [3.15.0](https://github.com/apify/crawlee/compare/v3.14.1...v3.15.0) (2025-09-17) **Note:** Version bump only for package @crawlee/browser ## [3.14.1](https://github.com/apify/crawlee/compare/v3.14.0...v3.14.1) (2025-08-05)[](#3141-2025-08-05 "Direct link to 3141-2025-08-05") **Note:** Version bump only for package @crawlee/browser # [3.14.0](https://github.com/apify/crawlee/compare/v3.13.10...v3.14.0) (2025-07-25) ### Features[](#features "Direct link to Features") * add `maxCrawlDepth` crawler option ([#3045](https://github.com/apify/crawlee/issues/3045)) ([0090df9](https://github.com/apify/crawlee/commit/0090df93a12df9918d016cf2f1378f1f7d40557d)), closes [#2633](https://github.com/apify/crawlee/issues/2633) ## [3.13.10](https://github.com/apify/crawlee/compare/v3.13.9...v3.13.10) (2025-07-09)[](#31310-2025-07-09 "Direct link to 31310-2025-07-09") **Note:** Version bump only for package @crawlee/browser ## [3.13.9](https://github.com/apify/crawlee/compare/v3.13.8...v3.13.9) (2025-06-27)[](#3139-2025-06-27 "Direct link to 3139-2025-06-27") ### Features[](#features-1 "Direct link to Features") * Report links skipped because of various filter conditions ([#3026](https://github.com/apify/crawlee/issues/3026)) ([5a867bc](https://github.com/apify/crawlee/commit/5a867bc28135803b55c765ec12e6fd04017ce53d)), closes [#3016](https://github.com/apify/crawlee/issues/3016) ## [3.13.8](https://github.com/apify/crawlee/compare/v3.13.7...v3.13.8) (2025-06-16)[](#3138-2025-06-16 "Direct link to 3138-2025-06-16") ### Bug Fixes[](#bug-fixes "Direct link to Bug Fixes") * Do not enqueue more links than what the crawler is capable of processing ([#2990](https://github.com/apify/crawlee/issues/2990)) ([ea094c8](https://github.com/apify/crawlee/commit/ea094c819232e0b30bc550270836d10506eb9454)), closes [#2728](https://github.com/apify/crawlee/issues/2728) ## [3.13.7](https://github.com/apify/crawlee/compare/v3.13.6...v3.13.7) (2025-06-06)[](#3137-2025-06-06 "Direct link to 3137-2025-06-06") **Note:** Version bump only for package @crawlee/browser ## [3.13.6](https://github.com/apify/crawlee/compare/v3.13.5...v3.13.6) (2025-06-05)[](#3136-2025-06-05 "Direct link to 3136-2025-06-05") **Note:** Version bump only for package @crawlee/browser ## [3.13.5](https://github.com/apify/crawlee/compare/v3.13.4...v3.13.5) (2025-05-20)[](#3135-2025-05-20 "Direct link to 3135-2025-05-20") **Note:** Version bump only for package @crawlee/browser ## [3.13.4](https://github.com/apify/crawlee/compare/v3.13.3...v3.13.4) (2025-05-14)[](#3134-2025-05-14 "Direct link to 3134-2025-05-14") **Note:** Version bump only for package @crawlee/browser ## [3.13.3](https://github.com/apify/crawlee/compare/v3.13.2...v3.13.3) (2025-05-05)[](#3133-2025-05-05 "Direct link to 3133-2025-05-05") **Note:** Version bump only for package @crawlee/browser ## [3.13.2](https://github.com/apify/crawlee/compare/v3.13.1...v3.13.2) (2025-04-08)[](#3132-2025-04-08 "Direct link to 3132-2025-04-08") ### Features[](#features-2 "Direct link to Features") * add `onSkippedRequest` option ([#2916](https://github.com/apify/crawlee/issues/2916)) ([764f992](https://github.com/apify/crawlee/commit/764f99203627b6a44d2ee90d623b8b0e6ecbffb5)), closes [#2910](https://github.com/apify/crawlee/issues/2910) ## [3.13.1](https://github.com/apify/crawlee/compare/v3.13.0...v3.13.1) (2025-04-07)[](#3131-2025-04-07 "Direct link to 3131-2025-04-07") ### Bug Fixes[](#bug-fixes-1 "Direct link to Bug Fixes") * don't double increment session usage count in `BrowserCrawler` ([#2908](https://github.com/apify/crawlee/issues/2908)) ([3107e55](https://github.com/apify/crawlee/commit/3107e5511142a3579adc2348fcb6a9dcadd5c0b9)), closes [#2851](https://github.com/apify/crawlee/issues/2851) * rename `RobotsFile` to `RobotsTxtFile` ([#2913](https://github.com/apify/crawlee/issues/2913)) ([3160f71](https://github.com/apify/crawlee/commit/3160f717e865326476d78089d778cbc7d35aa58d)), closes [#2910](https://github.com/apify/crawlee/issues/2910) ### Features[](#features-3 "Direct link to Features") * add `respectRobotsTxtFile` crawler option ([#2910](https://github.com/apify/crawlee/issues/2910)) ([0eabed1](https://github.com/apify/crawlee/commit/0eabed1f13070d902c2c67b340621830a7f64464)) # [3.13.0](https://github.com/apify/crawlee/compare/v3.12.2...v3.13.0) (2025-03-04) **Note:** Version bump only for package @crawlee/browser ## [3.12.2](https://github.com/apify/crawlee/compare/v3.12.1...v3.12.2) (2025-01-27)[](#3122-2025-01-27 "Direct link to 3122-2025-01-27") **Note:** Version bump only for package @crawlee/browser ## [3.12.1](https://github.com/apify/crawlee/compare/v3.12.0...v3.12.1) (2024-12-04)[](#3121-2024-12-04 "Direct link to 3121-2024-12-04") **Note:** Version bump only for package @crawlee/browser # [3.12.0](https://github.com/apify/crawlee/compare/v3.11.5...v3.12.0) (2024-11-04) **Note:** Version bump only for package @crawlee/browser ## [3.11.5](https://github.com/apify/crawlee/compare/v3.11.4...v3.11.5) (2024-10-04)[](#3115-2024-10-04 "Direct link to 3115-2024-10-04") ### Bug Fixes[](#bug-fixes-2 "Direct link to Bug Fixes") * **puppeteer:** rename `ignoreHTTPSErrors` to `acceptInsecureCerts` to support v23 ([#2684](https://github.com/apify/crawlee/issues/2684)) ([f3927e6](https://github.com/apify/crawlee/commit/f3927e6c3487deef4a2a6b0face04d3742ecd5dd)) ## [3.11.4](https://github.com/apify/crawlee/compare/v3.11.3...v3.11.4) (2024-09-23)[](#3114-2024-09-23 "Direct link to 3114-2024-09-23") **Note:** Version bump only for package @crawlee/browser ## [3.11.3](https://github.com/apify/crawlee/compare/v3.11.2...v3.11.3) (2024-09-03)[](#3113-2024-09-03 "Direct link to 3113-2024-09-03") **Note:** Version bump only for package @crawlee/browser ## [3.11.2](https://github.com/apify/crawlee/compare/v3.11.1...v3.11.2) (2024-08-28)[](#3112-2024-08-28 "Direct link to 3112-2024-08-28") **Note:** Version bump only for package @crawlee/browser ## [3.11.1](https://github.com/apify/crawlee/compare/v3.11.0...v3.11.1) (2024-07-24)[](#3111-2024-07-24 "Direct link to 3111-2024-07-24") **Note:** Version bump only for package @crawlee/browser # [3.11.0](https://github.com/apify/crawlee/compare/v3.10.5...v3.11.0) (2024-07-09) ### Features[](#features-4 "Direct link to Features") * add `iframe` expansion to `parseWithCheerio` in browsers ([#2542](https://github.com/apify/crawlee/issues/2542)) ([328d085](https://github.com/apify/crawlee/commit/328d08598807782b3712bd543e394fe9a000a85d)), closes [#2507](https://github.com/apify/crawlee/issues/2507) * add `ignoreIframes` opt-out from the Cheerio iframe expansion ([#2562](https://github.com/apify/crawlee/issues/2562)) ([474a8dc](https://github.com/apify/crawlee/commit/474a8dc06a567cde0651d385fdac9c350ddf4508)) ## [3.10.5](https://github.com/apify/crawlee/compare/v3.10.4...v3.10.5) (2024-06-12)[](#3105-2024-06-12 "Direct link to 3105-2024-06-12") ### Bug Fixes[](#bug-fixes-3 "Direct link to Bug Fixes") * declare missing peer dependencies in `@crawlee/browser` package ([#2532](https://github.com/apify/crawlee/issues/2532)) ([3357c7f](https://github.com/apify/crawlee/commit/3357c7fc5ab071b12f72097c190dbee9990e3751)) * mark `context.request.loadedUrl` and `id` as required inside the request handler ([#2531](https://github.com/apify/crawlee/issues/2531)) ([2b54660](https://github.com/apify/crawlee/commit/2b546600691d84852a2f9ef42f273cecf818d66d)) ## [3.10.4](https://github.com/apify/crawlee/compare/v3.10.3...v3.10.4) (2024-06-11)[](#3104-2024-06-11 "Direct link to 3104-2024-06-11") **Note:** Version bump only for package @crawlee/browser ## [3.10.3](https://github.com/apify/crawlee/compare/v3.10.2...v3.10.3) (2024-06-07)[](#3103-2024-06-07 "Direct link to 3103-2024-06-07") **Note:** Version bump only for package @crawlee/browser ## [3.10.2](https://github.com/apify/crawlee/compare/v3.10.1...v3.10.2) (2024-06-03)[](#3102-2024-06-03 "Direct link to 3102-2024-06-03") **Note:** Version bump only for package @crawlee/browser ## [3.10.1](https://github.com/apify/crawlee/compare/v3.10.0...v3.10.1) (2024-05-23)[](#3101-2024-05-23 "Direct link to 3101-2024-05-23") **Note:** Version bump only for package @crawlee/browser # [3.10.0](https://github.com/apify/crawlee/compare/v3.9.2...v3.10.0) (2024-05-16) **Note:** Version bump only for package @crawlee/browser ## [3.9.2](https://github.com/apify/crawlee/compare/v3.9.1...v3.9.2) (2024-04-17)[](#392-2024-04-17 "Direct link to 392-2024-04-17") **Note:** Version bump only for package @crawlee/browser ## [3.9.1](https://github.com/apify/crawlee/compare/v3.9.0...v3.9.1) (2024-04-11)[](#391-2024-04-11 "Direct link to 391-2024-04-11") ### Features[](#features-5 "Direct link to Features") * `browserPerProxy` browser launch option ([#2418](https://github.com/apify/crawlee/issues/2418)) ([df57b29](https://github.com/apify/crawlee/commit/df57b2965ac8c8b3adf807e3bad8a649814fa213)) # [3.9.0](https://github.com/apify/crawlee/compare/v3.8.2...v3.9.0) (2024-04-10) ### Features[](#features-6 "Direct link to Features") * `tieredProxyUrls` for ProxyConfiguration ([#2348](https://github.com/apify/crawlee/issues/2348)) ([5408c7f](https://github.com/apify/crawlee/commit/5408c7f60a5bf4dbdba92f2d7440e0946b94ea6e)) * better `newUrlFunction` for ProxyConfiguration ([#2392](https://github.com/apify/crawlee/issues/2392)) ([330598b](https://github.com/apify/crawlee/commit/330598b348ad27bc7c73732294a14b655ccd3507)), closes [#2348](https://github.com/apify/crawlee/issues/2348) [#2065](https://github.com/apify/crawlee/issues/2065) * expand #shadow-root elements automatically in `parseWithCheerio` helper ([#2396](https://github.com/apify/crawlee/issues/2396)) ([a05b3a9](https://github.com/apify/crawlee/commit/a05b3a93a9b57926b353df0e79d846b5024c42ac)) ## [3.8.2](https://github.com/apify/crawlee/compare/v3.8.1...v3.8.2) (2024-03-21)[](#382-2024-03-21 "Direct link to 382-2024-03-21") **Note:** Version bump only for package @crawlee/browser ## [3.8.1](https://github.com/apify/crawlee/compare/v3.8.0...v3.8.1) (2024-02-22)[](#381-2024-02-22 "Direct link to 381-2024-02-22") **Note:** Version bump only for package @crawlee/browser # [3.8.0](https://github.com/apify/crawlee/compare/v3.7.3...v3.8.0) (2024-02-21) ### Features[](#features-7 "Direct link to Features") * adaptive playwright crawler ([#2316](https://github.com/apify/crawlee/issues/2316)) ([8e4218a](https://github.com/apify/crawlee/commit/8e4218ada03cf485751def46f8c465b2d2a825c7)) ## [3.7.3](https://github.com/apify/crawlee/compare/v3.7.2...v3.7.3) (2024-01-30)[](#373-2024-01-30 "Direct link to 373-2024-01-30") **Note:** Version bump only for package @crawlee/browser ## [3.7.2](https://github.com/apify/crawlee/compare/v3.7.1...v3.7.2) (2024-01-09)[](#372-2024-01-09 "Direct link to 372-2024-01-09") **Note:** Version bump only for package @crawlee/browser ## [3.7.1](https://github.com/apify/crawlee/compare/v3.7.0...v3.7.1) (2024-01-02)[](#371-2024-01-02 "Direct link to 371-2024-01-02") **Note:** Version bump only for package @crawlee/browser # [3.7.0](https://github.com/apify/crawlee/compare/v3.6.2...v3.7.0) (2023-12-21) ### Bug Fixes[](#bug-fixes-4 "Direct link to Bug Fixes") * `retryOnBlocked` doesn't override the blocked HTTP codes ([#2243](https://github.com/apify/crawlee/issues/2243)) ([81672c3](https://github.com/apify/crawlee/commit/81672c3d1db1dcdcffb868de5740addff82cf112)) ### Features[](#features-8 "Direct link to Features") * check enqueue link strategy post redirect ([#2238](https://github.com/apify/crawlee/issues/2238)) ([3c5f9d6](https://github.com/apify/crawlee/commit/3c5f9d6056158e042e12d75b2b1b21ef6c32e618)), closes [#2173](https://github.com/apify/crawlee/issues/2173) * log cause with `retryOnBlocked` ([#2252](https://github.com/apify/crawlee/issues/2252)) ([e19a773](https://github.com/apify/crawlee/commit/e19a773693cfc5e65c1e2321bfc8b73c9844ea8b)), closes [#2249](https://github.com/apify/crawlee/issues/2249) ## [3.6.2](https://github.com/apify/crawlee/compare/v3.6.1...v3.6.2) (2023-11-26)[](#362-2023-11-26 "Direct link to 362-2023-11-26") **Note:** Version bump only for package @crawlee/browser ## [3.6.1](https://github.com/apify/crawlee/compare/v3.6.0...v3.6.1) (2023-11-15)[](#361-2023-11-15 "Direct link to 361-2023-11-15") ### Features[](#features-9 "Direct link to Features") * **puppeteer:** enable `new` headless mode ([#1910](https://github.com/apify/crawlee/issues/1910)) ([7fc999c](https://github.com/apify/crawlee/commit/7fc999cf4658ca69b97f16d434444081998470f4)) # [3.6.0](https://github.com/apify/crawlee/compare/v3.5.8...v3.6.0) (2023-11-15) **Note:** Version bump only for package @crawlee/browser ## [3.5.8](https://github.com/apify/crawlee/compare/v3.5.7...v3.5.8) (2023-10-17)[](#358-2023-10-17 "Direct link to 358-2023-10-17") **Note:** Version bump only for package @crawlee/browser ## [3.5.7](https://github.com/apify/crawlee/compare/v3.5.6...v3.5.7) (2023-10-05)[](#357-2023-10-05 "Direct link to 357-2023-10-05") **Note:** Version bump only for package @crawlee/browser ## [3.5.6](https://github.com/apify/crawlee/compare/v3.5.5...v3.5.6) (2023-10-04)[](#356-2023-10-04 "Direct link to 356-2023-10-04") **Note:** Version bump only for package @crawlee/browser ## [3.5.5](https://github.com/apify/crawlee/compare/v3.5.4...v3.5.5) (2023-10-02)[](#355-2023-10-02 "Direct link to 355-2023-10-02") ### Features[](#features-10 "Direct link to Features") * Request Queue v2 ([#1975](https://github.com/apify/crawlee/issues/1975)) ([70a77ee](https://github.com/apify/crawlee/commit/70a77ee15f984e9ae67cd584fc58ace7e55346db)), closes [#1365](https://github.com/apify/crawlee/issues/1365) ## [3.5.4](https://github.com/apify/crawlee/compare/v3.5.3...v3.5.4) (2023-09-11)[](#354-2023-09-11 "Direct link to 354-2023-09-11") **Note:** Version bump only for package @crawlee/browser ## [3.5.3](https://github.com/apify/crawlee/compare/v3.5.2...v3.5.3) (2023-08-31)[](#353-2023-08-31 "Direct link to 353-2023-08-31") ### Bug Fixes[](#bug-fixes-5 "Direct link to Bug Fixes") * pin all internal dependencies ([#2041](https://github.com/apify/crawlee/issues/2041)) ([d6f2b17](https://github.com/apify/crawlee/commit/d6f2b172d4a6776137c7893ca798d5b4a9408e79)), closes [#2040](https://github.com/apify/crawlee/issues/2040) ## [3.5.2](https://github.com/apify/crawlee/compare/v3.5.1...v3.5.2) (2023-08-21)[](#352-2023-08-21 "Direct link to 352-2023-08-21") **Note:** Version bump only for package @crawlee/browser ## [3.5.1](https://github.com/apify/crawlee/compare/v3.5.0...v3.5.1) (2023-08-16)[](#351-2023-08-16 "Direct link to 351-2023-08-16") ### Bug Fixes[](#bug-fixes-6 "Direct link to Bug Fixes") * log original error message on session rotation ([#2022](https://github.com/apify/crawlee/issues/2022)) ([8a11ffb](https://github.com/apify/crawlee/commit/8a11ffbdaef6b2fe8603aac570c3038f84c2f203)) # [3.5.0](https://github.com/apify/crawlee/compare/v3.4.2...v3.5.0) (2023-07-31) ### Features[](#features-11 "Direct link to Features") * retire session on proxy error ([#2002](https://github.com/apify/crawlee/issues/2002)) ([8c0928b](https://github.com/apify/crawlee/commit/8c0928b24ceabefc454f8114ac30a27023709010)), closes [#1912](https://github.com/apify/crawlee/issues/1912) ## [3.4.2](https://github.com/apify/crawlee/compare/v3.4.1...v3.4.2) (2023-07-19)[](#342-2023-07-19 "Direct link to 342-2023-07-19") ### Features[](#features-12 "Direct link to Features") * retryOnBlocked detects blocked webpage ([#1956](https://github.com/apify/crawlee/issues/1956)) ([766fa9b](https://github.com/apify/crawlee/commit/766fa9b88029e9243a7427075384c1abe85c70c8)) ## [3.4.1](https://github.com/apify/crawlee/compare/v3.4.0...v3.4.1) (2023-07-13)[](#341-2023-07-13 "Direct link to 341-2023-07-13") **Note:** Version bump only for package @crawlee/browser # [3.4.0](https://github.com/apify/crawlee/compare/v3.3.3...v3.4.0) (2023-06-12) ### Bug Fixes[](#bug-fixes-7 "Direct link to Bug Fixes") * respect `` when enqueuing ([#1936](https://github.com/apify/crawlee/issues/1936)) ([aeef572](https://github.com/apify/crawlee/commit/aeef57231c84671374ed0309b7b95fa9ce9a6e8b)) ## [3.3.3](https://github.com/apify/crawlee/compare/v3.3.2...v3.3.3) (2023-05-31)[](#333-2023-05-31 "Direct link to 333-2023-05-31") **Note:** Version bump only for package @crawlee/browser ## [3.3.2](https://github.com/apify/crawlee/compare/v3.3.1...v3.3.2) (2023-05-11)[](#332-2023-05-11 "Direct link to 332-2023-05-11") **Note:** Version bump only for package @crawlee/browser ## [3.3.1](https://github.com/apify/crawlee/compare/v3.3.0...v3.3.1) (2023-04-11)[](#331-2023-04-11 "Direct link to 331-2023-04-11") **Note:** Version bump only for package @crawlee/browser # [3.3.0](https://github.com/apify/crawlee/compare/v3.2.2...v3.3.0) (2023-03-09) ### Bug Fixes[](#bug-fixes-8 "Direct link to Bug Fixes") * ignore invalid URLs in `enqueueLinks` in browser crawlers ([#1803](https://github.com/apify/crawlee/issues/1803)) ([5ac336c](https://github.com/apify/crawlee/commit/5ac336c5b83b212fd6281659b8ceee091e259ff1)) ## [3.2.2](https://github.com/apify/crawlee/compare/v3.2.1...v3.2.2) (2023-02-08)[](#322-2023-02-08 "Direct link to 322-2023-02-08") **Note:** Version bump only for package @crawlee/browser ## [3.2.1](https://github.com/apify/crawlee/compare/v3.2.0...v3.2.1) (2023-02-07)[](#321-2023-02-07 "Direct link to 321-2023-02-07") **Note:** Version bump only for package @crawlee/browser # [3.2.0](https://github.com/apify/crawlee/compare/v3.1.4...v3.2.0) (2023-02-07) ### Bug Fixes[](#bug-fixes-9 "Direct link to Bug Fixes") * declare missing dependency on `tslib` ([27e96c8](https://github.com/apify/crawlee/commit/27e96c80c26e7fc31809a4b518d699573cb8c662)), closes [#1747](https://github.com/apify/crawlee/issues/1747) ## [3.1.4](https://github.com/apify/crawlee/compare/v3.1.3...v3.1.4) (2022-12-14)[](#314-2022-12-14 "Direct link to 314-2022-12-14") **Note:** Version bump only for package @crawlee/browser ## [3.1.3](https://github.com/apify/crawlee/compare/v3.1.2...v3.1.3) (2022-12-07)[](#313-2022-12-07 "Direct link to 313-2022-12-07") **Note:** Version bump only for package @crawlee/browser ## 3.1.2 (2022-11-15)[](#312-2022-11-15 "Direct link to 3.1.2 (2022-11-15)") **Note:** Version bump only for package @crawlee/browser ## 3.1.1 (2022-11-07)[](#311-2022-11-07 "Direct link to 3.1.1 (2022-11-07)") **Note:** Version bump only for package @crawlee/browser # 3.1.0 (2022-10-13) **Note:** Version bump only for package @crawlee/browser ## [3.0.4](https://github.com/apify/crawlee/compare/v3.0.3...v3.0.4) (2022-08-22)[](#304-2022-08-22 "Direct link to 304-2022-08-22") ### Features[](#features-13 "Direct link to Features") * enable tab-as-a-container for Firefox ([#1456](https://github.com/apify/crawlee/issues/1456)) ([ae5ba4f](https://github.com/apify/crawlee/commit/ae5ba4f15fd6d14f444486234753ce1781c74cc8)) --- # abstractBrowserCrawler \ Provides a simple framework for parallel crawling of web pages using headless browsers with [Puppeteer](https://github.com/puppeteer/puppeteer) and [Playwright](https://github.com/microsoft/playwright). The URLs to crawl are fed either from a static list of URLs or from a dynamic queue of URLs enabling recursive crawling of websites. Since `BrowserCrawler` uses headless (or even headful) browsers to download web pages and extract data, it is useful for crawling of websites that require to execute JavaScript. If the target website doesn't need JavaScript, we should consider using the [CheerioCrawler](https://crawlee.dev/js/api/cheerio-crawler/class/CheerioCrawler.md), which downloads the pages using raw HTTP requests and is about 10x faster. The source URLs are represented by the [Request](https://crawlee.dev/js/api/core/class/Request.md) objects that are fed from the [RequestList](https://crawlee.dev/js/api/core/class/RequestList.md) or [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) instances provided by the [`requestList`](https://crawlee.dev/js/api/browser-crawler/interface/BrowserCrawlerOptions.md#requestList) or [`requestQueue`](https://crawlee.dev/js/api/browser-crawler/interface/BrowserCrawlerOptions.md#requestQueue) constructor options, respectively. If neither `requestList` nor `requestQueue` options are provided, the crawler will open the default request queue either when the [`crawler.addRequests()`](https://crawlee.dev/js/api/browser-crawler/class/BrowserCrawler.md#addRequests) function is called, or if `requests` parameter (representing the initial requests) of the [`crawler.run()`](https://crawlee.dev/js/api/browser-crawler/class/BrowserCrawler.md#run) function is provided. If both [`requestList`](https://crawlee.dev/js/api/browser-crawler/interface/BrowserCrawlerOptions.md#requestList) and [`requestQueue`](https://crawlee.dev/js/api/browser-crawler/interface/BrowserCrawlerOptions.md#requestQueue) options are used, the instance first processes URLs from the [RequestList](https://crawlee.dev/js/api/core/class/RequestList.md) and automatically enqueues all of them to the [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) before it starts their processing. This ensures that a single URL is not crawled multiple times. The crawler finishes when there are no more [Request](https://crawlee.dev/js/api/core/class/Request.md) objects to crawl. `BrowserCrawler` opens a new browser page (i.e. tab or window) for each [Request](https://crawlee.dev/js/api/core/class/Request.md) object to crawl and then calls the function provided by user as the [`requestHandler`](https://crawlee.dev/js/api/browser-crawler/interface/BrowserCrawlerOptions.md#requestHandler) option. New pages are only opened when there is enough free CPU and memory available, using the functionality provided by the [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) class. All [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) configuration options can be passed to the [`autoscaledPoolOptions`](https://crawlee.dev/js/api/browser-crawler/interface/BrowserCrawlerOptions.md#autoscaledPoolOptions) parameter of the `BrowserCrawler` constructor. For user convenience, the [`minConcurrency`](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md#minConcurrency) and [`maxConcurrency`](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md#maxConcurrency) options of the underlying [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) constructor are available directly in the `BrowserCrawler` constructor. > *NOTE:* the pool of browser instances is internally managed by the [BrowserPool](https://crawlee.dev/js/api/browser-pool/class/BrowserPool.md) class. ### Hierarchy * [BasicCrawler](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md)\ * *BrowserCrawler* * [PuppeteerCrawler](https://crawlee.dev/js/api/puppeteer-crawler/class/PuppeteerCrawler.md) * [PlaywrightCrawler](https://crawlee.dev/js/api/playwright-crawler/class/PlaywrightCrawler.md) ## Index[**](#Index) ### Properties * [**autoscaledPool](#autoscaledPool) * [**browserPool](#browserPool) * [**config](#config) * [**hasFinishedBefore](#hasFinishedBefore) * [**launchContext](#launchContext) * [**log](#log) * [**proxyConfiguration](#proxyConfiguration) * [**requestList](#requestList) * [**requestQueue](#requestQueue) * [**router](#router) * [**running](#running) * [**sessionPool](#sessionPool) * [**stats](#stats) ### Methods * [**addRequests](#addRequests) * [**exportData](#exportData) * [**getData](#getData) * [**getDataset](#getDataset) * [**getRequestQueue](#getRequestQueue) * [**pushData](#pushData) * [**run](#run) * [**setStatusMessage](#setStatusMessage) * [**stop](#stop) * [**useState](#useState) ## Properties[**](#Properties) ### [**](#autoscaledPool)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L524)optionalinheritedautoscaledPool **autoscaledPool? : [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) Inherited from BasicCrawler.autoscaledPool A reference to the underlying [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) class that manages the concurrency of the crawler. > *NOTE:* This property is only initialized after calling the [`crawler.run()`](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md#run) function. We can use it to change the concurrency settings on the fly, to pause the crawler by calling [`autoscaledPool.pause()`](https://crawlee.dev/js/api/core/class/AutoscaledPool.md#pause) or to abort it by calling [`autoscaledPool.abort()`](https://crawlee.dev/js/api/core/class/AutoscaledPool.md#abort). ### [**](#browserPool)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L329)browserPool **browserPool: [BrowserPool](https://crawlee.dev/js/api/browser-pool/class/BrowserPool.md)\, ReturnType<[InferBrowserPluginArray](https://crawlee.dev/js/api/browser-pool.md#InferBrowserPluginArray)\\[number]\[createController]>, ReturnType<[InferBrowserPluginArray](https://crawlee.dev/js/api/browser-pool.md#InferBrowserPluginArray)\\[number]\[createLaunchContext]>, Parameters\\[number]\[createController]>\[newPage]>\[0], [UnwrapPromise](https://crawlee.dev/js/api/browser-pool.md#UnwrapPromise)\\[number]\[createController]>\[newPage]>>> A reference to the underlying [BrowserPool](https://crawlee.dev/js/api/browser-pool/class/BrowserPool.md) class that manages the crawler's browsers. ### [**](#config)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L364)readonlyinheritedconfig **config: [Configuration](https://crawlee.dev/js/api/core/class/Configuration.md) = ... Inherited from BasicCrawler.config ### [**](#hasFinishedBefore)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L533)inheritedhasFinishedBefore **hasFinishedBefore: boolean = false Inherited from BasicCrawler.hasFinishedBefore ### [**](#launchContext)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L331)launchContext **launchContext: [BrowserLaunchContext](https://crawlee.dev/js/api/browser-crawler/interface/BrowserLaunchContext.md)\ ### [**](#log)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L535)readonlyinheritedlog **log: [Log](https://crawlee.dev/js/api/core/class/Log.md) Inherited from BasicCrawler.log ### [**](#proxyConfiguration)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L324)optionalproxyConfiguration **proxyConfiguration? : [ProxyConfiguration](https://crawlee.dev/js/api/core/class/ProxyConfiguration.md) A reference to the underlying [ProxyConfiguration](https://crawlee.dev/js/api/core/class/ProxyConfiguration.md) class that manages the crawler's proxies. Only available if used by the crawler. ### [**](#requestList)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L497)optionalinheritedrequestList **requestList? : [IRequestList](https://crawlee.dev/js/api/core/interface/IRequestList.md) Inherited from BasicCrawler.requestList A reference to the underlying [RequestList](https://crawlee.dev/js/api/core/class/RequestList.md) class that manages the crawler's [requests](https://crawlee.dev/js/api/core/class/Request.md). Only available if used by the crawler. ### [**](#requestQueue)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L504)optionalinheritedrequestQueue **requestQueue? : [RequestProvider](https://crawlee.dev/js/api/core/class/RequestProvider.md) Inherited from BasicCrawler.requestQueue Dynamic queue of URLs to be processed. This is useful for recursive crawling of websites. A reference to the underlying [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) class that manages the crawler's [requests](https://crawlee.dev/js/api/core/class/Request.md). Only available if used by the crawler. ### [**](#router)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L530)readonlyinheritedrouter **router: [RouterHandler](https://crawlee.dev/js/api/core/interface/RouterHandler.md)\> = ... Inherited from BasicCrawler.router Default [Router](https://crawlee.dev/js/api/core/class/Router.md) instance that will be used if we don't specify any [`requestHandler`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#requestHandler). See [`router.addHandler()`](https://crawlee.dev/js/api/core/class/Router.md#addHandler) and [`router.addDefaultHandler()`](https://crawlee.dev/js/api/core/class/Router.md#addDefaultHandler). ### [**](#running)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L532)inheritedrunning **running: boolean = false Inherited from BasicCrawler.running ### [**](#sessionPool)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L515)optionalinheritedsessionPool **sessionPool? : [SessionPool](https://crawlee.dev/js/api/core/class/SessionPool.md) Inherited from BasicCrawler.sessionPool A reference to the underlying [SessionPool](https://crawlee.dev/js/api/core/class/SessionPool.md) class that manages the crawler's [sessions](https://crawlee.dev/js/api/core/class/Session.md). Only available if used by the crawler. ### [**](#stats)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L491)readonlyinheritedstats **stats: [Statistics](https://crawlee.dev/js/api/core/class/Statistics.md) Inherited from BasicCrawler.stats A reference to the underlying [Statistics](https://crawlee.dev/js/api/core/class/Statistics.md) class that collects and logs run statistics for requests. ## Methods[**](#Methods) ### [**](#addRequests)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L1147)inheritedaddRequests * ****addRequests**(requests, options): Promise<[CrawlerAddRequestsResult](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerAddRequestsResult.md)> - Inherited from BasicCrawler.addRequests Adds requests to the queue in batches. By default, it will resolve after the initial batch is added, and continue adding the rest in background. You can configure the batch size via `batchSize` option and the sleep time in between the batches via `waitBetweenBatchesMillis`. If you want to wait for all batches to be added to the queue, you can use the `waitForAllRequestsToBeAdded` promise you get in the response object. This is an alias for calling `addRequestsBatched()` on the implicit `RequestQueue` for this crawler instance. *** #### Parameters * ##### requests: [RequestsLike](https://crawlee.dev/js/api/core.md#RequestsLike) The requests to add * ##### options: [CrawlerAddRequestsOptions](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerAddRequestsOptions.md) = {} Options for the request queue #### Returns Promise<[CrawlerAddRequestsResult](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerAddRequestsResult.md)> ### [**](#exportData)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L1253)inheritedexportData * ****exportData**\(path, format, options): Promise\ - Inherited from BasicCrawler.exportData Retrieves all the data from the default crawler [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md) and exports them to the specified format. Supported formats are currently 'json' and 'csv', and will be inferred from the `path` automatically. *** #### Parameters * ##### path: string * ##### optionalformat: json | csv * ##### optionaloptions: [DatasetExportOptions](https://crawlee.dev/js/api/core/interface/DatasetExportOptions.md) #### Returns Promise\ ### [**](#getData)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L1244)inheritedgetData * ****getData**(...args): Promise<[DatasetContent](https://crawlee.dev/js/api/core/interface/DatasetContent.md)\> - Inherited from BasicCrawler.getData Retrieves data from the default crawler [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md) by calling [Dataset.getData](https://crawlee.dev/js/api/core/class/Dataset.md#getData). *** #### Parameters * ##### rest...args: \[options: [DatasetDataOptions](https://crawlee.dev/js/api/core/interface/DatasetDataOptions.md)] #### Returns Promise<[DatasetContent](https://crawlee.dev/js/api/core/interface/DatasetContent.md)\> ### [**](#getDataset)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L1237)inheritedgetDataset * ****getDataset**(idOrName): Promise<[Dataset](https://crawlee.dev/js/api/core/class/Dataset.md)\> - Inherited from BasicCrawler.getDataset Retrieves the specified [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md), or the default crawler [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md). *** #### Parameters * ##### optionalidOrName: string #### Returns Promise<[Dataset](https://crawlee.dev/js/api/core/class/Dataset.md)\> ### [**](#getRequestQueue)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L1080)inheritedgetRequestQueue * ****getRequestQueue**(): Promise<[RequestProvider](https://crawlee.dev/js/api/core/class/RequestProvider.md)> - Inherited from BasicCrawler.getRequestQueue #### Returns Promise<[RequestProvider](https://crawlee.dev/js/api/core/class/RequestProvider.md)> ### [**](#pushData)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L1229)inheritedpushData * ****pushData**(data, datasetIdOrName): Promise\ - Inherited from BasicCrawler.pushData Pushes data to the specified [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md), or the default crawler [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md) by calling [Dataset.pushData](https://crawlee.dev/js/api/core/class/Dataset.md#pushData). *** #### Parameters * ##### data: Dictionary | Dictionary\[] * ##### optionaldatasetIdOrName: string #### Returns Promise\ ### [**](#run)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L948)inheritedrun * ****run**(requests, options): Promise<[FinalStatistics](https://crawlee.dev/js/api/core/interface/FinalStatistics.md)> - Inherited from BasicCrawler.run Runs the crawler. Returns a promise that resolves once all the requests are processed and `autoscaledPool.isFinished` returns `true`. We can use the `requests` parameter to enqueue the initial requests — it is a shortcut for running [`crawler.addRequests()`](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md#addRequests) before [`crawler.run()`](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md#run). *** #### Parameters * ##### optionalrequests: [RequestsLike](https://crawlee.dev/js/api/core.md#RequestsLike) The requests to add. * ##### optionaloptions: [CrawlerRunOptions](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerRunOptions.md) Options for the request queue. #### Returns Promise<[FinalStatistics](https://crawlee.dev/js/api/core/interface/FinalStatistics.md)> ### [**](#setStatusMessage)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L871)inheritedsetStatusMessage * ****setStatusMessage**(message, options): Promise\ - Inherited from BasicCrawler.setStatusMessage This method is periodically called by the crawler, every `statusMessageLoggingInterval` seconds. *** #### Parameters * ##### message: string * ##### options: [SetStatusMessageOptions](https://crawlee.dev/js/api/types/interface/SetStatusMessageOptions.md) = {} #### Returns Promise\ ### [**](#stop)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L1068)inheritedstop * ****stop**(message): void - Inherited from BasicCrawler.stop Gracefully stops the current run of the crawler. All the tasks active at the time of calling this method will be allowed to finish. *** #### Parameters * ##### message: string = 'The crawler has been gracefully stopped.' #### Returns void ### [**](#useState)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L1102)inheriteduseState * ****useState**\(defaultValue): Promise\ - Inherited from BasicCrawler.useState #### Parameters * ##### defaultValue: State = ... #### Returns Promise\ --- # BrowserCrawlerOptions \ ### Hierarchy * Omit<[BasicCrawlerOptions](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md), requestHandler | handleRequestFunction | failedRequestHandler | handleFailedRequestFunction | errorHandler> * *BrowserCrawlerOptions* * [PuppeteerCrawlerOptions](https://crawlee.dev/js/api/puppeteer-crawler/interface/PuppeteerCrawlerOptions.md) * [PlaywrightCrawlerOptions](https://crawlee.dev/js/api/playwright-crawler/interface/PlaywrightCrawlerOptions.md) ## Index[**](#Index) ### Properties * [**autoscaledPoolOptions](#autoscaledPoolOptions) * [**browserPoolOptions](#browserPoolOptions) * [**errorHandler](#errorHandler) * [**experiments](#experiments) * [**failedRequestHandler](#failedRequestHandler) * [**headless](#headless) * [**httpClient](#httpClient) * [**ignoreIframes](#ignoreIframes) * [**ignoreShadowRoots](#ignoreShadowRoots) * [**keepAlive](#keepAlive) * [**launchContext](#launchContext) * [**maxConcurrency](#maxConcurrency) * [**maxCrawlDepth](#maxCrawlDepth) * [**maxRequestRetries](#maxRequestRetries) * [**maxRequestsPerCrawl](#maxRequestsPerCrawl) * [**maxRequestsPerMinute](#maxRequestsPerMinute) * [**maxSessionRotations](#maxSessionRotations) * [**minConcurrency](#minConcurrency) * [**navigationTimeoutSecs](#navigationTimeoutSecs) * [**onSkippedRequest](#onSkippedRequest) * [**persistCookiesPerSession](#persistCookiesPerSession) * [**postNavigationHooks](#postNavigationHooks) * [**preNavigationHooks](#preNavigationHooks) * [**proxyConfiguration](#proxyConfiguration) * [**requestHandler](#requestHandler) * [**requestHandlerTimeoutSecs](#requestHandlerTimeoutSecs) * [**requestList](#requestList) * [**requestManager](#requestManager) * [**requestQueue](#requestQueue) * [**respectRobotsTxtFile](#respectRobotsTxtFile) * [**retryOnBlocked](#retryOnBlocked) * [**sameDomainDelaySecs](#sameDomainDelaySecs) * [**sessionPoolOptions](#sessionPoolOptions) * [**statisticsOptions](#statisticsOptions) * [**statusMessageCallback](#statusMessageCallback) * [**statusMessageLoggingInterval](#statusMessageLoggingInterval) * [**useSessionPool](#useSessionPool) ## Properties[**](#Properties) ### [**](#autoscaledPoolOptions)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L294)optionalinheritedautoscaledPoolOptions **autoscaledPoolOptions? : [AutoscaledPoolOptions](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md) Inherited from Omit.autoscaledPoolOptions Custom options passed to the underlying [AutoscaledPool](https://crawlee.dev/js/api/core/class/AutoscaledPool.md) constructor. > *NOTE:* The [`runTaskFunction`](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md#runTaskFunction) option is provided by the crawler and cannot be overridden. However, we can provide custom implementations of [`isFinishedFunction`](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md#isFinishedFunction) and [`isTaskReadyFunction`](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md#isTaskReadyFunction). ### [**](#browserPoolOptions)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L194)optionalbrowserPoolOptions **browserPoolOptions? : Partial<[BrowserPoolOptions](https://crawlee.dev/js/api/browser-pool/interface/BrowserPoolOptions.md)<[BrowserPlugin](https://crawlee.dev/js/api/browser-pool/class/BrowserPlugin.md)<[CommonLibrary](https://crawlee.dev/js/api/browser-pool/interface/CommonLibrary.md), undefined | Dictionary, CommonBrowser, unknown, CommonPage>>> & Partial<[BrowserPoolHooks](https://crawlee.dev/js/api/browser-pool/interface/BrowserPoolHooks.md)<\_\_BrowserControllerReturn, \_\_LaunchContextReturn, [UnwrapPromise](https://crawlee.dev/js/api/browser-pool.md#UnwrapPromise)\>>> Custom options passed to the underlying [BrowserPool](https://crawlee.dev/js/api/browser-pool/class/BrowserPool.md) constructor. We can tweak those to fine-tune browser management. ### [**](#errorHandler)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L163)optionalerrorHandler **errorHandler? : [BrowserErrorHandler](https://crawlee.dev/js/api/browser-crawler.md#BrowserErrorHandler)\ User-provided function that allows modifying the request object before it gets retried by the crawler. It's executed before each retry for the requests that failed less than [`maxRequestRetries`](https://crawlee.dev/js/api/browser-crawler/interface/BrowserCrawlerOptions.md#maxRequestRetries) times. The function receives the [BrowserCrawlingContext](https://crawlee.dev/js/api/browser-crawler/interface/BrowserCrawlingContext.md) (actual context will be enhanced with the crawler specific properties) as the first argument, where the [`request`](https://crawlee.dev/js/api/browser-crawler/interface/BrowserCrawlingContext.md#request) corresponds to the request to be retried. Second argument is the `Error` instance that represents the last error thrown during processing of the request. ### [**](#experiments)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L390)optionalinheritedexperiments **experiments? : [CrawlerExperiments](https://crawlee.dev/js/api/basic-crawler/interface/CrawlerExperiments.md) Inherited from Omit.experiments Enables experimental features of Crawlee, which can alter the behavior of the crawler. WARNING: these options are not guaranteed to be stable and may change or be removed at any time. ### [**](#failedRequestHandler)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L174)optionalfailedRequestHandler **failedRequestHandler? : [BrowserErrorHandler](https://crawlee.dev/js/api/browser-crawler.md#BrowserErrorHandler)\ A function to handle requests that failed more than `option.maxRequestRetries` times. The function receives the [BrowserCrawlingContext](https://crawlee.dev/js/api/browser-crawler/interface/BrowserCrawlingContext.md) (actual context will be enhanced with the crawler specific properties) as the first argument, where the [`request`](https://crawlee.dev/js/api/browser-crawler/interface/BrowserCrawlingContext.md#request) corresponds to the failed request. Second argument is the `Error` instance that represents the last error thrown during processing of the request. ### [**](#headless)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L260)optionalheadless **headless? : boolean | new | old Whether to run browser in headless mode. Defaults to `true`. Can be also set via [Configuration](https://crawlee.dev/js/api/core/class/Configuration.md). ### [**](#httpClient)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L402)optionalinheritedhttpClient **httpClient? : [BaseHttpClient](https://crawlee.dev/js/api/core/interface/BaseHttpClient.md) Inherited from Omit.httpClient HTTP client implementation for the `sendRequest` context helper and for plain HTTP crawling. Defaults to a new instance of [GotScrapingHttpClient](https://crawlee.dev/js/api/core/class/GotScrapingHttpClient.md) ### [**](#ignoreIframes)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L272)optionalignoreIframes **ignoreIframes? : boolean Whether to ignore `iframes` when processing the page content via `parseWithCheerio` helper. By default, `iframes` are expanded automatically. Use this option to disable this behavior. ### [**](#ignoreShadowRoots)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L266)optionalignoreShadowRoots **ignoreShadowRoots? : boolean Whether to ignore custom elements (and their #shadow-roots) when processing the page content via `parseWithCheerio` helper. By default, they are expanded automatically. Use this option to disable this behavior. ### [**](#keepAlive)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L322)optionalinheritedkeepAlive **keepAlive? : boolean Inherited from Omit.keepAlive Allows to keep the crawler alive even if the [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) gets empty. By default, the `crawler.run()` will resolve once the queue is empty. With `keepAlive: true` it will keep running, waiting for more requests to come. Use `crawler.stop()` to exit the crawler gracefully, or `crawler.teardown()` to stop it immediately. ### [**](#launchContext)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L90)optionallaunchContext **launchContext? : [BrowserLaunchContext](https://crawlee.dev/js/api/browser-crawler/interface/BrowserLaunchContext.md)\ ### [**](#maxConcurrency)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L308)optionalinheritedmaxConcurrency **maxConcurrency? : number Inherited from Omit.maxConcurrency Sets the maximum concurrency (parallelism) for the crawl. Shortcut for the AutoscaledPool [`maxConcurrency`](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md#maxConcurrency) option. ### [**](#maxCrawlDepth)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L285)optionalinheritedmaxCrawlDepth **maxCrawlDepth? : number Inherited from Omit.maxCrawlDepth Maximum depth of the crawl. If not set, the crawl will continue until all requests are processed. Setting this to `0` will only process the initial requests, skipping all links enqueued by `crawlingContext.enqueueLinks` and `crawlingContext.addRequests`. Passing `1` will process the initial requests and all links enqueued by `crawlingContext.enqueueLinks` and `crawlingContext.addRequests` in the handler for initial requests. ### [**](#maxRequestRetries)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L256)optionalinheritedmaxRequestRetries **maxRequestRetries? : number = 3 Inherited from Omit.maxRequestRetries Specifies the maximum number of retries allowed for a request if its processing fails. This includes retries due to navigation errors or errors thrown from user-supplied functions (`requestHandler`, `preNavigationHooks`, `postNavigationHooks`). This limit does not apply to retries triggered by session rotation (see [`maxSessionRotations`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#maxSessionRotations)). ### [**](#maxRequestsPerCrawl)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L278)optionalinheritedmaxRequestsPerCrawl **maxRequestsPerCrawl? : number Inherited from Omit.maxRequestsPerCrawl Maximum number of pages that the crawler will open. The crawl will stop when this limit is reached. This value should always be set in order to prevent infinite loops in misconfigured crawlers. > *NOTE:* In cases of parallel crawling, the actual number of pages visited might be slightly higher than this value. ### [**](#maxRequestsPerMinute)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L315)optionalinheritedmaxRequestsPerMinute **maxRequestsPerMinute? : number Inherited from Omit.maxRequestsPerMinute The maximum number of requests per minute the crawler should run. By default, this is set to `Infinity`, but we can pass any positive, non-zero integer. Shortcut for the AutoscaledPool [`maxTasksPerMinute`](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md#maxTasksPerMinute) option. ### [**](#maxSessionRotations)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L271)optionalinheritedmaxSessionRotations **maxSessionRotations? : number = 10 Inherited from Omit.maxSessionRotations Maximum number of session rotations per request. The crawler will automatically rotate the session in case of a proxy error or if it gets blocked by the website. The session rotations are not counted towards the [`maxRequestRetries`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#maxRequestRetries) limit. ### [**](#minConcurrency)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L302)optionalinheritedminConcurrency **minConcurrency? : number Inherited from Omit.minConcurrency Sets the minimum concurrency (parallelism) for the crawl. Shortcut for the AutoscaledPool [`minConcurrency`](https://crawlee.dev/js/api/core/interface/AutoscaledPoolOptions.md#minConcurrency) option. > *WARNING:* If we set this value too high with respect to the available system memory and CPU, our crawler will run extremely slow or crash. If not sure, it's better to keep the default value and the concurrency will scale up automatically. ### [**](#navigationTimeoutSecs)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L248)optionalnavigationTimeoutSecs **navigationTimeoutSecs? : number Timeout in which page navigation needs to finish, in seconds. ### [**](#onSkippedRequest)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L381)optionalinheritedonSkippedRequest **onSkippedRequest? : [SkippedRequestCallback](https://crawlee.dev/js/api/core.md#SkippedRequestCallback) Inherited from Omit.onSkippedRequest When a request is skipped for some reason, you can use this callback to act on it. This is currently fired for requests skipped 1. based on robots.txt file, 2. because they don't match enqueueLinks filters, 3. because they are redirected to a URL that doesn't match the enqueueLinks strategy, 4. or because the [`maxRequestsPerCrawl`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#maxRequestsPerCrawl) limit has been reached ### [**](#persistCookiesPerSession)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L254)optionalpersistCookiesPerSession **persistCookiesPerSession? : boolean Defines whether the cookies should be persisted for sessions. This can only be used when `useSessionPool` is set to `true`. ### [**](#postNavigationHooks)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L243)optionalpostNavigationHooks **postNavigationHooks? : [BrowserHook](https://crawlee.dev/js/api/browser-crawler.md#BrowserHook)\\[] Async functions that are sequentially evaluated after the navigation. Good for checking if the navigation was successful. The function accepts `crawlingContext` as the only parameter. **Example:** ``` postNavigationHooks: [ async (crawlingContext) => { const { page } = crawlingContext; if (hasCaptcha(page)) { await solveCaptcha(page); } }, ] ``` ### [**](#preNavigationHooks)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L224)optionalpreNavigationHooks **preNavigationHooks? : [BrowserHook](https://crawlee.dev/js/api/browser-crawler.md#BrowserHook)\\[] Async functions that are sequentially evaluated before the navigation. Good for setting additional cookies or browser properties before navigation. The function accepts two parameters, `crawlingContext` and `gotoOptions`, which are passed to the `page.goto()` function the crawler calls to navigate. **Example:** ``` preNavigationHooks: [ async (crawlingContext, gotoOptions) => { const { page } = crawlingContext; await page.evaluate((attr) => { window.foo = attr; }, 'bar'); gotoOptions.timeout = 60_000; gotoOptions.waitUntil = 'domcontentloaded'; }, ] ``` Modyfing `pageOptions` is supported only in Playwright incognito. See [PrePageCreateHook](https://crawlee.dev/js/api/browser-pool.md#PrePageCreateHook) ### [**](#proxyConfiguration)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L201)optionalproxyConfiguration **proxyConfiguration? : [ProxyConfiguration](https://crawlee.dev/js/api/core/class/ProxyConfiguration.md) If set, the crawler will be configured for all connections to use the Proxy URLs provided and rotated according to the configuration. ### [**](#requestHandler)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L119)optionalrequestHandler **requestHandler? : [BrowserRequestHandler](https://crawlee.dev/js/api/browser-crawler.md#BrowserRequestHandler)\> Function that is called to process each request. The function receives the [BrowserCrawlingContext](https://crawlee.dev/js/api/browser-crawler/interface/BrowserCrawlingContext.md) (actual context will be enhanced with the crawler specific properties) as an argument, where: * [`request`](https://crawlee.dev/js/api/browser-crawler/interface/BrowserCrawlingContext.md#request) is an instance of the [Request](https://crawlee.dev/js/api/core/class/Request.md) object with details about the URL to open, HTTP method etc; * [`page`](https://crawlee.dev/js/api/browser-crawler/interface/BrowserCrawlingContext.md#page) is an instance of the Puppeteer [Page](https://pptr.dev/api/puppeteer.page) or Playwright [Page](https://playwright.dev/docs/api/class-page); * [`browserController`](https://crawlee.dev/js/api/browser-crawler/interface/BrowserCrawlingContext.md#browserController) is an instance of the [BrowserController](https://crawlee.dev/js/api/browser-pool/class/BrowserController.md); * [`response`](https://crawlee.dev/js/api/browser-crawler/interface/BrowserCrawlingContext.md#response) is an instance of the Puppeteer [Response](https://pptr.dev/api/puppeteer.httpresponse) or Playwright [Response](https://playwright.dev/docs/api/class-response), which is the main resource response as returned by the respective `page.goto()` function. The function must return a promise, which is then awaited by the crawler. If the function throws an exception, the crawler will try to re-crawl the request later, up to the [`maxRequestRetries`](https://crawlee.dev/js/api/browser-crawler/interface/BrowserCrawlerOptions.md#maxRequestRetries) times. If all the retries fail, the crawler calls the function provided to the [`failedRequestHandler`](https://crawlee.dev/js/api/browser-crawler/interface/BrowserCrawlerOptions.md#failedRequestHandler) parameter. To make this work, we should **always** let our function throw exceptions rather than catch them. The exceptions are logged to the request using the [`Request.pushErrorMessage()`](https://crawlee.dev/js/api/core/class/Request.md#pushErrorMessage) function. ### [**](#requestHandlerTimeoutSecs)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L203)optionalinheritedrequestHandlerTimeoutSecs **requestHandlerTimeoutSecs? : number = 60 Inherited from Omit.requestHandlerTimeoutSecs Timeout in which the function passed as [`requestHandler`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#requestHandler) needs to finish, in seconds. ### [**](#requestList)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L181)optionalinheritedrequestList **requestList? : [IRequestList](https://crawlee.dev/js/api/core/interface/IRequestList.md) Inherited from Omit.requestList Static list of URLs to be processed. If not provided, the crawler will open the default request queue when the [`crawler.addRequests()`](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md#addRequests) function is called. > Alternatively, `requests` parameter of [`crawler.run()`](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md#run) could be used to enqueue the initial requests - it is a shortcut for running `crawler.addRequests()` before the `crawler.run()`. ### [**](#requestManager)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L197)optionalinheritedrequestManager **requestManager? : [IRequestManager](https://crawlee.dev/js/api/core/interface/IRequestManager.md) Inherited from Omit.requestManager Allows explicitly configuring a request manager. Mutually exclusive with the `requestQueue` and `requestList` options. This enables explicitly configuring the crawler to use `RequestManagerTandem`, for instance. If using this, the type of `BasicCrawler.requestQueue` may not be fully compatible with the `RequestProvider` class. ### [**](#requestQueue)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L189)optionalinheritedrequestQueue **requestQueue? : [RequestProvider](https://crawlee.dev/js/api/core/class/RequestProvider.md) Inherited from Omit.requestQueue Dynamic queue of URLs to be processed. This is useful for recursive crawling of websites. If not provided, the crawler will open the default request queue when the [`crawler.addRequests()`](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md#addRequests) function is called. > Alternatively, `requests` parameter of [`crawler.run()`](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md#run) could be used to enqueue the initial requests - it is a shortcut for running `crawler.addRequests()` before the `crawler.run()`. ### [**](#respectRobotsTxtFile)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L371)optionalinheritedrespectRobotsTxtFile **respectRobotsTxtFile? : boolean Inherited from Omit.respectRobotsTxtFile If set to `true`, the crawler will automatically try to fetch the robots.txt file for each domain, and skip those that are not allowed. This also prevents disallowed URLs to be added via `enqueueLinks`. ### [**](#retryOnBlocked)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L365)optionalinheritedretryOnBlocked **retryOnBlocked? : boolean Inherited from Omit.retryOnBlocked If set to `true`, the crawler will automatically try to bypass any detected bot protection. Currently supports: * [**Cloudflare** Bot Management](https://www.cloudflare.com/products/bot-management/) * [**Google Search** Rate Limiting](https://www.google.com/sorry/) ### [**](#sameDomainDelaySecs)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L262)optionalinheritedsameDomainDelaySecs **sameDomainDelaySecs? : number = 0 Inherited from Omit.sameDomainDelaySecs Indicates how much time (in seconds) to wait before crawling another same domain request. ### [**](#sessionPoolOptions)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L333)optionalinheritedsessionPoolOptions **sessionPoolOptions? : [SessionPoolOptions](https://crawlee.dev/js/api/core/interface/SessionPoolOptions.md) Inherited from Omit.sessionPoolOptions The configuration options for [SessionPool](https://crawlee.dev/js/api/core/class/SessionPool.md) to use. ### [**](#statisticsOptions)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L396)optionalinheritedstatisticsOptions **statisticsOptions? : [StatisticsOptions](https://crawlee.dev/js/api/core/interface/StatisticsOptions.md) Inherited from Omit.statisticsOptions Customize the way statistics collecting works, such as logging interval or whether to output them to the Key-Value store. ### [**](#statusMessageCallback)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L356)optionalinheritedstatusMessageCallback **statusMessageCallback? : [StatusMessageCallback](https://crawlee.dev/js/api/basic-crawler.md#StatusMessageCallback)<[BasicCrawlingContext](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlingContext.md)\, [BasicCrawler](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler.md)<[BasicCrawlingContext](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlingContext.md)\>> Inherited from Omit.statusMessageCallback Allows overriding the default status message. The callback needs to call `crawler.setStatusMessage()` explicitly. The default status message is provided in the parameters. ``` const crawler = new CheerioCrawler({ statusMessageCallback: async (ctx) => { return ctx.crawler.setStatusMessage(`this is status message from ${new Date().toISOString()}`, { level: 'INFO' }); // log level defaults to 'DEBUG' }, statusMessageLoggingInterval: 1, // defaults to 10s async requestHandler({ $, enqueueLinks, request, log }) { // ... }, }); ``` ### [**](#statusMessageLoggingInterval)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L338)optionalinheritedstatusMessageLoggingInterval **statusMessageLoggingInterval? : number Inherited from Omit.statusMessageLoggingInterval Defines the length of the interval for calling the `setStatusMessage` in seconds. ### [**](#useSessionPool)[**](https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L328)optionalinheriteduseSessionPool **useSessionPool? : boolean Inherited from Omit.useSessionPool Basic crawler will initialize the [SessionPool](https://crawlee.dev/js/api/core/class/SessionPool.md) with the corresponding [`sessionPoolOptions`](https://crawlee.dev/js/api/core/interface/SessionPoolOptions.md). The session instance will be than available in the [`requestHandler`](https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions.md#requestHandler). --- # BrowserCrawlingContext \ ### Hierarchy * [CrawlingContext](https://crawlee.dev/js/api/core/interface/CrawlingContext.md)\ * *BrowserCrawlingContext* * [PuppeteerCrawlingContext](https://crawlee.dev/js/api/puppeteer-crawler/interface/PuppeteerCrawlingContext.md) * [PlaywrightCrawlingContext](https://crawlee.dev/js/api/playwright-crawler/interface/PlaywrightCrawlingContext.md) ## Index[**](#Index) ### Properties * [**addRequests](#addRequests) * [**browserController](#browserController) * [**crawler](#crawler) * [**getKeyValueStore](#getKeyValueStore) * [**id](#id) * [**log](#log) * [**page](#page) * [**proxyInfo](#proxyInfo) * [**request](#request) * [**response](#response) * [**session](#session) * [**useState](#useState) ### Methods * [**enqueueLinks](#enqueueLinks) * [**pushData](#pushData) * [**sendRequest](#sendRequest) ## Properties[**](#Properties) ### [**](#addRequests)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L88)inheritedaddRequests **addRequests: (requestsLike, options) => Promise\ Inherited from CrawlingContext.addRequests Add requests directly to the request queue. *** #### Type declaration * * **(requestsLike, options): Promise\ - #### Parameters * ##### requestsLike: readonly (string | ReadonlyObjectDeep\> & { regex?: RegExp; requestsFromUrl?: string }> | ReadonlyObjectDeep<[Request](https://crawlee.dev/js/api/core/class/Request.md)\>)\[] * ##### optionaloptions: ReadonlyObjectDeep<[RequestQueueOperationOptions](https://crawlee.dev/js/api/core/interface/RequestQueueOperationOptions.md)> Options for the request queue #### Returns Promise\ ### [**](#browserController)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L59)browserController **browserController: ProvidedController ### [**](#crawler)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L113)inheritedcrawler **crawler: Crawler Inherited from CrawlingContext.crawler ### [**](#getKeyValueStore)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L147)inheritedgetKeyValueStore **getKeyValueStore: (idOrName) => Promise<[KeyValueStore](https://crawlee.dev/js/api/core/class/KeyValueStore.md)> Inherited from CrawlingContext.getKeyValueStore Get a key-value store with given name or id, or the default one for the crawler. *** #### Type declaration * * **(idOrName): Promise<[KeyValueStore](https://crawlee.dev/js/api/core/class/KeyValueStore.md)> - #### Parameters * ##### optionalidOrName: string #### Returns Promise<[KeyValueStore](https://crawlee.dev/js/api/core/class/KeyValueStore.md)> ### [**](#id)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L33)inheritedid **id: string Inherited from CrawlingContext.id ### [**](#log)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L108)inheritedlog **log: [Log](https://crawlee.dev/js/api/core/class/Log.md) Inherited from CrawlingContext.log A preconfigured logger for the request handler. ### [**](#page)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L60)page **page: Page ### [**](#proxyInfo)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L40)optionalinheritedproxyInfo **proxyInfo? : [ProxyInfo](https://crawlee.dev/js/api/core/interface/ProxyInfo.md) Inherited from CrawlingContext.proxyInfo An object with information about currently used proxy by the crawler and configured by the [ProxyConfiguration](https://crawlee.dev/js/api/core/class/ProxyConfiguration.md) class. ### [**](#request)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L45)inheritedrequest **request: [Request](https://crawlee.dev/js/api/core/class/Request.md)\ Inherited from CrawlingContext.request The original [Request](https://crawlee.dev/js/api/core/class/Request.md) object. ### [**](#response)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L61)optionalresponse **response? : Response ### [**](#session)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L34)optionalinheritedsession **session? : [Session](https://crawlee.dev/js/api/core/class/Session.md) Inherited from CrawlingContext.session ### [**](#useState)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L96)inheriteduseState **useState: \(defaultValue) => Promise\ Inherited from CrawlingContext.useState Returns the state - a piece of mutable persistent data shared across all the request handler runs. *** #### Type declaration * * **\(defaultValue): Promise\ - #### Parameters * ##### optionaldefaultValue: State #### Returns Promise\ ## Methods[**](#Methods) ### [**](#enqueueLinks)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L140)inheritedenqueueLinks * ****enqueueLinks**(options): Promise<[BatchAddRequestsResult](https://crawlee.dev/js/api/types/interface/BatchAddRequestsResult.md)> - Inherited from CrawlingContext.enqueueLinks This function automatically finds and enqueues links from the current page, adding them to the [RequestQueue](https://crawlee.dev/js/api/core/class/RequestQueue.md) currently used by the crawler. Optionally, the function allows you to filter the target links' URLs using an array of globs or regular expressions and override settings of the enqueued [Request](https://crawlee.dev/js/api/core/class/Request.md) objects. Check out the [Crawl a website with relative links](https://crawlee.dev/js/docs/examples/crawl-relative-links.md) example for more details regarding its usage. **Example usage** ``` async requestHandler({ enqueueLinks }) { await enqueueLinks({ globs: [ 'https://www.example.com/handbags/*', ], }); }, ``` *** #### Parameters * ##### optionaloptions: ReadonlyObjectDeep\> & Pick<[EnqueueLinksOptions](https://crawlee.dev/js/api/core/interface/EnqueueLinksOptions.md), requestQueue> All `enqueueLinks()` parameters are passed via an options object. #### Returns Promise<[BatchAddRequestsResult](https://crawlee.dev/js/api/types/interface/BatchAddRequestsResult.md)> Promise that resolves to [BatchAddRequestsResult](https://crawlee.dev/js/api/types/interface/BatchAddRequestsResult.md) object. ### [**](#pushData)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L54)inheritedpushData * ****pushData**(data, datasetIdOrName): Promise\ - Inherited from CrawlingContext.pushData This function allows you to push data to a [Dataset](https://crawlee.dev/js/api/core/class/Dataset.md) specified by name, or the one currently used by the crawler. Shortcut for `crawler.pushData()`. *** #### Parameters * ##### optionaldata: ReadonlyDeep\ Data to be pushed to the default dataset. * ##### optionaldatasetIdOrName: string #### Returns Promise\ ### [**](#sendRequest)[**](https://github.com/apify/crawlee/blob/master/packages/core/src/crawlers/crawler_commons.ts#L166)inheritedsendRequest * ****sendRequest**\(overrideOptions): Promise\> - Inherited from CrawlingContext.sendRequest Fires HTTP request via [`got-scraping`](https://crawlee.dev/js/docs/guides/got-scraping.md), allowing to override the request options on the fly. This is handy when you work with a browser crawler but want to execute some requests outside it (e.g. API requests). Check the [Skipping navigations for certain requests](https://crawlee.dev/js/docs/examples/skip-navigation.md) example for more detailed explanation of how to do that. ``` async requestHandler({ sendRequest }) { const { body } = await sendRequest({ // override headers only headers: { ... }, }); }, ``` *** #### Parameters * ##### optionaloverrideOptions: Partial\ #### Returns Promise\> --- # BrowserLaunchContext \ ### Hierarchy * [BrowserPluginOptions](https://crawlee.dev/js/api/browser-pool/interface/BrowserPluginOptions.md)\ * *BrowserLaunchContext* * [PuppeteerLaunchContext](https://crawlee.dev/js/api/puppeteer-crawler/interface/PuppeteerLaunchContext.md) * [PlaywrightLaunchContext](https://crawlee.dev/js/api/playwright-crawler/interface/PlaywrightLaunchContext.md) ## Index[**](#Index) ### Properties * [**browserPerProxy](#browserPerProxy) * [**experimentalContainers](#experimentalContainers) * [**launcher](#launcher) * [**launchOptions](#launchOptions) * [**proxyUrl](#proxyUrl) * [**useChrome](#useChrome) * [**useIncognitoPages](#useIncognitoPages) * [**userAgent](#userAgent) * [**userDataDir](#userDataDir) ## Properties[**](#Properties) ### [**](#browserPerProxy)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-launcher.ts#L40)optionalbrowserPerProxy **browserPerProxy? : boolean Overrides BrowserPluginOptions.browserPerProxy If set to `true`, the crawler respects the proxy url generated for the given request. This aligns the browser-based crawlers with the `HttpCrawler`. Might cause performance issues, as Crawlee might launch too many browser instances. ### [**](#experimentalContainers)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-launcher.ts#L54)optionalexperimentalContainersexperimental **experimentalContainers? : boolean Overrides BrowserPluginOptions.experimentalContainers Like `useIncognitoPages`, but for persistent contexts, so cache is used for faster loading. Works best with Firefox. Unstable on Chromium. ### [**](#launcher)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-launcher.ts#L82)optionallauncher **launcher? : Launcher The type of browser to be launched. By default, `chromium` is used. Other browsers like `webkit` or `firefox` can be used. * **@example** ``` // import the browser from the library first import { firefox } from 'playwright'; ``` For more details, check out the [example](https://crawlee.dev/js/docs/examples/playwright-crawler-firefox.md). ### [**](#launchOptions)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-plugin.ts#L54)optionalinheritedlaunchOptions **launchOptions? : TOptions Inherited from BrowserPluginOptions.launchOptions Options that will be passed down to the automation library. E.g. `puppeteer.launch(launchOptions);`. This is a good place to set options that you want to apply as defaults. To dynamically override those options per-browser, see the `preLaunchHooks` of [BrowserPool](https://crawlee.dev/js/api/browser-pool/class/BrowserPool.md). ### [**](#proxyUrl)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-launcher.ts#L22)optionalproxyUrl **proxyUrl? : string Overrides BrowserPluginOptions.proxyUrl URL to an HTTP proxy server. It must define the port number, and it may also contain proxy username and password. * **@example** ``` `http://bob:pass123@proxy.example.com:1234`. ``` ### [**](#useChrome)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-launcher.ts#L32)optionaluseChrome **useChrome? : boolean = false If `true` and the `executablePath` option of [`launchOptions`](https://crawlee.dev/js/api/browser-crawler/interface/BrowserLaunchContext.md#launchOptions) is not set, the launcher will launch full Google Chrome browser available on the machine rather than the bundled Chromium. The path to Chrome executable is taken from the `CRAWLEE_CHROME_EXECUTABLE_PATH` environment variable if provided, or defaults to the typical Google Chrome executable location specific for the operating system. ### [**](#useIncognitoPages)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-launcher.ts#L47)optionaluseIncognitoPages **useIncognitoPages? : boolean = false Overrides BrowserPluginOptions.useIncognitoPages With this option selected, all pages will be opened in a new incognito browser context. This means they will not share cookies nor cache and their resources will not be throttled by one another. ### [**](#userAgent)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-launcher.ts#L68)optionaluserAgent **userAgent? : string The `User-Agent` HTTP header used by the browser. If not provided, the function sets `User-Agent` to a reasonable default to reduce the chance of detection of the crawler. ### [**](#userDataDir)[**](https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-launcher.ts#L61)optionaluserDataDir **userDataDir? : string Overrides BrowserPluginOptions.userDataDir Sets the [User Data Directory](https://chromium.googlesource.com/chromium/src/+/master/docs/user_data_dir.md) path. The user data directory contains profile data such as history, bookmarks, and cookies, as well as other per-installation local state. If not specified, a temporary directory is used instead. --- # @crawlee/browser-pool Browser Pool is a small, but powerful and extensible library, that allows you to seamlessly control multiple headless browsers at the same time with only a little configuration, and a single function call. Currently, it supports [Puppeteer](https://github.com/puppeteer/puppeteer), [Playwright](https://github.com/microsoft/playwright), and it can be easily extended with plugins. We created Browser Pool because we regularly needed to execute tasks concurrently in many headless browsers and their pages, but we did not want to worry about launching browsers, closing browsers, restarting them after crashes and so on. We also wanted to easily and reliably manage the whole browser/page lifecycle. You can use Browser Pool for scraping the internet at scale, testing your website in multiple browsers at the same time or launching web automation robots. ## Installation[](#installation "Direct link to Installation") Use NPM or Yarn to install `@crawlee/browser-pool`. Note that `@crawlee/browser-pool` does not come preinstalled with browser automation libraries. This allows you to choose your own libraries and their versions, and it also makes `@crawlee/browser-pool` much smaller. Run this command to install `@crawlee/browser-pool` and the `playwright` browser automation library. ``` npm install @crawlee/browser-pool playwright ``` ## Usage[](#usage "Direct link to Usage") This simple example shows how to open a page in a browser using Browser Pool. We use the provided `PlaywrightPlugin` to wrap a Playwright installation of your own. By calling `browserPool.newPage()` you launch a new Firefox browser and open a new page in that browser. ``` import { BrowserPool, PlaywrightPlugin } from '@crawlee/browser-pool'; import playwright from 'playwright'; const browserPool = new BrowserPool({ browserPlugins: [new PlaywrightPlugin(playwright.chromium)], }); // Launches Chromium with Playwright and returns a Playwright Page. const page1 = await browserPool.newPage(); // You can interact with the page as you're used to. await page1.goto('https://example.com'); // When you're done, close the page. await page1.close(); // Opens a second page in the same browser. const page2 = await browserPool.newPage(); // When everything's finished, tear down the pool. await browserPool.destroy(); ``` ## Launching multiple browsers[](#launching-multiple-browsers "Direct link to Launching multiple browsers") The basic example shows how to launch a single browser, but the purpose of Browser Pool is to launch many browsers. This is done automatically in the background. You only need to provide the relevant plugins and call `browserPool.newPage()`. ``` import { BrowserPool, PlaywrightPlugin } from '@crawlee/browser-pool'; import playwright from 'playwright'; const browserPool = new BrowserPool({ browserPlugins: [ new PlaywrightPlugin(playwright.chromium), new PlaywrightPlugin(playwright.firefox), new PlaywrightPlugin(playwright.webkit), ], }); // Open 4 pages in 3 browsers. The browsers are launched // in a round-robin fashion based on the plugin order. const chromiumPage = await browserPool.newPage(); const firefoxPage = await browserPool.newPage(); const webkitPage = await browserPool.newPage(); const chromiumPage2 = await browserPool.newPage(); // Don't forget to close pages / destroy pool when you're done. ``` This round-robin way of opening pages may not be useful for you, if you need to consistently run tasks in multiple environments. For that, there's the `newPageWithEachPlugin` function. ``` import { BrowserPool, PlaywrightPlugin, PuppeteerPlugin } from '@crawlee/browser-pool'; import playwright from 'playwright'; import puppeteer from 'puppeteer'; const browserPool = new BrowserPool({ browserPlugins: [ new PlaywrightPlugin(playwright.chromium), new PuppeteerPlugin(puppeteer), ], }); const pages = await browserPool.newPageWithEachPlugin(); const promises = pages.map(async page => { // Run some task with each page // pages are in order of plugins: // [playwrightPage, puppeteerPage] await page.close(); }); await Promise.all(promises); // Continue with some more work. ``` ## Features[](#features "Direct link to Features") Besides a simple interface for launching browsers, Browser Pool includes other helpful features that make browser management more convenient. ### Simple configuration[](#simple-configuration "Direct link to Simple configuration") You can easily set the maximum number of pages that can be open in a given browser and also the maximum number of pages to process before a browser [is retired](#graceful-browser-closing). ``` const browserPool = new BrowserPool({ maxOpenPagesPerBrowser: 20, retireBrowserAfterPageCount: 100, }); ``` You can configure the browser launch options either right in the plugins: ``` const playwrightPlugin = new PlaywrightPlugin(playwright.chromium, { launchOptions: { headless: true, } }) ``` Or dynamically in [pre-launch hooks](#lifecycle-management-with-hooks): ``` const browserPool = new BrowserPool({ preLaunchHooks: [(pageId, launchContext) => { if (pageId === 'headful') { launchContext.launchOptions.headless = false; } }] }); ``` ### Proxy management[](#proxy-management "Direct link to Proxy management") When scraping at scale or testing websites from multiple geolocations, one often needs to use proxy servers. Setting up an authenticated proxy in Puppeteer can be cumbersome, so we created a helper that does all the heavy lifting for you. Simply provide a proxy URL with authentication credentials, and you're done. It works the same for Playwright too. ``` const puppeteerPlugin = new PuppeteerPlugin(puppeteer, { proxyUrl: 'http://:@proxy.com:8000' }); ``` > We plan to extend this by adding a proxy-per-page functionality, allowing you to rotate proxies per page, rather than per browser. ### Lifecycle management with hooks[](#lifecycle-management-with-hooks "Direct link to Lifecycle management with hooks") Browser Pool allows you to manage the full browser / page lifecycle by attaching hooks to the most important events. Asynchronous hooks are supported, and their execution order is guaranteed. The first parameter of each hook is either a `pageId` for the hooks executed before a `page` is created or a `page` afterward. This is useful to keep track of which hook was triggered by which `newPage()` call. ``` const browserPool = new BrowserPool({ browserPlugins: [ new PlaywrightPlugin(playwright.chromium), ], preLaunchHooks: [(pageId, launchContext) => { // You can use pre-launch hooks to make dynamic changes // to the launchContext, such as changing a proxyUrl // or updating the browser launchOptions pageId === 'my-page' // true }], postPageCreateHooks: [(page, browserController) => { // It makes sense to make global changes to pages // in post-page-create hooks. For example, you can // inject some JavaScript library, such as jQuery. browserPool.getPageId(page) === 'my-page' // true }] }); await browserPool.newPage({ id: 'my-page' }); ``` > See the API Documentation for all hooks and their arguments. ### Manipulating playwright context using `pageOptions` or `launchOptions`[](#manipulating-playwright-context-using-pageoptions-or-launchoptions "Direct link to manipulating-playwright-context-using-pageoptions-or-launchoptions") Playwright allows customizing multiple browser attributes by browser context. You can customize some of them once the context is created, but some need to be customized within its creation. This part of the documentation should explain how you can effectively customize the browser context. First of all, let's take a look at what kind of context strategy you chose. You can choose between two strategies by `useIncognitoPages` `LaunchContext` option. Suppose you decide to keep `useIncognitoPages` default `false` and create a shared context across all pages launched by one browser. In this case, you should pass the `contextOptions` as a `launchOptions` since the context is created within the new browser launch. The `launchOptions` corresponds to these [playwright options](https://playwright.dev/docs/api/class-browsertype#browsertypelaunchpersistentcontextuserdatadir-options). As you can see, these options contain not only ordinary playwright launch options but also the context options. If you set `useIncognitoPages` to `true`, you will create a new context within each new page, which allows you to handle each page its cookies and application data. This approach allows you to pass the context options as `pageOptions` because a new context is created once you create a new page. In this case, the `pageOptions` corresponds to these [playwright options](https://playwright.dev/docs/api/class-browser#browsernewpageoptions). **Changing context options with `LaunchContext`:** This will only work if you keep the default value for `useIncognitoPages` (`false`). ``` const browserPool = new BrowserPool({ browserPlugins: [ new PlaywrightPlugin( playwright.chromium, { launchOptions: { deviceScaleFactor: 2, }, }, ), ], }); ``` **Changing context options with `browserPool.newPage` options:** ``` const browserPool = new BrowserPool({ browserPlugins: [ new PlaywrightPlugin( playwright.chromium, { useIncognitoPages: true, // You must turn on incognito pages. launchOptions: { // launch options headless: false, devtools: true, }, }, ), ], }); // Launches Chromium with Playwright and returns a Playwright Page. const page = await browserPool.newPage({ pageOptions: { // context options deviceScaleFactor: 2, colorScheme: 'light', locale: 'de-DE', }, }); ``` **Changing context options with `prePageCreateHooks` options:** ``` const browserPool = new BrowserPool({ browserPlugins: [ new PlaywrightPlugin( playwright.chromium, { useIncognitoPages: true, launchOptions: { // launch options headless: false, devtools: true, }, }, ), ], prePageCreateHooks: [ (pageId, browserController, pageOptions) => { pageOptions.deviceScaleFactor = 2; pageOptions.colorScheme = 'dark'; pageOptions.locale = 'de-DE'; // You must modify the 'pageOptions' object, not assign to the variable. // pageOptions = {deviceScaleFactor: 2, ...etc} => This will not work! }, ], }); // Launches Chromium with Playwright and returns a Playwright Page. const page = await browserPool.newPage(); ``` ### Single API for common operations[](#single-api-for-common-operations "Direct link to Single API for common operations") Puppeteer and Playwright handle some things differently. Browser Pool attempts to remove those differences for the most common use-cases. ``` // Playwright const cookies = await context.cookies(); await context.addCookies(cookies); // Puppeteer const cookies = await page.cookies(); await page.setCookie(...cookies); // BrowserPool uses the same API for all plugins const cookies = await browserController.getCookies(page); await browserController.setCookies(page, cookies); ``` ### Graceful browser closing[](#graceful-browser-closing "Direct link to Graceful browser closing") With Browser Pool, browsers are not closed, but retired. A retired browser will no longer open new pages, but it will wait until the open pages are closed, allowing your running tasks to finish. If a browser gets stuck in limbo, it will be killed after a timeout to prevent hanging browser processes. ### Changing browser fingerprints a.k.a. browser signatures[](#changing-browser-fingerprints-aka-browser-signatures "Direct link to Changing browser fingerprints a.k.a. browser signatures") > Fingerprints are enabled by default since v3. Changing browser fingerprints is beneficial for avoiding getting blocked and simulating real user browsers. With Browser Pool, you can do this otherwise complicated technique by enabling the `useFingerprints` option. The fingerprints are by default tied to the respective proxy urls to not use the same unique fingerprint from various IP addresses. You can disable this behavior in the [`fingerprintOptions`](https://crawlee.dev/js/api/browser-pool/interface/FingerprintOptions.md). In the `fingerprintOptions`, You can also control which fingerprints are generated. You can control parameters as browser, operating system, and browser versions. The `browser-pool` module exports three constructors. One for `BrowserPool` itself and two for the included Puppeteer and Playwright plugins. **Example:** ``` import { BrowserPool, PuppeteerPlugin, PlaywrightPlugin } from '@crawlee/browser-pool'; import puppeteer from 'puppeteer'; import playwright from 'playwright'; const browserPool = new BrowserPool({ browserPlugins: [ new PuppeteerPlugin(puppeteer), new PlaywrightPlugin(playwright.chromium), ] }); ``` ## Index[**](#Index) ### Enumerations * [**BROWSER\_CONTROLLER\_EVENTS](https://crawlee.dev/js/api/browser-pool/enum/BROWSER_CONTROLLER_EVENTS.md) * [**BROWSER\_POOL\_EVENTS](https://crawlee.dev/js/api/browser-pool/enum/BROWSER_POOL_EVENTS.md) * [**BrowserName](https://crawlee.dev/js/api/browser-pool/enum/BrowserName.md) * [**DeviceCategory](https://crawlee.dev/js/api/browser-pool/enum/DeviceCategory.md) * [**OperatingSystemsName](https://crawlee.dev/js/api/browser-pool/enum/OperatingSystemsName.md) ### Classes * [**BrowserController](https://crawlee.dev/js/api/browser-pool/class/BrowserController.md) * [**BrowserLaunchError](https://crawlee.dev/js/api/browser-pool/class/BrowserLaunchError.md) * [**BrowserPlugin](https://crawlee.dev/js/api/browser-pool/class/BrowserPlugin.md) * [**BrowserPool](https://crawlee.dev/js/api/browser-pool/class/BrowserPool.md) * [**LaunchContext](https://crawlee.dev/js/api/browser-pool/class/LaunchContext.md) * [**PlaywrightBrowser](https://crawlee.dev/js/api/browser-pool/class/PlaywrightBrowser.md) * [**PlaywrightController](https://crawlee.dev/js/api/browser-pool/class/PlaywrightController.md) * [**PlaywrightPlugin](https://crawlee.dev/js/api/browser-pool/class/PlaywrightPlugin.md) * [**PuppeteerController](https://crawlee.dev/js/api/browser-pool/class/PuppeteerController.md) * [**PuppeteerPlugin](https://crawlee.dev/js/api/browser-pool/class/PuppeteerPlugin.md) ### Interfaces * [**BrowserControllerEvents](https://crawlee.dev/js/api/browser-pool/interface/BrowserControllerEvents.md) * [**BrowserPluginOptions](https://crawlee.dev/js/api/browser-pool/interface/BrowserPluginOptions.md) * [**BrowserPoolEvents](https://crawlee.dev/js/api/browser-pool/interface/BrowserPoolEvents.md) * [**BrowserPoolHooks](https://crawlee.dev/js/api/browser-pool/interface/BrowserPoolHooks.md) * [**BrowserPoolNewPageInNewBrowserOptions](https://crawlee.dev/js/api/browser-pool/interface/BrowserPoolNewPageInNewBrowserOptions.md) * [**BrowserPoolNewPageOptions](https://crawlee.dev/js/api/browser-pool/interface/BrowserPoolNewPageOptions.md) * [**BrowserPoolOptions](https://crawlee.dev/js/api/browser-pool/interface/BrowserPoolOptions.md) * [**BrowserSpecification](https://crawlee.dev/js/api/browser-pool/interface/BrowserSpecification.md) * [**CommonLibrary](https://crawlee.dev/js/api/browser-pool/interface/CommonLibrary.md) * [**CreateLaunchContextOptions](https://crawlee.dev/js/api/browser-pool/interface/CreateLaunchContextOptions.md) * [**FingerprintGenerator](https://crawlee.dev/js/api/browser-pool/interface/FingerprintGenerator.md) * [**FingerprintGeneratorOptions](https://crawlee.dev/js/api/browser-pool/interface/FingerprintGeneratorOptions.md) * [**FingerprintOptions](https://crawlee.dev/js/api/browser-pool/interface/FingerprintOptions.md) * [**GetFingerprintReturn](https://crawlee.dev/js/api/browser-pool/interface/GetFingerprintReturn.md) * [**LaunchContextOptions](https://crawlee.dev/js/api/browser-pool/interface/LaunchContextOptions.md) ### Type Aliases * [**InferBrowserPluginArray](https://crawlee.dev/js/api/browser-pool.md#InferBrowserPluginArray) * [**PostLaunchHook](https://crawlee.dev/js/api/browser-pool.md#PostLaunchHook) * [**PostPageCloseHook](https://crawlee.dev/js/api/browser-pool.md#PostPageCloseHook) * [**PostPageCreateHook](https://crawlee.dev/js/api/browser-pool.md#PostPageCreateHook) * [**PreLaunchHook](https://crawlee.dev/js/api/browser-pool.md#PreLaunchHook) * [**PrePageCloseHook](https://crawlee.dev/js/api/browser-pool.md#PrePageCloseHook) * [**PrePageCreateHook](https://crawlee.dev/js/api/browser-pool.md#PrePageCreateHook) * [**UnwrapPromise](https://crawlee.dev/js/api/browser-pool.md#UnwrapPromise) ### Variables * [**DEFAULT\_USER\_AGENT](https://crawlee.dev/js/api/browser-pool.md#DEFAULT_USER_AGENT) ## Type Aliases[**](<#Type Aliases>) ### [**](#InferBrowserPluginArray)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/utils.ts#L15)InferBrowserPluginArray **InferBrowserPluginArray\: Input extends readonly \[infer FirstValue, ...infer Rest] | \[infer FirstValue, ...infer Rest] ? FirstValue extends [PlaywrightPlugin](https://crawlee.dev/js/api/browser-pool/class/PlaywrightPlugin.md) ? [InferBrowserPluginArray](https://crawlee.dev/js/api/browser-pool.md#InferBrowserPluginArray)\ : FirstValue extends [PuppeteerPlugin](https://crawlee.dev/js/api/browser-pool/class/PuppeteerPlugin.md) ? [InferBrowserPluginArray](https://crawlee.dev/js/api/browser-pool.md#InferBrowserPluginArray)\ : never : Input extends \[] ? Result : Input extends readonly infer U\[] ? \[U] extends \[[PuppeteerPlugin](https://crawlee.dev/js/api/browser-pool/class/PuppeteerPlugin.md) | [PlaywrightPlugin](https://crawlee.dev/js/api/browser-pool/class/PlaywrightPlugin.md)] ? U\[] : never : Result #### Type parameters * **Input**: readonly unknown\[] * **Result**: [BrowserPlugin](https://crawlee.dev/js/api/browser-pool/class/BrowserPlugin.md)\[] = \[] ### [**](#PostLaunchHook)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L136)PostLaunchHook **PostLaunchHook\: (pageId, browserController) => void | Promise\ Post-launch hooks are executed as soon as a browser is launched. The hooks are called with two arguments: `pageId`: `string` and `browserController`: [BrowserController](https://crawlee.dev/js/api/browser-pool/class/BrowserController.md) To guarantee order of execution before other hooks in the same browser, the [BrowserController](https://crawlee.dev/js/api/browser-pool/class/BrowserController.md) methods cannot be used until the post-launch hooks complete. If you attempt to call `await browserController.close()` from a post-launch hook, it will deadlock the process. This API is subject to change. *** #### Type parameters * **BC**: [BrowserController](https://crawlee.dev/js/api/browser-pool/class/BrowserController.md) #### Type declaration * * **(pageId, browserController): void | Promise\ - #### Parameters * ##### pageId: string * ##### browserController: BC #### Returns void | Promise\ ### [**](#PostPageCloseHook)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L186)PostPageCloseHook **PostPageCloseHook\: (pageId, browserController) => void | Promise\ Post-page-close hooks allow you to do page related clean up. The hooks are called with two arguments: `pageId`: `string` and `browserController`: [BrowserController](https://crawlee.dev/js/api/browser-pool/class/BrowserController.md) *** #### Type parameters * **BC**: [BrowserController](https://crawlee.dev/js/api/browser-pool/class/BrowserController.md) #### Type declaration * * **(pageId, browserController): void | Promise\ - #### Parameters * ##### pageId: string * ##### browserController: BC #### Returns void | Promise\ ### [**](#PostPageCreateHook)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L164)PostPageCreateHook **PostPageCreateHook\: (page, browserController) => void | Promise\ Post-page-create hooks are called right after a new page is created and all internal actions of Browser Pool are completed. This is the place to make changes to a page that you would like to apply to all pages. Such as injecting a JavaScript library into all pages. The hooks are called with two arguments: `page`: `Page` and `browserController`: [BrowserController](https://crawlee.dev/js/api/browser-pool/class/BrowserController.md) *** #### Type parameters * **BC**: [BrowserController](https://crawlee.dev/js/api/browser-pool/class/BrowserController.md) * **Page** = [UnwrapPromise](https://crawlee.dev/js/api/browser-pool.md#UnwrapPromise)\> #### Type declaration * * **(page, browserController): void | Promise\ - #### Parameters * ##### page: Page * ##### browserController: BC #### Returns void | Promise\ ### [**](#PreLaunchHook)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L125)PreLaunchHook **PreLaunchHook\: (pageId, launchContext) => void | Promise\ Pre-launch hooks are executed just before a browser is launched and provide a good opportunity to dynamically change the launch options. The hooks are called with two arguments: `pageId`: `string` and `launchContext`: [LaunchContext](https://crawlee.dev/js/api/browser-pool/class/LaunchContext.md) *** #### Type parameters * **LC**: [LaunchContext](https://crawlee.dev/js/api/browser-pool/class/LaunchContext.md) #### Type declaration * * **(pageId, launchContext): void | Promise\ - #### Parameters * ##### pageId: string * ##### launchContext: LC #### Returns void | Promise\ ### [**](#PrePageCloseHook)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L176)PrePageCloseHook **PrePageCloseHook\: (page, browserController) => void | Promise\ Pre-page-close hooks give you the opportunity to make last second changes in a page that's about to be closed, such as saving a snapshot or updating state. The hooks are called with two arguments: `page`: `Page` and `browserController`: [BrowserController](https://crawlee.dev/js/api/browser-pool/class/BrowserController.md) *** #### Type parameters * **BC**: [BrowserController](https://crawlee.dev/js/api/browser-pool/class/BrowserController.md) * **Page** = [UnwrapPromise](https://crawlee.dev/js/api/browser-pool.md#UnwrapPromise)\> #### Type declaration * * **(page, browserController): void | Promise\ - #### Parameters * ##### page: Page * ##### browserController: BC #### Returns void | Promise\ ### [**](#PrePageCreateHook)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L150)PrePageCreateHook **PrePageCreateHook\: (pageId, browserController, pageOptions) => void | Promise\ Pre-page-create hooks are executed just before a new page is created. They are useful to make dynamic changes to the browser before opening a page. The hooks are called with three arguments: `pageId`: `string`, `browserController`: [BrowserController](https://crawlee.dev/js/api/browser-pool/class/BrowserController.md) and `pageOptions`: `object|undefined` - This only works if the underlying `BrowserController` supports new page options. So far, new page options are only supported by `PlaywrightController` in incognito contexts. If the page options are not supported by `BrowserController` the `pageOptions` argument is `undefined`. *** #### Type parameters * **BC**: [BrowserController](https://crawlee.dev/js/api/browser-pool/class/BrowserController.md) * **PO** = Parameters\\[0] #### Type declaration * * **(pageId, browserController, pageOptions): void | Promise\ - #### Parameters * ##### pageId: string * ##### browserController: BC * ##### optionalpageOptions: PO #### Returns void | Promise\ ### [**](#UnwrapPromise)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/utils.ts#L5)UnwrapPromise **UnwrapPromise\: T extends PromiseLike\ R> ? [UnwrapPromise](https://crawlee.dev/js/api/browser-pool.md#UnwrapPromise)\ : T #### Type parameters * **T** ## Variables[**](#Variables) ### [**](#DEFAULT_USER_AGENT)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-plugin.ts#L20)constDEFAULT\_USER\_AGENT **DEFAULT\_USER\_AGENT: Mozilla/5.0 (Macintosh; Intel Mac OS X 10\_15\_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36 = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10\_15\_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36' The default User Agent used by `PlaywrightCrawler`, `launchPlaywright`, 'PuppeteerCrawler' and 'launchPuppeteer' when Chromium/Chrome browser is launched: * in headless mode, * without using a fingerprint, * without specifying a user agent. Last updated on 2022-05-05. After you update it here, please update it also in jsdom-crawler.ts --- # Changelog All notable changes to this project will be documented in this file. See [Conventional Commits](https://conventionalcommits.org) for commit guidelines. ## [3.15.3](https://github.com/apify/crawlee/compare/v3.15.2...v3.15.3) (2025-11-10)[](#3153-2025-11-10 "Direct link to 3153-2025-11-10") **Note:** Version bump only for package @crawlee/browser-pool ## [3.15.2](https://github.com/apify/crawlee/compare/v3.15.1...v3.15.2) (2025-10-23)[](#3152-2025-10-23 "Direct link to 3152-2025-10-23") ### Bug Fixes[](#bug-fixes "Direct link to Bug Fixes") * correctly apply `launchOptions` with `useIncognitoPages` ([#3181](https://github.com/apify/crawlee/issues/3181)) ([84a4b70](https://github.com/apify/crawlee/commit/84a4b709ee59d9edbcdc9a19559fefa4e9139ba4)), closes [/github.com/apify/crawlee/issues/3173#issuecomment-3346728227](https://github.com//github.com/apify/crawlee/issues/3173/issues/issuecomment-3346728227) [#3173](https://github.com/apify/crawlee/issues/3173) [#3173](https://github.com/apify/crawlee/issues/3173) ## [3.15.1](https://github.com/apify/crawlee/compare/v3.15.0...v3.15.1) (2025-09-26)[](#3151-2025-09-26 "Direct link to 3151-2025-09-26") **Note:** Version bump only for package @crawlee/browser-pool # [3.15.0](https://github.com/apify/crawlee/compare/v3.14.1...v3.15.0) (2025-09-17) **Note:** Version bump only for package @crawlee/browser-pool ## [3.14.1](https://github.com/apify/crawlee/compare/v3.14.0...v3.14.1) (2025-08-05)[](#3141-2025-08-05 "Direct link to 3141-2025-08-05") **Note:** Version bump only for package @crawlee/browser-pool # [3.14.0](https://github.com/apify/crawlee/compare/v3.13.10...v3.14.0) (2025-07-25) ### Bug Fixes[](#bug-fixes-1 "Direct link to Bug Fixes") * don't retire browsers with long-running `pre|postLaunchHooks` prematurely ([#3062](https://github.com/apify/crawlee/issues/3062)) ([681660e](https://github.com/apify/crawlee/commit/681660e35a1ceaca5e96a7f61d5a7c66ec32bcde)) ## [3.13.10](https://github.com/apify/crawlee/compare/v3.13.9...v3.13.10) (2025-07-09)[](#31310-2025-07-09 "Direct link to 31310-2025-07-09") **Note:** Version bump only for package @crawlee/browser-pool ## [3.13.9](https://github.com/apify/crawlee/compare/v3.13.8...v3.13.9) (2025-06-27)[](#3139-2025-06-27 "Direct link to 3139-2025-06-27") **Note:** Version bump only for package @crawlee/browser-pool ## [3.13.8](https://github.com/apify/crawlee/compare/v3.13.7...v3.13.8) (2025-06-16)[](#3138-2025-06-16 "Direct link to 3138-2025-06-16") **Note:** Version bump only for package @crawlee/browser-pool ## [3.13.7](https://github.com/apify/crawlee/compare/v3.13.6...v3.13.7) (2025-06-06)[](#3137-2025-06-06 "Direct link to 3137-2025-06-06") **Note:** Version bump only for package @crawlee/browser-pool ## [3.13.6](https://github.com/apify/crawlee/compare/v3.13.5...v3.13.6) (2025-06-05)[](#3136-2025-06-05 "Direct link to 3136-2025-06-05") **Note:** Version bump only for package @crawlee/browser-pool ## [3.13.5](https://github.com/apify/crawlee/compare/v3.13.4...v3.13.5) (2025-05-20)[](#3135-2025-05-20 "Direct link to 3135-2025-05-20") **Note:** Version bump only for package @crawlee/browser-pool ## [3.13.4](https://github.com/apify/crawlee/compare/v3.13.3...v3.13.4) (2025-05-14)[](#3134-2025-05-14 "Direct link to 3134-2025-05-14") **Note:** Version bump only for package @crawlee/browser-pool ## [3.13.3](https://github.com/apify/crawlee/compare/v3.13.2...v3.13.3) (2025-05-05)[](#3133-2025-05-05 "Direct link to 3133-2025-05-05") ### Bug Fixes[](#bug-fixes-2 "Direct link to Bug Fixes") * await `_createPageForBrowser` in browser pool ([#2950](https://github.com/apify/crawlee/issues/2950)) ([27ba74b](https://github.com/apify/crawlee/commit/27ba74bacfcaa0467e7d97eb27d6a9c1d9cea9be)), closes [#2789](https://github.com/apify/crawlee/issues/2789) * Fix trailing slash removal in BrowserPool ([#2921](https://github.com/apify/crawlee/issues/2921)) ([c1fc439](https://github.com/apify/crawlee/commit/c1fc439e8e9cf74808912c66a1915f1bfd345b5f)), closes [#2878](https://github.com/apify/crawlee/issues/2878) ## [3.13.2](https://github.com/apify/crawlee/compare/v3.13.1...v3.13.2) (2025-04-08)[](#3132-2025-04-08 "Direct link to 3132-2025-04-08") **Note:** Version bump only for package @crawlee/browser-pool ## [3.13.1](https://github.com/apify/crawlee/compare/v3.13.0...v3.13.1) (2025-04-07)[](#3131-2025-04-07 "Direct link to 3131-2025-04-07") **Note:** Version bump only for package @crawlee/browser-pool # [3.13.0](https://github.com/apify/crawlee/compare/v3.12.2...v3.13.0) (2025-03-04) **Note:** Version bump only for package @crawlee/browser-pool ## [3.12.2](https://github.com/apify/crawlee/compare/v3.12.1...v3.12.2) (2025-01-27)[](#3122-2025-01-27 "Direct link to 3122-2025-01-27") **Note:** Version bump only for package @crawlee/browser-pool ## [3.12.1](https://github.com/apify/crawlee/compare/v3.12.0...v3.12.1) (2024-12-04)[](#3121-2024-12-04 "Direct link to 3121-2024-12-04") **Note:** Version bump only for package @crawlee/browser-pool # [3.12.0](https://github.com/apify/crawlee/compare/v3.11.5...v3.12.0) (2024-11-04) ### Bug Fixes[](#bug-fixes-3 "Direct link to Bug Fixes") * update `fingerprintGeneratorOptions` types ([#2705](https://github.com/apify/crawlee/issues/2705)) ([fcb098d](https://github.com/apify/crawlee/commit/fcb098d6357b69e6d1790765076e4fe4146c8143)), closes [/github.com/apify/fingerprint-suite/blob/c61814e6ba8822543deb0ce6c03e0a0249933629/packages/fingerprint-generator/src/fingerprint-generator.ts#L73](https://github.com//github.com/apify/fingerprint-suite/blob/c61814e6ba8822543deb0ce6c03e0a0249933629/packages/fingerprint-generator/src/fingerprint-generator.ts/issues/L73) [/github.com/apify/fingerprint-suite/blob/c61814e6ba8822543deb0ce6c03e0a0249933629/packages/header-generator/src/header-generator.ts#L87](https://github.com//github.com/apify/fingerprint-suite/blob/c61814e6ba8822543deb0ce6c03e0a0249933629/packages/header-generator/src/header-generator.ts/issues/L87) [#2703](https://github.com/apify/crawlee/issues/2703) ### Features[](#features "Direct link to Features") * allow using other HTTP clients ([#2661](https://github.com/apify/crawlee/issues/2661)) ([568c655](https://github.com/apify/crawlee/commit/568c6556d79ce91654c8a715d1d1729d7d6ed8ef)), closes [#2659](https://github.com/apify/crawlee/issues/2659) ## [3.11.5](https://github.com/apify/crawlee/compare/v3.11.4...v3.11.5) (2024-10-04)[](#3115-2024-10-04 "Direct link to 3115-2024-10-04") **Note:** Version bump only for package @crawlee/browser-pool ## [3.11.4](https://github.com/apify/crawlee/compare/v3.11.3...v3.11.4) (2024-09-23)[](#3114-2024-09-23 "Direct link to 3114-2024-09-23") **Note:** Version bump only for package @crawlee/browser-pool ## [3.11.3](https://github.com/apify/crawlee/compare/v3.11.2...v3.11.3) (2024-09-03)[](#3113-2024-09-03 "Direct link to 3113-2024-09-03") **Note:** Version bump only for package @crawlee/browser-pool ## [3.11.2](https://github.com/apify/crawlee/compare/v3.11.1...v3.11.2) (2024-08-28)[](#3112-2024-08-28 "Direct link to 3112-2024-08-28") **Note:** Version bump only for package @crawlee/browser-pool ## [3.11.1](https://github.com/apify/crawlee/compare/v3.11.0...v3.11.1) (2024-07-24)[](#3111-2024-07-24 "Direct link to 3111-2024-07-24") **Note:** Version bump only for package @crawlee/browser-pool # [3.11.0](https://github.com/apify/crawlee/compare/v3.10.5...v3.11.0) (2024-07-09) **Note:** Version bump only for package @crawlee/browser-pool ## [3.10.5](https://github.com/apify/crawlee/compare/v3.10.4...v3.10.5) (2024-06-12)[](#3105-2024-06-12 "Direct link to 3105-2024-06-12") **Note:** Version bump only for package @crawlee/browser-pool ## [3.10.4](https://github.com/apify/crawlee/compare/v3.10.3...v3.10.4) (2024-06-11)[](#3104-2024-06-11 "Direct link to 3104-2024-06-11") **Note:** Version bump only for package @crawlee/browser-pool ## [3.10.3](https://github.com/apify/crawlee/compare/v3.10.2...v3.10.3) (2024-06-07)[](#3103-2024-06-07 "Direct link to 3103-2024-06-07") ### Bug Fixes[](#bug-fixes-4 "Direct link to Bug Fixes") * increase timeout for retiring inactive browsers ([#2523](https://github.com/apify/crawlee/issues/2523)) ([195f176](https://github.com/apify/crawlee/commit/195f1766a03293db19caa33f9fc3d4ab08081f71)) ## [3.10.2](https://github.com/apify/crawlee/compare/v3.10.1...v3.10.2) (2024-06-03)[](#3102-2024-06-03 "Direct link to 3102-2024-06-03") **Note:** Version bump only for package @crawlee/browser-pool ## [3.10.1](https://github.com/apify/crawlee/compare/v3.10.0...v3.10.1) (2024-05-23)[](#3101-2024-05-23 "Direct link to 3101-2024-05-23") **Note:** Version bump only for package @crawlee/browser-pool # [3.10.0](https://github.com/apify/crawlee/compare/v3.9.2...v3.10.0) (2024-05-16) **Note:** Version bump only for package @crawlee/browser-pool ## [3.9.2](https://github.com/apify/crawlee/compare/v3.9.1...v3.9.2) (2024-04-17)[](#392-2024-04-17 "Direct link to 392-2024-04-17") **Note:** Version bump only for package @crawlee/browser-pool ## [3.9.1](https://github.com/apify/crawlee/compare/v3.9.0...v3.9.1) (2024-04-11)[](#391-2024-04-11 "Direct link to 391-2024-04-11") ### Features[](#features-1 "Direct link to Features") * `browserPerProxy` browser launch option ([#2418](https://github.com/apify/crawlee/issues/2418)) ([df57b29](https://github.com/apify/crawlee/commit/df57b2965ac8c8b3adf807e3bad8a649814fa213)) # [3.9.0](https://github.com/apify/crawlee/compare/v3.8.2...v3.9.0) (2024-04-10) ### Features[](#features-2 "Direct link to Features") * `tieredProxyUrls` for ProxyConfiguration ([#2348](https://github.com/apify/crawlee/issues/2348)) ([5408c7f](https://github.com/apify/crawlee/commit/5408c7f60a5bf4dbdba92f2d7440e0946b94ea6e)) * better `newUrlFunction` for ProxyConfiguration ([#2392](https://github.com/apify/crawlee/issues/2392)) ([330598b](https://github.com/apify/crawlee/commit/330598b348ad27bc7c73732294a14b655ccd3507)), closes [#2348](https://github.com/apify/crawlee/issues/2348) [#2065](https://github.com/apify/crawlee/issues/2065) ## [3.8.2](https://github.com/apify/crawlee/compare/v3.8.1...v3.8.2) (2024-03-21)[](#382-2024-03-21 "Direct link to 382-2024-03-21") ### Bug Fixes[](#bug-fixes-5 "Direct link to Bug Fixes") * fix detection of older puppeteer versions ([890669b](https://github.com/apify/crawlee/commit/890669b0b3eef94d00ad69aa022e13b3109a660c)), closes [#2370](https://github.com/apify/crawlee/issues/2370) * **puppeteer:** improve detection of older versions ([98d4e86](https://github.com/apify/crawlee/commit/98d4e8664a54c1a134446a1b6ab9042d14ed8629)) ## [3.8.1](https://github.com/apify/crawlee/compare/v3.8.0...v3.8.1) (2024-02-22)[](#381-2024-02-22 "Direct link to 381-2024-02-22") **Note:** Version bump only for package @crawlee/browser-pool # [3.8.0](https://github.com/apify/crawlee/compare/v3.7.3...v3.8.0) (2024-02-21) ### Bug Fixes[](#bug-fixes-6 "Direct link to Bug Fixes") * **puppeteer:** add 'process' to the browser bound methods ([#2329](https://github.com/apify/crawlee/issues/2329)) ([2750ba6](https://github.com/apify/crawlee/commit/2750ba646ef3c1d51eacdd8e7d67be0e14fb2a97)) * **puppeteer:** support `puppeteer@v22` ([#2337](https://github.com/apify/crawlee/issues/2337)) ([3cc360a](https://github.com/apify/crawlee/commit/3cc360a1ea94147133f9785d65834f360f7b42a7)) ## [3.7.3](https://github.com/apify/crawlee/compare/v3.7.2...v3.7.3) (2024-01-30)[](#373-2024-01-30 "Direct link to 373-2024-01-30") **Note:** Version bump only for package @crawlee/browser-pool ## [3.7.2](https://github.com/apify/crawlee/compare/v3.7.1...v3.7.2) (2024-01-09)[](#372-2024-01-09 "Direct link to 372-2024-01-09") **Note:** Version bump only for package @crawlee/browser-pool ## [3.7.1](https://github.com/apify/crawlee/compare/v3.7.0...v3.7.1) (2024-01-02)[](#371-2024-01-02 "Direct link to 371-2024-01-02") **Note:** Version bump only for package @crawlee/browser-pool # [3.7.0](https://github.com/apify/crawlee/compare/v3.6.2...v3.7.0) (2023-12-21) ### Bug Fixes[](#bug-fixes-7 "Direct link to Bug Fixes") * **browser-pool:** respect user options before assigning fingerprints ([#2190](https://github.com/apify/crawlee/issues/2190)) ([f050776](https://github.com/apify/crawlee/commit/f050776a916a0530aca6727a447a49252e643417)), closes [#2164](https://github.com/apify/crawlee/issues/2164) ## [3.6.2](https://github.com/apify/crawlee/compare/v3.6.1...v3.6.2) (2023-11-26)[](#362-2023-11-26 "Direct link to 362-2023-11-26") **Note:** Version bump only for package @crawlee/browser-pool ## [3.6.1](https://github.com/apify/crawlee/compare/v3.6.0...v3.6.1) (2023-11-15)[](#361-2023-11-15 "Direct link to 361-2023-11-15") ### Features[](#features-3 "Direct link to Features") * **puppeteer:** enable `new` headless mode ([#1910](https://github.com/apify/crawlee/issues/1910)) ([7fc999c](https://github.com/apify/crawlee/commit/7fc999cf4658ca69b97f16d434444081998470f4)) # [3.6.0](https://github.com/apify/crawlee/compare/v3.5.8...v3.6.0) (2023-11-15) ### Bug Fixes[](#bug-fixes-8 "Direct link to Bug Fixes") * **BrowserPool:** ignore `--no-sandbox` flag for webkit launcher ([#2148](https://github.com/apify/crawlee/issues/2148)) ([1eb2f08](https://github.com/apify/crawlee/commit/1eb2f08a3cdead5dd21ffde4162d403175a4594c)), closes [#1797](https://github.com/apify/crawlee/issues/1797) * provide more detailed error messages for browser launch errors ([#2157](https://github.com/apify/crawlee/issues/2157)) ([f188ebe](https://github.com/apify/crawlee/commit/f188ebe0b4ae7594225ef37d8160d175d4535ccd)) ## [3.5.8](https://github.com/apify/crawlee/compare/v3.5.7...v3.5.8) (2023-10-17)[](#358-2023-10-17 "Direct link to 358-2023-10-17") **Note:** Version bump only for package @crawlee/browser-pool ## [3.5.7](https://github.com/apify/crawlee/compare/v3.5.6...v3.5.7) (2023-10-05)[](#357-2023-10-05 "Direct link to 357-2023-10-05") **Note:** Version bump only for package @crawlee/browser-pool ## [3.5.6](https://github.com/apify/crawlee/compare/v3.5.5...v3.5.6) (2023-10-04)[](#356-2023-10-04 "Direct link to 356-2023-10-04") **Note:** Version bump only for package @crawlee/browser-pool ## [3.5.5](https://github.com/apify/crawlee/compare/v3.5.4...v3.5.5) (2023-10-02)[](#355-2023-10-02 "Direct link to 355-2023-10-02") ### Bug Fixes[](#bug-fixes-9 "Direct link to Bug Fixes") * allow to use any version of puppeteer or playwright ([#2102](https://github.com/apify/crawlee/issues/2102)) ([0cafceb](https://github.com/apify/crawlee/commit/0cafceb2966d430dd1b2a1b619fe66da1c951f4c)), closes [#2101](https://github.com/apify/crawlee/issues/2101) ## [3.5.4](https://github.com/apify/crawlee/compare/v3.5.3...v3.5.4) (2023-09-11)[](#354-2023-09-11 "Direct link to 354-2023-09-11") **Note:** Version bump only for package @crawlee/browser-pool ## [3.5.3](https://github.com/apify/crawlee/compare/v3.5.2...v3.5.3) (2023-08-31)[](#353-2023-08-31 "Direct link to 353-2023-08-31") ### Bug Fixes[](#bug-fixes-10 "Direct link to Bug Fixes") * **browser-pool:** improve error handling when browser is not found ([#2050](https://github.com/apify/crawlee/issues/2050)) ([282527f](https://github.com/apify/crawlee/commit/282527f31bb366a4e52463212f652dcf6679b6c3)), closes [#1459](https://github.com/apify/crawlee/issues/1459) * pin all internal dependencies ([#2041](https://github.com/apify/crawlee/issues/2041)) ([d6f2b17](https://github.com/apify/crawlee/commit/d6f2b172d4a6776137c7893ca798d5b4a9408e79)), closes [#2040](https://github.com/apify/crawlee/issues/2040) ## [3.5.2](https://github.com/apify/crawlee/compare/v3.5.1...v3.5.2) (2023-08-21)[](#352-2023-08-21 "Direct link to 352-2023-08-21") **Note:** Version bump only for package @crawlee/browser-pool ## [3.5.1](https://github.com/apify/crawlee/compare/v3.5.0...v3.5.1) (2023-08-16)[](#351-2023-08-16 "Direct link to 351-2023-08-16") **Note:** Version bump only for package @crawlee/browser-pool # [3.5.0](https://github.com/apify/crawlee/compare/v3.4.2...v3.5.0) (2023-07-31) **Note:** Version bump only for package @crawlee/browser-pool ## [3.4.2](https://github.com/apify/crawlee/compare/v3.4.1...v3.4.2) (2023-07-19)[](#342-2023-07-19 "Direct link to 342-2023-07-19") **Note:** Version bump only for package @crawlee/browser-pool ## [3.4.1](https://github.com/apify/crawlee/compare/v3.4.0...v3.4.1) (2023-07-13)[](#341-2023-07-13 "Direct link to 341-2023-07-13") **Note:** Version bump only for package @crawlee/browser-pool # [3.4.0](https://github.com/apify/crawlee/compare/v3.3.3...v3.4.0) (2023-06-12) **Note:** Version bump only for package @crawlee/browser-pool ## [3.3.3](https://github.com/apify/crawlee/compare/v3.3.2...v3.3.3) (2023-05-31)[](#333-2023-05-31 "Direct link to 333-2023-05-31") **Note:** Version bump only for package @crawlee/browser-pool ## [3.3.2](https://github.com/apify/crawlee/compare/v3.3.1...v3.3.2) (2023-05-11)[](#332-2023-05-11 "Direct link to 332-2023-05-11") **Note:** Version bump only for package @crawlee/browser-pool ## [3.3.1](https://github.com/apify/crawlee/compare/v3.3.0...v3.3.1) (2023-04-11)[](#331-2023-04-11 "Direct link to 331-2023-04-11") **Note:** Version bump only for package @crawlee/browser-pool # [3.3.0](https://github.com/apify/crawlee/compare/v3.2.2...v3.3.0) (2023-03-09) **Note:** Version bump only for package @crawlee/browser-pool ## [3.2.2](https://github.com/apify/crawlee/compare/v3.2.1...v3.2.2) (2023-02-08)[](#322-2023-02-08 "Direct link to 322-2023-02-08") **Note:** Version bump only for package @crawlee/browser-pool ## [3.2.1](https://github.com/apify/crawlee/compare/v3.2.0...v3.2.1) (2023-02-07)[](#321-2023-02-07 "Direct link to 321-2023-02-07") **Note:** Version bump only for package @crawlee/browser-pool # [3.2.0](https://github.com/apify/crawlee/compare/v3.1.4...v3.2.0) (2023-02-07) ### Bug Fixes[](#bug-fixes-11 "Direct link to Bug Fixes") * update playwright to 1.29.2 and make peer dep. less strict ([#1735](https://github.com/apify/crawlee/issues/1735)) ([c654fcd](https://github.com/apify/crawlee/commit/c654fcdea06fb203b7952ed97650190cc0e74394)), closes [#1723](https://github.com/apify/crawlee/issues/1723) ## [3.1.3](https://github.com/apify/crawlee/compare/v3.1.2...v3.1.3) (2022-12-07)[](#313-2022-12-07 "Direct link to 313-2022-12-07") **Note:** Version bump only for package @crawlee/browser-pool ## 3.1.2 (2022-11-15)[](#312-2022-11-15 "Direct link to 3.1.2 (2022-11-15)") **Note:** Version bump only for package @crawlee/browser-pool ## 3.1.1 (2022-11-07)[](#311-2022-11-07 "Direct link to 3.1.1 (2022-11-07)") **Note:** Version bump only for package @crawlee/browser-pool # 3.1.0 (2022-10-13) **Note:** Version bump only for package @crawlee/browser-pool ## [3.0.4](https://github.com/apify/crawlee/compare/v3.0.3...v3.0.4) (2022-08-22)[](#304-2022-08-22 "Direct link to 304-2022-08-22") ### Features[](#features-4 "Direct link to Features") * enable tab-as-a-container for Firefox ([#1456](https://github.com/apify/crawlee/issues/1456)) ([ae5ba4f](https://github.com/apify/crawlee/commit/ae5ba4f15fd6d14f444486234753ce1781c74cc8)) --- # abstractBrowserController \ The `BrowserController` serves two purposes. First, it is the base class that specialized controllers like `PuppeteerController` or `PlaywrightController` extend. Second, it defines the public interface of the specialized classes which provide only private methods. Therefore, we do not keep documentation for the specialized classes, because it's the same for all of them. ### Hierarchy * TypedEmitter<[BrowserControllerEvents](https://crawlee.dev/js/api/browser-pool/interface/BrowserControllerEvents.md)\> * *BrowserController* * [PuppeteerController](https://crawlee.dev/js/api/browser-pool/class/PuppeteerController.md) * [PlaywrightController](https://crawlee.dev/js/api/browser-pool/class/PlaywrightController.md) ## Index[**](#Index) ### Properties * [**activePages](#activePages) * [**browser](#browser) * [**browserPlugin](#browserPlugin) * [**id](#id) * [**isActive](#isActive) * [**lastPageOpenedAt](#lastPageOpenedAt) * [**launchContext](#launchContext) * [**proxyTier](#proxyTier) * [**proxyUrl](#proxyUrl) * [**totalPages](#totalPages) * [**defaultMaxListeners](#defaultMaxListeners) ### Methods * [**addListener](#addListener) * [**close](#close) * [**emit](#emit) * [**eventNames](#eventNames) * [**getCookies](#getCookies) * [**getMaxListeners](#getMaxListeners) * [**kill](#kill) * [**listenerCount](#listenerCount) * [**listeners](#listeners) * [**off](#off) * [**on](#on) * [**once](#once) * [**prependListener](#prependListener) * [**prependOnceListener](#prependOnceListener) * [**rawListeners](#rawListeners) * [**removeAllListeners](#removeAllListeners) * [**removeListener](#removeListener) * [**setCookies](#setCookies) * [**setMaxListeners](#setMaxListeners) ## Properties[**](#Properties) ### [**](#activePages)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-controller.ts#L73)activePages **activePages: number = 0 ### [**](#browser)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-controller.ts#L52)browser **browser: LaunchResult = ... Browser representation of the underlying automation library. ### [**](#browserPlugin)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-controller.ts#L47)browserPlugin **browserPlugin: [BrowserPlugin](https://crawlee.dev/js/api/browser-pool/class/BrowserPlugin.md)\ The `BrowserPlugin` instance used to launch the browser. ### [**](#id)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-controller.ts#L42)id **id: string = ... ### [**](#isActive)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-controller.ts#L71)isActive **isActive: boolean = false ### [**](#lastPageOpenedAt)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-controller.ts#L77)lastPageOpenedAt **lastPageOpenedAt: number = ... ### [**](#launchContext)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-controller.ts#L57)launchContext **launchContext: [LaunchContext](https://crawlee.dev/js/api/browser-pool/class/LaunchContext.md)\ = ... The configuration the browser was launched with. ### [**](#proxyTier)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-controller.ts#L63)optionalproxyTier **proxyTier? : number The proxy tier tied to this browser controller. `undefined` if no tiered proxy is used. ### [**](#proxyUrl)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-controller.ts#L69)optionalproxyUrl **proxyUrl? : string The proxy URL used by the browser controller. This is set every time the browser controller uses proxy (even the tiered one). `undefined` if no proxy is used ### [**](#totalPages)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-controller.ts#L75)totalPages **totalPages: number = 0 ### [**](#defaultMaxListeners)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L10)staticexternalinheriteddefaultMaxListeners **defaultMaxListeners: number Inherited from TypedEmitter.defaultMaxListeners ## Methods[**](#Methods) ### [**](#addListener)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L11)externalinheritedaddListener * ****addListener**\(event, listener): this - Inherited from TypedEmitter.addListener #### Parameters * ##### externalevent: U * ##### externallistener: [BrowserControllerEvents](https://crawlee.dev/js/api/browser-pool/interface/BrowserControllerEvents.md)\\[U] #### Returns this ### [**](#close)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-controller.ts#L131)close * ****close**(): Promise\ - Gracefully closes the browser and makes sure there will be no lingering browser processes. Emits 'browserClosed' event. *** #### Returns Promise\ ### [**](#emit)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L19)externalinheritedemit * ****emit**\(event, ...args): boolean - Inherited from TypedEmitter.emit #### Parameters * ##### externalevent: U * ##### externalrest...args: Parameters<[BrowserControllerEvents](https://crawlee.dev/js/api/browser-pool/interface/BrowserControllerEvents.md)\\[U]> #### Returns boolean ### [**](#eventNames)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L20)externalinheritedeventNames * ****eventNames**\(): U\[] - Inherited from TypedEmitter.eventNames #### Returns U\[] ### [**](#getCookies)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-controller.ts#L181)getCookies * ****getCookies**(page): Promise<[Cookie](https://crawlee.dev/js/api/core/interface/Cookie.md)\[]> - #### Parameters * ##### page: NewPageResult #### Returns Promise<[Cookie](https://crawlee.dev/js/api/core/interface/Cookie.md)\[]> ### [**](#getMaxListeners)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L24)externalinheritedgetMaxListeners * ****getMaxListeners**(): number - Inherited from TypedEmitter.getMaxListeners #### Returns number ### [**](#kill)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-controller.ts#L156)kill * ****kill**(): Promise\ - Immediately kills the browser process. Emits 'browserClosed' event. *** #### Returns Promise\ ### [**](#listenerCount)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L21)externalinheritedlistenerCount * ****listenerCount**(type): number - Inherited from TypedEmitter.listenerCount #### Parameters * ##### externaltype: BROWSER\_CLOSED #### Returns number ### [**](#listeners)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L22)externalinheritedlisteners * ****listeners**\(type): [BrowserControllerEvents](https://crawlee.dev/js/api/browser-pool/interface/BrowserControllerEvents.md)\\[U]\[] - Inherited from TypedEmitter.listeners #### Parameters * ##### externaltype: U #### Returns [BrowserControllerEvents](https://crawlee.dev/js/api/browser-pool/interface/BrowserControllerEvents.md)\\[U]\[] ### [**](#off)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L18)externalinheritedoff * ****off**\(event, listener): this - Inherited from TypedEmitter.off #### Parameters * ##### externalevent: U * ##### externallistener: [BrowserControllerEvents](https://crawlee.dev/js/api/browser-pool/interface/BrowserControllerEvents.md)\\[U] #### Returns this ### [**](#on)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L17)externalinheritedon * ****on**\(event, listener): this - Inherited from TypedEmitter.on #### Parameters * ##### externalevent: U * ##### externallistener: [BrowserControllerEvents](https://crawlee.dev/js/api/browser-pool/interface/BrowserControllerEvents.md)\\[U] #### Returns this ### [**](#once)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L16)externalinheritedonce * ****once**\(event, listener): this - Inherited from TypedEmitter.once #### Parameters * ##### externalevent: U * ##### externallistener: [BrowserControllerEvents](https://crawlee.dev/js/api/browser-pool/interface/BrowserControllerEvents.md)\\[U] #### Returns this ### [**](#prependListener)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L12)externalinheritedprependListener * ****prependListener**\(event, listener): this - Inherited from TypedEmitter.prependListener #### Parameters * ##### externalevent: U * ##### externallistener: [BrowserControllerEvents](https://crawlee.dev/js/api/browser-pool/interface/BrowserControllerEvents.md)\\[U] #### Returns this ### [**](#prependOnceListener)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L13)externalinheritedprependOnceListener * ****prependOnceListener**\(event, listener): this - Inherited from TypedEmitter.prependOnceListener #### Parameters * ##### externalevent: U * ##### externallistener: [BrowserControllerEvents](https://crawlee.dev/js/api/browser-pool/interface/BrowserControllerEvents.md)\\[U] #### Returns this ### [**](#rawListeners)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L23)externalinheritedrawListeners * ****rawListeners**\(type): [BrowserControllerEvents](https://crawlee.dev/js/api/browser-pool/interface/BrowserControllerEvents.md)\\[U]\[] - Inherited from TypedEmitter.rawListeners #### Parameters * ##### externaltype: U #### Returns [BrowserControllerEvents](https://crawlee.dev/js/api/browser-pool/interface/BrowserControllerEvents.md)\\[U]\[] ### [**](#removeAllListeners)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L15)externalinheritedremoveAllListeners * ****removeAllListeners**(event): this - Inherited from TypedEmitter.removeAllListeners #### Parameters * ##### externaloptionalevent: BROWSER\_CLOSED #### Returns this ### [**](#removeListener)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L14)externalinheritedremoveListener * ****removeListener**\(event, listener): this - Inherited from TypedEmitter.removeListener #### Parameters * ##### externalevent: U * ##### externallistener: [BrowserControllerEvents](https://crawlee.dev/js/api/browser-pool/interface/BrowserControllerEvents.md)\\[U] #### Returns this ### [**](#setCookies)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-controller.ts#L177)setCookies * ****setCookies**(page, cookies): Promise\ - #### Parameters * ##### page: NewPageResult * ##### cookies: [Cookie](https://crawlee.dev/js/api/core/interface/Cookie.md)\[] #### Returns Promise\ ### [**](#setMaxListeners)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L25)externalinheritedsetMaxListeners * ****setMaxListeners**(n): this - Inherited from TypedEmitter.setMaxListeners #### Parameters * ##### externaln: number #### Returns this --- # BrowserLaunchError Errors of `CriticalError` type will shut down the whole crawler. Error handlers catching CriticalError should avoid logging it, as it will be logged by Node.js itself at the end ### Hierarchy * [CriticalError](https://crawlee.dev/js/api/core/class/CriticalError.md) * *BrowserLaunchError* ## Index[**](#Index) ### Constructors * [**constructor](#constructor) ### Properties * [**cause](#cause) * [**message](#message) * [**name](#name) * [**stack](#stack) * [**stackTraceLimit](#stackTraceLimit) ### Methods * [**captureStackTrace](#captureStackTrace) * [**isError](#isError) * [**prepareStackTrace](#prepareStackTrace) ## Constructors[**](#Constructors) ### [**](#constructor)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-plugin.ts#L295)publicconstructor * ****new BrowserLaunchError**(...args): [BrowserLaunchError](https://crawlee.dev/js/api/browser-pool/class/BrowserLaunchError.md) - Overrides CriticalError.constructor #### Parameters * ##### rest...args: \[message?: string, options?: ErrorOptions] #### Returns [BrowserLaunchError](https://crawlee.dev/js/api/browser-pool/class/BrowserLaunchError.md) ## Properties[**](#Properties) ### [**](#cause)[**](https://undefined/apify/crawlee/blob/master/website/node_modules/typescript/src/lib.es2022.error.d.ts#L26)externaloptionalinheritedcause **cause? : unknown Inherited from CriticalError.cause ### [**](#message)[**](https://undefined/apify/crawlee/blob/master/website/node_modules/typescript/src/lib.es5.d.ts#L1077)externalinheritedmessage **message: string Inherited from CriticalError.message ### [**](#name)[**](https://undefined/apify/crawlee/blob/master/website/node_modules/typescript/src/lib.es5.d.ts#L1076)externalinheritedname **name: string Inherited from CriticalError.name ### [**](#stack)[**](https://undefined/apify/crawlee/blob/master/website/node_modules/typescript/src/lib.es5.d.ts#L1078)externaloptionalinheritedstack **stack? : string Inherited from CriticalError.stack ### [**](#stackTraceLimit)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/globals.d.ts#L68)staticexternalinheritedstackTraceLimit **stackTraceLimit: number Inherited from CriticalError.stackTraceLimit The `Error.stackTraceLimit` property specifies the number of stack frames collected by a stack trace (whether generated by `new Error().stack` or `Error.captureStackTrace(obj)`). The default value is `10` but may be set to any valid JavaScript number. Changes will affect any stack trace captured *after* the value has been changed. If set to a non-number value, or set to a negative number, stack traces will not capture any frames. ## Methods[**](#Methods) ### [**](#captureStackTrace)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/globals.d.ts#L52)staticexternalinheritedcaptureStackTrace * ****captureStackTrace**(targetObject, constructorOpt): void - Inherited from CriticalError.captureStackTrace Creates a `.stack` property on `targetObject`, which when accessed returns a string representing the location in the code at which `Error.captureStackTrace()` was called. ``` const myObject = {}; Error.captureStackTrace(myObject); myObject.stack; // Similar to `new Error().stack` ``` The first line of the trace will be prefixed with `${myObject.name}: ${myObject.message}`. The optional `constructorOpt` argument accepts a function. If given, all frames above `constructorOpt`, including `constructorOpt`, will be omitted from the generated stack trace. The `constructorOpt` argument is useful for hiding implementation details of error generation from the user. For instance: ``` function a() { b(); } function b() { c(); } function c() { // Create an error without stack trace to avoid calculating the stack trace twice. const { stackTraceLimit } = Error; Error.stackTraceLimit = 0; const error = new Error(); Error.stackTraceLimit = stackTraceLimit; // Capture the stack trace above function b Error.captureStackTrace(error, b); // Neither function c, nor b is included in the stack trace throw error; } a(); ``` *** #### Parameters * ##### externaltargetObject: object * ##### externaloptionalconstructorOpt: Function #### Returns void ### [**](#isError)[**](https://undefined/apify/crawlee/blob/master/website/node_modules/typescript/src/lib.esnext.error.d.ts#L23)staticexternalinheritedisError * ****isError**(error): error is Error - Inherited from CriticalError.isError Indicates whether the argument provided is a built-in Error instance or not. *** #### Parameters * ##### externalerror: unknown #### Returns error is Error ### [**](#prepareStackTrace)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/globals.d.ts#L56)staticexternalinheritedprepareStackTrace * ****prepareStackTrace**(err, stackTraces): any - Inherited from CriticalError.prepareStackTrace * **@see** *** #### Parameters * ##### externalerr: Error * ##### externalstackTraces: CallSite\[] #### Returns any --- # abstractBrowserPlugin \ The `BrowserPlugin` serves two purposes. First, it is the base class that specialized controllers like `PuppeteerPlugin` or `PlaywrightPlugin` extend. Second, it allows the user to configure the automation libraries and feed them to [BrowserPool](https://crawlee.dev/js/api/browser-pool/class/BrowserPool.md) for use. ### Hierarchy * *BrowserPlugin* * [PlaywrightPlugin](https://crawlee.dev/js/api/browser-pool/class/PlaywrightPlugin.md) * [PuppeteerPlugin](https://crawlee.dev/js/api/browser-pool/class/PuppeteerPlugin.md) ## Index[**](#Index) ### Constructors * [**constructor](#constructor) ### Properties * [**browserPerProxy](#browserPerProxy) * [**experimentalContainers](#experimentalContainers) * [**launchOptions](#launchOptions) * [**library](#library) * [**name](#name) * [**proxyUrl](#proxyUrl) * [**useIncognitoPages](#useIncognitoPages) * [**userDataDir](#userDataDir) ### Methods * [**createController](#createController) * [**createLaunchContext](#createLaunchContext) * [**launch](#launch) ## Constructors[**](#Constructors) ### [**](#constructor)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-plugin.ts#L129)constructor * ****new BrowserPlugin**\(library, options): [BrowserPlugin](https://crawlee.dev/js/api/browser-pool/class/BrowserPlugin.md)\ - #### Parameters * ##### library: Library * ##### options: [BrowserPluginOptions](https://crawlee.dev/js/api/browser-pool/interface/BrowserPluginOptions.md)\ = {} #### Returns [BrowserPlugin](https://crawlee.dev/js/api/browser-pool/class/BrowserPlugin.md)\ ## Properties[**](#Properties) ### [**](#browserPerProxy)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-plugin.ts#L127)optionalbrowserPerProxy **browserPerProxy? : boolean ### [**](#experimentalContainers)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-plugin.ts#L125)experimentalContainers **experimentalContainers: boolean ### [**](#launchOptions)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-plugin.ts#L117)launchOptions **launchOptions: LibraryOptions ### [**](#library)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-plugin.ts#L115)library **library: Library ### [**](#name)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-plugin.ts#L113)name **name: string = ... ### [**](#proxyUrl)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-plugin.ts#L119)optionalproxyUrl **proxyUrl? : string ### [**](#useIncognitoPages)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-plugin.ts#L123)useIncognitoPages **useIncognitoPages: boolean ### [**](#userDataDir)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-plugin.ts#L121)optionaluserDataDir **userDataDir? : string ## Methods[**](#Methods) ### [**](#createController)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-plugin.ts#L181)createController * ****createController**(): [BrowserController](https://crawlee.dev/js/api/browser-pool/class/BrowserController.md)\ - #### Returns [BrowserController](https://crawlee.dev/js/api/browser-pool/class/BrowserController.md)\ ### [**](#createLaunchContext)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-plugin.ts#L154)createLaunchContext * ****createLaunchContext**(options): [LaunchContext](https://crawlee.dev/js/api/browser-pool/class/LaunchContext.md)\ - Creates a `LaunchContext` with all the information needed to launch a browser. Aside from library specific launch options, it also includes internal properties used by `BrowserPool` for management of the pool and extra features. *** #### Parameters * ##### options: [CreateLaunchContextOptions](https://crawlee.dev/js/api/browser-pool/interface/CreateLaunchContextOptions.md)\ = {} #### Returns [LaunchContext](https://crawlee.dev/js/api/browser-pool/class/LaunchContext.md)\ ### [**](#launch)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-plugin.ts#L188)launch * ****launch**(launchContext): Promise\ - Launches the browser using provided launch context. *** #### Parameters * ##### launchContext: [LaunchContext](https://crawlee.dev/js/api/browser-pool/class/LaunchContext.md)\ = ... #### Returns Promise\ --- # BrowserPool \ The `BrowserPool` class is the most important class of the `browser-pool` module. It manages opening and closing of browsers and their pages and its constructor options allow easy configuration of the browsers' and pages' lifecycle. The most important and useful constructor options are the various lifecycle hooks. Those allow you to sequentially call a list of (asynchronous) functions at each stage of the browser / page lifecycle. **Example:** ``` import { BrowserPool, PlaywrightPlugin } from '@crawlee/browser-pool'; import playwright from 'playwright'; const browserPool = new BrowserPool({ browserPlugins: [new PlaywrightPlugin(playwright.chromium)], preLaunchHooks: [(pageId, launchContext) => { // do something before a browser gets launched launchContext.launchOptions.headless = false; }], postLaunchHooks: [(pageId, browserController) => { // manipulate the browser right after launch console.dir(browserController.browser.contexts()); }], prePageCreateHooks: [(pageId, browserController) => { if (pageId === 'my-page') { // make changes right before a specific page is created } }], postPageCreateHooks: [async (page, browserController) => { // update some or all new pages await page.evaluate(() => { // now all pages will have 'foo' window.foo = 'bar' }) }], prePageCloseHooks: [async (page, browserController) => { // collect information just before a page closes await page.screenshot(); }], postPageCloseHooks: [(pageId, browserController) => { // clean up or log after a job is done console.log('Page closed: ', pageId) }] }); ``` ### Hierarchy * TypedEmitter<[BrowserPoolEvents](https://crawlee.dev/js/api/browser-pool/interface/BrowserPoolEvents.md)\> * *BrowserPool* ## Index[**](#Index) ### Constructors * [**constructor](#constructor) ### Properties * [**activeBrowserControllers](#activeBrowserControllers) * [**browserPlugins](#browserPlugins) * [**closeInactiveBrowserAfterMillis](#closeInactiveBrowserAfterMillis) * [**fingerprintCache](#fingerprintCache) * [**fingerprintGenerator](#fingerprintGenerator) * [**fingerprintInjector](#fingerprintInjector) * [**fingerprintOptions](#fingerprintOptions) * [**maxOpenPagesPerBrowser](#maxOpenPagesPerBrowser) * [**operationTimeoutMillis](#operationTimeoutMillis) * [**pageCounter](#pageCounter) * [**pageIds](#pageIds) * [**pages](#pages) * [**pageToBrowserController](#pageToBrowserController) * [**postLaunchHooks](#postLaunchHooks) * [**postPageCloseHooks](#postPageCloseHooks) * [**postPageCreateHooks](#postPageCreateHooks) * [**preLaunchHooks](#preLaunchHooks) * [**prePageCloseHooks](#prePageCloseHooks) * [**prePageCreateHooks](#prePageCreateHooks) * [**retireBrowserAfterPageCount](#retireBrowserAfterPageCount) * [**retiredBrowserControllers](#retiredBrowserControllers) * [**startingBrowserControllers](#startingBrowserControllers) * [**useFingerprints](#useFingerprints) * [**defaultMaxListeners](#defaultMaxListeners) ### Methods * [**addListener](#addListener) * [**closeAllBrowsers](#closeAllBrowsers) * [**destroy](#destroy) * [**emit](#emit) * [**eventNames](#eventNames) * [**getBrowserControllerByPage](#getBrowserControllerByPage) * [**getMaxListeners](#getMaxListeners) * [**getPage](#getPage) * [**getPageId](#getPageId) * [**listenerCount](#listenerCount) * [**listeners](#listeners) * [**newPage](#newPage) * [**newPageInNewBrowser](#newPageInNewBrowser) * [**newPageWithEachPlugin](#newPageWithEachPlugin) * [**off](#off) * [**on](#on) * [**once](#once) * [**prependListener](#prependListener) * [**prependOnceListener](#prependOnceListener) * [**rawListeners](#rawListeners) * [**removeAllListeners](#removeAllListeners) * [**removeListener](#removeListener) * [**retireAllBrowsers](#retireAllBrowsers) * [**retireBrowserByPage](#retireBrowserByPage) * [**retireBrowserController](#retireBrowserController) * [**setMaxListeners](#setMaxListeners) ## Constructors[**](#Constructors) ### [**](#constructor)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L338)constructor * ****new BrowserPool**\(options): [BrowserPool](https://crawlee.dev/js/api/browser-pool/class/BrowserPool.md)\ - Overrides TypedEmitter\>.constructor #### Parameters * ##### options: Options & [BrowserPoolHooks](https://crawlee.dev/js/api/browser-pool/interface/BrowserPoolHooks.md)\ #### Returns [BrowserPool](https://crawlee.dev/js/api/browser-pool/class/BrowserPool.md)\ ## Properties[**](#Properties) ### [**](#activeBrowserControllers)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L322)activeBrowserControllers **activeBrowserControllers: Set\ = ... ### [**](#browserPlugins)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L305)browserPlugins **browserPlugins: BrowserPlugins ### [**](#closeInactiveBrowserAfterMillis)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L309)closeInactiveBrowserAfterMillis **closeInactiveBrowserAfterMillis: number ### [**](#fingerprintCache)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L327)optionalfingerprintCache **fingerprintCache? : QuickLRU\ ### [**](#fingerprintGenerator)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L326)optionalfingerprintGenerator **fingerprintGenerator? : FingerprintGenerator ### [**](#fingerprintInjector)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L325)optionalfingerprintInjector **fingerprintInjector? : FingerprintInjector ### [**](#fingerprintOptions)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L311)fingerprintOptions **fingerprintOptions: [FingerprintOptions](https://crawlee.dev/js/api/browser-pool/interface/FingerprintOptions.md) ### [**](#maxOpenPagesPerBrowser)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L306)maxOpenPagesPerBrowser **maxOpenPagesPerBrowser: number ### [**](#operationTimeoutMillis)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L308)operationTimeoutMillis **operationTimeoutMillis: number ### [**](#pageCounter)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L318)pageCounter **pageCounter: number = 0 ### [**](#pageIds)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L320)pageIds **pageIds: WeakMap\ = ... ### [**](#pages)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L319)pages **pages: Map\ = ... ### [**](#pageToBrowserController)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L324)pageToBrowserController **pageToBrowserController: WeakMap\ = ... ### [**](#postLaunchHooks)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L313)postLaunchHooks **postLaunchHooks: [PostLaunchHook](https://crawlee.dev/js/api/browser-pool.md#PostLaunchHook)\\[] ### [**](#postPageCloseHooks)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L317)postPageCloseHooks **postPageCloseHooks: [PostPageCloseHook](https://crawlee.dev/js/api/browser-pool.md#PostPageCloseHook)\\[] ### [**](#postPageCreateHooks)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L315)postPageCreateHooks **postPageCreateHooks: [PostPageCreateHook](https://crawlee.dev/js/api/browser-pool.md#PostPageCreateHook)\\[] ### [**](#preLaunchHooks)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L312)preLaunchHooks **preLaunchHooks: [PreLaunchHook](https://crawlee.dev/js/api/browser-pool.md#PreLaunchHook)\\[] ### [**](#prePageCloseHooks)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L316)prePageCloseHooks **prePageCloseHooks: [PrePageCloseHook](https://crawlee.dev/js/api/browser-pool.md#PrePageCloseHook)\\[] ### [**](#prePageCreateHooks)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L314)prePageCreateHooks **prePageCreateHooks: [PrePageCreateHook](https://crawlee.dev/js/api/browser-pool.md#PrePageCreateHook)\\[] ### [**](#retireBrowserAfterPageCount)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L307)retireBrowserAfterPageCount **retireBrowserAfterPageCount: number ### [**](#retiredBrowserControllers)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L323)retiredBrowserControllers **retiredBrowserControllers: Set\ = ... ### [**](#startingBrowserControllers)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L321)startingBrowserControllers **startingBrowserControllers: Set\ = ... ### [**](#useFingerprints)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L310)optionaluseFingerprints **useFingerprints? : boolean ### [**](#defaultMaxListeners)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L10)staticexternalinheriteddefaultMaxListeners **defaultMaxListeners: number Inherited from TypedEmitter.defaultMaxListeners ## Methods[**](#Methods) ### [**](#addListener)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L11)externalinheritedaddListener * ****addListener**\(event, listener): this - Inherited from TypedEmitter.addListener #### Parameters * ##### externalevent: U * ##### externallistener: [BrowserPoolEvents](https://crawlee.dev/js/api/browser-pool/interface/BrowserPoolEvents.md)\\[U] #### Returns this ### [**](#closeAllBrowsers)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L649)closeAllBrowsers * ****closeAllBrowsers**(): Promise\ - Closes all managed browsers without waiting for pages to close. *** #### Returns Promise\ ### [**](#destroy)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L661)destroy * ****destroy**(): Promise\ - Closes all managed browsers and tears down the pool. *** #### Returns Promise\ ### [**](#emit)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L19)externalinheritedemit * ****emit**\(event, ...args): boolean - Inherited from TypedEmitter.emit #### Parameters * ##### externalevent: U * ##### externalrest...args: Parameters<[BrowserPoolEvents](https://crawlee.dev/js/api/browser-pool/interface/BrowserPoolEvents.md)\\[U]> #### Returns boolean ### [**](#eventNames)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L20)externalinheritedeventNames * ****eventNames**\(): U\[] - Inherited from TypedEmitter.eventNames #### Returns U\[] ### [**](#getBrowserControllerByPage)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L525)getBrowserControllerByPage * ****getBrowserControllerByPage**(page): undefined | BrowserControllerReturn - Retrieves a [BrowserController](https://crawlee.dev/js/api/browser-pool/class/BrowserController.md) for a given page. This is useful when you're working only with pages and need to access the browser manipulation functionality. You could access the browser directly from the page, but that would circumvent `BrowserPool` and most likely cause weird things to happen, so please always use `BrowserController` to control your browsers. The function returns `undefined` if the browser is closed. *** #### Parameters * ##### page: PageReturn Browser plugin page #### Returns undefined | BrowserControllerReturn ### [**](#getMaxListeners)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L24)externalinheritedgetMaxListeners * ****getMaxListeners**(): number - Inherited from TypedEmitter.getMaxListeners #### Returns number ### [**](#getPage)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L535)getPage * ****getPage**(id): undefined | PageReturn - If you provided a custom ID to one of your pages or saved the randomly generated one, you can use this function to retrieve the page. If the page is no longer open, the function will return `undefined`. *** #### Parameters * ##### id: string #### Returns undefined | PageReturn ### [**](#getPageId)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L545)getPageId * ****getPageId**(page): undefined | string - Page IDs are used throughout `BrowserPool` as a method of linking events. You can use a page ID to track the full lifecycle of the page. It is created even before a browser is launched and stays with the page until it's closed. *** #### Parameters * ##### page: PageReturn #### Returns undefined | string ### [**](#listenerCount)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L21)externalinheritedlistenerCount * ****listenerCount**(type): number - Inherited from TypedEmitter.listenerCount #### Parameters * ##### externaltype: keyof [BrowserPoolEvents](https://crawlee.dev/js/api/browser-pool/interface/BrowserPoolEvents.md)\ #### Returns number ### [**](#listeners)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L22)externalinheritedlisteners * ****listeners**\(type): [BrowserPoolEvents](https://crawlee.dev/js/api/browser-pool/interface/BrowserPoolEvents.md)\\[U]\[] - Inherited from TypedEmitter.listeners #### Parameters * ##### externaltype: U #### Returns [BrowserPoolEvents](https://crawlee.dev/js/api/browser-pool/interface/BrowserPoolEvents.md)\\[U]\[] ### [**](#newPage)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L437)newPage * ****newPage**(options): Promise\ - Opens a new page in one of the running browsers or launches a new browser and opens a page there, if no browsers are active, or their page limits have been exceeded. *** #### Parameters * ##### options: [BrowserPoolNewPageOptions](https://crawlee.dev/js/api/browser-pool/interface/BrowserPoolNewPageOptions.md)\ = {} #### Returns Promise\ ### [**](#newPageInNewBrowser)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L465)newPageInNewBrowser * ****newPageInNewBrowser**(options): Promise\ - Unlike newPage, `newPageInNewBrowser` always launches a new browser to open the page in. Use the `launchOptions` option to configure the new browser. *** #### Parameters * ##### options: [BrowserPoolNewPageInNewBrowserOptions](https://crawlee.dev/js/api/browser-pool/interface/BrowserPoolNewPageInNewBrowserOptions.md)\ = {} #### Returns Promise\ ### [**](#newPageWithEachPlugin)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L499)newPageWithEachPlugin * ****newPageWithEachPlugin**(optionsList): Promise\ - Opens new pages with all available plugins and returns an array of pages in the same order as the plugins were provided to `BrowserPool`. This is useful when you want to run a script in multiple environments at the same time, typically in testing or website analysis. **Example:** ``` const browserPool = new BrowserPool({ browserPlugins: [ new PlaywrightPlugin(playwright.chromium), new PlaywrightPlugin(playwright.firefox), new PlaywrightPlugin(playwright.webkit), ] }); const pages = await browserPool.newPageWithEachPlugin(); const [chromiumPage, firefoxPage, webkitPage] = pages; ``` *** #### Parameters * ##### optionsList: Omit<[BrowserPoolNewPageOptions](https://crawlee.dev/js/api/browser-pool/interface/BrowserPoolNewPageOptions.md)\, browserPlugin>\[] = \[] #### Returns Promise\ ### [**](#off)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L18)externalinheritedoff * ****off**\(event, listener): this - Inherited from TypedEmitter.off #### Parameters * ##### externalevent: U * ##### externallistener: [BrowserPoolEvents](https://crawlee.dev/js/api/browser-pool/interface/BrowserPoolEvents.md)\\[U] #### Returns this ### [**](#on)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L17)externalinheritedon * ****on**\(event, listener): this - Inherited from TypedEmitter.on #### Parameters * ##### externalevent: U * ##### externallistener: [BrowserPoolEvents](https://crawlee.dev/js/api/browser-pool/interface/BrowserPoolEvents.md)\\[U] #### Returns this ### [**](#once)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L16)externalinheritedonce * ****once**\(event, listener): this - Inherited from TypedEmitter.once #### Parameters * ##### externalevent: U * ##### externallistener: [BrowserPoolEvents](https://crawlee.dev/js/api/browser-pool/interface/BrowserPoolEvents.md)\\[U] #### Returns this ### [**](#prependListener)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L12)externalinheritedprependListener * ****prependListener**\(event, listener): this - Inherited from TypedEmitter.prependListener #### Parameters * ##### externalevent: U * ##### externallistener: [BrowserPoolEvents](https://crawlee.dev/js/api/browser-pool/interface/BrowserPoolEvents.md)\\[U] #### Returns this ### [**](#prependOnceListener)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L13)externalinheritedprependOnceListener * ****prependOnceListener**\(event, listener): this - Inherited from TypedEmitter.prependOnceListener #### Parameters * ##### externalevent: U * ##### externallistener: [BrowserPoolEvents](https://crawlee.dev/js/api/browser-pool/interface/BrowserPoolEvents.md)\\[U] #### Returns this ### [**](#rawListeners)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L23)externalinheritedrawListeners * ****rawListeners**\(type): [BrowserPoolEvents](https://crawlee.dev/js/api/browser-pool/interface/BrowserPoolEvents.md)\\[U]\[] - Inherited from TypedEmitter.rawListeners #### Parameters * ##### externaltype: U #### Returns [BrowserPoolEvents](https://crawlee.dev/js/api/browser-pool/interface/BrowserPoolEvents.md)\\[U]\[] ### [**](#removeAllListeners)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L15)externalinheritedremoveAllListeners * ****removeAllListeners**(event): this - Inherited from TypedEmitter.removeAllListeners #### Parameters * ##### externaloptionalevent: keyof BrowserPoolEvents\ #### Returns this ### [**](#removeListener)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L14)externalinheritedremoveListener * ****removeListener**\(event, listener): this - Inherited from TypedEmitter.removeListener #### Parameters * ##### externalevent: U * ##### externallistener: [BrowserPoolEvents](https://crawlee.dev/js/api/browser-pool/interface/BrowserPoolEvents.md)\\[U] #### Returns this ### [**](#retireAllBrowsers)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L639)retireAllBrowsers * ****retireAllBrowsers**(): void - Removes all active browsers from the pool. The browsers will be closed after all their pages are closed. *** #### Returns void ### [**](#retireBrowserByPage)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L630)retireBrowserByPage * ****retireBrowserByPage**(page): void - Removes a browser from the pool. It will be closed after all its pages are closed. *** #### Parameters * ##### page: PageReturn #### Returns void ### [**](#retireBrowserController)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/browser-pool.ts#L613)retireBrowserController * ****retireBrowserController**(browserController): void - Removes a browser controller from the pool. The underlying browser will be closed after all its pages are closed. *** #### Parameters * ##### browserController: BrowserControllerReturn #### Returns void ### [**](#setMaxListeners)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L25)externalinheritedsetMaxListeners * ****setMaxListeners**(n): this - Inherited from TypedEmitter.setMaxListeners #### Parameters * ##### externaln: number #### Returns this --- # LaunchContext \ ## Index[**](#Index) ### Constructors * [**constructor](#constructor) ### Properties * [**browserPerProxy](#browserPerProxy) * [**browserPlugin](#browserPlugin) * [**experimentalContainers](#experimentalContainers) * [**fingerprint](#fingerprint) * [**id](#id) * [**launchOptions](#launchOptions) * [**proxyTier](#proxyTier) * [**useIncognitoPages](#useIncognitoPages) * [**userDataDir](#userDataDir) ### Accessors * [**proxyUrl](#proxyUrl) ### Methods * [**extend](#extend) ## Constructors[**](#Constructors) ### [**](#constructor)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/launch-context.ts#L85)constructor * ****new LaunchContext**\(options): [LaunchContext](https://crawlee.dev/js/api/browser-pool/class/LaunchContext.md)\ - #### Parameters * ##### options: [LaunchContextOptions](https://crawlee.dev/js/api/browser-pool/interface/LaunchContextOptions.md)\ #### Returns [LaunchContext](https://crawlee.dev/js/api/browser-pool/class/LaunchContext.md)\ ## Properties[**](#Properties) ### [**](#browserPerProxy)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/launch-context.ts#L74)optionalbrowserPerProxy **browserPerProxy? : boolean ### [**](#browserPlugin)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/launch-context.ts#L71)browserPlugin **browserPlugin: [BrowserPlugin](https://crawlee.dev/js/api/browser-pool/class/BrowserPlugin.md)\ ### [**](#experimentalContainers)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/launch-context.ts#L75)experimentalContainers **experimentalContainers: boolean ### [**](#fingerprint)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/launch-context.ts#L82)optionalfingerprint **fingerprint? : BrowserFingerprintWithHeaders ### [**](#id)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/launch-context.ts#L70)optionalid **id? : string ### [**](#launchOptions)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/launch-context.ts#L72)launchOptions **launchOptions: LibraryOptions ### [**](#proxyTier)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/launch-context.ts#L77)optionalproxyTier **proxyTier? : number ### [**](#useIncognitoPages)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/launch-context.ts#L73)useIncognitoPages **useIncognitoPages: boolean ### [**](#userDataDir)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/launch-context.ts#L76)userDataDir **userDataDir: string ## Accessors[**](#Accessors) ### [**](#proxyUrl)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/launch-context.ts#L131)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/launch-context.ts#L148)proxyUrl * **get proxyUrl(): undefined | string * **set proxyUrl(url): void - Returns the proxy URL of the browser. *** #### Returns undefined | string - Sets a proxy URL for the browser. Use `undefined` to unset existing proxy URL. *** #### Parameters * ##### url: undefined | string #### Returns void ## Methods[**](#Methods) ### [**](#extend)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/launch-context.ts#L117)extend * ****extend**\(fields): void - Extend the launch context with any extra fields. This is useful to keep state information relevant to the browser being launched. It ensures that no internal fields are overridden and should be used instead of property assignment. *** #### Parameters * ##### fields: T #### Returns void --- # PlaywrightBrowser Browser wrapper created to have consistent API with persistent and non-persistent contexts. ### Hierarchy * EventEmitter * *PlaywrightBrowser* ## Index[**](#Index) ### Constructors * [**constructor](#constructor) ### Properties * [**captureRejections](#captureRejections) * [**captureRejectionSymbol](#captureRejectionSymbol) * [**defaultMaxListeners](#defaultMaxListeners) * [**errorMonitor](#errorMonitor) ### Methods * [**\[asyncDispose\]](#\[asyncDispose]) * [**\[captureRejectionSymbol\]](#\[captureRejectionSymbol]) * [**addListener](#addListener) * [**browserType](#browserType) * [**close](#close) * [**contexts](#contexts) * [**emit](#emit) * [**eventNames](#eventNames) * [**getMaxListeners](#getMaxListeners) * [**isConnected](#isConnected) * [**listenerCount](#listenerCount) * [**listeners](#listeners) * [**newBrowserCDPSession](#newBrowserCDPSession) * [**newContext](#newContext) * [**newPage](#newPage) * [**off](#off) * [**on](#on) * [**once](#once) * [**prependListener](#prependListener) * [**prependOnceListener](#prependOnceListener) * [**rawListeners](#rawListeners) * [**removeAllListeners](#removeAllListeners) * [**removeListener](#removeListener) * [**setMaxListeners](#setMaxListeners) * [**startTracing](#startTracing) * [**stopTracing](#stopTracing) * [**version](#version) * [**addAbortListener](#addAbortListener) * [**getEventListeners](#getEventListeners) * [**getMaxListeners](#getMaxListeners) * [**listenerCount](#listenerCount) * [**on](#on) * [**once](#once) * [**setMaxListeners](#setMaxListeners) ## Constructors[**](#Constructors) ### [**](#constructor)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/playwright/playwright-browser.ts#L19)constructor * ****new PlaywrightBrowser**(options): [PlaywrightBrowser](https://crawlee.dev/js/api/browser-pool/class/PlaywrightBrowser.md) - Overrides EventEmitter.constructor #### Parameters * ##### options: BrowserOptions #### Returns [PlaywrightBrowser](https://crawlee.dev/js/api/browser-pool/class/PlaywrightBrowser.md) ## Properties[**](#Properties) ### [**](#captureRejections)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L458)staticexternalinheritedcaptureRejections **captureRejections: boolean Inherited from EventEmitter.captureRejections Value: [boolean](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Data_structures#Boolean_type) Change the default `captureRejections` option on all new `EventEmitter` objects. * **@since** v13.4.0, v12.16.0 ### [**](#captureRejectionSymbol)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L451)staticexternalreadonlyinheritedcaptureRejectionSymbol **captureRejectionSymbol: typeof captureRejectionSymbol Inherited from EventEmitter.captureRejectionSymbol Value: `Symbol.for('nodejs.rejection')` See how to write a custom `rejection handler`. * **@since** v13.4.0, v12.16.0 ### [**](#defaultMaxListeners)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L497)staticexternalinheriteddefaultMaxListeners **defaultMaxListeners: number Inherited from EventEmitter.defaultMaxListeners By default, a maximum of `10` listeners can be registered for any single event. This limit can be changed for individual `EventEmitter` instances using the `emitter.setMaxListeners(n)` method. To change the default for *all*`EventEmitter` instances, the `events.defaultMaxListeners` property can be used. If this value is not a positive number, a `RangeError` is thrown. Take caution when setting the `events.defaultMaxListeners` because the change affects *all* `EventEmitter` instances, including those created before the change is made. However, calling `emitter.setMaxListeners(n)` still has precedence over `events.defaultMaxListeners`. This is not a hard limit. The `EventEmitter` instance will allow more listeners to be added but will output a trace warning to stderr indicating that a "possible EventEmitter memory leak" has been detected. For any single `EventEmitter`, the `emitter.getMaxListeners()` and `emitter.setMaxListeners()` methods can be used to temporarily avoid this warning: ``` import { EventEmitter } from 'node:events'; const emitter = new EventEmitter(); emitter.setMaxListeners(emitter.getMaxListeners() + 1); emitter.once('event', () => { // do stuff emitter.setMaxListeners(Math.max(emitter.getMaxListeners() - 1, 0)); }); ``` The `--trace-warnings` command-line flag can be used to display the stack trace for such warnings. The emitted warning can be inspected with `process.on('warning')` and will have the additional `emitter`, `type`, and `count` properties, referring to the event emitter instance, the event's name and the number of attached listeners, respectively. Its `name` property is set to `'MaxListenersExceededWarning'`. * **@since** v0.11.2 ### [**](#errorMonitor)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L444)staticexternalreadonlyinheritederrorMonitor **errorMonitor: typeof errorMonitor Inherited from EventEmitter.errorMonitor This symbol shall be used to install a listener for only monitoring `'error'` events. Listeners installed using this symbol are called before the regular `'error'` listeners are called. Installing a listener using this symbol does not change the behavior once an `'error'` event is emitted. Therefore, the process will still crash if no regular `'error'` listener is installed. * **@since** v13.6.0, v12.17.0 ## Methods[**](#Methods) ### [**](#\[asyncDispose])[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/playwright/playwright-browser.ts#L33)\[asyncDispose] * ****\[asyncDispose]**(): Promise\ - #### Returns Promise\ ### [**](#\[captureRejectionSymbol])[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L136)externaloptionalinherited\[captureRejectionSymbol] * ****\[captureRejectionSymbol]**\(error, event, ...args): void - Inherited from EventEmitter.\[captureRejectionSymbol] #### Parameters * ##### externalerror: Error * ##### externalevent: string | symbol * ##### externalrest...args: AnyRest #### Returns void ### [**](#addListener)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L596)externalinheritedaddListener * ****addListener**\(eventName, listener): this - Inherited from EventEmitter.addListener Alias for `emitter.on(eventName, listener)`. * **@since** v0.1.26 *** #### Parameters * ##### externaleventName: string | symbol * ##### externallistener: (...args) => void #### Returns this ### [**](#browserType)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/playwright/playwright-browser.ts#L58)browserType * ****browserType**(): BrowserType<{}> - #### Returns BrowserType<{}> ### [**](#close)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/playwright/playwright-browser.ts#L37)close * ****close**(): Promise\ - #### Returns Promise\ ### [**](#contexts)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/playwright/playwright-browser.ts#L41)contexts * ****contexts**(): BrowserContext\[] - #### Returns BrowserContext\[] ### [**](#emit)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L858)externalinheritedemit * ****emit**\(eventName, ...args): boolean - Inherited from EventEmitter.emit Synchronously calls each of the listeners registered for the event named `eventName`, in the order they were registered, passing the supplied arguments to each. Returns `true` if the event had listeners, `false` otherwise. ``` import { EventEmitter } from 'node:events'; const myEmitter = new EventEmitter(); // First listener myEmitter.on('event', function firstListener() { console.log('Helloooo! first listener'); }); // Second listener myEmitter.on('event', function secondListener(arg1, arg2) { console.log(`event with parameters ${arg1}, ${arg2} in second listener`); }); // Third listener myEmitter.on('event', function thirdListener(...args) { const parameters = args.join(', '); console.log(`event with parameters ${parameters} in third listener`); }); console.log(myEmitter.listeners('event')); myEmitter.emit('event', 1, 2, 3, 4, 5); // Prints: // [ // [Function: firstListener], // [Function: secondListener], // [Function: thirdListener] // ] // Helloooo! first listener // event with parameters 1, 2 in second listener // event with parameters 1, 2, 3, 4, 5 in third listener ``` * **@since** v0.1.26 *** #### Parameters * ##### externaleventName: string | symbol * ##### externalrest...args: AnyRest #### Returns boolean ### [**](#eventNames)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L921)externalinheritedeventNames * ****eventNames**(): (string | symbol)\[] - Inherited from EventEmitter.eventNames Returns an array listing the events for which the emitter has registered listeners. The values in the array are strings or `Symbol`s. ``` import { EventEmitter } from 'node:events'; const myEE = new EventEmitter(); myEE.on('foo', () => {}); myEE.on('bar', () => {}); const sym = Symbol('symbol'); myEE.on(sym, () => {}); console.log(myEE.eventNames()); // Prints: [ 'foo', 'bar', Symbol(symbol) ] ``` * **@since** v6.0.0 *** #### Returns (string | symbol)\[] ### [**](#getMaxListeners)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L773)externalinheritedgetMaxListeners * ****getMaxListeners**(): number - Inherited from EventEmitter.getMaxListeners Returns the current max listener value for the `EventEmitter` which is either set by `emitter.setMaxListeners(n)` or defaults to EventEmitter.defaultMaxListeners. * **@since** v1.0.0 *** #### Returns number ### [**](#isConnected)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/playwright/playwright-browser.ts#L45)isConnected * ****isConnected**(): boolean - #### Returns boolean ### [**](#listenerCount)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L867)externalinheritedlistenerCount * ****listenerCount**\(eventName, listener): number - Inherited from EventEmitter.listenerCount Returns the number of listeners listening for the event named `eventName`. If `listener` is provided, it will return how many times the listener is found in the list of the listeners of the event. * **@since** v3.2.0 *** #### Parameters * ##### externaleventName: string | symbol The name of the event being listened for * ##### externaloptionallistener: Function The event handler function #### Returns number ### [**](#listeners)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L786)externalinheritedlisteners * ****listeners**\(eventName): Function\[] - Inherited from EventEmitter.listeners Returns a copy of the array of listeners for the event named `eventName`. ``` server.on('connection', (stream) => { console.log('someone connected!'); }); console.log(util.inspect(server.listeners('connection'))); // Prints: [ [Function] ] ``` * **@since** v0.1.26 *** #### Parameters * ##### externaleventName: string | symbol #### Returns Function\[] ### [**](#newBrowserCDPSession)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/playwright/playwright-browser.ts#L70)newBrowserCDPSession * ****newBrowserCDPSession**(): Promise\ - #### Returns Promise\ ### [**](#newContext)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/playwright/playwright-browser.ts#L66)newContext * ****newContext**(): Promise\ - #### Returns Promise\ ### [**](#newPage)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/playwright/playwright-browser.ts#L62)newPage * ****newPage**(...args): Promise\ - #### Parameters * ##### rest...args: \[] #### Returns Promise\ ### [**](#off)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L746)externalinheritedoff * ****off**\(eventName, listener): this - Inherited from EventEmitter.off Alias for `emitter.removeListener()`. * **@since** v10.0.0 *** #### Parameters * ##### externaleventName: string | symbol * ##### externallistener: (...args) => void #### Returns this ### [**](#on)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L628)externalinheritedon * ****on**\(eventName, listener): this - Inherited from EventEmitter.on Adds the `listener` function to the end of the listeners array for the event named `eventName`. No checks are made to see if the `listener` has already been added. Multiple calls passing the same combination of `eventName` and `listener` will result in the `listener` being added, and called, multiple times. ``` server.on('connection', (stream) => { console.log('someone connected!'); }); ``` Returns a reference to the `EventEmitter`, so that calls can be chained. By default, event listeners are invoked in the order they are added. The `emitter.prependListener()` method can be used as an alternative to add the event listener to the beginning of the listeners array. ``` import { EventEmitter } from 'node:events'; const myEE = new EventEmitter(); myEE.on('foo', () => console.log('a')); myEE.prependListener('foo', () => console.log('b')); myEE.emit('foo'); // Prints: // b // a ``` * **@since** v0.1.101 *** #### Parameters * ##### externaleventName: string | symbol The name of the event. * ##### externallistener: (...args) => void The callback function #### Returns this ### [**](#once)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L658)externalinheritedonce * ****once**\(eventName, listener): this - Inherited from EventEmitter.once Adds a **one-time** `listener` function for the event named `eventName`. The next time `eventName` is triggered, this listener is removed and then invoked. ``` server.once('connection', (stream) => { console.log('Ah, we have our first user!'); }); ``` Returns a reference to the `EventEmitter`, so that calls can be chained. By default, event listeners are invoked in the order they are added. The `emitter.prependOnceListener()` method can be used as an alternative to add the event listener to the beginning of the listeners array. ``` import { EventEmitter } from 'node:events'; const myEE = new EventEmitter(); myEE.once('foo', () => console.log('a')); myEE.prependOnceListener('foo', () => console.log('b')); myEE.emit('foo'); // Prints: // b // a ``` * **@since** v0.3.0 *** #### Parameters * ##### externaleventName: string | symbol The name of the event. * ##### externallistener: (...args) => void The callback function #### Returns this ### [**](#prependListener)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L885)externalinheritedprependListener * ****prependListener**\(eventName, listener): this - Inherited from EventEmitter.prependListener Adds the `listener` function to the *beginning* of the listeners array for the event named `eventName`. No checks are made to see if the `listener` has already been added. Multiple calls passing the same combination of `eventName` and `listener` will result in the `listener` being added, and called, multiple times. ``` server.prependListener('connection', (stream) => { console.log('someone connected!'); }); ``` Returns a reference to the `EventEmitter`, so that calls can be chained. * **@since** v6.0.0 *** #### Parameters * ##### externaleventName: string | symbol The name of the event. * ##### externallistener: (...args) => void The callback function #### Returns this ### [**](#prependOnceListener)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L901)externalinheritedprependOnceListener * ****prependOnceListener**\(eventName, listener): this - Inherited from EventEmitter.prependOnceListener Adds a **one-time**`listener` function for the event named `eventName` to the *beginning* of the listeners array. The next time `eventName` is triggered, this listener is removed, and then invoked. ``` server.prependOnceListener('connection', (stream) => { console.log('Ah, we have our first user!'); }); ``` Returns a reference to the `EventEmitter`, so that calls can be chained. * **@since** v6.0.0 *** #### Parameters * ##### externaleventName: string | symbol The name of the event. * ##### externallistener: (...args) => void The callback function #### Returns this ### [**](#rawListeners)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L817)externalinheritedrawListeners * ****rawListeners**\(eventName): Function\[] - Inherited from EventEmitter.rawListeners Returns a copy of the array of listeners for the event named `eventName`, including any wrappers (such as those created by `.once()`). ``` import { EventEmitter } from 'node:events'; const emitter = new EventEmitter(); emitter.once('log', () => console.log('log once')); // Returns a new Array with a function `onceWrapper` which has a property // `listener` which contains the original listener bound above const listeners = emitter.rawListeners('log'); const logFnWrapper = listeners[0]; // Logs "log once" to the console and does not unbind the `once` event logFnWrapper.listener(); // Logs "log once" to the console and removes the listener logFnWrapper(); emitter.on('log', () => console.log('log persistently')); // Will return a new Array with a single function bound by `.on()` above const newListeners = emitter.rawListeners('log'); // Logs "log persistently" twice newListeners[0](); emitter.emit('log'); ``` * **@since** v9.4.0 *** #### Parameters * ##### externaleventName: string | symbol #### Returns Function\[] ### [**](#removeAllListeners)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L757)externalinheritedremoveAllListeners * ****removeAllListeners**(eventName): this - Inherited from EventEmitter.removeAllListeners Removes all listeners, or those of the specified `eventName`. It is bad practice to remove listeners added elsewhere in the code, particularly when the `EventEmitter` instance was created by some other component or module (e.g. sockets or file streams). Returns a reference to the `EventEmitter`, so that calls can be chained. * **@since** v0.1.26 *** #### Parameters * ##### externaloptionaleventName: string | symbol #### Returns this ### [**](#removeListener)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L741)externalinheritedremoveListener * ****removeListener**\(eventName, listener): this - Inherited from EventEmitter.removeListener Removes the specified `listener` from the listener array for the event named `eventName`. ``` const callback = (stream) => { console.log('someone connected!'); }; server.on('connection', callback); // ... server.removeListener('connection', callback); ``` `removeListener()` will remove, at most, one instance of a listener from the listener array. If any single listener has been added multiple times to the listener array for the specified `eventName`, then `removeListener()` must be called multiple times to remove each instance. Once an event is emitted, all listeners attached to it at the time of emitting are called in order. This implies that any `removeListener()` or `removeAllListeners()` calls *after* emitting and *before* the last listener finishes execution will not remove them from`emit()` in progress. Subsequent events behave as expected. ``` import { EventEmitter } from 'node:events'; class MyEmitter extends EventEmitter {} const myEmitter = new MyEmitter(); const callbackA = () => { console.log('A'); myEmitter.removeListener('event', callbackB); }; const callbackB = () => { console.log('B'); }; myEmitter.on('event', callbackA); myEmitter.on('event', callbackB); // callbackA removes listener callbackB but it will still be called. // Internal listener array at time of emit [callbackA, callbackB] myEmitter.emit('event'); // Prints: // A // B // callbackB is now removed. // Internal listener array [callbackA] myEmitter.emit('event'); // Prints: // A ``` Because listeners are managed using an internal array, calling this will change the position indices of any listener registered *after* the listener being removed. This will not impact the order in which listeners are called, but it means that any copies of the listener array as returned by the `emitter.listeners()` method will need to be recreated. When a single function has been added as a handler multiple times for a single event (as in the example below), `removeListener()` will remove the most recently added instance. In the example the `once('ping')` listener is removed: ``` import { EventEmitter } from 'node:events'; const ee = new EventEmitter(); function pong() { console.log('pong'); } ee.on('ping', pong); ee.once('ping', pong); ee.removeListener('ping', pong); ee.emit('ping'); ee.emit('ping'); ``` Returns a reference to the `EventEmitter`, so that calls can be chained. * **@since** v0.1.26 *** #### Parameters * ##### externaleventName: string | symbol * ##### externallistener: (...args) => void #### Returns this ### [**](#setMaxListeners)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L767)externalinheritedsetMaxListeners * ****setMaxListeners**(n): this - Inherited from EventEmitter.setMaxListeners By default `EventEmitter`s will print a warning if more than `10` listeners are added for a particular event. This is a useful default that helps finding memory leaks. The `emitter.setMaxListeners()` method allows the limit to be modified for this specific `EventEmitter` instance. The value can be set to `Infinity` (or `0`) to indicate an unlimited number of listeners. Returns a reference to the `EventEmitter`, so that calls can be chained. * **@since** v0.3.5 *** #### Parameters * ##### externaln: number #### Returns this ### [**](#startTracing)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/playwright/playwright-browser.ts#L74)startTracing * ****startTracing**(): Promise\ - #### Returns Promise\ ### [**](#stopTracing)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/playwright/playwright-browser.ts#L78)stopTracing * ****stopTracing**(): Promise\ - #### Returns Promise\ ### [**](#version)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/playwright/playwright-browser.ts#L49)version * ****version**(): string - #### Returns string ### [**](#addAbortListener)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L436)staticexternalinheritedaddAbortListener * ****addAbortListener**(signal, resource): Disposable - Inherited from EventEmitter.addAbortListener Listens once to the `abort` event on the provided `signal`. Listening to the `abort` event on abort signals is unsafe and may lead to resource leaks since another third party with the signal can call `e.stopImmediatePropagation()`. Unfortunately Node.js cannot change this since it would violate the web standard. Additionally, the original API makes it easy to forget to remove listeners. This API allows safely using `AbortSignal`s in Node.js APIs by solving these two issues by listening to the event such that `stopImmediatePropagation` does not prevent the listener from running. Returns a disposable so that it may be unsubscribed from more easily. ``` import { addAbortListener } from 'node:events'; function example(signal) { let disposable; try { signal.addEventListener('abort', (e) => e.stopImmediatePropagation()); disposable = addAbortListener(signal, (e) => { // Do something when signal is aborted. }); } finally { disposable?.[Symbol.dispose](); } } ``` * **@since** v20.5.0 *** #### Parameters * ##### externalsignal: AbortSignal * ##### externalresource: (event) => void #### Returns Disposable Disposable that removes the `abort` listener. ### [**](#getEventListeners)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L358)staticexternalinheritedgetEventListeners * ****getEventListeners**(emitter, name): Function\[] - Inherited from EventEmitter.getEventListeners Returns a copy of the array of listeners for the event named `eventName`. For `EventEmitter`s this behaves exactly the same as calling `.listeners` on the emitter. For `EventTarget`s this is the only way to get the event listeners for the event target. This is useful for debugging and diagnostic purposes. ``` import { getEventListeners, EventEmitter } from 'node:events'; { const ee = new EventEmitter(); const listener = () => console.log('Events are fun'); ee.on('foo', listener); console.log(getEventListeners(ee, 'foo')); // [ [Function: listener] ] } { const et = new EventTarget(); const listener = () => console.log('Events are fun'); et.addEventListener('foo', listener); console.log(getEventListeners(et, 'foo')); // [ [Function: listener] ] } ``` * **@since** v15.2.0, v14.17.0 *** #### Parameters * ##### externalemitter: EventEmitter\ | EventTarget * ##### externalname: string | symbol #### Returns Function\[] ### [**](#getMaxListeners)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L387)staticexternalinheritedgetMaxListeners * ****getMaxListeners**(emitter): number - Inherited from EventEmitter.getMaxListeners Returns the currently set max amount of listeners. For `EventEmitter`s this behaves exactly the same as calling `.getMaxListeners` on the emitter. For `EventTarget`s this is the only way to get the max event listeners for the event target. If the number of event handlers on a single EventTarget exceeds the max set, the EventTarget will print a warning. ``` import { getMaxListeners, setMaxListeners, EventEmitter } from 'node:events'; { const ee = new EventEmitter(); console.log(getMaxListeners(ee)); // 10 setMaxListeners(11, ee); console.log(getMaxListeners(ee)); // 11 } { const et = new EventTarget(); console.log(getMaxListeners(et)); // 10 setMaxListeners(11, et); console.log(getMaxListeners(et)); // 11 } ``` * **@since** v19.9.0 *** #### Parameters * ##### externalemitter: EventEmitter\ | EventTarget #### Returns number ### [**](#listenerCount)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L330)staticexternalinheritedlistenerCount * ****listenerCount**(emitter, eventName): number - Inherited from EventEmitter.listenerCount A class method that returns the number of listeners for the given `eventName` registered on the given `emitter`. ``` import { EventEmitter, listenerCount } from 'node:events'; const myEmitter = new EventEmitter(); myEmitter.on('event', () => {}); myEmitter.on('event', () => {}); console.log(listenerCount(myEmitter, 'event')); // Prints: 2 ``` * **@since** v0.9.12 * **@deprecated** Since v3.2.0 - Use `listenerCount` instead. *** #### Parameters * ##### externalemitter: EventEmitter\ The emitter to query * ##### externaleventName: string | symbol The event name #### Returns number ### [**](#on)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L303)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L308)staticexternalinheritedon * ****on**(emitter, eventName, options): AsyncIterator\ * ****on**(emitter, eventName, options): AsyncIterator\ - Inherited from EventEmitter.on ``` import { on, EventEmitter } from 'node:events'; import process from 'node:process'; const ee = new EventEmitter(); // Emit later on process.nextTick(() => { ee.emit('foo', 'bar'); ee.emit('foo', 42); }); for await (const event of on(ee, 'foo')) { // The execution of this inner block is synchronous and it // processes one event at a time (even with await). Do not use // if concurrent execution is required. console.log(event); // prints ['bar'] [42] } // Unreachable here ``` Returns an `AsyncIterator` that iterates `eventName` events. It will throw if the `EventEmitter` emits `'error'`. It removes all listeners when exiting the loop. The `value` returned by each iteration is an array composed of the emitted event arguments. An `AbortSignal` can be used to cancel waiting on events: ``` import { on, EventEmitter } from 'node:events'; import process from 'node:process'; const ac = new AbortController(); (async () => { const ee = new EventEmitter(); // Emit later on process.nextTick(() => { ee.emit('foo', 'bar'); ee.emit('foo', 42); }); for await (const event of on(ee, 'foo', { signal: ac.signal })) { // The execution of this inner block is synchronous and it // processes one event at a time (even with await). Do not use // if concurrent execution is required. console.log(event); // prints ['bar'] [42] } // Unreachable here })(); process.nextTick(() => ac.abort()); ``` Use the `close` option to specify an array of event names that will end the iteration: ``` import { on, EventEmitter } from 'node:events'; import process from 'node:process'; const ee = new EventEmitter(); // Emit later on process.nextTick(() => { ee.emit('foo', 'bar'); ee.emit('foo', 42); ee.emit('close'); }); for await (const event of on(ee, 'foo', { close: ['close'] })) { console.log(event); // prints ['bar'] [42] } // the loop will exit after 'close' is emitted console.log('done'); // prints 'done' ``` * **@since** v13.6.0, v12.16.0 *** #### Parameters * ##### externalemitter: EventEmitter\ * ##### externaleventName: string | symbol * ##### externaloptionaloptions: StaticEventEmitterIteratorOptions #### Returns AsyncIterator\ An `AsyncIterator` that iterates `eventName` events emitted by the `emitter` ### [**](#once)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L217)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L222)staticexternalinheritedonce * ****once**(emitter, eventName, options): Promise\ * ****once**(emitter, eventName, options): Promise\ - Inherited from EventEmitter.once Creates a `Promise` that is fulfilled when the `EventEmitter` emits the given event or that is rejected if the `EventEmitter` emits `'error'` while waiting. The `Promise` will resolve with an array of all the arguments emitted to the given event. This method is intentionally generic and works with the web platform [EventTarget](https://dom.spec.whatwg.org/#interface-eventtarget) interface, which has no special`'error'` event semantics and does not listen to the `'error'` event. ``` import { once, EventEmitter } from 'node:events'; import process from 'node:process'; const ee = new EventEmitter(); process.nextTick(() => { ee.emit('myevent', 42); }); const [value] = await once(ee, 'myevent'); console.log(value); const err = new Error('kaboom'); process.nextTick(() => { ee.emit('error', err); }); try { await once(ee, 'myevent'); } catch (err) { console.error('error happened', err); } ``` The special handling of the `'error'` event is only used when `events.once()` is used to wait for another event. If `events.once()` is used to wait for the '`error'` event itself, then it is treated as any other kind of event without special handling: ``` import { EventEmitter, once } from 'node:events'; const ee = new EventEmitter(); once(ee, 'error') .then(([err]) => console.log('ok', err.message)) .catch((err) => console.error('error', err.message)); ee.emit('error', new Error('boom')); // Prints: ok boom ``` An `AbortSignal` can be used to cancel waiting for the event: ``` import { EventEmitter, once } from 'node:events'; const ee = new EventEmitter(); const ac = new AbortController(); async function foo(emitter, event, signal) { try { await once(emitter, event, { signal }); console.log('event emitted!'); } catch (error) { if (error.name === 'AbortError') { console.error('Waiting for the event was canceled!'); } else { console.error('There was an error', error.message); } } } foo(ee, 'foo', ac.signal); ac.abort(); // Abort waiting for the event ee.emit('foo'); // Prints: Waiting for the event was canceled! ``` * **@since** v11.13.0, v10.16.0 *** #### Parameters * ##### externalemitter: EventEmitter\ * ##### externaleventName: string | symbol * ##### externaloptionaloptions: StaticEventEmitterOptions #### Returns Promise\ ### [**](#setMaxListeners)[**](https://undefined/apify/crawlee/blob/master/node_modules/@types/node/events.d.ts#L402)staticexternalinheritedsetMaxListeners * ****setMaxListeners**(n, ...eventTargets): void - Inherited from EventEmitter.setMaxListeners ``` import { setMaxListeners, EventEmitter } from 'node:events'; const target = new EventTarget(); const emitter = new EventEmitter(); setMaxListeners(5, target, emitter); ``` * **@since** v15.4.0 *** #### Parameters * ##### externaloptionaln: number A non-negative number. The maximum number of listeners per `EventTarget` event. * ##### externalrest...eventTargets: (EventEmitter\ | EventTarget)\[] Zero or more {EventTarget} or {EventEmitter} instances. If none are specified, `n` is set as the default max for all newly created {EventTarget} and {EventEmitter} objects. #### Returns void --- # PlaywrightController The `BrowserController` serves two purposes. First, it is the base class that specialized controllers like `PuppeteerController` or `PlaywrightController` extend. Second, it defines the public interface of the specialized classes which provide only private methods. Therefore, we do not keep documentation for the specialized classes, because it's the same for all of them. ### Hierarchy * [BrowserController](https://crawlee.dev/js/api/browser-pool/class/BrowserController.md)\\[0], Browser> * *PlaywrightController* ## Index[**](#Index) ### Constructors * [**constructor](#constructor) ### Properties * [**activePages](#activePages) * [**browser](#browser) * [**browserPlugin](#browserPlugin) * [**id](#id) * [**isActive](#isActive) * [**lastPageOpenedAt](#lastPageOpenedAt) * [**launchContext](#launchContext) * [**proxyTier](#proxyTier) * [**proxyUrl](#proxyUrl) * [**totalPages](#totalPages) * [**defaultMaxListeners](#defaultMaxListeners) ### Methods * [**addListener](#addListener) * [**close](#close) * [**emit](#emit) * [**eventNames](#eventNames) * [**getCookies](#getCookies) * [**getMaxListeners](#getMaxListeners) * [**kill](#kill) * [**listenerCount](#listenerCount) * [**listeners](#listeners) * [**off](#off) * [**on](#on) * [**once](#once) * [**prependListener](#prependListener) * [**prependOnceListener](#prependOnceListener) * [**rawListeners](#rawListeners) * [**removeAllListeners](#removeAllListeners) * [**removeListener](#removeListener) * [**setCookies](#setCookies) * [**setMaxListeners](#setMaxListeners) ## Constructors[**](#Constructors) ### [**](#constructor)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-controller.ts#L91)constructor * ****new PlaywrightController**(browserPlugin): [PlaywrightController](https://crawlee.dev/js/api/browser-pool/class/PlaywrightController.md) - Inherited from BrowserController< BrowserType, SafeParameters\\[0], Browser >.constructor #### Parameters * ##### browserPlugin: [BrowserPlugin](https://crawlee.dev/js/api/browser-pool/class/BrowserPlugin.md)\, undefined | LaunchOptions, Browser, undefined | { acceptDownloads?: boolean; baseURL?: string; bypassCSP?: boolean; clientCertificates?: { cert?: Buffer\; certPath?: string; key?: Buffer\; keyPath?: string; origin: string; passphrase?: string; pfx?: Buffer\; pfxPath?: string }\[]; colorScheme?: null | light | dark | no-preference; contrast?: null | no-preference | more; deviceScaleFactor?: number; extraHTTPHeaders?: {}; forcedColors?: null | active | none; geolocation?: { accuracy?: number; latitude: number; longitude: number }; hasTouch?: boolean; httpCredentials?: { origin?: string; password: string; send?: unauthorized | always; username: string }; ignoreHTTPSErrors?: boolean; isMobile?: boolean; javaScriptEnabled?: boolean; locale?: string; logger?: Logger; offline?: boolean; permissions?: string\[]; proxy?: { bypass?: string; password?: string; server: string; username?: string }; recordHar?: { content?: omit | embed | attach; mode?: full | minimal; omitContent?: boolean; path: string; urlFilter?: string | RegExp }; recordVideo?: { dir: string; size?: { height: number; width: number } }; reducedMotion?: null | reduce | no-preference; screen?: { height: number; width: number }; serviceWorkers?: allow | block; storageState?: string | { cookies: { domain: string; expires: number; httpOnly: boolean; name: string; path: string; sameSite: Strict | Lax | None; secure: boolean; value: string }\[]; origins: { localStorage: { name: string; value: string }\[]; origin: string }\[] }; strictSelectors?: boolean; timezoneId?: string; userAgent?: string; videoSize?: { height: number; width: number }; videosPath?: string; viewport?: null | { height: number; width: number } }, Page> #### Returns [PlaywrightController](https://crawlee.dev/js/api/browser-pool/class/PlaywrightController.md) ## Properties[**](#Properties) ### [**](#activePages)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-controller.ts#L73)inheritedactivePages **activePages: number = 0 Inherited from BrowserController.activePages ### [**](#browser)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-controller.ts#L52)inheritedbrowser **browser: Browser = ... Inherited from BrowserController.browser Browser representation of the underlying automation library. ### [**](#browserPlugin)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-controller.ts#L47)inheritedbrowserPlugin **browserPlugin: [BrowserPlugin](https://crawlee.dev/js/api/browser-pool/class/BrowserPlugin.md)\, undefined | LaunchOptions, Browser, undefined | { acceptDownloads? : boolean; baseURL? : string; bypassCSP? : boolean; clientCertificates? : { cert? : Buffer\; certPath? : string; key? : Buffer\; keyPath? : string; origin: string; passphrase? : string; pfx? : Buffer\; pfxPath? : string }\[]; colorScheme? : null | light | dark | no-preference; contrast? : null | no-preference | more; deviceScaleFactor? : number; extraHTTPHeaders? : {}; forcedColors? : null | active | none; geolocation? : { accuracy? : number; latitude: number; longitude: number }; hasTouch? : boolean; httpCredentials? : { origin? : string; password: string; send? : unauthorized | always; username: string }; ignoreHTTPSErrors? : boolean; isMobile? : boolean; javaScriptEnabled? : boolean; locale? : string; logger? : Logger; offline? : boolean; permissions? : string\[]; proxy? : { bypass? : string; password? : string; server: string; username? : string }; recordHar? : { content? : omit | embed | attach; mode? : full | minimal; omitContent? : boolean; path: string; urlFilter? : string | RegExp }; recordVideo? : { dir: string; size? : { height: number; width: number } }; reducedMotion? : null | reduce | no-preference; screen? : { height: number; width: number }; serviceWorkers? : allow | block; storageState? : string | { cookies: { domain: string; expires: number; httpOnly: boolean; name: string; path: string; sameSite: Strict | Lax | None; secure: boolean; value: string }\[]; origins: { localStorage: { name: string; value: string }\[]; origin: string }\[] }; strictSelectors? : boolean; timezoneId? : string; userAgent? : string; videoSize? : { height: number; width: number }; videosPath? : string; viewport? : null | { height: number; width: number } }, Page> Inherited from BrowserController.browserPlugin The `BrowserPlugin` instance used to launch the browser. ### [**](#id)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-controller.ts#L42)inheritedid **id: string = ... Inherited from BrowserController.id ### [**](#isActive)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-controller.ts#L71)inheritedisActive **isActive: boolean = false Inherited from BrowserController.isActive ### [**](#lastPageOpenedAt)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-controller.ts#L77)inheritedlastPageOpenedAt **lastPageOpenedAt: number = ... Inherited from BrowserController.lastPageOpenedAt ### [**](#launchContext)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-controller.ts#L57)inheritedlaunchContext **launchContext: [LaunchContext](https://crawlee.dev/js/api/browser-pool/class/LaunchContext.md)\, undefined | LaunchOptions, Browser, undefined | { acceptDownloads? : boolean; baseURL? : string; bypassCSP? : boolean; clientCertificates? : { cert? : Buffer\; certPath? : string; key? : Buffer\; keyPath? : string; origin: string; passphrase? : string; pfx? : Buffer\; pfxPath? : string }\[]; colorScheme? : null | light | dark | no-preference; contrast? : null | no-preference | more; deviceScaleFactor? : number; extraHTTPHeaders? : {}; forcedColors? : null | active | none; geolocation? : { accuracy? : number; latitude: number; longitude: number }; hasTouch? : boolean; httpCredentials? : { origin? : string; password: string; send? : unauthorized | always; username: string }; ignoreHTTPSErrors? : boolean; isMobile? : boolean; javaScriptEnabled? : boolean; locale? : string; logger? : Logger; offline? : boolean; permissions? : string\[]; proxy? : { bypass? : string; password? : string; server: string; username? : string }; recordHar? : { content? : omit | embed | attach; mode? : full | minimal; omitContent? : boolean; path: string; urlFilter? : string | RegExp }; recordVideo? : { dir: string; size? : { height: number; width: number } }; reducedMotion? : null | reduce | no-preference; screen? : { height: number; width: number }; serviceWorkers? : allow | block; storageState? : string | { cookies: { domain: string; expires: number; httpOnly: boolean; name: string; path: string; sameSite: Strict | Lax | None; secure: boolean; value: string }\[]; origins: { localStorage: { name: string; value: string }\[]; origin: string }\[] }; strictSelectors? : boolean; timezoneId? : string; userAgent? : string; videoSize? : { height: number; width: number }; videosPath? : string; viewport? : null | { height: number; width: number } }, Page> = ... Inherited from BrowserController.launchContext The configuration the browser was launched with. ### [**](#proxyTier)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-controller.ts#L63)optionalinheritedproxyTier **proxyTier? : number Inherited from BrowserController.proxyTier The proxy tier tied to this browser controller. `undefined` if no tiered proxy is used. ### [**](#proxyUrl)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-controller.ts#L69)optionalinheritedproxyUrl **proxyUrl? : string Inherited from BrowserController.proxyUrl The proxy URL used by the browser controller. This is set every time the browser controller uses proxy (even the tiered one). `undefined` if no proxy is used ### [**](#totalPages)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-controller.ts#L75)inheritedtotalPages **totalPages: number = 0 Inherited from BrowserController.totalPages ### [**](#defaultMaxListeners)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L10)staticexternalinheriteddefaultMaxListeners **defaultMaxListeners: number Inherited from BrowserController.defaultMaxListeners ## Methods[**](#Methods) ### [**](#addListener)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L11)externalinheritedaddListener * ****addListener**\(event, listener): this - Inherited from BrowserController.addListener #### Parameters * ##### externalevent: U * ##### externallistener: [BrowserControllerEvents](https://crawlee.dev/js/api/browser-pool/interface/BrowserControllerEvents.md)\, undefined | LaunchOptions, Browser, undefined | { acceptDownloads?: boolean; baseURL?: string; bypassCSP?: boolean; clientCertificates?: { cert?: Buffer\; certPath?: string; key?: Buffer\; keyPath?: string; origin: string; passphrase?: string; pfx?: Buffer\; pfxPath?: string }\[]; colorScheme?: null | light | dark | no-preference; contrast?: null | no-preference | more; deviceScaleFactor?: number; extraHTTPHeaders?: {}; forcedColors?: null | active | none; geolocation?: { accuracy?: number; latitude: number; longitude: number }; hasTouch?: boolean; httpCredentials?: { origin?: string; password: string; send?: unauthorized | always; username: string }; ignoreHTTPSErrors?: boolean; isMobile?: boolean; javaScriptEnabled?: boolean; locale?: string; logger?: Logger; offline?: boolean; permissions?: string\[]; proxy?: { bypass?: string; password?: string; server: string; username?: string }; recordHar?: { content?: omit | embed | attach; mode?: full | minimal; omitContent?: boolean; path: string; urlFilter?: string | RegExp }; recordVideo?: { dir: string; size?: { height: number; width: number } }; reducedMotion?: null | reduce | no-preference; screen?: { height: number; width: number }; serviceWorkers?: allow | block; storageState?: string | { cookies: { domain: string; expires: number; httpOnly: boolean; name: string; path: string; sameSite: Strict | Lax | None; secure: boolean; value: string }\[]; origins: { localStorage: { name: string; value: string }\[]; origin: string }\[] }; strictSelectors?: boolean; timezoneId?: string; userAgent?: string; videoSize?: { height: number; width: number }; videosPath?: string; viewport?: null | { height: number; width: number } }, Page>\[U] #### Returns this ### [**](#close)[**](https://github.com/apify/crawlee/blob/master/packages/browser-pool/src/abstract-classes/browser-controller.ts#L131)inheritedclose * ****close**(): Promise\ - Inherited from BrowserController.close Gracefully closes the browser and makes sure there will be no lingering browser processes. Emits 'browserClosed' event. *** #### Returns Promise\ ### [**](#emit)[**](https://undefined/apify/crawlee/blob/master/node_modules/tiny-typed-emitter/src/index.d.ts#L19)externalinheritedemit * ****emit**\(event, ...args): boolean - Inherited from BrowserController.emit #### Parameters * ##### externalevent: U * ##### externalrest...args: Parameters<[BrowserControllerEvents](https://crawlee.dev/js/api/browser-pool/interface/BrowserControllerEvents.md)\