@crawlee/browser
Provides a simple framework for parallel crawling of web pages using headless browsers with Puppeteer and Playwright. The URLs to crawl are fed either from a static list of URLs or from a dynamic queue of URLs enabling recursive crawling of websites.
Since BrowserCrawler
uses headless (or even headful) browsers to download web pages and extract data, it is useful for crawling of websites that require to execute JavaScript. If the target website doesn't need JavaScript, we should consider using the CheerioCrawler, which downloads the pages using raw HTTP requests and is about 10x faster.
The source URLs are represented by the Request objects that are fed from the RequestList or RequestQueue instances provided by the requestList
or requestQueue
constructor options, respectively. If neither requestList
nor requestQueue
options are provided, the crawler will open the default request queue either when the crawler.addRequests()
function is called, or if requests
parameter (representing the initial requests) of the crawler.run()
function is provided.
If both requestList
and requestQueue
options are used, the instance first processes URLs from the RequestList and automatically enqueues all of them to the RequestQueue before it starts their processing. This ensures that a single URL is not crawled multiple times.
The crawler finishes when there are no more Request objects to crawl.
BrowserCrawler
opens a new browser page (i.e. tab or window) for each Request object to crawl and then calls the function provided by user as the requestHandler
option.
New pages are only opened when there is enough free CPU and memory available, using the functionality provided by the AutoscaledPool class.
All AutoscaledPool configuration options can be passed to the autoscaledPoolOptions
parameter of the BrowserCrawler
constructor. For user convenience, the minConcurrency
and maxConcurrency
options of the underlying AutoscaledPool constructor are available directly in the BrowserCrawler
constructor.
NOTE: the pool of browser instances is internally managed by the BrowserPool class.
Index
References
- AutoscaledPool
- AutoscaledPoolOptions
- BASIC_CRAWLER_TIMEOUT_BUFFER_SECS
- BasicCrawler
- BasicCrawlerOptions
- BasicCrawlingContext
- ClientInfo
- Configuration
- ConfigurationOptions
- Cookie
- CrawlerAddRequestsOptions
- CrawlerAddRequestsResult
- CrawlingContext
- CreateContextOptions
- CreateSession
- CriticalError
- Dataset
- DatasetConsumer
- DatasetContent
- DatasetDataOptions
- DatasetIteratorOptions
- DatasetMapper
- DatasetOptions
- DatasetReducer
- EnqueueLinksOptions
- EnqueueStrategy
- ErrorHandler
- EventManager
- EventType
- EventTypeName
- ExportOptions
- FinalStatistics
- GlobInput
- GlobObject
- IStorage
- KeyConsumer
- KeyValueStore
- KeyValueStoreIteratorOptions
- KeyValueStoreOptions
- LocalEventManager
- Log
- LogLevel
- Logger
- LoggerJson
- LoggerOptions
- LoggerText
- NonRetryableError
- ProxyConfiguration
- ProxyConfigurationFunction
- ProxyConfigurationOptions
- ProxyInfo
- PseudoUrl
- PseudoUrlInput
- PseudoUrlObject
- PushErrorMessageOptions
- RecordOptions
- RegExpInput
- RegExpObject
- Request
- RequestHandler
- RequestList
- RequestListOptions
- RequestListSourcesFunction
- RequestListState
- RequestOptions
- RequestQueue
- RequestQueueOperationOptions
- RequestQueueOptions
- RequestState
- RequestTransform
- RetryRequestError
- Router
- RouterHandler
- Session
- SessionOptions
- SessionPool
- SessionPoolOptions
- SessionState
- Snapshotter
- SnapshotterOptions
- Source
- StatisticPersistedState
- StatisticState
- Statistics
- StorageClient
- StorageManagerOptions
- SystemInfo
- SystemStatus
- SystemStatusOptions
- UrlPatternObject
- UseStateOptions
- createBasicRouter
- enqueueLinks
- filterRequestsByPatterns
- log
- purgeDefaultStorages
- useState
Crawlers
Interfaces
Type Aliases
References
AutoscaledPool
AutoscaledPoolOptions
BASIC_CRAWLER_TIMEOUT_BUFFER_SECS
BasicCrawler
BasicCrawlerOptions
BasicCrawlingContext
ClientInfo
Configuration
ConfigurationOptions
Cookie
CrawlerAddRequestsOptions
CrawlerAddRequestsResult
CrawlingContext
CreateContextOptions
CreateSession
CriticalError
Dataset
DatasetConsumer
DatasetContent
DatasetDataOptions
DatasetIteratorOptions
DatasetMapper
DatasetOptions
DatasetReducer
EnqueueLinksOptions
EnqueueStrategy
ErrorHandler
EventManager
EventType
EventTypeName
ExportOptions
FinalStatistics
GlobInput
GlobObject
IStorage
KeyConsumer
KeyValueStore
KeyValueStoreIteratorOptions
KeyValueStoreOptions
LocalEventManager
Log
LogLevel
Logger
LoggerJson
LoggerOptions
LoggerText
NonRetryableError
ProxyConfiguration
ProxyConfigurationFunction
ProxyConfigurationOptions
ProxyInfo
PseudoUrl
PseudoUrlInput
PseudoUrlObject
PushErrorMessageOptions
RecordOptions
RegExpInput
RegExpObject
Request
RequestHandler
RequestList
RequestListOptions
RequestListSourcesFunction
RequestListState
RequestOptions
RequestQueue
RequestQueueOperationOptions
RequestQueueOptions
RequestState
RequestTransform
RetryRequestError
Router
RouterHandler
Session
SessionOptions
SessionPool
SessionPoolOptions
SessionState
Snapshotter
SnapshotterOptions
Source
StatisticPersistedState
StatisticState
Statistics
StorageClient
StorageManagerOptions
SystemInfo
SystemStatus
SystemStatusOptions
UrlPatternObject
UseStateOptions
createBasicRouter
enqueueLinks
filterRequestsByPatterns
log
purgeDefaultStorages
useState
Type Aliases
BrowserErrorHandler
Type parameters
- Context: BrowserCrawlingContext = BrowserCrawlingContext
BrowserHook
Type parameters
- Context = BrowserCrawlingContext
- GoToOptions: Dictionary | undefined = Dictionary
Type declaration
Parameters
crawlingContext: Context
gotoOptions: GoToOptions
Returns Awaitable<void>
BrowserRequestHandler
Type parameters
- Context: BrowserCrawlingContext = BrowserCrawlingContext