@crawlee/browser
Provides a simple framework for parallel crawling of web pages using headless browsers with Puppeteer and Playwright. The URLs to crawl are fed either from a static list of URLs or from a dynamic queue of URLs enabling recursive crawling of websites.
Since BrowserCrawler
uses headless (or even headful) browsers to download web pages and extract data, it is useful for crawling of websites that require to execute JavaScript. If the target website doesn't need JavaScript, we should consider using the CheerioCrawler, which downloads the pages using raw HTTP requests and is about 10x faster.
The source URLs are represented by the Request objects that are fed from the RequestList or RequestQueue instances provided by the requestList
or requestQueue
constructor options, respectively. If neither requestList
nor requestQueue
options are provided, the crawler will open the default request queue either when the crawler.addRequests()
function is called, or if requests
parameter (representing the initial requests) of the crawler.run()
function is provided.
If both requestList
and requestQueue
options are used, the instance first processes URLs from the RequestList and automatically enqueues all of them to the RequestQueue before it starts their processing. This ensures that a single URL is not crawled multiple times.
The crawler finishes when there are no more Request objects to crawl.
BrowserCrawler
opens a new browser page (i.e. tab or window) for each Request object to crawl and then calls the function provided by user as the requestHandler
option.
New pages are only opened when there is enough free CPU and memory available, using the functionality provided by the AutoscaledPool class.
All AutoscaledPool configuration options can be passed to the autoscaledPoolOptions
parameter of the BrowserCrawler
constructor. For user convenience, the minConcurrency
and maxConcurrency
options of the underlying AutoscaledPool constructor are available directly in the BrowserCrawler
constructor.
NOTE: the pool of browser instances is internally managed by the BrowserPool class.
Index
Crawlers
Other
- AddRequestsBatchedOptions
- AddRequestsBatchedResult
- AutoscaledPool
- AutoscaledPoolOptions
- BASIC_CRAWLER_TIMEOUT_BUFFER_SECS
- BLOCKED_STATUS_CODES
- BaseHttpClient
- BaseHttpResponseData
- BasicCrawler
- BasicCrawlerOptions
- BasicCrawlingContext
- ClientInfo
- Configuration
- ConfigurationOptions
- Cookie
- CrawlerAddRequestsOptions
- CrawlerAddRequestsResult
- CrawlerExperiments
- CrawlerRunOptions
- CrawlingContext
- CreateContextOptions
- CreateSession
- CriticalError
- Dataset
- DatasetConsumer
- DatasetContent
- DatasetDataOptions
- DatasetExportOptions
- DatasetExportToOptions
- DatasetIteratorOptions
- DatasetMapper
- DatasetOptions
- DatasetReducer
- EnqueueLinksOptions
- EnqueueStrategy
- ErrnoException
- ErrorHandler
- ErrorSnapshotter
- ErrorTracker
- ErrorTrackerOptions
- EventManager
- EventType
- EventTypeName
- FinalStatistics
- GetUserDataFromRequest
- GlobInput
- GlobObject
- GotScrapingHttpClient
- HttpRequest
- HttpRequestOptions
- HttpResponse
- IRequestList
- IStorage
- KeyConsumer
- KeyValueStore
- KeyValueStoreIteratorOptions
- KeyValueStoreOptions
- LoadedRequest
- LocalEventManager
- Log
- LogLevel
- Logger
- LoggerJson
- LoggerOptions
- LoggerText
- MAX_POOL_SIZE
- NonRetryableError
- PERSIST_STATE_KEY
- PersistenceOptions
- ProxyConfiguration
- ProxyConfigurationFunction
- ProxyConfigurationOptions
- ProxyInfo
- PseudoUrl
- PseudoUrlInput
- PseudoUrlObject
- PushErrorMessageOptions
- QueueOperationInfo
- RecordOptions
- RedirectHandler
- RegExpInput
- RegExpObject
- Request
- RequestHandler
- RequestHandlerResult
- RequestList
- RequestListOptions
- RequestListSourcesFunction
- RequestListState
- RequestOptions
- RequestProvider
- RequestProviderOptions
- RequestQueue
- RequestQueueOperationOptions
- RequestQueueOptions
- RequestQueueV1
- RequestQueueV2
- RequestState
- RequestTransform
- ResponseLike
- ResponseTypes
- RestrictedCrawlingContext
- RetryRequestError
- Router
- RouterHandler
- RouterRoutes
- Session
- SessionError
- SessionOptions
- SessionPool
- SessionPoolOptions
- SessionState
- SitemapRequestList
- SitemapRequestListOptions
- SnapshotResult
- Snapshotter
- SnapshotterOptions
- Source
- StatisticPersistedState
- StatisticState
- Statistics
- StatisticsOptions
- StatusMessageCallback
- StatusMessageCallbackParams
- StorageClient
- StorageManagerOptions
- StreamingHttpResponse
- SystemInfo
- SystemStatus
- SystemStatusOptions
- TieredProxy
- UrlPatternObject
- UseStateOptions
- checkStorageAccess
- createBasicRouter
- enqueueLinks
- filterRequestsByPatterns
- log
- processHttpRequestOptions
- purgeDefaultStorages
- tryAbsoluteURL
- useState
- withCheckedStorageAccess
- BrowserCrawlerOptions
- BrowserCrawlingContext
- BrowserLaunchContext
- BrowserErrorHandler
- BrowserHook
- BrowserRequestHandler
Other
AddRequestsBatchedOptions
AddRequestsBatchedResult
AutoscaledPool
AutoscaledPoolOptions
BASIC_CRAWLER_TIMEOUT_BUFFER_SECS
BLOCKED_STATUS_CODES
BaseHttpClient
BaseHttpResponseData
BasicCrawler
BasicCrawlerOptions
BasicCrawlingContext
ClientInfo
Configuration
ConfigurationOptions
Cookie
CrawlerAddRequestsOptions
CrawlerAddRequestsResult
CrawlerExperiments
CrawlerRunOptions
CrawlingContext
CreateContextOptions
CreateSession
CriticalError
Dataset
DatasetConsumer
DatasetContent
DatasetDataOptions
DatasetExportOptions
DatasetExportToOptions
DatasetIteratorOptions
DatasetMapper
DatasetOptions
DatasetReducer
EnqueueLinksOptions
EnqueueStrategy
ErrnoException
ErrorHandler
ErrorSnapshotter
ErrorTracker
ErrorTrackerOptions
EventManager
EventType
EventTypeName
FinalStatistics
GetUserDataFromRequest
GlobInput
GlobObject
GotScrapingHttpClient
HttpRequest
HttpRequestOptions
HttpResponse
IRequestList
IStorage
KeyConsumer
KeyValueStore
KeyValueStoreIteratorOptions
KeyValueStoreOptions
LoadedRequest
LocalEventManager
Log
LogLevel
Logger
LoggerJson
LoggerOptions
LoggerText
MAX_POOL_SIZE
NonRetryableError
PERSIST_STATE_KEY
PersistenceOptions
ProxyConfiguration
ProxyConfigurationFunction
ProxyConfigurationOptions
ProxyInfo
PseudoUrl
PseudoUrlInput
PseudoUrlObject
PushErrorMessageOptions
QueueOperationInfo
RecordOptions
RedirectHandler
RegExpInput
RegExpObject
Request
RequestHandler
RequestHandlerResult
RequestList
RequestListOptions
RequestListSourcesFunction
RequestListState
RequestOptions
RequestProvider
RequestProviderOptions
RequestQueue
RequestQueueOperationOptions
RequestQueueOptions
RequestQueueV1
RequestQueueV2
RequestState
RequestTransform
ResponseLike
ResponseTypes
RestrictedCrawlingContext
RetryRequestError
Router
RouterHandler
RouterRoutes
Session
SessionError
SessionOptions
SessionPool
SessionPoolOptions
SessionState
SitemapRequestList
SitemapRequestListOptions
SnapshotResult
Snapshotter
SnapshotterOptions
Source
StatisticPersistedState
StatisticState
Statistics
StatisticsOptions
StatusMessageCallback
StatusMessageCallbackParams
StorageClient
StorageManagerOptions
StreamingHttpResponse
SystemInfo
SystemStatus
SystemStatusOptions
TieredProxy
UrlPatternObject
UseStateOptions
checkStorageAccess
createBasicRouter
enqueueLinks
filterRequestsByPatterns
log
processHttpRequestOptions
purgeDefaultStorages
tryAbsoluteURL
useState
withCheckedStorageAccess
BrowserErrorHandler
Type parameters
- Context: BrowserCrawlingContext = BrowserCrawlingContext
BrowserHook
Type parameters
- Context = BrowserCrawlingContext
- GoToOptions: Dictionary | undefined = Dictionary
Type declaration
Parameters
crawlingContext: Context
gotoOptions: GoToOptions
Returns Awaitable<void>
BrowserRequestHandler
Type parameters
- Context: BrowserCrawlingContext = BrowserCrawlingContext