@crawlee/core
Core set of classes required for Crawlee.
The crawlee package consists of several smaller packages, released separately under @crawlee namespace:
- @crawlee/core: the base for all the crawler implementations, also contains things like- Request,- RequestQueue,- RequestListor- Datasetclasses
- @crawlee/cheerio: exports- CheerioCrawler
- @crawlee/playwright: exports- PlaywrightCrawler
- @crawlee/puppeteer: exports- PuppeteerCrawler
- @crawlee/linkedom: exports- LinkeDOMCrawler
- @crawlee/jsdom: exports- JSDOMCrawler
- @crawlee/basic: exports- BasicCrawler
- @crawlee/http: exports- HttpCrawler(which is used for creating- @crawlee/jsdomand- @crawlee/cheerio)
- @crawlee/browser: exports- BrowserCrawler(which is used for creating- @crawlee/playwrightand- @crawlee/puppeteer)
- @crawlee/memory-storage:- @apify/storage-localalternative
- @crawlee/browser-pool: previously- browser-poolpackage
- @crawlee/utils: utility methods
- @crawlee/types: holds TS interfaces mainly about the- StorageClient
Installing Crawlee
Most of the Crawlee packages are extending and reexporting each other, so it's enough to install just the one you plan on using, e.g. @crawlee/playwright if you plan on using playwright - it already contains everything from the @crawlee/browser package, which includes everything from @crawlee/basic, which includes everything from @crawlee/core.
If we don't care much about additional code being pulled in, we can just use the crawlee meta-package, which contains (re-exports) most of the @crawlee/* packages, and therefore contains all the crawler classes.
npm install crawlee
Or if all we need is cheerio support, we can install only @crawlee/cheerio.
npm install @crawlee/cheerio
When using playwright or puppeteer, we still need to install those dependencies explicitly - this allows the users to be in control of which version will be used.
npm install crawlee playwright
# or npm install @crawlee/playwright playwright
Alternatively we can also use the crawlee meta-package which contains (re-exports) most of the @crawlee/* packages, and therefore contains all the crawler classes.
Sometimes you might want to use some utility methods from
@crawlee/utils, so you might want to install that as well. This package contains some utilities that were previously available underApify.utils. Browser related utilities can be also found in the crawler packages (e.g.@crawlee/playwright).
Index
Crawlers
Result Stores
Scaling
Sources
Other
- RequestQueueV2
- EnqueueStrategy
- EventType
- LogLevel
- RequestState
- Configuration
- CriticalError
- ErrorSnapshotter
- ErrorTracker
- EventManager
- GotScrapingHttpClient
- LocalEventManager
- Log
- Logger
- LoggerJson
- LoggerText
- NonRetryableError
- RecoverableState
- RequestHandlerResult
- RequestManagerTandem
- RequestProvider
- RetryRequestError
- Router
- SessionError
- SitemapRequestList
- AddRequestsBatchedOptions
- AddRequestsBatchedResult
- AutoscaledPoolOptions
- BaseHttpClient
- BaseHttpResponseData
- ClientInfo
- ConfigurationOptions
- Cookie
- CrawlingContext
- CreateSession
- DatasetConsumer
- DatasetContent
- DatasetDataOptions
- DatasetExportOptions
- DatasetExportToOptions
- DatasetIteratorOptions
- DatasetMapper
- DatasetOptions
- DatasetReducer
- EnqueueLinksOptions
- ErrnoException
- ErrorTrackerOptions
- FinalStatistics
- HttpRequest
- HttpRequestOptions
- HttpResponse
- IRequestList
- IRequestManager
- IStorage
- KeyConsumer
- KeyValueStoreIteratorOptions
- KeyValueStoreOptions
- LoggerOptions
- PersistenceOptions
- ProxyConfigurationFunction
- ProxyConfigurationOptions
- ProxyInfo
- PushErrorMessageOptions
- QueueOperationInfo
- RecordOptions
- RecoverableStateOptions
- RecoverableStatePersistenceOptions
- RequestListOptions
- RequestListState
- RequestOptions
- RequestProviderOptions
- RequestQueueOperationOptions
- RequestQueueOptions
- RequestTransform
- ResponseLike
- ResponseTypes
- RestrictedCrawlingContext
- RouterHandler
- SessionOptions
- SessionPoolOptions
- SessionState
- SitemapRequestListOptions
- SnapshotResult
- SnapshotterOptions
- StatisticPersistedState
- StatisticsOptions
- StatisticState
- StorageClient
- StorageManagerOptions
- StreamingHttpResponse
- SystemInfo
- SystemStatusOptions
- TieredProxy
- UseStateOptions
- EventTypeName
- GetUserDataFromRequest
- GlobInput
- GlobObject
- LoadedRequest
- PseudoUrlInput
- PseudoUrlObject
- RedirectHandler
- RegExpInput
- RegExpObject
- RequestListSourcesFunction
- RequestsLike
- RouterRoutes
- SkippedRequestCallback
- SkippedRequestReason
- Source
- UrlPatternObject
- BLOCKED_STATUS_CODES
- log
- MAX_POOL_SIZE
- PERSIST_STATE_KEY
- checkStorageAccess
- enqueueLinks
- filterRequestsByPatterns
- processHttpRequestOptions
- purgeDefaultStorages
- tryAbsoluteURL
- useState
- withCheckedStorageAccess
Other
RequestQueueV2
EventTypeName
GetUserDataFromRequest
Type parameters
- T
GlobInput
GlobObject
LoadedRequest
Type parameters
- R: Request
PseudoUrlInput
PseudoUrlObject
RedirectHandler
Type declaration
- Parameters- redirectResponse: BaseHttpResponseData
- updatedRequest: { headers: SimpleHeaders; url?: string | URL }
- headers: SimpleHeaders
- optionalurl: string | URL
 - Returns void
 
RegExpInput
RegExpObject
RequestListSourcesFunction
Type declaration
- Returns Promise<RequestListSource[]>
 
RequestsLike
RouterRoutes
Type parameters
- Context
- UserData: Dictionary
SkippedRequestCallback
Type declaration
- Parameters- args: { reason: SkippedRequestReason; url: string }
- reason: SkippedRequestReason
- url: string
 - Returns Awaitable<void>
 
Type of a function called when an HTTP redirect takes place. It is allowed to mutate the
updatedRequestargument.