@crawlee/core
Core set of classes required for Crawlee.
The crawlee
package consists of several smaller packages, released separately under @crawlee
namespace:
@crawlee/core
: the base for all the crawler implementations, also contains things likeRequest
,RequestQueue
,RequestList
orDataset
classes@crawlee/cheerio
: exportsCheerioCrawler
@crawlee/playwright
: exportsPlaywrightCrawler
@crawlee/puppeteer
: exportsPuppeteerCrawler
@crawlee/linkedom
: exportsLinkeDOMCrawler
@crawlee/jsdom
: exportsJSDOMCrawler
@crawlee/basic
: exportsBasicCrawler
@crawlee/http
: exportsHttpCrawler
(which is used for creating@crawlee/jsdom
and@crawlee/cheerio
)@crawlee/browser
: exportsBrowserCrawler
(which is used for creating@crawlee/playwright
and@crawlee/puppeteer
)@crawlee/memory-storage
:@apify/storage-local
alternative@crawlee/browser-pool
: previouslybrowser-pool
package@crawlee/utils
: utility methods@crawlee/types
: holds TS interfaces mainly about theStorageClient
Installing Crawlee
Most of the Crawlee packages are extending and reexporting each other, so it's enough to install just the one you plan on using, e.g. @crawlee/playwright
if you plan on using playwright
- it already contains everything from the @crawlee/browser
package, which includes everything from @crawlee/basic
, which includes everything from @crawlee/core
.
If we don't care much about additional code being pulled in, we can just use the crawlee
meta-package, which contains (re-exports) most of the @crawlee/*
packages, and therefore contains all the crawler classes.
npm install crawlee
Or if all we need is cheerio support, we can install only @crawlee/cheerio
.
npm install @crawlee/cheerio
When using playwright
or puppeteer
, we still need to install those dependencies explicitly - this allows the users to be in control of which version will be used.
npm install crawlee playwright
# or npm install @crawlee/playwright playwright
Alternatively we can also use the crawlee
meta-package which contains (re-exports) most of the @crawlee/*
packages, and therefore contains all the crawler classes.
Sometimes you might want to use some utility methods from
@crawlee/utils
, so you might want to install that as well. This package contains some utilities that were previously available underApify.utils
. Browser related utilities can be also found in the crawler packages (e.g.@crawlee/playwright
).
Index
Crawlers
Result Stores
Scaling
Sources
Other
- RequestQueueV2
- EnqueueStrategy
- EventType
- LogLevel
- RequestState
- Configuration
- CriticalError
- ErrorSnapshotter
- ErrorTracker
- EventManager
- GotScrapingHttpClient
- LocalEventManager
- Log
- Logger
- LoggerJson
- LoggerText
- NonRetryableError
- RequestHandlerResult
- RequestProvider
- RetryRequestError
- Router
- SessionError
- SitemapRequestList
- AddRequestsBatchedOptions
- AddRequestsBatchedResult
- AutoscaledPoolOptions
- BaseHttpClient
- BaseHttpResponseData
- ClientInfo
- ConfigurationOptions
- Cookie
- CrawlingContext
- CreateSession
- DatasetConsumer
- DatasetContent
- DatasetDataOptions
- DatasetExportOptions
- DatasetExportToOptions
- DatasetIteratorOptions
- DatasetMapper
- DatasetOptions
- DatasetReducer
- EnqueueLinksOptions
- ErrnoException
- ErrorTrackerOptions
- FinalStatistics
- HttpRequest
- HttpRequestOptions
- HttpResponse
- IRequestList
- IStorage
- KeyConsumer
- KeyValueStoreIteratorOptions
- KeyValueStoreOptions
- LoggerOptions
- PersistenceOptions
- ProxyConfigurationFunction
- ProxyConfigurationOptions
- ProxyInfo
- PushErrorMessageOptions
- QueueOperationInfo
- RecordOptions
- RequestListOptions
- RequestListState
- RequestOptions
- RequestProviderOptions
- RequestQueueOperationOptions
- RequestQueueOptions
- RequestTransform
- ResponseLike
- ResponseTypes
- RestrictedCrawlingContext
- RouterHandler
- SessionOptions
- SessionPoolOptions
- SessionState
- SitemapRequestListOptions
- SnapshotResult
- SnapshotterOptions
- StatisticPersistedState
- StatisticState
- StatisticsOptions
- StorageClient
- StorageManagerOptions
- StreamingHttpResponse
- SystemInfo
- SystemStatusOptions
- TieredProxy
- UseStateOptions
- EventTypeName
- GetUserDataFromRequest
- GlobInput
- GlobObject
- LoadedRequest
- PseudoUrlInput
- PseudoUrlObject
- RedirectHandler
- RegExpInput
- RegExpObject
- RequestListSourcesFunction
- RouterRoutes
- Source
- UrlPatternObject
- BLOCKED_STATUS_CODES
- MAX_POOL_SIZE
- PERSIST_STATE_KEY
- log
- checkStorageAccess
- enqueueLinks
- filterRequestsByPatterns
- processHttpRequestOptions
- purgeDefaultStorages
- tryAbsoluteURL
- useState
- withCheckedStorageAccess
Other
RequestQueueV2
EventTypeName
GetUserDataFromRequest
Type parameters
- T
GlobInput
GlobObject
LoadedRequest
Type parameters
- R: Request
PseudoUrlInput
PseudoUrlObject
RedirectHandler
Type declaration
Parameters
redirectResponse: BaseHttpResponseData
updatedRequest: { headers: SimpleHeaders; url?: string | URL }
headers: SimpleHeaders
optionalurl: string | URL
Returns void
RegExpInput
RegExpObject
RequestListSourcesFunction
Type declaration
Returns Promise<RequestListSource[]>
RouterRoutes
Type parameters
- Context
- UserData: Dictionary
Type of a function called when an HTTP redirect takes place. It is allowed to mutate the
updatedRequest
argument.