Changelog
All notable changes to this project will be documented in this file. See Conventional Commits for commit guidelines.
3.12.0 (2024-11-04)
Bug Fixes
Features
3.11.5 (2024-10-04)
Bug Fixes
forefront
request fetching in RQv2 (#2689) (03951bd), closes #2669- core: accept
UInt8Array
inKVS.setValue()
(#2682) (8ef0e60) - decode special characters in proxy
username
andpassword
(#2696) (0f0fcc5)
3.11.4 (2024-09-23)
Bug Fixes
SitemapRequestList.teardown()
doesn't breakpersistState
calls (#2673) (fb2c5cd), closes /github.com/apify/crawlee/blob/f3eb99d9fa9a7aa0ec1dcb9773e666a9ac14fb76/packages/core/src/storages/sitemap_request_list.ts#L446 #2672
3.11.3 (2024-09-03)
Bug Fixes
- RequestQueueV2: reset recently handled cache too if the queue is pending for too long (#2656) (51a69bc)
3.11.2 (2024-08-28)
Bug Fixes
Features
globs
®exps
forSitemapRequestList
(#2631) (b5fd3a9)- resilient sitemap loading (#2619) (1dd7660)
3.11.1 (2024-07-24)
Note: Version bump only for package @crawlee/core
3.11.0 (2024-07-09)
Features
3.10.5 (2024-06-12)
Bug Fixes
3.10.4 (2024-06-11)
Bug Fixes
- add
waitForAllRequestsToBeAdded
option toenqueueLinks
helper (925546b), closes #2318 - respect
crawler.log
when creating child logger forStatistics
(0a0d75d), closes #2412
3.10.3 (2024-06-07)
Bug Fixes
- respect implicit router when no
requestHandler
is provided inAdaptiveCrawler
(#2518) (31083aa) - revert the scaling steps back to 5% (5bf32f8)
Features
3.10.2 (2024-06-03)
Note: Version bump only for package @crawlee/core
3.10.1 (2024-05-23)
Bug Fixes
- investigate and temp fix for possible 0-concurrency bug in RQv2 (#2494) (4ebe820)
- provide URLs to the error snapshot (#2482) (7f64145), closes /github.com/apify/apify-sdk-js/blob/master/packages/apify/src/key_value_store.ts#L25
3.10.0 (2024-05-16)
Bug Fixes
EnqueueStrategy.All
erroring with links using unsupported protocols (#2389) (8db3908)- core: conversion between tough cookies and browser pool cookies (#2443) (74f73ab)
- core: fire local
SystemInfo
events every second (#2454) (1fa9a66) - core: use createSessionFunction when loading Session from persisted state (#2444) (3c56b4c)
- double tier decrement in tiered proxy (#2468) (3a8204b)
Features
- implement ErrorSnapshotter for error context capture (#2332) (e861dfd), closes #2280
- make
RequestQueue
v2 the default queue, see more on Apify blog (#2390) (41ae8ab), closes #2388
Performance Improvements
- improve scaling based on memory (#2459) (2d5d443)
- optimize
RequestList
memory footprint (#2466) (12210bd) - optimize adding large amount of requests via
crawler.addRequests()
(#2456) (6da86a8)
3.9.2 (2024-04-17)
Bug Fixes
3.9.1 (2024-04-11)
Note: Version bump only for package @crawlee/core
3.9.0 (2024-04-10)
Bug Fixes
- include actual key in error message of KVS'
setValue
(#2411) (9089bf1) - notify autoscaled pool about newly added requests (#2400) (a90177d)
Features
createAdaptivePlaywrightRouter
utility (#2415) (cee4778), closes #2407tieredProxyUrls
for ProxyConfiguration (#2348) (5408c7f)- better
newUrlFunction
for ProxyConfiguration (#2392) (330598b), closes #2348 #2065
3.8.2 (2024-03-21)
Bug Fixes
- core: solve possible dead locks in
RequestQueueV2
(#2376) (ffba095) - use 0 (number) instead of false as default for sessionRotationCount (#2372) (667a3e7)
Features
- implement global storage access checking and use it to prevent unwanted side effects in adaptive crawler (#2371) (fb3b7da), closes #2364
3.8.1 (2024-02-22)
Bug Fixes
3.8.0 (2024-02-21)
Bug Fixes
Features
KeyValueStore.recordExists()
(#2339) (8507a65)- accessing crawler state, key-value store and named datasets via crawling context (#2283) (58dd5fc)
- adaptive playwright crawler (#2316) (8e4218a)
3.7.3 (2024-01-30)
Bug Fixes
3.7.2 (2024-01-09)
Bug Fixes
3.7.1 (2024-01-02)
Note: Version bump only for package @crawlee/core
3.7.0 (2023-12-21)
Bug Fixes
retryOnBlocked
doesn't override the blocked HTTP codes (#2243) (81672c3)- filter out empty globs (#2205) (41322ab), closes #2200
- make SessionPool queue up getSession calls to prevent overruns (#2239) (0f5665c), closes #1667
Features
- allow configuring crawler statistics (#2213) (9fd60e4), closes #1789
- check enqueue link strategy post redirect (#2238) (3c5f9d6), closes #2173
3.6.2 (2023-11-26)
Bug Fixes
3.6.1 (2023-11-15)
Bug Fixes
- ts: specify type explicitly for logger (aec3550)
3.6.0 (2023-11-15)
Bug Fixes
- add
skipNavigation
option toenqueueLinks
(#2153) (118515d) - core: respect some advanced options for
RequestList.open()
+ improve docs (#2158) (c5a1b07) - declare missing dependency on got-scraping in the core package (cd2fd4d)
- retry incorrect Content-Type when response has blocked status code (#2176) (b54fb8b), closes #1994
Features
3.5.8 (2023-10-17)
Note: Version bump only for package @crawlee/core
3.5.7 (2023-10-05)
Bug Fixes
3.5.6 (2023-10-04)
Bug Fixes
3.5.5 (2023-10-02)
Bug Fixes
- session pool leaks memory on multiple crawler runs (#2083) (b96582a), closes #2074 #2031
- types: make return type of RequestProvider.open and RequestQueue(v2).open strict and accurate (#2096) (dfaddb9)
Features
3.5.4 (2023-09-11)
Bug Fixes
- core: allow explicit calls to
purgeDefaultStorage
to wipe the storage on each call (#2060) (4831f07) - various helpers opening KVS now respect Configuration (#2071) (59dbb16)
3.5.3 (2023-08-31)
Bug Fixes
- browser-pool: improve error handling when browser is not found (#2050) (282527f), closes #1459
- crawler instances with different StorageClients do not affect each other (#2056) (3f4c863)
- pin all internal dependencies (#2041) (d6f2b17), closes #2040
Features
3.5.2 (2023-08-21)
Bug Fixes
3.5.1 (2023-08-16)
Bug Fixes
- add
Request.maxRetries
to theRequestOptions
interface (#2024) (6433821) - log original error message on session rotation (#2022) (8a11ffb)
3.5.0 (2023-07-31)
Bug Fixes
- core: add requests from URL list (
requestsFromUrl
) to the queue in batches (418fbf8), closes #1995 - core: support relative links in
enqueueLinks
explicitly provided viaurls
option (#2014) (cbd9d08), closes #2005
Features
- core: use
RequestQueue.addBatchedRequests()
inenqueueLinks
helper (4d61ca9), closes #1995 - retire session on proxy error (#2002) (8c0928b), closes #1912
3.4.2 (2023-07-19)
Features
3.4.1 (2023-07-13)
Bug Fixes
- http-crawler: replace
IncomingMessage
withPlainResponse
for context'sresponse
(#1973) (2a1cc7f), closes #1964
3.4.0 (2023-06-12)
Features
- add LinkeDOMCrawler (#1907) (1c69560), closes /github.com/apify/crawlee/pull/1890#issuecomment-1533271694
3.3.3 (2023-05-31)
Features
- add support for
requestsFromUrl
toRequestQueue
(#1917) (7f2557c) - core: add
Request.maxRetries
to allow overriding themaxRequestRetries
(#1925) (c5592db)
3.3.2 (2023-05-11)
Bug Fixes
Features
- allow running single crawler instance multiple times (#1844) (9e6eb1e), closes #765
- router: allow inline router definition (#1877) (2d241c9)
- support alternate storage clients when opening storages (#1901) (661e550)
3.3.1 (2023-04-11)
Bug Fixes
- Storage: queue up opening storages to prevent issues in concurrent calls (#1865) (044c740)
- try to detect stuck request queue and fix its state (#1837) (95a9f94)
3.3.0 (2023-03-09)
Bug Fixes
Features
3.2.2 (2023-02-08)
Note: Version bump only for package @crawlee/core
3.2.1 (2023-02-07)
Bug Fixes
- add
QueueOperationInfo
export to the core package (5ec6c24)
3.2.0 (2023-02-07)
Bug Fixes
- clone
request.userData
when creating new request object (#1728) (222ef59), closes #1725 - declare missing dependency on
tslib
(27e96c8), closes #1747 - ensure CrawlingContext interface is inferred correctly in route handlers (aa84633)
- utils: add missing dependency on
ow
(bf0e03c), closes #1716
Features
- enqueueLinks: add SameOrigin strategy and relax protocol matching for the other strategies (#1748) (4ba982a)
3.1.3 (2022-12-07)
Note: Version bump only for package @crawlee/core
3.1.2 (2022-11-15)
Bug Fixes
- injectJQuery in context does not survive navs (#1661) (493a7cf)
- make router error message more helpful for undefined routes (#1678) (ab359d8)
- MemoryStorage: correctly respect the desc option (#1666) (b5f37f6)
- requestHandlerTimeout timing (#1660) (493ea0c)
- shallow clone browserPoolOptions before normalization (#1665) (22467ca)
- support headfull mode in playwright js project template (ea2e61b)
- support headfull mode in puppeteer js project template (e6aceb8)
Features
3.1.1 (2022-11-07)
Bug Fixes
utils.playwright.blockRequests
warning message (#1632) (76549eb)- concurrency option override order (#1649) (7bbad03)
- handle non-error objects thrown gracefully (#1652) (c3a4e1a)
- mark session as bad on failed requests (#1647) (445ae43)
- support reloading of sessions with lots of retries (ebc89d2)
- fix type errors when
playwright
is not installed (#1637) (de9db0c) - upgrade to puppeteer@19.x (#1623) (ce36d6b)
Features
- add static
set
anduseStorageClient
shortcuts toConfiguration
(2e66fa2) - enable migration testing (#1583) (ee3a68f)
- playwright: disable animations when taking screenshots (#1601) (4e63034)
3.1.0 (2022-10-13)
Bug Fixes
- add overload for
KeyValueStore.getValue
with defaultValue (#1541) (e3cb509) - add retry attempts to methods in CLI (#1588) (9142e59)
- allow
label
inenqueueLinksByClickingElements
options (#1525) (18b7c25) - basic-crawler: handle
request.noRetry
aftererrorHandler
(#1542) (2a2040e) - build storage classes by using
this
instead of the class (#1596) (2b14eb7) - correct some typing exports (#1527) (4a136e5)
- do not hide stack trace of (retried) Type/Syntax/ReferenceErrors (469b4b5)
- enqueueLinks: ensure the enqueue strategy is respected alongside user patterns (#1509) (2b0eeed)
- enqueueLinks: prevent useless request creations when filtering by user patterns (#1510) (cb8fe36)
- export
Cookie
fromcrawlee
metapackage (7b02ceb) - handle redirect cookies (#1521) (2f7fc7c)
- http-crawler: do not hang on POST without payload (#1546) (8c87390)
- remove undeclared dependency on core package from puppeteer utils (827ae60)
- support TypeScript 4.8 (#1507) (4c3a504)
- wait for persist state listeners to run when event manager closes (#1481) (aa550ed)
Features
- add
Dataset.exportToValue
(#1553) (acc6344) - add
Dataset.getData()
shortcut (522ed6e) - add
utils.downloadListOfUrls
to crawlee metapackage (7b33b0a) - add
utils.parseOpenGraph()
(#1555) (059f85e) - add
utils.playwright.compileScript
(#1559) (2e14162) - add
utils.playwright.infiniteScroll
(#1543) (60c8289), closes #1528 - add
utils.playwright.saveSnapshot
(#1544) (a4ceef0) - add global
useState
helper (#1551) (2b03177) - add static
Dataset.exportToValue
(#1564) (a7c17d4) - allow disabling storage persistence (#1539) (f65e3c6)
- bump puppeteer support to 17.x (#1519) (b97a852)
- core: add
forefront
option toenqueueLinks
helper (f8755b6), closes #1595 - don't close page before calling errorHandler (#1548) (1c8cd82)
- enqueue links by clicking for Playwright (#1545) (3d25ade)
- error tracker (#1467) (6bfe1ce)
- make the CLI download directly from GitHub (#1540) (3ff398a)
- router: add userdata generic to addHandler (#1547) (19cdf13)
- use JSON5 for
INPUT.json
to support comments (#1538) (09133ff)
3.0.4 (2022-08-22)
Features
- bump puppeteer support to 15.1
Bug Fixes
- key value stores emitting an error when multiple write promises ran in parallel (#1460) (f201cca)
- fix dockerfiles in project templates
3.0.3 (2022-08-11)
Fixes
- add missing configuration to CheerioCrawler constructor (#1432)
- sendRequest types (#1445)
- respect
headless
option in browser crawlers (#1455) - make
CheerioCrawlerOptions
type more loose (d871d8c) - improve dockerfiles and project templates (7c21a64)
Features
- add
utils.playwright.blockRequests()
(#1447) - http-crawler (#1440)
- prefer
/INPUT.json
files forKeyValueStore.getInput()
(#1453) - jsdom-crawler (#1451)
- add
RetryRequestError
+ add error to the context for BC (#1443) - add
keepAlive
to crawler options (#1452)
3.0.2 (2022-07-28)
Fixes
- regression in resolving the base url for enqueue link filtering (1422)
- improve file saving on memory storage (1421)
- add
UserData
type argument toCheerioCrawlingContext
and related interfaces (1424) - always limit
desiredConcurrency
to the value ofmaxConcurrency
(bcb689d) - wait for storage to finish before resolving
crawler.run()
(9d62d56) - using explicitly typed router with
CheerioCrawler
(07b7e69) - declare dependency on
ow
in@crawlee/cheerio
package (be59f99) - use
crawlee@^3.0.0
in the CLI templates (6426f22) - fix building projects with TS when puppeteer and playwright are not installed (1404)
- enqueueLinks should respect full URL of the current request for relative link resolution (1427)
- use
desiredConcurrency: 10
as the default forCheerioCrawler
(1428)
Features
- feat: allow configuring what status codes will cause session retirement (1423)
- feat: add support for middlewares to the
Router
viause
method (1431)
3.0.1 (2022-07-26)
Fixes
- remove
JSONData
generic type arg fromCheerioCrawler
in (#1402) - rename default storage folder to just
storage
in (#1403) - remove trailing slash for proxyUrl in (#1405)
- run browser crawlers in headless mode by default in (#1409)
- rename interface
FailedRequestHandler
toErrorHandler
in (#1410) - ensure default route is not ignored in
CheerioCrawler
in (#1411) - add
headless
option toBrowserCrawlerOptions
in (#1412) - processing custom cookies in (#1414)
- enqueue link not finding relative links if the checked page is redirected in (#1416)
- fix building projects with TS when puppeteer and playwright are not installed in (#1404)
- calling
enqueueLinks
in browser crawler on page without any links in (385ca27) - improve error message when no default route provided in (04c3b6a)
Features
- feat: add parseWithCheerio for puppeteer & playwright in (#1418)
3.0.0 (2022-07-13)
This section summarizes most of the breaking changes between Crawlee (v3) and Apify SDK (v2). Crawlee is the spiritual successor to Apify SDK, so we decided to keep the versioning and release Crawlee as v3.
Crawlee vs Apify SDK
Up until version 3 of apify
, the package contained both scraping related tools and Apify platform related helper methods. With v3 we are splitting the whole project into two main parts:
- Crawlee, the new web-scraping library, available as
crawlee
package on NPM - Apify SDK, helpers for the Apify platform, available as
apify
package on NPM
Moreover, the Crawlee library is published as several packages under @crawlee
namespace:
@crawlee/core
: the base for all the crawler implementations, also contains things likeRequest
,RequestQueue
,RequestList
orDataset
classes@crawlee/basic
: exportsBasicCrawler
@crawlee/cheerio
: exportsCheerioCrawler
@crawlee/browser
: exportsBrowserCrawler
(which is used for creating@crawlee/playwright
and@crawlee/puppeteer
)@crawlee/playwright
: exportsPlaywrightCrawler
@crawlee/puppeteer
: exportsPuppeteerCrawler
@crawlee/memory-storage
:@apify/storage-local
alternative@crawlee/browser-pool
: previouslybrowser-pool
package@crawlee/utils
: utility methods@crawlee/types
: holds TS interfaces mainly about theStorageClient
Installing Crawlee
As Crawlee is not yet released as
latest
, we need to install from thenext
distribution tag!
Most of the Crawlee packages are extending and reexporting each other, so it's enough to install just the one you plan on using, e.g. @crawlee/playwright
if you plan on using playwright
- it already contains everything from the @crawlee/browser
package, which includes everything from @crawlee/basic
, which includes everything from @crawlee/core
.
npm install crawlee@next
Or if all we need is cheerio support, we can install only @crawlee/cheerio
npm install @crawlee/cheerio@next
When using playwright
or puppeteer
, we still need to install those dependencies explicitly - this allows the users to be in control of which version will be used.
npm install crawlee@next playwright
# or npm install @crawlee/playwright@next playwright
Alternatively we can also use the crawlee
meta-package which contains (re-exports) most of the @crawlee/*
packages, and therefore contains all the crawler classes.
Sometimes you might want to use some utility methods from
@crawlee/utils
, so you might want to install that as well. This package contains some utilities that were previously available underApify.utils
. Browser related utilities can be also found in the crawler packages (e.g.@crawlee/playwright
).
Full TypeScript support
Both Crawlee and Apify SDK are full TypeScript rewrite, so they include up-to-date types in the package. For your TypeScript crawlers we recommend using our predefined TypeScript configuration from @apify/tsconfig
package. Don't forget to set the module
and target
to ES2022
or above to be able to use top level await.
The
@apify/tsconfig
config hasnoImplicitAny
enabled, you might want to disable it during the initial development as it will cause build failures if you left some unused local variables in your code.
{
"extends": "@apify/tsconfig",
"compilerOptions": {
"module": "ES2022",
"target": "ES2022",
"outDir": "dist",
"lib": ["DOM"]
},
"include": [
"./src/**/*"
]
}
Docker build
For Dockerfile
we recommend using multi-stage build, so you don't install the dev dependencies like TypeScript in your final image:
# using multistage build, as we need dev deps to build the TS source code
FROM apify/actor-node:16 AS builder
# copy all files, install all dependencies (including dev deps) and build the project
COPY . ./
RUN npm install --include=dev \
&& npm run build
# create final image
FROM apify/actor-node:16
# copy only necessary files
COPY /usr/src/app/package*.json ./
COPY /usr/src/app/README.md ./
COPY /usr/src/app/dist ./dist
COPY /usr/src/app/apify.json ./apify.json
COPY /usr/src/app/INPUT_SCHEMA.json ./INPUT_SCHEMA.json
# install only prod deps
RUN npm --quiet set progress=false \
&& npm install --only=prod --no-optional \
&& echo "Installed NPM packages:" \
&& (npm list --only=prod --no-optional --all || true) \
&& echo "Node.js version:" \
&& node --version \
&& echo "NPM version:" \
&& npm --version
# run compiled code
CMD npm run start:prod
Browser fingerprints
Previously we had a magical stealth
option in the puppeteer crawler that enabled several tricks aiming to mimic the real users as much as possible. While this worked to a certain degree, we decided to replace it with generated browser fingerprints.
In case we don't want to have dynamic fingerprints, we can disable this behaviour via useFingerprints
in browserPoolOptions
:
const crawler = new PlaywrightCrawler({
browserPoolOptions: {
useFingerprints: false,
},
});
Session cookie method renames
Previously, if we wanted to get or add cookies for the session that would be used for the request, we had to call session.getPuppeteerCookies()
or session.setPuppeteerCookies()
. Since this method could be used for any of our crawlers, not just PuppeteerCrawler
, the methods have been renamed to session.getCookies()
and session.setCookies()
respectively. Otherwise, their usage is exactly the same!
Memory storage
When we store some data or intermediate state (like the one RequestQueue
holds), we now use @crawlee/memory-storage
by default. It is an alternative to the @apify/storage-local
, that stores the state inside memory (as opposed to SQLite database used by @apify/storage-local
). While the state is stored in memory, it also dumps it to the file system, so we can observe it, as well as respects the existing data stored in KeyValueStore (e.g. the INPUT.json
file).
When we want to run the crawler on Apify platform, we need to use Actor.init
or Actor.main
, which will automatically switch the storage client to ApifyClient
when on the Apify platform.
We can still use the @apify/storage-local
, to do it, first install it pass it to the Actor.init
or Actor.main
options:
@apify/storage-local
v2.1.0+ is required for Crawlee
import { Actor } from 'apify';
import { ApifyStorageLocal } from '@apify/storage-local';
const storage = new ApifyStorageLocal(/* options like `enableWalMode` belong here */);
await Actor.init({ storage });
Purging of the default storage
Previously the state was preserved between local runs, and we had to use --purge
argument of the apify-cli
. With Crawlee, this is now the default behaviour, we purge the storage automatically on Actor.init/main
call. We can opt out of it via purge: false
in the Actor.init
options.
Renamed crawler options and interfaces
Some options were renamed to better reflect what they do. We still support all the old parameter names too, but not at the TS level.
handleRequestFunction
->requestHandler
handlePageFunction
->requestHandler
handleRequestTimeoutSecs
->requestHandlerTimeoutSecs
handlePageTimeoutSecs
->requestHandlerTimeoutSecs
requestTimeoutSecs
->navigationTimeoutSecs
handleFailedRequestFunction
->failedRequestHandler
We also renamed the crawling context interfaces, so they follow the same convention and are more meaningful:
CheerioHandlePageInputs
->CheerioCrawlingContext
PlaywrightHandlePageFunction
->PlaywrightCrawlingContext
PuppeteerHandlePageFunction
->PuppeteerCrawlingContext