Changelog
All notable changes to this project will be documented in this file.
1.0.5 - not yet releasedโ
๐ Featuresโ
- Add chromeBrowserTypeforPlaywrightCrawlerto use the Chrome browser (#1487) (b06937b) by @Mantisus, closes #1071
1.0.4 (2025-10-24)โ
๐ Bug Fixesโ
- Respect enqueue_strategyinenqueue_links(#1505) (6ee04bc) by @Mantisus, closes #1504
- Exclude incorrect links before checking robots.txt(#1502) (3273da5) by @Mantisus, closes #1499
- Resolve compatibility issue between SqlStorageClientandAdaptivePlaywrightCrawler(#1496) (ce172c4) by @Mantisus, closes #1495
- Fix BasicCrawlerstatistics persistence (#1490) (1eb1c19) by @Pijukatel, closes #1501
- Save context state in result for AdaptivePlaywrightCrawlerafter isolated processing inSubCrawler(#1488) (62b7c70) by @Mantisus, closes #1483
1.0.3 (2025-10-17)โ
๐ Bug Fixesโ
- Add support for Pydantic v2.12 (#1471) (35c1108) by @Mantisus, closes #1464
- Fix database version warning message (#1485) (18a545e) by @Mantisus
- Fix reclaim_requestinSqlRequestQueueClientto correctly update the request state (#1486) (1502469) by @Mantisus, closes #1484
- Fix KeyValueStore.auto_saved_valuefailing in some scenarios (#1438) (b35dee7) by @Pijukatel, closes #1354
1.0.2 (2025-10-08)โ
๐ Bug Fixesโ
- Use Self type in the open() method of storage clients (#1462) (4ec6f6c) by @janbuchar
- Add storages name validation (#1457) (84de11a) by @Mantisus, closes #1434
- Pin pydantic version to <2.12.0 to avoid compatibility issues (#1467) (f11b86f) by @vdusek
1.0.1 (2025-10-06)โ
๐ Bug Fixesโ
- Fix memory leak in PlaywrightCrawleron browser context creation (#1446) (bb181e5) by @Pijukatel, closes #1443
- Update templates to handle optional httpx client (#1440) (c087efd) by @Pijukatel
1.0.0 (2025-09-29)โ
- Check out the Release blog post for more details.
- Check out the Upgrading guide to ensure a smooth update.
๐ Featuresโ
- Add utility for load and parse Sitemap and SitemapRequestLoader(#1169) (66599f8) by @Mantisus, closes #1161
- Add periodic status logging and status_message_callbackparameter for customization (#1265) (b992fb2) by @Mantisus, closes #96
- Add crawlee-cli option to skip project installation (#1294) (4d5aef0) by @Pijukatel, closes #1122
- Improve CrawleeCLI help text (#1297) (afbe10f) by @Pijukatel, closes #1295
- Add basic OpenTelemetryinstrumentation (#1255) (a92d8b3) by @Pijukatel, closes #1254
- Add ImpitHttpClienthttp-client client using theimpitlibrary (#1151) (0d0d268) by @Mantisus
- Prevent overloading system memory when running locally (#1270) (30de3bd) by @janbuchar, closes #1232
- Expose PlaywrightPersistentBrowserclass (#1314) (b5fa955) by @Mantisus
- Add impitoption for Crawlee CLI (#1312) (508d7ce) by @Mantisus
- Persist RequestList state (#1274) (cc68014) by @janbuchar, closes #99
- Persist DefaultRenderingTypePredictorstate (#1340) (fad4c25) by @Mantisus, closes #1272
- Persist the SitemapRequestLoaderstate (#1347) (27ef9ad) by @Mantisus, closes #1269
- Add support for NDU storages (#1401) (5dbd212) by @vdusek, closes #1175
- Add RQ id, name, alias args to add_requestsandenqueue_linksmethods (#1413) (1cae2bc) by @Mantisus, closes #1402
- Add SqlStorageClientbased onsqlalchemyv2+ (#1339) (07c75a0) by @Mantisus, closes #307
๐ Bug Fixesโ
- Fix memory estimation not working on MacOS (#1330) (ab020eb) by @Pijukatel, closes #1329
- Fix retry count to not count the original request (#1328) (74fa1d9) by @Pijukatel, closes #1326
- [breaking] Remove unused "stats" field from RequestQueueMetadata (#1331) (0a63bef) by @vdusek
- Ignore unknown parameters passed in cookies (#1336) (50d3ef7) by @Mantisus, closes #1333
- Fix timeoutforstreammethod inImpitHttpClient(#1352) (54b693b) by @Mantisus
- Include reason in the session rotation warning logs (#1363) (d6d7a45) by @vdusek, closes #1318
- Improve crawler statistics logging (#1364) (1eb6da5) by @vdusek, closes #1317
- Do not add a request that is already in progress to MemoryRequestQueueClient(#1384) (3af326c) by @Mantisus, closes #1383
- Save RequestQueueStateforFileSystemRequestQueueClientin default KVS (#1411) (6ee60a0) by @Mantisus, closes #1410
- Set default desired concurrency for non-browser crawlers to 10 (#1419) (1cc9401) by @vdusek
Refactorโ
- [breaking] Introduce new storage client system (#1194) (de1c03f) by @vdusek, closes #92, #147, #783, #1247
- [breaking] Split BrowserTypeliteral into two different literals based on context (#1070) (72b5698) by @Pijukatel
- [breaking] Change method HttpResponse.readfrom sync to async (#1296) (83fa8a4) by @Mantisus
- [breaking] Replace HttpxHttpClientwithImpitHttpClientas default HTTP client (#1307) (c803a97) by @Mantisus, closes #1079
- [breaking] Change Dataset unwind parameter to accept list of strings (#1357) (862a203) by @vdusek
- [breaking] Remove Request.idfield (#1366) (32f3580) by @Pijukatel, closes #1358
- [breaking] Refactor storage creation and caching, configuration and services (#1386) (04649bd) by @Pijukatel, closes #1379
0.6.12 (2025-07-30)โ
๐ Featuresโ
๐ Bug Fixesโ
- Use perf_counter_nsfor request duration tracking (#1260) (9e92f6b) by @Pijukatel, closes #1256
- Fix memory estimation not working on MacOS (#1330) (8558954) by @Pijukatel, closes #1329
- Fix retry count to not count the original request (#1328) (1aff3aa) by @Pijukatel, closes #1326
- Ignore unknown parameters passed in cookies (#1336) (0f2610c) by @Mantisus, closes #1333
0.6.11 (2025-06-23)โ
๐ Featuresโ
๐ Bug Fixesโ
- Fix ClientSnapshotoverload calculation (#1228) (a4fc1b6) by @Pijukatel, closes #1207
- Use PSSinstead ofRSSto estimate children process memory usage on Linux (#1210) (436032f) by @Pijukatel, closes #1206
- Do not raise an error to check 'same-domain' if there is no hostname in the url (#1251) (a6c3aab) by @Mantisus
0.6.10 (2025-06-02)โ
๐ Bug Fixesโ
- Allow config change on PlaywrightCrawler(#1186) (f17bf31) by @mylank, closes #1185
- Add payloadtoSendRequestFunctionto supportPOSTrequest (#1202) (e7449f2) by @Mantisus
- Fix match check for specified enqueue strategy for requests with redirect (#1199) (d84c30c) by @Mantisus, closes #1198
- Set WindowsSelectorEventLoopPolicyonly for curl-impersonate template withoutplaywright(#1209) (f3b839f) by @Mantisus, closes #1204
- Add support non-GET requests for PlaywrightCrawler(#1208) (dbb9f44) by @Mantisus, closes #1201
- Respect EnqueueLinksKwargsforextract_linksfunction (#1213) (c9907d6) by @Mantisus, closes #1212
0.6.9 (2025-05-02)โ
๐ Featuresโ
- Add an internal HttpClientto be used insend_requestforPlaywrightCrawlerusingAPIRequestContextbound to the browser context (#1134) (e794f49) by @Mantisus, closes #928
- Make timeout error log cleaner (#1170) (78ea9d2) by @Pijukatel, closes #1158
- Add on_skipped_requestdecorator, to process links skipped according torobots.txtrules (#1166) (bd16f14) by @Mantisus, closes #1160
๐ Bug Fixesโ
- Fix handle error without argsin_get_error_messageforErrorTracker(#1181) (21944d9) by @Mantisus, closes #1179
- Temporarily add certifi<=2025.1.31dependency (#1183) (25ff961) by @Pijukatel
0.6.8 (2025-04-25)โ
๐ Featuresโ
- Handle unprocessed requests in add_requests_batched(#1159) (7851175) by @Pijukatel, closes #456
- Add  respect_robots_txt_fileoption (#1162) (c23f365) by @Mantisus
๐ Bug Fixesโ
- Update UnprocessedRequestto match actual data (#1155) (a15a1f3) by @Pijukatel, closes #1150
- Fix the order in which cookies are saved to the SessionCookiesand the handler is executed forPlaywrightCrawler(#1163) (82ff69a) by @Mantisus
- Call failed_request_handlerforSessionErrorwhen session rotation count exceeds maximum (#1147) (b3637b6) by @Mantisus
0.6.7 (2025-04-17)โ
๐ Featuresโ
- Add ErrorSnapshottertoErrorTracker(#1125) (9666092) by @Pijukatel, closes #151
๐ Bug Fixesโ
- Improve validation errors in Crawlee CLI (#1140) (f2d33df) by @vdusek, closes #1138
- Disable logger propagation to prevent duplicate logs (#1156) (0b3648d) by @vdusek
0.6.6 (2025-04-03)โ
๐ Featuresโ
- Add statistics_log_formatparameter toBasicCrawler(#1061) (635ae4a) by @Mantisus, closes #700
- Add Session binding capability via session_idinRequest(#1086) (cda7b31) by @Mantisus, closes #1076
- Add requestsargument toEnqueueLinksFunction(#1024) (fc8444c) by @Pijukatel
๐ Bug Fixesโ
- Add port for same-originstrategy check (#1096) (9e24598) by @Mantisus
- Fix handling of loading empty metadatafile for queue (#1042) (b00876e) by @Mantisus, closes #1029
- Update favicon (#1114) (eba900f) by @baldasseva
- website: Use correct image source (#1115) (ee7806f) by @baldasseva
0.6.5 (2025-03-13)โ
๐ Bug Fixesโ
- Update to browserforgeworkaround (#1075) (2378cf8) by @Pijukatel
0.6.4 (2025-03-12)โ
๐ Bug Fixesโ
- Add a check thread before set add_signal_handler(#1068) (6983bda) by @Mantisus
- Temporary workaround for browserforgeimport time code execution (#1073) (17d914f) by @Pijukatel
0.6.3 (2025-03-07)โ
๐ Featuresโ
- Add project template with uvpackage manager (#1057) (9ec06e5) by @Mantisus, closes #1053
- Use fingerprint generator in PlaywrightCrawlerby default (#1060) (09cec53) by @Pijukatel, closes #1054
๐ Bug Fixesโ
- Update project templates for Poetry v2.x compatibility (#1049) (96dc2f9) by @Mantisus, closes #954
- Remove tmp folder for PlaywrightCrawler in non-headless mode (#1046) (3a7f444) by @Mantisus
0.6.2 (2025-03-05)โ
๐ Featuresโ
- Extend ErrorTracker with error grouping (#1014) (561de5c) by @Pijukatel
0.6.1 (2025-03-03)โ
๐ Bug Fixesโ
- Add browserforgeto mandatory dependencies (#1044) (ddfbde8) by @Pijukatel
0.6.0 (2025-03-03)โ
- Check out the Release blog post for more details.
- Check out the Upgrading guide to ensure a smooth update.
๐ Featuresโ
- Integrate browserforge fingerprints (#829) (2b156b4) by @Pijukatel, closes #549
- Add AdaptivePlaywrightCrawler (#872) (5ba70b6) by @Pijukatel
- Implement _snapshot_clientforSnapshotter(#957) (ba4d384) by @Mantisus, closes #60
- Add adaptive context helpers (#964) (e248f17) by @Pijukatel, closes #249
- [breaking] Enable additional status codes arguments to PlaywrightCrawler (#959) (87cf446) by @Pijukatel, closes #953
- Replace HeaderGeneratorimplementation bybrowserforgeimplementation (#960) (c2f8c93) by @Pijukatel, closes #937
๐ Bug Fixesโ
- Fix playwright template and dockerfile (#972) (c33b34d) by @janbuchar, closes #969
- Fix installing dependencies via pip in project template (#977) (1e3b8eb) by @janbuchar, closes #975
- Fix default migration storage (#1018) (6a0c4d9) by @Pijukatel, closes #991
- Fix logger name for http based loggers (#1023) (bfb3944) by @Pijukatel, closes #1021
- Remove allow_redirects override in CurlImpersonateHttpClient (#1017) (01d855a) by @2tunnels, closes #1016
- Remove follow_redirects override in HttpxHttpClient (#1015) (88afda3) by @2tunnels, closes #1013
- Fix flaky test_common_headers_and_user_agent (#1030) (58aa70e) by @Pijukatel, closes #1027
Refactorโ
- [breaking] Remove unused config properties (#978) (4b7fe29) by @vdusek
- [breaking] Remove Base prefix from abstract class names (#980) (8ccb5d4) by @vdusek
- [breaking] ะกhange default incognito contexttopersistent contextforPlaywright(#985) (f01520d) by @Mantisus, closes #721, #963
- [breaking] Change Sessioncookies fromdicttoSessionCookieswithCookieJar(#984) (6523b3a) by @Mantisus, closes #710, #933
- [breaking] Replace enum with literal for EnqueueStrategy(#1019) (d2481ef) by @vdusek
- [breaking] Update status code handling (#1028) (6b59471) by @Mantisus, closes #830, #998
- [breaking] Move clidependencies to optional dependencies (#1011) (4382959) by @Mantisus, closes #703, #1010
0.5.4 (2025-02-05)โ
๐ Featuresโ
- Add support use_incognito_pagesforbrowser_launch_optionsinPlaywrightCrawler(#941) (eae3a33) by @Mantisus
๐ Bug Fixesโ
- Fix session managment with retire (#947) (caee03f) by @Mantisus
- Fix templates - poetry-plugin-export version and camoufox template name (#952) (7addea6) by @Pijukatel, closes #951
- Fix convert relative link to absolute in enqueue_linksfor response with redirect (#956) (694102e) by @Mantisus, closes #955
- Fix CurlImpersonateHttpClientcookies handler (#946) (ed415c4) by @Mantisus
0.5.3 (2025-01-31)โ
๐ Featuresโ
- Add keep_alive flag to crawler.__init__(#921) (7a82d0c) by @Pijukatel, closes #891
- Add block_requestshelper forPlaywrightCrawler(#919) (1030459) by @Mantisus, closes #848
- Return request handlers from decorator methods to allow further decoration (#934) (9ec0aae) by @mylank
- Add transform_request_functionforenqueue_links(#923) (6b15957) by @Mantisus, closes #894
- Add time_remaining_secsproperty toMIGRATINGevent data (#940) (b44501b) by @fnesveda
- Add LogisticalRegressionPredictor - rendering type predictor for adaptive crawling (#930) (8440499) by @Pijukatel
๐ Bug Fixesโ
- Fix crawler not retrying user handler if there was timeout in the handler (#909) (f4090ef) by @Pijukatel, closes #907
- Optimize memory consumption for HttpxHttpClient, fix proxy handling (#905) (d7ad480) by @Mantisus, closes #895
- Fix BrowserPoolandPlaywrightBrowserPluginclosure (#932) (997543d) by @Mantisus
0.5.2 (2025-01-17)โ
๐ Bug Fixesโ
- Avoid use_staterace conditions. Remove key argument touse_state(#868) (000b976) by @Pijukatel, closes #856
- Restore proxy functionality for PlaywrightCrawler broken in v0.5 (#889) (908c944) by @Mantisus, closes #887
- Fix the usage of Configuration (#899) (0f1cf6f) by @vdusek, closes #670
0.5.1 (2025-01-07)โ
๐ Bug Fixesโ
- Make result of RequestList.is_empty independent of fetch_next_request calls (#876) (d50249e) by @janbuchar
0.5.0 (2025-01-02)โ
- Check out the Release blog post for more details.
- Check out the Upgrading guide to ensure a smooth update.
๐ Featuresโ
- Add possibility to use None as no proxy in tiered proxies (#760) (0fbd017) by @Pijukatel, closes #687
- Add use_statecontext method (#682) (868b41e) by @Mantisus, closes #191
- Add pre-navigation hooks router to AbstractHttpCrawler (#791) (0f23205) by @Pijukatel, closes #635
- Add example of how to integrate Camoufox into PlaywrightCrawler (#789) (246cfc4) by @Pijukatel, closes #684
- Expose event types, improve on/emit signature, allow parameterless listeners (#800) (c102c4c) by @janbuchar, closes #561
- Add stop method to BasicCrawler (#807) (6d01af4) by @Pijukatel, closes #651
- Add html_to_texthelper function (#792) (2b9d970) by @Pijukatel, closes #659
- [breaking] Implement RequestManagerTandem, removeadd_requestfromRequestList, accept any iterable inRequestListconstructor (#777) (4172652) by @janbuchar
๐ Bug Fixesโ
- Fix circular import in KeyValueStore(#805) (8bdf49d) by @Mantisus, closes #804
- [breaking] Refactor service usage to rely on service_locator(#691) (1d31c6c) by @vdusek, closes #369, #539, #699
- Pass verifyin httpx client (#802) (074d083) by @Mantisus, closes #798
- Fix page_optionsforPlaywrightBrowserPlugin(#796) (bd3bdd4) by @Mantisus, closes #755
- Fix event migrating handler in RequestQueue(#825) (fd6663f) by @Mantisus, closes #815
- Respect user configuration for work with status codes (#812) (8daf4bd) by @Mantisus, closes #708, #756
- abort-on-errorfor successive runs (#834) (0cea673) by @Mantisus
- Relax ServiceLocator restrictions (#837) (aa3667f) by @janbuchar, closes #806
- Fix typo in exports (#841) (8fa6ac9) by @janbuchar
Refactorโ
- [breaking] Refactor HttpCrawler, BeautifulSoupCrawler, ParselCrawler inheritance (#746) (9d3c269) by @Pijukatel, closes #350
- [breaking] Remove json_andorder_nofromRequest(#788) (5381d13) by @Mantisus, closes #94
- [breaking] Rename PwPreNavContext to PwPreNavCrawlingContext (#827) (84b61a3) by @vdusek
- [breaking] Rename PlaywrightCrawler kwargs: browser_options, page_options (#831) (ffc6048) by @Pijukatel
- [breaking] Update the crawlers & storage clients structure (#828) (0ba04d1) by @vdusek, closes #764
0.4.5 (2024-12-06)โ
๐ Featuresโ
๐ Bug Fixesโ
- Add upper bound of HTTPX version (#775) (b59e34d) by @vdusek
- Fix incorrect use of desired concurrency ratio (#780) (d1f8bfb) by @Pijukatel, closes #759
- Remove pydantic constraint <2.10.0 and update timedelta validator, serializer type hints (#757) (c0050c0) by @Pijukatel
0.4.4 (2024-11-29)โ
๐ Featuresโ
- Expose browser_options and page_options to PlaywrightCrawler (#730) (dbe85b9) by @vdusek, closes #719
- Add abort_on_errorproperty (#731) (6dae03a) by @Mantisus, closes #704
๐ Bug Fixesโ
0.4.3 (2024-11-21)โ
๐ Bug Fixesโ
- Pydantic 2.10.0 issues (#716) (8d8b3fc) by @Pijukatel
0.4.2 (2024-11-20)โ
๐ Bug Fixesโ
- Respect custom HTTP headers in PlaywrightCrawler(#685) (a84125f) by @Mantisus
- Fix serialization payload in Request. Fix Docs for Post Request (#683) (e8b4d2d) by @Mantisus, closes #668
- Accept string payload in the Request constructor (#697) (19f5add) by @vdusek
- Fix snapshots handling (#692) (4016c0d) by @Pijukatel
0.4.1 (2024-11-11)โ
๐ Featuresโ
- Add max_crawl_depthoption toBasicCrawler(#637) (77deaa9) by @Prathamesh010, closes #460
- Add BeautifulSoupParser type alias (#674) (b2cf88f) by @Pijukatel
๐ Bug Fixesโ
- Fix total_size usage in memory size monitoring (#661) (c2a3239) by @janbuchar
- Add HttpHeaders to module exports (#664) (f0c5ca7) by @vdusek, closes #663
- Fix unhandled ValueError in request handler result processing (#666) (0a99d7f) by @janbuchar
- Fix BaseDatasetClient.iter_items type hints (#680) (a968b1b) by @Pijukatel
0.4.0 (2024-11-01)โ
- Check out the Upgrading guide to ensure a smooth update.
๐ Featuresโ
- [breaking] Add headers in unique key computation (#609) (6c4746f) by @Prathamesh010, closes #548
- Add pre_navigation_hookstoPlaywrightCrawler(#631) (5dd5b60) by @Prathamesh010, closes #427
- Add always_enqueueoption to bypass URL deduplication (#621) (4e59fa4) by @Rutam21, closes #547
- Split and add extra configuration to export_data method (#580) (6751635) by @deshansh, closes #526
๐ Bug Fixesโ
- Use strip in headers normalization (#614) (a15b21e) by @vdusek
- [breaking] Merge payload and data fields of Request (#542) (d06fcef) by @vdusek, closes #560
- Default ProxyInfo port if httpx.URL port is None (#619) (8107a6f) by @steffansafey, closes #618
Choreโ
0.3.9 (2024-10-23)โ
๐ Featuresโ
- Key-value store context helpers (#584) (fc15622) by @janbuchar
- Added get_public_url method to KeyValueStore (#572) (3a4ba8f) by @akshay11298, closes #514
๐ Bug Fixesโ
- Workaround for JSON value typing problems (#581) (403496a) by @janbuchar, closes #563
0.3.8 (2024-10-02)โ
๐ Featuresโ
- Mask Playwright's "headless" headers (#545) (d1445e4) by @vdusek, closes #401
- Add new model for HttpHeaders(#544) (854f2c1) by @vdusek
๐ Bug Fixesโ
- Call error_handlerforSessionError(#557) (e75ac4b) by @vdusek, closes #546
- Extend from StrEnuminRequestStateto fix serialization (#556) (6bf35ba) by @vdusek, closes #551
- Add equality check to UserData model (#562) (899a25c) by @janbuchar
0.3.7 (2024-09-25)โ
๐ Bug Fixesโ
- Improve Request.user_dataserialization (#540) (de29c0e) by @janbuchar, closes #524
- Adopt new version of curl-cffi (#543) (f6fcf48) by @vdusek
0.3.6 (2024-09-19)โ
๐ Featuresโ
- Add HTTP/2 support for HTTPX client (#513) (0eb0a33) by @vdusek, closes #512
- Expose extended unique key when creating a new Request (#515) (1807f41) by @vdusek
- Add header generator and integrate it into HTTPX client (#530) (b63f9f9) by @vdusek, closes #402
๐ Bug Fixesโ
0.3.5 (2024-09-10)โ
๐ Featuresโ
- Memory usage limit configuration via environment variables (#502) (c62e554) by @janbuchar
๐ Bug Fixesโ
- Http clients detect 4xx as errors by default (#498) (1895dca) by @vdusek, closes #496
- Correctly handle log level configuration (#508) (7ea8fe6) by @janbuchar
0.3.4 (2024-09-05)โ
๐ Bug Fixesโ
0.3.3 (2024-09-05)โ
๐ Bug Fixesโ
- Deduplicate requests by unique key before submitting them to the queue (#499) (6a3e0e7) by @janbuchar
0.3.2 (2024-09-02)โ
๐ Bug Fixesโ
- Double incrementation of item_count(#443) (cd9adf1) by @cadlagtrader, closes #442
- Field alias in BatchRequestsOperationResponse(#485) (126a862) by @janbuchar
- JSON handling with Parsel (#490) (ebf5755) by @janbuchar, closes #488
0.3.1 (2024-08-30)โ
๐ Featuresโ
0.3.0 (2024-08-27)โ
- Check out the Upgrading guide to ensure a smooth update.
๐ Featuresโ
- Implement ParselCrawler that adds support for Parsel (#348) (a3832e5) by @asymness, closes #335
- Add support for filling a web form (#453) (5a125b4) by @vdusek, closes #305
๐ Bug Fixesโ
- Remove indentation from statistics logging and print the data in tables (#322) (359b515) by @TymeeK, closes #306
- Remove redundant log, fix format (#408) (8d27e39) by @janbuchar
- Dequeue items from RequestQueue in the correct order (#411) (96fc33e) by @janbuchar
- Relative URLS supports & If not a URL, pass #417 (#431) (ccd8145) by @black7375, closes #417
- Typo in ProlongRequestLockResponse (#458) (30ccc3a) by @janbuchar
- Add missing all to top-level init.py file (#463) (353a1ce) by @janbuchar
Refactorโ
- [breaking] RequestQueue and service management rehaul (#429) (b155a9f) by @janbuchar, closes #83, #174, #203, #423
- [breaking] Declare private and public interface (#456) (d6738df) by @vdusek
0.2.1 (2024-08-05)โ
๐ Bug Fixesโ
0.2.0 (2024-08-05)โ
๐ Featuresโ
- Add new curl impersonate HTTP client (#387) (9c06260) by @vdusek, closes #292
- playwright: infinite_scrollhelper (#393) (34f74bd) by @janbuchar
0.1.2 (2024-07-30)โ
๐ Featuresโ
๐ Bug Fixesโ
- Minor log fix (#341) (0688bf1) by @souravjain540
- Also use error_handler for context pipeline errors (#331) (7a66445) by @janbuchar, closes #296
- Strip whitespace from href in enqueue_links (#346) (8a3174a) by @janbuchar, closes #337
- Warn instead of crashing when an empty dataset is being exported (#342) (22b95d1) by @janbuchar, closes #334
- Avoid Github rate limiting in project bootstrapping test (#364) (992f07f) by @janbuchar
- Pass crawler configuration to storages (#375) (b2d3a52) by @janbuchar
- Purge request queue on repeated crawler runs (#377) (7ad3d69) by @janbuchar, closes #152
0.1.1 (2024-07-19)โ
๐ Featuresโ
- Expose crawler log (#316) (ae475fa) by @vdusek, closes #303
- Integrate proxies into PlaywrightCrawler(#325) (2e072b6) by @vdusek
- Blocking detection for playwright crawler (#328) (49ff6e2) by @vdusek, closes #239
๐ Bug Fixesโ
- Pylance reportPrivateImportUsage errors (#313) (09d7203) by @vdusek, closes #283
- Set httpx logging to warning (#314) (1585def) by @vdusek, closes #302
- Byte size serialization in MemoryInfo (#245) (a030174) by @janbuchar
- Project bootstrapping in existing folder (#318) (c630818) by @janbuchar, closes #301
0.1.0 (2024-07-08)โ
๐ Featuresโ
- Project templates (#237) (c23c12c) by @janbuchar, closes #215
๐ Bug Fixesโ
- CLI UX improvements (#271) (123d515) by @janbuchar, closes #267
- Error handling in CLI and templates documentation (#273) (61083c3) by @vdusek, closes #268
0.0.7 (2024-06-27)โ
๐ Bug Fixesโ
- Do not wait for consistency in request queue (#235) (03ff138) by @vdusek
- Selector handling in BeautifulSoupCrawler enqueue_links (#231) (896501e) by @janbuchar, closes #230
- Handle blocked request (#234) (f8ef79f) by @Mantisus
- Improve AutoscaledPool state management (#241) (fdea3d1) by @janbuchar, closes #236
0.0.6 (2024-06-25)โ
๐ Featuresโ
- Maintain a global configuration instance (#207) (e003aa6) by @janbuchar
- Add max requests per crawl to BasicCrawler(#198) (b5b3053) by @vdusek
- Add support decompress br response content (#226) (a3547b9) by @Mantisus
- BasicCrawler.export_data helper (#222) (237ec78) by @janbuchar, closes #211
- Automatic logging setup (#229) (a67b72f) by @janbuchar, closes #214
๐ Bug Fixesโ
- Handling of relative URLs in add_requests (#213) (8aa8c57) by @janbuchar, closes #202, #204
- Graceful exit in BasicCrawler.run (#224) (337286e) by @janbuchar, closes #212
0.0.5 (2024-06-21)โ
๐ Featuresโ
- Browser rotation and better browser abstraction (#177) (a42ae6f) by @vdusek, closes #131
- Add emit persist state event to event manager (#181) (97f6c68) by @vdusek
- Batched request addition in RequestQueue (#186) (f48c806) by @vdusek
- Add storage helpers to crawler & context (#192) (f8f4066) by @vdusek, closes #98, #100, #172
- Handle all supported configuration options (#199) (23c901c) by @janbuchar, closes #84
- Add Playwright's enqueue links helper (#196) (849d73c) by @vdusek
๐ Bug Fixesโ
- Tmp path in tests is working (#164) (382b6f4) by @vdusek, closes #159
- Add explicit err msgs for missing pckg extras during import (#165) (200ebfa) by @vdusek, closes #155
- Make timedelta_ms accept string-encoded numbers (#190) (d8426ff) by @janbuchar
- deps: Update dependency psutil to v6 (#193) (eb91f51) by @renovate[bot]
- Improve compatibility between ProxyConfiguration and its SDK counterpart (#201) (1a76124) by @janbuchar
- Correct return type of storage get_info methods (#200) (332673c) by @janbuchar
- Type error in statistics persist state (#206) (96ceef6) by @vdusek, closes #194
0.0.4 (2024-05-30)โ
๐ Featuresโ
- Capture statistics about the crawler run (#142) (eeebe9b) by @janbuchar, closes #97
- Proxy configuration (#156) (5c3753a) by @janbuchar, closes #136
- Add first version of browser pool and playwright crawler (#161) (2d2a050) by @vdusek
0.0.3 (2024-05-13)โ
๐ Featuresโ
- AutoscaledPool implementation (#55) (621ada2) by @janbuchar, closes #19
- Add Snapshotter (#20) (492ee38) by @vdusek
- Implement BasicCrawler (#56) (6da971f) by @janbuchar, closes #30
- BeautifulSoupCrawler (#107) (4974dfa) by @janbuchar, closes #31
- Add_requests and enqueue_links context helpers (#120) (dc850a5) by @janbuchar, closes #5
- Use SessionPool in BasicCrawler (#128) (9fc4648) by @janbuchar, closes #110
- Add base storage client and resource subclients (#138) (44d6597) by @vdusek
๐ Bug Fixesโ
- deps: Update dependency docutils to ^0.21.0 (#101) (534b613) by @renovate[bot]
- deps: Update dependency eval-type-backport to ^0.2.0 (#124) (c9e69a8) by @renovate[bot]
- Fire local SystemInfo events every second (#144) (f1359fa) by @vdusek
- Storage manager & purging the defaults (#150) (851042f) by @vdusek