Upgrading to v0.x
This page summarizes the breaking changes between Crawlee for Python zero-based versions.
Upgrading to v0.5โ
This section summarizes the breaking changes between v0.4.x and v0.5.0.
Crawlers & CrawlingContextsโ
- All crawler and crawling context classes have been consolidated into a single sub-package called
crawlers
. - The affected classes include:
AbstractHttpCrawler
,AbstractHttpParser
,BasicCrawler
,BasicCrawlerOptions
,BasicCrawlingContext
,BeautifulSoupCrawler
,BeautifulSoupCrawlingContext
,BeautifulSoupParserType
,ContextPipeline
,HttpCrawler
,HttpCrawlerOptions
,HttpCrawlingContext
,HttpCrawlingResult
,ParsedHttpCrawlingContext
,ParselCrawler
,ParselCrawlingContext
,PlaywrightCrawler
,PlaywrightCrawlingContext
,PlaywrightPreNavCrawlingContext
.
Example update:
- from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
+ from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
Storage clientsโ
- All storage client classes have been moved into a single sub-package called
storage_clients
. - The affected classes include:
MemoryStorageClient
,BaseStorageClient
.
Example update:
- from crawlee.memory_storage_client import MemoryStorageClient
+ from crawlee.storage_clients import MemoryStorageClient
CurlImpersonateHttpClientโ
- The
CurlImpersonateHttpClient
changed its import location.
Example update:
- from crawlee.http_clients.curl_impersonate import CurlImpersonateHttpClient
+ from crawlee.http_clients import CurlImpersonateHttpClient
BeautifulSoupParserโ
- Renamed
BeautifulSoupParser
toBeautifulSoupParserType
. Probably used only in type hints. Please replace previous usages ofBeautifulSoupParser
byBeautifulSoupParserType
. BeautifulSoupParser
is now a new class that is used in refactored classBeautifulSoupCrawler
.
Service locatorโ
- The
crawlee.service_container
was completely refactored and renamed tocrawlee.service_locator
. - You can use it to set the configuration, event manager or storage client globally. Or you can pass them to your crawler instance directly and it will use the service locator under the hood.
Statisticsโ
- The
crawlee.statistics.Statistics
class do not accept an event manager as an input argument anymore. It uses the default, global one. - If you want to set your custom event manager, do it either via the service locator or pass it to the crawler.
Requestโ
- The properties
json_
andorder_no
were removed. They were there only for the internal purpose of the memory storage client, you should not need them.
Request storages and loadersโ
- The
request_provider
parameter ofBasicCrawler.__init__
has been renamed torequest_manager
- The
BasicCrawler.get_request_provider
method has been renamed toBasicCrawler.get_request_manager
and it does not accept theid
andname
arguments anymore- If using a specific request queue is desired, pass it as the
request_manager
onBasicCrawler
creation
- If using a specific request queue is desired, pass it as the
- The
RequestProvider
interface has been renamed toRequestManager
and moved to thecrawlee.request_loaders
package RequestList
has been moved to thecrawlee.request_loaders
packageRequestList
does not support.drop()
,.reclaim_request()
,.add_request()
andadd_requests_batched()
anymore- It implements the new
RequestLoader
interface instead ofRequestManager
RequestManagerTandem
with aRequestQueue
should be used to enable passing aRequestList
(or any otherRequestLoader
implementation) as arequest_manager
,await list.to_tandem()
can be used as a shortcut
- It implements the new
PlaywrightCrawlerโ
- The
PlaywrightPreNavigationContext
was renamed toPlaywrightPreNavCrawlingContext
. - The input arguments in
PlaywrightCrawler.__init__
have been renamed:browser_options
is nowbrowser_launch_options
,page_options
is nowbrowser_new_context_options
.
- These argument renaming changes have also been applied to
BrowserPool
,PlaywrightBrowserPlugin
, andPlaywrightBrowserController
.
Upgrading to v0.4โ
This section summarizes the breaking changes between v0.3.x and v0.4.0.
Request modelโ
- The
Request.query_params
field has been removed. Please add query parameters directly to the URL, which was possible before as well, and is now the only supported approach. - The
Request.payload
andRequest.data
fields have been consolidated. Now, onlyRequest.payload
remains, and it should be used for all payload data in requests.
Extended unique key computationโ
- The computation of
extended_unique_key
now includes HTTP headers. While this change impacts the behavior, the interface remains the same.
Upgrading to v0.3โ
This section summarizes the breaking changes between v0.2.x and v0.3.0.
Public and private interface declarationโ
In previous versions, the majority of the package was fully public, including many elements intended for internal use only. With the release of v0.3, we have clearly defined the public and private interface of the package. As a result, some imports have been updated (see below). If you are importing something now designated as private, we recommend reconsidering its use or discussing your use case with us in the discussions/issues.
Here is a list of the updated public imports:
- from crawlee.enqueue_strategy import EnqueueStrategy
+ from crawlee import EnqueueStrategy
- from crawlee.models import Request
+ from crawlee import Request
- from crawlee.basic_crawler import Router
+ from crawlee.router import Router
Request queueโ
There were internal changes that should not affect the intended usage:
- The unused
BaseRequestQueueClient.list_requests()
method was removed RequestQueue
internals were updated to match the "Request Queue V2" implementation in Crawlee for JS
Service containerโ
A new module, crawlee.service_container
, was added to allow management of "global instances" - currently it contains Configuration
, EventManager
and BaseStorageClient
. The module also replaces the StorageClientManager
static class. It is likely that its interface will change in the future. If your use case requires working with it, please get in touch - we'll be glad to hear any feedback.