Skip to main content

Crawlee for Python v0.5

· 6 min read
Vlada Dusek
Developer of Crawlee for Python

Crawlee for Python v0.5 is now available! This is our biggest release to date, bringing new ported functionality from the Crawlee for JavaScript, brand-new features that are exclusive to the Python library (for now), a new consolidated package structure, and a bunch of bug fixes and further improvements.

Getting started

You can upgrade to the latest version straight from PyPI:

pip install --upgrade crawlee

Check out the full changelog on our website to see all the details. If you are updating from an older version, make sure to follow our Upgrading to v0.5 guide for a smooth upgrade.

New package structure

We have introduced a new consolidated package structure. The goal is to streamline the development experience, help you find the crawlers you are looking for faster, and improve the IDE's code suggestions while importing.

Crawlers

We have grouped all crawler classes (and their corresponding crawling context classes) into a single sub-package called crawlers. Here is a quick example of how the imports have changed:

- from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
+ from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext

Look how you can see all the crawlers that we have, isn't that cool!

Import from crawlers subpackage.

Storage clients

Similarly, we have moved all storage client classes under storage_clients sub-package. For instance:

- from crawlee.memory_storage_client import MemoryStorageClient
+ from crawlee.storage_clients import MemoryStorageClient

This consolidation makes it clearer where each class belongs and ensures that your IDE can provide better autocompletion when you are looking for the right crawler or storage client.

Continued parity with Crawlee JS

We are constantly working toward feature parity with our JavaScript library, Crawlee JS. With v0.5, we have brought over more functionality:

HTML to text context helper

The html_to_text crawling context helper simplifies extracting text from an HTML page by automatically removing all tags and returning only the raw text content. It's available in the ParselCrawlingContext and BeautifulSoupCrawlingContext.

import asyncio

from crawlee.crawlers import ParselCrawler, ParselCrawlingContext


async def main() -> None:
crawler = ParselCrawler()

@crawler.router.default_handler
async def handler(context: ParselCrawlingContext) -> None:
context.log.info('Crawling: %s', context.request.url)
text = context.html_to_text()
# Continue with the processing...

await crawler.run(['https://crawlee.dev'])


if __name__ == '__main__':
asyncio.run(main())

In this example, we use a ParselCrawler to fetch a webpage, then invoke context.html_to_text() to extract clean text for further processing.

Use state

The use_state crawling context helper makes it simple to create and manage persistent state values within your crawler. It ensures that all state values are automatically persisted. It enables you to maintain data across different crawler runs, restarts, and failures. It acts as a convenient abstraction for interaction with KeyValueStore.

import asyncio

from crawlee import Request
from crawlee.configuration import Configuration
from crawlee.crawlers import ParselCrawler, ParselCrawlingContext


async def main() -> None:
# Create a crawler with purge_on_start disabled to retain state across runs.
crawler = ParselCrawler(
configuration=Configuration(purge_on_start=False),
)

@crawler.router.default_handler
async def handler(context: ParselCrawlingContext) -> None:
context.log.info(f'Crawling {context.request.url}')

# Retrieve or initialize the state with a default value.
state = await context.use_state('state', default_value={'runs': 0})

# Increment the run count.
state['runs'] += 1

# Create a request with always_enqueue enabled to bypass deduplication and ensure it is processed.
request = Request.from_url('https://crawlee.dev/', always_enqueue=True)

# Run the crawler with the start request.
await crawler.run([request])

# Fetch the persisted state from the key-value store.
kvs = await crawler.get_key_value_store()
state = await kvs.get_auto_saved_value('state')
crawler.log.info(f'Final state after run: {state}')


if __name__ == '__main__':
asyncio.run(main())

Please note that the use_state is an experimental feature. Its behavior and interface may evolve in future versions.

Brand new features

In addition to porting features from JS, we are introducing new, Python-first functionalities that will eventually make their way into Crawlee JS in the coming months.

Crawler's stop method

The BasicCrawler, and by extension, all crawlers that inherit from it, now has a stop method. This makes it easy to halt the crawling when a specific condition is met, for instance, if you have found the data you were looking for.

import asyncio

from crawlee.crawlers import ParselCrawler, ParselCrawlingContext


async def main() -> None:
crawler = ParselCrawler()

@crawler.router.default_handler
async def handler(context: ParselCrawlingContext) -> None:
context.log.info('Crawling: %s', context.request.url)

# Extract and enqueue links from the page.
await context.enqueue_links()

title = context.selector.css('title::text').get()

# Condition when you want to stop the crawler, e.g. you
# have found what you were looking for.
if 'Crawlee for Python' in title:
context.log.info('Condition met, stopping the crawler.')
await crawler.stop()

await crawler.run(['https://crawlee.dev'])


if __name__ == '__main__':
asyncio.run(main())

Request loaders

There are new classes RequestLoader, RequestManager and RequestManagerTandem that manage how Crawlee accesses and stores requests. They allow you to use other component (service) as a source for requests and optionally you can combine it with a RequestQueue. They let you plug in any request source, and combine the external data sources with Crawlee's standard RequestQueue.

You can learn more about these new features in the Request loaders guide.

import asyncio

from crawlee.crawlers import ParselCrawler, ParselCrawlingContext
from crawlee.request_loaders import RequestList, RequestManagerTandem
from crawlee.storages import RequestQueue


async def main() -> None:
rl = RequestList(
[
'https://crawlee.dev',
'https://apify.com',
# Long list of URLs...
],
)

rq = await RequestQueue.open()

# Combine them into a single request source.
tandem = RequestManagerTandem(rl, rq)

crawler = ParselCrawler(request_manager=tandem)

@crawler.router.default_handler
async def handler(context: ParselCrawlingContext) -> None:
context.log.info(f'Crawling {context.request.url}')
# ...

await crawler.run()


if __name__ == '__main__':
asyncio.run(main())

In this example we combine a RequestList with a RequestQueue. However, instead of the RequestList you can use any other class that implements the RequestLoader interface to suit your specific requirements.

Service locator

The ServiceLocator is primarily an internal mechanism for managing the services that Crawlee depends on. Specifically, the Configuration, StorageClient, and EventManager. By swapping out these components, you can adapt Crawlee to suit different runtime environments.

You can use the service locator explicitly:

import asyncio

from crawlee import service_locator
from crawlee.configuration import Configuration
from crawlee.crawlers import ParselCrawler, ParselCrawlingContext
from crawlee.events import LocalEventManager
from crawlee.storage_clients import MemoryStorageClient


async def main() -> None:
service_locator.set_configuration(Configuration())
service_locator.set_storage_client(MemoryStorageClient())
service_locator.set_event_manager(LocalEventManager())

crawler = ParselCrawler()

# ...


if __name__ == '__main__':
asyncio.run(main())

Or pass the services directly to the crawler instance, and they will be set under the hood:

import asyncio

from crawlee.configuration import Configuration
from crawlee.crawlers import ParselCrawler, ParselCrawlingContext
from crawlee.events import LocalEventManager
from crawlee.storage_clients import MemoryStorageClient


async def main() -> None:
crawler = ParselCrawler(
configuration=Configuration(),
storage_client=MemoryStorageClient(),
event_manager=LocalEventManager(),
)

# ...


if __name__ == '__main__':
asyncio.run(main())

Conclusion

We are excited to share that Crawlee v0.5 is here. If you have any questions or feedback, please open a GitHub discussion. If you encounter any bugs, or have an idea for a new feature, please open a GitHub issue.