Skip to main content

Service locator

The ServiceLocator is a central registry for global services. It manages and provides access to these services throughout the framework, ensuring their consistent configuration and across all components.

The service locator manages three core services: Configuration, EventManager, and StorageClient. All services are initialized lazily with defaults when first accessed.

Services

There are three core services that are managed by the service locator:

Configuration

Configuration is a class that provides access to application-wide settings and parameters. It allows you to configure various aspects of Crawlee, such as timeouts, logging level, persistance intervals, and various other settings. The configuration can be set directly in the code or via environment variables.

StorageClient

StorageClient is the backend implementation for storages in Crawlee. It provides a unified interface for Dataset, KeyValueStore, and RequestQueue, regardless of the underlying storage implementation. Storage clients were already explained in the storage clients section.

Refer to the Storage clients guide for more information about storage clients and how to use them.

EventManager

EventManager is responsible for coordinating internal events in Crawlee. It allows you to register event listeners and emit events throughout the framework. Examples of such events aborting, migrating, system info, or browser-specific events like page created, page closed and more. It provides a way to listen to events and execute custom logic when certain events occur.

Service registration

There are several ways to register services in Crawlee, depending on your use case and preferences.

Via service locator

Services can be registered globally through the ServiceLocator before they are first accessed. There is a singleton service_locator instance that is used throughout the framework, making the services available to all components throughout the whole framework.

Run on
import asyncio

from crawlee import service_locator
from crawlee.storage_clients import MemoryStorageClient


async def main() -> None:
storage_client = MemoryStorageClient()

# Register storage client via service locator.
service_locator.set_storage_client(storage_client)


if __name__ == '__main__':
asyncio.run(main())

Via crawler constructors

Alternatively services can be passed to the crawler constructors. They will be registered globally to the ServiceLocator under the hood, making them available to all components and reaching consistent configuration.

Run on
import asyncio

from crawlee.crawlers import ParselCrawler
from crawlee.storage_clients import MemoryStorageClient


async def main() -> None:
storage_client = MemoryStorageClient()

# Register storage client via crawler.
crawler = ParselCrawler(
storage_client=storage_client,
)


if __name__ == '__main__':
asyncio.run(main())

Via storage constructors

Alternatively, services can be provided when opening specific storage instances, which uses them only for that particular instance without affecting global configuration.

Run on
import asyncio

from crawlee.storage_clients import MemoryStorageClient
from crawlee.storages import Dataset


async def main() -> None:
storage_client = MemoryStorageClient()

# Pass the storage client to the dataset (or other storage) when opening it.
dataset = await Dataset.open(
storage_client=storage_client,
)


if __name__ == '__main__':
asyncio.run(main())

Conflict prevention

Once a service has been retrieved from the service locator, attempting to set a different instance will raise a ServiceConflictError to prevent accidental configuration conflicts.

Run on
import asyncio

from crawlee import service_locator
from crawlee.storage_clients import FileSystemStorageClient, MemoryStorageClient


async def main() -> None:
# Register the storage client via service locator.
memory_storage_client = MemoryStorageClient()
service_locator.set_storage_client(memory_storage_client)

# Retrieve the storage client.
current_storage_client = service_locator.get_storage_client()

# Try to set a different storage client, which will raise ServiceConflictError
# if storage client was already retrieved.
file_system_storage_client = FileSystemStorageClient()
service_locator.set_storage_client(file_system_storage_client)


if __name__ == '__main__':
asyncio.run(main())

Conclusion

The ServiceLocator is a tool for managing global services in Crawlee. It provides a consistent way to configure and access services throughout the framework, ensuring that all components have access to the same configuration and services.

If you have questions or need assistance, feel free to reach out on our GitHub or join our Discord community. Happy scraping!