Service locator
The ServiceLocator
is a central registry for global services. It manages and provides access to these services throughout the framework, ensuring their consistent configuration and across all components.
The service locator manages three core services: Configuration
, EventManager
, and StorageClient
. All services are initialized lazily with defaults when first accessed.
Services
There are three core services that are managed by the service locator:
Configuration
Configuration
is a class that provides access to application-wide settings and parameters. It allows you to configure various aspects of Crawlee, such as timeouts, logging level, persistance intervals, and various other settings. The configuration can be set directly in the code or via environment variables.
StorageClient
StorageClient
is the backend implementation for storages in Crawlee. It provides a unified interface for Dataset
, KeyValueStore
, and RequestQueue
, regardless of the underlying storage implementation. Storage clients were already explained in the storage clients section.
Refer to the Storage clients guide for more information about storage clients and how to use them.
EventManager
EventManager
is responsible for coordinating internal events in Crawlee. It allows you to register event listeners and emit events throughout the framework. Examples of such events aborting, migrating, system info, or browser-specific events like page created, page closed and more. It provides a way to listen to events and execute custom logic when certain events occur.
Service registration
There are several ways to register services in Crawlee, depending on your use case and preferences.
Via service locator
Services can be registered globally through the ServiceLocator
before they are first accessed. There is a singleton service_locator
instance that is used throughout the framework, making the services available to all components throughout the whole framework.
- Storage client
- Configuration
- Event manager
import asyncio
from crawlee import service_locator
from crawlee.storage_clients import MemoryStorageClient
async def main() -> None:
storage_client = MemoryStorageClient()
# Register storage client via service locator.
service_locator.set_storage_client(storage_client)
if __name__ == '__main__':
asyncio.run(main())
import asyncio
from datetime import timedelta
from crawlee import service_locator
from crawlee.configuration import Configuration
async def main() -> None:
configuration = Configuration(
log_level='DEBUG',
headless=False,
persist_state_interval=timedelta(seconds=30),
)
# Register configuration via service locator.
service_locator.set_configuration(configuration)
if __name__ == '__main__':
asyncio.run(main())
import asyncio
from datetime import timedelta
from crawlee import service_locator
from crawlee.events import LocalEventManager
async def main() -> None:
event_manager = LocalEventManager(
system_info_interval=timedelta(seconds=5),
)
# Register event manager via service locator.
service_locator.set_event_manager(event_manager)
if __name__ == '__main__':
asyncio.run(main())
Via crawler constructors
Alternatively services can be passed to the crawler constructors. They will be registered globally to the ServiceLocator
under the hood, making them available to all components and reaching consistent configuration.
- Storage client
- Configuration
- Event manager
import asyncio
from crawlee.crawlers import ParselCrawler
from crawlee.storage_clients import MemoryStorageClient
async def main() -> None:
storage_client = MemoryStorageClient()
# Register storage client via crawler.
crawler = ParselCrawler(
storage_client=storage_client,
)
if __name__ == '__main__':
asyncio.run(main())
import asyncio
from datetime import timedelta
from crawlee.configuration import Configuration
from crawlee.crawlers import ParselCrawler
async def main() -> None:
configuration = Configuration(
log_level='DEBUG',
headless=False,
persist_state_interval=timedelta(seconds=30),
)
# Register configuration via crawler.
crawler = ParselCrawler(
configuration=configuration,
)
if __name__ == '__main__':
asyncio.run(main())
import asyncio
from datetime import timedelta
from crawlee.crawlers import ParselCrawler
from crawlee.events import LocalEventManager
async def main() -> None:
event_manager = LocalEventManager(
system_info_interval=timedelta(seconds=5),
)
# Register event manager via crawler.
crawler = ParselCrawler(
event_manager=event_manager,
)
if __name__ == '__main__':
asyncio.run(main())
Via storage constructors
Alternatively, services can be provided when opening specific storage instances, which uses them only for that particular instance without affecting global configuration.
- Storage client
- Configuration
import asyncio
from crawlee.storage_clients import MemoryStorageClient
from crawlee.storages import Dataset
async def main() -> None:
storage_client = MemoryStorageClient()
# Pass the storage client to the dataset (or other storage) when opening it.
dataset = await Dataset.open(
storage_client=storage_client,
)
if __name__ == '__main__':
asyncio.run(main())
import asyncio
from datetime import timedelta
from crawlee.configuration import Configuration
from crawlee.storages import Dataset
async def main() -> None:
configuration = Configuration(
log_level='DEBUG',
headless=False,
persist_state_interval=timedelta(seconds=30),
)
# Pass the configuration to the dataset (or other storage) when opening it.
dataset = await Dataset.open(
configuration=configuration,
)
if __name__ == '__main__':
asyncio.run(main())
Conflict prevention
Once a service has been retrieved from the service locator, attempting to set a different instance will raise a ServiceConflictError
to prevent accidental configuration conflicts.
import asyncio
from crawlee import service_locator
from crawlee.storage_clients import FileSystemStorageClient, MemoryStorageClient
async def main() -> None:
# Register the storage client via service locator.
memory_storage_client = MemoryStorageClient()
service_locator.set_storage_client(memory_storage_client)
# Retrieve the storage client.
current_storage_client = service_locator.get_storage_client()
# Try to set a different storage client, which will raise ServiceConflictError
# if storage client was already retrieved.
file_system_storage_client = FileSystemStorageClient()
service_locator.set_storage_client(file_system_storage_client)
if __name__ == '__main__':
asyncio.run(main())
Conclusion
The ServiceLocator
is a tool for managing global services in Crawlee. It provides a consistent way to configure and access services throughout the framework, ensuring that all components have access to the same configuration and services.
If you have questions or need assistance, feel free to reach out on our GitHub or join our Discord community. Happy scraping!