Architecture overview
Crawlee is a modern and modular web scraping framework. It is designed for both HTTP-only and browser-based scraping. In this guide, we will provide a high-level overview of its architecture and the main components that make up the system.
Crawler
The main user-facing component of Crawlee is the crawler, which orchestrates the crawling process and takes care of all other components. It manages storages, executes user-defined request handlers, handles retries, manages concurrency, and coordinates all other components. All crawlers inherit from the BasicCrawler
class, which provides the basic functionality. There are two main groups of specialized crawlers: HTTP crawlers and browser crawlers.
You will learn more about the request handlers in the request router section.
HTTP crawlers
HTTP crawlers use HTTP clients to fetch pages and parse them with HTML parsing libraries. They are fast and efficient for sites that do not require JavaScript rendering. HTTP clients are Crawlee components that wrap around HTTP libraries like httpx, curl-impersonate or impit and handle HTTP communication for requests and responses. You can learn more about them in the HTTP clients guide.
HTTP crawlers inherit from AbstractHttpCrawler
and there are three crawlers that belong to this category:
BeautifulSoupCrawler
utilizes the BeautifulSoup HTML parser.ParselCrawler
utilizes Parsel for parsing HTML.HttpCrawler
does not parse HTTP responses at all and is used when no content parsing is required.
You can learn more about HTTP crawlers in the HTTP crawlers guide.
Browser crawlers
Browser crawlers use a real browser to render pages, enabling scraping of sites that require JavaScript. They manage browser instances, pages, and context lifecycles. Currently, the only browser crawler is PlaywrightCrawler
, which utilizes the Playwright library. Playwright provides a high-level API for controlling and navigating browsers. You can learn more about PlaywrightCrawler
, its features, and how it internally manages browser instances in the Playwright crawler guide.
Adaptive crawler
The AdaptivePlaywrightCrawler
sits between HTTP and browser crawlers. It can automatically decide whether to use HTTP or browser crawling for each request based on heuristics or user configuration. This allows for optimal performance and compatibility. It also provides a uniform interface for both crawling types (modes). You can learn more about adaptive crawling in the Adaptive Playwright crawler guide.
Crawling contexts
Crawling contexts are objects that encapsulate the state and data for each request being processed by the crawler. They provide access to the request, response, session, and helper methods for handling the request. Crawling contexts are used to pass data between different parts of the crawler and to manage the lifecycle of each request. These contexts are provided to user-defined request handlers, which can then use them to access request data, response data, or use helper methods to interact with storages, and extract and enqueue new requests.
They have a similar inheritance structure as the crawlers, with the base class being BasicCrawlingContext
. The specific crawling contexts are:
HttpCrawlingContext
for HTTP crawlers.ParsedHttpCrawlingContext
for HTTP crawlers with parsed responses.ParselCrawlingContext
for HTTP crawlers that use Parsel for parsing.BeautifulSoupCrawlingContext
for HTTP crawlers that use BeautifulSoup for parsing.PlaywrightPreNavCrawlingContext
for Playwright crawlers before the page is navigated.PlaywrightCrawlingContext
for Playwright crawlers.AdaptivePlaywrightPreNavCrawlingContext
for Adaptive Playwright crawlers before the page is navigated.AdaptivePlaywrightCrawlingContext
for Adaptive Playwright crawlers.
Storages
Storages are the components that manage data in Crawlee. They provide a way to store and retrieve data during the crawling process. Crawlee's storage system consists of two main layers:
- Storages: High-level interfaces for interacting with different storage types
- Storage clients: Backend implementations that handle the actual data persistence and management (you will learn more about them in the next section)
Crawlee provides three built-in storage types for managing data:
Dataset
- Append-only, tabular storage for structured data. It is ideal for storing scraping results.KeyValueStore
- Storage for arbitrary data like JSON documents, images or configs. It supports get and set operations with key-value pairs; updates are only possible by replacement.RequestQueue
- A managed queue for pending and completed requests, with automatic deduplication and dynamic addition of new items. It is used to track URLs for crawling.
See the Storages guide for more details.
Storage clients
Storage clients are the backend implementations for storages that handle interactions with different storage systems. They provide a unified interface for Dataset
, KeyValueStore
, and RequestQueue
, regardless of the underlying storage implementation.
Crawlee provides several built-in storage client implementations:
MemoryStorageClient
- Stores data in memory with no persistence (ideal for testing and fast operations).FileSystemStorageClient
- Provides persistent file system storage with caching (default client).ApifyStorageClient
- Manages storage on the Apify platform (cloud-based). It is implemented in the Apify SDK. You can find more information about it in the Apify SDK documentation.
Storage clients can be registered globally with the ServiceLocator
(you will learn more about the ServiceLocator
in the next section), passed directly to crawlers, or specified when opening individual storage instances. You can also create custom storage clients by implementing the StorageClient
interface.
See the Storage clients guide for more details.
Request router
The request Router
is a central component that manages the flow of requests and responses in Crawlee. It is responsible for routing requests to the appropriate request handlers, managing the crawling context, and coordinating the execution of user-defined logic.
Request handlers
Request handlers are user-defined functions that process requests and responses in Crawlee. They are the core of the crawling logic and are responsible for handling data extraction, processing, and storage. Each request handler receives a crawling context as an argument, which provides access to request data, response data, and other information related to the request. Request handlers can be registered with the Router
.
The request routing in Crawlee supports:
- Default handlers - Fallback handlers for requests without specific labels.
- Label-based routing - Handlers for specific request types based on labels.
- Error handlers - Handle errors during request processing.
- Failed request handlers - Handle requests that exceed retry limits.
- Pre-navigation hooks - Execute logic before navigating to URLs.
See the Request router guide for detailed information and examples.
Service locator
The ServiceLocator
is a central registry for global services in Crawlee. It manages and provides access to core services throughout the framework, ensuring consistent configuration across all components. The service locator coordinates these three services:
Configuration
- Application-wide settings and parameters that control various aspects of Crawlee behavior.StorageClient
- Backend implementation for data storage across datasets, key-value stores, and request queues.EventManager
- Event coordination system for internal framework events and custom user hooks.
Services can be registered globally through the service_locator
singleton instance, passed to crawler constructors, or provided when opening individual storage instances. The service locator includes conflict prevention mechanisms to ensure configuration consistency and prevent accidental service conflicts during runtime.
See the Service locator guide for detailed information about service registration and configuration options.
Request loaders
Request loaders provide a subset of RequestQueue
functionality, focusing specifically on reading and accessing streams of requests from various sources. They define how requests are fetched and processed, enabling use cases such as reading URLs from files, external APIs, sitemaps, or combining multiple sources together. Unlike request queues, they do not handle storage or persistence—they only provide request reading capabilities.
RequestLoader
- Base interface for read-only access to a stream of requests, with capabilities like fetching the next request, marking as handled, and status checking.RequestList
- Lightweight in-memory implementation ofRequestLoader
for managing static lists of URLs.SitemapRequestLoader
- Specialized loader for reading URLs from XML sitemaps with filtering capabilities.
Request managers
RequestManager
extends RequestLoader
with write capabilities for adding and reclaiming requests, providing full request management functionality. RequestQueue
is the primary concrete implementation of RequestManager
.
RequestManagerTandem
combines a read-only RequestLoader
with a writable RequestManager
, transferring requests from the loader to the manager for hybrid scenarios. This is useful when you want to start with a predefined set of URLs (from a file or sitemap) but also need to add new requests dynamically during crawling. The tandem first processes all requests from the loader, then handles any additional requests added to the manager.
Request loaders are useful when you need to start with a predefined set of URLs. The tandem approach allows processing requests from static sources (like files or sitemaps) while maintaining the ability to add new requests dynamically.
See the Request loaders guide for detailed information.
Event manager
The EventManager
is responsible for coordinating internal events throughout Crawlee and enabling custom hooks. It provides a system for registering event listeners, emitting events, and managing their execution lifecycle.
Crawlee provides several implementations of the event manager:
EventManager
is the base class for event management in Crawlee.LocalEventManager
extends the base event manager for local environments by automatically emittingSYSTEM_INFO
events at regular intervals. This provides real-time system metrics including CPU usage and memory consumption, which are essential for internal components like theSnapshotter
andAutoscaledPool
.ApifyEventManager
- Manages events on the Apify platform (cloud-based). It is implemented in the Apify SDK.
You can learn more about Snapshotter
and AutoscaledPool
and their configuration in the Scaling crawlers guide.
Crawlee defines several built-in event types:
PERSIST_STATE
- Emitted periodically to trigger state persistence.SYSTEM_INFO
- Contains CPU and memory usage information.MIGRATING
- Signals that the crawler is migrating to a different environment.ABORTING
- Indicates the crawler is aborting execution.EXIT
- Emitted when the crawler is exiting.CRAWLER_STATUS
- Provides status updates from crawlers.
Additional specialized events for browser and session management are also available.
The event manager operates as an async context manager, automatically starting periodic tasks when entered and ensuring all listeners complete before exiting. Event listeners can be either synchronous or asynchronous functions and are executed safely without blocking the main event loop.
Session management
The core component of session management in Crawlee is SessionPool
. It manages a collection of sessions that simulate individual users with unique attributes like cookies, IP addresses (via proxies), and browser fingerprints. Sessions help avoid blocking by rotating user identities and maintaining realistic browsing patterns.
You can learn more about fingerprints and how to avoid getting blocked in the Avoid blocking guide.
Session
A session is represented as a Session
object, which contains components like cookies, error tracking, usage limits, and expiration handling. Sessions can be marked as good (Session.mark_good
), bad (Session.mark_bad
), or retired (Session.retire
) based on their performance, and they automatically become unusable when they exceed error thresholds or usage limits.
Session pool
The session pool provides automated session lifecycle management:
- Automatic rotation - Retrieves random sessions from the pool and creates new ones as needed.
- Pool maintenance - Removes retired sessions and maintains the pool at maximum capacity.
- State persistence - Persists session state to enable recovery across restarts.
- Configurable limits - Supports custom pool sizes, session settings, and creation functions.
The pool operates as an async context manager, automatically initializing with sessions and cleaning up on exit. It ensures proper session management by rotating sessions based on usage count, expiration time, and custom rules while maintaining optimal pool size.
See the Session management guide for more information.
Statistics
The Statistics
class provides runtime monitoring for crawler operations, tracking performance metrics like request counts, processing times, retry attempts, and error patterns. It operates as an async context manager, automatically persisting data across crawler restarts and migrations using KeyValueStore
.
The system includes error tracking through the ErrorTracker
class, which groups similar errors by type and message patterns using wildcard matching. It can capture HTML snapshots and screenshots for debugging and separately track retry-specific errors.
Statistics are logged at configurable intervals in both table and inline formats, with final summary data returned from the crawler.run
method available through FinalStatistics
.
Conclusion
In this guide, we provided a high-level overview of the core components of the Crawlee library and its architecture. We covered the main components like crawlers, crawling contexts, storages, request routers, service locator, request loaders, event manager, session management, and statistics. Check out other guides, the API reference, and Examples for more details on how to use these components in your own projects.
If you have questions or need assistance, feel free to reach out on our GitHub or join our Discord community. Happy scraping!