Version: Next

PydanticAiSelectorExtractor

Extractor that learns reusable CSS selectors and reuses them for free.

On each call it first tries the cached selector maps and extracts with no LLM call when one fits. On a miss it asks the model for a new map, validates it against the live page, and caches it. A bucket keeps several maps, so A/B-tested markup variants can coexist.

The cache is a RecoverableState persisted to a KeyValueStore. As an async context manager it loads at startup and saves at shutdown. Used standalone, it initializes lazily.

With a fallback extractor, unsupported schemas and generation failures degrade to it. Infrastructure errors such as credentials, HTTP, and usage limits propagate.

See the PydanticAiHtmlExtractor protocol for the common extractor interface, and PydanticAiDirectExtractor for a per-page variant with no selector cache.

Usage

from pydantic import BaseModel
from pydantic_ai.models.openai import OpenAIChatModel
from pydantic_ai.providers.openai import OpenAIProvider

from crawlee.crawlers import PydanticAiDirectExtractor, PydanticAiSelectorExtractor


class Product(BaseModel):
    name: str
    price: str | None


model = OpenAIChatModel('gpt-5.4-nano', provider=OpenAIProvider(api_key='...'))
extractor = PydanticAiSelectorExtractor(model=model, fallback=PydanticAiDirectExtractor(model=model))
product = await extractor.extract('<html>...</html>', Product, cache_tag='product')

Hierarchy

BasePydanticAiHtmlExtractor
- PydanticAiSelectorExtractor

Index

Methods

Properties

Methods

aenter

async __aenter__(): PydanticAiSelectorExtractor

Initialize the selector cache eagerly and enter the fallback chain.
Returns PydanticAiSelectorExtractor

aexit

async __aexit__(exc_type, exc_value, exc_traceback): None

Persist the selector cache one final time, detach from events, and exit the fallback chain.
Parameters
- exc_type: type[BaseException] | None
- exc_value: BaseException | None
- exc_traceback: TracebackType | None
Returns None

init

__init__(model, *, kvs_cache_key, distiller, instructions, retries, max_variants, fallback, usage_limits, persistence): None

Overrides BasePydanticAiHtmlExtractor.__init__
Initialize a new instance.
Parameters
- model: str | Model
  A provider-prefixed name (e.g. 'openai:gpt-5.4-nano') or a pydantic-ai Model.
- optionalkeyword-onlykvs_cache_key: str | None = None
  Name of the KeyValueStore record holding the selector cache. Defaults to 'AI-SELECTORS'.
- optionalkeyword-onlydistiller: PydanticAiHtmlDistiller | None = None
  The HTML distiller shaping the LLM input. Defaults to PydanticAiSkeletonDistiller.
- optionalkeyword-onlyinstructions: str = _SELECTOR_INSTRUCTIONS
  Base selector-generation instructions. The distiller's prompt notes are appended automatically.
- optionalkeyword-onlyretries: int = 3
  How many times the model may fix failing selectors within one generation.
- optionalkeyword-onlymax_variants: int = 5
  Cap on cached selector maps per bucket.
- optionalkeyword-onlyfallback: PydanticAiHtmlExtractor | None = None
  Extractor to degrade to when generation fails or the schema shape is unsupported.
- optionalkeyword-onlyusage_limits: UsageLimits | None = None
  Optional pydantic-ai UsageLimits applied to every generation run.
- optionalkeyword-onlypersistence: bool = True
  Whether the selector cache is persisted. Disable for ephemeral runs or tests.
Returns None

extract

async extract(content, schema, *, scope, cache_tag, additional_instructions): TSchema

Overrides BasePydanticAiHtmlExtractor.extract
Extract schema from content using cached or freshly generated selectors.
Parameters
- content: str | Selector
  Raw HTML or a parsed Parsel Selector.
- schema: type[TSchema]
  The Pydantic model describing the desired output.
- optionalkeyword-onlyscope: str | None = None
  Optional CSS selector restricting extraction to the first matching subtree.
- optionalkeyword-onlycache_tag: str | None = None
  Optional tag identifying the page kind. Selectors are cached per (schema, scope, cache_tag). A shared tag (the None default) buckets unlike pages together, overflowing the cache fast.
- optionalkeyword-onlyadditional_instructions: str | None = None
  Extra instructions appended for this call only.
Returns TSchema

set_ai_usage

set_ai_usage(value): None

Overrides BasePydanticAiHtmlExtractor.set_ai_usage
Adopt value and re-share it with the fallback chain.
Parameters
- value: PydanticAiUsageStats
Returns None

Properties

active

active: bool

Whether the extractor is in its async context-manager scope.

ai_usage

ai_usage: PydanticAiUsageStats

Accumulated token usage of this extractor's runs.

Usage

Hierarchy

Index

Methods

Properties

Methods

__aenter__

Returns PydanticAiSelectorExtractor

__aexit__

Parameters

exc_type: type[BaseException] | None

exc_value: BaseException | None

exc_traceback: TracebackType | None

Returns None

__init__

Parameters

model: str | Model

optionalkeyword-onlykvs_cache_key: str | None = None

optionalkeyword-onlydistiller: PydanticAiHtmlDistiller | None = None

optionalkeyword-onlyinstructions: str = _SELECTOR_INSTRUCTIONS

optionalkeyword-onlyretries: int = 3

optionalkeyword-onlymax_variants: int = 5

optionalkeyword-onlyfallback: PydanticAiHtmlExtractor | None = None

optionalkeyword-onlyusage_limits: UsageLimits | None = None

optionalkeyword-onlypersistence: bool = True

Returns None

extract

Parameters

content: str | Selector

schema: type[TSchema]

optionalkeyword-onlyscope: str | None = None

optionalkeyword-onlycache_tag: str | None = None

optionalkeyword-onlyadditional_instructions: str | None = None

Returns TSchema

set_ai_usage

Parameters

value: PydanticAiUsageStats

Returns None

Properties

active

ai_usage

aenter

aexit

init