PydanticAiSelectorExtractor
Hierarchy
- BasePydanticAiHtmlExtractor
- PydanticAiSelectorExtractor
Index
Methods
Properties
Methods
__aenter__
Initialize the selector cache eagerly and enter the fallback chain.
Returns PydanticAiSelectorExtractor
__aexit__
Persist the selector cache one final time, detach from events, and exit the fallback chain.
Parameters
exc_type: type[BaseException] | None
exc_value: BaseException | None
exc_traceback: TracebackType | None
Returns None
__init__
Initialize a new instance.
Parameters
model: str | Model
A provider-prefixed name (e.g.
'openai:gpt-5.4-nano') or a pydantic-aiModel.optionalkeyword-onlykvs_cache_key: str | None = None
Name of the
KeyValueStorerecord holding the selector cache. Defaults to'AI-SELECTORS'.optionalkeyword-onlydistiller: PydanticAiHtmlDistiller | None = None
The HTML distiller shaping the LLM input. Defaults to
PydanticAiSkeletonDistiller.optionalkeyword-onlyinstructions: str = _SELECTOR_INSTRUCTIONS
Base selector-generation instructions. The distiller's prompt notes are appended automatically.
optionalkeyword-onlyretries: int = 3
How many times the model may fix failing selectors within one generation.
optionalkeyword-onlymax_variants: int = 5
Cap on cached selector maps per bucket.
optionalkeyword-onlyfallback: PydanticAiHtmlExtractor | None = None
Extractor to degrade to when generation fails or the schema shape is unsupported.
optionalkeyword-onlyusage_limits: UsageLimits | None = None
Optional pydantic-ai
UsageLimitsapplied to every generation run.optionalkeyword-onlypersistence: bool = True
Whether the selector cache is persisted. Disable for ephemeral runs or tests.
Returns None
extract
Extract
schemafromcontentusing cached or freshly generated selectors.Parameters
content: str | Selector
Raw HTML or a parsed Parsel
Selector.schema: type[TSchema]
The Pydantic model describing the desired output.
optionalkeyword-onlyscope: str | None = None
Optional CSS selector restricting extraction to the first matching subtree.
optionalkeyword-onlycache_tag: str | None = None
Optional tag identifying the page kind. Selectors are cached per
(schema, scope, cache_tag). A shared tag (theNonedefault) buckets unlike pages together, overflowing the cache fast.optionalkeyword-onlyadditional_instructions: str | None = None
Extra instructions appended for this call only.
Returns TSchema
set_ai_usage
Adopt
valueand re-share it with the fallback chain.Parameters
value: PydanticAiUsageStats
Returns None
Properties
active
Whether the extractor is in its async context-manager scope.
ai_usage
Accumulated token usage of this extractor's runs.
Extractor that learns reusable CSS selectors and reuses them for free.
On each call it first tries the cached selector maps and extracts with no LLM call when one fits. On a miss it asks the model for a new map, validates it against the live page, and caches it. A bucket keeps several maps, so A/B-tested markup variants can coexist.
The cache is a
RecoverableStatepersisted to aKeyValueStore. As an async context manager it loads at startup and saves at shutdown. Used standalone, it initializes lazily.With a
fallbackextractor, unsupported schemas and generation failures degrade to it. Infrastructure errors such as credentials, HTTP, and usage limits propagate.See the
PydanticAiHtmlExtractorprotocol for the common extractor interface, andPydanticAiDirectExtractorfor a per-page variant with no selector cache.Usage