Skip to main content
Version: Next

PydanticAiSelectorExtractor

Extractor that learns reusable CSS selectors and reuses them for free.

On each call it first tries the cached selector maps and extracts with no LLM call when one fits. On a miss it asks the model for a new map, validates it against the live page, and caches it. A bucket keeps several maps, so A/B-tested markup variants can coexist.

The cache is a RecoverableState persisted to a KeyValueStore. As an async context manager it loads at startup and saves at shutdown. Used standalone, it initializes lazily.

With a fallback extractor, unsupported schemas and generation failures degrade to it. Infrastructure errors such as credentials, HTTP, and usage limits propagate.

See the PydanticAiHtmlExtractor protocol for the common extractor interface, and PydanticAiDirectExtractor for a per-page variant with no selector cache.

Usage

from pydantic import BaseModel
from pydantic_ai.models.openai import OpenAIChatModel
from pydantic_ai.providers.openai import OpenAIProvider

from crawlee.crawlers import PydanticAiDirectExtractor, PydanticAiSelectorExtractor


class Product(BaseModel):
name: str
price: str | None


model = OpenAIChatModel('gpt-5.4-nano', provider=OpenAIProvider(api_key='...'))
extractor = PydanticAiSelectorExtractor(model=model, fallback=PydanticAiDirectExtractor(model=model))
product = await extractor.extract('<html>...</html>', Product, cache_tag='product')

Hierarchy

Index

Methods

__aenter__

__aexit__

  • async __aexit__(exc_type, exc_value, exc_traceback): None
  • Persist the selector cache one final time, detach from events, and exit the fallback chain.


    Parameters

    • exc_type: type[BaseException] | None
    • exc_value: BaseException | None
    • exc_traceback: TracebackType | None

    Returns None

__init__

  • __init__(model, *, kvs_cache_key, distiller, instructions, retries, max_variants, fallback, usage_limits, persistence): None
  • Initialize a new instance.


    Parameters

    • model: str | Model

      A provider-prefixed name (e.g. 'openai:gpt-5.4-nano') or a pydantic-ai Model.

    • optionalkeyword-onlykvs_cache_key: str | None = None

      Name of the KeyValueStore record holding the selector cache. Defaults to 'AI-SELECTORS'.

    • optionalkeyword-onlydistiller: PydanticAiHtmlDistiller | None = None

      The HTML distiller shaping the LLM input. Defaults to PydanticAiSkeletonDistiller.

    • optionalkeyword-onlyinstructions: str = _SELECTOR_INSTRUCTIONS

      Base selector-generation instructions. The distiller's prompt notes are appended automatically.

    • optionalkeyword-onlyretries: int = 3

      How many times the model may fix failing selectors within one generation.

    • optionalkeyword-onlymax_variants: int = 5

      Cap on cached selector maps per bucket.

    • optionalkeyword-onlyfallback: PydanticAiHtmlExtractor | None = None

      Extractor to degrade to when generation fails or the schema shape is unsupported.

    • optionalkeyword-onlyusage_limits: UsageLimits | None = None

      Optional pydantic-ai UsageLimits applied to every generation run.

    • optionalkeyword-onlypersistence: bool = True

      Whether the selector cache is persisted. Disable for ephemeral runs or tests.

    Returns None

extract

  • async extract(content, schema, *, scope, cache_tag, additional_instructions): TSchema
  • Extract schema from content using cached or freshly generated selectors.


    Parameters

    • content: str | Selector

      Raw HTML or a parsed Parsel Selector.

    • schema: type[TSchema]

      The Pydantic model describing the desired output.

    • optionalkeyword-onlyscope: str | None = None

      Optional CSS selector restricting extraction to the first matching subtree.

    • optionalkeyword-onlycache_tag: str | None = None

      Optional tag identifying the page kind. Selectors are cached per (schema, scope, cache_tag). A shared tag (the None default) buckets unlike pages together, overflowing the cache fast.

    • optionalkeyword-onlyadditional_instructions: str | None = None

      Extra instructions appended for this call only.

    Returns TSchema

set_ai_usage

  • set_ai_usage(value): None

Properties

active

active: bool

Whether the extractor is in its async context-manager scope.

ai_usage

Accumulated token usage of this extractor's runs.