Skip to main content
Version: Next

PydanticAiHtmlExtractor

Interface for HTML extractors.

An extractor turns an HTML page into a validated Pydantic model using an LLM. The input format (cleaned HTML, skeleton, Markdown, ...) is decided by the PydanticAiHtmlDistiller an implementation composes. The model and base instructions are set at construction. Each extract call runs one extraction. The built-in extractors are PydanticAiDirectExtractor and PydanticAiSelectorExtractor.

Index

Methods

extract

  • async extract(content, schema, *, scope, cache_tag, additional_instructions): TSchema
  • Extract a structured instance of schema from content.


    Parameters

    • content: str | Selector

      Raw HTML or a parsed Parsel Selector. A Selector is the fast path. The crawler passes its live parsed tree directly and skips a re-parse. Treat it as read-only, since the user handler shares it.

    • schema: type[TSchema]

      The Pydantic model describing the desired output.

    • optionalkeyword-onlyscope: str | None = None

      Optional CSS selector. Extraction is restricted to the first matching subtree. A scope that matches nothing raises an error.

    • optionalkeyword-onlycache_tag: str | None = None

      Optional tag for caching implementations. Selectors are bucketed per tag, so one schema can serve several page kinds without competing. The crawler usually passes request.label. Implementations without caching ignore it.

    • optionalkeyword-onlyadditional_instructions: str | None = None

      Extra instructions for this call only. They are appended to the base instructions, not a replacement. Use them for page specifics (e.g. 'the price is the discounted one, not the list price').

    Returns TSchema

set_ai_usage

  • set_ai_usage(value): None
  • Replace the usage accumulator with value.

    Lets an external owner share one accumulator across a delegation chain. PydanticAiSelectorExtractor uses this to fold its fallback's usage into one accumulator. Extractors with per-instance counters may make it a no-op.


    Parameters

    Returns None

Properties

ai_usage

Accumulated token usage across extraction calls.

Page Options