Skip to main content
Version: Next

BasePydanticAiHtmlExtractor

Base class for the built-in HTML extractors.

An HTML extractor turns a page into a validated Pydantic model with the help of an LLM. This abstract base implements the parts the built-in extractors share: resolving the model, composing the task instructions with the distiller's prompt notes, and accumulating token usage.

The public interface is the PydanticAiHtmlExtractor protocol. The concrete extractors are PydanticAiDirectExtractor and PydanticAiSelectorExtractor.

Hierarchy

Index

Methods

__init__

  • __init__(model, *, distiller, instructions, usage_limits): None
  • Initialize a new instance.


    Parameters

    • model: str | Model

      A provider-prefixed name (e.g. 'openai:gpt-5.4-nano') or a pydantic-ai Model. Credentials are read from the provider's environment variable (e.g. OPENAI_API_KEY) or passed explicitly through a Model instance.

    • keyword-onlydistiller: PydanticAiHtmlDistiller

      The HTML distiller shaping the LLM input.

    • keyword-onlyinstructions: str

      Base task instructions. The distiller's prompt notes are appended automatically.

    • keyword-onlyusage_limits: UsageLimits | None

      Optional pydantic-ai UsageLimits applied to every single run.

    Returns None

extract

  • async extract(content, schema, *, scope, cache_tag, additional_instructions): TSchema
  • Extract a structured instance of schema from content.


    Parameters

    • content: str | Selector
    • schema: type[TSchema]
    • optionalkeyword-onlyscope: str | None = None
    • optionalkeyword-onlycache_tag: str | None = None
    • optionalkeyword-onlyadditional_instructions: str | None = None

    Returns TSchema

set_ai_usage

  • set_ai_usage(value): None
  • Replace the usage accumulator with value.

    Lets an external owner share one accumulator across a delegation chain.


    Parameters

    Returns None

Properties

ai_usage

Accumulated token usage of this extractor's runs.

Page Options