Version: Next

PydanticAiHtmlExtractor

Interface for HTML extractors.

An extractor turns an HTML page into a validated Pydantic model using an LLM. The input format (cleaned HTML, skeleton, Markdown, ...) is decided by the PydanticAiHtmlDistiller an implementation composes. The model and base instructions are set at construction. Each extract call runs one extraction. The built-in extractors are PydanticAiDirectExtractor and PydanticAiSelectorExtractor.

Index

Methods

Properties

ai_usage

Methods

extract

async extract(content, schema, *, scope, cache_tag, additional_instructions): TSchema

Extract a structured instance of schema from content.
Parameters
- content: str | Selector
  Raw HTML or a parsed Parsel Selector. A Selector is the fast path. The crawler passes its live parsed tree directly and skips a re-parse. Treat it as read-only, since the user handler shares it.
- schema: type[TSchema]
  The Pydantic model describing the desired output.
- optionalkeyword-onlyscope: str | None = None
  Optional CSS selector. Extraction is restricted to the first matching subtree. A scope that matches nothing raises an error.
- optionalkeyword-onlycache_tag: str | None = None
  Optional tag for caching implementations. Selectors are bucketed per tag, so one schema can serve several page kinds without competing. The crawler usually passes request.label. Implementations without caching ignore it.
- optionalkeyword-onlyadditional_instructions: str | None = None
  Extra instructions for this call only. They are appended to the base instructions, not a replacement. Use them for page specifics (e.g. 'the price is the discounted one, not the list price').
Returns TSchema

set_ai_usage

set_ai_usage(value): None

Replace the usage accumulator with value.

Lets an external owner share one accumulator across a delegation chain. PydanticAiSelectorExtractor uses this to fold its fallback's usage into one accumulator. Extractors with per-instance counters may make it a no-op.
Parameters
- value: PydanticAiUsageStats
  The accumulator to adopt.
Returns None

Properties

ai_usage

ai_usage: PydanticAiUsageStats

Accumulated token usage across extraction calls.

Index

Methods

Properties

Methods

extract

Parameters

content: str | Selector

schema: type[TSchema]

optionalkeyword-onlyscope: str | None = None

optionalkeyword-onlycache_tag: str | None = None

optionalkeyword-onlyadditional_instructions: str | None = None

Returns TSchema

set_ai_usage

Parameters

value: PydanticAiUsageStats

Returns None

Properties

ai_usage