PydanticAiHtmlExtractor
Index
Methods
Properties
Methods
extract
Extract a structured instance of
schemafromcontent.Parameters
content: str | Selector
Raw HTML or a parsed Parsel
Selector. ASelectoris the fast path. The crawler passes its live parsed tree directly and skips a re-parse. Treat it as read-only, since the user handler shares it.schema: type[TSchema]
The Pydantic model describing the desired output.
optionalkeyword-onlyscope: str | None = None
Optional CSS selector. Extraction is restricted to the first matching subtree. A scope that matches nothing raises an error.
optionalkeyword-onlycache_tag: str | None = None
Optional tag for caching implementations. Selectors are bucketed per tag, so one schema can serve several page kinds without competing. The crawler usually passes
request.label. Implementations without caching ignore it.optionalkeyword-onlyadditional_instructions: str | None = None
Extra instructions for this call only. They are appended to the base instructions, not a replacement. Use them for page specifics (e.g. 'the price is the discounted one, not the list price').
Returns TSchema
set_ai_usage
Replace the usage accumulator with
value.Lets an external owner share one accumulator across a delegation chain.
PydanticAiSelectorExtractoruses this to fold its fallback's usage into one accumulator. Extractors with per-instance counters may make it a no-op.Parameters
value: PydanticAiUsageStats
The accumulator to adopt.
Returns None
Properties
ai_usage
Accumulated token usage across extraction calls.
Interface for HTML extractors.
An extractor turns an HTML page into a validated Pydantic model using an LLM. The input format (cleaned HTML, skeleton, Markdown, ...) is decided by the
PydanticAiHtmlDistilleran implementation composes. The model and base instructions are set at construction. Eachextractcall runs one extraction. The built-in extractors arePydanticAiDirectExtractorandPydanticAiSelectorExtractor.