PydanticAiSkeletonDistiller
Hierarchy
- PydanticAiCleanHtmlDistiller
- PydanticAiSkeletonDistiller
Index
Methods
__init__
Initialize a new instance.
Parameters
optionalkeyword-onlycleaner: Cleaner | None = None
A custom
lxml_html_clean.Cleaner.optionalkeyword-onlymax_text_len: int = 50
Cap on a text node, in characters.
optionalkeyword-onlymax_json_len: int | None = 1_000
Cap on JSON payload length, or
Noneto keep in full.optionalkeyword-onlykeep_siblings: int = 3
How many leading siblings to keep when a repeated run is collapsed.
optionalkeyword-onlymax_classes: int = 5
How many class tokens to keep per element.
optionalkeyword-onlymax_attr_len: int = 100
Cap on attribute value length, in characters.
optionalkeyword-onlykeep_head: bool = True
Whether to keep a reduced
<head>.optionalkeyword-onlymax_size: int | None = 60_000
Hard cap on the skeleton, in characters. A tightening re-distillation runs first. If the result is still too big, the tail is dropped and replaced with the truncation marker.
optionalkeyword-onlypretty: bool = False
Whether to pretty-print the serialized HTML.
optionalkeyword-onlyprompt_notes: str | None = _SKELETON_PROMPT_NOTES
Override for the default prompt notes. Pass
Noneto send no notes to the LLM.
Returns None
distill
Convert raw HTML to the cleaned, structure-preserving representation.
Parameters
html: str
The raw HTML markup.
Returns str
get_prompt_notes
Return the configured prompt notes, or
Nonewhen not set.Returns str | None
Distiller that produces a DOM skeleton used to ask an LLM for CSS selectors.
The skeleton is built from the page by removing nodes, attributes, and class tokens, or by truncating text. It never renames or re-parents elements. So any selector the LLM builds from the skeleton also matches the original page.
This is the default distiller for
PydanticAiSelectorExtractor. SeePydanticAiCleanHtmlDistillerfor the direct-extraction variant that keeps the full page text.On top of the base cleaning:
max_text_len, so the model sees samples rather than full content.max_json_len, so only their key structure reaches the model.keep_siblingsitems plus a comment marker. Siblings with a distinct identity attribute (name,property,itemprop, ...) are kept, since a run of<meta>tags is not a repeating template.max_size, it is re-distilled with tighter settings. Cutting the output is the last resort.Usage