PydanticAiCleanHtmlDistiller
Hierarchy
- BasePydanticAiHtmlDistiller
- PydanticAiCleanHtmlDistiller
Index
Methods
__init__
Initialize a new instance.
Parameters
optionalkeyword-onlycleaner: Cleaner | None = None
A custom
lxml_html_clean.Cleaner.optionalkeyword-onlymax_classes: int = 5
How many class tokens to keep per element.
optionalkeyword-onlymax_attr_len: int = 300
Cap on attribute value length, in characters.
optionalkeyword-onlymax_json_len: int | None = None
Cap on JSON payload length, or
Noneto keep in full.optionalkeyword-onlykeep_head: bool = True
Whether to keep a reduced
<head>containing<title>, semantic<meta>and JSON scripts.optionalkeyword-onlymax_size: int | None = 400_000
Hard cap on the distilled document, in characters. When breached, the tail is dropped and replaced with the truncation marker.
optionalkeyword-onlypretty: bool = False
Whether to pretty-print the serialized HTML.
optionalkeyword-onlyprompt_notes: str | None = _CLEAN_HTML_PROMPT_NOTES
Override for the default prompt notes. Pass
Noneto send no notes to the LLM.
Returns None
distill
Convert raw HTML to the cleaned, structure-preserving representation.
Parameters
html: str
The raw HTML markup.
Returns str
get_prompt_notes
Return the configured prompt notes, or
Nonewhen not set.Returns str | None
Distiller that produces cleaned, structure-preserving HTML for direct LLM extraction.
The full page text survives, so the data to extract lives inside the produced document. Tags, nesting, and semantic attributes (
class,itemprop,datetime) are kept so the model can tell fields apart.JSON scripts are kept in full by default. For sites where a JSON-LD or framework blob is itself the data, this is the cheapest path. Such blobs can reach hundreds of kilobytes, so set
max_json_lenfor them.This is the default distiller for
PydanticAiDirectExtractor. SeePydanticAiSkeletonDistillerfor the selector-generation variant.Usage