Version: Next

PydanticAiCleanHtmlDistiller

Distiller that produces cleaned, structure-preserving HTML for direct LLM extraction.

The full page text survives, so the data to extract lives inside the produced document. Tags, nesting, and semantic attributes (class, itemprop, datetime) are kept so the model can tell fields apart.

JSON scripts are kept in full by default. For sites where a JSON-LD or framework blob is itself the data, this is the cheapest path. Such blobs can reach hundreds of kilobytes, so set max_json_len for them.

This is the default distiller for PydanticAiDirectExtractor. See PydanticAiSkeletonDistiller for the selector-generation variant.

Usage

from crawlee.crawlers import PydanticAiCleanHtmlDistiller

distiller = PydanticAiCleanHtmlDistiller(max_json_len=5_000)
distilled_html = distiller.distill('<html>...</html>')

Hierarchy

BasePydanticAiHtmlDistiller
- PydanticAiCleanHtmlDistiller
  - PydanticAiSkeletonDistiller

Index

Methods

init

__init__(*, cleaner, max_classes, max_attr_len, max_json_len, keep_head, max_size, pretty, prompt_notes): None

Overrides BasePydanticAiHtmlDistiller.__init__
Initialize a new instance.
Parameters
- optionalkeyword-onlycleaner: Cleaner | None = None
  A custom lxml_html_clean.Cleaner.
- optionalkeyword-onlymax_classes: int = 5
  How many class tokens to keep per element.
- optionalkeyword-onlymax_attr_len: int = 300
  Cap on attribute value length, in characters.
- optionalkeyword-onlymax_json_len: int | None = None
  Cap on JSON payload length, or None to keep in full.
- optionalkeyword-onlykeep_head: bool = True
  Whether to keep a reduced <head> containing <title>, semantic <meta> and JSON scripts.
- optionalkeyword-onlymax_size: int | None = 400_000
  Hard cap on the distilled document, in characters. When breached, the tail is dropped and replaced with the truncation marker.
- optionalkeyword-onlypretty: bool = False
  Whether to pretty-print the serialized HTML.
- optionalkeyword-onlyprompt_notes: str | None = _CLEAN_HTML_PROMPT_NOTES
  Override for the default prompt notes. Pass None to send no notes to the LLM.
Returns None

distill

distill(html): str

Overrides BasePydanticAiHtmlDistiller.distill
Convert raw HTML to the cleaned, structure-preserving representation.
Parameters
- html: str
  The raw HTML markup.
Returns str

get_prompt_notes

get_prompt_notes(): str | None

Inherited from BasePydanticAiHtmlDistiller.get_prompt_notes
Return the configured prompt notes, or None when not set.
Returns str | None

Usage

Hierarchy

Index

Methods

Methods

__init__

Parameters

optionalkeyword-onlycleaner: Cleaner | None = None

optionalkeyword-onlymax_classes: int = 5

optionalkeyword-onlymax_attr_len: int = 300

optionalkeyword-onlymax_json_len: int | None = None

optionalkeyword-onlykeep_head: bool = True

optionalkeyword-onlymax_size: int | None = 400_000

optionalkeyword-onlypretty: bool = False

optionalkeyword-onlyprompt_notes: str | None = _CLEAN_HTML_PROMPT_NOTES

Returns None

distill

Parameters

html: str

Returns str

get_prompt_notes

Returns str | None

init