Skip to main content
Version: Next

PydanticAiCleanHtmlDistiller

Distiller that produces cleaned, structure-preserving HTML for direct LLM extraction.

The full page text survives, so the data to extract lives inside the produced document. Tags, nesting, and semantic attributes (class, itemprop, datetime) are kept so the model can tell fields apart.

JSON scripts are kept in full by default. For sites where a JSON-LD or framework blob is itself the data, this is the cheapest path. Such blobs can reach hundreds of kilobytes, so set max_json_len for them.

This is the default distiller for PydanticAiDirectExtractor. See PydanticAiSkeletonDistiller for the selector-generation variant.

Usage

from crawlee.crawlers import PydanticAiCleanHtmlDistiller

distiller = PydanticAiCleanHtmlDistiller(max_json_len=5_000)
distilled_html = distiller.distill('<html>...</html>')

Hierarchy

Index

Methods

__init__

  • __init__(*, cleaner, max_classes, max_attr_len, max_json_len, keep_head, max_size, pretty, prompt_notes): None
  • Initialize a new instance.


    Parameters

    • optionalkeyword-onlycleaner: Cleaner | None = None

      A custom lxml_html_clean.Cleaner.

    • optionalkeyword-onlymax_classes: int = 5

      How many class tokens to keep per element.

    • optionalkeyword-onlymax_attr_len: int = 300

      Cap on attribute value length, in characters.

    • optionalkeyword-onlymax_json_len: int | None = None

      Cap on JSON payload length, or None to keep in full.

    • optionalkeyword-onlykeep_head: bool = True

      Whether to keep a reduced <head> containing <title>, semantic <meta> and JSON scripts.

    • optionalkeyword-onlymax_size: int | None = 400_000

      Hard cap on the distilled document, in characters. When breached, the tail is dropped and replaced with the truncation marker.

    • optionalkeyword-onlypretty: bool = False

      Whether to pretty-print the serialized HTML.

    • optionalkeyword-onlyprompt_notes: str | None = _CLEAN_HTML_PROMPT_NOTES

      Override for the default prompt notes. Pass None to send no notes to the LLM.

    Returns None

distill

  • distill(html): str

get_prompt_notes

  • get_prompt_notes(): str | None
Page Options