Skip to main content
Version: Next

PydanticAiSkeletonDistiller

Distiller that produces a DOM skeleton used to ask an LLM for CSS selectors.

The skeleton is built from the page by removing nodes, attributes, and class tokens, or by truncating text. It never renames or re-parents elements. So any selector the LLM builds from the skeleton also matches the original page.

This is the default distiller for PydanticAiSelectorExtractor. See PydanticAiCleanHtmlDistiller for the direct-extraction variant that keeps the full page text.

On top of the base cleaning:

  • text nodes are truncated to max_text_len, so the model sees samples rather than full content.
  • JSON payloads are capped at max_json_len, so only their key structure reaches the model.
  • runs of repeated siblings are collapsed to the first keep_siblings items plus a comment marker. Siblings with a distinct identity attribute (name, property, itemprop, ...) are kept, since a run of <meta> tags is not a repeating template.
  • if the result still exceeds max_size, it is re-distilled with tighter settings. Cutting the output is the last resort.

Usage

from crawlee.crawlers import PydanticAiSkeletonDistiller

distiller = PydanticAiSkeletonDistiller(max_text_len=80)
skeleton = distiller.distill('<html>...</html>')

Hierarchy

Index

Methods

__init__

  • __init__(*, cleaner, max_text_len, max_json_len, keep_siblings, max_classes, max_attr_len, keep_head, max_size, pretty, prompt_notes): None
  • Initialize a new instance.


    Parameters

    • optionalkeyword-onlycleaner: Cleaner | None = None

      A custom lxml_html_clean.Cleaner.

    • optionalkeyword-onlymax_text_len: int = 50

      Cap on a text node, in characters.

    • optionalkeyword-onlymax_json_len: int | None = 1_000

      Cap on JSON payload length, or None to keep in full.

    • optionalkeyword-onlykeep_siblings: int = 3

      How many leading siblings to keep when a repeated run is collapsed.

    • optionalkeyword-onlymax_classes: int = 5

      How many class tokens to keep per element.

    • optionalkeyword-onlymax_attr_len: int = 100

      Cap on attribute value length, in characters.

    • optionalkeyword-onlykeep_head: bool = True

      Whether to keep a reduced <head>.

    • optionalkeyword-onlymax_size: int | None = 60_000

      Hard cap on the skeleton, in characters. A tightening re-distillation runs first. If the result is still too big, the tail is dropped and replaced with the truncation marker.

    • optionalkeyword-onlypretty: bool = False

      Whether to pretty-print the serialized HTML.

    • optionalkeyword-onlyprompt_notes: str | None = _SKELETON_PROMPT_NOTES

      Override for the default prompt notes. Pass None to send no notes to the LLM.

    Returns None

distill

  • distill(html): str

get_prompt_notes

  • get_prompt_notes(): str | None
Page Options