Skip to main content
Version: Next

PydanticAiDirectExtractor

Extractor that asks the LLM to read the page and return the data directly.

The page is distilled to compact HTML and sent to the model in a single call. The user schema is the agent's output type, so pydantic-ai validates the result and feeds invalid output back to the model. This is the simplest extractor and works on any page, at the cost of one LLM call per page.

See the PydanticAiHtmlExtractor protocol for the common extractor interface, and PydanticAiSelectorExtractor for a variant that learns reusable CSS selectors.

Usage

from pydantic import BaseModel
from pydantic_ai.models.openai import OpenAIChatModel
from pydantic_ai.providers.openai import OpenAIProvider

from crawlee.crawlers import PydanticAiDirectExtractor


class Product(BaseModel):
name: str
price: str | None


model = OpenAIChatModel('gpt-5.4-nano', provider=OpenAIProvider(api_key='...'))
extractor = PydanticAiDirectExtractor(model=model)
product = await extractor.extract('<html>...</html>', Product)

Hierarchy

Index

Methods

__init__

  • __init__(model, *, distiller, instructions, retries, usage_limits): None
  • Initialize a new instance.


    Parameters

    • model: str | Model

      A provider-prefixed name (e.g. 'openai:gpt-5.4-nano') or a pydantic-ai Model.

    • optionalkeyword-onlydistiller: PydanticAiHtmlDistiller | None = None

      The HTML distiller shaping the LLM input. Defaults to PydanticAiCleanHtmlDistiller.

    • optionalkeyword-onlyinstructions: str = _DIRECT_INSTRUCTIONS

      Base task instructions. The distiller's prompt notes are appended automatically.

    • optionalkeyword-onlyretries: int = 1

      How many times the model may fix output that fails schema validation within one run (pydantic-ai output retries).

    • optionalkeyword-onlyusage_limits: UsageLimits | None = None

      Optional pydantic-ai UsageLimits applied to every single run.

    Returns None

extract

  • async extract(content, schema, *, scope, cache_tag, additional_instructions): TSchema
  • Distill content, send it to the model, and return a validated schema.


    Parameters

    • content: str | Selector

      Raw HTML or a parsed Parsel Selector.

    • schema: type[TSchema]

      The Pydantic model describing the desired output.

    • optionalkeyword-onlyscope: str | None = None

      Optional CSS selector restricting extraction to the first matching subtree.

    • optionalkeyword-onlycache_tag: str | None = None

      Ignored in direct extraction.

    • optionalkeyword-onlyadditional_instructions: str | None = None

      Extra instructions appended for this call only.

    Returns TSchema

set_ai_usage

  • set_ai_usage(value): None

Properties

ai_usage

Accumulated token usage of this extractor's runs.

Page Options