Skip to main content
Version: Next

Pydantic AI crawler

A PydanticAiCrawler extracts structured data from a page with an LLM. It fetches each page over plain HTTP and parses it with Parsel, then exposes an extract helper: pass a Pydantic model and get a validated instance back. Instead of writing CSS selectors for every field, you describe the data with a schema and the model fills it in.

The model layer is Pydantic AI, so any provider it supports (OpenAI, Anthropic, Gemini, Ollama, and more) works through the model argument. The context is a PydanticAiCrawlingContext, that extends the ParselCrawlingContext, so the manual selector and enqueue_links stay available next to extract.

Experimental

PydanticAiCrawler is experimental. Its public API may change in future releases.

When to use PydanticAiCrawler

Use PydanticAiCrawler when:

  • Selectors are unknown or brittle. The model reads the content, so it tolerates markup that varies or changes.
  • One schema spans many layouts. A single Pydantic model fits differently structured pages, with no per-page selectors.
  • Rapid prototyping. You describe the data with a schema instead of writing selectors.

For pages with a stable, known structure, a plain ParselCrawler or BeautifulSoupCrawler is cheaper, since it runs no model calls.

PydanticAiCrawler fetches pages over plain HTTP and doesn't render JavaScript. For pages that need a browser, or for complex multi-step interactions, use StagehandCrawler. See the Stagehand crawler guide.

Installation

PydanticAiCrawler requires the pydantic-ai optional dependency group:

pip install 'crawlee[pydantic-ai]'

or with uv:

uv add 'crawlee[pydantic-ai]'

The pydantic-ai extra installs the OpenAI integration by default. To use another provider, add the matching pydantic-ai-slim extra. For example, for Anthropic:

pip install 'crawlee[pydantic-ai]' 'pydantic-ai-slim[anthropic]'

Basic usage

Provide a model and call context.extract with a Pydantic model inside the handler. With only a model, the crawler wraps it in a PydanticAiDirectExtractor, the default extractor, that sends each distilled page to the model in one call. The following example extracts an article and pushes it to the dataset.

import asyncio

from pydantic import BaseModel
from pydantic_ai.models.openai import OpenAIChatModel
from pydantic_ai.providers.openai import OpenAIProvider

from crawlee.crawlers import PydanticAiCrawler, PydanticAiCrawlingContext


class Article(BaseModel):
"""Model representing the extracted data for an article."""

title: str
short_text: str


async def main() -> None:
# A `Model` instance sets the API key explicitly. A provider-prefixed string such as
# 'openai:gpt-5.4-nano' reads the key from the provider's env var like OPENAI_API_KEY.
model = OpenAIChatModel(
'gpt-5.4-nano',
provider=OpenAIProvider(api_key='your-openai-api-key'),
)

# With only `model`, the crawler uses a PydanticAiDirectExtractor by default.
crawler = PydanticAiCrawler(model=model, max_requests_per_crawl=5)

@crawler.router.default_handler
async def handler(context: PydanticAiCrawlingContext) -> None:
context.log.info(f'Processing {context.request.url} ...')

# Pass a Pydantic model and get a validated instance back.
article = await context.extract(Article)

await context.push_data(article.model_dump())

await context.enqueue_links()

await crawler.run(['https://crawlee.dev/'])


if __name__ == '__main__':
asyncio.run(main())

The model argument accepts a provider-prefixed name or a Pydantic AI Model instance. With both model and extractor left unset, the crawler defaults to 'openai:gpt-5.4-nano', that reads OPENAI_API_KEY from the environment.

# A provider-prefixed name reads credentials from the provider's environment variable (e.g. OPENAI_API_KEY).
crawler = PydanticAiCrawler(model='openai:gpt-5.4-nano')

# A Model instance takes credentials explicitly.
from pydantic_ai.models.openai import OpenAIChatModel
from pydantic_ai.providers.openai import OpenAIProvider

model = OpenAIChatModel('gpt-5.4-nano', provider=OpenAIProvider(api_key='...'))
crawler = PydanticAiCrawler(model=model)

Extractors

An extractor turns a page into your schema. Extractors implement different strategies for working with the LLM, and each one uses a PydanticAiHtmlDistiller to shape the model's input. Crawlee ships two.

PydanticAiDirectExtractor

PydanticAiDirectExtractor sends the distilled page to the model in one call. The schema is the model's output type. Pydantic AI validates the result. On a mismatch, it sends the error back to the model to fix, bounded by retries.

It reads each page on its own, so extraction is accurate per page. It accepts schemas of any shape: nested models, lists, dictionaries, unions, and deep nesting. The cost is one model call per page, that scales poorly on a large site.

To focus the model on the data you want, use additional_instructions:

import asyncio

from pydantic import BaseModel
from pydantic_ai.models.openai import OpenAIChatModel
from pydantic_ai.providers.openai import OpenAIProvider

from crawlee.crawlers import PydanticAiCrawler, PydanticAiCrawlingContext


class Post(BaseModel):
"""Model representing a single post."""

title: str
url: str


class Posts(BaseModel):
"""Model representing the extracted list of posts."""

posts: list[Post]


async def main() -> None:
model = OpenAIChatModel(
'gpt-5.4-nano',
provider=OpenAIProvider(api_key='your-openai-api-key'),
)
crawler = PydanticAiCrawler(model=model, max_requests_per_crawl=5)

@crawler.router.default_handler
async def handler(context: PydanticAiCrawlingContext) -> None:
# The instruction narrows what the model returns from the page.
posts = await context.extract(
Posts,
additional_instructions='Extract only the top five posts on the page.',
)

await context.push_data(posts.model_dump())

await crawler.run(['https://news.ycombinator.com'])


if __name__ == '__main__':
asyncio.run(main())

PydanticAiSelectorExtractor

PydanticAiSelectorExtractor asks the model for reusable CSS selectors on the first page of a route, caches them, and reuses them with no model call on later pages of the same layout, so it scales to large sites. When a page matches none of the cached selectors (a different markup variant), it generates and caches a new set, so one bucket can hold several variants. If selector generation fails, or the schema shape is unsupported, it degrades to the fallback extractor when one is set, and raises otherwise. Selectors are bucketed by cache_tag, that defaults to the request label, so each route keeps its own set. The cache is persisted to a KeyValueStore, so a later run reuses selectors learned earlier.

import asyncio

from pydantic import BaseModel
from pydantic_ai.models.openai import OpenAIChatModel
from pydantic_ai.providers.openai import OpenAIProvider

from crawlee import Glob
from crawlee.crawlers import (
PydanticAiCrawler,
PydanticAiCrawlingContext,
PydanticAiDirectExtractor,
PydanticAiSelectorExtractor,
)


class Article(BaseModel):
"""Model representing the extracted data for an article."""

title: str
main_text: str


async def main() -> None:
model = OpenAIChatModel(
'gpt-5.4-nano',
provider=OpenAIProvider(api_key='your-openai-api-key'),
)
crawler = PydanticAiCrawler(
extractor=PydanticAiSelectorExtractor(
model=model,
# Pages the cached selectors cannot handle fall back to direct extraction.
fallback=PydanticAiDirectExtractor(model=model),
),
max_requests_per_crawl=10,
)

@crawler.router.default_handler
async def handler(context: PydanticAiCrawlingContext) -> None:
# Enqueue blog article pages; the article handler extracts the data.
await context.enqueue_links(
include=[Glob('https://crawlee.dev/blog/*')],
label='article',
)

@crawler.router.handler('article')
async def article_handler(context: PydanticAiCrawlingContext) -> None:
# The first page generates selectors; later pages reuse them with no LLM call.
article = await context.extract(Article)

await context.push_data(article.model_dump())

await crawler.run(['https://crawlee.dev/blog'])


if __name__ == '__main__':
asyncio.run(main())

It supports schemas built from scalar fields, lists of scalars, lists of items, and a single nested item, one level deep. For shapes it can't serve (such as a dict field), set a fallback or use PydanticAiDirectExtractor.

The fallback is any extractor implementing PydanticAiHtmlExtractor. It can be a PydanticAiDirectExtractor with its own settings, or another PydanticAiSelectorExtractor with a different model, distiller, or instructions. The argument takes a configured extractor rather than a boolean flag, since a fallback often needs fine-grained control. Without a fallback, a failed selector generation raises an UnexpectedModelBehavior, and a complex, unsupported schema raises a ValueError.

Both extractors share two more knobs. retries caps how many times the model may fix output that fails validation (default 1 for PydanticAiDirectExtractor, 3 for PydanticAiSelectorExtractor). instructions replaces the base task instructions entirely.

Distillers

A distiller reduces raw HTML to a compact representation the model reads cheaply. Each extractor uses one, set through its distiller argument (the crawler has no distiller argument).

Any distiller pairs with any extractor, that lets you balance extraction quality against token cost. A distiller that keeps more of the page helps the model read values or generate selectors, but costs more tokens. One that keeps only structure and attributes is cheaper, but drops the page text. Pick by your task and the strategy your extractor implements.

PydanticAiCleanHtmlDistiller

PydanticAiCleanHtmlDistiller produces cleaned, structure-preserving HTML and keeps the full page text. Scripts and styling are removed. Tags, nesting, and data-bearing attributes (href, class, datetime, ...) stay. It is the default for PydanticAiDirectExtractor, that reads the values straight from the document, so it fits pages where the data lives in the text.

PydanticAiSkeletonDistiller

PydanticAiSkeletonDistiller builds on the PydanticAiCleanHtmlDistiller, then truncates text nodes to short samples and collapses runs of repeated siblings. The output shows the page structure rather than its content, that is what the model needs to write CSS selectors, and its smaller size costs fewer tokens. It is the default for PydanticAiSelectorExtractor.

Custom distiller

Subclass BasePydanticAiHtmlDistiller and implement distill to send a different representation. Set prompt_notes so the model knows the input format. The extractor appends the notes to its instructions.

The following example converts the cleaned page to Markdown with html-to-markdown, an extra dependency:

pip install html-to-markdown
import asyncio

from html_to_markdown import convert
from lxml_html_clean import Cleaner
from pydantic import BaseModel
from pydantic_ai.models.openai import OpenAIChatModel
from pydantic_ai.providers.openai import OpenAIProvider

from crawlee.crawlers import (
BasePydanticAiHtmlDistiller,
PydanticAiCrawler,
PydanticAiCrawlingContext,
PydanticAiDirectExtractor,
get_basic_http_cleaner,
)

# Notes appended to the model instructions so it knows the input format.
MARKDOWN_PROMPT_NOTES = 'The document is Markdown converted from the HTML page.'


class MarkdownDistiller(BasePydanticAiHtmlDistiller):
"""Distiller that cleans the page HTML and converts it to Markdown."""

def __init__(self, cleaner: Cleaner | None = None) -> None:
super().__init__(prompt_notes=MARKDOWN_PROMPT_NOTES)

# Strip scripts, styles, and other noise before the conversion.
self._cleaner = cleaner or get_basic_http_cleaner()

def distill(self, html: str) -> str:
return convert(self._cleaner.clean_html(html)).content or ''


class Article(BaseModel):
"""Model representing the extracted data for an article."""

title: str
short_text: str


async def main() -> None:
model = OpenAIChatModel(
'gpt-5.4-nano',
# Set the provider with the API key explicitly.
provider=OpenAIProvider(api_key='your-openai-api-key'),
)
crawler = PydanticAiCrawler(
# Use the custom distiller to convert the page to Markdown before extraction.
extractor=PydanticAiDirectExtractor(model=model, distiller=MarkdownDistiller()),
max_requests_per_crawl=5,
)

@crawler.router.default_handler
async def handler(context: PydanticAiCrawlingContext) -> None:
# Pass a Pydantic model and get a validated instance back.
article = await context.extract(Article)
await context.push_data(article.model_dump())

# Enqueue links as usual, the distillation and extraction don't affect
# the rest of the crawling logic.
await context.enqueue_links()

await crawler.run(['https://crawlee.dev/'])


if __name__ == '__main__':
asyncio.run(main())

Extract options

context.extract takes options alongside the schema:

  • scope - a CSS selector that restricts extraction to the first matching subtree (e.g. main or article.post). It saves tokens and keeps the model away from unrelated parts of the page.
  • cache_tag - the bucket for cached selectors. It defaults to the request label.
  • additional_instructions - extra instructions for this call, appended to the base instructions. With PydanticAiSelectorExtractor they steer the one-time selector generation, not each extraction, so use them to point the model at the right region.

Usage and cost

Token usage accumulates on context.ai_usage, and on crawler.ai_usage for the whole crawl. The accumulator is a PydanticAiUsageStats with requests, input_tokens, output_tokens, and total_tokens.

To cap spend, pass usage_limits (a pydantic-ai UsageLimits) to an extractor. It applies to every model run, and extract raises UsageLimitExceeded when a page needs more. The example below caps each extraction, logs and skips pages that exceed it, and stops the whole crawl once a token budget is spent.

import asyncio

from pydantic import BaseModel
from pydantic_ai.exceptions import UsageLimitExceeded
from pydantic_ai.models.openai import OpenAIChatModel
from pydantic_ai.providers.openai import OpenAIProvider
from pydantic_ai.usage import UsageLimits

from crawlee.crawlers import (
PydanticAiCrawler,
PydanticAiCrawlingContext,
PydanticAiDirectExtractor,
)

# Stop the whole crawl once this many tokens have been spent.
TOKEN_BUDGET = 50_000


class Article(BaseModel):
"""Model representing the extracted data for an article."""

title: str
short_text: str


async def main() -> None:
model = OpenAIChatModel(
'gpt-5.4-nano',
provider=OpenAIProvider(api_key='your-openai-api-key'),
)
crawler = PydanticAiCrawler(
# Cap each extraction so an oversized page cannot consume LLM resources.
extractor=PydanticAiDirectExtractor(
model=model,
usage_limits=UsageLimits(total_tokens_limit=10_000),
),
max_requests_per_crawl=5,
)

@crawler.router.default_handler
async def handler(context: PydanticAiCrawlingContext) -> None:
# Stop the crawl once the cumulative token budget is exhausted.
if context.ai_usage.total_tokens > TOKEN_BUDGET:
context.log.info('Token budget exhausted, stopping the crawler.')
crawler.stop()
return

try:
article = await context.extract(Article)
except UsageLimitExceeded:
# The page needs more tokens than the per-extraction limit allows.
context.log.warning(f'Content at {context.request.url} is too large.')
return

await context.push_data(article.model_dump())

await crawler.run(['https://crawlee.dev/'])


if __name__ == '__main__':
asyncio.run(main())

Debugging

Debugging LLM extraction is less direct than debugging selectors, because the model decides the output. A few things help.

The model only sees the distilled document, so when a field is wrong or empty, print what the distiller produces and confirm the data is there. Use the same distiller your extractor runs. To see the exact prompts, responses, and retries exchanged with the model, wrap the call in Pydantic AI's capture_run_messages. The example below does both inside a handler:

import asyncio

from pydantic import BaseModel
from pydantic_ai import capture_run_messages
from pydantic_ai.exceptions import UnexpectedModelBehavior
from pydantic_ai.models.openai import OpenAIChatModel
from pydantic_ai.providers.openai import OpenAIProvider

from crawlee import ConcurrencySettings
from crawlee.crawlers import (
PydanticAiCleanHtmlDistiller,
PydanticAiCrawler,
PydanticAiCrawlingContext,
PydanticAiDirectExtractor,
)


class Article(BaseModel):
"""Model representing the extracted data for an article."""

title: str
short_text: str


async def main() -> None:
model = OpenAIChatModel(
'gpt-5.4-nano',
provider=OpenAIProvider(api_key='your-openai-api-key'),
)
# Build the distiller once so the extractor and the handler below share
# the same instance.
distiller = PydanticAiCleanHtmlDistiller()
crawler = PydanticAiCrawler(
max_requests_per_crawl=10,
# Create a direct extractor with your distiller.
extractor=PydanticAiDirectExtractor(
model,
distiller=distiller,
),
# Set concurrency to 1, which ensures only one request is processed at a time.
concurrency_settings=ConcurrencySettings(
desired_concurrency=1, max_concurrency=1
),
# Set abort_on_error to True to stop the crawl if an error occurs during
# extraction.
abort_on_error=True,
)

@crawler.router.default_handler
async def handler(context: PydanticAiCrawlingContext) -> None:
# Inspect the distilled document the model actually reads, using the same
# distiller the extractor runs. On real pages this can be tens of KB.
distilled = distiller.distill(context.selector.get())
context.log.info(distilled)

# Capture the prompts, responses, and retries exchanged with the model.
with capture_run_messages() as messages:
try:
article = await context.extract(Article)
except UnexpectedModelBehavior:
context.log.exception(f'Extraction failed for {context.request.url}.')
raise
finally:
# Log each exchanged message on its own line for readability.
for message in messages:
context.log.info(f'{message}')

await context.push_data(article.model_dump())

await crawler.run(['https://crawlee.dev/'])


if __name__ == '__main__':
asyncio.run(main())

The errors point at the cause:

  • UnexpectedModelBehavior - the model couldn't produce a valid result within retries, for example its output kept failing validation, a content filter fired, or the token limit cut off a tool call. See Pydantic AI model errors.
  • ModelHTTPError - the provider returned a 4xx or 5xx response: a missing or wrong API key, a rate limit, or a provider outage.
  • ValueError - the schema shape is unsupported by PydanticAiSelectorExtractor, or a scope matched nothing.
  • UsageLimitExceeded - a run hit the configured usage_limits.

UnexpectedModelBehavior, ModelHTTPError, and UsageLimitExceeded come from the model run and share the base AgentRunError, so catch it to handle any model failure in one place. ValueError is raised by the extractor itself, around the schema or scope, not by the model.

An extraction error is an ordinary request error, so Crawlee retries the request up to max_request_retries (3 by default), re-running the extraction and repeating its token cost. For errors that won't pass on retry, such as a wrong API key or an unsupported schema, the retries only waste effort, so catch them in the handler to skip the page, or set abort_on_error=True to stop the crawl.

Conclusion

This guide introduced PydanticAiCrawler and its extract helper, the PydanticAiDirectExtractor and PydanticAiSelectorExtractor strategies, the built-in and custom distillers, the extract options, how failures and cost are handled, and how to debug extraction. If you have questions or need assistance, feel free to reach out on our GitHub or join our Discord community. Happy scraping!