AdaptivePlaywrightCrawler

An AdaptivePlaywrightCrawler is a combination of PlaywrightCrawler and some implementation of HTTP-based crawler such as ParselCrawler or BeautifulSoupCrawler. It uses a more limited crawling context interface so that it is able to switch to HTTP-only crawling when it detects that it may bring a performance benefit.

Detection is done based on the RenderingTypePredictor with default implementation DefaultRenderingTypePredictor. It predicts which crawling method should be used and learns from already crawled pages.

When to use AdaptivePlaywrightCrawler

Use AdaptivePlaywrightCrawler in scenarios where some target pages have to be crawled with PlaywrightCrawler, but for others faster HTTP-based crawler is sufficient. This way, you can achieve lower costs when crawling multiple different websites.

Another use case is performing selector-based data extraction without prior knowledge of whether the selector exists in the static page or is dynamically added by a code executed in a browsing client.

Request handler and adaptive context helpers

Request handler for AdaptivePlaywrightCrawler works on special context type - AdaptivePlaywrightCrawlingContext. This context is sometimes created by HTTP-based sub crawler and sometimes by playwright based sub crawler. Due to its dynamic nature, you can't always access page object. To overcome this limitation, there are three helper methods on this context that can be called regardless of how the context was created.

wait_for_selector accepts css selector as first argument and timeout as second argument. The function will try to locate this selector a return once it is found(within timeout). In practice this means that if HTTP-based sub crawler was used, the function will find the selector only if it is part of the static content. If not, the adaptive crawler will fall back to the playwright sub crawler and will wait try to locate the selector within the timeout using playwright.

query_selector_one accepts css selector as first argument and timeout as second argument. This function acts similar to wait_for_selector, but it also returns one selector if any selector is found. Return value type is determined by used HTTP-based sub crawler. For example, it will be Selector for ParselCrawler and Tag for BeautifulSoupCrawler.

query_selector_all same as query_selector_one, but returns all found selectors.

parse_with_static_parser will re-parse the whole page. Return value type is determined by used HTTP-based sub crawler. It has optional arguments: selector and timeout. If those optional arguments are used then the function first calls wait_for_selector and then do the parsing. This can be used in scenario where some specific element can signal, that page is already complete.

See the following example about how to create request handler and use context helpers:

from datetime import timedelta

from crawlee.crawlers import AdaptivePlaywrightCrawler, AdaptivePlaywrightCrawlingContext

crawler = AdaptivePlaywrightCrawler.with_beautifulsoup_static_parser()


@crawler.router.default_handler
async def request_handler(context: AdaptivePlaywrightCrawlingContext) -> None:
    # Locate element h2 within 5 seconds
    h2 = await context.query_selector_one('h2', timedelta(milliseconds=5000))
    # Do stuff with element found by the selector
    context.log.info(h2)

Crawler configuration

To use AdaptivePlaywrightCrawler it is recommended to use one of the prepared factory methods that will create the crawler with specific HTTP-based sub crawler variant: AdaptivePlaywrightCrawler.with_beautifulsoup_static_parser or AdaptivePlaywrightCrawler.with_parsel_static_parser.

AdaptivePlaywrightCrawler is internally composed of two sub crawlers and you can do a detailed configuration of both of them. For detailed configuration options of the sub crawlers, please refer to their pages: PlaywrightCrawler, ParselCrawler, BeautifulSoupCrawler.

In the following example you can see how to create and configure AdaptivePlaywrightCrawler with two different HTTP-based sub crawlers:

BeautifulSoupCrawler
ParselCrawler

from crawlee.crawlers import AdaptivePlaywrightCrawler

crawler = AdaptivePlaywrightCrawler.with_beautifulsoup_static_parser(
    # Arguments relevant only for PlaywrightCrawler
    playwright_crawler_specific_kwargs={'headless': False, 'browser_type': 'chromium'},
    # Common arguments relevant to all crawlers
    max_crawl_depth=5,
)

from crawlee.crawlers import AdaptivePlaywrightCrawler

crawler = AdaptivePlaywrightCrawler.with_parsel_static_parser(
    # Arguments relevant only for PlaywrightCrawler
    playwright_crawler_specific_kwargs={'headless': False, 'browser_type': 'chromium'},
    # Common arguments relevant to all crawlers
    max_crawl_depth=5,
)

To control which pages are crawled by which method you can use following arguments:

RenderingTypePredictor - Class that can give recommendations about which sub crawler should be used for specific url. Predictor will also recommend to use both sub crawlers for some page from time to time, to check that the given recommendation was correct. Predictor should be able to learn from previous results and gradually give more reliable recommendations.

result_checker - Is a function that checks result created from crawling a page. By default, it always returns True.

result_comparator - Is a function that compares two results (HTTP-based sub crawler result and playwright based sub crawler result) and returns True if they are considered the same. By default, this function compares calls of context helper push_data by each sub crawler. This function is used by rendering_type_predictor to evaluate whether HTTP-based crawler has the same results as playwright based sub crawler.

See the following example about how to pass prediction related arguments:

from crawlee import Request
from crawlee._types import RequestHandlerRunResult
from crawlee.crawlers import (
    AdaptivePlaywrightCrawler,
    RenderingType,
    RenderingTypePrediction,
    RenderingTypePredictor,
)


class CustomRenderingTypePredictor(RenderingTypePredictor):
    def __init__(self) -> None:
        self._learning_data = list[tuple[Request, RenderingType]]()

    def predict(self, request: Request) -> RenderingTypePrediction:
        # Some custom logic that produces some `RenderingTypePrediction`
        # based on the `request` input.
        rendering_type: RenderingType = (
            'static' if 'abc' in request.url else 'client only'
        )

        return RenderingTypePrediction(
            #  Recommends `static` rendering type -> HTTP-based sub crawler will be used.
            rendering_type=rendering_type,
            # Recommends that both sub crawlers should run with 20% chance. When both sub
            # crawlers are running, the predictor can compare results and learn.
            # High number means that predictor is not very confident about the
            # `rendering_type`, low number means that predictor is very confident.
            detection_probability_recommendation=0.2,
        )

    def store_result(self, request: Request, rendering_type: RenderingType) -> None:
        # This function allows predictor to store new learning data and retrain itself
        # if needed. `request` is input for prediction and `rendering_type` is the correct
        # prediction.
        self._learning_data.append((request, rendering_type))
        # retrain


def result_checker(result: RequestHandlerRunResult) -> bool:
    # Some function that inspects produced `result` and returns `True` if the result
    # is correct.
    return bool(result)  # Check something on result


def result_comparator(
    result_1: RequestHandlerRunResult, result_2: RequestHandlerRunResult
) -> bool:
    # Some function that inspects two results and returns `True` if they are
    # considered equivalent. It is used when comparing results produced by HTTP-based
    # sub crawler and playwright based sub crawler.
    return (
        result_1.push_data_calls == result_2.push_data_calls
    )  #  For example compare `push_data` calls.


crawler = AdaptivePlaywrightCrawler.with_parsel_static_parser(
    rendering_type_predictor=CustomRenderingTypePredictor(),
    result_checker=result_checker,
    result_comparator=result_comparator,
)

In some use cases, you may need to configure the page before it navigates to the target URL. For instance, you might set navigation timeouts or manipulate other page-level settings. For such cases you can use the pre_navigation_hook method of the AdaptivePlaywrightCrawler. This method is called before the page navigates to the target URL and allows you to configure the page instance. Due to the dynamic nature of AdaptivePlaywrightCrawler it is possible that the hook will be executed for HTTP-based sub crawler or playwright-based sub crawler. Using page object for hook that will be executed on HTTP-based sub crawler will raise an exception. To overcome this you can use optional argument playwright_only = True when registering the hook.

See the following example about how to register the pre navigation hooks:

from playwright.async_api import Route

from crawlee.crawlers import (
    AdaptivePlaywrightCrawler,
    AdaptivePlaywrightPreNavCrawlingContext,
)

crawler = AdaptivePlaywrightCrawler.with_beautifulsoup_static_parser()


@crawler.pre_navigation_hook
async def hook(context: AdaptivePlaywrightPreNavCrawlingContext) -> None:
    """Hook executed both in static sub crawler and playwright sub crawler.

    Trying to access `context.page` in this hook would raise `AdaptiveContextError`
    for pages crawled without playwright."""

    context.log.info(f'pre navigation hook for: {context.request.url}')


@crawler.pre_navigation_hook(playwright_only=True)
async def hook_playwright(context: AdaptivePlaywrightPreNavCrawlingContext) -> None:
    """Hook executed only in playwright sub crawler."""

    async def some_routing_function(route: Route) -> None:
        await route.continue_()

    await context.page.route('*/**', some_routing_function)
    context.log.info(f'Playwright only pre navigation hook for: {context.request.url}')

When to use AdaptivePlaywrightCrawler​

Request handler and adaptive context helpers​

Crawler configuration​

Prediction related arguments​

Page configuration with pre-navigation hooks​

When to use AdaptivePlaywrightCrawler

Request handler and adaptive context helpers

Crawler configuration

Prediction related arguments

Page configuration with pre-navigation hooks