Skip to main content

BasicCrawler

crawlee.basic_crawler.basic_crawler.BasicCrawler

Provides a simple framework for parallel crawling of web pages.

The URLs to crawl are fed either from a static list of URLs or from a dynamic queue of URLs enabling recursive crawling of websites.

BasicCrawler is a low-level tool that requires the user to implement the page download and data extraction functionality themselves. If we want a crawler that already facilitates this functionality, we should consider using one of its subclasses.

Index

Errors

error_handler

  • error_handler(handler): ErrorHandler[TCrawlingContext]
  • Decorator for configuring an error handler (called after a request handler error and before retrying).


    Parameters

    • handler: ErrorHandler[TCrawlingContext]

    Returns ErrorHandler[TCrawlingContext]

Constructors

__init__

  • __init__(*, request_provider, request_handler, http_client, concurrency_settings, max_request_retries, max_requests_per_crawl, max_session_rotations, configuration, request_handler_timeout, session_pool, use_session_pool, retry_on_blocked, proxy_configuration, statistics, configure_logging, _context_pipeline, _additional_context_managers, _logger): None
  • Initialize the BasicCrawler.


    Parameters

    • request_provider: RequestProvider | None = Nonekeyword-only
    • request_handler: Callable[[TCrawlingContext], Awaitable[None]] | None = Nonekeyword-only
    • http_client: BaseHttpClient | None = Nonekeyword-only
    • concurrency_settings: ConcurrencySettings | None = Nonekeyword-only
    • max_request_retries: int = 3keyword-only
    • max_requests_per_crawl: int | None = Nonekeyword-only
    • max_session_rotations: int = 10keyword-only
    • configuration: Configuration | None = Nonekeyword-only
    • request_handler_timeout: timedelta = timedelta(minutes=1)keyword-only
    • session_pool: SessionPool | None = Nonekeyword-only
    • use_session_pool: bool = Truekeyword-only
    • retry_on_blocked: bool = Truekeyword-only
    • proxy_configuration: ProxyConfiguration | None = Nonekeyword-only
    • statistics: Statistics | None = Nonekeyword-only
    • configure_logging: bool = Truekeyword-only
    • _context_pipeline: ContextPipeline[TCrawlingContext] | None = Nonekeyword-only
    • _additional_context_managers: Sequence[AsyncContextManager] | None = Nonekeyword-only
    • _logger: logging.Logger | None = Nonekeyword-only

    Returns None

Methods

add_requests

  • async add_requests(requests, *, batch_size, wait_time_between_batches, wait_for_all_requests_to_be_added, wait_for_all_requests_to_be_added_timeout): None
  • Add requests to the underlying request provider in batches.


    Parameters

    • requests: Sequence[str | BaseRequestData | Request]
    • batch_size: int = 1000keyword-only
    • wait_time_between_batches: timedelta = timedelta(0)keyword-only
    • wait_for_all_requests_to_be_added: bool = Falsekeyword-only
    • wait_for_all_requests_to_be_added_timeout: timedelta | None = Nonekeyword-only

    Returns None

export_data

  • async export_data(path, content_type, dataset_id, dataset_name): None
  • Export data from a dataset.

    This helper method simplifies the process of exporting data from a dataset. It opens the specified dataset and then exports the data based on the provided parameters.


    Parameters

    • path: str | Path
    • content_type: Literal['json', 'csv'] | None = None
    • dataset_id: str | None = None
    • dataset_name: str | None = None

    Returns None

failed_request_handler

  • failed_request_handler(handler): FailedRequestHandler[TCrawlingContext]
  • Decorator for configuring a failed request handler (called after max retries are reached).


    Parameters

    • handler: FailedRequestHandler[TCrawlingContext]

    Returns FailedRequestHandler[TCrawlingContext]

get_data

  • async get_data(dataset_id, dataset_name, kwargs): DatasetItemsListPage
  • Retrieve data from a dataset.

    This helper method simplifies the process of retrieving data from a dataset. It opens the specified dataset and then retrieves the data based on the provided parameters.


    Parameters

    • dataset_id: str | None = None
    • dataset_name: str | None = None
    • kwargs: Unpack[GetDataKwargs]

    Returns DatasetItemsListPage

get_dataset

  • async get_dataset(*, id, name): Dataset
  • Return the dataset with the given ID or name. If none is provided, return the default dataset.


    Parameters

    • id: str | None = Nonekeyword-only
    • name: str | None = Nonekeyword-only

    Returns Dataset

get_key_value_store

  • async get_key_value_store(*, id, name): KeyValueStore
  • Return the key-value store with the given ID or name. If none is provided, return the default KVS.


    Parameters

    • id: str | None = Nonekeyword-only
    • name: str | None = Nonekeyword-only

    Returns KeyValueStore

get_request_provider

  • async get_request_provider(*, id, name): RequestProvider
  • Return the configured request provider. If none is configured, open and return the default request queue.


    Parameters

    • id: str | None = Nonekeyword-only
    • name: str | None = Nonekeyword-only

    Returns RequestProvider

router

  • router(router): None
  • Parameters

    • router: Router[TCrawlingContext]

    Returns None

run

  • async run(requests): FinalStatistics
  • Run the crawler until all requests are processed.


    Parameters

    • requests: Sequence[str | BaseRequestData | Request] | None = None

    Returns FinalStatistics

Properties

router

router: Router[TCrawlingContext]

The router used to handle each individual crawling request.

statistics

statistics: Statistics[StatisticsState]

Statistics about the current (or last) crawler run.