Skip to main content

BeautifulSoupCrawler

A crawler that fetches the request URL using httpx and parses the result with BeautifulSoup.

Hierarchy

Index

Constructors

__init__

  • __init__(*, parser, additional_http_error_status_codes, ignore_http_error_status_codes, kwargs): None
  • Initialize the BeautifulSoupCrawler.


    Parameters

    • keyword-onlyparser: Literal['html.parser', 'lxml', 'xml', 'html5lib'] = 'lxml'

      The type of parser that should be used by BeautifulSoup

    • keyword-onlyadditional_http_error_status_codes: Iterable[int] = ()

      HTTP status codes that should be considered errors (and trigger a retry)

    • keyword-onlyignore_http_error_status_codes: Iterable[int] = ()

      HTTP status codes that are normally considered errors but we want to treat them as successful

    • kwargs: Unpack[BasicCrawlerOptions[BeautifulSoupCrawlingContext]]

      Arguments to be forwarded to the underlying BasicCrawler

    Returns None

Methods

add_requests

  • async add_requests(requests, *, batch_size, wait_time_between_batches, wait_for_all_requests_to_be_added, wait_for_all_requests_to_be_added_timeout): None
  • Add requests to the underlying request provider in batches.


    Parameters

    • requests: Sequence[str | Request]

      A list of requests to add to the queue.

    • keyword-onlybatch_size: int = 1000

      The number of requests to add in one batch.

    • keyword-onlywait_time_between_batches: timedelta = timedelta(0)

      Time to wait between adding batches.

    • keyword-onlywait_for_all_requests_to_be_added: bool = False

      If True, wait for all requests to be added before returning.

    • keyword-onlywait_for_all_requests_to_be_added_timeout: timedelta | None = None

      Timeout for waiting for all requests to be added.

    Returns None

error_handler

  • error_handler(handler): ErrorHandler[TCrawlingContext]
  • Decorator for configuring an error handler (called after a request handler error and before retrying).


    Parameters

    • handler: ErrorHandler[TCrawlingContext | BasicCrawlingContext]

    Returns ErrorHandler[TCrawlingContext]

export_data

  • async export_data(path, content_type, dataset_id, dataset_name): None
  • Export data from a dataset.

    This helper method simplifies the process of exporting data from a dataset. It opens the specified dataset and then exports the data based on the provided parameters.


    Parameters

    • path: str | Path

      The destination path

    • content_type: Literal['json', 'csv'] | None = None

      The output format

    • dataset_id: str | None = None

      The ID of the dataset.

    • dataset_name: str | None = None

      The name of the dataset.

    Returns None

failed_request_handler

  • failed_request_handler(handler): FailedRequestHandler[TCrawlingContext]
  • Decorator for configuring a failed request handler (called after max retries are reached).


    Parameters

    • handler: FailedRequestHandler[TCrawlingContext | BasicCrawlingContext]

    Returns FailedRequestHandler[TCrawlingContext]

get_data

  • Retrieve data from a dataset.

    This helper method simplifies the process of retrieving data from a dataset. It opens the specified dataset and then retrieves the data based on the provided parameters.


    Parameters

    • dataset_id: str | None = None

      The ID of the dataset.

    • dataset_name: str | None = None

      The name of the dataset.

    • kwargs: Unpack[GetDataKwargs]

      Keyword arguments to be passed to the dataset's get_data method.

    Returns DatasetItemsListPage

    The retrieved data.

get_dataset

  • async get_dataset(*, id, name): Dataset
  • Return the dataset with the given ID or name. If none is provided, return the default dataset.


    Parameters

    • keyword-onlyid: str | None = None
    • keyword-onlyname: str | None = None

    Returns Dataset

get_key_value_store

  • Return the key-value store with the given ID or name. If none is provided, return the default KVS.


    Parameters

    • keyword-onlyid: str | None = None
    • keyword-onlyname: str | None = None

    Returns KeyValueStore

get_request_provider

  • Return the configured request provider. If none is configured, open and return the default request queue.


    Parameters

    • keyword-onlyid: str | None = None
    • keyword-onlyname: str | None = None

    Returns RequestProvider

router

  • router(router): None
  • Parameters

    • router: Router[TCrawlingContext]

    Returns None

run

  • Run the crawler until all requests are processed.


    Parameters

    • requests: Sequence[str | Request] | None = None

      The requests to be enqueued before the crawler starts

    • keyword-onlypurge_request_queue: bool = True

      If this is True and the crawler is not being run for the first time, the default request queue will be purged

    Returns FinalStatistics

Properties

log

log: logging.Logger

The logger used by the crawler.

router

router: Router[TCrawlingContext]

The router used to handle each individual crawling request.

statistics

statistics: Statistics[StatisticsState]

Statistics about the current (or last) crawler run.