BeautifulSoupCrawlingContext

The crawling context used by the BeautifulSoupCrawler.

It provides access to key objects as well as utility functions for handling crawling tasks.

Hierarchy

ParsedHttpCrawlingContext
- BeautifulSoupCrawlingContext

Methods

hash

__hash__(): int

Inherited from BasicCrawlingContext.__hash__
Return hash of the context. Each context is considered unique.
Returns int

from_basic_crawling_context

from_basic_crawling_context(context, http_response): Self

Inherited from HttpCrawlingContext.from_basic_crawling_context
Initialize a new instance from an existing BasicCrawlingContext.
Parameters
- context: BasicCrawlingContext
- http_response: HttpResponse
Returns Self

from_http_crawling_context

from_http_crawling_context(context, parsed_content, enqueue_links, extract_links): Self

Inherited from ParsedHttpCrawlingContext.from_http_crawling_context
Initialize a new instance from an existing HttpCrawlingContext.
Parameters
- context: HttpCrawlingContext
- parsed_content: TParseResult
- enqueue_links: EnqueueLinksFunction
- extract_links: ExtractLinksFunction
Returns Self

from_parsed_http_crawling_context

from_parsed_http_crawling_context(context): Self

Initialize a new instance from an existing ParsedHttpCrawlingContext.
Parameters
- context: ParsedHttpCrawlingContext[BeautifulSoup]
Returns Self

get_snapshot

async get_snapshot(): PageSnapshot

Inherited from HttpCrawlingContext.get_snapshot
Overrides BasicCrawlingContext.get_snapshot
Get snapshot of crawled page.
Returns PageSnapshot

html_to_text

html_to_text(): str

Convert the parsed HTML content to newline-separated plain text without tags.
Returns str

Properties

add_requests

add_requests: AddRequestsFunction

Add requests crawling context helper function.

Get key-value store crawling context helper function.

http_response

http_response: HttpResponse

The HTTP response received from the server.

log

log: logging.Logger

Logger instance.

parsed_content

parsed_content: TParseResult

proxy_info

proxy_info: ProxyInfo | None

Proxy information for the current page being processed.

push_data

push_data: PushDataFunction

Push data crawling context helper function.

request

request: Request

Request object for the current page being processed.

send_request

send_request: SendRequestFunction

Send request crawling context helper function.

session

session: Session | None

Session object for the current page being processed.

soup

soup: BeautifulSoup

Convenience alias.

use_state

use_state: UseStateFunction

Use state crawling context helper function.

Hierarchy

Index

Methods

Properties

Methods

__hash__

Returns int

from_basic_crawling_context

Parameters

context: BasicCrawlingContext

http_response: HttpResponse

Returns Self

from_http_crawling_context

Parameters

context: HttpCrawlingContext

parsed_content: TParseResult

enqueue_links: EnqueueLinksFunction

extract_links: ExtractLinksFunction

Returns Self

from_parsed_http_crawling_context

Parameters

context: ParsedHttpCrawlingContext[BeautifulSoup]

Returns Self

get_snapshot

Returns PageSnapshot

html_to_text

Returns str

Properties

add_requests

enqueue_links

extract_links

get_key_value_store

http_response

log

parsed_content

proxy_info

push_data

request

send_request

session

soup

use_state

hash