Skip to main content

Version: Next

RobotsTxtFile

Index

Methods

Methods

init

__init__(url, robots, http_client, proxy_info): None

Parameters
- url: str
- robots: Protego
- optionalhttp_client: HttpClient | None = None
- optionalproxy_info: ProxyInfo | None = None
Returns None

find

async find(url, http_client, proxy_info): Self

Determine the location of a robots.txt file for a URL and fetch it.
Parameters
- url: str
  The URL whose domain will be used to find the corresponding robots.txt file.
- http_client: HttpClient
  Optional ProxyInfo to be used when fetching the robots.txt file. If None, no proxy is used.
- optionalproxy_info: ProxyInfo | None = None
  The HttpClient instance used to perform the network request for fetching the robots.txt file.
Returns Self

from_content

async from_content(url, content): Self

Create a RobotsTxtFile instance from the given content.
Parameters
- url: str
  The URL associated with the robots.txt file.
- content: str
  The raw string content of the robots.txt file to be parsed.
Returns Self

get_crawl_delay

get_crawl_delay(user_agent): int | None

Get the crawl delay for the given user agent.
Parameters
- optionaluser_agent: str = '*'
  The user-agent string to check the crawl delay for. Defaults to '*' which matches any user-agent.
Returns int | None

get_sitemaps

get_sitemaps(*, enqueue_strategy): list[str]

Get the list of sitemap URLs from the robots.txt file, filtered by enqueue strategy.
Parameters
- keyword-onlyenqueue_strategy: EnqueueStrategy
  Strategy used to filter sitemap entries relative to the robots.txt URL's host. Pass 'same-hostname' to match the sitemap protocol's same-host expectation, or 'all' to disable host filtering. Regardless of the strategy, entries with non-http(s) schemes are always filtered out.
Returns list[str]

is_allowed

is_allowed(url, user_agent): bool

Check if the given URL is allowed for the given user agent.
Parameters
- url: str
  The URL to check against the robots.txt rules.
- optionaluser_agent: str = '*'
  The user-agent string to check permissions for. Defaults to '*' which matches any user-agent.
Returns bool

load

async load(url, http_client, proxy_info): Self

Load the robots.txt file for a given URL.
Parameters
- url: str
  The direct URL of the robots.txt file to be loaded.
- http_client: HttpClient
  The HttpClient instance used to perform the network request for fetching the robots.txt file.
- optionalproxy_info: ProxyInfo | None = None
  Optional ProxyInfo to be used when fetching the robots.txt file. If None, no proxy is used.
Returns Self

parse_sitemaps

async parse_sitemaps(*, enqueue_strategy): Sitemap

Parse the sitemaps from the robots.txt file and return a Sitemap instance.
Parameters
- keyword-onlyenqueue_strategy: EnqueueStrategy
  Forwarded to get_sitemaps; see that method for details.
Returns Sitemap

parse_urls_from_sitemaps

async parse_urls_from_sitemaps(*, enqueue_strategy): list[str]

Parse the sitemaps in the robots.txt file and return a list URLs.
Parameters
- keyword-onlyenqueue_strategy: EnqueueStrategy
  Forwarded to get_sitemaps; see that method for details.
Returns list[str]

Page Options

Hide Inherited

__init__
find
from_content
get_crawl_delay
get_sitemaps
is_allowed
load
parse_sitemaps
parse_urls_from_sitemaps