Skip to main content

RobotsTxtFile

Index

Methods

Methods

init

__init__(url, robots, http_client, proxy_info): None

Parameters
- url: str
- robots: Protego
- optionalhttp_client: HttpClient | None = None
- optionalproxy_info: ProxyInfo | None = None
Returns None

find

async find(url, http_client, proxy_info): Self

Determine the location of a robots.txt file for a URL and fetch it.
Parameters
- url: str
  The URL whose domain will be used to find the corresponding robots.txt file.
- http_client: HttpClient
  Optional ProxyInfo to be used when fetching the robots.txt file. If None, no proxy is used.
- optionalproxy_info: ProxyInfo | None = None
  The HttpClient instance used to perform the network request for fetching the robots.txt file.
Returns Self

from_content

async from_content(url, content): Self

Create a RobotsTxtFile instance from the given content.
Parameters
- url: str
  The URL associated with the robots.txt file.
- content: str
  The raw string content of the robots.txt file to be parsed.
Returns Self

get_crawl_delay

get_crawl_delay(user_agent): int | None

Get the crawl delay for the given user agent.
Parameters
- optionaluser_agent: str = '*'
  The user-agent string to check the crawl delay for. Defaults to '*' which matches any user-agent.
Returns int | None

get_sitemaps

get_sitemaps(): list[str]

Get the list of sitemaps urls from the robots.txt file.
Returns list[str]

is_allowed

is_allowed(url, user_agent): bool

Check if the given URL is allowed for the given user agent.
Parameters
- url: str
  The URL to check against the robots.txt rules.
- optionaluser_agent: str = '*'
  The user-agent string to check permissions for. Defaults to '*' which matches any user-agent.
Returns bool

load

async load(url, http_client, proxy_info): Self

Load the robots.txt file for a given URL.
Parameters
- url: str
  The direct URL of the robots.txt file to be loaded.
- http_client: HttpClient
  The HttpClient instance used to perform the network request for fetching the robots.txt file.
- optionalproxy_info: ProxyInfo | None = None
  Optional ProxyInfo to be used when fetching the robots.txt file. If None, no proxy is used.
Returns Self

parse_sitemaps

async parse_sitemaps(): Sitemap

Parse the sitemaps from the robots.txt file and return a Sitemap instance.
Returns Sitemap

parse_urls_from_sitemaps

async parse_urls_from_sitemaps(): list[str]

Parse the sitemaps in the robots.txt file and return a list URLs.
Returns list[str]

Page Options

Hide Inherited

__init__
find
from_content
get_crawl_delay
get_sitemaps
is_allowed
load
parse_sitemaps
parse_urls_from_sitemaps