RobotsTxtFile
Index
Methods
__init__
Parameters
url: str
robots: Protego
Returns None
find
Parameters
url: str
The URL whose domain will be used to find the corresponding robots.txt file.
http_client: HttpClient
Optional
ProxyInfo
to be used when fetching the robots.txt file. If None, no proxy is used.optionalproxy_info: ProxyInfo | None = None
The
HttpClient
instance used to perform the network request for fetching the robots.txt file.
Returns Self
from_content
Create a
RobotsTxtFile
instance from the given content.Parameters
url: str
The URL associated with the robots.txt file.
content: str
The raw string content of the robots.txt file to be parsed.
Returns Self
get_crawl_delay
Get the crawl delay for the given user agent.
Parameters
optionaluser_agent: str = '*'
The user-agent string to check the crawl delay for. Defaults to '*' which matches any user-agent.
Returns int | None
get_sitemaps
Get the list of sitemaps urls from the robots.txt file.
Returns list[str]
is_allowed
Check if the given URL is allowed for the given user agent.
Parameters
url: str
The URL to check against the robots.txt rules.
optionaluser_agent: str = '*'
The user-agent string to check permissions for. Defaults to '*' which matches any user-agent.
Returns bool
load
Load the robots.txt file for a given URL.
Parameters
url: str
The direct URL of the robots.txt file to be loaded.
http_client: HttpClient
The
HttpClient
instance used to perform the network request for fetching the robots.txt file.optionalproxy_info: ProxyInfo | None = None
Optional
ProxyInfo
to be used when fetching the robots.txt file. If None, no proxy is used.
Returns Self
Determine the location of a robots.txt file for a URL and fetch it.