Skip to main content

RobotsTxtFile

Index

Methods

__init__

  • __init__(url, robots): None
  • Parameters

    • url: str
    • robots: Protego

    Returns None

find

  • async find(url, http_client, proxy_info): Self
  • Determine the location of a robots.txt file for a URL and fetch it.


    Parameters

    • url: str

      The URL whose domain will be used to find the corresponding robots.txt file.

    • http_client: HttpClient

      Optional ProxyInfo to be used when fetching the robots.txt file. If None, no proxy is used.

    • optionalproxy_info: ProxyInfo | None = None

      The HttpClient instance used to perform the network request for fetching the robots.txt file.

    Returns Self

from_content

  • async from_content(url, content): Self
  • Create a RobotsTxtFile instance from the given content.


    Parameters

    • url: str

      The URL associated with the robots.txt file.

    • content: str

      The raw string content of the robots.txt file to be parsed.

    Returns Self

get_crawl_delay

  • get_crawl_delay(user_agent): int | None
  • Get the crawl delay for the given user agent.


    Parameters

    • optionaluser_agent: str = '*'

      The user-agent string to check the crawl delay for. Defaults to '*' which matches any user-agent.

    Returns int | None

get_sitemaps

  • get_sitemaps(): list[str]
  • Get the list of sitemaps urls from the robots.txt file.


    Returns list[str]

is_allowed

  • is_allowed(url, user_agent): bool
  • Check if the given URL is allowed for the given user agent.


    Parameters

    • url: str

      The URL to check against the robots.txt rules.

    • optionaluser_agent: str = '*'

      The user-agent string to check permissions for. Defaults to '*' which matches any user-agent.

    Returns bool

load

  • async load(url, http_client, proxy_info): Self
  • Load the robots.txt file for a given URL.


    Parameters

    • url: str

      The direct URL of the robots.txt file to be loaded.

    • http_client: HttpClient

      The HttpClient instance used to perform the network request for fetching the robots.txt file.

    • optionalproxy_info: ProxyInfo | None = None

      Optional ProxyInfo to be used when fetching the robots.txt file. If None, no proxy is used.

    Returns Self