Skip to main content
Version: Next

RobotsTxtFile

Index

Methods

__init__

  • __init__(url, robots, http_client, proxy_info): None
  • Parameters

    • url: str
    • robots: Protego
    • optionalhttp_client: HttpClient | None = None
    • optionalproxy_info: ProxyInfo | None = None

    Returns None

find

  • async find(url, http_client, proxy_info): Self
  • Determine the location of a robots.txt file for a URL and fetch it.


    Parameters

    • url: str

      The URL whose domain will be used to find the corresponding robots.txt file.

    • http_client: HttpClient

      Optional ProxyInfo to be used when fetching the robots.txt file. If None, no proxy is used.

    • optionalproxy_info: ProxyInfo | None = None

      The HttpClient instance used to perform the network request for fetching the robots.txt file.

    Returns Self

from_content

  • async from_content(url, content): Self
  • Create a RobotsTxtFile instance from the given content.


    Parameters

    • url: str

      The URL associated with the robots.txt file.

    • content: str

      The raw string content of the robots.txt file to be parsed.

    Returns Self

get_crawl_delay

  • get_crawl_delay(user_agent): int | None
  • Get the crawl delay for the given user agent.


    Parameters

    • optionaluser_agent: str = '*'

      The user-agent string to check the crawl delay for. Defaults to '*' which matches any user-agent.

    Returns int | None

get_sitemaps

  • get_sitemaps(*, enqueue_strategy): list[str]
  • Get the list of sitemap URLs from the robots.txt file, filtered by enqueue strategy.


    Parameters

    • keyword-onlyenqueue_strategy: EnqueueStrategy

      Strategy used to filter sitemap entries relative to the robots.txt URL's host. Pass 'same-hostname' to match the sitemap protocol's same-host expectation, or 'all' to disable host filtering. Regardless of the strategy, entries with non-http(s) schemes are always filtered out.

    Returns list[str]

is_allowed

  • is_allowed(url, user_agent): bool
  • Check if the given URL is allowed for the given user agent.


    Parameters

    • url: str

      The URL to check against the robots.txt rules.

    • optionaluser_agent: str = '*'

      The user-agent string to check permissions for. Defaults to '*' which matches any user-agent.

    Returns bool

load

  • async load(url, http_client, proxy_info): Self
  • Load the robots.txt file for a given URL.


    Parameters

    • url: str

      The direct URL of the robots.txt file to be loaded.

    • http_client: HttpClient

      The HttpClient instance used to perform the network request for fetching the robots.txt file.

    • optionalproxy_info: ProxyInfo | None = None

      Optional ProxyInfo to be used when fetching the robots.txt file. If None, no proxy is used.

    Returns Self

parse_sitemaps

  • async parse_sitemaps(*, enqueue_strategy): Sitemap
  • Parse the sitemaps from the robots.txt file and return a Sitemap instance.


    Parameters

    • keyword-onlyenqueue_strategy: EnqueueStrategy

      Forwarded to get_sitemaps; see that method for details.

    Returns Sitemap

parse_urls_from_sitemaps

  • async parse_urls_from_sitemaps(*, enqueue_strategy): list[str]
  • Parse the sitemaps in the robots.txt file and return a list URLs.


    Parameters

    • keyword-onlyenqueue_strategy: EnqueueStrategy

      Forwarded to get_sitemaps; see that method for details.

    Returns list[str]