Skip to main content
Version: 3.8

RobotsFile

Loads and queries information from a robots.txt file.

Example usage:

// Load the robots.txt file
const robots = await RobotsFile.load('https://crawlee.dev/docs/introduction/first-crawler');

// Check if a URL should be crawled according to robots.txt
const url = 'https://crawlee.dev/api/puppeteer-crawler/class/PuppeteerCrawler';
if (robots.isAllowed(url)) {
await crawler.addRequests([url]);
}

// Enqueue all links in the sitemap(s)
await crawler.addRequests(await robots.parseUrlsFromSitemaps());

Index

Methods

getSitemaps

  • getSitemaps(): string[]
  • Get URLs of sitemaps referenced in the robots file.


    Returns string[]

isAllowed

  • isAllowed(url: string, userAgent?: string): boolean
  • Check if a URL should be crawled by robots.


    Parameters

    • url: string

      the URL to check against the rules in robots.txt

    • optionaluserAgent: string = '*'

      relevant user agent, default to *

    Returns boolean

parseSitemaps

  • Parse all the sitemaps referenced in the robots file.


    Returns Promise<Sitemap>

parseUrlsFromSitemaps

  • parseUrlsFromSitemaps(): Promise<string[]>
  • Get all URLs from all the sitemaps referenced in the robots file. A shorthand for (await robots.parseSitemaps()).urls.


    Returns Promise<string[]>

staticfind

  • find(url: string, proxyUrl?: string): Promise<RobotsFile>
  • Determine the location of a robots.txt file for a URL and fetch it.


    Parameters

    • url: string

      the URL to fetch robots.txt for

    • optionalproxyUrl: string

      a proxy to be used for fetching the robots.txt file

    Returns Promise<RobotsFile>

staticfrom

  • from(url: string, content: string, proxyUrl?: string): RobotsFile
  • Allows providing the URL and robotx.txt content explicitly instead of loading it from the target site.


    Parameters

    • url: string

      the URL for robots.txt file

    • content: string

      contents of robots.txt

    • optionalproxyUrl: string

      a proxy to be used for fetching the robots.txt file

    Returns RobotsFile