Respect robots.txt file

This example demonstrates how to configure your crawler to respect the rules established by websites for crawlers as described in the robots.txt file.

To configure Crawlee to follow the robots.txt file, set the parameter respect_robots_txt_file=True in BasicCrawlerOptions. In this case, Crawlee will skip any URLs forbidden in the website's robots.txt file.

As an example, let's look at the website https://news.ycombinator.com/ and its corresponding robots.txt file. Since the file has a rule Disallow: /login, the URL https://news.ycombinator.com/login will be automatically skipped.

The code below demonstrates this behavior using the BeautifulSoupCrawler:

Run on

import asyncio

from crawlee.crawlers import (
    BeautifulSoupCrawler,
    BeautifulSoupCrawlingContext,
)


async def main() -> None:
    # Initialize the crawler with robots.txt compliance enabled
    crawler = BeautifulSoupCrawler(respect_robots_txt_file=True)

    @crawler.router.default_handler
    async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
        context.log.info(f'Processing {context.request.url} ...')

    # Start the crawler with the specified URLs
    # The crawler will check the robots.txt file before making requests
    # In this example, 'https://news.ycombinator.com/login' will be skipped
    # because it's disallowed in the site's robots.txt file
    await crawler.run(
        ['https://news.ycombinator.com/', 'https://news.ycombinator.com/login']
    )


if __name__ == '__main__':
    asyncio.run(main())

Handle with `on_skipped_request`

If you want to process URLs skipped according to the robots.txt rules, for example for further analysis, you should use the on_skipped_request handler from BasicCrawler.

Let's update the code by adding the on_skipped_request handler:

Run on

import asyncio

from crawlee import SkippedReason
from crawlee.crawlers import (
    BeautifulSoupCrawler,
    BeautifulSoupCrawlingContext,
)


async def main() -> None:
    # Initialize the crawler with robots.txt compliance enabled
    crawler = BeautifulSoupCrawler(respect_robots_txt_file=True)

    @crawler.router.default_handler
    async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
        context.log.info(f'Processing {context.request.url} ...')

    # This handler is called when a request is skipped
    @crawler.on_skipped_request
    async def skipped_request_handler(url: str, reason: SkippedReason) -> None:
        # Check if the request was skipped due to robots.txt rules
        if reason == 'robots_txt':
            crawler.log.info(f'Skipped {url} due to robots.txt rules.')


    # Start the crawler with the specified URLs
    # The login URL will be skipped and handled by the skipped_request_handler
    await crawler.run(
        ['https://news.ycombinator.com/', 'https://news.ycombinator.com/login']
    )


if __name__ == '__main__':
    asyncio.run(main())

Handle with on_skipped_request​

Handle with `on_skipped_request`