Respect robots.txt file
This example demonstrates how to configure your crawler to respect the rules established by websites for crawlers as described in the robots.txt file.
To configure Crawlee
to follow the robots.txt
file, set the parameter respect_robots_txt_file=True
in BasicCrawlerOptions
. In this case, Crawlee
will skip any URLs forbidden in the website's robots.txt file.
As an example, let's look at the website https://news.ycombinator.com/
and its corresponding robots.txt file. Since the file has a rule Disallow: /login
, the URL https://news.ycombinator.com/login
will be automatically skipped.
The code below demonstrates this behavior using the BeautifulSoupCrawler
:
import asyncio
from crawlee.crawlers import (
BeautifulSoupCrawler,
BeautifulSoupCrawlingContext,
)
async def main() -> None:
# Initialize the crawler with robots.txt compliance enabled
crawler = BeautifulSoupCrawler(respect_robots_txt_file=True)
@crawler.router.default_handler
async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
context.log.info(f'Processing {context.request.url} ...')
# Start the crawler with the specified URLs
# The crawler will check the robots.txt file before making requests
# In this example, 'https://news.ycombinator.com/login' will be skipped
# because it's disallowed in the site's robots.txt file
await crawler.run(
['https://news.ycombinator.com/', 'https://news.ycombinator.com/login']
)
if __name__ == '__main__':
asyncio.run(main())
Handle with on_skipped_request
If you want to process URLs skipped according to the robots.txt
rules, for example for further analysis, you should use the on_skipped_request
handler from BasicCrawler
.
Let's update the code by adding the on_skipped_request
handler:
import asyncio
from crawlee import SkippedReason
from crawlee.crawlers import (
BeautifulSoupCrawler,
BeautifulSoupCrawlingContext,
)
async def main() -> None:
# Initialize the crawler with robots.txt compliance enabled
crawler = BeautifulSoupCrawler(respect_robots_txt_file=True)
@crawler.router.default_handler
async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
context.log.info(f'Processing {context.request.url} ...')
# This handler is called when a request is skipped
@crawler.on_skipped_request
async def skipped_request_handler(url: str, reason: SkippedReason) -> None:
# Check if the request was skipped due to robots.txt rules
if reason == 'robots_txt':
crawler.log.info(f'Skipped {url} due to robots.txt rules.')
# Start the crawler with the specified URLs
# The login URL will be skipped and handled by the skipped_request_handler
await crawler.run(
['https://news.ycombinator.com/', 'https://news.ycombinator.com/login']
)
if __name__ == '__main__':
asyncio.run(main())