Crawl all links on a website
This example uses the enqueueLinks()
method to add new links to the RequestQueue
as the crawler navigates from page to page. This example can also be used to find all URLs on a domain by removing the maxRequestsPerCrawl
option.
tip
If no options are given, by default the method will only add links that are under the same subdomain. This behavior can be controlled with the strategy
option. You can find more info about this option in the Crawl relative links
examples.
- Cheerio Crawler
- Puppeteer Crawler
- Playwright Crawler
Run on
import { CheerioCrawler } from 'crawlee';
const crawler = new CheerioCrawler({
async requestHandler({ request, enqueueLinks, log }) {
log.info(request.url);
// Add all links from page to RequestQueue
await enqueueLinks();
},
maxRequestsPerCrawl: 10, // Limitation for only 10 requests (do not use if you want to crawl all links)
});
// Run the crawler with initial request
await crawler.run(['https://crawlee.dev']);
tip
To run this example on the Apify Platform, select the apify/actor-node-puppeteer-chrome
image for your Dockerfile.
Run on
import { PuppeteerCrawler } from 'crawlee';
const crawler = new PuppeteerCrawler({
async requestHandler({ request, enqueueLinks, log }) {
log.info(request.url);
// Add all links from page to RequestQueue
await enqueueLinks();
},
maxRequestsPerCrawl: 10, // Limitation for only 10 requests (do not use if you want to crawl all links)
});
// Run the crawler with initial request
await crawler.run(['https://crawlee.dev']);
tip
To run this example on the Apify Platform, select the apify/actor-node-playwright-chrome
image for your Dockerfile.
Run on
import { PlaywrightCrawler } from 'crawlee';
const crawler = new PlaywrightCrawler({
async requestHandler({ request, enqueueLinks, log }) {
log.info(request.url);
// Add all links from page to RequestQueue
await enqueueLinks();
},
maxRequestsPerCrawl: 10, // Limitation for only 10 requests (do not use if you want to crawl all links)
});
// Run the crawler with initial request
await crawler.run(['https://crawlee.dev']);