Crawl a website with relative links
When crawling a website, you may encounter different types of links present that you may want to crawl.
To facilitate the easy crawling of such links, we provide the enqueueLinks()
method on the crawler context, which will
automatically find links and add them to the crawler's RequestQueue
.
We provide 3 different strategies for crawling relative links:
All (or the string"all" ) which will enqueue all links found, regardless of the domain they point to.SameHostname (or the string"same-hostname" ) which will enqueue all links found for the same hostname. This is the default strategy.SameDomain (or the string"same-domain" ) which will enqueue all links found that have the same domain name, including links from any possible subdomain.
For these examples, we are using the CheerioCrawler
, however
the same method is available for both the PuppeteerCrawler
and PlaywrightCrawler
, and you use it
the exact same way.
- All Links
- Same Hostname
- Same Subdomain
Any urls found will be matched by this strategy, even if they go off of the site you are currently crawling.
import { CheerioCrawler, EnqueueStrategy } from 'crawlee';
const crawler = new CheerioCrawler({
maxRequestsPerCrawl: 10, // Limitation for only 10 requests (do not use if you want to crawl all links)
async requestHandler({ request, enqueueLinks, log }) {
log.info(request.url);
await enqueueLinks({
// Setting the strategy to 'all' will enqueue all links found
strategy: EnqueueStrategy.All,
// Alternatively, you can pass in the string 'all'
// strategy: 'all',
});
},
});
// Run the crawler with initial request
await crawler.run(['https://crawlee.dev']);
For a url of https://example.com
, enqueueLinks()
will match relative urls and urls that point to the same
hostname.
This is the default strategy when calling
enqueueLinks()
, so you don't have to specify it.
For instance, hyperlinks like https://example.com/some/path
, /absolute/example
or ./relative/example
will all be matched by this strategy. But links to any subdomain like https://subdomain.example.com/some/path
won't.
import { CheerioCrawler, EnqueueStrategy } from 'crawlee';
const crawler = new CheerioCrawler({
maxRequestsPerCrawl: 10, // Limitation for only 10 requests (do not use if you want to crawl all links)
async requestHandler({ request, enqueueLinks, log }) {
log.info(request.url);
await enqueueLinks({
// Setting the strategy to 'same-hostname' will enqueue all links found that are on the
// same hostname (including subdomain) as request.loadedUrl or request.url
strategy: EnqueueStrategy.SameHostname,
// Alternatively, you can pass in the string 'same-hostname'
// strategy: 'same-hostname',
});
},
});
// Run the crawler with initial request
await crawler.run(['https://crawlee.dev']);
For a url of https://subdomain.example.com
, enqueueLinks()
will match relative urls or urls that point to the same domain name, regardless of their subdomain.
For instance, hyperlinks like https://subdomain.example.com/some/path
, /absolute/example
or ./relative/example
will all be matched by this strategy, as well as links to other subdomains or to the naked domain, like https://other-subdomain.example.com
or https://example.com
will work too.
import { CheerioCrawler, EnqueueStrategy } from 'crawlee';
const crawler = new CheerioCrawler({
maxRequestsPerCrawl: 10, // Limitation for only 10 requests (do not use if you want to crawl all links)
async requestHandler({ request, enqueueLinks, log }) {
log.info(request.url);
await enqueueLinks({
// Setting the strategy to 'same-domain' will enqueue all links found that are on the
// same hostname as request.loadedUrl or request.url
strategy: EnqueueStrategy.SameDomain,
// Alternatively, you can pass in the string 'same-domain'
// strategy: 'same-domain',
});
},
});
// Run the crawler with initial request
await crawler.run(['https://crawlee.dev']);