JSDOMCrawler is very useful for scraping with the Window API.
How the crawler works
JSDOMCrawler crawls by making plain HTTP requests to the provided URLs using the specialized got-scraping HTTP client. The URLs are fed to the crawler using
RequestQueue. The HTTP responses it gets back are usually HTML pages. The same pages you would get in your browser when you first load a URL. But it can handle any content types with the help of the
Once the page's HTML is retrieved, the crawler will pass it to JSDOM for parsing. The result is a
window property, which should be familiar to frontend developers. You can use the Window API to do all sorts of lookups and manipulation of the page's HTML, but in scraping, you will mostly use it to find specific HTML elements and extract their data.
// Return the page title
document.title; // browsers
window.document.title; // JSDOM
When to use
JSDOMCrawler really shines when
CheerioCrawler is just not enough. There is an entire set of APIs available!
- Easy to set up
- Familiar for frontend developers
- Content can be manipulated
- Automatically avoids some anti-scraping bans
- Slower than
- May easily overload the target website with requests
Example use of Element API
Find all links on a page
This snippet finds all
<a> elements which have the
href attribute and extracts the hrefs into an array.
Array.from(document.querySelectorAll('a[href]')).map((a) => a.href);
Visit the Examples section to browse examples of
JSDOMCrawler usage. Almost all examples show
JSDOMCrawler code in their code tabs.