Skip to main content
Version: Next

Getting some real-world data

Hey, guys, you know, it's cool that we can scrape the <title> elements of web pages, but that's not very useful. Can we finally scrape some real data and save it somewhere in a machine-readable format? Because that's why I started reading this tutorial in the first place!

We hear you, young padawan! First, learn how to crawl, you must. Only then, walk through data, you can!

Making a production-grade crawler

Making a production-grade crawler is not difficult, but there are many pitfalls of scraping that can catch you off guard. So for the real world project you'll learn how to scrape Apify Store instead of the Crawlee website. It is a library of scrapers and automations called actors that anyone can grab and use for free.

The website requires JavaScript rendering, which allows us to showcase more features of Crawlee. We've also added some helpful tips that prepare you for the real-world issues that you will surely encounter when scraping at scale.

tip

If you're not interested in crawling theory, feel free to skip to the next lesson and get right back to coding.

Drawing a plan

Sometimes scraping is really straightforward, but most of the time, it really pays off to do a bit of research first and try to answer some of these questions:

  • How is the website structured?
  • Can I scrape it only with HTTP requests (read "with CheerioCrawler")?
  • Do I need a headless browser for something?
  • Are there any anti-scraping protections in place?
  • Do I need to parse the HTML or can I get the data otherwise, such as directly from the website's API?

For the purposes of this tutorial, let's assume that the website cannot be scraped with CheerioCrawler. It actually can, but we would have to dive a bit deeper than this introductory guide allows. So for now we will make things easier for you, scrape it with PlaywrightCrawler, and you'll learn about headless browsers in the process.

Choosing the data you need

A good first step is to figure out what data you want to scrape and where to find it. For the time being, let's just agree that we want to scrape all actors from Apify Store and for each actor we want to get its:

  • URL
  • Owner
  • Unique identifier (such as apify/web-scraper)
  • Title
  • Description
  • Last modification date
  • Number of runs

You will notice that some information is available directly on the list page, but for details such as "Last modification date" or "Number of runs" we'll also need to open the actor detail pages.

data to scrape

The start URL(s)

This is where you start your crawl. It's convenient to start as close to the data as possible. For example, it wouldn't make much sense to start at apify.com and look for a store link there, when we already know that everything we want to extract can be found at the apify.com/store page.

Exploring the page

Let's take a look at the apify.com/store page more carefully. There are some categories on the left, sorting on the right, and at the very bottom, there are links to the next pages of results. This is usually called the pagination.

Categories and sorting

When you click the categories, you'll see that they filter the results. If you remove the category, you're back to the original number of results. By going through a few categories and observing the behavior, we can quite safely assume that the default setting - with no category selected - shows us all the actors available in the store and that's the setting we'll use to scrape. The same applies to sorting. We don't need that now.

caution

Be careful, because on some websites, like amazon.com, this is not true and the sum of products in categories is actually larger than what's available without filters. Learn more in our tutorial on scraping websites with limited pagination.

Pagination

The pagination of Apify Store is simple enough. The only setback is that there's no button that would allow you to go directly to the last page, so you can't confirm how many results there are. But you can trick it a bit. When you click on the page 2 button, you'll see that the URL changes to:

https://apify.com/store?page=2

Try clicking on the link to page 6. You'll see that the pagination links update and show more pages. But can you trust that this will include all pages and won't stop at some point?

caution

Similarly to the issue with filters explained above, the existence of pagination does not guarantee that you can simply paginate through all the results. Always test your assumptions about pagination. Otherwise, you might miss a chunk of results, and not even know about it.

At the time of writing the Store results counter showed 1047 results - actors. Quick count of actors on one page of results makes 24. 8 rows times 3 actors. This means that there should be 44 pages of results. 1047 divided by 24. Try going to page number 44 (or the number your own calculation produced).

https://apify.com/store?page=44

It's empty. 🤯 Wrong calculation? Not really. This is an example of another common issue in web scraping. The result count presented by websites very rarely matches the actual number of available results. In our case, it's simply because certain actors were hidden for some reason, but the count does not reflect it.

At the time of writing, the last results were actually on page 42. But that's ok. What's important is that on this page 42, the pagination links at the bottom are still the same as on page one, two or six. This makes it fairly certain that you can keep following those links until you scrape all the results. Good 👍

If you're not convinced, you can visit a page somewhere in the middle, like https://apify.com/store?page=20 and see how the pagination looks there.

The crawling strategy

Now that you know where to start and how to find all the actor details, let's look at the crawling process.

  1. Visit the store page containing the list of actors (our start URL).
  2. Enqueue all links to actor details.
  3. Enqueue links to next pages of results.
  4. Open the next page in queue.
    • When it's a results list page, go to 2.
    • When it's an actor detail page, scrape the data.
  5. Repeat until all results pages and all actor details have been processed.

PlaywrightCrawler will make sure to visit the pages for you, if you provide the correct requests, and you already know how to enqueue pages, so this should be fairly easy. Nevertheless, there are few more tricks that we'd like to showcase.

Sanity check

Let's check that everything is set up correctly before writing the scraping logic itself. You might realize that something in your previous analysis doesn't quite add up, or the website might not behave exactly as you expected.

The example below creates a new crawler that visits the start URL and prints the text content of all the actor cards on that page. When you run the code, you will see the very badly formatted content of the individual actor cards.

src/main.mjs
// Instead of CheerioCrawler let's use Playwright
// to be able to render JavaScript.
import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
requestHandler: async ({ page }) => {
// Wait for the actor cards to render.
await page.waitForSelector('.ActorStoreItem');
// Execute a function in the browser which targets
// the actor card elements and allows their manipulation.
const actorTexts = await page.$$eval('.ActorStoreItem', (els) => {
// Extract text content from the actor cards
return els.map((el) => el.textContent);
});
actorTexts.forEach((text, i) => {
console.log(`ACTOR_${i + 1}: ${text}\n`);
});
},
});

await crawler.run(['https://apify.com/store']);

If you're wondering how to get that .ActorStoreItem selector. We'll explain it in the next chapter on DevTools.

DevTools - the scrapers toolbox

info

We'll use Chrome DevTools here, since it's the most common browser, but feel free to use any other, they're all very similar.

Let's open DevTools by going to https://apify.com/store in Chrome and then right-clicking anywhere in the page and selecting Inspect, or by pressing F12 or whatever your system prefers. With DevTools, you can inspect or manipulate any aspect of the currently open web page. You can learn more about DevTools in their official documentaion.

Selecting elements

In the DevTools, choose the Select an element tool and try hovering over one of the actor cards.

select an element

You'll see that you can select different elements inside the card. Instead, select the whole card, not just some of its contents, such as its title or description.

selected element

Selecting an element will highlight it in the DevTools HTML inspector. When carefully look at the elements, you'll see that there are some classes attached to the different HTML elements. Those are called CSS classes, and we can make a use of them in scraping.

Conversely, by hovering over elements in the HTML inspector, you will see them highlight on the page. Inspect the page's structure around the actor cards. You'll see that all the card's data is displayed in an <a> element with two classes, one of which is ActorStoreItem. It should now make sense how we got that .ActorStoreItem selector. It's just a way to find all elements that are annotated with the ActorStoreItem.

It's always a good idea to double-check that you're not getting any unwanted elements with this class. To do that, go into the Console tab of DevTools and run:

document.querySelectorAll('.ActorStoreItem');

You will see that only the 24 actor cards will be returned, and nothing else.

tip

CSS selectors and DevTools are quite a big topic. If you want to learn more, visit the Web scraping for beginners course in the Apify Academy. It's free and open-source ❤️.

Next lesson

In the next lesson we will crawl the whole store, including all the listing pages and all the actor detail pages.