Scraping the Store
In the Real-world project chapter, we created a list of the information we wanted to collect about the actors in the Apify Store. Let's review that and figure out ways to access the data.
- URL
- Owner
- Unique identifier (such as
apify/web-scraper
) - Title
- Description
- Last modification date
- Number of runs
Scraping the URL, Owner and Unique identifier
Some information is lying right there in front of us without even having to touch the actor detail pages. The URL
we already have - the request.url
. And by looking at it carefully, we realize that it already includes the owner
and the unique identifier
too. We can just split the string
and be on our way then!
You can use request.loadedUrl
as well. Remember the difference: request.url
is what you enqueue, request.loadedUrl
is what gets processed (after possible redirects).
// request.url = https://apify.com/apify/web-scraper
const urlParts = request.url.split('/').slice(-2); // ['apify', 'web-scraper']
const uniqueIdentifier = urlParts.join('/'); // 'apify/web-scraper'
const owner = urlParts[0]; // 'apify'
It's a matter of preference, whether to store this information separately in the resulting dataset, or not. Whoever uses the dataset can easily parse the owner
from the URL
, so should you duplicate the data unnecessarily? Our opinion is that unless the increased data consumption would be too large to bear, it's better to make the dataset as rich as possible. For example, someone might want to filter by owner
.
Now it's time to add more data to the results. Let's open one of the actor detail pages, for example the apify/web-scraper
page and use our DevTools-Fu 🥋 to figure out how to get the title of the actor.
Title
By using the element selector tool, you can see that the title is there under an <h1>
tag, as titles should be.
In some cases there could actually be more <h1>
tags on the same page. Yes, it's wrong, but it is quite common in the wild. Be careful when scraping and don't assume that websites are programmed correctly. They aren't.
Remember that you can press CTRL+F (or CMD+F on Mac) in the Elements tab of DevTools to open the search bar where you can quickly search for elements using their selectors. Always verify your scraping process and assumptions using the DevTools. It's faster than changing the crawler code all the time.
To get the title we just need to find it using Playwright
and a h1
locator, which selects the first <h1>
element it finds and throws if it finds more. That's good. It's usually better to crash the crawler than silently return bad data.
const title = await page.locator('h1').textContent();
Description
Getting the actor's description is a little more involved, but still pretty straightforward. You can't just simply search for a <p>
tag, because there's a lot of them in the page. We need to narrow our search down a little. Using the DevTools we find that the actor description is actually nested inside a <span>
tag with a class actor-description
. And since there's no other <span>
with that class on the page, we can safely use it.
const description = await page.locator('span.actor-description').textContent();
Last modification date
DevTools tell us that the modifiedDate
can be found in a <time>
element with a datetime
property. You can use the attribute CSS selector to get it.
const modifiedTimestamp = await page.locator('time[datetime]').getAttribute('datetime');
const modifiedDate = new Date(Number(modifiedTimestamp));
It might look a little too complex at first glance, but let's walk through it. We find the right <time>
element, and then we read its datetime
attribute, because that's where a unix timestamp is stored as a string
.
But we would much rather see a readable date in our results, not a unix timestamp, so we need to convert it. Unfortunately the new Date()
constructor will not accept a string
, so we cast the string
to a number
using the Number()
function before actually calling new Date()
. Phew!
Run count
We're finishing up with the runCount
. There's no specific element like <time>
, so we need to create a complex selector and then do a transformation on the result.
const runsRow = page.locator('ul.ActorHeader-stats > li').filter({ hasText: 'Runs' });
const runCountString = await runsRow.locator('span').last().textContent();
const runCount = Number(runCountString.replaceAll(',', ''));
The code looks complicated, but it only reads that we're looking for an <ul class="ActorHeader-stats ...">
element and within that element we're looking for the <li>
element that includes the text Runs. We're only interested in the number of runs. So we extract the number from the last (second) <span>
element inside that <li>
. But its type is still a string
, so we remove the commas by replacing them with empty strings, and finally convert the result to a number
by wrapping it with a Number()
call.
And there we have it! All the data we needed. For the sake of completeness, let's add all the properties together, and we're good to go.
const urlParts = request.url.split('/').slice(-2);
const uniqueIdentifier = urlParts.join('/');
const owner = urlParts[0];
const title = await page.locator('h1').textContent();
const description = await page.locator('span.actor-description').textContent();
const modifiedTimestamp = await page.locator('time[datetime]').getAttribute('datetime');
const modifiedDate = new Date(Number(modifiedTimestamp));
const runsRow = page.locator('ul.ActorHeader-stats > li').filter({ hasText: 'Runs' });
const runCountString = await runsRow.locator('span').last().textContent();
const runCount = Number(runCountString.replaceAll(',', ''));
Trying it out
We have everything we need so grab our newly created scraping logic, dump it into our original requestHandler()
and see the magic happen!
import { PlaywrightCrawler } from 'crawlee';
const crawler = new PlaywrightCrawler({
requestHandler: async ({ page, request, enqueueLinks }) => {
console.log(`Processing: ${request.url}`)
if (request.label === 'DETAIL') {
const urlParts = request.url.split('/').slice(-2);
const modifiedTimestamp = await page.locator('time[datetime]').getAttribute('datetime');
const runsRow = page.locator('ul.ActorHeader-stats > li').filter({ hasText: 'Runs' });
const runCountString = await runsRow.locator('span').last().textContent();
const results = {
url: request.url,
uniqueIdentifier: urlParts.join('/'),
owner: urlParts[0],
title: await page.locator('h1').textContent(),
description: await page.locator('span.actor-description').textContent(),
modifiedDate: new Date(Number(modifiedTimestamp)),
runCount: Number(runCountString.replaceAll(',', '')),
}
console.log(results)
} else {
await page.waitForSelector('.ActorStorePagination-pages a');
await enqueueLinks({
selector: '.ActorStorePagination-pages > a',
label: 'LIST',
})
await page.waitForSelector('.ActorStoreItem');
await enqueueLinks({
selector: '.ActorStoreItem',
label: 'DETAIL', // <= note the different label
})
}
}
});
await crawler.run(['https://apify.com/store']);
When you run the actor, you should see the crawled URLs and their scraped data printed to the console.
Next lesson
In the next lesson, we will show you how to save the data you scraped to the disk for further processing.