Scraping the Store
In the Real-world project chapter, we created a list of the information we wanted to collect about the Actors in the Apify Store. Let's review that and figure out ways to access the data.
- URL
- Owner
- Unique identifier (such as
apify/web-scraper
) - Title
- Description
- Last modification date
- Number of runs
Scraping the URL, Owner and Unique identifier
Some information is lying right there in front of us without even having to touch the Actor detail pages. The URL
we already have - the request.url
. And by looking at it carefully, we realize that it already includes the owner
and the unique identifier
too. We can just split the string
and be on our way then!
You can use request.loadedUrl
as well. Remember the difference: request.url
is what you enqueue, request.loadedUrl
is what gets processed (after possible redirects).
// request.url = https://apify.com/apify/web-scraper
const urlParts = request.url.split('/').slice(-2); // ['apify', 'web-scraper']
const uniqueIdentifier = urlParts.join('/'); // 'apify/web-scraper'
const owner = urlParts[0]; // 'apify'
It's a matter of preference, whether to store this information separately in the resulting dataset, or not. Whoever uses the dataset can easily parse the owner
from the URL
, so should you duplicate the data unnecessarily? Our opinion is that unless the increased data consumption would be too large to bear, it's better to make the dataset as rich as possible. For example, someone might want to filter by owner
.
Now it's time to add more data to the results. Let's open one of the Actor detail pages, for example the apify/web-scraper
page and use our DevTools-Fu 🥋 to figure out how to get the title of the Actor.
Title
By using the element selector tool, you can see that the title is there under an <h1>
tag, as titles should be. Web pages usually have only a single <h1>
tag, but this one has two. Because of that, we have to extend the selector with more information. The <h1>
tag is enclosed in a <div>
with class ActorHeader-identificator
. We can leverage this to create a combined selector .ActorHeader-identificator h1
. It selects any <h1>
element that is a child of a different element with the class ActorHeader-identificator
.
Remember that you can press CTRL+F (or CMD+F on Mac) in the Elements tab of DevTools to open the search bar where you can quickly search for elements using their selectors. Always verify your scraping process and assumptions using the DevTools. It's faster than changing the crawler code all the time.
To get the title, we need to find it using Playwright
and a .ActorHeader-identificator h1
locator, which selects the <h1>
element we're looking for, or throws, if it finds more than one. That's good. It's usually better to crash the crawler than silently return bad data.
const title = await page.locator('.ActorHeader-identificator h1').textContent();
Description
Using the DevTools, we find that the Actor description is inside a <p>
tag with a class ActorHeader-description
. And since there's no other <p>
with that class on the page, we can safely use it.
const description = await page.locator('p.ActorHeader-description').textContent();
Last modification date
DevTools tell us that the modifiedDate
can be found in a <time>
element with a datetime
property. You can use the attribute CSS selector to get it.
const modifiedTimestamp = await page.locator('time[datetime]').getAttribute('datetime');
const modifiedDate = new Date(Number(modifiedTimestamp));
It might look a little too complex at first glance, but let's walk through it. We find the right <time>
element, and then we read its datetime
attribute, because that's where a unix timestamp is stored as a string
.
But we would much rather see a readable date in our results, not a unix timestamp, so we need to convert it. Unfortunately the new Date()
constructor will not accept a string
, so we cast the string
to a number
using the Number()
function before actually calling new Date()
. Phew!
Run count
We're finishing up with the runCount
. There's no specific element like <time>
, so we need to create a complex selector and then do a transformation on the result.
const runsRow = page.locator('ul.ActorHeader-userMedallion > li').filter({ hasText: 'Runs' });
const runCountString = await runsRow.textContent();
const runCount = runCountString.replace('Runs ', '');
The code looks complicated, but it only reads that we're looking for an <ul class="ActorHeader-userMedallion ...">
element and within that element we're looking for the <li>
element that includes the text Runs. We're only interested in the number of runs. So we extract the count from the last (second) <li>
element. But it contains the "Runs " string as well, so we remove the "Runs " by replacing the word with empty string.
And there we have it! All the data we needed. For the sake of completeness, let's add all the properties together, and we're good to go.
const urlParts = request.url.split('/').slice(-2);
const uniqueIdentifier = urlParts.join('/');
const owner = urlParts[0];
const title = await page.locator('.ActorHeader-identificator h1').textContent();
const description = await page.locator('p.ActorHeader-description').textContent();
const modifiedTimestamp = await page.locator('time[datetime]').getAttribute('datetime');
const modifiedDate = new Date(Number(modifiedTimestamp));
const runsRow = page.locator('ul.ActorHeader-userMedallion > li').filter({ hasText: 'Runs' });
const runCountString = await runsRow.textContent();
const runCount = runCountString.replace('Runs ', '');
Trying it out
We have everything we need so grab our newly created scraping logic, dump it into our original requestHandler()
and see the magic happen!
import { PlaywrightCrawler } from 'crawlee';
const crawler = new PlaywrightCrawler({
requestHandler: async ({ page, request, enqueueLinks }) => {
console.log(`Processing: ${request.url}`)
if (request.label === 'DETAIL') {
const urlParts = request.url.split('/').slice(-2);
const modifiedTimestamp = await page.locator('time[datetime]').getAttribute('datetime');
const runsRow = page.locator('ul.ActorHeader-userMedallion > li').filter({ hasText: 'Runs' });
const runCountString = await runsRow.textContent();
const results = {
url: request.url,
uniqueIdentifier: urlParts.join('/'),
owner: urlParts[0],
title: await page.locator('.ActorHeader-identificator h1').textContent(),
description: await page.locator('p.ActorHeader-description').textContent(),
modifiedDate: new Date(Number(modifiedTimestamp)),
runCount: runCountString.replace('Runs ', ''),
}
console.log(results);
} else {
await page.waitForSelector('.ActorStorePagination-buttons a');
await enqueueLinks({
selector: '.ActorStorePagination-buttons a',
label: 'LIST',
})
await page.waitForSelector('div[data-test="actorCard"] a');
await enqueueLinks({
selector: 'div[data-test="actorCard"] a',
label: 'DETAIL', // <= note the different label
})
}
}
});
await crawler.run(['https://apify.com/store']);
When you run the crawler, you will see the crawled URLs and their scraped data printed to the console. The output will look something like this:
{
"url": "https://apify.com/clockworks/tiktok-scraper",
"uniqueIdentifier": "clockworks/tiktok-scraper",
"owner": "clockworks",
"title": "TikTok Scraper",
"description": "Powerful TikTok scraper to extract data from TikTok videos, hashtags, and users. Use it to scrape TikTok profiles, hashtags, posts, URLs, numbers of shares, followers, hearts, names, video, and music-related data. Download TikTok data as a HTML, JSON, CSV, Excel, or XML doc.",
"modifiedDate": "2023-04-26T12:13:19.387Z",
"runCount": "1.6M"
}
Next lesson
In the next lesson, we will show you how to save the data you scraped to the disk for further processing.