Scraping
In the Real-world project chapter, you've created a list of the information you wanted to collect about the products in the example Warehouse store. Let's review that and figure out ways to access the data.
- URL
- Manufacturer
- SKU
- Title
- Current price
- Stock available
Scraping the URL and manufacturerโ
Some information is lying right there in front of us without even having to touch the product detail pages. The URL
we already have - the context.request.url
. And by looking at it carefully, we realize that we can also extract the manufacturer from the URL (as all product urls start with /products/<manufacturer>
). We can just split the string
and be on our way then!
You can use request.loaded_url
as well. Remember the difference: request.url
is what you enqueue, request.loaded_url
is what gets processed (after possible redirects).
By splitting the request.url
, we can extract the manufacturer name directly from the URL. This is done by first splitting the URL to get the product identifier and then splitting that identifier to get the manufacturer name.
# context.request.url: https://warehouse-theme-metal.myshopify.com/products/sennheiser-mke-440-professional-stereo-shotgun-microphone-mke-440
# Split the URL and get the last part.
url_part = context.request.url.split('/').pop()
# url_part: sennheiser-mke-440-professional-stereo-shotgun-microphone-mke-440
# Split the last part by '-' and get the first element.
manufacturer = url_part.split('-')[0]
# manufacturer: 'sennheiser'
It's a matter of preference, whether to store this information separately in the resulting dataset, or not. Whoever uses the dataset can easily parse the manufacturer
from the URL
, so should you duplicate the data unnecessarily? Our opinion is that unless the increased data consumption would be too large to bear, it's better to make the dataset as rich as possible. For example, someone might want to filter by manufacturer
.
One thing you may notice is that the manufacturer
might have a -
in its name. If that's the case, your best bet is extracting it from the details page instead, but it's not mandatory. At the end of the day, you should always adjust and pick the best solution for your use case, and website you are crawling.
Now it's time to add more data to the results. Let's open one of the product detail pages, for example the Sony XBR-950G page and use our DevTools-Fu ๐ฅ to figure out how to get the title of the product.
Scraping titleโ
To scrape the product title from a webpage, you need to identify its location in the HTML structure. By using the element selector tool in your browser's DevTools, you can see that the title is within an <h1>
tag, which is a common practice for important headers. This <h1>
tag is enclosed in a <div>
with the class product-meta. We can leverage this structure to create a combined selector .product-meta h1
. This selector targets any <h1>
element that is a child of an element with the class product-meta
.
Remember that you can press CTRL+F (or CMD+F on Mac) in the Elements tab of DevTools to open the search bar where you can quickly search for elements using their selectors. Always verify your scraping process and assumptions using the DevTools. It's faster than changing the crawler code all the time.
To get the title, you need to locate it using Playwright with the .product-meta h1
selector. This selector specifically targets the <h1>
element you need. If multiple elements match, it will throw an error, which is beneficial as it prevents returning incorrect data silently. Ensuring the accuracy of your selectors is crucial for reliable data extraction.
title = await context.page.locator('.product-meta h1').text_content()
Scraping SKUโ
Using the DevTools, you can find that the product SKU is inside a <span>
tag with the class product-meta__sku-number
. Since there is no other <span>
with that class on the page, you can safely use this selector to extract the SKU.
# Find the SKU element using the selector and get its text content.
sku = await context.page.locator('span.product-meta__sku-number').text_content()
Scraping current priceโ
Using DevTools, you can find that the current price is within a <span>
element tagged with the price
class. However, it is nested alongside another <span>
element with the visually-hidden
class. To avoid extracting the wrong text, you can filter the elements to get the correct one using the has_text
helper.
# Locate the price element and filter out the visually hidden elements.
price_element = context.page.locator('span.price', has_text='$').first
# Extract the text content of the price element.
current_price_string = await price_element.text_content() or ''
# current_price_string: 'Sale price$1,398.00'
# Split the string by the '$' sign to get the numeric part.
raw_price = current_price_string.split('$')[1]
# raw_price: '1,398.00'
# Convert the raw price string to a float after removing commas.
price = float(raw_price.replace(',', ''))
# price: 1398.00
It might look a little complex at first glance, but let's walk through what you did. First, you locate the correct part of the price
span by filtering for elements containing the $
sign. This ensures that you get the actual price element. Once you have the right element, you extract its text content, which gives you a string similar to Sale price$1,398.00
. To get the numeric value, you split this string by the $
sign. Next, you remove any commas from the resulting numeric string and convert it to a float, allowing you to work with the price as a number. This process ensures that you accurately extract and convert the current price from the product page.
Scraping stock availabilityโ
The final step is to scrape the stock availability information. There is a <span>
with the class product-form__inventory
, which contains the text In stock
if the product is available. You can use the has_text
helper to filter out the correct element.
# Locate the element that contains the text 'In stock' and filter out other elements.
in_stock_element = context.page.locator(
selector='span.product-form__inventory',
has_text='In stock',
).first
# Check if the element exists by counting the matching elements.
in_stock = await in_stock_element.count() > 0
For this, all that matters is whether the element exists or not. You can use the count()
method to check if any elements match the selector. If there are, it means the product is in stock.
Trying it outโ
You have everything that is needed, so grab your newly created scraping logic, dump it into your original request handler and see the magic happen!
import asyncio
from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext
async def main() -> None:
crawler = PlaywrightCrawler(
# Let's limit our crawls to make our tests shorter and safer.
max_requests_per_crawl=50,
)
@crawler.router.default_handler
async def request_handler(context: PlaywrightCrawlingContext) -> None:
context.log.info(f'Processing {context.request.url}')
# We're not processing detail pages yet, so we just pass.
if context.request.label == 'DETAIL':
# Split the URL and get the last part to extract the manufacturer.
url_part = context.request.url.split('/').pop()
manufacturer = url_part.split('-')[0]
# Extract the title using the combined selector.
title = await context.page.locator('.product-meta h1').text_content()
# Extract the SKU using its selector.
sku = await context.page.locator('span.product-meta__sku-number').text_content()
# Locate the price element that contains the '$' sign and filter out
# the visually hidden elements.
price_element = context.page.locator('span.price', has_text='$').first
current_price_string = await price_element.text_content() or ''
raw_price = current_price_string.split('$')[1]
price = float(raw_price.replace(',', ''))
# Locate the element that contains the text 'In stock'
# and filter out other elements.
in_stock_element = context.page.locator(
selector='span.product-form__inventory',
has_text='In stock',
).first
in_stock = await in_stock_element.count() > 0
# Put it all together in a dictionary.
data = {
'manufacturer': manufacturer,
'title': title,
'sku': sku,
'price': price,
'in_stock': in_stock,
}
# Print the extracted data.
context.log.info(data)
# We are now on a category page. We can use this to paginate through and
# enqueue all products, as well as any subsequent pages we find.
elif context.request.label == 'CATEGORY':
# Wait for the product items to render.
await context.page.wait_for_selector('.product-item > a')
# Enqueue links found within elements matching the provided selector.
# These links will be added to the crawling queue with the label DETAIL.
await context.enqueue_links(
selector='.product-item > a',
label='DETAIL',
)
# Find the "Next" button to paginate through the category pages.
next_button = await context.page.query_selector('a.pagination__next')
# If a "Next" button is found, enqueue the next page of results.
if next_button:
await context.enqueue_links(
selector='a.pagination__next',
label='CATEGORY',
)
# This indicates we're on the start page with no specific label.
# On the start page, we want to enqueue all the category pages.
else:
# Wait for the collection cards to render.
await context.page.wait_for_selector('.collection-block-item')
# Enqueue links found within elements matching the provided selector.
# These links will be added to the crawling queue with the label CATEGORY.
await context.enqueue_links(
selector='.collection-block-item',
label='CATEGORY',
)
await crawler.run(['https://warehouse-theme-metal.myshopify.com/collections'])
if __name__ == '__main__':
asyncio.run(main())
When you run the crawler, you will see the crawled URLs and their scraped data printed to the console. The output will look something like this:
{
"url": "https://warehouse-theme-metal.myshopify.com/products/sony-str-za810es-7-2-channel-hi-res-wi-fi-network-av-receiver",
"manufacturer": "sony",
"title": "Sony STR-ZA810ES 7.2-Ch Hi-Res Wi-Fi Network A/V Receiver",
"sku": "SON-692802-STR-DE",
"price": 698,
"in_stock": true
}
Next stepsโ
Next, you'll see how to save the data you scraped to the disk for further processing.