Refactoring
It may seem that the data is extracted and the crawler is done, but honestly, this is just the beginning. For the sake of brevity, we've completely omitted error handling, proxies, logging, architecture, tests, documentation and other stuff that a reliable software should have. The good thing is, error handling is mostly done by Crawlee itself, so no worries on that front, unless you need some custom magic.
You might be wondering about the anti-blocking, bot-protection avoiding stealthy features and why we haven't highlighted them yet. The reason is straightforward: these features are automatically used within the default configuration, providing a smooth start without manual adjustments.
To promote good coding practices, let's look at how you can use a Router
class to better structure your crawler code.
Request routing
In the following code, we've made several changes:
- Split the code into multiple files.
- Added custom instance of
Router
to make our routing cleaner, without if clauses. - Moved route definitions to a separate
routes.py
file. - Simplified the
main.py
file to focus on the general structure of the crawler.
Routes file
First, let's define our routes in a separate file:
from crawlee.crawlers import PlaywrightCrawlingContext
from crawlee.router import Router
router = Router[PlaywrightCrawlingContext]()
@router.default_handler
async def default_handler(context: PlaywrightCrawlingContext) -> None:
# This is a fallback route which will handle the start URL.
context.log.info(f'default_handler is processing {context.request.url}')
await context.page.wait_for_selector('.collection-block-item')
await context.enqueue_links(
selector='.collection-block-item',
label='CATEGORY',
)
@router.handler('CATEGORY')
async def category_handler(context: PlaywrightCrawlingContext) -> None:
# This replaces the context.request.label == CATEGORY branch of the if clause.
context.log.info(f'category_handler is processing {context.request.url}')
await context.page.wait_for_selector('.product-item > a')
await context.enqueue_links(
selector='.product-item > a',
label='DETAIL',
)
next_button = await context.page.query_selector('a.pagination__next')
if next_button:
await context.enqueue_links(
selector='a.pagination__next',
label='CATEGORY',
)
@router.handler('DETAIL')
async def detail_handler(context: PlaywrightCrawlingContext) -> None:
# This replaces the context.request.label == DETAIL branch of the if clause.
context.log.info(f'detail_handler is processing {context.request.url}')
url_part = context.request.url.split('/').pop()
manufacturer = url_part.split('-')[0]
title = await context.page.locator('.product-meta h1').text_content()
sku = await context.page.locator('span.product-meta__sku-number').text_content()
price_element = context.page.locator('span.price', has_text='$').first
current_price_string = await price_element.text_content() or ''
raw_price = current_price_string.split('$')[1]
price = float(raw_price.replace(',', ''))
in_stock_element = context.page.locator(
selector='span.product-form__inventory',
has_text='In stock',
).first
in_stock = await in_stock_element.count() > 0
data = {
'manufacturer': manufacturer,
'title': title,
'sku': sku,
'price': price,
'in_stock': in_stock,
}
await context.push_data(data)
Main file
Next, our main file becomes much simpler and cleaner:
import asyncio
from crawlee.crawlers import PlaywrightCrawler
from .routes import router
async def main() -> None:
crawler = PlaywrightCrawler(
# Let's limit our crawls to make our tests shorter and safer.
max_requests_per_crawl=50,
# Provide our router instance to the crawler.
request_handler=router,
)
await crawler.run(['https://warehouse-theme-metal.myshopify.com/collections'])
if __name__ == '__main__':
asyncio.run(main())
By structuring your code this way, you achieve better separation of concerns, making the code easier to read, manage and extend. The Router
class keeps your routing logic clean and modular, replacing if clauses with function decorators.
Summary
Refactoring your crawler code with these practices enhances readability, maintainability, and scalability.
Splitting your code into multiple files
There's no reason not to split your code into multiple files and keep your logic separate. Less code in a single file means less complexity to handle at any time, which improves overall readability and maintainability. Consider further splitting the routes into separate files for even better organization.
Using a router to structure your crawling
Initially, using a simple if
/ else
statement for selecting different logic based on the crawled pages might appear more readable. However, this approach can become cumbersome with more than two types of pages, especially when the logic for each page extends over dozens or even hundreds of lines of code.
It's good practice in any programming language to split your logic into bite-sized chunks that are easy to read and reason about. Scrolling through a thousand line long request_handler()
where everything interacts with everything and variables can be used everywhere is not a beautiful thing to do and a pain to debug. That's why we prefer the separation of routes into their own files.