Skip to main content

Cloud Run

Google Cloud Run is a container-based serverless platform that allows you to run web crawlers with headless browsers. This service is recommended when your Crawlee applications need browser rendering capabilities, require more granular control, or have complex dependencies that aren't supported by Cloud Functions.

GCP Cloud Run allows you to deploy using Docker containers, giving you full control over your environment and the flexibility to use any web server framework of your choice, unlike Cloud Functions which are limited to Flask.

Preparing the project

We'll prepare our project using Litestar and the Uvicorn web server. The HTTP server handler will wrap the crawler to communicate with clients. Because the Cloud Run platform sees only an opaque Docker container, we have to take care of this bit ourselves.

info

GCP passes you an environment variable called PORT - your HTTP server is expected to be listening on this port (GCP exposes this one to the outer world).

import json
import os

import uvicorn
from litestar import Litestar, get

from crawlee import service_locator
from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext

# Disable writing storage data to the file system
configuration = service_locator.get_configuration()
configuration.persist_storage = False
configuration.write_metadata = False


@get('/')
async def main() -> str:
"""The crawler entry point that will be called when the HTTP endpoint is accessed."""
crawler = PlaywrightCrawler(
headless=True,
max_requests_per_crawl=10,
browser_type='firefox',
)

@crawler.router.default_handler
async def default_handler(context: PlaywrightCrawlingContext) -> None:
"""Default request handler that processes each page during crawling."""
context.log.info(f'Processing {context.request.url} ...')
title = await context.page.query_selector('title')
await context.push_data(
{
'url': context.request.loaded_url,
'title': await title.inner_text() if title else None,
}
)

await context.enqueue_links()

await crawler.run(['https://crawlee.dev'])

data = await crawler.get_data()

# Return the results as JSON to the client
return json.dumps(data.items)


# Initialize the Litestar app with our route handler
app = Litestar(route_handlers=[main])

# Start the Uvicorn server using the `PORT` environment variable provided by GCP
# This is crucial - Cloud Run expects your app to listen on this specific port
uvicorn.run(app, host='0.0.0.0', port=int(os.environ.get('PORT', '8080'))) # noqa: S104 # Use all interfaces in a container, safely
tip

Always make sure to keep all the logic in the request handler - as with other FaaS services, your request handlers have to be stateless.

Deploying to Google Cloud Platform

Now, we’re ready to deploy! If you have initialized your project using uvx crawlee create, the initialization script has prepared a Dockerfile for you.

All you have to do now is run gcloud run deploy in your project folder (the one with your Dockerfile in it). The gcloud CLI application will ask you a few questions, such as what region you want to deploy your application in, or whether you want to make your application public or private.

After answering those questions, you should be able to see your application in the GCP dashboard and run it using the link you find there.

tip

In case your first execution of your newly created Cloud Run fails, try editing the Run configuration - mainly setting the available memory to 1GiB or more and updating the request timeout according to the size of the website you are scraping.