Deploy to GCP Cloud Functions
Updating the project
For the project foundation, use BeautifulSoupCrawler as described in this example.
Add functions-framework to your dependencies file requirements.txt
. If you're using a project manager like poetry
or uv
, export your dependencies to requirements.txt
.
Update the project code to make it compatible with Cloud Functions and return data in JSON format. Also add an entry point that Cloud Functions will use to run the project.
import asyncio
import json
from datetime import timedelta
import functions_framework
from flask import Request, Response
from crawlee import service_locator
from crawlee.crawlers import (
BeautifulSoupCrawler,
BeautifulSoupCrawlingContext,
)
# Disable writing storage data to the file system
configuration = service_locator.get_configuration()
configuration.persist_storage = False
configuration.write_metadata = False
async def main() -> str:
crawler = BeautifulSoupCrawler(
max_request_retries=1,
request_handler_timeout=timedelta(seconds=30),
max_requests_per_crawl=10,
)
@crawler.router.default_handler
async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
context.log.info(f'Processing {context.request.url} ...')
data = {
'url': context.request.url,
'title': context.soup.title.string if context.soup.title else None,
'h1s': [h1.text for h1 in context.soup.find_all('h1')],
'h2s': [h2.text for h2 in context.soup.find_all('h2')],
'h3s': [h3.text for h3 in context.soup.find_all('h3')],
}
await context.push_data(data)
await context.enqueue_links()
await crawler.run(['https://crawlee.dev'])
# Extract data saved in `Dataset`
data = await crawler.get_data()
# Serialize to json string and return
return json.dumps(data.items)
@functions_framework.http
def crawlee_run(request: Request) -> Response:
# You can pass data to your crawler using `request`
function_id = request.headers['Function-Execution-Id']
response_str = asyncio.run(main())
# Return a response with the crawling results
return Response(response=response_str, status=200)
You can test your project locally. Start the server by running:
functions-framework --target=crawlee_run
Then make a GET request to http://127.0.0.1:8080/
, for example in your browser.
Deploying to Google Cloud Platform
In the Google Cloud dashboard, create a new function, allocate memory and CPUs to it, set region and function timeout.
When deploying, select "Use an inline editor to create a function". This allows you to configure the project using only the Google Cloud Console dashboard.
Using the inline editor
, update the function files according to your project. Make sure to update the requirements.txt
file to match your project's dependencies.
Also, make sure to set the Function entry point to the name of the function decorated with @functions_framework.http
, which in our case is crawlee_run
.
After the Function deploys, you can test it by clicking the "Test" button. This button opens a popup with a curl
script that calls your new Cloud Function. To avoid having to install the gcloud
CLI application locally, you can also run this script in the Cloud Shell by clicking the link above the code block.