Skip to main content

Crawlee for Python v0.5

· 6 min read
Vlada Dusek
Developer of Crawlee for Python

Crawlee for Python v0.5 is now available! This is our biggest release to date, bringing new ported functionality from the Crawlee for JavaScript, brand-new features that are exclusive to the Python library (for now), a new consolidated package structure, and a bunch of bug fixes and further improvements.

How to scrape Crunchbase using Python in 2024 (Easy Guide)

· 11 min read
Max
Community Member of Crawlee and web scraping expert

Python developers know the drill: you need reliable company data, and Crunchbase has it. This guide shows you how to build an effective Crunchbase scraper in Python that gets you the data you need.

Crunchbase tracks details that matter: locations, business focus, founders, and investment histories. Manual extraction from such a large dataset isn't practical -automation is essential for transforming this information into an analyzable format.

By the end of this blog, we'll explore three different ways to extract data from Crunchbase using Crawlee for Python. We'll fully implement two of them and discuss the specifics and challenges of the third. This will help us better understand how important it is to properly choose the right data source.

note

This guide comes from a developer in our growing community. Have you built interesting projects with Crawlee? Join us on Discord to share your experiences and blog ideas - we value these contributions from developers like you.

How to Scrape Crunchbase Using Python

Key steps we'll cover:

  1. Project setup
  2. Choosing the data source
  3. Implementing sitemap-based crawler
  4. Analysis of search-based approach and its limitations
  5. Implementing the official API crawler
  6. Conclusion and repository access

How to scrape Google Maps data using Python

· 11 min read
Satyam Tripathi
Community Member of Crawlee

Millions of people use Google Maps daily, leaving behind a goldmine of data just waiting to be analyzed. In this guide, I'll show you how to build a reliable scraper using Crawlee and Python to extract locations, ratings, and reviews from Google Maps, all while handling its dynamic content challenges.

note

One of our community members wrote this blog as a contribution to the Crawlee Blog. If you would like to contribute blogs like these to Crawlee Blog, please reach out to us on our discord channel.

What data will we extract from Google Maps?

We’ll collect information about hotels in a specific city. You can also customize your search to meet your requirements. For example, you might search for "hotels near me", "5-star hotels in Bombay", or other similar queries.

Google Maps Data Screenshot

We’ll extract important data, including the hotel name, rating, review count, price, a link to the hotel page on Google Maps, and all available amenities. Here’s an example of what the extracted data will look like:

{
"name": "Vividus Hotels, Bangalore",
"rating": "4.3",
"reviews": "633",
"price": "₹3,667",
"amenities": [
"Pool available",
"Free breakfast available",
"Free Wi-Fi available",
"Free parking available"
],
"link": "https://www.google.com/maps/place/Vividus+Hotels+,+Bangalore/..."
}

How to scrape Google search results with Python

· 7 min read
Max
Community Member of Crawlee and web scraping expert

Scraping Google Search delivers essential SERP analysis, SEO optimization, and data collection capabilities. Modern scraping tools make this process faster and more reliable.

note

One of our community members wrote this blog as a contribution to the Crawlee Blog. If you would like to contribute blogs like these to Crawlee Blog, please reach out to us on our discord channel.

In this guide, we'll create a Google Search scraper using Crawlee for Python that can handle result ranking and pagination.

We'll create a scraper that:

  • Extracts titles, URLs, and descriptions from search results
  • Handles multiple search queries
  • Tracks ranking positions
  • Processes multiple result pages
  • Saves data in a structured format

How to scrape Google search results with Python

Reverse engineering GraphQL persistedQuery extension

· 5 min read
Saurav Jain
Developer Community Manager
Matěj Volf
Web Automation Engineer

GraphQL is a query language for getting deeply nested structured data from a website's backend, similar to MongoDB queries.

The request is usually a POST to some general /graphql endpoint with a body like this:

GraphQL Query

When scraping data from websites using GraphQL, it’s common to inspect the network requests in developer tools to find the exact queries being used. However, on some websites, you might notice that the GraphQL query itself isn’t visible in the request. Instead, you only see a cryptic hash value. This can be confusing and makes it harder to understand how data is being requested from the server.

This is because some websites use a feature called "persisted queries. It's a performance optimization that reduces the amount of data sent with each request by replacing the full query text with a precomputed hash. While this improves website speed and efficiency, it introduces challenges for scraping because the query text isn’t readily available.

Persisted Query Reverse Engineering

12 tips on how to think like a web scraping expert

· 12 min read
Max
Community Member of Crawlee and web scraping expert

Typically, tutorials focus on the technical aspects, on what you can replicate: "Start here, follow this path, and you'll end up here." This is great for learning a particular technology, but it's sometimes difficult to understand why the author decided to do things a certain way or what guides their development process.

note

One of our community members wrote this blog as a contribution to Crawlee Blog. If you want to contribute blogs like these to Crawlee Blog, please reach out to us on our discord channel.

In this blog, I'll discuss the general rules and principles that guide me when I work on web scraping projects and allow me to achieve great results.

So, let's explore the mindset of a web scraping developer.

How to think like a web scraping expert

How to create a LinkedIn job scraper in Python with Crawlee

· 7 min read
Arindam Majumder
Community Member of Crawlee

Introduction

In this article, we will build a web application that scrapes LinkedIn for job postings using Crawlee and Streamlit.

We will create a LinkedIn job scraper in Python using Crawlee for Python to extract the company name, job title, time of posting, and link to the job posting from dynamically received user input through the web application.

note

One of our community members wrote this blog as a contribution to Crawlee Blog. If you want to contribute blogs like these to Crawlee Blog, please reach out to us on our discord channel.

By the end of this tutorial, you’ll have a fully functional web application that you can use to scrape job postings from LinkedIn.

Linkedin Job Scraper

Let's begin.

Optimizing web scraping: Scraping auth data using JSDOM

· 8 min read
Saurav Jain
Developer Community Manager

As scraping developers, we sometimes need to extract authentication data like temporary keys to perform our tasks. However, it is not as simple as that. Usually, it is in HTML or XHR network requests, but sometimes, the auth data is computed. In that case, we can either reverse-engineer the computation, which takes a lot of time to deobfuscate scripts or run the JavaScript that computes it. Normally, we use a browser, but that is expensive. Crawlee provides support for running browser scraper and Cheerio Scraper in parallel, but that is very complex and expensive in terms of compute resource usage. JSDOM helps us run page JavaScript with fewer resources than a browser and slightly higher than Cheerio.

This article will discuss a new approach that we use in one of our Actors to obtain the authentication data from TikTok ads creative center generated by browser web applications without actually running the browser but instead of it, using JSDOM.

JSDOM based approach from scraping

Web scraping of a dynamic website using Python with HTTP Client

· 13 min read
Max
Community Member of Crawlee and web scraping expert

Dynamic websites that use JavaScript for content rendering and backend interaction often create challenges for web scraping. The traditional approach to solving this problem is browser emulation, but it's not very efficient in terms of resource consumption.

note

One of our community members wrote this blog as a contribution to Crawlee Blog. If you want to contribute blogs like these to Crawlee Blog, please reach out to us on our discord channel.

In this article, we'll explore an alternative method based on in-depth site analysis and the use of an HTTP client. We'll go through the entire process from analyzing a dynamic website to implementing an efficient web crawler using the Crawlee for Python framework.

How to scrape dynamic websites in Python

How to scrape infinite scrolling webpages with Python

· 7 min read
Saurav Jain
Developer Community Manager

Hello, Crawlee Devs, and welcome back to another tutorial on the Crawlee Blog. This tutorial will teach you how to scrape infinite-scrolling websites using Crawlee for Python.

For context, infinite-scrolling pages are a modern alternative to classic pagination. When users scroll to the bottom of the webpage instead of choosing the next page, the page automatically loads more data, and users can scroll more.

As a big sneakerhead, I'll take the Nike shoes infinite-scrolling website as an example, and we'll scrape thousands of sneakers from it.

How to scrape infinite scrolling pages with Python

Crawlee for Python has some amazing initial features, such as a unified interface for HTTP and headless browser crawling, automatic retries, and much more.