Crawlee Blog - learn how to build better scrapers | Crawlee for JavaScript

Reverse engineering GraphQL persistedQuery extension

November 15, 2024 · 5 min read

Developer Community Manager

Web Automation Engineer

GraphQL is a query language for getting deeply nested structured data from a website's backend, similar to MongoDB queries.

The request is usually a POST to some general /graphql endpoint with a body like this:

GraphQL Query

When scraping data from websites using GraphQL, it’s common to inspect the network requests in developer tools to find the exact queries being used. However, on some websites, you might notice that the GraphQL query itself isn’t visible in the request. Instead, you only see a cryptic hash value. This can be confusing and makes it harder to understand how data is being requested from the server.

This is because some websites use a feature called "persisted queries. It's a performance optimization that reduces the amount of data sent with each request by replacing the full query text with a precomputed hash. While this improves website speed and efficiency, it introduces challenges for scraping because the query text isn’t readily available.

Persisted Query Reverse Engineering

12 tips on how to think like a web scraping expert

November 10, 2024 · 13 min read

Max

Community Member of Crawlee and web scraping expert

Typically, tutorials focus on the technical aspects, on what you can replicate: "Start here, follow this path, and you'll end up here." This is great for learning a particular technology, but it's sometimes difficult to understand why the author decided to do things a certain way or what guides their development process.

note

One of our community members wrote this blog as a contribution to Crawlee Blog. If you want to contribute blogs like these to Crawlee Blog, please reach out to us on our discord channel.

In this blog, I'll discuss the general rules and principles that guide me when I work on web scraping projects and allow me to achieve great results.

So, let's explore the mindset of a web scraping developer.

How to think like a web scraping expert

How to create a LinkedIn job scraper in Python with Crawlee

October 14, 2024 · 7 min read

Arindam Majumder

Community Member of Crawlee

Introduction

In this article, we will build a web application that scrapes LinkedIn for job postings using Crawlee and Streamlit.

We will create a LinkedIn job scraper in Python using Crawlee for Python to extract the company name, job title, time of posting, and link to the job posting from dynamically received user input through the web application.

note

One of our community members wrote this blog as a contribution to Crawlee Blog. If you want to contribute blogs like these to Crawlee Blog, please reach out to us on our discord channel.

By the end of this tutorial, you’ll have a fully functional web application that you can use to scrape job postings from LinkedIn.

Linkedin Job Scraper

Let's begin.

Optimizing web scraping: Scraping auth data using JSDOM

September 30, 2024 · 8 min read

Saurav Jain

Developer Community Manager

As scraping developers, we sometimes need to extract authentication data like temporary keys to perform our tasks. However, it is not as simple as that. Usually, it is in HTML or XHR network requests, but sometimes, the auth data is computed. In that case, we can either reverse-engineer the computation, which takes a lot of time to deobfuscate scripts or run the JavaScript that computes it. Normally, we use a browser, but that is expensive. Crawlee provides support for running browser scraper and Cheerio Scraper in parallel, but that is very complex and expensive in terms of compute resource usage. JSDOM helps us run page JavaScript with fewer resources than a browser and slightly higher than Cheerio.

This article will discuss a new approach that we use in one of our Actors to obtain the authentication data from TikTok ads creative center generated by browser web applications without actually running the browser but instead of it, using JSDOM.

JSDOM based approach from scraping

Web scraping of a dynamic website using Python with HTTP Client

September 12, 2024 · 15 min read

Max

Community Member of Crawlee and web scraping expert

Dynamic websites that use JavaScript for content rendering and backend interaction often create challenges for web scraping. The traditional approach to solving this problem is browser emulation, but it's not very efficient in terms of resource consumption.

note

One of our community members wrote this blog as a contribution to Crawlee Blog. If you want to contribute blogs like these to Crawlee Blog, please reach out to us on our discord channel.

In this article, we'll explore an alternative method based on in-depth site analysis and the use of an HTTP client. We'll go through the entire process from analyzing a dynamic website to implementing an efficient web crawler using the Crawlee for Python framework.

How to scrape dynamic websites in Python

How to scrape infinite scrolling webpages with Python

August 27, 2024 · 7 min read

Saurav Jain

Developer Community Manager

Hello, Crawlee Devs, and welcome back to another tutorial on the Crawlee Blog. This tutorial will teach you how to scrape infinite-scrolling websites using Crawlee for Python.

For context, infinite-scrolling pages are a modern alternative to classic pagination. When users scroll to the bottom of the webpage instead of choosing the next page, the page automatically loads more data, and users can scroll more.

As a big sneakerhead, I'll take the Nike shoes infinite-scrolling website as an example, and we'll scrape thousands of sneakers from it.

How to scrape infinite scrolling pages with Python

Crawlee for Python has some amazing initial features, such as a unified interface for HTTP and headless browser crawling, automatic retries, and much more.

Current problems and mistakes of web scraping in Python and tricks to solve them!

August 20, 2024 · 17 min read

Max

Community Member of Crawlee and web scraping expert

Introduction

Greetings! I'm Max, a Python developer from Ukraine, a developer with expertise in web scraping, data analysis, and processing.

My journey in web scraping started in 2016 when I was solving lead generation challenges for a small company. Initially, I used off-the-shelf solutions such as Import.io and Kimono Labs. However, I quickly encountered limitations such as blocking, inaccurate data extraction, and performance issues. This led me to learn Python. Those were the glory days when requests and lxml/beautifulsoup were enough to extract data from most websites. And if you knew how to work with threads, you were already a respected expert :)

note

One of our community members wrote this blog as a contribution to Crawlee Blog. If you want to contribute blogs like these to Crawlee Blog, please reach out to us on our discord channel.

As a freelancer, I've built small solutions and large, complex data mining systems for products over the years.

Today, I want to discuss the realities of web scraping with Python in 2024. We'll look at the mistakes I sometimes see and the problems you'll encounter and offer solutions to some of them.

Let's get started.

Just take requests and beautifulsoup and start making a lot of money...

No, this is not that kind of article.

Announcing Crawlee for Python: Now you can use Python to build reliable web crawlers

July 5, 2024 · 6 min read

Saurav Jain

Developer Community Manager

Testimonial from early adopters

“Crawlee for Python development team did a great job in building the product, it makes things faster for a Python developer.”

~ Maksym Bohomolov

We launched Crawlee in August 2022 and got an amazing response from the JavaScript community. With many early adopters in its initial days, we got valuable feedback, which gave Crawlee a strong base for its success.

Today, Crawlee built-in TypeScript has nearly 13,000 stars on GitHub, with 90 open-source contributors worldwide building the best web scraping and automation library.

Since the launch, the feedback we’ve received most often [1][2][3] has been to build Crawlee in Python so that the Python community can use all the features the JavaScript community does.

With all these requests in mind and to simplify the life of Python web scraping developers, we’re launching Crawlee for Python today.

The new library is still in beta, and we are looking for early adopters.

Crawlee for Python is looking for early adopters

Crawlee for Python has some amazing initial features, such as a unified interface for HTTP and headless browser crawling, automatic retries, and much more.

How Crawlee uses tiered proxies to avoid getting blocked

June 24, 2024 · 4 min read

Saurav Jain

Developer Community Manager

Hello Crawlee community,

We are back with another blog, this time explaining how Crawlee rotates proxies and prevents crawlers from getting blocked.

Proxies vary in quality, speed, reliability, and cost. There are a few types of proxies, such as datacenter and residential proxies. Datacenter proxies are cheaper but, on the other hand, more prone to getting blocked, and vice versa with residential proxies.

It is hard for developers to decide which proxy to use while scraping data. We might get blocked if we use datacenter proxies for low-cost scraping, but residential proxies are sometimes too expensive for bigger projects. Developers need a system that can manage both costs and avoid getting blocked. To manage this, we recently introduced tiered proxies in Crawlee. Let’s take a look at it.

Building a Netflix show recommender using Crawlee and React

June 10, 2024 · 8 min read

Ayush Thakur

Community Member of Crawlee

In this blog, we'll guide you through the process of using Vite and Crawlee to build a website that recommends Netflix shows based on their categories and genres. To do that, we will first scrape the shows and categories from Netflix using Crawlee, and then visualize the scraped data in a React app built with Vite. By the end of this guide, you'll have a functional web show recommender that can provide Netflix show suggestions.

note

One of our community members wrote this blog as a contribution to Crawlee Blog. If you want to contribute blogs like these to Crawlee Blog, please reach out to us on our discord channel.

How to scrape Netflix using Crawlee and React to build a show recommender

Introduction​

Introduction​

Introduction

Introduction