Skip to main content

Current problems and mistakes of web scraping in Python and tricks to solve them!

· 15 min read
Max
Community Member of Crawlee and web scraping expert

Introduction

Greetings! I'm Max, a Python developer from Ukraine, a developer with expertise in web scraping, data analysis, and processing.

My journey in web scraping started in 2016 when I was solving lead generation challenges for a small company. Initially, I used off-the-shelf solutions such as Import.io and Kimono Labs. However, I quickly encountered limitations such as blocking, inaccurate data extraction, and performance issues. This led me to learn Python. Those were the glory days when requests and lxml/beautifulsoup were enough to extract data from most websites. And if you knew how to work with threads, you were already a respected expert :)

note

One of our community members wrote this blog as a contribution to Crawlee Blog. If you want to contribute blogs like these to Crawlee Blog, please reach out to us on our discord channel.

As a freelancer, I've built small solutions and large, complex data mining systems for products over the years.

Today, I want to discuss the realities of web scraping with Python in 2024. We'll look at the mistakes I sometimes see and the problems you'll encounter and offer solutions to some of them.

Let's get started.

Just take requests and beautifulsoup and start making a lot of money...

No, this is not that kind of article.

Announcing Crawlee for Python: Now you can use Python to build reliable web crawlers

· 5 min read
Saurav Jain
Developer Community Manager

Testimonial from early adopters

“Crawlee for Python development team did a great job in building the product, it makes things faster for a Python developer.”

~ Maksym Bohomolov

We launched Crawlee in August 2022 and got an amazing response from the JavaScript community. With many early adopters in its initial days, we got valuable feedback, which gave Crawlee a strong base for its success.

Today, Crawlee built-in TypeScript has nearly 13,000 stars on GitHub, with 90 open-source contributors worldwide building the best web scraping and automation library.

Since the launch, the feedback we’ve received most often [1][2][3] has been to build Crawlee in Python so that the Python community can use all the features the JavaScript community does.

With all these requests in mind and to simplify the life of Python web scraping developers, we’re launching Crawlee for Python today.

The new library is still in beta, and we are looking for early adopters.

Crawlee for Python is looking for early adopters

Crawlee for Python has some amazing initial features, such as a unified interface for HTTP and headless browser crawling, automatic retries, and much more.

How Crawlee uses tiered proxies to avoid getting blocked

· 4 min read
Saurav Jain
Developer Community Manager

Hello Crawlee community,

We are back with another blog, this time explaining how Crawlee rotates proxies and prevents crawlers from getting blocked.

Proxies vary in quality, speed, reliability, and cost. There are a few types of proxies, such as datacenter and residential proxies. Datacenter proxies are cheaper but, on the other hand, more prone to getting blocked, and vice versa with residential proxies.

It is hard for developers to decide which proxy to use while scraping data. We might get blocked if we use datacenter proxies for low-cost scraping, but residential proxies are sometimes too expensive for bigger projects. Developers need a system that can manage both costs and avoid getting blocked. To manage this, we recently introduced tiered proxies in Crawlee. Let’s take a look at it.

Building a Netflix show recommender using Crawlee and React

· 7 min read
Ayush Thakur
Community Member of Crawlee

In this blog, we'll guide you through the process of using Vite and Crawlee to build a website that recommends Netflix shows based on their categories and genres. To do that, we will first scrape the shows and categories from Netflix using Crawlee, and then visualize the scraped data in a React app built with Vite. By the end of this guide, you'll have a functional web show recommender that can provide Netflix show suggestions.

note

One of our community members wrote this blog as a contribution to Crawlee Blog. If you want to contribute blogs like these to Crawlee Blog, please reach out to us on our discord channel.

How to scrape Netflix using Crawlee and React to build a show recommender

Scrapy vs. Crawlee

· 11 min read
Saurav Jain
Developer Community Manager

Hey, crawling masters!

Welcome to another post on the Crawlee blog; this time, we are going to compare Scrapy, one of the oldest and most popular web scraping libraries in the world, with Crawlee, a relative newcomer. This article will answer your questions about when to use Scrapy and help you decide when it would be better to use Crawlee instead. This article will be the first in a series comparing the various technical aspects of Crawlee with Scrapy.

Introduction:

Scrapy is an open-source Python-based web scraping framework that extracts data from websites. With Scrapy, you create spiders, which are autonomous scripts to download and process web content. The limitation of Scrapy is that it does not work very well with JavaScript rendered websites, as it was designed for static HTML pages. We will do a comparison later in the article about this.

Crawlee is also an open-source library that originated as Apify SDK. Crawlee has the advantage of being the latest library in the market, so it already has many features that Scrapy lacks, like autoscaling, headless browsing, working with JavaScript rendered websites without any plugins, and many more, which we are going to explain later on.

How to scrape Amazon products

· 12 min read
Lukáš Průša
Junior Web Automation Engineer

Introduction

Amazon is one of the largest and most complex websites, which means scraping it is pretty challenging. Thankfully, the Crawlee library makes things a little easier, with utilities like JSON file outputs, automatic scaling, and request queue management.

In this guide, we'll be extracting information from Amazon product pages using the power of TypeScript in combination with the Cheerio and Crawlee libraries. We'll explore how to retrieve and extract detailed product data such as titles, prices, image URLs, and more from Amazon's vast marketplace. We'll also discuss handling potential blocking issues that may arise during the scraping process.

How to scrape Amazon using Typescript, Cheerio, and Crawlee

Launching Crawlee Blog

· 3 min read
Saurav Jain
Developer Community Manager

Hey, crawling masters!

I’m Saurav, Developer Community Manager at Apify, and I’m thrilled to announce that we’re launching the Crawlee blog today 🎉

We launched Crawlee, the successor to our Apify SDK, in August 2022 to make the best web scraping and automation library for Node.js developers who like to write code in JavaScript or TypeScript.

Since then, our dev community has grown exponentially. I’m proud to tell you that we have over 11,500 Stars on GitHub, over 6,000 community members on our Discord, and over 125,000 downloads monthly on npm. We’re now the most popular web scraping and automation library for Node.js developers 👏