Skip to main content

Announcing Crawlee for Python: Now you can use Python to build reliable web crawlers

· 5 min read
Saurav Jain
Developer Community Manager

Testimonial from early adopters

“Crawlee for Python development team did a great job in building the product, it makes things faster for a Python developer.”

~ Maksym Bohomolov

We launched Crawlee in August 2022 and got an amazing response from the JavaScript community. With many early adopters in its initial days, we got valuable feedback, which gave Crawlee a strong base for its success.

Today, Crawlee built-in TypeScript has nearly 13,000 stars on GitHub, with 90 open-source contributors worldwide building the best web scraping and automation library.

Since the launch, the feedback we’ve received most often [1][2][3] has been to build Crawlee in Python so that the Python community can use all the features the JavaScript community does.

With all these requests in mind and to simplify the life of Python web scraping developers, we’re launching Crawlee for Python today.

The new library is still in beta, and we are looking for early adopters.

Crawlee for Python is looking for early adopters

Crawlee for Python has some amazing initial features, such as a unified interface for HTTP and headless browser crawling, automatic retries, and much more.

How Crawlee uses tiered proxies to avoid getting blocked

· 4 min read
Saurav Jain
Developer Community Manager @ Crawlee

Hello Crawlee community,

We are back with another blog, this time explaining how Crawlee rotates proxies and prevents crawlers from getting blocked.

Proxies vary in quality, speed, reliability, and cost. There are a few types of proxies, such as datacenter and residential proxies. Datacenter proxies are cheaper but, on the other hand, more prone to getting blocked, and vice versa with residential proxies.

It is hard for developers to decide which proxy to use while scraping data. We might get blocked if we use datacenter proxies for low-cost scraping, but residential proxies are sometimes too expensive for bigger projects. Developers need a system that can manage both costs and avoid getting blocked. To manage this, we recently introduced tiered proxies in Crawlee. Let’s take a look at it.

note

If you like reading this blog, we would be really happy if you gave Crawlee a star on GitHub!

What are tiered proxies?

Tiered proxies are a method of organizing and using different types of proxies based on their quality, speed, reliability, and cost. Tiered proxies allow you to rotate between a mix of proxy types to optimize your scraping activities.

You categorize your proxies into different tiers based on their quality. For example:

  • High-tier proxies: Fast, reliable, and expensive. Best for critical tasks where you need high performance.
  • Mid-tier proxies: Moderate speed and reliability. A good balance between cost and performance.
  • Low-tier proxies: Slow and less reliable but cheap. Useful for less critical tasks or high-volume scraping.

Features:

  • Tracking errors: The system monitors errors (e.g. failed requests, retries) for each domain.
  • Adjusting tiers: Higher-tier proxies are used if a domain shows more errors. Conversely, if a domain performs well with a high-tier proxy, the system will occasionally test lower-tier proxies. If successful, it continues using the lower tier, optimizing costs.
  • Forgetting old errors: Old errors are given less weight over time, allowing the system to adjust tiers dynamically as proxies' performance changes.

Working

The tieredProxyUrls option in Crawlee's ProxyConfigurationOptions allows you to define a list of proxy URLs organized into tiers. Each tier represents a different level of quality, speed, and reliability.

Usage

Fallback Mechanism: Crawlee starts with the first tier of proxies. If proxies in the current tier fail, it will switch to the next tier.

import { CheerioCrawler, ProxyConfiguration } from 'crawlee';

const proxyConfiguration = new ProxyConfiguration({
tieredProxyUrls: [
['http://tier1-proxy1.example.com', 'http://tier1-proxy2.example.com'],
['http://tier2-proxy1.example.com', 'http://tier2-proxy2.example.com'],
['http://tier2-proxy1.example.com', 'http://tier3-proxy2.example.com'],
],
});

const crawler = new CheerioCrawler({
proxyConfiguration,
requestHandler: async ({ request, response }) => {
// Handle the request
},
});

await crawler.addRequests([
{ url: 'https://example.com/critical' },
{ url: 'https://example.com/important' },
{ url: 'https://example.com/regular' },
]);

await crawler.run();

How tiered proxies use Session Pool under the hood

A session pool is a way to manage multiple sessions on a website so you can distribute your requests across them, reducing the chances of being detected and blocked. You can imagine each session like a different human user with its own IP address.

When you use tiered proxies, each proxy tier works with the session pool to enhance request distribution and manage errors effectively.

Diagram explaining how tiered proxies use Session Pool under the hood

For each request, the crawler instance asks the ProxyConfiguration which proxy it should use. ' ProxyConfiguration` also keeps track of the requests domains, and if it sees more requests being retried or, say, more errors, it returns higher proxy tiers.

In each request, we must pass sessionId and the request URL to the proxy configuration to get the needed proxy URL from one of the tiers.

Choosing which session to pass is where SessionPool comes in. Session pool automatically creates a pool of sessions, rotates them, and uses one of them without getting blocked and mimicking human-like behavior.

Conclusion: using proxies efficiently

This inbuilt feature is similar to what Scrapy's scrapy-rotating-proxies plugin offers to its users. The tiered proxy configuration dynamically adjusts proxy usage based on real-time performance data, optimizing cost and performance. The session pool ensures requests are distributed across multiple sessions, mimicking human behavior and reducing detection risk.

We hope this gives you a better understanding of how Crawlee manages proxies and sessions to make your scraping tasks more effective.

As always, we welcome your feedback. Join our developer community on Discord to ask any questions about Crawlee or tell us how you use it.

Building a Netflix show recommender using Crawlee and React

· 7 min read
Ayush Thakur
Community Member of Crawlee

In this blog, we'll guide you through the process of using Vite and Crawlee to build a website that recommends Netflix shows based on their categories and genres. To do that, we will first scrape the shows and categories from Netflix using Crawlee, and then visualize the scraped data in a React app built with Vite. By the end of this guide, you'll have a functional web show recommender that can provide Netflix show suggestions.

note

One of our community members wrote this blog as a contribution to Crawlee Blog. If you want to contribute blogs like these to Crawlee Blog, please reach out to us on our discord channel.

How to scrape Netflix using Crawlee and React to build a show recommender

Scrapy vs. Crawlee

· 11 min read
Saurav Jain
Developer Community Manager

Hey, crawling masters!

Welcome to another post on the Crawlee blog; this time, we are going to compare Scrapy, one of the oldest and most popular web scraping libraries in the world, with Crawlee, a relative newcomer. This article will answer your questions about when to use Scrapy and help you decide when it would be better to use Crawlee instead. This article will be the first in a series comparing the various technical aspects of Crawlee with Scrapy.

Introduction:

Scrapy is an open-source Python-based web scraping framework that extracts data from websites. With Scrapy, you create spiders, which are autonomous scripts to download and process web content. The limitation of Scrapy is that it does not work very well with JavaScript rendered websites, as it was designed for static HTML pages. We will do a comparison later in the article about this.

Crawlee is also an open-source library that originated as Apify SDK. Crawlee has the advantage of being the latest library in the market, so it already has many features that Scrapy lacks, like autoscaling, headless browsing, working with JavaScript rendered websites without any plugins, and many more, which we are going to explain later on.

How to scrape Amazon products

· 12 min read
Lukáš Průša
Junior Web Automation Engineer

Introduction

Amazon is one of the largest and most complex websites, which means scraping it is pretty challenging. Thankfully, the Crawlee library makes things a little easier, with utilities like JSON file outputs, automatic scaling, and request queue management.

In this guide, we'll be extracting information from Amazon product pages using the power of TypeScript in combination with the Cheerio and Crawlee libraries. We'll explore how to retrieve and extract detailed product data such as titles, prices, image URLs, and more from Amazon's vast marketplace. We'll also discuss handling potential blocking issues that may arise during the scraping process.

How to scrape Amazon using Typescript, Cheerio, and Crawlee

Launching Crawlee Blog

· 3 min read
Saurav Jain
Developer Community Manager

Hey, crawling masters!

I’m Saurav, Developer Community Manager at Apify, and I’m thrilled to announce that we’re launching the Crawlee blog today 🎉

We launched Crawlee, the successor to our Apify SDK, in August 2022 to make the best web scraping and automation library for Node.js developers who like to write code in JavaScript or TypeScript.

Since then, our dev community has grown exponentially. I’m proud to tell you that we have over 11,500 Stars on GitHub, over 6,000 community members on our Discord, and over 125,000 downloads monthly on npm. We’re now the most popular web scraping and automation library for Node.js developers 👏