Skip to main content

Optimizing web scraping: Scraping auth data using JSDOM

· 8 min read
Saurav Jain
Developer Community Manager

As scraping developers, we sometimes need to extract authentication data like temporary keys to perform our tasks. However, it is not as simple as that. Usually, it is in HTML or XHR network requests, but sometimes, the auth data is computed. In that case, we can either reverse-engineer the computation, which takes a lot of time to deobfuscate scripts or run the JavaScript that computes it. Normally, we use a browser, but that is expensive. Crawlee provides support for running browser scraper and Cheerio Scraper in parallel, but that is very complex and expensive in terms of compute resource usage. JSDOM helps us run page JavaScript with fewer resources than a browser and slightly higher than Cheerio.

This article will discuss a new approach that we use in one of our Actors to obtain the authentication data from TikTok ads creative center generated by browser web applications without actually running the browser but instead of it, using JSDOM.

JSDOM based approach from scraping

Web scraping of a dynamic website using Python with HTTP Client

· 13 min read
Max
Community Member of Crawlee and web scraping expert

Dynamic websites that use JavaScript for content rendering and backend interaction often create challenges for web scraping. The traditional approach to solving this problem is browser emulation, but it's not very efficient in terms of resource consumption.

note

One of our community members wrote this blog as a contribution to Crawlee Blog. If you want to contribute blogs like these to Crawlee Blog, please reach out to us on our discord channel.

In this article, we'll explore an alternative method based on in-depth site analysis and the use of an HTTP client. We'll go through the entire process from analyzing a dynamic website to implementing an efficient web crawler using the Crawlee for Python framework.

How to scrape dynamic websites in Python

How to scrape infinite scrolling webpages with Python

· 7 min read
Saurav Jain
Developer Community Manager

Hello, Crawlee Devs, and welcome back to another tutorial on the Crawlee Blog. This tutorial will teach you how to scrape infinite-scrolling websites using Crawlee for Python.

For context, infinite-scrolling pages are a modern alternative to classic pagination. When users scroll to the bottom of the webpage instead of choosing the next page, the page automatically loads more data, and users can scroll more.

As a big sneakerhead, I'll take the Nike shoes infinite-scrolling website as an example, and we'll scrape thousands of sneakers from it.

How to scrape infinite scrolling pages with Python

Crawlee for Python has some amazing initial features, such as a unified interface for HTTP and headless browser crawling, automatic retries, and much more.

Current problems and mistakes of web scraping in Python and tricks to solve them!

· 15 min read
Max
Community Member of Crawlee and web scraping expert

Introduction

Greetings! I'm Max, a Python developer from Ukraine, a developer with expertise in web scraping, data analysis, and processing.

My journey in web scraping started in 2016 when I was solving lead generation challenges for a small company. Initially, I used off-the-shelf solutions such as Import.io and Kimono Labs. However, I quickly encountered limitations such as blocking, inaccurate data extraction, and performance issues. This led me to learn Python. Those were the glory days when requests and lxml/beautifulsoup were enough to extract data from most websites. And if you knew how to work with threads, you were already a respected expert :)

note

One of our community members wrote this blog as a contribution to Crawlee Blog. If you want to contribute blogs like these to Crawlee Blog, please reach out to us on our discord channel.

As a freelancer, I've built small solutions and large, complex data mining systems for products over the years.

Today, I want to discuss the realities of web scraping with Python in 2024. We'll look at the mistakes I sometimes see and the problems you'll encounter and offer solutions to some of them.

Let's get started.

Just take requests and beautifulsoup and start making a lot of money...

No, this is not that kind of article.

Announcing Crawlee for Python: Now you can use Python to build reliable web crawlers

· 5 min read
Saurav Jain
Developer Community Manager

Testimonial from early adopters

“Crawlee for Python development team did a great job in building the product, it makes things faster for a Python developer.”

~ Maksym Bohomolov

We launched Crawlee in August 2022 and got an amazing response from the JavaScript community. With many early adopters in its initial days, we got valuable feedback, which gave Crawlee a strong base for its success.

Today, Crawlee built-in TypeScript has nearly 13,000 stars on GitHub, with 90 open-source contributors worldwide building the best web scraping and automation library.

Since the launch, the feedback we’ve received most often [1][2][3] has been to build Crawlee in Python so that the Python community can use all the features the JavaScript community does.

With all these requests in mind and to simplify the life of Python web scraping developers, we’re launching Crawlee for Python today.

The new library is still in beta, and we are looking for early adopters.

Crawlee for Python is looking for early adopters

Crawlee for Python has some amazing initial features, such as a unified interface for HTTP and headless browser crawling, automatic retries, and much more.

How Crawlee uses tiered proxies to avoid getting blocked

· 4 min read
Saurav Jain
Developer Community Manager @ Crawlee

Hello Crawlee community,

We are back with another blog, this time explaining how Crawlee rotates proxies and prevents crawlers from getting blocked.

Proxies vary in quality, speed, reliability, and cost. There are a few types of proxies, such as datacenter and residential proxies. Datacenter proxies are cheaper but, on the other hand, more prone to getting blocked, and vice versa with residential proxies.

It is hard for developers to decide which proxy to use while scraping data. We might get blocked if we use datacenter proxies for low-cost scraping, but residential proxies are sometimes too expensive for bigger projects. Developers need a system that can manage both costs and avoid getting blocked. To manage this, we recently introduced tiered proxies in Crawlee. Let’s take a look at it.

note

If you like reading this blog, we would be really happy if you gave Crawlee a star on GitHub!

What are tiered proxies?

Tiered proxies are a method of organizing and using different types of proxies based on their quality, speed, reliability, and cost. Tiered proxies allow you to rotate between a mix of proxy types to optimize your scraping activities.

You categorize your proxies into different tiers based on their quality. For example:

  • High-tier proxies: Fast, reliable, and expensive. Best for critical tasks where you need high performance.
  • Mid-tier proxies: Moderate speed and reliability. A good balance between cost and performance.
  • Low-tier proxies: Slow and less reliable but cheap. Useful for less critical tasks or high-volume scraping.

Features:

  • Tracking errors: The system monitors errors (e.g. failed requests, retries) for each domain.
  • Adjusting tiers: Higher-tier proxies are used if a domain shows more errors. Conversely, if a domain performs well with a high-tier proxy, the system will occasionally test lower-tier proxies. If successful, it continues using the lower tier, optimizing costs.
  • Forgetting old errors: Old errors are given less weight over time, allowing the system to adjust tiers dynamically as proxies' performance changes.

Working

The tieredProxyUrls option in Crawlee's ProxyConfigurationOptions allows you to define a list of proxy URLs organized into tiers. Each tier represents a different level of quality, speed, and reliability.

Usage

Fallback Mechanism: Crawlee starts with the first tier of proxies. If proxies in the current tier fail, it will switch to the next tier.

import { CheerioCrawler, ProxyConfiguration } from 'crawlee';

const proxyConfiguration = new ProxyConfiguration({
tieredProxyUrls: [
['http://tier1-proxy1.example.com', 'http://tier1-proxy2.example.com'],
['http://tier2-proxy1.example.com', 'http://tier2-proxy2.example.com'],
['http://tier2-proxy1.example.com', 'http://tier3-proxy2.example.com'],
],
});

const crawler = new CheerioCrawler({
proxyConfiguration,
requestHandler: async ({ request, response }) => {
// Handle the request
},
});

await crawler.addRequests([
{ url: 'https://example.com/critical' },
{ url: 'https://example.com/important' },
{ url: 'https://example.com/regular' },
]);

await crawler.run();

How tiered proxies use Session Pool under the hood

A session pool is a way to manage multiple sessions on a website so you can distribute your requests across them, reducing the chances of being detected and blocked. You can imagine each session like a different human user with its own IP address.

When you use tiered proxies, each proxy tier works with the session pool to enhance request distribution and manage errors effectively.

Diagram explaining how tiered proxies use Session Pool under the hood

For each request, the crawler instance asks the ProxyConfiguration which proxy it should use. ' ProxyConfiguration` also keeps track of the requests domains, and if it sees more requests being retried or, say, more errors, it returns higher proxy tiers.

In each request, we must pass sessionId and the request URL to the proxy configuration to get the needed proxy URL from one of the tiers.

Choosing which session to pass is where SessionPool comes in. Session pool automatically creates a pool of sessions, rotates them, and uses one of them without getting blocked and mimicking human-like behavior.

Conclusion: using proxies efficiently

This inbuilt feature is similar to what Scrapy's scrapy-rotating-proxies plugin offers to its users. The tiered proxy configuration dynamically adjusts proxy usage based on real-time performance data, optimizing cost and performance. The session pool ensures requests are distributed across multiple sessions, mimicking human behavior and reducing detection risk.

We hope this gives you a better understanding of how Crawlee manages proxies and sessions to make your scraping tasks more effective.

As always, we welcome your feedback. Join our developer community on Discord to ask any questions about Crawlee or tell us how you use it.

Building a Netflix show recommender using Crawlee and React

· 7 min read
Ayush Thakur
Community Member of Crawlee

In this blog, we'll guide you through the process of using Vite and Crawlee to build a website that recommends Netflix shows based on their categories and genres. To do that, we will first scrape the shows and categories from Netflix using Crawlee, and then visualize the scraped data in a React app built with Vite. By the end of this guide, you'll have a functional web show recommender that can provide Netflix show suggestions.

note

One of our community members wrote this blog as a contribution to Crawlee Blog. If you want to contribute blogs like these to Crawlee Blog, please reach out to us on our discord channel.

How to scrape Netflix using Crawlee and React to build a show recommender

Scrapy vs. Crawlee

· 11 min read
Saurav Jain
Developer Community Manager

Hey, crawling masters!

Welcome to another post on the Crawlee blog; this time, we are going to compare Scrapy, one of the oldest and most popular web scraping libraries in the world, with Crawlee, a relative newcomer. This article will answer your questions about when to use Scrapy and help you decide when it would be better to use Crawlee instead. This article will be the first in a series comparing the various technical aspects of Crawlee with Scrapy.

Introduction:

Scrapy is an open-source Python-based web scraping framework that extracts data from websites. With Scrapy, you create spiders, which are autonomous scripts to download and process web content. The limitation of Scrapy is that it does not work very well with JavaScript rendered websites, as it was designed for static HTML pages. We will do a comparison later in the article about this.

Crawlee is also an open-source library that originated as Apify SDK. Crawlee has the advantage of being the latest library in the market, so it already has many features that Scrapy lacks, like autoscaling, headless browsing, working with JavaScript rendered websites without any plugins, and many more, which we are going to explain later on.

How to scrape Amazon products

· 12 min read
Lukáš Průša
Junior Web Automation Engineer

Introduction

Amazon is one of the largest and most complex websites, which means scraping it is pretty challenging. Thankfully, the Crawlee library makes things a little easier, with utilities like JSON file outputs, automatic scaling, and request queue management.

In this guide, we'll be extracting information from Amazon product pages using the power of TypeScript in combination with the Cheerio and Crawlee libraries. We'll explore how to retrieve and extract detailed product data such as titles, prices, image URLs, and more from Amazon's vast marketplace. We'll also discuss handling potential blocking issues that may arise during the scraping process.

How to scrape Amazon using Typescript, Cheerio, and Crawlee

Launching Crawlee Blog

· 3 min read
Saurav Jain
Developer Community Manager

Hey, crawling masters!

I’m Saurav, Developer Community Manager at Apify, and I’m thrilled to announce that we’re launching the Crawlee blog today 🎉

We launched Crawlee, the successor to our Apify SDK, in August 2022 to make the best web scraping and automation library for Node.js developers who like to write code in JavaScript or TypeScript.

Since then, our dev community has grown exponentially. I’m proud to tell you that we have over 11,500 Stars on GitHub, over 6,000 community members on our Discord, and over 125,000 downloads monthly on npm. We’re now the most popular web scraping and automation library for Node.js developers 👏