Version: Next

HTTP clients

Copy for LLM

HTTP clients are utilized by HTTP-based crawlers (e.g., CheerioCrawler) to communicate with web servers. They use external HTTP libraries for communication rather than a browser. Examples of such libraries include impit or got-scraping. After retrieving page content, an HTML parsing library is typically used to facilitate data extraction. Examples of such libraries include Cheerio, jsdom or linkedom. These crawlers are faster than browser-based crawlers but generally cannot execute client-side JavaScript.

Switching between HTTP clients

Crawlee currently provides two main HTTP clients: GotScrapingHttpClient, which uses the got-scraping library, and ImpitHttpClient, which uses the impit library. You can switch between them by setting the BasehttpClient parameter when initializing a crawler class. The default HTTP client is GotScrapingHttpClient. For more details on anti-blocking features, see our avoid getting blocked guide.

Below are examples of how to configure the HTTP client for the CheerioCrawler:

CheerioCrawler with got-scraping
CheerioCrawler with impit

Run on

import { CheerioCrawler, GotScrapingHttpClient } from 'crawlee';

const crawler = new CheerioCrawler({
    httpClient: new GotScrapingHttpClient(),
    async requestHandler() {
        /* ... */
    },
});

Run on

import { CheerioCrawler } from 'crawlee';
import { ImpitHttpClient } from '@crawlee/impit-client';

const crawler = new CheerioCrawler({
    httpClient: new ImpitHttpClient({
        // Set-up options for the impit library
        ignoreTlsErrors: true,
        browser: 'firefox',
    }),
    async requestHandler() {
        /* ... */
    },
});

Installation requirements

Since GotScrapingHttpClient is the default HTTP client, it's included with the base Crawlee installation and requires no additional packages.

For ImpitHttpClient, you need to install a separate @crawlee/impit-client package:

npm i @crawlee/impit-client

Creating custom HTTP clients

Crawlee provides an interface, BaseHttpClient, which defines the interface that all HTTP clients must implement. This allows you to create custom HTTP clients tailored to your specific requirements.

HTTP clients are responsible for several key operations:

sending HTTP requests and receiving responses,
managing cookies and sessions,
handling headers and authentication,
managing proxy configurations,
connection pooling with timeout management.

To create a custom HTTP client, you need to implement the BaseHttpClient interface. Your implementation must be async-compatible and include proper cleanup and resource management to work seamlessly with Crawlee's concurrent processing model.

Conclusion

This guide introduced you to the HTTP clients available in Crawlee and demonstrated how to switch between them, including their installation requirements and usage examples. You also learned about the responsibilities of HTTP clients and how to implement your own custom HTTP client by inheriting from the BaseHttpClient base class.

If you have questions or need assistance, feel free to reach out on our GitHub or join our Discord community. Happy scraping!

HTTP clients

Switching between HTTP clients​

Installation requirements​

Creating custom HTTP clients​

Conclusion​

Switching between HTTP clients

Installation requirements

Creating custom HTTP clients

Conclusion