Version: 3.11

Got Scraping

Copy for LLM

Intro

When using BasicCrawler, we have to send the requests manually. In order to do this, we can use the context-aware sendRequest() function:

import { BasicCrawler } from 'crawlee';

const crawler = new BasicCrawler({
    async requestHandler({ sendRequest, log }) {
        const res = await sendRequest();
        log.info('received body', res.body);
    },
});

It uses got-scraping under the hood. Got Scraping is a Got extension developed to mimic browser requests, so there's a high chance we'll open the webpage without getting blocked.

`sendRequest` API

async sendRequest(overrideOptions?: GotOptionsInit) => {
    return gotScraping({
        url: request.url,
        method: request.method,
        body: request.payload,
        headers: request.headers,
        proxyUrl: crawlingContext.proxyInfo?.url,
        sessionToken: session,
        responseType: 'text',
        ...overrideOptions,
        retry: {
            limit: 0,
            ...overrideOptions?.retry,
        },
        cookieJar: {
            getCookieString: (url: string) => session!.getCookieString(url),
            setCookie: (rawCookie: string, url: string) => session!.setCookie(rawCookie, url),
            ...overrideOptions?.cookieJar,
        },
    });
}

`url`

By default, it's the URL of current task. However you can override this with a string or a URL instance if necessary.

More details in Got documentation.

`method`

By default, it's the HTTP method of current task. Possible values are 'GET', 'POST', 'HEAD', 'PUT', 'PATCH', 'DELETE'.

More details in Got documentation.

`body`

By default, it's the HTTP payload of current task.

More details in Got documentation.

`headers`

By default, it's the HTTP headers of current task. It's an object with string values.

More details in Got documentation.

`proxyUrl`

It's a string representing the proxy server in the format of protocol://username:password@hostname:port.

For example, an Apify proxy server looks like this: http://auto:password@proxy.apify.com:8000.

Basic Crawler does not have the concept of a session or proxy, therefore we need to manually pass the proxyUrl option:

import { BasicCrawler } from 'crawlee';

const crawler = new BasicCrawler({
    async requestHandler({ sendRequest, log }) {
        const res = await sendRequest({
            proxyUrl: 'http://auto:password@proxy.apify.com:8000',
        });
        log.info('received body', res.body);
    },
});

We use proxies to hide our real IP address.

More details in Got Scraping documentation.

`sessionToken`

It's a non-primitive object used as a key when generating browser fingerprint. Fingerprints with the same token don't change. This can be used to retain the user-agent header when using the same Apify Session.

More details in Got Scraping documentation.

`responseType`

This option defines how the response should be parsed.

By default, we fetch HTML websites - that is plaintext. Hence, we set responseType to 'text'. However, JSON is possible as well:

import { BasicCrawler } from 'crawlee';

const crawler = new BasicCrawler({
    async requestHandler({ sendRequest, log }) {
        const res = await sendRequest({ responseType: 'json' });
        log.info('received body', res.body);
    },
});

More details in Got documentation.

`cookieJar`

Got uses a cookieJar to manage cookies. It's an object with an interface of a tough-cookie package.

Example:

import { BasicCrawler } from 'crawlee';
import { CookieJar } from 'tough-cookie';

const cookieJar = new CookieJar();

const crawler = new BasicCrawler({
    async requestHandler({ sendRequest, log }) {
        const res = await sendRequest({ cookieJar });
        log.info('received body', res.body);
    },
});

More details in

`retry.limit`

This option specifies the maximum number of Got retries.

By default, retry.limit is set to 0. This is because Crawlee has its own (complicated enough) retry management.

We suggest NOT changing this value for stability reasons.

`useHeaderGenerator`

It's a boolean for whether to generate browser headers. By default, it's set to true, and we recommend keeping this for better results.

`headerGeneratorOptions`

This option represents an object how to generate browser fingerprint. Example:

import { BasicCrawler } from 'crawlee';

const crawler = new BasicCrawler({
    async requestHandler({ sendRequest, log }) {
        const res = await sendRequest({
            headerGeneratorOptions: {
                devices: ['mobile', 'desktop'],
                locales: ['en-US'],
                operatingSystems: ['windows', 'macos', 'android', 'ios'],
                browsers: ['chrome', 'edge', 'firefox', 'safari'],
            },
        });
        log.info('received body', res.body);
    },
});

More details in HeaderGeneratorOptions documentation.

Related links

Got Scraping

Intro​

sendRequest API​

url​

method​

body​

headers​

proxyUrl​

sessionToken​

responseType​

cookieJar​

retry.limit​

useHeaderGenerator​

headerGeneratorOptions​

Intro

`sendRequest` API

`url`

`method`

`body`

`headers`

`proxyUrl`

`sessionToken`

`responseType`

`cookieJar`

`retry.limit`

`useHeaderGenerator`

`headerGeneratorOptions`