Skip to main content

Got Scraping

Intro

When using BasicCrawler, we have to send the requests manually. In order to do this, we can use the context-aware sendRequest() function:

import { BasicCrawler } from 'crawlee';

const crawler = new BasicCrawler({
async requestHandler({ sendRequest, log }) {
const res = await sendRequest();
log.info('received body', res.body);
},
});

It uses got-scraping under the hood. Got Scraping is a Got extension developed to mimic browser requests, so there's a high chance we'll open the webpage without getting blocked.

sendRequest API

async sendRequest(overrideOptions?: GotOptionsInit) => {
return gotScraping({
url: request.url,
method: request.method,
body: request.payload,
headers: request.headers,
proxyUrl: crawlingContext.proxyInfo?.url,
sessionToken: session,
responseType: 'text',
...overrideOptions,
retry: {
limit: 0,
...overrideOptions?.retry,
},
cookieJar: {
getCookieString: (url: string) => session!.getCookieString(url),
setCookie: (rawCookie: string, url: string) => session!.setCookie(rawCookie, url),
...overrideOptions?.cookieJar,
},
});
}

url

By default, it's the URL of current task. However you can override this with a string or a URL instance if necessary.

More details in Got documentation.

method

By default, it's the HTTP method of current task. Possible values are 'GET', 'POST', 'HEAD', 'PUT', 'PATCH', 'DELETE'.

More details in Got documentation.

body

By default, it's the HTTP payload of current task.

More details in Got documentation.

headers

By default, it's the HTTP headers of current task. It's an object with string values.

More details in Got documentation.

proxyUrl

It's a string representing the proxy server in the format of protocol://username:password@hostname:port.

For example, an Apify proxy server looks like this: http://auto:password@proxy.apify.com:8000.

Basic Crawler does not have the concept of a session or proxy, therefore we need to manually pass the proxyUrl option:

import { BasicCrawler } from 'crawlee';

const crawler = new BasicCrawler({
async requestHandler({ sendRequest, log }) {
const res = await sendRequest({
proxyUrl: 'http://auto:password@proxy.apify.com:8000',
});
log.info('received body', res.body);
},
});

We use proxies to hide our real IP address.

More details in Got Scraping documentation.

sessionToken

It's a non-primitive object used as a key when generating browser fingerprint. Fingerprints with the same token don't change. This can be used to retain the user-agent header when using the same Apify Session.

More details in Got Scraping documentation.

responseType

This option defines how the response should be parsed.

By default, we fetch HTML websites - that is plaintext. Hence, we set responseType to 'text'. However, JSON is possible as well:

import { BasicCrawler } from 'crawlee';

const crawler = new BasicCrawler({
async requestHandler({ sendRequest, log }) {
const res = await sendRequest({ responseType: 'json' });
log.info('received body', res.body);
},
});

More details in Got documentation.

cookieJar

Got uses a cookieJar to manage cookies. It's an object with an interface of a tough-cookie package.

Example:

import { BasicCrawler } from 'crawlee';
import { CookieJar } from 'tough-cookie';

const cookieJar = new CookieJar();

const crawler = new BasicCrawler({
async requestHandler({ sendRequest, log }) {
const res = await sendRequest({ cookieJar });
log.info('received body', res.body);
},
});

More details in

retry.limit

This option specifies the maximum number of Got retries.

By default, retry.limit is set to 0. This is because Crawlee has its own (complicated enough) retry management.

We suggest NOT changing this value for stability reasons.

useHeaderGenerator

It's a boolean for whether to generate browser headers. By default, it's set to true, and we recommend keeping this for better results.

headerGeneratorOptions

This option represents an object how to generate browser fingerprint. Example:

import { BasicCrawler } from 'crawlee';

const crawler = new BasicCrawler({
async requestHandler({ sendRequest, log }) {
const res = await sendRequest({
headerGeneratorOptions: {
devices: ['mobile', 'desktop'],
locales: ['en-US'],
operatingSystems: ['windows', 'macos', 'android', 'ios'],
browsers: ['chrome', 'edge', 'firefox', 'safari'],
},
});
log.info('received body', res.body);
},
});

More details in HeaderGeneratorOptions documentation.

Related links