Skip to main content
Version: Next

RequestListOptions

Index

Properties

optionalkeepDuplicateUrls

keepDuplicateUrls?: boolean = ```ts false ```

By default, RequestList will deduplicate the provided URLs. Default deduplication is based on the uniqueKey property of passed source Request objects.

If the property is not present, it is generated by normalizing the URL. If present, it is kept intact. In any case, only one request per uniqueKey is added to the RequestList resulting in removal of duplicate URLs / unique keys.

Setting keepDuplicateUrls to true will append an additional identifier to the uniqueKey of each request that does not already include a uniqueKey. Therefore, duplicate URLs will be kept in the list. It does not protect the user from having duplicates in user set uniqueKeys however. It is the user's responsibility to ensure uniqueness of their unique keys if they wish to keep more than just a single copy in the RequestList.

optionalpersistRequestsKey

persistRequestsKey?: string

Identifies the key in the default key-value store under which the RequestList persists its Requests during the RequestList.initialize call. This is necessary if persistStateKey is set and the source URLs might potentially change, to ensure consistency of the source URLs and state object. However, it comes with some storage and performance overheads.

If persistRequestsKey is not set, RequestList.initialize will always fetch the sources from their origin, check that they are consistent with the restored state (if any) and throw an error if they are not.

optionalpersistStateKey

persistStateKey?: string

Identifies the key in the default key-value store under which RequestList periodically stores its state (i.e. which URLs were crawled and which not). If the crawler is restarted, RequestList will read the state and continue where it left off.

If persistStateKey is not set, RequestList will always start from the beginning, and all the source URLs will be crawled again.

optionalproxyConfiguration

proxyConfiguration?: ProxyConfiguration

Used to pass the proxy configuration for the requestsFromUrl objects. Takes advantage of the internal address rotation and authentication process. If undefined, the requestsFromUrl requests will be made without proxy.

optionalsources

sources?: RequestListSource[]

An array of sources of URLs for the RequestList. It can be either an array of strings, plain objects that define at least the url property, or an array of Request instances.

IMPORTANT: The sources array will be consumed (left empty) after RequestList initializes. This is a measure to prevent memory leaks in situations when millions of sources are added.

Additionally, the requestsFromUrl property may be used instead of url, which will instruct RequestList to download the source URLs from a given remote location. The URLs will be parsed from the received response.

[
// A single URL
'http://example.com/a/b',

// Modify Request options
{ method: PUT, 'https://example.com/put, payload: { foo: 'bar' }}

// Batch import of URLs from a file hosted on the web,
// where the URLs should be requested using the HTTP POST request
{ method: 'POST', requestsFromUrl: 'http://example.com/urls.txt' },

// Batch import from remote file, using a specific regular expression to extract the URLs.
{ requestsFromUrl: 'http://example.com/urls.txt', regex: /https://example.com/.+/ },

// Get list of URLs from a Google Sheets document. Just add "/gviz/tq?tqx=out:csv" to the Google Sheet URL.
// For details, see https://help.apify.com/en/articles/2906022-scraping-a-list-of-urls-from-a-google-sheets-document
{ requestsFromUrl: 'https://docs.google.com/spreadsheets/d/1-2mUcRAiBbCTVA5KcpFdEYWflLMLp9DDU3iJutvES4w/gviz/tq?tqx=out:csv' }
]

optionalsourcesFunction

sourcesFunction?: RequestListSourcesFunction

A function that will be called to get the sources for the RequestList, but only if RequestList was not able to fetch their persisted version (see RequestListOptions.persistRequestsKey). It must return an Array of Request or RequestOptions.

This is very useful in a scenario when getting the sources is a resource intensive or time consuming task, such as fetching URLs from multiple sitemaps or parsing URLs from large datasets. Using the sourcesFunction in combination with persistStateKey and persistRequestsKey will allow you to fetch and parse those URLs only once, saving valuable time when your crawler migrates or restarts.

If both RequestListOptions.sources and RequestListOptions.sourcesFunction are provided, the sources returned by the function will be added after the sources.

Example:

// Let's say we want to scrape URLs extracted from sitemaps.

const sourcesFunction = async () => {
// With super large sitemaps, this operation could take very long
// and big websites typically have multiple sitemaps.
const sitemaps = await downloadHugeSitemaps();
return parseUrlsFromSitemaps(sitemaps);
};

// Sitemaps can change in real-time, so it's important to persist
// the URLs we collected. Otherwise we might lose our scraping
// state in case of an crawler migration / failure / time-out.
const requestList = await RequestList.open(null, [], {
// The sourcesFunction is called now and the Requests are persisted.
// If something goes wrong and we need to start again, RequestList
// will load the persisted Requests from storage and will NOT
// call the sourcesFunction again, saving time and resources.
sourcesFunction,
persistStateKey: 'state-key',
persistRequestsKey: 'requests-key',
})

optionalstate

The state object that the RequestList will be initialized from. It is in the form as returned by RequestList.getState(), such as follows:

{
nextIndex: 5,
nextUniqueKey: 'unique-key-5'
inProgress: {
'unique-key-1': true,
'unique-key-4': true,
},
}

Note that the preferred (and simpler) way to persist the state of crawling of the RequestList is to use the stateKeyPrefix parameter instead.