RequestListOptions
Index
Properties
optionalkeepDuplicateUrls
optionalpersistRequestsKey
Identifies the key in the default key-value store under which the RequestList
persists its
Requests during the RequestList.initialize call.
This is necessary if persistStateKey
is set and the source URLs might potentially change,
to ensure consistency of the source URLs and state object. However, it comes with some
storage and performance overheads.
If persistRequestsKey
is not set, RequestList.initialize will always fetch the sources
from their origin, check that they are consistent with the restored state (if any)
and throw an error if they are not.
optionalpersistStateKey
Identifies the key in the default key-value store under which RequestList
periodically stores its
state (i.e. which URLs were crawled and which not).
If the crawler is restarted, RequestList
will read the state
and continue where it left off.
If persistStateKey
is not set, RequestList
will always start from the beginning,
and all the source URLs will be crawled again.
optionalproxyConfiguration
Used to pass the the proxy configuration for the requestsFromUrls
objects.
Takes advantage of the internal address rotation and authentication process.
If undefined, the requestsFromUrls
requests will be made without proxy.
optionalsources
An array of sources of URLs for the RequestList. It can be either an array of strings,
plain objects that define at least the url
property, or an array of Request instances.
IMPORTANT: The sources
array will be consumed (left empty) after RequestList
initializes.
This is a measure to prevent memory leaks in situations when millions of sources are
added.
Additionally, the requestsFromUrl
property may be used instead of url
,
which will instruct RequestList
to download the source URLs from a given remote location.
The URLs will be parsed from the received response.
[
// A single URL
'http://example.com/a/b',
// Modify Request options
{ method: PUT, 'https://example.com/put, payload: { foo: 'bar' }}
// Batch import of URLs from a file hosted on the web,
// where the URLs should be requested using the HTTP POST request
{ method: 'POST', requestsFromUrl: 'http://example.com/urls.txt' },
// Batch import from remote file, using a specific regular expression to extract the URLs.
{ requestsFromUrl: 'http://example.com/urls.txt', regex: /https://example.com/.+/ },
// Get list of URLs from a Google Sheets document. Just add "/gviz/tq?tqx=out:csv" to the Google Sheet URL.
// For details, see https://help.apify.com/en/articles/2906022-scraping-a-list-of-urls-from-a-google-sheets-document
{ requestsFromUrl: 'https://docs.google.com/spreadsheets/d/1-2mUcRAiBbCTVA5KcpFdEYWflLMLp9DDU3iJutvES4w/gviz/tq?tqx=out:csv' }
]
optionalsourcesFunction
A function that will be called to get the sources for the RequestList
, but only if RequestList
was not able to fetch their persisted version (see RequestListOptions.persistRequestsKey).
It must return an Array
of Request or RequestOptions.
This is very useful in a scenario when getting the sources is a resource intensive or time consuming
task, such as fetching URLs from multiple sitemaps or parsing URLs from large datasets. Using the
sourcesFunction
in combination with persistStateKey
and persistRequestsKey
will allow you to
fetch and parse those URLs only once, saving valuable time when your crawler migrates or restarts.
If both RequestListOptions.sources and RequestListOptions.sourcesFunction are provided,
the sources returned by the function will be added after the sources
.
Example:
// Let's say we want to scrape URLs extracted from sitemaps.
const sourcesFunction = async () => {
// With super large sitemaps, this operation could take very long
// and big websites typically have multiple sitemaps.
const sitemaps = await downloadHugeSitemaps();
return parseUrlsFromSitemaps(sitemaps);
};
// Sitemaps can change in real-time, so it's important to persist
// the URLs we collected. Otherwise we might lose our scraping
// state in case of an crawler migration / failure / time-out.
const requestList = await RequestList.open(null, [], {
// The sourcesFunction is called now and the Requests are persisted.
// If something goes wrong and we need to start again, RequestList
// will load the persisted Requests from storage and will NOT
// call the sourcesFunction again, saving time and resources.
sourcesFunction,
persistStateKey: 'state-key',
persistRequestsKey: 'requests-key',
})
optionalstate
The state object that the RequestList
will be initialized from.
It is in the form as returned by RequestList.getState()
, such as follows:
{
nextIndex: 5,
nextUniqueKey: 'unique-key-5'
inProgress: {
'unique-key-1': true,
'unique-key-4': true,
},
}
Note that the preferred (and simpler) way to persist the state of crawling of the RequestList
is to use the stateKeyPrefix
parameter instead.
By default,
RequestList
will deduplicate the provided URLs. Default deduplication is based on theuniqueKey
property of passed source Request objects.If the property is not present, it is generated by normalizing the URL. If present, it is kept intact. In any case, only one request per
uniqueKey
is added to theRequestList
resulting in removal of duplicate URLs / unique keys.Setting
keepDuplicateUrls
totrue
will append an additional identifier to theuniqueKey
of each request that does not already include auniqueKey
. Therefore, duplicate URLs will be kept in the list. It does not protect the user from having duplicates in user setuniqueKey
s however. It is the user's responsibility to ensure uniqueness of their unique keys if they wish to keep more than just a single copy in theRequestList
.