Upgrading to v1
Summary
After 3.5 years of rapid development and a lot of breaking changes and deprecations, here comes the result - Apify SDK v1. There were two goals for this release. Stability and adding support for more browsers - Firefox and Webkit (Safari).
The SDK has grown quite popular over the years, powering thousands of web scraping and automation projects. We think our developers deserve a stable environment to work in and by releasing SDK v1, we commit to only make breaking changes once a year, with a new major release.
We added support for more browsers by replacing PuppeteerPool
with
browser-pool
. A new library that we created
specifically for this purpose. It builds on the ideas from PuppeteerPool
and extends
them to support Playwright. Playwright is
a browser automation library similar to Puppeteer. It works with all well known browsers
and uses almost the same interface as Puppeteer, while adding useful features and simplifying
common tasks. Don't worry, you can still use Puppeteer with the new BrowserPool
.
A large breaking change is that neither puppeteer
nor playwright
are bundled with
the SDK v1. To make the choice of a library easier and installs faster, users will
have to install the selected modules and versions themselves. This allows us to add
support for even more libraries in the future.
Thanks to the addition of Playwright we now have a PlaywrightCrawler
. It is very similar
to PuppeteerCrawler
and you can pick the one you prefer. It also means we needed to make
some interface changes. The launchPuppeteerFunction
option of PuppeteerCrawler
is gone
and launchPuppeteerOptions
were replaced by launchContext
. We also moved things around
in the handlePageFunction
arguments. See the
migration guide
for more detailed explanation and migration examples.
What's in store for SDK v2? We want to split the SDK into smaller libraries, so that everyone can install only the things they need. We plan a TypeScript migration to make crawler development faster and safer. Finally, we will take a good look at the interface of the whole SDK and update it to improve the developer experience. Bug fixes and scraping features will of course keep landing in versions 1.X as well.
Migration Guide
There are a lot of breaking changes in the v1.0.0 release, but we're confident that updating your code will be a matter of minutes. Below, you'll find examples how to do it and also short tutorials how to use many of the new features.
Many of the new features are made with power users in mind, so don't worry if something looks complicated. You don't need to use it.
Installation
Previous versions of the SDK bundled the puppeteer
package, so you did not have to install
it. SDK v1 supports also playwright
and we don't want to force users to install both.
To install SDK v1 with Puppeteer (same as previous versions), run:
npm install apify puppeteer
To install SDK v1 with Playwright run:
npm install apify playwright
While we tried to add the most important functionality in the initial release, you may find that there are still some utilities or options that are only supported by Puppeteer and not Playwright.
Running on Apify Platform
If you want to make use of Playwright on the Apify Platform, you need to use a Docker image that supports Playwright. We've created them for you, so head over to the new Docker image guide and pick the one that best suits your needs.
Note that your package.json
MUST include puppeteer
and/or playwright
as dependencies.
If you don't list them, the libraries will be uninstalled from your node_modules
folder
when you build your actors.
Handler arguments are now Crawling Context
Previously, arguments of user provided handler functions were provided in separate objects. This made it difficult to track values across function invocations.
const handlePageFunction = async (args1) => {
args1.hasOwnProperty('proxyInfo') // true
}
const handleFailedRequestFunction = async (args2) => {
args2.hasOwnProperty('proxyInfo') // false
}
args1 === args2 // false
This happened because a new arguments object was created for each function. With SDK v1 we now have a single object called Crawling Context.
const handlePageFunction = async (crawlingContext1) => {
crawlingContext1.hasOwnProperty('proxyInfo') // true
}
const handleFailedRequestFunction = async (crawlingContext2) => {
crawlingContext2.hasOwnProperty('proxyInfo') // true
}
// All contexts are the same object.
crawlingContext1 === crawlingContext2 // true
Map
of crawling contexts and their IDs
Now that all the objects are the same, we can keep track of all running crawling contexts.
We can do that by working with the new id
property of crawlingContext
This is useful when you need cross-context access.
let masterContextId;
const handlePageFunction = async ({ id, page, request, crawler }) => {
if (request.userData.masterPage) {
masterContextId = id;
// Prepare the master page.
} else {
const masterContext = crawler.crawlingContexts.get(masterContextId);
const masterPage = masterContext.page;
const masterRequest = masterContext.request;
// Now we can manipulate the master data from another handlePageFunction.
}
}
autoscaledPool
was moved under crawlingContext.crawler
To prevent bloat and to make access to certain key objects easier, we exposed a crawler
property on the handle page arguments.
const handlePageFunction = async ({ request, page, crawler }) => {
await crawler.requestQueue.addRequest({ url: 'https://example.com' });
await crawler.autoscaledPool.pause();
}
This also means that some shorthands like puppeteerPool
or autoscaledPool
were
no longer necessary.
const handlePageFunction = async (crawlingContext) => {
crawlingContext.autoscaledPool // does NOT exist anymore
crawlingContext.crawler.autoscaledPool // <= this is correct usage
}
Replacement of PuppeteerPool
with BrowserPool
BrowserPool
was created to extend PuppeteerPool
with the ability to manage other
browser automation libraries. The API is similar, but not the same.
Access to running BrowserPool
Only PuppeteerCrawler
and PlaywrightCrawler
use BrowserPool
. You can access it
on the crawler
object.
const crawler = new Apify.PlaywrightCrawler({
handlePageFunction: async ({ page, crawler }) => {
crawler.browserPool // <-----
}
});
crawler.browserPool // <-----
Pages now have IDs
And they're equal to crawlingContext.id
which gives you access to full crawlingContext
in hooks. See Lifecycle hooks below.
const pageId = browserPool.getPageId
Configuration and lifecycle hooks
The most important addition with BrowserPool
are the
lifecycle hooks.
You can access them via browserPoolOptions
in both crawlers. A full list of browserPoolOptions
can be found in browser-pool
readme.
const crawler = new Apify.PuppeteerCrawler({
browserPoolOptions: {
retireBrowserAfterPageCount: 10,
preLaunchHooks: [
async (pageId, launchContext) => {
const { request } = crawler.crawlingContexts.get(pageId);
if (request.userData.useHeadful === true) {
launchContext.launchOptions.headless = false;
}
}
]
}
})
Introduction of BrowserController
BrowserController
is a class of browser-pool
that's responsible for browser management.
Its purpose is to provide a single API for working with both Puppeteer and Playwright browsers.
It works automatically in the background, but if you ever wanted to close a browser properly,
you should use a browserController
to do it. You can find it in the handle page arguments.
const handlePageFunction = async ({ page, browserController }) => {
// Wrong usage. Could backfire because it bypasses BrowserPool.
await page.browser().close();
// Correct usage. Allows graceful shutdown.
await browserController.close();
const cookies = [/* some cookie objects */];
// Wrong usage. Will only work in Puppeteer and not Playwright.
await page.setCookies(...cookies);
// Correct usage. Will work in both.
await browserController.setCookies(page, cookies);
}
The BrowserController
also includes important information about the browser, such as
the context it was launched with. This was difficult to do before SDK v1.
const handlePageFunction = async ({ browserController }) => {
// Information about the proxy used by the browser
browserController.launchContext.proxyInfo
// Session used by the browser
browserController.launchContext.session
}
BrowserPool
methods vs PuppeteerPool
Some functions were removed (in line with earlier deprecations), and some were changed a bit:
// OLD
await puppeteerPool.recyclePage(page);
// NEW
await page.close();
// OLD
await puppeteerPool.retire(page.browser());
// NEW
browserPool.retireBrowserByPage(page);
// OLD
await puppeteerPool.serveLiveViewSnapshot();
// NEW
// There's no LiveView in BrowserPool
Updated PuppeteerCrawlerOptions
To keep PuppeteerCrawler
and PlaywrightCrawler
consistent, we updated the options.
Removal of gotoFunction
The concept of a configurable gotoFunction
is not ideal. Especially since we use a modified
gotoExtended
. Users have to know this when they override gotoFunction
if they want to
extend default behavior. We decided to replace gotoFunction
with preNavigationHooks
and
postNavigationHooks
.
The following example illustrates how gotoFunction
makes things complicated.
const gotoFunction = async ({ request, page }) => {
// pre-processing
await makePageStealthy(page);
// Have to remember how to do this:
const response = await gotoExtended(page, request, {/* have to remember the defaults */});
// post-processing
await page.evaluate(() => {
window.foo = 'bar';
});
// Must not forget!
return response;
}
const crawler = new Apify.PuppeteerCrawler({
gotoFunction,
// ...
})
With preNavigationHooks
and postNavigationHooks
it's much easier. preNavigationHooks
are called with two arguments: crawlingContext
and gotoOptions
. postNavigationHooks
are called only with crawlingContext
.
const preNavigationHooks = [
async ({ page }) => makePageStealthy(page)
];
const postNavigationHooks = [
async ({ page }) => page.evaluate(() => {
window.foo = 'bar'
})
]
const crawler = new Apify.PuppeteerCrawler({
preNavigationHooks,
postNavigationHooks,
// ...
})
launchPuppeteerOptions
=> launchContext
Those were always a point of confusion because they merged custom Apify options with
launchOptions
of Puppeteer.
const launchPuppeteerOptions = {
useChrome: true, // Apify option
headless: false, // Puppeteer option
}
Use the new launchContext
object, which explicitly defines launchOptions
.
launchPuppeteerOptions
were removed.
const crawler = new Apify.PuppeteerCrawler({
launchContext: {
useChrome: true, // Apify option
launchOptions: {
headless: false // Puppeteer option
}
}
})
LaunchContext is also a type of
browser-pool
and the structure is exactly the same there. SDK only adds extra options.
Removal of launchPuppeteerFunction
browser-pool
introduces the idea of lifecycle hooks,
which are functions that are executed when a certain event in the browser lifecycle happens.
const launchPuppeteerFunction = async (launchPuppeteerOptions) => {
if (someVariable === 'chrome') {
launchPuppeteerOptions.useChrome = true;
}
return Apify.launchPuppeteer(launchPuppeteerOptions);
}
const crawler = new Apify.PuppeteerCrawler({
launchPuppeteerFunction,
// ...
})
Now you can recreate the same functionality with a preLaunchHook
:
const maybeLaunchChrome = (pageId, launchContext) => {
if (someVariable === 'chrome') {
launchContext.useChrome = true;
}
}
const crawler = new Apify.PuppeteerCrawler({
browserPoolOptions: {
preLaunchHooks: [maybeLaunchChrome]
},
// ...
})
This is better in multiple ways. It is consistent across both Puppeteer and Playwright. It allows you to easily construct your browsers with pre-defined behavior:
const preLaunchHooks = [
maybeLaunchChrome,
useHeadfulIfNeeded,
injectNewFingerprint,
]
And thanks to the addition of crawler.crawlingContexts
the functions also have access to the crawlingContext
of the request
that triggered the launch.
const preLaunchHooks = [
async function maybeLaunchChrome(pageId, launchContext) {
const { request } = crawler.crawlingContexts.get(pageId);
if (request.userData.useHeadful === true) {
launchContext.launchOptions.headless = false;
}
}
]
Launch functions
In addition to Apify.launchPuppeteer()
we now also have Apify.launchPlaywright()
.
Updated arguments
We updated the launch options object because it was a frequent source of confusion.
// OLD
await Apify.launchPuppeteer({
useChrome: true,
headless: true,
})
// NEW
await Apify.launchPuppeteer({
useChrome: true,
launchOptions: {
headless: true,
}
})
Custom modules
Apify.launchPuppeteer
already supported the puppeteerModule
option. With Playwright,
we normalized the name to launcher
because the playwright
module itself does not
launch browsers.
const puppeteer = require('puppeteer');
const playwright = require('playwright');
await Apify.launchPuppeteer();
// Is the same as:
await Apify.launchPuppeteer({
launcher: puppeteer
})
await Apify.launchPlaywright();
// Is the same as:
await Apify.launchPlaywright({
launcher: playwright.chromium
})