Skip to main content
Version: 3.0

playwrightUtils

A namespace that contains various utilities for Playwright - the headless Chrome Node API.

Example usage:

import { launchPlaywright, playwrightUtils } from 'crawlee';

// Navigate to https://www.example.com in Playwright with a POST request
const browser = await launchPlaywright();
const page = await browser.newPage();
await playwrightUtils.gotoExtended(page, {
url: 'https://example.com,
method: 'POST',
});

Index

Interfaces

BlockRequestsOptions

BlockRequestsOptions:

optionalextraUrlPatterns

extraUrlPatterns?: string[]

If you just want to append to the default blocked patterns, use this property.

optionalurlPatterns

urlPatterns?: string[]

The patterns of URLs to block from being loaded by the browser. Only * can be used as a wildcard. It is also automatically added to the beginning and end of the pattern. This limitation is enforced by the DevTools protocol. .png is the same as *.png*.

DirectNavigationOptions

DirectNavigationOptions:

optionalreferer

referer?: string

Referer header value. If provided it will take preference over the referer header value set by page.setExtraHTTPHeaders(headers).

optionaltimeout

timeout?: number

Maximum operation time in milliseconds, defaults to 30 seconds, pass 0 to disable timeout. The default value can be changed by using the browserContext.setDefaultNavigationTimeout(timeout), browserContext.setDefaultTimeout(timeout), page.setDefaultNavigationTimeout(timeout) or page.setDefaultTimeout(timeout) methods.

optionalwaitUntil

waitUntil?: domcontentloaded | load | networkidle

When to consider operation succeeded, defaults to load. Events can be either:

  • 'domcontentloaded' - consider operation to be finished when the DOMContentLoaded event is fired.
  • 'load' - consider operation to be finished when the load event is fired.
  • 'networkidle' - consider operation to be finished when there are no network connections for at least 500 ms.

InjectFileOptions

InjectFileOptions:

optionalsurviveNavigations

surviveNavigations?: boolean

Enables the injected script to survive page navigations and reloads without need to be re-injected manually. This does not mean, however, that internal state will be preserved. Just that it will be automatically re-injected on each navigation before any other scripts get the chance to execute.

PlaywrightContextUtils

PlaywrightContextUtils:

blockRequests

  • blockRequests(options?: BlockRequestsOptions): Promise<void>
  • Parameters

    • optionaloptions: BlockRequestsOptions

    Returns Promise<void>

injectFile

  • injectFile(filePath: string, options?: InjectFileOptions): Promise<unknown>
  • Parameters

    • filePath: string
    • optionaloptions: InjectFileOptions

    Returns Promise<unknown>

injectJQuery

  • injectJQuery(): Promise<unknown>
  • Returns Promise<unknown>

parseWithCheerio

  • parseWithCheerio(): Promise<CheerioAPI>
  • Returns Promise<CheerioAPI>

Functions

blockRequests

  • blockRequests(page: Page, options?: BlockRequestsOptions): Promise<void>
  • Forces the Playwright browser tab to block loading URLs that match a provided pattern. This is useful to speed up crawling of websites, since it reduces the amount of data that needs to be downloaded from the web, but it may break some websites or unexpectedly prevent loading of resources.

    By default, the function will block all URLs including the following patterns:

    [".css", ".jpg", ".jpeg", ".png", ".svg", ".gif", ".woff", ".pdf", ".zip"]

    If you want to extend this list further, use the extraUrlPatterns option, which will keep blocking the default patterns, as well as add your custom ones. If you would like to block only specific patterns, use the urlPatterns option, which will override the defaults and block only URLs with your custom patterns.

    This function does not use Playwright's request interception and therefore does not interfere with browser cache. It's also faster than blocking requests using interception, because the blocking happens directly in the browser without the round-trip to Node.js, but it does not provide the extra benefits of request interception.

    The function will never block main document loads and their respective redirects.

    Example usage

    import { launchPlaywright, playwrightUtils } from 'crawlee';

    const browser = await launchPlaywright();
    const page = await browser.newPage();

    // Block all requests to URLs that include `adsbygoogle.js` and also all defaults.
    await playwrightUtils.blockRequests(page, {
    extraUrlPatterns: ['adsbygoogle.js'],
    });

    await page.goto('https://cnn.com');

    Parameters

    • page: Page

      Playwright Page object.

    • optionaloptions: BlockRequestsOptions = {}

    Returns Promise<void>

gotoExtended

  • gotoExtended(page: Page, request: Request<Dictionary<any>>, gotoOptions?: DirectNavigationOptions): Promise<Response | null>
  • Extended version of Playwright's page.goto() allowing to perform requests with HTTP method other than GET, with custom headers and POST payload. URL, method, headers and payload are taken from request parameter that must be an instance of Request class.

    NOTE: In recent versions of Playwright using requests other than GET, overriding headers and adding payloads disables browser cache which degrades performance.


    Parameters

    • page: Page

      Playwright Page object.

    • request: Request<Dictionary<any>>
    • optionalgotoOptions: DirectNavigationOptions = {}

      Custom options for page.goto().

    Returns Promise<Response | null>

injectFile

  • injectFile(page: Page, filePath: string, options?: InjectFileOptions): Promise<unknown>
  • Injects a JavaScript file into a Playwright page. Unlike Playwright's addScriptTag function, this function works on pages with arbitrary Cross-Origin Resource Sharing (CORS) policies.

    File contents are cached for up to 10 files to limit file system access.


    Parameters

    • page: Page

      Playwright Page object.

    • filePath: string

      File path

    • optionaloptions: InjectFileOptions = {}

    Returns Promise<unknown>

injectJQuery

  • injectJQuery(page: Page): Promise<unknown>
  • Injects the jQuery library into a Playwright page. jQuery is often useful for various web scraping and crawling tasks. For example, it can help extract text from HTML elements using CSS selectors.

    Beware that the injected jQuery object will be set to the window.$ variable and thus it might cause conflicts with other libraries included by the page that use the same variable name (e.g. another version of jQuery). This can affect functionality of page's scripts.

    The injected jQuery will survive page navigations and reloads.

    Example usage:

    await playwrightUtils.injectJQuery(page);
    const title = await page.evaluate(() => {
    return $('head title').text();
    });

    Note that injectJQuery() does not affect the Playwright page.$() function in any way.


    Parameters

    • page: Page

      Playwright Page object.

    Returns Promise<unknown>

parseWithCheerio

  • parseWithCheerio(page: Page): Promise<CheerioRoot>
  • Returns Cheerio handle for page.content(), allowing to work with the data same way as with CheerioCrawler.

    Example usage:

    const $ = await playwrightUtils.parseWithCheerio(page);
    const title = $('title').text();

    Parameters

    • page: Page

      Playwright Page object.

    Returns Promise<CheerioRoot>

registerUtilsToContext

  • registerUtilsToContext(context: PlaywrightCrawlingContext<Dictionary<any>>): void
  • Parameters

    • context: PlaywrightCrawlingContext<Dictionary<any>>

    Returns void