Skip to main content
Version: 3.13

System Infomation V2

caution

This is an experimental feature. While we welcome testers, keep in mind that it is currently not recommended to use this in production.

The API is subject to change, and we might introduce breaking changes in the future.

Should you be using this, feel free to open issues on our GitHub repository, and we'll take a look.

Starting with the newest crawlee beta, we have introduced a new crawler option that enables an improved metric collection system. This new system should collect cpu and memory metrics more accurately in containerised environments by checking for cgroup enforce limits.

How to enable the experiment

note

This example shows how to enable the experiment in the CheerioCrawler, but you can apply this to any crawler type.

import { CheerioCrawler, Configuration } from 'crawlee';

Configuration.set('systemInfoV2', true);

const crawler = new CheerioCrawler({
async requestHandler({ $, request }) {
const title = $('title').text();
console.log(`The title of "${request.url}" is: ${title}.`);
},
});

await crawler.run(['https://crawlee.dev']);

Other changes

info

This section is only useful if you're a tinkerer and want to see what's going on under the hood.

The existing solution checked the bare metal metrics for how much cpu and memory was being used and how much headroom was available. This is an intuitive solution but unfortunately doesnt account for when there is an external limit on the amount of resources a process can consume. This is often the case in containerized environments where each container will have a quota for its cpu and memory usage.

This experiment attempts to address this issue by introducing a new isContainerized() utility function and changing the way resources are collected when a container is detected.

note

This isContainerized() function is very similar to the existing isDocker() function however for now they both work side by side. If this experiment is successful, eventualy isDocker() may eventually be depreciated in favour of isContainerized().

Cgroup detection

On linux, to detect if cgroup is available, we check if there is a directory at /sys/fs/cgroup. If the directory exists, a version of cgroup is installed. Next we check the version of cgroup installed by checking for a directory at /sys/fs/cgroup/memory/. If it exists, cgroup V1 is installed. If it is missing, it is assumed cgroup V2 is installed.

CPU metric collection

The existing solution worked by checking the fraction of cpu idle ticks to the total number of cpu ticks since the last profile. If 100000 ticks elapse and 5000 were idle, the cpu is at 95% utilisation.

In this experiment, the method of cpu load calculation depends on the result of isContainerized() or if set, the CRAWLEE_CONTAINERIZED environment variable. If isContainerized() returns true, the new cgroup aware metric collection will be used over the "bare metal" numbers. This works by inspecting the /sys/fs/cgroup/cpuacct/cpuacct.usage, /sys/fs/cgroup/cpu/cpu.cfs_quota_us and /sys/fs/cgroup/cpu/cpu.cfs_period_us files for cgroup V1 and the /sys/fs/cgroup/cpu.stat and /sys/fs/cgroup/cpu.max files for cgroup V2. The actual cpu usage figure is calculated in the same manner as the "bare metal" figure by comparing the total number of ticks elapsed to the number of idle ticks between profiles but by using the figures from the cgroup files. If no cgroup quota is enforced, the "bare metal" numbers will be used.

Memory metric collection

The existing solution was already cgroup aware however an improvement has been made to memory metric collection when running on windows. The existing solution used an external package apify/ps-tree to find the amount of memory crawlee and any child processes were using. On Windows, this package used the depreciated "WMIC" command line utility to determine memory usage.

In this experiment, apify/ps-tree has been removed and replaced by the packages/utils/src/internals/ps-tree.ts file. This works in much the same manner however, instead of using "WMIC", it uses "powershell" to collect the same data.