System Infomation V2
This is an experimental feature. While we welcome testers, keep in mind that it is currently not recommended to use this in production.
The API is subject to change, and we might introduce breaking changes in the future.
Should you be using this, feel free to open issues on our GitHub repository, and we'll take a look.
Starting with the newest crawlee
beta, we have introduced a new crawler option that enables an improved metric collection system.
This new system should collect cpu and memory metrics more accurately in containerised environments by checking for cgroup enforce limits.
How to enable the experiment
This example shows how to enable the experiment in the CheerioCrawler
,
but you can apply this to any crawler type.
import { CheerioCrawler, Configuration } from 'crawlee';
Configuration.set('systemInfoV2', true);
const crawler = new CheerioCrawler({
async requestHandler({ $, request }) {
const title = $('title').text();
console.log(`The title of "${request.url}" is: ${title}.`);
},
});
await crawler.run(['https://crawlee.dev']);
Other changes
This section is only useful if you're a tinkerer and want to see what's going on under the hood.
The existing solution checked the bare metal metrics for how much cpu and memory was being used and how much headroom was available. This is an intuitive solution but unfortunately doesnt account for when there is an external limit on the amount of resources a process can consume. This is often the case in containerized environments where each container will have a quota for its cpu and memory usage.
This experiment attempts to address this issue by introducing a new isContainerized()
utility function and changing the way resources are collected
when a container is detected.
This isContainerized()
function is very similar to the existing isDocker()
function however for now they both work side by side.
If this experiment is successful, eventualy isDocker()
may eventually be depreciated in favour of isContainerized()
.
Cgroup detection
On linux, to detect if cgroup is available, we check if there is a directory at /sys/fs/cgroup
.
If the directory exists, a version of cgroup is installed.
Next we check the version of cgroup installed by checking for a directory at /sys/fs/cgroup/memory/
.
If it exists, cgroup V1 is installed. If it is missing, it is assumed cgroup V2 is installed.
CPU metric collection
The existing solution worked by checking the fraction of cpu idle ticks to the total number of cpu ticks since the last profile. If 100000 ticks elapse and 5000 were idle, the cpu is at 95% utilisation.
In this experiment, the method of cpu load calculation depends on the result of isContainerized()
or if set, the CRAWLEE_CONTAINERIZED
environment variable.
If isContainerized()
returns true, the new cgroup aware metric collection will be used over the "bare metal" numbers.
This works by inspecting the /sys/fs/cgroup/cpuacct/cpuacct.usage
, /sys/fs/cgroup/cpu/cpu.cfs_quota_us
and /sys/fs/cgroup/cpu/cpu.cfs_period_us
files for cgroup V1 and the /sys/fs/cgroup/cpu.stat
and /sys/fs/cgroup/cpu.max
files for cgroup V2.
The actual cpu usage figure is calculated in the same manner as the "bare metal" figure by comparing the total number of ticks elapsed to the number
of idle ticks between profiles but by using the figures from the cgroup files.
If no cgroup quota is enforced, the "bare metal" numbers will be used.
Memory metric collection
The existing solution was already cgroup aware however an improvement has been made to memory metric collection when running on windows.
The existing solution used an external package apify/ps-tree
to find the amount of memory crawlee and any child processes were using.
On Windows, this package used the depreciated "WMIC" command line utility to determine memory usage.
In this experiment, apify/ps-tree
has been removed and replaced by the packages/utils/src/internals/ps-tree.ts
file. This works in much the
same manner however, instead of using "WMIC", it uses "powershell" to collect the same data.