Cgroup v2 is now the default on every modern Linux distribution and the substrate under every container, every systemd service, and every memory-pressure decision the kernel makes. Auditing what your services actually get โ and what they actually use โ is no longer an optional skill. This guide explains the unified cgroup hierarchy, the per-resource controllers that matter, and the commands that turn cgroup data into actionable monitoring.
Confirm cgroup v2 is active
stat -fc %T /sys/fs/cgroup # cgroup2fs = v2; tmpfs = v1
mount | grep cgroup
ls /sys/fs/cgroup/ # v2: flat, with cgroup.* files at root
If you see cgroup2fs, you are on the unified hierarchy. Hybrid mode (some controllers v1, some v2) is possible but increasingly rare; stick to pure v2 for new installs.
The systemd view
systemd organises every running process under a slice, scope, or service. Read the tree:
systemd-cgls
systemd-cgls /system.slice/postgresql.service
systemd-cgtop # top-style live view
systemctl status postgresql # includes cgroup path
systemd-cgls shows every process grouped by its cgroup. systemd-cgtop shows live CPU%, memory, and IO per slice โ invaluable when "the box is busy" and you need to know which workload is to blame.
Per-service resource limits
Apply limits via systemd unit files (or drop-ins):
# /etc/systemd/system/myapp.service.d/limits.conf
[Service]
CPUQuota=50%
CPUWeight=200
MemoryHigh=2G
MemoryMax=3G
MemorySwapMax=0
IOWeight=200
TasksMax=4096
LimitNOFILE=65535
MemoryHigh applies soft pressure โ the kernel reclaims aggressively above this; MemoryMax is the hard cap and triggers OOM kill. Setting both is the recommended pattern for production services.
Reading cgroup files directly
Each cgroup is a directory under /sys/fs/cgroup/. The interface files describe configuration and current usage:
cat /sys/fs/cgroup/system.slice/postgresql.service/memory.current
cat /sys/fs/cgroup/system.slice/postgresql.service/memory.peak
cat /sys/fs/cgroup/system.slice/postgresql.service/memory.events
cat /sys/fs/cgroup/system.slice/postgresql.service/cpu.stat
cat /sys/fs/cgroup/system.slice/postgresql.service/io.stat
memory.events shows OOM-kill counts and "high"-pressure events; cpu.stat shows usage and throttling. These are the same data systemd-cgtop reads.
PSI: Pressure Stall Information
The killer feature of cgroup v2: per-cgroup pressure metrics that quantify how often tasks are stalled waiting for a resource:
cat /proc/pressure/cpu # whole-system CPU pressure
cat /proc/pressure/memory
cat /proc/pressure/io
cat /sys/fs/cgroup/system.slice/postgresql.service/cpu.pressure
cat /sys/fs/cgroup/system.slice/postgresql.service/memory.pressure
Each file shows some (any task stalled) and full (all tasks stalled) percentages over 10s, 1min, 5min windows. Alert on some avg10 > 30% or full avg10 > 10% as early warning of saturation โ far more actionable than generic CPU% or "memory used" metrics.
Container runtime cgroups
Docker and Podman create per-container cgroups under system.slice/docker-<id>.scope (Docker) or user.slice/.../libpod-... (Podman). Inspect:
docker stats # CPU/memory/IO per running container
sudo systemd-cgtop /system.slice/docker.service
sudo cat /sys/fs/cgroup/system.slice/docker-$(docker inspect --format '{{.Id}}' my-app).scope/memory.current
For Kubernetes pods on a cgroup v2 node, look under kubepods.slice; the path includes pod UID and container ID.
Audit script: find resource-unconstrained services
#!/bin/bash
echo "== Services without MemoryMax =="
for svc in $(systemctl list-units --type=service --no-legend --plain | awk '{print $1}'); do
mm=$(systemctl show -p MemoryMax --value "$svc")
cq=$(systemctl show -p CPUQuotaPerSecUSec --value "$svc")
tm=$(systemctl show -p TasksMax --value "$svc")
if [ "$mm" = "infinity" ] && [ "$cq" = "infinity" ]; then
echo " unbounded: $svc"
fi
done | head -30
echo
echo "== High-pressure cgroups (> 10% memory full) =="
find /sys/fs/cgroup -name memory.pressure 2>/dev/null | while read f; do
full=$(awk '/^full/ {print $2}' "$f" | cut -d= -f2)
full_int=${full%.*}
[ "${full_int:-0}" -gt 10 ] && echo " $f: full=$full"
done
echo
echo "== Recent OOM kills per cgroup =="
find /sys/fs/cgroup -name memory.events 2>/dev/null | while read f; do
killed=$(awk '/^oom_kill/ {print $2}' "$f")
[ "${killed:-0}" -gt 0 ] && echo " $f: oom_kill=$killed"
done
IO controller
The IO controller is per-device and uses weights (1โ10000, default 100):
cat /sys/fs/cgroup/io.max # current limits per device
echo '8:0 wbps=10485760' > /sys/fs/cgroup/system.slice/backup.service/io.max
# limits backup.service to 10 MB/s write on device major:minor 8:0
For a busy database server, give the database a high IO weight and backup jobs a low one โ the IO scheduler proportionally allocates bandwidth even when both compete.
Common pitfalls
- Setting
MemoryMax=infinityby accident; cgroup v2 treats this as no limit, undoing your hardening. - Tuning
MemoryHighwithout settingMemoryMax; under sustained pressure the cgroup just slows down without ever reaching a kill point. - Mixing v1 and v2 mounts (hybrid mode) โ some tools see one view, others the other; migrate fully to v2.
- Believing
memory.currentis RSS โ it includes page cache attributable to the cgroup.
Cgroup v2 audits are a quarterly task that pays off the next time a runaway service starves the database. Use systemd unit limits, watch PSI rather than averages, and treat any unbounded service on a multi-tenant host as a follow-up ticket.