Linux Disk I/O Monitoring: IOPS, Throughput, and…

Disk I/O is the single most common cause of "the database is slow" tickets that turn out not to be the database at all. Three numbers describe every storage device — IOPS, throughput, and latency — and the relationship between them is non-linear in ways that break intuition. This guide walks through the Linux tools that measure each, the per-device counters in /proc, and the patterns that distinguish a saturated SSD from a misbehaving controller.

The three metrics that matter

IOPS — operations per second. Limited by device queue depth and seek behaviour. Spinning disks: 80–200 IOPS. SATA SSD: 10k–100k. NVMe: 100k–1M+.
Throughput — MB/s. Limited by interface bandwidth (SATA: 600 MB/s, PCIe Gen4 x4: 8 GB/s).
Latency — time per operation. The metric users feel. Healthy: under 1 ms for SSD; under 10 ms for HDD; database servers want consistent low p99.

A workload doing 1k IOPS at 4 KB averages 4 MB/s — almost no throughput, but enough random-access load to saturate a slow SSD.

iostat: the workhorse

iostat -xmz 5 6                # extended stats, 5-sec samples, MB/s, skip idle
iostat -xmz 5 6 sda nvme0n1    # specific devices
iostat -xt 1                   # add timestamps
iostat -xmd 5                  # device only, no CPU section

Key columns:

r/s, w/s — read and write IOPS.
rMB/s, wMB/s — throughput.
rareq-sz, wareq-sz — average request size.
aqu-sz — average queue depth. > 1 means requests are queueing.
r_await, w_await — average wait time per op (ms). The latency users feel.
%util — percent of time the device had at least one outstanding request. Misleading on multi-queue NVMe; trust latency instead.

iotop: which process is to blame

sudo iotop -oPa                # only active processes, accumulated, P=processes only
sudo iotop -obtqqq --iter=10   # batch mode, 10 iterations, no headers
sudo pidstat -d 5              # per-process I/O

iotop reads /proc/*/io which the kernel updates per process. The oPa flags reduce noise to "processes that actually did I/O, with running totals." For a noisy server, redirect to a file and review later.

Reading /proc/diskstats directly

For scripting and exporters:

cat /proc/diskstats | awk '$3 !~ /loop|ram/'
column -t /proc/diskstats | head

The 14 numeric fields per device include: reads completed, sectors read, time spent reading (ms), writes completed, sectors written, time spent writing, IOs in progress, time IO in progress, weighted time. Sample twice and divide by interval to compute rates. node_exporter and most monitoring agents do exactly this.

Latency distribution with bcc/bpftrace

iostat reports averages; latency outliers cause pain. Use eBPF tools:

sudo apt install bpfcc-tools          # Debian/Ubuntu
sudo biolatency 5 6                   # latency histogram per 5 sec
sudo biotop 5                         # top processes by I/O
sudo biosnoop                          # per-IO trace

biolatency output is a power-of-two histogram showing what fraction of operations completed in 0–1 ms, 1–2 ms, 2–4 ms, etc. A bimodal distribution (most in 0.1 ms, a long tail at 100 ms) usually means a misbehaving controller or filesystem flush stalls.

Synthetic benchmarking with fio

Before you put a database on a new disk, characterise it:

sudo fio --name=randread --filename=/dev/nvme0n1 --rw=randread \
         --bs=4k --iodepth=32 --numjobs=4 --runtime=60 --time_based \
         --group_reporting

sudo fio --name=seqwrite --filename=test.fio --rw=write --bs=1M \
         --iodepth=4 --numjobs=1 --size=1G --runtime=30 --time_based

Run different patterns: random 4 KB read (matches database OLTP), random 4 KB write (write-heavy DB), sequential 1 MB write (backup). Compare to the vendor spec — large gaps mean a tuning problem (e.g. RAID write-back disabled, BBU expired).

Distinguishing read vs write saturation

Different cures for different bottlenecks:

Read saturated — your working set exceeds the page cache. Add RAM, or move the dataset to faster storage.
Write saturated — fsync latency spikes. Check write-back cache, use a separate WAL/journal device, batch commits.
Both saturated — the device is genuinely undersized; rightsize the storage tier.

The 20-line monitoring script

#!/bin/bash
THRESH_AWAIT=10
iostat -xmd 5 2 | tail -n +4 | awk -v t=$THRESH_AWAIT '
  NF > 10 && $1 !~ /^(loop|ram|sr|Device|$)/ {
    name=$1
    rIOPS=$2; wIOPS=$3; rMB=$4; wMB=$5
    rAwait=$10; wAwait=$11; util=$NF
    if (rAwait+0 > t || wAwait+0 > t)
      printf "WARN %-10s r=%s w=%s rMB=%s wMB=%s r_await=%s w_await=%s util=%s\n",
             name, rIOPS, wIOPS, rMB, wMB, rAwait, wAwait, util
  }'

Filesystem-level effects

The same disk can show very different latency depending on filesystem:

Mount with noatime on read-heavy filesystems to eliminate metadata writes per read.
Use data=writeback on ext4 only when you can lose a few seconds of data on crash.
Avoid sync mount option in production; it serialises every write through fsync.
Tune the I/O scheduler: none for NVMe (let hardware queue), mq-deadline for SATA SSD, bfq for desktops.

Common pitfalls

Trusting %util on multi-queue NVMe; modern devices saturate one queue while others are idle, but report 100% util.
Benchmarking with the OS page cache enabled and concluding the disk is fast — use --direct=1 in fio.
Running iostat for one second; rates are unreliable on the first sample. Always discard the first iteration.
Forgetting that LVM and dm-crypt add their own block devices; iostat shows latency at every layer.

Disk I/O monitoring is the most quantitative observability you have on a Linux host — and the most often misread. Keep iostat in muscle memory, baseline with fio, alert on latency rather than utilisation, and use bpftrace when an average hides a long tail.

Categories

Linux Disk I/O Monitoring: IOPS, Throughput, and Latency Analysis

The three metrics that matter

iostat: the workhorse

iotop: which process is to blame

Reading /proc/diskstats directly

Latency distribution with bcc/bpftrace

Synthetic benchmarking with fio

Distinguishing read vs write saturation

The 20-line monitoring script

Filesystem-level effects

Common pitfalls

Dargslan Editorial Team (Dargslan)

Stay Updated

Categories

The three metrics that matter

iostat: the workhorse

iotop: which process is to blame

Reading /proc/diskstats directly

Latency distribution with bcc/bpftrace

Synthetic benchmarking with fio

Distinguishing read vs write saturation

The 20-line monitoring script

Filesystem-level effects

Common pitfalls

Dargslan Editorial Team (Dargslan)

Related Articles

Linux Locale and Encoding: Fixing UTF-8 Issues and Language Configuration

GRUB Bootloader: Validating Configuration, Kernel Parameters, and Boot Recovery

Linux Kernel Module Management: Loading, Unloading, and Blacklisting Drivers

Stay Updated