Disk I/O is the single most common cause of "the database is slow" tickets that turn out not to be the database at all. Three numbers describe every storage device β IOPS, throughput, and latency β and the relationship between them is non-linear in ways that break intuition. This guide walks through the Linux tools that measure each, the per-device counters in /proc, and the patterns that distinguish a saturated SSD from a misbehaving controller.
The three metrics that matter
- IOPS β operations per second. Limited by device queue depth and seek behaviour. Spinning disks: 80β200 IOPS. SATA SSD: 10kβ100k. NVMe: 100kβ1M+.
- Throughput β MB/s. Limited by interface bandwidth (SATA: 600 MB/s, PCIe Gen4 x4: 8 GB/s).
- Latency β time per operation. The metric users feel. Healthy: under 1 ms for SSD; under 10 ms for HDD; database servers want consistent low p99.
A workload doing 1k IOPS at 4 KB averages 4 MB/s β almost no throughput, but enough random-access load to saturate a slow SSD.
iostat: the workhorse
iostat -xmz 5 6 # extended stats, 5-sec samples, MB/s, skip idle
iostat -xmz 5 6 sda nvme0n1 # specific devices
iostat -xt 1 # add timestamps
iostat -xmd 5 # device only, no CPU section
Key columns:
- r/s, w/s β read and write IOPS.
- rMB/s, wMB/s β throughput.
- rareq-sz, wareq-sz β average request size.
- aqu-sz β average queue depth. > 1 means requests are queueing.
- r_await, w_await β average wait time per op (ms). The latency users feel.
- %util β percent of time the device had at least one outstanding request. Misleading on multi-queue NVMe; trust latency instead.
iotop: which process is to blame
sudo iotop -oPa # only active processes, accumulated, P=processes only
sudo iotop -obtqqq --iter=10 # batch mode, 10 iterations, no headers
sudo pidstat -d 5 # per-process I/O
iotop reads /proc/*/io which the kernel updates per process. The oPa flags reduce noise to "processes that actually did I/O, with running totals." For a noisy server, redirect to a file and review later.
Reading /proc/diskstats directly
For scripting and exporters:
cat /proc/diskstats | awk '$3 !~ /loop|ram/'
column -t /proc/diskstats | head
The 14 numeric fields per device include: reads completed, sectors read, time spent reading (ms), writes completed, sectors written, time spent writing, IOs in progress, time IO in progress, weighted time. Sample twice and divide by interval to compute rates. node_exporter and most monitoring agents do exactly this.
Latency distribution with bcc/bpftrace
iostat reports averages; latency outliers cause pain. Use eBPF tools:
sudo apt install bpfcc-tools # Debian/Ubuntu
sudo biolatency 5 6 # latency histogram per 5 sec
sudo biotop 5 # top processes by I/O
sudo biosnoop # per-IO trace
biolatency output is a power-of-two histogram showing what fraction of operations completed in 0β1 ms, 1β2 ms, 2β4 ms, etc. A bimodal distribution (most in 0.1 ms, a long tail at 100 ms) usually means a misbehaving controller or filesystem flush stalls.
Synthetic benchmarking with fio
Before you put a database on a new disk, characterise it:
sudo fio --name=randread --filename=/dev/nvme0n1 --rw=randread \
--bs=4k --iodepth=32 --numjobs=4 --runtime=60 --time_based \
--group_reporting
sudo fio --name=seqwrite --filename=test.fio --rw=write --bs=1M \
--iodepth=4 --numjobs=1 --size=1G --runtime=30 --time_based
Run different patterns: random 4 KB read (matches database OLTP), random 4 KB write (write-heavy DB), sequential 1 MB write (backup). Compare to the vendor spec β large gaps mean a tuning problem (e.g. RAID write-back disabled, BBU expired).
Distinguishing read vs write saturation
Different cures for different bottlenecks:
- Read saturated β your working set exceeds the page cache. Add RAM, or move the dataset to faster storage.
- Write saturated β fsync latency spikes. Check write-back cache, use a separate WAL/journal device, batch commits.
- Both saturated β the device is genuinely undersized; rightsize the storage tier.
The 20-line monitoring script
#!/bin/bash
THRESH_AWAIT=10
iostat -xmd 5 2 | tail -n +4 | awk -v t=$THRESH_AWAIT '
NF > 10 && $1 !~ /^(loop|ram|sr|Device|$)/ {
name=$1
rIOPS=$2; wIOPS=$3; rMB=$4; wMB=$5
rAwait=$10; wAwait=$11; util=$NF
if (rAwait+0 > t || wAwait+0 > t)
printf "WARN %-10s r=%s w=%s rMB=%s wMB=%s r_await=%s w_await=%s util=%s\n",
name, rIOPS, wIOPS, rMB, wMB, rAwait, wAwait, util
}'
Filesystem-level effects
The same disk can show very different latency depending on filesystem:
- Mount with
noatimeon read-heavy filesystems to eliminate metadata writes per read. - Use
data=writebackon ext4 only when you can lose a few seconds of data on crash. - Avoid
syncmount option in production; it serialises every write through fsync. - Tune the I/O scheduler:
nonefor NVMe (let hardware queue),mq-deadlinefor SATA SSD,bfqfor desktops.
Common pitfalls
- Trusting
%utilon multi-queue NVMe; modern devices saturate one queue while others are idle, but report 100% util. - Benchmarking with the OS page cache enabled and concluding the disk is fast β use
--direct=1in fio. - Running iostat for one second; rates are unreliable on the first sample. Always discard the first iteration.
- Forgetting that LVM and dm-crypt add their own block devices; iostat shows latency at every layer.
Disk I/O monitoring is the most quantitative observability you have on a Linux host β and the most often misread. Keep iostat in muscle memory, baseline with fio, alert on latency rather than utilisation, and use bpftrace when an average hides a long tail.