RAID (Redundant Array of Independent Disks) provides data redundancy and performance improvements for Linux servers. However, RAID arrays require active monitoring to detect degraded arrays, failed disks, and rebuild operations before they cause data loss. In this comprehensive guide, you will learn how to monitor RAID health using mdadm, detect problems early, and automate array health checks.
Understanding Linux Software RAID with mdadm
Linux software RAID is managed through mdadm (Multiple Device Administration), which creates and manages RAID arrays using standard block devices. Unlike hardware RAID controllers, software RAID gives you full visibility into array status through /proc/mdstat and mdadm commands.
The most common RAID levels on Linux servers are RAID 1 (mirroring), RAID 5 (striping with parity), RAID 6 (double parity), and RAID 10 (mirrored stripes). Each level has different failure tolerance and performance characteristics that affect your monitoring strategy.
Checking RAID Status with /proc/mdstat
The quickest way to check RAID health is reading /proc/mdstat:
cat /proc/mdstat
# Output shows all arrays, their state, and member disks
# [UU] means all disks up, [U_] means one disk down
The bracket notation is crucial: each U represents an active disk, and an underscore (_) represents a missing or failed disk. A healthy RAID 1 shows [UU], while a degraded one shows [U_] or [_U].
Detailed Array Information with mdadm --detail
For comprehensive array information, use mdadm --detail:
sudo mdadm --detail /dev/md0
# Shows: RAID level, array size, used devices, state
# Active Devices, Working Devices, Failed Devices, Spare Devices
Key fields to monitor include the State (clean, degraded, rebuilding), Active/Failed/Spare device counts, and the UUID for array identification across reboots.
Monitoring Rebuild Progress
When a disk is replaced in a degraded array, RAID rebuilds automatically. Monitor progress through /proc/mdstat:
watch cat /proc/mdstat
# Shows: recovery = XX.X% (YY/ZZ) finish=Xmin speed=XK/sec
Rebuild time depends on array size and I/O load. For production servers, consider setting rebuild speed limits to avoid performance impact:
echo 50000 > /proc/sys/dev/raid/speed_limit_min
echo 200000 > /proc/sys/dev/raid/speed_limit_max
Automating RAID Monitoring
Set up automatic monitoring with mdadm daemon mode and email alerts:
# /etc/mdadm/mdadm.conf
MAILADDR admin@example.com
# Start monitoring daemon
sudo mdadm --monitor --scan --daemonise
For more sophisticated monitoring, our dargslan-raid-monitor tool provides comprehensive RAID health checks with JSON output for integration with monitoring stacks:
pip install dargslan-raid-monitor
dargslan-raid report # Full health report
dargslan-raid audit # Issues only
dargslan-raid json # JSON for automation
Handling Failed Disks
When a disk fails, the procedure is: mark as failed, remove from array, physically replace, add new disk:
sudo mdadm /dev/md0 --fail /dev/sdb1
sudo mdadm /dev/md0 --remove /dev/sdb1
# Replace physical disk
sudo mdadm /dev/md0 --add /dev/sdb1
Always verify the rebuild completes successfully and check SMART data on the replacement disk.
Best Practices for RAID Monitoring
- Check /proc/mdstat in your daily monitoring routine
- Configure email alerts for degraded arrays
- Monitor disk SMART data for early failure prediction
- Keep spare disks ready for quick replacement
- Test rebuild procedures regularly in non-production environments
- Document your RAID layout and disk serial numbers
Download our free RAID Array Monitoring Cheat Sheet for a quick reference of all essential mdadm commands. For deeper Linux storage administration knowledge, check out our Linux & DevOps eBooks.