System uptime monitoring is fundamental to maintaining service reliability and meeting SLA commitments. Beyond the simple uptime command, Linux provides rich data about boot history, crash events, OOM kills, and system stability that helps you identify reliability issues before they impact users. This guide covers comprehensive uptime and availability monitoring techniques.
Basic Uptime Information
The uptime command provides a quick overview:
uptime
# 14:23:15 up 45 days, 3:42, 2 users, load average: 0.15, 0.10, 0.08
# Machine-readable from /proc
cat /proc/uptime
# 3891720.45 7654321.89 (uptime_seconds idle_seconds)
For SLA tracking, you need more than current uptime β you need historical reboot data and crash analysis.
Reboot History with last
The last command shows reboot and shutdown history:
# Reboot history
last reboot
# Shutdown history
last -x shutdown
# With full timestamps
last reboot -F
Frequent reboots may indicate kernel panics, hardware issues, or OOM kills triggering automatic restarts.
Boot Analysis with systemd-analyze
Systemd provides detailed boot analysis:
# Boot time summary
systemd-analyze
# List all boots with journalctl
journalctl --list-boots
# Check for kernel panics
journalctl -k -p 0 --no-pager
# Find OOM kill events
journalctl -g "Out of memory" --no-pager
Calculating Availability Percentage
SLA availability is typically expressed as a percentage over a 30-day period:
- 99.9% (three nines) = max 43 minutes downtime/month
- 99.95% = max 22 minutes downtime/month
- 99.99% (four nines) = max 4.3 minutes downtime/month
Automate availability tracking with our tool:
pip install dargslan-uptime-report
dargslan-uptime report # Full availability report
dargslan-uptime reboots # Reboot history
dargslan-uptime crashes # Crash event analysis
dargslan-uptime load # Current load average
Detecting OOM Kills
Out-of-Memory kills are a common cause of service disruption:
# Find OOM events
dmesg | grep -i "out of memory"
journalctl -g "oom-kill" --no-pager
# Check which process was killed
dmesg | grep -i "killed process"
If OOM kills are frequent, you need to either increase RAM, tune the OOM score for critical processes, or fix memory leaks in your application.
Proactive Stability Monitoring
- Monitor load average vs CPU count (load > 2x CPUs indicates overload)
- Track reboot frequency β more than 2 unplanned reboots/month needs investigation
- Set up alerting for kernel panics and OOM kills
- Monitor swap usage trends (increasing swap = approaching OOM)
- Use watchdog timers for automatic crash recovery
Uptime Monitoring Best Practices
- Track uptime history, not just current uptime
- Calculate and report monthly availability percentages
- Investigate all unplanned reboots within 24 hours
- Configure OOM score adjustments for critical services
- Set up external uptime monitoring (the server cannot report its own downtime)
- Document all planned maintenance windows
Download our free System Uptime & Availability Cheat Sheet for essential monitoring commands. For deeper Linux administration knowledge, explore our Linux & DevOps eBooks.