Linux System Uptime Monitoring: Availability Tra…

System uptime monitoring is fundamental to maintaining service reliability and meeting SLA commitments. Beyond the simple uptime command, Linux provides rich data about boot history, crash events, OOM kills, and system stability that helps you identify reliability issues before they impact users. This guide covers comprehensive uptime and availability monitoring techniques.

Basic Uptime Information

The uptime command provides a quick overview:

uptime
# 14:23:15 up 45 days, 3:42, 2 users, load average: 0.15, 0.10, 0.08

# Machine-readable from /proc
cat /proc/uptime
# 3891720.45 7654321.89 (uptime_seconds idle_seconds)

For SLA tracking, you need more than current uptime — you need historical reboot data and crash analysis.

Reboot History with last

The last command shows reboot and shutdown history:

# Reboot history
last reboot

# Shutdown history
last -x shutdown

# With full timestamps
last reboot -F

Frequent reboots may indicate kernel panics, hardware issues, or OOM kills triggering automatic restarts.

Boot Analysis with systemd-analyze

Systemd provides detailed boot analysis:

# Boot time summary
systemd-analyze

# List all boots with journalctl
journalctl --list-boots

# Check for kernel panics
journalctl -k -p 0 --no-pager

# Find OOM kill events
journalctl -g "Out of memory" --no-pager

Calculating Availability Percentage

SLA availability is typically expressed as a percentage over a 30-day period:

99.9% (three nines) = max 43 minutes downtime/month
99.95% = max 22 minutes downtime/month
99.99% (four nines) = max 4.3 minutes downtime/month

Automate availability tracking with our tool:

pip install dargslan-uptime-report
dargslan-uptime report    # Full availability report
dargslan-uptime reboots   # Reboot history
dargslan-uptime crashes   # Crash event analysis
dargslan-uptime load      # Current load average

Detecting OOM Kills

Out-of-Memory kills are a common cause of service disruption:

# Find OOM events
dmesg | grep -i "out of memory"
journalctl -g "oom-kill" --no-pager

# Check which process was killed
dmesg | grep -i "killed process"

If OOM kills are frequent, you need to either increase RAM, tune the OOM score for critical processes, or fix memory leaks in your application.

Proactive Stability Monitoring

Monitor load average vs CPU count (load > 2x CPUs indicates overload)
Track reboot frequency — more than 2 unplanned reboots/month needs investigation
Set up alerting for kernel panics and OOM kills
Monitor swap usage trends (increasing swap = approaching OOM)
Use watchdog timers for automatic crash recovery

Uptime Monitoring Best Practices

Track uptime history, not just current uptime
Calculate and report monthly availability percentages
Investigate all unplanned reboots within 24 hours
Configure OOM score adjustments for critical services
Set up external uptime monitoring (the server cannot report its own downtime)
Document all planned maintenance windows

Download our free System Uptime & Availability Cheat Sheet for essential monitoring commands. For deeper Linux administration knowledge, explore our Linux & DevOps eBooks.

Categories

Linux System Uptime Monitoring: Availability Tracking and Crash Analysis

Basic Uptime Information

Reboot History with last

Boot Analysis with systemd-analyze

Calculating Availability Percentage

Detecting OOM Kills

Proactive Stability Monitoring

Uptime Monitoring Best Practices

Dargslan Editorial Team (Dargslan)

Stay Updated

Categories

Basic Uptime Information

Reboot History with last

Boot Analysis with systemd-analyze

Calculating Availability Percentage

Detecting OOM Kills

Proactive Stability Monitoring

Uptime Monitoring Best Practices

Dargslan Editorial Team (Dargslan)

Related Articles

Linux Mount Point Monitoring: NFS, CIFS, Bind Mounts and Stale Detection

Linux Locale and Encoding: Fixing UTF-8 Issues and Language Configuration

GRUB Bootloader: Validating Configuration, Kernel Parameters, and Boot Recovery

Stay Updated