Complete Linux System Health Check Guide 2024

Master Linux system health monitoring with essential commands, automated scripts, and optimization techniques to prevent downtime and boost performance.

Complete Linux System Health Check Guide: Monitor, Diagnose, and Optimize Your Server Performance in 2024

Table of Contents

1. [Introduction](#introduction) 2. [Why System Health Monitoring Matters](#why-system-health-monitoring-matters) 3. [Essential Linux System Health Check Commands](#essential-linux-system-health-check-commands) 4. [CPU Performance Monitoring](#cpu-performance-monitoring) 5. [Memory Usage Analysis](#memory-usage-analysis) 6. [Disk Space and I/O Monitoring](#disk-space-and-io-monitoring) 7. [Network Performance Checks](#network-performance-checks) 8. [System Load and Process Management](#system-load-and-process-management) 9. [Log File Analysis](#log-file-analysis) 10. [Automated Health Check Scripts](#automated-health-check-scripts) 11. [Advanced Monitoring Tools](#advanced-monitoring-tools) 12. [Performance Optimization Tips](#performance-optimization-tips) 13. [Troubleshooting Common Issues](#troubleshooting-common-issues) 14. [Best Practices](#best-practices) 15. [Conclusion](#conclusion)

Introduction

Linux system health monitoring is a critical aspect of server administration that ensures optimal performance, prevents downtime, and maintains system reliability. Whether you're managing a single server or an entire infrastructure, regular health checks help identify potential issues before they become critical problems.

This comprehensive guide covers everything you need to know about Linux system health monitoring, from basic commands to advanced automation techniques. You'll learn how to monitor CPU usage, memory consumption, disk space, network performance, and system logs effectively.

Why System Health Monitoring Matters

Preventing System Failures

Regular health checks help identify warning signs before they escalate into system failures. By monitoring key metrics like CPU usage, memory consumption, and disk space, administrators can take proactive measures to prevent outages.

Optimizing Performance

System health monitoring reveals performance bottlenecks and resource constraints that may be slowing down your applications. This information is crucial for making informed decisions about hardware upgrades and system optimization.

Security Monitoring

Health checks can reveal unusual system behavior that might indicate security breaches, malware infections, or unauthorized access attempts. Regular monitoring helps maintain system security and integrity.

Cost Management

By understanding resource utilization patterns, organizations can make better decisions about infrastructure scaling, potentially saving significant costs on unnecessary hardware or cloud resources.

Essential Linux System Health Check Commands

Basic System Information Commands

#### uname - System Information `bash

Display system information

uname -a

Show kernel version

uname -r

Display architecture

uname -m `

#### uptime - System Uptime and Load `bash

Show system uptime and load averages

uptime

Example output:

14:30:45 up 10 days, 5:42, 3 users, load average: 0.15, 0.18, 0.12

`

#### whoami and who - User Information `bash

Show current user

whoami

Show logged-in users

who

Show detailed user information

w `

System Resource Overview Commands

#### htop - Interactive Process Viewer `bash

Install htop if not available

sudo apt-get install htop # Ubuntu/Debian sudo yum install htop # CentOS/RHEL

Run htop

htop `

#### top - Process Monitor `bash

Display running processes

top

Sort by CPU usage

top -o %CPU

Sort by memory usage

top -o %MEM `

CPU Performance Monitoring

Understanding CPU Metrics

CPU monitoring involves tracking several key metrics: - CPU Usage Percentage: Shows how much of the CPU capacity is being used - Load Average: Indicates system load over 1, 5, and 15-minute intervals - Context Switches: Number of times the CPU switches between processes - Interrupts: Hardware and software interrupt frequency

CPU Monitoring Commands

#### vmstat - Virtual Memory Statistics `bash

Display CPU, memory, and I/O statistics

vmstat 1 5

Example output explanation:

procs: r (running), b (blocked)

memory: swpd, free, buff, cache

swap: si (swap in), so (swap out)

io: bi (blocks in), bo (blocks out)

system: in (interrupts), cs (context switches)

cpu: us (user), sy (system), id (idle), wa (wait), st (stolen)

`

#### iostat - I/O Statistics `bash

Install sysstat package if needed

sudo apt-get install sysstat

Display CPU and I/O statistics

iostat -c 1 5

Show extended statistics

iostat -x 1 5 `

#### sar - System Activity Reporter `bash

Monitor CPU usage every second for 10 iterations

sar -u 1 10

Monitor CPU usage for specific cores

sar -P ALL 1 5

Generate daily CPU report

sar -u -f /var/log/sysstat/saXX `

CPU Performance Analysis

#### Identifying CPU Bottlenecks `bash

Check for high CPU processes

ps aux --sort=-%cpu | head -10

Monitor CPU usage in real-time

watch -n 1 'cat /proc/loadavg'

Check CPU frequency scaling

cat /proc/cpuinfo | grep MHz `

#### CPU Temperature Monitoring `bash

Install lm-sensors

sudo apt-get install lm-sensors

Detect sensors

sudo sensors-detect

Display temperature readings

sensors `

Memory Usage Analysis

Understanding Memory Metrics

Linux memory management involves several types of memory: - Physical RAM: Actual memory modules installed - Virtual Memory: Combination of RAM and swap space - Buffer/Cache: Memory used for file system caching - Swap: Disk space used as virtual memory extension

Memory Monitoring Commands

#### free - Memory Usage Display `bash

Display memory usage in human-readable format

free -h

Show memory usage with buffer/cache breakdown

free -h -w

Monitor memory usage continuously

watch -n 1 free -h `

#### /proc/meminfo - Detailed Memory Information `bash

Display detailed memory statistics

cat /proc/meminfo

Monitor specific memory metrics

grep -E "MemTotal|MemFree|MemAvailable|Buffers|Cached|SwapTotal|SwapFree" /proc/meminfo `

#### ps - Process Memory Usage `bash

Show processes sorted by memory usage

ps aux --sort=-%mem | head -10

Display memory usage for specific process

ps -p -o pid,vsz,rss,comm

Show memory maps for a process

pmap `

Memory Performance Analysis

#### Identifying Memory Leaks `bash

Monitor memory usage over time

while true; do echo "$(date): $(free | grep Mem | awk '{print $3/$2 * 100.0}')" >> memory_usage.log sleep 60 done

Check for memory-hungry processes

ps -eo pid,ppid,cmd,%mem,%cpu --sort=-%mem | head -20 `

#### Swap Usage Monitoring `bash

Check swap usage

swapon --show

Monitor swap activity

vmstat 1 5 | awk '{print $7, $8}'

Find processes using swap

for file in /proc/*/status; do awk '/VmSwap|Name/{printf $2 " " $3}END{ print ""}' $file done | sort -k 2 -n `

Disk Space and I/O Monitoring

Disk Space Monitoring

#### df - Filesystem Disk Space Usage `bash

Display filesystem usage in human-readable format

df -h

Show inode usage

df -i

Display filesystem type

df -T `

#### du - Directory Space Usage `bash

Show directory sizes in current location

du -h --max-depth=1

Find largest directories

du -h / 2>/dev/null | sort -rh | head -20

Check specific directory size

du -sh /var/log `

Disk I/O Monitoring

#### iotop - I/O Usage by Process `bash

Install iotop

sudo apt-get install iotop

Monitor I/O usage

sudo iotop

Show only processes with I/O activity

sudo iotop -o `

#### Disk Performance Analysis `bash

Monitor disk I/O statistics

iostat -x 1 5

Check disk read/write speeds

dd if=/dev/zero of=testfile bs=1G count=1 oflag=dsync dd if=testfile of=/dev/null bs=1G count=1 rm testfile `

Filesystem Health Checks

#### File System Integrity `bash

Check filesystem for errors (unmounted filesystem)

sudo fsck /dev/sda1

Check filesystem in read-only mode

sudo fsck -n /dev/sda1

Check all filesystems

sudo fsck -A -R `

#### SMART Monitoring `bash

Install smartmontools

sudo apt-get install smartmontools

Check drive health

sudo smartctl -a /dev/sda

Run short self-test

sudo smartctl -t short /dev/sda

Check test results

sudo smartctl -l selftest /dev/sda `

Network Performance Checks

Network Interface Monitoring

#### ifconfig and ip Commands `bash

Display network interfaces (traditional)

ifconfig

Modern alternative using ip command

ip addr show

Show network statistics

ip -s link

Display routing table

ip route show `

#### Network Traffic Monitoring `bash

Install network monitoring tools

sudo apt-get install net-tools iftop nethogs

Monitor network traffic by interface

sudo iftop -i eth0

Monitor network usage by process

sudo nethogs eth0 `

Network Connectivity Tests

#### Basic Connectivity Tests `bash

Test connectivity to remote host

ping -c 4 google.com

Test specific port connectivity

telnet google.com 80

Advanced port testing

nmap -p 80,443 google.com `

#### Network Performance Testing `bash

Install network testing tools

sudo apt-get install speedtest-cli iperf3

Test internet speed

speedtest-cli

Network throughput testing (requires iperf3 server)

iperf3 -c server_ip `

Network Security Monitoring

#### Active Connections `bash

Show active network connections

netstat -tuln

Display connections with process information

netstat -tulnp

Modern alternative

ss -tuln `

#### Firewall Status `bash

Check UFW status (Ubuntu)

sudo ufw status

Check iptables rules

sudo iptables -L

Display firewall logs

sudo tail -f /var/log/ufw.log `

System Load and Process Management

Understanding System Load

System load represents the amount of computational work the system is performing. Load averages show the average system load over 1, 5, and 15-minute periods.

#### Load Average Interpretation `bash

Display current load

cat /proc/loadavg

Rule of thumb for load interpretation:

Load < Number of CPU cores = Good

Load = Number of CPU cores = Fully utilized

Load > Number of CPU cores = Overloaded

`

Process Management

#### Process Monitoring `bash

List all processes

ps aux

Show process tree

pstree

Monitor processes in real-time

top htop `

#### Process Control `bash

Kill process by PID

kill

Kill process by name

pkill process_name

Kill all processes by user

pkill -u username `

Service Management

#### Systemd Service Management `bash

List all services

systemctl list-units --type=service

Check service status

systemctl status service_name

Start/stop/restart services

systemctl start service_name systemctl stop service_name systemctl restart service_name

Enable/disable services

systemctl enable service_name systemctl disable service_name `

Log File Analysis

System Log Locations

Linux systems store logs in various locations: - /var/log/syslog - General system messages - /var/log/auth.log - Authentication logs - /var/log/kern.log - Kernel messages - /var/log/dmesg - Boot messages - /var/log/cron.log - Cron job logs

Log Analysis Commands

#### journalctl - Systemd Journal `bash

View all journal entries

journalctl

Show recent entries

journalctl -n 50

Follow log in real-time

journalctl -f

Filter by service

journalctl -u ssh.service

Filter by time range

journalctl --since "2024-01-01" --until "2024-01-02" `

#### Traditional Log Analysis `bash

View recent system messages

tail -f /var/log/syslog

Search for errors

grep -i error /var/log/syslog

Count error occurrences

grep -c "error" /var/log/syslog

Analyze authentication failures

grep "Failed password" /var/log/auth.log `

Log Rotation and Management

#### Logrotate Configuration `bash

Check logrotate configuration

cat /etc/logrotate.conf

View specific log rotation rules

ls /etc/logrotate.d/

Test logrotate configuration

sudo logrotate -d /etc/logrotate.conf

Force log rotation

sudo logrotate -f /etc/logrotate.conf `

Automated Health Check Scripts

Basic Health Check Script

Create a comprehensive health check script:

`bash #!/bin/bash

System Health Check Script

echo "=== System Health Check Report ===" echo "Date: $(date)" echo "Hostname: $(hostname)" echo "Uptime: $(uptime)" echo ""

CPU Usage

echo "=== CPU Usage ===" top -bn1 | grep "Cpu(s)" | awk '{print $2}' | cut -d'%' -f1 | awk '{print "CPU Usage: " $1"%"}' echo ""

Memory Usage

echo "=== Memory Usage ===" free -h | grep -E "Mem|Swap" echo ""

Disk Usage

echo "=== Disk Usage ===" df -h | grep -E "^/dev" echo ""

System Load

echo "=== System Load ===" cat /proc/loadavg echo ""

Top Processes by CPU

echo "=== Top 5 CPU Processes ===" ps aux --sort=-%cpu | head -6 echo ""

Top Processes by Memory

echo "=== Top 5 Memory Processes ===" ps aux --sort=-%mem | head -6 echo ""

Network Connections

echo "=== Network Connections ===" ss -tuln | wc -l echo "Total connections: $(ss -tuln | wc -l)" echo ""

Recent Errors in Syslog

echo "=== Recent System Errors ===" grep -i error /var/log/syslog | tail -5 echo ""

echo "=== Health Check Complete ===" `

Advanced Monitoring Script

`bash #!/bin/bash

Advanced System Health Monitoring Script

Configuration

LOG_FILE="/var/log/system_health.log" EMAIL="admin@example.com" CPU_THRESHOLD=80 MEMORY_THRESHOLD=85 DISK_THRESHOLD=90

Functions

log_message() { echo "$(date '+%Y-%m-%d %H:%M:%S') - $1" >> $LOG_FILE }

send_alert() { echo "$1" | mail -s "System Alert: $(hostname)" $EMAIL log_message "ALERT: $1" }

check_cpu() { CPU_USAGE=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | cut -d'%' -f1) CPU_INT=${CPU_USAGE%.*} if [ $CPU_INT -gt $CPU_THRESHOLD ]; then send_alert "High CPU usage detected: ${CPU_USAGE}%" fi log_message "CPU Usage: ${CPU_USAGE}%" }

check_memory() { MEMORY_USAGE=$(free | grep Mem | awk '{printf "%.0f", $3/$2 * 100.0}') if [ $MEMORY_USAGE -gt $MEMORY_THRESHOLD ]; then send_alert "High memory usage detected: ${MEMORY_USAGE}%" fi log_message "Memory Usage: ${MEMORY_USAGE}%" }

check_disk() { df -h | grep -E "^/dev" | while read line; do USAGE=$(echo $line | awk '{print $5}' | cut -d'%' -f1) PARTITION=$(echo $line | awk '{print $6}') if [ $USAGE -gt $DISK_THRESHOLD ]; then send_alert "High disk usage on $PARTITION: ${USAGE}%" fi log_message "Disk Usage $PARTITION: ${USAGE}%" done }

check_services() { CRITICAL_SERVICES=("ssh" "nginx" "mysql") for service in "${CRITICAL_SERVICES[@]}"; do if ! systemctl is-active --quiet $service; then send_alert "Critical service $service is not running" else log_message "Service $service: Running" fi done }

Main execution

log_message "Starting system health check" check_cpu check_memory check_disk check_services log_message "System health check completed" `

Scheduling Automated Checks

Add the script to crontab for regular execution:

`bash

Edit crontab

crontab -e

Add entries for automated health checks

Run every 5 minutes

/5 * /path/to/basic_health_check.sh

Run advanced check every hour

0 /path/to/advanced_health_check.sh

Daily comprehensive report

0 6 * /path/to/daily_report.sh `

Advanced Monitoring Tools

Nagios - Network Monitoring

#### Installation and Configuration `bash

Install Nagios (Ubuntu/Debian)

sudo apt-get update sudo apt-get install nagios3 nagios-plugins

Configure Nagios

sudo nano /etc/nagios3/nagios.cfg

Add host configuration

sudo nano /etc/nagios3/conf.d/localhost_nagios2.cfg `

Zabbix - Enterprise Monitoring

#### Zabbix Agent Installation `bash

Install Zabbix repository

wget https://repo.zabbix.com/zabbix/5.4/ubuntu/pool/main/z/zabbix-release/zabbix-release_5.4-1+ubuntu20.04_all.deb sudo dpkg -i zabbix-release_5.4-1+ubuntu20.04_all.deb sudo apt update

Install Zabbix agent

sudo apt install zabbix-agent

Configure agent

sudo nano /etc/zabbix/zabbix_agentd.conf

Start and enable service

sudo systemctl start zabbix-agent sudo systemctl enable zabbix-agent `

Prometheus and Grafana

#### Node Exporter Installation `bash

Download and install Node Exporter

wget https://github.com/prometheus/node_exporter/releases/download/v1.3.1/node_exporter-1.3.1.linux-amd64.tar.gz tar xvfz node_exporter-1.3.1.linux-amd64.tar.gz sudo cp node_exporter-1.3.1.linux-amd64/node_exporter /usr/local/bin/

Create systemd service

sudo nano /etc/systemd/system/node_exporter.service

Service file content:

[Unit] Description=Node Exporter After=network.target

[Service] User=prometheus Group=prometheus Type=simple ExecStart=/usr/local/bin/node_exporter

[Install] WantedBy=multi-user.target

Start service

sudo systemctl daemon-reload sudo systemctl start node_exporter sudo systemctl enable node_exporter `

Performance Optimization Tips

CPU Optimization

#### Process Priority Management `bash

Change process priority (nice values: -20 to 19)

nice -n 10 command

Modify running process priority

renice 10 -p PID

Run process with high priority

nice -n -10 important_process `

#### CPU Governor Settings `bash

Check available governors

cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_governors

Check current governor

cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor

Set performance governor

echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor `

Memory Optimization

#### Swap Management `bash

Check swap usage

swapon --show

Adjust swappiness (0-100, default 60)

echo 10 | sudo tee /proc/sys/vm/swappiness

Make permanent

echo 'vm.swappiness=10' | sudo tee -a /etc/sysctl.conf `

#### Memory Cleaning `bash

Clear page cache

echo 1 | sudo tee /proc/sys/vm/drop_caches

Clear dentries and inodes

echo 2 | sudo tee /proc/sys/vm/drop_caches

Clear all caches

echo 3 | sudo tee /proc/sys/vm/drop_caches `

Disk Optimization

#### Filesystem Tuning `bash

Check filesystem parameters

tune2fs -l /dev/sda1

Adjust reserved blocks percentage

sudo tune2fs -m 1 /dev/sda1

Enable filesystem features

sudo tune2fs -O extent /dev/sda1 `

#### I/O Scheduler Optimization `bash

Check current I/O scheduler

cat /sys/block/sda/queue/scheduler

Set I/O scheduler

echo deadline | sudo tee /sys/block/sda/queue/scheduler

Make permanent in GRUB

sudo nano /etc/default/grub

Add: elevator=deadline to GRUB_CMDLINE_LINUX

sudo update-grub `

Troubleshooting Common Issues

High CPU Usage

#### Identifying CPU-Intensive Processes `bash

Find processes consuming high CPU

ps aux --sort=-%cpu | head -10

Monitor CPU usage in real-time

top -o %CPU

Check for zombie processes

ps aux | grep -E "Z|" `

#### CPU Usage Solutions 1. Kill unnecessary processes 2. Optimize application code 3. Upgrade hardware 4. Implement load balancing

Memory Issues

#### Out of Memory (OOM) Troubleshooting `bash

Check OOM killer logs

dmesg | grep -i "killed process"

Monitor memory usage patterns

sar -r 1 60

Identify memory leaks

valgrind --leak-check=full ./your_program `

#### Memory Issue Solutions 1. Increase swap space 2. Add more RAM 3. Fix memory leaks 4. Optimize applications

Disk Space Problems

#### Finding Large Files `bash

Find largest files in system

find / -type f -size +100M 2>/dev/null | head -20

Clean package cache

sudo apt-get clean # Ubuntu/Debian sudo yum clean all # CentOS/RHEL

Remove old log files

sudo journalctl --vacuum-time=7d `

Network Issues

#### Network Troubleshooting `bash

Check network interfaces

ip link show

Test DNS resolution

nslookup google.com

Trace network route

traceroute google.com

Check network statistics

netstat -i `

Best Practices

Regular Monitoring Schedule

1. Real-time Monitoring: Critical systems should have continuous monitoring 2. Hourly Checks: CPU, memory, and disk usage 3. Daily Reports: Comprehensive system health reports 4. Weekly Analysis: Performance trends and capacity planning 5. Monthly Reviews: Hardware and software optimization

Documentation and Alerting

#### Documentation Standards - Maintain system inventory - Document baseline performance metrics - Keep troubleshooting procedures updated - Record configuration changes

#### Alert Configuration `bash

Example alert thresholds:

CPU usage > 80% for 5 minutes

Memory usage > 85%

Disk usage > 90%

Load average > number of CPU cores

Service downtime > 1 minute

`

Security Considerations

#### Log Monitoring for Security `bash

Monitor failed login attempts

grep "Failed password" /var/log/auth.log | tail -20

Check for unusual network connections

netstat -an | grep ESTABLISHED

Monitor file system changes

sudo apt-get install aide sudo aide --init sudo aide --check `

Backup and Recovery

#### System State Backup `bash

Backup system configuration

sudo tar -czf system_config_$(date +%Y%m%d).tar.gz /etc

Database backups

mysqldump -u root -p --all-databases > backup_$(date +%Y%m%d).sql

Create system snapshots (if using LVM)

sudo lvcreate -L1G -s -n snapshot /dev/vg0/root `

Conclusion

Effective Linux system health monitoring is essential for maintaining reliable, secure, and high-performing systems. This comprehensive guide has covered the fundamental commands, advanced tools, and best practices needed to implement robust system monitoring.

Key takeaways include:

1. Regular Monitoring: Implement automated health checks to catch issues early 2. Comprehensive Coverage: Monitor CPU, memory, disk, network, and logs 3. Proactive Approach: Set up alerts and thresholds to prevent problems 4. Documentation: Keep detailed records of system performance and changes 5. Continuous Improvement: Regularly review and optimize monitoring procedures

Remember that system monitoring is an ongoing process that requires regular attention and adjustment. As your systems grow and change, your monitoring strategy should evolve accordingly.

By implementing the techniques and tools discussed in this guide, you'll be well-equipped to maintain healthy Linux systems and quickly resolve any issues that arise. Regular system health checks not only prevent downtime but also help optimize performance and ensure the long-term stability of your infrastructure.

Start with basic monitoring commands and gradually implement more advanced tools as your needs grow. The investment in proper system monitoring will pay dividends in reduced downtime, improved performance, and peace of mind knowing your systems are running optimally.

---

This guide serves as a comprehensive reference for Linux system health monitoring. Regular updates and practice with these tools will help you become proficient in maintaining robust Linux systems.

Tags

  • DevOps
  • Linux
  • Performance Optimization
  • System Monitoring
  • server-administration

Related Articles

Popular Technical Articles & Tutorials

Explore our comprehensive collection of technical articles, programming tutorials, and IT guides written by industry experts:

Browse all 8+ technical articles | Read our IT blog

Complete Linux System Health Check Guide 2024