Complete Linux System Health Check Guide: Monitor, Diagnose, and Optimize Your Server Performance in 2024
Table of Contents
1. [Introduction](#introduction) 2. [Why System Health Monitoring Matters](#why-system-health-monitoring-matters) 3. [Essential Linux System Health Check Commands](#essential-linux-system-health-check-commands) 4. [CPU Performance Monitoring](#cpu-performance-monitoring) 5. [Memory Usage Analysis](#memory-usage-analysis) 6. [Disk Space and I/O Monitoring](#disk-space-and-io-monitoring) 7. [Network Performance Checks](#network-performance-checks) 8. [System Load and Process Management](#system-load-and-process-management) 9. [Log File Analysis](#log-file-analysis) 10. [Automated Health Check Scripts](#automated-health-check-scripts) 11. [Advanced Monitoring Tools](#advanced-monitoring-tools) 12. [Performance Optimization Tips](#performance-optimization-tips) 13. [Troubleshooting Common Issues](#troubleshooting-common-issues) 14. [Best Practices](#best-practices) 15. [Conclusion](#conclusion)Introduction
Linux system health monitoring is a critical aspect of server administration that ensures optimal performance, prevents downtime, and maintains system reliability. Whether you're managing a single server or an entire infrastructure, regular health checks help identify potential issues before they become critical problems.
This comprehensive guide covers everything you need to know about Linux system health monitoring, from basic commands to advanced automation techniques. You'll learn how to monitor CPU usage, memory consumption, disk space, network performance, and system logs effectively.
Why System Health Monitoring Matters
Preventing System Failures
Regular health checks help identify warning signs before they escalate into system failures. By monitoring key metrics like CPU usage, memory consumption, and disk space, administrators can take proactive measures to prevent outages.Optimizing Performance
System health monitoring reveals performance bottlenecks and resource constraints that may be slowing down your applications. This information is crucial for making informed decisions about hardware upgrades and system optimization.Security Monitoring
Health checks can reveal unusual system behavior that might indicate security breaches, malware infections, or unauthorized access attempts. Regular monitoring helps maintain system security and integrity.Cost Management
By understanding resource utilization patterns, organizations can make better decisions about infrastructure scaling, potentially saving significant costs on unnecessary hardware or cloud resources.Essential Linux System Health Check Commands
Basic System Information Commands
#### uname - System Information
`bash
Display system information
uname -aShow kernel version
uname -rDisplay architecture
uname -m`#### uptime - System Uptime and Load
`bash
Show system uptime and load averages
uptimeExample output:
14:30:45 up 10 days, 5:42, 3 users, load average: 0.15, 0.18, 0.12
`#### whoami and who - User Information
`bash
Show current user
whoamiShow logged-in users
whoShow detailed user information
w`System Resource Overview Commands
#### htop - Interactive Process Viewer
`bash
Install htop if not available
sudo apt-get install htop # Ubuntu/Debian sudo yum install htop # CentOS/RHELRun htop
htop`#### top - Process Monitor
`bash
Display running processes
topSort by CPU usage
top -o %CPUSort by memory usage
top -o %MEM`CPU Performance Monitoring
Understanding CPU Metrics
CPU monitoring involves tracking several key metrics: - CPU Usage Percentage: Shows how much of the CPU capacity is being used - Load Average: Indicates system load over 1, 5, and 15-minute intervals - Context Switches: Number of times the CPU switches between processes - Interrupts: Hardware and software interrupt frequency
CPU Monitoring Commands
#### vmstat - Virtual Memory Statistics
`bash
Display CPU, memory, and I/O statistics
vmstat 1 5Example output explanation:
procs: r (running), b (blocked)
memory: swpd, free, buff, cache
swap: si (swap in), so (swap out)
io: bi (blocks in), bo (blocks out)
system: in (interrupts), cs (context switches)
cpu: us (user), sy (system), id (idle), wa (wait), st (stolen)
`#### iostat - I/O Statistics
`bash
Install sysstat package if needed
sudo apt-get install sysstatDisplay CPU and I/O statistics
iostat -c 1 5Show extended statistics
iostat -x 1 5`#### sar - System Activity Reporter
`bash
Monitor CPU usage every second for 10 iterations
sar -u 1 10Monitor CPU usage for specific cores
sar -P ALL 1 5Generate daily CPU report
sar -u -f /var/log/sysstat/saXX`CPU Performance Analysis
#### Identifying CPU Bottlenecks
`bash
Check for high CPU processes
ps aux --sort=-%cpu | head -10Monitor CPU usage in real-time
watch -n 1 'cat /proc/loadavg'Check CPU frequency scaling
cat /proc/cpuinfo | grep MHz`#### CPU Temperature Monitoring
`bash
Install lm-sensors
sudo apt-get install lm-sensorsDetect sensors
sudo sensors-detectDisplay temperature readings
sensors`Memory Usage Analysis
Understanding Memory Metrics
Linux memory management involves several types of memory: - Physical RAM: Actual memory modules installed - Virtual Memory: Combination of RAM and swap space - Buffer/Cache: Memory used for file system caching - Swap: Disk space used as virtual memory extension
Memory Monitoring Commands
#### free - Memory Usage Display
`bash
Display memory usage in human-readable format
free -hShow memory usage with buffer/cache breakdown
free -h -wMonitor memory usage continuously
watch -n 1 free -h`#### /proc/meminfo - Detailed Memory Information
`bash
Display detailed memory statistics
cat /proc/meminfoMonitor specific memory metrics
grep -E "MemTotal|MemFree|MemAvailable|Buffers|Cached|SwapTotal|SwapFree" /proc/meminfo`#### ps - Process Memory Usage
`bash
Show processes sorted by memory usage
ps aux --sort=-%mem | head -10Display memory usage for specific process
ps -pShow memory maps for a process
pmap`Memory Performance Analysis
#### Identifying Memory Leaks
`bash
Monitor memory usage over time
while true; do echo "$(date): $(free | grep Mem | awk '{print $3/$2 * 100.0}')" >> memory_usage.log sleep 60 doneCheck for memory-hungry processes
ps -eo pid,ppid,cmd,%mem,%cpu --sort=-%mem | head -20`#### Swap Usage Monitoring
`bash
Check swap usage
swapon --showMonitor swap activity
vmstat 1 5 | awk '{print $7, $8}'Find processes using swap
for file in /proc/*/status; do awk '/VmSwap|Name/{printf $2 " " $3}END{ print ""}' $file done | sort -k 2 -n`Disk Space and I/O Monitoring
Disk Space Monitoring
#### df - Filesystem Disk Space Usage
`bash
Display filesystem usage in human-readable format
df -hShow inode usage
df -iDisplay filesystem type
df -T`#### du - Directory Space Usage
`bash
Show directory sizes in current location
du -h --max-depth=1Find largest directories
du -h / 2>/dev/null | sort -rh | head -20Check specific directory size
du -sh /var/log`Disk I/O Monitoring
#### iotop - I/O Usage by Process
`bash
Install iotop
sudo apt-get install iotopMonitor I/O usage
sudo iotopShow only processes with I/O activity
sudo iotop -o`#### Disk Performance Analysis
`bash
Monitor disk I/O statistics
iostat -x 1 5Check disk read/write speeds
dd if=/dev/zero of=testfile bs=1G count=1 oflag=dsync dd if=testfile of=/dev/null bs=1G count=1 rm testfile`Filesystem Health Checks
#### File System Integrity
`bash
Check filesystem for errors (unmounted filesystem)
sudo fsck /dev/sda1Check filesystem in read-only mode
sudo fsck -n /dev/sda1Check all filesystems
sudo fsck -A -R`#### SMART Monitoring
`bash
Install smartmontools
sudo apt-get install smartmontoolsCheck drive health
sudo smartctl -a /dev/sdaRun short self-test
sudo smartctl -t short /dev/sdaCheck test results
sudo smartctl -l selftest /dev/sda`Network Performance Checks
Network Interface Monitoring
#### ifconfig and ip Commands
`bash
Display network interfaces (traditional)
ifconfigModern alternative using ip command
ip addr showShow network statistics
ip -s linkDisplay routing table
ip route show`#### Network Traffic Monitoring
`bash
Install network monitoring tools
sudo apt-get install net-tools iftop nethogsMonitor network traffic by interface
sudo iftop -i eth0Monitor network usage by process
sudo nethogs eth0`Network Connectivity Tests
#### Basic Connectivity Tests
`bash
Test connectivity to remote host
ping -c 4 google.comTest specific port connectivity
telnet google.com 80Advanced port testing
nmap -p 80,443 google.com`#### Network Performance Testing
`bash
Install network testing tools
sudo apt-get install speedtest-cli iperf3Test internet speed
speedtest-cliNetwork throughput testing (requires iperf3 server)
iperf3 -c server_ip`Network Security Monitoring
#### Active Connections
`bash
Show active network connections
netstat -tulnDisplay connections with process information
netstat -tulnpModern alternative
ss -tuln`#### Firewall Status
`bash
Check UFW status (Ubuntu)
sudo ufw statusCheck iptables rules
sudo iptables -LDisplay firewall logs
sudo tail -f /var/log/ufw.log`System Load and Process Management
Understanding System Load
System load represents the amount of computational work the system is performing. Load averages show the average system load over 1, 5, and 15-minute periods.
#### Load Average Interpretation
`bash
Display current load
cat /proc/loadavgRule of thumb for load interpretation:
Load < Number of CPU cores = Good
Load = Number of CPU cores = Fully utilized
Load > Number of CPU cores = Overloaded
`Process Management
#### Process Monitoring
`bash
List all processes
ps auxShow process tree
pstreeMonitor processes in real-time
top htop`#### Process Control
`bash
Kill process by PID
killKill process by name
pkill process_nameKill all processes by user
pkill -u username`Service Management
#### Systemd Service Management
`bash
List all services
systemctl list-units --type=serviceCheck service status
systemctl status service_nameStart/stop/restart services
systemctl start service_name systemctl stop service_name systemctl restart service_nameEnable/disable services
systemctl enable service_name systemctl disable service_name`Log File Analysis
System Log Locations
Linux systems store logs in various locations:
- /var/log/syslog - General system messages
- /var/log/auth.log - Authentication logs
- /var/log/kern.log - Kernel messages
- /var/log/dmesg - Boot messages
- /var/log/cron.log - Cron job logs
Log Analysis Commands
#### journalctl - Systemd Journal
`bash
View all journal entries
journalctlShow recent entries
journalctl -n 50Follow log in real-time
journalctl -fFilter by service
journalctl -u ssh.serviceFilter by time range
journalctl --since "2024-01-01" --until "2024-01-02"`#### Traditional Log Analysis
`bash
View recent system messages
tail -f /var/log/syslogSearch for errors
grep -i error /var/log/syslogCount error occurrences
grep -c "error" /var/log/syslogAnalyze authentication failures
grep "Failed password" /var/log/auth.log`Log Rotation and Management
#### Logrotate Configuration
`bash
Check logrotate configuration
cat /etc/logrotate.confView specific log rotation rules
ls /etc/logrotate.d/Test logrotate configuration
sudo logrotate -d /etc/logrotate.confForce log rotation
sudo logrotate -f /etc/logrotate.conf`Automated Health Check Scripts
Basic Health Check Script
Create a comprehensive health check script:
`bash
#!/bin/bash
System Health Check Script
echo "=== System Health Check Report ===" echo "Date: $(date)" echo "Hostname: $(hostname)" echo "Uptime: $(uptime)" echo ""
CPU Usage
echo "=== CPU Usage ===" top -bn1 | grep "Cpu(s)" | awk '{print $2}' | cut -d'%' -f1 | awk '{print "CPU Usage: " $1"%"}' echo ""Memory Usage
echo "=== Memory Usage ===" free -h | grep -E "Mem|Swap" echo ""Disk Usage
echo "=== Disk Usage ===" df -h | grep -E "^/dev" echo ""System Load
echo "=== System Load ===" cat /proc/loadavg echo ""Top Processes by CPU
echo "=== Top 5 CPU Processes ===" ps aux --sort=-%cpu | head -6 echo ""Top Processes by Memory
echo "=== Top 5 Memory Processes ===" ps aux --sort=-%mem | head -6 echo ""Network Connections
echo "=== Network Connections ===" ss -tuln | wc -l echo "Total connections: $(ss -tuln | wc -l)" echo ""Recent Errors in Syslog
echo "=== Recent System Errors ===" grep -i error /var/log/syslog | tail -5 echo ""echo "=== Health Check Complete ==="
`
Advanced Monitoring Script
`bash
#!/bin/bash
Advanced System Health Monitoring Script
Configuration
LOG_FILE="/var/log/system_health.log" EMAIL="admin@example.com" CPU_THRESHOLD=80 MEMORY_THRESHOLD=85 DISK_THRESHOLD=90Functions
log_message() { echo "$(date '+%Y-%m-%d %H:%M:%S') - $1" >> $LOG_FILE }send_alert() { echo "$1" | mail -s "System Alert: $(hostname)" $EMAIL log_message "ALERT: $1" }
check_cpu() { CPU_USAGE=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | cut -d'%' -f1) CPU_INT=${CPU_USAGE%.*} if [ $CPU_INT -gt $CPU_THRESHOLD ]; then send_alert "High CPU usage detected: ${CPU_USAGE}%" fi log_message "CPU Usage: ${CPU_USAGE}%" }
check_memory() { MEMORY_USAGE=$(free | grep Mem | awk '{printf "%.0f", $3/$2 * 100.0}') if [ $MEMORY_USAGE -gt $MEMORY_THRESHOLD ]; then send_alert "High memory usage detected: ${MEMORY_USAGE}%" fi log_message "Memory Usage: ${MEMORY_USAGE}%" }
check_disk() { df -h | grep -E "^/dev" | while read line; do USAGE=$(echo $line | awk '{print $5}' | cut -d'%' -f1) PARTITION=$(echo $line | awk '{print $6}') if [ $USAGE -gt $DISK_THRESHOLD ]; then send_alert "High disk usage on $PARTITION: ${USAGE}%" fi log_message "Disk Usage $PARTITION: ${USAGE}%" done }
check_services() { CRITICAL_SERVICES=("ssh" "nginx" "mysql") for service in "${CRITICAL_SERVICES[@]}"; do if ! systemctl is-active --quiet $service; then send_alert "Critical service $service is not running" else log_message "Service $service: Running" fi done }
Main execution
log_message "Starting system health check" check_cpu check_memory check_disk check_services log_message "System health check completed"`Scheduling Automated Checks
Add the script to crontab for regular execution:
`bash
Edit crontab
crontab -eAdd entries for automated health checks
Run every 5 minutes
/5 * /path/to/basic_health_check.shRun advanced check every hour
0 /path/to/advanced_health_check.shDaily comprehensive report
0 6 * /path/to/daily_report.sh`Advanced Monitoring Tools
Nagios - Network Monitoring
#### Installation and Configuration
`bash
Install Nagios (Ubuntu/Debian)
sudo apt-get update sudo apt-get install nagios3 nagios-pluginsConfigure Nagios
sudo nano /etc/nagios3/nagios.cfgAdd host configuration
sudo nano /etc/nagios3/conf.d/localhost_nagios2.cfg`Zabbix - Enterprise Monitoring
#### Zabbix Agent Installation
`bash
Install Zabbix repository
wget https://repo.zabbix.com/zabbix/5.4/ubuntu/pool/main/z/zabbix-release/zabbix-release_5.4-1+ubuntu20.04_all.deb sudo dpkg -i zabbix-release_5.4-1+ubuntu20.04_all.deb sudo apt updateInstall Zabbix agent
sudo apt install zabbix-agentConfigure agent
sudo nano /etc/zabbix/zabbix_agentd.confStart and enable service
sudo systemctl start zabbix-agent sudo systemctl enable zabbix-agent`Prometheus and Grafana
#### Node Exporter Installation
`bash
Download and install Node Exporter
wget https://github.com/prometheus/node_exporter/releases/download/v1.3.1/node_exporter-1.3.1.linux-amd64.tar.gz tar xvfz node_exporter-1.3.1.linux-amd64.tar.gz sudo cp node_exporter-1.3.1.linux-amd64/node_exporter /usr/local/bin/Create systemd service
sudo nano /etc/systemd/system/node_exporter.serviceService file content:
[Unit] Description=Node Exporter After=network.target[Service] User=prometheus Group=prometheus Type=simple ExecStart=/usr/local/bin/node_exporter
[Install] WantedBy=multi-user.target
Start service
sudo systemctl daemon-reload sudo systemctl start node_exporter sudo systemctl enable node_exporter`Performance Optimization Tips
CPU Optimization
#### Process Priority Management
`bash
Change process priority (nice values: -20 to 19)
nice -n 10 commandModify running process priority
renice 10 -p PIDRun process with high priority
nice -n -10 important_process`#### CPU Governor Settings
`bash
Check available governors
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_governorsCheck current governor
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governorSet performance governor
echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor`Memory Optimization
#### Swap Management
`bash
Check swap usage
swapon --showAdjust swappiness (0-100, default 60)
echo 10 | sudo tee /proc/sys/vm/swappinessMake permanent
echo 'vm.swappiness=10' | sudo tee -a /etc/sysctl.conf`#### Memory Cleaning
`bash
Clear page cache
echo 1 | sudo tee /proc/sys/vm/drop_cachesClear dentries and inodes
echo 2 | sudo tee /proc/sys/vm/drop_cachesClear all caches
echo 3 | sudo tee /proc/sys/vm/drop_caches`Disk Optimization
#### Filesystem Tuning
`bash
Check filesystem parameters
tune2fs -l /dev/sda1Adjust reserved blocks percentage
sudo tune2fs -m 1 /dev/sda1Enable filesystem features
sudo tune2fs -O extent /dev/sda1`#### I/O Scheduler Optimization
`bash
Check current I/O scheduler
cat /sys/block/sda/queue/schedulerSet I/O scheduler
echo deadline | sudo tee /sys/block/sda/queue/schedulerMake permanent in GRUB
sudo nano /etc/default/grubAdd: elevator=deadline to GRUB_CMDLINE_LINUX
sudo update-grub`Troubleshooting Common Issues
High CPU Usage
#### Identifying CPU-Intensive Processes
`bash
Find processes consuming high CPU
ps aux --sort=-%cpu | head -10Monitor CPU usage in real-time
top -o %CPUCheck for zombie processes
ps aux | grep -E "Z|`#### CPU Usage Solutions 1. Kill unnecessary processes 2. Optimize application code 3. Upgrade hardware 4. Implement load balancing
Memory Issues
#### Out of Memory (OOM) Troubleshooting
`bash
Check OOM killer logs
dmesg | grep -i "killed process"Monitor memory usage patterns
sar -r 1 60Identify memory leaks
valgrind --leak-check=full ./your_program`#### Memory Issue Solutions 1. Increase swap space 2. Add more RAM 3. Fix memory leaks 4. Optimize applications
Disk Space Problems
#### Finding Large Files
`bash
Find largest files in system
find / -type f -size +100M 2>/dev/null | head -20Clean package cache
sudo apt-get clean # Ubuntu/Debian sudo yum clean all # CentOS/RHELRemove old log files
sudo journalctl --vacuum-time=7d`Network Issues
#### Network Troubleshooting
`bash
Check network interfaces
ip link showTest DNS resolution
nslookup google.comTrace network route
traceroute google.comCheck network statistics
netstat -i`Best Practices
Regular Monitoring Schedule
1. Real-time Monitoring: Critical systems should have continuous monitoring 2. Hourly Checks: CPU, memory, and disk usage 3. Daily Reports: Comprehensive system health reports 4. Weekly Analysis: Performance trends and capacity planning 5. Monthly Reviews: Hardware and software optimization
Documentation and Alerting
#### Documentation Standards - Maintain system inventory - Document baseline performance metrics - Keep troubleshooting procedures updated - Record configuration changes
#### Alert Configuration
`bash
Example alert thresholds:
CPU usage > 80% for 5 minutes
Memory usage > 85%
Disk usage > 90%
Load average > number of CPU cores
Service downtime > 1 minute
`Security Considerations
#### Log Monitoring for Security
`bash
Monitor failed login attempts
grep "Failed password" /var/log/auth.log | tail -20Check for unusual network connections
netstat -an | grep ESTABLISHEDMonitor file system changes
sudo apt-get install aide sudo aide --init sudo aide --check`Backup and Recovery
#### System State Backup
`bash
Backup system configuration
sudo tar -czf system_config_$(date +%Y%m%d).tar.gz /etcDatabase backups
mysqldump -u root -p --all-databases > backup_$(date +%Y%m%d).sqlCreate system snapshots (if using LVM)
sudo lvcreate -L1G -s -n snapshot /dev/vg0/root`Conclusion
Effective Linux system health monitoring is essential for maintaining reliable, secure, and high-performing systems. This comprehensive guide has covered the fundamental commands, advanced tools, and best practices needed to implement robust system monitoring.
Key takeaways include:
1. Regular Monitoring: Implement automated health checks to catch issues early 2. Comprehensive Coverage: Monitor CPU, memory, disk, network, and logs 3. Proactive Approach: Set up alerts and thresholds to prevent problems 4. Documentation: Keep detailed records of system performance and changes 5. Continuous Improvement: Regularly review and optimize monitoring procedures
Remember that system monitoring is an ongoing process that requires regular attention and adjustment. As your systems grow and change, your monitoring strategy should evolve accordingly.
By implementing the techniques and tools discussed in this guide, you'll be well-equipped to maintain healthy Linux systems and quickly resolve any issues that arise. Regular system health checks not only prevent downtime but also help optimize performance and ensure the long-term stability of your infrastructure.
Start with basic monitoring commands and gradually implement more advanced tools as your needs grow. The investment in proper system monitoring will pay dividends in reduced downtime, improved performance, and peace of mind knowing your systems are running optimally.
---
This guide serves as a comprehensive reference for Linux system health monitoring. Regular updates and practice with these tools will help you become proficient in maintaining robust Linux systems.