System Health Check Scripts: Complete Guide and Implementation
Table of Contents
1. [Introduction](#introduction) 2. [Prerequisites](#prerequisites) 3. [Core Components](#core-components) 4. [Script Architecture](#script-architecture) 5. [Implementation Examples](#implementation-examples) 6. [Configuration Management](#configuration-management) 7. [Monitoring Categories](#monitoring-categories) 8. [Alert Systems](#alert-systems) 9. [Best Practices](#best-practices) 10. [Troubleshooting](#troubleshooting)Introduction
System health check scripts are automated tools designed to monitor, assess, and report on the operational status of computer systems. These scripts provide real-time insights into system performance, resource utilization, service availability, and potential issues that may affect system reliability.
The primary objectives of system health check scripts include: - Proactive monitoring of system resources - Early detection of performance bottlenecks - Automated alerting for critical issues - Historical data collection for trend analysis - Compliance verification for system standards - Automated remediation of common problems
Prerequisites
Before implementing system health check scripts, ensure the following requirements are met:
System Requirements
| Component | Minimum Requirement | Recommended | |-----------|-------------------|-------------| | Operating System | Linux/Unix-based | CentOS 7+, Ubuntu 18.04+ | | Shell | Bash 4.0+ | Bash 5.0+ | | Memory | 512 MB | 1 GB+ | | Disk Space | 100 MB | 500 MB+ | | Network | Basic connectivity | High-speed connection |
Required Tools and Utilities
| Tool | Purpose | Installation Command |
|------|---------|---------------------|
| awk | Text processing | Usually pre-installed |
| sed | Stream editing | Usually pre-installed |
| grep | Pattern matching | Usually pre-installed |
| curl | HTTP requests | yum install curl or apt-get install curl |
| jq | JSON processing | yum install jq or apt-get install jq |
| mail | Email notifications | yum install mailx or apt-get install mailutils |
Permissions and Access
- Root or sudo access for system-level monitoring - Read access to log files and system directories - Network access for external service checks - Write permissions for log and report generation
Core Components
1. Data Collection Module
The data collection module gathers information from various system sources:
`bash
#!/bin/bash
Data Collection Functions
collect_system_info() { local output_file="$1" { echo "SYSTEM_INFO_START" echo "Hostname: $(hostname)" echo "Uptime: $(uptime)" echo "Date: $(date)" echo "Kernel: $(uname -r)" echo "Architecture: $(uname -m)" echo "SYSTEM_INFO_END" } >> "$output_file" }
collect_cpu_info() { local output_file="$1" { echo "CPU_INFO_START" echo "CPU Usage: $(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | cut -d'%' -f1)" echo "Load Average: $(uptime | awk -F'load average:' '{print $2}')" echo "CPU Cores: $(nproc)" echo "CPU_INFO_END" } >> "$output_file" }
collect_memory_info() {
local output_file="$1"
{
echo "MEMORY_INFO_START"
free -m | while read line; do
echo "Memory: $line"
done
echo "MEMORY_INFO_END"
} >> "$output_file"
}
`
2. Analysis Engine
The analysis engine processes collected data and determines system health status:
`bash
#!/bin/bash
Analysis Functions
analyze_cpu_usage() { local cpu_usage="$1" local threshold_warning=70 local threshold_critical=90 if (( $(echo "$cpu_usage > $threshold_critical" | bc -l) )); then echo "CRITICAL" elif (( $(echo "$cpu_usage > $threshold_warning" | bc -l) )); then echo "WARNING" else echo "OK" fi }
analyze_memory_usage() { local total_mem="$1" local used_mem="$2" local usage_percent=$(echo "scale=2; ($used_mem / $total_mem) * 100" | bc) local threshold_warning=80 local threshold_critical=95 if (( $(echo "$usage_percent > $threshold_critical" | bc -l) )); then echo "CRITICAL" elif (( $(echo "$usage_percent > $threshold_warning" | bc -l) )); then echo "WARNING" else echo "OK" fi }
analyze_disk_usage() {
local filesystem="$1"
local usage_percent=$(df -h "$filesystem" | awk 'NR==2 {print $5}' | sed 's/%//')
local threshold_warning=80
local threshold_critical=90
if [ "$usage_percent" -gt "$threshold_critical" ]; then
echo "CRITICAL"
elif [ "$usage_percent" -gt "$threshold_warning" ]; then
echo "WARNING"
else
echo "OK"
fi
}
`
3. Reporting Module
The reporting module formats and presents health check results:
`bash
#!/bin/bash
Reporting Functions
generate_html_report() { local report_file="$1" local timestamp="$2" cat > "$report_file" << EOF
System Health Report
Generated on: $timestamp
add_table_to_report() { local report_file="$1" local table_title="$2" shift 2 local -a table_data=("$@") cat >> "$report_file" << EOF
$table_title
| Component | Status | Value | Threshold |
|---|---|---|---|
| $component | $status | $value | $threshold |
`Script Architecture
Master Health Check Script
`bash
#!/bin/bash
master_health_check.sh
Main system health check orchestrator
Configuration
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" CONFIG_FILE="$SCRIPT_DIR/health_check.conf" LOG_DIR="/var/log/health_check" REPORT_DIR="/var/reports/health_check" TIMESTAMP=$(date +"%Y%m%d_%H%M%S")Source configuration
if [ -f "$CONFIG_FILE" ]; then source "$CONFIG_FILE" else echo "Configuration file not found: $CONFIG_FILE" exit 1 fiCreate necessary directories
mkdir -p "$LOG_DIR" "$REPORT_DIR"Logging function
log_message() { local level="$1" local message="$2" echo "[$(date '+%Y-%m-%d %H:%M:%S')] [$level] $message" | tee -a "$LOG_DIR/health_check_$TIMESTAMP.log" }Main execution function
main() { log_message "INFO" "Starting system health check" # Initialize report local report_file="$REPORT_DIR/health_report_$TIMESTAMP.html" generate_html_report "$report_file" "$(date)" # Perform health checks check_system_resources "$report_file" check_services "$report_file" check_network_connectivity "$report_file" check_disk_health "$report_file" check_security_status "$report_file" # Finalize report finalize_html_report "$report_file" # Send notifications if needed process_alerts "$report_file" log_message "INFO" "Health check completed successfully" }Execute main function
main "$@"`System Resource Monitoring
`bash
#!/bin/bash
system_resources.sh
Detailed system resource monitoring
check_system_resources() {
local report_file="$1"
local -a resource_data=()
log_message "INFO" "Checking system resources"
# CPU Check
local cpu_usage=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | cut -d'%' -f1)
local cpu_status=$(analyze_cpu_usage "$cpu_usage")
resource_data+=("CPU Usage|$cpu_status|${cpu_usage}%|Warning: 70%, Critical: 90%")
# Memory Check
local mem_info=$(free -m | grep "^Mem:")
local total_mem=$(echo "$mem_info" | awk '{print $2}')
local used_mem=$(echo "$mem_info" | awk '{print $3}')
local mem_usage_percent=$(echo "scale=1; ($used_mem / $total_mem) * 100" | bc)
local mem_status=$(analyze_memory_usage "$total_mem" "$used_mem")
resource_data+=("Memory Usage|$mem_status|${mem_usage_percent}%|Warning: 80%, Critical: 95%")
# Load Average Check
local load_avg=$(uptime | awk -F'load average:' '{print $2}' | awk '{print $1}' | sed 's/,//')
local cpu_cores=$(nproc)
local load_per_core=$(echo "scale=2; $load_avg / $cpu_cores" | bc)
local load_status="OK"
if (( $(echo "$load_per_core > 1.5" | bc -l) )); then
load_status="CRITICAL"
elif (( $(echo "$load_per_core > 1.0" | bc -l) )); then
load_status="WARNING"
fi
resource_data+=("Load Average|$load_status|$load_avg (${load_per_core}/core)|Warning: 1.0/core, Critical: 1.5/core")
# Disk Usage Check
while IFS= read -r line; do
local filesystem=$(echo "$line" | awk '{print $1}')
local usage_percent=$(echo "$line" | awk '{print $5}' | sed 's/%//')
local mount_point=$(echo "$line" | awk '{print $6}')
# Skip special filesystems
if [[ "$filesystem" =~ ^/dev/(sd|hd|nvme|xvd) ]]; then
local disk_status=$(analyze_disk_usage "$mount_point")
resource_data+=("Disk $mount_point|$disk_status|${usage_percent}%|Warning: 80%, Critical: 90%")
fi
done < <(df -h | grep -E '^/dev/')
# Add resource data to report
add_table_to_report "$report_file" "System Resources" "${resource_data[@]}"
}
`
Service Monitoring
`bash
#!/bin/bash
service_monitoring.sh
Service availability and health monitoring
check_services() {
local report_file="$1"
local -a service_data=()
log_message "INFO" "Checking services"
# Define critical services to monitor
local -a critical_services=("sshd" "httpd" "nginx" "mysqld" "postgresql" "docker")
for service in "${critical_services[@]}"; do
if systemctl list-unit-files | grep -q "^$service.service"; then
local service_status="UNKNOWN"
local service_info=""
if systemctl is-active --quiet "$service"; then
service_status="OK"
local uptime=$(systemctl show "$service" --property=ActiveEnterTimestamp | cut -d= -f2)
service_info="Active since $uptime"
else
service_status="CRITICAL"
service_info="Service not running"
fi
service_data+=("$service|$service_status|$service_info|Service should be active")
fi
done
# Check custom application services
if [ -n "$CUSTOM_SERVICES" ]; then
IFS=',' read -ra CUSTOM_SERVICE_ARRAY <<< "$CUSTOM_SERVICES"
for service in "${CUSTOM_SERVICE_ARRAY[@]}"; do
local service_status="UNKNOWN"
local service_info=""
if systemctl is-active --quiet "$service"; then
service_status="OK"
service_info="Running"
else
service_status="CRITICAL"
service_info="Not running"
fi
service_data+=("$service (custom)|$service_status|$service_info|Custom service should be active")
done
fi
# Port connectivity checks
if [ -n "$MONITOR_PORTS" ]; then
IFS=',' read -ra PORT_ARRAY <<< "$MONITOR_PORTS"
for port_info in "${PORT_ARRAY[@]}"; do
IFS=':' read -r host port <<< "$port_info"
local port_status="UNKNOWN"
local port_info_msg=""
if timeout 5 bash -c "/dev/null; then
port_status="OK"
port_info_msg="Port accessible"
else
port_status="CRITICAL"
port_info_msg="Port not accessible"
fi
service_data+=("Port $host:$port|$port_status|$port_info_msg|Port should be accessible")
done
fi
add_table_to_report "$report_file" "Services and Connectivity" "${service_data[@]}"
}
`
Configuration Management
Configuration File Structure
`bash
health_check.conf
System Health Check Configuration
Email Configuration
EMAIL_ENABLED=true EMAIL_RECIPIENTS="admin@example.com,monitoring@example.com" EMAIL_SMTP_SERVER="smtp.example.com" EMAIL_SMTP_PORT=587 EMAIL_USERNAME="healthcheck@example.com" EMAIL_PASSWORD="secure_password"Alert Thresholds
CPU_WARNING_THRESHOLD=70 CPU_CRITICAL_THRESHOLD=90 MEMORY_WARNING_THRESHOLD=80 MEMORY_CRITICAL_THRESHOLD=95 DISK_WARNING_THRESHOLD=80 DISK_CRITICAL_THRESHOLD=90 LOAD_WARNING_THRESHOLD=1.0 LOAD_CRITICAL_THRESHOLD=1.5Services to Monitor
CRITICAL_SERVICES="sshd,httpd,mysqld" CUSTOM_SERVICES="myapp,background-worker"Network Monitoring
MONITOR_PORTS="localhost:80,localhost:443,database.local:3306" EXTERNAL_HOSTS="google.com,github.com"Disk Monitoring
MONITOR_FILESYSTEMS="/,/home,/var,/tmp" EXCLUDE_FILESYSTEMS="/dev,/proc,/sys"Log Monitoring
LOG_FILES="/var/log/messages,/var/log/secure,/var/log/httpd/error_log" LOG_ERROR_PATTERNS="ERROR,CRITICAL,FATAL,Out of memory"Report Configuration
REPORT_FORMAT="html,json,plain" REPORT_RETENTION_DAYS=30 REPORT_COMPRESSION=trueNotification Configuration
SLACK_WEBHOOK_URL="https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK" DISCORD_WEBHOOK_URL="https://discord.com/api/webhooks/YOUR/DISCORD/WEBHOOK"Advanced Options
ENABLE_PERFORMANCE_METRICS=true ENABLE_SECURITY_CHECKS=true ENABLE_LOG_ANALYSIS=true PARALLEL_EXECUTION=true MAX_PARALLEL_JOBS=4`Dynamic Configuration Loading
`bash
#!/bin/bash
config_manager.sh
Configuration management functions
load_configuration() { local config_file="$1" if [ ! -f "$config_file" ]; then log_message "ERROR" "Configuration file not found: $config_file" return 1 fi # Validate configuration file syntax if ! bash -n "$config_file"; then log_message "ERROR" "Configuration file has syntax errors: $config_file" return 1 fi # Source configuration source "$config_file" # Set default values for missing configurations set_default_values # Validate configuration values validate_configuration log_message "INFO" "Configuration loaded successfully from $config_file" }
set_default_values() { # Set defaults if not specified CPU_WARNING_THRESHOLD=${CPU_WARNING_THRESHOLD:-70} CPU_CRITICAL_THRESHOLD=${CPU_CRITICAL_THRESHOLD:-90} MEMORY_WARNING_THRESHOLD=${MEMORY_WARNING_THRESHOLD:-80} MEMORY_CRITICAL_THRESHOLD=${MEMORY_CRITICAL_THRESHOLD:-95} DISK_WARNING_THRESHOLD=${DISK_WARNING_THRESHOLD:-80} DISK_CRITICAL_THRESHOLD=${DISK_CRITICAL_THRESHOLD:-90} EMAIL_ENABLED=${EMAIL_ENABLED:-false} REPORT_FORMAT=${REPORT_FORMAT:-"html"} REPORT_RETENTION_DAYS=${REPORT_RETENTION_DAYS:-7} PARALLEL_EXECUTION=${PARALLEL_EXECUTION:-false} MAX_PARALLEL_JOBS=${MAX_PARALLEL_JOBS:-2} }
validate_configuration() {
local validation_errors=0
# Validate threshold values
if [ "$CPU_WARNING_THRESHOLD" -ge "$CPU_CRITICAL_THRESHOLD" ]; then
log_message "ERROR" "CPU warning threshold must be less than critical threshold"
((validation_errors++))
fi
if [ "$MEMORY_WARNING_THRESHOLD" -ge "$MEMORY_CRITICAL_THRESHOLD" ]; then
log_message "ERROR" "Memory warning threshold must be less than critical threshold"
((validation_errors++))
fi
# Validate email configuration
if [ "$EMAIL_ENABLED" = "true" ]; then
if [ -z "$EMAIL_RECIPIENTS" ]; then
log_message "ERROR" "Email recipients not specified"
((validation_errors++))
fi
if [ -z "$EMAIL_SMTP_SERVER" ]; then
log_message "ERROR" "SMTP server not specified"
((validation_errors++))
fi
fi
# Validate numeric values
if ! [[ "$REPORT_RETENTION_DAYS" =~ ^[0-9]+$ ]]; then
log_message "ERROR" "Report retention days must be a number"
((validation_errors++))
fi
if [ "$validation_errors" -gt 0 ]; then
log_message "ERROR" "Configuration validation failed with $validation_errors errors"
return 1
fi
log_message "INFO" "Configuration validation successful"
return 0
}
`
Monitoring Categories
Network Connectivity Monitoring
`bash
#!/bin/bash
network_monitoring.sh
Network connectivity and performance monitoring
check_network_connectivity() {
local report_file="$1"
local -a network_data=()
log_message "INFO" "Checking network connectivity"
# Internal network interface checks
while IFS= read -r interface; do
local interface_name=$(echo "$interface" | awk '{print $1}')
local interface_status="OK"
local interface_info=""
if ip link show "$interface_name" | grep -q "state UP"; then
local ip_address=$(ip addr show "$interface_name" | grep "inet " | awk '{print $2}' | head -1)
interface_info="UP - IP: $ip_address"
else
interface_status="WARNING"
interface_info="Interface DOWN"
fi
network_data+=("Interface $interface_name|$interface_status|$interface_info|Interface should be UP")
done < <(ip link show | grep -E "^[0-9]+:" | grep -v "lo:" | awk -F': ' '{print $2}')
# External connectivity checks
if [ -n "$EXTERNAL_HOSTS" ]; then
IFS=',' read -ra HOST_ARRAY <<< "$EXTERNAL_HOSTS"
for host in "${HOST_ARRAY[@]}"; do
local ping_status="UNKNOWN"
local ping_info=""
if ping -c 3 -W 5 "$host" >/dev/null 2>&1; then
local ping_time=$(ping -c 1 -W 5 "$host" 2>/dev/null | grep "time=" | awk -F'time=' '{print $2}' | awk '{print $1}')
ping_status="OK"
ping_info="Reachable - RTT: ${ping_time}ms"
else
ping_status="CRITICAL"
ping_info="Unreachable"
fi
network_data+=("External Host $host|$ping_status|$ping_info|Host should be reachable")
done
fi
# DNS resolution checks
local dns_servers=("8.8.8.8" "1.1.1.1")
for dns_server in "${dns_servers[@]}"; do
local dns_status="UNKNOWN"
local dns_info=""
if nslookup google.com "$dns_server" >/dev/null 2>&1; then
dns_status="OK"
dns_info="DNS resolution working"
else
dns_status="WARNING"
dns_info="DNS resolution failed"
fi
network_data+=("DNS Server $dns_server|$dns_status|$dns_info|DNS should resolve queries")
done
# Network statistics
local rx_errors=$(cat /proc/net/dev | grep -E "(eth|ens|enp)" | awk '{sum+=$4} END {print sum+0}')
local tx_errors=$(cat /proc/net/dev | grep -E "(eth|ens|enp)" | awk '{sum+=$12} END {print sum+0}')
local error_status="OK"
if [ "$rx_errors" -gt 100 ] || [ "$tx_errors" -gt 100 ]; then
error_status="WARNING"
fi
network_data+=("Network Errors|$error_status|RX: $rx_errors, TX: $tx_errors|Errors should be minimal")
add_table_to_report "$report_file" "Network Connectivity" "${network_data[@]}"
}
`
Security Status Monitoring
`bash
#!/bin/bash
security_monitoring.sh
Security-related system checks
check_security_status() {
local report_file="$1"
local -a security_data=()
log_message "INFO" "Checking security status"
# Failed login attempts
local failed_logins=$(grep "Failed password" /var/log/auth.log 2>/dev/null | grep "$(date +%b\ %d)" | wc -l)
local login_status="OK"
if [ "$failed_logins" -gt 50 ]; then
login_status="CRITICAL"
elif [ "$failed_logins" -gt 10 ]; then
login_status="WARNING"
fi
security_data+=("Failed Logins Today|$login_status|$failed_logins attempts|Warning: >10, Critical: >50")
# Root login attempts
local root_logins=$(grep "root" /var/log/auth.log 2>/dev/null | grep "$(date +%b\ %d)" | grep -c "session opened")
local root_status="OK"
if [ "$root_logins" -gt 5 ]; then
root_status="WARNING"
fi
security_data+=("Root Logins Today|$root_status|$root_logins logins|Warning: >5")
# Firewall status
local firewall_status="UNKNOWN"
local firewall_info=""
if command -v ufw >/dev/null 2>&1; then
if ufw status | grep -q "Status: active"; then
firewall_status="OK"
firewall_info="UFW active"
else
firewall_status="WARNING"
firewall_info="UFW inactive"
fi
elif command -v firewall-cmd >/dev/null 2>&1; then
if firewall-cmd --state >/dev/null 2>&1; then
firewall_status="OK"
firewall_info="FirewallD active"
else
firewall_status="WARNING"
firewall_info="FirewallD inactive"
fi
elif command -v iptables >/dev/null 2>&1; then
local iptables_rules=$(iptables -L | wc -l)
if [ "$iptables_rules" -gt 10 ]; then
firewall_status="OK"
firewall_info="iptables configured"
else
firewall_status="WARNING"
firewall_info="iptables minimal rules"
fi
fi
security_data+=("Firewall|$firewall_status|$firewall_info|Firewall should be active")
# SELinux/AppArmor status
local mac_status="UNKNOWN"
local mac_info=""
if command -v getenforce >/dev/null 2>&1; then
local selinux_status=$(getenforce)
if [ "$selinux_status" = "Enforcing" ]; then
mac_status="OK"
mac_info="SELinux enforcing"
else
mac_status="WARNING"
mac_info="SELinux $selinux_status"
fi
elif command -v aa-status >/dev/null 2>&1; then
if aa-status --enabled >/dev/null 2>&1; then
mac_status="OK"
mac_info="AppArmor enabled"
else
mac_status="WARNING"
mac_info="AppArmor disabled"
fi
fi
security_data+=("MAC System|$mac_status|$mac_info|MAC system should be enabled")
# Check for world-writable files in critical directories
local writable_files=$(find /etc /bin /sbin /usr/bin /usr/sbin -type f -perm -002 2>/dev/null | wc -l)
local writable_status="OK"
if [ "$writable_files" -gt 0 ]; then
writable_status="WARNING"
fi
security_data+=("World-writable Files|$writable_status|$writable_files files|Should be 0")
# SSH configuration security
local ssh_status="OK"
local ssh_issues=""
if [ -f "/etc/ssh/sshd_config" ]; then
if grep -q "^PermitRootLogin yes" /etc/ssh/sshd_config; then
ssh_status="WARNING"
ssh_issues="Root login permitted"
fi
if grep -q "^PasswordAuthentication yes" /etc/ssh/sshd_config; then
if [ "$ssh_status" = "WARNING" ]; then
ssh_issues="$ssh_issues, Password auth enabled"
else
ssh_status="WARNING"
ssh_issues="Password auth enabled"
fi
fi
if [ "$ssh_status" = "OK" ]; then
ssh_issues="Configuration secure"
fi
else
ssh_status="WARNING"
ssh_issues="SSH config not found"
fi
security_data+=("SSH Configuration|$ssh_status|$ssh_issues|Should be secure")
add_table_to_report "$report_file" "Security Status" "${security_data[@]}"
}
`
Alert Systems
Email Notification System
`bash
#!/bin/bash
email_notifications.sh
Email notification system for health check alerts
send_email_notification() { local subject="$1" local body="$2" local priority="$3" local attachment="$4" if [ "$EMAIL_ENABLED" != "true" ]; then log_message "INFO" "Email notifications disabled" return 0 fi local temp_email_file=$(mktemp) local email_subject="[Health Check] $subject - $(hostname)" # Create email content cat > "$temp_email_file" << EOF From: $EMAIL_USERNAME To: $EMAIL_RECIPIENTS Subject: $email_subject Content-Type: text/html; charset=UTF-8
System Health Alert
Host: $(hostname)
Time: $(date)
Priority: ${priority}
`Webhook Notifications
`bash
#!/bin/bash
webhook_notifications.sh
Webhook notification system for external integrations
send_slack_notification() { local message="$1" local priority="$2" local webhook_url="$SLACK_WEBHOOK_URL" if [ -z "$webhook_url" ]; then log_message "INFO" "Slack webhook URL not configured" return 0 fi local color="good" case "$priority" in "CRITICAL") color="danger" ;; "WARNING") color="warning" ;; "OK") color="good" ;; esac local payload=$(cat << EOF { "username": "Health Check Bot", "icon_emoji": ":robot_face:", "attachments": [ { "color": "$color", "title": "System Health Alert - $(hostname)", "text": "$message", "fields": [ { "title": "Host", "value": "$(hostname)", "short": true }, { "title": "Time", "value": "$(date)", "short": true }, { "title": "Priority", "value": "$priority", "short": true } ], "footer": "System Health Monitor", "ts": $(date +%s) } ] } EOF ) if curl -X POST -H 'Content-type: application/json' --data "$payload" "$webhook_url" >/dev/null 2>&1; then log_message "INFO" "Slack notification sent successfully" return 0 else log_message "ERROR" "Failed to send Slack notification" return 1 fi }
send_discord_notification() {
local message="$1"
local priority="$2"
local webhook_url="$DISCORD_WEBHOOK_URL"
if [ -z "$webhook_url" ]; then
log_message "INFO" "Discord webhook URL not configured"
return 0
fi
local color=65280 # Green
case "$priority" in
"CRITICAL") color=16711680 ;; # Red
"WARNING") color=16776960 ;; # Yellow
"OK") color=65280 ;; # Green
esac
local payload=$(cat << EOF
{
"username": "Health Check Bot",
"avatar_url": "https://cdn.discordapp.com/embed/avatars/0.png",
"embeds": [
{
"title": "System Health Alert",
"description": "$message",
"color": $color,
"fields": [
{
"name": "Host",
"value": "$(hostname)",
"inline": true
},
{
"name": "Priority",
"value": "$priority",
"inline": true
},
{
"name": "Time",
"value": "$(date)",
"inline": false
}
],
"footer": {
"text": "System Health Monitor"
},
"timestamp": "$(date -u +%Y-%m-%dT%H:%M:%S.000Z)"
}
]
}
EOF
)
if curl -H "Content-Type: application/json" -X POST -d "$payload" "$webhook_url" >/dev/null 2>&1; then
log_message "INFO" "Discord notification sent successfully"
return 0
else
log_message "ERROR" "Failed to send Discord notification"
return 1
fi
}
`
Best Practices
Performance Optimization
| Practice | Description | Implementation | |----------|-------------|----------------| | Parallel Execution | Run independent checks simultaneously | Use background processes with job control | | Caching | Cache frequently accessed data | Store results in temporary files with timestamps | | Efficient Parsing | Use optimized text processing | Prefer awk/sed over multiple grep/cut operations | | Resource Limits | Set limits on script resource usage | Use ulimit and timeout commands | | Incremental Checks | Only check what has changed | Compare timestamps and checksums |
Error Handling and Logging
`bash
#!/bin/bash
error_handling.sh
Comprehensive error handling and logging
Global error handling
set -eE # Exit on error and inherit ERR trap set -u # Exit on undefined variable set -o pipefail # Exit on pipe failureError trap function
error_trap() { local exit_code=$? local line_number=$1 local bash_lineno=$2 local last_command="$3" local func_stack=("${FUNCNAME[@]}") log_message "ERROR" "Command failed with exit code $exit_code" log_message "ERROR" "Line $line_number: $last_command" log_message "ERROR" "Function stack: ${func_stack[*]}" # Send critical alert send_email_notification "Script Error" "Health check script encountered an error on line $line_number" "CRITICAL" # Cleanup temporary files cleanup_temp_files exit "$exit_code" }Set error trap
trap 'error_trap $LINENO $BASH_LINENO "$BASH_COMMAND"' ERRCleanup function
cleanup_temp_files() { if [ -n "${TEMP_FILES:-}" ]; then for temp_file in $TEMP_FILES; do if [ -f "$temp_file" ]; then rm -f "$temp_file" log_message "DEBUG" "Cleaned up temporary file: $temp_file" fi done fi }Exit trap for cleanup
trap cleanup_temp_files EXITSafe command execution
safe_execute() { local command="$1" local description="$2" local max_retries="${3:-1}" local retry_delay="${4:-1}" local retry_count=0 while [ "$retry_count" -lt "$max_retries" ]; do if eval "$command"; then log_message "DEBUG" "Successfully executed: $description" return 0 else local exit_code=$? retry_count=$((retry_count + 1)) if [ "$retry_count" -lt "$max_retries" ]; then log_message "WARNING" "Command failed (attempt $retry_count/$max_retries): $description" sleep "$retry_delay" else log_message "ERROR" "Command failed after $max_retries attempts: $description" return "$exit_code" fi fi done }`Security Considerations
| Security Aspect | Recommendation | Implementation | |------------------|----------------|----------------| | File Permissions | Restrict access to scripts and configs | chmod 750 for scripts, 640 for configs | | Credential Management | Use secure credential storage | Environment variables or encrypted files | | Log Sanitization | Remove sensitive data from logs | Filter passwords and keys before logging | | Input Validation | Validate all external inputs | Use parameter expansion and regex validation | | Privilege Separation | Run with minimal required privileges | Use sudo only when necessary |
Troubleshooting
Common Issues and Solutions
| Issue | Symptoms | Solution | |-------|----------|----------| | Permission Denied | Script fails to access system files | Run with appropriate privileges or adjust file permissions | | Command Not Found | Missing utilities cause script failures | Install required packages or add availability checks | | Network Timeouts | External connectivity checks fail | Increase timeout values or check network configuration | | Memory Issues | Script consumes excessive memory | Optimize data processing and use streaming where possible | | Lock File Conflicts | Multiple instances running simultaneously | Implement proper locking mechanism |
Debug Mode Implementation
`bash
#!/bin/bash
debug_mode.sh
Debug and troubleshooting utilities
Enable debug mode
enable_debug_mode() { export DEBUG_MODE=true export PS4='+(${BASH_SOURCE}:${LINENO}): ${FUNCNAME[0]:+${FUNCNAME[0]}(): }' set -x # Enable command tracing log_message "INFO" "Debug mode enabled" }Debug logging function
debug_log() { local message="$1" if [ "$DEBUG_MODE" = "true" ]; then echo "[DEBUG] $(date '+%Y-%m-%d %H:%M:%S') $message" >&2 fi }Performance timing
time_function() { local func_name="$1" shift local start_time=$(date +%s.%N) debug_log "Starting function: $func_name" "$@" local exit_code=$? local end_time=$(date +%s.%N) local duration=$(echo "$end_time - $start_time" | bc) debug_log "Function $func_name completed in ${duration}s with exit code $exit_code" return "$exit_code" }System state snapshot
capture_system_state() { local output_dir="$1" mkdir -p "$output_dir" # Capture various system states for debugging ps aux > "$output_dir/processes.txt" free -m > "$output_dir/memory.txt" df -h > "$output_dir/disk.txt" netstat -tuln > "$output_dir/network.txt" dmesg | tail -100 > "$output_dir/dmesg.txt" log_message "INFO" "System state captured to $output_dir" }`This comprehensive guide provides a complete framework for implementing system health check scripts. The modular design allows for easy customization and extension based on specific requirements. Regular monitoring and maintenance of these scripts ensure reliable system oversight and proactive issue resolution.