System Health Check Scripts: Complete Implementation Guide

Learn to build automated system health monitoring scripts with real-time insights, alerting, and performance tracking for reliable infrastructure.

System Health Check Scripts: Complete Guide and Implementation

Table of Contents

1. [Introduction](#introduction) 2. [Prerequisites](#prerequisites) 3. [Core Components](#core-components) 4. [Script Architecture](#script-architecture) 5. [Implementation Examples](#implementation-examples) 6. [Configuration Management](#configuration-management) 7. [Monitoring Categories](#monitoring-categories) 8. [Alert Systems](#alert-systems) 9. [Best Practices](#best-practices) 10. [Troubleshooting](#troubleshooting)

Introduction

System health check scripts are automated tools designed to monitor, assess, and report on the operational status of computer systems. These scripts provide real-time insights into system performance, resource utilization, service availability, and potential issues that may affect system reliability.

The primary objectives of system health check scripts include: - Proactive monitoring of system resources - Early detection of performance bottlenecks - Automated alerting for critical issues - Historical data collection for trend analysis - Compliance verification for system standards - Automated remediation of common problems

Prerequisites

Before implementing system health check scripts, ensure the following requirements are met:

System Requirements

| Component | Minimum Requirement | Recommended | |-----------|-------------------|-------------| | Operating System | Linux/Unix-based | CentOS 7+, Ubuntu 18.04+ | | Shell | Bash 4.0+ | Bash 5.0+ | | Memory | 512 MB | 1 GB+ | | Disk Space | 100 MB | 500 MB+ | | Network | Basic connectivity | High-speed connection |

Required Tools and Utilities

| Tool | Purpose | Installation Command | |------|---------|---------------------| | awk | Text processing | Usually pre-installed | | sed | Stream editing | Usually pre-installed | | grep | Pattern matching | Usually pre-installed | | curl | HTTP requests | yum install curl or apt-get install curl | | jq | JSON processing | yum install jq or apt-get install jq | | mail | Email notifications | yum install mailx or apt-get install mailutils |

Permissions and Access

- Root or sudo access for system-level monitoring - Read access to log files and system directories - Network access for external service checks - Write permissions for log and report generation

Core Components

1. Data Collection Module

The data collection module gathers information from various system sources:

`bash #!/bin/bash

Data Collection Functions

collect_system_info() { local output_file="$1" { echo "SYSTEM_INFO_START" echo "Hostname: $(hostname)" echo "Uptime: $(uptime)" echo "Date: $(date)" echo "Kernel: $(uname -r)" echo "Architecture: $(uname -m)" echo "SYSTEM_INFO_END" } >> "$output_file" }

collect_cpu_info() { local output_file="$1" { echo "CPU_INFO_START" echo "CPU Usage: $(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | cut -d'%' -f1)" echo "Load Average: $(uptime | awk -F'load average:' '{print $2}')" echo "CPU Cores: $(nproc)" echo "CPU_INFO_END" } >> "$output_file" }

collect_memory_info() { local output_file="$1" { echo "MEMORY_INFO_START" free -m | while read line; do echo "Memory: $line" done echo "MEMORY_INFO_END" } >> "$output_file" } `

2. Analysis Engine

The analysis engine processes collected data and determines system health status:

`bash #!/bin/bash

Analysis Functions

analyze_cpu_usage() { local cpu_usage="$1" local threshold_warning=70 local threshold_critical=90 if (( $(echo "$cpu_usage > $threshold_critical" | bc -l) )); then echo "CRITICAL" elif (( $(echo "$cpu_usage > $threshold_warning" | bc -l) )); then echo "WARNING" else echo "OK" fi }

analyze_memory_usage() { local total_mem="$1" local used_mem="$2" local usage_percent=$(echo "scale=2; ($used_mem / $total_mem) * 100" | bc) local threshold_warning=80 local threshold_critical=95 if (( $(echo "$usage_percent > $threshold_critical" | bc -l) )); then echo "CRITICAL" elif (( $(echo "$usage_percent > $threshold_warning" | bc -l) )); then echo "WARNING" else echo "OK" fi }

analyze_disk_usage() { local filesystem="$1" local usage_percent=$(df -h "$filesystem" | awk 'NR==2 {print $5}' | sed 's/%//') local threshold_warning=80 local threshold_critical=90 if [ "$usage_percent" -gt "$threshold_critical" ]; then echo "CRITICAL" elif [ "$usage_percent" -gt "$threshold_warning" ]; then echo "WARNING" else echo "OK" fi } `

3. Reporting Module

The reporting module formats and presents health check results:

`bash #!/bin/bash

Reporting Functions

generate_html_report() { local report_file="$1" local timestamp="$2" cat > "$report_file" << EOF

System Health Report

Generated on: $timestamp

EOF }

add_table_to_report() { local report_file="$1" local table_title="$2" shift 2 local -a table_data=("$@") cat >> "$report_file" << EOF

$table_title

EOF for row in "${table_data[@]}"; do IFS='|' read -r component status value threshold <<< "$row" cat >> "$report_file" << EOF EOF done echo "
Component Status Value Threshold
$component $status $value $threshold
" >> "$report_file" } `

Script Architecture

Master Health Check Script

`bash #!/bin/bash

master_health_check.sh

Main system health check orchestrator

Configuration

SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" CONFIG_FILE="$SCRIPT_DIR/health_check.conf" LOG_DIR="/var/log/health_check" REPORT_DIR="/var/reports/health_check" TIMESTAMP=$(date +"%Y%m%d_%H%M%S")

Source configuration

if [ -f "$CONFIG_FILE" ]; then source "$CONFIG_FILE" else echo "Configuration file not found: $CONFIG_FILE" exit 1 fi

Create necessary directories

mkdir -p "$LOG_DIR" "$REPORT_DIR"

Logging function

log_message() { local level="$1" local message="$2" echo "[$(date '+%Y-%m-%d %H:%M:%S')] [$level] $message" | tee -a "$LOG_DIR/health_check_$TIMESTAMP.log" }

Main execution function

main() { log_message "INFO" "Starting system health check" # Initialize report local report_file="$REPORT_DIR/health_report_$TIMESTAMP.html" generate_html_report "$report_file" "$(date)" # Perform health checks check_system_resources "$report_file" check_services "$report_file" check_network_connectivity "$report_file" check_disk_health "$report_file" check_security_status "$report_file" # Finalize report finalize_html_report "$report_file" # Send notifications if needed process_alerts "$report_file" log_message "INFO" "Health check completed successfully" }

Execute main function

main "$@" `

System Resource Monitoring

`bash #!/bin/bash

system_resources.sh

Detailed system resource monitoring

check_system_resources() { local report_file="$1" local -a resource_data=() log_message "INFO" "Checking system resources" # CPU Check local cpu_usage=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | cut -d'%' -f1) local cpu_status=$(analyze_cpu_usage "$cpu_usage") resource_data+=("CPU Usage|$cpu_status|${cpu_usage}%|Warning: 70%, Critical: 90%") # Memory Check local mem_info=$(free -m | grep "^Mem:") local total_mem=$(echo "$mem_info" | awk '{print $2}') local used_mem=$(echo "$mem_info" | awk '{print $3}') local mem_usage_percent=$(echo "scale=1; ($used_mem / $total_mem) * 100" | bc) local mem_status=$(analyze_memory_usage "$total_mem" "$used_mem") resource_data+=("Memory Usage|$mem_status|${mem_usage_percent}%|Warning: 80%, Critical: 95%") # Load Average Check local load_avg=$(uptime | awk -F'load average:' '{print $2}' | awk '{print $1}' | sed 's/,//') local cpu_cores=$(nproc) local load_per_core=$(echo "scale=2; $load_avg / $cpu_cores" | bc) local load_status="OK" if (( $(echo "$load_per_core > 1.5" | bc -l) )); then load_status="CRITICAL" elif (( $(echo "$load_per_core > 1.0" | bc -l) )); then load_status="WARNING" fi resource_data+=("Load Average|$load_status|$load_avg (${load_per_core}/core)|Warning: 1.0/core, Critical: 1.5/core") # Disk Usage Check while IFS= read -r line; do local filesystem=$(echo "$line" | awk '{print $1}') local usage_percent=$(echo "$line" | awk '{print $5}' | sed 's/%//') local mount_point=$(echo "$line" | awk '{print $6}') # Skip special filesystems if [[ "$filesystem" =~ ^/dev/(sd|hd|nvme|xvd) ]]; then local disk_status=$(analyze_disk_usage "$mount_point") resource_data+=("Disk $mount_point|$disk_status|${usage_percent}%|Warning: 80%, Critical: 90%") fi done < <(df -h | grep -E '^/dev/') # Add resource data to report add_table_to_report "$report_file" "System Resources" "${resource_data[@]}" } `

Service Monitoring

`bash #!/bin/bash

service_monitoring.sh

Service availability and health monitoring

check_services() { local report_file="$1" local -a service_data=() log_message "INFO" "Checking services" # Define critical services to monitor local -a critical_services=("sshd" "httpd" "nginx" "mysqld" "postgresql" "docker") for service in "${critical_services[@]}"; do if systemctl list-unit-files | grep -q "^$service.service"; then local service_status="UNKNOWN" local service_info="" if systemctl is-active --quiet "$service"; then service_status="OK" local uptime=$(systemctl show "$service" --property=ActiveEnterTimestamp | cut -d= -f2) service_info="Active since $uptime" else service_status="CRITICAL" service_info="Service not running" fi service_data+=("$service|$service_status|$service_info|Service should be active") fi done # Check custom application services if [ -n "$CUSTOM_SERVICES" ]; then IFS=',' read -ra CUSTOM_SERVICE_ARRAY <<< "$CUSTOM_SERVICES" for service in "${CUSTOM_SERVICE_ARRAY[@]}"; do local service_status="UNKNOWN" local service_info="" if systemctl is-active --quiet "$service"; then service_status="OK" service_info="Running" else service_status="CRITICAL" service_info="Not running" fi service_data+=("$service (custom)|$service_status|$service_info|Custom service should be active") done fi # Port connectivity checks if [ -n "$MONITOR_PORTS" ]; then IFS=',' read -ra PORT_ARRAY <<< "$MONITOR_PORTS" for port_info in "${PORT_ARRAY[@]}"; do IFS=':' read -r host port <<< "$port_info" local port_status="UNKNOWN" local port_info_msg="" if timeout 5 bash -c "/dev/null; then port_status="OK" port_info_msg="Port accessible" else port_status="CRITICAL" port_info_msg="Port not accessible" fi service_data+=("Port $host:$port|$port_status|$port_info_msg|Port should be accessible") done fi add_table_to_report "$report_file" "Services and Connectivity" "${service_data[@]}" } `

Configuration Management

Configuration File Structure

`bash

health_check.conf

System Health Check Configuration

Email Configuration

EMAIL_ENABLED=true EMAIL_RECIPIENTS="admin@example.com,monitoring@example.com" EMAIL_SMTP_SERVER="smtp.example.com" EMAIL_SMTP_PORT=587 EMAIL_USERNAME="healthcheck@example.com" EMAIL_PASSWORD="secure_password"

Alert Thresholds

CPU_WARNING_THRESHOLD=70 CPU_CRITICAL_THRESHOLD=90 MEMORY_WARNING_THRESHOLD=80 MEMORY_CRITICAL_THRESHOLD=95 DISK_WARNING_THRESHOLD=80 DISK_CRITICAL_THRESHOLD=90 LOAD_WARNING_THRESHOLD=1.0 LOAD_CRITICAL_THRESHOLD=1.5

Services to Monitor

CRITICAL_SERVICES="sshd,httpd,mysqld" CUSTOM_SERVICES="myapp,background-worker"

Network Monitoring

MONITOR_PORTS="localhost:80,localhost:443,database.local:3306" EXTERNAL_HOSTS="google.com,github.com"

Disk Monitoring

MONITOR_FILESYSTEMS="/,/home,/var,/tmp" EXCLUDE_FILESYSTEMS="/dev,/proc,/sys"

Log Monitoring

LOG_FILES="/var/log/messages,/var/log/secure,/var/log/httpd/error_log" LOG_ERROR_PATTERNS="ERROR,CRITICAL,FATAL,Out of memory"

Report Configuration

REPORT_FORMAT="html,json,plain" REPORT_RETENTION_DAYS=30 REPORT_COMPRESSION=true

Notification Configuration

SLACK_WEBHOOK_URL="https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK" DISCORD_WEBHOOK_URL="https://discord.com/api/webhooks/YOUR/DISCORD/WEBHOOK"

Advanced Options

ENABLE_PERFORMANCE_METRICS=true ENABLE_SECURITY_CHECKS=true ENABLE_LOG_ANALYSIS=true PARALLEL_EXECUTION=true MAX_PARALLEL_JOBS=4 `

Dynamic Configuration Loading

`bash #!/bin/bash

config_manager.sh

Configuration management functions

load_configuration() { local config_file="$1" if [ ! -f "$config_file" ]; then log_message "ERROR" "Configuration file not found: $config_file" return 1 fi # Validate configuration file syntax if ! bash -n "$config_file"; then log_message "ERROR" "Configuration file has syntax errors: $config_file" return 1 fi # Source configuration source "$config_file" # Set default values for missing configurations set_default_values # Validate configuration values validate_configuration log_message "INFO" "Configuration loaded successfully from $config_file" }

set_default_values() { # Set defaults if not specified CPU_WARNING_THRESHOLD=${CPU_WARNING_THRESHOLD:-70} CPU_CRITICAL_THRESHOLD=${CPU_CRITICAL_THRESHOLD:-90} MEMORY_WARNING_THRESHOLD=${MEMORY_WARNING_THRESHOLD:-80} MEMORY_CRITICAL_THRESHOLD=${MEMORY_CRITICAL_THRESHOLD:-95} DISK_WARNING_THRESHOLD=${DISK_WARNING_THRESHOLD:-80} DISK_CRITICAL_THRESHOLD=${DISK_CRITICAL_THRESHOLD:-90} EMAIL_ENABLED=${EMAIL_ENABLED:-false} REPORT_FORMAT=${REPORT_FORMAT:-"html"} REPORT_RETENTION_DAYS=${REPORT_RETENTION_DAYS:-7} PARALLEL_EXECUTION=${PARALLEL_EXECUTION:-false} MAX_PARALLEL_JOBS=${MAX_PARALLEL_JOBS:-2} }

validate_configuration() { local validation_errors=0 # Validate threshold values if [ "$CPU_WARNING_THRESHOLD" -ge "$CPU_CRITICAL_THRESHOLD" ]; then log_message "ERROR" "CPU warning threshold must be less than critical threshold" ((validation_errors++)) fi if [ "$MEMORY_WARNING_THRESHOLD" -ge "$MEMORY_CRITICAL_THRESHOLD" ]; then log_message "ERROR" "Memory warning threshold must be less than critical threshold" ((validation_errors++)) fi # Validate email configuration if [ "$EMAIL_ENABLED" = "true" ]; then if [ -z "$EMAIL_RECIPIENTS" ]; then log_message "ERROR" "Email recipients not specified" ((validation_errors++)) fi if [ -z "$EMAIL_SMTP_SERVER" ]; then log_message "ERROR" "SMTP server not specified" ((validation_errors++)) fi fi # Validate numeric values if ! [[ "$REPORT_RETENTION_DAYS" =~ ^[0-9]+$ ]]; then log_message "ERROR" "Report retention days must be a number" ((validation_errors++)) fi if [ "$validation_errors" -gt 0 ]; then log_message "ERROR" "Configuration validation failed with $validation_errors errors" return 1 fi log_message "INFO" "Configuration validation successful" return 0 } `

Monitoring Categories

Network Connectivity Monitoring

`bash #!/bin/bash

network_monitoring.sh

Network connectivity and performance monitoring

check_network_connectivity() { local report_file="$1" local -a network_data=() log_message "INFO" "Checking network connectivity" # Internal network interface checks while IFS= read -r interface; do local interface_name=$(echo "$interface" | awk '{print $1}') local interface_status="OK" local interface_info="" if ip link show "$interface_name" | grep -q "state UP"; then local ip_address=$(ip addr show "$interface_name" | grep "inet " | awk '{print $2}' | head -1) interface_info="UP - IP: $ip_address" else interface_status="WARNING" interface_info="Interface DOWN" fi network_data+=("Interface $interface_name|$interface_status|$interface_info|Interface should be UP") done < <(ip link show | grep -E "^[0-9]+:" | grep -v "lo:" | awk -F': ' '{print $2}') # External connectivity checks if [ -n "$EXTERNAL_HOSTS" ]; then IFS=',' read -ra HOST_ARRAY <<< "$EXTERNAL_HOSTS" for host in "${HOST_ARRAY[@]}"; do local ping_status="UNKNOWN" local ping_info="" if ping -c 3 -W 5 "$host" >/dev/null 2>&1; then local ping_time=$(ping -c 1 -W 5 "$host" 2>/dev/null | grep "time=" | awk -F'time=' '{print $2}' | awk '{print $1}') ping_status="OK" ping_info="Reachable - RTT: ${ping_time}ms" else ping_status="CRITICAL" ping_info="Unreachable" fi network_data+=("External Host $host|$ping_status|$ping_info|Host should be reachable") done fi # DNS resolution checks local dns_servers=("8.8.8.8" "1.1.1.1") for dns_server in "${dns_servers[@]}"; do local dns_status="UNKNOWN" local dns_info="" if nslookup google.com "$dns_server" >/dev/null 2>&1; then dns_status="OK" dns_info="DNS resolution working" else dns_status="WARNING" dns_info="DNS resolution failed" fi network_data+=("DNS Server $dns_server|$dns_status|$dns_info|DNS should resolve queries") done # Network statistics local rx_errors=$(cat /proc/net/dev | grep -E "(eth|ens|enp)" | awk '{sum+=$4} END {print sum+0}') local tx_errors=$(cat /proc/net/dev | grep -E "(eth|ens|enp)" | awk '{sum+=$12} END {print sum+0}') local error_status="OK" if [ "$rx_errors" -gt 100 ] || [ "$tx_errors" -gt 100 ]; then error_status="WARNING" fi network_data+=("Network Errors|$error_status|RX: $rx_errors, TX: $tx_errors|Errors should be minimal") add_table_to_report "$report_file" "Network Connectivity" "${network_data[@]}" } `

Security Status Monitoring

`bash #!/bin/bash

security_monitoring.sh

Security-related system checks

check_security_status() { local report_file="$1" local -a security_data=() log_message "INFO" "Checking security status" # Failed login attempts local failed_logins=$(grep "Failed password" /var/log/auth.log 2>/dev/null | grep "$(date +%b\ %d)" | wc -l) local login_status="OK" if [ "$failed_logins" -gt 50 ]; then login_status="CRITICAL" elif [ "$failed_logins" -gt 10 ]; then login_status="WARNING" fi security_data+=("Failed Logins Today|$login_status|$failed_logins attempts|Warning: >10, Critical: >50") # Root login attempts local root_logins=$(grep "root" /var/log/auth.log 2>/dev/null | grep "$(date +%b\ %d)" | grep -c "session opened") local root_status="OK" if [ "$root_logins" -gt 5 ]; then root_status="WARNING" fi security_data+=("Root Logins Today|$root_status|$root_logins logins|Warning: >5") # Firewall status local firewall_status="UNKNOWN" local firewall_info="" if command -v ufw >/dev/null 2>&1; then if ufw status | grep -q "Status: active"; then firewall_status="OK" firewall_info="UFW active" else firewall_status="WARNING" firewall_info="UFW inactive" fi elif command -v firewall-cmd >/dev/null 2>&1; then if firewall-cmd --state >/dev/null 2>&1; then firewall_status="OK" firewall_info="FirewallD active" else firewall_status="WARNING" firewall_info="FirewallD inactive" fi elif command -v iptables >/dev/null 2>&1; then local iptables_rules=$(iptables -L | wc -l) if [ "$iptables_rules" -gt 10 ]; then firewall_status="OK" firewall_info="iptables configured" else firewall_status="WARNING" firewall_info="iptables minimal rules" fi fi security_data+=("Firewall|$firewall_status|$firewall_info|Firewall should be active") # SELinux/AppArmor status local mac_status="UNKNOWN" local mac_info="" if command -v getenforce >/dev/null 2>&1; then local selinux_status=$(getenforce) if [ "$selinux_status" = "Enforcing" ]; then mac_status="OK" mac_info="SELinux enforcing" else mac_status="WARNING" mac_info="SELinux $selinux_status" fi elif command -v aa-status >/dev/null 2>&1; then if aa-status --enabled >/dev/null 2>&1; then mac_status="OK" mac_info="AppArmor enabled" else mac_status="WARNING" mac_info="AppArmor disabled" fi fi security_data+=("MAC System|$mac_status|$mac_info|MAC system should be enabled") # Check for world-writable files in critical directories local writable_files=$(find /etc /bin /sbin /usr/bin /usr/sbin -type f -perm -002 2>/dev/null | wc -l) local writable_status="OK" if [ "$writable_files" -gt 0 ]; then writable_status="WARNING" fi security_data+=("World-writable Files|$writable_status|$writable_files files|Should be 0") # SSH configuration security local ssh_status="OK" local ssh_issues="" if [ -f "/etc/ssh/sshd_config" ]; then if grep -q "^PermitRootLogin yes" /etc/ssh/sshd_config; then ssh_status="WARNING" ssh_issues="Root login permitted" fi if grep -q "^PasswordAuthentication yes" /etc/ssh/sshd_config; then if [ "$ssh_status" = "WARNING" ]; then ssh_issues="$ssh_issues, Password auth enabled" else ssh_status="WARNING" ssh_issues="Password auth enabled" fi fi if [ "$ssh_status" = "OK" ]; then ssh_issues="Configuration secure" fi else ssh_status="WARNING" ssh_issues="SSH config not found" fi security_data+=("SSH Configuration|$ssh_status|$ssh_issues|Should be secure") add_table_to_report "$report_file" "Security Status" "${security_data[@]}" } `

Alert Systems

Email Notification System

`bash #!/bin/bash

email_notifications.sh

Email notification system for health check alerts

send_email_notification() { local subject="$1" local body="$2" local priority="$3" local attachment="$4" if [ "$EMAIL_ENABLED" != "true" ]; then log_message "INFO" "Email notifications disabled" return 0 fi local temp_email_file=$(mktemp) local email_subject="[Health Check] $subject - $(hostname)" # Create email content cat > "$temp_email_file" << EOF From: $EMAIL_USERNAME To: $EMAIL_RECIPIENTS Subject: $email_subject Content-Type: text/html; charset=UTF-8

System Health Alert

Host: $(hostname)

Time: $(date)

Priority: ${priority}

$body
EOF # Send email using different methods if command -v sendmail >/dev/null 2>&1; then sendmail "$EMAIL_RECIPIENTS" < "$temp_email_file" local send_result=$? elif command -v mail >/dev/null 2>&1; then mail -s "$email_subject" "$EMAIL_RECIPIENTS" < "$temp_email_file" local send_result=$? elif command -v mutt >/dev/null 2>&1; then if [ -n "$attachment" ] && [ -f "$attachment" ]; then mutt -s "$email_subject" -a "$attachment" -- "$EMAIL_RECIPIENTS" < "$temp_email_file" else mutt -s "$email_subject" "$EMAIL_RECIPIENTS" < "$temp_email_file" fi local send_result=$? else log_message "ERROR" "No email client available (sendmail, mail, or mutt)" rm -f "$temp_email_file" return 1 fi if [ "$send_result" -eq 0 ]; then log_message "INFO" "Email notification sent successfully" else log_message "ERROR" "Failed to send email notification" fi rm -f "$temp_email_file" return "$send_result" } `

Webhook Notifications

`bash #!/bin/bash

webhook_notifications.sh

Webhook notification system for external integrations

send_slack_notification() { local message="$1" local priority="$2" local webhook_url="$SLACK_WEBHOOK_URL" if [ -z "$webhook_url" ]; then log_message "INFO" "Slack webhook URL not configured" return 0 fi local color="good" case "$priority" in "CRITICAL") color="danger" ;; "WARNING") color="warning" ;; "OK") color="good" ;; esac local payload=$(cat << EOF { "username": "Health Check Bot", "icon_emoji": ":robot_face:", "attachments": [ { "color": "$color", "title": "System Health Alert - $(hostname)", "text": "$message", "fields": [ { "title": "Host", "value": "$(hostname)", "short": true }, { "title": "Time", "value": "$(date)", "short": true }, { "title": "Priority", "value": "$priority", "short": true } ], "footer": "System Health Monitor", "ts": $(date +%s) } ] } EOF ) if curl -X POST -H 'Content-type: application/json' --data "$payload" "$webhook_url" >/dev/null 2>&1; then log_message "INFO" "Slack notification sent successfully" return 0 else log_message "ERROR" "Failed to send Slack notification" return 1 fi }

send_discord_notification() { local message="$1" local priority="$2" local webhook_url="$DISCORD_WEBHOOK_URL" if [ -z "$webhook_url" ]; then log_message "INFO" "Discord webhook URL not configured" return 0 fi local color=65280 # Green case "$priority" in "CRITICAL") color=16711680 ;; # Red "WARNING") color=16776960 ;; # Yellow "OK") color=65280 ;; # Green esac local payload=$(cat << EOF { "username": "Health Check Bot", "avatar_url": "https://cdn.discordapp.com/embed/avatars/0.png", "embeds": [ { "title": "System Health Alert", "description": "$message", "color": $color, "fields": [ { "name": "Host", "value": "$(hostname)", "inline": true }, { "name": "Priority", "value": "$priority", "inline": true }, { "name": "Time", "value": "$(date)", "inline": false } ], "footer": { "text": "System Health Monitor" }, "timestamp": "$(date -u +%Y-%m-%dT%H:%M:%S.000Z)" } ] } EOF ) if curl -H "Content-Type: application/json" -X POST -d "$payload" "$webhook_url" >/dev/null 2>&1; then log_message "INFO" "Discord notification sent successfully" return 0 else log_message "ERROR" "Failed to send Discord notification" return 1 fi } `

Best Practices

Performance Optimization

| Practice | Description | Implementation | |----------|-------------|----------------| | Parallel Execution | Run independent checks simultaneously | Use background processes with job control | | Caching | Cache frequently accessed data | Store results in temporary files with timestamps | | Efficient Parsing | Use optimized text processing | Prefer awk/sed over multiple grep/cut operations | | Resource Limits | Set limits on script resource usage | Use ulimit and timeout commands | | Incremental Checks | Only check what has changed | Compare timestamps and checksums |

Error Handling and Logging

`bash #!/bin/bash

error_handling.sh

Comprehensive error handling and logging

Global error handling

set -eE # Exit on error and inherit ERR trap set -u # Exit on undefined variable set -o pipefail # Exit on pipe failure

Error trap function

error_trap() { local exit_code=$? local line_number=$1 local bash_lineno=$2 local last_command="$3" local func_stack=("${FUNCNAME[@]}") log_message "ERROR" "Command failed with exit code $exit_code" log_message "ERROR" "Line $line_number: $last_command" log_message "ERROR" "Function stack: ${func_stack[*]}" # Send critical alert send_email_notification "Script Error" "Health check script encountered an error on line $line_number" "CRITICAL" # Cleanup temporary files cleanup_temp_files exit "$exit_code" }

Set error trap

trap 'error_trap $LINENO $BASH_LINENO "$BASH_COMMAND"' ERR

Cleanup function

cleanup_temp_files() { if [ -n "${TEMP_FILES:-}" ]; then for temp_file in $TEMP_FILES; do if [ -f "$temp_file" ]; then rm -f "$temp_file" log_message "DEBUG" "Cleaned up temporary file: $temp_file" fi done fi }

Exit trap for cleanup

trap cleanup_temp_files EXIT

Safe command execution

safe_execute() { local command="$1" local description="$2" local max_retries="${3:-1}" local retry_delay="${4:-1}" local retry_count=0 while [ "$retry_count" -lt "$max_retries" ]; do if eval "$command"; then log_message "DEBUG" "Successfully executed: $description" return 0 else local exit_code=$? retry_count=$((retry_count + 1)) if [ "$retry_count" -lt "$max_retries" ]; then log_message "WARNING" "Command failed (attempt $retry_count/$max_retries): $description" sleep "$retry_delay" else log_message "ERROR" "Command failed after $max_retries attempts: $description" return "$exit_code" fi fi done } `

Security Considerations

| Security Aspect | Recommendation | Implementation | |------------------|----------------|----------------| | File Permissions | Restrict access to scripts and configs | chmod 750 for scripts, 640 for configs | | Credential Management | Use secure credential storage | Environment variables or encrypted files | | Log Sanitization | Remove sensitive data from logs | Filter passwords and keys before logging | | Input Validation | Validate all external inputs | Use parameter expansion and regex validation | | Privilege Separation | Run with minimal required privileges | Use sudo only when necessary |

Troubleshooting

Common Issues and Solutions

| Issue | Symptoms | Solution | |-------|----------|----------| | Permission Denied | Script fails to access system files | Run with appropriate privileges or adjust file permissions | | Command Not Found | Missing utilities cause script failures | Install required packages or add availability checks | | Network Timeouts | External connectivity checks fail | Increase timeout values or check network configuration | | Memory Issues | Script consumes excessive memory | Optimize data processing and use streaming where possible | | Lock File Conflicts | Multiple instances running simultaneously | Implement proper locking mechanism |

Debug Mode Implementation

`bash #!/bin/bash

debug_mode.sh

Debug and troubleshooting utilities

Enable debug mode

enable_debug_mode() { export DEBUG_MODE=true export PS4='+(${BASH_SOURCE}:${LINENO}): ${FUNCNAME[0]:+${FUNCNAME[0]}(): }' set -x # Enable command tracing log_message "INFO" "Debug mode enabled" }

Debug logging function

debug_log() { local message="$1" if [ "$DEBUG_MODE" = "true" ]; then echo "[DEBUG] $(date '+%Y-%m-%d %H:%M:%S') $message" >&2 fi }

Performance timing

time_function() { local func_name="$1" shift local start_time=$(date +%s.%N) debug_log "Starting function: $func_name" "$@" local exit_code=$? local end_time=$(date +%s.%N) local duration=$(echo "$end_time - $start_time" | bc) debug_log "Function $func_name completed in ${duration}s with exit code $exit_code" return "$exit_code" }

System state snapshot

capture_system_state() { local output_dir="$1" mkdir -p "$output_dir" # Capture various system states for debugging ps aux > "$output_dir/processes.txt" free -m > "$output_dir/memory.txt" df -h > "$output_dir/disk.txt" netstat -tuln > "$output_dir/network.txt" dmesg | tail -100 > "$output_dir/dmesg.txt" log_message "INFO" "System state captured to $output_dir" } `

This comprehensive guide provides a complete framework for implementing system health check scripts. The modular design allows for easy customization and extension based on specific requirements. Regular monitoring and maintenance of these scripts ensure reliable system oversight and proactive issue resolution.

Tags

  • Automation
  • bash scripting
  • infrastructure
  • monitoring
  • system-administration

Related Articles

Popular Technical Articles & Tutorials

Explore our comprehensive collection of technical articles, programming tutorials, and IT guides written by industry experts:

Browse all 8+ technical articles | Read our IT blog

System Health Check Scripts: Complete Implementation Guide