Integrating Email Alerts with System Monitoring
Table of Contents
1. [Introduction](#introduction) 2. [Prerequisites](#prerequisites) 3. [Email Configuration Methods](#email-configuration-methods) 4. [Monitoring Tools Integration](#monitoring-tools-integration) 5. [Alert Configuration](#alert-configuration) 6. [Script-Based Monitoring](#script-based-monitoring) 7. [Advanced Configuration](#advanced-configuration) 8. [Troubleshooting](#troubleshooting) 9. [Best Practices](#best-practices)
Introduction
Email alerts in system monitoring provide immediate notification when critical system events occur, performance thresholds are exceeded, or services become unavailable. This integration ensures administrators can respond quickly to issues before they impact users or cause system downtime.
System monitoring with email alerts typically involves: - Monitoring system resources (CPU, memory, disk space, network) - Service availability monitoring - Log file analysis for errors - Performance threshold monitoring - Security event detection
Prerequisites
Before implementing email alerts with system monitoring, ensure you have:
| Requirement | Description | Example | |-------------|-------------|---------| | SMTP Server | Mail server for sending emails | Gmail SMTP, company mail server | | Monitoring Tools | System monitoring software | Nagios, Zabbix, custom scripts | | System Access | Administrative privileges | root or sudo access | | Email Credentials | Valid email account for sending | monitoring@company.com | | Network Access | Connectivity to SMTP server | Port 587 or 465 open |
Email Configuration Methods
Method 1: Using Postfix (Local Mail Server)
Install and configure Postfix as a local mail relay:
`bash
Install Postfix
sudo apt-get update sudo apt-get install postfix mailutilsConfigure Postfix for Gmail relay
sudo nano /etc/postfix/main.cf`Add the following configuration to /etc/postfix/main.cf:
`
relayhost = [smtp.gmail.com]:587
smtp_sasl_auth_enable = yes
smtp_sasl_password_maps = hash:/etc/postfix/sasl_passwd
smtp_sasl_security_options = noanonymous
smtp_tls_CAfile = /etc/ssl/certs/ca-certificates.crt
smtp_use_tls = yes
`
Create SASL password file:
`bash
Create password file
sudo nano /etc/postfix/sasl_passwdAdd credentials (replace with actual values)
[smtp.gmail.com]:587 username@gmail.com:app_passwordSecure and hash the file
sudo chmod 400 /etc/postfix/sasl_passwd sudo postmap /etc/postfix/sasl_passwdRestart Postfix
sudo systemctl restart postfix`Method 2: Using SSMTP (Lightweight Alternative)
`bash
Install SSMTP
sudo apt-get install ssmtpConfigure SSMTP
sudo nano /etc/ssmtp/ssmtp.conf`SSMTP configuration:
`
root=monitoring@yourdomain.com
mailhub=smtp.gmail.com:587
rewriteDomain=yourdomain.com
AuthUser=your-email@gmail.com
AuthPass=your-app-password
FromLineOverride=YES
UseSTARTTLS=YES
`
Method 3: Using Python SMTP Library
Create a Python script for sending emails:
`python
#!/usr/bin/env python3
import smtplib
import sys
from email.mime.text import MIMEText
from email.mime.multipart import MIMEMultipart
from datetime import datetime
def send_alert_email(subject, message, recipient): # Email configuration smtp_server = "smtp.gmail.com" smtp_port = 587 sender_email = "monitoring@yourdomain.com" sender_password = "your-app-password" # Create message msg = MIMEMultipart() msg['From'] = sender_email msg['To'] = recipient msg['Subject'] = f"[ALERT] {subject}" # Email body body = f""" System Alert Notification Time: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')} Alert: {subject} Details: {message} Please investigate immediately. -- System Monitoring """ msg.attach(MIMEText(body, 'plain')) try: # Connect to server and send email server = smtplib.SMTP(smtp_server, smtp_port) server.starttls() server.login(sender_email, sender_password) text = msg.as_string() server.sendmail(sender_email, recipient, text) server.quit() print("Alert email sent successfully") return True except Exception as e: print(f"Failed to send email: {e}") return False
if __name__ == "__main__":
if len(sys.argv) != 4:
print("Usage: python3 email_alert.py `
Monitoring Tools Integration
Nagios Integration
Nagios is a comprehensive monitoring solution that supports email notifications natively.
#### Command Configuration
Define notification commands in /etc/nagios3/conf.d/commands.cfg:
`
define command{
command_name notify-host-by-email
command_line /usr/bin/printf "%b" " Nagios \n\nNotification Type: $NOTIFICATIONTYPE$\nHost: $HOSTNAME$\nState: $HOSTSTATE$\nAddress: $HOSTADDRESS$\nInfo: $HOSTOUTPUT$\n\nDate/Time: $LONGDATETIME$\n" | /usr/bin/mail -s " $NOTIFICATIONTYPE$ Host Alert: $HOSTNAME$ is $HOSTSTATE$ " $CONTACTEMAIL$
}
define command{
command_name notify-service-by-email
command_line /usr/bin/printf "%b" " Nagios \n\nNotification Type: $NOTIFICATIONTYPE$\n\nService: $SERVICEDESC$\nHost: $HOSTALIAS$\nAddress: $HOSTADDRESS$\nState: $SERVICESTATE$\n\nDate/Time: $LONGDATETIME$\n\nAdditional Info:\n\n$SERVICEOUTPUT$\n" | /usr/bin/mail -s " $NOTIFICATIONTYPE$ Service Alert: $HOSTALIAS$/$SERVICEDESC$ is $SERVICESTATE$ " $CONTACTEMAIL$
}
`
#### Contact Configuration
Define contacts in /etc/nagios3/conf.d/contacts.cfg:
`
define contact{
contact_name admin
use generic-contact
alias System Administrator
email admin@yourdomain.com
host_notification_period 24x7
service_notification_period 24x7
host_notification_options d,u,r,f,s
service_notification_options w,u,c,r,f,s
host_notification_commands notify-host-by-email
service_notification_commands notify-service-by-email
}
define contactgroup{
contactgroup_name admins
alias System Administrators
members admin
}
`
Zabbix Integration
Zabbix provides robust email notification capabilities through media types and actions.
#### Media Type Configuration
Configure email media type in Zabbix web interface or via configuration:
| Parameter | Value | Description | |-----------|-------|-------------| | Name | Email | Media type name | | Type | Email | Communication method | | SMTP server | smtp.gmail.com | Mail server address | | SMTP server port | 587 | SMTP port | | SMTP helo | yourdomain.com | HELO message | | SMTP email | monitoring@yourdomain.com | Sender email | | Connection security | STARTTLS | Security method | | Authentication | Username and password | Auth method | | Username | your-email@gmail.com | SMTP username | | Password | your-app-password | SMTP password |
#### Action Configuration
Create actions to trigger email notifications:
`sql
-- Example Zabbix action configuration
INSERT INTO actions (actionid, name, eventsource, evaltype, status, esc_period, def_shortdata, def_longdata)
VALUES (1, 'Email Notifications', 0, 0, 0, 3600,
'Problem: {EVENT.NAME}',
'Problem started at {EVENT.TIME} on {EVENT.DATE}\nProblem name: {EVENT.NAME}\nHost: {HOST.NAME}\nSeverity: {EVENT.SEVERITY}\n\nOriginal problem ID: {EVENT.ID}\n{TRIGGER.URL}');
`
Custom Script Integration
Create comprehensive monitoring scripts with email integration:
`bash
#!/bin/bash
System monitoring script with email alerts
File: /usr/local/bin/system_monitor.sh
Configuration
EMAIL_RECIPIENT="admin@yourdomain.com" HOSTNAME=$(hostname) ALERT_SCRIPT="/usr/local/bin/send_alert.py"Thresholds
CPU_THRESHOLD=80 MEMORY_THRESHOLD=85 DISK_THRESHOLD=90 LOAD_THRESHOLD=5.0Function to send alert
send_alert() { local subject="$1" local message="$2" python3 "$ALERT_SCRIPT" "$subject" "$message" "$EMAIL_RECIPIENT" }Check CPU usage
check_cpu() { cpu_usage=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | awk -F'%' '{print $1}') cpu_usage_int=${cpu_usage%.*} if [ "$cpu_usage_int" -gt "$CPU_THRESHOLD" ]; then send_alert "High CPU Usage on $HOSTNAME" "CPU usage is ${cpu_usage}%, threshold is ${CPU_THRESHOLD}%" fi }Check memory usage
check_memory() { memory_usage=$(free | grep Mem | awk '{printf "%.0f", $3/$2 * 100.0}') if [ "$memory_usage" -gt "$MEMORY_THRESHOLD" ]; then send_alert "High Memory Usage on $HOSTNAME" "Memory usage is ${memory_usage}%, threshold is ${MEMORY_THRESHOLD}%" fi }Check disk usage
check_disk() { while read output; do usage=$(echo $output | awk '{print $5}' | cut -d'%' -f1) partition=$(echo $output | awk '{print $6}') if [ $usage -ge $DISK_THRESHOLD ]; then send_alert "High Disk Usage on $HOSTNAME" "Disk usage on $partition is ${usage}%, threshold is ${DISK_THRESHOLD}%" fi done <<< "$(df -h | grep -vE '^Filesystem|tmpfs|cdrom')" }Check system load
check_load() { load_avg=$(uptime | awk -F'load average:' '{print $2}' | awk '{print $1}' | sed 's/,//') if (( $(echo "$load_avg > $LOAD_THRESHOLD" | bc -l) )); then send_alert "High System Load on $HOSTNAME" "System load average is $load_avg, threshold is $LOAD_THRESHOLD" fi }Check service status
check_services() { services=("ssh" "apache2" "mysql" "nginx") for service in "${services[@]}"; do if ! systemctl is-active --quiet "$service"; then send_alert "Service Down on $HOSTNAME" "Service $service is not running" fi done }Main execution
main() { echo "$(date): Starting system monitoring check" check_cpu check_memory check_disk check_load check_services echo "$(date): System monitoring check completed" }main "$@"
`
Alert Configuration
Alert Severity Levels
Define different alert levels with appropriate email formatting:
| Severity | Description | Email Subject Prefix | Response Time | |----------|-------------|---------------------|---------------| | Critical | System down, data loss risk | [CRITICAL] | Immediate | | High | Service unavailable | [HIGH] | 15 minutes | | Medium | Performance degraded | [MEDIUM] | 1 hour | | Low | Minor issues | [LOW] | 4 hours | | Info | Informational | [INFO] | No action required |
Advanced Alert Script
`python
#!/usr/bin/env python3
File: /usr/local/bin/advanced_alert.py
import smtplib import json import sys import os from email.mime.text import MIMEText from email.mime.multipart import MIMEMultipart from datetime import datetime import logging
class AlertManager: def __init__(self, config_file="/etc/monitoring/alert_config.json"): self.config = self.load_config(config_file) self.setup_logging() def load_config(self, config_file): """Load configuration from JSON file""" try: with open(config_file, 'r') as f: return json.load(f) except Exception as e: # Default configuration return { "smtp": { "server": "smtp.gmail.com", "port": 587, "username": "monitoring@yourdomain.com", "password": "your-app-password", "use_tls": True }, "recipients": { "critical": ["admin@yourdomain.com", "oncall@yourdomain.com"], "high": ["admin@yourdomain.com"], "medium": ["admin@yourdomain.com"], "low": ["admin@yourdomain.com"], "info": ["logs@yourdomain.com"] }, "rate_limiting": { "enabled": True, "max_emails_per_hour": 10 } } def setup_logging(self): """Setup logging configuration""" logging.basicConfig( level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s', handlers=[ logging.FileHandler('/var/log/monitoring_alerts.log'), logging.StreamHandler() ] ) self.logger = logging.getLogger(__name__) def check_rate_limit(self): """Check if rate limiting allows sending email""" if not self.config.get("rate_limiting", {}).get("enabled", False): return True # Implementation would check recent email count # For brevity, always return True return True def format_email_body(self, severity, subject, message, additional_info=None): """Format email body based on severity""" timestamp = datetime.now().strftime('%Y-%m-%d %H:%M:%S') hostname = os.uname().nodename body = f""" System Alert Notification
Severity: {severity.upper()} Time: {timestamp} Host: {hostname} Alert: {subject}
Description: {message} """ if additional_info: body += f"\nAdditional Information:\n{additional_info}" body += f"""
Alert Details: - Severity Level: {severity} - Generated by: System Monitoring - Host: {hostname} - Timestamp: {timestamp}
Please investigate and take appropriate action.
-- Automated System Monitoring """ return body def send_alert(self, severity, subject, message, additional_info=None): """Send alert email based on severity""" if not self.check_rate_limit(): self.logger.warning("Rate limit exceeded, skipping email") return False # Get recipients for severity level recipients = self.config["recipients"].get(severity, self.config["recipients"].get("medium", [])) if not recipients: self.logger.error(f"No recipients configured for severity: {severity}") return False # Format email email_subject = f"[{severity.upper()}] {subject}" email_body = self.format_email_body(severity, subject, message, additional_info) # Send email return self._send_email(email_subject, email_body, recipients) def _send_email(self, subject, body, recipients): """Send email using SMTP""" try: smtp_config = self.config["smtp"] # Create message msg = MIMEMultipart() msg['From'] = smtp_config["username"] msg['To'] = ", ".join(recipients) msg['Subject'] = subject msg.attach(MIMEText(body, 'plain')) # Connect and send server = smtplib.SMTP(smtp_config["server"], smtp_config["port"]) if smtp_config.get("use_tls", True): server.starttls() server.login(smtp_config["username"], smtp_config["password"]) server.sendmail(smtp_config["username"], recipients, msg.as_string()) server.quit() self.logger.info(f"Alert sent successfully to {recipients}") return True except Exception as e: self.logger.error(f"Failed to send alert: {e}") return False
def main():
if len(sys.argv) < 4:
print("Usage: python3 advanced_alert.py
if __name__ == "__main__":
main()
`
Configuration File Example
Create /etc/monitoring/alert_config.json:
`json
{
"smtp": {
"server": "smtp.gmail.com",
"port": 587,
"username": "monitoring@yourdomain.com",
"password": "your-app-password",
"use_tls": true
},
"recipients": {
"critical": [
"admin@yourdomain.com",
"oncall@yourdomain.com",
"manager@yourdomain.com"
],
"high": [
"admin@yourdomain.com",
"oncall@yourdomain.com"
],
"medium": [
"admin@yourdomain.com"
],
"low": [
"admin@yourdomain.com"
],
"info": [
"logs@yourdomain.com"
]
},
"rate_limiting": {
"enabled": true,
"max_emails_per_hour": 20,
"cooldown_period": 300
},
"email_templates": {
"critical": {
"subject_prefix": "[CRITICAL ALERT]",
"priority": "high"
},
"high": {
"subject_prefix": "[HIGH ALERT]",
"priority": "high"
},
"medium": {
"subject_prefix": "[MEDIUM ALERT]",
"priority": "normal"
},
"low": {
"subject_prefix": "[LOW ALERT]",
"priority": "low"
},
"info": {
"subject_prefix": "[INFO]",
"priority": "low"
}
}
}
`
Script-Based Monitoring
Comprehensive System Monitor
`bash
#!/bin/bash
File: /usr/local/bin/comprehensive_monitor.sh
Configuration
SCRIPT_DIR="/usr/local/bin" CONFIG_DIR="/etc/monitoring" LOG_DIR="/var/log/monitoring" ALERT_SCRIPT="$SCRIPT_DIR/advanced_alert.py"Create directories if they don't exist
mkdir -p "$LOG_DIR"Logging function
log_message() { echo "$(date '+%Y-%m-%d %H:%M:%S') - $1" >> "$LOG_DIR/system_monitor.log" }Network connectivity check
check_network() { local hosts=("8.8.8.8" "google.com" "github.com") local failed_hosts=() for host in "${hosts[@]}"; do if ! ping -c 1 -W 5 "$host" > /dev/null 2>&1; then failed_hosts+=("$host") fi done if [ ${#failed_hosts[@]} -gt 0 ]; then python3 "$ALERT_SCRIPT" "high" "Network Connectivity Issues" \ "Failed to reach: ${failed_hosts[*]}" \ "Network connectivity check failed for multiple hosts" fi }SSL certificate expiry check
check_ssl_certificates() { local domains=("yourdomain.com" "api.yourdomain.com") for domain in "${domains[@]}"; do expiry_date=$(echo | openssl s_client -servername "$domain" -connect "$domain:443" 2>/dev/null | \ openssl x509 -noout -dates | grep notAfter | cut -d= -f2) if [ -n "$expiry_date" ]; then expiry_timestamp=$(date -d "$expiry_date" +%s) current_timestamp=$(date +%s) days_until_expiry=$(( (expiry_timestamp - current_timestamp) / 86400 )) if [ "$days_until_expiry" -lt 30 ]; then severity="high" [ "$days_until_expiry" -lt 7 ] && severity="critical" python3 "$ALERT_SCRIPT" "$severity" "SSL Certificate Expiring" \ "SSL certificate for $domain expires in $days_until_expiry days" \ "Certificate expiry date: $expiry_date" fi fi done }Database connectivity check
check_database() { local databases=("mysql" "postgresql") for db in "${databases[@]}"; do case "$db" in "mysql") if command -v mysql > /dev/null; then if ! mysql -u monitoring -p"$MYSQL_PASSWORD" -e "SELECT 1;" > /dev/null 2>&1; then python3 "$ALERT_SCRIPT" "critical" "MySQL Database Connection Failed" \ "Unable to connect to MySQL database" \ "Database service may be down or credentials invalid" fi fi ;; "postgresql") if command -v psql > /dev/null; then if ! PGPASSWORD="$POSTGRES_PASSWORD" psql -U monitoring -d postgres -c "SELECT 1;" > /dev/null 2>&1; then python3 "$ALERT_SCRIPT" "critical" "PostgreSQL Database Connection Failed" \ "Unable to connect to PostgreSQL database" \ "Database service may be down or credentials invalid" fi fi ;; esac done }Log file monitoring
monitor_log_files() { local log_files=("/var/log/auth.log" "/var/log/syslog" "/var/log/apache2/error.log") local error_patterns=("Failed password" "authentication failure" "Internal Server Error") for log_file in "${log_files[@]}"; do if [ -f "$log_file" ]; then for pattern in "${error_patterns[@]}"; do # Check for errors in the last 5 minutes recent_errors=$(grep "$pattern" "$log_file" | \ awk -v cutoff="$(date -d '5 minutes ago' '+%b %d %H:%M')" \ '$0 > cutoff' | wc -l) if [ "$recent_errors" -gt 10 ]; then python3 "$ALERT_SCRIPT" "medium" "High Error Rate in Logs" \ "Found $recent_errors occurrences of '$pattern' in $log_file in the last 5 minutes" \ "Recent error pattern detected" fi done fi done }Process monitoring
monitor_processes() { local critical_processes=("sshd" "systemd" "init") local important_processes=("apache2" "nginx" "mysql" "postgresql") for process in "${critical_processes[@]}"; do if ! pgrep "$process" > /dev/null; then python3 "$ALERT_SCRIPT" "critical" "Critical Process Not Running" \ "Critical process $process is not running" \ "System stability may be compromised" fi done for process in "${important_processes[@]}"; do if ! pgrep "$process" > /dev/null; then python3 "$ALERT_SCRIPT" "high" "Important Process Not Running" \ "Important process $process is not running" \ "Service may be unavailable" fi done }Main execution
main() { log_message "Starting comprehensive system monitoring" # Load environment variables for database passwords if [ -f "$CONFIG_DIR/monitoring.env" ]; then source "$CONFIG_DIR/monitoring.env" fi # Run all checks check_network check_ssl_certificates check_database monitor_log_files monitor_processes log_message "Comprehensive system monitoring completed" }Execute main function
main "$@"`Cron Job Setup
Set up automated monitoring with cron:
`bash
Edit crontab
crontab -eAdd monitoring jobs
Run comprehensive monitoring every 5 minutes
/5 * /usr/local/bin/comprehensive_monitor.shRun basic system monitoring every minute
* /usr/local/bin/system_monitor.shRun daily system health report
0 8 * /usr/local/bin/daily_health_report.shRun weekly system summary
0 9 1 /usr/local/bin/weekly_summary.sh`Advanced Configuration
Email Template System
Create customizable email templates:
`python
#!/usr/bin/env python3
File: /usr/local/bin/template_manager.py
import json import os from string import Template from datetime import datetime
class EmailTemplateManager: def __init__(self, template_dir="/etc/monitoring/templates"): self.template_dir = template_dir self.templates = self.load_templates() def load_templates(self): """Load email templates from files""" templates = {} if not os.path.exists(self.template_dir): os.makedirs(self.template_dir) self.create_default_templates() for filename in os.listdir(self.template_dir): if filename.endswith('.template'): template_name = filename[:-9] # Remove .template extension with open(os.path.join(self.template_dir, filename), 'r') as f: templates[template_name] = f.read() return templates def create_default_templates(self): """Create default email templates""" templates = { 'critical_alert': '''Subject: [CRITICAL] $alert_type on $hostname
CRITICAL SYSTEM ALERT
Time: $timestamp Host: $hostname Alert Type: $alert_type Severity: CRITICAL
Issue Description: $description
Current Status: $current_status
Immediate Action Required: $recommended_action
System Details: - Hostname: $hostname - IP Address: $ip_address - Operating System: $os_info - Uptime: $uptime
This is a critical alert requiring immediate attention.
-- Automated Monitoring System''',
'service_down': '''Subject: [HIGH] Service $service_name is DOWN on $hostname
SERVICE UNAVAILABLE ALERT
Time: $timestamp Host: $hostname Service: $service_name Status: DOWN Duration: $downtime_duration
Service Details: - Service Name: $service_name - Expected Status: Running - Current Status: Stopped/Failed - Last Known Good: $last_good_time
Impact Assessment: $impact_description
Recommended Actions: 1. Check service logs: journalctl -u $service_name 2. Attempt service restart: systemctl restart $service_name 3. Verify service configuration 4. Check system resources
-- Service Monitoring System''',
'performance_degraded': '''Subject: [MEDIUM] Performance Alert on $hostname
PERFORMANCE DEGRADATION DETECTED
Time: $timestamp Host: $hostname Metric: $metric_name Current Value: $current_value Threshold: $threshold_value
Performance Metrics: $performance_details
Trend Analysis: $trend_information
Suggested Actions: $suggested_actions
--
Performance Monitoring System'''
}
for name, content in templates.items():
with open(os.path.join(self.template_dir, f"{name}.template"), 'w') as f:
f.write(content)
def render_template(self, template_name, variables):
"""Render template with provided variables"""
if template_name not in self.templates:
raise ValueError(f"Template '{template_name}' not found")
template = Template(self.templates[template_name])
# Add common variables
common_vars = {
'timestamp': datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
'hostname': os.uname().nodename,
'ip_address': self.get_ip_address(),
'os_info': self.get_os_info(),
'uptime': self.get_uptime()
}
# Merge with provided variables
all_vars = {common_vars, variables}
return template.safe_substitute(all_vars)
def get_ip_address(self):
"""Get system IP address"""
try:
import socket
s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
s.connect(("8.8.8.8", 80))
ip = s.getsockname()[0]
s.close()
return ip
except:
return "Unknown"
def get_os_info(self):
"""Get OS information"""
try:
with open('/etc/os-release', 'r') as f:
for line in f:
if line.startswith('PRETTY_NAME='):
return line.split('=')[1].strip().strip('"')
except:
pass
return os.uname().sysname
def get_uptime(self):
"""Get system uptime"""
try:
with open('/proc/uptime', 'r') as f:
uptime_seconds = float(f.readline().split()[0])
days = int(uptime_seconds // 86400)
hours = int((uptime_seconds % 86400) // 3600)
minutes = int((uptime_seconds % 3600) // 60)
return f"{days}d {hours}h {minutes}m"
except:
return "Unknown"
`
Dashboard Integration
Create a simple web dashboard for monitoring status:
`python
#!/usr/bin/env python3
File: /usr/local/bin/monitoring_dashboard.py
from flask import Flask, render_template, jsonify import json import os import subprocess from datetime import datetime, timedelta
app = Flask(__name__)
class MonitoringDashboard: def __init__(self): self.status_file = "/var/log/monitoring/status.json" self.alert_log = "/var/log/monitoring_alerts.log" def get_system_status(self): """Get current system status""" try: with open(self.status_file, 'r') as f: return json.load(f) except: return self.collect_system_status() def collect_system_status(self): """Collect current system status""" status = { 'timestamp': datetime.now().isoformat(), 'hostname': os.uname().nodename, 'cpu_usage': self.get_cpu_usage(), 'memory_usage': self.get_memory_usage(), 'disk_usage': self.get_disk_usage(), 'load_average': self.get_load_average(), 'services': self.get_service_status(), 'alerts': self.get_recent_alerts() } # Save status os.makedirs(os.path.dirname(self.status_file), exist_ok=True) with open(self.status_file, 'w') as f: json.dump(status, f, indent=2) return status def get_cpu_usage(self): """Get CPU usage percentage""" try: result = subprocess.run(['top', '-bn1'], capture_output=True, text=True) for line in result.stdout.split('\n'): if 'Cpu(s)' in line: return float(line.split()[1].rstrip('%us,')) except: return 0.0 def get_memory_usage(self): """Get memory usage percentage""" try: result = subprocess.run(['free'], capture_output=True, text=True) lines = result.stdout.split('\n') mem_line = lines[1].split() total = int(mem_line[1]) used = int(mem_line[2]) return round((used / total) * 100, 2) except: return 0.0 def get_disk_usage(self): """Get disk usage for all mounted filesystems""" try: result = subprocess.run(['df', '-h'], capture_output=True, text=True) disk_info = [] for line in result.stdout.split('\n')[1:]: if line.strip() and not line.startswith('tmpfs'): parts = line.split() if len(parts) >= 6: disk_info.append({ 'filesystem': parts[0], 'size': parts[1], 'used': parts[2], 'available': parts[3], 'usage_percent': int(parts[4].rstrip('%')), 'mount_point': parts[5] }) return disk_info except: return [] def get_load_average(self): """Get system load average""" try: with open('/proc/loadavg', 'r') as f: loads = f.read().split()[:3] return [float(load) for load in loads] except: return [0.0, 0.0, 0.0] def get_service_status(self): """Get status of important services""" services = ['ssh', 'apache2', 'nginx', 'mysql', 'postgresql'] status = {} for service in services: try: result = subprocess.run(['systemctl', 'is-active', service], capture_output=True, text=True) status[service] = result.stdout.strip() except: status[service] = 'unknown' return status def get_recent_alerts(self): """Get recent alerts from log file""" alerts = [] try: if os.path.exists(self.alert_log): with open(self.alert_log, 'r') as f: lines = f.readlines()[-50:] # Last 50 lines for line in lines: if 'Alert sent successfully' in line: alerts.append({ 'timestamp': line.split(' - ')[0], 'message': line.strip() }) except: pass return alerts
dashboard = MonitoringDashboard()
@app.route('/') def index(): """Main dashboard page""" status = dashboard.get_system_status() return render_template('dashboard.html', status=status)
@app.route('/api/status') def api_status(): """API endpoint for system status""" return jsonify(dashboard.collect_system_status())
@app.route('/api/refresh') def api_refresh(): """Force refresh system status""" return jsonify(dashboard.collect_system_status())
if __name__ == '__main__':
app.run(host='0.0.0.0', port=5000, debug=False)
`
Troubleshooting
Common Issues and Solutions
| Issue | Symptoms | Solution | |-------|----------|----------| | Emails not sending | No alert emails received | Check SMTP configuration, credentials, network connectivity | | Authentication failures | SMTP auth errors | Use app passwords for Gmail, verify credentials | | Rate limiting | Some emails missing | Implement proper rate limiting, check provider limits | | False positives | Too many alerts | Adjust thresholds, implement alert correlation | | Template errors | Malformed emails | Validate template syntax, check variable substitution |
Debugging Commands
Test email functionality:
`bash
Test basic mail command
echo "Test message" | mail -s "Test Subject" user@domain.comTest SMTP connectivity
telnet smtp.gmail.com 587Check mail queue
mailqView mail logs
tail -f /var/log/mail.logTest Python SMTP
python3 -c " import smtplib server = smtplib.SMTP('smtp.gmail.com', 587) server.starttls() print('SMTP connection successful') server.quit() "`Log Analysis
Monitor email alert logs:
`bash
Create log monitoring script
#!/bin/bashFile: /usr/local/bin/monitor_alert_logs.sh
ALERT_LOG="/var/log/monitoring_alerts.log" MAIL_LOG="/var/log/mail.log"
echo "=== Recent Alert Attempts ===" tail -20 "$ALERT_LOG"
echo -e "\n=== Mail Server Activity ===" tail -20 "$MAIL_LOG" | grep -E "(sent|delivered|failed|error)"
echo -e "\n=== Alert Statistics (Last 24 hours) ===" grep "$(date -d '1 day ago' '+%Y-%m-%d')" "$ALERT_LOG" | \ grep -c "Alert sent successfully"
echo -e "\n=== Failed Alerts (Last 24 hours) ==="
grep "$(date -d '1 day ago' '+%Y-%m-%d')" "$ALERT_LOG" | \
grep "Failed to send"
`
Best Practices
Security Considerations
1. Credential Management: Store SMTP credentials securely, use app-specific passwords 2. Access Control: Restrict access to configuration files and scripts 3. Encryption: Use TLS/SSL for SMTP connections 4. Rate Limiting: Implement proper rate limiting to prevent spam 5. Log Security: Protect log files from unauthorized access
Performance Optimization
1. Efficient Monitoring: Avoid excessive system calls in monitoring scripts 2. Batch Processing: Group related alerts to reduce email volume 3. Caching: Cache system status to reduce redundant checks 4. Asynchronous Processing: Use background processes for email sending 5. Resource Limits: Set appropriate limits on monitoring frequency
Maintenance Procedures
1. Regular Testing: Test email functionality regularly 2. Log Rotation: Implement proper log rotation for monitoring logs 3. Configuration Backup: Backup monitoring configurations 4. Alert Review: Regularly review and tune alert thresholds 5. Documentation: Maintain up-to-date documentation for procedures
Alert Fatigue Prevention
1. Intelligent Grouping: Group related alerts together 2. Escalation Policies: Implement proper escalation procedures 3. Acknowledgment System: Allow operators to acknowledge alerts 4. Severity Tuning: Regularly adjust alert severity levels 5. Noise Reduction: Filter out non-actionable alerts
This comprehensive guide provides a complete framework for integrating email alerts with system monitoring, covering everything from basic setup to advanced configurations and best practices for maintaining a robust monitoring system.