Monitoring is essential for maintaining healthy infrastructure. Prometheus and Grafana together form the most popular open-source monitoring stack, used by organizations from startups to enterprises. This guide walks you through setting up a production-ready monitoring system on Linux.
Architecture Overview
The monitoring stack consists of:
- Prometheus: Time-series database that scrapes metrics from targets
- Node Exporter: Exposes Linux system metrics (CPU, memory, disk, network)
- Grafana: Visualization platform for creating dashboards
- Alertmanager: Handles alerts from Prometheus and routes notifications
Installing Prometheus
# Create user
sudo useradd --no-create-home --shell /bin/false prometheus
# Download and install
cd /tmp
wget https://github.com/prometheus/prometheus/releases/download/v2.50.0/prometheus-2.50.0.linux-amd64.tar.gz
tar xzf prometheus-2.50.0.linux-amd64.tar.gz
sudo cp prometheus-2.50.0.linux-amd64/{prometheus,promtool} /usr/local/bin/
# Create directories
sudo mkdir -p /etc/prometheus /var/lib/prometheus
sudo chown prometheus:prometheus /var/lib/prometheus
Prometheus Configuration
# /etc/prometheus/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
alerting:
alertmanagers:
- static_configs:
- targets: ['localhost:9093']
rule_files:
- "alert_rules.yml"
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node_exporter'
static_configs:
- targets: ['localhost:9100']
Installing Node Exporter
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
tar xzf node_exporter-1.7.0.linux-amd64.tar.gz
sudo cp node_exporter-1.7.0.linux-amd64/node_exporter /usr/local/bin/
# Create systemd service
sudo tee /etc/systemd/system/node_exporter.service <<EOF
[Unit]
Description=Node Exporter
After=network.target
[Service]
User=prometheus
ExecStart=/usr/local/bin/node_exporter
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable --now node_exporter
Installing Grafana
# Add Grafana repository
sudo apt install -y software-properties-common
wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add -
echo "deb https://packages.grafana.com/oss/deb stable main" | sudo tee /etc/apt/sources.list.d/grafana.list
sudo apt update
sudo apt install grafana
sudo systemctl enable --now grafana-server
Creating Dashboards
After connecting Prometheus as a data source in Grafana:
- Import the Node Exporter Full dashboard (ID: 1860) for comprehensive system monitoring
- Create custom panels with PromQL queries
- Set up variables for multi-server dashboards
Essential PromQL Queries
# CPU Usage Percentage
100 - (avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory Usage Percentage
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100
# Disk Usage Percentage
(1 - node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100
# Network Traffic (bytes/sec)
rate(node_network_receive_bytes_total{device="eth0"}[5m])
Alert Rules
# /etc/prometheus/alert_rules.yml
groups:
- name: server_alerts
rules:
- alert: HighCPUUsage
expr: 100 - (avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
- alert: DiskSpaceLow
expr: (1 - node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 > 85
for: 5m
labels:
severity: critical
annotations:
summary: "Disk space low on {{ $labels.instance }}"
Production Best Practices
- Use persistent storage for Prometheus data with appropriate retention settings
- Set up Alertmanager with email, Slack, or PagerDuty notifications
- Secure Grafana with HTTPS and strong authentication
- Implement recording rules for frequently used queries
- Monitor the monitoring system itself
- Use federation for multi-datacenter setups
A well-configured monitoring stack gives you visibility into your infrastructure and helps you detect and resolve issues before they impact users. Start with basic metrics and gradually expand your monitoring coverage.