🎁 New User? Get 20% off your first purchase with code NEWUSER20 Register Now →
Menu

Categories

Building a Complete Monitoring Stack: Prometheus, Grafana, and Beyond

Building a Complete Monitoring Stack: Prometheus, Grafana, and Beyond

Why Monitoring Matters

You can't fix what you can't see. Modern infrastructure generates enormous amounts of data—understanding that data is the difference between proactive operations and firefighting. A well-designed monitoring stack provides visibility, enables fast troubleshooting, and prevents outages.

This guide walks you through building a production-ready monitoring solution.

The Modern Monitoring Stack

The industry-standard open-source stack includes:

  • Prometheus: Metrics collection and storage
  • Grafana: Visualization and dashboards
  • Alertmanager: Alert routing and notification
  • Node Exporter: Host-level metrics
  • Various exporters: Application-specific metrics

Installing Prometheus

Docker Compose Setup

version: '3.8'

services:
  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.retention.time=30d'

volumes:
  prometheus_data:

Basic Configuration

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node'
    static_configs:
      - targets: ['node-exporter:9100']

Setting Up Grafana

Docker Configuration

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    volumes:
      - grafana_data:/var/lib/grafana
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=secure_password
      - GF_USERS_ALLOW_SIGN_UP=false

Creating Dashboards

Start with community dashboards:

  • Node Exporter Full: Dashboard ID 1860
  • Docker and Host: Dashboard ID 179
  • Kubernetes Cluster: Dashboard ID 315

Import via Grafana UI: Dashboards → Import → Enter ID

Essential Exporters

Node Exporter (System Metrics)

  node-exporter:
    image: prom/node-exporter:latest
    ports:
      - "9100:9100"
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.sysfs=/host/sys'

PostgreSQL Exporter

  postgres-exporter:
    image: prometheuscommunity/postgres-exporter
    environment:
      DATA_SOURCE_NAME: "postgresql://user:pass@db:5432/dbname?sslmode=disable"
    ports:
      - "9187:9187"

Nginx Exporter

  nginx-exporter:
    image: nginx/nginx-prometheus-exporter
    command:
      - -nginx.scrape-uri=http://nginx:8080/stub_status
    ports:
      - "9113:9113"

Alerting Configuration

Prometheus Alert Rules

# alerts.yml
groups:
  - name: system
    rules:
      - alert: HighCPUUsage
        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage is above 80% for 5 minutes"

      - alert: DiskSpaceLow
        expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 10
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Low disk space on {{ $labels.instance }}"

Alertmanager Configuration

# alertmanager.yml
global:
  smtp_smarthost: 'smtp.example.com:587'
  smtp_from: 'alertmanager@example.com'

route:
  receiver: 'team-email'
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h

receivers:
  - name: 'team-email'
    email_configs:
      - to: 'team@example.com'
        send_resolved: true

Key Metrics to Monitor

System Health

  • CPU usage, load average
  • Memory usage and swap
  • Disk I/O and space
  • Network traffic and errors

Application Health

  • Request rate and latency
  • Error rates
  • Active connections
  • Queue depths

Business Metrics

  • Transactions per second
  • User signups/logins
  • Revenue-impacting events

Best Practices

  • Use labels effectively: Environment, service, instance
  • Set retention appropriately: Balance history vs. storage
  • Alert on symptoms, not causes: User impact matters most
  • Create runbooks: Document response procedures
  • Review alerts regularly: Eliminate noise

Conclusion

A comprehensive monitoring stack is essential for modern operations. Start with the basics, expand coverage gradually, and continuously refine your alerting to minimize noise while catching real issues.

Our monitoring and observability eBooks provide deep dives into building production-ready monitoring solutions.

Share this article:

Stay Updated

Subscribe to our newsletter for the latest tutorials, tips, and exclusive offers.