การมอนิเตอร์ระบบด้วย Prometheus + Grafana

Architecture Overview

1 สรุป (Summary)

คู่มือนี้ครอบคลุมการติดตั้งและตั้งค่า Monitoring Stack ที่ประกอบด้วย Prometheus สำหรับเก็บข้อมูล metrics, Grafana สำหรับแสดงผลใน Dashboard ที่สวยงาม, Alertmanager สำหรับระบบแจ้งเตือน และ Node Exporter สำหรับเก็บ metrics ของระบบปฏิบัติการ ทั้งหมดใช้ Docker Compose ในการจัดการ

2 ความต้องการเบื้องต้น (Prerequisites)

ระบบปฏิบัติการ

Ubuntu 20.04+
Debian 11+
หรือระบบอื่นที่รองรับ Docker

สิทธิ์การใช้งาน

สิทธิ์ root หรือ sudo
สามารถรัน Docker ได้

Software

Docker Engine
Docker Compose v2+

พอร์ตที่ต้องเปิด

3000 - Grafana
9090 - Prometheus
9093 - Alertmanager
9100 - Node Exporter

3 Docker Compose ไฟล์

docker-compose.yml

4 Services

version: '3.8'

services:
  # Prometheus - Metrics Collection
  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    restart: unless-stopped
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
      - ./prometheus/alert.rules.yml:/etc/prometheus/alert.rules.yml
      - prometheus-data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.console.libraries=/etc/prometheus/console_libraries'
      - '--web.console.templates=/etc/prometheus/consoles'
      - '--web.enable-lifecycle'

  # Grafana - Visualization
  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    restart: unless-stopped
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_USER=admin
      - GF_SECURITY_ADMIN_PASSWORD=admin
      - GF_USERS_ALLOW_SIGN_UP=false
    volumes:
      - grafana-data:/var/lib/grafana
    depends_on:
      - prometheus

  # Node Exporter - System Metrics
  node-exporter:
    image: prom/node-exporter:latest
    container_name: node-exporter
    restart: unless-stopped
    ports:
      - "9100:9100"
    command:
      - '--path.procfs=/host/proc'
      - '--path.rootfs=/rootfs'
      - '--path.sysfs=/host/sys'
      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro

  # Alertmanager - Alert Management
  alertmanager:
    image: prom/alertmanager:latest
    container_name: alertmanager
    restart: unless-stopped
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml
      - alertmanager-data:/alertmanager
    command:
      - '--config.file=/etc/alertmanager/alertmanager.yml'

volumes:
  prometheus-data:
  grafana-data:
  alertmanager-data:

หมายเหตุ

• Volumes ใช้สำหรับรักษาข้อมูลถึงแม้คอนเทนเนอร์จะถูกลบ
• restart: unless-stopped คือ restart policy ที่แนะนำ
• รหัสผ่าน Grafana default: admin / admin

4 การตั้งค่า Prometheus

prometheus/prometheus.yml

Config

global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    monitor: 'docker-monitor'

# Alert rules configuration
rule_files:
  - '/etc/prometheus/alert.rules.yml'

# Scrape configurations
scrape_configs:
  # Prometheus self-monitoring
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # Node Exporter - System Metrics
  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']
        labels:
          instance: 'docker-host'

  # Alertmanager
  - job_name: 'alertmanager'
    static_configs:
      - targets: ['alertmanager:9093']

prometheus/alert.rules.yml

Alert Rules

groups:
  - name: system_alerts
    interval: 30s
    rules:
      # High CPU Usage Alert
      - alert: HighCPUUsage
        expr: 100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU Usage Detected"
          description: "CPU usage on {{ $labels.instance }} is above 80% (current: {{ $value }}%)"

      # High Memory Usage Alert
      - alert: HighMemoryUsage
        expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High Memory Usage Detected"
          description: "Memory usage on {{ $labels.instance }} is above 80% (current: {{ $value }}%)"

      # Disk Space Alert
      - alert: DiskSpaceLow
        expr: (1 - (node_filesystem_avail_bytes{fstype!="tmpfs"} / node_filesystem_size_bytes{fstype!="tmpfs"})) * 100 > 85
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Low Disk Space"
          description: "Disk space on {{ $labels.instance }} (mount: {{ $labels.mountpoint }}) is above 85% (current: {{ $value }}%)"

      # System Down Alert
      - alert: InstanceDown
        expr: up == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Instance Down"
          description: "{{ $labels.instance }} has been down for more than 2 minutes"

5 รันคอนเทนเนอร์

เริ่มต้นคอนเทนเนอร์

docker compose up -d

ตรวจสอบสถานะคอนเทนเนอร์

docker compose ps

ผลลัพธ์ควรแสดงทั้ง 4 services อยู่ในสถานะ Up

ดู logs ของแต่ละ service

# ดู logs ทั้งหมด
docker compose logs -f

# ดู logs เฉพาะ service
docker compose logs -f prometheus

6 ตั้งค่า Data Source ใน Grafana

1

เข้าสู่ Grafana Dashboard

เปิดเบราว์เซอร์ไปที่ http://localhost:3000

Login ด้วย: admin / admin

2

เพิ่ม Data Source

• คลิก Settings → Data sources
• คลิกปุ่ม Add data source
• เลือก Prometheus

3

กำหนดค่า

• URL: http://prometheus:9090
• หากรันบนเครื่องโดยตรง: http://localhost:9090
• คลิก Save & Test

7 นำเข้า Dashboard ตัวอย่าง

Dashboard ที่แนะนำ

Node Exporter Full ID: 1860

Dashboard ครบถ้วนสำหรับตรวจสอบ system metrics

ดูรายละเอียด

Prometheus Overview ID: 10991

ภาพรวมของ Prometheus และ targets

ดูรายละเอียด

ขั้นตอนการ Import

1

คลิก + ที่มุมซ้ายบน → Import

2

กรอก Dashboard ID (เช่น 1860) หรืออัปโหลด JSON file

3

เลือก Prometheus เป็น Data Source

4

คลิก Import

8 ตั้งค่า Alertmanager

alertmanager/alertmanager.yml

Email Config

global:
  resolve_timeout: 5m

# SMTP configuration for email alerts
smtp_smarthost: 'smtp.gmail.com:587'
smtp_from: 'your-email@gmail.com'
smtp_auth_username: 'your-email@gmail.com'
smtp_auth_password: 'your-app-password'

# Route configuration
route:
  group_by: ['alertname', 'severity']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h
  receiver: 'default-receiver'

  routes:
    - match:
        severity: critical
      receiver: 'critical-receiver'
    - match:
        severity: warning
      receiver: 'warning-receiver'

# Receivers
receivers:
  - name: 'default-receiver'
    email_configs:
      - to: 'admin@example.com'
        from: 'monitoring@example.com'

  - name: 'critical-receiver'
    email_configs:
      - to: 'admin@example.com,support@example.com'
        from: 'alerts@example.com'
        headers:
          subject: '🚨 CRITICAL: {{ .GroupLabels.alertname }}'

  - name: 'warning-receiver'
    email_configs:
      - to: 'admin@example.com'
        from: 'alerts@example.com'
        headers:
          subject: '⚠️ WARNING: {{ .GroupLabels.alertname }}'

# Inhibition rules
inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'instance']

การเชื่อมต่อ Alertmanager กับ Prometheus

Prometheus จะส่ง alerts ไปยัง Alertmanager โดยอัตโนมัติ เนื่องจากไฟล์ prometheus.yml มีการตั้งค่า rule_files และ scrape_configs สำหรับ Alertmanager เรียบร้อยแล้ว

9 การทดสอบ & ตรวจสอบ

Prometheus

localhost:9090

ตรวจสอบ Targets

Grafana

localhost:3000

ดู Dashboard

Alertmanager

localhost:9093

ตรวจสอบ Alerts

ตรวจสอบ Targets ใน Prometheus

1. เข้า http://localhost:9090/targets
2. ตรวจสอบว่าทุก target มีสถานะ UP
3. รอให้มี metrics ถูกเก็บอย่างน้อย 1-2 นาที

ทดสอบการแจ้งเตือน

เพื่อทดสอบว่า Alertmanager ทำงานได้ สามารถทำดังนี้:

1. เข้า http://localhost:9090/alerts
2. ตรวจสอบว่ามี alerts ที่ active
3. หรือสร้าง load บนเครื่องเพื่อทำให้เกิด alert (เช่น CPU, Memory)
4. ตรวจสอบอีเมลที่ได้รับ

10 เคล็ดลับ (Tips)

Docker Volumes

ใช้ Docker volumes สำหรับรักษา config และ data ของ Prometheus, Grafana, Alertmanager เพื่อป้องกันการสูญหายเมื่อคอนเทนเนอร์ถูกลบหรือรีสตาร์ท

HTTPS / Reverse Proxy

ใช้ NGINX หรือ Caddy เป็น reverse proxy เพื่อเปิด HTTPS และปกป้อง Grafana และ Dashboard ด้วย Authentication เพิ่มเติม

Blackbox Exporter

เพิ่ม blackbox_exporter สำหรับตรวจสอบ HTTP endpoints, ICMP ping, DNS และ TCP ports จากภายนอก

Alert Routes

แยก alert routes ตาม severity (critical, warning, info) และกำหนด receivers ที่แตกต่างกันสำหรับแต่ละกลุ่ม

11 การขยาย (Scaling)

หลายโหนด Prometheus (Federation)

สำหรับระบบขนาดใหญ่ สามารถใช้ Prometheus Federation เพื่อรวมข้อมูลจากหลายๆ instance

# บน Prometheus หลัก (Central)
scrape_configs:
  - job_name: 'federate'
    honor_labels: true
    metrics_path: '/federate'
    params:
      'match[]':
        - '{job="prometheus"}'
        - '{__name__=~"job:.*"}'
    static_configs:
      - targets:
        - 'prometheus-instance-1:9090'
        - 'prometheus-instance-2:9090'

Remote Write ไปยัง Prometheus กลาง

ส่ง metrics จากหลาย instance ไปยัง Prometheus server กลางผ่าน remote_write

# บน Prometheus instance ย่อย
remote_write:
  - url: 'http://central-prometheus:9090/api/v1/write'
    basic_auth:
      username: 'username'
      password: 'password'

12 ทำความสะอาด (Cleanup)

คำเตือน

คำสั่งเหล่านี้จะลบคอนเทนเนอร์และ data ทั้งหมด ให้แน่ใจว่าได้ backup ข้อมูลสำคัญก่อน

หยุดและลบคอนเทนเนอร์ (รักษา data)

docker compose down

หยุดและลบคอนเทนเนอร์ + volumes (ลบทั้งหมด)

docker compose down --volumes

ลบ volumes โดยเฉพาะ

docker volume rm wiki_prometheus-data wiki_grafana-data wiki_alertmanager-data