AIOps: ปัญญาประดิษฐ์เพื่อการดำเนินงาน IT

1. บทนำ: AIOps คืออะไร?

AIOps (Artificial Intelligence for IT Operations) คือการนำปัญญาประดิษฐ์ (AI) และ Machine Learning (ML) มาใช้ในการจัดการระบบ IT โดยอัตโนมัติ ช่วยให้ทีมงานสามารถตรวจจับปัญหา วิเคราะห์สาเหตุ และแก้ไขได้เร็วขึ้น

ในปี 2025-2026 ตลาด AIOps เติบโตอย่างรวดเร็วจาก USD 16.42B ในปี 2025 เป็น USD 36.60B ภายในปี 2030 โดยมีอัตราการเติบโต 17.39% ต่อปี

ปัจจุบัน 55% ขององค์กรทั่วโลก ได้เริ่มใช้ AIOps แล้ว โดยเฉพาะในอุตสาหกรรมที่มีระบบ IT ซับซ้อน เช่น ธนาคาร โทรคมนาคม และ E-commerce

ตรวจจับอัตโนมัติ

ตรวจจับความผิดปกติ (Anomaly Detection) โดยไม่ต้องตั้ง Threshold ล่วงหน้า

วิเคราะห์ความสัมพันธ์

เชื่อมโยงเหตุการณ์ที่เกี่ยวข้องกัน หาสาเหตุหลัก (Root Cause Analysis)

แก้ไขอัตโนมัติ

Auto-remediation แก้ไขปัญหาที่พบบ่อยโดยอัตโนมัติ

2. ประโยชน์และข้อดีของ AIOps

ตัวเลขที่น่าประทับใจ

50%

ลด Downtime

80%

ลดค่าใช้จ่าย OpEx

95%

ลด Alert Noise

130%

ROI

ลด Alert Fatigue

จากการมี Alert หลายพันรายการต่อวัน เหลือเพียง 10-20 รายการที่สำคัญจริงๆ AI จะกรองและรวม Alert ที่เกี่ยวข้องกันให้อัตโนมัติ

ลดเวลาแก้ไขปัญหา (MTTR)

AI ช่วยวิเคราะห์หาสาเหตุเร็วขึ้น และแนะนำวิธีแก้ไขจากประวัติการแก้ไขปัญหาในอดีต

ทำนายปัญหาล่วงหน้า (Predictive)

ตรวจจับรูปแบบที่บ่งชี้ปัญหาก่อนที่จะเกิดขึ้น เช่น พยากรณ์ว่า Disk จะเต็มในอีก 3 วัน

3. เครื่องมือ Open Source สำหรับ AIOps

ในบทความนี้ เราจะใช้เครื่องมือ Open Source ทั้งหมด ซึ่งช่วยให้คุณควบคุมข้อมูลได้เอง (Data Sovereignty) และไม่มีค่าใช้จ่ายในการใช้งาน

Prometheus - Time-Series Database

เก็บ Metrics จากระบบ รองรับ PromQL สำหรับ Query ข้อมูล มี 165.7M downloads ในปี 2023

Metrics Collection PromQL Alerting

Grafana - Visualization & ML

Dashboard สำหรับแสดงผลข้อมูล รองรับ ML Plugin สำหรับ Anomaly Detection และ Forecasting

Dashboards ML Plugins Alerting

ELK Stack (Elasticsearch, Logstash, Kibana)

รวบรวมและวิเคราะห์ Logs จากทุกระบบ มี Built-in ML สำหรับตรวจจับความผิดปกติใน Logs

Log Analytics Full-text Search ML Anomaly

Keep - Alert Correlation Platform

รวม Alert จากทุกแหล่ง ใช้ AI สำหรับ Deduplication, Correlation และ Root Cause Analysis

Alert Management AI Correlation Noise Reduction

Seldon Core - ML Model Serving

Deploy ML Models (PyTorch, TensorFlow, Scikit-learn) เป็น Microservices บน Kubernetes

Model Serving Kubernetes A/B Testing

4. สถาปัตยกรรมระบบ AIOps

สถาปัตยกรรม AIOps แบ่งเป็น 3 Layer หลัก: Data Collection, Data Processing และ ML/Analytics

แผนภาพสถาปัตยกรรม AIOps Pipeline

5. การติดตั้งและใช้งาน

ข้อกำหนด (Prerequisites)

Ubuntu 22.04 LTS หรือใหม่กว่า
RAM อย่างน้อย 8GB (แนะนำ 16GB)
Storage อย่างน้อย 100GB
Docker & Docker Compose
Kubernetes (ถ้ามี)
kubectl, helm

Phase 1: ติดตั้ง Prometheus และ Grafana

เริ่มจากการติดตั้ง Stack พื้นฐานสำหรับเก็บและแสดงผล Metrics

docker-compose.yml

version: '3.8'

services:
  prometheus:
    image: prom/prometheus:v2.48.0
    container_name: prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.retention.time=30d'
    restart: unless-stopped

  grafana:
    image: grafana/grafana:10.2.0
    container_name: grafana
    ports:
      - "3000:3000"
    volumes:
      - grafana_data:/var/lib/grafana
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin123
      - GF_INSTALL_PLUGINS=grafana-ml-app
    restart: unless-stopped

  node-exporter:
    image: prom/node-exporter:v1.7.0
    container_name: node-exporter
    ports:
      - "9100:9100"
    restart: unless-stopped

volumes:
  prometheus_data:
  grafana_data:

prometheus.yml

global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
    - static_configs:
        - targets: []

rule_files: []

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']

  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true

Phase 2: ตั้งค่า Anomaly Detection ใน Grafana

ใช้ Grafana Machine Learning Plugin สำหรับตรวจจับความผิดปกติ

PromQL Query สำหรับ Anomaly Detection

# ตรวจจับ CPU Usage ที่ผิดปกติ
# ใช้ z-score เพื่อหาค่าที่ห่างจากค่าเฉลี่ยมากกว่า 2 standard deviations

avg_over_time(
  rate(container_cpu_usage_seconds_total[5m])[1h]
) > 
avg_over_time(
  rate(container_cpu_usage_seconds_total[5m])[7d]
) + 2 * stddev_over_time(
  rate(container_cpu_usage_seconds_total[5m])[7d]
)

# ตรวจจับ Memory Leak
# หาก Memory ใช้เพิ่มขึ้นเรื่อยๆ อย่างต่อเนื่อง

deriv(
  container_memory_usage_bytes[1h]
) > 0
and
  predict_linear(
    container_memory_usage_bytes[1h],
    3600
  ) > container_spec_memory_limit_bytes * 0.9

# ตรวจจับ Response Time ที่สูงผิดปกติ
histogram_quantile(0.95, 
  sum by (le) (
    rate(http_request_duration_seconds_bucket[5m])
  )
) > 
avg_over_time(
  histogram_quantile(0.95, 
    sum by (le) (
      rate(http_request_duration_seconds_bucket[5m])
    )
  )[7d]
) * 2

Phase 3: ติดตั้ง Keep สำหรับ Alert Correlation

Keep ช่วยรวบรวม Alert จากหลายแหล่ง และใช้ AI ลด Noise

ติดตั้ง Keep ด้วย Docker

# Clone Keep repository
git clone https://github.com/keephq/keep.git
cd keep

# รันด้วย Docker Compose
docker-compose -f docker-compose-dev.yml up -d

# เข้าใช้งานที่ http://localhost:8080

keep-workflow.yml - Alert Correlation

workflow:
  id: aiops-alert-correlation
  name: AIOps Alert Correlation
  description: รวมและกรอง Alert โดยอัตโนมัติ
  
  triggers:
    - type: alert
      filters:
        - key: severity
          value: critical|warning

  steps:
    - name: check_duplicate
      type: dedupe
      with:
        time_window: 300  # 5 นาที
        group_by:
          - service
          - alert_name

    - name: correlate_events
      type: correlate
      with:
        algorithm: similarity
        threshold: 0.8
        lookback: 900  # 15 นาที

    - name: enrich_with_context
      type: enrich
      with:
        add_silence_info: true
        add_runbook_link: true
        add_service_dependency: true

    - name: calculate_priority
      type: ml_score
      with:
        model: priority_classifier
        features:
          - severity
          - frequency
          - business_impact

  actions:
    - name: notify_slack
      condition: steps.calculate_priority.score > 0.7
      type: slack
      with:
        channel: '#ops-alerts'
        message: |
          🚨 Alert: {{ alert.name }}
          Service: {{ alert.service }}
          Priority: {{ steps.calculate_priority.score }}
          Runbook: {{ steps.enrich_with_context.runbook }}

    - name: auto_remediate
      condition: steps.calculate_priority.score > 0.9
      type: webhook
      with:
        url: http://auto-remediation:8080/trigger
        payload:
          alert_id: {{ alert.id }}
          service: {{ alert.service }}

6. ตัวอย่างโค้ด

Python: Isolation Forest Anomaly Detection

โมเดล ML สำหรับตรวจจับความผิดปกติโดยใช้ Isolation Forest

anomaly_detector.py

import numpy as np
import pandas as pd
from sklearn.ensemble import IsolationForest
from prometheus_api_client import PrometheusConnect
import joblib
from datetime import datetime, timedelta

class AIOpsAnomalyDetector:
    """
    AIOps Anomaly Detector ใช้ Isolation Forest
    สำหรับตรวจจับความผิดปกติใน Metrics
    """
    
    def __init__(self, prometheus_url: str, contamination: float = 0.05):
        """
        Args:
            prometheus_url: URL ของ Prometheus server
            contamination: อัตราส่วนของ outliers ที่คาดว่าจะพบ (default: 5%)
        """
        self.prom = PrometheusConnect(url=prometheus_url, disable_ssl=True)
        self.model = IsolationForest(
            n_estimators=100,
            contamination=contamination,
            random_state=42,
            n_jobs=-1
        )
        self.feature_columns = None
        
    def fetch_metrics(self, 
                      metric_name: str,
                      hours_back: int = 24) -> pd.DataFrame:
        """
        ดึง Metrics จาก Prometheus
        
        Args:
            metric_name: ชื่อ metric (เช่น 'node_cpu_seconds_total')
            hours_back: จำนวนชั่วโมงที่ย้อนหลัง
        """
        end_time = datetime.now()
        start_time = end_time - timedelta(hours=hours_back)
        
        metric_data = self.prom.custom_query_range(
            query=metric_name,
            start_time=start_time,
            end_time=end_time,
            step="1m"
        )
        
        # แปลงเป็น DataFrame
        records = []
        for item in metric_data:
            labels = item['metric']
            for value in item['values']:
                records.append({
                    'timestamp': pd.to_datetime(value[0], unit='s'),
                    'value': float(value[1]),
                    **labels
                })
        
        return pd.DataFrame(records)
    
    def prepare_features(self, df: pd.DataFrame) -> np.ndarray:
        """
        เตรียม Features สำหรับ ML Model
        """
        # สร้าง lag features
        df = df.copy()
        df['value_lag_1'] = df['value'].shift(1)
        df['value_lag_5'] = df['value'].shift(5)
        df['value_lag_15'] = df['value'].shift(15)
        
        # สร้าง rolling statistics
        df['rolling_mean_5'] = df['value'].rolling(window=5).mean()
        df['rolling_std_5'] = df['value'].rolling(window=5).std()
        df['rolling_mean_15'] = df['value'].rolling(window=15).mean()
        
        # สร้าง rate of change
        df['rate_of_change'] = df['value'].pct_change()
        
        # Drop NaN
        df = df.dropna()
        
        self.feature_columns = [
            'value', 'value_lag_1', 'value_lag_5', 'value_lag_15',
            'rolling_mean_5', 'rolling_std_5', 'rolling_mean_15',
            'rate_of_change'
        ]
        
        return df[self.feature_columns].values
    
    def train(self, X: np.ndarray):
        """
        Train Isolation Forest Model
        """
        print(f"Training Isolation Forest with {len(X)} samples...")
        self.model.fit(X)
        print("Training complete!")
        
    def predict(self, X: np.ndarray) -> np.ndarray:
        """
        ทำนายว่าเป็น Anomaly หรือไม่
        
        Returns:
            -1 = Anomaly, 1 = Normal
        """
        return self.model.predict(X)
    
    def get_anomaly_score(self, X: np.ndarray) -> np.ndarray:
        """
        คำนวณ Anomaly Score (ยิ่งต่ำยิ่งผิดปกติ)
        """
        return self.model.score_samples(X)
    
    def save_model(self, filepath: str):
        """บันทึก Model"""
        joblib.dump({
            'model': self.model,
            'feature_columns': self.feature_columns
        }, filepath)
        print(f"Model saved to {filepath}")
    
    def load_model(self, filepath: str):
        """โหลด Model"""
        data = joblib.load(filepath)
        self.model = data['model']
        self.feature_columns = data['feature_columns']
        print(f"Model loaded from {filepath}")


# ตัวอย่างการใช้งาน
if __name__ == "__main__":
    # สร้าง Detector
    detector = AIOpsAnomalyDetector(
        prometheus_url="http://localhost:9090",
        contamination=0.05  # คาดว่าจะมี 5% outliers
    )
    
    # ดึงข้อมูล CPU usage
    df = detector.fetch_metrics(
        metric_name='rate(node_cpu_seconds_total[5m])',
        hours_back=168  # 7 วัน
    )
    
    # เตรียม Features
    X = detector.prepare_features(df)
    
    # Train Model
    detector.train(X)
    
    # บันทึก Model
    detector.save_model("anomaly_model.pkl")
    
    # ทำนายข้อมูลใหม่
    predictions = detector.predict(X[-100:])  # 100 ข้อมูลล่าสุด
    scores = detector.get_anomaly_score(X[-100:])
    
    # หา Anomalies
    anomaly_indices = np.where(predictions == -1)[0]
    print(f"Found {len(anomaly_indices)} anomalies in last 100 samples")
    
    # Alert ถ้าพบ Anomaly
    if len(anomaly_indices) > 5:
        print("⚠️ ALERT: High number of anomalies detected!")
        print("Sending notification to Slack...")

Kubernetes: Deploy Seldon Core Model

Deploy ML Model บน Kubernetes ด้วย Seldon Core

anomaly-detector.yaml

apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
metadata:
  name: aiops-anomaly-detector
  namespace: aiops
spec:
  predictors:
  - name: default
    replicas: 2
    graph:
      name: anomaly-model
      implementation: SKLEARN_SERVER
      modelUri: s3://ml-models/anomaly-detector
      envSecretRefName: s3-credentials
      parameters:
        - name: method
          type: STRING
          value: predict_proba
    componentSpecs:
    - spec:
        containers:
        - name: anomaly-model
          resources:
            requests:
              memory: "1Gi"
              cpu: "500m"
            limits:
              memory: "2Gi"
              cpu: "1000m"
---
# Service สำหรับเรียก Model
apiVersion: v1
kind: Service
metadata:
  name: anomaly-detector
  namespace: aiops
spec:
  selector:
    app: aiops-anomaly-detector
  ports:
  - port: 8000
    targetPort: 8000
  type: ClusterIP

Apache Airflow: AIOps Pipeline DAG

สร้าง Pipeline อัตโนมัติสำหรับ AIOps

aiops_pipeline.py

from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.providers.slack.operators.slack_webhook import SlackWebhookOperator
from datetime import datetime, timedelta
import requests
import pandas as pd

default_args = {
    'owner': 'aiops',
    'depends_on_past': False,
    'start_date': datetime(2025, 1, 1),
    'email_on_failure': True,
    'email_on_retry': False,
    'retries': 2,
    'retry_delay': timedelta(minutes=5),
}

with DAG(
    'aiops_anomaly_detection_pipeline',
    default_args=default_args,
    schedule_interval='*/5 * * * *',  # ทุก 5 นาที
    catchup=False,
    tags=['aiops', 'monitoring', 'ml']
) as dag:

    def fetch_prometheus_metrics(**context):
        """ดึง Metrics จาก Prometheus"""
        prometheus_url = "http://prometheus:9090"
        query = 'rate(container_cpu_usage_seconds_total[5m])'
        
        response = requests.get(
            f"{prometheus_url}/api/v1/query",
            params={'query': query}
        )
        
        data = response.json()['data']['result']
        metrics_df = pd.DataFrame([
            {
                'metric': item['metric'],
                'value': float(item['value'][1]),
                'timestamp': item['value'][0]
            }
            for item in data
        ])
        
        context['ti'].xcom_push(key='metrics', value=metrics_df.to_dict())
        return metrics_df

    def detect_anomalies(**context):
        """เรียก Seldon Model ตรวจจับ Anomaly"""
        metrics = context['ti'].xcom_pull(key='metrics', task_ids='fetch_metrics')
        
        # เรียก Seldon Model
        seldon_url = "http://anomaly-detector:8000/api/v1.0/predictions"
        
        payload = {
            "data": {
                "ndarray": [[m['value']] for m in metrics.values()]
            }
        }
        
        response = requests.post(seldon_url, json=payload)
        predictions = response.json()['data']['ndarray']
        
        # หา anomalies (ค่า -1)
        anomalies = [
            list(metrics.keys())[i] 
            for i, pred in enumerate(predictions) 
            if pred[0] == -1
        ]
        
        context['ti'].xcom_push(key='anomalies', value=anomalies)
        return anomalies

    def check_and_alert(**context):
        """ตรวจสอบและส่ง Alert"""
        anomalies = context['ti'].xcom_pull(key='anomalies', task_ids='detect_anomalies')
        
        if len(anomalies) > 0:
            # ส่งต่อไป Keep สำหรับ Correlation
            keep_url = "http://keep:8080/api/alerts"
            alert_payload = {
                "source": "airflow-aiops",
                "severity": "warning",
                "message": f"Detected {len(anomalies)} anomalies",
                "details": anomalies
            }
            requests.post(keep_url, json=alert_payload)
        
        return len(anomalies)

    # Task Definitions
    fetch_metrics = PythonOperator(
        task_id='fetch_metrics',
        python_callable=fetch_prometheus_metrics,
    )

    detect_anomalies = PythonOperator(
        task_id='detect_anomalies',
        python_callable=detect_anomalies,
    )

    check_alert = PythonOperator(
        task_id='check_and_alert',
        python_callable=check_and_alert,
    )

    # Pipeline
    fetch_metrics >> detect_anomalies >> check_alert

7. Use Cases: การใช้งานจริง

1. Infrastructure Monitoring

ตรวจจับปัญหา Server ก่อนที่จะ Crash

• พยากรณ์ว่า Disk จะเต็มในอีกกี่วัน
• ตรวจจับ Memory Leak อัตโนมัติ
• แจ้งเตือนเมื่อ CPU ใช้สูงผิดปกติ

2. Application Performance

ตรวจจับปัญหา Application ก่อนผู้ใช้ร้องเรียน

• Response Time สูงผิดปกติ
• Error Rate เพิ่มขึ้น
• Database Query ช้า

3. Security Monitoring

ตรวจจับพฤติกรรมที่น่าสงสัย

• Failed Login Attempts ผิดปกติ
• Traffic Spike จาก IP ไม่คุ้นเคย
• Unusual API Access Patterns

4. Cloud Cost Optimization

ลดค่าใช้จ่าย Cloud โดยใช้ AI

• ระบุ Resources ที่ใช้น้อยเกินไป
• แนะนำ Right-sizing
• พยากรณ์ค่าใช้จ่ายล่วงหน้า

8. แก้ไขปัญหาที่พบบ่อย

ปัญหา: Prometheus ไม่เก็บ Metrics

สาเหตุ: Target ไม่ reachable หรือ scrape_interval ไม่เหมาะสม

# ตรวจสอบ Prometheus Targets
curl http://localhost:9090/api/v1/targets

# ตรวจสอบว่า Service Discovery ทำงานได้
curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[].labels'
                                

ปัญหา: False Positives สูงเกินไป

แนวทาง: ปรับ contamination parameter และเพิ่ม training data

# ลด contamination จาก 0.1 เป็น 0.02
detector = AIOpsAnomalyDetector(
    prometheus_url="http://localhost:9090",
    contamination=0.02  # ลดจาก 10% เป็น 2%
)

# เพิ่ม training data จาก 7 วัน เป็น 30 วัน
df = detector.fetch_metrics(
    metric_name='rate(node_cpu_seconds_total[5m])',
    hours_back=720  # 30 วัน
)
                                

ปัญหา: Elasticsearch ใช้ Disk สูง

แนวทาง: ตั้งค่า Index Lifecycle Management

# สร้าง ILM Policy
PUT _ilm/policy/aiops_logs_policy
{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0ms",
        "actions": {
          "rollover": {
            "max_size": "50gb",
            "max_age": "7d"
          }
        }
      },
      "warm": {
        "min_age": "7d",
        "actions": {
          "shrink": {"number_of_shards": 1},
          "forcemerge": {"max_num_segments": 1}
        }
      },
      "delete": {
        "min_age": "30d",
        "actions": {
          "delete": {}
        }
      }
    }
  }
}
                                

สรุป

AIOps คืออนาคตของการจัดการระบบ IT โดยใช้ AI และ Machine Learning ในการ:

ตรวจจับปัญหาอัตโนมัติ - ไม่ต้องตั้ง Threshold ล่วงหน้า
ลด Alert Noise 95% - จากหลายพันรายการ เหลือ 10-20 รายการสำคัญ
ลด MTTR 50% - หาสาเหตุและแก้ไขเร็วขึ้น
ใช้ Open Source ได้ฟรี - Prometheus, Grafana, ELK, Keep

AIOps: ปัญญาประดิษฐ์
เพื่อการดำเนินงาน IT