DevOps พังแล้วพังอีกพังต่อ EP.9 — Monitoring & Observability

สวัสดีค่าา EP.9 ใกล้จบซีรี่ส์แล้วนะ!

พอเราพูดถึง monitoring เราคิดถึงอะไร? สำหรับคนที่ออกกำลังกาย monitor ตัวเองอยู่แล้วทุกวัน เวลาซ้อม HYROX ก็ต้องดู heart rate อยู่ตลอด ถ้า HR ขึ้น zone 5 ตลอดเวลา อาจแปลว่า overtraining หรือ under recovery ข้อมูลเหล่านี้ช่วยให้ตัดสินใจได้ดีกว่าการเดาสุ่ม

Monitoring ระบบ IT ก็เหมือนกันค่ะ ถ้าไม่มีข้อมูล เราไม่รู้ว่าระบบสุขภาพดีไหม จะเริ่มพังตรงไหน หรือพังไปแล้วตั้งนานโดยไม่รู้ตัว

Monitoring vs Observability ต่างกันยังไง

Monitoring คือการเฝ้าดู metrics ที่เรากำหนดไว้ล่วงหน้า เช่น CPU usage, error rate, response time ถ้า metric ไหนเกิน threshold ก็แจ้งเตือน

Observability คือความสามารถในการ "เข้าใจ" ว่าระบบทำงานยังไง โดยดูจาก output ของระบบ ทำให้สามารถตอบคำถามที่ไม่ได้คาดไว้ล่วงหน้าได้ด้วย

ความแตกต่างง่ายๆ: Monitoring บอกว่า "มีปัญหา" Observability บอกว่า "ปัญหาอยู่ที่ไหนและเกิดจากอะไร"

Three Pillars of Observability

1. Metrics

ตัวเลขที่วัดได้ เก็บตามเวลา เช่น CPU 72%, request rate 500 req/s, error rate 0.1%

เหมาะสำหรับดู trend และ set alert เมื่อตัวเลขผิดปกติ

2. Logs

ข้อความที่ระบบเขียนออกมาบันทึกว่าเกิดอะไรขึ้น ตอนไหน ที่ไหน มีประโยชน์มากเวลา debug ว่าก่อนระบบพัง มีอะไรเกิดขึ้นบ้าง

2026-06-01 10:23:41 INFO  User login: user_id=1234
2026-06-01 10:23:42 INFO  Payment initiated: amount=500
2026-06-01 10:23:43 ERROR Database timeout: connection refused
2026-06-01 10:23:43 ERROR Payment failed: retry attempt 1
2026-06-01 10:23:53 ERROR Payment failed: max retries exceeded

3. Traces

ติดตาม request หนึ่งๆตั้งแต่ต้นจนจบ ผ่าน service ต่างๆ เหมาะสำหรับ microservices ที่ request อาจเดินทางผ่าน 5-10 service ก่อนจะได้ response กลับมา

User request: GET /checkout └── [2ms] API Gateway └── [5ms] Order Service ├── [10ms] Inventory Service ├── [8ms] Payment Service │ └── [50ms] Bank API (ช้า!) └── [3ms] Notification Service Total: 78ms (ช้าเพราะ Bank API)

Stack ยอดนิยม: Prometheus และ Grafana

Prometheus

Prometheus เป็น time-series database ที่เก็บ metrics โดยจะ "pull" ข้อมูลจาก endpoint /metrics ของแต่ละ service ตามเวลาที่กำหนด

# prometheus.yml
global:
  scrape_interval: 15s    # เก็บ metrics ทุก 15 วินาที

scrape_configs:
  - job_name: 'my-app'
    static_configs:
      - targets: ['my-app:8000']    # ดึง metrics จาก endpoint นี้

  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']    # metrics ของ server

แล้ว app ของเราก็ expose endpoint /metrics ที่ Prometheus จะมาดึงข้อมูล

# Python app กับ prometheus_client library
from prometheus_client import Counter, Histogram, start_http_server

# นับจำนวน request
REQUEST_COUNT = Counter('app_requests_total', 'Total requests', ['method', 'endpoint'])

# วัดเวลา response
REQUEST_LATENCY = Histogram('app_request_latency_seconds', 'Request latency')

@app.route('/users')
@REQUEST_LATENCY.time()
def get_users():
    REQUEST_COUNT.labels(method='GET', endpoint='/users').inc()
    return users

Grafana

Grafana คือ visualization tool ที่ดึงข้อมูลจาก Prometheus (และ datasource อื่นๆ) มาแสดงเป็น dashboard สวยๆ

ด้วย Grafana เราสร้าง dashboard ที่แสดง request rate, error rate, latency, CPU, memory ทั้งหมดในหน้าเดียว ทีมเปิดดูได้ตลอดเวลา

# docker-compose.yml สำหรับ monitoring stack
services:
  prometheus:
    image: prom/prometheus:latest
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    ports:
      - "9090:9090"

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=secret
    volumes:
      - grafana_data:/var/lib/grafana

volumes:
  grafana_data:

Alerting แจ้งเตือนอัตโนมัติ

Monitoring ที่ดีต้องมีการแจ้งเตือนที่ดีด้วย เราไม่มีทางนั่งจ้องหน้าจอ Grafana ตลอด 24 ชั่วโมง

# alerting rules ใน Prometheus
groups:
  - name: app-alerts
    rules:
      - alert: HighErrorRate
        expr: rate(app_requests_total{status="500"}[5m]) > 0.05
        for: 2m
        annotations:
          summary: "Error rate สูงเกิน 5%"

      - alert: HighLatency
        expr: histogram_quantile(0.95, app_request_latency_seconds) > 1
        for: 5m
        annotations:
          summary: "Response time p95 เกิน 1 วินาที"

เมื่อ alert fire จะส่งไปยัง Alertmanager แล้วส่งต่อไปยัง Slack, PagerDuty, email หรืออะไรก็ได้ที่ทีมใช้

Log Management

Logs จาก container หลายร้อยตัวกระจายอยู่ทั่ว cluster เราต้องรวมมาไว้ที่เดียวเพื่อ search และ analyze ได้

Loki + Grafana

Loki คือ log aggregation system จาก Grafana Labs ออกแบบมาให้ทำงานคู่กับ Grafana ได้ดี ข้อดีคือไม่ index content ของ log เหมือน Elasticsearch เลยประหยัด cost กว่ามาก

ELK Stack

Elasticsearch + Logstash + Kibana เป็น classic stack ที่ใช้กันมานาน powerful มากแต่ค่อนข้าง resource-intensive และ setup ซับซ้อนกว่า

Tool	ทำอะไร
Elasticsearch	เก็บและ search logs
Logstash	รวบรวมและ transform logs
Kibana	visualize และ search logs ผ่าน UI

SLI, SLO, SLA ตัวเลขที่ต้องรู้

SLI (Service Level Indicator)

ตัวเลขที่วัดจริงได้ เช่น uptime 99.95%, error rate 0.02%, latency p99 = 200ms

SLO (Service Level Objective)

เป้าหมายที่ทีมตั้งไว้ เช่น เราจะรักษา uptime ไว้ที่ 99.9% ต่อเดือน ถ้าต่ำกว่านี้ต้องมา review ว่าเกิดอะไรขึ้น

SLA (Service Level Agreement)

ข้อตกลงกับลูกค้าว่าถ้า uptime ต่ำกว่าเท่าไหร่จะมีการชดเชย เป็น legal commitment ระหว่างบริษัทกับลูกค้า

ความแตกต่างง่ายๆ: SLI คือตัวเลขจริง, SLO คือเป้าหมายภายใน, SLA คือสัญญากับลูกค้า

สรุป EP.9

Observability มี 3 เสา: Metrics, Logs, Traces
Prometheus เก็บ metrics โดย pull จาก /metrics endpoint
Grafana visualize ข้อมูลจาก Prometheus เป็น dashboard
Alerting ส่ง notification อัตโนมัติเมื่อ metric ผิดปกติ
SLI/SLO/SLA คือตัวเลขที่ใช้วัดความ reliable ของระบบ

💡 DevOps Engineer ที่ดีไม่ใช่แค่ deploy ได้เร็ว แต่ต้องรู้ด้วยว่าระบบสุขภาพดีไหมหลัง deploy ทุกครั้ง Monitoring จึงเป็น skill ที่ขาดไม่ได้เลยค่ะ

ซีรี่ส์ DevOps พังแล้วพังอีกพังต่อ

EP.8 — Infrastructure as Code เขียน server เป็นโค้ดได้เลย

EP.10 — DevSecOps Security ไม่ใช่เรื่องของคนอื่น