Observability Stack

Three Pillars of Observability

Observability refers to the ability to infer internal state from a system’s external outputs. Unlike traditional “monitoring,” observability emphasizes proactively understanding system behavior rather than reactively responding to known issues.

graph TB
    subgraph "Three Pillars of Observability"
        Metrics[Metrics<br/>Quantify system state]
        Logs[Logs<br/>Record discrete events]
        Traces[Traces<br/>Track request paths]
    end
    Metrics -->|Ask: How fast now?| Q1[Latency, throughput, error rate]
    Logs -->|Ask: What happened?| Q2[Error messages, state changes]
    Traces -->|Ask: Where is the bottleneck?| Q3[Call chains, bottleneck location]
    Metrics -.->|Correlate| Logs
    Logs -.->|Correlate| Traces
    Traces -.->|Correlate| Metrics

Relationship Between the Three

Dimension	Metrics	Logs	Traces
Data volume	Small (time series)	Large (discrete events)	Medium (sampled)
Alerting	Best suited	Suitable (critical logs)	Not suitable
Troubleshooting	Identify direction	View details	Locate bottleneck
Cost	Low	High	Medium
Granularity	System-level	Event-level	Request-level

Best practice is three-way correlation: metrics trigger alerts → traces locate bottlenecks → logs reveal root causes.

Prometheus: Metrics Collection and Alerting

Prometheus is the cornerstone of cloud-native observability, using a pull model for metrics collection:

graph TB
    subgraph "Prometheus Architecture"
        Prom[Prometheus Server]
        TSDB[TSDB Time-Series Database]
        SD[Service Discovery]
        Prom --> TSDB
        SD --> Prom
    end
    App1[App Exporter] -->|/metrics| Prom
    App2[Node Exporter] -->|/metrics| Prom
    App3[Kube State Metrics] -->|/metrics| Prom
    Prom --> AlertMgr[Alertmanager]
    AlertMgr --> Email[Email/Feishu/PagerDuty]
    Prom --> Grafana[Grafana]

Metric Types

Type	Description	Example
Counter	Only-increments counter	Total requests, total errors
Gauge	Can increase or decrease	Current connections, memory usage
Histogram	Distribution statistics	Request latency distribution
Summary	Quantile statistics	P50/P95/P99 latency

Application Instrumentation Example

// Go application Prometheus instrumentation
var (
    httpRequestsTotal = prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Name: "http_requests_total",
            Help: "Total number of HTTP requests",
        },
        []string{"method", "path", "status"},
    )

    httpRequestDuration = prometheus.NewHistogramVec(
        prometheus.HistogramOpts{
            Name:    "http_request_duration_seconds",
            Help:    "HTTP request duration in seconds",
            Buckets: prometheus.DefBuckets,
        },
        []string{"method", "path"},
    )
)

func metricsMiddleware(next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        timer := prometheus.NewTimer(httpRequestDuration.WithLabelValues(r.Method, r.URL.Path))
        next.ServeHTTP(w, r)
        timer.ObserveDuration()
        httpRequestsTotal.WithLabelValues(r.Method, r.URL.Path, "200").Inc()
    })
}

Common PromQL Queries

# Request rate (QPS)
rate(http_requests_total[5m])

# P99 latency
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

# Error rate
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))

# CPU usage by Pod
sum(rate(container_cpu_usage_seconds_total{container!="POD"}[5m])) by (pod)
/
sum(container_spec_cpu_quota{container!="POD"} / container_spec_cpu_period{container!="POD"}) by (pod)

Alert Rules

groups:
  - name: api-alerts
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m]))
          /
          sum(rate(http_requests_total[5m])) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "API error rate exceeds 5%"
          runbook: "https://wiki/runbook/high-error-rate"

      - alert: HighLatency
        expr: |
          histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 2
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "API P95 latency exceeds 2s"

Grafana: Dashboard Design

Grafana is the visualization layer of observability, unifying data from Prometheus/Loki/Jaeger and other sources:

Dashboard Design Principles

Macro to Micro: Top overview (SLI metrics), middle service-level, bottom instance-level
Use Variable Templates: Reuse dashboards via $namespace, $pod variables
Semantic Color Coding: Green=normal, Yellow=warning, Red=abnormal
Highlight Key Metrics: SLIs on the first row, displayed in large font

Common Dashboard Panel Types

Panel Type	Use Case	Example
Stat	Single value	Current QPS, error rate
Time Series	Time trends	Latency change curve
Heatmap	Distribution density	Request latency heat map
Table	Tabular data	Slow query list
Log Viewer	Log viewing	Error log context

// Grafana dashboard variable example
{
  "templating": {
    "list": [
      {
        "name": "namespace",
        "type": "query",
        "query": "label_values(kube_pod_info, namespace)"
      },
      {
        "name": "pod",
        "type": "query",
        "query": "label_values(kube_pod_info{namespace=\"$namespace\"}, pod)"
      }
    ]
  }
}

Log Aggregation

ELK vs Loki

Feature	ELK (Elasticsearch)	Loki
Indexing	Full-text index	Label-only index
Storage cost	High	Low
Query capability	Strong (full-text search)	Medium (label filtering + regex)
Deployment complexity	High	Low
Prometheus integration	Requires adapter	Native (same query language)

Loki + Promtail Architecture

graph LR
    App[App Container] -->|stdout/stderr| Docker[Docker Logs]
    Docker -->|Read| Promtail[Promtail<br/>Log Collection]
    Promtail -->|Push| Loki[Loki<br/>Log Storage]
    Loki -->|Query| Grafana2[Grafana<br/>Log Display]

Structured Logging Practices

// Use structured logging (e.g., slog/zap)
slog.Info("request completed",
    "method", r.Method,
    "path", r.URL.Path,
    "status", statusCode,
    "duration_ms", duration.Milliseconds(),
    "trace_id", traceID,  // Correlate with tracing
)

Log query examples (LogQL):

# Query error logs for specific service
{app="api", namespace="production"} |= "error" | json | status >= 500

# Count error logs per minute
sum(count_over_time({app="api"} |= "error" [1m])) by (level)

Distributed Tracing

Distributed tracing tracks a request’s complete call chain across multiple services, a core tool for microservice troubleshooting.

OpenTelemetry Standard

OpenTelemetry is the unified standard for observability, merging OpenTracing and OpenCensus:

graph TB
    subgraph "OpenTelemetry Architecture"
        SDK[OTel SDK<br/>Auto/Manual Instrumentation]
        SDK --> Exporter[OTel Exporter<br/>OTLP Protocol]
        Exporter --> Collector[OTel Collector<br/>Data Processing Pipeline]
    end
    Collector --> Jaeger[Jaeger<br/>Trace Storage & Display]
    Collector --> Tempo[Tempo<br/>Grafana Trace Backend]
    Collector --> Prom[Prometheus<br/>Metrics Backend]
    Collector --> Loki2[Loki<br/>Logs Backend]

Core Concepts

Trace: A request’s complete call chain
Span: An operation unit within a Trace
Context Propagation: Passing Trace ID across services (typically via HTTP Header)

graph LR
    subgraph "Request Trace"
        GW[API Gateway<br/>Span: 50ms]
        GW --> User[User Service<br/>Span: 20ms]
        GW --> Order[Order Service<br/>Span: 30ms]
        User --> DB1[(User DB<br/>Span: 5ms)]
        Order --> Cache[Redis<br/>Span: 2ms]
        Order --> DB2[(Order DB<br/>Span: 15ms)]
    end

Go Application Integration Example

import (
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
    "go.opentelemetry.io/otel/sdk/trace"
)

func initTracer() (*trace.TracerProvider, error) {
    exporter, err := otlptracegrpc.New(context.Background(),
        otlptracegrpc.WithEndpoint("otel-collector:4317"),
        otlptracegrpc.WithInsecure(),
    )
    if err != nil {
        return nil, err
    }

    tp := trace.NewTracerProvider(
        trace.WithBatcher(exporter),
        trace.WithResource(resource.NewWithAttributes(
            semconv.SchemaURL,
            semconv.ServiceNameKey.String("api-service"),
        )),
    )
    otel.SetTracerProvider(tp)
    return tp, nil
}

Trace-Log-Metric Correlation

Best practice is to inject Trace IDs into both logs and metrics, enabling three-way correlation:

Logs include trace_id: Jump from logs to trace details
Metrics include trace_id Exemplars: Jump from metric charts to slow request traces
One-click navigation in Grafana from metrics → logs → traces

Observability is not a piling up of tools but an organic correlation of the three pillars. Metrics tell you “there’s a problem,” traces tell you “where it’s stuck,” and logs tell you “why.” Only when all three work together can you achieve rapid identification and efficient troubleshooting.