Observability Stack
Three Pillars of Observability
Observability refers to the ability to infer internal state from a system’s external outputs. Unlike traditional “monitoring,” observability emphasizes proactively understanding system behavior rather than reactively responding to known issues.
graph TB
subgraph "Three Pillars of Observability"
Metrics[Metrics<br/>Quantify system state]
Logs[Logs<br/>Record discrete events]
Traces[Traces<br/>Track request paths]
end
Metrics -->|Ask: How fast now?| Q1[Latency, throughput, error rate]
Logs -->|Ask: What happened?| Q2[Error messages, state changes]
Traces -->|Ask: Where is the bottleneck?| Q3[Call chains, bottleneck location]
Metrics -.->|Correlate| Logs
Logs -.->|Correlate| Traces
Traces -.->|Correlate| Metrics
Relationship Between the Three
| Dimension | Metrics | Logs | Traces |
|---|---|---|---|
| Data volume | Small (time series) | Large (discrete events) | Medium (sampled) |
| Alerting | Best suited | Suitable (critical logs) | Not suitable |
| Troubleshooting | Identify direction | View details | Locate bottleneck |
| Cost | Low | High | Medium |
| Granularity | System-level | Event-level | Request-level |
Best practice is three-way correlation: metrics trigger alerts → traces locate bottlenecks → logs reveal root causes.
Prometheus: Metrics Collection and Alerting
Prometheus is the cornerstone of cloud-native observability, using a pull model for metrics collection:
graph TB
subgraph "Prometheus Architecture"
Prom[Prometheus Server]
TSDB[TSDB Time-Series Database]
SD[Service Discovery]
Prom --> TSDB
SD --> Prom
end
App1[App Exporter] -->|/metrics| Prom
App2[Node Exporter] -->|/metrics| Prom
App3[Kube State Metrics] -->|/metrics| Prom
Prom --> AlertMgr[Alertmanager]
AlertMgr --> Email[Email/Feishu/PagerDuty]
Prom --> Grafana[Grafana]
Metric Types
| Type | Description | Example |
|---|---|---|
| Counter | Only-increments counter | Total requests, total errors |
| Gauge | Can increase or decrease | Current connections, memory usage |
| Histogram | Distribution statistics | Request latency distribution |
| Summary | Quantile statistics | P50/P95/P99 latency |
Application Instrumentation Example
// Go application Prometheus instrumentation
var (
httpRequestsTotal = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "http_requests_total",
Help: "Total number of HTTP requests",
},
[]string{"method", "path", "status"},
)
httpRequestDuration = prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Help: "HTTP request duration in seconds",
Buckets: prometheus.DefBuckets,
},
[]string{"method", "path"},
)
)
func metricsMiddleware(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
timer := prometheus.NewTimer(httpRequestDuration.WithLabelValues(r.Method, r.URL.Path))
next.ServeHTTP(w, r)
timer.ObserveDuration()
httpRequestsTotal.WithLabelValues(r.Method, r.URL.Path, "200").Inc()
})
}
Common PromQL Queries
# Request rate (QPS)
rate(http_requests_total[5m])
# P99 latency
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
# Error rate
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
# CPU usage by Pod
sum(rate(container_cpu_usage_seconds_total{container!="POD"}[5m])) by (pod)
/
sum(container_spec_cpu_quota{container!="POD"} / container_spec_cpu_period{container!="POD"}) by (pod)
Alert Rules
groups:
- name: api-alerts
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m])) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "API error rate exceeds 5%"
runbook: "https://wiki/runbook/high-error-rate"
- alert: HighLatency
expr: |
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 2
for: 10m
labels:
severity: warning
annotations:
summary: "API P95 latency exceeds 2s"
Grafana: Dashboard Design
Grafana is the visualization layer of observability, unifying data from Prometheus/Loki/Jaeger and other sources:
Dashboard Design Principles
- Macro to Micro: Top overview (SLI metrics), middle service-level, bottom instance-level
- Use Variable Templates: Reuse dashboards via
$namespace,$podvariables - Semantic Color Coding: Green=normal, Yellow=warning, Red=abnormal
- Highlight Key Metrics: SLIs on the first row, displayed in large font
Common Dashboard Panel Types
| Panel Type | Use Case | Example |
|---|---|---|
| Stat | Single value | Current QPS, error rate |
| Time Series | Time trends | Latency change curve |
| Heatmap | Distribution density | Request latency heat map |
| Table | Tabular data | Slow query list |
| Log Viewer | Log viewing | Error log context |
// Grafana dashboard variable example
{
"templating": {
"list": [
{
"name": "namespace",
"type": "query",
"query": "label_values(kube_pod_info, namespace)"
},
{
"name": "pod",
"type": "query",
"query": "label_values(kube_pod_info{namespace=\"$namespace\"}, pod)"
}
]
}
}
Log Aggregation
ELK vs Loki
| Feature | ELK (Elasticsearch) | Loki |
|---|---|---|
| Indexing | Full-text index | Label-only index |
| Storage cost | High | Low |
| Query capability | Strong (full-text search) | Medium (label filtering + regex) |
| Deployment complexity | High | Low |
| Prometheus integration | Requires adapter | Native (same query language) |
Loki + Promtail Architecture
graph LR
App[App Container] -->|stdout/stderr| Docker[Docker Logs]
Docker -->|Read| Promtail[Promtail<br/>Log Collection]
Promtail -->|Push| Loki[Loki<br/>Log Storage]
Loki -->|Query| Grafana2[Grafana<br/>Log Display]
Structured Logging Practices
// Use structured logging (e.g., slog/zap)
slog.Info("request completed",
"method", r.Method,
"path", r.URL.Path,
"status", statusCode,
"duration_ms", duration.Milliseconds(),
"trace_id", traceID, // Correlate with tracing
)
Log query examples (LogQL):
# Query error logs for specific service
{app="api", namespace="production"} |= "error" | json | status >= 500
# Count error logs per minute
sum(count_over_time({app="api"} |= "error" [1m])) by (level)
Distributed Tracing
Distributed tracing tracks a request’s complete call chain across multiple services, a core tool for microservice troubleshooting.
OpenTelemetry Standard
OpenTelemetry is the unified standard for observability, merging OpenTracing and OpenCensus:
graph TB
subgraph "OpenTelemetry Architecture"
SDK[OTel SDK<br/>Auto/Manual Instrumentation]
SDK --> Exporter[OTel Exporter<br/>OTLP Protocol]
Exporter --> Collector[OTel Collector<br/>Data Processing Pipeline]
end
Collector --> Jaeger[Jaeger<br/>Trace Storage & Display]
Collector --> Tempo[Tempo<br/>Grafana Trace Backend]
Collector --> Prom[Prometheus<br/>Metrics Backend]
Collector --> Loki2[Loki<br/>Logs Backend]
Core Concepts
- Trace: A request’s complete call chain
- Span: An operation unit within a Trace
- Context Propagation: Passing Trace ID across services (typically via HTTP Header)
graph LR
subgraph "Request Trace"
GW[API Gateway<br/>Span: 50ms]
GW --> User[User Service<br/>Span: 20ms]
GW --> Order[Order Service<br/>Span: 30ms]
User --> DB1[(User DB<br/>Span: 5ms)]
Order --> Cache[Redis<br/>Span: 2ms]
Order --> DB2[(Order DB<br/>Span: 15ms)]
end
Go Application Integration Example
import (
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
"go.opentelemetry.io/otel/sdk/trace"
)
func initTracer() (*trace.TracerProvider, error) {
exporter, err := otlptracegrpc.New(context.Background(),
otlptracegrpc.WithEndpoint("otel-collector:4317"),
otlptracegrpc.WithInsecure(),
)
if err != nil {
return nil, err
}
tp := trace.NewTracerProvider(
trace.WithBatcher(exporter),
trace.WithResource(resource.NewWithAttributes(
semconv.SchemaURL,
semconv.ServiceNameKey.String("api-service"),
)),
)
otel.SetTracerProvider(tp)
return tp, nil
}
Trace-Log-Metric Correlation
Best practice is to inject Trace IDs into both logs and metrics, enabling three-way correlation:
- Logs include
trace_id: Jump from logs to trace details - Metrics include
trace_idExemplars: Jump from metric charts to slow request traces - One-click navigation in Grafana from metrics → logs → traces
Observability is not a piling up of tools but an organic correlation of the three pillars. Metrics tell you “there’s a problem,” traces tell you “where it’s stuck,” and logs tell you “why.” Only when all three work together can you achieve rapid identification and efficient troubleshooting.
Comments