SRE Practices & Reliability Engineering
SRE Overview
Site Reliability Engineering (SRE) is a methodology proposed by Google that applies software engineering approaches to solve operations problems. SRE’s core philosophy is that operations problems are essentially software problems and should be solved systematically through engineering practices.
SRE is not a job title but a system. It defines methods for measuring reliability, strategies for handling failures, and mechanisms for managing risk.
SLI/SLO/SLA Framework
This is the measurement cornerstone of SRE, progressing from bottom to top:
graph TB
SLI["SLI<br/>Service Level Indicator<br/>Quantifiable measurement<br/>e.g., Latency, Availability"] --> SLO["SLO<br/>Service Level Objective<br/>Target value for SLI<br/>e.g., 99.9% Availability"]
SLO --> SLA["SLA<br/>Service Level Agreement<br/>Business consequences of not meeting SLO<br/>e.g., Compensation terms"]
style SLI fill:#e3f2fd
style SLO fill:#e8f5e9
style SLA fill:#fff3e0
SLI Selection
SLIs should reflect real user experience, not internal metrics:
| Service Type | Recommended SLI | Anti-Pattern |
|---|---|---|
| API Service | Request success rate, latency | CPU usage |
| Web Frontend | Page load time, FCP | Server load |
| Storage | Data durability, I/O latency | Disk usage |
| Message Queue | Message delivery latency, message loss rate | Queue length |
SLO Formulation
Higher SLOs aren’t always better. A 100% SLO means no changes are allowed, and the system will stagnate. A reasonable SLO balances user experience with development velocity:
# SLO definition example (SLO Generator format)
service: api-gateway
slos:
- name: availability
description: "API request success rate"
sli:
type: good_bad_ratio
good: sum(rate(http_requests_total{status=~"2.."}[5m]))
bad: sum(rate(http_requests_total{status!~"2.."}[5m]))
target: 99.9%
window: 30d
- name: latency
description: "API P99 latency"
sli:
type: threshold
metric: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
threshold: 500ms
target: 99%
window: 30d
Error Budget
Error Budget is the inverse of SLO: 1 - SLO = allowed failure space.
graph LR
subgraph "Error Budget Consumption"
SLO["SLO: 99.9%<br/>1M requests/month"]
Budget["Error Budget: 0.1%<br/>≈ 1000 failures/month"]
Budget --> B1["Deploy new version<br/>Consumes 200"]
Budget --> B2["Infrastructure failure<br/>Consumes 300"]
Budget --> B3["Remaining 500<br/>Available for innovation"]
end
Practical Implications of Error Budgets:
- Budget is ample: Team can boldly release new features and refactor architecture
- Budget is exhausted: Team stops non-urgent changes, focuses on reliability improvements
- Budget consistently remaining: SLO is too loose, should raise targets to drive improvement
On-Call Practices
On-Call is SRE’s front-line work, ensuring someone responds to and handles system failures.
On-Call Principles
- Reasonable alert volume: No more than 2 page-level alerts per week, otherwise On-Call personnel will experience fatigue
- Clear escalation path: L1 → L2 → L3, each level has time limits
- Post-incident review: Every On-Call incident must have a post-mortem analysis
- Rotation mechanism: Avoid the same person being On-Call long-term
graph TB
Alert[Alert Triggered] --> L1[L1 On-Call<br/>5 min response]
L1 -->|Cannot resolve| L2[L2 On-Call<br/>15 min response]
L2 -->|Need support| L3[L3 Expert Team<br/>30 min response]
L3 -->|Cross-team| War[War Room<br/>All-hands response]
Alert Severity Levels
| Level | Response Time | Notification Method | Example |
|---|---|---|---|
| P1 - Critical | 5 minutes | Phone + SMS | Production service unavailable |
| P2 - High | 15 minutes | SMS + IM | Partial feature degradation |
| P3 - Medium | 1 hour | IM notification | Latency increase |
| P4 - Low | 24 hours | Disk usage > 70% |
On-Call Handoff Checklist
- Incidents this week and their handling status
- Pending Action Items
- SLOs / Error budgets approaching limits
- Planned changes and potential impact
Chaos Engineering
Chaos engineering is an experimental method that proactively injects failures to verify system resilience. The core idea is to discover system weaknesses before failures affect users.
Chaos Engineering Principles
- Establish steady-state hypothesis: Define “normal” behavior metrics for the system
- Simulate real events: Inject server crashes, network latency, dependency unavailability, etc.
- Observe system behavior: Compare metric changes before and after injection
- Learn and improve: Harden the system based on experimental results
graph LR
subgraph "Chaos Experiment Flow"
Hypo[Establish Hypothesis<br/>Service A crashes<br/>Traffic auto-fails over to B]
Inject[Inject Failure<br/>Terminate Service A process]
Observe[Observe Results<br/>Failover takes 30s<br/>SLO unaffected]
Conclusion[Conclusion<br/>DR effective but failover slow<br/>Optimize health check interval]
end
Hypo --> Inject --> Observe --> Conclusion
Chaos Monkey vs Chaos Mesh
| Tool | Platform | Failure Types |
|---|---|---|
| Chaos Monkey | Netflix/Spinnaker | Random instance termination |
| Chaos Mesh | K8s | Pod failures, network failures, I/O failures |
| Litmus | K8s | Comprehensive chaos experiments |
| Gremlin | All platforms | Commercial solution, visual experiments |
Chaos Mesh Experiment Example
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: api-latency
namespace: chaos-testing
spec:
action: delay
mode: one
selector:
namespaces: ["production"]
labelSelectors:
app: api
delay:
latency: "500ms"
correlation: "50"
duration: "5m"
Incident Management
Incident Lifecycle
stateDiagram-v2
[*] --> Detection: Monitoring alert/User report
Detection --> Response: Confirm incident
Response --> Mitigation: Stop the bleeding
Mitigation --> Resolution: Fix root cause
Resolution --> Review: Post-mortem analysis
Review --> Improvement: Action Items
Improvement --> [*]: Close
Incident Response Principles
- Stop the bleeding first, find root cause later: Rapid service restoration takes priority over finding root cause
- Single incident commander: One person coordinates, others focus on technical work
- Frequent updates: Sync status to stakeholders every 15-30 minutes
- Preserve the scene: Don’t rush to restart, preserve logs and forensic information
Blameless Post-Mortem
The core principle of post-mortems: Focus on the process, not the person (no individual blame, but finding systemic issues).
Post-mortem template:
## Incident Post-Mortem
### Basic Information
- Incident time: 2026-05-01 14:30 - 15:45
- Impact scope: API service 30% request failures
- Impact duration: 75 minutes
- SLO impact: Consumed 15% of monthly error budget
### Timeline
- 14:30 - Alert triggered: API error rate rising
- 14:35 - On-Call confirmed incident, started investigation
- 14:45 - Identified: Database connection pool exhausted
- 15:00 - Mitigated: Scaled database connection pool
- 15:15 - Service restored to normal
- 15:45 - Confirmed no recurrence
### Root Cause Analysis
Database connection pool configuration was not adjusted as traffic grew, causing connection wait timeouts under high concurrency
### Action Items
1. [P0] Connection pool config auto-scales with HPA (Owner: Zhang San, Due: 5/7)
2. [P1] Add connection pool usage alerting (Owner: Li Si, Due: 5/5)
3. [P2] Integrate Chaos Mesh to simulate connection pool exhaustion (Owner: Wang Wu, Due: 5/14)
Capacity Planning
Capacity planning ensures the system has enough resources to handle traffic as the business grows:
Planning Methodology
- Current baseline: Collect resource usage trend data
- Growth rate prediction: Project future needs based on historical data
- Safety margin: Reserve 30% buffer for unexpected traffic spikes
- Milestone reviews: Quarterly review of forecast vs actual deviation
graph TB
subgraph "Capacity Planning Flow"
Current[Current Resource Usage<br/>CPU: 60%<br/>Memory: 55%] --> Growth[Growth Prediction<br/>Quarterly growth 20%]
Growth --> Peak[Peak Estimation<br/>Quarterly peak 1.5x]
Peak --> Buffer[Safety Margin 30%]
Buffer --> Plan[Next Quarter Needs<br/>CPU: 60% × 1.2 × 1.5 × 1.3 ≈ 140%<br/>Need to scale to 2x]
end
Key Metrics
| Dimension | Metric | Purpose |
|---|---|---|
| Compute | CPU/Memory usage trends | Determine node scaling |
| Storage | Disk growth rate | Predict scaling timeline |
| Network | Bandwidth usage trends | Plan bandwidth upgrades |
| Business | DAU/Request volume growth | Drive all resource planning |
Automated Capacity Management
# K8s VPA auto-recommend resource quotas
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: api-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: api
updatePolicy:
updateMode: Off # Recommend only, don't auto-modify
resourcePolicy:
containerPolicies:
- mode: Auto
SRE is not about preventing system failures — that’s impossible. SRE’s goal is to give systems the ability to recover quickly and to continuously learn and improve from failures. From SLI/SLO measurement to On-Call response, from chaos engineering verification to blameless post-mortem growth, SRE makes reliability an engineering-driven, repeatable practice.
Comments