Skip to content

Backend Performance Tuning

5 min read

Performance Metrics

Backend performance tuning starts with defining measurement standards:

Metric Meaning Focus
Throughput (QPS/TPS) Requests/transactions per second System capacity ceiling
Latency Time from request to response User experience
P50/P90/P99 50th/90th/99th percentile latency Tail latency
Concurrency Simultaneously processed requests Resource utilization
Error rate Percentage of failed requests System stability

Why look at P99 instead of averages? Suppose 99 out of 100 requests take 10ms and 1 takes 10s. The average is 109ms, which seems acceptable, but P99 is 10s—meaning 1% of users have a terrible experience. Averages mask tail problems.

graph LR
    A[Performance Goals] --> B[Reduce P99 latency]
    A --> C[Increase throughput]
    A --> D[Maintain error rate < 0.1%]

Load Testing Tools

wrk

# Basic load test: 12 threads, 400 connections, 30 seconds
wrk -t12 -c400 -d30s http://localhost:8080/api/users

# With latency distribution
wrk -t12 -c400 -d30s --latency http://localhost:8080/api/users

# Output example
#   Latency   P50=3.21ms  P90=5.43ms  P99=12.87ms
#   Requests/sec: 124532.11
#   Transfer/sec: 45.67MB

wrk2 (Constant Throughput Load Testing)

# Constant 1000 QPS, observe how latency changes with load
wrk2 -t4 -c100 -d30s -R1000 http://localhost:8080/api/users

k6

// k6 script: progressive load
import http from 'k6/http';
import { check } from 'k6';

export const options = {
  stages: [
    { duration: '2m', target: 100 },   // Ramp up to 100 VU in 2 minutes
    { duration: '5m', target: 100 },   // Steady for 5 minutes
    { duration: '2m', target: 500 },   // Ramp up to 500 VU
    { duration: '5m', target: 500 },   // Steady for 5 minutes
    { duration: '2m', target: 0 },     // Ramp down
  ],
  thresholds: {
    http_req_duration: ['p(99)<500'],   // P99 < 500ms
    http_req_failed: ['rate<0.01'],     // Error rate < 1%
  },
};

export default function () {
  const res = http.get('http://localhost:8080/api/users');
  check(res, { 'status is 200': (r) => r.status === 200 });
}

Database Optimization

Slow Query Management

-- MySQL enable slow query log
SET GLOBAL slow_query_log = ON;
SET GLOBAL long_query_time = 0.1;  -- Log queries over 100ms
SET GLOBAL log_queries_not_using_indexes = ON;

-- Analyze slow queries
EXPLAIN ANALYZE SELECT u.name, COUNT(o.id)
FROM users u
JOIN orders o ON u.id = o.user_id
WHERE u.created_at > '2024-01-01'
GROUP BY u.name;

Common optimization paths:

Problem Optimization Expected Effect
Full table scan Add appropriate index 99%+ row reduction
Excessive bookmark lookups Covering index Avoid 50%+ random I/O
Temporary table sorting Optimize ORDER BY index Eliminate filesort
Excessive JOINs Denormalization/redundant fields Reduce JOIN depth
Deep pagination Cursor-based pagination Avoid scanning first N rows
-- ❌ OFFSET deep pagination (scans 100010 rows, discards first 100000)
SELECT * FROM orders ORDER BY id LIMIT 10 OFFSET 100000;

-- ✅ Cursor-based pagination (scans only 10 rows)
SELECT * FROM orders WHERE id > 100000 ORDER BY id LIMIT 10;

Connection Pool Tuning

# HikariCP recommended configuration
maximumPoolSize: 20          # Formula: (core count * 2) + effective disk count
minimumIdle: 5               # Minimum idle connections
connectionTimeout: 3000      # Connection acquisition timeout 3s
idleTimeout: 600000          # Max idle connection lifetime 10 minutes
maxLifetime: 1800000         # Max connection lifetime 30 minutes
leakDetectionThreshold: 60000  # Connection leak detection 60s

Connection and Thread Pool Tuning

Thread Pool Configuration

flowchart TD
    A[Request arrives] --> B{Core threads full?}
    B -->|No| C[Core thread processes]
    B -->|Yes| D{Queue full?}
    D -->|No| E[Enqueue and wait]
    D -->|Yes| F{Max threads reached?}
    F -->|No| G[Create non-core thread]
    F -->|Yes| H[Rejection policy]
// Thread pool configuration (I/O-intensive example)
ThreadPoolExecutor executor = new ThreadPoolExecutor(
    16,                              // Core thread count
    64,                              // Max thread count
    60, TimeUnit.SECONDS,            // Non-core thread idle keepalive time
    new LinkedBlockingQueue<>(1000), // Task queue
    new ThreadPoolExecutor.CallerRunsPolicy()  // Caller thread executes when queue is full
);

// CPU-intensive: core threads ≈ CPU core count + 1
// I/O-intensive: core threads ≈ CPU core count * (1 + wait time/compute time)

Rejection Policy Selection

Policy Behavior Use Case
AbortPolicy Throw exception Default, need to sense overload
CallerRunsPolicy Caller thread executes Slow down but don’t discard
DiscardOldestPolicy Discard oldest task Acceptable to lose tasks
DiscardPolicy Silently discard Acceptable to lose tasks

HTTP Connection Pool

// Go HTTP client tuning
transport := &http.Transport{
    MaxIdleConns:        100,              // Global max idle connections
    MaxIdleConnsPerHost: 20,               // Max idle connections per host
    MaxConnsPerHost:     50,               // Max connections per host
    IdleConnTimeout:     90 * time.Second, // Idle connection timeout
    DialContext: (&net.Dialer{
        Timeout:   5 * time.Second,        // Connection timeout
        KeepAlive: 30 * time.Second,       // TCP keepalive
    }).DialContext,
    TLSHandshakeTimeout:   5 * time.Second,
    ResponseHeaderTimeout: 10 * time.Second,
}
client := &http.Client{Transport: transport, Timeout: 30 * time.Second}

Full-Chain Load Testing and Capacity Planning

Full-Chain Load Testing Architecture

flowchart TD
    A[Load test traffic entry] --> B[Traffic tagging<br/>Mark load test requests]
    B --> C[Gateway<br/>Identify load test traffic]
    C --> D[Service A<br/>Isolate load test data]
    C --> E[Service B<br/>Isolate load test data]
    D --> F[Shadow Database<br/>Load test dedicated]
    E --> F
    D --> G[Message Queue<br/>Load test topic]
    E --> G
    G --> H[Service C<br/>Consume load test messages]

    I[Monitoring Center] --> J[Real-time Dashboard<br/>QPS/Latency/Error Rate]
    I --> K[Alerting<br/>Auto circuit break on anomaly]

Four-Step Capacity Planning

1. Define Goals

Business goal: Peak 10000 QPS during promotion, P99 < 500ms, error rate < 0.1%

2. Baseline Testing

Single instance baseline: 2000 QPS per instance, P99 = 200ms

3. Calculate Capacity

Required instances = Target QPS / Single instance QPS × Safety factor
                   = 10000 / 2000 × 1.5
                   = 8 instances

4. Validate and Tune

Step load test:
  4 instances → 4000 QPS → P99 = 350ms ✅
  6 instances → 6000 QPS → P99 = 420ms ✅
  8 instances → 8000 QPS → P99 = 480ms ⚠️  Approaching target
  10 instances → 10000 QPS → P99 = 460ms ✅ Goal met

Bottleneck identified: Database connection pool queuing → Increase connection count → 8 instances sufficient

Performance Tuning Process

flowchart TD
    A[Establish performance baseline] --> B[Execute load test]
    B --> C{Goal met?}
    C -->|Yes| D[Done]
    C -->|No| E[Identify bottleneck]
    E --> F{Where is the bottleneck?}
    F -->|CPU| G[Optimize algorithms/concurrency]
    F -->|I/O| H[Caching/async/batching]
    F -->|Database| I[Index/SQL/sharding]
    F -->|Network| J[Connection pool/compression/protocol]
    G --> B
    H --> B
    I --> B
    J --> B

Common Tuning Methods by Priority

Priority Method Effect Cost
1 Add caching 10-100x Low
2 Optimize indexes/SQL 5-50x Low
3 Asynchronous processing 2-10x Medium
4 Batch processing 3-10x Medium
5 Connection pool tuning 2-5x Low
6 Algorithm optimization Varies High
7 Horizontal scaling Linear High (cost)

The core principle of performance tuning: Measure first, then optimize. Let data speak; don’t optimize based on intuition. Change only one variable at a time, confirm the effect, then proceed to the next step.

Edit this page

Comments