Backend Performance Tuning

Performance Metrics

Backend performance tuning starts with defining measurement standards:

Metric	Meaning	Focus
Throughput (QPS/TPS)	Requests/transactions per second	System capacity ceiling
Latency	Time from request to response	User experience
P50/P90/P99	50th/90th/99th percentile latency	Tail latency
Concurrency	Simultaneously processed requests	Resource utilization
Error rate	Percentage of failed requests	System stability

Why look at P99 instead of averages? Suppose 99 out of 100 requests take 10ms and 1 takes 10s. The average is 109ms, which seems acceptable, but P99 is 10s—meaning 1% of users have a terrible experience. Averages mask tail problems.

graph LR
    A[Performance Goals] --> B[Reduce P99 latency]
    A --> C[Increase throughput]
    A --> D[Maintain error rate < 0.1%]

Load Testing Tools

wrk

# Basic load test: 12 threads, 400 connections, 30 seconds
wrk -t12 -c400 -d30s http://localhost:8080/api/users

# With latency distribution
wrk -t12 -c400 -d30s --latency http://localhost:8080/api/users

# Output example
#   Latency   P50=3.21ms  P90=5.43ms  P99=12.87ms
#   Requests/sec: 124532.11
#   Transfer/sec: 45.67MB

wrk2 (Constant Throughput Load Testing)

# Constant 1000 QPS, observe how latency changes with load
wrk2 -t4 -c100 -d30s -R1000 http://localhost:8080/api/users

k6

// k6 script: progressive load
import http from 'k6/http';
import { check } from 'k6';

export const options = {
  stages: [
    { duration: '2m', target: 100 },   // Ramp up to 100 VU in 2 minutes
    { duration: '5m', target: 100 },   // Steady for 5 minutes
    { duration: '2m', target: 500 },   // Ramp up to 500 VU
    { duration: '5m', target: 500 },   // Steady for 5 minutes
    { duration: '2m', target: 0 },     // Ramp down
  ],
  thresholds: {
    http_req_duration: ['p(99)<500'],   // P99 < 500ms
    http_req_failed: ['rate<0.01'],     // Error rate < 1%
  },
};

export default function () {
  const res = http.get('http://localhost:8080/api/users');
  check(res, { 'status is 200': (r) => r.status === 200 });
}

Database Optimization

Slow Query Management

-- MySQL enable slow query log
SET GLOBAL slow_query_log = ON;
SET GLOBAL long_query_time = 0.1;  -- Log queries over 100ms
SET GLOBAL log_queries_not_using_indexes = ON;

-- Analyze slow queries
EXPLAIN ANALYZE SELECT u.name, COUNT(o.id)
FROM users u
JOIN orders o ON u.id = o.user_id
WHERE u.created_at > '2024-01-01'
GROUP BY u.name;

Common optimization paths:

Problem	Optimization	Expected Effect
Full table scan	Add appropriate index	99%+ row reduction
Excessive bookmark lookups	Covering index	Avoid 50%+ random I/O
Temporary table sorting	Optimize ORDER BY index	Eliminate filesort
Excessive JOINs	Denormalization/redundant fields	Reduce JOIN depth
Deep pagination	Cursor-based pagination	Avoid scanning first N rows

-- ❌ OFFSET deep pagination (scans 100010 rows, discards first 100000)
SELECT * FROM orders ORDER BY id LIMIT 10 OFFSET 100000;

-- ✅ Cursor-based pagination (scans only 10 rows)
SELECT * FROM orders WHERE id > 100000 ORDER BY id LIMIT 10;

Connection Pool Tuning

# HikariCP recommended configuration
maximumPoolSize: 20          # Formula: (core count * 2) + effective disk count
minimumIdle: 5               # Minimum idle connections
connectionTimeout: 3000      # Connection acquisition timeout 3s
idleTimeout: 600000          # Max idle connection lifetime 10 minutes
maxLifetime: 1800000         # Max connection lifetime 30 minutes
leakDetectionThreshold: 60000  # Connection leak detection 60s

Connection and Thread Pool Tuning

Thread Pool Configuration

flowchart TD
    A[Request arrives] --> B{Core threads full?}
    B -->|No| C[Core thread processes]
    B -->|Yes| D{Queue full?}
    D -->|No| E[Enqueue and wait]
    D -->|Yes| F{Max threads reached?}
    F -->|No| G[Create non-core thread]
    F -->|Yes| H[Rejection policy]

// Thread pool configuration (I/O-intensive example)
ThreadPoolExecutor executor = new ThreadPoolExecutor(
    16,                              // Core thread count
    64,                              // Max thread count
    60, TimeUnit.SECONDS,            // Non-core thread idle keepalive time
    new LinkedBlockingQueue<>(1000), // Task queue
    new ThreadPoolExecutor.CallerRunsPolicy()  // Caller thread executes when queue is full
);

// CPU-intensive: core threads ≈ CPU core count + 1
// I/O-intensive: core threads ≈ CPU core count * (1 + wait time/compute time)

Rejection Policy Selection

Policy	Behavior	Use Case
AbortPolicy	Throw exception	Default, need to sense overload
CallerRunsPolicy	Caller thread executes	Slow down but don’t discard
DiscardOldestPolicy	Discard oldest task	Acceptable to lose tasks
DiscardPolicy	Silently discard	Acceptable to lose tasks

HTTP Connection Pool

// Go HTTP client tuning
transport := &http.Transport{
    MaxIdleConns:        100,              // Global max idle connections
    MaxIdleConnsPerHost: 20,               // Max idle connections per host
    MaxConnsPerHost:     50,               // Max connections per host
    IdleConnTimeout:     90 * time.Second, // Idle connection timeout
    DialContext: (&net.Dialer{
        Timeout:   5 * time.Second,        // Connection timeout
        KeepAlive: 30 * time.Second,       // TCP keepalive
    }).DialContext,
    TLSHandshakeTimeout:   5 * time.Second,
    ResponseHeaderTimeout: 10 * time.Second,
}
client := &http.Client{Transport: transport, Timeout: 30 * time.Second}

Full-Chain Load Testing and Capacity Planning

Full-Chain Load Testing Architecture

flowchart TD
    A[Load test traffic entry] --> B[Traffic tagging<br/>Mark load test requests]
    B --> C[Gateway<br/>Identify load test traffic]
    C --> D[Service A<br/>Isolate load test data]
    C --> E[Service B<br/>Isolate load test data]
    D --> F[Shadow Database<br/>Load test dedicated]
    E --> F
    D --> G[Message Queue<br/>Load test topic]
    E --> G
    G --> H[Service C<br/>Consume load test messages]

    I[Monitoring Center] --> J[Real-time Dashboard<br/>QPS/Latency/Error Rate]
    I --> K[Alerting<br/>Auto circuit break on anomaly]

Four-Step Capacity Planning

1. Define Goals

Business goal: Peak 10000 QPS during promotion, P99 < 500ms, error rate < 0.1%

2. Baseline Testing

Single instance baseline: 2000 QPS per instance, P99 = 200ms

3. Calculate Capacity

Required instances = Target QPS / Single instance QPS × Safety factor
                   = 10000 / 2000 × 1.5
                   = 8 instances

4. Validate and Tune

Step load test:
  4 instances → 4000 QPS → P99 = 350ms ✅
  6 instances → 6000 QPS → P99 = 420ms ✅
  8 instances → 8000 QPS → P99 = 480ms ⚠️  Approaching target
  10 instances → 10000 QPS → P99 = 460ms ✅ Goal met

Bottleneck identified: Database connection pool queuing → Increase connection count → 8 instances sufficient

Performance Tuning Process

flowchart TD
    A[Establish performance baseline] --> B[Execute load test]
    B --> C{Goal met?}
    C -->|Yes| D[Done]
    C -->|No| E[Identify bottleneck]
    E --> F{Where is the bottleneck?}
    F -->|CPU| G[Optimize algorithms/concurrency]
    F -->|I/O| H[Caching/async/batching]
    F -->|Database| I[Index/SQL/sharding]
    F -->|Network| J[Connection pool/compression/protocol]
    G --> B
    H --> B
    I --> B
    J --> B

Common Tuning Methods by Priority

Priority	Method	Effect	Cost
1	Add caching	10-100x	Low
2	Optimize indexes/SQL	5-50x	Low
3	Asynchronous processing	2-10x	Medium
4	Batch processing	3-10x	Medium
5	Connection pool tuning	2-5x	Low
6	Algorithm optimization	Varies	High
7	Horizontal scaling	Linear	High (cost)

The core principle of performance tuning: Measure first, then optimize. Let data speak; don’t optimize based on intuition. Change only one variable at a time, confirm the effect, then proceed to the next step.