K8s Operations & Troubleshooting

Common Failure Patterns

K8s failures typically follow specific patterns. Quickly identifying patterns is key to efficient troubleshooting.

Pod-Level Failures

Status	Cause	Troubleshooting Direction
ImagePullBackOff	Image pull failure	Check image name, credentials, network
CrashLoopBackOff	Container crashes after starting	Check container logs, resource limits
OOMKilled	Killed due to insufficient memory	Increase limits or optimize memory usage
Pending	Cannot be scheduled	Check resources, affinity, PVC
Completed	Main process exited normally	Check if it should be long-running

Node-Level Failures

Status	Cause	Troubleshooting Direction
NotReady	kubelet abnormal	Check kubelet logs and certificates
DiskPressure	Disk pressure	Clean up images and logs
MemoryPressure	Memory pressure	Evict low-priority Pods
PIDPressure	Too many processes	Check for process leaks

Troubleshooting Decision Tree

When facing K8s failures, follow this process to troubleshoot step by step:

graph TD
    Start[Failure Occurred] --> Q1{Pod Status?}
    Q1 -->|Pending| A1[Check resource requests vs node capacity]
    Q1 -->|CrashLoopBackOff| A2[Check container logs and events]
    Q1 -->|ImagePullBackOff| A3[Check image and pull credentials]
    Q1 -->|Running but abnormal| Q2{Service reachable?}
    A1 --> Fix1[Adjust resources/Add nodes/Modify affinity]
    A2 --> Fix2[Fix app bugs/Adjust probes/Increase resources]
    A3 --> Fix3[Fix image name/Create imagePullSecrets]
    Q2 -->|Unreachable| A4[Check Service and Endpoints]
    Q2 -->|Reachable but errors| A5[Check app logs and configuration]
    A4 --> Fix4[Fix selector/Check NetworkPolicy]
    A5 --> Fix5[Check ConfigMap/Secret/Environment variables]

Troubleshooting Toolchain

Essential Command Reference

# === Pod Troubleshooting ===
kubectl get pods -A -o wide                    # Global Pod status
kubectl describe pod <name>                    # Events and status details
kubectl logs <pod> -c <container> --previous   # Logs from last crash
kubectl logs <pod> --all-containers            # All container logs
kubectl exec -it <pod> -- /bin/sh              # Enter container

# === Network Troubleshooting ===
kubectl get endpoints <service>                # Service backend Pods
kubectl get networkpolicies -A                 # Network policies
kubectl run tmp --image=busybox --rm -it -- wget -qO- http://svc:80  # Temporary test

# === Node Troubleshooting ===
kubectl describe node <name>                   # Node conditions and resources
kubectl top nodes                              # Resource usage
kubectl get events --field-selector involvedObject.kind=Node  # Node events

# === Cluster Troubleshooting ===
kubectl get componentstatuses                  # Component health status
kubectl get apiservices                        # API service status

Temporary Diagnostic Pods

Quickly create diagnostic tool Pods:

# Network diagnostics
kubectl run nettool --image=nicolaka/netshoot --rm -it -- bash

# Inside diagnostic Pod
nslookup api-service.production.svc.cluster.local
curl -v http://api-service:8080/health
traceroute db-service

Log Aggregation Queries

# Use stern to view multiple Pod logs in parallel
stern "app=api" -n production --since 1h

# Use kubectl logs with label selector
kubectl logs -l app=api -n production --since=1h --tail=100

Cluster Operations

Certificate Management

K8s cluster certificates have a default validity of 1 year and require regular rotation:

# Check certificate expiration
kubeadm certs check-expiration

# Renew all certificates
kubeadm certs renew all

# Restart control plane components for new certificates to take effect
docker restart $(docker ps | grep kube- | awk '{print $1}')

etcd Backup and Recovery

etcd is the core of cluster state and must be backed up regularly:

# Backup etcd snapshot
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-$(date +%Y%m%d).db \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

# View snapshot status
ETCDCTL_API=3 etcdctl snapshot status /backup/etcd-20260501.db --write-table

# Restore (execute on all etcd nodes)
ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-20260501.db \
  --data-dir=/var/lib/etcd/restore

Node Maintenance

# Mark node as unschedulable
kubectl cordon node-1

# Evict Pods from node (before maintenance)
kubectl drain node-1 --ignore-daemonsets --delete-emptydir-data

# Restore after maintenance
kubectl uncordon node-1

Cluster Upgrade Process

# 1. Upgrade kubeadm
apt-mark unhold kubeadm
apt-get update && apt-get install -y kubeadm=1.29.4-*
apt-mark hold kubeadm

# 2. Verify upgrade plan
kubeadm upgrade plan

# 3. Upgrade control plane
kubeadm upgrade apply v1.29.4

# 4. Upgrade worker nodes one by one
kubectl drain node-1 --ignore-daemonsets
apt-get install -y kubelet=1.29.4-* kubectl=1.29.4-*
systemctl restart kubelet
kubectl uncordon node-1

Resource Quotas

ResourceQuota — Namespace-Level Quotas

apiVersion: v1
kind: ResourceQuota
metadata:
  name: team-quota
  namespace: team-a
spec:
  hard:
    requests.cpu: "10"
    requests.memory: 20Gi
    limits.cpu: "20"
    limits.memory: 40Gi
    pods: "50"
    services: "10"
    persistentvolumeclaims: "20"

LimitRange — Default Resource Limits

apiVersion: v1
kind: LimitRange
metadata:
  name: default-limits
  namespace: team-a
spec:
  limits:
    - type: Container
      default:           # Default limits
        cpu: 500m
        memory: 256Mi
      defaultRequest:    # Default requests
        cpu: 100m
        memory: 128Mi
      max:               # Maximum allowed
        cpu: "2"
        memory: 2Gi
      min:               # Minimum allowed
        cpu: 50m
        memory: 64Mi

Cluster Monitoring

Monitoring System Architecture

graph TB
    subgraph "Data Collection"
        NodeExp[Node Exporter]
        KubeState[Kube State Metrics]
        CAdvisor[cAdvisor]
        AppExp[App Exporter]
    end
    subgraph "Storage & Query"
        Prom[Prometheus]
        Thanos[Thanos / VictoriaMetrics]
    end
    subgraph "Visualization & Alerting"
        Grafana[Grafana Dashboard]
        AlertMgr[Alertmanager]
        PagerDuty[PagerDuty / Feishu]
    end
    NodeExp --> Prom
    KubeState --> Prom
    CAdvisor --> Prom
    AppExp --> Prom
    Prom --> Thanos
    Prom --> Grafana
    Prom --> AlertMgr
    AlertMgr --> PagerDuty

Key Monitoring Metrics

Category	Metric	Suggested Alert Threshold
Pod	`kube_pod_container_status_restarts_total`	> 3 times/10min
Pod	`kube_pod_status_phase{phase="Pending"}`	> 5min
Node	`node_cpu_seconds_total`	Utilization > 85%
Node	`node_memory_MemAvailable_bytes`	Available < 10%
Node	`node_filesystem_avail_bytes`	Available < 15%
K8s	`kube_node_status_condition`	NotReady > 3min
etcd	`etcd_disk_wal_fsync_duration_seconds`	P99 > 10ms

Essential Alert Rules

groups:
  - name: k8s-critical
    rules:
      - alert: PodCrashLooping
        expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
        for: 15m
        labels:
          severity: critical
        annotations:
          summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping"

      - alert: NodeNotReady
        expr: kube_node_status_condition{condition="Ready",status="true"} == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Node {{ $labels.node }} is not ready"

      - alert: PVAlmostFull
        expr: |
          (kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes) > 0.85
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "PVC {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} is {{ $value | humanizePercentage }} full"

K8s operations and troubleshooting is a systematic capability. Mastering failure pattern recognition, troubleshooting decision trees, and toolchain usage, combined with comprehensive monitoring and alerting systems, enables rapid identification and service recovery when failures occur.