Kubernetes 存储与调度

存储体系

K8s 存储体系将底层存储抽象为标准接口，让应用无需关心具体的存储实现。核心对象包括 PV、PVC 和 StorageClass。

graph TB
    Pod[Pod] -->|引用| PVC[PVC<br/>存储请求]
    PVC -->|绑定| PV[PV<br/>存储实体]
    PV -->|后端| NFS[NFS]
    PV -->|后端| EBS[AWS EBS]
    PV -->|后端| Ceph[Ceph RBD]
    PVC -.->|动态供给| SC[StorageClass]
    SC -->|自动创建| PV

PersistentVolume（PV）

PV 是集群级别的存储资源，由管理员创建或动态供给：

apiVersion: v1
kind: PersistentVolume
metadata:
  name: nfs-pv
spec:
  capacity:
    storage: 50Gi
  accessModes:
    - ReadWriteMany        # RWX：多节点读写
    # - ReadWriteOnce     # RWO：单节点读写
    # - ReadOnlyMany      # ROX：多节点只读
  persistentVolumeReclaimPolicy: Retain  # Retain/Delete/Recycle
  storageClassName: nfs
  nfs:
    server: 10.0.0.100
    path: /data/share

PersistentVolumeClaim（PVC）

PVC 是用户对存储的声明，类似 Pod 对计算资源的请求：

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: app-data
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 20Gi
  storageClassName: fast-ssd

PV/PVC 绑定流程

sequenceDiagram
    participant User as 用户
    participant API as API Server
    participant SC as StorageClass
    participant PV as PV
    participant PVC as PVC
    participant Pod as Pod

    User->>API: 创建 PVC
    API->>SC: 匹配 StorageClass
    SC->>PV: 动态创建 PV
    PV->>PVC: 绑定 (Bound)
    User->>API: 创建 Pod
    Pod->>PVC: 挂载卷
    PVC->>PV: 使用存储

StorageClass — 动态供给

StorageClass 消除了管理员手动创建 PV 的负担：

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: fast-ssd
provisioner: ebs.csi.aws.com
parameters:
  type: gp3
  iops: "5000"
  throughput: "250"
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer  # 延迟绑定，直到 Pod 调度
allowVolumeExpansion: true

volumeBindingMode 的两种模式：

Immediate：PVC 创建后立即绑定，可能导致 Pod 调度到与 PV 不同的可用区
WaitForFirstConsumer：等到使用此 PVC 的 Pod 创建时才绑定，确保 PV 和 Pod 在同一可用区

调度器

K8s 调度器负责将 Pod 分配到合适的节点，决策过程分为过滤和打分两个阶段：

graph LR
    Pod[待调度 Pod] --> Filter[过滤：排除不满足条件的节点]
    Filter --> Score[打分：对候选节点评分]
    Score --> Bind[绑定：选择最高分节点]

节点亲和性

控制 Pod 调度到哪些节点：

spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:  # 硬约束
        nodeSelectorTerms:
          - matchExpressions:
              - key: topology.kubernetes.io/zone
                operator: In
                values: ["us-east-1a", "us-east-1b"]
      preferredDuringSchedulingIgnoredDuringExecution:  # 软约束
        - weight: 80
          preference:
            matchExpressions:
              - key: node-type
                operator: In
                values: ["high-mem"]

Pod 亲和/反亲和

控制 Pod 之间的分布关系：

spec:
  affinity:
    podAntiAffinity:  # 反亲和：分散部署
      requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchLabels:
              app: web
          topologyKey: kubernetes.io/hostname  # 每个节点最多一个 web Pod
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 100
          podAffinityTerm:
            labelSelector:
              matchLabels:
                app: web
            topologyKey: topology.kubernetes.io/zone  # 优先分散到不同可用区

污点与容忍

污点（Taint）让节点排斥 Pod，容忍（Toleration）让 Pod 接受污点：

# 给节点添加污点
kubectl taint nodes node-1 dedicated=gpu:NoSchedule

# Pod 容忍该污点
spec:
  tolerations:
    - key: "dedicated"
      operator: "Equal"
      value: "gpu"
      effect: "NoSchedule"

污点效果	说明
NoSchedule	不调度新 Pod
PreferNoSchedule	尽量不调度（软约束）
NoExecute	不调度且驱逐已有 Pod

有状态应用：StatefulSet

StatefulSet 为有状态应用提供稳定的网络标识、持久存储和有序部署：

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: postgres
spec:
  serviceName: postgres-headless  # 关联 Headless Service
  replicas: 3
  selector:
    matchLabels:
      app: postgres
  template:
    metadata:
      labels:
        app: postgres
    spec:
      containers:
        - name: postgres
          image: postgres:16
          ports:
            - containerPort: 5432
          volumeMounts:
            - name: data
              mountPath: /var/lib/postgresql/data
  volumeClaimTemplates:  # 每个 Pod 独立的 PVC
    - metadata:
        name: data
      spec:
        accessModes: ["ReadWriteOnce"]
        storageClassName: fast-ssd
        resources:
          requests:
            storage: 50Gi

StatefulSet vs Deployment：

特性	StatefulSet	Deployment
Pod 名称	固定：`statefulset-name-0/1/2`	随机后缀
网络标识	固定 DNS：`pod-name.headless-svc`	不稳定
存储	每个 Pod 独立 PVC	共享或无状态
扩缩顺序	有序（0→1→2）	并行
更新策略	RollingUpdate/OnDelete	RollingUpdate/Recreate

水平扩展

HPA — 水平 Pod 自动扩缩

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second
        target:
          type: AverageValue
          averageValue: "1000"
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300  # 缩容稳定期
      policies:
        - type: Percent
          value: 10
          periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
        - type: Percent
          value: 100
          periodSeconds: 15

VPA — 垂直 Pod 自动扩缩

VPA 自动调整 Pod 的资源请求和限制：

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: api-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api
  updatePolicy:
    updateMode: Auto  # Auto/Recreate/Off
  resourcePolicy:
    containerPolicies:
      - containerName: app
        minAllowed:
          cpu: 100m
          memory: 128Mi
        maxAllowed:
          cpu: "2"
          memory: 2Gi

KEDA — 事件驱动的自动扩缩

KEDA 扩展了 HPA 的指标来源，支持基于消息队列长度、Cron 表达式等自定义指标：

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: worker-scaler
spec:
  scaleTargetRef:
    name: worker-deployment
  minReplicaCount: 0   # 支持缩容到 0
  maxReplicaCount: 30
  triggers:
    - type: rabbitmq
      metadata:
        queueName: tasks
        host: amqp://rabbitmq.default.svc.cluster.local
        queueLength: "5"  # 每队 5 条消息扩一个副本

Helm Chart 管理

Helm 是 K8s 的包管理器，将一组 K8s 清单模板化并版本管理：

my-chart/
├── Chart.yaml          # Chart 元数据
├── values.yaml         # 默认配置值
├── templates/
│   ├── deployment.yaml
│   ├── service.yaml
│   ├── ingress.yaml
│   ├── _helpers.tpl    # 模板辅助函数
│   └── NOTES.txt       # 安装后提示
└── charts/             # 依赖 Chart

# Chart.yaml
apiVersion: v2
name: my-app
description: Kubernetes storage management Helm chart
type: application
version: 1.0.0
appVersion: "2.0.0"
dependencies:
  - name: postgresql
    version: "14.x.x"
    repository: "https://charts.bitnami.com/bitnami"
    condition: postgresql.enabled

# 安装
helm install my-app ./my-chart -n production

# 升级
helm upgrade my-app ./my-chart -n production -f values-prod.yaml

# 回滚
helm rollback my-app 1 -n production

# 查看发布历史
helm history my-app -n production

K8s 的存储和调度是运行有状态应用和优化资源利用率的关键。从 PV/PVC 的存储抽象到 StatefulSet 的稳定标识，从调度约束到自动扩缩，这些机制共同支撑了生产级别的容器化工作负载。