跳到主要内容
EN

Kubernetes 存储与调度

14 分钟阅读

存储体系

K8s 存储体系将底层存储抽象为标准接口,让应用无需关心具体的存储实现。核心对象包括 PV、PVC 和 StorageClass。

graph TB
    Pod[Pod] -->|引用| PVC[PVC<br/>存储请求]
    PVC -->|绑定| PV[PV<br/>存储实体]
    PV -->|后端| NFS[NFS]
    PV -->|后端| EBS[AWS EBS]
    PV -->|后端| Ceph[Ceph RBD]
    PVC -.->|动态供给| SC[StorageClass]
    SC -->|自动创建| PV

PersistentVolume(PV)

PV 是集群级别的存储资源,由管理员创建或动态供给:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: nfs-pv
spec:
  capacity:
    storage: 50Gi
  accessModes:
    - ReadWriteMany        # RWX:多节点读写
    # - ReadWriteOnce     # RWO:单节点读写
    # - ReadOnlyMany      # ROX:多节点只读
  persistentVolumeReclaimPolicy: Retain  # Retain/Delete/Recycle
  storageClassName: nfs
  nfs:
    server: 10.0.0.100
    path: /data/share

PersistentVolumeClaim(PVC)

PVC 是用户对存储的声明,类似 Pod 对计算资源的请求:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: app-data
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 20Gi
  storageClassName: fast-ssd

PV/PVC 绑定流程

sequenceDiagram
    participant User as 用户
    participant API as API Server
    participant SC as StorageClass
    participant PV as PV
    participant PVC as PVC
    participant Pod as Pod

    User->>API: 创建 PVC
    API->>SC: 匹配 StorageClass
    SC->>PV: 动态创建 PV
    PV->>PVC: 绑定 (Bound)
    User->>API: 创建 Pod
    Pod->>PVC: 挂载卷
    PVC->>PV: 使用存储

StorageClass — 动态供给

StorageClass 消除了管理员手动创建 PV 的负担:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: fast-ssd
provisioner: ebs.csi.aws.com
parameters:
  type: gp3
  iops: "5000"
  throughput: "250"
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer  # 延迟绑定,直到 Pod 调度
allowVolumeExpansion: true

volumeBindingMode 的两种模式:

  • Immediate:PVC 创建后立即绑定,可能导致 Pod 调度到与 PV 不同的可用区
  • WaitForFirstConsumer:等到使用此 PVC 的 Pod 创建时才绑定,确保 PV 和 Pod 在同一可用区

调度器

K8s 调度器负责将 Pod 分配到合适的节点,决策过程分为过滤和打分两个阶段:

graph LR
    Pod[待调度 Pod] --> Filter[过滤:排除不满足条件的节点]
    Filter --> Score[打分:对候选节点评分]
    Score --> Bind[绑定:选择最高分节点]

节点亲和性

控制 Pod 调度到哪些节点:

spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:  # 硬约束
        nodeSelectorTerms:
          - matchExpressions:
              - key: topology.kubernetes.io/zone
                operator: In
                values: ["us-east-1a", "us-east-1b"]
      preferredDuringSchedulingIgnoredDuringExecution:  # 软约束
        - weight: 80
          preference:
            matchExpressions:
              - key: node-type
                operator: In
                values: ["high-mem"]

Pod 亲和/反亲和

控制 Pod 之间的分布关系:

spec:
  affinity:
    podAntiAffinity:  # 反亲和:分散部署
      requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchLabels:
              app: web
          topologyKey: kubernetes.io/hostname  # 每个节点最多一个 web Pod
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 100
          podAffinityTerm:
            labelSelector:
              matchLabels:
                app: web
            topologyKey: topology.kubernetes.io/zone  # 优先分散到不同可用区

污点与容忍

污点(Taint)让节点排斥 Pod,容忍(Toleration)让 Pod 接受污点:

# 给节点添加污点
kubectl taint nodes node-1 dedicated=gpu:NoSchedule
# Pod 容忍该污点
spec:
  tolerations:
    - key: "dedicated"
      operator: "Equal"
      value: "gpu"
      effect: "NoSchedule"
污点效果 说明
NoSchedule 不调度新 Pod
PreferNoSchedule 尽量不调度(软约束)
NoExecute 不调度且驱逐已有 Pod

有状态应用:StatefulSet

StatefulSet 为有状态应用提供稳定的网络标识、持久存储和有序部署:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: postgres
spec:
  serviceName: postgres-headless  # 关联 Headless Service
  replicas: 3
  selector:
    matchLabels:
      app: postgres
  template:
    metadata:
      labels:
        app: postgres
    spec:
      containers:
        - name: postgres
          image: postgres:16
          ports:
            - containerPort: 5432
          volumeMounts:
            - name: data
              mountPath: /var/lib/postgresql/data
  volumeClaimTemplates:  # 每个 Pod 独立的 PVC
    - metadata:
        name: data
      spec:
        accessModes: ["ReadWriteOnce"]
        storageClassName: fast-ssd
        resources:
          requests:
            storage: 50Gi

StatefulSet vs Deployment:

特性 StatefulSet Deployment
Pod 名称 固定:statefulset-name-0/1/2 随机后缀
网络标识 固定 DNS:pod-name.headless-svc 不稳定
存储 每个 Pod 独立 PVC 共享或无状态
扩缩顺序 有序(0→1→2) 并行
更新策略 RollingUpdate/OnDelete RollingUpdate/Recreate

水平扩展

HPA — 水平 Pod 自动扩缩

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second
        target:
          type: AverageValue
          averageValue: "1000"
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300  # 缩容稳定期
      policies:
        - type: Percent
          value: 10
          periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
        - type: Percent
          value: 100
          periodSeconds: 15

VPA — 垂直 Pod 自动扩缩

VPA 自动调整 Pod 的资源请求和限制:

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: api-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api
  updatePolicy:
    updateMode: Auto  # Auto/Recreate/Off
  resourcePolicy:
    containerPolicies:
      - containerName: app
        minAllowed:
          cpu: 100m
          memory: 128Mi
        maxAllowed:
          cpu: "2"
          memory: 2Gi

KEDA — 事件驱动的自动扩缩

KEDA 扩展了 HPA 的指标来源,支持基于消息队列长度、Cron 表达式等自定义指标:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: worker-scaler
spec:
  scaleTargetRef:
    name: worker-deployment
  minReplicaCount: 0   # 支持缩容到 0
  maxReplicaCount: 30
  triggers:
    - type: rabbitmq
      metadata:
        queueName: tasks
        host: amqp://rabbitmq.default.svc.cluster.local
        queueLength: "5"  # 每队 5 条消息扩一个副本

Helm Chart 管理

Helm 是 K8s 的包管理器,将一组 K8s 清单模板化并版本管理:

my-chart/
├── Chart.yaml          # Chart 元数据
├── values.yaml         # 默认配置值
├── templates/
│   ├── deployment.yaml
│   ├── service.yaml
│   ├── ingress.yaml
│   ├── _helpers.tpl    # 模板辅助函数
│   └── NOTES.txt       # 安装后提示
└── charts/             # 依赖 Chart
# Chart.yaml
apiVersion: v2
name: my-app
description: Kubernetes storage management Helm chart
type: application
version: 1.0.0
appVersion: "2.0.0"
dependencies:
  - name: postgresql
    version: "14.x.x"
    repository: "https://charts.bitnami.com/bitnami"
    condition: postgresql.enabled
# 安装
helm install my-app ./my-chart -n production

# 升级
helm upgrade my-app ./my-chart -n production -f values-prod.yaml

# 回滚
helm rollback my-app 1 -n production

# 查看发布历史
helm history my-app -n production

K8s 的存储和调度是运行有状态应用和优化资源利用率的关键。从 PV/PVC 的存储抽象到 StatefulSet 的稳定标识,从调度约束到自动扩缩,这些机制共同支撑了生产级别的容器化工作负载。

编辑此页

评论