K8s Storage & Scheduling
Storage System
The K8s storage system abstracts underlying storage into standard interfaces, allowing applications to be unaware of specific storage implementations. Core objects include PV, PVC, and StorageClass.
graph TB
Pod[Pod] -->|Reference| PVC[PVC<br/>Storage Request]
PVC -->|Bind| PV[PV<br/>Storage Entity]
PV -->|Backend| NFS[NFS]
PV -->|Backend| EBS[AWS EBS]
PV -->|Backend| Ceph[Ceph RBD]
PVC -.->|Dynamic Provisioning| SC[StorageClass]
SC -->|Auto Create| PV
PersistentVolume (PV)
PV is a cluster-level storage resource, created by administrators or dynamically provisioned:
apiVersion: v1
kind: PersistentVolume
metadata:
name: nfs-pv
spec:
capacity:
storage: 50Gi
accessModes:
- ReadWriteMany # RWX: Multi-node read-write
# - ReadWriteOnce # RWO: Single-node read-write
# - ReadOnlyMany # ROX: Multi-node read-only
persistentVolumeReclaimPolicy: Retain # Retain/Delete/Recycle
storageClassName: nfs
nfs:
server: 10.0.0.100
path: /data/share
PersistentVolumeClaim (PVC)
PVC is a user’s storage request, similar to a Pod’s compute resource request:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: app-data
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 20Gi
storageClassName: fast-ssd
PV/PVC Binding Flow
sequenceDiagram
participant User as User
participant API as API Server
participant SC as StorageClass
participant PV as PV
participant PVC as PVC
participant Pod as Pod
User->>API: Create PVC
API->>SC: Match StorageClass
SC->>PV: Dynamically create PV
PV->>PVC: Bind (Bound)
User->>API: Create Pod
Pod->>PVC: Mount volume
PVC->>PV: Use storage
StorageClass — Dynamic Provisioning
StorageClass eliminates the burden of administrators manually creating PVs:
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: fast-ssd
provisioner: ebs.csi.aws.com
parameters:
type: gp3
iops: "5000"
throughput: "250"
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer # Delay binding until Pod is scheduled
allowVolumeExpansion: true
Two volumeBindingMode options:
- Immediate: Bind immediately after PVC creation, which may result in Pod being scheduled to a different availability zone than the PV
- WaitForFirstConsumer: Wait until a Pod using this PVC is created before binding, ensuring PV and Pod are in the same availability zone
Scheduler
The K8s scheduler assigns Pods to appropriate nodes. The decision process has two phases: filtering and scoring:
graph LR
Pod[Pending Pod] --> Filter[Filter: Exclude nodes that don't meet requirements]
Filter --> Score[Score: Rate candidate nodes]
Score --> Bind[Bind: Select highest-scoring node]
Node Affinity
Controls which nodes a Pod is scheduled to:
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution: # Hard constraint
nodeSelectorTerms:
- matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values: ["us-east-1a", "us-east-1b"]
preferredDuringSchedulingIgnoredDuringExecution: # Soft constraint
- weight: 80
preference:
matchExpressions:
- key: node-type
operator: In
values: ["high-mem"]
Pod Affinity/Anti-Affinity
Controls the distribution relationship between Pods:
spec:
affinity:
podAntiAffinity: # Anti-affinity: Spread deployment
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: web
topologyKey: kubernetes.io/hostname # At most one web Pod per node
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: web
topologyKey: topology.kubernetes.io/zone # Prefer spreading across availability zones
Taints and Tolerations
Taints make nodes repel Pods, while tolerations let Pods accept taints:
# Add taint to node
kubectl taint nodes node-1 dedicated=gpu:NoSchedule
# Pod tolerates the taint
spec:
tolerations:
- key: "dedicated"
operator: "Equal"
value: "gpu"
effect: "NoSchedule"
| Taint Effect | Description |
|---|---|
| NoSchedule | Don’t schedule new Pods |
| PreferNoSchedule | Try not to schedule (soft constraint) |
| NoExecute | Don’t schedule and evict existing Pods |
Stateful Applications: StatefulSet
StatefulSet provides stable network identity, persistent storage, and ordered deployment for stateful applications:
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: postgres
spec:
serviceName: postgres-headless # Associated Headless Service
replicas: 3
selector:
matchLabels:
app: postgres
template:
metadata:
labels:
app: postgres
spec:
containers:
- name: postgres
image: postgres:16
ports:
- containerPort: 5432
volumeMounts:
- name: data
mountPath: /var/lib/postgresql/data
volumeClaimTemplates: # Independent PVC per Pod
- metadata:
name: data
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: fast-ssd
resources:
requests:
storage: 50Gi
StatefulSet vs Deployment:
| Feature | StatefulSet | Deployment |
|---|---|---|
| Pod name | Fixed: statefulset-name-0/1/2 |
Random suffix |
| Network identity | Fixed DNS: pod-name.headless-svc |
Unstable |
| Storage | Independent PVC per Pod | Shared or stateless |
| Scaling order | Ordered (0→1→2) | Parallel |
| Update strategy | RollingUpdate/OnDelete | RollingUpdate/Recreate |
Horizontal Scaling
HPA — Horizontal Pod Autoscaler
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "1000"
behavior:
scaleDown:
stabilizationWindowSeconds: 300 # Scale-down stabilization period
policies:
- type: Percent
value: 10
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 0
policies:
- type: Percent
value: 100
periodSeconds: 15
VPA — Vertical Pod Autoscaler
VPA automatically adjusts Pod resource requests and limits:
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: api-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: api
updatePolicy:
updateMode: Auto # Auto/Recreate/Off
resourcePolicy:
containerPolicies:
- containerName: app
minAllowed:
cpu: 100m
memory: 128Mi
maxAllowed:
cpu: "2"
memory: 2Gi
KEDA — Event-Driven Autoscaling
KEDA extends HPA’s metric sources, supporting custom metrics based on message queue length, Cron expressions, etc.:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: worker-scaler
spec:
scaleTargetRef:
name: worker-deployment
minReplicaCount: 0 # Support scaling to zero
maxReplicaCount: 30
triggers:
- type: rabbitmq
metadata:
queueName: tasks
host: amqp://rabbitmq.default.svc.cluster.local
queueLength: "5" # Scale one replica per 5 messages in queue
Helm Chart Management
Helm is the K8s package manager, templating and versioning a set of K8s manifests:
my-chart/
├── Chart.yaml # Chart metadata
├── values.yaml # Default configuration values
├── templates/
│ ├── deployment.yaml
│ ├── service.yaml
│ ├── ingress.yaml
│ ├── _helpers.tpl # Template helper functions
│ └── NOTES.txt # Post-install notes
└── charts/ # Dependency Charts
# Chart.yaml
apiVersion: v2
name: my-app
description: Kubernetes storage management Helm chart
type: application
version: 1.0.0
appVersion: "2.0.0"
dependencies:
- name: postgresql
version: "14.x.x"
repository: "https://charts.bitnami.com/bitnami"
condition: postgresql.enabled
# Install
helm install my-app ./my-chart -n production
# Upgrade
helm upgrade my-app ./my-chart -n production -f values-prod.yaml
# Rollback
helm rollback my-app 1 -n production
# View release history
helm history my-app -n production
K8s storage and scheduling are critical for running stateful applications and optimizing resource utilization. From PV/PVC storage abstraction to StatefulSet stable identity, from scheduling constraints to auto-scaling, these mechanisms together support production-grade containerized workloads.
Comments