runbooks:coustom_alerts:KubernetesNodeDiskPressure
====== KubernetesNodeDiskPressure ======
===== Meaning =====
This alert is triggered when a Kubernetes node reports the **DiskPressure** condition for more than 2 minutes.
DiskPressure indicates that the node is running low on available disk space, and Kubernetes may evict pods to free space.
===== Impact =====
Disk pressure on a node can cause:
* Pod evictions or restarts
* Application failures due to insufficient storage
* Node instability
* Scheduling failures for new pods
This alert is **critical**, as sustained disk pressure can affect cluster stability and production workloads.
===== Diagnosis =====
Check node status:
kubectl get nodes
kubectl describe node
Check disk usage:
df -h
du -sh /var/lib/kubelet/*
Check pods consuming disk space:
kubectl get pvc --all-namespaces
kubectl describe pod -n
Check recent events:
kubectl get events --sort-by=.lastTimestamp
===== Possible Causes =====
* Full disks due to logs, images, or temporary files
* Large persistent volumes filling up
* Containers writing excessive data
* Old or unused Docker images not cleaned
* Disk size too small for workload requirements
===== Mitigation =====
- Clean up unused images and temporary files
- Rotate and compress logs
- Move non-critical data to other storage
- Increase node disk capacity if possible
- Evict non-critical pods or scale workloads to other nodes
Drain node if immediate relief is needed:
kubectl drain --ignore-daemonsets --delete-emptydir-data
After mitigation:
kubectl uncordon
===== Escalation =====
* Escalate if DiskPressure persists beyond 10 minutes
* Page on-call engineer if production workloads are impacted
* Treat multiple affected nodes as cluster-level incident
===== Related Alerts =====
* HighDiskUsage
* HighDiskIOWait
* KubernetesNodeNotReady
* PodCrashLoopBackOff
===== Related Dashboards =====
* Grafana → Kubernetes / Node Disk
* Grafana → Node Exporter Disk Overview