runbooks:coustom_alerts:KubernetesNodeDiskPressure ====== KubernetesNodeDiskPressure ====== ===== Meaning ===== This alert is triggered when a Kubernetes node reports the **DiskPressure** condition for more than 2 minutes. DiskPressure indicates that the node is running low on available disk space, and Kubernetes may evict pods to free space. ===== Impact ===== Disk pressure on a node can cause: * Pod evictions or restarts * Application failures due to insufficient storage * Node instability * Scheduling failures for new pods This alert is **critical**, as sustained disk pressure can affect cluster stability and production workloads. ===== Diagnosis ===== Check node status: kubectl get nodes kubectl describe node Check disk usage: df -h du -sh /var/lib/kubelet/* Check pods consuming disk space: kubectl get pvc --all-namespaces kubectl describe pod -n Check recent events: kubectl get events --sort-by=.lastTimestamp ===== Possible Causes ===== * Full disks due to logs, images, or temporary files * Large persistent volumes filling up * Containers writing excessive data * Old or unused Docker images not cleaned * Disk size too small for workload requirements ===== Mitigation ===== - Clean up unused images and temporary files - Rotate and compress logs - Move non-critical data to other storage - Increase node disk capacity if possible - Evict non-critical pods or scale workloads to other nodes Drain node if immediate relief is needed: kubectl drain --ignore-daemonsets --delete-emptydir-data After mitigation: kubectl uncordon ===== Escalation ===== * Escalate if DiskPressure persists beyond 10 minutes * Page on-call engineer if production workloads are impacted * Treat multiple affected nodes as cluster-level incident ===== Related Alerts ===== * HighDiskUsage * HighDiskIOWait * KubernetesNodeNotReady * PodCrashLoopBackOff ===== Related Dashboards ===== * Grafana → Kubernetes / Node Disk * Grafana → Node Exporter Disk Overview