User Tools

Site Tools


runbooks:coustom_alerts:kubernetesnodediskpressure

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

runbooks:coustom_alerts:kubernetesnodediskpressure [2025/12/13 16:36] – created adminrunbooks:coustom_alerts:kubernetesnodediskpressure [2025/12/14 06:54] (current) admin
Line 1: Line 1:
 runbooks:coustom_alerts:KubernetesNodeDiskPressure runbooks:coustom_alerts:KubernetesNodeDiskPressure
 +
 +====== KubernetesNodeDiskPressure ======
 +
 +===== Meaning =====
 +This alert is triggered when a Kubernetes node reports the **DiskPressure** condition for more than 2 minutes.
 +DiskPressure indicates that the node is running low on available disk space, and Kubernetes may evict pods to free space.
 +
 +===== Impact =====
 +Disk pressure on a node can cause:
 +  * Pod evictions or restarts
 +  * Application failures due to insufficient storage
 +  * Node instability
 +  * Scheduling failures for new pods
 +
 +This alert is **critical**, as sustained disk pressure can affect cluster stability and production workloads.
 +
 +===== Diagnosis =====
 +Check node status:
 +
 +<code bash>
 +kubectl get nodes
 +kubectl describe node <NODE_NAME>
 +</code>
 +
 +Check disk usage:
 +
 +<code bash>
 +df -h
 +du -sh /var/lib/kubelet/*
 +</code>
 +
 +Check pods consuming disk space:
 +
 +<code bash>
 +kubectl get pvc --all-namespaces
 +kubectl describe pod <POD_NAME> -n <NAMESPACE>
 +</code>
 +
 +Check recent events:
 +
 +<code bash>
 +kubectl get events --sort-by=.lastTimestamp
 +</code>
 +
 +===== Possible Causes =====
 +  * Full disks due to logs, images, or temporary files
 +  * Large persistent volumes filling up
 +  * Containers writing excessive data
 +  * Old or unused Docker images not cleaned
 +  * Disk size too small for workload requirements
 +
 +===== Mitigation =====
 +  - Clean up unused images and temporary files
 +  - Rotate and compress logs
 +  - Move non-critical data to other storage
 +  - Increase node disk capacity if possible
 +  - Evict non-critical pods or scale workloads to other nodes
 +
 +Drain node if immediate relief is needed:
 +
 +<code bash>
 +kubectl drain <NODE_NAME> --ignore-daemonsets --delete-emptydir-data
 +</code>
 +
 +After mitigation:
 +
 +<code bash>
 +kubectl uncordon <NODE_NAME>
 +</code>
 +
 +===== Escalation =====
 +  * Escalate if DiskPressure persists beyond 10 minutes
 +  * Page on-call engineer if production workloads are impacted
 +  * Treat multiple affected nodes as cluster-level incident
 +
 +===== Related Alerts =====
 +  * HighDiskUsage
 +  * HighDiskIOWait
 +  * KubernetesNodeNotReady
 +  * PodCrashLoopBackOff
 +
 +===== Related Dashboards =====
 +  * Grafana → Kubernetes / Node Disk
 +  * Grafana → Node Exporter Disk Overview
 +
runbooks/coustom_alerts/kubernetesnodediskpressure.txt · Last modified: by admin