runbooks:coustom_alerts:kubernetesnodediskpressure
Differences
This shows you the differences between two versions of the page.
| runbooks:coustom_alerts:kubernetesnodediskpressure [2025/12/13 16:36] – created admin | runbooks:coustom_alerts:kubernetesnodediskpressure [2025/12/14 06:54] (current) – admin | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| runbooks: | runbooks: | ||
| + | |||
| + | ====== KubernetesNodeDiskPressure ====== | ||
| + | |||
| + | ===== Meaning ===== | ||
| + | This alert is triggered when a Kubernetes node reports the **DiskPressure** condition for more than 2 minutes. | ||
| + | DiskPressure indicates that the node is running low on available disk space, and Kubernetes may evict pods to free space. | ||
| + | |||
| + | ===== Impact ===== | ||
| + | Disk pressure on a node can cause: | ||
| + | * Pod evictions or restarts | ||
| + | * Application failures due to insufficient storage | ||
| + | * Node instability | ||
| + | * Scheduling failures for new pods | ||
| + | |||
| + | This alert is **critical**, | ||
| + | |||
| + | ===== Diagnosis ===== | ||
| + | Check node status: | ||
| + | |||
| + | <code bash> | ||
| + | kubectl get nodes | ||
| + | kubectl describe node < | ||
| + | </ | ||
| + | |||
| + | Check disk usage: | ||
| + | |||
| + | <code bash> | ||
| + | df -h | ||
| + | du -sh / | ||
| + | </ | ||
| + | |||
| + | Check pods consuming disk space: | ||
| + | |||
| + | <code bash> | ||
| + | kubectl get pvc --all-namespaces | ||
| + | kubectl describe pod < | ||
| + | </ | ||
| + | |||
| + | Check recent events: | ||
| + | |||
| + | <code bash> | ||
| + | kubectl get events --sort-by=.lastTimestamp | ||
| + | </ | ||
| + | |||
| + | ===== Possible Causes ===== | ||
| + | * Full disks due to logs, images, or temporary files | ||
| + | * Large persistent volumes filling up | ||
| + | * Containers writing excessive data | ||
| + | * Old or unused Docker images not cleaned | ||
| + | * Disk size too small for workload requirements | ||
| + | |||
| + | ===== Mitigation ===== | ||
| + | - Clean up unused images and temporary files | ||
| + | - Rotate and compress logs | ||
| + | - Move non-critical data to other storage | ||
| + | - Increase node disk capacity if possible | ||
| + | - Evict non-critical pods or scale workloads to other nodes | ||
| + | |||
| + | Drain node if immediate relief is needed: | ||
| + | |||
| + | <code bash> | ||
| + | kubectl drain < | ||
| + | </ | ||
| + | |||
| + | After mitigation: | ||
| + | |||
| + | <code bash> | ||
| + | kubectl uncordon < | ||
| + | </ | ||
| + | |||
| + | ===== Escalation ===== | ||
| + | * Escalate if DiskPressure persists beyond 10 minutes | ||
| + | * Page on-call engineer if production workloads are impacted | ||
| + | * Treat multiple affected nodes as cluster-level incident | ||
| + | |||
| + | ===== Related Alerts ===== | ||
| + | * HighDiskUsage | ||
| + | * HighDiskIOWait | ||
| + | * KubernetesNodeNotReady | ||
| + | * PodCrashLoopBackOff | ||
| + | |||
| + | ===== Related Dashboards ===== | ||
| + | * Grafana → Kubernetes / Node Disk | ||
| + | * Grafana → Node Exporter Disk Overview | ||
| + | |||
runbooks/coustom_alerts/kubernetesnodediskpressure.txt · Last modified: by admin
