runbooks:coustom_alerts:KubernetesNodeDiskPressure

====== KubernetesNodeDiskPressure ======

===== Meaning =====
This alert is triggered when a Kubernetes node reports the **DiskPressure** condition for more than 2 minutes.
DiskPressure indicates that the node is running low on available disk space, and Kubernetes may evict pods to free space.

===== Impact =====
Disk pressure on a node can cause:
  * Pod evictions or restarts
  * Application failures due to insufficient storage
  * Node instability
  * Scheduling failures for new pods

This alert is **critical**, as sustained disk pressure can affect cluster stability and production workloads.

===== Diagnosis =====
Check node status:

<code bash>
kubectl get nodes
kubectl describe node <NODE_NAME>
</code>

Check disk usage:

<code bash>
df -h
du -sh /var/lib/kubelet/*
</code>

Check pods consuming disk space:

<code bash>
kubectl get pvc --all-namespaces
kubectl describe pod <POD_NAME> -n <NAMESPACE>
</code>

Check recent events:

<code bash>
kubectl get events --sort-by=.lastTimestamp
</code>

===== Possible Causes =====
  * Full disks due to logs, images, or temporary files
  * Large persistent volumes filling up
  * Containers writing excessive data
  * Old or unused Docker images not cleaned
  * Disk size too small for workload requirements

===== Mitigation =====
  - Clean up unused images and temporary files
  - Rotate and compress logs
  - Move non-critical data to other storage
  - Increase node disk capacity if possible
  - Evict non-critical pods or scale workloads to other nodes

Drain node if immediate relief is needed:

<code bash>
kubectl drain <NODE_NAME> --ignore-daemonsets --delete-emptydir-data
</code>

After mitigation:

<code bash>
kubectl uncordon <NODE_NAME>
</code>

===== Escalation =====
  * Escalate if DiskPressure persists beyond 10 minutes
  * Page on-call engineer if production workloads are impacted
  * Treat multiple affected nodes as cluster-level incident

===== Related Alerts =====
  * HighDiskUsage
  * HighDiskIOWait
  * KubernetesNodeNotReady
  * PodCrashLoopBackOff

===== Related Dashboards =====
  * Grafana → Kubernetes / Node Disk
  * Grafana → Node Exporter Disk Overview