runbooks:coustom_alerts:HighDiskUsage
====== HighDiskUsage ======
===== Meaning =====
This alert is triggered when disk usage on a node exceeds 90% for more than 5 minutes.
Disk usage is calculated using filesystem size and free space metrics reported by node-exporter.
The alert is scoped to a specific mount point.
===== Impact =====
High disk usage can cause serious stability and availability issues.
Possible impacts include:
* Applications failing to write data or logs
* Pods crashing or entering error states
* Node instability and kubelet failures
* Potential data loss if disk becomes full
This alert is a **warning**, but may escalate to a critical issue if disk usage continues to grow.
===== Diagnosis =====
Identify affected nodes and mount points:
kubectl get nodes
If SSH access is available, check disk usage on the node:
df -h
Check disk usage per directory to find large consumers:
du -xh / | sort -h | tail -20
Check for pod-level disk usage (Kubernetes):
kubectl describe node
Check recent events related to disk pressure:
kubectl get events --field-selector involvedObject.kind=Node
Check if the node is under DiskPressure:
kubectl get nodes
===== Possible Causes =====
* Log files growing uncontrollably
* Application writing excessive data to disk
* Container images and layers not cleaned up
* Old files or backups consuming disk space
* Insufficient disk capacity on the node
===== Mitigation =====
- Remove or rotate large log files
- Clean up unused container images and volumes
- Delete temporary or obsolete files
- Resize the disk if supported by the platform
- Move data to external storage if applicable
If the node is under DiskPressure, consider draining it:
kubectl drain --ignore-daemonsets
After resolving the issue:
kubectl uncordon
===== Escalation =====
* If disk usage continues to increase or exceeds 95%, escalate immediately
* If production workloads are impacted, page the on-call engineer
* If disk growth cause is unclear, escalate to the application owner or infrastructure team
===== Related Alerts =====
* HighMemoryUsage
* NodeDown
* NodeNotReady
===== Related Dashboards =====
* Grafana → Node Overview
* Grafana → Disk Usage Dashboard