runbooks:coustom_alerts:HighDiskUsage ====== HighDiskUsage ====== ===== Meaning ===== This alert is triggered when disk usage on a node exceeds 90% for more than 5 minutes. Disk usage is calculated using filesystem size and free space metrics reported by node-exporter. The alert is scoped to a specific mount point. ===== Impact ===== High disk usage can cause serious stability and availability issues. Possible impacts include: * Applications failing to write data or logs * Pods crashing or entering error states * Node instability and kubelet failures * Potential data loss if disk becomes full This alert is a **warning**, but may escalate to a critical issue if disk usage continues to grow. ===== Diagnosis ===== Identify affected nodes and mount points: kubectl get nodes If SSH access is available, check disk usage on the node: df -h Check disk usage per directory to find large consumers: du -xh / | sort -h | tail -20 Check for pod-level disk usage (Kubernetes): kubectl describe node Check recent events related to disk pressure: kubectl get events --field-selector involvedObject.kind=Node Check if the node is under DiskPressure: kubectl get nodes ===== Possible Causes ===== * Log files growing uncontrollably * Application writing excessive data to disk * Container images and layers not cleaned up * Old files or backups consuming disk space * Insufficient disk capacity on the node ===== Mitigation ===== - Remove or rotate large log files - Clean up unused container images and volumes - Delete temporary or obsolete files - Resize the disk if supported by the platform - Move data to external storage if applicable If the node is under DiskPressure, consider draining it: kubectl drain --ignore-daemonsets After resolving the issue: kubectl uncordon ===== Escalation ===== * If disk usage continues to increase or exceeds 95%, escalate immediately * If production workloads are impacted, page the on-call engineer * If disk growth cause is unclear, escalate to the application owner or infrastructure team ===== Related Alerts ===== * HighMemoryUsage * NodeDown * NodeNotReady ===== Related Dashboards ===== * Grafana → Node Overview * Grafana → Disk Usage Dashboard