runbooks:coustom_alerts:HostOutOfDiskSpace
====== HostOutOfDiskSpace ======
===== Meaning =====
This alert is triggered when a host node’s disk has **less than 10% free space** on any filesystem (excluding tmpfs, fuse, cifs, nfs) for more than 2 minutes.
It indicates that the host is running low on disk space, which may cause system or application failures.
===== Impact =====
Low disk space can cause:
* Pod evictions due to inability to write logs or data
* Application failures
* Node instability or crashes
* Kubernetes scheduling failures for pods with persistent volume requirements
* Increased latency or I/O errors
This alert is **critical**, as disk space exhaustion can immediately impact production workloads.
===== Diagnosis =====
Check disk usage:
df -h
df -i
lsblk
Check disk space per mountpoint:
du -sh /var/lib/kubelet/*
du -sh /home/*
Check pods consuming disk:
kubectl get pvc --all-namespaces
kubectl describe pod -n
Check node events:
kubectl get events --sort-by=.lastTimestamp
===== Possible Causes =====
* Large log files or temporary files
* Full persistent volumes
* Backup jobs or batch jobs filling disks
* Container images not cleaned up
* Disk size too small for workload
===== Mitigation =====
- Clean up unused files, logs, or images
- Rotate and compress logs
- Move non-critical data to another storage
- Evict or reschedule non-critical pods
- Increase disk capacity if possible
Drain node if needed:
kubectl drain --ignore-daemonsets --delete-emptydir-data
kubectl uncordon
===== Escalation =====
* Escalate if disk space remains below 10% for extended periods
* Page on-call engineer if production services are impacted
* Treat multiple nodes with low disk space as a cluster-level incident
===== Related Alerts =====
* HighDiskUsage
* HighDiskIOWait
* KubernetesNodeDiskPressure
* HostUnusualDiskReadRate
===== Related Dashboards =====
* Grafana → Node Disk Usage
* Grafana → Node Exporter Disk Metrics