runbooks:coustom_alerts:hostoutofdiskspace
Differences
This shows you the differences between two versions of the page.
| runbooks:coustom_alerts:hostoutofdiskspace [2025/12/13 16:38] – created admin | runbooks:coustom_alerts:hostoutofdiskspace [2025/12/14 07:03] (current) – admin | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| runbooks: | runbooks: | ||
| + | |||
| + | ====== HostOutOfDiskSpace ====== | ||
| + | |||
| + | ===== Meaning ===== | ||
| + | This alert is triggered when a host node’s disk has **less than 10% free space** on any filesystem (excluding tmpfs, fuse, cifs, nfs) for more than 2 minutes. | ||
| + | It indicates that the host is running low on disk space, which may cause system or application failures. | ||
| + | |||
| + | ===== Impact ===== | ||
| + | Low disk space can cause: | ||
| + | * Pod evictions due to inability to write logs or data | ||
| + | * Application failures | ||
| + | * Node instability or crashes | ||
| + | * Kubernetes scheduling failures for pods with persistent volume requirements | ||
| + | * Increased latency or I/O errors | ||
| + | |||
| + | This alert is **critical**, | ||
| + | |||
| + | ===== Diagnosis ===== | ||
| + | Check disk usage: | ||
| + | |||
| + | <code bash> | ||
| + | df -h | ||
| + | df -i | ||
| + | lsblk | ||
| + | </ | ||
| + | |||
| + | Check disk space per mountpoint: | ||
| + | |||
| + | <code bash> | ||
| + | du -sh / | ||
| + | du -sh /home/* | ||
| + | </ | ||
| + | |||
| + | Check pods consuming disk: | ||
| + | |||
| + | <code bash> | ||
| + | kubectl get pvc --all-namespaces | ||
| + | kubectl describe pod < | ||
| + | </ | ||
| + | |||
| + | Check node events: | ||
| + | |||
| + | <code bash> | ||
| + | kubectl get events --sort-by=.lastTimestamp | ||
| + | </ | ||
| + | |||
| + | ===== Possible Causes ===== | ||
| + | * Large log files or temporary files | ||
| + | * Full persistent volumes | ||
| + | * Backup jobs or batch jobs filling disks | ||
| + | * Container images not cleaned up | ||
| + | * Disk size too small for workload | ||
| + | |||
| + | ===== Mitigation ===== | ||
| + | - Clean up unused files, logs, or images | ||
| + | - Rotate and compress logs | ||
| + | - Move non-critical data to another storage | ||
| + | - Evict or reschedule non-critical pods | ||
| + | - Increase disk capacity if possible | ||
| + | |||
| + | Drain node if needed: | ||
| + | |||
| + | <code bash> | ||
| + | kubectl drain < | ||
| + | kubectl uncordon < | ||
| + | </ | ||
| + | |||
| + | ===== Escalation ===== | ||
| + | * Escalate if disk space remains below 10% for extended periods | ||
| + | * Page on-call engineer if production services are impacted | ||
| + | * Treat multiple nodes with low disk space as a cluster-level incident | ||
| + | |||
| + | ===== Related Alerts ===== | ||
| + | * HighDiskUsage | ||
| + | * HighDiskIOWait | ||
| + | * KubernetesNodeDiskPressure | ||
| + | * HostUnusualDiskReadRate | ||
| + | |||
| + | ===== Related Dashboards ===== | ||
| + | * Grafana → Node Disk Usage | ||
| + | * Grafana → Node Exporter Disk Metrics | ||
| + | |||
runbooks/coustom_alerts/hostoutofdiskspace.txt · Last modified: by admin
