User Tools

Site Tools


runbooks:coustom_alerts:hostoutofdiskspace

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

runbooks:coustom_alerts:hostoutofdiskspace [2025/12/13 16:38] – created adminrunbooks:coustom_alerts:hostoutofdiskspace [2025/12/14 07:03] (current) admin
Line 1: Line 1:
 runbooks:coustom_alerts:HostOutOfDiskSpace runbooks:coustom_alerts:HostOutOfDiskSpace
 +
 +====== HostOutOfDiskSpace ======
 +
 +===== Meaning =====
 +This alert is triggered when a host node’s disk has **less than 10% free space** on any filesystem (excluding tmpfs, fuse, cifs, nfs) for more than 2 minutes.
 +It indicates that the host is running low on disk space, which may cause system or application failures.
 +
 +===== Impact =====
 +Low disk space can cause:
 +  * Pod evictions due to inability to write logs or data
 +  * Application failures
 +  * Node instability or crashes
 +  * Kubernetes scheduling failures for pods with persistent volume requirements
 +  * Increased latency or I/O errors
 +
 +This alert is **critical**, as disk space exhaustion can immediately impact production workloads.
 +
 +===== Diagnosis =====
 +Check disk usage:
 +
 +<code bash>
 +df -h
 +df -i
 +lsblk
 +</code>
 +
 +Check disk space per mountpoint:
 +
 +<code bash>
 +du -sh /var/lib/kubelet/*
 +du -sh /home/*
 +</code>
 +
 +Check pods consuming disk:
 +
 +<code bash>
 +kubectl get pvc --all-namespaces
 +kubectl describe pod <POD_NAME> -n <NAMESPACE>
 +</code>
 +
 +Check node events:
 +
 +<code bash>
 +kubectl get events --sort-by=.lastTimestamp
 +</code>
 +
 +===== Possible Causes =====
 +  * Large log files or temporary files
 +  * Full persistent volumes
 +  * Backup jobs or batch jobs filling disks
 +  * Container images not cleaned up
 +  * Disk size too small for workload
 +
 +===== Mitigation =====
 +  - Clean up unused files, logs, or images
 +  - Rotate and compress logs
 +  - Move non-critical data to another storage
 +  - Evict or reschedule non-critical pods
 +  - Increase disk capacity if possible
 +
 +Drain node if needed:
 +
 +<code bash>
 +kubectl drain <NODE_NAME> --ignore-daemonsets --delete-emptydir-data
 +kubectl uncordon <NODE_NAME>
 +</code>
 +
 +===== Escalation =====
 +  * Escalate if disk space remains below 10% for extended periods
 +  * Page on-call engineer if production services are impacted
 +  * Treat multiple nodes with low disk space as a cluster-level incident
 +
 +===== Related Alerts =====
 +  * HighDiskUsage
 +  * HighDiskIOWait
 +  * KubernetesNodeDiskPressure
 +  * HostUnusualDiskReadRate
 +
 +===== Related Dashboards =====
 +  * Grafana → Node Disk Usage
 +  * Grafana → Node Exporter Disk Metrics
 +
runbooks/coustom_alerts/hostoutofdiskspace.1765643931.txt.gz · Last modified: by admin