Differences

This shows you the differences between two versions of the page.

--- runbooks:coustom_alerts:hostoutofdiskspace [2025/12/13 16:38] – created admin
+++ runbooks:coustom_alerts:hostoutofdiskspace [2025/12/14 07:03] (current) – admin
@@ Line 1: / Line 1: @@
 runbooks:coustom_alerts:HostOutOfDiskSpace
+====== HostOutOfDiskSpace ======
+===== Meaning =====
+This alert is triggered when a host node’s disk has **less than 10% free space** on any filesystem (excluding tmpfs, fuse, cifs, nfs) for more than 2 minutes.
+It indicates that the host is running low on disk space, which may cause system or application failures.
+===== Impact =====
+Low disk space can cause:
+  * Pod evictions due to inability to write logs or data
+  * Application failures
+  * Node instability or crashes
+  * Kubernetes scheduling failures for pods with persistent volume requirements
+  * Increased latency or I/O errors
+This alert is **critical**, as disk space exhaustion can immediately impact production workloads.
+===== Diagnosis =====
+Check disk usage:
+<code bash>
+df -h
+df -i
+lsblk
+</code>
+Check disk space per mountpoint:
+<code bash>
+du -sh /var/lib/kubelet/*
+du -sh /home/*
+</code>
+Check pods consuming disk:
+<code bash>
+kubectl get pvc --all-namespaces
+kubectl describe pod <POD_NAME> -n <NAMESPACE>
+</code>
+Check node events:
+<code bash>
+kubectl get events --sort-by=.lastTimestamp
+</code>
+===== Possible Causes =====
+  * Large log files or temporary files
+  * Full persistent volumes
+  * Backup jobs or batch jobs filling disks
+  * Container images not cleaned up
+  * Disk size too small for workload
+===== Mitigation =====
+  - Clean up unused files, logs, or images
+  - Rotate and compress logs
+  - Move non-critical data to another storage
+  - Evict or reschedule non-critical pods
+  - Increase disk capacity if possible
+Drain node if needed:
+<code bash>
+kubectl drain <NODE_NAME> --ignore-daemonsets --delete-emptydir-data
+kubectl uncordon <NODE_NAME>
+</code>
+===== Escalation =====
+  * Escalate if disk space remains below 10% for extended periods
+  * Page on-call engineer if production services are impacted
+  * Treat multiple nodes with low disk space as a cluster-level incident
+===== Related Alerts =====
+  * HighDiskUsage
+  * HighDiskIOWait
+  * KubernetesNodeDiskPressure
+  * HostUnusualDiskReadRate
+===== Related Dashboards =====
+  * Grafana → Node Disk Usage
+  * Grafana → Node Exporter Disk Metrics