Differences

This shows you the differences between two versions of the page.

--- runbooks:coustom_alerts:highdiskusage [2025/12/13 16:25] – created admin
+++ runbooks:coustom_alerts:highdiskusage [2025/12/14 06:45] (current) – admin
@@ Line 1: / Line 1: @@
 runbooks:coustom_alerts:HighDiskUsage
+====== HighDiskUsage ======
+===== Meaning =====
+This alert is triggered when disk usage on a node exceeds 90% for more than 5 minutes.
+Disk usage is calculated using filesystem size and free space metrics reported by node-exporter.
+The alert is scoped to a specific mount point.
+===== Impact =====
+High disk usage can cause serious stability and availability issues.
+Possible impacts include:
+  * Applications failing to write data or logs
+  * Pods crashing or entering error states
+  * Node instability and kubelet failures
+  * Potential data loss if disk becomes full
+This alert is a **warning**, but may escalate to a critical issue if disk usage continues to grow.
+===== Diagnosis =====
+Identify affected nodes and mount points:
+<code bash>
+kubectl get nodes
+</code>
+If SSH access is available, check disk usage on the node:
+<code bash>
+df -h
+</code>
+Check disk usage per directory to find large consumers:
+<code bash>
+du -xh / | sort -h | tail -20
+</code>
+Check for pod-level disk usage (Kubernetes):
+<code bash>
+kubectl describe node <NODE_NAME>
+</code>
+Check recent events related to disk pressure:
+<code bash>
+kubectl get events --field-selector involvedObject.kind=Node
+</code>
+Check if the node is under DiskPressure:
+<code bash>
+kubectl get nodes
+</code>
+===== Possible Causes =====
+  * Log files growing uncontrollably
+  * Application writing excessive data to disk
+  * Container images and layers not cleaned up
+  * Old files or backups consuming disk space
+  * Insufficient disk capacity on the node
+===== Mitigation =====
+  - Remove or rotate large log files
+  - Clean up unused container images and volumes
+  - Delete temporary or obsolete files
+  - Resize the disk if supported by the platform
+  - Move data to external storage if applicable
+If the node is under DiskPressure, consider draining it:
+<code bash>
+kubectl drain <NODE_NAME> --ignore-daemonsets
+</code>
+After resolving the issue:
+<code bash>
+kubectl uncordon <NODE_NAME>
+</code>
+===== Escalation =====
+  * If disk usage continues to increase or exceeds 95%, escalate immediately
+  * If production workloads are impacted, page the on-call engineer
+  * If disk growth cause is unclear, escalate to the application owner or infrastructure team
+===== Related Alerts =====
+  * HighMemoryUsage
+  * NodeDown
+  * NodeNotReady
+===== Related Dashboards =====
+  * Grafana → Node Overview
+  * Grafana → Disk Usage Dashboard