Differences

This shows you the differences between two versions of the page.

--- runbooks:coustom_alerts:highdiskiowait [2025/12/13 16:27] – created admin
+++ runbooks:coustom_alerts:highdiskiowait [2025/12/14 06:50] (current) – admin
@@ Line 1: / Line 1: @@
 runbooks:coustom_alerts:HighDiskIOWait
+====== HighDiskIOWait ======
+===== Meaning =====
+This alert is triggered when the CPU spends an unusually high amount of time waiting for disk I/O operations to complete.
+High I/O wait typically indicates disk performance bottlenecks.
+===== Impact =====
+Sustained high disk I/O wait can significantly degrade system and application performance.
+Possible impacts include:
+  * Increased application latency
+  * Slow database queries and file operations
+  * Pod startup delays
+  * Reduced overall node throughput
+This alert is a **warning**, but may escalate if the condition persists.
+===== Diagnosis =====
+Check I/O wait and overall CPU usage:
+<code bash>
+kubectl top nodes
+</code>
+If SSH access is available, inspect disk I/O metrics directly:
+<code bash>
+iostat -xz 1
+vmstat 1
+</code>
+Identify processes causing high disk I/O:
+<code bash>
+iotop
+</code>
+Check disk usage and pressure conditions:
+<code bash>
+kubectl describe node <NODE_NAME>
+</code>
+Verify if disk-related alerts are firing:
+<code bash>
+kubectl get events --field-selector involvedObject.kind=Node
+</code>
+===== Possible Causes =====
+  * Disk saturation due to heavy read/write operations
+  * Slow or degraded storage (network-attached or cloud disks)
+  * Log flooding or excessive file writes
+  * Database or batch jobs performing intensive I/O
+  * Disk nearing full capacity
+===== Mitigation =====
+  - Identify and throttle or stop I/O-heavy workloads
+  - Move high I/O workloads to faster storage
+  - Enable or tune log rotation
+  - Scale out workloads to reduce per-node I/O pressure
+  - Increase disk performance (IOPS / throughput) if supported
+If the node is severely impacted, drain it temporarily:
+<code bash>
+kubectl drain <NODE_NAME> --ignore-daemonsets
+</code>
+After mitigation:
+<code bash>
+kubectl uncordon <NODE_NAME>
+</code>
+===== Escalation =====
+  * If high I/O wait persists beyond 10 minutes, escalate to the platform team
+  * If multiple nodes are affected, treat as a storage-level incident
+  * If production services are impacted, page the on-call engineer
+===== Related Alerts =====
+  * HighDiskUsage
+  * NodeNotReady
+  * HighCPUUsage
+===== Related Dashboards =====
+  * Grafana → Node Exporter / Disk I/O
+  * Grafana → Storage Performance Overview