User Tools

Site Tools


runbooks:coustom_alerts:highdiskiowait

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

runbooks:coustom_alerts:highdiskiowait [2025/12/13 16:27] – created adminrunbooks:coustom_alerts:highdiskiowait [2025/12/14 06:50] (current) admin
Line 1: Line 1:
 runbooks:coustom_alerts:HighDiskIOWait runbooks:coustom_alerts:HighDiskIOWait
 +
 +====== HighDiskIOWait ======
 +
 +===== Meaning =====
 +This alert is triggered when the CPU spends an unusually high amount of time waiting for disk I/O operations to complete.
 +High I/O wait typically indicates disk performance bottlenecks.
 +
 +===== Impact =====
 +Sustained high disk I/O wait can significantly degrade system and application performance.
 +
 +Possible impacts include:
 +  * Increased application latency
 +  * Slow database queries and file operations
 +  * Pod startup delays
 +  * Reduced overall node throughput
 +
 +This alert is a **warning**, but may escalate if the condition persists.
 +
 +===== Diagnosis =====
 +Check I/O wait and overall CPU usage:
 +
 +<code bash>
 +kubectl top nodes
 +</code>
 +
 +If SSH access is available, inspect disk I/O metrics directly:
 +
 +<code bash>
 +iostat -xz 1
 +vmstat 1
 +</code>
 +
 +Identify processes causing high disk I/O:
 +
 +<code bash>
 +iotop
 +</code>
 +
 +Check disk usage and pressure conditions:
 +
 +<code bash>
 +kubectl describe node <NODE_NAME>
 +</code>
 +
 +Verify if disk-related alerts are firing:
 +
 +<code bash>
 +kubectl get events --field-selector involvedObject.kind=Node
 +</code>
 +
 +===== Possible Causes =====
 +  * Disk saturation due to heavy read/write operations
 +  * Slow or degraded storage (network-attached or cloud disks)
 +  * Log flooding or excessive file writes
 +  * Database or batch jobs performing intensive I/O
 +  * Disk nearing full capacity
 +
 +===== Mitigation =====
 +  - Identify and throttle or stop I/O-heavy workloads
 +  - Move high I/O workloads to faster storage
 +  - Enable or tune log rotation
 +  - Scale out workloads to reduce per-node I/O pressure
 +  - Increase disk performance (IOPS / throughput) if supported
 +
 +If the node is severely impacted, drain it temporarily:
 +
 +<code bash>
 +kubectl drain <NODE_NAME> --ignore-daemonsets
 +</code>
 +
 +After mitigation:
 +
 +<code bash>
 +kubectl uncordon <NODE_NAME>
 +</code>
 +
 +===== Escalation =====
 +  * If high I/O wait persists beyond 10 minutes, escalate to the platform team
 +  * If multiple nodes are affected, treat as a storage-level incident
 +  * If production services are impacted, page the on-call engineer
 +
 +===== Related Alerts =====
 +  * HighDiskUsage
 +  * NodeNotReady
 +  * HighCPUUsage
 +
 +===== Related Dashboards =====
 +  * Grafana → Node Exporter / Disk I/O
 +  * Grafana → Storage Performance Overview
 +
runbooks/coustom_alerts/highdiskiowait.txt · Last modified: by admin