runbooks:coustom_alerts:highdiskiowait
Differences
This shows you the differences between two versions of the page.
| runbooks:coustom_alerts:highdiskiowait [2025/12/13 16:27] – created admin | runbooks:coustom_alerts:highdiskiowait [2025/12/14 06:50] (current) – admin | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| runbooks: | runbooks: | ||
| + | |||
| + | ====== HighDiskIOWait ====== | ||
| + | |||
| + | ===== Meaning ===== | ||
| + | This alert is triggered when the CPU spends an unusually high amount of time waiting for disk I/O operations to complete. | ||
| + | High I/O wait typically indicates disk performance bottlenecks. | ||
| + | |||
| + | ===== Impact ===== | ||
| + | Sustained high disk I/O wait can significantly degrade system and application performance. | ||
| + | |||
| + | Possible impacts include: | ||
| + | * Increased application latency | ||
| + | * Slow database queries and file operations | ||
| + | * Pod startup delays | ||
| + | * Reduced overall node throughput | ||
| + | |||
| + | This alert is a **warning**, | ||
| + | |||
| + | ===== Diagnosis ===== | ||
| + | Check I/O wait and overall CPU usage: | ||
| + | |||
| + | <code bash> | ||
| + | kubectl top nodes | ||
| + | </ | ||
| + | |||
| + | If SSH access is available, inspect disk I/O metrics directly: | ||
| + | |||
| + | <code bash> | ||
| + | iostat -xz 1 | ||
| + | vmstat 1 | ||
| + | </ | ||
| + | |||
| + | Identify processes causing high disk I/O: | ||
| + | |||
| + | <code bash> | ||
| + | iotop | ||
| + | </ | ||
| + | |||
| + | Check disk usage and pressure conditions: | ||
| + | |||
| + | <code bash> | ||
| + | kubectl describe node < | ||
| + | </ | ||
| + | |||
| + | Verify if disk-related alerts are firing: | ||
| + | |||
| + | <code bash> | ||
| + | kubectl get events --field-selector involvedObject.kind=Node | ||
| + | </ | ||
| + | |||
| + | ===== Possible Causes ===== | ||
| + | * Disk saturation due to heavy read/write operations | ||
| + | * Slow or degraded storage (network-attached or cloud disks) | ||
| + | * Log flooding or excessive file writes | ||
| + | * Database or batch jobs performing intensive I/O | ||
| + | * Disk nearing full capacity | ||
| + | |||
| + | ===== Mitigation ===== | ||
| + | - Identify and throttle or stop I/O-heavy workloads | ||
| + | - Move high I/O workloads to faster storage | ||
| + | - Enable or tune log rotation | ||
| + | - Scale out workloads to reduce per-node I/O pressure | ||
| + | - Increase disk performance (IOPS / throughput) if supported | ||
| + | |||
| + | If the node is severely impacted, drain it temporarily: | ||
| + | |||
| + | <code bash> | ||
| + | kubectl drain < | ||
| + | </ | ||
| + | |||
| + | After mitigation: | ||
| + | |||
| + | <code bash> | ||
| + | kubectl uncordon < | ||
| + | </ | ||
| + | |||
| + | ===== Escalation ===== | ||
| + | * If high I/O wait persists beyond 10 minutes, escalate to the platform team | ||
| + | * If multiple nodes are affected, treat as a storage-level incident | ||
| + | * If production services are impacted, page the on-call engineer | ||
| + | |||
| + | ===== Related Alerts ===== | ||
| + | * HighDiskUsage | ||
| + | * NodeNotReady | ||
| + | * HighCPUUsage | ||
| + | |||
| + | ===== Related Dashboards ===== | ||
| + | * Grafana → Node Exporter / Disk I/O | ||
| + | * Grafana → Storage Performance Overview | ||
| + | |||
runbooks/coustom_alerts/highdiskiowait.txt · Last modified: by admin
