runbooks:coustom_alerts:hostunusualdiskreadrate
Differences
This shows you the differences between two versions of the page.
| runbooks:coustom_alerts:hostunusualdiskreadrate [2025/12/13 16:38] – created admin | runbooks:coustom_alerts:hostunusualdiskreadrate [2025/12/14 07:01] (current) – admin | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| runbooks: | runbooks: | ||
| + | |||
| + | ====== HostUnusualDiskReadRate ====== | ||
| + | |||
| + | ===== Meaning ===== | ||
| + | This alert is triggered when a host node experiences **high disk read activity**, with IO wait greater than 80% over a 5-minute window. | ||
| + | It indicates that the disk may be a bottleneck or under heavy load. | ||
| + | |||
| + | ===== Impact ===== | ||
| + | High disk read rates can lead to: | ||
| + | * Application slowdowns or latency | ||
| + | * Increased pod response times | ||
| + | * Potential cascading failures if services rely on disk-intensive operations | ||
| + | * Node-level resource contention | ||
| + | |||
| + | This alert is **warning**, | ||
| + | |||
| + | ===== Diagnosis ===== | ||
| + | Check disk IO statistics: | ||
| + | |||
| + | <code bash> | ||
| + | iostat -x 1 5 | ||
| + | iotop -o | ||
| + | </ | ||
| + | |||
| + | Check system-wide IO wait: | ||
| + | |||
| + | <code bash> | ||
| + | top | ||
| + | vmstat 1 5 | ||
| + | </ | ||
| + | |||
| + | Check disk usage and filesystem health: | ||
| + | |||
| + | <code bash> | ||
| + | df -h | ||
| + | lsblk | ||
| + | smartctl -a /dev/sdX | ||
| + | </ | ||
| + | |||
| + | Check pods consuming disk on the node: | ||
| + | |||
| + | <code bash> | ||
| + | kubectl top pod --all-namespaces --field-selector spec.nodeName={{ $labels.instance }} | ||
| + | </ | ||
| + | |||
| + | ===== Possible Causes ===== | ||
| + | * Disk-intensive workloads or batch jobs | ||
| + | * Logging or database writes causing high IO | ||
| + | * Slow or failing disks | ||
| + | * Misconfigured storage (e.g., small volumes) | ||
| + | * Backup jobs or heavy monitoring metrics writes | ||
| + | |||
| + | ===== Mitigation ===== | ||
| + | - Identify and reduce disk-intensive workloads | ||
| + | - Move high IO workloads to other nodes or storage | ||
| + | - Monitor disk health and replace failing disks | ||
| + | - Tune filesystem or storage configuration if needed | ||
| + | - Scale out storage for critical workloads | ||
| + | |||
| + | ===== Escalation ===== | ||
| + | * Escalate if high IO persists for extended periods | ||
| + | * Page on-call engineer if production services are impacted | ||
| + | * Investigate related alerts (DiskPressure, | ||
| + | |||
| + | ===== Related Alerts ===== | ||
| + | * HighDiskUsage | ||
| + | * HighDiskIOWait | ||
| + | * KubernetesNodeDiskPressure | ||
| + | * HostUnusualDiskWriteRate | ||
| + | |||
| + | ===== Related Dashboards ===== | ||
| + | * Grafana → Node Disk IO | ||
| + | * Grafana → Node Exporter Disk Metrics | ||
| + | |||
runbooks/coustom_alerts/hostunusualdiskreadrate.txt · Last modified: by admin
