Differences

This shows you the differences between two versions of the page.

--- runbooks:coustom_alerts:hostunusualdiskreadrate [2025/12/13 16:38] – created admin
+++ runbooks:coustom_alerts:hostunusualdiskreadrate [2025/12/14 07:01] (current) – admin
@@ Line 1: / Line 1: @@
 runbooks:coustom_alerts:HostUnusualDiskReadRate
+====== HostUnusualDiskReadRate ======
+===== Meaning =====
+This alert is triggered when a host node experiences **high disk read activity**, with IO wait greater than 80% over a 5-minute window.
+It indicates that the disk may be a bottleneck or under heavy load.
+===== Impact =====
+High disk read rates can lead to:
+  * Application slowdowns or latency
+  * Increased pod response times
+  * Potential cascading failures if services rely on disk-intensive operations
+  * Node-level resource contention
+This alert is **warning**, as prolonged high IO can degrade performance or trigger other alerts.
+===== Diagnosis =====
+Check disk IO statistics:
+<code bash>
+iostat -x 1 5
+iotop -o
+</code>
+Check system-wide IO wait:
+<code bash>
+top
+vmstat 1 5
+</code>
+Check disk usage and filesystem health:
+<code bash>
+df -h
+lsblk
+smartctl -a /dev/sdX
+</code>
+Check pods consuming disk on the node:
+<code bash>
+kubectl top pod --all-namespaces --field-selector spec.nodeName={{ $labels.instance }}
+</code>
+===== Possible Causes =====
+  * Disk-intensive workloads or batch jobs
+  * Logging or database writes causing high IO
+  * Slow or failing disks
+  * Misconfigured storage (e.g., small volumes)
+  * Backup jobs or heavy monitoring metrics writes
+===== Mitigation =====
+  - Identify and reduce disk-intensive workloads
+  - Move high IO workloads to other nodes or storage
+  - Monitor disk health and replace failing disks
+  - Tune filesystem or storage configuration if needed
+  - Scale out storage for critical workloads
+===== Escalation =====
+  * Escalate if high IO persists for extended periods
+  * Page on-call engineer if production services are impacted
+  * Investigate related alerts (DiskPressure, HighDiskUsage)
+===== Related Alerts =====
+  * HighDiskUsage
+  * HighDiskIOWait
+  * KubernetesNodeDiskPressure
+  * HostUnusualDiskWriteRate
+===== Related Dashboards =====
+  * Grafana → Node Disk IO
+  * Grafana → Node Exporter Disk Metrics