User Tools

Site Tools


runbooks:coustom_alerts:hostunusualdiskreadrate

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

runbooks:coustom_alerts:hostunusualdiskreadrate [2025/12/13 16:38] – created adminrunbooks:coustom_alerts:hostunusualdiskreadrate [2025/12/14 07:01] (current) admin
Line 1: Line 1:
 runbooks:coustom_alerts:HostUnusualDiskReadRate runbooks:coustom_alerts:HostUnusualDiskReadRate
 +
 +====== HostUnusualDiskReadRate ======
 +
 +===== Meaning =====
 +This alert is triggered when a host node experiences **high disk read activity**, with IO wait greater than 80% over a 5-minute window.
 +It indicates that the disk may be a bottleneck or under heavy load.
 +
 +===== Impact =====
 +High disk read rates can lead to:
 +  * Application slowdowns or latency
 +  * Increased pod response times
 +  * Potential cascading failures if services rely on disk-intensive operations
 +  * Node-level resource contention
 +
 +This alert is **warning**, as prolonged high IO can degrade performance or trigger other alerts.
 +
 +===== Diagnosis =====
 +Check disk IO statistics:
 +
 +<code bash>
 +iostat -x 1 5
 +iotop -o
 +</code>
 +
 +Check system-wide IO wait:
 +
 +<code bash>
 +top
 +vmstat 1 5
 +</code>
 +
 +Check disk usage and filesystem health:
 +
 +<code bash>
 +df -h
 +lsblk
 +smartctl -a /dev/sdX
 +</code>
 +
 +Check pods consuming disk on the node:
 +
 +<code bash>
 +kubectl top pod --all-namespaces --field-selector spec.nodeName={{ $labels.instance }}
 +</code>
 +
 +===== Possible Causes =====
 +  * Disk-intensive workloads or batch jobs
 +  * Logging or database writes causing high IO
 +  * Slow or failing disks
 +  * Misconfigured storage (e.g., small volumes)
 +  * Backup jobs or heavy monitoring metrics writes
 +
 +===== Mitigation =====
 +  - Identify and reduce disk-intensive workloads
 +  - Move high IO workloads to other nodes or storage
 +  - Monitor disk health and replace failing disks
 +  - Tune filesystem or storage configuration if needed
 +  - Scale out storage for critical workloads
 +
 +===== Escalation =====
 +  * Escalate if high IO persists for extended periods
 +  * Page on-call engineer if production services are impacted
 +  * Investigate related alerts (DiskPressure, HighDiskUsage)
 +
 +===== Related Alerts =====
 +  * HighDiskUsage
 +  * HighDiskIOWait
 +  * KubernetesNodeDiskPressure
 +  * HostUnusualDiskWriteRate
 +
 +===== Related Dashboards =====
 +  * Grafana → Node Disk IO
 +  * Grafana → Node Exporter Disk Metrics
 +
runbooks/coustom_alerts/hostunusualdiskreadrate.txt · Last modified: by admin