User Tools

Site Tools


runbooks:coustom_alerts:highdiskusage

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

runbooks:coustom_alerts:highdiskusage [2025/12/13 16:25] – created adminrunbooks:coustom_alerts:highdiskusage [2025/12/14 06:45] (current) admin
Line 1: Line 1:
 runbooks:coustom_alerts:HighDiskUsage runbooks:coustom_alerts:HighDiskUsage
 +
 +====== HighDiskUsage ======
 +
 +===== Meaning =====
 +This alert is triggered when disk usage on a node exceeds 90% for more than 5 minutes.
 +Disk usage is calculated using filesystem size and free space metrics reported by node-exporter.
 +
 +The alert is scoped to a specific mount point.
 +
 +===== Impact =====
 +High disk usage can cause serious stability and availability issues.
 +
 +Possible impacts include:
 +  * Applications failing to write data or logs
 +  * Pods crashing or entering error states
 +  * Node instability and kubelet failures
 +  * Potential data loss if disk becomes full
 +
 +This alert is a **warning**, but may escalate to a critical issue if disk usage continues to grow.
 +
 +===== Diagnosis =====
 +Identify affected nodes and mount points:
 +
 +<code bash>
 +kubectl get nodes
 +</code>
 +
 +If SSH access is available, check disk usage on the node:
 +
 +<code bash>
 +df -h
 +</code>
 +
 +Check disk usage per directory to find large consumers:
 +
 +<code bash>
 +du -xh / | sort -h | tail -20
 +</code>
 +
 +Check for pod-level disk usage (Kubernetes):
 +
 +<code bash>
 +kubectl describe node <NODE_NAME>
 +</code>
 +
 +Check recent events related to disk pressure:
 +
 +<code bash>
 +kubectl get events --field-selector involvedObject.kind=Node
 +</code>
 +
 +Check if the node is under DiskPressure:
 +
 +<code bash>
 +kubectl get nodes
 +</code>
 +
 +===== Possible Causes =====
 +  * Log files growing uncontrollably
 +  * Application writing excessive data to disk
 +  * Container images and layers not cleaned up
 +  * Old files or backups consuming disk space
 +  * Insufficient disk capacity on the node
 +
 +===== Mitigation =====
 +  - Remove or rotate large log files
 +  - Clean up unused container images and volumes
 +  - Delete temporary or obsolete files
 +  - Resize the disk if supported by the platform
 +  - Move data to external storage if applicable
 +
 +If the node is under DiskPressure, consider draining it:
 +
 +<code bash>
 +kubectl drain <NODE_NAME> --ignore-daemonsets
 +</code>
 +
 +After resolving the issue:
 +
 +<code bash>
 +kubectl uncordon <NODE_NAME>
 +</code>
 +
 +===== Escalation =====
 +  * If disk usage continues to increase or exceeds 95%, escalate immediately
 +  * If production workloads are impacted, page the on-call engineer
 +  * If disk growth cause is unclear, escalate to the application owner or infrastructure team
 +
 +===== Related Alerts =====
 +  * HighMemoryUsage
 +  * NodeDown
 +  * NodeNotReady
 +
 +===== Related Dashboards =====
 +  * Grafana → Node Overview
 +  * Grafana → Disk Usage Dashboard
 +
runbooks/coustom_alerts/highdiskusage.txt · Last modified: by admin