runbooks:coustom_alerts:highdiskusage
Differences
This shows you the differences between two versions of the page.
| runbooks:coustom_alerts:highdiskusage [2025/12/13 16:25] – created admin | runbooks:coustom_alerts:highdiskusage [2025/12/14 06:45] (current) – admin | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| runbooks: | runbooks: | ||
| + | |||
| + | ====== HighDiskUsage ====== | ||
| + | |||
| + | ===== Meaning ===== | ||
| + | This alert is triggered when disk usage on a node exceeds 90% for more than 5 minutes. | ||
| + | Disk usage is calculated using filesystem size and free space metrics reported by node-exporter. | ||
| + | |||
| + | The alert is scoped to a specific mount point. | ||
| + | |||
| + | ===== Impact ===== | ||
| + | High disk usage can cause serious stability and availability issues. | ||
| + | |||
| + | Possible impacts include: | ||
| + | * Applications failing to write data or logs | ||
| + | * Pods crashing or entering error states | ||
| + | * Node instability and kubelet failures | ||
| + | * Potential data loss if disk becomes full | ||
| + | |||
| + | This alert is a **warning**, | ||
| + | |||
| + | ===== Diagnosis ===== | ||
| + | Identify affected nodes and mount points: | ||
| + | |||
| + | <code bash> | ||
| + | kubectl get nodes | ||
| + | </ | ||
| + | |||
| + | If SSH access is available, check disk usage on the node: | ||
| + | |||
| + | <code bash> | ||
| + | df -h | ||
| + | </ | ||
| + | |||
| + | Check disk usage per directory to find large consumers: | ||
| + | |||
| + | <code bash> | ||
| + | du -xh / | sort -h | tail -20 | ||
| + | </ | ||
| + | |||
| + | Check for pod-level disk usage (Kubernetes): | ||
| + | |||
| + | <code bash> | ||
| + | kubectl describe node < | ||
| + | </ | ||
| + | |||
| + | Check recent events related to disk pressure: | ||
| + | |||
| + | <code bash> | ||
| + | kubectl get events --field-selector involvedObject.kind=Node | ||
| + | </ | ||
| + | |||
| + | Check if the node is under DiskPressure: | ||
| + | |||
| + | <code bash> | ||
| + | kubectl get nodes | ||
| + | </ | ||
| + | |||
| + | ===== Possible Causes ===== | ||
| + | * Log files growing uncontrollably | ||
| + | * Application writing excessive data to disk | ||
| + | * Container images and layers not cleaned up | ||
| + | * Old files or backups consuming disk space | ||
| + | * Insufficient disk capacity on the node | ||
| + | |||
| + | ===== Mitigation ===== | ||
| + | - Remove or rotate large log files | ||
| + | - Clean up unused container images and volumes | ||
| + | - Delete temporary or obsolete files | ||
| + | - Resize the disk if supported by the platform | ||
| + | - Move data to external storage if applicable | ||
| + | |||
| + | If the node is under DiskPressure, | ||
| + | |||
| + | <code bash> | ||
| + | kubectl drain < | ||
| + | </ | ||
| + | |||
| + | After resolving the issue: | ||
| + | |||
| + | <code bash> | ||
| + | kubectl uncordon < | ||
| + | </ | ||
| + | |||
| + | ===== Escalation ===== | ||
| + | * If disk usage continues to increase or exceeds 95%, escalate immediately | ||
| + | * If production workloads are impacted, page the on-call engineer | ||
| + | * If disk growth cause is unclear, escalate to the application owner or infrastructure team | ||
| + | |||
| + | ===== Related Alerts ===== | ||
| + | * HighMemoryUsage | ||
| + | * NodeDown | ||
| + | * NodeNotReady | ||
| + | |||
| + | ===== Related Dashboards ===== | ||
| + | * Grafana → Node Overview | ||
| + | * Grafana → Disk Usage Dashboard | ||
| + | |||
runbooks/coustom_alerts/highdiskusage.txt · Last modified: by admin
