runbooks:coustom_alerts:HighDiskUsage

====== HighDiskUsage ======

===== Meaning =====
This alert is triggered when disk usage on a node exceeds 90% for more than 5 minutes.
Disk usage is calculated using filesystem size and free space metrics reported by node-exporter.

The alert is scoped to a specific mount point.

===== Impact =====
High disk usage can cause serious stability and availability issues.

Possible impacts include:
  * Applications failing to write data or logs
  * Pods crashing or entering error states
  * Node instability and kubelet failures
  * Potential data loss if disk becomes full

This alert is a **warning**, but may escalate to a critical issue if disk usage continues to grow.

===== Diagnosis =====
Identify affected nodes and mount points:

<code bash>
kubectl get nodes
</code>

If SSH access is available, check disk usage on the node:

<code bash>
df -h
</code>

Check disk usage per directory to find large consumers:

<code bash>
du -xh / | sort -h | tail -20
</code>

Check for pod-level disk usage (Kubernetes):

<code bash>
kubectl describe node <NODE_NAME>
</code>

Check recent events related to disk pressure:

<code bash>
kubectl get events --field-selector involvedObject.kind=Node
</code>

Check if the node is under DiskPressure:

<code bash>
kubectl get nodes
</code>

===== Possible Causes =====
  * Log files growing uncontrollably
  * Application writing excessive data to disk
  * Container images and layers not cleaned up
  * Old files or backups consuming disk space
  * Insufficient disk capacity on the node

===== Mitigation =====
  - Remove or rotate large log files
  - Clean up unused container images and volumes
  - Delete temporary or obsolete files
  - Resize the disk if supported by the platform
  - Move data to external storage if applicable

If the node is under DiskPressure, consider draining it:

<code bash>
kubectl drain <NODE_NAME> --ignore-daemonsets
</code>

After resolving the issue:

<code bash>
kubectl uncordon <NODE_NAME>
</code>

===== Escalation =====
  * If disk usage continues to increase or exceeds 95%, escalate immediately
  * If production workloads are impacted, page the on-call engineer
  * If disk growth cause is unclear, escalate to the application owner or infrastructure team

===== Related Alerts =====
  * HighMemoryUsage
  * NodeDown
  * NodeNotReady

===== Related Dashboards =====
  * Grafana → Node Overview
  * Grafana → Disk Usage Dashboard