runbooks:coustom_alerts:HighMemoryUsage
====== HighMemoryUsage ======
===== Meaning =====
This alert is triggered when memory usage on a node exceeds 90% for more than 5 minutes.
Memory usage is calculated based on total memory and available memory reported by node-exporter.
===== Impact =====
High memory usage can significantly affect node and application stability.
Possible impacts include:
* Pod evictions due to memory pressure
* Application crashes (OOMKilled)
* Increased latency and degraded performance
* Node becoming unresponsive under sustained pressure
This alert is a **warning**, but may escalate to a critical issue if not addressed.
===== Diagnosis =====
Check memory usage across nodes:
kubectl top nodes
Identify top memory-consuming pods:
kubectl top pods -A --sort-by=memory
Check node conditions for memory pressure:
kubectl describe node
Look for recent memory-related events:
kubectl get events --field-selector involvedObject.kind=Node
If SSH access is available, inspect memory usage directly:
free -h
top
vmstat 1
Check for pods being OOM-killed:
kubectl get pods -A | grep OOMKilled
===== Possible Causes =====
* Memory leak in an application
* Pods without memory limits
* Sudden increase in workload
* Insufficient node memory capacity
* Cache growth not properly controlled
===== Mitigation =====
- Identify and restart leaking or misbehaving pods if safe
- Set or adjust memory requests and limits for workloads
- Scale the application or add more nodes if required
- Evict non-critical workloads if needed
- Investigate and fix memory leaks in application code
If the node is under sustained pressure, drain it temporarily:
kubectl drain --ignore-daemonsets
After recovery:
kubectl uncordon
===== Escalation =====
* If memory usage remains above threshold for more than 15 minutes, notify the platform team
* If pods are repeatedly OOM-killed, escalate to the application owner
* If production services are impacted, page the on-call engineer
===== Related Alerts =====
* HighCPUUsage
* NodeDown
* NodeRebootedRecently
===== Related Dashboards =====
* Grafana → Node Overview
* Grafana → Memory Usage Dashboard