runbooks:coustom_alerts:HighMemoryUsage

====== HighMemoryUsage ======

===== Meaning =====
This alert is triggered when memory usage on a node exceeds 90% for more than 5 minutes.
Memory usage is calculated based on total memory and available memory reported by node-exporter.

===== Impact =====
High memory usage can significantly affect node and application stability.

Possible impacts include:
  * Pod evictions due to memory pressure
  * Application crashes (OOMKilled)
  * Increased latency and degraded performance
  * Node becoming unresponsive under sustained pressure

This alert is a **warning**, but may escalate to a critical issue if not addressed.

===== Diagnosis =====
Check memory usage across nodes:

<code bash>
kubectl top nodes
</code>

Identify top memory-consuming pods:

<code bash>
kubectl top pods -A --sort-by=memory
</code>

Check node conditions for memory pressure:

<code bash>
kubectl describe node <NODE_NAME>
</code>

Look for recent memory-related events:

<code bash>
kubectl get events --field-selector involvedObject.kind=Node
</code>

If SSH access is available, inspect memory usage directly:

<code bash>
free -h
top
vmstat 1
</code>

Check for pods being OOM-killed:

<code bash>
kubectl get pods -A | grep OOMKilled
</code>

===== Possible Causes =====
  * Memory leak in an application
  * Pods without memory limits
  * Sudden increase in workload
  * Insufficient node memory capacity
  * Cache growth not properly controlled

===== Mitigation =====
  - Identify and restart leaking or misbehaving pods if safe
  - Set or adjust memory requests and limits for workloads
  - Scale the application or add more nodes if required
  - Evict non-critical workloads if needed
  - Investigate and fix memory leaks in application code

If the node is under sustained pressure, drain it temporarily:

<code bash>
kubectl drain <NODE_NAME> --ignore-daemonsets
</code>

After recovery:

<code bash>
kubectl uncordon <NODE_NAME>
</code>

===== Escalation =====
  * If memory usage remains above threshold for more than 15 minutes, notify the platform team
  * If pods are repeatedly OOM-killed, escalate to the application owner
  * If production services are impacted, page the on-call engineer

===== Related Alerts =====
  * HighCPUUsage
  * NodeDown
  * NodeRebootedRecently

===== Related Dashboards =====
  * Grafana → Node Overview
  * Grafana → Memory Usage Dashboard