Differences

This shows you the differences between two versions of the page.

--- runbooks:coustom_alerts:highmemoryusage [2025/12/13 16:25] – created admin
+++ runbooks:coustom_alerts:highmemoryusage [2025/12/14 06:44] (current) – admin
@@ Line 1: / Line 1: @@
 runbooks:coustom_alerts:HighMemoryUsage
+====== HighMemoryUsage ======
+===== Meaning =====
+This alert is triggered when memory usage on a node exceeds 90% for more than 5 minutes.
+Memory usage is calculated based on total memory and available memory reported by node-exporter.
+===== Impact =====
+High memory usage can significantly affect node and application stability.
+Possible impacts include:
+  * Pod evictions due to memory pressure
+  * Application crashes (OOMKilled)
+  * Increased latency and degraded performance
+  * Node becoming unresponsive under sustained pressure
+This alert is a **warning**, but may escalate to a critical issue if not addressed.
+===== Diagnosis =====
+Check memory usage across nodes:
+<code bash>
+kubectl top nodes
+</code>
+Identify top memory-consuming pods:
+<code bash>
+kubectl top pods -A --sort-by=memory
+</code>
+Check node conditions for memory pressure:
+<code bash>
+kubectl describe node <NODE_NAME>
+</code>
+Look for recent memory-related events:
+<code bash>
+kubectl get events --field-selector involvedObject.kind=Node
+</code>
+If SSH access is available, inspect memory usage directly:
+<code bash>
+free -h
+top
+vmstat 1
+</code>
+Check for pods being OOM-killed:
+<code bash>
+kubectl get pods -A | grep OOMKilled
+</code>
+===== Possible Causes =====
+  * Memory leak in an application
+  * Pods without memory limits
+  * Sudden increase in workload
+  * Insufficient node memory capacity
+  * Cache growth not properly controlled
+===== Mitigation =====
+  - Identify and restart leaking or misbehaving pods if safe
+  - Set or adjust memory requests and limits for workloads
+  - Scale the application or add more nodes if required
+  - Evict non-critical workloads if needed
+  - Investigate and fix memory leaks in application code
+If the node is under sustained pressure, drain it temporarily:
+<code bash>
+kubectl drain <NODE_NAME> --ignore-daemonsets
+</code>
+After recovery:
+<code bash>
+kubectl uncordon <NODE_NAME>
+</code>
+===== Escalation =====
+  * If memory usage remains above threshold for more than 15 minutes, notify the platform team
+  * If pods are repeatedly OOM-killed, escalate to the application owner
+  * If production services are impacted, page the on-call engineer
+===== Related Alerts =====
+  * HighCPUUsage
+  * NodeDown
+  * NodeRebootedRecently
+===== Related Dashboards =====
+  * Grafana → Node Overview
+  * Grafana → Memory Usage Dashboard