Differences

This shows you the differences between two versions of the page.

--- runbooks:coustom_alerts:kubernetesnodememorypressure [2025/12/13 16:36] – created admin
+++ runbooks:coustom_alerts:kubernetesnodememorypressure [2025/12/14 06:53] (current) – admin
@@ Line 1: / Line 1: @@
 runbooks:coustom_alerts:KubernetesNodeMemoryPressure
+====== KubernetesNodeMemoryPressure ======
+===== Meaning =====
+This alert is triggered when a Kubernetes node reports the **MemoryPressure** condition for more than 2 minutes.
+MemoryPressure indicates that the node is running low on available memory and may start evicting pods.
+===== Impact =====
+Memory pressure on a node can lead to:
+  * Pod evictions and restarts
+  * OOMKilled containers
+  * Degraded application performance
+  * Scheduling failures for new pods
+This alert is **critical** because sustained memory pressure directly affects workload stability.
+===== Diagnosis =====
+Check node memory status:
+<code bash>
+kubectl get nodes
+kubectl describe node <NODE_NAME>
+</code>
+Check node memory usage:
+<code bash>
+kubectl top node <NODE_NAME>
+free -m
+</code>
+List pods consuming high memory:
+<code bash>
+kubectl top pod --all-namespaces --sort-by=memory
+</code>
+Check recent pod evictions:
+<code bash>
+kubectl get events --sort-by=.lastTimestamp
+</code>
+===== Possible Causes =====
+  * Memory leaks in applications
+  * Insufficient memory requests/limits
+  * Sudden traffic spikes
+  * Misconfigured workloads or batch jobs
+  * Too many pods scheduled on the node
+===== Mitigation =====
+  - Identify and restart or scale memory-heavy pods
+  - Set proper resource **requests and limits**
+  - Scale out workloads or add more nodes
+  - Increase node memory capacity if required
+If immediate relief is needed, drain the node:
+<code bash>
+kubectl drain <NODE_NAME> --ignore-daemonsets --delete-emptydir-data
+</code>
+After mitigation and stabilization:
+<code bash>
+kubectl uncordon <NODE_NAME>
+</code>
+===== Escalation =====
+  * Escalate if memory pressure persists longer than 10 minutes
+  * Page on-call engineer if pod evictions impact production
+  * If multiple nodes show memory pressure, treat as cluster capacity issue
+===== Related Alerts =====
+  * HighMemoryUsage
+  * PodCrashLoopBackOff
+  * KubernetesNodeNotReady
+  * HighCPUUsage
+===== Related Dashboards =====
+  * Grafana → Kubernetes / Node Memory
+  * Grafana → Node Exporter Full