Differences

This shows you the differences between two versions of the page.

--- runbooks:coustom_alerts:hostoutofmemory [2025/12/13 16:38] – created admin
+++ runbooks:coustom_alerts:hostoutofmemory [2025/12/14 07:00] (current) – admin
@@ Line 1: / Line 1: @@
 runbooks:coustom_alerts:HostOutOfMemory
+====== HostOutOfMemory ======
+===== Meaning =====
+This alert is triggered when a host node has **less than 10% of available memory** for more than 2 minutes.
+It indicates that the node is at risk of running out of memory, which may lead to OOMKilled processes and system instability.
+===== Impact =====
+Low memory on a host node can cause:
+  * Application pods being OOMKilled
+  * System processes failing
+  * Node instability or crashes
+  * Degraded application performance
+  * Kubernetes scheduling failures due to resource constraints
+This alert is marked **warning**, as it can escalate quickly if memory continues to deplete.
+===== Diagnosis =====
+Check node memory usage:
+<code bash>
+kubectl top node {{ $labels.instance }}
+free -m
+</code>
+Check top memory-consuming processes:
+<code bash>
+top
+htop
+ps aux --sort=-%mem | head -n 20
+</code>
+Check pod resource usage on the node:
+<code bash>
+kubectl top pod --all-namespaces --field-selector spec.nodeName={{ $labels.instance }}
+</code>
+===== Possible Causes =====
+  * Memory leaks in applications
+  * Memory-intensive batch jobs
+  * Too many pods scheduled on the node
+  * Misconfigured pod resource requests/limits
+  * System processes consuming excessive memory
+===== Mitigation =====
+  - Identify and restart memory-heavy pods or processes
+  - Scale workloads to other nodes
+  - Adjust resource requests/limits for pods
+  - Free up system memory (e.g., clear caches, restart unnecessary processes)
+  - Add more memory to the node if possible
+===== Escalation =====
+  * Escalate if memory usage remains below 10% for an extended period
+  * Page on-call engineer if production services are affected
+  * Monitor related nodes for similar memory pressure
+===== Related Alerts =====
+  * HighMemoryUsage
+  * KubernetesNodeMemoryPressure
+  * PodOOMKilled
+  * HostCPUHigh
+===== Related Dashboards =====
+  * Grafana → Node Memory Usage
+  * Grafana → Node Resource Overview