User Tools

Site Tools


runbooks:coustom_alerts:hostoutofmemory

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

runbooks:coustom_alerts:hostoutofmemory [2025/12/13 16:38] – created adminrunbooks:coustom_alerts:hostoutofmemory [2025/12/14 07:00] (current) admin
Line 1: Line 1:
 runbooks:coustom_alerts:HostOutOfMemory runbooks:coustom_alerts:HostOutOfMemory
 +
 +====== HostOutOfMemory ======
 +
 +===== Meaning =====
 +This alert is triggered when a host node has **less than 10% of available memory** for more than 2 minutes.
 +It indicates that the node is at risk of running out of memory, which may lead to OOMKilled processes and system instability.
 +
 +===== Impact =====
 +Low memory on a host node can cause:
 +  * Application pods being OOMKilled
 +  * System processes failing
 +  * Node instability or crashes
 +  * Degraded application performance
 +  * Kubernetes scheduling failures due to resource constraints
 +
 +This alert is marked **warning**, as it can escalate quickly if memory continues to deplete.
 +
 +===== Diagnosis =====
 +Check node memory usage:
 +
 +<code bash>
 +kubectl top node {{ $labels.instance }}
 +free -m
 +</code>
 +
 +Check top memory-consuming processes:
 +
 +<code bash>
 +top
 +htop
 +ps aux --sort=-%mem | head -n 20
 +</code>
 +
 +Check pod resource usage on the node:
 +
 +<code bash>
 +kubectl top pod --all-namespaces --field-selector spec.nodeName={{ $labels.instance }}
 +</code>
 +
 +===== Possible Causes =====
 +  * Memory leaks in applications
 +  * Memory-intensive batch jobs
 +  * Too many pods scheduled on the node
 +  * Misconfigured pod resource requests/limits
 +  * System processes consuming excessive memory
 +
 +===== Mitigation =====
 +  - Identify and restart memory-heavy pods or processes
 +  - Scale workloads to other nodes
 +  - Adjust resource requests/limits for pods
 +  - Free up system memory (e.g., clear caches, restart unnecessary processes)
 +  - Add more memory to the node if possible
 +
 +===== Escalation =====
 +  * Escalate if memory usage remains below 10% for an extended period
 +  * Page on-call engineer if production services are affected
 +  * Monitor related nodes for similar memory pressure
 +
 +===== Related Alerts =====
 +  * HighMemoryUsage
 +  * KubernetesNodeMemoryPressure
 +  * PodOOMKilled
 +  * HostCPUHigh
 +
 +===== Related Dashboards =====
 +  * Grafana → Node Memory Usage
 +  * Grafana → Node Resource Overview
 +
runbooks/coustom_alerts/hostoutofmemory.1765643891.txt.gz · Last modified: by admin