runbooks:coustom_alerts:hostoutofmemory
Differences
This shows you the differences between two versions of the page.
| runbooks:coustom_alerts:hostoutofmemory [2025/12/13 16:38] – created admin | runbooks:coustom_alerts:hostoutofmemory [2025/12/14 07:00] (current) – admin | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| runbooks: | runbooks: | ||
| + | |||
| + | ====== HostOutOfMemory ====== | ||
| + | |||
| + | ===== Meaning ===== | ||
| + | This alert is triggered when a host node has **less than 10% of available memory** for more than 2 minutes. | ||
| + | It indicates that the node is at risk of running out of memory, which may lead to OOMKilled processes and system instability. | ||
| + | |||
| + | ===== Impact ===== | ||
| + | Low memory on a host node can cause: | ||
| + | * Application pods being OOMKilled | ||
| + | * System processes failing | ||
| + | * Node instability or crashes | ||
| + | * Degraded application performance | ||
| + | * Kubernetes scheduling failures due to resource constraints | ||
| + | |||
| + | This alert is marked **warning**, | ||
| + | |||
| + | ===== Diagnosis ===== | ||
| + | Check node memory usage: | ||
| + | |||
| + | <code bash> | ||
| + | kubectl top node {{ $labels.instance }} | ||
| + | free -m | ||
| + | </ | ||
| + | |||
| + | Check top memory-consuming processes: | ||
| + | |||
| + | <code bash> | ||
| + | top | ||
| + | htop | ||
| + | ps aux --sort=-%mem | head -n 20 | ||
| + | </ | ||
| + | |||
| + | Check pod resource usage on the node: | ||
| + | |||
| + | <code bash> | ||
| + | kubectl top pod --all-namespaces --field-selector spec.nodeName={{ $labels.instance }} | ||
| + | </ | ||
| + | |||
| + | ===== Possible Causes ===== | ||
| + | * Memory leaks in applications | ||
| + | * Memory-intensive batch jobs | ||
| + | * Too many pods scheduled on the node | ||
| + | * Misconfigured pod resource requests/ | ||
| + | * System processes consuming excessive memory | ||
| + | |||
| + | ===== Mitigation ===== | ||
| + | - Identify and restart memory-heavy pods or processes | ||
| + | - Scale workloads to other nodes | ||
| + | - Adjust resource requests/ | ||
| + | - Free up system memory (e.g., clear caches, restart unnecessary processes) | ||
| + | - Add more memory to the node if possible | ||
| + | |||
| + | ===== Escalation ===== | ||
| + | * Escalate if memory usage remains below 10% for an extended period | ||
| + | * Page on-call engineer if production services are affected | ||
| + | * Monitor related nodes for similar memory pressure | ||
| + | |||
| + | ===== Related Alerts ===== | ||
| + | * HighMemoryUsage | ||
| + | * KubernetesNodeMemoryPressure | ||
| + | * PodOOMKilled | ||
| + | * HostCPUHigh | ||
| + | |||
| + | ===== Related Dashboards ===== | ||
| + | * Grafana → Node Memory Usage | ||
| + | * Grafana → Node Resource Overview | ||
| + | |||
runbooks/coustom_alerts/hostoutofmemory.txt · Last modified: by admin
