runbooks:coustom_alerts:highmemoryusage
Differences
This shows you the differences between two versions of the page.
| runbooks:coustom_alerts:highmemoryusage [2025/12/13 16:25] – created admin | runbooks:coustom_alerts:highmemoryusage [2025/12/14 06:44] (current) – admin | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| runbooks: | runbooks: | ||
| + | |||
| + | ====== HighMemoryUsage ====== | ||
| + | |||
| + | ===== Meaning ===== | ||
| + | This alert is triggered when memory usage on a node exceeds 90% for more than 5 minutes. | ||
| + | Memory usage is calculated based on total memory and available memory reported by node-exporter. | ||
| + | |||
| + | ===== Impact ===== | ||
| + | High memory usage can significantly affect node and application stability. | ||
| + | |||
| + | Possible impacts include: | ||
| + | * Pod evictions due to memory pressure | ||
| + | * Application crashes (OOMKilled) | ||
| + | * Increased latency and degraded performance | ||
| + | * Node becoming unresponsive under sustained pressure | ||
| + | |||
| + | This alert is a **warning**, | ||
| + | |||
| + | ===== Diagnosis ===== | ||
| + | Check memory usage across nodes: | ||
| + | |||
| + | <code bash> | ||
| + | kubectl top nodes | ||
| + | </ | ||
| + | |||
| + | Identify top memory-consuming pods: | ||
| + | |||
| + | <code bash> | ||
| + | kubectl top pods -A --sort-by=memory | ||
| + | </ | ||
| + | |||
| + | Check node conditions for memory pressure: | ||
| + | |||
| + | <code bash> | ||
| + | kubectl describe node < | ||
| + | </ | ||
| + | |||
| + | Look for recent memory-related events: | ||
| + | |||
| + | <code bash> | ||
| + | kubectl get events --field-selector involvedObject.kind=Node | ||
| + | </ | ||
| + | |||
| + | If SSH access is available, inspect memory usage directly: | ||
| + | |||
| + | <code bash> | ||
| + | free -h | ||
| + | top | ||
| + | vmstat 1 | ||
| + | </ | ||
| + | |||
| + | Check for pods being OOM-killed: | ||
| + | |||
| + | <code bash> | ||
| + | kubectl get pods -A | grep OOMKilled | ||
| + | </ | ||
| + | |||
| + | ===== Possible Causes ===== | ||
| + | * Memory leak in an application | ||
| + | * Pods without memory limits | ||
| + | * Sudden increase in workload | ||
| + | * Insufficient node memory capacity | ||
| + | * Cache growth not properly controlled | ||
| + | |||
| + | ===== Mitigation ===== | ||
| + | - Identify and restart leaking or misbehaving pods if safe | ||
| + | - Set or adjust memory requests and limits for workloads | ||
| + | - Scale the application or add more nodes if required | ||
| + | - Evict non-critical workloads if needed | ||
| + | - Investigate and fix memory leaks in application code | ||
| + | |||
| + | If the node is under sustained pressure, drain it temporarily: | ||
| + | |||
| + | <code bash> | ||
| + | kubectl drain < | ||
| + | </ | ||
| + | |||
| + | After recovery: | ||
| + | |||
| + | <code bash> | ||
| + | kubectl uncordon < | ||
| + | </ | ||
| + | |||
| + | ===== Escalation ===== | ||
| + | * If memory usage remains above threshold for more than 15 minutes, notify the platform team | ||
| + | * If pods are repeatedly OOM-killed, escalate to the application owner | ||
| + | * If production services are impacted, page the on-call engineer | ||
| + | |||
| + | ===== Related Alerts ===== | ||
| + | * HighCPUUsage | ||
| + | * NodeDown | ||
| + | * NodeRebootedRecently | ||
| + | |||
| + | ===== Related Dashboards ===== | ||
| + | * Grafana → Node Overview | ||
| + | * Grafana → Memory Usage Dashboard | ||
| + | |||
runbooks/coustom_alerts/highmemoryusage.txt · Last modified: by admin
