User Tools

Site Tools


runbooks:coustom_alerts:highmemoryusage

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

runbooks:coustom_alerts:highmemoryusage [2025/12/13 16:25] – created adminrunbooks:coustom_alerts:highmemoryusage [2025/12/14 06:44] (current) admin
Line 1: Line 1:
 runbooks:coustom_alerts:HighMemoryUsage runbooks:coustom_alerts:HighMemoryUsage
 +
 +====== HighMemoryUsage ======
 +
 +===== Meaning =====
 +This alert is triggered when memory usage on a node exceeds 90% for more than 5 minutes.
 +Memory usage is calculated based on total memory and available memory reported by node-exporter.
 +
 +===== Impact =====
 +High memory usage can significantly affect node and application stability.
 +
 +Possible impacts include:
 +  * Pod evictions due to memory pressure
 +  * Application crashes (OOMKilled)
 +  * Increased latency and degraded performance
 +  * Node becoming unresponsive under sustained pressure
 +
 +This alert is a **warning**, but may escalate to a critical issue if not addressed.
 +
 +===== Diagnosis =====
 +Check memory usage across nodes:
 +
 +<code bash>
 +kubectl top nodes
 +</code>
 +
 +Identify top memory-consuming pods:
 +
 +<code bash>
 +kubectl top pods -A --sort-by=memory
 +</code>
 +
 +Check node conditions for memory pressure:
 +
 +<code bash>
 +kubectl describe node <NODE_NAME>
 +</code>
 +
 +Look for recent memory-related events:
 +
 +<code bash>
 +kubectl get events --field-selector involvedObject.kind=Node
 +</code>
 +
 +If SSH access is available, inspect memory usage directly:
 +
 +<code bash>
 +free -h
 +top
 +vmstat 1
 +</code>
 +
 +Check for pods being OOM-killed:
 +
 +<code bash>
 +kubectl get pods -A | grep OOMKilled
 +</code>
 +
 +===== Possible Causes =====
 +  * Memory leak in an application
 +  * Pods without memory limits
 +  * Sudden increase in workload
 +  * Insufficient node memory capacity
 +  * Cache growth not properly controlled
 +
 +===== Mitigation =====
 +  - Identify and restart leaking or misbehaving pods if safe
 +  - Set or adjust memory requests and limits for workloads
 +  - Scale the application or add more nodes if required
 +  - Evict non-critical workloads if needed
 +  - Investigate and fix memory leaks in application code
 +
 +If the node is under sustained pressure, drain it temporarily:
 +
 +<code bash>
 +kubectl drain <NODE_NAME> --ignore-daemonsets
 +</code>
 +
 +After recovery:
 +
 +<code bash>
 +kubectl uncordon <NODE_NAME>
 +</code>
 +
 +===== Escalation =====
 +  * If memory usage remains above threshold for more than 15 minutes, notify the platform team
 +  * If pods are repeatedly OOM-killed, escalate to the application owner
 +  * If production services are impacted, page the on-call engineer
 +
 +===== Related Alerts =====
 +  * HighCPUUsage
 +  * NodeDown
 +  * NodeRebootedRecently
 +
 +===== Related Dashboards =====
 +  * Grafana → Node Overview
 +  * Grafana → Memory Usage Dashboard
 +
runbooks/coustom_alerts/highmemoryusage.txt · Last modified: by admin