runbooks:coustom_alerts:HighMemoryUsage ====== HighMemoryUsage ====== ===== Meaning ===== This alert is triggered when memory usage on a node exceeds 90% for more than 5 minutes. Memory usage is calculated based on total memory and available memory reported by node-exporter. ===== Impact ===== High memory usage can significantly affect node and application stability. Possible impacts include: * Pod evictions due to memory pressure * Application crashes (OOMKilled) * Increased latency and degraded performance * Node becoming unresponsive under sustained pressure This alert is a **warning**, but may escalate to a critical issue if not addressed. ===== Diagnosis ===== Check memory usage across nodes: kubectl top nodes Identify top memory-consuming pods: kubectl top pods -A --sort-by=memory Check node conditions for memory pressure: kubectl describe node Look for recent memory-related events: kubectl get events --field-selector involvedObject.kind=Node If SSH access is available, inspect memory usage directly: free -h top vmstat 1 Check for pods being OOM-killed: kubectl get pods -A | grep OOMKilled ===== Possible Causes ===== * Memory leak in an application * Pods without memory limits * Sudden increase in workload * Insufficient node memory capacity * Cache growth not properly controlled ===== Mitigation ===== - Identify and restart leaking or misbehaving pods if safe - Set or adjust memory requests and limits for workloads - Scale the application or add more nodes if required - Evict non-critical workloads if needed - Investigate and fix memory leaks in application code If the node is under sustained pressure, drain it temporarily: kubectl drain --ignore-daemonsets After recovery: kubectl uncordon ===== Escalation ===== * If memory usage remains above threshold for more than 15 minutes, notify the platform team * If pods are repeatedly OOM-killed, escalate to the application owner * If production services are impacted, page the on-call engineer ===== Related Alerts ===== * HighCPUUsage * NodeDown * NodeRebootedRecently ===== Related Dashboards ===== * Grafana → Node Overview * Grafana → Memory Usage Dashboard