runbooks:coustom_alerts:HighCPUUsage ====== HighCPUUsage ====== ===== Meaning ===== This alert is triggered when the average CPU usage on a node exceeds 85% for more than 5 minutes. CPU usage is calculated using node-exporter metrics by excluding idle CPU time. ===== Impact ===== Sustained high CPU usage can degrade node and application performance. Possible impacts include: * Increased application latency * Pod CPU throttling * Slow scheduling and eviction decisions * Potential node instability under prolonged load This alert is a **warning** but may become critical if CPU usage remains high. ===== Diagnosis ===== Identify nodes with high CPU usage: kubectl top nodes Identify top CPU-consuming pods: kubectl top pods -A --sort-by=cpu Describe the affected node for pressure conditions: kubectl describe node Check recent events related to resource pressure: kubectl get events --field-selector involvedObject.kind=Node If SSH access is available, inspect CPU usage directly: top htop mpstat -P ALL ===== Possible Causes ===== * Traffic spike or increased workload * Application infinite loop or bug * Pods without CPU limits * Insufficient node CPU capacity * Background system processes consuming CPU ===== Mitigation ===== - Identify and restart misbehaving pods if safe - Scale the workload horizontally if supported - Apply or adjust CPU limits and requests - Reschedule pods to other nodes if needed - Consider adding more nodes to the cluster If necessary, temporarily drain the node: kubectl drain --ignore-daemonsets Restore scheduling after mitigation: kubectl uncordon ===== Escalation ===== * If CPU usage remains above threshold for more than 15 minutes, notify the platform team * If production workloads are impacted, page the on-call engineer * If multiple nodes are affected, treat as a capacity issue and escalate immediately ===== Related Alerts ===== * NodeDown * NodeRebootedRecently * NodeNotReady ===== Related Dashboards ===== * Grafana → Node Overview * Grafana → CPU Usage Dashboard