runbooks:coustom_alerts:HighCPUUsage
====== HighCPUUsage ======
===== Meaning =====
This alert is triggered when the average CPU usage on a node exceeds 85% for more than 5 minutes.
CPU usage is calculated using node-exporter metrics by excluding idle CPU time.
===== Impact =====
Sustained high CPU usage can degrade node and application performance.
Possible impacts include:
* Increased application latency
* Pod CPU throttling
* Slow scheduling and eviction decisions
* Potential node instability under prolonged load
This alert is a **warning** but may become critical if CPU usage remains high.
===== Diagnosis =====
Identify nodes with high CPU usage:
kubectl top nodes
Identify top CPU-consuming pods:
kubectl top pods -A --sort-by=cpu
Describe the affected node for pressure conditions:
kubectl describe node
Check recent events related to resource pressure:
kubectl get events --field-selector involvedObject.kind=Node
If SSH access is available, inspect CPU usage directly:
top
htop
mpstat -P ALL
===== Possible Causes =====
* Traffic spike or increased workload
* Application infinite loop or bug
* Pods without CPU limits
* Insufficient node CPU capacity
* Background system processes consuming CPU
===== Mitigation =====
- Identify and restart misbehaving pods if safe
- Scale the workload horizontally if supported
- Apply or adjust CPU limits and requests
- Reschedule pods to other nodes if needed
- Consider adding more nodes to the cluster
If necessary, temporarily drain the node:
kubectl drain --ignore-daemonsets
Restore scheduling after mitigation:
kubectl uncordon
===== Escalation =====
* If CPU usage remains above threshold for more than 15 minutes, notify the platform team
* If production workloads are impacted, page the on-call engineer
* If multiple nodes are affected, treat as a capacity issue and escalate immediately
===== Related Alerts =====
* NodeDown
* NodeRebootedRecently
* NodeNotReady
===== Related Dashboards =====
* Grafana → Node Overview
* Grafana → CPU Usage Dashboard