runbooks:coustom_alerts:highcpuusage
Differences
This shows you the differences between two versions of the page.
| runbooks:coustom_alerts:highcpuusage [2025/12/13 16:24] – created admin | runbooks:coustom_alerts:highcpuusage [2025/12/14 06:42] (current) – admin | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| runbooks: | runbooks: | ||
| + | |||
| + | ====== HighCPUUsage ====== | ||
| + | |||
| + | ===== Meaning ===== | ||
| + | This alert is triggered when the average CPU usage on a node exceeds 85% for more than 5 minutes. | ||
| + | CPU usage is calculated using node-exporter metrics by excluding idle CPU time. | ||
| + | |||
| + | ===== Impact ===== | ||
| + | Sustained high CPU usage can degrade node and application performance. | ||
| + | |||
| + | Possible impacts include: | ||
| + | * Increased application latency | ||
| + | * Pod CPU throttling | ||
| + | * Slow scheduling and eviction decisions | ||
| + | * Potential node instability under prolonged load | ||
| + | |||
| + | This alert is a **warning** but may become critical if CPU usage remains high. | ||
| + | |||
| + | ===== Diagnosis ===== | ||
| + | Identify nodes with high CPU usage: | ||
| + | |||
| + | <code bash> | ||
| + | kubectl top nodes | ||
| + | </ | ||
| + | |||
| + | Identify top CPU-consuming pods: | ||
| + | |||
| + | <code bash> | ||
| + | kubectl top pods -A --sort-by=cpu | ||
| + | </ | ||
| + | |||
| + | Describe the affected node for pressure conditions: | ||
| + | |||
| + | <code bash> | ||
| + | kubectl describe node < | ||
| + | </ | ||
| + | |||
| + | Check recent events related to resource pressure: | ||
| + | |||
| + | <code bash> | ||
| + | kubectl get events --field-selector involvedObject.kind=Node | ||
| + | </ | ||
| + | |||
| + | If SSH access is available, inspect CPU usage directly: | ||
| + | |||
| + | <code bash> | ||
| + | top | ||
| + | htop | ||
| + | mpstat -P ALL | ||
| + | </ | ||
| + | |||
| + | ===== Possible Causes ===== | ||
| + | * Traffic spike or increased workload | ||
| + | * Application infinite loop or bug | ||
| + | * Pods without CPU limits | ||
| + | * Insufficient node CPU capacity | ||
| + | * Background system processes consuming CPU | ||
| + | |||
| + | ===== Mitigation ===== | ||
| + | - Identify and restart misbehaving pods if safe | ||
| + | - Scale the workload horizontally if supported | ||
| + | - Apply or adjust CPU limits and requests | ||
| + | - Reschedule pods to other nodes if needed | ||
| + | - Consider adding more nodes to the cluster | ||
| + | |||
| + | If necessary, temporarily drain the node: | ||
| + | |||
| + | <code bash> | ||
| + | kubectl drain < | ||
| + | </ | ||
| + | |||
| + | Restore scheduling after mitigation: | ||
| + | |||
| + | <code bash> | ||
| + | kubectl uncordon < | ||
| + | </ | ||
| + | |||
| + | ===== Escalation ===== | ||
| + | * If CPU usage remains above threshold for more than 15 minutes, notify the platform team | ||
| + | * If production workloads are impacted, page the on-call engineer | ||
| + | * If multiple nodes are affected, treat as a capacity issue and escalate immediately | ||
| + | |||
| + | ===== Related Alerts ===== | ||
| + | * NodeDown | ||
| + | * NodeRebootedRecently | ||
| + | * NodeNotReady | ||
| + | |||
| + | ===== Related Dashboards ===== | ||
| + | * Grafana → Node Overview | ||
| + | * Grafana → CPU Usage Dashboard | ||
| + | |||
runbooks/coustom_alerts/highcpuusage.txt · Last modified: by admin
