User Tools

Site Tools


runbooks:coustom_alerts:highcpuusage

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

runbooks:coustom_alerts:highcpuusage [2025/12/13 16:24] – created adminrunbooks:coustom_alerts:highcpuusage [2025/12/14 06:42] (current) admin
Line 1: Line 1:
 runbooks:coustom_alerts:HighCPUUsage runbooks:coustom_alerts:HighCPUUsage
 +
 +====== HighCPUUsage ======
 +
 +===== Meaning =====
 +This alert is triggered when the average CPU usage on a node exceeds 85% for more than 5 minutes.
 +CPU usage is calculated using node-exporter metrics by excluding idle CPU time.
 +
 +===== Impact =====
 +Sustained high CPU usage can degrade node and application performance.
 +
 +Possible impacts include:
 +  * Increased application latency
 +  * Pod CPU throttling
 +  * Slow scheduling and eviction decisions
 +  * Potential node instability under prolonged load
 +
 +This alert is a **warning** but may become critical if CPU usage remains high.
 +
 +===== Diagnosis =====
 +Identify nodes with high CPU usage:
 +
 +<code bash>
 +kubectl top nodes
 +</code>
 +
 +Identify top CPU-consuming pods:
 +
 +<code bash>
 +kubectl top pods -A --sort-by=cpu
 +</code>
 +
 +Describe the affected node for pressure conditions:
 +
 +<code bash>
 +kubectl describe node <NODE_NAME>
 +</code>
 +
 +Check recent events related to resource pressure:
 +
 +<code bash>
 +kubectl get events --field-selector involvedObject.kind=Node
 +</code>
 +
 +If SSH access is available, inspect CPU usage directly:
 +
 +<code bash>
 +top
 +htop
 +mpstat -P ALL
 +</code>
 +
 +===== Possible Causes =====
 +  * Traffic spike or increased workload
 +  * Application infinite loop or bug
 +  * Pods without CPU limits
 +  * Insufficient node CPU capacity
 +  * Background system processes consuming CPU
 +
 +===== Mitigation =====
 +  - Identify and restart misbehaving pods if safe
 +  - Scale the workload horizontally if supported
 +  - Apply or adjust CPU limits and requests
 +  - Reschedule pods to other nodes if needed
 +  - Consider adding more nodes to the cluster
 +
 +If necessary, temporarily drain the node:
 +
 +<code bash>
 +kubectl drain <NODE_NAME> --ignore-daemonsets
 +</code>
 +
 +Restore scheduling after mitigation:
 +
 +<code bash>
 +kubectl uncordon <NODE_NAME>
 +</code>
 +
 +===== Escalation =====
 +  * If CPU usage remains above threshold for more than 15 minutes, notify the platform team
 +  * If production workloads are impacted, page the on-call engineer
 +  * If multiple nodes are affected, treat as a capacity issue and escalate immediately
 +
 +===== Related Alerts =====
 +  * NodeDown
 +  * NodeRebootedRecently
 +  * NodeNotReady
 +
 +===== Related Dashboards =====
 +  * Grafana → Node Overview
 +  * Grafana → CPU Usage Dashboard
 +
runbooks/coustom_alerts/highcpuusage.txt · Last modified: by admin