Differences

This shows you the differences between two versions of the page.

--- runbooks:coustom_alerts:highcpuusage [2025/12/13 16:24] – created admin
+++ runbooks:coustom_alerts:highcpuusage [2025/12/14 06:42] (current) – admin
@@ Line 1: / Line 1: @@
 runbooks:coustom_alerts:HighCPUUsage
+====== HighCPUUsage ======
+===== Meaning =====
+This alert is triggered when the average CPU usage on a node exceeds 85% for more than 5 minutes.
+CPU usage is calculated using node-exporter metrics by excluding idle CPU time.
+===== Impact =====
+Sustained high CPU usage can degrade node and application performance.
+Possible impacts include:
+  * Increased application latency
+  * Pod CPU throttling
+  * Slow scheduling and eviction decisions
+  * Potential node instability under prolonged load
+This alert is a **warning** but may become critical if CPU usage remains high.
+===== Diagnosis =====
+Identify nodes with high CPU usage:
+<code bash>
+kubectl top nodes
+</code>
+Identify top CPU-consuming pods:
+<code bash>
+kubectl top pods -A --sort-by=cpu
+</code>
+Describe the affected node for pressure conditions:
+<code bash>
+kubectl describe node <NODE_NAME>
+</code>
+Check recent events related to resource pressure:
+<code bash>
+kubectl get events --field-selector involvedObject.kind=Node
+</code>
+If SSH access is available, inspect CPU usage directly:
+<code bash>
+top
+htop
+mpstat -P ALL
+</code>
+===== Possible Causes =====
+  * Traffic spike or increased workload
+  * Application infinite loop or bug
+  * Pods without CPU limits
+  * Insufficient node CPU capacity
+  * Background system processes consuming CPU
+===== Mitigation =====
+  - Identify and restart misbehaving pods if safe
+  - Scale the workload horizontally if supported
+  - Apply or adjust CPU limits and requests
+  - Reschedule pods to other nodes if needed
+  - Consider adding more nodes to the cluster
+If necessary, temporarily drain the node:
+<code bash>
+kubectl drain <NODE_NAME> --ignore-daemonsets
+</code>
+Restore scheduling after mitigation:
+<code bash>
+kubectl uncordon <NODE_NAME>
+</code>
+===== Escalation =====
+  * If CPU usage remains above threshold for more than 15 minutes, notify the platform team
+  * If production workloads are impacted, page the on-call engineer
+  * If multiple nodes are affected, treat as a capacity issue and escalate immediately
+===== Related Alerts =====
+  * NodeDown
+  * NodeRebootedRecently
+  * NodeNotReady
+===== Related Dashboards =====
+  * Grafana → Node Overview
+  * Grafana → CPU Usage Dashboard