runbooks:coustom_alerts:HighCPUUsage

====== HighCPUUsage ======

===== Meaning =====
This alert is triggered when the average CPU usage on a node exceeds 85% for more than 5 minutes.
CPU usage is calculated using node-exporter metrics by excluding idle CPU time.

===== Impact =====
Sustained high CPU usage can degrade node and application performance.

Possible impacts include:
  * Increased application latency
  * Pod CPU throttling
  * Slow scheduling and eviction decisions
  * Potential node instability under prolonged load

This alert is a **warning** but may become critical if CPU usage remains high.

===== Diagnosis =====
Identify nodes with high CPU usage:

<code bash>
kubectl top nodes
</code>

Identify top CPU-consuming pods:

<code bash>
kubectl top pods -A --sort-by=cpu
</code>

Describe the affected node for pressure conditions:

<code bash>
kubectl describe node <NODE_NAME>
</code>

Check recent events related to resource pressure:

<code bash>
kubectl get events --field-selector involvedObject.kind=Node
</code>

If SSH access is available, inspect CPU usage directly:

<code bash>
top
htop
mpstat -P ALL
</code>

===== Possible Causes =====
  * Traffic spike or increased workload
  * Application infinite loop or bug
  * Pods without CPU limits
  * Insufficient node CPU capacity
  * Background system processes consuming CPU

===== Mitigation =====
  - Identify and restart misbehaving pods if safe
  - Scale the workload horizontally if supported
  - Apply or adjust CPU limits and requests
  - Reschedule pods to other nodes if needed
  - Consider adding more nodes to the cluster

If necessary, temporarily drain the node:

<code bash>
kubectl drain <NODE_NAME> --ignore-daemonsets
</code>

Restore scheduling after mitigation:

<code bash>
kubectl uncordon <NODE_NAME>
</code>

===== Escalation =====
  * If CPU usage remains above threshold for more than 15 minutes, notify the platform team
  * If production workloads are impacted, page the on-call engineer
  * If multiple nodes are affected, treat as a capacity issue and escalate immediately

===== Related Alerts =====
  * NodeDown
  * NodeRebootedRecently
  * NodeNotReady

===== Related Dashboards =====
  * Grafana → Node Overview
  * Grafana → CPU Usage Dashboard