Differences

This shows you the differences between two versions of the page.

--- runbooks:coustom_alerts:kubenodenotready [2025/12/13 16:27] – created admin
+++ runbooks:coustom_alerts:kubenodenotready [2025/12/14 06:49] (current) – admin
@@ Line 1: / Line 1: @@
 runbooks:coustom_alerts:KubeNodeNotReady
+====== KubeNodeNotReady ======
+===== Meaning =====
+This alert is triggered when a Kubernetes node reports a `NotReady` status for more than 2 minutes.
+A node in `NotReady` state cannot reliably run or manage pods.
+===== Impact =====
+This alert indicates a **node-level availability issue**.
+Possible impacts include:
+  * Pods on the node may be evicted or rescheduled
+  * Reduced cluster capacity
+  * Increased load on remaining nodes
+  * Application performance degradation or partial outages
+This alert is a **warning**, but may become critical if the condition persists or affects multiple nodes.
+===== Diagnosis =====
+Check node status:
+<code bash>
+kubectl get nodes
+</code>
+Describe the affected node to inspect conditions and events:
+<code bash>
+kubectl describe node {{ $labels.node }}
+</code>
+Check recent node-related events:
+<code bash>
+kubectl get events --field-selector involvedObject.kind=Node
+</code>
+Verify kubelet health on the node (if SSH access is available):
+<code bash>
+systemctl status kubelet
+journalctl -u kubelet --since "15 min ago"
+</code>
+Check node resource pressure:
+<code bash>
+kubectl describe node {{ $labels.node }} | grep -i pressure
+</code>
+===== Possible Causes =====
+  * Kubelet process stopped or unhealthy
+  * Network connectivity issues
+  * Disk, memory, or PID pressure on the node
+  * Node reboot or hardware failure
+  * Cloud provider instance issue
+===== Mitigation =====
+  - Restart the kubelet service if it is not running
+  - Resolve disk, memory, or PID pressure conditions
+  - Restore network connectivity
+  - Reboot the node if required and safe
+  - If the node is unstable, drain it for investigation:
+<code bash>
+kubectl drain {{ $labels.node }} --ignore-daemonsets
+</code>
+After the node becomes healthy:
+<code bash>
+kubectl uncordon {{ $labels.node }}
+</code>
+===== Escalation =====
+  * If the node remains NotReady for more than 10 minutes, escalate to the platform team
+  * If multiple nodes are affected, treat as a cluster-level incident
+  * If production workloads are impacted, page the on-call engineer
+===== Related Alerts =====
+  * NodeDown
+  * KubeletDown
+  * HighDiskUsage
+  * HighMemoryUsage
+===== Related Dashboards =====
+  * Grafana → Kubernetes / Nodes
+  * Grafana → Node Health Overview