Differences

This shows you the differences between two versions of the page.

--- runbooks:coustom_alerts:kubernetesnodenotready [2025/12/13 16:27] – created admin
+++ runbooks:coustom_alerts:kubernetesnodenotready [2025/12/14 06:51] (current) – admin
@@ Line 1: / Line 1: @@
 runbooks:coustom_alerts:KubernetesNodeNotReady
+====== KubernetesNodeNotReady ======
+===== Meaning =====
+This alert is triggered when a Kubernetes node remains in the **NotReady** state for more than 10 minutes.
+A NotReady node cannot reliably run or manage pods.
+===== Impact =====
+A node in NotReady state can cause:
+  * Pods being evicted or stuck in Pending state
+  * Reduced cluster capacity
+  * Application downtime if replicas are insufficient
+  * Scheduling failures for new workloads
+This alert is marked **critical** because prolonged node unavailability threatens cluster stability.
+===== Diagnosis =====
+Check node status:
+<code bash>
+kubectl get nodes
+</code>
+Inspect node conditions and events:
+<code bash>
+kubectl describe node <NODE_NAME>
+</code>
+Check recent cluster-wide events:
+<code bash>
+kubectl get events --sort-by=.lastTimestamp
+</code>
+Verify kubelet status on the node (if SSH access is available):
+<code bash>
+systemctl status kubelet
+journalctl -u kubelet -n 100
+</code>
+Check system resource pressure:
+<code bash>
+kubectl top node <NODE_NAME>
+df -h
+free -m
+</code>
+===== Possible Causes =====
+  * Kubelet service stopped or unhealthy
+  * Node lost network connectivity
+  * Disk, memory, or CPU pressure
+  * Kernel panic or OS-level issues
+  * Cloud provider instance failure or maintenance
+===== Mitigation =====
+  - Restart kubelet service:
+<code bash>
+systemctl restart kubelet
+</code>
+  - Resolve resource pressure (disk cleanup, memory leaks)
+  - Verify networking and DNS configuration
+  - Reboot the node if necessary
+  - If node cannot recover, drain and replace it
+Drain node safely:
+<code bash>
+kubectl drain <NODE_NAME> --ignore-daemonsets --delete-emptydir-data
+</code>
+After recovery:
+<code bash>
+kubectl uncordon <NODE_NAME>
+</code>
+===== Escalation =====
+  * If node remains NotReady after mitigation, escalate to the infrastructure team
+  * If multiple nodes are affected, treat as a cluster-level incident
+  * Page on-call engineer if production workloads are impacted
+===== Related Alerts =====
+  * KubeletDown
+  * NodeDown
+  * HighDiskIOWait
+  * HighCPUUsage
+  * HighMemoryUsage
+===== Related Dashboards =====
+  * Grafana → Kubernetes / Nodes
+  * Grafana → Node Exporter Full