User Tools

Site Tools


runbooks:coustom_alerts:kubernetesnodenotready

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

runbooks:coustom_alerts:kubernetesnodenotready [2025/12/13 16:27] – created adminrunbooks:coustom_alerts:kubernetesnodenotready [2025/12/14 06:51] (current) admin
Line 1: Line 1:
 runbooks:coustom_alerts:KubernetesNodeNotReady runbooks:coustom_alerts:KubernetesNodeNotReady
 +
 +====== KubernetesNodeNotReady ======
 +
 +===== Meaning =====
 +This alert is triggered when a Kubernetes node remains in the **NotReady** state for more than 10 minutes.
 +A NotReady node cannot reliably run or manage pods.
 +
 +===== Impact =====
 +A node in NotReady state can cause:
 +  * Pods being evicted or stuck in Pending state
 +  * Reduced cluster capacity
 +  * Application downtime if replicas are insufficient
 +  * Scheduling failures for new workloads
 +
 +This alert is marked **critical** because prolonged node unavailability threatens cluster stability.
 +
 +===== Diagnosis =====
 +Check node status:
 +
 +<code bash>
 +kubectl get nodes
 +</code>
 +
 +Inspect node conditions and events:
 +
 +<code bash>
 +kubectl describe node <NODE_NAME>
 +</code>
 +
 +Check recent cluster-wide events:
 +
 +<code bash>
 +kubectl get events --sort-by=.lastTimestamp
 +</code>
 +
 +Verify kubelet status on the node (if SSH access is available):
 +
 +<code bash>
 +systemctl status kubelet
 +journalctl -u kubelet -n 100
 +</code>
 +
 +Check system resource pressure:
 +
 +<code bash>
 +kubectl top node <NODE_NAME>
 +df -h
 +free -m
 +</code>
 +
 +===== Possible Causes =====
 +  * Kubelet service stopped or unhealthy
 +  * Node lost network connectivity
 +  * Disk, memory, or CPU pressure
 +  * Kernel panic or OS-level issues
 +  * Cloud provider instance failure or maintenance
 +
 +===== Mitigation =====
 +  - Restart kubelet service:
 +<code bash>
 +systemctl restart kubelet
 +</code>
 +
 +  - Resolve resource pressure (disk cleanup, memory leaks)
 +  - Verify networking and DNS configuration
 +  - Reboot the node if necessary
 +  - If node cannot recover, drain and replace it
 +
 +Drain node safely:
 +
 +<code bash>
 +kubectl drain <NODE_NAME> --ignore-daemonsets --delete-emptydir-data
 +</code>
 +
 +After recovery:
 +
 +<code bash>
 +kubectl uncordon <NODE_NAME>
 +</code>
 +
 +===== Escalation =====
 +  * If node remains NotReady after mitigation, escalate to the infrastructure team
 +  * If multiple nodes are affected, treat as a cluster-level incident
 +  * Page on-call engineer if production workloads are impacted
 +
 +===== Related Alerts =====
 +  * KubeletDown
 +  * NodeDown
 +  * HighDiskIOWait
 +  * HighCPUUsage
 +  * HighMemoryUsage
 +
 +===== Related Dashboards =====
 +  * Grafana → Kubernetes / Nodes
 +  * Grafana → Node Exporter Full
 +
runbooks/coustom_alerts/kubernetesnodenotready.txt · Last modified: by admin