runbooks:coustom_alerts:kubernetesnodenotready
Table of Contents
runbooks:coustom_alerts:KubernetesNodeNotReady
KubernetesNodeNotReady
Meaning
This alert is triggered when a Kubernetes node remains in the NotReady state for more than 10 minutes. A NotReady node cannot reliably run or manage pods.
Impact
A node in NotReady state can cause:
- Pods being evicted or stuck in Pending state
- Reduced cluster capacity
- Application downtime if replicas are insufficient
- Scheduling failures for new workloads
This alert is marked critical because prolonged node unavailability threatens cluster stability.
Diagnosis
Check node status:
kubectl get nodes
Inspect node conditions and events:
kubectl describe node <NODE_NAME>
Check recent cluster-wide events:
kubectl get events --sort-by=.lastTimestamp
Verify kubelet status on the node (if SSH access is available):
systemctl status kubelet journalctl -u kubelet -n 100
Check system resource pressure:
kubectl top node <NODE_NAME> df -h free -m
Possible Causes
- Kubelet service stopped or unhealthy
- Node lost network connectivity
- Disk, memory, or CPU pressure
- Kernel panic or OS-level issues
- Cloud provider instance failure or maintenance
Mitigation
- Restart kubelet service:
systemctl restart kubelet
- Resolve resource pressure (disk cleanup, memory leaks)
- Verify networking and DNS configuration
- Reboot the node if necessary
- If node cannot recover, drain and replace it
Drain node safely:
kubectl drain <NODE_NAME> --ignore-daemonsets --delete-emptydir-data
After recovery:
kubectl uncordon <NODE_NAME>
Escalation
- If node remains NotReady after mitigation, escalate to the infrastructure team
- If multiple nodes are affected, treat as a cluster-level incident
- Page on-call engineer if production workloads are impacted
Related Alerts
- KubeletDown
- NodeDown
- HighDiskIOWait
- HighCPUUsage
- HighMemoryUsage
Related Dashboards
- Grafana → Kubernetes / Nodes
- Grafana → Node Exporter Full
runbooks/coustom_alerts/kubernetesnodenotready.txt · Last modified: by admin
