Table of Contents

runbooks:coustom_alerts:KubernetesNodeNotReady

KubernetesNodeNotReady

Meaning

This alert is triggered when a Kubernetes node remains in the NotReady state for more than 10 minutes. A NotReady node cannot reliably run or manage pods.

Impact

A node in NotReady state can cause:

This alert is marked critical because prolonged node unavailability threatens cluster stability.

Diagnosis

Check node status:

kubectl get nodes

Inspect node conditions and events:

kubectl describe node <NODE_NAME>

Check recent cluster-wide events:

kubectl get events --sort-by=.lastTimestamp

Verify kubelet status on the node (if SSH access is available):

systemctl status kubelet
journalctl -u kubelet -n 100

Check system resource pressure:

kubectl top node <NODE_NAME>
df -h
free -m

Possible Causes

Mitigation

  1. Restart kubelet service:
systemctl restart kubelet
  1. Resolve resource pressure (disk cleanup, memory leaks)
  2. Verify networking and DNS configuration
  3. Reboot the node if necessary
  4. If node cannot recover, drain and replace it

Drain node safely:

kubectl drain <NODE_NAME> --ignore-daemonsets --delete-emptydir-data

After recovery:

kubectl uncordon <NODE_NAME>

Escalation