KubernetesNodeNotReady

Meaning

This alert is triggered when a Kubernetes node remains in the NotReady state for more than 10 minutes. A NotReady node cannot reliably run or manage pods.

Impact

A node in NotReady state can cause:

Pods being evicted or stuck in Pending state
Reduced cluster capacity
Application downtime if replicas are insufficient
Scheduling failures for new workloads

This alert is marked critical because prolonged node unavailability threatens cluster stability.

Diagnosis

Check node status:

kubectl get nodes

Inspect node conditions and events:

kubectl describe node <NODE_NAME>

Check recent cluster-wide events:

kubectl get events --sort-by=.lastTimestamp

Verify kubelet status on the node (if SSH access is available):

systemctl status kubelet
journalctl -u kubelet -n 100

Check system resource pressure:

kubectl top node <NODE_NAME>
df -h
free -m

Possible Causes

Kubelet service stopped or unhealthy
Node lost network connectivity
Disk, memory, or CPU pressure
Kernel panic or OS-level issues
Cloud provider instance failure or maintenance

Mitigation

Restart kubelet service:

systemctl restart kubelet

Resolve resource pressure (disk cleanup, memory leaks)
Verify networking and DNS configuration
Reboot the node if necessary
If node cannot recover, drain and replace it

Drain node safely:

kubectl drain <NODE_NAME> --ignore-daemonsets --delete-emptydir-data

After recovery:

kubectl uncordon <NODE_NAME>

Escalation

If node remains NotReady after mitigation, escalate to the infrastructure team
If multiple nodes are affected, treat as a cluster-level incident
Page on-call engineer if production workloads are impacted

Related Alerts

KubeletDown
NodeDown
HighDiskIOWait
HighCPUUsage
HighMemoryUsage

Related Dashboards

Grafana → Kubernetes / Nodes
Grafana → Node Exporter Full

Table of Contents