runbooks:coustom_alerts:KubeNodeNotReady
====== KubeNodeNotReady ======
===== Meaning =====
This alert is triggered when a Kubernetes node reports a `NotReady` status for more than 2 minutes.
A node in `NotReady` state cannot reliably run or manage pods.
===== Impact =====
This alert indicates a **node-level availability issue**.
Possible impacts include:
* Pods on the node may be evicted or rescheduled
* Reduced cluster capacity
* Increased load on remaining nodes
* Application performance degradation or partial outages
This alert is a **warning**, but may become critical if the condition persists or affects multiple nodes.
===== Diagnosis =====
Check node status:
kubectl get nodes
Describe the affected node to inspect conditions and events:
kubectl describe node {{ $labels.node }}
Check recent node-related events:
kubectl get events --field-selector involvedObject.kind=Node
Verify kubelet health on the node (if SSH access is available):
systemctl status kubelet
journalctl -u kubelet --since "15 min ago"
Check node resource pressure:
kubectl describe node {{ $labels.node }} | grep -i pressure
===== Possible Causes =====
* Kubelet process stopped or unhealthy
* Network connectivity issues
* Disk, memory, or PID pressure on the node
* Node reboot or hardware failure
* Cloud provider instance issue
===== Mitigation =====
- Restart the kubelet service if it is not running
- Resolve disk, memory, or PID pressure conditions
- Restore network connectivity
- Reboot the node if required and safe
- If the node is unstable, drain it for investigation:
kubectl drain {{ $labels.node }} --ignore-daemonsets
After the node becomes healthy:
kubectl uncordon {{ $labels.node }}
===== Escalation =====
* If the node remains NotReady for more than 10 minutes, escalate to the platform team
* If multiple nodes are affected, treat as a cluster-level incident
* If production workloads are impacted, page the on-call engineer
===== Related Alerts =====
* NodeDown
* KubeletDown
* HighDiskUsage
* HighMemoryUsage
===== Related Dashboards =====
* Grafana → Kubernetes / Nodes
* Grafana → Node Health Overview