runbooks:coustom_alerts:KubeNodeNotReady ====== KubeNodeNotReady ====== ===== Meaning ===== This alert is triggered when a Kubernetes node reports a `NotReady` status for more than 2 minutes. A node in `NotReady` state cannot reliably run or manage pods. ===== Impact ===== This alert indicates a **node-level availability issue**. Possible impacts include: * Pods on the node may be evicted or rescheduled * Reduced cluster capacity * Increased load on remaining nodes * Application performance degradation or partial outages This alert is a **warning**, but may become critical if the condition persists or affects multiple nodes. ===== Diagnosis ===== Check node status: kubectl get nodes Describe the affected node to inspect conditions and events: kubectl describe node {{ $labels.node }} Check recent node-related events: kubectl get events --field-selector involvedObject.kind=Node Verify kubelet health on the node (if SSH access is available): systemctl status kubelet journalctl -u kubelet --since "15 min ago" Check node resource pressure: kubectl describe node {{ $labels.node }} | grep -i pressure ===== Possible Causes ===== * Kubelet process stopped or unhealthy * Network connectivity issues * Disk, memory, or PID pressure on the node * Node reboot or hardware failure * Cloud provider instance issue ===== Mitigation ===== - Restart the kubelet service if it is not running - Resolve disk, memory, or PID pressure conditions - Restore network connectivity - Reboot the node if required and safe - If the node is unstable, drain it for investigation: kubectl drain {{ $labels.node }} --ignore-daemonsets After the node becomes healthy: kubectl uncordon {{ $labels.node }} ===== Escalation ===== * If the node remains NotReady for more than 10 minutes, escalate to the platform team * If multiple nodes are affected, treat as a cluster-level incident * If production workloads are impacted, page the on-call engineer ===== Related Alerts ===== * NodeDown * KubeletDown * HighDiskUsage * HighMemoryUsage ===== Related Dashboards ===== * Grafana → Kubernetes / Nodes * Grafana → Node Health Overview