runbooks:coustom_alerts:kubernetesnodenotready
Differences
This shows you the differences between two versions of the page.
| runbooks:coustom_alerts:kubernetesnodenotready [2025/12/13 16:27] – created admin | runbooks:coustom_alerts:kubernetesnodenotready [2025/12/14 06:51] (current) – admin | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| runbooks: | runbooks: | ||
| + | |||
| + | ====== KubernetesNodeNotReady ====== | ||
| + | |||
| + | ===== Meaning ===== | ||
| + | This alert is triggered when a Kubernetes node remains in the **NotReady** state for more than 10 minutes. | ||
| + | A NotReady node cannot reliably run or manage pods. | ||
| + | |||
| + | ===== Impact ===== | ||
| + | A node in NotReady state can cause: | ||
| + | * Pods being evicted or stuck in Pending state | ||
| + | * Reduced cluster capacity | ||
| + | * Application downtime if replicas are insufficient | ||
| + | * Scheduling failures for new workloads | ||
| + | |||
| + | This alert is marked **critical** because prolonged node unavailability threatens cluster stability. | ||
| + | |||
| + | ===== Diagnosis ===== | ||
| + | Check node status: | ||
| + | |||
| + | <code bash> | ||
| + | kubectl get nodes | ||
| + | </ | ||
| + | |||
| + | Inspect node conditions and events: | ||
| + | |||
| + | <code bash> | ||
| + | kubectl describe node < | ||
| + | </ | ||
| + | |||
| + | Check recent cluster-wide events: | ||
| + | |||
| + | <code bash> | ||
| + | kubectl get events --sort-by=.lastTimestamp | ||
| + | </ | ||
| + | |||
| + | Verify kubelet status on the node (if SSH access is available): | ||
| + | |||
| + | <code bash> | ||
| + | systemctl status kubelet | ||
| + | journalctl -u kubelet -n 100 | ||
| + | </ | ||
| + | |||
| + | Check system resource pressure: | ||
| + | |||
| + | <code bash> | ||
| + | kubectl top node < | ||
| + | df -h | ||
| + | free -m | ||
| + | </ | ||
| + | |||
| + | ===== Possible Causes ===== | ||
| + | * Kubelet service stopped or unhealthy | ||
| + | * Node lost network connectivity | ||
| + | * Disk, memory, or CPU pressure | ||
| + | * Kernel panic or OS-level issues | ||
| + | * Cloud provider instance failure or maintenance | ||
| + | |||
| + | ===== Mitigation ===== | ||
| + | - Restart kubelet service: | ||
| + | <code bash> | ||
| + | systemctl restart kubelet | ||
| + | </ | ||
| + | |||
| + | - Resolve resource pressure (disk cleanup, memory leaks) | ||
| + | - Verify networking and DNS configuration | ||
| + | - Reboot the node if necessary | ||
| + | - If node cannot recover, drain and replace it | ||
| + | |||
| + | Drain node safely: | ||
| + | |||
| + | <code bash> | ||
| + | kubectl drain < | ||
| + | </ | ||
| + | |||
| + | After recovery: | ||
| + | |||
| + | <code bash> | ||
| + | kubectl uncordon < | ||
| + | </ | ||
| + | |||
| + | ===== Escalation ===== | ||
| + | * If node remains NotReady after mitigation, escalate to the infrastructure team | ||
| + | * If multiple nodes are affected, treat as a cluster-level incident | ||
| + | * Page on-call engineer if production workloads are impacted | ||
| + | |||
| + | ===== Related Alerts ===== | ||
| + | * KubeletDown | ||
| + | * NodeDown | ||
| + | * HighDiskIOWait | ||
| + | * HighCPUUsage | ||
| + | * HighMemoryUsage | ||
| + | |||
| + | ===== Related Dashboards ===== | ||
| + | * Grafana → Kubernetes / Nodes | ||
| + | * Grafana → Node Exporter Full | ||
| + | |||
runbooks/coustom_alerts/kubernetesnodenotready.txt · Last modified: by admin
