runbooks:coustom_alerts:kubenodenotready
Differences
This shows you the differences between two versions of the page.
| runbooks:coustom_alerts:kubenodenotready [2025/12/13 16:27] – created admin | runbooks:coustom_alerts:kubenodenotready [2025/12/14 06:49] (current) – admin | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| runbooks: | runbooks: | ||
| + | |||
| + | ====== KubeNodeNotReady ====== | ||
| + | |||
| + | ===== Meaning ===== | ||
| + | This alert is triggered when a Kubernetes node reports a `NotReady` status for more than 2 minutes. | ||
| + | A node in `NotReady` state cannot reliably run or manage pods. | ||
| + | |||
| + | ===== Impact ===== | ||
| + | This alert indicates a **node-level availability issue**. | ||
| + | |||
| + | Possible impacts include: | ||
| + | * Pods on the node may be evicted or rescheduled | ||
| + | * Reduced cluster capacity | ||
| + | * Increased load on remaining nodes | ||
| + | * Application performance degradation or partial outages | ||
| + | |||
| + | This alert is a **warning**, | ||
| + | |||
| + | ===== Diagnosis ===== | ||
| + | Check node status: | ||
| + | |||
| + | <code bash> | ||
| + | kubectl get nodes | ||
| + | </ | ||
| + | |||
| + | Describe the affected node to inspect conditions and events: | ||
| + | |||
| + | <code bash> | ||
| + | kubectl describe node {{ $labels.node }} | ||
| + | </ | ||
| + | |||
| + | Check recent node-related events: | ||
| + | |||
| + | <code bash> | ||
| + | kubectl get events --field-selector involvedObject.kind=Node | ||
| + | </ | ||
| + | |||
| + | Verify kubelet health on the node (if SSH access is available): | ||
| + | |||
| + | <code bash> | ||
| + | systemctl status kubelet | ||
| + | journalctl -u kubelet --since "15 min ago" | ||
| + | </ | ||
| + | |||
| + | Check node resource pressure: | ||
| + | |||
| + | <code bash> | ||
| + | kubectl describe node {{ $labels.node }} | grep -i pressure | ||
| + | </ | ||
| + | |||
| + | ===== Possible Causes ===== | ||
| + | * Kubelet process stopped or unhealthy | ||
| + | * Network connectivity issues | ||
| + | * Disk, memory, or PID pressure on the node | ||
| + | * Node reboot or hardware failure | ||
| + | * Cloud provider instance issue | ||
| + | |||
| + | ===== Mitigation ===== | ||
| + | - Restart the kubelet service if it is not running | ||
| + | - Resolve disk, memory, or PID pressure conditions | ||
| + | - Restore network connectivity | ||
| + | - Reboot the node if required and safe | ||
| + | - If the node is unstable, drain it for investigation: | ||
| + | |||
| + | <code bash> | ||
| + | kubectl drain {{ $labels.node }} --ignore-daemonsets | ||
| + | </ | ||
| + | |||
| + | After the node becomes healthy: | ||
| + | |||
| + | <code bash> | ||
| + | kubectl uncordon {{ $labels.node }} | ||
| + | </ | ||
| + | |||
| + | ===== Escalation ===== | ||
| + | * If the node remains NotReady for more than 10 minutes, escalate to the platform team | ||
| + | * If multiple nodes are affected, treat as a cluster-level incident | ||
| + | * If production workloads are impacted, page the on-call engineer | ||
| + | |||
| + | ===== Related Alerts ===== | ||
| + | * NodeDown | ||
| + | * KubeletDown | ||
| + | * HighDiskUsage | ||
| + | * HighMemoryUsage | ||
| + | |||
| + | ===== Related Dashboards ===== | ||
| + | * Grafana → Kubernetes / Nodes | ||
| + | * Grafana → Node Health Overview | ||
| + | |||
runbooks/coustom_alerts/kubenodenotready.txt · Last modified: by admin
