runbooks:coustom_alerts:kubernetesnodenetworkunavailable
Differences
This shows you the differences between two versions of the page.
| runbooks:coustom_alerts:kubernetesnodenetworkunavailable [2025/12/13 16:36] – created admin | runbooks:coustom_alerts:kubernetesnodenetworkunavailable [2025/12/14 06:56] (current) – admin | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| runbooks: | runbooks: | ||
| + | |||
| + | ====== KubernetesNodeNetworkUnavailable ====== | ||
| + | |||
| + | ===== Meaning ===== | ||
| + | This alert is triggered when a Kubernetes node reports the **NetworkUnavailable** condition for more than 2 minutes. | ||
| + | It indicates that the node’s networking is not properly configured or unavailable, | ||
| + | |||
| + | ===== Impact ===== | ||
| + | NetworkUnavailable can cause: | ||
| + | * Pods on the node being unable to communicate with each other or external services | ||
| + | * Application downtime or degraded performance | ||
| + | * Cluster components (kubelet, kube-proxy) failing to manage pods | ||
| + | * Scheduling and service disruptions | ||
| + | |||
| + | This alert is **critical**, | ||
| + | |||
| + | ===== Diagnosis ===== | ||
| + | Check node status: | ||
| + | |||
| + | <code bash> | ||
| + | kubectl get nodes | ||
| + | kubectl describe node < | ||
| + | </ | ||
| + | |||
| + | Check network plugin status (e.g., CNI pods): | ||
| + | |||
| + | <code bash> | ||
| + | kubectl get pods -n kube-system | ||
| + | kubectl describe pod < | ||
| + | </ | ||
| + | |||
| + | Check kubelet logs for network errors: | ||
| + | |||
| + | <code bash> | ||
| + | journalctl -u kubelet -n 100 | ||
| + | </ | ||
| + | |||
| + | Check recent events for network-related issues: | ||
| + | |||
| + | <code bash> | ||
| + | kubectl get events --sort-by=.lastTimestamp | ||
| + | </ | ||
| + | |||
| + | Verify node network interfaces and routes (if SSH access is available): | ||
| + | |||
| + | <code bash> | ||
| + | ip addr | ||
| + | ip route | ||
| + | </ | ||
| + | |||
| + | ===== Possible Causes ===== | ||
| + | * CNI plugin misconfiguration or failure | ||
| + | * Node network interface down or misconfigured | ||
| + | * Firewall or security group blocking traffic | ||
| + | * Cloud provider network issues | ||
| + | * Kubelet unable to configure networking due to errors | ||
| + | |||
| + | ===== Mitigation ===== | ||
| + | - Restart CNI plugin pods: | ||
| + | |||
| + | <code bash> | ||
| + | kubectl delete pod < | ||
| + | </ | ||
| + | |||
| + | - Restart kubelet service: | ||
| + | |||
| + | <code bash> | ||
| + | systemctl restart kubelet | ||
| + | </ | ||
| + | |||
| + | - Verify network configuration and routes on the node | ||
| + | - Check firewall/ | ||
| + | - If cloud provider issue, contact provider support | ||
| + | - If node cannot recover, cordon and drain it temporarily: | ||
| + | |||
| + | <code bash> | ||
| + | kubectl drain < | ||
| + | kubectl uncordon < | ||
| + | </ | ||
| + | |||
| + | ===== Escalation ===== | ||
| + | * If NetworkUnavailable persists beyond 10 minutes, escalate to the platform/ | ||
| + | * Page on-call engineer if production workloads are impacted | ||
| + | * If multiple nodes are affected, treat as cluster-wide network incident | ||
| + | |||
| + | ===== Related Alerts ===== | ||
| + | * KubernetesNodeNotReady | ||
| + | * KubeletDown | ||
| + | * PodCrashLoopBackOff | ||
| + | * NodeDown | ||
| + | |||
| + | ===== Related Dashboards ===== | ||
| + | * Grafana → Kubernetes / Node Network | ||
| + | * Grafana → CNI Plugin Metrics | ||
| + | |||
runbooks/coustom_alerts/kubernetesnodenetworkunavailable.txt · Last modified: by admin
