runbooks:coustom_alerts:KubernetesNodeNetworkUnavailable
====== KubernetesNodeNetworkUnavailable ======
===== Meaning =====
This alert is triggered when a Kubernetes node reports the **NetworkUnavailable** condition for more than 2 minutes.
It indicates that the node’s networking is not properly configured or unavailable, preventing pods from communicating.
===== Impact =====
NetworkUnavailable can cause:
* Pods on the node being unable to communicate with each other or external services
* Application downtime or degraded performance
* Cluster components (kubelet, kube-proxy) failing to manage pods
* Scheduling and service disruptions
This alert is **critical**, as networking issues directly affect node and application availability.
===== Diagnosis =====
Check node status:
kubectl get nodes
kubectl describe node
Check network plugin status (e.g., CNI pods):
kubectl get pods -n kube-system
kubectl describe pod -n kube-system
Check kubelet logs for network errors:
journalctl -u kubelet -n 100
Check recent events for network-related issues:
kubectl get events --sort-by=.lastTimestamp
Verify node network interfaces and routes (if SSH access is available):
ip addr
ip route
===== Possible Causes =====
* CNI plugin misconfiguration or failure
* Node network interface down or misconfigured
* Firewall or security group blocking traffic
* Cloud provider network issues
* Kubelet unable to configure networking due to errors
===== Mitigation =====
- Restart CNI plugin pods:
kubectl delete pod -n kube-system
- Restart kubelet service:
systemctl restart kubelet
- Verify network configuration and routes on the node
- Check firewall/security group rules
- If cloud provider issue, contact provider support
- If node cannot recover, cordon and drain it temporarily:
kubectl drain --ignore-daemonsets --delete-emptydir-data
kubectl uncordon
===== Escalation =====
* If NetworkUnavailable persists beyond 10 minutes, escalate to the platform/network team
* Page on-call engineer if production workloads are impacted
* If multiple nodes are affected, treat as cluster-wide network incident
===== Related Alerts =====
* KubernetesNodeNotReady
* KubeletDown
* PodCrashLoopBackOff
* NodeDown
===== Related Dashboards =====
* Grafana → Kubernetes / Node Network
* Grafana → CNI Plugin Metrics