Differences

This shows you the differences between two versions of the page.

--- runbooks:coustom_alerts:kubernetesnodenetworkunavailable [2025/12/13 16:36] – created admin
+++ runbooks:coustom_alerts:kubernetesnodenetworkunavailable [2025/12/14 06:56] (current) – admin
@@ Line 1: / Line 1: @@
 runbooks:coustom_alerts:KubernetesNodeNetworkUnavailable
+====== KubernetesNodeNetworkUnavailable ======
+===== Meaning =====
+This alert is triggered when a Kubernetes node reports the **NetworkUnavailable** condition for more than 2 minutes.
+It indicates that the node’s networking is not properly configured or unavailable, preventing pods from communicating.
+===== Impact =====
+NetworkUnavailable can cause:
+  * Pods on the node being unable to communicate with each other or external services
+  * Application downtime or degraded performance
+  * Cluster components (kubelet, kube-proxy) failing to manage pods
+  * Scheduling and service disruptions
+This alert is **critical**, as networking issues directly affect node and application availability.
+===== Diagnosis =====
+Check node status:
+<code bash>
+kubectl get nodes
+kubectl describe node <NODE_NAME>
+</code>
+Check network plugin status (e.g., CNI pods):
+<code bash>
+kubectl get pods -n kube-system
+kubectl describe pod <CNI_POD_NAME> -n kube-system
+</code>
+Check kubelet logs for network errors:
+<code bash>
+journalctl -u kubelet -n 100
+</code>
+Check recent events for network-related issues:
+<code bash>
+kubectl get events --sort-by=.lastTimestamp
+</code>
+Verify node network interfaces and routes (if SSH access is available):
+<code bash>
+ip addr
+ip route
+</code>
+===== Possible Causes =====
+  * CNI plugin misconfiguration or failure
+  * Node network interface down or misconfigured
+  * Firewall or security group blocking traffic
+  * Cloud provider network issues
+  * Kubelet unable to configure networking due to errors
+===== Mitigation =====
+  - Restart CNI plugin pods:
+<code bash>
+kubectl delete pod <CNI_POD_NAME> -n kube-system
+</code>
+  - Restart kubelet service:
+<code bash>
+systemctl restart kubelet
+</code>
+  - Verify network configuration and routes on the node
+  - Check firewall/security group rules
+  - If cloud provider issue, contact provider support
+  - If node cannot recover, cordon and drain it temporarily:
+<code bash>
+kubectl drain <NODE_NAME> --ignore-daemonsets --delete-emptydir-data
+kubectl uncordon <NODE_NAME>
+</code>
+===== Escalation =====
+  * If NetworkUnavailable persists beyond 10 minutes, escalate to the platform/network team
+  * Page on-call engineer if production workloads are impacted
+  * If multiple nodes are affected, treat as cluster-wide network incident
+===== Related Alerts =====
+  * KubernetesNodeNotReady
+  * KubeletDown
+  * PodCrashLoopBackOff
+  * NodeDown
+===== Related Dashboards =====
+  * Grafana → Kubernetes / Node Network
+  * Grafana → CNI Plugin Metrics