runbooks:coustom_alerts:KubernetesNodeNetworkUnavailable

====== KubernetesNodeNetworkUnavailable ======

===== Meaning =====
This alert is triggered when a Kubernetes node reports the **NetworkUnavailable** condition for more than 2 minutes.
It indicates that the node’s networking is not properly configured or unavailable, preventing pods from communicating.

===== Impact =====
NetworkUnavailable can cause:
  * Pods on the node being unable to communicate with each other or external services
  * Application downtime or degraded performance
  * Cluster components (kubelet, kube-proxy) failing to manage pods
  * Scheduling and service disruptions

This alert is **critical**, as networking issues directly affect node and application availability.

===== Diagnosis =====
Check node status:

<code bash>
kubectl get nodes
kubectl describe node <NODE_NAME>
</code>

Check network plugin status (e.g., CNI pods):

<code bash>
kubectl get pods -n kube-system
kubectl describe pod <CNI_POD_NAME> -n kube-system
</code>

Check kubelet logs for network errors:

<code bash>
journalctl -u kubelet -n 100
</code>

Check recent events for network-related issues:

<code bash>
kubectl get events --sort-by=.lastTimestamp
</code>

Verify node network interfaces and routes (if SSH access is available):

<code bash>
ip addr
ip route
</code>

===== Possible Causes =====
  * CNI plugin misconfiguration or failure
  * Node network interface down or misconfigured
  * Firewall or security group blocking traffic
  * Cloud provider network issues
  * Kubelet unable to configure networking due to errors

===== Mitigation =====
  - Restart CNI plugin pods:

<code bash>
kubectl delete pod <CNI_POD_NAME> -n kube-system
</code>

  - Restart kubelet service:

<code bash>
systemctl restart kubelet
</code>

  - Verify network configuration and routes on the node
  - Check firewall/security group rules
  - If cloud provider issue, contact provider support
  - If node cannot recover, cordon and drain it temporarily:

<code bash>
kubectl drain <NODE_NAME> --ignore-daemonsets --delete-emptydir-data
kubectl uncordon <NODE_NAME>
</code>

===== Escalation =====
  * If NetworkUnavailable persists beyond 10 minutes, escalate to the platform/network team
  * Page on-call engineer if production workloads are impacted
  * If multiple nodes are affected, treat as cluster-wide network incident

===== Related Alerts =====
  * KubernetesNodeNotReady
  * KubeletDown
  * PodCrashLoopBackOff
  * NodeDown

===== Related Dashboards =====
  * Grafana → Kubernetes / Node Network
  * Grafana → CNI Plugin Metrics