User Tools

Site Tools


runbooks:coustom_alerts:kubernetesnodenetworkunavailable

runbooks:coustom_alerts:KubernetesNodeNetworkUnavailable

KubernetesNodeNetworkUnavailable

Meaning

This alert is triggered when a Kubernetes node reports the NetworkUnavailable condition for more than 2 minutes. It indicates that the node’s networking is not properly configured or unavailable, preventing pods from communicating.

Impact

NetworkUnavailable can cause:

  • Pods on the node being unable to communicate with each other or external services
  • Application downtime or degraded performance
  • Cluster components (kubelet, kube-proxy) failing to manage pods
  • Scheduling and service disruptions

This alert is critical, as networking issues directly affect node and application availability.

Diagnosis

Check node status:

kubectl get nodes
kubectl describe node <NODE_NAME>

Check network plugin status (e.g., CNI pods):

kubectl get pods -n kube-system
kubectl describe pod <CNI_POD_NAME> -n kube-system

Check kubelet logs for network errors:

journalctl -u kubelet -n 100

Check recent events for network-related issues:

kubectl get events --sort-by=.lastTimestamp

Verify node network interfaces and routes (if SSH access is available):

ip addr
ip route

Possible Causes

  • CNI plugin misconfiguration or failure
  • Node network interface down or misconfigured
  • Firewall or security group blocking traffic
  • Cloud provider network issues
  • Kubelet unable to configure networking due to errors

Mitigation

  1. Restart CNI plugin pods:
kubectl delete pod <CNI_POD_NAME> -n kube-system
  1. Restart kubelet service:
systemctl restart kubelet
  1. Verify network configuration and routes on the node
  2. Check firewall/security group rules
  3. If cloud provider issue, contact provider support
  4. If node cannot recover, cordon and drain it temporarily:
kubectl drain <NODE_NAME> --ignore-daemonsets --delete-emptydir-data
kubectl uncordon <NODE_NAME>

Escalation

  • If NetworkUnavailable persists beyond 10 minutes, escalate to the platform/network team
  • Page on-call engineer if production workloads are impacted
  • If multiple nodes are affected, treat as cluster-wide network incident
  • KubernetesNodeNotReady
  • KubeletDown
  • PodCrashLoopBackOff
  • NodeDown
  • Grafana → Kubernetes / Node Network
  • Grafana → CNI Plugin Metrics
runbooks/coustom_alerts/kubernetesnodenetworkunavailable.txt · Last modified: by admin