runbooks:coustom_alerts:KubeAPIDown
====== KubeAPIDown ======
===== Meaning =====
This alert is triggered when Prometheus is unable to scrape the Kubernetes API server metrics.
It usually indicates that the API server is unreachable, unresponsive, or not running.
===== Impact =====
This alert represents a **critical control-plane failure**.
Possible impacts include:
* `kubectl` commands failing or timing out
* Inability to create, update, or delete Kubernetes resources
* Controllers and schedulers unable to reconcile cluster state
* Automation and CI/CD pipelines failing
* Monitoring data becoming stale or unavailable
If this alert is firing, the cluster is likely **partially or completely unusable**.
===== Diagnosis =====
Check if the Kubernetes API server is reachable:
kubectl get nodes
If `kubectl` is unresponsive, check API server health endpoints (if accessible):
curl -k https:///healthz
Check control-plane pod status (for self-managed clusters):
kubectl get pods -n kube-system | grep kube-apiserver
Describe the API server pod for recent failures:
kubectl describe pod kube-apiserver- -n kube-system
Check recent cluster-wide events:
kubectl get events -A --sort-by=.lastTimestamp
If running on managed Kubernetes, check cloud provider control-plane status dashboards.
===== Possible Causes =====
* API server process crashed or not running
* Control-plane node failure
* Network connectivity issues to the API endpoint
* Certificate expiration or authentication failure
* Resource exhaustion on control-plane nodes
* Cloud provider control-plane outage
===== Mitigation =====
- For managed Kubernetes, verify provider status and open a support ticket if needed
- For self-managed clusters:
* Restart the kube-apiserver service or pod
* Check etcd health and connectivity
* Resolve networking or DNS issues
- Verify certificates and rotate if expired
- Ensure control-plane nodes have sufficient CPU and memory
If the issue is transient, continue monitoring after recovery.
===== Escalation =====
* Immediately page the on-call engineer
* Notify the platform or infrastructure team
* If running on managed Kubernetes, escalate to the cloud provider support
* If unresolved after 15 minutes, treat as a major incident
===== Related Alerts =====
* KubeControllerManagerDown
* KubeSchedulerDown
* EtcdDown
===== Related Dashboards =====
* Grafana → Kubernetes / API Server
* Grafana → Control Plane Overview