Table of Contents
runbooks:coustom_alerts:KubeAPIDown
KubeAPIDown
Meaning
This alert is triggered when Prometheus is unable to scrape the Kubernetes API server metrics. It usually indicates that the API server is unreachable, unresponsive, or not running.
Impact
This alert represents a critical control-plane failure.
Possible impacts include:
- `kubectl` commands failing or timing out
- Inability to create, update, or delete Kubernetes resources
- Controllers and schedulers unable to reconcile cluster state
- Automation and CI/CD pipelines failing
- Monitoring data becoming stale or unavailable
If this alert is firing, the cluster is likely partially or completely unusable.
Diagnosis
Check if the Kubernetes API server is reachable:
kubectl get nodes
If `kubectl` is unresponsive, check API server health endpoints (if accessible):
curl -k https://<API_SERVER_ENDPOINT>/healthz
Check control-plane pod status (for self-managed clusters):
kubectl get pods -n kube-system | grep kube-apiserver
Describe the API server pod for recent failures:
kubectl describe pod kube-apiserver-<node> -n kube-system
Check recent cluster-wide events:
kubectl get events -A --sort-by=.lastTimestamp
If running on managed Kubernetes, check cloud provider control-plane status dashboards.
Possible Causes
- API server process crashed or not running
- Control-plane node failure
- Network connectivity issues to the API endpoint
- Certificate expiration or authentication failure
- Resource exhaustion on control-plane nodes
- Cloud provider control-plane outage
Mitigation
- For managed Kubernetes, verify provider status and open a support ticket if needed
- For self-managed clusters:
- Restart the kube-apiserver service or pod
- Check etcd health and connectivity
- Resolve networking or DNS issues
- Verify certificates and rotate if expired
- Ensure control-plane nodes have sufficient CPU and memory
If the issue is transient, continue monitoring after recovery.
Escalation
- Immediately page the on-call engineer
- Notify the platform or infrastructure team
- If running on managed Kubernetes, escalate to the cloud provider support
- If unresolved after 15 minutes, treat as a major incident
Related Alerts
- KubeControllerManagerDown
- KubeSchedulerDown
- EtcdDown
Related Dashboards
- Grafana → Kubernetes / API Server
- Grafana → Control Plane Overview
