KubeAPIDown

Meaning

This alert is triggered when Prometheus is unable to scrape the Kubernetes API server metrics. It usually indicates that the API server is unreachable, unresponsive, or not running.

Impact

This alert represents a critical control-plane failure.

Possible impacts include:

`kubectl` commands failing or timing out
Inability to create, update, or delete Kubernetes resources
Controllers and schedulers unable to reconcile cluster state
Automation and CI/CD pipelines failing
Monitoring data becoming stale or unavailable

If this alert is firing, the cluster is likely partially or completely unusable.

Diagnosis

Check if the Kubernetes API server is reachable:

kubectl get nodes

If `kubectl` is unresponsive, check API server health endpoints (if accessible):

curl -k https://<API_SERVER_ENDPOINT>/healthz

Check control-plane pod status (for self-managed clusters):

kubectl get pods -n kube-system | grep kube-apiserver

Describe the API server pod for recent failures:

kubectl describe pod kube-apiserver-<node> -n kube-system

Check recent cluster-wide events:

kubectl get events -A --sort-by=.lastTimestamp

If running on managed Kubernetes, check cloud provider control-plane status dashboards.

Possible Causes

API server process crashed or not running
Control-plane node failure
Network connectivity issues to the API endpoint
Certificate expiration or authentication failure
Resource exhaustion on control-plane nodes
Cloud provider control-plane outage

Mitigation

For managed Kubernetes, verify provider status and open a support ticket if needed
For self-managed clusters:
- Restart the kube-apiserver service or pod
- Check etcd health and connectivity
- Resolve networking or DNS issues
Verify certificates and rotate if expired
Ensure control-plane nodes have sufficient CPU and memory

If the issue is transient, continue monitoring after recovery.

Escalation

Immediately page the on-call engineer
Notify the platform or infrastructure team
If running on managed Kubernetes, escalate to the cloud provider support
If unresolved after 15 minutes, treat as a major incident

Related Alerts

KubeControllerManagerDown
KubeSchedulerDown
EtcdDown

Related Dashboards

Grafana → Kubernetes / API Server
Grafana → Control Plane Overview

Table of Contents