runbooks:coustom_alerts:KubeAPIDown ====== KubeAPIDown ====== ===== Meaning ===== This alert is triggered when Prometheus is unable to scrape the Kubernetes API server metrics. It usually indicates that the API server is unreachable, unresponsive, or not running. ===== Impact ===== This alert represents a **critical control-plane failure**. Possible impacts include: * `kubectl` commands failing or timing out * Inability to create, update, or delete Kubernetes resources * Controllers and schedulers unable to reconcile cluster state * Automation and CI/CD pipelines failing * Monitoring data becoming stale or unavailable If this alert is firing, the cluster is likely **partially or completely unusable**. ===== Diagnosis ===== Check if the Kubernetes API server is reachable: kubectl get nodes If `kubectl` is unresponsive, check API server health endpoints (if accessible): curl -k https:///healthz Check control-plane pod status (for self-managed clusters): kubectl get pods -n kube-system | grep kube-apiserver Describe the API server pod for recent failures: kubectl describe pod kube-apiserver- -n kube-system Check recent cluster-wide events: kubectl get events -A --sort-by=.lastTimestamp If running on managed Kubernetes, check cloud provider control-plane status dashboards. ===== Possible Causes ===== * API server process crashed or not running * Control-plane node failure * Network connectivity issues to the API endpoint * Certificate expiration or authentication failure * Resource exhaustion on control-plane nodes * Cloud provider control-plane outage ===== Mitigation ===== - For managed Kubernetes, verify provider status and open a support ticket if needed - For self-managed clusters: * Restart the kube-apiserver service or pod * Check etcd health and connectivity * Resolve networking or DNS issues - Verify certificates and rotate if expired - Ensure control-plane nodes have sufficient CPU and memory If the issue is transient, continue monitoring after recovery. ===== Escalation ===== * Immediately page the on-call engineer * Notify the platform or infrastructure team * If running on managed Kubernetes, escalate to the cloud provider support * If unresolved after 15 minutes, treat as a major incident ===== Related Alerts ===== * KubeControllerManagerDown * KubeSchedulerDown * EtcdDown ===== Related Dashboards ===== * Grafana → Kubernetes / API Server * Grafana → Control Plane Overview