runbooks:coustom_alerts:KubeAPIDown

====== KubeAPIDown ======

===== Meaning =====
This alert is triggered when Prometheus is unable to scrape the Kubernetes API server metrics.
It usually indicates that the API server is unreachable, unresponsive, or not running.

===== Impact =====
This alert represents a **critical control-plane failure**.

Possible impacts include:
  * `kubectl` commands failing or timing out
  * Inability to create, update, or delete Kubernetes resources
  * Controllers and schedulers unable to reconcile cluster state
  * Automation and CI/CD pipelines failing
  * Monitoring data becoming stale or unavailable

If this alert is firing, the cluster is likely **partially or completely unusable**.

===== Diagnosis =====
Check if the Kubernetes API server is reachable:

<code bash>
kubectl get nodes
</code>

If `kubectl` is unresponsive, check API server health endpoints (if accessible):

<code bash>
curl -k https://<API_SERVER_ENDPOINT>/healthz
</code>

Check control-plane pod status (for self-managed clusters):

<code bash>
kubectl get pods -n kube-system | grep kube-apiserver
</code>

Describe the API server pod for recent failures:

<code bash>
kubectl describe pod kube-apiserver-<node> -n kube-system
</code>

Check recent cluster-wide events:

<code bash>
kubectl get events -A --sort-by=.lastTimestamp
</code>

If running on managed Kubernetes, check cloud provider control-plane status dashboards.

===== Possible Causes =====
  * API server process crashed or not running
  * Control-plane node failure
  * Network connectivity issues to the API endpoint
  * Certificate expiration or authentication failure
  * Resource exhaustion on control-plane nodes
  * Cloud provider control-plane outage

===== Mitigation =====
  - For managed Kubernetes, verify provider status and open a support ticket if needed
  - For self-managed clusters:
    * Restart the kube-apiserver service or pod
    * Check etcd health and connectivity
    * Resolve networking or DNS issues
  - Verify certificates and rotate if expired
  - Ensure control-plane nodes have sufficient CPU and memory

If the issue is transient, continue monitoring after recovery.

===== Escalation =====
  * Immediately page the on-call engineer
  * Notify the platform or infrastructure team
  * If running on managed Kubernetes, escalate to the cloud provider support
  * If unresolved after 15 minutes, treat as a major incident

===== Related Alerts =====
  * KubeControllerManagerDown
  * KubeSchedulerDown
  * EtcdDown

===== Related Dashboards =====
  * Grafana → Kubernetes / API Server
  * Grafana → Control Plane Overview