User Tools

Site Tools


runbooks:coustom_alerts:kubeapidown

runbooks:coustom_alerts:KubeAPIDown

KubeAPIDown

Meaning

This alert is triggered when Prometheus is unable to scrape the Kubernetes API server metrics. It usually indicates that the API server is unreachable, unresponsive, or not running.

Impact

This alert represents a critical control-plane failure.

Possible impacts include:

  • `kubectl` commands failing or timing out
  • Inability to create, update, or delete Kubernetes resources
  • Controllers and schedulers unable to reconcile cluster state
  • Automation and CI/CD pipelines failing
  • Monitoring data becoming stale or unavailable

If this alert is firing, the cluster is likely partially or completely unusable.

Diagnosis

Check if the Kubernetes API server is reachable:

kubectl get nodes

If `kubectl` is unresponsive, check API server health endpoints (if accessible):

curl -k https://<API_SERVER_ENDPOINT>/healthz

Check control-plane pod status (for self-managed clusters):

kubectl get pods -n kube-system | grep kube-apiserver

Describe the API server pod for recent failures:

kubectl describe pod kube-apiserver-<node> -n kube-system

Check recent cluster-wide events:

kubectl get events -A --sort-by=.lastTimestamp

If running on managed Kubernetes, check cloud provider control-plane status dashboards.

Possible Causes

  • API server process crashed or not running
  • Control-plane node failure
  • Network connectivity issues to the API endpoint
  • Certificate expiration or authentication failure
  • Resource exhaustion on control-plane nodes
  • Cloud provider control-plane outage

Mitigation

  1. For managed Kubernetes, verify provider status and open a support ticket if needed
  2. For self-managed clusters:
    • Restart the kube-apiserver service or pod
    • Check etcd health and connectivity
    • Resolve networking or DNS issues
  3. Verify certificates and rotate if expired
  4. Ensure control-plane nodes have sufficient CPU and memory

If the issue is transient, continue monitoring after recovery.

Escalation

  • Immediately page the on-call engineer
  • Notify the platform or infrastructure team
  • If running on managed Kubernetes, escalate to the cloud provider support
  • If unresolved after 15 minutes, treat as a major incident
  • KubeControllerManagerDown
  • KubeSchedulerDown
  • EtcdDown
  • Grafana → Kubernetes / API Server
  • Grafana → Control Plane Overview
runbooks/coustom_alerts/kubeapidown.txt · Last modified: by admin