User Tools

Site Tools


runbooks:coustom_alerts:kubeapidown

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

runbooks:coustom_alerts:kubeapidown [2025/12/13 16:26] – created adminrunbooks:coustom_alerts:kubeapidown [2025/12/14 06:48] (current) admin
Line 1: Line 1:
 runbooks:coustom_alerts:KubeAPIDown runbooks:coustom_alerts:KubeAPIDown
 +
 +====== KubeAPIDown ======
 +
 +===== Meaning =====
 +This alert is triggered when Prometheus is unable to scrape the Kubernetes API server metrics.
 +It usually indicates that the API server is unreachable, unresponsive, or not running.
 +
 +===== Impact =====
 +This alert represents a **critical control-plane failure**.
 +
 +Possible impacts include:
 +  * `kubectl` commands failing or timing out
 +  * Inability to create, update, or delete Kubernetes resources
 +  * Controllers and schedulers unable to reconcile cluster state
 +  * Automation and CI/CD pipelines failing
 +  * Monitoring data becoming stale or unavailable
 +
 +If this alert is firing, the cluster is likely **partially or completely unusable**.
 +
 +===== Diagnosis =====
 +Check if the Kubernetes API server is reachable:
 +
 +<code bash>
 +kubectl get nodes
 +</code>
 +
 +If `kubectl` is unresponsive, check API server health endpoints (if accessible):
 +
 +<code bash>
 +curl -k https://<API_SERVER_ENDPOINT>/healthz
 +</code>
 +
 +Check control-plane pod status (for self-managed clusters):
 +
 +<code bash>
 +kubectl get pods -n kube-system | grep kube-apiserver
 +</code>
 +
 +Describe the API server pod for recent failures:
 +
 +<code bash>
 +kubectl describe pod kube-apiserver-<node> -n kube-system
 +</code>
 +
 +Check recent cluster-wide events:
 +
 +<code bash>
 +kubectl get events -A --sort-by=.lastTimestamp
 +</code>
 +
 +If running on managed Kubernetes, check cloud provider control-plane status dashboards.
 +
 +===== Possible Causes =====
 +  * API server process crashed or not running
 +  * Control-plane node failure
 +  * Network connectivity issues to the API endpoint
 +  * Certificate expiration or authentication failure
 +  * Resource exhaustion on control-plane nodes
 +  * Cloud provider control-plane outage
 +
 +===== Mitigation =====
 +  - For managed Kubernetes, verify provider status and open a support ticket if needed
 +  - For self-managed clusters:
 +    * Restart the kube-apiserver service or pod
 +    * Check etcd health and connectivity
 +    * Resolve networking or DNS issues
 +  - Verify certificates and rotate if expired
 +  - Ensure control-plane nodes have sufficient CPU and memory
 +
 +If the issue is transient, continue monitoring after recovery.
 +
 +===== Escalation =====
 +  * Immediately page the on-call engineer
 +  * Notify the platform or infrastructure team
 +  * If running on managed Kubernetes, escalate to the cloud provider support
 +  * If unresolved after 15 minutes, treat as a major incident
 +
 +===== Related Alerts =====
 +  * KubeControllerManagerDown
 +  * KubeSchedulerDown
 +  * EtcdDown
 +
 +===== Related Dashboards =====
 +  * Grafana → Kubernetes / API Server
 +  * Grafana → Control Plane Overview
 +
runbooks/coustom_alerts/kubeapidown.txt · Last modified: by admin