Differences

This shows you the differences between two versions of the page.

--- runbooks:coustom_alerts:kubeapidown [2025/12/13 16:26] – created admin
+++ runbooks:coustom_alerts:kubeapidown [2025/12/14 06:48] (current) – admin
@@ Line 1: / Line 1: @@
 runbooks:coustom_alerts:KubeAPIDown
+====== KubeAPIDown ======
+===== Meaning =====
+This alert is triggered when Prometheus is unable to scrape the Kubernetes API server metrics.
+It usually indicates that the API server is unreachable, unresponsive, or not running.
+===== Impact =====
+This alert represents a **critical control-plane failure**.
+Possible impacts include:
+  * `kubectl` commands failing or timing out
+  * Inability to create, update, or delete Kubernetes resources
+  * Controllers and schedulers unable to reconcile cluster state
+  * Automation and CI/CD pipelines failing
+  * Monitoring data becoming stale or unavailable
+If this alert is firing, the cluster is likely **partially or completely unusable**.
+===== Diagnosis =====
+Check if the Kubernetes API server is reachable:
+<code bash>
+kubectl get nodes
+</code>
+If `kubectl` is unresponsive, check API server health endpoints (if accessible):
+<code bash>
+curl -k https://<API_SERVER_ENDPOINT>/healthz
+</code>
+Check control-plane pod status (for self-managed clusters):
+<code bash>
+kubectl get pods -n kube-system | grep kube-apiserver
+</code>
+Describe the API server pod for recent failures:
+<code bash>
+kubectl describe pod kube-apiserver-<node> -n kube-system
+</code>
+Check recent cluster-wide events:
+<code bash>
+kubectl get events -A --sort-by=.lastTimestamp
+</code>
+If running on managed Kubernetes, check cloud provider control-plane status dashboards.
+===== Possible Causes =====
+  * API server process crashed or not running
+  * Control-plane node failure
+  * Network connectivity issues to the API endpoint
+  * Certificate expiration or authentication failure
+  * Resource exhaustion on control-plane nodes
+  * Cloud provider control-plane outage
+===== Mitigation =====
+  - For managed Kubernetes, verify provider status and open a support ticket if needed
+  - For self-managed clusters:
+    * Restart the kube-apiserver service or pod
+    * Check etcd health and connectivity
+    * Resolve networking or DNS issues
+  - Verify certificates and rotate if expired
+  - Ensure control-plane nodes have sufficient CPU and memory
+If the issue is transient, continue monitoring after recovery.
+===== Escalation =====
+  * Immediately page the on-call engineer
+  * Notify the platform or infrastructure team
+  * If running on managed Kubernetes, escalate to the cloud provider support
+  * If unresolved after 15 minutes, treat as a major incident
+===== Related Alerts =====
+  * KubeControllerManagerDown
+  * KubeSchedulerDown
+  * EtcdDown
+===== Related Dashboards =====
+  * Grafana → Kubernetes / API Server
+  * Grafana → Control Plane Overview