runbooks:coustom_alerts:kubeapidown
Differences
This shows you the differences between two versions of the page.
| runbooks:coustom_alerts:kubeapidown [2025/12/13 16:26] – created admin | runbooks:coustom_alerts:kubeapidown [2025/12/14 06:48] (current) – admin | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| runbooks: | runbooks: | ||
| + | |||
| + | ====== KubeAPIDown ====== | ||
| + | |||
| + | ===== Meaning ===== | ||
| + | This alert is triggered when Prometheus is unable to scrape the Kubernetes API server metrics. | ||
| + | It usually indicates that the API server is unreachable, | ||
| + | |||
| + | ===== Impact ===== | ||
| + | This alert represents a **critical control-plane failure**. | ||
| + | |||
| + | Possible impacts include: | ||
| + | * `kubectl` commands failing or timing out | ||
| + | * Inability to create, update, or delete Kubernetes resources | ||
| + | * Controllers and schedulers unable to reconcile cluster state | ||
| + | * Automation and CI/CD pipelines failing | ||
| + | * Monitoring data becoming stale or unavailable | ||
| + | |||
| + | If this alert is firing, the cluster is likely **partially or completely unusable**. | ||
| + | |||
| + | ===== Diagnosis ===== | ||
| + | Check if the Kubernetes API server is reachable: | ||
| + | |||
| + | <code bash> | ||
| + | kubectl get nodes | ||
| + | </ | ||
| + | |||
| + | If `kubectl` is unresponsive, | ||
| + | |||
| + | <code bash> | ||
| + | curl -k https://< | ||
| + | </ | ||
| + | |||
| + | Check control-plane pod status (for self-managed clusters): | ||
| + | |||
| + | <code bash> | ||
| + | kubectl get pods -n kube-system | grep kube-apiserver | ||
| + | </ | ||
| + | |||
| + | Describe the API server pod for recent failures: | ||
| + | |||
| + | <code bash> | ||
| + | kubectl describe pod kube-apiserver-< | ||
| + | </ | ||
| + | |||
| + | Check recent cluster-wide events: | ||
| + | |||
| + | <code bash> | ||
| + | kubectl get events -A --sort-by=.lastTimestamp | ||
| + | </ | ||
| + | |||
| + | If running on managed Kubernetes, check cloud provider control-plane status dashboards. | ||
| + | |||
| + | ===== Possible Causes ===== | ||
| + | * API server process crashed or not running | ||
| + | * Control-plane node failure | ||
| + | * Network connectivity issues to the API endpoint | ||
| + | * Certificate expiration or authentication failure | ||
| + | * Resource exhaustion on control-plane nodes | ||
| + | * Cloud provider control-plane outage | ||
| + | |||
| + | ===== Mitigation ===== | ||
| + | - For managed Kubernetes, verify provider status and open a support ticket if needed | ||
| + | - For self-managed clusters: | ||
| + | * Restart the kube-apiserver service or pod | ||
| + | * Check etcd health and connectivity | ||
| + | * Resolve networking or DNS issues | ||
| + | - Verify certificates and rotate if expired | ||
| + | - Ensure control-plane nodes have sufficient CPU and memory | ||
| + | |||
| + | If the issue is transient, continue monitoring after recovery. | ||
| + | |||
| + | ===== Escalation ===== | ||
| + | * Immediately page the on-call engineer | ||
| + | * Notify the platform or infrastructure team | ||
| + | * If running on managed Kubernetes, escalate to the cloud provider support | ||
| + | * If unresolved after 15 minutes, treat as a major incident | ||
| + | |||
| + | ===== Related Alerts ===== | ||
| + | * KubeControllerManagerDown | ||
| + | * KubeSchedulerDown | ||
| + | * EtcdDown | ||
| + | |||
| + | ===== Related Dashboards ===== | ||
| + | * Grafana → Kubernetes / API Server | ||
| + | * Grafana → Control Plane Overview | ||
| + | |||
runbooks/coustom_alerts/kubeapidown.txt · Last modified: by admin
