runbooks:coustom_alerts:kubernetespodnothealthy
Differences
This shows you the differences between two versions of the page.
| runbooks:coustom_alerts:kubernetespodnothealthy [2025/12/13 16:37] – created admin | runbooks:coustom_alerts:kubernetespodnothealthy [2025/12/14 06:58] (current) – admin | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| runbooks: | runbooks: | ||
| + | |||
| + | ====== KubernetesPodNotHealthy ====== | ||
| + | |||
| + | ===== Meaning ===== | ||
| + | This alert is triggered when a Kubernetes pod has been in a **non-running state** (`Pending`, `Unknown`, or `Failed`) for more than 4 hours. | ||
| + | It indicates that the pod is unhealthy and not serving its intended workload. | ||
| + | |||
| + | ===== Impact ===== | ||
| + | A pod remaining non-running for extended periods can cause: | ||
| + | * Application downtime | ||
| + | * Service degradation or unavailability | ||
| + | * Failed deployments or incomplete updates | ||
| + | * Potential cascading failures if other pods depend on it | ||
| + | |||
| + | This alert is **critical**, | ||
| + | |||
| + | ===== Diagnosis ===== | ||
| + | Check pod status: | ||
| + | |||
| + | <code bash> | ||
| + | kubectl get pod {{ $labels.pod }} -n {{ $labels.namespace }} | ||
| + | kubectl describe pod {{ $labels.pod }} -n {{ $labels.namespace }} | ||
| + | </ | ||
| + | |||
| + | Check events for reasons of failure: | ||
| + | |||
| + | <code bash> | ||
| + | kubectl get events -n {{ $labels.namespace }} --sort-by=.lastTimestamp | ||
| + | </ | ||
| + | |||
| + | Check logs for container errors: | ||
| + | |||
| + | <code bash> | ||
| + | kubectl logs {{ $labels.pod }} -n {{ $labels.namespace }} --all-containers | ||
| + | </ | ||
| + | |||
| + | For multi-container pods, check individual container states: | ||
| + | |||
| + | <code bash> | ||
| + | kubectl get pod {{ $labels.pod }} -n {{ $labels.namespace }} -o json | jq ' | ||
| + | </ | ||
| + | |||
| + | ===== Possible Causes ===== | ||
| + | * CrashLoopBackOff due to application errors | ||
| + | * ImagePullBackOff or missing images | ||
| + | * Insufficient resources on the node (CPU/ | ||
| + | * Pod scheduling failures due to node constraints | ||
| + | * Configuration errors or misconfigured readiness/ | ||
| + | |||
| + | ===== Mitigation ===== | ||
| + | - Investigate logs and restart the pod if appropriate: | ||
| + | |||
| + | <code bash> | ||
| + | kubectl delete pod {{ $labels.pod }} -n {{ $labels.namespace }} | ||
| + | </ | ||
| + | |||
| + | - Resolve resource constraints (increase node capacity, adjust limits/ | ||
| + | - Fix configuration issues, container image problems, or dependency failures | ||
| + | - Reschedule pods to healthy nodes using taints/ | ||
| + | |||
| + | ===== Escalation ===== | ||
| + | * Escalate if pod remains unhealthy after mitigation | ||
| + | * Page on-call engineer if production services are impacted | ||
| + | * Monitor related pods or services for cascading failures | ||
| + | |||
| + | ===== Related Alerts ===== | ||
| + | * PodCrashLoopBackOff | ||
| + | * PodPending | ||
| + | * KubernetesNodeMemoryPressure | ||
| + | * KubernetesNodeDiskPressure | ||
| + | |||
| + | ===== Related Dashboards ===== | ||
| + | * Grafana → Kubernetes / Pods Overview | ||
| + | * Grafana → Namespace Health | ||
| + | |||
runbooks/coustom_alerts/kubernetespodnothealthy.txt · Last modified: by admin
