runbooks:coustom_alerts:KubernetesPodNotHealthy ====== KubernetesPodNotHealthy ====== ===== Meaning ===== This alert is triggered when a Kubernetes pod has been in a **non-running state** (`Pending`, `Unknown`, or `Failed`) for more than 4 hours. It indicates that the pod is unhealthy and not serving its intended workload. ===== Impact ===== A pod remaining non-running for extended periods can cause: * Application downtime * Service degradation or unavailability * Failed deployments or incomplete updates * Potential cascading failures if other pods depend on it This alert is **critical**, as prolonged pod unhealthiness directly affects applications. ===== Diagnosis ===== Check pod status: kubectl get pod {{ $labels.pod }} -n {{ $labels.namespace }} kubectl describe pod {{ $labels.pod }} -n {{ $labels.namespace }} Check events for reasons of failure: kubectl get events -n {{ $labels.namespace }} --sort-by=.lastTimestamp Check logs for container errors: kubectl logs {{ $labels.pod }} -n {{ $labels.namespace }} --all-containers For multi-container pods, check individual container states: kubectl get pod {{ $labels.pod }} -n {{ $labels.namespace }} -o json | jq '.status.containerStatuses' ===== Possible Causes ===== * CrashLoopBackOff due to application errors * ImagePullBackOff or missing images * Insufficient resources on the node (CPU/memory/disk) * Pod scheduling failures due to node constraints * Configuration errors or misconfigured readiness/liveness probes ===== Mitigation ===== - Investigate logs and restart the pod if appropriate: kubectl delete pod {{ $labels.pod }} -n {{ $labels.namespace }} - Resolve resource constraints (increase node capacity, adjust limits/requests) - Fix configuration issues, container image problems, or dependency failures - Reschedule pods to healthy nodes using taints/tolerations or affinity rules ===== Escalation ===== * Escalate if pod remains unhealthy after mitigation * Page on-call engineer if production services are impacted * Monitor related pods or services for cascading failures ===== Related Alerts ===== * PodCrashLoopBackOff * PodPending * KubernetesNodeMemoryPressure * KubernetesNodeDiskPressure ===== Related Dashboards ===== * Grafana → Kubernetes / Pods Overview * Grafana → Namespace Health