runbooks:coustom_alerts:KubernetesPodNotHealthy
====== KubernetesPodNotHealthy ======
===== Meaning =====
This alert is triggered when a Kubernetes pod has been in a **non-running state** (`Pending`, `Unknown`, or `Failed`) for more than 4 hours.
It indicates that the pod is unhealthy and not serving its intended workload.
===== Impact =====
A pod remaining non-running for extended periods can cause:
* Application downtime
* Service degradation or unavailability
* Failed deployments or incomplete updates
* Potential cascading failures if other pods depend on it
This alert is **critical**, as prolonged pod unhealthiness directly affects applications.
===== Diagnosis =====
Check pod status:
kubectl get pod {{ $labels.pod }} -n {{ $labels.namespace }}
kubectl describe pod {{ $labels.pod }} -n {{ $labels.namespace }}
Check events for reasons of failure:
kubectl get events -n {{ $labels.namespace }} --sort-by=.lastTimestamp
Check logs for container errors:
kubectl logs {{ $labels.pod }} -n {{ $labels.namespace }} --all-containers
For multi-container pods, check individual container states:
kubectl get pod {{ $labels.pod }} -n {{ $labels.namespace }} -o json | jq '.status.containerStatuses'
===== Possible Causes =====
* CrashLoopBackOff due to application errors
* ImagePullBackOff or missing images
* Insufficient resources on the node (CPU/memory/disk)
* Pod scheduling failures due to node constraints
* Configuration errors or misconfigured readiness/liveness probes
===== Mitigation =====
- Investigate logs and restart the pod if appropriate:
kubectl delete pod {{ $labels.pod }} -n {{ $labels.namespace }}
- Resolve resource constraints (increase node capacity, adjust limits/requests)
- Fix configuration issues, container image problems, or dependency failures
- Reschedule pods to healthy nodes using taints/tolerations or affinity rules
===== Escalation =====
* Escalate if pod remains unhealthy after mitigation
* Page on-call engineer if production services are impacted
* Monitor related pods or services for cascading failures
===== Related Alerts =====
* PodCrashLoopBackOff
* PodPending
* KubernetesNodeMemoryPressure
* KubernetesNodeDiskPressure
===== Related Dashboards =====
* Grafana → Kubernetes / Pods Overview
* Grafana → Namespace Health