runbooks:coustom_alerts:kubernetespodnothealthy
Table of Contents
runbooks:coustom_alerts:KubernetesPodNotHealthy
KubernetesPodNotHealthy
Meaning
This alert is triggered when a Kubernetes pod has been in a non-running state (`Pending`, `Unknown`, or `Failed`) for more than 4 hours. It indicates that the pod is unhealthy and not serving its intended workload.
Impact
A pod remaining non-running for extended periods can cause:
- Application downtime
- Service degradation or unavailability
- Failed deployments or incomplete updates
- Potential cascading failures if other pods depend on it
This alert is critical, as prolonged pod unhealthiness directly affects applications.
Diagnosis
Check pod status:
kubectl get pod {{ $labels.pod }} -n {{ $labels.namespace }} kubectl describe pod {{ $labels.pod }} -n {{ $labels.namespace }}
Check events for reasons of failure:
kubectl get events -n {{ $labels.namespace }} --sort-by=.lastTimestamp
Check logs for container errors:
kubectl logs {{ $labels.pod }} -n {{ $labels.namespace }} --all-containers
For multi-container pods, check individual container states:
kubectl get pod {{ $labels.pod }} -n {{ $labels.namespace }} -o json | jq '.status.containerStatuses'
Possible Causes
- CrashLoopBackOff due to application errors
- ImagePullBackOff or missing images
- Insufficient resources on the node (CPU/memory/disk)
- Pod scheduling failures due to node constraints
- Configuration errors or misconfigured readiness/liveness probes
Mitigation
- Investigate logs and restart the pod if appropriate:
kubectl delete pod {{ $labels.pod }} -n {{ $labels.namespace }}
- Resolve resource constraints (increase node capacity, adjust limits/requests)
- Fix configuration issues, container image problems, or dependency failures
- Reschedule pods to healthy nodes using taints/tolerations or affinity rules
Escalation
- Escalate if pod remains unhealthy after mitigation
- Page on-call engineer if production services are impacted
- Monitor related pods or services for cascading failures
Related Alerts
- PodCrashLoopBackOff
- PodPending
- KubernetesNodeMemoryPressure
- KubernetesNodeDiskPressure
Related Dashboards
- Grafana → Kubernetes / Pods Overview
- Grafana → Namespace Health
runbooks/coustom_alerts/kubernetespodnothealthy.txt · Last modified: by admin
