runbooks:coustom_alerts:KubernetesPodNotHealthy

====== KubernetesPodNotHealthy ======

===== Meaning =====
This alert is triggered when a Kubernetes pod has been in a **non-running state** (`Pending`, `Unknown`, or `Failed`) for more than 4 hours.
It indicates that the pod is unhealthy and not serving its intended workload.

===== Impact =====
A pod remaining non-running for extended periods can cause:
  * Application downtime
  * Service degradation or unavailability
  * Failed deployments or incomplete updates
  * Potential cascading failures if other pods depend on it

This alert is **critical**, as prolonged pod unhealthiness directly affects applications.

===== Diagnosis =====
Check pod status:

<code bash>
kubectl get pod {{ $labels.pod }} -n {{ $labels.namespace }}
kubectl describe pod {{ $labels.pod }} -n {{ $labels.namespace }}
</code>

Check events for reasons of failure:

<code bash>
kubectl get events -n {{ $labels.namespace }} --sort-by=.lastTimestamp
</code>

Check logs for container errors:

<code bash>
kubectl logs {{ $labels.pod }} -n {{ $labels.namespace }} --all-containers
</code>

For multi-container pods, check individual container states:

<code bash>
kubectl get pod {{ $labels.pod }} -n {{ $labels.namespace }} -o json | jq '.status.containerStatuses'
</code>

===== Possible Causes =====
  * CrashLoopBackOff due to application errors
  * ImagePullBackOff or missing images
  * Insufficient resources on the node (CPU/memory/disk)
  * Pod scheduling failures due to node constraints
  * Configuration errors or misconfigured readiness/liveness probes

===== Mitigation =====
  - Investigate logs and restart the pod if appropriate:

<code bash>
kubectl delete pod {{ $labels.pod }} -n {{ $labels.namespace }}
</code>

  - Resolve resource constraints (increase node capacity, adjust limits/requests)
  - Fix configuration issues, container image problems, or dependency failures
  - Reschedule pods to healthy nodes using taints/tolerations or affinity rules

===== Escalation =====
  * Escalate if pod remains unhealthy after mitigation
  * Page on-call engineer if production services are impacted
  * Monitor related pods or services for cascading failures

===== Related Alerts =====
  * PodCrashLoopBackOff
  * PodPending
  * KubernetesNodeMemoryPressure
  * KubernetesNodeDiskPressure

===== Related Dashboards =====
  * Grafana → Kubernetes / Pods Overview
  * Grafana → Namespace Health