User Tools

Site Tools


runbooks:coustom_alerts:kubernetespodnothealthy

runbooks:coustom_alerts:KubernetesPodNotHealthy

KubernetesPodNotHealthy

Meaning

This alert is triggered when a Kubernetes pod has been in a non-running state (`Pending`, `Unknown`, or `Failed`) for more than 4 hours. It indicates that the pod is unhealthy and not serving its intended workload.

Impact

A pod remaining non-running for extended periods can cause:

  • Application downtime
  • Service degradation or unavailability
  • Failed deployments or incomplete updates
  • Potential cascading failures if other pods depend on it

This alert is critical, as prolonged pod unhealthiness directly affects applications.

Diagnosis

Check pod status:

kubectl get pod {{ $labels.pod }} -n {{ $labels.namespace }}
kubectl describe pod {{ $labels.pod }} -n {{ $labels.namespace }}

Check events for reasons of failure:

kubectl get events -n {{ $labels.namespace }} --sort-by=.lastTimestamp

Check logs for container errors:

kubectl logs {{ $labels.pod }} -n {{ $labels.namespace }} --all-containers

For multi-container pods, check individual container states:

kubectl get pod {{ $labels.pod }} -n {{ $labels.namespace }} -o json | jq '.status.containerStatuses'

Possible Causes

  • CrashLoopBackOff due to application errors
  • ImagePullBackOff or missing images
  • Insufficient resources on the node (CPU/memory/disk)
  • Pod scheduling failures due to node constraints
  • Configuration errors or misconfigured readiness/liveness probes

Mitigation

  1. Investigate logs and restart the pod if appropriate:
kubectl delete pod {{ $labels.pod }} -n {{ $labels.namespace }}
  1. Resolve resource constraints (increase node capacity, adjust limits/requests)
  2. Fix configuration issues, container image problems, or dependency failures
  3. Reschedule pods to healthy nodes using taints/tolerations or affinity rules

Escalation

  • Escalate if pod remains unhealthy after mitigation
  • Page on-call engineer if production services are impacted
  • Monitor related pods or services for cascading failures
  • PodCrashLoopBackOff
  • PodPending
  • KubernetesNodeMemoryPressure
  • KubernetesNodeDiskPressure
  • Grafana → Kubernetes / Pods Overview
  • Grafana → Namespace Health
runbooks/coustom_alerts/kubernetespodnothealthy.txt · Last modified: by admin