User Tools

Site Tools


runbooks:coustom_alerts:kubernetespodcrashlooping

runbooks:coustom_alerts:KubernetesPodCrashLooping

KubernetesPodCrashLooping

Meaning

This alert is triggered when a Kubernetes pod has restarted more than 10 times in the last 6 hours. It indicates that the pod is crash looping and unable to run stably.

Impact

Crash looping pods can cause:

  • Service degradation or unavailability
  • Increased load on the node due to repeated restarts
  • Potential cascading failures if other pods or services depend on it
  • Deployment or update instability

This alert is warning, but can become critical if it affects production workloads or multiple pods.

Diagnosis

Check pod status:

kubectl get pod {{ $labels.pod }} -n {{ $labels.namespace }}
kubectl describe pod {{ $labels.pod }} -n {{ $labels.namespace }}

Check container restart count:

kubectl get pod {{ $labels.pod }} -n {{ $labels.namespace }} -o jsonpath='{.status.containerStatuses[*].restartCount}'

Inspect pod logs to identify the root cause:

kubectl logs {{ $labels.pod }} -n {{ $labels.namespace }} --previous
kubectl logs {{ $labels.pod }} -n {{ $labels.namespace }} --all-containers

Check events for errors:

kubectl get events -n {{ $labels.namespace }} --sort-by=.lastTimestamp

Possible Causes

  • Application crashes due to bugs or misconfiguration
  • Memory or CPU resource exhaustion (OOMKilled)
  • Missing or incompatible dependencies
  • Failed readiness/liveness probes causing restarts
  • Misconfigured environment variables or secrets

Mitigation

  1. Investigate logs and fix the root cause
  2. Adjust pod resource requests and limits
  3. Verify container image integrity
  4. Restart the pod after applying fixes:
kubectl delete pod {{ $labels.pod }} -n {{ $labels.namespace }}
  1. Update deployments or StatefulSets if configuration errors exist

Escalation

  • Escalate if crash looping persists after mitigation
  • Page on-call engineer if production workloads are impacted
  • Monitor other pods in the same namespace for related issues
  • KubernetesPodNotHealthy
  • PodOOMKilled
  • PodPending
  • KubernetesNodeMemoryPressure
  • Grafana → Kubernetes / Pod Restarts
  • Grafana → Namespace Health Overview
runbooks/coustom_alerts/kubernetespodcrashlooping.txt · Last modified: by admin