Table of Contents

runbooks:coustom_alerts:PodCrashLoopBackOff

PodCrashLoopBackOff

Meaning

This alert is triggered when a pod container remains in the `CrashLoopBackOff` state for more than 6 hours. It indicates that the container is repeatedly crashing and Kubernetes is backing off from restarting it.

Impact

This alert represents a critical application-level failure.

Possible impacts include:

If the affected pod is part of a critical service, production impact is likely.

Diagnosis

Check the status of the affected pod:

kubectl get pod {{ $labels.pod }} -n {{ $labels.namespace }}

Describe the pod to view events and failure reasons:

kubectl describe pod {{ $labels.pod }} -n {{ $labels.namespace }}

Check container logs for crash details:

kubectl logs {{ $labels.pod }} -n {{ $labels.namespace }} --previous

If multiple containers exist in the pod:

kubectl logs {{ $labels.pod }} -n {{ $labels.namespace }} -c <container_name> --previous

Check recent events in the namespace:

kubectl get events -n {{ $labels.namespace }} --sort-by=.lastTimestamp

Possible Causes

Mitigation

  1. Review application logs to identify the crash reason
  2. Fix configuration issues (env vars, secrets, config maps)
  3. Increase resource limits if OOMKilled
  4. Fix failing probes or startup commands
  5. Redeploy the pod after applying fixes

Restart the pod if safe:

kubectl delete pod {{ $labels.pod }} -n {{ $labels.namespace }}

If the issue persists, scale down the workload temporarily:

kubectl scale deployment <deployment_name> -n {{ $labels.namespace }} --replicas=0

Escalation