runbooks:coustom_alerts:PodCrashLoopBackOff
====== PodCrashLoopBackOff ======
===== Meaning =====
This alert is triggered when a pod container remains in the `CrashLoopBackOff` state for more than 6 hours.
It indicates that the container is repeatedly crashing and Kubernetes is backing off from restarting it.
===== Impact =====
This alert represents a **critical application-level failure**.
Possible impacts include:
* Application or service outage
* Repeated pod restarts causing instability
* Increased load on other replicas or services
* Failed background jobs or controllers
If the affected pod is part of a critical service, production impact is likely.
===== Diagnosis =====
Check the status of the affected pod:
kubectl get pod {{ $labels.pod }} -n {{ $labels.namespace }}
Describe the pod to view events and failure reasons:
kubectl describe pod {{ $labels.pod }} -n {{ $labels.namespace }}
Check container logs for crash details:
kubectl logs {{ $labels.pod }} -n {{ $labels.namespace }} --previous
If multiple containers exist in the pod:
kubectl logs {{ $labels.pod }} -n {{ $labels.namespace }} -c --previous
Check recent events in the namespace:
kubectl get events -n {{ $labels.namespace }} --sort-by=.lastTimestamp
===== Possible Causes =====
* Application crash due to bug or misconfiguration
* Missing or invalid environment variables
* Dependency services unavailable
* Insufficient memory or CPU causing OOMKills
* Failing liveness or readiness probes
* Incorrect container image or startup command
===== Mitigation =====
- Review application logs to identify the crash reason
- Fix configuration issues (env vars, secrets, config maps)
- Increase resource limits if OOMKilled
- Fix failing probes or startup commands
- Redeploy the pod after applying fixes
Restart the pod if safe:
kubectl delete pod {{ $labels.pod }} -n {{ $labels.namespace }}
If the issue persists, scale down the workload temporarily:
kubectl scale deployment -n {{ $labels.namespace }} --replicas=0
===== Escalation =====
* Immediately notify the application owner
* If production services are impacted, page the on-call engineer
* If unresolved after 30 minutes, escalate to the platform team
===== Related Alerts =====
* PodNotReady
* HighMemoryUsage
* NodeDown
===== Related Dashboards =====
* Grafana → Kubernetes / Pods
* Grafana → Container Resource Usage