runbooks:coustom_alerts:podcrashloopbackoff
Table of Contents
runbooks:coustom_alerts:PodCrashLoopBackOff
PodCrashLoopBackOff
Meaning
This alert is triggered when a pod container remains in the `CrashLoopBackOff` state for more than 6 hours. It indicates that the container is repeatedly crashing and Kubernetes is backing off from restarting it.
Impact
This alert represents a critical application-level failure.
Possible impacts include:
- Application or service outage
- Repeated pod restarts causing instability
- Increased load on other replicas or services
- Failed background jobs or controllers
If the affected pod is part of a critical service, production impact is likely.
Diagnosis
Check the status of the affected pod:
kubectl get pod {{ $labels.pod }} -n {{ $labels.namespace }}
Describe the pod to view events and failure reasons:
kubectl describe pod {{ $labels.pod }} -n {{ $labels.namespace }}
Check container logs for crash details:
kubectl logs {{ $labels.pod }} -n {{ $labels.namespace }} --previous
If multiple containers exist in the pod:
kubectl logs {{ $labels.pod }} -n {{ $labels.namespace }} -c <container_name> --previous
Check recent events in the namespace:
kubectl get events -n {{ $labels.namespace }} --sort-by=.lastTimestamp
Possible Causes
- Application crash due to bug or misconfiguration
- Missing or invalid environment variables
- Dependency services unavailable
- Insufficient memory or CPU causing OOMKills
- Failing liveness or readiness probes
- Incorrect container image or startup command
Mitigation
- Review application logs to identify the crash reason
- Fix configuration issues (env vars, secrets, config maps)
- Increase resource limits if OOMKilled
- Fix failing probes or startup commands
- Redeploy the pod after applying fixes
Restart the pod if safe:
kubectl delete pod {{ $labels.pod }} -n {{ $labels.namespace }}
If the issue persists, scale down the workload temporarily:
kubectl scale deployment <deployment_name> -n {{ $labels.namespace }} --replicas=0
Escalation
- Immediately notify the application owner
- If production services are impacted, page the on-call engineer
- If unresolved after 30 minutes, escalate to the platform team
Related Alerts
- PodNotReady
- HighMemoryUsage
- NodeDown
Related Dashboards
- Grafana → Kubernetes / Pods
- Grafana → Container Resource Usage
runbooks/coustom_alerts/podcrashloopbackoff.txt · Last modified: by admin
