runbooks:coustom_alerts:PodCrashLoopBackOff
This alert is triggered when a pod container remains in the `CrashLoopBackOff` state for more than 6 hours. It indicates that the container is repeatedly crashing and Kubernetes is backing off from restarting it.
This alert represents a critical application-level failure.
Possible impacts include:
If the affected pod is part of a critical service, production impact is likely.
Check the status of the affected pod:
kubectl get pod {{ $labels.pod }} -n {{ $labels.namespace }}
Describe the pod to view events and failure reasons:
kubectl describe pod {{ $labels.pod }} -n {{ $labels.namespace }}
Check container logs for crash details:
kubectl logs {{ $labels.pod }} -n {{ $labels.namespace }} --previous
If multiple containers exist in the pod:
kubectl logs {{ $labels.pod }} -n {{ $labels.namespace }} -c <container_name> --previous
Check recent events in the namespace:
kubectl get events -n {{ $labels.namespace }} --sort-by=.lastTimestamp
Restart the pod if safe:
kubectl delete pod {{ $labels.pod }} -n {{ $labels.namespace }}
If the issue persists, scale down the workload temporarily:
kubectl scale deployment <deployment_name> -n {{ $labels.namespace }} --replicas=0