runbooks:coustom_alerts:PodCrashLoopBackOff ====== PodCrashLoopBackOff ====== ===== Meaning ===== This alert is triggered when a pod container remains in the `CrashLoopBackOff` state for more than 6 hours. It indicates that the container is repeatedly crashing and Kubernetes is backing off from restarting it. ===== Impact ===== This alert represents a **critical application-level failure**. Possible impacts include: * Application or service outage * Repeated pod restarts causing instability * Increased load on other replicas or services * Failed background jobs or controllers If the affected pod is part of a critical service, production impact is likely. ===== Diagnosis ===== Check the status of the affected pod: kubectl get pod {{ $labels.pod }} -n {{ $labels.namespace }} Describe the pod to view events and failure reasons: kubectl describe pod {{ $labels.pod }} -n {{ $labels.namespace }} Check container logs for crash details: kubectl logs {{ $labels.pod }} -n {{ $labels.namespace }} --previous If multiple containers exist in the pod: kubectl logs {{ $labels.pod }} -n {{ $labels.namespace }} -c --previous Check recent events in the namespace: kubectl get events -n {{ $labels.namespace }} --sort-by=.lastTimestamp ===== Possible Causes ===== * Application crash due to bug or misconfiguration * Missing or invalid environment variables * Dependency services unavailable * Insufficient memory or CPU causing OOMKills * Failing liveness or readiness probes * Incorrect container image or startup command ===== Mitigation ===== - Review application logs to identify the crash reason - Fix configuration issues (env vars, secrets, config maps) - Increase resource limits if OOMKilled - Fix failing probes or startup commands - Redeploy the pod after applying fixes Restart the pod if safe: kubectl delete pod {{ $labels.pod }} -n {{ $labels.namespace }} If the issue persists, scale down the workload temporarily: kubectl scale deployment -n {{ $labels.namespace }} --replicas=0 ===== Escalation ===== * Immediately notify the application owner * If production services are impacted, page the on-call engineer * If unresolved after 30 minutes, escalate to the platform team ===== Related Alerts ===== * PodNotReady * HighMemoryUsage * NodeDown ===== Related Dashboards ===== * Grafana → Kubernetes / Pods * Grafana → Container Resource Usage