User Tools

Site Tools


runbooks:coustom_alerts:podcrashloopbackoff

runbooks:coustom_alerts:PodCrashLoopBackOff

PodCrashLoopBackOff

Meaning

This alert is triggered when a pod container remains in the `CrashLoopBackOff` state for more than 6 hours. It indicates that the container is repeatedly crashing and Kubernetes is backing off from restarting it.

Impact

This alert represents a critical application-level failure.

Possible impacts include:

  • Application or service outage
  • Repeated pod restarts causing instability
  • Increased load on other replicas or services
  • Failed background jobs or controllers

If the affected pod is part of a critical service, production impact is likely.

Diagnosis

Check the status of the affected pod:

kubectl get pod {{ $labels.pod }} -n {{ $labels.namespace }}

Describe the pod to view events and failure reasons:

kubectl describe pod {{ $labels.pod }} -n {{ $labels.namespace }}

Check container logs for crash details:

kubectl logs {{ $labels.pod }} -n {{ $labels.namespace }} --previous

If multiple containers exist in the pod:

kubectl logs {{ $labels.pod }} -n {{ $labels.namespace }} -c <container_name> --previous

Check recent events in the namespace:

kubectl get events -n {{ $labels.namespace }} --sort-by=.lastTimestamp

Possible Causes

  • Application crash due to bug or misconfiguration
  • Missing or invalid environment variables
  • Dependency services unavailable
  • Insufficient memory or CPU causing OOMKills
  • Failing liveness or readiness probes
  • Incorrect container image or startup command

Mitigation

  1. Review application logs to identify the crash reason
  2. Fix configuration issues (env vars, secrets, config maps)
  3. Increase resource limits if OOMKilled
  4. Fix failing probes or startup commands
  5. Redeploy the pod after applying fixes

Restart the pod if safe:

kubectl delete pod {{ $labels.pod }} -n {{ $labels.namespace }}

If the issue persists, scale down the workload temporarily:

kubectl scale deployment <deployment_name> -n {{ $labels.namespace }} --replicas=0

Escalation

  • Immediately notify the application owner
  • If production services are impacted, page the on-call engineer
  • If unresolved after 30 minutes, escalate to the platform team
  • PodNotReady
  • HighMemoryUsage
  • NodeDown
  • Grafana → Kubernetes / Pods
  • Grafana → Container Resource Usage
runbooks/coustom_alerts/podcrashloopbackoff.txt · Last modified: by admin