runbooks:coustom_alerts:PodCrashLoopBackOff

====== PodCrashLoopBackOff ======

===== Meaning =====
This alert is triggered when a pod container remains in the `CrashLoopBackOff` state for more than 6 hours.
It indicates that the container is repeatedly crashing and Kubernetes is backing off from restarting it.

===== Impact =====
This alert represents a **critical application-level failure**.

Possible impacts include:
  * Application or service outage
  * Repeated pod restarts causing instability
  * Increased load on other replicas or services
  * Failed background jobs or controllers

If the affected pod is part of a critical service, production impact is likely.

===== Diagnosis =====
Check the status of the affected pod:

<code bash>
kubectl get pod {{ $labels.pod }} -n {{ $labels.namespace }}
</code>

Describe the pod to view events and failure reasons:

<code bash>
kubectl describe pod {{ $labels.pod }} -n {{ $labels.namespace }}
</code>

Check container logs for crash details:

<code bash>
kubectl logs {{ $labels.pod }} -n {{ $labels.namespace }} --previous
</code>

If multiple containers exist in the pod:

<code bash>
kubectl logs {{ $labels.pod }} -n {{ $labels.namespace }} -c <container_name> --previous
</code>

Check recent events in the namespace:

<code bash>
kubectl get events -n {{ $labels.namespace }} --sort-by=.lastTimestamp
</code>

===== Possible Causes =====
  * Application crash due to bug or misconfiguration
  * Missing or invalid environment variables
  * Dependency services unavailable
  * Insufficient memory or CPU causing OOMKills
  * Failing liveness or readiness probes
  * Incorrect container image or startup command

===== Mitigation =====
  - Review application logs to identify the crash reason
  - Fix configuration issues (env vars, secrets, config maps)
  - Increase resource limits if OOMKilled
  - Fix failing probes or startup commands
  - Redeploy the pod after applying fixes

Restart the pod if safe:

<code bash>
kubectl delete pod {{ $labels.pod }} -n {{ $labels.namespace }}
</code>

If the issue persists, scale down the workload temporarily:

<code bash>
kubectl scale deployment <deployment_name> -n {{ $labels.namespace }} --replicas=0
</code>

===== Escalation =====
  * Immediately notify the application owner
  * If production services are impacted, page the on-call engineer
  * If unresolved after 30 minutes, escalate to the platform team

===== Related Alerts =====
  * PodNotReady
  * HighMemoryUsage
  * NodeDown

===== Related Dashboards =====
  * Grafana → Kubernetes / Pods
  * Grafana → Container Resource Usage