Differences

This shows you the differences between two versions of the page.

--- runbooks:coustom_alerts:podcrashloopbackoff [2025/12/13 16:26] – created admin
+++ runbooks:coustom_alerts:podcrashloopbackoff [2025/12/14 06:46] (current) – admin
@@ Line 1: / Line 1: @@
 runbooks:coustom_alerts:PodCrashLoopBackOff
+====== PodCrashLoopBackOff ======
+===== Meaning =====
+This alert is triggered when a pod container remains in the `CrashLoopBackOff` state for more than 6 hours.
+It indicates that the container is repeatedly crashing and Kubernetes is backing off from restarting it.
+===== Impact =====
+This alert represents a **critical application-level failure**.
+Possible impacts include:
+  * Application or service outage
+  * Repeated pod restarts causing instability
+  * Increased load on other replicas or services
+  * Failed background jobs or controllers
+If the affected pod is part of a critical service, production impact is likely.
+===== Diagnosis =====
+Check the status of the affected pod:
+<code bash>
+kubectl get pod {{ $labels.pod }} -n {{ $labels.namespace }}
+</code>
+Describe the pod to view events and failure reasons:
+<code bash>
+kubectl describe pod {{ $labels.pod }} -n {{ $labels.namespace }}
+</code>
+Check container logs for crash details:
+<code bash>
+kubectl logs {{ $labels.pod }} -n {{ $labels.namespace }} --previous
+</code>
+If multiple containers exist in the pod:
+<code bash>
+kubectl logs {{ $labels.pod }} -n {{ $labels.namespace }} -c <container_name> --previous
+</code>
+Check recent events in the namespace:
+<code bash>
+kubectl get events -n {{ $labels.namespace }} --sort-by=.lastTimestamp
+</code>
+===== Possible Causes =====
+  * Application crash due to bug or misconfiguration
+  * Missing or invalid environment variables
+  * Dependency services unavailable
+  * Insufficient memory or CPU causing OOMKills
+  * Failing liveness or readiness probes
+  * Incorrect container image or startup command
+===== Mitigation =====
+  - Review application logs to identify the crash reason
+  - Fix configuration issues (env vars, secrets, config maps)
+  - Increase resource limits if OOMKilled
+  - Fix failing probes or startup commands
+  - Redeploy the pod after applying fixes
+Restart the pod if safe:
+<code bash>
+kubectl delete pod {{ $labels.pod }} -n {{ $labels.namespace }}
+</code>
+If the issue persists, scale down the workload temporarily:
+<code bash>
+kubectl scale deployment <deployment_name> -n {{ $labels.namespace }} --replicas=0
+</code>
+===== Escalation =====
+  * Immediately notify the application owner
+  * If production services are impacted, page the on-call engineer
+  * If unresolved after 30 minutes, escalate to the platform team
+===== Related Alerts =====
+  * PodNotReady
+  * HighMemoryUsage
+  * NodeDown
+===== Related Dashboards =====
+  * Grafana → Kubernetes / Pods
+  * Grafana → Container Resource Usage