User Tools

Site Tools


runbooks:coustom_alerts:podcrashloopbackoff

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

runbooks:coustom_alerts:podcrashloopbackoff [2025/12/13 16:26] – created adminrunbooks:coustom_alerts:podcrashloopbackoff [2025/12/14 06:46] (current) admin
Line 1: Line 1:
 runbooks:coustom_alerts:PodCrashLoopBackOff runbooks:coustom_alerts:PodCrashLoopBackOff
 +
 +====== PodCrashLoopBackOff ======
 +
 +===== Meaning =====
 +This alert is triggered when a pod container remains in the `CrashLoopBackOff` state for more than 6 hours.
 +It indicates that the container is repeatedly crashing and Kubernetes is backing off from restarting it.
 +
 +===== Impact =====
 +This alert represents a **critical application-level failure**.
 +
 +Possible impacts include:
 +  * Application or service outage
 +  * Repeated pod restarts causing instability
 +  * Increased load on other replicas or services
 +  * Failed background jobs or controllers
 +
 +If the affected pod is part of a critical service, production impact is likely.
 +
 +===== Diagnosis =====
 +Check the status of the affected pod:
 +
 +<code bash>
 +kubectl get pod {{ $labels.pod }} -n {{ $labels.namespace }}
 +</code>
 +
 +Describe the pod to view events and failure reasons:
 +
 +<code bash>
 +kubectl describe pod {{ $labels.pod }} -n {{ $labels.namespace }}
 +</code>
 +
 +Check container logs for crash details:
 +
 +<code bash>
 +kubectl logs {{ $labels.pod }} -n {{ $labels.namespace }} --previous
 +</code>
 +
 +If multiple containers exist in the pod:
 +
 +<code bash>
 +kubectl logs {{ $labels.pod }} -n {{ $labels.namespace }} -c <container_name> --previous
 +</code>
 +
 +Check recent events in the namespace:
 +
 +<code bash>
 +kubectl get events -n {{ $labels.namespace }} --sort-by=.lastTimestamp
 +</code>
 +
 +===== Possible Causes =====
 +  * Application crash due to bug or misconfiguration
 +  * Missing or invalid environment variables
 +  * Dependency services unavailable
 +  * Insufficient memory or CPU causing OOMKills
 +  * Failing liveness or readiness probes
 +  * Incorrect container image or startup command
 +
 +===== Mitigation =====
 +  - Review application logs to identify the crash reason
 +  - Fix configuration issues (env vars, secrets, config maps)
 +  - Increase resource limits if OOMKilled
 +  - Fix failing probes or startup commands
 +  - Redeploy the pod after applying fixes
 +
 +Restart the pod if safe:
 +
 +<code bash>
 +kubectl delete pod {{ $labels.pod }} -n {{ $labels.namespace }}
 +</code>
 +
 +If the issue persists, scale down the workload temporarily:
 +
 +<code bash>
 +kubectl scale deployment <deployment_name> -n {{ $labels.namespace }} --replicas=0
 +</code>
 +
 +===== Escalation =====
 +  * Immediately notify the application owner
 +  * If production services are impacted, page the on-call engineer
 +  * If unresolved after 30 minutes, escalate to the platform team
 +
 +===== Related Alerts =====
 +  * PodNotReady
 +  * HighMemoryUsage
 +  * NodeDown
 +
 +===== Related Dashboards =====
 +  * Grafana → Kubernetes / Pods
 +  * Grafana → Container Resource Usage
 +
runbooks/coustom_alerts/podcrashloopbackoff.txt · Last modified: by admin