runbooks:coustom_alerts:podcrashloopbackoff
Differences
This shows you the differences between two versions of the page.
| runbooks:coustom_alerts:podcrashloopbackoff [2025/12/13 16:26] – created admin | runbooks:coustom_alerts:podcrashloopbackoff [2025/12/14 06:46] (current) – admin | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| runbooks: | runbooks: | ||
| + | |||
| + | ====== PodCrashLoopBackOff ====== | ||
| + | |||
| + | ===== Meaning ===== | ||
| + | This alert is triggered when a pod container remains in the `CrashLoopBackOff` state for more than 6 hours. | ||
| + | It indicates that the container is repeatedly crashing and Kubernetes is backing off from restarting it. | ||
| + | |||
| + | ===== Impact ===== | ||
| + | This alert represents a **critical application-level failure**. | ||
| + | |||
| + | Possible impacts include: | ||
| + | * Application or service outage | ||
| + | * Repeated pod restarts causing instability | ||
| + | * Increased load on other replicas or services | ||
| + | * Failed background jobs or controllers | ||
| + | |||
| + | If the affected pod is part of a critical service, production impact is likely. | ||
| + | |||
| + | ===== Diagnosis ===== | ||
| + | Check the status of the affected pod: | ||
| + | |||
| + | <code bash> | ||
| + | kubectl get pod {{ $labels.pod }} -n {{ $labels.namespace }} | ||
| + | </ | ||
| + | |||
| + | Describe the pod to view events and failure reasons: | ||
| + | |||
| + | <code bash> | ||
| + | kubectl describe pod {{ $labels.pod }} -n {{ $labels.namespace }} | ||
| + | </ | ||
| + | |||
| + | Check container logs for crash details: | ||
| + | |||
| + | <code bash> | ||
| + | kubectl logs {{ $labels.pod }} -n {{ $labels.namespace }} --previous | ||
| + | </ | ||
| + | |||
| + | If multiple containers exist in the pod: | ||
| + | |||
| + | <code bash> | ||
| + | kubectl logs {{ $labels.pod }} -n {{ $labels.namespace }} -c < | ||
| + | </ | ||
| + | |||
| + | Check recent events in the namespace: | ||
| + | |||
| + | <code bash> | ||
| + | kubectl get events -n {{ $labels.namespace }} --sort-by=.lastTimestamp | ||
| + | </ | ||
| + | |||
| + | ===== Possible Causes ===== | ||
| + | * Application crash due to bug or misconfiguration | ||
| + | * Missing or invalid environment variables | ||
| + | * Dependency services unavailable | ||
| + | * Insufficient memory or CPU causing OOMKills | ||
| + | * Failing liveness or readiness probes | ||
| + | * Incorrect container image or startup command | ||
| + | |||
| + | ===== Mitigation ===== | ||
| + | - Review application logs to identify the crash reason | ||
| + | - Fix configuration issues (env vars, secrets, config maps) | ||
| + | - Increase resource limits if OOMKilled | ||
| + | - Fix failing probes or startup commands | ||
| + | - Redeploy the pod after applying fixes | ||
| + | |||
| + | Restart the pod if safe: | ||
| + | |||
| + | <code bash> | ||
| + | kubectl delete pod {{ $labels.pod }} -n {{ $labels.namespace }} | ||
| + | </ | ||
| + | |||
| + | If the issue persists, scale down the workload temporarily: | ||
| + | |||
| + | <code bash> | ||
| + | kubectl scale deployment < | ||
| + | </ | ||
| + | |||
| + | ===== Escalation ===== | ||
| + | * Immediately notify the application owner | ||
| + | * If production services are impacted, page the on-call engineer | ||
| + | * If unresolved after 30 minutes, escalate to the platform team | ||
| + | |||
| + | ===== Related Alerts ===== | ||
| + | * PodNotReady | ||
| + | * HighMemoryUsage | ||
| + | * NodeDown | ||
| + | |||
| + | ===== Related Dashboards ===== | ||
| + | * Grafana → Kubernetes / Pods | ||
| + | * Grafana → Container Resource Usage | ||
| + | |||
runbooks/coustom_alerts/podcrashloopbackoff.txt · Last modified: by admin
