runbooks:coustom_alerts:KubernetesPodCrashLooping ====== KubernetesPodCrashLooping ====== ===== Meaning ===== This alert is triggered when a Kubernetes pod has **restarted more than 10 times in the last 6 hours**. It indicates that the pod is **crash looping** and unable to run stably. ===== Impact ===== Crash looping pods can cause: * Service degradation or unavailability * Increased load on the node due to repeated restarts * Potential cascading failures if other pods or services depend on it * Deployment or update instability This alert is **warning**, but can become critical if it affects production workloads or multiple pods. ===== Diagnosis ===== Check pod status: kubectl get pod {{ $labels.pod }} -n {{ $labels.namespace }} kubectl describe pod {{ $labels.pod }} -n {{ $labels.namespace }} Check container restart count: kubectl get pod {{ $labels.pod }} -n {{ $labels.namespace }} -o jsonpath='{.status.containerStatuses[*].restartCount}' Inspect pod logs to identify the root cause: kubectl logs {{ $labels.pod }} -n {{ $labels.namespace }} --previous kubectl logs {{ $labels.pod }} -n {{ $labels.namespace }} --all-containers Check events for errors: kubectl get events -n {{ $labels.namespace }} --sort-by=.lastTimestamp ===== Possible Causes ===== * Application crashes due to bugs or misconfiguration * Memory or CPU resource exhaustion (OOMKilled) * Missing or incompatible dependencies * Failed readiness/liveness probes causing restarts * Misconfigured environment variables or secrets ===== Mitigation ===== - Investigate logs and fix the root cause - Adjust pod resource requests and limits - Verify container image integrity - Restart the pod after applying fixes: kubectl delete pod {{ $labels.pod }} -n {{ $labels.namespace }} - Update deployments or StatefulSets if configuration errors exist ===== Escalation ===== * Escalate if crash looping persists after mitigation * Page on-call engineer if production workloads are impacted * Monitor other pods in the same namespace for related issues ===== Related Alerts ===== * KubernetesPodNotHealthy * PodOOMKilled * PodPending * KubernetesNodeMemoryPressure ===== Related Dashboards ===== * Grafana → Kubernetes / Pod Restarts * Grafana → Namespace Health Overview