runbooks:coustom_alerts:KubernetesPodCrashLooping
====== KubernetesPodCrashLooping ======
===== Meaning =====
This alert is triggered when a Kubernetes pod has **restarted more than 10 times in the last 6 hours**.
It indicates that the pod is **crash looping** and unable to run stably.
===== Impact =====
Crash looping pods can cause:
* Service degradation or unavailability
* Increased load on the node due to repeated restarts
* Potential cascading failures if other pods or services depend on it
* Deployment or update instability
This alert is **warning**, but can become critical if it affects production workloads or multiple pods.
===== Diagnosis =====
Check pod status:
kubectl get pod {{ $labels.pod }} -n {{ $labels.namespace }}
kubectl describe pod {{ $labels.pod }} -n {{ $labels.namespace }}
Check container restart count:
kubectl get pod {{ $labels.pod }} -n {{ $labels.namespace }} -o jsonpath='{.status.containerStatuses[*].restartCount}'
Inspect pod logs to identify the root cause:
kubectl logs {{ $labels.pod }} -n {{ $labels.namespace }} --previous
kubectl logs {{ $labels.pod }} -n {{ $labels.namespace }} --all-containers
Check events for errors:
kubectl get events -n {{ $labels.namespace }} --sort-by=.lastTimestamp
===== Possible Causes =====
* Application crashes due to bugs or misconfiguration
* Memory or CPU resource exhaustion (OOMKilled)
* Missing or incompatible dependencies
* Failed readiness/liveness probes causing restarts
* Misconfigured environment variables or secrets
===== Mitigation =====
- Investigate logs and fix the root cause
- Adjust pod resource requests and limits
- Verify container image integrity
- Restart the pod after applying fixes:
kubectl delete pod {{ $labels.pod }} -n {{ $labels.namespace }}
- Update deployments or StatefulSets if configuration errors exist
===== Escalation =====
* Escalate if crash looping persists after mitigation
* Page on-call engineer if production workloads are impacted
* Monitor other pods in the same namespace for related issues
===== Related Alerts =====
* KubernetesPodNotHealthy
* PodOOMKilled
* PodPending
* KubernetesNodeMemoryPressure
===== Related Dashboards =====
* Grafana → Kubernetes / Pod Restarts
* Grafana → Namespace Health Overview