runbooks:coustom_alerts:KubernetesPodCrashLooping

====== KubernetesPodCrashLooping ======

===== Meaning =====
This alert is triggered when a Kubernetes pod has **restarted more than 10 times in the last 6 hours**.
It indicates that the pod is **crash looping** and unable to run stably.

===== Impact =====
Crash looping pods can cause:
  * Service degradation or unavailability
  * Increased load on the node due to repeated restarts
  * Potential cascading failures if other pods or services depend on it
  * Deployment or update instability

This alert is **warning**, but can become critical if it affects production workloads or multiple pods.

===== Diagnosis =====
Check pod status:

<code bash>
kubectl get pod {{ $labels.pod }} -n {{ $labels.namespace }}
kubectl describe pod {{ $labels.pod }} -n {{ $labels.namespace }}
</code>

Check container restart count:

<code bash>
kubectl get pod {{ $labels.pod }} -n {{ $labels.namespace }} -o jsonpath='{.status.containerStatuses[*].restartCount}'
</code>

Inspect pod logs to identify the root cause:

<code bash>
kubectl logs {{ $labels.pod }} -n {{ $labels.namespace }} --previous
kubectl logs {{ $labels.pod }} -n {{ $labels.namespace }} --all-containers
</code>

Check events for errors:

<code bash>
kubectl get events -n {{ $labels.namespace }} --sort-by=.lastTimestamp
</code>

===== Possible Causes =====
  * Application crashes due to bugs or misconfiguration
  * Memory or CPU resource exhaustion (OOMKilled)
  * Missing or incompatible dependencies
  * Failed readiness/liveness probes causing restarts
  * Misconfigured environment variables or secrets

===== Mitigation =====
  - Investigate logs and fix the root cause
  - Adjust pod resource requests and limits
  - Verify container image integrity
  - Restart the pod after applying fixes:

<code bash>
kubectl delete pod {{ $labels.pod }} -n {{ $labels.namespace }}
</code>

  - Update deployments or StatefulSets if configuration errors exist

===== Escalation =====
  * Escalate if crash looping persists after mitigation
  * Page on-call engineer if production workloads are impacted
  * Monitor other pods in the same namespace for related issues

===== Related Alerts =====
  * KubernetesPodNotHealthy
  * PodOOMKilled
  * PodPending
  * KubernetesNodeMemoryPressure

===== Related Dashboards =====
  * Grafana → Kubernetes / Pod Restarts
  * Grafana → Namespace Health Overview