Differences

This shows you the differences between two versions of the page.

--- runbooks:coustom_alerts:kubernetespodcrashlooping [2025/12/13 16:37] – created admin
+++ runbooks:coustom_alerts:kubernetespodcrashlooping [2025/12/14 06:59] (current) – admin
@@ Line 1: / Line 1: @@
 runbooks:coustom_alerts:KubernetesPodCrashLooping
+====== KubernetesPodCrashLooping ======
+===== Meaning =====
+This alert is triggered when a Kubernetes pod has **restarted more than 10 times in the last 6 hours**.
+It indicates that the pod is **crash looping** and unable to run stably.
+===== Impact =====
+Crash looping pods can cause:
+  * Service degradation or unavailability
+  * Increased load on the node due to repeated restarts
+  * Potential cascading failures if other pods or services depend on it
+  * Deployment or update instability
+This alert is **warning**, but can become critical if it affects production workloads or multiple pods.
+===== Diagnosis =====
+Check pod status:
+<code bash>
+kubectl get pod {{ $labels.pod }} -n {{ $labels.namespace }}
+kubectl describe pod {{ $labels.pod }} -n {{ $labels.namespace }}
+</code>
+Check container restart count:
+<code bash>
+kubectl get pod {{ $labels.pod }} -n {{ $labels.namespace }} -o jsonpath='{.status.containerStatuses[*].restartCount}'
+</code>
+Inspect pod logs to identify the root cause:
+<code bash>
+kubectl logs {{ $labels.pod }} -n {{ $labels.namespace }} --previous
+kubectl logs {{ $labels.pod }} -n {{ $labels.namespace }} --all-containers
+</code>
+Check events for errors:
+<code bash>
+kubectl get events -n {{ $labels.namespace }} --sort-by=.lastTimestamp
+</code>
+===== Possible Causes =====
+  * Application crashes due to bugs or misconfiguration
+  * Memory or CPU resource exhaustion (OOMKilled)
+  * Missing or incompatible dependencies
+  * Failed readiness/liveness probes causing restarts
+  * Misconfigured environment variables or secrets
+===== Mitigation =====
+  - Investigate logs and fix the root cause
+  - Adjust pod resource requests and limits
+  - Verify container image integrity
+  - Restart the pod after applying fixes:
+<code bash>
+kubectl delete pod {{ $labels.pod }} -n {{ $labels.namespace }}
+</code>
+  - Update deployments or StatefulSets if configuration errors exist
+===== Escalation =====
+  * Escalate if crash looping persists after mitigation
+  * Page on-call engineer if production workloads are impacted
+  * Monitor other pods in the same namespace for related issues
+===== Related Alerts =====
+  * KubernetesPodNotHealthy
+  * PodOOMKilled
+  * PodPending
+  * KubernetesNodeMemoryPressure
+===== Related Dashboards =====
+  * Grafana → Kubernetes / Pod Restarts
+  * Grafana → Namespace Health Overview