User Tools

Site Tools


runbooks:coustom_alerts:kubernetespodcrashlooping

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

runbooks:coustom_alerts:kubernetespodcrashlooping [2025/12/13 16:37] – created adminrunbooks:coustom_alerts:kubernetespodcrashlooping [2025/12/14 06:59] (current) admin
Line 1: Line 1:
 runbooks:coustom_alerts:KubernetesPodCrashLooping runbooks:coustom_alerts:KubernetesPodCrashLooping
 +
 +====== KubernetesPodCrashLooping ======
 +
 +===== Meaning =====
 +This alert is triggered when a Kubernetes pod has **restarted more than 10 times in the last 6 hours**.
 +It indicates that the pod is **crash looping** and unable to run stably.
 +
 +===== Impact =====
 +Crash looping pods can cause:
 +  * Service degradation or unavailability
 +  * Increased load on the node due to repeated restarts
 +  * Potential cascading failures if other pods or services depend on it
 +  * Deployment or update instability
 +
 +This alert is **warning**, but can become critical if it affects production workloads or multiple pods.
 +
 +===== Diagnosis =====
 +Check pod status:
 +
 +<code bash>
 +kubectl get pod {{ $labels.pod }} -n {{ $labels.namespace }}
 +kubectl describe pod {{ $labels.pod }} -n {{ $labels.namespace }}
 +</code>
 +
 +Check container restart count:
 +
 +<code bash>
 +kubectl get pod {{ $labels.pod }} -n {{ $labels.namespace }} -o jsonpath='{.status.containerStatuses[*].restartCount}'
 +</code>
 +
 +Inspect pod logs to identify the root cause:
 +
 +<code bash>
 +kubectl logs {{ $labels.pod }} -n {{ $labels.namespace }} --previous
 +kubectl logs {{ $labels.pod }} -n {{ $labels.namespace }} --all-containers
 +</code>
 +
 +Check events for errors:
 +
 +<code bash>
 +kubectl get events -n {{ $labels.namespace }} --sort-by=.lastTimestamp
 +</code>
 +
 +===== Possible Causes =====
 +  * Application crashes due to bugs or misconfiguration
 +  * Memory or CPU resource exhaustion (OOMKilled)
 +  * Missing or incompatible dependencies
 +  * Failed readiness/liveness probes causing restarts
 +  * Misconfigured environment variables or secrets
 +
 +===== Mitigation =====
 +  - Investigate logs and fix the root cause
 +  - Adjust pod resource requests and limits
 +  - Verify container image integrity
 +  - Restart the pod after applying fixes:
 +
 +<code bash>
 +kubectl delete pod {{ $labels.pod }} -n {{ $labels.namespace }}
 +</code>
 +
 +  - Update deployments or StatefulSets if configuration errors exist
 +
 +===== Escalation =====
 +  * Escalate if crash looping persists after mitigation
 +  * Page on-call engineer if production workloads are impacted
 +  * Monitor other pods in the same namespace for related issues
 +
 +===== Related Alerts =====
 +  * KubernetesPodNotHealthy
 +  * PodOOMKilled
 +  * PodPending
 +  * KubernetesNodeMemoryPressure
 +
 +===== Related Dashboards =====
 +  * Grafana → Kubernetes / Pod Restarts
 +  * Grafana → Namespace Health Overview
 +
runbooks/coustom_alerts/kubernetespodcrashlooping.txt · Last modified: by admin