runbooks:coustom_alerts:kubernetespodcrashlooping
Differences
This shows you the differences between two versions of the page.
| runbooks:coustom_alerts:kubernetespodcrashlooping [2025/12/13 16:37] – created admin | runbooks:coustom_alerts:kubernetespodcrashlooping [2025/12/14 06:59] (current) – admin | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| runbooks: | runbooks: | ||
| + | |||
| + | ====== KubernetesPodCrashLooping ====== | ||
| + | |||
| + | ===== Meaning ===== | ||
| + | This alert is triggered when a Kubernetes pod has **restarted more than 10 times in the last 6 hours**. | ||
| + | It indicates that the pod is **crash looping** and unable to run stably. | ||
| + | |||
| + | ===== Impact ===== | ||
| + | Crash looping pods can cause: | ||
| + | * Service degradation or unavailability | ||
| + | * Increased load on the node due to repeated restarts | ||
| + | * Potential cascading failures if other pods or services depend on it | ||
| + | * Deployment or update instability | ||
| + | |||
| + | This alert is **warning**, | ||
| + | |||
| + | ===== Diagnosis ===== | ||
| + | Check pod status: | ||
| + | |||
| + | <code bash> | ||
| + | kubectl get pod {{ $labels.pod }} -n {{ $labels.namespace }} | ||
| + | kubectl describe pod {{ $labels.pod }} -n {{ $labels.namespace }} | ||
| + | </ | ||
| + | |||
| + | Check container restart count: | ||
| + | |||
| + | <code bash> | ||
| + | kubectl get pod {{ $labels.pod }} -n {{ $labels.namespace }} -o jsonpath=' | ||
| + | </ | ||
| + | |||
| + | Inspect pod logs to identify the root cause: | ||
| + | |||
| + | <code bash> | ||
| + | kubectl logs {{ $labels.pod }} -n {{ $labels.namespace }} --previous | ||
| + | kubectl logs {{ $labels.pod }} -n {{ $labels.namespace }} --all-containers | ||
| + | </ | ||
| + | |||
| + | Check events for errors: | ||
| + | |||
| + | <code bash> | ||
| + | kubectl get events -n {{ $labels.namespace }} --sort-by=.lastTimestamp | ||
| + | </ | ||
| + | |||
| + | ===== Possible Causes ===== | ||
| + | * Application crashes due to bugs or misconfiguration | ||
| + | * Memory or CPU resource exhaustion (OOMKilled) | ||
| + | * Missing or incompatible dependencies | ||
| + | * Failed readiness/ | ||
| + | * Misconfigured environment variables or secrets | ||
| + | |||
| + | ===== Mitigation ===== | ||
| + | - Investigate logs and fix the root cause | ||
| + | - Adjust pod resource requests and limits | ||
| + | - Verify container image integrity | ||
| + | - Restart the pod after applying fixes: | ||
| + | |||
| + | <code bash> | ||
| + | kubectl delete pod {{ $labels.pod }} -n {{ $labels.namespace }} | ||
| + | </ | ||
| + | |||
| + | - Update deployments or StatefulSets if configuration errors exist | ||
| + | |||
| + | ===== Escalation ===== | ||
| + | * Escalate if crash looping persists after mitigation | ||
| + | * Page on-call engineer if production workloads are impacted | ||
| + | * Monitor other pods in the same namespace for related issues | ||
| + | |||
| + | ===== Related Alerts ===== | ||
| + | * KubernetesPodNotHealthy | ||
| + | * PodOOMKilled | ||
| + | * PodPending | ||
| + | * KubernetesNodeMemoryPressure | ||
| + | |||
| + | ===== Related Dashboards ===== | ||
| + | * Grafana → Kubernetes / Pod Restarts | ||
| + | * Grafana → Namespace Health Overview | ||
| + | |||
runbooks/coustom_alerts/kubernetespodcrashlooping.txt · Last modified: by admin
