Table of Contents

runbooks:coustom_alerts:KubernetesPodCrashLooping

KubernetesPodCrashLooping

Meaning

This alert is triggered when a Kubernetes pod has restarted more than 10 times in the last 6 hours. It indicates that the pod is crash looping and unable to run stably.

Impact

Crash looping pods can cause:

This alert is warning, but can become critical if it affects production workloads or multiple pods.

Diagnosis

Check pod status:

kubectl get pod {{ $labels.pod }} -n {{ $labels.namespace }}
kubectl describe pod {{ $labels.pod }} -n {{ $labels.namespace }}

Check container restart count:

kubectl get pod {{ $labels.pod }} -n {{ $labels.namespace }} -o jsonpath='{.status.containerStatuses[*].restartCount}'

Inspect pod logs to identify the root cause:

kubectl logs {{ $labels.pod }} -n {{ $labels.namespace }} --previous
kubectl logs {{ $labels.pod }} -n {{ $labels.namespace }} --all-containers

Check events for errors:

kubectl get events -n {{ $labels.namespace }} --sort-by=.lastTimestamp

Possible Causes

Mitigation

  1. Investigate logs and fix the root cause
  2. Adjust pod resource requests and limits
  3. Verify container image integrity
  4. Restart the pod after applying fixes:
kubectl delete pod {{ $labels.pod }} -n {{ $labels.namespace }}
  1. Update deployments or StatefulSets if configuration errors exist

Escalation