Differences

This shows you the differences between two versions of the page.

--- runbooks:coustom_alerts:kubernetespodnothealthy [2025/12/13 16:37] – created admin
+++ runbooks:coustom_alerts:kubernetespodnothealthy [2025/12/14 06:58] (current) – admin
@@ Line 1: / Line 1: @@
 runbooks:coustom_alerts:KubernetesPodNotHealthy
+====== KubernetesPodNotHealthy ======
+===== Meaning =====
+This alert is triggered when a Kubernetes pod has been in a **non-running state** (`Pending`, `Unknown`, or `Failed`) for more than 4 hours.
+It indicates that the pod is unhealthy and not serving its intended workload.
+===== Impact =====
+A pod remaining non-running for extended periods can cause:
+  * Application downtime
+  * Service degradation or unavailability
+  * Failed deployments or incomplete updates
+  * Potential cascading failures if other pods depend on it
+This alert is **critical**, as prolonged pod unhealthiness directly affects applications.
+===== Diagnosis =====
+Check pod status:
+<code bash>
+kubectl get pod {{ $labels.pod }} -n {{ $labels.namespace }}
+kubectl describe pod {{ $labels.pod }} -n {{ $labels.namespace }}
+</code>
+Check events for reasons of failure:
+<code bash>
+kubectl get events -n {{ $labels.namespace }} --sort-by=.lastTimestamp
+</code>
+Check logs for container errors:
+<code bash>
+kubectl logs {{ $labels.pod }} -n {{ $labels.namespace }} --all-containers
+</code>
+For multi-container pods, check individual container states:
+<code bash>
+kubectl get pod {{ $labels.pod }} -n {{ $labels.namespace }} -o json | jq '.status.containerStatuses'
+</code>
+===== Possible Causes =====
+  * CrashLoopBackOff due to application errors
+  * ImagePullBackOff or missing images
+  * Insufficient resources on the node (CPU/memory/disk)
+  * Pod scheduling failures due to node constraints
+  * Configuration errors or misconfigured readiness/liveness probes
+===== Mitigation =====
+  - Investigate logs and restart the pod if appropriate:
+<code bash>
+kubectl delete pod {{ $labels.pod }} -n {{ $labels.namespace }}
+</code>
+  - Resolve resource constraints (increase node capacity, adjust limits/requests)
+  - Fix configuration issues, container image problems, or dependency failures
+  - Reschedule pods to healthy nodes using taints/tolerations or affinity rules
+===== Escalation =====
+  * Escalate if pod remains unhealthy after mitigation
+  * Page on-call engineer if production services are impacted
+  * Monitor related pods or services for cascading failures
+===== Related Alerts =====
+  * PodCrashLoopBackOff
+  * PodPending
+  * KubernetesNodeMemoryPressure
+  * KubernetesNodeDiskPressure
+===== Related Dashboards =====
+  * Grafana → Kubernetes / Pods Overview
+  * Grafana → Namespace Health