User Tools

Site Tools


runbooks:coustom_alerts:kubernetespodnothealthy

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

runbooks:coustom_alerts:kubernetespodnothealthy [2025/12/13 16:37] – created adminrunbooks:coustom_alerts:kubernetespodnothealthy [2025/12/14 06:58] (current) admin
Line 1: Line 1:
 runbooks:coustom_alerts:KubernetesPodNotHealthy runbooks:coustom_alerts:KubernetesPodNotHealthy
 +
 +====== KubernetesPodNotHealthy ======
 +
 +===== Meaning =====
 +This alert is triggered when a Kubernetes pod has been in a **non-running state** (`Pending`, `Unknown`, or `Failed`) for more than 4 hours.
 +It indicates that the pod is unhealthy and not serving its intended workload.
 +
 +===== Impact =====
 +A pod remaining non-running for extended periods can cause:
 +  * Application downtime
 +  * Service degradation or unavailability
 +  * Failed deployments or incomplete updates
 +  * Potential cascading failures if other pods depend on it
 +
 +This alert is **critical**, as prolonged pod unhealthiness directly affects applications.
 +
 +===== Diagnosis =====
 +Check pod status:
 +
 +<code bash>
 +kubectl get pod {{ $labels.pod }} -n {{ $labels.namespace }}
 +kubectl describe pod {{ $labels.pod }} -n {{ $labels.namespace }}
 +</code>
 +
 +Check events for reasons of failure:
 +
 +<code bash>
 +kubectl get events -n {{ $labels.namespace }} --sort-by=.lastTimestamp
 +</code>
 +
 +Check logs for container errors:
 +
 +<code bash>
 +kubectl logs {{ $labels.pod }} -n {{ $labels.namespace }} --all-containers
 +</code>
 +
 +For multi-container pods, check individual container states:
 +
 +<code bash>
 +kubectl get pod {{ $labels.pod }} -n {{ $labels.namespace }} -o json | jq '.status.containerStatuses'
 +</code>
 +
 +===== Possible Causes =====
 +  * CrashLoopBackOff due to application errors
 +  * ImagePullBackOff or missing images
 +  * Insufficient resources on the node (CPU/memory/disk)
 +  * Pod scheduling failures due to node constraints
 +  * Configuration errors or misconfigured readiness/liveness probes
 +
 +===== Mitigation =====
 +  - Investigate logs and restart the pod if appropriate:
 +
 +<code bash>
 +kubectl delete pod {{ $labels.pod }} -n {{ $labels.namespace }}
 +</code>
 +
 +  - Resolve resource constraints (increase node capacity, adjust limits/requests)
 +  - Fix configuration issues, container image problems, or dependency failures
 +  - Reschedule pods to healthy nodes using taints/tolerations or affinity rules
 +
 +===== Escalation =====
 +  * Escalate if pod remains unhealthy after mitigation
 +  * Page on-call engineer if production services are impacted
 +  * Monitor related pods or services for cascading failures
 +
 +===== Related Alerts =====
 +  * PodCrashLoopBackOff
 +  * PodPending
 +  * KubernetesNodeMemoryPressure
 +  * KubernetesNodeDiskPressure
 +
 +===== Related Dashboards =====
 +  * Grafana → Kubernetes / Pods Overview
 +  * Grafana → Namespace Health
 +
runbooks/coustom_alerts/kubernetespodnothealthy.txt · Last modified: by admin