Differences

This shows you the differences between two versions of the page.

--- runbooks:coustom_alerts:kubernetesnodeoutofpodcapacity [2025/12/13 16:37] – created admin
+++ runbooks:coustom_alerts:kubernetesnodeoutofpodcapacity [2025/12/14 06:57] (current) – admin
@@ Line 1: / Line 1: @@
 runbooks:coustom_alerts:KubernetesNodeOutOfPodCapacity
+====== KubernetesNodeOutOfPodCapacity ======
+===== Meaning =====
+This alert is triggered when a Kubernetes node reaches **more than 90% of its pod capacity** for more than 2 minutes.
+It indicates that the node has almost no free allocatable pod slots left.
+===== Impact =====
+A node running out of pod capacity can cause:
+  * New pods failing to schedule on the node
+  * Workload imbalance across the cluster
+  * Potential service degradation if no other nodes are available
+  * Increased latency for scheduling or scaling operations
+This alert is marked **warning**, as it may precede node-level failures or application disruptions.
+===== Diagnosis =====
+Check node pod allocation:
+<code bash>
+kubectl get nodes -o wide
+kubectl describe node <NODE_NAME>
+</code>
+Check running pods on the node:
+<code bash>
+kubectl get pods -o wide --all-namespaces | grep <NODE_NAME>
+</code>
+Check node allocatable pods:
+<code bash>
+kubectl get node <NODE_NAME> -o jsonpath='{.status.allocatable.pods}'
+</code>
+Check cluster-wide pod distribution:
+<code bash>
+kubectl get pods --all-namespaces -o wide
+</code>
+===== Possible Causes =====
+  * Node is heavily loaded with many pods
+  * Misconfigured deployments with too many replicas on a single node
+  * DaemonSets consuming pod slots
+  * Cluster autoscaler not configured or failing
+  * Pod anti-affinity rules forcing pods onto fewer nodes
+===== Mitigation =====
+  - Review and redistribute workloads across nodes
+  - Scale out the cluster by adding more nodes
+  - Remove unnecessary pods or workloads from the node
+  - Adjust DaemonSets or affinity/anti-affinity rules
+  - Enable or tune Cluster Autoscaler if available
+===== Escalation =====
+  * Escalate if multiple nodes are reaching pod capacity
+  * Page on-call engineer if workloads fail to schedule and impact production
+  * Monitor cluster autoscaler or take manual action to add nodes
+===== Related Alerts =====
+  * KubernetesNodeMemoryPressure
+  * KubernetesNodeDiskPressure
+  * KubernetesNodeNotReady
+  * PodPending
+===== Related Dashboards =====
+  * Grafana → Kubernetes / Node Pod Capacity
+  * Grafana → Cluster Pod Distribution