runbooks:coustom_alerts:kubernetesnodeoutofpodcapacity
Differences
This shows you the differences between two versions of the page.
| runbooks:coustom_alerts:kubernetesnodeoutofpodcapacity [2025/12/13 16:37] – created admin | runbooks:coustom_alerts:kubernetesnodeoutofpodcapacity [2025/12/14 06:57] (current) – admin | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| runbooks: | runbooks: | ||
| + | |||
| + | ====== KubernetesNodeOutOfPodCapacity ====== | ||
| + | |||
| + | ===== Meaning ===== | ||
| + | This alert is triggered when a Kubernetes node reaches **more than 90% of its pod capacity** for more than 2 minutes. | ||
| + | It indicates that the node has almost no free allocatable pod slots left. | ||
| + | |||
| + | ===== Impact ===== | ||
| + | A node running out of pod capacity can cause: | ||
| + | * New pods failing to schedule on the node | ||
| + | * Workload imbalance across the cluster | ||
| + | * Potential service degradation if no other nodes are available | ||
| + | * Increased latency for scheduling or scaling operations | ||
| + | |||
| + | This alert is marked **warning**, | ||
| + | |||
| + | ===== Diagnosis ===== | ||
| + | Check node pod allocation: | ||
| + | |||
| + | <code bash> | ||
| + | kubectl get nodes -o wide | ||
| + | kubectl describe node < | ||
| + | </ | ||
| + | |||
| + | Check running pods on the node: | ||
| + | |||
| + | <code bash> | ||
| + | kubectl get pods -o wide --all-namespaces | grep < | ||
| + | </ | ||
| + | |||
| + | Check node allocatable pods: | ||
| + | |||
| + | <code bash> | ||
| + | kubectl get node < | ||
| + | </ | ||
| + | |||
| + | Check cluster-wide pod distribution: | ||
| + | |||
| + | <code bash> | ||
| + | kubectl get pods --all-namespaces -o wide | ||
| + | </ | ||
| + | |||
| + | ===== Possible Causes ===== | ||
| + | * Node is heavily loaded with many pods | ||
| + | * Misconfigured deployments with too many replicas on a single node | ||
| + | * DaemonSets consuming pod slots | ||
| + | * Cluster autoscaler not configured or failing | ||
| + | * Pod anti-affinity rules forcing pods onto fewer nodes | ||
| + | |||
| + | ===== Mitigation ===== | ||
| + | - Review and redistribute workloads across nodes | ||
| + | - Scale out the cluster by adding more nodes | ||
| + | - Remove unnecessary pods or workloads from the node | ||
| + | - Adjust DaemonSets or affinity/ | ||
| + | - Enable or tune Cluster Autoscaler if available | ||
| + | |||
| + | ===== Escalation ===== | ||
| + | * Escalate if multiple nodes are reaching pod capacity | ||
| + | * Page on-call engineer if workloads fail to schedule and impact production | ||
| + | * Monitor cluster autoscaler or take manual action to add nodes | ||
| + | |||
| + | ===== Related Alerts ===== | ||
| + | * KubernetesNodeMemoryPressure | ||
| + | * KubernetesNodeDiskPressure | ||
| + | * KubernetesNodeNotReady | ||
| + | * PodPending | ||
| + | |||
| + | ===== Related Dashboards ===== | ||
| + | * Grafana → Kubernetes / Node Pod Capacity | ||
| + | * Grafana → Cluster Pod Distribution | ||
| + | |||
runbooks/coustom_alerts/kubernetesnodeoutofpodcapacity.txt · Last modified: by admin
