User Tools

Site Tools


runbooks:coustom_alerts:kubenodenotready

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

runbooks:coustom_alerts:kubenodenotready [2025/12/13 16:27] – created adminrunbooks:coustom_alerts:kubenodenotready [2025/12/14 06:49] (current) admin
Line 1: Line 1:
 runbooks:coustom_alerts:KubeNodeNotReady runbooks:coustom_alerts:KubeNodeNotReady
 +
 +====== KubeNodeNotReady ======
 +
 +===== Meaning =====
 +This alert is triggered when a Kubernetes node reports a `NotReady` status for more than 2 minutes.
 +A node in `NotReady` state cannot reliably run or manage pods.
 +
 +===== Impact =====
 +This alert indicates a **node-level availability issue**.
 +
 +Possible impacts include:
 +  * Pods on the node may be evicted or rescheduled
 +  * Reduced cluster capacity
 +  * Increased load on remaining nodes
 +  * Application performance degradation or partial outages
 +
 +This alert is a **warning**, but may become critical if the condition persists or affects multiple nodes.
 +
 +===== Diagnosis =====
 +Check node status:
 +
 +<code bash>
 +kubectl get nodes
 +</code>
 +
 +Describe the affected node to inspect conditions and events:
 +
 +<code bash>
 +kubectl describe node {{ $labels.node }}
 +</code>
 +
 +Check recent node-related events:
 +
 +<code bash>
 +kubectl get events --field-selector involvedObject.kind=Node
 +</code>
 +
 +Verify kubelet health on the node (if SSH access is available):
 +
 +<code bash>
 +systemctl status kubelet
 +journalctl -u kubelet --since "15 min ago"
 +</code>
 +
 +Check node resource pressure:
 +
 +<code bash>
 +kubectl describe node {{ $labels.node }} | grep -i pressure
 +</code>
 +
 +===== Possible Causes =====
 +  * Kubelet process stopped or unhealthy
 +  * Network connectivity issues
 +  * Disk, memory, or PID pressure on the node
 +  * Node reboot or hardware failure
 +  * Cloud provider instance issue
 +
 +===== Mitigation =====
 +  - Restart the kubelet service if it is not running
 +  - Resolve disk, memory, or PID pressure conditions
 +  - Restore network connectivity
 +  - Reboot the node if required and safe
 +  - If the node is unstable, drain it for investigation:
 +
 +<code bash>
 +kubectl drain {{ $labels.node }} --ignore-daemonsets
 +</code>
 +
 +After the node becomes healthy:
 +
 +<code bash>
 +kubectl uncordon {{ $labels.node }}
 +</code>
 +
 +===== Escalation =====
 +  * If the node remains NotReady for more than 10 minutes, escalate to the platform team
 +  * If multiple nodes are affected, treat as a cluster-level incident
 +  * If production workloads are impacted, page the on-call engineer
 +
 +===== Related Alerts =====
 +  * NodeDown
 +  * KubeletDown
 +  * HighDiskUsage
 +  * HighMemoryUsage
 +
 +===== Related Dashboards =====
 +  * Grafana → Kubernetes / Nodes
 +  * Grafana → Node Health Overview
 +
runbooks/coustom_alerts/kubenodenotready.txt · Last modified: by admin