Differences

This shows you the differences between two versions of the page.

--- runbooks:coustom_alerts:nodedown [2025/12/13 16:23] – created admin
+++ runbooks:coustom_alerts:nodedown [2025/12/13 16:47] (current) – admin
@@ Line 1: / Line 1: @@
 runbooks:coustom_alerts:NodeDown
+====== NodeDown ======
+===== Meaning =====
+This alert is triggered when Prometheus is unable to scrape metrics from the node-exporter running on a node.
+The condition persists when the `up` metric for node-exporter is `0`, indicating that the exporter is unreachable.
+===== Impact =====
+This alert indicates a **critical node-level issue**.
+Possible impacts include:
+  * Loss of node-level metrics (CPU, memory, disk, network)
+  * Reduced observability for workloads running on the node
+  * Node may be powered off, unreachable, or network-isolated
+  * If the node is actually down, workloads may be rescheduled or unavailable
+This alert does **not always mean the node is completely down**, but it does mean monitoring visibility is lost.
+===== Diagnosis =====
+Check whether the node is visible and ready in the cluster:
+<code bash>
+kubectl get nodes
+</code>
+Check detailed node status and recent conditions:
+<code bash>
+kubectl describe node <NODE_NAME>
+</code>
+Verify if node-exporter pod is running (Kubernetes setup):
+<code bash>
+kubectl get pods -n monitoring -o wide | grep node-exporter
+</code>
+Check events related to the node:
+<code bash>
+kubectl get events --field-selector involvedObject.kind=Node
+</code>
+If you have SSH access to the node, verify node-exporter and node health:
+<code bash>
+systemctl status node-exporter
+uptime
+df -h
+</code>
+Test connectivity from Prometheus to the node:
+<code bash>
+curl http://<NODE_IP>:9100/metrics
+</code>
+===== Possible Causes =====
+  * Node is powered off or crashed
+  * Network connectivity issue between Prometheus and the node
+  * node-exporter service is stopped or crashed
+  * Firewall or security group blocking port 9100
+  * High resource pressure causing exporter to fail
+===== Mitigation =====
+  - If the node is down, restore or reboot the node
+  - Restart node-exporter if the service is not running
+  - Fix networking or firewall issues blocking metrics access
+  - If the node is unhealthy, consider draining it:
+<code bash>
+kubectl drain <NODE_NAME> --ignore-daemonsets
+</code>
+  - Once resolved, uncordon the node:
+<code bash>
+kubectl uncordon <NODE_NAME>
+</code>
+===== Escalation =====
+  * If multiple nodes are affected, escalate to the platform or infrastructure team immediately
+  * If production workloads are impacted for more than 10 minutes, page the on-call engineer
+  * If cloud provider issues are suspected, open a support ticket
+===== Related Alerts =====
+  * KubeletDown
+  * NodeNotReady
+  * DiskFull
+===== Related Dashboards =====
+  * Grafana → Node Exporter / Node Overview