User Tools

Site Tools


runbooks:coustom_alerts:nodedown

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

runbooks:coustom_alerts:nodedown [2025/12/13 16:23] โ€“ created adminrunbooks:coustom_alerts:nodedown [2025/12/13 16:47] (current) โ€“ admin
Line 1: Line 1:
 runbooks:coustom_alerts:NodeDown runbooks:coustom_alerts:NodeDown
 +
 +====== NodeDown ======
 +
 +===== Meaning =====
 +This alert is triggered when Prometheus is unable to scrape metrics from the node-exporter running on a node.
 +The condition persists when the `up` metric for node-exporter is `0`, indicating that the exporter is unreachable.
 +
 +===== Impact =====
 +This alert indicates a **critical node-level issue**.
 +
 +Possible impacts include:
 +  * Loss of node-level metrics (CPU, memory, disk, network)
 +  * Reduced observability for workloads running on the node
 +  * Node may be powered off, unreachable, or network-isolated
 +  * If the node is actually down, workloads may be rescheduled or unavailable
 +
 +This alert does **not always mean the node is completely down**, but it does mean monitoring visibility is lost.
 +
 +===== Diagnosis =====
 +Check whether the node is visible and ready in the cluster:
 +
 +<code bash>
 +kubectl get nodes
 +</code>
 +
 +Check detailed node status and recent conditions:
 +
 +<code bash>
 +kubectl describe node <NODE_NAME>
 +</code>
 +
 +Verify if node-exporter pod is running (Kubernetes setup):
 +
 +<code bash>
 +kubectl get pods -n monitoring -o wide | grep node-exporter
 +</code>
 +
 +Check events related to the node:
 +
 +<code bash>
 +kubectl get events --field-selector involvedObject.kind=Node
 +</code>
 +
 +If you have SSH access to the node, verify node-exporter and node health:
 +
 +<code bash>
 +systemctl status node-exporter
 +uptime
 +df -h
 +</code>
 +
 +Test connectivity from Prometheus to the node:
 +
 +<code bash>
 +curl http://<NODE_IP>:9100/metrics
 +</code>
 +
 +===== Possible Causes =====
 +  * Node is powered off or crashed
 +  * Network connectivity issue between Prometheus and the node
 +  * node-exporter service is stopped or crashed
 +  * Firewall or security group blocking port 9100
 +  * High resource pressure causing exporter to fail
 +
 +===== Mitigation =====
 +  - If the node is down, restore or reboot the node
 +  - Restart node-exporter if the service is not running
 +  - Fix networking or firewall issues blocking metrics access
 +  - If the node is unhealthy, consider draining it:
 +
 +<code bash>
 +kubectl drain <NODE_NAME> --ignore-daemonsets
 +</code>
 +
 +  - Once resolved, uncordon the node:
 +
 +<code bash>
 +kubectl uncordon <NODE_NAME>
 +</code>
 +
 +===== Escalation =====
 +  * If multiple nodes are affected, escalate to the platform or infrastructure team immediately
 +  * If production workloads are impacted for more than 10 minutes, page the on-call engineer
 +  * If cloud provider issues are suspected, open a support ticket
 +
 +===== Related Alerts =====
 +  * KubeletDown
 +  * NodeNotReady
 +  * DiskFull
 +
 +===== Related Dashboards =====
 +  * Grafana โ†’ Node Exporter / Node Overview
 +
runbooks/coustom_alerts/nodedown.txt ยท Last modified: by admin