runbooks:coustom_alerts:NodeDown
====== NodeDown ======
===== Meaning =====
This alert is triggered when Prometheus is unable to scrape metrics from the node-exporter running on a node.
The condition persists when the `up` metric for node-exporter is `0`, indicating that the exporter is unreachable.
===== Impact =====
This alert indicates a **critical node-level issue**.
Possible impacts include:
* Loss of node-level metrics (CPU, memory, disk, network)
* Reduced observability for workloads running on the node
* Node may be powered off, unreachable, or network-isolated
* If the node is actually down, workloads may be rescheduled or unavailable
This alert does **not always mean the node is completely down**, but it does mean monitoring visibility is lost.
===== Diagnosis =====
Check whether the node is visible and ready in the cluster:
kubectl get nodes
Check detailed node status and recent conditions:
kubectl describe node
Verify if node-exporter pod is running (Kubernetes setup):
kubectl get pods -n monitoring -o wide | grep node-exporter
Check events related to the node:
kubectl get events --field-selector involvedObject.kind=Node
If you have SSH access to the node, verify node-exporter and node health:
systemctl status node-exporter
uptime
df -h
Test connectivity from Prometheus to the node:
curl http://:9100/metrics
===== Possible Causes =====
* Node is powered off or crashed
* Network connectivity issue between Prometheus and the node
* node-exporter service is stopped or crashed
* Firewall or security group blocking port 9100
* High resource pressure causing exporter to fail
===== Mitigation =====
- If the node is down, restore or reboot the node
- Restart node-exporter if the service is not running
- Fix networking or firewall issues blocking metrics access
- If the node is unhealthy, consider draining it:
kubectl drain --ignore-daemonsets
- Once resolved, uncordon the node:
kubectl uncordon
===== Escalation =====
* If multiple nodes are affected, escalate to the platform or infrastructure team immediately
* If production workloads are impacted for more than 10 minutes, page the on-call engineer
* If cloud provider issues are suspected, open a support ticket
===== Related Alerts =====
* KubeletDown
* NodeNotReady
* DiskFull
===== Related Dashboards =====
* Grafana → Node Exporter / Node Overview