Table of Contents
runbooks:coustom_alerts:NodeDown
NodeDown
Meaning
This alert is triggered when Prometheus is unable to scrape metrics from the node-exporter running on a node. The condition persists when the `up` metric for node-exporter is `0`, indicating that the exporter is unreachable.
Impact
This alert indicates a critical node-level issue.
Possible impacts include:
- Loss of node-level metrics (CPU, memory, disk, network)
- Reduced observability for workloads running on the node
- Node may be powered off, unreachable, or network-isolated
- If the node is actually down, workloads may be rescheduled or unavailable
This alert does not always mean the node is completely down, but it does mean monitoring visibility is lost.
Diagnosis
Check whether the node is visible and ready in the cluster:
kubectl get nodes
Check detailed node status and recent conditions:
kubectl describe node <NODE_NAME>
Verify if node-exporter pod is running (Kubernetes setup):
kubectl get pods -n monitoring -o wide | grep node-exporter
Check events related to the node:
kubectl get events --field-selector involvedObject.kind=Node
If you have SSH access to the node, verify node-exporter and node health:
systemctl status node-exporter uptime df -h
Test connectivity from Prometheus to the node:
curl http://<NODE_IP>:9100/metrics
Possible Causes
- Node is powered off or crashed
- Network connectivity issue between Prometheus and the node
- node-exporter service is stopped or crashed
- Firewall or security group blocking port 9100
- High resource pressure causing exporter to fail
Mitigation
- If the node is down, restore or reboot the node
- Restart node-exporter if the service is not running
- Fix networking or firewall issues blocking metrics access
- If the node is unhealthy, consider draining it:
kubectl drain <NODE_NAME> --ignore-daemonsets
- Once resolved, uncordon the node:
kubectl uncordon <NODE_NAME>
Escalation
- If multiple nodes are affected, escalate to the platform or infrastructure team immediately
- If production workloads are impacted for more than 10 minutes, page the on-call engineer
- If cloud provider issues are suspected, open a support ticket
Related Alerts
- KubeletDown
- NodeNotReady
- DiskFull
Related Dashboards
- Grafana → Node Exporter / Node Overview
