runbooks:coustom_alerts:NodeDown
This alert is triggered when Prometheus is unable to scrape metrics from the node-exporter running on a node. The condition persists when the `up` metric for node-exporter is `0`, indicating that the exporter is unreachable.
This alert indicates a critical node-level issue.
Possible impacts include:
This alert does not always mean the node is completely down, but it does mean monitoring visibility is lost.
Check whether the node is visible and ready in the cluster:
kubectl get nodes
Check detailed node status and recent conditions:
kubectl describe node <NODE_NAME>
Verify if node-exporter pod is running (Kubernetes setup):
kubectl get pods -n monitoring -o wide | grep node-exporter
Check events related to the node:
kubectl get events --field-selector involvedObject.kind=Node
If you have SSH access to the node, verify node-exporter and node health:
systemctl status node-exporter uptime df -h
Test connectivity from Prometheus to the node:
curl http://<NODE_IP>:9100/metrics
kubectl drain <NODE_NAME> --ignore-daemonsets
kubectl uncordon <NODE_NAME>