Table of Contents

runbooks:coustom_alerts:NodeDown

NodeDown

Meaning

This alert is triggered when Prometheus is unable to scrape metrics from the node-exporter running on a node. The condition persists when the `up` metric for node-exporter is `0`, indicating that the exporter is unreachable.

Impact

This alert indicates a critical node-level issue.

Possible impacts include:

This alert does not always mean the node is completely down, but it does mean monitoring visibility is lost.

Diagnosis

Check whether the node is visible and ready in the cluster:

kubectl get nodes

Check detailed node status and recent conditions:

kubectl describe node <NODE_NAME>

Verify if node-exporter pod is running (Kubernetes setup):

kubectl get pods -n monitoring -o wide | grep node-exporter

Check events related to the node:

kubectl get events --field-selector involvedObject.kind=Node

If you have SSH access to the node, verify node-exporter and node health:

systemctl status node-exporter
uptime
df -h

Test connectivity from Prometheus to the node:

curl http://<NODE_IP>:9100/metrics

Possible Causes

Mitigation

  1. If the node is down, restore or reboot the node
  2. Restart node-exporter if the service is not running
  3. Fix networking or firewall issues blocking metrics access
  4. If the node is unhealthy, consider draining it:
kubectl drain <NODE_NAME> --ignore-daemonsets
  1. Once resolved, uncordon the node:
kubectl uncordon <NODE_NAME>

Escalation