User Tools

Site Tools


runbooks:coustom_alerts:nodedown

runbooks:coustom_alerts:NodeDown

NodeDown

Meaning

This alert is triggered when Prometheus is unable to scrape metrics from the node-exporter running on a node. The condition persists when the `up` metric for node-exporter is `0`, indicating that the exporter is unreachable.

Impact

This alert indicates a critical node-level issue.

Possible impacts include:

  • Loss of node-level metrics (CPU, memory, disk, network)
  • Reduced observability for workloads running on the node
  • Node may be powered off, unreachable, or network-isolated
  • If the node is actually down, workloads may be rescheduled or unavailable

This alert does not always mean the node is completely down, but it does mean monitoring visibility is lost.

Diagnosis

Check whether the node is visible and ready in the cluster:

kubectl get nodes

Check detailed node status and recent conditions:

kubectl describe node <NODE_NAME>

Verify if node-exporter pod is running (Kubernetes setup):

kubectl get pods -n monitoring -o wide | grep node-exporter

Check events related to the node:

kubectl get events --field-selector involvedObject.kind=Node

If you have SSH access to the node, verify node-exporter and node health:

systemctl status node-exporter
uptime
df -h

Test connectivity from Prometheus to the node:

curl http://<NODE_IP>:9100/metrics

Possible Causes

  • Node is powered off or crashed
  • Network connectivity issue between Prometheus and the node
  • node-exporter service is stopped or crashed
  • Firewall or security group blocking port 9100
  • High resource pressure causing exporter to fail

Mitigation

  1. If the node is down, restore or reboot the node
  2. Restart node-exporter if the service is not running
  3. Fix networking or firewall issues blocking metrics access
  4. If the node is unhealthy, consider draining it:
kubectl drain <NODE_NAME> --ignore-daemonsets
  1. Once resolved, uncordon the node:
kubectl uncordon <NODE_NAME>

Escalation

  • If multiple nodes are affected, escalate to the platform or infrastructure team immediately
  • If production workloads are impacted for more than 10 minutes, page the on-call engineer
  • If cloud provider issues are suspected, open a support ticket
  • KubeletDown
  • NodeNotReady
  • DiskFull
  • Grafana → Node Exporter / Node Overview
runbooks/coustom_alerts/nodedown.txt · Last modified: by admin