runbooks:coustom_alerts:NodeDown ====== NodeDown ====== ===== Meaning ===== This alert is triggered when Prometheus is unable to scrape metrics from the node-exporter running on a node. The condition persists when the `up` metric for node-exporter is `0`, indicating that the exporter is unreachable. ===== Impact ===== This alert indicates a **critical node-level issue**. Possible impacts include: * Loss of node-level metrics (CPU, memory, disk, network) * Reduced observability for workloads running on the node * Node may be powered off, unreachable, or network-isolated * If the node is actually down, workloads may be rescheduled or unavailable This alert does **not always mean the node is completely down**, but it does mean monitoring visibility is lost. ===== Diagnosis ===== Check whether the node is visible and ready in the cluster: kubectl get nodes Check detailed node status and recent conditions: kubectl describe node Verify if node-exporter pod is running (Kubernetes setup): kubectl get pods -n monitoring -o wide | grep node-exporter Check events related to the node: kubectl get events --field-selector involvedObject.kind=Node If you have SSH access to the node, verify node-exporter and node health: systemctl status node-exporter uptime df -h Test connectivity from Prometheus to the node: curl http://:9100/metrics ===== Possible Causes ===== * Node is powered off or crashed * Network connectivity issue between Prometheus and the node * node-exporter service is stopped or crashed * Firewall or security group blocking port 9100 * High resource pressure causing exporter to fail ===== Mitigation ===== - If the node is down, restore or reboot the node - Restart node-exporter if the service is not running - Fix networking or firewall issues blocking metrics access - If the node is unhealthy, consider draining it: kubectl drain --ignore-daemonsets - Once resolved, uncordon the node: kubectl uncordon ===== Escalation ===== * If multiple nodes are affected, escalate to the platform or infrastructure team immediately * If production workloads are impacted for more than 10 minutes, page the on-call engineer * If cloud provider issues are suspected, open a support ticket ===== Related Alerts ===== * KubeletDown * NodeNotReady * DiskFull ===== Related Dashboards ===== * Grafana → Node Exporter / Node Overview