runbooks:coustom_alerts:NodeDown

====== NodeDown ======

===== Meaning =====
This alert is triggered when Prometheus is unable to scrape metrics from the node-exporter running on a node.
The condition persists when the `up` metric for node-exporter is `0`, indicating that the exporter is unreachable.

===== Impact =====
This alert indicates a **critical node-level issue**.

Possible impacts include:
  * Loss of node-level metrics (CPU, memory, disk, network)
  * Reduced observability for workloads running on the node
  * Node may be powered off, unreachable, or network-isolated
  * If the node is actually down, workloads may be rescheduled or unavailable

This alert does **not always mean the node is completely down**, but it does mean monitoring visibility is lost.

===== Diagnosis =====
Check whether the node is visible and ready in the cluster:

<code bash>
kubectl get nodes
</code>

Check detailed node status and recent conditions:

<code bash>
kubectl describe node <NODE_NAME>
</code>

Verify if node-exporter pod is running (Kubernetes setup):

<code bash>
kubectl get pods -n monitoring -o wide | grep node-exporter
</code>

Check events related to the node:

<code bash>
kubectl get events --field-selector involvedObject.kind=Node
</code>

If you have SSH access to the node, verify node-exporter and node health:

<code bash>
systemctl status node-exporter
uptime
df -h
</code>

Test connectivity from Prometheus to the node:

<code bash>
curl http://<NODE_IP>:9100/metrics
</code>

===== Possible Causes =====
  * Node is powered off or crashed
  * Network connectivity issue between Prometheus and the node
  * node-exporter service is stopped or crashed
  * Firewall or security group blocking port 9100
  * High resource pressure causing exporter to fail

===== Mitigation =====
  - If the node is down, restore or reboot the node
  - Restart node-exporter if the service is not running
  - Fix networking or firewall issues blocking metrics access
  - If the node is unhealthy, consider draining it:

<code bash>
kubectl drain <NODE_NAME> --ignore-daemonsets
</code>

  - Once resolved, uncordon the node:

<code bash>
kubectl uncordon <NODE_NAME>
</code>

===== Escalation =====
  * If multiple nodes are affected, escalate to the platform or infrastructure team immediately
  * If production workloads are impacted for more than 10 minutes, page the on-call engineer
  * If cloud provider issues are suspected, open a support ticket

===== Related Alerts =====
  * KubeletDown
  * NodeNotReady
  * DiskFull

===== Related Dashboards =====
  * Grafana → Node Exporter / Node Overview