runbooks:coustom_alerts:nodedown
Differences
This shows you the differences between two versions of the page.
| runbooks:coustom_alerts:nodedown [2025/12/13 16:23] โ created admin | runbooks:coustom_alerts:nodedown [2025/12/13 16:47] (current) โ admin | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| runbooks: | runbooks: | ||
| + | |||
| + | ====== NodeDown ====== | ||
| + | |||
| + | ===== Meaning ===== | ||
| + | This alert is triggered when Prometheus is unable to scrape metrics from the node-exporter running on a node. | ||
| + | The condition persists when the `up` metric for node-exporter is `0`, indicating that the exporter is unreachable. | ||
| + | |||
| + | ===== Impact ===== | ||
| + | This alert indicates a **critical node-level issue**. | ||
| + | |||
| + | Possible impacts include: | ||
| + | * Loss of node-level metrics (CPU, memory, disk, network) | ||
| + | * Reduced observability for workloads running on the node | ||
| + | * Node may be powered off, unreachable, | ||
| + | * If the node is actually down, workloads may be rescheduled or unavailable | ||
| + | |||
| + | This alert does **not always mean the node is completely down**, but it does mean monitoring visibility is lost. | ||
| + | |||
| + | ===== Diagnosis ===== | ||
| + | Check whether the node is visible and ready in the cluster: | ||
| + | |||
| + | <code bash> | ||
| + | kubectl get nodes | ||
| + | </ | ||
| + | |||
| + | Check detailed node status and recent conditions: | ||
| + | |||
| + | <code bash> | ||
| + | kubectl describe node < | ||
| + | </ | ||
| + | |||
| + | Verify if node-exporter pod is running (Kubernetes setup): | ||
| + | |||
| + | <code bash> | ||
| + | kubectl get pods -n monitoring -o wide | grep node-exporter | ||
| + | </ | ||
| + | |||
| + | Check events related to the node: | ||
| + | |||
| + | <code bash> | ||
| + | kubectl get events --field-selector involvedObject.kind=Node | ||
| + | </ | ||
| + | |||
| + | If you have SSH access to the node, verify node-exporter and node health: | ||
| + | |||
| + | <code bash> | ||
| + | systemctl status node-exporter | ||
| + | uptime | ||
| + | df -h | ||
| + | </ | ||
| + | |||
| + | Test connectivity from Prometheus to the node: | ||
| + | |||
| + | <code bash> | ||
| + | curl http://< | ||
| + | </ | ||
| + | |||
| + | ===== Possible Causes ===== | ||
| + | * Node is powered off or crashed | ||
| + | * Network connectivity issue between Prometheus and the node | ||
| + | * node-exporter service is stopped or crashed | ||
| + | * Firewall or security group blocking port 9100 | ||
| + | * High resource pressure causing exporter to fail | ||
| + | |||
| + | ===== Mitigation ===== | ||
| + | - If the node is down, restore or reboot the node | ||
| + | - Restart node-exporter if the service is not running | ||
| + | - Fix networking or firewall issues blocking metrics access | ||
| + | - If the node is unhealthy, consider draining it: | ||
| + | |||
| + | <code bash> | ||
| + | kubectl drain < | ||
| + | </ | ||
| + | |||
| + | - Once resolved, uncordon the node: | ||
| + | |||
| + | <code bash> | ||
| + | kubectl uncordon < | ||
| + | </ | ||
| + | |||
| + | ===== Escalation ===== | ||
| + | * If multiple nodes are affected, escalate to the platform or infrastructure team immediately | ||
| + | * If production workloads are impacted for more than 10 minutes, page the on-call engineer | ||
| + | * If cloud provider issues are suspected, open a support ticket | ||
| + | |||
| + | ===== Related Alerts ===== | ||
| + | * KubeletDown | ||
| + | * NodeNotReady | ||
| + | * DiskFull | ||
| + | |||
| + | ===== Related Dashboards ===== | ||
| + | * Grafana โ Node Exporter / Node Overview | ||
| + | |||
runbooks/coustom_alerts/nodedown.txt ยท Last modified: by admin
