runbooks:coustom_alerts:noderebootedrecently
Differences
This shows you the differences between two versions of the page.
| runbooks:coustom_alerts:noderebootedrecently [2025/12/13 16:24] – created admin | runbooks:coustom_alerts:noderebootedrecently [2025/12/14 06:41] (current) – admin | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| runbooks: | runbooks: | ||
| + | |||
| + | ====== NodeRebootedRecently ====== | ||
| + | |||
| + | ===== Meaning ===== | ||
| + | This alert is triggered when a node has rebooted within the last 5 minutes. | ||
| + | It is detected by comparing the current time with the node's boot time as reported by node-exporter. | ||
| + | |||
| + | ===== Impact ===== | ||
| + | This alert indicates a **recent node restart** and may affect workloads running on the node. | ||
| + | |||
| + | Possible impacts include: | ||
| + | * Temporary disruption of pods scheduled on the node | ||
| + | * Pod restarts or rescheduling to other nodes | ||
| + | * Short-lived service degradation | ||
| + | * Loss of in-memory application state | ||
| + | |||
| + | This alert is typically **informational or warning-level**, | ||
| + | |||
| + | ===== Diagnosis ===== | ||
| + | Verify node status and readiness: | ||
| + | |||
| + | <code bash> | ||
| + | kubectl get nodes | ||
| + | </ | ||
| + | |||
| + | Check detailed node information and recent events: | ||
| + | |||
| + | <code bash> | ||
| + | kubectl describe node < | ||
| + | </ | ||
| + | |||
| + | Check events related to node reboot or pressure conditions: | ||
| + | |||
| + | <code bash> | ||
| + | kubectl get events --field-selector involvedObject.kind=Node | ||
| + | </ | ||
| + | |||
| + | Check system uptime from node-exporter metrics (Grafana) or via SSH: | ||
| + | |||
| + | <code bash> | ||
| + | uptime | ||
| + | </ | ||
| + | |||
| + | If SSH access is available, check system logs for reboot cause: | ||
| + | |||
| + | <code bash> | ||
| + | journalctl --list-boots | ||
| + | journalctl -b -1 | ||
| + | </ | ||
| + | |||
| + | ===== Possible Causes ===== | ||
| + | * Planned maintenance or OS patching | ||
| + | * Kernel panic or hardware issue | ||
| + | * Cloud provider host restart | ||
| + | * Manual reboot by an operator | ||
| + | * Power or resource pressure issues | ||
| + | |||
| + | ===== Mitigation ===== | ||
| + | - Confirm whether the reboot was planned or expected | ||
| + | - Ensure the node is in `Ready` state | ||
| + | - Verify that all critical pods have been rescheduled successfully | ||
| + | - Check workloads for crash loops or degraded performance | ||
| + | - If reboots are frequent, investigate system and kernel logs | ||
| + | |||
| + | If needed, temporarily cordon the node for investigation: | ||
| + | |||
| + | <code bash> | ||
| + | kubectl cordon < | ||
| + | </ | ||
| + | |||
| + | Uncordon once verified healthy: | ||
| + | |||
| + | <code bash> | ||
| + | kubectl uncordon < | ||
| + | </ | ||
| + | |||
| + | ===== Escalation ===== | ||
| + | * If the reboot was unplanned, notify the platform or infrastructure team | ||
| + | * If the same node reboots multiple times within 24 hours, escalate immediately | ||
| + | * If production services are impacted, page the on-call engineer | ||
| + | |||
| + | ===== Related Alerts ===== | ||
| + | * NodeDown | ||
| + | * NodeNotReady | ||
| + | * KubeletDown | ||
| + | |||
| + | ===== Related Dashboards ===== | ||
| + | * Grafana → Node Overview | ||
| + | * Grafana → Node Exporter | ||
| + | |||
runbooks/coustom_alerts/noderebootedrecently.txt · Last modified: by admin
