Differences

This shows you the differences between two versions of the page.

--- runbooks:coustom_alerts:noderebootedrecently [2025/12/13 16:24] – created admin
+++ runbooks:coustom_alerts:noderebootedrecently [2025/12/14 06:41] (current) – admin
@@ Line 1: / Line 1: @@
 runbooks:coustom_alerts:NodeRebootedRecently
+====== NodeRebootedRecently ======
+===== Meaning =====
+This alert is triggered when a node has rebooted within the last 5 minutes.
+It is detected by comparing the current time with the node's boot time as reported by node-exporter.
+===== Impact =====
+This alert indicates a **recent node restart** and may affect workloads running on the node.
+Possible impacts include:
+  * Temporary disruption of pods scheduled on the node
+  * Pod restarts or rescheduling to other nodes
+  * Short-lived service degradation
+  * Loss of in-memory application state
+This alert is typically **informational or warning-level**, but may require attention if frequent or unexpected.
+===== Diagnosis =====
+Verify node status and readiness:
+<code bash>
+kubectl get nodes
+</code>
+Check detailed node information and recent events:
+<code bash>
+kubectl describe node <NODE_NAME>
+</code>
+Check events related to node reboot or pressure conditions:
+<code bash>
+kubectl get events --field-selector involvedObject.kind=Node
+</code>
+Check system uptime from node-exporter metrics (Grafana) or via SSH:
+<code bash>
+uptime
+</code>
+If SSH access is available, check system logs for reboot cause:
+<code bash>
+journalctl --list-boots
+journalctl -b -1
+</code>
+===== Possible Causes =====
+  * Planned maintenance or OS patching
+  * Kernel panic or hardware issue
+  * Cloud provider host restart
+  * Manual reboot by an operator
+  * Power or resource pressure issues
+===== Mitigation =====
+  - Confirm whether the reboot was planned or expected
+  - Ensure the node is in `Ready` state
+  - Verify that all critical pods have been rescheduled successfully
+  - Check workloads for crash loops or degraded performance
+  - If reboots are frequent, investigate system and kernel logs
+If needed, temporarily cordon the node for investigation:
+<code bash>
+kubectl cordon <NODE_NAME>
+</code>
+Uncordon once verified healthy:
+<code bash>
+kubectl uncordon <NODE_NAME>
+</code>
+===== Escalation =====
+  * If the reboot was unplanned, notify the platform or infrastructure team
+  * If the same node reboots multiple times within 24 hours, escalate immediately
+  * If production services are impacted, page the on-call engineer
+===== Related Alerts =====
+  * NodeDown
+  * NodeNotReady
+  * KubeletDown
+===== Related Dashboards =====
+  * Grafana → Node Overview
+  * Grafana → Node Exporter