User Tools

Site Tools


runbooks:coustom_alerts:noderebootedrecently

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

runbooks:coustom_alerts:noderebootedrecently [2025/12/13 16:24] – created adminrunbooks:coustom_alerts:noderebootedrecently [2025/12/14 06:41] (current) admin
Line 1: Line 1:
 runbooks:coustom_alerts:NodeRebootedRecently runbooks:coustom_alerts:NodeRebootedRecently
 +
 +====== NodeRebootedRecently ======
 +
 +===== Meaning =====
 +This alert is triggered when a node has rebooted within the last 5 minutes.
 +It is detected by comparing the current time with the node's boot time as reported by node-exporter.
 +
 +===== Impact =====
 +This alert indicates a **recent node restart** and may affect workloads running on the node.
 +
 +Possible impacts include:
 +  * Temporary disruption of pods scheduled on the node
 +  * Pod restarts or rescheduling to other nodes
 +  * Short-lived service degradation
 +  * Loss of in-memory application state
 +
 +This alert is typically **informational or warning-level**, but may require attention if frequent or unexpected.
 +
 +===== Diagnosis =====
 +Verify node status and readiness:
 +
 +<code bash>
 +kubectl get nodes
 +</code>
 +
 +Check detailed node information and recent events:
 +
 +<code bash>
 +kubectl describe node <NODE_NAME>
 +</code>
 +
 +Check events related to node reboot or pressure conditions:
 +
 +<code bash>
 +kubectl get events --field-selector involvedObject.kind=Node
 +</code>
 +
 +Check system uptime from node-exporter metrics (Grafana) or via SSH:
 +
 +<code bash>
 +uptime
 +</code>
 +
 +If SSH access is available, check system logs for reboot cause:
 +
 +<code bash>
 +journalctl --list-boots
 +journalctl -b -1
 +</code>
 +
 +===== Possible Causes =====
 +  * Planned maintenance or OS patching
 +  * Kernel panic or hardware issue
 +  * Cloud provider host restart
 +  * Manual reboot by an operator
 +  * Power or resource pressure issues
 +
 +===== Mitigation =====
 +  - Confirm whether the reboot was planned or expected
 +  - Ensure the node is in `Ready` state
 +  - Verify that all critical pods have been rescheduled successfully
 +  - Check workloads for crash loops or degraded performance
 +  - If reboots are frequent, investigate system and kernel logs
 +
 +If needed, temporarily cordon the node for investigation:
 +
 +<code bash>
 +kubectl cordon <NODE_NAME>
 +</code>
 +
 +Uncordon once verified healthy:
 +
 +<code bash>
 +kubectl uncordon <NODE_NAME>
 +</code>
 +
 +===== Escalation =====
 +  * If the reboot was unplanned, notify the platform or infrastructure team
 +  * If the same node reboots multiple times within 24 hours, escalate immediately
 +  * If production services are impacted, page the on-call engineer
 +
 +===== Related Alerts =====
 +  * NodeDown
 +  * NodeNotReady
 +  * KubeletDown
 +
 +===== Related Dashboards =====
 +  * Grafana → Node Overview
 +  * Grafana → Node Exporter
 +
runbooks/coustom_alerts/noderebootedrecently.txt · Last modified: by admin