User Tools

Site Tools


runbooks:coustom_alerts:noderebootedrecently

runbooks:coustom_alerts:NodeRebootedRecently

NodeRebootedRecently

Meaning

This alert is triggered when a node has rebooted within the last 5 minutes. It is detected by comparing the current time with the node's boot time as reported by node-exporter.

Impact

This alert indicates a recent node restart and may affect workloads running on the node.

Possible impacts include:

  • Temporary disruption of pods scheduled on the node
  • Pod restarts or rescheduling to other nodes
  • Short-lived service degradation
  • Loss of in-memory application state

This alert is typically informational or warning-level, but may require attention if frequent or unexpected.

Diagnosis

Verify node status and readiness:

kubectl get nodes

Check detailed node information and recent events:

kubectl describe node <NODE_NAME>

Check events related to node reboot or pressure conditions:

kubectl get events --field-selector involvedObject.kind=Node

Check system uptime from node-exporter metrics (Grafana) or via SSH:

uptime

If SSH access is available, check system logs for reboot cause:

journalctl --list-boots
journalctl -b -1

Possible Causes

  • Planned maintenance or OS patching
  • Kernel panic or hardware issue
  • Cloud provider host restart
  • Manual reboot by an operator
  • Power or resource pressure issues

Mitigation

  1. Confirm whether the reboot was planned or expected
  2. Ensure the node is in `Ready` state
  3. Verify that all critical pods have been rescheduled successfully
  4. Check workloads for crash loops or degraded performance
  5. If reboots are frequent, investigate system and kernel logs

If needed, temporarily cordon the node for investigation:

kubectl cordon <NODE_NAME>

Uncordon once verified healthy:

kubectl uncordon <NODE_NAME>

Escalation

  • If the reboot was unplanned, notify the platform or infrastructure team
  • If the same node reboots multiple times within 24 hours, escalate immediately
  • If production services are impacted, page the on-call engineer
  • NodeDown
  • NodeNotReady
  • KubeletDown
  • Grafana → Node Overview
  • Grafana → Node Exporter
runbooks/coustom_alerts/noderebootedrecently.txt · Last modified: by admin