====== Mattermost Backup CronJob ====== ===== Meaning ===== These alerts monitor the Kubernetes Job created by the Mattermost backup CronJob. Alerts: * `MattermostBackupSucceeded` → Backup Job succeeded; PostgreSQL database dump uploaded successfully to backup storage. * `MattermostBackupFailed` → Backup Job failed; PostgreSQL dump did not complete or failed to upload. ===== Impact ===== * Success → Backup completed successfully. Mattermost data is safely stored in S3/MinIO. * Failure → Mattermost database may not be backed up. Could affect disaster recovery if restoration is needed. ===== Diagnosis ===== 1. Check Kubernetes Job status: kubectl get job mattermost-backup-job -n kubectl describe job mattermost-backup-job -n 2. Check logs of the Job pod: kubectl logs job/mattermost-backup-job -n 3. Verify backup in S3/MinIO: mc ls /mattermost-backups/ mc stat /mattermost-backups/ 4. Check PVC mounts if used: kubectl get pvc -n kubectl describe pvc -n ===== Possible Causes of Failure ===== * Pod in CrashLoopBackOff, OOMKilled, or Failed * PVC mount unavailable or insufficient space * Backup storage credentials missing or misconfigured * Network issues preventing upload to S3/MinIO * Disk space or permissions issues on the node * CronJob manifest misconfiguration * Database credentials invalid or inaccessible ===== Mitigation ===== 1. Inspect Job pod logs to identify errors. 2. Verify S3/MinIO credentials and connectivity. 3. Check PVC status and node disk availability. 4. Verify database credentials and connectivity. 5. Retry backup manually if needed: kubectl create job --from=cronjob/mattermost-backup-job mattermost-backup-job-manual -n 6. Correct any misconfigurations in CronJob YAML, database, or backup storage policies. 7. Escalate to SRE or admin team if repeated failures occur. ===== Escalation ===== * Escalate if backups fail for more than one consecutive run. * Notify on-call engineer if production Mattermost data may not be recoverable. ===== Related Alerts ===== * MattermostBackupSucceeded * MattermostBackupFailed * HostOutOfDiskSpace (node running backup Job) * KubernetesPodCrashLooping ===== Related Dashboards ===== * Kubernetes → Jobs & CronJobs (namespace: ) * Grafana → Backup Job status metrics * S3/MinIO → Backup object listings