Disaster Recovery Plan¶
Backup strategy, recovery procedures, and RTO/RPO targets for the DataEngineX platform.
RTO / RPO Targets¶
| Tier | Component | RPO | RTO | Strategy |
|---|---|---|---|---|
| 1 | PostgreSQL | 1 hour | 30 min | Scheduled pg_dump + WAL archiving |
| 1 | Application state | 0 | 15 min | Stateless — redeploy from Git |
| 2 | Qdrant vectors | 24 hours | 2 hours | Daily snapshot + restore |
| 2 | MinIO objects | 24 hours | 1 hour | Daily sync to S3/backup bucket |
| 3 | Redis cache | N/A | 5 min | Ephemeral — auto-rebuilds on start |
| 3 | Prometheus metrics | 24 hours | 30 min | Thanos/remote-write (if configured) |
Backup Strategy¶
Automated Backups¶
# Daily PostgreSQL backup (add to cron)
0 2 * * * kubectl exec -n dex deploy/postgres -- pg_dump -U dex dex > /opt/infradex/backups/dex_$(date +%Y%m%d).sql
# Weekly Qdrant snapshot
0 3 * * 0 curl -X POST http://qdrant:6333/collections/dex_vectors/snapshots
# Daily MinIO sync (if S3 backup target configured)
0 4 * * * mc mirror --overwrite minio/dex s3/dex-backup/minio/
Backup Verification¶
Run monthly restore drills:
- Restore PostgreSQL backup to a test database
- Verify row counts match production
- Run application health checks against test database
- Document results and any issues
Recovery Procedures¶
Scenario 1: VPS Total Loss¶
Time estimate: 30-60 minutes
- Provision new VPS
- Update inventory with new IP
- Run full deployment
ansible-playbook -i ansible/inventory/hosts.yml ansible/playbooks/install-k3s.yml
./scripts/one-click-deploy.sh
- Restore database
- Update DNS to new IP
Scenario 2: Database Corruption¶
Time estimate: 15-30 minutes
- Stop application pods
- Restore from latest backup
# Drop and recreate database
psql -h localhost -U postgres -c "DROP DATABASE IF EXISTS dex;"
psql -h localhost -U postgres -c "CREATE DATABASE dex;"
# Restore
gunzip -c /path/to/latest/backup.sql.gz | psql -h localhost -U postgres -d dex
- Restart application pods
- Verify data integrity
Scenario 3: Kubernetes Cluster Failure¶
Time estimate: 30 minutes
- Reinstall K3s
- Redeploy all Helm charts
helm install dataenginex helm/charts/dataenginex -f helm/values/values-vps.yaml
helm install dex-studio helm/charts/dex-studio -f helm/values/values-vps.yaml
- Restore persistent data (PostgreSQL, Qdrant)
Scenario 4: Secret Compromise¶
Time estimate: 10 minutes
- Rotate all secrets immediately
-
Revoke any external API tokens (Cloudflare, container registry)
-
Audit access logs for unauthorized activity
-
Update monitoring alerts for anomalous patterns
Communication Plan¶
| Severity | Notify | Channel |
|---|---|---|
| P1 — Full outage | All stakeholders | Immediate |
| P2 — Degraded | Engineering team | Within 30 min |
| P3 — Minor issue | On-call engineer | Within 2 hours |
Post-Incident¶
After every incident:
- Conduct blameless post-mortem
- Document root cause and timeline
- Update runbooks if procedures were unclear
- Implement preventive measures
- Test the fix with a simulated failure