Disaster Recovery Plan¶

Backup strategy, recovery procedures, and RTO/RPO targets for the DataEngineX platform.

RTO / RPO Targets¶

Tier	Component	RPO	RTO	Strategy
1	PostgreSQL	1 hour	30 min	Scheduled pg_dump + WAL archiving
1	Application state	0	15 min	Stateless — redeploy from Git
2	Qdrant vectors	24 hours	2 hours	Daily snapshot + restore
2	MinIO objects	24 hours	1 hour	Daily sync to S3/backup bucket
3	Redis cache	N/A	5 min	Ephemeral — auto-rebuilds on start
3	Prometheus metrics	24 hours	30 min	Thanos/remote-write (if configured)

Backup Strategy¶

Automated Backups¶

# Daily PostgreSQL backup (add to cron)
0 2 * * * kubectl exec -n dex deploy/postgres -- pg_dump -U dex dex > /opt/infradex/backups/dex_$(date +%Y%m%d).sql

# Weekly Qdrant snapshot
0 3 * * 0 curl -X POST http://qdrant:6333/collections/dex_vectors/snapshots

# Daily MinIO sync (if S3 backup target configured)
0 4 * * * mc mirror --overwrite minio/dex s3/dex-backup/minio/

Backup Verification¶

Run monthly restore drills:

Restore PostgreSQL backup to a test database
Verify row counts match production
Run application health checks against test database
Document results and any issues

Recovery Procedures¶

Scenario 1: VPS Total Loss¶

Time estimate: 30-60 minutes

Provision new VPS

hcloud server create --name dex-server-recovery --type cx31 --image ubuntu-22.04

Update inventory with new IP

vim ansible/inventory/hosts.yml

Run full deployment

ansible-playbook -i ansible/inventory/hosts.yml ansible/playbooks/install-k3s.yml
./scripts/one-click-deploy.sh

Restore database

gunzip -c /path/to/backup/dex_YYYYMMDD_HHMMSS.sql.gz | \
  psql -h localhost -U postgres -d dex

Update DNS to new IP

Scenario 2: Database Corruption¶

Time estimate: 15-30 minutes

Stop application pods

kubectl scale deployment --replicas=0 -l app=dataenginex -n dex

Restore from latest backup

# Drop and recreate database
psql -h localhost -U postgres -c "DROP DATABASE IF EXISTS dex;"
psql -h localhost -U postgres -c "CREATE DATABASE dex;"

# Restore
gunzip -c /path/to/latest/backup.sql.gz | psql -h localhost -U postgres -d dex

Restart application pods

kubectl scale deployment --replicas=1 -l app=dataenginex -n dex

Verify data integrity

Scenario 3: Kubernetes Cluster Failure¶

Time estimate: 30 minutes

Reinstall K3s

ansible-playbook -i ansible/inventory/hosts.yml ansible/playbooks/install-k3s.yml

Redeploy all Helm charts

helm install dataenginex helm/charts/dataenginex -f helm/values/values-vps.yaml
helm install dex-studio helm/charts/dex-studio -f helm/values/values-vps.yaml

Restore persistent data (PostgreSQL, Qdrant)

Scenario 4: Secret Compromise¶

Time estimate: 10 minutes

Rotate all secrets immediately

./scripts/rotate-secrets.sh dex

Revoke any external API tokens (Cloudflare, container registry)
Audit access logs for unauthorized activity
Update monitoring alerts for anomalous patterns

Communication Plan¶

Severity	Notify	Channel
P1 — Full outage	All stakeholders	Immediate
P2 — Degraded	Engineering team	Within 30 min
P3 — Minor issue	On-call engineer	Within 2 hours

Post-Incident¶

After every incident:

Conduct blameless post-mortem
Document root cause and timeline
Update runbooks if procedures were unclear
Implement preventive measures
Test the fix with a simulated failure