Skip to content

Disaster Recovery Plan

Backup strategy, recovery procedures, and RTO/RPO targets for the DataEngineX platform.

RTO / RPO Targets

Tier Component RPO RTO Strategy
1 PostgreSQL 1 hour 30 min Scheduled pg_dump + WAL archiving
1 Application state 0 15 min Stateless — redeploy from Git
2 Qdrant vectors 24 hours 2 hours Daily snapshot + restore
2 MinIO objects 24 hours 1 hour Daily sync to S3/backup bucket
3 Redis cache N/A 5 min Ephemeral — auto-rebuilds on start
3 Prometheus metrics 24 hours 30 min Thanos/remote-write (if configured)

Backup Strategy

Automated Backups

# Daily PostgreSQL backup (add to cron)
0 2 * * * kubectl exec -n dex deploy/postgres -- pg_dump -U dex dex > /opt/infradex/backups/dex_$(date +%Y%m%d).sql

# Weekly Qdrant snapshot
0 3 * * 0 curl -X POST http://qdrant:6333/collections/dex_vectors/snapshots

# Daily MinIO sync (if S3 backup target configured)
0 4 * * * mc mirror --overwrite minio/dex s3/dex-backup/minio/

Backup Verification

Run monthly restore drills:

  1. Restore PostgreSQL backup to a test database
  2. Verify row counts match production
  3. Run application health checks against test database
  4. Document results and any issues

Recovery Procedures

Scenario 1: VPS Total Loss

Time estimate: 30-60 minutes

  1. Provision new VPS
hcloud server create --name dex-server-recovery --type cx31 --image ubuntu-22.04
  1. Update inventory with new IP
vim ansible/inventory/hosts.yml
  1. Run full deployment
ansible-playbook -i ansible/inventory/hosts.yml ansible/playbooks/install-k3s.yml
./scripts/one-click-deploy.sh
  1. Restore database
gunzip -c /path/to/backup/dex_YYYYMMDD_HHMMSS.sql.gz | \
  psql -h localhost -U postgres -d dex
  1. Update DNS to new IP

Scenario 2: Database Corruption

Time estimate: 15-30 minutes

  1. Stop application pods
kubectl scale deployment --replicas=0 -l app=dataenginex -n dex
  1. Restore from latest backup
# Drop and recreate database
psql -h localhost -U postgres -c "DROP DATABASE IF EXISTS dex;"
psql -h localhost -U postgres -c "CREATE DATABASE dex;"

# Restore
gunzip -c /path/to/latest/backup.sql.gz | psql -h localhost -U postgres -d dex
  1. Restart application pods
kubectl scale deployment --replicas=1 -l app=dataenginex -n dex
  1. Verify data integrity

Scenario 3: Kubernetes Cluster Failure

Time estimate: 30 minutes

  1. Reinstall K3s
ansible-playbook -i ansible/inventory/hosts.yml ansible/playbooks/install-k3s.yml
  1. Redeploy all Helm charts
helm install dataenginex helm/charts/dataenginex -f helm/values/values-vps.yaml
helm install dex-studio helm/charts/dex-studio -f helm/values/values-vps.yaml
  1. Restore persistent data (PostgreSQL, Qdrant)

Scenario 4: Secret Compromise

Time estimate: 10 minutes

  1. Rotate all secrets immediately
./scripts/rotate-secrets.sh dex
  1. Revoke any external API tokens (Cloudflare, container registry)

  2. Audit access logs for unauthorized activity

  3. Update monitoring alerts for anomalous patterns

Communication Plan

Severity Notify Channel
P1 — Full outage All stakeholders Immediate
P2 — Degraded Engineering team Within 30 min
P3 — Minor issue On-call engineer Within 2 hours

Post-Incident

After every incident:

  1. Conduct blameless post-mortem
  2. Document root cause and timeline
  3. Update runbooks if procedures were unclear
  4. Implement preventive measures
  5. Test the fix with a simulated failure