Skip to content

DEX Project Architecture & Roadmap

Executive Summary

DEX (DataEngineX) is evolving from a foundational API service into a complete data engineering and ML platform. We are following a phased approach: Foundation → Core Features → Advanced Platform → Future Innovation.

DEX Philosophy

DEX is a unified framework that bridges Data Engineering, Data Warehousing, Machine Learning, AI Agents, MLOps, and DevOps. It focuses on building AI‑ready infrastructure that moves models from notebooks to production.

Portfolio Modules (Roadmap): - dex-data (Spark/Flink/Kafka pipelines) - dex-warehouse (dbt + lakehouse/warehouse patterns) - dex-lakehouse (Iceberg/Delta datasets) - dex-ml (MLflow/Kubeflow + model serving) - dex-api (FastAPI feature/prediction APIs) - dex-ops (Terraform + Kubernetes + GitOps)

See README.md for the full philosophy and roadmap context.

Current State (v0.3.x - Foundation + Hardening)

Infrastructure Baseline (implemented)

  • CI/CD: GitHub Actions — lint (ruff), type-check (mypy), test (pytest), build, push
  • GitOps: ArgoCD with branch-based deployment (dev/prod)
  • Code Quality: Ruff (0 errors), mypy strict (0 errors), 94% test coverage
  • Pre-commit: ruff + mypy + standard hooks
  • Containerization: Multi-stage Docker with non-root user, healthcheck
  • Infrastructure-as-Code: Kustomize overlays for all environments
  • Observability: Structured logging (structlog), Prometheus metrics, OpenTelemetry tracing
  • Data Framework: Medallion architecture (Bronze/Silver/Gold), data quality validators
  • Security: CodeQL, Trivy scanning, pip-audit, branch protection

Current Architecture

┌─────────────────────────────────────────────────────────────┐
│                     DEX Platform (v0.3.0)                   │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  ┌──────────────┐     ┌──────────────┐    ┌──────────────┐│
│  │  DataEngineX │     │  CareerDEX   │    │  WeatherDEX  ││
│  │   (API +     │     │  (Job Data   │    │  (Weather    ││
│  │  Framework)  │     │   Platform)  │    │   Pipeline)  ││
│  └──────────────┘     └──────────────┘    └──────────────┘│
│  Observability: Prometheus + OpenTelemetry + structlog      │
│  Quality: Ruff + mypy + pytest (94% cov) + pre-commit      │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│              Kubernetes + ArgoCD (GitOps)                   │
│   Environments: dev (2 pods, dex-dev), prod (3 pods, dex)    │
└─────────────────────────────────────────────────────────────┘

Target Architecture (v1.0.0 - Production Ready)

┌─────────────────────────────────────────────────────────────────────┐
│                     DEX Platform (v1.0.0)                           │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  ┌────────────────┐  ┌──────────────┐  ┌─────────────────┐        │
│  │  FastAPI       │  │  Data        │  │  ML Model       │        │
│  │  Service       │  │  Pipelines   │  │  Serving        │        │
│  │                │  │              │  │                 │        │
│  │ • Auth (JWT)   │  │ • Ingestion  │  │ • Training      │        │
│  │ • Validation   │  │ • Transform  │  │ • Inference     │        │
│  │ • Logging      │  │ • Quality    │  │ • Monitoring    │        │
│  │ • Metrics      │  │ • Scheduling │  │ • Registry      │        │
│  └────────────────┘  └──────────────┘  └─────────────────┘        │
│         │                   │                    │                  │
│         └───────────────────┴────────────────────┘                  │
│                             │                                       │
├─────────────────────────────┼───────────────────────────────────────┤
│  ┌───────────┐  ┌──────────┐│  ┌────────┐  ┌──────────────┐       │
│  │PostgreSQL │  │  Redis   ││  │ MinIO  │  │   MLflow     │       │
│  │  (OLTP)   │  │ (Cache)  ││  │(Object)│  │ (Experiments)│       │
│  └───────────┘  └──────────┘│  └────────┘  └──────────────┘       │
└─────────────────────────────┼───────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────┐
│        Observability Layer (Prometheus, Grafana, Loki)              │
│        GitOps (ArgoCD) + Secret Management (Sealed Secrets)         │
│        Kubernetes Cluster (HPA, Resource Limits, Health Checks)     │
└─────────────────────────────────────────────────────────────────────┘

Roadmap Overview

The detailed roadmap is tracked in GitHub Issues/Milestones with CSV export in docs/roadmap/project-roadmap.csv as canonical documentation source.

Organization project hub: https://github.com/orgs/TheDataEngineX/projects

Phases (High Level)

  • Phase 1: Foundation (v0.1.0) ✅ — CI/CD, GitOps, multi‑env deployments
  • Phase 2: Production Hardening (v0.2.0) ✅ — observability, health probes, API quality
  • Phase 3: Data Platform (v0.3.0) 🔄 — medallion architecture foundation, incremental data quality/schema implementation
  • Phase 4: ML Platform (v0.4.0) — training, registry, serving, monitoring
  • Phase 5: Advanced Features (v0.5.0) — auth, caching, analytics
  • Phase 6: Production Ready (v1.0.0) — DR, security, performance

For execution details, see GitHub Issues and SDLC.

Modular Monolith Strategy

Current Module Structure

src/
├── dataenginex/          # Core framework (API, middleware, validators, schemas)
│   ├── api/              # FastAPI app, health, errors
│   ├── core/             # Medallion architecture, validators, schemas
│   └── middleware/       # Logging, metrics, tracing, request handling
├── careerdex/            # Job data ingestion platform (phases 1-6)
│   ├── phases/           # Implementation phases
│   ├── dags/             # Airflow DAGs
│   └── models/           # Data models
└── weatherdex/           # Weather ML pipeline (reference implementation)
    ├── core/             # Pipeline core
    ├── ml/               # ML models
   └── notebooks/        # Notebook-based experimentation assets

Service Extraction Criteria

When to Extract a Service: 1. Independent Scaling: Different resource requirements (e.g., GPU for ML) 2. Team Ownership: Separate team needs autonomy 3. Technology Diversity: Different tech stack required 4. Deployment Frequency: Needs to deploy independently 5. Fault Isolation: Failures shouldn't cascade

First Extraction Candidate: ML Model Serving - GPU scaling independent from API - Polyglot support (TensorFlow Serving, TorchServe) - High-frequency model updates - Separate SLA requirements

Not Extracting Yet: - Data pipelines (shared storage, orchestration overhead) - API endpoints (low latency requirements) - Analytics (tightly coupled to data layer)

Technology Decisions

Core Stack (Confirmed)

  • API: FastAPI + Uvicorn
  • Language: Python 3.11+
  • Package Management: uv (dependencies/env) + Hatchling (build backend)
  • Container: Docker
  • Orchestration: Kubernetes + ArgoCD
  • CI/CD: GitHub Actions

Infrastructure Additions (v0.2.0+)

  • Observability: Prometheus, Grafana, Loki, OpenTelemetry
  • Database: PostgreSQL (OLTP)
  • Cache: Redis
  • Object Storage: MinIO (default) / S3-compatible adapters
  • Secrets: Sealed Secrets

Data & ML Stack (v0.3.0+)

  • Orchestration: Apache Airflow
  • ML Tracking: MLflow (preferred) or Weights & Biases
  • BI Tool: Metabase (preferred) or Superset
  • Data Quality: Great Expectations
  • Feature Store: Feast (future)

Development Workflow

1. Planning Phase

TODO.md → GitHub Issue (using template) → Add to Project Board → Assign Milestone

2. Development Phase

Create branch → Develop → Test locally → Commit with #issue → Push

3. Review Phase

Create PR → CI checks → Code review → Merge to main

4. Deployment Phase

CI builds image → CD updates manifests → ArgoCD syncs → Monitor

5. Promotion Flow

dev (auto) → prod (PR promotion via main branch)

Risk Management

High Priority Risks

  1. Complexity Creep: Too many features, slow delivery
  2. Mitigation: Strict prioritization, MVP mindset

  3. Technical Debt: Fast iteration sacrifices quality

  4. Mitigation: 20% time for refactoring, code reviews

  5. Infrastructure Costs: Cloud bills spiral

  6. Mitigation: Resource limits, cost monitoring, right-sizing

  7. Security Gaps: Auth/secrets not implemented early

  8. Mitigation: Phase 5 prioritizes security hardening

Medium Priority Risks

  1. Data Quality Issues: Bad data in production
  2. Mitigation: Data quality framework in Phase 3

  3. Model Drift: Models degrade over time

  4. Mitigation: Monitoring and automated retraining in Phase 4

  5. Scaling Bottlenecks: Performance issues at scale

  6. Mitigation: Load testing, HPA, caching

Success Metrics

v0.2.0 (Production Hardening)

  • API uptime: >99%
  • P99 latency: <200ms
  • Test coverage: >80%
  • Zero critical security vulnerabilities

v0.3.0 (Data Platform)

  • Pipeline success rate: >95%
  • Data freshness: <1 hour delay
  • Data quality checks: 100% passing
  • Pipeline runtime: <30 minutes

v0.4.0 (ML Platform)

  • Model deployment time: <5 minutes
  • Model accuracy: >baseline
  • Inference latency: <100ms
  • Drift detection: active

v1.0.0 (Production)

  • SLA: 99.9% uptime
  • RTO: <1 hour
  • Cost per request: <$0.001
  • Customer satisfaction: >⅘

Next Actions

  1. Immediate (This Sprint):
  2. Database integration (PostgreSQL)
  3. Authentication (JWT + API keys)
  4. Cache layer (Redis)

  5. Short Term (Next 2 Sprints):

  6. ML experiment tracking (MLflow)
  7. Model serving endpoints
  8. Feature store integration

  9. Medium Term (Next Quarter):

  10. Complete v0.4.0 ML platform
  11. Production hardening for v1.0.0
  12. Performance tuning and load testing

Last Updated: 2026-02-15 Document Owner: Project Lead Review Cadence: Bi-weekly