DataEngineX Architecture¶
Overview¶
DataEngineX is a unified Data + ML + AI library that wires industry tools through a
single config-driven interface. One dex.yaml defines the entire project.
Design principle: Pure Python library — no HTTP server bundled. Your application (DEX Studio,
your own FastAPI/Flask app, a script) imports dataenginex and owns the server layer.
Architecture¶
dex.yaml
│
▼
┌─────────────────────────────────────────────────────────┐
│ Config System │
│ YAML → env var resolution → Pydantic validation │
│ Layering: base + overlay (dex.prod.yaml) │
└────────────────────┬────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ DexEngine │
│ Single entry point — loads config, inits backends │
│ Exposes: run_pipeline, model_registry, agents, store │
└──────┬──────────────┬──────────────┬────────────────────┘
│ │ │
▼ ▼ ▼
┌──────────────┐ ┌──────────┐ ┌──────────────┐
│ Data Layer │ │ ML Layer │ │ AI Layer │
│ │ │ │ │ │
│ Connectors │ │ Tracker │ │ LLM Provider │
│ Transforms │ │ Training │ │ Retriever │
│ Quality │ │ Serving │ │ Vector Store │
│ Orchestrator │ │ Drift │ │ Agent Runtime│
│ Feature Store│ │ Metrics │ │ Memory │
└──────┬───────┘ └────┬─────┘ └──────┬───────┘
│ │ │
└──────────────┼──────────────┘
▼
┌─────────────────────────┐
│ DexStore │
│ DuckDB — .dex/store. │
│ duckdb (project-local) │
│ pipeline_runs · lineage│
│ model_artifacts · etc. │
└─────────────────────────┘
Core Patterns¶
Backend Registry¶
Every subsystem follows the same pattern:
- ABC in
core/interfaces.py— defines the contract (e.g.BaseConnector) - BackendRegistry[T] in
core/registry.py— discovers and registers implementations - Built-in implements the ABC with zero external deps
- Extras implement the same ABC, swapped in via config
from dataenginex.core.registry import BackendRegistry
from dataenginex.core.interfaces import BaseConnector
connector_registry: BackendRegistry[BaseConnector] = BackendRegistry("connector")
@connector_registry.decorator("csv")
class CsvConnector(BaseConnector):
...
DexEngine — Application Entry Point¶
DexEngine is the single object applications instantiate. It:
- Loads and validates
dex.yaml - Initialises
DexStore(creates.dex/store.duckdbnext to the config file) - Registers data sources, pipelines, ML trackers, AI providers, agents
- Exposes domain methods:
run_pipeline,source_schema,warehouse_layers, etc.
from dataenginex.engine import DexEngine
engine = DexEngine("dex.yaml")
engine.run_pipeline("clean_users")
DexStore — Persistence¶
Single DuckDB file at .dex/store.duckdb (project-local, next to dex.yaml).
Tables: pipeline_runs, lineage_events, model_artifacts, quality_runs,
audit_log, ai_memory, ai_episodes, catalog_entries.
Config System¶
- Single
dex.yaml→ Pydantic validation → typedDexConfig - Env var interpolation:
${VAR:-default} - Overlay layering:
dex.yaml+dex.prod.yaml - Cross-reference validation (pipeline sources, dependencies)
- Only
project.nameis required; everything else has defaults
Exception Hierarchy¶
DataEngineXError
├── ConfigError → ConfigValidationError
├── PipelineError → PipelineStepError
├── RegistryError
├── BackendNotInstalledError
├── TrainingError
├── ServingError
└── AgentError → LLMProviderError
Module Map¶
| Module | Purpose |
|---|---|
engine.py |
DexEngine — application entry point |
store.py |
DexStore — DuckDB persistence layer |
config/ |
Schema, loader, env resolution |
core/ |
ABCs, registry, exceptions |
cli/ |
dex CLI (validate, version, init) |
api/ |
HTTP helpers: error types, response models |
data/ |
Connectors, pipeline runner, quality, profiler |
ml/ |
Classical ML: training, registry, serving, drift |
ai/ |
LLM, agents, RAG, vectorstore, memory, observability |
orchestration/ |
DriftScheduler, background tasks |
middleware/ |
structlog config, Prometheus metrics |
lakehouse/ |
Storage backends, catalog, partitioning |
warehouse/ |
SQL transforms, lineage |
plugins/ |
Entry-point discovery |
Tech Stack¶
| Component | Built-in | Extra |
|---|---|---|
| Data Engine | DuckDB | PySpark ([spark]) |
| Orchestration | croniter scheduler | — |
| ML Tracking | JSON-based | MLflow ([tracking]) |
| Model Serving | Built-in predictor | — |
| LLM Provider | Ollama / vLLM | LiteLLM (install separately) |
| Vector Store | DuckDB VSS | Qdrant |
| Retrieval | BM25 + Dense + Hybrid | — |
| Persistence | DuckDB | — |
| Logging | structlog | — |
| Config | Pydantic + YAML | — |
| CLI | Click | — |
| LLM Observability | — | Langfuse ([observability]) |
| Cloud Storage | — | S3/GCS/BigQuery ([cloud]) |
Key Design Decisions¶
| ID | Decision | Rationale |
|---|---|---|
| AD1 | Pure library — no bundled HTTP server | Applications own the server layer; library stays lean |
| AD2 | DexEngine as single entry point | One object to instantiate; hides wiring complexity |
| AD3 | DuckDB for persistence | Embedded, zero-ops, single file next to dex.yaml |
| AD4 | structlog only | One logging standard across the entire codebase |
| AD5 | LiteLLM install separately | It pins python-dotenv==1.0.1 which conflicts |
| AD6 | Embeddings require explicit opt-in | sentence-transformers + ONNX are 500 MB+; never auto-download |
| AD7 | Project isolation via separate DuckDB files | Each project's .dex/store.duckdb is self-contained |
| AD8 | Python 3.13+ | Full type parameter syntax, improved error messages |
| AD9 | ai/ for LLM/agents, ml/ for classical ML |
Clear domain separation |
Ecosystem¶
TheDataEngineX/
├── dataenginex — Core library (PyPI: dataenginex)
├── dex-studio — Web UI (FastAPI + Jinja2) — single pane of glass
└── infradex — Terraform + Helm + K3s deployment
- Container images:
ghcr.io/thedataenginex/dex - Docs:
docs.thedataenginex.org