Skip to content

ADR-0001: Medallion Architecture (Bronze/Silver/Gold)

Status: Accepted Date: 2026-02-15 Authors: Data Engineering Team

Context

DEX needs a scalable, standardized data architecture that works across all projects (CareerDEX, Weather, etc.). The data pipeline needs to handle:

  • Raw data ingestion from multiple sources
  • Data transformation and quality validation
  • Feature engineering for ML models
  • Analytics-ready data consumption

Multiple projects will be built on the DEX platform, each with different data sources and transformations. We need a consistent, reusable pattern.

Decision

Implement the Medallion Architecture (Bronze → Silver → Gold layers) as the standard data architecture for all DEX projects.

Architecture Layers

  1. Bronze Layer (Raw Data)
  2. Stores raw data as-is from sources
  3. No transformation applied
  4. Schema validation only
  5. High volume, immutable records
  6. Storage: Parquet (local) + BigQuery (*_bronze datasets)

  7. Silver Layer (Processed & Normalized)

  8. Data parsed and normalized
  9. Quality validated
  10. Deduplication applied
  11. Schema consistent across sources
  12. Sensitive data masked/redacted
  13. Storage: BigQuery (*_silver datasets) + embeddings cache (Redis)

  14. Gold Layer (Aggregated & Analytics)

  15. Business-ready features
  16. Pre-computed aggregations
  17. ML model predictions
  18. Denormalized for performance
  19. Storage: BigQuery (*_gold datasets)

Implementation

For each project (CareerDEX, Weather, etc.):

<project>_bronze   → Raw data from APIs/sources
<project>_silver   → Parsed, validated data
<project>_gold     → Features, aggregations, ML outputs

Local development:

~/data/<project>/bronze/
~/data/<project>/silver/
~/data/<project>/gold/

Considered Alternatives

1. Single Large Table

  • Pros: Simpler schema
  • Cons: Difficult to scale, hard to debug data quality issues

2. Flat File Structure

  • Pros: Works for small projects
  • Cons: No schema governance, difficult to track lineage

3. Data Vault 2.0

  • Pros: Highly normalized, strong audit trail
  • Cons: Over-engineered for current needs, complex

Consequences

Positive

  • Scalability: Easy to add new projects following same pattern
  • Debugging: Issues isolated to specific layer
  • Quality: Explicit data quality checks at Silver layer
  • Reusability: Silver/Gold layers shareable across teams
  • Lineage: Clear data flow and dependencies
  • Performance: Gold layer optimized for queries

Negative

  • Storage: Triple storage for each dataset (3x costs)
  • Complexity: More orchestration needed
  • Late binding: Features computed in Gold, not in source

Neutral

  • Requires discipline to not skip layers
  • Needs monitoring to catch quality issues

Implementation Notes

Phase 1 (v0.2.0)

  • Define schemas for each layer
  • Implement validation framework
  • Create local Parquet directories
  • Set up BigQuery datasets

Phase 2 (CareerDEX v0.3.0)

  • Implement Bronze ingest (Airflow orchestration)
  • Implement Silver transformation (parsing, validation)
  • Populate Gold with features/predictions

Migration Path

If moving from existing flat structure: 1. Map existing tables to layers 2. Create Parquet exports at each layer 3. Validate data transformations 4. Gradually migrate pipelines

Tooling

  • Storage: BigQuery (cloud), Parquet (local)
  • Transformations: dbt (SQL) or Spark (Python)
  • Orchestration: Apache Airflow
  • Quality Validation: Great Expectations + Pydantic
  • Lineage Tracking: dbt lineage + custom metadata
  • ADR-0002: Using dbt for Silver/Gold transformations
  • ADR-0003: Data quality framework
  • ADR-0004: Feature store strategy

References

Monitoring & Governance

Golden Rules: 1. ✅ Never skip a layer 2. ✅ Always validate schema at layer boundaries 3. ✅ Document transformations between layers 4. ✅ Track data lineage 5. ✅ Monitor quality metrics

Metrics to Track: - Row counts at each layer - Quality validation pass rate - Transformation latency - Storage usage by layer

Revision History

Date Author Change
2026-02-15 Data Team Initial acceptance