Skip to content

CI/CD Pipeline

Complete guide to DataEngineX continuous integration and release automation.

Quick Links: CI Workflow ยท Release Automation ยท PyPI Publishing ยท Troubleshooting ยท Quick Reference


๐Ÿ“‹ Table of Contents


Overview

DEX is a pure Python library published to PyPI. The pipeline is:

  • CI: Automated testing, linting, and security scanning on every PR
  • Release: Automated tagging and GitHub release creation on version bumps
  • PyPI Publish: Automated publishing triggered by GitHub releases
graph LR
    Dev[Developer] --> PR[Create PR]
    PR --> CI[CI: Lint/Test/Security]
    CI --> Review[Code Review]
    Review --> MergeMain[Merge to main]

    MergeMain --> VersionBump{Version bump?}
    VersionBump -->|Yes| Release[release-dataenginex.yml<br/>Create tag + release]
    Release --> PyPI[pypi-publish.yml<br/>Publish to PyPI]

    style CI fill:#e1f5ff
    style Release fill:#f8f5ff
    style PyPI fill:#d4edda

Project Structure

DEX is a single-package repo:

Component Location Purpose Release
dataenginex src/dataenginex/ Core framework (API, middleware, storage, ML) PyPI (v{version})

Unified Testing

The root pyproject.toml defines the package and test config:

  • name = "dataenginex", version = "<current>" (see pyproject.toml)
  • [tool.hatch.build.targets.wheel] packages = ["src/dataenginex"]
  • Dependency groups: dev (required), data (PySpark/Airflow), notebook (pandas), ml (sentence-transformers), dashboard (streamlit)

CI workflow (ci.yml) runs in a single job (poe lint โ†’ poe typecheck โ†’ pytest):

  • Single ci job: uv sync --all-extras + poe lint + poe typecheck + pytest --cov
  • concurrency: cancel-in-progress: true โ€” stale runs cancelled on new push
  • paths-ignore โ€” skips CI on doc-only changes

Release Automation

  • Release automation: release-please.yml reads conventional commits โ†’ creates Release PR (bumps pyproject.toml + CHANGELOG.md + uv.lock) โ†’ on merge creates v{version} tag + GitHub Release
  • Post-release: release-dataenginex.yml generates CycloneDX SBOM and attaches it to the release
  • PyPI publishing (pypi-publish.yml): Triggered by GitHub release published โ†’ detects changes in src/dataenginex/ since last v* tag โ†’ publishes to TestPyPI then PyPI

Continuous Integration (CI)

Workflow: .github/workflows/ci.yml

Triggers:

  • Push to main or dev branches
  • Pull requests targeting main or dev

Jobs:

1. Lint and Test

Runs code quality checks and test suite:

# Linting
uv run poe lint

# Tests with coverage
uv run poe test-cov

Requirements: All checks must pass before merge

2. Security Scans

Runs in parallel via .github/workflows/security.yml:

  • CodeQL: Static analysis for security vulnerabilities
  • Semgrep: OWASP Top 10 and best practice checks

Results: Available in GitHub Security tab

3. Integration Test (Optional)

Optional job for full dependency coverage (PySpark, Airflow, Pandas):

Trigger:

  • Manual: gh workflow run ci.yml
  • Label: Add full-test label to pull request

What it does:

# Installs all dependency groups
uv sync --group dev --group data --group notebook

# Runs full test suite (may take longer)
uv run poe test-cov

Use case: Validate changes to data pipelines, ML models, or when adding new dependencies to data or notebook groups.


Release Automation

DataEngineX Releases

Workflow: .github/workflows/release-dataenginex.yml

Trigger: release: types: [published] โ€” fires when release-please creates a GitHub Release

What it does:

  1. Generates CycloneDX SBOM for the release
  2. Attaches sbom-dataenginex-{version}.json to the GitHub Release

How to release DataEngineX:

Releases are fully automated via release-please. Push conventional commits to main; release-please creates the Release PR and tag.

# Monitor release-please PR
gh pr list --label "autorelease: pending"

# After merging the Release PR, monitor post-release workflows
gh run list --workflow=pypi-publish.yml --limit 5
gh run list --workflow=release-dataenginex.yml --limit 5

PyPI Publishing

Workflow: .github/workflows/pypi-publish.yml

Trigger: GitHub release published (from release-dataenginex.yml)

What it does:

  1. Receives GitHub release event from DataEngineX release
  2. Detects if files under src/dataenginex/ actually changed since previous v{version} tag
  3. If changes found:
  4. Builds wheel distributions
  5. Publishes to TestPyPI (dry-run)
  6. Promotes to PyPI (stable semver tags only, not pre-release)
  7. If no changes: skips publishing with informational message

Publish gates:

  • Only publishes if code actually changed (not just version bump in other files)
  • TestPyPI first for dry-run verification
  • PyPI promotion requires stable semver tag: vMAJOR.MINOR.PATCH (not v1.2.3-rc1)
  • Pre-release tags: publish to TestPyPI only

Automatic flow:

conventional commits โ†’ main โ†’ release-please Release PR โ†’ merge โ†’ v{version} tag + GitHub Release โ†’ pypi-publish.yml โ†’ PyPI

Manual trigger (if needed):

gh workflow run pypi-publish.yml -f tag=v<version>

Rollback Procedures

Rollback a PyPI Release

PyPI does not support deleting releases, but you can:

  1. Yank the release on PyPI (marks it as broken; pip install avoids it by default):
# Via PyPI web UI: manage release โ†’ yank
# Or via twine/API
  1. Publish a patch release with the fix:
# Bump version in pyproject.toml (e.g., 0.6.1)
git commit -m "fix: revert breaking change"
git push origin main

Rollback a Git Tag

# Delete tag locally and remotely
git tag -d v<version>
git push origin :refs/tags/v<version>

# Delete the GitHub release via gh CLI
gh release delete v<version> --yes

Pipeline Metrics

Build Times

  • CI (Lint + Test): ~2 minutes
  • Package validation: ~1 minute
  • PyPI publish: ~2 minutes

Success Rates (Target)

  • CI Pass Rate: >95%
  • Release Success Rate: >99%

Monitoring

# Recent CI runs
gh run list --workflow ci.yml --limit 10

# Recent releases
gh run list --workflow release-dataenginex.yml --limit 10

# Failed builds
gh run list --workflow pypi-publish.yml --status failure

CI/CD Evolution

Current State โœ…

  • Automated CI with lint, test, type checks
  • Security scanning (CodeQL, Semgrep)
  • Automated PyPI release on version bump
  • Package validation (wheel + twine check)
  • GitHub Pages documentation deployment

Future Enhancements ๐Ÿš€

  • E2E smoke tests: Post-release validation (install from PyPI and run examples)
  • SonarCloud integration: Code quality gates
  • Slack notifications: Release status updates
  • Release notes: Auto-generated from commits
  • Canary releases: TestPyPI smoke test before PyPI promotion

Troubleshooting

CI Fails with Lint Errors

# Run lint checks locally
uv run poe lint

# Auto-fix
uv run poe lint-fix

PyPI Publish Not Triggering

  • Verify version bump is in root pyproject.toml (not elsewhere)
  • Confirm push was to main branch (not dev)
  • Check release-dataenginex.yml ran and created a GitHub release
  • View workflow logs: gh run list --workflow pypi-publish.yml

Package Build Fails

# Build locally to diagnose
uv build
twine check dist/*

# Verify pyproject.toml metadata
uv run python -c "import dataenginex; print(dataenginex.__version__)"

Best Practices

Development Workflow

  1. Create feature branch from dev
  2. Develop and test locally
  3. Run quality checks before committing: uv run poe lint, uv run poe typecheck, uv run poe test
  4. Create PR targeting dev
  5. Wait for CI to pass
  6. Get code review approval
  7. Merge to dev โ†’ integration testing
  8. Create release PR from dev โ†’ main
  9. Merge to main โ†’ bump version if releasing

Commit Messages

Use conventional commits for clarity:

feat: add new endpoint for data processing
fix: resolve memory leak in pipeline
chore: update dependencies
docs: improve deployment runbook
test: add integration tests for API

PR Guidelines

  • Keep PRs small: \<500 lines of code
  • Single purpose: One feature/fix per PR
  • Test coverage: Include tests for new code
  • Documentation: Update docs for API changes

Next Steps:


Quick Reference

Workflows Overview

Workflow Trigger Purpose File
CI push main/dev, PRs to main/dev Lint, test, type-check ci.yml
Security push main/dev, PRs to main/dev CodeQL + Semgrep scans security.yml
Release Please push main Create/update Release PR with version bump + CHANGELOG release-please.yml
Release DataEngineX GitHub release published Generate + attach CycloneDX SBOM release-dataenginex.yml
PyPI Publish GitHub release published Detect changes + publish to TestPyPI/PyPI pypi-publish.yml

Local Commands

# Local development
uv lock
uv sync
uv run poe test
uv run poe lint

# Local with all dependencies (data + notebook)
uv sync --group data --group notebook
uv run poe test-cov

# Create PR
gh pr create --title "feat: add feature" --body "Description"

# Trigger optional integration tests
gh pr edit <pr-number> --add-label full-test

# Check CI status
gh pr checks <pr-number>

# Monitor CI
gh run list --workflow ci.yml
gh run view <run-id> --log

# Manual PyPI publish
gh workflow run pypi-publish.yml -f tag=v<version>

# Promote to production (dev โ†’ main PR)
./scripts/promote.sh

โ† Back to Documentation Hub