Observability: Metrics, Logging & Tracing¶

Complete guide to monitoring, debugging, and understanding DataEngineX in production.

Quick Links: Metrics · Tracing · Grafana · Local Testing · Troubleshooting

📋 Table of Contents¶

Overview
Architecture
Prometheus Metrics
OpenTelemetry Tracing
Grafana Dashboards
Local Testing
Health Checks
Troubleshooting
Related Documentation

Overview¶

DataEngineX implements comprehensive observability using Prometheus for metrics, OpenTelemetry for distributed tracing, and Structlog for structured logging.

Architecture¶

flowchart TB
    subgraph client["Client"]
        Browser["Browser/API Client"]
    end

    subgraph app["DataEngineX Application"]
        direction TB
        FastAPI["FastAPI App"]
        LogMW["RequestLoggingMiddleware"]
        MetricsMW["PrometheusMetricsMiddleware"]
        Tracing["OpenTelemetry Instrumentation"]

        FastAPI --> LogMW
        LogMW --> MetricsMW
        MetricsMW --> Tracing
    end

    subgraph observability["Observability Stack"]
        direction LR
        Prometheus["Prometheus<br/>(Metrics)"]
        Jaeger["Jaeger<br/>(Traces)"]
        Logs["Structured Logs<br/>(JSON/Console)"]
    end

    Browser -->|HTTP Request| FastAPI
    FastAPI -->|Response + X-Request-ID| Browser

    MetricsMW -.->|/metrics endpoint| Prometheus
    Tracing -.->|OTLP gRPC :4317| Jaeger
    LogMW -.->|stdout/stderr| Logs

    style FastAPI fill:#e1f5ff
    style LogMW fill:#fff3cd
    style MetricsMW fill:#d4edda
    style Tracing fill:#f8d7da
    style Prometheus fill:#d1ecf1
    style Jaeger fill:#d1ecf1
    style Logs fill:#d1ecf1

Request Flow¶

sequenceDiagram
    participant Client
    participant FastAPI
    participant LogMW as RequestLoggingMiddleware
    participant MetricsMW as PrometheusMetricsMiddleware
    participant Tracing as OpenTelemetry
    participant App as Application Logic

    Client->>+FastAPI: HTTP GET /
    FastAPI->>+LogMW: Process Request
    Note over LogMW: Generate UUID<br/>Bind context
    LogMW->>LogMW: Log "request_started"

    LogMW->>+MetricsMW: Continue
    Note over MetricsMW: Increment in_flight gauge

    MetricsMW->>+Tracing: Continue
    Note over Tracing: Create span<br/>"GET /"

    Tracing->>+App: Execute endpoint
    App-->>-Tracing: Return response
    Note over Tracing: End span<br/>Export to Jaeger

    Tracing-->>-MetricsMW: Response
    Note over MetricsMW: Record duration histogram<br/>Increment requests counter<br/>Decrement in_flight gauge

    MetricsMW-->>-LogMW: Response
    LogMW->>LogMW: Log "request_completed"
    Note over LogMW: Add X-Request-ID header

    LogMW-->>-FastAPI: Response
    FastAPI-->>-Client: HTTP 200 OK<br/>X-Request-ID: uuid

Prometheus Metrics¶

Available Metrics¶

Metric	Type	Description	Labels
`http_requests_total`	Counter	Total HTTP requests	method, endpoint, status
`http_request_duration_seconds`	Histogram	Request duration	method, endpoint
`http_requests_in_flight`	Gauge	Requests currently processing	-
`http_exceptions_total`	Counter	Total exceptions	exception_type

Metrics Endpoint¶

Access Prometheus metrics at:

GET /metrics

Example output:

# HELP http_requests_total Total HTTP requests
# TYPE http_requests_total counter
http_requests_total{endpoint="/",method="GET",status="200"} 1543.0

# HELP http_request_duration_seconds HTTP request duration in seconds
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{endpoint="/",le="0.005",method="GET"} 1234.0
http_request_duration_seconds_bucket{endpoint="/",le="0.01",method="GET"} 1456.0
http_request_duration_seconds_sum{endpoint="/",method="GET"} 143.2
http_request_duration_seconds_count{endpoint="/",method="GET"} 1543.0

Prometheus Configuration¶

Add this scrape config to prometheus.yml:

scrape_configs:
  - job_name: 'dataenginex'
    scrape_interval: 15s
    static_configs:
      - targets: ['dataenginex:8000']
    metrics_path: '/metrics'

If you run the API outside Docker, use localhost:8000 as the target.

Grafana Dashboard¶

Key Queries:

Request Rate (RPS):
```
rate(http_requests_total[5m])
```

P95 Latency:

histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

Error Rate:

rate(http_requests_total{status=~"5.."}[5m])

In-Flight Requests:
```
http_requests_in_flight
```

Metrics Collection Flow¶

flowchart TB
    subgraph request["HTTP Request Lifecycle"]
        Start["Request Arrives"]
        Process["Process Request"]
        End["Response Sent"]
    end

    subgraph metrics["Metrics Updates"]
        InFlight["http_requests_in_flight<br/>+1"]
        Duration["http_request_duration_seconds<br/>observe(duration)"]
        Counter["http_requests_total<br/>+1"]
        Exception["http_exceptions_total<br/>+1 (if error)"]
        InFlightDec["http_requests_in_flight<br/>-1"]
    end

    subgraph prometheus["Prometheus"]
        Scrape["Scrape /metrics<br/>every 15s"]
        Store["Time Series DB"]
        Query["PromQL Queries"]
    end

    Start --> InFlight
    InFlight --> Process
    Process -->|Success| Duration
    Process -->|Error| Exception
    Duration --> Counter
    Exception --> Counter
    Counter --> InFlightDec
    InFlightDec --> End

    Counter -.->|Expose| Scrape
    Scrape --> Store
    Store --> Query

    style request fill:#e1f5ff
    style metrics fill:#d4edda
    style prometheus fill:#d1ecf1

OpenTelemetry Tracing¶

Configuration¶

Environment Variables¶

# OTLP Collector endpoint (Jaeger, Tempo, etc.)
# Accepts formats: "localhost:4317" or "http://localhost:4317"
export OTLP_ENDPOINT="http://localhost:4317"

# Enable console trace export for debugging (optional)
export ENABLE_CONSOLE_TRACES="true"

Tracing Flow¶

flowchart LR
    subgraph app["Application"]
        Request["HTTP Request"]
        Instrumentation["FastAPI<br/>Instrumentation"]
        Span["Span Creation"]
        Export["OTLP Exporter"]
    end

    subgraph collectors["Collectors"]
        OTLP["OTLP Endpoint<br/>:4317"]
        Console["Console Export<br/>(Debug)"]
    end

    subgraph backend["Backend"]
        Jaeger["Jaeger UI<br/>:16686"]
    end

    Request --> Instrumentation
    Instrumentation --> Span
    Span -->|Attributes:<br/>method, path,<br/>status, duration| Export
    Export -->|if OTLP_ENDPOINT set| OTLP
    Export -->|if ENABLE_CONSOLE_TRACES| Console
    OTLP --> Jaeger

    style app fill:#e1f5ff
    style collectors fill:#fff3cd
    style backend fill:#d1ecf1

Automatic Instrumentation¶

FastAPI is automatically instrumented with OpenTelemetry. Every request creates: - Span with request details (method, path, status) - Trace ID for distributed tracing - Timing information

Manual Instrumentation¶

Add custom spans to your code:

from dataenginex.tracing import get_tracer

tracer = get_tracer(__name__)

def my_function():
    with tracer.start_as_current_span("database_query") as span:
        span.set_attribute("query.table", "users")
        span.set_attribute("query.limit", 100)
        # ... do database work ...
        span.set_attribute("query.results", 42)

Jaeger Setup¶

Run Jaeger with OTLP support:

docker run -d --name jaeger \
  -e COLLECTOR_OTLP_ENABLED=true \
  -e SPAN_STORAGE_TYPE=memory \
  -p 16686:16686 \
  -p 4317:4317 \
  -p 4318:4318 \
  -p 14250:14250 \
  jaegertracing/all-in-one:1.60

Access Jaeger UI: http://localhost:16686

Example Trace¶

Trace ID: 7f8a3b2c1d4e5f6a
│
├─ [150ms] GET /api/users
│  ├─ [80ms] database_query (users table)
│  ├─ [30ms] redis_cache_check
│  └─ [40ms] response_serialization
│
Total: 150ms

Grafana Dashboards¶

Prebuilt dashboards are available in infra/grafana:

DEX Metrics: request rate, latency, error rate, in-flight.
DEX Logs: log volume, error spikes, recent logs, and request IDs (Loki).
DEX Traces: trace list, top endpoints, and span latency (Tempo/TraceQL).

Import Steps¶

Open Grafana → Dashboards → New → Import.
Upload the JSON from infra/grafana/dashboards.
Select Prometheus/Loki/Tempo data sources when prompted.

Notes¶

Dashboards assume default labels (e.g., app=dataenginex). If your labels differ, edit the dashboard variables and panel queries.

Local Testing¶

Quick Start¶

Run the application: bash uv lock uv sync uv run poe api
Test metrics endpoint:
```
curl http://localhost:8000/metrics
```

Expected output:

# HELP http_requests_total Total HTTP requests
# TYPE http_requests_total counter
http_requests_total{endpoint="/",method="GET",status="200"} 1.0
...

Generate traffic to see metrics: ```bash # Make some requests curl http://localhost:8000/ curl http://localhost:8000/health curl http://localhost:8000/ready

# Check updated metrics curl http://localhost:8000/metrics ```

Local Observability Stack (Prometheus + Grafana + Jaeger)¶

Use the bundled compose file for one-command local testing:

docker compose -f docker-compose.yml up -d

Then open: - Prometheus: http://localhost:9090 - Grafana: http://localhost:3000 (admin / admin) - Jaeger: http://localhost:16686

Test with Jaeger (Optional)¶

Start Jaeger locally:

docker run -d --name jaeger \
  -e COLLECTOR_OTLP_ENABLED=true \
  -e SPAN_STORAGE_TYPE=memory \
  -p 16686:16686 \
  -p 4317:4317 \
  -p 4318:4318 \
  -p 14250:14250 \
  jaegertracing/all-in-one:1.60

Run application with tracing: bash export OTLP_ENDPOINT="http://localhost:4317" uv run poe api

Generate traces:

# Make several requests
for i in {1..10}; do curl http://localhost:8000/; done

View traces in Jaeger UI:
Open: http://localhost:16686
Select service: "dataenginex"
Click "Find Traces"

Test Console Traces (No Jaeger Required)¶

Enable console trace export to see traces in logs:

export ENABLE_CONSOLE_TRACES="true"
uv run poe api

Make a request and check console output:

curl http://localhost:8000/

Console output:

{
  "name": "GET /",
  "context": {
    "trace_id": "0x7f8a3b2c1d4e5f6a...",
    "span_id": "0x1234567890abcdef"
  },
  "kind": "SpanKind.SERVER",
  "parent_id": null,
  "start_time": "2026-02-05T10:30:00.000000Z",
  "end_time": "2026-02-05T10:30:00.150000Z",
  "status": {
    "status_code": "UNSET"
  },
  "attributes": {
    "http.method": "GET",
    "http.route": "/",
    "http.status_code": 200
  }
}

Run Tests¶

Verify metrics and tracing implementation:

# Run observability tests
uv run poe test

# Run all tests with coverage
uv run poe test-cov

Expected: - 23 tests passing - Coverage >90%

Integration Examples¶

Docker Compose¶

version: '3.8'

services:
  dataenginex:
    build: .
    ports:
      - "8000:8000"
    environment:
      - LOG_LEVEL=INFO
      - OTLP_ENDPOINT=http://jaeger:4317

  prometheus:
    image: prom/prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./infra/monitoring/prometheus.yml:/etc/prometheus/prometheus.yml

  jaeger:
    image: jaegertracing/all-in-one:1.60
    environment:
      - COLLECTOR_OTLP_ENABLED=true
      - SPAN_STORAGE_TYPE=memory
    ports:
      - "16686:16686"  # UI
      - "4317:4317"    # OTLP gRPC
      - "4318:4318"    # OTLP HTTP
      - "14250:14250"  # gRPC

  grafana:
    image: grafana/grafana
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin

Kubernetes¶

apiVersion: v1
kind: Service
metadata:
  name: dataenginex
  labels:
    app: dataenginex
spec:
  ports:
    - port: 8000
      name: http
  selector:
    app: dataenginex

---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: dataenginex
spec:
  selector:
    matchLabels:
      app: dataenginex
  endpoints:
    - port: http
      path: /metrics
      interval: 15s

Alerts¶

Prometheus alert rules (SLO-aligned)¶

The actual rule definitions live in infra/monitoring/alerts/dataenginex-alerts.yml. They expose three alerts—latency, error rate, and saturation—each scoped by environment so the thresholds can reflect the traffic patterns for dev, stage, and prod. Every alert annotation links to the deployment runbook.

Alert	Environment	Threshold	Severity	Receiver
`DataEngineXLatencyHigh`	prod	P95 latency > 0.75s (5m window)	`page`	critical slack (#dex-alerts)
	stage	P95 latency > 1.0s	`page`	same
	dev	P95 latency > 1.5s	`warning`	email (ops alias)
`DataEngineXErrorRateHigh`	prod	5xx fraction > 1%	`page`	critical slack
	stage	5xx fraction > 3%	`page`	critical slack
	dev	5xx fraction > 5%	`warning`	email
`DataEngineXSaturationHigh`	prod	In-flight > 25	`page`	critical slack
	stage	In-flight > 15	`warning`	email
	dev	In-flight > 10	`warning`	email

The production alertmanager configuration in infra/monitoring/alertmanager.yml routes all severity=page alerts to a Slack webhook (#dex-alerts) while severity=warning alerts go to the ops email alias. Alerts sharing the same alertname and environment are deduplicated via inhibit rules so warnings do not trigger when a page is active.

Reloading Alert Rules¶

Whenever the alert rules or Alertmanager config changes, reapply the manifests so Prometheus and Alertmanager scrape the latest thresholds. Reload the rules in the following order to avoid gaps:

Reapply the Prometheus rule set managed in GitOps:

kubectl apply -f infra/monitoring/alerts/dataenginex-alerts.yml
kubectl rollout restart deployment/prometheus

Reconfigure Alertmanager so receivers and runbooks stay up to date:

kubectl apply -f infra/monitoring/alertmanager.yml
kubectl rollout restart deployment/alertmanager

Verify the alerts appear in Alertmanager UI and reference the release runbook described in docs/DEPLOY_RUNBOOK.md.

If you manage the stack via ArgoCD, push the changes to the kustomize overlay and let ArgoCD sync the deployments automatically rather than running the commands above manually.

Proof-of-concept API endpoints¶

These endpoints surface the modules under src/pyconcepts so that you can explore how the application mixes external data and streaming insights.

External data — GET /api/external-data?currency=USD
Uses pyconcepts.external_data.fetch_external_data to call the Coindesk API and return the latest BTC rate for the requested currency.
Returns a JSON object with symbol, currency, value, timestamp, and source fields.
Example: curl "http://localhost:8000/api/external-data?currency=eur"
Streaming insights — GET /api/insights?count=5&interval=0.75
Streams conservative synthetic metrics via Server-Sent Events (text/event-stream).
Honors count (1–20) and interval seconds per event to throttle data for demos.
Consume it with curl -N http://localhost:8000/api/insights and parse each data: {...} chunk as JSON.

Best Practices¶

Metrics¶

Use labels wisely: Keep cardinality low (< 1000 unique label combinations)
Histogram buckets: Adjust buckets based on your latency distribution
Counter vs Gauge: Use counters for totals, gauges for current values
Naming: Follow Prometheus naming conventions (_total, _seconds)

Tracing¶

Sampling: In high-traffic scenarios, sample traces (e.g., 1% of requests)
Span attributes: Add relevant context but avoid PII
Error tracking: Mark spans with errors using span.set_status()
Span naming: Use consistent, hierarchical names

Performance¶

Metrics collection: ~0.1ms overhead per request
Tracing overhead: ~1-2ms per request (depends on sampling)
/metrics endpoint: Excluded from metrics to avoid recursion

Health Checks¶

DEX provides three health endpoints for Kubernetes probes:

/health (liveness) — returns component status and overall health
/ready (readiness) — returns readiness based on dependency checks
/startup (startup) — indicates app initialization completion

Configure Dependency Checks¶

Set these environment variables to enable component checks:

export DEX_DB_HOST="db"
export DEX_DB_PORT="5432"

export DEX_CACHE_HOST="redis"
export DEX_CACHE_PORT="6379"

export DEX_EXTERNAL_API_URL="https://api.example.com/health"

If a component is not configured, it will show skipped in the response.

Troubleshooting¶

Metrics not appearing¶

Check /metrics endpoint is accessible
Verify Prometheus can scrape the endpoint
Check Prometheus targets: http://localhost:9090/targets

Service not appearing in Jaeger UI¶

Troubleshooting steps:

Verify Jaeger is running:
```
docker ps | grep jaeger
```
Check OTLP_ENDPOINT is set:
```
echo $OTLP_ENDPOINT
```
Should output: http://localhost:4317
Generate traffic - Jaeger only shows services that have sent traces: bash # Make several requests curl http://localhost:8000/ curl http://localhost:8000/health curl http://localhost:8000/ready

Verify service name - Check application logs on startup:

{
  "event": "application_started",
  "service.name": "dataenginex"
}

Check Jaeger received traces:
Open Jaeger UI: http://localhost:16686
Click the "Service" dropdown
Look for "dataenginex" (matches APP_NAME env var)
If not visible, select time range "Last Hour" and click "Find Traces"

Check for export errors in application output:

StatusCode.UNAVAILABLE - Jaeger not reachable
StatusCode.INVALID_ARGUMENT - Configuration error

Enable console traces for debugging:

export ENABLE_CONSOLE_TRACES="true"
# Restart app - you'll see spans printed to console

Verify port 4317 is accessible:

# Check if Jaeger is listening on OTLP port
netstat -an | grep 4317
# or
curl -v telnet://localhost:4317

High memory usage¶

Reduce trace sampling rate
Check for metric label cardinality explosion
Monitor /metrics response size

Testing¶

Run observability tests:

uv run poe test

Deployment & Operations: - CI/CD Pipeline - Automated deployments - Deployment Runbook - Deploy procedures - Local K8s Setup - Kubernetes setup

Development: - SDLC - Development workflow - Contributing - Contribution guide

← Back to Documentation Hub