Observability: Metrics, Logging & Tracing¶
Complete guide to monitoring, debugging, and understanding DataEngineX in production.
Quick Links: Metrics · Tracing · Grafana · Local Testing · Troubleshooting
📋 Table of Contents¶
- Overview
- Architecture
- Prometheus Metrics
- OpenTelemetry Tracing
- Grafana Dashboards
- Local Testing
- Health Checks
- Troubleshooting
- Related Documentation
Overview¶
DataEngineX implements comprehensive observability using Prometheus for metrics, OpenTelemetry for distributed tracing, and Structlog for structured logging.
Architecture¶
flowchart TB
subgraph client["Client"]
Browser["Browser/API Client"]
end
subgraph app["DataEngineX Application"]
direction TB
FastAPI["FastAPI App"]
LogMW["RequestLoggingMiddleware"]
MetricsMW["PrometheusMetricsMiddleware"]
Tracing["OpenTelemetry Instrumentation"]
FastAPI --> LogMW
LogMW --> MetricsMW
MetricsMW --> Tracing
end
subgraph observability["Observability Stack"]
direction LR
Prometheus["Prometheus<br/>(Metrics)"]
Jaeger["Jaeger<br/>(Traces)"]
Logs["Structured Logs<br/>(JSON/Console)"]
end
Browser -->|HTTP Request| FastAPI
FastAPI -->|Response + X-Request-ID| Browser
MetricsMW -.->|/metrics endpoint| Prometheus
Tracing -.->|OTLP gRPC :4317| Jaeger
LogMW -.->|stdout/stderr| Logs
style FastAPI fill:#e1f5ff
style LogMW fill:#fff3cd
style MetricsMW fill:#d4edda
style Tracing fill:#f8d7da
style Prometheus fill:#d1ecf1
style Jaeger fill:#d1ecf1
style Logs fill:#d1ecf1
Request Flow¶
sequenceDiagram
participant Client
participant FastAPI
participant LogMW as RequestLoggingMiddleware
participant MetricsMW as PrometheusMetricsMiddleware
participant Tracing as OpenTelemetry
participant App as Application Logic
Client->>+FastAPI: HTTP GET /
FastAPI->>+LogMW: Process Request
Note over LogMW: Generate UUID<br/>Bind context
LogMW->>LogMW: Log "request_started"
LogMW->>+MetricsMW: Continue
Note over MetricsMW: Increment in_flight gauge
MetricsMW->>+Tracing: Continue
Note over Tracing: Create span<br/>"GET /"
Tracing->>+App: Execute endpoint
App-->>-Tracing: Return response
Note over Tracing: End span<br/>Export to Jaeger
Tracing-->>-MetricsMW: Response
Note over MetricsMW: Record duration histogram<br/>Increment requests counter<br/>Decrement in_flight gauge
MetricsMW-->>-LogMW: Response
LogMW->>LogMW: Log "request_completed"
Note over LogMW: Add X-Request-ID header
LogMW-->>-FastAPI: Response
FastAPI-->>-Client: HTTP 200 OK<br/>X-Request-ID: uuid
Prometheus Metrics¶
Available Metrics¶
| Metric | Type | Description | Labels |
|---|---|---|---|
http_requests_total |
Counter | Total HTTP requests | method, endpoint, status |
http_request_duration_seconds |
Histogram | Request duration | method, endpoint |
http_requests_in_flight |
Gauge | Requests currently processing | - |
http_exceptions_total |
Counter | Total exceptions | exception_type |
Metrics Endpoint¶
Access Prometheus metrics at:
Example output:
# HELP http_requests_total Total HTTP requests
# TYPE http_requests_total counter
http_requests_total{endpoint="/",method="GET",status="200"} 1543.0
# HELP http_request_duration_seconds HTTP request duration in seconds
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{endpoint="/",le="0.005",method="GET"} 1234.0
http_request_duration_seconds_bucket{endpoint="/",le="0.01",method="GET"} 1456.0
http_request_duration_seconds_sum{endpoint="/",method="GET"} 143.2
http_request_duration_seconds_count{endpoint="/",method="GET"} 1543.0
Prometheus Configuration¶
Add this scrape config to prometheus.yml:
scrape_configs:
- job_name: 'dataenginex'
scrape_interval: 15s
static_configs:
- targets: ['dataenginex:8000']
metrics_path: '/metrics'
If you run the API outside Docker, use localhost:8000 as the target.
Grafana Dashboard¶
Key Queries:
-
Request Rate (RPS):
-
P95 Latency:
-
Error Rate:
-
In-Flight Requests:
Metrics Collection Flow¶
flowchart TB
subgraph request["HTTP Request Lifecycle"]
Start["Request Arrives"]
Process["Process Request"]
End["Response Sent"]
end
subgraph metrics["Metrics Updates"]
InFlight["http_requests_in_flight<br/>+1"]
Duration["http_request_duration_seconds<br/>observe(duration)"]
Counter["http_requests_total<br/>+1"]
Exception["http_exceptions_total<br/>+1 (if error)"]
InFlightDec["http_requests_in_flight<br/>-1"]
end
subgraph prometheus["Prometheus"]
Scrape["Scrape /metrics<br/>every 15s"]
Store["Time Series DB"]
Query["PromQL Queries"]
end
Start --> InFlight
InFlight --> Process
Process -->|Success| Duration
Process -->|Error| Exception
Duration --> Counter
Exception --> Counter
Counter --> InFlightDec
InFlightDec --> End
Counter -.->|Expose| Scrape
Scrape --> Store
Store --> Query
style request fill:#e1f5ff
style metrics fill:#d4edda
style prometheus fill:#d1ecf1
OpenTelemetry Tracing¶
Configuration¶
Environment Variables¶
# OTLP Collector endpoint (Jaeger, Tempo, etc.)
# Accepts formats: "localhost:4317" or "http://localhost:4317"
export OTLP_ENDPOINT="http://localhost:4317"
# Enable console trace export for debugging (optional)
export ENABLE_CONSOLE_TRACES="true"
Tracing Flow¶
flowchart LR
subgraph app["Application"]
Request["HTTP Request"]
Instrumentation["FastAPI<br/>Instrumentation"]
Span["Span Creation"]
Export["OTLP Exporter"]
end
subgraph collectors["Collectors"]
OTLP["OTLP Endpoint<br/>:4317"]
Console["Console Export<br/>(Debug)"]
end
subgraph backend["Backend"]
Jaeger["Jaeger UI<br/>:16686"]
end
Request --> Instrumentation
Instrumentation --> Span
Span -->|Attributes:<br/>method, path,<br/>status, duration| Export
Export -->|if OTLP_ENDPOINT set| OTLP
Export -->|if ENABLE_CONSOLE_TRACES| Console
OTLP --> Jaeger
style app fill:#e1f5ff
style collectors fill:#fff3cd
style backend fill:#d1ecf1
Automatic Instrumentation¶
FastAPI is automatically instrumented with OpenTelemetry. Every request creates: - Span with request details (method, path, status) - Trace ID for distributed tracing - Timing information
Manual Instrumentation¶
Add custom spans to your code:
from dataenginex.tracing import get_tracer
tracer = get_tracer(__name__)
def my_function():
with tracer.start_as_current_span("database_query") as span:
span.set_attribute("query.table", "users")
span.set_attribute("query.limit", 100)
# ... do database work ...
span.set_attribute("query.results", 42)
Jaeger Setup¶
Run Jaeger with OTLP support:
docker run -d --name jaeger \
-e COLLECTOR_OTLP_ENABLED=true \
-e SPAN_STORAGE_TYPE=memory \
-p 16686:16686 \
-p 4317:4317 \
-p 4318:4318 \
-p 14250:14250 \
jaegertracing/all-in-one:1.60
Access Jaeger UI: http://localhost:16686
Example Trace¶
Trace ID: 7f8a3b2c1d4e5f6a
│
├─ [150ms] GET /api/users
│ ├─ [80ms] database_query (users table)
│ ├─ [30ms] redis_cache_check
│ └─ [40ms] response_serialization
│
Total: 150ms
Grafana Dashboards¶
Prebuilt dashboards are available in infra/grafana:
- DEX Metrics: request rate, latency, error rate, in-flight.
- DEX Logs: log volume, error spikes, recent logs, and request IDs (Loki).
- DEX Traces: trace list, top endpoints, and span latency (Tempo/TraceQL).
Import Steps¶
- Open Grafana → Dashboards → New → Import.
- Upload the JSON from infra/grafana/dashboards.
- Select Prometheus/Loki/Tempo data sources when prompted.
Notes¶
Dashboards assume default labels (e.g., app=dataenginex). If your labels differ, edit the dashboard variables and panel queries.
Local Testing¶
Quick Start¶
-
Run the application:
bash uv lock uv sync uv run poe api -
Test metrics endpoint:
Expected output:
# HELP http_requests_total Total HTTP requests
# TYPE http_requests_total counter
http_requests_total{endpoint="/",method="GET",status="200"} 1.0
...
- Generate traffic to see metrics: ```bash # Make some requests curl http://localhost:8000/ curl http://localhost:8000/health curl http://localhost:8000/ready
# Check updated metrics curl http://localhost:8000/metrics ```
Local Observability Stack (Prometheus + Grafana + Jaeger)¶
Use the bundled compose file for one-command local testing:
Then open: - Prometheus: http://localhost:9090 - Grafana: http://localhost:3000 (admin / admin) - Jaeger: http://localhost:16686
Test with Jaeger (Optional)¶
-
Start Jaeger locally:
-
Run application with tracing:
bash export OTLP_ENDPOINT="http://localhost:4317" uv run poe api -
Generate traces:
-
View traces in Jaeger UI:
- Open: http://localhost:16686
- Select service: "dataenginex"
- Click "Find Traces"
Test Console Traces (No Jaeger Required)¶
Enable console trace export to see traces in logs:
Make a request and check console output:
Console output:
{
"name": "GET /",
"context": {
"trace_id": "0x7f8a3b2c1d4e5f6a...",
"span_id": "0x1234567890abcdef"
},
"kind": "SpanKind.SERVER",
"parent_id": null,
"start_time": "2026-02-05T10:30:00.000000Z",
"end_time": "2026-02-05T10:30:00.150000Z",
"status": {
"status_code": "UNSET"
},
"attributes": {
"http.method": "GET",
"http.route": "/",
"http.status_code": 200
}
}
Run Tests¶
Verify metrics and tracing implementation:
Expected: - 23 tests passing - Coverage >90%
Integration Examples¶
Docker Compose¶
version: '3.8'
services:
dataenginex:
build: .
ports:
- "8000:8000"
environment:
- LOG_LEVEL=INFO
- OTLP_ENDPOINT=http://jaeger:4317
prometheus:
image: prom/prometheus
ports:
- "9090:9090"
volumes:
- ./infra/monitoring/prometheus.yml:/etc/prometheus/prometheus.yml
jaeger:
image: jaegertracing/all-in-one:1.60
environment:
- COLLECTOR_OTLP_ENABLED=true
- SPAN_STORAGE_TYPE=memory
ports:
- "16686:16686" # UI
- "4317:4317" # OTLP gRPC
- "4318:4318" # OTLP HTTP
- "14250:14250" # gRPC
grafana:
image: grafana/grafana
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
Kubernetes¶
apiVersion: v1
kind: Service
metadata:
name: dataenginex
labels:
app: dataenginex
spec:
ports:
- port: 8000
name: http
selector:
app: dataenginex
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: dataenginex
spec:
selector:
matchLabels:
app: dataenginex
endpoints:
- port: http
path: /metrics
interval: 15s
Alerts¶
Prometheus alert rules (SLO-aligned)¶
The actual rule definitions live in infra/monitoring/alerts/dataenginex-alerts.yml. They expose three alerts—latency, error rate, and saturation—each scoped by environment so the thresholds can reflect the traffic patterns for dev, stage, and prod. Every alert annotation links to the deployment runbook.
| Alert | Environment | Threshold | Severity | Receiver |
|---|---|---|---|---|
DataEngineXLatencyHigh |
prod | P95 latency > 0.75s (5m window) | page |
critical slack (#dex-alerts) |
| stage | P95 latency > 1.0s | page |
same | |
| dev | P95 latency > 1.5s | warning |
email (ops alias) | |
DataEngineXErrorRateHigh |
prod | 5xx fraction > 1% | page |
critical slack |
| stage | 5xx fraction > 3% | page |
critical slack | |
| dev | 5xx fraction > 5% | warning |
||
DataEngineXSaturationHigh |
prod | In-flight > 25 | page |
critical slack |
| stage | In-flight > 15 | warning |
||
| dev | In-flight > 10 | warning |
The production alertmanager configuration in infra/monitoring/alertmanager.yml routes all severity=page alerts to a Slack webhook (#dex-alerts) while severity=warning alerts go to the ops email alias. Alerts sharing the same alertname and environment are deduplicated via inhibit rules so warnings do not trigger when a page is active.
Reloading Alert Rules¶
Whenever the alert rules or Alertmanager config changes, reapply the manifests so Prometheus and Alertmanager scrape the latest thresholds. Reload the rules in the following order to avoid gaps:
- Reapply the Prometheus rule set managed in GitOps:
- Reconfigure Alertmanager so receivers and runbooks stay up to date:
- Verify the alerts appear in Alertmanager UI and reference the release runbook described in
docs/DEPLOY_RUNBOOK.md.
If you manage the stack via ArgoCD, push the changes to the kustomize overlay and let ArgoCD sync the deployments automatically rather than running the commands above manually.
Proof-of-concept API endpoints¶
These endpoints surface the modules under src/pyconcepts so that you can explore how the application mixes external data and streaming insights.
- External data —
GET /api/external-data?currency=USD - Uses
pyconcepts.external_data.fetch_external_datato call the Coindesk API and return the latest BTC rate for the requested currency. - Returns a JSON object with
symbol,currency,value,timestamp, andsourcefields. -
Example:
curl "http://localhost:8000/api/external-data?currency=eur" -
Streaming insights —
GET /api/insights?count=5&interval=0.75 - Streams conservative synthetic metrics via Server-Sent Events (
text/event-stream). - Honors
count(1–20) andintervalseconds per event to throttle data for demos. - Consume it with
curl -N http://localhost:8000/api/insightsand parse eachdata: {...}chunk as JSON.
Best Practices¶
Metrics¶
- Use labels wisely: Keep cardinality low (< 1000 unique label combinations)
- Histogram buckets: Adjust buckets based on your latency distribution
- Counter vs Gauge: Use counters for totals, gauges for current values
- Naming: Follow Prometheus naming conventions (
_total,_seconds)
Tracing¶
- Sampling: In high-traffic scenarios, sample traces (e.g., 1% of requests)
- Span attributes: Add relevant context but avoid PII
- Error tracking: Mark spans with errors using
span.set_status() - Span naming: Use consistent, hierarchical names
Performance¶
- Metrics collection: ~0.1ms overhead per request
- Tracing overhead: ~1-2ms per request (depends on sampling)
/metricsendpoint: Excluded from metrics to avoid recursion
Health Checks¶
DEX provides three health endpoints for Kubernetes probes:
- /health (liveness) — returns component status and overall health
- /ready (readiness) — returns readiness based on dependency checks
- /startup (startup) — indicates app initialization completion
Configure Dependency Checks¶
Set these environment variables to enable component checks:
export DEX_DB_HOST="db"
export DEX_DB_PORT="5432"
export DEX_CACHE_HOST="redis"
export DEX_CACHE_PORT="6379"
export DEX_EXTERNAL_API_URL="https://api.example.com/health"
If a component is not configured, it will show skipped in the response.
Troubleshooting¶
Metrics not appearing¶
- Check
/metricsendpoint is accessible - Verify Prometheus can scrape the endpoint
- Check Prometheus targets: http://localhost:9090/targets
Service not appearing in Jaeger UI¶
Troubleshooting steps:
-
Verify Jaeger is running:
-
Check OTLP_ENDPOINT is set:
Should output:http://localhost:4317 -
Generate traffic - Jaeger only shows services that have sent traces:
bash # Make several requests curl http://localhost:8000/ curl http://localhost:8000/health curl http://localhost:8000/ready -
Verify service name - Check application logs on startup:
-
Check Jaeger received traces:
- Open Jaeger UI: http://localhost:16686
- Click the "Service" dropdown
- Look for "dataenginex" (matches APP_NAME env var)
-
If not visible, select time range "Last Hour" and click "Find Traces"
-
Check for export errors in application output:
-
Enable console traces for debugging:
-
Verify port 4317 is accessible:
High memory usage¶
- Reduce trace sampling rate
- Check for metric label cardinality explosion
- Monitor
/metricsresponse size
Testing¶
Run observability tests:
Related Documentation¶
Deployment & Operations: - CI/CD Pipeline - Automated deployments - Deployment Runbook - Deploy procedures - Local K8s Setup - Kubernetes setup
Development: - SDLC - Development workflow - Contributing - Contribution guide