dataenginex.warehouse
SQL-like transforms, persistent lineage, warehouse utilities.
Public API::
from dataenginex.warehouse import (
Transform, TransformPipeline, TransformResult,
RenameFieldsTransform, DropNullsTransform,
CastTypesTransform, AddTimestampTransform, FilterTransform,
LineageEvent, PersistentLineage,
)
LineageEvent
dataclass
A single lineage event describing a data operation.
Attributes:
| Name | Type | Description |
|---|---|---|
event_id |
str
|
Auto-generated unique identifier (12-char hex). |
parent_id |
str | None
|
ID of the upstream event that produced the input. |
operation |
str
|
Type of operation ( |
layer |
str
|
Medallion layer ( |
source |
str
|
Where data came from. |
destination |
str
|
Where data was written. |
input_count |
int
|
Number of input records. |
output_count |
int
|
Number of output records. |
error_count |
int
|
Number of records that errored. |
quality_score |
float | None
|
Quality score of the output (0.0–1.0). |
pipeline_name |
str
|
Name of the owning pipeline. |
step_name |
str
|
Name of the transform step. |
metadata |
dict[str, Any]
|
Free-form context dict. |
timestamp |
datetime
|
When the event occurred. |
Source code in src/dataenginex/warehouse/lineage.py
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 | |
to_dict()
Serialize the lineage event to a plain dictionary.
Source code in src/dataenginex/warehouse/lineage.py
69 70 71 72 73 | |
PersistentLineage
JSON-file-backed lineage store.
Example::
lineage = PersistentLineage("data/lineage.json")
ev = lineage.record(
operation="ingest",
layer="bronze",
source="linkedin",
input_count=1250,
output_count=1250,
)
# later
lineage.record(
operation="transform",
layer="silver",
parent_id=ev.event_id,
input_count=1250,
output_count=1200,
quality_score=0.88,
)
Source code in src/dataenginex/warehouse/lineage.py
76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 | |
all_events
property
Return a shallow copy of all stored lineage events.
record(**kwargs)
Create and store a new lineage event.
Accepts the same keyword arguments as LineageEvent.
Source code in src/dataenginex/warehouse/lineage.py
109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 | |
get_event(event_id)
Fetch a single event by ID.
Source code in src/dataenginex/warehouse/lineage.py
129 130 131 132 133 134 | |
get_children(parent_id)
Return events that have parent_id as their parent.
Source code in src/dataenginex/warehouse/lineage.py
136 137 138 | |
get_chain(event_id)
Walk up from event_id to the root and return the full chain.
Source code in src/dataenginex/warehouse/lineage.py
140 141 142 143 144 145 146 147 148 | |
get_by_layer(layer)
Return all events for a given medallion layer.
Source code in src/dataenginex/warehouse/lineage.py
150 151 152 | |
get_by_pipeline(pipeline_name)
Return all events for a given pipeline.
Source code in src/dataenginex/warehouse/lineage.py
154 155 156 | |
summary()
Return high-level lineage statistics.
Source code in src/dataenginex/warehouse/lineage.py
163 164 165 166 167 168 169 170 171 172 173 174 | |
AddTimestampTransform
Bases: Transform
Add a processing timestamp field to every record.
Source code in src/dataenginex/warehouse/transforms.py
145 146 147 148 149 150 151 152 153 154 155 156 | |
apply(record)
Add an ISO-8601 processing timestamp to record.
Source code in src/dataenginex/warehouse/transforms.py
152 153 154 155 156 | |
CastTypesTransform
Bases: Transform
Cast fields to target types (int, float, str, bool).
Source code in src/dataenginex/warehouse/transforms.py
124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 | |
apply(record)
Cast specified fields to their target types.
Source code in src/dataenginex/warehouse/transforms.py
133 134 135 136 137 138 139 140 141 142 | |
DropNullsTransform
Bases: Transform
Drop records that have None in any of the specified fields.
Source code in src/dataenginex/warehouse/transforms.py
109 110 111 112 113 114 115 116 117 118 119 120 121 | |
apply(record)
Return None if any required field is null, else the record.
Source code in src/dataenginex/warehouse/transforms.py
116 117 118 119 120 121 | |
FilterTransform
Bases: Transform
Keep only records that match a predicate expression.
predicate receives a record dict and returns True to keep it.
Source code in src/dataenginex/warehouse/transforms.py
159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 | |
apply(record)
Keep record if the predicate returns True, else drop it.
Source code in src/dataenginex/warehouse/transforms.py
169 170 171 172 173 | |
RenameFieldsTransform
Bases: Transform
Rename keys in a record according to a mapping.
Source code in src/dataenginex/warehouse/transforms.py
93 94 95 96 97 98 99 100 101 102 103 104 105 106 | |
apply(record)
Rename keys in record according to the configured mapping.
Source code in src/dataenginex/warehouse/transforms.py
100 101 102 103 104 105 106 | |
Transform
Bases: ABC
Base class for a single data transform step.
Subclass and implement apply which receives one record and returns
either a transformed record or None to drop it.
Source code in src/dataenginex/warehouse/transforms.py
68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 | |
apply(record)
abstractmethod
Transform record in place or return a new dict.
Return None to drop the record from the pipeline.
Source code in src/dataenginex/warehouse/transforms.py
79 80 81 82 83 84 85 | |
TransformPipeline
Execute an ordered chain of Transform steps over a batch.
Example::
pipeline = TransformPipeline("bronze_to_silver")
pipeline.add(DropNullsTransform(["event_id", "user_name"]))
pipeline.add(CastTypesTransform({"amount": "float"}))
result = pipeline.run(records)
Source code in src/dataenginex/warehouse/transforms.py
181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 | |
add(transform)
Append a transform step to the pipeline.
Source code in src/dataenginex/warehouse/transforms.py
196 197 198 199 | |
run(records)
Execute all steps in order and return the aggregated result.
Source code in src/dataenginex/warehouse/transforms.py
201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 | |
TransformResult
dataclass
Outcome of running a pipeline over a batch of records.
Attributes:
| Name | Type | Description |
|---|---|---|
input_count |
int
|
Number of records fed into the pipeline. |
output_count |
int
|
Number of records that passed all transform steps. |
dropped_count |
int
|
Records dropped during transformation. |
error_count |
int
|
Records that caused transform errors. |
duration_ms |
float
|
Total pipeline execution time in milliseconds. |
records |
list[dict[str, Any]]
|
Transformed output records. |
step_metrics |
list[dict[str, Any]]
|
Per-step metrics (input/output/dropped/duration). |
completed_at |
datetime
|
Timestamp of pipeline completion. |
Source code in src/dataenginex/warehouse/transforms.py
33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 | |
success_rate
property
Fraction of input records that made it to the output.