dataenginex.warehouse¶
SQL-like transforms, persistent lineage tracking, and warehouse utilities.
dataenginex.warehouse
¶
SQL-like transforms, persistent lineage, warehouse utilities.
Public API::
from dataenginex.warehouse import (
Transform, TransformPipeline, TransformResult,
RenameFieldsTransform, DropNullsTransform,
CastTypesTransform, AddTimestampTransform, FilterTransform,
LineageEvent, PersistentLineage,
)
LineageEvent
dataclass
¶
A single lineage event describing a data operation.
Attributes:
| Name | Type | Description |
|---|---|---|
event_id |
str
|
Auto-generated unique identifier (12-char hex). |
parent_id |
str | None
|
ID of the upstream event that produced the input. |
operation |
str
|
Type of operation ( |
layer |
str
|
Medallion layer ( |
source |
str
|
Where data came from. |
destination |
str
|
Where data was written. |
input_count |
int
|
Number of input records. |
output_count |
int
|
Number of output records. |
error_count |
int
|
Number of records that errored. |
quality_score |
float | None
|
Quality score of the output (0.0–1.0). |
pipeline_name |
str
|
Name of the owning pipeline. |
step_name |
str
|
Name of the transform step. |
metadata |
dict[str, Any]
|
Free-form context dict. |
timestamp |
datetime
|
When the event occurred. |
Source code in packages/dataenginex/src/dataenginex/warehouse/lineage.py
to_dict()
¶
Serialize the lineage event to a plain dictionary.
PersistentLineage
¶
JSON-file-backed lineage store.
Example::
lineage = PersistentLineage("data/lineage.json")
ev = lineage.record(
operation="ingest",
layer="bronze",
source="linkedin",
input_count=1250,
output_count=1250,
)
# later
lineage.record(
operation="transform",
layer="silver",
parent_id=ev.event_id,
input_count=1250,
output_count=1200,
quality_score=0.88,
)
Source code in packages/dataenginex/src/dataenginex/warehouse/lineage.py
74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 | |
all_events
property
¶
Return a shallow copy of all stored lineage events.
record(**kwargs)
¶
Create and store a new lineage event.
Accepts the same keyword arguments as LineageEvent.
Source code in packages/dataenginex/src/dataenginex/warehouse/lineage.py
get_event(event_id)
¶
get_children(parent_id)
¶
Return events that have parent_id as their parent.
get_chain(event_id)
¶
Walk up from event_id to the root and return the full chain.
Source code in packages/dataenginex/src/dataenginex/warehouse/lineage.py
get_by_layer(layer)
¶
get_by_pipeline(pipeline_name)
¶
Return all events for a given pipeline.
summary()
¶
Return high-level lineage statistics.
Source code in packages/dataenginex/src/dataenginex/warehouse/lineage.py
AddTimestampTransform
¶
Bases: Transform
Add a processing timestamp field to every record.
Source code in packages/dataenginex/src/dataenginex/warehouse/transforms.py
apply(record)
¶
Add an ISO-8601 processing timestamp to record.
CastTypesTransform
¶
Bases: Transform
Cast fields to target types (int, float, str, bool).
Source code in packages/dataenginex/src/dataenginex/warehouse/transforms.py
apply(record)
¶
Cast specified fields to their target types.
Source code in packages/dataenginex/src/dataenginex/warehouse/transforms.py
DropNullsTransform
¶
Bases: Transform
Drop records that have None in any of the specified fields.
Source code in packages/dataenginex/src/dataenginex/warehouse/transforms.py
apply(record)
¶
Return None if any required field is null, else the record.
Source code in packages/dataenginex/src/dataenginex/warehouse/transforms.py
FilterTransform
¶
Bases: Transform
Keep only records that match a predicate expression.
predicate receives a record dict and returns True to keep it.
Source code in packages/dataenginex/src/dataenginex/warehouse/transforms.py
apply(record)
¶
Keep record if the predicate returns True, else drop it.
RenameFieldsTransform
¶
Bases: Transform
Rename keys in a record according to a mapping.
Source code in packages/dataenginex/src/dataenginex/warehouse/transforms.py
apply(record)
¶
Rename keys in record according to the configured mapping.
Source code in packages/dataenginex/src/dataenginex/warehouse/transforms.py
Transform
¶
Bases: ABC
Base class for a single data transform step.
Subclass and implement apply which receives one record and returns
either a transformed record or None to drop it.
Source code in packages/dataenginex/src/dataenginex/warehouse/transforms.py
apply(record)
abstractmethod
¶
Transform record in place or return a new dict.
Return None to drop the record from the pipeline.
TransformPipeline
¶
Execute an ordered chain of Transform steps over a batch.
Example::
pipeline = TransformPipeline("bronze_to_silver")
pipeline.add(DropNullsTransform(["job_id", "company_name"]))
pipeline.add(CastTypesTransform({"salary_min": "float"}))
result = pipeline.run(records)
Source code in packages/dataenginex/src/dataenginex/warehouse/transforms.py
add(transform)
¶
run(records)
¶
Execute all steps in order and return the aggregated result.
Source code in packages/dataenginex/src/dataenginex/warehouse/transforms.py
TransformResult
dataclass
¶
Outcome of running a pipeline over a batch of records.
Attributes:
| Name | Type | Description |
|---|---|---|
input_count |
int
|
Number of records fed into the pipeline. |
output_count |
int
|
Number of records that passed all transform steps. |
dropped_count |
int
|
Records dropped during transformation. |
error_count |
int
|
Records that caused transform errors. |
duration_ms |
float
|
Total pipeline execution time in milliseconds. |
records |
list[dict[str, Any]]
|
Transformed output records. |
step_metrics |
list[dict[str, Any]]
|
Per-step metrics (input/output/dropped/duration). |
completed_at |
datetime
|
Timestamp of pipeline completion. |
Source code in packages/dataenginex/src/dataenginex/warehouse/transforms.py
success_rate
property
¶
Fraction of input records that made it to the output.