Hey there ,
I want some advice on the data vault modelling and load stretegy of the sat table.
In our load progress
we have load from the data lake into the landing layer in the batches and these landing layer are used to load in the sat tables in our system
And in the load approach we just looked there the data in the landing table is there is in the sat table or not if it is not there it will inserted into it.
we are not handling state changed
for example
if the for particular key changes in the different in point of time
A β B β A
we are only getting A β B change in the SAT table
But we are going to change in the loading apprach
we thought we have delta feed in between in the sat table
Data Lake β Landing Table β Stage Table( Delta Feed) β Sat Table
Compare latest record in the sat table and Delta Feed.
And we consider dbt vault package
But there is one problem
we do often reload in the landing table historically table
Example
Stage
ID , INSERT_TIMESTAMP , PINCODE , Business
1 , 1 Jan , 100 , 1 Jan
2 , 1 Jan , 200 , 1 Jan
1 , 2 Jan , 200 , 2 Jan
Sat Table
ID , INSERT_TIMESTAMP , PINCODE , Business
1 , 1 Jan , 100 , 1 Jan
2 , 1 Jan , 200 , 1 Jan
1 , 2 Jan , 200 , 2 Jan
if i reload the ID 1 again in the from the data lake then the historically data
Stage
ID , INSERT_TIMESTAMP , PINCODE , Business
1 , 1 Jan , 100 , 1 Jan
2 , 1 Jan , 200 , 1 Jan
1 , 2 Jan , 200 , 2 Jan
1, 3 Jan , 100 , 1 Jan < β Reloaded data
Delta Feed
ID , INSERT_TIMESTAMP , PINCODE , Business
1, 3 Jan , 100 , 1 Jan
It will compare the latest record in the sat table consider it has latest and reload the data duplicated record in the sat
Sat
ID , INSERT_TIMESTAMP , PINCODE , Business
1 , 1 Jan , 100 , 1 Jan
2 , 1 Jan , 200 , 1 Jan
1 , 2 Jan , 200 , 2 Jan
1, 3 Jan , 100 , 1 Jan
I am currently using INSERT_TIMESTAMP has waterlevel
Can some one advice on this scenerio pls.
Thank you