As im reading on data vault more, regarding the “dates” modelling, after 10 years working in dental insurance, often daily batches would break for whatever reason and we had manually fix whatever was the problem and re-execute each batch per exact order date, unless job was all processing, meaning we could just execute just the latest erroneous failed job.
Now in data vault world, initially i was exceptic using system timestamp as secondary pk column in scd Satellites, because if we get late history data loaded then "the current status of the data no longer is the max(load_ts) right…
Meanwhile I become more convinced from many personal opinions online about the argument being system date is the only date we really control and its safe and incremental (unless we travel near speed of light and land back again on Earth).
Thats ok, and also ok to just insert/append records as they come no problem as no brains involved, but im concerned with typical problems and re-execution of cleanup and data fix problems in DV… which probably only happen when loading PITs or BRIDGE ?
So any process to calculate PITs or presentation layer “current” view with or without using PIT would have to work the real business dates anyway right?
But a PIT table snapshot using system load_dates could wrongly reflecting the data.
Here is simple example:
mage_sat
gandalf 5-APR-2022 blue
some_pit
pit_snapshot_date mage_key mage_load_ts
30-APR-2022 gandalf 5-APR-2022
Then imagine between generation of 30-April snapshot and next snapshot 1-MAY-2022, late data was received for 6-APR-2022
mage_sat
gandalf 5-APR-2022 blue
gandalf 6-APR-2022 white
some_pit
pit_snapshot_date mage_key mage_load_ts
30-APR-2022 gandalf 5-APR-2022 <------ old record links wrong sat record!!
1-MAY-2022 gandalf 6-APR-2022
Pit wrongly states snapshot 30-APR-2022 pointing to 5-APR record right, as if data had arrived in time correct pit should be:
30-APR-2022 gandalf 6-APR-2022
1-MAY-2022 gandalf 6-APR-2022
As always i prefer to counter any mr. murphy problems because they HAPPEN, and from my long career in databases the hardest thing is handling exceptions and recovery.
Sadly internet contents and society in general prefer to ignore the mr. murphy path which is the reality of things and people just broadcast what i call “the woodstock” way, where all data have quality, complete and arrives in time… all sort of pipelines get built manually or using tools till one night at 3am in the morning you find a not hippie thingie in the data flow process.
Im still readin first dv book, currently in the PIT chapter, but this topic of late past data keeps rumbling in my personal neural network
I enjoyed a lot Cuba’s slides on “time and claims multi temporality issues”, but still struggling a bit… imagine my simpler example above of just one temporal line, am i wrong thinking we better implement a PIT rebuild process or am i missing something obvious?
Thanks dear DV friends ~
Emanuel de Oliveira