Late data, or past data loaded past current date

As im reading on data vault more, regarding the “dates” modelling, after 10 years working in dental insurance, often daily batches would break for whatever reason and we had manually fix whatever was the problem and re-execute each batch per exact order date, unless job was all processing, meaning we could just execute just the latest erroneous failed job.

Now in data vault world, initially i was exceptic using system timestamp as secondary pk column in scd Satellites, because if we get late history data loaded then "the current status of the data no longer is the max(load_ts) right…
Meanwhile I become more convinced from many personal opinions online about the argument being system date is the only date we really control and its safe and incremental (unless we travel near speed of light and land back again on Earth).
Thats ok, and also ok to just insert/append records as they come no problem as no brains involved, but im concerned with typical problems and re-execution of cleanup and data fix problems in DV… which probably only happen when loading PITs or BRIDGE ?

So any process to calculate PITs or presentation layer “current” view with or without using PIT would have to work the real business dates anyway right?

But a PIT table snapshot using system load_dates could wrongly reflecting the data.

Here is simple example:

mage_sat
gandalf 5-APR-2022 blue

some_pit
pit_snapshot_date mage_key mage_load_ts
30-APR-2022 gandalf 5-APR-2022

Then imagine between generation of 30-April snapshot and next snapshot 1-MAY-2022, late data was received for 6-APR-2022

mage_sat
gandalf 5-APR-2022 blue
gandalf 6-APR-2022 white

some_pit
pit_snapshot_date mage_key mage_load_ts
30-APR-2022 gandalf 5-APR-2022 <------ old record links wrong sat record!!
1-MAY-2022 gandalf 6-APR-2022

Pit wrongly states snapshot 30-APR-2022 pointing to 5-APR record right, as if data had arrived in time correct pit should be:

30-APR-2022 gandalf 6-APR-2022
1-MAY-2022 gandalf 6-APR-2022

As always i prefer to counter any mr. murphy problems because they HAPPEN, and from my long career in databases the hardest thing is handling exceptions and recovery.
Sadly internet contents and society in general prefer to ignore the mr. murphy path which is the reality of things and people just broadcast what i call “the woodstock” way, where all data have quality, complete and arrives in time… all sort of pipelines get built manually or using tools till one night at 3am in the morning you find a not hippie thingie in the data flow process.

Im still readin first dv book, currently in the PIT chapter, but this topic of late past data keeps rumbling in my personal neural network :roll_eyes:

I enjoyed a lot Cuba’s slides on “time and claims multi temporality issues”, but still struggling a bit… imagine my simpler example above of just one temporal line, am i wrong thinking we better implement a PIT rebuild process or am i missing something obvious?

Thanks dear DV friends ~
Emanuel de Oliveira

My injenuity is so special :pray::sparkles::roll_eyes:

I just found my next step up after reading:

Solution:
Add [applied_date] column on SATellites.

so [sat_load_ts] continues to be the no-brain system microbatch timestamp.

So i guess need to change PITs sat_load_ts column semantics needs to use applied_date, despite whats seen in 99.999% of online dv content correct which mentions the brain-less sat_load_ts?

So Satellites instead of inserting (sat_key, sat_load_dts) they should really have is (sat_key, sat_applied_dts) pairs right ?
Keeping [pit_snapshot_ts] as reference frame using system date.

But my original wondering still remains - and i suspect the obvious answer - meaning need of a process to update PITs to correct pit_snapshot_dts when late data arrives ?
I suspect this not to be a friendly needed process, probably detecting dates intervals detection → deletion in all related PITs of [pit_snapshot_ts] needing rebuild correct ?

Thinking a bit, i suppose loading remains standard:

  • sat get inserted late data just received in current system load_ts (with attribute effective_ts abd others).
  • related pit(s) get updated on snapshot_ts, with each sat_load_ts pointing to current system load_ts which new records just got appended.

Back to my original post example changed as per my most recent wonderings:

mage_sat
(sat_load_ts, attribute_name, effective_ts)
17-APR-2022 gandalf 1-APR-2022 blue

some_pit - early morning loaded
(pit_snapshot_dt, sat_mage_key, sat_mage_load_ts)
6-APR-2022 gandalf 17-APR-2022 nearest effective_ts from 6th is 1st

Then on 30-APR-2022 late data received for “real date/aka effect_date” of 3-APR-2022

mage_sat
(sat_load_ts, attribute_name, effective_ts)
17-APR-2022 gandalf 1-APR-2022 blue
30-APR-2022 gandalf 3-APR-2022 white

some_pit
(pit_snapshot_dt, sat_mage_key, sat_mage_load_ts)
(…)
6-APR-2022 gandalf 17-APR-2022 *** needs to be updated to:
6-APR-2022 gandalf 30-APR-2022
(…)

Would this example reflects best practice handling arrival of late data - and not only - as per DV 2.0 standards ?

Any suggestions and/or real world implementation styles on this ?
Appreciated in advance your honourable energies :relaxed:

Emanuel de Oliveira

Maybe you can take a read of this post: Data Vault 2.0 has a new hero…. Analytics assumes that data movement… | by Patrick Cuba | Medium

Chapter 8 of the book has implementation patterns and code on how it is done

It is also a pattern adopted by dbtvault: Extended Tracking Satellites - dbtvault

1 Like

Thanks Patrick!! You the man :slight_smile:
Im in middle of 1st book:

  • Building a Scalable Data Warehouse with data vault 2.0 - Daniel Linstedt - 2016

I had also yours in queue, but I will hast his purchase, as I really like your online content, your diagrams explaining ideas are interesting and to the point, so im sure I will enjoy your book.

Thanks PC :slight_smile:

Yours,
Emanuel de Oliveira

p.s. like everything, its the exceptions they give more work. Im aware of Dbtvault exists, a kind of template language for dll creation and changes, and generation of etl code i think, not sure if it have execution runtime engine as well… surely i will investigate it (hopefully its free for developer to test to learn).
And probably the best dedicated tool for DV out there ?
But even with tools, I first wanna feel confident with understanding a core manual implementation with all sorts of exotic stuff, lime late data or multi temporality lines. Dates and data exceptions oh well, they the strawberries of our life :heart:

That’s ok, the book I referenced has sample code for hubs, links and satellites; modelling problems with clear illustration on what to use and not to use, and sample automation patterns too.

Choosing a tool comes down to preference

  • dbtvault - scripty type automation based on dbt, dbtvault is a package for dbt
  • vaultspeed, wherescape, dv builder, data joinery (and some more) are gui based, not scripty.

As long as the automation tools can automate most if not all the required patterns then they’re good. I have not heard/seen the other tools implement XTS as described above but I know dbtvault has.

1 Like