Would like to clarify working of satellite and historical tracking in the satellite table
Satellite loading code is as below
insert into sat_person_address
select distinct
‚person_hash_key
,loadts
,record_source
,hashdiff
,city_name
from <stage_file> stg
where not exists
(
select 1
from
(
select person_hash_key
,hashdiff
,rank() over (partition by person_hash_key order by loadts desc) as rnk
from sat_person_address qualify dv_rnk = 1
) cur
where stg.person_hash_key = cur.person_hash_key
and stg.hashdiff=cur.hashdiff
)
When loading data into the satellite table, we need to compare the hash diff with the previous record.
Now what is the definition of previous record? Is the previous record defined based on load_dtts only? If yes, do we mean that satellite only tracks history as per the sequence it was provided to the data warehouse.
If there are other columns provided by source system such as snapshot_date, extract_date etc, can we use it as well for calculating delta record?
Incase if we get the data in below sequence
On LDTS=1-Jan-2022, snapshot_date=1-jan-2022, city = ‘New york’
.
.
.
On LDTS=10-Jan-2022, snapshot_date=10-jan-2022, city=‘Houston’
On LDTS=11-Jan-2022, snapshot_date=9-jan-2022, city=‘San Diego’
On LDTS=11-Jan-2022, snapshot_date=11-jan-2022, city=‘Houston’
In this sequence of events, when we are comparing hashdiff, on LDTS=11-Jan-2022, we compare ‘San Diego’ with ‘Houston’ based on the current insert code i pasted above.
According to the business events, shouldn’t the satellite be comparing ‘San Diego’ with ‘New York’ ?
Would like to see the opinions of experts on this topic. and how they design satellite and inserts to the satellite tables
Short answar is that you should only compare against the last loaddate per bk.
I have earlier implemented your suggestion that compares snapshot dates. It was a mistake and created a lot of tech debt.
But… you can always extract the last loaddate per snapshot date to retrieve data for a specfic snapshot date. This means that you can load data for old snapshot dates and still retrieve the last loaddate for the last snapshot date to get “current” data. Do you follow?
Hope this helps.
Sats will track incremental changes from the source — it doesn’t matter if it is a snapshot source or cdc, the target sat still will only track the incremental changes.
If the snapshot source might include history as well then you need to make sure you’re processing all new records only. Some suggestions on how to deal with this are:
looping history
defining a time column as a dependent child key
or using what is suggested in using Streams on Views
Yes, Agree that satellite only track the incremental changes.
my question is about handling the incremental changes. Is loadts column enough to compare the previous record.
if we get a late arriving feed from source system , then just using loadts will not work. isn’t it?
I read in your book about the XTS satellite which helps in solving out of sequence loads. But it seems it is not the standard from data vault.
So, I wonder if XTS is not used, how do we handle such late arriving loads .what are the other standard data vault process or methods to handle such scenario?
Hello Nikunj!
I compare late arriving data the same way I would compare not late arriving data.
Think like this: There are business dates or applied dates within the batch delivered to the data warehouse. This means that you can extract Applied date per LoadDate.
Ex: applied date, LoadDate
2023-02-16, 2023-02-16 for batch 1
2023-02-17, 2023-02-17 for batch 2
2023-02-14, 2023-02-18 for batch 3
So the last entries in the satellites contains data with loaddate 2023-02-18 for applied date 2023-02-14.
If you want the latest data for the last applied date then ask for the maximum loaddate for the maximum applied date. => maximum applied date = 2023-02-17 with corresponding loaddate 2023-02-17
So when fetching data from a satellite do it with loaddate equls or less to 2023-02-17.
Hope that it helps.
@AHenning just to be sure that i completely follow you. Is this what you are suggesting?
Need to add a appliedts column to the satellite table.
And taking the sample insert statement i shared, the satellite loading would change to as below. Right?
Notice the addition by “order by appliedts desc”
Satellite Loading code:
insert into sat_person_address
select distinct
‚person_hash_key
,loadts
,record_source
,hashdiff
,city_name
from <stage_file> stg
where not exists
(
select 1
from
(
select person_hash_key
,hashdiff
,rank() over (partition by person_hash_key order by appliedts desc, loadts desc) as rnk
from sat_person_address qualify dv_rnk = 1
) cur
where stg.person_hash_key = cur.person_hash_key
and stg.hashdiff=cur.hashdiff
)
XTS actually does have the blessing of Dan Linstedt and I along with Nol Ebersohn presented it at the wwdvc 2019. It used to be on DVA’s blog pages but they’ve restructured everything since then so I don’t have a link for it.
The key is to extract data from a loaded satellite for a specific loaddate that is given from a specific applied date. Do you understand?
Hope this helps.
@Nikunj I am successfully using XTS in a CDC-like streaming solution - it works a treat - we got it down to a very small code change in the loaders.
for what its worth: I also opted to separate the XTS per SAT. I love the auditability uplift. Glad to hear Dan approves - It’s a worthy addition to the std that is consistent with the DV principles