We have a situation where our ETL load process can bring multiple records for the same Business Key in a single load. Each of these record represent a change to the descriptive attributes of the Business Key.
If we load this as it is, then we will have two or more record for the same BK with the same LOAD_DATE.
How can we handle such cases?
Hi @cmeyersohn . Per DV standard is it okay to have more than one record in SAT for a Business Key with the same LOAD_DATE ?
Your ETL hub load should detect whether a business already exists before creating one. The descriptive attributes associated with the business key belong in the related satellite table. The primary key of the satellite table includes the load timestamp, so you can load the history of the descriptive attributes.
Hope this helps!
Do you have a multi-active satellite?
This sounds like a multi-active SAT. The PK for this table would need to be the BK+ other field.
@jjs88 and @neil.strange . no this is a not multi-active SAT. this is just a case of multiple updates to the record being captured in a single load. Each of the change is a separate record with a new modify timestamp from the source.
Ok, I see now. I think what you want to do is have looping logic where you process the earliest update first to the most recent. Then you will capture history properly. Also, does the most recent update include the updates from the earlier ones? If not, then I think you have to do what I just mentioned.
This is 3 updates to a BK of A.
A 2022-03-28 16:32:32.053 - Load this one 1st to SAT
A 2022-03-28 17:32:32.053 - Load this one 2nd to SAT
A 2022-03-28 18:32:32.053 - Load this one 3rd to SAT
Hope this helps. I’m also assuming you have a full date time stamp to differentiate the different update records to the same BK.
@jjs88 , so basically I have to load each of these one at a time so that the LOAD_DATE is different?
Yeah, using looping logic. Sorry, I said row number in my initial reply meant to say looping. Dan’s book on DV 2.0 talks a little bit about this.
It complicates the loading process but I think that’s how you have to do it. The only hard part is if you don’t know how to determine which order the update needs to happen in. That’s why I mentioned, if you have an update timestamp or something from your source system that can help differentiate the different updates to the same record.
Thanks @jjs88 . I was going through Dan’s with re: to aforementioned. But was not able to find anything relevant. Would it be possible for you to get me the page / section number?
@jjs88 , I think I found the reference:
@neil.strange @alex.higgs , does the dbtvault macro support Satellite Sub-sequences? We have a way to order to data.
Yeah I don’t remember where in the book it is, I can try and look later.
Essentially, I think you will have a record in the SAT for each update, you just need a way to stagger the Load Date so all 3 are unique, and then the most recent load_date should be the “active” record. If all 3 are unique updates and don’t build off the previous update, then that might be an issue. Then you might need a sequence as described above.
Or, is there a way to combine all 3 updates into one record before inserting into the SAT with just 1 Load Date? This might break the RAW DV pattern but it would make usability of that record easier in the SAT. Not quite sure which route to take here.
You have a few choices.
Have you used the Applied_Date concept? If so Load_Date is imply a version of the applied data for the parent key, link or hub.
In terms of output patterns, either
- A satellite with a dependent-child key, the dep-key can be set to a intra-day key, see Kappa Vault article for details
- A multi-active satellite, that deals with the SET of data and use case differs substantially to the sat with a dep-key
@jjs88 , the updated records build on the previous updates.
- var_load_ts is a constant used on all inserts secs grain, in one batch load:
- multiple records in one bk.
Applying analytical function row_number() and proper partition by column(s) AS rn, would give you an incremental offset you could then add.
var_load_ts some_col rn
20220329 175200.0 first record 1
20220329 175200.0 second record 2
20220329 175200.0 third record 3
example in snowflake:
insert dateadd(nanosecond, rn, var_load_dt) as etl_load_ts, (…)
docs:Date & Time Functions — Snowflake Documentation
in terms of performance you can partition using table ddl like:
create table (
cluster by (to_date(etl_load_ts))
or trunc the timestamp per hour, minute or even second if you have too many records, but probably trunc by hour would do the trick:
docs: DATE_TRUNC — Snowflake Documentation
Hope this helps, the data types depend on quantity or records loaded, but im assuming you not just using date type (without time portion).
Thanks @emanueol .
Do Sequence Number become part of HASHDIFF?
no, as it was just auxiliary value to keep incrementing slightly timestamp.
Personally I really care having or push upstream sources and/or ingestion processes to capture “REAL DATA TIMESTAMP” (or so called “extract date/timestamp”).
Ingestion process pulling from a RBMS database like oracle for example can simple add 1 extra column:
select sysdate ETL_REAL_DATA_TS, (…)rest of columns(…)
Then you know how to proper order thigns right ? So this would be another detail.
Remember my suggestion is more for ingestion of realtime situation and/or where as mentioned you have many values for same BK… personally I would try to understand the meaning of having multiple records for same bk ingested at same time… because if link gain is proper, then you just insert append on satellite A or B all the records using same ETL_LOAD_TS…
So be sure to have lowest possible grain on LINK table before going for my suggestion of minor increases on timestamp to avoid dups on SATs where pk is (link_key, load_ts) or (hub_key, load_ts)
Not sure if you talking on hub_SAT or link_sat ? also sometimes you can also have “transaction” LINK tables that have no historic SAT, or they can store extra attributes on SATs, but idea is 1 transaction = 1 row on LINK and thats it.
So I wonder on your modelling, were you inserting… SAT of a hub ? Sat of a Link ? or did you imagined using link without sat (so called “non historized link” links) ?
My suggestion just a trick to avoid dups, but most important is how you proper modelling
Hope this helps.
Technically, the hash diff will be different for each record if there has been a change to the descriptive data, which will force an insert without the looping issue. Is there a user defined system date/timestamp in the descriptive data, such as a “modification date”? If so, you may add this modification date to the satellite’s primary key set if it is at an atomic level that provides a discreet and unique timestamp value for the record. You would include this date/timestamp field, along with the standard BK Hashkey and LOAD DATE, as the 3-part composite Primary Key for the satellite.
If you do not have a date/timestamp on the record itself, you may want to think about how you identify the correct order or sequence of the transactions that were captured in the batch file that you’re loading. If you use a rowid or generate a sequence id, and you are loading in parallel, there is no guarantee that the sequence assigned will reflect the proper order or sequence that the transactions occurred. Reach out to me if you want to talk about specifics.
Hi @cmeyersohn . Thanks for the detailed response.
Question: If we add source system defined timestamp (i.e. modifytimestamp) to the PK for the SAT, do we need to add it as part of the BK to the HUB as well?
No, you won’t want to do that. The BK should be unique in the Hub. What you are adding are all the descriptive details as they relate to the single hub BK - in other words, you are loading descriptive history as it occurs and relates to the hub BK. Each descriptive record should point to a single BK in the hub, and hold all of the descriptive data, as it changes, in the satellite. Unless I’m missing something regarding the data you’re referring to.
By adding the modify date as part of the satellite’s composite business key, you are guaranteeing that you can (1) associate each record to the Hub PK via the Hub PK column in the satellite composite key set, (2) identify when the record was loaded via the LOAD DATE column in the satellite’s composite key set, and (3) ensuring that you capture every change to the descriptive data by using the modified date/timestamp value in the satellite’s composite key set; and inserted based on the change in the Hash Diff for the same BK on the same Load Date. Adding the Modified Date ensures that you have a unique 3-part composite PK and should help to maintain the “sequence of events” or “order of data changes” for your BI queries.
I would expect to see something like this in the satellite structure:
HUB BK - PK
LDTS - PK
MODIFIED_DT - PK
Descriptive Col 1
Descriptive Col 2