Should HUB be loaded only from actual source and how to define- biz. key if 2 source have diff. keys

rjain101 · 2 July 2024 08:59

I have a system , I have mimicked using Customer Product analogy as in diagram below

Based on this System -C always has subset of customers from System-A and System-B and produces some data sets (50+ in number) for customer_product relationship(s)
The Data Vault model would look like below :

TO pouplate DV we are using the workflow as below:

I have follow question

The SYSTEM-B has only CUST_ID and SYSTEM-A has CUST_ID and CUST_SRC as biz.keys. I am thinking to design one single HUB_CUSTOMER with some default value in CUST_SRC for SYSTEM-B, is this approach correct?
AS the source for HUB_CUSTOMER and HUB_PRODUCT are SYSTEM-A and SYSTEM-B ., do I really need to populate CUSTOMER_HUB from SYSTEM-C as it only has Biz.Keys for CUSTOMER_HUB and PRODUCT_HUB for all 50 data sets as that would lead to duplicates if I ran 50+ data sets in parallel in case of DBT workflows with storage as DELTA (no unique key constraint in DB ) , shouldn’t data from SYSTEM-C only populate LINK tables and individual satellites?
ANy help is greatly appreciated

patrickcuba · 3 July 2024 04:42

Only Cust_ID is a biz key

yes

what if some keys don’t exist in the other systems? Can you guarantee that?

rjain101 · 3 July 2024 05:06

Thanks @patrickcuba , Here the SYSTEM- C is always a subset from A & B so it is kind of guaranteed.
The issue is if I use all 50 data sets from SYSTEM -C in parallel to populate HUB_CUSTOMER
(they arrive at same point in time) I might see duplicate HK in it( although grim chances) , as my tables are all DELTA tables with no constraints

patrickcuba · 3 July 2024 05:57

NEVER, if you do your table locking properly

rjain101 · 3 July 2024 12:51

If I have different Databricks workflows populating using DBT and automateDV , it create a temporary table every time , so parallel executions might erase the other data and hence cause issues.
Only option I have is I might have to SQL here instead of automate DV

Also there is only one way to do locking in Databricks is using “merge” , is that recommended for data vault 2.0 to use merge ? If there is any other strategy to lock if you can point to

patrickcuba · 3 July 2024 22:56

Yes, I did it here

Topic		Replies	Views
How to design hub & satellite for customer dimension Data Vault 2.0	1	78	12 December 2024
hub design question Data Vault 2.0	3	377	23 March 2023
HUB Creation from Multiple Source system Data Vault 2.0	15	2294	11 June 2022
Can I restructure source data before loading it into the raw vault? Data Vault 2.0	18	1690	3 April 2022
One or More Hubs for same Entity Data Vault 2.0 business-key , hub	2	492	21 August 2023

Should HUB be loaded only from actual source and how to define- biz. key if 2 source have diff. keys

Related topics