I have two raw source files that contain the relationship between a customer_id and its various IDs from other systems. The first file has our old, internal customer_id along with a third party vendor ID we’ll call ven1_customer_id. Whenever I loaded that file, I loaded both into HUB_CUSTOMER with the BKCC as “VEN1” for the the third party IDs. I then setup a SAL_CUSTOMER_VEN1 table with their respective HKEYs to link the two together. That was all fairly straightforward.
The issue or question I have is with the second file which has, again, our old customer_id and our new, internal customer_id (new_customer_id) as well as another third party vendor we’ll call ven2_customer_id. This file only has a sent_timestamp and a load_timestamp outside of those three ID fields.
My plan is to load the three IDs into HUB_CUSTOMER but I don’t know what the best method is for loading SAL(s). Should I also add a BKCC value for the new customer_id we’re using? I was going to add “VEN2” for the third ID field. Would I create a SAL_CUSTOMER_NEW and a SAL_CUSTOMER_VEN2? Should it all be stacked into one SAL_CUSTOMER_NEW_VEN2 table?
Very short answars.
For the first instance I would not recommend a SAL here, reasons:
- you have no guarantee what the vendor’s business key integrity is, even if it a business key at all and
- you have no control over how they manage those keys. Your customer id should be loaded into the hub but the vendor’s keys should be nothing more than an attribute on a satellite table.
- In the context of a bkcc imagine you would have one for every vendor! How do you ensure those are not duplicated within the vendor itself! To you it should be treated as a Junk Key!
- Imagine further the types of queries you need to write to resolve this too! Join, hub to SAL, SAL to hub whereas you could simply lookup the customer id and refer to satellite table!
For the follow-on question, it appears the load timestamp is not your own - remember the load time stamp is when you load data into your data vault.
Thank you, @patrickcuba. I bought and read your book, by the way. One of the better (and only) resources out there for Data Vault that isn’t really old (in technology time).
Anyway, I realize I need to add some more information to my original question. The first vendor’s ID will most likely work better as you suggested, simply as an attribute in a satellite of HUB_CUSTOMER. The second one, however, has many tables that house all of the purchase transactions and other business info about the customer that I will have to link together for reporting. You are correct in that I have no control over their data and whether it will have dupes or not but I don’t have much of a choice, I’m afraid.
So, next steps:
- I will drop the SAL_CUSTOMER_VEN1 table and just have that vendor ID number stored in a SAT.
- I will load the second, more important vendor ID into the HUB_CUSTOMER table and setup a SAL with the three values (our old internal, our new internal, and the vendor’s ID) that are in the one source file.
Do you concur or have a better suggestion?
My advice there is two-fold
- Make your Data Executive accountable for decisions that might put your butt on the line, i.e. you might be creating a legacy platform (a ball of technical debt)
- You might need a reliable system to process that secondary source data and allocate reliable business keys and some hard business rules applied that are signed off by the business. This secondary system becomes your source to DV
SAL with a vendor id might be really dangerous, for the reasons I mentioned above.
We worked in a debt collection company with managed debt (debt a vendor/retailer could not collect and was handed over to us). They had an account_number/account_id/__accno (you get the idea) but we processed that into our source and allocated a debt_id to it.
To the retailer they still retain their bkeys but to us and the rest of the business we always referred to this debt via the debt_id, debt_id is our reliable business key.
Allocated a debt_id meaning something you generated within the database? A sequence generated key type scenario?
The business treated it as a business key, I don’t remember if it was a sequence key explicitly or if there were some smarts in it (I worked for them 19 years ago!).
The best part of this was that we did not have to manage this at all – the data developers and administrators of that system did and hence managed the business rules to ensure those debt_ids were strong, valid and unique. It also meant we did not introduce this potentially devastating technical debt (if we relied on vendor ids) into DV.