An example, with a follow-up question at the end that relates to this topic then on restructuring data.
In my world I receive a Salesforce Account object, which is denormalized, to the point of having over 300 columns. Within it the users have captured things like contacts, and contract data. Just. Don’t. Ask. sighs sadly For example, the contract data includes a Contract Number and various other fields like data signed and so on. When a new Contract is created the fields, including the Contract number, are overwritten.
To add to the complexity I have a second CRM, using MS Dynamics, providing Account and Contract as separate normalized sources.
Some of those Accounts across the two CRM systems refer to the same business entity but they do not share a business key. I am getting users to identify the links between the Accounts because its obvious and because it creates massive business value for things like cross-sell, single invoicing etc, and business value ultimately what matters more than anything else we do. Avoiding a rebuild on your data model through using data vault is good because it provides business value through a long-lasting data model, not just because of its inherent structure being pleasing to the eye.
In looking at my approach I could:
- create a “Source System Vault” with hubs in the RV that have a similar name, but separated by source as well e.g. h_account_salesforce + h_account_dynamics + h_contract_dynamics,
- logically split the Account in Salesforce into Account and Contract during staging. So I end up with h_account and h_contract, each of which has the semantic meaning, the same grain, and each of which has has sats for the different source systems.
After a chat in the Slack channel I have chosen #2 because:
- an un-restructured “source system vault” for hubs without any other analysis feels like such an anti-pattern
- the Contract Number is clearly a business key in its business intent and use and should therefore be in a separate hub, which is shares with a different semantically identical source
- if anybody asked I could easily recreate the source data for any date (or all dates) by joining the two sats for Salesforce back together in my info mart
- as Dan Linstedt states, “hard business rules never change the meaning of incoming data, only the way the data is stored”. This is achieved.
- In the BV I’ll be able to add a same-as link table between the records sitting in h_account once the users provide me with a source for that information.
So, my question then.
When Will, in his comment above, says
The guidance is clear, don’t restructure the source
I’d suggest that we need a little more nuance, at least with regard to one scenario. Dan Linstedt includes a specific example of denormalized XML data in section 11.9 and is clear that denormalized sources should to be normalized before loading into the staging area, in advance of data vault transforms. Dan’s examples actually decomposes the Passenger Name into a separate hub… which feels like going too far, since a name (first+last) is really not very unique!
So, if we can agree that Data Vault allows for some restructuring into hubs I’m interested in having a discussion on handling denormalized data and the extent to which it should be restructured. If I can be as bold to paraphrase Will, the guidance is clear, you should restructure the source in some cases, but what are the caveats on this? Is it as simple as the identification of multiple business keys within a single source? Or is it messier than that?