Hi, I’ve been reading about Data Vault and watching some videos/courses. I still think I am not clear on when DV is a good fit and when it’s just a bad idea.
I was hoping the more experienced DV practitioners could help me point out differences between DV and the architecture I am used to so I can better understand the advantages/disadvantages of DV.
What I am used to is a structure like this:
- layer 1: raw data, 1:1 to source, captures all history; managed by something like Fivetran (so it also handles incremental loads, schema changes - adding/“soft” removing columns; SCD2 etc).
- layer 2…X: intermediary layers needed just to cleanse and transform the data
- layer X+1: dimensional data model; accessed by consumers and BI tools.
It sounds to me like if I replaced layer 1 by Raw Vault and layers 2-X by Business Vault I get the DV setup. But why would I do it? What does it bring me?
The DWHs in question can potentially have low petabytes of raw data, and many different data sources. Some sources are synced in batch, some stream the data in.
Articles that I found online often times contrasted DV with Dimensional modeling which to me doesn’t make sense as it is my understanding that you put a dimensional model on top of DV anyway so it’s not like you’re getting rid of it.
Unfortunately the article I’m finalising won’t be published until 27 March, “Rules for an almost unbreakable Data Vault”.
I hear you, why try this DV thing at all??? I have in my profession advised against building a data vault and for it becaue I don’t want to see people building a data vault just because they can say they are building a data vault.
So a few things come to mind when deciding to build a data vault,
- Are you integrating by business key across source systems?
- Are you expecting the business needs to change and affect existing models?
- Are you willing to invest in the training, coaching, time and discipline that goes into building a Data Vault? Are you willing to give the data vault ownership to data analyst/data modeller over a data engineer?
- Do you recognize that with a DV you will get far more joins to consider and you better do your research on query plans and join algorithms!
- Do you have executive sponsorship from your data management teams?
- I’m sure there are more reasons.
Raw Landed data is NOT a Raw Vault, Raw Vault is modelled into hub, link and satellite tables matching what the business considers is the output of the business process/service automation outcomes we capture in the raw vault. The landed area can be truncated at will, but the raw vault is your auditable source that comes with all the metadata you need to support lineage, governance, masking etc etc
Business vault is the captured output, sparsely modelled, extension of raw vault. You should only see business vault link and satellites here (we don’t derive business keys in the data warehouse therefore there are no BV hubs).
RV + BV = DV
Now, about those query plans, PITs, Bridges, Dimensional models and the like are used to pull that data out of vault into a form useful for querying. A DV is non-destructive to change modelling paragigm for your EDW, information marts take all that modelled, profiled DV into your BI needs.
The above brings you:
- Auditability, as mentioned
- Agility, non-destructive to change
- Automation, 3-table types = 3-standard table loaders = 3 repeatable patterns
I did publish this recently, maybe it clarifies things for you for now.