Hi everyone, I am fairly new to Data Vault, I’ve done some reading online and tried to learn as much as possible. I was asked by my company to review an existing data vault implementation an ex colleague has developed. Few things have be left in doubt, I would appreciate if you could answer me on the below questions:
- Can we have views in the staging layer of the data vault model? I thought in this layer they should all be materialised tables?
- Can our data vault concepts (hub, link, satellite) be created out of tables that are NOT in the staging layer?
- Can our staging table be the result of join operations?
Hi there! Welcome to Data Vault!
Let’s run through your questions:
-
The staging layer in a Data Vault model typically uses materialised tables since it wants to capture the raw data as is, so it can be effectively queried later. Using a view isn’t completely out of the question though as simple transformations or derivations may need to happen in a virtual layer over the raw data. However, with performance and stability in mind, materialised tables are preferred.
-
Yes, your Data Vault concepts can be created from tables that aren’t in the staging layer. As said above, the staging layer is mainly for raw data ingestion. The data within that can be transformed in a “priming” layer to be loaded into the Data Vault structures that align with business concepts. These can be built from any source table assuming it follows the standards.
-
This is a more tentative yes. There are scenarios where joining data from multiple sources can be beneficial or needed, assuming it maintains the auditability of the original data. There is usually a business requirement behind this. I will say though that this solution works fine in smaller datasets but will not scale well. More details around the exact use case would be helpful here!
1 Like
Hi Joe, thanks for the quick reply. Regarding your response to question number two, does it mean that there is no hard rule constraining us from creating data vault with tables falling outside the data vault architecture ? What is the point then of having a staging layer if it can just be bypassed when creating a hub/sat/link ? I am confused as the reading material I went through expressly mentions the staging layer as the layer prior to the actual data vault component. Also, I have seen sats in raw vault being created by using tables from outside the staging layer joined with sats from the business vault. Is that also acceptable? Can we mix tables between raw vault and business vault?
For your answer regarding question 3, my understanding was that business requirements come in play later in the architecture? At staging layer shouldn’t they just be added to the warehouse as the data is after applying soft business rules? Anyways, I was only given an outdated excel sheet that lists (not even all) the satellites/hubs/links on one column and (outdated) sources on another column. The person that has built this has left the business so we can’t reach out to them for help
Ooh a lot of questions there, plenty to think about. Very familiar with the outdated information problem! I do think it would be better to explore this in more detail in a call. Will reach out separately!
I’d add the following. In the standard Data Vault architecture raw data is staged and then loaded to the Raw Data Vault populating Hubs, Links and Satellites. Views on the stage are ok apart from when you have high volumes, where materialised views are more performant. All batch data should be staged. Real time data can stream into the Data Vault through micro batching from the pipe, usually into a non-historised link; it could also stream into a stage table and be loaded into the DV from there. It is possible that when you calculate business rules further Links and Satellites are created and rarely, but possibly, new Hubs are created too. Finally, most staging tables are cdc versions of raw tables 1:1, however if you need to apply hard business rules (rules on staging) you may need to manipulate data as it is staged to prepare it for load to the Data Vault, this manipulation might include table joins, however as Joe has said above, joins on load can have performance implications unless you index the data correctly (so for cloud databases you need to make sure the clustering is in your favour to avoid full table scans).