Need some guidance on modelling a supertype and its subtype in the data vault.
I need to model super types and sub types for the data vault.
I would like to get a perspective on how other forum members look modelling super types and sub types.
How do we go about modelling the Super type and sub type hub and its dattributes?
For example, a manufacturing organization with assets and all assets have an hierarchy.
Asset has certain common attributes which are applicable to all assets.
Then are certain attributes which are specific to certain assets.
How do we model this?
Let’s say the following approach is used
1. Create common Hub Asset
2. Create hierarchical link to form the asset hierarchy
3. Create Sat Asset Common for common attributes
4. Create Satellites Specific for certain asset for its specialized attributes.
One problem I see is that There can be lot of assets in an organization so there could be a case where lot of satellites are attached to the same hub ( Hub Asset)
Any view or comments on this approach?
References viewed so far
1. ‘The data vault guru: a pragmatic guide on building a data vault’ by Patrick Cuba
2. Building a scalable data warehouse with data vault 2.0 by Dan and Michael
3. Elephant in the fridge by John Giles.
Universal Data Vault: Case Study in Combining “Universal” Data Model Patterns with Data Vault Architecture – Part 1 – TDAN.com
very soon I will have to spend time on modeling this as well. I too have a concept of Asset, which subtypes like RealEstate, Commodity, Vessel, etc. After already having talked to the business a bit, the sub type really feels more like descriptive context of an asset and not necessarily a concept of its own. So I think I will end up with one generic Asset hub, one common Asset satellite, and then some satellites split per Asset Type. But that’s only possible because the source file is delivering the data sparsely, in the sense that the field RealEstateCity can only be populated if the Type is RealEstate. If RealEstateCity could be populated for a different type (weird enough as it is), I would have to include it in the common satellite, otherwise I would be losing data.
As I just mentioned in a different topic here, having lots of satellites attached to the same hub shouldn’t be an issue. A Point-In-Time table can take away much of the performance issues (if there are any).
By the way, the Same-As Link you mentioned should probably be a Hierarchical link, as I suspect that the assets in such a relationship are actually different assets with their own business key, that happen to be each others parent/child.
Thank you for your viewpoint.
How do we decide, Which satellite to load the data? Need to have some rules based on source file or the type of asset?
Yes , It would be a hierarchical link. Just updated my post.
Ideally you’d have a data contract with your data source in which is stated which column can or should be populated in which case.
In my case, I’ve done analysis on two year’s worth of daily deliveries in which for example this RealEstateCity really was only populated when the AssetType was RealEstate. For every other AssetType (Vessel, Commodity, etc.) it was always empty. So in that case I load it into my Asset RealEstate satellite that gets sourced from a staging view in which I filter on AssetType = ‘RealEstate’.
If, at some point, RealEstateCity would be populated for other asset types, I will be missing out on it because it won’t ever get loaded.
Interesting to hear other people’s opinions on this though.
- Are we describing a parent-child relationship rather than a super-type-sub-type? Then they should share the hub and have a hierarchical link
- Super-type + sub-type usually means there is a difference in definitions of those business entities, in which case you could be looking at two seperate hub tables, one for each because the grain and semantic meaning of those entities differ
- Is the one “entity” dependent on the other? i.e. the sub-type entity cannot exist without the super-type (ala dependent-child) Then you could be looking at defining the super-type in the hub and the “subtype” as dependent-child key in the satellite table.
I’m always in favour of less tables that support querying the Data Vault, but keeping in mind that we are sticking to standards!
Hi @patrickcuba ,
Thank you for this answer!
What decision to make when we have same BK defining the super type and subtype but they are seperated into different tables via normalization.
For e.g. interaction as hub with sub types as interaction email, interaction phone call, Interaction chatbot etc.
Lets say the main Interaction table will have BK as 123 and 124. Then child tables will also have the same BK -
123 for Interaction_phonecall
124 for Interaction_email.
In this scenario should we collide them into main Interaction table (with BKCC) or have different tables
Are these “business keys” or an instance-id representing an event? These sound like one-time immutable events that never change (immutable) and not business objects
Am also facing the similar kind of an issue in my project.
Request you to let me know in case of you found out the solution for this issue.
The good news is that the Data Vault is flexible and you have many ways to approach this. But - you have to choose.
John Giles discusses such a use case in his book, The Elephant in the Fridge.
What you have are two concepts: a super type and a sub type. The super type makes modelling sense in some business circumstances, the subtype makes sense in other circumstances.
So it depends on if they share the same business key? Same business key implies they are the same HUB. You can hold super or subtype data in separate satellites, assuming your source system refers to both concepts in separate tables.
If the concepts have different keys, you could model as two separate Hubs, with a same as link between them.
In a data vault model you have to balance between an abstract representation (as would happen with an enterprise architect who would use party, party role, etc.), and a more physical representation of concepts (supplier, customer, employee, etc.). In general, you should tend towards the more physical end for model readability and ease of extracting data.