Need help with a data vault model

Hey everyone,
I have been trying for some time now to model our source data into a data vault. Overall, I would say I am reasonably accustomed to the data vault artefacts and have read up on data vault by using the “scalable data warehouse” and this user group blog.
For the project, I also had prerequisites for certain entities we want to model with data vault to satisfy reporting and analytics needs.
Anyway, I have severly struggled with turning the following source data into a data vault:

The business concepts we try to model from that are: Property, Building, Rentalunit, Lease (RentalContract) and Tenant.

For property we have a uuid which actually comes from salesforce and is probably the business key. In principle I would prefer the Property.name because that is how the business talks about their properties, but since different business units have different names for a property that is probably not the best idea.
Everything below this layer, however, is giving me a lot of headache.
As you my seem from the ER diagram, all child tables have a foreign key parent_id for their one-to-many relationship and an id as the primary key.
However, those id’s are not anything the buisness users know about. The only person knowing about them is essentially the source system maintainer.
For building and building_address we actually also have uuids but those are not coming from salesforce but are system generated. So neither the id, nor the uuid can really be used as a business key.
The most reasonable business key for building would probably be the address, because that is something business may actually talk about and have reporting requirements on, but the source does not enforce the address fields, i.e. they may be empty.
In general, splitting building and building_address (essentially the house_entrance if multiple are present) is not how the business needs the data reported.
For rentalunit we have basically the same issue, the unique key there is either the id or the internal_id which is a composite key - both of them are also not used by the business. I have tried to build a composite business key for this one which consits of the other business keys in the hierarchy, i.e. property.bk and building.bk, and some name for the rentalunit.

The tenant data we have is also a crude mixture of things: It mixes individual tenants, company tenants in addition to some employee roles.
For the commercial tenants the business key would most likely be some kind of VAT ID, but this attribute is also not enforced.
For residential tenants one kind consider using their first_name, last_name combination but there are bound to be duplicates, another bk may be their email - but as you may guess be now, neither of those fields is mandatory.

I have essentially several issues:

  • The business keys I identified are sometimes not unique in the source system, i.e. there are duplicates (which should not be there) but the issue is cirumvented from the source side by simply slapping a new ID on any such tuple.
  • In a lot of cases I actually need a composite key so I have to join several elements of the source system which I currently do in “pre-staging” but this is of course something I have to maintain myself
  • In several cases, joining multiple tables is also necessary to filter out entries which are no longer active, since the source also does not provide that service (an issue which arises because of the duplicates which have a new ID as described above).
  • How do you handle something like the rentalunit business key, which is a composite key consisting of property.bk:building.bk:rentalunit.name. Is that a valid thing to do in data vault?
  • What about rental contracts? I would argue that a contract is actually a link between two parties and would probably treat the contractual item (the rentaluint in the case of rental contract) as a dependent child key. On the other hand, there are attributes which actually live on the contract level so it may also be a hub?

Do you have any suggestions for me? Does my data vault model even make sense or is there a better way to do that?

Thanks for your help!

Hi there!

Sounds to me like you’ve got quite the modelling challenge! I think a lot of your issues might need a step back and another look at your model and data.

From the information you’ve given, here’s a few key things:

  • It sounds like you’ve got potentially some poor data quality that’s limiting your business keys options. Taking the address fields for example, with it not being an enforced field, if you were to use it you would need to implement data quality checks to ensure completeness. You can use a uuid, but like you rightly say, it’s not our first choice since it doesn’t have much if any business meaning.

  • You also mention that there are duplicate business keys. The first thing I would check is the grain of this data, what does it actually represent? There may be an additional column with business meaning that actually takes this data to the correct grain and now it’s unique. The other thing to check is whether these duplicate keys actually walk through a history. It’s only if the rows create true duplicates then there is a data quality issue.

  • I would advise against filtering out inactive records if they were once valid in order to maintain history. There are other reasons to filter out data but losing history would lose auditability and that’s not great.

  • Composite keys are allowed, even your rental unit bk situation is a valid approach.

  • Your argument about rental contracts as a link does have merit, I can see where you’re coming from. However I would not advise this. A contract is a legal document, a thing. It’s such an intrinsic business concept that it really should be a Hub. Links are usually reserved for Units of Work. If you model it as a link then you’re cornering yourself and limiting flexibility in the future. There are plenty of business entities that need information from a contract and if it becomes a Link then it can be difficult to extract data at a later date.

As said initially, I think you need to revisit your modelling and data profiling to hopefully alleviate your issues. Hope this helps though!

1 Like