Hubs using multiple columns as natural identifier - general names or multiple hubs ?

Hi everybody!

I need some advice regarding best practices in hub column naming convention - especially for multiple columns representing a business key.

My project situation:

I have source tables providing business keys for a business object “Contract”.
The some tables provide the business key as a single column (e.g. contract_uuid) as well as in two columns (e.g. company + contract_number).
contract_uuid is a technical id, but very important because all others data is connected to it.
company + contract_number is the natural business identifier mainly used by business.

My Source:


contract_uuid (contract_bk_1)
company (contract_bk_2.1)
contract_id (contract_bk_2.2)

Therefore I modeled two hubs. One for single column contract business keys, one for two columns contract business keys.
Also I modeled a same-as link to connect both hubs.
I have choosen generic names because it is very likely that there will be more business keys (single columns as well as multiple columns) coming from different source systems or tables. I want to store them in the same hubs to create a lightweighted model

My DV Model:


hk_contract (hashkey: hash(contract_uuid))
id_contract (business key : contract_uuid from SRC_TABLE_1)


hk_contract_2 (hashkey: hash(company+contract_id))
id_contract_1 (business key : company from SRC_TABLE_1 )
id_contract_2 (business key : contract_id from SRC_TABLE_1)



The challenges for me and the consumers are now:
Because of my generic names the model is hard to read:

  • What is HUB_CONTRACT_2 ?
  • What values are in id_contract_1 or id_contract_2 ?

It even gets more complex, if my very likely second or third source system comes into play providing more single and multiple column keys.
For example a natural identifier will be column IKER + OPSE. I plan to store them as well into HUB_CONTRACT_2.

  • How does a consumer now only select records from HUB_CONTRACT_2, which id_contract_1 == company ?

The only way I see to solve this is to create multiple hubs.
In my case HUB_CONTRACT_2 would be renamed on entity level as well as its columns would be renamed:


hk_company_contract (hashkey: hash(company+contract_id))
id_company (business key : company from SRC_TABLE_1 )
id_contract (business key : contract_id from SRC_TABLE_1)

… as well as the link …


But this solution has the big drawback that my model increases by unnecessary links and hubs … more joining at the end.

What is your design pattern or best solution for this situation ?
Are you using general names for columns and use one hub for each grain of natural keys ? Are you opening new Hubs with specific names ?

Many thanks for any advice !

Is company + contract id unique or just contract_id? Where does contract_id come from?

It looks like you have established company + contract id as the contract’s business key.

One feed has the company column missing. Can you infer the company value and substitute it? Or you could use a token such as ‘not provided’.

An alternate business key could be:
company + party + date of contract + contract type

Company - is one party to the contract; Party is the second party; contract type is added if you have different contract types (NDA, MSA, etc.), alternatively you could use contract title.

This assumes two parties do not sign more than one contract of a particular type on a single day. If they do, then other matters would come into the key. Or you have a version number for updated contracts.

company + contract_id is unique.

Company is a code like ‘ZA’. Contract_id is an increasing number including a smart date key like ‘2022123’ [yyyy+SEQ]. Both columns are provided by source system.

Thanks for your answers !

But how do you deal in general with the problem:

We have multiple sources providing the natural business keys by two columns identifying contracts. Only the combination of these two columns makes the contract unique.
Please assume or take for granted that my identified columns are the correct columns identifying the business keys. Also the Business is using these columns to identify the contracts.

src1: company + contract_id → ‘ZA’+‘2022123’
src2: IKER + OPSE → ‘22’ + ‘ER07’

IKER is definitly not a comnpany and OPSE definitly not an increasing number.

Do you create a Hub with general column names or would it better to create an extra hub with explicit names ?
Are both options valid ?

Hello mat!
Welcome to the forum. Please be aware that my answer is my own opinion and it may differ from what most people think is good practise for a data vault.

I have done a lot of DV implementations and the biggest mistakes and most headache I got was when I implemented multiple bk columns in a hub. Try to avoid it.
In your case I would ask myself:
Is there descriptive data only for company?
Is there descriptive data only for contract_id?
Is there descriptive data only for IKER?
Is there descriptive data only for OPSE?
If you answer yes to any of the above questions you need a hub to only hold one of the BKs.
And if you have a hub with a list of keys, you would not like the same list in another hub concatinated with some other bk.
If you discover that some of the keys should be hubs and some of them should not(holding no descriptive values) then I would use a peg link.
Also, I opened my mind a little when I did my certification, I like you, had some hard questions for Michael Olschimke and one of his answers that I remember is: “physical implementation differs from logical modelling, use the capabilities of your RDBMS”. That can of Course be interpreted in a sorts of ways but it got me thinking and asking myself these questions:
Can a hub be implemented in more than one table? Can a link only contain peged values? Can a link be implemented in more than one table? What happens with performance If you are forced to use a BKCC?
Hope this helps on your DV Journey.

I’d discourage using dates as business keys! Otherwise it looks like you have sorted this out