HUB Creation from Multiple Source system

we are starting to Implement a DV 2.0 in our company, Could anyone please help me to understand designing the HUB tables whether I need single hub or multiple hub.

we have policy data from 2 different source systems, From one system the Policy Business keys are on two fields whereas another system Business key is on one field.
Should we create one Hub or 2 Hubs?

Hello preddy!
I would need some more information about the keys to give you a better answer. But from what you describe, at least 2 hubs.
Does the business key from system1 contain the same values as one of the business keys in system2? Can any of the keys (by itself) in system2 be a primary key for any descriptive data?
Maybe you can provide an example with content? That would help.

Thank you for Reply.
Yes, One of the Business key from system 1 contains same values as business keys in system2.

Here is an example how the data looks like
Capture

Both the systems will have Policynumber but System1 will have additional business key “Companycode”. Also, In System1 the same Policynumber can be repeated but with different company code treated as a unique PolicyNumber

It appears you have a thinking around table mapping instead of business object domain. It is a common issue that comes up in DV, if you have a business object represented by two columns and I change the value of one then is it then a different business object?

Could one of those keys without the other represent nothing? Is it a dependent child key rather than a part of a composite key?

If they are to be separate you have not identified the correct business key (they’re source ids) or you have defined the semantic definitions of a policy correctly

Company-Code looks like a dep-key

Hello preddy!
My suggestion (I also did the satellites and links for context):

For system2:
Create one hub called H_Policy.
Create two satellites, one effective satellite connected to H_Policy, one delta satellite connected to H_Policy.
For system1:
create one hub called H_Company.
create one link called L_CompanyPolicies that contains references for H_Company and H_Policy.
create one effective satellite connected to L_CompanyPolicies.
There are a couple of options for the delta satellites. If policies in system2 are identified by both the policynumber and the companycode, meaning that you have policies for each company rather than polices that are applicable to companies, than create a delta satellite named S_CompanyPolicies to the link. If the policies are just applicable to a company then you can instead create a delta satellite connected to H_Policy. Remember that if this is the case then values will be repeated in your dataset. And this is something you have to handle. This is often the case when the extraction process from a source system has performed a join between policy and company. Also you have to investigate if there is any data from system1 that can be only connected to the company in case the source has performed the join i mentioned.
Maybe this raises more questions than is answers!
Good luck! :slight_smile:

Hello,

You need to confirm if CompnayCode is a reference column or if it is a business key. The use of the word code in the name leads me to believe it’s the former.

If so then don’t bother with creating a link to a hub_company because hub_company is not a business object. Codes tend to be dependent-child keys if they are qualifying a business keys, the hub_policy will contain the policynumber and the satellite will contain the dependent-child key.

Of course, don’t take my word for it. You need to mob model!

The companycode is a part of business key, It doesnt mean anything without the policynumber, Both the fields in system1 make unique policynumber. System1 is old legacy system whereas as system2 is new application system ,since the company code doesn’t mean anything to user ,so it was removed from System2.

In this case ,should I make 2 Policy Hubs for each system? Or it would best having 1 Hub and default company code in system2 as -1

single hub single sat— sat has the company code

Keep your DV model simple, easy to query

check if you need to consider a BKCC

I am a fan of avoiding way too many joins and complexity that ultimately forces the users of the model to expensive queries every time.

The proliferation of hub tables is something Dan likes to call “hub-itus”

Having all policies in one hub also may simplify the build of QA tables like PITs

1 Like

Hello Patrick!
Single hub, single sat you say and the company code is loaded into the sat.
I am very curious about this approach. Preddy has earlier told us that both the policynumber and the companycode is a unique key from system1. Do you suggest 1 hub or 2 hubs?

I think I have already answered your query

Hello again!
You probably had, but it was not clear to me. That is why I am asking. Do you suggest 1 or 2 hubs?

@preddy - I’m curious.

  1. Can CompanyCode exist on its own? Or, can a company code only exist when a policy number exists with it?
  2. Is it at all possible that system 1 has simply modeled a 1:N relationship between company and policy without the use of a surrogate key?

This source data ignites my curiosity; thus, the questions.

With regard to how to model the hub and without fully understanding the source systems, I would consider the following design as an option, where “-2” is an optional null key:

HUB_POLICY_HK LDTS RSRC POLICY_ID COMPANY_CODE BKCC
AA112 1/1/2022 SRC1 T1 100 SYS1
BB223 1/1/2022 SRC1 T1 200 SYS1
CC334 1/1/2022 SRC1 T2 100 SYS1
DD445 1/1/2022 SRC2 T1 -2 SYS2
EE556 1/1/2022 SRC2 T2 -2 SYS2

It’s entirely possible that my suggestion is wrong for your situation, and that’s ok. In fact, I believe my use of the optional null key in a Hub is incorrect, given that I’ve only ever seen null keys used in links. I will have to consult with DVA on this one (I have a nagging suspicion, and I need clarity).

Now, take a look at a direct quote from Building a Scalable Data Warehouse with Data Vault 2.0 [Section 4.3.2.2, pg. 90]:

It is also possible to create multiple hubs for each individual part of the composite business key. For example, there could be one hub for vehicle identification numbers, another for WMI codes (the manufacturer identifier within the VIN), another for the vehicle descriptor section (VDS), and yet another hub for the vehicle identifier section (VIS). But because the VIN is used as a whole by the business, there also has to be a hub that contains the VIN number as a whole.

Something to consider…

Careful there, quoting Dan’s book to justify hub-itus may be problematic.

It’s the same question I asked earlier, companycode doesn’t appear to be a business object by the description above. It appears as a dependent-child key, probably on a sat of that hub.

Also, BKCC should not be used with a default value value like src1, src2. BKCC should only be used if there is a chance of business key collision, else you risk creating a source-system vault, anti-pattern.

Fair enough. I re-read the post, and you are indeed correct. Don’t know how I missed it the first time.

Agreed; thus, “something to consider.” It’s not necessarily the right approach, but it’s also not out of the question if it fits the situation. Given that the key is composite and part of a legacy system, this situation doesn’t necessarily justify multiple hubs.

Upon rereading the post, I also noticed that @preddy said the business removed the company code from System 2 because

If that is the case, then it would seem to me (as you, @patrickcuba , suggested) that moving the company code to the system specific satellite makes sense. You would still be able to “reverse engineer” the data to its original state, and it would simplify the hub creation. Whether or not a BKCC is relevant in this situation is also debatable.

Yes,

I think the general issue with building a DV is often people try to “map the source” but in reality they should be looking at what the top-down view of the Information Map to DV should be. “Elephant in the Fridge”. Passive Integration

Reducing the number of tables should also be a goal, try to keep the model slim and reduce the need to join a plethora of of tables to get to a simple IM model.

Such a case could be evidenced when you see a team build a DV (which splits content into hubs, links and sats) and then further split that content into facts and dimensions. These are opposing goals. Why decompose just to further decompose again? My recent article on Building Bridges I show how to use a Bridge table to use DV constructs to simulate Facts and Dims, reducing complexity and reducing latency to model insights. But this requires that you recognise Link-Satellites that I know the Ensemble Modelling crowd opposes. It also opposes the institution of building DV objects that are not true to the original purpose of data vault; objects like keyed-instance hubs are not true Data Vault concepts and you will not see this documented anywhere when building a DV 2.0 model either. Another Ensemble concept is to break every relationship into two-participant link tables and have the IM join them back together again… why? You’ve broken the Unit of Work when you do this.

So when looking at the requirement at the beginning of this thread I questioned if CompanyCode is really a part of the composite key at all or if it really belongs in the model as a dependent-child key instead. The latter greatly reduces the complexity of the model and ensures that the data must flow.

If we take preddy at his word then T1~100 (John Smith) and T1~200 (Jand Doe) are totally different policies( as when there is a merger and company 100 and 200 are now under one company). So the policies in both system 1 (Policynumber~CompanyCode) and system 2 (PolicyNumber) have the same semantic meaning but system1 is physically to columns.

I’ve seen this type of question posed on the now defunct DVA forum and I think Dan would say that it’s 2 hubs because the BKs have a different number of columns but I really didn’t understand why if it’s truly the same semantic meaning.

1 Like