Should Hub definition follow source system structure or business entities?

Hello everyone,

I am a newbie to Data Vault modeling and I am having some doubts regarding the architecture I have built so far.

A thing that is not clear to me is whether the DV entities definition should follow the source system structure or it should be guided by the business users.

For example: Consider a source system that has a table with clients information called client_info (with data like name, surname, birth date) and another table with people contacts called contacts_info (such as phone numbers and home addresses) and the domain office requests that data from those two tables should be joined in a unique entity called “clients” in the information delivery layer.

In this case, should we create just one hub for clients and satellites with attributes coming from the two source tables (hub_clients) or should we create two different hubs, one for client_info and one for contacts_info with their respective satellites and join their data in the Business Vault?

So far I have followed the domain offices indications by creating hubs for the entities they expected by joining source data info. This approach becomes a little messy when instead of just two source tables, one entity needs data from 10-15 source tables. On the other hand, creating 10 hubs for each entity required by the business users would make the DV grow very fast (and increase costs?).

This aspect seems not to be well clarified from online sources I found.

How would you behave in this case?

Hi Alberto

Always follow business entities not source system structure!

A DV warehouse is a master at uniting many sources of data into a common shared structure. In your instance the shared structure is that these source systems all relate to your business so you should build your hub entities around your business not your source system.

Doing it the other way around will work on a technical level but will be inflexible to change in many ways. Inflexible to business change, changes in source system, introductions of new source systems etc. Essentially it ruins one of the key motivations behind creating a vault in the first place. If you’re looking to recreate your source system then a DV2.0 solution might not be for you.

In your example unfortunately this is always something we get questions on and cant answer, the best way of figuring out the right architecture for you is to ask the business how they think of these things. Do your business users consider client-info and contact-info to be completely separate but related or are they two sides of the same coin? That should give you a better idea of how your architecture should come together.

A reminder that this is never a one step process. Get an understanding of the business and create a draft, these models often require a lot of collaboration and even more iterations before they find that it’s been fully grasped.

All the best and keep asking questions!

Frankie

2 Likes

To double down here, I’d strongly agree with what Frankie said and maybe add one practical refinement that usually helps when this starts to feel messy.

The key is to start from the business concept, not the source tables but also don’t confuse a “business concept” with “what the report(s) want”.

In your example, the real question isn’t two source tables vs one, it’s:

Is there one real business thing here, or two distinct things that just happen to be related?

If the business genuinely talks about a Client as a single concept, and phone numbers / addresses are just properties of that client, then a very normal pattern is:

  • one Client hub

  • multiple satellites sourced from client_info, contacts_info, etc.

Even if that ends up being 10–15 satellites, that’s not a problem. That’s exactly what satellites are for.

Where people get into trouble is when they accidentally model reporting needs as hubs. A request like “we want a Clients view with name, address, phone…” does not automatically mean “physically join those tables in the Raw Vault”.

The Raw Vault models business identity/concepts and their relationships. The delivery layer handles convenience joins.

A simple rule of thumb I often use:

  • Would the business ever talk about this thing independently?

  • Could it exist on its own, or only in the context of another thing?

If contact details only make sense because a client exists, they’re usually satellite attributes. If contacts are first-class, reused, contextual, or independently managed, then a separate hub + link might make sense.

That decision comes from business semantics, not from how many source tables are involved.

On the concern about DV “growing very fast”: hubs and satellites are cheap. What’s annoying is reworking the vault later because the business meaning was wrong. A data vault that grows because it reflects the business is usually healthier than one that stays small by mirroring sources.

And it’s never a one-and-done decision — you model, validate with the business, and iterate.

2 Likes

Thanks to both @Frankie and @joe.barter, you were very clear.

I started doubting my work because for different entities I had to take data from the same tables and I thought that maybe that common source tables should be the hubs instead of the entities that takes data from them.

I am glad I do not have to restructure my whole data vault and now I have a clear view on how to proceed

1 Like