Why are hubs, links and satellites separate tables?

Hi there,

I have reached out to the DV community on many occasions trying to get an answer to this question. I get that hubs, links and satellites are meant to keep business key, relationships and contextual information separate, but nobody can tell me WHY.

What benefit is there to structuring the data this way?

Hey Hook,

Hope youā€™re well. I might have a bit of an alternative take on this but it might give you some insight as to why I like this structure.

One of the main tenets of DV2 is high scalability through an insert only approach. This is cleanly implemented through hubs, links and satellites but as you mention itā€™s not the ONLY way to have an insert only structure.

What I like about the three main tables is that they allow us to not just think of records as insert only but expands on this idea to have an insert only architecture, where we only need to add additional tables to adjust to the ebb and flow of how a business changes.

  • New sources of data? new satellites
  • New concepts new hub
  • New relationship? new link
  • New calculated fields? new business vault objects

Obviously there are pros and cons with any methodology and a DV2 architecture should always be pruned for any tech debt (like any project would) but these three tables allow us to significantly reduce the need to tinker with any old code or anything that is a potential risk to the stability of the end-users system

There are definitely other ways of looking at why we need hubs, links and sats but this is how most often enjoy thinking about DV2

1 Like

Frankie,

Thank you for at least attempting to answer the question, and believe me, you are the first person that has responded with anything other than ā€œitā€™s to separate business key, relationship and contextā€.

What Iā€™m reading from your answer is that this architecture offers some kind of ā€œconceptual clarityā€, which helps when we expand the model. Itā€™s a great idea, but things are never so simple in practice.

The problem I have with this approach is how to handle relationships and, therefore, how to create links. Assuming most Data Vault models are built incrementally (as recommended), there is no way to know that the links we build today will fit new data we bring in tomorrow. We often find that the new data will suggest that the links we built before are now incorrect.

As an add-only paradigm, the Data Vault way is to create new links and refactor pipelines to target the new structures while leaving the old stuff alone. After a few iterations, it isnā€™t hard to see that the model will begin to get out of control and that any engineering effort will spend a disproportionate amount of time on remedying technical debt rather than bringing in new data.

It isnā€™t sustainable, and it is one of the main reasons I see Data Vault projects failing. So what is the alternative? What if we didnā€™t separate the data into hubs, links and satellites? What if we collapsed the business keys down into the satellite?

Consider a simple Data Vault scenario. Say I want to ingest a source table, and it contains three legitimate business keys (Customer, Order and Product). Firstly, I need three hubs; one for each business concept. Then, I need a link associating the three hubs, or is it? Maybe itā€™s two links: Customer<->Order and Order<->Product? Or maybe there is a third Customer<->Product? Then the satellite, which I need to hang off the link (which one). Iā€™ve only brought in one source table and Iā€™m not entirely sure I got the model right. Imagine scaling that up to hundreds or thousands of tables!

If I were to collapse the business keys into the satellite, then I wouldnā€™t need to worry about any of that. I will have a table containing three business keys, and I didnā€™t waste a scrap of energy thinking about what relationships to model. Does that model contain everything the equivalent DV model does? What did we lose? When I bring a new source table into the ā€œmodelā€ which might suggest a different relationship between the business concepts, I donā€™t need to go back and refactor the old table.

So I am still waiting for that ā€˜ahh-haā€™ moment, but I donā€™t think it is coming. I believe itā€™s done this way because itā€™s always been done this way. Data Vault was created in a time before we had massively parallel distributed database systems. Maybe back then, splitting the data in this way might have had some performance benefit, but those days are long gone. Splitting data into hubs, links, and satellites is what makes a Data Vault a Data Vault, but I see no practical or performance benefits for this approach any more.

Hi Hook,
You already have a very established view of DV and why you donā€™t agree with the way it works, so I believe my comment wonā€™t change that, but I will give it a try.
When you have only the business keys that uniquely identify your business entities in one structure, you can look for the information you need faster, only using that column to slice and dice your data.
The hubs act as an index for your data that allow you to write to the database and read fast.
The links capture the relationship as you already mentioned in both your question and your comment to the previous answer.
As long as the relationship is valid for the business then the link is valid.
In other words if a customer place an order, or if an order has products, define the relationship.
You want to keep this relationships described in the smallest grain possible in the unit of work for that particular set of business entities.
If you can have only 2 business keys that capture a unique relationship then that is the way to go.
As long as a costumer place orders, that relationship holds true and also the link in your DV.
Again holding only business keys, you can quickly retrieve or write the data to the database.
Finally when you connect the satellite or group of satellites, the question you need to answer is what are these attributes describing, in our example can be, the costumer, the order or the specific products in the order.
Of course you can do as you say and put the keys together with the data following the primary key and foreign key principles and then create a star schema around it, but that reduces the flexibility to write and read from the database.
Data Vault separates the writing step from the reading step allowing for parallel writing, on the other hand when separating the data in this 3 basic structures is easier to integrate multiple systems in the same data system.
For the reporting side of things, you can read the data from the Vault and give it the shape you need.
This basic constellation built with the hub, satellite and link allow for scaling the warehouse fast, in small increments that can fit in a couple of sprints delivering value faster to the business.
I hope this provides more elements for the discussion on WHY DV does the data model in this specific way.

Hi Ricardo,

I guess I do have an established view which has taken me many, many years to arrive at.

Iā€™ve read Danā€™s books and Hans Hultgrenā€™s books. Iā€™m a certified Data Vault 2.0 Practitioner (CDVP2) and a Certified Business Concepts Maven (CBCM). Lots and lots of theory, and it all makes senseā€¦ until you try to put it all into practice.

Over the past few years, Iā€™ve been directly involved with, or seen first hand, Data Vaults in multiple organisations and, with perhaps one exception, they have all failed. There must be a reason for this. So I ask questions, and I challenge assumptions.

The main issue I see is the modelling of links. Itā€™s challenging to get them right, and more often than not, you donā€™t. What would Data Vault look like if you got rid of links? Rather than placing the Hub references in a Link table, push them down into the satellite.

We lose the formal definition of a ā€œUnit of Workā€, a concept that has always baffled me. However, the satellite still defines the relationship as a tuple of business keys. There is now no distinction between hub and link satellites.

We can still maintain Hub tables. But we can also challenge that assumption. What do we lose by getting rid of Hub tables? OK, I get that the Hub table gives us a superset of all business keys across all data assets, but how useful is that? How often do we actually need to join to a Hub table?

What are the benefits of this approach? In a word, simplification. ā€œModellingā€ is now redundant. All you have to do is identify business keys, and the relationships look after themselves. Load patterns are almost trivial: one table in and one table out, no modification to the data, and you can run as many in parallel as you like (no dependencies). The pattern is so simple that you could implement it using views.

Does it work? Absolutely. The approach is called ā€˜Hookā€™, and over the past couple of years, Iā€™ve implemented it in two organisations. Now, there is interest from companies further afield.

I would challenge anybody considering a Data Vault implementation to look at Hook. But donā€™t take my word for it; put them side-by-side and see what works best. I think youā€™d be surprisedā€¦

So Iā€™m confused, is this an advertisement?
By your description it sounds like youā€™re trying to re-invent Inmon style relational databases and slap your name on the cover, but Iā€™m happy to read up on new methodologies so Iā€™ll take a look. I think the real answer here is that by merging these tables your lose one of the many nice bonus of the equivalent DV2 model so Iā€™d be intruiged how you get around these challenges.

Still a bit confused what you were aiming for with this question now that I know youā€™ve written a full book on this very topic?

Hey Frankie,

Have you ever been in a situation where something seems so obvious that you believe you must be missing something, but you just canā€™t see it?

That has been my state of mind for the past three years after I conceived the Hook approach. I must be missing something, but what is it? Thatā€™s why I ask these questions.

I know from recent experience that Hook is a much simpler, more efficient and agile method for data warehousing; I donā€™t need to, and Iā€™m not trying to, convince you of this. What I am asking for is, what am I missing? What magic benefits does Data Vault give me that Hook doesnā€™t? What critically important features am I missing that makes Data Vault the obvious choice?

You said:

ā€œā€¦ by merging these tables, you lose one of the many nice bonuses of the equivalent DV2 model, so Iā€™d be intrigued how you get around these challengesā€.

What bonuses and challenges are you referring to? Maybe there is something in there that Iā€™ve missedā€¦?

Different data modelling techniques suite different architectures. If your data architecture does not support ANSI compliant SQL semantics on a relational database, you should probably consider something other than data vault ā€” and yes, throwing Spark at something doesnā€™t make the data storage architecture relational, it makes it a band aid, a very expensive one.

Why separate them? Easy. Like the design of Kimball models the design of DV models must take advantage of the OLAP platformā€™s algorithms underneath the hood, to name a few:

  • Nested loops
  • Hash-joins (build and probe)
  • Data sketches like bloom filters

The hash join is important, itā€™s what makes joining facts and dimensions so performant, and when I see a customer build ā€œPIT viewsā€ I know they are missing this lesson entirely.

We also preach that your data complexities should be shifted as far left as possible, the theme of this article, The OBT Fallacy & Popcorn Analytics | by Patrick Cuba | The Modern Scientist | Medium

It is not documented anywhere that you shouldnā€™t have business keys in satellite tables, I encourage it. Why do you need to join to anything if you do not need to?

Why have those separate tables?

  • Audibility
  • Isolating PII
  • Tracking true changes
  • Solving model / data complexities upfront so analysts donā€™t need to

Coincidently many of these topics is exactly what I discuss here: Data Vault is Information Mapping | by Patrick Cuba | The Modern Scientist | Mar, 2025 | Medium

And very doable using Iceberg too, Data Vault on Snowflake and Apache Iceberg | by Patrick Cuba | The Modern Scientist | Medium

Why do I blog these? Because it helps with consultation with customers. How many times when youā€™re explaining these concepts to an audience and at the end of the call they ask a question you already answered at the beginning of the call! Cognitive load! Sometimes, whilst consulting I might even say, ā€œah, I have a blog for that, here you go!ā€

We always recommend training and coaching but I think it is a cheat to simply say, ā€œyouā€™re doing it wrong because you never had trainingā€ or worse, ā€œpay me money so I can show you how to do it.ā€ Let me tell you, even trainers get it wrong. No where in the training material do they tell you why the structures work as tables, they just tell you to do it. Or worse, stating that there are problem statements like ā€œlink-satellites are deadā€ without providing any evidence of the problems they encountered. Or spend pages discussing the ā€œcolours of DVā€ ā€” what is the value in that?

I have been trained and certified on both Data Vault variances, so I can see plainly when Iā€™m consulting which DV a modeller models just by looking at it.

It is not helpful going round and stating that you have seen many failed data vaults when the common denominator is yourself, Iā€™m sure you have seen success in your hook (or even share when it has not worked). I wonā€™t defend data vault because I know it is not perfect and I know where and (a lot of the times) why pitfalls occur. A common issue I see is, ā€œwe want to build a data vaultā€, I ask ā€œwhyā€. I would be happy to say ā€œno you shouldnā€™tā€ and there can be a host of reasons,

  • Your architecture does not support relational semantics
  • Your team lacks the maturity needed - technical or even business
  • Youā€™re not willing to listen to advice I am giving, you hear it, but youā€™re not following it ā€”and I never say ā€œI told you soā€ later, why?
  • It is led by or owned by data engineers, their focus is on automation, not data modelling.

Check out the comment section of this article, The Death of Hash Keys. An innovation introduced in data vaultā€¦ | by Patrick Cuba | The Modern Scientist | Feb, 2025 | Medium

I donā€™t understand why people are still trying to import hash-keys into PowerBI or Tableau!

Where I have seen it work and work quite well is when customers have a clear vision of what their business architecture looks like. That youā€™re not building a DV as just another modelling technique but as paradigm shift to thinking of structuring data around business needs. i.e. marrying Enterprise Architecture discipline with how the data is structured.
The pitfall I see here is employee churn and knowledge retention, enterprises must invest in the discipline to continue following principles established on how they will build their data vault, trust me, thereā€™s even variances within the two dominant DVs in the wild!

Have you posted your question to Danā€™s forum yet?

Hey Patrick,

Thatā€™s quite the stream of consciousness, but you make some interesting points.

I donā€™t see how splitting the data across different table types specifically optimises the DBMS performance. Fewer joins for querying and fewer tables for loading have to be better, but the only way to prove that is to put the methods side-by-side. I have asked/challenged the community, but there donā€™t seem to be any takers. When Hook starts to get more of a foothold, Iā€™m sure that will happen, and I welcome it.

However, there is more to any data warehousing method than performance. It has to be useable and easily learned. Modelling has to be intuitive and agile. As you point out, you canā€™t set out on the Data Vault path unless you are committed and really well prepared. There is a lot to learn and a lot to practice, requiring considerable investment in architecture, tools and, more importantly, people. This is why I see so many Data Vault projects fail: because most organisations dive into it half-cocked and expect the success that they were promised. That commitment is too high for most organisations. Still, there seems to be some deep-seated perception that Data Vault is the only credible method for building a successful data solution. So they forge ahead regardless and wonder why it all goes wrong.

Like you say, even experts can get it wrong, so what hope is there for the rest of us? If success is predicated on the experience and expertise of somebody like you, Patrick, then there is something wrong with the approach. You are not most people.

Another sticking point is Data Vaultā€™s modelling approach. If you glance through the questions on this forum alone, it is clear there seems to be a tremendous amount of confusion about how to go about it. Which modelling approach? Data Vault 2.0 or Ensemble? Do we model incrementally, one use-case at a time, or should we invest time upfront in modelling a broader model? When (not if) we get the model wrong, how easy is it to change it? Why is it so hard?

My comments about failed Data Vault projects might not be helpful, but the reality is that if my experiences with Data Vault had been positive, then Hook would not exist. Hook is my attempt to create a method that is just easier and to remove as many barriers that methods like Data Vault seem to put in our way. I probably shouldnā€™t comment further, especially on a public forum, but I have some interesting anecdotes Iā€™d happily share over a beer. Similarly, I could and would love to comment on my Hook experiences, but not here. Has it been plain sailing? Of course not. Hook is still relatively new, and there has been much to learn and new situations to understand. But it is now reaching a level of maturity that I feel comfortable with. By that, I mean that is Hook has reached a level where it should be included in the conversation for any organisation thinking about their next warehousing project.

Finally, to answer your question,

ā€œHave you posted your question to Danā€™s forum yet?ā€

As you know, the Data Vault Alliance changed its forum to a moderated format some time ago. Before that, I posted numerous questions like this and never received any satisfactory answers. Since the change, I have tried again, but my questions never get past the moderators, further reinforcing my belief that Iā€™m on to something. Dan no longer engages with me after he tried to call me out on LinkedIn, but I think those comments have long since been deletedā€¦

Thereā€™s your problem right there

Iā€™m actually the only local resource in customer calls across the globe because of this

Many donā€™t even know this

It exists because you put it there, and it might work for you and so be it.

I still post there and highlighted my XTS blog that they deleted from DVA.
You should post there, after all, a discussion between one inventor to the other would be at least entertainingā€¦ better yet, start a hook forum.

Thereā€™s your problem right there.

Apart from the comment being somewhat dismissive (Iā€™m sure you didnā€™t mean it to be), the argument just does not hold. Should I blindly accept that splitting data across different table types improves performance? I couldnā€™t say one way or another, which is why Iā€™m suggesting a side-by-side comparison. We need some empirical evidence. Maybe Data Vault does perform better, but maybe it doesnā€™t, but surely you wouldnā€™t expect me to just take your word for it.

You were critical of similar behaviour regarding Data Vault training:

We always recommend training and coaching but I think it is a cheat to simply say, ā€œyouā€™re doing it wrong because you never had trainingā€ or worse, ā€œpay me money so I can show you how to do it.ā€ Let me tell you, even trainers get it wrong. No where in the training material do they tell you why the structures work as tables, they just tell you to do itā€¦

Iā€™m asking the question because I donā€™t understand, so please enlighten me; donā€™t dismiss me. Explain it in terms that I can understand, and then I can explain it to the next person who asks me. At the moment, I have nothing.

Again, I didnā€™t say that, I provided links for you to traverse and you have dismissed it by asking the same question again. :man_shrugging:t2:
Many of the techniques that these enable are in my large catalogue of medium blogs ā€” have fun!