We can read concerning link loading in The Data Vault Guru that:
Has the staged source file of the link table been used to supply content to parent hub tables?
Should we understand that when a link is inserted we also have to insert (if not already present) all related hubs?
More generally, should we always insert in parent table (hub or link) when we insert in a child table (link or satellite)?
To keep architecture implementations simple you should have landing, staging, loading
DV loading comes from what was staged
If a landed file contains a relationship between business objects then you will use
- a link loader and
=2 hub loaders
Hubs are parent tables of a link or satellite
I was wondering the same myself.
What if you have a database source that enforces referential integrity, from which you do a direct pull?
I have table Customer_Core with CustomerID which gets loaded to Hub_Customer and Sat_Customer_Core.
Then I have table Customer_Additional with CustomerID which gets loaded to Sat_Customer_Additional. CustomerID in this table cannot exist if it is not in Customer_Core. Should I still load Hub_Customer from this table?
I would say yes. Load from all possible occurrences in all sources. You’ll get a hit on performance but since you’re loading a hub which will probably never be that big, the performance hit isn’t significant.
Thanks for the reply.
The special case I was thinking about is when we load link and we only have the BK of a referenced hub in the staged table.
I note that we always insert in parent table even if there is no descriptive data to add in a satellite.
What performance hit?
Hub, link and Sat loaders are autonomous tasks running in parallel against very thin tables.
Having to do a
as opposed to doing a
That is a dbt limitation, not a DV limitation
Well it’s not a dbtvault limitation actually, they support the multi-source hub pattern. So no problems there. It’s just that the select with the unions will be slower, hence the performance hit I was referring to.
It is a dbt limitation
Dbt doesn’t allow for multiple sql files to load to the same model, one file one model hence why dbtvault resorted to creating a Union all
Ideal DV automation is to have single purpose hub loaders, 1-1 src to tgt, so you don’t read srcs you don’t need when loading a hub.
Ah right, yes, sorry for missing that point
And I guess the single purpose hub loaders you mention is in reference to what you describe HERE related to Snowflake?
Autonomous, independent, micro service style architecture that makes it easy to recover from failure - follows the Unix mantra