I am pretty new with the data vault modeling and i would like to know if what i am implementing is making sense or not.
In my company, we want to implement the data vault in our lakehouse to get better quality and scalabilty about business changes.
For the raw vault, we want to ingest data as fast as possible. Therefore, we want to parallelize hubs, satellites and links. However, for links, i don’t know how to feed them.
Let’s say we have a link to build for the relationship between a contract and a client.
We have hubs Client and Contract that are ok with hash keys, natural keys that are used for the hash key and some technical columns (ingested_date, and so on…).
We have 3 applications that contains the notion of client and contract : salesforce and 2 internal apps.
My idea is to union 3 queries for the 3 apps to modelize all possible relationships, and then, insert all distinct lines into the link. With this approach, i think i could parallelize all tables into the raw vault and make generic all ingestion for this layer with json file that includes all metadata description to generate queries automatically.
What do you think about that ? Do you have any suggestions for making this real ? I am not really sure how to handle links with multiple sources.
Thanks a lot for your help.
Have a nice day.
it’s the only way! Perhaps a book might have the answer…
I know your book exists and i know how i could handle the parallelism between loads. However, i am not sure how to build queries for links when there are multiple sources : any suggestions ?
The same way you built the queries for your hubs. After all, they are pretty much the same right? A unique list of business keys versus a unique list of relationships between business keys.
Thanks for your response.
Sure, i want to follow the same logic but the main difference for me is a hub is always with one source. If some sources are used to feed the hub, i cut loads and parallelize them. With this way, i never have conflicts between sources and i separate all tables / sources.
A link can be made with a query with some joins. For example, the contract_customer link, in my case, needs to join two tables to build the natural link for the application.
In that case, how could i use the same way ?
UNION ALL is dangerous, it’s the one limitation dbt imposes on dbtvault wrt to hub loads. It means that every time you want to load a hub (in your case a link) from a single source you have to use the query reads from all sources to that hub. If a source is dropped then you have to update your code.
The alternative, independent loads to a shared hub or link is much better! You’re only loading from what you need, there are no redundant reads (UNION ALL will do this) and you can load any portion at any cadence.
Parallelism is the only way to load the DV— it’s what using a deterministic surrogate key generation does for you. Ensure these are of data type binary!
As per your latest comment, don’t create a switch architecture — use one method to load a hub or link
I agree with your point, Patrick, for sure.
Today, i have what you mentionned on my hubs, independent loads to a shared hub.
However, for links, for the SAME SOURCE (Salesforce for example), i have to JOIN (and not union different sources) the two tables that are already used for the two hubs.
To sum up :
- Table A from Source A → Hub A
- Table B from Source B → Hub A too (independent load)
- Table C from Source A → Hub B
- Table D from Source B → Hub B
Here, i have 4 parallels loads to feed my hubs
- Join between Table A and Table C from Source A → Link A
- Join between Table B and Table D from Source B → Link A too (independent load)
With this way, i can easily remove a source if it is dropped and it is not impacting all other loads.
For hash keys, i am using sha2 hash method to avoid collision like md5 for example.
But my question is : how can i automate the ingestion for my links, because of the join tables ?
To precise, i have python/pyspark/sql scripts in databricks which generates all queries dynamically for hubs thanks to json parameters file. I am not using an ETL or other tools to ingest data in the lakehouse.
hmmmm … source-system should provide the relationships which are loaded to link table or tables which occurs in parallel with related hub and sat loads.
All hub-hash-keys, link hash keys, sat hash keys, sat hash diffs and other metadata columns are added in staging. This single staged table or view is used as a source to feed all hub, link and sat tables it is mapped to.
If you are creating relationships after the raw vault load then this becomes a business vault link
Unfortunately, it is not always the case. We have to build those relationships sometimes if we don’t have it. That’s why i’m trying to find a solution to solve this problem.
In our lakehouse, today, we are ingesting data from sources into a bronze layer without any modification on data. After that, we want to build raw vault with hard rules (in fact, just remove duplicated from hubs / links today if necessary, that’s all).
With that, how would you modelize raw vault from bronze layer to build links ?
Then your relationships are not reflective of the source and you’re maintaining technical debt.
Look to your data contracts with the source, they must provide this relationship because they are implementations to automate of business processes and business rules. If this is not the case then you are not building a data vault, but rather a new legacy data platform.
I have read through this thread and I am wondering why you have to prejoin source data before loading links?
If you, via business rules, prejoin source data then you perform soft business rules. Soft business rules can be performed in data vault but perform them in processes that loads data from your data vault to your data vault. Do you follow?
Also, your data quality seems very poor, can you fetch data from the real source instead of the semi broken bronze data lake failure?
Thanks for your reply.
I will focus on a concrete example and i will come back if any further question about this subject.
However, it is a real problem in the business and one of the main goal of our data modeling is to show and anticipate data quality problems.
If you have any suggestions or feedback about that would be nice