Loading the same HUB as part of two different DBT jobs

saqib · 9 December 2023 17:45

Hi,

We have a HUB_ORDER that integrates ORDERs from Netsuite and Salesforce. The cadence for new Order data from Salesforce and Netsuite is different. So we have two DBT jobs that load data from these sources into their respective Satellites- SAT_ORDERS_NETSUITE and SAT_ORDERS_SALESFORCE. Sometimes these jobs overlap. HUB_ORDER runs only as part of the Salesforce DBT job and NOT the Netsuite DBT job. This is to prevent HUB_ORDER from getting duplicates, if the two aforementioned jobs run into each other.

Due to this we frequently run into the situation where we have the data in SAT_ORDERS_NETSUITE, but the corresponding Business Keys are not in the HUB_ORDER. This breaks the downstream tables.

How can we update HUB_ORDER as part of both DBT jobs, without the risk of duplicating Business Keys in the HUB_ORDER?

Thanks,
Saqib

AHenning · 10 December 2023 04:28

This is bad, this is really bad!
The real problem is the generated code that does not take into account that a hub can be loaded from different processes at the same time. You better fix the real problem instead of using a duct tape solution.

saqib · 10 December 2023 15:54

Hi @AHenning . I am not following you. How do I fix the loading pattern in the vault macro? We just use the macro as it is.

AHenning · 10 December 2023 17:24

You have to fix the code in the macro and take into account what concurrency model that is used in your data platform if you want to load the hubs in parallell.

saqib · 10 December 2023 17:35

how? can you please provide an example.

AHenning · 10 December 2023 19:49

You have to edit the macro so it generates code that properly handles concurrency. The actual code on how you do this is not the same on every data platform. You must study how your SQL engine is dealing with concurrency.

patrickcuba · 11 December 2023 01:49

Hello,

If you are using independent dbt code base then yes you can get duplicates if both loads run at exactly the same time
If they are part of the same code base AutomateDV uses a UNION ALL to bring the data sources together ;; you should never see duplicates
If you are using snowflake, change the INSERT INTO into a MERGE WHEN NOT MATCHED. Snowflake locks the whole table and picks one or the other MERGE’s to run one at a time without error

saqib · 11 December 2023 13:29

Hi @AHenning . I am not following you. Can you please provide an example for any platform, and I will adapt it for Snowflake. Thanks.

saqib · 11 December 2023 13:35

Hi @patrickcuba . Yes these are all part of the same code, and dbtvault does UNION ALL. But the dbt loading pattern is that it creates a TEMP table with new hub hashkeys and then they get loaded into the HUB table. If two jobs are running at the same and they create a temp table at the same time, I believe there is a collision.

patrickcuba · 11 December 2023 19:50

Well that’s a different problem; you would need to either:

change the macro that creates the TEMP table to a different location
change the TEMP table name, or add a random suffix to it
contact AutomateDV with your unique use case

AHenning · 11 December 2023 20:26

Pseudo code for hub loading:

Start a database transaction
Put exclusive table lock on the Hub.
Do the insert with selection of Hubs that does not already exist.
Commit the database transaction.

If this code is run by two different processes at the same time, one will be locked at step nr 2 until the other transaction has finished. That is beacuse two different processes cannot obtain an exclusive table lock, one of them will always have to wait for the other.
If you dont code your hub load pattern like this there is a chance that one processes gets old data, kind of like a dirty read. If you have added a unique constraint on hub pk you risk a unique key violation error.
Hope this helps.

saqib · 12 December 2023 20:58

Hi @AHenning . Thanks. But we are looking for a solution within the dbt + automateDV. We don’t want to roll our own Vault Automation tool.

saqib · 12 December 2023 20:59

Hi @alex.higgs . Any thoughts on this?

Thanks,
Saqib

patrickcuba · 13 December 2023 23:47

They’re more responsive on their slack channel

Topic		Replies	Views
Should HUB be loaded only from actual source and how to define- biz. key if 2 source have diff. keys Data Vault 2.0 dbtvault , business-key , link , hub , datamodelling , satellite	5	135	3 July 2024
Dbtvault macro for loading HUB from two different source with different number of BKs AutomateDV (formerly known as dbtvault) dbtvault , hub	5	302	10 June 2022
How to merge the same entity from another sources Data Vault 2.0 dbtvault	4	228	23 October 2024
How to parallelize hubs and links loads and automatize them as much as possible ? Data Vault 2.0 link	11	1279	27 March 2023
More than one record in SAT for a Business Key with the same LOAD_DATE in a single load Data Vault 2.0	32	1748	11 October 2023

Loading the same HUB as part of two different DBT jobs

Related topics