We have a HUB_ORDER that integrates ORDERs from Netsuite and Salesforce. The cadence for new Order data from Salesforce and Netsuite is different. So we have two DBT jobs that load data from these sources into their respective Satellites- SAT_ORDERS_NETSUITE and SAT_ORDERS_SALESFORCE. Sometimes these jobs overlap. HUB_ORDER runs only as part of the Salesforce DBT job and NOT the Netsuite DBT job. This is to prevent HUB_ORDER from getting duplicates, if the two aforementioned jobs run into each other.
Due to this we frequently run into the situation where we have the data in SAT_ORDERS_NETSUITE, but the corresponding Business Keys are not in the HUB_ORDER. This breaks the downstream tables.
How can we update HUB_ORDERas part of both DBT jobs, without the risk of duplicating Business Keys in the HUB_ORDER?
This is bad, this is really bad!
The real problem is the generated code that does not take into account that a hub can be loaded from different processes at the same time. You better fix the real problem instead of using a duct tape solution.
You have to fix the code in the macro and take into account what concurrency model that is used in your data platform if you want to load the hubs in parallell.
You have to edit the macro so it generates code that properly handles concurrency. The actual code on how you do this is not the same on every data platform. You must study how your SQL engine is dealing with concurrency.
If you are using independent dbt code base then yes you can get duplicates if both loads run at exactly the same time
If they are part of the same code base AutomateDV uses a UNION ALL to bring the data sources together ;; you should never see duplicates
If you are using snowflake, change the INSERT INTO into a MERGE WHEN NOT MATCHED. Snowflake locks the whole table and picks one or the other MERGE’s to run one at a time without error
Hi @patrickcuba . Yes these are all part of the same code, and dbtvault does UNION ALL. But the dbt loading pattern is that it creates a TEMP table with new hub hashkeys and then they get loaded into the HUB table. If two jobs are running at the same and they create a temp table at the same time, I believe there is a collision.
Do the insert with selection of Hubs that does not already exist.
Commit the database transaction.
If this code is run by two different processes at the same time, one will be locked at step nr 2 until the other transaction has finished. That is beacuse two different processes cannot obtain an exclusive table lock, one of them will always have to wait for the other.
If you dont code your hub load pattern like this there is a chance that one processes gets old data, kind of like a dirty read. If you have added a unique constraint on hub pk you risk a unique key violation error.
Hope this helps.