Loading the same HUB as part of two different DBT jobs

Hi,

We have a HUB_ORDER that integrates ORDERs from Netsuite and Salesforce. The cadence for new Order data from Salesforce and Netsuite is different. So we have two DBT jobs that load data from these sources into their respective Satellites- SAT_ORDERS_NETSUITE and SAT_ORDERS_SALESFORCE. Sometimes these jobs overlap. HUB_ORDER runs only as part of the Salesforce DBT job and NOT the Netsuite DBT job. This is to prevent HUB_ORDER from getting duplicates, if the two aforementioned jobs run into each other.

Due to this we frequently run into the situation where we have the data in SAT_ORDERS_NETSUITE, but the corresponding Business Keys are not in the HUB_ORDER. This breaks the downstream tables.

How can we update HUB_ORDER as part of both DBT jobs, without the risk of duplicating Business Keys in the HUB_ORDER?

Thanks,
Saqib

This is bad, this is really bad!
The real problem is the generated code that does not take into account that a hub can be loaded from different processes at the same time. You better fix the real problem instead of using a duct tape solution.

Hi @AHenning . I am not following you. How do I fix the loading pattern in the vault macro? We just use the macro as it is.

You have to fix the code in the macro and take into account what concurrency model that is used in your data platform if you want to load the hubs in parallell.

how? can you please provide an example.

You have to edit the macro so it generates code that properly handles concurrency. The actual code on how you do this is not the same on every data platform. You must study how your SQL engine is dealing with concurrency.

Hello,

  • If you are using independent dbt code base then yes you can get duplicates if both loads run at exactly the same time
  • If they are part of the same code base AutomateDV uses a UNION ALL to bring the data sources together ;; you should never see duplicates
  • If you are using snowflake, change the INSERT INTO into a MERGE WHEN NOT MATCHED. Snowflake locks the whole table and picks one or the other MERGE’s to run one at a time without error

Hi @AHenning . I am not following you. Can you please provide an example for any platform, and I will adapt it for Snowflake. Thanks.

Hi @patrickcuba . Yes these are all part of the same code, and dbtvault does UNION ALL. But the dbt loading pattern is that it creates a TEMP table with new hub hashkeys and then they get loaded into the HUB table. If two jobs are running at the same and they create a temp table at the same time, I believe there is a collision.

Well that’s a different problem; you would need to either:

  • change the macro that creates the TEMP table to a different location
  • change the TEMP table name, or add a random suffix to it
  • contact AutomateDV with your unique use case

Pseudo code for hub loading:

  1. Start a database transaction
  2. Put exclusive table lock on the Hub.
  3. Do the insert with selection of Hubs that does not already exist.
  4. Commit the database transaction.

If this code is run by two different processes at the same time, one will be locked at step nr 2 until the other transaction has finished. That is beacuse two different processes cannot obtain an exclusive table lock, one of them will always have to wait for the other.
If you dont code your hub load pattern like this there is a chance that one processes gets old data, kind of like a dirty read. If you have added a unique constraint on hub pk you risk a unique key violation error.
Hope this helps.

Hi @AHenning . Thanks. But we are looking for a solution within the dbt + automateDV. We don’t want to roll our own Vault Automation tool.

1 Like

Hi @alex.higgs . Any thoughts on this?

Thanks,
Saqib

They’re more responsive on their slack channel