Handling "Expected" Duplicates

More Salesforce woes when it comes to DV…

There are many instances where a record in Salesforce gets “touched” but not necessarily a data update and, because of the nature of Salesforce, the LastModifiedDate gets updated to the latest touch. Since we are following an ELT methodology to populate the DV for our customer, MuleSoft does not handle a “hashdiff”-like approach to the records brought over…our SOURCE.

gets populated with a new record/loaddate/lastmodifieddate combination even though the rest of the data is the same.

When DBTVault picks up the record and moves it into the Hub and Satellite, we do not include LASTMODIFIEDDATE in the hashdiff, it creates a new record with the same HASHDIFF.

Thinking through all of that, I guess I really have a few questions:

  1. How should this be handled? Should the record be allowed to persist, or should it not make it into the DV?
  2. For anyone more familiar with DBTVault than I am, I thought it was able to look at the HASHDIFF and cleanly handle not inserting a new record if the HASHDIFF was the same?
  3. Should I set up something to automate deleting duplicate HASHDIFFS?

Now, all of that said, there are some instances where I would expect duplicate HASHDIFFs, but I’m not sure how to handle them. Situation: An Opportunity in the sales pipeline moves from Contracting to Contract Review. The contract gets rejected for some reason, so the Opportunity moves back to Contracting in order to be worked. This is an example of a workflow for the customer, but they have many simlar workflows. When that happens, the move back to Contracting is a true data change, with a different LASTMODIFIEDDATE, but it creates a duplicate HASHDIFF (though I feel this is an acceptable duplicate).

The question is: is it okay to have an “acceptable duplicate” and how should it be handled?

In an overarching question that covers all of this topic: It is possible every object we sync over could fall into the “acceptable duplicate” situation since select boxes and formula fields can change and then revert back. In broad strokes, should I just allow duplicate hashdiffs and safely assume that if there are duplicates with different LASTMODIFIEDDATES, that it was an instance where a value was moved forward and then reverted back (which is 100% the case for all of the duplicates after a lot of manual data verification and audit log tracking)?

  1. You have received new data so you should persist it. I guess you dont want to insert multiple rows containing the same data besides the lastupdateddate, you can solve this by moving out the lastupdateddate to another satellite.
  2. I am not familiar with dbt-vault but data vault 2.0 should not insert a new row if the hashdiff is the same. You said that you did not include the lastupdateddate in the hashdiff calculation, maybe this trigger something else. I think it is odd that you even can exclude fields from the hashdiff (loaddate not included) because the reason for hashdiff is to track changes on all fields in the satellite.
  3. No.

Good morning,

  1. Duplicate hashdiffs? That should not happen at, ever. You could have the same hashdiff if
    a. you have a null in one of your attributes contributing to the hashdiff — nulls break hashing
    b. if in the past the state of the business object returns to a value it was before – Jill got married, Jill became single again. But then this is a duplicate more like the old state returned
  2. “Touched” records but not modified is a common scenario – if this is a column controlled by Salesforce then there’s not much you can do from the application end — if having that column of any value to you? Is it a metadata column that will simply create far more data then you need to store? Why not exclude it in your Mulesoft extracts?

Thanks for the responses!


Yes, your B scenario is what is happening in one of the two instances. A value is set to X, it gets changed to Y, and then later gets changed back to X. Apart from the LastModifiedDate being updated each time, no other fields get updated. So the first and second versions of X have the same HashDiffs.

That said, the client needs to know when something like an Opportunity changed stages, even if backwards, so they can do projections, trends, etc. I wondered if I should have two HashDiff columns. A general HashDiff that is typical of DV2, and a “HashDiffDup” that removes the fields that could revert back so that way it checks for true duplicates…even then, it feels like not the right solution.

The second situation where we have duplicates is when a record gets “touched”. Salesforce has workflows that trigger from an event. So an Account update to field X will update field ABC on all related Opportunity records. Keep in mind this is very generalized and the real-world situations are much more complex, but if that field X gets updated again (even with the same value) it will trigger the field ABC to be updated with the exact same value (no change). This causes the LastModifiedDate to change, which triggers Mulesoft to push a new record to Snowflake.

Situation two might be easier to handle with HashDiff, but then introduce situation one…with that, it is hard to tell for sure which situation caused the duplicate (and if that duplicate is acceptable).

I read back through Linstedt’s book at near-relevant examples, but I am thinking we might have to flex the requirements of DV2.0 for these edge-case objects (making them not true DV methodology objects) to account for both situations.

Or split the satellite tables? One Hashdiff per satellite table, always.