Should Business Key be included in the SAT HASHDIFF? Also what about Depedent Child in the HASHDIFF for LINK SAT

Hi,

I have seen Business Keys and Dependent Child included in the HASHDIFF of a SAT or LINK SAT. What is the best practice? Does it make a difference?

Thanks,
Saqib

It matters. Having the bk inside the hashdiff enables you to only compare hashdiffs against hashdiffs, as long as you compare the latest content in the satellite. It is very beautiful code.

@saqib - Dan Linstedt actually addresses this directly in an article he published on http://LearnDataVault.com called DV2.0 and Hash Keys (which is included in the practitioner course book as a tab labeled Hash Keys):

** IMPORTANT! Do not forget to ADD the sequence OR the natural primary key field to the HASHDIFF calculation!! This will ensure the most uniqueness of the Hash calculations across multiple similar Satellite rows and greatly reduces the risk of duplicate hashes for different values.

Hi @Nicruzer . Interesting. I wonder if this is still applicable now that we use SHA-256 with a extremely low probability of collision.

My point is: you can load your satellites faster if you include the bk in the hashdiff. Thats because you only have to check for new hashdiffs compared to latest hashdiff for each bk.

@AHenning . how will include the bk in the hashdiff help that? you will still have to check for new hashdiffs compared to latest hashdiff for each bk.

Not for each BK, i will explain.
If you extract a list of a hashdiffs for the latest loaddate for each bk from a satellite. That is just a list of hashdiffs.
Then compare the incoming hashdiff-rows with this list. All hashdiffs in the incoming bucket that doesnt match any hashdiff in the list should be inserted.
This works only because the bk is a part of the hashdiff. Do you follow, or should I go into more detail? Kind regards.

Hey,

Bkeys can be included in the HashDiff, it makes no difference. Those that suggest it makes it more unique probably have never seen the load code for a satellite that compares the latest hashdiff per hashkey.

Dep-Keys must be included in the hashdiff, they after all simply just attribute columns identified as dep-keys, nothing makes them special to be excluded from hashdiff calculation.

Hope that helps

2 Likes

There is a difference between including the BK in the hashdiff and not including it in the hashdiff. The following example says why:

incomming data batch:

bk,country,comment

AA,Cuba,is wrong

BB,Cuba,is wrong

Look at the example above

If we calculate hashdiff for country and comment we got the same hashdiff, for simplicity lets say that the
concat and hashcalculation of “Cuba” and “is wrong” equals 0xCC. This means that the two rows will have the same hashdiff value.

But if we calculate the hashdiff for bk, country and comment we got two different values because we have different bk:s in our example.

Lets call these two 0xAA and 0xBB.

After that we define a second incoming data batch like this:

bk,country,comment

AA,Cuba,is wrong

BB,Cuba,is still wrong

Lets calculate the hashdiffs for incoming data and include the bk.

AA,Cuba,is wrong → 0xAA

BB,Cuba,is still wrong → 0xBD

algo:

  1. Extract current hashdiffs for each BK → 0xAA, 0xBB

  2. Calculate hashdiffs for incoming data → 0xAA, 0xBD

  3. Insert rows where hashdiff in statement 2 is not in statement 1 → 0xBD.

Conclusion:

  1. The hashdiffs gets “more unique” if you include the bk.

  2. The loading process of a satellite can be performed faster because you just have to do an anti-join between incoming data hashdiff and latest hashdiff per bk
    instead of matching BK and BK and then compare the hashdiffs. This can be performed because the BK is included in the hashdiff.

errr right… like i said those that suggest there is a difference have never seen the load code.

This might be an eye opener to you but you can load a satellite with different program code.

Yea clearly! The right way and the wrong way, well done son.

Well done being narrow minded. There are still many correct and incorrect ways to write code.

1 Like

ok sunshine… lol… :slight_smile:

It seems to me that you still dont understand the algo I posted. Maybe you should look into it more, maybe you can learn something?

lol … sure sunshine…

Maybe you should address the topic in this thread instead of being unpolite. Do you still think that there is no difference between including and not including the bk in the hashdiff?

What is surprising is that you think I haven’t seen incorrect implementations like yours before.

So before you share more “expert” opinions in these forums I recommend getting yourself a copy of this: Amazon.com

Well, well well. It is very interesting that you still argue that the algo I posted is incorrect. Maybe you can explain why it is incorrect? The algo I posted is nothing that I came up with myself. I was thougth that algo by Michael Olschimke when I did my dv2 certification in 2019.