Resolving Hash collision

Hello All,

I understand that the MD5, SHA1 or SHA256 are some of the hash algorithms used for generating hash keys for business key.

we all know that there is a minor risk of collision and the risk reduces (not completely eliminated) as we use SHA256.

But just incase if hash key collision does occur even after using SHA256, how do we resolve that problem?

E.g. Customer ID= 123 generated Hash key =abc123456
CustomerID=456 also generated Hash key=abc123456

Any references?

very unlikely … like meteor hitting a two seperate data centres at the same time.

If you find a collision, document it, and write a white paper on how you did it.

1 Like

@Nikunj - Good question!

Dan’s book, “Building a Scalable Data Warehouse with Data Vault 2.0”, address this issue in chapters 11 and 12. Also, the CDVP2 certification course addresses collisions specifically with a suggested mitigation strategy as well. I would recommend either or both of those sources to get detailed information on dealing with this issue.

The “short” list is as follows:

  1. Choose the greatest viable hashing strategy for your system (we use SHA2_256)…or design the perfect hashing strategy and share it with us first before sharing it with the rest of the world. :wink:
  2. Identify potential collisions between either after the stage load (lazy) or during the load (real time).
    SELECT * FROM stagingTbl stg INNER JOIN targetHub th ON stg.HashKey = th.HashKey WHERE stg.BusinessKey != th.BusinessKey
  3. When a collision is detected, use a hash based on a reversed business key that is stored in your staging table. (Original key = Abcdef; Reversed key = fedcbA)

It is best to define this mitigation strategy prior to implementing a data vault; however, if you don’t yet have any collisions, it can feasibly be done midstream without needing to refactor existing raw vault structures.

2 Likes

Yes, understand that the probability of happening is quite low. Just curious, Don’t we need to have mitigation process or plans incase if such situation arises

Thanks @Nicruzer for the pointers

I am reading through Dan’s book in which It is recommended to use SHA256 (but it doesn’t guarantee that hash collision will never happen).

Thanks for pointer 3, I would like to clarify

So, for e.g. in hub, its stores as
Business key = Abcdef, Hash Key=011111a

Now, for business key = Abc123, Hash key generated is 011111a (which is same as the hash key for Abcdef),
so, when such collision is detected, reverse the business key from Abc123 to 321bcA and generate a hash key 02222b

So, hub table will finally look as below
Business key = Abcdef, Hash Key=011111a
Business key = 321bcA, Hash Key=02222b

, do we need to keep an indicator in the hub table or some way to identify that a business key is reversed?

1 Like

:slight_smile:
You’ll be famous, when those authors found collisions for MD5 and SHA1 they were comparing images, far more likely to find a collision if ever.
Hash-Keys and Record Digests are really unlikely.

SHA2 is too expensive, especially for large table loads, but there are strategies I point out in my book, they’re basically divided into two such strategies.

PRE vs POST load

PRE

  • horizontal check - checks calculated hash+bkeys vs what is already loaded against target sats
  • vertical check - checks calculated hash+bkeys within the staging content
    i.e. both checks if the same bkey (+tenant-id+bkcc) has two hashkeys represented
    PRE-checks at worst should be done in a warranty period, you’re basically testing the hashing & loaders.

POST

  • can i recreate the source? i.e. if i do joins against all the content (you periodically should do this anyway) can i recreate the source?

Other preventative measures (as i point out in the book and this article Business Key Treatments. Business keys are the crux of a data… | by Patrick Cuba | Snowflake | Medium) is the use of appropriate business key treatments even before you hash.

If you find a collision, document it, make sure you can re-create that scenario, publish your findings, you’ve convinced DV-practitioners SHA1 is not good enough!

1 Like

Thank you, @Nikunj , for the clarifying question. I forgot to include that detail. You are correct in your understanding of the approach.

With regard to:

do we need to keep an indicator in the hub table or some way to identify that a business key is reversed?

Yes. Your final Hub structure would be as follows:

HUB_SAMPLE_HK HUB_REV_SAMPLE_HK HUB_SAMPLE_BK LDTS RSRC REV_HK_FLAG
0x11111A ABCDEF 1/1/2022 SRC1 0
0x11111A 0x22222B ABC123 1/1/2022 SRC1 1

You will want to generate both the original hash and the reverse hash in your staging process.

1 Like

Thank you @Nicruzer for the explanation. I understand it better now.

So, this structure would be created across all hubs and links from Day 1 of the data vault?

Thanks @patrickcuba . I do refer to your book and it has been very useful. Thank you for writing the book and sharing your knowledge with us.

I was just looking for any reference as to what to do when the collision actually happens for business key hash in a data vault. Couldn’t find such reference anywhere so I thought of asking in this forum

hey no worries mate… i only refer to the book because i have plenty of strategies in there — i think might have stated in there “write a white paper, get it published” if you find a collision!!! :stuck_out_tongue:

1 Like

Ideally, yes, if you choose this mitigation strategy.

You could opt to only use one of the two additional fields, if you’d like. For example, if you choose to keep just the FLAG, your staging process would simply use the reverse hash as the HUB_SAMPLE_HK. That eliminates the need to keep a mostly NULL field/attribute/column.

@Nikunj Did you follow the Data Vault methodology Rules before hashing the business key? Like trim the business key, remove the whitespace etc…

@sarathcr Yes, I have applied the business key treatments

OR don’t use hashes at all. Just cast the business key to a binary (to avoid collation issues). Most of the time it takes up less space, guaranteed not to collide and less compute required (hashing algorithms are CPU intensive).

1 Like