Resolving Hash collision

Nikunj · 9 November 2022 09:03

Hello All,

I understand that the MD5, SHA1 or SHA256 are some of the hash algorithms used for generating hash keys for business key.

we all know that there is a minor risk of collision and the risk reduces (not completely eliminated) as we use SHA256.

But just incase if hash key collision does occur even after using SHA256, how do we resolve that problem?

E.g. Customer ID= 123 generated Hash key =abc123456
CustomerID=456 also generated Hash key=abc123456

Any references?

patrickcuba · 10 November 2022 01:55

very unlikely … like meteor hitting a two seperate data centres at the same time.

If you find a collision, document it, and write a white paper on how you did it.

Nicruzer · 11 November 2022 16:49

@Nikunj - Good question!

Dan’s book, “Building a Scalable Data Warehouse with Data Vault 2.0”, address this issue in chapters 11 and 12. Also, the CDVP2 certification course addresses collisions specifically with a suggested mitigation strategy as well. I would recommend either or both of those sources to get detailed information on dealing with this issue.

The “short” list is as follows:

Choose the greatest viable hashing strategy for your system (we use SHA2_256)…or design the perfect hashing strategy and share it with us first before sharing it with the rest of the world.
Identify potential collisions between either after the stage load (lazy) or during the load (real time).
SELECT * FROM stagingTbl stg INNER JOIN targetHub th ON stg.HashKey = th.HashKey WHERE stg.BusinessKey != th.BusinessKey
When a collision is detected, use a hash based on a reversed business key that is stored in your staging table. (Original key = Abcdef; Reversed key = fedcbA)

It is best to define this mitigation strategy prior to implementing a data vault; however, if you don’t yet have any collisions, it can feasibly be done midstream without needing to refactor existing raw vault structures.

Nikunj · 12 November 2022 06:36

Yes, understand that the probability of happening is quite low. Just curious, Don’t we need to have mitigation process or plans incase if such situation arises

Nikunj · 12 November 2022 06:47

Thanks @Nicruzer for the pointers

I am reading through Dan’s book in which It is recommended to use SHA256 (but it doesn’t guarantee that hash collision will never happen).

Thanks for pointer 3, I would like to clarify

So, for e.g. in hub, its stores as
Business key = Abcdef, Hash Key=011111a

Now, for business key = Abc123, Hash key generated is 011111a (which is same as the hash key for Abcdef),
so, when such collision is detected, reverse the business key from Abc123 to 321bcA and generate a hash key 02222b

So, hub table will finally look as below
Business key = Abcdef, Hash Key=011111a
Business key = 321bcA, Hash Key=02222b

, do we need to keep an indicator in the hub table or some way to identify that a business key is reversed?

patrickcuba · 12 November 2022 06:56

You’ll be famous, when those authors found collisions for MD5 and SHA1 they were comparing images, far more likely to find a collision if ever.
Hash-Keys and Record Digests are really unlikely.

SHA2 is too expensive, especially for large table loads, but there are strategies I point out in my book, they’re basically divided into two such strategies.

PRE vs POST load

PRE

horizontal check - checks calculated hash+bkeys vs what is already loaded against target sats
vertical check - checks calculated hash+bkeys within the staging content
i.e. both checks if the same bkey (+tenant-id+bkcc) has two hashkeys represented
PRE-checks at worst should be done in a warranty period, you’re basically testing the hashing & loaders.

POST

can i recreate the source? i.e. if i do joins against all the content (you periodically should do this anyway) can i recreate the source?

Other preventative measures (as i point out in the book and this article Business Key Treatments. Business keys are the crux of a data… | by Patrick Cuba | Snowflake | Medium) is the use of appropriate business key treatments even before you hash.

If you find a collision, document it, make sure you can re-create that scenario, publish your findings, you’ve convinced DV-practitioners SHA1 is not good enough!

Nicruzer · 14 November 2022 16:04

Thank you, @Nikunj , for the clarifying question. I forgot to include that detail. You are correct in your understanding of the approach.

With regard to:

do we need to keep an indicator in the hub table or some way to identify that a business key is reversed?

Yes. Your final Hub structure would be as follows:

HUB_SAMPLE_HK	HUB_REV_SAMPLE_HK	HUB_SAMPLE_BK	LDTS	RSRC	REV_HK_FLAG
0x11111A		ABCDEF	1/1/2022	SRC1	0
0x11111A	0x22222B	ABC123	1/1/2022	SRC1	1

You will want to generate both the original hash and the reverse hash in your staging process.

Nikunj · 15 November 2022 02:22

Thank you @Nicruzer for the explanation. I understand it better now.

So, this structure would be created across all hubs and links from Day 1 of the data vault?

Nikunj · 15 November 2022 02:30

Thanks @patrickcuba . I do refer to your book and it has been very useful. Thank you for writing the book and sharing your knowledge with us.

I was just looking for any reference as to what to do when the collision actually happens for business key hash in a data vault. Couldn’t find such reference anywhere so I thought of asking in this forum

patrickcuba · 15 November 2022 02:46

hey no worries mate… i only refer to the book because i have plenty of strategies in there — i think might have stated in there “write a white paper, get it published” if you find a collision!!!

Nicruzer · 15 November 2022 11:17

Ideally, yes, if you choose this mitigation strategy.

You could opt to only use one of the two additional fields, if you’d like. For example, if you choose to keep just the FLAG, your staging process would simply use the reverse hash as the HUB_SAMPLE_HK. That eliminates the need to keep a mostly NULL field/attribute/column.

sarathcr · 16 November 2022 08:46

@Nikunj Did you follow the Data Vault methodology Rules before hashing the business key? Like trim the business key, remove the whitespace etc…

Nikunj · 16 November 2022 13:44

@sarathcr Yes, I have applied the business key treatments

Hook · 5 June 2023 09:34

OR don’t use hashes at all. Just cast the business key to a binary (to avoid collation issues). Most of the time it takes up less space, guaranteed not to collide and less compute required (hashing algorithms are CPU intensive).

Topic		Replies	Views
Must I re-engineer the DV when I add a new source with a possible collision to a hub table? Data Vault 2.0 business-key , integration	1	702	7 February 2024
Hub Collision Code Best Practices Data Vault 2.0	5	194	24 June 2024
Hash key generation - sequence of columns Data Vault 2.0 hash-key , sequence , sort-order	3	1004	7 March 2023
New Domain, Hash Key Question for the HUBs Data Vault 2.0	9	1124	21 September 2022
Business Key Field Names from Different Sources with the Same Values - Is it still passive integration? Data Vault 2.0 business-key	6	1256	3 February 2022

Resolving Hash collision

Related topics