Dan’s book, “Building a Scalable Data Warehouse with Data Vault 2.0”, address this issue in chapters 11 and 12. Also, the CDVP2 certification course addresses collisions specifically with a suggested mitigation strategy as well. I would recommend either or both of those sources to get detailed information on dealing with this issue.
The “short” list is as follows:
Choose the greatest viable hashing strategy for your system (we use SHA2_256)…or design the perfect hashing strategy and share it with us first before sharing it with the rest of the world.
Identify potential collisions between either after the stage load (lazy) or during the load (real time). SELECT * FROM stagingTbl stg INNER JOIN targetHub th ON stg.HashKey = th.HashKey WHERE stg.BusinessKey != th.BusinessKey
When a collision is detected, use a hash based on a reversed business key that is stored in your staging table. (Original key = Abcdef; Reversed key = fedcbA)
It is best to define this mitigation strategy prior to implementing a data vault; however, if you don’t yet have any collisions, it can feasibly be done midstream without needing to refactor existing raw vault structures.
Yes, understand that the probability of happening is quite low. Just curious, Don’t we need to have mitigation process or plans incase if such situation arises
I am reading through Dan’s book in which It is recommended to use SHA256 (but it doesn’t guarantee that hash collision will never happen).
Thanks for pointer 3, I would like to clarify
So, for e.g. in hub, its stores as
Business key = Abcdef, Hash Key=011111a
Now, for business key = Abc123, Hash key generated is 011111a (which is same as the hash key for Abcdef),
so, when such collision is detected, reverse the business key from Abc123 to 321bcA and generate a hash key 02222b
So, hub table will finally look as below
Business key = Abcdef, Hash Key=011111a
Business key = 321bcA, Hash Key=02222b
, do we need to keep an indicator in the hub table or some way to identify that a business key is reversed?
You’ll be famous, when those authors found collisions for MD5 and SHA1 they were comparing images, far more likely to find a collision if ever.
Hash-Keys and Record Digests are really unlikely.
SHA2 is too expensive, especially for large table loads, but there are strategies I point out in my book, they’re basically divided into two such strategies.
PRE vs POST load
PRE
horizontal check - checks calculated hash+bkeys vs what is already loaded against target sats
vertical check - checks calculated hash+bkeys within the staging content
i.e. both checks if the same bkey (+tenant-id+bkcc) has two hashkeys represented
PRE-checks at worst should be done in a warranty period, you’re basically testing the hashing & loaders.
POST
can i recreate the source? i.e. if i do joins against all the content (you periodically should do this anyway) can i recreate the source?
If you find a collision, document it, make sure you can re-create that scenario, publish your findings, you’ve convinced DV-practitioners SHA1 is not good enough!
Thanks @patrickcuba . I do refer to your book and it has been very useful. Thank you for writing the book and sharing your knowledge with us.
I was just looking for any reference as to what to do when the collision actually happens for business key hash in a data vault. Couldn’t find such reference anywhere so I thought of asking in this forum
hey no worries mate… i only refer to the book because i have plenty of strategies in there — i think might have stated in there “write a white paper, get it published” if you find a collision!!!
Ideally, yes, if you choose this mitigation strategy.
You could opt to only use one of the two additional fields, if you’d like. For example, if you choose to keep just the FLAG, your staging process would simply use the reverse hash as the HUB_SAMPLE_HK. That eliminates the need to keep a mostly NULL field/attribute/column.
OR don’t use hashes at all. Just cast the business key to a binary (to avoid collation issues). Most of the time it takes up less space, guaranteed not to collide and less compute required (hashing algorithms are CPU intensive).