Using json to hash diff your satllite

turagittech · 22 September 2024 23:29

Hi All,
This one is for some who might be looking for ways to hash dif with large numbers if columns in source tables and/or dealing with varchar(max). Now I can’t claim any ownership of the solution beyond deciding to try and apply it to data vault as I have several source tables over 300 columns and data factory when creating tables automaticaly from source data tens to create a lot of varchar(max) columns.
Simply put, the solution is Json for path. In a nutshell you create a json string of all columns in the row and hashing that
Here is a link to Dr Greg Low’s blog with the explanation of the concept SQL: More on finding rows that have changed using HASHBYTES and FOR JSON PATH - The Bit Bucket
Greg got the idea from Adam Machanic according to the post

This is an example of how I use it in a merge in a satellite loader

FROM staging.TreatmentMedication tm
                                    LEFT OUTER JOIN dv_raw.satTreatmentMedication stm ON stm.STEHashKey
                               = HASHBYTES('SHA2_256' , CONCAT (tm.TreatmentEpisodeId, tm.sourceSystem))
                                    LEFT JOIN dv_raw.hubTreatmentEpisodes hTE ON hte.hTreatmentEpisodeId = tm.TreatmentEpisodeId
                                    INNER JOIN (
                               SELECT tm1.TreatmentMedicationId
                                    ,HASHBYTES('SHA2_256', json_modify(tmd.TreatmentMedicationData, '$."TmData".TreatmentEpisodeId',null)) hashdata
                               FROM staging.TreatmentMedication tm1
                                        CROSS APPLY (SELECT tm1.* FOR JSON PATH, ROOT ('TmData')) tmd(TreatmentMedicationData)) as tm2
                                               on tm.TreatmentMedicationId = tm2.TreatmentMedicationId
                    ) src

If someone sees an issue I’d also be happy to hear feedback, however this appears to work effectively and seems to perform well

patrickcuba · 23 September 2024 01:08

Concat should always have a delimiter… other then that (i am not familiar with this syntax) if it works for you then go for it.

The other thing i’d comment though, if this is proposing to turn structured columns into JSON just so you can make the code easier on the eye, note that this extra step to parse to JSON will likely take longer to hash – i.e. it would cost more to this per CPU cycle (hashing is not free) … simple concat (with delimiters) will always be cheaper… you will start to see a difference when you’re hashing millions of records.

turagittech · 23 September 2024 02:14

Thanks for the feedback Patrick. Maybe at larger scales it will be a problem, In the case of this system it may never see that level of data, so I should be OK.
It’s not so much easier on the eye, but simplfies dealing with all those issues which you encounter wih hashing like null values in columns, data conversions for CONCAT and the length of data in multiple varchar max.
Performance has been ok on several million records, but maybe on 50mil or more, then I might wish I hadn’t done it

patrickcuba · 23 September 2024 02:28

we always cast to char and coalesce for nulls mate… i wrote about the performance issues about 2 years ago with the json conversion for hashing (here), … of course it’s up to you… don’t forget about the delimiter i mentioned

Topic		Replies	Views
New column additions to SAT, and the HASHDIFF in dbtvault Data Vault 2.0 dbtvault , hashdiff	27	3377	20 September 2022
Sat hash Diff Calculation Erwin hashdiff , satellite	1	461	10 April 2024
Hash Keys Implementation Data Vault 2.0	5	174	30 July 2024
Hash key generation - sequence of columns Data Vault 2.0 hash-key , sequence , sort-order	3	1010	7 March 2023
What's the best approach for removing columns from a satellite? Data Vault 2.0 satellite , automate-dv	17	188	8 August 2025

Using json to hash diff your satllite

Related topics