Hi All,
This one is for some who might be looking for ways to hash dif with large numbers if columns in source tables and/or dealing with varchar(max). Now I can’t claim any ownership of the solution beyond deciding to try and apply it to data vault as I have several source tables over 300 columns and data factory when creating tables automaticaly from source data tens to create a lot of varchar(max) columns.
Simply put, the solution is Json for path. In a nutshell you create a json string of all columns in the row and hashing that
Here is a link to Dr Greg Low’s blog with the explanation of the concept SQL: More on finding rows that have changed using HASHBYTES and FOR JSON PATH - The Bit Bucket
Greg got the idea from Adam Machanic according to the post
This is an example of how I use it in a merge in a satellite loader
FROM staging.TreatmentMedication tm
LEFT OUTER JOIN dv_raw.satTreatmentMedication stm ON stm.STEHashKey
= HASHBYTES('SHA2_256' , CONCAT (tm.TreatmentEpisodeId, tm.sourceSystem))
LEFT JOIN dv_raw.hubTreatmentEpisodes hTE ON hte.hTreatmentEpisodeId = tm.TreatmentEpisodeId
INNER JOIN (
SELECT tm1.TreatmentMedicationId
,HASHBYTES('SHA2_256', json_modify(tmd.TreatmentMedicationData, '$."TmData".TreatmentEpisodeId',null)) hashdata
FROM staging.TreatmentMedication tm1
CROSS APPLY (SELECT tm1.* FOR JSON PATH, ROOT ('TmData')) tmd(TreatmentMedicationData)) as tm2
on tm.TreatmentMedicationId = tm2.TreatmentMedicationId
) src
If someone sees an issue I’d also be happy to hear feedback, however this appears to work effectively and seems to perform well