New Domain, Hash Key Question for the HUBs

Hello all,

I’m building out a new domain in DV and have a question regarding Hash Keys for the HUBs.

SHA1 is being deprecated in SQL Server so I’m wondering if it makes sense to use that for HK purposes in the model going forward. SHA2_256 is a binary(32) and I’m just worried about the performance / space considerations. From what it looks like, us SQL Server shops might be forced to use.

Has anyone used SHA2_256 as their HKs and have any insight? I’m also considering just using the BK or a traditional surrogate key for this domain.

Thanks,
Josh

1 Like

As long as the hashing meets the same requirements as SHA1 it should be fine, remember hashing is only used because it produces a consistent value and any slight change in bit or byte generates a completely different hash. A Key requirement for hash keys is the avoidance of collisions, older hashes are being deprecated because those algorithms have shown collisions and the stronger hashes don’t. Whether you use SHA1 or SHA2 the same functionality is achieved. We have known of MD5 vulnerabilities for some time and SHA1 has also shown vulnerabilities too. A customer I worked with decided to go with SHA2.

Picking a hashing algorithm does not effect the implementation of the Data Vault 2.0 standard.

1 Like

Thanks for the response.

I meant more in terms of using an SHA2 HK as the PK in the HUBs and how it affects performance of joins. This domain is our largest volume of data (close to 500 million records). Our other domains built out in DV are low-volume so using the HK as the PK doesn’t concern me. The PK would also be non-clustered PK as recommended by Dan.

Your performance hit should only be in the calculation of the hash itself. SHA2 being more complex than SHA1 and MD5.
Yes, make the HK the PK, HK should be made up of:

  • Business Key (treated)
  • Business Key Collision Code
  • Multi-tenant id
    Of course! You’d never cluster on a hash value! But if your system is MPP based on distributed nodes then the HK should/could be used a distribution key.
1 Like

Hi Josh,

We’re using SHA2_512 into binary(64) datatypes in Azure SQLMI
We’re not in prod yet but the resource that’s showing up as the most likely bottle neck is the Log IO
We’re not worried about the CPU impact

Russell

sha2_512!!! wow! Why??

@neil.strange had this published back in 2019 on Azure Synapse best practices, Deploying Data Vault on Azure SQL Data Warehouse

1 Like

I inherited it from another ‘specialist’ I specified SHA2_256 - as it seems to be the right balance of performance, strength (security) and uniqueness. I was expecting visible performance degradation - none observed - as expected, it’s the log IO that is the constraint

further, this is an event based near real time feed from D365. each CUD event in the source is a new record for the DV. The cost of the SHA calc is invisible in the broader context (D365->SynapseLink->AzureFunction->DVloadSP->DVtables and indexes)

we’re using SQLMI for the DV, primarily for its performance of multi streamed, real time ingestion across a large number of parallel pipelines, and familiarity for the current workforce. We are expecting to receive thousands of source table CUD streams from many separate D365 applications. So many thousands of parallel but low volume database updates.

From what I can see the DW appliances are MPP like solutions that do noit like that style of workload

When the workload of the consumers exceeds that of the MI, we will persist marts in Synapse proper

I think if you were on a lower hashing algorithm loads would be faster; and Neil’s doc suggests that you should build a natural key based DV because underneath Azure hashes the bkeys anyway.

1 Like

Cheers. Some counter points (I think)

  1. we’re using SQL Server MI - as the data arrival and loading is more like an OLTP. I don’t think it does any hashing internally. That might be Azure SQL DW (Now Synapse) that hashes
  2. we have many multi part BKs - I ilke the simplicity of a single field PK

I agree with #2; I think that’s what a lot of ppl miss in DV modelling, it is meant to solve all model complexities for the user to query on

Got it for #1, I thought it was entirely in Synapse

1 Like