New Domain, Hash Key Question for the HUBs

jjs88 · 28 March 2022 08:15

Hello all,

I’m building out a new domain in DV and have a question regarding Hash Keys for the HUBs.

SHA1 is being deprecated in SQL Server so I’m wondering if it makes sense to use that for HK purposes in the model going forward. SHA2_256 is a binary(32) and I’m just worried about the performance / space considerations. From what it looks like, us SQL Server shops might be forced to use.

Has anyone used SHA2_256 as their HKs and have any insight? I’m also considering just using the BK or a traditional surrogate key for this domain.

Thanks,
Josh

patrickcuba · 29 March 2022 01:20

As long as the hashing meets the same requirements as SHA1 it should be fine, remember hashing is only used because it produces a consistent value and any slight change in bit or byte generates a completely different hash. A Key requirement for hash keys is the avoidance of collisions, older hashes are being deprecated because those algorithms have shown collisions and the stronger hashes don’t. Whether you use SHA1 or SHA2 the same functionality is achieved. We have known of MD5 vulnerabilities for some time and SHA1 has also shown vulnerabilities too. A customer I worked with decided to go with SHA2.

Picking a hashing algorithm does not effect the implementation of the Data Vault 2.0 standard.

jjs88 · 29 March 2022 14:51

Thanks for the response.

I meant more in terms of using an SHA2 HK as the PK in the HUBs and how it affects performance of joins. This domain is our largest volume of data (close to 500 million records). Our other domains built out in DV are low-volume so using the HK as the PK doesn’t concern me. The PK would also be non-clustered PK as recommended by Dan.

patrickcuba · 29 March 2022 21:56

Your performance hit should only be in the calculation of the hash itself. SHA2 being more complex than SHA1 and MD5.
Yes, make the HK the PK, HK should be made up of:

Business Key (treated)
Business Key Collision Code
Multi-tenant id
Of course! You’d never cluster on a hash value! But if your system is MPP based on distributed nodes then the HK should/could be used a distribution key.

Rusty · 19 September 2022 06:49

Hi Josh,

We’re using SHA2_512 into binary(64) datatypes in Azure SQLMI
We’re not in prod yet but the resource that’s showing up as the most likely bottle neck is the Log IO
We’re not worried about the CPU impact

Russell

patrickcuba · 19 September 2022 22:00

sha2_512!!! wow! Why??

@neil.strange had this published back in 2019 on Azure Synapse best practices, Deploying Data Vault on Azure SQL Data Warehouse

Rusty · 20 September 2022 01:54

I inherited it from another ‘specialist’ I specified SHA2_256 - as it seems to be the right balance of performance, strength (security) and uniqueness. I was expecting visible performance degradation - none observed - as expected, it’s the log IO that is the constraint

further, this is an event based near real time feed from D365. each CUD event in the source is a new record for the DV. The cost of the SHA calc is invisible in the broader context (D365->SynapseLink->AzureFunction->DVloadSP->DVtables and indexes)

we’re using SQLMI for the DV, primarily for its performance of multi streamed, real time ingestion across a large number of parallel pipelines, and familiarity for the current workforce. We are expecting to receive thousands of source table CUD streams from many separate D365 applications. So many thousands of parallel but low volume database updates.

From what I can see the DW appliances are MPP like solutions that do noit like that style of workload

When the workload of the consumers exceeds that of the MI, we will persist marts in Synapse proper

patrickcuba · 20 September 2022 23:05

I think if you were on a lower hashing algorithm loads would be faster; and Neil’s doc suggests that you should build a natural key based DV because underneath Azure hashes the bkeys anyway.

Rusty · 21 September 2022 01:04

Cheers. Some counter points (I think)

we’re using SQL Server MI - as the data arrival and loading is more like an OLTP. I don’t think it does any hashing internally. That might be Azure SQL DW (Now Synapse) that hashes
we have many multi part BKs - I ilke the simplicity of a single field PK

patrickcuba · 21 September 2022 01:11

I agree with #2; I think that’s what a lot of ppl miss in DV modelling, it is meant to solve all model complexities for the user to query on

Got it for #1, I thought it was entirely in Synapse

Topic		Replies	Views
Hash Keys Implementation Data Vault 2.0	5	163	30 July 2024
DV Implementation Data Vault 2.0	4	161	3 June 2024
Change in Primary Key at source Data Vault 2.0 business-key , hub	2	236	16 May 2024
Resolving Hash collision Data Vault 2.0 business-key , hashkey , collision	13	2410	5 June 2023
Modeling of link: reference a hub by businessKey instead of hashKey Data Vault 2.0 link	6	71	30 June 2025

New Domain, Hash Key Question for the HUBs

Related topics