New column additions to SAT, and the HASHDIFF in dbtvault

Hi Cindi,
Thanks for the reply! Why would you have to recalculate previous hash diffs? Adding additional columns and delimiters will have different hashes. This is explained in the text book.

Because if you don’t you’ll end up with duplicate descriptive records in the satellite as a result of the new columns. Let’s say that the existing satellite version 1 contains 4 columns as follows:

Col1 = A
Col2 = B
Col3 = C
Col4 = D

HashDiff of version 1 of the satellite - Col1 + Col2 + Col3 + Col4 = ‘A’ + ‘B’ + ‘C’ + ‘D’

Now, let’s alter Satellite version 1 and add 3 new columns as Satellite version 2:

Col1 = A
Col2 = B
Col3 = C
Col4 = D
Col5 = NULL
Col6 = NULL
Col7 = NULL

HashDiff of version 1 of the satellite - Col1 + Col2 + Col3 + Col4 + Col5 + Col6 + Col7 =
‘A’ + ‘B’ + ‘C’ + ‘D’ + ‘empty string’ + ‘empty string’ + ‘empty string’

My question to you is, are the hash diffs the same between version 1 and version 2? No, they are different. So even though the values for previous rows in the satellite remain the same and have been altered to include these new columns that most likely contain a NULL value now, when those rows are processed at some moment in the future, they will result in a completely different hashdiff compute - and as a result, will be inserted into the satellite as if they are deltas.

Selecting against this satellite on the same primary key and pulling back history over time will result in what appears to be duplicate descriptive rows because you have two rows both with NULLs in these additional columns. No only is this incorrect, it will introduce mistrust in the data among your consumers. The last thing you want is consumers to question the trustworthiness of your data.

The easiest, most accurate, and efficient means to deal with these additional columns, is to create a new satellite. This also reduces the risk of introducing an error into the existing load process to the existing satellite. This is a standard of the DV2 methodology and its incorporation of Six Sigma error reduction techniques.

I hope this is helpful.
Respectfully, Cindi

1 Like

Wouldn’t it be A|B’|C|D|||||. Vs A|B|C|D|

The 4 null columns are delimited. Dan explains this in the text book section 11.2

Mary, I’m using the example without the delimiters because I’m focusing on the missing values that were not part of the previous structure. I think you get the point.
The hashdiffs will be different between version 1 and version 2 of the satellite even if the values of the first 4 columns never change. Meaning, the addition of the 3 columns to the satellite resulting from the ALTER action will create a new hashdiff and will effectuate a new row being inserted into the satellite if you ALTER the satellite and don’t take the step to recompute the existing hashdiffs.
I believe that was your original question - why would the hashdiffs have to be recomputed if the satellite were simply altered to include the new columns?
The answer is because if you don’t recompute the hashdiffs, then you’d end up with potentially multiple duplicate rows being inserted into the satellite.
As a best practice, we recommend that you create a new satellite for the new columns. If, instead, you choose to ALTER the satellite and add the new columns, then you must recompute the hashdiffs to eliminate the duplicate rows that will be inserted in the scenario that I described.
You have choices in how you handle new columns, but with each choice comes a set of standards that must be met.
If you choose to ALTER the table, then you must recompute the hashdiffs.
If you choose to follow the recommended best practice, then you create a new satellite to absorb the changes to the source system without having to refactor the Raw Vault model - which is optimal for flexibility, scalability, and resilience, not to mention reduced risk of introducing errors into the existing load process and any downstream outcomes, processes, views, etc., that may include the satellite that was altered. It reduces the resources required to implement the change including the cost and time for regression testing everything that touches that data flow.
Respectfully,
Cindi

1 Like

Thanks Cindi, I finally get what you’re saying. I misremembered our solution. We do NOT want to insert a new row when null columns are added. I forgot an important point, that you have to trim off trailing delimiters.

So this is what the text book said and how we implemented it. It seems to work ok but I do understand that this is not the best practice.

But it is actually possible to further reduce the maintenance overhead to zero: by making sure that the hash diffs remain valid, even after structural changes. However, there are some conditions to make this happen: 1. Columns are only added, not deleted. 2. All new columns are added at the end of the table. 3. Columns that are NULL and at the end of the table are not added to the input of the hash function.

Linstedt, Daniel; Olschimke, Michael. Building a Scalable Data Warehouse with Data Vault 2.0 (p. 369). Elsevier Science. Kindle Edition.

We have some new sources coming in that are json and xml. We were going to store those as a long string or clob… now I’m wondering about when we get new null columns added the hashes will be different.

Mary, I appreciate your response. I do want to make sure that you are aware that the book is dated. It needs an errata, and what we teach in the CDVP2 class (and have since 2015) is that the delimiters are not truncated from the hashdiff compute string. The hashdiff compute standard retains all delimiters because these delimiters hold significance in the hashdiff result. This is the first I’ve heard of anyone truncating delimiters. I’m sorry to be the bearer of bad news, but this approach is a deviation or break from the DV2 standards and rules.

I’ve been a CDVP2 Certified Instructor since early 2017. The CDVP2 course materials supersede the book that was written in 2013-2014, and published in 2015. I’m traveling and don’t have my book with me. Thank you for the reference, that will be helpful.

With regard to the approach being implemented to reduce maintenance overhead, here are my thoughts:

1. Columns are only added, not deleted.
My comment here is that this is an acceptable option (adding columns), but not the best practice as I’ve pointed out in our previous discussions.

2. All new columns are added at the end of the table.
My comment here is that this is a common data design practice, and is an acceptable practice albeit not the recommended best practice as I’ve previously indicated.

3. Columns that are NULL and at the end of the table are not added to the input of the hash function.
My comment here is that this is not an acceptable practice as it breaks the standards in a number of ways. To add to the list of reasons I provided earlier in our thread, this approach introduces conditional logic into the hash string compute and/or the load process. Either the columns exist, whether they have data or not; or the columns do not exist. If they exist, then they have to be accounted for in a consistent, repeatable fashion regardless of the data value stored in them. The DV2 standard is that every NULL value be replaced by an empty string character when building the input string to be consumed by the hash function.

JSON, XML processing are all things we cover in class, but you are correct - if those key/value pairs contain a NULL, you have to account for them if you’re computing a hashdiff.

Respectfully,
Cindi

Hi Cindi,

If you haven’t reached the end of your patience with me, in regard to your response #3 above… there is no conditional logic.

Assuming that the whole reason for data_hash is to determine if a change happened…

And you have these 2 rows in a table with columns A,B,C,D,E

1 Data,Vault,is,Fun,! → input to hash Data|Vault|is|Fun|!
2 Data,Vault,is,Fun,[null] → input to hash Data|Vault|is|Fun

Then column F is added in the source and has a value of null for both of these

1 Data,Vault,is,Fun,!,[null] → input to hash Data|Vault|is|Fun|! – same no insert
2 Data,Vault,is,Fun,[null],[null] → input to hash Data|Vault|is|Fun – same no insert

Then record 2 is modified in the source…

2 Data,Vault,is,Fun,[null],For All → input to hash Data|Vault|is|Fun||For All – hash change, insert row

Provided that the columns are in the same order and new columns are added to the end and that you ALWAYS trim trailing delimiters, not just when you add a new column (so no algorithm change) the data_hash column will detect a change.

Just be glad you weren’t my cert class instructor, I drove poor Bruce crazy with all my questions :wink:

Dan trained me in 2014. Here’s what I remember that’s relevant here for hashing and schema drift

  1. Prepare a a string for hashing that is a concatenation of descriptive attributes that are separated by a special separation character.
  2. That string begins and ends with a separation character
  3. Trim leading and trailing spaces
  4. Upper case values for HKEYS
  5. Choose a strong algorithm if there is any PII in the string
  6. When the source schema changes:
    1. Deploy a new SAT
    2. Deploy a new SAT loader
    3. Stop the old loader
    4. update the presentation view (info mart) to union old and new tables
    5. query carefully

This approach preserves the strength of the SAT as an audit record as it is NEVER modified

In short,

  • the schema is a CREATE only schema - ALTER is not required
  • the data is an insert only database - UPDATE/DELETE is never used

and that’s one of the reasons I recommend DV