New column additions to SAT, and the HASHDIFF in dbtvault

Hello,

When it comes to adding new columns to an existing SAT, Dan recommends to put the new columns at the end of the SAT and the HASHDIFF such that, “Columns that are NULL and at the end of the table are not added to the input of the hash function”[1]. If this is not done, a full load of the data will happen whenever there is a new column added to the SAT since the HASHDIFF will be different.

What is the best way to implement this using dbtvault macros?
see:
https://dbtvault.readthedocs.io/en/latest/best_practices/

Thanks,
Saqib

  1. Linstedt, D. and Olschimke, M., 2016. Building a Scalable Data Warehouse with Data Vault 2.0. Morgan Kaufmann, p.369.

We have a new feature coming out soon which will add an optional src_additional_columns parameters to all macros, allowing arbitrary selection of extra columns from the source_model which will support any columns our users need for their use cases. Right now in your satellite model, you can wrap the dbtvault.sat() call in a CTE and do something like:

WITH sat AS (
    dbtvault.sat(src_pk...)
)

SELECT a.*, b.my_additional_column
FROM sat AS a
JOIN my_stage AS b
WHERE a.HK = b.HK

You can then make sure your columns are added to the end of the table.

2 Likes

@saqib - Great question! :wink:

1 Like

Thanks @alex.higgs . Also how to do we make sure that the HASHDIFF doesn’t change for the records with the newly added column that is NULL? If the HASHDIFF changes, it will cause a full load of the SAT.

It says “not added to the input of the hash function” so just add them to the payload without adding them to the hashdiff

@alex.higgs. how about the new records that have that column populated. Then we will be missing those out in the HASHDIFF.

I would say this is essentially a schema change except it would not be possible to compare ‘old’ and ‘new’ directly. You’d want to archive the old satellite (rename with x_ARCHIVE or something) and have a new satellite with the new hashdiff and start history again.

For the presentation layer we could have a business rule to reconcile the two, adding a flag for archived columns.

I’m sure there are probably better and more standard approaches though, just thinking out loud really.

I’ve always wondered how to actually implement that! Instead of just blindly concatenating all attributes, stick something in that will figure out if the last attribute is NULL and if so, don’t concatenate that one. And what if you added more than one new attribute?

In theory it makes a lot of sense… in practice, it’s neigh impossible, I’d say.

So, just create a separate satellite for it, I guess :slight_smile:

Hi @FrenkLoonen . Good point. I have done this in one of my past implementation where we’re were using SQL to generate the hashdiff. But like you mentioned, it won’t work if you are adding more than one column.

Doesn’t dbtvault handle schema drift elegantly? Genuine Q. I think support was added to dbt some time ago?

Let’s say on Day 1 your Sat looks like this
Hk|LD|AD|HD|Att1|Att2

Therefore HD = Attr1|Attr2

Day 2 it was
Hk|LD|AD|HD|Att1|Att2|Attr3|Attr4

Therefore HD = Attr1|Attr2|Attr3|Attr4

If all your attributes columns are null then hashdiff would be ||||

Does it matter if any or all are null?

Yeah as long as your dbtvault configuration is up to speed, the HD gets updated automatically.

I think the point saqib was making was that you don’t want to have inserts if that new attribute is NULL (and therefore the same as the existing record)

So then in your example
Day 1: you insert HASH(Attr1|Attr2)
Day 2: you insert again because now it is HASH(Attr1|Attr2|null|null)

If you apply Dan’s suggestion, you would have to come up with logic to figure out that there are two nulls at the end, which you would leave out so you end up with HASH(Attr1|Attr2) which is the current record, so there won’t be an insert.

Why not?
Day 2 represents a new fact: we have those columns and they are null is different to not knowing those new attributes existed at all

Your hd will differ and elegantly insert the new record

Day 1 12|34
Day 2 12|34||

I agree, those are two different facts, but the end result is the same either way, in the sense that we have the same null values for Att3 and Attr4 after Day 2. Would the sole reason to insert the record on Day 2 be that you can more easily prove the fact that you did try to load the new attributes, but that they were still null? Whereas if you would not insert, you would probably have no easy way of telling…

NB for reference sake: Data Vault Anti-Patterns: Anti-pattern: Adding column to the middle of the hashdiff

But why create a “switch” architecture?
If you find this then do this, else do that.

If you let the data flow you reduce complexity and ultimately tech debt.

Well since I was expecting to have a hard time incorporating this logic into the hashdiff calculation in dbtvault anyway, I wasn’t necessarily planning on it, I was just trying to voice saqib’s reasoning (while referring to Lindstedt in the process).

Anyway, dbtvault by default alpha sorts the hashdiff input attributes anyway (see HERE) so I guess that makes this a non-issue :slight_smile:

Yea I read about it. I don’t think this aligns with the Data Vault 2.0 loading patterns. May cause some issues. Probably when a field is removed from the source…

cc·'ing: @cmeyersohn for some guidance on this.

When a column is removed you keep feeding it with nulls going forward

That way you ensure no refactoring is needed, ever

The solution outlined in Dan’s book (add new attributes to the end) is simple and elegant. On the now defunct DVA site when this same question was posed I replied with the text book answer and the DVA person replied that it’s best practice to spawn a new satellite with the new columns. She said that we should have been taught that in our certification class and offered no explanation of why that was “best practice”

dbtvault follows the standard in this regard. It’s up to you to make sure you keep feeding the missing/deleted columns with NULLs by using derived columns.

That is correct. The recommended best practice is to create a new satellite to handle the new columns rather than alter the existing satellite and recalculate all of the hash diffs. The question was what to do with new descriptive columns arriving in the data set. What I’m reading here has to do with columns that have been deprecated from the data set and perhaps replaced by new, different columns. In the case of columns being deprecated, I agree with Patrick, the load process will insert a null value into the column and should continue operating without modification or refactoring. The new columns would spawn a new satellite with a completely different hash diff value based on a completely different load process. None of the old code to load the original satellite should need to be touched. You will have a bit of refactoring to do, perhaps, to the source to stage or secondary stage process and/or target object.