I’m having a hard time understanding when in a hub you would create a record in the raw vault if you didn’t know what the record key value(s) are. I can understand coming up with this situation in the staging area but not putting it into the raw vault. What is the processing sequence to follow when you get a zero key in staging?
In general there are are two situations where you need Zero Keys:
- BK is NULL and is required e.g. an
ITEM in an
- BK is NULL but it is optional e.g.
SHIP_TO_ADDRESS in an
If it is optional we set value of -2 and hash that. If it is required we set it to -1 and hash that.
In the Corresponding HUBs we create records with hashkeys of -1 and -2 with the
LOAD_DATE as the beginning of time (1900-01-01). In the above case, these two records will go in both
This enables you to perform inner join from the
LINK_ORDER to the
HUB_ADDRESS. Also you will have a record in the
LINK_ORDER of when the record with the NULL business keys came, and when it was updated. This can be used as input to the source system to see if there is a process improvement needed or a bug that need to be addressed.
Thank you for the explanation. This has raised another question that has to do with the satellites. Is the descriptive data loaded into the satellite or is that bypassed if a negative number is present in the hub or link business key? If the satellite is loaded, do you replace the hub hash key in the satellite record with the valid one when no negative key values exist?
You will need to load the descriptive attributes in the Link Satellite even if one of the BK is NULL (-1 or -2). And then when the BK actually gets populated, a new LINK_HASHKEY will be generated, which will corresponding records in the SATs
The key thing is that the LINKHASHKEY will be different with one of the BK as NULL vs . when the BK is not NULL. They will be treated as that in the LINK SATellite as well.
Ahhhhhhhhhhh. Thank you very much!!!
Hi, I just wanted to get some further clarification on this… Previously I have created zero key (-1 mandatory and -2 optional) records in each of of my hubs and also in the adjacent satellite. Is the entry into the satellite even needed? Isn’t the satellite entry helpful for query performance?
For link tables what is the standard guidance between distinguishing between a missing mandatory hub association and a missing optional one? Do you need just a hashed -1 entry or should you distinguish between the two states and include a hashed -2 entry too and would that necessitate a -2 entry in the adjacent hubs?
Different blogs/articles/books seem to suggest different things - Patrick’s Data Vault Mysteries article referenced doesn’t suggest that satellite entries are needed nor any distinguishing between mandatory and optional associations. Dan’s book does indicate the need for both -1 and -2 entries into hubs. Just not sure about satellites…
I only pre-load Ghost records in Sats
I never even pre-load zero keys in hubs! I don’t think it is necessary (tbh)!
Why would you need to load zero keys in Sats? I don’t think that makes sense!
The loading of the zero key record(s) into the Hub as part of the standard Hub loader kind of makes sense - the only thing I am wondering about is the importance of setting the load date to the dawn of time instead of when the first missing BK landed in the staging table. Does it even really matter? Not convinced it does.
In fact just inheriting the load date from staging tells you the first time it occurred
What do we expose into a virtual dimension from a Zero key perspective? Do we largely ignore Kimballisms such as having -1 to -4 default dimension records?
To me it makes sense to leverage the zero key records for free in the Information Mart to give us derived unknown (-2: optional missing) and error (-1: mandatory missing) dimension records. The Facts simply load the appropriate value to use from the source Link table.
Is it really necessary to apply a bunch of business logic in the virtual Information Mart layer to remap the zero key values in the Link to these special dimension records that we would also have to virtualise. Sounds like hard work unless you really have a reporting use case to distinguish between ‘Missing’, ‘Bad’, ‘N/A’, ‘Not Happened Yet’ etc.
As for ‘Bad’ - well we expose the raw data anyway according to DV2.0 mantra (all of the facts, all of the time), so this doesn’t seem relevant.
For kimball you’re pre-loading dims and attaching the negative values to facts depending on what you want to report to your BI user about that missing key.
DV is not BI user focussed, so we assign Zero keys to imply optional portions of relationships. As we said it is staged and naturally loads to a hub if/when it occurs (you can apply -1 to -4 in there if you like). What this means is that when you run your SQL over your links+hubs the link with a zero key will always have something to join to and therefore you’ll use an equi-join and get the expected outcome. Sounds familiar right? Ghost records works for sats with their parent hub or link and must be preloaded to achieve equi-joins, Zero key elegantly/naturally occurs if they do at all.
Now back to the crux of your question, what do you expose to the BI user? Depends if you want to show anything at all, if the need at the information mart has any meaning at to the BI req!
An optional portion (zero key) of a relationship should not have any satellite attributes associated with it, if it does and you’re getting a null value as the bkey then that is an issue that should stop processing altogether! You cannot have descriptive attributes describing nothing!
From that I infer I should create the Zero Keys as required via staging (which will do the hard work of loading the Hub for me) then expose to the Information Mart what is required to meet the BI use case - which may be repurposed Zero Keys or something else as needed. Sounds like I have leeway to choose what I think is reasonable - as a starting point I will expose repurposed Zero Key records and ensure they have a meaningful on screen default values in the reporting tools
Re not allowing child descriptive attributes to point to parent ‘nothing’ BKs - makes sense!
@patrickcuba do we need ghost records in the link and hub tables?
In the book ‘data vault guru’, it mentions to add ghost records in the hub and link tables, but in the article here, it mentions that ghost records are not required for hub and link. Data Vault Mysteries… Zero Keys & Ghost Records… | by Patrick Cuba | Snowflake | Medium
Depends on the use case mate — if you read that part in the book carefully it was only used because of Big Data platform not being able to materialise a table for querying without having at least one record in there. So there is not conflict in what I’ve published.
One more scenario regarding a composite business key
Would like to ask what is the recommended practice about handling one or more parts to be null incase of composite business key?
Let’s say a composite business key with 3 columns in a hub. Bk1 + Bk2 + Bk3
If the value of all of the 3 business keys comes as Null from source, then ofcourse, i think can apply the replacement for all 3 to be “-1”
Convert all to -1 in stage and process as normal.
This is well understood.
How about if only part of bk comes null from source system 1?
What i can think of is that
Replace Bk3 with “-1” and put it as part of normal hash process but then it introduces business data i.e. “-1”.
What if source system 2 sends the actual combination as below?
This will cause collision with the hub record that I mentioned in Scenario1.
Just want to be careful to not introduce some additional data in business keys.
Any thoughts on this?
Don’t think this scenario is that different from when your BK is not composite. -1 is the default in most literature because sources usually don’t send -1 values as business keys. If they do, you’re better off choosing a different value as zero key.
I currently use ‘#null_req’ and ‘#null_opt’ for my required and optional zero keys.
@FrenkLoonen ok. I think then the optional and mandatory zero key that we choose must be something unique across all systems in the enterprise. This could be difficult to find out as we may be profiling data only for a limited systems at one time.
If there is a BKCC in the hub tables, Adding a BKCC value to make it further unique DV (system name) could be a consideration. What do you think?
The guidance I think is if you choose an automation path it is generally better to not include a switch in that automation. So if you set a -1 for null for single column business key, or for a composite, then the same is true if part of the business key is null. However, I’d question if that is a good enough representation of a business key for the same entity if parts of it can be null!