We have a Multi-Active SAT that houses statuses. While the statuses have temp-orality, multiples statues can exists at the same time. For this purpose we have added a
Sub-sequence Number for each Status that exists at the same time. We are using a Window function in Snowflake to generate the Sub-sequence Number. While this work, the generation of the Staging table for this SAT is extremely slow due to the Windows Function + Large amount of data coming from the Source.
- Is there a better way in Snowflake to implement this Sub-sequence Number instead of using a Window Function?
- Is there dbtvault way of generating this Sub-Sequence Number that does not not rely on the Windows and is more performant?
Do you need the sub-seq?
The original purpose of including a sub-seq was for indexing on platforms needing it as a part of the primary key for the sat — since Snowflake does not have/need indexes do you still need to gen a sub-seq column? Does it serve any business value? If not I’d toss the sub-sequence column.
Hi @patrickcuba . We are using the Sub Sequence Number as part of the PK for the Row.
Our PK is:
BKs + Load Date + Applied Date (modifyts from source) + Sub Sequence Num.
Can you use a different column as the dependent-child key? And therefore not need to generate a sub-seq key.
@patrickcuba . I am not sure if I am following you.
Here is what the MA SAT looks like for Status. Widget is the Dependent Child in the LINK. Status is tracked in the SAT hanging from the LINK:
it’s very easy — it seems that your widget is the dep-key, but to make it look like an MSAT why not have the widget in the sat instead as the dep-key?
@patrickcuba . How will that help? We will still have multiple status for the same
STATUS_DATE in the
I see… then make the status the dep-key
I think you are on the right track but I would suggeset some changes.
- Do not include Applied Date in the PK. The businesskey and the LoadDate is enough to make each entry unique.
- You don’t neccessary need a MSAT. But be aware of what makes the business keys unique. Example: Is it really the combination of H_Opportunity, H_Account and H_Employee that creates a unique entry or is it possible to connect satellite data only on H_Opportunity? If that is the case you can connect satellite data directly to H_Opportunity. You still have to keep the Link to able keep track of what H_Opportunity, H_Employee, H_Account and widget combination was delivered from the source. I would also add the STATUS to the Link and connect an effective satellite to the link.
Hope this was helpful.
AppliedDate is the applicable date for the business rule outcome, optional but recommended for your data vault. If a record is superseded by a correction you do not need to replay that record but simply insert by the same applied date but with a new load date, the load date is now a “version” number of the record. If you choose to use applied date then hash-key+applied date+load date is unique – DV now is bitemporal. If not then stick to hash-key + load date
What is an effective satellite? If it is the effectivity satellite then why do you need it?
I am aware of how the applied date works. My point is that the applied date should not be a part of a primary key. The hashkey + loaddate already makes the row unique. There is no point of having a primary key that is “overly” unique.
The effectivity satellite, I used to call them Enable Satellites before I found DV litteratur that described the Effectivity satellite, that is why I mess up their name :).
You need one effectivity satellite for each dataset(FLAT data) as long as the dataset would not get imported into a Non Historized Link.
It is needed to ensure that deletes and combinations of keys changes gets stored in the Data vault over time. It is a basic Data vault concept.
I am not following you. Why not have the
APPLIED_DATE in the
PK? It is common for various application to have multiple changes represented as separate records for the same BK in a single ETL load . If we load this data as is, then the
LOAD_DATE won’t be Unique. We need to have the
APPLIED_DATE as part of the
PK for the
APPLIED_DATE being a stable temporal descriptive attribute from the source.
Your definition of an effectivity satellite is probably incorrect; an EFS is used when a driver key is identified only if the business date cannot be supplied by the source (preferred option). EFS is only about the relationship and nothing else.
By your logic every link will have a driver key; that is incorrect. For a practitioner overly concerned by byte length of data types for performance you have EFS tables everywhere.
You are indeed living in the past.
I assumed that you only loaded one change per BK per dataset at a time into your satellites. That is why I suggested that you don’t need to extend your primary key with the applied date.
I am a little curious about the dataset you receieve, does it contain only the changed records and if a record is changed multiple times you receive multiple rows? What happens when a row gets deleted in the source, is that also an entry in the dataset?
Thank you (again) for pointing out that I am living in the past!
You are making conclusions from my statement that is not correct.
“By your logic every link will have a driver key”. That is not correct, if you model 2 links from one dataset you only need 1 EFS.
I am overly concerned about performance but I am also overly concerned that the data that gets imported into system can be recreated as it was delivered. Data vault has these abilities, why would you not care about this?
How would you model this data?
You don’t need an EFS at all
On a scale of complex satellites EFS ranks as the most complex, which is why practitioners should consider this as the least desirable satellite to implement. Look to RTS or STS instead.
Have you identified a driver key? You need one for an EFS.
That is why, and I’ll say it again, get the source to send you the business date of the changing relationship, rather than modelling your way out of technical debt yourself. The onus on tracking relationship changes is with the source and not with your data vault, therefore less tables to join, therefore better performance.
Ok, the source system cannot send the business date because of reasons.
I chose complexity over losing data every time!
Data vault systems are generated so the complexity is not an issue after you get it right once!
Can you please model the data I will show you my point.
I think I already made mine, where’s your driver key?
No, you did not!
How would you model this data?
Yes I did, where’s the driver key?
Not sure how or what you modelled as an EFS, clearly it wasn’t .