With DV2,0, we only do inserts. When we get full snapshots from a source, we’re able to implement Status Tracking Satellites to track when a record has been deleted. However, when I need to write a business rule to historically derive content based on both regular satellites and status tracking satellites, the query can become complex quite fast.
In the Effectivity Satellite pattern, we combine data from the source as well as from the Effectivity Satellite itself, inserting data that hasn’t been factually delivered (the records where we set the end date ourselves).
What would be the main objections against doing something similar in a regular Satellite? We can easily change the pattern so that it detects if a BK has been deleted, and it inserts a ghost-like record for that BK, in which all attributes are NULL and the record source could say something like ‘SYSTEM’. Yes, performance will be affected because you have to do an additional select on the source table and the satellite (which you would be doing when loading an STS anyway) but if we leave that out of the discussion, what other objections are there?
You could then also even implement a Status Tracking Satellite on top of that, as a view (even though its use case would probably be only for some certain analysis scenarios).
I realize this discussion might have been held (and shot down?) on numerous occasions in the last years already, I just couldn’t find any hits online.
There are a couple of resson not to do status tracking in a delta satellite. The purpose of the delta satellite is to store properties of a BK, not wheter or not a bk exists. The loading process of your satellites will be much more complex if they also should do tracking. Also you need to store more data if you have more than one satellite for a hub in an incoming dataset.
If you’re combining the data around a PIT what would be the issue?
Status Tracking / Record Tracking (or a delete flag from a source) that hangs off the same hub is brought together using the PIT code and therefore you would get the right delete flag for that point in time (effectivity satellite only works off links).
This does not break the DV2 pattern and resolves the data for you in the info-mart.
BTW you can use the Status Tracking Satellite for either delta or snapshot — there’s no standard on how you populate that satellite, only that the “standard” way is to compare snapshots between landed and loaded.
I would be careful in populating the data satellite with non-auditable data beyond what the standards provide guidance for!
You could argue that the purpose is to store its descriptive context over time… which could include context about whether it existed or not
Yes, the loading pattern will be more complex, although I wouldn’t say it becomes much more complex.
True, with satellite splitting, in order to be consistent you’d have to do it in all satellites. But I haven’t seen that many deletes happening so I think the extra amount of records is negligible.
(side note: my own implementation of an STS is more like the Deletion Tracking Satellite… I don’t care about Updates, only Inserts&Deletes)
When getting the data out towards my Information Mart (by joining the sats back to the PIT) I don’t have problems. The complexity sits in potential business rules that I’m writing before I get to the PITs. I have for instance a BR in which I need to (historically!) calculate an amount by combining amounts from two different satellites. that hang off two different hubs, that both have their own STS, and are connected by a link with an Effectivity Satellite on top. Even in a scenario where deletes never happen, the query will be complex, having to create SCD2-style CTEs for the satellites and combining all of their timelines.
If I were to consider tracking deletes in a satellite… I believe I can still be fully auditable and able to reproduce the source data sets… but will it from then on still be considered fake-vault, or am I just really bending the standards?
Please try it and then come back with your conclusions. It could be interesting.
I have used the technique of inserting a ‘tombstone’ row is a satellite to signal the deletion of temporal context, NOT representing a soft delete of the parent key successfully since DV1.0. There can be any number of other satellites that have/had context for the parent key - and these (soft deletes of an instance of a “key set” in a HUB or LINK) likely require a (as @patrickcuba suggests) a Status Tracking / Record Tracking /Effectivity Satellite dedicated to maintain the timeline of the key set. The deletion can be explicit (i.e. auditable CDC message), or inferred (some form of BATCH MINUS) or deduced via soft business rule (i.e. retention policy, regulatory necessity).
Downsides of this I have seen are:
- complexity when splitting satellites, you must account for the possibility of multiple tombstones. Still automatable.
- query extraction needs to filter out (if that is what is in "the story:) those records (I use a flag on the satellite but a completely NULL row might suffice. Also automatable.
- Reversibility of deduced deletes must be maintained in order to remain auditable, otherwise this determination would be better served in the Business Vault
Upsides I have seen:
- increased trust in DWH, as deleted context can be used or avoided as required.
- complete support for temporal queries against high volume integration (number of sources), and low or variable actual integration (not every key present in every source).
- utilizing this data for quality measures and volatility analysis
It is not in the standards today, but has been vetted, and has work well.
Hope that helps,
thanks for you reply!
Tombstone row, I like it
I don’t fully understand what you mean with “deletion of temporal context”. Does that mean a hard delete of the BK?
I use the term temporal context to describe the non-identifying attributes housed in a satellite, over time. These can be deleted from a specific source system source, but it does NOT indecate the deletion of the BK, as there can be more that one satellite source housing independent temporal context for that key. The soft delete of a tombstone entry only signifies the deletion of that specific context, and says nothing about the BK itself. The soft delete of the BK can be managed using a record tracking/effectivity satellite whose purpose is to track the timeline/lifecycle of the PARENT BK(set).
ps. the term tombstone row originates from Kafka.
Hope that helps,
Yea… used in Cassandra too — but tombstones there are hard deletes proliferated when in-mem data a is flushed to sstables through compaction!