We have a business concept of a Patient that has an identifier that can be used as the business key. At the moment we are getting the data from tables in source system 1 but there will be source system 2 coming along shortly.
For each, patient there can be multiple active addresses with the source system proving the effective dates. The PK of the source system table is patient_identifier, address_type, date_start. We can identify delta loads for new or updated rows.
If a user entered an address against the wrong patient and later removed it will be hard deleted in the table. As part of the ETL process, we have to extract all primary key values to check for any that we have see before that no longer exist. It is also possible that a previously hard deleted address could be reinstated and appear again.
The same is true of the patient name. Each patient can have multiple names to cover name before married or an alias etc. These can also be hard deleted if a name is entered against the wrong patient.
After doing a lot of research and discounting various options as not being acceptable in DV 2.0 and that may impact on some of the benefits of using DV in the first place.
The only option, I can currently see is treating the name and address as core business concepts and creating Hubs for them with Links back to the patient. This would then give options to manage the hard deletions, like Status Tracking or Effectivity Satellites. Neither, the name or address data has a great business key.
I’m fairly new to DV myself so just throwing out a couple of thoughts as a starter.
If they can have multiple names / addresses do you have dependent child keys to partition each name / address within the SATs? If so, you could create a LINK which would include the HUB patient key and the dependent child keys and then create a Status Tracking Satellite for the LINK.
Alternatively, have you considered making the SATs Multi-Active SATs? As in, loading the attributes in SETS. Any change to the number of active attributes (e.g. different addresses) or any change to any of the details of the attributes (e.g. address type) would mean that you re-insert all of the active attributes again. The most recent (non-deleted) SET of attributes (should have the same Load Date) and can be identified by using Row Number when partitioning on HUB Key and ordering by Load Date. You have the effective dates provided by the source to allow for Point-in-Time capabilities (I’m unsure how you would handle PIT requirements in Multi-Active SATs where there a hard deletions, no notification of the delete from source, and no source provided effective dates…??)
What @Carl said about MSATs is probably the way to go.
If there is a change in the SET of addresses to a patient then it will not be in the active SET any longer, it’s not deleting anything in the DV (and it shouldn’t anyway) and your DV will reflect what was active and what is active as a set of addresses.
I had discounted MSATs due to the fact that the data feed is incremental/delta, so will not get the full set of addresses only the changed ones.
I really like DV, but do find it really challenging sometime when really with 3rd party source systems that cannot be changed to be more helpful with the supply of data.
In our DV model we have lots of HUBs that have many SATs due to different source systems or functional areas. Some of these SATs have dependent child keys and some don’t. Most of the source systems have hard deletes.
It would make the DV model to complex and cumbersome to model everything as HUBs with LINKs and they wouldn’t be true business concepts.
I did some further reading yesterday and found something I had missed regarding the addition of a “tombstone” row into the SATs to indicate a source system table hard delete. The SAT structure would require an additional metadata column to track the status. Looking at our metadata driven SAT loader code the impact of doing this would be minimal.
However, I’ve seen no reference in any DV book to this as a valid solution within the DV 2.0 specification. I’m always mindful of the key principles of DV which the model must be scalable, allow data to be auditable and to create the original source files (in our case tables). I have tried to find a reason why this slight modification to how we use SATs would be a bad idea.
What does everyone think of this approach? Good or bad? Storing up problems for the future?
On my DV journey I’ve made a fare few mistakes but that has only go on to improve my learning and knowledge. The benefits of using DV cancel out the few minor problems experienced along the way.
Once again many thanks for your valuable insights, much appreciated.
Depends if you have “weak” hubs or not; I have seen modellers model every key as a hub and then determine whether it is a strong or weak hub. This is incorrect and not the way to model a DV.
If your data source is a trickle feed and you are faced with deletions then you must have a source provide what that deletion is. If you are left comparing source vs target to determine what is deleted then consider a status tracking satellite pattern. This avoids adding a tombstone row, if you use this pattern then does the absent row exist in the source? No of course not.
DV is not for everyone and I’d never try to convince you that you must do a DV. DV requires discipline, training and/or coaching and yet not all model situations should end up as a DV either!
Just for clarity, if the source system can provide that a hard deletion has occurred, say maybe through CDC, can we utilise a tombstone row in the SAT?
Identifying and tracking deletions is one of the concepts I am most struggling with in DV.
CDC issues a delete record and therefore repeats the entire row + the CDC upset metadata, would it benefit you to instead push those deletion markers into a status tracking satellite? Could this be an appropriate case of satellite splitting? I’d say so. Then when you’re utilising the sats plus status tracking in a PIT you can retrieve the fact that the record has been deleted in the source.
In a previous project, I did use the status tracking satellite method.
I guess the only reason I raised the question was due to the description of Status Tracking Satellites in Chapter 6 Peripheral tracking satellites in your book. When you referred to “business entity” I had taken that to mean BK in the HUB and not all the depended child key SATs hanging off that HUB.
I have read section 6.1 many many times, however it would appear not as carefully as I first thought. I have now read it again and seen a few points that that would allow status tracking satellites to track hard deletions for a source system table with dependent child keys. The status tracking satellite would still be linked to the HUB as SAT to SAT links are not allowable.
Apologies for the confusion and thanks for the additional pointers in this thread.
Status model is about I, U and D of the entity; you’re right to review. STS is not about tracking the dep-keys but you could use a similar outcome to track your records. As long as it simplifies how/what you’re querying I don’t think you would be breaking any rules - afaik
Thank you for your reply. I can understand using a Status Tracking Satellite to track the business entity or relationship against a LINK, however, I’m still a little confused how you would track the deletion of a dependent child when the dependent child has been recorded in the Satellite and not in the LINK. The parent entity still exists but it is the dependent child that has been deleted and this is not held in the LINK.
Is this where you are referring to a good use case for Satellite splitting? Would you mind elaborating more on this concept please?
As always, thank you for taking the time to respond to questions on this group. I have definitely found this user group invaluable.