How data vault detects deleted records in ?
Suppose there is a records in source system with bkey vaule 101 and descriptive data. In 1st load this record loaded into data vault tables(Hub,link and sat). After that load that 101 record has deleted in source system. now 2nd load happend and loaded new and updated records into the data vault tables. All new records goes to hub,link and sat tables, updated records goes to link and satellite tables.
Now how can i know which records are deleted in data vault model ?
Deleted 101 record still present in hub and satellite tables right?
So how data vault deal with deleted records ?
I think i have explained it clearly…
You can/should use a record tracking satellite. This type of satellite register existance of bk in source systems.
In modern DV we only INSERT (no DELETES neither UPDATES).
AS @AHenning said add add a Record Tracking Satellite (RTS) in data vault is used to record every instance a business key or a relationship was sent from the source system, heres nice post about it:
Personally, I have during my professional life seen many situations where DELETES are propagated via 2nd system/flow (ex. pulling data from JIRA REST API is easy, contains all info and its easy BUT it doesn’t bring DELETES. So we developed a 2nd way using webbooks to send just the deletes to an AWS API Gateway lambda which then put DELETE record on AWS Kinesis queue, which then would write to disk S3, where all files in Snowflake lands to load in DV.
In many other situations source system dont have JOURNAL tables or even DELTA column(s) so we know when there was CDC activity, meaning even source team doesnt know about deletes. OF course theres systems better implemented then others.
Solution is proper CDC configured on source systems:
Overall I felt/feel 1st step for a proper quality of life implementation of DWH(DV or other pattern) is to guaranted proper licensed CDC SaaS solution as starting point… really… just refuse to participate in any dwh project when proper cdc isnt put in place.
With CDC propagating changes into some cloud file-system bucket, then you have real and proper deltas (INSERTs, UPDATEs and DELETE s).
For reference, but now I thin deprecated in DV2.0, here some old ideas aka workarounds:
- add column LAST_SEEN in HUB → UPDATE LAST_SEEN when inbound feed have BK for any reason → Agree with business time frame when to consider a BK to be deleted and have a DV process just to flag this somehow and then mark record as DELETE. This can be done either hub column EFFECTIVE_TILL and change from null to LOAD_DT or add a HUB_EFFECTIVITY_SAT with the EFFECTIVE_FROM/EFFECTIVE_TILL columns and have your DV DELETES process to INSERT your flagged record (WHERE DATE_DIFF( HUB_.LAST_SEEN , today) > “x month(s)”…
Hope this helps and that I didnt wrote too much…
@AHenning @emanueol Thank you for the information. I will explore on that
in our platform the upstream integration tool, Fivetran in our architecture, detects if a record has been physically deleted in the source and appends an additional boolean field onto the table as part of its metadata. If a system does logical deletes then we don’t need to do anything because we’ll already have a field to tell us that.
Any record where that is_deleted = true should theoretically become the last ever version of the data inserted for that upstream business entity.