Hi all,
I’m new to DVUG, but have been lurking these pages trying to expand my knowledge of DV2.0 beyond the basics, and to decide if DV is the right way to go for a new data platform I’m designing.
I’ve read a few similar posts, most recently Modelling Question on Business Key Updating in source system, which led me to Seven Deadly Sins of Fake Vault. The term “Fake Vault” was coined at… | by Patrick Cuba | The Modern Scientist | Medium and You might be doing #datavault Wrong! | by Patrick Cuba | Medium. I think I’m in very real danger of breaking one or several deadly sins, but I’m still drawn to the flexibility DV appears to offer when integrating new sources, as my solution will need to scale across multiple third-party platforms over time.
My immediate concern relates to customer contact data from multiple third-party platforms (e.g. linkedin, mailchimp). Identifying a single, immutable BK for all sources feels impossible. The business would recognise firstname
, lastname
, and email_address
as the identifiers for many of these source systems, but these can obviously change over time. I could adopt the contact_id
from these platforms, with an appropriate BKCC, but that feels like I’m in danger of creating a source system data vault. Also, not all sources have a unique id outside of email_address
(for these platforms a new email address == a new person).
A lot of my perceived problems revolve around customer data, and I suspect the right approach in relation to MDM would be get this data into the CRM, and let that handle deduplication, however this approach has been vetoed. I recognise DV isn’t going to magically solve data governance and data quality issues - it may be that given the lack of cleanliness and control over the sources, that DV isn’t the right approach, but I’d love to hear thoughts from people who may have tried to apply DV to this type of challenge.
General advice welcomed, but to make it easier to respond, I think my question boils down which of these is the least bad idea:
a) Create HUB_customer_contact
with BKs as a composite of firstname
, lastname
and email_address
, with a Same-As-Link table built from the source contact_id
where it exists;
b) Create HUB_customer_contact
with BK as the contact_id
from the source platforms where it exists, the email_address
where no other unique id exists, a BKCC where required, a SAL table for deduplication, and hold a “bag of keys” satellite (Patrick warns against this in the 7 deadly sins post above);
c) Dispense with the DV2.0 model and go with something less-principled.
Thanks for reading.
Jeff