DV modeling with sereval sources and no nice business key

Hello,
I am quite a beginner in DV, but i have sevearl years of XP in dimensional modeling.
I have some questions/challenges raised by the DV itself and the new practices that appeared with modern cloud datawarehouses, which today allow us to do things that we were not able to do in the past.

Here’s my situation :

  • i am modeling the Restaurant entity.
  • In the company, there is not viable business key for this. restaurants have nothing we can 100% identify them uniquely. Hence we have to use the IDs provided by the source systems.
  • We have 5 systems managing restaurants (NextGen; Legacy; Salesforce; Fenix: TA), each have their of PK for a restaurant but they also all reference the PK of another system.

You will find in my post the data model for the Restaurant table of each system (Salesforce has 2, leads and converted accounts).

All source tables are snapshotted. So the snapshot date is also part of the primary key of each table in that situation. (those are source tables ingested in our datalake).

Users want to have the image of a restaurant at a point in time.
They also want to have that vision from a source system perspective (so we have to be able to provide source IDs at all times).

My main Challenge : The matches between ids of systems changes in time. And let’s concider that we want to keep the image of the match at that point in time. The last version of the FKs is not necessarily the right one, we better have it as it was at the date we are looking.

How would you suggest to model this in DV way ?
I’d like to be able to have a simple HUB_Restaurant table and satellites with all the attributes coming from each source system.

I have several ways in mind but i don’y have enough XP to guess into what kind of problem i may stumble in the future.

Thank you very much for your hepl.
E.

You need consistent keys full stop.
There must be a business that represents your restaurant uniquely and forever.
How are these businesses represented? Surely by an entity like a company registration number

As far as mapping goes, that’s great. You need to think about passive integration across source systems. I have somewhat of a segment describing that concept here: 1. Data Vault and Domain Driven Design | by Patrick Cuba | Snowflake | Medium

The goal of DV differs from dimensional modelling, maybe some Christmas reading for you! https://www.amazon.com/Data-Vault-Guru-pragmatic-building-ebook/dp/B08KSSKFMZ/ref=mp_s_a_1_1?crid=1P5RAI2M21VQB&keywords=data+vault+guru&qid=1671366924&sprefix=data+vault+guru%2Caps%2C533&sr=8-1

1 Like
  1. Use the snapshot date as metadata for each loaddate.
  2. Use one hub-restaurant with bkcc or hub-nextgen, hub-legacy etc with no kbcc. I prefer the later option because it wont mix different “styles” of identification in one hub.
  3. Create a sat for nextgen-restaurant (name), bk hub-nextgen. Create a rts for hub-nextgen.
    Create sat for legacy-restaurant(name), bk hub-legacy. Create a link between hub-legacy and hub-nextgen. Create a rts for the link.

Do the same kind of modelling for the remaning objects.

Your model will then be able to handle data/relations/bk that changes over time. You will have 5 different autonomous processes for loading your data. You will not lose any data with this model.

You can later “clean” up the mess in BV.
Hope this helps!

1 Like
  1. Don’t do that, snapshot date should be your applied date if anything else
  2. What if ids were the same in legacy and in the new stuff? You’re creating a source system Vault!
  3. Hub-nextgen, what is that?

Hello mister Cuba!

  1. Using the snapshot date as metadata for loaddate is the same as using the snapshot date as applied date.
  2. If ids were the same in legacy and new stuff, use one hub. That is not a source vault.
  3. You know what hub-nextgen is.
  1. Incorrect- load date is when the data enters the vault, applied date is the business process outcome as it was APPLIED at that point in time
  2. When you split them for the sake of source, they are. Only use bkcc when a clash of business object identifiers can occur
  3. No, hence why I ask. Assumption is the mother of all f……

Hello again.

  1. Stop pretending that you dont understand.
    AppliedDate=businessdate=snapshotdate. If you dont want data redudancy then assign the snapshotdate to the loaddate. Loaddate is still the time when the data enters the dv.
  2. Use bkcc or different hubs. If you want messy bks in your hubs, use bkcc. If you want less messier hubs but more hubs, use more hubs. By now you know my preference.
  3. Maybe you can provide us with a suggestion of a model that answers the original topic in this thread?
    Kind regards!
  1. Incorrect
  2. Wow and you advise on DV? That is shocking
  3. What is a hub-nextgen? I think you just realised your mistake, thanks for trying

Maybe you can provide us with a suggestion of a model that answers the original topic in this thread?
Kind regards!

Thank you all for your answers, this feed my reflexion a lot.
I haven’t fully cracked it yet.

They do not exist, unfortunately. We do not capture anything that could play that role (for instance company registration numbers, which could, in theory, be a good idea).

Those businesses are represented by their name+ their adress.
All free text entered in our apps, so data quality is very bad.

So no actual business key available.
Would this meen that DV does not work here ?

  1. Mapping between Contexts. No overlap exists between contexts and no way to integrate the same business objects. No issue will exist in the integration to a common hub table however you will not be able to relate business objects between contexts (if they do exist) unless a mapping is supplied by either context (source-domain) or the solution is supported by master data management. Here you will likely see the use of a “same-as link” table for the mapping between contexts to solve equi-joins that overlap contexts — in effect, a context map.

This, from the article linked in first answer is a trail.

This is a MDM issue not a data vault issue - invest first in something to fix your master data with some match/merge and key assignment.

hi Nat
Thanks for your answer.

I know that and i completely agree with you.
Unfortunately, it is not in our near future plans to invest in an MDM infra, so in the meantime i have to find a solution.

Sorry to hear that.
I’m sure there’s a dilbert cartoon just for that situation.

You might need to build a small match/merge MDM-lite with a basic interface just for this in python or something, then use that as the source for your data vault.

Thanks for relating :slight_smile:

Yes! This is exatly what we do; we setup an MDM lite with mapping tables that we use to map manually the source values to the ones we want to report on (dimension tables values). It works pretty fine, but generates some overload on the BI team.

Now i’m gonna find a way to integrate this in DV. Or maybe not DV.