What is a Source Data Vault?

Hi communty,
you can read it in several posts all over the internet. Don’t build a Source Data Vault!

I wonder if there is a definition for a Source DV?
I would assume this term comes from the idea of building Business onthologies in a first step and then model your DV2 according to your onthology.
To me this seems to be more related to the methodology of DV (without 2) or when you build an EDWH.

I remember when I had my DV2 training in 2017 M. Olschimke, ( the Co author of the book with the fast car) told us, that building up the Raw vault is easy and can be automated in 90 % of the time. I also follow the approach to load a single stage table independently of other tables from source to the raw vault entities.
(if I do not misinterpret it, also Dan says that in his DV Modeling Specs. Chapter 2.5 Staging tables.
[https://danlinstedt.com/wp-content/uploads/2018/06/DVModelingSpecs2-0-1.pdf] )

→ so I would expect this means I have to be as close to the source as possible?

Otherwise,if I had to make sure my Raw Vault is more Business Onthology specific: I guess I’d have to join and transform data before I load it to the Raw Vault, but this would complicate the loading algorithm and would not scale in the end?!

Furthermore, Patrick Cuba says:

After all you do not purchase a tool like Salesforce for its logo! You purchase Salesforce because it mostly fulfills your business processes, the data output of which is captured into Raw Vault , the gaps modelled into Business Vault !
[You might be doing #datavault Wrong! | by Patrick Cuba | Medium]

→ To me that sounds like most source systems are very close to the Business processes anyway and thus it would make sense to stay close to these source tables?

I also have the problem that Business usually has reporting needs that are closelly related to the source systems but cannot be fulfilled on the source systems themselves (eg Salesforce, but also SAP ), maybe because of missing history or whatever reason.

So my approach would be:
1, define together with Business “good” Businesskeys and Hubs → passive integration should be fulfilled if possible at all!
2, Load for your use case each needed source table to the Raw Vault as close as possible to the Source

e.g: SAP environment, concept CUSTOMER:

  • HUB_CUSTOMER: (filled by KNA1 table, which is the master table for customer data; unfortunatelly a proper BK is hard to find as the NAME FIELDS might not be unique, at least in my case it means, use the KUNNR , which is the customer number, but that seems ok as also Business refers to their customers by using that number)
  • SAT_CUSTOMER_KNA1_SAPSYSTEM1: Load all columns from KNA1 into that SAT (and for convenience reasons I even add the name of the source table to my SAT.
  • SAT_CUSTOMER_KNA1_SAPSYSTEM2: same as above (but might have different fields than the first SAT)

So finally, is that a Source Vault or not?

I’d really be interested in other thoughts!?

Thanks a lot
Klaus

it’s when you are not doing Passive Integration

DV is a top-down dv model that aligns to Business Architecture,

Based on Business Capabilities this should mean hub tables are based on Business Objects.
I try to decipher this here: Time to upgrade your thinking on Data Vault | by Patrick Cuba | Snowflake | Medium

John Giles has written about it too, google “Elephant in the Fridge”

Finally, I have some guidance on Passive Integration here: https://medium.com/snowflake/a-rose-by-any-other-name-wait-is-it-still-the-same-rose-3a6202d1aecd

DV2.0 is not just about building a model, it is mapping the data to the enterprise. Everything we need to know about a business object is in that hub, regardless of source system — a source system is a business rule automation engine, remember that. How do we ensure how they have represented our business rules maps to how we operate our business?

See: https://medium.com/snowflake/data-vault-recipes-edebd61ab8a6

Thanks a lot for the feedback.

Regarding your last paragraph:

Is it still a valid statement that you load your data from stage to vault without changing the data in the staging - either transforming, but also joining with other source tables would not be allowed (the only exception would be to make sure that you get the Businesskey in your table as it is needed in the next steps…)?

As long as this is correct my Raw Vault will always have big similarities to the “business rule automation engines” being my source(s)? And some might call that a “Source Data Vault”, but as long as I have the passive integration in place I am fine.
The final mapping to the ontology will take place in the Business Vault.

And I guess this is the point where I always had my troubles when I read eg the Elephant book or any discussion about Business Ontology, because I always asked myself: "How the hell should I map the source to the ontology without transforming the data before loading it to the Raw Vault. Doing / finalizing it afterwards would be the trick…

And that my friend, is the difference between Ensemble Modelling and Data Vault 2.0. If you are following the RV to BV as you described above you are not following Dan Linstedt’s approach, rather you would be referring to the Hans approach. I know because I am certified in both.

Hi again,
that’s what I always felt about but never knew how to express it. I tried in the initial post but failed ;-(

The confusing part about Data Vault is, that you read lot’s of stuff but you do not really know what Methodology it is based on: Ensemble Modeling or not…
Thanks Patrick for clarifying that. It is the ammo I need for future discussions :wink:

The core of the DV model (the Hubs and Link) must represent BUSINESS objects with each identified by BUSINESS KEYS. The Satellites are source aligned and provide the various descriptions (context?) of the relevant object (or relationship) from teh perspective of a particular source system. This is the Hard part of integrating data from numerous systems and is part of the beauty and value of the DV method. Once integrated in the DV, the translation of source data to business information model is available for all others to use, eliminating the divergence and sins-of-the-silos that occurs with tactical point to point approaches.

It’s harder to do but worth the effort so that it’s done once, diminishing diversity in models

If all you do is Vault the source system then you’ve done a fabulous DV of the source system but not solved the real problem of semantic integration of your sources

Russell

Hi Russell,
thanks for your insights. I did not understand your point regarding “semantic integration”.
I would do that (at least partly) in the BusinessVault and most of it on the way to the Information Mart.

KR Klaus

Hi Klaus,

Raw vault handles the difficult part of semantic integration - associating source systems data with business concepts - These are the hard rules and that part of data integration that is hard and costly. DV allow this to be done once (correctly) and enhance the enterprise data asset portfolio.

In the Business Vault, differences in grain can be addressed (rollups) and differences in names can be resolved (synonyms) - These are the soft rules. Here, further enterprise standards can be applied (glossaries, derivation rules etc), specialised over rides can also be applied if there are specialised semantic groups within the enterprise that have their own specialised lexicon

Cheers
Russell