Approach for handling the need for "pre-join"

I’ve come accross generic data sources that don’t hold business keys, but have to be joined with other structures to get to the descriptive data.
For example there are Case and CaseObject tables in CRM system, that depending on the range of data can store information about Offers, Assets or Contracts.
Whereas in Case table you can find keys of business objects, you have to join to CaseObject using technical ID, to retrieve descriptive data.

I can think of several ways to solve it, none is perfect

  1. Request business aligned interfaces from source system
    Approach: Do not import generic structures, but request proper interfaces from source system, that would encapsulate the logic (joins + filters)
  • this is the easiest way to handle the problem (dedicated interfaces for Asset, Offer, Contract)
  • in majority of cases not feasible (cost too high, too long to wait for it, technically not possible)
  1. Build DV structures to hold technical data
    Approach: build Case and Case Object hubs, link and satellites, then build Offer/Asset/Contract structures only in Business Vault
  • this solves the problem
  • but to my understanding it is not the “Data Vault way”, it would be an example of “Source Vault”
  1. Pre-join in Stage objects
  • Join and add filter (select range for Contract, Asset, Offer) on technical sources when loading proper business aligned DV structures
  • It does work, but
  • pretty complex, business logic in stage object, always full load needed or persistent staging area required to tackle dependencies if data load is incremental

What is your way of solving similar challenges?


  1. Best choice and note that there is a difference between a “source data vault” and using system keys in hubs.

Both alt1 and alt3 will force that soft business rules will be performed. This means that you loose data.

That’s pretty good assessment, I would reorder those into best to worst case

  1. As you have identified, make it the source-problem to give you decent data for you to model. Absolutely the best case and makes your DV cleaner. Never be shy to request this to be solved in the source.
  2. Solve it yourself in pre-staging but this means you have a point of maintenance, at least the DV stays clean.
  3. Worst case — solve it in DV with so many tables to join to solve for this that querying the data becomes complex and expensive and a Ball of Mud.

In this option we would create DV objects for specific structures of given system? They would not correspond to any business concept as contentwise they are mix of everything (offer, contract, asset).
Isn’t it source vault?

A “source vault” is subjective. John Giles, auther of the book Elephant in the fridge, divides data in data vault as business centric and source centric. Your example is definitely data that is source centric.
Even the greatest business model will not be able to injest all fields from a source system. I just accept that all stuff in a source system cannot be squeezed into a business centric model, some fields will be source centric.
I strongly avoid fixing stuff in the extraction code from the sources just because if wont fit the business model in the data vault. That will enevitably lead to more tech debt, misunderstandings, lower data quality and a poor dv implementation. The reason for this is that more logic will be inplemented in different places. Please be aware that the last is my opinion. Good luck with your implementation.

Pre-joining in stage is cumbersome in case of incremental loads. In particular case I’ve described it is possible to recognize the changed row based on the timestamp, if we don’t do it then ingestion step takes far too long. The difficulty is however, that we need the full current state of interrelated structures (Case+CaseObject) in order to get the correct results of the pre-join.
Is Perstistent Staging Area an answer to this challange? We could be loading the increments to dedicated PSA structures and retrieve the latest record per business key before joining both structures.
Is this aligned with best practices? Maybe you have already described it in some blog entry I have not read yet?


Someone in some location is going to have to pay the Pied Piper,

  • Source — not you
  • Staging – you
  • In the DV – your users

The latter two implies you will probably have to do it

1 Like