Filtering content of the source - hard rule?


I have a question regarding placing of the data filtering logic.
There is an interface that mixes data for different concepts (e.g. entries for offer and contract).
Is it a best practice to filter subsets when loading hub/sate of the offer and contract respectively?
Or rather load everything twice to both sets and filter in business vault?
There might be also 2 cases of filters needed in general to handle this and similar cases:

  • simple filter - there is an information regarding type of object that determines type of the object (offer, contract)
  • pre-join - data source need to be joined to another source in order to determine the type of object.

Is there a best practice to tackle such complexities of the data sources?


I’m not completely clear on the core issue here. Do you have sample source data that you can share?

It sounds like your source is just a relational table (i.e., junction table, associative entity, cross-reference [xref], link, many-to-many, etc.); however, I may just be misunderstanding that.

Given what you’ve shared, it sounds like you’ve considered the major business concepts to model for this particular use case, correct?

I would start with the overall business landscape. If you treat quotes and orders/ contracts as different things (and most companies do), or if you are expecting multiple systems, and some systems have the concept of quote separate from order / contract and one has it integrated, I’d split in staging, so that we are able to manage hubs for quotes and orders separately.

Xero does this. It is quite weird in my experience.

By exception only,

  • Ideally the source provides the pre-joined and filtered content — should be in your interface contract / SLO
  • If you do this in your staging then it is a point of maintenance for you, better than post filtering in the data vault

Problem is when using public APIs you don’t usually get much of a choice on what you get. And the public API pattern is becoming way more common in my experience.

1 Like

The source of data is a custom build system, that provides data mostly in mixed format (generic, technical attributes, and business content in XML format).
As part of the XML there is indication whether given entry decribes an Offer or Contract concept.
Currently there is no way that source provides proper interfaces (1 for offer, another for contract). Thus we are loading source 1:1 and in the stage applying filters or joins to another source in order to figure out the context.
Is filtering or prejoin justifiable in this case?

It is never justifiable to perform soft business rules before loading the data into the raw vault. Filtering and prejoins are examples of soft business rules. Dont lose data!

1 Like

Even if your business concepts don’t match what you have in the source? Not sure i buy that.

Hello Nat!
I have never seen business concepts align 100% with the sources.

Hi Henning,
What is the alternative in your opinion?
Creating technical objects in Raw Vault (e.g. Case, CaseObject, CaseCalculation in our case) to capture the data and then the proper objects in the business vault (Offer, Contract)?


Basically yes. I accept that some parts of the data vault is source centric.
I would never ask the source system to filter or do any fuzzy logic because it always ends the same way with You losing data!
Good luck. :grinning: