The Data Vault Staging Balancing Act: Purity vs. Practicality

Hello AutomateDV community,

I’m new to Data Vault 2.0 and have a foundational question about the proper use of the staging area. I understand the principle that the Raw Vault should be an unaltered, auditable copy of the source data. However, it seems some level of transformation is necessary in the staging area to create the stable business keys required for Hubs.

I’m trying to define the precise boundary between transformations that are acceptable in staging (before the Raw Vault) and those that belong in the Business Vault.
Here are my specific questions:

1. Defining “Hard Rules” vs. “Soft Rules”
Where is the definitive line drawn between technical adjustments (hard rules) and business logic (soft rules)?

  • Basic Cleansing: Are functions like trimming whitespace (TRIM), casting data types (CAST), and standardizing case (UPPER/LOWER) generally considered acceptable hard rules for the staging area?
  • Composite Keys: Is concatenating several source fields to create a composite business key (e.g., source_code || '-' || invoice_number) an acceptable hard rule? Or does this cross into business logic?

2. Staging Strategy for Multi-Table Sources
What is the best practice for staging when the components of a single business key are spread across multiple tables within the same source system?

  • For instance, if Orders has the primary transaction data but the Customer master table contains a required part of the business key, should I create a staging view that joins these tables?
  • If so, how do I handle this without violating the “unaltered source” principle? Is the goal to present a unified, logical record for the Raw Vault load while maintaining clear lineage back to the original tables?

3. Integrating Multiple Source Systems
When integrating multiple source systems (e.g., two different ERPs), should the staging area always maintain a one-to-one relationship with the source tables?

  • Is it ever advisable to create one large, combined staging view per source (joining all its relevant tables) before loading into the Raw Vault? Or does this introduce too much complexity and obscure the raw data?

I would greatly appreciate any perspectives or practical examples of how you draw these lines in your own Data Vault implementations. My goal is to build a robust and auditable Raw Vault without over-complicating the staging process.
Thank you!

There’s a lot to unpack here, may I suggest you watch this video?

Thanks @patrickcuba! I will look into it.

Hello liberrenaud!
A lot of really good questions for being a newbie on data vault.
Regarding soft and hard rules. I would say TRIM is fine as a hard rule if the White spaces are not in the original data but rather appended as a side effect when data is delivered. I would not perform upper or concatenation to replace the original content. But you can perform concatenation and uppers and store it in another field in the satellite. Creating a soft rule on the fly but still being able to recreate the content from the source.

For question #2 and #3, i say no. Some dv folks thinks it is wise to prejoin in the stage area. I do not agree with this because it creates non autonomous loading processes and more complexity.

Thanks @AHenning for your answer, much appreciated !

1 Like