The Data Vault Staging Balancing Act: Purity vs. Practicality

Hello AutomateDV community,

I’m new to Data Vault 2.0 and have a foundational question about the proper use of the staging area. I understand the principle that the Raw Vault should be an unaltered, auditable copy of the source data. However, it seems some level of transformation is necessary in the staging area to create the stable business keys required for Hubs.

I’m trying to define the precise boundary between transformations that are acceptable in staging (before the Raw Vault) and those that belong in the Business Vault.
Here are my specific questions:

1. Defining “Hard Rules” vs. “Soft Rules”
Where is the definitive line drawn between technical adjustments (hard rules) and business logic (soft rules)?

  • Basic Cleansing: Are functions like trimming whitespace (TRIM), casting data types (CAST), and standardizing case (UPPER/LOWER) generally considered acceptable hard rules for the staging area?
  • Composite Keys: Is concatenating several source fields to create a composite business key (e.g., source_code || '-' || invoice_number) an acceptable hard rule? Or does this cross into business logic?

2. Staging Strategy for Multi-Table Sources
What is the best practice for staging when the components of a single business key are spread across multiple tables within the same source system?

  • For instance, if Orders has the primary transaction data but the Customer master table contains a required part of the business key, should I create a staging view that joins these tables?
  • If so, how do I handle this without violating the “unaltered source” principle? Is the goal to present a unified, logical record for the Raw Vault load while maintaining clear lineage back to the original tables?

3. Integrating Multiple Source Systems
When integrating multiple source systems (e.g., two different ERPs), should the staging area always maintain a one-to-one relationship with the source tables?

  • Is it ever advisable to create one large, combined staging view per source (joining all its relevant tables) before loading into the Raw Vault? Or does this introduce too much complexity and obscure the raw data?

I would greatly appreciate any perspectives or practical examples of how you draw these lines in your own Data Vault implementations. My goal is to build a robust and auditable Raw Vault without over-complicating the staging process.
Thank you!

There’s a lot to unpack here, may I suggest you watch this video?

Thanks @patrickcuba! I will look into it.