Raw Vault to Information Layer (Dim) Questions

I’ve read Linstedt’s book, but I find it light on practical details and I have some questions. For background, I’m working on a financial project with over a dozen different vendor data sources that all have various types of accounts, products, and transactions and I’m bringing all of this into a conformed Raw Vault. I’m about 3 sources in and so far I’m able to keep it in the shared hubs and I intend to get it all in.

  1. Do I need a discrete Business Vault Layer after my Raw Vault, or can I just integrated the PiT, Bridge and other advanced DV objects in the same database as the Dimensional Model?

  2. When designing my facts and dimensions, should I be spending energy to re-use my conformed model, where I’ve taken 10 or more different sources and conformed all accounts together, or should I just have 10 different Star schemas by subject area and make it easy on myself?

  3. HKeys: I just keep using my HKs as PKs in the Information Layer right?

  4. Information Level Foreign Keys: To what degree do I bother to create all that structure when it’s all in the corresponding satellites?

  5. Links. If a link exists between two hubs, does every data element have to be able to be expressed through it? Can I have links that are source dependent between hubs, or do all members of the hub need to be able to use the same link?

I think that’s enough for my first question, I really appreciate any help I can get and thank you for taking the time to read.

  1. All objects are in the same database. The business vault is a sparse overlay on the raw vault. You might want to separate presentation layer into one schema per mart.
  2. Reuse the conformed Data Vault objects. Depends on your use case if you want to split them back into source system data sets there is nothing stopping you.
  3. Yes. But some engineers inject an integer as the PKs in the information layer to speed up queries/reduce storage.
  4. Not needed unless you are building indexes. Snowflake doesn’t enforce them.
  5. Links are units of work, source system independent. If the same UoW exists in multiple source systems they will all use the same Link. Not all Hub records need to use the Link, if there is no UoW involving a Hub row then it won’t have a Link entry.
2 Likes

@neil.strange point 3 - Can we go with the natural keys in Dims/Facts rather than PK(SK) in Dims/Facts? My question is what would be the recommend way to choose keys i.e., natural keys vs SK’s (PK) in dimensional modeling?

You can’t use natural keys if you need type 2 etc slowly changing dimensions (or, you can, but you need to take time into consideration in every query). The surrogate PK needs to encapsulate the time and the hub key as per normal Kimball.

Whether you use integer or hash for your SK is up to you. There are query performance (depending on database), idempotency (hashes are more idempotent), ease of implementation (trying to do an integer SK in dbt is difficult) etc considerations to take account of.

1 Like
  1. PITs and Bridges are snapshots of keys and load dates from the underlying Data Vault; whether those include BV satellites or not is not important. PITs and Bridges can use RV content alone, or include BV too – concur with Neil
  2. If you can reuse the original DV content, even better! Data Vault on Snowflake: Expanding to Dimensional Models | by Patrick Cuba | Medium
  3. To join on yes, do not expose these to business users
  4. see point 2
  5. Interesting topic, RV links come from a source, if multiple sources have the same unit of work then do we have redundant source systems? Maybe not, maybe the same relationship is depicted in another source system with more facts about that relationship, thus you may want to reuse the same link and add that source specific satellite.

RV is not conformed, only the hub business keys, everything is strictly not conformed

1 Like