Hey, Frenk, I thought I should probably weigh in here, so excuse the interruption.
I want to correct your statement of “follow the rule of DV of bringing in all columns.” There is no Data Vault standard (or rule) that states you must bring in all columns. What Dan actually said, and what the internet sources and many others often misquote, is that in DV you load “100% of the data, 100% of the time, WITHIN SCOPE”. Many developers and even practitioners leave off the “WITHIN SCOPE”. That’s where your struggle lies.
My general observation here is that if you are extracting the data into the raw vault from a persistent staging area (PSA), and you can guarantee that the data in that PSA will be available to your DV team should you require it, then only bring in what is in scope. Make sure that you only compute the hashdiff in the satellite from the actual data elements or attributes that you are loading into the satellite (plus any of the DV system elements that you may be using, such as a business key collision code and/or a multi-tenant identifier). This ensures that you will only insert satellite records whenever a change occurs in the data set represented in the satellite. Nothing more, nothing less.
If you need other aspects of the PSA data set that wasn’t loaded in this initial effort, then they may be added at a later date. As was pointed out in earlier responses, you can create one or more additional satellites to hold the “missing” descriptive data when you need it. Each of these new satellites will have a hashdiff column whose value is computed using a completely different set of source or PSA data elements.
The question is, when do you want to pay the piper for computing the hash diffs for these satellites? One thing to think about is, how much more does it cost you to compute the hash diffs on 50 billion records NOW versus loading 5 billion records? What is the velocity of the data growth? What happens when, not if, the business asks for more descriptive data (because you know they will)?
If the data volume is growing at a persistent rate, you can calculate a reasonable estimate of the data volume 6 months from now. What will the cost be to ingest the additional columns and compute the new hash diff for the new satellite in 6 months from now? You are hedging a bet that the business either won’t ask for the data you’re excluding or that you can better handle the volume in 6 months.
These are just a few of the factors that the team must consider when evaluating the missing piece of the quote. In short, this is part of what factors into the decision of what is “in scope” and what is not. “In scope” is not restricted to “what the business is asking for today” when it comes to loading data into the raw vault. “Within scope” depends on any number of business and infrastructure (or technical) considerations.
I hope that his was somewhat helpful.
Respectfully,
Cindi Meyersohn, DataRebels
Certified Data Vault 2.0 Instructor