Excluding data towards RDV that has no business value

Hi all,

I’m struggling with the decision to exclude data with no business value when loading the Raw Data Vault. After all, we should be able to recreate the source file from an auditability POV… Which is why I also try to follow the rule of DV of bringing in all columns.

But what if I will never ever use the data?

I have a source that contains parties and their ‘communication methods’. The business is really only interested in the official method (indicated by a ‘type’ field), and doesn’t care for any of the others (like preferred, emergency, whatever). The volumes are pretty high though so performance-wise I would really benefit from not having to load them (and I can also simplify the model a bit).

When can we actually speak of there being ‘no business value’?

Hi Frank,

If you really don’t want to load them, what is stopping you from excluding the attributes now and if needed in the future, load into a new satellite? Or consider satellite splitting - no idea if the rate of change etc is different but it make sense to satellite split anyway - it may reduce the noise of new satellite versions.

You could potentially exclude from the hashdiff so that they don’t drive lots of new versioned records until an attribute of interest that is hashdiff tracked changes. But you get lack of awareness of when those attributes do change at source unless something of importance changes.

The latter seems the least good option to me. Unless it is really impacting performance (and storage is cheap, right? and you’re using templated load processes too, right?) is it really an issue? Why not just load the attributes as you never know if the business may change their mind in the future about their importance…

I think it boils down to me wanting to read some articles on the need of being able to recreate the source… what are the starting points, assumptions, scenarios, exceptions, etcetera.

In my case, I source from a persistent staging area built/managed by an entirely different team. It contains full snapshots. So indeed, if I decide to not load specific columns (or rows for that matter) right now, I can always decide to load those at a later point in time. And since my source is a delivery system in itself, and not the actual source, I have no idea what (if any) rules are already being applied in-between.

At this point it is impacting performance, because we have to do a historical load that goes back until 2010, meaning I can choose between loading 50 billion records with 200 columns and 5 billion records with 10 columns… and I prefer the latter :wink:

I guess it depends, satellite splitting would chuck all those other columns into their own satellite but then you’re taking the cost of running those updates.
At the same time, that other data could have other value, for instance, since it’s loading to a raw area then this could serve as a base for data discovery and exploratory exercises and feed a business vault satellite that is the result of running feature engineering tasks.

The non-business data should probably be ignored though, metadata columns tells us more about how the source system is functioning than how the business is functioning

As for not including attributes you’re tracking in the hashdiff, think about it, that’s a terrible idea.

Hey, Frenk, I thought I should probably weigh in here, so excuse the interruption.
I want to correct your statement of “follow the rule of DV of bringing in all columns.” There is no Data Vault standard (or rule) that states you must bring in all columns. What Dan actually said, and what the internet sources and many others often misquote, is that in DV you load “100% of the data, 100% of the time, WITHIN SCOPE”. Many developers and even practitioners leave off the “WITHIN SCOPE”. That’s where your struggle lies.

My general observation here is that if you are extracting the data into the raw vault from a persistent staging area (PSA), and you can guarantee that the data in that PSA will be available to your DV team should you require it, then only bring in what is in scope. Make sure that you only compute the hashdiff in the satellite from the actual data elements or attributes that you are loading into the satellite (plus any of the DV system elements that you may be using, such as a business key collision code and/or a multi-tenant identifier). This ensures that you will only insert satellite records whenever a change occurs in the data set represented in the satellite. Nothing more, nothing less.

If you need other aspects of the PSA data set that wasn’t loaded in this initial effort, then they may be added at a later date. As was pointed out in earlier responses, you can create one or more additional satellites to hold the “missing” descriptive data when you need it. Each of these new satellites will have a hashdiff column whose value is computed using a completely different set of source or PSA data elements.

The question is, when do you want to pay the piper for computing the hash diffs for these satellites? One thing to think about is, how much more does it cost you to compute the hash diffs on 50 billion records NOW versus loading 5 billion records? What is the velocity of the data growth? What happens when, not if, the business asks for more descriptive data (because you know they will)?

If the data volume is growing at a persistent rate, you can calculate a reasonable estimate of the data volume 6 months from now. What will the cost be to ingest the additional columns and compute the new hash diff for the new satellite in 6 months from now? You are hedging a bet that the business either won’t ask for the data you’re excluding or that you can better handle the volume in 6 months.

These are just a few of the factors that the team must consider when evaluating the missing piece of the quote. In short, this is part of what factors into the decision of what is “in scope” and what is not. “In scope” is not restricted to “what the business is asking for today” when it comes to loading data into the raw vault. “Within scope” depends on any number of business and infrastructure (or technical) considerations.

I hope that his was somewhat helpful.

Respectfully,
Cindi Meyersohn, DataRebels
Certified Data Vault 2.0 Instructor

Hi Cindi,

don’t mind the interruption, in fact, it was more than welcome! Many thanks for your elaborate response!

Adding the “within scope” makes all the difference here. And indeed, having the PSA around gives me a lot of flexibility.

I’ll go and have some additional talks with the business and after that, decide whether I want to pay the piper now or postpone it (and hope that I don’t ever have to ;))

Cheers!

Frenk,
Precisely the correct answer, “I’ll go and have some additional talks with the business …”!
Best regards,
Cindi