Raw Vault layer data modeling questions

Hello All,

We are doing a POC to refactor our data assets using DV 2.0 methodology and having a few questions about RV layer.

  1. Why do we need record_source column in Satellite tables if they are already source specific Satellite tables ?
  2. Why do we need as many satellite tables as that of the source tables? Can we combine related data together in Satellites from one source with data coming from multiple tables?

Thanks in advance !

Ask yourself

  • why would you combine raw satellite tables? What do you lose or gain?
  • what would you consider as record-source value?
  • what would you consider as record-source value?

We have member data coming from different source systems and we are using pattern SAT_Meaningful name_src for satellite tables.

For example SRC1, table name is SAT_MEMBER_SRC1 and SRC2, its SAT_MEMBER_SRC2. So, values for record_source column would be ‘SRC1’ for 1st SAT table and ‘SRC2’ for 2nd SAT table. My concern is, we already have source specific suffix within a table name, so is it okay to drop ‘record_source’ column from satellite tables or its good to have it?

Ah but you can be doing a lot more with rec-source. If the source file landed for you is the captured output of a business rule/process (which it is) then I’d add a lot more. Some of this might be available as a metadata column already, so your rec-src could look like this:
/location/filename_yyyymmdd.csv

What’s more this column could be used to denote business rule version if the rule evolves and is the record itself is impacted by article 17 of gdpr

1 Like

Thanks much Patrick.

  1. To register metadata about the dataset. The field itself could be extracted from the satellite because it contains the same value for a specific loaddate or batchID. Save space → run queries faster!
  2. Sometimes we need more satellites than tables in the source. Keep in mind that Data Vault basically stores mulitple snapshots of the dataset over time (combining the best of 3rd normal form and star schema to achieve this). It is very important that you can recreate all the data exactly as it was in the source over time. Michael Olschimke calles this the “Jedi Test”.
    You should never import data from mulitple datasets into one satellite. I understand that it feels odd to have two or more tables with basically the same structure. The benefit is that you never have to retest functionality from a specific satellite. If you want to combine data in a report from multiple satellites, its your choice to do so. If you import a new datasource to an already existing satellite you have to investigate and retest the reports that fetches data from that satellite. And if you are importing different dataset to same satellite you also create an unnessary dependency between import and report. If the satellites get data from one dataset then all the imports can be deployed independently.
  1. When you build information marts and potentially combine data elements from multiple sats, you can include the rec_src value in the reports tables right from the Sat value. If you rely on the Sat prefix/suffix, then you have to parse the table name to convert the source name to data. So while it seems extraneous now, it has a use. Hence the need to follow the standards as it prevents you from re-engineering in the future when you hit a situation or use case you never envisioned.
  2. You can combine them in one Sat as long as they are all indeed at the same grain, truely are related to the same Hub key (at the same cardinality) and the data in those tables changes at the same rate. If one of those tabels has columns that change more often and more rapidly than the other tables, this will take more storage as you will be inserting new rows into the combined Sat whenever any column in any of those tables change. So you get fewer tables but more frequent loads and more rows over time.

I’d never combine RAW Vault satellite tables, if one satellite table evolves and the other doesn’t

  • how do you then deal with an evolving satellite source and the other source hasn’t?
  • loading to a satellite table is hash_key + hashdiff change based, is the one source now comparing against content from another source in the same satellite, or do you now add rec-source in the comparison (only update if it is the same record source is ‘X’)

The complexity added by combining sources to one raw satellite table is just not worth it.

Keep the data vault simple and it will last longer, combining sources to a raw satellite = tech debt.

1 Like