naming and splitting satellite tables

mabrouk.gadri · 7 May 2024 08:22

Hi,

what’s the most popular approach to naming satellites when there are frequent schema changes in the source ?
I am thinking about the scenario where have a sat_customer table coming from the CRM, then new columns are added to the source table.
is it better to rebuild the sat_customer table with the new columns along with its hashdiffs ?

or create a separate satellite that has the new columns. And in that latter case what would you name that additional Satellite table if the new columns do not have a common business theme ? something like sat_customer_2 ?

my question is for DV2.0 and Ensemble

many thanks !

patrickcuba · 8 May 2024 23:38

Stick to DV2 for this, Alter table, add column, add column to hashdiff calculation, continue…

Refer to rule 6. 3-Valued logic and its effects here

mabrouk.gadri · 9 May 2024 00:00

thanks !
I guess this works well for most cases.
I would still challenge the approach a little in the case where the source schema changes frequently and the satellite table is a certain size. wouldn’t rebuilding it each time be challenged by some people in charge of the platform ?

patrickcuba · 9 May 2024 00:12

if you read what i wrote – i said i don’t like to refactor, you break the audit

mabrouk.gadri · 9 May 2024 09:49

oh got it now ! totally missed that “here” is a link as it has the same color as the rest of the text and I was also half asleep

mabrouk.gadri · 9 May 2024 09:58

what would you argue against anchor modeling apart from the large number of tables and joins ? the current topic for instance is irrelevant as every table has one attribute and no refactoring is required

patrickcuba · 9 May 2024 10:28

Try telling your data administrator that every column is a table and now you need to join dozens of tables to return a result, let alone trying to ensure consistency between business entity state!
It’s a nice academic exercise, but at least I have never seen it used at production scale.

mabrouk.gadri · 9 May 2024 11:35

maintaining hundreds of thousands of tables sounds crazy indeed. I hear a company named manychat is using it on snowflake

patrickcuba · 9 May 2024 22:53

I guess it depends… massive historization of millions to billions rows, vs SQL server-like workloads.

If they’re using Anchor for the former I’d like to find out the cost of running those join queries! (an academic exercise on its own!).
If it’s for the latter then I guess the cost of the model will not outweigh the value it produces.

Of course – I’m not always right! I have seen customers/prospects even choosing data vault because “they want to do data vault.” Never a good reason to adopt it and they subsequently tell they’re friends not to do data vault

mabrouk.gadri · 10 May 2024 14:12

got you !
what you said is exactly happening with a customer I am talking to right now. they went for data vault a year ago because they wanted data vault. Today it got out of hand and now they are asking what they are doing wrong.

regarding anchor,
there will be indeed far more joins to perform to consume data from the “anchor” model but with careful SQL, couldn’t those joins be performed on already pre-fitered data (usually benefiting from partition pruning) ?
so big tables wouldn’t be fully scanned.
and other tables storing attributes that are not changing that frequently would be broadcast before the joins

I clearly can see the nightmare trying to put together queries with hundreds of tables to create an information mart but I still cannot see clearly that query performance would be an issue. I might be completely missing something !

patrickcuba · 10 May 2024 22:42

Even Snowflake will begin to find limitations with the number of tables being joined.

Tables not fully scanned happens in Data Vault too — if you follow some of the recommendations here.

mabrouk.gadri · 11 May 2024 13:31

thanks for the link ! and sharing all the helpful content. (I am currently reading your book)

I meant that, as with data vault, avoiding full table scans can be achieved with an anchor model.

I hear you on the risk of pushing snowflake and other engines to their limits when joining that many tables

thanks for the insights

Topic		Replies	Views
Raw Vault layer data modeling questions Data Vault 2.0	7	1136	24 May 2022
What's the best approach for removing columns from a satellite? Data Vault 2.0 satellite , automate-dv	4	107	14 April 2025
Re-Assembling Split Satellite Tables Data Vault 2.0	4	288	21 April 2022
Can I restructure source data before loading it into the raw vault? Data Vault 2.0	18	1691	3 April 2022
Source Tables - joining transactional tables or not? Data Vault 2.0	7	1161	19 March 2023

naming and splitting satellite tables

Related topics