naming and splitting satellite tables

Hi,

what’s the most popular approach to naming satellites when there are frequent schema changes in the source ?
I am thinking about the scenario where have a sat_customer table coming from the CRM, then new columns are added to the source table.
is it better to rebuild the sat_customer table with the new columns along with its hashdiffs ?

or create a separate satellite that has the new columns. And in that latter case what would you name that additional Satellite table if the new columns do not have a common business theme ? something like sat_customer_2 ?

my question is for DV2.0 and Ensemble

many thanks !

Stick to DV2 for this, Alter table, add column, add column to hashdiff calculation, continue…

Refer to rule 6. 3-Valued logic and its effects here

thanks !
I guess this works well for most cases.
I would still challenge the approach a little in the case where the source schema changes frequently and the satellite table is a certain size. wouldn’t rebuilding it each time be challenged by some people in charge of the platform ?

if you read what i wrote – i said i don’t like to refactor, you break the audit

oh got it now ! totally missed that “here” is a link as it has the same color as the rest of the text and I was also half asleep

1 Like

what would you argue against anchor modeling apart from the large number of tables and joins ? the current topic for instance is irrelevant as every table has one attribute and no refactoring is required

Try telling your data administrator that every column is a table and now you need to join dozens of tables to return a result, let alone trying to ensure consistency between business entity state!
It’s a nice academic exercise, but at least I have never seen it used at production scale.

maintaining hundreds of thousands of tables sounds crazy indeed. I hear a company named manychat is using it on snowflake

I guess it depends… massive historization of millions to billions rows, vs SQL server-like workloads.

  • If they’re using Anchor for the former I’d like to find out the cost of running those join queries! (an academic exercise on its own!).
  • If it’s for the latter then I guess the cost of the model will not outweigh the value it produces.

Of course – I’m not always right! I have seen customers/prospects even choosing data vault because “they want to do data vault.” Never a good reason to adopt it and they subsequently tell they’re friends not to do data vault :smile:

got you !
what you said is exactly happening with a customer I am talking to right now. they went for data vault a year ago because they wanted data vault. Today it got out of hand and now they are asking what they are doing wrong.

regarding anchor,
there will be indeed far more joins to perform to consume data from the “anchor” model but with careful SQL, couldn’t those joins be performed on already pre-fitered data (usually benefiting from partition pruning) ?
so big tables wouldn’t be fully scanned.
and other tables storing attributes that are not changing that frequently would be broadcast before the joins

I clearly can see the nightmare trying to put together queries with hundreds of tables to create an information mart but I still cannot see clearly that query performance would be an issue. I might be completely missing something !

Even Snowflake will begin to find limitations with the number of tables being joined.

Tables not fully scanned happens in Data Vault too — if you follow some of the recommendations here.

thanks for the link ! and sharing all the helpful content. (I am currently reading your book)

I meant that, as with data vault, avoiding full table scans can be achieved with an anchor model.

I hear you on the risk of pushing snowflake and other engines to their limits when joining that many tables

thanks for the insights

1 Like