When working on virtualization of dimension in infomart I realized that I have to modify the code template shared in all DV books to allow for JOIN Elimination in SQL Server environment. W/o join elimination ad hoc queries for dimensions that compile couple dozen of satellites were extremely slow.
Therefore I came up with an adjusted template for a dimension.
Here is the description of major changes I’ve made
I had to use LEFT JOINs and not INNER JOINs - SQL Server will not eliminate join if you INNER JOIN using 2 attributes (Hkey + loadDate)
Patrick Cuba suggested SnoPITs that work, however there are 2 prerquisites that I don’t like
additional sequence numbers would have to be added - bigint would have to be used for large dimensions (daily snapshot of couple dozen mlns of rows)
Enforced foreign key constraints would have to be added (from PIT to all sats)
For the business satellites, that persist results very heavy business logic and cannot be virtualized, I used a pattern promoted by ScaleFree, which is a Snapshot based satellite
this deviates from standard which says that all changes in satellites should be tracked, instead it calculates and keeps snaphots to business dates defined by business - in BV I can easily recalculate values for any point in time w/o keeping track of daily changes
LEFT JOIN allows for JOIN elimination and good performance without using another layer of PIT tables
LEFT JOIN keeps the deep right join tree and is even faster in SQL Server than INNER JOIN!
Now I need to solve reference satellites, which have to be joined through satellites and dynamically filtered to a snapshot date, this kills join elimination and again all related joins are executed at all times.
I can think of few options:
extend PIT with reference codes and loaddates - this would make some PIT very very wide (60+ columns already, another 30+ would be needed for joining dictionaries)
create a snaphot based sats and left join similiarily to business satellites
Have you had similiar challanges? Have you solved them in a different way?
You don’t need FK constraints — PITs and SnoPITs are ephemeral, do you really want to be adding constraints to these tables when they are not necessary?
Adding the sequence number to Sats is actually how Dimensional modelling works; hence why this is so effective because it mimics the joins between dims and facts.
Once built, PITs / SnoPITs will always be equi-join. With a SnoPIT you only need to join with one id — again, this is how dimensional modelling works.
As far as a BV sat goes, this could be anything really – could you clarify how Scalefree’s suggestion deviates form the standards?
In order to virtualize a dimensional model and make it performant/usable for ad hoc queries design has to support JOIN Elimination. As far as I know and tested on SQL Server there are 2 options how it can be setup:
LEFT JOIN - as we join PIT to hkey+loadDate in satellite which is unique combination, optimizer knows it does not influence cardinality and thus can be eliminated if sattellite is not explicitly referenced in the query; although it is not an EQUI JOIN execution plan still looks right deep join tree
or INNER JOIN on 1 ONLY attribute (does not work for 2+) + enforced foreign key constraint - in case of INNER JOIN optimizer has to know that there will be match in the satellite therefore active foreign key constraint is required, I know that in Snowflake there is a RELY keyword that does the work, unfortately does not work this way in SQL Server
Regarding snapshot satellites - in any of the public sources (your’s book, Dan’s book) I read that business vault satellites should use the same pattern as Raw Vault, that is recognizing each change using hash diff and INSERT Only changed entries with new load date. This however, leads to complexity because to make virtualized dimensions performant you’d have to index BV with another PIT and in dimension join 2 PITs together…
On the other hand Scalefree suggests that business vault satellites should be snapshot based (similarily as PITs), that means these sats dont’s have hashdiffs and loadDate but rather a SnapshotDate hence are similar to PIT tables in structure. Thanks to these approach you can simple reference them from basic PIT joining by Hkey and SnapshotDate, the cost paid is potential duplication of values in subsequent snapshots that would not change otherwise in “load date based” sat.
How would you handle the complexity with reference objects + reference hubs?