I am working on data vault, we are using DBT and Big query.
As of now we have created raw data vault on Big query using dbt templates and we are working on creating information mart on top of it.
So which is the best model to create information mart on top of raw data vault. Is it dimensional model(star schema) or OBT(one big table).
I see most the places people are recommending dimensional model, but as big query performs very well with nested and repeated tables so can we create one big table with (nested and repeated )structure instead star schema.
Thanks for you suggestions!
Any update on this ? Please.
I recommend star schema. Most bi tools handles star schemas very well.
My client is recommending to create one wide table( Nested and Repeated structure) in Big query instead start schema as information mart on top of Raw data vault.
Do you see any problems with this approach?
Depends on the use case.
For general BI, dimensional modelling / star schema is good (obviously following Kimball’s principles on one process / grain per fact table and conformed dimensions), but you hit up on the limits of star schemas in cases where they don’t actually model the business (particularly multi-thing relationships) outside of fact tables. Fact tables are often based on transactional links/ linksats btw.
For serving data products / interfaces or similar, then flattened tables/ views can be more useful.
Even one big table approaches need to follow Kimball’s approach on single granularity otherwise you’ll get messy.
You might need different patterns for transaction level granularity and aggregated results, particularly if you are aggregating multiple processes to their lowest common denominator
Surely this depends on what processes you are modelling. If conceptually the OBT is simply a denormalised star schema as per Kimball (single process/granularity) then it simply comes down to how big query processes things and whether you are materialising as a table or as a view (or tables / views).
I don’t know enough about BigQuery technically to be able to advise you specifically but generally, if a table, an OBT is going to cost you more storage as you are physically storing all your expensive text bytes the same number as times as transactions (one of the reasons why star schemas became popular in the first instance). But that might also mean that processing is cheaper as you only do it once (per refresh cycle). If storage costs are not material for you based on the size of your data, it might be a better trade off to OBT against the cost of processing joins at query time.
Given that these are just the physical I would generally do a dimensional logical modelling exercise for each process / grain as per Kimball before deciding how to do the physical implementation.
Thanks Nat. Storage costs are fine with us.
Client is more interested in OBT(with nested and repeated structures) in Big query as it gives more performance compared to start schema.
When bringing all data from different satellites to OBT, how do we manage load dates.? As there might be inconsistent load dates in each satellite, I have tried to use PIT table ( satellites surrounded by hub) to overcome this problem but still this there are other satellites which are not part of same the hub( different entities). So how do we design load dates in OBT structure? Or Do you see any problem with OBT?