I’m looking at virtualising the information mart layer for the first time from the persisted data vault layer and have a question around dimensions. The examples I have seen in Dan’s book focus on dimensions with strong business concepts, that are ultimately uniquely identified by a business key (physically the hashkey).
Is there any guidance for those attributes are important enough for regular slice and dice purposes but are that are simply non-unique descriptive attributes loaded into a satellite and are not business concepts that have their own business key? Examples includes category, priority or workflow status attributes where the supplied data set doesn’t send the reference data separately - the transaction data just contains the descriptive values. In a ‘regular’ star schema, these attributes could be physicalised as ‘leftover’ status attributes in the fact or perhaps collated into a junk dimension (they could go into a dedicated dimension but it would not be ideal creating a dimension with only a single descriptive attribute).
I could simply derive the unique values from the main transaction data sets, but we’re talking a handful of unique values vs a transaction data set of many orders of magnitude more. Or just leave the status attributes grouped into an overall virtual dimension for the business concept of the hub (e.g., order, invoice, support ticket etc).
I read guidance (I think in Patrick’s book) not to split out reference data from satellites into dedicated reference tables. So, I’m wondering if the ‘first prize’ answer is to try and go back to source a see if dedicated reference data extracts can be supplied separately - which can be loaded into reference tables and drive a very lightweight dimension. If that isn’t possible, what is the better approach?
Would like to see a diagram of what you’re proposing if you don’t mind?
Sorry, I had to use a contrived example, rather than real. But this I guess is my current ‘choice’ based on having one source file. The dimension source attribs all come from the same data source that feeds the s_support_ticket example sat. The dotted lines represent the flow of data over a SQL view.
CREATE VIEW [mart].[dim_priority_view] AS
SELECT DISTINCT priority FROM [vault].[s_support_ticket]
- Do i even bother with the separate dims for priority, severity and category or just dump the attribs into the bigger ticket dimension which is virtually keyed off the hub hashkey?
- Do I seek out more supporting attribs (such as description) for the three status attribs and load into a reference hub/sat and instead create the dimension view logic against the ref tables instead?
- Do I collapse the status attribs into the virtual support tickets fact logic (no separate dim objects)?
Hey good morning,
I am wondering a few things:
- Does the final tool even want a star schema or would they prefer a denormalised flat wide table/view? Reason being, we already decompose into hubs, links and sats, now we would be decomposing even further!
- On the second point, building a mart around a business object, yes absolutely. When virtualising your content a PIT can be built to support a single business entity. I see a single sat in your image but a PIT is designed to bring the keys and load dates for all relevant sats around a hub. A PIT might help here**. You see a PIT can be used to simulate the “Right-Deep-Join-Tree” just like you would expect when querying a kimbal star schema. On Snowflake with millions of records I used a PIT to turn a 8 and half minute query into 20 seconds using the Right Deep Join Tree. I did this on Snowflake but I know the same can be achieved on Oracle. To simulate a central fact perhaps not a PIT but a Bridge with the required metrics could be built instead? The views built over your star schema will only benefit from Right Deep Join Tree if the underlying data are tables
- On the 3rd point, if you are looking to gather and consolidate reference data for the business then by all means extract it from the source or DV. Now if your data already arrives with ref-code to ref-desc resolved don’t break that up! That’s the best situation to be in when getting your data out! That means the source has already resolved your live data to ref data mapping for you! If it didn’t, then you would have to add yet another join to your query! So you get to pay the cost of having unresolved ref data, and worse, what if the ref-resolution changes through time, you now have to write a query that has to resolve the code-to-desc content at run time and to the correct point in time. This is different if say, you were managing reference data externally to the live data of course!
- Lastly, I assume dim_date dim is a role-playing dim in your diagram? If so then makes sense, else avoid snowflaking your data model!
Try to keep the number of tables needed low, most analytics is only bothered with the latest state of things anyway!
PS: don’t use select distinct and union in the same code, you only need one or the other!
Ref: WHY EQUIJOINS MATTER!. The Scene | by Patrick Cuba | Snowflake | Medium
The data will be stored in SQL Server (not Oracle on this occasion) and loaded into SSAS tabular model and Power BI for reporting - hence the use of a dimension model. I hadn’t really thought about approaching it differently given BI tools like to play with such models.
Regarding reference data, ideally yes, I want to bring the reference data into one place for reusability and adding more useful descriptions. For the data set currently provided, there has been no resolution - at least the value provided are meaningful words like ‘High’ ,‘Medium’ ,‘Low’ (not 1,2,3) although they don’t come with descriptions - this is what would like to improve and get the business to deliver more meaningful translations as separate extracts that I can load into the reference tables and use for specific dimension sources going forward.
Re dates - there is one physical date dimension (dim_date) - the rest are SQL views against DIM_DATE to assist with role playing.
I hadn’t considered a PIT simply because in the current ‘real’ data set I do not evisage multiple satellites around a single hub at this time. But it is just a matter of time before I will need to leverage both PITs and BRIDGEs in anger…
Re the Union, yes, of course - the perils of providing pseudo code as an example - but yes Union makes the distinct moot (duh to me!)
This post inspired me to finish a post I had been working on for a while.
Using a Bridge to accumulate metrics and use the temporal sequence keys (each sat has an identity/auto-increment column) to support a star schema.
I think SQL Server does support hash-joins, Joins (SQL Server) - SQL Server | Microsoft Docs if you check your SQL query profile after execution it should form a right-deep join tree
I know this is mock up, but I see zero value in breaking those out into seperate dimensions. If all you have are the actual values (priority, severity, category), I would keep them in dim_ticket_view and be done with it. For the reports, users can easily filter where priority = ‘x’ , etc. Since no addtional joins are required and the “select distinct” is not needed, the results might acutally be faster than if you build three seperate views on the same Sat using “select distinct”.
i know - rubbish example. The original question was around whether to create reference tables for basic attributes required for slice and dicing, ideally getting the business to provide some additional descriptive attributes to make them more meaningful as well as reducing the amount of records a virtual dimension had to query to display a unique set of values.
Saying that, I think for those very skinny reference data examples I previously gave Kent’s view of just grouping them in a single dimension probably makes more sense and fewer SQL statements will be generated by the BI tool if there are fewer dimensions to generate on the fly. Some of the attributes (not listed in the simplified example) naturally form a hierarchy, so may warrant a dedicated dimension though.
Thanks for your input!
Yea sorry John, I should have been more clear.
I was looking at your example and was thinking that by combining the above with a Bridge table to bring it all together to simulate a fact + dims you’re saving on data movement. The bridge table was meant to include the ref + descriptions.
So the article’s 3rd example does just that. Anyway, if you have a solution already then ignore me!