Refactoring of business vault

Hello,

We have built Data Vault for Offer and PrintOut and pretty much virtualized BV components (see attached diagram).


As there are quite a few dependencies, data volumes significant and solution is deployed on the on-prem SQL Server box, we experience poor performance. Additionally as the business logic is based on the latest entries in raw satellites, BV is not auditable and shows only latest calculated values. The advantage of the solution is that code is implemented once and is reusable.
Here is the current sequence of steps and dependencies in the current pipeline

  1. Raw Vault – in parallel
  • Load Offer_Hub, Offer_Sats from raw sources
  1. Virtual BV (SQL Views)
    1a Calculate Offer_Tags based on Offer raw vault satellites
    1b Calculate Offer_PrintOuts based on PrintOut raw vault
    2a Calculate Offer_Params based on Offer_Tags
  2. Expose Offer data in consumption layer (as virtual Offer DIM)

We considered adding PITs and Business Satellite to accelerate the pipeline and add audit to BV, but we struggle with additional dependencies that appear and questions:

  • should we create PITs for raw vault and use them to accelerate build out of business satellites?
  • should we create another layer of PIT to accelerate DIM?
  • how to shape business satellites? Per topic, 1 broad satellite? should the simple attributes ( COALESCE(sat1.Att1, sat2.Att2) ) be included in business satellite as well?

I’d appreciate your feedback on above.

Regards,
Marcin

  1. What is PrintOut? Is that a business object with an immutable business key?
  2. Why not load Business Vault sats be reusing the same loaders for Raw Vault sats?
  3. PIT tables can have many variations
  • Current PIT to point to the Active record per business key — you might test this configuration if using the join between Sat and C-PIT give you that performance index
  • PITs are tables and the default implementation is to build information marts as views utilizing this PIT structure
  1. PrintOut stores information about printed out documents. It has it’s own unique key and is loaded from separate system, it may be linked to offers, contracts and other objects.
  2. We can reuse Sat loading template, though we are not sure how to tackle the levels of dependencies, how to shape business satellites (1 bus sat with all calculated satellites or split by area?). Do we build PIT on top of RAW sats to accelerate calculation of business sats and then another PIT for business sat or maybe 1 PIT for both raw and business sats?
  3. Yes, we are still looking to virtualize InfoMarts on top of PITs. CPIT looks like a good fit for near-realtime pipeline we develop as well, but not so much for the ones where we need to keep snaphots.
  1. would be your discretion – like for your analytics what would work best and then further should the BV sat follow a framework that is a pivot of the content… or even if the output should be stored in a semi-structured column because the metrics may change. Now a PIT’s job is to get the data needed quickly, the downside is you need to build a PIT. However there are ways you may consider to populate a PIT to do this efficiently, i.e. build it in parallel to the loading of sats so you effectively only push the needed keys and load dates to the target PIT (and probably either UPDATE the load date value based on the parent key) or if you’re on Snowflake, deploy a stream on top of the sats and process only new content for the PIT. PITs are helper tables, if you need one to help you load a BV SAT then so be it!

  2. Snapshot PITs at a logarithmic cadence can be driven by your as_of table with business flags, like how you would develop a calendar-dimension for Star Schemas, right?