Could you please help me to understand what are PIT and bridge tables and when would I use them?
(This is one of the questions that was asked during our last meetup "5 most common challenges with Data Vault modelling on the 12th Jan 2022 - by one of the members)
I’m going to open by saying this is a fairly wide-ranging topic, handled by Dan in Chapter 6 of the DV book. So what I’m about to say should be clearly referred to that chapter for more in-depth coverage.
PIT (Point-In-Time) Tables are a specialist type of table that gathers data from historised satellites around a hub and presents an easy way for a query to get what the data looked like at a point in time. They are designed to improve the performance of your vault with regard to historical queries. If you are in a fairly simple set-up that only requires ‘what does the data look like right now?’, then PIT tables aren’t needed. If you are regularly serving queries that look up ‘what did the data look like last week?’, or ‘who made what change when?’ then PIT tables could be a good fit.
Bridge Tables are a specialist type of table that gathers commonly associated hubs into one place and presents an easy way for a query to get the keys of things that are associated. They are designed to improve the performance of your vault with regard to ‘long chain’ joins, and look a lot like link tables, although they are ‘links by usage’ rather than ‘links in data’. The query strain of the inner-joined DV architecture isn’t that high, so Bridge tables are less common than might be implied. If you notice in your logs that queries are regularly hitting the same hubs that are ‘distant’ in the data model, it may be worth your while to make a Bridge to remove the traversing of the data model by many joins, especially if the intervening hubs and links aren’t being used for anything apart from the joins.
In short, PIT and bridge tables are query assistance structures built in the business vault to improve performance and reduce complexity when writing queries against the data vault.
PIT (Point-In-Time): As the name implies, PIT tables/views (“virtualize until it hurts”) gather together a hub’s business key with its associated load date time stamps from the multiple satellites dependent on that hub for any one particular “point in time.” PIT structures reduce complexity (and potentially improve performance) in joins by nearly eliminating the need for the subqueries used in finding the MAX load date time stamp relative to a point in time for each of the satellites used.
Bridge: Bridge structures gather together multiple hubs and associated link tables for performance improvement and query simplification by reducing the number of joins needed when querying the underlying tables. For example, if you have a raw vault with 3 hubs, 1 satellite per hub, and 2 links to join the hubs, and you want to gather the information from the three satellites into one query, you would likely need at least 8 joins. With a bridge table, you could reduce the joins to at least 3, since all of the relevant keys needed to get the information from each of the 3 satellites would be joined together into the one bridge.
Kent Graziano (@kgraziano ) presents these concepts well in his artice The Business Data Vault | Vertabelo Database Modeler online.
Or you can get an updated version in my book Introduction to Agile Data Engineering Using Data Vault 2.0 - on Amazon (hardcopy or Kindle format)
We (at least I have) have seen "PIT"s deployed as views around a hub or link and these are incorrect interpretations of a PIT, it’s just a mart. A PIT as the name implies is a Point in Time structure designed to give you the GPS locations of the relevant data around a hub or a link at that point in time. Pretty simple. But what does that look like?
If I want to take a Snapshot of the relevant data at a point in time I think about:
- What is the frequency of those snapshots?
- What is the window of those snapshots?
That’s why we talk about logarithmic PIT tables and PIT windows; they’re not mutually exclusive but are considered when building PITs.
Why do they work?
Star Join queries, we do the same thing when building Facts and Dimensions each surrogate key in a dimension table is a temporal value that is referenced in the fact table an employs a single join key, you control the window of that slice by using a date dimension
With a PIT structure we use an as_of table to employ essentially the same functionality. With facts and dims you need to load dims before facts but with a PIT you can take a snapshot anytime because the sats are loaded in parallel.
For a deep dive into this go here: Data Vault PIT Flow Manifold. Streams and Tasks (see: bit.ly/3fqV9oZ)… | by Patrick Cuba | Snowflake | Medium
Norbert, I’m going to weigh in here to hopefully simplify these two objects for you.
As has already been stated, a PIT is a Point-in-Time object. It always starts as a key set that is focused on a single hub or a single link and its related satellite tables.
A Bridge starts as a key set from one or more links and any associated hubs.
There are also 'hybrid" structures that are a combination of these key sets.
The main thing to remember is to always start with the key sets; then add other elements as may be necessary for performance optimization and/or outcome requirements.
Any of these objects (PIT / Bridge / hybrid) may be physical (table) or logical (i.e., a view).
All of these objects are built for two primary reasons - (1) performance as query assist objects (a.k.a., join tables) and/or, (2) for security. They are all USE CASE driven.
PITs, as their name implies, represent data at a specific moment in time or “snapshot”, and are similar to a Type 2 slowly changing dimension; whereas Bridges may include a snapshot and closely resemble a Fact table.
When do you build a PIT or a Bridge? When your query outcomes are not meeting the customer’s requirements for performance or when there are security requirements (access control policies) that you must adhere to. For example, you can use a PIT or Bridge to segment the same data set to two different groups of consumers where one group has access to, say, PII data, - and the other group does not. A PIT or Bridge provides various approaches and flexibility to handling data security.
The Data Vault 2.0 recommended best practice is to virtualize (i.e., create a View) (see Kent’s notes below) these objects and maintain them as a View for as long as possible. Only physicalize these objects when performance dictates otherwise, and the only way you can gain better performance is by using the physical features of the database application system platform to squeeze out as much speed as possible to meet the customer’s requirements.
The Data Vault 2.0 methodology provides the optimized patterns for building the SELECT FROM and predicate logic needed to pull these structures together, and highlights the areas of the query logic that needs especially close attention to avoid cartesians.
What metadata columns should be stored in a PIT or Bridge table? I’m thinking specifically around record source (I think the answer is do not store), business key collision codes and multi-tenant codes (the latter two attributes are more aimed at @patrickcuba as they are not in the core pattern)
Bkcc and mltid are part of DV2.0 standards and training.
The real question is, what value would they or other metadata columns serve your info marts?
The books I have nor do the 2.0.2 standards really cover it for PITs and Bridges - so I’m thinking they are not relevant. Thinking about it more widely, if the adjacent hubs contain the info anyway and the hashkeys are defined using them as inputs then the relevance of them in the PIT and Bridge definition doesn’t stack up in my mind. However, other optional attributes such as the task id or Jira/requirement is to me more useful as it assists with understanding why the PIT/Bridge exists in the first place (the driving requirement), what process loaded it (as they are physicalised) etc. Record Source seems pointless given the data is derived from multiple places other than just to say a static value such as ‘Derived’ etc
Agree, as long as it’s useful
BKCC makes up the hashkey, beyond that they serve no purpose beyond DV: so you shouldn’t see them in QA tables such as PiTs and Bridges.
And in your info mart you shouldn’t see hash keys, will confuse the business users relying on it.
MLTID also make up the hashkey, once you’ve built QA tables they should be based on a tenant, therefore the column itself doesn’t persist beyond DV-no need
As far as the other metadata, as long as it makes sense for your use case, QA tables & IM views are disposable. If you rebuild your PIT or Bridge then your metadata obviously changes
Hi everyone - i just wanted to make sure i read this correctly: Bridges may potentially contain a (logarithmic) window of snapshots just like PITs, if there is a compelling reason to support it - such as supporting a downstream Timespan Fact, where there is a need to Marty McFly back in time? Otherwise the recommendation is to rebuild per run (day)?
One more question: PITs - to hash or not to hash? Is there any definitive guidance on whether the primary key is either:
- a multi-part key based on the associated hub/link hashkey and the snapshot date columns; or
- a separate single column hash of the associated hub/link hashkey and the snapshot date (with alternate unique key on the associated hub/link hashkey and the snapshot date columns)
Does it matter?
Looking at Dan’s guidance on virtualising Type 2 Dimensions it looks to me that the dedicated PIT hashkey column is rather helpful in this regard. But guidance seems to differ based which source material you refer to…
Correct. If you want to build a virtual SCD2 on top of a historized PIT you need a PK for the dimension. The hashkey of the Hub alone would not be unique as you are tracking changes over time. Hashing the Hub business key attributes plus the loaddts of the PIT row are needed to get a unique value that can act as the virtual PK for the virtual SCD2.
To answer your question about whether you need a snapshot date in a Bridge, think about what a Bridge table is providing - it’s a join table of key sets to the underlying objects. The Bridge may be collecting these key sets in a forward rolling manner and may include a computed End Date based on the associated Link’s Effectivity Satellite. You can compute the duration of the Link’s representative relationships using the Load Date (which will act as a Snapshot Date); and you could have multiple End Dates in the Bridge if you have multiple Links with associated Effectivity Satellites. It depends on the Use Case that you are building to. If you are maintaining a Bridge in a forward rolling manner, then you would most likely not want to reload it. Again, the load process will follow the Use Case.
With regard to hashing or not to hash on a PIT - it is a design preference based on the Use Case. All PITs and Bridges are Use Case driven. You may wish to compute a hash on the entirety of a PIT’s row-level column set as a means to eliminate duplicate rows for the specified snapshot date and to assist with restart-ability should something happen during the load process. The recommended best practice for the unique row-level identifier on a PIT or Bridge is to generate a sequence number - yes, that’s right - a sequence number. Why? Two-fold: (a) it is strictly a unique row identifier; no dependencies; no lookups; no latency; and (b) it allows the Bridge or PIT to mimic a Kimball-styled Fact or Type 2 dimension.
You’ll want to stay away from hashing a hash key. On a PIT table, you should have the entire business key for every hub or, in the case of a link-based PIT, you should have every business key represented in the link relationship, along with the required set of satellite primary key/load date pairs.