Bridge pseudo code

Hi all,

I’m looking to implement my first bridge table using dbtvault. I looked up @patrickcuba pseudo code, but unfortunately, that doesn’t cover effectivity satellites.

So I have a Hub (called “hub”), with two Links (called “link_1” and “link_2”), that both have an effectivity satellite (called “eff_sat_1” and “eff_sat_2”). The eff_sat contains an applied date called “eff_dts”, with effectivity start and end dates called “eff_start_dts” and “eff_end_dts”. This results in the following generated dbtvault sql:

WITH as_of_dates AS (
    SELECT *
    FROM as_of_date
),

new_rows AS (
    SELECT
		hub.hub_hk,
		as_of_dates.AS_OF_DATE,
		link_1.link_hk        AS link_1_hk,
		eff_sat_1.eff_end_dts AS eff_sat_1_eff_end_dts,
		eff_sat_1.load_dts    AS eff_sat_1_load_dts,
		link_2.link_hk        AS link_2_hk,
		eff_sat_2.eff_end_dts AS eff_sat_2_eff_end_dts,
		eff_sat_2.load_dts    AS eff_sat_2_load_dts
    FROM hub
    INNER JOIN as_of_dates
        ON (1=1)
    LEFT JOIN link_1
        ON hub.hub_hk = link_1.foreign_hub_hk_1
    INNER JOIN eff_sat_1
        ON eff_sat_1.link_hk = link_1.link_hk
        AND eff_sat_1.eff_dts <= b.AS_OF_DATE
    LEFT JOIN link_2
        ON link1.foreign_hub_hk _2 = link_2.foreign_hub_hk_1
    INNER JOIN eff_sat_2
        ON eff_sat_2.link_hk = link_2.link_hk
        AND eff_sat_2.eff_dts <= b.AS_OF_DATE
),

all_rows AS (
    SELECT * FROM new_rows
),

candidate_rows AS (
    SELECT *
    FROM (
        SELECT *,
               ROW_NUMBER() OVER (
               PARTITION BY
                   AS_OF_DATE,
                   link_1_hk, link_2_hk
               ORDER BY
                   eff_sat_1_load_dts DESC, eff_sat_2_load_dts DESC
               ) AS ROW_NUM
        FROM all_rows
    ) AS a
    WHERE a.ROW_NUM = 1
),

bridge AS (
    SELECT
        c.hub_hk,
        c.AS_OF_DATE,
        c.link_1_hk,
        c.link_2_hk
    FROM candidate_rows AS c
    WHERE TO_DATE(c.eff_sat_1_eff_end_dts) = TO_DATE('9999-12-31 23:59:59.9999999')
      AND TO_DATE(c.eff_sat_2_eff_end_dts) = TO_DATE('9999-12-31 23:59:59.9999999')
)

SELECT * FROM bridge

My questions:

  • in the new_rows CTE, won’t the INNER JOIN to an eff_sat result in records falling out if that eff_sat didn’t exist at a certain as_of_date?
  • in the bridge CTE, the WHERE clause on eff_end_dts = 9999… isn’t that too late? Shouldn’t that be moved up to the ON clause in the eff_sat JOIN in new_rows CTE?
  • the bridge_walk is used to walk across multiple links, where fk1 in link2 is joined on fk2 in link1. What if I want to join all links on fk1 (in other words, the base hub)? For instance, I have a Hub, with a Link_Parent and Link_Ultimate_Parent that both have the same driving key. So basically, I want this in my Bridge:

Additionally, I would like to have the option to include the hub hashkeys (and hub business keys) in the final SELECT as well.

I’m going to create some explicit test cases to challenge my own questions… those will follow.

edit: below I also changed the example on Macros - dbtvault to include an additional link. Please notice the fact that in my own example, I use the “eff_dts” as as_of_date and join my eff_sats accordingly. The dbtvault example relies on the load_datetime timeline.

WITH as_of AS (
     SELECT a.AS_OF_DATE
     FROM AS_OF_DATE AS a
     WHERE a.AS_OF_DATE <= CURRENT_DATE()
),

new_rows AS (
    SELECT
        a.CUSTOMER_PK,
        b.AS_OF_DATE,
        LINK_CUSTOMER_ORDER.CUSTOMER_ORDER_PK AS LINK_CUSTOMER_ORDER_PK,
        EFF_SAT_CUSTOMER_ORDER.END_DATE AS EFF_SAT_CUSTOMER_ORDER_ENDDATE,
        EFF_SAT_CUSTOMER_ORDER.LOAD_DATETIME AS EFF_SAT_CUSTOMER_ORDER_LOADDATE,
        LINK_CUSTOMER_ADDRESS.CUSTOMER_ADDRESS_PK AS LINK_CUSTOMER_ADRESS_PK,
        EFF_SAT_CUSTOMER_ADDRESS.END_DATE AS EFF_SAT_CUSTOMER_ADDRESS_ENDDATE,
        EFF_SAT_CUSTOMER_ADDRESS.LOAD_DATETIME AS EFF_SAT_CUSTOMER_ADDRESS_LOADDATE
    FROM DBTVAULT_DEV.TEST.HUB_CUSTOMER AS a
    INNER JOIN AS_OF AS b
        ON (1=1)
    LEFT JOIN LINK_CUSTOMER_ORDER AS LINK_CUSTOMER_ORDER
        ON a.CUSTOMER_PK = LINK_CUSTOMER_ORDER.CUSTOMER_FK
    INNER JOIN EFF_SAT_CUSTOMER_ORDER AS EFF_SAT_CUSTOMER_ORDER
        ON EFF_SAT_CUSTOMER_ORDER.CUSTOMER_ORDER_PK = LINK_CUSTOMER_ORDER.CUSTOMER_ORDER_PK
        AND EFF_SAT_CUSTOMER_ORDER.LOAD_DATETIME <= b.AS_OF_DATE
    LEFT JOIN LINK_CUSTOMER_ADDRESS AS LINK_CUSTOMER_ADDRESS
        ON a.CUSTOMER_PK = LINK_CUSTOMER_ADDRESS.CUSTOMER_FK
    INNER JOIN EFF_SAT_CUSTOMER_ADDRESS AS EFF_SAT_CUSTOMER_ADDRESS
        ON EFF_SAT_CUSTOMER_ADDRESS.CUSTOMER_ADDRESS_PK = LINK_CUSTOMER_ADDRESS.CUSTOMER_ADDRESS_PK
        AND EFF_SAT_CUSTOMER_ADDRESS.LOAD_DATETIME <= b.AS_OF_DATE
),

all_rows AS (
    SELECT * FROM new_rows
),

candidate_rows AS (
    SELECT *,
        ROW_NUMBER() OVER (
            PARTITION BY 
                AS_OF_DATE,
                LINK_CUSTOMER_ORDER_PK,
                LINK_CUSTOMER_ADDRESS_PK
            ORDER BY
                EFF_SAT_CUSTOMER_ORDER_LOADDATE DESC,
                EFF_SAT_CUSTOMER_ADDRESS_LOADDATE DESC,
            ) AS row_num
    FROM all_rows
    QUALIFY row_num = 1
),

bridge AS (
    SELECT
        c.CUSTOMER_PK,
        c.AS_OF_DATE,
        c.LINK_CUSTOMER_ORDER_PK,
        c.LINK_CUSTOMER_ADDRESS_PK
    FROM candidate_rows AS c
    WHERE TO_DATE(c.EFF_SAT_CUSTOMER_ORDER_ENDDATE) = TO_DATE('9999-12-31 23:59:59.999999')
      AND TO_DATE(c.EFF_SAT_CUSTOMER_ADDRESS_ENDDATE) = TO_DATE('9999-12-31 23:59:59.999999')
)

SELECT * FROM bridge

Hey Frank,

Although not explicitly included in a Bridge query the book does have Effectivity Satellite components you can use in Bridge construction.

Look at Chapter 10, specifically section 10.3. I have three methods in there to query an effectivity satellite. Choose the one you need, I am guessing you need the current relationship by a driver key (selecting the active relationship).

For those curious about the scenario above, click here
and scroll down to “Flip-Flopping Relationship”

Hi Patrick,

yes, I found those, thanks. In my case I want to build up a Bridge using the the same time window as I’m using in my PITs, in which I base everything off of my eff_dts (or applied_ts). So for the effectivity satellite, I think I would need to join like this:

FROM hub
    INNER JOIN as_of_dates
        ON (1=1)
    LEFT JOIN link_1
        ON hub.hub_hk = link_1.foreign_hub_hk_1
    LEFT JOIN eff_sat_1
        ON eff_sat_1.link_hk = link_1.link_hk
        AND eff_sat_1.eff_dts <= b.AS_OF_DATE
        AND TO_DATE(eff_sat_1.eff_end_dts) = TO_DATE('9999-12-31')
    QUALIFY ROW_NUMBER() OVER (
           PARTITION BY
               AS_OF_DATE, link_1_hk
           ORDER BY eff_sat_1.load_dts DESC
           ) = 1

So particularly important is then the AND eff_sat_1.eff_dts <= b.AS_OF_DATE.

Just for starters I’m wondering why LEFT JOIN to the Link when the keys should match 1=1
Hub could contain more keys then is relevant to the link and therefore an equijoin should suffice, the same goes for joining to the effectivity sat.

I’m actually looking at it more from a PIT point of view. In my PIT, if a Hub record at a certain point in time (my applied date) doesn’t have a corresponding satellite record (which could come from any random source), I refer to the ghost record. For the Bridge, I’d like to do the same. Start from a Hub, and for any point in time, I could be pointing to an Effectivity Satellite record, if there is any. But of course instead of doing that, we rather resolve that timeline so that we can point to the actual related Hub record. Of course, Links (and their Effectivity Satellite) could come from any random source, so a certain Hub record at a certain point in time, doesn’t necessarily have a corresponding effective link record. So I’d then like to point to the Zero Key. In order to do all of this, I would need to do left joins, otherwise all those records that didn’t have that effective link will fall out.

Zero Keys

So essentially a Zero Key would exist for the optional portion of a relationship; the absence of a relationship between a driver key(s) and the non-driver key(s) should also be captured in the effectivity satellite. This essentially means you do not do LEFT JOINs because the link table has Hashed Zero Keys in them. They point to the Zero Key loaded to the Hub table (the absence of the relationship).

When the non-driver key portion eventually arrives the Effectivity Satellite will show that as well, and yet you’re still relying on the EQUI-JOIN

I do store the absence of a relationship in the effectivity satellite, but only from the moment in time the (absence of) the relationship gets delivered. When I have two source tables delivering a relationship (both to a different entity… let’s say one between Party and PartyParent, and one between Party and BusinessUnit), and table A starts delivering the Party-PartyParent relationship at Day 1, and table B starts delivering Party-BusinessUnit at Day 2, and I build up my Bridge table… I want it to look like this:

AsOfDate Party Parent BusinessUnit
Day1 123 456 -1
Day2 123 456 ABC

When populating the table, if I inner join for Day1, there’s nothing to join to in the Party-BusinessUnit link/effsat, and the 123 record will fall out.

If on Day3, Party 123 does get delivered but the BU is empty, I will record that in the link and the effsat, and I can inner join without problems.

So one issue could arise from certain sources tables that start delivering data at a later date than other sources. But an issue can also occur when two sources start delivering data at the same date, but not for the same list of business keys. If Party-BusinessUnit already gets delivered on Day1, but if Party 123 happens to be missing… I still want to be able to show that it with its parent.

That’s why I’m approaching this with a PIT table in mind… there, you also check (with a left join), if there is a corresponding satellite record for a given BK for a given AsOfDate, if there isn’t one, you refer to the ghost record.

I see,

Two Link constructs
Day 1 this link gets a record
Party-Parent
123-456

Day 2 other link gets a record
Party-BusinessUnit
123--1
123-ABC

So the next questions is:

  • the fact that the record is not included on day 1, is that not correct and should not be reported? (maybe even highlighted as a data integrity issue)
  • is there a business timestamp you can rely on rather than the applied timestamp? (it’s only late because the data wasn’t there already, however the business date says the business unit relationship is actually effective from Day 1)

Else, yes, you’d need a LEFT JOIN construct as you have pointed out. BUT then I’d have an EQUI-JOIN to Party-Parent, or another filter to ensure I am not including other entries in the party hub that are not relevant to the Bridge.

The difference in this example is just 1 day, but theoretically it could also be a year or even longer. I have two sources, that can deliver records for the same keys, but also for keys that only exist in that one source. And one source might have been added at a completely different point in time than the other. So I can’t really give a definition of when data would be “not included” or “missing”. I’m just onboarding another source, which results in a satellite that will be populated from this point onwards, as well as this link+effsat. Unfortunately I don’t have a business date, only a snapshot date from the source delivery.

I implemented a filter to only show those parties that are of interest to my consumers. So both my Party PIT and Bridge will start from that point. But because of various reasons (DQ, source delivery timelines), I don’t think I can ever do an INNER JOIN to any of the links included in the Bridge, because I can’t guarantee I’ll have all link records for all my Hub records. Also the PartyParent link might have gaps.

I think you have one of these: Big Ball of mud - DDD - The Domain Driven Design

There must always be a business date somewhere,