SAL is killing me

I’m hoping someone can please show me a cleaner way to do this. I want to create a type 1 dimension based off a hub, two satellites (for systems A and B) and a same-as-link. Not all records in system A exist in system B, and vice versa. Additionally, not all records in the hub may exist in either system A or system B. But I want my dimension to be comprehensive.

This is the code I’ve come up with, and it works, but I don’t think it’s the ‘Data Vault way’. I’ve done the course, and read the books. What have I missed?

WITH
sat_sf AS (
    SELECT --Get the most recent version of bank from the Salesforce Satellite, including the business key from the hub
        hub.funder_bk,
        hub.funder_pk,
        sat.load_date,
        sat.name
    FROM
        HUB_FUNDER AS hub
        INNER JOIN SAT_H_FUNDER__SF_BANK AS sat
            ON hub.funder_pk = sat.funder_pk
    QUALIFY
        ROW_NUMBER() OVER (PARTITION BY sat.funder_pk ORDER BY sat.load_date DESC) = 1
),

sat_platform AS (
    SELECT --Get the most recent version of bank from the Platform Satellite, including the business key from the hub
        hub.funder_bk,
        hub.funder_pk,
        sat.load_date,
        sat.funder_name AS name
    FROM
        HUB_FUNDER AS hub
        INNER JOIN SAT_H_FUNDER__PLATFORM_BANK AS sat
            ON hub.funder_pk = sat.funder_pk
    QUALIFY
        ROW_NUMBER() OVER (PARTITION BY sat.funder_pk ORDER BY sat.load_date DESC) = 1
),

union_set AS ( -- get a full set of all keys, but only one record per funder, even if in multiple systems
    SELECT -- get the primary key for all Salesforce records that don't exist in the SAL
        NULL AS platform_pk,
        funder_pk AS sf_pk,
        NULL AS other_pk,
        NULL AS other_bk
    FROM
        sat_lgsf
    WHERE
        funder_pk NOT IN (SELECT funder_pk__sf_bank FROM SAL_FUNDER__SF_BANK__PLATFORM_FUNDER)
    UNION ALL
    SELECT -- get the primary key for all Platform records that don't exist in the SAL
        funder_pk AS platform_pk,
        NULL AS sf_pk,
        NULL AS other_pk,
        NULL AS other_bk
    FROM
        sat_platform
    WHERE
        funder_pk NOT IN (SELECT funder_pk__platform_funder FROM SAL_FUNDER__SF_BANK__PLATFORM_FUNDER)
    UNION ALL
    SELECT -- Get the primary keys for records that do exist in the SAL
        funder_pk__platform_funder AS platform_pk,
        funder_pk__sf_bank AS sf_pk,
        NULL AS other_pk,
        NULL AS other_bk
    FROM
        SAL_FUNDER__SF_BANK__PLATFORM_FUNDER
    UNION ALL
    SELECT -- finally add in any record in the hub that hasn't already been taken account of
        NULL AS platform_pk,
        NULL AS sf_pk,
        funder_pk AS other_pk,
        funder_bk AS other_bk
    FROM
        hub_funder
    WHERE
        funder_pk NOT IN (SELECT funder_pk FROM sat_sf)
        AND funder_pk NOT IN (SELECT funder_pk FROM sat_platform)
)

SELECT -- bring it all together with a COALESCE to choose which system to use as a primary source
    COALESCE(union_set.platform_pk, union_set.sf_pk, union_set.other_pk) AS _dim_pk,
    COALESCE(sat_sf.funder_bk, '') AS sf_funder,
    COALESCE(sat_platform.funder_bk, '') AS platform_funder,
    COALESCE(union_set.other_bk, '') AS other_funder,
    COALESCE(sat_sf.name, sat_platform.name, '') AS funder_name
FROM
    union_set
    LEFT JOIN sat_sf
        ON union_set.sf_pk = sat_sf.funder_pk
    LEFT JOIN sat_platform
        ON union_set.platform_pk = sat_platform.funder_pk

Based on the above, if you were to have multiple SAL tables for the same object (system A to system B, and another for system B to system C) then you could join these together in a cte it the top, and then reference it as needed in the union_set.

EDIT: For context on the naming:
SAT_H_FUNDER__SF_BANK as a table name infers that it is a Satellite table, that hangs off Hub Funder, and is sourced from the Bank table in Salesforce.

SAL_FUNDER__SF_BANK__PLATFORM_FUNDER as a table name infers that it is a Same-as-Link, that is related to Hub Funder, and provides the relationship between the Banks in Salesforce and the Funders in Platform.

They may not be perfect, but they work for us.

How is your SAL sourced? Is it provided by one of the sources or do you derive it yourself using a business rule?

Imo, you should end up with only one SAL which is source-independent (like any other link), with probably an Effectivity Satellite or Delete Tracking Satellite on top, per source.

Do you have a golden source for your Hub Funder? You take platform_pk as a first argument in the _dim_pk coalesce but you take the name from sat_sf as your first argument in the funder_name coalesce.

If you have a golden source (let’s say Platform), you store all Funders from Sat Platform in the Dimension, only Funders from SF that don’t appear in the SAL to Platform, and all Funders from the Hub that don’t appear in neither Sat Platform not Sat SF.

Thank you for the reply.

In this case, each record in SF has both the SF id and the Platform id. So It’s relatively easy to build the SAL. There is also a primary system in this case where the funder is originally generated, however some of the names have been updated in SF with no (current) ability to do the same in Platform (hence where name comes from).

Other examples I could provide have no golden record though.

Regardless of how the SAL is formed however, my concern is about getting the data out. Is the example query I’ve given actually the ‘right way’? Asides from replacing the NOT IN() with a more performant ‘a LEFT JOIN b … WHERE b.id IS NULL’, is there a more appropriate way to build this query? An agreed on standard or such?

Do your consumers want to see the different BK’s or just one? I would say, just one. If so,

WITH
sat_sf AS (
    SELECT --Get the most recent version of bank from the Salesforce Satellite
        sat.funder_pk,
        sat.name
    FROM
        SAT_H_FUNDER__SF_BANK AS sat
    QUALIFY
        ROW_NUMBER() OVER (PARTITION BY sat.funder_pk ORDER BY sat.load_date DESC) = 1
),
sat_platform AS (
    SELECT --Get the most recent version of bank from the Platform Satellite
        sat.funder_pk,
        sat.funder_name AS name
    FROM
        SAT_H_FUNDER__PLATFORM_BANK AS sat
    QUALIFY
        ROW_NUMBER() OVER (PARTITION BY sat.funder_pk ORDER BY sat.load_date DESC) = 1
),
hub AS (
	SELECT --Get all Hub keys except those from SF for which we have a Platform key
		hub.funder_pk,
		hub.funder_bk	
    FROM
		HUB_FUNDER AS hub
	LEFT ANTI JOIN SAL_FUNDER__SF_BANK__PLATFORM_FUNDER AS sal
		ON hub.funder_pk = sal.funder_pk__sf_bank
)

SELECT
    hub.funder_pk AS _dim_pk,
    hub.funder_bk AS funder,
    COALESCE(sat_sf2.name, sat_sf.name, sat_platform.name, '') AS funder_name
FROM
    hub
    LEFT JOIN sat_sf -- for those that only exist in SF
        ON hub.funder_pk = sat_sf.funder_pk
    LEFT JOIN sat_platform -- 
        ON hub.funder_pk = sat_platform.funder_pk
	LEFT JOIN SAL_FUNDER__SF_BANK__PLATFORM_FUNDER AS sal
		ON hub.funder_pk = sal.funder_pk__platform_funder
	LEFT JOIN sat_sf AS sat_sf2 -- for those that exist in Platform but also in SF
        ON sat_sf2.funder_pk = sal.funder_pk__sf_bank

Thank you,

That does look much cleaner, however unfortunately yes, they do want to see all business keys. They use them in the event they need to go back to source and look up a record. They want the key for whichever system they need to log into.

Maybe this works? Not entirely sure if it is syntactically correct.

WITH
sat_sf AS (
    SELECT --Get the most recent version of bank from the Salesforce Satellite
        sat.funder_pk,
        sat.name
    FROM
        SAT_H_FUNDER__SF_BANK AS sat
    QUALIFY
        ROW_NUMBER() OVER (PARTITION BY sat.funder_pk ORDER BY sat.load_date DESC) = 1
),
sat_platform AS (
    SELECT --Get the most recent version of bank from the Platform Satellite
        sat.funder_pk,
        sat.funder_name AS name
    FROM
        SAT_H_FUNDER__PLATFORM_BANK AS sat
    QUALIFY
        ROW_NUMBER() OVER (PARTITION BY sat.funder_pk ORDER BY sat.load_date DESC) = 1
)

SELECT
    hub.funder_pk AS _dim_pk,
	CASE WHEN sat_sf.funder_pk IS NOT NULL THEN hub.funder_bk WHEN hub2.funder_pk THEN hub2.funder_bk ELSE '' END AS sf_funder,
	CASE WHEN sat_platform.funder_pk IS NOT NULL THEN hub.funder_bk ELSE '' END AS platform_funder,
	CASE WHEN sat_sf.funder_pk IS NULL AND sat_platform.funder_pk IS NULL THEN hub.funder_bk ELSE '' END AS other_funder,
    COALESCE(sat_sf2.name, sat_sf.name, sat_platform.name, '') AS funder_name
FROM
    HUB_FUNDER AS hub
    LEFT JOIN sat_sf
        ON hub.funder_pk = sat_sf.funder_pk
    LEFT JOIN sat_platform 
        ON hub.funder_pk = sat_platform.funder_pk
	LEFT JOIN SAL_FUNDER__SF_BANK__PLATFORM_FUNDER AS sal
		ON hub.funder_pk = sal.funder_pk__platform_funder
	LEFT JOIN sat_sf2
        ON sat_sf2.funder_pk = sal.funder_pk__sf_bank
	LEFT JOIN HUB_FUNDER AS hub2
		ON hub2.funder_pk = sal.funder_pk__sf_bank
	LEFT ANTI JOIN SAL_FUNDER__SF_BANK__PLATFORM_FUNDER AS sal_no_sf --Get all Hub keys except those from SF for which we have a Platform key
		ON hub.funder_pk = sal_no_sf.funder_pk__sf_bank

That could work! Thank you!