Still struggling on this.
My approach at this stage is that my SAL and related PIT will not be auditable, and would be truncated/reloaded with each warehouse update. My reasoning for this is that we donât have a âmaster recordâ as defined in Danâs book, but rather we have multiple sources of (for example) customer, with different keys, and only one of those sources has the relationship between all of the keys. If those keys/customers havenât yet reached that system, there is no relationship even though the customer exists. Additionally, a record that didnât have a relationship today might have a relationship tomorrow, and may have a third relationship the day after as different source records slowly fed into this one system that contains the relationships.
The approach Iâm looking to take is to start with the system that has the relationships, and then union in all other records that exist in other systems that have not already been accommodated for in the ârelationship systemâ. I would end up with a table that had a SAL_HUB_CUSTOMER_HK, followed by fields with various HUB_CUSTOMER_HK from different source systems:
SAL_HUB_CUSTOMER_HK;
HUB_CUSTOMER_HK__LGSF_CONTACT_ID;
HUB_CUSTOMER_HK__LGSF_ACCOUNT_ID;
HUB_CUSTOMER_HK__PLATFORM_CUSTOMER_ID
âŚ
From here I can create a PIT, linking the various SAT tables back to the appropriate field in the SAL, knowing that because of my unions all records would exist, so long as I was joining to the correct column.
The reason for the truncate/reload is one of simplicity.
Any links between HUBS that had a SAL in place would then need to join across the SAL.
Itâs the only way I can think to do it, but it does (IMO) kinda make the HUBS somewhat redundant.
I would add that given we use multiple systems throughout our organization, the vast majority of our HUBS would require an accompanying SAL - almost to the point that youâd argue itâs easier just to have one auto-generated when a new HUB was created.
Specifically (from a code perspective), Iâm talking about something like this:
CREATE OR REPLACE TABLE RAW_VAULT.SAL_HUB_CUSTOMER_V1 AS
WITH
primary_relationship AS (
SELECT
id AS lgsf_contact_id,
account_id AS lgsf_account_id,
legacy_sf_id_c AS lsf_contact_id,
customer_uuid_c AS platform_customer_id,
toolbox_id_c AS toolbox_customer_id
FROM
DATA_VAULT_DEV.STG_SF_LGFORCE_PRD.CONTACT__CUSTOMER
QUALIFY
ROW_NUMBER() OVER (PARTITION BY id ORDER BY last_modified_date DESC) = 1
)
SELECT
MD5(array_to_string(array_construct(
lgsf_contact_id::VARCHAR(255),
lgsf_account_id::VARCHAR(255),
lsf_contact_id::VARCHAR(255),
platform_customer_id::VARCHAR(255),
toolbox_customer_id::VARCHAR(255)
), '|'))::VARCHAR(64) AS sal_hub_customer_hk,
MD5(COALESCE(lgsf_contact_id::VARCHAR(255), ''))::VARCHAR(64) AS hub_customer_hk__lgsf_contact_id,
MD5(COALESCE(lgsf_account_id::VARCHAR(255), ''))::VARCHAR(64) AS hub_customer_hk__lgsf_account_id,
MD5(COALESCE(lsf_contact_id::VARCHAR(255), ''))::VARCHAR(64) AS hub_customer_hk__lsf_contact_id,
MD5(COALESCE(platform_customer_id::VARCHAR(255), ''))::VARCHAR(64) AS hub_customer_hk__platform_customer_id,
MD5(COALESCE(toolbox_customer_id::VARCHAR(255), ''))::VARCHAR(64) AS hub_customer_hk__toolbox_customer_id
FROM
primary_relationship
UNION DISTINCT
SELECT DISTINCT
MD5(array_to_string(array_construct(
''::VARCHAR(255),
src.id::VARCHAR(255),
''::VARCHAR(255),
''::VARCHAR(255),
''::VARCHAR(255)
), '|'))::VARCHAR(64) AS sal_hub_customer_hk,
MD5('')::VARCHAR(64) AS hub_customer_hk__lgsf_contact_id,
MD5(COALESCE(src.id::VARCHAR(255), ''))::VARCHAR(64) AS hub_customer_hk__lgsf_account_id,
MD5('')::VARCHAR(64) AS hub_customer_hk__lsf_contact_id,
MD5('')::VARCHAR(64) AS hub_customer_hk__platform_customer_id,
MD5('')::VARCHAR(64) AS hub_customer_hk__toolbox_customer_id
FROM
DATA_VAULT_DEV.STG_SF_LGFORCE_PRD.ACCOUNT__CUSTOMER AS src
LEFT JOIN primary_relationship
ON src.id = primary_relationship.lgsf_account_id
WHERE
primary_relationship.lgsf_account_id IS NULL
UNION DISTINCT
SELECT DISTINCT
MD5(array_to_string(array_construct(
''::VARCHAR(255),
''::VARCHAR(255),
''::VARCHAR(255),
src.customer_id::VARCHAR(255),
''::VARCHAR(255)
), '|'))::VARCHAR(64) AS sal_hub_customer_hk,
MD5('')::VARCHAR(64) hub_customer_hk__lgsf_contact_id,
MD5('')::VARCHAR(64) hub_customer_hk__lgsf_account_id,
MD5('')::VARCHAR(64) hub_customer_hk__lsf_contact_id,
MD5(COALESCE(src.customer_id::VARCHAR(255), ''))::VARCHAR(64) AS hub_customer_hk__platform_customer_id,
MD5('')::VARCHAR(64) AS hub_customer_hk__toolbox_customer_id
FROM
DATA_VAULT_DEV.STG_KAFKA_PRD.CUSTOMER_EVENT AS src
LEFT JOIN primary_relationship
ON src.customer_id = primary_relationship.platform_customer_id
WHERE
primary_relationship.platform_customer_id IS NULL
UNION DISTINCT
SELECT DISTINCT
MD5(array_to_string(array_construct(
''::VARCHAR(255),
''::VARCHAR(255),
''::VARCHAR(255),
src.customer_id::VARCHAR(255),
''::VARCHAR(255)
), '|'))::VARCHAR(64) AS sal_hub_customer_hk,
MD5('')::VARCHAR(64) AS hub_customer_hk__lgsf_contact_id,
MD5('')::VARCHAR(64) AS hub_customer_hk__lgsf_account_id,
MD5('')::VARCHAR(64) AS hub_customer_hk__lsf_contact_id,
MD5(COALESCE(src.customer_id::VARCHAR(255), ''))::VARCHAR(64) AS hub_customer_hk__platform_customer_id,
MD5('')::VARCHAR(64) AS hub_customer_hk__toolbox_customer_id
FROM
DATA_VAULT_DEV.STG_KAFKA_PRD.APPLICATION_EVENT AS src
LEFT JOIN primary_relationship
ON src.customer_id = primary_relationship.platform_customer_id
WHERE
primary_relationship.platform_customer_id IS NULL;
Whereby we can simply append new sources to the bottom of this union in the same way that you add new sources to a hub. But with new records, and relationships forming daily weâre in a position that either:
- we need to truncate/reload the table (and any derived PIT tables) daily; or
- implement valid_from and valid_to effectivity satellites for both this and derived PIT tables.
There doesnât appear to be any substantial benefit to go through the pain of implementing the second option, hence going with the first.