Technical Advice on SAL and PIT Related to Same Hub

Hi there,

Hoping for some technical advice. As background:

4 source systems into an Application Hub

  1. Original Salesforce (2010 - 2021)
  2. Bespoke System (1990 - present)
  3. Application microservice (2018 - present)
  4. New Salesforce (2021 - present)

There is no MDM solution.

Any application in progress when we migrated from Original to New Salesforce was copied over, and so there is a key (old_sf_id) on the New Salesforce table that provides the relationship.

The bespoke system had all historical applications migrated into the New Salesforce, and until this system is turned off there is a two way sync between them. The bespoke system should be considered ‘source of truth’ for these records. Similar to above, New Salesforce provides a key (bespoke_id) for these records.

When the Application Microservice was created, it was considered that this should be the source of truth for all new applications. It started fresh, and never had any history imported into it. Similar to above, New Salesforce as the downstream replica of the Application Microservice has a key (platform_uuid) for the relationship back to the microservice.

This means New Salesforce could be considered the ‘MDM’, but it doesn’t contain everything. It doesn’t have all of the Old Salesforce records (but it does have some). It might not have all Bespoke or Microservice applications if they haven’t yet been replicated.

Background out of the way, I’m comfortable with having PIT tables for the various satellites of the Application Hub that relate to the same source system. However… how would you suggest building a SAL for this hub, given the four systems, non of which have all of the records? And further, if I wanted a ‘complete’ PIT for this hub, would you expect it to hang off the Hub directly or off the SAL?

Hope that makes sense?

Source must provide a mapping for this SAL, how else did you migrate from one source to another — assuming that you did.

That mapping, even if it is one off because of a migration should form the basis of your SAL.

If you have ongoing SAL updates it wouldn’t hurt to load to the same SAL if the grain is the same.

Lastly, if the satellites you are combining for your PIT are off the hub then that where the PIT will be based on, however you may want to resolve SAL to HUB relations to get to the business key you want to surface into the PIT and subsequent info-mart!

Thank you Patrick,

It’s the new Salesforce that has all of the mappings, however it doesn’t have all of the records. There are records in the other systems that either haven’t made it to the new Salesforce yet, or that never will, but are still at the same grain. Examples being old applications never migrated, or applications that simply haven’t been synced yet.

So maybe to really distill it down: is the assumption that my SAL contains all applications, or only those that have a same-as relationship (being a subset of applications in the application hub). I believe it is the latter.

Where my confusion lies is the PIT. I want a single PIT for all satellites that relate to the HUB. If I don’t incorporate the SAL in this, then it’s not going to give me a point in time regardless of source. If I do, there doesn’t appear to be a ‘single source of truth’ system that I could map everything to.

You’d have to use the SAL to resolve the keys before you have the sats based on the hub.

I’m not sure I follow sorry.

Take a hub example. It contains application_hub_bk ‘abc’, ‘abd’ & ‘abe’ from system A. It also contains application_hub_bk ‘123’ and ‘124’ from system B. ‘abd’ and ‘124’ are ‘same as’. ‘abc’ & ‘abe’ are unique to system A an not referenced elsewhere. ‘123’ is unique to system B and not referenced elsewhere. In all likelihood, ‘abe’ will end up with a link to a record in system B tomorrow, but that record hasn’t yet been created.

This means that any SAL will either not have a complete list of all applications, or it will be subject to change over time (‘abe’ is either not included until it has a reference in system B, or it is included from the outset and then needs to be updated to show that a relationship has been created).

It’s really not clear to me how to build this. All the documentation seems to indicate that I should be basing things on the ‘primary system’ on the premise that this contains all records. In reality however, this simply isn’t the case.

  • Well your PIT would be based on the business object — i.e. the hub, right?
  • The business key is what you expose to the business, correct?
  • SAL will not have all keys but the hub will, right?

Therefore, you need an intermediary step to resolve a “hub” to the business key you want users to see, that central intermediary hub is then used as the base for your PIT. When picking the business key of course, you’d need a coalesce function with the preferred key first but that leaves the final question:

  • will the business key ever resolve to the preferred business key?
    And if it does will the business user care that the key could change?

Thank you Patrick,

You’re correct on your three points. Downstream the business will end up seeing this through a Kimball model. So the PIT will be used in building out a dimension - dim_application. This will come from many sources with consistent grain. There may be a COALESCE(sys_a.value, sys_b.value) AS value, field for example. As such, the business would want to know all of the business keys from all of the related systems - not just the one that was deemed ‘primary’.

I’m going to have to look into using a SAL for a PIT as well. Luckily in my case, I only have two systems, where it is explicitly stated that one is the golden source until a certain point in time, when afterwards the new source becomes golden.

So I have a legacy source system ABC that was loaded until end of 2022, and the new source system XYZ to which ABC was migrated, which started at 2023

ABC used a technical key which, in some cases, also contained a proper natural key. That natural key was consequently used in XYZ as business key.

So I now have:

ABC:

technical_id natural_id attribute
ABC-123-BLA 123 andy
ABC-456-BLA null patrick

XYZ:

natural_id attribute
123 ANDY
456 PATRICK

My Hub looks like this:

dv_hub_hashkey hub_id
HASH(ABC-123-BLA) ABC-123-BLA
HASH(ABC-456-BLA) ABC-456-BLA
HASH(123) 123
HASH(456) 456

My SAL looks like this:

dv_sal_hashkey dv_hub_hashkey dv_hub_hashkey_sameas
HASH(ABC-123-BLA ++ 123) HASH(ABC-123-BLA) HASH(123)

Now my PIT will look like this:

as_of_date dv_hub_hashkey dv_sat_ABC_hashkey dv_sat_ABC_load_dts dv_sat_XYZ_hashkey dv_sat_XYZ_load_dts
2023-01-01 HASH(123) HASH(ABC-123-BLA) 2022-12-31 HASH(123) 2023-01-01
2023-01-01 HASH(ABC-456-BLA) HASH(ABC-456-BLA) 2022-12-31 HASH(GHOST) 1900-01-01
2023-01-01 HASH(456) HASH(GHOST) 1900-01-01 HASH(456) 2023-01-01

So ABC-123-BLABLA is not not in my PIT base because it can be resolved to 123 using the SAL. ABC-456-BLA was never migrated to XYZ but still needs to appear in the information mart.

Currently with the AutomateDV implementation, I don’t think the sat hashkeys can be something other the hub hashkey itself or the ghost record. I think I’ll have to have a look at refactoring the current pattern.

Have all the keys available then :wink:

You’d have to fork dbtvault’s PIT code, we did

Yeah, we’ve already done that for pretty much all patterns already. Created our own historical loading patterns for all, and added an STS.

1 Like

Hi,

I’m an absolute Noob to DataVault and muddling my way through using ‘Building a Scalable Data Warehouse with Data Vault 2.0’ by Daniel Linstedt and Michael Olschimke and also gooling where I can.

In your 3rd paragraph you have mentioned in brackets that you would UPDATE your SAL to identify the new relationship between System A and System B. From my understanding of my limited reading so far, LINKS should only ever have INSERTS and should not be UPDATED. This is something that has been stumping me on my understanding of SALs. If you add ‘abe’ to the SAL originally with a GHOST record for System A and then a true relationship is created tomorrow between System A and System B, from my understanding, this would then INSERT a new row into the SAL which would mean that there are now 2 rows for ‘abe’ in the SAL. One with the relationship showing to System A and one with a Ghost record for System A. When referencing the SAL in queries for InfoMarts how do you determine which row to use? @patrickcuba @andy

Agreed.

I’m also unclear on the best practice here. I’m guessing that it would be to have an effectivity satellite hanging off it with the valid_to/from dates, as opposed to having these sitting within the SAL itself. But I hope the answer is that you can simply have them in the SAL as it’s a special link’ so the rules don’t apply.

It also then follows that any PIT table that uses the SAL would either be a truncate/reload or would also have effective dates. Given a PIT table sits in the business vault, I imagine it could also be done as an update, whichever being most time/resource effective.

When ppl say UPDATE in DV2.0 it means INSERT a new record.

BTW, Dan’s book is a bit dated.

You need to read up on the difference between ghost records and zero keys, Data Vault Mysteries… Zero Keys & Ghost Records… | by Patrick Cuba | Snowflake | Medium

You’d never have a ghost record in a SAL

Still struggling on this.

My approach at this stage is that my SAL and related PIT will not be auditable, and would be truncated/reloaded with each warehouse update. My reasoning for this is that we don’t have a ‘master record’ as defined in Dan’s book, but rather we have multiple sources of (for example) customer, with different keys, and only one of those sources has the relationship between all of the keys. If those keys/customers haven’t yet reached that system, there is no relationship even though the customer exists. Additionally, a record that didn’t have a relationship today might have a relationship tomorrow, and may have a third relationship the day after as different source records slowly fed into this one system that contains the relationships.

The approach I’m looking to take is to start with the system that has the relationships, and then union in all other records that exist in other systems that have not already been accommodated for in the ‘relationship system’. I would end up with a table that had a SAL_HUB_CUSTOMER_HK, followed by fields with various HUB_CUSTOMER_HK from different source systems:

SAL_HUB_CUSTOMER_HK;
HUB_CUSTOMER_HK__LGSF_CONTACT_ID;
HUB_CUSTOMER_HK__LGSF_ACCOUNT_ID;
HUB_CUSTOMER_HK__PLATFORM_CUSTOMER_ID

From here I can create a PIT, linking the various SAT tables back to the appropriate field in the SAL, knowing that because of my unions all records would exist, so long as I was joining to the correct column.
The reason for the truncate/reload is one of simplicity.

Any links between HUBS that had a SAL in place would then need to join across the SAL.

It’s the only way I can think to do it, but it does (IMO) kinda make the HUBS somewhat redundant.

I would add that given we use multiple systems throughout our organization, the vast majority of our HUBS would require an accompanying SAL - almost to the point that you’d argue it’s easier just to have one auto-generated when a new HUB was created.

Specifically (from a code perspective), I’m talking about something like this:

CREATE OR REPLACE TABLE RAW_VAULT.SAL_HUB_CUSTOMER_V1 AS

WITH
primary_relationship AS (
    SELECT
        id AS lgsf_contact_id,
        account_id AS lgsf_account_id,
        legacy_sf_id_c AS lsf_contact_id,
        customer_uuid_c AS platform_customer_id,
        toolbox_id_c AS toolbox_customer_id
    FROM
        DATA_VAULT_DEV.STG_SF_LGFORCE_PRD.CONTACT__CUSTOMER
    QUALIFY
        ROW_NUMBER() OVER (PARTITION BY id ORDER BY last_modified_date DESC) = 1
)

SELECT
    MD5(array_to_string(array_construct(
        lgsf_contact_id::VARCHAR(255),
        lgsf_account_id::VARCHAR(255),
        lsf_contact_id::VARCHAR(255),
        platform_customer_id::VARCHAR(255),
        toolbox_customer_id::VARCHAR(255)
        ), '|'))::VARCHAR(64) AS sal_hub_customer_hk,
    MD5(COALESCE(lgsf_contact_id::VARCHAR(255), ''))::VARCHAR(64) AS hub_customer_hk__lgsf_contact_id,
    MD5(COALESCE(lgsf_account_id::VARCHAR(255), ''))::VARCHAR(64) AS hub_customer_hk__lgsf_account_id,
    MD5(COALESCE(lsf_contact_id::VARCHAR(255), ''))::VARCHAR(64) AS hub_customer_hk__lsf_contact_id,
    MD5(COALESCE(platform_customer_id::VARCHAR(255), ''))::VARCHAR(64) AS hub_customer_hk__platform_customer_id,
    MD5(COALESCE(toolbox_customer_id::VARCHAR(255), ''))::VARCHAR(64) AS hub_customer_hk__toolbox_customer_id
FROM
    primary_relationship
UNION DISTINCT
SELECT DISTINCT
    MD5(array_to_string(array_construct(
        ''::VARCHAR(255),
        src.id::VARCHAR(255),
        ''::VARCHAR(255),
        ''::VARCHAR(255),
        ''::VARCHAR(255)
        ), '|'))::VARCHAR(64) AS sal_hub_customer_hk,
    MD5('')::VARCHAR(64) AS hub_customer_hk__lgsf_contact_id,
    MD5(COALESCE(src.id::VARCHAR(255), ''))::VARCHAR(64) AS hub_customer_hk__lgsf_account_id,
    MD5('')::VARCHAR(64) AS hub_customer_hk__lsf_contact_id,
    MD5('')::VARCHAR(64) AS hub_customer_hk__platform_customer_id,
    MD5('')::VARCHAR(64) AS hub_customer_hk__toolbox_customer_id
FROM
    DATA_VAULT_DEV.STG_SF_LGFORCE_PRD.ACCOUNT__CUSTOMER AS src
LEFT JOIN primary_relationship
    ON src.id = primary_relationship.lgsf_account_id
WHERE
    primary_relationship.lgsf_account_id IS NULL
UNION DISTINCT
SELECT DISTINCT
    MD5(array_to_string(array_construct(
        ''::VARCHAR(255),
        ''::VARCHAR(255),
        ''::VARCHAR(255),
        src.customer_id::VARCHAR(255),
        ''::VARCHAR(255)
        ), '|'))::VARCHAR(64) AS sal_hub_customer_hk,
    MD5('')::VARCHAR(64) hub_customer_hk__lgsf_contact_id,
    MD5('')::VARCHAR(64) hub_customer_hk__lgsf_account_id,
    MD5('')::VARCHAR(64) hub_customer_hk__lsf_contact_id,
    MD5(COALESCE(src.customer_id::VARCHAR(255), ''))::VARCHAR(64) AS hub_customer_hk__platform_customer_id,
    MD5('')::VARCHAR(64) AS hub_customer_hk__toolbox_customer_id
FROM
    DATA_VAULT_DEV.STG_KAFKA_PRD.CUSTOMER_EVENT AS src
LEFT JOIN primary_relationship
    ON src.customer_id = primary_relationship.platform_customer_id
WHERE
    primary_relationship.platform_customer_id IS NULL
UNION DISTINCT
SELECT DISTINCT
    MD5(array_to_string(array_construct(
        ''::VARCHAR(255),
        ''::VARCHAR(255),
        ''::VARCHAR(255),
        src.customer_id::VARCHAR(255),
        ''::VARCHAR(255)
        ), '|'))::VARCHAR(64) AS sal_hub_customer_hk,
    MD5('')::VARCHAR(64) AS hub_customer_hk__lgsf_contact_id,
    MD5('')::VARCHAR(64) AS hub_customer_hk__lgsf_account_id,
    MD5('')::VARCHAR(64) AS hub_customer_hk__lsf_contact_id,
    MD5(COALESCE(src.customer_id::VARCHAR(255), ''))::VARCHAR(64) AS hub_customer_hk__platform_customer_id,
    MD5('')::VARCHAR(64) AS hub_customer_hk__toolbox_customer_id
FROM
    DATA_VAULT_DEV.STG_KAFKA_PRD.APPLICATION_EVENT AS src
LEFT JOIN primary_relationship
    ON src.customer_id = primary_relationship.platform_customer_id
WHERE
    primary_relationship.platform_customer_id IS NULL;

Whereby we can simply append new sources to the bottom of this union in the same way that you add new sources to a hub. But with new records, and relationships forming daily we’re in a position that either:

  1. we need to truncate/reload the table (and any derived PIT tables) daily; or
  2. implement valid_from and valid_to effectivity satellites for both this and derived PIT tables.

There doesn’t appear to be any substantial benefit to go through the pain of implementing the second option, hence going with the first.

We’ll take this one step at a time

— reloading = lost audit = fake vault
— you are using semi-structured functions to construct your hashes, these are slower than just using regular concat because they’re not intended for structured data, see: Data Vault Loader Traps. You might have already had to deal with… | by Patrick Cuba | Medium
— no master record, only one has the relationship between the keys — that’s ok, whats the problem?
— if one of those keys haven’t reached that system, there is no relationship — that’s ok, your enterprise data vault model is meant to be eventually consistent, same applies to relationships eventually arriving — not a model problem but an orchestration problem if you are expecting all these relationships to arrive at the same - they rarely ever do

— what the heck is that? “Bag of keys”? – a SAL is a link table, use the source with the keys to construct this table

– oh man this is so bad! If you’re prepared to manage this then by all means build it! :smiley:
Bring in the source keys as is into the SAL, construct a business vault link to resolve this with business rules you’re constructing — I did something similar here: Apache Spark GraphX and the Seven Bridges of Königsberg | by Patrick Cuba | Geek Culture | Medium

— yes, your approach does

Thank you Patrick,

Lot’s to go on there.

  • Yes - reloading = lost audit. But only in the SAL and PIT, not in the base hubs, links and satellites. My understanding is that the Business Vault should not be considered auditable anyway, and this is where the PIT sits.
  • Good call on the semi-structured functions. We’re loading ~10,000 records per day, so it’s not significant for us. But makes sense to make things as efficient as possible.
  • I’m struggling that the one that has the relationship doesn’t have all of the keys. The example explained in the book assumes every key for a domain object exists in a single location.
  • Related to the point above, it seems easier to retrieve the data from the vault if it’s built in a way that everything comes together into a single spot. Having to connect to a SAL, and other locations in order to get a complete picture of a single domain objects seems fraught with risk of missing something. Unless I’ve misunderstood?
  • Not so much an orchestration problem as a timing problem. A record is created in system A, is transacted on is system A, is eventually sent to system B where a relationship is created and further transactions are applied. In parallel, system A may continue to make changes to the same record. But regardless, I need a way to report on this showing what has occurred in which system, when. And I need to be able to look at this one record (in both sources) as a single record. If doing it within Data Vault is not they way, using SAL and PIT tables, then what is the correct way?
  • SAL_HUB_CUSTOMER_HK. Makes sense to me. It’s a Same-As-Link table, that is creating a relationship of similar records in the Customer Hub, and the field is a Hash value. :man_shrugging:
  • This is where I’m getting confused. I’m not following the ‘construct a business vault link to resolve this with business rules you’re constructing’ piece. We have a table (in system B) that contains the relationship (to the extent that the relationship has been defined - which is to say that the record has been sent across from system A). It’s STG_SF_LGFORCE_PRD.CONTACT__CUSTOMER in the above example.

Take our hub. It has two records:
HUB_HK, HUB_BK, SOURCE …
MD5(abc), abc, platform …
MD5(123), 123, salesforce …

These are the same account from two different systems. My understanding is that the SAL is there to allow me to materialize the relationship between these two things by essentially pivoting them based on source system:

SAL_HK, PLATFORM, SALESFORCE
…, abc, 123

^ whether it’s abc or it’s MD5(abc) isn’t really relevant to this topic, I don’t believe.

I have a SAT table with attributes from Platform, related to abc, and I have a SAT table with attributes from Salesforce, related to 123. The only way that I can see to create a PIT across all of these SAT tables is to go via the SAL, as this is where the relationships between source systems is shown in a single record. So what follows is a HUB with multiple SATs, and then a SAL that essentially pivots the values in the HUB back to their appropriate source systems, with then a PIT on top of this to show the valid_from and valid_to for every ‘consolidated record’ regardless of which source that version of history came from. I would then link the SATs to my PIT to build out my dimension.

I’m taking from your comments that I’m way off base on this stuff. But this is how I’m interpreting the book. So not sure where to go from here.

a SAL is a Link not a Hub

That’s ok, DV2.0 is meant to be “Eventually Consistent” — if you create these join in pre-staging then you will have lag

A PIT can have a hub or link as a base, since yours is a SAL use that as a base, include a join retrieve the keys you need for your info-marts.

which book?

Hi Patrick,

Thank you for your reply and for the link to the Data vault Mysteries…Zero Keys and Ghost Records, it was very helpful.

I was able to find another post of yours: https://www.linkedin.com/pulse/data-lt-mysteries-effectivity-satellite-driving-key-patrick-cuba/ which I believe provided the answer to my original question. In order to determine which row (relationship status) within a LINK table is the valid row I need to create an Effectivity Satellite for the LINK. This then allows me to identify which relationship is the relationship i require at any point in time for historical reporting.

Would you mind just confirming that Effectivity Satellites are the right approach please?

Thanks,

Carl

Yes, it can get complex so you could also consider the following to track link effectivity:
• Status Tracking Satellite and
• Record Tracking Satellite