Same as across different sources that hold the data differently

Hi,

I’m relatively new to DV2.0 and I’m hoping to understand a few things that I’m having trouble finding resources for online. I’ll keep them as separate threads so as not to confuse.

We have two Kafka event streams and three Salesforce tables.

Kafka_customer_event
Kafka_team_member_event
SF_account
SF_user
SF_contact

The SF_contact table contains additional attributes for both accounts and users. The SF_account table holds the key for the Kafka_customer, and likewise the SF_user table holds the key for the Kafka team_member. Between the customer and account, there should be a 1:1, but there are records in Kafka that aren’t in Salesforce, and likewise records in Salesforce that don’t exist in Kafka. Same on the user/team member side.

Any transaction records in Kafka refer to the Kafka id, and any records in SF use the SF id.

We have been forbidden from holding PII within our warehouse - no exceptions - so we cannot use the customer name as the business key.

So how should I go about modelling this in the raw layer?

Is this a customer and team member hub, with satellites each source table and same-as links for the twe hubs to link the Kafka and SF records? This would suggest that the contact satellite would be used in part for the customer hub and in part for the team member hub.

Is it a customer, team member and contact hub with link between customer-contact and team_member-contact?

Is it a hub for all 5 tables?

I’m leaning towards the second option, but keep second guessing myself.

Either spend the time to read a book https://www.amazon.com/dp/B08KSSKFMZ
There’s much theory to understand

Or go on a DV course, like through DVA

Thank you.

I do have the book. I’ve read it cover to cover once, and I’ve spot read multiple times. But still confused by this :-/

Then this is a strange question given that this is covered in the book.

For more assistance, these blogs extend the book content: https://patrickcuba.medium.com/

Hello Andy!
Welcome to the forum. It is hard to get a clear view of what the streams and tables contains. Can you describe all kind keys that exists in the tables and the streams. Both PK and FK. It will be easier to give you advise then.
Kind regards

Interesting scenario.

My first inclination is to ask, “What does the business conceptual model look like?” In other words, let’s take a look at the business model in this scenario and see what we can infer from it. We can then work through that model to determine how to model the raw vault accordingly.

I certainly don’t know the history behind your questions, so you may have already gone through this exercise. If so, could you present that here as well?

I say this because, in my experience, modeling a raw vault by starting with source system structures can lead to creating a source system vault, which vastly hinders the true power of a data vault.

Thank you for the follow up.

I’m unsure that the book gives a definitive answer on this, hence my question to this group.

Kafka_customer_event is supposed to be the driving event for all customers. We can’t use the customer name as a business key, as it contains PII which we are prohibited for having in the data warehouse. As such, we need to use the system key, which is a UUID, and is named appropriately as customer_id.

SF_account is supposed to be a downstream recipient of customers from the Kafka event. As such, all accounts generated in SF should contain the Kafka UUID in the dataset. Unfortunately due to historical mergers, bad business practices etc., there are customers in SF that don’t exist in Kafka. Again, we can’t use the customer name, so need to use the system key, which is an alpha-numeric. It is named simply id, but the business knows it as the account_id.

The book suggests that you can either have these in one hub (customer_hub), with a SAL or you can keep them in separate hubs (customer_hub & account_hub), with a standard link. I’m leaning towards a single hub, as there would be no overlap in keys.

The next step is to look at the SF_contact table. This has it’s own id (contact_id), and links back to both the account_id, but also to user_id (as a contact could be a customer or could be a team member). So my challenge is in understanding whether this is a contact_hub & contact_satellite with links to customer_hub and team_member hub, or whether the tables is split by row, and there is a customer_contact_satellite linked to the customer_hub along with a team_member_contact_satellite linked to the team_member_hub.

AHenning, to answer your specific question:
Kafka Customer: PK is UUID, no FK
Salesforce Account: PK is a string, FK to Kafka Customer, FK to Salesforce User (multiple fields - owner, last updated by, created by)
Salesforce Contact: PK is a string, FK to Salesforce account_id, FK to Salesforce User (multiple fields - owner, last updated by, created by, salesforce_user). Either the account_id or salesforce_user id will be populated, not both. This defines whether the contact is a customer or a team member.
Salesforce User: PK is a string, FK to manager_id (links back to the PK), FK to contact_id, FK to Kafka team_member ID.
Kafka Team Member: PK is a UUID. No FK.

1 Like

You should NEVER use the customer name! It is not a business key and not reliable as a unique value anyway! UUID should be fine here, as long as it is guaranteed to be unique.

Sounds like you have tech debt you need to sort out before getting into vault, otherwise the vault becomes a repository of tech debt.

If semantically account and customer are different then they should be separate, else they should be in the same hub, again this is covered in the book

You really should not be skipping the conceptual modelling step to data modelling where terms like these should be ironed out for you. Again covered in the book, conceptual model is a representation of the business architecture/ontology. If you’re not in the position to know what different hub business data should load to you should seek business help (business analyst or source system SME) to determine this, otherwise it’s your head on the block — covered in the book when discussing Mob Modelling.

1 Like

Hello Andy!
Disclaimer: blabla ask the business, not source vault blabla. My answer is as if the business model is the same as the model that you presented.

  1. The stuff from kafka are events. My advise is to model them in a NHL(non historized link). Check out NHL:s.

  2. The customer_id(you have no other choice, UUID) is a hub in NHL (kafka_customer_event).

  3. The customer_id is a hub in NHL (kafka_team_member)

  4. Having 2 or more business keys in a hub is (my opinion) a bad practice. That is because if you have descriptive data for one of the business keys than you need to construct another hub with only that business key and you end up having the same business key in different hubs. So, customer_hub and account_hub and a standard link is my suggestion.

  5. For salesforce_account: hub_account, hub_customer, hub_user. Link_Account(hub_account, hub_customer, hub_user). Sat_account(hub_account, descriptive fields), Sat_account_rts(Link_account).

  6. For salesforce_contact: hub_contact, hub_account, hub_user, Link_Contact(hub_contact, hub_account, hub_user), Sat_Contact(Hub_contact, descriptive fields, Sat_contact_rts(Link_Contact).

  7. For salesforce user: H_User, H_User from managerid, H_Contact, H_customer. Link_User(h_user, h_user_manager, h_contact, h_customer), S_User_RTS(Link_User).

  8. Pleaee note that this model enables you to have 5 autonumous processes for loading the source data.

  9. It was not clear to me if you hade repeatable sub structures in salesforce_contact. If have modeled it without repaetable sub structures in my suggestion.

  10. Hope there are not too many typos here. Pleaee come back if anything is unclear, good luck.

Ok, so I think I might have been looking down the wrong path here…

I had assumed that what went into raw was basically a table for table copy of what was in the source operational systems. However having taken a look at the dbtvault examples it appears that there are joins taking place as part of the process to bring the data into stg. This was a step that either wasn’t explained in the book, or that I missed.

Anyway, I’ve bought @patrickcuba’s book, and hopefully that will help shine more light on things. I can afford $100 for a book. I can’t afford $5,000 for a 3 day course :frowning:

It depends on the DV you follow, DV2.0 has RV modelled, Ensemble does not.

Scroll to the bottom of this to see a depiction: A Data Vault Analogy - Snowflake - Medium

Thank you. Quick reply!

The intent is to follow DV2.0.