Hi all,
I’ve been doing more reading. I think my confusion is stemming from the relatively little information that exists on how to model your data before feeding into data vault. In that context, I’m hoping to ask a few questions that I’m hoping will clear things up for me.
For the context of the questions, I’m going to use Salesforce data, as it’s generally well known and referenced multiple times in The Data Vault Guru book (I’m up to chapter 8, so if this is answered later, or if I missed it, please let me know).
I had been under the (mis?)-understanding that the data coming into your raw vault (hubs, links, satellites) should be exactly as per the source system, which is to say it shouldn’t be modified at all. However I’m getting the impression that this is not the case. Is there any literature out there on the approach to take here? How should the staging data be modeled? Is there a rule to follow?
Salesforce examples:
Every table has a link to the record_type table. This defines additional attributes that may be useful for the business. Maybe the brand related to an application. The record type doesn’t constitute a hub, but it is important information for the business. Is it appropriate to join this table to create a view of multiple tables that is used to source your raw vault, and hence in this example a satellite?
By this, I would give the following… is it appropriate to use this as a staging ‘table’ to be ingested into the raw vault:
SELECT
opp.id,
opp.name AS opportunity_name,
opp.status,
COALESCE(rt.name,'') AS opportunity_record_type_name
FROM
opportunity AS opp
LEFT JOIN record_type AS rt
ON opp.record_type_id = rt.record_type_id
In the above example, the opp.id would become the business key in my application hub, and the opportunity_name, status & opportunity_record_type_name would become attributes in my satellite.
If this isn’t appropriate, how would you proceed? I do note that you aren’t supposed to change the names of the fields as part of the ingestion, however if doing a join like this, you would be forced to, else end up with two columns both called ‘name’.
My second question, on the topic of naming conventions, and again using Salesforce. Every customer, opportunity, appointment etc. table contains a created_by and modified_by field. These fields relate to the user table, and what our business defines as team members, so the team_member_hub. We would want to load both of these into the team_member_hub to ensure a complete set of data in this hub.
So when creating the hash values in the source, would you use a naming convention such as team_member_created_by_hash and team_member_modified_by_hash, or would you name these hash values differently? Similarly, assuming loading from the Salesforce Opportunity table, when loading these records into the team_member_hub, would you reference the field in the record source - SF-OPP-CREATED_BY - or just the table - SF-OPP? I’m assuming the former, but hoping to get a confirmation.
PS: kudos to @patrickcuba for his book. I’m not finished reading it yet, but it’s answered some questions for me that I was left with after having read Dan Linstedt’s book cover to cover.