Source Tables - joining transactional tables or not?

Hi all,

I’ve been doing more reading. I think my confusion is stemming from the relatively little information that exists on how to model your data before feeding into data vault. In that context, I’m hoping to ask a few questions that I’m hoping will clear things up for me.

For the context of the questions, I’m going to use Salesforce data, as it’s generally well known and referenced multiple times in The Data Vault Guru book (I’m up to chapter 8, so if this is answered later, or if I missed it, please let me know).

I had been under the (mis?)-understanding that the data coming into your raw vault (hubs, links, satellites) should be exactly as per the source system, which is to say it shouldn’t be modified at all. However I’m getting the impression that this is not the case. Is there any literature out there on the approach to take here? How should the staging data be modeled? Is there a rule to follow?

Salesforce examples:

Every table has a link to the record_type table. This defines additional attributes that may be useful for the business. Maybe the brand related to an application. The record type doesn’t constitute a hub, but it is important information for the business. Is it appropriate to join this table to create a view of multiple tables that is used to source your raw vault, and hence in this example a satellite?

By this, I would give the following… is it appropriate to use this as a staging ‘table’ to be ingested into the raw vault:

SELECT
  opp.id,
  opp.name AS opportunity_name,
  opp.status,
  COALESCE(rt.name,'') AS opportunity_record_type_name
FROM
  opportunity AS opp
  LEFT JOIN record_type AS rt
    ON opp.record_type_id = rt.record_type_id

In the above example, the opp.id would become the business key in my application hub, and the opportunity_name, status & opportunity_record_type_name would become attributes in my satellite.

If this isn’t appropriate, how would you proceed? I do note that you aren’t supposed to change the names of the fields as part of the ingestion, however if doing a join like this, you would be forced to, else end up with two columns both called ‘name’.

My second question, on the topic of naming conventions, and again using Salesforce. Every customer, opportunity, appointment etc. table contains a created_by and modified_by field. These fields relate to the user table, and what our business defines as team members, so the team_member_hub. We would want to load both of these into the team_member_hub to ensure a complete set of data in this hub.

So when creating the hash values in the source, would you use a naming convention such as team_member_created_by_hash and team_member_modified_by_hash, or would you name these hash values differently? Similarly, assuming loading from the Salesforce Opportunity table, when loading these records into the team_member_hub, would you reference the field in the record source - SF-OPP-CREATED_BY - or just the table - SF-OPP? I’m assuming the former, but hoping to get a confirmation.

PS: kudos to @patrickcuba for his book. I’m not finished reading it yet, but it’s answered some questions for me that I was left with after having read Dan Linstedt’s book cover to cover.

Hey thanks Andy,

I just want to clarify, Raw Vault has not modification of the original data besides applying hard rules (basic SQL functions to clean out leading and trailing spaces and the like). The only column name that is conformed in any way is the business key when loading to the hub table. Apologies if that is not clear in the book.

Naming conventions to attributes happens in the business vault and information mart layers.

Sorry - I didn’t mean to insinuate you had led me astray. I guess it comes down to there being lots of documentation around hubs, links and satellites, but not much leading up to that.

Then there are what appear at face value to be conflicting opinions. As you say, the Raw Vault shouldn’t modify the original data besides applying hard rules. However this page suggests that ‘One of the biggest causes of project failure, is designing and mapping a model that mimics the source system too closely.’. Additionally, the example database for dbtvault specifically does join transactional tables in advance of the data flowing into the raw vault.

The example I gave of Salesforce table was probably a little loaded. I would argue that the brand should be a hub. But the record_type table doesn’t just contain brand. It contains brand on records joined to the opportunity table. If the same record_type table is joined to the appointment table, it provides the type of appointment - new lead, follow-up etc. So creating a hub simply off the record_type table seems like a no-go.

Using record_type again, where the record_type could be considered a hub (such as brand), I’d argue that it is the brand itself (i.e. the record_type name) that would be the hub business key, not the SF PK. The brand (as text) would doubtless appear in countless other areas and would be the brand name in those locations. If we didn’t join the record_type table directly to the opportunity table as presented above, we would still need to join it to pull through the brand name (as opposed to the record_type_id) to create the link table between brand and opportunity.

I’m sure this is all very simple. But with the seemingly conflicting approaches to staging, I feel like I’m going around in circles. I’m stuck at the first step in the journey, even though I feel like I now have a relatively good understanding of steps 2, 3, 4 & 5.

Satellite splitting is one of the most underrated in DV activities. If choosing to model the source attributes in a link-sat or hub-sat modeller need to profile what those attributes are describing, the relationship (it’s a link-sat) or a business object in the relationship (it’s a hub-sat). Imagine querying for business object details in a link-sat, you’d have to think about how to present that uniquely to the user when likely because the relationship details are probably events or transaction between business objects you’ll have to essentially dedupe that content every time you’re pulling that data out. If it was modelled as a hub-sat already then you don’t have to.

Now, you would argue that brand should be in a hub — but what does the business say it is? Is there an immutable business key for the brand like a brand id? If you’re in the position to make that decision then by all means do that — if not, make sure you confirm your suspicions with the business or at least someone with “enterprise” in their title.

John Giles focuses a lot on top-down modelling in a data vault, if you’re not doing that then there is no value to building a data vault. Why Do Top-Down Big-Picture Modeling? Part 2 – TDAN.com

Thank you,

Excellent article. I think it may help explain my predicament.

I’m fortunate that our Enterprise Architect has defined our companies Unified Domain Model. It has our domain objects (hubs), and the relationships (links) between these: a customer can book and appointment with an associate, and/or they can start an application and invite additional applicants to that same application, etc. etc… It goes further in identifying the system considered the primary source of truth for each domain object. Is it a microservice, or is it Salesforce? It’s all defined for me.

My challenge is in mapping this back to the transactional tables in the respective source systems.

Per the article, Dan is quoted as saying ‘if the Data Vault you have in place today is not currently about the business, then unfortunately you’ve hired the wrong people.’, and further the article notes ‘… bottom-up, source-centric thinking. And it’s a big problem.’.

I don’t want a bottom-up, source-centric solution. The business has defined my hubs. Now I need to align the data in the underlying systems to those hubs. How am I to do that without performing a level of transformation to the underlying sources? When the tables and relationships have been built generically so to be leveraged by as many clients as possible, you can all but guarantee those source system tables won’t neatly align to a Data Vault designed for a single specific company.

The record_type table in Salesforce is the perfect example of this. It’s generic enough that any company can use it any way they like. We use it the way that we like, and as a complete table, it certainly can’t be considered a hub containing similar data at consistent grain.

Am I to divide this table up based on filters applied in staging and create a satellite with a single attribute field for each hub? Or should I join it to the relevant table that contains 99% of other attributes for that hub, and then break that view down into a series of satellites based on change velocity, PII segregation etc., where this attribute would cleanly sit in the same satellite as other attributes off the main table such as opportunity type that have a low change velocity?

Is this another way of saying “OLTP”? Or CDC?

This is where the concept of SLOs / SLAs come into the picture (the concept comes from Google SREs) and even in Data Mesh “Producer Domains”, data contracts are established from the producer and the producer is responsible for the content and quality of the data that you need. Some implementations (such as SAP) end up building VIEWs to sanitise the data with such joins in place, but that means you can only get to the data when the OLTP is not terribly busy — in either way a source-system SME must be involved to help you decipher this!

This looks likely, and build an Error mart to capture “unknown” record_types as they happen. It is also an opportunity to exclude data from the loads, the data that has no business value.

Remember this is the second best scenario — the worst case is if you leave it up to the users to decipher this in the multitude of tables they would otherwise need to join to solve something you can easily do before it even gets in to the vault.

Hey there,

By transactional tables, I’m referring to the 1:1 copies of the OLTP tables that are landed in Snowflake. These come from FiveTran, Matillion, Segment etc. Actually, to be clear, it is either the 1:1 copies of the OLTP tables or the 1:1 copies after a dbt snapshot, where the source table is mutable.

Unfortunately we don’t have data contracts. The Salesforce Development team do as they will, and we get the output of that. New fields appear without us having any prior warning. I know well enough to pick my fights. For the moment, this isn’t one of them.

So any views built between the landing of data in Snowflake and the loading of that same data into Data Vault will have no impact on the transactional system. It would be more about aligning the data into logical groupings that both align with our unified domain model, while also allowing us to understand what the source system looked like at any given point in time.

BTW: I’m curious, but of course you don’t need to answer. Are you Sydney based?

Yup Down Under

I think this data management mess needs to be highlighted with your leadership team-- specifically that no data contracts are in place will lead to potentially undetected damage to your data analytics loads — unless they take that analytics platform seriously you’ll be chasing/praying for “normal” daily data loads

1 Like