Can I restructure source data before loading it into the raw vault?

Am I taking the guidance that raw vault data should be unmodified by business logic too literally by assuming that it also applies to not restructuring source data based on our understanding of it?

Could alternatives require quite extensive restructuring and generating the individual hub/sat/link structures from multiple sources even from a single system?

(This is one of the questions that was asked during our last meetup "5 most common challenges with Data Vault modelling on the 12th Jan 2022 - by one of the members)

I’m going to assume I understand the ask here by attempting to describe a use case.

Let’s say a single source system contains a table for Suppliers and a Table for Customers.
Both tables contain Primary Contact Details (Say Name, Email, Phone)

It could potentially make sense to model “Contacts” as one business Entity/Hub (possibly with a supplier/customer attribute)
“Suppliers” as another (containing just company attributes)
“Customers” as a third (containing just company attributes)

With
contact_supplier Link
And
contact_Customer Link

Is this what you mean by potential restructuring? If so, in my opinion it can make sense - as contacts move around, it may may sense to split these contact attributes away from the source system table objects. We did something similar in an Aviation vault, where we didn’t want to distinguish between people on a particular flight (source systems separated passenger manifests from crew manifests, however to our business users, they were all “humans on a plane” and had crew or passenger satellite attributes…

Much depends on how your business “sees” entities - design with that in mind - but In my opinion, the use case/implementation above is perfectly valid - the raw data is not altered but it may end up structured more efficiently… Hope this helps

1 Like

RV looks like the source yes, conformance starts to happen once you have identified the hub and link tables otherwise all other content is unmanipulated. This allows the solution scale so if new columns are added / renamed then you simply add it / change it in the raw vault satellite.

Hub conformance = business key column that aligns to your naming standard, ex. one source could have accno as the accont_id field and another could have account_number. Account_id is how you identify the business object through DV. Then you stick to a naming standard for hub business key columns, i.e. customer_id, product_id etc.

When you say unmodified by business logic think about where that fits in, RV is the outcome of the business logic from source. After all, it’s the reason why you purchased / developed that software so you can automate the Business Processes to support your business. When BV starts coming into play is that it is the recorded outcome of business logic within DV. BV can be built based on RV and other BV artefacts — all of which is the recorded outcome of business logic. Which is why we say BV is sparsely modelled!

Alternatives lead to tech debt,
Conforming RV attribute columns = what if a new column is added from source it now has the same column name as what you have conformed to? Worse, what if the column being pulled has a totally different definition! Oh My!

1 Like

Are you taking the guidance too literally? Absolutely not. The guidance is clear, don’t restructure the source. As Patrick mentions, there are a whole host of reasons for this time-tested guidance. Once the data is in the vault, you can restructure it to your heart’s content.

I’m not sure how a ‘single system’ can also be ‘multiple sources’. Maybe that’s the imperfection of language again. I’m going to continue under the assumption that what is meant is : My ‘single system’ has multiple tables in it and each one of those is a ‘source’ that is being directly loaded, rather than the system giving us a curated data output.

This particular situation leads to an annoying scenario where the incoming data doesn’t necessarily have the appropriate business identifiers to load into the vault. I have encountered that scenario, and developed a pattern to deal with it using Data Vault principles, but it 100% isn’t data vault standard - in fact, it’s very close to a data vault anti-pattern, the dreaded ‘Source System Vault’.

A core conceit of data vault methodology is that your incoming data will have the appropriate business identifiers. If your source (for whatever reason) can’t give you those, something somewhere needs to get them together before the data goes into the data vault. For me that meant the data warehouse server and process (in an ‘error vault’ of the data vault) providing capability to associate the data to the identifier, which the source couldn’t / wouldn’t do. This essentially is re-engineering a capability that the source system should have, and should have provided you, based on your (potentially flawed) understanding of the source system and /or the rules governing it.

The big take-away is: your data vault should not restructure incoming data. If at all possible, you need to make sure the incoming data is processed prior to export by the source system to have an appropriate identifier within the exported dataset. If that doesn’t happen, the data isn’t suitable for data vault. End.

The situation might arise that you as a data engineer then get told ‘it’s your problem now’. At that point you need a solution that isn’t out-of-the-book data vault.

Read more about the Normalised Source Pattern (link) as I implemented it.

An example, with a follow-up question at the end that relates to this topic then on restructuring data.

In my world I receive a Salesforce Account object, which is denormalized, to the point of having over 300 columns. Within it the users have captured things like contacts, and contract data. Just. Don’t. Ask. sighs sadly For example, the contract data includes a Contract Number and various other fields like data signed and so on. When a new Contract is created the fields, including the Contract number, are overwritten.

To add to the complexity I have a second CRM, using MS Dynamics, providing Account and Contract as separate normalized sources.

Some of those Accounts across the two CRM systems refer to the same business entity but they do not share a business key. I am getting users to identify the links between the Accounts because its obvious and because it creates massive business value for things like cross-sell, single invoicing etc, and business value ultimately what matters more than anything else we do. Avoiding a rebuild on your data model through using data vault is good because it provides business value through a long-lasting data model, not just because of its inherent structure being pleasing to the eye.

In looking at my approach I could:

  1. create a “Source System Vault” with hubs in the RV that have a similar name, but separated by source as well e.g. h_account_salesforce + h_account_dynamics + h_contract_dynamics,

or

  1. logically split the Account in Salesforce into Account and Contract during staging. So I end up with h_account and h_contract, each of which has the semantic meaning, the same grain, and each of which has has sats for the different source systems.

After a chat in the Slack channel I have chosen #2 because:

  • an un-restructured “source system vault” for hubs without any other analysis feels like such an anti-pattern
  • the Contract Number is clearly a business key in its business intent and use and should therefore be in a separate hub, which is shares with a different semantically identical source
  • if anybody asked I could easily recreate the source data for any date (or all dates) by joining the two sats for Salesforce back together in my info mart
  • as Dan Linstedt states, “hard business rules never change the meaning of incoming data, only the way the data is stored”. This is achieved.
  • In the BV I’ll be able to add a same-as link table between the records sitting in h_account once the users provide me with a source for that information.

So, my question then.

When Will, in his comment above, says

The guidance is clear, don’t restructure the source

I’d suggest that we need a little more nuance, at least with regard to one scenario. Dan Linstedt includes a specific example of denormalized XML data in section 11.9 and is clear that denormalized sources should to be normalized before loading into the staging area, in advance of data vault transforms. Dan’s examples actually decomposes the Passenger Name into a separate hub… which feels like going too far, since a name (first+last) is really not very unique!

So, if we can agree that Data Vault allows for some restructuring into hubs I’m interested in having a discussion on handling denormalized data and the extent to which it should be restructured. If I can be as bold to paraphrase Will, the guidance is clear, you should restructure the source in some cases, but what are the caveats on this? Is it as simple as the identification of multiple business keys within a single source? Or is it messier than that?

I’m also in the position of denormalised sources sometimes. I’m in full agreement that nuance is important, so I’ve drawn a small diagram to show where I’m talking about.


In my previous post, I am referring to ( 1 ) . Restructuring here leads to a whole host of auditability issues as discussed.
At stage ( 2 ), restructuring the data should be as little as possible - as determined by hard business rules. Things like combining fields to create business keys, or interrogating XML fields for specific data as you and Dan allude to. Logical restructuring (as seen in the diagram - splitting ‘Account’ and ‘Contract’ data) absolutely should be done at this stage, so that the loads at stage ( 3 ) can proceed.

Second diagram, showing a potential ‘restructuring’ (deriving data from an XML message) at stage ( 2 ) :

This restructuring could be handled in a more orthodox (but certainly more work effort) manner by loading the XML message, then using a business rule view to extract the data, and then reloading the derived data using the business rule view as a source.

Load RV as-is, but i have seen nich data products asked by business where source are unormalized files instead of 1 file per source table, so im guessing your question relates to this scenario? since easy to think expanding 1 file into couple tables on stage to improve load in RV performance and other benefits ?

I wonder if I hit your question well :pray::sparkles::blush:

Good stuff Will.

Yes, I’d never touch the original data between source and landing (#1) beyond what Dbt Snapshots do (which is to add more columns).

I think my divergence is to use staging (step #2) to split into enterprise model aligned, semantically distinct sets of data to allow the creation of hubs/links/sats to go more smoothly. I consider this to be part of my hard business rules. Staging is all based on views so if I need to rejig anything it doesn’t take long.

Looks ok

Would you consider Applied_Date, Record_Source, Task_ID, JIRA_ID?

Would you impose Load Date on Landing - this looks more like the Applied Date - and not in Staging?

Would you also consider BKCC?

Is it worth breaking out the parts to more tables to join? Does the Hub still reflect the Business Architecture (ala Business Capability / Business Object)? More tables you break out the more you need to join - needs a balance, I find that sticking to BA reduces the need for more tables

As for semi-structured content, Snowflake has Variant data type and I believe Redshift introduced the SUPER data type to store semi-structured content as is. Eliminates the need to pre-parse RAW content.

Just some thoughts.

All good questions :slight_smile:
These were only really meant to be indicative diagrams to give us a point of reference in the discussion. They certainly aren’t fully standards compliant, or contain all the fields they could.

Agreed that breaking apart the tables needs to be done sensitively with regard to the business entity / concept model, otherwise you end up with a whole host of additional tables as you say.

Storing the semi-structured data and parsing it later in the data flow would be generically the ‘better plan’ (as at that point the data can be evaluated vs various business soft rules). My intent was to show that specific appropriate data (such as a business key) could be extracted at this stage as part of an ‘acceptable’ restructuring.

1 Like

Hi all,

I’m adding to this with another nuance, rather than spinning up a new topic.

We know hubs need to contain semantically similar concepts.

I have a data source from an enterprise app called Teamcenter. Data is exported in XML format and as the XML contains nested repeating groups of data this needs to be normalised during staging into separate sets of data we stage. This seems pretty normal - and it is relatively easy to see how these relationships drive the unit of work and definition of hubs, links etc, for the most part. The driving concept is the “Problem” and we can create a hub and sat for it. Also, the unit of work allows for a link to be defined containing a weak key (dependent child key holding a “revision number”) - which we join to other hubs.

One of the nested groups is context sensitive and can optionally only ever contain one record per “problem”. Let’s call this nested group “implementation” - From a business perspective “implementation” is pretty meaningless - it is just how the standard app is built - it is the type of implementation that matters and there are two implementation types, driven by a “object_type” field - namely “waiver” and “deviation”.

The question is whether i should:
a) restructure the raw “implementation” data so that I create hubs/satellites separately for the “deviation” and “waiver”, joined to the driving “problem revision” link table. No new data is created or derived - it is just where the incoming raw data is placed. There is no “implementation” hub in this option

b) load the raw data, as is, in to an “implementation” hub, joined to the link table driving the unit of work for the “problem revision”. Then create new hubs/sats for the Waiver and Deviation concepts as business vault objects.

c) do as b) but just create the concept of a “waiver” and “hub” as part of the definition of the mart dimension views - as there is no complex logic, so two views could be created against the raw vault “implementation” objects.

Thoughts?

1 Like

This is an interesting topic. In one of the books I have for DV, it talks about this topic. If you have a denormalized dataset and it contains multiple BKs, I believe you can break them out into separate HUBs. They all still tie to the same source record.

1 Like

Hey John,

Do you have a visual of the options you’re proposing?


Hi Patrick,
Nice stencils! Partial options shown for the “implementation” aspect of the model, excluding item, which I now things complicates matter now I have thought about it more.

The data source has a hierarchy of concepts:
problem --< revision --< item (1:m)
|___implementation (app logic enforces 1:1 with revision)

As no “links off links” are allowed, I am wondering if I am better modelling the “problem revision” as a hub instead (with a composite business key containing the “problem Id” and “revision id”) instead of a link with a dependent child key - and either ditching the “problem” hub or having separate hubs for “problem” and “problem revision” with separate links to “item” and however implementation gets modelled.

an attempt to deal with the hierarchy…
Problem --< Revision --< item
|_____ Implementation

No dependent child key in the revision link in this approach…

Yea ok I see the choices.

We essentially never have BV Hubs, so I’ll break it down how I think this sort of probem should be approached

1st Prize: Source provides the data split out and you naturally load the hubs + links as you have defined in Option 1
2nd Prize: You need to implement the “splitting out” in ETL, include the splitting in the stage. Build the error traps you need and rejection SLOs to keep the source is in check
3rd Prize: Option 3, makes the overall implementation more complex and you’re forced to mangle the model to suit technical debt. 3rd Prize should really be called the Wooden Spoon option! No one wants this!

1st prize is the best as it reduces the time from business event to analytical value.

I am sceptical of combining bkeys in a hub of business objects that have different semantic meaning

I hope I am following your depiction accurately!

The XML source data can be normalised and loaded into multiple staging tables via the ETL/ELT pipeline.

I’m still a little bit confused over how to model the hierarchy of:
Problem -1:m- Problem Revision -1:m- Item Revision

Should i split out the concept of the problem and item from their revision counterparts or just kill the top grain hubs and only keep hubs at revision grain. Nearly all of the attributes of interest are held against the revision level of grain anyway, leaving not much more than a stub hub for each of the problem and item. However, separate hubs will simply downstream consumption where analysis against the overall problem or item is required. If it wasn’t for fact that problem revision contains many related items (for a specific revision) i would have modeled the revision id as a link dependent child key.

Another option is to model the hub for just the main Problem and model the satellite with a dependent child key to holds the revision_id. However, I think that would not work as a specific revision of the Problem associates with a specific revision of the Item.

I feel like I’m going around in a circle rather than making a quick decision and moving on!

I’m thinking about separate guidance I have seen from @patrickcuba regarding the use of different hubs for different grain of concepts. Do you agree that the concept of a Problem and a Problem Revision is an example of super/subtyping (99% of the attributes are shared) and so aligns to the guidance that this is representing a different grain of the same concept and should be modelled into separate hubs? The business users usually refer to the Problem and Revision when referring to the concept, e.g., problem 12345-2 (where 12345 is the problem and 2 is the revision).

It seems different to that of say a PO and a PO line.

Yes… I agree, looks that ike it is a super-type, sub-type relationship and therefore separate hubs, even if the attributes of interest are at the sub-type, the stub hub would be necessary

1 Like