Linking User Hub to Numerous Other Hubs By CreatedBy/LastModifiedBy

Many tables in many sources have audit columns that includes an id of a user who created and/or last modified the record, with the id linking to a separate table of Users.

  • My data analysts would like this audit data
  • I can create a Hub and Satellite for the User data. Given that we have many sources I’d assume they’d all end up with their own satellites against this hub. Where they use a common key like email address or AD user id we might even have a common business key.

What is the advice on linking this hub to many (probably most) other hubs.

  • Denormalization to replace the user id with a user name in my raw DV satellites seems to be against DV principles
  • I can leave the user id on the satellite but then my data analysts would be linking their queries directly to a sat without going through a Link table.
  • I guess I could leave the id in place on the sat record and then denormalize to create a revised satellite in the business vault?
  • Do I create multiple Link tables, each with an eff sat, between each of my main business entity Hubs and the User hub. e.g Between Widget hub and User hub I create Widget_User link table. Widget_User has an eff sat that records when each user last interacted with a widget record. I guess if I found the most recent interaction for a particular Widget key in that eff sat that would give me the most recent change columns. This will create a lot of link tables but does reflect the explicit foreign keys in the underlying source and feels closest to the reality of the source data.
  • The last solution creates a compounding problem because some sources will capture the editing of a relationship as a first-class concern e.g. Salesforce has a relation table to link an Account with the Contacts that operate on its behalf. This means I have audit information sitting on a Links eff sat, which would then need to be linked to a User hub
  • It struck me that I could regard the User table as Reference tables, although they would be different per source.

How have others resolved this type of attribute?

answered in the slack channel

1 Like

To summarise:

  • the user id is best thought of as a kind of reference table
  • store the list of ids in a separate HUB (and associated SAT if you have payload)
  • then it acts as a fk link to wherever it appears in satellites, but DON’T implement the fk constraint, it is just implied in the model; and do not connect to the reference data using a link from the satellite’s parent hub (you’ll end up with a load of links with no real business meaning and a hit on query performance to retrieve the id value from the linked to hub)
  • the user HUB is ready to be involved in other LINKS if needed
3 Likes

Thanks for updating the thread with the summary, Neil!

Really good question/questions, below is my suggestions/anwers.

  • Denormalization to replace the user id with a user name in my raw DV satellites seems to be against DV principles
    -Yes, you are breaking 2 important data vault rules.
  1. You apply soft business rules when you denormalize the data.
  2. The import of the dataset becomes dependent on the list of names for the users. This means that the running code is not autonomous.
  • I can leave the user id on the satellite but then my data analysts would be linking their queries directly to a sat without going through a Link table.
    -I would not recommend this, event though if is the cheapest solution. Second I wonder why you would give an analyst direct access to the vault and not to a mart/schema/other? In my experience analysts are not that great when it comes to understand Data Vault.

  • I guess I could leave the id in place on the sat record and then denormalize to create a revised satellite in the business vault?
    -If you want to denormalize it (soft BR) you still need to create the hub for User. So that the satellite in the business valut will use the hub user as its BK.

  • Do I create multiple Link tables, each with an eff sat, between each of my main business entity Hubs and the User hub. e.g Between Widget hub and User hub I create Widget_User link table. Widget_User has an eff sat that records when each user last interacted with a widget record. I guess if I found the most recent interaction for a particular Widget key in that eff sat that would give me the most recent change columns. This will create a lot of link tables but does reflect the explicit foreign keys in the underlying source and feels closest to the reality of the source data.
    -This is the most ambitious solution and it reflects the source system with no data loss. When you apply the effectivity satellites over the links there will be no need for any effectivity satellites over the Hub User. Only one effectivity satellite is needed for each dataset (flat data) if it is placed on link that contains all the hubs.

  • The last solution creates a compounding problem because some sources will capture the editing of a relationship as a first-class concern e.g. Salesforce has a relation table to link an Account with the Contacts that operate on its behalf. This means I have audit information sitting on a Links eff sat, which would then need to be linked to a User hub
    -As long as you can capture everything with no data loss then you are fine.

  • It struck me that I could regard the User table as Reference tables, although they would be different per source.
    -This can be applied if you have no requirements on auditability and are ok with data loss. In most cases you might discover that the business had these requirements but it was not communicated. Then you are in big trouble. My suggestion is that you treat the lists of names as incoming data and import it into the data vault as any other dataset.

Good luck!

I’d add here that you should always consult your users to explore the business context.

If they see business value in tracking users for when they interact with a concept then add a link between the concept and hub user. If not, then you may have a reference table.

Dan has two principles that conflict: ‘auditability of source data’ and ‘avoid source system data vaults’. There is judgement to be applied here.

1 Like

Hello Neil!
This topic is very interesting. The avoidance of a source vault vs auditability of the data.
In my book, my opinion, you should always be able to recreate the data that has been imported into the system. When you are breaking the possiblibility for auditability/traceability you also have broken another important Data Vault rule, you have performed soft business rules on the data.

Even if you create a “source vault” with all its issues but the auditability is intact, you can fix/migrate the data without any data loss.
Number one rule should be (my opinion): Don’t lose data!
What do you think?

Hey Neil,

Tbh I don’t think the two principles are at odds at each other.
RAW vault represents the source, hub tables define the business objects (as the business sees them) and the links and sats capture relationships and bo states.
You can always write the appropriate SQL to recreate the source because in essence those sources are the automated business rule engines.

Source vault to me is when you simply map everything from these engines without any care given to proper hub representation, passive integration or satellite splitting (to name a few).
These become difficult to query and manage