Filtering a source table/file when loading Raw Data Vault

What are everyone’s thoughts on filtering the source data when loading the Raw Data Vault (provided that you still load all the data)? Note: going back to the source and have them deliver the data in a different way is not an option.

I have to load data from two source systems containing the same type of data. System A was used up until 2020, from 2021 onwards System B is being used. I need to load all historical data.

System B delivers the concepts of “Individuals” and “Organisations” in separate tables. I load both into their own Hub+Satellite.

System A delivers both concepts in one table, with a column called “Type” than contains either value “Individual” or “Organisation”. The total amount of columns is roughly the same as in system B, although in system A, there’s a lot of sparsity in the sense that many columns are only populated for one type, and blank for the other.

I would like to load this data into the different hubs. I know that the “Type” column only can contain those two values so by filtering I will not lose any records.

If I don’t filter towards RDV, I would get an extra, source-specific, Hub that I can’t integrate anything else into and I would have to apply a business rule to populate those other two hubs.

I can identify some caveats already:

  • I might need to load data from a table from Source A, that does contain a FK to the ‘core’ table, but not that “Type” column. So if I want to load this table to satellites attached to those separate hubs, I would have to join back to the ‘core’ table in my staging layer in order to fetch that “Type” column. Not ideal.
  • In this case I know the “Type” column can only contain two values. What if you have a scenario in which another value might get introduced? You’d have to include some really generic hub and satellite that you can then later decide to refactor when you know more about what that data is actually about? Or would this be the prime reason that you never filter when there’s the odd chance that you face the risk of losing data?

I’d love to hear everyone’s two cents. I already got Patrick’s in the comment section on this article of his.

My two cents:
By performing filtrering in the load process you also introduce the use of soft business rules.
If you add specific logic (the filtering and loading different hubs) you add complexity to the loading process. If you add complexity you have to spend more time on documentation describing why you did not follow the standard pattern. It will take longer time for other developers to understand, debug and fix issues with your customized code.

I understand the temptation of making a cleaner dv model but as you say yourself, why risk not getting all the data?

Finally: creating 2 same-as-links(with eff sats) in bdv to capture what individual and organization is connected to the source specific hub aint that hard.

Your two cents are valid :slight_smile:

I made a quick start to just try it out but already ran into an issue where I would want to create a link based on the ParentID field… but I would have to filter that one too in order to tell to what Hub I should connect it to. It became really complex really fast.

I don’t think there’s a need for same-as links (in the literal sense) though right? I will get a regular link between two separate hubs which feels as a same-as link, because I know they mean the same functional thing.

Also, I think I can get away with not having an eff sat on that link, because the value in that “Type” column is never going to change. The more general “PartyID” from that one source will always be related to either a certain “IndividualID” or a certain “OrganisationID”. It will never change over time.

A regular link or same-as-link, a link is link, I agree with you.

Regarding an eff sat, a always generate them to be 100% sure that i can regenerate the data.

I would be a rich man if got a dime every time somebody told me that this is never going to be changed. I recommend taking no risks, in the end the business always blames the warehouse for incorrect numbers, never the sources. That is why I also, as a dw developer, want to write the code for the extract processes from the sources rather than the sources delivering data to me. I think this is one of the keys to implement a data warehouse very, very fast.

I’d say you have a hard rule, go ahead and filter on it and split to the appropriate hubs. If you’re concerned about capturing any suddenly appearing types, set up a path to send unknowns to an error mart for re-evaluation and potential processing. I’d argue that’s a lot more simple than a whole new hub that doesn’t align to a business concept along with the needed datavault objects and logic to support it.

I’ve had massive JSON files and oddly constructed tables holding all sorts of information the business had no use-case for. I definitely use hard-rules to cherry pick what matters to the business. You can still perform table reconciliation and recreate the captured source data.

If the rule wasn’t so clear-cut I would agree that you need to bring it in as a source-driven construct, but if the rule meets whatever company standards you have (or create) for separating the soft from the hard, then keep it simple.

I’m inclined to partially agree with Christopher on this. I have a scenario where one of my upstream systems is physically sharded. The problem is that as part of the sharding the tech team stuffed up and left a duplicated record in two of the shards. Same business key. It doesn’t matter to them because they don’t do cross-shard reporting and they won’t fix it.

It matters to me though because I bring all the sharded data back together, albeit with different record source values. Sure, I could put include record source field into the creation of my PK in the hub and sat but all I’m doing is replicating an error in the source system into my vault model. And worse our CRM depends on the original business key to relate the sharded data back to the customer account it manages. Again, they don’t care about the duplicate because they don’t see the sharding because someone has filtered out the bad data in an insights feed we supply back to the CRM.

The point is that the CRM system holds the data that effectively becomes the link table but they don’t have the record source data to include when generating the key that should link to the hub. I do not mind documenting that this clause is there, with a link to the not-actioned JIRA ticket. It’s my passive aggressive way of reminding the main dev team how long their mistake has been around for :stuck_out_tongue:

In referring back to Patrick’s answer in his article, my outcome is not the consequence of a business rule. It’s a consequence of laziness and incompetence. I know that the relevant team will never read this but if they do… guys, fix this!

No application of any pattern or principle can ever be totally pure because there are always edge cases. Trying to mangle your processes to meet some ideal standard likely leads to further problems and workarounds elsewhere. As architects we have a duty to be pragmatic in our solutions. We do our best with the data in front of us and make sure any exceptions are minimized and documented.

1 Like

“shopping list”

  1. As per DV2.0 recommendations starting - an one of most important initial and core steps - identify business entities meaning CONFIRM and consolidate what are the HUBs BKs.DV2.0 it pretty straight and simple: just INSERTs new or deltas, and do BKs consolidation as softfix the sources mode, since HUBs BK are such core part of everything really:
* INDIVIDUAL
* ORGANIZATION

2 BKs clash analysis:

*   IF same entity have equal (BKs+TYPE) on both systems THEN
        all good no extra work to do => no need for SAL(s).
    OTHERWISE
        need 1 or 2 SALs depending where BK clash occurs:
        * INDIVIDUAL_SAL
        * ORGANIZATION_SAL
        need mapping table or RULES based on attributes, to map different systems BKs as same, so both SALs can be loaded.
  1. one SAT (per hub) per source system.
* INDIVIDUAL_SAT_system_A
* INDIVIDUAL_SAT_system_B

* ORGANIZATION_SAT_system_A
* ORGANIZATION_SAT_system_B
  1. extra: eventually split each SAT per system id theres many columns and different rates of changes.

So either 6 or 8 tables needed for your solution.

If you filter the data and later need information from another source that references the excluded data, you might have to perform joins in your queries. This adds complexity and may impact performance. If there’s a possibility of new values being introduced in the “Type” column in the future, you might need to revisit and modify your data model. As you mentioned, there’s a risk of data loss if you filter out certain records. In your specific case, if the “Type” column only has two values and you are certain that it won’t change, this risk might be minimal.

1 Like