I have my h_party with its BK being party_id, sourced from our golden source (let’s call that source GS) table Party, column PartyId.
Now, I need to load external credit ratings coming from another source (ECR), that are tied to a LEI_code (the globally known Legal Entity Identifier). This LEI_code is stored in a different GS table named PartyExternal, with a composite BK of PartyId+ExternalIdType, with attribute ExternalIdValue.
One of those ExternalIdTypes is LEI, so that ExternalIdValue is how I can link my external credit ratings to a Party.
Now, this PartyExternal table contains all sorts of Types, some of them being BIC, CoC, VAT, but also Driver’s License Numbers, Passport Numbers and Identity Card Numbers. All of which are very PII sensitive and should be masked.
How should I model this table?
Can I just take the LEI codes (both from GS and ECR), load them into h_party and create a same-as-link? I won’t be able to recreate the source because I’m not loading all the data. And what if at a later stage I decide to load an additional ExternalIdType for which the ExternalIdValue can clash with an existing party_id?
Create a h_lei with a link to h_party? Will I then have to create new hubs for every additional ExternalIdType?
Create h_party_external with composite BK of party_external_id + party_external_type? When sourcing this from the ECR table, I would have to hardcode the value ‘LEI’ into party_external_type because I only get the LEI_code (which implies the party_external_type is ‘LEI’). If I decide to load all ExternalIdTypes, I will have to mask the ones that are PII sensitive. Can you even mask only certain values, based on a condition, in one particular column?
Create a mas_party with ExternalIdType as a DCK. Perhaps make two, one with PII data where I can more easily mask ExternalIdValue and one where no masking is needed (meaning I’d have to filter on ExternalIdType in staging). But then I’d still need to know what to do with the ECR table?
These are some possible solutions I can think of… any help is greatly appreciated!
edit: something to keep in mind: we’re going to implement BK treatment because the party_id should always have 12 characters with leading zeroes (a legacy source delivered them without leading zeroes).
But you have PII with non-PII identifiers… somewhere the ids must be split before ingest.
Imagine if you’re building a source system and decide to use PII ids as PKs… that means the DB’s integrity is based on PII … = disaster when Article 17 is exercised.
What this means is that these ids should never be loaded to hub tables; they’re locked away in PII sats.
Source should split for you unless unfeasible to do so. Regardless of PII or not PII the splitting must happen before entering DV. Because if you allow this entanglement to be managed by building more tables in your DV model it means you have more and more tech. debt handled by SQL joins.
Your party_type could be a dependent-child key in a satellite… hub_party would be all parties because by definition they are the party — external or not they’re still parties.
I’ll do some analysis on Monday, there are dozens of different party types and I think there will be quite some collisions, not just between the external types but also between the external types and my ‘main’ party id. Besides, I’m doing LPAD(id,12,‘0’) BK treatment that I prefer to do on the column-level, not on the column-mapping-level.
Haven’t tried the stencil yet, but will give it a go, thanks! Just tried importing it into diagrams.net and noticed some of the shape texts were aligned to the left too much, making them overlap with the shape image. But I think I can make it work.
Hmmm you could never guarantee ids from external providers if there are thousands of them — you’d have the same issue if they were in the same hub or seperate hub.
You’d need a mapping table for BKCC per source system, build an error trap for stopping the pipeline when a new source is detected that hasn’t been profiled yet
You could also restrict the loads to external parties you care about
Or, this is something we did at a debt collection customer. The customer had their own account ids that will clash so we had an internal system assign a debt id to each account id from each external account id. The external id became merely an attribute to a satellite table and the debt id the unique id we used in our DV.
Seems to me that Patrick has put his finger on the real problem: the ECR source is violating PII standards by mixing keys that need to be masked with keys that don’t. In your DV, you DO NOT want to repeat that error, so you will segregate them. If that means you can’t fully recreate the source… well, that’s kind of a GOOD thing when the source is not something worth emulating.
The main purpose in being able to recreate the source is to satisfy audit requirements. I think any auditor would understand your unwillingness to make that PII-compromised source re-creatable.