PII data for Business Keys

Another day, another challenge.

I have my h_party with its BK being party_id, sourced from our golden source (let’s call that source GS) table Party, column PartyId.

Now, I need to load external credit ratings coming from another source (ECR), that are tied to a LEI_code (the globally known Legal Entity Identifier). This LEI_code is stored in a different GS table named PartyExternal, with a composite BK of PartyId+ExternalIdType, with attribute ExternalIdValue.

One of those ExternalIdTypes is LEI, so that ExternalIdValue is how I can link my external credit ratings to a Party.

Now, this PartyExternal table contains all sorts of Types, some of them being BIC, CoC, VAT, but also Driver’s License Numbers, Passport Numbers and Identity Card Numbers. All of which are very PII sensitive and should be masked.

How should I model this table?

  • Can I just take the LEI codes (both from GS and ECR), load them into h_party and create a same-as-link? I won’t be able to recreate the source because I’m not loading all the data. And what if at a later stage I decide to load an additional ExternalIdType for which the ExternalIdValue can clash with an existing party_id?
  • Create a h_lei with a link to h_party? Will I then have to create new hubs for every additional ExternalIdType?
  • Create h_party_external with composite BK of party_external_id + party_external_type? When sourcing this from the ECR table, I would have to hardcode the value ‘LEI’ into party_external_type because I only get the LEI_code (which implies the party_external_type is ‘LEI’). If I decide to load all ExternalIdTypes, I will have to mask the ones that are PII sensitive. Can you even mask only certain values, based on a condition, in one particular column?
  • Create a mas_party with ExternalIdType as a DCK. Perhaps make two, one with PII data where I can more easily mask ExternalIdValue and one where no masking is needed (meaning I’d have to filter on ExternalIdType in staging). But then I’d still need to know what to do with the ECR table?

These are some possible solutions I can think of… any help is greatly appreciated!

edit: something to keep in mind: we’re going to implement BK treatment because the party_id should always have 12 characters with leading zeroes (a legacy source delivered them without leading zeroes).

Sounds complex…

  • deal with the tech debt by splitting the data upon staging,

  • if the keys are embedded then split those out too

Question is, can the source provide that for you instead then you don’t need to do/maintain that splitting yourself.

Don’t let the DV become the repository of tech debt.

Thanks for the reply Patrick. But I’m still struggling with what DV objects I should be splitting them into.

Unfortunately, the source won’t be of any help here.

But you have PII with non-PII identifiers… somewhere the ids must be split before ingest.
Imagine if you’re building a source system and decide to use PII ids as PKs… that means the DB’s integrity is based on PII … = disaster when Article 17 is exercised.

What this means is that these ids should never be loaded to hub tables; they’re locked away in PII sats.

Source should split for you unless unfeasible to do so. Regardless of PII or not PII the splitting must happen before entering DV. Because if you allow this entanglement to be managed by building more tables in your DV model it means you have more and more tech. debt handled by SQL joins.

Your aim should be less DV tables, not more.

So then the DV model could look like this (sorry for the text… I should really get a visualisation tool).

Staging:

  • Party
  • ExternalCreditRatings
  • PartyExternalNonPII (filtering on ExternalIdType)
  • PartyExternalPII (filtering on ExternalIdType)

RDV:

  • h_party: BK party_id with mappings:
    • Party.PartyId
    • PartyExternalNonPII.PartyId
    • PartyExternalPII.PartyId
  • h_party_external: BK party_external_id + party_external_type with mappings:
    • PartyExternalNonPII.ExternalIdValue + PartyExternalNonPII.ExternalIdType
    • ExternalCreditRatings.LEI_code + hardcoded value that matches with the LEI equivalent in ExternalIdType
  • l_party_external_party: BK h_party_external_hkey + h_party_hkey with mappings:
    • PartyExternalNonPII.ExternalIdValue + PartyExternalNonPII.ExternalIdType + PartyExternalNonPII.PartyId
  • s_party_external_credit_ratings with mappings that match the mapping to h_party_external:
    • ExternalCreditRatings.LEI_code + hardcoded value that matches with the LEI equivalent in ExternalIdType
  • mas_party_pii: BK party_id + party_external_type with mappings:
    • PartyExternalPII.PartyId + PartyExternalPII.ExternalIdType
    • Masked attribute mapped to PartyExternalNonPII.ExternalIdValue

Hoping this is readable and understandable :blush:

Your party_type could be a dependent-child key in a satellite… hub_party would be all parties because by definition they are the party — external or not they’re still parties.

Yes… get a visual tool :stuck_out_tongue:

have you tried this free stencil? the_data_must_flow/stencils at master · PatrickCuba/the_data_must_flow · GitHub

I’ll do some analysis on Monday, there are dozens of different party types and I think there will be quite some collisions, not just between the external types but also between the external types and my ‘main’ party id. Besides, I’m doing LPAD(id,12,‘0’) BK treatment that I prefer to do on the column-level, not on the column-mapping-level.

Haven’t tried the stencil yet, but will give it a go, thanks! Just tried importing it into diagrams.net and noticed some of the shape texts were aligned to the left too much, making them overlap with the shape image. But I think I can make it work.

Hmmm you could never guarantee ids from external providers if there are thousands of them — you’d have the same issue if they were in the same hub or seperate hub.

  1. You’d need a mapping table for BKCC per source system, build an error trap for stopping the pipeline when a new source is detected that hasn’t been profiled yet
  2. You could also restrict the loads to external parties you care about
  3. Or, this is something we did at a debt collection customer. The customer had their own account ids that will clash so we had an internal system assign a debt id to each account id from each external account id. The external id became merely an attribute to a satellite table and the debt id the unique id we used in our DV.

I have already considered your point 2… But then I won’t be able to recreate the source. How bad would that actually be in this case?

Seems to me that Patrick has put his finger on the real problem: the ECR source is violating PII standards by mixing keys that need to be masked with keys that don’t. In your DV, you DO NOT want to repeat that error, so you will segregate them. If that means you can’t fully recreate the source… well, that’s kind of a GOOD thing when the source is not something worth emulating.

The main purpose in being able to recreate the source is to satisfy audit requirements. I think any auditor would understand your unwillingness to make that PII-compromised source re-creatable.

(it’s actually the GS source who is violating PII standards as that one is delivering all the external identifiers… but I get your point)

I think especially your last paragraph is worded in a very nice manner… something that I hadn’t been able to put into words properly myself.

Thank you both for the input! Really really appreciated. I’m lacking a sparring partner currently and this helps tremendously.