We are modeling a healthcare data vault in which members will eventually need to be matched to claimed by business analysts.
Legal/Compliance has just informed us that the MemberId field on the member records are PII because they appear on the insurance card.
In the raw vault, we’ve got hubs, links and satellites for members, claims, etc. In our business vault, we’ve got a common member view, common claim view but the member id will have to be masked for people without pii access.
We’ve brainstormed the following:
- Add a reference table that maps a guid to the member id and use that in the common view (this would have to get created on the ingestion pipeline)
- Create two layers to the common views - one that is locked down and only used to create joined common views and then the business would have access to the top layer views (some of which are just select * from common view) with masking applied so that the making doesn’t impact the joins
What are we missing? Thoughts? Who has done this before?
Note: we are using snowflake and doing dynamic data masking for pii
If you’re using snowflake’s data masking you shouldn’t need to create the separate views etc - i would have thought that you would have lets say 2 roles, a service account role and an analyst role.
Anything that is PII - e.g. your memberiD is put under the DG masking as soon as possible e.g. in landing, and certainly before it gets into anywhere an analyst can see. A Service account has the service role and as such access to all data and is used to run all your jobs automatically. User and analyst has access only to the masked data. (You might want a support role too, so users who are testing or dealing with support tickets get temporary access to see PII).
Any DV hash keys, particularly if they are salted with some BKCCs are not going to be PII so that they can be joined on and made visible to anyone.
There are probably some technicalities that i haven’t quite got right but that should work in general
And welcome @jheckman324 to the forum!
Consider defining a PII Satellite where such content is managed in this seperate satellite
You should also consider a non-PII column (internal key) to manage this member across your business, (i.e. this will go to the hub table instead). That way if a request to delete the member data you might not need to delete the key from the hub but simply obfuscate the member id in the sat, the sat itself could be locked away in a database schema only accessible by authorized roles. What’s more about this approach is you do not lose the analytical value of the non-PII content and the amount of data management needed is reduced.
Alternatively consider “crypto-shredding” approach, your key that gets loaded into the hub is tokenized and should the content need to be deleted then the private key for that encrypted key is deleted instead.
Masking as Nat suggested will not work here, because masking is applied to a column. A hub is the integration point between sources so you could have non-PII and PII content loaded to the same hub and therefore managing masking dynamically will become tedious!
Hope that helps!
The internal key and PII sat sounds better than doing the hub thing. You could do masking on that column in the PII sat though as well as putting it into its own database/schema.
On your last point - if the hub’s business object is the member, defined by memberID as a key, and memberID is always PII (at least that’s what I read in the above), then any source providing memberIDs into the hub would be providing PII data. And if a source is providing member info at a different level that is not PII it’s a different level of detail / key structure so should be its own hub. Thougths?
I’m facing similar challenge on the project. There are data sets identfied by PartyID, however there are also some data sets identfied by TaxIDs (PII) only (w/o technical identifier). TaxID might relate to multiple Parties, therefore I recon there should be 2 hubs (different granularity and meaning): Party and “PartyTaxID”.
Would you organize the model accross 2 hubs as well? In what structure would you store the private keys? is it sat of plain ref table? Have you described this approach in any of your blog posts?
Would you make taxid an attibute in the sat? I ask because Taxid sounds private? No?
If not, go ahead (I think) with your suggestion!