A question on handling third party data

Hi there,

Hoping someone might be able to provide some guidance here.

Assume we have a customer hub. We’re sourcing data from Salesforce, and are using the Salesforce ID from the contact table as our business key.

We have a related third party send us a csv each day. It contains data about customer accounts. In the csv, there is the customer name, but no other ‘business key’. It has other data such as their loan amounts, repayments etc.

How would you go about integrating this into DV? Should we be performing a fuzzy match on customer name pre-raw vault to understand the connection, and then adding the data as a satellite having already cross-referenced to get the customer Salesforce ID?

Guidance would be appreciated, as I’m feeling a bit lost by this one.

It shouldn’t be you that decides.

Business must provide the rules on how to match that data and assign it a business key, there must be an identifier and as you know a person’s name is not good enough.

Thank you Patrick,

Understand your position, however I’m unsure it helps in answering the question. If there were someone in the business that could tell me how to link a third-party dataset for which the only unique ‘key’ was the customer name without using the customer name, I wouldn’t have had need to ask in this forum. That’s not me trying to be a smart a**. Just the reality of the situation.

My take on this is that given circumstances - it’s a dataset from a third party, we have no control over the fields received or the structure of the data, and the business wants it incorporated into the warehouse and used for reporting - I’m leaning towards creating a hash from the customer name and using that as the ‘business key’, hanging the dataset off the customer hub as a satellite, and using a SAL to connect it back to the customer in the CRM based on matching rules that happen outside the warehouse.

Given my restrictions, does that seem appropriate?

1 Like

This question is really a data governance question than a data vault question to be honest.

Basically, .if you cannot trust that the the customer name on the 3rd party data set is unique, then you cannot trust the data set or the results of analyses that are performed using this data set.

Best would be to tell your business counterparts that you can’t promise your results will be accurate because you can’t trust that data is unique, and that they should push back to their 3rd party to get a relevant unique ID and not customer name.

Look up some of Chad Sanderson’s data contract stuff

1 Like

Then, you should get your salesforce team to update salesforce (or another MDM) with this new ID.

1 Like

Thank you :pray:
Appreciate the feedback.

I was inferring that business must sign off on what you do with the data.

If you do not have a unique key, business must sign off on assigning one, i.e. must define the rules on how you uniquely identify that business object — assign a key at the source. That is how you take control of the “situation”.

Nat’s comment is 100% correct, MDM-id can be used but then it is not you defining the rules, it is the business because how does that MDM id get assigned? You simply insert the MDM id as the business key in the hub.

Thank you,

While I get that ultimately the business needs to sign this off, with full awareness of the implications of what they’re signing off, they are looking to me for the recommendations on approaches. This includes the Data Architect, who is primarily concerned with the internal systems as opposed to third party data.

So the aim was to illicit some suggestions and/or sanity check an approach.

Hey Andy,

This is very similar to the situation that I am facing at the moment with my organization. Like Dan says in the book, garbage in is garbage out. Nat’s suggestion is spot on.

I empathize with your position on being tasked to solve this problem. Data Vault is not necessarily the solution to the specific problem, but it can help with speed to refactor once the third parties are sending you the data you need.