Hey everyone,
before getting into the question in the title, I want to give some context.
I recently joined a team, where Data Vault was brought in previously by a consultancy during the build up of a data warehouse. Since I had no prior experience with Data Vault I just went ahead and got accustomed with the modeling based on the standards established by that firm.
After having to come to grips with their solution, reading about and learning about it, I can confidently say that whatever was onboarded was not Data Vault, at least not DV2:
There were no hashes, loads must be sequential, all the data loading is done by “hand-crafed” stored procedures, staged data is regularly transformed and has soft business rules applied, hubs are basically a collection of different (business) keys and it was modled very closely to the source
While the DV implementation was obviously far from ideal - I would actually arguing the only thing they were doing right was the meta data tagging - a lot of the issue stemmed from not having a proper conceptual model of the business and having completely unstructured sources (cough… spreadsheets).
Maintaining this mess has became cumbersome to say the least. Since the lack of automation has also drastically slowed down developement, we decided to design the data warehouse from scratch and address those issues.
Simultaneously, a more structured source (from now an referenced as: the source) has been orboarded so we hoped that we could implement a “proper” data vault this time around.
However, after checking out the source system, I can safely say that it provides gargabe - it’s structured, but some ugly mix between a relational, dimensional and hierarchical model (probably best described as a record filing system). It exclusively uses surrogates to establish relatiohships between tables. Those integers obviously do not hold any business meaning but additionally there are many cases in the source where there aren’t even any more candidate keys - only the slapped on ID is unique.
Therefore, we have tried to define a conceptual model together with our operational colleagues, arriving at a list of core business concepts and respective business keys.
For example, we have to represent a property consisting of buildlings which in turn may be composed of appartments. The property level has a salesforce uuid, certainly not ideal but at least its a key which is used everywhere in the business. For the building we identified its address as a reasonable business key and apartments have an apartment number. But all kinds of keys which are supposed to be unique according to the business, simply are not unique in the source.
I am aware that people, that you should get the business on board and define data contracts etc and we try that. But while the business may understand the need for theses things, they are not enforcing anything - data is simply not our core product. Also changing the source is pretty much impossible because it is an external tool and we would need to maintain any kind of interface changes ourselves.
(After writing all this I kind of realized that the business tried to abuse the source as a MDM, which is rediculous because it is not designed to do that).
Now we are unsure if we should pursue data vault because it seems like an aweful uphill battle. In addition, we only have a handful of sources: the source, a few APIs, some web app database and a few excel files. While I think that this is in priciple enough to warrant a data vault, since there are no business keys, we cannot even benefit from passive integration.
Any advice and experience how to handle those kind of situations would be highly appreciated!