Renaming the source columns in raw vault

Hi All,

We have encountered a few special column names that are reserved words(keywords), and numeric column names like 238567 from the source tables. What is the best practice for handling these columns in the raw vault? Are we allowed to rename them appropriately in the raw vault, or should they renamed only in the later stages?

Regards,
S

TIA

I reckon you’d need to see what the platform supports… I have not seen many that support numeric column names, you might need to set a standard that such columns be renamed with a prefix to make it work — like a double underscore or something…

1 Like

We are currently using Redshift, which supports numeric column names. To avoid using quotes(") when reading the columns, I want to rename the column. Is column renaming an anti-pattern in Datavault?

Ah I see. The DV2.0 recommendation is to not change it, you’ll have to make a decision here.

1 Like

This could be cast as a hard business rule - a necessary change to be able to load the data. Try to keep the column names as similar to the source as possible to facilitate the audit trail - e.g. add a prefix to the coulmn name x_ or something like that.

4 Likes

Reopening this topic with a follow-on query…

If the recommendation is not to rename the source columns as they land into the raw vault layer unless you absolutely have no choice (due to target platform either not supporting the characters or a conflict with reserved words) where should they be renamed? Business Vault, Information Mart (fact/dim attributes), or (shock) within the reporting tool? I’m leaning towards the Information Mart as I don’t think simply renaming a column is really a business rule.

The source systems often have pretty unfriendly/unhelpful names (‘order number’ might be ‘ohnmr’, or a custom source field configured in a COTS solution to be ‘marital status’ might be ‘attribute1’), so they need to be translated somewhere.

No renaming in BV, BV is sparsely modelled and the outcome of idempotent business rules.
Names changes like - to _ is trivial and a DB limitation — and that’s ok. SAP tables can be difficult to understand but I have seen customers use a SAP mapping table to rename those to something sensible, this too is ok because. the SOURCE still controls that naming standard.

If you really want to rename RV columns why not do it in a view? The view is NOT BV, it’s what can call in DDD language, an “anti-corruption layer”. 1. Data Vault and Domain Driven Design | by Patrick Cuba | Medium

Yeah I don’t know where this sits in the broader methodology but it sounds like a bit of a pedantic thing to try to look through a book for. Changing these columns names in staging or ingestion isn’t going to cause you any pattern breaking consequences, I’d squash this problem early in the process rather than trying to use DV2.0 methodology to solve these issues but others have expressed different opinions.

At the end of the day these aren’t really data issues since it doesn’t actually touch any of the column contents. So it’s more of a preference on how you want to organize your data.

You should never feel shy about engaging the source to fix / modify things. I think this is important because “the data vault is not the repository of technical debt”.

1 Like

Being realistic, are commercial software packages suddenly going to change their data models to suit some downstream endeavour? No. Good luck renaming some legacy on-prem ERP data model to friendlier names - you’ll break it and it won’t be supported. For SaaS products you don’t get access to the data model anyway - you’ll be relying on data extracts or REST API calls etc. where hopefully the object and attribute names may be closer to how they are labelled on screen in those packages. Hence it is about what is realistic to achieve. Nowhere are we advocating changes to the semantics of the content of the data - it is about getting the naming of these domain attributes aligned to how the business actually operates so what is ultimately presented to consumers isn’t nonsensical. We create hubs based on the business view of the world not how the source system is modelled and the business key will not necessarily be named as per the originating source system name, especially as it designed for passive integration from multiple places.

I think the most important thing is traceability and ensuring we know that something in our model labelled ‘order_name’ originated from something called ‘ohname’ etc.

A customer of mine would be keen to deal with aligning labelling of attributes to the ‘real world’ as part of the staging process and they’re quite happy to create a view against the original landed data where they deal with this relabelling, so to preserve traceability. So I guess it is down to whether you do it there or after the load into the raw data vault layer. I guess there is also a risk the ‘real world’ view ends up being departmental and not enterprise - so relabelling up front may create technical debt later on - hence my original thought on dealing with this in the information mart layer where domain focussed data sets are getting presented to specific groupings of individuals. It just means you may end up repeating the relabelling if the attribute names end up being more enterprise-wide after all…

Of course not, I do a talk where I include 3 definitions from DDD to categorize this relationship. Part of “strategic design”, https://youtu.be/EKuXFTkK2Yw?t=839

That’s easy, RV will have that name, lineage is tracked by platform for catalogue tool

It will