I have a question about whether people currently persist (if at all) both the untreated and treated business key in the staging table before loading into the target hub. It is good practice to hold the raw business key attributes and the standardised ones separately? Given that the untreated business key may be loaded into a target satellite table (this is my default view) this kind of makes sense as you would need two source attributes for these separate target attributes. Otherwise you would have to apply business key treatment in the load into the hub itself, which seems counterintuitive to the purpose of doing a lot of the legwork (generated hashkeys, hash diffs etc) in the staging load.
If so, is there any guidance on naming the two separate variants of the same source attribute?
My view is as follows…
account_code ← raw BK
dv_treatedbk_account_code ← treated BK
I guess it doesn’t really matter if the DV prefixed cols was flipped to be the untreated raw value (dv_rawbk_)… I’m probably picking at hairs here, if the only thing that matters is consistency!
OK, so to extend that logic to the staging table itself where we set up the DV metadata in DV_ cols, I guess it is up to the modeller to determine the naming standard for the treated key as stored in the staging table. There will be occasions where the source attribute label is sensible and would be the business key label too - so we would need to store the sanded/treated version (trimmed, uppercased etc, converted to string) in a field with a different name to distinguish from raw incoming data.
Hi @jhall_uk . I guess since Staging is temporary i.e. truncate and load, there is not really a naming convention for Staging. @cmeyersohn may have some best practices guidelines for naming in Staging.
There is no standard for naming the staging, but there is a recommended best practice which I will share here.
Since source to target is a 1:1, I like to start my staging tables by first identifying that they are staging tables - so, I use ‘STG_’.
Then I like to identify the source because we often have multiple sources for the same type of data - like employee data. In this case, I would follow ‘STG_’ with ‘PS_’ if data is coming from Peoplesoft, or ‘SAP_’ if it’s coming from an SAP system - you get the idea.
Finally, I will append the source object metadata name; example: BENEFITS.
The end result would be STG_PS_BENEFITS. I may choose to version control the target object in the metadata name as well - STG_PS_BENEFITS_v1. You don’t have to version control by object name, you may have another configuration management technique or process that is used to version control. Having said that, you must version control your objects somewhere. The standards require version control on everything - objects, processes, models, documentation, business rules, etc.
Best regards, Cindi
In dbtvault staging is a VIEW.
I have mentioned to @alex.higgs about the recent release of Streams on Views from Snowflake which makes the pipeline way more flexible and streamlined (pardon the pun). No need to truncate landed data at all
I have written that as a naming maybe consider treated bkeys in staging you could have a prefix like dv_ to it and then the untreated you do nothing. Not sure if dbtvault can cater for that I’m sure he can provide input!
Staging could be virtualised I suppose if the performance was ok - it would depend on the platform. In Oracle, this could be an External Table pointing at a standard CSV file (classic external table) or via the Big Data connectors, the External Table could point at a various file types including parquet or a Kafka stream - Oracle have supported this for years.
I guess coming back to my original question about how I label two attributes, sourced from the same raw source attribute, but one is the raw value going to the satellite and one is the treated value going to the hub, it still sounds like the answer is “it doesn’t matter, there is no standard, just be consistent in your approach”. Then, whether I physicalise the staging table, or virtualise it via an External Table (or whatever the equivalent is called on your platform flavour) is is largely down to what works best for your platform.
In dbtvault you can create as many aliases for the same column as you need to using a derived column configuration. We’re planning to add global configurations for standardising prefixes, but at the moment you’ll need to do this on a case-by-case basis.
The best practice is to maintain the column/field name of the source from Source to Stage, and all the way through the EDW. When loading to an integrated hub, the hub concept should drive the business key column/field naming convention.
I’m not quite sure what you are referring to when you talk about “treated” business key. Could you explain what you mean by that?
In the case where the source data has been dropped to a .csv or text file of some kind, and you are applying data type alignment to the text data to align it back to the source system data type (i.e., converting the string to a numeric data type or a string to a date/timestamp data type), the best practice is to indicate that a field has been derived (converted). In your example above, the stage table would hold the string value “account_code” in the “account_code” field, and the converted or type aligned value in an “account_code_derived” or “account_code_drvd” field to make it clear in the staging table that the value in the staging table target column/field has had some sort of hard business rule applied to it.