Hi everyone,
I have a question about whether people currently persist (if at all) both the untreated and treated business key in the staging table before loading into the target hub. It is good practice to hold the raw business key attributes and the standardised ones separately? Given that the untreated business key may be loaded into a target satellite table (this is my default view) this kind of makes sense as you would need two source attributes for these separate target attributes. Otherwise you would have to apply business key treatment in the load into the hub itself, which seems counterintuitive to the purpose of doing a lot of the legwork (generated hashkeys, hash diffs etc) in the staging load.
If so, is there any guidance on naming the two separate variants of the same source attribute?
My view is as followsâŚ
Source file:
account_code
staging table:
account_code â raw BK
dv_treatedbk_account_code â treated BK
I guess it doesnât really matter if the DV prefixed cols was flipped to be the untreated raw value (dv_rawbk_)⌠Iâm probably picking at hairs here, if the only thing that matters is consistency!
Untreated Bkeys appears as attributes in the Satellite table â therefore no naming standard, you load it as it came form source like all other attributes.
Treated Bkey in the Hub conforms to the naming standard you and your modellers have decided.
hub.account_id â your conformed treated keys
sat.account_number â an attribute
OK, so to extend that logic to the staging table itself where we set up the DV metadata in DV_ cols, I guess it is up to the modeller to determine the naming standard for the treated key as stored in the staging table. There will be occasions where the source attribute label is sensible and would be the business key label too - so we would need to store the sanded/treated version (trimmed, uppercased etc, converted to string) in a field with a different name to distinguish from raw incoming data.
Hi @jhall_uk . I guess since Staging is temporary i.e. truncate and load, there is not really a naming convention for Staging. @cmeyersohn may have some best practices guidelines for naming in Staging.
Saqib,
There is no standard for naming the staging, but there is a recommended best practice which I will share here.
Since source to target is a 1:1, I like to start my staging tables by first identifying that they are staging tables - so, I use âSTG_â.
Then I like to identify the source because we often have multiple sources for the same type of data - like employee data. In this case, I would follow âSTG_â with âPS_â if data is coming from Peoplesoft, or âSAP_â if itâs coming from an SAP system - you get the idea.
Finally, I will append the source object metadata name; example: BENEFITS.
The end result would be STG_PS_BENEFITS. I may choose to version control the target object in the metadata name as well - STG_PS_BENEFITS_v1. You donât have to version control by object name, you may have another configuration management technique or process that is used to version control. Having said that, you must version control your objects somewhere. The standards require version control on everything - objects, processes, models, documentation, business rules, etc.
Best regards, Cindi
In dbtvault staging is a VIEW.
I have mentioned to @alex.higgs about the recent release of Streams on Views from Snowflake which makes the pipeline way more flexible and streamlined (pardon the pun). No need to truncate landed data at all
I have written that as a naming maybe consider treated bkeys in staging you could have a prefix like dv_ to it and then the untreated you do nothing. Not sure if dbtvault can cater for that Iâm sure he can provide input!
Staging could be virtualised I suppose if the performance was ok - it would depend on the platform. In Oracle, this could be an External Table pointing at a standard CSV file (classic external table) or via the Big Data connectors, the External Table could point at a various file types including parquet or a Kafka stream - Oracle have supported this for years.
I guess coming back to my original question about how I label two attributes, sourced from the same raw source attribute, but one is the raw value going to the satellite and one is the treated value going to the hub, it still sounds like the answer is âit doesnât matter, there is no standard, just be consistent in your approachâ. Then, whether I physicalise the staging table, or virtualise it via an External Table (or whatever the equivalent is called on your platform flavour) is is largely down to what works best for your platform.
In dbtvault you can create as many aliases for the same column as you need to using a derived column configuration. Weâre planning to add global configurations for standardising prefixes, but at the moment youâll need to do this on a case-by-case basis.
The best practice is to maintain the column/field name of the source from Source to Stage, and all the way through the EDW. When loading to an integrated hub, the hub concept should drive the business key column/field naming convention.
Iâm not quite sure what you are referring to when you talk about âtreatedâ business key. Could you explain what you mean by that?
In the case where the source data has been dropped to a .csv or text file of some kind, and you are applying data type alignment to the text data to align it back to the source system data type (i.e., converting the string to a numeric data type or a string to a date/timestamp data type), the best practice is to indicate that a field has been derived (converted). In your example above, the stage table would hold the string value âaccount_codeâ in the âaccount_codeâ field, and the converted or type aligned value in an âaccount_code_derivedâ or âaccount_code_drvdâ field to make it clear in the staging table that the value in the staging table target column/field has had some sort of hard business rule applied to it.
Respectfully,
Cindi