Are these any guidelines for defining which timezone the timestamp should be stored in?
Using Azure SQL for the raw vault.
what datetime option to choose and any considerations for the same
Use datetime data type. it does not have timezone awareness
Use datetimeoffset data type.It has timezone awareness.
Facing below challenges and would like to get some perspectives and any guideline from other members here.
Most source systems provide data in the local date time. they dont mention the timezone information in the timestamp field, though.
If choosing to keep all timestamps in UTC, then need to convert the incoming timestamps (which dont have timezone in it) to UTC timeszone. where to do this? Is this a hard rule to implement before loading to raw vault and should all businesss datetime columns from source be converted as well to UTC?
If converting all dates to UTC in the raw vault, we will also need to convert back to local timezone
when building information marts as the business users would like to see the date times in local time
zone.
If choosing to keep all datetimes with timezone in local timezone, still need to add timezone in the datetime columns for those sources which don’t add timezone in the date time columns. Again, the question is where to do this conversion? Is this a hard rule to implement before loading to raw vault and should all business datetime columns from source be converted as well to local time zone?
Don’t track timezone, just use the keep the datetime columns. In this way, the sources which sends data in local timezone don’t need any conversion. For any source which sends data in any other timezone, we convert it to local timezone in stage area before storing in the raw vault.
I think you are overthinking. My suggestion is that you try to keep it simple. Dont convert any dates sent from a source system. But that does not mean that you cannot also store the date(in utc, with offset or whatever) in another field in your satellites. Then you have the original date in one field and another field contains a calculated(business driven) date.
Remember what you are defining here and it may solve your Qs for you:
There are predominantly two date time stamps in a DV model
load time stamp, the timestamp of when the data is loaded to a DV and
applied time stamp, essentially the state of the data as you received it, i.e. the extract date
These two dates are never business dates and therefore not localised to a region where the event occurred but rather the timestamps of where the data has come to rest.
Defining these localised timestamps is a hard rule defined as such for loading your raw vault,
Thanks @patrickcuba for the clarification. I understand about the load time stamp and applied timestamp.
how about the business dates and timestamps which occur in business data. do we perform timezone conversion for them as well to standardize the timezone for all business data date columns?
In some cases, the source system don’t mention the timezone in their timestamp fields. So, if the target column is defined as datetimeoffset (which is timezone aware) and we don’t mention timezone when inserting, it defaults to UTC timezone which makes the timestamp incorrect.
Alright!
So imagine if you’re doing a report on 4pm status of all account balances for New York, but your DV is in Sydney. But you run a similar report for 4pm Sydney too.
It might be that each loads a seperate satellite table (probably) around the same hub table (conceptually/semantically the same thing). To rationalise the key report metrics you’d need a PIT to consolidate that into a single dashboard/info-mart.
What will also be useful is some indicator on where that data came from, i.e. a timezone column or as you have suggested an additional column with UTC.
This way you have done nothing to the facts because this leads to tech debt. You’d have to get clever in managing special cases like daylight savings and so on.