How to correctly model IoT data which rows cannot be uniquely identified and cannot detect delta?
We have rows which are 100% identical, we have otherwise identical rows but one value changing and we have unidentical rows as shown in the example picture.
I was thinking about making a link table with the business keys (Hub1, Hub2) and timestamp as dependent child. Link hash key would be HASH(HUB1,HUB2,timestamp) and then there would be a multi active satellite, which links to the link with the same Link hash key HASH(HUB1,HUB2,timestamp) having all the descriptive data and the value column. Some sub sequence can be generated to the MA-SAT but it might not even be required on the snowflake.
The data never changes, but same data can come twice. The MA-SAT would then contain the newest rows per Link HK based on the LDTS.
Is this correct approach?
First of all is that data correct or nonsense… Do you know from the sensor manufacturer/configurer what things here mean/is there a meaning to two observations at the precise same time? Is the source sending all the data possible via the API, or are there other fields but you’re not getting them?
Sounds like that IoT system is pretty crappy delivering pure duplicates.
IoT data belongs in a non historized link.
My suggestion for the duplicates is to calculate a field in the stage environement and do distinct select when loading the NHL.
New field named NrOfRowsPerBK. Most rows will contain the value 1. But if you have a duplicate this row will contain a value larger than 1.
This way you are filtering the data in stage without losing any data and you can easily load your NHL.
Never ever hash the timestamp, the data should be loaded to a link-satellite with a dependent-child key being the timestamp.
As for the identical rows, but then you counter your own statement by saying that aren’t — still a link-satellite.
Consider sat-splitting if in fact the identical data is describing the business object and not the transaction between business objects.
The timestamp should be hashed as a part of the LINK HK, right? But of course the timestamp field itself is plain and untouched.
What comes to the dupes, I need to have a chat about the duplicates with the source provider again.
never include timestamps in the hash keys
Hmm, the link HK will then be same for all the records referring HUB1 and HUB2? So, joining the link to the sat causes cartesian product and you don’t really know which of the link’s row is paired with each entry in the sat
If the link-hk is unique in the link how will you ever get a cartesian product by joining the link to the link sat?
I don’t know if I am tripping or if we’re speaking different language or what but the link-hk will not be unique if the timestamp is excluded from the HASH. Refer to this picture
Of course it will… otherwise you’re doing Data Vault wrong.