How to model repeating IoT data

sosmaf · 11 May 2023 09:55

Hey,
How to correctly model IoT data which rows cannot be uniquely identified and cannot detect delta?

We have rows which are 100% identical, we have otherwise identical rows but one value changing and we have unidentical rows as shown in the example picture.

I was thinking about making a link table with the business keys (Hub1, Hub2) and timestamp as dependent child. Link hash key would be HASH(HUB1,HUB2,timestamp) and then there would be a multi active satellite, which links to the link with the same Link hash key HASH(HUB1,HUB2,timestamp) having all the descriptive data and the value column. Some sub sequence can be generated to the MA-SAT but it might not even be required on the snowflake.
The data never changes, but same data can come twice. The MA-SAT would then contain the newest rows per Link HK based on the LDTS.
Is this correct approach?

Nat · 11 May 2023 14:05

First of all is that data correct or nonsense… Do you know from the sensor manufacturer/configurer what things here mean/is there a meaning to two observations at the precise same time? Is the source sending all the data possible via the API, or are there other fields but you’re not getting them?

AHenning · 14 May 2023 04:26

Sounds like that IoT system is pretty crappy delivering pure duplicates.
IoT data belongs in a non historized link.
My suggestion for the duplicates is to calculate a field in the stage environement and do distinct select when loading the NHL.
Ex:
New field named NrOfRowsPerBK. Most rows will contain the value 1. But if you have a duplicate this row will contain a value larger than 1.
This way you are filtering the data in stage without losing any data and you can easily load your NHL.
Good luck!

patrickcuba · 14 May 2023 23:52

Never ever hash the timestamp, the data should be loaded to a link-satellite with a dependent-child key being the timestamp.

As for the identical rows, but then you counter your own statement by saying that aren’t — still a link-satellite.

Consider sat-splitting if in fact the identical data is describing the business object and not the transaction between business objects.

sosmaf · 15 May 2023 05:34

The timestamp should be hashed as a part of the LINK HK, right? But of course the timestamp field itself is plain and untouched.
What comes to the dupes, I need to have a chat about the duplicates with the source provider again.

patrickcuba · 15 May 2023 05:49

never include timestamps in the hash keys

sosmaf · 15 May 2023 05:59

Hmm, the link HK will then be same for all the records referring HUB1 and HUB2? So, joining the link to the sat causes cartesian product and you don’t really know which of the link’s row is paired with each entry in the sat

patrickcuba · 15 May 2023 07:11

If the link-hk is unique in the link how will you ever get a cartesian product by joining the link to the link sat?

SQL 101

sosmaf · 15 May 2023 07:35

I don’t know if I am tripping or if we’re speaking different language or what but the link-hk will not be unique if the timestamp is excluded from the HASH. Refer to this picture
timestamp_hashed

patrickcuba · 15 May 2023 07:47

Of course it will… otherwise you’re doing Data Vault wrong.

Topic		Replies	Views
More than one record in SAT for a Business Key with the same LOAD_DATE in a single load Data Vault 2.0	32	1749	11 October 2023
Effective satellite for link of json source Data Vault 2.0 link-satellite	1	45	7 April 2025
DQ issues: Error Vault? Data Vault 2.0	4	442	7 August 2022
[Need Advice] Sat Table Reloading Stretegy Data Vault 2.0 dbtvault , satellite , dv-architecture	1	39	11 February 2025
Should HUB be loaded only from actual source and how to define- biz. key if 2 source have diff. keys Data Vault 2.0 dbtvault , business-key , link , hub , datamodelling , satellite	5	136	3 July 2024

How to model repeating IoT data

Related topics