How can I control the referential integrity in Link and satellite during parallel loading?

Sudi · 13 August 2024 11:32

Hello,

I’m implementing the SQL script to create and load data into the target Hubs/Links and satellites.
How can we manage the referential integrity of links and satellites to prevent them from having any orphaned records?
As part of our ETL process, I’m using Java code to run the jobs.
The book “The Data Vault Guru,” page 343, mentions orphan checks. But I don’t know how to check while, logically, links/satellites depend on the hub but not physically. I have a plan to use merge in Java code. Can you give me some hints?

Thank you in advance and best regards

patrickcuba · 14 August 2024 00:21

A few pages later in that chapter I have 9.2 Link (tester) and then later still there is a sub-topic called 4-Orphan check with pseudo code, same for 9.3 Satellite (sat tester).

patrickcuba · 14 August 2024 00:36

Also… I do discuss testing and orchestration of testing here: Data Vault Test Automation. Modern day data analytics platforms… | by Patrick Cuba | Medium

Sudi · 14 August 2024 05:30

Dear Patrick,

Thank you so much for your reply and information I will check them.

Best Regards,
Sudi

Sudi · 14 August 2024 13:38

I have a problem regarding the testing time for the orphan check based on the below check for checking that hub keys loaded to the link exist in adjacent hubs since I have a huge amount of data, around 50 million records, in the target table ( Hub).

select count() err_count*
from {{link-tablename}} l
LEFT OUTER JOIN {{hub-tablenames}} h
on l.DV_HASHKEY_{{HUB-TABLENAME}}1 = h.DV_HASHKEY_{{hub-tablenames}}
and l.DV_TENANTID = h.DV_TENANTID
where h.DV_HASHKEY_{{hub-tablenames}} is null

Do you have any suggestions? For your information, I’m also using SYS-ROWVERSION in my Metadata.

Thank you in advance for your attention,
Best Regards
Sudi

patrickcuba · 14 August 2024 22:15

Of course you should limit to new data being added to your DV. You don’t need to check what was already there (that’s already been tested). On Snowflake I’d recommend using Streams on hubs, links and satellites to process these tests only on new data.

You’d need some sort of watermarking to achieve this on other platforms

Sudi · 15 August 2024 04:10

Dear Patrick,

Thank you so much for the response and information.
I will order by sys-rowversion and use delta value to make it limit.
I think the only problem for performance will be order by which is taking more time…

Then I will test the stream.

Thanks again for the input

Best Regards
Sudi

Topic		Replies	Views
When insert in parent table? Data Vault 2.0	10	277	16 June 2022
For JSON Source if there is a LINK why need HUB and SAT ? Data Vault 2.0 business-key , link , hub	3	360	6 February 2023
Is there any precedence for a single table (self-referencing?) LINK? Data Vault 2.0 link	20	714	11 December 2024
relationship between one hub and satellite of another hub Data Vault 2.0 link , hub , satellite	13	1335	26 March 2024
How to parallelize hubs and links loads and automatize them as much as possible ? Data Vault 2.0 link	11	1291	27 March 2023

How can I control the referential integrity in Link and satellite during parallel loading?

Related topics