How can I control the referential integrity in Link and satellite during parallel loading?

Hello,

I’m implementing the SQL script to create and load data into the target Hubs/Links and satellites.
How can we manage the referential integrity of links and satellites to prevent them from having any orphaned records?
As part of our ETL process, I’m using Java code to run the jobs.
The book “The Data Vault Guru,” page 343, mentions orphan checks. But I don’t know how to check while, logically, links/satellites depend on the hub but not physically. I have a plan to use merge in Java code. Can you give me some hints?

Thank you in advance and best regards

A few pages later in that chapter I have 9.2 Link (tester) and then later still there is a sub-topic called 4-Orphan check with pseudo code, same for 9.3 Satellite (sat tester).

1 Like

Also… I do discuss testing and orchestration of testing here: Data Vault Test Automation. Modern day data analytics platforms… | by Patrick Cuba | Medium

1 Like

Dear Patrick,

Thank you so much for your reply and information :slight_smile: I will check them.

Best Regards,
Sudi

I have a problem regarding the testing time for the orphan check based on the below check for checking that hub keys loaded to the link exist in adjacent hubs since I have a huge amount of data, around 50 million records, in the target table ( Hub).

select count() err_count*
from {{link-tablename}} l
LEFT OUTER JOIN {{hub-tablenames}} h
on l.DV_HASHKEY_{{HUB-TABLENAME}}1 = h.DV_HASHKEY_{{hub-tablenames}}
and l.DV_TENANTID = h.DV_TENANTID
where h.DV_HASHKEY_{{hub-tablenames}} is null

Do you have any suggestions? For your information, I’m also using SYS-ROWVERSION in my Metadata.

Thank you in advance for your attention,
Best Regards
Sudi

Of course you should limit to new data being added to your DV. You don’t need to check what was already there (that’s already been tested). On Snowflake I’d recommend using Streams on hubs, links and satellites to process these tests only on new data.

You’d need some sort of watermarking to achieve this on other platforms

1 Like

Dear Patrick,

Thank you so much for the response and information.
I will order by sys-rowversion and use delta value to make it limit.
I think the only problem for performance will be order by which is taking more time…

Then I will test the stream.

Thanks again for the input :+1: :slightly_smiling_face:

Best Regards
Sudi

1 Like