I’m implementing the SQL script to create and load data into the target Hubs/Links and satellites.
How can we manage the referential integrity of links and satellites to prevent them from having any orphaned records?
As part of our ETL process, I’m using Java code to run the jobs.
The book “The Data Vault Guru,” page 343, mentions orphan checks. But I don’t know how to check while, logically, links/satellites depend on the hub but not physically. I have a plan to use merge in Java code. Can you give me some hints?
A few pages later in that chapter I have 9.2 Link (tester) and then later still there is a sub-topic called 4-Orphan check with pseudo code, same for 9.3 Satellite (sat tester).
I have a problem regarding the testing time for the orphan check based on the below check for checking that hub keys loaded to the link exist in adjacent hubs since I have a huge amount of data, around 50 million records, in the target table ( Hub).
select count() err_count* from {{link-tablename}} l LEFT OUTER JOIN {{hub-tablenames}} h on l.DV_HASHKEY_{{HUB-TABLENAME}}1 = h.DV_HASHKEY_{{hub-tablenames}} and l.DV_TENANTID = h.DV_TENANTID where h.DV_HASHKEY_{{hub-tablenames}} is null
Do you have any suggestions? For your information, I’m also using SYS-ROWVERSION in my Metadata.
Thank you in advance for your attention,
Best Regards
Sudi
Of course you should limit to new data being added to your DV. You don’t need to check what was already there (that’s already been tested). On Snowflake I’d recommend using Streams on hubs, links and satellites to process these tests only on new data.
You’d need some sort of watermarking to achieve this on other platforms
Thank you so much for the response and information.
I will order by sys-rowversion and use delta value to make it limit.
I think the only problem for performance will be order by which is taking more time…