Initial Loading of Link Tables from Staging

Bigguy365 · 3 October 2022 18:14

I’m requesting clarification on the topic in your book dealing with the loading of link tables in the Automation and Orchestration chapter.

Your diagram indicates a dependency exists before the link table is created. However, there are instances in the chapter where this dependency is not specifically mentioned. Instead, it is implied that the link table can be created even if there is no table to be linked to it other than the table currently being loaded. I realize that a link table will be needed in the future but does creating a link table with only 1 table make sence?

Thanks

Clay

FrenkLoonen · 4 October 2022 08:54

When you’re saying “your”, you’re talking to @patrickcuba right?

This diagram is showing the DV1.0 loading pattern, where sequence keys were being used instead of hash keys. So you had to load your Hubs first, in which sequence keys were generated that you would then lookup in your Link load. So the order your load would be in was:
Hub → Link+HubSat → LinkSat (as mentioned in the figure description).

Hence the ‘staggered’ load.

If you don’t apply the correct mappings to your Hubs, you can end up with a Link that you can’t join back to the Hub to retrieve the Business Key. But if you do, it doesn’t matter what order the Hubs and Link get loaded in. At the end of the loading process, all tables will be populated correctly.

Firgure 7-3 is the one you should be looking at

Bigguy365 · 4 October 2022 11:51

Thank you for the explanation of the differences between Figures 7-2 and 7-3. Unfortunately, I’m still confused about how to handle the loading of a link (or bypassing the loading) when you ONLY have 1 of the tables available. My question was → “Instead, it is implied that the link table can be created even if there is no table to be linked to it other than the table currently being loaded. I realize that a link table will be needed in the future but does creating a link table with only 1 table make sense?”
Thanks

Clay

FrenkLoonen · 4 October 2022 12:21

I think I don’t fully understand the question… what do you mean with “you only have 1 of the tables available”. Do you mean only 1 hub? You need two Hubs for creating a Link (not taking into account Same-As Links or Hierarchical Links but I don’t think you’re referring to those right now). Otherwise, what are you modeling in your Link? A relationship between your one Hub, and… what?

Can you maybe show some (dummy) source data that you’re trying to model?

Bigguy365 · 4 October 2022 13:08

What is confusing to me is in parallel processing, links can be loaded at the same time because there is no dependency. I’m interpreting that to mean that a link can be created with only 1 known hub and the second hub being a Zero Key or Missing Key. Is this a common practice or do you wait until you have at least 2 hubs that you want to link? If it’s the latter then the parallel processing WILL Not be parallel in every situation. Yes??

FrenkLoonen · 4 October 2022 15:46

The way the parallel load works is as follows:

Let’s say we have a source with:
party_id, address_id
clay, your street 123
frenk, my lane 456

Which you will load into the following tables:
hub_party:
party_hkey, party_id
hash(clay), clay
hash(frenk), frenk

hub_address:
address_hkey, address_id
hash(your street 123), your street 123
hash(my lane 456), my lane 456

link_party_address:
link_party_address_hkey, party_hkey, address_hkey
hash(clay||your street 123), hash(clay), hash(your street 123)
hash(frenk||my lane 456), hash(frenk), hash(my lane 456)

You can load all three tables at the same time because there is no loading dependency between them. The only dependency in this scenario is that you of course need to have the data in staging to be able to load the link. If you don’t have the address_id, creating the link doesn’t make sense.

Bigguy365 · 4 October 2022 16:45

Got it. Thank you very much.

Clay

patrickcuba · 4 October 2022 22:45

Hello, I recognize that screenshot!

I think this is resolved right?

DV1 pattern relied on parent tables being populated first before a child table could be populated by taking that surrogate sequence key and reusing it in that child table — that sequence key was then used to join: very efficiently.

But along came DV2 where we use hash keys instead, because surrogate hash keys are deterministic you can load all related tables at the same time — they may still be referred to as parent-child relation to each other but their loads no longer need to imply it.

I emphasise the point here too: You might be doing #datavault Wrong! | by Patrick Cuba | Snowflake | Medium

@FrenkLoonen is spot on

Sudi · 24 April 2024 12:50

Thanks for the simple explanation.

I have a question about the one step before loading; when we want to create a table, then we need to define the F.K columns for the Link in DDL. Will we have any problem regarding the loading because of the F.K constraint since by loading, we will load the hash keys from Hubs into the Link in parallel?

Thank you in advance for your attention,
Sudi

Topic		Replies	Views
How to parallelize hubs and links loads and automatize them as much as possible ? Data Vault 2.0 link	11	1277	27 March 2023
When insert in parent table? Data Vault 2.0	10	257	16 June 2022
ETL pattern for Stage->DV Data Vault 2.0	6	270	22 May 2024
Reason for Record Source Column in Link tables Data Vault 2.0	4	595	24 January 2023
What is the rationale for not loading links containing NULL values in dbtvault? AutomateDV (formerly known as dbtvault)	1	419	20 May 2022

Initial Loading of Link Tables from Staging

Related topics