Initial Loading of Link Tables from Staging

I’m requesting clarification on the topic in your book dealing with the loading of link tables in the Automation and Orchestration chapter.


Your diagram indicates a dependency exists before the link table is created. However, there are instances in the chapter where this dependency is not specifically mentioned. Instead, it is implied that the link table can be created even if there is no table to be linked to it other than the table currently being loaded. I realize that a link table will be needed in the future but does creating a link table with only 1 table make sence?

Thanks

Clay

When you’re saying “your”, you’re talking to @patrickcuba right? :slight_smile:

This diagram is showing the DV1.0 loading pattern, where sequence keys were being used instead of hash keys. So you had to load your Hubs first, in which sequence keys were generated that you would then lookup in your Link load. So the order your load would be in was:
Hub → Link+HubSat → LinkSat (as mentioned in the figure description).

Hence the ‘staggered’ load.

If you don’t apply the correct mappings to your Hubs, you can end up with a Link that you can’t join back to the Hub to retrieve the Business Key. But if you do, it doesn’t matter what order the Hubs and Link get loaded in. At the end of the loading process, all tables will be populated correctly.

Firgure 7-3 is the one you should be looking at :slight_smile:

Thank you for the explanation of the differences between Figures 7-2 and 7-3. :slightly_smiling_face: Unfortunately, I’m still confused about how to handle the loading of a link (or bypassing the loading) when you ONLY have 1 of the tables available. :thinking: My question was → “Instead, it is implied that the link table can be created even if there is no table to be linked to it other than the table currently being loaded. I realize that a link table will be needed in the future but does creating a link table with only 1 table make sense?”
Thanks

Clay

I think I don’t fully understand the question… what do you mean with “you only have 1 of the tables available”. Do you mean only 1 hub? You need two Hubs for creating a Link (not taking into account Same-As Links or Hierarchical Links but I don’t think you’re referring to those right now). Otherwise, what are you modeling in your Link? A relationship between your one Hub, and… what?

Can you maybe show some (dummy) source data that you’re trying to model?

What is confusing to me is in parallel processing, links can be loaded at the same time because there is no dependency. I’m interpreting that to mean that a link can be created with only 1 known hub and the second hub being a Zero Key or Missing Key. Is this a common practice or do you wait until you have at least 2 hubs that you want to link? If it’s the latter then the parallel processing WILL Not be parallel in every situation. Yes??

The way the parallel load works is as follows:

Let’s say we have a source with:
party_id, address_id
clay, your street 123
frenk, my lane 456

Which you will load into the following tables:
hub_party:
party_hkey, party_id
hash(clay), clay
hash(frenk), frenk

hub_address:
address_hkey, address_id
hash(your street 123), your street 123
hash(my lane 456), my lane 456

link_party_address:
link_party_address_hkey, party_hkey, address_hkey
hash(clay||your street 123), hash(clay), hash(your street 123)
hash(frenk||my lane 456), hash(frenk), hash(my lane 456)

You can load all three tables at the same time because there is no loading dependency between them. The only dependency in this scenario is that you of course need to have the data in staging to be able to load the link. If you don’t have the address_id, creating the link doesn’t make sense.

1 Like

Got it. Thank you very much.

Clay

Hello, I recognize that screenshot!

I think this is resolved right?

DV1 pattern relied on parent tables being populated first before a child table could be populated by taking that surrogate sequence key and reusing it in that child table — that sequence key was then used to join: very efficiently.

But along came DV2 where we use hash keys instead, because surrogate hash keys are deterministic you can load all related tables at the same time — they may still be referred to as parent-child relation to each other but their loads no longer need to imply it.

I emphasise the point here too: You might be doing #datavault Wrong! | by Patrick Cuba | Snowflake | Medium

@FrenkLoonen is spot on :slight_smile:

Thanks for the simple explanation.

I have a question about the one step before loading; when we want to create a table, then we need to define the F.K columns for the Link in DDL. Will we have any problem regarding the loading because of the F.K constraint since by loading, we will load the hash keys from Hubs into the Link in parallel?

Thank you in advance for your attention,
Sudi