Optimizing SAT Loading from Staging as a VIEW in Snowflake

Hello all,

We have Staging as VIEW. The underlying query producing the Staging VIEW is complex with lot of UNIONs and JOINs and lot of data. However the delta that gets loaded into the SAT in single run is < 100 records.

We are currently using the following SAT loading pattern:

where not exists
  (
    select 1
    from vault.rv.satl_opp as dv
    where dv.link_hashkey = stage.link_hashkey
      and dv.load_date = (select max (dv1.load_date) from vault.rv.satl_opp dv1 where dv.link_hashkey=dv1.link_hashkey
    )
      and dv.hashdiff = stage.hashdiff
  );

Loading these < 100 records takes 10+ minutes on Large Snowflake WH. How can we optimize this loading pattern?

Thanks.

here is one option I am thinking:

  with new_data as (select
    pk
  from vault.stage.stg_view as stage 
  where not exists
    (
      select 1
      from vault.raw_vault.sat as dv
      where dv.hashkey = stage.hashkey
        and dv.load_date = (select max (dv1.load_date) from vault.raw_vault.sat dv1 where dv.hashkey=dv1.hashkey
      )
        and dv.hashdiff = stage.hashdiff
    ))
  select * FROM vault.stage.stg_view as stage
  where stage.pk IN (select pk from new_data );

I am seeing > 30% reduction in the execution time with this pattern.

Note: PK is HASHKEY + LOAD_DATE + [APPLIED_DATE]

Are you on Snowflake?

If so you can use Streams on Views.

I developed a pattern for this; Kappa Vault – able to hand data loads at any velocity and any volume and still maintain Satellite table integrity. It’s on my Medium.

Thanks @patrickcuba . I will try that out.

Couple of questions:

  1. Since Streams will only produce records that have been updated since last DML, do we even need to test if it already exists in the SAT? Or should we just load it without checking?
  2. Let’s say something goes wrong, and the SAT load fails after reading from the Stream. The next time we read from the Stream we will not get the data that we failed to load in the SAT in the last execution.

Thanks,
Saqib

  1. Ah, so follow through to my Kappa Vault doc and it explains what this pattern is solving; when multiple loads to the landed content is more than what is loaded to the target satellite — of course you must check for true change
  2. You have to manage that by building an ops model to record the query-ids at that point in time, or recording when the DML was executed. If a load fails you can reset the stream to that offset, refer to the Snowflake docs on how to do that

i’ll just add it here then: Kappa Vault. This article expands on “Data Vault… | by Patrick Cuba | Snowflake | Medium

Hi @patrickcuba ,

I tried creating an Stream, but I get the following error:
Change tracking is not supported on queries with QUALIFY.

In the underlying query for the VIEW we have to use QUALIFY to get the latest data from the tables that we are joining to.

Saqib

Consult the documentation!

I did. Looks like this is not support in Snowflake. So this strategy won’t work.

Wonder why you’re doing that in staging!

:grinning:

you mean joining to other tables? actually it is in layer prior to staging. that layer is also a VIEW.

View on view?

Hmmmm my assumptions were this:

  • create satellite
  • define two views on the sat
    — view1 uses the qualify because it’s only interested in the latest record per hk
    — view2 does not use quslify but adds an end date and current flag for simplifying queries

The source for a sat (and it’s associated hub/link whatever) is a landed file. A view is defined on top of the landed file that adds all the metadata tags, hash keys, record source etc.
A stream on view is defined on that staging view

Satellite load code uses the stream to load the sat and uses the current Sat view in the load code.

This ensures re-runability, and you only process new records in the stream. And this process works because I have demo’ed it customers while combining it with tasks for a near real time data vault as described in the kappa vault article

so here is the scenario:

We need to model the link shown below. To create the LINK and the corresponding SAT LINK using dbt, we have to create a Raw Staging VIEW which joins to OPPORTUNITY_PRODUCT (SFDC) to OPPORTUNITY to get the OPP NUM, ACCOUNT to get the ACCOUNT_NAME, CONTACT to get the CORP_EMAIL_ADDRESS and PRODUCT to get the PRODUCT_NAME. Without these joins in Pre-staging we won’t be able to populate the LINK and the SAT LINK.

Thanks,
Saqib

I think you might be in trouble here.
Applying joins in staging to “enrich” your model creates dependencies between the landed content, anti-pattern

Again, 1st prize get the source to send you what you want. DV should not be the sink of tech debt. If it is, you’re busy building a legacy system

Hi @patrickcuba . How would do that using dbtvault then? We have to join in raw-staging to get the appropriate BKs for the HASHING in the LINK + LINK SAT.

Thanks,
Saqib

Good question, I always thought that joining in Staging is forbidden except for this single use case: getting the Business keys in place in case they are not there yet (as there 's usually a surrogate Primary key instead…)
What about a source table that includes a TYPE column, and we’d want to have that split into several hubs divided by TYPE.
Would a simple filter still be allowed in staging? What if it is even more complex und you need to join another table to do that filtering?

@patrickcuba any comment would be appreciated!?