Dropping historical entries in a computed satellite?

Hi everyone!

I would like to load a computed satellite into a business vault but am interested only in the most recent versions of the attributes for each business key.

Is it OK from the DV2.0 standard perspective, not to include historical values in a computed satellite (and update the fields in-place)?

Thanks for help,
Mark

Hi, Mark. That’s fine. DV2.0 allows you to determine how much history is kept. In your case the answer is zero! And the raw satellites still have all the history in case you ever need to recreate what the current-record was for a given day.

No, do not delete history,

Business Vault is the auditable facts but based on derived content.

If you’re using Snowflake, here’s a technique to get the current record every time: Data Vault on Snowflake: Querying really BIG satellite tables | by Patrick Cuba | Snowflake | Medium

A view that gives you the latest (along with an end of time date) would be better than deleting.

If you have a reasonable sized data set (few million rows) then a simple way is to rank / partition over the keys order so that the latest = 1 and then where clause = 1 to get the latest.

I’ve seen another version that is generated by Scalefree’s fork of dbtvault which uses lead and a timestamp offset to get the latest record as is-current. It seems to work but I haven’t deeply analysed it.

WITH

end_dated_source AS (

SELECT
    hk_party_h,
    hd_party_s,
    rsrc,
    ldts,
    COALESCE(LEAD(ldts - INTERVAL '1 MICROSECOND') OVER (PARTITION BY hk_party_h ORDER BY ldts),TO_TIMESTAMP('8888-12-31T23:59:59', 'YYYY-MM-DDTHH24:MI:SS')) AS ledts,

(rest of select clause)
FROM foo

)

SELECT
hk_party_h,
hd_party_s,
rsrc,
ldts,
ledts,
CASE WHEN ledts = TO_TIMESTAMP(‘8888-12-31T23:59:59’, ‘YYYY-MM-DDTHH24:MI:SS’)
THEN TRUE
ELSE FALSE
END AS IS_CURRENT,
(rest of select clause)
FROM end_dated_source
);

Patrick’s approach is likely more bombproof at big scale, the views suggested may get expensive at runtime, but do your testing.

I can clearly see the benefit of this C-PIT tables if you are using Snowflake!

In our case we have an INSERT only hadoop data lake. For some fact tables with are just interest on current records, moreover we just export a few months of data to Oracle as performance layer of datamarts for BI users.

Our overnight processes build a view of current records Satellites, based on DV model metadata, (hundreds of millions of data) and build the few months fact tables (few millions of data) from them; and for some intraday loads, we just build a view of last hour data Satellites (based on DV model metadata) and build the incremental fact tables from them, and MERGE the data in Oracle.

On the other hand, we also load historical effectivity Satellites following DV2.0 as auditable data.

Is our best approach we can think of, to keep Data Vault approach and have a good performance to refresh our fact tables, especially for intraday loads.

Hadoop-based would be immutable files I’d reason, and you’d need some form of SQL to query, join and make sense of the data. The beauty of using a C-PIT in Snowflake’s case is the JoinFilter — but also using a PIT in a star-join an OLAP query engine uses a hash-join to optimise performance of the query (performs a Right-Deep Join Tree that Snowflake favours). I don’t think you have the same thing on HiveSQL or SparkSQL — I might be wrong, if you do then PITs will help.