Loading of Historical Snapshots Data Into a Data Vault 2.0

luissimoes · 18 August 2023 08:56

Hi Everyone!

I have been asked to create a data vault 2.0 model based on a historical dataset that contains snapshots of data since 5 years ago.

I would like to know if there is any good approach to the initial load of the raw vault considering that going for a daily loop on each snapshot might be quite time consuming and not very effective. Are there any other alternatives to this? Would love to hear more about the approach and design.

Thank you

Regards,
LS

AHenning · 20 August 2023 14:29

Why would it not be very effective? If you want to track the changes between each dataset loading it one by one is a good approach.

patrickcuba · 21 August 2023 23:35

Ah, it seems you’re referring to a migration,
Consider:

Migrate everything?
Highest impact portion of the migration — do these first, show business value and get lessons learnt. Also the opportunity to set standards for your DV — will you hash? What will be the naming standards? etc etc
Depending on your toolset (Spark for example) you might want to design a single SQL per target table you define, yes you still need modelling. A big section in my book discusses migrations.

FrenkLoonen · 22 August 2023 14:13

What we did in my current project, was using AutomateDV, and creating an extra ‘historical’ version of all table macros to do precisely this: loading years and years of snapshotted data into our DV, which also consists of STSs (macro we had to create ourselves, doesn’t exist in AutomateDV yet) and ESs.

Some Satellites would contain too much data (12 years, 365 snapshots per year, 40M records per snapshot) so we use the ranked loader to load year by year.

We used a staged load dts for the first snapshot date, and then added a millisecond for each later snapshot date in that batch.

You’d have to spend quite some time on the patterns though

luissimoes · 28 August 2023 08:11

Hi Frenk,

Would love to hear more details on your approach.
We will also use AutomateDV and would be great to understand how we can optimize this initial load.

Can you elaborate on the principles you considered/implemented on the Macros you created for this?

Our main idea was to have a PSA with all the 5 years of daily snapshots from each individual source and then use dbt to load it in one go for the initial load. The incremental would be based on a different set of dbt jobs for example.

luissimoes · 28 August 2023 08:12

Hi Patrick,

My goal here is mainly figure out a way to optimize the initial load using dbt.

Thank you

Topic		Replies	Views
AutomateDV Initial Load AutomateDV (formerly known as dbtvault) satellite	1	48	27 March 2025
Does DV make sense if all sources are snapshotted ? Data Vault 2.0	2	281	18 December 2022
DQ issues: Error Vault? Data Vault 2.0	4	439	7 August 2022
Using external load-dates Data Vault 2.0 load-date , loading , effective_from , historical_load	4	524	28 January 2022
Steaming data into data vault 2.0 Data Vault 2.0	1	213	5 February 2023

Loading of Historical Snapshots Data Into a Data Vault 2.0

Related topics