I have been asked to create a data vault 2.0 model based on a historical dataset that contains snapshots of data since 5 years ago.
I would like to know if there is any good approach to the initial load of the raw vault considering that going for a daily loop on each snapshot might be quite time consuming and not very effective. Are there any other alternatives to this? Would love to hear more about the approach and design.
Why would it not be very effective? If you want to track the changes between each dataset loading it one by one is a good approach.
Ah, it seems you’re referring to a migration,
- Migrate everything?
- Highest impact portion of the migration — do these first, show business value and get lessons learnt. Also the opportunity to set standards for your DV — will you hash? What will be the naming standards? etc etc
- Depending on your toolset (Spark for example) you might want to design a single SQL per target table you define, yes you still need modelling. A big section in my book discusses migrations.
What we did in my current project, was using AutomateDV, and creating an extra ‘historical’ version of all table macros to do precisely this: loading years and years of snapshotted data into our DV, which also consists of STSs (macro we had to create ourselves, doesn’t exist in AutomateDV yet) and ESs.
Some Satellites would contain too much data (12 years, 365 snapshots per year, 40M records per snapshot) so we use the ranked loader to load year by year.
We used a staged load dts for the first snapshot date, and then added a millisecond for each later snapshot date in that batch.
You’d have to spend quite some time on the patterns though
Would love to hear more details on your approach.
We will also use AutomateDV and would be great to understand how we can optimize this initial load.
Can you elaborate on the principles you considered/implemented on the Macros you created for this?
Our main idea was to have a PSA with all the 5 years of daily snapshots from each individual source and then use dbt to load it in one go for the initial load. The incremental would be based on a different set of dbt jobs for example.
My goal here is mainly figure out a way to optimize the initial load using dbt.