Basic question regarding DV data pipeline

DVBeginner · 11 March 2023 22:40

Hello,

I’m used to building data pipelines on system by system basis; in case of issue with 1 particular system, it’s possible to restart and reload its data without delaying other systems
The way I understood DV, it is theoretically possible to create 1 pipeline, that refreshes entire DWH, thanks to the paralelization and clear dependencies. Is it a default approach or are there additional factors to be considered? Is it practical to manage 1 pipeline?
Should LoadDate be common for all the objects in the same pipelines or is it a timestamp of data insert into respective table? If LoadDate is common then is there no need to store Insert timestamp as well? E.g. to troubleshoot delays or be able to show time when data was refreshed in data mart?

Thanks,
Martin

AHenning · 12 March 2023 16:16

My suggestion is that you have one import process for each dataset to DV. I would not use one pipeline for the entire dwh.
Regarding loaddates, i find it easy to use the same loaddate for one batch of a dataset. A batch containing multple loaddates will increase complexity, i try to avoid that.

patrickcuba · 13 March 2023 08:26

You shouldn’t reload anything!

Yes there are several acceptable practices to load-dates,

If set once as a variable in Staged content then all target DV tables will attain that load date timestamp, each loader (hub, link and sat) would load their respective content based on their independent loading rules.
Set the load date timestamp as a context function like current_timestamp, set this in staging and the same timestamps like above get loaded to their respective DV artefacts
We also recommend using applied timestamp, this is essentially the metadata file date (ala extract timestamp) of the data being loaded. It’s called applied because it is the applicable timestamp for the state of the data at that point in time, like taking a snapshot of the source.
never set these metadata timestamps to business timestamps because the meaning of those dates could obscure the meaning of load and applied timestamps.

Here is a blog on parallelism + independent pipelines: Data Vault Techniques: Streams & Tasks on Views

DVBeginner · 2 May 2023 20:44

With 1 big pipeline I’m facing scheduler limitation to max 50 parallel tasks on the platform we are using.
Have you ever used a dynamic DAG optimization technique (inteligent ordering od tasks to optimize longest/critical Path) in order to accelerate runtime of the entire pipeline?

patrickcuba · 2 May 2023 21:51

No, interesting you are hitting a limit, unless the whole DV load is in one dag. Check with your Snowflake SE on this!

DV loads should be file-based, that way you’re loading parts of the overall DV model independently and your DAG sizes should never reach 50.
Also, consider the test framework — the test should verify what was just loaded and it too can use the SAME stream by utilizing the repeatable read isolation technique

This makes the DV loads dynamic and the enterprise model in constant state of eventual consistency — aim for this!

DVBeginner · 9 May 2023 13:43

Patrick,
Just to clarify, I’m not running SnowFlake but on-prem MS SQL and a custom Scheduler (WhereScape DWA tool).
The way I understood your article is that if possible everything should be run in parallel and this would require 1 DAG. If this is not what you meant, then how DAGs should be structured? Per source, domain, or other critieria?

patrickcuba · 9 May 2023 21:19

Have a look at this article,

Topic		Replies	Views
Loading of Historical Snapshots Data Into a Data Vault 2.0 Data Vault 2.0 dbtvault	5	551	28 August 2023
Applied and Load dates as full datetime Data Vault 2.0	1	58	2 January 2025
AutomateDV Initial Load AutomateDV (formerly known as dbtvault) satellite	1	34	27 March 2025
Using external load-dates Data Vault 2.0 load-date , loading , effective_from , historical_load	4	515	28 January 2022
ETL pattern for Stage->DV Data Vault 2.0	6	264	22 May 2024

Basic question regarding DV data pipeline

Related topics