I’m used to building data pipelines on system by system basis; in case of issue with 1 particular system, it’s possible to restart and reload its data without delaying other systems
The way I understood DV, it is theoretically possible to create 1 pipeline, that refreshes entire DWH, thanks to the paralelization and clear dependencies. Is it a default approach or are there additional factors to be considered? Is it practical to manage 1 pipeline?
Should LoadDate be common for all the objects in the same pipelines or is it a timestamp of data insert into respective table? If LoadDate is common then is there no need to store Insert timestamp as well? E.g. to troubleshoot delays or be able to show time when data was refreshed in data mart?
My suggestion is that you have one import process for each dataset to DV. I would not use one pipeline for the entire dwh.
Regarding loaddates, i find it easy to use the same loaddate for one batch of a dataset. A batch containing multple loaddates will increase complexity, i try to avoid that.
Yes there are several acceptable practices to load-dates,
If set once as a variable in Staged content then all target DV tables will attain that load date timestamp, each loader (hub, link and sat) would load their respective content based on their independent loading rules.
Set the load date timestamp as a context function like current_timestamp, set this in staging and the same timestamps like above get loaded to their respective DV artefacts
We also recommend using applied timestamp, this is essentially the metadata file date (ala extract timestamp) of the data being loaded. It’s called applied because it is the applicable timestamp for the state of the data at that point in time, like taking a snapshot of the source.
never set these metadata timestamps to business timestamps because the meaning of those dates could obscure the meaning of load and applied timestamps.
With 1 big pipeline I’m facing scheduler limitation to max 50 parallel tasks on the platform we are using.
Have you ever used a dynamic DAG optimization technique (inteligent ordering od tasks to optimize longest/critical Path) in order to accelerate runtime of the entire pipeline?
No, interesting you are hitting a limit, unless the whole DV load is in one dag. Check with your Snowflake SE on this!
DV loads should be file-based, that way you’re loading parts of the overall DV model independently and your DAG sizes should never reach 50.
Also, consider the test framework — the test should verify what was just loaded and it too can use the SAME stream by utilizing the repeatable read isolation technique
This makes the DV loads dynamic and the enterprise model in constant state of eventual consistency — aim for this!
Patrick,
Just to clarify, I’m not running SnowFlake but on-prem MS SQL and a custom Scheduler (WhereScape DWA tool).
The way I understood your article is that if possible everything should be run in parallel and this would require 1 DAG. If this is not what you meant, then how DAGs should be structured? Per source, domain, or other critieria?