Can/should we attempt to build a DV on file based storage?

paulskully · 14 July 2022 07:23

All the text I have found on implementing DV seem to centre around SQL and RDBMS based for e.g. metadata-driven automated code generation etc. Likewise the Scalefree course I went on several years ago, taught by Michael Olschimke himself.
Is it possible/advisable to attempt to implement DV on a file based storage medium, like ADLS? Or is it better to use ADLS as a staging area and the basis for the subsequent build of Raw Vault, Business Vault on a SQL-based platform such as Synapse?
Further down the line are we going to see the lines blur between Synapse and ADLS, as MS catches up with Snowflake?

patrickcuba · 23 July 2022 07:57

It is possible, but it could be painful because DV is essentially a relational data model.

AHenning · 23 July 2022 14:47

Hello!
I think we should be asking ourselves what data should be imported into a data lake. My opinion is that a data lake is excellent for storing big data. But when i read articles online and see what some clients have implemented i realize that they dump all kinds of data into the lake and use it both as a psa and as a data lake. The data itself contains unknown amounts of errors and data quality is low.
If data is structured and resides in an oltp system, i see no need to send it to a data lake. Why?

Data vault (correctly implemented) can regenerate all data to any point in time. Adding another system (the lake) to handle the data increases the risk of data alteration. Data alteration is soft business rules and should be avoided at any cost.
Performing joins with Hive or any other file-sql mounter is extremly slow compared to a rdbms. The rdbms is built to be performant and at the same time be compliant to ACID.

But if you have real big data (logs, IoT, google analytics) and you are ok with lower quality data because you are going to do behavioral analysis and it wont matter if you miss some data, then you should use a data lake.

So if you import big data into a lake i would argue that all that data should be modeled into non historized links.

Please be aware that these are my opinions and they might differ from the common data vault concepts.

paulskully · 17 August 2022 15:49

Thanks @patrickcuba
I agree, and that is the essence of my question. Having been “advised” to attempt to implement a DV model (Hubs, Links and Satellites etc.) in ADF on the Delta format.
Hence I’m confused about where the MPP relational DW (Synapse) would fit in, and I doubt the person giving the advice has actually tried it!

patrickcuba · 18 August 2022 01:13

It can be done — BUT it is a lot more painful to manage and query on a NoSQL platform is my experience.
Example, we had a need for recursion to solve a business rule and it turns out HiveSQL and SparkSQL can’t do it — and still can’t do it today. Instead we had to resort to this: Apache Spark GraphX and the Seven Bridges of Königsberg | by Patrick Cuba | Geek Culture | Medium

Topic		Replies	Views
Data Vault with Databricks Data Vault 2.0 databricks	6	1297	19 June 2023
Is Data Vault a good fit? Data Vault 2.0	1	579	5 February 2024
Loading Near Real Time Data into the Data Vault Data Vault 2.0	5	439	29 April 2023
Does DV make sense if all sources are snapshotted ? Data Vault 2.0	2	284	18 December 2022
Data Vault formation in Azure ADLS Data Vault 2.0	4	323	22 February 2022

Can/should we attempt to build a DV on file based storage?

Related topics