Data Vault with Databricks

Greetings all!

Has anyone got any practical experience of using Data Vault against an (Azure) Databricks platform. Is there any general guidance on how well this works?

I’m trying to figure out if it is “better” to only use Databricks to standardise data (say into Parquet format) in the data lake but offload the definition of the classic Data Vault structures to a down stream database (such as Azure SQL DB); or model Data Vault in delta lake objects in Databricks; or perhaps something else.

The Databricks SQL component seems to make the need for a separate downstream data warehouse database redundant - but I’m not overly clear on what that means for where any Data Vault modelled objects should reside.

Not Databricks directly, but certainly using Parquet, S3 buckets and Spark and Spark SQL

We used parquet as the landing area for landed content that has had hard rules applied, this formed our data lake. We also designed and built our own Scala based DV automation tool

The difficulty came in the usability of the whole thing, Spark SQL is not great and since DV is a change-tracking data modelling paradigm the partition for each parquet table became small and non-performant.

Basically for the DV model you’re better off doing what you’ve suggested, bring that data into a platform made for relational analysis and queries and leave the landed content in parquet.

I’m currently on a project where we’re using Databricks.

We have a shared Persistent Storage Area that contains ORC files. The DV itself gets written to Delta tables.

As I’m typing this we’re preparing for our first UAT, doing a deep dive in performance tuning. Initially we partitioned on load_dts but as Patrick mentioned, you’ll end up with a lot of small partitions. At this moment we don’t have any partitions specified (anymore/yet) but focusing on ZORDER, OPTIMIZE and ANALYZE TABLE.

We’re using a SQL endpoint and noticed that scaling up the size of the warehouse solves a lot of issues :joy:

Hi @FrenkLoonen - thanks for the reply. It looks like you hit the classic “lots of small files” challenge that I have ready about. Are you using the Databricks SQL component to create that SQL endpoint. I think you need that component in order to be able to create endpoints to consuming BI tools.

Thanks @patrickcuba - and that is where I am trying to cut through the hype and understand if the performance of doing it all in Parquet/Delta Lake tables offers sufficient performance. The key difference between the approach you mentioned is that it is not leveraging the optimised engine that Databricks is supposed to offer to make things fly along. Reading what Frank had to say - it doesn’t look like Databricks makes the performance challenges go away - you still need to appropriately configure the delta tables and rely on pumping up the compute of the Databricks SQL cluster when hitting the tables with demand.

Hello jhall_uk!
A RDBMS has its own operating system just to be able to access data on disk as fast as possible and they have been around for about 40 years. I do not understand how any software solution that has to read multiple files on disk and combining the result to perform JOINS even can compete with a RDBMS. Maybe I am missing something here so I be glad if someone could describe me the benefits of databricks or any other file of disk based solutions when it comes to relational data.

@FrenkLoonen would want to check with you how has your experience been so far on implementing data vault on databricks platform?

I have been reading some articles of implementing datavault 2.0 on databricks platform as a silver layer model (referring to Databricks medallion architecture). But, just concerned with the joins that the data vault structure introdces and its performance on parquet columnar store.

Anything that you can share from your experience so far?