DQ issues: Error Vault?

FrenkLoonen · 30 June 2022 08:10

I need to load Company data. The BK is the DUNS number (with which many of you will probably be familiar). I’ll be receiving full snapshots every day. The source table PK is DUNS + DUNS_TS, the timestamp of when the info was retrieved. Because in one snapshot, they might have received data for one particular DUNS more than one time. I checked the data, on a set of 700k records there are on average 100 DUNS numbers that appear twice. They are identical for 99% of the attributes, it’s often a description field that has an extra period or something.

I’m now looking into these two options:
1: CDK-Sat in RV, filtering rule in the BV (by taking the record with the most recent timestamp) to bring it back to the grain of the hub

Pro: load data into RV as-is, this table is 1-on-1 with what the source provided
Con: extra table required in BV
Con: lot of extra data stored

2: Do the filtering towards the RV, output all records that fall out to an Error Vault table

Pro: no extra table required in BV. Yes, extra table required but it only gets read from ad-hoc, when there’s a need to recreate the source, so:
Pro: still able to recreate the source
Pro: less data stored as compared to the other solution
Pro: if at some point in the future the source doesn’t allow for multiple versions per day anymore (which have no meaning for the business anyway because they only care about the last version) you can continue loading to the satellite that has the intended grain from the start
Con: can be considered a soft rule so doesn’t qualify for RV

edit: forgot to mention: your thoughts?

AHenning · 8 July 2022 23:29

Nor 1 or 2 in my opinion.
Comment about 1: you receive snapshot data, a delta satellite would be more appropriate for the dataset.
Comment about 2: it is not an error that you receive multiple rows for the same object.

My suggestion:
Assign a row-number in the stage environment for each object and order it by timestamp desc.
Create a peg-leged link that contains the row-number and the hub ref.
Put an effectivity sat on the link.
Connect the delta sat to the link.
If you want only the last rows for each object just filter the peg-leged link row-number eql to 1.

FrenkLoonen · 26 July 2022 14:21

Welp, it took me quite some time to respond.

I think I’ve used some incorrect terms. When I said snapshot, I meant a regular full load, so no delta/incremental/CDC load. And when I said Error Vault, I should have maybe called it DQ Vault.

But anyway, where I originally thought that assigning a row-number in the stage environment was going to be a whole ordeal, I found that that is what the dbtvault ranked_columns macro is for. So what I will probably go for is an msat where I use that row_number as dependent-child key. And then in my information mart I’ll filter on sub_seq = 1. Similar to your solutiun, but I think your proposal of the peg-legged link was because you wanted to attach both an effectivity sat and a delta sat?

AHenning · 29 July 2022 23:50

For me snapshot and a full load are the same. Delta satelltes are excellent for handling this kind if data. I think we are talking about the saame thing regarding full load.

However, I am not fund of dv solutions that cannot recreate the data that was given. That is why i suggest you to put the dep childs in the link and put an eff sat connected to it. Then you can look into the eff sat over time to exactly determine what data was delivered for each dataset.

patrickcuba · 7 August 2022 07:41

Load as:

Satellite with dependent-child key; this will ensure you are only loading what’s changed according to the hk+dep-key and nothing else. No cons if I read your problem correctly.

Topic		Replies	Views
Handling erroneous input data Data Vault 2.0	8	412	14 March 2023
[Need Advice] Sat Table Reloading Stretegy Data Vault 2.0 automatedv , satellite , dv-architecture	1	44	11 February 2025
Filtering a source table/file when loading Raw Data Vault Data Vault 2.0	7	1380	15 January 2024
Loading of Historical Snapshots Data Into a Data Vault 2.0 Data Vault 2.0 automatedv	5	608	28 August 2023
Does DV make sense if all sources are snapshotted ? Data Vault 2.0	2	286	18 December 2022

DQ issues: Error Vault?

Related topics