I need to load Company data. The BK is the DUNS number (with which many of you will probably be familiar). I’ll be receiving full snapshots every day. The source table PK is DUNS + DUNS_TS, the timestamp of when the info was retrieved. Because in one snapshot, they might have received data for one particular DUNS more than one time. I checked the data, on a set of 700k records there are on average 100 DUNS numbers that appear twice. They are identical for 99% of the attributes, it’s often a description field that has an extra period or something.
I’m now looking into these two options:
1: CDK-Sat in RV, filtering rule in the BV (by taking the record with the most recent timestamp) to bring it back to the grain of the hub
- Pro: load data into RV as-is, this table is 1-on-1 with what the source provided
- Con: extra table required in BV
- Con: lot of extra data stored
2: Do the filtering towards the RV, output all records that fall out to an Error Vault table
- Pro: no extra table required in BV. Yes, extra table required but it only gets read from ad-hoc, when there’s a need to recreate the source, so:
- Pro: still able to recreate the source
- Pro: less data stored as compared to the other solution
- Pro: if at some point in the future the source doesn’t allow for multiple versions per day anymore (which have no meaning for the business anyway because they only care about the last version) you can continue loading to the satellite that has the intended grain from the start
- Con: can be considered a soft rule so doesn’t qualify for RV
edit: forgot to mention: your thoughts?
Nor 1 or 2 in my opinion.
Comment about 1: you receive snapshot data, a delta satellite would be more appropriate for the dataset.
Comment about 2: it is not an error that you receive multiple rows for the same object.
Assign a row-number in the stage environment for each object and order it by timestamp desc.
Create a peg-leged link that contains the row-number and the hub ref.
Put an effectivity sat on the link.
Connect the delta sat to the link.
If you want only the last rows for each object just filter the peg-leged link row-number eql to 1.
Welp, it took me quite some time to respond.
I think I’ve used some incorrect terms. When I said snapshot, I meant a regular full load, so no delta/incremental/CDC load. And when I said Error Vault, I should have maybe called it DQ Vault.
But anyway, where I originally thought that assigning a row-number in the stage environment was going to be a whole ordeal, I found that that is what the dbtvault ranked_columns macro is for. So what I will probably go for is an msat where I use that row_number as dependent-child key. And then in my information mart I’ll filter on sub_seq = 1. Similar to your solutiun, but I think your proposal of the peg-legged link was because you wanted to attach both an effectivity sat and a delta sat?
For me snapshot and a full load are the same. Delta satelltes are excellent for handling this kind if data. I think we are talking about the saame thing regarding full load.
However, I am not fund of dv solutions that cannot recreate the data that was given. That is why i suggest you to put the dep childs in the link and put an eff sat connected to it. Then you can look into the eff sat over time to exactly determine what data was delivered for each dataset.