Handling erroneous input data

DVBeginner · 11 March 2023 22:40

Hi,

I read many times that we good, bad and ugly data should be loaded to raw vault and then handled via business rules, but is always practical?
What if there is 1 row of bad data in the source system, that cannot be cleaned up due to historical bookings? It causes satellite load to crash. Is worth defining the satellite as multi-active and additional logic just to sort out this row?
I see 2 pragmatic options:

filter out problematic rows at source - works, but caviot is that you loose visibility of how many errors there are, especially if you filter out ranges of data
introduce additional flag in load tables (e.g. DATA_ERROR), then after loading data run UPDATE to flag errorenous data, filter them out in staging code, before loading Raw Vault - this violates INSERT ONLY principle
What is your opinion?

Regards,
Martin

AHenning · 12 March 2023 16:08

If hard business rules cannot be performed than the “data-contract” is broken by the supplier. Who can determine if a row is errorenous if data contract is not broken?

patrickcuba · 13 March 2023 08:20

You’re right and this is a common issue,
If it cannot be removed because there is some historical context then it should be filtered out if it has no business value, you have data that breaks the business rules.
I would not default to modelling structures for the sake of bad data, this is not good practice, a Multi-Active Satellite would not solve this for you anyway!

DVBeginner · 13 March 2023 08:53

Thanks Patrick,
Would you load errors, flag them in staging area, but exclude from further processing or exclude already in the load query? I use primarily pull SQL/ODBC interfaces, where I can specify filter conditions.

FrenkLoonen · 13 March 2023 10:31

My team is actually building a solution for this right now, focusing primarily on duplicates (one BK being delivered more than once, with different payloads). We’re using a dbtvault ranked column in the staging hashview, partitioning by hashkey and ordering by the hashdiff. Rank 1 gets loaded into the Satellite, all problem cases (including rank 1) get re-directed to our big generic DQ bin (our consumers rather want potentially incorrect data than no data at all). We’ll build reports on that table so that as soon as we identify duplicates, we can talk to the source to get them fixed. And if really necessary, we can build business rules to fetch other records from that table. But that’s going to be our last resort.

At the same time, we already have a source that we know delivers duplicates, because of past system migrations and their incapability to get their sh*t together… and we know which “record to take” based on specific ordering… so in that case, we don’t do the ordering by hashdiff but by a specific selection of payload attributes.

edit: at first, I also went with the MAS solution. A regular sat for all rank=1 records, and a MAS for all problem cases, so we at least didn’t throw any data away (and also to have a table to build reports off of). But that was for the sources for which we already knew there were issues. But ideally you have a generic solution in place that can also tackle all future duplicates. Especially in our case when we started with our historical loading where we found out some other sources had delivered duplicates in the past as well.

DVBeginner · 13 March 2023 11:36

Frenk,
Sounds like a good idea. Let me summarize the approach:

implement optional “QAd Load” view on top of loaded data, views perform ranking / filters out erroneous data
RV loaders are based on :“QAd Load” views, are still standardized and don’t contain any custom QA logic
Error cases can be routed from QAd views to Error Mart for analysis and monitoring
Is my summary correct?
M.

patrickcuba · 13 March 2023 20:08

I’d identify them and load them into an Error Mart, that way you’re always auditable to the source but you haven’t corrupted the vault.
A dynamic approach would then be to have these not only flagged but sent to source to fix them (if possible)

FrenkLoonen · 14 March 2023 08:13

edit: I blindly assumed you’re also using dbtvault… which I’m not sure you actually do. If not, then the below example yaml might not make much sense ^^

Our RV loaders themselves are the same, but our hash views are different. It starts with the fact that our hash views are target-based, not source-based. You can map a source file to various DV objects, potentially with different grains. Of course you’d want to create a ranked column to find duplicates on what the source says is the PK. But I’ve often found that a source file can contain different entities that don’t have the same grain, so you’d map that to, for example, a link+sat, but also a hub+sat. The PK of the table probably is the same as the PK of your link+sat, but for the hubsat you will need to check for duplicates as well. So each of our (again, target-based) hash views contain a ranked column. So a hash view will look like this (left out some fields for simplicity):

v_stg_hash_src_filename_s_entity_src

{%- set yaml_metadata -%}
derived_columns:
    entity_id: CAST(id AS STRING)
hashed_columns:
    h_entity_hkey:
    - entity_id
    s_entity_src_hdiff:
        columns:
        - attr
        is_hashdiff: true
ranked_columns:
    dq_payload_rank:
        partition_by:
        - h_entity_hkey
        order_by:
        - s_entity_src_hdiff: ASC
source_model: v_stg_raw_src_filename
{%- endset -%}

{% set metadata_dict = fromyaml(yaml_metadata) %}

WITH CTE AS (

{{ dbtvault.stage(
        include_source_columns=metadata_dict.get('include_source_columns', none),
        source_model=metadata_dict.get('source_model', none),
        derived_columns=metadata_dict.get('derived_columns', none),
        null_columns=metadata_dict.get('null_columns', none),
        hashed_columns=metadata_dict.get('hashed_columns', none),
        ranked_columns=metadata_dict.get('ranked_columns', none)) }}
)

SELECT * 
FROM CTE
WHERE dq_payload_rank = 1

As you can see, we wrapped the macro in our own CTE.

DVBeginner · 14 March 2023 11:41

I don’t know DBT but can get the idea by looking at the configuration thanks!

Topic		Replies	Views
DQ issues: Error Vault? Data Vault 2.0	4	455	7 August 2022
Can I restructure source data before loading it into the raw vault? Data Vault 2.0	18	1795	3 April 2022
Filtering a source table/file when loading Raw Data Vault Data Vault 2.0	7	1406	15 January 2024
Two different source table with same columns, can I load to same satellite or different satellite Data Vault 2.0	7	866	25 October 2022
Excluding data towards RDV that has no business value Data Vault 2.0	6	438	23 March 2023

Handling erroneous input data

Related topics