Handling erroneous input data


I read many times that we good, bad and ugly data should be loaded to raw vault and then handled via business rules, but is always practical?
What if there is 1 row of bad data in the source system, that cannot be cleaned up due to historical bookings? It causes satellite load to crash. Is worth defining the satellite as multi-active and additional logic just to sort out this row?
I see 2 pragmatic options:

  • filter out problematic rows at source - works, but caviot is that you loose visibility of how many errors there are, especially if you filter out ranges of data
  • introduce additional flag in load tables (e.g. DATA_ERROR), then after loading data run UPDATE to flag errorenous data, filter them out in staging code, before loading Raw Vault - this violates INSERT ONLY principle
    What is your opinion?


If hard business rules cannot be performed than the “data-contract” is broken by the supplier. Who can determine if a row is errorenous if data contract is not broken?

You’re right and this is a common issue,
If it cannot be removed because there is some historical context then it should be filtered out if it has no business value, you have data that breaks the business rules.
I would not default to modelling structures for the sake of bad data, this is not good practice, a Multi-Active Satellite would not solve this for you anyway!

Thanks Patrick,
Would you load errors, flag them in staging area, but exclude from further processing or exclude already in the load query? I use primarily pull SQL/ODBC interfaces, where I can specify filter conditions.

My team is actually building a solution for this right now, focusing primarily on duplicates (one BK being delivered more than once, with different payloads). We’re using a dbtvault ranked column in the staging hashview, partitioning by hashkey and ordering by the hashdiff. Rank 1 gets loaded into the Satellite, all problem cases (including rank 1) get re-directed to our big generic DQ bin (our consumers rather want potentially incorrect data than no data at all). We’ll build reports on that table so that as soon as we identify duplicates, we can talk to the source to get them fixed. And if really necessary, we can build business rules to fetch other records from that table. But that’s going to be our last resort.

At the same time, we already have a source that we know delivers duplicates, because of past system migrations and their incapability to get their sh*t together… and we know which “record to take” based on specific ordering… so in that case, we don’t do the ordering by hashdiff but by a specific selection of payload attributes.

edit: at first, I also went with the MAS solution. A regular sat for all rank=1 records, and a MAS for all problem cases, so we at least didn’t throw any data away (and also to have a table to build reports off of). But that was for the sources for which we already knew there were issues. But ideally you have a generic solution in place that can also tackle all future duplicates. Especially in our case when we started with our historical loading where we found out some other sources had delivered duplicates in the past as well.

Sounds like a good idea. Let me summarize the approach:

  • implement optional “QAd Load” view on top of loaded data, views perform ranking / filters out erroneous data
  • RV loaders are based on :“QAd Load” views, are still standardized and don’t contain any custom QA logic
  • Error cases can be routed from QAd views to Error Mart for analysis and monitoring
    Is my summary correct?

I’d identify them and load them into an Error Mart, that way you’re always auditable to the source but you haven’t corrupted the vault.
A dynamic approach would then be to have these not only flagged but sent to source to fix them (if possible)

1 Like

edit: I blindly assumed you’re also using dbtvault… which I’m not sure you actually do. If not, then the below example yaml might not make much sense ^^

Our RV loaders themselves are the same, but our hash views are different. It starts with the fact that our hash views are target-based, not source-based. You can map a source file to various DV objects, potentially with different grains. Of course you’d want to create a ranked column to find duplicates on what the source says is the PK. But I’ve often found that a source file can contain different entities that don’t have the same grain, so you’d map that to, for example, a link+sat, but also a hub+sat. The PK of the table probably is the same as the PK of your link+sat, but for the hubsat you will need to check for duplicates as well. So each of our (again, target-based) hash views contain a ranked column. So a hash view will look like this (left out some fields for simplicity):


{%- set yaml_metadata -%}
    entity_id: CAST(id AS STRING)
    - entity_id
        - attr
        is_hashdiff: true
        - h_entity_hkey
        - s_entity_src_hdiff: ASC
source_model: v_stg_raw_src_filename
{%- endset -%}

{% set metadata_dict = fromyaml(yaml_metadata) %}


{{ dbtvault.stage(
        include_source_columns=metadata_dict.get('include_source_columns', none),
        source_model=metadata_dict.get('source_model', none),
        derived_columns=metadata_dict.get('derived_columns', none),
        null_columns=metadata_dict.get('null_columns', none),
        hashed_columns=metadata_dict.get('hashed_columns', none),
        ranked_columns=metadata_dict.get('ranked_columns', none)) }}

WHERE dq_payload_rank = 1

As you can see, we wrapped the macro in our own CTE.

I don’t know DBT but can get the idea by looking at the configuration :slight_smile: thanks!