Tracking number of INSERTs into HUB, LINK and Satellite


We have audit requirement to capture the INSERT for each HUB, LINK and SAT. Currently I have implemented a python wrapper that performs the update of the DV as one single Snowflake transaction. This python wrapper also logs (into AWS Cloudwatch) the number of records (as returned by SQL INSERT) INSERTed.

I am wondering if there is a better approach to capture and retain this metric for each HUB, SAT, LINK?


1 Like

I’m not on Snowflake/AWS, but from a technology-agnostic standpoint, that sounds like what needs to be done. In order to capture the number of rows inserted, you need something that can return the number of rows inserted, and then something to log that.
So long as your logging can deal with rolled back transactions, and is logging individual tables rather than multi-table transactions, you’ll be fine.

Technology-specific advice welcome.

You could adopt the approach documented here:

1 Like

Have you looked into Snowflake Streams?

These are change tracking/change data capture tables that track any DML changes to the table (inserts, updates, deletes, etc).

Specifically for inserts you could use an append-only stream which just tracks the inserts to standard tables in Snowflake. My thinking is that after each load of a table you could query the table’s stream to retrieve the number of inserts for that table using Python and then log it into AWS Cloudwatch.

A point to note is that Streams don’t retain a history of changes. To build a history you could you write a task that queries the stream and stages the result, which is then loaded into a change history table. This table could then be used in downstream analytics etc.

For some more information on Snowflake Streams, take a look at the documentation: Change Tracking Using Table Streams — Snowflake Documentation

I hope this is of some help to you!

1 Like

As each target table has a Source and Load_Datetime column, and we run an insert only loading pattern, you can also count the inserts as a sweep SQL script after the load (select sum from table where ldts = last load ldts). A further benefit of this is you can then use the results of the counts as a new staging table (insert into stg_load_stats as select sum…) and load that data into a hub/nhl - a meta vault, a vault table holding the metadata about load volumes.

Hi @neil.strange . This is a good a approach, but it won’t capture if there we no INSERTs because there was no new data or if there was a failure or the INSERT script didn’t even run.

If there were no inserts you’d see a zero count. You can also use a record tracking satellite or status tracking satellite to log every feed and status change. And finally, you can log every batch run in a waterlevel table holding the last successful load date run for each table and feed.