Questions regarding applied timestamp

Hello everyone

I have some questions related to the applied timestamp metadata field.

There is not much information about the optional applied timestamp. I think the most detailed explanation is found in the excellent book - The Data Vault Guru. But there is still some conecpts about the applied timestamp that eludes my understanding.

As I can understand the applied timestamp should be the same value for all ingested data in a batch/packet. Meaning that the applied timestamp should be the same for all rows in the staging layer of a fx 10.000 rows batch.

And I can also understand that one of the benefits of using applied timestamp is that if the source is changing a value in a row that was ingested into the data vault a long time ago, then it will be possible to detect this change since this row will have the same applied timestamp but a different load timestamp.

Question 1:

What determines the applied timestamp value?

  1. Is it the time the data was extracted from the source. Fx 2025-12-20 10:00 the ingestion from source was initiated and then 5 minutes later the data is written to the staging layer. Therefore the applied timestamp = 2025-12-20 10:00 and the load timestamp will be 2025-12-20 10:05?

  2. Or is it determine by the query? fx if I filter the data from a source table using a WHERE clause with a date/timestamp value?

  3. Or is it determine by a field in the source. Fx the table that is queried is having valid_from and valid_to fields. Is the applied timestamp then the value from the valid_from field? (I know it breaks the batch/packet concept as every row of the 10.000 rows can potentially have a unique applied timestamp value)

I assume the answer is 1.

Question 2:

How is inital loads creating the applied timestamps?

Let us say that we ingest all rows (fx 1.000.000 rows) from a source for the first time into the data vault. What will the applied timestamp be for all rows?

Question 3:

How is CDC/delta ingestions creating the applied timestamps?

If the answer to question 2 above is that every row gets the same applied timestamp then I think it poses a problem if some of the rows are updated in the source system. But let my clarify with an example.

Example:

1. Initial Load

Date: 2025-12-20

We are running an initial load into the data vault.

Row count: 1.000.000

Applied timestamp: 2025-12-20 10:00

Load timestamp: 2025-12-20 10:05

2. CDC occurs

Date: 2025-12-21

Lets say that the next day the source system is updating 2 of the 1.000.000 rows that was ingested with the inital load. This change in the source system was due to an error. The first value of the 2 rows was not correct and has been corrected.

Row count: 2 (let us just say no new rows have been created and it is only the 2 rows that was updated)

Applied timestamp: It should be 2025-12-20 10:00 since the 2 rows represents that business timeline. But since such CDC is automatic and ingestion is manage by an orchestration tool (fx airflow) then the applied timestamp value will be 2025-12-21 10:00

Load timestamp: 2025-12-21 10:05

As you can see the applied timestamp (2025-12-21 10:00) for the 2 updated rows is not the same value as it was during the initial load (2025-12-20 10:00). Therefore it is not possible to decern by looking in the data vault that the 2 rows have actually been corrected in the source system. The 2 rows just seem to have change value during the next day and does not say anything about the new value of the 2 rows should have been different yesterday (2025-12-20).

I hope my wall of text makes sense. Please let me know if I need to clarify.

Thanks!

Jacob

1 Like

Hi,

-applied_ts, or data_ts as i prefer is real data timestamp (either source already have an INS_TS/UPD_TS/DEL_TS, if not then critical you pull rows adding current_timestamp(), meaning it’s not injected by somewhere in middle of etl process as per some job parameter.. AVOID THIS critical error many data teams sadly do under Woodstock programming style.. the real data row timestamp is one of most critical data so pull it directly from source (this will be the ts used for everything downstream including scd2 (pk, data_ts). The etl_ts is also important to pick the current record via window function qualify row_number() over (partition by pk, data_ts order by etl_ts desc) = 1

Hi there JLeicht, welcome and thanks for posting this!

This is a really good set of questions. Your example makes sense, so I’ll walk through it using the same structure with my understanding.

What applied timestamp actually represents

Applied timestamp is not a business-valid timestamp and it is not the same as load timestamp.

It represents the logical “as-of” time that a batch of data is associated with, typically the start of the ingestion or extraction process. It is assigned to the batch as metadata and does not describe when individual rows are physically written to the Data Vault.

That has two important consequences:

  • It is set once per batch, not per row

  • It reflects when the warehouse contextually learned about the data, not when rows were written or when the data was true in the source

So let’s go through all your questions!

Question 1: What determines the applied timestamp?

In practice, it is almost always the start time of the extraction or ingestion job.

So your first interpretation is correct:

  • Extraction starts at 2025-12-20 10:00

  • Data lands shortly after

So as you said the applied timestamp is 10:00 and the load timestamp is the physical insert time (10:05 in your example).

It is not derived from WHERE clauses or source valid_from fields.
Using a source column would break the “batch represents a single moment” rule.

Question 2: What happens on an initial load?

An initial load is simply a large batch.

If you ingest 1,000,000 rows in an initial load, all rows receive the same applied timestamp, because they all belong to the same batch. This is independent of the load timestamp, which simply records when each row was physically written.

This does not imply that the data was all true at that point in business time. It only indicates when the warehouse first became aware of the data.

Question 3: What happens with CDC and corrections?

In your example, the concern is not really about CDC mechanics. It’s about this observation:

Two rows were wrong yesterday, corrected today, but the applied timestamp moves forward. That makes it look like the values only changed today but aren’t a correction.

That observation is correct, and it is intentional.

The applied timestamp advances because the correction arrives as part of a new batch. The vault is recording when the correction became known, not rewriting the past.

The key point is that the applied timestamp is not the mechanism that preserves attribute history.

That responsibility sits with the satellite:

  • The satellite’s primary key anchors the row to the business key

  • The hashdiff detects that the attribute values have changed

  • A new SAT row is inserted with a later load timestamp

When querying the satellite using standard “latest record” logic, the corrected value becomes the current state, even though it arrived later. This is alluded to by @emanueol in his QUALIFY statement.

This allows the vault to support corrected historical data without pretending the warehouse knew the correct value earlier than it did. That would be very naughty.

If you need to express that the corrected value was already valid in the past, that belongs in the business vault using business-effective dates (usually some equivalent to a modified/updated at timestamp) or restatement logic. It is deliberately not handled by applied timestamp.

A useful way to think about it in this scenario:

  • Applied timestamp tells you when a batch of data was applied to the vault

  • Load timestamp tells you when a specific row was physically written

  • Satellites (via PK + hashdiff) tell you when attribute values changed

Together, this allows the vault to show both:

  • what was known at a given point in time

  • and when corrected information became available

If you also need to model when a value was valid in business time, that is handled separately using business-effective logic, not applied timestamp.

Ultimately, you can rename these columns for clarity, just as long as you are consistent!

This thread also goes through some of this, may provide some extra context, and includes some wise words from the author of The Data Vault Guru himself:

Effective From vs Applied Date

Hope this all helps!

Joe

2 Likes

Wow! Thank you so much for such a thorough and detailed explanation! It makes perfect sense.

I am quite relieved because I am completely onboard with your definition of applied timestamp. For the majority of the time I have had that understanding based on the Data Vault Guru book. But something that keeps throwing me off - and questioning my understanding of applied timestamp, is explanations or comments from major players in the data vault community that seems to contradict applied timestamp as a batch of data.

Fx take emanueol’s post in this thread:

I read this as: use a source data timestamp attribute as the applied timestamp if it is available in the source table you are ingesting. If such an attribute is not available then use current_timestamp() (emanueol please interject if I misunderstand your comment).

This seems to contradict your definition of applied timestamp which is to always use current_timestamp().

Another example is from Daniel Linstedt blog post:

This also seems to indicate that the applied timestamp is based on a source attribute and not current_timestamp() alone.

What are your thoughts on this?

updated: in my previous answer may mistakenly enforced too much data_ts, when business effective_ts is the real important one + we need data_ts as part of the audit json column xnd as way to correctly f>nd latest current version.. sorry about that.

There’s 3 types of dates:

  • real data_ts (needed for getting current/latest record)
  • effective_ts (business confirmed start: can change/be fixed)
  • load_ts - (audit)
  • other temporal/business dates (1 or more)
  1. Personally I really want the “real data_ts”, imagine.. if you only have a batch ts or load_ts, both reflect only data arrival on dv.. you can eventually produce files/data in January that only get loaded on march batch (and load_ts few minutes after batch).. unless theres some business dates, you would never know those late arrival records where physically from.. January.

  2. i also don’t see meanfull the extra work loading all data with same batch_ts.. it only complicates things a lot, it uses in my opinion an old paradigm of scheduler control from the 70s mainframes era. I prefer a non orchestrated free data flow, load as it alrrives, even if later doesn’t matter because we have the “real data_ts” that i call “effective_ts”.

Finally performance wise, just cluster satellites by (key, effective_dt), and to get current transaction snapshot via partition by key order by data_ts desc = 1, I think correct now imagining late arrivals or fixes or updates ..the transaction key is the critical aspect, effective_dt can be fixed/change, but at any time we need data_ts desc to get whats current record and that must be independent of load_ts (dv can load ancient data anytime.

Discard following as seems snowflake will most likely have multiple dates of same key loaded on same micropartition with cluster key (k, dt):

Another remark on the cluster key, if there’s not much updates/fixes on transaction/record than (key, effective_dt) would cause sparse micropartitions storage waste and as most analytics done on effective dates, so probably better clustering on(effective_dt, key) and for current a dynamic table.

Anyway deciding keep natural clustering or setting specific, should be investigated per each Satellite own rhythm.

Hi both and thanks for the update Emanuel! That clarification helps a lot, and I think it actually brings our positions closer rather than further apart. I think all the apparent contradiction is/was mostly about terminology and overloaded expectations, not fundamental disagreement.

I agree with the refined breakdown you’ve outlined:

  • real data_ts as audit lineage (when the record was updated in the source)

  • effective_ts as the business-confirmed start of validity

  • load_ts as the physical arrival time, useful for clustering and current-state queries

  • plus any other domain-specific temporal attributes

I’m in complete agreement that if you care about late-arriving data (which you probably should), you must carry a row-level effective_ts or equivalent capture timestamp. Without that, neither batch timestamps nor load timestamps can tell you where data truly belongs in the business timeline.

Sometimes there is no reliable or valid source column for any timestamp and current_timestamp() might need to be used in its place. If the reasoning and decision is documented then that’s acceptable.

Where my earlier explanation fits is that I don’t see applied timestamp as competing with any of those. I see it as optional metadata that answers a different question:

When did a packet of data become visible to the warehouse?

It’s not business time, and it’s not meant to solve late arrivals. That’s the job of effective_ts (and to some extent real data_ts for audit).

If the source provides a reliable batch-level timestamp (for example an extract timestamp, snapshot-as-of time, or CDC window end time that applies consistently to the delivered packet), then using that as applied timestamp makes sense.

On the batching point, I don’t see applied timestamp as enforcing mainframe-era scheduling. Even in a free-flow or event-driven ingestion, data still arrives in packets, even if they’re small or irregular. Applied timestamp simply anchors that context if you choose to keep it. If you don’t need that context, dropping it is a perfectly reasonable design choice. It’s whatever is most appropriate to you. It can always be useful to know if a source data extract has gone awry in some fashion.

I think where we’re fully aligned now is:

  • Business truth lives in effective timestamps

  • Late arrivals are handled by row-level timelines, not batch metadata

  • Current-state queries should not depend on load order alone

  • No single timestamp should be overloaded to do everything

So I see this less as a disagreement and more as two valid modelling styles:

  • one that keeps applied timestamp as optional warehouse-awareness metadata

  • one that relies entirely on row-level effective and audit timestamps

Both can be correct depending on how much ingestion context you want to retain.

As I said in my previous reply, ultimately naming is up to you. In our use case, applied timestamp could be renamed source extract timestamp to clearly show exactly what this is for. You could also name your effective timestamp as business shenanigans time but that may not go down so well. The point is consistency and clarity trumps confusing terminology.

2 Likes

Hiya Both!

Thank you for taking the time to dig a bit deeper on this subject. I highly appreciate it!

I have read your replies a few times and the precise definition on real data_ts and effective_ts seems to elude my understanding.

real data_ts

so if I understand this correctly, this DV timestamp is based on a metadata attribute from the source table that specify the time the record was inserted in the source table?

What if this metadata attribute does not exist in the source? is the current_timestamp() then used?

What would the real data_ts value be for the file (fx csv) example (created in january and loaded/ingested in march into DV database)?

effective_ts

I am not sure what this value is. Business confirmed start / row-level timelines.

What would this value be in a table that is located in a high quality source database?

What would this value be in a csv file or API that is perhaps a bit low quality?

And I also have a smail side-question:

With all these timestamps, I assume the Load_ts is still part of the constraint in DV that determines whether a record is unique? (Fx the hash_key + load_ts must be unique)

Thanks!

yes (SELECT CURRENT_TIMESTAMP(), * FROM SRC)

--file_001.csv
data_ts, txn_key, attr_1_effective_dt, attr_2,           attr_3
2-jan,   10,      1-jan,               bought a car,     --
2-jan,   12,      2-jan,               bough chocolates, --

--file_002.csv
data_ts, txn_key, attr_1_effective_dt, attr_2, attr_3
3-jan,   10,      3-jan,               bought a car,     fixed transaction/effective date 1-jan into 3-jan

sat_table * cluster [data_ts,key] & (partition by key order by data_ts desc, load_ts for dynamic table current)

--sat table
[data_ts,key] (load_ts),   attr_1_effective_dt, attr_2,           attr_3
2-jan,   10,    3-mar,     1-jan,               bought a car,     --
2-jan,   12,    3-mar,     2-jan,               bough chocolates, --

3-jan,   10,    4-jan,     3-jan,               bought a car,     fixed transaction/effective date

Note: Despite there’s a DV standard considering pk (key, load_dt) as its not enforced on Snowflake, but our current logic for dynamic table always safe select .. from sat qualify 1 = row_number() over (partition by key order by data_ts desc, load_ts desc)

its an attribute like transaction date, it’s what analytics really care (after using data_ts and load_ts to pull correct row to extract effective_dt).

Additionally, having data_ts makes PITS creation/maintenance very straight and faster.

Just a minor dups avoidance, like row_diff_hash.

I may be rehashing Emanual’s response slightly but coming at it from a different angle.

real data_ts — audit / provenance time

Think of real data_ts as an audit or provenance timestamp, not a business one.

It answers:

When was this record created, captured or updated in the source system (or data-producing process)?

If the source table has something like:

  • created_at

  • inserted_at

  • modified_at

  • CDC commit timestamp

  • capture timestamp

…then that is what real data_ts usually refers to.

It is not:

  • business validity time

  • load time into the Data Vault

It is simply source-side origin or capture. If the source system doesn’t have this, then you can’t invent it.

In that case:

  • you either don’t carry real data_ts

  • or you store what you do know in audit metadata (file metadata, JSON audit column, etc.)

Using current_timestamp() would turn it into a load/ingestion timestamp, which is a different thing. This is part of the reason I said “might” in my earlier reply; there could be an argument for it, but it would have to be a very good one, and it would need to be documented. As the evergreen response goes, “It depends.”

In the case of your csv, your options are:

  • If the file has metadata (file creation time, export timestamp in header)that can be your real data_ts

  • If it doesn’t then you simply don’t know and you can’t guess January unless the data itself tells you

real data_ts is optional and source-dependent by nature.

effective_ts — business time

This is the most important distinction.

effective_ts answers:

From when does the business consider this value to be valid?

This is business semantics, not technical metadata. It is highly nuanced and dependent on the business questions you are trying to answer. Your data might have been updated on Wednesday but Greg in finance sees it as effective from Monday.

If the source already models validity, for example:

  • valid_from

  • effective_start_date

  • contract start date

  • price effective date

…then effective_ts comes directly from there. It’s not always that easy though, and in lower-quality data, this usually doesn’t exist explicitly.

Options then are:

  • leave effective_ts null

  • derive it in the Business Vault based on agreed business rules

  • default it only if the business explicitly defines a rule (for example “effective on extract date”)

The effective_ts only exists if the business meaning exists. If it doesn’t, don’t fake it.

How the timestamps fit together

Putting it all together:

  • real data_ts
    When the source created or captured the record

    • audit / lineage
    • optional
  • effective_ts
    When the value is valid in business terms

    • business timeline
    • sometimes explicit, sometimes derived, sometimes absent
  • load_ts
    When the row physically landed in the Raw Vault

    • always present
    • technical
  • applied_ts (if used)
    When a packet of data became visible to the warehouse

    • batch / packet awareness
    • optional metadata

Each answers a different question. None replaces the others. Some could be the same and sometimes that is fine, but as said before, “It depends.”

Side question: is load_ts still part of DV uniqueness?

Yes.

In classic DV patterns:

  • the hash key identifies what the record is about

  • the hashdiff detects attribute change

  • load_ts provides technical uniqueness and ordering

So the combination of(hash_key + load_ts) must be unique, and multiple rows with the same hash key but different load_ts represent history.

Even if you query or cluster by effective_ts or data_ts, load_ts remains the technical anchor. DV can load ancient data at any time, and load_ts is what keeps that honest.

1 Like

Just wanted share, that I’m aware of possible controversy of my beloved data_ts as critical means to select correct row (as current for example), since DV idea is that standard assumes we can’t control any date outside DV, so that’s why the idea of load_ts (and pk being: key + load_dt), but problem is that this isn’t enough for full resilience like late arrival data. if we in march receive old row from January (gets loaded as latest load_dt) and given business effective_dt can also change, than it’s imperative to have data_ts to physically know whats correct version.

Current standard pk(key, load_ts) + last_diff doesn’t prevent older data being loaded, temporarily causing data quality issues till those older recirds get reloaded and fix current row version. That’s why i think it’s more resilient to have real data_ts (either source already have ins_dt, upd_dt or del_dt; or via adding current_timestamp() so files get exported with it per row). DV standard truthful mention that may be hard .. but I prefer making upstream political pressure for inbound data having this as much as possible. We innthe XXI century and can’t have export process give real data_ts ?!? Camon :grinning_face_with_smiling_eyes: . Doesn’t matter how it’s done, or if needed extra columns for correct TIMESTAMP resolution based on geo location of data etc.. Doesn’t matter, real data_ts must finally be considered as a top-down Head of Data decision every upstream or extract/etl process must comply. Failure having this and harder silently problems arise (alert, old data arrived for these bunch of rows > go and run adhoc script to repull correct latest versions of these particular rows etc.

I also find performance/costs as good or better than theory, nothing like resilient, high performance low cost elegant solutions, but.. hey be prepared to make upstream pressure to have that data_ts.

I just finished the basic four day DV course and I’m pretty certain that applied dates (without the underscore and extensible in concept to applied timestamp) was essentially being taught as a business-centric class of datetime column distinctly different from the DV specific load timestamp. In that sense, there is not specific purpose in DV methodology for an applied_timestamp other than a source/business provided datetime distinct from the load timestamp. This is from the latest DV 2.1 course.

Yes thats truth and the theory, but then you accept data problems with late arrival data or mis-ordering of data.

It’s a political stance you write in some document that late arrival data never happens, and that changes always send in order. reality.. that is just pushing up stress handling data issues with stakeholders complaining, for me its better to build resilients systems than wait fir the unavailable mr. Murphy to popin, saying you 2.1 compliance or that upstream didn’t complied with data requirements as per mandate, will not save you as now you need fix problem, and then realizing all the extra custom code and scripts needed to handle all possible data late arrival, updates ordering etc will move project from clean 2.x just beautiful dv automated stuff into nightmare consuming 3/4 of team and sky rocking maintenance.

Im aware of dv techniques from Cuba like XTS and others (he shared many useful articles on how handle these problems by ONLY relying on LDTS), but i prefer zero data quality and zero edgie scripts and techniques to counterpart something as basic as the source real data_ts. Think about it - with only ldts assuming late data arrival, ordered or unordered uncontrollable arrival of events - how can you avoid problems? You can’t because you missing some real ordering from upstream.

Hope this helps, much easy politically enforce need of inbound data_ts than spend 6 or 8 months in data quality, and even performance issues.

TL_DR:

  • LDTS cool dv standard (supposedly the only ts we have control, np)
  • effective_dt (the real important business value temporal line(s) performance important on SATs or just rely on PITs/BRIDGEs.. depending common access types vs BV added value etc.

Just enforce upstream to also provide data_ts per row/record/event.. simple as that, nothing else needed since no matter how or when data arrives, unordered or not, one have a 100% resilient timeline from source.

If one can’t or don’t want bother making pressure to upstream teams which will fast feedback the common “we need budget for that” or “we don’t have time” and you accepting painfull maintenance handling data quality.