Hello everyone
I have some questions related to the applied timestamp metadata field.
There is not much information about the optional applied timestamp. I think the most detailed explanation is found in the excellent book - The Data Vault Guru. But there is still some conecpts about the applied timestamp that eludes my understanding.
As I can understand the applied timestamp should be the same value for all ingested data in a batch/packet. Meaning that the applied timestamp should be the same for all rows in the staging layer of a fx 10.000 rows batch.
And I can also understand that one of the benefits of using applied timestamp is that if the source is changing a value in a row that was ingested into the data vault a long time ago, then it will be possible to detect this change since this row will have the same applied timestamp but a different load timestamp.
Question 1:
What determines the applied timestamp value?
-
Is it the time the data was extracted from the source. Fx 2025-12-20 10:00 the ingestion from source was initiated and then 5 minutes later the data is written to the staging layer. Therefore the applied timestamp = 2025-12-20 10:00 and the load timestamp will be 2025-12-20 10:05?
-
Or is it determine by the query? fx if I filter the data from a source table using a WHERE clause with a date/timestamp value?
-
Or is it determine by a field in the source. Fx the table that is queried is having valid_from and valid_to fields. Is the applied timestamp then the value from the valid_from field? (I know it breaks the batch/packet concept as every row of the 10.000 rows can potentially have a unique applied timestamp value)
I assume the answer is 1.
Question 2:
How is inital loads creating the applied timestamps?
Let us say that we ingest all rows (fx 1.000.000 rows) from a source for the first time into the data vault. What will the applied timestamp be for all rows?
Question 3:
How is CDC/delta ingestions creating the applied timestamps?
If the answer to question 2 above is that every row gets the same applied timestamp then I think it poses a problem if some of the rows are updated in the source system. But let my clarify with an example.
Example:
1. Initial Load
Date: 2025-12-20
We are running an initial load into the data vault.
Row count: 1.000.000
Applied timestamp: 2025-12-20 10:00
Load timestamp: 2025-12-20 10:05
2. CDC occurs
Date: 2025-12-21
Lets say that the next day the source system is updating 2 of the 1.000.000 rows that was ingested with the inital load. This change in the source system was due to an error. The first value of the 2 rows was not correct and has been corrected.
Row count: 2 (let us just say no new rows have been created and it is only the 2 rows that was updated)
Applied timestamp: It should be 2025-12-20 10:00 since the 2 rows represents that business timeline. But since such CDC is automatic and ingestion is manage by an orchestration tool (fx airflow) then the applied timestamp value will be 2025-12-21 10:00
Load timestamp: 2025-12-21 10:05
As you can see the applied timestamp (2025-12-21 10:00) for the 2 updated rows is not the same value as it was during the initial load (2025-12-20 10:00). Therefore it is not possible to decern by looking in the data vault that the 2 rows have actually been corrected in the source system. The 2 rows just seem to have change value during the next day and does not say anything about the new value of the 2 rows should have been different yesterday (2025-12-20).
I hope my wall of text makes sense. Please let me know if I need to clarify.
Thanks!
Jacob

