How to model website tracking data in DV 2.0

Hello DV Community,

I would like to ask what’s the best way, in your opinion, to model website tracking data (e.g. Google Analytics) according to the DV 2.0 standard.

The raw (toy) data looks something like this:

| viewed_at | user_id | url_path       |
|-----------+---------+----------------|
| t1        | u1      | /articles/art1 |
| t2        | u1      | /articles/art2 |
| t3        | u1      | /forum/post1   |
| t4        | u1      | /forum/post2   |
| t5        | u1      | /contact       |
| t6        | u2      | /articles/art1 |

The granularity of the data is “page view” or “page impression” and columns mean:

  • viewed_at :: is the timestamp of the page view
  • user_id :: is a ID of a user that viewed that page, which is also business key that has it’s hub (user_h)
  • url_path :: path to the page that was viewed and has the following structure
    • /articles/{article_id} :: articles are business objects and have their hub (article_h)
    • /forum/{post_id} :: forum posts are business objects and have their hub too (post_h)
    • /contact :: page not associated with any hub

In the example data, user u1 viewed 5 pages (2 articles, 2 forum posts and a contact page) at times t1–t5. User u2 had a single page view at t6 that was a single article.

How would you model this data given already identified business objects (user, article, post) and corresponding hubs (user_h, article_h, post_h)?

According to the standard such data should be stored in a non-historized link because the data is transactional in nature and it related already identified business objects. The challenge is that each record exclusively relates a user with only a single business object (article or page, or other) so that the relations are “multiplexed”.

I consider two approaches:

  1. Single page view link (page_view_l)

(systems generated link fields are not shown for clarity)

| user_hk  | article_hk | post_hk     | viewed_at |
|----------+------------+-------------+-----------|
| hash(u1) | hash(art1) | hash(-1)    | t1        |
| hash(u1) | hash(art2) | hash(-1)    | t2        |
| hash(u1) | hash(-1)   | hash(post1) | t3        |
| hash(u1) | hash(-1)   | hash(post2) | t4        |
| hash(u1) | hash(-1)   | hash(-1)    | t5        |
| hash(u2) | hash(art1) | hash(-1)    | t6        |

The advantage of this approach is that user journey can be analyzed more easily, but the disadvantage is that there are quite many null-keys being stored, esp. if we have even more hubs that the user can “visit”.

  1. Create page view link for every hub (article_page_view_l, post_page_view_l, other_page_view_l)

article_page_view_l:

| user_hk  | article_hk | viewed_at |
|----------+------------+-----------|
| hash(u1) | hash(art1) | t1        |
| hash(u1) | hash(art2) | t2        |
| hash(u2) | hash(art1) | t6        |

post_page_view_l:

| user_hk  | post_hk     | viewed_at |
|----------+-------------+-----------|
| hash(u1) | hash(post1) | t3        |
| hash(u1) | hash(post2) | t4        |

other_page_view_l:

| user_hk  | viewed_at |
|----------+-----------|
| hash(u1) | t5        |

Advantage is that there are tables are leaner (no unnecessary null-keys), but it’s harder to analyze overall user’s journey (tables must be often unioned beforehand).

What do you think? Is there a better standard approach here?

Thanks for help,
Marek (CDVP2)

Is the data delivered in 1 or 3 datasets?

raw data is delivered in 1 data set (see the 1st table)

Then keep it simple, One nhl.

Thanks for your comment. I’ll take it as a point in a conversation, but cannot take it a as a solution, because it does not seem to address the concerns that I’ve mentioned in the top post.

Thanks for the point!

Ehh, and the concern was storing null values in a table? Or was it something else?

Indeed, “zero-keys” to be precise. In this toy example there are only 2 business object that the paths can refer to, but in our real life example there are currently 5 business objects, and the number can increase in the future. In the “one table” approach, only one reference per record can have non-zero-key, so the link seems unnecessarily wide.

Please take into account that this is ever growing transactional data.

Ahh, I see. I always use nulls instead of zero keys in my implementations. The nulls are very cheap compared to a zero key. You can go for one NHL for each 2-combo then, that will save you a lot of space!
As you mention the extractioncode will be a little bit more complex. Just be careful and make sure that you can recreate the data that you import into the tables.

Not quite mate — NH links are for event data arriving in near-realtime. It is an exception pattern because exactly once delivery semantics are expected to be defined at the source and not something you need to check on (true changes) when loading your data. Secondly, should you be building a NH (if your use case is streaming) using an expensive hash function? I’d consider you want to avoid that! I have done extensive testing on the subject.

If the data arrives as a file then load to a regular satellite table to guarantee idempotence

Now to the meat of your model,

Are articles and posts really business objects? The user for sure, hub_user I think; the tricky thing about event data is you need to ensure it is a true business key you’re modelling. I.e. do not rely on arbitrary “keys” like IP addresses or cookie ids to uniquely identify users because those values can change (public IP & people deleting their cookies) and you’ll just load tech debt into your vault. Ensure that the bk you use is indeed unique and what can be used to join to other data (the whole point of DV is to integrate by BK). If you cannot attain these then you need to find some other way to uniquely identify the user if none is available then the question is should the data be loaded to a data vault at all! Data without business keys can only useful as aggregates on page views, popular articles etc, which kind of has no meaningful value to the business imho!

Zero key concept has many facets, you purposely associate it with a value you can later join to signify a missing value — and take advantage of join key performance when utilising that content in a PIT or Bridge – Right Deep Join Tree. This should never be a null value because null is not a value and in fact breaks a hash. I recall seeing a manual DV “implementation” (… the horror… the horror…) where I asked the developer “why are all your hashdiff values the same if the attributes feeding into it are different?”. Turns out some of the columns can have a null and hence even if one attribute is null it breaks the hash.
Zero keys also enable you to support link tables that can house any cardinality of relationships, 1:0 (is a zero-key relationship), 1:1, 1:M etc…
Lastly, if you want to get fancy with Zero keys you can also give additional context depending on what you want to do with your implementation, i.e. -1 is intentional missing, -2 is missing in error etc etc.

Hope that helps!

Regarding null values:
A null value stores 1 bit, a zero key(sha1) stores 160 bits. The size of a table can be heavily reduced if you choose nulls instead of zero keys.
You can never join a null value to another null value because they are not the same. This means that you cannot compare join performance because the null value will not join at all. In practice you perform a left join against a nullable column if you dont want to reduce the dataset, but this is basic SQL.

A hashdiff has nothing to do with zero keys. The coder that did not forse that a column can contain a null value is just inexperienced. That mistakes happens once and then he/she would have learnt. When calculating a hashdiff a nullcheck must be performed on every nullable column.
The zero key is not the reason for any cardinality in a link table. Links are many to many, with or without nullability.

Then you lose the functionality i mentioned above; clearly if you’re not using zero-keys you’re very likely doing data vault wrong!

Pay attention to what I am saying, I was using an exemplar of what nulls do to hashing.

Maybe you should pay attention to what I am saying and argue with pros and cons of one or the other instead of just saying: “you are doing it wrong”. There is no functionality that gets lost by using nullability instead of zero keys. Of course you have to write the code a little different because of the nullability.
And still, links are many-to-many.

Hi guys, I have only little time now, but wanted chime in on the point of NULLs, and I’m with Patrick on that point. If I remember correctly from the CDVP2 training: NULLs in the HK fields are not part of the DV 2.0 standard.

In the end everyone can implement whatever is suitable, but then it’s not DV. From what I saw, Dan Linstedt is always clear on this point in general and “protective” of the standard.

PS I’ll reply to other points later. Thanks for the discussion!

Yes, Nulls are not standard in dv2. But a rdbms performs faster with smaler tables. I dont blindly buy in to all “standards”. Data vault is great but we must be able to question its weaknesses and hopefully improve.

No need, I can tell from your various posts on this forum you do DV wrong. Besides, you clearly don’t need help and this post was for @mrcol

ZKs are there for a big big reason, hence why I put the links in my reply.

Zero Keys and Ghosts Records help with performance on getting the data out of Data Vault efficiently.

Here it is again,

And here is why EQUI-JOINs matter, (which both of the above concepts enable but this post was about Ghost Records)

Ignore that the latter is about Snowflake, ALL OLAP platforms should have Hash-Join capability (RDJT)

@patrickcuba, you wrote:

If the data arrives as a file then load to a regular satellite table to guarantee idempotence

If you have data arriving in batches, then all records in a batch will have the same load timestamp (ldts). The problem is that in a satellite a hub HK & ldts pair must be unique unless it’s a multi-active satellite. In MA sats, there’s an additional attribute assuring uniqueness for a given HK and ldts. Is it what you are suggesting?

Another concern on the logical level is that pages visited by users are not attributes of a user, but rather events/actions. It it fine to store them in a satellite which is usually used for descriptive attributes of a business object?

  • if the data arrive in a batch and have multiple records load date is still the same, it will not be unique — that is obvious
  • the data will have what is a called an applied date, this + load date + hk is what makes the record unique
  • msat will not save you unless your definition of an msat is different to mine, an msat is about a SET of data; intra-day data loading to a satellite is not.
  • therefore your satellite should be loaded as hk + loaddt + applieddt ← this is unique, two events for the same entity cannot have the same applied dt <— this is a data quality issue if it is. The satellite’s applied date treated as a dependent-child key

for your second concern, what are actions or events? Is a user a business object? I don’t know the concern.

A question for you; this thread is over 3 month’s old and it is your thread. Is this still a modelling problem you are trying to solve for a customer?

1 Like

hi @patrickcuba

the data will have what is a called an applied date, this + load date + hk is what makes the record unique

– that makes sense, thanks!

msat will not save you unless your definition of an msat is different to mine, an msat is about a SET of data

– agree, my definition is the same

Is a user a business object?

– all entities that I called “business objects” are true business objects. I’m very careful with that, I just didn’t explain all context that is not directly related to the question at hand.

this thread is over 3 month’s old and it is your thread. Is this still a modelling problem you are trying to solve for a customer?

yes, a genuine modeling problem. unfortunately, this conversation got off track with the “zero keys” which was non-issue for me, and discouraged me from returning to the conversation

To sum up and close the topic: I have realized that a satellite is a better choice then links because, despite articles and posts being true business objects, their IDs are “obscured” in the URLs. Parsing requires understanding of the URLs, which might change over time, or whenever the structure of the site changes – those sound like soft rules, and should not be applied when loading into raw DV. If links described in the top post are needed, they can be computed and become part of the business vault.

thanks a lot for help

1 Like