Hello DV Community,
I would like to ask what’s the best way, in your opinion, to model website tracking data (e.g. Google Analytics) according to the DV 2.0 standard.
The raw (toy) data looks something like this:
| viewed_at | user_id | url_path |
|-----------+---------+----------------|
| t1 | u1 | /articles/art1 |
| t2 | u1 | /articles/art2 |
| t3 | u1 | /forum/post1 |
| t4 | u1 | /forum/post2 |
| t5 | u1 | /contact |
| t6 | u2 | /articles/art1 |
The granularity of the data is “page view” or “page impression” and columns mean:
- viewed_at :: is the timestamp of the page view
- user_id :: is a ID of a user that viewed that page, which is also business key that has it’s hub (user_h)
- url_path :: path to the page that was viewed and has the following structure
- /articles/{article_id} :: articles are business objects and have their hub (article_h)
- /forum/{post_id} :: forum posts are business objects and have their hub too (post_h)
- /contact :: page not associated with any hub
In the example data, user u1 viewed 5 pages (2 articles, 2 forum posts and a contact page) at times t1–t5. User u2 had a single page view at t6 that was a single article.
How would you model this data given already identified business objects (user, article, post) and corresponding hubs (user_h, article_h, post_h)?
According to the standard such data should be stored in a non-historized link because the data is transactional in nature and it related already identified business objects. The challenge is that each record exclusively relates a user with only a single business object (article or page, or other) so that the relations are “multiplexed”.
I consider two approaches:
- Single page view link (page_view_l)
(systems generated link fields are not shown for clarity)
| user_hk | article_hk | post_hk | viewed_at |
|----------+------------+-------------+-----------|
| hash(u1) | hash(art1) | hash(-1) | t1 |
| hash(u1) | hash(art2) | hash(-1) | t2 |
| hash(u1) | hash(-1) | hash(post1) | t3 |
| hash(u1) | hash(-1) | hash(post2) | t4 |
| hash(u1) | hash(-1) | hash(-1) | t5 |
| hash(u2) | hash(art1) | hash(-1) | t6 |
The advantage of this approach is that user journey can be analyzed more easily, but the disadvantage is that there are quite many null-keys being stored, esp. if we have even more hubs that the user can “visit”.
- Create page view link for every hub (article_page_view_l, post_page_view_l, other_page_view_l)
article_page_view_l:
| user_hk | article_hk | viewed_at |
|----------+------------+-----------|
| hash(u1) | hash(art1) | t1 |
| hash(u1) | hash(art2) | t2 |
| hash(u2) | hash(art1) | t6 |
post_page_view_l:
| user_hk | post_hk | viewed_at |
|----------+-------------+-----------|
| hash(u1) | hash(post1) | t3 |
| hash(u1) | hash(post2) | t4 |
other_page_view_l:
| user_hk | viewed_at |
|----------+-----------|
| hash(u1) | t5 |
Advantage is that there are tables are leaner (no unnecessary null-keys), but it’s harder to analyze overall user’s journey (tables must be often unioned beforehand).
What do you think? Is there a better standard approach here?
Thanks for help,
Marek (CDVP2)