Loading Near Real Time Data into the Data Vault

We’ve just implemented a Chatbot and are receiving data from it into our data lake. We want to load it into our Data Vault that resides in an Azure Synapse Database. According to the Data Vault recommendations, this data fits into a non-historical link table. I have a couple of questions:

  1. Can this go directly into the business vault, thus bypassing loading into the raw vault?
  2. If the answer to #1 is yes, does the data need a hub table?
  3. If #2 is yes, what would be the structure of that table?

This data is comparable to a web stream where there are only inserts. We recognize that we’ll need link satellites for various reasons which gives me the impression that a hash key would be a good idea for data retrieval.

From experience, how have you handled this situation?

Thanks in advance,

Clay

Is the use case in Synapse for real-time data? Or is the latency not a problem.
It sounds like the real time use cases is already served in your lake and not in Synapse, if this is true then use a standard link-satellite.

Why BV? BV is the derived soft business rule outcome stored as BV links and sats; you have a raw vault artefact.

Hey Patrick

The data has soft rules applied before it is saved in the data lake to expedite its availability for data vault loading. The reason for the questions is to ensure the data could hop the raw vault load. Thus the 3 questions that I asked are to justify this scenario. Those question were

  1. Can this go directly into the business vault, thus bypassing loading into the raw vault?
  2. If the answer to #1 is yes, does the data need a hub table?
  3. If #2 is yes, what would be the structure of that table?

Thanks and please advise.

Still a raw source, BR occurs between RV and BV

Hello Clay!

  1. No
  2. Not applicable
  3. Not applicable

Do not bypass the raw vault. Maybe you are thinking of streaming data that can bypass the stage area? But on the other hand that data is not from a stream but from a data lake, which is already slow.
I also see that soft business rules are already applied before data is going into the lake. Try to avoid that because it will only create low data quality.

You can combine a nhl with satellites from one dataset, no problem.
I dont understand what you mean by “impression that a hash key would be a good idea for data retriaval”. Can you explain?
Good luck!

Thank you for your reply. I totally agree with you and Patrick that the data is raw, and we shouldn’t be applying soft rules when loading to the data lake. Those decisions were made outside of my reach and knowledge. Not to practice that activity has been communicated to the parties involved. However, they did raise an interesting point about including UTC times from the source location. For example – the data was saved at 9 am PST while the Data vault was in EST. So when loading the vault, the activity date would be the date loaded concerning the source. I support this soft rule addition to the data being loaded into the data lake.

My statement of “impression that a hash key would be a good idea for data retrieval” was from the point of my misinterpretation of the data source being a stream. Our management, for a variety of reasons, ONLY wants a front door to the data vault to come via the data lake. That was my viewpoint when I wrote that statement. In all fairness, to get things moving forward, I don’t have a problem with that limitation until I have more experience with the data vault and the streaming of data becomes a viable data source alternative.

Thanks

Clay

1 Like