Teasing Business Keys out of variants of the Entity - Hotels!

We are using dbt and AutomateDV for our project, dbt-Core since we are on MS SQL Server on-prem… I am also very new to this…

We have been struggling to try to define a business key in our environment. I am trying to gain some insight into this situation and recommendations from the community where we may be thinking about this incorrectly.
The background:
We are new at DV. My team has experience in working with an existing DV (banking) but this is our first deep dive into creating one.

The data we are modelling is about Hotels within the travel industry. The Hotel industry is unique in that they are ubiquitous in the world and are known by many systems, i.e. Sabre, Concur, Expense systems, Hotel Management systems, other online booking sources, etc.
Our task is to ‘Normalize’ hotel information so that a ‘Hotel’ is a single definition of the hotel business at the given property location. There are nuances here, as some hotels have similar/same names in different locations, and the opposite that a single location (address) can have 2 different hotels with that same address (think of a long-term say Residence Inn at the same property as a Courtyard by Marriott. Same address.)

We are tracking hotel programs where a corporation enters into agreements with many hotels for guaranteed discounted rates.

We are trying to get a ‘Normalized’ list of the hotels so that all of our reporting and analytics uses the same hotel entity.

We have many sources:

  • Hotelier reporting via flat file the number of bookings per month, broken out by booking with a Confirmation number.
  • The Hotel Program Management software provides a flat file of Hotels within the program, with their own ‘property key’
  • Booking information from Concur (and other Travel groups) that provide hotel information with their OWN keys
  • Expense reports with Hotel name/address

In all these cases, the source system keys are unrelated… some systems have keys like FEMA Number, but only a few sources… also, it isn’t required.
FEMA Number is a US Gov’t issued key for US Hotels… but we have international Hotels as well.

So, we figured that across all our sources, a hotel is only reliably referenced by 2 things: The Hotel NAME + Hotel Address.
Here is the fun part: Each source system may refer to a given hotel using a human-provided name which tend to be different from all these sources… and addresses vary based on who entered the data (AVE vs. Avenue)

This is backed up by the business, when asked how do they talk about hotels, it was generally ‘the hotel name in ’ like ‘The Hampton Inn’ in Sarasota, Florida (there could be 2 Hamptons, West and East, so they refine further, the West one) Some systems may have the Name as Hampton WEST, some not.

As we attempt to define our Hotel_hub, and determining the Business Key, we cannot use the keys from the individual systems since they are not known by all the systems providing data, and would generate multiple hub records for the same entity, even based on a difference like Ave. vs Avenue in the street name.

We came up with the plan to use USPS standardized addresses that we can obtain via API calls, and a similar process for getting a standard Hotel Name…

So, the Business key for a hotel would be Std_HotelName + Std_HotelAddress.

What do you think? How do you go about finding a business key when there are common data points like HotelName/Address, but the quality of that data, or
consistency across systems is full of different variants of the name/address?

In the ingest process from raw_stage to stage, we would join to a lookup table for address/name to ensure that we have the standardized data, if we do, proceed, if not, send to lookupQueue

Outside of the DV processing, we would have a LookupService that handles the LookupQueue.
We would perform the lookups for name and address via the LookupQueue that populates a Lookup Table that includes:

  1. HK of the Raw Address (the address in the data)
  2. HK of the Standardized Address (to be used in the Hotel_Hub below)
  3. Raw Address fields
  4. Standardized Address Fields

similar for the name.

Once the lookup has been populated from the record in the lookupQueue, that record is marked for Reprocessing and is sent back to be included in the next run for that source.

The idea for the Vault:
Hotel_Hub - Business Key = Std_HotelName + Std_HotelAddress
Hotel_Raw_Address_Sat
Hotel_Std_Address_Sat
Other sats, hubs for the STAY info, link between hotel and stay, etc.

This means that staging a source from RAW_STAGE into the vault goes through this lookup join/process and either are ingested into the vault or are sent to the lookup queue.

Also, if this is a sound process, I don’t know how to structure my dbt project (with AutomateDV) to ‘split’ the source into those that have the Key lookup already performed, vs. those records that need to be sent to the lookup… How do you do that in dbt/AutomateDV? Mutiple STG_ views, one for the ‘good’ and a different one for the ‘missing keys’ rows??

Any help is much appreciated!

Jason

Training friend… otherwise I’m sure that you can engage @neil.strange or @alex.higgs directly as I feel that what you have described here deserves more attention than a forum reply.

2 Likes

IMHO, I believe your best approach in the Raw Vault is to ingest each of your sources with their OWN keys into Hub-Sat’s and then link them using Same-As-Links in the Business Vault. This is also the recommended DV 2.0 standard.

This way U are able to preserve traceability/auditability back to each source system.

U can do their inter-linkages, incl derivations using the Lookup process, in the Business Vault.
The BV_Hotel_Hub can contain the “normalized” common Master BK such as
(Std_HotelName + Std_HotelAddress) as U have suggested.
Hence, it can form the “single definition of the hotel business at the given property location”
as per your requirement.

So this BV_Hotel_Hub will be linked to the several RV_Hotel_Hub (s) via Same-As-Links. U can also build in a "fuzzability " measure if U are not 100% sure of the linkages.

Of course, some of the operational logistics with your LookupQueue etc will have to be worked out in the ELT stages.

Comments welcome.

1 Like