What is Data Vault Modeling and when should you use it?
by Nande Konst, on Nov 16, 2021 4:13:48 PM
Marketing analytics produce the highest value and insight when data is combined from multiple sources. Many companies transform data through an ETL (Extract, Transform and Load) process and store data in Data Warehouses for further analysis. There are multiple ways to manage data in such a way that they support data and insight-driven decision-making. One of the latest approaches related to managing data is Data Vault Modelling. Data Vault Modelling is a highly scalable method to provide long-term historical storage of data that comes from multiple operational systems. What problem is Data Vault trying to solve and in what situations is data vault architecture your best option?
What problem does Data Vault Modelling solve?
Before we look deeper into this question, first, let’s have a look at other common data warehousing architectures. A more classical approach to managing data is the dimensional design approach (DM). This architecture aims at processing data and delivering results quickly to end-users.
Dimensional modelling (DM)
The purpose of dimensional modelling is to optimize the database for faster retrieval of data. Dimensional data models deal with Dimension and Fact tables. Let’s say, in an E-commerce use case where a company has multiple products and stores, Dimension tables can be Product, Customer, Store etc. Dimensions have a unique identifier and other attributes. For the dimension Customer, the unique identifier, also called the primary key could be customer_id and the attributes could be name, surname, email, address etc. Facts tables are tables like order details, metrics and transactions from different business processes which have a foreign key relationship with one or more Dimension tables. An example of a Fact in this use case would be the order table. A star schema is then used to create an aggregated view or report of a business requirement. An example of a star schema applied to the E-commerce use case can be seen below.
Drawbacks of dimensional modelling
The problem with dimensional modelling is that when the business is growing, this way of modelling oftentimes isn’t flexible and agile enough. The problem starts to arise when for example a business requirement requires information on a different grain. For example all the sales that’s made in a specific city. You can add an extra attribute to the existing Fact table. Then the risk of data inconsistency starts to arise, or you might lose your history. Another option would be to create a new Fact table with almost the same information but on a different grain. It adds complexity to the model and leads to the fact that a star schema is very much oriented on a particular view, which makes it less flexible. Therefore dimensional modelling is not always responsive to changing business needs. If you are looking for a more flexible solution, this is where data vault architecture comes in.
Data Vault Modeling
Data vault modelling is focused on providing flexible modelling patterns that work together to integrate raw data by business key. The data vault modelling method is much more immune to changes in business needs because of separating structural information from descriptive attributes. The modelling method is designed to enable parallel loading. Large implementations can easily scale out without the need for a redesign of the model. A Data Vault consists of Hubs, Satellites and Links. Together, Hubs, Links and Satellites form a sort of grammar to describe the business that you are analysing.
The Hub is an entity that has significance to the business. It only holds a unique identifier that can directly relate to the business. So in our example hubs would be Product, Customer, Store and Promo with their unique identifiers product_id and customer_id.
Links show that two hubs are related. They are primary/foreign key relationships and describe a unique relationship between two hubs.
Satellites record “point in time” values. You can compare them with the attributes in a dimension table. They connect to Hub or Link tables, but not to other satellites. Satellites are associated with only one Hub. They hold the history as fresh records. Each time there is a change in values a new Satellite record is stored with a timestamp. This means that there are no deletions or updates in a data vault, only inserts. It allows an answer to the question: “What did we know when?”
A simplified overview of a data vault model looks as follows:
Use Data Vault Modelling in a dynamic business environment
Now let’s have a look at how a data vault can adapt easily to a changing business need. So let’s say that for a Store the tax details are required. This data might come from another system. What can be done is to create another satellite that holds the tax details without affecting the existing structure. This allows you to make quick changes which makes it scalable, adaptable and consistent. Data vault architecture is a hybrid database modelling technique that takes the best of both worlds. However, it’s not suitable for every type of reporting or any type of use case.
Use cases for Data Vault
When the structure of resource systems changes very frequently or when the company operates in a very dynamic environment, a data vault architecture might be your best choice. Also in cases where historical data is an important factor, a data vault might be a good option as it tracks a history of all changes as satellites include the load time as part of the primary key. Each time an attribute is updated, a new record will be created. This enables you to access data from any point in time. Another reason to use a data vault architecture is it enables quick data loading as many tables can be loaded parallel.
When would you not use a Data Vault?
When you need to load data directly into a reporting tool, a data vault might not be the way to go. A data vault needs many joins to get to the desired data. These joins would slow down the performance. In order to be suitable for reporting, a data vault model needs to feed data first to a dimensional model or a reporting layer needs to be added to speed up the performance. Another reason to stay away from data vault modelling is the case where you have only one source system or static data. In this case, there are no benefits of using a data vault. Using a dimensional model would require less data manipulation and would be a better fit.
Data Vault Modeling is a highly scalable method to provide long-term historical storage of data that comes from multiple operational systems. Depending on your use case, there are different types of architecture to choose from. I hope that this article has shed some light on the best solution for you and your company.
Crystalloids help companies improve their customer experiences and build marketing technology. Founded in 2006 in the Netherlands, Crystalloids builds crystal-clear solutions that turn customer data into information and knowledge into wisdom. As a leading Google Cloud Partner, Crystalloids combines experience in software development, data science, and marketing, making them one of a kind IT company. Using the Agile approach Crystalloids ensures that use cases show immediate value to their clients and frees their time to focus on decision making and less on programming.