Data Lakes, what are they and how to use them?
by Nande Konst, on Dec 7, 2021 1:50:02 PM
Data lakes and data warehouses are both used for storing big data, but the terms are not interchangeable. Basically, the only similarity between them is that they are both used for storing data. Their purpose, however, is different. In my opinion, for some companies, a data warehouse is a better fit while for others a data lake is a better solution and in other cases, a hybrid setup is the best fit. In this article, I’ll explain what a data lake is and how it’s different from a data warehouse. At Crystalloids we design, develop and manage data lakes for several clients on the Google Cloud Platform. We are Google Cloud Platform specialists with years of experience in developing customer data platform solutions. In this blog, I will focus on data lake architecture and the best practices and use cases for data lakes.
What is a data lake?
Like water molecules in a lake, a data lake consists of raw, free-flowing data. A data lake is a large storage repository that holds raw data in its native format. The unstructured character of the data is also one of the main differences with a data warehouse, which stores data that is already processed. The data in a data lake is waiting there until it's needed for analytics purposes. A data lake uses a flat architecture to store data. Mainly files and object storage is used. There is no schema defined until the data is queried. Each piece of information has a unique identifier and is tagged with a set of metadata tags. A data lake can be queried for relevant data. This smaller set of data can be analyzed to answer business questions.
Why would you use a data lake?
Many data lakes contain large sets of structured, unstructured, and semistructured data. These environments cannot be used with a relational database because such a system requires a rigid schema for data, this limits them to only storing structured data. Most data warehouses are built on relational databases and don’t allow various schemas. Data lakes support just that and don’t require any defined schema upfront. This is what makes them so strong in handling different types of data in different formats. For many organizations, this is what makes data lakes key components in their data architecture. Most companies use them as a platform for big data analytics and data science applications that are using advanced analytics methods, such as data mining, machine learning, and predictive analytics.
Schema-on-read and Schema-on-write-access
Data lakes don’t need a schema. This means that when a user wants to view the data they can apply a schema at that moment. This process is called schema-on-read. I think this is a huge advantage of data lakes. For businesses that add new data sources on a regular basis, this is an extremely useful process. Defining a schema upfront is a time-consuming process which makes not having to design a schema a big advantage of data lakes. A data warehouse has a predefined schema, this means the schema has been designed before the data is loaded into the warehouse. This process is called schema-on-write. It may prevent the insert of certain data that does not conform to the schema. Therefore a data warehouse is better suited for cases where a business has a large amount of repetitive data that needs to be analyzed to answer predefined business requirements.
Big data volumes are growing every day. Gartner states that in 2022 90% of the corporate strategies contain analytics as an essential competency. This is why many business leaders have to rethink their strategy of storing and processing data. Many of those businesses have existing data warehouses which offer flexible data storage but lack the ability to wrangle with data for analytics purposes. The solution to this problem is adding a complementary data lake to the existing architecture. Such a hybrid setup is called a lakehouse and makes it possible to run concurrent workloads on the same datasets and reuse the available data for different purposes. At Crystalloids, we see the advantages of lakehouse solutions for some of our clients too. The dual implementation allows it to reroute sensitive data to the data warehouse and keep it from entering the data lake. This leads to better control over data usage as well as compliance as opposed to using only a data lake. In my opinion, a lake house solution is a great way to manage data in such a way that it serves many different needs.
What a data lake architecture should look like
Building the right data lake architecture is extremely important for turning data into value. The data in your data lake will be of no use if you don’t have the architectural features to manage the data effectively. Therefore it’s important to build the right features into the data lake architecture from the start. Using a cloud-optimized architecture will simplify the data lake. A modern cloud data lake should have the following characteristics:
- Multi-clustered and shared-data architecture
- Independent storage resource and compute scaling
- A well-defined metadata service that is fundamental to an object storage environment
- All architectural components and their interaction should support native data types
- Data discovery, storage, transformation, and visualization should be managed independently.
- The architecture should be tailored to the industry, with its unique features and capabilities present for the specific domain.
Features that should be available in any data lake include:
Data governance: refers to the processes and standards that are used to ensure the data can fulfill its intended purpose. It helps to increase data quality and data security. An example of data governance can be to limit the file size in order to standardize them. Files that are too large can make the data difficult to work with. An example to increase data quality would be to scan the data for incomplete or unreadable data.
Data catalog: this is the information about the data that exists in the data lake. A data catalog makes it easy to understand the context of the data. It enables stakeholders to work faster with the data. The types of information in a data catalog vary per use case, but they usually include information like:
- The connectors necessary to work with data
- Metadata about the origin of the data and the amount of time it has been stored
- Description of the applications that use the data
Search: Being able to search effectively through the data lake is crucial. The search functionality includes the ability to find data based on features like size, content, and date of origin. A best practice would be to build an index of data assets to facilitate fast searches.
Security: In order to make sure that sensitive data remains private security measures such as access control that prevent non-authorized parties from accessing and modifying data and encryption methods need to be implemented.
What are the use cases for a data lake?
The more unstructured data an organization has, the bigger the need for a data lake solution. A good example of an industry where a data lake comes into play really well is the healthcare industry. The unstructured nature of much of the data in healthcare, think about X-rays, MRI-scans, clinical data, medicine info, analysis, and the need for real-time insights make data lakes suitable for use in the healthcare industry since they allow a combination of structured and unstructured data. In transportation, data lakes can help make predictions. The predictive capabilities of a data lake can have enormous cost-saving benefits, especially in supply chain management.
How Crystalloids uses data lakes to create a lake house solution
Building a lake house solution is often part of our job when we develop data platforms such as Customer Data Platforms. The purpose of a Customer Data Platform is to centralize data coming from different sources. This data combined can provide very useful insights for all kinds of purposes. In the context of a customer data platform, a lakehouse solution is used to get valuable insights about customer behavior and automate customer activations in owned, paid, and earned channels.
An example: in the central data point a customer profile is formed based on data such as purchase behavior and loyalty, which is coming from various systems. Having this data present in a customer service application can provide a sales employee guidance on how to approach a specific type of customer. In order to create such profiles, it’s important to record history. This is done in a data lake, other information that contributes to the creation of this profile can come from information stored in the data warehouse. Both streaming data and batch data are processed. Whenever streaming data comes in it gets written in real-time to multiple databases that need the information.
Which solution is best for your organization, depends highly on the use case, the type of data, and the existing architecture. If you don’t have a clear purpose for your data yet and your organization has a lot of structured and unstructured information that needs to be used at a later stage, a data lake is a good solution. In case your business has an existing warehouse that lacks the ability to wrangle with data, adding a data lake to the existing architecture is your way to go. In this case, we are talking about a lakehouse. Which makes it possible to run concurrent workloads on the same datasets and reuse the data for different purposes. A data warehouse is best suited for a situation in which a business has large amounts of repetitive data and predefined business requirements.
Crystalloids helps companies improve their customer experiences and build marketing technology. Founded in 2006 in the Netherlands, Crystalloids builds crystal-clear solutions that turn customer data into information and knowledge into wisdom. As a leading Google Cloud Partner, Crystalloids combines experience in software development, data science, and marketing, making them one of a kind IT company. Using the Agile approach Crystalloids ensures that use cases show immediate value to their clients and frees their time to focus on decision making and less on programming.