Comparison Matrix: Data Lake vs. Data Warehouse
“If a Data warehouse can be thought of as a bottle of water which is cleansed, packaged and structured for easy consumption, a data lake is like a body of water in its natural state.”
Data Warehouses can be viewed as central repositories of integrated data from one or more disparate sources where they store current and historical data. Users can generate reports and comparisons off of this data for both senior management and operational users.
Data Lakes are ideal for operational users due to the structure and ease of use and understanding. Smaller portions are analyst users as they often will revert to source systems to clarify findings and an even smaller portion perform deep analysis, these users are known as data scientist. Often developers need to change or update the warehouse for different user requirements to put in suitable ETL processes to extract the data from sources and bring it into the warehouse. The traditional approach of manually curating the data warehouse, which provides a limited window of view of the data is designed to answer specific questions that are identified at design time and might not make as much sense any more due to the data discovery now playing a major role in many businesses.
If you have a well-established data warehouse, don’t throw away all that time and effort spent on creating this to start from scratch on a data lake. Rather look at a hybrid setup where the data lake is setup alongside the data warehouse if your data warehouse suffers from the above mentioned problems.
In a Data Lake, data flows into the lake from various streams, while users have access to the lake to examine, take samples or dive in. Data lakes support each user more equally compared to data warehouses as data scientist can make use of the lake’s large and varied data sets while operational users can make use of more structured views that can be provided. Users are able to perform their own changes without the need for developers however, because users are in the driver seat, they become accountable for mistakes. A data lake excels at providing the ability for data discovery to be performed on your data as this collection of data can be seen as a hub or repository of all data an organisation has. This setup allows users to understand where the data is stored and allows data to be ingested as close to the raw form as possible without any restrictive schema. This enables an unlimited window of view of the data for anyone to run ad-hoc queries and perform cross-source navigation and analysis on the fly. Successful data lakes can respond to these queries in real-time and provide users with an easy and uniform access to the disparate sources of data. A data lake allows for more questions and better answers while allowing any and all data to be captured and stored. However, data storage alone has no impact on effectiveness on business decisions and storage is not infinite or limitless.
The Hadoop ecosystem is perfect for the adaptability, flexibility and scalability of a data lake as it is able to handle very large volumes of data while being able to handle any data type or structure. Hadoop provides the ability to apply structured views to raw data which allows Hadoop to excel at providing data and insights to all tiers of business users. Due to Hadoops reliance on an open source model, there is a very compelling argument regarding costs and features perspectives to consider when evaluating these two approaches.