Data Platform and Data Science

13 September 2023

Data Lakehouse vs Data Lake+Warehouse

Filed under: Data Warehousing — Vincent Rainardi @ 7:16 am

Which one is better between a) Create a data warehouse on the gold layer of the data lake, and b) Build a data lake then create a dimensional model on a relational database. a) is called Data Lakehouse architecture, and b) is called Data Lake + Data Warehouse architecture.

In other words, should we build our data warehouse inside the data lake (using Lakehouse technology from Synapse or Databricks) or should we build it on a separate cloud relational database?

The advantages of Data Lakehouse architecture:

  1. Users only have one data platform to read from.
  2. Easier to maintain data lineage of every field in DW.
  3. Only one data catalogue to maintain.
  4. Easier to monitor DW data quality (by comparing it to the lake).
  5. Delta Lake now supports ACID (atomicity, consistency, isolation, durability).

The advantages of Data Lake + Data Warehouse architecture:

  1. All BI tools works with relational databases.
  2. Some database functionalities are not on Delta Lake yet (FK, PK, multi-table transaction, etc.)
  3. Staff are more familiar with relational databases (query, loading and admin).
  4. Cloud relational databases are cheaper than Spark clusters.
  5. Cloud relational databases are simpler than Spark clusters (easier to operate & support).

Some say that it is better to create the data warehouse/mart on a relational database, some say that it is better to create the data warehouse/mart on the Spark cluster, inside the data lake.

In my opinion, it is better to build on both, i.e. create multiple data marts for different purposes, some on relational DBs, some on the Spark cluster / data lake. We put the dimensional marts on relational DBs, and the data science ones on the data lake. Not all of those data marts have to be dimensional model. They can be flattened too, which is more fitting for data frame structure, suitable for data science and machine learning.

References:

  1. Data Lakehouse defined, by James Serra: link.
  2. ACID on Databricks, by Databricks: link.
  3. Databricks Lakehouse Platform, by Databricks: link.
  4. Data Warehouse vs Data Lake vs Data Lakehouse, by John Kutay: link.
  5. Data Warehouse vs Data Lake vs Data Lakehouse, by M. Haseeb Asif: link.
  6. Lakehouse vs Warehouse vs Datamart – The Difference Between The Three Fabric Objects, by Reza Rad: link.
  7. Data Lakehouse & Synapse, by James Serra: link.
  8. Data Virtualisation vs Data Warehouse, by James Serra: link.

4 Comments »

  1. Hi Vincent,

    Nice thoughts about “to DWH or not to DWH”
    What i would like to know is in the situtation with the Datalake + Data warehouse there is also a gold layer drawn in datalake. I wonder if there are any transformations from the gold layer in datalake towards DWH or is the Gold layer just ‘loaded’ into DWH without aditional transformations. In the latter case the DWH on a database would seem to be redundant.

    Kind regards,
    Job Eichhorn

    Comment by Job Eichhorn — 26 September 2023 @ 9:54 am | Reply

    • Hi Job, thanks for your comment. If you have a DWH, the lake gold layer is in normalised form. If you don’t have a DWH, the lake gold layer is in dimensional form.

      Comment by Vincent Rainardi — 26 September 2023 @ 10:57 am | Reply

  2. Sounds similar to: Kimball (a) with dimensional datawarehouse, and Inmon with 3NF “datawarehouse” + dimensional datamarts (b)?

    Comment by milhouse77bs — 26 September 2023 @ 1:48 pm | Reply


RSS feed for comments on this post. TrackBack URI

Leave a comment

Blog at WordPress.com.