Data Warehousing and Machine Learning

28 December 2021

Data Lakehouse

Filed under: Data Warehousing — Vincent Rainardi @ 8:08 am

Data Lakehouse is a data lake which has data warehouse features such as star schema, transaction, data integrity, BI tools, data governance, scalability, data ingestion from various sources, and can do “time travel” (link, link). For a paper on Data Lakehouse see here: link.

Usually a data lake does not have a schema. It is just a collection of files. No SQL data types, no structures. A data lakehouse has a schema. It has data types. It has a data structure. It has tables. And it has a database.

Usually we read data from a data lake and load it into a data warehouse/mart so that we can have a star schema. Then we point our BI tool such as Power BI or Tableau to this data warehouse/mart. With a data lakehouse, we don’t need to do that. We don’t need to build a data warehouse/mart. We can directly point the BI tool to the data lakehouse.

ACID

Usually a data lake does not have data integrity. There is no concept of a transaction. When users read the data lake whiles some process is ingesting data into it, there is no way to ensure data consistency. There is no way to ensure data reliability. When multiple users are reading and writing from and to the same table all at once, there is no way to ensure that they don’t interfere each other (isolation concept). There is no concept of a transaction, to ensure that multiple data changes must be treated as one unit (transaction, or atomicity). There is no concept of ACID, i.e. atomicity, consistency, isolation, durability. A data lake is just a collection of files. It is not a database.

But a data lakehouse is different. It is a data lake, but like a database it has the concept of atomicity, consistency, isolation and durability (ACID). It can handle transaction, i.e. when we have multiple changes either all changes are executed successfully or the none of them is not executed. It has a concept of durability, meaning that data changes would be saved, even when the disk or the server failed. This means that the data would not be corrupted by system crashes or power failure.

A data lakehouse can achieve this ACID capability using transaction logs. Yup, just like a relational database, it uses transaction logs.

Time Travel

One of the key features of a data warehouse is the ability to do “time travel”. Meaning that we can query the data as it is today, but we can also query the data as it was at some point in the past.

A data lakehouse has this capability. We can query an older snapshot of a table (link). We can rerun a report as it was last year. We can replay a machine learning input (or output) as it was before it was changed. We can write a temporal query. We can use this “time travel” capability for debugging or for auditing.

How do we do that in a data lakehouse? By querying the data lakehouse using TIMESTAMP AS OF clause, like this:

SELECT * FROM Table1 TIMESTAMP AS OF '2020-12-31'

Simple, isn’t it?

SQL

Oh, I forgot to mention one very important feature! A data lakehouse uses SQL language. Yes, the good old SQL! The SQL language that we all have been using for 50 years. You can query a data lakehouse by simply doing “select * from Table1 where column1 = criteria”. And other SQL statement such as group by etc. Isn’t wonderful?

Many BI tools use SQL language to query databases (well all BI tools, actually). But when they query a data lake, they can’t use SQL language. Why? Because a data lake is just a collection of files. Like CSV files for example. Or pipe delimited files. Or JSON files. Or parquet files. So those BI tools will have to open the files and read them as files!

Not with this special data lake called data lakehouse. If your data lake is a data lakehouse, you can use SQL language to query the data lake. Your data lakehouse consists of databases and tables. Yes, databases and tables! Isn’t wonderful?

Delta Lake

We should thank those people at Delta for this wonderful capabilities. Who are Delta? Well here’s their website: https://delta.io/. They created something called “Delta Lake” on Apache Spark which has all those wonderful features above. They started in June 2019 with version 0.2 (link). The latest version is 1.1.0 (link), released on 4th December 2021.

Other than ACID, it also supports change data capture, streaming, conditional update, metadata handling, schema evolution, and SQL constraints like “not null” and “check”. We can read all these wonderful features here: link. They documented it very well there.

Transaction Log is very important. It enables atomicity, consistency, isolation and durability (ACID) as I said above. We can read the detail mechanism of Delta Lake transaction log here: link. The checkpoint, the format specification, the protocol, the transaction identifiers, everything is very well documented there.

The data lake that Delta created is called Delta Lake. Delta Lake is an open source. You can read all the code here in Github: link.

Databricks

The best implementation of Delta Lake is Databricks (link). It is a data lake but it was designed from the start to be a data lakehouse. It is a Delta Lake implementation, so it has all the features of a Delta Lake.

Databricks consists of 3 environments:

  1. Databricks Workspace
  2. Databricks SQL
  3. Databricks ML

A Databricks Workspace is an environment for managing all of your Databricks assets (link) such as notebooks, libraries, dashboards, and experiments, out them into folders and provides access to data objects and computational resources. A Databricks Workspace is used for data engineering and data science.

A Databricks SQL is an environment to run quick ad-hoc queries on your data lake, create multiple visualization types to explore query results from different perspectives, and build and share dashboards (link).

A Databricks ML is an integrated end-to-end machine learning environment incorporating managed services for experiment tracking, model training, feature development and management, and feature and model serving (link).

Data Warehouse

The question is, as we now have a data lakehouse, do we still need a data warehouse? The answer is yes. If your company don’t have a data lake, then you will need to build a data warehouse.

But if your company builds a data lake, you better build a special form of data lake. The one called a data lakehouse that I describe above. This way you won’t need to build a data warehouse or a data mart. You can point your BI tool directly to this data lakehouse.

Leave a Comment »

No comments yet.

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Blog at WordPress.com.

%d bloggers like this: