Data Warehousing and Data Science

20 January 2022

Data Lake Architecture

Filed under: Data Warehousing — Vincent Rainardi @ 8:47 am

When we google “data lake architecture” we usually get a diagram showing the functions of a data lake (what it is used for), but not showing what’s inside, like this:

But what we really want to know is what is inside that Data Lake box, not the inputs or the outputs. When we try to find out what’s inside the lake, we’ll find something like this:

They show the layers and functions of a data lake, but not the internal mechanism. Not what is really inside a data lake. Note that instead of raw, clean and curated layers, some people call the layers differently, such as:

  • Bronze, silver, gold
  • Temporal, raw, trusted, clean
  • Raw, core, data mart
  • Transient, raw, trusted, refined
  • Landing, raw, standardised, curated
  • Raw, staged, analytics

After reading a few books on data lakes we’ll find that a data lake is just a collection of files. That’s it. Seriously? Are you telling me that a data lake is just a collection of files? Yup. Wow, how disappointing! That’s what I initially thought.

Every body agrees that data lakes should be in the cloud. In other words we should use Amazon S3 (AWS) or Google Cloud Storage (GCS) or Azure Storage (ADLS) to create a data lake. But it is just a collection of files.

Then I saw this diagram in Holt Calder’s article (link):

So we can query a data lake in S3 / GCS / ADLS by creating Snowflake external tables on top of it. This way we can use SQL to query the data lake. You can read how to do it on Snowflake documentation: link.

Creating external table is an old technology. I remember creating an external table in Hive like 10 years ago. Yes, in 2012! At that time there was no data lakes. I was working for an investment bank and all we had was Hadoop (HDFS). On top of this distributed file system we use Apache Hive so we can query the data using SQL. See Hive documentation on how to do this: link.

Then recently when working with Databricks I read Armbrust’s seminal paper on Delta tables: link. Well not just Michael Armbrust but 10 other people wrote it too. In that paper they explained how they store the transaction log, check points and change metadata directly within the object store, like below:

It is similar to Apache Hive ACID but in Hive ACID the metadata is stored in the RDBMS, not in HDFS or object stores. Armburst calls this approach Delta Lake. It is similar to Apache Hudi and Apache Iceberg.

In this paper (link) they explained the mechanism in great details, including the storage format, the log checkpoints, the isolation levels, adding log records atomically, writing transactions, reading from tables

Until today, whenever I query any table in Databricks, I’m amazed with the wonderful mechanism underneath it. Earlier I just took it for granted, didn’t really think how it works, treating it as just another RDBMS. Well because we can do SQL select, update and delete statements, etc. (link, link) just like in Oracle or SQL Server. And yes we can do JOIN too! (link). But underneath it’s all stored as files in the data lake. What a wonderful mechanism!

So Data Lake these days is not just in the cloud, but also represented as tables. We can query it using SQL statements. That is the internal mechanism of cloud data lakes. Either using Snowflake external table, or using Delta table in Databricks.

Leave a Comment »

No comments yet.

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Blog at WordPress.com.

%d bloggers like this: