Data Warehousing and Machine Learning

18 October 2021

Why Data Lake?

Filed under: Data Warehousing — Vincent Rainardi @ 6:55 am

I’ve been building data warehouses for 15 years and my typical architecture is like this:

My main focus has been to build a data warehouse in a database in dimensional model format (Kimball star schema). The database is usually SQL Server but also Oracle and Teradata. This works well because a) the warehouse is queryable by SQL, b) the data is integrated, and c) the star schema is easy to use and performant. The front end varied from company to company e.g. Reporting Services, Excel, Tableau, Power BI, Qlik, Cognos, Business Objects, and users also query the warehouse directly. Occasionally I built OLAP cubes (mostly Analysis Services but also Hyperion) and users access it from Excel. For 15 years the name of the game was “Business Intelligence” (BI), that’s what data was used for.

In the last few years I’ve been doing Machine Learning and it mostly requires data files (rather than database), including image files (for CNN) and sequential/streamed data (for RNN). This makes the Data Warehouse not fit for purpose, and we use Data Lake instead. The ML tools I use are varied, from Jupyter notebooks, Tensorflow in Google Colab, Azure ML Studio, to Rasa NLP engine and Watson, to NVIDIA Jetson Nano. Data is no longer used for BI, but for modelling AI. It is the output of the ML models that is used for reporting and BI. The model files (h5), the augmentation (images), the model parameters and model output (pickle files) are stored in Data Lake. The Data Lake started to be used for staging area as well, as users require access to the raw data files. So the architecture become like this:

It is a completely different world to 15 years ago. We change and adapt. The Data Lake is a collection of files, yes, but it is also queryable via SQL for those old school folks (if we use Databricks or Hive). And Python friendly too, for those ML folks. Data Lake is available in various formats, one of the best ones is Databricks Delta (available on Azure and AWS), which stores data in optimised Parquet files (compressed vertical format). It provides transaction logs, consistent view, concurent read/write operations and good performance (z-order partitioning).

So as we can see above, the data mart/warehouse is still there in the architecture, with the data lake used as its staging area. The warehouse also takes in the the output of the ML models e.g. predictions, etc. And the BI Tools and reports are still there too like before, publishing the content of the warehouse to the users. Where is the integration point? The warehouse of course. Or if you use a multi-marts architecture (instead of the single warehouse), then the integration point is the data marts.

Note: what I labelled as “Data Lake” in my diagram above (e.g. Databricks Delta Lake, see below diagram) some people call it a “Lakehouse”. A Lakehouse is a data lake and a data warehouse in one thing, see Ref #6 below. My architecture diagram above (and below) is not a Lakehouse, it is the architecture of a modern data warehouse, which uses both a data warehouse and a data lake for the reasons I outlined above.

The Cloud Architecture

The main difference to 15 years ago is the cloud architecture (which is also the main driver). It hard to argue against cloud, whether you choose Azure, AWS or GCP. First, we don’t pay a million dollar in advance, but pay by the hour (tens of thousands per month). Second, we enjoy robust infrastructure, high availability, hardware redundancy, 24×7 support, tight security. We don’t need to think about backup and failover, it is always available. Third, there is no patching. This is like heaven for the infrastructure team, and even more so for the head of IT, the Chief Financial Officer and the Chief Operating Officer.

With that, data warehouses are now in the form of PaaS databases (Platform as a Service). Whether you are a SQL Server shop or Oracle shop, you would be pushed to be in the cloud. We won’t be able to argue why we need to maintain on-premise databases.

The pressure for a data lake platform to be in the cloud is even stronger then a data warehouse. 10 years ago, people still talk about building their own Hadoop cluster, on premise. The “zoo” as we used to call it (because it’s not just the elephant but also the pig, the bee hive, the oozie, the hbase whale). But now it is unheard of. Everyone just use cloud based data lake like Azure and AWS (between them they probably have 90% of the market). It’s a pay-as-you-go (pay by the hour), it’s has high availability, hardware redundancy, 24×7 support and tight security. We get a world class platform for a low cost.

Conclusion

So why data lake? Because a) the ML systems need files rather than database, e.g. image files, streamed data, b) to store unstructured data, e.g. tweets, text, images, videos, c) to be used as a staging area for the data warehouse.

Data lake is not the integration point, it is just a storage. The integration point is still the data warehouse (for the structured data). So data warehouse is still being used, and the BI tools are still being used too.

This modern data warehouse architecture (which is a data lake plus a data warehouse) must be in the cloud, the most popular ones are Azure and AWS.

References

  1. BlueGranite’s diagram of a modern data warehouse: link, link
  2. Scott Muniz’s diagram of a Databricks data lake in Azure: link
  3. Kevin Clugage and Denny Lee’s diagram of a modern data warehouse: link
  4. Alexandre Gattiker’s diagram of Azure Databricks streaming architecture with a data warehouse: link
  5. Clinton Ford’s diagram of a modern data architecture with Delta Lake and Azure Databricks: link
  6. What is a LakeHouse by Databricks team: link

Leave a Comment »

No comments yet.

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Blog at WordPress.com.

%d bloggers like this: