Data Warehousing, BI and Data Science

6 September 2018

Modern Data Warehousing

Filed under: Data Warehousing — Vincent Rainardi @ 7:09 pm

What should 2018 Data Warehousing look like (as opposed to in year 2005)? That’s what the word “Modern” refers to. We now have streaming data, we have big data, we have data lake, and we have data scientists. A modern DW in 2018 incorporates these 4 factors. An “old” DW in 2005 did not incorporate these 4 factors.

The first attempt to define a Modern DW was from Bill Inmon in his DW 2.0 concept in 2006, where he added unstructured data, different user communities and (whitepaper: link, book: link). There was no big data or NoSQL at that time, but we already had unstructured data in data warehousing. I would argue that metadata and different archiving lines were not new, they had been in DW before 2005.

Then there was an attempt to define a Modern DW as Lambda, i.e. streaming and batch. The best definition is probably from Melissa Coates: link, search “characteristics of a modern data warehouse” on this page. Here she defines other importance characteristics such as Hadoop, data lake, NoSQL, bimodal (sandbox). I would argue that data virtualisation, near real-time, governance, agile delivery, self service BI, large volumes, MDM and high variety are not new. They had been in DW before 2005.

The Azure community is quite adamant with the Lamda concept of a modern DW. Notably Adatis, in particular Simon Whiteley in April 2017 (link), and in SQL Saturday. Microsoft supports this Lambda idea in Azure (concept: link, diagram: link).

Then there is a concept of “Data Lake side-by-side with DW”, notably proposed by Barry Devlin, supported by Hortonworks (link) and also proposed by BlueGranite (link) where Melissa works (see her SQL Saturday presentation is here). The diagram on page 5 of the BlueGranite eBook, which essentially the same as page 5 of Melissa’s SQL Saturday, is in my opinion the best diagram describing a modern DW, as we can see below:

BlueGranite diagram

Here we have the data lake as the staging area for both the streaming data and the data warehousing. And it is also the staging area for the NoSQL data store such as graph database. And the data lake is also accessed by the machine learning. So there are 4 purposes of the data lake.

Arguably it’s Microsoft centric (i.e. Azure), but they are the best vendor at the moment with the most complete components for modern DW (in my opinion). A close second is AWS, with Redshift for the DW and S3 for the data lake (catalogue & ETL on Glue). The Hadoop and Spark are on EMR (Elastic MapReduce). GCP is on the 3rd place with BigQuery as the DW, Cloud Storage as the data lake, and Cloud DataFlow for the streaming data. The Hadoop and Spark are on Cloud Dataproc.

Advertisements

Blog at WordPress.com.