In data warehousing, we can use Hadoop as staging area. Or we can use it for the main data store with SQL query (using Hive). We can also use Hadoop to backup the data warehouse.
Hadoop usage in data warehousing:
a) As staging area
b) Data store with SQL query
c) To backup the DW
I do not think that in point b) above, Hadoop will be able to replace relational databases in the main stream. In some cases where it is a huge table with specific query, yes. For for analysing big data, yes. But not for everything / general storage. Both SMP and MPP RDBMS such as Oracle, SQL Server, DB/2, Teradata & Netezza will still be used as the main store of data warehouses. Of course columnar DB will still be an important part of data warehousing, be it Hadoop based (e.g. HBase) or not.
I agree with Ralph Kimball (link) and Amr Awadallah and Dan Graham from Cloudera and Teradata (link) that Hadoop and RDBMS will co-exist in data warehousing and they will complement each other.
Hadoop is particularly suitable in big data analytics, for example the use cases that Ralph Kimball wrote in his article (link). Unlike RDBMS, Hadoop can not do transaction processing. Hadoop only support primitive SQL, through Hive. The main advantages of Hadoop over RDBMS are: a) excellent speed for searching data, and b) we don’t need to structure the data. I still can’t understand/accept this paradigm shift. How come we don’t need data modelling anymore?
MapReduce is a method for processing large data by splitting the data into smaller parts and giving the parts into many nodes. Each nodes process the data independently in parallel, then give the results back to the node which gave the task, which will then combine the results.
If we use HDFS (see below), the higher nodes know what data the lower nodes have. This way, the amount of traffic is greatly reduced, because the lower nodes only receive the data suitable for them.
MapReduce method does to data files what MPP and Teradata did in databases: parallel computing. A primer on MapReduce is here, here and here.
Apache Hadoop is the most popular implementation of MapReduce. It is open source, built by global community. Hadoop Distributed File System (HDFS) stores large files across multiple machines. HDFS has high reliability because the data is stored in 3 copies (or 4, or 2, but 3 the default setting). It is written in Java and it uses TCP/IP and RPC.
HDFS does not have high availability, because the main node does not have automatic failover. It can only do manual failover. Apart from HDFS, Hadoop can also use Amazon S3 file system (and other file systems too).
Data Warehouse Architectures using Hadoop
There are 3 main architectures of data warehousing using Hadoop: we can put Hadoop before or after the RDBMS in the data warehouse architecture (see A and B below). We can also put Hadoop in parallel to the RDBMS, i.e. big data goes to Hadoop, and “small data” goes to RDBMS (see C below), as mentioned by Amr Awadallah and Dan Graham in their article (link). See below for definition of small data and big data.
Please note that Ralph Kimball’s diagrams in his article (link, figure 4) are for “big data” sources, whereas architecture A & B above are for “small data” sources.
In architecture A above, Hadoop is used for staging area. In contrast to the normal file system or relational staging area, using Hadoop we keep all the past dates data as well. Apart from historical data, the main advantages of using Hadoop for staging area are the speed and capacity. Then we use tools like Cloudera Scoop to import data from Hadoop into RDBMS. In the RDBMS we build dimensional DW e.g. fact & dim tables. We don’t need to build 3NF EDW (or in my book, I call it NDW) to keep the historical data in normalised form. We keep the historical data in Hadoop.
This architecture (A) is similar to using Data Vault for data warehousing, but different. When we use Data Vault, we use it for EDW. When we use Hadoop, we use it for staging. When we use Data Vault, we do data modelling in Data Vault to build the EDW (not 3NF but Data Vault style, i.e. designing hub, link, satellite. When we use Hadoop, we don’t do data modelling. We store the data in its native data model, we don’t remodel it.
In architecture B above, Hadoop is used for backup. I have to say that it’s probably over the top if we use Hadoop just for data backup. It’s very fast and it’s resilient too (see HDFS above), because it’s a distributed file system working independently in parallel. But there are 30 other distributed, fault-tolerant file systems (see here) that can be used for the same purpose, instead of Hadoop.
In architecture C, we use Hadoop for big data, and RDBMS for “small data”. I think in the next few years the BI tools for big data will develop further and better. But yes, the combining of small and big data is done in the BI tool, not in the back end. It is of course possible to combine them in the back end, either in the RDBMS or in Hadoop, but there are advantages & disadvantages of each of these 3 different places where the combination takes place.
Using Hadoop for “Small Data” Data Warehousing
I do not see the benefit of using Hadoop for “small data” and can see many disadvantages in doing so: the data types, the ETL difficulties/inflexibility, data consistency (state/ACID), lack of SQL ability.
Hence for “small data” I don’t think anyone would argue that we need to use RDBMS, be it SMP, MPP or columnar. But, for big data, I don’t think anyone would argue that we need to use Hadoop. Please see below for definition of small and big data.
I believe that most data warehouses on the planet are about “small data”. Only very few are about big data. If I have to put a number on it (so that you have an idea of how much), I would say that 95% of all data warehouses are about “small data”. And therefore this is section is focusing on using Hadoop for “Small Data” data warehousing.
So in Small Data DW we use Hadoop in A and B architecture above, i.e. as staging area or as backup storage. We don’t use Hadoop for Small Data DW. We still use RDBMS for Small Data DW. If you see it differently I’m happy to be corrected.
What is Small Data? Big Data?
What is big data? Again it is easier to learn by example than by definition. Big data is images/photos/video/music streams (and as files too), unstructured text (documents, emails), web site log files, outputs (streams) from scientific/industrial instruments and sensors such as thermometers, digital pneumatic/pressure gauges.
Big data does not have to be big (peta/exabytes). Even 50 GB can be said as big data if the structure is too complex for a normal RDBMS to store.
What is small data? Small data is simple data structures, e.g. numbers (be it monetary, integers, fractions or floating points), strings (names, description, types), dates, times, and all the data we used to know in the last 30 years of data warehousing history.
A large collection of small data is big data. Again I have to put a number here so you have an idea of how big is big. I don’t like being unclear. A collection of simple data structures of more than 1 petabytes is big data.
Bear in mind that now is 2012. Petabytes (PB) is 2012 term. But 5 years ago in 2007 it was Terabytes (TB). So in 2014 the definition will change to: a collection of simple data structures of more than 1 Exabytes is big data. And in 2015 the definition will change to “more than 1 Zettabytes”.
In a few years time, these seemingly foreign words will be common words in IT: Petabytes, Exabytes, Zettabytes, Yottabytes. Here the Wikipedia page for their definition: link.
So in DW, we will be using Hadoop for big data (complex data structures). But for small data (simple data structures) we will still be using RDBMS. For simple data structures exceeding 1 PB, MPP will still cater for them, but for some cases we need to use Hadoop (if we need more performance than MPP, or if it is more than 10 PB).
Hadoop and RDBMS (including columnar and MPP) will be side by side in DW. RDBMS will still be used in majority of DW to store simple data structures up to a few PBs. Beyond 10 PB (simple structure) we probably need to use Hadoop. But for complex data structures, Hadoop is the right tool to process them.
Big data: images/photos/video/music streams (and as files too), unstructured text (documents, emails), web site log files, outputs (streams) from scientific/industrial instruments and sensors such as thermometers, digital pneumatic/pressure gauges.
Big data: A collection of simple data structures of more than 1 petabytes.
I hope this gives a bit of understanding to those of you who are new to the subject. And for the Hadoop, DW and big data experts out there, I would appreciate your opinions and advice.
Vincent Rainardi, 3rd July 2012, firstname.lastname@example.org