Data Warehousing and Data Science

24 April 2017

Choosing between Big Data and Data Warehousing

Filed under: Data Architecture,Data Warehousing — Vincent Rainardi @ 4:14 am

If we have 100 files each containing 10 million rows that we need to load to a repository so that we can analyse the data. What should we do? Do we put them into Hadoop (HDFS), or into a database (RDBMS)?

Last week I defined the difference between Big Data and Data Warehousing as: Big Data is Hadoop, and Data Warehousing is RDBMS. See my article here: link. Today I would like to illustrate using an example where we need to choose between the two.

There are 4 consideration factors:

  1. Data Structure
  2. Data Volume
  3. Unstructured Data
  4. Schema on Read

1. Data Structure: Simple vs Complex
If all 100 files have the same structure e.g. they all consist of the same 10 columns, then it is better to put them into Hadoop. We can then use Hive, Spark, Presto, R or Python* to analyse the data. For example, to find patterns within the data, doing statistical analysis, or create forecasts. The development time will be shorter, because it is only 1 layer.
*or Phoenix, Impala, BigSQL, Stinger, Drill

If the 100 files contain 100 different tables, it is better to put them into a database, create a data warehouse, and use a Analytic/BI tool such as Tableau or QlikView* to analyse the data. For example to slice and dice the data, find percentage or anomalies and time series analysis. Yes we need to create 3 layers (staging, 3NF, star schema) but it enables us analyse each measure by different dimensions.
*or Looker, PowerBI, MicroStrategy, BusinessObjects, Cognos BI, Birt, Pentaho, Roambi, SAS, Sisense, or other BI tool

So if the data structure is simple, put into Hadoop, and if the structure is complex, put into a data warehouse. This is the general rule, but there are always exceptions. Can data with simple pattern be put into a data warehouse? Of course it can. Can data with complex pattern be put into Hadoop? Of course it can.

Using Hadoop and Hive/Spark/Presto we can also do slice and dice, find percentage or anomalies and do time series analysis. Using a data warehouse we can also do machine learning and data mining to find patterns in the data, do statistical analysis, and create forecasts. So basically, whether we store the data in Hadoop or in a data warehouse, we can still do complete analysis.

The issue here is storing it. Linking 100 tables in Hadoop is difficult and not natural. RDBMS such as SQL Server or Oracle is designed precisely for that task: linking and joining tables. Constructing a data model linking 100 tables is very suitable for an RDBMS. Can we design a data model linking 100 files with different structures in Hadoop? Of course we can. But it is much more difficult. For starters, it is Schema-on-Read, so the columns in the files have no data types. Schema-on-Read means that we don’t try to understand the relationship between the files when loading them into Hadoop. So yes we can load the 100 files into Hadoop, but we keep them as individual files, without relationships between them. This is the same as in Data Lake, which is also using Schema-on-Read, also using HDFS.

2. Data Volume: Small vs Large

100 files containing 10 million rows each is 1 billion rows per day. If all 100 files have the same structure (say they all consists of the same 10 columns), then we will have a performance problem if we load them into an SMP database such as SQL Server or Oracle. Within 3 years, this table will have about 1 trillion rows. Even with partitioning and indexing, it will still be slow to query.

Hadoop on the other hand, will have no problem storing and querying 1 trillion rows. It is designed exactly for this task, by storing it in many files and querying it in parallel using Stinger, Drill, Phoenix, Impala or Spark. The file structure is simple (the same 10 columns each) which lends itself to Hadoop.

Redshift, Azure SQL Data Warehouse, Exadata, Teradata, Greenplum and Netezza are more than capable to handle this, with excellent query performance. But MPPs are more costly than Hadoop, which is why companies tend to choose Hadoop for this task. Using an MPP for this task is like killing a fly with a canon. Not only it is expensive and unnecessary, but also it is too sluggish and cumbersome for the task.

If the 100 source files have a complex structure (such as an export from SAP system) then yes an MPP is a suitable solution as we need to create relationship between the files/tables. But if the source files have a simple structure and we just need to union them, then Hadoop is more suitable and more economical for this task.

So if the data volume is large, like 1 billion per day, and the data structure is simple, put them into Hadoop. But if the data volume is large and the data structure is complex, put them into an MPP.

3. Unstructured Data

If most of those 100 source files are MP4 (video) or MP3 (music), then Hadoop or Data Lake is an ideal platform to store them. An RDBMS be it SNP or MPP are not designed to store video or music files. They can (as a blob, or as externally-linked files), but they are not really designed for it.

If the source files have different number of attributes (such as Facebook or Twitter files,) then Hadoop or Data Lake is an ideal platform to store them. An RDBMS is not really designed for it.

Unstructured Data can also comes in the form of free-format text files (such as emails) and documents (such as journals and patents). Again Hadoop or Data Lake is much better position to store them than a RDBMS. But even better is document database, such as MongoDB, AWS DynamoDB or Azure CosmosDB.

4. Schema-on-Read

One of the advantages of using Hadoop or Data Lake is that they are Schema-on-Read. Meaning that we just store those files without determining whether the columns are numeric or string. It is only when we want to query it then we need to specify the data type.

Why is this an advantage? Because it makes it flexible. In Data Warehousing the first thing we need to do is to analyse the file structure, and design many tables to host the files in a Staging database. Then we design a normalised database to integrate those Staging tables. And then we design a Reporting layer in the form of Fact and Dimension tables and load those normalised tables into them. The whole thing can take a year if we have 100 files. The more number of files we have, the more complex the process and the longer it takes to design the databases for Integration layer and Reporting layer. It is good for the data architect (it gives them a job) but it is not good for the people who pay for this project.

Hadoop on the other hand is Schema-on-Read. After we put these 100 files in Hadoop, we query the first file. And when we query this first file, we specify the data types of each column. We don’t need to touch the other 99 files, yet. And we can already get the benefit. We can analyse the data straight away. On day one! If the other 99 files have the same structure, then we can union them, without extra effort of designing any database, and we can query them straight away. On day two! It is much simpler, we don’t need a team of 10 people designing any Staging, Normalised or Reporting layer for many many months. We can start analysing the data straight away and the project can finish in 2 to 3 months, with only 3 or 4 people. A lot less costly, a lot more agile, and a lot more flexible.

Conclusion

So that’s the 4 consideration factors when choosing between implementing big data or a data warehouse: data structure, data volume, unstructured data and schema on read.

17 April 2017

Definition of Big Data and Data Warehousing

Filed under: Data Architecture,Data Warehousing — Vincent Rainardi @ 5:38 pm
Tags: ,

I’m annoyed that many people label the normal Data Warehousing & BI stuff as “Big Data”. It is not. For me, Big Data is the “Hadoop stuff” (HDFS). If it is in SQL Server, Oracle or Teradata, it is just a normal database, not Big Data.

Big Data is stored on HDFS (Hadoop Distributed File System), not in RDBMS. Oracle, IBM, Microsoft, SAP, Teradata, all use Hadoop to store Big Data. Big Data is queried using MapReduce.

The reason why Big Data can’t be stored in RDBMS is because the format is not tabular. Sometimes it is 2 columns, sometimes it is 200 columns. Like Twitter data. The second reason is because it is too big. Sensors can make 100 measurements in a second, and in a year it could be Petabytes. Web Logs is another example. Tracking the ask and offer price of every transaction in every stock market is another example. Yes we can put Petabytes into SQL Server or Oracle, into Netezza or Teradata, but not at this speed (and more importantly not at this price!) Hadoop on the other hand is designed exactly to cope with these kind of speed and volume (and price).

Now the usage. What is Big Data Analytics? Big Data Analytics is when we do analytics on Hadoop Data.

Is Fraud Detection Big Data Analytics? Not always. Fraud Detection can be done on a normal Data Warehouse or a database. Is Machine Learning Big Data? Not always. Machine Learning can be done on a normal Data Warehouse or a database. If the Fraud Detection or the Machine Learning is done on data stored in Hadoop, then it is Big Data Analytics.

Even if it is only 200 GB, if it is stored in Hadoop, it is Big Data. Even if the data is 5 Petabyte, if it is stored in an SQL Server database, it is not Big Data, in my opinion.

Even if the data is in tabular format (i.e. columns and rows), if it is stored in Hadoop, it is Big Data. But if it is stored in an Oracle database, it is not Big Data.

Every Big Data architecture that I know uses Hadoop. No companies (or government) implement Big Data on an RDBMS. Or on a non HDFS files. Every single company, every single Big Data case I read implement the Big Data on Hadoop. I may be wrong and would be happy to be corrected. If you know a case which implements Big Data on a non-Hadoop system/architecture, I will grateful if you could let me know, either through comments, or via vrainardi@gmail.com.

 

 

Blog at WordPress.com.