Data Warehousing and Data Science

16 July 2012

The Main Weakness of Snowflake Schemas

Filed under: Data Warehousing — Vincent Rainardi @ 1:48 am
Tags:

Let’s have a look at this dimensional model:

It’s a snowflake schema. It is the dimensional model for a book distribution company. The company buys books from publishers, stores them in their warehouses and sells them in their shops. “Line” means product line, i.e. book category such as academic books, fiction books, sport books, etc.

There are a few interesting points here.

  1. DimLine, DimAuthor and DimPublisher are the sub dimensions of DimProduct.
  2. DimCity is a sub dimension of DimCustomer.
  3. Each sub dimension has a surrogate key.

Because each sub dimension has a surrogate key, there is additional complexity. If the sub dims are type 2, whenever the author name or publisher name changes, their surrogate key changes, resulting in a new row in DimProduct. Let me demonstrate:

Below are DimProduct, DimLine, DimAuthor and DimPublisher:

Now let’s make 1 change in each sub dimension (SCD type 2):

Each time any of the sub dim changes, in the DimProduct we will have 3 new rows:

As you can see above, we now have 4 versions of Rainfall in DimProduct.

That is the issue of using Snowflake: we can’t do type 2 on sub dims. We can only do type 1, like this:

Whenever I asked a DW practitioner which DW methodology he uses Star or Snowflake, the answers almost always have been Star. But when I dig deeper why not snowflake, nobody brings up this fundamental issue: inability to store the history of attributes. I hope this article would make it clearer to DW practitioners.

14 July 2012

What is Big Data, Data Warehouse, Data Mining

Filed under: Business Intelligence,Data Warehousing — Vincent Rainardi @ 6:35 am
Tags: ,

Michael Koploy from Software Advice gathered the opinion of 5 BI experts about the definition of 3 BI terminologies that puzzle many people:

  • What is Big Data
  • What is a Data Warehouse
  • What is Data Mining

The article is here. The 5 experts are:

My definitions:

What is Big Data? (from Hadoop in Data Warehousing article in my blog)

What is big data? Again it is easier to learn by example than by definition. Big data is images/photos/video/music streams (and as files too), unstructured text (documents, emails), web site log files, outputs (streams) from scientific/industrial instruments and sensors such as thermometers, digital pneumatic/pressure gauges.

Big data does not have to be big (peta/exabytes). Even 50 GB can be said as big data if the structure is too complex for a normal RDBMS to store.

What is small data? Small data is simple data structures, e.g. numbers (be it monetary, integers, fractions or floating points), strings (names, description, types), dates, times, and all the data we used to know in the last 30 years of data warehousing history.

A large collection of small data is big data. Again I have to put a number here so you have an idea of how big is big. I don’t like being unclear. A collection of simple data structures of more than 1 petabytes is big data.

Bear in mind that now is 2012. Petabytes (PB) is 2012 term. But 5 years ago in 2007 it was Terabytes (TB). So in 2014 the definition will change to: a collection of simple data structures of more than 1 Exabytes is big data. And in 2015 the definition will change to “more than 1 Zettabytes”.

In a few years time, these seemingly foreign words will be common words in IT: Petabytes, Exabytes, Zettabytes, Yottabytes. Here the Wikipedia page for their definition: link.

What is a Data Warehouse?

From my book: (Amazon.com, Amazon.co.uk, Apressmy blog)

A data warehouse is a system that retrieves and consolidates data periodically from the source systems into a dimensional or normalized data store. It usually keeps years of history and is queried for business intelligence or other analytical activities. It is typically updated in batches, not every time a transaction happens in the source system.

Below is a diagram of a data warehouse system: (from my book)

Below is a short explanation about each component in the above diagram, and how the components work together in a data warehouse system: (again, this is from my book)

The source systems are the OLTP systems that contain the data you want to load into the data warehouse. Online Transaction Processing (OLTP) is a system whose main purpose is to capture and store the business transactions. The source systems’ data is examined using a data profiler to understand the characteristics of the data. A data profiler is a tool that has the capability to analyse data, such as finding out how many rows are in each table, how many rows contain NULL values, and so on.

The extract, transform, and load (ETL) system then brings data from various source systems into a staging area. ETL is a system that has the capability to connect to the source systems, read the data, transform the data, and load it into a target system (the target system doesn’t have to be a data warehouse). The ETL system then integrates, transforms, and loads the data into a dimensional data store (DDS). A DDS is a database that stores the data warehouse data in a different format than OLTP. The reason for getting the data from the source system into the DDS and then querying the DDS instead of querying the source system directly is that in a DDS the data is arranged in a dimensional format that is more suitable for analysis. The second reason is because a DDS contains integrated data from several source systems.

When the ETL system loads the data into the DDS, the data quality rules do various data quality checks. Bad data is put into the data quality (DQ) database to be reported and then corrected in the source systems. Bad data can also be automatically corrected or tolerated if it is within a certain limit. The ETL system is managed and orchestrated by the control system, based on the sequence, rules, and logic stored in the metadata. The metadata is a database containing information about the data structure, the data meaning, the data usage, the data quality rules, and other information about the data.

The audit system logs the system operations and usage into the metadata database. The audit system is part of the ETL system that monitors the operational activities of the ETL processes and logs their operational statistics. It is used for understanding what happened during the ETL process.

Users use various front-end tools such as spreadsheets, pivot tables, reporting tools, and SQL query tools to retrieve and analyse the data in a DDS. Some applications operate on a multidimensional database format. For these applications, the data in the DDS is loaded into multidimensional databases (MDBs), which are also known as cubes. A multidimensional database is a form of database where the data is stored in cells and the position of each cell is defined by a number of variables called dimensions. Each cell represents a business event, and the values of the dimensions show when and where this event happened.

Tools such as analytics applications, data mining, scorecards, dashboards, multidimensional reporting tools, and other BI tools can retrieve data interactively from multidimensional databases. They retrieve the data to produce various features and results on the front-end screens that enable the users to get a deeper understanding about their businesses. An example of an analytic application is to analyse the sales by time, customer, and product. The users can analyse the revenue and cost for a certain month, region, and product type.

What is Data Mining? (from my book)

Data mining is the process of exploring data to find the patterns and relationships that describe the data and to predict the unknown or future values of the data. The key value of data mining is the ability to understand why some things happened in the past and the ability to predict what will happen in the future. To refer to predicting the future with regard to data mining, some people use the term forecasting, and some call it predictive analytics. On the other hand, when data mining is used to explain the current or past situation, it is called descriptive modeling, descriptive analytics, or knowledge discovery.

Implementing data mining in the business is growing by the day, both through descriptive and predictive analytics. Using data mining, we can find the correlation between purchase patterns and customer demographics. For example, in our Amadeus Entertainment case study, we can find whether there is a relation between the sales of a music product type and the customer interest or occupation. In the financial services industry, data mining is traditionally used for credit scoring, for example, to predict the ability and willingness to pay the loan based on (each country has different factors) payment behavior, past delinquency, current debt, credit history, number of inquiries, address, bankruptcies filed, and customer demography.

In business intelligence, popular applications of data mining are for fraud detection (credit card industry), forecasting and budgeting (finance), cellular/mobile package development by analyzing call patterns (telecommunication industry), market basket analysis (retail industry), customer risk profiling (insurance industry), usage monitoring (energy and utilities), and machine service times (manufacturing industry).

12 July 2012

Delete All Rows in the Dimension Table

Filed under: Data Warehousing — Vincent Rainardi @ 5:32 pm
Tags:

I was very surprised to receive an email this morning from somebody saying that they delete the whole dimension table and insert all rows again, and they use surrogate keys. And they are having problem because the existing rows get new surrogate keys.

Whenever I heard “delete all rows in the dim table” a big alarm bell rang in my head. The word “delete” and “dim table” should not be in the same sentence!

This is the second time this month I encountered people doing this and I suspect there are many others out there doing the same thing, and having exactly the same problem. So I thought I have to write about this asap. So here it goes.

As always it is better to explain using an example. Say it’s a Member dimension. In the source system we have Membership table, like this:

And in the data warehouse we have Member dimension, like this:

There are 2 types of changes happening in the source Membership table:

  • Change to an existing row
  • New row created

Let’s make these 2 types of changes: change Ayane’s name to Eyane and create a new member: Agus Salim, like this:

When we truncate the Member dimension and reload all rows, this is what we get:

All members get new surrogate keys.
G48’s surrogate key changed from 1 to 3.
G49’s surrogate key changed from 2 to 4.
That is the issue about “And they are having problem because the existing rows get new surrogate keys” that I mentioned at the beginning of this article.

The problem with G48’s SK changed from 1 to 3 is: the fact table row for G48 is still referring to SK = 1. Now the fact table SKs don’t match with the dim table SKs, causing issues when you join them.

As I said above, we should not delete from the dim table. Instead, we should update changed rows, and insert new rows. After the update, the dim table should be like this:

So once again, the word “delete” and “dim table” should not be in the same sentence!

 

11 July 2012

Business Objects Voyager

Filed under: Business Intelligence — Vincent Rainardi @ 1:00 am

What is it?

BO Voyager is a tool for “browsing” cubes. It can read 3 OLAP cubes: SSAS, Hyperion Essbase, SAP BW (link, page 52). DB2 OLAP (8.2) uses the same provider as Essbase 7 hence Voyager can open DB2 OLAP. Oracle OLAP can only be accessed using Simba MDX provider (link), not native. SAP BW (or BIW) can only be accessed using BO Integration Kit for SAP.

The formal name of Voyager today (10/7/12) is “SAP BusinessObjects Analysis, edition for OLAP” (link). The user guide is here, the admin guide is here, the product sheet is here.

There are 2 versions of “BO Analysis”, the other one is “edition for Microsoft Office”, which is an Excel plugin (link). This is the new name of Business Explorer Analyzer (link).

This is how Voyager looks like: (link, page 51)

What is for?

Voyager has been used for many years (probably more than 7 years) to browse SSAS cube. Now that Excel can browse SSAS cube, the use of Voyager for browsing SSAS cube is a question mark. Tableau and XLCubed have better capability than Voyager. We can’t even create a hyperlink in Voyager.

Unlike Tableau, Excel or XLCube however, Voyager can browse Hyperion/Essbase and SAP BW as well. Small companies usually only have 1 OLAP platform. But large corporation like global banks usually have multiple OLAP platforms and this is where Voyager kicks in.

Apart from “browsing”/querying the cube and creating charts from it, we can use Voyager to do the following:

  • Highlight exceptions e.g. make the cell red (or any colour) if the number is bigger or smaller than certain limit.
  • Sort a column or a row, i.e. ascending, descending or hierarchies.
  • Filter a column or a row, i.e. display only top 10, display only where revenue > certain threshold.
  • Do calculations, i.e. sum, sumproduct, +, -, /, x, % of total
  • Useful OLAP functions: Prior period, period to date, moving average, rank.
  • Link to BO report (but not to URL)

Is it good?

No. In my opinion, Voyager is a last generation product. It feels like ProClarity to me. It needs a lot of investment to beat modern BI products such as Tableau, XLCubed, Qlikview, Roambi and Spotfire.

It provides basic functionality to explore Essbase & SSAS cubes, but nothing more. If this was 2006, Voyager would have a chance. But in 2012 where the BI tools have grown strong in competition, there’s no chance.

Prominent areas for improvements:

  1. Port to true thin client i.e. HTML5/Javascript
  2. Visualisation e.g. chart, dashboard
  3. Interactive ability, i.e. dashboard, drilldown, map, alert
  4. More integration with SSAS & Essbase, e.g. displaying description of measures & attributes, drilldown, KPI
  5. Scripting to support queries
  6. Automation and scheduling
  7. Mobile platform i.e. iPad, iPhone, Galaxy, Android, BB
  8. Cloud interface (S/I/PaaS), stream data
  9. Big data integration, i.e. Hadoop

10. Silverlight integration

11. Data mashup integration, e.g. composite

SAP is a giant corporation with good R&D. I admire Hana, which is a modern (even future) platform, i.e. in-memory database & analytics. I wrote an article about Hana (link). I also admire BO Webi, which in my opinion is the best in its class i.e. dynamic reporting. BO functionality is a lot richer than SSRS Report Builder.

But I don’t admire Voyager. I believe SAP has the capacity to improve it, which is why I took the time to write the 11 points above. It needs good investment to match today’s standards. SAP is one of the top 4 giants in BI, along with Oracle, Microsoft, SAS and IBM (link). SAP has the financial & technical capability to refurbish Voyager to be one of the best analytic tools.

9 July 2012

Duplicate Attribute Keys in SSAS

Filed under: Analysis Services,Data Warehousing — Vincent Rainardi @ 5:32 pm
Tags: ,

Sometimes when we process the cube we get an error message: duplicate attribute key. And at other times we get: missing attribute key. In this article I will try to address these 2 issues.

A. Duplicate Attribute Key

There are 2 things to check in this case:

  1. Do you use surrogate key instead of business key?
  2. Is there a duplicate business key in the dimension table?

1. Surrogate Key vs Business Key

In the data warehouse, to link the fact tables to the dim tables we must use surrogate keys. We must not use business key to connect fact table to dim table. But time and time again I have seen people not reading Kimball and Caserta’s book thoroughly and try to build a data warehouse quickly. They use the business key to connect the fact and dim. It is easier to understand if I use an example, so I’ll use income fact table as an example.

FactIncome:

DimWork:

The reason we have 2 rows for WorkId 781 is because DimWork is SCD type 2.

The above example shows that they link FactIncome to DimWork using the business key: WorkId. This is a fatal error. Because when we query “What is the IncomeAmount for WorkId 781 on 9/7/12?” the DW will return 2 rows to the cube:

select DW.WorkId, DW.WorkKey, F.IncomeAmount
from FactIncome F
join DimWork DW on DW.WorkId = F.WordId
where F.SnapshotDateKey = 20120709

Output:

781     2        140
781     3        140

And SSAS will sum the two 140 up into 280.

It is vital that in the fact table we use surrogate key to connect to the dimension table, like this:

FactIncome:

So that when we query “What is the IncomeAmount for WorkId 781 on 9/7/12?” the DW will return only 1 row to the cube:

select DW.WorkId, DW.WorkKey, F.IncomeAmount
from FactIncome F
join DimWork DW on DW.WorkId = F.WordId
where F.SnapshotDateKey = 20120709

Output:

781     2        140

2. Is there a duplicate business key in the dimension table?

The second reason of experiencing “duplicate attribute key” in SSAS is because there is a duplicate business key in the dimension table.

This could happen if there are 2 rows in the source table with the same business key, but the business key on the second row has trailing spaces. For example:

Row 1: CustomerId = ‘GA34’
Row 2: CustomerId = ‘GA34  ’

The ETL consider these 2 rows as two different customers and allocate different surrogate key in the customer dimension.

If you use business key as the key in SSAS dimension, SSAS will say that there is a duplicate attribute key in the customer dimension, i.e. there are two rows with CustomerId = ‘GA34’. This is because SSAS trim the attribute key.

Of course you can suppress the issue using custom configuration (in the property of the dimension), but this is hiding the issue and making the front end (say Tableau or Strategy Companion Analyser) returning incorrect data to the users.

This issue will not occur by the way, if you use surrogate key. It will only occur if you use the business key as the key of the dimension in SSAS.

B. Missing Attribute Key

The second error, missing attribute key, means that there is a surrogate key in the fact table which does not exist in the dimension table. We can isolate the fact row that gives the error message by looking at the details of the error message that SSAS gives us when we process the cube. The value of the key is given in the message. We can then query the dim table to check if the dim table has a row with that SK.

A possible reason why the SK exists in the fact table but not in the dim table is because of a mistake on the ETL task that populates the fact table. The ETL task can be SSIS workflow or Informatica session/mapping, but it is the same principle: the part of the workflow or mapping that does the lookup failed. This is either because the Lookup Transform didn’t use the correct Business Key, or because the Lookup Transform didn’t look at the correct column.

There is another possibility if it is a type 2 dim: the Lookup Transform didn’t get the correct version, i.e. part of lookup logic that examines the EffectiveDate and ExpiryDate (or ValidToDate and ValidFromDate) is not working properly.

Those are the various possibilities of why we could have “duplicate attribute key” and “missing attribute key” in SSAS cube.

7 July 2012

SAP Hana, an in-Memory Database

Filed under: Business Intelligence,Data Warehousing — Vincent Rainardi @ 7:19 am

What is it?

Hana is an in-memory database from SAP. It enables us to do lightning OLAP analytics from relational databases. We can access Hana data using SQL, MDX or from BO. Or from Cognos, Report Builder and PowerPivot. We can create calculated attributes and measures. We can create our own data model and hierarchies.

A must read technical introduction from SAP is here, the official web site home page is here, and the technical operation manual is here, the data modelling manual is here (very bad, no screenshots at all) and the Wikipedia page is here.

HANA stands for HAsso’s New Architecture. Hasso Plattner is a cofounder of SAP.
HANA also stands for High performance ANalytic Appliance.
Hana is also a name, rather than acronym.

Actually Hana is not just an in-memory database. It is also an analytics appliance.
Cisco provides a Hana appliance on Intel Xeon 7500 (link).
HP provides a Hana appliance on ProLiant DL580 and DL980 (link).
IBM provides a Hana appliance on x3690 X5 and x3950 X5 (link).
Violin provides a Hana appliance on Intel Xeon 5600 (link).

HANA is also a cloud application. It is a PaaS, Platform as a Service. EMC and Cisco provides a Hana could app on Cisco UCS and EMC VNX 5300 (link).

Hana was released in Nov 2010. Version 3 was Nov 2011, and version 4 will be in Q2 2012.

How does it work?

Hana has a parallel calculation engine. Hana takes in a SQL query and run the query in parallel across many partitioned data sections in memory, and across many multicore processors.

Hana also has business functions to speed up calculations, for example currency conversion and calendar functions, as well as predictive analysis library.

Hana database is both row based and column based. If a calculation involved only a few columns, the data columns are stored in columnar store (compressed) in memory. Hana also support temporal tables which speed up queries for historical data.

The key to Hana speed is the first sentence I wrote above: in-memory database. Memory is 1000 times faster than disk, and these days we can have a server with 2 TB of memory.

But that is the story of the past. Everybody now store in memory. What Hana has that the crowd doesn’t is the technique to load data from memory into CPU cache. To partition a query and data, and load them into different CPU cache so that each CPU can process them in parallel. That’s what differentiates Hana from the others.

This is how Hana works: (from the must read technical introduction that I mentioned earlier, link, page 10)

Hana does parallel SQL execution and parallel aggregation. These parallel execution and aggregation happens on multiple threads. Each thread has a hash table which stores the aggregation result. These hash tables are then merged by the merged threads, as we can see below: (from the technical introduction doc, link, page 12)

What is it for?

How do we use Hana? What can we use it for? Using Hana, we don’t have to create summary tables in the data warehouse. This removes the basic need for a data warehouse. We need a DW to aggregate data fast. Now we can do it using Hana. This also eliminates the need for having OLAP cubes. We use OLAP cubes such as SSAS to aggregate the data fast. Now we can do it using Hana.

Still need a SQL interface though, which is a no go-er from the start. Who wants to type in the SQL queries? Definitely not the business users! Hence we need a BI tool, such as BO Universe, Cognos, SSRS Report Builder and PowerPivot. The major weakness of BO Universe, RS Report Builder, Cognos and other flexible reporting technology when operating on a DW is that it is slow. Because the DW is disk based. Now that we have Hana, an in-memory database which is both row-based and columnar (and temporal!), these flexible reporting tool will be incredibly fast.

Data Warehouse vs Data Virtualisation

Filed under: Data Warehousing — Vincent Rainardi @ 5:43 am

For many years I have been advocating data warehousing and BI. Large enterprises like global banks need to integrate data from many systems and my answer has always been building an Enterprise Data Warehouse. It takes years to build data integration and EDW. And often it covers only a few departments, not the whole enterprise.

Data virtualisation on the other hand, enables us to integrate data from various transaction systems/applications, databases, DW, files, spreadsheets, the Cloud and Big Data within 3-6 months, making us more agile and cost effective.

While data virtualisation doesn’t provide the solid data platform that data warehouse provides, the speed and cost of building it provides huge business benefit. In a data warehouse the data is very well arranged and very user oriented. In data virtualisation we will still have all the data structure issues from the source systems. This is the main drawback.

Above is how I see it in large corporations. It takes 12-18 months to put together an enterprise data warehouse (first go live 12 months) whereas had we used data virtualisation it would only take 6-9 months (first go live 6 months). To build EDW we need 10 people, whereas DV we only need 5. To build EDW we need design tools, ETL tools and database servers such as ERWin, Informatica and Oracle 11g, whereas DV is more light weight tools such as Composite.

Data virtualisation will not be able to replace a data warehouse. But it will complement it. It is a quick tactical win. But usually, a tactical initiative becomes permanent solution.

 

3 July 2012

Hadoop in Data Warehousing

Filed under: Data Warehousing — Vincent Rainardi @ 7:25 am

In data warehousing, we can use Hadoop as staging area. Or we can use it for the main data store with SQL query (using Hive). We can also use Hadoop to backup the data warehouse.

Hadoop usage in data warehousing:

a)    As staging area
b)    Data store with SQL query
c)    To backup the DW

I do not think that in point b) above, Hadoop will be able to replace relational databases in the main stream. In some cases where it is a huge table with specific query, yes. For for analysing big data, yes. But not for everything / general storage. Both SMP and MPP RDBMS such as Oracle, SQL Server, DB/2, Teradata & Netezza will still be used as the main store of data warehouses. Of course columnar DB will still be an important part of data warehousing, be it Hadoop based (e.g. HBase) or not.

I agree with Ralph Kimball (link) and Amr Awadallah and Dan Graham from Cloudera and Teradata (link) that Hadoop and RDBMS will co-exist in data warehousing and they will complement each other.

Hadoop is particularly suitable in big data analytics, for example the use cases that Ralph Kimball wrote in his article (link). Unlike RDBMS, Hadoop can not do transaction processing. Hadoop only support primitive SQL, through Hive. The main advantages of Hadoop over RDBMS are: a) excellent speed for searching data, and b) we don’t need to structure the data. I still can’t understand/accept this paradigm shift. How come we don’t need data modelling anymore?

MapReduce

MapReduce is a method for processing large data by splitting the data into smaller parts and giving the parts into many nodes. Each nodes process the data independently in parallel, then give the results back to the node which gave the task, which will then combine the results.

If we use HDFS (see below), the higher nodes know what data the lower nodes have. This way, the amount of traffic is greatly reduced, because the lower nodes only receive the data suitable for them.

MapReduce method does to data files what MPP and Teradata did in databases: parallel computing. A primer on MapReduce is here, here and here.

Apache Hadoop is the most popular implementation of MapReduce. It is open source, built by global community. Hadoop Distributed File System (HDFS) stores large files across multiple machines. HDFS has high reliability because the data is stored in 3 copies (or 4, or 2, but 3 the default setting). It is written in Java and it uses TCP/IP and RPC.

HDFS does not have high availability, because the main node does not have automatic failover. It can only do manual failover. Apart from HDFS, Hadoop can also use Amazon S3 file system (and other file systems too).

Data Warehouse Architectures using Hadoop

There are 3 main architectures of data warehousing using Hadoop: we can put Hadoop before or after the RDBMS in the data warehouse architecture (see A and B below). We can also put Hadoop in parallel to the RDBMS, i.e. big data goes to Hadoop, and “small data” goes to RDBMS (see C below), as mentioned by Amr Awadallah and Dan Graham in their article (link). See below for definition of small data and big data.

Please note that Ralph Kimball’s diagrams in his article (link, figure 4) are for “big data” sources, whereas architecture A & B above are for “small data” sources.

In architecture A above, Hadoop is used for staging area. In contrast to the normal file system or relational staging area, using Hadoop we keep all the past dates data as well. Apart from historical data, the main advantages of using Hadoop for staging area are the speed and capacity. Then we use tools like Cloudera Scoop to import data from Hadoop into RDBMS. In the RDBMS we build dimensional DW e.g. fact & dim tables. We don’t need to build 3NF EDW (or in my book, I call it NDW) to keep the historical data in normalised form. We keep the historical data in Hadoop.

This architecture (A) is similar to using Data Vault for data warehousing, but different. When we use Data Vault, we use it for EDW. When we use Hadoop, we use it for staging. When we use Data Vault, we do data modelling in Data Vault to build the EDW (not 3NF but Data Vault style, i.e. designing hub, link, satellite. When we use Hadoop, we don’t do data modelling. We store the data in its native data model, we don’t remodel it.

In architecture B above, Hadoop is used for backup. I have to say that it’s probably over the top if we use Hadoop just for data backup. It’s very fast and it’s resilient too (see HDFS above), because it’s a distributed file system working independently in parallel. But there are 30 other distributed, fault-tolerant file systems (see here) that can be used for the same purpose, instead of Hadoop.

In architecture C, we use Hadoop for big data, and RDBMS for “small data”. I think in the next few years the BI tools for big data will develop further and better. But yes, the combining of small and big data is done in the BI tool, not in the back end. It is of course possible to combine them in the back end, either in the RDBMS or in Hadoop, but there are advantages & disadvantages of each of these 3 different places where the combination takes place.

Using Hadoop for “Small Data” Data Warehousing

I do not see the benefit of using Hadoop for “small data” and can see many disadvantages in doing so: the data types, the ETL difficulties/inflexibility, data consistency (state/ACID), lack of SQL ability.

Hence for “small data” I don’t think anyone would argue that we need to use RDBMS, be it SMP, MPP or columnar. But, for big data, I don’t think anyone would argue that we need to use Hadoop. Please see below for definition of small and big data.

I believe that most data warehouses on the planet are about “small data”. Only very few are about big data. If I have to put a number on it (so that you have an idea of how much), I would say that 95% of all data warehouses are about “small data”. And therefore this is section is focusing on using Hadoop for “Small Data” data warehousing.

So in Small Data DW we use Hadoop in A and B architecture above, i.e. as staging area or as backup storage. We don’t use Hadoop for Small Data DW. We still use RDBMS for Small Data DW. If you see it differently I’m happy to be corrected.

What is Small Data? Big Data?

What is big data? Again it is easier to learn by example than by definition. Big data is images/photos/video/music streams (and as files too), unstructured text (documents, emails), web site log files, outputs (streams) from scientific/industrial instruments and sensors such as thermometers, digital pneumatic/pressure gauges.

Big data does not have to be big (peta/exabytes). Even 50 GB can be said as big data if the structure is too complex for a normal RDBMS to store.

What is small data? Small data is simple data structures, e.g. numbers (be it monetary, integers, fractions or floating points), strings (names, description, types), dates, times, and all the data we used to know in the last 30 years of data warehousing history.

A large collection of small data is big data. Again I have to put a number here so you have an idea of how big is big. I don’t like being unclear. A collection of simple data structures of more than 1 petabytes is big data.

Bear in mind that now is 2012. Petabytes (PB) is 2012 term. But 5 years ago in 2007 it was Terabytes (TB). So in 2014 the definition will change to: a collection of simple data structures of more than 1 Exabytes is big data. And in 2015 the definition will change to “more than 1 Zettabytes”.

In a few years time, these seemingly foreign words will be common words in IT: Petabytes, Exabytes, Zettabytes, Yottabytes. Here the Wikipedia page for their definition: link.

Closure

So in DW, we will be using Hadoop for big data (complex data structures). But for small data (simple data structures) we will still be using RDBMS. For simple data structures exceeding 1 PB, MPP will still cater for them, but for some cases we need to use Hadoop (if we need more performance than MPP, or if it is more than 10 PB).

Hadoop and RDBMS (including columnar and MPP) will be side by side in DW. RDBMS will still be used in majority of DW to store simple data structures up to a few PBs. Beyond 10 PB (simple structure) we probably need to use Hadoop. But for complex data structures, Hadoop is the right tool to process them.

Big data: images/photos/video/music streams (and as files too), unstructured text (documents, emails), web site log files, outputs (streams) from scientific/industrial instruments and sensors such as thermometers, digital pneumatic/pressure gauges.

Big data: A collection of simple data structures of more than 1 petabytes.

I hope this gives a bit of understanding to those of you who are new to the subject. And for the Hadoop, DW and big data experts out there, I would appreciate your opinions and advice.

Vincent Rainardi, 3rd July 2012, vrainardi@gmail.com

Blog at WordPress.com.