- What is Big Data
- What is a Data Warehouse
- What is Data Mining
The article is here. The 5 experts are:
- Billy Cripe from Digitility
- Jake Freivald from Information Builders
- Estelle Nicholson, independent BI consultant
- Bob Scott from Bilander Group
- Peter Thomas from Validus
What is Big Data? (from Hadoop in Data Warehousing article in my blog)
What is big data? Again it is easier to learn by example than by definition. Big data is images/photos/video/music streams (and as files too), unstructured text (documents, emails), web site log files, outputs (streams) from scientific/industrial instruments and sensors such as thermometers, digital pneumatic/pressure gauges.
Big data does not have to be big (peta/exabytes). Even 50 GB can be said as big data if the structure is too complex for a normal RDBMS to store.
What is small data? Small data is simple data structures, e.g. numbers (be it monetary, integers, fractions or floating points), strings (names, description, types), dates, times, and all the data we used to know in the last 30 years of data warehousing history.
A large collection of small data is big data. Again I have to put a number here so you have an idea of how big is big. I don’t like being unclear. A collection of simple data structures of more than 1 petabytes is big data.
Bear in mind that now is 2012. Petabytes (PB) is 2012 term. But 5 years ago in 2007 it was Terabytes (TB). So in 2014 the definition will change to: a collection of simple data structures of more than 1 Exabytes is big data. And in 2015 the definition will change to “more than 1 Zettabytes”.
In a few years time, these seemingly foreign words will be common words in IT: Petabytes, Exabytes, Zettabytes, Yottabytes. Here the Wikipedia page for their definition: link.
What is a Data Warehouse?
A data warehouse is a system that retrieves and consolidates data periodically from the source systems into a dimensional or normalized data store. It usually keeps years of history and is queried for business intelligence or other analytical activities. It is typically updated in batches, not every time a transaction happens in the source system.
Below is a diagram of a data warehouse system: (from my book)
Below is a short explanation about each component in the above diagram, and how the components work together in a data warehouse system: (again, this is from my book)
The source systems are the OLTP systems that contain the data you want to load into the data warehouse. Online Transaction Processing (OLTP) is a system whose main purpose is to capture and store the business transactions. The source systems’ data is examined using a data profiler to understand the characteristics of the data. A data profiler is a tool that has the capability to analyse data, such as finding out how many rows are in each table, how many rows contain NULL values, and so on.
The extract, transform, and load (ETL) system then brings data from various source systems into a staging area. ETL is a system that has the capability to connect to the source systems, read the data, transform the data, and load it into a target system (the target system doesn’t have to be a data warehouse). The ETL system then integrates, transforms, and loads the data into a dimensional data store (DDS). A DDS is a database that stores the data warehouse data in a different format than OLTP. The reason for getting the data from the source system into the DDS and then querying the DDS instead of querying the source system directly is that in a DDS the data is arranged in a dimensional format that is more suitable for analysis. The second reason is because a DDS contains integrated data from several source systems.
When the ETL system loads the data into the DDS, the data quality rules do various data quality checks. Bad data is put into the data quality (DQ) database to be reported and then corrected in the source systems. Bad data can also be automatically corrected or tolerated if it is within a certain limit. The ETL system is managed and orchestrated by the control system, based on the sequence, rules, and logic stored in the metadata. The metadata is a database containing information about the data structure, the data meaning, the data usage, the data quality rules, and other information about the data.
The audit system logs the system operations and usage into the metadata database. The audit system is part of the ETL system that monitors the operational activities of the ETL processes and logs their operational statistics. It is used for understanding what happened during the ETL process.
Users use various front-end tools such as spreadsheets, pivot tables, reporting tools, and SQL query tools to retrieve and analyse the data in a DDS. Some applications operate on a multidimensional database format. For these applications, the data in the DDS is loaded into multidimensional databases (MDBs), which are also known as cubes. A multidimensional database is a form of database where the data is stored in cells and the position of each cell is defined by a number of variables called dimensions. Each cell represents a business event, and the values of the dimensions show when and where this event happened.
Tools such as analytics applications, data mining, scorecards, dashboards, multidimensional reporting tools, and other BI tools can retrieve data interactively from multidimensional databases. They retrieve the data to produce various features and results on the front-end screens that enable the users to get a deeper understanding about their businesses. An example of an analytic application is to analyse the sales by time, customer, and product. The users can analyse the revenue and cost for a certain month, region, and product type.
What is Data Mining? (from my book)
Data mining is the process of exploring data to find the patterns and relationships that describe the data and to predict the unknown or future values of the data. The key value of data mining is the ability to understand why some things happened in the past and the ability to predict what will happen in the future. To refer to predicting the future with regard to data mining, some people use the term forecasting, and some call it predictive analytics. On the other hand, when data mining is used to explain the current or past situation, it is called descriptive modeling, descriptive analytics, or knowledge discovery.
Implementing data mining in the business is growing by the day, both through descriptive and predictive analytics. Using data mining, we can find the correlation between purchase patterns and customer demographics. For example, in our Amadeus Entertainment case study, we can find whether there is a relation between the sales of a music product type and the customer interest or occupation. In the financial services industry, data mining is traditionally used for credit scoring, for example, to predict the ability and willingness to pay the loan based on (each country has different factors) payment behavior, past delinquency, current debt, credit history, number of inquiries, address, bankruptcies filed, and customer demography.
In business intelligence, popular applications of data mining are for fraud detection (credit card industry), forecasting and budgeting (finance), cellular/mobile package development by analyzing call patterns (telecommunication industry), market basket analysis (retail industry), customer risk profiling (insurance industry), usage monitoring (energy and utilities), and machine service times (manufacturing industry).