Data Warehousing, BI and Data Science

28 October 2017

What is Data Science?

Filed under: Data Science — Vincent Rainardi @ 7:21 am

What is the difference between Data Science, Data Mining, Statistics, Machine Learning and Artificial Intelligence? Data Science is one of those buzz words which is very popular today, and therefore tends to be used to spice up news and marketing materials. In this article I will explain what Data Science is, and what the difference is to AI, BI, Big Data, Computer Science, Data Analysis, Data Management, Data Mining, Data Warehousing, Machine Learning, Predictive Analytics, Robotics and Statistics.

Data Science and Data Scientist

Data Science is about the scientific approaches to manage and analyse data using statistics, machine learning and visualisation.

Unlike a business analyst, a Data Scientist’s job is to manage and analyse data, focusing on the data itself, rather than the business functionality. For example, finding a pattern in the data, or forecasting future values.

Although a Data Scientist does manage the data, unlike a Big Data engineer, a Data Scientist does not setup the Hadoop infrastructure, such as configuring the nodes. A Data Scientist is an expert in using big data, such as video feed in a self driving car, million images in a character recognition system, streaming voice in speech recognition, or classifying ecommerce customers in petabytes of website traffic data.

A data scientist cleans the data, reformat the data, and manage how the data is stored and retrieved from files and databases. For example, splitting the data into many files, and combine the calculation result back. A data scientist also creates new data (usually artificial data, based on the data which already exists).

Unlike a data warehouse architect, a Data Scientist is not an expert in ETL technology (such as Informatica, SSIS), or parallel technology (such as UPI in Teradata, or partitioning in SQL server) or any database/file/storage technology (such as in-memory, cube or SAN).

A Data Scientist is an expert in clustering algorithm, deep learning, and artificial neural network. Unlike a brain doctor or neural surgeon, they are not an expert in neuron, nerve system, or human brain (or animal’s). A Data Scientist knows the architecture of a neural network, such as the number of layers or nodes, and they know about Long Short-Term Memory, Long Term Memory, Deep Believe networks, and Reinforced Learning (these are all architecture of a neural network). Unlike a psychologist or a psychiatrist, a Data Scientist does not know how human perceived events, remember and recall things, or make decisions, and they do not study human behaviour or mental illnesses.

A Data Scientist is an expert in visualising the data using python, Jupyter and R, such as producing 3D graphs, charts and trend lines using ggplot2. And they should be able to use a BI tool such as Tableau and QlikView to create visualisations, but at basic level. Unlike a Tableau, BusinessObjects, Cognos, Microsoft BI, TM1 or QlikView developer, they are not an expert in BI software technicalities. For example, they would not know about setting access restrictions on BO Universes, using Hierarchize on MDX, TurboIntegrator in TM1, or loading security tables in QlikView.

A Data Scientist knows how to write SQL queries, including grouping and joining tables, and converting data types. But they are not a SQL developer who knows how to write stored procedures containing recursive query using CTE, ranking functions and cursors, or forcing a query plan to use hash join or bitmap filter.

A Data Scientist knows about big data technologies, and how to use them. They should be able to write MapReduce in Java to read text files from HDFS and outputting a json file. But I would not expect them to understand why their MapReduce code is giving this warning: “Use GenericOptionsParser for parsing the arguments” (because we need to use the getConf() method). When using Cassandra or Hortonworks, a Data Scientist needs a Big Data Engineer to setup the platform for them, and a Big Data Developer to help them with the coding.

A Data Scientist should have a good background on data architecture, and they should be able to design data structures such as tables and HDFS/json files. But at basic level. They should not be expected to understand detailed Kimball modelling (such as implementing bridge table to solve multi valued attributes), or detailed Graph modelling, or detailed Data Vault modelling – for these we still need a Data Architect.

A glaring gap is the business knowledge. Unlike a Data Architect or a Business Analyst, a Data Scientist does not have good business knowledge. FX Option, IRS, CDS, or any other swaps are not in their vocabulary. A Data Architect or a Business Analyst working in Investment Banking or Investment Management would know these OTC Derivative well. A Data Scientist will need to be taught of the business knowledge. Whether it is cancer, lending, airline, retail or telecom data, a Business Analyst or a Business person in that sector will need to explain the numbers and data to the Data Scientist, before they can do their work.

So to recap, a Data Scientist is:

  • An expert in Machine Learning, statistics, AI and mathematical modelling
  • An expert in using various types of data, such as numeric, video, images and voice
  • An expert in statistic and analytics tool such as SPSS, Statistica, Weka, KNIME
  • An expert in data processing, data quality, and data management
  • Has good knowledge in programming, particularly python, R and Matlab
  • Has good knowledge in querying Big Data platforms such as Spark and Cloudera
  • Has basic knowledge in SQL, data modelling and databases
  • Has basic knowledge in some BI tools and visualisation
  • Has no business knowledge such as treasury, lending or legal
  • Has no knowledge in technical infrastructure such as Big Data platform, database engine, network infrastructure, and storage technologies
  • Typically has a degree or PhD in Math, Physics, Computer Science or Engineering

Data Science is about the scientific approaches to manage and analyse data using machine learning, statistics and visualisation. So the data is already there, stored in databases or file systems, and the data science is about analysing this data using scientific techniques (not business techniques).

Data Science is not about collecting and storing data in databases or file systems. It is about analysing the data which already stored in databases/files. But this analysis is not business analysis, it is scientific analysis (mathematical). We try to find patterns in the data.

Statistics

Statistics are mathematical methods for analysing numbers and sets. It is about linear regression, Gaussian distribution, correlation, sample size and probability. Advanced statistics involves advanced mathematics such as differential equations (calculus) and stochastic.

Statistics are not about analysing multimedia data or big data, such as images, video feeds, text and voices (natural language processing). It is about analysing numbers and sets.

Statistics are not about neural networks, deep learning or clustering algorithms. These are machine learning.

Machine Learning

Machine Learning is about creating machines/computers, that do tasks without explicitly being programmed (link). After learning, the computer will be able to predict future values or or classify future data.

In the last 50 years we have been giving computers instructions or rules on how to do things. For example, a credit card fraud is defined as … probably 25 rules, such as a transaction with unusually big amount, or has unusual location.

In contrast, in machine learning, we don’t give the rules to the computer. Instead we give the computer 1000 transactions and tell it: this one is a fraud, this one is not, this one is a fraud, this one is not, … and so on until 1000. The computer learns and creates its own rules (or pattern) and stores these rules. Then we give a new transaction and the computer can tell us whether it is a fraud or not.

We can use the same methods to identify cancer based on thousands of scan images, or analysing millions of images of nebulas and stars for finding black holes. We can use it to read handwriting (like post codes on envelopes used by the Royal Mail), speech recognition like Siri and Alexa, image recognition like in the self driving cars, and face recognition for banking and payment applications.

Machine Learning is about creating machines/computers which can do tasks without being programmed. We don’t give them the rules; they create the rules themselves.

Machine Learning uses various algorithms such as Linear Regression, Logistic Regression, Decision Tree, K-means Clustering, Support Vector Machines (SVM), Principal Component Analysis (PCA), Anomaly Detection, and Neural Network.

Machine Learning uses mathematics (statistics, algebra, calculus) to derive and calculate those algorithms. It uses computer programming to make the computer “learn” the parameters or weightings for the training data (to create the “rules”), and to predict the result for new data.

The most popular programming language today for Machine Learning is Python. R comes second, and very few use Matlab. This is because Python has Scipy and Scikit, a comprehensive Machine Learning library. Today nobody manually implements those ML algorithms themselves – why reinvent the wheel? Everybody just uses the Scikit library. Secondly because Python and R are free whereas Matlab is expensive (£1800, link). Also because Python is a richer language than R, i.e. Python is a general language like Java and C#, whereas R is a specialist language, used primarily for statistics.

Data Mining

Data Mining is about extracting and processing data, and finding patterns and insight from large amount of data. So data mining is not about collecting and storing data, or designing databases / file systems to store the data. It is about analysing the data to find patterns and insight.

So it is the same as Machine Learning then? No, it is not. Data Mining is wider than Machine Learning. Data Mining uses Machine Learning algorithms, such as Clustering and Classification, but it also uses non Machine Learning algorithm such as recurring relationship, market basket analysis, and frequent items.

Data Mining also does the following:

  • Data Mining uses SQL queries to simply query relational databases to find answer to a specific question to get a business insight (not trying to find the pattern).
  • Data Mining also explores OLAP cubes to find specific business insight about the data
  • Data Mining also queries text and documents (it is called Text Mining, or Text Analytics)
  • Data Mining also queries graph databases, object databases, document databases
  • Data Mining also queries file systems, such as data lakes and Hadoop
  • Data Mining also extract information from images, sounds, videos (i.e. multi media) and maps (called spatial data)
  • Data Mining also extract information from social media such as Twitter, Facebook and Instagram
  • Data Mining also extract information from streaming data, such as weather and stock market
  • Data Mining also queries DNA or protein sequence for a certain genetic pattern, e.g. using DNAQL

Data Mining is different to Reporting in the sense that Data Mining explores the data (a flexible exercise where we query the data repeatedly) whereas Reporting queries the data and output it in a rigid, specific format.

Data Mining is different to Data Science because when Data Mining analyses the data it does not only use scientific method (mathematical) but also business methods (business analysis). For example: analysing floating rates in Interest Rate Swaps or rating migration pattern in corporate bonds. So Data Mining covers Business Intelligence and business analysis, in addition to scientifically analysing the data using statistics and Machine Learning algorithms. Data Mining also tries to find business insights, in addition to finding statistical patterns.

Artificial Intelligence (AI)

Artificial Intelligence is the ability for a machine/computer to learn, think, solve problems and making decisions. The machine does not have to be able to see, hear, talk or communicate well (that’s robotics), just a basic input-output will do.

At the moment Machine Learning is used a lot in AI (especially Neural Network), but AI also use non Machine Learning methods such as Bayesian network, Kalman filter, fuzzy logic, automated reasoning, solution searching and evolutionary algorithm.

So an AI expert can be a Machine Learning expert, or an evolutionary algorithm expert, which are very different. But generally an AI expert is an expert in mathematics and able to do basic programming in some languages (usually Python or Java).

An “AI expert” does not mean that they are in IT. They could be a psychologyst or a psychiatrist who defines what AI is and isn’t. Who analyses the cognitive behaviour of an AI computer, comparing it to human behaviour/thinking.

Some people includes robotics, sensors and motions in AI, such as the ability to move, pickup objects and understand conversation. In my opinion all the mechanics of a robot are not AI. AI is only the thinking bit. But that’s just my opinion.

Ability to understand language is part of AI. This is called NLP, Natural Language Processing. The mechanics of hearing and speaking is not AI (that’s electronic & mechanical engineering), but the ability to understand the meaning of the words is AI.

Business Intelligence (BI)

Business Intelligence is about analysing business data to get a business insight, to be used to make business decisions. So Business Intelligence is not for scientific purposes, but to improve business performance, typically reducing the costs or increasing revenues.

Data Analysis can be for many different purposes, including academic research, personal interest, but if it is not for business, it is not BI. Examples of BI are: analysing customer profitability, analysing sales across different products, analysing patient risk for illness, and analysing risk of losses in investment or lending.

BI is done by human, not machines. A BI analyst explore the data in the database, using a BI tool such as Tableau and MicroStrategy, using OLAP cubes such as SSAS and Qlikview, or using a reporting tool such as Board and SAP Crystal Report.

Based on the tools used, there are 6 categories of BI applications: reporting, analytic, data mining, dashboard, alert and portal. Quoting from my book: Reporting applications query the data warehouse and present the data in static tabular format. Analytic applications query the data warehouse repeatedly and interactively, and present the data in flexible formats that users can slice and dice. Data mining applications explore the data warehouse to find patterns and relationships that describe the data. Reporting applications are usually used to perform lightweight analysis. Analytic applications are used to perform deeper analysis. Data mining applications are used for pattern finding.

Dashboards are a category of BI applications that gives a quick high level summary of business performance in graphical gadgets, typically gauges, charts, indicators, and color-coded maps. By clicking these gadgets, we can drill down to lower-level details. Alerts are notifications to the users when certain events or conditions happen. A BI portal is an application that functions as a gateway to access and manage business intelligence reports, analytics, data mining, and dashboard applications as well as alert subscriptions.

We now have a new category of BI: streaming analytics, where the BI tool give second-by-second real time summary of the streaming data.

That’s my understanding. If I’m wrong please correct me, via comments below, thanks. I hope this article is useful for you.

References:

  1. Data Mining: Concept and Techniques, Jiawei Han and Micheline Kamber, link
  2. Artificial Intelligence: A Modern Approach, Stuart Russell and Peter Norvig, link
Advertisements

1 Comment »

  1. Awesome post. I’m looking to expand into the BI field and currently lead a DW project. While doing my initial research into DW and DWA tools I bumped into your blog. It’s been very informational. I see myself using the DW admin position as a temporary role as I move into the BI Analyst role and become an integral part of my company’s financial success.
    I got into this a bit late, but I am inspired by you, your knowledge and experience both in data and in the financial field!!!

    Thanks I appreciate your blog very much.

    Comment by DC — 29 October 2017 @ 3:48 am | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Blog at WordPress.com.

%d bloggers like this: