Data Warehousing, BI and Data Science

28 October 2017

What is Data Science?

Filed under: Data Science — Vincent Rainardi @ 7:21 am

What is the difference between Data Science, Data Mining, Statistics, Machine Learning and Artificial Intelligence? Data Science is one of those buzz words which is very popular today, and therefore tends to be used to spice up news and marketing materials. In this article I will explain what Data Science is, and what the difference is to AI, BI, Big Data, Computer Science, Data Analysis, Data Management, Data Mining, Data Warehousing, Machine Learning, Predictive Analytics, Robotics and Statistics.

Data Science and Data Scientist

Data Science is about the scientific approaches to manage and analyse data using statistics, machine learning and visualisation.

Unlike a business analyst, a Data Scientist’s job is to manage and analyse data, focusing on the data itself, rather than the business functionality. For example, finding a pattern in the data, or forecasting future values.

Although a Data Scientist does manage the data, unlike a Big Data engineer, a Data Scientist does not setup the Hadoop infrastructure, such as configuring the nodes. A Data Scientist is an expert in using big data, such as video feed in a self driving car, million images in a character recognition system, streaming voice in speech recognition, or classifying ecommerce customers in petabytes of website traffic data.

A data scientist cleans the data, reformat the data, and manage how the data is stored and retrieved from files and databases. For example, splitting the data into many files, and combine the calculation result back. A data scientist also creates new data (usually artificial data, based on the data which already exists).

Unlike a data warehouse architect, a Data Scientist is not an expert in ETL technology (such as Informatica, SSIS), or parallel technology (such as UPI in Teradata, or partitioning in SQL server) or any database/file/storage technology (such as in-memory, cube or SAN).

A Data Scientist is an expert in clustering algorithm, deep learning, and artificial neural network. Unlike a brain doctor or neural surgeon, they are not an expert in neuron, nerve system, or human brain (or animal’s). A Data Scientist knows the architecture of a neural network, such as the number of layers or nodes, and they know about Long Short-Term Memory, Long Term Memory, Deep Believe networks, and Reinforced Learning (these are all architecture of a neural network). Unlike a psychologist or a psychiatrist, a Data Scientist does not know how human perceived events, remember and recall things, or make decisions, and they do not study human behaviour or mental illnesses.

A Data Scientist is an expert in visualising the data using python, Jupyter and R, such as producing 3D graphs, charts and trend lines using ggplot2. And they should be able to use a BI tool such as Tableau and QlikView to create visualisations, but at basic level. Unlike a Tableau, BusinessObjects, Cognos, Microsoft BI, TM1 or QlikView developer, they are not an expert in BI software technicalities. For example, they would not know about setting access restrictions on BO Universes, using Hierarchize on MDX, TurboIntegrator in TM1, or loading security tables in QlikView.

A Data Scientist knows how to write SQL queries, including grouping and joining tables, and converting data types. But they are not a SQL developer who knows how to write stored procedures containing recursive query using CTE, ranking functions and cursors, or forcing a query plan to use hash join or bitmap filter.

A Data Scientist knows about big data technologies, and how to use them. They should be able to write MapReduce in Java to read text files from HDFS and outputting a json file. But I would not expect them to understand why their MapReduce code is giving this warning: “Use GenericOptionsParser for parsing the arguments” (because we need to use the getConf() method). When using Cassandra or Hortonworks, a Data Scientist needs a Big Data Engineer to setup the platform for them, and a Big Data Developer to help them with the coding.

A Data Scientist should have a good background on data architecture, and they should be able to design data structures such as tables and HDFS/json files. But at basic level. They should not be expected to understand detailed Kimball modelling (such as implementing bridge table to solve multi valued attributes), or detailed Graph modelling, or detailed Data Vault modelling – for these we still need a Data Architect.

A glaring gap is the business knowledge. Unlike a Data Architect or a Business Analyst, a Data Scientist does not have good business knowledge. FX Option, IRS, CDS, or any other swaps are not in their vocabulary. A Data Architect or a Business Analyst working in Investment Banking or Investment Management would know these OTC Derivative well. A Data Scientist will need to be taught of the business knowledge. Whether it is cancer, lending, airline, retail or telecom data, a Business Analyst or a Business person in that sector will need to explain the numbers and data to the Data Scientist, before they can do their work.

So to recap, a Data Scientist is:

  • An expert in Machine Learning, statistics, AI and mathematical modelling
  • An expert in using various types of data, such as numeric, video, images and voice
  • An expert in statistic and analytics tool such as SPSS, Statistica, Weka, KNIME
  • An expert in data processing, data quality, and data management
  • Has good knowledge in programming, particularly python, R and Matlab
  • Has good knowledge in querying Big Data platforms such as Spark and Cloudera
  • Has basic knowledge in SQL, data modelling and databases
  • Has basic knowledge in some BI tools and visualisation
  • Has no business knowledge such as treasury, lending or legal
  • Has no knowledge in technical infrastructure such as Big Data platform, database engine, network infrastructure, and storage technologies
  • Typically has a degree or PhD in Math, Physics, Computer Science or Engineering

Data Science is about the scientific approaches to manage and analyse data using machine learning, statistics and visualisation. So the data is already there, stored in databases or file systems, and the data science is about analysing this data using scientific techniques (not business techniques).

Data Science is not about collecting and storing data in databases or file systems. It is about analysing the data which already stored in databases/files. But this analysis is not business analysis, it is scientific analysis (mathematical). We try to find patterns in the data.

Statistics

Statistics are mathematical methods for analysing numbers and sets. It is about linear regression, Gaussian distribution, correlation, sample size and probability. Advanced statistics involves advanced mathematics such as differential equations (calculus) and stochastic.

Statistics are not about analysing multimedia data or big data, such as images, video feeds, text and voices (natural language processing). It is about analysing numbers and sets.

Statistics are not about neural networks, deep learning or clustering algorithms. These are machine learning.

Machine Learning

Machine Learning is about creating machines/computers, that do tasks without explicitly being programmed (link). After learning, the computer will be able to predict future values or or classify future data.

In the last 50 years we have been giving computers instructions or rules on how to do things. For example, a credit card fraud is defined as … probably 25 rules, such as a transaction with unusually big amount, or has unusual location.

In contrast, in machine learning, we don’t give the rules to the computer. Instead we give the computer 1000 transactions and tell it: this one is a fraud, this one is not, this one is a fraud, this one is not, … and so on until 1000. The computer learns and creates its own rules (or pattern) and stores these rules. Then we give a new transaction and the computer can tell us whether it is a fraud or not.

We can use the same methods to identify cancer based on thousands of scan images, or analysing millions of images of nebulas and stars for finding black holes. We can use it to read handwriting (like post codes on envelopes used by the Royal Mail), speech recognition like Siri and Alexa, image recognition like in the self driving cars, and face recognition for banking and payment applications.

Machine Learning is about creating machines/computers which can do tasks without being programmed. We don’t give them the rules; they create the rules themselves.

Machine Learning uses various algorithms such as Linear Regression, Logistic Regression, Decision Tree, K-means Clustering, Support Vector Machines (SVM), Principal Component Analysis (PCA), Anomaly Detection, and Neural Network.

Machine Learning uses mathematics (statistics, algebra, calculus) to derive and calculate those algorithms. It uses computer programming to make the computer “learn” the parameters or weightings for the training data (to create the “rules”), and to predict the result for new data.

The most popular programming language today for Machine Learning is Python. R comes second, and very few use Matlab. This is because Python has Scipy and Scikit, a comprehensive Machine Learning library. Today nobody manually implements those ML algorithms themselves – why reinvent the wheel? Everybody just uses the Scikit library. Secondly because Python and R are free whereas Matlab is expensive (£1800, link). Also because Python is a richer language than R, i.e. Python is a general language like Java and C#, whereas R is a specialist language, used primarily for statistics.

Data Mining

Data Mining is about extracting and processing data, and finding patterns and insight from large amount of data. So data mining is not about collecting and storing data, or designing databases / file systems to store the data. It is about analysing the data to find patterns and insight.

So it is the same as Machine Learning then? No, it is not. Data Mining is wider than Machine Learning. Data Mining uses Machine Learning algorithms, such as Clustering and Classification, but it also uses non Machine Learning algorithm such as recurring relationship, market basket analysis, and frequent items.

Data Mining also does the following:

  • Data Mining uses SQL queries to simply query relational databases to find answer to a specific question to get a business insight (not trying to find the pattern).
  • Data Mining also explores OLAP cubes to find specific business insight about the data
  • Data Mining also queries text and documents (it is called Text Mining, or Text Analytics)
  • Data Mining also queries graph databases, object databases, document databases
  • Data Mining also queries file systems, such as data lakes and Hadoop
  • Data Mining also extract information from images, sounds, videos (i.e. multi media) and maps (called spatial data)
  • Data Mining also extract information from social media such as Twitter, Facebook and Instagram
  • Data Mining also extract information from streaming data, such as weather and stock market
  • Data Mining also queries DNA or protein sequence for a certain genetic pattern, e.g. using DNAQL

Data Mining is different to Reporting in the sense that Data Mining explores the data (a flexible exercise where we query the data repeatedly) whereas Reporting queries the data and output it in a rigid, specific format.

Data Mining is different to Data Science because when Data Mining analyses the data it does not only use scientific method (mathematical) but also business methods (business analysis). For example: analysing floating rates in Interest Rate Swaps or rating migration pattern in corporate bonds. So Data Mining covers Business Intelligence and business analysis, in addition to scientifically analysing the data using statistics and Machine Learning algorithms. Data Mining also tries to find business insights, in addition to finding statistical patterns.

Artificial Intelligence (AI)

Artificial Intelligence is the ability for a machine/computer to learn, think, solve problems and making decisions. The machine does not have to be able to see, hear, talk or communicate well (that’s robotics), just a basic input-output will do.

At the moment Machine Learning is used a lot in AI (especially Neural Network), but AI also use non Machine Learning methods such as Bayesian network, Kalman filter, fuzzy logic, automated reasoning, solution searching and evolutionary algorithm.

So an AI expert can be a Machine Learning expert, or an evolutionary algorithm expert, which are very different. But generally an AI expert is an expert in mathematics and able to do basic programming in some languages (usually Python or Java).

An “AI expert” does not mean that they are in IT. They could be a psychologyst or a psychiatrist who defines what AI is and isn’t. Who analyses the cognitive behaviour of an AI computer, comparing it to human behaviour/thinking.

Some people includes robotics, sensors and motions in AI, such as the ability to move, pickup objects and understand conversation. In my opinion all the mechanics of a robot are not AI. AI is only the thinking bit. But that’s just my opinion.

Ability to understand language is part of AI. This is called NLP, Natural Language Processing. The mechanics of hearing and speaking is not AI (that’s electronic & mechanical engineering), but the ability to understand the meaning of the words is AI.

Business Intelligence (BI)

Business Intelligence is about analysing business data to get a business insight, to be used to make business decisions. So Business Intelligence is not for scientific purposes, but to improve business performance, typically reducing the costs or increasing revenues.

Data Analysis can be for many different purposes, including academic research, personal interest, but if it is not for business, it is not BI. Examples of BI are: analysing customer profitability, analysing sales across different products, analysing patient risk for illness, and analysing risk of losses in investment or lending.

BI is done by human, not machines. A BI analyst explore the data in the database, using a BI tool such as Tableau and MicroStrategy, using OLAP cubes such as SSAS and Qlikview, or using a reporting tool such as Board and SAP Crystal Report.

Based on the tools used, there are 6 categories of BI applications: reporting, analytic, data mining, dashboard, alert and portal. Quoting from my book: Reporting applications query the data warehouse and present the data in static tabular format. Analytic applications query the data warehouse repeatedly and interactively, and present the data in flexible formats that users can slice and dice. Data mining applications explore the data warehouse to find patterns and relationships that describe the data. Reporting applications are usually used to perform lightweight analysis. Analytic applications are used to perform deeper analysis. Data mining applications are used for pattern finding.

Dashboards are a category of BI applications that gives a quick high level summary of business performance in graphical gadgets, typically gauges, charts, indicators, and color-coded maps. By clicking these gadgets, we can drill down to lower-level details. Alerts are notifications to the users when certain events or conditions happen. A BI portal is an application that functions as a gateway to access and manage business intelligence reports, analytics, data mining, and dashboard applications as well as alert subscriptions.

We now have a new category of BI: streaming analytics, where the BI tool give second-by-second real time summary of the streaming data.

That’s my understanding. If I’m wrong please correct me, via comments below, thanks. I hope this article is useful for you.

References:

  1. Data Mining: Concept and Techniques, Jiawei Han and Micheline Kamber, link
  2. Artificial Intelligence: A Modern Approach, Stuart Russell and Peter Norvig, link
Advertisements

25 October 2017

Andrew Ng’s Machine Learning course

Filed under: Data Science — Vincent Rainardi @ 6:00 pm

I have just completed Andrew Ng’s Machine Learning course on Coursera and in this article I would like to share what I have experienced.

The course contains various algorithms of machine learning such as neural networks, K-means clustering, linear regression, logistic regression, support vector machine, principal component analysis and anomaly detection.

More importantly for me, the course contains various real world applications of machine learning, such as self driving car, character recognition, image recognition, movie recommendation, image compression, cancer detection, property prices.

What makes this course different to other Machine Learning materials and sessions I have seen is that it is technical. Usually when people talk about machine learning, they don’t talk about the mechanics and the mathematics. They explain about what clustering does, but they don’t explain about how exactly clustering is done. In this course Andrew Ng explains how it is done in great details.

It is amazing how people can get away with it, i.e. explaining the edges of Data Science / Machine Learning without diving into the core. But I do understand that in reality, not many people are able to understand the mathematics. The amount of matrix and vector algebra in this course is mind boggling. I am lucky that I studied physics for undergraduate in university, so I have a good grounding in calculus and linear algebra. But for those who have never worked with matrix before, they might have difficulties understanding the math.

Unfortunately, we do need to understand the math in order to be able to complete the programming assignments. The programming is in Octave, which is very similar to Matlab. When I started this course, I have not heard about Octave. But luckily again, I have used Matlab when I was in uni. Yes it was 25 years ago, but it gave me some background. And in the last 20 years I have been coding in various programming languages, from C++, VB, C#, Pascal, Cobol and SQL, to Java, R and Python so it does help. It does help if you have strong programming experience.

The programming assignments do take a lot of time. It took me about 3-5 hours per assignment, and there are 9 assignments (week 1 to 9, there are no assignment for week 8 and 9). For me, the problem with these programming assignment is the vectorisation. I can roughly figure out the answer using loops, which takes many lines. But to convert into a vectorised solution (which is only 1 line), it takes a long time. Secondly, it takes time to translate the mathematics formula into Octave, at least in the first 3 weeks. And thirdly, it takes time to get it test the program and correct mistakes that I make.

The quizes (the tests) are not bad. They are relatively easy, far easier than the programming. Each quiz comprises of 5 questions and we need to get 4 out of 5 questions correct. We are given 3 goes for each quiz. If we still fails after 3 goes, we can try again the next day. Out of about 11 quizes (or may be 12) there is one which is difficult and I failed twice, but passed the third time. But that’s the only one which is difficult. The other are relatively straight forward.

The most enjoyable thing for me is the math. I have not got a chance to use the math I learned in uni since I graduated 25 years ago (well, except helping my children with their GCSE homework). The practical / real world examples are also enjoyable. The programming on some modules are enjoyable. Overall it was fun and interesting, and very useful at the same time. I have been to many IT courses: .NET, SQL BI, Teradata, TDWI, and Big Data, but none of them are as enjoyable as this one.

Because of this course, which makes me realise that Machine Learning is very useful and fun, I decided to apply for an MSc course in Data Science (containing big data and machine learning). So thank you Andrew Ng for creating this course, and patiently explaining the chapters and concepts, video by video. Thank you.

Create a free website or blog at WordPress.com.