Data Warehousing and Machine Learning

5 January 2018

Andrew Ng’s Deep Learning Course

Filed under: Data Science — Vincent Rainardi @ 6:13 am

I’m doing Andrew Ng’s Deep Learning Course at the moment (link). When I wrote an article about “What Machine Learning Can Be Used For” last week (link), I realised that most of the implementation of machine learning is using deep learning. CNN is used widely for image recognition (computer vision to be precise), and RNN is used widely for speech and audio. Hence the reason for me to study deep learning.

But not only that. I’m in for a treat because in this course Andrew Ng also interviews deep learning legends like Jeffrey Hinton and Pieter Abbeel. In the data science community meetup in London last month I heard Jeffrey’s name for the first time. He was mentioned because of his capsule concept. He is also known for back propagation and Boltzmann machine (link). He is a very important figure in deep learning, because of his contribution for decades in DL, since 1980s to today. He reminds me of Bill Inmon and Ralph Kimball, who have also done decades of contribution in data warehousing, since 1990. Pieter Abbeel is a legend in robotics and deep reinforcement learning (link).

Hearing legends and experts being interviewed by Andrew Ng is really inspiring. Not only we can understand their inventions in a simplified way, and how they came into such concepts, but also enables us to know about the situation in the industry (the UK didn’t appreciate ML as much as Silicon Valley, for example. Luckily London is now the hub ML start-ups and funding, link, link), and the future direction of the industry. On top of that, as a beginner we get very valuable advice from them, which could save us a lot of time heading into the right direction. In my opinion when learning it is very important to have a clear direction. We do not want to learn “the wrong things” and waste years of our time.

In the last year or so Python has been the most widely used programming language in machine learning. Its scikit-learn library is de facto standard in machine learning (it is built on NumPy, SciPy and mathplotlib which are also the de facto standards for in their respective areas). Theano, CNTK and TensorFlow are the de facto standards for CNN (which is used for computer vision, the most popular application of ML/DL) and indeed other areas of DL, and they are all in Python. Keras, another very popular ML library, is also written in Python (link) and run on Theano, CNTK, and TensorFlow. A large part of ML/DL is data analysis dan preparation, and Pandas is the most popular tool for that, which is also in Python. So today (Jan 2018) there is no competition. Not like in 2016 when R and Python were still heavily argued, about which one is the best language for ML. Today, Python is the de facto language for ML.

When I took the previous Andrew Ng ML course, it was using MathLab and Octave. But this course is using Python and Jupyter notebook from the start, and TensorFlow later on. It is a treat! The NumPy library is very easy to use and is very powerful for doing work with matrix operation. The Jupyter notebook is also very easy to use. The programming exercises use Scikit-Learn, NumPy and mathplotlib (pyplot). I look forward to using TensorFlow later on. It is the right toolset for the job, and it gives a lot of value added to the CV and experience of the student of this course. Well done Andrew Ng and team for choosing this toolset and moving away from Octave.

I should also mention that using Jupyter notebook in this course is a lot easier than using Octave in the previous ML course. I don’t have to download the zip file, unzip it, copy it to Octave working folder, run it on my laptop (oh and I have to install Octave first of course). And switching between the tabs in Octave to see the output, and the source code.  And I had to switch back and forth to the PDF to see the programming instruction too. It feels like I was back in the nineties with Delphi IDE! With Jupyter notebook it feels like I’m really in 2018. I can run the code there and then by pressing Shift-Enter. Right in between the instruction and the code! And the output is right below it. There is no switching back and forth at all. Everything is in one place. I don’t even have to install anything on my laptop. Everything is online! Amazingly simple.

And the scoring (marking) is integrated too. In the old ML course I had to get the submission code from the PDF, put it on the Octave IDE then typing submit. Now there’s no submission code. I just have to click the submit button on the Jupyter notebook. Easy! And when I’m back in the Coursera pages, my scores are already there. They are well integrated.

On week 2 and week 3 there is a little bit of calculus i.e. derivatives with computation graph, and for backprop. It is only a little bit though, not as much as I thought. I haven’t done calculus since university but I didn’t encounter any issue in following the calculus in week 2 and 3. It was only simple derivative such as x^3, ln(x), sum rule, chain rule and derivative of sigmoid function. That’s it. To be honest I enjoyed encountering calculus again. My background is in Engineering Physics (BEng), i.e. instrumentation and control, thermodynamics, vibration, optics, electronics and material science. So there was a lot of math involved, particularly calculus and numerical analysis including linear algebra.

The price is unbelievably cheap. It is only £36/month. It is 5 courses, and each course is 3-4 weeks long (more likely to be 3 weeks than 4 weeks). So about 4 months. Andrew Ng’s first ML course was £61 total (about 11 weeks). I guess that is the power of Coursera: the price is next to nothing. Try to compare this with the usual IT training of £1500/week. Let me repeat it because this is probably the most important thing about the course: it is £36 per month.

The other thing I enjoy in this course is listening to Andrew Ng’s explaining all the concepts on the white board (or screen to be precise). I am not sure why but it is hugely entertaining, particularly the mathematics, e.g. matrix calculations, but also other concepts such as backward propagation, etc. I also enjoy the programming assignments. Day to day in the office I mostly work with SQL language, so doing programming in other language is refreshing for me. I did a bit of VB, Java, C#, and C++ but it was a long time ago. I also did Python when doing Spotfire but again it was a few years ago. So I actually look forward to the programming assignments. Again I don’t know why but I do enjoy it. It is a huge satisfaction to see that the output of my program is as expected.

Overall it has been a treat and I enjoy doing this course. Thank you Andrew Ng, Kian and Younes for preparing and providing this course. It must have been a hard work for the three of you for months, even years.

29 November 2017

What’s in Azure?

Filed under: Business Intelligence,Data Science,Data Warehousing — Vincent Rainardi @ 5:31 am

As of 28th Nov 2017 we can create the following in Microsoft Azure:

  • Data: SQL Database, SQL Server, SQL Data Warehouse, SQL Elastic Database Pool, MySQL, PostgreSQL, Data Lake, CosmosDB, MongoDB, Redis, Teradata Database, Oracle Database, IBM DB2, Aerospike, XtremeData, HazelCast, ArangoDB.
  • Data Integration: Informatica, Data Factory, Data Catalog, File Sync, Profisee Maestro Integration Server, Information Builder iWay Big Data Integrator, BizTalk Service, IBM WebSphere Application Server, IBM WebSphere MQ, IBM MQ, Datameer, QuerySurge.
  • Big Data: HDInsight, Hadoop, HDFS, HBase, Hive (Interactive Query), Hive Streaming, Spark, Kafka, Storm, Hortonworks, Cloudera, Cassandra, GridGain, MapR, F5 BIG-IP, Syncfusion, Informatica Big Data Mgt, Kyligence, AtScale.
  • Data Science: ML Experimentation, ML Model Management, R Server, Data Science VM, Bot Service, Computer Vision API, Face API, Language Understanding Intelligent Service, Translation Speech API, Text Analytics API, ML Studio Web Service, Sparkling Water H2O.
  • Analytics: Databricks, Stream Analytics, Analysis Services, Data Lake Analytics, Time Series Insights, Tableau Server, Qlik Sense Server, Pyramid Analytics, Nutanix Analytics, Real World Analytics, Exasol Analytic, HP Vertica Analytics, Teradata Aster Analytics, Bing Search, Dundas BI, Power BI, Panorama Necto, Spago BI, Targit, KNIME, SAP Hana, Kepion, Jedox, Diagramics.
  • Internet of Things: IoT Hub, Event Hub, Notification Hub, Pilot Things IoT VPN, TrendMicro IoT Security.
  • Developer Tools: Visual Studio, VS Anywhere, DevOps Project, Team Project, DevTest Labs, Application Insights, API Management, Operational Insights, Jenkins,
  • Infrastructure: Virtual Network, Load Balancer, Network Security Group, Security & Audit, Security & Compliance, Event Tracker Security Centre, Log Analytics, Automation, Active Directory, Scheduler, CloudAMQP, Cradeon.
  • Web: Website + SQL, WebApp, WordPress, Drupal, Joomla, Django, API App, Logic App, CDN, Media Services, VoIP, SiouxApp, App Dynamics, CakePHP, BlogEngine.NET, MVC Forum, Better CMS, Node JS
  • Compute: Windows Server, Red Hat Enterprise Linux, Oracle Linux, Ubuntu Server, VM Instances, Service Fabric Cluster, Web App for Container, Function App, Batch Service, Cloud Service.
  • Blockchain: Ethereum, Hyperledger, Corda, Quorum, STRATO, Chain Core, Stratis.

28 October 2017

What is Data Science?

Filed under: Data Science — Vincent Rainardi @ 7:21 am

What is the difference between Data Science, Data Mining, Statistics, Machine Learning and Artificial Intelligence? Data Science is one of those buzz words which is very popular today, and therefore tends to be used to spice up news and marketing materials. In this article I will explain what Data Science is, and what the difference is to AI, BI, Big Data, Computer Science, Data Analysis, Data Management, Data Mining, Data Warehousing, Machine Learning, Predictive Analytics, Robotics and Statistics.

Data Science and Data Scientist

Data Science is about the scientific approaches to manage and analyse data using statistics, machine learning and visualisation.

Unlike a business analyst, a Data Scientist’s job is to manage and analyse data, focusing on the data itself, rather than the business functionality. For example, finding a pattern in the data, or forecasting future values.

Although a Data Scientist does manage the data, unlike a Big Data engineer, a Data Scientist does not setup the Hadoop infrastructure, such as configuring the nodes. A Data Scientist is an expert in using big data, such as video feed in a self driving car, million images in a character recognition system, streaming voice in speech recognition, or classifying ecommerce customers in petabytes of website traffic data.

A data scientist cleans the data, reformat the data, and manage how the data is stored and retrieved from files and databases. For example, splitting the data into many files, and combine the calculation result back. A data scientist also creates new data (usually artificial data, based on the data which already exists).

Unlike a data warehouse architect, a Data Scientist is not an expert in ETL technology (such as Informatica, SSIS), or parallel technology (such as UPI in Teradata, or partitioning in SQL server) or any database/file/storage technology (such as in-memory, cube or SAN).

A Data Scientist is an expert in clustering algorithm, deep learning, and artificial neural network. Unlike a brain doctor or neural surgeon, they are not an expert in neuron, nerve system, or human brain (or animal’s). A Data Scientist knows the architecture of a neural network, such as the number of layers or nodes, and they know about Long Short-Term Memory, Long Term Memory, Deep Believe networks, and Reinforced Learning (these are all architecture of a neural network). Unlike a psychologist or a psychiatrist, a Data Scientist does not know how human perceived events, remember and recall things, or make decisions, and they do not study human behaviour or mental illnesses.

A Data Scientist is an expert in visualising the data using python, Jupyter and R, such as producing 3D graphs, charts and trend lines using ggplot2. And they should be able to use a BI tool such as Tableau and QlikView to create visualisations, but at basic level. Unlike a Tableau, BusinessObjects, Cognos, Microsoft BI, TM1 or QlikView developer, they are not an expert in BI software technicalities. For example, they would not know about setting access restrictions on BO Universes, using Hierarchize on MDX, TurboIntegrator in TM1, or loading security tables in QlikView.

A Data Scientist knows how to write SQL queries, including grouping and joining tables, and converting data types. But they are not a SQL developer who knows how to write stored procedures containing recursive query using CTE, ranking functions and cursors, or forcing a query plan to use hash join or bitmap filter.

A Data Scientist knows about big data technologies, and how to use them. They should be able to write MapReduce in Java to read text files from HDFS and outputting a json file. But I would not expect them to understand why their MapReduce code is giving this warning: “Use GenericOptionsParser for parsing the arguments” (because we need to use the getConf() method). When using Cassandra or Hortonworks, a Data Scientist needs a Big Data Engineer to setup the platform for them, and a Big Data Developer to help them with the coding.

A Data Scientist should have a good background on data architecture, and they should be able to design data structures such as tables and HDFS/json files. But at basic level. They should not be expected to understand detailed Kimball modelling (such as implementing bridge table to solve multi valued attributes), or detailed Graph modelling, or detailed Data Vault modelling – for these we still need a Data Architect.

A glaring gap is the business knowledge. Unlike a Data Architect or a Business Analyst, a Data Scientist does not have good business knowledge. FX Option, IRS, CDS, or any other swaps are not in their vocabulary. A Data Architect or a Business Analyst working in Investment Banking or Investment Management would know these OTC Derivative well. A Data Scientist will need to be taught of the business knowledge. Whether it is cancer, lending, airline, retail or telecom data, a Business Analyst or a Business person in that sector will need to explain the numbers and data to the Data Scientist, before they can do their work.

So to recap, a Data Scientist is:

  • An expert in Machine Learning, statistics, AI and mathematical modelling
  • An expert in using various types of data, such as numeric, video, images and voice
  • An expert in statistic and analytics tool such as SPSS, Statistica, Weka, KNIME
  • An expert in data processing, data quality, and data management
  • Has good knowledge in programming, particularly python, R and Matlab
  • Has good knowledge in querying Big Data platforms such as Spark and Cloudera
  • Has basic knowledge in SQL, data modelling and databases
  • Has basic knowledge in some BI tools and visualisation
  • Has no business knowledge such as treasury, lending or legal
  • Has no knowledge in technical infrastructure such as Big Data platform, database engine, network infrastructure, and storage technologies
  • Typically has a degree or PhD in Math, Physics, Computer Science or Engineering

Data Science is about the scientific approaches to manage and analyse data using machine learning, statistics and visualisation. So the data is already there, stored in databases or file systems, and the data science is about analysing this data using scientific techniques (not business techniques).

Data Science is not about collecting and storing data in databases or file systems. It is about analysing the data which already stored in databases/files. But this analysis is not business analysis, it is scientific analysis (mathematical). We try to find patterns in the data.


Statistics are mathematical methods for analysing numbers and sets. It is about linear regression, Gaussian distribution, correlation, sample size and probability. Advanced statistics involves advanced mathematics such as differential equations (calculus) and stochastic.

Statistics are not about analysing multimedia data or big data, such as images, video feeds, text and voices (natural language processing). It is about analysing numbers and sets.

Statistics are not about neural networks, deep learning or clustering algorithms. These are machine learning.

Machine Learning

Machine Learning is about creating machines/computers, that do tasks without explicitly being programmed (link). After learning, the computer will be able to predict future values or or classify future data.

In the last 50 years we have been giving computers instructions or rules on how to do things. For example, a credit card fraud is defined as … probably 25 rules, such as a transaction with unusually big amount, or has unusual location.

In contrast, in machine learning, we don’t give the rules to the computer. Instead we give the computer 1000 transactions and tell it: this one is a fraud, this one is not, this one is a fraud, this one is not, … and so on until 1000. The computer learns and creates its own rules (or pattern) and stores these rules. Then we give a new transaction and the computer can tell us whether it is a fraud or not.

We can use the same methods to identify cancer based on thousands of scan images, or analysing millions of images of nebulas and stars for finding black holes. We can use it to read handwriting (like post codes on envelopes used by the Royal Mail), speech recognition like Siri and Alexa, image recognition like in the self driving cars, and face recognition for banking and payment applications.

Machine Learning is about creating machines/computers which can do tasks without being programmed. We don’t give them the rules; they create the rules themselves.

Machine Learning uses various algorithms such as Linear Regression, Logistic Regression, Decision Tree, K-means Clustering, Support Vector Machines (SVM), Principal Component Analysis (PCA), Anomaly Detection, and Neural Network.

Machine Learning uses mathematics (statistics, algebra, calculus) to derive and calculate those algorithms. It uses computer programming to make the computer “learn” the parameters or weightings for the training data (to create the “rules”), and to predict the result for new data.

The most popular programming language today for Machine Learning is Python. R comes second, and very few use Matlab. This is because Python has Scipy and Scikit, a comprehensive Machine Learning library. Today nobody manually implements those ML algorithms themselves – why reinvent the wheel? Everybody just uses the Scikit library. Secondly because Python and R are free whereas Matlab is expensive (£1800, link). Also because Python is a richer language than R, i.e. Python is a general language like Java and C#, whereas R is a specialist language, used primarily for statistics.

Data Mining

Data Mining is about extracting and processing data, and finding patterns and insight from large amount of data. So data mining is not about collecting and storing data, or designing databases / file systems to store the data. It is about analysing the data to find patterns and insight.

So it is the same as Machine Learning then? No, it is not. Data Mining is wider than Machine Learning. Data Mining uses Machine Learning algorithms, such as Clustering and Classification, but it also uses non Machine Learning algorithm such as recurring relationship, market basket analysis, and frequent items.

Data Mining also does the following:

  • Data Mining uses SQL queries to simply query relational databases to find answer to a specific question to get a business insight (not trying to find the pattern).
  • Data Mining also explores OLAP cubes to find specific business insight about the data
  • Data Mining also queries text and documents (it is called Text Mining, or Text Analytics)
  • Data Mining also queries graph databases, object databases, document databases
  • Data Mining also queries file systems, such as data lakes and Hadoop
  • Data Mining also extract information from images, sounds, videos (i.e. multi media) and maps (called spatial data)
  • Data Mining also extract information from social media such as Twitter, Facebook and Instagram
  • Data Mining also extract information from streaming data, such as weather and stock market
  • Data Mining also queries DNA or protein sequence for a certain genetic pattern, e.g. using DNAQL

Data Mining is different to Reporting in the sense that Data Mining explores the data (a flexible exercise where we query the data repeatedly) whereas Reporting queries the data and output it in a rigid, specific format.

Data Mining is different to Data Science because when Data Mining analyses the data it does not only use scientific method (mathematical) but also business methods (business analysis). For example: analysing floating rates in Interest Rate Swaps or rating migration pattern in corporate bonds. So Data Mining covers Business Intelligence and business analysis, in addition to scientifically analysing the data using statistics and Machine Learning algorithms. Data Mining also tries to find business insights, in addition to finding statistical patterns.

Artificial Intelligence (AI)

Artificial Intelligence is the ability for a machine/computer to learn, think, solve problems and making decisions. The machine does not have to be able to see, hear, talk or communicate well (that’s robotics), just a basic input-output will do.

At the moment Machine Learning is used a lot in AI (especially Neural Network), but AI also use non Machine Learning methods such as Bayesian network, Kalman filter, fuzzy logic, automated reasoning, solution searching and evolutionary algorithm.

So an AI expert can be a Machine Learning expert, or an evolutionary algorithm expert, which are very different. But generally an AI expert is an expert in mathematics and able to do basic programming in some languages (usually Python or Java).

An “AI expert” does not mean that they are in IT. They could be a psychologyst or a psychiatrist who defines what AI is and isn’t. Who analyses the cognitive behaviour of an AI computer, comparing it to human behaviour/thinking.

Some people includes robotics, sensors and motions in AI, such as the ability to move, pickup objects and understand conversation. In my opinion all the mechanics of a robot are not AI. AI is only the thinking bit. But that’s just my opinion.

Ability to understand language is part of AI. This is called NLP, Natural Language Processing. The mechanics of hearing and speaking is not AI (that’s electronic & mechanical engineering), but the ability to understand the meaning of the words is AI.

Business Intelligence (BI)

Business Intelligence is about analysing business data to get a business insight, to be used to make business decisions. So Business Intelligence is not for scientific purposes, but to improve business performance, typically reducing the costs or increasing revenues.

Data Analysis can be for many different purposes, including academic research, personal interest, but if it is not for business, it is not BI. Examples of BI are: analysing customer profitability, analysing sales across different products, analysing patient risk for illness, and analysing risk of losses in investment or lending.

BI is done by human, not machines. A BI analyst explore the data in the database, using a BI tool such as Tableau and MicroStrategy, using OLAP cubes such as SSAS and Qlikview, or using a reporting tool such as Board and SAP Crystal Report.

Based on the tools used, there are 6 categories of BI applications: reporting, analytic, data mining, dashboard, alert and portal. Quoting from my book: Reporting applications query the data warehouse and present the data in static tabular format. Analytic applications query the data warehouse repeatedly and interactively, and present the data in flexible formats that users can slice and dice. Data mining applications explore the data warehouse to find patterns and relationships that describe the data. Reporting applications are usually used to perform lightweight analysis. Analytic applications are used to perform deeper analysis. Data mining applications are used for pattern finding.

Dashboards are a category of BI applications that gives a quick high level summary of business performance in graphical gadgets, typically gauges, charts, indicators, and color-coded maps. By clicking these gadgets, we can drill down to lower-level details. Alerts are notifications to the users when certain events or conditions happen. A BI portal is an application that functions as a gateway to access and manage business intelligence reports, analytics, data mining, and dashboard applications as well as alert subscriptions.

We now have a new category of BI: streaming analytics, where the BI tool give second-by-second real time summary of the streaming data.

That’s my understanding. If I’m wrong please correct me, via comments below, thanks. I hope this article is useful for you.


  1. Data Mining: Concept and Techniques, Jiawei Han and Micheline Kamber, link
  2. Artificial Intelligence: A Modern Approach, Stuart Russell and Peter Norvig, link

25 October 2017

Andrew Ng’s Machine Learning course

Filed under: Data Science — Vincent Rainardi @ 6:00 pm

I have just completed Andrew Ng’s Machine Learning course on Coursera (link) and in this article I would like to share what I have experienced.

The course contains various algorithms of machine learning such as neural networks, K-means clustering, linear regression, logistic regression, support vector machine, principal component analysis and anomaly detection.

More importantly for me, the course contains various real world applications of machine learning, such as self driving car, character recognition, image recognition, movie recommendation, image compression, cancer detection, property prices.

What makes this course different to other Machine Learning materials and sessions I have seen is that it is technical. Usually when people talk about machine learning, they don’t talk about the mechanics and the mathematics. They explain about what clustering does, but they don’t explain about how exactly clustering is done. In this course Andrew Ng explains how it is done in great details.

It is amazing how people can get away with it, i.e. explaining the edges of Data Science / Machine Learning without diving into the core. But I do understand that in reality, not many people are able to understand the mathematics. The amount of matrix and vector algebra in this course is mind boggling. I am lucky that I studied physics for undergraduate in university, so I have a good grounding in calculus and linear algebra. But for those who have never worked with matrix before, they might have difficulties understanding the math.

Unfortunately, we do need to understand the math in order to be able to complete the programming assignments. The programming is in Octave, which is very similar to Matlab. When I started this course, I have not heard about Octave. But luckily again, I have used Matlab when I was in uni. Yes it was 25 years ago, but it gave me some background. And in the last 20 years I have been coding in various programming languages, from C++, VB, C#, Pascal, Cobol and SQL, to Java, R and Python so it does help. It does help if you have strong programming experience.

The programming assignments do take a lot of time. It took me about 3-5 hours per assignment, and there were 9 assignments (week 1 to 11, there were no assignment for week 8 and 9). For me, the problem with these programming assignment was the vectorisation. I could roughly figure out the answer using loops, which took many lines. But to convert into a vectorised solution (which is only 1 line), it took a long time. Secondly, it took time to translate the mathematics formula into Octave, at least in the first 3 weeks. And thirdly, it took time to test the code and correct mistakes that I make.

The quizes (the tests) were not bad. They were relatively easy, far easier than the programming. Each quiz comprises of 5 questions and we needed to get 4 out of 5 questions correct. We were given 3 goes for each quiz. If we still failed after 3 goes, we could try again the next day. Out of about 11 quizes (or may be 12) there was one which was difficult and I failed twice, but passed the third time. But that’s the only one which was difficult. The others were relatively straight forward.

The most enjoyable thing for me was the math. I had not got a chance to use the math I learned in uni since I graduated 25 years ago (well, except helping my children with their GCSE homework). The practical / real world examples were also enjoyable. The programming on some modules were enjoyable. Overall it was fun and interesting, and very useful at the same time. I had been to many IT courses: .NET, SQL BI, Teradata, TDWI, and Big Data, but none of them were as enjoyable as this one.

Because of this course, which makes me realise that Machine Learning is very useful and fun, I decided to apply for an MSc course in Data Science (containing big data and machine learning). So thank you Andrew Ng for creating this course, and patiently explaining the chapters and concepts, video by video. Thank you.

Blog at