8 June 2021

Which machine learning algorithms should I use?

Filed under: Machine Learning — Vincent Rainardi @ 5:06 am
Tags: Machine Learning

Every month I learn a new machine learning algorithm. Until today I’ve learned about ten algorithms and whenever I’m trying to solve a machine learning problem, the question is always “Which algorithm should I use?”

Almost every machine learning practitioner knows that the answer depends on supervised or unsupervised, then classification or regression. So which algorithm to use is quite straight forward right?

Well, no. Take classification for example. We can use Logistic Regression, Naive Bayes, Support Vector Machine, Decision Tree, Random Forest, Gradient Boosting or Neural Network. Which one should we use?

“Well, it depends” is the answer we often hear. “Depends on what?” that is the question! It would be helpful if we know what factors to consider right?

So in this article I would like to try answering those questions. I’m going to first address the general question on “Which machine learning algorithm should I use?” This is useful when you are new in machine learning and never heard about classification and regression, let alone ensemble and boosting. There are many good articles already written about this, so I’m going to point you to them.

Then as an example I’m going to dive specifically into classification algorithms. I’ll try to give a brief outline on what factors we need to consider when deciding, such as linearity, interpretability, multiclass and accuracy. Also the strengths and weaknesses of each algorithm.

General guide on which ML algorithms to use

I would recommend that you start with Hui Li’s diagram: link. She categorised ML algorithms into 4: clustering, regression, classification and dimensionality reduction:

It is very easy to follow, and it is detail enough. She wrote it in 2017 but by and large it is still relevant today.

The second one that I’d recommend is Microsoft’s guide: link, which is newer (2019) and more comprehensive. They categorise ML algorithms into 8: clustering, regression, classification (2 class and multiclass), text analytics, image classification, recommenders, and anomaly detection:

So now you know roughly which algorithm to use for each case, using the combination of Hui Li’s and Microsoft’s diagrams. In addition to that, it would be helpful if you read Danny Varghese’s article about comparative study on machine learning algorithms: link. For every algorithm Danny outlines the advantages and disadvantages against other algorithms in the same category. So once you choose an algorithm based on Hui Li’s and Microsoft’s diagrams, check that algorithm against the alternatives on Danny’s list, make sure that the advantages outweigh the disadvantages.

Classification algorithms: which one should I use?

For classification we can use Logistic Regression, Naive Bayes, Support Vector Machine, Decision Tree, Random Forest, Gradient Boosting Machine (GBM), Perceptron, Linear Discriminant Analysis (LDA), K Nearest Neighbours (KNN), Learning Vector Quantisation (LVQ) or Neural Network. What factors do we need to consider when deciding? And what are the strength and weaknesses of each algorithm?

The factors we need to consider are: linearity, interpretability and multiclass.

The first consideration is linearity of the data. The data is linear if the plot between the predictor and the target variable is separable by a straight line, like below.

Note that the plots above are over-simplified as the reality is not only 2 dimensions but many dimensions (e.g. we have have 8 predictors, or 8 X axis) so the separator is not a line but a hyperplane.

If the data is linear, we can use (link): Logistic Regression, Naive Bayes, Support Vector Machine, Perceptron, Linear Discriminant Analysis.
If the data is not linear, we can use (link): Decision Tree, Random Forest, Gradient Boosting Machine, K Nearest Neighbours, Neural Network, Support Vector Machine using Kernel, Learning Vector Quantisation.

Can we use algorithms in #2 for linear classification? Yes we can, but #1 is more suitable.

Can we use #1 for non-linear classification? No we can’t, not without modification. But there are ways to transform data from a non-linear space to a linear space. They are called “kernel trick”, see my article here: link.

The second factor that we need to consider is interpretability, i.e. the ability to explain why a data point is classified into a certain class. Christoph Molnar explains interpretability in great details: link.

If we need to be able to explain, we can use Logistic Regression, Naive Bayes, Decision Tree, Linear Support Vector Machine,
If we don’t need to be able to explain, we can use Random Forest, Support Vector Machine with Kernel (see Hugo Dolan’s article: link), Gradient Boosting Machine, K Nearest Neighbours, Neural Network, Perceptron, Linear Discriminant Analysis, Learning Vector Quantisation

The third factor that we need to consider whether we are classifying into two classes (binary classification) or more than two classes (multi-class). Support Vector Machine (SVM), Linear Discriminant Analysis (LDA) and Perceptron are binary classification, but everything else can be used for both binary and multi-class. We can make LDA multi-class, see here: link. Ditto SVM: link.

1. Logistic Regression

Strengths: good accuracy on small amount of data,easy tointerpret (we get feature importance), easy to implement, efficient to train (doesn’t need high compute power), can do multi-class.

Weaknesses: tend to overfit on high dimensions (use regularisation), can’t do non-linear classification (or complex relationship), not good with multicollinearity, sensitive to outliers, requires linear relationship between log odds and target variable.

2. Naive Bayes

Strengths: good accuracy on small amount of data, efficient to train (doesn’t need high compute power), easy to implement, highly scalable, can do multi-class, can do continuous and discreet data, not sensitive to irrelevant features.

Weaknesses: features must be are independent, a category which exist in test dataset but not in training data set will get zero probability (zero frequency problem)

3. Decision Tree

Strengths: easy to interpret (intuitive, show interaction between variables), can classify non-linear data, data doesn’t need to be normalised nor scaled, not affected by missing values, not affected by outliers, performs well with unbalanced data (the nature of data distribution does not matter), can do both classification and regression, can do both numerical and categorical data, provide feature importance (calculated from the decrease in node impurity), good with large dataset, able to handle multicollinearity.

Weaknesses: has tendency tooverfit (bias towards training set, requires pruning), not robust (high variance, small change in training data results in major change in the model and output),not good with continuous variable, requires longer time to train the model (resource intensive).

4. Random Forest

Strengths: high accuracy, doesn’t need pruning, no overfitting, low bias with quite low/moderate variance (because of bootstrapping), can do both classification and regression, can do numerical and categorical, can classify non-linear data, data doesn’t need to be normalised nor scaled, not affected by missing values, not affected by outliers, performs well with unbalanced data (the nature of data distribution does not matter), can be parallelised (can use multiple CPUs in parallel), good with high dimensionality.

Weaknesses: long training time, requires large memory, non interpretable (because there are hundreds of trees).

5. Support Vector Machine (Linear Vanilla)

Strengths: scales well with high dimensional data, stable (low variance), less risk of overftting, doesn’t rely on the entire data (not affected by missing values), works well with noise.

Weaknesses: long training time for large data, requires features scaling.

6. Support Vector Machine (with Kernel)

Strengths: scales well with high dimensional data, stable (low variance), handle non-linear data very well, less risk of overftting (because of regularisation), good with outliers (has gamma and C to control), can detect outliers in anomaly detection, works well with noise.

Weaknesses: long training time for large data, tricky to find appropriate kernel, need large memory, requires features scaling, difficult to interpret.

7. Gradient Boosting

Strengths: high accuracy, flexible with various loss functions, minimal pre-processing, not affected by missing values, works well with unbalanced data, can do both classification and regression.

Weaknesses: tendency to overfit (because it continues to minimise errors), sensitive with outliers, large memory requirement (thousands of trees), long training time, large grid search for hyperparameter, not good with noise, difficult to interpret.

7. K Nearest Neighbours

Strengths: simple to understand (intuitive), simple to implement (both binary and multi-class), handles non-linear data well, non parametric (no requirements on data distribution), respond quickly to data changes in real time implementation, can do both classification and regression.

Weaknesses: long training time, doesn’t work well with high dimensional data, requires scaling, doesn’t work well with imbalanced data, sensitive to outliers and noise, affected by missing values.

8. Neural Network

Strengths: high accuracy,handles non-linear data well, generalise well on unseen data (low variance), non parametric (no requirements on data distribution), works with heteroskedastic data (non-constant variance), works with highly volatile data (time series), works with incomplete data (not affected by missing values), fault tolerance

Weaknesses: requires large amount of data,computationally expensive (requires parallel processors/GPU and large memory), not interpretable, tricky to get the architecture right (#layers, #neurons, functions, etc.)

5 June 2021

The Trick in Understanding Human Language

Filed under: Machine Learning — Vincent Rainardi @ 10:08 am
Tags: Machine Learning

I started learning Natural Language Processing (NLP) with such enthusiasm. There were 3 stages in NLP. The first stage is lexical analysis where the root words and phrases are identified, dealing with stop words and misspelling. The second stage is syntactic analysis where the nouns, verbs, etc. are identified and the grammar is analysed. The third stage is semantic analysis which is about understanding the meanings of the words.

So I thought, this is amazing! I knew computers now understand human languages, for example Alexa and chatbots. And I would be diving into that wonderful world, learning how it’s done. At the end of this process I would be able to create a chatbot that could understand human language. Cool!

I did build a chatbot that could “understand” human language, but disappointingly it doesn’t really understand it. A chatbot uses a “trick” to guess the meaning of our sentences, identifying the most probable intention. It outputs prepared responses and we do need to define which response for which input. So no it does not understand human language in the way I initially thought. We are still far away from having clever robots like in “I Robot” and “Ex Machina”.

In this article I’m writing that learning experience, hoping that it would enlighten those who have not entered the NLP world.

Lexical Analysis

Lexical analysis is about identifying words and phrases, and dealing with stop words and misspelling. I learned how to identify the base form of words, such as “play” in the word “playing” and “player”. This process is called stemming, where we identify rules such as removing “ing” and “ion” suffixes. For this we use regular expression.

The base form of “best” is “good”, which can’t be identified using stemming. For this we use lemmatisation, which is done using a combination of lookups and rules. Both are widely implemented using NLTK in Python, see Ivo’s Bernardo’s article on stemming (link) and Selva Prabhakaran’s article on lemmatisation (link).

But before that we will need to break text into paragraphs, sentences and words. This is called tokenisation. We deal with “she’d” and “didn’t” which are actually “she would” and “did not”. We deal with tokens which are not words, like dates, time e.g. “3:15”, symbols, email address, numbers, years, brackets. See my article on tokenisation here: link.

Then we need to deal with misspelling and for this we need to know how similar two words are using edit distance. Edit distance is the number of operations (like delete a letter, insert a letter, etc.) required to change one word into another.

A crude way of representing a text is using “bag of words”. First we remove the stop words such as the, in, a, is, etc. because stop words exist in every text so they don’t provide useful information. Then we construct a dictionary from the distinct list of words in the text. For every sentence we mark each word whether it exist in the dictionary or not. The result is that a sentence is now converted to a series of 1 and 0. A more sophisticated version uses the word frequency instead of 1 and 0, see: link. Either way in the end the sentences are converted into numbers.

Once a document is converted into numbers, we can run machine learning algorithms on it such as classification. For example we can classify whether an email / text is a spam or not.

That is, in 1 minute, Lexical Analysis 🙂 We can (crudely) represent a document as numbers and use this numerical representation to classify documents. But at this stage the machine doesn’t understand the documents, at all.

Syntactic Analysis

Syntactic Analysis is about breaking (or parsing) a sentence into phrases such as noun phrase, verb phrase, etc. and recognising them. We do this is because the meaning of a word (e.g. “play”) depends on whether it is a noun or a verb. This “noun”, “verb”, “adjective”, “preposition”, etc. are called “part of speech” or POS for short. So the first step is to identify the POS tag for each word.

There are many different approaches for doing POS tagging: supervised, unsupervised, rule based, stochastic, Conditional Random Fields, Hidden Markov Model, memory based learning, etc. Fahim Muhammad Hassan cataloged them in his thesis: link.

The Hidden Markov Model (HMM) is arguably the most popular, where the POS tag is determined based not only on the word, but also the POS tag of the previous word. Many have written about HMM and its implementation in Python. For introduction I recommend Raymond Kwok’s article (link) and for a formal lecture Ramon van Handel from Princeton University (link).
The best approach in terms of accuracy is the Recurrent Neural Network (RNN). RNN uses deep learning approach where the feedback is fed to the next stage. The most popular implementations are LTSM and GRU. Tanya Dayanand wrote a good short explanation here: link (notebook here).

Once we know the POS tags for each word, we can now parse or break a sentence into phrases (e.g. noun phrase, verb phrase, etc.) or into subject, modifier, object, etc. in order to understand them. The former is called constituency grammar and the latter is called dependency grammar.

Constituency grammar: the most popular method is Context Free Grammar (CFG, link), which specifies the rules of how words are grouped into phrases. For example, a noun phrase (e.g. “the sun”) may consist of a determinant (the) and a noun (sun). A sentence can consist of a noun phrase and a verb phrase, e.g. the sun shines brightly.
In dependency grammar we first identify the root verb, followed by the subject and object of that verb. Then the modifiers which are an adjective, noun or preposition that modifies the subject or the object. The most popular framework is the Universal Dependency (link). Two of the most popular Universal Dependency English parser is from Georgetown University (link) and Stamford (link).

In addition to parsing sentences into phrases, we need to identify named entities such as city name, person name, etc. In general this subject is called Information Extraction (link) covering the whole pipeline from pre-processing, entity recognition, relation recognition, record linkage and knowledge generation. Recognising named entities is vital for chatbots in order to understand the intention. There are many approaches in Named Entity Recognition (NER) such as Naive Bayes, Decision Trees and Conditional Random Field (see Sidharth Macherla’s article: link). There are many good libraries that we can use, such as NLTK, spaCy and Stanford. Susan Li wrote a good article on NER implementation: link.

That is syntactic analysis in 2 minutes, in which we break sentenses into phrases and words, and recognising each word as verb, noun, etc. or a named entity. At this stage the machine still doesn’t understand the meaning of the sentence!

Semantic Analysis – The Trick

Now that we have parsed sentenses into words and identified the named entities, the final step is to understand those words. This is the biggest learning point for me in NLP. Machines don’t understand the text word by word like human do, but by converting each word into numerical representation (called vectors) and then extract the topic. The topic is the centre of those word vectors.

And that is the big “trick” in NLP. We can do all the lexical analysis and syntactic analysis all we want, but in the end we need to convert the words into vectors, and the centre of those vectors is meaning of those words (the topic). So the meaning is also a vector!

In the real world the vector representations have hundred of dimensions and in the diagrams below I only use 2 dimensions so they are massively over simplified but I hope they can convey my point across. In diagram A we have a sentence “Running is a sport”. Each word is a blue circle and the centre (the centroid) is the solid blue circle. The vector representing this centre is the blue arrow. This blue arrow is the “meaning” which is just a bunch of numbers that make up that vectors (in reality it’s hundred of numbers).

In diagram B we have another sentence “He walks as an exercise”. Each word is a brown circle and the centre is the solid brown circle. The vector representing this centre is the brown arrow. That brown arrow is the “meaning” of that sentence. So the meaning is just a bunch of numbers that make up that brown arrow.

In diagram C we superimpose diagram A and diagram B, and in diagram D we remove the word vectors, leaving just the 2 meaning vectors. Now we can find out how close the meanings of the 2 sentences are, just by looking at how close these 2 vectors are.

Remember that in reality it’s not 2 dimensions but hundreds of dimensions. But you can clearly see the mechanism here. We convert sentences into numbers (vectors) and we compare the numbers. So the computer still don’t understand the sentences, but it can compare sentences.

Say we have a collection of sentences about cooking. We can represent each of these sentences as numbers/vectors. See the left diagram below. The blue circles are the sentences and the solid blue circle is the centre.

If we have a collection sentences about banking, we can do the same. Ditto with holiday. Now we have 3 blue dots (or 3 blue arrows), each representing different topic. One for cooking, one for banking, one for holiday. See the right diagram above.

Now if we have an input, like “I went to Paris and saw Eiffle tower”, the NLP will be able to determine whether this input is about holiday, cooking or banking, without even knowing what a holiday, cooking or banking are! Without even knowing what Eiffle tower and Paris are. Or even what “went” and “saw” are. All it knows is that the vector for “I went to Paris and saw Eiffle tower” is closer to the holiday vector than to the cooking vector or the banking vector. Very smart!

That is the trick in understanding human languages. Convert the sentences into numbers and compare them!

Semantic Analysis – The Steps

Semantic means meaning. Semantic Analysis is about understanding the meaning. Now that we have an idea of how semantic analysis trick is done, let’s understand the steps.

First we convert the words into vectors. There are 2 approaches for doing this:

Frequency based
Prediction based

In the frequency based approach the basic assumption is: words which are used and occur in the same context (e.g. a document or a sentence) tend to have similar meanings. This principle is called Distributional Semantics (link). First we create a matrix containing the word counts per document i.e. the occurance frequency of each word. This matrix is called Occurance Matrix, the rows are the words and the columns are the documents. Then we reduce the number of rows in this matrix using Singular Value Decomposition (link). Each row in this final matrix is the word vector, it represents how that word is distributed in various document. That is the vector for that word. Examples of frequency based approach are: Latent Semantic Analysis (link) and Explicit Semantic Analysis (link).

The prediction based approach uses neural network to learn how words are related to each other. The input of the neural network is the word, represented as a one-hot vector (link), which means that all numbers are zero except one. There is only 1 hidden layer in the neural network, with hundreds of neuron. The output of the neural network is the context words, i.e. the words closest to the input words. For example: if the input word is “car”, the outputs are like below left, with the vector representing the word “car” on the right (source: link).

Examples of the prediction approach are Word2Vec from Google (link), GloVe from Stanford (link) and fastText from Facebook (link).

Once the words become vectors, we use cosine similarity (link) to find out how close the vectors are to each other. And that is how computers “understand” human language, i.e. by converting them into vectors and comparing them with other vectors.

Chatbot

I’m going to end this artcle with chatbot. A chatbot is a conversational engine/bot, which we can use to order tickets, book a hotel, talk to customer service, etc. We can build a chatbot using Rasa (link), IBM Watson (link), Amazon Lex (link) or Google Chat (link).

A chatbot has 2 components:

Natural Language Processing (NLP)
Dialogue Management

The Natural Language Processing part does Named Entity Recognition (NER) and intention classification. The NER part identifies named entities such as city name, person name, etc. The intention classification part detects what is the intention in the input sentence. For example for a hotel booking chatbot the intention can be greeting, finding hotels, specify location, specify price range, make a booking, etc.

The Dialogue Management part determine what is the response and next step for each intent. For example, if the intention is greeting the response is saying “Hi how can I help you” then wait for an input. If the intension is “finding hotels” the response is asking “In which location” then wait for an input.

And that’s it. The intention classification uses the “trick” I explained in this article to understand human language. It converts the sentence into vector and compare it with the list of intentions (which have been converted into vectors too). That’s how a chatbot “understand” what we are typing. And then it uses a series of “if-then-else” to output the correct response for each intension. Easy isn’t it?

No, from my experience it’s not easy. We need to prepare lots of examples to train the NLP. For each intention we need to supply many examples. For each location we need to specify the other possible names. For example: Madras for Chennai, Bengaluru for Bengalore and Delhi for New Delhi. And we need to provide a list of cities that we are operating in. And we need to cover so many possible dialogue flows in the conversation. And then we need to run it sooo many times over and over again (and it could take 15-30 minutes per run!), each time correcting the mistakes.

It was very time consuming but it was fun and very illuminating. Now I understand what’s going on behind the scene when I’m talking to a chatbot on the internet, or talking to Alex in my kitchen.

Comments (1)