Data Warehousing and Data Science

5 June 2021

The Trick in Understanding Human Language

Filed under: Machine Learning — Vincent Rainardi @ 10:08 am

I started learning Natural Language Processing (NLP) with such enthusiasm. There were 3 stages in NLP. The first stage is lexical analysis where the root words and phrases are identified, dealing with stop words and misspelling. The second stage is syntactic analysis where the nouns, verbs, etc. are identified and the grammar is analysed. The third stage is semantic analysis which is about understanding the meanings of the words.

So I thought, this is amazing! I knew computers now understand human languages, for example Alexa and chatbots. And I would be diving into that wonderful world, learning how it’s done. At the end of this process I would be able to create a chatbot that could understand human language. Cool!

I did build a chatbot that could “understand” human language, but disappointingly it doesn’t really understand it. A chatbot uses a “trick” to guess the meaning of our sentences, identifying the most probable intention. It outputs prepared responses and we do need to define which response for which input. So no it does not understand human language in the way I initially thought. We are still far away from having clever robots like in “I Robot” and “Ex Machina”.

In this article I’m writing that learning experience, hoping that it would enlighten those who have not entered the NLP world.

Lexical Analysis

Lexical analysis is about identifying words and phrases, and dealing with stop words and misspelling. I learned how to identify the base form of words, such as “play” in the word “playing” and “player”. This process is called stemming, where we identify rules such as removing “ing” and “ion” suffixes. For this we use regular expression.

The base form of “best” is “good”, which can’t be identified using stemming. For this we use lemmatisation, which is done using a combination of lookups and rules. Both are widely implemented using NLTK in Python, see Ivo’s Bernardo’s article on stemming (link) and Selva Prabhakaran’s article on lemmatisation (link).

But before that we will need to break text into paragraphs, sentences and words. This is called tokenisation. We deal with “she’d” and “didn’t” which are actually “she would” and “did not”. We deal with tokens which are not words, like dates, time e.g. “3:15”, symbols, email address, numbers, years, brackets. See my article on tokenisation here: link.

Then we need to deal with misspelling and for this we need to know how similar two words are using edit distance. Edit distance is the number of operations (like delete a letter, insert a letter, etc.) required to change one word into another.

A crude way of representing a text is using “bag of words”. First we remove the stop words such as the, in, a, is, etc. because stop words exist in every text so they don’t provide useful information. Then we construct a dictionary from the distinct list of words in the text. For every sentence we mark each word whether it exist in the dictionary or not. The result is that a sentence is now converted to a series of 1 and 0. A more sophisticated version uses the word frequency instead of 1 and 0, see: link. Either way in the end the sentences are converted into numbers.

Once a document is converted into numbers, we can run machine learning algorithms on it such as classification. For example we can classify whether an email / text is a spam or not.

That is, in 1 minute, Lexical Analysis 🙂 We can (crudely) represent a document as numbers and use this numerical representation to classify documents. But at this stage the machine doesn’t understand the documents, at all.

Syntactic Analysis

Syntactic Analysis is about breaking (or parsing) a sentence into phrases such as noun phrase, verb phrase, etc. and recognising them. We do this is because the meaning of a word (e.g. “play”) depends on whether it is a noun or a verb. This “noun”, “verb”, “adjective”, “preposition”, etc. are called “part of speech” or POS for short. So the first step is to identify the POS tag for each word.

There are many different approaches for doing POS tagging: supervised, unsupervised, rule based, stochastic, Conditional Random Fields, Hidden Markov Model, memory based learning, etc. Fahim Muhammad Hassan cataloged them in his thesis: link.

  • The Hidden Markov Model (HMM) is arguably the most popular, where the POS tag is determined based not only on the word, but also the POS tag of the previous word. Many have written about HMM and its implementation in Python. For introduction I recommend Raymond Kwok’s article (link) and for a formal lecture Ramon van Handel from Princeton University (link).
  • The best approach in terms of accuracy is the Recurrent Neural Network (RNN). RNN uses deep learning approach where the feedback is fed to the next stage. The most popular implementations are LTSM and GRU. Tanya Dayanand wrote a good short explanation here: link (notebook here).

Once we know the POS tags for each word, we can now parse or break a sentence into phrases (e.g. noun phrase, verb phrase, etc.) or into subject, modifier, object, etc. in order to understand them. The former is called constituency grammar and the latter is called dependency grammar.

  • Constituency grammar: the most popular method is Context Free Grammar (CFG, link), which specifies the rules of how words are grouped into phrases. For example, a noun phrase (e.g. “the sun”) may consist of a determinant (the) and a noun (sun). A sentence can consist of a noun phrase and a verb phrase, e.g. the sun shines brightly.
  • In dependency grammar we first identify the root verb, followed by the subject and object of that verb. Then the modifiers which are an adjective, noun or preposition that modifies the subject or the object. The most popular framework is the Universal Dependency (link). Two of the most popular Universal Dependency English parser is from Georgetown University (link) and Stamford (link).

In addition to parsing sentences into phrases, we need to identify named entities such as city name, person name, etc. In general this subject is called Information Extraction (link) covering the whole pipeline from pre-processing, entity recognition, relation recognition, record linkage and knowledge generation. Recognising named entities is vital for chatbots in order to understand the intention. There are many approaches in Named Entity Recognition (NER) such as Naive Bayes, Decision Trees and Conditional Random Field (see Sidharth Macherla’s article: link). There are many good libraries that we can use, such as NLTK, spaCy and Stanford. Susan Li wrote a good article on NER implementation: link.

That is syntactic analysis in 2 minutes, in which we break sentenses into phrases and words, and recognising each word as verb, noun, etc. or a named entity. At this stage the machine still doesn’t understand the meaning of the sentence!

Semantic Analysis – The Trick

Now that we have parsed sentenses into words and identified the named entities, the final step is to understand those words. This is the biggest learning point for me in NLP. Machines don’t understand the text word by word like human do, but by converting each word into numerical representation (called vectors) and then extract the topic. The topic is the centre of those word vectors.

And that is the big “trick” in NLP. We can do all the lexical analysis and syntactic analysis all we want, but in the end we need to convert the words into vectors, and the centre of those vectors is meaning of those words (the topic). So the meaning is also a vector!

In the real world the vector representations have hundred of dimensions and in the diagrams below I only use 2 dimensions so they are massively over simplified but I hope they can convey my point across. In diagram A we have a sentence “Running is a sport”. Each word is a blue circle and the centre (the centroid) is the solid blue circle. The vector representing this centre is the blue arrow. This blue arrow is the “meaning” which is just a bunch of numbers that make up that vectors (in reality it’s hundred of numbers).

In diagram B we have another sentence “He walks as an exercise”. Each word is a brown circle and the centre is the solid brown circle. The vector representing this centre is the brown arrow. That brown arrow is the “meaning” of that sentence. So the meaning is just a bunch of numbers that make up that brown arrow.

In diagram C we superimpose diagram A and diagram B, and in diagram D we remove the word vectors, leaving just the 2 meaning vectors. Now we can find out how close the meanings of the 2 sentences are, just by looking at how close these 2 vectors are.

Remember that in reality it’s not 2 dimensions but hundreds of dimensions. But you can clearly see the mechanism here. We convert sentences into numbers (vectors) and we compare the numbers. So the computer still don’t understand the sentences, but it can compare sentences.

Say we have a collection of sentences about cooking. We can represent each of these sentences as numbers/vectors. See the left diagram below. The blue circles are the sentences and the solid blue circle is the centre.

If we have a collection sentences about banking, we can do the same. Ditto with holiday. Now we have 3 blue dots (or 3 blue arrows), each representing different topic. One for cooking, one for banking, one for holiday. See the right diagram above.

Now if we have an input, like “I went to Paris and saw Eiffle tower”, the NLP will be able to determine whether this input is about holiday, cooking or banking, without even knowing what a holiday, cooking or banking are! Without even knowing what Eiffle tower and Paris are. Or even what “went” and “saw” are. All it knows is that the vector for “I went to Paris and saw Eiffle tower” is closer to the holiday vector than to the cooking vector or the banking vector. Very smart!

That is the trick in understanding human languages. Convert the sentences into numbers and compare them!

Semantic Analysis – The Steps

Semantic means meaning. Semantic Analysis is about understanding the meaning. Now that we have an idea of how semantic analysis trick is done, let’s understand the steps.

First we convert the words into vectors. There are 2 approaches for doing this:

  1. Frequency based
  2. Prediction based

In the frequency based approach the basic assumption is: words which are used and occur in the same context (e.g. a document or a sentence) tend to have similar meanings. This principle is called Distributional Semantics (link). First we create a matrix containing the word counts per document i.e. the occurance frequency of each word. This matrix is called Occurance Matrix, the rows are the words and the columns are the documents. Then we reduce the number of rows in this matrix using Singular Value Decomposition (link). Each row in this final matrix is the word vector, it represents how that word is distributed in various document. That is the vector for that word. Examples of frequency based approach are: Latent Semantic Analysis (link) and Explicit Semantic Analysis (link).

The prediction based approach uses neural network to learn how words are related to each other. The input of the neural network is the word, represented as a one-hot vector (link), which means that all numbers are zero except one. There is only 1 hidden layer in the neural network, with hundreds of neuron. The output of the neural network is the context words, i.e. the words closest to the input words. For example: if the input word is “car”, the outputs are like below left, with the vector representing the word “car” on the right (source: link).

Examples of the prediction approach are Word2Vec from Google (link), GloVe from Stanford (link) and fastText from Facebook (link).

Once the words become vectors, we use cosine similarity (link) to find out how close the vectors are to each other. And that is how computers “understand” human language, i.e. by converting them into vectors and comparing them with other vectors.


I’m going to end this artcle with chatbot. A chatbot is a conversational engine/bot, which we can use to order tickets, book a hotel, talk to customer service, etc. We can build a chatbot using Rasa (link), IBM Watson (link), Amazon Lex (link) or Google Chat (link).

A chatbot has 2 components:

  • Natural Language Processing (NLP)
  • Dialogue Management

The Natural Language Processing part does Named Entity Recognition (NER) and intention classification. The NER part identifies named entities such as city name, person name, etc. The intention classification part detects what is the intention in the input sentence. For example for a hotel booking chatbot the intention can be greeting, finding hotels, specify location, specify price range, make a booking, etc.

The Dialogue Management part determine what is the response and next step for each intent. For example, if the intention is greeting the response is saying “Hi how can I help you” then wait for an input. If the intension is “finding hotels” the response is asking “In which location” then wait for an input.

And that’s it. The intention classification uses the “trick” I explained in this article to understand human language. It converts the sentence into vector and compare it with the list of intentions (which have been converted into vectors too). That’s how a chatbot “understand” what we are typing. And then it uses a series of “if-then-else” to output the correct response for each intension. Easy isn’t it?

No, from my experience it’s not easy. We need to prepare lots of examples to train the NLP. For each intention we need to supply many examples. For each location we need to specify the other possible names. For example: Madras for Chennai, Bengaluru for Bengalore and Delhi for New Delhi. And we need to provide a list of cities that we are operating in. And we need to cover so many possible dialogue flows in the conversation. And then we need to run it sooo many times over and over again (and it could take 15-30 minutes per run!), each time correcting the mistakes.

It was very time consuming but it was fun and very illuminating. Now I understand what’s going on behind the scene when I’m talking to a chatbot on the internet, or talking to Alex in my kitchen.

1 Comment »

  1. I’m working through your recent posts and they are excellent.
    Just wanted to indicate how much they are appreciated.

    Comment by michaeldroy — 4 July 2021 @ 5:37 pm | Reply

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Blog at

%d bloggers like this: