Data Warehousing and Data Science

6 April 2021

Natural Language Processing (NLP)

Filed under: Data Science,Machine Learning — Vincent Rainardi @ 8:15 am

NLP is different to all other machine learning areas. Machine learning usually deals with mathematics, with numbers. It is about finding a pattern in the numbers, and make a prediction. The root of analysis is mathematical such as matrix, vectors, statistics, probability and calculus. But NLP is about words and sentences which is is very different.

We are now used to Alexa, Siri and Google able to understand us and answer us back in a conversation (5 years ago it wasn’t like that). When we type a reply to an email in Gmail or a message in Linked In we are now used to receiving suggestions about what we are going to type. And when we login to British Gas, or online banking or online retail shop we now find chat bots with whom we will be able have a useful conversation. Much better than 5 years ago. There is no doubt there has been a significant advancements in this area.

The processing of language, be it voice or text, are done in 3 levels. The bottom level is lexical analysis, where ML deals with each word in isolation. The middle level is syntax analysis, where ML analyses the words within the context of the sentence and the grammar. The top level is semantic analysis where ML tries to understand the meaning of the sentence.

To do lexical analysis we start with regular expression. We use regular expression to find words within a text, and to replace them with another words. Then we learn how to identify and remove stop words such as and, the, a which occur frequently but don’t provide useful information during lexical analysis. The third step is learning how to break the text into sentences and into words. And finally for each word we try to find the base word either using stemming, lemmatisation or soundex.

Stemming is a process of removing prefixes and suffixes like “ing” and “er” from “learning” and “learner” to get the base word which is learn. Lemmatisation is a processes of changing a word to its root, e.g. from “went” to “go”, and from “better”, “well”, “best” to “good”. Soundex is a 4-character code that represents the pronounciation of a word, rather than its spelling.

The syntax analysis is done by tagging each word as noun, verb, adjective, etc. (called “part of speech”). The tagging is done by parsing (breaking up) the sentences into groups of words (phrases), analysing the grammatical patterns, and considering the dependencies between words.

Semantic analysis is about understanding the meaning of the words and sentences by looking at the structure of the sentence and the word tagging. Words such as “Orange” can mean colour, a fruit or a area, and “Apple” can mean a fruit or a company, depending on the sentence. In semantic analysis we either assign predefined categories to a text (for example for sentiment analysis, for classifying messages or for chat bots) or pull out a specific information from a text (for example for extracting certain terms from IRS contracts, or other documents).

Leave a Comment »

No comments yet.

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Blog at

%d bloggers like this: