Data Warehousing and Machine Learning

22 May 2021

Tokenisation

Filed under: Data Warehousing — Vincent Rainardi @ 7:38 am

One my teachers said to me once: the best way to learn is by writing it. It’s been 30 years and his words still rings true in my head.

One of the exciting subjects in machine learning is natural language (NL). There are 2 main subjects in NL: natural language processing (NLP) and natural language generation (NLG).

  • NLP is about processing and understanding human languages in the form of a text or voice. For example: reading a book, an email or a tweet, listening to people talking, singing, radio, etc.
  • NLG is about creating a text or voice in human languages. For example: creating a poetry or a news article, generating a voice which says some sentences, singing a song or a radio broadcast.

My article today is about NLP. One specific part of NLP. In NLP we have 3 levels of processing: lexical processing, syntactic processing and semantic processing.

  • Lexical processing is looking at a text without thinking about the grammar. We don’t differentiate if a word is a noun or a verb. In other words we don’t consider the role or position of that word in a sentence. For example, we break a text into paragraphs, paragraph into sentences and sentence into words. We change each word to their root form, e.g. we change “talking”, “talked”, “talks” to “talk”.
  • Syntactic processing is looking at a text to understand the role or function of each word. The meaning of a word depends on its role in the sentence. For example: subject, predicate or object. A noun, a verb, an adverb or an adjective. Present tense, past tense or in the future.
  • Semantic processing is trying to understand the meaning of the text. We try to understand the meaning of each word, each sentence, each paragraph and eventually the whole text.

My article today is about lexical processing. One specific part of lexical processing, called tokenisation.

Tokenisation is the process of breaking a text into smaller pieces. For example: breaking sentences into words. The sentence: “Are you ok?” she asked, can be tokenised into 5 words: are, you, ok, she, asked.

We can tokenise text in various different ways: (source: link)

  • characters
  • words
  • sentences
  • lines
  • paragraphs
  • N-grams

N-gram tokenisation is about breaking text into tokens with N number of characters in each token.
So 3-gram means 3 characters in each token. (source: link)

For example: the word “learning” can be tokenised into 3-gram like this: lea, ear, arn, rni, nin, ing.

One of the most popular libraries in NLP is the Natural Language Toolkit (NLTK). In NLTK library we have a few tokenisers: word tokeniser, sentence tokeniser, tweet tokerniser and regular expression tokeniser. Let’s go through them one by one.

Word Tokenizer

In NLTK we have a word tokeniser called word_tokenize. This tokeniser breaks text into word not only on spaces but also on apostrophy, greater than, less than and brackets. Periods, commas and colons are tokenised separately.

Python code – print the text:

document = "I'll do it don't you worry. O'Connor'd go at 3 o'clock, can't go wrong. " \
         + "Amazon's delivery at 3:15, but it's nice'. A+B>5 but #2 is {red}, (green) and [blue], email: a@b.com" 
print(document)  
I'll do it don't you worry. O'Connor'd go at 3 o'clock, can't go wrong. Amazon's delivery at 3:15, but it's nice'. A+B>5 but #2 is {red}, (green) and [blue], email: a@b.com

Tokenise using a space:

words = document.split()
print(words)
["I'll", 'do', 'it', "don't", 'you', 'worry.', "O'Connor'd", 'go', 'at', '3', "o'clock,", "can't", 'go', 'wrong.', "Amazon's", 'delivery', 'at', '3:15,', 'but', "it's", "nice'.", 'A+B>5', 'but', '#2', 'is', '{red},', '(green)', 'and', '[blue],', 'email:', 'a@b.com']

Tokenise using word_tokenize from NLTK:

from nltk.tokenize import word_tokenize
words = word_tokenize(document)
print(words)
['I', "'ll", 'do', 'it', 'do', "n't", 'you', 'worry', '.', "O'Connor", "'d", 'go', 'at', '3', "o'clock", ',', 'ca', "n't", 'go', 'wrong', '.', 'Amazon', "'s", 'delivery', 'at', '3:15', ',', 'but', 'it', "'s", 'nice', "'", '.', 'A+B', '>', '5', 'but', '#', '2', 'is', '{', 'red', '}', ',', '(', 'green', ')', 'and', '[', 'blue', ']', ',', 'email', ':', 'a', '@', 'b.com']

We can see above that using spaces we get these:

I’ll   don’t   worry.   O’Connor’d   o’clock,   can’t   Amazon’s   3:15   it’s   A+B>5   #2   {red},   (green)   [blue],   email:   a@b.com

Whereas using word_tokenise from NLTK we get these:

I   ‘ll   n’t   worry   .   O’Connor   ‘d   o’clock   ,   ca   n’t   Amazon   ‘s   3:15   it   ‘s   A+B   >   5   #   2   {  red  }  ,  (  green  )  [  blue  ]  ,   email   :   a   @   b.com

Notice that using NLTK these become separate tokens whereas using spaces they are not:

‘ll  n’t  .  ca  O’Connor  ‘d  o’clock  ‘s  A+B  >  #  {}  ()  []  ,  :   @

Sentence Tokenizer

In NLTK we have a sentence tokeniser called sent_tokenize. This tokeniser breaks text into sentences not only on periods but also on ellipsis, question marks and exclamation mark.

Python code – split on period:

document = "Oh... do you mind? Sit please. She said {...} go! So go."
words = document.split(".")
print(words)
['Oh', '', '', ' do you mind? Sit please', ' She said {', '', '', '} go! So go', '']

Using sent_tokenize from NLTK:

from nltk.tokenize import sent_tokenize
sentences = sent_tokenize(document)
print(sentences)
['Oh... do you mind?', 'Sit please.', 'She said {...} go!', 'So go.']

Notice that NLTK breaks the text on periods (.), ellipsis (…),  question mark (?) and exclamation mark (!).

Also notice that if we use period we get a space in the beginning for the sentence. Using NLTK we don’t.

Tweet Tokenizer

In NLTK we have a tweet tokeniser. We can use this tokeniser to break a tweet into tokens, considering the smileys, emojis and hashtags.

Python code – using NLTK word tokeniser:

document = "I watched it :) It was gr8 <3 😍 #bingewatching"
words = word_tokenize(document)
print(words)
['I', 'watched', 'it', ':', ')', 'It', 'was', 'gr8', '<', '3', '😍', '#', 'Netflix']

Using NLTK tweet tokeniser:

from nltk.tokenize import TweetTokenizer
tknzr = TweetTokenizer()
tknzr.tokenize(document)
['I', 'watched', 'it', ':)', 'It', 'was', 'gr8', '<3', '😍', '#Netflix']

Notice that using tweet tokeniser we get smileys like <3 and hashtags as a token, whereas using word tokeniser the < and # are split up.

Regular Expression Tokenizer

In NLTK we have a regular expression tokeniser. We can use this tokeniser to break a tweet into tokens, considering the smileys, emojis and hash tags.

Python code:

from nltk.tokenize import regexp_tokenize
document = "Watched it 3x in 2 weeks!! 10 episodes #TheCrown #Netflix"
hashtags = "#[\w]+"
numbers  = "[0-9]+"

regexp_tokenize(document, hashtags)
['#TheCrown', '#Netflix']

regexp_tokenize(document, numbers)
['3', '2', '10']

Notice that using regular expression tokeniser we can extract hash tags and numbers. We can also use it to extract dates, email address, monetary amount.

 

Leave a Comment »

No comments yet.

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Blog at WordPress.com.

%d bloggers like this: