Data Warehousing and Machine Learning

12 August 2021

RNN Applications

Filed under: Machine Learning — Vincent Rainardi @ 6:23 am

A Recurrent Neural Network is a machine learning architecture for processing sequential data, see my article here: link. The applications of this architecture are amazing, for example we can generate a song from a note, generate a poetry, a story or even a C code!

Here is a list of various amazing applications of RNN:

  1. Video classification: link
  2. Image classification: link
  3. Image captioning: link
  4. Sentiment analysis: link
  5. Language translation: link
  6. Making music: link
  7. Writing poem: link
  8. Writing code: link
  9. Generating text: link
  10. FX trading: link
  11. Stock market prediction: link
  12. Speech recognition: link
  13. Text to speech: link

RNN application is about a sequence of data. That sequence of data can be the input or it can be the output. That sequence of data can be a sequence of numbers, a sequence of musical notes, a sequence of words, or a sequence of images.

If the sequence of data is the output, then it becomes a creation. For example:

  • If the output is a sequence of notes, then the RNN is “writing music”.
  • If the output is a sequence of words, then the RNN is “writing a story”.
  • If the output is a sequence of share prices, then the RNN is “predicting share prices”.
  • If the output is a sequence of voices, then the RNN is “speaking”.
  • If the output is a sequence of colours, then the RNN is “painting”.

That is very powerful right? This is why AI in the last few years is really taking off. Because finally AI can create a song, a speech, a painting, a story, a poem, an article. Finally AI can predict a sequence of numbers. Not just one number but a series of numbers. That has very, very serious consequences. Imagine if that series of numbers is the temperature every hour in the next few days.

Imagine, if that series of numbers is stock prices in the next few weeks. Imagine if the prediction is accurate. It would turn the financial world upside down!

Three Categories of RNN

RNN applications can be categorised into 3:

  1. Classification
  2. Generation
  3. Encoder Decoder

Classification is about categorising a sequence of images or data into categories, for example:

  • Classifying films into action, drama or documentary
  • Classifying stock market movements into positive or negative trend
  • Classifying text into news, scientific or story

In classification the output is a single number.

Generation is about making a sequence of data based on another data, for example:

  • Making a sequence of musical notes e.g. a song.
  • Making a sequence of words e.g. a poem, Python code or a story.
  • Making a sequence of numbers e.g. predicting stock market.

For generation we need to have a “seed” i.e. data which the creation is based on. For example, when generating a sequence of words we need a word. When generating a sequence of musical notes we need a note.

Encoder Decoder consists of 2 parts. The first part (encoder) encodes the data into a vector. The second part (decoder) uses this vector to generate a sequence of data. For example: (image source: Greg Corrado)

The words in the incoming email is fed one by one as a sequence into an LSTM network, and encoded into a word vector representation. This vector (called thought vector) is then used to generate the reply, one word at a time. In the above example “Yes,” was generated first, and then that word is used to generate the second word “what’s”. Then these 2 words are used to generate the third word, and so on.

Using RNN to forecast stock prices

One of the popular techniques to forecast stock prices is RNN-based (the other is Reinforcement Learning). We can see the trend in Ref 1 below.

In a paper by Wenjie Lu, Jiazheng Li, Yifan Li, Aijun Sun, Jingyang Wang (see Ref 2 below), we can see that amongs the RNN based technique, the most accurate one is a combination of CNN and LSTM. The CNN is used to extract features in the stock price history, and LSTM is used to predict the future stock prices. The result is like this (Ref 2):

The stock price in this case is Shanghai Composite Index from 1/7/1991 to 31/8/2020. The last 500 days are used for test data, the rest as training data. They compared 6 methods: MLP, CNN, RNN, LSTM, CNN-RNN, CNN-LSTM and the result is as follows in terms of Mean Absolute Error (MAE):

Reference:

  1. Hu, Z.; Zhao, Y.; Khushi, M. A Survey of Forex and Stock Price Prediction Using Deep Learning. Appl. Syst. Innov. 2021, 4, 9. https://doi.org/10.3390/asi4010009.
  2. Wenjie Lu, Jiazheng Li, Yifan Li, Aijun Sun, Jingyang Wang, “A CNN-LSTM-Based Model to Forecast Stock Prices”, Complexity, vol. 2020, Article ID 6622927, 10 pages, 2020. https://doi.org/10.1155/2020/6622927.

7 August 2021

Recurrent Neural Network (RNN) and LSTM

Filed under: Machine Learning — Vincent Rainardi @ 4:16 pm

In machine learning, Recurrent Neural Network (RNN) is used to predict data that happens one after another. The data has a time element i.e. it has a sequence/order. For example: a video. A video is a series of images arranged in a particular order. Stock prices is also data of this kind, it happens day after day, in sequence (or second after second). Document is also a sequential data, the words are arranged in a particular sequence.

Let’s say we have a video of a dog running, and we try to classify whether the dog in the video jumps or not. So the output is say 100 frames of images, and the output is a binary number, 1 means jump and 0 means not jump, like below (image source: link).

The Recurrent Neural Network receives 100 images as input, one image at a time, in a particular order. And the output is binary number 1 or 0. So it’s a binary classification.

So that’s the input and output of RNN. The input is a sequence of images or numbers (or words), and the output is … well, there are a few different kinds actually:

  1. One output (like above) i.e. we just take the last output.
  2. Many output, i.e. we take the output on many different times.
  3. Generator, e.g. based on 1 note we generate a song.
  4. #1 above followed by #3 (called encoder-decoder), e.g. Gmail Smart Compose (link).

Architecture

In the early days the RNN architecture was similar to the neural network architecture. See below: input x1 (the first image or number) was fed through a series of neural network layers until we get output y1. Input x2 (the second image or number) was fed through a series of neural network layers until we get output y2, like this:

The difference to a normal neural network was the red arrows above, i.e. the values on the hidden layers from time slot 1 was fed to time slot 2. So each layers on time slot 2 received two inputs: x2 and the values of the hidden layers in time slot 1 (multiplied by some weights).

If we take just 1 node in a layer, we can show what happens in this node across time (below left). The node (s) receives input (x) producing output (h). The state of the node (s) is multiplied by a weight ( w ) and sent to itself (s).

The left diagram is simplified into the right diagram above, i.e. we only draw 1 copy, with a circular w from s pointing to itself. The right diagram is called the “rolled” version, and the left one is the “unrolled” version.

Note that in the diagram above the output is h not y, because it is the output of a node in a layer, not the final output of the last layer.

I saw the rolled versions of the RNN diagram above for the first time about 2 years ago and I no idea what it was. I hope you can understand it now.

Long Short Term Memory (LSTM)

These days no one use this original RNN architecture any more. Today everyone uses some variant of LSTM, which looks like this:

This architecture is called Long Short Term Memory because it is using many short term memory cells to create a long term memory (link), meaning: able to remember a long sequence of input, e.g. 5 years of historical stock data. The old RNN inherently has a problem with long sequences because of “vanishing gradient” problem (link). “Exploding gradient” problem (link) is not as big an issue because we can cap it (called “gradient clipping”).

Cell Memory

On the LSTM diagram, the horizontal line at the top (from ct-1 to ct) is the cell state.
It is the memory of the cell i.e. the short term memory. Along this line, there are 3 things happening: the cell state is multiplied by the “forget gate”, increased/reduced by the “input gate” and finally the value is taken to the “output gate”.

So what are these 3 gates? Let’s go through them one by one.

Forget Gate

The forget gate removes unwanted information from the cell state.
The value of σ is from 0 to 1. By varying the value of σ we can adjust how much information is removed from the cell state. The current input (xt) and the previous output (ht-1) are multiplied by σ.

So the impact of this forget gate to the cell state is:

where bf is the bias and Wf and Uf are the weights (link). The blue circle with cross is element wise multiplication.

Bear in mind that t is the current time slot and t-1 is the previous time slot. Notice that h and x have their own weights.

Input Gate

The input gate adds new information into the cell state. As we can see below, the current input (xt) and the previous output (ht-1) pass through a sigma gate and a tanh gate, multiplied then added to the cell memory line.

Here i controls the how much a influences c. The value of tanh is between -1 and +1 so a can decrease or increase c. And the amount of a’s influence to c is controlled by i.

So the impact of this input gate to the cell state is: (link)

Notice that h and x have their own weights, both for i and a.

Output Gate

The output (h) is taken from the cell state (c) using tanh function. The value of tanh is from -1 to +1 so it can make the cell state positive or negative. The amount of influence this tanh(c) has on h is controlled by o. O is calculated from the previous output (ht-1) and the current input (xt), each having different weights, using a sigma function.

So the output (h) is calculated like this:

The complete equation is on Wikipedia: link, which is from Hochreiter and Schidhuber’s original LSTM paper (link) and Gers, Schmidhuber, Cummins’ paper: Learning to forget (link).

A variant of LSTM is Gated Recurrent Unit (GRU). GRU does not have an output gate like LSTM. It has a reset gate and an update gate: link.

Reference:

  1. Wikipedia, RNN: link
  2. Wikipedia, LSTM: link
  3. Wikipedia, GRU: link
  4. Andrej Karpathy, The Unreasonable Effectiveness of RNN: link
  5. Michael Phi, Illustrated Guide to LSTM and GRU: link
  6. Christopher Olah, Understanding LSTM: link
  7. Gursewak Singh, Demystifying LSTM weights and bias dimensions, link
  8. Shipra Saxena, Introduction to LSTM: link
  9. Gu, Gulcehre, Paine, Hoffman, Pascanu, Improving Gating Mechanism in RNN: link
  10. Hochreiter and Schmidhuber, LSTM: link

Blog at WordPress.com.