Data Warehousing and Data Science

5 October 2021

Reinforcement Learning

Filed under: Machine Learning — Vincent Rainardi @ 7:28 am

Reinforcement Learning (RL) is a branch of machine learning where we have an agent and environment, and the agent is learning from the environment by trying out different actions from each state to find out which action gives the best reward.

Examples of RL is self driving car, games (e.g. chess, Go, DOTA2), portfolio management, chemical plant, traffic control system. In portfolio management, the agent is the Algo trading robot, the state is the holdings in the portfolio, actions are buying and selling stocks, and the reward is the profit made.

It is probably best to illustrate this using Grid World Game (link). Imagine a robot walking from the start box to the finish box, avoiding the X boxes:

Here’s the rule of the game:

  1. In this case the agent is the robot. Let’s call it R. The state is R’s position, i.e. A1, B3, etc. R’s initial state is A1. The goal is for R to get to D4. The episode finishes when R gets to D4, or when R hits the X1 or X2 box.
  2. The reward is: every time R moves it gets -1, if R reaches D4 it gets +20 and if R hits X1 or X2 it gets -10. The action is R’s movement. From A1 if R moves to the right the next state is A2, if R moves down the next state is B1.
  3. If R hits the perimeter wall the next state is the same as the current state. For example from A1 if R moves to the left or upwards the next state is A1.
  4. There are 4 possible actions from each state: move right, move left, move up, move down.
  5. There are 13 possible states (R possible locations), i.e. 4×4 = 16 minus X1, X2 and Finish.
  6. Example of an episode: A1, A2, A3, B2, C2, D2, Finish. The reward is 6x-1 + 20 = 14.
    Another example: A1, B1, X1. The reward is 2x-1 – 10 = -12.

These are the 2 basic equations in RL: (they are called Bellman equations, link)

  • The first one means: the reward for a particular state is the sum of the rewards from all possible actions in that state, weighted by the probability of those actions.
    v is the value of state s, π is the policy and q is the value of action a in state s.
  • The second one means: the reward for a particular action in a particular state is the sum of (the immediate reward and all the future rewards) from that state, over the model probabilities.
    p is the environment model, r is the immediate reward, γ is the discount factor (how much we value future reward), and v is the value of the next state.

The environment model, denoted by p(s’,r|s,a) in the second equation above, takes in the state and the action, and it gives back the reward and the next state. For example, we put in state = A1 and action = go right (that’s the s and the a), the output of the environment model is reward = -1 and next state = A2 (that’s the r and the s’).

A policy is the probability of choosing actions in a state. For example: in A1, policy1 says that the probability of going right = 40%, going left = 10%, up = 10%, down = 40%. Whereas policy2 says that in A1 the probability of taking those 4 actions have different percentages. A policy defines the actions for every single state, not just 1 state.

The goal is to choose the best policy which has the best action for every state. The best action is the one is the highest total reward, not just the immediate reward but all future reward as well (over an episode). If an action gives us a high reward now, but over the next view steps it gives low rewards, then the total reward would be low.


Training an RL model means:

Step 1. First we initialise every state with an action. Say we initialise all states with action = Up, like this:

Step 2. We start from A1. We calculate the reward for all 4 actions in A1, then choose the best action (say right). So the state is now A2. We calculate the reward for all 4 actions in A2, and choose the best action again. We do this until the episode finishes, i.e. we arrive at Finish or we hit x1 or x2.

For example, this is what we end up with in Episode 1, i.e. we took the yellow path. The reward is 5x-1 – 10 = -15.

Step 3. We do another episode. It is important that we must explore other possibilities, so not always choose the best action, but deliberately take a random action. For example, in this episode we go down from A1, then left on B1 (so the next state is still B1), then right on B1 and hit the x1 box. The reward is 3 x -1 – 10 = -13.

Step 4. We don’t want to keep exploring forever, so as time goes by we explore less and less, and exploit more and more. Explore means we choose an action randomly. Whereas exploit means we take the best action for that state.

We use a hyperparameter called epsilon to do this (ε). We start with ε = 1 and slowly decreasing it to 0. In every move we generate a random number. If this random number is lower than ε then we explore, but if it is higher than ε then we explore.

So in the initial episode our score is low but after a while our score will be high. Score is the total reward per episode. The maximum score is 14, i.e. when you directly to Finish in the shortest possible way. The worst score is a big negative number i.e. if you keep going around in circle endlessly. Remember that every time you move you get -1. This is to motivate the robot to go to the Finish box as soon as possible.

So if you play say 1000 episodes, the score would be something like this:

We can see that in the beginning until episode 200 the score is constantly increasing. This is because at the start we set the ε to 1 (fully exploring) and we tapering it slowly to 0.9, 0.8, 0.7, etc until it reaches 0 at episode 200. This is called epsilon decay.

From episode 200 the score is “in a band”, i.e. the value is in a certain range. We say that the score is “converged”. In this case it is between 8 and 14. From episode 300 the band is narrower to 9-14. From episode 500 it is 10-14 and from episode 700 to 1000 the score is 11-14. The band is smaller and smaller because after episode 300 the RL model doesn’t explore any more. It only exploits, i.e. taking the best possible actions. And it is still learning, resulting in the increase of score as time goes by.

Model Free

One of the most important thing to remember in RL is that in most cases we don’t have the model of the environment. So we need to use a neural network to estimate the value of q (the reward for an action in a state). The neural network is called the Q network. And because it consist of many layers, it is called Deep Q Network, or DQN. Like this:

The input of DQN is a state vector (one hot encoding) and the action vector (also one hot encoding).
For example, assuming we have 3 states (box A1, A2, A3) and 2 actions (left and right):

  • For state A1 the state vector is [1  0  0]
  • For state A2 the state vector is [0  1  0]
  • For state A3 the state vector is [0  0  1]
  • For action Go Left the action vector is [1  0]
  • For action Go Right the action vector is [0  1]

Because we don’t have the environment model, we generate the data (called “experience”) using DQN. We put in the state and the action, and get the Q value (the reward). We do this many many times and save those “experience” into memory (called Replay Buffer). After we have many experiences in the Replay Buffer (say 30,000) we pick a batch of say 100 experience randomly and use this data for training the network.

Three Architecture of DQN

There are 3 architecture of DQN. The first one is the one in the previous diagram (I draw it here again for clarity). It takes in the state and action vectors as input, and the output is the Q value (the reward) for that state and action. For example: from state A1 [1  0  0] take action = Go Right [0  1], and the reward is -1. So we feed in [1  0  0  0  1], i.e. concatenation of state and action vectors, and get -1 as the output.

The second architecture takes in the state vector as input, and the output is the Q value (the reward) for each action. For example, the input is A1 and the output is -1 for Go Left and -1 for Go Right. The advantage of using this architecture is we only need to feed each state once into the network, whereas the first architecture we need to feed each state and action combination into the network.

The problem with both of the above architectures is that we use the same network for calculating the expected Q values and for predicting the Q values. This makes the system unstable because as the network weights are updated in each step, both the expected Q and the predicted Q changes.

This problem is solved by the third DQN architecture which is called Double DQN. In this architecture we use 2 networks, one for predicting the Q value and one for calculating the expected Q value. The former is called the Main network and the latter is called the Target network. The weight of the Main network is updated at every step, whereas the weight of the Target network is updated at every episode. This makes the expected Q values (the target Q values) stable through out the episode.

We only train the Main network. The weight of the Target network is not updated using back propagation, but by copying the weight of the Main network. This way at the beginning of every episode the Target network is the same as the Main network, and we use it to calculate the target/expected Q values.

Portfolio Management

in portfolio management if we have 50 holdings in the portfolio, and investment universe is 500 stocks, then there are 501 possible actions i.e. sell any of the 50 holdings, or buy any of the 450 stocks that we are not currently holding, or do nothing. And what are the state? The 50 holdings, which could be any possible combination of the 500 stock in the investment universe – that’s a lot of states!

And we are not limiated to buy or sell just 1 stock. We can buy several stocks. We can buy 2 stocks or sell 2 stocks, or 3 or 4!

And we have not factored in the price. In real live the action is not “buy stock X” but “buy stock X and price P”. In this case the state is all the possible combination of 50 out of 500, at various different prices. That’s really a lot of states! So it is very resource intenstive.


Reinforcement Learning (RL) is probably the most complicated machine learning model. It uses deep neural networks, we have to generate the data (the experience) and it is resource intensive especially when we have many states and many actions.

But it is the one we use for for many things these days. We use it for robotics (check out Boston Dynamics), for self driving cars, for playing computer games, and for Algo trading (portfolio management). Checkout OpenAI playing DOTA 2: link. Guess how many CPUs they use? 128,000 CPUs plus 256 GPUs!

Leave a Comment »

No comments yet.

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Blog at

%d bloggers like this: