Data Warehousing and Data Science

24 December 2021

Using Reinforcement Learning to Manage Portfolio Allocation

Filed under: Data Science,Data Warehousing — Vincent Rainardi @ 9:03 am

I am a firm believer on the iterative approach, rather than big bang. To create something complex we need to build it incrementally. For my masters dissertation I would like to use machine learning to manage investment portfolios. Last week I described the “layout of the land” of this topic in this article: link. That made me realise 3 things:

  1. That the topic is very large, from predicting stock prices to managing risk, from managing stock composition to crypto currencies.
  2. That what I’d like to do is managing the portfolio allocation. In terms of assets I would prefer stocks, rather than fixed income or crypto currencies.
  3. That the best approach for this is using Reinforcement Learning (Q network).

Problem Statement

So as the first step, I would like to simply use a Q network to decide which portfolio allocation would be best in terms of maximising return. So the reward is the return (current market price minus the purchase cost). The environment is 4 stocks in different industry sectors:

  • 1 in financial industry: JP Morgan, symbol: JPM
  • 1 in retail industry: Home Depot, symbol: HD
  • 1 in commodity industry: Exxon Mobile, symbol: XOM
  • 1 in healthcare industry: Pfizer, symbol: PFE

All from the same country i.e. US. The action is to choose the composition of these 4 stocks in the portfolio, plus cash. To simplify things the composition must be: 1 stock = 40% weight, and the other 3 stocks and cash = 15% weight. So there are only 5 possible actions to take:

Every working day, the agent must decide which action it wants to take, i.e. which composition it wants to use. Then the reward is calculated by comparing the valuation of the portfolio at the end of the day to the previous day, minus the transaction cost and the holding cost. The portfolio valuation is obtained by summing the valuation of the 4 stocks (held quantity x today closing price) plus the cash. The transaction cost is $10 per trade. The holding cost is 0.005% of the portfolio value per day, including weekend.

I will be using use 4 years of price data from 19th Dec 2016 to 18th Dec 2020 to train the model, and 19th Jan 2021 to 18th Dec of 2021 to test it. Note that stock prices are only available on Monday to Friday, and when it’s not a public holiday in the US. All 5 prices will be fed into the Q model (open, close, high, low, adjusted close) plus the daily volume too.

The Environment

In Reinforcement Learning we have an environment and an agent. The environment consists of a state space and an action space. In Reinforcement Learning we need to define 6 things in the environment:

  1. The state space i.e. a list of all the possible states
  2. The action space i.e. a list of all possible actions
  3. The reward for doing an action from a particular state
  4. The next state after doing an action
  5. An episode, and how many time steps in an episode
  6. How the environment will be reset at the beginning of an episode

State Space: The state is the current state of the portfolio on a particular date, i.e. composition 1 to 5 (C1 to C5). In the beginning, before any trade is made, the portfolio consists of 100% cash (trade means buying or selling a stock). This beginning state is called C0.

Action Space: The action that the agent can take is buying or selling stocks so that the portfolio is in a certain composition (from composition 1 to 5). So there are 5 actions. Let’s name these 5 actions as A1 to A5. If the agent does action A2, then the next state is C2, because the portfolio will be in composition 2. If the agent does action A3, then the next state will be C3. And so on. Of course the agent can also do nothing (let’s call it A0), in this case the next state is the same as the previous state.

Episode: One episode in this case is 30 trading days. So at the beginning of an episode, a date within the training data will be randomly chosen as the start date. For example, the start date = 17st June 2018. Then every trading day the agent would take an action. A “trading day” means a day when the US stock markets are open. 17th June is a Sunday, not a trading day, so it starts on 18th June 2018, like this:

2018-06-17 Sunday No action
2018-06-18 Monday, Action = A2
2018-06-19 Tuesday, Action = A3
2018-06-20 Wednesday, Action = A5
… and so on until
2018-07-26 Thursday, Action = A4
2018-07-27 Friday, Action = A5
2018-07-30 Monday, Action = A1

In the above, the actions are just examples. Every day the agent determines which action to take, between A1 to A5. The agent can only make 1 action per day, i.e. at the beginning of every day.

Note that US public holidays are not trading days. So for example, 25th Dec 2018 (Tuesday) is Christmas day, so no action.

Reward: The portfolio valuation is calculated as the valuation of all the stocks, plus cash. The reward is calculated based on the profit for that day, i.e. the portfolio value at the end of the day, minus the portfolio value at the start of the day.

Beginning of an episode: At the beginning of an episode the portfolio consist entirely of cash. This is an additional state, in addition to C1 to C5 defined above. So we have 6 states in total: C0 to C6

Portfolio Valuation

At this initial state we need to define how much cash is in the portfolio. Let’s define that as USD $1 million. So on that first day in the episode (say Sunday 17th June 2018), the value of the portfolio was $1 million.

Let’s say that the next day, Monday 18th June 2018, the agent decided to take action C1, which brings the portfolio to state C1. So on that Monday morning the portfolio consisted of: 40% cash, 15% JPM, 15% HD, 15% XOM and 15% PFE. The value of the portfolio in the beginning of that Monday 18th June 2018 was the sum of the 40% cash and the initial value of the holdings (i.e. the 4 stocks):

  • The value of 40% cash = $400,000
  • 15% ($150,000) to buy JPM stock. Opening price: 107.260002. Quantity: 1398.470979
  • 15% ($150,000) to buy HD stock. Opening price: 198.940002. Quantity: 753.9961722
  • 15% ($150,000) to buy XOM stock. Opening price: 80.400002. Quantity: 1865.671595
  • 15% ($150,000) to buy PFE stock. Opening price: 198.940002. Quantity: 753.9961722

In the above, the prices are from the stock market data. The quantity held is simply calculated as the money to buy the stock divided by the opening price. The value of the portfolio at the end of that Monday 18th June 2018 is the sum of the 40% cash and the value of the 4 stocks (based on the quantity held):

  • 40% cash = $400,000
  • JPM: Closing price: 108.180000. Value = 151,286.59
  • HD: Closing price: 200.690002. Value = 151,319.49
  • XOM: Closing price: 80.82. Value = 150,783.58
  • PFE: Closing price: 34.278938. Value = 150,124.55

We also need to subtract the transaction cost, which is $10 per trade, and the holding cost (the cost we pay to the investment platform, for keeping our holdings), which is 0.01% of the portfolio value, per day:

So after 1 day of trading, 18th June 2018, after the agent decided to take action C1, the portfolio value is 1,003,424.03. So the profit is $3,424.03.

The reward

The reward is the profit for that day, i.e. the portfolio value at the end of the day, minus the portfolio value at the start of the day. So in this case the reward is $3,474.21. Note that the reward can be negative, i.e. if on that day the value of the portfolio at the end of the day is lower than at the start of the day.

For each trading day there will be a reward, which is added to the previous day reward to make the cumulative reward. Every day we calculate the cumulative reward.

Episodes and Total Score

An episode is 30 trading days. At the end of the episode (30th July 2018) the cumulative reward is the “total score” for the agent.

Then another episode is chosen and the environment is reset. The 30 days trading begins and the reward and cumulative reward is calculated every day. And at the end of the episode, the total score is obtained.

And so the agent keep learning, episode by episode, each time adjusting the weights of the neural network within the agent. In the early episodes, we expect the total score to be low because the agent is still learning. But after many episodes, we expect the total score to be consistently high, because the agent has learned the pattern in the stock prices. So it knows which stock would be the most profitable to invest in.

Generating Experience

So far so good. But the state space is not actually just the current portfolio holdings/composition. The state space also include the current prices, and historical prices. Not only the prices of the holdings, but also the stocks not in holding (because they also determine what should be held).

And, in reality, the stocks in the holdings are not just 4. There are 40 to 100 stocks, depending on the size and the policy of the fund. And the investment universe (out of which we choose the stock to hold) is about 500 to 1000 stocks.

So obviously, we can’t have the mapping of all the state and actions, to the “value” (the net profit for today). Because there are so many combinations of the state and actions (millions).

In Reinforcement Learning this problem is solved by approximating the value using a neural network. We don’t record all those combinations of historical prices (states) and stock allocations (actions) and their values (today’s profit). Instead, we train a neural network to learn the relationship between the states, action and values. Then use it to approximate the value, for a given state and action.

In Reinforcement Learning, we generate the experience using a Q network. Generating an experience means that the system will choose to either to do exploration or exploitation. This is called “Epsilon Greedy” algorithm.

  1. Set the epsilon (ε), which is the boundary between exploration and exploitation.
  2. Generate a random number.
  3. If the number is less than epsilon, choose a random action (exploration).
  4. If the number is more than epsilon (or equal), choose the best action (exploitation).
    The best action is the action with the highest reward.
  5. Calculate the reward.
  6. Determine the next state.
  7. Store the state, the action, the reward and the next state.

So that’s the topic for the next article, i.e. how to use neural network to approximate the value, for a given state and action. The state in this case is the historical prices, and the action here is the portfolio composition (or stock allocation) The value here is the profit or gain on a particular day.

So the problem statement becomes: given all the historical prices, what is the best portfolio composition for today? And the value of that portfolio composition is: the net profit we make today.

Once we are able to create a neural network which can answer the above question, then we’ll create the second neural network to do the Reinforcement Learning, using Action and Reward, using Environment and Agent. This second NN will be learning how to optimise a portfolio, i.e. what stocks should be held in order to maximise the return during a 30-day period (one episode).

1 Comment »

  1. Great work Vincent. Wish you all the best.

    Comment by aalgohary — 3 February 2022 @ 6:30 pm | Reply

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Blog at

%d bloggers like this: