Data Warehousing and Data Science

18 October 2021

Why Data Lake?

Filed under: Data Warehousing — Vincent Rainardi @ 6:55 am

I’ve been building data warehouses for 15 years and my typical architecture is like this:

My main focus has been to build a data warehouse in a database in dimensional model format (Kimball star schema). The database is usually SQL Server but also Oracle and Teradata. This works well because a) the warehouse is queryable by SQL, b) the data is integrated, and c) the star schema is easy to use and performant. The front end varied from company to company e.g. Reporting Services, Excel, Tableau, Power BI, Qlik, Cognos, Business Objects, and users also query the warehouse directly. Occasionally I built OLAP cubes (mostly Analysis Services but also Hyperion) and users access it from Excel. For 15 years the name of the game was “Business Intelligence” (BI), that’s what data was used for.

In the last few years I’ve been doing Machine Learning and it mostly requires data files (rather than database), including image files (for CNN) and sequential/streamed data (for RNN). This makes the Data Warehouse not fit for purpose, and we use Data Lake instead. The ML tools I use are varied, from Jupyter notebooks, Tensorflow in Google Colab, Azure ML Studio, to Rasa NLP engine and Watson, to NVIDIA Jetson Nano. Data is no longer used for BI, but for modelling AI. It is the output of the ML models that is used for reporting and BI. The model files (h5), the augmentation (images), the model parameters and model output (pickle files) are stored in Data Lake. The Data Lake started to be used for staging area as well, as users require access to the raw data files. So the architecture become like this:

It is a completely different world to 15 years ago. We change and adapt. The Data Lake is a collection of files, yes, but it is also queryable via SQL for those old school folks (if we use Databricks or Hive). And Python friendly too, for those ML folks. Data Lake is available in various formats, one of the best ones is Databricks Delta (available on Azure and AWS), which stores data in optimised Parquet files (compressed vertical format). It provides transaction logs, consistent view, concurent read/write operations and good performance (z-order partitioning).

So as we can see above, the data mart/warehouse is still there in the architecture, with the data lake used as its staging area. The warehouse also takes in the the output of the ML models e.g. predictions, etc. And the BI Tools and reports are still there too like before, publishing the content of the warehouse to the users. Where is the integration point? The warehouse of course. Or if you use a multi-marts architecture (instead of the single warehouse), then the integration point is the data marts.

Note: what I labelled as “Data Lake” in my diagram above (e.g. Databricks Delta Lake, see below diagram) some people call it a “Lakehouse”. A Lakehouse is a data lake and a data warehouse in one thing, see Ref #6 below. My architecture diagram above (and below) is not a Lakehouse, it is the architecture of a modern data warehouse, which uses both a data warehouse and a data lake for the reasons I outlined above.

The Cloud Architecture

The main difference to 15 years ago is the cloud architecture (which is also the main driver). It hard to argue against cloud, whether you choose Azure, AWS or GCP. First, we don’t pay a million dollar in advance, but pay by the hour (tens of thousands per month). Second, we enjoy robust infrastructure, high availability, hardware redundancy, 24×7 support, tight security. We don’t need to think about backup and failover, it is always available. Third, there is no patching. This is like heaven for the infrastructure team, and even more so for the head of IT, the Chief Financial Officer and the Chief Operating Officer.

With that, data warehouses are now in the form of PaaS databases (Platform as a Service). Whether you are a SQL Server shop or Oracle shop, you would be pushed to be in the cloud. We won’t be able to argue why we need to maintain on-premise databases.

The pressure for a data lake platform to be in the cloud is even stronger then a data warehouse. 10 years ago, people still talk about building their own Hadoop cluster, on premise. The “zoo” as we used to call it (because it’s not just the elephant but also the pig, the bee hive, the oozie, the hbase whale). But now it is unheard of. Everyone just use cloud based data lake like Azure and AWS (between them they probably have 90% of the market). It’s a pay-as-you-go (pay by the hour), it’s has high availability, hardware redundancy, 24×7 support and tight security. We get a world class platform for a low cost.

Conclusion

So why data lake? Because a) the ML systems need files rather than database, e.g. image files, streamed data, b) to store unstructured data, e.g. tweets, text, images, videos, c) to be used as a staging area for the data warehouse.

Data lake is not the integration point, it is just a storage. The integration point is still the data warehouse (for the structured data). So data warehouse is still being used, and the BI tools are still being used too.

This modern data warehouse architecture (which is a data lake plus a data warehouse) must be in the cloud, the most popular ones are Azure and AWS.

References

  1. BlueGranite’s diagram of a modern data warehouse: link, link
  2. Scott Muniz’s diagram of a Databricks data lake in Azure: link
  3. Kevin Clugage and Denny Lee’s diagram of a modern data warehouse: link
  4. Alexandre Gattiker’s diagram of Azure Databricks streaming architecture with a data warehouse: link
  5. Clinton Ford’s diagram of a modern data architecture with Delta Lake and Azure Databricks: link
  6. What is a LakeHouse by Databricks team: link

5 October 2021

Reinforcement Learning

Filed under: Machine Learning — Vincent Rainardi @ 7:28 am

Reinforcement Learning (RL) is a branch of machine learning where we have an agent and environment, and the agent is learning from the environment by trying out different actions from each state to find out which action gives the best reward.

Examples of RL is self driving car, games (e.g. chess, Go, DOTA2), portfolio management, chemical plant, traffic control system. In portfolio management, the agent is the Algo trading robot, the state is the holdings in the portfolio, actions are buying and selling stocks, and the reward is the profit made.

It is probably best to illustrate this using Grid World Game (link). Imagine a robot walking from the start box to the finish box, avoiding the X boxes:

Here’s the rule of the game:

  1. In this case the agent is the robot. Let’s call it R. The state is R’s position, i.e. A1, B3, etc. R’s initial state is A1. The goal is for R to get to D4. The episode finishes when R gets to D4, or when R hits the X1 or X2 box.
  2. The reward is: every time R moves it gets -1, if R reaches D4 it gets +20 and if R hits X1 or X2 it gets -10. The action is R’s movement. From A1 if R moves to the right the next state is A2, if R moves down the next state is B1.
  3. If R hits the perimeter wall the next state is the same as the current state. For example from A1 if R moves to the left or upwards the next state is A1.
  4. There are 4 possible actions from each state: move right, move left, move up, move down.
  5. There are 13 possible states (R possible locations), i.e. 4×4 = 16 minus X1, X2 and Finish.
  6. Example of an episode: A1, A2, A3, B2, C2, D2, Finish. The reward is 6x-1 + 20 = 14.
    Another example: A1, B1, X1. The reward is 2x-1 – 10 = -12.

These are the 2 basic equations in RL: (they are called Bellman equations, link)

  • The first one means: the reward for a particular state is the sum of the rewards from all possible actions in that state, weighted by the probability of those actions.
    v is the value of state s, π is the policy and q is the value of action a in state s.
  • The second one means: the reward for a particular action in a particular state is the sum of (the immediate reward and all the future rewards) from that state, over the model probabilities.
    p is the environment model, r is the immediate reward, γ is the discount factor (how much we value future reward), and v is the value of the next state.

The environment model, denoted by p(s’,r|s,a) in the second equation above, takes in the state and the action, and it gives back the reward and the next state. For example, we put in state = A1 and action = go right (that’s the s and the a), the output of the environment model is reward = -1 and next state = A2 (that’s the r and the s’).

A policy is the probability of choosing actions in a state. For example: in A1, policy1 says that the probability of going right = 40%, going left = 10%, up = 10%, down = 40%. Whereas policy2 says that in A1 the probability of taking those 4 actions have different percentages. A policy defines the actions for every single state, not just 1 state.

The goal is to choose the best policy which has the best action for every state. The best action is the one is the highest total reward, not just the immediate reward but all future reward as well (over an episode). If an action gives us a high reward now, but over the next view steps it gives low rewards, then the total reward would be low.

Training

Training an RL model means:

Step 1. First we initialise every state with an action. Say we initialise all states with action = Up, like this:

Step 2. We start from A1. We calculate the reward for all 4 actions in A1, then choose the best action (say right). So the state is now A2. We calculate the reward for all 4 actions in A2, and choose the best action again. We do this until the episode finishes, i.e. we arrive at Finish or we hit x1 or x2.

For example, this is what we end up with in Episode 1, i.e. we took the yellow path. The reward is 5x-1 – 10 = -15.

Step 3. We do another episode. It is important that we must explore other possibilities, so not always choose the best action, but deliberately take a random action. For example, in this episode we go down from A1, then left on B1 (so the next state is still B1), then right on B1 and hit the x1 box. The reward is 3 x -1 – 10 = -13.

Step 4. We don’t want to keep exploring forever, so as time goes by we explore less and less, and exploit more and more. Explore means we choose an action randomly. Whereas exploit means we take the best action for that state.

We use a hyperparameter called epsilon to do this (ε). We start with ε = 1 and slowly decreasing it to 0. In every move we generate a random number. If this random number is lower than ε then we explore, but if it is higher than ε then we explore.

So in the initial episode our score is low but after a while our score will be high. Score is the total reward per episode. The maximum score is 14, i.e. when you directly to Finish in the shortest possible way. The worst score is a big negative number i.e. if you keep going around in circle endlessly. Remember that every time you move you get -1. This is to motivate the robot to go to the Finish box as soon as possible.

So if you play say 1000 episodes, the score would be something like this:

We can see that in the beginning until episode 200 the score is constantly increasing. This is because at the start we set the ε to 1 (fully exploring) and we tapering it slowly to 0.9, 0.8, 0.7, etc until it reaches 0 at episode 200. This is called epsilon decay.

From episode 200 the score is “in a band”, i.e. the value is in a certain range. We say that the score is “converged”. In this case it is between 8 and 14. From episode 300 the band is narrower to 9-14. From episode 500 it is 10-14 and from episode 700 to 1000 the score is 11-14. The band is smaller and smaller because after episode 300 the RL model doesn’t explore any more. It only exploits, i.e. taking the best possible actions. And it is still learning, resulting in the increase of score as time goes by.

Model Free

One of the most important thing to remember in RL is that in most cases we don’t have the model of the environment. So we need to use a neural network to estimate the value of q (the reward for an action in a state). The neural network is called the Q network. And because it consist of many layers, it is called Deep Q Network, or DQN. Like this:

The input of DQN is a state vector (one hot encoding) and the action vector (also one hot encoding).
For example, assuming we have 3 states (box A1, A2, A3) and 2 actions (left and right):

  • For state A1 the state vector is [1  0  0]
  • For state A2 the state vector is [0  1  0]
  • For state A3 the state vector is [0  0  1]
  • For action Go Left the action vector is [1  0]
  • For action Go Right the action vector is [0  1]

Because we don’t have the environment model, we generate the data (called “experience”) using DQN. We put in the state and the action, and get the Q value (the reward). We do this many many times and save those “experience” into memory (called Replay Buffer). After we have many experiences in the Replay Buffer (say 30,000) we pick a batch of say 100 experience randomly and use this data for training the network.

Three Architecture of DQN

There are 3 architecture of DQN. The first one is the one in the previous diagram (I draw it here again for clarity). It takes in the state and action vectors as input, and the output is the Q value (the reward) for that state and action. For example: from state A1 [1  0  0] take action = Go Right [0  1], and the reward is -1. So we feed in [1  0  0  0  1], i.e. concatenation of state and action vectors, and get -1 as the output.

The second architecture takes in the state vector as input, and the output is the Q value (the reward) for each action. For example, the input is A1 and the output is -1 for Go Left and -1 for Go Right. The advantage of using this architecture is we only need to feed each state once into the network, whereas the first architecture we need to feed each state and action combination into the network.

The problem with both of the above architectures is that we use the same network for calculating the expected Q values and for predicting the Q values. This makes the system unstable because as the network weights are updated in each step, both the expected Q and the predicted Q changes.

This problem is solved by the third DQN architecture which is called Double DQN. In this architecture we use 2 networks, one for predicting the Q value and one for calculating the expected Q value. The former is called the Main network and the latter is called the Target network. The weight of the Main network is updated at every step, whereas the weight of the Target network is updated at every episode. This makes the expected Q values (the target Q values) stable through out the episode.

We only train the Main network. The weight of the Target network is not updated using back propagation, but by copying the weight of the Main network. This way at the beginning of every episode the Target network is the same as the Main network, and we use it to calculate the target/expected Q values.

Portfolio Management

in portfolio management if we have 50 holdings in the portfolio, and investment universe is 500 stocks, then there are 501 possible actions i.e. sell any of the 50 holdings, or buy any of the 450 stocks that we are not currently holding, or do nothing. And what are the state? The 50 holdings, which could be any possible combination of the 500 stock in the investment universe – that’s a lot of states!

And we are not limiated to buy or sell just 1 stock. We can buy several stocks. We can buy 2 stocks or sell 2 stocks, or 3 or 4!

And we have not factored in the price. In real live the action is not “buy stock X” but “buy stock X and price P”. In this case the state is all the possible combination of 50 out of 500, at various different prices. That’s really a lot of states! So it is very resource intenstive.

Conclusion

Reinforcement Learning (RL) is probably the most complicated machine learning model. It uses deep neural networks, we have to generate the data (the experience) and it is resource intensive especially when we have many states and many actions.

But it is the one we use for for many things these days. We use it for robotics (check out Boston Dynamics), for self driving cars, for playing computer games, and for Algo trading (portfolio management). Checkout OpenAI playing DOTA 2: link. Guess how many CPUs they use? 128,000 CPUs plus 256 GPUs!

Blog at WordPress.com.