Data Warehousing and Data Science

24 November 2022

Data Lake for Asset Management Companies

Filed under: Data Warehousing — Vincent Rainardi @ 6:36 pm

For many years I have been asking myself the value of data lake specifically for asset management companies. If an asset management company does not have a data warehouse (or any other data management solution), would I put a data lake as a solution? The business needs are for reporting, and for investment management. Let me spell it out:

  1. To provide portfolio managers and financial analysts with the latest, valid version of all holdings (IBOR and ABOR positions) and benchmark constituents, and tearsheets for their meeting with various clients and companies, as well as evaluating the companies they are looking to buy.
  2. To provide the client reporting team with performance reports, attribution reports, valuation reports and ESG reports to be sent to clients and to be published in the factsheets and KII documents.
  3. To report to regulators including FCA and SEC on shareholding, OTC derivatives, trades, ESG, MiFID and SFDR, and to comply with requirements to be sustainable funds.
  4. To provide the finance department with the asset under management numbers (AUM), NAV and transactions.
  5. To assist the marketing and sales teams with various data requests for proposal, RFPs and presentations to clients and prospective clients.
  6. To provide the risk and compliance team with risk analytics (spread duration, DTS, etc), exposures and VaR breakdowns, stress test results and compliance reports.

The reason I ask this question is because the data in asset management companies are mostly structured data, if not 100%. All the above 6 points are structured data. They are tabular. They consists of columns and rows. The data types are string, numeric and dates. There are no audio, images or video. There are no twitter data, Facebook data or documents.

Data Lake on the other hand, handles multimedia files well, and social media too. That is the key differentiator between a data lake and a data warehouse. On the other hand, there are many, many superior features that a data warehouse has but a data lake doesn’t. The top one is data integration. A data warehouse integrates data. Company information from various data sources (Bloomberg, Factset, Markit, MSCI, Trucost, etc) are integrated in the warehouse. Industry sector for example, and their primary equity ID such as ISIN. Also company names, country of domicile, and ratings.

Allocating proper IDs to each company is not a simple process. Any one who has done this will testify how complex the process is. There are many ISINs, SEDOLs and CUSIP that the company issued, which one should we choose? Yes they are security ID not company ID, but that is what is used by other data providers such as MSCI and Trucost, so we need to use ISIN. Should we use Factset entity ID or Bloomberg issuer ID for the main issuer ID? (like the primary key) There are many company names. The company name from Bloomberg is not the same as Factset, and they are different to Refinitive, MSCI, Trucost, S&P, FTSE and Rimes. One is “Company, Inc.” and the other is without a comma or a period. One uses “Corp” and the other uses “Corporation”. And many other abbreviations. Which one should you use? Do you use Factset proper name? Should you take Bloomberg issuer hierarchy or Factset’s?

The same deal with security integration. How do you do your waterfall matching? SEDOL first or CUSIP first? CUSIP 8 or 9? Should you use Bloomberg security ID first? And Ticker is a mess. With exchange, without exchange, with country, without country, Bloomberg ticker, Factset ticker, etc. The same thing with valuation data. Do you use the start of day, end of day, mid month or month end? Do you use the latest available, or the signed off version? In a data warehousing project, the first thing we do is to establish a credible issuer record, security data, ESG data, valuation data, performance attribtion data. A credible exchange rates, prices and risk data. By integrating them from various sources. Yes it is a large project, but that is what is required for reporting.

A data warehouse is integrated. A data lake is not. A data warehouse is golden source (meaning it’s credible). A data lake is not. A data warehouse has DQ process (the data has been corrected and signed off), a data lake is not. Can you trust holding valuations coming out of data lake? No. It does not have a robust sign off process. Can you trust a company ISIN coming from the data lake? No. It is only from one source which hasn’t been integrated or cross checked with other sources.

A data lake is technically superior (in terms of data format that it can handle) but functionally poor. So if your goal is reporting, the right solution is a data warehouse. Not data lake. If your purpose is data science (machine learning) then data lake is the right solution. This is because machine learning processes raw data. Well that’s not right, because machine learning processes both raw data and calculated data. Say you’d like to predict the carbon emission scope 1 of a company using time series, say on CNN or RNN algorithm. That’s using raw data. That emission scope 1 number is from the ESG data provider (raw). But if you’d like to include the company’s annual revenue or the company industry sector, ratings, country of domicile or market cap, then it is not “raw” data. It is “processed” data. The company information coming out from the data warehouse is much more higher quality than the ones in the data lake. Why? Because annual revenue is a calculation. For example, in the US we get the quarterly sales number from 10Q, and we sum it up. But which period are we using? Does it tally with the period of the carbon emission? If the carbon emission is as at 31st Dec 2020, and the company financial year is from 1st April to 31st March, which quarterly sales do you take? If there are several sources for sector, rating, country, then there is a waterfall process + data quality check to determine which one to use. All this happens in a data warehouse. None of it happens in the lake, as by definition the lake store only raw data.

So there goes the argument that “for machine learning you need a data lake, not a data warehouse, because you need raw data”. The fact is machine learning predictions require both raw data and processed/calculated data.

For an ecommerce companies which takes Twitter feed, or share prediction algorithm that takes public news as input, data lake is ideal. But if your goal is reporting to regulators, investment desks, sales and marketing, risk and compliance, etc. then you need to use a data warehouse, not a data lake.

I have created a data lake for an asset management company (part of a big team). And as I said, the data is structured. It is tabular. Rows and columns. No images, audio files or videos. No Twitter feed, or Facebook export. It is about holdings, valuations, portfolios, issuers, assets, attributions, the normal day-to-day asset management business data.

Monolith is dead

That said, I completely agree that monoliths are dead. A data warehouse as a monolith is not suitable for today. It takes a long time to build (1-2 years), and once it’s build it’s difficult to change as the regression test takes ages. We need to build a small data mart. Specifically for 1 purpose, like calculating the total AUM (with the breakdowns) for finance at month ends and dailies. Like SFDR reporting. Or client reporting. One mart for one purpose. And you’ll have many marts. Each with different purpose. That way if you change the portfolio dimension in 1 mart, the other marts won’t be affected. There is no single version of truth. There are many versions of which the issuer hierarchy can be implemented. Fixed income is different to equity.

And if you require a single version of the truth, like holdings, valuation or attribution for example, then you create a special mart what calculates that. And then publish it to all the other mart. More on this publishing mechanism later.

That said, the mart doesn’t have to be dimensional. No. It can be relational, it can be vertical. The data model in a mart needs to suit the business purpose of that mart. It might be serving a website, it might be for reporting, or for analytics (BI). Or for machine learning. Or calculating ESG numbers. Or calculating AUM. The data model in each mart needs to suit its purpose. It would be silly if we insist of using dimensional model all the time (fact and dimension). Of course not. We have to tailor the data model to suit what the data is used for.

Because of this you might not want to call it a data mart. Because data mart traditionally means a dimensional model. A small data warehouse, in dimensional format. A data mart traditionally consists of fact and dimension tables. In Kimball format (as in the group, not just one man). As a fan of the Kimball method, I found that dimensional format has many advantages. But when it is used for feeding a website that’s not the right format. When it is used solely for calculating ESG numbers from a Python notebook using Pandas and Numpy, dimensional is not the right format. Some applications require flatten tables, some applications require relational tables. Some require data vault format. No, definitely not “always dimensional”. So you might want not to call it a data mart. You might want to call it a client application. A business application database. Along those lines, i.e. the client or business purpose. Call it an ESG data store, AUM data store, client reporting database, SFDR database, etc. Remember: only for 1 specific purpose.

The advantage of this “only for 1 specific purpose” from the management point of view is the clarity of funding. It is easier to calculate the cost-benefit analysis. And to monitor the payback. Because the whole project is about that 1 purpose. Compare this with funding a traditional “enterprise scale” data warehouse. How do you budget it? You will have to take it from the core IT budget. The central budget. Or combine many projects. And the payback could be 3 years. Or 5 years. It is not financially attractive. On the other hand, a business application that serves only 1 specific purpose is hugely attractive. You don’t mess around doing other thing. You just do 1 thing, like calculating AUM. When you need another one for SFDR, you build another one. Another application, with another database. The budget allocation could not be clearer. The payback too.

Data Lake

That said, I would argue that there is a point in building a data lake for an asset management company. First, for a central data storage, where everything is kept. Second, for landing area. Third, for data publishing.

A data lake by definition is a centralised storage for raw data. Everything is stored there, from many data sources. External and internal. Even calculated data from various systems is stored there too. The advantage? Everything is in one place. And if you use Databricks for your data lake, then you can join the files like tables! You can use normal SQL to query the data. You can feed your BI tools. And your reporting tools. And many, many business applications too. From your company website, to your externally-facing client reporting portal. You can feed your internal data portal too, where everybody can get any data that they need.

Second, landing area. I’d argue that, rather than doing data staging or landing in the SQL database where the data mart is located, it is better to put them in a data lake. The reason is because we can do incremental loading into the data marts / business applications, keeping them lean and performant.

Third, for data publishing. If you have an application which calculates something which is used by many business application, like AUM for example, then it is a good idea for this application to read the lake and publish the calculation results back into the data lake (but in another area, not in the raw data area). For example, you could label the raw data as Bronze area and the AUM as Silver area. So the idea is that we use the data lake for publishing calculated numbers, to be consumed by other business application (because all business applications read data from the lake).


  1. Asset managers: if your need is reporting, use a data warehouse not a data lake.
  2. Monolith is dead: build data marts (business apps) rather than a data warehouse.
  3. There is a value for an asset manager to have a data lake (3 advantages).

Thank you to Robert Adler whose article inspired me to write this article:

1 November 2022

Power BI Datamart

Filed under: Business Intelligence,Data Warehousing — Vincent Rainardi @ 8:39 am
Tags: , ,

Like SQL Data Warehouse, Microsoft once again named a product by what it does: Power BI Datamart. It is an Azure SQL database, with a visual query designer so users don’t need to write SQL to get the data out to BI tools, Excel or ML algorithms. It also generates a Power BI dataset automatically, which we can use to create Power BI reports.

Loading data into Datamart

Power BI Datamart uses a built-in Power BI Dataflow for loading data into it. After we create a Datamart, we click on Get Data and use PowerQuery to connect to various data sources such as Azure SQL database, Azure Data Lake Storage, Databricks (Spark), Azure Synapse, HDInsight, Sharepoint, Excel files, text files, JSON files, Parquet files, API, Salesforce, Snowflake, Redshift, BigQuery, Oracle DB, IBM DB2, Teradata, PostgreSQL, MySQL, Access, SAP, Analysis Services and ODBC (link):

The above list of data sources is so comprehensive, I could not believe it at first. This is a game changer. Gone are the days when we struggling in SSIS, Informatica or ADF to load data from those varied sources. PowerQuery has all the connectors and interfaces to every single one of those databases, files, data lakes and big data.

We then select the tables, specify transformations and PowerQuery would build the ETL rules in that built-in Dataflow, create the relationships and load the data from those tables in the data source into the Datamart. We can then see all those tables and relationships in the Table Tools tab:

Afterwards, we can create relationships (again) and measures. We can also hide the tables and columns that we don’t need.

In the workspace we automatically get a dataset:

We can schedule when we want to reload data from those data sources into the Datamart (called Scheduled Refresh). And we can also setup incremental load, i.e. only changed rows are loaded into the Datamart, based on the column that we want (known as “watermark column”, link):

Security and access control

Power BI Datamart uses row level security and roles to restrict user access to the Datamart. In this example below (link), the users are restricted to only be able to access rows with category ID of 1 and 2. Once that role is defined, we can assign users to that role, as we can see below:

Querying data

As I said at the start of this article, we can use a visual query designer to get the data out without writing SQL (link). As we can see below, the Visual Query Designer is like Markit EDM, SSIS and Informatica.

We can select rows, filter rows, select columns, split columns, remove duplicates, join tables, change data types, transpose table and pivot columns. We can also do error handling, such as removing errors or replacing errors (link). And we can also create custom functions (link).

Of course we can also write SQL:

Accessing Power BI Datamart from SSMS

And we can also query the data in the Power BI Datamart using SSMS (SQL Server Management Studio). And using Azure Data Studio too (link):

Not only we can use SSMS for querying data in the Datamart, but also adding users and configuring security. And we can also use SSMS for looking at the query statistics. This is how it looks from SSMS (link):

We can see above that the Power BI Datamart tables are listed as views, under “model” schema. The relationships in the Power BI Datamart is stored in two views called metadata.relationships and relationshipColumns, as shown above (link).


I think Power BI Datamart is a game changer.

  • It enables us to gather data from databases, data lakes, files and APIs into one data mart.
  • We can reload data every day/week and we can do incremental loading.
  • We can query the data using SSMS (can also do admin stuff).
  • We can build Power BI reports easily (the dataset is automatically created).
  • We query the data using visual query designer or by writing SQL.
  • We can easily get the data out to BI or ML tools, and to Excel.

That is a very powerful tool. Anyone in data warehousing and BI should consider this tool. At least take a look.

And I think the name suits it well. The Power BI Datamart is indeed a good data mart. With all its bells and whistles.

20 October 2022

Forecasting stock prices using LSTM

Filed under: Data Warehousing — Vincent Rainardi @ 7:53 am


Long Short Term Memory (LSTM) is a machine learning algorithm that can recognise a series of data in a sequence. For example, series of words in an email, a series of spoken words (voice), a series of pictures (a video), a series of daily rain fall data, daily solar radiation, or monthly sales. LSTM has the ability to recognise the pattern in the data and make a prediction about the future values. This has been implemented widely in many areas, including in the financial markets.


LSTM belongs to a family of algorithms called Recurrent Neural Network (RNN). RNN is an machine learning algorithm that has the ability to learn the pattern in a sequential data. Apart from LSTM, there are other algorithms in RNN family, for example: Gated Recurrent Unit (GRU). RNN, as the name implies, consists of a neural network. A neural network is an architecture from a series of layers, and each layer consists of nodes. Each node in LSTM consists of 3 gates, i.e. the forget gate, the input gate and the output gate.

Stock Markets

There are many different forms of financial markets. Stock markets, bond markets, commodity markets (they trade the future price of oil, gold, aluminium, wheat, etc.), currency markets (trading the exchange rate of fiat currency pairs and crypto currencies), money markets (trading very liquid securities such as US treasury) and derivative markets (trading fixed income and equity derivatives such as options, futures and swaps). These markets have different characteristics. Some are volatile like stock markets and currency markets, some are quite stable like bond markets and money markets. For individual investors, stock markets is arguably the most popular one, much more popular than bond, commodity, derivative or money markets.

Forecasting vs prediction

Before we get into stock price forecasting, let’s understand the difference between prediction and forecasting. In machine learning the word “prediction” means trying to guess the value of a variable. If you have the number of windows in a house, and you try to guess the price of the house, that’s prediction. Or if you have the height of the people, and you try to guess the weight, that’s prediction. Of course you can use many variables for prediction, for example: using the house location, the floor area, the number of bedrooms and the age of the house to predict the price of the house.

In machine learning forecasting deals specifically with time. It means predicting the future values. For example how many washing machine will be sold in our store next month, or whether it is going to rain or not tomorrow. We can based our forecast based on the past values of same metrics. For example, we can forecast the next month inflation based on the inflation in last 2 years. Or we can based our forecast on something else entirely. For example, we can forecast the traffic jam based on the scheduled events in that city. And of course, forecasting is predicting. Because predicting means trying to guess the value of a variable, including the value of a variable in the future.

Stock Price Forecast

In the stock market there a lot of variables that we can forecast (= guessing the future). For example, try to predict which companies will be leaving S&P 500 next year. We can predict the number of transactions tomorrow. Or the volume. Or the direction of the price tomorrow, i.e. up or down. Or the market sentiment. Or we can try to correlate one variable with another variable. Out of all variables, the stock price is the most popular one. That is not surprising, because if you know what the stock price will be in the future, you can make a huge fortune.

There are many stock price forecasting articles. Most of them use only last few weeks data. And they forecast only the next day price. That is not what happens in the real world. Many uses the closing price, or the opening price. Again that’s not what’s happening in the real world. When an investor sell, the price they get is not the opening or closing price. Not the highest price or the lowest price either. They get somewhere between the highest and the lowest price. It is not called the mid price, because mid price is the mid point between the asking price and the offer/bid price, for a particular trade. Whereas high, low, open, close are for a day worth of trades, for a particular stock or bond. Or options, or futures, or any other derivatives, commodity or currency for that matter.

Principles of Forecasting

Many forecasting methods is based on the assumption that the future will behave like the past. That is the fundamental reason to believe that we can predict the future. Because the future will behave like the past. This is the fundamental of a traditional time series forecasting such as ARIMA or Exponential Smoothing.

The technical term for this is autoregression, meaning that the future values of a time series is related to the linear combination of past values of that time series (or autocorrelation if it is not linear). The other technical term for this is moving average model, meaning that the future values of a time series is related to the past forecast errors. Both autoregression and moving average model are the fundamental principles used in ARIMA.

Whereas Exponential Smoothing technique believes that the future values of a time series are related to the weighted average of the past values of that time series. The recent events get larger weights than the distant past events.

The Art of Stock Forecasting

Like any prediction, the art of stock forecasting is choosing the right variables to use. The stock time series itself might not enough for forecasting its future. In that case we need to include other variables, such as volume, other stocks, and the industry average. That is the most important ingredient: the input variables. The first step that we do, before choosing the algorithm or methods, we need to establish a set of variables which are highly correlated to the future price of the stock.

The second most important ingredient in stock forecasting is the choosing the right algorithm. It does not have to be a machine learning. Time series forecasting is a hundred years old since, so ignore the non ML methods at your peril. Combine the traditional methods and the ML methods. Learn about Facebook’s Prophet, GARCH, ES, ARIMA, Theta and other non-ML methods. Combine them with popular ML methods commonly used for forecasting time series such as LSTM and CNN. But don’t stop there, also use unconventional ML algorithms such as XGBoost and Random Forest. There are tons of literatures and papers for each methods, which provides us with a good starting point.

That is the art of stock forecasting, to find the right combination of variables and algorithms that can give us the best results. And then the third one: tuning those combinations. Trying out different regularisations, using different hyper-parameters, different optimiser, different evaluation criteria. All in an attempt to get the best results. That is the art of stock forecasting.

Different types of LSTM

When it comes to stock price forecasting using LSTM, there are 5 different architectures that we can use: vanilla LSTM, stacked LSTM, convolutional LSTM, bidirectional LSTM and CNN LSTM.

  • Vanilla LSTM is a single LSTM layer, possibly with regularisation measure such as a dropout layer. Optionally with one or two dense layers for converting the output into classes.
  • Stacked LSTM means multiple LSTM layers stacked one after another.
  • Bidirectional LSTM learns the time series both forward and backward.
  • Convolutional LSTM reads the time series using convolution method (see my article here for what convolution means)
  • CNN LSTM also reads the time series using convolution method.

The difference between CNN LSTM an Convolutional LSTM is: CNN LSTM uses a convolution layer to read the input, then pass the output to an LSTM layer. Whereas Convolutional LSTM does not use a separate convolution layer separately, but uses a Keras layer called ConvLSTM which already have convolution reading built in.


The implementation examples of these 5 architectures can be seen on Jason Brownlee’s website:

Here is an example for Vanilla LSTM: (my own implementation)

In the above example, the Y Gap is 66, i.e. the number of days between [the last trading day in the stock time series] and [the number of days in the forecasting period]. For example: the time series is from 1st Jan 2016 to 31st Dec 2021. The Y Gap is the period from Jan, Feb, and March 2022. And the forecasting period is the first week in April 2022.

That is why, in the example above the LSTM Output is set to 71 days. Because it is 66 days (the number of trading days in 3 months period of the gap, i.e. Jan, Feb, March 2022) plus 5 days (the number of trading days in the first week of April).

The number of trading days corresponds to the days when the stock market is open. This is usually Monday to Friday, minus the public holidays.

Running the model is quite simple:, Y_train, epochs = 50, batch_size = 32)

I do hope this article inspired you to take a stab in stock price forecasting. If you have any questions or would like to discuss any aspect of stock forecasting, I’m available at

14 October 2022

Data Warehouse on Data Lake

Filed under: Data Warehousing — Vincent Rainardi @ 7:04 am

Some people argue that the best way to build a data warehouse on data lake is to create a SQL Server database (or Oracle), populated from a Databricks data lake, like the diagram below. It is certainly a good way of doing it, but yesterday I saw a better way: build the Data Warehouse on the Databricks SQL Warehouse feature (used to be called SQL endpoint).

The SQL Warehouse in Databricks makes the tables in the Databricks queryable using SQL command as if they were tables in a SQL database. More importantly it enables BI and reporting tools to connect to it as if it was a SQL database, and we can permission which users are allowed to access which schemas, tables and views. Yes views! That is something that we could not do previously on the Databricks cluster. But now we can, on the SQL Warehouse. See the diagram below.

Similarly, for the data scientists and ML engineer, instead of creating a SQL database populated from the data lake, it is better to create the data science area on Databricks itself, as a SQL Warehouse.

Here’s how to do it. We build the data lake as normal on Databricks. One Databricks database per data source. As usual, in each database we have bronze area, gold area, etc. We bring the data in from the data sources (say from Charles River, or from SAP) into the bronze area using Azure Data Factory (or Glue + Transform if you use AWS). We create Python notebooks to process data from bronze to silver to gold, validating data, cleaning it and processing it along the way.

At this point the usual approach is to create an Azure SQL database (or Redshift if you use AWS, or Snowflake), and create a Kimball dimensional model on it (star schema, fact and dimension tables). To populate it we can use Azure Data Factory to read the lake. To get the data out of the lake we can either a) create APIs on top of Databricks (so the ADF pipelines consume these APIs) or b) directly reading the Databricks cluster using Spark SQL, Scala or Python.

Using SQL Datawarehouse we don’t create this Azure SQL database. Instead, we create a database in SQL Warehouse. A database in SQL Warehouse is formally called a schema. Then we create the Kimball dimensional model (fact and dimension tables) on this schema. Put the location path of those table in the data lake ADLS (or S3 if you use AWS). Preferably on Delta format (you can choose JSON, CSV, Parquet, Avro). For large fact tables we can partition the tables.

Then we use COPY INTO to loads data from a file location in the data lake, into staging tables on SQL Warehouse. So we don’t need to create the data lake tables as external tables. These staging tables are created as Delta tables. Then we populate those fact and dimension tables using INSERT INTO statement: INSERT INTO FactTable1 TABLE StagingTable1. So we select from the staging tables and insert into the fact & dimension tables.

And that’s how we create a data warehouse on Databricks SQL Warehouse. And populate it.

As a bonus, using Databricks SQL Warehouse we can capture the data lineage out of the box. See the Reference section below.


5 October 2022

MCC formula for multiclass classification

Filed under: Data Warehousing — Vincent Rainardi @ 7:51 am

Problem: you produced multiclass classification like below, and need to evaluate it using MCC:

Note: MCC (Matthew Correlation Coefficient), also known as Phi Coefficient (link) is a measure how closely related 2 variables are. For multiclass, MCC is a better evaluation measure than accuracy, precision, recall or F1 score. MCC and AUC measures different things: MCC measures the statistical accuracy, whereas AUC measures the robustness (link).


from sklearn.metrics import matthews_corrcoef
matthews_corrcoef(Y_Actual, Y_Predicted)
Output: 0.035662657

Where Y_Actual and Y_Predicted are lists containing the class ID like: [3, 8, 2, 5, …, 9]. In the above example both lists contain 1000 items.

How does it work?

But how does it actually work? What is the formula to calculate that 0.035662657?

Well, Scikit Learn specifies the formula here:

Where TP, TN, FP, FN are the number of true positives, true negatives, false positives and false negatives, respectively.

But that’s for binary classification. How about multiclass classification?

Well, Scikit Learn specifies the formula too, on the same page as above:


k = class from 1 to K (in the above example: 10 classes)
s = number of samples (in the above example: 1000 samples)
c = number of samples correctly predicted (in the above example: 157)
tk = number of times class k truly occurred (in the above example: 67, 42, 67, etc.)
pk = number of times class k was predicted (in the above example : 17, 8, 20, etc.)

When I looked at this formula, I got confused. So let’s pick it apart and do it in Excel.


The nominator is: c x s minus t x p, see below:

So it is 1000 x 1000 minus (column t) x (column p), i.e. 67 x 17 + 42 x 8 + … and so on.

In Excel we can use sumproduct like this:
(refer to the column and row numbers in Excel screenshot above)

= B12*D12-(SUMPRODUCT(B2:B11,C2:C11))
= 28918


The denominator is:

In Excel it is:
= SQRT((B12^2-SUMPRODUCT(C2:C11,C2:C11))*(B12^2-SUMPRODUCT(B2:B11,B2:B11)))
= 810876.198255

Dividing the nominator by the denominator gives us:
28918 / 810876.198255 = 0.035662657

There we go. That’s how you calculate MCC in Excel. I hope this is useful.

1 October 2022

Stock Price Forecasting using XGBoost

Filed under: Data Warehousing — Vincent Rainardi @ 7:37 am

This article is about forecasting stock prices using XGBoost algorithm. For those who are not familiar with XGBoost, Jason Brownlee wrote a good introduction: link. For those who are not familiar with stock price forecasting, Chai Chon Lung provides a good literature review: link (chapter 2), while Rouf et all provides a good recap on the latest development: link.

Predicted Variable

XGBoost algorithm is capable of predicting a categorical variable based on numerical independent variables. In other words, the input are numbers, and the output are classes. Therefore, to forecast stock prices using XGBoost algorithm, we need to categorise the predicted stock prices into classes.

This can be done by comparing the future stock prices with either the today’s prices or the average of the last few days. This comparison is something like: up 6% or down 3%. They are percentage increase or decrease, aka percentage change. This percentage change is still numeric, but they can be categorised into bands, such as 0 to 3%, 3% to 6%, 6% to 9%, and so on. Similarly with the negative changes. Each band can then be given a name, such as P1, P2, P3, and so on for the positive changes, and N1, N2, N3, and so on for the negative changes.

What about zero? 0% can be put either to P1 or N1, doesn’t make a difference. What about a large change? How many classes should we have? Do we stop at P10? P15? It depends on the distribution of data. I would recommend covering 95% of the sample with the normal 3% increment P and N classes, and put the remaining 5% into the last P and N classes. You could also label this last P and N classes differently, such as PR and NR (R stands for remaining).

So that is the output (the predicted variable). The output is P1 to P10 plus PR, and N1 to N10 plus NR. So 22 classes in total. Here 10 is just an example, depending on the data distribution.

Wait, this 3%, 6%, etc. are measured over what period? You can take a week from today, or a month from today, etc. whichever satisfy the business requirements. Of course you will need to translate these calendar days, weeks and months into “trading days”. This is because the stock prices data only exists when the stock markets are trading. The stock markets close on week ends and national holidays, which are different for each country. For example: instead of saying a week we say 5 trading days. And instead of saying a month we say 20 trading days.

Dependent Variables

What about the input, i.e. the dependent variables. That is of course the stock prices in the last few months or years. The more recent a date is, the bigger impact it has to the future price. Therefore, the more recent we should have higher frequency. For example, if you have 5 years data, do something like this: 1 to 5 years ago = only month end dates, 3 to 12 months ago = weekly (Mondays), and the last 3 months = daily prices. The number of dependent variables = (12 x 4 dates for the 4 years) + (39-40 dates for the 9 months) + (64-68 days for the 3 months, depending on country) = about 151 to 156 days.

That’s the input, i.e. prices from 151 to 156 days. Of course the input needs to be of the same size, so you’ll need to take 64 dates for the last 3 months period, and the 39 dates for the 9 months period. This way you’ll always have 151 inputs or dependent variables into your XGBoost algorithm.


The X dataset consists of 5 columns: Date, HL_Avg, Volume, HL_Avg_Norm, Volume_Norm like below. Below is X[0], i.e. just the first set.

For all intend and purposes, in this article, we can consider that HL Avg is the stock price, and HL Avg Norm is the normalised stock price. Only the HL_Avg_Norm column will be used.

And Y dataset contains Y_Avg, X_Last, Y_Diff, Y_Band, Y_Gap, Y_Days, Y_Avg_Norm, X_Last_Norm, Y_Band_Norm. It looks like this: (this is Y[0])

In this article we will only be using Y_Band, which is U1 in the above.

Y Band obtained by categorising Y Diff into 5% bands.

Tuning XGBoost Algorithm

The best parameters for on the XGBoost model are obtained by trying out different values for each parameter. It is best to change only 1 parameter, whilst keeping all the other parameters the same. This way we know the effect of that 1 parameter.

The following parameters need to be configured: number of estimators, number of days in the data, Gamma, maximum depth, minimum child weight, subsample, learning rate, lambda, alpha.

Note that the value for the best parameters depends on the number of dates you have on the dependent variables. Therefore, you should decide the number of dates first, factoring in a higher frequency for more recent period (see Independent Variables section above). We need to be mindful that increasing the number of days only increasing the model performance/accuracy up to a certain point. Afterwards it startes to make it worse. Therefore it is very important to find the right amount of data to feed into the model.

By far the most influential parameter is Gamma. Gamma is the regularisation parameter, ranging from zero to infinity. Increasing Gamma will make the model performance better (in terms of accuracy) only to a certain point, afterwards it starts to make it worse. You should try a very small numbers, gradually increasing to bigger numbers at a “stepped” distance (at different order). For example: 0, 0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10, 50.

The second most influencial parameter is the number of estimators. Like Gamma, increasing the number of estimators only increases the model performance/accuracy better up to a certain point. And afterwards it decreases.

So the order is: data first (number of dates, with recent dates get higher priority/frequency), then Gamma, then the number of estimators. Check out my article on tuning XGBoost model here. Also Laurae’s article specifically on Gamma here. All the XGBoost parameters are described here.

Forecasting the price band

Forecasting the price band is as easy as calling the predict method on the XGBoost model, giving it the test data. The output is the class ID (an integer indicating which class). So we will need to convert it to the price bands like P1 to P10 in “Predicted Variable” section above. In the code I use U1 to U5 and D1 to D5 (U stands for Up and D stands for Down, instead of using P and N). And I have 10 classes instead of 22.

Evaluation using MCC

For multiclass classification I would recommend using Matthew’s Correlation Coefficient. You can read the Wiki here, Boaz Shmueli’s article here and the Scikit reference here.

The formula is: (source: link)

s = number of samples
c = number of samples correctly predicted
p = number of times class k was predicted
t = number of times class k really/truly occurred
K = number of classes

To use it, nothing could be simpler. We just need to import it from Scikit Learn and use it:

Save the result for analysis

I recommend you save the output of the model (the actual), the expected result (from test data), along with the MCC for further analysis.

Saving it is as easy as this:

You can then open the CSV file in Excel and doing analysis like this: (or use Seaborn / Matplotlib)

18 June 2022

Reporting – Stability or Correctness

Filed under: Data Warehousing — Vincent Rainardi @ 6:25 am

There is a benefit of restating last month’s reports using a more recent data. There is also a benefit of not changing the content of any report after it is published. In this article I’d like to outline the pros and cons of both approaches.

As always it is better to learn by example. Say on 1st April 2022 you were calculating the carbon footprint of company A as of 31 March 2022 (the last 12 months). You used the 2020 carbon data as that was the latest carbon data available for this company.

One month later, on 2nd May 2022 the 2021 carbon data is available. So you recalculated those 12 numbers (last 12 months carbon footprint, from 30/4/21 to 31/3/22) using this latest carbon data.

We can see that this approach makes sense: we’d want to restate the March report to use the more recent data, to make it more accurate.

But this approach comes with a cost: our trustworthiness. When the version 1 of the report came out on 1st April 2022, the numbers where used by many users in the firm. When the 2nd version of the report came out on 2nd May, some of these users didn’t get it so they kept using the old numbers. Of those who got the new version, some of them were not sure about what was changed. Is it just the carbon data that was changed, or something else too?

If every first working day in the month you publish this report (the last 12 months carbon data for thousands of companies). Then users will be confused because on 2nd May you were restated not only the March report, but also publishes April report. And you restated February, January and December reports too. Say users can access these reports on the intranet using Power BI, and on this report there is a drop down list to change the “As of” month.

This approach causes confusion amongst the users. Because the March numbers keep changing. February, January and December numbers too, they keep changing. To prevent that confusion that some companies implement the “freeze after publish” approach. Once the December report was published on 4th Jan 2022, it is frozen. Once the March report was published on 1st April, it is frozen. Its numbers won’t be changed. Even if in April the data that was used to make that March report changed, the numbers in the March report were not restated.

This second approach promotes confidence amongst the users. The numbers in all reports are stable. That is why they are trustworthy. Even though the numbers are wrong now, but on the day the report was produced they are all correct. Reports were produced using the best data available at that time.

Which approach is better? The second one. Because it creates a stable and trustworthy environment amongst the users. Which takes precedence over the the correctness. I found that “The report was correct on the day it was produced” was quite an acceptable principle for many users. And a few months later “the incorrectness” is only in the small amount. Only a very small amount of data changed, not much. So a few months later that March report is only slightly incorrect, not much. That’s why it is more important to freeze the report and get the stability and trust (and have 99% correctness), than keep correcting the past reports sacrificing stability and trust to obtain 100% correctness.

Of course you need to be politically correct and ask the users what they want. But if you’ve been to many companies and see what happened with both approaches, you also have the obligation to inform the users about the consequences of both approaches, before they make a decision. And of course, my recommendation above is not universal. You need to look at individual cases one by one. Each case is different. There are cases where being 100% correct is more important than the stability.

22 May 2022

Watermark in Data Warehousing

Filed under: Data Warehousing — Vincent Rainardi @ 7:54 am

When we load data into a target table, we write down the last row of data we loaded from the source table. That’s watermark. Say we have 5 rows in the source table, and we loaded all of them to the target table. So we write down “5” (meaning that the last row we loaded was row 5), see the image below:

So now in the target table we have 5 rows.

Note: The above image is slightly in correct. In data warehousing, the pump doesn’t “withdraw” the water from the source (pumping it out). Instead it “copies” the water (water in the source container is not gone but is duplicated).

Let’s say there are 3 more rows in the source table. So now there are 8 rows in the source. We load from row 6 of course, because the last row we loaded was row 5. And we need to load the 3 new rows from source into the target, see the image below:

After we loaded the 3 new rows, we update the watermark to 8, see the image below:

This mechanism is well documented in the Microsoft Azure web page for ADF pipeline (link):

Note here that step 2 is done before step 3. This ensure that the data loading is water tight. Meaning, any row created in the source after the watermark in step 2 will be loaded next time.

Also note above that Microsoft uses the word “watermark” for ADF pipeline (as opposed to “delta mark”).

But where did the word “watermark” come from? Well if you see on the Wikipedia (link), the word watermark means an obscured image on paper (almost hidden). If you look at a £20 note, you’ll see a £20 watermark, see the red arrow below:

But that is not the word watermark that we use in data loading 🙂

The word watermark we use in data loading is from this:

It is a water level stick in a dam or a river, and there is a marking on the stick to mark high water level in the past event such as a flood (source: link, link).

Sometime the water mark line is drawn on the wall of a house, to mark the level of flooding that hit that house. We can see below left that in 1893 the water level was 41 inches, whereas on the right there were various water marks going back centuries, the most recent one being 1970 and 1982 (source: link, link). Those are known as “high water mark”.

Below left is the high water mark in the Old Town Hall in the city of Passau, Germany (link, link). And below right is the record flood levels at the Boat Inn in Coalport, near Telford, England (link):

That was where the word “watermark” in data loading from! Not the hidden writing on money, but the “high-water marks” due to flooding.

Now you know what a “watermark” in data warehousing is, and where it is from 🙂

30 March 2022

Infrastructure in Data Warehousing

Filed under: Data Warehousing — Vincent Rainardi @ 7:43 am

We often forget that underpinning the data mart/lake/warehouse is the storage, network and security. Without them users can’t access the data. We can’t even store the data! We are so dependent on them and yet we spent so little time on them. Today let’s think about them. About the infrastructure.

When we talk about the infrastructure in Data Warehousing, there are 3 things that we need to do:
1. Storage
2. Network
3. Security

1. Storage

Storage means the place where we store the data, e.g. a file storage, a blob storage, a data lake, a database, etc. It all depends on your architecture. So in Azure or AWS we need to create a storage account, or a data lake store, or a database.

This is what we call them in Azure:

  • Blob storage
  • File share
  • Data lake store
  • SQL database

1.1. Storage – SQL Database

If it’s a SQL database, then we need to think about the sizing, i.e. how much computing power, I/O and memory is allocated to it. In Azure we call it DTU. The default is S0 which is 10 DTU (database transaction unit). A DTU is unit of measure for [CPU, memory, read, write] together. S0 has 10 DTU and 250 GB storage. An S0 has very limited power, it’s very slow. For “real” applications you’ll need an S4 or S6. For light weight application it’s an S3 (link).

  • S0 is 10 DTU with 250 GB storage, $18/month
  • S3 is 100 DTU with 250 GB storage, $184/month
  • S4 is 200 DTU with 250 GB storage, $368/month
  • S6 is 400 DTU with 250 GB storage, $736/month

It’s probably easier to visualise DTU using a “bounding box” like this (source: link):

In my experience, a process consists of the combination of stored procedures and data loading which ran for 2-3 hours on S0, after I changed it to S6 it ran 1-2 minutes.

1.2. Storage – Files

For storing files we first create a “storage account” (link). We can choose either GPv2 storage account which is hard drive based or FileStorage storage account which is SSD based (a lot faster). We also need to choose between local redundancy (3 copies in one physical location), zone redundancy (3 copies in 3 different zones in 1 region) or geo redundancy (6 copies in 2 different regions), see link.

Then we create a container, which is like a folder (link). Then we upload the files into this container. To enable users to access the files from Windows explorer (like \\location\folder\) we need to create a file share (link).

You can also create an Oracle database in Azure (link), but you need to create it on a VM (virtual machine). If you use AWS, the storage is called “S3” (Simple Storage Service). What we call a “container” in Azure, in S3 we call it a “bucket”. Basically it means a group of files (link).

1.3. Storage – Data Lake

In Azure, the storage for data lake is called ADLS2 (Azure Data Lake Storage Generation 2, link). It can be used for Hadoop (HDInsight), Synapse or Databricks. It’s very scalable (exabytes) and cost effective ($0.019 per GB per month, for standard storage, or “hot storage”).

There are 4 different storage in ADLS: hot, cool, archive and premium. Hot is the standard. Archive is automatically deleted after 180 days. Cool is automatically deleted after 30 days. Premium has low latency (meaning fast).

2. Network

Network means access to that storage, i.e. how users can connect to that storage or database. How the reports can access that database or data lake.

2.1 Network for a storage account

After we have a storage account, we configure the “networking” of that storage account. By that I mean we click on the Network pane and configure which the IP ranges are allowed to access this storage account. This is done by configuring the “firewall rules” to allow only certain IP ranges to go in, or to go out.

Storage accounts also have private endpoints which is accessible through the internet. It means that we assign a private IP address from the VNet to the storage account. A VNet (virtual network) is a group of computers and devices connected through the internet. We can put a VM (virtual machine), a storage account, a SQL database into one VNet so they can access each other.

The concept of private endpoint can also be applied to an application, like this (source: link):
(click the image to enlarge)

We can see above that the Europe West and Europe North VNets are both using private IP addresses (meaning 10.something or 192.something), communicating via private endpoints (the green arrows) to a web app front end (link).

2.2. Network for a SQL database

In Azure, a SQL database also has a Network pane on which we can configure which IP ranges are allowed to access this database. This is done by configuring the virtual network rules (VNet rules) or IP rules to allow certain IP ranges to go in, or to go out.

VNet rules are applied to the whole SQL server, not just to one database. Whereas IP rules can be applied to either the SQL server level or the database level.

Networking wise, in front of a SQL database there is a “database gateway” (the green boxes below, source: link). A database gateway is an Azure service which has a public IP address, and listens to port 1433 (the SQL port).

The gateway then redirect the traffic to the right cluster (cluster 54). Inside the cluster, the traffic is forwarded to the right database. The alternative is called “proxy mode”, which means that traffic must hit the gateway and the gateway will get the traffic to the right cluster.

Of course you don’t have to have a cluster. In most cases the SQL server is “stand alone” (meaning not in a cluster). A database cluster means a group of SQL servers with high availability. Each of the SQL server is called a “node”. One node (called the primary node) is copied to other nodes (called secondary nodes), like this: (source: link)

3. Security

Security is about allowing users to access the data, and protecting the data.

3.1. Data Access

Data Access is about 2 things: authentication and authorisation.

  • Authentication mean that we lookup the incoming user in our Active Directory, and verify the password.
  • Authorisation means that we lookup what the user is allowed to access.

Bear in mind that a user can be a service account, not just human users. And bear in mind that instead of user ID and password, for service account we can also have a “key” and a “secret”, which is normally stored in a “key vault”.

If you have a database, and a user wants to read a table in this database, then the system will first find out who this user is. This is done by looking up in the Active Directory (list of all the computers, servers, users in our organisation). It checks whether the password supplied is correct (or the “secret” for the “key”).

We can also create a “guest user” in our Active Directory (link). A guest users “sign in” in their company, then come to our network as a temporary user. So we verify their user ID (or email address) but not their password. But remember, only users from that company are allowed in (“company” is defined as a range of IP addresses).

3.2. Data Protection

Data protection is about 2 things: protecting the data storage, and protecting the data transport.

Data storage means the place where our data is stored, i.e. the storage accounts, the data lake, or the databases. We do this by allowing certain users to be able to access certain tables (or views, or stored procedures). We do that using Active Directory.

Data transport means that when we transfer data through the internet, no one can know its content. And no one can eavesdrop either. We do this by encrypting the data, and securing the “channel” we use to transport the data. “Securing” means that public can’t access this channel. Only authorised users are allowed to access it.

Data Architect

What really strikes me is that as a data architect you would need to know those things. The networking, the security and the storage. Earlier in my career, when I was just started working I was a network guy. Back in uni I installed Novell Ethernet network in my lab and configure it. And in my first job I was connecting IBM token ring terminal for AS/400 network. That was my job when I started. Well not for long, because straight after then I became a COBOL developer in AS/400, as well as SQL developer and C# developer (for security access). Well when you just started working you’d do anything that your boss asked you to, right? 🙂

So when later on in my career I became a data architect I didn’t have much trouble understanding the terminology. Well, it’s true that it’s been 20 years so I need to read up a bit. But the basic understanding was already there. Networking and security doesn’t change much. They have a “new face”, new names, but the principles are still the same.

But yes, my point is that as a data architect you need to do those things. You need to design the access to your data, how to secure the storage and the transport, and how you are going to store the data. Basically those 3 things above: the storage, the network and the security. It’s part of the job as a data architect. We are not data modellers. Or data analysts. We are a data architect, meaning that we design the data models, and we architect the data storage, data loading and data access, and everything around it. Not just the data, but also the processes and the architecture.

21 February 2022

NoSQL in Data Warehousing

Filed under: Data Warehousing — Vincent Rainardi @ 3:42 am

Where does NoSQL sit in data warehousing landscape? Does it just sit on the bench, having no role? Do we use it in staging area? Or does it function as one of the main data stores?

NoSQL has many faces. Document database, key value pair, graph database, they are all NoSQL. So obviously its usage depends on what type of data we have to store. If our data is about individual people (or individual companies) then it lends itself to be stored as a graph database. Because people have connections (ditto companies). Or you can store them (people and companies) as documents.

That is if you focus on one of the at the time. But if you’d like to process them en-masse, aggregate them, etc. then tabular storage has its advantages. KVP has a distinct advantage when it comes to schema-on-read. Data from any sources can be stored as key value pairs in the data lake, and thus deliberately avoiding modelling the data.

The idea of schema-on-read is simple, but a bit difficult to be agreed with. But it is key to understanding NoSQL’s role. The basic premise of a data lake is that we don’t model the data when storing it in the lake. We model it when we use it. KVP is ideal for this, for data lake store. All the attributes from all tables are stored in just one column. Obviously this requires a good data dictionary and data lineage. Not just at entity level (what they contain, how they are used, etc.) but down to each attribute.

Document DB’s role in data warehousing is very different (or any data lakescape for that matter). Document DB has models. Every entity has attributes. And that becomes the basis of the data model. But unlike tabular model, in document DB individual entity can have different attributes. Let’s look at security as an example (as in financial instrument). The attribute of an option is very different to a CDS. An option has a “strike price” attribute for example, and a CDS doesn’t. A CDS has a “reference entity” attribute whereas an option doesn’t.

In document DB entities have relationship with each other. If it is one to many then the parent IDs are put on the child entities. If it many to many then we use a “joining document” or we can do embedding and referencing (see: ref #6 below).

There are other types of NoSQL too: wide column store, tuple store, object store, triple store, multi model DB. We really are limiting ourselves if we only do relational databases. There are many, many different types of data store outside relational DB. And as a data architect, it is our job to know them well. Even if eventually we use SQL for a project, we know know very well why we are not implementing it on the other data stores.

A “wide column store” uses rows and columns like in a relational database, but the column names and data types can vary from row to row. The most famous wide column store is probably Cassandra, but both Amazon and Azure have them.

So where does that leave us. Where do we use NoSQL in data warehousing or data lake? KVP is useful to be used as a data lake store. Graph database would be useful as the main data store if our data is about individual object (such as person, security, or company) which is not processed/queried en-masse. Document DB is useful for modelling entites which has variable attributes, such as security. So NoSQL can be used as a data lake store, or as one of the main data stores.


  1. Azure Cosmos DB – Documentation: link
  2. Azure Cosmos DB – Key Value Pair: link
  3. Azure Cosmos DB – Document DB: link (MongoDB)
  4. Azure Cosmos DB – Graph DB: link (Gremlin)
  5. Azure Cosmos DB – Designing Data Structure: link
  6. Azure Cosmos DB – Data Modelling: link
  7. Azure Cosmos DB – Wide Column Store: link (Cassandra)
  8. Amazon NoSQL – Documentation: link
  9. Amazon NoSQL – Key Value Pair: link (DynamoDB)
  10. Amazon NoSQL – Document DB: link
  11. Amazon NoSQL – Graph DB: link (Neptune)
  12. Technical Review of Two wide column store NoSQL Databases: Amazon DynamoDB and Apache Cassandra, by Abigail Udo: link
  13. Amazon Athena: link
  14. Oracle NoSQL Database Documentation: link
  15. Wikipedia on NoSQL: link
  16. The curse of Schema-on-read in Data Lakes, by Sandeep Uttamchandani: link
Next Page »

Blog at