Data Warehousing, BI and Data Science

30 May 2019

Entropy and Information Gain in Decision Tree

Filed under: Data Warehousing — Vincent Rainardi @ 7:28 pm

Decision Tree is one of the most popular algorithms in machine learning. It is relatively simple, yet able to produce good accuracy. But the main reason it is widely used is the interpretability. We can see how it works quite clearly. We can understand how it works.

Decision Tree is a supervised machine learning algorithm. So we train the model using a dataset, in order for it to learn. Then we can use it to predict the output. It can be used for both regression and classification. Regression is when the output is a number. Classification is when the output is a category.

In this article I would like to focus on Entropy and Information Gain, using investment funds as an example. Entropy is the level of disorder in the data.

Entropy

In thermodynamics, Entropy is the level of disorder or randomness in the system. Similary in data analytics, entropy is the level of disorder or randomness in the data. If we have 100 numbers and all of them is 5, then the data is in very good order. The level of disorder is zero. The randomness is zero. There is no randomness in the data. Everywhere you look you get 5. The entropy is zero.

If these 100 numbers contain different numbers, then the data is in a disorder state. The level or randomness is high. When you get a number, you might get number 4, or you might get number 7, or any other number. You don’t know what you are going to get. The data is “completely” random. The level of randomness in the data is very high. The entropy in data is very high.

The distribution of these different numbers in the data determine the entropy. If there are 4 possible numbers and they are distributed 25% each, then the entropy is very high. But if they are distributed 99%, 1%, 1%, 1% then the entropy is very low. And if it’s 70%, 10%, 10%, 10% the entropy is somewhere in between (medium).

The maximum value for entrophy is 1. The minimum value for entrophy is 0.

Information Gain

Now that we have a rough idea of what entropy is, let’s try to understand Information Gain.

A Decision Tree consists of many levels. In the picture below it consists of 2 levels. Level 1 consists of node A. Level 2 consists of node B and node C.

1. Two Branches of Entropy

Information Gain is the decrease in entropy from one level to the next. Node B has entrophy = 0.85, a decrease of 0.1 from Node A’s entrophy which is 0.95. So Node B has information gain of 0.1. Similarly, Node C has information gain of 0.95 – 0.75 = 0.2.

When the entropy goes down from 0.95 to 0.75, why do we say that the amount of information is more (gaining)? Higher entrophy means the data is more uniform, lower entropy means the data is more distributed or varied. That’s why there is more information in the data, because the data is more varied. That’s why when the entropy decreases the amount of information is higher. We have “additonal” information. That is Information Gain.

Calculating Entropy

Now we know what Entropy is, and what Information Gain is. Let us now calculate the entropy.

First let’s find the formula for entropy. In thermodynamics, entropy is the logarithmic measure of the number of states

Entropy is the average of information content (link). The information content of an event E1 is the log of 1/(the probability of E1). The information content is called I. So I1 = log of (1/p(E1)).

If we have another event (E2), the information content is: I2 = log of (1/p(E2)).

The average of the information content I1 and I2 (or the entropy) is:
the sum of (information content for each event x the probability that event occuring)
= I1 x p(E1) + I2 x p(E2)
= log of (1/p(E1)) x p(E1) + log of (1/p(E2)) x p(E2)
= –log of p(E1) x p(E1) –log of p(E2) x p(E2)

If we have i events, the entropy is:
= -sum of (p(Ei) x log of p(Ei))

Fund Price

Now that we know how to caculate entropy, let us try to calculate the entropy of probability of the price of a fund going up in the next 1 year.

2a. Fund price table (top).PNG
2b. Fund price table (bottom)

In the above table, the last column is the price of a fund 1 year from now, which can be higher or lower than today. This is denoted with “Up” or “Down”. This price is determined from 4 factors or features:

  1. The performance of the fund in the last 3 years (annualised, gross of fees).
    This past performance is divided into 3 buckets: Down (less than zero, “Up between 0 and 2%”, and “Up more than 2%”.
  2. The interest rate, for example LIBOR GBP 1 Year today.
    This today interest rate is compared with the interest 1 year go, and divided into 3 buckets: today it’s higher than 1 year ago, lower than 1 year ago, or the same (constant).
  3. The value of the companies that the fund invest in, by comparing the book value to the share price of the company today. Also the earning (the income) the companies make compared to the share price (cyclically adjusted). This company value factor is divided into 3 buckets: overvalued, undervalued and fair value.
  4. The ESG factors, i.e. Environment, Social and Governance factors such as polution, remuneration, the board of directors, employee rights, etc. This is also divided into 3 buckets, i.e. high (good), medium, and low (bad).

The Four Factors

1. Past performance

Funds which have been going up a lot, generally speaking, has the tendency to reverse back to the mean. Meaning that it’s going to go down. But another theory says that if the fund price has been going up, then it has the tendency to keep going up, because of the momentum. Who is right is up for a debate. In my opinion the momentum principle has stronger effect compared to the “reveral to the mean” principle.

2. Interest rate

Because the value/price of the fund is not only affected by the companies or shares in the fund, but also affected by external factors. The interest rate represent these external factors. When the interest rate is high, share prices growth is usually constraint because more investors money is invested in cash. On the contrary, when the interest rate is low, people don’t invest in cash and invest in shares instead (or bonds).

But the factor we are considering here is the change of interest rate. But the impact is generally the same. Generally speaking if the interest rate is going up then the investment in equity is decreasing, thus putting pressure on the share price, resulting lower share price.

3. Value

If the company valuation is too high, the investors become concerned psychologically, afraid of the price would go down. This concern creates pressure on the share price, and the share price will eventually goes down.

On the contrary, if the the company valuation is lower compared to similar companies in the same industry sector and in the same country (and similar size), then the investors would feel that this stock is cheap and would be more inclined to buy. And this naturally would put the price up.

4. ESG

Factors like climate change, energy management, health & safety, compensation, product quality and employee relation can affect the company value. Good ESG scores usually increase the value of companies in the fund, and therefore collectively increases the value of the fund.

On the contrary, concerns such as accidents, controversies, pollutions, excessive CEO compensation and issues with auditability/control on the board of directors are real risks to the company futures and therefore affect the their share price.

Entropy at Top Level

Now that we know the factors, let us calculate the Information Gain for each factor (feature). This Information Gain is the Entropy at the Top Level minus the Entropy at the branch level.

Of the total of 30 events, there are 12 “Price is down” events and 18 “Price is up” events.

The probability of the price of a fund going “down” event is 12/30 = 0.4 and the probability of an “up” event is 18/30 = 0.6.

The entropy at the top level is therefore:
-U*Log(U,2) -D*Log(D,2) where U is the probably of Up and D is the probability of Down
= -0.6*Log(0.6,2) -0.4*Log(0.4,2)
= 0.97

Information Gain of the Performance branch

The Information Gain of the Performance branch is calculated as follows:

3a. Information Gain for Performance branch

First we calculate the entropy of the performance branch for “Less than 0”, which is:
-U*Log(U,2) -D*Log(D,2) where U is the probably of the price is going up when the performance is less than zero, and D is the probability of the price is going down when the performance is less than zero.
= -0.5 * Log(0.5,2) -0.5 * Log(0.5,2)
= 1

Then we calculate the entropy of the performance branch for “0 to 5%”, which is:
= -0.56 * Log(0.56,2) -0.44 * Log(0.44,2)
= 0.99

Then we calculate the entropy of the performance branch for “More than 5%”, which is:
= -0.69 * Log(0.69,2) -0.31 * Log(0.31,2)
= 0.89

Then we calculate the probability of the “Less than 0”, “0 to 5%” and “More than 5%” which are:
8/30 = 0.27, 9/30 = 0.3 and 13/30 = 0.43

So if Performance was the first branch, it would look like this:

2c. Performance as the first branch

Then we sum the weighted entropy for “Less than 0”, “0 to 5%” and “More than 5%”, to get the total entropy for the Performance branch:
1 * 0.27 + 0.99 * 0.3 + 0.89 * 0.43 = 0.95

So the Information Gain for the Performance branch is 0.97 – 0.95 = 0.02

Information Gain for the Interest Rate, Value and ESG branches

We can calculate the Information Gain for the Interest Rate branch, the Value branch and the ESG branch the same way:

3b. Information Gain for Interest Rate branch

3c. Information Gain for Value branch

3d. Information Gain for ESG branch

Why do we calculate the entropy? Because we need entropy to know the Information Gain.

But why do we need to know the Information Gain? Because the decision tree would be more efficient if we put the factor with the largest Information Gain as the first branch (the highest level).

In this case, the factor with the largest Information Gain is Value, which has the Information Gain of 0.31. So Value should be the first branch, followed by ESG, Interest Rate and the last one is Performance.

Advertisements

10 May 2019

Analysis Services Tabular Model

Filed under: Data Warehousing — Vincent Rainardi @ 6:03 pm

What is it?
It is an in-memory database, managed by the Analysis Services Vertipaq engine.

Who uses it?
Power BI and Excel Power Pivot.

Why use it?
To provide fast and simple access to relational data.

When was it introduced?
2012 by Microsoft

How to create it?
Using SQL Server Data Tool.

Where is it stored?
in SQL Server Analysis Services server or Azure Analysis Services.
The data is compressed when stored.

What are the components?
– Tables
– Relationships
– Measures
– Hierarchies
– KPIs
– Partitions

22 March 2019

Bridge table

Filed under: Data Warehousing — Vincent Rainardi @ 8:22 am

In dimensional modelling, a bridge table is a table which connects a fact table to a dimension, in order to bring the grain of the fact table down to the grain of the dimension.

The best way to learn the complexity of this bridge table is using an example, so let’s get down to it.

Driver Bridge Table

An example of a bridge table is the insurance claim fact table in the car insurance industry. The grain of this fact table is one row for each claim (this is different to claim payment fact table where the grain is one row for each claim payment). One of the dimension keys in this claim fact table is the policy_key. Another dimension key is the insured_driver_key.

In car insurance there is only one policy per claim so there is no problem with the policy key. There is usually one driver per claim, but there could be two drivers per claim. It is rare but occasionally there is third driver. I mean drivers covered by our policy, not the third party drivers. The second driver on the policy is called the named driver (ditto the third driver). In other company there could be a fourth driver, but for this company/case let us set that the maximum number of driver to three.

The function of the bridge table is to connect the claim fact table to the driver dimension. Because there could be more than one driver per claim, the surrogate key column is not called driver_key, but driver_group_key. The bridge table is called Bridge_Driver_Group, which has only two columns:

  • driver_group_key: connects to the Insured_Driver_Group_Key column on the claim fact table
  • driver_key: connects to the driver_key column on the driver dimension table

A Sample Claim

Now consider a claim where the claim fact table is like this:

Fact_Claim table:
claim_key|claim_date_key|insurer_key|policy_key|insured_driver_group_key|initial_claim_amount|adjustment_amount|final_claim_amount|excess_amount|payout_amount
1178|20100518|298|32839|88|16000|1200|14800|500|14300
3122|20180725|397|45578|88|32000|4000|28000|1000|27000
3123|20181108|267|45642|89|51000|3000|48000|1500|46500

Bridge_Driver_Group:
driver_group_key|driver_key
88|188
89|302
89|304

Dim_Driver:
driver_key|driver_name|date_of_birth|driving_licence_number
188|Chloe Dunn|1980-08-08|3824920
302|Peter Keller|1971-06-05|4503532
304|Samantha Keller|1970-12-07|4507651

The key question in car insurance claim is how much claim has a driver made, across all policy, across call insurer (insurers share claim data). This figure is used to determine the premium discount for the following year. The more recent a claim is the more weight it has to the next year premium. Chloe Dunn has 2 claims totalling £42800 (before excess). The first one was on 18th May 2010 and the second one was on 25th July 2018.

Type and Weight Columns on the Bridge Table

But Peter Keller shares a policy with his wife, Samantha Keller. We know that both of them has claimed £48000 but how much each is what we need to know.

To do this we add a weight column and driver_type on the bridge table. There is a business rule to determine the weight, for example: the main driver get 2/3 and the second driver get 1/3. If there are 3 drivers, the main get 50%, the second and third get 25%. The rule depends on the company. Each company has a different rule.

Bridge_Driver_Group:
driver_group_key|driver_key|driver_type|weight
88|188|single driver|1.000000
89|302|main driver|0.666666
89|304|second driver|0.333333
90|405|main driver|0.500000
90|406|second driver|0.250000
90|407|third driver|0.250000

Now we know how much claim is attributed to Peter Keller. It is 2/3 of £48,000 = £32,000.

The Group Dimension

Note that the bridge table above has many-to-many relationship with the fact table. Analysis Services doesn’t like many-to-many relationships. In Analysis Services modelling, a fact table can’t be linked to a bridge table like above. Instead, a fact table must be linked to a dimension table in a one-to-many relationship. This dimension table is called a Group dimension, like this:

Fact_Claim table: (same as before)
claim_key|claim_date_key|insurer_key|policy_key|insured_driver_group_key|initial_claim_amount|adjustment_amount|final_claim_amount|excess_amount|payout_amount
1178|20100518|298|32839|88|16000|1200|14800|500|14300
3122|20180725|397|45578|88|32000|4000|28000|1000|27000
3123|20181108|267|45642|89|51000|3000|48000|1500|46500

Dim_Driver_Group: (this is the Group dimension)
driver_group_key|driver_group_name
88|88
89|89

Bridge_Driver_Group: (same as before)
driver_group_key|driver_key|driver_type|weight
88|188|single driver|1.000000
89|302|main driver|0.666666
89|304|second driver|0.333333

Dim_Driver: (same as before)
driver_key|driver_name|date_of_birth|driving_licence_number
188|Chloe Dunn|1980-08-08|3824920
302|Peter Keller|1971-06-05|4503532
304|Samantha Keller|1970-12-07|4507651

Account and Diagnosis Bridge Tables

Another classic example of a bridge table is in retail banking. It is the bridge table from the account to customer. Another famous example by Ralph Kimball and Margy Ross is the multi value diagnosis. For this please read Kimball Data Warehouse toolkit chapter 13. For the account to customer bridge table it is in chapter 9 of the same book. I don’t want to repeat them here as they have done an excellent job in explaining it there. Note the book I quote above is the second edition. For other editions please look at the index for “bridge tables”.

Policy Dimension

I just realised that the Kimball Data Warehouse toolkit book also mentions the bridge table for insured driver (in chapter 15). But the difference is that the driver dimension is linked to the policy dimension, not driver group dimension. Also that the insurance example in Kimball book is about a premium fact table, not claim fact table.

The reason it is possible to use a policy dimension instead of driver group dimension is: in the car insurance the driver is attached to the policy. So I think Kimball’s approach is better, because it eliminates the need to create the driver group dimension, which in reality such thing didn’t exist. But the alternative of not using any dimension at all is also possible, I mean we attach the bridge table directly to the fact table, like the first example I mentioned above.

If we replace the driver group dimension with a policy dimension, what we get is something like this:

Fact_Claim table: (same as before)
claim_key|claim_date_key|insurer_key|policy_key|insured_driver_group_key|initial_claim_amount|adjustment_amount|final_claim_amount|excess_amount|payout_amount
1178|20100518|298|32839|88|16000|1200|14800|500|14300
3122|20180725|397|45578|88|32000|4000|28000|1000|27000
3123|20181108|267|45642|89|51000|3000|48000|1500|46500

Dim_Policy: (previously this was Dim_Driver_Group)
policy_key|policy_number|inception_date|last_renewal_date|expiry_date|policy_holder|cover_type|european_cover|effective_date|expiry_date|active_flag
48523|NK98402|2015-08-12|2018-08-12|2019-08-11|Chloe Dunn|comprehensive|Y|2015-08-07 00:00:00|9999-12-31 23:59:59|Y
63291|NK84826|2014-11-01|2018-11-01|2019-10-31|Peter Keller|third party|N|2014-10-26 00:00:00|9999-12-31 23:59:59|Y

Bridge_Driver_Group: (replace the first column with policy_key)
policy_key|driver_key|driver_type|weight
48523|188|single driver|1.000000
63291|302|main driver|0.666666
63291|304|second driver|0.333333

Dim_Driver: (same as before)
driver_key|driver_name|date_of_birth|driving_licence_number
188|Chloe Dunn|1980-08-08|3824920
302|Peter Keller|1971-06-05|4503532
304|Samantha Keller|1970-12-07|4507651

We need to bear in mind that in car insurance, the policy dimension is the biggest dimension. It has a lot of columns and a lot of rows. And it is a type 2 dimension. The last renewal date and the expiry date change every year, and therefore the policy_key column in the bridge table will need to be updated. This is really not suitable for the purpose of the bridge table (which is to link the fact table to the driver dimension).

Group Dimension Created from Policy Dimension

The best solution I found is to create a group dimension as a type 0 dimension, based on the policy dimension. We cut away all the attributes, leaving just the policy number, like this:

Fact_Claim table: (same as before)
claim_key|claim_date_key|insurer_key|policy_key|insured_driver_group_key|initial_claim_amount|adjustment_amount|final_claim_amount|excess_amount|payout_amount
1178|20100518|298|32839|88|16000|1200|14800|500|14300
3122|20180725|397|45578|88|32000|4000|28000|1000|27000
3123|20181108|267|45642|89|51000|3000|48000|1500|46500

Dim_Driver_Group:
driver_group_key|policy_number|inserted_date
88|NK98402|2015-08-07 04:12:05
89|NK84826|2014-10-26 04:14:09

Bridge_Driver_Group: (same as before)
driver_group_key|driver_key|driver_type|weight
88|188|single driver|1.000000
89|302|main driver|0.666666
89|304|second driver|0.333333

Dim_Driver: (same as before)
driver_key|driver_name|date_of_birth|driving_licence_number
188|Chloe Dunn|1980-08-08|3824920
302|Peter Keller|1971-06-05|4503532
304|Samantha Keller|1970-12-07|4507651

Note that Dim_Driver_Group is type 1 so it doesn’t have effective_date, expiry_date or active_flag columns. Instead it only has inserted_date column because it is type 0. Type 0 means that it is fixed. Once a row is inserted it will never get changed or updated.

Tracking Changes in the Bridge Table

What if Peter and Samantha Keller add their 18 year old daugther (Karina) to the policy, and now the number of insured driver in their policy becomes three?

Fact_Claim table: (same as before)
claim_key|claim_date_key|insurer_key|policy_key|insured_driver_group_key|initial_claim_amount|adjustment_amount|final_claim_amount|excess_amount|payout_amount
1178|20100518|298|32839|88|16000|1200|14800|500|14300
3122|20180725|397|45578|88|32000|4000|28000|1000|27000
3123|20181108|267|45642|89|51000|3000|48000|1500|46500

Dim_Driver_Group: (same as before)
driver_group_key|policy_number|inserted_date
88|NK98402|2015-08-07 04:12:05
89|NK84826|2014-10-26 04:14:09

Bridge_Driver_Group:
driver_group_key|driver_key|driver_type|weight|start_date|end_date|active_flag
88|188|single driver|1.000000|1900-01-01|9999-12-31|Y
89|302|main driver|0.666666|1900-01-01|2019-03-22|N
89|304|second driver|0.333333|1900-01-01|2019-03-22|N
89|302|main driver|0.500000|2019-03-22|9999-12-31|Y
89|304|second driver|0.250000|2019-03-22|9999-12-31|Y
89|375|third driver|0.250000|2019-03-22|9999-12-31|Y

Dim_Driver: (same as before)
driver_key|driver_name|date_of_birth|driving_licence_number
188|Chloe Dunn|1980-08-08|3824920
302|Peter Keller|1971-06-05|4503532
304|Samantha Keller|1970-12-07|4507651
375|Karina Keller|2001-04-15|9302583

Notice that Karina does not contribute to her parents’ claim on 18th Nov 2018, because she was put on the policy today (22nd March 2019). But Karina is responsible for any claim happen after today. It is quite complex to calculate the claim amount attributed to each driver, so I would recommend calculating it overnight and putting the output in a separate fact table, or an output table. If the business needs to browse this data (drilling up and down, slicing and dicing) then put it in a fact table. If this “claim per driver” data is required for producing a report, then put it in an output table.

21 February 2019

Data Files – Delimiter and Qualifier

Filed under: Data Warehousing — Vincent Rainardi @ 8:06 am

Suppose you got to specify a file spec for your data supplier to send their data via FTP. Would you request them to send you Excel files or CSV files? Do you prefer a tab delimited text file or pipe delimited? Do you ask them to qualify their string with double quotes?

Excel or CSV

I would prefer CSV files than Excel files, because the number of columns in Excel which can be read by SSIS is limited to 255 columns (see link). Whereas in CSV files there are no limitation regarding the number of columns.

To overcome this limitation we need to import it twice (or three times) using two data source components and then join them. But in order to join them we will need to have an identity column, which will become the join criteria (see link). We also need to be careful with the performance when joining because merge join can be slow.

The second issue with Excel file is the OLE DB Provider installed in the server where SSIS is running, otherwise we could get an error message saying that “OLE DB provider Microsoft.ACE.OLEDB.12.0 is not registered” (see link). The “Use 32 bit runtime” in the SQL Agent job step and Run64BitRunTime in the project property are also affecting this.

The other disadvantage of importing an Excel file is dealing with zero in the front of a string, such as “007”, which automatically becomes number 7.

Also numbers which are in scientific notation, such as 1.23E21, which will then be imported as a string rather a number, causing a failure. If it is in CSV it written as 1230000000000000000000 in the file and imported as a number.

The other limitation of Excel is about cells containing long strings, such as 2000 characters, being cut to 256 characters. This issue only happens before 2007 edition.

Delimiter

What delimiter do you prefer: comma, pipe or tab?

The problem with CSV or a comma delimited file is that we have comma in the data. This causes misalignment when the file is imported. For example, if there are 10 columns, and one of them has a comma, this row will become 11 columns. This problem is known in the IT world as “delimiter collision”, see here: link.

Comma in the data is very common when dealing with numeric fields, such as “123,000”. Or, in countries like Indonesia which uses comma as a decimal point, it is like this: “100,23”.

We can enclose it with double quotes, but why not eliminate the problem in the first place? That’s why for text files people prefer pipe delimiter or tab delimiter. Text file with tab delimiter is known as tab-separated values, or TSV (link), but pipe delimited files are not known as PSV. We do have DSV though, which stands for Delimiter Separated Values (link).

Pipe is generally preferable because of the perception that it is rarer than tab. Some strings may contains tab, for example in a commentary field.

Qualifier

Qualifier means enclosing the data with something, such as double quote. Double quote is the most common delimiter. Other delimiters (but far less common) are single quote and brackets. There are 3 different types of bracket, i.e. [ ], < > and { }.

The issue with double quote delimiter is that the data may contain double quote, such as in commentary fields. This applies to other delimiters too. That’s why it is ideal to use pipe delimiter, without qualifier.

So that’s my preferred choice for the format of a data file, if I can choose it: pipe delimited text file without qualifiers. But unfortunately, in reality, we very rarely get the opportunity to choose the format. Usually the vendor dictates the format because the same file goes to many customers.

20 February 2019

Transactional Fact Tables

Filed under: Data Warehousing — Vincent Rainardi @ 9:11 pm

Transactional fact tables are not as popular as periodic snapshot fact table. In this article I would like to compare transactional and periodic snapshot fact tables, list their advantages and disadvantages, and give two cases from my own experience where I needed to decide between the two. But first let me explain what these two types of fact tables are.

What is a transactional fact table?

In dimensional data modelling we have 3 types of fact tables: transactional, periodic snapshot and accumulation snapshot. I’m going to explain the first two below.

1. Transactional Fact Table

A transactional fact table is a fact table where:

  • Each event is stored in the fact table only once.
  • It has a date column indicating when the event occurred.
  • It has an identifier column which identifies each event.
  • The number of rows is the same as the source table.

A classic example is when in the source system we have a sales table containing customer orders (for example, in a shop, a restaurant, or a factory). Say Monday we had 80 orders, Tuesday 70 orders, Wednesday 90, and so on. So on Monday night we load the 80 rows for Monday into the data warehouse. On Tuesday night we load the 70 rows for Tuesday, and on Wednesday night we load the 90 rows for Wednesday.

Transactional Fact Tables
Figure 1
. A transactional fact table loading 3 days data

In the data warehouse we store the customer orders in the Sales Fact Table, which is a transactional fact table.

  • In this Sales Fact Table, every order is stored only once. The 70 orders on Tuesday are different to the 80 orders on Monday. And they are also different to the 90 Wednesday orders.
  • For this we use the order date column. In the above example the Monday, Tuesday and Wednesday are the order date. This order date column indicates when the event occurred, when the order happened.
  • In this sales table we also have a sales identifier, such as order number if it is a shop, or ticket number if it is a restaurant.
  • On Wednesday night, after the warehouse load finishes, we have 80+70+90 = 240 rows, the same as in the source system.

In addition to insert, we also have update in the source table. In the example above, in addition to 70 new orders on Tuesday, we also have updates to some of the 80 Monday orders. This is they key difference to the Periodic Snapshot fact table: a Transactional fact table updates existing rows, and therefore lost some history.

Other ideal examples of a transactional fact table is the journal table in an accounting system, a trade table in an investment banking system, a premium table in an insurance system, a payment table in a payment system, a transaction table in a retail banking system, and a call table in a telecommunication system.

2. Periodic Snapshot Fact Table

A periodic snapshot fact table is a fact table where:

  • The whole source system is copied into the fact table regularly.
  • The same event is stored multiple times.
  • It has a snapshot date column indicating when a copy of the source table was created.

An ideal example of a periodic snapshot fact table is the bank account balance. At the end of each day, the balances of every customer account in the bank is stored in this account balance table. Say there were 20,000 customers on Monday; 22,000 customers on Tuesday and 24,000 customers on Wednesday.

Periodic Snapshot Fact Tables
Figure 2. Periodic Snapshot Fact Table

  • Every day we copy the whole content of the account balance table into the periodic snapshot fact table.
  • So on Monday night we stored 20,000 rows in the account balances periodic snapshot fact table, on Tuesday night 22,000 rows and on Wednesday night 24,000 rows. So an account is copied every day to the fact table, each day with potentially a different balance amount.
  • In the fact table we have a column called snapshot date. For all the rows created on Monday night, we set the snapshot date column to (for example) 11th Feb 2018. For the rows created on Tuesday night we set the snapshot date to 12th Feb 2018 and for the Wednesday rows we set the snapshot date to 13th Feb 2018.

Of course there are accounts which were closed on Tuesday and no longer in the account balance table in the source system. In the fact table, the Monday data set contains these accounts, but the Tuesday data set doesn’t contain these accounts, and neither does the Wednesday data set.

And there are accounts which were updated on Tuesday. These changes will be reflected on the Tuesday snapshot in the fact table, different to their Monday rows.

Another example is inventory table in manufacturing, holdings table in fund management, billing balance table in telecommunication, and daily temperature table in a weather system.

Advantages and Disadvantages

The advantages of a transactional fact table are:

  • It mimics the source table
  • It is simpler as we only have once version of each event

The disadvantages of a transactional fact table are:

  • We don’t have the previous values
  • Update is slow if the fact table is large, potentially performance issue

Now let’s get on with the two cases of transactional fact table implementations.

Case1: Retail Association

Let’s suppose we are running a Retail Association. Every shop in the country reports their annual sales data on our website, within one month of their financial year end. In January 2019 there were 8,000 shops reporting their sales data on our website, and in Feb 2019 there were 9000 shops reporting their sales data. There are about 100,000 shops in total.

Every month we get a CSV file containing the previous month data. The file is called Data_YYYYMM.csv and the file contains a column called Reporting Date. So:

  • The Data_20180131.csv contains the 8000 shops reporting in January 2019, with the Reporting Date column containing dates from 1st to 31st Jan 2019.
  • The Data_20190228.csv contains the 9000 shops reporting in February 2019, with the Reporting Date column containing dates from 1st to 28th Feb 2019.

Case 1
Figure 3. Retail Association monthly data file loaded into a transactional fact table

Because on the February file the January data is not given in full (but only the changed and new rows), we can’t make it as a periodic snapshot fact table. So in this case a transactional fact table is the only option.

Heart of Transactional Fact Tables

In theory the loading is straight forward. Jan file is loaded, then Feb file, and so on. But in reality this is rarely the case. A few shops who supposed to report in January were late, and they reported in February. So the in the February file we also get a few rows of the January “late reporter” shops.

Some shops made an error in their January submission and corrected it in February. So in February file we also have a few rows containing January corrections.

That is the heart of transactional fact tables: performing updates to the previous months’ data. In this case it goes back only one month, but in real cases it could be a few months.

Case 2: Waste Reporting

Suppose we are working with a government department responsible for implementing the waste regulations in our country. Every company and government agency in the country needs to report the amount and types of their packaging waste, electronic waste, commercial waste and recyclables e.g. plastic, paper, metal. This report is annual (once a year). Every town council also need to report the amount and types of their household waste, also annually.

The timing when they have to submit the report is different for every company, agency and council, depending on their financial year end. But the report contains the same year, which is the previous calendar year. So companies/agencies/councils with financial year ending on 31st March 2018 need to report the waste happened from Jan to Dec 2017. Those with financial year ending on 31st August 2018 also need to report the waste happened from Jan to Dec 2017.

The data we get from this central government department is monthly, i.e. every month we get a file. The file contains the waste produced in the previous calendar year. So the file we get in all 12 months of 2019 contains the waste data for 2018, but growing every month. For example, in Jan 2019 the file contains 17,000 entities (i.e. companies, agencies and councils), in Feb 2019 the file contains 32,000 entities, the March 2019 file contains 49,000 entities, and so on, like below.

Case 2
Figure 4. Waste Regulation monthly data file loaded into a transactional fact table

The February file contains all the 17,000 January entities plus 15,000 entities that report in February. But the February file also contains corrections for January. In other words, in the February file the waste data for some of these 17,000 January entities might have changed. For example, company A which in January reported waste of 11,000 kg of waste for 2018 calendar year, in February might submitted a correction to change this figure to 11,500 kg. This correction may happen up to 6 months after the submission deadline. So the correction for January data can happen in the July file.

Now we need to decide whether transactional fact table is the right approach for this case.

Case 2 Periodic vs Transactional
Figure 5. Comparing Transactional Fact Table and Periodic Snapshot

If we build this fact table as a periodic snapshot fact table, we will have every version of the data. We will have the January version when Entity A was 11,000 kg, and we will also have the February and March version when entity A was 11,500 kg.

If we build it as a transactional fact table, we will only have the latest version.

From my experience many people simply go for the snapshot model because of “just in case”. In case we need the January version.

The other reason, which is more valid, is trace-ability. If we use a transactional fact table, the report we produced in Feb 2019 (based on the Jan data) will have different numbers to the report we produce in Apr 2019 (based on the Mar data). If we use the snapshot model, we can explain the differences, i.e. because in January Entity A was 11000 kg, not 11500 kg.

So in case 1 we are dealing with “incremental” monthly data files, whereas in case 2 we are dealing with “accumulated” monthly data files.

 

15 February 2019

Data Type

Filed under: Data Warehousing — Vincent Rainardi @ 6:53 pm

In the data warehouse, should we make the data type the same as the source system? If in the source system the data type of the currency code column is varchar(50), in the our data warehouse should we set the data type to varchar(50) or to varchar(3)? If an ISIN column is defined as varchar(100) in the source system should you set the data type to varchar(100) or varchar(12)?

The short answer is: set it to whatever the source system specifies.

The long answer is:

  1. If the source system specifies a ridiculously long data type such as varchar(500) for currency code or varchar(1000) for ISIN, or varchar(8000) for a description column, and the actual data in those columns are much shorter than the data type (e.g. 3, 12 and 200 characters respectively for the above 3 examples), then set the data type to fit the actual data length, plus spare.
  2. Have a standard in the warehouse for length. For example: 1, 5, 10, 20, 50, 100, 200 and pick the next one up. For example if the actual data length is 12 set the data type for that column to 20, and if the actual data length is 30 set the data type to 50.
  3. In the ETL, if the actual data exceed the column data type, cut the data. For example, if the data type of the attribute column in the dimension is 100 and the actual data is 101 characters, then cut it to 100 characters and load it into the dimension but flag this in the ETL error log.
  4. Talk to the source system team. Perhaps there is a reason why the designer of that source column set it to wider than what it should be. Perhaps when there is no ISIN, they put the other security identifier such as currency + SEDOL in there? Or internal security identifier, which is quite long?
  5. If the source system team confirmed that the data they put into that column will not exceed 12 characters (even though the column is defined as 100 characters) then set the column in the warehouse to 12. But again, in the ETL cut it to 12 if it exceed 12, and log it.

 

31 January 2019

Column Store Index

Filed under: Data Warehousing — Vincent Rainardi @ 8:57 am

I write this article specifically to answer this question:
Do column store indices improve query performance in data warehousing?

So I’m not looking at data loading performance, and I’m looking specifically data warehousing, i.e. querying fact tables. From past experience I know that the answer is yes if the fact tables are large and the query is selective. Otherwise the answer is no. But I don’t have the numbers. So with this article I would like to measure it. Thank you to Nadir Doctor for suggesting this topic.

Experiment Design

I will first query a small fact table (50k rows) which is indexed the normal way, i.e. clustered on PK, non-clustered on SKs. For this I will use the internet sales fact table on Adventure DW. I’ll execute two queries. The first one will select all rows, and the second one will select only 0.1% of the rows.

I will then create column store indices on this fact table, and execute the same two queries. I expect that there will be no performance improvement because the fact table is small.

Then I’ll add more rows to the internet sales fact table so that it becomes 10 million rows and execute the same queries. I expect there won’t be performance improvement on the select all, but a significant improvement on the selective query.

To summarise:
Experiment Design.JPG

Note: The reason I put the “select all” in there because that’s how OLAP tools works, e.g. Power BI, SSAS, Qlikview. They select all rows from the fact table. Where reporting tool such as SSRS works on selective query (Power BI can be selective too).

SQL Queries and DDL

The query I use is as follows:

select F.SalesOrderNumber, F.SalesOrderLineNumber, DC1.FirstName + ' ' + DC1.LastName as CustomerName, DP.ProductAlternateKey, DP.ModelName, DC2.CurrencyAlternateKey as Currency, F.UnitPrice, F.OrderQuantity, F.SalesAmount
from FactInternetSales F
left join DimCustomer DC1 on F.CustomerKey = DC1.CustomerKey
left join DimProduct DP on F.ProductKey = DP.ProductKey
left join DimSalesTerritory DST on F.SalesTerritoryKey = DST.SalesTerritoryKey
left join DimCurrency DC2 on F.CurrencyKey = DC2.CurrencyKey
where DST.SalesTerritoryCountry = 'United Kingdom' and F.OrderDateKey/100 = 201401

For select all rows I use the same query but without the where clause.

The DDL for create the indexes are as follows:

create clustered columnstore index PK_FactInternetSales_SalesOrderNumber_SalesOrderLineNumber on dbo.FactInternetSales
create clustered index PK_FactInternetSales_SalesOrderNumber_SalesOrderLineNumber on dbo.FactInternetSales(SalesOrderNumber, SalesOrderLineNumber)
create nonclustered index FactInternetSales_CustomerKey on dbo.FactInternetSales(CustomerKey)
create nonclustered index FactInternetSales_ProductKey on dbo.FactInternetSales(ProductKey)
create nonclustered index FactInternetSales_SalesTerritoryKey on dbo.FactInternetSales(SalesTerritoryKey)
create nonclustered index FactInternetSales_CurrencyKey on dbo.FactInternetSales(CurrencyKey)

Results

results
Query performance numbers are in milliseconds

So as expected, the column store indices only improve the query performance if the table is large and the query is selective.

Other Considerations

Apart of the query performance, of course we also have to consider the maintenance, i.e. how much effort we need to spend to keep the column store indices up-to-date. Is it a lot more than row store indices? (that is beyond the remit of this article)

References:

  1. Column Store Index – Design Guidance: link
  2. Column Store Index – Data Warehouse: link
  3. Niko Neugebauer on Column Store: link

28 January 2019

SSIS: Numeric Column Loaded as NULL

Filed under: Data Warehousing — Vincent Rainardi @ 6:08 am

Why SSIS loads a numeric column in Excel as NULL?
(only applies to Excel 97-2003 files, does not apply to Excel 2013 files)

What Happened

I have an Excel 97-2003 file containing a numeric column. The first 10 rows of this column are blank. There are numbers from row 11 onwards. When SSIS loads this file, all columns are imported correctly into the target table, except this one column which is imported as NULL.

Left: Excel 97-2003 source file, right: target table in SQL Server
excel 97-2003
Figure 1. Loading an Excel 97-2003 file

Why It Happened

Why Column2 is loaded as NULL, but Column3 is loaded ok?

This has nothing to do with SSIS. It is because of Microsoft Excel Driver. The Excel driver checks the first 8 rows of that column to find out the data type. Because they are blank, the Excel Driver (not SSIS) sets the expected data type to Text. And because the actual data is a Number, it is set as NULL.

How It Happened

Here is how it happened (the detailed mechanics).

When we setup the Excel Connection Manager in SSIS (see below), the Excel Driver installed in our PC does these steps:

excel source editor
Figure 2. Setting up Excel source

Step 1. The Excel Driver looks at the TypeGuessRows and ImportMixedTypes registry setting: HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Office\15.0\Access Connectivity Engine\Engines\Excel

registry settings
Figure 3. Registry Settings

Note: The branch in the registry is different for each Office version and 32/64 bit, see list here.

Step 2. The Excel Driver reads the first N rows from the Excel file to determine the data type, where N is the TypeGuessRows.

Note:

  • The default for TypeGuessRows is 8.
  • If the TypeGuessRows is set to 0, it reads the first 16384 rows (2^14).
  • To check all rows, the TypeGuessRows needs to be set to 1,048,576 (2^20), the maximum number of rows allowed in Excel.

Step 3. The Excel Driver sets the expected data type for that column as follows:

  • If the value of the ImportMixedTypes is “Text”, the expected data type is set based on the majority of the N rows where N the value of TypeGuessRows. There are 5 possible data types: Text, Number, Date, Currency, Boolean.
  • For example, if in the first 8 rows there are 4 dates, 3 texts, 2 number and 1 currency, the expected data type will be set to Date.
  • Another example: if in the first 8 rows there are 5 text and 3 numbers, the expected data type will be set to Text.
  • If there is a tie like (4 and 4) or (3,3,2) or (2,2,2,2) then the Number wins, followed by Currency, Date, Text and the last one is Boolean.
  • If the first 8 rows are all NULL the expected data type is set to the ImportMixedTypes.
  • In the above case, Column1 is set to Number (majority of 8), Column2 to Text (all NULL), Column3 to Number (majority of 1), Column4 to Text (all NULL).

Step 4. When we click the Preview button (see Appendix 1), the Excel Driver reads the first 200 rows from the Excel file and display the values on the screen. If the value from the file does not match the expected data type, the Excel Driver returns NULL to SSIS.
This is why in the above case Column2 is shown as NULL in the Preview screen.
The same happens during the actual load process.

Excel 2007 or Excel 2013

What happens if we import an Excel 2007/2013 file?

If we import an Excel 2007 or Excel 2013 the numeric column is loaded correctly, see Column2 below.

excel 2007 or 2013
Figure 4. Loading Excel 2007 or 2013

When loading an Excel 97-2003 file we get the same result as above if we set the Excel version on the Excel Connection Manager to “Microsoft Excel 2007”, like this:

set excel version
Figure 5. Changing the Excel Version

So when we face a problem with numeric column being null when loading from Excel 97-2003, we can set the Excel version to 2007 in Excel Connection Manager and SSIS will load it correctly.

 

27 January 2019

Starting a Post

Filed under: Data Warehousing — Vincent Rainardi @ 6:09 am

I usually start writing a post just like that. As if I was explaining it to someone in front of me.

But there is a better way to start a post, i.e. by formulating the thing you want to explain in the post into a question. This will grab the attention of the reader straight away.

For example, I am about to write an SSIS issue that happens when we import an Excel file containing a numeric column. The issue is: if the first 10 rows of that column is blank, Excel will import all rows of that column as blank.

I could start the usual way by explaining the issue (what happened), why it happened and what the solution is. I still think that is a good way to write it. But a better way to start the article is to formulate the issue into a question: Why SSIS imports a numeric column in Excel as null?

So that’s what I’m going to do in the next post.

5 January 2019

Asset Management Companies in the UK

Filed under: Data Warehousing — Vincent Rainardi @ 11:21 am

Having worked at several investment banks and asset management in the UK I was wondering which ones are the biggest asset managers in the UK. I don’t mean to list just the UK fund managers, but also US fund managers operating in the UK. So I searched in Wikipedia and Google, and the individual websites of each companies and wrote what I found below. Links to sources are on the AUM figures.

Largest asset management companies in the UK are (AUM $0.4tn or more, rounded to $0.1tn):
Note: only the asset management division/arm, not the whole group.

Smaller ones are (AUM under $0.4tn, rounded to $bn):

Those with AUM under $10bn are (please correct me if I’m wrong): Amati, Chelverton, City Financial, Downing, Hawksmoor, Premier, Miton, Neptune, Sanford DeLand, Wise. Those with AUM not known are: Cavendish, Evenlode, Unicorn.

Whereas the investment banks operating in London are (HQ mostly in Canary Wharf):

  • 1.     Barclays
  • 2.     BNP Paribas
  • 3.     Bank of America Merrill Lynch
  • 4.     Citigroup
  • 5.     Credit Suisse
  • 6.     Deutsche Bank
  • 7.     Goldman Sachs
  • 8.     HSBC
  • 9.     JP Morgan
  • 10.  Mizuho
  • 11.  Mitsubishi UFJ
  • 12.  Morgan Stanley
  • 13.  Nomura
  • 14.  UBS

 

Next Page »

Create a free website or blog at WordPress.com.