Data Warehousing, BI and Data Science

7 April 2021

Handling Class Imbalance

Filed under: Data Warehousing — Vincent Rainardi @ 7:17 am

In this article I will explain a few ways to treat class imbalance in machine learning. I will also give some examples in Python.

What is class imbalance?

Imagine if you have a data set containing 2 classes: 100 class A and 100 class B. This is called a balanced data set. But if those 2 classes are 5000 class A and 100 class B that is an imbalanced data set. This is not limited to 2 classes, but can happen on more than 2 classes. For example: class A and B both have 5000 members, whereas class C and D both have 100 members.

In an imbalance data set, the class with fewer members is called the minority class. The class with much more members is called the majority class. So if class A has 5000 members and class B 100 members, class A is the majority class and class B is the minority class.

Note that the “class” here is the target variable, not the independent variable. So the target variable is a categorical variable, not a continuous variable. A case where the target data set has 2 classes like above is called “binary classification” and it is quite common in machine learning.

At what ratio it is called class imbalance?

There is no exact definition on the ratio. If class A is 20% of class B I would call it imbalance. Whereas if class A is 70% of class B I would call it balance. 50% I would say is a good bet. It is wrong to dwell on finding the precise ratio range because each data set and each ML algorithm is diffferent. Some cases have bad results at 40%, some cases are ok with 40%.

Why class imbalance occurs

Some data is naturally imbalance, because one class happens rarely in nature, whereas the other happens frequently. For example: cancer, fraud, spam, accidents. The number of people with cancer are naturally much less than those without. The number of fraudulant credit card payments are naturally much less than good payments. The number of spam emails are much less than good emails. The number of flight having accidents are naturally much less than good flights.

Why class imbalance needs to be treated

Some machine learning algorithms don’t work well if the target variable is imbalanced, because during training the majority class would be favoured. As a result the model would be skewed toward the majority class. This situation is an issue because in most cases what we are interested in is predicting the minority class. For example: predicting that a transaction is a fraud, or that an email is a spam, is more important than predicting the majority class.

That is the reason why class imbalance needs to be treated. Because the model would be skewed towards the majority class, and we need to predict the minority class.

How to treat class imbalance

We resolve this situation by oversampling the minority class or by undersampling the majority class.

Oversampling the minority class means we randomly choose sample data from the minority class many times, whereas on the majority class we don’t do anything.

For example if class A has 5000 members and class B has 100 members, we resample class B 4950 times. Meaning that we pick data randomly from class B 4950 times. Effectively it is like duplicating class B data 50 times.

Undersampling the minority class means that we randomly selecting data from the majority class as many times as the minority class. In the above example we randomly pick 100 samples from class A, so that both class A and class B have 100 members.

Apart from randomly selecting data there are many other techniques, including:

  • Creating a new samples (called synthetic data)
  • Selecting samples not randomly but favouring samples which are misclassified
  • Selecting samples not randomly but favouring samples which resembles the other class

Jason Brownlee explained several other techniques such as SMOTE, Borderline Oversampling, CNN, ENN, OSS in this article: link.

Python examples

1. Random Oversampling

# Import resample from the Scikit Learn library
from sklearn.utils import resample

# Put the majority class and minority class on separate dataframes
majority_df = df[df["fraud"]==0]
minority_df = df[df["fraud"]==1] 

# Oversampling the minority class randomly
new_minority_df = resample( minority_df, replace = True, 
                            n_samples = len(majority_df), 
                            random_state = 0 )

# Combine the new minority class with the majority class
balanced_df = pd.concat([majority_df, new_minority_df])

2. Synthetic Minority Oversampling Technique (SMOTE)

# Import SMOTE from the Imbalance Learn library
from imblearn.over_sampling import SMOTE

# Oversampling the minority class using SMOTE
s = SMOTE()
X_new, y_new = s.fit_resample(X, y)

Jason Brownlee illustrates very well which part of the minority class got oversampled by SMOTE in this article: link. Please notice how the minority class differs on the first 3 plots in his article. We can see clearly how SMOTE with random undersampling is better than SMOTE alone or random undersampling alone.

6 April 2021

Natural Language Processing (NLP)

Filed under: Data Warehousing — Vincent Rainardi @ 8:15 am

NLP is different to all other machine learning areas. Machine learning usually deals with mathematics, with numbers. It is about finding a pattern in the numbers, and make a prediction. The root of analysis is mathematical such as matrix, vectors, statistics, probability and calculus. But NLP is about words and sentences which is is very different.

We are now used to Alexa, Siri and Google able to understand us and answer us back in a conversation (5 years ago it wasn’t like that). When we type a reply to an email in Gmail or a message in Linked In we are now used to receiving suggestions about what we are going to type. And when we login to British Gas, or online banking or online retail shop we now find chat bots with whom we will be able have a useful conversation. Much better than 5 years ago. There is no doubt there has been a significant advancements in this area.

The processing of language, be it voice or text, are done in 3 levels. The bottom level is lexical analysis, where ML deals with each word in isolation. The middle level is syntax analysis, where ML analyses the words within the context of the sentence and the grammar. The top level is semantic analysis where ML tries to understand the meaning of the sentence.

To do lexical analysis we start with regular expression. We use regular expression to find words within a text, and to replace them with another words. Then we learn how to identify and remove stop words such as and, the, a which occur frequently but don’t provide useful information during lexical analysis. The third step is learning how to break the text into sentences and into words. And finally for each word we try to find the base word either using stemming, lemmatisation or soundex.

Stemming is a process of removing prefixes and suffixes like “ing” and “er” from “learning” and “learner” to get the base word which is learn. Lemmatisation is a processes of changing a word to its root, e.g. from “went” to “go”, and from “better”, “well”, “best” to “good”. Soundex is a 4-character code that represents the pronounciation of a word, rather than its spelling.

The syntax analysis is done by tagging each word as noun, verb, adjective, etc. (called “part of speech”). The tagging is done by parsing (breaking up) the sentences into groups of words (phrases), analysing the grammatical patterns, and considering the dependencies between words.

Semantic analysis is about understanding the meaning of the words and sentences by looking at the structure of the sentence and the word tagging. Words such as “Orange” can mean colour, a fruit or a area, and “Apple” can mean a fruit or a company, depending on the sentence. In semantic analysis we either assign predefined categories to a text (for example for sentiment analysis, for classifying messages or for chat bots) or pull out a specific information from a text (for example for extracting certain terms from IRS contracts, or other documents).

21 February 2021

One or Two Fact Tables

Filed under: Data Warehousing — Vincent Rainardi @ 7:34 pm

One of the data I receive from my data sources is like this:

In this case, should we make it into one fact table like this: (one row per company per year, with M1, M2, M3 repeated) – let’s call this option 1

Or should we make it into two fact tables like this: (the first one is 1 row per company and the second one is 1 row per company per year) – let’s call it option 2

Kimball theory says we should do option 2, because the grains are different. A fact table is determined by its grain. If the data is at different grain, then it does not belong in this fact table, it belongs to another fact table. In option 1, M1 to M3 do not belong there, because their grains are different. They belong to another fact table, hence option 2 is the correct way.

So that was what I did when I first came into this this situation. Stick to the book, I said to myself, and you’ll be safe.

But then the same situation came up again, on another data source. I should model it the same way I thought. But this one is a bit “rugged”, notice M4 and M5 below which are only applicable to some years:

And M1A is closely related to M1 (they are used together), M2A is closely related to M2, M3A is closely related to M3, so the two fact table almost always have to be joined together.

Also notice that there are 15 measures which are annual, and only 3 which are not.

So I thought rather than the users having to join those two fact tables every time because the first 3 measures are used together, I think the right thing to do is to put them in one fact table like this:

So I did. And there are two more data sources like that afterwards. I still feel it is the right thing to do despite it breaks the normalisation rules (who’s doing normalisation in data warehousing anyway?) and breaks a Kimball principle on fact tables.

It’s not because there are only 3 measures which are repeated (even one is too many) but because otherwise to use it users will have to join the two fact tables.

What’s wrong with joining two fact tables? Nothing. Joining fact tables was done since the dawn of data warehousing, with shipments and orders as the classic example in 1990s.

Oh well, both options have the good side and bad side, the positives and negatives, and I think we could present good arguments on both options. What do you think?

And that is the title of this article: One or Two Fact Tables.

Vincent Rainardi, 21st Feb 2021 (gosh it’s been a looong time since I last wrote about data warehousing. That’s because I’m currently doing a master degree on machine learning, which is a lot of fun! Can’t have enough of it! Doing Python and math every day 🙂 At first I thought I would be converting to ML. But now that I’ve done quite a lot of it, I’m sure I don’t want to let DWBI and Investment Management go. I will bring ML into my DWBI work in financial sector. And keep BA as my main work, not ML engineer 🙂 It’s a very good feeling to be able to know what we want to do in life!

8 November 2020

Day Measures

Filed under: Data Warehousing — Vincent Rainardi @ 3:04 am

Sometimes the measures we want to store is the number of days. For example:

  • the number of days from when we received an order until we shipped it
  • the number of days from when a booking was made until the engineer visited
  • the number of days from today until a bond matures
  • the number of days since the policy was incepted (began) until today

I call these measures Day Measures.
When a day measure is “from today” or “until today”, then the measure changes everyday.
For example: the maturity of a bond is the number of years from today until a bond matures. If the bond matures in 31st December 2030, and today is 9th Nov 2020, then the maturity is 10.1451 years (10 + 52/365.25).
Tomorrow it is 10.1396. The maturity of a bond is important because the bigger the maturity, the higher the risk that the bond value can be impacted by the interest rate.

Days to shipment (the number of days from when the order is received until shipment) does not change if it was a last year order. But if the order was placed yesterday and it usually takes about 3 weeks for us to fulfill an order, then in the next 3 weeks the “days to shipment” changes every day.

If we have a daily periodic snapshot fact table, and we have a day measure in this fact table, then we need to calculate this measures every day.

The question is: is that right?
Couldn’t we just store the maturity date of the bond, rather than calculating the Maturity daily?
Couldn’t we just store the shipment date and the order date, rather than calculating “days to shipment” daily?

We can and we should. But “in addition to” not “instead of”.
It would be silly to store just the date because everytime we want to use it we need to calculate it. The point of having a data warehouse is not to save storage, but to make it easy to use. Easy to query.
When we calculate the Maturity or Days To Shipment, we don’t need to calculate anything. It is there ready for us to use.

So yes it is worth calculating Days Measures every day, and store them in the daily periodic snapshot fact table.

11 September 2020

Copy Paste Anchor Links (Page Jumps) from Microsoft Word to WordPress

Filed under: Data Warehousing — Vincent Rainardi @ 6:57 am

Problem: when copy-pasting a Word document to WordPress, hyperlinks are preserved but bookmarks are gone.

Solution:

  1. In Microsoft Word create them as hyperlinks
  2. Paste it to WordPress
  3. Change the hyperlinks to anchor links using Notepad
  4. Paste back to WordPress

Step by step:

Step 1. In Microsoft Word, create them as hyperlinks

  • In a blank Microsoft Word document, create this:

Both of the “Section 1” are normal hypelinks, with text = “Section1” (no space), address = “#section1” (no space with # prefix) like this:

  • Highlight it, control-K (Insert Hyperlink), type Address = #Section 1 (notice the hash in the front).
  • Create Section 2 and Section 3 the same way.

Step 2. Copy to WordPress

  • Control-A (select all)
  • Copy paste to WordPress visual editor.

Step 3. Replace the hyperlinks to anchor links in Notepad

  • Control-Shift-Alt-M (switch to Code Editor)
  • Control-A (select all)
  • Paste to Notepad (Notepad++ is better)
  • Control-H (replace), replace href with name, and remove the # as follows:
    find what: <a href=”#, replace with: <a name=”

Step 4. Paste back to WordPress

  • Control-A and paste back to WordPress code editor.
  • Click Update on the top right corner
  • View the page and test the hyperlink

6 September 2020

Dynamic Difference in Power BI

Filed under: Data Warehousing — Vincent Rainardi @ 3:30 pm

If you are looking for how to do dynamic difference (dynamic delta) in Power BI, you’ve come to the right place. This article is about showing how to calculate the difference (say between actual and budget) dynamically or on-the-fly as you switch from product to product. This is done using the SelectedValue function.

Business Scenario

Here is the business scenario: you work in product development in a manufacturing company. You have many products developed in your product development pipeline, each have different costs for each component like this:

Report Layout

The Power BI report should look like this:

You need to be able to select a product (Product A in the above example) and the actual spend vs budget are displayed in a table, along with the difference.

So you are comparing the cost breakdown of a product to the budget. Another business scenario which is similar to this is comparing the performance attribution of a mutual fund to the benchmark. In this case instead of components, we have industry sectors.

Dynamic Difference

The point here is that the differences are calculated on-the-fly, depending on the user selection. That is the key Power BI functionality that I’d like to show you.

There is a function in Power BI called SELECTEDVALUE, which we need to use for this. We use it to find out which product is being selected.

How To Do It

So let’s create the Power BI report.

Step 1. First, in Excel, type this table, and load it in Power BI as Cost table:

Step 2. Create a calculated table to select distinct values from the Product column:

   Product = distinct(‘Cost'[Product])

Step 3. Use it for a slicer so we can select Product A, B and C (filter the Budget out):

Step 4. Create 3 measures: Cost_Budget, Cost_Actual and Cost_Differene as follows:

1. Cost for the budget:
Cost_Budget =
CALCULATE
( SUM(‘Cost'[Cost]),
  FILTER(‘Cost’, ‘Cost'[Product] = “Budget”)
)

2. Cost for the selected product:
Cost_Actual =
CALCULATE
( SUM(‘Cost'[Cost]),
  FILTER(‘Cost’, ‘Cost'[Product] = SELECTEDVALUE(Product[Product]))
)

3. The difference between the budget and actual:
Cost_Difference = [Cost_Budget] – [Cost_Actual]

Step 4. On the report create a table with Component, Budget, Product and Difference in the Values.

Now when we select Product A it will display the cost of Product A and calculate the difference on-the-fly.
And when we select Product B it will display the cost of Product B and calculate the difference on-the-fly.

26 September 2019

Development Operations (DevOps)

Filed under: Data Warehousing — Vincent Rainardi @ 5:45 pm

Whether you are building a data warehouse, a client reporting system, a MDM system or a trade booking system you need to run a development team. Development Operations are things that you need to run a software development team. What do you need to run a development team? They are:
1. Release pipeline
2. Change control
3. Sprint planning
4. Automated testing
5. Code repository
6. Server build

In writing this article I will try to combine all my experiences from about 10 companies in the last 12 years. Some of them do best practices, some do bad practices. I will only write about the good ones here.

1. Release Pipeline

The first thing that you need as a development team is a release pipeline. A release pipeline is a process to automatically build, deploy and test the code to a series of environments, i.e. development, test, UAT, production and support. A release is a collection of code such as stored procedures, SSIS packages, .Net code, views, tables, etc. bundled into 1 package to be deployed together as 1 unit.

Can’t you deploy it manually by copying the code? Of course you can. But it will take more effort. And it is prone to error. What if you want to rollback to previous version? Well you have to do it manually right, by copying the previous version of each component. Who is keeping track of the version for each component? What if you missed 1 component? You could be deploying 50 components in test but only 49 in production. It’s a headache right? With a release pipeline, you can deploy all components into many environments consistently and effortlessly. You deploy into Test environment, and test it here. You deploy into UAT environment, and test it here. And you deploy the same thing into Production. And then you deploy the same thing into Support environment. The environments are guaranteed to be consistent.

If you have 100 stored procedures, 100 tables, 100 views, 100 SSIS packages, a view SSAS cubes, 50 SSRS reports and hundreds of .Net codes, then you will appreciate the benefit of having an automated release pipeline. It is a huge time saver. And it improves your quality.

Deployment into an environment requires approvals. A deployment into the test environment for example, may require the test manager approval. Whereas to deploy a release into production requires an approval from the production support manager.

2. Change Control

The second thing that you need to run a development team is change control. Change control is a process of approving every single change going into production environment. It can be a code change (like a release). It can be a data change (like an UPDATE or DELETE statement). It can be a configuration change (like moving the data folder from C drive to E drive). It can be an infrastructure change (like setting up a new SQL server).

Change control is very important because:
– It shows that the change has been tested properly in Test environment, before it is deployed to production.
– It shows that the development manager and the production support manager are aware about the change.
– When the company is audited you have evidence that every single change in production is approved and tested.
– You can choose which change will be deployed at the same time in production, and which ones are not at the same time to minimise the risk
– It shows the sign off from the business, agreeing to the change.
– When the production deployment failed, it is recorded properly including the reason and the solution.
– It shows that every change in production is controlled, so the production systems are stable and reliable.

A change request is an electronic form which specifies:
– The reason for the change (usually a business reason but can be a technical reason)
– The list of components and data to be changed
– The evidence that the change has been deployed to Test/UAT environment the same way it will be deployed into production
– Link to the design specification or requirement which forms the basis of the change
– The evidence that the change satisfies the design specification or requirement (or if it is a bug fix, the evidence that the change fixes the bug)
– The impact of this change to the business users
– The approval from the business, from the development manager, and from the production support
– The updated documentation for the production support
– How the change will be deployed to production and to the DR system
– How the change will be tested that it works in production
– How the change will be rolled back if it doesn’t work in production
– The date the change is intended to be deployed to production
– The developer who did the change
– Who from the development team will be on support on the day the change is deployed to production
– Which teams are required to deploy the change
– The applications which are impacted, including downstream applications
– The severity of impact of the change, i.e. high, medium, low (the risk)
– The scale of the change, i.e. major change, minor change or medium (how many components)
– Whether it is an emergency change or not

The change request form is not just for application changes, but also for infrastructure changes such as servers and networks. It is for any changes impacting any production system, including codes, files, databases, servers, configurations, firewall, connections, permissions, scheduling.

Major change requests are discussed in a weekly meeting between the developers and the production support team to understand:
– What changes are going in this week
– Whether there are any conflicts between the changes which are going in this week
– Where those changes are in terms of their positions in the daily batch
– If things do go wrong, how long it will take to recover (or to fix)
– Who is required to standby to fix if anything goes wrong (DBA, infrastructure, developer)
– When those changes should be done from production point of view (to minimise the impact).
– To scrutinise the test evidence, not whether the change satisfies the requirements, but whether the change has been deployed in the Test system the same way as it would be deployed in production. In particular, whether the condition of the Test environment is the same as production. Also, whether the impacted application has been tested in Test environment.
– To check that the rollback script really covers the whole change, not just part of it. And whether it has been tested.
– Whether you need to communicate the change to the users or not, and whether the help desk should be prepared to field calls from the users on the deployment day.

The changes are then either approved or rejected. If it is approved, it is given a date and time when it will be deployed. Naturally, the business and the development team want the change to be deployed as soon as possible, but it is the responsibility of the production support team to decide the timing, to minimise the risk. Changes with major impact (or high risk) are usually deployed on Saturday when the business users are not working so they are not impacted and you have Sunday to recover or fix any issues. Changes with medium impact (and medium risk) can be deployed on Friday, so you have Saturday for fixing. Changes with low impact and low risk can be deployed earlier in the week (Wed or Thu but not Mon or Tue).

If there are major changes going in this week end, it is wise to defer deploying other major changes to next week end. Similarly if the required support resources is not available this week end (if it goes wrong) then the change should be postpone to next week.

Some changes requires coordination with an external vendor. So the timing of production deployment must be mutually agreed. Some changes are time-bound, i.e. they have to be done before certain date otherwise the company will get a penalty from the regulator.

Some changes need to be done immediately, for example because of security risk, or risk of losing a great deal of money. This is classified as an emergency change and it is usually a bug which needs to be fixed immediately. The procedure for emergency change request is different to a normal change request. It still requires approval, it still needs to be tested, but a lot more streamlined. In most cases the principle of applying emergency changes is “fix after”, meaning if things go wrong you do not roll the change back, but you fix it. This requires more people to standby compared to a normal change. This “fix after” principle allows the testing to be light weight.

3. Sprint Planning

The third thing that you need to run a development team is sprint planning. A sprint is a unit of two weeks. In theory it is possible to define a sprint as 1 or 3 weeks, but I’ve been working at 4 companies which do sprint, and every single one does 2 weeks.

Yes Sprint is a way of running Agile. No one does Waterfall any more these days. Funny how things moved, 5 years ago people still debating Agile vs Waterfall. But not now. No more debate on that. The debate is more around CI/CD atau automated testing.

Azure DevOps is probably the best tool in the market, followed by Jenkins, Travis. Many are not a complete DevOps tool. GitHub, Grade, JIRA, Bamboo, AWS CodePipeline, Trello for example. They just doing build and source code control but without CI/CD, or vice versa. Or doing both but not sprint planning (Trello do sprint planning but no release pipeline, whereas AWS CodePipeline is the opposite). Even Azure DevOps is not complete (it doesn’t do change control), but it does sprint planning, source code control, release pipeline CI/CD and automated testing.

Every two weeks you plan the works that you need to do in the next sprint. These pieces of work are called user stories. First you define the team capacity, i.e the number of working days minus the holidays. You then allocate how many days is required for each story, who will be doing it, and the acceptance criteria (the definition of done), i.e. just doing the development and test, or until production. Then you check if anyone is under or over allocated. If someone is over capacity, distribute or reduce the work.

After planning, during the 2 weeks sprint run you track the progress of each story. For each story you create a few tasks with an estimate and the person doing it. As each task gets done, you mark the actual hours against the estimate. The status of each task changes from new to in-progress to completed. This will create a burndown chart, which tracks the whole team’s actual hours against estimate.

It’s all very well and good doing planning and tracking as above, but how do you determine what needs to be done in the first place? This is where the product owner comes in. The list of work in each sprint should be driven largely from the prioritised list of projects + product increment. If the team is a dedicated project team, then the list of work comes from just 1 project + the BAU of that project. Some teams serve multiple projects (size of 3 months to 1 year) plus hundreds enhancement requests (size of 2 to 10 days) throughout the year.

4. Automated Testing

The fourth thing that you need to run a development team is automated testing. Many people leave this to the back of the queue, but the reality is that this thing saves a lot of time (and money), and increases the quality of the code enormously. Automated testing means two things:
a) Ensuring that deployment to production does not break anything
b) The code satisfies all the intended functionalities, and none of the unintended ones

Point a) is achieved by copying the production system into the test environment. The lot, i.e. thousands of components, code, settings, databases and files from production is restored into the test environment overnight. Then you apply the latest code and data changes from the release pipeline into this test environment. Then you run the whole system, and check the expected results. The key thing is: all of these are conducted programmatically, every day, without any human intervention.

Point b) is achieved by having a test environment which is completely empty, i.e. no data. You have the databases, but there are no rows in any tables. You have the data folders but there are no data files in them. Then you test 1 piece of code, e.g. a .Net web service or a stored procedure. You do this by setting up test data, e.g. you insert a few rows into the input tables. Each row reflect one specific test scenario. Then you run the code and check the result. Then you repeat it for all pieces of code in the system, thousands of them. For 1 piece of code you need to run some positive tests and some negative tests. The key thing is: all of these are conducted programmatically, every day, without any human intervention.

If you have to do all that manually, it would requires thousands of hours. Actually it would be impossible to do because for every release you will have to repeat both tests. If you don’t have automated testing, you can’t deploy to production 3x a week. The best you can do is once a quarter, because of the amount of testing involved. It does not make sense to do 10 hours of development and then spend 2000 hours of testing right? But once your hours of testing is down to zero, then you can do a 10 hours development, test it and deploy to production. And then you can do another 10 hours of development, and so on.

Because for every single change you run “the whole lot of testing” twice (the a and the b above), you can release very often, and you increase the quality of the code. You do positive testing and negative testing on every component, and also integration testing. The a) above is called “system testing”, and the b) above is called “unit testing”. Both are automated.

Does it mean you do not need to do manual testing? No, you still need to. When a major feature is developed, you need to do manually test the functionalities across the whole piece. Suppose you are adding a new feature in an order processing system which enable the system to combine orders from different customers and process them together. There are 9 places in the system which need code changes. Each of these 9 pieces has been unit tested individually using the b) above, which proves that each piece works individually (using minimal amount of data, i.e. a few rows). The whole system has been system tested using the a) above, which proves that there is no negative impact to the system as a whole.

But how do you know that the ability to combine orders really works end-to-end, using a normal production data volume? The automated unit test only uses a few rows, and they are made up data. It does not prove that end to end the new feature is working. To prove that the new feature is working we need to run thousands of real orders from last month across the system, combining hundreds of them, process them and check the output. Are all order processed correctly? Are there any issues? If it passes this test we can say hand on heart that the system now has capability of combining orders.

5. Code Repository

This is a must in any development team. You can use GitHub, TFS, SVN, PVCS, BitBucket, or others, but you need to have the capability of storing code, versioning, master and feature branches, committing changes, packaging and releasing code, merging, collaborating, restoring to previous version, doing code reviews and pull requests.

Whether it is Java code, Python, databases, ETL tool, cubes or reports, they are code. And they ALL need to be in ONE code repository. And the code repository MUST be integrated with the release pipeline. You need to be able to push a release candidate to the release pipeline, test it and deploy it automatically into various environment including test, support and production.

6. Server Build

Like automated testing, this one is frequently put at the back of the queue, but it really is a time saver and money saver. So we need to do it. Server Build is the ability to create a server programmatically. A release pipeline consists of a series of environments such as Dev, Test, UAT, Prod and Support. Each environment consist of several servers, like SQL Server, application server, SSIS server, SSAS server, Python server, etc.

These servers need to be built. If you build them manually, it would take a lot of hours. Not just once, but they will need to be configured, patched and maintained. And when they don’t work, you need to troubleshoot it and fix it.

A better way to create servers in Azure is using PowerShell DSC. For each environment you need to create a DSC configuration which tells PowerShell DSC the server name, resource group, folder paths, Windows features and any other characteristics which are specific to each environment and each server. The PowerShell DSC then reads this DSC configuration and create each server in each environment according to the settings written in this file.

The best practice is to recreate these servers on a weekly basis (except the production environment), to avoid having to upgrade and patch them.

7. BAU/support/bug fixes
8. Business requirement & design
9. Architecture

The traditional understanding of “DevOps” as a system normally includes point 1 to 6, but not 7 to 9. But 7 to 9 are certainly required to run a development team for any system (I do mean software, not civil engineering).

3 August 2019

Memory for SSIS

Filed under: Data Warehousing — Vincent Rainardi @ 6:56 am

SQL Server has an upper memory limit in the Server Properties. This is the maximum amount of memory that SQL Server database engine can use. Just the database engine. It does not include SSRS, SSIS or SSAS. This maximum amount of memory is constantly used by SQL Server database engine. If we set the upper limit to 90 GB, SQL Server DB engine will use all 90 GB, all the time. Not some of the time, but all of the time. SQL Server DB engine will use 90 GB every minute of the day.

Many people put SSIS into the same box as the SQL Server databases. This is fine, provided that SSIS is given enough amount of memory. If there the box has 96 GB, and the maximum amount made available to SQL Server is 90 GB, then only 6 GB is available for all other things, including Windows processes, antivirus, SSIS, SSRS and SSAS. In this case SSIS would be running very slowly, like 100 times slower that how it should be. Because effectively only a few MB is available to SSIS (depending on what other applications/processes are doing).

If the SQL Server box has 96 GB and the max amount is already set to 90 GB, and we put SSIS on the box, that 90 GB should be lowered to around 70 GB, so that there are 26 GB available for SSIS, Windows and other processes. We then look at the amount of memory used by SSIS engine (in the task manager, or use the Profiler). Generally speaking, SSIS would be happy if it has 16 GB available to it. Very big packages could require 32 GB, for example packages which move billion of rows.

Take 5 GB or so for Windows processes, antivirus and other operating system. Assuming there is no SSRS or SSAS on the box, then we can calculate that out of that 26 GB, 21 GB is available for SSIS to use. If there is SSRS on the box, take 5 to 10 GB for SSRS. If there is SSAS on the box, it is a different story, as SSAS tends to grab all remaining available memory. For this reason we should be putting SSAS in its own box/VM.

With 21 GB SSIS should be able to perform near its peak performance. If the package processes up to 1 million rows, there will be no problem. If 5 packages running simultaneously processing up to 1m rows, the should be no problem. Unless the package is written in a very efficient way of course. Because SSIS will automatically batch it in chunks of 10,000 rows. So “normal size” packages should be running ok in SSIS with near peak performance with 16-21 GB memory (based on my experience).

What if you only have 64 GB on the box? Then allocate 43 for SQL Server (which should be enough for most queries or SQL operations, for most database sizes), giving 21 GB free memory for all other processes. Taking 5 GB for Windows, leaving 16 GB for SSIS (assuming there is no SSRS on the box).

What if you only have 32 GB on the box and you want SQL Server and SSIS running parallel? Give 16 GB to SQL Server, leaving 16 GB for others. Minus 5 GB for Windows and other O/S related, this gives 11 GB to SSIS. Not ideal, but that’s the best we can set for both SSIS and SQL Server, without knowing the details of the SSIS packages. Once we know the detail work flow and tasks then we can be more precise.

I’m worry if there is less than 4 GB for SSIS (after taking out 5 GB for Windows process, antivirus and other applications, and after taking out 5 GB for SSRS). 8 GB is the bare minimum I would recommend as memory for the SSIS, for most packages.  But my default recommendation for most packages would be 16 GB (that’s free memory just for SSIS, after taking out Windows and SSRS).

You can of course measure it on your dev box using profiler and check the peak memory requirements, when the packages are being run in SQL server. But this takes time (1-2 weeks), in order to properly measure up the collective requirements of all packages running in parallel.

30 May 2019

Entropy and Information Gain in Decision Tree

Filed under: Data Warehousing — Vincent Rainardi @ 7:28 pm

Decision Tree is one of the most popular algorithms in machine learning. It is relatively simple, yet able to produce good accuracy. But the main reason it is widely used is the interpretability. We can see how it works quite clearly. We can understand how it works.

Decision Tree is a supervised machine learning algorithm. So we train the model using a dataset, in order for it to learn. Then we can use it to predict the output. It can be used for both regression and classification. Regression is when the output is a number. Classification is when the output is a category.

In this article I would like to focus on Entropy and Information Gain, using investment funds as an example. Entropy is the level of disorder in the data.

Entropy

In thermodynamics, Entropy is the level of disorder or randomness in the system. Similary in data analytics, entropy is the level of disorder or randomness in the data. If we have 100 numbers and all of them is 5, then the data is in very good order. The level of disorder is zero. The randomness is zero. There is no randomness in the data. Everywhere you look you get 5. The entropy is zero.

If these 100 numbers contain different numbers, then the data is in a disorder state. The level or randomness is high. When you get a number, you might get number 4, or you might get number 7, or any other number. You don’t know what you are going to get. The data is “completely” random. The level of randomness in the data is very high. The entropy in data is very high.

The distribution of these different numbers in the data determine the entropy. If there are 4 possible numbers and they are distributed 25% each, then the entropy is very high. But if they are distributed 99%, 1%, 1%, 1% then the entropy is very low. And if it’s 70%, 10%, 10%, 10% the entropy is somewhere in between (medium).

The maximum value for entrophy is 1. The minimum value for entrophy is 0.

Information Gain

Now that we have a rough idea of what entropy is, let’s try to understand Information Gain.

A Decision Tree consists of many levels. In the picture below it consists of 2 levels. Level 1 consists of node A. Level 2 consists of node B and node C.

1. Two Branches of Entropy

Information Gain is the decrease in entropy from one level to the next. Node B has entrophy = 0.85, a decrease of 0.1 from Node A’s entrophy which is 0.95. So Node B has information gain of 0.1. Similarly, Node C has information gain of 0.95 – 0.75 = 0.2.

When the entropy goes down from 0.95 to 0.75, why do we say that the amount of information is more (gaining)? Higher entrophy means the data is more uniform, lower entropy means the data is more distributed or varied. That’s why there is more information in the data, because the data is more varied. That’s why when the entropy decreases the amount of information is higher. We have “additonal” information. That is Information Gain.

Calculating Entropy

Now we know what Entropy is, and what Information Gain is. Let us now calculate the entropy.

First let’s find the formula for entropy. In thermodynamics, entropy is the logarithmic measure of the number of states

Entropy is the average of information content (link). The information content of an event E1 is the log of 1/(the probability of E1). The information content is called I. So I1 = log of (1/p(E1)).

If we have another event (E2), the information content is: I2 = log of (1/p(E2)).

The average of the information content I1 and I2 (or the entropy) is:
the sum of (information content for each event x the probability that event occuring)
= I1 x p(E1) + I2 x p(E2)
= log of (1/p(E1)) x p(E1) + log of (1/p(E2)) x p(E2)
= –log of p(E1) x p(E1) –log of p(E2) x p(E2)

If we have i events, the entropy is:
= -sum of (p(Ei) x log of p(Ei))

Fund Price

Now that we know how to caculate entropy, let us try to calculate the entropy of probability of the price of a fund going up in the next 1 year.

2a. Fund price table (top).PNG
2b. Fund price table (bottom)

In the above table, the last column is the price of a fund 1 year from now, which can be higher or lower than today. This is denoted with “Up” or “Down”. This price is determined from 4 factors or features:

  1. The performance of the fund in the last 3 years (annualised, gross of fees).
    This past performance is divided into 3 buckets: Down (less than zero, “Up between 0 and 2%”, and “Up more than 2%”.
  2. The interest rate, for example LIBOR GBP 1 Year today.
    This today interest rate is compared with the interest 1 year go, and divided into 3 buckets: today it’s higher than 1 year ago, lower than 1 year ago, or the same (constant).
  3. The value of the companies that the fund invest in, by comparing the book value to the share price of the company today. Also the earning (the income) the companies make compared to the share price (cyclically adjusted). This company value factor is divided into 3 buckets: overvalued, undervalued and fair value.
  4. The ESG factors, i.e. Environment, Social and Governance factors such as polution, remuneration, the board of directors, employee rights, etc. This is also divided into 3 buckets, i.e. high (good), medium, and low (bad).

The Four Factors

1. Past performance

Funds which have been going up a lot, generally speaking, has the tendency to reverse back to the mean. Meaning that it’s going to go down. But another theory says that if the fund price has been going up, then it has the tendency to keep going up, because of the momentum. Who is right is up for a debate. In my opinion the momentum principle has stronger effect compared to the “reveral to the mean” principle.

2. Interest rate

Because the value/price of the fund is not only affected by the companies or shares in the fund, but also affected by external factors. The interest rate represent these external factors. When the interest rate is high, share prices growth is usually constraint because more investors money is invested in cash. On the contrary, when the interest rate is low, people don’t invest in cash and invest in shares instead (or bonds).

But the factor we are considering here is the change of interest rate. But the impact is generally the same. Generally speaking if the interest rate is going up then the investment in equity is decreasing, thus putting pressure on the share price, resulting lower share price.

3. Value

If the company valuation is too high, the investors become concerned psychologically, afraid of the price would go down. This concern creates pressure on the share price, and the share price will eventually goes down.

On the contrary, if the the company valuation is lower compared to similar companies in the same industry sector and in the same country (and similar size), then the investors would feel that this stock is cheap and would be more inclined to buy. And this naturally would put the price up.

4. ESG

Factors like climate change, energy management, health & safety, compensation, product quality and employee relation can affect the company value. Good ESG scores usually increase the value of companies in the fund, and therefore collectively increases the value of the fund.

On the contrary, concerns such as accidents, controversies, pollutions, excessive CEO compensation and issues with auditability/control on the board of directors are real risks to the company futures and therefore affect the their share price.

Entropy at Top Level

Now that we know the factors, let us calculate the Information Gain for each factor (feature). This Information Gain is the Entropy at the Top Level minus the Entropy at the branch level.

Of the total of 30 events, there are 12 “Price is down” events and 18 “Price is up” events.

The probability of the price of a fund going “down” event is 12/30 = 0.4 and the probability of an “up” event is 18/30 = 0.6.

The entropy at the top level is therefore:
-U*Log(U,2) -D*Log(D,2) where U is the probably of Up and D is the probability of Down
= -0.6*Log(0.6,2) -0.4*Log(0.4,2)
= 0.97

Information Gain of the Performance branch

The Information Gain of the Performance branch is calculated as follows:

3a. Information Gain for Performance branch

First we calculate the entropy of the performance branch for “Less than 0”, which is:
-U*Log(U,2) -D*Log(D,2) where U is the probably of the price is going up when the performance is less than zero, and D is the probability of the price is going down when the performance is less than zero.
= -0.5 * Log(0.5,2) -0.5 * Log(0.5,2)
= 1

Then we calculate the entropy of the performance branch for “0 to 5%”, which is:
= -0.56 * Log(0.56,2) -0.44 * Log(0.44,2)
= 0.99

Then we calculate the entropy of the performance branch for “More than 5%”, which is:
= -0.69 * Log(0.69,2) -0.31 * Log(0.31,2)
= 0.89

Then we calculate the probability of the “Less than 0”, “0 to 5%” and “More than 5%” which are:
8/30 = 0.27, 9/30 = 0.3 and 13/30 = 0.43

So if Performance was the first branch, it would look like this:

2c. Performance as the first branch

Then we sum the weighted entropy for “Less than 0”, “0 to 5%” and “More than 5%”, to get the total entropy for the Performance branch:
1 * 0.27 + 0.99 * 0.3 + 0.89 * 0.43 = 0.95

So the Information Gain for the Performance branch is 0.97 – 0.95 = 0.02

Information Gain for the Interest Rate, Value and ESG branches

We can calculate the Information Gain for the Interest Rate branch, the Value branch and the ESG branch the same way:

3b. Information Gain for Interest Rate branch

3c. Information Gain for Value branch

3d. Information Gain for ESG branch

Why do we calculate the entropy? Because we need entropy to know the Information Gain.

But why do we need to know the Information Gain? Because the decision tree would be more efficient if we put the factor with the largest Information Gain as the first branch (the highest level).

In this case, the factor with the largest Information Gain is Value, which has the Information Gain of 0.31. So Value should be the first branch, followed by ESG, Interest Rate and the last one is Performance.

10 May 2019

Analysis Services Tabular Model

Filed under: Data Warehousing — Vincent Rainardi @ 6:03 pm

What is it?
It is an in-memory database, managed by the Analysis Services Vertipaq engine.

Who uses it?
Power BI and Excel Power Pivot.

Why use it?
To provide fast and simple access to relational data.

When was it introduced?
2012 by Microsoft

How to create it?
Using SQL Server Data Tool.

Where is it stored?
in SQL Server Analysis Services server or Azure Analysis Services.
The data is compressed when stored.

What are the components?
– Tables
– Relationships
– Measures
– Hierarchies
– KPIs
– Partitions

Next Page »

Blog at WordPress.com.