Quite often in machine learning we would like to know which variables have the most impact on predicting the target variable. Knowing which product features affect the customer’s decision to buy the product is critical for a consumer goods manufacturer. Knowing which financial ratios affect the stock price if very important not only to fund managers, but also to the board of directors. Imagine for a second, that you will be able to know that, out of 10 different factors there are 2 factors which has lots of influence and 1 factor having minor influence, like this:

That would be very useful right? Some people would do anything to get this knowledge.

Well, the good news is: it is possible. In fact, we have been doing it for many decades, since before machine learning. But with machine learning it got better.

Like many things in machine learning, the question is usually: do we have the data? When I first studying ML, I always thought “Do we have the algorithm?” Or “which algorithm should I use?” (link) But after studying many algorithm, I’m now convinced that there would be one suitable. Given that we have deep learning which can find pattern in any data, it’s quite unlikely that we don’t have an algorithm. Even if there isn’t (just for argument sake), we can make one. Either by enhancing an existing algorithm or by combining several.

No, the question in machine learning is usually: do we have the data? And in this case, yes we do. I work a lot with ESG data (environment, social, corporate governance), such as carbon emission, air pollution, customer satisfaction, employee well being, board composition and executive remuneration. There are hundreds of ESG factors for every company and we have their share price (they are listed companies). Naturally, the question every one wants to know the answer to is: which ESG factors affect the share price most? And not only that, for every company we have many financial ratios such as: return on equity, net profit margin, debt to total asset, earning per share. Which ratios affect the share price the most?

Different industry sector have different variables. Imagine the variables in retail sector, health care sector, oil, banking, etc. They all have different variables, and they all have different target variables. And yes, we can find out which variables affect the target variable the most. So yes, we do have the data!

So it’s totally worth it to spend an hour reading and learning about this. The benefit for us is enormous. And once again, it is called Feature Importance (in machine learning features means input variables).

The best starting point I found, is to read what Jason Brownlee wrote on this topic: link. In that article he explains how to get the features importance using various different algorithms such as Random Forest, Decision Trees, Extreme Gradient Boost, Linear Regression, and Permutation. Both for regression (predicting a value) and classification (grouping data).

Yes, there are 2 different things in Feature Importance:

- Which variables affect the target variable the most (as I explained above), and
- Which variables affect the classification the most i.e. grouping the data points (putting similar data points into the same group)

In certain industry sectors #1 is more important, but in other industry sectors #2 is more important. That’s why we need to understand both.

If you read Jason’s article above, you’ll find that for each algorithm he provides a Python code to calculate the feature importance (mostly using Scikit Learn), as well as the result (both the numbers and the bar charts). In order to be comparable, we can scale those numbers to between 0 and 100 like this:

So if the output of a model is between 0 and 1, we multiply by 100. If it is between 0 and 1000, we divide by 10. This way all model outputs will become between 0 and 100.

In the above tables, the legend for the columns are as follows:

- XGB = Extreme Gradient Boosting
- RF = Random Forest
- DT = Decision Trees (more specifically CART, Classification and Regression Trees)
- LR = Linear Regression or Logistic Regression (depending whether it is for regression or for classification), using the coefficient for each feature
- Perm = Permutation (using K Neighbours Regression then calculating the permutation feature importance)

But the above is still not easy to compare between models. To make it easier to compare between models, we can express them as percentage of total. So we first calculate the column totals like this:

And then we express each cell as a percentage of total like this:

Now it is easier to see. We can go through line by line, and spot the anomalies like this:

We can see above that for regression, the Linear Regression considers Factor 2 and Factor 7 as much more important compared to the other models.

On the classification side, we can see that Decision Trees consider Factor 7 as more important compare to other models, but Factor 2 and 6 as less important. And the Logistic Regression model consider Factor 2 as more important than other models, but Factor 4 as less important.

It is usually easier to spot the anomalies if we put them into 3D bar chart like this:

Here on the Regression chart (see the red arrows) we can see the 2 yellow bars for Linear Regression model on factor 2 and 7 are higher than other models.

And on the Classification chart (see the green arrows) we can easily see the bars which are higher than other models. But (see the black arrows) it is more difficult to spot bars which are lower than other models.

The question here is which model is the best one to use? Well, each model have different ways in considering which variables are important, as Jason explained in his article. Random Forest for example, use the decrease in impurity to determine which variables are more important. Specifically, for every variable, the sum of the impurity decreases across every tree in the forest, and is accumulated every time that variable is used to split a node. The sum is then divided by the number of trees in the forest to give an average. By impurity here I mean Gini. Similarly, the Decision Trees (CART) and the XGB models also use the impurity decrease, but in slightly different flavour because the ensemble technique used is different in those models.

So it is not about which model is the best to use. But we have to choose a few models which covers different opinions. The wider the opinion, the better. So we need to choose models which are contrasting each other. For example, for regression we can choose Random Forest and Linear Regression (coefficients). Because Linear Regression points out that factor 2 and 7 are important, whereas other models don’t.

For classification we can use XGB, Decision Trees and Logistic Regression. Because DT and LR have different opinion to XGB with regards to which factors are important. We know from XGB factor 4 is the most important, followed by factors 3, 5 and 6. But DT is saying that factor 7 is important too. And LR is saying that factor 2 is important too. This way we can examine factor 2 and 7 to see if they are significant or not. So the most important thing in choosing the models is to get different opinion about what factors are important.