Yesterday a friend asked me why we used Python in machine learning. He prefers to use Node.js for application development, and he can code in other languages too including C++.
I replied that we use Python in machine learning (ML) because:
ML libraries such as Scikit Learn, Keras, XGBoost are mostly available in Python (some are available in R), and we need those libraries to build and train ML models.
Code examples in ML are mostly in Python e.g. in Kaggle, OpenAI, Medium, Towards Data Science, Analytics Vidhya, ML Mastery.
ML platforms like Colab, Dataiku, Google AI, OpenAI Gym, Data Robot, Azure ML use Python. Some use R but none uses Node.js.
And we both also observed that universities these days use Python more, compared to other programming languages. And ML courses such as Coursera, Udemy, Udacity too, they also use Python (again some use R but not as popular as Python).
Let me qualify some of the above remarks, with regards to R:
Keras and Tensorflow are available in R too: link.
So yes you can still do ML in R today. But it is a losing battle. More and more examples and academic papers are available in Python, compared to R.
And yes there is no chance for other languages, such as C++ or Node.js. We can’t use them in ML.
Change or Die
IT in general is a very dynamic world. It keeps changing. Particularly programming (development). I started programming when I was in 16 using Basic language. And dBase III+. In uni (around 19-23 years old) I mainly use Pascal and I learned C and Fortran too. I had difficulty to switch to Assembler, so I took a few months course. In uni I took a few odd jobs/projects, using ForPro and Clipper.
Then I joined Accenture (Andersen Consulting then) where I used Cobol and AS/400, C++, PL/SQL. Then I moved to UK (a company called Paxar in Harlow, now Avery Dennison) and used C#.NET and T-SQL. I then joined Lastminute.com and used Java and PL/SQL. Then in the subsequent jobs I used MDX, T-SQL, Python, JavaScript and R.
Yesterday another friend asked me how did I change to Python in ML recently. I replied that I have jumped many times in the past, so the last jump into Python was not difficult. You can see above that I have jumped from one language to the next constantly. I said to my friend (he’s a SQL DBA) that the first jump, i.e. into Assembler was very difficult. But after that it was easy. I jumped into C#, Java, R and Python without difficulty.
And we will have to change again in the future. In the IT world, we have to constantly change. Not just in terms of languages, but also in terms of platforms. For example, we all had to change to the Cloud platform. SQL Server, my DBA friend said yesterday, moved to the Cloud. Ah yes, we all moved to Azure or AWS.
So yes, I’m afraid in the IT world we have to change, or we die.
Best Language for Machine Learning
I’ll close this article with this: there is a survey this week in Linked In about which language will be the best programming language for Data Science and Machine Learning in 2022 by Zakash (link). And the result is overwhelmingly Python (94%). Yes it is only 136 votes today (31st Dec, 9:15am), but click the above link to see the current result after a few days. I am sure it will still be overwhelmingly Python.
I have written about whether we need to create a physical primary key in the fact table or not (link), including whether that primary key needs to be a single column or a composite. But a couple of weeks ago someone asked me if we would need to physicalise the foreign keys in the fact tables or not. This is an interesting topic because we all know that in Kimball dimensional modelling we have surrogate keys in the fact tables, but it has been a long debate in the data warehousing community whether these surrogate keys should be physicalised or not.
The thinking amongst many data warehousing practitioners is that the surrogate keys are fully controlled by the dimensional business key lookup process, so they don’t need to be physicalised. Besides, it would slow down the insert in the fact tables, whereas “fact table insert” is one of the tasks in data warehouse loading which needs to be kept as efficient as possible because it is resource intensive. It is resource intensive because fact tables can be very large tables. So we employ techniques such as partition switching, disabling indexes, etc.
The point is, if the dimensional key lookup process ensure that all the surrogate keys (SK) in the fact tables are valid and correct, then why do we need to physicalise the SKs? For many DW practitioners, me included, the Referential Integrity in the data warehouse should not be in the form of physicalised foreign keys, but maintained by the dimensional key lookup process (link, link point 5). This lookup process not only ensures that the SKs in the fact tables are valid, but also that they are correct.
Practitioners Discussion
Nicholas Galemmo, author of “Mastering Data Warehouse Design” book, link: In general, the process of assigning surrogate keys in the ETL process is more than sufficient to ensure FK RI. Enforcing constraints in the database is redundant and not necessary. Doing so slows down the load process. However, some BI tools make use of FK declarations to ease defining relationships between tables. Therefore it is common practice to declare FKs, but not implement constraints to enforce them.
Hang from Brisbane, Australia, link: Since ETL is a tightly controlled process, any database level RI requirements in data warehouse should be superseded by ETL. In OLTP system, RI constraints are targeted at data entry on record level and hence can protect data integrity. In data warehouse, data are loaded in batches by ETL and RI validation is just a bear minimum part of the whole complex process. RI constraints has performance implication to ETL, and some ETL also has its own peg order about loading the data which could require undoing the RI constraints. In my experience, the only value having database RI is to automatically draw the connection lines between tables by some modelling tools.
Martin Rennhackkamp, link, point 5: The approach we typically follow is to define the foreign key constraints in the model, and implement and activate them in the development, testing and acceptance environments. However, once you are sure that the foreign key constraints are properly adhered to by the ETL processes; you do not implement and enforce them in the database in the production environment.
What Ralph Kimball Says
One of the old saying in data warehousing is, “when in doubt, look at what Ralph Kimball says”. And in this case, Ralph Kimball and Joe Caserta say this in their book in 2014, The Data Warehousing ETL Toolkit (link), page 212:
——————————————————
There are the three main places in the ETL pipeline where referential integrity can be enforced:
Check the “load tables” before loading them into fact tables: Careful bookkeeping and data preparation just before loading the fact table records into the final tables, coupled with careful bookkeeping before deleting any dimension records.
Check while loading into fact table using “DBMS RI Check”: Enforcement of referential integrity in the database itself at the moment of every fact table insertion and every dimension table deletion.
Check the integrity of tables at later times: Discovery and correction of referential integrity violations after loading has occurred by regularly scanning the fact table, looking forbad foreign keys
Practically speaking, the first option usually makes the most sense. One of the last steps just before the fact table load is looking up the natural keys in the fact table record and replacing them with the correct contemporary values of the dimension surrogate keys.
The second option of having the database enforce referential integrity continuously is elegant but often too slow for major bulk loads of thousands or millions of records. But this is only a matter of software technology.
The third option of checking for referential integrity after database changes have been made is theoretically capable of finding all violations but may be prohibitively slow. But perhaps the query can be restricted only to the data that has been loaded today. But this is a sensible approach that probably should be used as a sanity check even if the first approach is the main processing technique.
——————————————————
Bear in mind that Ralph Kimball and Joe Caserta wrote this 17 years ago in 2004 when database technologies were not as advanced as they are today. But it looks that they are in favour of option 1 i.e. maintain the RI on the ETL.
Steve Flynn’s Advice
One of the best advice I found is from Steve Flynn (link) who says the following (link):
At design time foreign key constraints should be used. They guarantee referential integrity when the facts are inserted and will throw an error that the developer/admin can catch and fix before the users report issues with the data.
Design the ETL as if the constraints are not there; i.e. follow Kimball’s recommendation to enforce RI during data preparation.
If ETL performance is an issue and removing the foreign key constraints will make a difference then remove the constraints. But, test the pre and post ETL performance to see what’s been gained, and only do this if it makes real difference and other performance optimisations haven’t been tried first or have been ruled out for other reasons, such as cost.
This gives the assurance that RI cannot be compromised, even if modifications are made to a complex ETL process, while at the same time providing scope to optimise fact table load speed by disabling the foreign key constraints should it be required.
Conclusion
In my opinion, we do not need to implement RI in the fact tables because:
The SK lookup process in the ETL is sufficient to ensure RI.
In my experience disabling and enabling FKs before & after fact load takes a lot of time. And not disabling them takes even longer load time.
We don’t delete from dimension tables.
The argument about “in SQL Server FKs are used to optimise star query” is not true. It is the “star join” which is used, not the constraint (link).
Data Lakehouse is a data lake which has data warehouse features such as star schema, transaction, data integrity, BI tools, data governance, scalability, data ingestion from various sources, and can do “time travel” (link, link). For a paper on Data Lakehouse see here: link.
Usually a data lake does not have a schema. It is just a collection of files. No SQL data types, no structures. A data lakehouse has a schema. It has data types. It has a data structure. It has tables. And it has a database.
Usually we read data from a data lake and load it into a data warehouse/mart so that we can have a star schema. Then we point our BI tool such as Power BI or Tableau to this data warehouse/mart. With a data lakehouse, we don’t need to do that. We don’t need to build a data warehouse/mart. We can directly point the BI tool to the data lakehouse.
ACID
Usually a data lake does not have data integrity. There is no concept of a transaction. When users read the data lake whiles some process is ingesting data into it, there is no way to ensure data consistency. There is no way to ensure data reliability. When multiple users are reading and writing from and to the same table all at once, there is no way to ensure that they don’t interfere each other (isolation concept). There is no concept of a transaction, to ensure that multiple data changes must be treated as one unit (transaction, or atomicity). There is no concept of ACID, i.e. atomicity, consistency, isolation, durability. A data lake is just a collection of files. It is not a database.
But a data lakehouse is different. It is a data lake, but like a database it has the concept of atomicity, consistency, isolation and durability (ACID). It can handle transaction, i.e. when we have multiple changes either all changes are executed successfully or the none of them is not executed. It has a concept of durability, meaning that data changes would be saved, even when the disk or the server failed. This means that the data would not be corrupted by system crashes or power failure.
A data lakehouse can achieve this ACID capability using transaction logs. Yup, just like a relational database, it uses transaction logs.
Time Travel
One of the key features of a data warehouse is the ability to do “time travel”. Meaning that we can query the data as it is today, but we can also query the data as it was at some point in the past.
A data lakehouse has this capability. We can query an older snapshot of a table (link). We can rerun a report as it was last year. We can replay a machine learning input (or output) as it was before it was changed. We can write a temporal query. We can use this “time travel” capability for debugging or for auditing.
How do we do that in a data lakehouse? By querying the data lakehouse using TIMESTAMP AS OF clause, like this:
SELECT * FROM Table1 TIMESTAMP AS OF '2020-12-31'
Simple, isn’t it?
SQL
Oh, I forgot to mention one very important feature! A data lakehouse uses SQL language. Yes, the good old SQL! The SQL language that we all have been using for 50 years. You can query a data lakehouse by simply doing “select * from Table1 where column1 = criteria”. And other SQL statement such as group by etc. Isn’t wonderful?
Many BI tools use SQL language to query databases (well all BI tools, actually). But when they query a data lake, they can’t use SQL language. Why? Because a data lake is just a collection of files. Like CSV files for example. Or pipe delimited files. Or JSON files. Or parquet files. So those BI tools will have to open the files and read them as files!
Not with this special data lake called data lakehouse. If your data lake is a data lakehouse, you can use SQL language to query the data lake. Your data lakehouse consists of databases and tables. Yes, databases and tables! Isn’t wonderful?
Delta Lake
We should thank those people at Delta for this wonderful capabilities. Who are Delta? Well here’s their website: https://delta.io/. They created something called “Delta Lake” on Apache Spark which has all those wonderful features above. They started in June 2019 with version 0.2 (link). The latest version is 1.1.0 (link), released on 4th December 2021.
Other than ACID, it also supports change data capture, streaming, conditional update, metadata handling, schema evolution, and SQL constraints like “not null” and “check”. We can read all these wonderful features here: link. They documented it very well there.
Transaction Log is very important. It enables atomicity, consistency, isolation and durability (ACID) as I said above. We can read the detail mechanism of Delta Lake transaction log here: link. The checkpoint, the format specification, the protocol, the transaction identifiers, everything is very well documented there.
The data lake that Delta created is called Delta Lake. Delta Lake is an open source. You can read all the code here in Github: link.
Databricks
The best implementation of Delta Lake is Databricks (link). It is a data lake but it was designed from the start to be a data lakehouse. It is a Delta Lake implementation, so it has all the features of a Delta Lake.
Databricks consists of 3 environments:
Databricks Workspace
Databricks SQL
Databricks ML
A Databricks Workspace is an environment for managing all of your Databricks assets (link) such as notebooks, libraries, dashboards, and experiments, out them into folders and provides access to data objects and computational resources. A Databricks Workspace is used for data engineering and data science.
A Databricks SQL is an environment to run quick ad-hoc queries on your data lake, create multiple visualization types to explore query results from different perspectives, and build and share dashboards (link).
A Databricks ML is an integrated end-to-end machine learning environment incorporating managed services for experiment tracking, model training, feature development and management, and feature and model serving (link).
Amazon Athena
Amazon Athena is another platform that implements Data Lakehouse. It sits on top of Amazon S3 data lake. We can write SQL statements to query the data in the lake, as if it is a relational database. Yes, we can GROUP BY, DISTINCT, UNION, HAVING, even JOIN and CTE! (link). It is schema on read, so we issue a “CREATE TABLE” statement first specifying the data type for each column (link). We can even do INSERT to add data into the files in the lake. Yes we can partition the tables in Athena, like in Databricks.
Oracle Autonomous Database
Oracle Autonomous DB is another platform that implements Data Lakehouse. It provides a SQL queryable layer on top of data lakes in AWS S3, Azure, GCP, or Oracle Cloud. See my article here for more info and relevant links.
Snowflake
Snowflake is another platform that implements Data Lakehouse. We can query a data lake in S3 / GCS / ADLS by creating Snowflake external tables on top of it. This way we can use SQL to query the data lake. You can read how to do it on Snowflake documentation: link.
Hive
It was Hive which started this “query the data files using SQL” thing 10 years ago. Creating external table is an old technology. I remember creating an external table in Hive like 10 years ago. Yes, in 2012! At that time there was no data lakes. I was working for an investment bank and all we had was Hadoop (HDFS). On top of this distributed file system we use Apache Hive so we can query the data using SQL. See Hive documentation on how to do this: link.
Data Warehouse
The question is, as we now have a data lakehouse, do we still need a data warehouse? The answer is yes. If your company don’t have a data lake, then you will need to build a data warehouse.
But if your company builds a data lake, you better build a special form of data lake. The one called a data lakehouse that I describe above. This way you won’t need to build a data warehouse or a data mart. You can point your BI tool directly to this data lakehouse.
I am a firm believer on the iterative approach, rather than big bang. To create something complex we need to build it incrementally. For my masters dissertation I would like to use machine learning to manage investment portfolios. Last week I described the “layout of the land” of this topic in this article: link. That made me realise 3 things:
That the topic is very large, from predicting stock prices to managing risk, from managing stock composition to crypto currencies.
That what I’d like to do is managing the portfolio allocation. In terms of assets I would prefer stocks, rather than fixed income or crypto currencies.
That the best approach for this is using Reinforcement Learning (Q network).
Problem Statement
So as the first step, I would like to simply use a Q network to decide which portfolio allocation would be best in terms of maximising return. So the reward is the return (current market price minus the purchase cost). The environment is 4 stocks in different industry sectors:
1 in financial industry: JP Morgan, symbol: JPM
1 in retail industry: Home Depot, symbol: HD
1 in commodity industry: Exxon Mobile, symbol: XOM
1 in healthcare industry: Pfizer, symbol: PFE
All from the same country i.e. US. The action is to choose the composition of these 4 stocks in the portfolio, plus cash. To simplify things the composition must be: 1 stock = 40% weight, and the other 3 stocks and cash = 15% weight. So there are only 5 possible actions to take:
Every working day, the agent must decide which action it wants to take, i.e. which composition it wants to use. Then the reward is calculated by comparing the valuation of the portfolio at the end of the day to the previous day, minus the transaction cost and the holding cost. The portfolio valuation is obtained by summing the valuation of the 4 stocks (held quantity x today closing price) plus the cash. The transaction cost is $10 per trade. The holding cost is 0.005% of the portfolio value per day, including weekend.
I will be using use 4 years of price data from 19th Dec 2016 to 18th Dec 2020 to train the model, and 19th Jan 2021 to 18th Dec of 2021 to test it. Note that stock prices are only available on Monday to Friday, and when it’s not a public holiday in the US. All 5 prices will be fed into the Q model (open, close, high, low, adjusted close) plus the daily volume too.
The Environment
In Reinforcement Learning we have an environment and an agent. The environment consists of a state space and an action space. In Reinforcement Learning we need to define 6 things in the environment:
The state space i.e. a list of all the possible states
The action space i.e. a list of all possible actions
The reward for doing an action from a particular state
The next state after doing an action
An episode, and how many time steps in an episode
How the environment will be reset at the beginning of an episode
State Space: The state is the current state of the portfolio on a particular date, i.e. composition 1 to 5 (C1 to C5). In the beginning, before any trade is made, the portfolio consists of 100% cash (trade means buying or selling a stock). This beginning state is called C0.
Action Space: The action that the agent can take is buying or selling stocks so that the portfolio is in a certain composition (from composition 1 to 5). So there are 5 actions. Let’s name these 5 actions as A1 to A5. If the agent does action A2, then the next state is C2, because the portfolio will be in composition 2. If the agent does action A3, then the next state will be C3. And so on. Of course the agent can also do nothing (let’s call it A0), in this case the next state is the same as the previous state.
Episode: One episode in this case is 30 trading days. So at the beginning of an episode, a date within the training data will be randomly chosen as the start date. For example, the start date = 17st June 2018. Then every trading day the agent would take an action. A “trading day” means a day when the US stock markets are open. 17th June is a Sunday, not a trading day, so it starts on 18th June 2018, like this:
2018-06-17 Sunday No action 2018-06-18 Monday, Action = A2 2018-06-19 Tuesday, Action = A3 2018-06-20 Wednesday, Action = A5 … and so on until 2018-07-26 Thursday, Action = A4 2018-07-27 Friday, Action = A5 2018-07-30 Monday, Action = A1
In the above, the actions are just examples. Every day the agent determines which action to take, between A1 to A5. The agent can only make 1 action per day, i.e. at the beginning of every day.
Note that US public holidays are not trading days. So for example, 25th Dec 2018 (Tuesday) is Christmas day, so no action.
Reward: The portfolio valuation is calculated as the valuation of all the stocks, plus cash. The reward is calculated based on the profit for that day, i.e. the portfolio value at the end of the day, minus the portfolio value at the start of the day.
Beginning of an episode: At the beginning of an episode the portfolio consist entirely of cash. This is an additional state, in addition to C1 to C5 defined above. So we have 6 states in total: C0 to C6
Portfolio Valuation
At this initial state we need to define how much cash is in the portfolio. Let’s define that as USD $1 million. So on that first day in the episode (say Sunday 17th June 2018), the value of the portfolio was $1 million.
Let’s say that the next day, Monday 18th June 2018, the agent decided to take action C1, which brings the portfolio to state C1. So on that Monday morning the portfolio consisted of: 40% cash, 15% JPM, 15% HD, 15% XOM and 15% PFE. The value of the portfolio in the beginning of that Monday 18th June 2018 was the sum of the 40% cash and the initial value of the holdings (i.e. the 4 stocks):
In the above, the prices are from the stock market data. The quantity held is simply calculated as the money to buy the stock divided by the opening price. The value of the portfolio at the end of that Monday 18th June 2018 is the sum of the 40% cash and the value of the 4 stocks (based on the quantity held):
40% cash = $400,000
JPM: Closing price: 108.180000. Value = 151,286.59
HD: Closing price: 200.690002. Value = 151,319.49
XOM: Closing price: 80.82. Value = 150,783.58
PFE: Closing price: 34.278938. Value = 150,124.55
We also need to subtract the transaction cost, which is $10 per trade, and the holding cost (the cost we pay to the investment platform, for keeping our holdings), which is 0.01% of the portfolio value, per day:
So after 1 day of trading, 18th June 2018, after the agent decided to take action C1, the portfolio value is 1,003,424.03. So the profit is $3,424.03.
The reward
The reward is the profit for that day, i.e. the portfolio value at the end of the day, minus the portfolio value at the start of the day. So in this case the reward is $3,474.21. Note that the reward can be negative, i.e. if on that day the value of the portfolio at the end of the day is lower than at the start of the day.
For each trading day there will be a reward, which is added to the previous day reward to make the cumulative reward. Every day we calculate the cumulative reward.
Episodes and Total Score
An episode is 30 trading days. At the end of the episode (30th July 2018) the cumulative reward is the “total score” for the agent.
Then another episode is chosen and the environment is reset. The 30 days trading begins and the reward and cumulative reward is calculated every day. And at the end of the episode, the total score is obtained.
And so the agent keep learning, episode by episode, each time adjusting the weights of the neural network within the agent. In the early episodes, we expect the total score to be low because the agent is still learning. But after many episodes, we expect the total score to be consistently high, because the agent has learned the pattern in the stock prices. So it knows which stock would be the most profitable to invest in.
Generating Experience
So far so good. But the state space is not actually just the current portfolio holdings/composition. The state space also include the current prices, and historical prices. Not only the prices of the holdings, but also the stocks not in holding (because they also determine what should be held).
And, in reality, the stocks in the holdings are not just 4. There are 40 to 100 stocks, depending on the size and the policy of the fund. And the investment universe (out of which we choose the stock to hold) is about 500 to 1000 stocks.
So obviously, we can’t have the mapping of all the state and actions, to the “value” (the net profit for today). Because there are so many combinations of the state and actions (millions).
In Reinforcement Learning this problem is solved by approximating the value using a neural network. We don’t record all those combinations of historical prices (states) and stock allocations (actions) and their values (today’s profit). Instead, we train a neural network to learn the relationship between the states, action and values. Then use it to approximate the value, for a given state and action.
In Reinforcement Learning, we generate the experience using a Q network. Generating an experience means that the system will choose to either to do exploration or exploitation. This is called “Epsilon Greedy” algorithm.
Set the epsilon (ε), which is the boundary between exploration and exploitation.
Generate a random number.
If the number is less than epsilon, choose a random action (exploration).
If the number is more than epsilon (or equal), choose the best action (exploitation). The best action is the action with the highest reward.
Calculate the reward.
Determine the next state.
Store the state, the action, the reward and the next state.
So that’s the topic for the next article, i.e. how to use neural network to approximate the value, for a given state and action. The state in this case is the historical prices, and the action here is the portfolio composition (or stock allocation) The value here is the profit or gain on a particular day.
So the problem statement becomes: given all the historical prices, what is the best portfolio composition for today? And the value of that portfolio composition is: the net profit we make today.
Once we are able to create a neural network which can answer the above question, then we’ll create the second neural network to do the Reinforcement Learning, using Action and Reward, using Environment and Agent. This second NN will be learning how to optimise a portfolio, i.e. what stocks should be held in order to maximise the return during a 30-day period (one episode).
In investment management, machine learning can be used on different areas of portfolio management including portfolio construction, signal generation, trade execution, asset allocation, security selection, position sizing, strategy testing, alpha factor design and risk management. Portfolio management is first a prediction problem for the vector of expected returns and covariance matrix, and then an optimization problem for returns, risk, and market impact (link). We can use various ML algorithms for managing investment portfolios, including reinforcement learning, Elastic Net, RNN (LSTM), CNN and Random Forest.
In this article I would like to give the “overview of the land” on how machine learning is used to manage investment portfolios. So I’m going to list down various methods that people use, but not going to explain the mechanism of each method. For example: I could be mentioning that people predict the direction of stock prices using random forest, but I’m not going to explain the why nor how. For each case I’m going to provide a link, so that you can read more about them if you are interested. The idea is that people who wish to do a research about portfolio management using ML can understand the current landscape, i.e. what have been done recently.
In the next article (link), I will be describing the problem statement of using Reinforcement Learning to manage portfolio allocation. So out of so many things about portfolio management I describe in this article, I choose only 1 (which is portfolio allocation). And out of so many ML approaches I describe in this article, I choose only 1 (which is RL).
But first, a brief overview on what portfolio management is. This is important as some of us are not from the investment sector. We know about the stock markets, but have no idea how a portfolio is managed.
What is Portfolio Management?
Portfolio management is the art and science of making optimal investment decisions to minimise the risks and maximizing the returns, to meet the investor’s financial objectives and risk tolerance. Active portfolio management means strategically buying and selling securities (stocks, bonds, options, commodity, property, cash, etc) in an effort to beat the market. Whereas passive portfolio management means matching the returns of the market by replicating some indexes.
Portfolio management involves the following stages:
Objectives and constraints
Portfolio strategies
Asset allocation
Portfolio construction
Portfolio monitoring
Day to day operations
Let’s examine those 6 stages one by one.
1. Define the investment objectives and constraints
First the investors or portfolio managers need to define the short term and long term investment goals, and how much and which types of risks the investor is willing to take. Other constraints include the capital amount, the time constraints, the asset types, the liquidity, the geographical regions, the ESG factors (environment, social, governance), the “duration” (sensitivity to interest rate changes) and the currency. Whether hedging FX or credit risks is allowed or not (using FX forwards and CDS), having more than 10% of cash is allowed or not, investing in developed markets is allowed or not, how much market volatility is allowed, whether investing in companies with market cap less than $1 billion is allowed, whether investing in coal or oil companies is allowed or not, whether investing in currencies is allowed or not, etc. – those are all portfolio constraints too.
2. Define the portfolio strategies
Based on the objectives and constraints, the investors or portfolio managers define the portfolio strategies, i.e. active or passive, top down or bottom up, growth or value investing, income investing, contrarian investing, buy and hold, momentum trading, long short strategy, indexing, pairs trading, dollar cost averaging (see here for more details). Hedging strategies, diversification strategies, duration strategies, currency strategies, risk strategies, stop loss strategies, liquidity strategy (to deal with redemptions and subscriptions), cash strategies – these are all strategies in managing portfolios.
3. Define the asset allocations
Based on the objectives and constraints, the investors or portfolio managers define what types of assets they should be investing. For example, if the objective is to make a difference in climate change, then the investment universe would be low carbon companies, clean energy companies and green transport companies. If one of the investment constraints is the invest in Asia, but not in Japan, and only in fixed income (not equity) then the investment universe would be the bonds issued by companies based in China, India, Korea, Singapore, etc. The asset types could be commodity (like oil or gold), property (like offices or houses), cash-like assets (like treasury bonds), government bonds, corporate bonds, futures, ETFs, crypto currencies, options, CDS (credit default swaps), MBS (mortgage based securities), ABS (asset based securities), time deposits, etc.
4. Portfolio construction
Once the strategies and the asset allocations are defined, the investors or portfolio managers begin building the portfolio by buying assets in the stock/bond markets and by entering contracts (e.g. CDS contracts, IRS contracts, property contracts, forward exchange contracts). Every company that they are going to buy is evaluated. The financial performance is evaluated (financial ratios, balance sheet, cash flow, etc), the board of directors are evaluated (independence, diversity, board size, directors skills, ages and background, etc), the stock prices are evaluated (company value, historical momentum, etc), the controversies are evaluated (incidents, health & safety record, employee disputes, law breaking records & penalties, etc), the environmental factors are evaluated (pollutions, climate change policies, carbon and plastic records, etc). So it is not just financial, but a lot more than that.
5. Portfolio monitoring
Then they need put in place a risk monitoring system, compliance monitoring system, performance monitoring system, portfolio reporting system and asset allocation monitoring system. Every trade is monitored (market abuse, trade transparency, capital requirements, post trade reporting, authorisation requirements, derivative reporting), and every day each portfolio holding are monitored. Cash level and portfolio breakdown are monitored every day. Early warnings are detected and reported (for threshold breach), market movement effect are monitored, operational risks are monitored & reported. Client reporting are in place (e.g. when investment values drop more than 10% the client must be notified), and audits are put in place (data security audit, IT systems audit, legal audit, anti money laundering audit, KYC/on-boarding process, insider trading).
6. Day-to-day operations
On the day-to-day operation, the investors or portfolio managers basically identify the potential trades to make money (to enhance the return). Trade means buying or selling securities. For this they screen potential companies (based on financial ratios, technical indicators, ESG factors, etc) to come up with a short list of companies that they will buy. They research these companies in depth and finally come up one a company they are going to buy (company A). They calculate which holding in the portfolio they will need to sell (company B) to buy this new company. They calculate the ideal size of holding for company A (in a global portfolio, each holding is about 1-2% of the portfolio), and it depends on the other properties of this company as well (sector, country, benchmark comparison, etc). Then they make 2 trades: buy A and sell B.
Apart from trades to make money, there are trades for other purposes: trades to mitigate risks, trade for compliance, trade for rebalancing, trade for benchmark adjustments, trades to improve liquidity, etc.
What are not included in portfolio management are the sales and marketing operation, the business development, the product development. These activities are also directly impacting the portfolio management though, because subscriptions and redemptions change the AUM (asset under management), but they are not considered part of portfolio management.
Machine Learning Methods Used in Portfolio Management
Below are various research papers which use various machine learning models and algorithms to manage investment portfolios, including predicting stock prices and minimising the risks.
Part 1. Using Reinforcement Learning
A deep Q-learning portfolio management framework for the crypto currency market (Sep 2020, link) A deep Q-learning portfolio management framework consisting of 2 elements: a set of local agents that learn assets behaviours and a global agent that describes the global reward function. Implemented on a crypto portfolio composed by four crypto currencies. Data: Bitcoin (BTC), Litecoin (LTC), Ethereum (ETH) and Riple (XRP) July 2017 to January 2019.
RL based Portfolio Management with Augmented Asset Movement Prediction States (Feb 2020, link) Using State-Augmented RL framework (SARL) to augment the asset price information with their price movement prediction (derived from news), evaluated on accumulated profits and risk-adjusted profits. Datasets: Bitcoin and high tech stock market, and 7 year Reuters news articles. Using LSTM for predicting the asset movement and NLP (Glove) to embed the news then feed into HAN to predict asset movement.
Adversarial Deep RL in Portfolio Management (Nov 2018, link) Using 3 RL algorithms: Deep Deterministic Policy Gradient (DDPG), Proximal Policy Optimization (PPO) and Policy Gradient (PG). China stock market data. Using Adversarial Training method to improve the training efficiency and promote average daily return and Sharpe ratio.
Financial Portfolio Management using Reinforcement Learning (Jan 2020, link) Using 3 RL strategies are used to train the models to maximize the returns and minimize the risks: DQN, T-DQN, and D-DQN. Indian stock market from Yahoo finance data, from 2008 to 2020.
Using RL for risk-return balanced portfolio management with market conditions embedding (Feb 2021, link) A deep RL method to tackle the risk-return balancing problem by using macro market conditions as indicators to adjust the proportion between long and short funds to lower the risk of market fluctuations, using the negative maximum drawdown as the reward function.
Enhancing Q-Learning for Optimal Asset Allocation (Dec 1997, link) To enhance the Q-Iearning algorithm for optimal asset allocation using only one value-function for many assets and allows model-free policy-iteration.
Portfolio Optimization using Reinforcement Learning (Apr 2021, link) Experimenting with RL for building optimal portfolio of 3 cryptocurrencies (Dash, Litecoin, Staker) and comparing it with Markowitz’ Efficient Frontier approach. Given the price history, to allocate a fixed amount of money between the 3 currencies every day to maximize the returns.
Part 2. Using Recurrent Neural Network (RNN)
Mutual Fund Portfolio Management Using LSTM (Oct 2020, link) Predicting the company stock prices on 31/12/2019 in IT, banking and pharmaceutical sectors based on Bombay stock prices from 1/1/2012 to 31/12/2015. Mutual funds are created from stocks in each sector, and across sectors.
Stock Portfolio Optimization Using a Deep Learning LSTM Model (Nov 2021, link) Time series analysis of the top 5 stocks historical prices from the nine different sectors in the Indian stock market from 1/1/2016 to 31/12/2020. Optimum portfolios are built for each of these sectors. The predicted returns and risks of each portfolio are computed using LSTM.
Deep RL for Asset Allocation in US Equities (Oct 2020, link) A model-free solution to the asset allocation problem, learning to solve the problem using time series and deep NN. Daily data for the top 24 stocks in the US equities universe with daily rebalancing. Compare LSTM, CNN, and RNN with traditional portfolio management approaches like mean-variance, minimum variance, risk parity, and equally weighted.
Portfolio Management with LSTM (Dec 2018, link) Predicting short term and long term stock price movements using LSTM model. 15 stocks, 17 years of daily Philippine Stock Exchange price data. Simple portfolio management algorithm which buys and sells stocks based on the predicted prices.
Anomaly detection for portfolio risk management (June 2018, link) ARMA-GARCH and EWMA econometric models, and LSTM and HTM machine learning algorithms, were evaluated for the task of performing unsupervised anomaly detection on the streaming time series of portfolio risk measures. Datasets: returns and VAR (value at risk).
Part 3. Using Random Forest
Forecasting directional movements of stock prices for intraday trading using LSTM and random forests (June 2021, link) Using random forests and CuDNNLSTM to forecast the directional movements of S&P 500 constituent stocks from January 1993 to December 2018 for intraday trading (closing and opening prices returns and intraday returns). On each trading day, buy the 10 stocks with the highest probability and sell short the 10 stocks with the lowest probability to outperform the market in terms of intraday returns.
Stock Selection with Random Forest in the Chinese stock market (Aug 2019, link) Evaluates the robustness of the random forest model for stock selection. Fundamental/technical feature space and pure momentum feature space are adopted to forecast the price trend in the short and long term. Data: all companies on the Chinese stock market from 8/2/2013 to 8/8/2017. Stocks are divided into N classes based on the forward excess returns of each stock. RF model is used in the subsequent trading period to predict the probability for each stock that belongs to the category with the largest excess return. The selected stocks constituting the portfolio are held for a certain period, and the portfolio constituents are then renewed based on the new probability ranking.
Predicting clean energy stock price using random forests approach (Jan 2021, link) Using random forests to predict the stock price direction of clean energy exchange traded funds. For a 20-day forecast horizon, tree bagging and random forests methods produce 85% to 90% accuracy rates while logistic regression models are 55% to 60%.
Stock Market Prices Prediction using Random Forest and Extra Tree Regression (Sep 2019, link) Comparing Linear Regression, Decision Tree and Random Forest models. Using the last 5 years historical stock prices for all companies on S&P 500 index. From these the price of the stock for the sixth day are predicted.
Part 4. Using Gradient Boosting
A Machine Learning Integrated Portfolio Rebalance Framework with Risk-Aversion Adjustment (July 2021, link) A portfolio rebalance framework that integrates ML models into the mean-risk portfolios in multi-period settings with risk-aversion adjustment. In each period, the risk-aversion coefficient is adjusted automatically according to market trend movements predicted by ML models. The XGBoost model provides the best prediction of market movement, while the proposed portfolio rebalance strategy generates portfolios with superior out-of-sample performances compared to the benchmarks. Data: 25 US stocks, 13-week Treasury Bill and S&P 500 index from 01/09/1995 to 12/31/2018 with 1252 weekly returns.
The Success of AdaBoost and Its Application in Portfolio Management (Mar 2021, link) A novel approach to explain why AdaBoost is a successful classifier introducing a measure of the influence of the noise points. Applying AdaBoost in portfolio management via empirical studies in the Chinese stock market: 1. Selecting an optimal portfolio management strategy based on AdaBoost 2. Good performance of the equal-weighted strategy based on AdaBoost Data: June 2002 and ends in June 2017, 181 months, Chinese A-share market. 60 fundamentals & technical factors.
Moving Forward from Predictive Regressions: Boosting Asset Allocation Decisions (Jan 2021, link) A flexible utility-based empirical approach to directly determine asset allocation decisions between risky and risk-free assets. Single-step customized gradient boosting method specifically designed to find optimal portfolio weights in a direct utility maximization. Empirical results of the monthly U.S. data show the superiority of boosted portfolio weights over several benchmarks, generating interpretable results and profitable asset allocation decisions. Data: The Welch-Goyal dataset, containing macroeconomic variables and the S&P 500 index from December 1950 to December 2018.
Understanding Machine Learning for Diversified Portfolio Construction by Explainable AI (Feb 2020, link) A pipeline to investigate heuristic diversification strategies in asset allocation. Use explainable AI to compare the robustness of different strategies and back out implicit rules for decision making. Augment the asset universe with scenarios generated with a block bootstrap from the empirical dataset. The empirical dataset consists of 17 equity index, government bond, and commodity futures markets across 20 years. The two strategies are back tested for the empirical dataset and for about 100,000 bootstrapped datasets. XGBoost is used to regress the Calmar ratio spread between the two strategies against features of the bootstrapped datasets.
After 13 months of exams, assignments, case studies, tests and lectures I finally finished my machine learning postgraduate course. Despite the encouragement from my course provider to go looking for a data scientist job, I’m not thinking of going for interviews yet. I am happy to stay where I am, finishing building a data lake and data marts in Azure. I love machine learning, it’s my new purpose in life, but I hate leaving something unfinished. But supposed I was looking for a data scientist, what kind of questions would I ask during the interview?
Update 7/12/21: I understand some of you reading this are saying “Where are the actual questions?” If you are one of them, jump to the bottom of this article.
Academic ability vs work capability
I don’t want to end up with a recruit who is good in answering questions but no good at the actual work. Or someone who is good in academic, but not good in providing solution to real world problems. So a question like: “Tell me the difference between accuracy, precision and recall” would be testing their academic ability, but not their work capability.
I’m also aware that the candidate would forget a few things during the interview, so not able to recall the answer would be understandable. So I don’t want to ask things like “is it Lasso or Ridge which doesn’t shrink the cofficient to zero?” because that’s like testing their memory. I don’t need a person who can memorise well, I need someone who can work and solve problems.
Existing websites
There are many websites posting interview questions for data scientists, such as:
As we can see from these websits, there are a lot of questions to choose from. But as I said above, we don’t want to select candidate based on their academic ability but more on their ability to work. And this is where the art is, what kind of questions do you need to ask to the candidates to separate those who have worked hard in machine learning projects, versus those who haven’t?
Overfitting
After thinking hard going through all types of questions. I found something which I have gone through so many times myself when training my models. Even if the candidate forget everything else, they can’t forget this one because it is so engraved into their work. It’s overfitting. I would show these charts to them and ask an open question: “What’s happening in these charts?”
A candidate who have been training many models over and over again for days would definitely recognise these charts. The training accuracies are all very high, but the validation accuracies are all very low. It means that the models are built to suit the training data but they do not generalise well to new, unseen data.
“What would you do about it?” would provoke a flood of answers from the candidates. But only if they have been battling with this problem in their work. Things like “regularisation”, “augmentation”, “class balancing”, “dropout” and “batch normalisation” are the methods that can be used to reduce overfitting. And a good debate could follow from that. “How would you do augmentation”, “how would you do regularisation”, “how would you do class balancing?” all these “how” questions would create a good discussion with the candidates.
There are a vast area of topics from which we can withdraw interview questions. For example: ensambles (weak learners), cross validations, preprocessing, visualisation, neural network, kernel, multicollinearity, boosting, linear regression assumptions, Naive Bayes assumptions, SVM, PCA. Decision Trees, etc. But again, we need to be careful not to test their memory or academic performance. But their ability to work and solve problems.
Models/Algorithms
So as a second question I would pick this: put to them a simple binary classification problem and ask them which models/algorithms would give a good result. For example: whether a customer will be purchasing an item or not, whether a machine will breakdown or not, whether a telecom customer will be churning or not, whether it is going to rain or not, whether it is a fraudulent transaction or not. Give them a list, i.e. Logistic Regression, Naive Bayes, Random Forest, Decision Trees, KNN, SVM, PCA, Neural Network, CNN, RNN, Reinforcement Learning, XG Boost, Stochastic Gradient Descent. And ask them which one they would pick for this case.
A data scientist who has been doing enough work (not study but work) would have experienced a variety of different machine learning models and algorithms. They would know which ones are good and which ones are bad. On the other hand, a data scientist who has been doing a lot of reading would know the theory such as how the algorithm works, what are the difference between them, etc. but because they haven’t used the algorithms in practice, they would not know the result.
“Which ones are fast?”, “which ones are slow?”, “which ones are highly accurate?”, “which ones have the highest AUC?”, “which ones are prone to overfitting?” all these questions would provoke a good discussion. The point of this question is to dig out their experience. They must have some working experience with some models/algorithms, we want them to tell us those experiences. They would be able to tell us for example, that logistic regression and decision trees are fast, whereas KNN and SVM are very slow. XG Boost and SVM are highly accurate. And that neural network is prone to overfitting.
Some of the algorithms given above are grossly unsuitable for binary classification, namely: Reinforcement Learning, PCA, CNN, RNN. And XG Boost is an implementation of SGB. If the candidates fall into these traps, then we should help them out getting out of it. For example: “isn’t CNN used for image classification?” If you mentioned in your example, “whether it is a cancer or not”, then CNN would be suitable. But if it is numerical data then it is not suitable. If you mentioned in your example, “whether the sentiment is positive or negative”, “whether the email is a spam or not”, then RNN would be suitable. These would be an interesting discussion with the candidate. It would show that the candidates have experience with CNN or RNN, and you can dig it out.
Data Preparation & Exploration
We want to encourage the candidate to tell us their experience with data preparation & exploration. There is no machine learning work where we don’t need to do data preparation & exploration. The data preparation & exploration work takes a significant amount of time on any ML project.
If your candidates have worked on many machine learning projects, they would have gone through many data preparation exercise. For example, data loading, normalisation, class balancing, feature selection, data transformation, dimensionality reduction, one hot encoding. They would have been busy using a lot of Pandas and Numpy to do that. Also Power Transform, Keras and Scikit for augmentation, normalisation and scaling.
Data exploration is widely known as EDA within the machine learning community. It stands for Exploratory Data Analysis. This is where you discover the pattern within the data, find out outliers, get the statistics from the data and creating various charts to understand the data. They would have been busy using Matplotlib and Seaborn packages to do charting, Pandas and Numpy to do data manipulation and statistics. They would do univariate, bivariate and multivariate data analysis.
Any decent data scientist would have done a lot of work with data preparation and exploration. Ask for example how would the find the outliers, and what to do with them. This would generate a good discussion. Also how would they do feature selection, i.e. which columns to use, which columns to drop. How would they know that a column is relevant, how to eliminate mulitcollinearity between columns. Do they need to check correlation between columns? And if the columns are correlated then what should they do? All these would provoke a good discussion with the candidate from which we would know whether they have done a good amount of data preparation work or not.
And then of course the exciting topic on visualisation. Box plot, scatter plot, horizontal bar plot, skewness, interquantile range, histogram, colour palette, ggplot2, Seaborn, etc. Even Tableau. Have a look at these to get some ideas: Plotting, Seaborn. It is probably a good idea to print out a few charts and discuss them with the candidates during the interview. For example: (source: link)
Having a plot in front of us usually provokes a good and lively discussion with the candidate. For example, on the Seaborn boxplot above you can discuss the Q1 and Q3, median, outliers and the interquartile range.
Here is another one which is good for discussion. It’s the heatmap of the correlation between features: (source: link)
Experience
And of course you would want to ask about the candidate’s experience. Ask what data science / machine learning projects they have been involved in. Ask them to tell you about those projects. What were the goals, what were the approaches/methodologies, what were the tools, etc. And most importantly, what were their roles in those projects.
Not only these conversations will give you an idea whether the candidates have worked in ML projects, you will also get an idea about what projects they have worked on. You can judge for yourselves whether those projects are highly relevant or not relevant to the projects that you are doing in your company. Ditto with the tools. But the tools should not matter too much. If the candidates have a good background they will be able to adapt to the new toolset quickly. If they use Scikit Learn and haven’t used Keras, I am sure they would be able to use Keras in no time.
Their work experience with many models and algorithms, spending months tuning the models, processing and understanding the data, that is the one which is very precious. If they use Jarvis or Colab and you use Azure ML, I am sure they will be able adapt in no time. If they have used the some tools as you, but they don’t have much experience working in models and data prep/viz, it would take them a long time to learn.
Well it depends, if you are looking for a junior data scientists, then yes you would not be looking at their work experience. Instead, you would be looking at their academic performance. As long as they understand the concepts, that’s enough. Because you are looking for a junior data scientist. You are prepared to train them up for a year or two. But if you yourselves are new to machine learning world, then you would want to hire someone who has the experience. Someone who can deliver the project. If that is what you are looking for, then all things I wrote above would be useful to separate the candidates, namely overfitting, models/algorithm, data preparation & exploration and experience.
Interview Questions for Data Scientists
Question: Tell me about model A and model B below. Answer: Model A is overfitting. After 20 epochs the training accuracy is 90% but the validation accuracy is 50%. Model B is underfitting. After 20 epochs the training and validation accurancies are only 55%. Comment: we want to see if the candidate recognises overfitting and underfitting. A data scientist who has spent a lot of time training models will undoubtedly recognise them. Those who haven’t trained a lot might not be able to recognise it.
Question: What can cause model A’s training accuracy to be so high, and yet its validation accuracy to be so low? Answer: Model A is very complex so it is able to fit the training data perfectly, without using regularisation. The model is built exactly for that training data, so when it is run on data it has not seen before it is not able to classify the data correctly. Comment: We want to see if the candidate understands the circumstances how overfitting happens.
Question: How about model B? What can cause model B’s training accuracy to be so low? Answer: Model B is too simply. It is not able to recognise the patterns in the training data. Comment: we want to see if the candidate understands the circumstances how underfitting happens.
Question: Tell me about your experience in solving overfitting problem like model A. Answer: Use regularisation such as Lasso or Ridge. If it is a neural network model, use dropout or batch normalisation. There are also regularisation parameter in most model, such as C in logistic regression and Gamma in SVM models. Comment: We want to see if the candidate knows how to solve overfitting.
Question: How do you overcome underfitting like model B? Answer: Use a more complex model such as Random Forest which uses ensemble and boosting algorithms. Or use a non-linear model such as Support Vector Machine with Radial Basis Function kernel, or a neural network model. This way the model will be able to recongise the pattern in the data. Comment: We want to see if the candidate knows how to solve underfitting.
Question: Suppose you have weather data like temperature, wind speed, pressure, rainfall history, humidity, etc. and you need to predict if tomorrow it is going to rain or not. What model/algorithms can you use? Answer: This is a binary classification problem so I can use Logistic Regression, Naive Bayes, KNN, SVM, Decision Tree, Random Forest, Neural Network and Gradient Boosting. Comment: We want to see if the candidate has the experience of picking a model that suit a particular problem. Binary classification is one of the simplest one so any data scientist would have come across it.
Question: Out of those algorithms, which ones run fast and which ones run slow? Answer: Logistic Regression and Decision Tree are simple models so they run fast. SVM, Random Forest, Gradient Boosting and Neural Network are complex so they are slow. KNN is simple, but it needs to measure the distance between each data point too all other data points so it is slow. Some Gradient Boosting models such as XG Boost have been optimised so they are fast. Comment: We want to know if the candidates have experience using various different algorithms or are they only know them in theory. If they really have tried various different models they would remember which ones are fast and which ones are slow. Whereas if they only read about it they would know how the model works, but since they haven’t run them they wouldn’t know how the models perform.
Question: Which models/algorithms have high accuracy? Answer: Generally speaking for classification SVM using RBF kernel, Neural Network and XG Boost have high accuracy. Random Forest is quite high too. But the accuracy would depend on the case and data i.e. is it linear or not linear, is it multiclass or binary, etc. Some models are better in certain situation than the others. Logistic Regression, even though it is simple and quick, surprisingly, can be very accurate too, even though it is still under NN and XGB. Comment: To solve a problem a data scientist would need to try out different models so they would know which ones are the best. That is the first question they need to answer in nearly every project: which model should I use? So we expect the candidates to know which models are highly accurate and which ones are not. If they mention “The accuracy depends on the data, i.e. non linear or linear”, give them extra credit. It means that they have implemented different algorithms in different cases and recognise that an algorithm can perform differently in different cases/data, i.e. for case A neural network might be the highest, whereas for case B Gradient Boosting might be the highest.
Question: Supposed it is not weather data but cancer images, what algorithm/model would you use to detect if it is a cancer or not? Answer: CNN (Convolutional Neural Network). Comment: There is no alternative model here. All image classification are using CNN.
Question: And how would you configure the CNN in terms of layers? Answer: Traditionally (VGGNet) we use convolutional layers, pooling layer, flatten layer, and fully connected layer. But in practice we use ResNet (using skip connections), Google Net (using inception modules). Comment: Manually stacking conv & pooling layers in Keras is a good answer. But in the real world people uses transfer learning for example from VGG16/19, ResNet50, InceptionV3 or MobileNetV2. See here for complete list of modern models: link. Not if your company is not doing image classification / recognition then you can skip this question.
Question: Supposed it is not weather data but email spam detector, what algorithm/model would you use to detect if it is a spam or not? Answer: RNN (Recurrent Neural Network). Email consists of sequential words so we should use RNN. Comment: We would like to see here if the candidates recognise that email has a sequential order and therefore we should use Recurrent Neural Network. Most candidate would not mention RNN, as they would think that it is just a binary classification so they can use Logistic Regression, Naive Bayes, etc.
Question: Why would you not use Naive Bayes for binary classification problem? Answer: Because each feature must be independent of each other, a condition which doesn’t happen in the real world. But sometimes, when some features are dependent to other features, NB still performs well. The main reason we don’t use NB is because it does perform well if the data is imbalanced, which is usually the case in binary classification. Worse still, if one of the class is zero. Comment: If the candidates have done binary classification projects they would have experienced first hand that NB doesn’t perform well on imbalanced data. They will then need to use class balancing methods such as ADASYN or SMOTE to balance the data, but still the result is not as good as gradient boosting models or even random forest. Give extra mark if the candidate mentioned the reasons for using Naive Bayes, such as in text classification and tagging.
Question: What are outliers and how do you handle them? Answer: Outliers are data points which are far away from other data points. We can define outliers as anything outside the interquantile range (IQR), meaning that anything above 1.5 of Q3 minus Q1 are classified as outliers. Or we can 3 sigma criteria, meaning anything above 3 standard deviations are outliers. Or we can use 1% or 5% quantile criteria. We need to detect outliers and remove them. Or replace them with the mean, or better still with the median. Or cap them, meaning replacing them with the 5% and 95% quantile values (or 1% and 99% quantile values). Comment: Dealing outliers are one of the bread and butter of a data scientists. On almost every data preparation we will need to deal with it. If the candidates have ever worked as a data scientists they will have encountered this many times. If they haven’t encountered this, they are must be only reading data science, not doing it.
Question: Why do you need to remove or replace outliers? In what cases you should not remove outliers? Answer: Because outliers indicate faulty data (incorrectly entered), incorrect measuring procedures, incorrect unit of measure. For example, we have in the height column values ranging from 160 cm to 180 cm. And then there is one data point with height = 17. That is not physically possible, it is an data input error. Maybe they missed a zero, may be the entered too many zeroes. May be they missed the period, like this: 1607 cm. It was meant to be 160.7 cm but the period was missing. The same with age column, weight column, distance column, price column, etc. Outliers need to be removed because they distort or skewed the result. The ML model would try to accommodate the outliers and as a result the model would not generalise well to new data sets. The model would achieve high accuracy during training, but on test data it wouldn’t perform well. Particularly in linear regression or logistic regression. But sometimes we should not remove outliers. In the credit card fraud detection, if the amount column is ranging from $1 to $50, a value of $10,000 is an outlier. But that is a valid data. A very large amount could be a good indicator of a fraudulent transaction. In this case we should not remove it. Comment: Here we would like to see if the candidates understand the reasoning behind removing outliers, i.e. because outliers are, quite often, input errors. And because some algorithms are sensitive to outliers. Not all algorithms though, tree-based algorithms such as decision tree, random forest and stochastic gradient boosting are not sensitive to outliers. We also want to see if the candidates understand that sometimes outliers are valid data, like in “detection” cases, e.g. cancer detection, fraud detection, spam detection.
Question: If the features have different ranges, for example some columns are in the hundreds, whereas other columns are in the millions, what should you do? Why? Answer: We should normalise them (or standardise them). Neural network models would converge faster when doing gradient descent, if the data is normalised. In linear regression, the feature importance would be incorrect if the features are not normalised. In K-nearest neighbours wouldn’t work correctly if the data is not normalised because the distance between data point would be distorted. Comment: Most data scientists would be able to answer what to do in this situation. But it is important to know why we need to do it. Candidates who don’t know the why are walking blindly in this field. They are like robots following the procedures without never understanding why. Clearly you don’t want people like that in your team. A decent data scientist would realise that the feature importance of linear regression and gradient descent on neural network depend on the data being normalised. So give extra points if the candidates mention these two things (or KNN). Please note that in linear regression, it is the error terms which must be normally distributed, not the data itself (see link, point 2). So subtract some points if the candidates assume that in linear regression the data must be distributed normally.
Question: What is the difference between normalisation and standardisation? Answer: In normalisation we map the data values to a range from 0 to 1. In standardisation we map the data values to a distribution with the mean of 0 and standard deviation of 1. Comment: This is just a follow up from the previous question, to see if they know the details. This question is not as important as the previous question. If the candidates understand the why in the previous question, it does not matter if they forget the difference between normalisation and standardisation. In fact, candidates who failed to answer the previous question but can answer this question, it could be an indicator that they are theorists or students, rather than hands-on workers.
Question: If the data is very skewed / not balanced, how do you split the data into training data and test data? For example in the cancer detection you could have only 0.3% of the data which indicates cancer (99.7% not cancer). Answer: Use stratified sampling so that within the training data the proportion of cancer is still 0.3%, and the same with the test data. Stratified Comment: Along with data cleaning and feature scaling, I think data split is one of the most frequent things to do in data preprocessing. We always have to split data into training data and test data.
Question: When do we need to use cross validation? Answer: When we need to tune hyperparameters, or when the model is prone to overfitting. In complex ML models such as neural network or stochastic gradient boosting tree models, we have many hyperparameters. We need to find the optimum values of these hyperparameters, but we can’t use the test data. So we split the training data into 3 to 6 subsets, we use 1 set for validation and train the model with the rest to find the optimum hyperparameter values. When we have the hyperparameter values, then we evaluate the model with the test data. Cross validation also prevents overfitting because the model is trained using different parts of the training data, not the whole of the training data. Comment: In general a data scientist knows what cross validation is, because it is used very often in machine learning. But this question ask when they need to use it, and that would separate candidates who train their models and those who don’t. A decent data scientist would use neural network and gradient boosting tree models (random forest, XG boost, etc), and these models have a lot of hyperparameters. Therefore they must have cut the training data into different folds, and use them to find the optimum hyperparameter values. Therefore, they should be able to answer this question. But if the candidates never tune models with hyperparameters, then they would not come across this. Which means that the candidates only have experience with simple models such as linear regression and decision trees. The second reason is even more worrying. K-fold cross validation (and stratified sampling) are used to mitigate overfitting. If the candidates never used cross validation, then they haven’t mitigate the risk of overfitting. Yes they can do regularisation, early stopping, etc. but cross validation must also be done in conjunction with those techniques. Cross validation also increases test accuracy, so every serious data scientist would consider using cross validation. At least they are aware of it.
Question: Tell me about this chart? What are the red lines? Answer: These are boxplot chart (image source: link) with the red lines showing the medians. The edge of the boxes are Q1 and Q3 (the 25% and 75% percentiles). The top and bottom lines are the minimum (Q1 – 1.5 IQR) and maximum (Q3 + 1.5 IQR). Comment: Boxplots are commonly used in EDA and visualisation to quickly see the data distribution. We would like to know if the candidates are familiar with it. If the candidate knows that the red lines are the median, ask them what a median is. Then what 25% percentile means.
Question: This is ROC curve (Receiver Operating Characteristic). The larger the area under the curve (AUC), the better the model. What are the X axis and the Y axis in this curve? Answer: The X axis is the False Positive Rate (FPR) and the Y axis is the True Positive Rate (TPR). Image source: Wikipedia (link). Comment: AUC is the measure that frequently used to measure the model accuracy. It is rather disappointing if the candidate knows what AUC is but don’t know what the axes are. Because in many projects as a data scientists we are always doing a “balancing act” between choosing the higher TPR (sensitivity) and the lower FPR.
Question: Below is a frequency distribution chart. The data is positively skewed (long tail on the right hand side). Which ones are the mean, median and mode? Answer: A has the highest frequency so it is the mode. C is the mean because the long tail on the right dragged the mean to the right of the median (B). Comment: Image source: link. In a normal distribution, the mean, median and mode are in one point. But if the data is positively skewed, the mean is on the right of the median. In a negatively skewed distribution the mean is on the left of the median.
Question: Why is it better to use median than mean? Answer: because median is not affected be the skewness in the data. It is not affected by outlier either. Both median and mean reflect the central tendency in the data. But often the data is not normally distributed, but skewed to one side. In those cases it is better to use median as it’s not affected by the skewness. Comment: As a data scientist we came across this issue very often, i.e. the data is not normally distributed. Blindly using mean doesn’t help reflecting the true central tendency if the data is not normally distributed.