After 13 months of exams, assignments, case studies, tests and lectures I finally finished my machine learning postgraduate course. Despite the encouragement from my course provider to go looking for a data scientist job, I’m not thinking of going for interviews yet. I am happy to stay where I am, finishing building a data lake and data marts in Azure. I love machine learning, it’s my new purpose in life, but I hate leaving something unfinished. But supposed I was looking for a data scientist, what kind of questions would I ask during the interview?

**Update 7/12/21**: I understand some of you reading this are saying “Where are the actual questions?” If you are one of them, jump to the bottom of this article.

**Academic ability vs work capability**

I don’t want to end up with a recruit who is good in answering questions but no good at the actual work. Or someone who is good in academic, but not good in providing solution to real world problems. So a question like: “Tell me the difference between accuracy, precision and recall” would be testing their academic ability, but not their work capability.

I’m also aware that the candidate would forget a few things during the interview, so not able to recall the answer would be understandable. So I don’t want to ask things like “is it Lasso or Ridge which doesn’t shrink the cofficient to zero?” because that’s like testing their memory. I don’t need a person who can memorise well, I need someone who can work and solve problems.

**Existing websites**

There are many websites posting interview questions for data scientists, such as:

- Terence Shin on Towards Data Science: link
- Upasana on Edureka: link
- Brain Station: link
- Anant Kurana on Analytics Vidya: link
- Linkedin: link
- Akhil Bhadwal on Hackr.io: link
- Abhinav Rai on Upgrad: link
- Simplilearn: link

As we can see from these websits, there are a lot of questions to choose from. But as I said above, we don’t want to select candidate based on their academic ability but more on their ability to work. And this is where the art is, what kind of questions do you need to ask to the candidates to separate those who have worked hard in machine learning projects, versus those who haven’t?

**Overfitting**

After thinking hard going through all types of questions. I found something which I have gone through so many times myself when training my models. Even if the candidate forget everything else, they can’t forget this one because it is so engraved into their work. It’s overfitting. I would show these charts to them and ask an open question: “What’s happening in these charts?”

A candidate who have been training many models over and over again for days would definitely recognise these charts. The training accuracies are all very high, but the validation accuracies are all very low. It means that the models are built to suit the training data but they do not generalise well to new, unseen data.

“What would you do about it?” would provoke a flood of answers from the candidates. But only if they have been battling with this problem in their work. Things like “regularisation”, “augmentation”, “class balancing”, “dropout” and “batch normalisation” are the methods that can be used to reduce overfitting. And a good debate could follow from that. “How would you do augmentation”, “how would you do regularisation”, “how would you do class balancing?” all these “how” questions would create a good discussion with the candidates.

There are a vast area of topics from which we can withdraw interview questions. For example: ensambles (weak learners), cross validations, preprocessing, visualisation, neural network, kernel, multicollinearity, boosting, linear regression assumptions, Naive Bayes assumptions, SVM, PCA. Decision Trees, etc. But again, we need to be careful not to test their memory or academic performance. But their ability to work and solve problems.

**Models/Algorithms**

So as a second question I would pick this: put to them a simple binary classification problem and ask them which models/algorithms would give a good result. For example: whether a customer will be purchasing an item or not, whether a machine will breakdown or not, whether a telecom customer will be churning or not, whether it is going to rain or not, whether it is a fraudulent transaction or not. Give them a list, i.e. Logistic Regression, Naive Bayes, Random Forest, Decision Trees, KNN, SVM, PCA, Neural Network, CNN, RNN, Reinforcement Learning, XG Boost, Stochastic Gradient Descent. And ask them which one they would pick for this case.

A data scientist who has been doing enough work (not study but work) would have experienced a variety of different machine learning models and algorithms. They would know which ones are good and which ones are bad. On the other hand, a data scientist who has been doing a lot of reading would know the theory such as how the algorithm works, what are the difference between them, etc. but because they haven’t used the algorithms in practice, they would not know the result.

“Which ones are fast?”, “which ones are slow?”, “which ones are highly accurate?”, “which ones have the highest AUC?”, “which ones are prone to overfitting?” all these questions would provoke a good discussion. The point of this question is to dig out their experience. They must have some working experience with some models/algorithms, we want them to tell us those experiences. They would be able to tell us for example, that logistic regression and decision trees are fast, whereas KNN and SVM are very slow. XG Boost and SVM are highly accurate. And that neural network is prone to overfitting.

Some of the algorithms given above are grossly unsuitable for binary classification, namely: Reinforcement Learning, PCA, CNN, RNN. And XG Boost is an implementation of SGB. If the candidates fall into these traps, then we should help them out getting out of it. For example: “isn’t CNN used for image classification?” If you mentioned in your example, “whether it is a cancer or not”, then CNN would be suitable. But if it is numerical data then it is not suitable. If you mentioned in your example, “whether the sentiment is positive or negative”, “whether the email is a spam or not”, then RNN would be suitable. These would be an interesting discussion with the candidate. It would show that the candidates have experience with CNN or RNN, and you can dig it out.

**Data Preparation & Exploration**

We want to encourage the candidate to tell us their experience with data preparation & exploration. There is no machine learning work where we don’t need to do data preparation & exploration. The data preparation & exploration work takes a significant amount of time on any ML project.

If your candidates have worked on many machine learning projects, they would have gone through many data preparation exercise. For example, data loading, normalisation, class balancing, feature selection, data transformation, dimensionality reduction, one hot encoding. They would have been busy using a lot of Pandas and Numpy to do that. Also Power Transform, Keras and Scikit for augmentation, normalisation and scaling.

Data exploration is widely known as EDA within the machine learning community. It stands for Exploratory Data Analysis. This is where you discover the pattern within the data, find out outliers, get the statistics from the data and creating various charts to understand the data. They would have been busy using Matplotlib and Seaborn packages to do charting, Pandas and Numpy to do data manipulation and statistics. They would do univariate, bivariate and multivariate data analysis.

Any decent data scientist would have done a lot of work with data preparation and exploration. Ask for example how would the find the outliers, and what to do with them. This would generate a good discussion. Also how would they do feature selection, i.e. which columns to use, which columns to drop. How would they know that a column is relevant, how to eliminate mulitcollinearity between columns. Do they need to check correlation between columns? And if the columns are correlated then what should they do? All these would provoke a good discussion with the candidate from which we would know whether they have done a good amount of data preparation work or not.

And then of course the exciting topic on visualisation. Box plot, scatter plot, horizontal bar plot, skewness, interquantile range, histogram, colour palette, ggplot2, Seaborn, etc. Even Tableau. Have a look at these to get some ideas: Plotting, Seaborn. It is probably a good idea to print out a few charts and discuss them with the candidates during the interview. For example: (source: link)

Having a plot in front of us usually provokes a good and lively discussion with the candidate. For example, on the Seaborn boxplot above you can discuss the Q1 and Q3, median, outliers and the interquartile range.

Here is another one which is good for discussion. It’s the heatmap of the correlation between features: (source: link)

**Experience**

And of course you would want to ask about the candidate’s experience. Ask what data science / machine learning projects they have been involved in. Ask them to tell you about those projects. What were the goals, what were the approaches/methodologies, what were the tools, etc. And most importantly, what were their roles in those projects.

Not only these conversations will give you an idea whether the candidates have worked in ML projects, you will also get an idea about what projects they have worked on. You can judge for yourselves whether those projects are highly relevant or not relevant to the projects that you are doing in your company. Ditto with the tools. But the tools should not matter too much. If the candidates have a good background they will be able to adapt to the new toolset quickly. If they use Scikit Learn and haven’t used Keras, I am sure they would be able to use Keras in no time.

Their work experience with many models and algorithms, spending months tuning the models, processing and understanding the data, that is the one which is very precious. If they use Jarvis or Colab and you use Azure ML, I am sure they will be able adapt in no time. If they have used the some tools as you, but they don’t have much experience working in models and data prep/viz, it would take them a long time to learn.

Well it depends, if you are looking for a junior data scientists, then yes you would not be looking at their work experience. Instead, you would be looking at their academic performance. As long as they understand the concepts, that’s enough. Because you are looking for a junior data scientist. You are prepared to train them up for a year or two. But if you yourselves are new to machine learning world, then you would want to hire someone who has the experience. Someone who can deliver the project. If that is what you are looking for, then all things I wrote above would be useful to separate the candidates, namely overfitting, models/algorithm, data preparation & exploration and experience.

**Interview Questions for Data Scientists**

**Question:**Tell me about model A and model B below.: Model A is overfitting. After 20 epochs the training accuracy is 90% but the validation accuracy is 50%. Model B is underfitting. After 20 epochs the training and validation accurancies are only 55%.

Answer**Comment**: we want to see if the candidate recognises overfitting and underfitting. A data scientist who has spent a lot of time training models will undoubtedly recognise them. Those who haven’t trained a lot might not be able to recognise it.

**Question**: What can cause model A’s training accuracy to be so high, and yet its validation accuracy to be so low?: Model A is very complex so it is able to fit the training data perfectly, without using regularisation. The model is built exactly for that training data, so when it is run on data it has not seen before it is not able to classify the data correctly.

Answer**Comment**: We want to see if the candidate understands the circumstances how overfitting happens.**Question**: How about model B? What can cause model B’s training accuracy to be so low?: Model B is too simply. It is not able to recognise the patterns in the training data.

Answer**Comment**: we want to see if the candidate understands the circumstances how underfitting happens.**Question**: Tell me about your experience in solving overfitting problem like model A.: Use regularisation such as Lasso or Ridge. If it is a neural network model, use dropout or batch normalisation. There are also regularisation parameter in most model, such as C in logistic regression and Gamma in SVM models.

Answer**Comment**: We want to see if the candidate knows how to solve overfitting.**Question**: How do you overcome underfitting like model B?: Use a more complex model such as Random Forest which uses ensemble and boosting algorithms. Or use a non-linear model such as Support Vector Machine with Radial Basis Function kernel, or a neural network model. This way the model will be able to recongise the pattern in the data.

Answer**Comment**: We want to see if the candidate knows how to solve underfitting.**Question**: Suppose you have weather data like temperature, wind speed, pressure, rainfall history, humidity, etc. and you need to predict if tomorrow it is going to rain or not. What model/algorithms can you use?: This is a binary classification problem so I can use Logistic Regression, Naive Bayes, KNN, SVM, Decision Tree, Random Forest, Neural Network and Gradient Boosting.

Answer**Comment**: We want to see if the candidate has the experience of picking a model that suit a particular problem. Binary classification is one of the simplest one so any data scientist would have come across it.**Question**: Out of those algorithms, which ones run fast and which ones run slow?: Logistic Regression and Decision Tree are simple models so they run fast. SVM, Random Forest, Gradient Boosting and Neural Network are complex so they are slow. KNN is simple, but it needs to measure the distance between each data point too all other data points so it is slow. Some Gradient Boosting models such as XG Boost have been optimised so they are fast.

Answer**Comment**: We want to know if the candidates have experience using various different algorithms or are they only know them in theory. If they really have tried various different models they would remember which ones are fast and which ones are slow. Whereas if they only read about it they would know how the model works, but since they haven’t run them they wouldn’t know how the models perform.**Question**: Which models/algorithms have high accuracy?: Generally speaking for classification SVM using RBF kernel, Neural Network and XG Boost have high accuracy. Random Forest is quite high too. But the accuracy would depend on the case and data i.e. is it linear or not linear, is it multiclass or binary, etc. Some models are better in certain situation than the others. Logistic Regression, even though it is simple and quick, surprisingly, can be very accurate too, even though it is still under NN and XGB.

Answer**Comment**: To solve a problem a data scientist would need to try out different models so they would know which ones are the best. That is the first question they need to answer in nearly every project: which model should I use? So we expect the candidates to know which models are highly accurate and which ones are not. If they mention “The accuracy depends on the data, i.e. non linear or linear”, give them extra credit. It means that they have implemented different algorithms in different cases and recognise that an algorithm can perform differently in different cases/data, i.e. for case A neural network might be the highest, whereas for case B Gradient Boosting might be the highest.**Question**: Supposed it is not weather data but cancer images, what algorithm/model would you use to detect if it is a cancer or not?: CNN (Convolutional Neural Network).

Answer**Comment**: There is no alternative model here. All image classification are using CNN.**Question**: And how would you configure the CNN in terms of layers?: Traditionally (VGGNet) we use convolutional layers, pooling layer, flatten layer, and fully connected layer. But in practice we use ResNet (using skip connections), Google Net (using inception modules).

Answer**Comment**: Manually stacking conv & pooling layers in Keras is a good answer. But in the real world people uses transfer learning for example from VGG16/19, ResNet50, InceptionV3 or MobileNetV2. See here for complete list of modern models: link. Not if your company is not doing image classification / recognition then you can skip this question.**Question**: Supposed it is not weather data but email spam detector, what algorithm/model would you use to detect if it is a spam or not?: RNN (Recurrent Neural Network). Email consists of sequential words so we should use RNN.

Answer**Comment**: We would like to see here if the candidates recognise that email has a sequential order and therefore we should use Recurrent Neural Network. Most candidate would not mention RNN, as they would think that it is just a binary classification so they can use Logistic Regression, Naive Bayes, etc.**Question**: Why would you not use Naive Bayes for binary classification problem?: Because each feature must be independent of each other, a condition which doesn’t happen in the real world. But sometimes, when some features are dependent to other features, NB still performs well. The main reason we don’t use NB is because it does perform well if the data is imbalanced, which is usually the case in binary classification. Worse still, if one of the class is zero.

Answer**Comment**: If the candidates have done binary classification projects they would have experienced first hand that NB doesn’t perform well on imbalanced data. They will then need to use class balancing methods such as ADASYN or SMOTE to balance the data, but still the result is not as good as gradient boosting models or even random forest. Give extra mark if the candidate mentioned the reasons for using Naive Bayes, such as in text classification and tagging.

**Question**: What are outliers and how do you handle them?: Outliers are data points which are far away from other data points. We can define outliers as anything outside the interquantile range (IQR), meaning that anything above 1.5 of Q3 minus Q1 are classified as outliers. Or we can 3 sigma criteria, meaning anything above 3 standard deviations are outliers. Or we can use 1% or 5% quantile criteria. We need to detect outliers and remove them. Or replace them with the mean, or better still with the median. Or cap them, meaning replacing them with the 5% and 95% quantile values (or 1% and 99% quantile values).

Answer**Comment**: Dealing outliers are one of the bread and butter of a data scientists. On almost every data preparation we will need to deal with it. If the candidates have ever worked as a data scientists they will have encountered this many times. If they haven’t encountered this, they are must be only reading data science, not doing it.**Question**: Why do you need to remove or replace outliers? In what cases you should not remove outliers?: Because outliers indicate faulty data (incorrectly entered), incorrect measuring procedures, incorrect unit of measure. For example, we have in the height column values ranging from 160 cm to 180 cm. And then there is one data point with height = 17. That is not physically possible, it is an data input error. Maybe they missed a zero, may be the entered too many zeroes. May be they missed the period, like this: 1607 cm. It was meant to be 160.7 cm but the period was missing. The same with age column, weight column, distance column, price column, etc.

Answer

Outliers need to be removed because they distort or skewed the result. The ML model would try to accommodate the outliers and as a result the model would not generalise well to new data sets. The model would achieve high accuracy during training, but on test data it wouldn’t perform well. Particularly in linear regression or logistic regression.

But sometimes we should not remove outliers. In the credit card fraud detection, if the amount column is ranging from $1 to $50, a value of $10,000 is an outlier. But that is a valid data. A very large amount could be a good indicator of a fraudulent transaction. In this case we should not remove it.**Comment**: Here we would like to see if the candidates understand the reasoning behind removing outliers, i.e. because outliers are, quite often, input errors. And because some algorithms are sensitive to outliers. Not all algorithms though, tree-based algorithms such as decision tree, random forest and stochastic gradient boosting are not sensitive to outliers. We also want to see if the candidates understand that sometimes outliers are valid data, like in “detection” cases, e.g. cancer detection, fraud detection, spam detection.**Question**: If the features have different ranges, for example some columns are in the hundreds, whereas other columns are in the millions, what should you do? Why?: We should normalise them (or standardise them). Neural network models would converge faster when doing gradient descent, if the data is normalised. In linear regression, the feature importance would be incorrect if the features are not normalised. In K-nearest neighbours wouldn’t work correctly if the data is not normalised because the distance between data point would be distorted.

Answer**Comment**: Most data scientists would be able to answer what to do in this situation. But it is important to know why we need to do it. Candidates who don’t know the why are walking blindly in this field. They are like robots following the procedures without never understanding why. Clearly you don’t want people like that in your team. A decent data scientist would realise that the feature importance of linear regression and gradient descent on neural network depend on the data being normalised. So give extra points if the candidates mention these two things (or KNN). Please note that in linear regression, it is the error terms which must be normally distributed, not the data itself (see link, point 2). So subtract some points if the candidates assume that in linear regression the data must be distributed normally.**Question**: What is the difference between normalisation and standardisation?: In normalisation we map the data values to a range from 0 to 1. In standardisation we map the data values to a distribution with the mean of 0 and standard deviation of 1.

Answer**Comment**: This is just a follow up from the previous question, to see if they know the details. This question is not as important as the previous question. If the candidates understand the why in the previous question, it does not matter if they forget the difference between normalisation and standardisation. In fact, candidates who failed to answer the previous question but can answer this question, it could be an indicator that they are theorists or students, rather than hands-on workers.

**Question**: If the data is very skewed / not balanced, how do you split the data into training data and test data? For example in the cancer detection you could have only 0.3% of the data which indicates cancer (99.7% not cancer).: Use stratified sampling so that within the training data the proportion of cancer is still 0.3%, and the same with the test data. Stratified

Answer**Comment**: Along with data cleaning and feature scaling, I think data split is one of the most frequent things to do in data preprocessing. We always have to split data into training data and test data.**Question**: When do we need to use cross validation?: When we need to tune hyperparameters, or when the model is prone to overfitting. In complex ML models such as neural network or stochastic gradient boosting tree models, we have many hyperparameters. We need to find the optimum values of these hyperparameters, but we can’t use the test data. So we split the training data into 3 to 6 subsets, we use 1 set for validation and train the model with the rest to find the optimum hyperparameter values. When we have the hyperparameter values, then we evaluate the model with the test data.

Answer

Cross validation also prevents overfitting because the model is trained using different parts of the training data, not the whole of the training data.**Comment**: In general a data scientist knows what cross validation is, because it is used very often in machine learning. But this question ask when they need to use it, and that would separate candidates who train their models and those who don’t. A decent data scientist would use neural network and gradient boosting tree models (random forest, XG boost, etc), and these models have a lot of hyperparameters. Therefore they must have cut the training data into different folds, and use them to find the optimum hyperparameter values. Therefore, they should be able to answer this question. But if the candidates never tune models with hyperparameters, then they would not come across this. Which means that the candidates only have experience with simple models such as linear regression and decision trees.

The second reason is even more worrying. K-fold cross validation (and stratified sampling) are used to mitigate overfitting. If the candidates never used cross validation, then they haven’t mitigate the risk of overfitting. Yes they can do regularisation, early stopping, etc. but cross validation must also be done in conjunction with those techniques. Cross validation also increases test accuracy, so every serious data scientist would consider using cross validation. At least they are aware of it.

**Question**: Tell me about this chart? What are the red lines?**Answer**: These are boxplot chart (image source: link) with the red lines showing the medians. The edge of the boxes are Q1 and Q3 (the 25% and 75% percentiles). The top and bottom lines are the minimum (Q1 – 1.5 IQR) and maximum (Q3 + 1.5 IQR).**Comment**: Boxplots are commonly used in EDA and visualisation to quickly see the data distribution. We would like to know if the candidates are familiar with it. If the candidate knows that the red lines are the median, ask them what a median is. Then what 25% percentile means.

**Question**: This is ROC curve (Receiver Operating Characteristic). The larger the area under the curve (AUC), the better the model. What are the X axis and the Y axis in this curve?: The X axis is the False Positive Rate (FPR) and the Y axis is the True Positive Rate (TPR). Image source: Wikipedia (link).

Answer**Comment**: AUC is the measure that frequently used to measure the model accuracy. It is rather disappointing if the candidate knows what AUC is but don’t know what the axes are. Because in many projects as a data scientists we are always doing a “balancing act” between choosing the higher TPR (sensitivity) and the lower FPR.

**Question**: Below is a frequency distribution chart. The data is positively skewed (long tail on the right hand side). Which ones are the mean, median and mode?: A has the highest frequency so it is the mode. C is the mean because the long tail on the right dragged the mean to the right of the median (B).

Answer**Comment**: Image source: link. In a normal distribution, the mean, median and mode are in one point. But if the data is positively skewed, the mean is on the right of the median. In a negatively skewed distribution the mean is on the left of the median.

**Question**: Why is it better to use median than mean?**Answer**: because median is not affected be the skewness in the data. It is not affected by outlier either. Both median and mean reflect the central tendency in the data. But often the data is not normally distributed, but skewed to one side. In those cases it is better to use median as it’s not affected by the skewness.**Comment**: As a data scientist we came across this issue very often, i.e. the data is not normally distributed. Blindly using mean doesn’t help reflecting the true central tendency if the data is not normally distributed.

## Leave a Reply