Some people wonder what is the meaning of data scientist. Put it simply, a data scientist is an expert in:
- Statistical analysis
- Data mining
- Big data
- Data visualisation
In addition to the above technical skills, a data scientist usually has good business knowledge in one or two sectors, such as banking, insurance, investment, oil or biotech.
In this article, I will explain each of the 4 technical areas, and then pick an example.
I consider the following statistical analysis skills to be important to have for a data scientist:
- Generalised Linear Model (Bernoulli distribution, Bayesian methods, Gaussian regressor)
- Cluster analysis (k-means, Expectation Maximisation, fuzzy clustering)
- Goodness of Fit Tests (Kolmogorov Smirnov, Pearson’s Chi2, Anderson-Darling)
- Bayesian Statistics (Naïve Bayes Classifier, Bayesian Network, Hidden Markov Models)
- Factor Analysis (Maximum Likelihood Extraction, Principal Component Extraction)
- Time Series, Matrix computations, Latent Dirichlet Allocation
For many candidates, this first subject (Statistical Analysis) is the biggest stumbling block. The above topics are advanced statistics, some of them are at PhD level. Very few people in IT know them because they are usually taught in universities when you take mathematics, physics or engineering.
As with anything else, Statistical Analysis requires the use of software/tool. The most widely used Statistical Analysis tools are SPSS and SAS. These two are defacto industry standard. Matlab has a special place. It’s not as user friendly nor comprehensive as SPSS and SAS, but can be programmed to do anything and extremely flexible for any kind of statistical processing.
Statistical analysis requires a good programming skills, particularly R and S. Knowledge of parallel processing and multithreading will be also useful. R is the defacto standard language for statistical computing, data analysis and data mining. R is derived from S, with Lexical Scoping semantics added to it, so S provides good fundamental but practically no longer used. Some of the popular IDE for writing R are Rstudio, Rattle GUI and RWeka.
The reason Data Mining is required is for predictive analysis and forecasting the future, as well as descriptive analysis (explaining the past). For this you need to be able to build data mining models, train the models and use it for forecasting. Data mining requires strong mathematical foundation, such as clustering, regression and neural networks. It also requires knowledge about specific tools such as SAS and Analysis Services.
Data mining requires knowledge of data warehousing and BI, because data mining can only uncover patterns actually present in the data. As the data is usually a very large set, it is commonly stored in a data warehouse, and undergo data warehousing processes such as data integration, denormalisation modelling, data quality improvement, and data cleansing.
Data mining requires business knowledge, such as Customer Relationship Management, market basket analysis, credit card fraud, insurance claims, marine underwriting, credit risk and FX trading. Without the business knowledge, it is impossible to create a good mining model.
A data scientist need to know about big data because increasingly, more data is stored in big data architecture, i.e. Hadoop, HBase, MapReduce, EC2/S3. They do not need to know about Pig, Oozie, Lucene, Flume, Sqoop in detail, but they need to have experience with platform that the company uses, such as Hortonworks, Cloudera, BigInsights (IBM), and HDInsight (Microsoft). These platforms are fully equiped with all the tools that you need to load and process data in Hadoop. Data access layers, streaming, query language, security, scheduling, and governance all rolled-in into an Enterprise-ready platform.
A data scientist may not be write complex MapReduce transformation in PigLatin and extend the functions using Python or Ruby (that’s for the programmer to do), but they do need to understand the basic concept. For instance, how the Reduce job combines the output tuples from a Map into a smaller set of tuples, what are keys and values in Hadoop, why you need a Shuffle between a Map and a Reduce, etc. Whether you use Horton, BigInsights or HDI – the implementations are different between companies, but the core concepts are always the same.
Using Big Data platforms such as BigInsights enable data scientists to do data discovery and visualisation. It comes with advanced text analytics tool, machine learning analytics, large scale indexing, adaptive MapReduce, compression, security and stream analytics. Not having knowledge of such a platform means that the data scientist limits their capability to process the data.
It’s the buzz word for BI tool. Tableau, QlikView and Spotfire are the most popular. IBM has five: Cognos Insight, Cognos Express, SPSS Analytic Catalyst, SPSS Visualisation Designer and Watson Analytics. SAP has Lumira. SAS has Visual Analytics and Visual Statistics. And there are tons of other tools: Dundas, iDashboards, Datazen, MicroStategy, Birst, Roambi, Klipfolio, Inetsoft, to name a few.
A good data scientist must have experienced creating visualisation using one or two of the above popular tools. All these tools are not difficult to use, compared to programming in Python or Ruby for example, or even compared to Statistical Analysis.Within 2 weeks you will be able to grasp the basics, and within a month you would be able to use them fluently. They are very user friendly, highly visual GUI (i.e. point and click, drag and drop, that sort of things).
One of the most famous team of data scientist is AIG’s. As the Wall Street Journal reported (link): “AIG analyzes reams of data, from massive databases to handwritten notes from claim adjusters, to identify potentially fraudulent activity. It produces a list of around 100 claims to investigate, ranked highest priority to lowest. It also produces charts and other visualizations, like heat maps, that helps investigators understand why machine learning algorithms make the connections they do.”
Jobs and Pay
I recently saw a job advert for a data scientist for £800 per day. Admittedly this is the highest I have ever seen. I believe the normal range is £400 to £600 per day, depending on industry sector. Banking paying the highest, insurance or pharma probably second, and public sector including NHS the lowest.
- ITJobsWatch (link) reported that the average salary for permanent position is £57,500; whereas for contract position it is £450/day.
- Glassdoor (link) reported that the national average for data scientist is £40,000.
- PayScale (link): reported that the median is ££39,691.
It is impossible to get all 4 skills in 1 person. So in many companies this role is carved up to two to three positions: a big data guy, a visualisation developer, and a statistical analyst. In some companies, you don’t need to know any statistics to be called Data Scientist; you just need to know SQL. They call the SQL database guy as Data Scientist just to make the role more attractive and looks good. Ten years ago the “SQL guy” it used to be called “report developer”, but now a days it could be (mis)labelled as “data scientist”.