As per definition, data scientists work with data. This involves plenty of activities such as sampling and pre-processing of data, model estimation and post-processing (e.g. sensitivity analysis, model deployment, back-testing, model validation). Although many user-friendly software tools are on the market nowadays to automate this, every analytical exercise requires tailored steps to tackle the specificities of a particular business problem. In order to successfully perform these steps, programming needs to be done. Hence, a good data scientist should possess sound programming skills in e.g. R, Python, SAS etc. The programming language itself is not that important as such, as long as he/she is familiar with the basic concepts of programming and knows how to use these to automate repetitive tasks or perform specific routines.
A data scientist should have solid quantitative skills:
Obviously, a data scientist should have a thorough background in statistics, machine learning or data mining. The distinction between these various disciplines is getting more and more blurred and is actually not that relevant. They all provide a set of quantitative techniques to analyze data and find business relevant patterns within a particular context (e.g. risk management, fraud detection, marketing analytics). The data scientist should be aware of which technique can be applied when and how. He/she should not focus too much on the underlying mathematical e.g. optimization details but rather have a good understanding of what analytical problem a technique solves, and how its results should be interpreted. In this, training of engineers in computer science and business or industrial engineering should aim at an integrated, multidisciplinary view, with recent grads formed in both the use of the techniques, and with the business acumen necessary to bring new endeavors to fruition.
Also important in this context is to spend enough time validating the analytical results obtained so as to avoid situations often referred to as data massage or data torture whereby data is (intentionally) misrepresented or too much focus is spent discussing spurious correlations. When selecting the optimal quantitative technique, the data scientist should take into account the specificities of the business problem. Typical requirements for analytical models are: action-ability (to what extent is the analytical model solving the business problem?), performance (what is the statistical performance of the analytical model?), interpret-ability (can the analytical model be easily explained to decision makers?), operational efficiency (how much efforts are needed to setup, evaluate and monitor the analytical model?), regulatory compliance (is the model in line with regulation?) and economical cost (what is the cost of setting up, running and maintaining the model?). Based upon a combination of these requirements, the data scientist should be capable of selecting the best analytical technique to solve the business problem.
A data scientist should excel in communication and visualization skills:
Like it or not, but analytics is a technical exercise. At this moment, there is a huge gap between the analytical models and the business users. To bridge this gap, communication and visualization facilities are key! Hence, a data scientist should know how to represent analytical models and their accompanying statistics and reports in user-friendly ways using e.g. traffic light approaches, OLAP (on-line analytical processing) facilities, If-then business rules, He/she should be capable of communicating the right amount of information without getting lost into complex (e.g. statistical) details which will inhibit a model├??├?┬ó??s successful deployment. By doing so, business users will better understand the characteristics and behavior in their big data which will improve their attitude towards and acceptance of the resulting analytical models. Educational institutions must learn to balance, since it is known that many academic degrees prepare students that are skewed to either too much analytical or too much practical knowledge.
A data scientist should have a solid business understanding:
While this might be obvious, we have witnessed too many data science projects that failed since the respective analyst did not understand the business problem at hand. By "business" we refer to the respective application area, which could be e.g. churn prediction or credit scoring in a real business context or astronomy or medicine if the respective data to be analyzed stem from such areas.
A data scientist should be creative:
A data scientist needs creativity on at least two levels. First, on a technical level, it is important to be creative with regard to feature selection, data transformation and cleaning. These steps of the standard knowledge discovery process have to be adapted to each particular application and often the "right guess" could make a big difference. Second, big data and analytics is a fast evolving field. New problems, technologies and corresponding challenges pop up on an ongoing basis. It is important that a data scientist keeps up with these new technologies and has enough creativity to see how they can create new business opportunities.
We have provided a brief overview of characteristics to be looked for when hiring data scientists. To summarize, given the multidisciplinary nature of big data and analytics, a data scientist should possess a mix of skills like programming, quantitative modelling, communication and visualization, business understanding, and creativity.