Data scientists are expected to know a lot - machine learning, computer science, statistics, mathematics, data visualization, communication, and deep learning. Within those areas, there are dozens of languages, frameworks, and technologies data scientists could learn. How should data scientists who want to be in demand by employers spend their learning budget?
I scoured job listing websites to find which skills are most in demand for data scientists. I looked at general data science skills and at specific languages and tools separately. I searched job listings on LinkedIn, Indeed, SimplyHired, Monster, and AngelList on October 10, 2018. Here's a chart showing how many data scientist jobs each website listed.
I read through many job listings and surveys to find the most common skills. Terms like management were not compared because they can be used in so many different contexts in job listings.
All searches were performed for the United States with "data scientist" "[keyword]". Using exact match search reduced the number of results. However, this method ensured the results were relevant for data scientist positions and affected all search terms similarly.
AngelList provides the number of companies with data scientist listings rather than the number of positions. I excluded AngelList from both analyses because its search algorithm seems to operate as an OR type of logical search, without the ability to change it to an AND. AngelList works fine if you are looking for "data scientist" "TensorFlow" which is only going to be found with data scientist positions, but if your keywords are "data scientist" "react.js" it returns far too many listings for companies with non-data scientist job listings.
Glassdoor was also excluded from my analyses. The site stated that it had 26,263 "data scientist" jobs in the US, but it would show me no more than 900 jobs. Additionally, it seems highly unlikely it would have more than three times the number of data scientist job listings as any other major platform.
Terms with over 400 listings on LinkedIn for general skills and over 200 listings for specific technologies were included in the final analyses. There was certainly some cross-posting. The results are recorded in this Google Sheet.
I downloaded .csv files and imported them into JupyterLab. I then computed the percentage occurrences and averaged them across the job listing websites.
I also compared the software results to a Glassdoor study of its data scientist job listings from the first half of 2017. Combined with information from KDNuggets' usage survey, it appears some skills are becoming more important and others are losing importance. We'll get to those in a bit.
See my Kaggle Kernel for interactive charts and additional analyses here. I used Plotly for the visualizations. To use Plotly with JupyterLab takes a little wrangling as of this writing - instructions are at the end of my Kaggle Kernel and in Plotly's docs.
Here's the chart of the most frequent general data scientist skills sought by employers.
The results show that analysis and machine learning are at the heart of data scientist jobs. Gleaning insights from data is a primary function of data science. Machine learning is all about creating systems to predict performance and it is very in demand.
Data science requires statistics and computer science skills - no surprise there. Statistics, computer science, and mathematics are also college majors, which probably helps their frequency.
It is interesting that communication is mentioned in nearly half of job listings. Data scientists need to be able to communicate insights and work with others.
AI and deep learning don't show up as frequently as some other terms. However, they are subsets of machine learning. Deep learning is being used for more and more of the machine learning tasks that other algorithms were used for previously. For example, the best machine learning algorithms for most natural language processing problems are now deep learning algorithms. I expect deep learning skills will be sought more explicitly in the future and that machine learning will become more synonymous with deep learning.
Which specific software tools for data scientists are employers looking for? Let's tackle that question next.
Below are the top 20 specific languages, libraries, and tech tools employers are looking for data scientists to have experience with.
Let's briefly look at the most common tech skills.
Python is the most in-demand language. The popularity of this open-source language has been widely observed. It's beginner friendly, with many support resources. The vast majority of new data science tools are compatible with it. Python is the primary language for data scientists.
R is not far behind Python. It once was the primary language for data science. I was surprised to see how in demand it still is. The roots of this open source language are in statistics, and it's still very popular with statisticians.
Python or R is a must for virtually every data scientist position.
SQL is also in high demand. SQL stands for Structured Query Language and is the primary way to interact with relational databases. SQL is sometimes overlooked in the data science world, but it's a skill worth demonstrating mastery of if you're planning to hit the job market.
Up next are Hadoop and Spark, both open source tools from Apache for big data.
Apache Hadoop is an open source software platform for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware. - Source.
Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data workers to efficiently execute streaming, machine learning or SQL workloads that require fast iterative access to datasets. - Source.
These tools have considerably less written about them on Medium and in tutorials than many others. I expect many fewer job candidates have these skills than Python, R, and SQL. If you have or can gain experience with Hadoop and Spark it should give you a leg up on the competition.
Then come Java and SAS. I was surprised to see these languages as high as they are. Both have large companies behind them and at least some free offerings. Both Java and SAS generally receive little attention in the data science community.
Tableau is next in demand. This analytics platform and visualization tool are powerful, easy to use, and growing in popularity. It has a free public version but will cost you money if you want to keep your data private.
If you aren't familiar with Tableau, it's definitely worth taking a quick class such as Tableau 10 A-Z on Udemy. I don't get a commission for the suggestion- I just took the class and found it to be a great value.
The chart below shows an even bigger list of the most in-demand languages, frameworks, and other data science software tools.
GlassDoor did an analysis of the 10 most common software skills for data scientists from January 2017 through July 2017 on their site. Here's a comparison of how frequently the terms appeared on their site compared to the average on LinkedIn, Indeed, SimplyHired, and Monster in October 2018.
The results are fairly similar. Both my analysis and GlassDoor's found Python, R, and SQL to be the most in demand. We also found the same top nine technology skills, albeit in slightly different orders.
The results suggest that compared to the first half of 2017, R, Hadoop, Java, SAS, and MatLab are now less in demand and Tableau is more in demand. This is what I would expect given the complementary results from sources such as the KDnuggets developer survey. There, R, Hadoop, Java, and SAS all show clear multi-year downward usage trends and Tableau shows a clear upward trend.
Based on the results of these analyses, here are some general recommendations for current and aspiring data scientists concerned with making themselves widely marketable.
Demonstrate you can do data analysis and focus on becoming really skilled at machine learning.
Invest in your communication skills. I recommend reading the book Made to Stick to help your ideas have more impact. Also, check out the Hemmingway Editor app to improve the clarity of your writing.
Master a deep learning framework. Being proficient with a deep learning framework is a larger and larger part of being proficient with machine learning. For a comparison of deep learning frameworks in terms of usage, interest, and popularity see my article here.
If you are choosing between learning Python and R, choose Python. If you have Python down cold, consider learning R. You'll definitely be more marketable if you also know R.
When an employer is looking for a data scientist with Python skills, they are also likely to expect candidates to know the common python data science libraries: numpy, pandas, scikit-learn, and matplotlib. If you're looking to learn this set of tools, I suggest the following resources:
DataCamp and DataQuest - they are both reasonably priced online SaaS data science education products where you learn as you code. They both teach a number of technology tools.
If you are looking to jump into deep learning, I suggest starting with Keras or FastAI before moving on to TensorFlow or PyTorch. Chollet's Deep Learning with Python is a great resource for learning Keras.
Beyond these recommendations, I suggest you learn what interests you, although there are obviously many considerations when deciding how to allocate your learning time.
If you're looking for a data scientist job through online portals, I suggest you start with LinkedIn - it consistently has the most results.
If you are looking for a job or posting positions on job sites, keywords matter. "data science" returns nearly 3x the number of results that "data scientist" does on each site. But if you are looking strictly for a data scientist job, you're probably better off searching for "data scientist".
Regardless of where you're looking, I suggest you make an online portfolio that demonstrates your proficiency with as many in-demand skill areas as possible. I also suggest your LinkedIn profile showcase your skills.
As part of this project, I collected other data that I may turn into articles. Follow me to make sure you don't miss out.
If you want to see the interactive plotly charts and the code behind them, check out my Kaggle Kernel.
I hope this article has provided you with some insights into what organizations hiring data scientists are looking for. If you learned something, please share on Twitter, linkedin, facebook so others will be more likely to find it.