How Will You Be Get Hired As A Data Scientist? In short Analysis Of My LinkedIn Messages For Data Scientist

Oct 24, 2018 | 6420 Views

An analysis of data science recruitment using some data science with data from LinkedIn which I'll then post back on LinkedIn

There is a ton of stats being thrown around in regards to jobs within the data science field: the number of open positions, high median base salaries, unmet market needs, etc. While this is promising, you can't necessarily equate these stats to any sort of direct, personal outcome if you're a job seeker or new to the data science world. As such, the goal of this post is to give a little bit of insight into how I've been recruited since my title on LinkedIn became "Data Scientist", and my profile gained an increasing number of skills/buzzwords.

For the sake of transparency - and because it would be ironic not to do so, I'll use some basic data science skills to conduct the analysis and share the Python code so you can follow along if you're curious.

A little about me
My background is actually in biology. While I was getting my MS in molecular and cellular biology, I was lucky enough to get a job as a data analyst at a biotech company. There, my role was basically to teach myself R and Python to conduct statistical analysis of our cellular and biological data. As I became more interested in programmatic data science I wanted a more formal education - or basically anything other than just me and stack overflow.

Thus, I attended the New York City Data Science Academy (NYCDSA) boot camp. I worked for them for awhile writing and curating data science interview questions, along with some other data science and analytics jobs until I got a data science internship at Pfizer. Now I've been a data scientist with them for a little over a year, dealing primarily in deep learning, NLP, data engineering and architecture.

Downloading your InMail
As I got more and more messages from recruiters about data science openings, I began to wonder if there was a good way to access the data without scraping my own profile. Though the LinkedIn API is useless in this case, you can actually just download all your message data into a convenient .csv file. You can access it through the Settings & Privacy portion of your profile, but for ease here's the link: https://www.linkedin.com/psettings/member-data. You should then see the options below.

You can then select the data you want to download, and it'll send an email confirmation link to you. For this walkthrough/analysis, it's just "Messages".

Really Basic Plots and Stats
Word up, so now we can actually get started with the data. Let's start up python and see what we've got.

import pandas as pd
import matplotlib.pyplot as plt
from datetime import datetime
messages = pd.read_csv("messages.csv")
messages.head()

I've dropped the values from the From and Content columns for anonymity's sake, but you should see the fields above. We can see who sent the mail, the date/time, subject and content. First, we just have to clean the data a bit. For plotting, it'll be helpful to have the date and time as the index. Also, I don't really want to look at all the back and forth messages as that can be a bit misleading. Let's keep only the first message everyone sent me.

## properly format the date and set it as the index
messages['Date'] = [datetime.strptime(x, '%m/%d/%y, %I:%M %p') for x in messages['Date']]
## keep only message sent to me
df = messages[messages['Direction'] == "INCOMING"]
## keep only the first message everyone sent me
df = df.drop_duplicates('From', keep = 'last')
Now that we have only the first message everyone sent me, let's take a look at some basic stats.

df = df.set_index(pd.DatetimeIndex(df['Date']))
total_msg = len(df)
min_date = min(df.index)
mean_msg_month = df.resample('MS').size().mean()
mean_msg_week = df.resample('W-MON').size().mean()
print("Earliest Message: %s" % min_date)
print("Total All Time Incoming Messages: %s" % total_msg)
print("Avg # of Messages per Month: %s" % mean_msg_month)
print("Avg # of Messages per Week: %s" % mean_msg_week)

So I've got a little over 300 incoming messages form individuals; for simplicity, we'll assume most are recruiters and rule out the occasional product promotion, pyramid selling garbage or whatever else. However - I didn't add anything about data science to my profile until February or March of 2017 and my first message is back in July of 2016; this could be skewing the data a bit. Let's plot and find out.

Ah there we go - you can see I have basically no recruiter inMail before adding data science terms to my profile in February 2017. Once I do, my messages skyrocket! Note that I think I also turned on the setting to let recruiters know I'm open to new opportunities around this time, but I turned it off in August of 2017 to little observable attribution. There's a small lull towards the end of the year in December, but January 2018 again starts with a burst of messages and it's been pretty steady since then.

Knowing that I have a lot of pre-data science data affecting my mean calculations, let's redo those stats for dates after 2017.

mean_msg_month = df[df.index > '2017-01-01'].resample('MS').size().mean()
mean_msg_week = df[df.index > '2017-01-01'].resample('W-MON').size().mean()
print("Avg # of Messages per Month: %s" % mean_msg_month)
print("Avg # of Messages per Week: %s" % mean_msg_week)
That's better. On average, I get 14 messages per week and 3 per week with data scientist as my title and data science terms in my profile. As can be seen in the chart, some months are much stronger than others.

A little bit of NLP
Those stats were informative but that's like the most boring python I've done all year. Even though it may not be that fruitful, let's play with the text a little bit to see what info we can pull out. We can go back to the original table we read in, drop NA values in the Content column and look at incoming messages a bit starting with the lamest of NLP visuals - a word cloud.

from wordcloud import WordCloud, STOPWORDS
stopwords = set(list(STOPWORDS) + ['hi','kyle','will','thank','thanks'])
df = messages[messages['Direction'] == "INCOMING"]
df = df[pd.notnull(df['Content'])]
wc = WordCloud(background_color='white',
stopwords=stopwords,
).generate(' '.join(list(df['Content'])))
plt.imshow(wc)
plt.axis("off")
plt.show()
After removing stopwords and some additional custom ones we see the bulk of my inMail is discussing data science and data science roles. Some are referencing my current position, but many more are referencing new "opportunities" for a "client" or "business" for data science or "machine learning" roles. This is cool but doesn't tell us too much, let's see if we can dig deeper with something more advanced.

Entity Recognition
"Named-entity recognition (NER) (also known as entity identification, entity chunking and entity extraction) is a subtask of information extraction that seeks to locate and classify named entity mentions in unstructured text into pre-defined categories such as the person names, organizations, location, etcâ?¦" - Wikipedia. Basically pulling out useful labeled data from unstructured text.

Instead of training our own model or labeling data, we can use packages like spacy to provide pre-trained NER models capable of identifying a number of objects, some of which can be seen below.

That way using a couple of lines of code, you can quickly gather and display entities from any text. The example below uses a random anonymized example from my messages that I render in a jupyter notebook.

import spacy
from spacy import displacy
nlp = spacy.load('en_core_web_sm')
text = df['Content'][i]
displacy.render(nlp(text), jupyter=True, style='ent')

It's cool but isn't working all that well. The algorithm labels the company "Place" 4 different ways, as GPE, PERSON, nothing, and finally the correct way, ORG. NY and CHI should also not be ORGs.

I originally hoped to use this to get a good sense of the companies positions I was being recruited for and salary mentions, but without labeling a lot of my own data (which I won't do) and training a new model (which I don't feel like doing) it probably won't yield good results. Looks like we'll have to try a more hard-coded approach.

positions = ['data scientist',
'data analyst',
'data engineer',
'senior data scientist']
text = ' '.join(list(df['Content'])).lower()
for job in positions:
count = text.count(job)
print(job + ': %d' % count)
Probably still missing a lot of edge cases, roles or positions only mentioned in the message subject but there you go. A short exploration of roles mentioned in my inMail.

Future Directions
It'd be cool to actually do something machine learning with this data. I had a whole thing where I did k-means on the data using averaged pre-trained word embeddings but it's a bit much for this post.

Additionally, if I'd had comprehensive logs of my LinkedIn profile changes over time (bio, certifications, positions, connections, etcâ?¦), I would've liked to see if certain terms inspire optimal interaction and dig more into the time-series aspect. Unfortunately, I probably don't have the time or data for that. Regardless, hope you enjoyed it and let me know if you have any more ideas for exploration.

Source: HOB