Doctors have always written clinical notes about their patientsâ??-â??originally, the notes were on paper and were locked away in a cabinet. Fortunately for data scientists, doctors now enter their notes in an electronic medical record. These notes represent a vast wealth of knowledge and insight that can be utilized for predictive models using Natural Language Processing (NLP) to improve patient care and hospital workflow. As an example, I will show you how to predict hospital readmission with discharge summaries.
This article is intended for people interested in healthcare data science. After completing this tutorial, you will learn
How to prepare data for a machine learning project
How to preprocess the unstructured notes using a bag-of-words approach
How to build a simple predictive model
How to assess the quality of your model
How to decide the next step for improving the model
I recently read this great paper "Scalable and accurate deep learning for electronic health records" by Rajkomar et al. (paper at https://arxiv.org/abs/1801.07860). The authors built many state-of-the-art deep learning models with hospital data to predict in-hospital mortality (AUC = 0.93-0.94), 30-day unplanned readmission (AUC = 0.75-76), prolonged length of stay (AUC = 0.85-0.86) and discharge diagnoses (AUC = 0.90). AUC is a data science performance metric (more about this below) where closer to 1 is better. It is clear that predicting readmission is the hardest task since it has a lower AUC. I was curious how good of a model we can get if use the discharge free-text summaries with a simple predictive model.
If you would like to follow along with the Python code in a Jupyter Notebook, feel free to download the code from my github.
This blog post will outline how to build a classification model to predict which patients are at risk for 30-day unplanned readmission utilizing free-text hospital discharge summaries.
We will utilize the MIMIC-III (Medical Information Mart for Intensive Care III) database. This amazing free hospital database contains de-identified data from over 50,000 patients who were admitted to Beth Israel Deaconess Medical Center in Boston, Massachusetts from 2001 to 2012. In order to get access to the data for this project, you will need to request access at this link (https://mimic.physionet.org/gettingstarted/access/).
In this project, we will make use of the following MIMIC III tables
ADMISSIONSâ??-â??a table containing admission and discharge dates (has a unique identifier HADM_ID for each admission)
NOTEEVENTSâ??-â??contains all notes for each hospitalization (links with HADM_ID)
To maintain anonymity, all dates have been shifted far into the future for each patient, but the time between two consecutive events for a patient is maintained in the database. This is important as it maintains the time between two hospitalizations for a specific patient.
Since this is a restricted dataset, I am not able to publicly share raw patient data. As a result, I will only show you artificial single patient data or aggregated descriptions.
Step 1: Prepare data for a machine learning project
We will follow the steps below to prepare the data from the ADMISSIONS and NOTEEVENTS MIMIC tables for our machine learning project.
First, we load the admissions table using pandas dataframes:
# set up notebook
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# read the admissions table
df_adm = pd.read_csv('ADMISSIONS.csv')
he main columns of interest in this table are :
SUBJECT_ID: unique identifier for each subject
HADM_ID: unique identifier for each hospitalization
ADMITTIME: admission date with format YYYY-MM-DD hh:mm:ss
DISCHTIME: discharge date with same format
DEATHTIME: death time (if it exists) with same format
ADMISSION_TYPE: includes ELECTIVE, EMERGENCY, NEWBORN, URGENT
The next step is to convert the dates from their string format into a datetime. We use the errors = 'coerce' flag to allow for missing dates.
# convert to dates
df_adm.ADMITTIME = pd.to_datetime(df_adm.ADMITTIME, format = '%Y-%m-%d %H:%M:%S', errors = 'coerce')
df_adm.DISCHTIME = pd.to_datetime(df_adm.DISCHTIME, format = '%Y-%m-%d %H:%M:%S', errors = 'coerce')
df_adm.DEATHTIME = pd.to_datetime(df_adm.DEATHTIME, format = '%Y-%m-%d %H:%M:%S', errors = 'coerce')
The next step is to get the next unplanned admission date if it exists. This will follow a few steps, and I'll show you what happens for an artificial patient. First we will sort the dataframe by the admission date
Since I can't show individual notes, I will just describe them here. The dataset has 2,083,180 rows, indicating that there are multiple notes per hospitalization. In the notes, the dates and PHI (name, doctor, location) have been converted for confidentiality. There are also special characters such as
(new line), numbers and punctuation.
Since there are multiple notes per hospitalization, we need to make a choice on what notes to use. For simplicity, let's use the discharge summary, but we could use all the notes by concatenating them if we wanted.
Since the next step is to merge the notes on the admissions table, we might have the assumption that there is one discharge summary per admission, but we should probably check this. We can check this with an assert statement, which ends up failing.
At this point, you might want to investigate why there are multiple summaries, but for simplicity let's just use the last one
assert df_notes_dis_sum_last.duplicated(['HADM_ID']).sum() == 0, 'Multiple discharge summaries per admission'
Now we are ready to merge the admissions and notes tables. I use a left merge to account for when notes are missing. There are a lot of cases where you get multiple rows after a merge(although we dealt with it above), so I like to add assert statements after a merge
and discovered that 53% of the NEWBORN admissions were missing discharge summaries vs ~4% for the others. At this point I decided to remove the NEWBORN admissions. Most likely, these missing NEWBORN admissions have their discharge summary stored outside of the MIMIC dataset.
For this problem, we are going to classify if a patient will be admitted in the next 30 days. Therefore, we need to create a variable with the output label (1 = readmitted, 0 = not readmitted)
A quick count of positive and negative results in 3004 positive samples, 48109 negative samples. This indicates that we have an imbalanced dataset, which is a common occurrence in healthcare data science.
The last step to prepare our data is to split the data into training, validation and test sets. For reproducible results, I have made the random_state always 42.
Since the prevalence is so low, we want to prevent the model from always predicting negative (not re-admitted). To do this, we have a few options to balance the training data
sub-sampling the negatives
over-sampling the positives
create synthetic data (e.g. SMOTE)
Since I didn't make any restrictions on size of RAM for your computer, we will sub-sample the negatives, but I encourage you to try out the other techniques if your computer or server can handle it to see if you can get an improvement. (Post as a comment below if you try this out!)
# split the training data into positive and negative
Step 2: Preprocess the unstructured notes using a bag-of-words approach
Now that we have created data sets that have a label and the notes, we need to preprocess our text data to convert it to something useful (i.e. numbers) for the machine learning model. We are going to use the Bag-of-Words (BOW) approach.
BOW basically breaks up the note into the individual words and counts how many times each word occurs. Your numerical data then becomes counts for some set of words as shown below. BOW is the simplest way to do NLP classification. In most blog posts I have read, fancier techniques have a hard time beating BOW for NLP classification tasks.
In this process, there are few choices that need to be made
how to preprocess the words
how to count the words
which words to use
There is no optimal choice for all NLP projects, so I recommend trying out a few options when building your own models.
You can do the preprocessing in two ways
modify the original dataframe TEXT column
preprocess as part of your pipeline so you don't edit the original data
I will show you how to do both of these, but I prefer the second one since it took a lot of work to get to this point.
Let's define a function that will modify the original dataframe by filling missing notes with space and removing newline and carriage returns
# This function preprocesses the text by filling not a number and replacing new lines ('
') and carriage returns ('
df.TEXT = df.TEXT.fillna(' ')
df.TEXT = df.TEXT.str.replace('
df.TEXT = df.TEXT.str.replace('
# preprocess the text to deal with known issues
df_train = preprocess_text(df_train)
df_valid = preprocess_text(df_valid)
df_test = preprocess_text(df_test)
The other option is to preprocess as part of the pipeline. This process consists of using a tokenizer and a vectorizer. The tokenizer breaks a single note into a list of words and a vectorizer takes a list of words and counts the words.
We will use word_tokenize from the nltk package for our default tokenizer, which basically breaks the note based on spaces and some punctuation. An example is shown below:
from nltk import word_tokenize
word_tokenize('This should be tokenized. 02/02/2018 sentence has stars**')
Additional things you can do would be to lemmatize or stem the words, but that is more advanced so I'll skip that.
Now that we have a way to convert free-text into tokens, we need a way to count the tokens for each discharge summary. We will use the built in CountVectorizer from scikit-learn package. This vectorizer simply counts how many times each word occurs in the note. There is also a TfidfVectorizer which takes into how often words are used across all notes, but for this project let's use the simpler one (I got similar results with the second one too).
As an example, let's say we have 3 notes
sample_text = ['Data science is about the data', 'The science is amazing', 'Predictive modeling is part of data science']
Essentially, you fit the CountVectorizer to learn the words in your data and the transform your data to create counts for each word.
from sklearn.feature_extraction.text import CountVectorizer
We can now fit our CountVectorizer on the clinical notes. It is important to use only the training data because you don't want to include any new words that show up in the validation and test sets. There is a hyperparameter called max_features which you can set to constrain how many words are included in the Vectorizer. This will use the top N most frequently used words. In step 5, we will adjust this to see its effect.
# fit our vectorizer. This will take a while depending on your computer.
from sklearn.feature_extraction.text import CountVectorizer
We can look at the most frequently used words and we will see that many of these words might not add any value for our model. These words are called stop words, and we can remove them easily (if we want) with the CountVectorizer. There are lists of common stop words for different NLP corpus, but we will just make up our own based on the image below.
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer(max_features = 3000,
tokenizer = tokenizer_better,
stop_words = my_stop_words)
# this could take a while
Now we can transform our notes into numerical matrices. At this point, I will only use the training and validation data so I'm not tempted to see how it works on the test data yet.
X_train_tf = vect.transform(df_train.TEXT.values)
X_valid_tf = vect.transform(df_valid.TEXT.values)
We also need our output labels as separate variables
y_train = df_train.OUTPUT_LABEL
y_valid = df_valid.OUTPUT_LABEL
As seen by the location of the scroll barâ?¦ as always, it takes 80% of the time to get the data ready for the predictive model.
Step 3: Build a simple predictive model
We can now build a simple predictive model that takes our bag-of-words inputs and predicts if a patient will be readmitted in 30 days (YES = 1, NO = 0).
Here we will use the Logistic Regression model. Logistic regression is a good baseline model for NLP tasks since it works well with sparse matrices and is interpretable. We have a few additional choices (called hyperparameters) including C which is a coefficient on regularization and penalty which tells how to measure the regularization. Regularization is essentially a technique to try to minimize overfitting.
# logistic regression
from sklearn.linear_model import LogisticRegression
At this point, we need to measure how well our model performed. There are a few different data science performance metrics. I wrote another blog post explaining these in detail if you are interested. Since this post is quite long now, I will start just showing results and figures. You can see the github account for the code to produce the figures.
For a threshold of 0.5 for predicting positive, we get the following performance
With the current selection of hyperparameters, we do have some overfitting. One thing to point out is that the major difference between the precision in the two sets of data is due to the fact that we balanced the training set, where as the validation set is the original distribution. Currently, if we make a list of patients predicted to be readmitted we catch twice as many of them as if we randomly selected patients (PRECISION vs PREVALENCE).
Another performance metric not shown above is AUC or area under the ROC curve. The ROC curve for our current model is shown below. Essentially the ROC curve allows you to see the trade-off between true positive rate and false positive rate as you vary the threshold on what you define as predicted positive vs predicted negative.
Step 5: Next steps for improving the model
At this point, you might be tempted to calculate the performance on your test set and see how you did. But wait! We made many choices (a few below) which we could change and see if there is an improvement:
should we spend time getting more data?
how to tokenizeâ??-â??should we use stemming?
how to vectorizeâ??-â??should we change the number of words?
how to regularized the logistic regressionâ??-â??should we change C or penalty?
which model to use?
When I am trying to improve my models, I read a lot of other blog posts and articles to see how people tackled similar issues. When you do this, you start to see interesting ways to visualize data and I highly recommend holding on to these techniques for your own projects.
For NLP projects that use BOW+logistic regression, we can plot the most important words to see if we can gain any insight. For this step, I borrowed code from a nice NLP article by Insight Data Science. When you look at the most important words, I see two immediate things:
Oops! I forgot to exclude the patients who died since 'expired' showed up in the negative list. For now, I will ignore this and fix it below.
There are also some other stop words we should probably remove ('should','if','it','been','who','during', 'x')
When we want to improve the model, we want to do it in a data-driven manner. You can spend a lot of time on 'hunches', that don't end up panning out. To do this, it is recommend to pick a single performance metric that you use to make your decisions. For this project, I am going to pick AUC.
For the first question above, we can plot something called a Learning Curve to understand the effect of adding more data. Andrew Ng has a set of great Coursera classes on the discussion of high-bias vs high-variance models.
We can see that we have some overfitting but adding more data is likely not going to drastically change the AUC on a validation set. This is good to know because it means we shouldn't spend months getting more data.
Some simple things that we can do is try to see the effect of some of our hyperparameters (max_features and C). We could run a grid search, but since we only have 2 parameters here we could look at them separately and see the effect.
We can see that increasing C and max_features, cause the model to overfit pretty quickly. I selected C = 0.0001 and max_features = 3000 where the validation set started to plateau.
At this point, you could try a few other things
change sub-sample to over-sample
add stemming or lemmatization to the tokenizer
test a few different sci-kit learn models
concatenate all the notes instead of the last discharge summary
try a deep learning method such as LSTMs
review the discharge summaries that you are getting wrong
Step 6: Finalize your model and test it
We will now fit our final model with hyperparameter selection. We will also exclude the patients who died with a re-balancing.
This produces the following results and ROC curve.
Congratulations! You built a simple NLP model (AUC = 0.70) to predict re-admission based on hospital discharge summaries that is only slightly worse than the state-of-the-art deep learning method that uses all hospital data (AUC = 0.75). If you have any feedback, feel free to leave it below.
MIMIC-III, a freely accessible critical care database. Johnson AEW, Pollard TJ, Shen L, Lehman L, Feng M, Ghassemi M, Moody B, Szolovits P, Celi LA, and Mark RG. Scientific Data (2016). DOI: 10.1038/sdata.2016.35. Available at: http://www.nature.com/articles/sdata201635