This is a story about the danger of interpreting your machine learning model incorrectly, and the value of interpreting it correctly. If you have found the robust accuracy of ensemble tree models such as gradient boosting machines or random forests attractive, but also need to interpret them, then I hope you find this informative and helpful.

Imagine we are tasked with predicting a person's financial status for a bank. The more accurate our model, the more money the bank makes, but since this prediction is used for loan applications we are also legally required to provide an explanation for why a prediction was made. After experimenting with several model types, we find that gradient boosted trees as implemented in XGBoost give the best accuracy. Unfortunately, explaining why XGBoost made a prediction seems hard, so we are left with the choice of retreating to a linear model, or figuring out how to interpret our XGBoost model. No data scientist wants to give up on accuracy...so we decide to attempt the latter, and interpret the complex XGBoost model (which happens to have 1,247 depth 6 trees).

Classic global feature importance measures

The first obvious choice is to use the plot_importance() method in the Python XGBoost interface. It gives an attractively simple bar-chart representing the importance of each feature in our dataset: (code to reproduce this article is in a Jupyter notebook)

Results of running xgboost.plot_importance(model) for a model trained to predict if people will report over $50k of income from the classic "adult" census dataset (using a logistic loss).

If we look at the feature importances returned by XGBoost we see that age dominates the other features, clearly standing out as the most important predictor of income. We could stop here and report to our manager the intuitively satisfying answer that age is the most important feature, followed by hours worked per week and education level. But being good data scientists...we take a look at the docs and see there are three options for measuring feature importance in XGBoost:

1. Weight. The number of times a feature is used to split the data across all trees.

2. Cover. The number of times a feature is used to split the data across all trees weighted by the number of training data points that go through those splits.

3. Gain. The average training loss reduction gained when using a feature for splitting.

These are typical importance measures that we might find in any tree-based modeling package. Weight was the default option so we decide to give the other two approaches a try to see if they make a difference:

Results of running xgboost.plot_importance with both importance_type="cover" and importance_type="gain".

To our dismay we see that the feature importance orderings are very different for each of the three options provided by XGBoost! For the cover method it seems like the capital gain feature is most predictive of income, while for the gain method the relationship status feature dominates all the others. This should make us very uncomfortable about relying on these measures for reporting feature importance without knowing which method is best.

What makes a measure of feature importance good or bad?

It not obvious how to compare one feature attribution method to another. We could measure end-user performance for each method on tasks such as data-cleaning, bias detection, etc. But these tasks are only indirect measures of the quality of a feature attribution method. Here, we will instead define two properties that we think any good feature attribution method should follow:

1. Consistency. Whenever we change a model such that it relies more on a feature, then the attributed importance for that feature should not decrease.

2. Accuracy. The sum of all the feature importances should sum up to the total importance of the model. (For example if importance is measured by the RÃ?Â² value then the attribution to each feature should sum to the RÃ?Â² of the full model)

If consistency fails to hold, then we can't compare the attributed feature importances between any two models, because then having a higher assigned attribution doesn't mean the model actually relies more on that feature.

If accuracy fails to hold then we don't know how the attributions of each feature combine to represent the output of the whole model. We can't just normalize the attributions after the method is done since this might break the consistency of the method.

Are current attribution methods consistent and accurate?

Back to our work as bank data scientists...we realize that consistency and accuracy are important to us. In fact if a method is not consistent we have no guarantee that the feature with the highest attribution is actually the most important. So we decide to the check the consistency of each method using two very simple tree models that are unrelated to our task at the bank:

Simple tree models over two features. Cough is clearly more important in model B than model A.

The output of the models is a risk score based on a person's symptoms. Model A is just a simple "and" function for the binary features fever and cough. Model B is the same function but with +10 whenever cough is yes. To check consistency we must define "importance". Here we will define importance two ways: 1) as the change in the model's expected accuracy when we remove a set of features. 2) as the change in the model's expected output when we remove a set of features.

The first definition of importance measures the global impact of features on the model. While the second definition measures the individualized impact of features on a single prediction. In our simple tree models the cough feature is clearly more important in model B, both for global importance and for the importance of the individual prediction when both fever and cough are yes.

The weight, cover, and gain methods above are all global feature attribution methods. But when we deploy our model in the bank we will also need individualized explanations for each customer. To check for consistency we run five different feature attribution methods on our simple tree models:

1. Tree SHAP. A new individualized method we are proposing.

2. Saabas. An individualized heuristic feature attribution method.

3. mean(|Tree SHAP|). A global attribution method based on the average magnitude of the individualized Tree SHAP attributions.

4. Gain. The same method used above in XGBoost, and also equivalent to the Gini importance measure used in scikit-learn tree models.

5. Split count. Represents both the closely related "weight" and "cover" methods in XGBoost, but is 6. computed using the "weight" method.

6. Permutation. The resulting drop in accuracy of the model when a single feature is randomly permuted in the test data set.

Feature attributions for model A and model B using six different methods. As far we can tell, these methods represent all the tree-specific feature attribution methods in the literature.

All the previous methods other than feature permutation are inconsistent! This is because they assign less importance to cough in model B than in model A. Inconsistent methods cannot be trusted to correctly assign more importance to the most influential features. The astute reader will notice that this inconsistency was already on display earlier when the classic feature attribution methods we examined contradicted each other on the same model. What about the accuracy property? It turns out Tree SHAP, Sabaas, and Gain are all accurate as defined earlier, while feature permutation and split count are not.

It is perhaps surprising that such a widely used method as gain (gini importance) can lead to such clear inconsistency results. To better understand why this happens let's examine how gain gets computed for model A and model B. To make this simple we will assume that 25% of our data set falls into each leaf, and that the datasets for each model have labels that exactly match the output of the models.

If we consider mean squared error (MSE) as our loss function, then we start with an MSE of 1200 before doing any splits in model A. This is the error from the constant mean prediction of 20. After splitting on fever in model A the MSE drops to 800, so the gain method attributes this drop of 400 to the fever feature. Splitting again on the cough feature then leads to an MSE of 0, and the gain method attributes this drop of 800 to the cough feature. In model B the same process leads to an importance of 800 assigned to the fever feature and 625 to the cough feature:

Computation of the gain (aka. Gini importance) scores for model A and model B.

Typically we expect features near the root of the tree to be more important than features split on near the leaves (since trees are constructed greedily). Yet the gain method is biased to attribute more importance to lower splits. This bias leads to an inconsistency, where when cough becomes more important (and it hence is split on at the root) its attributed importance actually drops. The individualized Saabas method (used by the treeinterpreter package) calculates differences in predictions as we descend the tree, and so it also suffers from the same bias towards splits lower in the tree. As trees get deeper, this bias only grows. In contrast the Tree SHAP method is mathematically equivalent to averaging differences in predictions over all possible orderings of the features, rather than just the ordering specified by their position in the tree.

It is not a coincidence that only Tree SHAP is both consistent and accurate. Given that we want a method that is both consistent and accurate, it turns out there is only one way to allocate feature importances. The details are in our recent NIPS paper, but the summary is that a proof from game theory on the fair allocation of profits leads to a uniqueness result for feature attribution methods in machine learning. These unique values are called Shapley values, after Lloyd Shapley who derived them in the 1950's. The SHAP values we use here result from a unification of several individualized model interpretation methods connected to Shapley values. Tree SHAP is a fast algorithm that can exactly compute SHAP values for trees in polynomial time instead of the classical exponential runtime (see arXiv).

Interpreting our model with confidence

The combination of a solid theoretical justification and a fast practical algorithm makes SHAP values a powerful tool for confidently interpreting tree models such as XGBoost's gradient boosting machines. Armed with this new approach we return to the task of interpreting our bank XGBoost model:

The global mean(|Tree SHAP|) method applied to the income prediction model. The x-axis is essentially the average magnitude change in model output when a feature is "hidden" from the model (for this model the output has log-odds units). See papers for details, but "hidden" means integrating the variable out of the model. Since the impact of hiding a feature changes depending on what other features are also hidden, Shapley values are used to enforce consistency and accuracy.

We can see that the relationship feature is actually the most important, followed by the age feature. Since SHAP values have guaranteed consistency we don't need to worry about the kinds of contradictions we found before using the gain, or split count methods. However, since we now have individualized explanations for every person, we can do more than just make a bar chart. We can plot the feature importance for every customer in our data set. The shap Python package makes this easy. We first call shap.TreeExplainer(model).shap_values(X) to explain every prediction, then call shap.summary_plot(shap_values, X) to plot these explanations:

Every customer has one dot on each row. The x position of the dot is the impact of that feature on the model's prediction for the customer, and the color of the dot represents the value of that feature for the customer. Dots that don't fit on the row pile up to show density (there are 32,561 customers in this example). Since the XGBoost model has a logistic loss the x-axis has units of log-odds (Tree SHAP explains the change in the margin output of the model).

The features are sorted by mean(|Tree SHAP|) and so we again see the relationship feature as the strongest predictor of making over $50K annually. By plotting the impact of a feature on every sample we can also see important outlier effects. For example, while capital gain is not the most important feature globally, it is by far the most important feature for a subset of customers. The coloring by feature value shows us patterns such as how being younger lowers your chance of making over $50K, while higher education increases your chance of making over $50K.

We could stop here and show this plot to our boss, but let's instead dig a bit deeper into some of these features. We can do that for the age feature by plotting the age SHAP values (changes in log odds) vs. the age feature values:

The y-axis is how much the age feature changes the log odds of making over $50K annually. The x-axis is the age of the customer. Each dot represents a single customer from the data set.

Here we see the clear impact of age on earning potential as captured by the XGBoost model. Note that unlike traditional partial dependence plots (which show the average model output when changing a feature's value) these SHAP dependence plots show interaction effects. Even though many people in the data set are 20 years old, how much their age impacts their prediction differs as shown by the vertical dispersion of dots at age 20. This means other features are impacting the importance of age. To see what feature might be part of this effect we color the dots by the number of years of education and see that a high level of education lowers the effect of age in your 20's, but raises it in your 30's:

The y-axis is how much the age feature changes the log odds of making over $50K annually. The x-axis is the age of the customer. Education-Num is the number of years of education the customer has completed.

If we make another dependence plot for the number of hours worked per week we see that the benefit of working more plateaus at about 50 hrs/week, and working extra is less likely to indicate high earnings if you are married:

Hours worked per week vs. the impact of the number of hours worked on earning potential.

Interpreting your own model

This simple walk-through was meant to mirror the process you might go through when designing and deploying your own models. The shap package is easy to install through pip, and we hope it helps you explore your models with confidence. It includes more than what this article touched on, including SHAP interaction values, model agnostic SHAP value estimation, and additional visualizations. Notebooks are available that illustrate all these features on various interesting datasets. For example you can check out the top reasons you will die based on your health checkup in a notebook explaining an XGBoost model of mortality. For languages other than Python, Tree SHAP has also been merged directly into the core XGBoost and LightGBM packages.