I participated in Santander Customer Satisfaction challenge, ran on Kaggle for 2 months and got into top 1%. Here, I would be discussing my approach to this problem.
Customer Satisfaction is one of the prime motive of every company. From frontline support teams to C-suites, customer satisfaction is a key measure of success. Unhappy customers don't stick around. What's more, unhappy customers rarely voice their dissatisfaction before leaving.
Santander Bank floated a problem statement to help them identify dissatisfied customers early in their relationship. Doing so would allow Santander to take proactive steps to improve a customer's happiness before it's too late.
In this competition, data was some hundreds of anonymized features to predict if a customer is satisfied or dissatisfied with their banking experience. The "TARGET" column is the variable to predict. It equals 1 for unsatisfied customers and 0 for satisfied customers.
Let peek into the dataset:
You will find that unhappy customers are little under 4 %, which implies that the dataset is very unbalanced. Handling unbalanced dataset it another challenge that has to be tackled in this challenge.
Not much was done. Here we are converting non-numeric data into numeric using one hot encoding. Also NaN are filled with the mean of the series.
I used XGBoost. XGBoost is the leading model for working with standard tabular data (the type of data you store in Pandas DataFrames, as opposed to data like images and videos). XGBoost models dominate many Kaggle competitions.
XGBoost is an implementation of the Gradient Boosted Decision Trees algorithm.
The implementation of the algorithm is such that the compute time and memory resources are very efficient. A design goal was to make the best use of available resources to train the model. Some key algorithm implementation features include:
Sparse dataset: Algorithm designed to handle sparse data easily.
Block Structure: This help in parallelization of the algorithm.
Why I use XGBoost?
XGBoost is fast.
Handle sparse data perfectly.
XGBoost dominates structured or tabular datasets on classification and regression predictive modeling problems.
XGBoost has a few parameters that can change accuracy and speed of your model significantly.
n_estimators and early_stopping_rounds
n_estimators specifies how many times to go through the model. Too large a value can causes overfitting, which is accurate predictions on training data, but inaccurate predictions on new data, which means your model is not generalising instead mugging up the training dataset and in this challenge this happened with many participant as in contest ranking and final ranking were having huge difference. You can experiment with your dataset to find the ideal. Typical values range from 100â??1000, though this depends a lot on the learning rate discussed below.
The argument early_stopping_rounds offers a way to automatically find the ideal value. Early stopping causes the model to stop iterating when the validation score stops improving, even if we aren't at the hard stop for n_estimators. It's smart to set a high value for n_estimators and then use early_stopping_rounds to find the optimal time to stop iterating.
In general, a small learning rate (and large number of estimators) will yield more accurate XGBoost models, though it will also take the model longer to train since it does more iterations through the cycle.
My final score was 0.840968 which fetched me 48th rank out of 5123 team participated.