Embarking on a journey through the lands of machine learning? Here are few important lessons like Feature Engineering, Model tuning, Overfitting, Ensembling etc. which you should keep in mind along the way.
I would like to share my learning experience on Machine Learning for the past few months :
It was end of last year, I decided to research upon Machine learning (ML) and have been taking few little steps. I need to understand what it's all about ML and related hype factor that it has created in the technology industry. Few articles suggested that I should have good understanding of basic Mathematics, Statistics and few suggested that I need to be good in domain knowledge etc. etc. Most of the basic algorithms or ML Techniques has been there for many years but it has gained lot of momentum now. Why? We see the modern systems have good computing power to execute ML at ease and also due to exponential data growth every year (Lot of data are available to us) which encourages us to build systems that could deliver better insights real-time.
At first, I started looking at few good book sources to start with - but it didn't kick off very well for me as I'm very new with all these concepts. With the introduction of new educational MOOC platforms on data science / ML - I thought of taking few shots on MOOC Courses which could give good introduction on ML and help me understand what is actually happening behind the ML algorithms. I came across Andrew NG course on ML in Coursera platform which is kind of what I was expecting to go through in text book - this proved to be enabling platform for me to start with ML. In few weeks - the course became as intensive as it did lot of deep dive going into mathematical equations and it needs ample amount of time for any newbie to stay on, understand and digest lot of information. Via this course - I got introduced to Octave programming language and enjoyed few implementations on ML algorithms to complete my homework assignments. Course also gave good pointers on how to build effective models and at the end of that course - I realized that I have gone through the basics of ML but the industry is far ahead in using various ML techniques and also with respect to implementation Platforms - industry is doing better with Python (Sci Kit, Pandas) & R.
Then I came across the MOOC course edX Analytics Edge MOOC from MIT Team & took the same - This course is so unique, they explained around ML techniques with real use cases . This course served as very good practical orientation of ML techniques with good introduction to R. Another major benefit with this course - It gives introduction to Kaggle platform where I can compete with other ML practitioners globally on real problem scenarios. I was able to easily navigate through this MOOC course as I have some base foundation from my earlier MOOC on ML. As part of assignment - I need to compete in private Kaggle competition which I did. At the end of the course - I have better handle on R, implementing ML algorithms using R, also have good confidence on competing in Kaggle with other ML practitioners. Moreover - I have gained confidence in learning & implementing any new ML technique as well in short timeframe. Felt that I'm ready for Kaggle.
As continuation - I competed in one of the public competition (Liberty Mutual) in Kaggle with the goal of being within 1000 rank, almost 2.2K people competed. I started building models with Random Forest and then from the forum - I learned about XGBoost technique, implemented and did few model fitments along the same. I ended up 854ish rank which is in line with my goal.
Few Key pointers (Focus areas) for acquiring Machine Learning skillset which I came across with respect to Machine Learning over the few months:
Feature Engineering - For most of the models that you build, Features will be the basic building blocks. Features are nothing but the attributes in your data sample (training & test datasets) which you provide to ML component. When I say about Feature Engineering: In the market - there are lot of techniques involved, think it will be separate MOOC course to explain on how to do this Feature Engineering. Still Machine Learning needs human in order to choose the right features for the respective use case. This is where domain knowledge play a key role. Lot of efforts are spent on collecting the feature data, correcting the feature data, choosing the right features for your model, introducing new features, filling up the missing features, changing the feature classes and normalizing the existing feature across the entire dataset. At this juncture, whenever you see any feature engineering scripts - just grab and store in your code repository - you never know when you will need those code snippets. Also keep in mind - too many features could affect the overall accuracy, add more noise and latency to your model build process.
Model tuning - You can easily get hold of any basic implementation of ML algorithm in R or Python. But further - model tuning is another critical factor. Understanding each model parameters and how to play with those parameters - proves to be significant ones to get more accuracy. This requires good in depth understanding of ML algorithm. Lot of You tube videos and white papers available on several models which you can bank on.
Avoid Overfitting - This is the most common error which could affect your overall model accuracy. When you train our model to the training set - building models which are more specific to input data alone could result in over fitting issue as it will not be able to predict well with respect to new test data or future data. In case of over fitting , Your model accuracy on new set of data will drop significantly. Always when you train any model - try to see if any over fitting occurs.
Getting handle on various ML techniques - You will learn the basic ML Techniques in any MOOC course but lot of ML Techniques are available in the market. MOOC Courses are not tuned to include all kinds of new techniques. So you need spend some quality time in learning new ML techniques on your own, how they implemented with R / Python / Java, what kind of parameters are passed to the model etc. etc.
Model Ensemble Techniques - this is one hot topic on increasing the model accuracy. In short, this is about building models based on various ML techniques, samplings of data and do your final prediction based on combination of results from multiple models (You could average out the results from several models, assign weights on the results from different model, looking for majority wins etc. and do the final prediction). You can also look on Stacking and Blending concepts which are again ensemble techniques. These Ensemble techniques has been one of the critical success factors for winning Kaggle competitions.
ML Implementation platforms - Multiple platforms like R, Python, Java, Spark ML libraries etc are available for ML implementation. You can start with whatever platforms you are comfortable with. Most of the common code repositories are based out of Python & R. Again computing requirement plays another role in training the model - multiple times, I have seen my machine crashing over multiple iterations of training your model due to memory constraints. Well, there are several options to overcome computing limitations - you can also go for cloud based computing resources to train your model with exhaustive data. Try to look at Azure ML platform. But for a starter - basic computing resources are good enough.
Visualization (Data/Model) - Visualization of base data and model is another critical area - This will give holistic view on how well our data is organized, how your model performs etc. Lot of plotting functions available with R to start with. You can get hold of several plotting techniques available and use it whenever necessary
Work on real projects to gain practitioner view on deploying ML techniques. You can make use of Kaggle platform which already has lot of use-cases to start with.
Further down, I'm planning to spend some more time on Feature engineering, getting handle on multiple ML techniques (like GBM, H2O, Adaboost, Gini scoring, Vowpal Techniques etc), increase competency with R, Research on Python libraries, be active on Kaggle platform, research on stacking/blending techniques. Also being in Retail industry , I'm planning to spend more time in finding relevant use cases related to Retail (for instance - Multiple opportunities could exist related to Demand forecasting, assortment planning, pricing engine, ecommerce, sales analytics, workforce scheduling etc).
Also another puzzle to solve is to enable seamless integration of ML component (model which you developed) with your existing or new application architecture and enabling continuous model build process in your application architecture for the all sets of data.