Nand Kishor is the Product Manager of House of Bots. After finishing his studies in computer science, he ideated & re-launched Real Estate Business Intelligence Tool, where he created one of the leading Business Intelligence Tool for property price analysis in 2012. He also writes, research and sharing knowledge about Artificial Intelligence (AI), Machine Learning (ML), Data Science, Big Data, Python Language etc... ...Full Bio
Nand Kishor is the Product Manager of House of Bots. After finishing his studies in computer science, he ideated & re-launched Real Estate Business Intelligence Tool, where he created one of the leading Business Intelligence Tool for property price analysis in 2012. He also writes, research and sharing knowledge about Artificial Intelligence (AI), Machine Learning (ML), Data Science, Big Data, Python Language etc...
Data science is the big draw in business schools
1092 days ago
7 Effective Methods for Fitting a Liner
1102 days ago
3 Thoughts on Why Deep Learning Works So Well
1102 days ago
3 million at risk from the rise of robots
1102 days ago
Top 10 Hot Artificial Intelligence (AI) Technologies
March Machine Learning Mania, 1st Place Winner's Interview: Andrew Landgraf
Kaggle's 2017 March Machine Learning Mania competition challenged Kagglers to do what millions of sports fans do every yearâ??try to predict the winners and losers of the US men's college basketball tournament. In this winner's interview, 1st place winner, Andrew Landgraf, describes how he cleverly analyzed his competition to optimize his luck.
What made you decide to enter this competition?
I am interested in sports analytics and have followed the previous competitions on Kaggle. Reading last yearâ??s winnerâ??s interview, I realized that luck is a major component of winning this competition, just like all brackets. I wanted to see if there was a way of maximizing my luck. For example, when entering an office pool, your strategy depends on whether you are facing 5 Duke alumns or the entire office. My goal was to systematically optimize my submissions against the competition.
This competition is unique among Kaggle contests in that there is a history of submissions from previous years. My idea was to model not only the probability of each team winning each game, but also the competitorsâ?? submissions. Combining these models, I searched for the submission with the highest chance of finishing with a prize (top 5 on the leaderboard). A schematic of my approach is below. The three main processes are shaded in blue: (1) A model of the probability of winning each game, (2) a model of what the competitors are likely to submit, and (3) an optimization of my submission based on these two models.
While I believe this approach is generally worthwhile, a much simpler approach would have also won the competition, as discussed at the end.
What was your approach? Did past March Mania competitions inform your winning strategy?
I kept my models simple and probabilistic. To model the outcomes of each game, I used a similar method as previous winners, One Shining MGF. I created my own team efficiency ratings using a regression model so that I could calculate the historical ratings before the tournament started. The ratings, and a distance from home metric (more on this later), were used as covariates in a Bayesian logistic regression model (using the rstanarm package) to predict the outcomes of each game.
To model competitorsâ?? submissions, I built a mixed effects model (with lme4) using data from the previous competitions. I used the logit of the submitted probability as the response, the team efficiencies as fixed effects, random intercepts for competitors and games, and random efficiency slopes for competitors. I guessed that there would be 500 competitors and that 400 of them would make 2 submissions, which wasnâ??t too far off.
The plot below shows the models for the two Final Four semi-final games. The black lines are densities of 100 simulations from the mixed effects model and the orange line is the true distribution of competitorsâ?? predictions. They line up well for the SC vs. Gonzaga game and a little less so for the Oregon vs. UNC game. The posterior distribution from my model is much tighter than distributions from the competitors. My two submissions are the two vertical lines.
Finally, I used these models to come up with an optimal submission by simulating the bracket and the competitionsâ?? submissions 10,000 times. This essentially gave me 10,000 simulated leaderboards of the competitors and my goal was to find the submission that most frequently showed up in the top 5 of the leaderboard. I tried to use a general-purpose optimizer, but it was very slow and it gave poor results. Instead, I sampled pairs of probabilities from the posterior many times, and chose the pair that was in the top 5 the most times. If I had naively used the posterior mean as a submission, my estimated probability of being in the top 5 would have been 15%, while my estimated probability of for the optimized submission (with two entries) went up to 25%.
The competitorsâ?? submission model was trained on 2015 data. To assess the quality of the model, I have plotted the simulated distribution of the leaderboard losses for 2016 and 2017 and compared to the actual leaderboards. 2016 seems well in line, but 2017 had more submissions with lower losses than predicted. For both years, the actual 5th place loss was right in line with what was expected.
Looking back, what would you do differently now?
A common strategy for this competition is to use the same predictions in both submissions except for the championship game, in which each team is given a 100% chance of winning in one of the submissions, guaranteeing that one of the two submissions will get the last game exactly correct. While I was aware of this strategy beforehand, I didnâ??t realize how good it is. If I had used this strategy, my estimated probability of being in the top 5 was 27%, 2 percentage points higher than my submission. This submission would have also won the competition.
What have you taken away from this competition?
Sometimes itâ??s better to be lucky than good. The location data that I used had a coding error in it. South Carolinaâ??s Sweet Sixteen and Elite Eight games were coded as being in Greenville, SC instead of New York City. The led me to give them higher odds than most others, which helped me since they won. It is hard to say what the optimizer would have selected (and how it affected othersâ?? models), but there is a good chance I would have finished in 2nd place or worse if the correct locations had been used. Read More