...

Full Bio

Use Machine Learning To Teach Robots to Navigate by CMU & Facebook Artificial Intelligence Research Team

136 days ago

Top 10 Artificial Intelligence & Data Science Master's Courses for 2020

137 days ago

Is Data Science Dead? Long Live Business Science

165 days ago

New Way to write code is about to Change: Join the Revolution

166 days ago

Google Go Language Future, Programming Language Programmer Will Get Best Paid Jobs

487 days ago

Top 10 Best Countries for Software Engineers to Work & High in-Demand Programming Languages

692946 views

Highest Paying Programming Language, Skills: Here Are The Top Earners

665166 views

Which Programming Languages in Demand & Earn The Highest Salaries?

469875 views

Top 5 Programming Languages Mostly Used By Facebook Programmers To Developed All Product

397296 views

World's Most Popular 5 Hardest Programming Language

346833 views

### Recent Advances for a Better Understanding of Deep Learning - Part I

I would like to live in a world whose systems are build on rigorous, reliable, verifiable knowledge, and not on alchemy. Simple experiments and simple theorems are the building blocks that help understand complicated larger phenomena.

- Non Convex Optimization: How can we understand the highly non-convex loss function associated with deep neural networks? Why does stochastic gradient descent even converge?
- Overparametrization and Generalization: In classical statistical theory, generalization depends on the number of parameters but not in deep learning. Why? Can we find another good measure of generalization?
- Role of Depth: How does depth help a neural network to converge? What is the link between depth and generalization?
- Generative Models: Why do Generative Adversarial Networks (GANs) work so well? What theoretical properties could we use to stabilize them or avoid mode collapse?

I bet a lot of you have tried training a deep net of your own from scratch and walked away feeling bad about yourself because you couldn't get it to perform. I don't think it's your fault. I think it's gradient descent's fault.

- What does the loss function look like?
- Why does SGD converge?

If we perturb a single parameter, say by adding a small constant, but leave the others free to adapt to this change to still minimise the loss, it may be argued that by adjusting somewhat, the myriad other parameters can "make up" for the change imposed on only one of them

- The functional that is minimized by SGD can be rewritten as a sum of two terms (Eq. 11): the expectancy of a potential Ã?Â¦, and the entropy of the distribution. The temperature 1/Ã?Â² controls the trade-off between those two terms.
- The potential Ã?Â¦ depends only on the data and the architecture of the network (and not the optimization process). If it is equal to the loss function, SGD will converge to a global minimum. However, the paper shows that it's rarely the case, and knowing how far Ã?Â¦ is from the loss function will tell you how likely your SGD will converge.
- The entropy of the final distribution depends on the ratio learning_rate/batch_size (the temperature). Intuitively, the entropy is related to the size of a distribution and having a high temperature often comes down to having a distribution with high variance, which usually means a flat minimum. Since flat minima are often considered to generalize better, it's consistent with the empirical finding that high learning and low batch size often lead to better minima.