...

Full Bio

Developers Reveal Most Loved, Most Loathed Programming Language, What Pays Best

today

Which Country Produces The Best Programming Language Programmers & Engineers In The World?

yesterday

Top 10 Most Popular Programming Language Programmers Expert In The World Of All Time

2 days ago

Computer Programming Language Programmer Salary & Career Outlook

3 days ago

How To Prepare For Competitive Programming Language For Computing Olympiad & Win Gold?

4 days ago

Highest Paying Programming Language, Skills: Here Are The Top Earners

605862 views

Which Programming Languages in Demand & Earn The Highest Salaries?

424599 views

Top 10 Best Countries for Software Engineers to Work & High in-Demand Programming Languages

371058 views

50+ Data Structure, Algorithms & Programming Languages Interview Questions for Programmers

250671 views

100+ Data Structure, Algorithms & Programming Language Interview Questions Answers for Programmers - Part 1

213606 views

### Recent Advances for a Better Understanding of Deep Learning - Part I

I would like to live in a world whose systems are build on rigorous, reliable, verifiable knowledge, and not on alchemy. Simple experiments and simple theorems are the building blocks that help understand complicated larger phenomena.

- Non Convex Optimization: How can we understand the highly non-convex loss function associated with deep neural networks? Why does stochastic gradient descent even converge?
- Overparametrization and Generalization: In classical statistical theory, generalization depends on the number of parameters but not in deep learning. Why? Can we find another good measure of generalization?
- Role of Depth: How does depth help a neural network to converge? What is the link between depth and generalization?
- Generative Models: Why do Generative Adversarial Networks (GANs) work so well? What theoretical properties could we use to stabilize them or avoid mode collapse?

I bet a lot of you have tried training a deep net of your own from scratch and walked away feeling bad about yourself because you couldn't get it to perform. I don't think it's your fault. I think it's gradient descent's fault.

- What does the loss function look like?
- Why does SGD converge?

If we perturb a single parameter, say by adding a small constant, but leave the others free to adapt to this change to still minimise the loss, it may be argued that by adjusting somewhat, the myriad other parameters can "make up" for the change imposed on only one of them

- The functional that is minimized by SGD can be rewritten as a sum of two terms (Eq. 11): the expectancy of a potential Ã?Â¦, and the entropy of the distribution. The temperature 1/Ã?Â² controls the trade-off between those two terms.
- The potential Ã?Â¦ depends only on the data and the architecture of the network (and not the optimization process). If it is equal to the loss function, SGD will converge to a global minimum. However, the paper shows that it's rarely the case, and knowing how far Ã?Â¦ is from the loss function will tell you how likely your SGD will converge.
- The entropy of the final distribution depends on the ratio learning_rate/batch_size (the temperature). Intuitively, the entropy is related to the size of a distribution and having a high temperature often comes down to having a distribution with high variance, which usually means a flat minimum. Since flat minima are often considered to generalize better, it's consistent with the empirical finding that high learning and low batch size often lead to better minima.