...

Full Bio

8 Most Popular Programming Languages & Frameworks of 2019 - All Programmer Should Have Knowledge

2 days ago

Top 14 Different Demanded Programming Languages And Their Uses - All Programmer Should Know

3 days ago

What Is The Programming Language You Are Looking For And Why?

6 days ago

Top 10 Most Popular Machine Learning Companies In 2019

9 days ago

6 Things To Deal With The Great Data Scientist Shortage

9 days ago

Highest Paying Programming Language, Skills: Here Are The Top Earners

623370 views

Which Programming Languages in Demand & Earn The Highest Salaries?

433023 views

Top 10 Best Countries for Software Engineers to Work & High in-Demand Programming Languages

409314 views

50+ Data Structure, Algorithms & Programming Languages Interview Questions for Programmers

254712 views

Which Country Has The Best Programming Language Programmer?

218214 views

### Recent Advances for a Better Understanding of Deep Learning - Part I

I would like to live in a world whose systems are build on rigorous, reliable, verifiable knowledge, and not on alchemy. Simple experiments and simple theorems are the building blocks that help understand complicated larger phenomena.

- Non Convex Optimization: How can we understand the highly non-convex loss function associated with deep neural networks? Why does stochastic gradient descent even converge?
- Overparametrization and Generalization: In classical statistical theory, generalization depends on the number of parameters but not in deep learning. Why? Can we find another good measure of generalization?
- Role of Depth: How does depth help a neural network to converge? What is the link between depth and generalization?
- Generative Models: Why do Generative Adversarial Networks (GANs) work so well? What theoretical properties could we use to stabilize them or avoid mode collapse?

I bet a lot of you have tried training a deep net of your own from scratch and walked away feeling bad about yourself because you couldn't get it to perform. I don't think it's your fault. I think it's gradient descent's fault.

- What does the loss function look like?
- Why does SGD converge?

If we perturb a single parameter, say by adding a small constant, but leave the others free to adapt to this change to still minimise the loss, it may be argued that by adjusting somewhat, the myriad other parameters can "make up" for the change imposed on only one of them

- The functional that is minimized by SGD can be rewritten as a sum of two terms (Eq. 11): the expectancy of a potential Ã?Â¦, and the entropy of the distribution. The temperature 1/Ã?Â² controls the trade-off between those two terms.
- The potential Ã?Â¦ depends only on the data and the architecture of the network (and not the optimization process). If it is equal to the loss function, SGD will converge to a global minimum. However, the paper shows that it's rarely the case, and knowing how far Ã?Â¦ is from the loss function will tell you how likely your SGD will converge.
- The entropy of the final distribution depends on the ratio learning_rate/batch_size (the temperature). Intuitively, the entropy is related to the size of a distribution and having a high temperature often comes down to having a distribution with high variance, which usually means a flat minimum. Since flat minima are often considered to generalize better, it's consistent with the empirical finding that high learning and low batch size often lead to better minima.