Deep Learning(CS7015): Lec 5.9 Gradient Descent with Adaptive Learning Rate

Рет қаралды 47,051

Күн бұрын

Пікірлер: 44

@ashisharora9649 5 жыл бұрын

Sir, you are amazing. The way you cleared everything is extremely helpful. I am BCom Hons graduate and have turned myself into this career. I was doing a deep learning course from Andrew ng. He also explained well. However, his explanation was limited to some limited formulas, as he considered ourselves an extreme entry-level student. Me being a practical person was so confused with those terms, what is momentum, why we square root and why we add this momentum was something unexplained in many articles. For the last 3 days, I was finding these answers, Finally cleared. Thanks.

@ambujmittal6824 5 жыл бұрын

I see a video by Mitesh sir, I hit like and then I start watching the video! :)

@mradulnamdeo3587 Жыл бұрын

Me too 🤩 This entire course is just an miracle ✨✨💖

@umerjamil7208 4 жыл бұрын

Just to clear out one thing: Here, at 38:45, AP means Arthemetic Progression and GP means Geometric Progression

@Ravi-xp8hi 6 жыл бұрын

Excellent lecture as usual...!

@nzsvus 4 жыл бұрын

Great material, thanks, sir!

@nevilvekariya1224 5 жыл бұрын

Now I got the reason why IITs are IITs !!!!!

@ashisharora9649 5 жыл бұрын

exactly the level of teachers and the explanation students get from them is completely out of this world and can only be found in IIT colleges.

@shubhamvashisth9518 4 жыл бұрын

Watch the IOT course by IIT K. your reason will decay just like the learning rate did in Adagrad.

@PremiDhruv 4 жыл бұрын

Your Sample Set is Small, Will not be able to find a Good Population Mean

@gauravshah89 3 жыл бұрын

I am an alumnus of IITB. And I want to inform you that your are wrong. There are all kinds of teachers in IITs. Some are good, others are not.

@bhaskartripathi 3 жыл бұрын

There are Excellent,Good and Average teachers in all institutions regardless of their reputation. This teacher is certainly excellent even by Stanford standards !

@m____s Жыл бұрын

I laughed real hard at Adam, Eve, Momentum joke.

@najmeb1786 3 жыл бұрын

Thanks for your great way of teaching. Can you please name the papers you mentioned in this course about convergence of Adam? thank you in advance.

@Pappu77775 4 ай бұрын

Saying that Andrew Ng is the American Mitesh Khapra would be an understatement

@bhagawanlalith3870 3 ай бұрын

Adagrad + momentum = Adam, but why did we use momentum instead of Nesterov accelerated GD, which works better than momentum?

@aroras902 3 жыл бұрын

@12:38 "Its almost as if these algorithms went to a school where Pythagoras theorem was not taught" :D Hilarious!

@sourjyadipray5635 4 ай бұрын

If we added another input to the neuron, what would happen to the formula for f(x)?

@prateek5069 5 жыл бұрын

at 8:43 what are dimensions of Vt ? I am confused because it should be a SCALAR in weight update equation(eq3), but in original Vt equation(eq1) del(Wt)**2 (last term) is a matrix, making Vt a MATRIX. can someone please clarify?

@ANANTBARA 3 жыл бұрын

We are expecting a matrix only because we are looking for an adaptive learning rate for each feature. I knew it was too late to answer.😀

@yugantarbhasin1914 Жыл бұрын

can we use the log function instead of root do that with very high accumilation of history we dont kill the learning rate

@coolk5802 3 жыл бұрын

Awsome

@suryaprakash-yw2fu 2 жыл бұрын

What is the need for reducing learning rate of dense inputs? .Increasing the learning rate for sparse inputs is justified .

@INGLERAJKAMALRAJENDRA 8 ай бұрын

@19:52 Not being aggressive while accumulating history

@manudasmd Жыл бұрын

Is bias correction in ADAM required only when we do SGD?

@fahadaslam820 5 жыл бұрын

JazakAllah (Y)

@newbie8051 Жыл бұрын

Didn't understand the need for bias correction. Can anyone pls share some resources for that

@MANISHAKUMARI-lz2vz 2 жыл бұрын

Sir, Understood well. but plz don't take movie example.

@Sudhirrt10 5 жыл бұрын

Please explain me why gradiant with respect to W1 is like (f(x)-y)*f(x)*(1-f(x))*x1

@dailyDesi_abhrant 4 жыл бұрын

This was explained in previous lectures

@shubhamparida2584 2 жыл бұрын

lect 3.4

@kausikhira4418 11 ай бұрын

@@shubhamparida2584 thanks

@mcab2222 6 жыл бұрын

course is really amazing but the notation in the beginning ...

@raymondlion314 5 жыл бұрын

I got really confused in the last ten minutes

@RahulMadhavan 5 жыл бұрын

It's covered in the next video again: kzbin.info/www/bejne/Y2G9fohjnNOgaMk

@raymondlion314 5 жыл бұрын

@@RahulMadhavan Thanks. In fact I am confused about the intuition for bias correction. A.K.A Why shall we correctify the m_t such that its expectation equals to the expectation gradient? If so, did we give up the original purpose of momentum descent? Imagine on a plat, if the expectation of m_t equals to the expectation of gradient, then ADAM will still be very slow...

@raymondlion314 5 жыл бұрын

@@RahulMadhavan I think since beta is smaller than 1, then bias correction will amplify the momentum to the mean of gradient descent... Very weird algorithm

@RahulMadhavan 5 жыл бұрын

@@raymondlion314 Say there are n samples in the training set. Using stochastic GD (ie gradient descent for samples one at a time), we should replicate batch GD, which gives a descent of Σ(ΔW) per epoch/batch. 1) Each of these samples, has some component of the descent (for W) we should take, aka signal. The expectation of signal from each sample is Σ(ΔW)/n 2) Each of these n samples will not be exactly Σ(ΔW)/n and will have some noise components, but in expectation, the descent per sample would be Σ(ΔW)/n 3) Normally, if we take n descents as per each of these samples, we would arrive at n * Σ(ΔW)/n = Σ(ΔW), the same as the batch GD 4) But if we (i) multiply each of the sample terms (which are Σ(ΔW)/n in expectation) by (1-β), (ii) multiply each of the previous terms by β (iii) sum over n such operations, then we don't arrive after n samples at Σ(ΔW). Instead, we arrive at Σ(ΔW)*(1-β^n). Note: These are the calculations shown in the last ten minutes. 5) Thus, this estimate is called biased, as it's mean is changed from our original method. to correct for this, we divide by (1-β^n). Then the estimate is called "bias corrected". 6) After bias correction, after n samples, in expectation, we arrive at Σ(ΔW)*(1-β^n) / (1-β^n) = Σ(ΔW), which is the same as our original batch GD algorithm which we know to be right. 7) Dividing by (1-β^n) actually increases the rate, as β < 1. Thus this doesn't decrease the speed.

@raymondlion314 4 жыл бұрын

@@RahulMadhavan Thank you for your patient reply first. I agree with you that the original calculation for ADAM is biased and we need to correct it such that its expectation equals to the expectation of Σ(ΔW). Else we would get very small gradient. What I don't agree with is the notion of momentum here. If you check the original momentum method, you can see its expectation is also biased(usually upward, depending on \gamma and \eta). I think professor calls it momentum just because it has the same form of the Momentum GD. It will never decrease the speed but also it will never has the same effect as Momentum GD either. We can only accelerate the learning process by adopting the adaptive learning rate.

@mainakghosh3521 Жыл бұрын

26:46 The code seems different from what is described as the update rule. After dividing mₜ by (1 - β₁^t), it sets that value as the new mₜ, instead of a different value m̂ₜ. Same with vₜ.