Sir, you are amazing. The way you cleared everything is extremely helpful. I am BCom Hons graduate and have turned myself into this career. I was doing a deep learning course from Andrew ng. He also explained well. However, his explanation was limited to some limited formulas, as he considered ourselves an extreme entry-level student. Me being a practical person was so confused with those terms, what is momentum, why we square root and why we add this momentum was something unexplained in many articles. For the last 3 days, I was finding these answers, Finally cleared. Thanks.
@ambujmittal68245 жыл бұрын
I see a video by Mitesh sir, I hit like and then I start watching the video! :)
@mradulnamdeo3587 Жыл бұрын
Me too 🤩 This entire course is just an miracle ✨✨💖
@umerjamil72084 жыл бұрын
Just to clear out one thing: Here, at 38:45, AP means Arthemetic Progression and GP means Geometric Progression
@Ravi-xp8hi6 жыл бұрын
Excellent lecture as usual...!
@nzsvus4 жыл бұрын
Great material, thanks, sir!
@nevilvekariya12245 жыл бұрын
Now I got the reason why IITs are IITs !!!!!
@ashisharora96495 жыл бұрын
exactly the level of teachers and the explanation students get from them is completely out of this world and can only be found in IIT colleges.
@shubhamvashisth95184 жыл бұрын
Watch the IOT course by IIT K. your reason will decay just like the learning rate did in Adagrad.
@PremiDhruv4 жыл бұрын
Your Sample Set is Small, Will not be able to find a Good Population Mean
@gauravshah893 жыл бұрын
I am an alumnus of IITB. And I want to inform you that your are wrong. There are all kinds of teachers in IITs. Some are good, others are not.
@bhaskartripathi3 жыл бұрын
There are Excellent,Good and Average teachers in all institutions regardless of their reputation. This teacher is certainly excellent even by Stanford standards !
@m____s Жыл бұрын
I laughed real hard at Adam, Eve, Momentum joke.
@najmeb17863 жыл бұрын
Thanks for your great way of teaching. Can you please name the papers you mentioned in this course about convergence of Adam? thank you in advance.
@Pappu777754 ай бұрын
Saying that Andrew Ng is the American Mitesh Khapra would be an understatement
@bhagawanlalith38703 ай бұрын
Adagrad + momentum = Adam, but why did we use momentum instead of Nesterov accelerated GD, which works better than momentum?
@aroras9023 жыл бұрын
@12:38 "Its almost as if these algorithms went to a school where Pythagoras theorem was not taught" :D Hilarious!
@sourjyadipray56354 ай бұрын
If we added another input to the neuron, what would happen to the formula for f(x)?
@prateek50695 жыл бұрын
at 8:43 what are dimensions of Vt ? I am confused because it should be a SCALAR in weight update equation(eq3), but in original Vt equation(eq1) del(Wt)**2 (last term) is a matrix, making Vt a MATRIX. can someone please clarify?
@ANANTBARA3 жыл бұрын
We are expecting a matrix only because we are looking for an adaptive learning rate for each feature. I knew it was too late to answer.😀
@yugantarbhasin1914 Жыл бұрын
can we use the log function instead of root do that with very high accumilation of history we dont kill the learning rate
@coolk58023 жыл бұрын
Awsome
@suryaprakash-yw2fu2 жыл бұрын
What is the need for reducing learning rate of dense inputs? .Increasing the learning rate for sparse inputs is justified .
@INGLERAJKAMALRAJENDRA8 ай бұрын
@19:52 Not being aggressive while accumulating history
@manudasmd Жыл бұрын
Is bias correction in ADAM required only when we do SGD?
@fahadaslam8205 жыл бұрын
JazakAllah (Y)
@newbie8051 Жыл бұрын
Didn't understand the need for bias correction. Can anyone pls share some resources for that
@MANISHAKUMARI-lz2vz2 жыл бұрын
Sir, Understood well. but plz don't take movie example.
@Sudhirrt105 жыл бұрын
Please explain me why gradiant with respect to W1 is like (f(x)-y)*f(x)*(1-f(x))*x1
@dailyDesi_abhrant4 жыл бұрын
This was explained in previous lectures
@shubhamparida25842 жыл бұрын
lect 3.4
@kausikhira441811 ай бұрын
@@shubhamparida2584 thanks
@mcab22226 жыл бұрын
course is really amazing but the notation in the beginning ...
@raymondlion3145 жыл бұрын
I got really confused in the last ten minutes
@RahulMadhavan5 жыл бұрын
It's covered in the next video again: kzbin.info/www/bejne/Y2G9fohjnNOgaMk
@raymondlion3145 жыл бұрын
@@RahulMadhavan Thanks. In fact I am confused about the intuition for bias correction. A.K.A Why shall we correctify the m_t such that its expectation equals to the expectation gradient? If so, did we give up the original purpose of momentum descent? Imagine on a plat, if the expectation of m_t equals to the expectation of gradient, then ADAM will still be very slow...
@raymondlion3145 жыл бұрын
@@RahulMadhavan I think since beta is smaller than 1, then bias correction will amplify the momentum to the mean of gradient descent... Very weird algorithm
@RahulMadhavan5 жыл бұрын
@@raymondlion314 Say there are n samples in the training set. Using stochastic GD (ie gradient descent for samples one at a time), we should replicate batch GD, which gives a descent of Σ(ΔW) per epoch/batch. 1) Each of these samples, has some component of the descent (for W) we should take, aka signal. The expectation of signal from each sample is Σ(ΔW)/n 2) Each of these n samples will not be exactly Σ(ΔW)/n and will have some noise components, but in expectation, the descent per sample would be Σ(ΔW)/n 3) Normally, if we take n descents as per each of these samples, we would arrive at n * Σ(ΔW)/n = Σ(ΔW), the same as the batch GD 4) But if we (i) multiply each of the sample terms (which are Σ(ΔW)/n in expectation) by (1-β), (ii) multiply each of the previous terms by β (iii) sum over n such operations, then we don't arrive after n samples at Σ(ΔW). Instead, we arrive at Σ(ΔW)*(1-β^n). Note: These are the calculations shown in the last ten minutes. 5) Thus, this estimate is called biased, as it's mean is changed from our original method. to correct for this, we divide by (1-β^n). Then the estimate is called "bias corrected". 6) After bias correction, after n samples, in expectation, we arrive at Σ(ΔW)*(1-β^n) / (1-β^n) = Σ(ΔW), which is the same as our original batch GD algorithm which we know to be right. 7) Dividing by (1-β^n) actually increases the rate, as β < 1. Thus this doesn't decrease the speed.
@raymondlion3144 жыл бұрын
@@RahulMadhavan Thank you for your patient reply first. I agree with you that the original calculation for ADAM is biased and we need to correct it such that its expectation equals to the expectation of Σ(ΔW). Else we would get very small gradient. What I don't agree with is the notion of momentum here. If you check the original momentum method, you can see its expectation is also biased(usually upward, depending on \gamma and \eta). I think professor calls it momentum just because it has the same form of the Momentum GD. It will never decrease the speed but also it will never has the same effect as Momentum GD either. We can only accelerate the learning process by adopting the adaptive learning rate.
@mainakghosh3521 Жыл бұрын
26:46 The code seems different from what is described as the update rule. After dividing mₜ by (1 - β₁^t), it sets that value as the new mₜ, instead of a different value m̂ₜ. Same with vₜ.