Linear regression (2): Gradient descent

Рет қаралды 90,765

Alexander Ihler

Күн бұрын

Пікірлер: 30

@emmanuel5566 4 жыл бұрын

so, how many of you came here from Andrew NG's ML Course?

@phungdaoxuan99 4 жыл бұрын

i'm here

@nasserhussain5698 4 жыл бұрын

@willgabriel5275 4 жыл бұрын

I'm here.

@charlottefx7163 3 жыл бұрын

mee too

@behrad9712 3 жыл бұрын

yea!

@yemaneabrha6637 4 жыл бұрын

Simple , Clear , and Gentle Explanation Thanks More Prof.

@JulianHarris 9 жыл бұрын

I'm very visual so I particularly loved the visualisations showing the progressive improvement of the hypothesis as the parameters were refined.

@sagardolas3880 7 жыл бұрын

This was the simplest explaination as most beautiful and precise one

@redfield126 5 жыл бұрын

Very interesting. Thanks for the clear and vizual explanation that provided me quite good intuition of different version of Gradient descent.

@sagarbhat7932 4 жыл бұрын

Wouldn't online gradient decent cause the problem of overfitting?

@AlexanderIhler 4 жыл бұрын

Overfitting is not really related to the *method* of doing the optimization (online=stochastic GD, versus batch GD, or second order methods like BFGS, etc.) but rather to the complexity of the model, and the *degree* to which the optimization process is allowed to fit the data. So, early stopping (incomplete optimization) can reduce overfitting, for example. Changing optimization methods can appear to change overfitting simply because of stopping rules interacting with optimization efficiency, but they don't really change the fundamental issue.

@AronBordin 9 жыл бұрын

Exactly what I was looking for, thx!

@poltimmer 9 жыл бұрын

Thanks! Making an essay on machine learning, and this really helped me out!

@anthamithya 6 жыл бұрын

First of all, how do we know that j(theta) curve is of that kind? The curve will be obtained only after the gradient descent worked upon or run random 1000 or so theta values...

@sidbhatia4230 5 жыл бұрын

What modifications can we make to use l2 norm instead?

@mdfantacherislam4401 7 жыл бұрын

Thanks for such kinda helpful lecture

@UmeshMorankar 8 жыл бұрын

what if we set the learning rate α to too large a value.

@SreeragNairisawesome 8 жыл бұрын

+Umesh Morankar Then it might diverge or offshoot from the minima..... for eg. if the minima is 2 ,the latest value of Θ = 4 and α = 8 (suppose) , then it would diverge to 4 - 8 = -4 which is too far away from 2 whereas if α = 1(too small) then it would reach the local minima in the next 2 iterations. I havent applied the algorithm. This is just for explanation purpose.

@EngineersLife-Vlog 6 жыл бұрын

Can i get this slide please

@JanisPundurs 10 жыл бұрын

This helped a lot, thanks

@NicoCarosio 8 жыл бұрын

gracias!

@fyz5689 8 жыл бұрын

excellent

@prithviprakash1110 6 жыл бұрын

Can someone explain how the derivative of ⍬X(t) wrt ⍬ becomes X and not X(t)?

@patton4786 6 жыл бұрын

because it is reference to theta0 therefore derivative theta0*x is 1*theta0^(1-0)*x=1*1*x=x (btw, this is partial derivative, so all other terms before derivative are constant expect theta0*x0)