Machine Learning Lecture 13 "Linear / Ridge Regression" -Cornell CS4780 SP17

Рет қаралды 33,660

Kilian Weinberger

Күн бұрын

Lecture Notes:
www.cs.cornell....

Пікірлер: 50

@prwi87 Жыл бұрын

There is an error at 36:16 that leads to wrong solution at 37:10. The sum is not taken over P(w). As we write P(y|X,w)*P(w) as Π( P(y_i | x_i, w) * P(w) ), P(w) is constant so it can by taken out of the multiplication, then taking a log will give log(P(w)) + sum( log( P(y_i | x_i, w) ) ). After solving it there will be no "n" in the nominator before "w.T w". y is not normally distributed, but y|X is. That's why we write P(y|X) in the next step. also in the MAP approach P(D|w) is not defined, we don't know what this pdf is, this should be P(y|X, w) which is given as normal by our main assumption. Since, D = (X, y) writing P(D|w) means that we stating that we know P(X, y|w) but we have no idea what it is. Later it is defined properly. another thing is that, to my understanding the concept of "slope" in high dimensional data is meaningless i think, we should use gradients or normal vectors. Thus, in this case vector w is not a slope but a normal to the hyperplane. at 37:05 "w" is a vector so P(w) is a multivariate gaussian distribution, but univariate is written. Since entries of "w" are i.i.d. we can write it as a multiplication of N univariate gaussians. It won't change much but is rigorous, and when we get ||w||^2 or sum of all w_i under the argmin it is reasonable since we wrote that "w" is multivariate. And just a moment later Professor writes "w^T w", meaning that multivariate normal is P(w). I understand that i have all time in the world to rewind this one lecture and pick up on little things, but i really like to be rigorous and if my nitpicking can help someone i will be really happy.

@30saransh 4 ай бұрын

helpful, thanks

@jachawkvr 4 жыл бұрын

This was a fun lecture. I never knew that minimizing the squared error was equivalent to the MLE approach.

@Galois1683 4 жыл бұрын

I am studying that now

@JoaoVitorBRgomes 4 жыл бұрын

It is because when you find the maximum likelihood estimator you trying to find the most likely coefficient which end up being the one that also minimizes the squared error.

@tonychen31 4 жыл бұрын

Excellent Lecture!

@ugurkap 5 жыл бұрын

MAP version is different from the lecture notes. I believe lecture notes are correct because when we take the log, since prior and likelihood is multiplied with each other we should get log(likelihood) + log(prior) and summation of the likelihood should not affect the log(prior) so lambda should not be multiplied by n. If we are not splitting the likelihood and prior and leaving it as log(likelihood x prior), we should get something different then the MLE version, right?

@echo-channel77 5 жыл бұрын

I agree, besides that n would simply be canceled anyway and we'd be left wit n * the constant.

@llll-dj8rn 10 ай бұрын

So, assuming that the noise of the linear regression OLS is Gaussian, when applying MLE we derive the ordinary linear regression, and when applying MAP we derive the regularized(ridge) linear regression

@tr8d3r 4 жыл бұрын

I love it "only losers maximize" ;)

@30saransh 4 ай бұрын

Is there any way we can get access to the projects for this course?

@erenyeager4452 3 жыл бұрын

OMG, the regularisation comes from the MAP!!!!!!!!!!!!!! Respect 🙇 🙇

@jonaswinston6032 3 жыл бұрын

I know it's kinda off topic but does anyone know of a good website to watch newly released tv shows online?

@pabloaydin6459 3 жыл бұрын

@Jonas Winston I would suggest Flixzone. Just search on google for it :)

@judekarsyn9507 3 жыл бұрын

@Pablo Aydin definitely, have been watching on Flixzone for since april myself :)

@ulisesvihaan3207 3 жыл бұрын

@Pablo Aydin yea, I've been using FlixZone for since march myself :)

@fletchermarcos2413 3 жыл бұрын

@Pablo Aydin Thanks, signed up and it seems like they got a lot of movies there :) Appreciate it !!

@JoaoVitorBRgomes 3 жыл бұрын

@kilian weinberger, at circa 28:45 you ask if we have any questions about this. I have 1 question. You use argmin w because you want to find the w that minimizes that loss function, right? If it was a concave function, you would write it as argmax w?

@Aesthetic_Euclides 8 ай бұрын

I was thinking about modeling the prediciton of y given x with a Gaussian. Are these observations/reasoning steps correct? I understand the Gaussianess comes because we have a true linear function that perfectly models the relationship between X and Y, but it is uknown to us. But we have data (D) that we assume comes from sampling the true distribution (P). Now, we only have this limited sample of data, so it's reasonable to model the noise as Gaussian. This means that for a given x, our prediction y actually belongs to a Gaussian distribution, but since we only have this "single" sample D of the true data distribution, our best bet is to assign this y as the expectation of the true Gaussian. Which results in us predicting y as the final prediction (also because a good estimator of the expectation is the average, I guess). Now, I have explained how in the end we are going to fit the model to the data and predict that, so why do we have to model the noise in the model? Why not make it purely an optimization problem? I guess more like the DL approach.

@flicker1984 4 жыл бұрын

Logistic regression is regression in the sense that it predicts probability, which can be used to define the classifier.

@cge007 7 ай бұрын

Hello, Thank you for the lecture. Why is the variance equal for all points? 17:23 Is this an assumption that we are taking?

@kilianweinberger698 6 ай бұрын

Yes, just an assumption to keep things simple.

@arihantjha4746 4 жыл бұрын

Hi kilian My Doubt is with respect to how we have derived the mean square loss in the notes. We take P(xi) to be independent of theta. Now, considering P(X) is the marginal for P(X,Y), if the joint is dependent on theta, wouldn't the marginal also depend on theta, so P(X=xi) will also depend on theta with this logic. Is it that we ASSUME P(X) to be independent of theta, for the parameterized distribution,for the sake of doing discriminative learning or is there some underlying obvious reason that I am missing. Thank you for your lectures

@amarshahchowdry9727 4 жыл бұрын

I have the same doubt.

@kilianweinberger698 4 жыл бұрын

Essentially it is a modelling assumption. We assume Y depends on X (which is very reasonable, as the label is a function of X). We model this function with a distribution parameterized by theta. So P(Y|X;theta) depends on theta. We also assume that the Xs are just given to us (by mother nature). This is also very reasonable -- essentially you assume someone gives you the data, and you predict the labels. So P(X) does not depend on theta. But now the joint distribution *does* depend on theta, because it contains the conditional: P(Y,X;theta)=P(Y|X;theta)*P(X) Hope this helps.

@arihantjha4746 4 жыл бұрын

@@kilianweinberger698 Thank you for the reply. I understand.

@FlynnCz 3 жыл бұрын

Hi Kilian, I have a doubt. Why do we assume a standard deviation for the noise? Should not we directly calculate it (or them if we allow the variance to be a function of x like for the mean) during the minimisation? Thank you!

@kilianweinberger698 3 жыл бұрын

There is always a trade-off between what you assume and what you learn. If you make it too general and you attempt to learn the entire noise model, then you could explain the entire data set as noise (i.e. your w is just the all-zeros vector, mapping everything to the origin, and all deviations from the origin are explained by a very large variance). So here we avoid this problem by fixing the variance to something reasonable, and then learning the mean. There are however more complicated algorithms with more sophisticated noise models.

@FlynnCz 3 жыл бұрын

@@kilianweinberger698 Thanks a lot for your answer! Your videos are helping me a lot for my master thesis!

@PremKumar-vw7rt 3 жыл бұрын

Hi Kilian, I have a doubt, in the probabilistic perspective of linear regression, we assume that for every Xi there is a range of values for Yi, i.e P(Yi|Xi), where Xi is a D dimension vector, so while solving and arriving at the cost function why are we using univariate Gaussian dist to find the cost function instead of multi-variate Gaussian dist?

@kilianweinberger698 3 жыл бұрын

well, the model is that P(y|x)=w'x+t where t is Gaussian distributed. So that's just 1 dimensional. You can of course make your noise model more complex, but you must make sure that, as it becomes more powerful, you don't explain the signal with the noise model.

@vatsan16 4 жыл бұрын

Often when people derive the loss function for linear regression, they just start directly from the minimization of the squared error between the regression line and the points that is, minimize sum((y-yi)^2) . Here, you start with an assumption that the yi has a gaussian distribution and then arrive at this same conclusion with MLE. If we call the former method 1 and latter method 2. Where is the gaussian distribution assumption considered in method 1?

@kilianweinberger698 4 жыл бұрын

The Gaussian noise model is essentially baked into the squared loss.

@deepfakevasmoy3477 4 жыл бұрын

good question

@deepfakevasmoy3477 4 жыл бұрын

@@kilianweinberger698 do we come up with same loss function if we use other distribution for error noise from exponential family?

@dmitriimedvedev6350 3 жыл бұрын

36:22 not clear: why P(w) is inside log and thus inside summation? Shouldnt it be a sum of logs of P(D | w) plus logP(w)? Could anyone please explain why we represent P(D | w) * P(w) = Product [ P (Yi | Xi, w) * P(w)], including P(w) INSIDE the product for every pare (Yi, Xi) from 1 to n?

@dmitriimedvedev6350 3 жыл бұрын

and i noticed this part is actually dirreferent from lecture notes

@saketdwivedi8187 4 жыл бұрын

how do we know the point to switch to Newtons Method?

@kilianweinberger698 4 жыл бұрын

You take a newton step and check if the loss went down. If it ever goes up, you undo that step and switch back to gradient descent.

@saketdwivedi8187 4 жыл бұрын

@@kilianweinberger698 Thanks so much for the response.

@shashankshekhar7052 5 жыл бұрын

For Project4(ERM) data_train file is given which consist of Bag of words matrix(X) but we don't have the label(y) for it whats the way around can you please help?

@vivekmittal2290 5 жыл бұрын

from where you got the project files

@SAINIVEDH 4 жыл бұрын

@@vivekmittal2290 Did you get em ?

@bbabnik 3 жыл бұрын

@@SAINIVEDH hey, can you please share them? I only found Exams and Homeworks.

@omalve9454 Жыл бұрын

Gold

@gregmakov2680 2 жыл бұрын

hahahaha, exactly, statistics always try to mess things up :D:D

@JoaoVitorBRgomes 3 жыл бұрын

But if I assume that in MAP the prior is like a poisson is it going to give me the same results as the MLE? Are they supposed to give the same theta/w? @killian weinberg , thank you prof.!

@kilianweinberger698 3 жыл бұрын

well if you don't use the conjugate prior, things can get ugly - purely from a mathematical point of view.