Machine Learning Lecture 23 "Kernels Continued Continued" -Cornell CS4780 SP17

No video

Machine Learning Lecture 23 "Kernels Continued Continued" -Cornell CS4780 SP17

Рет қаралды 14,112

Kilian Weinberger

Күн бұрын

Пікірлер: 25

@trollsofalabama 4 жыл бұрын

Dr. Kilian, These lectures have been invaluable to me, I cannot be thankful enough for them. From what I understand, Kernels help deal with high bias, and regularization help deal with high variance. My question is this. Say you perform the data subsampling analysis used in Lecture 21 to decide whether you still have to deal with high bias and/or high variance. You have already completed tuning hyper parameters for say Huber Loss + L2 regularization (with the L2 regularization reducing the variance), say you have enough data points that you're firmly in the high bias region. If you switch to kernel regression, with Huber Loss, and that takes care of your high bias problem, but because you no longer have the L2 regularization, wouldn't you potentially now have a variance problem? Is there a kernel version of regularization that we're supposed to use? Or is this perhaps not even an actual issue to begin with? Edit: I was trained as a computational quantum chemist/material researcher, I ran into this video kzbin.info/www/bejne/fXumlnafiN2Kjdk, and I am shocked. They talked about a Kernel Ridge Regression, which answers my question, but also I can't believe it's possible to by pass the physics to get the physics, that's insane. Dr. Kilian, if you get a chance to, it's a great video.

@sandeepreddy6295 4 жыл бұрын

A very good lecture.

@gauravsinghtanwar4415 4 жыл бұрын

What is the variable "z" used in kernel lectures?? Danke Schoen!

@iskandaratakhodjaev9479 4 жыл бұрын

In case of nonlinear decision boundaries, kernelization is one of the way to solve the problem, since linear classifiers are powerless. In case of linear decision boundaries, is it fair to say that kernels are preferrables only when d > n, where d is the dimensionality and n - number of examples? In other words, if we can forget about weights at all and switch to x_i, then why all the lectures before were we trying to learn the weights via gradient descent, newton's method and etc...?

@kilianweinberger698 4 жыл бұрын

Well, linear classifiers can actually be very powerful, provided they operate in a high dimensional space. So if d>>0 (e.g. text data with bag-of-words features), often a linear classifier is highly competitive. If d is small, then they have very few parameters and only few ways to separate two classes. So it is almost the other way round, if d is small use non-linear kernels, if d is large try linear classifiers first (as they are less prone to overfitting). Having said this, kernelization has computational advantages if d>>n, but you can always use a linear kernel to get the best of both worlds.

@iskandaratakhodjaev9479 4 жыл бұрын

@@kilianweinberger698 Thank you very much, Professor!

@shike1983 4 жыл бұрын

In the example on sets, for the proof I guess you assume the sets are countable. What if they are only measurable? Say intervals and |S_1 \cap S_2| is the measure of them. I guess it should work, correct?

@kilianweinberger698 4 жыл бұрын

It probably gets more hairy, but my guess would be that it all works out. Never worked through it though, was mostly meant to be an example ...

@evanm2024 3 жыл бұрын

Hi Professor, this series is wonderful. But as far as the concept of kernels go, I've worked through the math in the notes, and I've worked through the material in a couple of textbooks. I agree that the math works, but I feel like I have zero intuition for what's going on with an arbitrary kernel. (I've watched your lecture on Gaussian processes and I think I get that now, but that's one specific case that makes additional assumptions). Do you have any toy examples or resources that might give intuition of what's "really going on" with kernels?

@jachawkvr 4 жыл бұрын

I learnt that kernalization can help linear classifiers learn complex decision boundaries, reducing the bias. However, in case of regression, we are fitting a line/hyperplane onto the data. How would kernelization help in this case?

@kilianweinberger698 4 жыл бұрын

Essentially in the same way. You try to model your data with a non-linear function. So if your feature were on the x axis, and your label on the y axis, you don't draw a straight line through your data but can now draw a curved line. Hope this makes sense.

@jachawkvr 4 жыл бұрын

I can certainly see the similarity here. Thank you so much for making this clear!

@Tyokok 7 ай бұрын

Hi if anyone can advice, 40:00, so since we have close form solution, if we need implement ridge regression with kernel, we don't use Gradient Descent, but the close form directly, is that correct? because at lecture 21 at 16:40 prof. Kilian gave the recursive gradient descent solution, and I tried implement it which diverges very quickly, sensitive to the step size. Thank you!

@kilianweinberger698 Ай бұрын

The closed form solution only takes one step, but requires an O(d^3) operation (either inverting the Hessian matrix, or a linear solve). So if your data is very high dimensional, SGD can still be a lot faster.

@Tyokok Ай бұрын

@@kilianweinberger698 wow, Prof Kilian! It’s great honor to have yourself reply, thank you so much! Got your point. And wish you and your family all the best!

@ayushmalik7093 2 жыл бұрын

hi Prof why didn't we treat xTx as k(x,x) @23:34 and just like we did @33:22

@sudhanshuvashisht8960 4 жыл бұрын

Great Lecture, as usual, Professor. I've noticed we aren't talking (at least you didn't give any name to that) about kernel that we got after mapping our d dimensional data set to 2^d dimensions, i.e. k(x,z) = (1+x_1* z_1)*(1+x_2*z_2)....(1+x_d*z_d). I understand RBF is most popular, but is there any other reason I'm missing?

@saad7417 4 жыл бұрын

Polynomial kernel with degree 1, i guess

@kilianweinberger698 4 жыл бұрын

Close, but the polynomial kernel k(x,z)=(c+x'z)^d with degree d=1 becomes linear kernel. Come to think of it, I don't know the name of that product kernel that I introduced in class. Maybe if you find out let me know. :-)

@karthikeyakethamakka 2 жыл бұрын

Wouldn't Central Limit Theorem be the reason why RBF is good Kernel.

@pratikkulkar8128 2 жыл бұрын

Once we have alpha how are we going to make a test predictions? Do we first calculate alpha and then finally calculate W using the mentioned formula and do the prediction? is that right ?

@kilianweinberger698 Жыл бұрын

No, you can express h(z) directly in terms of alpha. Basically h(z)=\sum_i y_i alpha_i k(x_i,z)+b You will have to solve for the bias term, b. Details are here: www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote14.html

@erenyeager4452 3 жыл бұрын

This man's fascination with Julia makes me think his wife's name is Julia. Is that so Prof?

@kilianweinberger698 3 жыл бұрын

It is not … but your comment made her a little worried ;-)

@ibrahimhammoud2165 5 жыл бұрын

Amazing lectures, thank you! Btw, inspired by this I posted this question on crossvalidated: stats.stackexchange.com/questions/416912/is-kernalized-linear-regression-parametric-or-nonparametric