Lecture 03 -The Linear Model I

Рет қаралды 361,532

caltech

Күн бұрын

Пікірлер: 131

@justpaperskw 9 жыл бұрын

The best lectures in machine learning that I've listen to online. Thanks Professor Yaser.

@johnallard2429 9 жыл бұрын

Such a great professor, he is so clear in his explanations.

@helenlundeberg 9 жыл бұрын

agreed.

@alexdamado 6 жыл бұрын

Indeed, he is an amazing professor. It is clear that he knows his mathematics. He is confident in the responses and goes into a little more depth than the usual machine learning professor. I really appreciate his desire to teach the theory behind the applications

@cosette8570 5 жыл бұрын

I’m thrilled to have studied at Caltech for this course and actually talk to him! He’s so nice!

@sivaramkrishnanagireddy8341 4 жыл бұрын

@@cosette8570 What is your annual salary now ?

@kNowFixx 4 жыл бұрын

@@sivaramkrishnanagireddy8341 Very weird question and I'm not surprised the person you replied to didn't bother to respond.

@rvoros 12 жыл бұрын

Prof. Yaser Abu-Mostafa is by far the best lecturer I've ever seen. Well done, great course!

@atfchan 12 жыл бұрын

Thank you, Prof. Yaser Abu-Mostafa for these lectures. The concepts are concisely and precisely explained. I especially like how he explains the equations with real world examples. It makes the course material much more approachable.

@parvathysarat 7 жыл бұрын

Wow. This lecture was great. The difference between using linear regression for classification and using linear classification outright couldn't have been explained better (I mean it was just amazing).

@Tinou49000 10 жыл бұрын

Great great lecture. Thank you Pr. Yaser Abu-Mostafa, it is clear and well performed!

@jonsnow9246 7 жыл бұрын

What is the graph shown at 15:42 ,why are there ups and downs in Ein?

@laurin1510 6 жыл бұрын

Jon Snow in case it still matters: When you iterate in the PLN algorithm you adjust the weight of the perceptron in each iteration. so hence in each iteration you might classify more points correctly or less and this is reflected by the in sample error aswell as the out of sample error.

@gautamkarmakar3443 8 жыл бұрын

what a great authority and fielding questions like a true pundit on the subject..great respect and thanks a lot.

@prasanthaluru5433 11 жыл бұрын

For the first time I am finding Machine Learning interesting and learn-able. Thank you very much sir.

@ahmednasrzc 8 жыл бұрын

thank you Pr. Yaser Abu-Mostafa very very very clear and detective really it's hard to find genius explanation like that

@manjuhhh 10 жыл бұрын

Thank you Caltech and Prof.

@avidreader100 8 жыл бұрын

I learnt a method in another class where the primary classification features were selected based on things that causes the maximum change in entropy.

@EE-yv7xg 8 жыл бұрын

Decision Tree Algorithm.

@roelofvuurboom5431 3 жыл бұрын

A clean classification represents the lowest level of entropy (things aren't "muddled"). So going from the current situation to the lowest level of entropy will result in a maximum change in entropy.

@nias2631 5 жыл бұрын

Dude is a rockstar, just threw my panties at my screen again... Great lecturer.

@rezagraz 11 жыл бұрын

What an amazing and charming teacher...I wanna have a beer with him.

@sanuriishak308 5 жыл бұрын

He is muslim...don't drink beer. Let have coffee with him.

@abubakarali6279 4 жыл бұрын

Any new material from yasir?

@your_name96 3 жыл бұрын

@@sanuriishak308 I think he prefers to 'sip on tea while the machine learns'

@xelink 10 жыл бұрын

Good lecture. I think the question had to do with correlation between a transformed feature and the original feature. This is describing the problem of multicollinearity. With multicollinearity, numerical instability can occur. The weight estimates are unbiased(on average you'd expect them to be correct) but they're unstable - running with slightly different data might give different estimates. E.g. Estimate SAT scores. If given weight, height and parent's occupation one might expect all 3 are correlated. The OLS algorithm won't know which to properly credit.

@brainstormingsharing1309 4 жыл бұрын

Absolutely well done and definitely keep it up!!! 👍👍👍👍👍

@btsjiminface 4 жыл бұрын

Such a compassionate lecturer 🥺

@AniketMane-dn5uh 3 жыл бұрын

Howard Wolowitz is so good! One of the best lectures though.

@abdurrezzakefe5308 7 жыл бұрын

A million times better than my proffessor.

@melihcan8467 7 жыл бұрын

great lecture & channel. Thanks for such opportunity.

@LucaMolari 11 жыл бұрын

Thank you, great and relaxing lectures!

@Mateusz-Maciejewski 5 жыл бұрын

37:10 The formula is correct if we define the gradient as the Jacobian matrix transposed, not just Jacobian matrix. In the optimalization techniques this assumption is very helpful, so I think he uses this convention.

@pippo1116 7 жыл бұрын

I like the x1.5 speeding, works perfectly.

@elsmith1237 5 жыл бұрын

x1.25 for me!

@dimitriosalatzoglou5033 Жыл бұрын

Amazing content. Thank you.

@edvandossantossousa455 3 жыл бұрын

a great lecture. Thanks for sharing it.

@nischalsubedi9432 4 жыл бұрын

What is the difference between the simple perceptron algorithm and linear classification algorithm?

@ragiaibrahim6648 9 жыл бұрын

AWESOME, lots of thanks.

@danielgray8053 3 жыл бұрын

How come no one here in the comments admits that they are confused. I don't understand why college professors are so out of touch with their students. Why does learning hard engineering topics always have to be so much of a struggle. He is teaching as if we have years of experience in the field.

@danielgray8053 3 жыл бұрын

Like, in lecture 2 he mentioned that E in was used in place of new, which is basically the sample mean. Now he is using it here in a completely different context withoutt explaining why? It seems like alot of other students here have the same question. PLEASE SOMEONE OUT THERE JUST LEARN HOW TO TEACH PROPERLY ugh

@Milark 5 ай бұрын

Being confused doesn’t mean the professor is incapable. It’s a natural part of the learning process. If no one teaches right according to you, it might not be the teacher

@harpreetsinghmann 7 жыл бұрын

Key takeaway: Linearity of Weights not "variables". @52:00

@huleinpylo3906 12 жыл бұрын

Interesting lecture He's example help a lot to understand the class

@FsimulatorX 2 жыл бұрын

Wow This is the oldest comment I’ve seen in these videos thus far. How are you doing these days? Did you end-up pursuing machine learning?

@huleinpylo3906 2 жыл бұрын

@@FsimulatorX nope, I am working in cyber security.

@pyro226 5 жыл бұрын

Not sure I'm a fan of the "symmetry" measure. The number 8 in that example is clearly offset from center, the example only would have apparent symmetry because it's a wide number with a lot of black space. If a 1 is slanted and off center, it will literally have nearly 0 apparent symmetry because only its center point would have vertical symmetry. Oh well, we'll see where it goes.

@roelofvuurboom5431 3 жыл бұрын

You probably would use a bit more sophistication. After you flip the number you could "slide" it over the original number looking for the maximum matching value.

@fabiof.deaquino4731 6 жыл бұрын

"Surprise, surprise"... What a great professor!

@linkmaster959 5 жыл бұрын

I thought linear regression was for extrapolating data to a line to forecast future predictions. But here it is explained in terms of a seperation boundary for classification. can someone explain?

@roelofvuurboom5431 3 жыл бұрын

For data extrapolation you look for the line which minimizes the distance of ALL points to that line. In the classification problem you look for the line which minimizes the distance of WRONGLY CLASSIFIED points to that line.

@googlandroid176 2 жыл бұрын

The calculation uses the derivative set to zero, which I guess means finding where the slope is 0, but what if there are more than one such minimal spots? How is the global minimum guaranteed?

@indatawetrust101 3 жыл бұрын

When he says hypotheses h1, h2, etc. does he mean different hypotheses that fit same general form (e.g. all 2nd order polynomials) or different hypotheses forms (e.g. linear, polynomial, etc.)? Thanks

@douglasholman6300 5 жыл бұрын

Just to clarify, what piece of theory guarantees that the in-sample error will track the out of sample error? E_in E_out

@roelofvuurboom5431 3 жыл бұрын

That there is a probability distribution on X. What this says is (more or less) is that what I saw happen in the past (i.e. what I selected which drives my in sample error) says something about what will happen in the future. The "what will happen in the future" is a statement that says something about my out of sample and drives my out of sample error. Hoefding's inequality places numerical constraints on the relationship and is based on this probability relationship.

@mohamedelansari5427 10 жыл бұрын

Hello, It's a nice lecture!! Thanks to caltech and Prof Yasser. Can any one tell me where I can get the corresponding slides and textbooks? Thanks

@samis.a5824 10 жыл бұрын

work.caltech.edu/lectures.html#lectures

@mohamedelansari5427 10 жыл бұрын

Sami Albouq Many thanks

@naebliEcho 11 жыл бұрын

Anybody have homeworks accompanying these lectures? I don't have a registered account for the course and the registration is closed now. :(

@PerpetualEpiphany 10 жыл бұрын

this is really good

@mhchitsaz 12 жыл бұрын

fantastic lecture

@astyli 11 жыл бұрын

Thank you for uploading this!

@mackashir 8 жыл бұрын

In the previous lectures E(in) was used for in-sample performance. Was is substituted to in-sample error in this lecture? Am i missing something ?

@Bing.W 7 жыл бұрын

Performance is a rough wording, and error is one way to really evaluate the performance.

@solsticetwo3476 6 жыл бұрын

I see it as follow: E(in) is the fraction of "red" marbel, which is the fraction of wrong estimation by your hypothesis; which is the error of that h. That fraction is the probability 'q' of a Bernoulli distribution, which expected value E(q)=q

@roelofvuurboom5431 3 жыл бұрын

@@solsticetwo3476 The other way around. E(out) is the fraction of red marble i.e. the fraction of wrong estimation by your hypothesis. This value has nothing to do with a probability distribution. E(in) the in-sample error is coupled to selection and hence is coupled to a probablity distribution.

@Michellua 7 жыл бұрын

Hello, where did I find this dataset to implement the algorithms?

@hetav6714 4 жыл бұрын

The number set is called MNIST dataset

@rahulrathnakumar785 5 жыл бұрын

Great lecture overall. However, I couldn't really understand how to implement linear regression for classification...

@roelofvuurboom5431 3 жыл бұрын

With your training data you know which points have been wrongly classified. Look for the line which minimimes the distance (least squares error) of all wrongly classified data. Each time you move the line other data may become wrongly classified so you have to do redo the calculation but look for the line which gives you the minimum overall value for the lines associated set of wrongly classified data.

@DaneJessen101 9 жыл бұрын

Great lecture! Does the use of features ('features' are "higher level representations of raw inputs") increase the performance of a model out of sample? Does it somehow add information? Or does it simply make it computationally easier to produce a model? I'm working on a problem where this could potentially be very useful. I could also see how the use of features could make a model more meaningful to human interpretation, but there is a risk as well that interpretations will vary between people based on what words are being used. 'Intensity' and 'symmetry' are used here which are great examples, but is could very quickly get more abstract or technical. Thank you in advance to anyone who has a answer to my question!

@rubeseba 9 жыл бұрын

It depends on whether your features could be learned implicitly by your model. That is, let's say your original data are scores on two measures: IQ and age, and you want to use those to predict people's salaries. Let's also assume that the true way in which those are related is: salary = (IQ + age)*100 + e, where e is some residual error not explained by these two variables. In this case you could define a new feature that is the sum of IQ and age, and this would reduce the number of free parameters in your model, making it slightly easier to fit. Given enough data to train on however, your old model would perform just as well, because the feature in the new model is a linear combination of features in the old model. (That is, in the old model you would have w1 = w2 = 100, whereas in the new one you would just have w1 = 100.) Often, however, we define new features not (just) to reduce the number of model parameters, but to deal with non-linearities. In the example of the written digits, you can't really predict very well which digit is written in an image by computing a weighted sum over pixel intensities, because the mapping of digits to pixel values happens in a higher order space. So in this case we can greatly improve the performance of our model if we define our features in the same higher order space. The reason is not that we add information that wasn't in the data before, but that the information wasn't recoverable by our linear model.

@DaneJessen101 9 жыл бұрын

rubeseba - That was very helpful. Thank you!

@DaneJessen101 9 жыл бұрын

Kathryn Jessen Kathryn - I was born without the ability to be a mom :/ I will never experience the depth and vastness of a mother's understanding. I can sure pick a thing or two and try to pretend ;)

@andysilv 7 жыл бұрын

It seems to me that there is a small typo on the 18th slide (48:25). To perform classification using linear regression, it seems one needs to check sign(wx - y) rather than sign(wx).

@李恒岳 6 жыл бұрын

really confused me without your comment!

@李恒岳 6 жыл бұрын

Second-time watch this video makes me clear that there is no mistake. The threshold value is contained in w0. sign(wx) is the correct.

@Omar-kw5ui 4 жыл бұрын

wX is the output for each datapoint in the training set, taking the sign of each output gives you its classification.

@ishanprasad910 5 жыл бұрын

Quick question, what is the y-axis label at 20:00? What probability are we tracking for E_in and E_out?

@blaoi1562 5 жыл бұрын

E_in and E_out represent respectively the in-sample error and the out-sample error. Usually you don't have access to E_out the out-sample error. But you know that The in-sample error approximates the out-sample error the more you have data. The y axis represents the error percentage on data, while x-axis represents the iterations.

@abhijeetsharma5715 3 жыл бұрын

On the y-axis, we are tracking the "fraction-of-mislabeled-examples". So, E_in is the fraction of training-set examples that we got wrong. Similarly E_out is the fraction of examples(not from training-set) that we got wrong.

@shahardagan1584 6 жыл бұрын

What I should do if I didn't understand all the math in this lecture? do you have some resource that explains it quickly?

@majorqueros6812 5 жыл бұрын

Some videos on statistics/probability and linear algebra would already help a lot. Khan academy has many great videos. www.khanacademy.org/math/statistics-probability www.khanacademy.org/math/linear-algebra

@MrCmon113 5 жыл бұрын

Depends on what you don't understand. There should be introductory courses to linear algebra, analysis and stochastics at your university.

@fndTenorio 10 жыл бұрын

Fantastic!

@AndyLee-xq8wq Жыл бұрын

Great!!

@movax20h 8 жыл бұрын

How was E_out computed in each iteration? Was it using subsample of given sample and estimated E_out on full sample?

@fuadassayadi1 8 жыл бұрын

+movax20h You can not calculate E_out because you do not know the whole population of samples. But E_in can tell you something about E_out. This relation is explained in Lecture 02-Is learning feasible

@movax20h 8 жыл бұрын

+fouad Mohammed That is exactly why I am asking. The graph clearly shows the E_out being calculated somehow. I guess, this is done using validation techniques from one of the last lectures probably. Anyway, this is a synthetic example, so it is not hard to generate known unknown target function, and generate as many training and test examples as you want, just for examples sake. I do not believe it was claculated by Hoefding , because it is a probabilistic inequality, and would actually to circular reasoning logic here: lets use it to predict E_out, and use this prediction to claim that E_in tracks well E_out. That might be correct in probabilistic sense, but is not good way of demonstrating it at all.

@abaskm 8 жыл бұрын

Assume that data was already labeled and used to generate E_in (test set). Take a portion of this data (training sample) generate your hypothesis. You can use that hypothesis to measure E_in (training data) and E_out (on the test set). He's making the assumption that the whole set is labeled. Which doesn't usually apply to the real world.

@raulbeienheimer 3 ай бұрын

This is not the Squared Error but the Mean Squared Error.

@wayenwan 11 жыл бұрын

is there subtitle available?

@ajkdrag 5 жыл бұрын

I didn't understand how he obtained X- transpose after the differentation.

@adrianbakke1732 5 жыл бұрын

math.stackexchange.com/questions/2128462/derivative-of-squared-frobenius-norm-of-a-matrix :)

@JoaoVitorBRgomes 4 жыл бұрын

Why learning only occurs in a probabilistic sense? What other way could it be?

@roelofvuurboom5431 3 жыл бұрын

Literally, learning in a non-probabilistic (absolutistic or certainty) sense. However this runs up against the so-called induction problem first described by the philosopher Hume (you can google it). In our context here the Hume induction problem can be translated to "If I pick a number of balls and they always turn out to be green can I conclude (with certainty) that all balls in the bin are green?". In the lecture the statement is made that you can't. The philosophical discussion is a bit more nuanced. In any case, machine learning avoids this discussion by side stepping trying to make statements with certainty and moving to (weaker) probabilistic statements.

@JoaoVitorBRgomes 3 жыл бұрын

@@roelofvuurboom5431 great reply, thanks! I loved how you tied to Hume. I didn't think of this connection. Thanks for linking things.

@jonsnow9246 7 жыл бұрын

What is the graph shown at 15:42 ,why are there ups and downs in Ein?

@solsticetwo3476 6 жыл бұрын

Jon Snow The error in the sample set could increase in a next iteration if the algorithm change the hypothesis (weights) in a way that hurt the classification. The PLA is a random walk on the weights sub space.

@Omar-kw5ui 4 жыл бұрын

@@solsticetwo3476 PLA is not really a random walk in the weights sub space... The algorithm optimises the weights for a given (randomly chosen) miss-classified point. Fixing the weights as to not missclassify this point may lead to other points that were previously correctly classified to be misclassified. Hence, the rise in error, followed by drop etc. The algorithm works for non separable datasets, so you can't really call it a random walk, it clearly has a set of rules its following.

@chujgowi 11 жыл бұрын

check the itunes page there are homeworks an solutions available for free for this course

@rippa911 12 жыл бұрын

Thank you.

@SphereofTime 7 ай бұрын

6:14

@auggiewilliams3565 7 жыл бұрын

2.8 doesnt happen in Caltech..... 3.8 doesnt happen in Tribhuvan University

@sarthakmishra6915 6 жыл бұрын

LOL

@abubakarali6279 4 жыл бұрын

How we got X^T at 38:25.?

@roelofvuurboom5431 3 жыл бұрын

In linear algebra the definition of ||X||^2 is (X^T)(X). Apply this formula to ||Xw-Y||^2

@karlmadl 2 жыл бұрын

This comment is old and the op probably figured it out by now. However for anyone else who wonders this: It has to do with notation, the derivative of Xw WRT to w is either X or X^T depending on your notation. Here, we're using denominator layout notation, so we use X^T. (en.wikipedia.org/wiki/Matrix_calculus#:~:text=displaystyle%20%5Cmathbf%20%7BI%7D%20%7D-,A%20is%20not%20a%20function%20of%20x,%7B%5Cdisplaystyle%20%5Cmathbf%20%7BA%7D%20%7D,-%7B%5Cdisplaystyle%20%5Cmathbf%20%7BA) The natural follow up question here is why do we use one notation over the other? When did we choose our notation? Notation choices are often just the choice of the author, and can often make formulae more succinct or more clear. I think this answers the question without going off tangentially too far, as any further questions are probably best answered at your own inquiry.

@TomerBenDavid 8 жыл бұрын

Perfect

@ZeeshanAliSayyed 9 жыл бұрын

"+1" and "-1" among other things happen to be real numbers! LOL

@vyassathya3772 8 жыл бұрын

+Zeeshan Ali Sayyed There is something genius about the simplicity though lol

@ZeeshanAliSayyed 8 жыл бұрын

Vyas Sathya Indeed. :P

@bryanchambers1964 5 жыл бұрын

I wish he showed how to write the algorithms in Python because he teaches very well.

@FsimulatorX 2 жыл бұрын

There are plenty of other resources for that. Once you understand the theoretical component, implementation becomes easy

@Dwright3316 8 жыл бұрын

42 minutes !! BANG !!

@SJohnTrombley 7 жыл бұрын

People don't get 2.8's at caltech? I smell grade inflation.

@MrCmon113 5 жыл бұрын

I understood that as people with 2.8 or better not going there. Who takes courses in all of those fields at university?

@pyro226 5 жыл бұрын

32:22 MSE, mean squared error, for those with statistics background.

@HendSelim87 11 жыл бұрын

what's wrong? the video is not opening.

@fahimhossain165 4 жыл бұрын

Never imagined that I'd learn ML from Emperor Palpatine himself!

@akankshachawla2280 5 жыл бұрын

44:50

@tobalaba 4 жыл бұрын

best sound ever.

@millerfour2071 5 жыл бұрын

22:48, haha!

@niko97219 8 жыл бұрын

The linear regression is terribly explained.

@AlexEx70 8 жыл бұрын

Он же сказал что это просто на закуску. Далее будет более подробное объяснение.

@abhijeetsharma5715 3 жыл бұрын

As the professor mentions multiple times, this lecture is a bit out-of-place as it is placed before covering the theory. Other lectures are more theoretically dense and explained in reasonable depth.

@brainstormingsharing1309 3 жыл бұрын

Absolutely well done and definitely keep it up!!! 👍👍👍👍👍