The best lectures in machine learning that I've listen to online. Thanks Professor Yaser.
@johnallard24299 жыл бұрын
Such a great professor, he is so clear in his explanations.
@helenlundeberg9 жыл бұрын
agreed.
@alexdamado6 жыл бұрын
Indeed, he is an amazing professor. It is clear that he knows his mathematics. He is confident in the responses and goes into a little more depth than the usual machine learning professor. I really appreciate his desire to teach the theory behind the applications
@cosette85705 жыл бұрын
I’m thrilled to have studied at Caltech for this course and actually talk to him! He’s so nice!
@sivaramkrishnanagireddy83414 жыл бұрын
@@cosette8570 What is your annual salary now ?
@kNowFixx4 жыл бұрын
@@sivaramkrishnanagireddy8341 Very weird question and I'm not surprised the person you replied to didn't bother to respond.
@rvoros12 жыл бұрын
Prof. Yaser Abu-Mostafa is by far the best lecturer I've ever seen. Well done, great course!
@atfchan12 жыл бұрын
Thank you, Prof. Yaser Abu-Mostafa for these lectures. The concepts are concisely and precisely explained. I especially like how he explains the equations with real world examples. It makes the course material much more approachable.
@parvathysarat7 жыл бұрын
Wow. This lecture was great. The difference between using linear regression for classification and using linear classification outright couldn't have been explained better (I mean it was just amazing).
@Tinou4900010 жыл бұрын
Great great lecture. Thank you Pr. Yaser Abu-Mostafa, it is clear and well performed!
@jonsnow92467 жыл бұрын
What is the graph shown at 15:42 ,why are there ups and downs in Ein?
@laurin15106 жыл бұрын
Jon Snow in case it still matters: When you iterate in the PLN algorithm you adjust the weight of the perceptron in each iteration. so hence in each iteration you might classify more points correctly or less and this is reflected by the in sample error aswell as the out of sample error.
@gautamkarmakar34438 жыл бұрын
what a great authority and fielding questions like a true pundit on the subject..great respect and thanks a lot.
@prasanthaluru543311 жыл бұрын
For the first time I am finding Machine Learning interesting and learn-able. Thank you very much sir.
@ahmednasrzc8 жыл бұрын
thank you Pr. Yaser Abu-Mostafa very very very clear and detective really it's hard to find genius explanation like that
@manjuhhh10 жыл бұрын
Thank you Caltech and Prof.
@avidreader1008 жыл бұрын
I learnt a method in another class where the primary classification features were selected based on things that causes the maximum change in entropy.
@EE-yv7xg8 жыл бұрын
Decision Tree Algorithm.
@roelofvuurboom54313 жыл бұрын
A clean classification represents the lowest level of entropy (things aren't "muddled"). So going from the current situation to the lowest level of entropy will result in a maximum change in entropy.
@nias26315 жыл бұрын
Dude is a rockstar, just threw my panties at my screen again... Great lecturer.
@rezagraz11 жыл бұрын
What an amazing and charming teacher...I wanna have a beer with him.
@sanuriishak3085 жыл бұрын
He is muslim...don't drink beer. Let have coffee with him.
@abubakarali62794 жыл бұрын
Any new material from yasir?
@your_name963 жыл бұрын
@@sanuriishak308 I think he prefers to 'sip on tea while the machine learns'
@xelink10 жыл бұрын
Good lecture. I think the question had to do with correlation between a transformed feature and the original feature. This is describing the problem of multicollinearity. With multicollinearity, numerical instability can occur. The weight estimates are unbiased(on average you'd expect them to be correct) but they're unstable - running with slightly different data might give different estimates. E.g. Estimate SAT scores. If given weight, height and parent's occupation one might expect all 3 are correlated. The OLS algorithm won't know which to properly credit.
@brainstormingsharing13094 жыл бұрын
Absolutely well done and definitely keep it up!!! 👍👍👍👍👍
@btsjiminface4 жыл бұрын
Such a compassionate lecturer 🥺
@AniketMane-dn5uh3 жыл бұрын
Howard Wolowitz is so good! One of the best lectures though.
@abdurrezzakefe53087 жыл бұрын
A million times better than my proffessor.
@melihcan84677 жыл бұрын
great lecture & channel. Thanks for such opportunity.
@LucaMolari11 жыл бұрын
Thank you, great and relaxing lectures!
@Mateusz-Maciejewski5 жыл бұрын
37:10 The formula is correct if we define the gradient as the Jacobian matrix transposed, not just Jacobian matrix. In the optimalization techniques this assumption is very helpful, so I think he uses this convention.
@pippo11167 жыл бұрын
I like the x1.5 speeding, works perfectly.
@elsmith12375 жыл бұрын
x1.25 for me!
@dimitriosalatzoglou5033 Жыл бұрын
Amazing content. Thank you.
@edvandossantossousa4553 жыл бұрын
a great lecture. Thanks for sharing it.
@nischalsubedi94324 жыл бұрын
What is the difference between the simple perceptron algorithm and linear classification algorithm?
@ragiaibrahim66489 жыл бұрын
AWESOME, lots of thanks.
@danielgray80533 жыл бұрын
How come no one here in the comments admits that they are confused. I don't understand why college professors are so out of touch with their students. Why does learning hard engineering topics always have to be so much of a struggle. He is teaching as if we have years of experience in the field.
@danielgray80533 жыл бұрын
Like, in lecture 2 he mentioned that E in was used in place of new, which is basically the sample mean. Now he is using it here in a completely different context withoutt explaining why? It seems like alot of other students here have the same question. PLEASE SOMEONE OUT THERE JUST LEARN HOW TO TEACH PROPERLY ugh
@Milark5 ай бұрын
Being confused doesn’t mean the professor is incapable. It’s a natural part of the learning process. If no one teaches right according to you, it might not be the teacher
@harpreetsinghmann7 жыл бұрын
Key takeaway: Linearity of Weights not "variables". @52:00
@huleinpylo390612 жыл бұрын
Interesting lecture He's example help a lot to understand the class
@FsimulatorX2 жыл бұрын
Wow This is the oldest comment I’ve seen in these videos thus far. How are you doing these days? Did you end-up pursuing machine learning?
@huleinpylo39062 жыл бұрын
@@FsimulatorX nope, I am working in cyber security.
@pyro2265 жыл бұрын
Not sure I'm a fan of the "symmetry" measure. The number 8 in that example is clearly offset from center, the example only would have apparent symmetry because it's a wide number with a lot of black space. If a 1 is slanted and off center, it will literally have nearly 0 apparent symmetry because only its center point would have vertical symmetry. Oh well, we'll see where it goes.
@roelofvuurboom54313 жыл бұрын
You probably would use a bit more sophistication. After you flip the number you could "slide" it over the original number looking for the maximum matching value.
@fabiof.deaquino47316 жыл бұрын
"Surprise, surprise"... What a great professor!
@linkmaster9595 жыл бұрын
I thought linear regression was for extrapolating data to a line to forecast future predictions. But here it is explained in terms of a seperation boundary for classification. can someone explain?
@roelofvuurboom54313 жыл бұрын
For data extrapolation you look for the line which minimizes the distance of ALL points to that line. In the classification problem you look for the line which minimizes the distance of WRONGLY CLASSIFIED points to that line.
@googlandroid1762 жыл бұрын
The calculation uses the derivative set to zero, which I guess means finding where the slope is 0, but what if there are more than one such minimal spots? How is the global minimum guaranteed?
@indatawetrust1013 жыл бұрын
When he says hypotheses h1, h2, etc. does he mean different hypotheses that fit same general form (e.g. all 2nd order polynomials) or different hypotheses forms (e.g. linear, polynomial, etc.)? Thanks
@douglasholman63005 жыл бұрын
Just to clarify, what piece of theory guarantees that the in-sample error will track the out of sample error? E_in E_out
@roelofvuurboom54313 жыл бұрын
That there is a probability distribution on X. What this says is (more or less) is that what I saw happen in the past (i.e. what I selected which drives my in sample error) says something about what will happen in the future. The "what will happen in the future" is a statement that says something about my out of sample and drives my out of sample error. Hoefding's inequality places numerical constraints on the relationship and is based on this probability relationship.
@mohamedelansari542710 жыл бұрын
Hello, It's a nice lecture!! Thanks to caltech and Prof Yasser. Can any one tell me where I can get the corresponding slides and textbooks? Thanks
@samis.a582410 жыл бұрын
work.caltech.edu/lectures.html#lectures
@mohamedelansari542710 жыл бұрын
Sami Albouq Many thanks
@naebliEcho11 жыл бұрын
Anybody have homeworks accompanying these lectures? I don't have a registered account for the course and the registration is closed now. :(
@PerpetualEpiphany10 жыл бұрын
this is really good
@mhchitsaz12 жыл бұрын
fantastic lecture
@astyli11 жыл бұрын
Thank you for uploading this!
@mackashir8 жыл бұрын
In the previous lectures E(in) was used for in-sample performance. Was is substituted to in-sample error in this lecture? Am i missing something ?
@Bing.W7 жыл бұрын
Performance is a rough wording, and error is one way to really evaluate the performance.
@solsticetwo34766 жыл бұрын
I see it as follow: E(in) is the fraction of "red" marbel, which is the fraction of wrong estimation by your hypothesis; which is the error of that h. That fraction is the probability 'q' of a Bernoulli distribution, which expected value E(q)=q
@roelofvuurboom54313 жыл бұрын
@@solsticetwo3476 The other way around. E(out) is the fraction of red marble i.e. the fraction of wrong estimation by your hypothesis. This value has nothing to do with a probability distribution. E(in) the in-sample error is coupled to selection and hence is coupled to a probablity distribution.
@Michellua7 жыл бұрын
Hello, where did I find this dataset to implement the algorithms?
@hetav67144 жыл бұрын
The number set is called MNIST dataset
@rahulrathnakumar7855 жыл бұрын
Great lecture overall. However, I couldn't really understand how to implement linear regression for classification...
@roelofvuurboom54313 жыл бұрын
With your training data you know which points have been wrongly classified. Look for the line which minimimes the distance (least squares error) of all wrongly classified data. Each time you move the line other data may become wrongly classified so you have to do redo the calculation but look for the line which gives you the minimum overall value for the lines associated set of wrongly classified data.
@DaneJessen1019 жыл бұрын
Great lecture! Does the use of features ('features' are "higher level representations of raw inputs") increase the performance of a model out of sample? Does it somehow add information? Or does it simply make it computationally easier to produce a model? I'm working on a problem where this could potentially be very useful. I could also see how the use of features could make a model more meaningful to human interpretation, but there is a risk as well that interpretations will vary between people based on what words are being used. 'Intensity' and 'symmetry' are used here which are great examples, but is could very quickly get more abstract or technical. Thank you in advance to anyone who has a answer to my question!
@rubeseba9 жыл бұрын
It depends on whether your features could be learned implicitly by your model. That is, let's say your original data are scores on two measures: IQ and age, and you want to use those to predict people's salaries. Let's also assume that the true way in which those are related is: salary = (IQ + age)*100 + e, where e is some residual error not explained by these two variables. In this case you could define a new feature that is the sum of IQ and age, and this would reduce the number of free parameters in your model, making it slightly easier to fit. Given enough data to train on however, your old model would perform just as well, because the feature in the new model is a linear combination of features in the old model. (That is, in the old model you would have w1 = w2 = 100, whereas in the new one you would just have w1 = 100.) Often, however, we define new features not (just) to reduce the number of model parameters, but to deal with non-linearities. In the example of the written digits, you can't really predict very well which digit is written in an image by computing a weighted sum over pixel intensities, because the mapping of digits to pixel values happens in a higher order space. So in this case we can greatly improve the performance of our model if we define our features in the same higher order space. The reason is not that we add information that wasn't in the data before, but that the information wasn't recoverable by our linear model.
@DaneJessen1019 жыл бұрын
rubeseba - That was very helpful. Thank you!
@DaneJessen1019 жыл бұрын
Kathryn Jessen Kathryn - I was born without the ability to be a mom :/ I will never experience the depth and vastness of a mother's understanding. I can sure pick a thing or two and try to pretend ;)
@andysilv7 жыл бұрын
It seems to me that there is a small typo on the 18th slide (48:25). To perform classification using linear regression, it seems one needs to check sign(wx - y) rather than sign(wx).
@李恒岳6 жыл бұрын
really confused me without your comment!
@李恒岳6 жыл бұрын
Second-time watch this video makes me clear that there is no mistake. The threshold value is contained in w0. sign(wx) is the correct.
@Omar-kw5ui4 жыл бұрын
wX is the output for each datapoint in the training set, taking the sign of each output gives you its classification.
@ishanprasad9105 жыл бұрын
Quick question, what is the y-axis label at 20:00? What probability are we tracking for E_in and E_out?
@blaoi15625 жыл бұрын
E_in and E_out represent respectively the in-sample error and the out-sample error. Usually you don't have access to E_out the out-sample error. But you know that The in-sample error approximates the out-sample error the more you have data. The y axis represents the error percentage on data, while x-axis represents the iterations.
@abhijeetsharma57153 жыл бұрын
On the y-axis, we are tracking the "fraction-of-mislabeled-examples". So, E_in is the fraction of training-set examples that we got wrong. Similarly E_out is the fraction of examples(not from training-set) that we got wrong.
@shahardagan15846 жыл бұрын
What I should do if I didn't understand all the math in this lecture? do you have some resource that explains it quickly?
@majorqueros68125 жыл бұрын
Some videos on statistics/probability and linear algebra would already help a lot. Khan academy has many great videos. www.khanacademy.org/math/statistics-probability www.khanacademy.org/math/linear-algebra
@MrCmon1135 жыл бұрын
Depends on what you don't understand. There should be introductory courses to linear algebra, analysis and stochastics at your university.
@fndTenorio10 жыл бұрын
Fantastic!
@AndyLee-xq8wq Жыл бұрын
Great!!
@movax20h8 жыл бұрын
How was E_out computed in each iteration? Was it using subsample of given sample and estimated E_out on full sample?
@fuadassayadi18 жыл бұрын
+movax20h You can not calculate E_out because you do not know the whole population of samples. But E_in can tell you something about E_out. This relation is explained in Lecture 02-Is learning feasible
@movax20h8 жыл бұрын
+fouad Mohammed That is exactly why I am asking. The graph clearly shows the E_out being calculated somehow. I guess, this is done using validation techniques from one of the last lectures probably. Anyway, this is a synthetic example, so it is not hard to generate known unknown target function, and generate as many training and test examples as you want, just for examples sake. I do not believe it was claculated by Hoefding , because it is a probabilistic inequality, and would actually to circular reasoning logic here: lets use it to predict E_out, and use this prediction to claim that E_in tracks well E_out. That might be correct in probabilistic sense, but is not good way of demonstrating it at all.
@abaskm8 жыл бұрын
Assume that data was already labeled and used to generate E_in (test set). Take a portion of this data (training sample) generate your hypothesis. You can use that hypothesis to measure E_in (training data) and E_out (on the test set). He's making the assumption that the whole set is labeled. Which doesn't usually apply to the real world.
@raulbeienheimer3 ай бұрын
This is not the Squared Error but the Mean Squared Error.
@wayenwan11 жыл бұрын
is there subtitle available?
@ajkdrag5 жыл бұрын
I didn't understand how he obtained X- transpose after the differentation.
Why learning only occurs in a probabilistic sense? What other way could it be?
@roelofvuurboom54313 жыл бұрын
Literally, learning in a non-probabilistic (absolutistic or certainty) sense. However this runs up against the so-called induction problem first described by the philosopher Hume (you can google it). In our context here the Hume induction problem can be translated to "If I pick a number of balls and they always turn out to be green can I conclude (with certainty) that all balls in the bin are green?". In the lecture the statement is made that you can't. The philosophical discussion is a bit more nuanced. In any case, machine learning avoids this discussion by side stepping trying to make statements with certainty and moving to (weaker) probabilistic statements.
@JoaoVitorBRgomes3 жыл бұрын
@@roelofvuurboom5431 great reply, thanks! I loved how you tied to Hume. I didn't think of this connection. Thanks for linking things.
@jonsnow92467 жыл бұрын
What is the graph shown at 15:42 ,why are there ups and downs in Ein?
@solsticetwo34766 жыл бұрын
Jon Snow The error in the sample set could increase in a next iteration if the algorithm change the hypothesis (weights) in a way that hurt the classification. The PLA is a random walk on the weights sub space.
@Omar-kw5ui4 жыл бұрын
@@solsticetwo3476 PLA is not really a random walk in the weights sub space... The algorithm optimises the weights for a given (randomly chosen) miss-classified point. Fixing the weights as to not missclassify this point may lead to other points that were previously correctly classified to be misclassified. Hence, the rise in error, followed by drop etc. The algorithm works for non separable datasets, so you can't really call it a random walk, it clearly has a set of rules its following.
@chujgowi11 жыл бұрын
check the itunes page there are homeworks an solutions available for free for this course
@rippa91112 жыл бұрын
Thank you.
@SphereofTime7 ай бұрын
6:14
@auggiewilliams35657 жыл бұрын
2.8 doesnt happen in Caltech..... 3.8 doesnt happen in Tribhuvan University
@sarthakmishra69156 жыл бұрын
LOL
@abubakarali62794 жыл бұрын
How we got X^T at 38:25.?
@roelofvuurboom54313 жыл бұрын
In linear algebra the definition of ||X||^2 is (X^T)(X). Apply this formula to ||Xw-Y||^2
@karlmadl2 жыл бұрын
This comment is old and the op probably figured it out by now. However for anyone else who wonders this: It has to do with notation, the derivative of Xw WRT to w is either X or X^T depending on your notation. Here, we're using denominator layout notation, so we use X^T. (en.wikipedia.org/wiki/Matrix_calculus#:~:text=displaystyle%20%5Cmathbf%20%7BI%7D%20%7D-,A%20is%20not%20a%20function%20of%20x,%7B%5Cdisplaystyle%20%5Cmathbf%20%7BA%7D%20%7D,-%7B%5Cdisplaystyle%20%5Cmathbf%20%7BA) The natural follow up question here is why do we use one notation over the other? When did we choose our notation? Notation choices are often just the choice of the author, and can often make formulae more succinct or more clear. I think this answers the question without going off tangentially too far, as any further questions are probably best answered at your own inquiry.
@TomerBenDavid8 жыл бұрын
Perfect
@ZeeshanAliSayyed9 жыл бұрын
"+1" and "-1" among other things happen to be real numbers! LOL
@vyassathya37728 жыл бұрын
+Zeeshan Ali Sayyed There is something genius about the simplicity though lol
@ZeeshanAliSayyed8 жыл бұрын
Vyas Sathya Indeed. :P
@bryanchambers19645 жыл бұрын
I wish he showed how to write the algorithms in Python because he teaches very well.
@FsimulatorX2 жыл бұрын
There are plenty of other resources for that. Once you understand the theoretical component, implementation becomes easy
@Dwright33168 жыл бұрын
42 minutes !! BANG !!
@SJohnTrombley7 жыл бұрын
People don't get 2.8's at caltech? I smell grade inflation.
@MrCmon1135 жыл бұрын
I understood that as people with 2.8 or better not going there. Who takes courses in all of those fields at university?
@pyro2265 жыл бұрын
32:22 MSE, mean squared error, for those with statistics background.
@HendSelim8711 жыл бұрын
what's wrong? the video is not opening.
@fahimhossain1654 жыл бұрын
Never imagined that I'd learn ML from Emperor Palpatine himself!
@akankshachawla22805 жыл бұрын
44:50
@tobalaba4 жыл бұрын
best sound ever.
@millerfour20715 жыл бұрын
22:48, haha!
@niko972198 жыл бұрын
The linear regression is terribly explained.
@AlexEx708 жыл бұрын
Он же сказал что это просто на закуску. Далее будет более подробное объяснение.
@abhijeetsharma57153 жыл бұрын
As the professor mentions multiple times, this lecture is a bit out-of-place as it is placed before covering the theory. Other lectures are more theoretically dense and explained in reasonable depth.
@brainstormingsharing13093 жыл бұрын
Absolutely well done and definitely keep it up!!! 👍👍👍👍👍