Lecture 2 | Machine Learning (Stanford)

  Рет қаралды 841,652

Stanford

Stanford

Күн бұрын

Пікірлер: 382
@sienna367
@sienna367 6 жыл бұрын
1 an overview of the course in this introductory meeting. 2 linear regression, gradient descent, and normal equations and discusses how they relate to machine learning. 3 locally weighted regression, probabilistic interpretation and logistic regression and how it relates to machine learning. 4 Newton's method, exponential families, and generalized linear models and how they relate to machine learning. 5 generative learning algorithms and Gaussian discriminative analysis and their applications in machine learning. 6 naive Bayes, neural networks, and support vector machine. 7 optimal margin classifiers, KKT conditions, and SUM duals. 8 support vector machines, including soft margin optimization and kernels. 9 learning theory, covering bias, variance, empirical risk minimization, union bound and Hoeffding's inequalities. 10 learning theory by discussing VC dimension and model selection. 11 Bayesian statistics, regularization, digression-online learning, and the applications of machine learning algorithms. 12 unsupervised learning in the context of clustering, Jensen's inequality, mixture of Gaussians, and expectation-maximization. 13 expectation-maximization in the context of the mixture of Gaussian and naive Bayes models, as well as factor analysis and digression. 14 factor analysis and expectation-maximization steps, and continues on to discuss principal component analysis (PCA). 15 principal component analysis (PCA) and independent component analysis (ICA) in relation to unsupervised machine learning. 16 reinforcement learning, focusing particularly on MDPs, value functions, and policy and value iteration. 17 reinforcement learning, focusing particularly on continuous state MDPs, discretization, and policy and value iterations. 18 state action rewards, linear dynamical systems in the context of linear quadratic regulation, models, and the Riccati equation, and finite horizon MDPs. 19 debugging process, linear quadratic regulation, Kalmer filters, and linear quadratic Gaussian in the context of reinforcement learning. 20 POMDPs, policy search, and Pegasus in the context of reinforcement learning.
@rahulrathnakumar785
@rahulrathnakumar785 5 жыл бұрын
You're a godsend. Thanks
@lionellball8257
@lionellball8257 5 жыл бұрын
Thank you
@devivyapari2958
@devivyapari2958 4 жыл бұрын
Thanks for the comprehensive list
@quantummath
@quantummath 11 жыл бұрын
Andrew Ng. Rocks .. he's an amazing teacher and a influential engineer as well as a great scholar. in a rather small but unprecedented step, you've managed to popularize Machine Learning. Nice!
@sharjeeltahir5583
@sharjeeltahir5583 6 жыл бұрын
I disagree to some extent
@perioguatexgaming1333
@perioguatexgaming1333 3 жыл бұрын
@@sharjeeltahir5583 why?
@cogent4645
@cogent4645 7 жыл бұрын
The fact that by using simple physical examples (Portland property prices), and you could generalize and abstract into Learning Algorithms is just amazing. What an inspiration as a teacher!! Thank You.
@SagarPokhrel
@SagarPokhrel 7 жыл бұрын
Best lecture to understand Machine Learning that I've gone through so far. Professor Andrew Ng is all time best teacher for me.
@yardenm15
@yardenm15 6 жыл бұрын
This is pure gold mine for anyone interested in machine learning. He's doing such an amazing job explaining everything in a simple way, especially the parameters in new definitions and equations with plenty of examples and interesting videos.
@ajayram198
@ajayram198 9 жыл бұрын
Very well explained. I was going through the Coursera Course videolecs, but found this one much better.
@Rjsipad
@Rjsipad 9 жыл бұрын
+ajayram198 same
@nenasegura3069
@nenasegura3069 9 жыл бұрын
g j b de mi 22223300000ap000aÑañañsñsñ0P00a099ooaaoq99qq9 a0apaa
@evgeniynorin7345
@evgeniynorin7345 7 жыл бұрын
which course?
@coolguy-dw5jq
@coolguy-dw5jq 7 жыл бұрын
coursera course by andrew ng himself
@davidalexander829
@davidalexander829 7 жыл бұрын
Agreed. Of all the MOOCs, I like Coursera the least but Ng is much better in this lecture format
@teejiahen
@teejiahen 15 жыл бұрын
it's just like study in Stanford! Although it is not physically , but i really let me gain more knowledge of machine learning that only from my university. And he is really a good lecturer! thank you for you guys that propose it to the Standford University and upload it!
@curcicm
@curcicm 8 жыл бұрын
The normal equations fall out immediately from perpendicularity criterion for shortest distance X^t (X * theta - y) = 0 and you don't have to get into trace computations.
@cmares5858
@cmares5858 9 жыл бұрын
Well that escalated quickly... time to brush up on some of this math before continuing.
@TTGxCROTTY
@TTGxCROTTY 9 жыл бұрын
+cmares5858 Lol, Yup
@sahawndada
@sahawndada 8 жыл бұрын
+cmares5858 Yup!! I was like ahh illl be fine ...NOPE. what subjects do you think you need to brush up on before you can understand this ?
@tj8870
@tj8870 8 жыл бұрын
+cmares5858 Well we barely learned anything from lecture 1..
@elborrador333
@elborrador333 8 жыл бұрын
+Pat Bradley You should know introductory probability, linear algebra and maybe some multivariate calculus. If you're determined, mit has lecture series on all of those on youtube. You might also want to think about applying some of these algorithms yourself so the theory sticks.
@danny-bw8tu
@danny-bw8tu 7 жыл бұрын
I hope it Is not too late, I feel exactly what u felt about this first time I encounter this lecture, the math it involves is multivariate calculus and some elementary statistics. Moreover, there are good books about machine learning, plus tons of materials on the internet about gradient descent which are very helpful.
@Jabrils
@Jabrils 7 жыл бұрын
im raising my hand, why isn't professor Ng calling on me?
@UbuntuTricks
@UbuntuTricks 6 жыл бұрын
Jabrils i'm big fan of you !
@leonhardeuler9839
@leonhardeuler9839 6 жыл бұрын
He doesn’t like you Jabrils
@kongki7563
@kongki7563 5 жыл бұрын
lol !
@JohannSuarez
@JohannSuarez 4 жыл бұрын
Dude, you inspired me to start taking Computer Science two years ago. Thanks, Jabrils!
@bennasserchafi304
@bennasserchafi304 7 жыл бұрын
do you your really understand how lucky we are to find someone like this legend explain to us this material.
@dongiea
@dongiea 11 жыл бұрын
Andrew Ng (the lecturer in these videos) teaches a course on Coursera that is based on this class. It covers the same fundamental ideas but might not be as in depth as these Stanford lectures.
@TheDestint
@TheDestint 4 жыл бұрын
That coursera course is bs compared to this series.
@eng.mohammadshericmrp9251
@eng.mohammadshericmrp9251 5 жыл бұрын
If we have a dataset with the number of points =1000=m - Batch Gradient Descent: apply the process on all points in each step of the iteration (i=1......m) - Stochastic Gradient Descent: apply the process, not at all points, 1
@George-lt6jy
@George-lt6jy 8 жыл бұрын
first learning algorithm. i am so pumped.
@KCOWMOO
@KCOWMOO 7 жыл бұрын
Top KeK 😀
@arran5498
@arran5498 15 жыл бұрын
Stanford. Thanks for posting these lectures! Big thank you!
@akshatb
@akshatb 8 жыл бұрын
NOTE: A^(T) represents transpose of matrix A. At 59:56 it should only be C^(T)AB^(T) and not C^(T)AB^(T)+ CAB as according to one of the above equations, gradient of AB wrt A is equal to B^(T), thus the gradient of ABA^(T)C should be equal to (BA^(T)C)^(T) and that is equal to C^(T)AB^(T). Please help me sort this out.
@harrakaymane
@harrakaymane 8 жыл бұрын
no, because A^T also depends on A, so what you're saying is like : when deriving for respect to x, derivative(x*a)=a, SO derivative (x*a*x) is a*x, that's not the case.
@akshatb
@akshatb 8 жыл бұрын
Aimane Harrak thanks I got it now.
@abhishekkumar-os5zk
@abhishekkumar-os5zk 4 жыл бұрын
ans 44:00 we do double differentiate the gradient if it is greater than zero then it is going descent else ascent.
@field-yetian6001
@field-yetian6001 9 жыл бұрын
Question: at the 34:23, for a certain training sample, we have adjustment of the jth of Theta= - alpha * (estimation error )*Xj For example we only have one Theta and one x where Theta = unit price/sqr ft and X= the number of sqr ft I don't understand why a larger Xj should lead to a larger Theta adjustment. For example, if we have 2 cases, in both the estimation error is 10000 dollars. In the first case, the Xj = 500 sqr ft, in the second case Xj=5000 sqr ft. Then the second case feeds back a 10x larger adjustment for unit price. But why? In the first case, you tell the machine, hey you missed by 10,000 dollars, given that the apt have 500 sqr ft, next time, next time reduce 20 dollars per sqr ft. This makes sense. Then in the second case, you tell the machine, hey you missed by 10.000 dollars, given that the house have 5000 ft, next time reduce 200 dollars per sqr ft. That's weird. Thanks folks
@antonylawler3423
@antonylawler3423 9 жыл бұрын
+田野 I think it is because theta isn't a $ value in sqr ft, but a number by which the sample xi is multiplied.
@field-yetian6001
@field-yetian6001 9 жыл бұрын
+Antony Lawler Thank you so much for the reply. Technically, as you said, Theta can't be defined as unit price. But at least, I think Theta is an analogue of unit price, and that the product of Theta 1 * X1 (area) roughly represent the part of house price corresponding to area. This feedback design seems to be counter-intuitive.
@antonylawler3423
@antonylawler3423 9 жыл бұрын
+田野 No problem. How are you getting on with Lecture 3 ?
@gt7318d
@gt7318d 9 жыл бұрын
+田野 The adjustment formula is oversimplified. I believe that alpha in the formula should vary with xj. Basically the adjustment formula tries to arrive at a solution for which dJ_over_dtheta = 0, which is the first partial derivative of J with respect to theta. If you use Newton-Raphson's formula for 0-finding, you end up with theta := theta - beta * dJ_over_dtheta/d2J_over_dtheta2, where d2Jdtheta2 is the second partial derivative. If you carry out the math, you will find that the second derivative is proportional to xj^2. With the first derivative proportional to xj, you end up with the adjustment term as a constant beta multiplied by 1/xj, so a smaller adjustment is make when xj becomes larger. Hope this helps. Very interesting observation though!
@field-yetian6001
@field-yetian6001 9 жыл бұрын
+Antony Lawler I got quite frustrated with math.. I got stuck at video 3 and have not revisited for a few weeks.
@mayaahmed
@mayaahmed 15 жыл бұрын
Really nice. Well taught. I am really enjoying listening to these lectures. A true service to public.
@CosminVarlan
@CosminVarlan 7 жыл бұрын
I think one alternate answer for the question @41:40 might also be that we found the minimal point or the convergence point when the derivative goes to 0 or nearby: the derivative of a function measure the slope and when it goes to 0 it means that we found a local maximum or minimum; because we are hunting the minimum it means that we found it. Am I right ?
@OrakzaiSays
@OrakzaiSays 5 жыл бұрын
1:00:10 further on : The training example is a row matrix and we take transpose so that makes it column matrix?
@rakeshprab1
@rakeshprab1 14 жыл бұрын
learning a whole new concept easily in one hour is fantabulous.......thanx...
@joshuaburkholder
@joshuaburkholder 16 жыл бұрын
Around time = 28:00, Dr. Ng noted that to go in the direction of steepest descent from a point, ( theta1, theta2, J(theta1, theta2) ), we should go in the direction of the gradient of J at that point; however, this is incorrect. The gradient always points in the direction of steepest ascent, not descent; therefore, the direction of steepest descent from ( theta1, theta2, J(theta1, theta2) ) is opposite of the gradient: -Del( J( theta1, theta2 ) ).
@PMetheney84
@PMetheney84 9 жыл бұрын
At 1:10:20, I think there is a trace missing before the Nabla_Theta(y^TX Theta) Term (the very last term). All the other terms have traces, why doesn't this one? Without it, one cannot apply the rules he introduced before (Nabla tr(AB) = B^T)
@bidhovbizar
@bidhovbizar 6 жыл бұрын
you are right.It should have the trace notation too.Otherwise he cannot use the 2nd fact out of the 5 facts he mentioned during the matrix algebra revision.He might have accidentally missed it.
@eng.mohammadshericmrp9251
@eng.mohammadshericmrp9251 5 жыл бұрын
Two ways to find the theta that minimized the cost function: 1- Normal equation: (No Iteration) By taking its derivative and setting it to equal zero. 2- Gradient Descent: (With iteration) By taking its derivative and applying GD algorithm. ***************************************** For example: To find the minimum, if y=X^2 : 1- Normal Equation: 2X=0 X=0. This is the solution. 2- Gradient Descent: 2X X1 = X0 - step_size *2X0 After # iteration, X will reach to be zero. X=0. This is the solution.
@samferrer
@samferrer 9 жыл бұрын
If the rest of the lectures is based on these operators ... then I will hang out till the very end ... elegant!!
@chriswalsh5925
@chriswalsh5925 8 жыл бұрын
just wondering if you could encode the landscape using fourier transforms and then use that multi-level representation with a slightly modified algorithm to get a faster / more accurate result?
@signemadara2459
@signemadara2459 10 жыл бұрын
Can someone clarify please. On 50:00 when he answers the question about stochastic gradient descent, surely he does not mean that each iteration we use the SAME training example, right? I am sure he means that each iteration we take a different training example, but the way he talks about it is slightly confusing.
@ericakim4587
@ericakim4587 10 жыл бұрын
i think for the first step, you use the first training example and update all of the thetas. then for the second step, you use the second training example and update all of the thetas. and so on... so yeah, you use a different training example for each step/iteration
@elliottbajema3092
@elliottbajema3092 10 жыл бұрын
Yeah, the confusion is because he says "for each step, you're only using one training example". Worth emphasising that it's the jth example, which changes each step, and not the SAME training example. In batch, you use the entire training set of all (potentially millions) of examples, so each equivalent step for stochastic is potentially millions of times faster. It's just a compromise for the sake of speed. More generally, presumably you would actually take 'a random sample' of training examples rather than the jth, for greater accuracy.
@signemadara2459
@signemadara2459 10 жыл бұрын
Thanks!
@pitr2596
@pitr2596 8 жыл бұрын
Am I wrong or right if I assume that the gradient is actually oriented in the direction of biggest ASCENT? wikipedia says so too.. so I assume we should use the gradients orientation multiplicated with -1 for the stated example contrary to what is mentioned in the video
@erichoft7154
@erichoft7154 8 жыл бұрын
+King Schultz Maybe it depends on what exactly you are trying to optimize. If you are looking for a minimum cost you would go in the direction of greatest descent and if you are looking for a maximum profit you would go in the direction of greatest ascent?
@erichoft7154
@erichoft7154 8 жыл бұрын
+Eric Hoft I could be talking out of my ass though.
@pitr2596
@pitr2596 8 жыл бұрын
That makes total sense of course. I just mean that the gradient is mathematically definded as the greatest ascent so it actually points to the greatest ascent and its length is the magnitude of the ascent. Thats why it irritates me that we use the gradient here as if it was pointed to the biggest descent.
@DavidVaughan00
@DavidVaughan00 8 жыл бұрын
+King Schultz You're right, gradient points in the direction of greatest ascent, so he is slightly off when he talks about it. Not a huge deal though; just gotta keep in mind when he says "gradient" we should be thinking "negative gradient".
@СергейКиян-ш6у
@СергейКиян-ш6у 8 жыл бұрын
That is why he subtract gradient (which is simply add gradient multiplied by -1)
@TheReaMrBurntSausage
@TheReaMrBurntSausage 8 жыл бұрын
I'm a highschool junior and I didn't know what a partial derivative was so I walked into my AP Calc class today asked the teacher and was told to never speak of it again. Apparantly my teacher has repressed nightmares of it in college haha. I looked it up. seems pretty straight forward I think i get it now.
@danny-bw8tu
@danny-bw8tu 7 жыл бұрын
I don't think u should study machine learning now, and I don't think u got 'it', it involves way more than just partial derivative , kid.
@elzilcho222
@elzilcho222 7 жыл бұрын
that was a year ago, he's probably graduated college by now
@jazzpote4316
@jazzpote4316 6 жыл бұрын
@da ny You deserve to be kept far away of every learner ! Give this 'kid' the hope and belief he can do it and he will, instead of trying to fix your ego.
@superwiseman452
@superwiseman452 6 жыл бұрын
yup, I know PDEs well enough. Shame on your teacher for turning you away!
@florocasta
@florocasta 5 жыл бұрын
Thank you Professor Ng and Stanford University.
@jameskhan9383
@jameskhan9383 8 жыл бұрын
At 1:01:52 the design matrix X is m by n. Then he multiplies by theta and it looks like we're just left with a mx1 vector. Is each x in the resulting vector assumed to be an n dimensional or am I missing something?
@jameskhan9383
@jameskhan9383 8 жыл бұрын
Actually I think I'm being stupid. It's because we're multiplying by theta which is n x 1 right?
@flaviopibetagama
@flaviopibetagama 4 жыл бұрын
Hi. Great video. I have a question: At time 1:08:40 Why the first element of the product (XO -y)^t(XO-y) is equal to O^t X^t X O. Why is not X^t O^t X O?
@jcbmack
@jcbmack 12 жыл бұрын
Denzel it is all about the changing of the thetas which are parameters (weights) which take on new values with each update. We desire to choose a theta that will minimize J(theta). Gradient descent takes the form: thetaj: = thetaj - alpha p.d./p.d.thetaj jtheta. The actual update is performed upon all j values at the same time using theta. Thus we begin with some value theta and then we repeatedly change the value of theta to make Jtheta smaller.Alpha is just the learning rate determinant.
@DrDizzyMorris
@DrDizzyMorris 12 жыл бұрын
Firstly, I'm loving this, great class! I have a question about the derivation of Gradient Descent. How is the partial derivative of J(theta) taken in the iterative algorithm if it's simply a constant? We already have x, y, and the initial theta (zero vector), so how can we take the partial derivative AND THEN plug in what we know...could the mathematical notation possibly be improved a bit? As it stands now, it's not making sense to me and I've been through an entire calculus sequence.
@SiddharthGupta234
@SiddharthGupta234 8 жыл бұрын
why there is no m involved in the denominator? @1:04:25
@davidalexander829
@davidalexander829 7 жыл бұрын
I wondered the same thing. Instead he arbitrarily assign 1/2 versus the normal sum sq diff over n.
@newbielives
@newbielives 8 жыл бұрын
Am I the only one impressed by the chalk board that wipes itself clean when he lifts it up and pulls it back down
@UtkarshRuhela
@UtkarshRuhela 8 жыл бұрын
It's different board, you dumbass.
@СергейКиян-ш6у
@СергейКиян-ш6у 8 жыл бұрын
Board doesn't get cleaned, its illusion. Lector just lifts one board up and pulls new one down. Look 48:00
@mksv7663
@mksv7663 8 жыл бұрын
He obviously applied a learning algo to it!
@joshuaadickerson
@joshuaadickerson 8 жыл бұрын
I am laughing so hard. I would be impressed by that too, but as others said, they are overlapping chalkboards.
@xiangzhang7355
@xiangzhang7355 7 жыл бұрын
hahah~~~
@praneeta133
@praneeta133 15 жыл бұрын
These videos are brilliant!!Andrew is super cool at teaching, thanks Stanford!!
@Jacob011
@Jacob011 13 жыл бұрын
In my course of linear systems we used the same normal equation for estimating parameters of a discrete model of continuous system. The thing is, it can be derived in much simpler way than the one shown in the lecture. (without the use of traces, let alone the traces algebra) :) So besides that, great lecture and certainly motivating.
@eng.mohammadshericmrp9251
@eng.mohammadshericmrp9251 5 жыл бұрын
%% Visualizing Gradient Descent on quadratic function using matlab: clear all close all clc %% Defining the Input and the Output : Input=-5:0.1:5; Output=Input.^2; %% Plotting the function: plot(Input,Output,'LineWidth',3) hold on %% Determining the required parameters: step_size=0.01; Iterations = 100; %% Initialize the initializing points: X0(1)=[3.5]; %% Plotting the first step: Ite=1; disp(['Iteration ' num2str(Ite) ': Best Minima = ' num2str(X0(Ite))]); Output=X0(Ite).^2; plot(X0(Ite),Output,'.','MarkerSize',30) %% Starting the iterative gradient descent: Ite=2; while( Ite < Iterations) %% Least Mean Squares (Gradient Descent): X0(Ite,:) = X0(Ite-1,:) - step_size.*2.*(X0(Ite-1,:)); Output=X0(Ite).^2; disp(['Iteration ' num2str(Ite) ': Best Minima = ' num2str(X0(Ite))]); %% Plotting the next step: plot(X0(Ite),Output,'.','MarkerSize',30) Ite=Ite+1; end
@phibouafia
@phibouafia 10 жыл бұрын
One thing I did not understand is why introduce the batch gradient descent or the stochastic version if the problem can be solved by linear algebra. Is this only a way to get throug those algorithms, which we will use for more complicated minimization problems ? Or do you really use these algorithms for this particular problem ?
@orrymr
@orrymr 10 жыл бұрын
I think the case may be that doing it using linear algebra can be quite computationally intensive, whereas using the gradient descent algorithms don't require matrix multiplication (computationally intensive)
@daniellee3987
@daniellee3987 10 жыл бұрын
I think its because of the quantity of the data involved. If the training set data is too large, iterative algorithm might not be practical due to hardware limitation. So, yes, I think we pick the most efficient algorithm depending on the situation.
@DavidVaughan00
@DavidVaughan00 8 жыл бұрын
+phibouafia In general, only some problems (ie, minimizing least squares with linear h function) can be solved using linear algebra closed forms. Most can't, unfortunately. I think he shows us the gradient descent methods here even though we don't need them because we WILL need them lots more later in the course.
@dkwroot
@dkwroot 7 жыл бұрын
At 18:30 he talks about the summation of the 'vectors' as being a transpose of theta * x. How did he determine this? Did he use the dot product rule for transpose where [a • b] = a^T * b ?
@kentasuzuki4522
@kentasuzuki4522 7 жыл бұрын
It's the dot (inner) product; [theta0, theta1, theta2]^T * [1, x1, x2] = theta0 + theta1*x1 + theta2*x2
@drhoads9
@drhoads9 8 жыл бұрын
Around 55:47, should it be written as the gradient of f wrt A, and not be evaluated at A? i.e. drop the "(A)" before the "="? Otherwise, you'd be taking the gradient of a real #, unless I'm reading something wrong...
@jose-rs
@jose-rs 8 жыл бұрын
So, a bit late my response, but in this case A is regarded as a variable, so f(A) would be the same as just f. Here f has no specific value, like, A= I or something.
@filipturczynowicz-suszycki7728
@filipturczynowicz-suszycki7728 7 жыл бұрын
I can't express how much i loved this video
@punstress
@punstress 10 жыл бұрын
To Maris, since they square the result, it doesn't matter whether you subtract y-h(x) or h(x)-y. (for some reason there was no reply option under your question. maybe it's too old. but someone else might have the same question.)
@KlajdiDervishaj
@KlajdiDervishaj 5 жыл бұрын
He is missing the index superscript i (training example) on y at the last line inside summation equation. Min 1:04:23
@KlajdiDervishaj
@KlajdiDervishaj 5 жыл бұрын
OK he fixed it later...
@vg9311
@vg9311 6 жыл бұрын
At 44:05, he says that the derivative of tbe function gives the steepest descent a d said the TAs would probably elaborate on that in another session. Can someone pls explain that.
@DrDizzyMorris
@DrDizzyMorris 12 жыл бұрын
Thanks jcbmack, between your comment and reviewing the lesson again I was able to make heads and tails of the concept I was misunderstanding. I was considering the parameters/thetas to be constant when in fact they are varying; why, I have no idea, haha. Cheers!
@sushantkhanal480
@sushantkhanal480 7 жыл бұрын
at 19:30... the lecturer writes h(x) = (theta transpose) times (x) but that would give a 3 by 3 matrix shouldn't it be h(x) = (x) times (theta transpose)???
@tessb
@tessb 13 жыл бұрын
@astroboomboy on the course website (google it) it says you need linear algebra and probability theory, but it said you need basic linear algebra and probability and a little programming experience.
@armanrainy
@armanrainy 13 жыл бұрын
Lecture 2 is done Sir (1:13 am). See u 2morrow on lecture 3. Thank you Professor. Thank you Stanford.
@이인서-h1p
@이인서-h1p 4 жыл бұрын
Hi, I have a question about stochastic gradient descent. In 48:42, the inner loop has an iteration of j=1 to m. Does m signify the number of the whole dataset? If it signifies the number of the whole dataset, I think it does not really different from sigma j=1 to m in batch gradient descent. So.... m in stochastic gradient descent is different from m in batch gradient descent right???
@sdenkasp
@sdenkasp 13 жыл бұрын
Thanks to my Linear Algebra course in Peru :), I understood this nice lecture... so I continue with Lesson 3. Thanks Stanford!!!
@Hero7641
@Hero7641 12 жыл бұрын
Stanford has the right idea with spreading all this knowledge for free :D
@bennasserchafi304
@bennasserchafi304 7 жыл бұрын
this is brilliant. thank you so much professor
@sboparai09
@sboparai09 12 жыл бұрын
This lecture would be improved by first introducing a simple quadratic equation (i.e. Y=x^2+2x+1), find a minimum by finding the derivative, setting it to zero and solve for the value of X (the input parameter - cause of that minimum). Then, extend this concept to a 3D equation with two inputs X, Y and output Z and find the derivative, setting to zero and determining the values of X, Y in this case Theta1 and Theta2. The point of this lesson was to find a min (or max) given any # of inputs.
@ParthPatel643
@ParthPatel643 7 жыл бұрын
The images shown in white background are pretty hard to make out(Like the plot of housing price vs foot squared).
@joshuaburkholder
@joshuaburkholder 14 жыл бұрын
@matharoofmaths; Yes ... and that's why he makes so many mistakes in this lectures and has a hard time answering his student's questions (and occassionally evades student questions) in later lectures ... but if his research papers are any indication, he will definitely be an outstanding teacher in the future. All criticism aside, this is much better than what we had before - nothing. Thank you Dr. Ng and Stanford for letting us in. This is making Machine Learning that much more accessible.
@jamesmeikle8310
@jamesmeikle8310 8 жыл бұрын
most impressive of all is that this lecturer is actually a robot
@datalicious43
@datalicious43 8 жыл бұрын
well that's why he is teaching at Stanford!!! Show some respect. Thanks
@abramswee
@abramswee 13 жыл бұрын
agree with caesiume. this type of lecture is great. both free and good.
@syn3rman65
@syn3rman65 6 жыл бұрын
43:35 Doesn't the gradient give the direction of steepest ascent?
@xenosicotte
@xenosicotte 6 жыл бұрын
Yes
@syn3rman65
@syn3rman65 6 жыл бұрын
Xavier Thanks!
@tculig
@tculig 12 жыл бұрын
theta is some constant. If you had a quadratic equation: y=3+2x+5x^2 theta0 would be 3, theta1 would be 2, theta2 would be 5.
@ajiteshbhan
@ajiteshbhan 4 жыл бұрын
I have a query guys: In cost function in some examples like the lectures provided in AI series by andrew had 1/m term, my query is what are the points we need to consider when defining a cost function.
@chaityapatel2703
@chaityapatel2703 7 жыл бұрын
Any idea where to get the proofs to the two distinct matrix trace properties used for solving the Normal Equations?
@NetIdentity
@NetIdentity 11 жыл бұрын
Try Learning from Data on EdX - Easier to follow and easier to work through examples. There are solutions homework problems.
@joshuaburkholder
@joshuaburkholder 16 жыл бұрын
Around time = 43:00, Dr. Ng again gave the wrong description of the gradient. Example: Let f(x,y) = x^2 + y^2. Hence, the gradient is ( 2x, 2y ). At the point (1,1), the gradient is (2,2). Since the only local minimum of f(x,y) is at (0,0) and since (1,1)+(2,2)=(3,3), then the gradient at (1,1) points away from the only local minimum of f(x,y); therefore, the gradient does not point toward the direction of steepest descent. The gradient points in the direction of steepest ASCENT.
@lsun9593
@lsun9593 6 жыл бұрын
Interesting. Usually it will come back to the gradient descent when we solve inverse.
@gal1l1l-f7c
@gal1l1l-f7c 8 жыл бұрын
Isn't the cost function 1/(2*m) instead of just 1/2 of the sum of the squared errors?
@adityasoni121
@adityasoni121 8 жыл бұрын
Galina Staneva yes that m is missing...
@JWang-co2vj
@JWang-co2vj 8 жыл бұрын
I don't think so, actually that 1/2 is just added so as to get a neat expression after taking derivatives.
@venkatagangadharraoy5407
@venkatagangadharraoy5407 8 жыл бұрын
If we divide by m, we are substracting theta(i) by alpha times the average of the sum. If we dont divide by m, we are substracting theta(i) by alpha times the sum. Technically it doesnt matter if we divide by m or not. But dividing by m, will make us to converge faster I guess. Would love to hear some mathematical explanation around this.
@venkatagangadharraoy5407
@venkatagangadharraoy5407 8 жыл бұрын
I have implemented gradient descent in R with and without using m. In both the cases it is converging. But the catch here is when you don't use m, we have to use small value of alpha like 0.01. If I use 0.1 it is not converging.
@gal1l1l-f7c
@gal1l1l-f7c 8 жыл бұрын
Thank you very much for your answer! This clears things up!
@sharkllama
@sharkllama 11 жыл бұрын
I think the use of the trace operator in the derivation of the Least Squares Estimator obfuscates the derivation. I believe this would be easier to follow if the properties of matrix derivatives were used instead.
@joshuaburkholder
@joshuaburkholder 16 жыл бұрын
Around time = 28:00, Dr. Ng that if we want to go in the direction of steepest descent from a point J( theta1, theta2 ), then we should go in the direction of the gradient of J( theta1, theta2 ); however, this is incorrect. The gradient always points toward the direction of steepest ascent, not descent; therefore, if we want to go in the direction of steepest descent from a point J( theta1, theta12 ), then we should go in the direction that is opposite of the gradient ... -J( theta1, theta2 ).
@Gaiacarra
@Gaiacarra 14 жыл бұрын
@Fusionicon Basic Calculus. Other than the weird stats stuff he brings into play when formulating the error function ("J"), you don't need anything else, so long as you really pay close attention.
@JordanShackelford
@JordanShackelford 9 жыл бұрын
I can't keep up. I want to learn this but I have no experience with the math he's using. Calculus, right?
@hamsterpoop
@hamsterpoop 9 жыл бұрын
+Jordan Shackelford it&#39;s basic calculus and basic linear algebra... you can find free online courses for both online (check out MIT OCW, for example)
@blahdeblah1975
@blahdeblah1975 9 жыл бұрын
linear algebra is Greek for most people. One semester will set you straight IF you do the homework.
@xxanfighter
@xxanfighter 9 жыл бұрын
Why are we trying to minimize (h(x)-y)^2 and not just h(x)-2?
@Sonictll
@Sonictll 9 жыл бұрын
+Xanfighter : cuz we only need the absolute value of (h(x)-2) be the minimal. but (h(x)-2)^2 is more convinient for math expression.
@WahranRai
@WahranRai 9 жыл бұрын
+Xanfighter minimizing means that derivative is equal to zero. We dont care about coefficient(constant)
@xxanfighter
@xxanfighter 9 жыл бұрын
Thanks guys, really appreciate the answer :)
@hamsterpoop
@hamsterpoop 9 жыл бұрын
+Xanfighter The reason you minimize the square of the difference/error instead of the absolute error is because the linear algebra works out a lot easier this way. The assumption is that if the absolute difference is high, it is the same as if the difference squared is high. But basically, it's simply for mathematical ease. There is a lot of research on L1 norm minimization, check out the wikipedia article: "Least absolute deviations"
@chvan2335
@chvan2335 13 жыл бұрын
A couple of lectures in, it's surprisingly easy to get your head around this shit. Guess it all gets very tricky and intricate soon after, though.
@psbbboyz123
@psbbboyz123 12 жыл бұрын
What he says I think is right... He says that if XTX is not invertible which is the case when X is not full rank matrix(he says that X is dependent) then in that case you find the pseudo inverse in that particular case.
@subhaprakash5416
@subhaprakash5416 11 жыл бұрын
what is that transpose of theta represents
@iliasasdf
@iliasasdf 12 жыл бұрын
basically the partial derivative gives you the "steeper" way to the local minimum (about 43:00, last question)
@iliachigogidze6550
@iliachigogidze6550 5 жыл бұрын
y is missing superscript i at 1:04:10?
@jaanuskiipli4647
@jaanuskiipli4647 6 жыл бұрын
at 37.43 shouldn't we normalize se sum by dividing with m? otherwise the correction amount will blow up the more training data we input.
@jaanuskiipli4647
@jaanuskiipli4647 6 жыл бұрын
okay, forget about the question, error function J itself is not normalized, so its okay to blow everithing up.
@sboparai09
@sboparai09 12 жыл бұрын
In the method described "(Batch) Gradient Descent" is just optimization, by iterating over a training set from selected start-point (initial parameters) to find new minimums with their respective parameters. He is right, it can be slow if you have MANY parameters, since that will increase the number of combinations. The derivative would eliminate useless combinations. The Stochastic version is better, because it tries to "guess" direction and doesn't attempt to iterate every combo available.
@DragonSlave49
@DragonSlave49 11 жыл бұрын
He doesn't say how they decide alpha. It is just a "step size" for the gradient descent. It is the "weight" of the change in the parameter theta. Larger alpha means theta will converge faster but less accurately.
@shivananda30
@shivananda30 5 жыл бұрын
what does convergence mean here?? Is it the actual value converging to the predicted value?
@sanjhECE
@sanjhECE 5 жыл бұрын
moving towards the local minima or global minima where J(theta) will be minimum.
@shivananda30
@shivananda30 5 жыл бұрын
@@sanjhECE Thank you so much
@fupopanda
@fupopanda 5 жыл бұрын
His notations are listed here: 13:41
@CSEfreak
@CSEfreak 11 жыл бұрын
Once you reach the min(acc to GD alg) it stops moving, tetha doesnt change anymore.Thats when you know you've reached the local minimum.
@djremixmusic6598
@djremixmusic6598 3 жыл бұрын
I don't know about the math formulas in the lecture so what is the solution for me?
@hnomier
@hnomier 15 жыл бұрын
Thank you stanford ...really great work ...The lectures are great
@OrakzaiSays
@OrakzaiSays 5 жыл бұрын
Can we somehow get those TA Classes ? on Friday?
@iliachigogidze6550
@iliachigogidze6550 5 жыл бұрын
I think Batch Gradient Descent formula is missing 1/m at 44:39 Am I correct?
@khurshedfitter5695
@khurshedfitter5695 5 жыл бұрын
I think the same
@GraceTao
@GraceTao 12 жыл бұрын
I can not understand most of the equations on lecture 2. What kind of background knowledge should I look for?
@haoc5698
@haoc5698 5 жыл бұрын
you can check the least square solution.
@taketaxisky
@taketaxisky 14 жыл бұрын
For batch and stochastic gradient descent, is alpha (learning rate) usually the same size?
@sandysandeep7227
@sandysandeep7227 7 жыл бұрын
I haven't learnt Math. So, can someone please explain what exactly is θ? What is θ0 + θ1X? I understood Hypothesis, but I don't know what does θ0 + θ1X actually mean.
@lahirusomaratne7568
@lahirusomaratne7568 7 жыл бұрын
Sandy Sandeep This means that the algorithm is going to come up with a simple linear regression model where theta zero denotes the price of a very small house (theoretically zero square feet but as you know there is no such house) and theta one denotes the price increase per increase in each square footage.
@manisharma3068
@manisharma3068 7 жыл бұрын
hey sandy, theta0 is the base price. think of it as the minimum price for all houses. like they have to have this theta0 price as minimum. X is some feature of that house(size, number of bedrooms etc.) which we multiply by a coefficient theta1. This is our hypothesis that each house has to have a base price and that the feature x of the house affects the house of the price by a factor of theta1. So each unit increase in X increases the price of the house by theta1. Only thing we have to do now is compute the value of theta1 which professor does in the end of the video.
@FelixCrazzolara
@FelixCrazzolara 6 жыл бұрын
I just started to watch this lecture too, and I'm only in my second year of EE, but if you don't understand this stuff I guess you'd better off thoroughly read a book about linear algebra first. And probably some theory about Signals and Systems. He models the target as a linear function of the input, plus a constant term. I guess this how you should think about this stuff in general. But as I said, only 2. year Bachelor student^^
@kiriappeee
@kiriappeee 13 жыл бұрын
hmm.. so alvin looks at the road ahead and records the steering direction. So what if the road ahead is a curve but since I'm on a straight patch for the moment my steering direction is still straight? Seeing where the cam was placed and that there was no bonnet in the pictures it must have been calculated for a few metres ahead. Does that affect anything? In the video it seems like Alvins response is about 0.5 seconds behind a typical human response. Specially in the live tests
@eachonly
@eachonly 11 жыл бұрын
I just wonder if the stochastic gradient algorithm is more efficient than batch gradient algorithm give than the number of data n is large. the number of iteration for batch gradient algorithm should far less than n.
@canadianrepublican1185
@canadianrepublican1185 8 жыл бұрын
Are the discussion sessions posted online?
@linhelen8222
@linhelen8222 5 жыл бұрын
overview: batch gradient descent, stochastic gradient descent, normal equation batch ~: update Theta after scanning all samples stochastic~: update Theta after scanning one sample (useful when number of samples is large) normal equation: the analytical solution of Theta without iteration
@Gauravsaxena2512
@Gauravsaxena2512 11 жыл бұрын
Just wondering, why did the normal equation is only for OLS case? Wondering what assumption was made in the derivation for the equation to restrict to this specific case?
@rayptucha3515
@rayptucha3515 10 жыл бұрын
At 1:11:35, we have C=X'X and C' = X'X Can someone please explain??
@Korzakapitany
@Korzakapitany 10 жыл бұрын
(A * B)' = B' * A', this means if we apply this to (X' * X) we will get: (X' * X)' = X' * (X')' = X' * X, thus here it is the same thing.
@ritagatspy4750
@ritagatspy4750 9 жыл бұрын
c=x'x so c it's scalrai val ===>c=c' because x'x=num val but xx' its matrix for ex 1=1' also (ab)'=b'a'
@biswajeettripathy773
@biswajeettripathy773 7 жыл бұрын
Why 1/2 is multiplied with the squred value of difference between predicted and actual value..Why not any other constants or keep it as it is?
@fakal007
@fakal007 7 жыл бұрын
It's just because when you do the derivative of the squared term, you get 1/2 * 2 which is 1 and so it's nicely legible again :)
@juludd
@juludd 15 жыл бұрын
Could someone explain how to get Gradient tr ABAC^T=CAB+C^TAB^T? I can't see how you can get an addition on the right hand side. At least not from within the rules he described in the lecture. Could one use the chain rule for derivation?
@salmonito2
@salmonito2 11 жыл бұрын
Lecture notes differ. The batch grad. descent in notes calculates residual (if I understand correctly Data minus Fit) y-h(x), the square of which we try to minimize, but prof has h(x)-y. Which one is correct?
Lecture 3 | Machine Learning (Stanford)
1:13:14
Stanford
Рет қаралды 456 М.
MIT Introduction to Deep Learning | 6.S191
1:09:58
Alexander Amini
Рет қаралды 789 М.
Каха и дочка
00:28
К-Media
Рет қаралды 3,4 МЛН
Une nouvelle voiture pour Noël 🥹
00:28
Nicocapone
Рет қаралды 9 МЛН
Quando eu quero Sushi (sem desperdiçar) 🍣
00:26
Los Wagners
Рет қаралды 15 МЛН
It’s all not real
00:15
V.A. show / Магика
Рет қаралды 20 МЛН
Stanford CS229: Machine Learning Course, Lecture 1 - Andrew Ng (Autumn 2018)
1:15:20
The Most Important Algorithm in Machine Learning
40:08
Artem Kirsanov
Рет қаралды 545 М.
2024's Biggest Breakthroughs in Math
15:13
Quanta Magazine
Рет қаралды 457 М.
All Machine Learning algorithms explained in 17 min
16:30
Infinite Codes
Рет қаралды 439 М.
What's a Tensor?
12:21
Dan Fleisch
Рет қаралды 3,7 МЛН
Stanford CS229 Machine Learning I Introduction I 2022 I Lecture 1
1:18:42
Stanford Online
Рет қаралды 219 М.
Cosmology Lecture 1
1:35:47
Stanford
Рет қаралды 1,1 МЛН
Gradient descent, how neural networks learn | DL2
20:33
3Blue1Brown
Рет қаралды 7 МЛН
Каха и дочка
00:28
К-Media
Рет қаралды 3,4 МЛН