1 an overview of the course in this introductory meeting. 2 linear regression, gradient descent, and normal equations and discusses how they relate to machine learning. 3 locally weighted regression, probabilistic interpretation and logistic regression and how it relates to machine learning. 4 Newton's method, exponential families, and generalized linear models and how they relate to machine learning. 5 generative learning algorithms and Gaussian discriminative analysis and their applications in machine learning. 6 naive Bayes, neural networks, and support vector machine. 7 optimal margin classifiers, KKT conditions, and SUM duals. 8 support vector machines, including soft margin optimization and kernels. 9 learning theory, covering bias, variance, empirical risk minimization, union bound and Hoeffding's inequalities. 10 learning theory by discussing VC dimension and model selection. 11 Bayesian statistics, regularization, digression-online learning, and the applications of machine learning algorithms. 12 unsupervised learning in the context of clustering, Jensen's inequality, mixture of Gaussians, and expectation-maximization. 13 expectation-maximization in the context of the mixture of Gaussian and naive Bayes models, as well as factor analysis and digression. 14 factor analysis and expectation-maximization steps, and continues on to discuss principal component analysis (PCA). 15 principal component analysis (PCA) and independent component analysis (ICA) in relation to unsupervised machine learning. 16 reinforcement learning, focusing particularly on MDPs, value functions, and policy and value iteration. 17 reinforcement learning, focusing particularly on continuous state MDPs, discretization, and policy and value iterations. 18 state action rewards, linear dynamical systems in the context of linear quadratic regulation, models, and the Riccati equation, and finite horizon MDPs. 19 debugging process, linear quadratic regulation, Kalmer filters, and linear quadratic Gaussian in the context of reinforcement learning. 20 POMDPs, policy search, and Pegasus in the context of reinforcement learning.
@rahulrathnakumar7855 жыл бұрын
You're a godsend. Thanks
@lionellball82575 жыл бұрын
Thank you
@devivyapari29584 жыл бұрын
Thanks for the comprehensive list
@quantummath11 жыл бұрын
Andrew Ng. Rocks .. he's an amazing teacher and a influential engineer as well as a great scholar. in a rather small but unprecedented step, you've managed to popularize Machine Learning. Nice!
@sharjeeltahir55836 жыл бұрын
I disagree to some extent
@perioguatexgaming13333 жыл бұрын
@@sharjeeltahir5583 why?
@cogent46457 жыл бұрын
The fact that by using simple physical examples (Portland property prices), and you could generalize and abstract into Learning Algorithms is just amazing. What an inspiration as a teacher!! Thank You.
@SagarPokhrel7 жыл бұрын
Best lecture to understand Machine Learning that I've gone through so far. Professor Andrew Ng is all time best teacher for me.
@yardenm156 жыл бұрын
This is pure gold mine for anyone interested in machine learning. He's doing such an amazing job explaining everything in a simple way, especially the parameters in new definitions and equations with plenty of examples and interesting videos.
@ajayram1989 жыл бұрын
Very well explained. I was going through the Coursera Course videolecs, but found this one much better.
@Rjsipad9 жыл бұрын
+ajayram198 same
@nenasegura30699 жыл бұрын
g j b de mi 22223300000ap000aÑañañsñsñ0P00a099ooaaoq99qq9 a0apaa
@evgeniynorin73457 жыл бұрын
which course?
@coolguy-dw5jq7 жыл бұрын
coursera course by andrew ng himself
@davidalexander8297 жыл бұрын
Agreed. Of all the MOOCs, I like Coursera the least but Ng is much better in this lecture format
@teejiahen15 жыл бұрын
it's just like study in Stanford! Although it is not physically , but i really let me gain more knowledge of machine learning that only from my university. And he is really a good lecturer! thank you for you guys that propose it to the Standford University and upload it!
@curcicm8 жыл бұрын
The normal equations fall out immediately from perpendicularity criterion for shortest distance X^t (X * theta - y) = 0 and you don't have to get into trace computations.
@cmares58589 жыл бұрын
Well that escalated quickly... time to brush up on some of this math before continuing.
@TTGxCROTTY9 жыл бұрын
+cmares5858 Lol, Yup
@sahawndada8 жыл бұрын
+cmares5858 Yup!! I was like ahh illl be fine ...NOPE. what subjects do you think you need to brush up on before you can understand this ?
@tj88708 жыл бұрын
+cmares5858 Well we barely learned anything from lecture 1..
@elborrador3338 жыл бұрын
+Pat Bradley You should know introductory probability, linear algebra and maybe some multivariate calculus. If you're determined, mit has lecture series on all of those on youtube. You might also want to think about applying some of these algorithms yourself so the theory sticks.
@danny-bw8tu7 жыл бұрын
I hope it Is not too late, I feel exactly what u felt about this first time I encounter this lecture, the math it involves is multivariate calculus and some elementary statistics. Moreover, there are good books about machine learning, plus tons of materials on the internet about gradient descent which are very helpful.
@Jabrils7 жыл бұрын
im raising my hand, why isn't professor Ng calling on me?
@UbuntuTricks6 жыл бұрын
Jabrils i'm big fan of you !
@leonhardeuler98396 жыл бұрын
He doesn’t like you Jabrils
@kongki75635 жыл бұрын
lol !
@JohannSuarez4 жыл бұрын
Dude, you inspired me to start taking Computer Science two years ago. Thanks, Jabrils!
@bennasserchafi3047 жыл бұрын
do you your really understand how lucky we are to find someone like this legend explain to us this material.
@dongiea11 жыл бұрын
Andrew Ng (the lecturer in these videos) teaches a course on Coursera that is based on this class. It covers the same fundamental ideas but might not be as in depth as these Stanford lectures.
@TheDestint4 жыл бұрын
That coursera course is bs compared to this series.
@eng.mohammadshericmrp92515 жыл бұрын
If we have a dataset with the number of points =1000=m - Batch Gradient Descent: apply the process on all points in each step of the iteration (i=1......m) - Stochastic Gradient Descent: apply the process, not at all points, 1
@George-lt6jy8 жыл бұрын
first learning algorithm. i am so pumped.
@KCOWMOO7 жыл бұрын
Top KeK 😀
@arran549815 жыл бұрын
Stanford. Thanks for posting these lectures! Big thank you!
@akshatb8 жыл бұрын
NOTE: A^(T) represents transpose of matrix A. At 59:56 it should only be C^(T)AB^(T) and not C^(T)AB^(T)+ CAB as according to one of the above equations, gradient of AB wrt A is equal to B^(T), thus the gradient of ABA^(T)C should be equal to (BA^(T)C)^(T) and that is equal to C^(T)AB^(T). Please help me sort this out.
@harrakaymane8 жыл бұрын
no, because A^T also depends on A, so what you're saying is like : when deriving for respect to x, derivative(x*a)=a, SO derivative (x*a*x) is a*x, that's not the case.
@akshatb8 жыл бұрын
Aimane Harrak thanks I got it now.
@abhishekkumar-os5zk4 жыл бұрын
ans 44:00 we do double differentiate the gradient if it is greater than zero then it is going descent else ascent.
@field-yetian60019 жыл бұрын
Question: at the 34:23, for a certain training sample, we have adjustment of the jth of Theta= - alpha * (estimation error )*Xj For example we only have one Theta and one x where Theta = unit price/sqr ft and X= the number of sqr ft I don't understand why a larger Xj should lead to a larger Theta adjustment. For example, if we have 2 cases, in both the estimation error is 10000 dollars. In the first case, the Xj = 500 sqr ft, in the second case Xj=5000 sqr ft. Then the second case feeds back a 10x larger adjustment for unit price. But why? In the first case, you tell the machine, hey you missed by 10,000 dollars, given that the apt have 500 sqr ft, next time, next time reduce 20 dollars per sqr ft. This makes sense. Then in the second case, you tell the machine, hey you missed by 10.000 dollars, given that the house have 5000 ft, next time reduce 200 dollars per sqr ft. That's weird. Thanks folks
@antonylawler34239 жыл бұрын
+田野 I think it is because theta isn't a $ value in sqr ft, but a number by which the sample xi is multiplied.
@field-yetian60019 жыл бұрын
+Antony Lawler Thank you so much for the reply. Technically, as you said, Theta can't be defined as unit price. But at least, I think Theta is an analogue of unit price, and that the product of Theta 1 * X1 (area) roughly represent the part of house price corresponding to area. This feedback design seems to be counter-intuitive.
@antonylawler34239 жыл бұрын
+田野 No problem. How are you getting on with Lecture 3 ?
@gt7318d9 жыл бұрын
+田野 The adjustment formula is oversimplified. I believe that alpha in the formula should vary with xj. Basically the adjustment formula tries to arrive at a solution for which dJ_over_dtheta = 0, which is the first partial derivative of J with respect to theta. If you use Newton-Raphson's formula for 0-finding, you end up with theta := theta - beta * dJ_over_dtheta/d2J_over_dtheta2, where d2Jdtheta2 is the second partial derivative. If you carry out the math, you will find that the second derivative is proportional to xj^2. With the first derivative proportional to xj, you end up with the adjustment term as a constant beta multiplied by 1/xj, so a smaller adjustment is make when xj becomes larger. Hope this helps. Very interesting observation though!
@field-yetian60019 жыл бұрын
+Antony Lawler I got quite frustrated with math.. I got stuck at video 3 and have not revisited for a few weeks.
@mayaahmed15 жыл бұрын
Really nice. Well taught. I am really enjoying listening to these lectures. A true service to public.
@CosminVarlan7 жыл бұрын
I think one alternate answer for the question @41:40 might also be that we found the minimal point or the convergence point when the derivative goes to 0 or nearby: the derivative of a function measure the slope and when it goes to 0 it means that we found a local maximum or minimum; because we are hunting the minimum it means that we found it. Am I right ?
@OrakzaiSays5 жыл бұрын
1:00:10 further on : The training example is a row matrix and we take transpose so that makes it column matrix?
@rakeshprab114 жыл бұрын
learning a whole new concept easily in one hour is fantabulous.......thanx...
@joshuaburkholder16 жыл бұрын
Around time = 28:00, Dr. Ng noted that to go in the direction of steepest descent from a point, ( theta1, theta2, J(theta1, theta2) ), we should go in the direction of the gradient of J at that point; however, this is incorrect. The gradient always points in the direction of steepest ascent, not descent; therefore, the direction of steepest descent from ( theta1, theta2, J(theta1, theta2) ) is opposite of the gradient: -Del( J( theta1, theta2 ) ).
@PMetheney849 жыл бұрын
At 1:10:20, I think there is a trace missing before the Nabla_Theta(y^TX Theta) Term (the very last term). All the other terms have traces, why doesn't this one? Without it, one cannot apply the rules he introduced before (Nabla tr(AB) = B^T)
@bidhovbizar6 жыл бұрын
you are right.It should have the trace notation too.Otherwise he cannot use the 2nd fact out of the 5 facts he mentioned during the matrix algebra revision.He might have accidentally missed it.
@eng.mohammadshericmrp92515 жыл бұрын
Two ways to find the theta that minimized the cost function: 1- Normal equation: (No Iteration) By taking its derivative and setting it to equal zero. 2- Gradient Descent: (With iteration) By taking its derivative and applying GD algorithm. ***************************************** For example: To find the minimum, if y=X^2 : 1- Normal Equation: 2X=0 X=0. This is the solution. 2- Gradient Descent: 2X X1 = X0 - step_size *2X0 After # iteration, X will reach to be zero. X=0. This is the solution.
@samferrer9 жыл бұрын
If the rest of the lectures is based on these operators ... then I will hang out till the very end ... elegant!!
@chriswalsh59258 жыл бұрын
just wondering if you could encode the landscape using fourier transforms and then use that multi-level representation with a slightly modified algorithm to get a faster / more accurate result?
@signemadara245910 жыл бұрын
Can someone clarify please. On 50:00 when he answers the question about stochastic gradient descent, surely he does not mean that each iteration we use the SAME training example, right? I am sure he means that each iteration we take a different training example, but the way he talks about it is slightly confusing.
@ericakim458710 жыл бұрын
i think for the first step, you use the first training example and update all of the thetas. then for the second step, you use the second training example and update all of the thetas. and so on... so yeah, you use a different training example for each step/iteration
@elliottbajema309210 жыл бұрын
Yeah, the confusion is because he says "for each step, you're only using one training example". Worth emphasising that it's the jth example, which changes each step, and not the SAME training example. In batch, you use the entire training set of all (potentially millions) of examples, so each equivalent step for stochastic is potentially millions of times faster. It's just a compromise for the sake of speed. More generally, presumably you would actually take 'a random sample' of training examples rather than the jth, for greater accuracy.
@signemadara245910 жыл бұрын
Thanks!
@pitr25968 жыл бұрын
Am I wrong or right if I assume that the gradient is actually oriented in the direction of biggest ASCENT? wikipedia says so too.. so I assume we should use the gradients orientation multiplicated with -1 for the stated example contrary to what is mentioned in the video
@erichoft71548 жыл бұрын
+King Schultz Maybe it depends on what exactly you are trying to optimize. If you are looking for a minimum cost you would go in the direction of greatest descent and if you are looking for a maximum profit you would go in the direction of greatest ascent?
@erichoft71548 жыл бұрын
+Eric Hoft I could be talking out of my ass though.
@pitr25968 жыл бұрын
That makes total sense of course. I just mean that the gradient is mathematically definded as the greatest ascent so it actually points to the greatest ascent and its length is the magnitude of the ascent. Thats why it irritates me that we use the gradient here as if it was pointed to the biggest descent.
@DavidVaughan008 жыл бұрын
+King Schultz You're right, gradient points in the direction of greatest ascent, so he is slightly off when he talks about it. Not a huge deal though; just gotta keep in mind when he says "gradient" we should be thinking "negative gradient".
@СергейКиян-ш6у8 жыл бұрын
That is why he subtract gradient (which is simply add gradient multiplied by -1)
@TheReaMrBurntSausage8 жыл бұрын
I'm a highschool junior and I didn't know what a partial derivative was so I walked into my AP Calc class today asked the teacher and was told to never speak of it again. Apparantly my teacher has repressed nightmares of it in college haha. I looked it up. seems pretty straight forward I think i get it now.
@danny-bw8tu7 жыл бұрын
I don't think u should study machine learning now, and I don't think u got 'it', it involves way more than just partial derivative , kid.
@elzilcho2227 жыл бұрын
that was a year ago, he's probably graduated college by now
@jazzpote43166 жыл бұрын
@da ny You deserve to be kept far away of every learner ! Give this 'kid' the hope and belief he can do it and he will, instead of trying to fix your ego.
@superwiseman4526 жыл бұрын
yup, I know PDEs well enough. Shame on your teacher for turning you away!
@florocasta5 жыл бұрын
Thank you Professor Ng and Stanford University.
@jameskhan93838 жыл бұрын
At 1:01:52 the design matrix X is m by n. Then he multiplies by theta and it looks like we're just left with a mx1 vector. Is each x in the resulting vector assumed to be an n dimensional or am I missing something?
@jameskhan93838 жыл бұрын
Actually I think I'm being stupid. It's because we're multiplying by theta which is n x 1 right?
@flaviopibetagama4 жыл бұрын
Hi. Great video. I have a question: At time 1:08:40 Why the first element of the product (XO -y)^t(XO-y) is equal to O^t X^t X O. Why is not X^t O^t X O?
@jcbmack12 жыл бұрын
Denzel it is all about the changing of the thetas which are parameters (weights) which take on new values with each update. We desire to choose a theta that will minimize J(theta). Gradient descent takes the form: thetaj: = thetaj - alpha p.d./p.d.thetaj jtheta. The actual update is performed upon all j values at the same time using theta. Thus we begin with some value theta and then we repeatedly change the value of theta to make Jtheta smaller.Alpha is just the learning rate determinant.
@DrDizzyMorris12 жыл бұрын
Firstly, I'm loving this, great class! I have a question about the derivation of Gradient Descent. How is the partial derivative of J(theta) taken in the iterative algorithm if it's simply a constant? We already have x, y, and the initial theta (zero vector), so how can we take the partial derivative AND THEN plug in what we know...could the mathematical notation possibly be improved a bit? As it stands now, it's not making sense to me and I've been through an entire calculus sequence.
@SiddharthGupta2348 жыл бұрын
why there is no m involved in the denominator? @1:04:25
@davidalexander8297 жыл бұрын
I wondered the same thing. Instead he arbitrarily assign 1/2 versus the normal sum sq diff over n.
@newbielives8 жыл бұрын
Am I the only one impressed by the chalk board that wipes itself clean when he lifts it up and pulls it back down
@UtkarshRuhela8 жыл бұрын
It's different board, you dumbass.
@СергейКиян-ш6у8 жыл бұрын
Board doesn't get cleaned, its illusion. Lector just lifts one board up and pulls new one down. Look 48:00
@mksv76638 жыл бұрын
He obviously applied a learning algo to it!
@joshuaadickerson8 жыл бұрын
I am laughing so hard. I would be impressed by that too, but as others said, they are overlapping chalkboards.
@xiangzhang73557 жыл бұрын
hahah~~~
@praneeta13315 жыл бұрын
These videos are brilliant!!Andrew is super cool at teaching, thanks Stanford!!
@Jacob01113 жыл бұрын
In my course of linear systems we used the same normal equation for estimating parameters of a discrete model of continuous system. The thing is, it can be derived in much simpler way than the one shown in the lecture. (without the use of traces, let alone the traces algebra) :) So besides that, great lecture and certainly motivating.
@eng.mohammadshericmrp92515 жыл бұрын
%% Visualizing Gradient Descent on quadratic function using matlab: clear all close all clc %% Defining the Input and the Output : Input=-5:0.1:5; Output=Input.^2; %% Plotting the function: plot(Input,Output,'LineWidth',3) hold on %% Determining the required parameters: step_size=0.01; Iterations = 100; %% Initialize the initializing points: X0(1)=[3.5]; %% Plotting the first step: Ite=1; disp(['Iteration ' num2str(Ite) ': Best Minima = ' num2str(X0(Ite))]); Output=X0(Ite).^2; plot(X0(Ite),Output,'.','MarkerSize',30) %% Starting the iterative gradient descent: Ite=2; while( Ite < Iterations) %% Least Mean Squares (Gradient Descent): X0(Ite,:) = X0(Ite-1,:) - step_size.*2.*(X0(Ite-1,:)); Output=X0(Ite).^2; disp(['Iteration ' num2str(Ite) ': Best Minima = ' num2str(X0(Ite))]); %% Plotting the next step: plot(X0(Ite),Output,'.','MarkerSize',30) Ite=Ite+1; end
@phibouafia10 жыл бұрын
One thing I did not understand is why introduce the batch gradient descent or the stochastic version if the problem can be solved by linear algebra. Is this only a way to get throug those algorithms, which we will use for more complicated minimization problems ? Or do you really use these algorithms for this particular problem ?
@orrymr10 жыл бұрын
I think the case may be that doing it using linear algebra can be quite computationally intensive, whereas using the gradient descent algorithms don't require matrix multiplication (computationally intensive)
@daniellee398710 жыл бұрын
I think its because of the quantity of the data involved. If the training set data is too large, iterative algorithm might not be practical due to hardware limitation. So, yes, I think we pick the most efficient algorithm depending on the situation.
@DavidVaughan008 жыл бұрын
+phibouafia In general, only some problems (ie, minimizing least squares with linear h function) can be solved using linear algebra closed forms. Most can't, unfortunately. I think he shows us the gradient descent methods here even though we don't need them because we WILL need them lots more later in the course.
@dkwroot7 жыл бұрын
At 18:30 he talks about the summation of the 'vectors' as being a transpose of theta * x. How did he determine this? Did he use the dot product rule for transpose where [a • b] = a^T * b ?
Around 55:47, should it be written as the gradient of f wrt A, and not be evaluated at A? i.e. drop the "(A)" before the "="? Otherwise, you'd be taking the gradient of a real #, unless I'm reading something wrong...
@jose-rs8 жыл бұрын
So, a bit late my response, but in this case A is regarded as a variable, so f(A) would be the same as just f. Here f has no specific value, like, A= I or something.
@filipturczynowicz-suszycki77287 жыл бұрын
I can't express how much i loved this video
@punstress10 жыл бұрын
To Maris, since they square the result, it doesn't matter whether you subtract y-h(x) or h(x)-y. (for some reason there was no reply option under your question. maybe it's too old. but someone else might have the same question.)
@KlajdiDervishaj5 жыл бұрын
He is missing the index superscript i (training example) on y at the last line inside summation equation. Min 1:04:23
@KlajdiDervishaj5 жыл бұрын
OK he fixed it later...
@vg93116 жыл бұрын
At 44:05, he says that the derivative of tbe function gives the steepest descent a d said the TAs would probably elaborate on that in another session. Can someone pls explain that.
@DrDizzyMorris12 жыл бұрын
Thanks jcbmack, between your comment and reviewing the lesson again I was able to make heads and tails of the concept I was misunderstanding. I was considering the parameters/thetas to be constant when in fact they are varying; why, I have no idea, haha. Cheers!
@sushantkhanal4807 жыл бұрын
at 19:30... the lecturer writes h(x) = (theta transpose) times (x) but that would give a 3 by 3 matrix shouldn't it be h(x) = (x) times (theta transpose)???
@tessb13 жыл бұрын
@astroboomboy on the course website (google it) it says you need linear algebra and probability theory, but it said you need basic linear algebra and probability and a little programming experience.
@armanrainy13 жыл бұрын
Lecture 2 is done Sir (1:13 am). See u 2morrow on lecture 3. Thank you Professor. Thank you Stanford.
@이인서-h1p4 жыл бұрын
Hi, I have a question about stochastic gradient descent. In 48:42, the inner loop has an iteration of j=1 to m. Does m signify the number of the whole dataset? If it signifies the number of the whole dataset, I think it does not really different from sigma j=1 to m in batch gradient descent. So.... m in stochastic gradient descent is different from m in batch gradient descent right???
@sdenkasp13 жыл бұрын
Thanks to my Linear Algebra course in Peru :), I understood this nice lecture... so I continue with Lesson 3. Thanks Stanford!!!
@Hero764112 жыл бұрын
Stanford has the right idea with spreading all this knowledge for free :D
@bennasserchafi3047 жыл бұрын
this is brilliant. thank you so much professor
@sboparai0912 жыл бұрын
This lecture would be improved by first introducing a simple quadratic equation (i.e. Y=x^2+2x+1), find a minimum by finding the derivative, setting it to zero and solve for the value of X (the input parameter - cause of that minimum). Then, extend this concept to a 3D equation with two inputs X, Y and output Z and find the derivative, setting to zero and determining the values of X, Y in this case Theta1 and Theta2. The point of this lesson was to find a min (or max) given any # of inputs.
@ParthPatel6437 жыл бұрын
The images shown in white background are pretty hard to make out(Like the plot of housing price vs foot squared).
@joshuaburkholder14 жыл бұрын
@matharoofmaths; Yes ... and that's why he makes so many mistakes in this lectures and has a hard time answering his student's questions (and occassionally evades student questions) in later lectures ... but if his research papers are any indication, he will definitely be an outstanding teacher in the future. All criticism aside, this is much better than what we had before - nothing. Thank you Dr. Ng and Stanford for letting us in. This is making Machine Learning that much more accessible.
@jamesmeikle83108 жыл бұрын
most impressive of all is that this lecturer is actually a robot
@datalicious438 жыл бұрын
well that's why he is teaching at Stanford!!! Show some respect. Thanks
@abramswee13 жыл бұрын
agree with caesiume. this type of lecture is great. both free and good.
@syn3rman656 жыл бұрын
43:35 Doesn't the gradient give the direction of steepest ascent?
@xenosicotte6 жыл бұрын
Yes
@syn3rman656 жыл бұрын
Xavier Thanks!
@tculig12 жыл бұрын
theta is some constant. If you had a quadratic equation: y=3+2x+5x^2 theta0 would be 3, theta1 would be 2, theta2 would be 5.
@ajiteshbhan4 жыл бұрын
I have a query guys: In cost function in some examples like the lectures provided in AI series by andrew had 1/m term, my query is what are the points we need to consider when defining a cost function.
@chaityapatel27037 жыл бұрын
Any idea where to get the proofs to the two distinct matrix trace properties used for solving the Normal Equations?
@NetIdentity11 жыл бұрын
Try Learning from Data on EdX - Easier to follow and easier to work through examples. There are solutions homework problems.
@joshuaburkholder16 жыл бұрын
Around time = 43:00, Dr. Ng again gave the wrong description of the gradient. Example: Let f(x,y) = x^2 + y^2. Hence, the gradient is ( 2x, 2y ). At the point (1,1), the gradient is (2,2). Since the only local minimum of f(x,y) is at (0,0) and since (1,1)+(2,2)=(3,3), then the gradient at (1,1) points away from the only local minimum of f(x,y); therefore, the gradient does not point toward the direction of steepest descent. The gradient points in the direction of steepest ASCENT.
@lsun95936 жыл бұрын
Interesting. Usually it will come back to the gradient descent when we solve inverse.
@gal1l1l-f7c8 жыл бұрын
Isn't the cost function 1/(2*m) instead of just 1/2 of the sum of the squared errors?
@adityasoni1218 жыл бұрын
Galina Staneva yes that m is missing...
@JWang-co2vj8 жыл бұрын
I don't think so, actually that 1/2 is just added so as to get a neat expression after taking derivatives.
@venkatagangadharraoy54078 жыл бұрын
If we divide by m, we are substracting theta(i) by alpha times the average of the sum. If we dont divide by m, we are substracting theta(i) by alpha times the sum. Technically it doesnt matter if we divide by m or not. But dividing by m, will make us to converge faster I guess. Would love to hear some mathematical explanation around this.
@venkatagangadharraoy54078 жыл бұрын
I have implemented gradient descent in R with and without using m. In both the cases it is converging. But the catch here is when you don't use m, we have to use small value of alpha like 0.01. If I use 0.1 it is not converging.
@gal1l1l-f7c8 жыл бұрын
Thank you very much for your answer! This clears things up!
@sharkllama11 жыл бұрын
I think the use of the trace operator in the derivation of the Least Squares Estimator obfuscates the derivation. I believe this would be easier to follow if the properties of matrix derivatives were used instead.
@joshuaburkholder16 жыл бұрын
Around time = 28:00, Dr. Ng that if we want to go in the direction of steepest descent from a point J( theta1, theta2 ), then we should go in the direction of the gradient of J( theta1, theta2 ); however, this is incorrect. The gradient always points toward the direction of steepest ascent, not descent; therefore, if we want to go in the direction of steepest descent from a point J( theta1, theta12 ), then we should go in the direction that is opposite of the gradient ... -J( theta1, theta2 ).
@Gaiacarra14 жыл бұрын
@Fusionicon Basic Calculus. Other than the weird stats stuff he brings into play when formulating the error function ("J"), you don't need anything else, so long as you really pay close attention.
@JordanShackelford9 жыл бұрын
I can't keep up. I want to learn this but I have no experience with the math he's using. Calculus, right?
@hamsterpoop9 жыл бұрын
+Jordan Shackelford it's basic calculus and basic linear algebra... you can find free online courses for both online (check out MIT OCW, for example)
@blahdeblah19759 жыл бұрын
linear algebra is Greek for most people. One semester will set you straight IF you do the homework.
@xxanfighter9 жыл бұрын
Why are we trying to minimize (h(x)-y)^2 and not just h(x)-2?
@Sonictll9 жыл бұрын
+Xanfighter : cuz we only need the absolute value of (h(x)-2) be the minimal. but (h(x)-2)^2 is more convinient for math expression.
@WahranRai9 жыл бұрын
+Xanfighter minimizing means that derivative is equal to zero. We dont care about coefficient(constant)
@xxanfighter9 жыл бұрын
Thanks guys, really appreciate the answer :)
@hamsterpoop9 жыл бұрын
+Xanfighter The reason you minimize the square of the difference/error instead of the absolute error is because the linear algebra works out a lot easier this way. The assumption is that if the absolute difference is high, it is the same as if the difference squared is high. But basically, it's simply for mathematical ease. There is a lot of research on L1 norm minimization, check out the wikipedia article: "Least absolute deviations"
@chvan233513 жыл бұрын
A couple of lectures in, it's surprisingly easy to get your head around this shit. Guess it all gets very tricky and intricate soon after, though.
@psbbboyz12312 жыл бұрын
What he says I think is right... He says that if XTX is not invertible which is the case when X is not full rank matrix(he says that X is dependent) then in that case you find the pseudo inverse in that particular case.
@subhaprakash541611 жыл бұрын
what is that transpose of theta represents
@iliasasdf12 жыл бұрын
basically the partial derivative gives you the "steeper" way to the local minimum (about 43:00, last question)
@iliachigogidze65505 жыл бұрын
y is missing superscript i at 1:04:10?
@jaanuskiipli46476 жыл бұрын
at 37.43 shouldn't we normalize se sum by dividing with m? otherwise the correction amount will blow up the more training data we input.
@jaanuskiipli46476 жыл бұрын
okay, forget about the question, error function J itself is not normalized, so its okay to blow everithing up.
@sboparai0912 жыл бұрын
In the method described "(Batch) Gradient Descent" is just optimization, by iterating over a training set from selected start-point (initial parameters) to find new minimums with their respective parameters. He is right, it can be slow if you have MANY parameters, since that will increase the number of combinations. The derivative would eliminate useless combinations. The Stochastic version is better, because it tries to "guess" direction and doesn't attempt to iterate every combo available.
@DragonSlave4911 жыл бұрын
He doesn't say how they decide alpha. It is just a "step size" for the gradient descent. It is the "weight" of the change in the parameter theta. Larger alpha means theta will converge faster but less accurately.
@shivananda305 жыл бұрын
what does convergence mean here?? Is it the actual value converging to the predicted value?
@sanjhECE5 жыл бұрын
moving towards the local minima or global minima where J(theta) will be minimum.
@shivananda305 жыл бұрын
@@sanjhECE Thank you so much
@fupopanda5 жыл бұрын
His notations are listed here: 13:41
@CSEfreak11 жыл бұрын
Once you reach the min(acc to GD alg) it stops moving, tetha doesnt change anymore.Thats when you know you've reached the local minimum.
@djremixmusic65983 жыл бұрын
I don't know about the math formulas in the lecture so what is the solution for me?
@hnomier15 жыл бұрын
Thank you stanford ...really great work ...The lectures are great
@OrakzaiSays5 жыл бұрын
Can we somehow get those TA Classes ? on Friday?
@iliachigogidze65505 жыл бұрын
I think Batch Gradient Descent formula is missing 1/m at 44:39 Am I correct?
@khurshedfitter56955 жыл бұрын
I think the same
@GraceTao12 жыл бұрын
I can not understand most of the equations on lecture 2. What kind of background knowledge should I look for?
@haoc56985 жыл бұрын
you can check the least square solution.
@taketaxisky14 жыл бұрын
For batch and stochastic gradient descent, is alpha (learning rate) usually the same size?
@sandysandeep72277 жыл бұрын
I haven't learnt Math. So, can someone please explain what exactly is θ? What is θ0 + θ1X? I understood Hypothesis, but I don't know what does θ0 + θ1X actually mean.
@lahirusomaratne75687 жыл бұрын
Sandy Sandeep This means that the algorithm is going to come up with a simple linear regression model where theta zero denotes the price of a very small house (theoretically zero square feet but as you know there is no such house) and theta one denotes the price increase per increase in each square footage.
@manisharma30687 жыл бұрын
hey sandy, theta0 is the base price. think of it as the minimum price for all houses. like they have to have this theta0 price as minimum. X is some feature of that house(size, number of bedrooms etc.) which we multiply by a coefficient theta1. This is our hypothesis that each house has to have a base price and that the feature x of the house affects the house of the price by a factor of theta1. So each unit increase in X increases the price of the house by theta1. Only thing we have to do now is compute the value of theta1 which professor does in the end of the video.
@FelixCrazzolara6 жыл бұрын
I just started to watch this lecture too, and I'm only in my second year of EE, but if you don't understand this stuff I guess you'd better off thoroughly read a book about linear algebra first. And probably some theory about Signals and Systems. He models the target as a linear function of the input, plus a constant term. I guess this how you should think about this stuff in general. But as I said, only 2. year Bachelor student^^
@kiriappeee13 жыл бұрын
hmm.. so alvin looks at the road ahead and records the steering direction. So what if the road ahead is a curve but since I'm on a straight patch for the moment my steering direction is still straight? Seeing where the cam was placed and that there was no bonnet in the pictures it must have been calculated for a few metres ahead. Does that affect anything? In the video it seems like Alvins response is about 0.5 seconds behind a typical human response. Specially in the live tests
@eachonly11 жыл бұрын
I just wonder if the stochastic gradient algorithm is more efficient than batch gradient algorithm give than the number of data n is large. the number of iteration for batch gradient algorithm should far less than n.
@canadianrepublican11858 жыл бұрын
Are the discussion sessions posted online?
@linhelen82225 жыл бұрын
overview: batch gradient descent, stochastic gradient descent, normal equation batch ~: update Theta after scanning all samples stochastic~: update Theta after scanning one sample (useful when number of samples is large) normal equation: the analytical solution of Theta without iteration
@Gauravsaxena251211 жыл бұрын
Just wondering, why did the normal equation is only for OLS case? Wondering what assumption was made in the derivation for the equation to restrict to this specific case?
@rayptucha351510 жыл бұрын
At 1:11:35, we have C=X'X and C' = X'X Can someone please explain??
@Korzakapitany10 жыл бұрын
(A * B)' = B' * A', this means if we apply this to (X' * X) we will get: (X' * X)' = X' * (X')' = X' * X, thus here it is the same thing.
@ritagatspy47509 жыл бұрын
c=x'x so c it's scalrai val ===>c=c' because x'x=num val but xx' its matrix for ex 1=1' also (ab)'=b'a'
@biswajeettripathy7737 жыл бұрын
Why 1/2 is multiplied with the squred value of difference between predicted and actual value..Why not any other constants or keep it as it is?
@fakal0077 жыл бұрын
It's just because when you do the derivative of the squared term, you get 1/2 * 2 which is 1 and so it's nicely legible again :)
@juludd15 жыл бұрын
Could someone explain how to get Gradient tr ABAC^T=CAB+C^TAB^T? I can't see how you can get an addition on the right hand side. At least not from within the rules he described in the lecture. Could one use the chain rule for derivation?
@salmonito211 жыл бұрын
Lecture notes differ. The batch grad. descent in notes calculates residual (if I understand correctly Data minus Fit) y-h(x), the square of which we try to minimize, but prof has h(x)-y. Which one is correct?