Stanford CS229: Machine Learning - Linear Regression and Gradient Descent

Stanford CS229: Machine Learning - Linear Regression and Gradient Descent | Lecture 2 (Autumn 2018)

Рет қаралды 1,234,948

Күн бұрын

For more information about Stanford’s Artificial Intelligence professional and graduate programs, visit: stanford.io/ai
This lecture covers supervised learning and linear regression.
Andrew Ng
Adjunct Professor of Computer Science
www.andrewng.org/
To follow along with the course schedule and syllabus, visit:
cs229.stanford.edu/syllabus-au...
#andrewng #machinelearning
Chapters:
00:00 Intro
00:45 Motivate Linear Regression
03:01 Supervised Learning
04:44 Designing a Learning Algorithm
08:27 Parameters of the learning algorithm
14:44 Linear Regression Algorithm
18:06 Gradient Descent
33:01 Gradient Descent Algorithm
42:34 Batch Gradient Descent
44:56 Stochastic Gradient Descent

Пікірлер: 262

@krishyket Жыл бұрын

Dude is a multi-millionaire and took valuable time meticulously teaching students and us. Legend.

@The_Quaalude 6 ай бұрын

Bro needs to train his future employees

@vikram-aditya 6 ай бұрын

yes bro. i think the more people with the knowledge, the faster the breakthroughs in the field

@clerpington_the_fifth 6 ай бұрын

...and FOR FREE.

@user-ez7jl6ts8x 19 күн бұрын

To people like him, money is really irrelevent. These people are really top 0.00001 of people of the world, all that matters to them is how they can contribute to their respective field and help make this world a better place, money is just by-product of that passsion.

@Eric-zo8wo 11 ай бұрын

0:41: 📚 This class will cover linear regression, batch and stochastic gradient descent, and the normal equations as algorithms for fitting linear regression models. 5:35: 🏠 The speaker discusses using multiple input features, such as size and number of bedrooms, to estimate the size of a house. 12:03: 📝 The hypothesis is defined as the sum of features multiplied by parameters. 18:40: 📉 Gradient descent is a method to minimize a function J of Theta by iteratively updating the values of Theta. 24:21: 📝 Gradient descent is a method used to update values in each step by calculating the partial derivative of the cost function. 30:13: 📝 The partial derivative of a term with respect to Theta J is equal to XJ, and one step of gradient descent updates Theta J 36:08: 🔑 The choice of learning rate in the algorithm affects its convergence to the global minimum. 41:45: 📊 Batch gradient descent is a method in machine learning where the entire training set is processed as one batch, but it has a disadvantage when dealing with large datasets. 47:13: 📈 Stochastic gradient descent allows for faster progress in large datasets but never fully converges. 52:23: 📝 Gradient descent is an iterative algorithm used to find the global optimum, but for linear regression, the normal equation can be used to directly jump to the global optimum. 58:59: 📝 The derivative of a matrix function with respect to the matrix itself is a matrix with the same dimensions, where each element is the derivative with respect to the corresponding element in the original matrix. 1:05:51: 📝 The speaker discusses properties of matrix traces and their derivatives. 1:13:17: 📝 The derivative of the function is equal to one-half times the derivative of Theta multiplied by the transpose of X minus the transpose of y. Recap by Tammy AI

@Lucky-vm9dv 10 ай бұрын

How much we have to pay for your valuable overview on the entire class? Kudos to your efforts 👍

@MLLearner 3 ай бұрын

Thank you so much 👍🫡

@KylianMbappe07303 Ай бұрын

Legend

@surajr4757 13 күн бұрын

@@Lucky-vm9dv Bro didn't read the last line, Recap by Tammy AI🙂

@manudasmd Жыл бұрын

Feels like sitting in stanford classroom from india ...Thanks stanford. you guys are best

@gurjotsingh3726 9 ай бұрын

for real bro, me sitting in panjab, would have never come across how the top uni profs are, this is surreal.

@hamirmahal 5 ай бұрын

@@gurjotsingh3726 Sat sri akaal, ਖੁਸ਼ਕਿਸਮਤੀ

@calvin_713 10 ай бұрын

This course saves my life! The lecturer of the ML course I'm attending rn is just going thru those crazy math derivations preassuming that all the students have mastered it all before😂

@mahihoque4598 Ай бұрын

My man was treating like these top % brains had forgotten simple partial differentiation and ours just don't even care😢

@k-bobmakabaka4420 Жыл бұрын

when u paying 12k to your own university a year just so you can look up a course from a better school for free

@paulushimawan5196 Жыл бұрын

University cost needs to be as low cost as possible.

@_night_spring_ Жыл бұрын

while youtube have the unlimited free information and courses better than the tech university and colleges 🙂

@Call-me-Avi 10 ай бұрын

Hahahahaahaha fucking hell thats what i am doing right fucking now.

@preyumkumar7404 7 ай бұрын

which uni is that...

@k-bobmakabaka4420 7 ай бұрын

@@preyumkumar7404 University of Toronto

@DagmawiAbate Жыл бұрын

I am not good at math anymore, but I think math is simple if you get the right teachers like you. Tnks.

@imad1996 Жыл бұрын

We learn, and teachers give us the information in a way that can help stimulate our learning abilities. So, we always appreciate our teachers and the facilities contributing to our development. Thank you.

@abhishekagrawal896 2 ай бұрын

🎯 Key points for quick navigation: 00:03 *🏠 Introduction to Linear Regression* - Linear regression is a learning algorithm used to fit linear models. - Motivation for linear regression is explained through a supervised learning problem. - Collecting a dataset, defining notations, and building a regression model are important steps. 04:04 *📊 Designing a Learning Algorithm* - The process of supervised learning involves inputting a training set and outputting a hypothesis. - Key decisions in designing a machine learning algorithm include defining the hypothesis representation. - Understanding the workflow, dataset, and hypothesis structure is crucial in creating a successful learning algorithm. 07:19 *🏡 Multiple Features in Linear Regression* - Introducing multiple input features in linear regression models. - The importance of adding additional features like the number of bedrooms to enhance prediction accuracy. - Notation, such as defining a dummy feature for simplifying hypotheses, is explained. 13:03 *🎯 Cost Function and Parameter Optimization* - Choosing parameter values Theta to minimize the cost function J of Theta. - The squared error is used in linear regression as a measure of prediction accuracy. - Parameters are iteratively adjusted using gradient descent to find the optimal values for the model. 24:18 *🧮 Linear Regression: Gradient Descent Overview* Explanation of gradient descent in each step: - Update Theta values for each feature based on the learning rate and partial derivative of the cost function. - Learning rate determination for practical applications. - Detailed explanation of the derivative calculation for one training example. 27:11 *📈 Gradient Descent Algorithm* Derivation of the partial derivative with respect to Theta. - Calculating the partial derivative for a simple training example. - Update equation for each step of gradient descent using the calculated derivative. 33:11 *📉 Optimization: Convergence and Learning Rate* Concepts of convergence and learning rate optimization in gradient descent: - Explanation of repeat until convergence in gradient descent. - Impact of learning rate on the convergence speed and efficiency. - Practical approach to determining the optimal learning rate during implementation. 41:22 *📊 Batch Gradient Descent vs. Stochastic Gradient Descent* Comparison between batch gradient descent and stochastic gradient descent: - Description of batch gradient descent processing the entire training set in one batch. - Introduction to stochastic gradient descent processing one example at a time for parameter updates. - Illustration of how stochastic gradient descent takes a slightly noisy path towards convergence. 47:22 *🏃 Stochastic Gradient Descent vs. Batch Gradient Descent* - Stochastic gradient descent is used more in practice with very large datasets. - Mini-batch gradient descent is another algorithm that can be used with datasets that are too large for batch gradient descent. - Stochastic gradient descent is often preferred due to its faster progress in large datasets. 53:01 *📉 Derivation of the Normal Equation for Linear Regression* - The normal equation allows for the direct calculation of optimal parameter values in linear regression without an iterative algorithm. - Deriving the normal equation involves taking derivatives, setting them to zero, and solving for the optimal parameters theta. - Matrix derivatives and linear algebra notation play a crucial role in deriving the normal equation. 57:52 *🧮 Matrix Derivatives and Trace Operator* - The trace operator allows for the sum of diagonal entries in a matrix. - Properties of the trace operator include the trace of a matrix being equal to the trace of its transpose. - Derivatives with respect to matrices can be computed using the trace operator for functions mapping to real numbers. 01:12:49 *📈 Linear Regression Derivation Summary* - Deriving the gradient for the cost function J(Theta) involves taking the derivative of a quadratic function. 01:15:19 *🧮 Deriving the Normal Equations* - Setting the derivative of J(Theta) to 0 leads to the normal equations X^T X Theta = X^T y. - Using matrix derivatives helps simplify the final equation for Theta. 01:17:09 *🔍 Dealing with Non-Invertible X Matrix* - When X is non-invertible, it indicates redundant features or linear dependence. - The pseudo inverse can provide a solution in the case of linearly dependent features.

@hmm7780 15 күн бұрын

Thanx Bro for this!!

@dimensionentangled4514 2 жыл бұрын

We define a cost function based on sum of squared errors. The job is minimise this cost function with respect to the parameters. First, we look at (Batch) gradient descent. Second, we look at Stochastic gradient descent, which does not give us the exact value at which the minima is achieved, however, it is much much more effective in dealing with big data. Third, we look at the normal equation. This equation directly gives us the value at which minima is achieved! Linear regression models is one of the few models in which such an equation exist.

@xxdxma6700 2 жыл бұрын

I wish you sat next to me in class 😂

@rajvaghasia9942 2 жыл бұрын

Bro who named that equation as normal equation?

@alessandroderossi8930 2 жыл бұрын

@@rajvaghasia9942 the name "normal equation" is because generalizes the concept of perpendiculum (normal to something means perpendicula to something). In fact "the normal equation" represent the projection between the straight line that i draw as a starting point (in the case of LINEAR regression) and the effective sampling data .This projection has , obviously , information about the distances between the real data (sampling data) and my "starting line"...hence to find the optimal curve that fit my data i 've to find weight a bias (in this video Theta0 , Theta1 and so on) to minimize this distance. you can minimize this distance using gradient descend (too much the cost), stochastic gradient descend (doing a set of partial derivative not computing all the gradient of loss function) or using the "normal equations"...uderstand?... Here an image from wikipedia to understand better (the green line are the famous distances) en.wikipedia.org/wiki/File:Linear_least_squares_example2.svg

@JDMathematicsAndDataScience Жыл бұрын

@@rajvaghasia9942 because we're in the matrix now bro! ha. For real though. It's about the projection matrix and the matrix representation/method of acquiring the beta coefficients.

@JDMathematicsAndDataScience Жыл бұрын

I have been wondering why we need such an algorithm when we could just derive the least squares estimators. Have you seen any research comparing the gradient descent method of selection of parameters with the typical method of deriving the least squares estimators of the coefficient parameters?

@user-hm5qk8ic6j Жыл бұрын

8:50 notations and symbols 13:08 how to choose theta 17:50 Gradient descent

@dens3254 Жыл бұрын

52:50 Normal equations

@LuisFuentes98 Жыл бұрын

Hey can I point out how an amazing teacher professor Andrew is?! Also, I love how he is all excited about the lesson he is giving! It just makes me feel even more interested in the subject. Thanks for this awesome course!

@tanishsharma136 Жыл бұрын

Look at Coursera, he founded that and has many free courses.

@jeroenoomen8145 7 ай бұрын

Thank you to Stanford and Andrew for a wonderful series of lectures!

@deepakbastola6302 3 күн бұрын

Dr. NG is always my best.. keep up motivating with such classes.

@i183x4 11 ай бұрын

8:50 notations and symbols 13:08 how to choose theta 17:50 Gradient descent 8:42 - 14:42 - Terminologies completion 51:00 - batch 55:00 problem 1 set 57:00 for p 0

@AshishRaj04 11 ай бұрын

notes are not available on the website ???

@jaeen7665 5 ай бұрын

One of the greats, a legend in AI & Machine Learning. Up there with Prof. Strang and Prof LeCun.

@ikramadjissa370 2 жыл бұрын

Andrew Ng you are the best

@anushka.narsima Жыл бұрын

Thank you so much Dr. Andrew! It took me some time but your stepwise explanation and notes have given me a proper understanding. I'm learning this to make a presentation for my university club. We all are very grateful!

@Amit_Kumar_Trivedi Жыл бұрын

Hi I was not able to download the notes, 404 error, from the course page in description. Other PDFs are available on the course page. Are you enrolled or where did you download the notes from?

@anushka.narsima Жыл бұрын

@@Amit_Kumar_Trivedi cs229.stanford.edu/lectures-spring2022/main_notes.pdf

@georgenyagura7742 Жыл бұрын

@@anushka.narsima thanks

@claudiosaponaro4565 Жыл бұрын

the best professor in the world.

@olinabin2004 Жыл бұрын

8:42 - 14:42 - Terminologies completion 17:51 -- Checkpoint 57:00 - run1

@mortyrickerson6322 Жыл бұрын

Fantastic. Thank you deeply for sharing

@PhilosophyOfWinners 9 ай бұрын

Loving the lectures!!

@nikhithar3077 5 ай бұрын

39:38 we're subtracting because to minimize the cost function, the two vectors must be at 180⁰. So we get a negative from there.

@jpgunman0708 Жыл бұрын

thanks a lot 吴恩达,i learned a lot

@Honey-sv3ek 2 жыл бұрын

I really don't have a clue about this stuff, but it's interesting and I can concentrate a lot better when I listen to this lecture so I like it

@FA-BCS-MUHAMMADHAMIDSAEEDUnkno 2 жыл бұрын

You can see his lecture on coursera about Machine learning. You will surely get what he is saying in this video.

@paulushimawan5196 Жыл бұрын

@@FA-BCS-MUHAMMADHAMIDSAEEDUnkno yes, that course is beginner-friendly. Everyone with basic high school math can take that course even without knowledge of calculus.

@diegoalias2935 Жыл бұрын

Really easy to understand. Thanks a lot for sharing!

@massimovarano407 Жыл бұрын

sure it is, it is high school topic, at least in Italy

@gustavoramalho9454 Жыл бұрын

@@massimovarano407 I'm pretty sure multivariate calculus is not a high-school topic in Europe

@victor3btn598 Жыл бұрын

Simple and understandable

@zzh315 6 ай бұрын

"Wait, AI is just math?" "Always has been"

@chandarayi5673 11 ай бұрын

I love you Sir Andrew, you inspire me a lot haha

@clinkclink7814 Жыл бұрын

Very clear explanations. Extra points for sounding like Stewie Griffin

@vseelix957 Жыл бұрын

my machine learning lecturer is so dogshit I thought this unit was impossible to understand. Now following these on study break before midsem and this guy is the best. I'd prefer that my uni just refers to these lectures rather than making their own

@tanmayshukla8660 Жыл бұрын

Why do we take the transpose of each row, wouldn't it be stacking columns on top of each other?

@RHCIPHER Жыл бұрын

this men is great teatcher

@ozonewagle Жыл бұрын

Why aren't we using the usual numerical methods(least squares) to fit a straight line to a given set of data points?

@ambushtunes Жыл бұрын

Attending Stanford University from Nairobi, Kenya.

@sivavenkateshr Жыл бұрын

This is really cool. ❤

@polymaththesolver5721 Жыл бұрын

Thank you Stanford for this amazing resource. Pls csn i get a link to the lecture notes. Thanks

@sarayutsawaengsap6799 2 жыл бұрын

สุดจัดปลัดบอก

@clairewang8370 2 жыл бұрын

Andrew讲得太好了

@Suliyaa_Agri 2 ай бұрын

Andrews Voice is Everything and that blue shirt of his

@ZDixon-io5ww 2 жыл бұрын

47:00 51:00 - batch 55:00 problem 1 set 57:00 for p 0

@johndubchak Жыл бұрын

Andrew Ng, FTW!

@26d8 10 ай бұрын

The partial derivative was incomplete to me. we should take the derivative 2/2 thetha as well? is that term a constant? shouldn't we go with the product rule!

@thefourhourtalk Жыл бұрын

I didn't understand the linear regression algorithm is there any way to understand it better ??

@Gatsbi 4 ай бұрын

Had to study basic Calculus and Linear algebra at the same time to understand a bit, but don't get it fully yet,

@sipraneye70 Жыл бұрын

Where do i get the assignments for these lecture series?

@skillato9000 Жыл бұрын

1:01:06 Didn't know Darth Vader attended this lectures

@anonymous-3720 Жыл бұрын

Which book is he using? and where do we find the homework?

@learnfullstack Жыл бұрын

if board is full, slide up the board, if it refuses to go up, pull it back down, erase and continue writing on it.

@labiditasnim623 11 ай бұрын

why in cost function he did 1/2 and not 1/2*m ?

@7takeo 2 жыл бұрын

Podemos dar la clase fuera?

@HeisenbergHK 6 ай бұрын

Where can I find the notes and other videos and any material related to this class!?

@Jewishisgreat 10 ай бұрын

Knowledge is power

@user-up3fn9cw7h 4 ай бұрын

at 40:10, how about if we set the initial value at a point that the gradient is a negative direction, then we should increase theta rather than decrease theta?

@anikdas567 2 ай бұрын

even then we should decrease theta. Why? Reason: see the aim is to find a minima right? So if u start with a negative slope (aka gradient), u need to adjust the values of the parameters (theta) such that the slope approaches zero! (why? since the slope is zero at the minima). and if u see the graph of a quadratic equation, u will immediately understand the logic. it does not matter if u start with a pistive or negative slope. U just need to change theta so that finally ur gradient approaches zero. And for both of these cases we need to decrease the value of theta.

@ahmednesartahsinchoudhury2628 7 ай бұрын

Does anyone know which textbook goes well with these lectures?

@nanunsaram 2 жыл бұрын

Thank you!

@gauravpadole1035 11 ай бұрын

can anyone pls explain what do we mean by "parameters" that is denoted by theta here?

@SteveVon7 8 ай бұрын

Parameters are TRAINABLE numbers in the model such as weights and bias's, since the prediction of the model is based on some combination of weight and bias values. So when 'parameters' of 'theta' are changed or 'trained', it means that the weights and bias's are changed or trained.

@parthjoshi5892 Жыл бұрын

Would anyone please share the lecture notes? On clicking on the link for the pdf notes on the course website, its showing an error that the requested URL was not found on the server. It would really be great if someone could help me with finding the class notes.

@amaia7045 5 ай бұрын

I think i found them here : chrome-extension://efaidnbmnnnibpcajpcglclefindmkaj/cs229.stanford.edu/main_notes.pdf

@shashankshekharjha6913 Ай бұрын

okay so the superscript i, ( 1 to m) represents the number of features, right? Because here m = 2 and I don't understand why m = # training examples

@souravsengupta1311 8 ай бұрын

cant download the course class note pls look onto ot

@ObaroJohnson-q8v 7 күн бұрын

Formula looks like variance formulae , will be interested to know why we have that 1/2 of the variances of the lost of function. Could we just used the variance formula instead or is there a theory behind that. Thanks

@GameFlife 10 ай бұрын

I need that lecture notes ASAP professor

@Goaks8128 2 жыл бұрын

Seems like the lagrangian or path of least action theory in physics can be applied to algorythmic manipulations in machine learning as well as economics where isoquant curves and marginal analysis depend on many variables...not being an expert in any field the topics seem very similar and some corelation may exist...perhaps already being used.

@godson200 Жыл бұрын

Do you speak english?

@veeraboinabhavaniprasad3864 9 ай бұрын

Could you please tell me the actual use of Gradient Descent by minimizing the y(theta)?

@AdeelKhalidSE 8 ай бұрын

Gradient Descent is basically the optimization model that help minimizing the cost of Model. We obtain the cost by calculating the MSE (Mean Squared Error)

@kaipingli-mh3mw 8 ай бұрын

thank you

@chideraagbasiere7868 Жыл бұрын

May I ask, down to 7:50 what does O (teta) represent?

@lyndonyang1269 Ай бұрын

anyone knows where to access the homework assignments as practice?

@bhavyasharma9784 10 ай бұрын

The pdf link to the problem set says Error Not found. Can someone help Please ?

@raymundovazquezmusic216 Жыл бұрын

Can you update the lecture notes and assignments in the website for the course? Most of the links to the documents are broken

@stanfordonline Жыл бұрын

Hi there, thanks for your comment and feedback. The course website may be helpful to you cs229.stanford.edu/ and the notes document docs.google.com/spreadsheets/d/12ua10iRYLtxTWi05jBSAxEMM_104nTr8S4nC2cmN9BQ/edit?usp=sharing

@adi29raj Жыл бұрын

@@stanfordonline Where can I access the problem sets?

@salonisingla1665 Жыл бұрын

@@stanfordonline Please post this in the description to every video. Having this in an obscure reply to a comment will only lead to people missing it while scrolling.

@DrPan88 Жыл бұрын

Wonder: Is m equals to n+1 ？n stands for number of inputs, while the m stands for the number of the rows which includes X0 in addition.

@sandeeproy6564 Жыл бұрын

n actually stands for the number of attributes here, or the number of features (columns)

@MaxTheKing289 Жыл бұрын

No not necessarly m is the number of rows, and n is the number of column or features. In his example n is equal to two (Size, and bedrooms), m can be any number. But i think that in the example m is 50

@DrPan88 Жыл бұрын

@Louis Aballea yeah I got it. Thanks !

@jerzytas 9 ай бұрын

In the very last equatin (Normal equation 1:18:06) Transpose(X) appears on both sides of the equation, can't this be simplified by dropping transpose(T)?

@manasvi-fl6xq 6 ай бұрын

no because , x is neccesarily not a square a matrix

@uekiarawari3054 Жыл бұрын

difficult word : cost function gradient descent convex optimization hypothesis fx target j of theta = cost/loss function partial derivatives chain row global optimum batch gradient descent stochastic gradient descent mini batch gradient descent decreasing learning rate parameters oscillating iterative algorithm normal equation trace of a

@aliiq6572 Жыл бұрын

Can I get notes for these lectures?

@danilvinyukov2060 14 күн бұрын

1:17:31 Can't we just get rid of the x transverse on both left sides of the equation. As I remember from linear algebra if you have the same matrix on two sides of the equation from the same side that is redundant and can be removed. The result should be x(theta) =y => (theta) = x^(-1) y

@wishIKnewHowToLove Жыл бұрын

it's hard, but everything thats worth doing is

@HarshitSharma-YearBTechChemica 6 ай бұрын

Does someone know how to get the lecture notes? They are not available on stanford's website.

@logeshwaran1537 5 ай бұрын

Same issue for me alsoo....

@puspjoc9975 2 ай бұрын

where can i get the full detail notes?? Anyone who knows this ,reply please.

@samrendranath Жыл бұрын

how to access the lecture notes:(. they have been removed from standford websites.

@ChidinmaOnyeri 2 ай бұрын

Hi. Can anyone recommend any textbook that can help in further study of this course. Thank you

@promariddhidas6895 3 ай бұрын

i wish i had access to the problem sets for this course

@akshat_senpai 2 ай бұрын

May be on github...

@Baru_Bangun_Tidur Жыл бұрын

1.14.54 my answer is (X^T Xθ )+(X^T θ^T X)-(X^T Y)-(Y X^T) its same or my ans is wrong ?

@cristianreyes8288 Жыл бұрын

anybody know where the notes are? the link doesnt work for me

@techpasya974 4 ай бұрын

Is the lecture note available publicly for this? I have been going watching this playlist and I think the lecture note will be very helpful.

@KorexeroK 3 ай бұрын

cs229.stanford.edu/main_notes.pdf

@truszko91 Жыл бұрын

28:51, what is x0 and x1? If we have a single feature, say # of bedrooms, how can we have x0 and x1? Wouldn't x0 be just nothing? I'm confused. Or, in other words, if my Theta0 update function relies on x0 for the update, but x0 doesn't exist, theta0 will always be the initial theta0...

@MahakYadav12 Жыл бұрын

The value of x0 is always one 1. So theta0 can rely on x0 for the update. If we have single feature then h(X) =x0*Theta0 + x1* theta1 (which is ultimately equal to theta0 + x1*theta1 as x0=1, theta0 can also be referred as intercept and theta1 as slope if you compare it with the equation of a straight line such that price of house is linear function of # of bedrooms)

@truszko91 Жыл бұрын

@@MahakYadav12 thank you!!

@mikeeggleston1769 Жыл бұрын

Very clear, but what I don't get is for the multiple data sets when I sum the errors, do I do two passes through the data and choose the error that is less?

@victor3btn598 Жыл бұрын

Just continue changing theta till cost function reduces to optimal

@victor3btn598 Жыл бұрын

Yes the goal Is to reach less error and by tweaking theta you can achieve that and make sure you don't overshoot

@user-rj5ws9ry1w 11 ай бұрын

The notes from the description seem to have vanished. Does anyone have them?

@JunLiLin0616 11 ай бұрын

same problem

@samsondawit Жыл бұрын

why is it that the cost function has the constant 1/2 before the summation and not 1/2m?

@ihebbibani7122 Жыл бұрын

I think it's because he is taking one learning example and not m learning examples

@samsondawit Жыл бұрын

@@ihebbibani7122 ah I see

@tylonng Ай бұрын

The GOAT

@boardsdiscovery5803 11 ай бұрын

Very impressive. Somebody knows where are the lecture notes?

@bradmorse6320 11 ай бұрын

There is a link in the video description, make sure you click where it says "...more"

@Nobody2310 4 ай бұрын

has someone(possibly newbie like me) gone through all the videos and learnt enough to pursue an ML career or created a project? Wondering if a paid class should be taken or these free videos are enough.

@orignalbox 3 ай бұрын

i also want to know have you gone through all the videos

@chhaysith Жыл бұрын

Dear Dr. Andrew I saw yours other video with the cost function with linear regression by 1/2m but this video 1/2, so what is different between it?(footnote 16:00)

@treqemad Жыл бұрын

I don't really understand what you mean by 1/2m. However, from my understanding, the 1/2 is just for simplicity when taking the derivative of the cost ftn the power 2 will be multiplied to the equation and cancellyby the half.

@googgab Жыл бұрын

It should be 1/2m where m is the size of the data set. That's because we'd like to take the average sum of squared differences and not have the cost function depend on the size of the data set m kzbin.info/www/bejne/kKvIdaeJoteFpbc He explains it here at 6:30 minutes

@aman-qj5sx Жыл бұрын

@@googgab It should be ok if J depends on m since m isn't changing?

@labiditasnim623 11 ай бұрын

same question

@wonggran9983 2 жыл бұрын

Fred has a one hundred sided die. Fred rolls the dice, once and gets side i. Fred then rolls the dice, again, second roll, and gets side j where side j is not side i. What is the probability of this event e? Assume the one hundred sides of the one hundred sided die all have an equal probability of facing up.

@Tryingitoutletsee 2 жыл бұрын

1 - (1/10000) = 9999/10000

@ahmettolgakarabulut9380 Жыл бұрын

the probability of getting the same results for two rolls and they are both defined is 1/10000. So that we will subtract that from 1

@billr5842 Жыл бұрын

Wouldn't it be 99/100? The first roll can be any number so it doesn't really matter what's there. The second roll just needs to be one of the other 99 numbers. The first roll doesn't really change the probability. Of course, I barely know any math so I'm no expert lol

@emirkisa Жыл бұрын

@@billr5842 you're right, the probability calculated above as 1/10000 is the probability of getting the same result for a "specific side", like getting "side 3" twice. But there are 100 different sides that has the 1/10000 probability to occur twice, so the probability 1/10000 is multiplied by the different side number 100 which makes the probability of getting the same result for two rolls equal to 1/100. Then 1 - 1/100 = 99/100