Mini Batch Gradient Descent (C2W2L01)

Рет қаралды 161,757

Күн бұрын

Пікірлер: 57

@superchargedhelium956 3 жыл бұрын

This is the best way to learn. I can compartmentalize each portion of the video into a subsection and really train myself efficiently.

@RH-mk3rp 2 жыл бұрын

I agree and considering how I'll often rewatch a segment of the video, it ends up being epochs = 2

@TheDroidMate 2 жыл бұрын

This is by far the best explanation out there, thanks Andrew 🚀

@javiercoronado4429 4 жыл бұрын

why would someone dislike this very high level material, which Andrew made available for free for anyone?

@HimanshuMauryadesigners 3 жыл бұрын

envy

@moosapatrawala1554 2 жыл бұрын

There are so many things he hasn't explained and just wrote it

@erkinsagroglu8519 11 ай бұрын

@@moosapatrawala1554Because this is a part of a bigger course as described on the video name

@torgath5088 11 ай бұрын

How to draw a kitten. Step 1: Draw a line. Step 2: Draw the rest of a kitten

@yavdhesh 4 жыл бұрын

धन्यवाद आन्द्रु जी :)

@pivasmilos 4 жыл бұрын

Thanks for making the notation beautifully and simple.

@user-cc8kb 6 жыл бұрын

He is so great. Andrew Ng ftw :D

@honoriusgulmatico6073 4 жыл бұрын

So this is how this office looks like when you're NOT taking Coursera ML!

@taihatranduc8613 4 жыл бұрын

you are always the best teacher

@iAmTheSquidThing 6 жыл бұрын

I'm wondering if optimization might happen faster by first sorting the entire dataset into categories, and then ensuring that each mini-batch is a stratified sample which approximates the entire dataset.

@iAmTheSquidThing 6 жыл бұрын

Spirit - Apparently someone had the idea long before me and it is effective: arxiv.org/abs/1405.3080 My understanding is that it ensures your model approximates the entire dataset at every iteration. You never have an iteration which comprises almost entirely samples from one class. Thereby wasting iterations fitting the function to an inaccurate dataset which has to be undone in later iterations.

@cdahdude51 5 жыл бұрын

@@iAmTheSquidThing Why not just shuffle the dataset then?

@amanshah9961 5 жыл бұрын

@@iAmTheSquidThing Thanks for the reference :}

@cristian-bull 4 жыл бұрын

hey, that's a simple, cool idea you got there

@tutututututututututututututu 4 жыл бұрын

baby training sets

@rustyshackleford1964 Жыл бұрын

Thank you thank you thank you!

@snackbob100 4 жыл бұрын

so all mini-batch is, is taking a vector containing the whole data set, splitting it up into k subsections, finding the average loss in each subsection, and do gradient descent on each averaged error, instead of doing gradient descent step on every single loss or each original x,y pair. So it's kind of a dimension reduction technique in a way ??

@here_4_beer 4 жыл бұрын

Well, in principle you are correct. The Idea is that your mean out of 1000 samples may converge rather to the truth and also the variance (i.e. the bias of your cost) decreases with 1/sqrt(n) where n is the number of samples in a batch. Therefore your cost function evaluation is less biased and converges faster.

@here_4_beer 4 жыл бұрын

you want to exploit the weak law of large numbers, imagine you throw a dice 10 times and you want to make predictions of its side probabilities. Your result would have been less biased if instead you would have thrown the dice 1000 times right?

@ninhhoang616 7 ай бұрын

Great video!

@elgs1980 Жыл бұрын

What does processing the samples in the mini batch at the same time mean? Do we average or sum the input data before feeding them to the net?

@parthmadan671 2 жыл бұрын

Do we use the weights of the previous batch to initialise the next batch?

@ahmedb2559 Жыл бұрын

Thank you !

@imanshojaei7784 4 жыл бұрын

at 8:35 is sigma from 1 to 1000 rather than 1 to l ?

@goktugguvercin8069 4 жыл бұрын

Yes, I guess there is a mistake there

@RH-mk3rp 2 жыл бұрын

I agree, it should be sum from i=1 to i=mini-batch-size, which in this case is 1000. When it was batch gradient descent for all video examples up until now, it was i=1 to i=m where m = number of training samples.

@chinmaymaganur7133 4 жыл бұрын

what is (nx,m)? ie is nx --no of rows and m is number of features(columns) or vice versa

@aayushpaudel2379 4 жыл бұрын

nx- is the number of features or input values. m- is the number of training examples.

@sandipansarkar9211 3 жыл бұрын

Very good rxplanation.Need to watch again

@mishasulikashvili1215 5 жыл бұрын

Thank you sir

@JoseRomero-wp4ij 5 жыл бұрын

thank you so much

@Gammalight519 3 жыл бұрын

Free education

@bilalalpaslan2 2 жыл бұрын

Please help me? When do the weights and the biases of model update? End of per batch_size or end of per epoch? I can not understand this. For example, Our dataset has 1600 X_train data. we choose batch_size = 64 and epoch = 10, What are the weights and biases updating number 1600/64=25 or only per epoch = 10?

@EfilWarlord Ай бұрын

This is probably very late reply hahah, but weights and biases of the model update at the end of each mini-patch so if your dataset is 1600, and you have a batch size of 64 then 1600/64 = 25 so in each single epoch, weights and biases will be updated 25 times. In your case where you have a batch size of 64 and epoch of 10, then the model weights and biases will be updated 25 x 10 = 250 times.

@tonyclvz109 5 жыл бұрын

Andrews + ENSTA =

@BreizhPie 5 жыл бұрын

so true, I love u

@aravindravva3833 4 жыл бұрын

10:52 is it 1000 gradient descent steps or 5000??

@pushkarwani6099 4 жыл бұрын

1 mini batch(1000 training sets) processes 1 gradient descent at a time , and it is repeated 5000 times.

@s25412 3 жыл бұрын

@8:34 why is it being added from i=1 to l? shouldn't it be 1000?

@windupbird9019 3 жыл бұрын

From my understanding, 1000 is the size of the training batch, while the l refers to the total number of layers in the nn. Since he is doing the forward and backward propagation, the gradient descent would take l steps.

@s25412 3 жыл бұрын

@@windupbird9019 in another video by Ng (kzbin.info/www/bejne/i16XiamBbM-hmck) at 2:24, he indicates that the number you divide the cost function by and the upper limit of summation symbol should be identical. So I'm assuming the i=1 to l @ 8:34 is a typo... what do you think?

@nhactrutinh6201 3 жыл бұрын

Yes, I think it should be 1000 , the mini batch size. Typo error

@jalendarch89 6 жыл бұрын

at 9:44 , can l be 1000 ?

@krishnasrujan1486 4 жыл бұрын

yes, it should be

@nirbhaykumarpandey8955 6 жыл бұрын

why is X nx by m and not n by m ??

@rui7268 5 жыл бұрын

Because nx represent input data which could be a matrix rather than a integer, e.g. if input data is a RBG image, you have (0-255)*(0-255)*3 pixels, it's not only about the number of one mini batch but also the pixels it have. Notation of nx must be better than n.

@grantsmith3653 Жыл бұрын

I was just thinking that if you increase your mini batch size, then your error surface gets taller (assuming you're using SSE and it's a regression problem). And that means your gradients would get bigger, so your steps would all get bigger... Even though (on average) changing your batch size shouldn't change the error surface argmin. So if you increase batch size, I think you have to decrease learning rate by a proportional amount to keep your changes in weight similar

@EranM 4 ай бұрын

9:10 lol what is this? Sigma should be from 1 to 1000, over the batch examples.. what is this sigma over l? layers are only relevant for the weights not for the y, y_pred

@prismaticspace4566 4 жыл бұрын

baby training set...weird...

@Jirayu.Kaewprateep Жыл бұрын

📺💬 We should learn this mini-batches effects, instead of training all samples at the same time we divided the mini-0batches and see the effect from Gradient descents. 🧸💬 That is a different thing when input has not much correlated because accuracy and loss will go up and down in the training process and as in the previous example dropout layer technique help determine the patterns in the input. 🧸💬 The problem is how should we set the mini-batches size and the number of new inputs ( distribution rates should be the same ) this method possible to train faster when we have a high dataset input but also provides nothing and long time training for less related data. 🐑💬The accuracy rates and loss estimation values are just numbers but we can stop at the specific value we want to save and re-work. 🐑💬 One example of him doing an assignment from one course attached to this link is he changed the number of batch_size not to make it train more input samples but it is faster when they do not forget the last inputs when using less number of LSTM layer units. 🐑💬 This kind of problem you found when mapping input vocabulary. ( Nothing we found in local cartoons books Cookie run as example ) VOLA : BURG / GRUB : IBCF / FCBI COKKIE : IUUQOK / KOQUUI : PBFPTP / PTPFBP RUN! : XAT' / 'TAX : ELS. / .SLE VOLA COKKIE RUN! =========================================================================== GRUB KOQUUI XAT' =========================================================================== IBCF PTPFBP ELS. 👧💬 Comeback‼ BirdNest Hidden anywhere else.