This is the best way to learn. I can compartmentalize each portion of the video into a subsection and really train myself efficiently.
@RH-mk3rp2 жыл бұрын
I agree and considering how I'll often rewatch a segment of the video, it ends up being epochs = 2
@TheDroidMate2 жыл бұрын
This is by far the best explanation out there, thanks Andrew 🚀
@javiercoronado44294 жыл бұрын
why would someone dislike this very high level material, which Andrew made available for free for anyone?
@HimanshuMauryadesigners3 жыл бұрын
envy
@moosapatrawala15542 жыл бұрын
There are so many things he hasn't explained and just wrote it
@erkinsagroglu851911 ай бұрын
@@moosapatrawala1554Because this is a part of a bigger course as described on the video name
@torgath508811 ай бұрын
How to draw a kitten. Step 1: Draw a line. Step 2: Draw the rest of a kitten
@yavdhesh4 жыл бұрын
धन्यवाद आन्द्रु जी :)
@pivasmilos4 жыл бұрын
Thanks for making the notation beautifully and simple.
@user-cc8kb6 жыл бұрын
He is so great. Andrew Ng ftw :D
@honoriusgulmatico60734 жыл бұрын
So this is how this office looks like when you're NOT taking Coursera ML!
@taihatranduc86134 жыл бұрын
you are always the best teacher
@iAmTheSquidThing6 жыл бұрын
I'm wondering if optimization might happen faster by first sorting the entire dataset into categories, and then ensuring that each mini-batch is a stratified sample which approximates the entire dataset.
@iAmTheSquidThing6 жыл бұрын
Spirit - Apparently someone had the idea long before me and it is effective: arxiv.org/abs/1405.3080 My understanding is that it ensures your model approximates the entire dataset at every iteration. You never have an iteration which comprises almost entirely samples from one class. Thereby wasting iterations fitting the function to an inaccurate dataset which has to be undone in later iterations.
@cdahdude515 жыл бұрын
@@iAmTheSquidThing Why not just shuffle the dataset then?
@amanshah99615 жыл бұрын
@@iAmTheSquidThing Thanks for the reference :}
@cristian-bull4 жыл бұрын
hey, that's a simple, cool idea you got there
@tutututututututututututututu4 жыл бұрын
baby training sets
@rustyshackleford1964 Жыл бұрын
Thank you thank you thank you!
@snackbob1004 жыл бұрын
so all mini-batch is, is taking a vector containing the whole data set, splitting it up into k subsections, finding the average loss in each subsection, and do gradient descent on each averaged error, instead of doing gradient descent step on every single loss or each original x,y pair. So it's kind of a dimension reduction technique in a way ??
@here_4_beer4 жыл бұрын
Well, in principle you are correct. The Idea is that your mean out of 1000 samples may converge rather to the truth and also the variance (i.e. the bias of your cost) decreases with 1/sqrt(n) where n is the number of samples in a batch. Therefore your cost function evaluation is less biased and converges faster.
@here_4_beer4 жыл бұрын
you want to exploit the weak law of large numbers, imagine you throw a dice 10 times and you want to make predictions of its side probabilities. Your result would have been less biased if instead you would have thrown the dice 1000 times right?
@ninhhoang6167 ай бұрын
Great video!
@elgs1980 Жыл бұрын
What does processing the samples in the mini batch at the same time mean? Do we average or sum the input data before feeding them to the net?
@parthmadan6712 жыл бұрын
Do we use the weights of the previous batch to initialise the next batch?
@ahmedb2559 Жыл бұрын
Thank you !
@imanshojaei77844 жыл бұрын
at 8:35 is sigma from 1 to 1000 rather than 1 to l ?
@goktugguvercin80694 жыл бұрын
Yes, I guess there is a mistake there
@RH-mk3rp2 жыл бұрын
I agree, it should be sum from i=1 to i=mini-batch-size, which in this case is 1000. When it was batch gradient descent for all video examples up until now, it was i=1 to i=m where m = number of training samples.
@chinmaymaganur71334 жыл бұрын
what is (nx,m)? ie is nx --no of rows and m is number of features(columns) or vice versa
@aayushpaudel23794 жыл бұрын
nx- is the number of features or input values. m- is the number of training examples.
@sandipansarkar92113 жыл бұрын
Very good rxplanation.Need to watch again
@mishasulikashvili12155 жыл бұрын
Thank you sir
@JoseRomero-wp4ij5 жыл бұрын
thank you so much
@Gammalight5193 жыл бұрын
Free education
@bilalalpaslan22 жыл бұрын
Please help me? When do the weights and the biases of model update? End of per batch_size or end of per epoch? I can not understand this. For example, Our dataset has 1600 X_train data. we choose batch_size = 64 and epoch = 10, What are the weights and biases updating number 1600/64=25 or only per epoch = 10?
@EfilWarlordАй бұрын
This is probably very late reply hahah, but weights and biases of the model update at the end of each mini-patch so if your dataset is 1600, and you have a batch size of 64 then 1600/64 = 25 so in each single epoch, weights and biases will be updated 25 times. In your case where you have a batch size of 64 and epoch of 10, then the model weights and biases will be updated 25 x 10 = 250 times.
@tonyclvz1095 жыл бұрын
Andrews + ENSTA =
@BreizhPie5 жыл бұрын
so true, I love u
@aravindravva38334 жыл бұрын
10:52 is it 1000 gradient descent steps or 5000??
@pushkarwani60994 жыл бұрын
1 mini batch(1000 training sets) processes 1 gradient descent at a time , and it is repeated 5000 times.
@s254123 жыл бұрын
@8:34 why is it being added from i=1 to l? shouldn't it be 1000?
@windupbird90193 жыл бұрын
From my understanding, 1000 is the size of the training batch, while the l refers to the total number of layers in the nn. Since he is doing the forward and backward propagation, the gradient descent would take l steps.
@s254123 жыл бұрын
@@windupbird9019 in another video by Ng (kzbin.info/www/bejne/i16XiamBbM-hmck) at 2:24, he indicates that the number you divide the cost function by and the upper limit of summation symbol should be identical. So I'm assuming the i=1 to l @ 8:34 is a typo... what do you think?
@nhactrutinh62013 жыл бұрын
Yes, I think it should be 1000 , the mini batch size. Typo error
@jalendarch896 жыл бұрын
at 9:44 , can l be 1000 ?
@krishnasrujan14864 жыл бұрын
yes, it should be
@nirbhaykumarpandey89556 жыл бұрын
why is X nx by m and not n by m ??
@rui72685 жыл бұрын
Because nx represent input data which could be a matrix rather than a integer, e.g. if input data is a RBG image, you have (0-255)*(0-255)*3 pixels, it's not only about the number of one mini batch but also the pixels it have. Notation of nx must be better than n.
@grantsmith3653 Жыл бұрын
I was just thinking that if you increase your mini batch size, then your error surface gets taller (assuming you're using SSE and it's a regression problem). And that means your gradients would get bigger, so your steps would all get bigger... Even though (on average) changing your batch size shouldn't change the error surface argmin. So if you increase batch size, I think you have to decrease learning rate by a proportional amount to keep your changes in weight similar
@EranM4 ай бұрын
9:10 lol what is this? Sigma should be from 1 to 1000, over the batch examples.. what is this sigma over l? layers are only relevant for the weights not for the y, y_pred
@prismaticspace45664 жыл бұрын
baby training set...weird...
@Jirayu.Kaewprateep Жыл бұрын
📺💬 We should learn this mini-batches effects, instead of training all samples at the same time we divided the mini-0batches and see the effect from Gradient descents. 🧸💬 That is a different thing when input has not much correlated because accuracy and loss will go up and down in the training process and as in the previous example dropout layer technique help determine the patterns in the input. 🧸💬 The problem is how should we set the mini-batches size and the number of new inputs ( distribution rates should be the same ) this method possible to train faster when we have a high dataset input but also provides nothing and long time training for less related data. 🐑💬The accuracy rates and loss estimation values are just numbers but we can stop at the specific value we want to save and re-work. 🐑💬 One example of him doing an assignment from one course attached to this link is he changed the number of batch_size not to make it train more input samples but it is faster when they do not forget the last inputs when using less number of LSTM layer units. 🐑💬 This kind of problem you found when mapping input vocabulary. ( Nothing we found in local cartoons books Cookie run as example ) VOLA : BURG / GRUB : IBCF / FCBI COKKIE : IUUQOK / KOQUUI : PBFPTP / PTPFBP RUN! : XAT' / 'TAX : ELS. / .SLE VOLA COKKIE RUN! =========================================================================== GRUB KOQUUI XAT' =========================================================================== IBCF PTPFBP ELS. 👧💬 Comeback‼ BirdNest Hidden anywhere else.
@nikab18523 жыл бұрын
thank you sir!
@tohchengchuan68404 жыл бұрын
why y is (1, m) instead of (n,m) or (ny,m)?
@aayushpaudel23794 жыл бұрын
assuming y takes a real value and not a vector value. Like in a classification problem - 0 or 1. or a regression problem. Hope it make sense! :D