When the previous layer is covered, all things are clear. Brilliant explanation. Batch normalization works similarly the way input standardization works.
@holgip61266 жыл бұрын
like this guy - has calm voice / patience
@digitalghosts45995 жыл бұрын
Wow this is the best explanation I've seen so far! I really like Andrew Ng, he has an amazing talent for explaining even the most complicated things in a simple way and when he has to use mathematics to explain some concepts he does it in such a brilliant way that they become even simpler to understand and not more complicated as with some tutors
@aamir122a6 жыл бұрын
Great work, you have the natural talent to make difficult topics easily learnable
@siarez3 жыл бұрын
The "covariant shift" explanation has been falsified as an explanation for why BatchNorm works. If you are interested check out the paper "How does batch normalization help optimization?"
@dimitrisspiridonidis32842 жыл бұрын
God bless you
@randomforrest92513 жыл бұрын
This guy makes it look so easy... one has to love him
@AnuragHalderEcon6 жыл бұрын
Beautifully explained, classic Andrew Ng
@bgenchel16 жыл бұрын
"don't use it for regularization" - just use it all the time for general good practice, or are there times when I shouldn't use it?
@first-thoughtgiver-of-will24564 жыл бұрын
I think problems may arise if you don't have all of your training data ready and are looking to perform some transfer learning (training on new data) in the future since this is very domain dependent, as hinted at by the minibatch size-regularization effect but also and more importantly by the batch norm hyperparameters. I would always try to implement this. It seems ironic that it generalizes well but is constrained by the prescribed covariance from the training data.
@wtf15703 жыл бұрын
In some regression problems, it hurts the absolute value which might be critical.
@epistemophilicmetalhead945410 ай бұрын
when X changes (despite f(x) = y remaining the same, you can't expect the same model to perform well (eg: X1 is pics of black cats only. y=1. else for non-cats y=0. if X2 is pics of all colored cats, model won't do too well). This is co-variant shift. this co-variant shift is tackled during training through input standardization and batch normalization batch normalization ensures that the mean and variance of the distribution of the values of hyperparameters in the previous layer remains the same. Doesn't allow these hyperparameters' values to shift much. it doesn't allow values to change too much, thus reducing the coupling between hyperparameters of different layers and increases independence and hence, increase speed of learning
@amartyahatua4 жыл бұрын
Best explanation of batch norm
@qorbanimaq3 жыл бұрын
This video is just pure gold!
@ping_gunter3 жыл бұрын
The original paper where the batch normalization technique was introduced (by Sergey Ioffe, Christian Szegedy) says that removing dropout speeds up training, without increasing overfitting and there also recommendations not to use drop out together with batch normalization since it adds noise to stats calculations (mean and variance)...so should we really use DO with BN?
@s254123 жыл бұрын
7:55 why don't we use the mean and variance of the entire trg set instead of just those of a mini-batch? Wouldn't this reduce noise further (similar to using larger mini-batch size)? Unless we want those noise to seek out regularizing effect?
@lupsik13 жыл бұрын
Larger batch sizes are detrimental, like Yann Lecun once said "training with large minibatches is bad for your health. More importantly, it's bad for your test error. Friends dont let friends use minibatches larger than 32." as far as i understand it, with bigger batches you get stuck in narrower local optima while the noisier sets helps you generalize better and get pushed out of those local optima. Theres still lots of argument about this tho in some cases with very noisy data like predicting stock prices.
@s254123 жыл бұрын
@@lupsik1 great response!
@yuchenzhao64114 жыл бұрын
Since gamma and beta are parameters will be updated, how can mean and variance remain unchanged?
@gugaime Жыл бұрын
Amazing explanation
@haoming34305 ай бұрын
6:00, I have a question, do the values of beta[2] and gamma[2] not change as well during training? So the distribution of hidden unit values z[2] also keeps changing. Then the covariate shift problem is still there.
@haoming34305 ай бұрын
Or maybe I should convince myself that beta[2] and gamma[2] don't change much?
@MuhammadIrshadAli7 жыл бұрын
Thanks for sharing the great video, explained in simple and good manner.
@YuCai-v8k11 ай бұрын
Is it always have batch normalization in neural network?
@pemfiri4 жыл бұрын
don't the activation function such as sigmoid in each node already normalize the outputs from neurons for the most part ?
@bharathtejchinimilli3203 жыл бұрын
but the outputs are not zero centered
@bharathtejchinimilli3203 жыл бұрын
generally, sigmoids are not used because of saturation and not been zero centre outputs. instead, ReLU are used
@XX-vu5jo3 жыл бұрын
Keras people needs to watch this video!
@banipreetsinghraheja85296 жыл бұрын
You said that Batch Norm limits the change in values of the 3rd layer ( or more generally, any deeper layer) due to parameters of earlier layers, however, when you are performing Gradient Descent, the values of the new parameters ( parameters due to Batch Norm gamma and beta ), are also being learnt, and are changing with the help of learning rate and henceforth, the mean and variance of earlier layers are changing and are not limited to 0 and 1 respectively ( or more generally whatever you set it to ), so, I am not able to intuite this fixing of mean and variance of the parameters of earlier layer to prevent covariate shift. Can anyone help me out with this?
@bryan37926 жыл бұрын
My understanding is this: imagine the 4 neuron in hidden layer 3 represent the feature of [shape of the head of cat, shape of the body of cat, shape of the tail of cat, color of the cat]. The first 3 dimension will have high value as long as there is a cat in the image, but the color varies a lot. So when u normalize this vector, the changes in color will make less contribution to the prediction. Therefore, relatively the feature that really matters (like the first three dimension) will have larger influence.
@usnikchawla5 жыл бұрын
Having the same doubt
@Kerrosene5 жыл бұрын
batch normalization adds two trainable parameters to each layer, so the normalized output is multiplied by a “standard deviation” parameter (gamma) and the beta is added ("mean” parameter). In other words, batch normalization lets SGD do denormalization by changing only these two weights for each activation, instead of losing the stability of the network by changing all the weights.
@mustafaaydn62704 жыл бұрын
Page 316 of www.deeplearningbook.org/contents/optimization.html has an answer to this i guess: > At first glance, this may seem useless-why did we set the mean to 0, and then introduce a parameter that allows it to be set back to any arbitrary value β? The answer is that the new parametrization can represent the same family of functions of the input as the old parametrization, but the new parametrization has different learning dynamics. In the old parametrization, the mean of H was determined by a complicated interaction between the parameters in the layers below H. In the new parametrization, the mean of γH'+β is determined solely by β. The new parametrization is much easier to learn with gradient descent.
@novinnouri7642 жыл бұрын
@@bryan3792 thanks
@anynamecanbeuse5 жыл бұрын
I'm confused. Is that normalizing all neurons within each layer, or normalizing all activations computed from a mini-batch of one neuron ?
@yueying90835 жыл бұрын
第二种
@mindhalo5 жыл бұрын
To be precise, it's neither both, though closer to the second. When training your network, you normalize all Z[l] - scalars corresponding to each neuron of l-th layer. Z[l] = W[l] * A[l-1]. Where W[l] is current layer weights matrix, and A[l-1] is previous layer activations. So, you normalize numbers which are not yet activations of current layer, but calculated from weights of current layer and previous layer activations.
@NeerajGarg102511 ай бұрын
good to understand but still more nmerical calculations, will show effect
@IgorAherne7 жыл бұрын
thank you
@nikhilrana88005 жыл бұрын
I am not able to grab the batch norm working. Pls help me...
@wajidhassanmoosa3623 жыл бұрын
Beautifully Explained
@skytree2785 жыл бұрын
Thank you!
@sudharsaneaswaran25165 жыл бұрын
what does coloured image got to do with the location of data point in graph?
@amitkharel11685 жыл бұрын
pixel values
@dianadrives45195 жыл бұрын
That is just another way to show the difference in distribution of training and testing data. So in images the distribution difference is shown by one set having black cats another having non-black cats. While in the graph, the distribution difference is shown by the differences in position of the positive and negative data points. In short, these are two different examples hightlighting a single issue i.e covariance shift.
@arvindsuresh864 жыл бұрын
Wow, great explanation! Thanks!
@mariusmic65733 жыл бұрын
What is 'z' in this video?
@quishzhu2 жыл бұрын
谢谢
@emrahyigit7 жыл бұрын
Great explanation. Thank you.
@best_Vinyl_CollectorinShenZhen6 жыл бұрын
If the mini-batch size is only 1, is BN still working?
@karthik-ex4dm5 жыл бұрын
minibatch with the size of 1 is not a mini batch. Its using each point in the data seperately. you cannot batch norm with size=1
@oliviergraffeuille97953 жыл бұрын
According to kzbin.info/www/bejne/eqWoomdqe7mDg5Y , the covariate shift explanation (which was proposed by the original batch norm paper) has since been debunked by more recent papers. I don't know much about this though, if someone else would like to elaborate.