Why Does Batch Norm Work? (C2W3L06)

Рет қаралды 203,241

Күн бұрын

Take the Deep Learning Specialization: bit.ly/2x614g3
Check out all our courses: www.deeplearni...
Subscribe to The Batch, our weekly newsletter: www.deeplearni...
Follow us:
Twitter: / deeplearningai_
Facebook: / deeplearninghq
Linkedin: / deeplearningai

Пікірлер: 64

@maplex2656 5 жыл бұрын

When the previous layer is covered, all things are clear. Brilliant explanation. Batch normalization works similarly the way input standardization works.

@holgip6126 6 жыл бұрын

like this guy - has calm voice / patience

@digitalghosts4599 5 жыл бұрын

Wow this is the best explanation I've seen so far! I really like Andrew Ng, he has an amazing talent for explaining even the most complicated things in a simple way and when he has to use mathematics to explain some concepts he does it in such a brilliant way that they become even simpler to understand and not more complicated as with some tutors

@aamir122a 6 жыл бұрын

Great work, you have the natural talent to make difficult topics easily learnable

@randomforrest9251 3 жыл бұрын

This guy makes it look so easy... one has to love him

@AnuragHalderEcon 6 жыл бұрын

Beautifully explained, classic Andrew Ng

@siarez 3 жыл бұрын

The "covariant shift" explanation has been falsified as an explanation for why BatchNorm works. If you are interested check out the paper "How does batch normalization help optimization?"

@dimitrisspiridonidis3284 2 жыл бұрын

God bless you

@amartyahatua 4 жыл бұрын

Best explanation of batch norm

@epistemophilicmetalhead9454 11 ай бұрын

when X changes (despite f(x) = y remaining the same, you can't expect the same model to perform well (eg: X1 is pics of black cats only. y=1. else for non-cats y=0. if X2 is pics of all colored cats, model won't do too well). This is co-variant shift. this co-variant shift is tackled during training through input standardization and batch normalization batch normalization ensures that the mean and variance of the distribution of the values of hyperparameters in the previous layer remains the same. Doesn't allow these hyperparameters' values to shift much. it doesn't allow values to change too much, thus reducing the coupling between hyperparameters of different layers and increases independence and hence, increase speed of learning

@qorbanimaq 3 жыл бұрын

This video is just pure gold!

@gugaime Жыл бұрын

Amazing explanation

@bgenchel1 6 жыл бұрын

"don't use it for regularization" - just use it all the time for general good practice, or are there times when I shouldn't use it?

@first-thoughtgiver-of-will2456 4 жыл бұрын

I think problems may arise if you don't have all of your training data ready and are looking to perform some transfer learning (training on new data) in the future since this is very domain dependent, as hinted at by the minibatch size-regularization effect but also and more importantly by the batch norm hyperparameters. I would always try to implement this. It seems ironic that it generalizes well but is constrained by the prescribed covariance from the training data.

@wtf1570 3 жыл бұрын

In some regression problems, it hurts the absolute value which might be critical.

@MuhammadIrshadAli 7 жыл бұрын

Thanks for sharing the great video, explained in simple and good manner.

@XX-vu5jo 3 жыл бұрын

Keras people needs to watch this video!

@ping_gunter 3 жыл бұрын

The original paper where the batch normalization technique was introduced (by Sergey Ioffe, Christian Szegedy) says that removing dropout speeds up training, without increasing overfitting and there also recommendations not to use drop out together with batch normalization since it adds noise to stats calculations (mean and variance)...so should we really use DO with BN?

@yuchenzhao6411 4 жыл бұрын

Since gamma and beta are parameters will be updated, how can mean and variance remain unchanged?

@haoming3430 6 ай бұрын

6:00, I have a question, do the values of beta[2] and gamma[2] not change as well during training? So the distribution of hidden unit values z[2] also keeps changing. Then the covariate shift problem is still there.

@haoming3430 6 ай бұрын

Or maybe I should convince myself that beta[2] and gamma[2] don't change much?

@NeerajGarg1025 Жыл бұрын

good to understand but still more nmerical calculations, will show effect

@s25412 3 жыл бұрын

7:55 why don't we use the mean and variance of the entire trg set instead of just those of a mini-batch? Wouldn't this reduce noise further (similar to using larger mini-batch size)? Unless we want those noise to seek out regularizing effect?

@lupsik1 3 жыл бұрын

Larger batch sizes are detrimental, like Yann Lecun once said "training with large minibatches is bad for your health. More importantly, it's bad for your test error. Friends dont let friends use minibatches larger than 32." as far as i understand it, with bigger batches you get stuck in narrower local optima while the noisier sets helps you generalize better and get pushed out of those local optima. Theres still lots of argument about this tho in some cases with very noisy data like predicting stock prices.

@s25412 3 жыл бұрын

@@lupsik1 great response!

@IgorAherne 7 жыл бұрын

thank you

@wajidhassanmoosa362 3 жыл бұрын

Beautifully Explained

@YuCai-v8k Жыл бұрын

Is it always have batch normalization in neural network?

@quishzhu 2 жыл бұрын

谢谢

@Doctor-WAI 2 ай бұрын

很棒的讲解，比李沐讲的好太多了

@skytree278 5 жыл бұрын

Thank you!

@arvindsuresh86 4 жыл бұрын

Wow, great explanation! Thanks!

@pemfiri 5 жыл бұрын

don't the activation function such as sigmoid in each node already normalize the outputs from neurons for the most part ?

@bharathtejchinimilli320 3 жыл бұрын

but the outputs are not zero centered

@bharathtejchinimilli320 3 жыл бұрын

generally, sigmoids are not used because of saturation and not been zero centre outputs. instead, ReLU are used

@EranM 6 жыл бұрын

Ingenious!

@anynamecanbeuse 5 жыл бұрын

I'm confused. Is that normalizing all neurons within each layer, or normalizing all activations computed from a mini-batch of one neuron ?

@yueying9083 5 жыл бұрын

第二种

@mindhalo 5 жыл бұрын

To be precise, it's neither both, though closer to the second. When training your network, you normalize all Z[l] - scalars corresponding to each neuron of l-th layer. Z[l] = W[l] * A[l-1]. Where W[l] is current layer weights matrix, and A[l-1] is previous layer activations. So, you normalize numbers which are not yet activations of current layer, but calculated from weights of current layer and previous layer activations.

@emrahyigit 7 жыл бұрын

Great explanation. Thank you.

@lakshaithani268 5 жыл бұрын

Great explanation

@mariusmic6573 3 жыл бұрын

What is 'z' in this video?

@banipreetsinghraheja8529 6 жыл бұрын

You said that Batch Norm limits the change in values of the 3rd layer ( or more generally, any deeper layer) due to parameters of earlier layers, however, when you are performing Gradient Descent, the values of the new parameters ( parameters due to Batch Norm gamma and beta ), are also being learnt, and are changing with the help of learning rate and henceforth, the mean and variance of earlier layers are changing and are not limited to 0 and 1 respectively ( or more generally whatever you set it to ), so, I am not able to intuite this fixing of mean and variance of the parameters of earlier layer to prevent covariate shift. Can anyone help me out with this?

@bryan3792 6 жыл бұрын

My understanding is this: imagine the 4 neuron in hidden layer 3 represent the feature of [shape of the head of cat, shape of the body of cat, shape of the tail of cat, color of the cat]. The first 3 dimension will have high value as long as there is a cat in the image, but the color varies a lot. So when u normalize this vector, the changes in color will make less contribution to the prediction. Therefore, relatively the feature that really matters (like the first three dimension) will have larger influence.

@usnikchawla 6 жыл бұрын

Having the same doubt

@Kerrosene 5 жыл бұрын

batch normalization adds two trainable parameters to each layer, so the normalized output is multiplied by a “standard deviation” parameter (gamma) and the beta is added ("mean” parameter). In other words, batch normalization lets SGD do denormalization by changing only these two weights for each activation, instead of losing the stability of the network by changing all the weights.

@mustafaaydn6270 4 жыл бұрын

Page 316 of www.deeplearningbook.org/contents/optimization.html has an answer to this i guess: > At ﬁrst glance, this may seem useless-why did we set the mean to 0, and then introduce a parameter that allows it to be set back to any arbitrary value β? The answer is that the new parametrization can represent the same family of functions of the input as the old parametrization, but the new parametrization has diﬀerent learning dynamics. In the old parametrization, the mean of H was determined by a complicated interaction between the parameters in the layers below H. In the new parametrization, the mean of γH'+β is determined solely by β. The new parametrization is much easier to learn with gradient descent.

@novinnouri764 2 жыл бұрын

@@bryan3792 thanks

@tianyuez 5 жыл бұрын

Andrew Yang is really good at math

@nikhilrana8800 5 жыл бұрын

I am not able to grab the batch norm working. Pls help me...

@sudharsaneaswaran2516 5 жыл бұрын

what does coloured image got to do with the location of data point in graph?

@amitkharel1168 5 жыл бұрын

pixel values

@dianadrives4519 5 жыл бұрын

That is just another way to show the difference in distribution of training and testing data. So in images the distribution difference is shown by one set having black cats another having non-black cats. While in the graph, the distribution difference is shown by the differences in position of the positive and negative data points. In short, these are two different examples hightlighting a single issue i.e covariance shift.

@oliviergraffeuille9795 3 жыл бұрын

According to kzbin.info/www/bejne/eqWoomdqe7mDg5Y , the covariate shift explanation (which was proposed by the original batch norm paper) has since been debunked by more recent papers. I don't know much about this though, if someone else would like to elaborate.