Why Does Batch Norm Work? (C2W3L06)

  Рет қаралды 195,557

DeepLearningAI

DeepLearningAI

6 жыл бұрын

Take the Deep Learning Specialization: bit.ly/2x614g3
Check out all our courses: www.deeplearning.ai
Subscribe to The Batch, our weekly newsletter: www.deeplearning.ai/thebatch
Follow us:
Twitter: / deeplearningai_
Facebook: / deeplearninghq
Linkedin: / deeplearningai

Пікірлер: 61
@maplex2656
@maplex2656 4 жыл бұрын
When the previous layer is covered, all things are clear. Brilliant explanation. Batch normalization works similarly the way input standardization works.
@digitalghosts4599
@digitalghosts4599 4 жыл бұрын
Wow this is the best explanation I've seen so far! I really like Andrew Ng, he has an amazing talent for explaining even the most complicated things in a simple way and when he has to use mathematics to explain some concepts he does it in such a brilliant way that they become even simpler to understand and not more complicated as with some tutors
@holgip6126
@holgip6126 5 жыл бұрын
like this guy - has calm voice / patience
@aamir122a
@aamir122a 6 жыл бұрын
Great work, you have the natural talent to make difficult topics easily learnable
@MuhammadIrshadAli
@MuhammadIrshadAli 6 жыл бұрын
Thanks for sharing the great video, explained in simple and good manner.
@siarez
@siarez 2 жыл бұрын
The "covariant shift" explanation has been falsified as an explanation for why BatchNorm works. If you are interested check out the paper "How does batch normalization help optimization?"
@dimitrisspiridonidis3284
@dimitrisspiridonidis3284 Жыл бұрын
God bless you
@randomforrest9251
@randomforrest9251 3 жыл бұрын
This guy makes it look so easy... one has to love him
@qorbanimaq
@qorbanimaq 2 жыл бұрын
This video is just pure gold!
@IgorAherne
@IgorAherne 6 жыл бұрын
thank you
@epistemophilicmetalhead9454
@epistemophilicmetalhead9454 4 ай бұрын
when X changes (despite f(x) = y remaining the same, you can't expect the same model to perform well (eg: X1 is pics of black cats only. y=1. else for non-cats y=0. if X2 is pics of all colored cats, model won't do too well). This is co-variant shift. this co-variant shift is tackled during training through input standardization and batch normalization batch normalization ensures that the mean and variance of the distribution of the values of hyperparameters in the previous layer remains the same. Doesn't allow these hyperparameters' values to shift much. it doesn't allow values to change too much, thus reducing the coupling between hyperparameters of different layers and increases independence and hence, increase speed of learning
@amartyahatua
@amartyahatua 3 жыл бұрын
Best explanation of batch norm
@AnuragHalderEcon
@AnuragHalderEcon 6 жыл бұрын
Beautifully explained, classic Andrew Ng
@skytree278
@skytree278 4 жыл бұрын
Thank you!
@ping_gunter
@ping_gunter 3 жыл бұрын
The original paper where the batch normalization technique was introduced (by Sergey Ioffe, Christian Szegedy) says that removing dropout speeds up training, without increasing overfitting and there also recommendations not to use drop out together with batch normalization since it adds noise to stats calculations (mean and variance)...so should we really use DO with BN?
@EranM
@EranM 5 жыл бұрын
Ingenious!
@gugaime
@gugaime Жыл бұрын
Amazing explanation
@emrahyigit
@emrahyigit 6 жыл бұрын
Great explanation. Thank you.
@bgenchel1
@bgenchel1 6 жыл бұрын
"don't use it for regularization" - just use it all the time for general good practice, or are there times when I shouldn't use it?
@first-thoughtgiver-of-will2456
@first-thoughtgiver-of-will2456 3 жыл бұрын
I think problems may arise if you don't have all of your training data ready and are looking to perform some transfer learning (training on new data) in the future since this is very domain dependent, as hinted at by the minibatch size-regularization effect but also and more importantly by the batch norm hyperparameters. I would always try to implement this. It seems ironic that it generalizes well but is constrained by the prescribed covariance from the training data.
@wtf1570
@wtf1570 2 жыл бұрын
In some regression problems, it hurts the absolute value which might be critical.
@arvindsuresh86
@arvindsuresh86 4 жыл бұрын
Wow, great explanation! Thanks!
@yuchenzhao6411
@yuchenzhao6411 4 жыл бұрын
Since gamma and beta are parameters will be updated, how can mean and variance remain unchanged?
@quishzhu
@quishzhu Жыл бұрын
谢谢
@lakshaithani268
@lakshaithani268 4 жыл бұрын
Great explanation
@nikhilrana8800
@nikhilrana8800 4 жыл бұрын
I am not able to grab the batch norm working. Pls help me...
@s25412
@s25412 3 жыл бұрын
7:55 why don't we use the mean and variance of the entire trg set instead of just those of a mini-batch? Wouldn't this reduce noise further (similar to using larger mini-batch size)? Unless we want those noise to seek out regularizing effect?
@lupsik1
@lupsik1 2 жыл бұрын
Larger batch sizes are detrimental, like Yann Lecun once said "training with large minibatches is bad for your health. More importantly, it's bad for your test error. Friends dont let friends use minibatches larger than 32." as far as i understand it, with bigger batches you get stuck in narrower local optima while the noisier sets helps you generalize better and get pushed out of those local optima. Theres still lots of argument about this tho in some cases with very noisy data like predicting stock prices.
@s25412
@s25412 2 жыл бұрын
@@lupsik1 great response!
@anynamecanbeuse
@anynamecanbeuse 5 жыл бұрын
I'm confused. Is that normalizing all neurons within each layer, or normalizing all activations computed from a mini-batch of one neuron ?
@yueying9083
@yueying9083 5 жыл бұрын
第二种
@JohnFunnyMIH
@JohnFunnyMIH 4 жыл бұрын
To be precise, it's neither both, though closer to the second. When training your network, you normalize all Z[l] - scalars corresponding to each neuron of l-th layer. Z[l] = W[l] * A[l-1]. Where W[l] is current layer weights matrix, and A[l-1] is previous layer activations. So, you normalize numbers which are not yet activations of current layer, but calculated from weights of current layer and previous layer activations.
@pemfiri
@pemfiri 4 жыл бұрын
don't the activation function such as sigmoid in each node already normalize the outputs from neurons for the most part ?
@bharathtejchinimilli320
@bharathtejchinimilli320 3 жыл бұрын
but the outputs are not zero centered
@bharathtejchinimilli320
@bharathtejchinimilli320 3 жыл бұрын
generally, sigmoids are not used because of saturation and not been zero centre outputs. instead, ReLU are used
@user-lz8wv7rp1o
@user-lz8wv7rp1o 4 ай бұрын
Is it always have batch normalization in neural network?
@banipreetsinghraheja8529
@banipreetsinghraheja8529 6 жыл бұрын
You said that Batch Norm limits the change in values of the 3rd layer ( or more generally, any deeper layer) due to parameters of earlier layers, however, when you are performing Gradient Descent, the values of the new parameters ( parameters due to Batch Norm gamma and beta ), are also being learnt, and are changing with the help of learning rate and henceforth, the mean and variance of earlier layers are changing and are not limited to 0 and 1 respectively ( or more generally whatever you set it to ), so, I am not able to intuite this fixing of mean and variance of the parameters of earlier layer to prevent covariate shift. Can anyone help me out with this?
@bryan3792
@bryan3792 5 жыл бұрын
My understanding is this: imagine the 4 neuron in hidden layer 3 represent the feature of [shape of the head of cat, shape of the body of cat, shape of the tail of cat, color of the cat]. The first 3 dimension will have high value as long as there is a cat in the image, but the color varies a lot. So when u normalize this vector, the changes in color will make less contribution to the prediction. Therefore, relatively the feature that really matters (like the first three dimension) will have larger influence.
@usnikchawla
@usnikchawla 5 жыл бұрын
Having the same doubt
@Kerrosene
@Kerrosene 5 жыл бұрын
batch normalization adds two trainable parameters to each layer, so the normalized output is multiplied by a “standard deviation” parameter (gamma) and the beta is added ("mean” parameter). In other words, batch normalization lets SGD do denormalization by changing only these two weights for each activation, instead of losing the stability of the network by changing all the weights.
@mustafaaydn6270
@mustafaaydn6270 3 жыл бұрын
Page 316 of www.deeplearningbook.org/contents/optimization.html has an answer to this i guess: > At first glance, this may seem useless-why did we set the mean to 0, and then introduce a parameter that allows it to be set back to any arbitrary value β? The answer is that the new parametrization can represent the same family of functions of the input as the old parametrization, but the new parametrization has different learning dynamics. In the old parametrization, the mean of H was determined by a complicated interaction between the parameters in the layers below H. In the new parametrization, the mean of γH'+β is determined solely by β. The new parametrization is much easier to learn with gradient descent.
@novinnouri764
@novinnouri764 2 жыл бұрын
@@bryan3792 thanks
@XX-vu5jo
@XX-vu5jo 3 жыл бұрын
Keras people needs to watch this video!
@NeerajGarg1025
@NeerajGarg1025 4 ай бұрын
good to understand but still more nmerical calculations, will show effect
@sudharsaneaswaran2516
@sudharsaneaswaran2516 5 жыл бұрын
what does coloured image got to do with the location of data point in graph?
@amitkharel1168
@amitkharel1168 5 жыл бұрын
pixel values
@dianadrives4519
@dianadrives4519 4 жыл бұрын
That is just another way to show the difference in distribution of training and testing data. So in images the distribution difference is shown by one set having black cats another having non-black cats. While in the graph, the distribution difference is shown by the differences in position of the positive and negative data points. In short, these are two different examples hightlighting a single issue i.e covariance shift.
@best_Vinyl_CollectorinShenZhen
@best_Vinyl_CollectorinShenZhen 5 жыл бұрын
If the mini-batch size is only 1, is BN still working?
@karthik-ex4dm
@karthik-ex4dm 5 жыл бұрын
minibatch with the size of 1 is not a mini batch. Its using each point in the data seperately. you cannot batch norm with size=1
@wajidhassanmoosa362
@wajidhassanmoosa362 2 жыл бұрын
Beautifully Explained
@mariusmic6573
@mariusmic6573 2 жыл бұрын
What is 'z' in this video?
@tianyuez
@tianyuez 4 жыл бұрын
Andrew Yang is really good at math
@oliviergraffeuille9795
@oliviergraffeuille9795 2 жыл бұрын
According to kzbin.info/www/bejne/eqWoomdqe7mDg5Y , the covariate shift explanation (which was proposed by the original batch norm paper) has since been debunked by more recent papers. I don't know much about this though, if someone else would like to elaborate.
@sandipansarkar9211
@sandipansarkar9211 3 жыл бұрын
great xplnatoion
@ermano5586
@ermano5586 9 ай бұрын
I am watching it again
@youknowhoiamhehe
@youknowhoiamhehe 4 жыл бұрын
Is he the GOD?
Batch Norm At Test Time (C2W3L07)
5:47
DeepLearningAI
Рет қаралды 60 М.
Batch normalization | What it is and how to implement it
13:51
AssemblyAI
Рет қаралды 55 М.
MEGA BOXES ARE BACK!!!
08:53
Brawl Stars
Рет қаралды 33 МЛН
World’s Deadliest Obstacle Course!
28:25
MrBeast
Рет қаралды 131 МЛН
터키아이스크림🇹🇷🍦Turkish ice cream #funny #shorts
00:26
Byungari 병아리언니
Рет қаралды 25 МЛН
FOOTBALL WITH PLAY BUTTONS ▶️❤️ #roadto100million
00:20
Celine Dept
Рет қаралды 36 МЛН
Understanding Dropout (C2W1L07)
7:05
DeepLearningAI
Рет қаралды 148 М.
Normalizing Activations in a Network (C2W3L04)
8:55
DeepLearningAI
Рет қаралды 110 М.
Standardization vs Normalization Clearly Explained!
5:48
Normalized Nerd
Рет қаралды 120 М.
Vanishing/Exploding Gradients (C2W1L10)
6:08
DeepLearningAI
Рет қаралды 119 М.
Group Normalization (Paper Explained)
29:06
Yannic Kilcher
Рет қаралды 29 М.
Batch Normalization - EXPLAINED!
8:49
CodeEmporium
Рет қаралды 102 М.
AI Learns Insane Monopoly Strategies
11:30
b2studios
Рет қаралды 10 МЛН
Batch Normalization (“batch norm”) explained
7:32
deeplizard
Рет қаралды 219 М.
Multitask Learning (C3W2L08)
13:00
DeepLearningAI
Рет қаралды 73 М.
MEGA BOXES ARE BACK!!!
08:53
Brawl Stars
Рет қаралды 33 МЛН