To answer some questions below : Gamma and beta is optimized given the input to batch norm layer is normalized during training. That's why we can't skip normalization during test time. Gamma and beta are estimates of shift and scaling over the whole training set, therefore we need to normalize input to batch norm layer assuming each mini-batch has the same mean and variance as the whole training set distribution. In reality, mini-batch has strong sampling bias especially if the sampling size is small relative to the training data.
@penpomvx9228 Жыл бұрын
Can we take the using of Gamma and beta as a way to compensate for the noises from the normalization?
@aiymzhanbaitureyeva52962 жыл бұрын
If I understand correctly, we can optionally take and run all the training data through the last network and get the desired mean mu and sigma squared for each layer, but this will take time. That is why in practice it is easier to take new values of mu and sigma squared on each t (on each mini-batch) and step by step add to the exponentially weighted average formula and finally get even a rough but acceptable average of mu and sigma squared. Apparently, in practice, this result is no worse than if we took the average mu and sigma squared over the entire training set. Also, gamma and beta are trained on all batches (the entire training set). That's why we can achieve a slight regularization effect. I wrote these moments that were not entirely clear to me initially and required some time to figure it out. I hope this comment will speed up the process for someone if you are as confused at the beginning as I am. Good luck!
@daqi73467 жыл бұрын
Exactly what I was looking for, which was not very clear in the paper.
@alfcnz6 жыл бұрын
The resolution is very very low. What's happened?
@jimmiemunyi3 жыл бұрын
I love your videos Alfredo!
@alfcnz3 жыл бұрын
@@jimmiemunyi thank you 🥰🥰🥰
@경업이-r1e4 жыл бұрын
He's a god.
@yonggangxu98713 жыл бұрын
Great lecture! One suggestion/question... As \gamma \in R^m, is it better to replace \gamma with gamma^{(i)} in the last equation at 0:40? Same for \beta. With the current notation, it looks as if the exact same affine transformation were applied to all components.
@venkateshwarlusonnathi41373 жыл бұрын
Very nice explanation. One question though. Why do we need to take exponential verage of Mu and sigmas squares? The batches are generally randomly formed. There for there is no sequential element to the series. Is it not sufficient to take a simple mean of the two as batch size is constant.
@donm79066 жыл бұрын
I think we should only consider this method when we have single test data at a time. If you can divide test set into mini-batches, then don't worry about this.
@ch40t1c19896 жыл бұрын
True, but often you can use larger batch sizes when you do inference than when you do training since you don't need to store all the parameters for backprob in GPU memory. I assume If you have differently sized mini-batches for training and testing you should also do it as is proposed in this video...
@SuperMaker.M Жыл бұрын
you're the best!
@WillYuJenHuang6 жыл бұрын
Sorry, I may miss what Prof. Andrew Ng said. But what is the advantage to using the moving average mean and variance than its own mean and variance from each batch?
@itzmesujit6 жыл бұрын
I think what he meant was, since you pass one input at a time during test and don't have any batch to take average or variance over, you could keep a running mean and variance during the training and use that for each test input when you begin testing
@JuliusUnscripted4 жыл бұрын
why is this exponentially weighted average not used in the training phase too? in the training we only use data of the current mini-batch to calculate the mean and variance for that batch. whouldn't it be clever to use the exponentially weighted average for that too?
@alexanderfedintsev95706 жыл бұрын
Shouldn't you use unbiased variance estimate?
@wangtony35436 жыл бұрын
also wanna ask this. I think it rely on the number of mini batch
@koeficientas6 жыл бұрын
OK, at test time we use mean and variance that was estimated as mean(mean) and mean(variance). What with gamma and beta. They are separately trained for each mini batch or we track and update single gamma and beta for each batch norm layer?
@t8m8r6 жыл бұрын
Why not use gamma and beta since they are the learned sigma and mu? Or, just skip the whole normalization thing during test time?
@bipulkalita57806 жыл бұрын
normalization steps are also neurons, skipping them means skipping their layer which will break the networks.
@ah-rdk8 ай бұрын
I don't understand why there is this assumption at the beginning of this video that we may have to process the records in a test set one by one. If we split our main dataset with a 70-20-10 or a 99-1-1 ratio, there will always be enough records in each of the training, dev, and test sets to carry out the calculations of mean and variance at each phase. I would appreciate it if someone could clarify this for me.
@OmarHisham12 жыл бұрын
Challenge: Do 5 push-ups every time Prof. Andrew says "mini-batch" or "exponentially-weighted average"
@ermano5586 Жыл бұрын
Challenge accepted
@fire_nakamura6 ай бұрын
Here to learn English
@vnpikachu46274 жыл бұрын
Each mini-batch will have different values of beta, gama, so how do we set the values of beta, gama at test time?
@Mrityujha3 жыл бұрын
Beta, gamma are learned/trained during the training itself. They remain as it is while inference time.
@curiousSloth924 жыл бұрын
Why an EXPONENTIAL moving average? Does that mean that more recently calculated means and variance are more important?
@pivasmilos4 жыл бұрын
It makes sense. The network might have completely changed since the first epoch, so the means and variances from that time don't matter at all compared to the more recent ones. I'm sure there are papers filled with estimation theory that give better explanations.
@curiousSloth924 жыл бұрын
Miloš Pivaš ok,thank you
@yashas99744 жыл бұрын
@@pivasmilos but the mean and variances of a mini-batch do not change unless you are randomizing the dataset after every epoch. I feel like using an exponential weighted average gives better accuracy for the mini-batches that appeared later in the last epoch.
@pivasmilos4 жыл бұрын
@@yashas9974 For the same input, the activations change with every gradient descent step. If they change, how could their moments not change?
@sandipansarkar92114 жыл бұрын
Gret explanation.Need to mke notes
@ermano5586 Жыл бұрын
I am making notes
@heejuneAhn6 жыл бұрын
Can you clearly explain why we have to multiply and add "gamma" and "beta" again, when we already accomplished the normalization. In another word, I cannot see the reason "Identify transform" which has mathematically nothing.
@parnianshahkar77976 жыл бұрын
in the previous video prof. mentioned that we need to change the variance some times in order to increase or decrease nonlinearity. Similarly, we may need to shift Z sometimes. So this formula(the one which contains beta and gamma) is very general.
@ch40t1c19896 жыл бұрын
With the first normalization step you normalize all your z-values to 0 mean and variance 1. Well, what if you don't want all your z-values (i.e. the values that are subsequently fed into your chosen activation function (e.g. sigmoid, tanh, ReLu)) to be primarily between -1 and 1? For example, most z-values between -1 and 1 would result in only exploiting the mostly linear region of the sigmoid activation function (that is exactly the example Andrew mentioned in the previous video). So, the best choice is to let the network learn the most suitable distribution of z by letting it learn beta and gamma.
@yashas99744 жыл бұрын
Or even simpler: have gamma and beta so that the network can undo the normalization if that was the right thing to do. Or even more simpler: so that the representational power of the network is not changed (because you can undo the normalization).
@48956l3 жыл бұрын
I love your videos but your audio mastering is hurting my ears. There is a very strange and uncomfortable high pitched noise in your videos....