Batch Norm At Test Time (C2W3L07)

Рет қаралды 63,471

Күн бұрын

Пікірлер: 39

@god-son-love 6 жыл бұрын

To answer some questions below : Gamma and beta is optimized given the input to batch norm layer is normalized during training. That's why we can't skip normalization during test time. Gamma and beta are estimates of shift and scaling over the whole training set, therefore we need to normalize input to batch norm layer assuming each mini-batch has the same mean and variance as the whole training set distribution. In reality, mini-batch has strong sampling bias especially if the sampling size is small relative to the training data.

@penpomvx9228 Жыл бұрын

Can we take the using of Gamma and beta as a way to compensate for the noises from the normalization?

@aiymzhanbaitureyeva5296 2 жыл бұрын

If I understand correctly, we can optionally take and run all the training data through the last network and get the desired mean mu and sigma squared for each layer, but this will take time. That is why in practice it is easier to take new values of mu and sigma squared on each t (on each mini-batch) and step by step add to the exponentially weighted average formula and finally get even a rough but acceptable average of mu and sigma squared. Apparently, in practice, this result is no worse than if we took the average mu and sigma squared over the entire training set. Also, gamma and beta are trained on all batches (the entire training set). That's why we can achieve a slight regularization effect. I wrote these moments that were not entirely clear to me initially and required some time to figure it out. I hope this comment will speed up the process for someone if you are as confused at the beginning as I am. Good luck!

@daqi7346 7 жыл бұрын

Exactly what I was looking for, which was not very clear in the paper.

@alfcnz 6 жыл бұрын

The resolution is very very low. What's happened?

@jimmiemunyi 3 жыл бұрын

I love your videos Alfredo!

@alfcnz 3 жыл бұрын

@@jimmiemunyi thank you 🥰🥰🥰

@경업이-r1e 4 жыл бұрын

He's a god.

@yonggangxu9871 3 жыл бұрын

Great lecture! One suggestion/question... As \gamma \in R^m, is it better to replace \gamma with gamma^{(i)} in the last equation at 0:40? Same for \beta. With the current notation, it looks as if the exact same affine transformation were applied to all components.

@venkateshwarlusonnathi4137 3 жыл бұрын

Very nice explanation. One question though. Why do we need to take exponential verage of Mu and sigmas squares? The batches are generally randomly formed. There for there is no sequential element to the series. Is it not sufficient to take a simple mean of the two as batch size is constant.

@donm7906 6 жыл бұрын

I think we should only consider this method when we have single test data at a time. If you can divide test set into mini-batches, then don't worry about this.

@ch40t1c1989 6 жыл бұрын

True, but often you can use larger batch sizes when you do inference than when you do training since you don't need to store all the parameters for backprob in GPU memory. I assume If you have differently sized mini-batches for training and testing you should also do it as is proposed in this video...

@SuperMaker.M Жыл бұрын

you're the best!

@WillYuJenHuang 6 жыл бұрын

Sorry, I may miss what Prof. Andrew Ng said. But what is the advantage to using the moving average mean and variance than its own mean and variance from each batch?

@itzmesujit 6 жыл бұрын

I think what he meant was, since you pass one input at a time during test and don't have any batch to take average or variance over, you could keep a running mean and variance during the training and use that for each test input when you begin testing

@JuliusUnscripted 4 жыл бұрын

why is this exponentially weighted average not used in the training phase too? in the training we only use data of the current mini-batch to calculate the mean and variance for that batch. whouldn't it be clever to use the exponentially weighted average for that too?

@alexanderfedintsev9570 6 жыл бұрын

Shouldn't you use unbiased variance estimate?

@wangtony3543 6 жыл бұрын

also wanna ask this. I think it rely on the number of mini batch

@koeficientas 6 жыл бұрын

OK, at test time we use mean and variance that was estimated as mean(mean) and mean(variance). What with gamma and beta. They are separately trained for each mini batch or we track and update single gamma and beta for each batch norm layer?

@t8m8r 6 жыл бұрын

Why not use gamma and beta since they are the learned sigma and mu? Or, just skip the whole normalization thing during test time?

@bipulkalita5780 6 жыл бұрын

normalization steps are also neurons, skipping them means skipping their layer which will break the networks.

@ah-rdk 8 ай бұрын

I don't understand why there is this assumption at the beginning of this video that we may have to process the records in a test set one by one. If we split our main dataset with a 70-20-10 or a 99-1-1 ratio, there will always be enough records in each of the training, dev, and test sets to carry out the calculations of mean and variance at each phase. I would appreciate it if someone could clarify this for me.

@OmarHisham1 2 жыл бұрын

Challenge: Do 5 push-ups every time Prof. Andrew says "mini-batch" or "exponentially-weighted average"

@ermano5586 Жыл бұрын

Challenge accepted

@fire_nakamura 6 ай бұрын

Here to learn English

@vnpikachu4627 4 жыл бұрын

Each mini-batch will have different values of beta, gama, so how do we set the values of beta, gama at test time?

@Mrityujha 3 жыл бұрын

Beta, gamma are learned/trained during the training itself. They remain as it is while inference time.

@curiousSloth92 4 жыл бұрын

Why an EXPONENTIAL moving average? Does that mean that more recently calculated means and variance are more important?

@pivasmilos 4 жыл бұрын

It makes sense. The network might have completely changed since the first epoch, so the means and variances from that time don't matter at all compared to the more recent ones. I'm sure there are papers filled with estimation theory that give better explanations.

@curiousSloth92 4 жыл бұрын

Miloš Pivaš ok,thank you

@yashas9974 4 жыл бұрын

@@pivasmilos but the mean and variances of a mini-batch do not change unless you are randomizing the dataset after every epoch. I feel like using an exponential weighted average gives better accuracy for the mini-batches that appeared later in the last epoch.

@pivasmilos 4 жыл бұрын

@@yashas9974 For the same input, the activations change with every gradient descent step. If they change, how could their moments not change?

@sandipansarkar9211 4 жыл бұрын

Gret explanation.Need to mke notes

@ermano5586 Жыл бұрын

I am making notes

@heejuneAhn 6 жыл бұрын

Can you clearly explain why we have to multiply and add "gamma" and "beta" again, when we already accomplished the normalization. In another word, I cannot see the reason "Identify transform" which has mathematically nothing.

@parnianshahkar7797 6 жыл бұрын

in the previous video prof. mentioned that we need to change the variance some times in order to increase or decrease nonlinearity. Similarly, we may need to shift Z sometimes. So this formula(the one which contains beta and gamma) is very general.

@ch40t1c1989 6 жыл бұрын

With the first normalization step you normalize all your z-values to 0 mean and variance 1. Well, what if you don't want all your z-values (i.e. the values that are subsequently fed into your chosen activation function (e.g. sigmoid, tanh, ReLu)) to be primarily between -1 and 1? For example, most z-values between -1 and 1 would result in only exploiting the mostly linear region of the sigmoid activation function (that is exactly the example Andrew mentioned in the previous video). So, the best choice is to let the network learn the most suitable distribution of z by letting it learn beta and gamma.

@yashas9974 4 жыл бұрын

Or even simpler: have gamma and beta so that the network can undo the normalization if that was the right thing to do. Or even more simpler: so that the representational power of the network is not changed (because you can undo the normalization).