Group Normalization (Paper Explained)

  Рет қаралды 29,793

Yannic Kilcher

Yannic Kilcher

4 жыл бұрын

The dirty little secret of Batch Normalization is its intrinsic dependence on the training batch size. Group Normalization attempts to achieve the benefits of normalization without batch statistics and, most importantly, without sacrificing performance compared to Batch Normalization.
arxiv.org/abs/1803.08494
Abstract:
Batch Normalization (BN) is a milestone technique in the development of deep learning, enabling various networks to train. However, normalizing along the batch dimension introduces problems --- BN's error increases rapidly when the batch size becomes smaller, caused by inaccurate batch statistics estimation. This limits BN's usage for training larger models and transferring features to computer vision tasks including detection, segmentation, and video, which require small batches constrained by memory consumption. In this paper, we present Group Normalization (GN) as a simple alternative to BN. GN divides the channels into groups and computes within each group the mean and variance for normalization. GN's computation is independent of batch sizes, and its accuracy is stable in a wide range of batch sizes. On ResNet-50 trained in ImageNet, GN has 10.6% lower error than its BN counterpart when using a batch size of 2; when using typical batch sizes, GN is comparably good with BN and outperforms other normalization variants. Moreover, GN can be naturally transferred from pre-training to fine-tuning. GN can outperform its BN-based counterparts for object detection and segmentation in COCO, and for video classification in Kinetics, showing that GN can effectively replace the powerful BN in a variety of tasks. GN can be easily implemented by a few lines of code in modern libraries.
Authors: Yuxin Wu, Kaiming He
Links:
KZbin: / yannickilcher
Twitter: / ykilcher
BitChute: www.bitchute.com/channel/yann...
Minds: www.minds.com/ykilcher

Пікірлер: 64
@hillosand
@hillosand 3 жыл бұрын
Thanks for the video! Note: normalizing *isn't* making the data more Gaussian, it's just transforming it to have mean of 0 and SD of 1. Gaussian data is often normalized and represented in this way too, but the normalization doesn't make your data any more Gaussian. Normalization does not change the inherent distributional shape of the data, just the mean and SD. For example, if your data was right-tailed in one dimension, it would remain right-tailed (and non-gaussian looking), it would just have a mean and SD of 0 and 1, respectively.
@yes-vy6bn
@yes-vy6bn Жыл бұрын
yeah it's really called standardization which comes from the equation to convert a normal distribution to a standard normal distribution
@bluel1ng
@bluel1ng 4 жыл бұрын
Nice explanation of BN at the beginning! Glad you kept it simple and did not use mythical "internal covariate shift" terminology. ;-)
@mkamp
@mkamp 4 жыл бұрын
Thanks for taking the time to walk us through this so slowly. Much appreciated.
@rbain16
@rbain16 4 жыл бұрын
Thank gosh you mentioned the other way of thinking about batch norm @ 13:00. I thought I'd misunderstood batch norm the whole time. Like always, top notch content :)
@MiroslawHorbal
@MiroslawHorbal 4 жыл бұрын
Thanks for the videos. You do a great job of going over the details of papers and summarizing the key points.
@fahdciwan8709
@fahdciwan8709 4 жыл бұрын
Thanks a lot Yannic!! keep the videos coming
@IBMua
@IBMua 3 жыл бұрын
Definitely one of the best NN explanation videos I've seen.
@vandanaschannel4000
@vandanaschannel4000 2 жыл бұрын
Thanks man. Perfect illustration to understand the difference between batch norm and layer norm.
@bimds1661
@bimds1661 3 жыл бұрын
The visualization/explanation of batch norm was really helpful to understand how it works in a CNN! Thanks :)
@Konstantin-qk6hv
@Konstantin-qk6hv 2 жыл бұрын
Great explanation. Love your videos!
@carlosnacher3593
@carlosnacher3593 Жыл бұрын
Thank you so much! This explanation is literally what I needed 🙏🏽🤝🏽
@abdolahfrootan2127
@abdolahfrootan2127 2 жыл бұрын
good review Yannic it helps me a lot to faster understanding the papers
@bright1402
@bright1402 4 жыл бұрын
Great explanation! Thank you~
@AlterMachKeinAuge
@AlterMachKeinAuge 4 ай бұрын
Awesome content. Thanks!
@naru909
@naru909 2 жыл бұрын
This is gold. Thank you!
@johnkilbride3436
@johnkilbride3436 4 жыл бұрын
Well, I know what I’m adding to my model later. Thank you for the clear explanation.
@YannicKilcher
@YannicKilcher 4 жыл бұрын
Also add weight standardization
@andreasv9472
@andreasv9472 4 жыл бұрын
@@YannicKilcher que?
@dimitrisspiridonidis3284
@dimitrisspiridonidis3284 Жыл бұрын
i love your paper reviews
@saeidshamsaliei2536
@saeidshamsaliei2536 3 жыл бұрын
Thanks a lot Yannic
@nachiketa9245
@nachiketa9245 2 жыл бұрын
Amazing explanation.
@ManishChoudhary-hy5ey
@ManishChoudhary-hy5ey 3 жыл бұрын
Nice explained
@erictian8075
@erictian8075 4 жыл бұрын
Great video! Thanks! "But did you really do the experiment?"
@cexploreful
@cexploreful 2 жыл бұрын
As always, great place to start reading! 🙃
@-mwolf
@-mwolf Жыл бұрын
Thank you
@reginaphalange2563
@reginaphalange2563 3 жыл бұрын
"I usually don't believe the experiments that you see in single paper." LOL
@DerUltraGamerr
@DerUltraGamerr 3 жыл бұрын
Speaking about normalization, I was wondering about the intuition of LayerNorm in Transformer models. Usually it is applied after the concatenation and projection of the multiheaded self-attention output but wouldn't it make sense to apply it to each head separately to get more fine-grained normalization statistics?
@YannicKilcher
@YannicKilcher 3 жыл бұрын
There's always a tradeoff. what you're suggesting would also introduce more variance
@rahuldeora5815
@rahuldeora5815 4 жыл бұрын
Nice one! Enjoyed it. Can you do Stand-Alone Self-Attention in Vision Models, it has huge potential impact ?
@seankernitsman6055
@seankernitsman6055 4 жыл бұрын
+1 for the self-attention paper by J. Cheng, L. Dong and M. Lapata. Thanks for the video(:
@indraneilpaul1309
@indraneilpaul1309 3 жыл бұрын
It seems that the method is motivated by the fact that their might be a few correlated or similar channels but there is no effort to figure out which channels should be grouped together before normalizing them together. I'm surprised that this effect has been replicated across multiple efforts as you mention.
@dermitdembrot3091
@dermitdembrot3091 4 жыл бұрын
I usually even go with batch size of 1 when processing videos 😉 (with my brain)
@proreduction
@proreduction 3 жыл бұрын
To be sure I am understanding everything correctly: If you are training a fully connected NN (MLP) with only 1 channel, then Layer Norm = Instance Norm = Group Norm, correct?
@Chr0nalis
@Chr0nalis 4 жыл бұрын
Hmm yeah but what about non-conv networks?. Doesn't really make sense to group features from a specific layer for non-structured data imo.
@moonryu2806
@moonryu2806 3 жыл бұрын
15:13 you said you calculate the mean of 3 channel but the picture looks like have 6 different channel. is picture does not represent the 3 channel?
@YannicKilcher
@YannicKilcher 3 жыл бұрын
In Layer norm, all 6 channels are averaged, in group norm just the 3
@makgaiduk
@makgaiduk 5 ай бұрын
How does GroupNorm compare with the modern version of BatchNorm with running momentums? Sounds like momentums should fix the problem of small batch sizes on their own
@eelcohoogendoorn8044
@eelcohoogendoorn8044 4 жыл бұрын
What I dont quite get here, looking at the paper and the pytorch implementation, is that the batch axis remains unused. As far as making the point that you can get enough samples to compute meaningfull statistics without aggregating over the batch at all, its an interesting experiment. But it does show a tradeoff. If you have a batch thats bigger than 1, wouldn't you at least want to have the option to compute your statistics over the batch as well? Seem intuitively like that would bring you closer to the optimal big-batch-statistics behavior. Am I missing something here?
@YannicKilcher
@YannicKilcher 4 жыл бұрын
You could consider that. but there are other factors, like with batch norm, you have to track moving averages for test time, and also if you use distributed training, you run into synchronization problems.
@Ronschk
@Ronschk 4 жыл бұрын
Thanks for the nice explanation! As it goes into a similar direction, do you know about network deconvolution? Was presented at ICLR this year and looks very interesting: arxiv.org/pdf/1905.11926.pdf
@Erosis
@Erosis 3 жыл бұрын
I'm not convinced. I think they're going to have to include more of those groupnorm "experiments" :) 24:06
@julian390
@julian390 4 жыл бұрын
Shouldn't Batchnorm allow us to drop the bias term? It seems reasonable in my head but I couldn't find anything on it. Am I missing something?
@YannicKilcher
@YannicKilcher 4 жыл бұрын
It allows you to drop the bias term of the convolutional/dense layers, yes
@tropopyte6473
@tropopyte6473 4 жыл бұрын
​@@YannicKilcher Depends if you apply BN pre or post activation, doesn't it? Yes, it seems to be common practice to use BN pre activation, where you can drop the bias, but in my head it just makes much more sense to apply BN post activation. Especially if the argument is "i want unit gaussian like inputs to my layer". Using ReLU after BN will cut off half of my zero centered gaussian distribution, causing it to be neither zero centered, nor have a standard deviation of one, nor look much like a gaussian distribution. ...if my understanding of it is correct...
@julian390
@julian390 4 жыл бұрын
Ah that makes a lot of sense, thank you :) This Channel is amazing!
@guyindisguise
@guyindisguise 3 жыл бұрын
@@tropopyte6473 I was wondering the same thing, did you find an answer to that question?
@ekstrapolatoraproksymujacy412
@ekstrapolatoraproksymujacy412 4 жыл бұрын
isn't this simplified version of normalization used in AlexNet from 2012?
@YannicKilcher
@YannicKilcher 4 жыл бұрын
maybe. all these things look kinda similar, but if an element is repeated often, the precise implementation matters
@anynamecanbeuse
@anynamecanbeuse 3 жыл бұрын
I don't understand clearly. It seems Group norm just group some sequenced channels together, then how can you say they are just the same type of features?
@YannicKilcher
@YannicKilcher 3 жыл бұрын
The network will make them related by you imposing it.
@evgeniinikitin8843
@evgeniinikitin8843 4 жыл бұрын
great video! I disagree about the uselessness of the group norm experiment though. batchnorm is not the only possible reason of performance degradation in the small batch size mode, so it's a perfectly viable experiment
@JackofSome
@JackofSome 4 жыл бұрын
7 minutes in and I'm thinking "is this paper about aggregating statistics around normalization and then using those for each batch". Let's see if I was correct. Edit 1: doesn't seem like it Edit 2: neh, I'm wrong Edit 3: I like my idea better
@YannicKilcher
@YannicKilcher 4 жыл бұрын
Wouldn't that be batch norm that's always in eval mode?
@herp_derpingson
@herp_derpingson 4 жыл бұрын
How is group norm reducing internal covariance shift? It is agnostic to the batch moments.
@YannicKilcher
@YannicKilcher 4 жыл бұрын
It's probably not. It doesn't appear to matter. What appears to matter is that you normalize somehow.
@herp_derpingson
@herp_derpingson 4 жыл бұрын
@@YannicKilcher Dark magic :O
@oneman7094
@oneman7094 4 жыл бұрын
Question. Do you read papers with sunglasses on? Almost in every video that we see your face you have sunglasses and I am always wandering while watching...
@YannicKilcher
@YannicKilcher 4 жыл бұрын
Sure 😁
@bingbingsun6304
@bingbingsun6304 4 жыл бұрын
@@YannicKilcher Wearing sunglasses helps understanding the paper!!!
@YannicKilcher
@YannicKilcher 4 жыл бұрын
@@bingbingsun6304 Yea the white paper is just too bright :D
@hieuza
@hieuza 4 жыл бұрын
24:07 they reduce the learning rate by 10x at 30, 60, 90 epochs -- that's why the error reduce. Why is it funny?
@JackofSome
@JackofSome 4 жыл бұрын
That's not what he's laughing about. The full battery of experiments done on group norm at that point are kind of unnecessary as they really just needed to show the 2ims/batch case.
@eco24t
@eco24t 4 жыл бұрын
The point isn't that they decreased the learning rate. The point is that the group norm performance shouldn't be affected by the batch size, but they tested multiple batch sizes anyway, possibly at the request of a reviewer.
Multivariate Normal (Gaussian) Distribution Explained
7:08
DataMListic
Рет қаралды 25 М.
IS THIS REAL FOOD OR NOT?🤔 PIKACHU AND SONIC CONFUSE THE CAT! 😺🍫
00:41
Получилось у Вики?😂 #хабибка
00:14
ХАБИБ
Рет қаралды 2,9 МЛН
Rethinking Attention with Performers (Paper Explained)
54:39
Yannic Kilcher
Рет қаралды 55 М.
Batch Normalization (“batch norm”) explained
7:32
deeplizard
Рет қаралды 219 М.
Neural Architecture Search without Training (Paper Explained)
35:06
Yannic Kilcher
Рет қаралды 27 М.
Standardization vs Normalization Clearly Explained!
5:48
Normalized Nerd
Рет қаралды 119 М.
Batch normalization | What it is and how to implement it
13:51
AssemblyAI
Рет қаралды 55 М.
What are Pooling Layers in Deep Neural Networks?
9:16
Machine Learning Explained
Рет қаралды 2,2 М.
Normalization Vs. Standardization (Feature Scaling in Machine Learning)
19:48
После ввода кода - протирайте панель
0:18
Хотела заскамить на Айфон!😱📱(@gertieinar)
0:21
Взрывная История
Рет қаралды 3,2 МЛН
APPLE совершила РЕВОЛЮЦИЮ!
0:39
ÉЖИ АКСЁНОВ
Рет қаралды 2,6 МЛН