Why Are Neural Network Loss Landscapes So Weirdly Connected?

Рет қаралды 2,117

28 күн бұрын

Analyzing Monotonic Linear Interpolation in Neural Network Loss Landscapes
arxiv.org/abs/2104.11044
Support my learning journey either by becoming a KZbin or Patron member!
/ @tunadorable
patreon.com/Tunadorable?...
Discuss this stuff with other Tunadorks on Discord
/ discord
All my other links
linktr.ee/tunadorable

Пікірлер: 18

@rosieposiebias 25 күн бұрын

14:22 Layer Normalization and Batch Normalization are not the same thing. It doesn't take away the quality of the video or anything, just commenting :)

@Tunadorable 25 күн бұрын

omg thank you that felt like way too simple of an explanation but I didn't want to linger on it. this is one of those things where I can't memorize a given technical term to save my life and it ends up biting me in the butt every few months even though once-upon a time I understood the concept -_- ugh

@aleksandrkhavarovskiy991 24 күн бұрын

BatchNorm and LayerNorm are based on the same formula, specifically the normal distribution formula. The goal of the layers is the force trainable parameters of the layer to be in a normal distribution. Additional trainable parameters gamma and beta are used the shape the bell curve. Its widely used after convolutional operations as it helps the network converge faster during training. Batch Normalization has the added step of getting the average gradient of the entire batch. Although in practice, a running average can be used instead.

@aleksandrkhavarovskiy991 24 күн бұрын

Batch Normalization makes an assumption about the layer outputs. Specifically, that the output has to be in a normal distribution. Although this is an assumption we make; The optimal distribution for a layer may not be normal, at least not during the intermediate steps of training. This assumption in distribution is what could be causing the MLI property to not hold.

@drdca8263 23 күн бұрын

I did not expect the analogy with dive spots in RSE, haha

@sarthakmishra1483 26 күн бұрын

Nice analysis , made me want to revisit my optimisation notes.

@stereoplegic 26 күн бұрын

LeNet is a CNN by Yann LeCun (now Chief Scientist at Meta) et al.

@easydoesitismist 26 күн бұрын

Cool paper and analysis

@beagle989 19 күн бұрын

good video! and neat paper

@InfiniteQuest86 24 күн бұрын

Interesting. This all seemed self-evident to me. I didn't realize this stuff wasn't known. SGD is literally designed to do this. I suppose it may not work as intended, so it's good someone checked. Then ADAM is literally designed to avoid MLI. Hmm, good to know someone did the work to check this stuff, but it seems like a weird paper.

@SudhirYadav-kz6ts 24 күн бұрын

how can you say this, can you point to some reading material.

@jsparger 23 күн бұрын

What do they mean when they say Nguyen implies that all global minima are connected? Isn’t there only one global minimum? Somebody unpack that for me please.

@Tunadorable 23 күн бұрын

oooof apologies if I'm going a bit too basic here or not giving a good enough explanation. so basically once upon a time we used to think that our 3D intuitions could be applied to high dimensional loss landscapes, which they most definitely cannot. Then we started seeing really weird stuff that didn't make sense under those 3D intuitions. For an example relevant to this case, if you were to train 1000 randomly initialized NNs you'd find that instead of them all reaching the same global minima, meaning having roughly the same parameters, they all actually have completely different parameters and yet all reach minimal loss. AND, if you linearly interpolate between them you find regions of higher loss in between. If 3D intuitions were correct then you might interpret this as many distinct bottoms of the valley (loss landscape) that all just happen to be the exact same elevation (loss). However, this didn't really make any sense, why would there just happen to be tons and tons of equal elevation bottoms to the valley? The reality is that there is in fact only one bottom, but that bottom is a huge weirdly shaped high dimensional manifold. This weird reality is behind a lot of misconceptions still taught in ML classes, such as the myth of local minima. The paper that comes to mind as helping me first understand this a bit better was section 2 of arxiv.org/abs/1406.2572

@jsparger 22 күн бұрын

@@Tunadorable thanks that’s very interesting!