C4W2L04 Why ResNets Work

Рет қаралды 145,303

DeepLearningAI

Күн бұрын

Пікірлер: 69

@kartiksirwani4657 4 жыл бұрын

what a teacher he is ...watching his video is equivalent to reading 10 articles and watching 100 videos

@piyushkumar-wg8cv Жыл бұрын

You are high on ML.

@billykotsos4642 4 жыл бұрын

Finally it clicks in my head. Thanks Andrew !!!

@trexmidnite 3 жыл бұрын

Your brain clicks?

@muzeroj173 3 жыл бұрын

watched 3 times,during past 2 year, each time learn something new!

@ShubhamKumar-me7xy 2 жыл бұрын

or each time you didn't listen to him carefully?

@iammakimadog 3 жыл бұрын

The residual block guarantees that at least your deep NN performs as well as shallow one, so there's no reason to train a shallow NN rather than deep NN because theoritcailly deeper NN outperforms shallower NN.

@firstpenguin5653 3 жыл бұрын

Thanks! This is why ResNet at the same time Andrew works!

@razorphone77 5 жыл бұрын

I don't really understand what he means when he says the identity function is easy for the residual block to learn. It's not really learnt anything if all we do is append the initial input to the end. Given that we're saying the conv blocks are effectively superfluous because the weights are close to zero, I can't see what's gained in the whole process. We just appear to have extra calculation for the sake of it when we already have the output of layer a(l)

@SuperVaio123 5 жыл бұрын

Basically the baseline here is that you're trying to hopefully improve performance. In your worst case scenario the deeper layers dont learn anything and het your performance doesnt take a hit thanks to your skip connections. But in most cases these layers will learn something too that can only help improve performance. So yes although there are a lot of extra calculations you might get better performance. Again depends on application and trade offs

@X_platform 7 жыл бұрын

But how do we know how deep we should skip to? For example, how do we know the 10th layer will or will not improve from input of the 4th layer?

@joeycarson5510 6 жыл бұрын

It's my understanding that it may still be somewhat dependent on the problem. The skip connections are essentially restoring the identify of the input from the first layer of the block, thus keeping the block output similar to the input. The feature space that you are learning in those intermediate layers in the residual block is something you may need to consider for your individual problem, in terms of there being too much or too little parameter space. This is also dependent on the quantity and variability of your data. In general, ResNets are useful because as layers are stacked, the solution space increases hugely. That said, keeping the solution space somewhere around the input, constrains it so that it doesn't grow out of control. Two or three intermediate layers is usually enough for the block to learn a reasonable amount, but you may want to consider the width of those intermediate layers as well. In terms of why you may not want to stack 10 layers inside a residual block, consider the reason we use residual blocks in the first place. Stacking too many layers balloons the solution space, which SGD will try all sorts solutions and it will be difficult to converge on a reasonable solution. Thus we usually want to keep the blocks small, because we want to avoid that whole problem of stacking too many layers, especially inside the residual block, as residual blocks are like individual building blocks of the whole network.

@RehanAsif 5 жыл бұрын

By empirical analysis

@zxynj 2 жыл бұрын

We don't, but it doesn't hurt to save our game too often

@bowbowzai3757 11 ай бұрын

I have a question, if the result of the second network with two extra layers and a skip connection is same as the first network without them, because a[l+2] is likely to became a[l], then why we need to add the extra layers just for making it deeper? Or just like andrew said, maybe we will be lucky that the extra layer learn something and meanwhile we dont hurt the performance? Or the case in the video where a[l+2] is equal to a[l] is an edge case, where usually the extra layers can always learn more thing when we are retaining the original performance

@ahmadsaeedkhattak20 Жыл бұрын

Andrew Ng is a true technologist, soo involved in his lectures that he almost started Kung Fu art @8:47 when it sounded like Kung kung kung fu, kung kung kung fu ... 😆😆😆

@RH-mk3rp 2 жыл бұрын

so if the input encounters a skip connection route, does it take both or does it always take the skip connection? If it's the latter case then what's the point in even including all those skipped layers?

@robingutsche1117 3 жыл бұрын

In case we learn something useful in g(w[l+2] x a[l+1]+ b[l+2]), isnt it possible that adding the activations of the previous layer a[l] can actually decrease performance? So in that case a plain network would do a better job?

@zxynj 2 жыл бұрын

I guess if the performance is worse, then w and b will be 0 and nothing is learned. al is preserved through the 'game progress saving' technique, so al+2 is at least as good as al

@도정찬 3 жыл бұрын

i love this video! thanks professor andrew!!

@sandipansarkar9211 4 жыл бұрын

very good explantion.Need to watch again

@43SunSon 2 жыл бұрын

Question: let’s assume I do have an identity function learned, then a[l+2] = a [l], then what? I feel like we are doing f(x) + 0 = f(x), what’s the point of “adding nothing”? Since I am not following here, I can’t tell why Residual Networks is good for deeper NN training.

@kartikeyakhare5089 Жыл бұрын

The residual block ensures that our layer at least learns the output from the previous layer, so the performance doesn't get worse. This is helpful because the plain networks often struggle to learn even the identity mapping with increased depth, leading to worse performance.

@hackercop 3 жыл бұрын

Thanks Andrew! now I understand it.

@anasputhawala6390 2 жыл бұрын

I have a question: You mention that the W matrix and b MAY decay IF we use weight-decay. Isn't that a big IF though? Like is weight-decay a part of the residual network / skip-connections? In most cases, the W and b will not be decaying to 0, how is residual network / skip-connections useful in those cases?

@heejuneAhn 2 жыл бұрын

The L2 regularization is a kind of mandatory?

@rm175 2 жыл бұрын

Just amazing. So clear.

@anirudhgangadhar6158 3 жыл бұрын

"Residual networks can easily learn the identity function" - but isn't this true only when the weights and biases are 0? In the real situation, why would this happen? Its not making sense to me why you would skip connections and have the learned weights go to "0". If someone could please clarify this, I would be extremely grateful.

@1233-f7h 2 жыл бұрын

It essentially means that the residual layer can easily learn the identity function over the input by setting the weights to zero. This leads to layer giving an output that is at least NOT WORSE than the output of the previous layer. On the other hand, plain networks may struggle to learn the identity mapping and as a result can lead to worse performance with increasing layers.

@derekthompson2301 2 жыл бұрын

Hi did you figure it out ? I'm stuck at it now :(

@ahmedb2559 2 жыл бұрын

Thank you !

@patrickyu8470 2 жыл бұрын

(copied from the previous video in series) Just a question for those out there - has anyone been able to use techniques from ResNets to improve the convergence speed of deep fully connected networks? Usually people use skip connections in the context of convolutional neural nets but I haven't seen much gain in performance with fully connected ResNets, so just wondering if there's something else I may be missing.

@baranaldemir5570 4 жыл бұрын

Can someone please correct me if I'm wrong? As far as I understand if L2 regularization(weight decay) causes z[L+2] to become 0. Relu just carries a[L] to the next layer. Otherwise, it learns from both z[L+2] and a[L]. So, it bypass the vanishing gradient problem but increases the exploding gradient problem Am I right?

@АннаКопатько 3 жыл бұрын

I also have this question

@karatuno 3 жыл бұрын

same question

@derekthompson2301 2 жыл бұрын

same here

@elgs1980 4 жыл бұрын

If a layer is meant to be skipped, why was it there in the first place?

@mufaddalkanpurwala462 4 жыл бұрын

If the residual block has not learnt anything or is not useful, regularisation will help negate the effect of that layer and help bypass the previous activations thereby not sacrificing the performance of the layer. If the residual block has learnt something useful, even after regularisation, the knowledge learnt is stored plus the activations from the previous layer are also added, thereby not sacrificing the performance of the layer. So, it helps you keep deep layers with its ability to learn and not learn information.

@derekthompson2301 2 жыл бұрын

@@mufaddalkanpurwala462 Hi, thanks for your explain. There're some points I'm still not clear: - l2 regularisation makes W close to 0 but not exactly 0. Moreover, W is a matrix so it's very unlikely for all elements in it to be 0. So how is the layer skipped ? - Why would we want add activations from previous layer with knowledge learned ? why adding them won't sacrificing the performance of the layer ? Hope you can help me with this, thanks a lot !

@Ashokkumar-ds1nq 4 жыл бұрын

But we can also take w and b as 1 so that a[l+1]=a[l] and a[l+2]=a[l+1]. By doing so, we can get identity function without ResNets. Isn't it?

@5paceb0i 4 жыл бұрын

@Sunny kumar you can't explicitly make w and b as 1.. they are set by the gradient descent algo... If you are confused “then how w can become 0 ?” it is possible by applying l1 regularisation ( read about this )

@mohammedalsubaie3512 2 жыл бұрын

thank you very much Andrew, Could anyone please explain what 3X3 conv means? I would really appreciate that

@mohammedalsubaie3512 2 жыл бұрын

do you mean 3x3 filters?

@6884 3 жыл бұрын

am I the only one that thought that the pointer at 0:55 was actually a bug on their screen?

@heejuneAhn 2 жыл бұрын

Still I cannot get the intuition why Skip connection works better. It seems still experimental to me. ^^;

@ati43888 9 ай бұрын

Thanks

@jorjiang1 5 жыл бұрын

so does it mean that resnet models must to train with a certain degree of weight decay for it to make sense, otherwise it is just equivalent to a plain network

@vinayakpevekar 6 жыл бұрын

Can anybody tell me what is identity function?

@ajaysubramanian7026 6 жыл бұрын

g(x) = x (Same as linear function)

@永田善也-r2l 6 жыл бұрын

It's a function that outputs the exact same value as input, like y = x. For example, in ReLu function, if input x > 0 then output y = x. So in this case ReLu is a identity function.

@mohammedsamir9833 5 жыл бұрын

y=x;

@shuyuwang4867 4 жыл бұрын

Why does the filter number double after pooling is applied? Any suggestion.

@snippletrap 4 жыл бұрын

The dimension of the image is reduced. Pooling allows the network to learn more features over a larger window of the image, at the cost of lower resolution.

@shuyuwang4867 4 жыл бұрын

@@snippletrap thank u. Very good explanation

@shashankcharyavusali5914 6 жыл бұрын

Does Doesn't the performance gets affected if z[l+2] is negative?

@giofou711 6 жыл бұрын

Yes. If g(.) is ReLU: a[l+2] = g(z[l+2] + a[l]) = z[l+2] + a[l] if z[l+2] > -a[l] else 0. Since a[l] is always non-negative, if z[l+2] gets a negative value whose magnitude is larger than a[l], it results in a[l+2] being 0.

5 жыл бұрын

Activation function that you use for z(l+2) is ReLU, which has the minimum value of the zero. Then, z(l+2) will be 0 at minimum. It's kinda attempt for trying to solve vanishing gradients. I'm really interested in if they add random small numbers instead of a(l), means that not g(z(l+2)+a(l)) but g(z(l+2)+ random()) will it work well? This is the first question that comes to my mind. Hope they investigate it, I didn't realise it but I hope you know the paper and would like to share it.

5 жыл бұрын

@@giofou711 Activation function that you use for z(l+2) is ReLU, which has the minimum value of the zero. Then, z(l+2) will be 0 at minimum. It's kinda attempt for trying to solve vanishing gradients. I'm really interested in if they add random small numbers instead of a(l), means that not g(z(l+2)+a(l)) but g(z(l+2)+ random()) will it work well? This is the first question that comes to my mind. Hope they investigate it, I didn't realise it but I hope you know the paper and would like to share it.

@yongwookim1 4 ай бұрын

For learning identity

@MuhannadGhazal 4 жыл бұрын

what is a weight decay? anyone please help. thanks..

@baranaldemir5570 4 жыл бұрын

L2 regularization

@paulcurry8383 4 жыл бұрын

I’m still left wondering, why is it good to learn the identity? A lot of videos I see just say “the identity is good to learn” but I don’t intuitively see why a model would want to learn that, and why the inability to learn the identity causes instability in deeper networks.

@MrBemnet1 4 жыл бұрын

if the network learns identity then at least adding additional layers will not decrease performance.

@frasergilbert2949 3 жыл бұрын

@@MrBemnet1 That make sense. But by adding more layers, is the extra ReLU functions at the end the only difference? This is compared to having a shallow layer.