what a teacher he is ...watching his video is equivalent to reading 10 articles and watching 100 videos
@piyushkumar-wg8cv Жыл бұрын
You are high on ML.
@billykotsos46424 жыл бұрын
Finally it clicks in my head. Thanks Andrew !!!
@trexmidnite3 жыл бұрын
Your brain clicks?
@muzeroj1733 жыл бұрын
watched 3 times,during past 2 year, each time learn something new!
@ShubhamKumar-me7xy2 жыл бұрын
or each time you didn't listen to him carefully?
@iammakimadog3 жыл бұрын
The residual block guarantees that at least your deep NN performs as well as shallow one, so there's no reason to train a shallow NN rather than deep NN because theoritcailly deeper NN outperforms shallower NN.
@firstpenguin56533 жыл бұрын
Thanks! This is why ResNet at the same time Andrew works!
@razorphone775 жыл бұрын
I don't really understand what he means when he says the identity function is easy for the residual block to learn. It's not really learnt anything if all we do is append the initial input to the end. Given that we're saying the conv blocks are effectively superfluous because the weights are close to zero, I can't see what's gained in the whole process. We just appear to have extra calculation for the sake of it when we already have the output of layer a(l)
@SuperVaio1235 жыл бұрын
Basically the baseline here is that you're trying to hopefully improve performance. In your worst case scenario the deeper layers dont learn anything and het your performance doesnt take a hit thanks to your skip connections. But in most cases these layers will learn something too that can only help improve performance. So yes although there are a lot of extra calculations you might get better performance. Again depends on application and trade offs
@X_platform7 жыл бұрын
But how do we know how deep we should skip to? For example, how do we know the 10th layer will or will not improve from input of the 4th layer?
@joeycarson55106 жыл бұрын
It's my understanding that it may still be somewhat dependent on the problem. The skip connections are essentially restoring the identify of the input from the first layer of the block, thus keeping the block output similar to the input. The feature space that you are learning in those intermediate layers in the residual block is something you may need to consider for your individual problem, in terms of there being too much or too little parameter space. This is also dependent on the quantity and variability of your data. In general, ResNets are useful because as layers are stacked, the solution space increases hugely. That said, keeping the solution space somewhere around the input, constrains it so that it doesn't grow out of control. Two or three intermediate layers is usually enough for the block to learn a reasonable amount, but you may want to consider the width of those intermediate layers as well. In terms of why you may not want to stack 10 layers inside a residual block, consider the reason we use residual blocks in the first place. Stacking too many layers balloons the solution space, which SGD will try all sorts solutions and it will be difficult to converge on a reasonable solution. Thus we usually want to keep the blocks small, because we want to avoid that whole problem of stacking too many layers, especially inside the residual block, as residual blocks are like individual building blocks of the whole network.
@RehanAsif5 жыл бұрын
By empirical analysis
@zxynj2 жыл бұрын
We don't, but it doesn't hurt to save our game too often
@bowbowzai375711 ай бұрын
I have a question, if the result of the second network with two extra layers and a skip connection is same as the first network without them, because a[l+2] is likely to became a[l], then why we need to add the extra layers just for making it deeper? Or just like andrew said, maybe we will be lucky that the extra layer learn something and meanwhile we dont hurt the performance? Or the case in the video where a[l+2] is equal to a[l] is an edge case, where usually the extra layers can always learn more thing when we are retaining the original performance
@ahmadsaeedkhattak20 Жыл бұрын
Andrew Ng is a true technologist, soo involved in his lectures that he almost started Kung Fu art @8:47 when it sounded like Kung kung kung fu, kung kung kung fu ... 😆😆😆
@RH-mk3rp2 жыл бұрын
so if the input encounters a skip connection route, does it take both or does it always take the skip connection? If it's the latter case then what's the point in even including all those skipped layers?
@robingutsche11173 жыл бұрын
In case we learn something useful in g(w[l+2] x a[l+1]+ b[l+2]), isnt it possible that adding the activations of the previous layer a[l] can actually decrease performance? So in that case a plain network would do a better job?
@zxynj2 жыл бұрын
I guess if the performance is worse, then w and b will be 0 and nothing is learned. al is preserved through the 'game progress saving' technique, so al+2 is at least as good as al
@도정찬3 жыл бұрын
i love this video! thanks professor andrew!!
@sandipansarkar92114 жыл бұрын
very good explantion.Need to watch again
@43SunSon2 жыл бұрын
Question: let’s assume I do have an identity function learned, then a[l+2] = a [l], then what? I feel like we are doing f(x) + 0 = f(x), what’s the point of “adding nothing”? Since I am not following here, I can’t tell why Residual Networks is good for deeper NN training.
@kartikeyakhare5089 Жыл бұрын
The residual block ensures that our layer at least learns the output from the previous layer, so the performance doesn't get worse. This is helpful because the plain networks often struggle to learn even the identity mapping with increased depth, leading to worse performance.
@hackercop3 жыл бұрын
Thanks Andrew! now I understand it.
@anasputhawala63902 жыл бұрын
I have a question: You mention that the W matrix and b MAY decay IF we use weight-decay. Isn't that a big IF though? Like is weight-decay a part of the residual network / skip-connections? In most cases, the W and b will not be decaying to 0, how is residual network / skip-connections useful in those cases?
@heejuneAhn2 жыл бұрын
The L2 regularization is a kind of mandatory?
@rm1752 жыл бұрын
Just amazing. So clear.
@anirudhgangadhar61583 жыл бұрын
"Residual networks can easily learn the identity function" - but isn't this true only when the weights and biases are 0? In the real situation, why would this happen? Its not making sense to me why you would skip connections and have the learned weights go to "0". If someone could please clarify this, I would be extremely grateful.
@1233-f7h2 жыл бұрын
It essentially means that the residual layer can easily learn the identity function over the input by setting the weights to zero. This leads to layer giving an output that is at least NOT WORSE than the output of the previous layer. On the other hand, plain networks may struggle to learn the identity mapping and as a result can lead to worse performance with increasing layers.
@derekthompson23012 жыл бұрын
Hi did you figure it out ? I'm stuck at it now :(
@ahmedb25592 жыл бұрын
Thank you !
@patrickyu84702 жыл бұрын
(copied from the previous video in series) Just a question for those out there - has anyone been able to use techniques from ResNets to improve the convergence speed of deep fully connected networks? Usually people use skip connections in the context of convolutional neural nets but I haven't seen much gain in performance with fully connected ResNets, so just wondering if there's something else I may be missing.
@baranaldemir55704 жыл бұрын
Can someone please correct me if I'm wrong? As far as I understand if L2 regularization(weight decay) causes z[L+2] to become 0. Relu just carries a[L] to the next layer. Otherwise, it learns from both z[L+2] and a[L]. So, it bypass the vanishing gradient problem but increases the exploding gradient problem Am I right?
@АннаКопатько3 жыл бұрын
I also have this question
@karatuno3 жыл бұрын
same question
@derekthompson23012 жыл бұрын
same here
@elgs19804 жыл бұрын
If a layer is meant to be skipped, why was it there in the first place?
@mufaddalkanpurwala4624 жыл бұрын
If the residual block has not learnt anything or is not useful, regularisation will help negate the effect of that layer and help bypass the previous activations thereby not sacrificing the performance of the layer. If the residual block has learnt something useful, even after regularisation, the knowledge learnt is stored plus the activations from the previous layer are also added, thereby not sacrificing the performance of the layer. So, it helps you keep deep layers with its ability to learn and not learn information.
@derekthompson23012 жыл бұрын
@@mufaddalkanpurwala462 Hi, thanks for your explain. There're some points I'm still not clear: - l2 regularisation makes W close to 0 but not exactly 0. Moreover, W is a matrix so it's very unlikely for all elements in it to be 0. So how is the layer skipped ? - Why would we want add activations from previous layer with knowledge learned ? why adding them won't sacrificing the performance of the layer ? Hope you can help me with this, thanks a lot !
@Ashokkumar-ds1nq4 жыл бұрын
But we can also take w and b as 1 so that a[l+1]=a[l] and a[l+2]=a[l+1]. By doing so, we can get identity function without ResNets. Isn't it?
@5paceb0i4 жыл бұрын
@Sunny kumar you can't explicitly make w and b as 1.. they are set by the gradient descent algo... If you are confused “then how w can become 0 ?” it is possible by applying l1 regularisation ( read about this )
@mohammedalsubaie35122 жыл бұрын
thank you very much Andrew, Could anyone please explain what 3X3 conv means? I would really appreciate that
@mohammedalsubaie35122 жыл бұрын
do you mean 3x3 filters?
@68843 жыл бұрын
am I the only one that thought that the pointer at 0:55 was actually a bug on their screen?
@heejuneAhn2 жыл бұрын
Still I cannot get the intuition why Skip connection works better. It seems still experimental to me. ^^;
@ati438889 ай бұрын
Thanks
@jorjiang15 жыл бұрын
so does it mean that resnet models must to train with a certain degree of weight decay for it to make sense, otherwise it is just equivalent to a plain network
@vinayakpevekar6 жыл бұрын
Can anybody tell me what is identity function?
@ajaysubramanian70266 жыл бұрын
g(x) = x (Same as linear function)
@永田善也-r2l6 жыл бұрын
It's a function that outputs the exact same value as input, like y = x. For example, in ReLu function, if input x > 0 then output y = x. So in this case ReLu is a identity function.
@mohammedsamir98335 жыл бұрын
y=x;
@shuyuwang48674 жыл бұрын
Why does the filter number double after pooling is applied? Any suggestion.
@snippletrap4 жыл бұрын
The dimension of the image is reduced. Pooling allows the network to learn more features over a larger window of the image, at the cost of lower resolution.
@shuyuwang48674 жыл бұрын
@@snippletrap thank u. Very good explanation
@shashankcharyavusali59146 жыл бұрын
Does Doesn't the performance gets affected if z[l+2] is negative?
@giofou7116 жыл бұрын
Yes. If g(.) is ReLU: a[l+2] = g(z[l+2] + a[l]) = z[l+2] + a[l] if z[l+2] > -a[l] else 0. Since a[l] is always non-negative, if z[l+2] gets a negative value whose magnitude is larger than a[l], it results in a[l+2] being 0.
5 жыл бұрын
Activation function that you use for z(l+2) is ReLU, which has the minimum value of the zero. Then, z(l+2) will be 0 at minimum. It's kinda attempt for trying to solve vanishing gradients. I'm really interested in if they add random small numbers instead of a(l), means that not g(z(l+2)+a(l)) but g(z(l+2)+ random()) will it work well? This is the first question that comes to my mind. Hope they investigate it, I didn't realise it but I hope you know the paper and would like to share it.
5 жыл бұрын
@@giofou711 Activation function that you use for z(l+2) is ReLU, which has the minimum value of the zero. Then, z(l+2) will be 0 at minimum. It's kinda attempt for trying to solve vanishing gradients. I'm really interested in if they add random small numbers instead of a(l), means that not g(z(l+2)+a(l)) but g(z(l+2)+ random()) will it work well? This is the first question that comes to my mind. Hope they investigate it, I didn't realise it but I hope you know the paper and would like to share it.
@yongwookim14 ай бұрын
For learning identity
@MuhannadGhazal4 жыл бұрын
what is a weight decay? anyone please help. thanks..
@baranaldemir55704 жыл бұрын
L2 regularization
@paulcurry83834 жыл бұрын
I’m still left wondering, why is it good to learn the identity? A lot of videos I see just say “the identity is good to learn” but I don’t intuitively see why a model would want to learn that, and why the inability to learn the identity causes instability in deeper networks.
@MrBemnet14 жыл бұрын
if the network learns identity then at least adding additional layers will not decrease performance.
@frasergilbert29493 жыл бұрын
@@MrBemnet1 That make sense. But by adding more layers, is the extra ReLU functions at the end the only difference? This is compared to having a shallow layer.
@ruchirjain11633 жыл бұрын
Wow my lecturer made such a mess to explain why the layers just learn identity mapping, this was much easier to understand
@arpitaingermany5 ай бұрын
this videos have a weird signal tone coming from it