You cannot put the second ReLU inside the skip-block because then you cannot generate a residual (delta, difference) of the input as ReLU always outputs a positive value. What I don't understand is why a tanh function is not used instead of the first ReLU? That would seem a more efficient way to get to a meaningful residual in the first stage.
@prateekpatel60829 ай бұрын
would love to see some depth i teaching . 2 activation in a row , seems too vague , why is that an issue
@imanmokwena1593 Жыл бұрын
My intuition may be completely incorrect here, but do we also not risk exploding gradients if our activations dont output zero. Would that not lead to some of the signals being redundant?
@davefaulkner630211 ай бұрын
Exploding gradients are avoided in ResNet architectures because the skip connection allows for a unit gradient path end-to-end that competes with the gradient multipliers of the convolution paths. This creates s forcing function stabilizing the gradient decent of the entire model.
@hoaxuan70743 жыл бұрын
ReLU block information from passing through about 50% of time. Destroying input information before it can be used and destroying information being generated for the output before it can arrive there. Hence the need to allow alernative paths for information to percolate as you see with resnet. If you used 2 sided Parametric ReLU then the net should be able to organize information pathways itself. Leaky ReLU is somewhat of an alternative I suppose.
@SebastianRaschka3 жыл бұрын
Yes. But even with alternatives like leaky relu you can have small gradients, and if you have many of them it can add up in the chain rule during backpropagation.
@hoaxuan70743 жыл бұрын
@@SebastianRaschka Yeh, that's true. I never have that problem because I always train nets with a simple evolution slgorithm. I forget that other people use BP exclusively. My observation is that evolution always smoothly reduces error over time without being temporarily trapped in local minima or even slowed by saddle points. Which to me suggests there is s simple learning mode involving adjustments to the statistical responses of the neurons. Maybe that can be proved some day🍸
@tilkesh5 ай бұрын
Thank you
@davefaulkner630211 ай бұрын
For all the scribbling going on there is a better way of saying it: ResNet blocks learn a residual, or delta, activation value set with respect to the input. Within each resnet block, rather than learning a new transformation input to output, simply learn a function which is the delta (residual, difference) from the input. This seems like less to learn because you can always just use the identity function (i.e., no learning) and has the side effect of stabilizing gradients during optimization of the model.