ResNet (actually) explained in under 10 minutes

Рет қаралды 117,713

Күн бұрын

Пікірлер: 84

@nialperry9563 Жыл бұрын

Cracking video, Rupert. Well animated and explained. I am already satisfied with my understanding of ResNets after this.

@AhmedThahir2002 14 күн бұрын

This has to be the best explanation of ResNet ever. Amazing work, Rupert!

@Cypher195 Жыл бұрын

Thanks. Been out of touch with AI for far too long so this summary is very helpful.

@rupert_ai Жыл бұрын

Thanks Aziz, good luck with getting back in touch with AI

@sarthakpatwari7988 Жыл бұрын

Mark my words, if he become consitent, this channel will become one of the next big thing in AI

@prammar1951 4 ай бұрын

everyone is praising the video, maybe it's just me but i really didn't understand what the residual connection hopes to achieve? and how does it do that? didn't make it clear.

@TheJDen Ай бұрын

“Residuals” are what mathematicians call the difference between the actual and predicted data values. Imagine you had a simple dataset that looked linear but with some oscillating variation (like put x + sin(3x) into graphing calculator). One option to model this data would be to train a network on each x and y. In this case, the model would have to learn the underlying linear trend (x), and the oscillation (sin(3x)). Alternatively, we could estimate the slope of the line (without variations). We could then repeatedly feed the estimated height of the line at x into the network whenever it is training on an x y pair. This way, the model only has to learn the oscillation, the difference between the line and the variation, the residual (sin(3x)). It makes the model’s job easier because it doesn’t have to learn and keep track of the linear trend (x) since we remind it every few steps. In more complex things like he showed in the video it means it doesn’t have to learn both how to maintain a good representation of a flower and make resolution higher, only how to make resolution higher (because it always has access to original flower).

@poopenfarten4222 Жыл бұрын

legit one of the best explanations i found

@rupert_ai Жыл бұрын

Thanks myyy dude!

@sergioorozco7331 10 ай бұрын

Is the right hand side of the addition supposed to have height and width dimension of 32x32 at 7:08? I think there is a small typo in the visual.

@logon2778 2 жыл бұрын

You say that the identity function is added elementwise at the end of the block. So say I have an identity [1,2] and the result of the block is [3,4]. So would the output of the layer be [4,6]? So its not a concatenation of the identity function which would be [1,2,3,4], correct? You basically ensure the identity function is the same dimensionality as the output of the block then add them element-wise.

@rupert_ai 2 жыл бұрын

Hey Logon, great question, you are totally correct the output from your example (identity [1,2] and block output [3, 4]) would be [4, 6] e.g. you simply add the values based on their twin positions. You don't concatenate! Yes, the last section on dimension matching covers the scenario when the dimensions don't match (and therefore you can't add them element-wise until you modify them).

@logon2778 2 жыл бұрын

@@rupert_ai So in the case of the 1x1 convolutions where there are 3 input channels and 6 output channels of equal size... How are they added element-wise? Are the input features add elementwise twice? Once for each pair of 3 output channels? Or does it only add element-wise to the first 3 output channels and leaves the other 3 untouched.

@rupert_ai 2 жыл бұрын

Hi @@logon2778, as is standard with convolutional neural networks each 1x1 convolution takes contributions from all channels (in this case across all 3 channels of the input). So in order to have 6 output channels you have 6 lots of 1x1 convolutions that take contributions from all 3 channels. In order to half the size you skip every other pixel (e.g. a stride of 2). That is simply what is used for the original paper, obviously other approaches work too. Now you have a 6 channel output which is half the height and width which matches the network dimensions and you can do element wise addition as usual. Have a watch of the video again and look up convolutional basics - I have a video on this actually - hopefully that might shed some light on things kzbin.info/www/bejne/bIezap5ojLJpoZI

@logon2778 2 жыл бұрын

@@rupert_ai I understand how convolution works for the most part. 8:45 you show that there are 6 output channels of equal size to the input. But how can you element wise add 3 input channels to 6 output channels of equal size? In my mind you have double the dimensions. You have 6, 64x64 output channels. But you have 3, 64x64 input channels. So how can you element wise multiply them?

@rupert_ai 2 жыл бұрын

@@logon2778 The section you mention discusses what must be done to the copy of the identity along the residual connection BEFORE you do element wise addition with the output from the resnet block. The process follows this logic: 1) save a copy of your input as the identity (e.g. 3 channels 64x64) 2) run your input through the main block this outputs a new tensor. This new tensor can have the same dimensions or it can have different dimensions (e.g. 6 channels 32x32). If it has different dimensions proceed to step 3) if it has the same time dimensions proceed to step 4). 3) take the copy of the identity in step 1) and apply 6 1x1 convolution kernels with stride 2 to it, this outputs 6 channels 32x32. 4) do element wise addition with your identity and your resnet block output. Note that if the dimensions changed, then you also changed your identity with step 3 to ensure you can do element wise addition. Element wise addition is simply adding each corresponding value with one another. E.g. the value in the top left corner of channel 2 for the first tensor is added to the value in the top left corner of channel 2 for the second tensor. You don't do element wise multiplication as you mention. Hope that clears it up!

@agenticmark 10 ай бұрын

lol, I have fought that exact trendline so many times in ML :D Great humor. Great video work.

@samruddhisaoji7195 2 ай бұрын

9:02 i have doubt: how are the number of features in the LHS and RHS matching? LHS = w *h*c. RHS = (w/2)*(h/2)*(2*c). Thus RHS = 2*LHS

@Bryanvas25 2 ай бұрын

actually RHS = (1/2) * LHS, and yes, i also dont understand that part

@samruddhisaoji7195 2 ай бұрын

@@Bryanvas25 yes youre right about RHS = LHS/2. My bad!

@Omsip123 Жыл бұрын

I pushed it to exactly 1k likes, cause it deserves it ... and many more

@heathernapthine8775 Ай бұрын

is the zero padding only done for layers which increase the size or is it done for down sampling layers too? intuitively if we zero padded the output in order to add a larger inout this doesn't seem like a downsampled layer?

@egesener1932 2 жыл бұрын

Everyone say ResNet solves vanishing/gradient problem but dont we already use ReLu function istead of sigmoid to solve it ? Also part 4.1 of article say plain counterpart with batch normalization doesn't causes vanishing problem but still causes more error rate when layers are increased 18 to 34. Can you explain it ?

@rupert_ai Жыл бұрын

1) there are multiple things that help solved the vanishing/exploding gradient problem, residual connections in general help massively with the learning process - as they ground the learning process around the desired result. e.g. you learn the difference between what you have and the correct result (the residual). 2) batch normalisation also helps with the vanishing/exploding gradient problem as again this allows features of each layer to have a normalised distribution that is scaled so it won't explode/vanish, etc. 3) your point around 4.1 they are saying that networks without residual connections (plain) have worse error when they have more layers (18 vs 34) for the exact reason I stated in part 1) of this answer, it is a difficult optimisation problem for the network to solve without the residual, when you add residuals you aren't penalising adding more layers to your network. Hope that makes sense!

@firefistace8569 Жыл бұрын

What is the residual in the image classification task?

@rupert_ai Жыл бұрын

Good question! It can be tricky to understand what the residual might be in the image classification task as it is more abstract when compared to the super resolution task, essentially, you use the feature maps from previous layers and learn the 'residual' between previous layers and the current layer - in essence this makes a very powerful block of computation that is grounded by the skip connections. This makes image classification easier as the network itself can process the image in a more comprehensive way. There really isn't any 'end-to-end' residual in image classification like there is with super resolution, I hope that answers your question

@firefistace8569 Жыл бұрын

@@rupert_ai Thanks!

@TheBlendedTech 2 жыл бұрын

Thank you, this was well put together and very useful.

@rupert_ai 2 жыл бұрын

Thanks!

@devanshsharma5159 Жыл бұрын

love the animation! Thanks for the clean and clear explanation!

@ciciy-wm5ik 4 ай бұрын

at time 2:09 image 1- image2 = image 3 does not imply image1 + image 3 = image 2

@gunasekhar8440 3 ай бұрын

I mean we need to assume like that. Because in the paper they said h(x) be our desired mapping, x was input and f(x) would be some transformation. So f(x)=h(x)-x

@ShahidulAbir Жыл бұрын

Amazing explanation. Thank you for the video

@rupert_ai Жыл бұрын

Thank you Shahidul!

@xagent6327 Ай бұрын

The solution to pad with zeros fixed the number of channels, but how did they then reduce the dimensions from 64x64 to 32x32?

@mohamed_akram1 Жыл бұрын

Nice video. Did you use Manim?

@rupert_ai Жыл бұрын

Hey Mohamed! Yes I did - my first video using manim! I hope to use it for some more complex things in the future :)

@louisdante8457 4 ай бұрын

7:53 Why is there a need to preserve the time complexity per layer?

@samruddhisaoji7195 2 ай бұрын

The number of elements in the input and output of a convolution layer should remain same, as later we will be performing an element-wise operation

@wege8409 6 ай бұрын

6:38 this is the part that really made me understand, thank you

@januarchristie615 Жыл бұрын

Hello, I apologize for my question, but I still don't quite understand why learning residuals can improve model predictions better? Thank you

@giovannyencinia9239 Жыл бұрын

I think, that is because this arquitecture can apply the identity function, first you have an input a^[l] and this pass forward the convolutions, batch normalization, activation funciton etc. and finally there is an output z^[l+2] (this output in the hidden layers has some parameters theta), and here is where the architecture add the a^[l] (ReLU(z^[l+2] + a^[l])), then in the back propagation step there is the posibility that the optimal parameters in z^[l+2] are 0, so the result is a^[l] this is because you apply a ReLU activation funtion, and this means that the intermediate layers wont be use. If you build a big and deeper NN this arquitecture can skip the layers(blocks of residuals) that does not help to reach the local optima.

@panjak323 Жыл бұрын

Idk why, but simply adding bicubicly upscaled image to output of CNN with pixel shuffling layer achieves much better results than having any amount residual blocks. Also it's much faster.

@謝其宏-p3z 9 ай бұрын

It's amazing. Both resnet and this explaination.

@RadenRenggala Жыл бұрын

Hello, is the term "residu" referring to the convolutional feature maps from the previous layer that are then added to the feature maps output in the current layer?

@rupert_ai Жыл бұрын

The residual is actually the 'difference' between two features! In ResNets the feature maps from previous layers are added onto the current features maps, this means the current layer can learn the 'residual' function where it only needs to learn the difference

@RadenRenggala Жыл бұрын

@@rupert_ai So, residual is the difference between the current feature map and the previous feature map, and to obtain the residual, we need to perform an addition between those feature maps?.. Thank you.

@djauschan 10 ай бұрын

Amazing explanation of this concept. Thank you very much

@datascience8775 2 жыл бұрын

Good content, just subscribed, keep sharing.

@rupert_ai 2 жыл бұрын

Thanks, will do :)

@doudouban Жыл бұрын

2:06, the equation shift seem problematic.

@ColorfullHD Жыл бұрын

Hey, its 3blue1brown All jokes aside, great explanation, cheers

@rupert_ai Жыл бұрын

Hahaha well it is using his animation library ;) all hail grant sanderson

@dapr98 Жыл бұрын

Great video! Thanks. Would you recommend ResNet over CNN for music classification?

@nxtboyIII Жыл бұрын

Great video well explained thanks!

@nxtboyIII Жыл бұрын

I liked the visuals too

@rupert_ai Жыл бұрын

@@nxtboyIII Thank you Lucas 🙏

@christianondo9637 10 ай бұрын

great video, super intuitive explanation

@swedenontwowheels Жыл бұрын

Great content! Thank you for the effort!

@rupert_ai Жыл бұрын

Thanks Terence! :)

@MuhammadHamza-o3r 4 ай бұрын

Very well explained

@the_random_noob9860 9 ай бұрын

Lifesaver! Also, for classification, it's inevitable that the dimensions go down and channels go up across the network. But the 1 x 1 convolution on the input features to 'match the dimensions' kinda loses the original purpose i.e to retain/boost the original signal.. In a sense it's another conv operation that is no longer similar to the input (I mean it could be similar but certainly as not as the input features themselves). It's just the original idea was to have the same input features so that we could zero out the weights if no transformation is needed. Atleast they're not as different from how the input features as transformed across the usual conv block(conv, pooling, batch norm and activation). Let me know if I am missing anything