Why do Convolutional Neural Networks work so well?

Рет қаралды 44,944

Күн бұрын

Пікірлер: 99

@algorithmicsimplicity Жыл бұрын

Transformer video coming next! I'm still getting the hang of animating, but the transformer video probably won't take as long to make as this one. I haven't decided what I will do after that, so if you have any suggestions/requests for computer science, mathematics or physics topics let me know.

@bassemmansour3163 Жыл бұрын

what program are you using for animation? thanks!

@algorithmicsimplicity Жыл бұрын

I'm using the python package manim: github.com/ManimCommunity/manim

@davidmurphy563 Жыл бұрын

I'd say probably RNNs would flow nicely from this [excellent] video. GANs too I guess. Autoencoders for sure. Oh, LSTMs, the memory problem is a fascinating one. Oh and Deep Q-Networks. Meh, the field is so broad you can't help but hit. I'd say RNNs first as going from images to text seems a natural progression.

@wissemrouin4814 Жыл бұрын

@@davidmurphy563 yess please, I guess RNNs needs to be presented even before transformers

@davidmurphy563 Жыл бұрын

@@wissemrouin4814 Yeah, I would agree with you there. RNNs serve as a good introduction to a lot of the approaches you'll see for sequence-vector problems and its drawbacks explains the development of transformers. I'd suggest RNNs then LSTNs then transformers. That said, this channel has done sterling work explaining everything so far so I'm sure he'll do a great job even if he dives straight into the deep end.

@dradic9452 Жыл бұрын

Please make more videos. I've been watching countless neural networks videos and until I saw your two videos I was still lost. You explained it so clearly and concisely. I hope you make more videos.

@algorithmicsimplicity Жыл бұрын

Thanks for the comment, it's great to hear you found the videos useful. I was unexpectedly busy with my job the past few months, but rest assured I am still working on the transformer video.

@ozachar Жыл бұрын

As a physicist, I recognize this process as "real space renormalization group" procedure in statistical mechanics. So each layer is equivalent to a renormalization step (a coarse graining). The renormalization flows are then the gradual flow towards a resolution decision of the neural net. It makes the whole "magic" very clear conceptually, and also automatically points the way for less trivial renormalization procedures known in theoretical physics (not just simple real space coarse graining). The clarity of videos like yours is so stimulating! Thanks

@warpdrive9229 10 ай бұрын

Bingo!

@joshlevine4221 Ай бұрын

3:02 _Strictly_ speaking, there are only a finite number of images for any given image size and pixel depth, so each on can be uniquely described by a single number (and it is even an integer!). These "image numbers" cover a very, very, very wide and sparsely-filled range, but the "image number" still only has a single dimension. Thank you for the great video!

@IllIl Жыл бұрын

Dude, your teaching style is absolutely superb! Thank you so much for these. This surpasses any of the explanations I've come across in online courses. Please make more! The way you demystify these concepts is just in a league of its own!

@Number_Cruncher Жыл бұрын

This was a very cool twist in the end with the rearranged pixels. Thx, for this nice experiment.

@nananou1687 4 ай бұрын

This is genuinely one of the best videos I have ever seen! No matter the type of content. You have somehow made one of the most complicated topic, and simply distilled it to this. Brilliant!

@rohithpokala 9 ай бұрын

Bro ,you are real super man.This video gave so many deep insights in just 15 mintues providing so much strong foundation. I can confidently say,this video single handedly throwed 1000's of neural networks videos present on the internet.You raised the bar so high for others to compete.Thanks.

@benjamindilorenzo 6 ай бұрын

The best video on CNN´s. Please make a video about V-Jepa, the proposed SSL Architecture from Yann LeCun. Also it would be nice to have a deeper look at Diffusion Transformers or Diffusion in general. Really really good work man!

@khoakirokun217 4 ай бұрын

I love that you point out that we have "super human capability" because we are pre trained with assumption about the spatial information :D TLDR: "we are sucked" :D

@thomassynths 10 ай бұрын

This is by far the best explanation of CNNs I have ever come across. The motivational examples and the presentation are superb.

@illeto 4 ай бұрын

Fantastic videos. Here before you inevitably hit 100k subscribers.

@j.j.maverick9252 Жыл бұрын

another superb summary and visualisation, thank you!

@connorgoosen2468 Жыл бұрын

How has the KZbin Algorithm not suggested you sooner? This is such a great video, just subscribed and keen to see how the channel explodes!

@bassemmansour3163 Жыл бұрын

best illustrations in the subject. thank you for your work!

@BenjaminDorra Ай бұрын

Thank you for this fascinating video ! It is a very original angle on the effectiveness of CNNs. I have never seen this approach, most articles and videos focus on the reduction in parameters and computation compared with the base MLP or the image compression. Interestingly you don't talk about pooling, a staple of CNNs architectures. Arguably it is mostly for computational efficiency but I have seen a bit of debate on the subject (max pooling being especially polarizing).

@algorithmicsimplicity Ай бұрын

My goal in this video is to explain why CNNs generalize better than other architectures. It is true that CNNs are more computationally efficient than MLPs, but there are other ways to improve the efficiency of MLPs. In particular, in this video the "Deep neural network" that I am comparing to is not a MLP, but a MLP-mixer. This MLP-mixer is just as parameter and compute efficient as the CNN (using an almost identical architecture), the only difference between them is that in the CNN each neuron sees a 3x3 patch, and in the MLP-mixer each neuron sees information from the entire image. This difference, and this difference alone, results in the ~20% point accuracy increase. Max-pooling has generally been used to improve efficiency. Sometimes max-pooling can improve accuracy, but only by about 1-2%. In other cases, max-pooling can actually reduce accuracy. The main reason to use it is to just reduce computation. Because of this I don't consider max-pooling to be fundamental to the success of CNNs, you can built CNNs without max-pooling, they work fine.

@djenning90 Жыл бұрын

Both this and the transformers video are outstanding. I find your teaching style very interesting to learn from. And the visuals and animations you include are very descriptive and illustrative! I’m your newest fan. Thank you!

@jcorey333 7 ай бұрын

This is one of the best explanations I've seen! Thanks for making videos

@neilosborne8682 Жыл бұрын

@4:17 Why is it 9^N points required to densely fill N dimensions? Where is 9 being derived from? Is it for the purpose of the example given - or a more general constraint?

@algorithmicsimplicity Жыл бұрын

It is a completely arbitrary number just for demonstration purposes. In general, in order to fill a 1d interval of length 1 to a desired density d you need d evenly spaced points. To maintain that density for n-d volume you need d^n points. I just chose d=9 for the example. And the more densely filled the input space is with training examples, the lower the test error of a model will be.

@senurahansaja3287 Жыл бұрын

@@algorithmicsimplicity thank you for ur explanation but in here kzbin.info/www/bejne/bpqslYp-n9GYf9U dimensional points mean the input dimension right ?

@algorithmicsimplicity Жыл бұрын

@@senurahansaja3287 Yes that's correct.

@jollyrogererVF84 Жыл бұрын

A brilliant introduction to the subject. Very clear and informative. A good base for further investigation.👍

@sergiysergiy8875 11 ай бұрын

This was great. Please, continue your content

@manthanpatki146 8 ай бұрын

Man, keep making more videos, this is a brilliant video

@terjeoseberg990 Жыл бұрын

I believe that the main advantage to convolutional neural networks over fully connected neural networks is the computational savings and the increased training data. A convolutional neural network is basically a tiny fully connected network that’s being trained on every NxN square on every imaginable. This means that a 256x256 image is effectively turned into 254x254 or 64,516 tiny images. If you start with 1 million images in your training data, you now have 64.5 billion 3x3 images that you’re going to train the tiny neural network on. You can then create 100 of these tiny neural networks for the first layer, another 100 for the second layer, and another 100 for the third layer, and so on for 10 to 20 layers.

@algorithmicsimplicity Жыл бұрын

I think that these 2 reasons are the most commonly cited reasons for the success of CNNs (along with translation invariance, which is absolutely incorrect), but I don't think that these 2 things are sufficient to explain the success of the CNN. It is true that CNN uses much less computation than fully connected neural networks, but there are other ways to make deep neural networks which are just as computationally efficient as CNNs. For example, using a MLP-Mixer style architecture in which a linear transform is first applied independently across channels to all spatial locations, and then a linear transform is applied independently across spatial locations to all channels. In fact, this is exactly what I used when making this video! The "Deep Neural Network" I used was precisely this, it would have taken too long to train a deep fully connected neural network. This MLP-Mixer variant uses the same computation as CNN, but allows each layer to see the entire input. Which is why it achieves less accuracy than a CNN. As for the increased training data size, it is possible this helps but even if you multiply your dataset size by 100,000, it is still nowhere near the amount of data you would expect to need to learn in 256*256 dimensional space. Also, if it was merely the increased training data, then I would expect CNNs to perform better than DNNs even on shuffled data (after all, having more data should still help in this case). But in fact we observe the opposite, CNNs perform worse than DNN when the spatial structure is destroyed. For these reasons I believe that the fact that each layer sees a low effective dimensional input is necessary and sufficient to explain the success of CNNs.

@terjeoseberg990 Жыл бұрын

@@algorithmicsimplicity, It’s a combination of multiplying the dataset size by 64,500 and reducing the network size from 256x256 to 3x3. In fact it’s the reduction of the network size to 3x3 that’s allowing the effective 64,500 times increase in dataset size. It’s not one or the other, but both. Each weight gets a whole lot more training/gradient following. You should do a video on the MLP-Mixer, and how it compares to CNN.

@PotatoMan1491 3 ай бұрын

Best video I found for explaining this topic

@KarlyVelez-u2k Жыл бұрын

Your videos are extremely good, especially for such a small channel. Great video! Can do one in Recurrent Neural Networks please .

@nadaelnokaly4950 5 ай бұрын

wow!! ur channel is a treasure

@anangelsdiaries 7 ай бұрын

Fam, your videos are absolutely amazing. I finally understand what the heck a CNN is. Thanks a lot!

@jorgesolorio620 Жыл бұрын

Great video! Can do one in Recurrent Neural Networks please 🙏🏽

@pedromartins9889 7 ай бұрын

Great video. You explain things really well. My only complain is that you don't cite references. Citing references (which can be made simply as a list in the description) makes your less obvious statements more sound (like the fact that the quantity of significant outputs of a layer is more or less constant and small, I understand it would be very hard to explain it maintaining the flow of the video, but if there was in the description some link to that explanations or at least to a practical demonstration, the viewer could, if wanted, understand it better or at least be more sure that it is really true). Citing references also helps the viewer a lot if he wants to further study the topic (and this is fair, since you already made the rersearch for the video, so it costs you way less to show your sources than to the viewer to rediscover them). In summary: citing references gives you more credibility (in a digital world filled with so much bullshit) and gives a great deal of help to interested viewers to go deeper on the topic. Don't be mistaken, I really like your channel.

@neithanm Жыл бұрын

I feel like I missed a step. The layers on top of the horse looked like an homogeneous color. Where's the information? I was expecting to see features from small parts to recognizing the horse, but ...

@GaryBernstein Жыл бұрын

Can you explain how the NN produces the important-word-pair information-scores method described after 12:15 from the sentence problem raised at 10:17? Can you recommend any tg groups for this Q & topic?

@thetntsheep4075 Ай бұрын

At 14:00 with the rearranged pixels, do you mean every image in the dataset has the pixels rearranged in the same way? If they were rearranged in a different random way for each image I dont see how you could learn classification well at all

@algorithmicsimplicity Ай бұрын

Yes I do mean in the same way. The same permutation is applied to every image. This is equivalent to shuffling the columns of a tabular dataset. Has no effect on fully connected neural networks, but severely impacts CNNs.

@justchary Жыл бұрын

I do not know who you are, but please continue! You definitely have a wast knowledge on the subject, because you can explain complex things simply.

@escesc1 4 ай бұрын

This channel is top notch quality. Congratulations!

@VictorWinter-n2i 11 ай бұрын

Really nice! What tool did you use to do those awesome animations?

@algorithmicsimplicity 11 ай бұрын

This was done using Manim ( www.manim.community/ )

@5_inchc594 Жыл бұрын

amazing content thanks for sharing!

@joshmouch 11 ай бұрын

Yeah. Jaw dropped. This is an amazing explanation. More please.

@montanacaleb Ай бұрын

You are the 3blue1brown of ml

@Emma2-cg5jh 4 ай бұрын

Where does the Performance value for the rearranged images come from? Did you made Them By Yourself or is there a paper for that?

@algorithmicsimplicity 4 ай бұрын

All of the accuracy scores in this video are from models I trained myself on CIFAR10.

@ThankYouESM 11 ай бұрын

Seems like the bag-of-words algorithm can do a faster job at image recognition since it doesn't need to read a pixel more than once.

@mrfurious60 Ай бұрын

How we do go from 3 by 3 feature map to a 5 by 5 image?

@algorithmicsimplicity Ай бұрын

In the second layer, the input is a 3x3 grid of outputs from the first layer. In the first layer, each output is computed from a different 3 by 3 grid of pixels. Therefore, the input to the second layer contains information from 9 different overlapping 3x3 patches, which means it sees information from a 5x5 patch of pixels.

@mrfurious60 Ай бұрын

@@algorithmicsimplicity Thanks man. I guess I'm a little too slow because while I get how the first layer gives us what it does, the second layer is still a problem 😅.

@reubenkuhnert6870 Жыл бұрын

Excellent content!

@uplink-on-yt Жыл бұрын

12:58 Wait a minute... Did you just describe neural pruning, which has been observed in young human brains?

@jamespogg Жыл бұрын

amazing vid good job man

@Isaacmellojr 8 ай бұрын

Mais videos por favor! Vc tem o dom!!

@aydink7739 Жыл бұрын

This is wow, finally understand the „magic“ behind CNNs. Bravo, please continue 👍🏽

@bobuilder4444 4 ай бұрын

13:09 How would you know which numbers to remove?

@algorithmicsimplicity 4 ай бұрын

You can simply order the weights by absolute value and remove the smallest weights (the ones closest to 0). This probably isn't the best way to prune weights, but it already allows you to prune about 90% of them without any loss in accuracy: arxiv.org/abs/1803.03635

@bobuilder4444 4 ай бұрын

@@algorithmicsimplicity Thank you

@HD-Grand-Scheme-Unfolds Жыл бұрын

@AlgorithmicSimplicity greetings, may I ask: in you video presentation could you please specify in what sense do you mean by "Randomly re-order the pixels" (13:55)? let me explain my question. Although I know you mean reshuffling the permutation order of the set of input pixels; when I said in what sense I meant: is it (A-> a unique random re-order seed for each training example (as in for every picture) |OR| (B-> the same random re-order seed for each training example? if you meant in the sense of "A" I would be amazed the convolution-net can get that 62.9% accuracy you mentioned earlier. That 62.9% would be more believable for me if you meant in the sense of "B".

@algorithmicsimplicity Жыл бұрын

I meant in the B sense, same shuffle applied to every image in the dataset (training and test). If it was a different random shuffle for each input then no machine learning model (or human) would ever get above 10% accuracy. If you have some experience with machine learning, this operation is equivalent to shuffling the columns of a tabular dataset which of course all standard machine learning algorithms are invariant to.

@HD-Grand-Scheme-Unfolds Жыл бұрын

@@algorithmicsimplicity lol, speaking from in hindsight, your point in now taken. Dwl lmao😄🤣. Which human or person.... but let me be the devil's advocate for entertainment and curiosity purposes a bit: I it was somehow in sense "A", then I'd imagine that imply a phenomenon we all may call pure memorization at its finest. But to go back on main track, I love that you went out of the way to make that clear in you presentation, yours is the second video that mentioned, but you were to first to settle the question the big question (that I already asked you, thanks again). by the way in the name of opportunity sake I would like to ask: Do you know where a non-programmer person may source a intuitive interactive GUI based executable program that simulate and implement recurrent neural networks (especially if its the simple RNN, prefer against but will accept LSTMs or GRUs)? Github for example mostly accommodates those who meet coding knowledge prerequisite. "MemBrain" (meets concept but its RNN is still puzzling for me to figure out, and train test etc) (but its the most promising one to try work with so far) and "Neuroph Studio" (meets the concept but have no RNN support) and "Knime Analytics Platform" is likened onto coding skills, in disguise as GUI with click and parameter controls. rules for arrangments are too complex, and counter intuitive. IBM Watson studio seems similar and matlab is a puzzlebox too.

@algorithmicsimplicity Жыл бұрын

I'm afraid I don't know of any GUI programs that simulate RNNs explicitly, but I do know that RNNs are a subset of feedforward NNs. That is, it should be possible to implement an RNN in any of those programs you suggested. All you would need to do is have a bunch of neurons in each layer that copy the input directly (i.e. the i'th copy neuron should have a weight of 1 connected to the i'th input and 0 for all other connections), and then force all neuron weights to be the same in every layer. That will be equivalent to an RNN. I would also recommend you just try and program such an app yourself. Even if you have no experience programming, you can just ask ChatGPT to write the code for you 😄.

@Walczyk 7 ай бұрын

7:04 this is just like the boost library from microsoft

@nageswarkv 11 ай бұрын

definitely good video, not fluff video

@solaokusanya955 Жыл бұрын

So technically, what the computer sees or not is high dependent on "whatever" we has humans dictate it to be...

@pypypy4228 11 ай бұрын

It's brilliant!

@scarletsence Жыл бұрын

This god like visualizations thanks.

@blonkasnootch7850 11 ай бұрын

Thank you for the video. I am not sure if it is right to say that humans have build knowledge into the brain about how the world works from birth.. accepting vision input for data processing, detecting objects or separating regions of interest is something every baby has to clearly learn. I have seen that with my children it is remarkable but not there from beginning.

@algorithmicsimplicity 11 ай бұрын

Of course children still need to learn how to do visual processing, but the fact that children can learn to do visual processing implies that the brain already has some structure about the physical world built into it. It is quite literally impossible to learn from visual inputs alone, without any prior knowledge.