'How neural networks learn' - Part III: Generalization and Overfitting

Рет қаралды 41,684

Күн бұрын

In this third episode on "How neural nets learn" I dive into a bunch of academical research that tries to explain why neural networks generalize as wel as they do. We first look at the remarkable capability of DNNs to simply memorize huge amounts of (random) data. We then see how this picture is more subtle when training on real data and finally dive into some beautiful analysis from the viewpoint on information theory.
Main papers discussed in this video:
First paper on Memorization in DNNs: arxiv.org/abs/1611.03530
A closer look at memorization in Deep Networks: arxiv.org/abs/1706.05394
Opening the Black Box of Deep Neural Networks via Information: arxiv.org/abs/1703.00810
Other links:
Quanta Magazine blogpost on Tishby's work: www.quantamagazine.org/new-th...
Tishby's lecture at Stanford: • Stanford Seminar - Inf...
Amazing lecture by Ilya Sutkever at MIT: • Ilya Sutskever: OpenAI...
If you want to support this channel, here is my patreon link:
/ arxivinsights --- You are amazing!! ;)
If you have questions you would like to discuss with me personally, you can book a 1-on-1 video call through Pensight: pensight.com/x/xander-steenbr...

Пікірлер: 100

@owenliu4780 5 жыл бұрын

That part of visualization via information theory blows my mind and your explanation is really consice and instructive. The inspiring problems you left in the latter part of the video are also well-observed contradictions backed with strong academic arguments, unlike some general nonprofessional problems raised by random non-tech people. It's fortunate for us to have youtuber like you on the Internet. Thank you a lot.

@delikatus 4 жыл бұрын

openreview.net/forum?id=ry_WPG-A-

@deathybrs 5 жыл бұрын

WELCOME BACK! So good to see you again!

@Ceelvain 5 жыл бұрын

One nice thing with your videos, is that I can watch them 50 times, I'll learn new things every time. Best ML youtube channel! Please consider doing some collabs to get more viewers.

@veronmath3264 5 жыл бұрын

The standard of your videos are something out of this world, we are not only learning about programming but we are also learn to behave professional and build love on what you doing. This mind is giving a better production in our communities today. If the world continues to produce this type of behavior then everything well be good

@snippletrap 5 жыл бұрын

Tishby is approaching the problem from classical information theory, while the idea that the shortest model is best comes from algorithmic information theory. Is the network synthesizing a program or compressing based on statistical regularities? If Tishby is right, then it is the latter. I suspect that he is. In this case neural networks are necessarily blind to certain regularities, just as Shannon entropy is. For example, the entropy of pseudorandom numbers, like the digits of pi, is high, while the Kolmogorov complexity is low. If networks are compressing then they need more parameters to encode the input. If they are learning optimal programs to represent the data, then the rate of growth will be much lower (logarithmic rather than linear). Hector Zenil has recently done some interesting work in this field, check him out.

@ArxivInsights 5 жыл бұрын

Brilliant comment, thx for adding this! I would tend to agree that current deep learning models are basically doing representational compression rather than program search. However, I feel like fundamentally solving the generalization problem might require new ways of leveraging neural nets (not necessarily trained via SGD) to allow for model based, algorithmic reasoning, much like the scientific process where hypotheses are posited and subsequently rejected / refined through observation.

@BooleanDisorder 4 ай бұрын

A few years later and I think you're on to something. Look up AlphaGeometry! @@ArxivInsights

@deepblender 5 жыл бұрын

Thanks a lot for your videos! I love your detailed explanations as they have always been very useful if I wanted to dig deeper. As you know there are others who create videos about neural networks, but yours are the only ones which go to a point where I usually have a decent understanding of the important concepts. That's extremely valuable for me, thanks you so much!

@user-qu2oz2ut2h 5 жыл бұрын

Please, keep working on Arxiv videos, we need your clear explanations ask for help if you need it, but don't stop

@wencesvm 5 жыл бұрын

Superb content as always!!! Glad you came back

@sibyjoseplathottam4828 5 жыл бұрын

Thank you for providing such concise explanations. I would have missed these important papers if not for you.

@coemgeincraobhach236 3 жыл бұрын

Love that you reference everything! Thanks for these videos!

@preritrathi3440 5 жыл бұрын

Awesome vid Xander...keep up the good work...can't wait to see ya again😀😀😀

@NicolasIvanov 5 жыл бұрын

Super nice video, Xander, thank you! Gonna share with my colleagues.

@Tygetstrypes 4 жыл бұрын

I love your channel! All your videos are clear and beautifully explained. This was a great video to watch after I watched Tishby's full talk: great summary of his results and presentation, and that's a fascinating line of research. Cheers :-)

@chrismorris5241 5 жыл бұрын

I'll probably have to watch this series many times but I think it is the best explanation I have seen. Thank you!

@Schematical 5 жыл бұрын

Well done. Glad to see you are still doing these. They really helped on my MineCraft AI experiments.

@hackercop 2 жыл бұрын

Just discovered thsi channel and its amazing thanks.

@gnorts_mr_alien 2 жыл бұрын

You not doing this more often is a crime against humanity. But I understand, you probably do other important things. Very interesting videos, thank you!

5 жыл бұрын

Very good explanation. My mind needs to be cooled for a minute. don't stop making videos. I am waiting for a new one.

@JithuRJacob 5 жыл бұрын

Awesome Video! Waiting for more

@outdoorsismyhome7932 Жыл бұрын

Great visualization!

@alexbooth5574 5 жыл бұрын

Thanks for posting again

@BlakeEdwards333 4 жыл бұрын

Amazing content. Please post more like this!

@BlakeEdwards333 4 жыл бұрын

Please turn up the volume of your video though! I nearly broke my car speaker when I got a text while listening to the video!

@karthik-ex4dm 5 жыл бұрын

Waiting for the next episode!!!

@yangxun253 5 жыл бұрын

Really great talk! Keep it up!

@Gyringag 5 жыл бұрын

Great video, thanks!

@mohamedayoob4699 5 жыл бұрын

Absolutely helpful and intuitive videos, would you be able to do a video on reversible generative models please ? :)

@mikealche5778 4 жыл бұрын

This was very good! Thank you!! :)

@CosmiaNebula 3 жыл бұрын

Concerning neural network and shortest model, there is Schmidthuber's 1997 paper "Discovering neural nets with low Kolmogorov complexity and high generalization capability" where they directly searched for the least complex neural net and showed it has low generalization loss. I also want to mention Max Tegmark's speculative idea that neural networks have intrinsic bias towards learning physically meaningful functions. See for example "AI for physics & physics for AI " kzbin.info/www/bejne/ppytnHt4lMhmpKM&ab_channel=MITCBMM

@olegmaslov3425 3 жыл бұрын

I worked with the article from the video, setting up my own experiments. The set of input data in the article is completely unrelated to reality but is modeled so that it is easier to calculate the mutual information. When conducting experiments with MNIST data, the result is strikingly different, there is no compression stage on any Relu layer. You can also see this in the article below. arxiv.org/pdf/2004.14941.pdf

@michaelmcgrath9653 5 жыл бұрын

Great video, thanks

@resonance999 5 жыл бұрын

Great stuff!

@azizulbinazizanaiskla9342 3 жыл бұрын

Your videos are very good. Wished that you code continue to remain youtubing.

@pierl 5 жыл бұрын

Well done.

@benjaminhezrony5761 10 ай бұрын

This is soooooo interesting!!!!

@theweareus 4 жыл бұрын

Came for autoencoders, stayed for your great explanations! (especially the GAN video, finally understood the meaning behind competition between neural nets). By the way, do you intend to focus on one field (e.g. GAN or reinforcement learning), or to cover all the main topics (e.g. Feed Forward NN, RNN, and CNN)? Also, will you do some Q&A related to neural networks? Thanks Xander!

@AK-km5tj 5 жыл бұрын

Your videos are very nice

@MasterScrat 5 жыл бұрын

Great video! I'm curious, where does the SGD visualisation at 19:38 come from?

@0106139 5 жыл бұрын

Hi, Great video! I only don't see how Inception "somehow manages to get a much better test accuracy on the true test set when trained on partially corrupted labels" @ 3:47. Could someone explain ?

@mozartantonio1919 11 ай бұрын

awesome video and awesome publications. This has helped me a lot (im a phd student at univertity of cantabria hehe)

@UDharrmony 4 жыл бұрын

Brilliant!

@longlongmaan4681 5 жыл бұрын

Is it me or the volume is low in this video?

@jbhurruth 5 жыл бұрын

It is extremely low. I've had to turn my amplifier well beyond a safe volume to hear the embedded video. If KZbin decides to insert an advert it runs a real risk of breaking my speakers

@jasonlin1316 5 жыл бұрын

@4:57 to the human there might not be structure by "random" corruption but to the machine there might be some statistical structure in the underlying manifold of the new distribution created.

@mainaksarkar2387 5 жыл бұрын

Can you also do videos with specific examples from recurrent neural network models and LSTMs? Most of your existing examples are with images , feedforward networks and CNNs

@matthieuovp8654 5 жыл бұрын

Great !

@paedrufernando2351 3 жыл бұрын

@20.53 About the shortest program to best explain the data.. I think he is referring to entropy i.e the less elements of surprise a program has the more it has learnt about the data i. pattern i.e regularities of the data,(Refer Entropy ),, similar is the concept being used in autoencoders I believe .. where we are compressing this information into lesser nodes than the input layer nodes(So if compression is achieved it means that the learning has been done very well.Kind of like a measure of learnedness or memoisation.). Very useful in Reinforcement learning, *I am a novice learner and hence I am trying to understand the mechanics and gaining insight. Please refute my points if you know it is not exactly correct and may be off track...

@threeMetreJim 5 жыл бұрын

At a first guess i'd put forward that it's randomness in time rather than randomness in the learning function (SGD), that is the difference between an artificial network and a biological one. I'd put this down to signals being processed with random length of the biological connections (axons), and therefore random timing, as the signals take a finite time, even though it's small, to traverse the length of the connections. It would be just a difference in signal encoding between an artificial network (continuous values without the need for a concept of time) and a biological one (pulsed signals, actual 'value' could be determined by pulse density - using a time averaging activation function).

@beauzeta1342 5 жыл бұрын

Thank you for the great video. There are 2 questions that fundamentally trouble me behind all these interpretations. 1. The benefit of the gradient randomness should also apply for any conventional learning algorithms, and not just DNN. But this implies our learning problem is not uniquely defined by the objective function and the learning model. If numerical method conditions our solution, mathematically, this becomes a badly posed problem. 2. Is the DNN really not overfit ? Practical datasets often exhibit continuity and regularity. A network memorizing millions of training data will not necessarily exhibit overfit behavior on a test set, as we are essentially interpolating within the memorized points. Only when we test the model on data out of its conventional manifold, can we really see the extrapolation issue. So many papers have shown that NN can be easily fooled. My gut feeling is Big Data help us perform interpolation almost all the time. But we are actually overfitting.

@aidenstill7179 5 жыл бұрын

great video. Tell me please. what do i need to know to create my own python deep learning framework? tell me the books and courses to get knowledge for this.

@olegovcharenko8684 5 жыл бұрын

Please make videos more often! One video saves a week of reading and searching for papers, thanks

@SundaraRamanR 4 жыл бұрын

Link to Part 1 of this series: kzbin.info/www/bejne/g5TKqYWunpd9p9E (Feature visualisation) Part 2: kzbin.info/www/bejne/aqOpgJ6mfpV_mck (Adversarial examples)

@paulcurry8383 2 жыл бұрын

The graphic used to show that more layers reduces the time of the compression phase is confusing to me because how do we know that the 1 layer mlp has the representational capacity to overfit on MNIST?

@violetka07 3 жыл бұрын

I remember a couple of years a go he had a problem publishing this. Has it been published since? Thanks. Otherwise, extremely interesting research, indeed.

@hrhxysbdhdgxbhduebxhbd3694 5 жыл бұрын

Yes!

@user-or7ji5hv8y 5 жыл бұрын

great video! Lots of deep ideas.

@user-or7ji5hv8y 5 жыл бұрын

Is there another way to understand the axis of the chart that uses information theory? not quite getting it.

@262fabi 2 жыл бұрын

Very nice video, although the audio volume is a bit low. I was easily able to listen & understand multiple videos I watched prior to this one at just 20% volume and I struggle to understand what he or the professor are saying at 100% volume.

@bingochipspass08 Жыл бұрын

I thought it was just me lol,.. ya,.. the audio on this one is really low,..

@BooleanDisorder 4 ай бұрын

Rest in peace Tishby

@hfkssadfrew 5 жыл бұрын

You might want to look at “implicit regularization by SGD”

@ArxivInsights 5 жыл бұрын

I actually found that paper, but only after I had recorded the video :p So it was too late to put it in there, but indeed, that's the stuff :)

@hfkssadfrew 5 жыл бұрын

Arxiv Insights thanks for the reply. Indeed people have found long time ago that most that, if you choose to solve under determined linear system by sgd, as you know there are infinite solutions, sgd will only give you the mini-norm version.

@norbertfeurle7905 2 жыл бұрын

I think of it as this, two random values have much more hemming distance between them, than two none random values. So the randomness makes the distinction space between two values bigger and much more clearer. So the network will learn much better on random data. The question then would be, how you get the usefull data out of a almost random noisy network output. The answer is to use something known in information theory to get a signal from almost noise, it's called barker codes, so you use barker codes neurons in the network and therby can allow much more noise, resulting in better learning and better destinction. Ok now I have to charge you something for this.

@meinradrecheis895 5 жыл бұрын

A human has so many input streams (visual, tactile, audible, etc) that have high correlation. Isn't that supervised learning?

@StevenSmith68828 4 жыл бұрын

Maybe the next evolution of AI would be creating hormone like operations that overstimulate certain areas of the NN ahead of time.

@mendi1122 4 жыл бұрын

Our brain learn to predict the next input/experience. i.e. the continuous reality is the supervisor.

@alighahramani2347 Жыл бұрын

@user-or7ji5hv8y 5 жыл бұрын

but given that NN are universal function approximators, is it all that surprising that it managed to fit with 100%? or am I missing something?

@codyheiner3636 5 жыл бұрын

No, it's not surprising that they managed to fit perfectly. It's surprising they manage to fit perfectly *while also* generalizing well to unseen data. How many possible functions are there over the domain of all n by n images that classify a certain handful as planes and another handful as trucks? Now how many of those will also classify unseen images of planes as planes and unseen images of trucks as trucks?

@threeMetreJim 5 жыл бұрын

Your green screen background looks sort of familiar. Was it generated by a neural network? I've been playing with a very small image storing neural network 48x6 (based on Andrej Karpathy's Convnet.Js) Turns out that it can reliably store as much information as can be encoded by all of the weights (no real surprise there). For 100x100 pixel RGB image requiring 30000 bytes to store an image without loss, it turns out that the 48x6 network has roughly 12,000 weights each taking 4 bytes to encode, giving a total of around 48000 total bytes of information (more than the uncompressed image itself would take up). It does seem also to be able to fit more than that, somehow discarding irrelevant image information (like using a smaller colour space, or encoding large same coloured areas somehow). Seems that the complexity of the image(s) determines the network capability in part. I still find it fascinating that it can store a colour for random x,y co-ordinate's being fed in, and for multiple images too (even ones that are randomly rotated, and you can associate the random position with another network input!). Shame I didn't have the patience to get a more decent resolution - it takes a long time for the smaller image details to start appearing. It also accepts binary (you could probably use a base system other than 2 as well), rather than the more common one's hot encoding method - the 'distance' between the codes has to be enough to not get overlap, in the way I was experimenting. The simple x,y and remember colour network only seems to work if you feed it the x,y randomly, trying to scan over an image, doesn't seem to work at all; reminds me of the Sierpinski triangle, where if you try to draw one without picking the direction to travel at random, it fails to work well, or at all.

@ArxivInsights 5 жыл бұрын

I generated these moving backgrounds using a simple CPPN network I wrote in PyTorch. Simply initialize a small, random fully-connected network, feed it (x,y) coordinates as input + a moving latent vector (to create motion). The output of the network is the color for each pixel on the screen. Try a few random seeds and architectures until you get something that looks good! Then you simply run the network at whatever resolution you want :)

@threeMetreJim 5 жыл бұрын

@@ArxivInsights If you have a look at a couple of videos I've done, you'll see what I've been playing with exactly. Storing images and then warping them by adjusting other inputs that was left static while storing the images; an example of extreme overfitting but makes some nice, but rather low resolution videos - kzbin.info/www/bejne/gHewknyueb2Ynrc

@KhaledKimboo4 Жыл бұрын

So we should store/compress data as overfitted neural networks.

@TerrAkon3000 5 жыл бұрын

Can someone point me to a more formal statement of Ilya's claim that 'short programs generalize best'? I haven't had any luck on google yet. I feel like one has to make rather strong assumptions to show its validity.

@wiktormigaszewski8684 4 жыл бұрын

we do learn in a supervised method; this is called "school"

@ValpoPhysics 4 жыл бұрын

Beautiful summary. Tishby's lectures often get bogged down in the details. You give a nice clear overview. I recommend this version of Tishby's talk on the Information Bottleneck: kzbin.info/www/bejne/e4K3pXWIgpWmf9U He's frequently interrupted by the faculty, but I think he makes his points better as a result.

@zaksmith1035 3 жыл бұрын

Dude, come back.

@codyheiner3636 5 жыл бұрын

For the apparent clash between the idea that simpler rules generalize better and deeper neural networks train better, it's important to focus on the point that the accuracy of a deep neural network is based upon what it learns during training, and the majority of our decisions when creating deep learning models are motivated by the goal of making our network learn better and faster. Mathematically, the simplest set of rules will generalize the best. But mathematically, we have no way to find out this simplest set of rules. So we turn to deep learning, which gives us a very complicated set of rules instead. Nevertheless, it gives us this set of rules in a short amount of time. I imagine regarding the general problem of finding the simplest model that solves a problem of the type we are currently approaching using deep learning, it will take humanity decades of research to make significant progress. Even then, I'm not convinced we'll have much more than a bag of heuristics and practical tricks reinforced by massive computational capacity.

@matrixmoeniaclegacy 4 жыл бұрын

Hey, thanks for your helpful videos! => I think you misspelled Ilya Sutskever in your video description! In this video from MIT kzbin.info/www/bejne/b3axkHuletBmgbs he is written Sut-s-keve-r. Cheers!

@wolfisraging 5 жыл бұрын

Your videos are filled with lots of knowledge, great work. But the amount of videos and topics covered in your channel are so limited. Kindly upload at least 1 video in every 2 weeks. BTW Thanks.

@neutrinocoffee1151 5 жыл бұрын

Easier said than done!

@wolfisraging 5 жыл бұрын

@@neutrinocoffee1151 everything is impossible until it's done!

@benediktwichtlhuber3722 4 жыл бұрын

I would like to point out, that both theories of Zhang and Tishby got debunked in recent research.

@ArxivInsights 4 жыл бұрын

Hmm interesting! Can you share some links on this? That's the risk of making videos about Deep Learning :p

@benediktwichtlhuber3722 4 жыл бұрын

@@ArxivInsights I am still glad you do videos though :) Talking about Zhang. I can still remember a paper from Kruger(Deep nets do not learn via memorization), where researchers pointed out that learning random data and learning real data are completely different tasks. For obvious reasons algorithms can't learn irrational data. This is at best a trivial exception from existing theories(like Goodfellow). The main contribution of Zhangs paper is probably showing the overwhelming capacity of neural networks. Tishbys theory got far less interesting with Saxe(On the information bottleneck theory of deep learning). They showed that there is no connection between compression and generalization. They found a few cases where there is no compression phase, but networks were still able to generalize. Meaning those findings do not hold true for the general case. I think this field is so freaking interesting, because a lot of researchers don't even dare to touch it. Nobody can really tell why and how DNNs make decisions. I am curious to find an answer :)

@RocketSpecialist 4 жыл бұрын

cant hear anything

@dungeonkeeper42 Жыл бұрын

I have a bad feeling about all this..

@X_platform 5 жыл бұрын

Humans are in a supervised environment though. In the early days, if we did something wrong, we simply die. Now it is more forgiving if we did something wrong we might lose some time or some money.

@Ceelvain 5 жыл бұрын

That's not supervision. That's reinforcement.