#61: Prof. YANN LECUN: Interpolation, Extrapolation and Linearisation (w/ Dr. Randall Balestriero)

  Рет қаралды 106,226

Machine Learning Street Talk

Machine Learning Street Talk

Күн бұрын

We are now sponsored by Weights and Biases! Please visit our sponsor link: wandb.me/MLST
Patreon: / mlst
Discord: / discord
Yann LeCun thinks that it's specious to say neural network models are interpolating because in high dimensions, everything is extrapolation. Recently Dr. Randall Bellestrerio, Dr. Jerome Pesente and prof. Yann LeCun released their paper learning in high dimensions always amounts to extrapolation. This discussion has completely changed how we think about neural networks and their behaviour.
[00:00:00] Pre-intro
[00:11:58] Intro Part 1: On linearisation in NNs
[00:28:17] Intro Part 2: On interpolation in NNs
[00:47:45] Intro Part 3: On the curse
[00:57:41] LeCun intro
[00:58:18] Why is it important to distinguish between interpolation and extrapolation?
[01:03:18] Can DL models reason?
[01:06:23] The ability to change your mind
[01:07:59] Interpolation - LeCun steelman argument against NNs
[01:14:11] Should extrapolation be over all dimensions
[01:18:54] On the morphing of MNIST digits, is that interpolation?
[01:20:11] Self-supervised learning
[01:26:06] View on data augmentation
[01:27:42] TangentProp paper with Patrice Simard
[01:29:19] LeCun has no doubt that NNs will be able to perform discrete reasoning
[01:38:44] Discrete vs continous problems?
[01:50:13] Randall introduction
[01:50:13] are the interpolation people barking up the wrong tree?
[01:53:48] Could you steel man the interpolation argument?
[01:56:40] The definition of interpolation
[01:58:33] What if extrapolation was being outside the sample range on every dimension?
[02:01:18] On spurious dimensions and correlations dont an extrapolation make
[02:04:13] Making clock faces interpolative and why DL works at all?
[02:06:59] We discount all the human engineering which has gone into machine learning
[02:08:01] Given the curse, NNs still seem to work remarkably well
[02:10:09] Interpolation doesn't have to be linear though
[02:12:21] Does this invalidate the manifold hypothesis?
[02:14:41] Are NNs basically compositions of piecewise linear functions?
[02:17:54] How does the predictive architecture affect the structure of the latent?
[02:23:54] Spline theory of deep learning, and the view of NNs as piecewise linear decompositions
[02:29:30] Neural Decision Trees
[02:30:59] Continous vs discrete (Keith's favourite question!)
[02:36:20] MNIST is in some sense, a harder problem than Imagenet!
[02:45:26] Randall debrief
[02:49:18] LeCun debrief
Pod version: anchor.fm/machinelearningstre...
Our special thanks to;
- Francois Chollet (buy his book! www.manning.com/books/deep-le...)
- Alexander Mattick (Zickzack)
- Rob Lange
- Stella Biderman
References:
Learning in High Dimension Always Amounts to Extrapolation [Randall Balestriero, Jerome Pesenti, Yann LeCun]
arxiv.org/abs/2110.09485
A Spline Theory of Deep Learning [Dr. Balestriero, baraniuk]
proceedings.mlr.press/v80/bal...
Neural Decision Trees [Dr. Balestriero]
arxiv.org/pdf/1702.07360.pdf
Interpolation of Sparse High-Dimensional Data [Dr. Thomas Lux]
tchlux.github.io/papers/tchlu...
If you are an old fart and offended by the background music, here is the intro (first 60 mins) with no background music. drive.google.com/file/d/16bc7...

Пікірлер: 180
@andreye9068
@andreye9068 2 жыл бұрын
Thanks for posting this episode! And as "that guy" at 2:08:19, I'm happy to say I found the discussion very interesting and it's changed my mind :)
@nomenec
@nomenec 2 жыл бұрын
Thank you, Andre! And thank you for your article. Apologies we couldn't recall your name on the fly; we did make sure to show your name in video though ;-) I'm very curious, how did the discussion change you views?
@MachineLearningStreetTalk
@MachineLearningStreetTalk 2 жыл бұрын
Hey Andre, we really appreciate you dropping in here. Great article! For the benefit of folks -- here it is medium.com/analytics-vidhya/you-dont-understand-neural-networks-until-you-understand-the-universal-approximation-theorem-85b3e7677126
@MachineLearningStreetTalk
@MachineLearningStreetTalk 2 жыл бұрын
And this was the tweet where LeCun picked it up twitter.com/ylecun/status/1409940043951742981
@andreye9068
@andreye9068 2 жыл бұрын
@@nomenec For sure - Twitter really isn't the best platform to exchange nuanced perspectives, so when the Twitter conversation began, I took the disagreement (i.e. between LeCun, Pinker, Marcus, Booch, etc.) to be a sign that it was one of those types of ambiguous problems that one can't really confidently their mind up about. A lot of the Twitter thread content seemed pretty speculative or pulled willy-nilly without much organization. When I first read the paper on extrapolation, I was even more unsure of what to think - I was actually wondering many of the questions that you all asked in the interview, e.g. why choose the convex hull instead of another definition? Does this mean that neural networks are actually extrapolating? etc. After listening to LeCun and Balestriero's responses, I have a much more well-informed perspective of the paper's context and argument, and I think it's probably correct. Thanks guys for all the work you do arranging context and asking insightful questions!
@parker9163
@parker9163 Жыл бұрын
@@MachineLearningStreetTalk p
@AICoffeeBreak
@AICoffeeBreak 2 жыл бұрын
This is incredible! Ms. Coffee Bean's dream came true: the extrapolation interpolation beef explained in a verbal discussion! 🤯 Cannot wait to watch this. So happy about a new episode from MLST. You kept us waiting.
@nomenec
@nomenec 2 жыл бұрын
Thank you, Letitia! We burned the midnight oil for weeks on this one; we are looking forward to the community enjoying (hopefully!) the effort. We are grateful to both Yann LeCun and Randall Balestriero for spending time with us!
@DanElton
@DanElton 2 жыл бұрын
I’m literally working on a blog post about how deep learning is interpolation only , based on double descent phenomena and distribution shift issues, and then this drops!! Lol
@ketilmalde3402
@ketilmalde3402 2 жыл бұрын
Link (when you're done)? In general, where is a good place to discuss this? So many questions... but YT tends to drown in a zillion low-quality comments as soon as anything gets popular. The AI Stack Exchange?
@Smalldatalooser
@Smalldatalooser 2 жыл бұрын
@@ketilmalde3402 Are you the Ketil Malde who wrote the Arxiv paper about 'semantic' meaningful learning of Plankton images with siamese NNs? I just completed my Bachelorthesis using semi-supervised SimCLR for Plankton image categorization and really enjoyed and heavily used it. However i was not able to produce such nice clusters in representation space. So if you are the one who wrote it: Thanks a lot for the inspiration!
@ketilmalde3402
@ketilmalde3402 2 жыл бұрын
@@Smalldatalooser yes, that would be me :-) Thanks for the kind words! I really should try to get it published properly, but the review asked for lots of detailed changes and a full resubmission (rather than a revision with a deadline), so it got kinda left by the roadside. And the field moves so quickly and I've leared a lot since then, so nowadays I would probably use a different method (like you did).
@leinarramos
@leinarramos 2 жыл бұрын
Just spent 5 hours watching this 3-hour video. This is both dense and profound. Great job, best episode yet in my book!
@nomenec
@nomenec 2 жыл бұрын
Thank you for your time and commitment!
@EmileAI
@EmileAI Жыл бұрын
I spent 7h lmao I'm still too new to machine learning I love this episode
@abdurrezzakefe5308
@abdurrezzakefe5308 2 жыл бұрын
I wait for your videos in more excitement than I wait for my favorite tv shows' new seasons. Looks amazing!
@teksasteksasen1249
@teksasteksasen1249 2 жыл бұрын
Sooo what are the odds we can get a conversation between LeCun and Chollet? Would love to watch them have a discussion on this.
@BenuTuber
@BenuTuber 2 жыл бұрын
Starting off the new year with a bang. Tim, Keith and Yannic - thank you so much for this quality work. You can clearly tell how much love and dedication goes into every episode. Also the intros just continue to amaze me - the level of understanding you approach the variety of topics with is extremely inspiring.
@stalinsampras
@stalinsampras 2 жыл бұрын
Couple of minutes into the video and you break some of the fundamentals assumptions I had about deep learning/Neural nets, Jeez man. Excited for this 3hrs long video. And as usual the production quality of the videos keeps getting better. Happy New Year Guys
@nomenec
@nomenec 2 жыл бұрын
Happy New Year! Tim and I certainly walked away with very different (upgraded, in my opinion) view on neural nets. Would love to learn how, if at all, your views change after watching.
@jaapterwoerds9850
@jaapterwoerds9850 2 жыл бұрын
The content in this channel is just mind blowing. But the main reason I come back is the thoughtful editing and introductions and reflections of the content by dr Tim. I cannot keep up yet in grasping all the content in real time but that is exactly why it's so awesome. Thanks!
@Kerrosene
@Kerrosene 2 жыл бұрын
Occam's razor always makes straight cuts (in reference to piecewise linear functions) was a great line!
@vishalrajput9856
@vishalrajput9856 Жыл бұрын
Thank you guys, I've not been more amazed by anything in AI than this completely brand new revelation of neural network's internal working. Insanely interesting and beautiful.
@xorenpetrosyan2879
@xorenpetrosyan2879 2 жыл бұрын
imagine the balls to make a 1 hour intro before the main discussion :D
@victoroko3954
@victoroko3954 2 жыл бұрын
You guys really kept us waiting. Thank you! MLST for this one.
@mikenashtech
@mikenashtech 2 жыл бұрын
Fantastic discussion and explanation of the thinking behind interpolation, extrapolation and linearisation. This has really helped shift the needle towards towards the ultimate problem we all face, helping decipher what input is relevant to the task. If possible, please do V.2 covering some of the other concepts Prof LeCun was talking about. Could be a series on its own as so good! Mike Nash - The AI finder
@barlowtwin
@barlowtwin 2 жыл бұрын
Just got done watching it. Grateful for the great work the team has done. Cheers :)
@ChaiTimeDataScience
@ChaiTimeDataScience 2 жыл бұрын
WOHOOOO! I'm so so stoked to see this video! Time to drop everything and watch another epic interview by the MLST team!
@nomenec
@nomenec 2 жыл бұрын
Cheers! Just don't drop your Chai! ;-)
@OisinNolanChannel
@OisinNolanChannel 2 жыл бұрын
Love these long form videos -- really appreciate the effort you guys are putting in!!
@Artula55
@Artula55 4 ай бұрын
I think I have seen this video over a dozen times, but every time I keep learning something new. Thx MLST!
@YoungMasterpiece
@YoungMasterpiece 11 ай бұрын
I love the analogy 'I feel like I'm standing on Pluto', nice :)
@oncedidactic
@oncedidactic 2 жыл бұрын
5 minutes in and it feels like extended Christmas :D So glad to have the show back!
@johanneslaute3675
@johanneslaute3675 2 жыл бұрын
Great episode! These long deep dives are amazing, I get a lot of intuition from them and they are a great point to start reading more papers on the topic (who except Yann can keep up with axiv these days...) Really appreciate the effort and have a great 2022 :)
@Soul-rr3us
@Soul-rr3us Жыл бұрын
I keep coming back to this. One of the best MLSTs.
@madmanzila
@madmanzila Ай бұрын
Well done guys it's really a pleasure to be diving into into this field
@tchlux
@tchlux 2 жыл бұрын
Thanks for the shoutout at 38:01 Tim! The Discord channel rocks 😆 An additional note on extrapolation that people might find interesting: - In effect, the ReLU activation function prevents value extrapolation to the left. So when these are stacked, they serve as "extrapolation inhibitors". - This clipping could be applied to other activation functions to improve generalization (or forewarn excessive extrapolation)! - I.e., clipping the inputs to all activation functions within a neural network to be in the range seen at the end of training time will reduce large extrapolation errors at evaluation time (and counting the number of times an input point is clipped throughout the network could indicate how far "outside the relevant convex hull" it is). The clipping shouldn't be introduced until training is done (because we don't have a reason to assume the initialization vectors are "good" at identifying the relevant parts of the convex hull). But I'd be willing to bet that this "neuron input clipping" could improve generalization for many problems, is part of why ReLU works well for so many problems, and can prevent predictions from being made at all for adversarial inputs.
@oncedidactic
@oncedidactic 2 жыл бұрын
"[Activation clipping] ... can prevent predictions from being made at all for adversarial inputs." Would love to hear more about this line of thinking! Both practical side and what this illuminates on the theory side about "what does it mean to be adversarial / robust / etc". You guys didn't get a chance to discuss adversarial stuff on your chat episode much at all but it seems to abut the topic of generalization quite often which in turn tends to come up with geometric interpretation.
@tchlux
@tchlux 2 жыл бұрын
​@@oncedidactic happy to clarify. One way I like to think about it is that every basis function inside an MLP (the activations at a node after applying the nonlinearity) generates a distribution. If you have 10k points at training time, then for every internal 1D function you can plot the distribution of the 10k values at those points. That should give a pretty precise definition of the CDF (from central limit theorem), and rather tight bounds of what is "in distribution" (/ likely given observations). The issue is that the generated distribution of values at internal nodes over training data is (obviously) not independent of the training process. So to get an accurate estimation of the distributions we withhold validation data, which provides a true estimation of the error function (the error of the model over the space covered by the validation data). Now when you apply the model to new data, you can look at the values produced at internal nodes relative to the distributions seen at training / validation time. If you observe that a single evaluation point produces "out-of-distribution" (extrapolative) values for a substantial number of nodes in the model, then we know for certain that the point is not "nearby" to our training data. Even more, if the new point is out of distribution for the validation data, then that means we don't have a guess as to what the error looks like! 😱 One of the core mechanisms for making approximations outside the bounds of training data is projecting new points back into the region of space where you can make an approximation (usually on to the convex hull). So in practice we can project points onto the convex hull of the 1D basis functions by clipping all values to the minimum and maximum seen at training time. We would want to do this mainly because we have no reason to assume that the linear fit produced by one node (and it's infinite linear extrapolation to the right) is correct! No training data justified that behavior. If we let our basis functions extrapolate without bounds then our error *definitely* grows without bounds. If we prevent infinite extrapolation, then we *might* be bounding our error too. To tie it all together, the distributions of values seen at validation time (more validation data ➞ better distribution estimates) should *precisely* match the distributions for testing. If they do not, then you know that something about the data has changed (from training & validation time) and your error will change in a commensurate fashion (in an unknown way). This relates to another important fact: we can never modify a model based on validation error. If we make decisions based on validation error, then we entirely undo the (necessary) orthogonality of the validation set (and hence remove our ability to estimate error).
@oncedidactic
@oncedidactic 2 жыл бұрын
​@@tchlux Thanks for the detailed reply! Any further reading you can point to? It makes perfect sense to me you would want to use the clipping / projection to learned convex hull to prevent wild extrapolation that leaves you at the mercy of "out-of-distribution", be that natural or adversarial. I can't think of an example where this is implemented but my knowledge is *not* deep. I imagine this curtails the "magic" of kinda-sorta extrapolating well sometimes, but you win the tradeoff because the limitation of your model is predictable. Or in other words predictably dumb is better than undependably intelligent, as a system component. "Even more, if the new point is out of distribution for the validation data, then that means we don't have a guess as to what the error looks like! 😱" This is so insightful yet simple and really reframes the whole issue for me. Not to pile on too much, but this just feels like another sign that it seems pointless to expect better training sets or bigger models to ever overcome the problem of "something you haven't seen before."
@tchlux
@tchlux 2 жыл бұрын
@@oncedidactic > Any further reading you can point to? I mostly just think about things in terms of basic linear algebra. If you get super comfortable with matrix multiplication, linear operations, and think really hard about (or better, implement) a principal component analysis algorithm (any method), then you'll start to form the same intuitions I have (for better or worse 😜). I try to think of everything in terms of directions, distances, and derivatives (/ rates of change). I can't think of any "necessary" knowledge in machine learning that you can't draw a nice 2D or 3D picture of, or at least produce a really simple example. I suggest aggressively simplifying anything until you can either draw a picture or clear example with minimal information. If it seems too complicated, it probably is. Stephen Boyd's convex optimization work (KZbin or book) is great. And 3blue1brown is wonderful too. > this just feels like another sign that it seems pointless to expect better training sets or bigger models to ever overcome the problem of "something you haven't seen before." Exactly. People will probably continue to talk about it forever, but it only makes sense to *extrapolate* in very specific scenarios with relatively strong assumptions. What we really want in most cases is a model that identifies a low dimensional subspace of the input where it can accurately interpolate.
@vigneshpadmanabhan
@vigneshpadmanabhan Жыл бұрын
Thanks you for creating this amazing channel. The amount of insights one can get sitting for three hours with the Professionals is immense!
@lucca1820
@lucca1820 2 жыл бұрын
we need more of Prof. Yann lecun!
@paxdriver
@paxdriver 2 жыл бұрын
Happy New Year's!!! I've missed you guys
@stretch8390
@stretch8390 2 жыл бұрын
Time to write the afternoon off and make the most of an incredible opportunity in listening to this discussion.
@dr.mikeybee
@dr.mikeybee 2 жыл бұрын
Tim, your statement about neural networks being analogous to classical decision trees absolutely hits home.
@dinoscheidt
@dinoscheidt 2 жыл бұрын
They are back ❤️ if only youtube decided to use that bell 🔔. Great talk - thank you very much for all your efforts!
@abby5493
@abby5493 2 жыл бұрын
Wow! What an amazing video! Best one yet!
@juusokorhonen1628
@juusokorhonen1628 2 жыл бұрын
Came here from Lex Fridman video, and gotta say these make the perfect combination (especially now that Lex has arched into some topics outside AI). Keep delivering this fantastically specified content👍
@nomenec
@nomenec 2 жыл бұрын
Thank you, Jusso! I really appreciate that. Tim and I often struggle with finding the right balance while keeping it (hopefully) entertaining. It's not easy and we are also trying to brainstorm on ways to improve. So, it's great to hear from a satisfied viewer!
@NelsLindahl
@NelsLindahl 2 жыл бұрын
Your build quality here is really high. Nice work. My only comment on this video was that I had to give parts of this video my full attention. That is probably a good thing.
@nomenec
@nomenec 2 жыл бұрын
Lol, cheers and thank you!
@hossromani
@hossromani 8 ай бұрын
Great video - excellent conceptual discussion
@michaelwangCH
@michaelwangCH 2 жыл бұрын
I spend past two years in uni and attended all related classes in to ML and AI to try to understand the DNN, because noone in CS department can answer my question in the way which I can intuitively understand what the DNN is doing, how and why the DNN is doing. Tim, thanks for the enlighted explanation.
@TimScarfe
@TimScarfe 2 жыл бұрын
Thanks a lot Michael! But don't thank us too much, most of this wisdom is coming directly from Chollet, Balestriero and LeCun we are just digesting their fascinating ideas and presenting them in the best way we can.
@scottmiller2591
@scottmiller2591 2 жыл бұрын
Interesting talk - I'm working on a pile of notes, amplifications, and critiques.
@flooreijkelboom1693
@flooreijkelboom1693 2 жыл бұрын
Awesome! Thanks again for arranging this :) !
@sabawalid
@sabawalid 2 жыл бұрын
Very very good episode guys - kudos, as always I have a problem with LeCun's strong statement that "reasoning = optimization" (that most reasoning can be simulated by minimizing some cost). Inference/deduction is not optimization. That's not true at all.
@SisypheanRoller
@SisypheanRoller 2 жыл бұрын
Why is it not true
@leonidas193
@leonidas193 2 жыл бұрын
Great episode, keep up the good work. Agree with the reasoning = optimization, at least the reasoning that we currently do with machine learning. There is also a well known result in optimization which states that separation = optimization, where separation means finding a separating hyperplane between a point and some convex hull. So in other words membership, or interpolation is optimization. Many of these concepts are well known in the optimization community for some time now. For instance linear vs nonlinear or discrete vs continuous are known to be of little difference, while convexity is the main concept that makes things tractable. Also the curse of dimensionality can be avoided if you formulate the problem combinatorially as a graph for instance which is dimensionless.
@crimythebold
@crimythebold 2 жыл бұрын
Loved this one. Again
@youseftraveller2546
@youseftraveller2546 2 жыл бұрын
Great interviews, with many abstract ideas, made simple; I want to wish you all great success, and I will wait for more interesting conversations to come. I am coming for a computational engineering background. We are looking in my field for models that can extrapolate for problems that can be categorized as a mix of differentiable and discrete in nature. Is there any possibility to see a video in future that discusses the ideas of the current episode but more toward computational engineering and physics orientated problems? Thanks and Happy New Year
@UmairbinMansoor
@UmairbinMansoor 8 ай бұрын
The talk was beautifully presented... Thank you all My question is: why are we considering that the new sample (the test set) lies outside the convex hull of the training data, considering the dataset strictly represents a domain like pictures with or without cats? My second question is: In signal processing, the impulse contains all the frequency content the reason why we have to characterize any form of the filter by its impulse response. Having said that, for a particular domain, can we have a training set that completely characterizes the problem and hence the ML model which means, any test data must then lie within the convex hull...???
@filoautomata
@filoautomata 2 жыл бұрын
42:31 - 42:41 in 1993 there have been an architecture called ANFIS that is combination of interpretability, monotonicity, from Fuzzy Logic Inference System and combination of the adaptiveness of neural network ANFIS guaranteed smooth gradual change of prediction caused by slight modification of input because of the smoothness and monotonicity aspect from fuzzy logic while still being able to be optimized using gradient based optimizer if desired
@SLAM2977
@SLAM2977 2 жыл бұрын
Great stuff guys, LeCun is next level!:)
@citizizen
@citizizen 2 жыл бұрын
Note@15 min: if you create hyperplanes, this, my guess, will partake into extra usable information per hyperplane. No proof though. Note@28 min: one OR at a time; not to give properties to objects such that you loose the 'single or instant'. Note@38min: Experience pays off. Note@:41min: "math lump", creating simple datasets and putting those together. Like a sentence of 'objects'. You play with the : "semantics". Note@45min: Can one throw an object through all of the information present at hand and see what it does? Like an analysis: (one object at a time (no dogma)), and see, which manifold is strong and which is not.. (to entangle time as it where (@ 46.50 min)) Note@46min: I simply love this video! Note@53min: So if we have a ball (lot of density), we could encode only its traits we want to have and work with that. Note@1:02min: You need to build from certain objects, only a single spot. Not an object you need to redraw in each case. Such that it can be applied. -(question) IF you are inspired at 50 minutes and see something at 60 for more inspiration and add it to the 50th minute inspiration. IS this wrong? -IS it possible to let some data collect some data over time and notice as it where where it is going. Perhaps even creating objects that are good in this and adding these to ones data analytic toolkit. Having one such single object, is interesting simply in itself. Perhaps creating a vocabulary of some kind??? Term: "dataplatonic" mindset @on the curse: : "jackpot ;-)" ,, Note@51min: i guess it is utile to acquire virtualized versions of objects. Such that the data takes account of 'objects', i.e. : terms. Like a circle or a square as circle and square. So, if we have a term, like a concept, we should generalize(?) it into something that we can use. So getting rid of 'drawing' objects... I guess a 'vocabulary' of a dataset is a nice concept as well.. How to make a concept. Keep track of it. Like: a point drawn, becomes a sphere. So if we create an animation, we re-encode this into data for data analysis... Perhaps even creating synesthesia for the sentences created. Such a 'gift', might parametrize for people watching. Current conclusion: Each thing you want to analyse needs to be built up itself, such that you do not take big objects but building block parameters.. Such the result is not about objects but building blocks that might be like bigger objects, but without the crap (data intensive). One wants to get rid of .. and let the computer do it. Building the right concepts by the computer and by guidence of the hand. Note@01:33min: if you got a function where the energy is understood (being zero). You can grow and shrink it and add it to 'a sentence'. Next you should be able to adapt (add substract) these and using such functions in line and create a kind of word sequence.
@rgarthwood3881
@rgarthwood3881 2 жыл бұрын
If you use relus, and simple ff networks yes they're tessellations but not non-linear act fns with inter-layer feedback connections. An example of the latter is the transformer hypothesis class.
@connorshorten6311
@connorshorten6311 2 жыл бұрын
Amazing, congratulations!
@MachineLearningStreetTalk
@MachineLearningStreetTalk 2 жыл бұрын
Thanks Connor! We couldn't have done it without you!
@arvisz1871
@arvisz1871 2 жыл бұрын
Well done! 👍
@saundersnecessary
@saundersnecessary 2 жыл бұрын
Just wonderful thank you!
@federicorios1140
@federicorios1140 2 жыл бұрын
This is fucking crazy, there's just no other way to put it. The idea of piecewise linearity of a neural network is the single biggest opening of the deep learning black box that I have ever seen
@nomenec
@nomenec 2 жыл бұрын
Cheers, Federico! I share your opinion as well; for me it was an eye opening view point.
@fredericln
@fredericln 2 жыл бұрын
Great, great talk! My reaction is based on the first 37' first, but before I go to sleep and forget… two (very non-expert) cents. 1) around 15', you say that NN basically try to find boundaries and don't care about the internal structure of classes. How far does this hold? Loss functions do take into account how far the data point is from the boundary of the class (how dog-typical this dog is, etc.). For sure this is only one tiny part of what 'class structure' can encompass. 2) (I'm quite sure I will find the answer in the remaining part, but) ReLU are different from previous, e.g. logistic, activation functions, which were basically smoothed separators, smoothed piecewise constant functions. ReLU are not constant on the x>0 side :-) - which I found dangerous at first (how far will this climb? how much will a single ReLU influence the outcome, on out-of-distribution test points?) - but doesn't *that* add to the ability to extrapolate, i.e. to say things about what happens far from the convex hull of training points?
@citizizen
@citizizen 2 жыл бұрын
@1:36min: if we create labels for important stuff. These can be used again. Kind of 'meta propagation'. To be able to take something up. Building up a vocabulary. Note: IF we can have a tiny center where a lot can happen. This can be applied on say: a hand or a foot. If we have A and B connected, we do not need all that happens in between. I guess one wants to create something that is applicable everywhere. Teleportation. @1:46min: something differentiated, and molded together with related stuff (not yet known). Like velocity and acceleration together with the images related to it. Next normalize such information, into single principles (i guess normalization and making objects with what is normalized might be a way of creating : concepts). Note: IF the will can be defined as 'one or a couple of objects, taken together at once', then you must be able to work with such (like how to work in a database). Perhaps apply it as a regular expression? This can become very very agressive, and thus interesting. Note: a language such that we can derive where the machine is about. Like: visualizing what happens. (disentangle). Normalization. To normalize a principle. PErhaps making a database of normalized principles. @1:56min: perhaps create classes, like : per dimension a way to go about. @02:00min: MAtch! Got the same idea somewhat. Note: a language that generates generation 5 programming languages (relational language). Then terms normalized, put in a dataset. So, with a proper 'calculus', one can create discrete' objects, like: if it repeats a pattern on itself again: one needs 2 circles. (example). You do not need to know everything, If you get a couple of dimensions you work in. Like: 1, 2 and 4. Then this can be called discrete because you solve it with (underneath), these. I label this will because you can let those 3 work together and learn like that.
@Hexanitrobenzene
@Hexanitrobenzene Жыл бұрын
~2:55:00 I think discrete vs continuous dichotomy is not so absolute. Human brain seems to be an analog system, but it can emulate discrete reasoning. Computers are discrete machines, but with neural networks they can emulate continuous reasoning. The main problem seems to be efficiency: emulating one via another is extremely inefficient, that's why dr. Balestriero noted that a hybrid system would be the most efficient. EDIT: Yup, a little later Keith noted that, too.
@democratizing-ai
@democratizing-ai 2 жыл бұрын
When will Jürgen follow? :)
@SimonJackson13
@SimonJackson13 2 жыл бұрын
A discrete attraction chaoform. Convergence to attractor locations as solutions of time series. Then a disjunct split and fold to exceptional zones surrounding expected precursors to exception. Then train for drop errors triggering exceptional close zone to chaoform large split discreet?
@cambrinus4
@cambrinus4 2 жыл бұрын
Great and very inspring interviews. Thank you! I wonder how to explain the fact that CNNs learn very practical features in first layers like edge detectors and texture detectors in a persepective of a spline trees theory (I mentioned these because we know what they do and that they are present in NNs). Of course we know that they are used by NNs to split latent space but I think that the fact that NNs are able to figure out such specific features at all is enough qualitative difference comparing to decision trees to question if an analogy to decision trees makes sense at all. Yann LeCun claims that in high dimensional spaces everything is an extrapolation I think it's valid to ask if in high dimensional spaces everything is decion tree-like hyperplane splitting.
@rafaeldelaflor
@rafaeldelaflor Жыл бұрын
Thanks guys
@EngineeringNibbles
@EngineeringNibbles 2 жыл бұрын
Amazing video, the bass is super high though! Wish it was a little lower as it requires manual EQ
@User127169
@User127169 2 жыл бұрын
Dr . Randall is saying that even in the generative setting, in GAN's latent space (which has large number of dimensions), there is no interpolation (due to the curse of dimensionality of course). What is then the explanation on why these models even work, and how come they manage to generate new examples? I can't quite figure it out. Great video, enjoyed it!
@SimonJackson13
@SimonJackson13 2 жыл бұрын
The surface dividing the training set in two? How many would there be and are some better to consider as AND with the "search term"? Multiple max entropy cosearch parallelism?
@gren287
@gren287 2 жыл бұрын
Reflected ReLU > ReLU 😎 I want a neural network from you Tim ❤
@SimonJackson13
@SimonJackson13 2 жыл бұрын
The converdivergence of x^n at x=1 saddle unstable point even implies input scaling has a convergence implication on a polynomial fit.
@user__214
@user__214 Жыл бұрын
Great video! Here's a question I have after reading the papers, if anybody can help me: Hypothetically, if, say, the MNIST digits *did* lie on a lower dimensional manifold, then by definition all new data points would fall on that manifold, right? So in the Extrapolation paper, when they show in Table 1 that the test set data doesn't even fall within the convex hull of the Resnet **latent space **, this must mean either 1) Resnet is doing a poor job of learning the true latent space, or 2) MNIST digits do not actually fall on a lower dimensional manifold. Is that right?
@willd1mindmind639
@willd1mindmind639 2 жыл бұрын
I believe most of this just is a byproduct of the fact that brain neurons operate in analog space while computer neural networks are digital which is an approximation of analog data where sampling is always relevant. The other issue is the fact that all data in neural networks are collected together in a singular bucket with a singular answer for various learned scenarios. Whereas in the brain things are much more decomposed into component pieces or dimensions which become inputs into higher order reasoning processes. And this is what leads to the human cognitive evolution that creates language from symbols, with embedded meanings and things like numbers and mathematics. An analogy for this is to say that each individual arabic numeral has a distinct identity function (learned symbol pattern recognition) corresponding to a set of neurons in the brain. Separate from that you have another set of neurons that have learned the concept of numbers and can associate that with the symbol of a number. And separate from that there is a set of neurons that have learned the principle of counting associated with numbers. That is a network of networks that work together to produce a result. And as such the brain can learn and understand linear algebra and do calculations with it because of the preservation of low level atomic identity functions or logic functions that are not simple statistics problems. Meaning the brain is a network of networks where each dimension is a distinct network unto itself as opposed a singular statistical model.
@nomenec
@nomenec 2 жыл бұрын
I think that's a very nice way of looking at things. In a sense, NNs breaking up the space into polyhedra is like a simple hacked version of a network of networks. They are encoding little subunit networks, by virtual of the ReLU activations, that are then forced into shared latent value array representations. That introduces artifacts and isn't as flexible as networks of networks. The killer for trying to train networks of networks is the combinatorial blowup that happens when exploring the space of all possible connection configurations. And it's why so much of what makes NNs work today is actually the human engineering of certain network architectures that structurally hardcode useful priors. Great comment, thank you!
@willd1mindmind639
@willd1mindmind639 2 жыл бұрын
@@nomenec Thanks. It is definitely a much simpler quantification effort for dealing with probability calculation within a bounded context as defined by the algorithmic model, data provided for training and tuning of calculations. However, there is no reason not to investigate more open ended architectures, especially as a thought exercise of how such a thing would be possible.
@ClaudeCOULOMBE
@ClaudeCOULOMBE 2 жыл бұрын
Enlightning episode! A bit long but exciting subject... I would had appreciate to get François Chollet in this debate. Unfortunately, the elephant is not in the room...
@PhilipTeare
@PhilipTeare 2 жыл бұрын
does GELU not smooth these polyhedra from a geodesic structure into a continuous smooth manifold?
@BuFu1O1
@BuFu1O1 Жыл бұрын
Part 3 on the curse of dimensionality 🤯
@rafaeldelaflor
@rafaeldelaflor Жыл бұрын
The order of the equations matters is what they have established. This was already a basic tenet of symbolism.
@dr.mikeybee
@dr.mikeybee 2 жыл бұрын
I wonder if a machine's ability to find a non-linear function or to integrate one would be analogous to what Stephen Wolfram calls computational reducibility? Certainly, an agent can call a non-linear function rather than a piece-wise linear model.
@ShayanBanerji
@ShayanBanerji 2 жыл бұрын
Discussions from hallowed halls of Academia brought to KZbin. Or better? You 3 are setting very high standards!
@jacquesgouimenou9668
@jacquesgouimenou9668 2 жыл бұрын
wahoo! it's amazing
@PhilipTeare
@PhilipTeare 2 жыл бұрын
I didn't catch the new non-contrastive method Yann mentions after BYOL and Barlow Twins. Does anyone know?
@CharlesVanNoland
@CharlesVanNoland Жыл бұрын
For inputs that lie within the training data it's an ellipsoid. For inputs that lie outside of the training data I imagine more of a paraboloid. It seems like data could lie both inside of training data in some dimensions and outside in other dimensions, which makes it some kind of ellipsoid paraboloid hybrid. Is this a thing?
@muhammadaliyu3076
@muhammadaliyu3076 2 жыл бұрын
Where have you being?
@nomenec
@nomenec 2 жыл бұрын
A tremendous amount of work went into this show let alone the MLST channel as a whole. Good things take time. Thank you for your patience and continued viewership!
@funkypipole
@funkypipole 2 жыл бұрын
Great ! Why not considering inviting Prof. Jerome Darbon from Brown Univ, he has always bright views on that topic !
@rafaeldelaflor
@rafaeldelaflor Жыл бұрын
Can anyone discuss or comment on extrapolation in context of projection volume? Or more than 3D
@jabowery
@jabowery Жыл бұрын
Imputation can make interpolation appear to be extrapolation. But more importantly people don't understand the relationship between interpolation extrapolation and the Chomsky hierarchy. You simply cannot do extrapolation with context-free grammars. Transformers are context-free grammar capable not more.
@MachineLearningStreetTalk
@MachineLearningStreetTalk Жыл бұрын
Thanks James! According to arxiv.org/pdf/2207.02098.pdf transformers map to finite state automata computational model with no augmented memory and can recognise finite languages only
@NehadHirmiz
@NehadHirmiz 2 ай бұрын
at 3:04:08 Yannic foreshadowing "active inference" 😁
@SimonJackson13
@SimonJackson13 2 жыл бұрын
Extrapolation is interpolation where one endpoint is magnified by some potentiate of infinity controlled by end zone locking. The outer manifold potentiate?
@SimonJackson13
@SimonJackson13 2 жыл бұрын
The reflectome of the outer manifold into the morphology of the inner trained manifold to achieve greater formance from the IM. The focal of the reflectome as a filter to multibatch the stasis of the correct?
@larrybird3729
@larrybird3729 2 жыл бұрын
hmm...what word do we use for a Interpolation between Interpolation and Extrapolation 🤪
@henryzwart1024
@henryzwart1024 2 жыл бұрын
I'm only 20 minutes in, so will remove this comment if it's answered in the discussion, but.... How do you think smooth activation functions (e.g. ELU) would affect the polyhedral covering of the feature space? If ReLU functions create hard boundaries between separate polyhedra, would smooth functions create smooth boundaries? Or perhaps weighted combinations of polyhedra?
@dr.mikeybee
@dr.mikeybee 2 жыл бұрын
If in high dimensional spaces only have varying gradient in 16 or fewer dimensions, doesn't that suggest that principle component analysis should always be run?
@nomenec
@nomenec 2 жыл бұрын
Do you mean to run PCA on the ambient space and then throw away all but the top-K eigen vectors? Or just run PCA and use the entire transformed vector as input data points instead of raw data points? If the former, I guess the fear (probably justified) is that we'd be subjecting the entire data set to a single linear transform and possibly throwing out factors that are only useful in smaller subsets of the data. Instead, NNs are able to chop up the space and use different transforms for different regions of the ambient data space. In a sense, they can defer and tweak decisions to throw out factors/linear-combinations. That chopping, ie piece-wise, capability seems an essential upgrade to using only a single transform for the entire data space. If the latter, we'd just be adding another matrix multiplication to a stack of such and it wouldn't change much beyond perhaps numerical stability or efficiency since NNs are of course capable of finding any single linear transform including a PCA projection. In a way, it's related to all the various efforts at improving learning algorithms by tweaking gradients, hessians, etc. In the end, in practice most found that doing something super simple at GPU scale was faster; I'm not sure about the state-of-the-art in numerical stability, though.
@dr.mikeybee
@dr.mikeybee 2 жыл бұрын
@@nomenec I mean throw away the input data that isn't significant. Among other things, it will make smaller faster models. I hadn't heard that for really high dimensional data only 16 or fewer dimensions matter. If I'm not misunderstanding this, which I may very well be, doing PCA first makes a lot of sense. It takes me time to wrap my head around anything, and I'm often far off the mark anyway. Still, this seems logical.
@XOPOIIIO
@XOPOIIIO 2 жыл бұрын
I would like to see this "absolute dog". I think it's possible to generate one, just reward the GAN to react both to realism and activation of particular neuron. I wonder how that doggest dog ever would look like. I would also like to see the doggest cat, and the cattest dog.
@lenyabloko
@lenyabloko 2 жыл бұрын
You should invite Guy Emerson from Department of Computer Science and Technology University of Cambridge.
@MachineLearningStreetTalk
@MachineLearningStreetTalk 2 жыл бұрын
He looks great, we would love to have him on!
@dennisestenson7820
@dennisestenson7820 2 жыл бұрын
40:51 doesn't this suggest that the input data should to be transformed into a reduced dimension before training on it? Using MNIST digits, for example, the raw pixels could be transformed into the sequence of pen strokes that composed the written symbol. This might have dimensionality around a dozen rather than 784. Obviously, finding that transformation wouldn't be trivial. However, it could also allow generative models to create more realistic interpolations.
@jovian304
@jovian304 2 жыл бұрын
Finally!!
@mfpears
@mfpears 2 жыл бұрын
28:10 This isn't how humans understand physics. Really, really good video though. 34:00 54:30 It's cool that humans still understand a lot though. The possibilities in the universe are massively constrained by the fact that nothing is generated outside of physical laws. 1:00:00 The limitation that deep learning can't extrapolate 1:03:30 Extrapolation = reasoning? So can they reason? 1:03:50 No 1:05:50 Supervised learning is the thing that sucks 1:06:50 Geoff Hinton thinks general unsupervised first, specialization after 1:12:00 RBF network 1:15:00 Different definitions of interpolation 1:22:50 latent contrastive predictive models 1:25:00 New architectures that aren't contrastive have come out 1:29:30 No, they will be able to reason 1:30:00 What would prove that neural networks can reason? 1:35:30 RNNs are the networks that can train with variable number of layers 1:37:28 Nobody can train a neural net to print the nth digit of pi (I can). Yeah, once we figure out basic things we might be able to try mathematical concepts. 1:45:00 System 1 and 2 in chess and driving 2:07:10 Convolution is nothing more than a performance optimization by giving the network pre-knowledge that interesting features are spatially local A lot of tearing down of not 100% correct analogies of neural networks and what might actually model them well - 2:30:30 It's impossible for a neural network to find the nth digit of pi 2:34:45 Discrete vs smooth... Have both systems? (Actions distill, Jordan Peterson) 2:36:30 (The real world is limited) is it because neural nets only use textures? No, resolution is low, or it would blow up (Man that accent was tough for me) 2:45:30 Summary of that last interview. Intuition is fine, but mathematical rigor doesn't apply well with that definition 2:47:30 We need a better definition of what kind of interpolation is happening, and that will help us progress 2:50:00 It's hard to figure out where researchers exactly disagree because of politeness 2:53:00 It's all about pushing back on the limitation that neural networks can't extrapolate 2:54:40 Digits of pi again. It's not what he's talking about actually, too advanced. He's talking about a cat jumping in a place it's never seen before (Tesla predicts paths of cars). He thinks eventually we'll get there, but I'm not as optimistic. 2:56:00 There's an article by Andre Ye that annoyed him because it invoked interpolation vs extrapolation to say they'll never do it, which is the real question 2:57:10 At the end of the day, neuron signals are continuous functions, but somehow they produce digital reasoning. But will it be efficient? 2:59:00 But there is no discrete thing (actions) 3:00:40 (There you go. Yes. It's going to be hard. But that's the only way for a neural network to do it, and calculators aren't going to discover profound truths.) 3:01:30 (Omg it feels like they're starting to think the way I do about it. System 1, system 2) It's insanely powerful to train a discrete algorithm on top of neural network. Longer term possibility. 3:05:00 Underexplored. Feature creep? (No! That's insane. Is general intelligence feature creep?) 3:06:30 Hard to train (it seems the opposite to me) Getting to do discrete stuff involves lots of hacks. 3:07:30 "TABLE 1" Attacks attack on paper 3:17:00 You can't initialize a neural net with zeros 3:18:00 We're comparing neural nets to the entire human race and its culture and inheritance
@siarez
@siarez 2 жыл бұрын
So if I replace ReLU with Softplus, does that break their arguments?
@nomenec
@nomenec 2 жыл бұрын
On the one hand, sure, arguments which explicitly state repeatedly that they apply to piecewise linear functions do not immediately apply outside that assumption. On the other hand, that is not evidence the against a more general interpretation of the conclusions and there is soft evidence that in many problem spaces, NNs, regardless of their activation functions, are driven towards chopping up space into affine cells. Some examples of such soft evidence is 1a) the dominance of piecewise linear activation functions overall or otherwise 1b) the dominance of activation functions that are asymptotic (including your softplus example) and 2) the dominance of NN nodes structured as nonlinear functions of linear combinations as opposed to nonlinear combinations. The consequence of 2) is that softness still falls along a linear hyperplane boundary! And given 1b) there is a distance at which it effectively behaves as a piecewise linear function. It becomes a problem specific empirical question as to how much an NN actually leverages the curvature versus the asymptotic behavior. My claim, and it's just a gut conjecture based on soft evidence at this point, is that for most problems and typical NNs, any such activation function curvature is incidental and/or sometimes useful for efficient training and that's why ReLU and like, which "abandon all pretense at smooth non-linearity", as I said the video, are dominating.
@alivecoding4995
@alivecoding4995 Жыл бұрын
@MLST: Do you guys still follow this line of thinking; or did you end up settling on yet another interpretation recently? 😊
@MachineLearningStreetTalk
@MachineLearningStreetTalk Жыл бұрын
More or less! We are going to drop another show with RandallB next month, we now think that MLPs are more extrapolative than previously thought (along the lines of Randalls/Yann's everything is extrapolation paper)
@alivecoding4995
@alivecoding4995 Жыл бұрын
@@MachineLearningStreetTalk Thank you very much for taking the time to answer my question. 😌👍
@DavenH
@DavenH 2 жыл бұрын
Ho ho! Here we go!
@SimonJackson13
@SimonJackson13 2 жыл бұрын
An effective interpolation? Do all interpolations have to be effective? Is it still not an interpolation between even if inaccurate?
@ketilmalde3402
@ketilmalde3402 2 жыл бұрын
Great stuff, extremely interesting topic and strong content. But is there a version without the background music - podcast or video? I'm probably just a grumpy old fart, but I find it really hard to concentrate, it's like trying to follow a conversation while somebody is simultaneously licking my ear.
@MachineLearningStreetTalk
@MachineLearningStreetTalk 2 жыл бұрын
drive.google.com/file/d/16bc7XJjKJzw4YdvL5rYdRZZB19dSzR70/view?usp=sharing here is the intro (first 60 mins) with no background music, you old fart! :)
@killaz5526
@killaz5526 9 ай бұрын
I have no idea what any of this is about, I was watching a video by the channel Food Theory talking about a McDonald's conspiracy, then woke up 4 hours later to the end of this. Autoplay can be quite mysterious.
@MachineLearningStreetTalk
@MachineLearningStreetTalk 9 ай бұрын
LOL!
@CristianGarcia
@CristianGarcia 2 жыл бұрын
It happened!!!
@rafaeldelaflor
@rafaeldelaflor Жыл бұрын
Does the guy with the Sun glasses have eye problems?
@rafaeldelaflor
@rafaeldelaflor Жыл бұрын
I still haven’t really absorbed the intropolation VS extrapolation argument.
@rafaeldelaflor
@rafaeldelaflor Жыл бұрын
I think I understand the bimodal discussion of High dimension interpolation and extrapolation. Linear regression is a fitting of interpolated volume in 3 dimensions while extrapolation is any 3 dimensional values outside the interpolated value volume
@rafaeldelaflor
@rafaeldelaflor Жыл бұрын
I still don’t understand his argument.
@rafaeldelaflor
@rafaeldelaflor Жыл бұрын
How would this information change the direction of systematic AGI
@Lumeone
@Lumeone 2 жыл бұрын
I wonder, do linear functions exist in the universe?
@nomenec
@nomenec 2 жыл бұрын
There are certainly linear relationships at various levels of physical description: the first and second laws of thermodynamics, the Schwarzschild radius vs mass, photon energy vs frequency, etc. Whether or not any of these actually "exist in the Universe" is something philosophers have argued for at least millennia and probably will argue until heat death. From my perspective, they "exist" as epistemic descriptions of emergent phenomena.
@taiducnguyen7694
@taiducnguyen7694 2 жыл бұрын
E = mc^2, F = ma, KE = 1/2mv^2, tons in thermodynamics, notably W = -P (delta V)...basically, often the most fundamental ones are linear
@autobotrealm7897
@autobotrealm7897 Жыл бұрын
so addictive
@rafaeldelaflor
@rafaeldelaflor Жыл бұрын
Yeah, all the functions are linearities but, how does that describe the human mind or an animal mind
@jovian304
@jovian304 2 жыл бұрын
🤩
This is why Deep Learning is really weird.
2:06:38
Machine Learning Street Talk
Рет қаралды 309 М.
#86 - Prof. YANN LECUN and Dr. RANDALL BALESTRIERO - SSL, Data Augmentation [NEURIPS2022]
30:29
КИРПИЧ ОБ ГОЛОВУ #shorts
00:24
Паша Осадчий
Рет қаралды 6 МЛН
Trágico final :(
01:00
Juan De Dios Pantoja
Рет қаралды 16 МЛН
HOW DO WE EXIST IN THE UNIVERSE?
2:10:03
Machine Learning Street Talk
Рет қаралды 52 М.
MAMBA from Scratch: Neural Nets Better and Faster than Transformers
31:51
Algorithmic Simplicity
Рет қаралды 71 М.
#55 Dr. ISHAN MISRA - Self-Supervised Vision Models
1:36:22
Machine Learning Street Talk
Рет қаралды 22 М.
This is what DeepMind just did to Football with AI...
19:11
Machine Learning Street Talk
Рет қаралды 160 М.
AI: Grappling with a New Kind of Intelligence
1:55:51
World Science Festival
Рет қаралды 693 М.
ICLR 2020: Yann LeCun and Energy-Based Models
2:12:12
Machine Learning Street Talk
Рет қаралды 21 М.
MIT Introduction to Deep Learning | 6.S191
1:09:58
Alexander Amini
Рет қаралды 112 М.
TransformerFAM: Feedback attention is working memory
37:01
Yannic Kilcher
Рет қаралды 31 М.