Concept Learning with Energy-Based Models (Paper Explained)

  Рет қаралды 31,782

Yannic Kilcher

Yannic Kilcher

Күн бұрын

This is a hard paper! Energy-functions are typically a mere afterthought in current machine learning. A core function of the Energy - its smoothness - is usually not exploited at inference time. This paper takes a stab at it. Inferring concepts, world states, and attention masks via gradient descent on a learned energy function leads to an interesting framework with many possibilities.
Paper: arxiv.org/abs/...
Blog: openai.com/blo...
Videos: sites.google.c...
Abstract:
Many hallmarks of human intelligence, such as generalizing from limited experience, abstract reasoning and planning, analogical reasoning, creative problem solving, and capacity for language require the ability to consolidate experience into concepts, which act as basic building blocks of understanding and reasoning. We present a framework that defines a concept by an energy function over events in the environment, as well as an attention mask over entities participating in the event. Given few demonstration events, our method uses inference-time optimization procedure to generate events involving similar concepts or identify entities involved in the concept. We evaluate our framework on learning visual, quantitative, relational, temporal concepts from demonstration events in an unsupervised manner. Our approach is able to successfully generate and identify concepts in a few-shot setting and resulting learned concepts can be reused across environments. Example videos of our results are available at this http URL
Authors: Igor Mordatch
Links:
KZbin: / yannickilcher
Twitter: / ykilcher
BitChute: www.bitchute.c...
Minds: www.minds.com/...

Пікірлер: 53
@Chhillee
@Chhillee 4 жыл бұрын
1:16 Overview of energy based models 15:20 Start of the paper 30:05 Experiments
@BJCaasenbrood
@BJCaasenbrood 4 жыл бұрын
Love your style of presenting papers, its clear and well-structured. Keep up the good work!
@PeterOtt
@PeterOtt 4 жыл бұрын
I’m 10 minutes in and so far this is a great summary of what energy based learning is, I’ve heard the name but had no idea what it was before now!
@TijsMaas
@TijsMaas 4 жыл бұрын
Nice explanation. I was going to say reminds me a lot of neurosymbolic concept learning, however, just found out this work was published before NS-CL.
@herp_derpingson
@herp_derpingson 4 жыл бұрын
Once we have GPUs large enough, this would be a game changer in solving abstract reasoning problems and procedurally generated matrices.
@matterhart
@matterhart 4 жыл бұрын
Best video yet, the longer intro was totally worth it. Timestamps would be great though.
@jugrajsingh3299
@jugrajsingh3299 4 жыл бұрын
Great way of presenting link between current knowledge with problem and solution addressed by paper in a simple way 😃
@slackstation
@slackstation 4 жыл бұрын
Great explanation. It's both a new and challenging concept mathematically. Thank you for the clear explanation.
@vojtechkubin1590
@vojtechkubin1590 4 жыл бұрын
Genius design, beautiful presentation! But there is one thing I don't understand: "why only 11.5K subscribers"?
@YannicKilcher
@YannicKilcher 4 жыл бұрын
It's a very exclusive club ;)
@snippletrap
@snippletrap 4 жыл бұрын
Almost quadrupled by now
@jrkirby93
@jrkirby93 4 жыл бұрын
So if I understand correctly, w is rather arbitrary. It depends specifically on the energy function and how it's trained. I guess if they have n concepts to learn, they make w an n dim vector, and encode each concept with 1-hot. This paper does not explore out of distribution concepts, but I suppose theoretically, you could interpolate them. In all these problems the elements of x are positionally independent. If you swap the first and last element of x, and swap the first and last of the attention vector, you ought to get the same result. Do they test that this is true in practice? Does this technique require positional independence? Could enforcing positional independence more strictly give performance benefits? If you make a neural net piecewise linear the entire way through, you can calculate the function of the loss (or energy) with respect to a single parameter completely, and find the minima of that function in a computationally efficient manner. This is the key component of my current research. I wonder if this concept learning would benefit from attempting that instead of gradient descent.
@vsiegel
@vsiegel 3 жыл бұрын
A good example for something mind bending: Imagine a differentiable cat.
@welcomeaioverlords
@welcomeaioverlords 4 жыл бұрын
Super interesting, thanks for breaking it down!
@CristianGarcia
@CristianGarcia 4 жыл бұрын
Great video! This reminds me of the differentiable ODEs paper.
@pastrop2003
@pastrop2003 3 жыл бұрын
Somehow, I start to think that if this model is further developed and then married up with a lot of compute we may get something looking like AGI?????
@nbrpwng
@nbrpwng 4 жыл бұрын
So if it can be considered that inferring an x or a or w from the others, using an existing energy function, is “learning”, then maybe learning the energy function parameters is “meta learning” in a way? But maybe not, and I guess it’s just a less important matter of definition.
@YannicKilcher
@YannicKilcher 4 жыл бұрын
That's a good observation! It's maybe a bit tricky to call that learning, as we usually understand learning as something that has a lasting impact over time. Here, once you have inferred an x, you go to the next one and start over.
@sarvagyagupta1744
@sarvagyagupta1744 2 жыл бұрын
This is a great paper and great job explaining it. I kept on wondering while watching this if this is the concept behind the attention mechanism?
@lislouise2305
@lislouise2305 3 жыл бұрын
I'm not sure if I understand. So a deep neural network is an energy based model because you want to minimize the loss? Then deep learning is models are just energy based model and there's no difference?
@Laszer271
@Laszer271 4 жыл бұрын
Great presentation. I wonder how much time did it take you to understand such a paper (not taking into account planning out this presentation)
@SergeyVBD
@SergeyVBD 4 жыл бұрын
This is not a new idea to use gradient descent at inference. Ive definitely seen classic computer vision algorithms that have done this. Is deep learning now considered classic machine learning lol? I think the main contribution of this paper is the formulation of these concepts. That seems promising.
@amrmartini3935
@amrmartini3935 3 жыл бұрын
Yeah structured models and EBMs are full of this. The looped inference is a major bottleneck for any computational learning research in this area. It's why the computational community has moved away from PGMs in the first place.
@atursams6471
@atursams6471 3 жыл бұрын
28:00 Could you explain what backpropagation through an optimization procedure means?
@amrmartini3935
@amrmartini3935 3 жыл бұрын
Wouldn't a structured SVM framework provide a backprop-able loss that avoids having to backprop through SGD? You just need a solid (or principled) learning framework where a max/min/argmax/argmin is part of the definition of the loss function.
@joirnpettersen
@joirnpettersen 4 жыл бұрын
Interesting concept. Do you know why it has the name "Energy" function? Is it like, the more energy the more unstable it is?
@YannicKilcher
@YannicKilcher 4 жыл бұрын
I think it comes from physics. Think of the potential energy of a pendulum, for example. It will converge to the place where this energy is the lowest. I might be very wrong, though.
@joirnpettersen
@joirnpettersen 4 жыл бұрын
@@YannicKilcher Oh yeah, of course. Like how Snell's law can be thought about as minimizing the energy during travel of the light ray.
@BJCaasenbrood
@BJCaasenbrood 4 жыл бұрын
I agree with Yannic. The energy function is positive for all values of X, and close to zero for an equilibrium. The name also implies that the unknown energy function E(x) is differentable in contrast to any generic objective functions in AI. Generally, in physics, they aim to minimize the potential energy function to find the solution to complex nonlinear problems. Also through gradient decent methods. The advantage is that the Hessian (i.e., twice differentiation of E(x) w.r.t. X) is always positive definite since the energy function is always positively increasing for every X, similar to an elastic spring storing more energy the further you stretch it. An energy function, which is just a definition of a thing with similar characteristics like potential energy, offfers therefore good numerical stability and convergence!
@snippletrap
@snippletrap 4 жыл бұрын
@@YannicKilcher It does come from physics, but the lineage is through Hopfield nets and the Ising models that inspired them.
@sau002
@sau002 Жыл бұрын
Thank you.
@MrAlextorex
@MrAlextorex 4 жыл бұрын
Found better justifications in a slide here. When Y is high­ dimensional (or simply conbinatorial), normalizing becomes intractable...See: cs.nyu.edu/~yann/talks/lecun-20050719-ipam-2-ebm.pdf
@theodorosgalanos9663
@theodorosgalanos9663 4 жыл бұрын
Yannic, from your point of view, as a highly experienced researcher and a person who dissects papers like this 'for a living', how hard would it be to write the code for this one? I haven't found anything online and I wonder the reason it wasn't shared is that it might be a bit..difficult or hard to organize?
@YannicKilcher
@YannicKilcher 4 жыл бұрын
No I think as long as you have a clear picture in your mind of what is the "dataset", what counts as X and what is Y in each sample, you should be fine. The only engineering difficulty here is backpropagating through the inner optimization procedure.
@xgplayer
@xgplayer 4 жыл бұрын
But you do perform gradient descent when training the generator in the GANs framework, don't you?
@YannicKilcher
@YannicKilcher 4 жыл бұрын
Yes, but the gradient descent isn't part of the model itself.
@Elstuhn
@Elstuhn 8 ай бұрын
Bro I'm laughing so hard at 5:23 rn I'm so sorry for being so immature
@AI_ML_DL_LLM
@AI_ML_DL_LLM 5 ай бұрын
EBM is coming to fruition considering the recent leak on Q*
@HB-kl5ik
@HB-kl5ik 5 ай бұрын
That guy is stupid on Twitter who leaked that
@blizzard072
@blizzard072 3 жыл бұрын
Wouldn't this connet somehow with the recent iMAML paper that you reviewed? Backpropagating through SGD seemed worth trying
@patrickjdarrow
@patrickjdarrow 4 жыл бұрын
"you can gradient descent on colors" = 🤯
@vsiegel
@vsiegel 3 жыл бұрын
You can even do that on cats!
@datgatto3911
@datgatto3911 4 жыл бұрын
Nice video, p/s: "nice demonstration" of Discriminator of GAN 05:44 =)))
@AndreiMargeloiu
@AndreiMargeloiu 3 жыл бұрын
447 like and 0 dislikes - truly incredible!
@Chhillee
@Chhillee 4 жыл бұрын
This paper wasn't published anywhere was it? I see an ICLR workshop version, but the full version doesn't seem to have been accepted at any conference.
@CristianGarcia
@CristianGarcia 4 жыл бұрын
Welcome to arvix
@kormannn1
@kormannn1 6 ай бұрын
this aged well
@snippletrap
@snippletrap 4 жыл бұрын
5:25 slow down Yannic
AI can't cross this line and we don't know why.
24:07
Welch Labs
Рет қаралды 713 М.
Why Does Diffusion Work Better than Auto-Regression?
20:18
Algorithmic Simplicity
Рет қаралды 316 М.
GIANT Gummy Worm Pt.6 #shorts
00:46
Mr DegrEE
Рет қаралды 32 МЛН
So Cute 🥰
00:17
dednahype
Рет қаралды 58 МЛН
Nastya and balloon challenge
00:23
Nastya
Рет қаралды 56 МЛН
The Most Important Algorithm in Machine Learning
40:08
Artem Kirsanov
Рет қаралды 427 М.
A Brain-Inspired Algorithm For Memory
26:52
Artem Kirsanov
Рет қаралды 107 М.
The Key Equation Behind Probability
26:24
Artem Kirsanov
Рет қаралды 94 М.
Hopfield Networks is All You Need (Paper Explained)
1:05:16
Yannic Kilcher
Рет қаралды 93 М.
Terence Tao at IMO 2024: AI and Mathematics
57:24
AIMO Prize
Рет қаралды 358 М.
This is why Deep Learning is really weird.
2:06:38
Machine Learning Street Talk
Рет қаралды 387 М.
Flow Matching for Generative Modeling (Paper Explained)
56:16
Yannic Kilcher
Рет қаралды 46 М.
GIANT Gummy Worm Pt.6 #shorts
00:46
Mr DegrEE
Рет қаралды 32 МЛН