Avoiding Catastrophe: Active Dendrites Enable Multi-Task Learning in Dynamic Environments (Review)

No video

Avoiding Catastrophe: Active Dendrites Enable Multi-Task Learning in Dynamic Environments (Review)

Рет қаралды 18,617

Күн бұрын

#multitasklearning #biology #neuralnetworks
Catastrophic forgetting is a big problem in mutli-task and continual learning. Gradients of different objectives tend to conflict, and new tasks tend to override past knowledge. In biological neural networks, each neuron carries a complex network of dendrites that mitigate such forgetting by recognizing the context of an input signal. This paper introduces Active Dendrites, which carries over the principle of context-sensitive gating by dendrites into the deep learning world. Various experiments show the benefit in combatting catastrophic forgetting, while preserving sparsity and limited parameter counts.
OUTLINE:
0:00 - Introduction
1:20 - Paper Overview
3:15 - Catastrophic forgetting in continuous and multi-task learning
9:30 - Dendrites in biological neurons
16:55 - Sparse representations in biology
18:35 - Active dendrites in deep learning
34:15 - Experiments on multi-task learning
39:00 - Experiments in continual learning and adaptive prototyping
49:20 - Analyzing the inner workings of the algorithm
53:30 - Is this the same as just training a larger network?
59:15 - How does this relate to attention mechanisms?
1:02:55 - Final thoughts and comments
Paper: arxiv.org/abs/2201.00042
Blog: numenta.com/blog/2021/11/08/c...
ERRATA:
- I was made aware of this by / chainlesscoder: "That axon you showed of the pyramidal neuron, is actually the apical dendrite of the neuron". Sorry, my bad :)
Abstract:
A key challenge for AI is to build embodied systems that operate in dynamically changing environments. Such systems must adapt to changing task contexts and learn continuously. Although standard deep learning systems achieve state of the art results on static benchmarks, they often struggle in dynamic scenarios. In these settings, error signals from multiple contexts can interfere with one another, ultimately leading to a phenomenon known as catastrophic forgetting. In this article we investigate biologically inspired architectures as solutions to these problems. Specifically, we show that the biophysical properties of dendrites and local inhibitory systems enable networks to dynamically restrict and route information in a context-specific manner. Our key contributions are as follows. First, we propose a novel artificial neural network architecture that incorporates active dendrites and sparse representations into the standard deep learning framework. Next, we study the performance of this architecture on two separate benchmarks requiring task-based adaptation: Meta-World, a multi-task reinforcement learning environment where a robotic agent must learn to solve a variety of manipulation tasks simultaneously; and a continual learning benchmark in which the model's prediction task changes throughout training. Analysis on both benchmarks demonstrates the emergence of overlapping but distinct and sparse subnetworks, allowing the system to fluidly learn multiple tasks with minimal forgetting. Our neural implementation marks the first time a single architecture has achieved competitive results on both multi-task and continual learning settings. Our research sheds light on how biological properties of neurons can inform deep learning systems to address dynamic scenarios that are typically impossible for traditional ANNs to solve.
Authors: Abhiram Iyer, Karan Grewal, Akash Velu, Lucas Oliveira Souza, Jeremy Forest, Subutai Ahmad
Links:
TabNine Code Completion (Referral): bit.ly/tabnine-yannick
KZbin: / yannickilcher
Twitter: / ykilcher
Discord: / discord
BitChute: www.bitchute.com/channel/yann...
LinkedIn: / ykilcher
BiliBili: space.bilibili.com/2017636191
If you want to support me, the best thing to do is to share out the content :)
If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
SubscribeStar: www.subscribestar.com/yannick...
Patreon: / yannickilcher
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n

Пікірлер: 67

@YannicKilcher 2 жыл бұрын

OUTLINE: 0:00 - Introduction 1:20 - Paper Overview 3:15 - Catastrophic forgetting in continuous and multi-task learning 9:30 - Dendrites in biological neurons 16:55 - Sparse representations in biology 18:35 - Active dendrites in deep learning 34:15 - Experiments on multi-task learning 39:00 - Experiments in continual learning and adaptive prototyping 49:20 - Analyzing the inner workings of the algorithm 53:30 - Is this the same as just training a larger network? 59:15 - How does this relate to attention mechanisms? 1:02:55 - Final thoughts and comments Paper: arxiv.org/abs/2201.00042 Blog: numenta.com/blog/2021/11/08/can-active-dendrites-mitigate-catastrophic-forgetting

@andreasschneider1966 2 жыл бұрын

Thanks Yannic, for putting that together. Great content, as always! A little side note on the biology: the upper part of these pyramidal neurons is actually not the axon, but apical dendrites. So essentially a second input channel, next to the basal dendrites at the soma. The axon leaves the soma on the other side (probably one of the branches at the bottom).

@tanguydamart8368 2 жыл бұрын

The next step is to not provide the context and instead train another MLP to create an embedding that represents the task. This embedding can then be provided as context to the second network.

@daniloribeiro64 2 жыл бұрын

Yannic this is great explanation! I’ve been very intrigued by the Work of Numenta, and have studied it quite a lot recently. I had a very similar realization a year ago about the connection between this sparse models and the attention in transformers.

@Rizhiy13 2 жыл бұрын

This looks a lot like attention in transformers, but with a lot of hand-crafted rules/connections. I think another couple of papers down the line this can be combined with attention in transformers, so that context can be learned as well.

@vladimirtchuiev2218 2 жыл бұрын

I'm thinking something along the lines of instead of using values computed directly from the input for the encoder part of the encoder decoder, to input a bunch of task encodings, and maybe implement some sort of an adversarial loss such that the tasks will have little overlap between them.

@hyunsunggo855 2 жыл бұрын

Active dendrites is just the tip of the iceberg of Numenta's work. You should really look into their latest work, the thousand brains theory. There's a book written about it as well which I'd recommend.

@Dan-gs3kg 2 жыл бұрын

What's the name of the book

@hyunsunggo855 2 жыл бұрын

@@Dan-gs3kg It's "A Thousand Brains: A New Theory of Intelligence" by Jeff Hawkins.

@Adhil_parammel 2 жыл бұрын

Brain contain pleanty of different models of world.

@rbain16 2 жыл бұрын

they already interviewed Jeff Hawkins on the MLST podcast

@Fordance100 2 жыл бұрын

It's very interesting paper to experiment on the new ideas. It truly open doors for further experimentation and innovations. Thanks for choosing this paper and presenting it in a clear and easy to understand fashion.

@snippletrap 2 жыл бұрын

Classic Yannic expository video. This is the kind of content I subscribed for.

@ikiphoenix9505 2 жыл бұрын

Thanks ! Numenta's work is cool ! Hope to see one on the Relational Tsetlin machine or another of their work.

@mikenashtech 2 жыл бұрын

Great work Yannic. Really useful. Thank you. Mike

@YannicKilcher 2 жыл бұрын

ERRATA: - I was made aware of this by twitter.com/ChainlessCoder: "That axon you showed of the pyramidal neuron, is actually the apical dendrite of the neuron". Sorry, my bad :)

@TheRyulord 2 жыл бұрын

Should probably pin this since it's getting buried.

@Pmaisterify 2 жыл бұрын

Yannic a suggestion for future videos, maybe some more application papers in complex tasks? Often times those papers innovate on previous concepts in ways that improve them, sometimes even setting SOTA. I like papers like the one in this video because they help me think of creative ways to apply these concepts, but there is certainly a lot of overlap at times with previous works. But I did like this paper overall, the thousand brains theory this plays into is very cool stuff, I recommend the book to anyone seeing this comment.

@RoobertFlynn 2 жыл бұрын

Oh amazing, thanks for doing this one!

@paxdriver 2 жыл бұрын

Maaaaan we could combine so many awesome models with this approach. That math model combined with NLP or GPT would be really interesting, especially with autopiloted vehicles, market analysis or AGI I also don't see any reason why this model couldn't be combined with other combinatory versions of itself applied to other models - like that predictive code algo mixing with this mnist to convert diagrams into functions, or a collision detection with alpha fold, then applying this model to many other pretrained models in layers or by graph ML network instead.. This concept is so deep in potential for experiments

@aamirmirza2806 2 жыл бұрын

So people are up to date with . the whole sparsity thing, it would better if you start the topic from the original paper . " How can we be do dense " , benefits of highly sparse representation, then the paper of Active dendrites . it will give views some much needed context.

@andrewzeng712 Жыл бұрын

great video

@paxdriver 2 жыл бұрын

Yann LeCunn , meet Yan Le Man 😜😘 thanks for the show, love the new format. Still, I miss a bit of the cheekiness before there were interviews forthcoming. Could you release the paper review before booking the interviews maybe? Leave an open note for authors to request interviews maybe? Save you the work you hate having authors trace you rather than you coordinate reviews with the meetings? Just a suggestion

@pisoiorfan 2 жыл бұрын

Thanks. Do you plan an authors interview on this paper too? I think the choice of K-WTA vs a simpler threshold activation function is justified as a better approximation of what happens in natural neurons - there-s a lateral inhibitory feedback, not detailed in this paper, which limits the number of simultaneously firing neurons in a given layer/area. Thresholding may lead, in some cases, to either too many or too few activations. One case resembles en.wikipedia.org/wiki/Synesthesia the other en.wikipedia.org/wiki/Blindsight - both more confusing than useful.

@PaganPegasus 2 жыл бұрын

Additionally I think the kWTA works to prevent the catastrophic forgetting, as the feedforward and dendritic weights of these "ignored" neurons won't get updated in backprop. I think this helps enforce weight updates of only the top _k_ most relevant neurons for a particular context. Would it make sense to perhaps use a probabilistic sparsification scheme instead? kWTA reminds me of Top-K sampling in text models, and a probabilistic extension of that is nucleus sampling where the top _n_ samples are chosen such that the sum of their probabilities is close to some threshold, so fewer samples are picked when the distribution is more "peaky" vs more samples picked when the distribution is more "flat". I wonder if a similar principle could be applied here as the probability threshold could be a learnable parameter, allowing the network itself to learn how much or little sparsity is ideal.

@tildarusso 2 жыл бұрын

does feel like the attention mechanism and LSTM but I do struggle to have a clear mind of the difference.

@joshuasmith2450 2 жыл бұрын

for your vector of 1s with softmax. if you replaced 1 with a big number approaching inf it would just become max

@vladimirtchuiev2218 2 жыл бұрын

1. I wonder how it is compared to CNN's and attention models. The GPT models proved that they can handle a wide variety of tasks without resorting. 2. Is it scalable? And can you determine a fixed number of candidate tasks, and with no task id given and a contrastive loss learn a successful multitasking neural net? 3. Isn't this small number of neurons activated per task susceptible to overfitting? One of the goals of e.g. weight decay and dropout is to spread the inference across the entire network rather than relying on a small number of neurons from it...

@rbain16 2 жыл бұрын

3) This is not the case with winning lottery tickets. You can make them sometimes up to 3% sparse and they'll do just as well as the fully dense networks. Worth noting that these nets also have activation sparsity.

@Pmaisterify 2 жыл бұрын

Hmmm. I get a sort of task capsule network/set transformer vibe, at least with the dendritic intuition :) I’d be curious to see how the kWTA layer affects training, especially early on. I have yet to read the paper, my guess is that they would use warmup to avoid issues. I definitely have had a lot of issues with dead neurons when I have used sparsity like that.

@subutaiahmad8208 2 жыл бұрын

Hi Pietro, I agree. kWTA can definitely lead to dead neurons. In a precursor to this dendrites paper (see links below), we showed how introducing something called "boosting" helps prevent this. This technique essentially "boosts" neurons that have been relatively inactive to increase the overall entropy of the layer. Unfortunately boosting (and related techniques) are a bit tricky with continual learning. While learning task T, neurons specialized to tasks t

@Pmaisterify 2 жыл бұрын

@@subutaiahmad8208 very interesting, I’ll definitely check it out. Thanks for the reply!!!

@bilz0r 2 жыл бұрын

Yannic: That image of the pyramidal neuron, everything on it is a dendrite. That thing you think is the axon is called the "apical dendrite".

@hongyihuang3560 2 жыл бұрын

Saw that diagram and I immediately went: this is definitely from Numenta

@brll5733 2 жыл бұрын

So maybe there is just one "virtual" neural network (in the form of dendrites) that learns to regulate the primary one, which does the actual learning?

@jabowery 2 жыл бұрын

Nvidia needs to look at synergies between ray trace sparse processing and this paper.

@aroonsubway2079 2 жыл бұрын

Thanks for this wonderful video. I am not an expert on Multitask learning but I got a feeling that catastrphe forgeting can be solved by simply optimizing multiple losses simultanously (i.e. shared encoder, but different tasks have diferent prediction heads, so you can minimize all the losses together with multiple types of data entering the shared encoder) Please let me know if there is something wrong with my understanding.

@rbain16 2 жыл бұрын

Solving catastrophic forgetting would be one of the best things to ever happen to ANNs. Code it up and try it out.

@CharlesVanNoland 2 жыл бұрын

Your second attempt at annunciating "pyramidal" was right! "Per-am-id-all" @10:30

@senadkurtisi2930 2 жыл бұрын

Would incorporating some sort of schedule for kWTA make sense? Like at the beginning we keep top-90% of the neurons in kWTA layer, and progressively we decrease it to something like 25%? That way most of the neurons would get some chance to be included in the backprop path.

@Pmaisterify 2 жыл бұрын

I definitely think that makes the most sense. Based on how it’s described, I am quite certain a lot of those neurons are dead :)

@Gannicus99 2 жыл бұрын

@@Pmaisterify Just like in (most) brains then 😁

@Pmaisterify 2 жыл бұрын

@@Gannicus99 People need to start using GELU in their heads!

@carlotonydaristotile7420 2 жыл бұрын

Very Cool. Reminds my work being done by Numenta.

@NeoShameMan 2 жыл бұрын

So basically substituting the sparse random selection of neuron backpropagating at learning time to a self reinforced one?

@MartinB397 2 жыл бұрын

Instead of kWTA i'd use softWTA and maybe a threshold if beneficial. Nobody needs another hyperparameter.

@subutaiahmad8208 2 жыл бұрын

Hi Martin, a soft approach wouldn't have worked as well. We need most activations to be exactly zero in order to block gradients to most of the network, and avoid interference between tasks during learning. True sparse activity also closely parallels what goes on in our brains.

@MartinB397 2 жыл бұрын

@@subutaiahmad8208 Yep, that's why I mentioned the threshold. In the end it'd be the same i guess? It'd be usefull only if it's realy unsure about the current task or if the task is a mixture of several others. I realy like the idea about the task vectors and checking which one is most similar. It could be used to form subgroups of neurons which output independent data. Thank you for your work and sharing it.

@subutaiahmad8208 2 жыл бұрын

@@MartinB397 Thanks! Yes, a threshold would output zeros, similar to a ReLU but it's hard to guarantee both sparsity and having sufficient number of non-zero units. It could work with some tuning. We went with the simpler biological solution of kWTA, which is like an adaptive threshold.

@easyBob100 2 жыл бұрын

Context....couldn't you just use some sort of positional embedding?

@PaganPegasus 2 жыл бұрын

"It's a bit cheaty to have as many dendritic segments as tasks" Well, maybe it is cheating, but in a real biological system couldn't you argue that dendritic segments would actually grow to match this situation of up to 1 segment per 'task'? Since ANNs can't really grow new neurons or new connections (at least in our typical systems using backprop) we kinda have to give it the maximum capacity from the beginning the network will 'prune' out the irrelevant segments through decaying weights.

@zyxwvutsrqponmlkh 2 жыл бұрын

This is a cambrian explosion of deep learning methods.

@SimonJackson13 2 жыл бұрын

Learn taskID?

@joshuasmith2450 2 жыл бұрын

mean isnt affected by variance, top 5 is affected.

@aamir122a 2 жыл бұрын

Catastrophic forgetting also exists in humans, that is why we have specialists, (An extreme example ) for example, one cannot be a doctor and engineers are at that same time, if they try to be then one of the skills would be severely degraded, take my example, I am a native multi-lingual speaker, however for the larger part of my life I have lived in an English speaking world, as a result by abilities in other languages is to a larger extend degraded. I have to concentrate when reading Urdu as opposed to English.

@tanguydamart8368 2 жыл бұрын

10:52 that's the apical dendrites !!!

@shibash1 2 жыл бұрын

For me it feels like the attention mechanism. Maybe I got something wrong, but...

@Pmaisterify 2 жыл бұрын

It’s like task attention I guess

@shibash1 2 жыл бұрын

@@Pmaisterify true

@XOPOIIIO 2 жыл бұрын

But biological neurons ARE fully connected. It's realized in dendrites changing their physical location, not just increasing/decreasing their potential.

@subutaiahmad8208 2 жыл бұрын

In the neocortex, if you look at a layer of neurons projecting to another layer and the connections, the percentage of neurons that are physically connected is very small. Hence the connectivity is quite sparse.