Predictive Coding Approximates Backprop along Arbitrary Computation Graphs (Paper Explained)

  Рет қаралды 25,822

Yannic Kilcher

Yannic Kilcher

Күн бұрын

#ai #biology #neuroscience
Backpropagation is the workhorse of modern deep learning and a core component of most frameworks, but it has long been known that it is not biologically plausible, driving a divide between neuroscience and machine learning. This paper shows that Predictive Coding, a much more biologically plausible algorithm, can approximate Backpropagation for any computation graph, which they verify experimentally by building and training CNNs and LSTMs using Predictive Coding. This suggests that the brain and deep neural networks could be much more similar than previously believed.
OUTLINE:
0:00 - Intro & Overview
3:00 - Backpropagation & Biology
7:40 - Experimental Results
8:40 - Predictive Coding
29:00 - Pseudocode
32:10 - Predictive Coding approximates Backprop
35:00 - Hebbian Updates
36:35 - Code Walkthrough
46:30 - Conclusion & Comments
Paper: arxiv.org/abs/2006.04182
Code: github.com/BerenMillidge/Pred...
Abstract:
Backpropagation of error (backprop) is a powerful algorithm for training machine learning architectures through end-to-end differentiation. However, backprop is often criticised for lacking biological plausibility. Recently, it has been shown that backprop in multilayer-perceptrons (MLPs) can be approximated using predictive coding, a biologically-plausible process theory of cortical computation which relies only on local and Hebbian updates. The power of backprop, however, lies not in its instantiation in MLPs, but rather in the concept of automatic differentiation which allows for the optimisation of any differentiable program expressed as a computation graph. Here, we demonstrate that predictive coding converges asymptotically (and in practice rapidly) to exact backprop gradients on arbitrary computation graphs using only local learning rules. We apply this result to develop a straightforward strategy to translate core machine learning architectures into their predictive coding equivalents. We construct predictive coding CNNs, RNNs, and the more complex LSTMs, which include a non-layer-like branching internal graph structure and multiplicative interactions. Our models perform equivalently to backprop on challenging machine learning benchmarks, while utilising only local and (mostly) Hebbian plasticity. Our method raises the potential that standard machine learning algorithms could in principle be directly implemented in neural circuitry, and may also contribute to the development of completely distributed neuromorphic architectures.
Authors: Beren Millidge, Alexander Tschantz, Christopher L. Buckley
Links:
KZbin: / yannickilcher
Twitter: / ykilcher
Discord: / discord
BitChute: www.bitchute.com/channel/yann...
Minds: www.minds.com/ykilcher
Parler: parler.com/profile/YannicKilcher
LinkedIn: / yannic-kilcher-488534136
If you want to support me, the best thing to do is to share out the content :)
If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
SubscribeStar: www.subscribestar.com/yannick...
Patreon: / yannickilcher
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n

Пікірлер: 84
3 жыл бұрын
Thank you for analyzing this awesome paper Yannic, much appreciated.
@sebastianmestre8971
@sebastianmestre8971 3 жыл бұрын
If I understand correctly, you first do a forward pass to make some guesses, then you do a backward pass to find better guesses, then you do a parallel pass to improve weights. (though you can kick off the weight refinement on a separate thread as soon as you find the improved guess) The cool thing is that we can refine weights on multiple layers at once, instead of going one at a time, even if there are a few sequential steps before that.
@1234dbk
@1234dbk 9 ай бұрын
This might be a silly question, but if you are doing a backwards pass to refine your guesses, doesn't that still not solve the main issue with why these people created this in the first place -- to solve the lack of bidirectionally in biological circumstances (for example an RNN of synapses of neural pathways). To generalize, if the graph of nodes is purely one directional, how would information on error be sent backwards after calculating it?
@rockapedra1130
@rockapedra1130 3 жыл бұрын
This was super helpful! Thanks! I love this channel !!
@cedricvillani8502
@cedricvillani8502 3 жыл бұрын
Which part exactly was helpful?
@rockapedra1130
@rockapedra1130 3 жыл бұрын
@@cedricvillani8502 well ...all of it! He goes from the abstract and motivation to describing the general idea with simplified drawings to analyzing each equation to commenting on figures to dissecting the code and finally to his considered opinion about the whole thing. For me, this level of comprehension would take weeks (at least). Plus there are tons of papers out there and he filters and reviews “what’s hot” papers, another huge time saving! This channel is awesome!!!
@gruffdavies
@gruffdavies 3 жыл бұрын
This could be a gamechanger. Thanks for the analysis!
@leylakhenissi6641
@leylakhenissi6641 3 жыл бұрын
Thank you for the paper presentation, it's really well done and provides a useful overview of the topic and the paper. May I kindly ask that in the future you refrain from poking fun at other people's code though? It may keep others, especially in scientific computing, from making their code open/public, which would be a shame for everyone. Cheers.
@JamesAwokeKnowing
@JamesAwokeKnowing 3 жыл бұрын
I think the big deal (other than plausibility) only makes sense in context of hardware. With this scheme you can build local hardware neurons which only compute locally. In software it seems like "backward pass" because central processor goes around computing for all the neurons. Instead imagine a cuda core per neuron which never needs to load memory from anywhere except the other cores it's physically connected to.
@fimbulvntr
@fimbulvntr 3 жыл бұрын
Also, again thinking about hardware, this would enable "dynamic scaling" of a network, where you simply throw more neurons into the mix (since they're all clones and independent). I.e. imagine a gpu where you can bolt on extra cuda cores, ad infinitum. The current model needs (maybe I am wrong and misunderstood, I am a layman) to know the entire topology before it can work
@eelcohoogendoorn8044
@eelcohoogendoorn8044 3 жыл бұрын
Exactly; where this becomes relevant is with hardware that is explicitly simplified to take advantage of this compute structure that presumably does not need any global connections.
@ssssssstssssssss
@ssssssstssssssss 3 жыл бұрын
I am doubtful about the "plausibility" argument, but the realization of such a learning mechanism in hardware seems to me a very powerful argument. I imagine we could get analog processors to carry out this learning algorithm incredibly fast.
@23kl104
@23kl104 3 жыл бұрын
Can't you just as well make the same case for backpropagation? Imagine a bunch of backprop neurons only receiving information from their neighboring nodes (last hidden state for forward pass / gradient of next node for backward pass).
@ssssssstssssssss
@ssssssstssssssss 3 жыл бұрын
Interesting paper.. But this still does not seem biologically plausible to me, which they stated as the purpose. Not to mention, from what I see, so-called predictive coding is a variant of backpropagation (implementing dynamic programming) so saying it approximates backpropagation is misleading. They should qualify the title "Predictive Coding Approximates Backpropagation with Gradient Descent".
@JTMoustache
@JTMoustache 3 жыл бұрын
Predictive coding is a red herring, it is really a dynamic programming version of a variational gradient descent.
@skdx1000
@skdx1000 3 жыл бұрын
yeah it seems analagous to using a taylor series to approximate a function where in this case the error term corresponds to the nth derivative differential multiplier and the function is represented as the evaluation of the original LSTM cell.
@jordyvanlandeghem3457
@jordyvanlandeghem3457 3 жыл бұрын
@@skdx1000 oomph what resources should I check to understand this reply? :)
@skdx1000
@skdx1000 3 жыл бұрын
@@jordyvanlandeghem3457 this link will provide an explanation as to what a taylor series is: brilliant.org/wiki/taylor-series/ and then from there you can check the formula for derivation against the techniques used in the paper explained by yannic and then compare how the error term technique used in this paper corresponds to how a taylor series approximates error using the derivative
@AbeDillon
@AbeDillon 3 жыл бұрын
I don't see anything wrong with giving "a dynamic programming version of variational gradient descent" a shorter name, like "predictive coding". What makes it a red herring?
@peterfireflylund
@peterfireflylund 3 жыл бұрын
@@jordyvanlandeghem3457 take a look at 3brown1blue. He has a series of videos that explain Taylor series intuitively. In order to REALLY understand them, you need to understand calculus and do lots of homework exercises, of course. But maybe the videos are enough for you? Or maybe just the brilliant link was enough? Up to you :)
@woolfel
@woolfel 3 жыл бұрын
This paper makes me ask this question. After you've trained a base model, could the local errors reduce the need to backprop during re-training? If that's possible, would it actually reduce the cost of retraining base models?
@v.gedace1519
@v.gedace1519 3 жыл бұрын
WOW! The paper is great. Your explanations are greater!
@subarashii1368
@subarashii1368 3 жыл бұрын
I feel it just keep input/target fixed, then back-propagate one layer per iteration. In real life, you don't keep input fix until your brain form an equilibrium.
@Yash-vm4uk
@Yash-vm4uk 3 жыл бұрын
It is still using back-propagate which he said is not possible in brain and done by looping, so how is this biologically possible?
@raunaquepatra3966
@raunaquepatra3966 3 жыл бұрын
If in the inner loop (where they update the guesses with 100 iterations or so) we only run it once and instead of updating the predictions with small steps we just add the whole error, Then isn’t it becomes normal backprop 🤨 Pls correct me if I am wrong.
@probbob947
@probbob947 3 жыл бұрын
The structure of the update rule resembles a graph Laplacian.
@lucidraisin
@lucidraisin 3 жыл бұрын
Thank you!!
@lucidraisin
@lucidraisin 3 жыл бұрын
Nobody could have explained it as well as you did!
@kimanthony1667
@kimanthony1667 3 жыл бұрын
Next project ==> lucidrains/predictive-coding-backprop-pytorch
@herp_derpingson
@herp_derpingson 3 жыл бұрын
21:30 I wonder how would skip connections look for this system. 34:20 I wonder if we should run it to convergence or would it cause instability as it overfits to the batch. I am not sold on this. We are still sending information backward. How is this biologically feasible?
@linminhtoo
@linminhtoo 3 жыл бұрын
Looks like it happens through the local 'feedback' connections between neurons So the main difference from backprop is that the gradient doesn't need to be computed exactly all the way from the loss value back to the very first neurons that received the input, in one pass, like in backprop. We can just do it locally and it approximates backprop (which makes sense, since the errors are being sent backwards anyway)
@herp_derpingson
@herp_derpingson 3 жыл бұрын
@@linminhtoo Regardless if it done in one pass or multiple. Bidirectional propagation is not feasible.
@ibrax1
@ibrax1 3 жыл бұрын
@@herp_derpingson Biological neurons do have local feedback dendrites.
@wunkewldewd
@wunkewldewd 3 жыл бұрын
I was confused by this too! I have two qualms: A) it seems like this still requires sending info backwards like you said, so I don't see how it solves the problem... and B) backprop could be considered local IMO: even though the gradient at some much earlier layer is dL/dw_1 or whatever, the chain rule decomposition has the effect of breaking it down into a local gradient, da/dw_1, times the error signal from later in the network (the dL/dy * dy/dw * ... etc).
@danielbrennan5942
@danielbrennan5942 3 жыл бұрын
long term potentiation and long term depression (loosely) follow hebbian learning rules. if this algorithm also follows those hebbian rules, it should be biologically plausible
@TheIvanIvanich
@TheIvanIvanich 3 жыл бұрын
More papers about predicitive coding!
@boss91ssod
@boss91ssod 3 жыл бұрын
-> ... please!
@Zantorc
@Zantorc 3 жыл бұрын
This is interesting but I'm not sure it's applicable. The brain doesn't use point neurons, nor can it be replicated using them. You'll be lucky if you get 2 bits of accuracy out of most neurons. Beyond sensory motor inputs the idea that a neuron could output a value which could be compared to some other value is a non starter. Most connections are feedback in the brain not feed forward. The more you know about the brain the less like the idealised NN it seems.
@semjuel3077
@semjuel3077 3 жыл бұрын
@Zantorc Could you explain what you mean by "Most connection are feedback, not feed forward"?
@Zantorc
@Zantorc 3 жыл бұрын
@@semjuel3077 kzbin.info/www/bejne/n5TGlWtsq7Stadk Explains it quite well.
@charleshong1196
@charleshong1196 3 жыл бұрын
I just don't get it. What's the difference? it still need to backpropagate... the temporal and spatial dependence have not changed...
@YannicKilcher
@YannicKilcher 3 жыл бұрын
the algorithm is biologically plausible
@dm_grant
@dm_grant 3 жыл бұрын
Neurons are not bidirectional. Exactly!
@quebono100
@quebono100 3 жыл бұрын
Nice Paper :) tanh-k you
@jonatan01i
@jonatan01i 3 жыл бұрын
tanh-q
@quebono100
@quebono100 3 жыл бұрын
@@jonatan01i even nicer :D tanh-q
@DavidSaintloth
@DavidSaintloth 3 жыл бұрын
This looks a lot like the mechanism I presented as salience modulation when I presented the idea in a 2013 post. sent2null.blogspot.com/2013/11/salience-theory-of-dynamic-cognition.html?m=1 The back propagation happens through a salience driven remapping of stored information in any given sensory dimension. With inferencing happening continuously between data mapping into the networks. Tangent: there is some evidence that real neurons do have feedback sub signals along the firing path. Which would make this paper more biologically similar than you asserted.
@lemurpotatoes7988
@lemurpotatoes7988 3 жыл бұрын
Link to evidence of feedback subsignals, please?
@ayesaac
@ayesaac Жыл бұрын
How is this not just recursion-based backpropogation? Predictive coding, as I understand it, is a neuron making guesses about the _input_ it will get, not the output of the next neuron. Then the neuron adjusts its model to better predict its own inputs. That's what makes it local; its learning is based on its input; it doesn't need to know anything else.
@kascesar
@kascesar 3 жыл бұрын
wich program did you use to read papers?
@jerrygreenest
@jerrygreenest 3 жыл бұрын
And what OS 🤔
@YannicKilcher
@YannicKilcher 3 жыл бұрын
OneNote on Windows
@Yash-vm4uk
@Yash-vm4uk 3 жыл бұрын
It is still using back-propagate which he said is not possible in brain and done by looping, so how is this biologically possible?
@SianaGearz
@SianaGearz 3 жыл бұрын
Back propagation as defined is a global mechanism that makes use of the computer implementation of neural networks. However, in the brain, there can be no explicit metadata describing the connections, and no direct connections spanning all the way across the brain! Two-way communication for the purpose of reinforcement occurs biologically, but it is local, spanning just every pair of adjacent neurons. There are many mysteries regarding function of biological neural tissue. So this paper presents a mechanism which it shows to be identical in result to back-propagation, but which is local only, not global, and appears biologically plausible. It helps come one step closer to understanding the function of biological neural tissue.
@v.gedace1519
@v.gedace1519 3 жыл бұрын
I am pretty sure that the linearity of the decompose is the issue. Means dL/dh2 * dh2/ dw2 -> ...h3 ... -> ... h4 .... Nature make it different. dl/dh2 *dh2 /dw2 -> F[h3](L, h3w'3 ... h0w'0) -> F[h2](L, h2w'2 ... h0w'0) where w'... are weights aka "feed back connections". Hard to explain using text only. But you got the idea ;-)
@23kl104
@23kl104 3 жыл бұрын
no lol, I don't
@amitkumarsingh406
@amitkumarsingh406 3 жыл бұрын
How about the papers in dark mode
@keeperofthelight9681
@keeperofthelight9681 Жыл бұрын
It doesn’t for Reinforcement learning though
@diegofcm6201
@diegofcm6201 3 жыл бұрын
Like Jeff Hawkins says: neurons CANNOT be assigned numerical precision whatsoever. So even if there wasn’t any backwards pass, just by the fact that it’s assuming that much stability in input output is flawed from the POV of biological plausibility
@diegofcm6201
@diegofcm6201 3 жыл бұрын
It’s much more likely that it’s something more discrete, with Hebbian learning happening through information sent by neurotransmitters
@Hukkinen
@Hukkinen 3 жыл бұрын
How do neurons cannot be approximated by numerical representations? - I'd say this is just a trade-off between realism and abstraction of the model. Why am I wrong here?
@diegofcm6201
@diegofcm6201 3 жыл бұрын
@@Hukkinen TL;DR: It's naive to try to pick just a single part of bio neural networks (local update rules) and try to tie it (with expectancy of similar/better performance) in artificial one, without considering most of the other computational aspects of the real thing. The idea is that neuronal connections, in the actual brain, are maintained by STDP (spike time dependent plasticity) which is a rule that is not much dependent on the actual voltage but on their behaviour in the long term (potentiation or depression). There are no static weights, they're a dynamical property, evolving over time. There are also lots of other things we are neglecting, like the fact that memories are in the connections (somehow) and computation is done in the time domain (tied to the latency of the input neuron's time before spiking occurs in the outputs, and, just a "small" detail, in bio-neural networks the output neuron can spike before inputs).
@minecraftermad
@minecraftermad 3 жыл бұрын
I hope i can understand this cuz those graphs sure didn't look promising
@hoaxuan7074
@hoaxuan7074 3 жыл бұрын
Well almost anything will train a neural net and there is no point in being too clever about it. A dot product is a statistical summary measure and a filter. It will respond to the statistics of the neurons of the prior layers. No neuron can be so exceptional because its output will be shared by many forward dot products. And any realistic optimisation algorithm will be able to search only a small space of statistical solutions. And is that a bad thing? You exclude many brittle overfitted solutions.
@hoaxuan7074
@hoaxuan7074 3 жыл бұрын
I guess one way to test that is to delete a weight and see how badly it affects the net, or delete one neuron. Do you only ever get a small statistical effect or does such an action sometimes dramatically impact the net? Evolutionary algorithms like Continuous Gray Code Optimization can actually train large nets. And can have low network bandwidth requirements relative to BP. for federated learning. Each compute device is given the full network model and part of the training set. The same short sparse list of mutations to make to the model is sent to each device and it returns the cost for its part of the training set. The costs are summed and if an improvement an accept mutations message is sent to each device else a reject message. Anyway there is some kind of related chat at 'discourse numenta' under sparse numenta nets.
@blacklistnr1
@blacklistnr1 3 жыл бұрын
I'd like to say that I appreciate how you handled discussing this paper. Perhaps this is my biased incomplete view, but damn some research is this over pompous explanation of a really basic idea that makes you facepalm "Is that it?". I imagine these guys chuckling with pipes: - What should we research next? - Well I'd love to do something useful, but all the money seems to go to A.I. these days.. *scratches beard* - Oh... these primal monkeys, will they ever understand the beauty of exploring math? - I truly don't know, but let's give them what they want: deep networks and backprop. - Hasn't that been done like 10000 times already? - No no no, we don't do backprop, we break the chain with local variables and call it predictive coding - You're mad! *loud laugh* So you want do 100 LOCAL iterations to propagate what could be done in one pass? - You wouldn't say it like that.. use flashy words neuromorphic, LSTM, etc. - Neuromorphic Machine Learning? isn't that like what we've been calling what we're doing since 1970s? Have a little dignity, at least call it "Hebbian plasticity" - *drinks the whole glass and slams it on the table* Fine with me. Let's get this over with.
@gruffdavies
@gruffdavies 3 жыл бұрын
The paper's purpose was to address "biological plausibility" so "Hebbian plasticity" is perfectly appropriate.
@albertwang5974
@albertwang5974 3 жыл бұрын
Brain do backpropagation by generating connect between activating cell to the confirmed result.
@yasurikressh8325
@yasurikressh8325 3 жыл бұрын
Doesn’t look hideous to me. If it can be mapped than it is a beauteous model
@Prince-sf5en
@Prince-sf5en 3 жыл бұрын
Can't believe I'm first here
@bassr3hab
@bassr3hab 3 жыл бұрын
haha same here
@herp_derpingson
@herp_derpingson 3 жыл бұрын
Can't believe its not butter
@andreassyren329
@andreassyren329 3 жыл бұрын
Oh I had no idea this just premiered.
@notgabby604
@notgabby604 3 жыл бұрын
Naw, it's trans-fat margarine. Which certainly was a terrible thing.
@quebono100
@quebono100 3 жыл бұрын
@@bassr3hab same here on your post xD (recursion?)
@Rizhiy13
@Rizhiy13 3 жыл бұрын
Not very convincing so far, distribution of errors doesn't seem to offer any advantages in comparison to backprop.
@AirmailMRCOOL
@AirmailMRCOOL 3 жыл бұрын
"Advantages" aren't really what they were looking for. They were looking for a biologically possible training method. Your brain doesn't use backprop, so they're just theory shooting what it does use.
@444haluk
@444haluk 3 жыл бұрын
This is the smartest thing I have ever heard. I have always hated backprop because at each step it assumes it finds the temporarily perfect solution. This approach fixes that monstrosity.
@23kl104
@23kl104 3 жыл бұрын
no it doesn't. It finds the locally steepest direction.
@quAdxify
@quAdxify 2 жыл бұрын
This is a bit difficult to understand. I think it just needs a bit more theoretical context for all the people not familiar with variational inference. For the interested viewers, here is an excellent review (including predictive coding and VI) by the authors of the discussed paper (I belief) arxiv.org/pdf/2107.12979.pdf.
Deep Networks Are Kernel Machines (Paper Explained)
43:04
Yannic Kilcher
Рет қаралды 58 М.
Rethinking Attention with Performers (Paper Explained)
54:39
Yannic Kilcher
Рет қаралды 55 М.
They RUINED Everything! 😢
00:31
Carter Sharer
Рет қаралды 21 МЛН
I Built a Shelter House For myself and Сat🐱📦🏠
00:35
TooTool
Рет қаралды 28 МЛН
Hopfield Networks is All You Need (Paper Explained)
1:05:16
Yannic Kilcher
Рет қаралды 90 М.
Liquid Neural Networks
49:30
MITCBMM
Рет қаралды 224 М.
Gemini has a Diversity Problem
17:36
Yannic Kilcher
Рет қаралды 52 М.
iPhone 15 Pro vs Samsung s24🤣 #shorts
0:10
Tech Tonics
Рет қаралды 12 МЛН
😱НОУТБУК СОСЕДКИ😱
0:30
OMG DEN
Рет қаралды 3,4 МЛН
Main filter..
0:15
CikoYt
Рет қаралды 2,3 МЛН
Carregando telefone com carregador cortado
1:01
Andcarli
Рет қаралды 2,5 МЛН