A bio-inspired bistable recurrent cell allows for long-lasting memory (Paper Explained)

Рет қаралды 8,239

Күн бұрын

Пікірлер

@auridiamondiferous 4 жыл бұрын

this reminds me of logic circuits! the bistable part: keep high value until new enough low value is detected. keep low value until new enough high value is detected. THIS IS EQUAL to In electronics, a Schmitt trigger is a comparator circuit with hysteresis implemented by applying positive feedback to the noninverting input of a comparator or differential amplifier.

@kazz811 4 жыл бұрын

Stellar walk through of a very nice paper! Thanks for doing these. Also, great job explaining the GRU by writing a diagrammatic version of it.

@herp_derpingson 4 жыл бұрын

It would be funny if two papers down the line, it reclaims the crown from GPT. What a time to be alive.

@YannicKilcher 4 жыл бұрын

haha, yea. though I think just because it can remember things well it's not the same thing as the ability to attend to any token in the sequence. I guess the two have different strengths.

@PatrickOliveras 4 жыл бұрын

@@YannicKilcher Yeah, this seems to be more of a longer and stronger short-term memory. Perhaps it will have stronger implications in reinforcement learning?

@cwhy 4 жыл бұрын

@@YannicKilcher And attention will be kind of cheating if the purpose is the same to this paper, because it let in all the information at once.

@revimfadli4666 4 жыл бұрын

@@PatrickOliveras for longer memory(and more biomimicry) maybe it can be combined with differentiable plasticity?

@visionscaper 4 жыл бұрын

What I find interesting is that the BRC setup will be much faster than a LSTM.GRU, because it only uses element wise multiplications; it can also be parallelised more easily.

@priyamdey3298 4 жыл бұрын

One thing I would like to know: How do you keep track of the new papers coming in? Do you keep an eye on the sites everyday?

@SachinSingh-do5ju 4 жыл бұрын

Yeah we all want to know, what's the source?

@ulm287 4 жыл бұрын

Twitter? just follow ML bots

@videomae6519 4 жыл бұрын

may be arxiv

@user93237 4 жыл бұрын

arXiv-sanity, Twitter accounts by various researchers, r/machinelearning

@YannicKilcher 4 жыл бұрын

Craigslist

@mehermanoj45 4 жыл бұрын

Damm man! A video every day👌

@patrickjdarrow 4 жыл бұрын

Interesting that the presence of BRCs highlights the trade-off of long-term memory. It feels like a better analog for the function 'I' may be learned.

@ekstrapolatoraproksymujacy412 4 жыл бұрын

In the GRU they use, signal from ht-1 is multiplied by the weight matrix before going through the reset gate, it's usually the other way around and if that's the case, that weight matrix can potentially have values that amplify ht-1 enough to get this positive feedback and bistable behavior from normal GRU with standard sigmoid in reset gate. And you missed that they get rid of this weight matrix completely in their BRC and they adding ht-1 without any processing (besides that 1+tanh reset gate) to output tanh

@YannicKilcher 4 жыл бұрын

Thanks for the clarifications

@ekstrapolatoraproksymujacy412 4 жыл бұрын

@@YannicKilcher I checked their code and they use the default implementation of the keras GRUcell which has the "reset_after" argument that controls whether the reset gate is after or before matrix multiplication, I changed it so that gate is before matmul and it is now running their benchmark3 on mnist. Of course training is painfully slow so time will tell...

@SlickMona 4 жыл бұрын

At 36:33 - why would nBRC get *better* with higher values of T?

@NicheAsQuiche 4 жыл бұрын

Im confused about that too. maybe I'm not getting something but it looks like in all the benchmarks that model performs better with longer sequences, like somehow making the task harder makes it easier for it.

@YannicKilcher 4 жыл бұрын

I don't think the T changes. Just the N that specifies where the information is

@mrityunjoypanday227 4 жыл бұрын

Interesting to see, the use in RL. Replacing LSTM.

@sphereron 4 жыл бұрын

Yes, many environments can require long term memory. Supervised problems not as often.

@revimfadli4666 4 жыл бұрын

@@sphereron especially in complex, stochastic environments when storing all inputs ever would be inefficient, in contrast to tasks like NLP or image processing, where all inputs are already accessible in memory

@bluel1ng 4 жыл бұрын

Fantastic sound quality! Nice explanation of gating in unrolling / BPTT. Regarding the linear feedback simplification: You claim that it would self stabilize over time, but if f(V_post) would be a*V_post with a>2 this would "explode" with zero input (an IIR filter with one delay unit is not always stable).

@YannicKilcher 4 жыл бұрын

yes, very true. I was just thinking of the simplest case

@alelasantillan 4 жыл бұрын

Amazing explanation! Thank you!

@TheThirdLieberkind 4 жыл бұрын

This is so interesting. I wonder how the research on the math and functions in biological neurons is done. It really sounds like the brain does actual number crunching, and handles signals though known mathematical functions alike that we do with computers. There might be a lot we can learn from biology in machine learning research

@004307ec 4 жыл бұрын

Then you might want to search spike neural network (SNN)

@n.lu.x 4 жыл бұрын

I would say its the other way around. We use math as a language to describe the world around us. It's just that now we want to use it to model learning and intelligence and math is the best tool/language to do that. What I'm getting at is that brain doesn't necessarily do number crunching but we describe it as such because that's the closest we can get to modeling how it works.

@tsunamidestructor 4 жыл бұрын

Maybe I'm in the minority here but I really want to see LSTMs/GRUs outperform GPT-x models

@angrymurloc7626 4 жыл бұрын

If history tells us anything, scalability will win over intelligent design.

@revimfadli4666 4 жыл бұрын

Perhaps with some Turing machine modifications....

@NicheAsQuiche 4 жыл бұрын

I feel that too but I don't think it makes computational sense, like, 'why would only seeing one thing at a time in one order, having to remember it all, work better than being able to attend to all of it in parallel?' I think the reason for our hope for it is that recurrence most likely better resembles what happens in humans and we don't like the thought of designing something that works better (i know GPT-x is no where near human intelligence in language but future improvements on transformers may make it so)

@revimfadli4666 4 жыл бұрын

@@NicheAsQuiche for tasks where all input data are available at once(NLP, image processing, etc) large-scale parallelization like that might work better using GPUs/TPUs/etc, but for autonomous agents, reinforcement learning and the like in a stochastic & complex environment, storing inputs like that would be inefficient compared to having the NN "compress" all that data(which LSTM and the kind already do)

@sagumekishin5748 4 жыл бұрын

Maybe we can apply network architecture search here to find a good recurrent cell.

@harshpathak1247 4 жыл бұрын

Nice overview! And Nice paper. This seems similar to pitchfork Bifurcations, while performing optimization via Continuation methods. Hope these methods continue to explain more about deep learning optimization

@damienernst5758 4 жыл бұрын

The nBRC cell is indeed experiencing a pitchfork bifurcation at a=1 - see Appendix of arxiv.org/abs/2006.05252 for more details.

@harshpathak1247 4 жыл бұрын

Thanks, I have been following this topic closely. Here is the list of papers that talk about the dynamics of RNN. github.com/harsh306/awesome-nn-optimization#dynamics-bifurcations-and--rnns-difficulty-to-train

@Claudelu 4 жыл бұрын

A good question would be, what is a good database of new or interesting papers, we all want to know where u find these amazing papers!

@patrickjdarrow 4 жыл бұрын

At 24:00, "...a biological neuron can only feed back onto itself". What is being referenced here? Surely not synapses

@YannicKilcher 4 жыл бұрын

I think they mean this bistability mechanism

@patrickjdarrow 4 жыл бұрын

@@YannicKilcher makes much more sense.

@clivefernandes5435 4 жыл бұрын

So if we have every long sentences or paragraphs these will perform better than lstms rite ?

@YannicKilcher 4 жыл бұрын

Maybe, I guess that's up for people to figure out

@darkmythos4457 4 жыл бұрын

Thanks, very intresting. In case you are reading this, I am going to suggest a nice ICML20 paper: "Fast Differentiable Sorting and Ranking".

@marat61 4 жыл бұрын

Why did LSTM permorm so worser than GRU at starting at T = 50?

@YannicKilcher 4 жыл бұрын

who knows, it's more complicated

@aBigBadWolf 4 жыл бұрын

in their source code, the experiments are all using truncated backprop at 100 steps. How does this learn with more than 100 paddings symbols.. ?

@supernovae34 4 жыл бұрын

Hi ! Author here, the back propagation through time is actually made over all time-steps of the time-series. That parameter you are talking about is actually not used in the model and is a left-over of old code. I apparently forgot to remove it when cleaning the code. Nice catch, it would obviously be impossible to learn anything on these time-series with such settings. Sorry for the confusion ! (I updated the code)

@aBigBadWolf 4 жыл бұрын

@@supernovae34 thanks for the info. Which tensorflow version is this code for? It would be helpful if you'd add such reproducibility details to the readme.

@supernovae34 4 жыл бұрын

@@aBigBadWolf It's on TensorFlow 2. Indeed, I plan on doing a clean README and probably cleaning the code a little more. I haven't had the time yet but will make sure to have it done in the very near future !

@aBigBadWolf 4 жыл бұрын

@@supernovae34 I ran benchmark 3 for three of the model with zs 300. The LSTM resulted in nans after 30% accuracy. BRC is stuck at 10%. nBRC achieved 94% accuracy (after 20k steps). The paper doesn't mention any instability issues. Additionally, this first run of BRC is not at all where the mean and std of table 4 would indicate which begs the question: how fair is the comparison in table 4 really?

@supernovae34 4 жыл бұрын

@@aBigBadWolf Hi, sorry to hear that. This is weird, I didn't run into such issues. For the nBRC 94% accuracy, that is rather normal, it should gain a few more percents in the next 10k steps. However, I am rather surprised for the BRC as it proved to work well over three runs (and even more which were done when testing the architecture before the final three runs). Did you run it for long enough ? One thing we noticed on this particular benchmark is that the training of BRC makes a bit of a "saw-tooth" shape. I am also quite surprised for the LSTM as we never saw it learn anything (on the validation set !). One thing that might be worth noting though, is that we saw GRU overfit the benchmark 2 with "no-end" set at 200. That is, they achieved a loss of 0.6 on the training set, but a loss of 1.2 on the test set. Are you sure that the 30% accuracy of the LSTM is on the validation set ? Have you had any problems recreating the results for the other benchmarks ? I will upload today a self contained example for benchmark 2 (as I already did for benchmark 1). Both these scripts should give pretty much exactly the same results as those presented in the paper. Note that benchmark 2 requires a "warming-up" phase for the recurrent cells to start learning, so it is normal for the loss to not decrease before 13 to 20 epochs (variance). Once benchmark 2 is done (currently running to make sure results are the same with the new script using Keras sequential model as those presented in the paper), I will do the same with benchmark3. This will result in three self contained scripts, which should be much cleaner than what is currently available, sorry for the incovenience. Also, come to think about it, we should probably have shown more learning curves for benchmark 2 and 3 which would have given more insight than just the convergence results (it might also have answered some of your questions), unfortunately we lacked room in the paper to do so. Finally, we did try to be as fair as possible and our goal was never to crunch small percentages here and there, we just wanted to highlight a general behaviour !