this reminds me of logic circuits! the bistable part: keep high value until new enough low value is detected. keep low value until new enough high value is detected. THIS IS EQUAL to In electronics, a Schmitt trigger is a comparator circuit with hysteresis implemented by applying positive feedback to the noninverting input of a comparator or differential amplifier.
@kazz8114 жыл бұрын
Stellar walk through of a very nice paper! Thanks for doing these. Also, great job explaining the GRU by writing a diagrammatic version of it.
@herp_derpingson4 жыл бұрын
It would be funny if two papers down the line, it reclaims the crown from GPT. What a time to be alive.
@YannicKilcher4 жыл бұрын
haha, yea. though I think just because it can remember things well it's not the same thing as the ability to attend to any token in the sequence. I guess the two have different strengths.
@PatrickOliveras4 жыл бұрын
@@YannicKilcher Yeah, this seems to be more of a longer and stronger short-term memory. Perhaps it will have stronger implications in reinforcement learning?
@cwhy4 жыл бұрын
@@YannicKilcher And attention will be kind of cheating if the purpose is the same to this paper, because it let in all the information at once.
@revimfadli46664 жыл бұрын
@@PatrickOliveras for longer memory(and more biomimicry) maybe it can be combined with differentiable plasticity?
@visionscaper4 жыл бұрын
What I find interesting is that the BRC setup will be much faster than a LSTM.GRU, because it only uses element wise multiplications; it can also be parallelised more easily.
@priyamdey32984 жыл бұрын
One thing I would like to know: How do you keep track of the new papers coming in? Do you keep an eye on the sites everyday?
@SachinSingh-do5ju4 жыл бұрын
Yeah we all want to know, what's the source?
@ulm2874 жыл бұрын
Twitter? just follow ML bots
@videomae65194 жыл бұрын
may be arxiv
@user932374 жыл бұрын
arXiv-sanity, Twitter accounts by various researchers, r/machinelearning
@YannicKilcher4 жыл бұрын
Craigslist
@mehermanoj454 жыл бұрын
Damm man! A video every day👌
@patrickjdarrow4 жыл бұрын
Interesting that the presence of BRCs highlights the trade-off of long-term memory. It feels like a better analog for the function 'I' may be learned.
@ekstrapolatoraproksymujacy4124 жыл бұрын
In the GRU they use, signal from ht-1 is multiplied by the weight matrix before going through the reset gate, it's usually the other way around and if that's the case, that weight matrix can potentially have values that amplify ht-1 enough to get this positive feedback and bistable behavior from normal GRU with standard sigmoid in reset gate. And you missed that they get rid of this weight matrix completely in their BRC and they adding ht-1 without any processing (besides that 1+tanh reset gate) to output tanh
@YannicKilcher4 жыл бұрын
Thanks for the clarifications
@ekstrapolatoraproksymujacy4124 жыл бұрын
@@YannicKilcher I checked their code and they use the default implementation of the keras GRUcell which has the "reset_after" argument that controls whether the reset gate is after or before matrix multiplication, I changed it so that gate is before matmul and it is now running their benchmark3 on mnist. Of course training is painfully slow so time will tell...
@SlickMona4 жыл бұрын
At 36:33 - why would nBRC get *better* with higher values of T?
@NicheAsQuiche4 жыл бұрын
Im confused about that too. maybe I'm not getting something but it looks like in all the benchmarks that model performs better with longer sequences, like somehow making the task harder makes it easier for it.
@YannicKilcher4 жыл бұрын
I don't think the T changes. Just the N that specifies where the information is
@mrityunjoypanday2274 жыл бұрын
Interesting to see, the use in RL. Replacing LSTM.
@sphereron4 жыл бұрын
Yes, many environments can require long term memory. Supervised problems not as often.
@revimfadli46664 жыл бұрын
@@sphereron especially in complex, stochastic environments when storing all inputs ever would be inefficient, in contrast to tasks like NLP or image processing, where all inputs are already accessible in memory
@bluel1ng4 жыл бұрын
Fantastic sound quality! Nice explanation of gating in unrolling / BPTT. Regarding the linear feedback simplification: You claim that it would self stabilize over time, but if f(V_post) would be a*V_post with a>2 this would "explode" with zero input (an IIR filter with one delay unit is not always stable).
@YannicKilcher4 жыл бұрын
yes, very true. I was just thinking of the simplest case
@alelasantillan4 жыл бұрын
Amazing explanation! Thank you!
@TheThirdLieberkind4 жыл бұрын
This is so interesting. I wonder how the research on the math and functions in biological neurons is done. It really sounds like the brain does actual number crunching, and handles signals though known mathematical functions alike that we do with computers. There might be a lot we can learn from biology in machine learning research
@004307ec4 жыл бұрын
Then you might want to search spike neural network (SNN)
@n.lu.x4 жыл бұрын
I would say its the other way around. We use math as a language to describe the world around us. It's just that now we want to use it to model learning and intelligence and math is the best tool/language to do that. What I'm getting at is that brain doesn't necessarily do number crunching but we describe it as such because that's the closest we can get to modeling how it works.
@tsunamidestructor4 жыл бұрын
Maybe I'm in the minority here but I really want to see LSTMs/GRUs outperform GPT-x models
@angrymurloc76264 жыл бұрын
If history tells us anything, scalability will win over intelligent design.
@revimfadli46664 жыл бұрын
Perhaps with some Turing machine modifications....
@NicheAsQuiche4 жыл бұрын
I feel that too but I don't think it makes computational sense, like, 'why would only seeing one thing at a time in one order, having to remember it all, work better than being able to attend to all of it in parallel?' I think the reason for our hope for it is that recurrence most likely better resembles what happens in humans and we don't like the thought of designing something that works better (i know GPT-x is no where near human intelligence in language but future improvements on transformers may make it so)
@revimfadli46664 жыл бұрын
@@NicheAsQuiche for tasks where all input data are available at once(NLP, image processing, etc) large-scale parallelization like that might work better using GPUs/TPUs/etc, but for autonomous agents, reinforcement learning and the like in a stochastic & complex environment, storing inputs like that would be inefficient compared to having the NN "compress" all that data(which LSTM and the kind already do)
@sagumekishin57484 жыл бұрын
Maybe we can apply network architecture search here to find a good recurrent cell.
@harshpathak12474 жыл бұрын
Nice overview! And Nice paper. This seems similar to pitchfork Bifurcations, while performing optimization via Continuation methods. Hope these methods continue to explain more about deep learning optimization
@damienernst57584 жыл бұрын
The nBRC cell is indeed experiencing a pitchfork bifurcation at a=1 - see Appendix of arxiv.org/abs/2006.05252 for more details.
@harshpathak12474 жыл бұрын
Thanks, I have been following this topic closely. Here is the list of papers that talk about the dynamics of RNN. github.com/harsh306/awesome-nn-optimization#dynamics-bifurcations-and--rnns-difficulty-to-train
@Claudelu4 жыл бұрын
A good question would be, what is a good database of new or interesting papers, we all want to know where u find these amazing papers!
@patrickjdarrow4 жыл бұрын
At 24:00, "...a biological neuron can only feed back onto itself". What is being referenced here? Surely not synapses
@YannicKilcher4 жыл бұрын
I think they mean this bistability mechanism
@patrickjdarrow4 жыл бұрын
@@YannicKilcher makes much more sense.
@clivefernandes54354 жыл бұрын
So if we have every long sentences or paragraphs these will perform better than lstms rite ?
@YannicKilcher4 жыл бұрын
Maybe, I guess that's up for people to figure out
@darkmythos44574 жыл бұрын
Thanks, very intresting. In case you are reading this, I am going to suggest a nice ICML20 paper: "Fast Differentiable Sorting and Ranking".
@marat614 жыл бұрын
Why did LSTM permorm so worser than GRU at starting at T = 50?
@YannicKilcher4 жыл бұрын
who knows, it's more complicated
@aBigBadWolf4 жыл бұрын
in their source code, the experiments are all using truncated backprop at 100 steps. How does this learn with more than 100 paddings symbols.. ?
@supernovae344 жыл бұрын
Hi ! Author here, the back propagation through time is actually made over all time-steps of the time-series. That parameter you are talking about is actually not used in the model and is a left-over of old code. I apparently forgot to remove it when cleaning the code. Nice catch, it would obviously be impossible to learn anything on these time-series with such settings. Sorry for the confusion ! (I updated the code)
@aBigBadWolf4 жыл бұрын
@@supernovae34 thanks for the info. Which tensorflow version is this code for? It would be helpful if you'd add such reproducibility details to the readme.
@supernovae344 жыл бұрын
@@aBigBadWolf It's on TensorFlow 2. Indeed, I plan on doing a clean README and probably cleaning the code a little more. I haven't had the time yet but will make sure to have it done in the very near future !
@aBigBadWolf4 жыл бұрын
@@supernovae34 I ran benchmark 3 for three of the model with zs 300. The LSTM resulted in nans after 30% accuracy. BRC is stuck at 10%. nBRC achieved 94% accuracy (after 20k steps). The paper doesn't mention any instability issues. Additionally, this first run of BRC is not at all where the mean and std of table 4 would indicate which begs the question: how fair is the comparison in table 4 really?
@supernovae344 жыл бұрын
@@aBigBadWolf Hi, sorry to hear that. This is weird, I didn't run into such issues. For the nBRC 94% accuracy, that is rather normal, it should gain a few more percents in the next 10k steps. However, I am rather surprised for the BRC as it proved to work well over three runs (and even more which were done when testing the architecture before the final three runs). Did you run it for long enough ? One thing we noticed on this particular benchmark is that the training of BRC makes a bit of a "saw-tooth" shape. I am also quite surprised for the LSTM as we never saw it learn anything (on the validation set !). One thing that might be worth noting though, is that we saw GRU overfit the benchmark 2 with "no-end" set at 200. That is, they achieved a loss of 0.6 on the training set, but a loss of 1.2 on the test set. Are you sure that the 30% accuracy of the LSTM is on the validation set ? Have you had any problems recreating the results for the other benchmarks ? I will upload today a self contained example for benchmark 2 (as I already did for benchmark 1). Both these scripts should give pretty much exactly the same results as those presented in the paper. Note that benchmark 2 requires a "warming-up" phase for the recurrent cells to start learning, so it is normal for the loss to not decrease before 13 to 20 epochs (variance). Once benchmark 2 is done (currently running to make sure results are the same with the new script using Keras sequential model as those presented in the paper), I will do the same with benchmark3. This will result in three self contained scripts, which should be much cleaner than what is currently available, sorry for the incovenience. Also, come to think about it, we should probably have shown more learning curves for benchmark 2 and 3 which would have given more insight than just the convergence results (it might also have answered some of your questions), unfortunately we lacked room in the paper to do so. Finally, we did try to be as fair as possible and our goal was never to crunch small percentages here and there, we just wanted to highlight a general behaviour !
@grafzhl4 жыл бұрын
Tried my hands on an experimental Pytorch implementation: github.com/742617000027/nBRC/blob/master/main.py
@bosepukur4 жыл бұрын
wonderful video
@EditorsCanPlay4 жыл бұрын
here we go again
@MrZouzan4 жыл бұрын
Thanks !
@DavenH4 жыл бұрын
Ahh, bi-stable. I kept reading it thought it would rhyme with 'listable' and didn't know what the heck that word meant.
@snippletrap4 жыл бұрын
For the gold standard in biologically plausible neural modeling, check out the work by Numenta.