OUTLINE: 0:00 - Intro & Overview 1:10 - Sponsor Spot: Weights & Biases 3:35 - Problem Statement 8:00 - Continuous Attention Mechanism 16:25 - Unbounded Memory via concatenation & contraction 18:05 - Does this make sense? 20:25 - How the Long-Term Memory is used in an attention layer 27:40 - Entire Architecture Recap 29:30 - Sticky Memories by Importance Sampling 31:25 - Commentary: Pros and cons of using heuristics 32:30 - Experiments & Results
@mgostIH3 жыл бұрын
Cool fact: The infinite in Infinity Former stands for the amount of papers about transformer variations get published!
@carlossegura4033 жыл бұрын
Seriously. But I cannot complain; NLP has progressed thanks to the popularity of the Transformer / GPT drastically. NLP was slow, tedious, and with many architectures for each sub-specialized problem a few years back.
@L9X3 жыл бұрын
i thought the infinite stood for the number of variations of attention we have come up with, which aren't really even attention but we call them attention because its cool. (Like seriously, why the hell do we call modern "Gated Linear Units" an attention mechanism? It seems anything that involves a sigmoid or softmax applied to some vector then multiplied by some other vector is called attention theses days)
@alpers.21233 жыл бұрын
Learning is compression
@mgostIH3 жыл бұрын
17:43 I don't quite agree with this, the basis functions doesn't seem to be biased towards storing more recent information, it seems like the precision you get is uniform across the entire time domai n 20:00 It might be that the network learns to map embeddings in a way to help the interpolation in performing better, but I think the most important question is how well does this approach scale compared to storing the discrete representations themselves and whether tasks like NLP may benefit more than other general time sequence predictions. I think it would be cool to see a learned interpolator too, although the performance might be very bad 🤔
@michaelparis60393 жыл бұрын
Exactly, this parameter tau gives you the needed control parameter to equally weight newly learnt and already known information
@ademord3 жыл бұрын
Can i ask u for feedback on my masters thesis ?
@JuanCamiloGamboaHiguera3 жыл бұрын
You could argue that when training the model, if the embedding space is learned, the model would learn to map the training sequences to an embedding space where they can be represented with continuous signals. Also, this might just be a limitation of the choice of basis functions but there is no reason why you couldn't have different types of basis functions in there (sawtooth, triangle, square, gaussians, etc) at the same time.
@user-ey2vv1dl3n3 жыл бұрын
nice format!
@freemind.d27143 жыл бұрын
Infinite Memory Transformer -> Informer
@vaibhavbansal93583 жыл бұрын
Yeah it would had been a great name for it, but someone already stole the name for a transformer architecture for time series forecasting, published during early 2021.
@michaelparis60393 жыл бұрын
18:47 The reason (why a learned embedding could be modeled to be continuous) could be that language "feels" mostly fluid and continuous - or wouldn't you say?
@user-rh8hi4ph4b3 жыл бұрын
I don't think that language either is or feels continuous at all. I think the reason this approximation works here, is that the feature width is large enough that there is enough redundancy in the hidden representations, that the model can reconstruct some of the lost information when attending to/reading the long-term memory. Also note how the compression is applied to each feature dimension separately. For any element of the original sequence, the lossiness of the compression might have been particularly unfavorable to one dimension, but particularly gentle in another dimension. In total, even with the lossiness and the seemingly arbitrary assumption of continuity, after the compression there's still so much information for the model to extract relevant features, and that only gets better the larger you make the feature width.
@diagorasofmel0s3 жыл бұрын
keep up the good work!
@srh803 жыл бұрын
I think the next team who writes a transformer paper needs to donate a dollar to the swear jar.
@ハェフィシェフ3 жыл бұрын
I think this model would work really well in reinforcement learning so I'm really curious as to how well it performs in scenario's that weren't described in the paper. So I'd love to see you throwing some problems at it and see how well it works.
@L9X3 жыл бұрын
Man, I love your videos
@geoffreysworkaccount1603 жыл бұрын
Thank you for this.
@nocturnomedieval3 жыл бұрын
During my PhD studies I came across the not largely known and sometimes surprising dark-magic-sorcery fact that markov chains can have memory and even more it can be infinite. I was wondering when this could reach the AI niche. I am a believer that Information Theory can synergise pretty well with DL.
@rpcruz3 жыл бұрын
Markov chains have very strong assumptions. Namely, that the next step depends only on the current step.
@lirothen3 жыл бұрын
Feels like that compressing & appending long term memory in figure 2 should be applied to the attention
@florianro.91853 жыл бұрын
MLFlow is a good alternative to weights & biases in my opinion :)
@norik16163 жыл бұрын
Best W&B ad ever
@theohintemann93743 жыл бұрын
Thanks - good work
@agentds16243 жыл бұрын
Who or what is the name in the last sentenc of the video I understand "lucid rains" (and so does the automatic sutitles)?
@starkest3 жыл бұрын
sorry, what was the reference at 36:28 where the implementation will be available?
@Anoyzify3 жыл бұрын
Also want to know this.
@Kram10323 жыл бұрын
Can't wait for somebody to come up with the idea to "just learn the attention functions" by using an entire fully connected DNN dedicated to just the attention mechanism to encode arbitrary functions - and then for somebody to switch that out with first a CNN and then a transformer to get a second order attention-attention ... and then just continue from there. Construct the infinite tower of attentions all the way down Also: try the same but in fourier space! More seriously though, the smoothing could work out if you allow, like, deeper lookups. I'm imagining something like the network going "oh I remember there was something relevant roughly in the first quarter of the book" at which point this information can be unpacked and attended to at higher resolution. A sort of binary search -ish thing could be possible that way, assuming you can't just hold everything in RAM, but you *can* have access to it on your hard drive and it might be fast enough to retrieve that way. In that case, smoothing the signal first might make sense. If you can somehow reconstruct the deeper attention once you see what actually lies there again.
@Virsconte3 жыл бұрын
Before he said they were using RBFs, I just assumed it would be using Fourier series. I wonder if you could reorder the dimensions of the embeddings to try to get rid of higher frequency components. Basically looking at all of the tokens in your dataset (or some subset) and looking at what dimensions correlate the most.
@oncedidactic3 жыл бұрын
@@Virsconte This seems to make good intuitive sense as well as attractive from the engineering perspective. Like, you don't need to remember the exact sentence structure of a story years later, just the major nouns and verbs, and hence a simplified but effective description of events. It's another tradeoff for compression, but seems reasonable.
@RuminRoman3 жыл бұрын
Yannic, but you can place infinite Universe in your finite head. Even if you have forgotten something, you can read the appropriate paper or talk to a specialist or expand your brain with new modules by yourself with your own money which you yourself have earned. So we need a model that can read and talk and make money and expand itself with it's own money. Positive cash flow.
@JTMoustache3 жыл бұрын
Rather than using a decomposition based on RBFs they could have used a more classic and richer wavelet decomposition.. the structure of the wavelet output should also make more sense to the network, imho
@ademord3 жыл бұрын
Can i ask u for feedback on my masters thesis 🧐
@oncedidactic3 жыл бұрын
I forget which other transformer paper you did, but it brought up the idea of why not use fourier transforms to define attention. At that point idea being, it's not the exact form of the attention that matters, since the learning modulates it, but just some mixing in general. This one gets me thinking, if we want dabble in heuristic compression hell for a good tradeoff against incalculable backprop, why not use fourier for the longterm signal memory (instead of RBF) and also the attention learner (instead of whatever du jour). Like, signals are all waves anyway, tokens are projections of whatever the "upstream" process was that generated them. It's not too crazy to think that the compression lossiness against tokens might actually overlap well with the generating function making the tokens, or at least its relevant features that you're trying to learn anyway. Promise I'm not trying to superficially conflate two areas here where "hey look, it's a curvy continuous thing". More a remark on the artificiality of tokens as learnable data. I guess another thing you could say here is, if not fourier or another good jack of all trades compression system, what gimmick is supposed to work best? It can't be that we're just hunting for the right gimmick which runs well on our chips. Forgot to say, totally agree, super shady to call it infinite with such a modest performance bump of a "new architecture". And skeezy outsourcing to the heuristic.
@umabrahma20193 жыл бұрын
Amazing
@vaibhavbansal93583 жыл бұрын
Why does this paper feel like ' when Perceiver IO, Language modelling and Fourier transform walks into a bar' thing?😛 Though, great video, once again!😄
@kanfoosj3 жыл бұрын
I prefer calling it "nifty-former"
@rochne2 жыл бұрын
"Informer" would have been the best choice of name IMO 🙂
@pensiveintrovert43183 жыл бұрын
Basically quantization.
@sophiez79523 жыл бұрын
Hi I can learn a lot from you you are great thx have a wonderful day ,
@nikitastaf19963 жыл бұрын
I don't know if you did it intentionally or not but image for problem statement chapter is weights and biases ad.
@Anujkumar-my1wi3 жыл бұрын
Hey, can you clear my confusion regarding as to why the mathematical model of artificial neuron is like : The input data x, is subject to an affine transformation defined by W , followed by a non-linear transformation i.e nonlinear_fun(weights*inputs+bias). but why not like this : nonlinear transformation on input and then affine transformation on the transformed input i.e nonlinear_fun(inputs)*weights+bias? And also the mathematical model of a artificial neural net is like : weights*nonlinear_fun(weights*inputs+bias)+bias ,um isn't it the output of the artificial neural net ,then shouldn't it be like this : nonlinear_fun(weights*nonlinear_fun(weights*inputs+bias)+bias) or is it beacuse the activation function of output neuron is linear so that's why ? EDIT: I mean shouldn't mathematical model of a single neuron and artificial neural net be same ?
@rpcruz3 жыл бұрын
For most problems you do "nonlinear_fun(weights*nonlinear_fun(weights*inputs+bias)+bias)". But for some problems, when the output is a linear regression, then the last nonlinear_fun is the identity function, so you can omit it.
@Anujkumar-my1wi3 жыл бұрын
@@rpcruz Thanks ,but about the first ,do you have any thoughts on that i.e why the mathematical model of artificial neuron is like : The input data x, is subject to an affine transformation defined by W , followed by a non-linear transformation i.e nonlinear_fun(weights*inputs+bias). but why not like this : nonlinear transformation on input and then affine transformation on the transformed input i.e nonlinear_fun(inputs)*weights+bias?
@drdca82633 жыл бұрын
@@Anujkumar-my1wi so, take the example of fully connected feed forward network. In the way this is usually done, you can do this by applying a matrix to the input vector, and then applying the non-linearity to each coordinate separately. If you also mean to have the non-linearity apply to each coordinate separately, then uh, the non-linearity doesn’t get to use the mixing from the different inputs. If you add more layers, then the two should be equivalent, but this is basically just the same as “apply this non-linearity at the start of your network, and then continue as normal”, But that’s kinda pointless I think? I am not experienced in ML so take this with a grain of salt
@Anujkumar-my1wi3 жыл бұрын
@@drdca8263 Thanks
@mgostIH3 жыл бұрын
Doing a non linear operation first will lose information on the inputs: For example if your inputs are [1., -1.], if you apply a ReLU before doing any matrix multiplication you will lose the -1 as it'll become a 0.
@TheGallowtree3 жыл бұрын
I can't believe no-one spotted the typo in equation 8.
@ericadar3 жыл бұрын
why not treat the problem of compressing the embedding vectors as a learnable task or at least show that their choice (ridge regression using RBFs) is superior to other lossy compression heuristics? seems arbitrary.
@bigbuckey76873 жыл бұрын
The problem with learning compression is now you have to learn through time, and run into the classic problems the LSTMs had to solve like the vanishing/exploding gradient problem, not to mention that you can't parallelize training anymore. But yeah I agree their choices seem arbitrary and not justified (as someone who hasn't read the paper myself).
@mgostIH3 жыл бұрын
@@bigbuckey7687 > The problem with learning compression is now you have to learn through time I don't think it's true in this case: you only need to learn something that's good at interpolating points, there's no variation in the training regime that would be caused by the amount of tokens.
@bigbuckey76873 жыл бұрын
@@mgostIH Well we have plenty of algorithms to simply interpolate points, there's no need to learn that. But if we did make some custom interpolation that was learned, then since this architecture samples the past interpolations the loss would depend on those previous iterations, a.k.a learning through time. This is in the same idea as what an LSTM does (with major differences of course). Not sure what you mean by "there's no variation in the training regime that would be caused by the amount of tokens".
@mgostIH3 жыл бұрын
@@bigbuckey7687 > Well we have plenty of algorithms to simply interpolate points, there's no need to learn that. A lot of algorithms for interpolation make specific assumptions about the data, given that there's a strong link with compression and intelligence (Marcus Hutter) I would start thinking of new neural approaches much more for this sort of problems. > since this architecture samples the past interpolations the loss would depend on those previous iterations, a.k.a learning through time But there's no "previous iterations" in this case, in training models like these you only need a single forward pass to get all the tokens in a sentence their continuous representation and then interpolate between them. > Not sure what you mean by "there's no variation in the training regime that would be caused by the amount of tokens". An example could be training a SIREN network to fit the points you are given, their amount doesn't change anything about how you train the network but you still get out an interpolation that doesn't depend on the amount of tokens you had.
@bigbuckey76873 жыл бұрын
@@mgostIH Ah ok now I get your point. As long as the learned interpolation only depends on the input and no hidden states updated through time then each iteration doesn't depend on each other. Thanks for the clarification.
@umabh23393 жыл бұрын
Nice
@draxd30453 жыл бұрын
I like the sound of the helicopter
@samuelelwell75753 жыл бұрын
Hey Yannic, your videos are awesome! Do you know if you'll be doing one on the AlphaFold 2 paper?
@jean-baptistedelabroise53913 жыл бұрын
it is weird to limit to gaussian, if you make this model into a BERT type model, as the BERT attention for the CLS token will often attend at every SEP tokens...
@TheGodSaw3 жыл бұрын
i think you meant to say. "brings about the same problems as LSTM namely you get angry post from Schmidbhuber"
@NeoShameMan3 жыл бұрын
I was wondering when such architecture would emerged, back with the released of ai dungeon, i was like, what if we compressed into a summary the previous half of the working memory, such that the sliding working memory retain more information for continuity. It's a function that the language model could do back then.
@L9X3 жыл бұрын
W&B gang
@IvanHe-gc7bf7 ай бұрын
I think it should be called Continous Attention is All you Need.
@jawadmansoor60643 жыл бұрын
I think the best transformer was the stack former or add former (where you can stack layers without increasing complexity quadratically.
@__E__3 жыл бұрын
which papers are you exactly referring to ? I can't find them googling addformer or stackformer
@jawadmansoor60643 жыл бұрын
@@__E__ I am sorry, I have a habit of modifying names (as they sound to me for fun). I actually meant fastformer (the additive attention paper). "Additive attention CAN BE all you need".
@__E__3 жыл бұрын
@@jawadmansoor6064 It's alright man :) But I'm not sure I got this paper right because afaik it's not about stacking layers but rather having a global key and a global query per layer, the quadratic stuff happens inside a layer, not because you stack them
@jawadmansoor60643 жыл бұрын
@@__E__ Transformers (ordinarily) are expensive due to quadratic complexity. If you stack layers in them the BigO would still be quadratic (being the most expensive operation) however this paper suggests that you only compute KVQ once (Q once, K twice and V thrice, I forgot which was which) hence the maximum complexity just depends on number of tokens (linearly). What I mean by a layer is that all the operation from "global" key/query computation is one layer. And you can stack the same operation above it. (It is difficult to explain in a few words, better yet please refer to the video by Yanic Kilcher). If you still don't get it (after watching the video, then do ask again, I will write an explanation of it in a few days (motivated for you, though will make it public) insha ALLAH.
@hosseinmobahi48413 жыл бұрын
Why are you recording yourself on a roof top? :) Love the location.
@Metaloid-wv4kz3 жыл бұрын
He's trying to go infinity and beyond, he's reforming his thinking and having a watershed moment bro! Or he's just scared, we all need a rooftop moment.
@ngginger2944 Жыл бұрын
This paper reminds me of VQ-VAE
@sergiomanuel22063 жыл бұрын
Hello there!!
@minos993 жыл бұрын
This is one of the xformers I'm gonna forget. Continuous representation is welcome but I don't think they justify their choice enough. The model feels too unjustifiably handcrafted.