The one thing that irks me is kernels keep being referred to as "mapping to infinite dimensional space" in these vids. While true, it's kind of unhelpful. It's probably more helpful to think of it as mapping to a (potentially infinite) space where the (potentially infinite) bases aren't restricted to be vectors. The whole trick about the Hilbert space is it's just a space with a well-defined inner product, and that inner product can be between functions etc. You could have a function mapping to a space where the bases are the sin and cosine functions, for example, and you use that space to take the inner product between different sine/cosine wave combinations. Typically the kind of space you want to map to will express something about the problem you're trying to solve. I bring this up because I've seen kernels mentioned twice now, and both times the explanations are kind of lacking. Not Yannic's fault, I was lucky enough to spend a bunch of time around kernel gurus that cleared up a bunch of misconceptions for me...before that, I was making similar mistakes. Content otherwise very solid, as usual!
@YannicKilcher4 жыл бұрын
Thank you :)
@zhonghuahe35563 жыл бұрын
Hi, can you recommend some learning materials about "kernels"
@brandomiranda67033 жыл бұрын
Mit’s 9.520 with Lorenzo Rosasco ;)
@CosmiaNebula2 жыл бұрын
This is a note for the mathematicians: When people say "vector" in computer science, they mean "an array with finitely many entries". In other words, they mean a function from {1, 2, ..., n} to some set X. This is a severe generalization/weakening of the typical vectors that we mean, where a vector lives in a vector space, where there are all kins of symmetries (the group GL(n, K)). So when Alex Stenlake says "they aren't restricted to be vectors", he meant "they aren't restricted to be a finite-entry array". They are in fact, *exactly* what we mean when we say "vectors in a countable, separable vector space", in other words, any vector space we use in functional analysis. So as we see, this is a severe strengthening of a severe weakening (first, the CS scientist weakens the mathematician's idea of finite dimensional vector space, then they strengthen it to return to the mathematician's idea), which is why Alex had to point it out.
@andrewgjkgjk4 ай бұрын
This probably comes from the idea that, practically, these bases are computed via an infinite polynomial series that can only be approximated in compute using techniques like random fourier sampling, no?
@NicheAsQuiche4 жыл бұрын
I love that you go over and explain the topics each time you introduce them (e.g. transformers) even though you go over them in other videos it's so useful to me to hear different explanations of it and realy clrafiy what's going on inside. Really helps me get a better understanding.
@florianhonicke54484 жыл бұрын
Haha I was writing almost the same
@williamtepe41674 жыл бұрын
Me too, I third this ^
@florianhonicke54484 жыл бұрын
I like that you explain some stuff again in different videos. It makes sure everyone is on track and if someone would be annoyed by the redundance the person could just fast forward. To me it was very helpful to get multiple explanations about transformers
@utku_yucel4 жыл бұрын
the idea of Explaining the papers is awesome! Thanks!
@snippletrap4 жыл бұрын
It's a big part of grad school.
@fotisj3214 жыл бұрын
Thanks for making a difficult paper accessible. I especially like it that there a red threads in you selection of papers, which create almost something like a story-line, for example the one starting with the paper on the transformer architecture and the follow-up papers trying to remedy some of its shortcomings but keep its achievements.
@good_user354 жыл бұрын
At first, I thought your videos are a bit lengthy, but after watching this now I feel that I woudn't be able to understand the contents without sacrficing such an amount of time. And what you point out on the paper similarly tells me that the cost of softmax function is the needed price considering that the function can do in an unknown task. Thanks, it was good watching.
@YannicKilcher4 жыл бұрын
also, I suggest just changing the playback speed to 2x if it's too long :)
@andyblue9534 жыл бұрын
Great explanation. I really like the way you break down publications to the most essential parts. Never was "reading" articles that comfortable.
@OwenCampbellMoore4 жыл бұрын
These are amazing, love the depth and really appreciate the amount of content! I know you’ve said you’re not planning on it, but I’d be super happy to support on Patreon if that helps support you in making so much content!
@veedrac4 жыл бұрын
The natural question I have after this paper: why not both? Test a model with X% compute allocated towards ‘real’ attention and 1-X% towards linear attention, and see what's best.
@GeekProdigyGuy4 жыл бұрын
You can always ensemble any two models, but unless they provide independent contributions, there isn't really any point. Plus in this case their proposed model's main advantage is reduced compute/memory, so it would seem entirely counterproductive to ensemble with plain attention.
@veedrac4 жыл бұрын
@@GeekProdigyGuy The idea is that traditional attention is high quality, but struggles at reaching a large context. Clearly that high quality attention is necessary on short scales, but only a small reduction in the traditional attention would be needed to allow for a linear attention that reached very large context sizes. Assuming the far context needs less character-level detail (same thesis as Compressive Transformers), the lossier linear attention should suffice there.
@lucidraisin4 жыл бұрын
Veedrac - I actually explore that here github.com/lucidrains/linear-attention-transformer
@darkmythos44574 жыл бұрын
Thanks Yannic, soon enough you will be one of the requirements for ML jobs; Published at Neurips, ICML, and Yannic YT Channel.
@YannicKilcher4 жыл бұрын
Haha :D problem is I'm very corrupt so idk how that's going to work out ;)
@daryoushmehrtash76013 жыл бұрын
I appreciate your comment about different neural network architecture as different routing of information.
@welcomeaioverlords8 ай бұрын
I'm very late, but well done. Good call out on the RNN caveat! I missed that. And one correction (or I misunderstand something): it seems that we're not actually going to a higher dimensional space through phi, but rather applying a point-wise non-linearity and maintaining the same dimensionality. Any comments on that choice? That's seemingly contrary to the point of the kernel trick as I understood it. Why not do the typical DL thing and simply make phi a learnable function?
@Erotemic4 жыл бұрын
That is quite an impressive speed increase, and the ability to perform RNN like inference is incredibly useful. I wish they investigated other kernel functions (or maybe they did and I missed it), but I'd be interested in seeing how the choice of kernel impacts the results. Am I understanding the RNN autoregressive-transformer equivalence correctly? Is it the case that translation transformer models are not RNNs?
@YannicKilcher4 жыл бұрын
only transformers that are using the specific causal masking are RNNs
@blue_bear_music4 жыл бұрын
Cool charts at 26:20. Interesting that Reformer is slower than regular transformer at shorter sequences. You would't know that just from asymptotic time complexity. Looks like the paper conveniently omits that :) Linear attention seems super fast tho. Also it's probably obvious, but two times higher slope on a log chart corresponds to a square of a function on a linear chart.
@YannicKilcher4 жыл бұрын
Yep, I'm just a moron when it comes to non-linear charts :D
@sitioprueba28552 жыл бұрын
Thank you so much for all this great explinations!
@snippletrap4 жыл бұрын
Re: causal masking implementation @38:00, input[i] != target[i] because the target is shifted by one before being passed into the model.
@that_guy46904 жыл бұрын
Your videos are pretty long, but they are really not long enough :). Each time the video ends, I get a feeling that I need an even deeper explanation e.g. their code review
@davidsabatini91002 жыл бұрын
A really good video! I found it super informative, thank you!
@DanielHesslow4 жыл бұрын
This is super cool and I think we will see the performance improve down the line as people explore other feature mappings, elu + 1 seems a biiit arbitrary
@samm98404 жыл бұрын
How about applying the kernel (O(n)) trick at inferencing alone after training? I am especially interested whether the DETR model (Yannic already did a video on that), that uses the heavy-weight transformer be modified to do object detection simply with this linear version! May be it's an idea for a new paper. Looking forward to it.
@Kerrosene4 жыл бұрын
I seems like the closer we are to approximating the soft-max operator with a finite dimensional feature map, the better the performance of this linear transformer would be...which is what they suggested in the conclusion..Also, they don't use positional encodings..
@tomw46884 жыл бұрын
Good shit! This video was very helpful and many ELI5 moments useful for students. Thanks!
@noreddine4 жыл бұрын
Yannic: just lett me think what you think. Me: I think you're awesome
@herp_derpingson4 жыл бұрын
12:33 If Adding all the K[i] terms means that there is no way for the transformer to know the position of words in the sentence. Maybe the positional embeddings are helping it somehow. It would be interesting to see the performance of this transformer without the positional embeddings. Then again, RNNs dont have any explicit mechanism to store positions of words.
@YannicKilcher4 жыл бұрын
True, the positional embeddings probably do a lot of work here. RNNs don't have explicit position, but they can theoretically learn to count the number of input tokens - or what's probably more relevant - learn to estimate distances between tokens, which is another thing that position embeddings in transformers are good for.
@kiyoonyoo47614 жыл бұрын
Thanks Yannic for the explanation. I was actually having some difficulty understanding how they used the associative property to transition from Eq. 4 -> Eq.5. Glad to see you've already made a video! This wasn't explained in detail on the video also, maybe because it is something trivial, but am I the only one who cannot get around why the two formulations (Eq.4 and Eq.5) are equivalent? The dimensions of the numerators definitely are not the same as the former is a linear combinations of column vectors, while the latter is a row vector multiplied by square matrix. Also, associative property does not seem to be true in general for vector multiplication involving vector dot products. Any help or explanation from anyone would be very appreciated :0
@YannicKilcher4 жыл бұрын
Sometimes people are very loose with row/column vectors and don't properly apply transposes. Can you find a counter-example where the equation doesn't hold (even when glossing over transpose issues)?
@norik16164 жыл бұрын
@@YannicKilcher I had hard times with the indexes as well, so I wrote it in numpy: It does not help with transpositions (as numpy matches them *mostly correctly* automatically), but it helps with where to put which product - the equation holds true. import numpy as np shape = 5, 7 K = np.random.rand(*shape) Q = np.random.rand(*shape) V = np.random.rand(*shape) classic_attention = np.matmul(np.matmul(Q, K.T), V) assert classic_attention.shape == shape # stripped down classic attention res = [] for i in range(shape[0]): res.append(sum([np.matmul(Q[i], K[j]) * V[j] for j in range(shape[0])])) matmul_byhand_attention = np.array(res) print(np.allclose(classic_attention, matmul_byhand_attention)) res = [] for i in range(shape[0]): s = sum([np.outer(K[j], V[j]) for j in range(shape[0])]) # associativity and distributivity res.append(np.matmul(Q[i], s)) matmul_byhand_attention2 = np.array(res) print(np.allclose(classic_attention, matmul_byhand_attention2))
@RuminRoman4 жыл бұрын
I wonder why you do not have a video about Universal Transformers
@rayaengazou21406 ай бұрын
Hey! that's greate !! i would like to ask about the model he used to produce results in the paper ?in github there was only the library not the model ! can you share with us the model ?
@MrMIB9834 жыл бұрын
Love it! Universal transformer please!
@lucasnestler69334 жыл бұрын
As the paper states that transformers attending to one side only are equal to RNNs, doesn't it imply that full transformers are equal to residual bidirectional RNNs?
@YannicKilcher4 жыл бұрын
Only in an abstract sense. The computations are really different from a bidirectional RNN
@mattiasfagerlund4 жыл бұрын
Hi - is the reason that the softmax version is O(N^2) that softmax (in the general case) has N^2 gradients before they're collapsed back? Or is there something else going on?
@YannicKilcher4 жыл бұрын
It's because you need to do N^2 inner products. Maybe if the softmax wasn't there, you could somehow optimize that away, so I think you do have a point
@sajjadayobi6883 жыл бұрын
thank Yannic, we saw a few papers for transformers in long sequences like ReFormer, LinFormer, LongFormer but which one of them is really well in real-world if you make a video about comparing these papers for faster or longer Transformer I would be very thankful
@andres_pq4 жыл бұрын
So in practice you could stack way more tranformer layers and have better performance?
@YannicKilcher4 жыл бұрын
that's always the goal :D
@andyblue9534 жыл бұрын
I was wondering to which extend their idea is related to A² double-net attention? In the latter, the outer product is likewise separated into a gathering step and a distribution step, and it's likewise shown to be on par with the original transformer. Anyway, the general mathematical idea can already be found in the SVD.
@RobotProctor4 жыл бұрын
If Q.K is usually the highest for the same token, I wonder why Q and K need to be different at all for a transformed to work properly. Why not just output a K and a V?
@YannicKilcher4 жыл бұрын
interesting idea. some papers fuse k and v, but I've never heard of fusing k and q
@raminrasoulinezhad4 жыл бұрын
Cool description. Thanks
@ShokhanBirlikov4 жыл бұрын
Your intuition at 31:40 is great! Maybe a silly question, but is it a possible architecture where the transformation function phi is learned as Nx(#of intermediate units) matrix of trainable parameters?
@YannicKilcher4 жыл бұрын
sure, I don't see why not
@seraphim97234 жыл бұрын
Wait a second: Figure 5c is showing PER over wall-clock time, and despite the significant speed-up the original softmax transformer is better AND faster (at reaching any given error rate).
@YannicKilcher4 жыл бұрын
Yes, that's because on this task it seems to be very important to have the quadratic attention
@baqirhusain56522 жыл бұрын
The complexity works well when N > D^2. So the paper is saying if dimensionality is 512 my sequence length should be greater than 512^2 = 262144. Is this really practical !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
@bhuvanb81564 жыл бұрын
at 14:17 numerator and denominator are same except multiplying with Vj, how's is that possible
@YannicKilcher4 жыл бұрын
it characterizes a normalized distribution over V
@zhangcx934 жыл бұрын
According to there code, the main computation is like this: KV = torch.einsum("nshd,nshm->nhmd", K, values) Z = 1/(torch.einsum("nlhd,nhd->nlh", Q, K.sum(dim=1))+self.eps) V = torch.einsum("nlhd,nhmd,nlh->nlhm", Q, KV, Z) can someone translate this einsum into a normal pytorch way?
@eelcohoogendoorn80444 жыл бұрын
Better off just reading up on einsum and getting familiar with it. it is by far the most readable and intuitive way of expressing these statements; and I say that as a numpy veteran of 15 years.
@YannicKilcher4 жыл бұрын
yea the problem is that would translate to multiple matmuls, elementwise products, transposes and summations :D I also suggest you dive into einsum notation
@markdaoust45984 жыл бұрын
I came to ask “what’s the einsum for 21:03.”. Thanks! So.. dropping the “batch” and “head” indices it’s: Vld = sum_sd Qld*Ksd*Vsm / sum_sd Qld Ksd
@muneebth55084 жыл бұрын
Sir, Thank You
@Bencurlis3 жыл бұрын
The generalized formulation of equation 3 looks like Bayes formula, has anyone noticed? Is it accidental or is there a deeper meaning here?
@minhoryu27864 жыл бұрын
Can anyone summarize difference between linformer and this paper?
@meerkatj93634 жыл бұрын
Linformer has a linear complexity thanks to a constant dimension of the keys and values because they are projected in a fixed lower dimension. Whereas this does not fix the dimension of the sequence, it only reorders the computation. The reordering can only be made with conditions that need the kernel thing.
@couragefox4 жыл бұрын
commenting for algo. love your channel
@myungchulkang57164 жыл бұрын
thank you :)
@siyn0074 жыл бұрын
speed over accuracy for me tbh. Not everyone has Google's resources to run it on 112312412312423 TPUs
@YannicKilcher4 жыл бұрын
That's a lot of damage. I mean TPUs
@first-thoughtgiver-of-will24564 жыл бұрын
If I was rich I'd give you money.
@charudattamanwatkar83404 жыл бұрын
Who else saw balls in the thumbnail?
@herp_derpingson4 жыл бұрын
Now, that I see it. I cannot unsee it.
@dmitrysamoylenko67754 жыл бұрын
Now we are all see the message
@joery82904 жыл бұрын
I love your videos, but could you try to make them more compact? 50 minutes is a big time investment. I feel like most papers can be explained in high detail in half that time.