If nothing else, the contribution to model naming is a clear increment to SOTA.
@jonatan01i3 жыл бұрын
Nyströmer clearly is.
@andreassyren3293 жыл бұрын
@@jonatan01i I will agree with that.
@VikasSingh-jv7fn3 жыл бұрын
Hello Yannic, Your comment about the order of operations is correct. It is one of those things where you set out to check how poorly it performs and find out that it could work empirically (at least in limited settings). The lemma is not practically useful but merely evaluates the setting that if/when everything is idealized, the results/procedure does not lead to nonsensical conclusions. The choice of F early on in the paper was to avoid a conflict with D (D and d were both used) and E (ones matrix).
@tchlux3 жыл бұрын
What about the A^+ comment, was that actually a typo in the paper? kzbin.info/www/bejne/o17do5ajh8lqe5Y
@zhanpengzeng85923 жыл бұрын
@@tchlux Yes, that is a typo. We somehow left out the pseudo inverse sign.
@xiongyunyang96433 жыл бұрын
@@tchlux This is a typo. We will update it. Thanks for the catch.
@JoelHough3 жыл бұрын
I have seen this exact type of lemma in many discussions about approximations. It did not seem out of place to me. It's nice to know that in the limit your approximation will agree with the ground truth which is certainly not the case in all approximation methods.
@herp_derpingson3 жыл бұрын
0:42 Nyan-storm-former! . 3:30 Time for my weekly Transformer explanation :) . 27:00 That was a really sweet and easy to understand explanation. . 35:00 I wonder if we can have a DNN just predict a landmark tensor..
@xiongyunyang96433 жыл бұрын
Thanks for making this great video. Nice catch for the typo. We will update the draft soon.
@mdmishfaqahmed83563 жыл бұрын
the pronounciation of those author names were clutch :D
@RobEnglebright3 жыл бұрын
top work pronouncing the authors
@jamiekawabata71013 жыл бұрын
If I lift box 1 onto shelf 1 and box 1 onto shelf 2 and box 2 onto shelf 1, then I can predict the effort in lifting box 2 onto shelf 2. Great analogy, thank you.
@lucidraisin3 жыл бұрын
Lol, unexpectedly mentioned 😅 thanks for the video!
@mizupof3 жыл бұрын
By the power of Yannic, I rename you!
@poesaste3 жыл бұрын
Incredible breakdown, subbed!
@benjaminho3513 жыл бұрын
Nobody: Yannic: uöu 1:07
@Anirudh-cf3oc2 жыл бұрын
Very nice explaination sir, Thank you!!
@mathematicalninja27563 жыл бұрын
What I hear: nice transformer
@АлексейТучак-м4ч3 жыл бұрын
that's a nice trömformer
@MausamJain3 жыл бұрын
How did you import pdf to onenote with such a good quality? The printout option generally inserts a very poor quality images of the pages.
@YannicKilcher3 жыл бұрын
It's definitely poor for me, too, it's right on the edge of being useful.
@G12GilbertProduction3 жыл бұрын
Sweet Thursday with more sweety kind of paper. Buon appetite, Yannic! :)
@KennedDansker3 жыл бұрын
It is F because it is forward attention right? (Then it would fit with B being backward). It is not entirely right (A contain part of the forward attention) but I think that is the intention
@osiris423 жыл бұрын
Does it even matter that the softmax doesn't commute, if the softmax is just a heuristic / hack in the first place? Or is there something inherintly special about softmax in the transformer architecture?
@tchlux3 жыл бұрын
I don't know if I'd call it "special", but I like to think of it geometrically. When you use a softmax, you make it so that the layer immediately after the softmax only has to model a "surface" that lives on the inner wedge of the unit cube (points with 1-norm
@mathematicalninja27563 жыл бұрын
@@tchlux that is a good perspective
@JamesAwokeKnowing3 жыл бұрын
So is that like a softmax over time, where it's valid kindof because over many iterations it's pulling random samples? Well hope a better way is found.
@NilabhraRoyChowdhury3 жыл бұрын
I bet you wish you could: from previous_videos import SelfAttention every time you make a video related to transformers
@otaviodzb13 жыл бұрын
One thing that I still couldn't understand is how backprop works in a transformer. Does someone have a good reference or video that explains it?
@pg1337ful3 жыл бұрын
seems like you have fundamental gaps in ML.
@YannicKilcher3 жыл бұрын
It works like in any other neural network, by applying the chain rule to all involved operations
@jonatan01i3 жыл бұрын
I struggle to believe that it actually is named as Nyströmformer. I'll call it Nyströmer, as suggested and should be.
@scarcommander55173 жыл бұрын
We like this transformer!
@chaitanyaparmar8882 жыл бұрын
Love this video!
@BorrWick3 жыл бұрын
Didn't this came out like yesterday??
@ingusmant3 жыл бұрын
And?
@BorrWick3 жыл бұрын
@@ingusmant Just amazed by the speed Yannic can read, understand and produce these videos :o
@YannicKilcher3 жыл бұрын
You're right, it's already old now... ;)
@muhammadsaadmansoor77773 жыл бұрын
I was not expecting this until a month later. But where do keys queries and value come from
@IRWBRW9643 жыл бұрын
They are learned.
@kicckicc3 жыл бұрын
just fyi, I tried to implement this the day before yesterday, but got NAN. I checked the code and realized that the formula (14) isn't accurate and also Z0 = AS/(||AS ||1||AS ||∞) should be Z0 = transpose(AS)/(||AS ||1||AS ||∞).
@xiongyunyang96433 жыл бұрын
You mean the NAN is from your own implementation or our implementation? The accuracy to approximate pseudoinverse using formula (14) depends on how many iterations. Z0 is AS^T/(||AS ||1||AS ||∞). We will fix the typo in our update.
@kicckicc3 жыл бұрын
@@xiongyunyang9643 thanks for the reply. After I used correct (14) and correct z0, the nan is gone. Just fyi, formula (16) is also inaccurate, but it is easy to be noticed.
@xiongyunyang96433 жыл бұрын
@@kicckicc Cool. Formula (16), similar to average local pooling, is to compute landmarks efficiently.
@JamesAwokeKnowing3 жыл бұрын
The name was designed to sound like 'the nice transformer'. So leave the name as is.
@pratik2452 жыл бұрын
Have you heard Michelle Srivastav but more probably you would hear Peter Chakraborty. If you can tell me the reason, you would know a lot about caste and region based targeting in India.
@pratik2452 жыл бұрын
So, no body hates India when they are born, but, as you keep growing you see these divisions between people, majoritarianism, govt repression, targetting of intellectual class, poverty, corruption and then you start seeing trends in these concepts all in the name of highly preached American democracy and capitalism... But, Surely everything is a joke even misery.. Right Guys?
@ZedaZ803 жыл бұрын
I have no idea what most of this means, but the lemma was funny
@Xrey56Cheyz3 жыл бұрын
To be honest, I expected the Performer to be the ImageNet moment for transformers, but it seems there is still a long way to go and random Fourier features are not the best way to do the thing. Somewhat sad cause Performer's idea looked so cool and well grounded :(
@redjammie83423 жыл бұрын
Big leaps come through simple ideas like ReLU, convolution, drop-out, residual connections, self-attention... The moment an idea becomes too convoluted, it is less likely to be game changing.
@charlesfoster63263 жыл бұрын
What are you waiting for? If anything, the transformer revolution seems like it's come with even more force and speed than ImageNet.
@ahmadmoussa37713 жыл бұрын
*The NICEtrömer*
@NextFuckingLevel3 жыл бұрын
Indeed
@visionscaper3 жыл бұрын
Hi there!
@YannicKilcher3 жыл бұрын
hi!
@weizhu22303 жыл бұрын
OK i vote down for this work and i think the "asymmetric non local neural network for semantic segmentation" should be a better one.
@yaaank67253 жыл бұрын
In the last twitter chart, it's quite surprising that Performer has the worst performance across the other efficient transformers. Is this also verified by other tasks?
@yaaank67253 жыл бұрын
Or other people maybe..
@xiongyunyang96433 жыл бұрын
We have released the scores on individual LRA tasks. It will be interesting to see how Performer works for other tasks beyond LRA tasks.
@lennartvandergoten65923 жыл бұрын
Grüße an meinen alten ETH Kumpanen Yannic, richte Jonas schöne Grüße von mir aus :-)
@Ronnypetson3 жыл бұрын
Noice
@CandidDate3 жыл бұрын
I'd bet a million dollars that AGI, when discovered, uses frequencies of waves rather than any matrices.
@kimchi_taco3 жыл бұрын
mathematically ugly but somehow works well. I don't feel good in that both Nyströmformer and Performer rely on random sampling.
@xiongyunyang96433 жыл бұрын
No, Nyströmformer does not rely on random sampling.