OUTLINE: 0:00 - Intro & Overview 2:20 - Built-In assumptions of Computer Vision Models 5:10 - The Quadratic Bottleneck of Transformers 8:00 - Cross-Attention in Transformers 10:45 - The Perceiver Model Architecture & Learned Queries 20:05 - Positional Encodings via Fourier Features 23:25 - Experimental Results & Attention Maps 29:05 - Comments & Conclusion
@mgostIH3 жыл бұрын
This approach is so elegant! Unironically Schmidhuber was right that the more something looks like an LSTM the better 😆
@reesejammie88213 жыл бұрын
I always thought the human brain is a recurrent neural network with a big hidden state and being constantly fed data from the environment.
@6lack5ushi3 жыл бұрын
Powerful!!!
@srikanthpolisetty74763 жыл бұрын
Congratulations. I'm so glad this channel is growing so well, great to see a channel get the recognition they deserve. Can't wait to see where this channel goes from here.
@bardfamebuy3 жыл бұрын
I love how you did the cutting in front of a green screen and not even bother editing it out.
@Gorulabro3 жыл бұрын
Your videos are a joy to watch. Nothing I do in my spare time is so usefull!
@emilianpostolache5453 жыл бұрын
27:30 - Kant is all you need
@silvercat43 жыл бұрын
underrated comment
@jamiekawabata71013 жыл бұрын
The scissors scene is wonderful!
@robboswell39432 жыл бұрын
Excellent video! A critical question: How exactly are the learned latent arrays being learned? Is there some kind of algorithm used to create the learned latent array by reducing the dimensions of the input "byte array"? They never really go into detail about the exact process they used to do this in the paper. Surprisingly, no online sources on this paper that I have found speak about the exact process either. On pg. 3, it does state, "The model can also be seen as performing a fully end-to-end clustering of the inputs with latent positions as cluster centres..." But this is a pretty generic explanation. Could you please provide a short explanation of the process they used?
@RS-cz8kt3 жыл бұрын
Stumbled upon your channel a couple of days ago, watched a dozen videos since then, amazing work, thanks!
@jonathandoucette31583 жыл бұрын
Fantastic video, as always! Around 20:05 you describe transformers as invariant to permutations, but I believe they're more accurately equivariant, no? I.e. permuting the input permutes the output in exactly the same way, as opposed to permuting the input leading to the exact same output. Similar to convolutions being equivariant w.r.t. position
@mgostIH3 жыл бұрын
You could say those terms are just equivariant to mistakes!
@ruroruro3 жыл бұрын
Transformers are invariant to key+value permutations and equivariant to query permutations. The reason, why they are invariant to k+v permutations is that for each query all the values get summed together and the weights depend only on the keys. So if you permute the keys and the values in the same way, you still get the same weights and the sum is still the same.
@jonathandoucette31583 жыл бұрын
@@ruroruro Ahh, thanks for the clarification! In my head I was thinking only of self attention layers, which based on your explanation would indeed be permutation equivariant. But cross-attention layers are more subtle; queries equivariant, keys/values invariant (if they are permuted in the same way).
@anonymouse28842 жыл бұрын
I belive that it is permuation invariant, since you are doing a weighted sum of the inputs/ context, you should "roughly" (the positional encoder might encoder different time indices slightly differently, but this should not matter a lot) get the same results even if you permute the inputs.
@ruroruro3 жыл бұрын
Yeah, the attention maps look really really suspicious. Almost like the network only attends to the fourier features after the first layer. Also, the whole idea, that they are feeding the same unprocessed image into the network multiple times seems really weird. The keys should basically be a linear combination of r,g,b and the same fourier features each time. How much information can you realistically extract from an image just by attending to the low level color and positional information. I would have expected them to at least use a simple resnet or FPN alongside the "thin" attention branch thingy.
@reesejammie88213 жыл бұрын
Couldn't agree more. It's like the attention maps are far from being content-based. Also agree on the features being too low level, what does it even mean to attend to raw pixels?
@sanzharbakhtiyarov40443 жыл бұрын
Thanks a lot for the review Yannic! Great work
@justindaniels863 Жыл бұрын
unexpected combination of humour and intelligence!
@AbgezocktXD3 жыл бұрын
One day you will stop explaining how transformers work and I will be completely lost
@emmanuellagarde22123 жыл бұрын
If the attention maps for layers >2 are not image specific, then this echoes the results of the paper "Pretrained Transformers as Universal Computation Engines" which suggests that there is a universal mode of operation for processing "natural" data
@petrroll3 жыл бұрын
There's one thing I don't quite understand. How does this model do low features capture / how does it retain the information? I.e. how does it do the processing that happens in the first few layers of CNN. I can clearly see how this mechanism works well for higher-level processing but how does it capture (and keep) low-level features? The reason why I don't quite understand it that the amount of information that flows between the first and second layer of this and e.g. first and second module of ResNet is quite drastically different. In this case it's essentially N*D which I suppose is way smaller than M* (not M because there's some pooling even in the first section of Resnet, but still close) in case of ResNet, simply on the account of N
@Coolguydudeness12343 жыл бұрын
I lost it when you cut the piece of paper 😂
@patf97703 жыл бұрын
Something I just noticed about the attention maps: they seem to reflect something about the positional encodings? It looks like the model processes images hierarchically, globally at first and with a progressively finer tooth comb. My understanding is that CNNs tend to have a bias towards local textural information so it'd be really cool if an attention model learned to process images more intuitively
@maxdoner45283 жыл бұрын
Good Job, It's pretty great to have These topics explained by someone other than the aufhorchen, Keep it up!
@Daniel-ih4zh3 жыл бұрын
Things are going so fast in the last year or two.
@ssssssstssssssss3 жыл бұрын
I disagree... There haven't really been many major innovations in machine learning in the past two years.
@L9X3 жыл бұрын
Could this perhaps be used to model incredibly long distance relationships, i.e. incredibly long term memory? As in, the latent query vector (i'll just call it Q from here) becomes the memory. Perhaps we start of with a randomly initialised latent Q_0 and input KV_0 - let's say the first message sent by a user - to the perceiver which produces latent output Q_1, and we then feed Q_1 back into the perceiver with the next message sent by the user KV_1 as an input and get output Q_2 from the perceiver and so on. Then at every step we take Q_n and feed that to some small typical generative transformer decoder to produce a response to the user's message. This differs from typical conversational models, such as those using GPT-whatever, because they feed the entire conversation back into the model as input, and since the model has a constant size input, the older messages get truncated as enough new messages are given, which means the older memories get totally lost. Could this be a viable idea? We could have M >> N which means we have more memory than input length, but if we keep M on the order of a thousand that gives us 1000 'units' of memory that retain only the most important information.
@Ronschk3 жыл бұрын
Really nice idea. I wonder how much improvement it would bring if the incoming data would converted through a "sense". Our brain also doesn't receive images directly, but instead receives signals from our eyes which transform the input image (and use something akin to convolutions?). So you would have this as a generic compute structure, but depending on the modality you would have a converter. I think they had something like this in the "one model to rule them all" paper or so...
@timdernedde9933 жыл бұрын
Hey Yannic, great Video as usual :) If you want some feedback I feel like you could have covered the results a bit more. I do think the methodology of course is much more important but it helps to have a bit of an overview of how good it performs at what tasks. Maybe give it a few minutes more in the results section next time. But anyways still enjoyed the video greatly. Keep up the great work!
@CristianGarcia3 жыл бұрын
This is VERY nice! I'd love to give it a spin on a toy dataset. 😍 BTW: Many transformer patterns can be found in the Set Transformers paper, the learned query reduction strategy is termed Pooling by Attention.
@amirfru3 жыл бұрын
This is incredibly similar to Tabnet ! but with the attentive blocks changed to attention layers
@pvlr17882 жыл бұрын
Thanks for the video! But I can't understand where from the first latent array comes..
@JTedam3 жыл бұрын
this helps a lot to make research accessible
@jonatan01i3 жыл бұрын
2:44 "And the image is of not a cat!, a house! What did you think??!.." I thought nothing; my mind was empty :(
@NextFuckingLevel3 жыл бұрын
:( ifeel you
@herp_derpingson3 жыл бұрын
17:30 Since you already bought a green screen, maybe next time put Mars or the Apollo landing in the background. Or a large cheese cake. Thats good too. . All in all. Once architecture to rule them all.
@YannicKilcher3 жыл бұрын
Great suggestion :D
@piratepartyftw3 жыл бұрын
Very cool. I wonder if it works when you feed in multimodal data (e.g. both image and text in the same byte array).
@galchinsky3 жыл бұрын
Proper positional encodings should somehow work
@azimgivron18233 жыл бұрын
Are the query dimension and the latent array in figure 1 of the same dimensions ? It is written that Q belongs to the space of matrices of real numbers of dimensions MxD which does not make sens to me. I believe they meant NxD where D=C since you need to do a dot product to compute the cross-attention between the query Q and the keys K ==> Q.Kt with Kt being the transpose of K so it implies that the dimensions D and C are equal, isn't right ? I am kinda disappointed by the paper because this the core of what they want to show and they do not make the effort to dive in the math and explain this clearly.
@hugovaillaud51023 жыл бұрын
Is this architecture slower than a resnet with a comparable amount of parameters due to the fact that it is somehow recurrent? Great video, you explain things so clearly!
@henridehaybe5253 жыл бұрын
It would be nice to see how the Perceiver would perform when the KV of the cross-attentions are not the raw image at each "attend" but the feature maps of a pretrained ResNet. E.g. the first "attend" KV are the raw image, the second KV is the feature maps of the second ResNet output, and so on. A pretrained ResNet would do the trick but it could technically be feasible to train it concurrently. It would be a Parallel-Piped Convolutionnal-Perceiver model.
@HuyNguyen-rb4py3 жыл бұрын
so touching for an excellent video
@aday74752 жыл бұрын
Any chance we can get a compare and contrast between perciever, percieverIO, and percieverAR?
@cptechno3 жыл бұрын
Yes, I like this type of content. Keep up the good work. Bringing this material to our attention is a prime service. You might consider creating an AI.tv commercial channel. I'll join.
@swoletech59583 жыл бұрын
PointNet++ from 2017 outperformed the perceiver in image point clouds. 91.9 accuracy versus 85.7 See @ 27:19
@ibrahimaba89662 жыл бұрын
17:28 best way to solve the quadratic bottleneck 😄!
@yassineabbahaddou43692 жыл бұрын
why they have used a GPT-2 architecture in the latent transformer instead of BERT architecture?
@marat613 жыл бұрын
I belive there are error in the paper 23:07 Q must be MxC not MxD otherwise QK.transpose() will be imposible
@peterszilvasi7523 жыл бұрын
17:07 - The visual demonstration of how the quadratic bottleneck is solved was a true "Explain Like I'm Five" moment. 😀
@neworldemancer3 жыл бұрын
Thanks for video, Yannic! i would imagine that the attention "lines" @27:00 could indeed be static, but the alternative - they are input dependent, yet too overfitted to FF, as this lines are clear artefact.
@TheGreatBlackBird3 жыл бұрын
I was very confused until the visual demonstration.
@xealen21663 жыл бұрын
i'm curious, how are the queries generated from the latent matrix, how is the latent matrix initially generated?
@Kram10323 жыл бұрын
Did the house sit on the mat though
@48956l3 жыл бұрын
thank you for that wonderful demonstration with the piece of paper lol
@dr.mikeybee3 жыл бұрын
Even with my limited understanding, this looks like a big game change.
@NilabhraRoyChowdhury3 жыл бұрын
What's interesting is that the model performs better with weight sharing.
@MsFearco3 жыл бұрын
I just finished this, its an extremely interesting paper. Please review the SWIN transformer next. Its even more interesting :)
@cocoarecords3 жыл бұрын
Yannic can you tell us your approach to understand papers quickly?
@YannicKilcher3 жыл бұрын
Look at the pictures
@TheZork19953 жыл бұрын
@@YannicKilcher xD so easy yet so far. Thank you for the good work though. Literally the best youtube channel I ever found!
@synthetiksoftware56313 жыл бұрын
Isn't the 'fourier' style positional encoding just a different way to build a scale space representation of the input data? So you are still 'baking' that kind of scale space prior into the system.
@teatea55282 жыл бұрын
It is stupid, but I want to ask how the author claims their method is better than VIT in ImageNet in the appendix A, Table 7 while their accuracy is not higher?
@Anujkumar-my1wi3 жыл бұрын
can you tell me why neural nets with many hidden layer requires less number of neurons than a neural net with a single hidden layer to approximate a function?
@TheCreativeautomaton3 жыл бұрын
ey Thanks for doing this, very much like the direction of transformers in ML, im newer to NLP and looking at where the direction of ML might go next. once again thanks.
@TheJohnestOfJohns3 жыл бұрын
Isn't this really similar to facebook's DETR with their object queries, but with shared weights?
@antoninhejny81563 жыл бұрын
No, since DETR is just for localising objects from extracted features via some backbone like resnet, while this is the feature extractor. Furthemore, DETR just puts the features into a transformer, whereas this is like making an idea about what is in the image while consulting with the raw information in the form of RGB. This is however very suspitious, because linear combination of RGB is just three numbers.
@axeldroid24533 жыл бұрын
Does it have something todo with sparse sensing ? It basically attentds to the most relevant data points.
@simonstrandgaard55033 жыл бұрын
Excellent walkthrough
@hanstaeubler3 жыл бұрын
It would also be interesting to 'interpret' this model or algorithm on the music level as well (I compose music myself for my pleasure)? Thanks in any case for the good interpretation of this AI work!
@marat613 жыл бұрын
Also you did not say about dimension size in ablation part
@maks0293 жыл бұрын
Thanks for for an amazing video, I didn't really catch what the "Latent array" represents? It's array of zeros at first?
@bensums3 жыл бұрын
So the main point is you can have less queries than values? This is obvious even just by looking at the definition of scaled dot-product attention in Attention Is All You Need (Equation 1). From the definition there, the number of outputs equals the number of queries and is independent of the number of keys or values. The only constraints are: 1. the number of keys must match the number of values, 2. the dimension of each query must equal the dimension of the corresponding key.
@bensums3 жыл бұрын
(in the paper all queries and keys are the same dimension (d_k), but that's not necessary)
@DistortedV123 жыл бұрын
“General architecture”, but can it understand tabular inputs??
@bender27523 жыл бұрын
Great video! Consider making a video about DCTransformer maybe? 😊
@LaNeona3 жыл бұрын
If I have a gamification model is there anyone you know that does meta analysis on system mechanisms?
@hiramcoriarodriguez12523 жыл бұрын
This is huge, i'm not going to surprise if "perceiver" becomes the gold standard for CV tasks.
@galchinsky3 жыл бұрын
The way it is it seems to be classification only
@nathanpestes94973 жыл бұрын
@@galchinsky You should be able to run it backwards for generation. Just say my output (image/point-cloud/text I want to generate) is my latent(as labeled in the diagram), and my input (byte array in the diagram) is some latent representation that feeds into my outputs over several steps. I think this could be super cool for 3D GANs since you don't wind up having to fill 3d grids with a bunch of empty space.
@galchinsky3 жыл бұрын
@@nathanpestes9497 @Nathan Pestes won't you get o(huge^2) this way?
@nathanpestes94973 жыл бұрын
@@galchinsky I think it would be cross attention o(user defined * huge) same as the paper (different order). Generally we have o(M*N), M - the size of input/byte-array, N - the size of the latent. The paper goes after performance by forcing the latent to be non-huge so M=huge, N=small O(huge * small). Running it backwards you would have small input (which is now actually our latent so a low dimensional random sample if we want to do a gan, perhaps the (actual) latent from another perceiver in a VAE or similar). So backwards you have M=small N=huge so O(small*huge).
@galchinsky3 жыл бұрын
@@nathanpestes9497 Thanks for pointing this. I thought we would get Huge x Huge attention matrix, while you are right, if we set Q length to be Huge and K/V to be Small, the resulting complexity will be O(Huge*Small). So we want to get new K/V pair each time and this approach seems quite natural: (here was an imgur link but youtube seems to hide it). So there 2 parallel stacks of layers. The first set is like in the article: latent weights, then cross attention, then stack of transformers and so on. The second stack consists of your cross-attention layers, so operates in byte-array dimension. The first Q is the byte array input and K,V is taken from the stack of the "latent transformers". Then its output is fed as K,V back to the "latent" cross attention, making new K,V. So there is an informational ping-pong between "huge" and "latent" cross-attention layers.
@thegistofcalculus3 жыл бұрын
Just a silly question, instead of big data input vector and small latent vector could they have a big latent vector that they use as a summary vector and spoon feed slices of data in order to achieve some downstream task such as maybe predicting the next data slice? Would this allow for even bigger input which is summarized (like HD video)?
@thegistofcalculus2 жыл бұрын
Looking back it seems that my comment was unclear. It would involve a second cross attention module to determine what gets written into the big vector.
@conduit2423 жыл бұрын
Embeddings are still all you need 🤷
@brll57333 жыл бұрын
Performers already grow entirely linearly, right?
@kirtipandya46183 жыл бұрын
Where can we find source code?
@NeoShameMan3 жыл бұрын
So basically it's conceptually close to rapide eye movement, where we refine over time data we need to resolve recognition...
@corgirun78924 ай бұрын
M = 50176 for 224 × 224 ImageNet images
@Shan2243 жыл бұрын
Thank you yannic
@evilby Жыл бұрын
WAHHH... Problem Solved!😆
@moctardiallo26083 жыл бұрын
Yeah 30min is very better!
@Vikram-wx4hg2 жыл бұрын
17:15
@seraphim97233 жыл бұрын
The ablation study consists of three points without any error bars and could just be coincidence? One cannot call that "science".
@timstevens33613 жыл бұрын
attention looped is consciousness
@notsure71323 жыл бұрын
Thank you.
@Deez-Master3 жыл бұрын
Nice video
@vadimschashecnikovs30823 жыл бұрын
Hmm, I think it is possible to add some GLOM-like hierarchy of "words". This could improve the model...
@kenyang687 Жыл бұрын
The "hmm by hmm" is just too confusing lol
@rhronsky3 жыл бұрын
Clearly you are more of a fan of row vectors rather than column vectors Yannic (refererring to your visual demo :))
@martinschulze53993 жыл бұрын
habt ihr phd stellen offen? ^^
@GuillermoValleCosmos3 жыл бұрын
this is clever and cool
@freemind.d27143 жыл бұрын
Good job Yannic, But I start to feel like lot of paper you talk in video those days are all about transformer, and frankly they kind similar and most are about engineering research not scientific research, hope you don't mind to talk more about interesting paper on different subject
@muhammadaliyu30763 жыл бұрын
Yannick follows the hype
@TechyBen3 жыл бұрын
Oh no, they are making it try to be alive. XD
@AvastarBin3 жыл бұрын
+1 For the visual representation of M*N hahah
@oreganorx72 жыл бұрын
Very similar to MemFormer
@Stefan-bs3gm3 жыл бұрын
with O(M*M) attention you quickly get to OOM :-P
@allengrimm30393 жыл бұрын
I see what you did there
@omegapointil57413 жыл бұрын
I guess curing Cancer is even more complicated than this.
@happycookiecamper81013 жыл бұрын
nice
@enriquesolarte11643 жыл бұрын
haha, I love the scissors...!!!
@insighttoinciteworksllc10053 жыл бұрын
Humans can do the iterative process too. The Inquiry Method is the only thing that requires it. If you add the trial and error element with self-correction, young minds can develop a learning process. Learn How to learn? Once they get in touch with their inner teacher, they connect to the Information Dimension (theory). Humans can go to where the Perceiver can't go. The Inner teacher uses intuition to bring forth unknown knowledge to mankind's consciousness. The system Mr. Tesla used to create original thought. Unless you think he had a computer? The Perceiver will be able to replace all the scientists that helped develop it and the masses hooked on the internet. It will never replace the humans that develop the highest level of consciousness. Thank you, Yeshua for this revelation.
@allurbase3 жыл бұрын
It's kind of dumb to input the same video frame over and over, just go frame by frame, it's will take a bit for it to catch up but so would you.