Perceiver: General Perception with Iterative Attention (Google DeepMind Research Paper Explained)

Рет қаралды 56,602

Күн бұрын

Пікірлер: 127

@YannicKilcher 3 жыл бұрын

OUTLINE: 0:00 - Intro & Overview 2:20 - Built-In assumptions of Computer Vision Models 5:10 - The Quadratic Bottleneck of Transformers 8:00 - Cross-Attention in Transformers 10:45 - The Perceiver Model Architecture & Learned Queries 20:05 - Positional Encodings via Fourier Features 23:25 - Experimental Results & Attention Maps 29:05 - Comments & Conclusion

@mgostIH 3 жыл бұрын

This approach is so elegant! Unironically Schmidhuber was right that the more something looks like an LSTM the better 😆

@reesejammie8821 3 жыл бұрын

I always thought the human brain is a recurrent neural network with a big hidden state and being constantly fed data from the environment.

@6lack5ushi 3 жыл бұрын

Powerful!!!

@srikanthpolisetty7476 3 жыл бұрын

Congratulations. I'm so glad this channel is growing so well, great to see a channel get the recognition they deserve. Can't wait to see where this channel goes from here.

@bardfamebuy 3 жыл бұрын

I love how you did the cutting in front of a green screen and not even bother editing it out.

@Gorulabro 3 жыл бұрын

Your videos are a joy to watch. Nothing I do in my spare time is so usefull!

@emilianpostolache545 3 жыл бұрын

27:30 - Kant is all you need

@silvercat4 3 жыл бұрын

underrated comment

@jamiekawabata7101 3 жыл бұрын

The scissors scene is wonderful!

@robboswell3943 2 жыл бұрын

Excellent video! A critical question: How exactly are the learned latent arrays being learned? Is there some kind of algorithm used to create the learned latent array by reducing the dimensions of the input "byte array"? They never really go into detail about the exact process they used to do this in the paper. Surprisingly, no online sources on this paper that I have found speak about the exact process either. On pg. 3, it does state, "The model can also be seen as performing a fully end-to-end clustering of the inputs with latent positions as cluster centres..." But this is a pretty generic explanation. Could you please provide a short explanation of the process they used?

@RS-cz8kt 3 жыл бұрын

Stumbled upon your channel a couple of days ago, watched a dozen videos since then, amazing work, thanks!

@jonathandoucette3158 3 жыл бұрын

Fantastic video, as always! Around 20:05 you describe transformers as invariant to permutations, but I believe they're more accurately equivariant, no? I.e. permuting the input permutes the output in exactly the same way, as opposed to permuting the input leading to the exact same output. Similar to convolutions being equivariant w.r.t. position

@mgostIH 3 жыл бұрын

You could say those terms are just equivariant to mistakes!

@ruroruro 3 жыл бұрын

Transformers are invariant to key+value permutations and equivariant to query permutations. The reason, why they are invariant to k+v permutations is that for each query all the values get summed together and the weights depend only on the keys. So if you permute the keys and the values in the same way, you still get the same weights and the sum is still the same.

@jonathandoucette3158 3 жыл бұрын

@@ruroruro Ahh, thanks for the clarification! In my head I was thinking only of self attention layers, which based on your explanation would indeed be permutation equivariant. But cross-attention layers are more subtle; queries equivariant, keys/values invariant (if they are permuted in the same way).

@anonymouse2884 2 жыл бұрын

I belive that it is permuation invariant, since you are doing a weighted sum of the inputs/ context, you should "roughly" (the positional encoder might encoder different time indices slightly differently, but this should not matter a lot) get the same results even if you permute the inputs.

@ruroruro 3 жыл бұрын

Yeah, the attention maps look really really suspicious. Almost like the network only attends to the fourier features after the first layer. Also, the whole idea, that they are feeding the same unprocessed image into the network multiple times seems really weird. The keys should basically be a linear combination of r,g,b and the same fourier features each time. How much information can you realistically extract from an image just by attending to the low level color and positional information. I would have expected them to at least use a simple resnet or FPN alongside the "thin" attention branch thingy.

@reesejammie8821 3 жыл бұрын

Couldn't agree more. It's like the attention maps are far from being content-based. Also agree on the features being too low level, what does it even mean to attend to raw pixels?

@sanzharbakhtiyarov4044 3 жыл бұрын

Thanks a lot for the review Yannic! Great work

@justindaniels863 Жыл бұрын

unexpected combination of humour and intelligence!

@AbgezocktXD 3 жыл бұрын

One day you will stop explaining how transformers work and I will be completely lost

@emmanuellagarde2212 3 жыл бұрын

If the attention maps for layers >2 are not image specific, then this echoes the results of the paper "Pretrained Transformers as Universal Computation Engines" which suggests that there is a universal mode of operation for processing "natural" data

@petrroll 3 жыл бұрын

There's one thing I don't quite understand. How does this model do low features capture / how does it retain the information? I.e. how does it do the processing that happens in the first few layers of CNN. I can clearly see how this mechanism works well for higher-level processing but how does it capture (and keep) low-level features? The reason why I don't quite understand it that the amount of information that flows between the first and second layer of this and e.g. first and second module of ResNet is quite drastically different. In this case it's essentially N*D which I suppose is way smaller than M* (not M because there's some pooling even in the first section of Resnet, but still close) in case of ResNet, simply on the account of N

@Coolguydudeness1234 3 жыл бұрын

I lost it when you cut the piece of paper 😂

@patf9770 3 жыл бұрын

Something I just noticed about the attention maps: they seem to reflect something about the positional encodings? It looks like the model processes images hierarchically, globally at first and with a progressively finer tooth comb. My understanding is that CNNs tend to have a bias towards local textural information so it'd be really cool if an attention model learned to process images more intuitively

@maxdoner4528 3 жыл бұрын

Good Job, It's pretty great to have These topics explained by someone other than the aufhorchen, Keep it up!

@Daniel-ih4zh 3 жыл бұрын

Things are going so fast in the last year or two.

@ssssssstssssssss 3 жыл бұрын

I disagree... There haven't really been many major innovations in machine learning in the past two years.

@L9X 3 жыл бұрын

Could this perhaps be used to model incredibly long distance relationships, i.e. incredibly long term memory? As in, the latent query vector (i'll just call it Q from here) becomes the memory. Perhaps we start of with a randomly initialised latent Q_0 and input KV_0 - let's say the first message sent by a user - to the perceiver which produces latent output Q_1, and we then feed Q_1 back into the perceiver with the next message sent by the user KV_1 as an input and get output Q_2 from the perceiver and so on. Then at every step we take Q_n and feed that to some small typical generative transformer decoder to produce a response to the user's message. This differs from typical conversational models, such as those using GPT-whatever, because they feed the entire conversation back into the model as input, and since the model has a constant size input, the older messages get truncated as enough new messages are given, which means the older memories get totally lost. Could this be a viable idea? We could have M >> N which means we have more memory than input length, but if we keep M on the order of a thousand that gives us 1000 'units' of memory that retain only the most important information.

@Ronschk 3 жыл бұрын

Really nice idea. I wonder how much improvement it would bring if the incoming data would converted through a "sense". Our brain also doesn't receive images directly, but instead receives signals from our eyes which transform the input image (and use something akin to convolutions?). So you would have this as a generic compute structure, but depending on the modality you would have a converter. I think they had something like this in the "one model to rule them all" paper or so...

@timdernedde993 3 жыл бұрын

Hey Yannic, great Video as usual :) If you want some feedback I feel like you could have covered the results a bit more. I do think the methodology of course is much more important but it helps to have a bit of an overview of how good it performs at what tasks. Maybe give it a few minutes more in the results section next time. But anyways still enjoyed the video greatly. Keep up the great work!

@CristianGarcia 3 жыл бұрын

This is VERY nice! I'd love to give it a spin on a toy dataset. 😍 BTW: Many transformer patterns can be found in the Set Transformers paper, the learned query reduction strategy is termed Pooling by Attention.

@amirfru 3 жыл бұрын

This is incredibly similar to Tabnet ! but with the attentive blocks changed to attention layers

@pvlr1788 2 жыл бұрын

Thanks for the video! But I can't understand where from the first latent array comes..

@JTedam 3 жыл бұрын

this helps a lot to make research accessible

@jonatan01i 3 жыл бұрын

2:44 "And the image is of not a cat!, a house! What did you think??!.." I thought nothing; my mind was empty :(

@NextFuckingLevel 3 жыл бұрын

:( ifeel you

@herp_derpingson 3 жыл бұрын

17:30 Since you already bought a green screen, maybe next time put Mars or the Apollo landing in the background. Or a large cheese cake. Thats good too. . All in all. Once architecture to rule them all.

@YannicKilcher 3 жыл бұрын

Great suggestion :D

@piratepartyftw 3 жыл бұрын

Very cool. I wonder if it works when you feed in multimodal data (e.g. both image and text in the same byte array).

@galchinsky 3 жыл бұрын

Proper positional encodings should somehow work

@azimgivron1823 3 жыл бұрын

Are the query dimension and the latent array in figure 1 of the same dimensions ? It is written that Q belongs to the space of matrices of real numbers of dimensions MxD which does not make sens to me. I believe they meant NxD where D=C since you need to do a dot product to compute the cross-attention between the query Q and the keys K ==> Q.Kt with Kt being the transpose of K so it implies that the dimensions D and C are equal, isn't right ? I am kinda disappointed by the paper because this the core of what they want to show and they do not make the effort to dive in the math and explain this clearly.

@hugovaillaud5102 3 жыл бұрын

Is this architecture slower than a resnet with a comparable amount of parameters due to the fact that it is somehow recurrent? Great video, you explain things so clearly!

@henridehaybe525 3 жыл бұрын

It would be nice to see how the Perceiver would perform when the KV of the cross-attentions are not the raw image at each "attend" but the feature maps of a pretrained ResNet. E.g. the first "attend" KV are the raw image, the second KV is the feature maps of the second ResNet output, and so on. A pretrained ResNet would do the trick but it could technically be feasible to train it concurrently. It would be a Parallel-Piped Convolutionnal-Perceiver model.

@HuyNguyen-rb4py 3 жыл бұрын

so touching for an excellent video

@aday7475 2 жыл бұрын

Any chance we can get a compare and contrast between perciever, percieverIO, and percieverAR?

@cptechno 3 жыл бұрын

Yes, I like this type of content. Keep up the good work. Bringing this material to our attention is a prime service. You might consider creating an AI.tv commercial channel. I'll join.

@swoletech5958 3 жыл бұрын

PointNet++ from 2017 outperformed the perceiver in image point clouds. 91.9 accuracy versus 85.7 See @ 27:19

@ibrahimaba8966 2 жыл бұрын

17:28 best way to solve the quadratic bottleneck 😄!

@yassineabbahaddou4369 2 жыл бұрын

why they have used a GPT-2 architecture in the latent transformer instead of BERT architecture?

@marat61 3 жыл бұрын

I belive there are error in the paper 23:07 Q must be MxC not MxD otherwise QK.transpose() will be imposible

@peterszilvasi752 3 жыл бұрын

17:07 - The visual demonstration of how the quadratic bottleneck is solved was a true "Explain Like I'm Five" moment. 😀

@neworldemancer 3 жыл бұрын

Thanks for video, Yannic! i would imagine that the attention "lines" @27:00 could indeed be static, but the alternative - they are input dependent, yet too overfitted to FF, as this lines are clear artefact.

@TheGreatBlackBird 3 жыл бұрын

I was very confused until the visual demonstration.

@xealen2166 3 жыл бұрын

i'm curious, how are the queries generated from the latent matrix, how is the latent matrix initially generated?

@Kram1032 3 жыл бұрын

Did the house sit on the mat though

@48956l 3 жыл бұрын

thank you for that wonderful demonstration with the piece of paper lol

@dr.mikeybee 3 жыл бұрын

Even with my limited understanding, this looks like a big game change.

@NilabhraRoyChowdhury 3 жыл бұрын

What's interesting is that the model performs better with weight sharing.

@MsFearco 3 жыл бұрын

I just finished this, its an extremely interesting paper. Please review the SWIN transformer next. Its even more interesting :)

@cocoarecords 3 жыл бұрын

Yannic can you tell us your approach to understand papers quickly?

@YannicKilcher 3 жыл бұрын

Look at the pictures

@TheZork1995 3 жыл бұрын

@@YannicKilcher xD so easy yet so far. Thank you for the good work though. Literally the best youtube channel I ever found!

@synthetiksoftware5631 3 жыл бұрын

Isn't the 'fourier' style positional encoding just a different way to build a scale space representation of the input data? So you are still 'baking' that kind of scale space prior into the system.

@teatea5528 2 жыл бұрын

It is stupid, but I want to ask how the author claims their method is better than VIT in ImageNet in the appendix A, Table 7 while their accuracy is not higher?

@Anujkumar-my1wi 3 жыл бұрын

can you tell me why neural nets with many hidden layer requires less number of neurons than a neural net with a single hidden layer to approximate a function?

@TheCreativeautomaton 3 жыл бұрын

ey Thanks for doing this, very much like the direction of transformers in ML, im newer to NLP and looking at where the direction of ML might go next. once again thanks.

@TheJohnestOfJohns 3 жыл бұрын

Isn't this really similar to facebook's DETR with their object queries, but with shared weights?

@antoninhejny8156 3 жыл бұрын

No, since DETR is just for localising objects from extracted features via some backbone like resnet, while this is the feature extractor. Furthemore, DETR just puts the features into a transformer, whereas this is like making an idea about what is in the image while consulting with the raw information in the form of RGB. This is however very suspitious, because linear combination of RGB is just three numbers.

@axeldroid2453 3 жыл бұрын

Does it have something todo with sparse sensing ? It basically attentds to the most relevant data points.

@simonstrandgaard5503 3 жыл бұрын

Excellent walkthrough

@hanstaeubler 3 жыл бұрын

It would also be interesting to 'interpret' this model or algorithm on the music level as well (I compose music myself for my pleasure)? Thanks in any case for the good interpretation of this AI work!

@marat61 3 жыл бұрын

Also you did not say about dimension size in ablation part

@maks029 3 жыл бұрын

Thanks for for an amazing video, I didn't really catch what the "Latent array" represents? It's array of zeros at first?

@bensums 3 жыл бұрын

So the main point is you can have less queries than values? This is obvious even just by looking at the definition of scaled dot-product attention in Attention Is All You Need (Equation 1). From the definition there, the number of outputs equals the number of queries and is independent of the number of keys or values. The only constraints are: 1. the number of keys must match the number of values, 2. the dimension of each query must equal the dimension of the corresponding key.

@bensums 3 жыл бұрын

(in the paper all queries and keys are the same dimension (d_k), but that's not necessary)

@DistortedV12 3 жыл бұрын

“General architecture”, but can it understand tabular inputs??

@bender2752 3 жыл бұрын

Great video! Consider making a video about DCTransformer maybe? 😊

@LaNeona 3 жыл бұрын

If I have a gamification model is there anyone you know that does meta analysis on system mechanisms?

@hiramcoriarodriguez1252 3 жыл бұрын

This is huge, i'm not going to surprise if "perceiver" becomes the gold standard for CV tasks.

@galchinsky 3 жыл бұрын

The way it is it seems to be classification only

@nathanpestes9497 3 жыл бұрын

@@galchinsky You should be able to run it backwards for generation. Just say my output (image/point-cloud/text I want to generate) is my latent(as labeled in the diagram), and my input (byte array in the diagram) is some latent representation that feeds into my outputs over several steps. I think this could be super cool for 3D GANs since you don't wind up having to fill 3d grids with a bunch of empty space.

@galchinsky 3 жыл бұрын

@@nathanpestes9497 @Nathan Pestes won't you get o(huge^2) this way?

@nathanpestes9497 3 жыл бұрын

@@galchinsky I think it would be cross attention o(user defined * huge) same as the paper (different order). Generally we have o(M*N), M - the size of input/byte-array, N - the size of the latent. The paper goes after performance by forcing the latent to be non-huge so M=huge, N=small O(huge * small). Running it backwards you would have small input (which is now actually our latent so a low dimensional random sample if we want to do a gan, perhaps the (actual) latent from another perceiver in a VAE or similar). So backwards you have M=small N=huge so O(small*huge).

@galchinsky 3 жыл бұрын

@@nathanpestes9497 Thanks for pointing this. I thought we would get Huge x Huge attention matrix, while you are right, if we set Q length to be Huge and K/V to be Small, the resulting complexity will be O(Huge*Small). So we want to get new K/V pair each time and this approach seems quite natural: (here was an imgur link but youtube seems to hide it). So there 2 parallel stacks of layers. The first set is like in the article: latent weights, then cross attention, then stack of transformers and so on. The second stack consists of your cross-attention layers, so operates in byte-array dimension. The first Q is the byte array input and K,V is taken from the stack of the "latent transformers". Then its output is fed as K,V back to the "latent" cross attention, making new K,V. So there is an informational ping-pong between "huge" and "latent" cross-attention layers.

@thegistofcalculus 3 жыл бұрын

Just a silly question, instead of big data input vector and small latent vector could they have a big latent vector that they use as a summary vector and spoon feed slices of data in order to achieve some downstream task such as maybe predicting the next data slice? Would this allow for even bigger input which is summarized (like HD video)?

@thegistofcalculus 2 жыл бұрын

Looking back it seems that my comment was unclear. It would involve a second cross attention module to determine what gets written into the big vector.

@conduit242 3 жыл бұрын

Embeddings are still all you need 🤷

@brll5733 3 жыл бұрын

Performers already grow entirely linearly, right?

@kirtipandya4618 3 жыл бұрын

Where can we find source code?

@NeoShameMan 3 жыл бұрын

So basically it's conceptually close to rapide eye movement, where we refine over time data we need to resolve recognition...

@corgirun7892 4 ай бұрын

M = 50176 for 224 × 224 ImageNet images

@Shan224 3 жыл бұрын

Thank you yannic

@evilby Жыл бұрын

WAHHH... Problem Solved!😆

@moctardiallo2608 3 жыл бұрын

Yeah 30min is very better!

@Vikram-wx4hg 2 жыл бұрын

17:15

@seraphim9723 3 жыл бұрын

The ablation study consists of three points without any error bars and could just be coincidence? One cannot call that "science".

@timstevens3361 3 жыл бұрын

attention looped is consciousness

@notsure7132 3 жыл бұрын

Thank you.

@Deez-Master 3 жыл бұрын

Nice video

@vadimschashecnikovs3082 3 жыл бұрын

Hmm, I think it is possible to add some GLOM-like hierarchy of "words". This could improve the model...

@kenyang687 Жыл бұрын

The "hmm by hmm" is just too confusing lol

@rhronsky 3 жыл бұрын

Clearly you are more of a fan of row vectors rather than column vectors Yannic (refererring to your visual demo :))

@martinschulze5399 3 жыл бұрын

habt ihr phd stellen offen? ^^

@GuillermoValleCosmos 3 жыл бұрын

this is clever and cool

@freemind.d2714 3 жыл бұрын

Good job Yannic, But I start to feel like lot of paper you talk in video those days are all about transformer, and frankly they kind similar and most are about engineering research not scientific research, hope you don't mind to talk more about interesting paper on different subject

@muhammadaliyu3076 3 жыл бұрын

Yannick follows the hype

@TechyBen 3 жыл бұрын

Oh no, they are making it try to be alive. XD

@AvastarBin 3 жыл бұрын

+1 For the visual representation of M*N hahah

@oreganorx7 2 жыл бұрын

Very similar to MemFormer

@Stefan-bs3gm 3 жыл бұрын

with O(M*M) attention you quickly get to OOM :-P

@allengrimm3039 3 жыл бұрын

I see what you did there

@omegapointil5741 3 жыл бұрын

I guess curing Cancer is even more complicated than this.

@happycookiecamper8101 3 жыл бұрын

nice

@enriquesolarte1164 3 жыл бұрын

haha, I love the scissors...!!!

@insighttoinciteworksllc1005 3 жыл бұрын

Humans can do the iterative process too. The Inquiry Method is the only thing that requires it. If you add the trial and error element with self-correction, young minds can develop a learning process. Learn How to learn? Once they get in touch with their inner teacher, they connect to the Information Dimension (theory). Humans can go to where the Perceiver can't go. The Inner teacher uses intuition to bring forth unknown knowledge to mankind's consciousness. The system Mr. Tesla used to create original thought. Unless you think he had a computer? The Perceiver will be able to replace all the scientists that helped develop it and the masses hooked on the internet. It will never replace the humans that develop the highest level of consciousness. Thank you, Yeshua for this revelation.