XCiT: Cross-Covariance Image Transformers (Facebook AI Machine Learning Research Paper Explained)

Рет қаралды 18,641

Күн бұрын

Пікірлер: 52

@YannicKilcher 3 жыл бұрын

OUTLINE: 0:00 - Intro & Overview 3:45 - Self-Attention vs Cross-Covariance Attention (XCA) 19:55 - Cross-Covariance Image Transformer (XCiT) Architecture 26:00 - Theoretical & Engineering considerations 30:40 - Experimental Results 33:20 - Comments & Conclusion Paper: arxiv.org/abs/2106.09681 Code: github.com/facebookresearch/xcit

@pesky_mousquito 3 жыл бұрын

Hi Yannic, in 15:30, I think you say you explain cross-attention, but you're explaining the XCA. I love your videos, and I learn a lot from them!

@jeyakumarjohnnyjonathan461 2 жыл бұрын

Excellen presentation Sir! Thank you

@herp_derpingson 3 жыл бұрын

5:35 You should remaster the "All you need is attention video". . 32:45 What is being L2 normalized? All weights or just the weights of the transformer? . 35:25 I dont understand the query and key visualizations. It is a norm across channels? What would be the interpretation in this case? If each channel corresponds to some feature, then a high norm means the neural network found multiple things in the same patch/pixel. . This is essentially learning a generator function for the kernels instead of the kernels themselves.

@YannicKilcher 3 жыл бұрын

The queries and keys are L2-normalized. For the queries and keys, you simply look at each channel across the tokens as a vector and then proceed like usual. I think the visualizations are for the classification layer, where it's more "classic" attention, not this XCA. The visualizations are more to show that the network learns to focus on relevant things.

@expirinot8724 3 жыл бұрын

Hi Yannic, it would be great to hear from you on what you think makes 'best papers' at some large conferences (e.g. CVPR currently) special? What's the selection process for these awards and do you think it's important to aim for one? Thanks!

@Gazzar19 3 жыл бұрын

Pretty cool that the head feature triggered for the race cars cockpit

@ChaiTimeDataScience 3 жыл бұрын

We now need more weekends to keep up with Yannic's speed of creating videos. He's officially passed his speed of being "Yannic Lightspeed Kilcher"

@machinelearningone2635 3 жыл бұрын

So extracting features based on gram matrices . What they are doing is exploring equivariances, convs have translation, attn has permutation, this has scale and (and to certain degree) rotation.

@YannicKilcher 3 жыл бұрын

I think classic attention is based on Gram matrices, whereas this one is based on Covariance matrices

@magi-1 3 жыл бұрын

@@YannicKilcher Covariance matrices are a case of Gram matricies with a linear kernel function.

@st33lbird 3 жыл бұрын

So if you apply the XCiT idea to NLP, would you attend to dimensions of the word embedding vectors instead of channels?

@YannicKilcher 3 жыл бұрын

yes, exactly

@snippletrap 3 жыл бұрын

Would be hard to apply to NLP because the QKV and FF matrices would require fixed length sequences.

@kazz811 3 жыл бұрын

@@snippletrap yup this is my interpretation too. This combines cross sequence information through 1x1 convolutions (as opposed to cross channel) and can only be used for fixed length sequences.

@seanburton6007 3 жыл бұрын

@@kazz811 You can do a cumulative sum of the covariance, similar to 'Transformers are RNNs'. Might require a different normalization scheme though.

@444haluk 3 жыл бұрын

Biologically speaking, Xcit makes more sense than the original transformer: every XCiTation (see I what I did there) to neurons produces some distributed representation and other neurons listen to these repsresentions in a specific cut (that changes over time if a better one is found). So in a way XCiT is a very very crude, small and linear approximation of how actual neurons listen to other neurons (But not an approximation on how they operate though).

@CristianCYAC 3 жыл бұрын

Just as a curiosity, what program do you use to open the pdfs?

@herp_derpingson 3 жыл бұрын

One note

@victorrielly4588 3 жыл бұрын

You’re forgiven for drawing that picture exactly one more time, but no more.

@yimingqu2403 3 жыл бұрын

Me reading XCiT paper this afternoon: if only he had done a video on this

@paulcurry8383 3 жыл бұрын

Have any papers played around with stacking transformer blocks width wise? I.e using self attention to determine the keys/values weights of an attention block etc.?

@victorrielly4588 3 жыл бұрын

I suspect they tried to use smaller blocks but the added performance either decreased or did not increase enough to outweigh the added flops. Smaller blocks equates to less features in the original layer? The entire image becomes the only feature. With 8 by 8 blocks, each entry of the block is a feature (64 features). You could create many features from one long feature or a small number of features, with something like a dense layer, but that is not going to give you good performance. That’s like making apple pie out of just apples, no flour, no sugar,…

@JTMoustache 3 жыл бұрын

Yadi yadi yada !

@fahad3802 3 жыл бұрын

Won't you lose positional information of the actual sequence features? I had the same idea that I applied in comp biology problem (DNA sequences) but couldn't recover the attention/interaction distances of sequence features/motifs in DNA.

@YannicKilcher 3 жыл бұрын

yes, but you retain the position information in the final transformation because you pull through each patch independently.

@李白-g7g 3 жыл бұрын

5:30, I think, every different row represents a different channel, and every single element of the row should represent the probability of different objects, not an object, like an eye or mouth. Did I make any wrong understanding?

@etiennetiennetienne 3 жыл бұрын

i wonder if in fact "transformers" could be summarized as a form of metalearning or hypernetworks, where the weights are "learned" on the fly. The cross-covariance produces a fresh, single "learned" weight matrix at test time, while standard attention produces a weight matrix per data point, which is perhaps too complex. I am waiting for self-supervision to be applied explicitely on the fly inside the "inner loop" optimization ( "mesa" optimizer)

@seetj12 3 жыл бұрын

1st comment. I know I am shameless :p

@matteoguida9971 3 жыл бұрын

1. From your knowledge, the model may be in the group of state-of-art performance for image regression tasks (such as position regression of object)? 2. If so, what are pros and cons w.r.t. standard CNNs?

@swazza9999 3 жыл бұрын

XCA as a 1x1 convolution: So might be interesting to replicate XCiT replacing the XCA by (PyTorch) `nn.Conv2D(d, d, 1, groups=h)` and comparing the outcome after training from scratch. I still suspect the "implicit" token mixing would provide some boost but I wonder how much.

@ukhu_pacha 3 жыл бұрын

Can you review this 2021 paper ? The Affective Growth of Computer Vision, what do you think about it?

@mgostIH 3 жыл бұрын

I think this sort of papers are kind of boring now, people just try a variation of the transformer by changing a couple of formulas minimally, throw *A LOT* of computing and engineering with little tricks to get the same results we are used to get. It might just be, like for FFNet, that if you stir the pot of data you get as input and give it years of GPU processing good performance is bound to happen. Seems more of a side effect of "The Bitter Lesson" by Sutton than anything else.

@oncedidactic 3 жыл бұрын

I had the same overall reaction. But to reframe: it’s like these “hot” techniques, which win notoriety from performance that’s at least as much big compute/data as solid architecture and careful handling, become the excuse to give consideration to basic research. It seems like lazy/obvious permutations to test, but if same work was done without being on the category of a fad, you might call it useful basic work, if boring perhaps. These papers are bricks in the pyramid of “what do we know about structuring bias into NN architectures”. Indeed seems like enough shaking with some sort of inner structure with a learning signal will perform some kind of useful search/sort. (Duh, maybe?) But what we want to know is what specific choices are good tradeoffs, and longer term, is there something fundamental to understand about it that can be distilled. So, keep making bricks for now.

@oncedidactic 3 жыл бұрын

Or in other words, what a privilege that we now get to consider this kinda boring, haha. Must be progress of some kind?

@mgostIH 3 жыл бұрын

@@oncedidactic It's totally fair to have papers that have incremental improvements and try different things in order to explore the space of possibilities or even increasing our certainty of known results, but hearing a paper like this explained isn't really adding that much to what was already presented a lot of times before. Maybe there are some engineering tricks that will present themselves to be very resilient and overall beneficial in general (say batch norm), but it's only something we can see some years after a paper has been published and tried

@kazz811 3 жыл бұрын

So basically this approach cannot be used for variable length sequences since it takes linear combinations along sequence dimension (instead of along the feature/channel dimension) before attention . Which means that whatever the image size, we would have to ensure that the number of patches are identical. Am I getting this right?

@etiennetiennetienne 3 жыл бұрын

no, it works whatever the sequence length

@kazz811 3 жыл бұрын

@@etiennetiennetienne if it applies 1x1 convolutions along the sequence dimension for the query and Key vectors instead of along the channel dimension then I don't think it can. Otherwise how does this differ from standard attention? In standard attention all linear operations are done cross-channel with the sequence information coupled by softmax of the attention matrix.

@etiennetiennetienne 3 жыл бұрын

i think the 1x1 convolution processes token by token, it is not mixing tokens together only the channels. it is the cross covariance computation that mixes the token together

@aidegrod 3 жыл бұрын

I think, this is similarly idea, as was in "cSE" - "channel Squueze Excitation" except using more that 1 channel, and StyleGan like modulated convolution. Dynamic kernels for convolutions was in StyleGAN, I've seen about the same idea with little differencies in many papers, but with different names, like "Spade" blocks. So this is can be named,for example as cse modulated deep wise separable conv-net. Nothing new unfortunately.

@edeneden97 3 жыл бұрын

Please correct me if im wrong but to me it seems that all these attention\transposed attention\dynamic weights layers are doing is swap linear operation with quadratic or cubic operations, am I wrong? That is to say, a normal FF layer is just a linear transformation (that we sometimes add a non linearity to after), and a dynamic weights\attention layer is a when the weights themselves are a linear transformation of the input x so the output is a quadratic transformation. if we use queries keys and values we get a cubic transformation (I notice that I ignored the softmax, but the general point holds). If I am correct, why is this surprising that a higher degree polynomial will do a better fit than a linear function? Please help me make sense of this

@YannicKilcher 3 жыл бұрын

Your statements are correct, but I think you're using a different notion of what is quadratic than what we usually talk about with these things. We refer to the operation as quadratic because it computes the interaction term between all pairs of tokens.

@edeneden97 3 жыл бұрын

@@YannicKilcher I see what you mean, I meant a quadratic function of the input, as opposed to a linear function

@kanfoosj 3 жыл бұрын

So basically it's a fancier Squeeze-Excite layer.

@G12GilbertProduction 3 жыл бұрын

If Facebook AI researchers team creating a deep-seated in extensive reinforcement learning with coherential resolutioning in 26k sampling - a little kind of cross-covariance transformer, I'll go pass out.

@aspergale9836 3 жыл бұрын

This's just plain old linear memory networks. You call the two parts participating in the memory construction `q` and `k`, but could just as well have called them `k` and `v` and nothing would've changed. Same exact formulas. And it makes more intuitive sense, in my honest opinion.

@sayakpaul3152 3 жыл бұрын

I found some of the stuff a bit confusing honestly. On one hand, I am seeing capturing channel-wise interactions across an entire sequence (which is probably a single image), on the other hand, the notation for the cross-covariance matrix tells it's only for the single data point. You also kind of pointed out in the video that it does not even matter how we do it as long as things are contextualized. Works like Non-local Means, Global Context Blocks also provide a nice way to achieve that I would think.