Hi Yannic, in 15:30, I think you say you explain cross-attention, but you're explaining the XCA. I love your videos, and I learn a lot from them!
@jeyakumarjohnnyjonathan4612 жыл бұрын
Excellen presentation Sir! Thank you
@herp_derpingson3 жыл бұрын
5:35 You should remaster the "All you need is attention video". . 32:45 What is being L2 normalized? All weights or just the weights of the transformer? . 35:25 I dont understand the query and key visualizations. It is a norm across channels? What would be the interpretation in this case? If each channel corresponds to some feature, then a high norm means the neural network found multiple things in the same patch/pixel. . This is essentially learning a generator function for the kernels instead of the kernels themselves.
@YannicKilcher3 жыл бұрын
The queries and keys are L2-normalized. For the queries and keys, you simply look at each channel across the tokens as a vector and then proceed like usual. I think the visualizations are for the classification layer, where it's more "classic" attention, not this XCA. The visualizations are more to show that the network learns to focus on relevant things.
@expirinot87243 жыл бұрын
Hi Yannic, it would be great to hear from you on what you think makes 'best papers' at some large conferences (e.g. CVPR currently) special? What's the selection process for these awards and do you think it's important to aim for one? Thanks!
@Gazzar193 жыл бұрын
Pretty cool that the head feature triggered for the race cars cockpit
@ChaiTimeDataScience3 жыл бұрын
We now need more weekends to keep up with Yannic's speed of creating videos. He's officially passed his speed of being "Yannic Lightspeed Kilcher"
@machinelearningone26353 жыл бұрын
So extracting features based on gram matrices . What they are doing is exploring equivariances, convs have translation, attn has permutation, this has scale and (and to certain degree) rotation.
@YannicKilcher3 жыл бұрын
I think classic attention is based on Gram matrices, whereas this one is based on Covariance matrices
@magi-13 жыл бұрын
@@YannicKilcher Covariance matrices are a case of Gram matricies with a linear kernel function.
@st33lbird3 жыл бұрын
So if you apply the XCiT idea to NLP, would you attend to dimensions of the word embedding vectors instead of channels?
@YannicKilcher3 жыл бұрын
yes, exactly
@snippletrap3 жыл бұрын
Would be hard to apply to NLP because the QKV and FF matrices would require fixed length sequences.
@kazz8113 жыл бұрын
@@snippletrap yup this is my interpretation too. This combines cross sequence information through 1x1 convolutions (as opposed to cross channel) and can only be used for fixed length sequences.
@seanburton60073 жыл бұрын
@@kazz811 You can do a cumulative sum of the covariance, similar to 'Transformers are RNNs'. Might require a different normalization scheme though.
@444haluk3 жыл бұрын
Biologically speaking, Xcit makes more sense than the original transformer: every XCiTation (see I what I did there) to neurons produces some distributed representation and other neurons listen to these repsresentions in a specific cut (that changes over time if a better one is found). So in a way XCiT is a very very crude, small and linear approximation of how actual neurons listen to other neurons (But not an approximation on how they operate though).
@CristianCYAC3 жыл бұрын
Just as a curiosity, what program do you use to open the pdfs?
@herp_derpingson3 жыл бұрын
One note
@victorrielly45883 жыл бұрын
You’re forgiven for drawing that picture exactly one more time, but no more.
@yimingqu24033 жыл бұрын
Me reading XCiT paper this afternoon: if only he had done a video on this
@paulcurry83833 жыл бұрын
Have any papers played around with stacking transformer blocks width wise? I.e using self attention to determine the keys/values weights of an attention block etc.?
@victorrielly45883 жыл бұрын
I suspect they tried to use smaller blocks but the added performance either decreased or did not increase enough to outweigh the added flops. Smaller blocks equates to less features in the original layer? The entire image becomes the only feature. With 8 by 8 blocks, each entry of the block is a feature (64 features). You could create many features from one long feature or a small number of features, with something like a dense layer, but that is not going to give you good performance. That’s like making apple pie out of just apples, no flour, no sugar,…
@JTMoustache3 жыл бұрын
Yadi yadi yada !
@fahad38023 жыл бұрын
Won't you lose positional information of the actual sequence features? I had the same idea that I applied in comp biology problem (DNA sequences) but couldn't recover the attention/interaction distances of sequence features/motifs in DNA.
@YannicKilcher3 жыл бұрын
yes, but you retain the position information in the final transformation because you pull through each patch independently.
@李白-g7g3 жыл бұрын
5:30, I think, every different row represents a different channel, and every single element of the row should represent the probability of different objects, not an object, like an eye or mouth. Did I make any wrong understanding?
@etiennetiennetienne3 жыл бұрын
i wonder if in fact "transformers" could be summarized as a form of metalearning or hypernetworks, where the weights are "learned" on the fly. The cross-covariance produces a fresh, single "learned" weight matrix at test time, while standard attention produces a weight matrix per data point, which is perhaps too complex. I am waiting for self-supervision to be applied explicitely on the fly inside the "inner loop" optimization ( "mesa" optimizer)
@seetj123 жыл бұрын
1st comment. I know I am shameless :p
@matteoguida99713 жыл бұрын
1. From your knowledge, the model may be in the group of state-of-art performance for image regression tasks (such as position regression of object)? 2. If so, what are pros and cons w.r.t. standard CNNs?
@swazza99993 жыл бұрын
XCA as a 1x1 convolution: So might be interesting to replicate XCiT replacing the XCA by (PyTorch) `nn.Conv2D(d, d, 1, groups=h)` and comparing the outcome after training from scratch. I still suspect the "implicit" token mixing would provide some boost but I wonder how much.
@ukhu_pacha3 жыл бұрын
Can you review this 2021 paper ? The Affective Growth of Computer Vision, what do you think about it?
@mgostIH3 жыл бұрын
I think this sort of papers are kind of boring now, people just try a variation of the transformer by changing a couple of formulas minimally, throw *A LOT* of computing and engineering with little tricks to get the same results we are used to get. It might just be, like for FFNet, that if you stir the pot of data you get as input and give it years of GPU processing good performance is bound to happen. Seems more of a side effect of "The Bitter Lesson" by Sutton than anything else.
@oncedidactic3 жыл бұрын
I had the same overall reaction. But to reframe: it’s like these “hot” techniques, which win notoriety from performance that’s at least as much big compute/data as solid architecture and careful handling, become the excuse to give consideration to basic research. It seems like lazy/obvious permutations to test, but if same work was done without being on the category of a fad, you might call it useful basic work, if boring perhaps. These papers are bricks in the pyramid of “what do we know about structuring bias into NN architectures”. Indeed seems like enough shaking with some sort of inner structure with a learning signal will perform some kind of useful search/sort. (Duh, maybe?) But what we want to know is what specific choices are good tradeoffs, and longer term, is there something fundamental to understand about it that can be distilled. So, keep making bricks for now.
@oncedidactic3 жыл бұрын
Or in other words, what a privilege that we now get to consider this kinda boring, haha. Must be progress of some kind?
@mgostIH3 жыл бұрын
@@oncedidactic It's totally fair to have papers that have incremental improvements and try different things in order to explore the space of possibilities or even increasing our certainty of known results, but hearing a paper like this explained isn't really adding that much to what was already presented a lot of times before. Maybe there are some engineering tricks that will present themselves to be very resilient and overall beneficial in general (say batch norm), but it's only something we can see some years after a paper has been published and tried
@kazz8113 жыл бұрын
So basically this approach cannot be used for variable length sequences since it takes linear combinations along sequence dimension (instead of along the feature/channel dimension) before attention . Which means that whatever the image size, we would have to ensure that the number of patches are identical. Am I getting this right?
@etiennetiennetienne3 жыл бұрын
no, it works whatever the sequence length
@kazz8113 жыл бұрын
@@etiennetiennetienne if it applies 1x1 convolutions along the sequence dimension for the query and Key vectors instead of along the channel dimension then I don't think it can. Otherwise how does this differ from standard attention? In standard attention all linear operations are done cross-channel with the sequence information coupled by softmax of the attention matrix.
@etiennetiennetienne3 жыл бұрын
i think the 1x1 convolution processes token by token, it is not mixing tokens together only the channels. it is the cross covariance computation that mixes the token together
@aidegrod3 жыл бұрын
I think, this is similarly idea, as was in "cSE" - "channel Squueze Excitation" except using more that 1 channel, and StyleGan like modulated convolution. Dynamic kernels for convolutions was in StyleGAN, I've seen about the same idea with little differencies in many papers, but with different names, like "Spade" blocks. So this is can be named,for example as cse modulated deep wise separable conv-net. Nothing new unfortunately.
@edeneden973 жыл бұрын
Please correct me if im wrong but to me it seems that all these attention\transposed attention\dynamic weights layers are doing is swap linear operation with quadratic or cubic operations, am I wrong? That is to say, a normal FF layer is just a linear transformation (that we sometimes add a non linearity to after), and a dynamic weights\attention layer is a when the weights themselves are a linear transformation of the input x so the output is a quadratic transformation. if we use queries keys and values we get a cubic transformation (I notice that I ignored the softmax, but the general point holds). If I am correct, why is this surprising that a higher degree polynomial will do a better fit than a linear function? Please help me make sense of this
@YannicKilcher3 жыл бұрын
Your statements are correct, but I think you're using a different notion of what is quadratic than what we usually talk about with these things. We refer to the operation as quadratic because it computes the interaction term between all pairs of tokens.
@edeneden973 жыл бұрын
@@YannicKilcher I see what you mean, I meant a quadratic function of the input, as opposed to a linear function
@kanfoosj3 жыл бұрын
So basically it's a fancier Squeeze-Excite layer.
@G12GilbertProduction3 жыл бұрын
If Facebook AI researchers team creating a deep-seated in extensive reinforcement learning with coherential resolutioning in 26k sampling - a little kind of cross-covariance transformer, I'll go pass out.
@aspergale98363 жыл бұрын
This's just plain old linear memory networks. You call the two parts participating in the memory construction `q` and `k`, but could just as well have called them `k` and `v` and nothing would've changed. Same exact formulas. And it makes more intuitive sense, in my honest opinion.
@sayakpaul31523 жыл бұрын
I found some of the stuff a bit confusing honestly. On one hand, I am seeing capturing channel-wise interactions across an entire sequence (which is probably a single image), on the other hand, the notation for the cross-covariance matrix tells it's only for the single data point. You also kind of pointed out in the video that it does not even matter how we do it as long as things are contextualized. Works like Non-local Means, Global Context Blocks also provide a nice way to achieve that I would think.
@pensiveintrovert43183 жыл бұрын
Convnets are transformers, but at pixel, small features level.
@saeed577 Жыл бұрын
Thanks for making this video but very bad explanations 😅