what is the difference between self attention and multi head self attention? is both are same just instead of single attention multi head attention use multi heads?
@me-ou8rf4 күн бұрын
Can you suggest some materials that deal with how transformer can be applied to time series database like EEG ?
@hosseindahaee28865 күн бұрын
Thanks for your concise and insightful description.🙏
@SebastianRaschka9 күн бұрын
Very nice video! I can also imagine that predicting the caption text exactly isn't only more difficult but it would also be more likely result in (more) overfitting if it is learned this way. At 5:43, the pair-wise similarities, they are basically like cross-attention scores?
@PyMLstudio9 күн бұрын
Yes, in a way, it’s analogous to cross-attention, taking dot-product between the features from the text encoder and image encoder. This dot-product similarity is used as the final output of the model to determine if an image and a text caption are related or not. Good question, thanks for the comment
@benji629613 күн бұрын
what would be the advantage of this methods vs Flash attention. Flash attention speeds up the computation and it is an exact computation most of these methods are approximations. I would like if possible to see a video explaining other attention types as Paged attention and Flash Attention. Great content :)
@PyMLstudio6 күн бұрын
Thank you for the suggestion! You're absolutely right. In this video, I focused on purely algorithmic approaches, not hardware-based solutions like FlashAttention. FlashAttention is an IO-aware exact attention algorithm that uses tiling to reduce memory reads/writes between GPU memory levels, which results in significant speedup without sacrificing model quality. I appreciate your input and will definitely consider making a video to explain FlashAttention!
@agenticmark17 күн бұрын
you left out step and sine :D
@rafaeljose271617 күн бұрын
Can you talk about efficient self attetion
@ClarenceWijaya20 күн бұрын
thank you for detailing every matrix size in input and output, its so helpful
@PyMLstudio20 күн бұрын
Cool, glad that was helpful, thanks for the comment
@theophilegaudin2329Ай бұрын
Why is the key matrix different from the query matrix?
@PyMLstudioАй бұрын
That’s a good question! Making keys and queries different helps with the modeling power. It allows the model to adaptively learn how to match different aspects of the input data (represented by the queries) against all other aspects (represented by the keys) in a more effective manner. But note that there are some models that use the same weights for queries and keys too. But having different queries and keys results in more flexibility and a more powerful model.
@SolathPrimeАй бұрын
I have my own activation function that I use, it's Softplus like function it's the integral of (1+tanh(x))/2 which looks like Sigmoid except it's faster in training It's integral is this equation that I call "Rectified Integral Tangent Hyperpolica" RITH for short It's mostly linear for x≥1 which makes it fast in training (x+ln(cosh(x)))/2 I added the term 1/e to center it between 0 and positive infinity
@doublesamiАй бұрын
well explained . i jhave few questions 1 : why we need Three matrix Q K V , 2 : as we know dot product finds the vector similarity that we calculate using Q and K why again need V again what role V play besides giving us back the input matrix shape .
@PyMLstudioАй бұрын
Thanks for the great question! Each of these matrices play a different role that makes attention mechanism so powerful. We can think of the query as what the model is currently looking at, and the keys as all other aspects in the aspects. So the dot product q and k determines the relevance of what the model is looking at currently with everything else. Once the relevance of different parts of the input is established, the values are the actual content that we want to aggregate to form the output of the attention mechanism. The values hold the information that is being attended to.
@saqibsarwarkhan5549Ай бұрын
Great video. Thanks a lot.😊
@PyMLstudioАй бұрын
Glad you liked it!
@buh357Ай бұрын
thank you for covering all these details, i am a big fan of channel
@PyMLstudioАй бұрын
Thanks for your comment, I am glad you like the channel 👍🏻
@buh357Ай бұрын
you should include axial attention and axial position embedding, its simple yet work great on image, and video.
@PyMLstudioАй бұрын
Thanks for the suggestion, yes I agree. I have briefly described axial attention in the vision transformer series kzbin.info/www/bejne/mJLZl5SVh9dlnJYsi=0SB9Yc_0SasafhJN
@buh357Ай бұрын
@@PyMLstudio thats awesome, thanks you!
@buh357Ай бұрын
swin-transformer sucks
@digitalmonquiАй бұрын
Thank you for a clearly, patiently explained video. You explained the concepts with a perfect blend of clear language and technical background without hiding behind acronyms and algorithm names without explanation. Well done!
@PyMLstudioАй бұрын
Thanks for the nice comment, glad you enjoyed it!
@conlanriosАй бұрын
Great video, getting more clear 👍
@krischalkhanal95912 ай бұрын
How do you make such good Model Diagrams?
@PyMLstudio2 ай бұрын
Thanks, this video and some of my earlier videos are made with Python ManimCE package. But it takes so much time to prepare them , so my recent videos are made with PowerPoint
@nancyboukamel4422 ай бұрын
The best video ever
@maryammohseni45072 ай бұрын
great video!tnx
@brianlee49662 ай бұрын
Thank you so much
@SebastianRaschka2 ай бұрын
Awesome explanation. One of my favorite channels for research coverage!
@davefaulkner63023 ай бұрын
Thanks for the historical review of the topic. It would have been nice to see some of the results of the three papers.
@PyMLstudio3 ай бұрын
Thanks for your comment, yes that’s a good point. I’ll make include results in future videos , but for these paper I’ll write an article and include the results there.
@shredder-313 ай бұрын
Great video❤ Can you send slides link pls
@PyMLstudio3 ай бұрын
Thanks for watching. This video was not made with PowerPoint. All the animations were made using Python and Manim package
@gigglygeekgal3 ай бұрын
Great Explanation :)
@buh3573 ай бұрын
your channel is so underrated. you deserve more sir! thanks for covering this topic. i have one question, why we do clip? can i understand cliped relative position bias as we only considering small part of relative position?
@buh3573 ай бұрын
thanks for covering linear attention mechanism and implementing each step by step, i am working on CoAtNet, the attention block of CoAtNet has relative position bias, and the attention mechanism is vanilla attention, how can i utilize the linear attention with relative position bias? any suggestions will be appericeted. 🙏
@PyMLstudioАй бұрын
That’s a good question. Relative position bias is typically added to the result of the dot-product of queries Q and keys K in traditional attention mechanisms. For linear attention, which aims to reduce computational complexity through low-rank approximations, you can integrate the relative position bias in a similar way by adding it after these approximations are computed. I've made a video that covers relative self-attention, which could provide some additional insights on handling position biases. Please feel free to watch this video for more details: kzbin.info/www/bejne/jpXPnneclpebm9ksi=h9bywcuPAs7mqCSD
@buh357Ай бұрын
@@PyMLstudio thank you for your detailed response, i was working on 3D version of CoAtNet, i adopted the relative positon bias from video swim transformer, i kind of understand the relative positon bias for 2D image, but not understanding the neaty details for 3D image, and the position bias added to dot product of Q and K was one dimension vector, which is bit weird. but later i replaced relative positon bias embedding and vanilla attention with axial embedding and axial attention. to me, the axial positon emddeding was much easier to understand for 3D image. and 3D CoAtNet with axial attention outperformed 3D Efficientnet with CBAM and SE in my case.
@harshaldharpure99213 ай бұрын
How to do the cross attention mechanism if we have a three feature x,y, rag with size of x.shape torch.Size([8, 768]) y.shape torch.Size([8, 512]) rag.feature torch.Size([8, 768])
@PyMLstudio3 ай бұрын
So, let's assume query=y, key=x, and value=rag for our explanation, but remember, you can adjust this configuration depending on your specific needs. Given these tensors, our first step is to ensure that the dimensions of the query, key, and value match for the attention mechanism to work properly. Since y has a different dimension (512) compared to x and rag (768), we need to project y to match the 768-dimension space of x and rag: query_projector = Linear(512, 768) query_projected = query_projector(query) ## --> 8x768 With this projection, all three tensors (query_projected, key=x, value=rag) now share the same dimensionality (8x768), making them compatible for the multi-head attention, where each head involves dot-product between query_projected and keys, followed by Softmax and multiplying by the values. Remember that the assignment of x/y/rag to query/key/value can change depending on your use case and where these tensors come from. I hope this answers your question.
@harshaldharpure99213 ай бұрын
Thanks a lot sir @@PyMLstudio
@Yaz710233 ай бұрын
When you find people who makes science easy and engoyable 🫡
@MohSo143 ай бұрын
Nice explanation bro
@moralstorieskids38843 ай бұрын
What about sliding window attention
@kartikpodugu4 ай бұрын
His effort needs more audience.
@faiqkhan75454 ай бұрын
Hi, great video as usual . Do a video on ring attention mechanism .
@kennethcarvalho36844 ай бұрын
But how to get the actual matrix for x?
@PyMLstudio3 ай бұрын
Thank you for your question - it’s indeed a great question. So X represents the input to a given layer, much like inputs in traditional neural networks. Specifically, in the first layer of a transformer, X is derived by calculating both the token embedding and the position embedding. For subsequent layers within the transformer, X is simply the output of the preceding layer.
@gabrielvanderschmidt23015 ай бұрын
Great explanation and visuals! Thank you very much!
@tiphanysou83205 ай бұрын
Promo'SM 😈
@unclecode5 ай бұрын
Brilliant content! I really hope that at the end of this transformer series, you create a tutorial on building a transformer from scratch and training it with a small dataset, as well as fine-tuning. That is how everything comes to real practice.
@PyMLstudio5 ай бұрын
Thank you for the encouraging words! I’m glad you’re enjoying the series. Your suggestion is spot on - I plan to cover building a transformer from scratch and training it with a small dataset, as well as delve into fine-tuning LLMs using PEFT. These practical insights will indeed bring the theory into real-world practice. Stay tuned for more!
@unclecode5 ай бұрын
I appreciate the thorough tutorial. Working with this mathematical formula in the code is enjoyable, especially for programmers. It makes it accessible and less intimidating. I have a question, how effective have these linear approaches been in practice? Can they outperform and match the quality of full attention?
@PyMLstudio5 ай бұрын
Thank you for engaging with the content! In practice, these linear methods have proven quite effective, especially for very long sequences where full attention is computationally prohibitive. While they often come close, linear attention methods may not always match the quality of full attention, especially on shorter sequences where the full attention’s quadratic complexity is manageable. However, in scenarios with long texts or resource constraints, their efficiency and performance make them very compelling alternatives.
@unclecode5 ай бұрын
@@PyMLstudio Thx for the answer and definitely makes sense. It reminds me of choosing sort algorithms based on the length of data sequence.
@astitva50025 ай бұрын
your seires on transformers is really useful thank you for the content. do you refer to any documentation or have a site from where i can look at such figures and plots that you show?
@PyMLstudio5 ай бұрын
Thank you for the positive feedback on my Transformers series! I'm glad to hear that you're finding it useful. I am currently working on publishing supporting articles for these videos on my Substack page (pyml.substack.com/). There, you'll be able to download the images and view additional figures and plots that complement the videos. Stay tuned for updates!
@terjeoseberg9905 ай бұрын
I’ve been waiting for this. You have the best explanations.
@PyMLstudio5 ай бұрын
Thank you so much for your kind words and for waiting! I'm thrilled to hear that you find the explanations helpful. Your support means a lot, and it motivates me to keep creating content that makes complex topics more accessible. Stay tuned for more!
@terjeoseberg9905 ай бұрын
I’d like more explanation about how this self attention mechanism plugs into the large language models.
@PyMLstudio5 ай бұрын
In the scope of this series, I plan to begin with discussing the evolution of attention mechanisms in the image domain, followed by an exploration into vision transformers. It's important to note that the Non-Local Module (NLM) is distinct from vision transformers. NLMs, particularly when coupled with residual connections, can be integrated into any pre-trained model, such as ResNet, as demonstrated in the original paper. This integration is designed to enhance the model without altering its fundamental behavior. Stay tuned as we delve deeper into Vision Transformers later on, and will see how self-attention mechanism is utilized in ViTs.
@davefaulkner63025 ай бұрын
Regarding Multi-headed attention: it wasn't until you listed the dimensions of the output heads that it became clear that you are splitting the input by the embedding dimension, d, across the different heads. This should have been made more explicit in your explanation. Regardless, I was looking for the answer to this question of how the input was split across the heads so thank you for this detailed explanation of how the multi-headed mechanism works.
@jacobyoung204515 күн бұрын
Thanks, your comment made it clearer for me.
@chrisogonas5 ай бұрын
Very well illustrated! Thanks
@PyMLstudio5 ай бұрын
Glad you liked it!
@ai.simplified..5 ай бұрын
What a great channel, brow keep going i just find your channel
@TJ-zs2sv6 ай бұрын
I have been watching videos of Transformer series. Such a great content and all information at one point. Please sir, keep on publishing such advance videos. Thank you so much !!
@PyMLstudio6 ай бұрын
I'm glad to hear you're enjoying the Transformer series on my channel! Thank you so much for your kind words and encouragement. It really means a lot to me. Absolutely, I'm already working on the next installment, and here's a little teaser for you: the upcoming video will dive into the fascinating concept of relative self-attention. Stay tuned for more advanced insights!
@charlesriggins73856 ай бұрын
Very useful. Thank you.
@Summersault6666 ай бұрын
Interesting. Does Non-Local offers advantage to transformers? This is a very difficult topic, do you have a sequence of materials to get deeper in this subject?
@PyMLstudio6 ай бұрын
That’s a very good question. I think of Non-local Module as a generalization of scaled dot-product attention. The proposed non-local block can be added to existing architectures for capturing long-range dependencies. But Transformers are different. We will cover different vision transformer models in this series . The previous introductory video shows the topics that will be covered