Пікірлер
@KumR
@KumR 10 сағат бұрын
Wow.. So nice.
@paktv858
@paktv858 19 сағат бұрын
what is the difference between self attention and multi head self attention? is both are same just instead of single attention multi head attention use multi heads?
@me-ou8rf
@me-ou8rf 4 күн бұрын
Can you suggest some materials that deal with how transformer can be applied to time series database like EEG ?
@hosseindahaee2886
@hosseindahaee2886 5 күн бұрын
Thanks for your concise and insightful description.🙏
@SebastianRaschka
@SebastianRaschka 9 күн бұрын
Very nice video! I can also imagine that predicting the caption text exactly isn't only more difficult but it would also be more likely result in (more) overfitting if it is learned this way. At 5:43, the pair-wise similarities, they are basically like cross-attention scores?
@PyMLstudio
@PyMLstudio 9 күн бұрын
Yes, in a way, it’s analogous to cross-attention, taking dot-product between the features from the text encoder and image encoder. This dot-product similarity is used as the final output of the model to determine if an image and a text caption are related or not. Good question, thanks for the comment
@benji6296
@benji6296 13 күн бұрын
what would be the advantage of this methods vs Flash attention. Flash attention speeds up the computation and it is an exact computation most of these methods are approximations. I would like if possible to see a video explaining other attention types as Paged attention and Flash Attention. Great content :)
@PyMLstudio
@PyMLstudio 6 күн бұрын
Thank you for the suggestion! You're absolutely right. In this video, I focused on purely algorithmic approaches, not hardware-based solutions like FlashAttention. FlashAttention is an IO-aware exact attention algorithm that uses tiling to reduce memory reads/writes between GPU memory levels, which results in significant speedup without sacrificing model quality. I appreciate your input and will definitely consider making a video to explain FlashAttention!
@agenticmark
@agenticmark 17 күн бұрын
you left out step and sine :D
@rafaeljose2716
@rafaeljose2716 17 күн бұрын
Can you talk about efficient self attetion
@ClarenceWijaya
@ClarenceWijaya 20 күн бұрын
thank you for detailing every matrix size in input and output, its so helpful
@PyMLstudio
@PyMLstudio 20 күн бұрын
Cool, glad that was helpful, thanks for the comment
@theophilegaudin2329
@theophilegaudin2329 Ай бұрын
Why is the key matrix different from the query matrix?
@PyMLstudio
@PyMLstudio Ай бұрын
That’s a good question! Making keys and queries different helps with the modeling power. It allows the model to adaptively learn how to match different aspects of the input data (represented by the queries) against all other aspects (represented by the keys) in a more effective manner. But note that there are some models that use the same weights for queries and keys too. But having different queries and keys results in more flexibility and a more powerful model.
@SolathPrime
@SolathPrime Ай бұрын
I have my own activation function that I use, it's Softplus like function it's the integral of (1+tanh(x))/2 which looks like Sigmoid except it's faster in training It's integral is this equation that I call "Rectified Integral Tangent Hyperpolica" RITH for short It's mostly linear for x≥1 which makes it fast in training (x+ln(cosh(x)))/2 I added the term 1/e to center it between 0 and positive infinity
@doublesami
@doublesami Ай бұрын
well explained . i jhave few questions 1 : why we need Three matrix Q K V , 2 : as we know dot product finds the vector similarity that we calculate using Q and K why again need V again what role V play besides giving us back the input matrix shape .
@PyMLstudio
@PyMLstudio Ай бұрын
Thanks for the great question! Each of these matrices play a different role that makes attention mechanism so powerful. We can think of the query as what the model is currently looking at, and the keys as all other aspects in the aspects. So the dot product q and k determines the relevance of what the model is looking at currently with everything else. Once the relevance of different parts of the input is established, the values are the actual content that we want to aggregate to form the output of the attention mechanism. The values hold the information that is being attended to.
@saqibsarwarkhan5549
@saqibsarwarkhan5549 Ай бұрын
Great video. Thanks a lot.😊
@PyMLstudio
@PyMLstudio Ай бұрын
Glad you liked it!
@buh357
@buh357 Ай бұрын
thank you for covering all these details, i am a big fan of channel
@PyMLstudio
@PyMLstudio Ай бұрын
Thanks for your comment, I am glad you like the channel 👍🏻
@buh357
@buh357 Ай бұрын
you should include axial attention and axial position embedding, its simple yet work great on image, and video.
@PyMLstudio
@PyMLstudio Ай бұрын
Thanks for the suggestion, yes I agree. I have briefly described axial attention in the vision transformer series kzbin.info/www/bejne/mJLZl5SVh9dlnJYsi=0SB9Yc_0SasafhJN
@buh357
@buh357 Ай бұрын
@@PyMLstudio thats awesome, thanks you!
@buh357
@buh357 Ай бұрын
swin-transformer sucks
@digitalmonqui
@digitalmonqui Ай бұрын
Thank you for a clearly, patiently explained video. You explained the concepts with a perfect blend of clear language and technical background without hiding behind acronyms and algorithm names without explanation. Well done!
@PyMLstudio
@PyMLstudio Ай бұрын
Thanks for the nice comment, glad you enjoyed it!
@conlanrios
@conlanrios Ай бұрын
Great video, getting more clear 👍
@krischalkhanal9591
@krischalkhanal9591 2 ай бұрын
How do you make such good Model Diagrams?
@PyMLstudio
@PyMLstudio 2 ай бұрын
Thanks, this video and some of my earlier videos are made with Python ManimCE package. But it takes so much time to prepare them , so my recent videos are made with PowerPoint
@nancyboukamel442
@nancyboukamel442 2 ай бұрын
The best video ever
@maryammohseni4507
@maryammohseni4507 2 ай бұрын
great video!tnx
@brianlee4966
@brianlee4966 2 ай бұрын
Thank you so much
@SebastianRaschka
@SebastianRaschka 2 ай бұрын
Awesome explanation. One of my favorite channels for research coverage!
@davefaulkner6302
@davefaulkner6302 3 ай бұрын
Thanks for the historical review of the topic. It would have been nice to see some of the results of the three papers.
@PyMLstudio
@PyMLstudio 3 ай бұрын
Thanks for your comment, yes that’s a good point. I’ll make include results in future videos , but for these paper I’ll write an article and include the results there.
@shredder-31
@shredder-31 3 ай бұрын
Great video❤ Can you send slides link pls
@PyMLstudio
@PyMLstudio 3 ай бұрын
Thanks for watching. This video was not made with PowerPoint. All the animations were made using Python and Manim package
@gigglygeekgal
@gigglygeekgal 3 ай бұрын
Great Explanation :)
@buh357
@buh357 3 ай бұрын
your channel is so underrated. you deserve more sir! thanks for covering this topic. i have one question, why we do clip? can i understand cliped relative position bias as we only considering small part of relative position?
@buh357
@buh357 3 ай бұрын
thanks for covering linear attention mechanism and implementing each step by step, i am working on CoAtNet, the attention block of CoAtNet has relative position bias, and the attention mechanism is vanilla attention, how can i utilize the linear attention with relative position bias? any suggestions will be appericeted. 🙏
@PyMLstudio
@PyMLstudio Ай бұрын
That’s a good question. Relative position bias is typically added to the result of the dot-product of queries Q and keys K in traditional attention mechanisms. For linear attention, which aims to reduce computational complexity through low-rank approximations, you can integrate the relative position bias in a similar way by adding it after these approximations are computed. I've made a video that covers relative self-attention, which could provide some additional insights on handling position biases. Please feel free to watch this video for more details: kzbin.info/www/bejne/jpXPnneclpebm9ksi=h9bywcuPAs7mqCSD
@buh357
@buh357 Ай бұрын
@@PyMLstudio thank you for your detailed response, i was working on 3D version of CoAtNet, i adopted the relative positon bias from video swim transformer, i kind of understand the relative positon bias for 2D image, but not understanding the neaty details for 3D image, and the position bias added to dot product of Q and K was one dimension vector, which is bit weird. but later i replaced relative positon bias embedding and vanilla attention with axial embedding and axial attention. to me, the axial positon emddeding was much easier to understand for 3D image. and 3D CoAtNet with axial attention outperformed 3D Efficientnet with CBAM and SE in my case.
@harshaldharpure9921
@harshaldharpure9921 3 ай бұрын
How to do the cross attention mechanism if we have a three feature x,y, rag with size of x.shape torch.Size([8, 768]) y.shape torch.Size([8, 512]) rag.feature torch.Size([8, 768])
@PyMLstudio
@PyMLstudio 3 ай бұрын
So, let's assume query=y, key=x, and value=rag for our explanation, but remember, you can adjust this configuration depending on your specific needs. Given these tensors, our first step is to ensure that the dimensions of the query, key, and value match for the attention mechanism to work properly. Since y has a different dimension (512) compared to x and rag (768), we need to project y to match the 768-dimension space of x and rag: query_projector = Linear(512, 768) query_projected = query_projector(query) ## --> 8x768 With this projection, all three tensors (query_projected, key=x, value=rag) now share the same dimensionality (8x768), making them compatible for the multi-head attention, where each head involves dot-product between query_projected and keys, followed by Softmax and multiplying by the values. Remember that the assignment of x/y/rag to query/key/value can change depending on your use case and where these tensors come from. I hope this answers your question.
@harshaldharpure9921
@harshaldharpure9921 3 ай бұрын
Thanks a lot sir @@PyMLstudio
@Yaz71023
@Yaz71023 3 ай бұрын
When you find people who makes science easy and engoyable 🫡
@MohSo14
@MohSo14 3 ай бұрын
Nice explanation bro
@moralstorieskids3884
@moralstorieskids3884 3 ай бұрын
What about sliding window attention
@kartikpodugu
@kartikpodugu 4 ай бұрын
His effort needs more audience.
@faiqkhan7545
@faiqkhan7545 4 ай бұрын
Hi, great video as usual . Do a video on ring attention mechanism .
@kennethcarvalho3684
@kennethcarvalho3684 4 ай бұрын
But how to get the actual matrix for x?
@PyMLstudio
@PyMLstudio 3 ай бұрын
Thank you for your question - it’s indeed a great question. So X represents the input to a given layer, much like inputs in traditional neural networks. Specifically, in the first layer of a transformer, X is derived by calculating both the token embedding and the position embedding. For subsequent layers within the transformer, X is simply the output of the preceding layer.
@gabrielvanderschmidt2301
@gabrielvanderschmidt2301 5 ай бұрын
Great explanation and visuals! Thank you very much!
@tiphanysou8320
@tiphanysou8320 5 ай бұрын
Promo'SM 😈
@unclecode
@unclecode 5 ай бұрын
Brilliant content! I really hope that at the end of this transformer series, you create a tutorial on building a transformer from scratch and training it with a small dataset, as well as fine-tuning. That is how everything comes to real practice.
@PyMLstudio
@PyMLstudio 5 ай бұрын
Thank you for the encouraging words! I’m glad you’re enjoying the series. Your suggestion is spot on - I plan to cover building a transformer from scratch and training it with a small dataset, as well as delve into fine-tuning LLMs using PEFT. These practical insights will indeed bring the theory into real-world practice. Stay tuned for more!
@unclecode
@unclecode 5 ай бұрын
I appreciate the thorough tutorial. Working with this mathematical formula in the code is enjoyable, especially for programmers. It makes it accessible and less intimidating. I have a question, how effective have these linear approaches been in practice? Can they outperform and match the quality of full attention?
@PyMLstudio
@PyMLstudio 5 ай бұрын
Thank you for engaging with the content! In practice, these linear methods have proven quite effective, especially for very long sequences where full attention is computationally prohibitive. While they often come close, linear attention methods may not always match the quality of full attention, especially on shorter sequences where the full attention’s quadratic complexity is manageable. However, in scenarios with long texts or resource constraints, their efficiency and performance make them very compelling alternatives.
@unclecode
@unclecode 5 ай бұрын
@@PyMLstudio Thx for the answer and definitely makes sense. It reminds me of choosing sort algorithms based on the length of data sequence.
@astitva5002
@astitva5002 5 ай бұрын
your seires on transformers is really useful thank you for the content. do you refer to any documentation or have a site from where i can look at such figures and plots that you show?
@PyMLstudio
@PyMLstudio 5 ай бұрын
Thank you for the positive feedback on my Transformers series! I'm glad to hear that you're finding it useful. I am currently working on publishing supporting articles for these videos on my Substack page (pyml.substack.com/). There, you'll be able to download the images and view additional figures and plots that complement the videos. Stay tuned for updates!
@terjeoseberg990
@terjeoseberg990 5 ай бұрын
I’ve been waiting for this. You have the best explanations.
@PyMLstudio
@PyMLstudio 5 ай бұрын
Thank you so much for your kind words and for waiting! I'm thrilled to hear that you find the explanations helpful. Your support means a lot, and it motivates me to keep creating content that makes complex topics more accessible. Stay tuned for more!
@terjeoseberg990
@terjeoseberg990 5 ай бұрын
I’d like more explanation about how this self attention mechanism plugs into the large language models.
@PyMLstudio
@PyMLstudio 5 ай бұрын
In the scope of this series, I plan to begin with discussing the evolution of attention mechanisms in the image domain, followed by an exploration into vision transformers. It's important to note that the Non-Local Module (NLM) is distinct from vision transformers. NLMs, particularly when coupled with residual connections, can be integrated into any pre-trained model, such as ResNet, as demonstrated in the original paper. This integration is designed to enhance the model without altering its fundamental behavior. Stay tuned as we delve deeper into Vision Transformers later on, and will see how self-attention mechanism is utilized in ViTs.
@davefaulkner6302
@davefaulkner6302 5 ай бұрын
Regarding Multi-headed attention: it wasn't until you listed the dimensions of the output heads that it became clear that you are splitting the input by the embedding dimension, d, across the different heads. This should have been made more explicit in your explanation. Regardless, I was looking for the answer to this question of how the input was split across the heads so thank you for this detailed explanation of how the multi-headed mechanism works.
@jacobyoung2045
@jacobyoung2045 15 күн бұрын
Thanks, your comment made it clearer for me.
@chrisogonas
@chrisogonas 5 ай бұрын
Very well illustrated! Thanks
@PyMLstudio
@PyMLstudio 5 ай бұрын
Glad you liked it!
@ai.simplified..
@ai.simplified.. 5 ай бұрын
What a great channel, brow keep going i just find your channel
@TJ-zs2sv
@TJ-zs2sv 6 ай бұрын
I have been watching videos of Transformer series. Such a great content and all information at one point. Please sir, keep on publishing such advance videos. Thank you so much !!
@PyMLstudio
@PyMLstudio 6 ай бұрын
I'm glad to hear you're enjoying the Transformer series on my channel! Thank you so much for your kind words and encouragement. It really means a lot to me. Absolutely, I'm already working on the next installment, and here's a little teaser for you: the upcoming video will dive into the fascinating concept of relative self-attention. Stay tuned for more advanced insights!
@charlesriggins7385
@charlesriggins7385 6 ай бұрын
Very useful. Thank you.
@Summersault666
@Summersault666 6 ай бұрын
Interesting. Does Non-Local offers advantage to transformers? This is a very difficult topic, do you have a sequence of materials to get deeper in this subject?
@PyMLstudio
@PyMLstudio 6 ай бұрын
That’s a very good question. I think of Non-local Module as a generalization of scaled dot-product attention. The proposed non-local block can be added to existing architectures for capturing long-range dependencies. But Transformers are different. We will cover different vision transformer models in this series . The previous introductory video shows the topics that will be covered
@santiagorf77
@santiagorf77 6 ай бұрын
Great video!
@sarahgh8756
@sarahgh8756 6 ай бұрын
Amazing Tutorial. Thank you.