Self-Attention Using Scaled Dot-Product Approach

  Рет қаралды 12,239

Machine Learning Studio

Machine Learning Studio

Жыл бұрын

This video is a part of a series on Attention Mechanism and Transformers. Recently, Large Language Models (LLMs), such as ChatGPT, have gained a lot of popularity due to recent improvements. Attention mechanism is at the heart of such models. My goal is to explain the concepts with visual representation so that by the end of this series, you will have a good understanding of Attention Mechanism and Transformers. However, this video is specifically dedicated to the Self-Attention Mechanism, which uses a method called "Scaled Dot-Product Attention".
#SelfAttention #machinelearning #deeplearning

Пікірлер: 27
@conlanrios
@conlanrios 27 күн бұрын
Great video, getting more clear 👍
@nancyboukamel442
@nancyboukamel442 Ай бұрын
The best video ever
@Robertlga
@Robertlga 11 ай бұрын
way to go, really nice summary
@joshehm4599
@joshehm4599 Жыл бұрын
Great, very useful 👍
@MohSo14
@MohSo14 2 ай бұрын
Nice explanation bro
@SolathPrime
@SolathPrime Жыл бұрын
Thank you, vary useful
@PyMLstudio
@PyMLstudio Жыл бұрын
You are welcome, I’m glad you found it useful! I am working on a new video for Multihead Attention, which I will post that very soon
@doublesami
@doublesami 11 күн бұрын
well explained . i jhave few questions 1 : why we need Three matrix Q K V , 2 : as we know dot product finds the vector similarity that we calculate using Q and K why again need V again what role V play besides giving us back the input matrix shape .
@PyMLstudio
@PyMLstudio 3 күн бұрын
Thanks for the great question! Each of these matrices play a different role that makes attention mechanism so powerful. We can think of the query as what the model is currently looking at, and the keys as all other aspects in the aspects. So the dot product q and k determines the relevance of what the model is looking at currently with everything else. Once the relevance of different parts of the input is established, the values are the actual content that we want to aggregate to form the output of the attention mechanism. The values hold the information that is being attended to.
@ielinDaisy
@ielinDaisy Жыл бұрын
This is explained very well😄 Thank you so much. One doubt: Is Scaled Dot-Product attention same as Multiplicative Attention?
@PyMLstudio
@PyMLstudio Жыл бұрын
Thanks lelin, I’m glad you found the video useful. Yes, scaled dot product is a specific formulation of multiplicative attention mechanism that calculates attention weights using dot-product of query and key vectors followed by proper scaling to prevent the dot-product from growing too large.
@astitva5002
@astitva5002 4 ай бұрын
your seires on transformers is really useful thank you for the content. do you refer to any documentation or have a site from where i can look at such figures and plots that you show?
@PyMLstudio
@PyMLstudio 4 ай бұрын
Thank you for the positive feedback on my Transformers series! I'm glad to hear that you're finding it useful. I am currently working on publishing supporting articles for these videos on my Substack page (pyml.substack.com/). There, you'll be able to download the images and view additional figures and plots that complement the videos. Stay tuned for updates!
@pep1529
@pep1529 11 ай бұрын
@gorkemakgul9651
@gorkemakgul9651 Жыл бұрын
Could you also explain how attention work with RNNs?
@PyMLstudio
@PyMLstudio Жыл бұрын
Thanks for the suggestions. Yes, absolutely! I plan to do cover other models and architectures later on after finishing this topic. I will include models that integrate attention with RNNs.
@fabriciosales3299
@fabriciosales3299 11 ай бұрын
Thanks and congratulations by video. One doubt: why the size of D (15:28) is 5 ?
@PyMLstudio
@PyMLstudio 11 ай бұрын
Thanks for the comment So T is the sequence length, and d is basically the feature dimension. So I have just assumed d=5 for visualization purposes. In the paper « Attention is all you need », d is 512 but I cannot visualize matrices of such high dimensions , so just assumed d=5 I made this visualization to track the dimensionalities of matrices through these multiplication
@fabriciosales3299
@fabriciosales3299 11 ай бұрын
@@PyMLstudio Thanks a lot. Another doubt, if you can answer, please: I don't understand how Transformer assigns similarity between words in a sentence based only on these words in this specific sentence. When calculating the attention weights, I believe that only based on the words of the sentence is not enough for him to measure the similarity between the words. Shouldn't there be "prior knowledge"?
@PyMLstudio
@PyMLstudio 11 ай бұрын
@@fabriciosales3299 Absolutely, I am happy to answer any questions you may have. So, Transformer is typically pre-trained as a language model in a self-supervised manner (which we can consider as an unsupervised learning) . Besides that, no prior-knowledge for similarity of words is provided for this pre-training. During this pre-training, the Transformer will learn to predict the next word in a sequence (Causal LM) or predict a masked word (Masked LM). So, the similarities are learned through this pre-training to be able to predict the next word or the masked word. So, in summary, no other prior-knowledge is needed for the similarity of the words, and this is the job of the attention mechanism to learn which words to attend during the training of the Transformer. I hope this answers your question. Note that I am working on a new video to describe the full architecture of the Transformer and put everything together. I will publish the new video in a few days.
@theophilegaudin2329
@theophilegaudin2329 4 күн бұрын
Why is the key matrix different from the query matrix?
@PyMLstudio
@PyMLstudio 3 күн бұрын
That’s a good question! Making keys and queries different helps with the modeling power. It allows the model to adaptively learn how to match different aspects of the input data (represented by the queries) against all other aspects (represented by the keys) in a more effective manner. But note that there are some models that use the same weights for queries and keys too. But having different queries and keys results in more flexibility and a more powerful model.
@krischalkhanal9591
@krischalkhanal9591 Ай бұрын
How do you make such good Model Diagrams?
@PyMLstudio
@PyMLstudio Ай бұрын
Thanks, this video and some of my earlier videos are made with Python ManimCE package. But it takes so much time to prepare them , so my recent videos are made with PowerPoint
@kennethcarvalho3684
@kennethcarvalho3684 3 ай бұрын
But how to get the actual matrix for x?
@PyMLstudio
@PyMLstudio 3 ай бұрын
Thank you for your question - it’s indeed a great question. So X represents the input to a given layer, much like inputs in traditional neural networks. Specifically, in the first layer of a transformer, X is derived by calculating both the token embedding and the position embedding. For subsequent layers within the transformer, X is simply the output of the preceding layer.
A Dive Into Multihead Attention, Self-Attention and Cross-Attention
9:57
Machine Learning Studio
Рет қаралды 19 М.
The math behind Attention: Keys, Queries, and Values matrices
36:16
Serrano.Academy
Рет қаралды 197 М.
когда одна дома // EVA mash
00:51
EVA mash
Рет қаралды 13 МЛН
ПЕЙ МОЛОКО КАК ФОКУСНИК
00:37
Masomka
Рет қаралды 9 МЛН
Разбудила маму🙀@KOTVITSKY TG:👉🏼great_hustle
00:11
МишАня
Рет қаралды 3,4 МЛН
The Attention Mechanism in Large Language Models
21:02
Serrano.Academy
Рет қаралды 75 М.
Transformer Neural Networks, ChatGPT's foundation, Clearly Explained!!!
36:15
StatQuest with Josh Starmer
Рет қаралды 559 М.
What are Transformer Models and how do they work?
44:26
Serrano.Academy
Рет қаралды 94 М.
Attention is all you need explained
13:56
Lucidate
Рет қаралды 78 М.