Multi Head Attention in Transformer Neural Networks with Code!

  Рет қаралды 53,228

CodeEmporium

CodeEmporium

Күн бұрын

Пікірлер
@Dhanush-zj7mf
@Dhanush-zj7mf Жыл бұрын
We are very much fortunate to have all this for free. Thank You.
@ajaytaneja111
@ajaytaneja111 Жыл бұрын
Ajay, I'm currently on a holiday and was watching your Transformer videos on my mobile whilst taking my evening coffee with my mom! And I have been doing this for the past 3 to 4 days. Today my mom seemed so impressed with your oratory skills asked me if I could also lecture on a subject as spontaneously as the Ajay on the video?! Now you've started giving me a complex dude! Ha ha.
@CodeEmporium
@CodeEmporium Жыл бұрын
Hahahaha. Thanks to you and your mom for the kind words! And sorry for the tough spot :) Maybe you should show her some of your blogs since you’re pretty good at writing yourself
@ulassbingol
@ulassbingol Жыл бұрын
This was one of the best explanations of multi-attention. Thanks for your effort.
@barni_7762
@barni_7762 Жыл бұрын
Wow! I have watched a few other transformer explaination videos (they were shorter and yet tried to cover more content) and I honestly didn't understand anything. Your video on the other hand was crystal clear and not only do I now understand how every part works, but also have an idea WHY it is there. Also you were super specific about the details that are otherwise left out, great work!
@vio_tio12
@vio_tio12 10 ай бұрын
Good job Ajay! Best explanation I have seen so far!
@moshimeow64
@moshimeow64 2 ай бұрын
This is just fantastic! I've listened to countless tutorials and lectures and man, I was starting to get depressed - is ML that hard??? Thank you for breaking it down and helping a student who has lost all confidence!
@CodeEmporium
@CodeEmporium 2 ай бұрын
Thanks for commenting. Makes my heart full. Keep at it! Good luck! :)
@HồngPhongLương-i4y
@HồngPhongLương-i4y Жыл бұрын
Great works. One of the most clear explaination ever about Multi Head Attention
@romainjouhameau2764
@romainjouhameau2764 Жыл бұрын
Very well explained. I really enjoy this mix between explanations and your code examples. Your videos are the best ressources to learn about transformers. Really thankful for your work ! Thanks a lot
@hackie321
@hackie321 7 ай бұрын
Wow. You also put a background music. Great work!! Sun rays falling on your face. Felt like God himself is teaching us Transformers.
@user-wr4yl7tx3w
@user-wr4yl7tx3w Жыл бұрын
exactly the type of content needed. thanks
@CodeEmporium
@CodeEmporium Жыл бұрын
You are so welcome! Thanks for watching!
@shubhamgattani5357
@shubhamgattani5357 Ай бұрын
Corrections required in the video: In the qkv_layer, the bias should be set to False. The "values" variable should be permuted to (0,2,1,3) before reshaping.
@Official.chukwuemekajames
@Official.chukwuemekajames 2 күн бұрын
Thank you for this amazing work.
@jiefeiwang5330
@jiefeiwang5330 Жыл бұрын
Really nice explanation! Just a small catch. 13:25 I believe you need to permute the variable "values" from size [1, 8, 4, 64] to [1, 4, 8, 64] before reshaping it(Line 71). Otherwise, you are trying to combine the same part of head from multiple words, rather than combine multiple parts of heads from the same word
@georgemuriithi8765
@georgemuriithi8765 Жыл бұрын
Yes I believe so too
@Rohit-vd3mj
@Rohit-vd3mj 5 ай бұрын
yes I too believe
@shubhamgattani5357
@shubhamgattani5357 Ай бұрын
Yes you are absolutely right. Thanks a lot for pointing it out. @CodeEmporium
@prashantlawhatre7007
@prashantlawhatre7007 Жыл бұрын
❤❤ Loving this series on Transformers.
@CodeEmporium
@CodeEmporium Жыл бұрын
Thanks so much for commenting and watching! I really appreciate it
@ivantankoua9286
@ivantankoua9286 Жыл бұрын
Thanks!
@prashantlawhatre7007
@prashantlawhatre7007 Жыл бұрын
5:44, we should also set `bias=False` in nn.Linear().
@SarvaniChinthapalli
@SarvaniChinthapalli 8 ай бұрын
Great lecture..Thank you so much for this video.. Great resource..
@DailySFY
@DailySFY 9 ай бұрын
Thank you !! For all the effort you have put it.
@ayoghes2277
@ayoghes2277 Жыл бұрын
Thank you for making this video Ajay !!
@CodeEmporium
@CodeEmporium Жыл бұрын
My pleasure! Hope you enjoy the rest of the series’s
@thanhtuantran7926
@thanhtuantran7926 8 ай бұрын
i literally understand all of it, thank you so much
@rajv4509
@rajv4509 Жыл бұрын
Brilliant stuff! Thanks for the time & effort you have put in to create these videos ... dhanyavadagalu :)
@rabindralamsal
@rabindralamsal 5 ай бұрын
I think at 13:26 you need to permute (0,2,1,3) before reshaping.
@nexyboye5111
@nexyboye5111 3 ай бұрын
that one catched my eye too, not sure how reshaping exactly works but those dimensions shouldn't be messed up.
@SiriusXie
@SiriusXie Жыл бұрын
Exactly the content I needed.Thanks very much.
@CodeEmporium
@CodeEmporium Жыл бұрын
You're welcome! :)
@vivekmettu9374
@vivekmettu9374 Жыл бұрын
Absolutely loved your explanation. Thank you for contributing!!
@saikiranbondi6868
@saikiranbondi6868 Жыл бұрын
You are wonderful my brother your way of explaining is soo good
@CodeEmporium
@CodeEmporium Жыл бұрын
Thanks a lot :)
@simonebonato5881
@simonebonato5881 Жыл бұрын
Outstanding video and clear explanation!
@CodeEmporium
@CodeEmporium Жыл бұрын
Thanks so much! Real glad this is helpful
@DouglasASean
@DouglasASean Жыл бұрын
Thanks for your work, much neaded right now.
@Slayer-dan
@Slayer-dan Жыл бұрын
You never Never disappoint bro. Vielen vielen dank!
@CodeEmporium
@CodeEmporium Жыл бұрын
Thanks for the kind words and the support :)
@paull923
@paull923 Жыл бұрын
interesting and useful
@CodeEmporium
@CodeEmporium Жыл бұрын
Thanks so much
@surajgorai618
@surajgorai618 Жыл бұрын
Very rich content as always.. Thanks for sharing
@CodeEmporium
@CodeEmporium Жыл бұрын
Thanks so much for commenting and watching!
@prasenjitgiri919
@prasenjitgiri919 Жыл бұрын
ty for the effort you have put in, much appreciated but will you please explain the start token, it leave an understanding gap for me.
@pi5549
@pi5549 Жыл бұрын
5:05 Why separate variables for input_dim (embedding dimension IIUC) and d_model? Aren't these always going to be the same? Would we ever want this component to spit out a contextualized-wordVector that's a different length from the input wordVector?
@oussamawahbi4976
@oussamawahbi4976 Жыл бұрын
I have the same question , and i assume that most of the times input_dim should equal d_model in order to have a consistent vocabulary between the input and the output
@ShawnMorel
@ShawnMorel Жыл бұрын
My understanding is that it sets you up to be able to choose different hyper-parameters e.g. if you want a smaller input word embedding space size but a larger internal representation. Table 3 of the original transformers paper shows a few different combinations of these parameters arxiv.org/pdf/1706.03762.pdf
@raphango
@raphango Жыл бұрын
Thanks very much again! 😄
@davefaulkner6302
@davefaulkner6302 7 ай бұрын
Thanks for your efforts to explain a complicated subject. Couple of questions: did you intentionally skip the Layer Normalization or did I miss something? Also -- the final linear layer in the attention block has dimension 512 x 512 (input, output size). Does this mean that each token (logit?) output from the attention layer is passed token-by-token through the linear layer to create a new set of tokens, that set being of size token sequence length. This connection between the attention output and the Linear layer is baffling me. The output of the attention layer is (Sequence-length x transformed-embedding-length) or (4 x 512), ignoring batch dimension in the tensor. Yet the linear layer accepts a (1 x 512) input and yields a (1 x 512) output. So is each (1 x 512) output token in the attention layer output sequence passed one at a time through the linear layer? And does this imply that the same linear layer is used for all tokens in the sequence?
@Sumeet-y3x
@Sumeet-y3x 5 ай бұрын
How can I see the output for each attention head in the attention mechanism? Let's say after matrix multiplication?
@mmv365
@mmv365 5 ай бұрын
is reshape operation same as concatenation along a dimension ( to collapse multiple heads) ?
@seddikboudissa8668
@seddikboudissa8668 9 ай бұрын
Hello good job but i have a small misunderstanding on the transformer paper they computed different different key query .. for each head and here you splitting the key and query where each head takes a split . Whats the difference between the two approachs ?
@nexyboye5111
@nexyboye5111 3 ай бұрын
Actually nothing, you still have different weights for all the different heads. You could do the same thing by using different nn.linear layers for each head even for q k v distinctly, since 512*512*3 == 64*64*8*3, you would get the same number of weights. The network doesn't know anything about how you structure your code.
@yusuke.s2551
@yusuke.s2551 7 ай бұрын
Can you please make notebooks in the repo accessible again? because most of them are not accessible right now. Thank you in advance!
@tonywang7933
@tonywang7933 Жыл бұрын
At 4:57 d_model is 512, so is input_dim. But at 14:23 input_dim is 1024, I thought they should be the same number, are you saying you reduce the dimension of input into the dimension of the model by some compression technique like PCA? at 14:23, it looks like input_dim is only used at the very beginning, once we are in the model, input dimension is shrinked to 512
@jubaerhossain1865
@jubaerhossain1865 Жыл бұрын
It's not PCA. It dimension conversion by weight matrix multiplication. For example, to make (1x1024) -> (1x512), we need a weight matrix of 1024x512... This is just an example, not the actual scenario demonstrated here.
@chenmargalit7375
@chenmargalit7375 Жыл бұрын
Hi, thanks for the great series !. Something I don't understand and I'd love to hear ur opinion: You say the initial input is a one hot encoded vector which is the size of sequence length. Lets say my vocab is 1000 (all the words I want to support) and the sequence length is 30. How do I represent one word out of 1000, in a 30 sequence length vector? the index I put the 1 will not be correct as it might actually be in position 500 in the real vocab tensor
@kollivenkatamadhukar5059
@kollivenkatamadhukar5059 Жыл бұрын
Where can I get the theory part of it is good that you are explaining the code part of it can you share any link where we can read the theory part as well
@stanislavdidenko8436
@stanislavdidenko8436 Жыл бұрын
maybe you have to divide at first 1536 by 3, and then by 8. But you do it by 8 first and then by 3, which sounds like you mix q, k, v vectors dimensions.
@oussamawahbi4976
@oussamawahbi4976 Жыл бұрын
good point, but i think because the parameters that generate q, k, v are learned , it doesnt matter which you should divide by first, i could be wrong though
@nexyboye5111
@nexyboye5111 3 ай бұрын
@@oussamawahbi4976no you're right
@creativityoverload2049
@creativityoverload2049 Жыл бұрын
For how much i tried to understood, query, key and value are representation of embedded word after positional embedding so, with different purposes, but why are we dividing it into multiple heads in first place and dividing it into 64 each when we can just have 1 head with 512 q,k,v and then perform self attention. Even if we are using multiple head it for increasing context wouldn't 8 different vector of 512 for each q,k,v then performing self attention on each and combine them later will give us more accurate result. I mean to say why 512 representation of word is having 64 qkv each Someone please explain this.
@jinghaoxin2987
@jinghaoxin2987 5 ай бұрын
A little mistake here, at the inference stage of the decoder, the input is not only the current word, but the stack of historical word.
@pi5549
@pi5549 Жыл бұрын
14:40 Your embedding dimension is 1024. So how come qkv.shape[-1] is 3x512 not 3x1024?
@oussamawahbi4976
@oussamawahbi4976 Жыл бұрын
qkv is the result of the qkv_layer , which takes embeddings of size 1024 and has 3*d_model=3*512 neurons , therefor the output of this layer will be of dimension (batch_size, seq_length, 3*512)
@xdhanav5449
@xdhanav5449 10 ай бұрын
Wow, this is a very intuitive explanation! I have a question though. From my understanding, the attention aids the encoder and decoder blocks in the transformer to understand which words that came either before or after (sometimes) will have a strong impact on the generation of the next word, through the feedforward neural network and other processes. Given a sentence like "The cook is always teaching the assistant new techniques and giving her advice.", what is a method I could implement to determine the pronoun-profession relationships to understand that cook is not paired with "her", rather "assistant" is. I have tried two methods so far. 1. Using the pretrained contextual embeddings from BERT. 2. (relating to this video) I thought that I could almost reverse engineer the attention methods by creating an attention vector to understand what pair of pronoun-professions WOULD be relevant, through self attention. However, this method did not work as well (better than method 1) and I believe this is because the sentence structures are very nuanced, so I believe that the attention process is not actually understanding the grammatical relationships between words in the sentence. How could I achieve this: a method that could determine which of the two professions in a sentence like above are referenced by the pronoun. I hope you can see why I thought that using an attention matrix would be beneficial here because the attention would explain which profession was more important in deciding whether the pronoun would be "he" or "her". This is a brief description of what I am trying to do, so if you can, I could elaborate more about this over email or something else. Thank you in advance for your help and thanks a million for your amazing explanations of transformer processes!
@xdhanav5449
@xdhanav5449 10 ай бұрын
I would like to add additionally that in my approach of using attention, I don't actually create query, key, value vectors. I take the embeddings, do the dot product, scale it, and use softmax to convert it into a probability distribution. Possibly this is where my approach goes wrong. The original embeddings of the words in the sentence are created from BERT, so there should already be positional encoding and other relevant things for embeddings.
@Danpage04
@Danpage04 Жыл бұрын
what about the weights for K,V,Q for each head as well as the output?
@fayezalhussein7115
@fayezalhussein7115 Жыл бұрын
please code you explain how could way implemnt hybrid model(vision transfomrer+cnn) for image classification task
@yanlu914
@yanlu914 Жыл бұрын
After getting values, I think it should permute values first like before, and then reshape values.
@superghettoindian01
@superghettoindian01 Жыл бұрын
You are incredible, I’ve seen a good chunk of your videos and wanted to thank you from the bottom of my heart! With your content I feel like that maybe even an idiot like me can understand it (one day - maybe? 🤔)! I hope you enjoy a lot of success!
@CodeEmporium
@CodeEmporium Жыл бұрын
Super kind words. Thank you so much! I’m sure you aren’t an idiot and we hope can all learn together!
@Handelsbilanzdefizit
@Handelsbilanzdefizit Жыл бұрын
But why do they do this multihead-thing? Is it to reduce computational cost? 8*(64²) < 512²
@josephfemia8496
@josephfemia8496 Жыл бұрын
Hello, I was wondering what the actual difference is between key and value? I’m a bit confused between the difference is between “What I can offer” vs “What I actually offer”.
@yashs761
@yashs761 Жыл бұрын
This is a great video that might help you build intuition behind the difference of query, key and value. I've linked the exact timestamp: kzbin.info/www/bejne/h6fOgmR4aKt1p6M
@ShawnMorel
@ShawnMorel Жыл бұрын
First, remember that what we're trying to learn is Q-weights, K-weights, V-weights such that - input-embedding * Q-weights = Q (a vector that can be used as a query) - input-embedding * K-weights = K (a vector that can be used as a key) - input-embedding * V-weights = V (a vector that can be used as a value) Linguistic / Grammar intuition Let's assume that we had those Q, K and V, and we wanted to search for content for some query Q, how might we do that? Lgrammatically
@healthertsy1863
@healthertsy1863 Жыл бұрын
@@yashs761 Thank you so much, this video has helped me a lot! The lecturer is brilliant!
@yashs761
@yashs761 3 ай бұрын
​@@healthertsy1863 I know, right? Love that whole series
@kaitoukid1088
@kaitoukid1088 Жыл бұрын
Are you a full-time creator or do you work on AI while making digital content?
@CodeEmporium
@CodeEmporium Жыл бұрын
The latter. I have a full time job as a machine learning engineer. I make content like this on the side for now :)
@StarForgers
@StarForgers Жыл бұрын
​@CodeEmporium How complex is the work you do with the AI VS. what you teach us here? Would you say it's harder to code by far or is it mostly just scaling up, reformatting, and sorting data to train the models?
@Stopinvadingmyhardware
@Stopinvadingmyhardware Жыл бұрын
@@CodeEmporium Are you able to disclose your employer’s name?
@physicsphere
@physicsphere Жыл бұрын
I just started with AI-ML for few months, can you guide me what should I learn for getting a job .. I like your videos.
@CodeEmporium
@CodeEmporium Жыл бұрын
Nice! There are many answers to this. But to keep it short and effective, I would say know your fundamentals. This could be just picking one Regression model (like Linear Regression) and understand exactly how it works and why it works. I do the same for 1 classification model (like logistic regression). Look at both from the lens of Code, Math and real life problems. I think this is a good starting point for now. Honestly, it doesn’t exactly matter where you start as long as you start and don’t stop. I’m sure you’ll succeed! That said, if you are interested in the content I mentioned earlier, I should have some playlists with titles “Linear Regression “ and “Logistic Regression”. So do check them out if / when you’re interested. Hope this helps.
@physicsphere
@physicsphere Жыл бұрын
@@CodeEmporium thanks for the reply.. sure I will check.. I am going to do a work using transformers.. ur videos really help, specially the coding demonstration...
@wishIKnewHowToLove
@wishIKnewHowToLove Жыл бұрын
thx)
@CodeEmporium
@CodeEmporium Жыл бұрын
You are very welcome! Hope you enjoy your stay on the channel :)
@stanislavdidenko8436
@stanislavdidenko8436 Жыл бұрын
Priemlеmo!
@kartikpodugu
@kartikpodugu 10 ай бұрын
I have two doubts. 1. How Q, K, V are calculated from input text ? 2. How Q, K, V are calculated for multiple heads ? Can you elaborate or point me to a proper resource.
@naveenpoliasetty954
@naveenpoliasetty954 9 ай бұрын
word embeddings are fed into separate linear layers (fully connected neural networks) to generate the Q, K, and V vectors. These layers project the word embeddings into a new vector space specifically designed for the attention mechanism within the transformer architecture.
@suchinthanawijesundara6464
@suchinthanawijesundara6464 Жыл бұрын
❤❤
@CodeEmporium
@CodeEmporium Жыл бұрын
Thanks! :)
@thechoosen4240
@thechoosen4240 11 ай бұрын
Good job bro, JESUS IS COMING BACK VERY SOON; WATCH AND PREPARE
Positional Encoding in Transformer Neural Networks Explained
11:54
CodeEmporium
Рет қаралды 46 М.
小丑教训坏蛋 #小丑 #天使 #shorts
00:49
好人小丑
Рет қаралды 54 МЛН
The complete guide to Transformer neural Networks!
27:53
CodeEmporium
Рет қаралды 37 М.
Attention in transformers, visually explained | DL6
26:10
3Blue1Brown
Рет қаралды 2 МЛН
Self Attention with torch.nn.MultiheadAttention Module
12:32
Machine Learning with Pytorch
Рет қаралды 17 М.
Self Attention in Transformer Neural Networks (with Code!)
15:02
CodeEmporium
Рет қаралды 112 М.
Informer: Time series Transformer - EXPLAINED!
15:17
CodeEmporium
Рет қаралды 13 М.
Visualizing transformers and attention | Talk for TNG Big Tech Day '24
57:45
Layer Normalization - EXPLAINED (in Transformer Neural Networks)
13:34
The math behind Attention: Keys, Queries, and Values matrices
36:16
Serrano.Academy
Рет қаралды 273 М.
Blowing up the Transformer Encoder!
20:58
CodeEmporium
Рет қаралды 20 М.
小丑教训坏蛋 #小丑 #天使 #shorts
00:49
好人小丑
Рет қаралды 54 МЛН