Rasa Algorithm Whiteboard - Transformers & Attention 3: Multi Head Attention

Рет қаралды 58,250

Rasa

Күн бұрын

Пікірлер: 85

@anassbairouk953 4 жыл бұрын

The best explanation ever

@zionremy9968 3 жыл бұрын

you probably dont give a shit but does anybody know a trick to log back into an Instagram account..? I was dumb lost my password. I would love any tips you can offer me

@brocklukas2015 3 жыл бұрын

@Zion Remy instablaster =)

@zionremy9968 3 жыл бұрын

@Brock Lukas thanks for your reply. I got to the site thru google and Im trying it out atm. Looks like it's gonna take quite some time so I will get back to you later with my results.

@zionremy9968 3 жыл бұрын

@Brock Lukas it worked and I now got access to my account again. Im so happy! Thank you so much you really help me out :D

@brocklukas2015 3 жыл бұрын

@Zion Remy No problem :)

@avinashpaul1665 4 жыл бұрын

Very good explanation , the entire series(1,2,3) on attention provides a good step by step understanding about the attention concepts.

@sujitnalawade8661 2 жыл бұрын

One of the best explanation available on internet for transformers.

@fawzinashashibi4758 3 жыл бұрын

Series on attention mechanism are the best I've seen: clear and intuitive. Thank you !

@mikaslanche 4 жыл бұрын

These explanations are so good. Thanks for uploading these :)

@mehrzadio 4 жыл бұрын

That was the best explanation I'v seen < BRAVO >

@saikatroy3818 3 жыл бұрын

Explanation is awesome, superb. Attention mechanisms was a black box for me, but now it is like open secret. Thanks

@pedropereirapt 4 жыл бұрын

Another exceptional video! It seems like the main idea behind this multi-head attention was the continuation of the queries, keys, values, which was essentially to increase the learning power of the model. Before there were Q,K,V, that model was not trainable. Then, by adding Q,K and V to the model it became able to learn. Now by adding multi-head attentions the model becomes smarter because each attention can pick up a different relation. Thanks a lot for sharing this precious knowledge

@farrukhzamir Жыл бұрын

Very good tutorials series of attention, multihead attention, transformer. God bless you. Without your explanation video i wouldnt have understood it.

@hadjdaoudmomo9534 4 жыл бұрын

Excellent and clear

@coolblue2468 3 жыл бұрын

Nothing can be better than this explanation for multiheadattention. Thanks a lot.

@morrislo6042 4 жыл бұрын

Best explanation that I have seen ever

@ericklepptenberger6352 3 жыл бұрын

Thank you, best explanation ever. You helped me to understand attention intuitively for the first time. Thx!

@pushkarparanjpe 4 жыл бұрын

Fabulous! Clear explanations.

@444haluk 3 жыл бұрын

Omg you are the best! Just listening to all of your explanation video in the case of my stupid teachers missed other things to teach as well!

@danish5326 Жыл бұрын

U explained mw what I have been struggling to learn for a year. Thanks so much . BTW its parallelize not paralyze😜 4:23

@sampsuns Жыл бұрын

Are these heads in parallel or sequentially? At 7:30 it seems in parallel and 10:41 it seems sequentially. Another question is that if they are in parallel, why the trained Q K V not the same for each h?

@CristiVladZ 3 жыл бұрын

Not a little bit, but massive intuition. Thank you!

@kurotenshi7069 2 жыл бұрын

Thanks a looot! the best explanation of the multi-head attention mechanism concept!

@linjie6446 3 жыл бұрын

What a fantastic explanation!

@richadutt665 4 жыл бұрын

That clearly explained all my questions. Thanks

@lilialola123 3 жыл бұрын

THIS IS AMAZING THANK YOU FOR THE CLARITY! please keep the ML videos up

@albertwang5974 4 жыл бұрын

For these cannot understand multi-head, here is the tip for you: Every head can be treated as a channel or a feature

@ytcio Жыл бұрын

Ok but how do they specialize? How don't they be just copies of each other?

@kartikpodugu 10 ай бұрын

@@ytcio They are just like different CNN channels. Just like each nxn window of a CNN channel focuses on different aspects in the image, each head focuses on different types of attention.

@jeff__w Жыл бұрын

9:23 “Every attention header is giving its attention on something different.” (1) Is that just a function of each attention header calculating the dot product for a particular (and different) token in the sentence? (2) Another post I read said “ *Each Head does not process the whole embedding vector, it processes just a part of the vector.* Assume that our embedding is of size _d,_ and that we have _h_ heads, that means that the first head is going to process the first _d/h_ dimensions of the vector, the second head is going to process the next _d/h_ dimensions, and we continue in the same pattern.” Is _that_ what is giving rise to the difference in attention? (3) What are the “layers”? Are those each multi-head attention layers? Basically, what is giving rise to the attention header giving attention to something different?

@krishanudasbaksi9530 4 жыл бұрын

once again, very nice explanation

@engenglish610 3 жыл бұрын

The best explanation at all

@aayushjariwala6256 2 жыл бұрын

loved this series

@techrelieve1716 3 жыл бұрын

Really appreciate to make the explanation such a easy to understand. Keep up the great work.

@asmersoy4111 2 жыл бұрын

Incredible Explanation! Thank you so much!

@naserpiltan1539 3 жыл бұрын

The best explanation I've ever seen. So clear and helpful .Thx

@gausseinstein 3 жыл бұрын

Wonderful explanation. Many thanks!!

@evandavid940 Жыл бұрын

This series is just incredible!!

@zhangmanren 3 жыл бұрын

deserves more views

@heets1971 Жыл бұрын

I don't understand why do we need multiple attention heads. Also how are the weights for Keys, queries and values not trained to the same values for multiple heads? Is it because they are trained differently or are they initialized differently?

@zihanchen4312 4 жыл бұрын

Perfect explanation man! Thank you for your efforts, and can't wait to see your future content! : D

@TheVinkle 4 жыл бұрын

good and intitutive explanations, thanks.

@aj-tg 4 жыл бұрын

Nicely done mate!

@ashokkumarj594 4 жыл бұрын

Thank you for your great job😍😍😍

@donkeyknight1453 2 жыл бұрын

you explain things better than my professor lol

@mingyanghe7029 3 жыл бұрын

thank you, the best explanation ever

@AnkitBindal97 4 жыл бұрын

Thank you !

@deepanshusingh4140 3 жыл бұрын

On point!! Super

@AdrianYang 2 жыл бұрын

My understanding and thoughts: full connection is too flexible with all its weights free from restriction, causing it prone to overfit; RNN is too strict with all the items sharing the same weights (with only different powers), causing it underfit (gradient explosion / vanishing leads to poor learning ability); LSTM and GRU solve this by adding more weights in the form of gates, so that more memory can be kept; attention continues to relax the tight weight restriction of RNN, while keeping them not as free as full connection, where the key weights in attention must be from inputs to calculate the scores. The weights in full connection and RNN are more to learn the position info, while the weight in attention are to learn the embeddings.

@ahmed22502145 3 жыл бұрын

Great job! you really helped me

@sumowll8903 2 жыл бұрын

Great explanation!!! Thank you!!!

@shenhaochong 4 жыл бұрын

I see the last step of multi-header attention is to concatenate and make a dense vector of all the output vectors from MATMUL, but does it mean the input vector will grow its dimension by the times of # of Headers, each time it passes through one multi-header attention block?

@ax5344 3 жыл бұрын

@4:07 suppose we use multiple blocks instead of one, as the transformation inside these blocks is the same: q,k,v, then they are just the same three matrixes with different initial attention values?

@karimabdultankian28 4 жыл бұрын

Amazing!

@masoudparpanchi505 4 жыл бұрын

thanks

@ansharora3248 3 жыл бұрын

Wow!

@pranjalchaubey 4 жыл бұрын

Super!

@harryshuman9637 2 жыл бұрын

Quality stuff

@karannchew2534 10 ай бұрын

Multi Layer/Stack/Block. Each layer/stack/block Multi Head.

@alexvass 2 жыл бұрын

With Multi-head attention, does that mean that there are multiple M_k matrices (multiple matrices for the key weights)? Same for M_q and M_v being multiple?

@xflory26x Жыл бұрын

what do you mean by layers from the interactive visualization in reference to your previous diagrams, and how are they different to the different colors (heads)?

@bright1402 4 жыл бұрын

Thank you for your video! But one thing I am not clear is that when you conduct the split before feeding to the multi-head attention to get K, Q and V How to split the data V? For instance, if V is a n*d matrix, where n is the size of the vocabulary and d is the dimension of word vector, is this split worked on the n dimension or the d dimension? Thank you!

@RasaHQ 4 жыл бұрын

There's not so much a split happening. It's more that there are multiple layers being attached. Each arrow represents a matrix multiplication, not a "cut" or a "split". Does this help?

@bright1402 4 жыл бұрын

@@RasaHQ Thank you for your reply! Yeah, it is clear now

@gordonlim2322 3 жыл бұрын

@@RasaHQ I quote from the paper "In this work we employ h = 8 parallel attention layers, or heads. For each of these we use dk = dv = dmodel/h = 64. Due to the reduced dimension of each head, the total computational cost is similar to that of single-head attention with full dimensionality." It seems there is actually split going on. To relate to the variables you've used. There shouldn't be h vectors (v1 ... vn). Each of the h vectors should have a reduced dimensionality of n/h.

@karannchew2534 11 ай бұрын

Why do they first need to pass through linear layers?

@daniquinteiro6152 4 ай бұрын

Awesome

@emineaysesunarstudent1767 3 жыл бұрын

Thank you so much for the video! I am stuck on one point, how do we ensure that these different heads learn different attentions?

@RasaHQ 3 жыл бұрын

(Vincent here) While there is no hard guarantee, you could wonder about the following thought experiment. Suppose that we allow for 2 heads. Under what circumstances would the weights for both attention heads be exactly the same? If there's an improvement to be made, the gradient signal should cause the weights to differ. This depends a lot on the labels, but that's where the learning will be.

@zeinramadan 2 жыл бұрын

Random initialization of the weights?

@6511497115104 Жыл бұрын

Don't we cut the original V vector to h slices and feed each slice to a different attention head?

@6511497115104 Жыл бұрын

ChatGPT: Yes, in the multi-head attention mechanism, the original input embedding vector is sliced into multiple sub-vectors, or "heads", which are then processed in parallel to compute multiple sets of attention weights. The number of heads is a hyperparameter that is typically set to a small value, such as 8 or 16. Each head in the multi-head attention mechanism has its own set of learned weight matrices, which are used to project the input embedding vector into a query, key, and value vector for that particular head. These projected vectors are then used to compute the attention weights and the weighted sum of the values, which are concatenated across all heads and passed through a final linear layer to produce the final output representation. By splitting the input embedding vector into multiple heads, the multi-head attention mechanism is able to capture different aspects of the input representation, allowing the model to learn more nuanced and fine-grained relationships between the input tokens. Additionally, the parallel processing of the multiple heads can lead to faster training and inference times.

@asraajalilsaeed7435 Жыл бұрын

Where can i find code that display multi attention in the video? When move cursor above tokens.

@sethjchandler 3 жыл бұрын

Brilliant. Thanks!,

@ecto111 Ай бұрын

Which app is used to create this presentation?

@abc-by1kb 3 жыл бұрын

10:06 do you mean "Named Entity Recognition"? Great video btw. Thank you so much!!!

@RasaHQ 3 жыл бұрын

(Vincent here) d0h! Yep, you're right!

@abc-by1kb 3 жыл бұрын

@@RasaHQ Really want to say thank you so much for the video! Never thought someone could explain self-attention and transformers in such a logical, incremental, and intuitive way. Great work!

@abc-by1kb 3 жыл бұрын

@@RasaHQ As a CS student, I think your videos should def come on top when people search for transformers