you probably dont give a shit but does anybody know a trick to log back into an Instagram account..? I was dumb lost my password. I would love any tips you can offer me
@brocklukas20153 жыл бұрын
@Zion Remy instablaster =)
@zionremy99683 жыл бұрын
@Brock Lukas thanks for your reply. I got to the site thru google and Im trying it out atm. Looks like it's gonna take quite some time so I will get back to you later with my results.
@zionremy99683 жыл бұрын
@Brock Lukas it worked and I now got access to my account again. Im so happy! Thank you so much you really help me out :D
@brocklukas20153 жыл бұрын
@Zion Remy No problem :)
@avinashpaul16654 жыл бұрын
Very good explanation , the entire series(1,2,3) on attention provides a good step by step understanding about the attention concepts.
@sujitnalawade86612 жыл бұрын
One of the best explanation available on internet for transformers.
@fawzinashashibi47583 жыл бұрын
Series on attention mechanism are the best I've seen: clear and intuitive. Thank you !
@mikaslanche4 жыл бұрын
These explanations are so good. Thanks for uploading these :)
@mehrzadio4 жыл бұрын
That was the best explanation I'v seen < BRAVO >
@saikatroy38183 жыл бұрын
Explanation is awesome, superb. Attention mechanisms was a black box for me, but now it is like open secret. Thanks
@pedropereirapt4 жыл бұрын
Another exceptional video! It seems like the main idea behind this multi-head attention was the continuation of the queries, keys, values, which was essentially to increase the learning power of the model. Before there were Q,K,V, that model was not trainable. Then, by adding Q,K and V to the model it became able to learn. Now by adding multi-head attentions the model becomes smarter because each attention can pick up a different relation. Thanks a lot for sharing this precious knowledge
@farrukhzamir Жыл бұрын
Very good tutorials series of attention, multihead attention, transformer. God bless you. Without your explanation video i wouldnt have understood it.
@hadjdaoudmomo95344 жыл бұрын
Excellent and clear
@coolblue24683 жыл бұрын
Nothing can be better than this explanation for multiheadattention. Thanks a lot.
@morrislo60424 жыл бұрын
Best explanation that I have seen ever
@ericklepptenberger63523 жыл бұрын
Thank you, best explanation ever. You helped me to understand attention intuitively for the first time. Thx!
@pushkarparanjpe4 жыл бұрын
Fabulous! Clear explanations.
@444haluk3 жыл бұрын
Omg you are the best! Just listening to all of your explanation video in the case of my stupid teachers missed other things to teach as well!
@danish5326 Жыл бұрын
U explained mw what I have been struggling to learn for a year. Thanks so much . BTW its parallelize not paralyze😜 4:23
@sampsuns Жыл бұрын
Are these heads in parallel or sequentially? At 7:30 it seems in parallel and 10:41 it seems sequentially. Another question is that if they are in parallel, why the trained Q K V not the same for each h?
@CristiVladZ3 жыл бұрын
Not a little bit, but massive intuition. Thank you!
@kurotenshi70692 жыл бұрын
Thanks a looot! the best explanation of the multi-head attention mechanism concept!
@linjie64463 жыл бұрын
What a fantastic explanation!
@richadutt6654 жыл бұрын
That clearly explained all my questions. Thanks
@lilialola1233 жыл бұрын
THIS IS AMAZING THANK YOU FOR THE CLARITY! please keep the ML videos up
@albertwang59744 жыл бұрын
For these cannot understand multi-head, here is the tip for you: Every head can be treated as a channel or a feature
@ytcio Жыл бұрын
Ok but how do they specialize? How don't they be just copies of each other?
@kartikpodugu10 ай бұрын
@@ytcio They are just like different CNN channels. Just like each nxn window of a CNN channel focuses on different aspects in the image, each head focuses on different types of attention.
@jeff__w Жыл бұрын
9:23 “Every attention header is giving its attention on something different.” (1) Is that just a function of each attention header calculating the dot product for a particular (and different) token in the sentence? (2) Another post I read said “ *Each Head does not process the whole embedding vector, it processes just a part of the vector.* Assume that our embedding is of size _d,_ and that we have _h_ heads, that means that the first head is going to process the first _d/h_ dimensions of the vector, the second head is going to process the next _d/h_ dimensions, and we continue in the same pattern.” Is _that_ what is giving rise to the difference in attention? (3) What are the “layers”? Are those each multi-head attention layers? Basically, what is giving rise to the attention header giving attention to something different?
@krishanudasbaksi95304 жыл бұрын
once again, very nice explanation
@engenglish6103 жыл бұрын
The best explanation at all
@aayushjariwala62562 жыл бұрын
loved this series
@techrelieve17163 жыл бұрын
Really appreciate to make the explanation such a easy to understand. Keep up the great work.
@asmersoy41112 жыл бұрын
Incredible Explanation! Thank you so much!
@naserpiltan15393 жыл бұрын
The best explanation I've ever seen. So clear and helpful .Thx
@gausseinstein3 жыл бұрын
Wonderful explanation. Many thanks!!
@evandavid940 Жыл бұрын
This series is just incredible!!
@zhangmanren3 жыл бұрын
deserves more views
@heets1971 Жыл бұрын
I don't understand why do we need multiple attention heads. Also how are the weights for Keys, queries and values not trained to the same values for multiple heads? Is it because they are trained differently or are they initialized differently?
@zihanchen43124 жыл бұрын
Perfect explanation man! Thank you for your efforts, and can't wait to see your future content! : D
@TheVinkle4 жыл бұрын
good and intitutive explanations, thanks.
@aj-tg4 жыл бұрын
Nicely done mate!
@ashokkumarj5944 жыл бұрын
Thank you for your great job😍😍😍
@donkeyknight14532 жыл бұрын
you explain things better than my professor lol
@mingyanghe70293 жыл бұрын
thank you, the best explanation ever
@AnkitBindal974 жыл бұрын
Thank you !
@deepanshusingh41403 жыл бұрын
On point!! Super
@AdrianYang2 жыл бұрын
My understanding and thoughts: full connection is too flexible with all its weights free from restriction, causing it prone to overfit; RNN is too strict with all the items sharing the same weights (with only different powers), causing it underfit (gradient explosion / vanishing leads to poor learning ability); LSTM and GRU solve this by adding more weights in the form of gates, so that more memory can be kept; attention continues to relax the tight weight restriction of RNN, while keeping them not as free as full connection, where the key weights in attention must be from inputs to calculate the scores. The weights in full connection and RNN are more to learn the position info, while the weight in attention are to learn the embeddings.
@ahmed225021453 жыл бұрын
Great job! you really helped me
@sumowll89032 жыл бұрын
Great explanation!!! Thank you!!!
@shenhaochong4 жыл бұрын
I see the last step of multi-header attention is to concatenate and make a dense vector of all the output vectors from MATMUL, but does it mean the input vector will grow its dimension by the times of # of Headers, each time it passes through one multi-header attention block?
@ax53443 жыл бұрын
@4:07 suppose we use multiple blocks instead of one, as the transformation inside these blocks is the same: q,k,v, then they are just the same three matrixes with different initial attention values?
@karimabdultankian284 жыл бұрын
Amazing!
@masoudparpanchi5054 жыл бұрын
thanks
@ansharora32483 жыл бұрын
Wow!
@pranjalchaubey4 жыл бұрын
Super!
@harryshuman96372 жыл бұрын
Quality stuff
@karannchew253410 ай бұрын
Multi Layer/Stack/Block. Each layer/stack/block Multi Head.
@alexvass2 жыл бұрын
With Multi-head attention, does that mean that there are multiple M_k matrices (multiple matrices for the key weights)? Same for M_q and M_v being multiple?
@xflory26x Жыл бұрын
what do you mean by layers from the interactive visualization in reference to your previous diagrams, and how are they different to the different colors (heads)?
@bright14024 жыл бұрын
Thank you for your video! But one thing I am not clear is that when you conduct the split before feeding to the multi-head attention to get K, Q and V How to split the data V? For instance, if V is a n*d matrix, where n is the size of the vocabulary and d is the dimension of word vector, is this split worked on the n dimension or the d dimension? Thank you!
@RasaHQ4 жыл бұрын
There's not so much a split happening. It's more that there are multiple layers being attached. Each arrow represents a matrix multiplication, not a "cut" or a "split". Does this help?
@bright14024 жыл бұрын
@@RasaHQ Thank you for your reply! Yeah, it is clear now
@gordonlim23223 жыл бұрын
@@RasaHQ I quote from the paper "In this work we employ h = 8 parallel attention layers, or heads. For each of these we use dk = dv = dmodel/h = 64. Due to the reduced dimension of each head, the total computational cost is similar to that of single-head attention with full dimensionality." It seems there is actually split going on. To relate to the variables you've used. There shouldn't be h vectors (v1 ... vn). Each of the h vectors should have a reduced dimensionality of n/h.
@karannchew253411 ай бұрын
Why do they first need to pass through linear layers?
@daniquinteiro61524 ай бұрын
Awesome
@emineaysesunarstudent17673 жыл бұрын
Thank you so much for the video! I am stuck on one point, how do we ensure that these different heads learn different attentions?
@RasaHQ3 жыл бұрын
(Vincent here) While there is no hard guarantee, you could wonder about the following thought experiment. Suppose that we allow for 2 heads. Under what circumstances would the weights for both attention heads be exactly the same? If there's an improvement to be made, the gradient signal should cause the weights to differ. This depends a lot on the labels, but that's where the learning will be.
@zeinramadan2 жыл бұрын
Random initialization of the weights?
@6511497115104 Жыл бұрын
Don't we cut the original V vector to h slices and feed each slice to a different attention head?
@6511497115104 Жыл бұрын
ChatGPT: Yes, in the multi-head attention mechanism, the original input embedding vector is sliced into multiple sub-vectors, or "heads", which are then processed in parallel to compute multiple sets of attention weights. The number of heads is a hyperparameter that is typically set to a small value, such as 8 or 16. Each head in the multi-head attention mechanism has its own set of learned weight matrices, which are used to project the input embedding vector into a query, key, and value vector for that particular head. These projected vectors are then used to compute the attention weights and the weighted sum of the values, which are concatenated across all heads and passed through a final linear layer to produce the final output representation. By splitting the input embedding vector into multiple heads, the multi-head attention mechanism is able to capture different aspects of the input representation, allowing the model to learn more nuanced and fine-grained relationships between the input tokens. Additionally, the parallel processing of the multiple heads can lead to faster training and inference times.
@asraajalilsaeed7435 Жыл бұрын
Where can i find code that display multi attention in the video? When move cursor above tokens.
@sethjchandler3 жыл бұрын
Brilliant. Thanks!,
@ecto111Ай бұрын
Which app is used to create this presentation?
@abc-by1kb3 жыл бұрын
10:06 do you mean "Named Entity Recognition"? Great video btw. Thank you so much!!!
@RasaHQ3 жыл бұрын
(Vincent here) d0h! Yep, you're right!
@abc-by1kb3 жыл бұрын
@@RasaHQ Really want to say thank you so much for the video! Never thought someone could explain self-attention and transformers in such a logical, incremental, and intuitive way. Great work!
@abc-by1kb3 жыл бұрын
@@RasaHQ As a CS student, I think your videos should def come on top when people search for transformers
@TamilSelvanMurugesan-mw2bv2 жыл бұрын
Neat...
@search_is_mouse2 жыл бұрын
사랑해요...
@kartikpodugu10 ай бұрын
I think everybody should play with the visualization tool to understand MHA better.
@usertempeuqwer75764 жыл бұрын
reupload X)
@RasaHQ4 жыл бұрын
Yeah there was an issue with the previous version that we only discovered after hitting the "live" button. So we re-rendered. This one shoud be fine.