It's obvious that this host of this channel is a teacher since he knows how to teach
@InterpretingInterpretability2 ай бұрын
Is this from a specific visualization program? Could you share a repo for it, please?
@wilfredomartel77814 ай бұрын
🎉
@kenseno51084 ай бұрын
I am a newbie. Probably this question doesnt make sense for you guys. What I am not able to understand is why do we need the Q, K, V matrices? The concept of Q, K, V is clear. But the word vectors are learnable right? Instead of new Q, K, V matrices what if we just learn the word vectors. Of course the network will be very vague then. Is the only reason of adding Q, K, V matrices is more depth or there is some other reason. Please help me with this query.
@benwillslee27135 ай бұрын
Thanks bro, it's the BEST explanation of Attenction I had seen so far ( I have to say that I had seen many others ), looking forward the other Parts eventhough it's been almost 4 years since this Part1 !
@abhijitambhore58585 ай бұрын
Thanks bro please do the second part
@liberate76045 ай бұрын
Best explanation ever🥇🥇🥇🥇
@rishishrestham86816 ай бұрын
please do the second part🥺, this was a great explanation, very intuitive.
@suriyars44876 ай бұрын
Bro 26:39 in your slide there is a mistake . I think you need to remove "Weigh original words" block since that action is being done in "Matrix Multiplication" step between weight matrix and value matrix
@suriyars44876 ай бұрын
More correctly it should be "softmax" block !!
@suriyars44877 ай бұрын
Bro dropped a masterpiece and disappeared !
@VishalKumar-qu4he7 ай бұрын
great video ... explaining why need attention step by step
@ffilez7 ай бұрын
I like this msa vid
@JesusSavedMe997 ай бұрын
Msa playlist 💀
@JeunEbat7 ай бұрын
so why is this on msa’s playlist?
@kameshsingh78678 ай бұрын
Very helpful, thanks.
@rollingstone17848 ай бұрын
@arkaung, @ark_aung : there is an error at 13:00. s_1 is a row-VECTOR, so should be written in bold (just as v_1); s_1 represents the first row in the histogram. The components (s_11, ..., s_1n are scalars (normal font)) (the small boxes in the histogram) 14:00 again s_1 is a vector 15:00; the weighs w_i are vectors as well 17:45: y_i are vectors 20:50: maybe Matrixnotation would help here: V, the set of alle vectors v_i, is a matrix of dimension 3x50. The vector v_2 has dimensiont 1x50. Matrixmultiplication v_2 * V^T leads to (1x50)*(50x3) = s_2(1x3). Normalization leads to w_2(1x3). Matrixmultiplication y_2=w_2* V leads to (1x3)*(3x50)=(1x50) Remark: it should be noted that the last step is a "right multiplication" (matrix x vector) so it is, in matrix notation: V^T * w_2^T resulting in a vector y_2^T of dimension (50x3)*(3x1)=50x1. By transposiing this vector we get y_2(1x50).
@jynxzoid9 ай бұрын
What is this doing here in the MSA Top Playlist?
@ahmadabousetta9 ай бұрын
Wonderful. I enjoyed your explanation. Thank you.
@paveltolmachev18989 ай бұрын
By the way, I watched about 10 videos on attention, this is the best video so far. Trust me
@paveltolmachev18989 ай бұрын
I don't really understand why we need three different matrices for query, key and value. Why not to have just one? My reasoning is as follows. Let's say we have word embeddings {v_i}. To adjust for context, we need to "nudge" these vectors around, so that they incorporate the contextual information. As mentioned in the video, we can just use the embeddings {v_i} for computing similarity scores themselves, but then there would be no learnable parameters. However, one can simply introduce a matrix M, such that it is a feature extractor: f_i = M v_i, extracting useful features from the word embedding. Then, to compute the contextual similarity scores, we can just use vectors f_i directly: s_ij = (f_i, f_j) - there seems to be absolutely no need to use any other matrices. What am I missing here? Instead, we have three matrices which essentially do this: Mq - extracts query information {q_i}. Mk extracts key information, giving us a set {k_i}. The similarities are computed as s_ij = (q_i, k_j) and then softmaxed. The values {u_i} are computed as Mv v_i. We update our original embeddings v_i replacing them with h_i = sum_j s_ij u_j. I, for now, see no reason for this complexity, and it makes me paranoid
@fir3drag0n19849 ай бұрын
why only a single part.. :/
@karannchew25349 ай бұрын
Help please! I'm lost from 24:17 where 50x50 matrice are inserted. Why are the matrice inserted???
@Jess.1.4.129 ай бұрын
Why is this on msa playlist😭
@MinsungsLittleWorld10 ай бұрын
Me - Scrolls through MSA's playlist Also me - finds this video in the playlist that is not even related to MSA
@ZarifaahmedNoha-ey9kh7 ай бұрын
bro i am too i was just scrolling and founf this
@AliahDonald10 ай бұрын
im here from msa…
@siddharth-gandhi10 ай бұрын
The BEST source of information I've come across on the internet about the intuition behind the Q,K and V stuff. PLEASE do part 2! You are an amazing teacher!
@ucsdvc11 ай бұрын
This is the most intuitive but non-hand-wavy explanation of self-attention mechanism I’ve seen! Thanks so much!
@rdgopal11 ай бұрын
This is by far the most intuitive explanation that I have come across. Great job!
@DouwedeJong11 ай бұрын
Thanks for making this video. When the say BERT has 24 layers, 1024-dimensional hidden representations, 16 attention heads in each self-attention module, and 340M parameters. What they they represent in this video?
@RahimPasban11 ай бұрын
this is one of the greatest videos that I have ever watched about Transformers, thank you!
@unclecode11 ай бұрын
Such a brilliant explanation, 3 years ago, Part 1, and nothing after that! Sad
@osamutsuchiyatokyo Жыл бұрын
I believe it is at least one of the most clear presentations of multi-head attention.
@Jaybearno Жыл бұрын
Sir, you are an excellent instructor. Thank you for making this.
@atulshuklaa Жыл бұрын
You are awesome man 🎉🎉
@Boredreturn Жыл бұрын
why is this in msa top videos
@MrMacaroonable Жыл бұрын
Can you elaborate the part around 24:17 where you introduced the Q, K, V? You mentioned "..... we want to preserve the dimension..." what's the intuition for that?
@darylgraf9460 Жыл бұрын
I'd just like to add my voice to the chorus of praise for your teaching ability. Thank you for offering this intuition for the scaled dot product attention architecture. It is very helpful. I hope that you'll have the time and inclination to continue providing intuition for other aspects of LLM architectures. All the best to you and yours!
@JonathanUllrich Жыл бұрын
this video solved a month long understanding problem I had with attention. thank you so much for this educational and didactic master piece!
@izkaoix Жыл бұрын
why tf is this in msas playlist
@saimasideeq7254 Жыл бұрын
really helpful explanation
@youtube1o24 Жыл бұрын
Very decent work, please have more part 2, part 3 of this series.
@sibyjoseplathottam4828 Жыл бұрын
This is undoubtedly one of the best and most intuitive explanations of the Self-attention mechanism. Thank you very much!
@neilharvyn1417 Жыл бұрын
Transformer are robot change to car autobots are heroes villains are the deception and animals change to robot maximal protect and predacons and terracon are villan
@dr.p.m.ashokkumar5344 Жыл бұрын
Very nice explanation...congrats and continue ur good work....
@benjaminhobbs2923 Жыл бұрын
amazing thank you
@koladearisekola3650 Жыл бұрын
This is the best explanation of the attention mechanism I have seen on the internet. Great Job !
@pavan1006 Жыл бұрын
I owe you a lot, for this level of clear explanation on math involved.