The slides are posted here: cs.uwaterloo.ca/~ppoupart/teaching/cs480-spring19/schedule.html
@tarunluthrabk4 жыл бұрын
Hello Professor. Your explanations are amazing. Kindly pin this comment or add in description so that it is visible to everyone.
@majidheidarystories2 жыл бұрын
ّDear Pascal, I was wondering if you have any presentations to describe the article titled Neural Machine Translation By Jointly Learning To Align And Translate.
@SuperOnlyP4 жыл бұрын
Finally, Someone can explain simply what queries, keys , values are in the transformer model . Thank you Sir !!!
@mathisve4 жыл бұрын
Yeah, I don't understand why nobody else goes over this seemingly pretty important detail
@stackoverflow82604 жыл бұрын
Wow, I was gonna ask, why didn't he explain or give an example for query, key, value in the case of a simple language translation or modelling example. Machine learning community is not very good at conveying their ideas, when you can't put stuff in rigorous mathematics at least use a lot of pictures and many examples at every possible step.
@andrii50543 жыл бұрын
I can also recommend this explanation: kzbin.info/www/bejne/o37EY4Ojjq-fedE It has helped me a lot
@SuperOnlyP3 жыл бұрын
@@andrii5054 The video really simplify the concept. Thanks for sharing !
This is the single best video on "Attention is all you need", attention, transformers, etc. on the Internet. It's simple as that. Thanks Dr Poupart.
@bleacherz7503 Жыл бұрын
Why does a dot product correlate to attention?
@drdr3496 Жыл бұрын
@@bleacherz7503 a dot product between two vectors shows how similar they are
@seldan6698 Жыл бұрын
@@drdr3496 nice. Can you explain me whole query , key and value process for some example like " the cat sat on the mat". What is query, key and values for this sentence
@robn249710 ай бұрын
ty
@manikanth21665 ай бұрын
For those who are difficult to digest the picture at 29:00, here's a short explanation. If we take a 4 word sentence as an example: • k = represents each word, in some n-dimensional space (let's say 512), i = 1...4 (since there are 4 words). So k is vector • q = represents a single word in that 512 dimensional space. So q is also a vector • s = is the similarity operation (dot product) of q with each of k matmul (1x512, 512x1) = 1x1. So s is scalar • a = just softmax( S ) to make sum of all a = 1. So each a is a scalar • By now, each a holds a normalized number that represents of how query is similar to each of the word or key k in the sentence. The influence of "q" on each of the "k" is already attained at this critical step. But the outputs are just scalars it only talks about the influence as weight but doesn't encode the word itself (as 512 dimension vector) • To encode all the influences of the words in that query, a final linear combination is done Sum(a*v) to produce the "context" vector. This output context is a single vector 1x512 . "context" because it explains how the query fits in the context of the values Refer to image in this wiki section for more clarification: en.wikipedia.org/wiki/Attention_(machine_learning)#Core_calculations
@sudhirghandikota13824 жыл бұрын
Thank you very much Dr. Poupart. This is the best explanation of transformers I have come across on the internet
@JMRC4 жыл бұрын
Thank you to the person asking the question at 28:49! The softmax gave it away, but I wasn't sure.
@graceln24803 жыл бұрын
One of the best explanations for attention & transformers in KZbin. Most of the other videos are junk with authors pretending to understand the concepts and just adding to the KZbin clutter.
@GotUpLateWithMoon4 жыл бұрын
This is the best lecture on attention mechanism I can find! Thank you Dr. Poupart! finally all the details made sense to me.
@TylerMosaic3 жыл бұрын
wow! love the way he answers that great question at around 50:52 : “why we dont we implement the mask with hadamard product outside of the softmax?”. brilliant prof.
@momusi.98194 жыл бұрын
Thank you very much, this was by far the best explanation of Transformers that I found online!
@dansplain23932 жыл бұрын
I was about to type the same.
@AI_ML_iQ2 жыл бұрын
In recent work, titled "Generalized Attention Mechanism and Relative Position for Transformer" , on transformer it is shown that different matrices for query and key are not required for attention mechanism in Transformer thus reducing number of parameters to be trained for Transformer of GPT, other language models and Transformers for images/videos.
@moustafa_shomer2 жыл бұрын
This is the best Transformer / Attention Explaination ever.. Thank you
@tagrikli4 жыл бұрын
This video just cured my depression.
@judgeomega4 жыл бұрын
dont worry, im sure the next visit to a public internet forum will once again obliterate hope in humanity
@100vivasvan3 жыл бұрын
haha same here
@dilettante95762 жыл бұрын
Cured my ADHD
@Mrduralexx2 жыл бұрын
This video gave me depression…
@UmerBashir Жыл бұрын
@@Mrduralexx yeah its a different level of anxiety that it instills
@insoucyant6 ай бұрын
Best video on attention that I have come across
@weichen14 жыл бұрын
I am not able to find a better video than this one explaining attention and transformer on the internet
@MrFreemindonly4 жыл бұрын
I totally agree, He is genius!
@Siva-Kumar-D3 жыл бұрын
This is the best video Internet about Transformers network
@ghostoftsushimaps4150 Жыл бұрын
Bhaiji love from India. Is lecture ko araam se dekhunga
@richard126wfr2 жыл бұрын
The best explanation of attention mechanism I found on KZbin is the making pizza analogue by Alfredo Canziani.
@xhulioxhelilai93469 ай бұрын
Thank you for the very comprehensive and understandable course. Being in 2024 I can say that I can understand even better and easier this course using GPT-4.
@utkarshgupta73644 жыл бұрын
Most awesome video on transformers one could find on youtube
@aadeshingle7593 Жыл бұрын
Thanks a lot Professor Poupart one of the best explanation for maths behind transformers!
@benjamindeporte3806 Жыл бұрын
I eventually understood the Q,K,V in attention. Many thanks.
@autripat3 жыл бұрын
At 1:18:22, the professor refers to BERT and a "Decoder transformer that predicts a missing word". To me, BERT is a masked Encoder (not decoder). After all, BERT stands for bidirectional *encoder* representation from transformers. It's minor (and doesn't subtract from this great presentation), but can anyone comment?
@abdelrahmanhammad10203 жыл бұрын
Great lecture. And I believe you are correct, it seems there is a typo here. I was questioning the same!
@dennishuang34983 жыл бұрын
Very enjoyed your lecture, Professor Poupart! Very informative and simplified many complicated concepts. Thank you very much!
@cwtan5014 жыл бұрын
By far the best I have seen to explain multiheaded attention
@benjaminw21943 жыл бұрын
I'm a novice and have been praying to get someone who discusses these papers. You're an answered prayer! Great lecturer.
@vihaanrajput80823 жыл бұрын
His toturial video is my favorite timepass, specially at night,Hail to prof. Poupart
@greyreynyn4 жыл бұрын
41:14 Question, on the output side, why isn't there an additional feed-forward layer between the masked self attention in the output and the attention to the input? And maybe more broadly what are those feed forward units doing?
@greyreynyn4 жыл бұрын
45:50 - For the multiple linear transformations, are we applying the same linear transform to each set of Q/K/V in a "head" ? Or does each Q/K/V get its own unique linear transform applied?
@knoxvoxx3 жыл бұрын
Unique linear transform each time I guess.( In the original paper, under section 3.2.2 they mention that " h times with different learned linear projections to dk, dk and dv respectively") If we take repeat scaled dot product attention 3 times , then we will have total of 9 linear projections.
@ryanwhite74013 жыл бұрын
They each get their own learned parameters.
@evgenysavelev83711 ай бұрын
There is a better answer to the question asked at @1:04:30, regarding the positional encodings polluting word embeddings. If the positional embedding vectors are in a subspace of their own, then the addition of the positional encodings will never obfuscate the information encapsulated by word embeddings. Since the word embeddings are usually learned during network training, the network quickly learns to confine word embeddings to the subspace that is orthogonal or linearly independent to the positional encodings. So, TLDR, this is not a problem. In fact, it is advantageous, since information about position is evenly spread throughout all vector components, rather than being concentrated in a few coefficients in the head or tail. This removes a problem of having network nodes specialization, where some of the neurons become solely dedicated to deal with positional information.
@samson67079 ай бұрын
46:11 i dont understand how there can be a concatention of the outputs followed by a linear combination. in my mind it doesnt make sense to do both. either the outputs are concatenated or added up in a linear combination but both...?
@parmidagranfar48613 жыл бұрын
finallu understood what is going on . most of the videos are so simple and skipped math . i liked it
@Hotheaddragon3 жыл бұрын
You are a blessing, finally understood a very important concept.
@yd423304 жыл бұрын
Question about positional encoding. If we sum the Word Embedding (WE) with the Positional Encoding (PE) how does the model tell the difference between WE = 0.5, PE = 0.2 and WE = 0.4 and PE = 0.3 ? (Different words that are at different positions can yield the same value) Why not keep the PE separate from WE?
@sandipbnvnhjv Жыл бұрын
I asked chatGPT for the best video on Attention and it brought me here
@janasandeep2 жыл бұрын
23:09 - aren't query and keys both encode questions, while answers being values? 35:53 - why the values are going to be the same as keys? 36:02 - "attention mechanism merges information from _pairs_ of the words". Attention merges information of one word with _all_ the other words, isn't it? For each different word, values corresponding to all the words are weighed differently and added up.
@soumyajitganguly25932 жыл бұрын
I am confused about this "pairs of words" too.. Let me assume that every word is represented by a linear combination of all other words. Now what is the point of stacking N (N=6 in the original paper) of these attention layers? It would still be a linear combination right?
@soumyajitganguly25932 жыл бұрын
why the values are going to be the same as keys? - I would guess that Prof. was referring to the same word / token but not the same representation. The representations come after multiplying with Wv and Wk so they would be different.
@aileensengupta2 жыл бұрын
Big fan, big fan Sir!! Finally understood this!
@ritik846292 жыл бұрын
True
@mi9807 Жыл бұрын
One of the best videos!
@alexanderblumin66593 жыл бұрын
Very intersting lecture. Something that is not totally clear on minute 46: these multihead presented intuitievely as explicit 3 various filterts as in cnn to produce 3 corresponding feature maps,but on previous part of lecture its being said that multi heads are stacked one after another to produce at first info from (word i,word j) and second pairs of that stuff i.e one is the input to another one. So how to understand it in the rightr way? Seems like on minute 46 the inputs to each of the linear are the same but on lecture part it looks like one is going after another and intuitevly the pair of pais and so one changes the ouput size.
@justinkim29732 жыл бұрын
Best video to watch on the first day of 2023
@varungoel1854 жыл бұрын
Around @29:50 mark, he first mentions that the key vectors correspond to each output word, but the slide mentions input word. Could someone please clarify this?
@greyreynyn4 жыл бұрын
57:30 - I understand the normalization, but what's the intuition for adding? Does it just strengthen the signal from the input sequence?
@Victor-oc1ly3 жыл бұрын
I can comprehend the additive part as incrementing the knowledge from the self-attention operation to the query itself. You can think of it as the query (let's call it x) is a proper question and the sublayer (multihead attention) operation output is the contribution to your inquiry from the oracle (the sub-layer before the feed-forward step), which can then be used to refine your question (the query). If you look at the paper, it's really LayerNorm(x + Sublayer(x)) what the authors write for the Encoder and Decoder stacks.
@greyreynyn4 жыл бұрын
46:45 - Also, the output shape is the same as the input shape right? ie, the size of the input sequence?
@ephysics3801 Жыл бұрын
Hi Sir, Sir where did we get the lecture about the Attention mechanism basic concept.?
@larryobrien Жыл бұрын
Fantastic lecture, but I became confused reviewing minute 36, which says something like "When we do this in one block we essentially look at pairs of words... In the first block we look at pairs of words and the second block we're looking at pairs of pairs... We're combining more than just two words but [rather] groups of words that get larger and larger...." This would be similar to how we think of features in convolutional layers. But I don’t understand it here, since all of Q, K, and V are projections of the _whole_ input context, are they not? How do we get from that to the first attention block “essentially look[s] at pairs of words”?
@HeshamChannel2 жыл бұрын
Very good explanation. Thanks.
@giorgioregni26393 жыл бұрын
Best explanation of transformer I have ever seen, thank you Dr Poupart
@Vartazian360 Жыл бұрын
Little did anyone know just how groundbreaking this foundation would be for Chat GPT / GPT 4.
@davidingham34094 ай бұрын
Good motivation and understanding.
@fengxie47624 жыл бұрын
A great lecture! Highly recommended!
@aricircle8040 Жыл бұрын
Thank you very much for sharing that great lecture! Shouldn't it be the attention vector instead of the value? at 27:44
@ryandruckman9995 жыл бұрын
1:02:00 For the positional embedding, I am confused... It seems like the formula produces a scalar output (ie you put in a position in your sequence, you get out some sin/cos value). How does it become a vector?
@ryandruckman9995 жыл бұрын
You could take the value for each dimension in your embedding and then you have a vector of values. But then it seems you'd be encoding the same information in every input, which doesn't seem helpful?
@venkateshdas54224 жыл бұрын
@@ryandruckman999 Actually while doing that the values fed into the cos and sin function also takes the scalar position of the word. So each positional embedding vector will have different values which in turn captures the position / order of the sequence. refer this article : kazemnejad.com/blog/transformer_architecture_positional_encoding/
@LironCohenProfile4 жыл бұрын
At 31:20, shouldn't the last line be v_j (rather than h_j)?
@MustafaQamarudDin4 жыл бұрын
Thank you very much. It is very detailed and captures the intuition.
@syphiliticpangloss4 жыл бұрын
Could you explain what the model class looks like then? What is the capacity? What is the "unconstrained" version with higher capacity? I was full statistical learning theory style discussion in all pedalogical discussions. I don't understand how people think they understand this. If your life depended on it, would you feel confident in recommending one of these setups? What questions would you have to ask about the data, the model architecture, the observation process? You need worst case bounds, model complexity etc. I see none of that here.
@1Kapachow14 жыл бұрын
@@syphiliticpangloss Well, in deep learning the theory is far behind engineering. When people say they understand this lecture, they don't mean worst case bounds (which I strongly doubt anyone in the world knows how to calculate for this, without adding so many relaxation assumptions which make it basically irrelevant, like convexity etc.), they just mean that: 1. Engineering wise they understand how to build and use it 2. They feel they grasp enough intuition to what is the purpose of each sub-block and why it was added. I don't think anyone truly "understands" much simpler models in DL than transformers, which perform in a far superior level to classical machine learning methods. For example, fully convolutional neural networks, trained with Adam optimizer, based on back-propagation, using BN.
@syphiliticpangloss4 жыл бұрын
@@1Kapachow1 So can someone explain what the transformer is doing then in a precise way? I would accept answers that reference probability distributions and predictive goals or computation description of components like NAND gates etc. Also accepted would be anything related to the eigenvalues, stability, curvature etc. There are lots of people trying to talk about this stuff. For example arxiv.org/abs/2004.09280 Or Vapnick. To be perfectly clear, I think today we tend to say there are only two things really: a) "data" i.e. observations usually dozens to millions from some process we take to be slowly changing at most and b) predicates/models/architecture/constraints ... "observations" usually less than dozens, usually manually constructure (from other experiments and observations sets perhaps). To each of these we usually have some sort of "narrative" about where each came from, a way of describing it in some way to humans. The second thing is what I'm getting at. "Architecture" is a model constraint. If it is just pulled from thin air without undestanding the problem, the meta-problem etc, it is quite likely that there are buried problems, secret reasons for architecture choices that are not being disclosed or realised. Getting better at describing these models/arch/predicates is how we progress.
@user-or7ji5hv8y4 жыл бұрын
Does anyone know? Which NMT video has the previous intro to Attention the professor cites in this video? I couldn’t find his video on neural machine translation.
@jelenajokic91842 жыл бұрын
The simplest explanation of attention, thanks a lot for sharing, great lectures🤗!
@orhan4876 Жыл бұрын
thank you for being so thorough!
@evennot4 жыл бұрын
19:00 it's basically an exclusionary perceptron layer, isn't it? (also could be called fuzzy LUT) I'm sure it was used before for the attention emulation
@weiyaox68963 жыл бұрын
Best explanation
@SaNDRiTa1919 Жыл бұрын
I’m sorry, but the database example was extremely easy to understand but then it goes to the similarity and I don’t get a shit. For a translation problem, for example, would it be the Similarity between the word in English and the text in the other language? How is it going to be any similarity in this case between the words? Or is it between the embeddings? And if we’re trying to predict the next word in a sentence, what is the similarity in this case?
@prof_shixo4 жыл бұрын
Thanks for the nice lecture. I am still confused regarding how transformers model can replace RNNs or LSTMs for general sequence learning. The size of a sequence might be very lengthy in some applications rather than just a sentence (which can be designed to be fixed in length) so how to deal with this especially if we need to keep the complete sequence with us as there is no recurrence? If the answer is to divide the sequence, then how to link different chunks over time without a recurrence or a carry over? (Loops over time)
@JAKKOtutorials4 жыл бұрын
transformers are able to "query the recurrencies", think of it as instead of repeating the operation as in RNNs you just query 'x' times a database of the possible values and its given inputs and check if it matches the requirements, and because it's not a recurrence, repetition, you can make multiple of these queries, each being a new operation, at the same time!! each operation can be resolved without interference creating new tokens, or pieces, which represent convergence points in the data universe you are travelling. it's a huge improvement.. confirmed by the models shown at the end of the lecture. hope this helps :)
@venkateshdas54224 жыл бұрын
As JAKKO mentioned the transforms use the attention mechanism in a very efficient manner. The size of the sequence can be sufficiently longer than a sentence and still the attention mechanism will be able to capture the dependencies between the words at different positions. And this creates an efficient contextual representation of the sequence better than the normal input embedding vector. And this is how the complete input sequence is captured by the model without the recurrence. This is really a beautiful approach. (personal opinion)
@blasttrash10 ай бұрын
just curious, if transformer networks were known 4 years ago itself, why did it take chatgpt such a long time to be developed?
@opencvitk Жыл бұрын
the explanation of K,V and Q is great. unfortunately i lost him as soon as he started on multi-head. must be that the single head i possess is empty :-)
@ibrahimkaibi42003 жыл бұрын
A very interesting explanation (wonderful)
@trungtranthanh48195 жыл бұрын
Dear Professor Pascal, it's able to modify encoder or decoder of Transformer to a CNN block for computer vision?
@venkateshdas54224 жыл бұрын
arxiv.org/pdf/1802.05751.pdf Please refer this paper
@trungtranthanh48194 жыл бұрын
@@venkateshdas5422 Thank you so much. I tried to replace the input of Encoder by Resnet's output for Image Captioning; I found It worked pretty good.
@venkateshdas54224 жыл бұрын
@@trungtranthanh4819 Oh cool. Thats nice
@trungtranthanh48194 жыл бұрын
@@venkateshdas5422 If you want to have a discussion about this topic, we can exchange information.
@venkateshdas54224 жыл бұрын
@@trungtranthanh4819 That is really nice. please mail me at this address venkatesh.murugadas@st.ovgu.de
@brandonleesantos93832 жыл бұрын
Truly fantastic wow
@alelom2 жыл бұрын
I actually think that his explanation of K, V and Q makes zero sense like all of their definitions I could find online. At 42:11, he says "Keys and Values are just like key-value pairs in a Database", which I believe is completely off track. He then says that you take a weighted combination of Ks and Vs to produce the output, which in a KV pair collection would give you random shit. I still fail to understand what the relation between Ks and Vs is in the Attention context, and especially what Vs are.
@anatolicvs2 жыл бұрын
Dear Prof. Dr. Poupart, do we have chance to have your presentation that used at the lecture of 19 please ?
@chakibchemso Жыл бұрын
and thats how gpt was born my fellas
@hariaakashk61614 жыл бұрын
Great explanation sir... Thank You! Please post more such lectures and I would be the first to look at it...
@leoj58913 жыл бұрын
does this normalization layer matter in the inference stage?
@shifaspv2128 Жыл бұрын
Thank you so much, the brainstorming
@minhajulhoque21132 жыл бұрын
Great video!
@diffpizza Жыл бұрын
Why not just a complex number for the positional embedding? The imaginary part could be keeping track of the position, and all multiplication and gradient operations should still work
@aponom844 жыл бұрын
Nice lecture! Thanks!
@mohamedabbashedjazi4934 жыл бұрын
Softmax is computationally expensive, I wonder if this can be replaced somehow with another function to produce probabilities since Softmax is present in many places in all the blocks of the transformer network.
@eaglesofmai3 жыл бұрын
48:27 I don't understand why a current word cannot depend on a future word (for example when we are translating proverbs, we know it is the same proverb after all the words match)
@taravanova3 жыл бұрын
The output of the decoder at time t is used an an input to the decoder at time t+1. Using future words would be equivalent to using information that does not exist.
@muhammadnaufil5237 Жыл бұрын
Output can only perform attention with the words that come prior to the output location in the sequence, as future words aren't generated yet.
@jinyang47964 жыл бұрын
Thank you for the clear explanation and well-illustrated examples!
@ryandruckman9995 жыл бұрын
1:22:19 Two things: - What was that final model you described? Excelnet? I can't find it online - Alphastar still uses an LSTM at it's core! I wonder if you could model gameplay purely with frames and input and attention..
@aliamiri45245 жыл бұрын
it's XLnet LSTMs are good but they need a huge amount of time and data to train recently I built a chatbot in about 1 hour with transformer but if I train the same thing on the same hardware with LSTM it'll take more time (100-500 hours for example) besides transformers are part of modern deep learning and it's trivial to use them for training state of the art models
@zohaibramzan63814 жыл бұрын
@@aliamiri4524 i hv one nlp problem. Can i discuss its suitable approach for solution with you?
@aliamiri45244 жыл бұрын
@@zohaibramzan6381 sure, i'll answer if i know
@zohaibramzan63814 жыл бұрын
@@aliamiri4524 kindly share fb link of ur profile
@aliamiri45244 жыл бұрын
@@zohaibramzan6381 facebook.com/ali.galactic.1
@cedricmanouan23334 жыл бұрын
very interesting and useful. Thanks Sir
@syedhasany18095 жыл бұрын
This was a great lecture, thank you.
@gudepuvenkateswarlu56483 жыл бұрын
Excellent session....Tq professor
@goldencircle4331 Жыл бұрын
Huge thanks for putting this online.
@kungchun94614 жыл бұрын
This year should be the "tranformer year" as there a breakout in domain of CV.
@firstnamelastname31064 жыл бұрын
thank you my man, u saved me
@compmeist Жыл бұрын
Perhaps the reason we can't concatenate positional information is because we are trying to share that information among the dimensions of the word vector
@XiaosChannel3 жыл бұрын
9:17 feels like the instructor is repeating the same thing twice. long dependency is why you have vanishing/exploding gradients, and having many steps is a result of not being able to parallalize a sentence.
@faatemehch963 жыл бұрын
thank you, the video is really useful. 👍🏻👍🏻
@akashpb41833 жыл бұрын
Beautifully explained.. things seem clear to me now .. Thanks a lot sir!
@pred99904 жыл бұрын
Cool lecture!
@seminkwak3 жыл бұрын
Beautiful explanations
@markphillip99503 жыл бұрын
Great lecture.
@shavkat952 жыл бұрын
he sounds bored and depressed but the content is high class.
@HarpreetKaur-qq8rx4 жыл бұрын
What is residual connection. I am still not clear from the video. Also how is the analogy of database transforming to a sequence of words with respect to query key and value matrix
@Roscovanul24 жыл бұрын
It just takes the initial input that we give it to the transformer and adds it to the output of the multihead attention block. So you have an X as an input. After the MultiheadAttention block you have F(X) as an output. And you add X to that F(X). Residual blocks are also called Skip Connections...because we skip the block in front of us. They are good to maintaining the information of the initial input throughout the network.
@sienloonglee4238 Жыл бұрын
very good video!😀
@444haluk3 жыл бұрын
I heard queries, keys & values were primative concepts and counter-intuitive, but I didn't know it was THIS primative.
@nafeesahmad90833 жыл бұрын
Woohoo... Thank you so much
@AnonTrash Жыл бұрын
Beautiful.
@zhaoc0334 жыл бұрын
In the similarity function l(q, ki), What is the dimension of q and ki? Because l(q, k_i) can be transpose(q) * ki, this tells me that q is a vector, but I thought q is a query. How can a query be a vector? Also, if ki is a vector, k must be a matrix, but how can a key k be a matrix?
@clray1234 жыл бұрын
Both q (query) and ki (key) are vectors, passed as arguments to the similarity function. The similarity function allows finding the key which matches the query vector most closely (in order to "return" the value associated with just that key vector). "How can a query be a vector?" The query here is just like a (fuzzy) key lookup. "Query" is "the key you are looking up/searching for". It's confusing because in db context the word "query" refers to an expression in some query language (e.g. SELECT .. FROM table WHERE key=something), whereas in this context the query is just "something".