The math behind Attention: Keys, Queries, and Values matrices

Рет қаралды 253,265

Күн бұрын

Пікірлер: 337

@SerranoAcademy Жыл бұрын

Hello all! In the video I made a comment about how the Key and Query matrices capture low and high level properties of the text. After reading some of your comments, I've realized that this is not true (or at least there's no clear reason for it to be true), and probably something I misunderstood while reading in different places in the literature and threads. Apologies for the error, and thank you to all who pointed it out! I've removed that part of the video.

@tantzer6113 Жыл бұрын

No worries. It might help to pin this comment to the top. Thanks a lot for the video.

@chrisw4562 8 ай бұрын

Thanks for note. That comment actually sounds very reasonable to me. If I understand this right, keys and querys help to determine the context.

@JTedam 10 ай бұрын

I have watched more than 10 videos trying to wrap my head around the paper, attention is all you need. This video is by far the best video. I have been trying to assess why it is so effective at explaining such a complex concept and why the concept is hard to understand in the first place. Serrano explains the concepts, step by step, without making any assumptions. It helps a great deal. He also used diagrams, showing animations along the way as he explains. As for the architecture, there are so many layers condense in to the architecture. It has obviously evolved over the years with multiple concepts interlaced into the attention mechanism. so it is important to break it down into the various architectures and take each one at a time - positional encoding, tokenization, embedding, feed forward, normalization, neural networks, the math behind it, vectors, query-key -values. etc. Each of these are architectures that need explaining, or perhaps a video of their own, before putting them together. I am not quite there yet but this has improved my understanding a great deal. Serrano, keep up your approach. I would like to see you cover other areas such as Transformer with human feedback, the new Qstar architecture etc. You break it down so well.

@SerranoAcademy 10 ай бұрын

Thank you for such a thorough analysis! I do enjoy making the videos a lot, so I'm glad you find them useful. And thank you for the suggestions! Definitely RLHF and QStar are topics I'm interested in, so hopefully soon there'll be videos of those!

@blahblahsaurus2458 7 ай бұрын

Did you also try reading the original Attention is All you Need paper, and if so, what was your experience? Was there too much jargon and math to understand?

@visahonkanen7291 6 ай бұрын

Agree, an excellelt öööököööööööövnp

@JTedam 6 ай бұрын

@@blahblahsaurus2458 too much jargon obviously intended for those already Familiar with the concepts. The diagram appears upside down and not intuitive at all. Nobody has attempted to redraw the architecture diagram in the paper. It follows no particular convention at all.

@TomChenyangJI 4 күн бұрын

Absolutely ❤

@Rish__01 Жыл бұрын

This might be the best video on attention mechanisms on youtube right now. I really liked the fact that you explained matrix multplications with linear transformations. It brings a whole new level of understanding with respect to embedding space. Thanks a lot!!

@SerranoAcademy Жыл бұрын

Thank you so much! I enjoy seeing things pictorially, especially matrices, and I'm glad that you do too!

@maethu 10 ай бұрын

This is really great, thanks a lot!

@JosueHuaman-oz4fk 7 ай бұрын

That is what many disseminators lack: explaining things with the mathematical foundations. I understand that it is difficult to do so. However, you did it, and in an amazing way. The way you explained the linear transformation was epic. Thank you.

@Aaron洪希仁 Жыл бұрын

This is unequivocally the best introduction to Transformers and Attention Mechanisms on the entire internet. Luis Serrano has guided me all the way from Machine Learning to Deep Learning and onto Large Language Models, maximizing the entropy of my AI thinking, allowing for limitless possibilities.

@JonMasters 6 ай бұрын

💯 agree. Everything else is utter BS by comparison. I’ve never tipped someone $10 for a video before this one ❤

@fcx1439 8 ай бұрын

this is definitely the best explained video for attention model, the original paper sucks because there is not intuition at all, just simple words and crazy math equations that I don't know what it's doing

@nbtble Ай бұрын

things don't suck just because you are not able to understand them. w/o the original paper, there would be no neccesity for this video, as the content wouldn't "exist"

@computersciencelearningina7382 7 ай бұрын

This is the best description of Keys, Query, and Values I have ever seen across the internet. Thank you.

@23232323rdurian Жыл бұрын

you explain very well Luis. Thank you. It's HARD to explain complicated topics in a way people can easily understand. You do it very well.

@SerranoAcademy Жыл бұрын

Thank you! :)

@mushfikurahmaan Ай бұрын

are you kidding me ? seriously ? Lol some KZbinrs thinks that if they use fancy words they are good at teaching. but you're totally different man. you've cleared all of my confusions. Thanks man

@__redacted__ 10 ай бұрын

I really like how you're using these concrete examples and combining them with visuals. These really help build an intuition on what's actually happening. It's definitely a lot easier for people to consume than struggling with reading academic papers, constantly looking things up, and feeling frustrated and unsure. Please keep creating content like this!

@joelegger2570 11 ай бұрын

These are the best videos so far I saw to understand how Transformer / LLM works. Thank you. I really like maths but it is good that you keep math simple that one don't loose the overview. You really have a talent to explain complex things in a simple way. Greets from Switzerland

@channel8048 Жыл бұрын

Just the Keys and Queries section is worth the watch! I have been scratching my head on this for an entire month!

@SerranoAcademy Жыл бұрын

Thank you! :)

@olivergrau4660 Ай бұрын

I am so grateful that there are people like Luis Serrano who present incredibly complex material in a clear way. It must be an incredible job. I noticed Mr. Serrano very positively in Udacity. Just by reading the original papers, it is unlikely for “normal people” to understand such material. Many, many thanks!

@WhatsAI Жыл бұрын

The best explanation I've seen so far! Really cool to see how much closer the field is getting to understanding those models instead of being so abstract thanks to people like you, Luis! :)

@ChujiOlinze Жыл бұрын

Thanks for sharing your knowledge freely. I have been waiting patiently. You add a different perspective that we appreciate. Looking forward to the 3rd video. Thank you!

@SerranoAcademy Жыл бұрын

Thank you! So glad you like the videos!

@dekasthiti 6 ай бұрын

This really is one of the best videos explaining the purpose of K, Q, V. The illustrations provide a window into the math behind the concepts.

@MrMacaroonable 10 ай бұрын

this is absolutely the best video that clearly illustrate and explains why we need v,k,q in attention. Bravo!

@snehotoshbanerjee1938 11 ай бұрын

One of the Best video on Attention. Such a complex subject been taught in a simple manner.Thank u!

@Chill_Magma Жыл бұрын

Honestly you are the best content creator for learning Machine learning and Deep learning in a visual and intuitive way

@ganapathysubramaniam 11 ай бұрын

Absolutely the best set of videos explaining the most discussed topic. Thank you!!

@lengooi6125 9 ай бұрын

Simply the best explanation on this subject.Crystal clear .Thank you

@Ludwighaffen1 11 ай бұрын

Great video series! Thanks you! That helped a ton 🙂 One small remark: the concept of the "length" of a vector that you use here confused me. Here, I guess you take the point of view of a programmer: len(vector) outputs the number of dimensions of the vector. However, for a mathematician, the length of a vector is its norm or also called magnitude (square root of x^2 + y^2).

@leilanifrost771 7 ай бұрын

Math is not my strong suit, but you made these mathematical concepts so clear with all the visual animations and your concise descriptions. Thank you so much for the hard work and making this content freely accessible to us!

@guitarcrax127 Жыл бұрын

Amazing video. pushed forward my understanding of attention by quite a few steps and helped me build an intuition for what’s happening under the hood. Eagerly waiting for the next one

@kranthikumar4397 7 ай бұрын

This is one of the best videos on attention and w,k,v so far.Thank you for a detailed explanation

@redmond2582 10 ай бұрын

Amazing explanation of very difficult concepts. The best explanation I have found on the topic so far.

@andresfeliperiostamayo7307 5 ай бұрын

La mejor explicación que he visto sobre los Transformers. Gracias!

@aravind_selvam Жыл бұрын

This video is, without a doubt, the best video on transformers and attention that I have ever seen.

@puwanatsangkhapreecha7847 5 ай бұрын

Best video explaining what the query, key, and value matrices are! You saved my day.

@ravindra1607 8 күн бұрын

simply , the best video on attention is all you need. Tried to understand it from different videos, blogs paper itself , couldn't understand close enough to what i understood from this video. It clarified almost all the questions i had, except for few which i think will be clarified in next video. You have amazing teaching skills , kudos to you man

@subterraindia5761 2 ай бұрын

Awesome . You explained everything very well. It made life easy for me.

@rachadlakis1 4 ай бұрын

This is such a detailed and informative explanation of Transformer models! I appreciate the effort put into breaking down complex concepts with visuals and examples. Keep up the great work!

@danherman212nyc 7 ай бұрын

I studied linear algebra during the day on Coursera and watch KZbin videos at night on state of the art machine learning. I’m amazed by how fast you learn with Luis. I’ve learned everything I was curious about. Thank you!

@SerranoAcademy 7 ай бұрын

Thank you, it’s an honor to be part of your learning journey! :)

@alexrypun Жыл бұрын

Finally! This is the best from the tons of videos/articles I saw/read. Thank you for your work!

@shuang7877 5 ай бұрын

A professor here - preparing for my couse and tryng to find an easier way to talk about these ideas. I learned a lot! Thank you!

@nikhilbelure 3 ай бұрын

this is the best video i have seen on attention model. Even after reading through so many articles it was not intuitively clear but now it is!! thanks

@shannawallace7855 11 ай бұрын

I had to read this research paper for my Intro to AI class and it's obviously written for people who already have a lot of background knowledge in this field. so being a newbie I was so lost lol. Thanks for breaking it down and making it easy to understand!

@johnschut164 10 ай бұрын

Your explanations are truly great! You have even understood that you sometimes have to ‘lie’ first to be able to explain things better. My sincere compliments! 👊

@shubha07m Ай бұрын

What a flawed youtube algorithm , that it showed this Gem after so many over complicated videos of attention, every student should understand attention from THIS VIDEO!

@gauravruhela007 4 ай бұрын

I really liked the way you showed the motivation behind softmax function. i was blown away. thanks a lot Serrano!

@MrSikesben 9 ай бұрын

This is truly the best video explaining each stage of a transformer, thanks man

@cooperwu38 8 ай бұрын

Super clear ! Great video !!

@antraprakash2562 9 ай бұрын

This is one of best video I've come across to understand embeddings, attention. Looking forward to more such explanations which can simplify such complex mechanisms in AI world. Thanks for your efforts

@MarkusEicher70 11 ай бұрын

HI Luis. Thank you for this video. I'm sure, this is a very good way to explain this complex topic, but I just won't get this into my brain. I'm currently doing the Math for Machine Learning specialization on Coursera and brushing up my algebra and calculus skills that are way to low. In any case, you made me getting involved into this and now I will grind through it till I make it. I'm sure the pain will become less and the fog will lighten up. 😊

@danielmoore4311 11 ай бұрын

Excellent job! Please continue making videos that breakdown the math.

@SeyyedMohammadLoghmanDastgheyb Жыл бұрын

This is the best video that I have seen about the concept of attention! (I have seen more than 10 videos but none of them was like this.) Thank you so much! I am waiting for the next videos that you have promised! You are doing a great job!

@cachegrk 4 ай бұрын

The best ever videos on transformers in the internet. You are the best teacher!

@chrisw4562 8 ай бұрын

Thank you for the great tutorial. This is the clearest explanation I have found so far.

@rohitchan007 11 ай бұрын

Please continue making videos. You're the best teacher on this planet.

@_ncduy_ 7 ай бұрын

This is the best video for people trying to understand basic knowledge about transformer, thank you so much ^^

@王禹博-s8i 14 күн бұрын

really thanks for this video , i am a stu in China , and none of my teachers teach me this clearly.

@celilylmaz4426 10 ай бұрын

This video has the best explanations of QKV matrices and linear layers among the other resources i ve come across. I don't know why but people seem not interested in explaining whats really happening with each action we take which results in loads of vague points. Yet, the video could ve been further improved with more concrete examples and numbers. Thank you.

@rollingstone1784 6 ай бұрын

@SerranoAcademy If you want to come to the same notation as in the mentioned paper, Q times K_transpose, than the orange is the query and the phone is the key here. The you calculate q times Q times K_transpose times key_transpose (as mentioned in the paper) Remark: the paper uses "sequences", described as a "row vectors". However, usually one uses column vectors. Using row vectors, the linear transformation is a left multiplication a times A and the dot product is written as a times b_transpose. Using column vectors, the linear transformation is A times a and the dot product is written as a_transpose times b. This, in my opinion, is the standard notation, e.g. to write Ax = b and not xA=b.

@brianburton6669 Ай бұрын

This is the best video I’ve seen on this topic. Well done sir

@bzaruk 11 ай бұрын

MAN! I have no words! Your channel is priceless! thank you for everything!!!

@devmum2008 7 ай бұрын

This is great videos with clarity! on Keys, Query, and Values. Thank you

@alnouralharin 7 ай бұрын

One of the best explanations I have ever watched

@RoyBassTube 5 ай бұрын

Thanks! This is one of the best explanations of Q, K & V I've heard!

@iantanwx 4 ай бұрын

Most intuitive explanation for QKV, as someone with only an elementary understanding of linear algebra.

@deveshnandan323 7 ай бұрын

Sir , You are a Blessing to New Learners like me , Thank You , Big Respect.❤

@SaeclumSolvet 2 ай бұрын

Thank you @Serrano.Academy, very useful video. The only thing that is a bit misleading is around 24:50, where Q,K are implied to be multiplied with the word embeddings to produce the cosine distance, when in fact the embediings are included in Q,K. I guess you are using Wq, Wk interchangeably with Q,K for simplicity.

@0xSingletOnly 9 ай бұрын

I'm going to try implement self-attention and multi-head attention myself, thanks so much for doing this guide!

@kylelau1329 10 ай бұрын

I've been watching over 10 of the Transformers architecture tutorial videos, This one is so far the most intuitive way to understand it! really good work! yeah, Natural language processing is a hard topic, This tutorial is kind of revealed the black boxe from the large language model.

@laodrofotic7713 4 ай бұрын

noone of the videos I seen on this subject actually explain where the hell qkv values come from! its amazing people jump on making video while not understanding the concepts clearly! I guess youtube must pay a lot of money! But this video does a good job of explaining most of the things, it never does tell us where the actual qkv values come from, how do the embendings turn into them, and actually got things wrong in my oppinion. the q comes from embeddings that are multiplied by the wq, which is a weight and parameter in the model, but then the question is, where does wq wk wv come from???

@awinashjha Жыл бұрын

This probably is “the best video “ on this topic

@lijunzhang2788 Жыл бұрын

Great explanation. I was waitinig for this after your first video on attention mechanism! Your are so talented in explaining things in easily understandable ways! Thank you for the effort put into this and keep up the great work!

@TheMotorJokers Жыл бұрын

Thank you, really good job on the visualization! They make the process really understandable.

@glacierxs6646 3 ай бұрын

OMG this is so well explained! Thank you so much for the tutorials!

@syedmustahsan4888 Ай бұрын

Thank You very much sir. I am so pleased by the way you teach. Alhumdulillah. Thank GOD. However, I was unable to grasp the key, query, values part. Thank You Very Much

@tankado_ndakota 5 ай бұрын

amazing video. that's what i looking for. I need to know mathematical background to understand what is happening behind. thank you sir!

@brainxyz Жыл бұрын

Amazing explanation. Thanks a lot for your efforts.

@nileshkikle8112 9 ай бұрын

Dr. Luis - Excellent explanation for the math behind Q,K,V magic from AIAUN paper! In the video when you are explaining Multi-head attention (length 30:20), if I am not mistaken you have not explained the W matrix in the formula both at the "head" level and "concat" level. Please shed some light on this if you dont mind. My hunch is that to keep things simplified you might have decided to skip this W matrix that is part of NN loss function when the decoders are in action.

@debnath22026 3 ай бұрын

Damn! There's no better video to understand Attention than this!!

@colinmaharaj50 4 ай бұрын

4:55 Jon: The Singularity Doe: No Similarity Jon: Well are they not similar, I am demoing the topic LOL Doe: Palm-to-face

@colinmaharaj50 4 ай бұрын

Sorry, I like a joke, but thanks for your content, I really need to recreate the math behind how this works

@rollingstone1784 6 ай бұрын

@SerranoAcademy At 13:23, you show a matrix-vector multiplication with a column-vector (rows of the table times columns of the vector) by right-multiplication. On the right side, maybe you could use, additionally to "is sent to", the icon "orange' (orange prime). This would show the multiplication in a clearer way Remark: you use a matrix-vector multiplication here (using a row of the matrix and the words as a column on the right of the matrix). If you use row vectors, the the word vector should be placed horizontally on the left of the matrix and in the explanation, a column of the matrix has to be used. The result is then a row vector again (maybe a bit hard to sketch)

@YahyaMohand-r7f Жыл бұрын

The best explanation l've ever seen about the attention mechanism, amazing

@poojapalod Жыл бұрын

This is one of the best resource on attention mechanism. Key, query, value explaination was just awesome. Just one question though how do we know that key matrix is capturing high level concept , query matrix capturing low level concepts and value matrix more focused on predicting next word. Can you share code or playbook which validates this?

@SerranoAcademy Жыл бұрын

Thank you! That's a great question. I saw it in a thread somewhere, but I'm now doubting its validity. I may remove that part of the video. It may be better to look at K and Q as 'working together', rather than having separate functions. But if I find out more I'll post it here.

@GIChow 10 ай бұрын

I have been trying to conceptualise what QKV do from many videos (and the original AIAYN paper) but yours is the first to resonate with what I was thinking i.e. that 1) Q contains the 'word mix' (more accurately token mix) for our input text 2) K contains the real deep meaning (thing/action/feature, etc) behind every word so Q*K gives us our actual word mix (aka input text sentence) deep meaning. We need a consistent way of storing meaning for the computer since two different human language sentences or 'word mixes' could yield the same deep meaning ("a likes b", "b is liked by a") and vice versa i.e. two identical sentences could give a different deep meaning depending on what came before them. QK gives us that. 3) V gives us the features of the word that would likely come next following our input. Does this 'intuition' sound roughly right?

@SerranoAcademy 10 ай бұрын

Thank you! Yes I love your reasoning, that's the intuition behind them!

@GIChow 10 ай бұрын

@SerranoAcademy Thank you for getting back and helping my understanding of this area 🙌 You may know Karpathy has a video on building a simple GPT and that helped me understand qkv better too. Wishing all the best with your channel and ventures, and compliments of the season! 🎄

@vasanthakumarg4538 9 ай бұрын

This is the best video I had seen explaining attention mechanism. Keep up the good work!

@BrikeshKumar987 10 ай бұрын

Thank you so much !! I watched several video and none could explain the concept so well

@SerranoAcademy 10 ай бұрын

Thanks, I'm so glad you enjoyed it! Lemme know if you have suggestions for more topics to cover!

@user-um4di5qm8p 4 ай бұрын

by far the best explanation, Thanks for sharing!

@sreelakshminarayanan.m6609 6 ай бұрын

Best Video to get clear understanding of transformers

@TaianeCoelhoRamos Жыл бұрын

Great explanation. I just really needed the third video. Hope you will post it soon.

@mayyutyagi 4 ай бұрын

Now whenever I watch Serrano's video, I first like it and the start watching it coz I know the video will gonna be outstanding as always.

@gunamrit 5 ай бұрын

@36.15 divide it by the square root of the dimensions of the vector and not the length of the vector. The length of the vector is sq(coeff i ^2 + coeff j^2) amazing video ! Keep it up ! Thanks

@SerranoAcademy 5 ай бұрын

Thank you! Yes absolutely, I should have said dimensions instead of length.

@gunamrit 5 ай бұрын

@@SerranoAcademy The only thing between me and the paper was this video and it helped me clear the blanks I had after reading the paper. Thank You once again !

@januaymagori4642 Жыл бұрын

Today i have understood attention mechanism better than never before

@joehannes23 10 ай бұрын

Great video finally understood all the concepts in their context

@alieskandarian5258 9 ай бұрын

It was fascinating to me, I searched a lot for a math explained which didn't find thanks for this Please do more😅 with more complex ones

@BigAsciiHappyStar 6 ай бұрын

Thanks for this video. A few weeks ago, I tried to design an AI for playing the New York Times Connections game, but I could get nowhere near the performance of human experts. My algorithm converted individual words to vectors using word embeddings and chose the four vectors that were closest to each other. Hopefully this will give me some ideas for improving my AI 😊

@pavangupta6112 10 ай бұрын

Very well explained. Got a bit closer to understanding attention models.

@madhu1393 5 ай бұрын

@SerranoAcademy , Thank you this is awesome video. At time slot 25 mins - you are explaining that during the math Q(Orange) * WQ * WK * K(Phone) - WQ and WK helps to transform the embedding. This seems to incorrect - because, even before self attention - q is multiplied with WQ to get Q and same for K. So, the actual math is q *WQ * k * WK - since matrix multiplication is not commutative - you can't do that. So it looks like we are finding the similarity between Q and K where as Q and K belong to different embedding space. Not sure how it actually works as per you explanation. It would he helpful if you could answer it. Thanks.

@praveenkumarchandaliya1900 3 ай бұрын

Thank you for the awesome visualization explanation. How do you compute Apple fruit new coordinate(1.14, 2.43) towards orange and Apple tech new coordinate (2.86,1.14).

@chiboreache Жыл бұрын

very nice and easy explanation, thanks!

@BABA-oi2cl 10 ай бұрын

Thanks a lot for this. I always got terrified of the maths that might be there but the way you explained it all made it seem really easy ❤

@saintcodded2918 8 ай бұрын

This is powerful yet so simple. Thanks

@ArtOfTheProblem Жыл бұрын

Great work, can you point me to how you know the Key matrics is for general concepts and Querey for specifics? I've never seen this before and not clear how that would be the case given the design (which is they "learn what they need to know")

@SerranoAcademy Жыл бұрын

Thank you for pointing it out! I saw it in a thread, but I'm doubting it now, so I'll remove that part of the video. I think it's better to look at K and Q as 'working together', rather than capturing specific features.

@ArtOfTheProblem Жыл бұрын

you did a good job though. I'm like you where often the best things you learn are deep in a thread somewhere, but not always right. It reminds me of another thread where people say the Q vector "says what it's looking for" and the K "says what's relevant" etc. but that's just a human interpretation. I do like to think of them as 'interface points' which allows the tokens to share information which is relevant? @@SerranoAcademy

@ArtOfTheProblem Жыл бұрын

also did you see my other comment about simply boiling this down to the fact that a self attention layer is like a layer with "weights that adapt to context" which does what many static layers would do@@SerranoAcademy

@SerranoAcademy Жыл бұрын

@@ArtOfTheProblem yeah that makes sense, so basically like a big NN where some of the layers adapt to content and some to predicting the next word?

@ArtOfTheProblem Жыл бұрын

yes but I wouldn't say "some are for predicting next word" because the entire network is trained to predict the next word so every part of the network is playing a role in that objective.@@SerranoAcademy

@getvineet 10 ай бұрын

This series is a great description and the good part is a complex concept is explained in the simplest possible way. One question: Why are the Key, Query, and Value matrices called what they are called? What is the history behind their names?

@SerranoAcademy 10 ай бұрын

Thanks! Great question! In the original explanation, it seems that the key is a way to search for a definition in another matrix, called the query matrix. Sort of like a database, where you would search for a word, except that it's all numbers. But to be fully honest, I never understood that key-query explanation, which is why I made the video. :) So if you figure it out, pls let me know. :)

@getvineet 9 ай бұрын

Thanks @@SerranoAcademy

@davidingham3409 Ай бұрын

A sequence of linear transforms can be multiplied together, so deep learning requires non-linearity. In older networks that is only the sigmoid or soft max function, which is a negative relation, while in attention there is a positive relation.