LLaMA explained: KV-Cache, Rotary Positional Embedding, RMS Norm, Grouped Query Attention, SwiGLU

Рет қаралды 70,722

Күн бұрын

Пікірлер: 177

@umarjamilai 11 ай бұрын

As many of you have asked: LLaMA 2's architecture is made up of the ENCODER side of the Transformer plus a Linear Layer and a Softmax. It can also be thought of as the DECODER of the Transformer, minus the Cross-Attention. Generally speaking, people call a model like LLaMA a Decoder-only model, while a model like BERT an Encoder-only model. From now on I will also stick to this terminology for my future videos.

@aiyogiravi 10 ай бұрын

Yeah, It make sense now. Since we are not doing any Encoding and using it as Cross-Attention later. We will call this model a Decoder-only model. Edit: Really appreciate the effort you are putting. Great Channel :)

@haiphan980 7 ай бұрын

Great video about LLAMA! I have one question regarding inference steps. How does the input [SOS] predicts "Love" in the beginning when the model does not have any information about input sentence? In Transformer, we have encoder which encode the whole input sentence before going to decoder, offering conditional probabilistic graph mechanism however in LLAMA, we do not know it. [SOS] can predict any next words and how does it know that it is "LOVE"?

@kanakraj3198 6 ай бұрын

@@haiphan980 That was just for understanding. He didn't show the prompt part. the model will first take your prompt and perform self-attention on it and then only it will start predicting, so it will have information based on your prompt on how to start.

@chandrahasaroori317 7 күн бұрын

Ah phew, started googling things as soon as I heard it

@kqb540 5 ай бұрын

Umar, Andrew Ng, 3Blue1Brown and Andrej are all you need. You are one of the best educators of deep learning. Thank you.

@xinyaoyin2238 2 ай бұрын

nahh, Umar only, 3blue1brown only makes fancy but useless ppts, andrej emphasizes on more python tricks than the actual transformer

@huntersawyer9324 2 ай бұрын

I thought attention was all you need lol

@emir5146 Ай бұрын

i think, too.

@emir5146 Ай бұрын

@@xinyaoyin2238 andrej doing true thing

@Umar-s6e 16 күн бұрын

@@xinyaoyin2238 I agree, Umar only, andrej makes it complex or cover too much depth.

@jyun-siouhuang22 7 күн бұрын

The most comprehensible tutorial I've searched so far. Most blogs' explanation has some mistakes, but yours looks great!! Fascinating, very appreciated it.

@mandarinboy Жыл бұрын

The best 1 hour I spent! I had so many questions exactly on all these topics and this video does an outstanding job at explaining enough details in an easy way!

@umarjamilai Жыл бұрын

Glad you liked it! I just posted another video on how to code LLaMA 2 from scratch! Check it out

@mojtabanourani9988 Жыл бұрын

The network came in Feb 2023! This is a youtube channel worth subscribing. Thanks man

@muthukumarannm398 Жыл бұрын

I became your fan in 55:00 when you explain how GPU capability drives the development. 🙂

@Paluth Жыл бұрын

Thank you very much for your work. The community is blessed with such high quality presentations about difficult topics.

@TheMzbac 10 ай бұрын

Very underrated video. Thanks for providing such a good lecture to the community

@MaxenceErnoult 10 ай бұрын

This video, along with the previous ones about coding up transformers from scratch, are really outstanding. Thank you so much for taking such a tremendous amount of your free time to put all of this together!

@dgl3283 6 ай бұрын

This video can be like official textbook of llama architecture. Amazing.

@kunwar_divyanshu 2 ай бұрын

This is the best explanation on planet for llm techniques and architecture ..................

@Engrbilal143 Жыл бұрын

Amazing. Just wow. I cannot find this stuff on whole internet

@Best9in Жыл бұрын

Thank you very much! I once read the paper, but I think watching your video provided me with more insights about this paper than reading it many more times would have.

@tubercn Жыл бұрын

Thanks for your free time and offering this valuable tutorial👏👏👏 Hope you keep going to do this, thanks again

@deema_c 10 күн бұрын

You deserve a section in my thesis's Acknowledgments

@umarjamilai 9 күн бұрын

Waiting for it ;-)

@jordanconnolly1046 Жыл бұрын

Really glad I found your channel you create some of the most in depth and easy to follow explanations I've been able to find.

@tusharmadaan5480 3 ай бұрын

Amazing explanation. Such thorough coverage of KV cache, GQA, RopE, and SwiGLU.

@cobaltl8557 8 ай бұрын

This intro to llama is awesome ❤, thank you for making such a great video.

@mickelliu5559 Жыл бұрын

It's surprising that content of this quality is free.

@TheAero Жыл бұрын

You actually explained attention here better than the previous presentation!

@umarjamilai Жыл бұрын

Keep up with the journey! Watch my other video on how to code LLaMA 2 from scratch and you'll put to practice what you've learnt here

@Koi0312 3 ай бұрын

Very great video! Hope you will have a video explaining Llama 3 soon

@Jc-jv3wj 7 ай бұрын

Fantastic explanation on LLaMA model. Please keep making this kind of videos.

@siqb 9 ай бұрын

My TLDR for the video (please point out the mistakes): - LLaMA uses RMS normalization instead of LayerNorm because it provides the same benefits with less computation. - LLaMA uses rotary embeddings. These act as a distance-based scaling to the original dot product scalar value coming out of queries and keys. In other words, two tokens, X and Y, will have a larger scalar value versus two tokens X and Y that are far apart. This makes sense from the point of view that closer tokens should have a bigger say in the final representation of a given token than the ones far away. This is not the case for vanilla transformer. - LLaMA uses Grouped Query Attention as an alternative to vanilla attention mostly to optimize GPU Flops (and its much slow memory access). Key slide on 1:03:00. In vanilla attention, each token (within each head) has its own key, query and value vector. In multi-query attention (MQA), there is only one key and value vector for all query vectors. In between lies the MQA where a few query vectors (say 2-4) may be mapped to one key and value vector. - LLaMA uses SwiGLU activation function since it works better - LLaMA uses 3 layers instead of 2 for the FFNN part of the encoder block but keeps the number of parameters same.

@jackoneil1000 20 күн бұрын

I might be incorrect, but I believe that GLU cannot be simply thought as a 3 layer FFN, because they are more like 2 layer FFN with extra forget/update feature inspired by LSTMs / GRUs. The amount of parameters are of course a hyperparameter at the end of the day, and the video mentioned that the params were factored by 2/3 to test whether swish does better. After all comparing activation functions while their layers have diferent parameter amounts would be unfair and the increased performance could be an effect of higher param count instead of better activation. Don't want to be the "Uhm... Actually" guy, just thought it might be useful.

@Tensordroid 7 ай бұрын

One of the best explanations on youtube right now !!

@YuchenCao-e1v 11 ай бұрын

Great video! Worth spending the time going over and over again. I actually saw your video from a Chinese site (the video probably has been forwarded by many other people already), then I come here for the author. 讲得超棒，谢谢分享！

@umarjamilai 11 ай бұрын

什么国内的网站？我有点想看一下😁

@feixyzliu5432 10 ай бұрын

bilibili@@umarjamilai

@NJCLM 10 ай бұрын

I didn't even see the time passe ! Great work your are a future rock star at teaching complex thing in ML

@ravimandliya1881 Жыл бұрын

Such an amazing step by step breakdown of concepts involved! Thank you so much.

@Charbel-n1k 4 ай бұрын

Thank you for explaining these concepts in an easy way to understand!🎉

@UnknownHuman11110 4 ай бұрын

Amazing video ! Thanks for taking the time to explain core new concepts in language models

@librakevin1983 4 ай бұрын

The best machine learning videos I've ever watched. Thanks Umar!

@danish5326 8 ай бұрын

AMAZING! AMAZING AMAZING! Great work Umar .. Thanks a ton

@hieungo770 Жыл бұрын

Please keep doing content like this. Thank you very much. I learnt alot

@abhishekshinde-jb5pn 10 ай бұрын

Your videos are the best man !! Please keep releasing as much content as possible, on the famous papers

@umarjamilai Жыл бұрын

As always, the PDF slides are freely available on GitHub: github.com/hkproj/pytorch-llama-notes/

@visheshmittal468 Ай бұрын

At 38:20 i think its a mistake, Q!=K!=V even in self attention, they are calculated using Wq,Wk,Wv which are different weights, and when calulating score, e1j=[q1k1, q1k2,.....,q1kT] where q is from current token and k from all the other tokens a1j=softmax(e1j) z1=sum(a1jvj), where v are computed by mul(token_embed*Wv)

@GrifinsBrother 10 ай бұрын

Incredible explanation!> You have really predisposition to explaining materials. Keep going!

@mprone 7 ай бұрын

Despite your name deceived me at first, I had no doubt you were a fellow Italian!

@umarjamilai 7 ай бұрын

Dovresti farmi l'inganno della cadrega per vedere se sono davvero un Milanese 😂😇

@mprone 7 ай бұрын

@@umarjamilai Senta lì, Brambilla Jamil, s'accomodi, si serva, prenda una cadrega. Una bella cadreghina non si rifiuta mai!

@umarjamilai 7 ай бұрын

@@mprone 😋 mmmhhh... Buona sta cadrega 🍎 Scrivimi pure su LinkedIn se hai qualche dubbio o domande. Buona giornata!

@subhamkundu5043 Жыл бұрын

This is great. Thank you. It will be very helpful you could also create a video for hands on coding a Llama model, the way you did for vanilla transformer. Thanks in advance

@umarjamilai Жыл бұрын

It's coming soon Stay tuned!

@weicheng4608 Жыл бұрын

Same here. Eagerly waiting for a coding session for llama model.

@somebody-f8k Жыл бұрын

Fantastic video! Your explanations are very clear, thank you!

@saratbhargavachinni5544 Жыл бұрын

Thanks a lot, Great explanation of KV cache and Multi Query Attention.

@ahmetfirat23 Жыл бұрын

very informative video, details are clearly explained. thanks a lot

@barisdenizsaglam 3 ай бұрын

Amazing illustrations and explanations.

@parmanandchauhan6182 5 ай бұрын

Great content ,deep understanding after watching video

@TrelisResearch Жыл бұрын

Great channel and content Umar

@goelnikhils Жыл бұрын

Exceptional Video on LLaMA

@bipulbikramthapa8256 10 ай бұрын

Great video +1. I have a few queries about the LLaMA model. 1. In the architecture diagram, does one NX represent a single layer for the LLaMA? 2. Could you also please clarify how many NX are utilized for the LLAMA-2 13B model and any variants? 3. Finally, what are the potential for distributed computation in LLaMA model inference? What are the possible breakpoints in the model from an architectural standpoint?

@vassilisworld Жыл бұрын

another amazing video Umar! you do know how to teach for sure, it would be nice if you put into a repo very influential papers to read! Did I hear a baby in the background? Also given you are from Italy, there is a lovely video worth watching by Asianometry on 'Olivetti & the Italian Computer: What Could Have Been'. thank you again for the hard work you put on this video

@umarjamilai Жыл бұрын

Hello Vassilis! Thanks for the kind words and the suggestion! The voice in the background is from 奥利奥 (Oreo), our black and white cat 😺. Unfortunately I'm aware of Olivetti's history and what could have been. If you're curious, you should also check out Enrico Mattei, and what ENI could have been. Have a wonderful day! Hopefully in a few days I'll upload the video on how to code LLaMA from scratch

@haocongzhan1806 10 ай бұрын

u are doing great job, ty for tutoring me part by part! it helps a lot

@umarjamilai 10 ай бұрын

你在中国吗？我们在领英联系吧

@haocongzhan1806 10 ай бұрын

@@umarjamilai 在的，已经加你了！

@goelnikhils Жыл бұрын

Amazing explanation

@Angadsingh95 Жыл бұрын

Thank you for creating such quality content!

@taltlusty6804 10 ай бұрын

Great video!!! Thank you very much for enriching the community with such great explanations! Can you please share your slides?

@umarjamilai 10 ай бұрын

Check the video description, there's a link.

@Tomcat342 10 ай бұрын

You are doing god's work.Keep it up and Thank you.

@meili-ai Жыл бұрын

Very good explanation! Keep up the good work!

@hosseinhajipour7817 Жыл бұрын

Thanks for the videos. There are a few errors in the video which I mentioned below: 1- Llama is an decoder-only model 2- The size of Q and K are the same. However, they are not the "same" tensor.

@umarjamilai Жыл бұрын

Hi! 1 - To be the decoder, it should have the cross attention, which it doesn't. The closest architecture is the encoder (the left side of the transformer model). People commonly call it "decoder only" because we do not "encode" text into a latent representation, but rather, just "generate text" from pre trained embeddings. Technically, from an architecture point of view, it's more similar to an encoder, hence the name. 2 Q, K and V have the same size, but have the same content in self attention, at least in the vanilla transformer. In LLaMA, because of the KV Cache and the positional encodings, which are only applied to Q and K, the content is different. Have a nice day!

@TheFitsome 6 күн бұрын

when I give my Nobel price acceptance speech, I will mention this channel😆

@TianyiZhang-ns8pg 6 ай бұрын

Thanks a lot. Looking forward to your next video!!!

@satviknaren9681 6 ай бұрын

Thank you for posting Thank you existing Thank you

@ethanhe42 Жыл бұрын

great illustration!

@saurabh7337 7 ай бұрын

Hello , Many thanks for such a great content. Really enjoyed your work. Sorry for being greedy, may I request to create a video showing show llama model can be run effectively on local machines(with or without gpu) for inference(say with a custom flask api)

@berkk1993 Жыл бұрын

Your video are great. Please keep going.

@moacirponti 8 ай бұрын

Many thanks for the great video. From your explanation Llama uses Rotary Positional Embedding (RPE) as positional encoding. It is applied to *each* Q and K vector just after transformation by their respective W. In this case, I don't get what the Relative Positional Encoding has to do with it (and why it was explained before Rotary PE). It is because Rotary PE has connections with the relative positional method, or both are applied in the case of Llama?

@ml.9106 8 ай бұрын

thanks! super helpful. Seems your channel didn't tell the GPT model architecture. would you be open to introduce that?

@AnqiSun-f3e Жыл бұрын

Great video. Thanks, 小乌!

@umarjamilai Жыл бұрын

谢谢你😸

@Sisco404 7 ай бұрын

Love your videos, and I also love the meowing cat in background😂

@Charles-my2pb 4 ай бұрын

Thanks for your video, that's awesome!

@saranyav2581 5 ай бұрын

Thank you for the amazing explanation

@MarcosVinicius-bd6bi 8 ай бұрын

Fantastic video, thanks Umar!

@vincentabraham7690 8 ай бұрын

Hi Umar, I recently started studying LLMs and I loved your explanations on transformers and the Llama architecture. I wanted to know that is there any way to look at the attention weights of a model and gain insights on which specific portions of the text influenced the output prediction? Is there any way to do this which is beginner friendly?

@MENGRUWANG-qk1ip 7 ай бұрын

hi Umar! Your explanation is really excellent! By the way, llama3 has been released, will you continue to explain llama3 in a video?

@georgealexandruvlad7837 Жыл бұрын

Great explanation! 👌🏻

@Vignesh-ho2dn 7 ай бұрын

Thanks for the very resourceful video

@samc6368 10 ай бұрын

Great explanation for all levels. Que: did you say around 2:55 that its encoder only arch, i read that its decoder only?

@utkarshsharma232 3 ай бұрын

Best Video on LLaMA

@AmarshreeV 2 ай бұрын

Excellent Video on LLaMA, I am struggling with the Encoding part can you help me in detail how encoding happens in LLama, If already explained can you tag the video.

@Gaurav-im1df Ай бұрын

Such a great video. just a small request can i use one of the pic that you have used to explain in my blog its kinda help others to learn more. Waiting for your reply .

@umarjamilai Ай бұрын

Sure! Please remember to link back to the original video. Thanks!

@siddharthasubramaniyam Жыл бұрын

Great work man🙌

@just4visit 11 ай бұрын

hey! it there any way you can avoid using background bangs when slide changes, please?

@charlesriggins7385 11 ай бұрын

It's really useful. Thank you.

@eitancohen8717 6 ай бұрын

Hi, great explanations. btw, is there a chance you explain separately this torch.einsum operations shown in the code at time 58:41?

@baiyouheng5365 9 ай бұрын

good content,nice explanation, THANKSssss.

@kjkszpjab1510 7 ай бұрын

Brilliant, thank you.

@CCCeline_L 3 ай бұрын

thank you!!! I gained a lot

@EugenioDeHoyos Жыл бұрын

Thank you!

@grownupgaming Жыл бұрын

1:01:42 great video! what is this beam 1 beam 4

@umarjamilai Жыл бұрын

Beam 1 indicates the greedy strategy for inference, while Beam 4 indicates "Beam search" with K = 4.

@23조이연남 Жыл бұрын

This is the way!

@BadmintonTV2008 11 ай бұрын

really awesome work!

@localscope6454 Жыл бұрын

Beautiful, thx.

@kanakraj3198 6 ай бұрын

What I still didn't understand how grouping is happening in Grouped Multi Query attention? I didn't understand Rotary Positional Encoding concept, but will re-watch or read more?

@VivekKumar4-YrB.Tech.ChemicalE Ай бұрын

at 2:55 it will be decoder not encoder since llama is a decoder only model for next token prediction auto - regressively

@npip99 8 ай бұрын

49:10 "Since the model is causal, we don't care about the attention of a token with its successors" ~ I mean, a simpler explanation is also that the matrix is symmetric anyway, right? Like regardless of whether or not we care, it would be duplicated values.

@cfalguiere 10 ай бұрын

Thanks for sharing

@nikhiliyer8436 Ай бұрын

Hi Umar,please make videos on llama3 and Mistral architecture

@0nlif3nce 2 ай бұрын

Perfection!

@JaydeepPawar-n4v 9 ай бұрын

great explanation bro!!!

@yunhuaji3038 Жыл бұрын

Thanks for the great video. at 48:56, you mentioned that we don't care about previous attentions. Does that mean we will trim the attention tensor from (SeqLen x d_model) to (1 x d_model) ? If so, does GPT do the trim as well? I thought GPT uses the whole attention tensor (including the previous attention "vectors") to predict the next token. That seems redundant but I wonder what they did exactly here. If they did use the whole attention tensor, does it mean that GPT inferencing needs to cache QKV instead of just KV? Thank you.

@umarjamilai Жыл бұрын

Hi! I think you misunderstood my words: I said "we don't need to recompute the dot products again", because they have already been computed in the previous steps. The dot products for which we "don't care" are the ones above the principal diagonal of the attention scores matrix, because they're the ones masked out when we apply the causal mask (that is, we force each token to only attend tokens to its left). I suggest you watch my other video in which we code LLaMA from scratch in order to understand how this works in practice.

@yunhuaji3038 Жыл бұрын

@@umarjamilai Thanks for the fast reply. I actually should have pointed the timestamp to 50:47 instead, but I believe you already got my question which is about the outcome attention tensor. Do you mean that we will cache the result attention tensor as well as the k-v pairs? (I will meanwhile go watching your llama video.)Thanks again.

@yunhuaji3038 Жыл бұрын

@@umarjamilai Oh... I figured it out... thanks a lot

@昊朗鲁 20 күн бұрын

谢谢

@satpalsinghrathore2665 10 ай бұрын

Amazing video

@binfos7434 4 ай бұрын

Amazing!

@abdulahmed5610 Жыл бұрын

Llama is Decoder only plz check 2:50

@umarjamilai Жыл бұрын

Hi! You can call it "Decoder-only" or "Encoder-only" interchangeably, because it's neither. To be a decoder, it should also have a cross-attention (which it lacks), to be an encoder it should not have a linear layer (which it does). So technically it can be an Encoder with a final linear layer or a Decoder without cross-attention. As a matter of fact, the "E" in BERT, which is also based on the Transformer model, stands for "Encoder". Have a nice day!