LLaMA explained: KV-Cache, Rotary Positional Embedding, RMS Norm, Grouped Query Attention, SwiGLU

Рет қаралды 80,656

Күн бұрын

Пікірлер: 187

@umarjamilai Жыл бұрын

As many of you have asked: LLaMA 2's architecture is made up of the ENCODER side of the Transformer plus a Linear Layer and a Softmax. It can also be thought of as the DECODER of the Transformer, minus the Cross-Attention. Generally speaking, people call a model like LLaMA a Decoder-only model, while a model like BERT an Encoder-only model. From now on I will also stick to this terminology for my future videos.

@aiyogiravi Жыл бұрын

Yeah, It make sense now. Since we are not doing any Encoding and using it as Cross-Attention later. We will call this model a Decoder-only model. Edit: Really appreciate the effort you are putting. Great Channel :)

@haiphan980 9 ай бұрын

Great video about LLAMA! I have one question regarding inference steps. How does the input [SOS] predicts "Love" in the beginning when the model does not have any information about input sentence? In Transformer, we have encoder which encode the whole input sentence before going to decoder, offering conditional probabilistic graph mechanism however in LLAMA, we do not know it. [SOS] can predict any next words and how does it know that it is "LOVE"?

@kanakraj3198 8 ай бұрын

@@haiphan980 That was just for understanding. He didn't show the prompt part. the model will first take your prompt and perform self-attention on it and then only it will start predicting, so it will have information based on your prompt on how to start.

@chandrahasaroori317 2 ай бұрын

Ah phew, started googling things as soon as I heard it

@kqb540 7 ай бұрын

Umar, Andrew Ng, 3Blue1Brown and Andrej are all you need. You are one of the best educators of deep learning. Thank you.

@xinyaoyin2238 5 ай бұрын

nahh, Umar only, 3blue1brown only makes fancy but useless ppts, andrej emphasizes on more python tricks than the actual transformer

@huntersawyer9324 4 ай бұрын

I thought attention was all you need lol

@emir5146 3 ай бұрын

i think, too.

@emir5146 3 ай бұрын

@@xinyaoyin2238 andrej doing true thing

@Umar-s6e 3 ай бұрын

@@xinyaoyin2238 I agree, Umar only, andrej makes it complex or cover too much depth.

@mandarinboy Жыл бұрын

The best 1 hour I spent! I had so many questions exactly on all these topics and this video does an outstanding job at explaining enough details in an easy way!

@umarjamilai Жыл бұрын

Glad you liked it! I just posted another video on how to code LLaMA 2 from scratch! Check it out

@leodamato9685 24 күн бұрын

Man, it's the first time watching one of your videos. Your way of teaching is just perfect: straight to the point, clear and simple. Thanks a lot!

@mojtabanourani9988 Жыл бұрын

The network came in Feb 2023! This is a youtube channel worth subscribing. Thanks man

@jyun-siouhuang22 2 ай бұрын

The most comprehensible tutorial I've searched so far. Most blogs' explanation has some mistakes, but yours looks great!! Fascinating, very appreciated it.

@Paluth Жыл бұрын

Thank you very much for your work. The community is blessed with such high quality presentations about difficult topics.

@MaxenceErnoult Жыл бұрын

This video, along with the previous ones about coding up transformers from scratch, are really outstanding. Thank you so much for taking such a tremendous amount of your free time to put all of this together!

@TheMzbac Жыл бұрын

Very underrated video. Thanks for providing such a good lecture to the community

@muthukumarannm398 Жыл бұрын

I became your fan in 55:00 when you explain how GPU capability drives the development. 🙂

@umarjamilai Жыл бұрын

As always, the PDF slides are freely available on GitHub: github.com/hkproj/pytorch-llama-notes/

@ningjason9873 13 күн бұрын

Best LLM Video for student !!! I will introduce y video to all my group mate.

@cobaltl8557 10 ай бұрын

This intro to llama is awesome ❤, thank you for making such a great video.

@xugefu 23 күн бұрын

Thanks!

@tubercn Жыл бұрын

Thanks for your free time and offering this valuable tutorial👏👏👏 Hope you keep going to do this, thanks again

@dgl3283 8 ай бұрын

This video can be like official textbook of llama architecture. Amazing.

@millenniumbismay382 Ай бұрын

Thank you! It has been a life saver :) It is definitely the best explanation of such important concepts by a long way! Thank you so much.

@Engrbilal143 Жыл бұрын

Amazing. Just wow. I cannot find this stuff on whole internet

@tusharmadaan5480 5 ай бұрын

Amazing explanation. Such thorough coverage of KV cache, GQA, RopE, and SwiGLU.

@jordanconnolly1046 Жыл бұрын

Really glad I found your channel you create some of the most in depth and easy to follow explanations I've been able to find.

@Tensordroid 10 ай бұрын

One of the best explanations on youtube right now !!

@danish5326 11 ай бұрын

AMAZING! AMAZING AMAZING! Great work Umar .. Thanks a ton

@Best9in Жыл бұрын

Thank you very much! I once read the paper, but I think watching your video provided me with more insights about this paper than reading it many more times would have.

@Jc-jv3wj 9 ай бұрын

Fantastic explanation on LLaMA model. Please keep making this kind of videos.

@Charbel-n1k 6 ай бұрын

Thank you for explaining these concepts in an easy way to understand!🎉

@TheAero Жыл бұрын

You actually explained attention here better than the previous presentation!

@umarjamilai Жыл бұрын

Keep up with the journey! Watch my other video on how to code LLaMA 2 from scratch and you'll put to practice what you've learnt here

@kunwar_divyanshu 4 ай бұрын

This is the best explanation on planet for llm techniques and architecture ..................

@UnknownHuman11110 7 ай бұрын

Amazing video ! Thanks for taking the time to explain core new concepts in language models

@librakevin1983 7 ай бұрын

The best machine learning videos I've ever watched. Thanks Umar!

@YuchenCao-e1v Жыл бұрын

Great video! Worth spending the time going over and over again. I actually saw your video from a Chinese site (the video probably has been forwarded by many other people already), then I come here for the author. 讲得超棒，谢谢分享！

@umarjamilai Жыл бұрын

什么国内的网站？我有点想看一下😁

@feixyzliu5432 Жыл бұрын

bilibili@@umarjamilai

@NJCLM Жыл бұрын

I didn't even see the time passe ! Great work your are a future rock star at teaching complex thing in ML

@xujiacao6776 9 ай бұрын

This video is great！

@siqb 11 ай бұрын

My TLDR for the video (please point out the mistakes): - LLaMA uses RMS normalization instead of LayerNorm because it provides the same benefits with less computation. - LLaMA uses rotary embeddings. These act as a distance-based scaling to the original dot product scalar value coming out of queries and keys. In other words, two tokens, X and Y, will have a larger scalar value versus two tokens X and Y that are far apart. This makes sense from the point of view that closer tokens should have a bigger say in the final representation of a given token than the ones far away. This is not the case for vanilla transformer. - LLaMA uses Grouped Query Attention as an alternative to vanilla attention mostly to optimize GPU Flops (and its much slow memory access). Key slide on 1:03:00. In vanilla attention, each token (within each head) has its own key, query and value vector. In multi-query attention (MQA), there is only one key and value vector for all query vectors. In between lies the MQA where a few query vectors (say 2-4) may be mapped to one key and value vector. - LLaMA uses SwiGLU activation function since it works better - LLaMA uses 3 layers instead of 2 for the FFNN part of the encoder block but keeps the number of parameters same.

@jackoneil1000 3 ай бұрын

I might be incorrect, but I believe that GLU cannot be simply thought as a 3 layer FFN, because they are more like 2 layer FFN with extra forget/update feature inspired by LSTMs / GRUs. The amount of parameters are of course a hyperparameter at the end of the day, and the video mentioned that the params were factored by 2/3 to test whether swish does better. After all comparing activation functions while their layers have diferent parameter amounts would be unfair and the increased performance could be an effect of higher param count instead of better activation. Don't want to be the "Uhm... Actually" guy, just thought it might be useful.

@ravimandliya1881 Жыл бұрын

Such an amazing step by step breakdown of concepts involved! Thank you so much.

@abhishekshinde-jb5pn Жыл бұрын

Your videos are the best man !! Please keep releasing as much content as possible, on the famous papers

@hieungo-ai Жыл бұрын

Please keep doing content like this. Thank you very much. I learnt alot

@GrifinsBrother Жыл бұрын

Incredible explanation!> You have really predisposition to explaining materials. Keep going!

@vatsalkhetan4046 2 ай бұрын

YOU ARE SO AMAZING UMAR. THANKS A LOT MAN -

@deema_c 2 ай бұрын

You deserve a section in my thesis's Acknowledgments

@umarjamilai 2 ай бұрын

Waiting for it ;-)

@AntonyWilson Жыл бұрын

Thanks!

@TheFitsome 2 ай бұрын

when I give my Nobel price acceptance speech, I will mention this channel😆

@barisdenizsaglam 5 ай бұрын

Amazing illustrations and explanations.

@mickelliu5559 Жыл бұрын

It's surprising that content of this quality is free.

@TianyiZhang-ns8pg 8 ай бұрын

Thanks a lot. Looking forward to your next video!!!

@saratbhargavachinni5544 Жыл бұрын

Thanks a lot, Great explanation of KV cache and Multi Query Attention.

@parmanandchauhan6182 8 ай бұрын

Great content ,deep understanding after watching video

@haocongzhan1806 Жыл бұрын

u are doing great job, ty for tutoring me part by part! it helps a lot

@umarjamilai Жыл бұрын

你在中国吗？我们在领英联系吧

@haocongzhan1806 Жыл бұрын

@@umarjamilai 在的，已经加你了！

@Sisco404 10 ай бұрын

Love your videos, and I also love the meowing cat in background😂

@somebody-f8k Жыл бұрын

Fantastic video! Your explanations are very clear, thank you!

@Charles-my2pb 6 ай бұрын

Thanks for your video, that's awesome!

@goelnikhils Жыл бұрын

Exceptional Video on LLaMA

@Tomcat342 Жыл бұрын

You are doing god's work.Keep it up and Thank you.

@MarcosVinicius-bd6bi 10 ай бұрын

Fantastic video, thanks Umar!

@Angadsingh95 Жыл бұрын

Thank you for creating such quality content!

@bosepukur Ай бұрын

Cant thank you enough for your effort

@TrelisResearch Жыл бұрын

Great channel and content Umar

@ahmetfirat23 Жыл бұрын

very informative video, details are clearly explained. thanks a lot

@BishakhaBiswas-q8t 9 күн бұрын

very very effective vedio. thank you so much ❤️

@utkarshsharma232 6 ай бұрын

Best Video on LLaMA

@saranyav2581 8 ай бұрын

Thank you for the amazing explanation

@satviknaren9681 9 ай бұрын

Thank you for posting Thank you existing Thank you

@ethanhe42 Жыл бұрын

great illustration!

@meili-ai Жыл бұрын

Very good explanation! Keep up the good work!

@safwanmohammed7715 2 ай бұрын

Superb explaination

@AnqiSun-f3e Жыл бұрын

Great video. Thanks, 小乌!

@umarjamilai Жыл бұрын

谢谢你😸

@vassilisworld Жыл бұрын

another amazing video Umar! you do know how to teach for sure, it would be nice if you put into a repo very influential papers to read! Did I hear a baby in the background? Also given you are from Italy, there is a lovely video worth watching by Asianometry on 'Olivetti & the Italian Computer: What Could Have Been'. thank you again for the hard work you put on this video

@umarjamilai Жыл бұрын

Hello Vassilis! Thanks for the kind words and the suggestion! The voice in the background is from 奥利奥 (Oreo), our black and white cat 😺. Unfortunately I'm aware of Olivetti's history and what could have been. If you're curious, you should also check out Enrico Mattei, and what ENI could have been. Have a wonderful day! Hopefully in a few days I'll upload the video on how to code LLaMA from scratch

@Vignesh-ho2dn 10 ай бұрын

Thanks for the very resourceful video

@DiegoSilva-dv9uf Жыл бұрын

Valeu!

@umarjamilai Жыл бұрын

Thank you very very very very much

@siddharthasubramaniyam Жыл бұрын

Great work man🙌

@goelnikhils Жыл бұрын

Amazing explanation

@subhamkundu5043 Жыл бұрын

This is great. Thank you. It will be very helpful you could also create a video for hands on coding a Llama model, the way you did for vanilla transformer. Thanks in advance

@umarjamilai Жыл бұрын

It's coming soon Stay tuned!

@weicheng4608 Жыл бұрын

Same here. Eagerly waiting for a coding session for llama model.

@visheshmittal468 4 ай бұрын

At 38:20 i think its a mistake, Q!=K!=V even in self attention, they are calculated using Wq,Wk,Wv which are different weights, and when calulating score, e1j=[q1k1, q1k2,.....,q1kT] where q is from current token and k from all the other tokens a1j=softmax(e1j) z1=sum(a1jvj), where v are computed by mul(token_embed*Wv)

@georgealexandruvlad7837 Жыл бұрын

Great explanation! 👌🏻

@berkk1993 Жыл бұрын

Your video are great. Please keep going.

@charlesriggins7385 Жыл бұрын

It's really useful. Thank you.

@taltlusty6804 Жыл бұрын

Great video!!! Thank you very much for enriching the community with such great explanations! Can you please share your slides?

@umarjamilai Жыл бұрын

Check the video description, there's a link.

@VivekKumar4-YrB.Tech.ChemicalE 3 ай бұрын

at 2:55 it will be decoder not encoder since llama is a decoder only model for next token prediction auto - regressively

@mprone 10 ай бұрын

Despite your name deceived me at first, I had no doubt you were a fellow Italian!

@umarjamilai 10 ай бұрын

Dovresti farmi l'inganno della cadrega per vedere se sono davvero un Milanese 😂😇

@mprone 10 ай бұрын

@@umarjamilai Senta lì, Brambilla Jamil, s'accomodi, si serva, prenda una cadrega. Una bella cadreghina non si rifiuta mai!

@umarjamilai 10 ай бұрын

@@mprone 😋 mmmhhh... Buona sta cadrega 🍎 Scrivimi pure su LinkedIn se hai qualche dubbio o domande. Buona giornata!

@BadmintonTV2008 Жыл бұрын

really awesome work!

@bipulbikramthapa8256 Жыл бұрын

Great video +1. I have a few queries about the LLaMA model. 1. In the architecture diagram, does one NX represent a single layer for the LLaMA? 2. Could you also please clarify how many NX are utilized for the LLAMA-2 13B model and any variants? 3. Finally, what are the potential for distributed computation in LLaMA model inference? What are the possible breakpoints in the model from an architectural standpoint?

@baiyouheng5365 11 ай бұрын

good content,nice explanation, THANKSssss.

@CCCeline_L 5 ай бұрын

thank you!!! I gained a lot

@kjkszpjab1510 10 ай бұрын

Brilliant, thank you.

@eitancohen8717 9 ай бұрын

Hi, great explanations. btw, is there a chance you explain separately this torch.einsum operations shown in the code at time 58:41?

@saurabh7337 9 ай бұрын

Hello , Many thanks for such a great content. Really enjoyed your work. Sorry for being greedy, may I request to create a video showing show llama model can be run effectively on local machines(with or without gpu) for inference(say with a custom flask api)

@samc6368 Жыл бұрын

Great explanation for all levels. Que: did you say around 2:55 that its encoder only arch, i read that its decoder only?

@JaydeepPawar-n4v Жыл бұрын

great explanation bro!!!

@vincentabraham7690 11 ай бұрын

Hi Umar, I recently started studying LLMs and I loved your explanations on transformers and the Llama architecture. I wanted to know that is there any way to look at the attention weights of a model and gain insights on which specific portions of the text influenced the output prediction? Is there any way to do this which is beginner friendly?

@feixyzliu5432 Жыл бұрын

I'am wondering why multi-head attention with KV cache uses operations on O(bnd^2) in 1:00:01, isn't it on O(bnd + bd^2)? Could you please explain this or give some references to refer?

@chintabhanushri7080 2 ай бұрын

Really commendable.

@moacirponti 10 ай бұрын

Many thanks for the great video. From your explanation Llama uses Rotary Positional Embedding (RPE) as positional encoding. It is applied to *each* Q and K vector just after transformation by their respective W. In this case, I don't get what the Relative Positional Encoding has to do with it (and why it was explained before Rotary PE). It is because Rotary PE has connections with the relative positional method, or both are applied in the case of Llama?

@varunsaagars Жыл бұрын

🎯 Key Takeaways for quick navigation: 00:00 📹 *This video explains LLaMA, its structural differences from the Transformer, and builds each block of LLaMA from a conceptual, mathematical, and coding perspective.* 00:56 🧩 *LLaMA has only an encoder, as it's designed for the next token prediction task, and it uses self-attention with KV cache and rotary positional embeddings.* 03:38 📊 *LLaMA uses RMS normalization before every block, replacing the positional encoding used in the original Transformer.* 04:17 🧮 *LLaMA features grouped multi-query attention, RMSNorm activation function, and SwiGLU feed-forward layers, offering architectural differences from the Transformer.* 05:38 🔢 *LLaMA models come in various sizes, with different dimensions, numbers of layers, and tokens trained on, making them adaptable to specific tasks.* 09:05 📊 *Layer normalization is used in LLaMA to address internal covariate shift, recentering features around zero mean and rescaling to unit variance.* 19:11 🔲 *Root Mean Square (RMS) normalization is introduced in LLaMA, simplifying computation compared to layer normalization while achieving similar results.* 24:28 🔍 *Positional encodings in the vanilla Transformer are fixed vectors added to token embeddings to represent their absolute positions in the sentence.* 25:23 🔄 *Rotary positional encodings deal with two tokens at a time and represent the distance between them, used in attention mechanisms.* 28:39 🧮 *Rotary positional encodings involve a function that depends on the embeddings and relative distances between tokens.* 32:33 🧐 *Computationally efficient rotary positional encodings are used by Lama, avoiding unnecessary operations.* 34:37 📉 *Rotary positional encodings exhibit long-term decay, reducing the strength of relationships between distant tokens.* 40:33 🔄 *The KV cache is used during inference in Transformer models to avoid redundant computations and speed up next token prediction.* 46:56 🔁 *KV cache stores previously computed dot products between tokens, reducing computation in subsequent steps of the inference process.* 49:37 🧠 *The KV cache is used in LLaMA to reduce redundant calculations in self-attention by storing keys and values and updating them as new tokens are added, improving inference speed.* 54:13 📈 *Multi-query attention is introduced to optimize memory access in Transformer models and reduce the bottleneck caused by data transfer in GPUs.* 01:00:51 ⚡️ *Grouped multi-query attention divides queries into groups and assigns different heads for keys and values to strike a balance between model quality and speed.* 01:06:31 🔄 *The SwiGLU activation function with β=1 is used in LLaMA's feedforward network, showing good performance in various benchmarks, although the reason for its effectiveness is not fully explained in the paper.*

@MENGRUWANG-qk1ip 9 ай бұрын

hi Umar! Your explanation is really excellent! By the way, llama3 has been released, will you continue to explain llama3 in a video?

@bosepukur Ай бұрын

In the self attention with KV Cache section (53.39) , you are mentioning that by only using one token in Q instead of new token + previously generared token , we can generate only one attention ( i.e one orange block in your illustration ) . However the attention is vector is always 1xn where n is number of tokens generated and in the orange block we are generating a new token with Q,K and V ... Is that understanding correct that you meant a new token by a new attention ?

@ml.9106 10 ай бұрын

thanks! super helpful. Seems your channel didn't tell the GPT model architecture. would you be open to introduce that?

@just4visit Жыл бұрын

hey! it there any way you can avoid using background bangs when slide changes, please?

@yunhuaji3038 Жыл бұрын

Thanks for the great video. at 48:56, you mentioned that we don't care about previous attentions. Does that mean we will trim the attention tensor from (SeqLen x d_model) to (1 x d_model) ? If so, does GPT do the trim as well? I thought GPT uses the whole attention tensor (including the previous attention "vectors") to predict the next token. That seems redundant but I wonder what they did exactly here. If they did use the whole attention tensor, does it mean that GPT inferencing needs to cache QKV instead of just KV? Thank you.

@umarjamilai Жыл бұрын

Hi! I think you misunderstood my words: I said "we don't need to recompute the dot products again", because they have already been computed in the previous steps. The dot products for which we "don't care" are the ones above the principal diagonal of the attention scores matrix, because they're the ones masked out when we apply the causal mask (that is, we force each token to only attend tokens to its left). I suggest you watch my other video in which we code LLaMA from scratch in order to understand how this works in practice.

@yunhuaji3038 Жыл бұрын

@@umarjamilai Thanks for the fast reply. I actually should have pointed the timestamp to 50:47 instead, but I believe you already got my question which is about the outcome attention tensor. Do you mean that we will cache the result attention tensor as well as the k-v pairs? (I will meanwhile go watching your llama video.)Thanks again.

@yunhuaji3038 Жыл бұрын

@@umarjamilai Oh... I figured it out... thanks a lot

@satpalsinghrathore2665 Жыл бұрын

Amazing video

@hosseinhajipour7817 Жыл бұрын

Thanks for the videos. There are a few errors in the video which I mentioned below: 1- Llama is an decoder-only model 2- The size of Q and K are the same. However, they are not the "same" tensor.

@umarjamilai Жыл бұрын

Hi! 1 - To be the decoder, it should have the cross attention, which it doesn't. The closest architecture is the encoder (the left side of the transformer model). People commonly call it "decoder only" because we do not "encode" text into a latent representation, but rather, just "generate text" from pre trained embeddings. Technically, from an architecture point of view, it's more similar to an encoder, hence the name. 2 Q, K and V have the same size, but have the same content in self attention, at least in the vanilla transformer. In LLaMA, because of the KV Cache and the positional encodings, which are only applied to Q and K, the content is different. Have a nice day!