As many of you have asked: LLaMA 2's architecture is made up of the ENCODER side of the Transformer plus a Linear Layer and a Softmax. It can also be thought of as the DECODER of the Transformer, minus the Cross-Attention. Generally speaking, people call a model like LLaMA a Decoder-only model, while a model like BERT an Encoder-only model. From now on I will also stick to this terminology for my future videos.
@aiyogiravi10 ай бұрын
Yeah, It make sense now. Since we are not doing any Encoding and using it as Cross-Attention later. We will call this model a Decoder-only model. Edit: Really appreciate the effort you are putting. Great Channel :)
@haiphan9807 ай бұрын
Great video about LLAMA! I have one question regarding inference steps. How does the input [SOS] predicts "Love" in the beginning when the model does not have any information about input sentence? In Transformer, we have encoder which encode the whole input sentence before going to decoder, offering conditional probabilistic graph mechanism however in LLAMA, we do not know it. [SOS] can predict any next words and how does it know that it is "LOVE"?
@kanakraj31986 ай бұрын
@@haiphan980 That was just for understanding. He didn't show the prompt part. the model will first take your prompt and perform self-attention on it and then only it will start predicting, so it will have information based on your prompt on how to start.
@chandrahasaroori3177 күн бұрын
Ah phew, started googling things as soon as I heard it
@kqb5405 ай бұрын
Umar, Andrew Ng, 3Blue1Brown and Andrej are all you need. You are one of the best educators of deep learning. Thank you.
@xinyaoyin22382 ай бұрын
nahh, Umar only, 3blue1brown only makes fancy but useless ppts, andrej emphasizes on more python tricks than the actual transformer
@huntersawyer93242 ай бұрын
I thought attention was all you need lol
@emir5146Ай бұрын
i think, too.
@emir5146Ай бұрын
@@xinyaoyin2238 andrej doing true thing
@Umar-s6e16 күн бұрын
@@xinyaoyin2238 I agree, Umar only, andrej makes it complex or cover too much depth.
@jyun-siouhuang227 күн бұрын
The most comprehensible tutorial I've searched so far. Most blogs' explanation has some mistakes, but yours looks great!! Fascinating, very appreciated it.
@mandarinboy Жыл бұрын
The best 1 hour I spent! I had so many questions exactly on all these topics and this video does an outstanding job at explaining enough details in an easy way!
@umarjamilai Жыл бұрын
Glad you liked it! I just posted another video on how to code LLaMA 2 from scratch! Check it out
@mojtabanourani9988 Жыл бұрын
The network came in Feb 2023! This is a youtube channel worth subscribing. Thanks man
@muthukumarannm398 Жыл бұрын
I became your fan in 55:00 when you explain how GPU capability drives the development. 🙂
@Paluth Жыл бұрын
Thank you very much for your work. The community is blessed with such high quality presentations about difficult topics.
@TheMzbac10 ай бұрын
Very underrated video. Thanks for providing such a good lecture to the community
@MaxenceErnoult10 ай бұрын
This video, along with the previous ones about coding up transformers from scratch, are really outstanding. Thank you so much for taking such a tremendous amount of your free time to put all of this together!
@dgl32836 ай бұрын
This video can be like official textbook of llama architecture. Amazing.
@kunwar_divyanshu2 ай бұрын
This is the best explanation on planet for llm techniques and architecture ..................
@Engrbilal143 Жыл бұрын
Amazing. Just wow. I cannot find this stuff on whole internet
@Best9in Жыл бұрын
Thank you very much! I once read the paper, but I think watching your video provided me with more insights about this paper than reading it many more times would have.
@tubercn Жыл бұрын
Thanks for your free time and offering this valuable tutorial👏👏👏 Hope you keep going to do this, thanks again
@deema_c10 күн бұрын
You deserve a section in my thesis's Acknowledgments
@umarjamilai9 күн бұрын
Waiting for it ;-)
@jordanconnolly1046 Жыл бұрын
Really glad I found your channel you create some of the most in depth and easy to follow explanations I've been able to find.
@tusharmadaan54803 ай бұрын
Amazing explanation. Such thorough coverage of KV cache, GQA, RopE, and SwiGLU.
@cobaltl85578 ай бұрын
This intro to llama is awesome ❤, thank you for making such a great video.
@mickelliu5559 Жыл бұрын
It's surprising that content of this quality is free.
@TheAero Жыл бұрын
You actually explained attention here better than the previous presentation!
@umarjamilai Жыл бұрын
Keep up with the journey! Watch my other video on how to code LLaMA 2 from scratch and you'll put to practice what you've learnt here
@Koi03123 ай бұрын
Very great video! Hope you will have a video explaining Llama 3 soon
@Jc-jv3wj7 ай бұрын
Fantastic explanation on LLaMA model. Please keep making this kind of videos.
@siqb9 ай бұрын
My TLDR for the video (please point out the mistakes): - LLaMA uses RMS normalization instead of LayerNorm because it provides the same benefits with less computation. - LLaMA uses rotary embeddings. These act as a distance-based scaling to the original dot product scalar value coming out of queries and keys. In other words, two tokens, X and Y, will have a larger scalar value versus two tokens X and Y that are far apart. This makes sense from the point of view that closer tokens should have a bigger say in the final representation of a given token than the ones far away. This is not the case for vanilla transformer. - LLaMA uses Grouped Query Attention as an alternative to vanilla attention mostly to optimize GPU Flops (and its much slow memory access). Key slide on 1:03:00. In vanilla attention, each token (within each head) has its own key, query and value vector. In multi-query attention (MQA), there is only one key and value vector for all query vectors. In between lies the MQA where a few query vectors (say 2-4) may be mapped to one key and value vector. - LLaMA uses SwiGLU activation function since it works better - LLaMA uses 3 layers instead of 2 for the FFNN part of the encoder block but keeps the number of parameters same.
@jackoneil100020 күн бұрын
I might be incorrect, but I believe that GLU cannot be simply thought as a 3 layer FFN, because they are more like 2 layer FFN with extra forget/update feature inspired by LSTMs / GRUs. The amount of parameters are of course a hyperparameter at the end of the day, and the video mentioned that the params were factored by 2/3 to test whether swish does better. After all comparing activation functions while their layers have diferent parameter amounts would be unfair and the increased performance could be an effect of higher param count instead of better activation. Don't want to be the "Uhm... Actually" guy, just thought it might be useful.
@Tensordroid7 ай бұрын
One of the best explanations on youtube right now !!
@YuchenCao-e1v11 ай бұрын
Great video! Worth spending the time going over and over again. I actually saw your video from a Chinese site (the video probably has been forwarded by many other people already), then I come here for the author. 讲得超棒,谢谢分享!
@umarjamilai11 ай бұрын
什么国内的网站?我有点想看一下😁
@feixyzliu543210 ай бұрын
bilibili@@umarjamilai
@NJCLM10 ай бұрын
I didn't even see the time passe ! Great work your are a future rock star at teaching complex thing in ML
@ravimandliya1881 Жыл бұрын
Such an amazing step by step breakdown of concepts involved! Thank you so much.
@Charbel-n1k4 ай бұрын
Thank you for explaining these concepts in an easy way to understand!🎉
@UnknownHuman111104 ай бұрын
Amazing video ! Thanks for taking the time to explain core new concepts in language models
@librakevin19834 ай бұрын
The best machine learning videos I've ever watched. Thanks Umar!
@danish53268 ай бұрын
AMAZING! AMAZING AMAZING! Great work Umar .. Thanks a ton
@hieungo770 Жыл бұрын
Please keep doing content like this. Thank you very much. I learnt alot
@abhishekshinde-jb5pn10 ай бұрын
Your videos are the best man !! Please keep releasing as much content as possible, on the famous papers
@umarjamilai Жыл бұрын
As always, the PDF slides are freely available on GitHub: github.com/hkproj/pytorch-llama-notes/
@visheshmittal468Ай бұрын
At 38:20 i think its a mistake, Q!=K!=V even in self attention, they are calculated using Wq,Wk,Wv which are different weights, and when calulating score, e1j=[q1k1, q1k2,.....,q1kT] where q is from current token and k from all the other tokens a1j=softmax(e1j) z1=sum(a1jvj), where v are computed by mul(token_embed*Wv)
@GrifinsBrother10 ай бұрын
Incredible explanation!> You have really predisposition to explaining materials. Keep going!
@mprone7 ай бұрын
Despite your name deceived me at first, I had no doubt you were a fellow Italian!
@umarjamilai7 ай бұрын
Dovresti farmi l'inganno della cadrega per vedere se sono davvero un Milanese 😂😇
@mprone7 ай бұрын
@@umarjamilai Senta lì, Brambilla Jamil, s'accomodi, si serva, prenda una cadrega. Una bella cadreghina non si rifiuta mai!
@umarjamilai7 ай бұрын
@@mprone 😋 mmmhhh... Buona sta cadrega 🍎 Scrivimi pure su LinkedIn se hai qualche dubbio o domande. Buona giornata!
@subhamkundu5043 Жыл бұрын
This is great. Thank you. It will be very helpful you could also create a video for hands on coding a Llama model, the way you did for vanilla transformer. Thanks in advance
@umarjamilai Жыл бұрын
It's coming soon Stay tuned!
@weicheng4608 Жыл бұрын
Same here. Eagerly waiting for a coding session for llama model.
@somebody-f8k Жыл бұрын
Fantastic video! Your explanations are very clear, thank you!
@saratbhargavachinni5544 Жыл бұрын
Thanks a lot, Great explanation of KV cache and Multi Query Attention.
@ahmetfirat23 Жыл бұрын
very informative video, details are clearly explained. thanks a lot
@barisdenizsaglam3 ай бұрын
Amazing illustrations and explanations.
@parmanandchauhan61825 ай бұрын
Great content ,deep understanding after watching video
@TrelisResearch Жыл бұрын
Great channel and content Umar
@goelnikhils Жыл бұрын
Exceptional Video on LLaMA
@bipulbikramthapa825610 ай бұрын
Great video +1. I have a few queries about the LLaMA model. 1. In the architecture diagram, does one NX represent a single layer for the LLaMA? 2. Could you also please clarify how many NX are utilized for the LLAMA-2 13B model and any variants? 3. Finally, what are the potential for distributed computation in LLaMA model inference? What are the possible breakpoints in the model from an architectural standpoint?
@vassilisworld Жыл бұрын
another amazing video Umar! you do know how to teach for sure, it would be nice if you put into a repo very influential papers to read! Did I hear a baby in the background? Also given you are from Italy, there is a lovely video worth watching by Asianometry on 'Olivetti & the Italian Computer: What Could Have Been'. thank you again for the hard work you put on this video
@umarjamilai Жыл бұрын
Hello Vassilis! Thanks for the kind words and the suggestion! The voice in the background is from 奥利奥 (Oreo), our black and white cat 😺. Unfortunately I'm aware of Olivetti's history and what could have been. If you're curious, you should also check out Enrico Mattei, and what ENI could have been. Have a wonderful day! Hopefully in a few days I'll upload the video on how to code LLaMA from scratch
@haocongzhan180610 ай бұрын
u are doing great job, ty for tutoring me part by part! it helps a lot
@umarjamilai10 ай бұрын
你在中国吗?我们在领英联系吧
@haocongzhan180610 ай бұрын
@@umarjamilai 在的,已经加你了!
@goelnikhils Жыл бұрын
Amazing explanation
@Angadsingh95 Жыл бұрын
Thank you for creating such quality content!
@taltlusty680410 ай бұрын
Great video!!! Thank you very much for enriching the community with such great explanations! Can you please share your slides?
@umarjamilai10 ай бұрын
Check the video description, there's a link.
@Tomcat34210 ай бұрын
You are doing god's work.Keep it up and Thank you.
@meili-ai Жыл бұрын
Very good explanation! Keep up the good work!
@hosseinhajipour7817 Жыл бұрын
Thanks for the videos. There are a few errors in the video which I mentioned below: 1- Llama is an decoder-only model 2- The size of Q and K are the same. However, they are not the "same" tensor.
@umarjamilai Жыл бұрын
Hi! 1 - To be the decoder, it should have the cross attention, which it doesn't. The closest architecture is the encoder (the left side of the transformer model). People commonly call it "decoder only" because we do not "encode" text into a latent representation, but rather, just "generate text" from pre trained embeddings. Technically, from an architecture point of view, it's more similar to an encoder, hence the name. 2 Q, K and V have the same size, but have the same content in self attention, at least in the vanilla transformer. In LLaMA, because of the KV Cache and the positional encodings, which are only applied to Q and K, the content is different. Have a nice day!
@TheFitsome6 күн бұрын
when I give my Nobel price acceptance speech, I will mention this channel😆
@TianyiZhang-ns8pg6 ай бұрын
Thanks a lot. Looking forward to your next video!!!
@satviknaren96816 ай бұрын
Thank you for posting Thank you existing Thank you
@ethanhe42 Жыл бұрын
great illustration!
@saurabh73377 ай бұрын
Hello , Many thanks for such a great content. Really enjoyed your work. Sorry for being greedy, may I request to create a video showing show llama model can be run effectively on local machines(with or without gpu) for inference(say with a custom flask api)
@berkk1993 Жыл бұрын
Your video are great. Please keep going.
@moacirponti8 ай бұрын
Many thanks for the great video. From your explanation Llama uses Rotary Positional Embedding (RPE) as positional encoding. It is applied to *each* Q and K vector just after transformation by their respective W. In this case, I don't get what the Relative Positional Encoding has to do with it (and why it was explained before Rotary PE). It is because Rotary PE has connections with the relative positional method, or both are applied in the case of Llama?
@ml.91068 ай бұрын
thanks! super helpful. Seems your channel didn't tell the GPT model architecture. would you be open to introduce that?
@AnqiSun-f3e Жыл бұрын
Great video. Thanks, 小乌!
@umarjamilai Жыл бұрын
谢谢你😸
@Sisco4047 ай бұрын
Love your videos, and I also love the meowing cat in background😂
@Charles-my2pb4 ай бұрын
Thanks for your video, that's awesome!
@saranyav25815 ай бұрын
Thank you for the amazing explanation
@MarcosVinicius-bd6bi8 ай бұрын
Fantastic video, thanks Umar!
@vincentabraham76908 ай бұрын
Hi Umar, I recently started studying LLMs and I loved your explanations on transformers and the Llama architecture. I wanted to know that is there any way to look at the attention weights of a model and gain insights on which specific portions of the text influenced the output prediction? Is there any way to do this which is beginner friendly?
@MENGRUWANG-qk1ip7 ай бұрын
hi Umar! Your explanation is really excellent! By the way, llama3 has been released, will you continue to explain llama3 in a video?
@georgealexandruvlad7837 Жыл бұрын
Great explanation! 👌🏻
@Vignesh-ho2dn7 ай бұрын
Thanks for the very resourceful video
@samc636810 ай бұрын
Great explanation for all levels. Que: did you say around 2:55 that its encoder only arch, i read that its decoder only?
@utkarshsharma2323 ай бұрын
Best Video on LLaMA
@AmarshreeV2 ай бұрын
Excellent Video on LLaMA, I am struggling with the Encoding part can you help me in detail how encoding happens in LLama, If already explained can you tag the video.
@Gaurav-im1dfАй бұрын
Such a great video. just a small request can i use one of the pic that you have used to explain in my blog its kinda help others to learn more. Waiting for your reply .
@umarjamilaiАй бұрын
Sure! Please remember to link back to the original video. Thanks!
@siddharthasubramaniyam Жыл бұрын
Great work man🙌
@just4visit11 ай бұрын
hey! it there any way you can avoid using background bangs when slide changes, please?
@charlesriggins738511 ай бұрын
It's really useful. Thank you.
@eitancohen87176 ай бұрын
Hi, great explanations. btw, is there a chance you explain separately this torch.einsum operations shown in the code at time 58:41?
@baiyouheng53659 ай бұрын
good content,nice explanation, THANKSssss.
@kjkszpjab15107 ай бұрын
Brilliant, thank you.
@CCCeline_L3 ай бұрын
thank you!!! I gained a lot
@EugenioDeHoyos Жыл бұрын
Thank you!
@grownupgaming Жыл бұрын
1:01:42 great video! what is this beam 1 beam 4
@umarjamilai Жыл бұрын
Beam 1 indicates the greedy strategy for inference, while Beam 4 indicates "Beam search" with K = 4.
@23조이연남 Жыл бұрын
This is the way!
@BadmintonTV200811 ай бұрын
really awesome work!
@localscope6454 Жыл бұрын
Beautiful, thx.
@kanakraj31986 ай бұрын
What I still didn't understand how grouping is happening in Grouped Multi Query attention? I didn't understand Rotary Positional Encoding concept, but will re-watch or read more?
@VivekKumar4-YrB.Tech.ChemicalEАй бұрын
at 2:55 it will be decoder not encoder since llama is a decoder only model for next token prediction auto - regressively
@npip998 ай бұрын
49:10 "Since the model is causal, we don't care about the attention of a token with its successors" ~ I mean, a simpler explanation is also that the matrix is symmetric anyway, right? Like regardless of whether or not we care, it would be duplicated values.
@cfalguiere10 ай бұрын
Thanks for sharing
@nikhiliyer8436Ай бұрын
Hi Umar,please make videos on llama3 and Mistral architecture
@0nlif3nce2 ай бұрын
Perfection!
@JaydeepPawar-n4v9 ай бұрын
great explanation bro!!!
@yunhuaji3038 Жыл бұрын
Thanks for the great video. at 48:56, you mentioned that we don't care about previous attentions. Does that mean we will trim the attention tensor from (SeqLen x d_model) to (1 x d_model) ? If so, does GPT do the trim as well? I thought GPT uses the whole attention tensor (including the previous attention "vectors") to predict the next token. That seems redundant but I wonder what they did exactly here. If they did use the whole attention tensor, does it mean that GPT inferencing needs to cache QKV instead of just KV? Thank you.
@umarjamilai Жыл бұрын
Hi! I think you misunderstood my words: I said "we don't need to recompute the dot products again", because they have already been computed in the previous steps. The dot products for which we "don't care" are the ones above the principal diagonal of the attention scores matrix, because they're the ones masked out when we apply the causal mask (that is, we force each token to only attend tokens to its left). I suggest you watch my other video in which we code LLaMA from scratch in order to understand how this works in practice.
@yunhuaji3038 Жыл бұрын
@@umarjamilai Thanks for the fast reply. I actually should have pointed the timestamp to 50:47 instead, but I believe you already got my question which is about the outcome attention tensor. Do you mean that we will cache the result attention tensor as well as the k-v pairs? (I will meanwhile go watching your llama video.)Thanks again.
@yunhuaji3038 Жыл бұрын
@@umarjamilai Oh... I figured it out... thanks a lot
@昊朗鲁20 күн бұрын
谢谢
@satpalsinghrathore266510 ай бұрын
Amazing video
@binfos74344 ай бұрын
Amazing!
@abdulahmed5610 Жыл бұрын
Llama is Decoder only plz check 2:50
@umarjamilai Жыл бұрын
Hi! You can call it "Decoder-only" or "Encoder-only" interchangeably, because it's neither. To be a decoder, it should also have a cross-attention (which it lacks), to be an encoder it should not have a linear layer (which it does). So technically it can be an Encoder with a final linear layer or a Decoder without cross-attention. As a matter of fact, the "E" in BERT, which is also based on the Transformer model, stands for "Encoder". Have a nice day!