As many of you have asked: LLaMA 2's architecture is made up of the ENCODER side of the Transformer plus a Linear Layer and a Softmax. It can also be thought of as the DECODER of the Transformer, minus the Cross-Attention. Generally speaking, people call a model like LLaMA a Decoder-only model, while a model like BERT an Encoder-only model. From now on I will also stick to this terminology for my future videos.
@aiyogiravi Жыл бұрын
Yeah, It make sense now. Since we are not doing any Encoding and using it as Cross-Attention later. We will call this model a Decoder-only model. Edit: Really appreciate the effort you are putting. Great Channel :)
@haiphan9809 ай бұрын
Great video about LLAMA! I have one question regarding inference steps. How does the input [SOS] predicts "Love" in the beginning when the model does not have any information about input sentence? In Transformer, we have encoder which encode the whole input sentence before going to decoder, offering conditional probabilistic graph mechanism however in LLAMA, we do not know it. [SOS] can predict any next words and how does it know that it is "LOVE"?
@kanakraj31988 ай бұрын
@@haiphan980 That was just for understanding. He didn't show the prompt part. the model will first take your prompt and perform self-attention on it and then only it will start predicting, so it will have information based on your prompt on how to start.
@chandrahasaroori3172 ай бұрын
Ah phew, started googling things as soon as I heard it
@kqb5407 ай бұрын
Umar, Andrew Ng, 3Blue1Brown and Andrej are all you need. You are one of the best educators of deep learning. Thank you.
@xinyaoyin22385 ай бұрын
nahh, Umar only, 3blue1brown only makes fancy but useless ppts, andrej emphasizes on more python tricks than the actual transformer
@huntersawyer93244 ай бұрын
I thought attention was all you need lol
@emir51463 ай бұрын
i think, too.
@emir51463 ай бұрын
@@xinyaoyin2238 andrej doing true thing
@Umar-s6e3 ай бұрын
@@xinyaoyin2238 I agree, Umar only, andrej makes it complex or cover too much depth.
@mandarinboy Жыл бұрын
The best 1 hour I spent! I had so many questions exactly on all these topics and this video does an outstanding job at explaining enough details in an easy way!
@umarjamilai Жыл бұрын
Glad you liked it! I just posted another video on how to code LLaMA 2 from scratch! Check it out
@leodamato968524 күн бұрын
Man, it's the first time watching one of your videos. Your way of teaching is just perfect: straight to the point, clear and simple. Thanks a lot!
@mojtabanourani9988 Жыл бұрын
The network came in Feb 2023! This is a youtube channel worth subscribing. Thanks man
@jyun-siouhuang222 ай бұрын
The most comprehensible tutorial I've searched so far. Most blogs' explanation has some mistakes, but yours looks great!! Fascinating, very appreciated it.
@Paluth Жыл бұрын
Thank you very much for your work. The community is blessed with such high quality presentations about difficult topics.
@MaxenceErnoult Жыл бұрын
This video, along with the previous ones about coding up transformers from scratch, are really outstanding. Thank you so much for taking such a tremendous amount of your free time to put all of this together!
@TheMzbac Жыл бұрын
Very underrated video. Thanks for providing such a good lecture to the community
@muthukumarannm398 Жыл бұрын
I became your fan in 55:00 when you explain how GPU capability drives the development. 🙂
@umarjamilai Жыл бұрын
As always, the PDF slides are freely available on GitHub: github.com/hkproj/pytorch-llama-notes/
@ningjason987313 күн бұрын
Best LLM Video for student !!! I will introduce y video to all my group mate.
@cobaltl855710 ай бұрын
This intro to llama is awesome ❤, thank you for making such a great video.
@xugefu23 күн бұрын
Thanks!
@tubercn Жыл бұрын
Thanks for your free time and offering this valuable tutorial👏👏👏 Hope you keep going to do this, thanks again
@dgl32838 ай бұрын
This video can be like official textbook of llama architecture. Amazing.
@millenniumbismay382Ай бұрын
Thank you! It has been a life saver :) It is definitely the best explanation of such important concepts by a long way! Thank you so much.
@Engrbilal143 Жыл бұрын
Amazing. Just wow. I cannot find this stuff on whole internet
@tusharmadaan54805 ай бұрын
Amazing explanation. Such thorough coverage of KV cache, GQA, RopE, and SwiGLU.
@jordanconnolly1046 Жыл бұрын
Really glad I found your channel you create some of the most in depth and easy to follow explanations I've been able to find.
@Tensordroid10 ай бұрын
One of the best explanations on youtube right now !!
@danish532611 ай бұрын
AMAZING! AMAZING AMAZING! Great work Umar .. Thanks a ton
@Best9in Жыл бұрын
Thank you very much! I once read the paper, but I think watching your video provided me with more insights about this paper than reading it many more times would have.
@Jc-jv3wj9 ай бұрын
Fantastic explanation on LLaMA model. Please keep making this kind of videos.
@Charbel-n1k6 ай бұрын
Thank you for explaining these concepts in an easy way to understand!🎉
@TheAero Жыл бұрын
You actually explained attention here better than the previous presentation!
@umarjamilai Жыл бұрын
Keep up with the journey! Watch my other video on how to code LLaMA 2 from scratch and you'll put to practice what you've learnt here
@kunwar_divyanshu4 ай бұрын
This is the best explanation on planet for llm techniques and architecture ..................
@UnknownHuman111107 ай бұрын
Amazing video ! Thanks for taking the time to explain core new concepts in language models
@librakevin19837 ай бұрын
The best machine learning videos I've ever watched. Thanks Umar!
@YuchenCao-e1v Жыл бұрын
Great video! Worth spending the time going over and over again. I actually saw your video from a Chinese site (the video probably has been forwarded by many other people already), then I come here for the author. 讲得超棒,谢谢分享!
@umarjamilai Жыл бұрын
什么国内的网站?我有点想看一下😁
@feixyzliu5432 Жыл бұрын
bilibili@@umarjamilai
@NJCLM Жыл бұрын
I didn't even see the time passe ! Great work your are a future rock star at teaching complex thing in ML
@xujiacao67769 ай бұрын
This video is great!
@siqb11 ай бұрын
My TLDR for the video (please point out the mistakes): - LLaMA uses RMS normalization instead of LayerNorm because it provides the same benefits with less computation. - LLaMA uses rotary embeddings. These act as a distance-based scaling to the original dot product scalar value coming out of queries and keys. In other words, two tokens, X and Y, will have a larger scalar value versus two tokens X and Y that are far apart. This makes sense from the point of view that closer tokens should have a bigger say in the final representation of a given token than the ones far away. This is not the case for vanilla transformer. - LLaMA uses Grouped Query Attention as an alternative to vanilla attention mostly to optimize GPU Flops (and its much slow memory access). Key slide on 1:03:00. In vanilla attention, each token (within each head) has its own key, query and value vector. In multi-query attention (MQA), there is only one key and value vector for all query vectors. In between lies the MQA where a few query vectors (say 2-4) may be mapped to one key and value vector. - LLaMA uses SwiGLU activation function since it works better - LLaMA uses 3 layers instead of 2 for the FFNN part of the encoder block but keeps the number of parameters same.
@jackoneil10003 ай бұрын
I might be incorrect, but I believe that GLU cannot be simply thought as a 3 layer FFN, because they are more like 2 layer FFN with extra forget/update feature inspired by LSTMs / GRUs. The amount of parameters are of course a hyperparameter at the end of the day, and the video mentioned that the params were factored by 2/3 to test whether swish does better. After all comparing activation functions while their layers have diferent parameter amounts would be unfair and the increased performance could be an effect of higher param count instead of better activation. Don't want to be the "Uhm... Actually" guy, just thought it might be useful.
@ravimandliya1881 Жыл бұрын
Such an amazing step by step breakdown of concepts involved! Thank you so much.
@abhishekshinde-jb5pn Жыл бұрын
Your videos are the best man !! Please keep releasing as much content as possible, on the famous papers
@hieungo-ai Жыл бұрын
Please keep doing content like this. Thank you very much. I learnt alot
@GrifinsBrother Жыл бұрын
Incredible explanation!> You have really predisposition to explaining materials. Keep going!
@vatsalkhetan40462 ай бұрын
YOU ARE SO AMAZING UMAR. THANKS A LOT MAN -
@deema_c2 ай бұрын
You deserve a section in my thesis's Acknowledgments
@umarjamilai2 ай бұрын
Waiting for it ;-)
@AntonyWilson Жыл бұрын
Thanks!
@TheFitsome2 ай бұрын
when I give my Nobel price acceptance speech, I will mention this channel😆
@barisdenizsaglam5 ай бұрын
Amazing illustrations and explanations.
@mickelliu5559 Жыл бұрын
It's surprising that content of this quality is free.
@TianyiZhang-ns8pg8 ай бұрын
Thanks a lot. Looking forward to your next video!!!
@saratbhargavachinni5544 Жыл бұрын
Thanks a lot, Great explanation of KV cache and Multi Query Attention.
@parmanandchauhan61828 ай бұрын
Great content ,deep understanding after watching video
@haocongzhan1806 Жыл бұрын
u are doing great job, ty for tutoring me part by part! it helps a lot
@umarjamilai Жыл бұрын
你在中国吗?我们在领英联系吧
@haocongzhan1806 Жыл бұрын
@@umarjamilai 在的,已经加你了!
@Sisco40410 ай бұрын
Love your videos, and I also love the meowing cat in background😂
@somebody-f8k Жыл бұрын
Fantastic video! Your explanations are very clear, thank you!
@Charles-my2pb6 ай бұрын
Thanks for your video, that's awesome!
@goelnikhils Жыл бұрын
Exceptional Video on LLaMA
@Tomcat342 Жыл бұрын
You are doing god's work.Keep it up and Thank you.
@MarcosVinicius-bd6bi10 ай бұрын
Fantastic video, thanks Umar!
@Angadsingh95 Жыл бұрын
Thank you for creating such quality content!
@bosepukurАй бұрын
Cant thank you enough for your effort
@TrelisResearch Жыл бұрын
Great channel and content Umar
@ahmetfirat23 Жыл бұрын
very informative video, details are clearly explained. thanks a lot
@BishakhaBiswas-q8t9 күн бұрын
very very effective vedio. thank you so much ❤️
@utkarshsharma2326 ай бұрын
Best Video on LLaMA
@saranyav25818 ай бұрын
Thank you for the amazing explanation
@satviknaren96819 ай бұрын
Thank you for posting Thank you existing Thank you
@ethanhe42 Жыл бұрын
great illustration!
@meili-ai Жыл бұрын
Very good explanation! Keep up the good work!
@safwanmohammed77152 ай бұрын
Superb explaination
@AnqiSun-f3e Жыл бұрын
Great video. Thanks, 小乌!
@umarjamilai Жыл бұрын
谢谢你😸
@vassilisworld Жыл бұрын
another amazing video Umar! you do know how to teach for sure, it would be nice if you put into a repo very influential papers to read! Did I hear a baby in the background? Also given you are from Italy, there is a lovely video worth watching by Asianometry on 'Olivetti & the Italian Computer: What Could Have Been'. thank you again for the hard work you put on this video
@umarjamilai Жыл бұрын
Hello Vassilis! Thanks for the kind words and the suggestion! The voice in the background is from 奥利奥 (Oreo), our black and white cat 😺. Unfortunately I'm aware of Olivetti's history and what could have been. If you're curious, you should also check out Enrico Mattei, and what ENI could have been. Have a wonderful day! Hopefully in a few days I'll upload the video on how to code LLaMA from scratch
@Vignesh-ho2dn10 ай бұрын
Thanks for the very resourceful video
@DiegoSilva-dv9uf Жыл бұрын
Valeu!
@umarjamilai Жыл бұрын
Thank you very very very very much
@siddharthasubramaniyam Жыл бұрын
Great work man🙌
@goelnikhils Жыл бұрын
Amazing explanation
@subhamkundu5043 Жыл бұрын
This is great. Thank you. It will be very helpful you could also create a video for hands on coding a Llama model, the way you did for vanilla transformer. Thanks in advance
@umarjamilai Жыл бұрын
It's coming soon Stay tuned!
@weicheng4608 Жыл бұрын
Same here. Eagerly waiting for a coding session for llama model.
@visheshmittal4684 ай бұрын
At 38:20 i think its a mistake, Q!=K!=V even in self attention, they are calculated using Wq,Wk,Wv which are different weights, and when calulating score, e1j=[q1k1, q1k2,.....,q1kT] where q is from current token and k from all the other tokens a1j=softmax(e1j) z1=sum(a1jvj), where v are computed by mul(token_embed*Wv)
@georgealexandruvlad7837 Жыл бұрын
Great explanation! 👌🏻
@berkk1993 Жыл бұрын
Your video are great. Please keep going.
@charlesriggins7385 Жыл бұрын
It's really useful. Thank you.
@taltlusty6804 Жыл бұрын
Great video!!! Thank you very much for enriching the community with such great explanations! Can you please share your slides?
@umarjamilai Жыл бұрын
Check the video description, there's a link.
@VivekKumar4-YrB.Tech.ChemicalE3 ай бұрын
at 2:55 it will be decoder not encoder since llama is a decoder only model for next token prediction auto - regressively
@mprone10 ай бұрын
Despite your name deceived me at first, I had no doubt you were a fellow Italian!
@umarjamilai10 ай бұрын
Dovresti farmi l'inganno della cadrega per vedere se sono davvero un Milanese 😂😇
@mprone10 ай бұрын
@@umarjamilai Senta lì, Brambilla Jamil, s'accomodi, si serva, prenda una cadrega. Una bella cadreghina non si rifiuta mai!
@umarjamilai10 ай бұрын
@@mprone 😋 mmmhhh... Buona sta cadrega 🍎 Scrivimi pure su LinkedIn se hai qualche dubbio o domande. Buona giornata!
@BadmintonTV2008 Жыл бұрын
really awesome work!
@bipulbikramthapa8256 Жыл бұрын
Great video +1. I have a few queries about the LLaMA model. 1. In the architecture diagram, does one NX represent a single layer for the LLaMA? 2. Could you also please clarify how many NX are utilized for the LLAMA-2 13B model and any variants? 3. Finally, what are the potential for distributed computation in LLaMA model inference? What are the possible breakpoints in the model from an architectural standpoint?
@baiyouheng536511 ай бұрын
good content,nice explanation, THANKSssss.
@CCCeline_L5 ай бұрын
thank you!!! I gained a lot
@kjkszpjab151010 ай бұрын
Brilliant, thank you.
@eitancohen87179 ай бұрын
Hi, great explanations. btw, is there a chance you explain separately this torch.einsum operations shown in the code at time 58:41?
@saurabh73379 ай бұрын
Hello , Many thanks for such a great content. Really enjoyed your work. Sorry for being greedy, may I request to create a video showing show llama model can be run effectively on local machines(with or without gpu) for inference(say with a custom flask api)
@samc6368 Жыл бұрын
Great explanation for all levels. Que: did you say around 2:55 that its encoder only arch, i read that its decoder only?
@JaydeepPawar-n4v Жыл бұрын
great explanation bro!!!
@vincentabraham769011 ай бұрын
Hi Umar, I recently started studying LLMs and I loved your explanations on transformers and the Llama architecture. I wanted to know that is there any way to look at the attention weights of a model and gain insights on which specific portions of the text influenced the output prediction? Is there any way to do this which is beginner friendly?
@feixyzliu5432 Жыл бұрын
I'am wondering why multi-head attention with KV cache uses operations on O(bnd^2) in 1:00:01, isn't it on O(bnd + bd^2)? Could you please explain this or give some references to refer?
@chintabhanushri70802 ай бұрын
Really commendable.
@moacirponti10 ай бұрын
Many thanks for the great video. From your explanation Llama uses Rotary Positional Embedding (RPE) as positional encoding. It is applied to *each* Q and K vector just after transformation by their respective W. In this case, I don't get what the Relative Positional Encoding has to do with it (and why it was explained before Rotary PE). It is because Rotary PE has connections with the relative positional method, or both are applied in the case of Llama?
@varunsaagars Жыл бұрын
🎯 Key Takeaways for quick navigation: 00:00 📹 *This video explains LLaMA, its structural differences from the Transformer, and builds each block of LLaMA from a conceptual, mathematical, and coding perspective.* 00:56 🧩 *LLaMA has only an encoder, as it's designed for the next token prediction task, and it uses self-attention with KV cache and rotary positional embeddings.* 03:38 📊 *LLaMA uses RMS normalization before every block, replacing the positional encoding used in the original Transformer.* 04:17 🧮 *LLaMA features grouped multi-query attention, RMSNorm activation function, and SwiGLU feed-forward layers, offering architectural differences from the Transformer.* 05:38 🔢 *LLaMA models come in various sizes, with different dimensions, numbers of layers, and tokens trained on, making them adaptable to specific tasks.* 09:05 📊 *Layer normalization is used in LLaMA to address internal covariate shift, recentering features around zero mean and rescaling to unit variance.* 19:11 🔲 *Root Mean Square (RMS) normalization is introduced in LLaMA, simplifying computation compared to layer normalization while achieving similar results.* 24:28 🔍 *Positional encodings in the vanilla Transformer are fixed vectors added to token embeddings to represent their absolute positions in the sentence.* 25:23 🔄 *Rotary positional encodings deal with two tokens at a time and represent the distance between them, used in attention mechanisms.* 28:39 🧮 *Rotary positional encodings involve a function that depends on the embeddings and relative distances between tokens.* 32:33 🧐 *Computationally efficient rotary positional encodings are used by Lama, avoiding unnecessary operations.* 34:37 📉 *Rotary positional encodings exhibit long-term decay, reducing the strength of relationships between distant tokens.* 40:33 🔄 *The KV cache is used during inference in Transformer models to avoid redundant computations and speed up next token prediction.* 46:56 🔁 *KV cache stores previously computed dot products between tokens, reducing computation in subsequent steps of the inference process.* 49:37 🧠 *The KV cache is used in LLaMA to reduce redundant calculations in self-attention by storing keys and values and updating them as new tokens are added, improving inference speed.* 54:13 📈 *Multi-query attention is introduced to optimize memory access in Transformer models and reduce the bottleneck caused by data transfer in GPUs.* 01:00:51 ⚡️ *Grouped multi-query attention divides queries into groups and assigns different heads for keys and values to strike a balance between model quality and speed.* 01:06:31 🔄 *The SwiGLU activation function with β=1 is used in LLaMA's feedforward network, showing good performance in various benchmarks, although the reason for its effectiveness is not fully explained in the paper.*
@MENGRUWANG-qk1ip9 ай бұрын
hi Umar! Your explanation is really excellent! By the way, llama3 has been released, will you continue to explain llama3 in a video?
@bosepukurАй бұрын
In the self attention with KV Cache section (53.39) , you are mentioning that by only using one token in Q instead of new token + previously generared token , we can generate only one attention ( i.e one orange block in your illustration ) . However the attention is vector is always 1xn where n is number of tokens generated and in the orange block we are generating a new token with Q,K and V ... Is that understanding correct that you meant a new token by a new attention ?
@ml.910610 ай бұрын
thanks! super helpful. Seems your channel didn't tell the GPT model architecture. would you be open to introduce that?
@just4visit Жыл бұрын
hey! it there any way you can avoid using background bangs when slide changes, please?
@yunhuaji3038 Жыл бұрын
Thanks for the great video. at 48:56, you mentioned that we don't care about previous attentions. Does that mean we will trim the attention tensor from (SeqLen x d_model) to (1 x d_model) ? If so, does GPT do the trim as well? I thought GPT uses the whole attention tensor (including the previous attention "vectors") to predict the next token. That seems redundant but I wonder what they did exactly here. If they did use the whole attention tensor, does it mean that GPT inferencing needs to cache QKV instead of just KV? Thank you.
@umarjamilai Жыл бұрын
Hi! I think you misunderstood my words: I said "we don't need to recompute the dot products again", because they have already been computed in the previous steps. The dot products for which we "don't care" are the ones above the principal diagonal of the attention scores matrix, because they're the ones masked out when we apply the causal mask (that is, we force each token to only attend tokens to its left). I suggest you watch my other video in which we code LLaMA from scratch in order to understand how this works in practice.
@yunhuaji3038 Жыл бұрын
@@umarjamilai Thanks for the fast reply. I actually should have pointed the timestamp to 50:47 instead, but I believe you already got my question which is about the outcome attention tensor. Do you mean that we will cache the result attention tensor as well as the k-v pairs? (I will meanwhile go watching your llama video.)Thanks again.
@yunhuaji3038 Жыл бұрын
@@umarjamilai Oh... I figured it out... thanks a lot
@satpalsinghrathore2665 Жыл бұрын
Amazing video
@hosseinhajipour7817 Жыл бұрын
Thanks for the videos. There are a few errors in the video which I mentioned below: 1- Llama is an decoder-only model 2- The size of Q and K are the same. However, they are not the "same" tensor.
@umarjamilai Жыл бұрын
Hi! 1 - To be the decoder, it should have the cross attention, which it doesn't. The closest architecture is the encoder (the left side of the transformer model). People commonly call it "decoder only" because we do not "encode" text into a latent representation, but rather, just "generate text" from pre trained embeddings. Technically, from an architecture point of view, it's more similar to an encoder, hence the name. 2 Q, K and V have the same size, but have the same content in self attention, at least in the vanilla transformer. In LLaMA, because of the KV Cache and the positional encodings, which are only applied to Q and K, the content is different. Have a nice day!