How a Transformer works at inference vs training time

Рет қаралды 47,249

Жыл бұрын

I made this video to illustrate the difference between how a Transformer is used at inference time (i.e. when generating text) vs. how a Transformer is trained.
Disclaimer: this video assumes that you are familiar with the basics of deep learning, and that you've used HuggingFace Transformers at least once. If that's not the case, I highly recommend this course: cs231n.stanford.edu/ which will teach you the basics of deep learning. To learn HuggingFace, I recommend our free course: huggingface.co/course.
The video goes in detail explaining the difference between input_ids, decoder_input_ids and labels:
- the input_ids are the inputs to the encoder
- the decoder_input_ids are the inputs to the decoder
- the labels are the targets for the decoder.
Resources:
- Transformer paper: arxiv.org/abs/1706.03762
- Jay Allamar's The Illustrated Transformer blog post: jalammar.github.io/illustrate...
- HuggingFace Transformers: github.com/huggingface/transf...
- Transformers-Tutorials, a repository containing several demos for Transformer-based models: github.com/NielsRogge/Transfo....

Пікірлер: 104

@TempusWarrior 8 ай бұрын

i rarely comment on YT videos, but I wanted to say thanks. This video doesn't have all the marketing BS and provides the type of understanding I was looking for

@waynelau3256 7 ай бұрын

Gosh, imagine the day videos were ranked based on content and not fake marketing tactics 😂

@NielsRogge 6 ай бұрын

Thanks for the kind words!

@sohelshaikhh 5 ай бұрын

Beautifully explained! I want to shamelessly request you for a series where you go one step deeper to explain this beautiful architecture.

@kevinsummerian 5 ай бұрын

For someone comming from a software enginering background this was hands down the most useful explanation of the transformer architecture.

@farrugiamarc0 2 ай бұрын

This is the best explanation I have met so far on this particular topic (inference vs training). I hope that more videos like this are released in the future. Well done!

@user-yk4hv8tz7m 8 ай бұрын

Inference: 1. Tokens are generated one at a time conditioned on input+prev generation2 2. Language modelling head converts the hidden states to logits 3. Greedy search or beam search is possible Training: 1. Input ids: input prompt, labels: output 2. Decoder input ids are copied from labels, prepended with 3. Decoder generates text all at once but uses causal attention mask to mask out future tokens from decoder input ids 4. -100 is given to padded position in labels to indicate cross entropy function to not compute loss there

@ashishnegi9663 Жыл бұрын

You are a great teacher Niels! Would really appreciate if you add more such videos on hot ML/DL topics.

@omgwenxx 2 ай бұрын

I am using the huggingface library and this video finally gave me a clear understanding of the wordings used and the transformer architecture flow. Thank you!

@zobinhuang3955 9 ай бұрын

The most clear explaination of transformer model I have seen. Thanks Niels!

@forecenterforcustomermanag7715 14 күн бұрын

Excellent overview of how the encoder-decoder work together. Thanks.

@RamDhiwakarSeetharaman 11 ай бұрын

Unbelievably great and intuitive explanation. Something for us to learn. Thanks a lot, Niels.

@jasonzhang5378 8 ай бұрын

This is one of the cleanest explaination of transformer inference and training on the web. Great Video!

@marcoxs36 Жыл бұрын

Thank you Niels, this was really helpful to me for understanding this complex topic. These aspects of the model are not normally covered in most resources I've seen.

@jagadeeshm9526 Жыл бұрын

Amazing video... exactly covered what most other resources on this topic is missing.. keep this great work going Niels

@shivamsengupta121 4 ай бұрын

This is the best video on transformers. Everybody explains about the structure and attention mechanism but you choose to explain the training and inference phase. Thank you so much for this video. You are awesome 😎. Love from India ❤

@chenqu773 Жыл бұрын

Very intuitive, concise explanation to a very important topic. Thank you very much !

@mytr8986 Жыл бұрын

Excellent and simple video to understand the working of the transformer thanks a lot!

@amitsingha1637 6 ай бұрын

Thanks Man. We need more this type of Video.

@lucasbandeira5392 3 ай бұрын

Niels, thank you very much for this video! It was really helpful! The concept behind Transformers is pretty complicated, but your explanation definitely helped me to understand it.

@VitalContribution Жыл бұрын

I watched the whole video and I understand now so much more. Thank you very much for this great video! Please keep it up!

@sanjaybhatikar 5 ай бұрын

Thanks so much, you hit upon the points that are confusing for a first-time user of LLMs. Thank you!

@abhikhubby 9 ай бұрын

Best video on AI ive seen so far. Thank you so much for making & sharing! Only parts that might need a bit more explanation are logits area + vector embedding creation (but the later already has lots of content)

@thorty24 Ай бұрын

This is one of the greatest explanations I know. Thanks!

@thomasvrancken895 Жыл бұрын

Great video! I like the pace and easy explanation on things that are not necessarily straightforward. And clean excalidraw skills 😉 Hope to see more soon

@mathlife5495 9 ай бұрын

Very nice lecture. It clarified so many concepts for me.

@samilyalciner Жыл бұрын

Thanks Niels. Such a great explanation!

@lovekesh88 2 ай бұрын

Thanks Niels for the video. I look forward to more content on the topic.

@user-kd2st5vc5t Ай бұрын

谢谢你，讲得很好，之前只是大概了解，现在是更清楚其中的细节了。非常感谢，爱来自瓷器

@trilovio 4 ай бұрын

This explanation is gold! Thank you so much! 💯

@HerrBundesweit Жыл бұрын

Very informative. Thanks Niels!

@zagorot Жыл бұрын

Great video! I have to say thank you. This video is just what I need, because I have learned some basic ideas about word2vec, LSTM, RNN and something like that, but, I cannot understand how the Transformer works and what are the input and output, your video make me all clear about them. Yes, someone drop comments said this video is "pointless" or something, no, I cannot agree that, as different audiences have different background, so it is really hard to make something happy for everyone! Someone lack some basic ideas like word2vec(why use input_ids) then they would not be able to understand this video, and instead that someone are superior good at Transformer/Diffusion, then they won't need to watch this video! So how can I say that? This video taught me how are the encoder and decoder working on every single step, very detailed, really appreciated!

@nageswarsahoo1132 8 ай бұрын

amazing videos . Clear lot of doubt . Thanks Niels .

@minhajulhoque2113 Жыл бұрын

Great explanation video, really informative!

@fabianaltendorfer11 9 ай бұрын

Wonderful, thank you Niels!

@kmsravindra Жыл бұрын

Thanks Niels. This is pretty useful

@sambitmukherjee1713 Жыл бұрын

Very clearly explained!

@phucdoitoanable 9 ай бұрын

Nice explanation! Thank you!

@Wlodixpro 4 ай бұрын

🎯 Key Takeaways for quick navigation: 00:00 🧭 *Overview of Transformer Model Functionality* - Provides an overview of the Transformer model. - Discusses the distinction between using a Transformer during training versus inference. - Highlights the importance of understanding Transformer usage for tasks like text generation. 02:05 🤖 *Tokenization Process* - Describes the tokenization process where input text is converted into tokens. - Explains the mapping of tokens to integer indices using vocabulary. - Discusses the role of input IDs in feeding data to the model. 06:06 📚 *Vocabulary in Transformer Models* - Explores the concept of vocabulary in Transformer models. - Illustrates how tokens are mapped to integer indices in the vocabulary. - Emphasizes the importance of vocabulary in processing text inputs for Transformer models. 07:44 🧠 *Transformer Encoder Functionality* - Details the process of the Transformer encoder, converting tokens into embedding vectors. - Explains how the encoder generates hidden representations of input tokens. - Highlights the role of embedding vectors in representing input sequences. 10:45 🛠️ *Transformer Decoder Operation at Inference* - Demonstrates how the Transformer decoder operates during inference. - Discusses the generation process of new text using the decoder. - Describes the utilization of cached embedding vectors for generating subsequent tokens. 23:04 🔄 *Iterative Generation Process* - Illustrates the iterative process of token generation by the Transformer decoder. - Explains how the decoder predicts subsequent tokens based on previous predictions. - Discusses the termination condition of the generation process upon predicting the end-of-sequence token. 25:33 🧠 *Illustrating Inference Process with Transformers* - At inference time, text generation with Transformer models occurs in a loop, generating one token at a time. - Transformer models like GPT use a generation loop, allowing for flexibility in text generation. - Different decoding strategies, such as greedy decoding and beam search, impact the text generation process. 30:59 🛠️ *Explaining Decoding Strategies for Transformers* - Greedy decoding is a basic method where the token with the highest probability is chosen at each step. - Beam search is a more advanced decoding strategy that considers multiple potential sequences simultaneously. - Various decoding strategies, including beam search, are available in the `generate` method of Transformer libraries like Hugging Face's Transformers. 31:13 🎓 *Training Process of Transformer Models* - During training, the model learns to generate text by minimizing a loss function based on input sequences and target labels. - Teacher forcing is used during training, where the model is provided with ground truth tokens at each step. - The training process involves tokenizing input sequences, encoding them, and using labeled sequences to compute loss via cross-entropy calculations. 48:58 🤯 *Understanding Causal Attention Masking in Transformers* - Causal attention masking prevents the model from "cheating" by looking into the future during training. - At training time, the model predicts subsequent tokens based on the ground truth sequence, with the help of the causal attention mask. - This mechanism ensures that the model generates text one step at a time during training, similar to the inference process. Made with HARPA AI

@sebastianconrady7696 Жыл бұрын

Awesome! Great explanation

@imatrixx572 7 ай бұрын

Thanks you very much! Now I can say that I completely understand the Transformer!

@muhammadramismajeedrajput5632 Ай бұрын

Loved your explanation

@achyutanandasahoo4775 8 ай бұрын

thank you. great explanation.

@sitrakaforler8696 Жыл бұрын

really great vidéo ! Merci beaucoup !

@user-cv2fh3sh9x 4 ай бұрын

Very nice explanation. I request you to create video on how LLM can be derived based on, Prompt engineering., Fine tuning and generating New LLM with practical approach.❤❤❤❤❤❤❤

@nizamphoenix 7 ай бұрын

One word, Perfect!

@user-sr8zf9ms3o 6 ай бұрын

Great vid, thanks!

@omerali3320 Ай бұрын

I learned a lot thank you.

@PravasMohanty 7 ай бұрын

Great tutorial!! It will be great if you make a video personalize GPT , how to keep trained data and load for Q&N any recommendation.

@dhirajkumarsahu999 Ай бұрын

Thank you so Much!! Subscribed

@atmismahir 6 ай бұрын

great content thank you very much for the detailed explanation :)

@botfactory1510 Жыл бұрын

Thanks Niels

@yo-yoyo2303 6 ай бұрын

This is sooooooo good

@junaidbutt3000 11 ай бұрын

Very clearly explained Neils. I have a question about the decoder inputs. At training time, we added padding to the source and target sequences to make them a particular length. But at inference time at t=1, we only feed the start of sequence token to the decoder. Do we not require padding to make the sequence lengths consistent as well? it seems at inference time, we’re feeding different sequence lengths to the decoder. Is this true or is there implicit padding being applied here as well?

@syerwinD 10 ай бұрын

Thank you

@mohammedal-hitawi4667 Жыл бұрын

Very nice work , can you please make modification on decoder part in TrOCR model like replacing language model by gpt-2 ?

@aspboss1973 10 ай бұрын

Nice explanation ! I have these doubts - -During training, do we learn the Query, Value and Key matrix ? , in short do we learn the final embeddings of encoder through back propagation ? -During training, we supply encoders final embeddings to decoder, one at a time ? (Suppose we have 5 final encoders embeddings, then for first time step do we supply only first out of 5 embeddings to decoder?) - How this architecture is used in QA model ? (I am confuse !!!)

@FalguniDasShuvo 9 ай бұрын

Awesome!🎉

@lucasbandeira5392 14 күн бұрын

Thank you very much for the explanation, Niels. It was excellent. I have just one question regarding 'conditioning the decoder' during inference: How exactly does it work? Does it operate in the same way it does during training, i.e., the encoder hidden states are projected into queries, keys, and values, and then the dot products between the decoder and encoder hidden states are computed to generate the new hidden states? It seems like a lot of calculations for me, and in this way, the text generation process would be very slow, wouldn't it?

@pulkitsingh2149 6 ай бұрын

Hi Niels, great explanation on this. I just couldn't get my head around one point. At each time step we are producing n number of vectors (same as decoder input). Is it guaranteed that the previous predicted tokens vector won't change? What if the decoded token vector changes as we include more tokens in decoder input?

@giofou711 4 ай бұрын

@NielsRogge thanks for the super clear and helpful video! It's really one of the most clean and concise presentations I've watched on this topic! 🙌 I had a question though: At point 24:09, you are saying that *during inference* in the *last hidden state of the decoder* we get a hidden vector *for each of the decoder input ids*. In your example after 6 time steps, we have 6 decoder tokens: , salut, ..., mignon, which means the last hidden state (at time step t = 6) would produce a 6 x 768 matrix. Is that true though? I thought the last hidden state of the decoder produces the embedding of the *next token*. In other words, a 1 x 768 vector, that is later passed through a `nn.Linear(768, 50000)` layer to give us the next decoder input id. In other words, the 1 x 768 vector is passed to `nn.Linear(768, 50000)` and gives us a 1 x 50000 logit vector. But if what you say it's true, then when a 6 x 768 matrix is created at time step t = 6, then the end result after the last linear head would be 6 x 50000 logit matrix. No?

@leiyang2176 11 ай бұрын

That's a great video, I just have one question related to the video. In translation, there could be multiple valid translations. In this example the english output could be 'Hello, my dog is cute' or 'Hi, my dog is a cute dog' etc. In the real translation product, would there be use of metric like BLEU score, and how to use this score to evaluate and improve the product quality ?

@norman9174 Жыл бұрын

Sir can you please provide that ExcaliDraw notes . Thanks for this amazing explanation .

@BB-uy4bb 11 ай бұрын

In the description around 45:00 isn't there an end-token missing in the labels which the model should predict after the last label(231)?

@mbrochh82 Жыл бұрын

great video. the only thing that literally all videos on transformers don't mention is: how and when happens some kind of backpropagation? I understand how it works for a simple neural network with a hidden layer and we use gradient descent to update all the weights... but in the transformer architecture I find it hard to visualize which numbers get updated after we calculated the loss.

@jeffrey5602 Жыл бұрын

yeah, conceptually at first maybe but I would argue the transformations themselves are not more complicated than a normal NN for classification, coz its really doing just that, predicting the most probable token from the dictionary. At least its way easier than backprop for RNNs, LSTMs etc. The transformers book from Huggingface has a great explanation for attention which is really all you need to know to demystify the whole transformer architecture. And attention is really just adding a few linear projections and doing a dot product.

@shaxy6689 2 ай бұрын

It was so helpful, could you please share the drawing notes. Thank you!

@kaustuvray5066 6 ай бұрын

31:02 Training

@zbynekba 8 ай бұрын

Hi Niels, Here's a corrected version: I greatly appreciate that you've taken the time to create a fantastic summary of training and inference times from the user's perspective. Q1: during training, do you also involve the end-of-sentence token generation into the loss function? You haven’ mentioned it though IMHO a good model must detect the end of translation. Q2: why do you need to introduce padding? Everything works perfectly with arbitrary length of input and output sentence which is a true beauty. Why is it needed for batch training? Thank you.

@nouamaneelgueddari7518 7 ай бұрын

he said in the video that padding is introduced because the training is done in batches. The elements of the batches will have a very different lengths. If we don't use padding, we will have to dynamically allocate memory for every element in the batch. This is not very efficient for the computation.

@zbynekba 7 ай бұрын

@@nouamaneelgueddari7518 Makes sense to me. Thanks.

@NaveenRock1 Жыл бұрын

Great work. Thanks a lot for this video. I had a small doubt, during the transformer inference you mentioned we stop generating the sequence when we reach the token. But during the training, in the decoder_input_ids, I noticed you didn't add the token to the sentence, did I miss something here ?

@NielsRogge Жыл бұрын

Hi, during training, the token is indeed added to the labels (and in turn, to the decoder input ids), should have mentioned that!

@NaveenRock1 Жыл бұрын

@@NielsRogge Got it. Thanks. I believe will be added before the padding tokens ? " sentence tokens + padding tokens to reach the fixed sequence length. Am I correct ?

@NielsRogge Жыл бұрын

@@NaveenRock1 yes correct!

@NaveenRock1 Жыл бұрын

@@NielsRogge Awesome. Thank you. :)

@VaibhavPatil-rx7pc 11 ай бұрын

NICE!!!!

@braunagn 6 ай бұрын

Question on the tensor shapes of the Encoder that go into the Decoder during inference: If the Encoder output is of shape (1,6,768), during cross attention, how can this be combined with the Decoder's input which is only one token in length [e.g. Shape (1,1,768)]?

@sporrow Жыл бұрын

are attention vectors used during inference?

@arjunwankhede3706 3 ай бұрын

can you share excalildraw explanation link here

@bhujithmadav1481 Ай бұрын

Superb video. Just a doubt. @11:46 you mention that decoder would use the embeddings from encoder and the start of sequence token to generate the first output token. By embeddings did you mean the key value vectors from the last encoder stage? Also if encoder is being used to encode the input question then why are GPT, llama, etc., called decoder only models? Thanks

@NielsRogge Ай бұрын

Yes the embeddings from the encoder (after the last layer) are used as keys and values in the cross-attention operations of the decoder. The decoder inputs serve as queries. Decoder-only models like ChatGPT and Llama don't have an encoder. They directly feed the text to the decoder, and only use self-attention (with a causal mask to prevent future leakage).

@bhujithmadav1481 Ай бұрын

@@NielsRogge Thanks for the quick reply. But my confusion is that when we ask a question to GPT or llama like "what is transformer?", as per all the sources and including this video, they mention that decoders start with the SOS or EOS token to generate the output. But from where does the decoder learn the context? Even in this video you use the encoder to encode the input question and then pass the encoded embeddings to decoder right?

@DmitryPesegov 6 ай бұрын

What is the shape of the target tensor in training phase? (batch_size, maximum_supported_sequence_len_by_model, 50000) ? ( PLEASE answer anybody )

@SanKum7 Ай бұрын

Transformers are '"COMPLICATED" ? Not really after this video. Thanks.

@37-2ensoiree7 Жыл бұрын

Missing softmax during training, mandatory to calculate cross entropy loss. An unrelated question : Am I understanding right that there is a thus a maximum length for all these sentences, like 512 tokens ? Isn't that an issue ?

@navdeep8697 Жыл бұрын

i think cross entropy loss in pytorch (atleast!) apply the softmax internally. yes token limit is a sort of limitation because of how encoder and decoder internally works but it can be resolved while making the dataset pipeline for training and inference.

@robmarks6800 10 ай бұрын

Can you elaborate on why seemingly all new models are decoder-only? And are trained with the sole objective of next token prediction. Does the enc-dec architecture of T5 have any advantages? And is there any reason to train in different ways that T5 do?

@NielsRogge 10 ай бұрын

Hi, great question! Encoder-decoder architectures are typically good at tasks where the goal is to predict some output given a structured input, like machine translation or text-to-SQL. One first encodes the structured input, and then uses that as condition to the decoder using cross-attention. However, nowadays you can actually perfectly do these tasks with decoder-only models as well, like ChatGPT or LLaMa. The main disadvantage of encoder-decoders is that you need to recompute the keys/values at every time step, which is why all companies are using decoder-only at the moment (much faster at inference time)

@schwajj 10 ай бұрын

Thanks so much for the video, and answering questions! Can you explain (or provide a pointer to a paper) how the key/values can be cached to avoid recomputation in a decoder-only transformer? Edit: I figured it out while re-watching the training part of your video, so you needn’t answer unless you think others would benefit (I wouldn’t be able to explain very well, I fear)

@robmarks6800 10 ай бұрын

Don’t you have to recalculate in the decoder-only architecture aswell? Or is this where the non-default KV-cache comes in?

@dhirajkumarsahu999 Ай бұрын

One doubt please, does ChatGPT (decoder-only model) also use the Teacher forcing technique while training?

@NielsRogge Ай бұрын

Yes it does!

@dhirajkumarsahu999 Ай бұрын

@@NielsRogge Thanks a lot for your reply !!

@andygrouwstra1384 Жыл бұрын

Hi Niels, you describe a lot of steps that are taken, but don't really explain why they are taken. It becomes a kind of magic formula. For example, you have a sentence and break it up in tokens. OK. But hang on, why break it up in tokens rather than in words? What's different? Then you look up the tokens in a dictionary to replace them by numbers. Is that because it is easier to deal with numbers than with words? Then you do "something" and each number turns into a vector of 768 numbers. What is it that you do there, and why? What is the information in the other 767 numbers and where does that information come from? What do you want it for? It would be nice if you could give the context, both the big picture and the details.

@NielsRogge Жыл бұрын

Yes good point! I indeed assume in the video that you take the architecture of the Transformer as is, without asking why it looks that way. Let me give you some pointers: - subword tokens rather than words are used because it was proven in papers prior to the Transformer paper that they improved performance on machine translation benchmarks, see e.g. arxiv.org/abs/1609.08144. - we deal with numbers rather than text since computers only work with numbers, we can't do linear algebra on text. Each token ID (integer) is turned into a numerical representation, also called embedding. Tokens that have a similar meaning (like "cat" and "dog") will be closer in the embedding space (when you would project these embeddings in a n-dimensional space, with n = 768 for instance). The whole idea of creating embeddings for words or subword tokens comes from the Word2Vec paper: en.wikipedia.org/wiki/Word2vec.

@EkShunya Жыл бұрын

I like the video Crisp and concise Keep it up

@adrienforbu5165 Жыл бұрын

Perfect french :)

@frazuppi4897 9 ай бұрын

heyy niels

@IevaSimas Ай бұрын

Unless the token is predicted with 100% probability, you will still have non-zero loss

@acasualviewer5861 5 ай бұрын

It seems wasteful to run the entire decoder each time. Since it will do computations for all 6 positions regardless. There seems to be an opportunity to optimize this by only using the relevant part of the decoder mask each iteration.

@NielsRogge 5 ай бұрын

Yes indeed! That's where the key-value cache comes in: huggingface.co/blog/optimize-llm#32-the-key-value-cache

@isiisorisiaint Жыл бұрын

ok man, you tried, but honestly this is a totally pointless video, someone who knows what the transformer is about learns absolutely nothing except that -100 means 'ignore', and somebody who's still trying to wrap their heads around the transformer won't understand a single piece of what you kept typing in there. There you go, it's not just a thubs-down from me, i also took a couple of minutes to write this reply. Just try and see if you can define what the target audience of this video is, and you'll instantly see just how meaningless this video is.

@navdeep8697 Жыл бұрын

agree a little...this is good for audience who is interested in using huggingface library especially ...but not understanding the transformer and attention in generic way !