NLP Demystified 15: Transformers From Scratch + Pre-training and Transfer Learning With BERT/GPT

  Рет қаралды 70,192

Future Mojo

Future Mojo

Күн бұрын

CORRECTION:
00:34:47: that should be "each a dimension of 12x4"
Course playlist: • Natural Language Proce...
Transformers have revolutionized deep learning. In this module, we'll learn how they work in detail and build one from scratch. We'll then explore how to leverage state-of-the-art models for our projects through pre-training and transfer learning. We'll learn how to fine-tune models from Hugging Face and explore the capabilities of GPT from OpenAI. Along the way, we'll tackle a new task for this course: question answering.
Colab notebook: colab.research...
Timestamps
00:00:00 Transformers from scratch
00:01:05 Subword tokenization
00:04:27 Subword tokenization with byte-pair encoding (BPE)
00:06:53 The shortcomings of recurrent-based attention
00:07:55 How Self-Attention works
00:14:49 How Multi-Head Self-Attention works
00:17:52 The advantages of multi-head self-attention
00:18:20 Adding positional information
00:20:30 Adding a non-linear layer
00:22:02 Stacking encoder blocks
00:22:30 Dealing with side effects using layer normalization and skip connections
00:26:46 Input to the decoder block
00:27:11 Masked Multi-Head Self-Attention
00:29:38 The rest of the decoder block
00:30:39 [DEMO] Coding a Transformer from scratch
00:56:29 Transformer drawbacks
00:57:14 Pre-Training and Transfer Learning
00:59:36 The Transformer families
01:01:05 How BERT works
01:09:38 GPT: Language modelling at scale
01:15:13 [DEMO] Pre-training and transfer learning with Hugging Face and OpenAI
01:51:48 The Transformer is a "general-purpose differentiable computer"
This video is part of Natural Language Processing Demystified --a free, accessible course on NLP.
Visit www.nlpdemysti... to learn more.

Пікірлер: 143
@futuremojo
@futuremojo Жыл бұрын
Timestamps 00:00:00 Transformers from scratch 00:01:05 Subword tokenization 00:04:27 Subword tokenization with byte-pair encoding (BPE) 00:06:53 The shortcomings of recurrent-based attention 00:07:55 How Self-Attention works 00:14:49 How Multi-Head Self-Attention works 00:17:52 The advantages of multi-head self-attention 00:18:20 Adding positional information 00:20:30 Adding a non-linear layer 00:22:02 Stacking encoder blocks 00:22:30 Dealing with side effects using layer normalization and skip connections 00:26:46 Input to the decoder block 00:27:11 Masked Multi-Head Self-Attention 00:29:38 The rest of the decoder block 00:30:39 [DEMO] Coding a Transformer from scratch 00:56:29 Transformer drawbacks 00:57:14 Pre-Training and Transfer Learning 00:59:36 The Transformer families 01:01:05 How BERT works 01:09:38 GPT: Language modelling at scale 01:15:13 [DEMO] Pre-training and transfer learning with Hugging Face and OpenAI 01:51:48 The Transformer is a "general-purpose differentiable computer"
@user-wr4yl7tx3w
@user-wr4yl7tx3w Жыл бұрын
Why did it take so long for KZbin to show this channel when I searched transformers? KZbin algorithm really needs to get better. This is really quality content. Well structured and clearly explained.
@malikrumi1206
@malikrumi1206 11 ай бұрын
KZbin only gives you a limited number of videos that are responsive to your request. This is deliberate, because they want to keep you on the site as long as possible. That is one of the metrics they use to charge their advertisers. If you found exactly what you wanted the first time, chances are good you will then leave the site. But if you watch one or two of the videos, and then come back a few days later with the same search, now you will see matching videos that you were not shown before. Remember, on ad driven sites, *you* are the product being sold to advertisers.
@novantha1
@novantha1 4 ай бұрын
What was provided: A high quality, easily digestible, and calm introduction to Transformers that could take almost anyone from zero to GPT in a single video. What I got: It will probably take me longer than I'd like to get good at martial arts.
@johnmakris9999
@johnmakris9999 Жыл бұрын
Honestly, best explanation ever. I’m a data scientist (5 year’s experience) and I was struggling to understand in depth how transformers are trained. Come across this video and booom problem solved. Cheers mate. Propose to whole company to see this video
@futuremojo
@futuremojo Жыл бұрын
Love hearing that. Thanks, John!
@user-wr4yl7tx3w
@user-wr4yl7tx3w 5 ай бұрын
This is really high quality content. Why did it take so long for KZbin to recommend this.
@id-ic7ou
@id-ic7ou Жыл бұрын
I spent 2 days trying to understand the paper “Attention is all you need” but lots of thing were implicit in the article. Thank you for making it crystal clear. This is the best video I saw about transformer
@futuremojo
@futuremojo Жыл бұрын
Thanks! Really happy to hear that.
@RajkumarDarbar
@RajkumarDarbar Жыл бұрын
Thank you, legend for your exceptional teaching style !! 👏👏👏 If someone looking for a bit further explanation how to pass Q, K, and V matrices to the multi-head cross-attention layer in the decoder module: Specifically, the key vectors are obtained by multiplying the encoder outputs with a learnable weight matrix, which transforms the encoder outputs into a matrix with a shape of (sequence_length, d_model). The value vectors are obtained by applying another learnable weight matrix to the encoder outputs, resulting in a matrix of the same shape. The resulting key and value matrices can then be used as input to the multi-head cross-attention layer in the decoder module. The query vector, which is the input to the layer from the previous layer in the decoder, is also transformed using another learnable weight matrix to ensure compatibility with the key and value matrices. The attention mechanism then computes attention scores between the query vector and the key vectors, which are used to compute attention weights. The attention weights are used to compute a weighted sum of the value vectors, which is then used as input to the subsequent layers in the decoder. In summary, the key and value vectors are obtained by applying learnable weight matrices to the encoder outputs, and are used in the multi-head cross-attention mechanism of the decoder to compute attention scores and generate the output sequence.
@byotikram4495
@byotikram4495 Жыл бұрын
So the dimensions of that learnable weight matrices(both K and V) would be (d_model × d_model) ?
@RajkumarDarbar
@RajkumarDarbar Жыл бұрын
@@byotikram4495 yes, you got it right.
@kaustubhkapare807
@kaustubhkapare807 11 ай бұрын
God knows how many times I've banged my head aginst the wall...just to understand it... through different videos...this is the best one so far...🙏🏻
@mtlee8977
@mtlee8977 6 күн бұрын
Thank you so much! It covered transformers at several different levels, not only coding but also fine-tuning, usage and more. That's really helpful. That's comprehensive and really helpful. Thank you!
@mahmoudreda1083
@mahmoudreda1083 Жыл бұрын
I want to express my sincere gratitude for your excellent teaching and guidance in this State-of-the-art NLP course. Thank you Sir.
@1abc1566
@1abc1566 Жыл бұрын
I usually don’t comment on KZbin videos but couldn’t skip this. This is the BEST NLP course I’ve seen anywhere online. THANK YOU. ❤
@futuremojo
@futuremojo Жыл бұрын
Thank you!
@mtlee8977
@mtlee8977 6 күн бұрын
Thank you so much! It covered transformers and beyond at several different levels, not only just coding but also fine-tuning, usage and more. That's really helpful. Thank you!
@mazenlahham8029
@mazenlahham8029 Жыл бұрын
WOW, your level of comprehension and presentation of your subject is the best I've ever seen. You are the best. thank you very much ❤❤❤
@JBoy340a
@JBoy340a Жыл бұрын
Fantastic explanation. Very detailed, slow paced, and straightforward.
@AIShipped
@AIShipped Жыл бұрын
This is amazing, I can’t thank you enough. I only wish this was around sooner. Keep up the great work!
@AIShipped
@AIShipped Жыл бұрын
This is straight up how school/universities should teach
@futuremojo
@futuremojo Жыл бұрын
@@AIShipped Thank you! I'm glad you find it useful.
@anrichvanderwalt1108
@anrichvanderwalt1108 Жыл бұрын
Definitely my go-to video to understand and reference to anyone how Transformers work! Thanks Nitin!
@mage2754
@mage2754 Жыл бұрын
Thank you. I had problems visualising this concept before watching the video because not much explanations/reasons were given to why things were done the way they were
@futuremojo
@futuremojo Жыл бұрын
Glad you found it helpful!
@michaelmike8637
@michaelmike8637 Жыл бұрын
Thank you for your effort in creating this! The explanation and illustrations are amazing!
@srinathkumar1452
@srinathkumar1452 Жыл бұрын
This is a remarkable piece of work. Beyond excellent!
@futuremojo
@futuremojo Жыл бұрын
Thank you!
@weeb9133
@weeb9133 3 ай бұрын
Just completed the entire playlist. It was an absolute delight to watch, this last lecture was a favorite of mine because of you explained it in the form of a story. Thank you so much for sharing this knowledge with us and hope to learn more from you :D
@HazemAzim
@HazemAzim Жыл бұрын
A Legendary Explanation on Transformers compared to 10's , 100's tutorial videos out there .. Chapeau !
@youmna4045
@youmna4045 9 ай бұрын
There really aren't enough words to express how thankful I am for this awesome content. It's amazing that you've made it available to everyone for free. Thank you so much May Allah(GOD) help you like you help other
@jojolovekita7424
@jojolovekita7424 11 ай бұрын
excellent presentation of confusing stuff😊😊😊 ALL youtube videos explaining anything should be of this high caliber. Salamat Po! ❤
@karlswanson6811
@karlswanson6811 Жыл бұрын
Dude, this series is great. I do a lot of NLP in the clinical domain and I get asked a lot for a comprehensive starter for NLP for people that hop on projects. I tried to create a curriculum from things like SpaCy, NLTK, some Coursera DS courses, PyTorch/DL books, etc. but this is so well done and succinct, yet detailed when needed/if wanted. I think I will just refer people here from now on. Great work! And agree with the many comments I see about you having a radio voice lmao always a plus!
@futuremojo
@futuremojo Жыл бұрын
Thank you, Karl! I put a lot of thought into how to make it succinct yet detailed in the right places so I'm glad to hear it turned out well.
@xuantungnguyen9719
@xuantungnguyen9719 7 ай бұрын
Like what the hell. You made it so simple to learn. I kept consuming and taking notes, adding thoughts, perspective, feeling super productive. (I'm using Obsidian to link concepts). About three years ago the best explanation I could get is probably from Andrew Ng and I have to admit yours is so much better. My opinion might be biased since I was going back and forth in NLP times after times, but looking at the comment secion I'm pretty sure my opinion is validated
@futuremojo
@futuremojo 7 ай бұрын
Thank you!
@blindprogrammer
@blindprogrammer Жыл бұрын
This is the most awesome video on Transformers!! You earned my respect and a subscriber too. 🙏🙏
@AxelTInd
@AxelTInd Жыл бұрын
Phenomenal video. Well-structured, concise, professional. You have a real talent for teaching!
@futuremojo
@futuremojo Жыл бұрын
Thank you!
@arrekusua
@arrekusua 8 ай бұрын
Thank you so much for these videos!! Definitely one of the best videos on the NLP out there!
@87ggggyyyy
@87ggggyyyy Жыл бұрын
Great video, Philip torr
@DamianReloaded
@DamianReloaded Жыл бұрын
Wow, this is really a very good tutorial. Thank you very much for putting it up. Kudos!
@chrisogonas
@chrisogonas Жыл бұрын
Superb! Well illustrated. Thanks
@santuhazra1
@santuhazra1 Жыл бұрын
Seems like channel name got changed.. Big fan of ur work.. 🙂 One of the best explanation of transformer.. Waiting for more advanced topics..
@capyk5455
@capyk5455 Жыл бұрын
This is fantastic, thank you once again for your work :)
@futuremojo
@futuremojo Жыл бұрын
Hope you find it useful!
@nilesh30300
@nilesh30300 Жыл бұрын
Man... this is an awesome explanation by you. I can't thank you enough... Keep up the good work.
@kazeemkz
@kazeemkz 8 ай бұрын
Manh thanks for the detailed explanation. Your video has been helpful.
@monicameduri9692
@monicameduri9692 Ай бұрын
Great content! Thanks a lot!
@SnoozeDog
@SnoozeDog 9 ай бұрын
Fantastic sir
@marttilaine6778
@marttilaine6778 Жыл бұрын
Thank you very much for this series it was a wonderful explanation of NLP for me!
@BuddingAstroPhysicist
@BuddingAstroPhysicist Жыл бұрын
Your tutorials are life saver , thanks a lot for this.
@romitbarua7081
@romitbarua7081 Жыл бұрын
This video is incredible! Many thanks!!
@FrankCai-e7r
@FrankCai-e7r Жыл бұрын
great lectures. great teacher.
@AradAshrafi
@AradAshrafi 9 ай бұрын
What an amazing tutorial. Thank you
@ricardocorreia3687
@ricardocorreia3687 Жыл бұрын
man you are a legend. Best explanation ever.
@ilyas8523
@ilyas8523 Жыл бұрын
Amazing explanation, thank you.
@caiyu538
@caiyu538 Жыл бұрын
Great. Great. Great
@marcinbulka2829
@marcinbulka2829 Жыл бұрын
great explained. I think that showing examples of implementation like you did is the best way of explaining mathematical concepts. Although I am not sure if I missed it but I think that you don't have in your notebook explained how to calculate loss during training a transformer and I think it would be good to explain this.
@futuremojo
@futuremojo Жыл бұрын
You can see how loss is implemented in the previous video: kzbin.info/www/bejne/qqesq3WlqtZpos0
@sarat.6954
@sarat.6954 Жыл бұрын
This video was perfect. Thank you.
@jeffcav2119
@jeffcav2119 Жыл бұрын
This video is AWESOME!!!
@FlannelCamel
@FlannelCamel Жыл бұрын
Amazing work!
@futuremojo
@futuremojo Жыл бұрын
Thank you!
@wilfredomartel7781
@wilfredomartel7781 Жыл бұрын
Amazing explication❤
@aminekimo4606
@aminekimo4606 Жыл бұрын
This prestention of papier tenssion is all whats you need make a hands in the key of IA révolution.
@jenilshyara6746
@jenilshyara6746 11 ай бұрын
explaination🔥🔥🔥
@khushbootaneja6739
@khushbootaneja6739 Жыл бұрын
Nice video
@chris1324_
@chris1324_ Жыл бұрын
Amazing videos; I'm currently working on my thesis that aims to incorporate NLP techniques in different areas. Given the immense potential of transforms, using transformer-like architecture is an easy bet. I've been trying to understand them thoroughly for a while, but not anymore. Thank you so much. I've cited your website. I hope that's okay with you. Let me know if you have a preferred citation :)
@futuremojo
@futuremojo Жыл бұрын
Thank you! :-) Citing the website is great.
@theindianrover2007
@theindianrover2007 4 ай бұрын
Awesome
@MangaMania24
@MangaMania24 Жыл бұрын
More content please!
@futuremojo
@futuremojo Жыл бұрын
Tell me more to generate ideas: What would you find useful? What are you trying to accomplish?
@MangaMania24
@MangaMania24 Жыл бұрын
@@futuremojo wow that's a tough question :p I think I'll patiently wait for the content, cant think of a topic :p Thanks for your content though, I spent the last 1 week watching your videos everyday and man the confidence boost I've got by understanding how everything works! Great job 👏
@futuremojo
@futuremojo Жыл бұрын
@@MangaMania24 I'm glad it helped!
@maj46978
@maj46978 Жыл бұрын
pls make a series of videos on large language models with hands on...i m now 100% sure no body on this earth can explained like you❤
@pictzone
@pictzone Жыл бұрын
Simply astounding presentation!! Just wondering, how many years did you have to study to get to such a level of deep understanding of this field? (all connected disciplines included) Asking because while I do get the overall ideas, understanding why certain things are done different depending on your needs seems impossible unless you have a profound understanding of the concepts. I feel like I would essentially be like a blind man following orders if I would try to build useful apps out of these techniques, only going by what experts are suggesting because going through exactly why all the effects of these equations work will take so many years to truly figure it out. Huge respect for you!
@futuremojo
@futuremojo Жыл бұрын
Thank you! Without false modesty, I wouldn't say I have a deep understanding of the field. I think very, very few people do. And there are tons of productive people doing great work with varying levels of understanding. I would keep two things in mind: 1. A lot of times in this field, researchers come up with an idea based on some intuition/hunch or a rework of someone else's idea, and just try it. It's rarely the case that an idea is based on some detailed, logical rationalization beforehand. They're not throwing random stuff at a wall, but it's not completely informed either. Even today, researchers still don't know why, after training goes past a certain size and time threshold, LLMs suddenly start exhibiting advanced behaviors. If you read The Genius Makers, you'll see this field is largely empirical with persistent individuals nudging it forward one experiment at a time. 2. Don't think you need to understand every last detail before building something cool. By definition, you'll never get there. And the experts themselves don't have a complete understanding! You can start building now and slowly pick up details as you go. Just start and the process itself will force you to learn.
@pictzone
@pictzone Жыл бұрын
@@futuremojo Hey man, your comment has given me so much courage and motivation, it's unbelievable. You've really renewed my interest in these kind of things and made me realize it's ok to dive deep into advanced topics in a "blind faith" type of approach. Really appreciate your insights. You might think I'm exaggerating, but no. You've really told me exactly what I needed to hear. Thank you!
@wilsonbecker1881
@wilsonbecker1881 8 ай бұрын
Best ever
@daryladhityahenry
@daryladhityahenry 8 ай бұрын
Hi! I'm on my first episode currently in this lesson, I really excited and hope to learn much. Did you will create another tutorial on these kind of topics? Or only these 15 videos will kind of transform me into some expert ( remember, "kinda" expert ) in NLP and transformers so I can do pretrained my self and finetune it perfectly ? ( Assuming I have capability to gather the data? ) Thankkssss
@amortalbeing
@amortalbeing 11 ай бұрын
thanks a lot
@MachineLearningZuu
@MachineLearningZuu 4 ай бұрын
Ma bro just drop the "Best NLP Course" on Planet Earth and disappeared.
@tilkesh
@tilkesh Жыл бұрын
Thx
@deeps-n5y
@deeps-n5y Жыл бұрын
just adding a comment so that the video reaches more people :)
@futuremojo
@futuremojo Жыл бұрын
:-)
@puzan7685
@puzan7685 7 ай бұрын
Hello goddddddddddddd. Thank you so much
@loicbaconnier9150
@loicbaconnier9150 Жыл бұрын
Thanks for yor work but in your explaination of wq in code session (34 mn) you say 'dimension of 3 by 4' is it the right dimension please ?
@futuremojo
@futuremojo Жыл бұрын
Thanks for the catch @loicbaconnier9150! Nope, the weights are 12x4 which then help project the keys, queries, and values in each head down to 3x4. I added a correction caption and also a correction in the description.
@7900Nick
@7900Nick Жыл бұрын
Great tutorial and fantastic work!💪 I got 2 question in relation to byte-pair encoding (BPE) of tokens. After the vocabulary of is made, you said the original transformer map tokens/words into embeddings with a dimension of 512. I suppose that each token's word embedding is initially initialized at random, which brings me to my questions:☺ 1. How are a transformer's word embedding values actually updated (sorry if I missed it!)? 2. Is a word's embedding still fixed in values when it is made like in word2vec and Glove or are they constantly updated even though a word should react differently due to the contexts of a sentence?
@futuremojo
@futuremojo Жыл бұрын
Hi, thanks for the comment! Yes, if you're training a transformer from scratch, then the embeddings are usually initialized randomly. If there happens to be a BPE package that includes the right dimension embeddings, you can initialize your embedding layer with them, too. For example, we check out BPEMB (bpemb.h-its.org/) in the demo which has 100-dimension embeddings. If your use case is ok with that, then you can initialize your embedding layer with them. Regarding your two questions: 1. The transformer's embeddings are updated via backpropagation. So once the loss is calculated, the weights in the encoder/decoder blocks AND the word and positional embeddings are updated. We cover backpropagation in detail here if you're interested: kzbin.info/www/bejne/jISUnpqtdrhre68 2. When training from scratch, the embeddings are constantly updated. Now let's say the model is trained and you want to fine-tune it for a downstream task. At that point, you can choose to freeze the already trained layers and only train the fine-tuning layer (i.e. the embeddings won't change), or you can allow the whole model (including the embeddings) to adjust as well. We take the latter approach in our fine-tuning demo. We cover word vectors here if you're interested: kzbin.info/www/bejne/f5bFfWOIhqtoosk Did that answer your question?
@7900Nick
@7900Nick Жыл бұрын
@@futuremojo Thank you very much for your thorough and lengthy response! I'll try to rephrase my question because I'm still not sure how the word embeddings are processed by the transformer. I could be wrong, but doesn't each subword in a vocabulary (e.g. 50k words) have its own word embedding of size 512, with the different values in that vector corresponding to linguistic features? 😊 According to how I understood the explanation, loss calculated using backpropagation only modifies the weights of various head attention layers inside the transformer and does not alter the values of the word embeddings. Am I totally wrong or does the embeddings actually also be updated? Based on different demos I've seen, people don't update the tokenizer of a specific transformer even after fine-tuning. Sure, I will write a comment and thank you for the course!🙌
@futuremojo
@futuremojo Жыл бұрын
@@7900Nick Thanks for the testimonial, Nick! "I could be wrong, but doesn't each subword in a vocabulary (e.g. 50k words) have its own word embedding of size 512, with the different values in that vector corresponding to linguistic features?" This is correct. "According to how I understood the explanation, loss calculated using backpropagation only modifies the weights of various head attention layers inside the transformer and does not alter the values of the word embeddings. Am I totally wrong or does the embeddings actually also be updated?" The embeddings *ARE* updated during PRE-training. So once the loss is calculated, the feed-forward layers, the attention layers, AND the embedding layers are updated to minimize the loss. This is how the model arrives at embeddings that capture linguistic properties of the words (in such a way that it helps with the training goal). I think the confusion may lie in (a) the tokenizer's role and (b) the options during fine-tuning. So let's say you decide to train a model from scratch starting with the tokenizer. You decide to use English Wikipedia as your corpus. You fit your tokenizer using BPE over the corpus and it creates a 50k-word internal vocabulary. Ok, now you have your tokenizer. At this point, there are NO embeddings in the picture. The tokenizer's only job is to take whatever text you give it, and break it down into tokens based on the corpus it was fit on. It does not contain any weights. In the demo, we showed BPE-MB which *happened to come with embeddings* but we don't use them. Next, you initialize your model which has its various layers including embedding layers. You set the embedding layer size based on the vocabulary size and the embedding dimension you want (so let's say 50,000 x 512). Every subword in the tokenizer's vocabulary has an integer ID, and this integer ID is used to index into the embedding layer to pull out the right embedding. The embeddings are part of the model, not the tokenizer. Alright, you then train the model end to end on whatever task, and everything is updated via backprop including all the embedding layers. The model is now pre-trained. Ok, now you want to fine-tune it. You have multiple options: 1. You can train only the head (e.g. a classifier) and FREEZE the pre-trained part of the model. This means the attention layers, feed-forward layers, and embedding layers DO NOT change. 2. You can train the head and allow the rest of the model to ALSO be trained. In practice, it usually means the attention layers, feed-forward layers, and embedding layers will be adjusted a little via backprop to further minimize the loss. This is the option we take with BERT in the demo (i.e. we didn't freeze anything). In both cases, the tokenizer (whose only job is to tokenize text and has no trainable parameters in it) is left alone. Let me know if that helps.
@7900Nick
@7900Nick Жыл бұрын
@@futuremojo Mate, you are absolutely a gem. Nitin you are a born teacher and thank you very much for your explanation. 👏 Normally I reply my messages much faster, but I have been quite busy lately with both family and work. You have really demystified a lot of my NLP knowledge!! 🤗 But just to be sure, there is an embedding layer inside that transformer that corresponds to the index of the words that have been tokenized in the vocabulary, right? So, when people train from scratch, continue to pre-train (MLM), or fine-tune a transformer to a specific task, the word embedding of all the words in the vocabulary (50k) is updated inside an embedding layer of the new transformer model, correct? Therefore, the word embeddings of the old pre-trained model aren't used/touched when retraining a new transformer, just like your explanation of the BPE-MB case. Because these embedding layers will be updated inside the new transformer model, adding new words to a vocabulary from 50k => 60k is not a problem, since it is part of the training. ☺ I apologize for bothering you again; as I said before, you have done an excellent job; this is simply the only point on which I am unsure.
@futuremojo
@futuremojo Жыл бұрын
@@7900Nick "But just to be sure, there is an embedding layer inside that transformer that corresponds to the index of the words that have been tokenized in the vocabulary, right?" Correct. "So, when people train from scratch, continue to pre-train (MLM), or fine-tune a transformer to a specific task, the word embedding of all the words in the vocabulary (50k) is updated inside an embedding layer of the new transformer model, correct?" You have the right idea but we need to be careful here with wording. When you train a transformer from scratch/pre-train it, then yes, the embeddings keep getting updated during training. When it's time to fine-tune it, it's still the same transformer model but with an additional model head attached to it. The head will vary depending on the task. At that point, you can choose to train **only** the head (which means the embeddings won't change), or you can choose to let the transformer's body weights update as well (which means the embeddings will change). You can even choose to unfreeze only the few top layers of the transformer. The point is: when fine-tuning, whether the embedding layers update is your choice. "Therefore, the word embeddings of the old pre-trained model aren't used/touched when retraining a new transformer, just like your explanation of the BPE-MB case. Because these embedding layers will be updated inside the new transformer model, adding new words to a vocabulary from 50k => 60k is not a problem, since it is part of the training." If you have a pre-trained model but you decide that you want to train your own from scratch (but using the same architecture), then yeah, the embeddings will also be trained at the same time. And yes, you can have a larger vocabulary. For example, this is the default config for BERT: huggingface.co/docs/transformers/main/en/model_doc/bert#transformers.BertConfig It has a vocabulary size of 30,522. If you fit a tokenizer on a corpus such that it ends up with a vocabulary of 40,000, then you can instantiate a BERT model with that larger vocabulary and train it from scratch. "this is simply the only point on which I am unsure." It's fine. It's good to be clear on things. If something doesn't click, it probably means there's a hole in the explanation.
@wenmei8669
@wenmei8669 11 ай бұрын
The material and your explaination is amazing!!! Thanks a lot. I am wondering if it is possible to get the slides for your presentation?
@KutuluTuk
@KutuluTuk 7 ай бұрын
don't understand why always 512 as inputtokens.. how to make it bigger size..
@gol197884266
@gol197884266 Жыл бұрын
Joya😊
@rabailkamboh8857
@rabailkamboh8857 11 ай бұрын
best
@byotikram4495
@byotikram4495 Жыл бұрын
Thanks for this awesome in depth explanation of transformer. I'm just curious to know only one aspect. So in the explanation slides you use sine and cosine function that is also mentioned in the paper while generating the positional embedding. But in the implementation i have not seen that one. So how a random initialisation for positional embedding will catch the position information of the sequences. I may miss some points. Please clarify this point only.
@futuremojo
@futuremojo Жыл бұрын
Positional embeddings are trainable. So even though the positional embeddings are initialized with random values, they are adjusted over time via backprop to better achieve the goal.
@byotikram4495
@byotikram4495 Жыл бұрын
@@futuremojo But to capture the intitial word ordering of the sequence before start training is it not necessary to encode with the position information and relative distance between the tokens in the sequence ?
@futuremojo
@futuremojo Жыл бұрын
​@@byotikram4495 I think the confusion here is thinking that the neural network can tell the difference between position information that looks sequential to you and position information that looks random. Whether you're using sine/cosine wave values or positional embeddings, from the network's perspective, all it's seeing is input. All the network needs is information that helps it learn context. And so, the designers of the original transformer chose to sample values from sine/cosine waves to differentiate tokens by position. Here's the critical point: even after you add these wave values to the embedding, the untrained network has no idea what they mean. Rather, it learns over time that this particular embedding with this particular position information provides this context. So the word "rock" in position 1 might have an embedding that looks like "123", while the embedding for "rock" in position 2 might have an embedding that looks like "789". And the network learns what context each embedding provides to the overall sequence. Now, because all the network is seeing is particular embeddings, we are free to use other techniques to add position information as long as it's rich enough to differentiate tokens. In this case, positional embeddings work just as well while being simpler.
@byotikram4495
@byotikram4495 Жыл бұрын
@@futuremojo OK. now it's clear. So basically the position informations will learn by the network based on it's context over times through BP. Actually what I thought initially was, once we add the position information sampled from the sine/cosine wave with the embedding, the resulting vectors will capture the relative position information of the tokens in the sequence and also the distance between the tokens at the start of the training itself. That's why the confusion arises. Thank you for this thorough explanation. It means a lot.
@gigiopincio5006
@gigiopincio5006 Жыл бұрын
wow.
@iqranaveed2660
@iqranaveed2660 Жыл бұрын
Sir can you guide me what i do after built the scrach transformer your video is too good
@futuremojo
@futuremojo Жыл бұрын
It depends on your goal.
@iqranaveed2660
@iqranaveed2660 Жыл бұрын
​@@futuremojo i want to do abstractive summarization please can you guide me for further process
@futuremojo
@futuremojo Жыл бұрын
@@iqranaveed2660 At this point, you can just use an LLM. GPT, Bard, Claude, etc. Input your text along with some instructions, get a summarization.
@iqranaveed2660
@iqranaveed2660 Жыл бұрын
​@@futuremojo sir please can you guide me how i used the scrach transformer for summarization dony want to use pretrained tramsformer please reply me
@user-pt7gs2ei1r
@user-pt7gs2ei1r Жыл бұрын
I want to kiss and hug you, and kiss and hug you, and ... till the end of the world, you are such a talented and great teacher!
@panditamey1
@panditamey1 Жыл бұрын
Fantastic video. I have a question about head_dim. Why is the embed_dim divided by num_heads? I haven't understood it completely.
@futuremojo
@futuremojo Жыл бұрын
Because each head operates in a lower dimensional space. In our example, the original embedding dimension is 12 and we have three heads. So by dividing 12 by 3, each head now operates in a lower dimensional space of 4. Let's say we didn't do that. Instead, let's say we had each head operate in the original 12-dimensional space. That would dramatically increase memory requirements and training time. Maybe that would result in slightly better performance but the tradeoff was probably not worth it. By having each head operate in a lower dimensional space, we get the benefits of multiple heads while keeping the compute and memory requirements the same. There's also nothing stopping us from making each head dimension different. We could make the first head dimension 5, the second head dimension 3, and the last head dimension 4 so that it still adds up to 12, but then you sacrifice convenience and clarity for no benefit (AFAIK). Let me know if that helps.
@panditamey1
@panditamey1 Жыл бұрын
@@futuremojo Thank you for such a thorough explanation!!!
@panditamey1
@panditamey1 Жыл бұрын
​ @nitin_punjabi I have got one more question for you. I tried implementing the same but without using TensorFlow. And I was able to run when there is no batch. However, after creating batch data, I ran into the "shapes not aligned" issue 1 def scaled_self_attention(query, key, value): 2 key_dim = key.shape[1] ----> 3 QK = np.dot(query,key.T) Here is a snippet from the code. Have you run into this issue?
@futuremojo
@futuremojo Жыл бұрын
@@panditamey1 I haven't run into this issue, but that's likely because there's subtle behaviour differences between dot and matmul. I would first log the inputs you're getting into your function (include what the transposed keys look like) vs the Colab notebook inputs. Make sure they're the same or set up in such a way that they would lead to the same result. If so, I would Google behaviour differences between dot and matmul. My guess is your issue is most likely related to that.
@panditamey1
@panditamey1 Жыл бұрын
@@futuremojo sure thanks a lot!!
@peace-it4rg
@peace-it4rg 5 ай бұрын
bro really made transformer video with transformer
@nlpengineer1574
@nlpengineer1574 Жыл бұрын
I hope you do the same lesson using Pytorch, I picked up some idea still struggling with the code. Great explanation though.
@futuremojo
@futuremojo Жыл бұрын
Which part of the code are you struggling with?
@nlpengineer1574
@nlpengineer1574 Жыл бұрын
@@futuremojo the theory is simple and easy but the coding starts I'm lost. - I don't understand what happened in the embedding layer, because it seems like the (W matrices*Word_vector) are embedded in this layer, while in theory its not! -Secondly: What is vocab_size? by my understanding it should be the length of sequence, but again every implementation I read prove how wrong I'm. - Why should we integer divide the embedding_size // n_heads to get d_k? 16:05 .. Sorry if I'm talking like rude, but this makes me really frustrated during this week and I don't know what I miss here.. Thank you again.
@futuremojo
@futuremojo Жыл бұрын
@@nlpengineer1574 1. Re: embedding layer, have you watched the video on word vectors (kzbin.info/www/bejne/f5bFfWOIhqtoosk)? That should clear up any confusion regarding embeddings. In short, you can think of the embedding layer as a lookup table. Each word in your vocabulary maps to a row in this table. So the word "dog" might map to row 1, in which case, the embedding from row 1 is used as the embedding for the word "dog". These embeddings can be pre-trained or trained along with the model. The embedding weights are unrelated to the transformer weights. See the word vectors video for more info. 2. vocab_size is exactly what it sounds like: it's the size of your vocabulary. It's not the length of the sequence. The vocabulary represents all the different character, words, or subwords your model handles. It can't be infinite because your embedding table has to be a fixed size. If you're wondering where the vocabulary comes from or how the size is determined, this is covered in the word vectors video. 3. A single attention head works on the full embedding size, right? Ok, let's say you now have three heads. If we DON'T divide, then we essentially triple the computation cost because it's going to be 3 * embedding_size. By dividing embedding_size by n_heads, we can have multiple heads for roughly the same computation cost, and even though it means the embeddings in each head will now be smaller, it turns out it still works pretty well. Hope that helps.
@nlpengineer1574
@nlpengineer1574 Жыл бұрын
@@futuremojo Thank you man for your time you are a true hero for me Still have problem with that vocab size =! seq_length, but the way I think of it right now is that the embedding layer create a blueprint of where we retrieve our vocabularies (vocab_size) and the size of the embedding we will give them(embed_size). You mention this "and even though it means the embeddings in each head will now be smaller, it turns out it still works pretty well" here you completly adress my concern, because if we divide the embeddings by num_heads we will get a smaller embedding for each head, but I think since its not a problem the embedding size is arbitrary here. Anyway I feel more confident right now about my understanding. Again thank you for your time and patience
@amparoconsuelo9451
@amparoconsuelo9451 Жыл бұрын
Will you please show how your NLP demystification appear as a complete source code in Python or Llama.cpp.
@ilyas8523
@ilyas8523 Жыл бұрын
Hi, question. If I am building an Encoder for a regression problem where the output values are found a the end of the encoder, then how would I go about this? How should I change the feed-forward network to make this work? Should it take all of the embeddings at once? I am watching the video again so maybe I'll figure out the answer but until then, some guidance or advice would be great. Thanks! To be clear, each input sequence of my data is about 1024 long [text data], and the output I need to predict is an array of 2 numerical outputs [y1, y2].
@futuremojo
@futuremojo Жыл бұрын
If you want to stick with an Encoder solution, then one idea is to use two regressor heads on the CLS token. One regressor head outputs y1, the other regressor head outputs y2. If the numbers are bound (e.g. between 1 and 10 inclusive), you could even have two classifiers processing the CLS token.
@ilyas8523
@ilyas8523 Жыл бұрын
@@futuremojo The targets are float values where the min is around -2.#### and max around 4.#### in training data with no set boundaries, not sure about the test data since I have no access to it (Kaggle Competition). So I'll probably go with two regressor heads. Time to learn how to do this, I look forward to it! Thank you once again for taking the time to teach us. your lessons have been very useful in my journey. Edit: Do I have to fine-tune BERT for this?
@sourabhguptaurl
@sourabhguptaurl Жыл бұрын
just wonderful. How do I pay you?
@TTTrouble
@TTTrouble Жыл бұрын
Oh my gosh I can’t tell you how many times in your explanation of transformers, I was like…..OH GOD NOT ANOTHER wrinkle of complexity. My brain hurts… I honestly feel like I understand the math strictly speaking, but so much of the architecture seems random or hard to understand why it works. I think you did a fantastic job explaining everything in a slow and methodical way, but alas I just find there’s something about this I can’t wrap my head around even after watching dozens of videos on it. How do you get 3 different matrices(the key query value) from 1 input to somehow learn meaning. It’s not clear to me why the dot product of the embedding and another word can represent higher level meaning, that just sounds like magic(because obviously it works). Blargh, and that’s before you break it out into multiple heads and actually say you can get to some train 8 heads of attention from the single input vector. Like how do the KQV matrices learn generalized meaning in a sentence from SGD. why not add 100 heads or a 1000 heads if going from 1 to 8 was useful? Blah sorry I’m rambling it’s frustrating that I’m not even sure I can articulate exactly what it is about self attention that feels like cheating to me. Something about it is not clicking, though the rote math of it all makes well enough sense. All that aside thanks for all your hard work and sharing this for me to struggle through. It is much appreciated.
@ilyas8523
@ilyas8523 Жыл бұрын
Remember that the embedding goes through a positional encoding layer where the word "dog" can have many other vectors depending on the other words in the sentence. The dot product is well-suited for this purpose because it measures the cosine similarity between two vectors. When the dot product of two vectors is high, it indicates that they are pointing in similar directions or have similar orientations. This implies that the query vector and key vector are more similar or relevant to each other. edit: I am also not fully understanding everything but the secret is to keep doing more and more research
@TTTrouble
@TTTrouble Жыл бұрын
@@ilyas8523 haha agreed very much so. I must have literally watched and learned at least a dozen if not more explanations of the self attention mechanism and have talked with GPT4 to try to provide analogies and better ways to abstract out what’s happening. Refreshing some linear algebra with 3blue1brown videos helped as well. Also, writing out the expansion on my own by hand from memory of an example sentence embedding using both numbers and then generalizing the process to variables on my own with word vectors and subscripts instead of numbers was a very tedious process but I think finally when certain aspects started to click. I still struggle to fathom how and why the KQV weight matrices generalize so well to any given sequence and seem to be the distillation of human reasoning, but slow as molasses, my brain is mulling through the theory of it all in endless wonder. If I see training as a black box and assume the described trained weight matrices are magically produced, I understand how the inference part works fairly robustly. I keep getting distracted by all the constant developments and whatnot but it does finally feel like I’m making some incremental progress thanks to sticking with really trying to have a conceptual understanding of the fundamentals. Anyhow forcing myself to articulate my doubts is helpful for me in its own way and not meant to waste anyone’s time 😅. Hope your journey into understanding all of this stuff is going well, and thanks for your input!
@mostafaadel3452
@mostafaadel3452 10 ай бұрын
can you share the slides? please.
@jihanfarouq6904
@jihanfarouq6904 Жыл бұрын
This is amazing , I need those slides, Could you sent it to me please?
@chrs2436
@chrs2436 6 ай бұрын
the code in the notebook doesnt work 😮‍💨
@prashlovessamosa
@prashlovessamosa 4 ай бұрын
Where are you buddy cook something please
@efexzium
@efexzium Жыл бұрын
Listeria is not a rare word in spanish
@YHK_YT
@YHK_YT Жыл бұрын
Fard
@YHK_YT
@YHK_YT 4 ай бұрын
I have no recollection of writing this
@ridvanffm
@ridvanffm Ай бұрын
@futuremojo: Awesome content ! Can we expect more videos from you?
Transformer Neural Networks, ChatGPT's foundation, Clearly Explained!!!
36:15
StatQuest with Josh Starmer
Рет қаралды 683 М.
大家都拉出了什么#小丑 #shorts
00:35
好人小丑
Рет қаралды 95 МЛН
WORLD BEST MAGIC SECRETS
00:50
MasomkaMagic
Рет қаралды 43 МЛН
отомстил?
00:56
История одного вокалиста
Рет қаралды 6 МЛН
[1hr Talk] Intro to Large Language Models
59:48
Andrej Karpathy
Рет қаралды 2,2 МЛН
How a Transformer works at inference vs training time
49:53
Niels Rogge
Рет қаралды 53 М.
Gail Weiss: Thinking Like Transformers
1:07:12
Formal Languages and Neural Networks Seminar
Рет қаралды 15 М.
CS480/680 Lecture 19: Attention and Transformer Networks
1:22:38
Pascal Poupart
Рет қаралды 346 М.
Run your own AI (but private)
22:13
NetworkChuck
Рет қаралды 1,4 МЛН
Transformer Neural Networks Derived from Scratch
18:08
Algorithmic Simplicity
Рет қаралды 140 М.
Can we reach AGI with just LLMs?
18:17
Dr Waku
Рет қаралды 19 М.
What are Transformer Models and how do they work?
44:26
Serrano.Academy
Рет қаралды 118 М.
大家都拉出了什么#小丑 #shorts
00:35
好人小丑
Рет қаралды 95 МЛН