If you think I deserve it, please do consider a like and subscribe to support the channel. Thanks so much for watching ! :)
@mello1016 Жыл бұрын
Amazing videos, but dude please remove your face from the thumbnails. It adds zero value and is distracting from choosing the content. Don't follow a herd, better represent something unique in there.
@LuizHenrique-qr3lt Жыл бұрын
tks!!
@aryansoriginals Жыл бұрын
i think you desrve it, big thank you to you
@wishIKnewHowToLove Жыл бұрын
:)
@becayebalde3820 Жыл бұрын
shut up @@mello1016 adding his facing makes it less abstract. You can see it is a human behind, and it makes it easier to focus.
@sabzimatic Жыл бұрын
Hands down!! You have put in sincere effort in explaining crucial concepts in Transformers. Kudos to you! Wishing you the best !!
@CodeEmporium Жыл бұрын
Thanks for the super kind words! Definitely more to come. In the middle of making a series on Reinforcement Learning now :)
@becayebalde3820 Жыл бұрын
Man you are awesome! I thought transformers were too hard and needed too much effort to understand. While I was willing to put that much effort, your playlist has been extraordinarily useful to me. Thank you! I subscribed
@user-wr4yl7tx3w Жыл бұрын
Really enjoying this Transformer series.
@CodeEmporium Жыл бұрын
Thanks so much for watching and commenting on them :)
@XC35x781PTB Жыл бұрын
Attention is all you need! cit your tutorials are gold, thank you
@CodeEmporium Жыл бұрын
You are so welcome !
@mihirchauhan6346 Жыл бұрын
At time 6:24 reason 1 (periodicity) for positional encoding was under-specified, hence needed more clarity where it was mentioned that a word pays attention to other words (farther apart) in the sentence using periodicity property of sine and cosine function in order to make the solution tractable? Is it mentioned in some papers or can you cite this. Thanks.
@sumanthkumar40354 ай бұрын
I had the same exact doubt, did u get an answer for this ?
@PravsAI Жыл бұрын
One of the great explanation ajay , happy to see kannada words here ! . Look forward for more videos like this :-) Kudos ! Great work ....
@XNexezX Жыл бұрын
Dude these videos are so nice. Starting my masters thesis on a transformer-based topic soon and this is really helping me learn the basics
@CodeEmporium Жыл бұрын
Perfect! Super glad you’re on this journey. The field is very fun :)
@DeepakKandel-go3ff Жыл бұрын
Yes, Totally worth a like.
@SanjithKumar-xf4sg Жыл бұрын
One of the best series for transformers😄
@pizzaeater9509 Жыл бұрын
Most brilliant and simple to understand video
@CodeEmporium Жыл бұрын
Haha thanks a lot :) I try
@shauryai Жыл бұрын
Thanks for detailed videos on Transformer concepst!
@CodeEmporium Жыл бұрын
My pleasure :) Thank you for the support
@RohanShetty-zu4sc4 ай бұрын
Respect for Karnataka ❤
@ThinAirElon Жыл бұрын
Theoratically what does it mean to add embedding vector and positional vector ?
@judedavis92 Жыл бұрын
Thanks for the great video! Loving this series!
@CodeEmporium Жыл бұрын
Thanks so much for watching ! Hope you enjoy the rest :)
Clear explanation. If I want to use transformer for time series and the time is not evenly changing, there is irregularities of time points. How could I positional encoding of these time into transformer?
@andreytolkushkin3611 Жыл бұрын
hi, you probably wont see this since its been 6 months siince youve posted the video, however: im trying to write code for handwritten mathematical expression recognition and am trying to recreate the BTTR model. In it they use a Densenet as the transformer encoder and use "image positional encoding" whhich is supposed to be a generalization for 2d of the sinusodal positional encoding. What would be the logic behind the 2d image positional encoding. They do have code on github but i have no idea how to interpret it, could you please help
@srivatsa1193 Жыл бұрын
Bro.. You are Awesome!
@CodeEmporium Жыл бұрын
Nah you are awesome
@convolutionalnn2582 Жыл бұрын
In the code on the final class, position is 1 to max sequence length....Which include both even and odd...I think we use cos for odd and sin for even..Why all the position are pass which mean 1 to max sequence length including even are pass in cos and odd are pass in sin.
@CodeEmporium Жыл бұрын
I think I responded to this in another video you asked this question. Hope that helped tho :)
@convolutionalnn2582 Жыл бұрын
@@CodeEmporium Yeah but you didn't answer it fully
@aliwaleed9173 Жыл бұрын
thanks for the information in this video however i think i have a miss understanding , you said before that before the vocabs are going into the embedding victor which is like a bag of related word together in a box , but in the start of this video you said at first the words has done into a one hot encoder then passed to the positional encoding so what i want to know know is which is the scenarios is the right: 1- we take the word and search it into the embedding space then pass it into the positional encoder 2- we take the word and do it a one hot encoder then send it to the positional encoder
@wishIKnewHowToLove Жыл бұрын
I like your English is clean :) no disgusting non-Californian accent :)
@CodeEmporium Жыл бұрын
Thank you for the compliments
@ziki5993 Жыл бұрын
wonderful video
@CodeEmporium Жыл бұрын
Thank you so much! :D
@sangabahati3545 Жыл бұрын
You video are useful for me ,Congratulation for excellent works. But I suggest you demonstrate a real video in multivariate time series forecasting or classification.
@ChrisHalden007 Жыл бұрын
Great video. Thanks
@CodeEmporium Жыл бұрын
My pleasure
@paull923 Жыл бұрын
Thx! Clear and concise!
@CodeEmporium Жыл бұрын
Thanks So much
@fadecomic3 ай бұрын
I get why extrapolation is important, but why periodicity? Why does it matter that pos+5, pos+10, pos+15 from the example are periodic? What problem is that solving?
@lexingtonjackson365710 ай бұрын
I liked you already , now you are a kannadiga and i like you more.
@nawa4elgiza8 ай бұрын
i love your shit man, this was so usefull i actually understood this ml shit and now can be elon musk up in this llm shit
@AbhishekS-cv3cr Жыл бұрын
approved!
@LuizHenrique-qr3lt Жыл бұрын
my second doubt is that when I use BertTokenizer for example it transforms the text: [my name is ajay] in a list of integers for example [101, 11590, 11324, 10124, 138, 78761, 102], where does that part go? I couldn't understand that part
@CodeEmporium Жыл бұрын
So I haven’t shown the text encoding details just yet. :) since 4 words were encoded into 7 numbers, I assume the “BestTokenizer” is encoding each subword / word piece into some number. Essentially, the tokenizer is taking the sentence, breaking it down into word pieces (7 in this case) and each is being mapped to a unique integer number. Later on, you will see each number being mapped to a larger vector (I explained more details about why these vectors exist in the other comment)
@hermannangstl1904 Жыл бұрын
From what I understood each word/token is represented by a 512-dimensional vector. This values of this vector are modified by means of (Self)Attention and Positional Encoding. What is a bit counter-intuitive for me is that the place in which a word/token comes can be different in different sentences. For example lets take the word "Ajay". (1) In this sentence it's in 4th position: "My name is Ajay" (2) In a different sentence it is on 1st position: "Ajay explains very well". So the Positional Encodings for the word "Ajay" vary - they might be different in each sentence. How can the network be trained, how can it learn, with such contradicting input data?
@CodeEmporium Жыл бұрын
This is a good question. But it intuitively does make sense that the same word In different sentences can have different meanings. Take the word “grounded”. You can represent this as a 512 dimensional vector. But let’s say “grounded” occurs in 2 sentences: (1) The truth is grounded in reality (2) You’re grounded! Go to your room. In these examples, “grounded” has differing meanings and should hence have different vector representations. This is why we need surrounding context to understand word vectors individually. This is probably a lil hard to see with your example since “Ajay” is a proper noun. However, for non-proper nouns, context matters. I think you should take a look at the paper “Deep Contextualized word Representations” by Matthew Peters (2018). They more formally answer the question you are asking. This is the paper that introduced ELMo embedding. According to this paper, Turns out that using different vectors based on context really improved models on Part of Speech Tagging and Language modeling
@DevelopersHutt Жыл бұрын
You've raised an important point. While it is true that the positional encoding for a word like "Ajay" can vary depending on its position in different sentences. Let's consider the word "Ajay" in two different sentences and see how the Transformer model handles it: (1) Sentence 1: "My name is Ajay." (2) Sentence 2: "Ajay explains very well." In both sentences, the word "Ajay" has different positions, but the Transformer model can still learn and make sense of it. Here's a simplified example of how it works: Input Encoding: Each word, including "Ajay," is initially represented by a 512-dimensional vector. Sentence 1: "Ajay" is represented as [0.1, 0.2, 0.3, ..., 0.4]. Sentence 2: "Ajay" is represented as [0.5, 0.6, 0.7, ..., 0.8]. Positional Encoding: The model incorporates positional encodings to differentiate the positions of words. Sentence 1: The positional encoding for the 4th position is [0.4, 0.3, 0.2, ..., 0.1]. Sentence 2: The positional encoding for the 1st position is [1.0, 0.9, 0.8, ..., 0.5]. Attention and Context: The Transformer's attention mechanism considers the positional encodings along with the input representations to compute contextualized representations. Sentence 1: The attention mechanism incorporates the positional encoding and input embedding of "Ajay" at the 4th position to capture its contextual information within the sentence. Sentence 2: Similarly, the attention mechanism considers the positional encoding and input embedding of "Ajay" at the 1st position in the context of the second sentence. By attending to different positions and incorporating positional encodings, the model can learn to associate the word "Ajay" with its specific context and meaning in each sentence. Through training on various examples, the model adjusts its weights and learns to generate appropriate representations for words based on their positions, allowing it to make meaningful predictions and capture the contextual relationships between words effectively.
@balakrishnaprasad8928 Жыл бұрын
Please make a detailed video series on the math for data science
@CodeEmporium Жыл бұрын
I have made some math in machine learning videos. Maybe check the playlist “Th e math you should know” on the channel
@7_bairapraveen928 Жыл бұрын
I am kind on new bie here, if you think this is valid please answer. why are you introducing parameters of dimension 512 for the vocab size , making a neural network, I mean what happens if we dont do that?
@CodeEmporium Жыл бұрын
Why are we using 512 dimensions instead of the 1 hot vector of size equal to the vocabulary size? This is because of the curse of dimensionality. Vocabulary sizes are huge (often in the 10s of thousands). This is a lot for any model, neural network or not, to process. There was a 2001 paper by Yashua Bengio “A Neural Probabilistic Language Model” that describes exactly this issue and why it was introduced. I would recommend giving it a read. Also, my next series will delve into the history of language models so I hope you’ll stay tuned for this. Maybe some of the design choices will become clearer.
@Philippe.C.A-R Жыл бұрын
what a voice !!!
@neetpride5919 Жыл бұрын
Is there an advantage to using one-hot encoding instead of an integer index encoding for the words? If we're gonna download a pre-existing word2vec dictionary and map each word to its word vector during the data preparation anyway, the one-hot encoding seems like it'd just create an unnecessary large sparse matrix.
@CodeEmporium Жыл бұрын
The idea here is we are not going to use a preexisting word2vec for the transformer. Everything in clouding the embedding for every word will be learned during training. An issue with word2vec is they are fixed embeddings and don’t necessarily capture word context very well. This concept was introduced in the paper that introduced ELMo “Deep Contextualized word presentations” (Peter et al., 2018). Would recommend giving this a read if you’re interest
@FirstNameLastName-fv4eu6 ай бұрын
Ajay you are starting a cult man!!! May God bless you.
@HASHEMJABER-g3p Жыл бұрын
Hey Ajay, first of all, this is a video so well-built that I will be recommending it to our data science, AI, and Robotics clubs, your content is great and I can see the next Andrew NG before me, regardless I do have a question, why is it that there must be a max number of words in a transformer architecture I dont fully understand the reason behind it considering most of the operations conducted on the first half don't require a fixed length of input data since this isn't your usual neural network layer, do you mind explaining? because I do feel like this is flying above my head
@CodeEmporium Жыл бұрын
Your words are too kind. And good question. So what is fixed in length in this specific architecture is the maximum number of words in a sentence, not the number of words in a sentence. The remain unused words are filled with “padding tokens”. This will be come clearer when you watch the videos of coding out the complete transformer in the playlist “Transformers from Scratch”. We essentially do this so we can pass fixed size vector inputs through every part of the transformer. That said, I have seen more recent implementations where the size is dynamic
@Slayer-dan Жыл бұрын
Thanks a lot 💚
@CodeEmporium Жыл бұрын
Super welcome
@SAIDULISLAM-kc8ps Жыл бұрын
can you please tell me the between sequence length and dimension of embedding ?
@CodeEmporium Жыл бұрын
Sequence length = maximum number of characters/words we can pass into the transformer at a time. Dimension of embedding = size of vector representing each character / word.
@SAIDULISLAM-kc8ps Жыл бұрын
@@CodeEmporium thanks a lot.
@superghettoindian01 Жыл бұрын
As before, great work on this Transformer Series! Am trying to go through all your code / videos slowly so I make sure I'm fully absorbing it. Where I'm struggling / slowest right now is in my intuition behind some of these tensor operations with stack / concatenate. Do you have any recommendations for study material apart from the torch documentation?
@CodeEmporium Жыл бұрын
Thanks so much! Hmm. Maybe hugging face has some good resources too. Aside from this, I’ll be making a playlist on the evolution of language models so some design choices become more intuitive. Hope you’ll stick around for that
@ilyas8523 Жыл бұрын
Great videos, especially the one where you explained what a transformer is. Beside youtube, do you have a full time job or is this it? Just curious
@CodeEmporium Жыл бұрын
Thank you! And yep I have a full time job as a Machine Learning Engineer outside of KZbin :)
@LuizHenrique-qr3lt Жыл бұрын
Hey Ajay, great video!! Congratulations, I'm learning a lot from you thank you! Ajay I have some doubts, the first is that I didn't quite understand the difference between max sequence length and d_model. For example, if I have texts with 50 tokens in size, that is, my largest text has up to 50 tokens, this would be my max sequence length, however if my d_model were 10, my largest sequence would have to be divided into 5 to be able to pass through the model because it only accepts 10 tokens at a time, is my thinking correct?
@CodeEmporium Жыл бұрын
They way you described sequence length = 50 is correct. It is the maximum number of tokens you can pass into your network at a time (it’s the max number of words/ subwords/characters). D_model is the embedding dimension. Models don’t understand words, but they understand numbers. And so, you transform every token into some set of numbers (called a vector) and the number of numbers in this vector is d_model. Let’s say d_model is 512 and also say we have a sentence “my name is ajay”. The word “my” would be converted into a 512 dimensional vector. As would “name”, “is” and “ajay”. The idea of these vectors/embedding is to get some dense numeric representation of the context of a word (so similar words are represented with vectors that are close to each other and dissimilar words are represented with vectors that are farther from each other)
@LuizHenrique-qr3lt Жыл бұрын
@@CodeEmporium hm ok good answer, now a doubt if d_model is the dimension that I will put my tokens. Why don't some transformer models accept very long texts? for example: if I have the string length = 10 d_model = 3 the phrase "my name is Ajay" would turn 4 vectors my: [0,0.2,0.6] name: [0, 0.1, 0.11] is: [0.5, 0.2, 0.0] Ajay: [0,0,1] with d_model dimensions each Why can't I put very large sequences in my model? Why does d_model interfere with this
@DevelopersHutt Жыл бұрын
@@LuizHenrique-qr3lt The max sequence length refers to the maximum number of tokens in a sequence, while d_model represents the dimensionality of the token embeddings. They serve different purposes in the Transformer model. The max sequence length determines the size of the input that can be processed at once, whereas d_model influences the complexity and expressive power of the model. In your example, if the max sequence length is 50 and d_model is 10, the largest sequence would need to be divided into smaller segments or chunks of 10 tokens to fit within the model's input limit.
@lorenzobianconi7724 Жыл бұрын
hi ajy thanfor your videos. why are there 512 dimensions? who established this number? and how can we count the 175b parameters in gpt3. can you make a video when you break down the whole process of a transformer in one clear shot. possibly not using a translation but for exemple an answer task. thanks love your video and determination to spread knowledge
@giacomomunda3359 Жыл бұрын
512 is a hyperparameter. You can actually decide which dimension to use, but it has been proven that higher dimension usually work better, since they are able to capture more linguistic information, e.g. semantics, syntax, etc. BERT for instance uses 768 dimensions and the OpenAI ada embeddings have 1536 dimensions.
@ajaytaneja111 Жыл бұрын
Hi Ajay, isn't the purpose of positional encoding to figure out where the word is located in the sequence which actually the attention mechanism derives benefit from? Thanks... And again great content, grateful
@CodeEmporium Жыл бұрын
Yes! The idea overall is to create meaningful embedding for words that understand context. This is opposed to the tradition CBoW or Skip gram word embeddings that don’t quite get this context.
@joaogoncalves1149 Жыл бұрын
I think that queen/king example is somewhat cherry picked, as the principle behind the analogy fails for many examples.
@aar953 Жыл бұрын
There is one mistake that you are making. We are not taking a single output as input to the decoder, but all the previous outputs up to the current time step as input to the decoder.
@CodeEmporium Жыл бұрын
Yea that’s correct from a practical standpoint. I dive into this when coding this out in the rest of this playlist “Transformers from scratch “. Hope those videos clear things up!
@aar953 Жыл бұрын
@@CodeEmporium Thanks for answering. I understand that you have to make a trade-off between simplicity and accuracy. Here, I just wanted to note that little more complexity would have added quite a lot more accuracy. Your content is excellent!
@__hannibaalbarca__ Жыл бұрын
I don't nothing about Python, but it look extremely slow.