Live -Transformers Indepth Architecture Understanding- Attention Is All You Need

Рет қаралды 221,232

Күн бұрын

All Credits To Jay Alammar
Reference Link: jalammar.github...
Research Paper: papers.nips.cc...
youtube channel : • Jay's Visual Intro to AI
Please donate if you want to support the channel through GPay UPID,
Gpay: krishnaik06@okicici
Discord Server Link: / discord
Telegram link: t.me/joinchat/...
Please join as a member in my channel to get additional benefits like materials in Data Science, live streaming for Members and many more
/ @krishnaik06
Please do subscribe my other channel too
/ @krishnaikhindi
Connect with me here:
Twitter: / krishnaik06
Facebook: / krishnaik06
instagram: / krishnaik06

Пікірлер: 240

@jeeveshkataria6439 4 жыл бұрын

Sir, Please release the video of Bert. Eagerly waiting for it.

@mohammadmasum4483 Жыл бұрын

@ 40:00 why we consider 64? - It is based on the how many multi head attention you want to apply. We used embedding size for each word = 512 and want to apply 8 multi head self attention; there fore for each attention we are using (512/8 =) 64 dimensional Q, K, V vector. So that, when we concatenate all the multi attention heads afterward, we can achieve the same 512 dimensional word embeddings which will be the input to the feed forward layer. Now, for instance, if you want 16 multi attention head, in that case you can use 32 dimensional Q, K, and V vector. My opinion is that, the initial word embedding size and the number of multi attention head are the hyperparameters.

@naveenkumarjadi2915 2 ай бұрын

great bro

@Musicalphabeats 2 ай бұрын

thanks bro

@varaball Ай бұрын

Thank you 🫡

@suddhasatwaAtGoogle 3 жыл бұрын

For anyone having a doubt at 40:00 as to why we have taken a square root of 64 is because, as per the research it was mathematically proven to be the best method to keep the gradients stable! Also, note that the value 64, which is the size of the Query, Keys and Values vectors, is in itself a hyperparameter which was found to be working the best. Hope this helps.

@latikayadav3751 Жыл бұрын

The embedding vector dimension is 512. We divide this in 8 heads. We 512/8 =64. therefore size of query, keys and values is 64. therefore size is not hyperparameter.

@afsalmuhammed4239 Жыл бұрын

normlizing the data

@sg042 11 ай бұрын

Another reason is that we generally we want weights, inputs, etc. to follow normal distribution N(0, 1) and when we compute dot product then its summation of 64 values which mathematically increases the variance to sqrt(64) and make the distribution as N(0, sqrt(64)) and therefore dividing it by sqrt(64) will normalize it.

@sartajbhuvaji 11 ай бұрын

The paper states that: " While for small values of dk the two mechanisms(attention functions: additive attention and dot product attention) (note: paper uses dot product attention (q*k)) perform similarly, additive attention outoerforms dot product attention without scaling for larger values of dk. We suspect that for larger values of dk, the dot product grows large in magnitude, pushing the softmax function into regions where it has extremely small gradients. To counteract this effect, we scale the dot product by (1/sqrt(dk))"

@anoopitiss 5 ай бұрын

Any one in 2024, watching and learning from Krish

@roshankumargupta46 3 жыл бұрын

This might help the guy who asked why we take the square root and also for other aspirants : The scores get scaled down by getting divided by the square root of the dimension of query and key. This is to allow for more stable gradients, as multiplying values can have exploding effects.

@tarunbhatia8652 3 жыл бұрын

nice. i was also wondering about the same . it all started from gradient exploding or vansihing , how can i forget that :D

@apica1234 3 жыл бұрын

can this attention encoder-decoder be used in financial time series as well.. multivariate time series?

@matejkvassay7993 3 жыл бұрын

Hello, I think the sq root od dimension is not chosen just empirically but actually it's to normalize the length of vector or smth similar, it holds the vector length scales by sq root with increasing dimension size when some conditions I forgot are met, this way you scale it down to 1 ans thus prevent exploding dot product scores

@kunalkumar2717 2 жыл бұрын

@@apica1234 yes, although i have not used it, but it can be used.

@TheGenerationGapPodcast Жыл бұрын

The normalizing should come from softmax or by using the tri function to zero out the bottom of the matrix concatenated q, k and V MATRIX. to have good initialization weights, i think

@121MrVital 3 жыл бұрын

Hi Krish, When you gonna make a video on "Bert" with practical implementation ??

@faezakamran3793 2 жыл бұрын

For those getting confused with 8 heads, all the words would be going to all the heads. It's not one word per head. The X matrix remains the same only the W matrix would change in case of multi-head attention.

@story_teller_1987 3 жыл бұрын

Krish is a hard working person, not for himself but for our country in the best way he could...We need more persons like him in our country

@lohithklpteja 5 ай бұрын

Alu kavale ya lu kavale ahhh ahhh ahhh ahhh dhing chiki chiki chiki dhingi chiki chiki chiki

@mayurpatilprince2936 11 ай бұрын

Why they multiply each value vector by the softmax score because they want to keep intact the values of the all word(s) and they want to focus on, and drown-out irrelevant words (by multiplying them by tiny numbers like 0.001, for example) ... they wanted to immerse whatever that sentence has irrelevant words ...

@KunalSwami 3 жыл бұрын

Please don't keep repeating things and go up and down. You can explain things by sticking to the point and systematic. Also, in between explaining, you often deviate and start praising the blog and paper again. Just do this once in the beginning and then stick to explanation.

@harshitjain4923 3 жыл бұрын

Thanks for explaining Jay's blog. To add to the explanation at 39:30, the reason for using sqrt(dk) is to prevent the problem of vanishing gradient as mentioned in the paper. Since we are applying softmax on Q*K and if we consider a high dimension of these matrices, it will produce a high value which will get transformed close to 1 after softmax and hence leads to a small update in gradient.

@neelambujchaturvedi6886 3 жыл бұрын

Thanks for this Harshit

@shaktirajput4711 2 жыл бұрын

Thanks for explanation but I guess it will be called as exploding gradient not vanishing gradient. Hope I am not wrong.

@MuhammadShahzad-dx5je 3 жыл бұрын

Really nice sir, looking forward to Bert Implementation 😊

@Natasha-re1kt 2 жыл бұрын

There are so many problems with this video. 1) divide by root Dk is to reduce the dimension back to its original 64 because multiplication had increased it in previous steps 2) Values are not being used because it has to be used because it is there. What rubbish. Values was created for a purpose to induce the original vector values after softmax to derive at the result, similar to one in information retrieval system , 3) resnet explanation that he gave is incorrect, input is passed not to skip in between layer but to strengthen the output due to vanishing gradient problem 4) this guy got cocky when someone asked what is neural network. I mean whats the big deal? may be that guy came to the session for the first time and asking very basic question. Did Krish know about neural network few years ago? Noooo?. If inspite of following blog written in simple english, if you did not understand the whats inside the algorithm then where are getting audacity to post it on youtube???? Few more corrections- there is no such word as "farrrar". it farther. also please dont butcher french. its a beautiful language. the sentence is pronounced as "Ja sui ethudian"

@abhisheksa6635 2 жыл бұрын

Hi Natasha, can you please explain where this explanation went wrong and help the community as we are still exploring. I also felt we are missing few fundamentals here but I am not so much of a master to comment upon.

@aditiseetha1 Жыл бұрын

Your content is amazing. But, you unnecessarily repeat the same thing again and again. I am watching your video at a playback speed of 2x.

@cartoonsondemand_ 9 ай бұрын

Every person have different understanding level aditi

@nehalverma1444 5 ай бұрын

Let's not be selfish @aditiseetha1, everyone has their own space of learning and grasping the concept

@abhimanyusingh3755 3 ай бұрын

It's not a normal upload smarty aditi it was a live stream so he definitely has to repeat some stuff and also interact with the audience.

@anusikhpanda9816 4 жыл бұрын

You can skim through all the youtube videos explaining transformers, but nobody comes close to this video. Thank you Sir🙏🙏🙏

@kiran5918 7 ай бұрын

Difficult to understand foreign accents. Desi away zindabad

@nim-cast Жыл бұрын

Thanks for your fantastic LLM/Transformer series content, and I admire your positive attitude and support for the authors of these wonderful articles! 👏

@sivakrishna5557 4 ай бұрын

Could you please help me to get started on llm series, could you pls share the playlist link

@smilebig3884 3 жыл бұрын

So many dumb questions :D I can understand Krish, how much u had to endure.

@rajns8643 Жыл бұрын

A question is never dumb, but the answer to that might be dumb..... ~ One of the best prof who had taught me

@dandyyu0220 2 жыл бұрын

I cannot express the amount of appreciation enough of your videos, especially NLP deep learning related topics! They are extremely helpful and so easy to understand from scratch! Thank you very much!

@junaidiqbal5018 2 жыл бұрын

@31:45 If my understanding is correct, reson why we have 64, is because we we divide 512 into 8 equal heads. As we are computing the dot products to get the attention vaue, if we do the dot product of 512 embedding dimension length it will not only be computationally expensive but also the fact that we will get only one relation between the words . Taking advantage of parallel computation we divide 512 into 8 equal parts. this is why we call it as multi head attention. This way its computationally fast and we also get 8 different relation between the words. (FIY Attention is basically a relation between the words ). Any way Good work on explaining the architecture krish.

@desrucca 2 жыл бұрын

AFAIK Resnet is not like dropout, instead it brings information from the previous layer to the n_th layer by doing this, vanishing gradients are less likely to occur.

@Natasha-re1kt 2 жыл бұрын

basically newbies are accepting any nonsense that this guy explains

@ruchisaboo29 4 жыл бұрын

Awesome explanation.. when will you post BERT video ? waiting for it and if possible please cover GPT-2 as well.. Thanks a lot for this amazing playlist.

@devkinandanbhatt8057 2 ай бұрын

I've heard a lot about you, and this was the first video of yours that I've watched. Honestly, it seemed like the video was made in a hurry, and it felt like you hadn't done much self-study. Many concepts were poorly explained. I would recommend doing more research before posting a video.

@Mr.AIFella 5 ай бұрын

Why 64? The correct answer is not a hyperparameter! It's because the dimensionality of data divided by number of head ~> so 512/8 (heads) =64

@jaytube277 5 ай бұрын

Thank you Krish for making such a great video. Really appreciate your hard work. One thing I have not understood here is that where is the loss getting calculated? Is it happening on the multiple heads or at the encoder decoder attention layer. What I am assuming is that while we are training the model, the translations will not be accurate and we should get some loss which we will try to minimize but I am not understanding where is that comparison is happening?

@hiteshyerekar2204 4 жыл бұрын

Great Session Krish. Because of Research paper I understand things very easily and clearly.

@shweta5260 4 жыл бұрын

Hello sir i passed 12th now i want to become data scientist which coarse should i take ????

@moralstorieskids3884 10 ай бұрын

i'm same too

@PreyumKumar 10 ай бұрын

why will output decoder choose only the correct words. Should have given example about what happens if some wrong word is guessed by the model.

@akhilgangavarapu9728 4 жыл бұрын

Million tons appreciation for making this video. Thank you soo much for your amazing work.

@Sresta-w9i 28 күн бұрын

Can anybody pls explain how z1,z2 values we got becuase if we add v1 and v2 both z1 and z2 are same

@nareshmalviya5232 Жыл бұрын

can you explain this whole topic in.......some parts video. because we are confused for lot of info

@BhuwanBhatta 3 жыл бұрын

9:50 , that's when the actual video starts

@GamerBoy-ii4jc 2 жыл бұрын

plzzz upload video on practical of transformers usnig hugging face

@ss-dy1tw 3 жыл бұрын

Krish, I really see the honesty in you man, lot of humility, very humble person. In the beginning of this video, you gave credit to Jay several times who created amazing blog for Transformers. I really liked that. Be like that.

@sarrae100 3 жыл бұрын

Excellent blog from Jay, Thanks Krish for introducing this blog on ur channel !!

@krishnalikith8048 2 ай бұрын

sir 44:00 you said if we add v1+ v2 we will get z1. How z2 value came sir? then?

@shahveziqbal5206 2 жыл бұрын

Thankyou ❤️

@mrudhulraj2824 3 жыл бұрын

i am still curious what is key,query and value?

@yakubsadlilseyam5166 4 ай бұрын

In which session Krish discussed about encoder and decoder? Someone please mention it

@madhurendranathtiwari9465 6 ай бұрын

You haven't subscribed also and saying why he has so less subscribers 🤣🤣

@neelambujchaturvedi6886 3 жыл бұрын

Hey Krish, Had a quick question related to the explanation at 1:01:07 about positional encodings. How do we exactly create those embeddings, as in the paper the authors have used sine and cosine waves to produce these embeddings, I could not understand the intuition behind this, could you please help me understand this part, Thanks in advance.

@1111Shahad 3 ай бұрын

The use of sine and cosine functions ensures that the positional encodings have unique values for each position. Different frequencies allow the model to capture both short-range and long-range dependencies. These functions ensure that similar positions have similar encodings, providing a smooth gradient of positional information, which helps the model learn relationships between neighboring positions.

@paneercheeseparatha Жыл бұрын

Why do you say that the inbetween blocks of resnet are not important??

@mayurpatilprince2936 11 ай бұрын

Root of dk is nothing but dimension of the key vectors which leads to having more stable gradients ... Why we should use ? They have created Query vector, a Key vector, and a Value vector by multiplying the embedding by three matrices that they trained during the training process, and we can notice that these new vectors are smaller in dimension than the embedding vector. Their dimensionality is 64 because it provides stable gradients ....

@sg042 11 ай бұрын

Another reason probably is that we generally we want weights, inputs, etc. to follow normal distribution N(0, 1) and when we compute dot product then its summation of 64 values which mathematically increases the variance to sqrt(64) and make the distribution as N(0, sqrt(64)) and therefore dividing it by sqrt(64) will normalize it.

@snehamandal5376 Жыл бұрын

sir I have a question , it seems like the transformer model requires lots of computational power , but as we know our phones do not have so much powerful gpu s or cpus , then how does chatgpt works so flawlessly on our mobiles as well??🤔🤔🤔

@gautamdewasiofficial Жыл бұрын

Because we are using their webapp, Means actually computation is going on server's side. Not on our devices.

@PreyumKumar 10 ай бұрын

If v1 and v2 are summed both the sums will become the same.

@schachschach9119 2 жыл бұрын

25:05 what do you mean by "it" referring to "too tired". didn't get that

@shrikanyaghatak Жыл бұрын

I am very new to the world of AI. I was looking for easy videos to teach me about the different models. I cannot imagine that I was totally enthralled by this video as long as you taught. You are a very good teacher. Thank you for publishing this video free. Thanks to Jay as well for simplifying such complex topic.

@HarshitaPandey-rj4gr 2 ай бұрын

the explanation actually starts after 11.00 min

@mdzeeshan1148 3 ай бұрын

Hi Krish, Not able to find BERT and encoder decoder video

@csit_coding4133 3 ай бұрын

What is the data set taken in this paper. Plz tell me.

@monalisanayak2299 2 жыл бұрын

Can you do a similar session on CoTr: Efficiently Bridging CNN and Transformer for 3D Medical Image Segmentation? It will be helpful.

@shilashm5691 2 жыл бұрын

Refer digitalsreeni KZbin channel, it might be helpful

@vishnusit1 Жыл бұрын

Its not great explanation you just read out jay works,

@nahidzeinali1991 Ай бұрын

you are smart, Krish! thanks for the video, Love U

@paneercheeseparatha Жыл бұрын

You really skip the maths too much.. you got to luk into that..

@ewnetuabebe5059 2 жыл бұрын

Thank you For your amazing tutorial but is Transformer Work For climate prediction/Numerical Regression problems??????????????????????//

@bofloa Жыл бұрын

watching through this video, I can only conclude that the whole process is more of a Art than it is a science

@rajns8643 Жыл бұрын

Definitely!

@gurdeepsinghbhatia2875 4 жыл бұрын

sir thanks a alot , mza agia sir , your way of teaching with so humble and honest and most important patience , awesome video sir, too gud

@zohaibramzan6381 3 жыл бұрын

Great to overcome confusions. I hope next to get hands on Bert.

@mdmamunurrashid4112 Жыл бұрын

Please make videos on BERT , dna2vec

@pranjalkirange 2 жыл бұрын

Bert kaha hai?waiting for it from past one year.

@sayantikachatterjee5032 10 ай бұрын

at 58.49 it is told that if we increase no of heads it will give more importance to different words. so 'it' can give more importance to 'street' also. so between 'The animal' and 'street' which word will be more prioritized?

@palanaa 4 ай бұрын

Krish, - you didn't explain how softmax is calculated - in this case 112 (0.88) 96 (0.12) - how did you come to this number?

@sweela1 Жыл бұрын

In my opinion, At 40:00 the under root is taken for the purpose of scaling to normalize the value from larger value to be transformed to smaller value so that SoftMax function of these values can also be calculated easily. Dk is the dimension whose under root is taken to scale the values.

@sg042 11 ай бұрын

@MaheshK-gc1sv 3 жыл бұрын

could not understand weight update , because here ANN is feedforward and krish mentioned that weight wil be updated by back propagtion , so where is that backprop happening

@vamshi5745 5 ай бұрын

truely bro learn how to teach,,,please,,,,

@harshjain-cc5mk 3 жыл бұрын

What is the basic requirement one should have to understand transformer, currently I am in my final year and willing to do a project on this. I do have knowledge of machine learning, neural networks and just started to learn RNN and CNN. Any guidance and suggestions are welcome.

@underlecht 3 жыл бұрын

I love your patience how many times you go around explaining things until they get clear even for such dumb guys as me. BTW residual connection are not due some layers are not important and we have to skip them, it is for to solve the vanishing gradients problem.

@skipintro9988 3 жыл бұрын

I think the 19 dislikes are from the people who came here without basic neural network knowledge :-D

@learnvik 10 ай бұрын

thanks, Question: in step 1 (30:52), what if the randomly initialized weights have the same value during the start? then all resulting vectors will have same values.

@user-or7ji5hv8y 3 жыл бұрын

thank you, appreciate your time going through this material

@zainaqubbej7457 3 жыл бұрын

Why do I keep seeing state of art mentioned everywhere? what does it refer to as what we are doing with transformers here? ... Please someone explain ❤️🙏🏼

@HimanshuKanodia 4 жыл бұрын

Sir, I have been working on Dell Boomi Integration tool for 1 year after graduation, please suggest how it can be helpful in data science.

@SATISHKUMAR-qk2wq 4 жыл бұрын

Appliedai.com

@TheGenerationGapPodcast Жыл бұрын

The reason to divide by sq of k is to prevent a constant value of x. That x = 1/2 for values near x = 0 from the left or right f(x) approaches y = 1/2. Look at the shape of the sigmoid function.

@chd9841 3 жыл бұрын

I cannot explain how I sat for 1.5 hours . I wanted to run away so many times...but I hoped something would enter my mind...

@parmeetsingh4580 3 жыл бұрын

Hi Krish, great session. I have a question - the Z we get after the self-attention block of the encoder, is it interpretable? that means if we could figure out by just looking at Z what results does the multi-head self-attention block gives? Kindly help me out with this.

@smilebig3884 2 жыл бұрын

Very underrated video... this is super awesome explanation. I m watching and commenting 2nd time after a month.

@BalaguruGupta 3 жыл бұрын

The layer normalization does (X + Z) here X is input Z is result of self attention calculation. You mentioned when the Self attention doesn't perform well, the self attention calculation will be skipped and jumps to Layer Normalization, hence the Z value will be 'EMPTY' (Please correct me here, if I'm wrong). In this case the layer normalization happens only on X (the imput). Am I correct?

@NitishKumar-fl1bg 2 жыл бұрын

please make a video on implementation of Video Summarization With Frame Index Vision Transformer.

@gouravnaik3273 2 жыл бұрын

sir why are we using multiple head attention, because while training each head will produce the same weightage according to the loss and optimization why cant we have the single head which will get optimized

@vivekyadav-zl5dl 9 ай бұрын

It looks like you are too confused in this session, not confident giving each answer of comments. But hats off to your good effort.

@maaleem90 Жыл бұрын

Researchers go with the quote that " Try with as many possible way if something works don't touch it "

@sagaradoshi 2 жыл бұрын

Thanks for the wonderful explanation .. For the decoder in the 2nd time instance we passed word/letter 'I', then in 3rd time instance do we pass both the words 'I' and 'Am' or only the word 'Am' is passed? Similarly for the 3rd time instance do we pass the words 'I', 'am' and 'a' or just the word/letter 'a' is passed?

@kavitathakur6678 Жыл бұрын

Sir could you suggest me which algorithm we can use to create nlp model which can compare response from chatbot with it's own response

@digitalmbk 3 жыл бұрын

My MS SE thesis completion totally depends on your videos. Just AWESOME!!!

@pratheeeeeesh4839 3 жыл бұрын

Bro are you pursuing your ms?

@digitalmbk 3 жыл бұрын

@@pratheeeeeesh4839 yes

@pratheeeeeesh4839 3 жыл бұрын

@@digitalmbk where brother?

@digitalmbk 3 жыл бұрын

@@pratheeeeeesh4839 GCUF Pakistan

@MrKhaledpage Жыл бұрын

So you're trying to explain something you don't understand!

@rajns8643 Жыл бұрын

At least he is helping others gain some understanding of the topic.... Which is quite contrary than a certain person who is just pointing out flaws in others.....

@premranjan4440 3 жыл бұрын

How 512 dimension matrix can be changed into a 64 dimension matrix? Can anyone please explain?

@MayankKumar-nn7lk 3 жыл бұрын

Answer to why we are diving by the square root of dimension. basically, we are finding the similarity between the query and each key, there are different ways to get the similarity like dot product or scaled dot product so basically, here we are taking scaled dot product to keep the values in a fixed range

@salimtheone Жыл бұрын

Z1=v1+v2. Is the same as Z2 ?????!!!

@aarshp Жыл бұрын

at 51:40 he said multi head attention is used instead of single head attention, but he didn't explain what would be the input to the other 7 attention heads if our input matrix of words is given in to the 0th head.

@hudaalfigi2742 2 жыл бұрын

i really want to thank you for your nice explanation actually i could not be able to understsnd it befor watchining this video

@photospere5757 3 жыл бұрын

After watching this video, the more I admire my God because the more I realize how sophisticated human brain

@thepresistence5935 2 жыл бұрын

Session starts at 9:10

@fun7939 9 ай бұрын

how to calculate softmax value?

@ayushrathore8916 3 жыл бұрын

After the encoder. Is there any repository like which store all the output of encoder and then one by one it will pas to decoder to get one on one decoded output!

@MrChristian331 3 жыл бұрын

where are the weights being generated randomly?? I assume it's some sort of code function doing it?

@athiragopalakrishnan4316 3 жыл бұрын

Sir, How can I contact you? Your Fb/LinkedIn/Instagram sites are not reachable.

@TusharKale9 3 жыл бұрын

Very well covered GPT-3 topic. Very important from NLP point of view. Thank you for your efforts.

@shanthan9. 6 ай бұрын

Every time I get confused or distracted while listening to the Transformers, I have to watch the video again; this is my third time watching it, and now I understand it better.