Transformer Decoder coded from scratch

Рет қаралды 10,840

Күн бұрын

Пікірлер: 20

@suryabalaji3563 Жыл бұрын

Hey Ajay, amazing video! I was struggling to understand the concepts and code and your explanations really helped me in this. I especially love how you always specify the shape of each matrix after every operation and how you always explain the reason for each step however many times it takes. Thanks and keep up the good work!

@jingcheng2602 7 ай бұрын

Very cool and detailed explanation! The best class I have taken about transformer 🙂

@Koldim2001 Жыл бұрын

Great video! Thanks to this channel, I finally understood the Transformer architecture. It would be awesome if you could explain Visual Transformers in the same format, with detailed block diagrams and code explanations. Looking forward to your next video!

@pooriataheri795 11 ай бұрын

Great content. Keep up the good work

@DLwithShreyas Жыл бұрын

Hey Ajay it would be great if you could make a video on building a Tokenizer, I feel it is pretty underrated and not explored much.

@codinghighlightswithsadra7343 Жыл бұрын

amazing explanation, well done👌

@gangs0846 6 ай бұрын

Fantastic

@amparoconsuelo9451 Жыл бұрын

I am interested in knowing HOW hardware is involved during coding and data build up and execution of the program with simultaneous data retrieval.

@DLwithShreyas Жыл бұрын

Hey Ajay , you are such a great teacher ! I have been wanting to code a Transformer from scratch for a long time now , but each time i used to get stuck at the batching level with some or other issue. Your series has helped me understand and code the transformer better 🤗

@srikanthramakrishna1073 9 күн бұрын

Hi Ajay. Great work!! Quick question, in the code the default for mask is set to None. Is there an instance during training/inference where we won't add a mask? Is masking needed only for inference?

@srikanthramakrishna1073 9 күн бұрын

oh later in the video it is set to None for cross attention!

@scitechtalktv9742 Жыл бұрын

I consider this to be a very interesting explanation of the transformer decoder. But I have a question: because we are talking about MultiHeadAttention, for a long time I have wondered how these 8 attention heads come up with 8 different kinds of attention? What mechanism causes the fact that they are different? I am very curious to know your answer to this question!

@DLwithShreyas Жыл бұрын

I think it is because each head will get a different set of (q,k,v) vectors. The 1st head will get the 1st column of the (q,k,v) batch and the 2nd head will get the next one, and so on and so forth. Also, the number of heads can be different than 8. I guess this is correct but correct me if I am wrong.

@scitechtalktv9742 Жыл бұрын

That indeed could explain the fact that the different attention heads can produce different attentions. But I would have to more deeply explore the inner workings of transformers to convince myself that this is the real reason 😊

@scitechtalktv9742 Жыл бұрын

@@DLwithShreyas After watching this video, I am convinced that your answer is the correct one: every attention head gets different input vectors! : Blowing up Transformer Decoder architecture kzbin.info/www/bejne/m5zKXpulpMdjia8

@SVRamu Жыл бұрын

I understand, that it is due to the nature of vectors itself. If a word is encoded as a vector of size 512, then that means that each of those 512 numbers is encoding one particular aspect of the word. And this positional meaning of that vector should be same for all the word vectors, since each vector is created by the same encoding/embedding process. We now break each 512 sized vector into 8 heads/pieces, each of size 64. So, each head is denoting few (here 64) mutually exclusive aspects of that word. Now, when we apply attention to it, it is kind of signifying different kinds of attention.

@abulfahadsohail466 7 ай бұрын

Can anybody help me if output max sequence length of encoder is different from max sequence length of decoder means max length of query and key, value vector sequence length are different.

@ahmadfaraz9279 Жыл бұрын

Instead of getting qkv or kv through the same linear layer, why there can't be three different linear layer defined, so that the code is simple. Is there any computational cost associated with defining three separate layer (or two for kv case) associated with it?