Attention Is All You Need - Paper Explained

Рет қаралды 93,723

Күн бұрын

In this video, I'll try to present a comprehensive study on Ashish Vaswani and his coauthors' renowned paper, “attention is all you need”
This paper is a major turning point in deep learning research. The transformer architecture, which was introduced in this paper, is now used in a variety of state-of-the-art models in natural language processing and beyond.
📑 Chapters:
0:00 Abstract
0:39 Introduction
2:44 Model Details
3:20 Encoder
3:30 Input Embedding
5:22 Positional Encoding
11:05 Self-Attention
15:38 Multi-Head Attention
17:31 Add and Layer Normalization
20:38 Feed Forward NN
23:40 Decoder
23:44 Decoder in Training and Testing Phase
27:31 Masked Multi-Head Attention
30:03 Encoder-decoder Self-Attention
33:19 Results
35:37 Conclusion
📝 Link to the paper:
arxiv.org/abs/1706.03762
👥 Authors:
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin
🔗 Helpful Links:
- "Vectoring Words (Word Embeddings)" by Computerphile:
• Vectoring Words (Word ...
- "Transformer Architecture: The Positional Encoding" by Amirhossein Kazemnejad:
kazemnejad.com/blog/transform...
- "The Illustrated Transformer" by Jay Alammar:
jalammar.github.io/illustrate...
- Lennart Svensson's Video on Masked self-attention:
• Transformers - Part 7 ...
- Lennart Svensson's Video on Encoder-decoder self-attention:
• Transformer - Part 8 -...
🙏 I'd like to express my gratitude to Dr. Nasersharif, my supervisor, for suggesting this paper to me.
🙋‍♂️ Find me on: linktr.ee/HalflingWizard
#Transformer #Attention #Deep_Learning

Пікірлер: 46

@BenihimeLawliet Жыл бұрын

Finally an understandable video! I didn't find any other clear explanation about how the decoder works and the difference between test and train phases! Thank you very much, my saviour

@pigritor Жыл бұрын

Exatly that level of details I was looking for. Not too deep but not superficial. Great video, hoping that you are living long and in state of prosperity

@snehanjalikalamkar2268 Жыл бұрын

Such a great video with an excellent explanation! It was very helpful. Being an MCU fan, your examples played a major role in keeping me hooked up to the video, haha! :D

@gossipGirlMegan 7 ай бұрын

The most clearly explaination in the yt so far.

@nathansmith8187 4 ай бұрын

Fantastic explanation. Most clear and concise one I've seen yet for the Attention paper.

@ayoubelmhamdi7920 7 ай бұрын

you do a great job in the encoder part, thank you very much

@MaryamBibi-nu6ou Жыл бұрын

Voila! Ecstatic about explanation of the Paper"Attention is all you need" in this video.👏🏼

@atomicitee Жыл бұрын

Excellent overview, thanks so much!

@helenjackson9870 Жыл бұрын

Much clearer than my text book, thanks for sharing

@fa4954 Жыл бұрын

Thanks for very great explanation and helpful extra links.It would be great if you could share the slides too so we can use it, refer to it and add our notes on it.

@Haniyahmadi 2 жыл бұрын

Thanks for sharing the information with us, it was very informative

@serhattadik6402 2 жыл бұрын

Thanks a lot for this informative video. Very much appreciate your effort.

@fereshtehfeizabadi3129 2 жыл бұрын

Thanks a lot for making this informative video!

@royalarindam 9 ай бұрын

This is brilliant. Thanks for sharing!

@betonassu Жыл бұрын

Amazing content! Thank you :)

@Josh-di2ig Жыл бұрын

Thanks for a great video. I have a question. Are Query, Key, and Value matrices the exact copies of the input embeddings? And from which training process are the weight matrices learned?

@DrJanpha 11 ай бұрын

The best so far on this subject.. Well done

@leibai9233 2 жыл бұрын

Hi Mohammad, thank you. The video is great!

@MariamMeha Жыл бұрын

no one can explain better than this. TYSM.

@mirabirhossain1842 Жыл бұрын

Holy shit! your explanation is just so good. Also, thanks for adding those necessary links to other materials. This is the best explanation of transformer I have come across till now. Thank you for the detailed and carefully-cured work.

@violinplayer7201 Жыл бұрын

Thanks, this is so helpful!

@balavenkat8911 7 ай бұрын

Fantastic work.. thanks

@huilanzhu1562 Жыл бұрын

Great video! Very intuitive!

@samiulalim9708 4 ай бұрын

Fantastic explanation, calm and soft 👍

@KogDrum 8 ай бұрын

I noticed a potential discrepancy in the video at the 17:52. I would appreciate it if you could clarify this for me. It appears in the figrure that the output Z from the attention block is added to the input embedded after the positional encoding, rather than directly to the input embedding itself. I may be mistaken, so I kindly request what you are thinking the correct interpretation. Thank you!

@amandalmia6243 Жыл бұрын

As you mentioned at 5:12 about the embedding vector, can you please elaborate more on how they get the embeddings of size 512. Thanks

@mehmetozer692 Жыл бұрын

Excellent tutorial. Made it easier for me to understand the paper. Still, it will take some time and effort to further comprehend.

@mehmetozer692 Жыл бұрын

One question is that, in 10:25, shouldn't encoding vector values start with sin(w_0 . t) instead of sin(w_1 . t)? And 10000 in the denominator instead of 1000?

@polarbear986 Жыл бұрын

best explanation on youtube

@tahamohd1409 Жыл бұрын

so great. thank you!

@dizoner2610 7 ай бұрын

Thank you very much 😊

@rlv8472 Жыл бұрын

Evolution Is All You Need

@Cameron_Drummer 2 ай бұрын

This video saved my module

@TJVideoChannelUTube Жыл бұрын

In Transformer model, only these layer types are involved in the deep learning/containing trainable parameters, and (3) with activation functions: (1). Word Embedding Layer; (2). Weighted matrices for K, V, Q; (3). Feed Forward Layer or Fully Connected Layer. Correct?

@miltonborgesdasilva3263 Жыл бұрын

what an amazing channel

@amazingpatrick4659 Жыл бұрын

Should the dimension of Wq,Wk,and Wv be n * n when n equals to the number of the sequence? if they are square matrices，then the matrices in the video are wrong.

@PasseScience 2 жыл бұрын

Hi, thanks for the video! There are several things that are still unclear to me. First I do not understand well how the architecture is dynamic with respect to the size of the input. I mean what does change structurally when we change the size of the input, are there some inner parts that should be parallely repeated? or does this architecture fix a size of max window that we hope will be larger than any sequence input? The other question is the most important one, it seems every explanation of transformer architecture I have found so far focuses one what we WANT a self attention or attention layer to do but never say a word of WHY after training those attention layers will do, by emergence, what we expect them to do. I guess it has something to do with the chosen structure of data in input and output of those layers, as well as the data flow which is forced but I do not have yet the revelation. If you could help me with those, that would be great!

@PasseScience 2 жыл бұрын

@@HalflingWizard Hello, thanks for your answer! I think I got what I wanted looking again at what you explained for self attention (or attention). Each word embedding is replaced by a linear combination of every word embedding, so the only thing the NN can do when learning is adapt those weight in the linear combination and thus in a way it's natural to expect it to mix relevant pieces of information together and decorrelate and forgot what has no relationship. I think the skip connection around the attention block is also relevant to what we want in the sense that it introduces an asymmetry toward the reference word and more or less forces the rest of the linear combination to be the same kind of abstract information than the one in input. In a nutshell it's quite clear that this structure of information forces quite well the meaning of the work of the attention layer to be an attention mechanism. Could you confirm that the feed forward net after the attention blocs do not cross pieces of information from one embedding to another? (ie the same feed forward net is applied independently to each embedding in input, it seems it's what you are saying), if yes I do not get exactly how it relates to the examples for shallow etc... patterns you give just after.

@PasseScience 2 жыл бұрын

@@HalflingWizard Oh yes, good point. Yes indeed if you have horizontal (spatial mixing) then vertical (channel mixing) you have in fact mixing of more or less any kind. So in a nutshell if I input the sentence "My little sister is drawing a *shape* with *4 sides* of the *same length* and *4 right angles* " The attention layer allows the piece of information I put in bold to be gathered in a single embedding, then the channel mixing FFN can transform that in a single concept, let's say "a square", for the next layers. That makes sense. We go back to the fact that the attention encoder stack is a clever way to have a full connected network but in a very economical and symetrical way. Thx a lot.

@user-um4xc9dz8o 4 ай бұрын

Did this slides made by beamer? I like the theme. Can you share the theme?

@me4447 Жыл бұрын

Doesn't the LayerNorm here happen separately for each token ("word") separately, that is separately for "Popcorn" and "Popped" in your video?

@lyndenchang5637 Жыл бұрын

Hi, can you provide the slides?

@muhammadtanveer8045 2 жыл бұрын

Hi, the video is good but the problem is that the English translation of your voice is hide most of the things of the model. Please remove this English translation display so that the viewers can fully understand the conceptual things visually

@KumR 3 ай бұрын

@shoaibshoobi7131 7 ай бұрын

Hello Muhammad You are great

@googleyoutubechannel8554 Жыл бұрын

It's funny, all 20 YT videos on this topic either focus deeply on the steps and mechanics of how the binary data is transformed in details 'xyz' way to be amenable to gpu computation at step 'foo', with too much detail. Offering 'reasons' for each step that are surface level, don't follow, or just straight up wrong, and do not build up to create a coherent picture of the nature of 'attention'. All 'deep' videos are also too lazy to make better diagrams, so they use the crap ones from the paper too, which I thought was pretty funny (seriously, all of them!) Or YT vids are of the second type, they go the other way, giving a very surface level description of 'attention', but failing to explain the critical dynamics necessary to reason about this technique. There are basically _no videos_ in the middle, this one is basically the first type, and the paper itself is an example of the first type as well, so I'm honestly wondering if anyone at all understands how to reason about 'attention'. I'm wondering if the technique just sort of worked out, after trying who knows how many schemes. The researchers have some idea what's going on, but don't understand the core dynamics either, so they invented a concept called 'attention', but it's not a very helpful framework.