10 - Self / cross, hard / soft attention and the Transformer

Рет қаралды 36,196

Alfredo Canziani (冷在)

Күн бұрын

Пікірлер: 101

@abaybektursun Жыл бұрын

This is the most Italian ML lecture I have seen, god bless your mom

@alfcnz Жыл бұрын

😮😮😮

@dustinvansteeus4904 3 жыл бұрын

Thanks Alfredo, amazing presentation on transformers. By far the clearest explanation I have seen on the internet. Kudos to you!

@alfcnz 3 жыл бұрын

😃😃😃

@sk7w4tch3r 3 ай бұрын

Thanks!

@alfcnz 3 ай бұрын

🥳🥳🥳

@Vanadium404 10 ай бұрын

Beautiful lecture by a beautiful person. Subscribed you for this amazing explanation accept my respect and lots for love from Pakistan 🇵🇰

@alfcnz 10 ай бұрын

❤️❤️❤️

@alexvass 2 жыл бұрын

Thanks

@alexvass 2 жыл бұрын

could you please clarify what is the 'key' and the 'query' in the context of the language example?

@alfcnz 2 жыл бұрын

You’re welcome! 🤓🤓🤓

@drahmsha Жыл бұрын

Thanks Alfredo for the amazing material and high quality videos, I really enjoyed them and literally they are my only guide in the field, my med school doesn't have any programs for AI in medicine and it will not have for a coming thousand year and people in my place will only get the chance for such courses and deep understanding from courses and people who share their knowledge online like this.. You are a life changer and a life saver.. thank you so much from your thousand miles sincere remote student

@alfcnz Жыл бұрын

🥰🥰🥰

@Farhad6th 2 жыл бұрын

درود، آقا خیلی با جزئیات توضیح می‌دهی. دست شما درد نکند. ممنون. چون می دونم تا حدودی پارسی بلدی، پارسی نوشتم. :)

@alfcnz 2 жыл бұрын

خوش اومدی 🥰🥰🥰

@dimitri30 4 ай бұрын

Thank you so much, it's amazing how easy you make this to understand.

@alfcnz 4 ай бұрын

You're very welcome! 😀😀😀

@sur_yt805 2 жыл бұрын

I have been struggling with concept of attention from so Long and never came across such heavenly explanation can not thank you ever. hope I can develop your level of expertise Regards Hina Ali

@alfcnz 2 жыл бұрын

😀😀😀

@Cropinky Жыл бұрын

bro got me googling whats a pizza boscaiola while trying to learn attention

@alfcnz Жыл бұрын

🤣🤣🤣

@sidexside1439 2 жыл бұрын

I love your slides with dark background! It made my eyes more comfortable. Thanks!

@alfcnz 9 ай бұрын

😇😇😇

@erikdali983 3 жыл бұрын

Thanks for making these videos!

@alfcnz 3 жыл бұрын

My pleasure!

@anadianBaconator 3 жыл бұрын

honestly, thank you so much for the effort you take to explain the concepts! Really appreciate it

@alfcnz 3 жыл бұрын

🥰🥰🥰

@goedel. Жыл бұрын

At 25:42, why do we have t distinct attention vectors if the argmax X^T x does not change? I might be misunderstanding the argmax.

@alfcnz Жыл бұрын

argmax returns a 1-hot vector, specifying where the maximum element is. 🙂

@goedel. Жыл бұрын

@@alfcnz just rewatched it and it makes sense. I was just confused by the notation, I read is as mapping the vector to it's attention scalar and somehow getting a R^{t x t} matrix. Thank you Alfredo 👍

@alfcnz Жыл бұрын

No worries 😊😊😊

@juanolano2818 Жыл бұрын

It would be great if you would explain this same topic today. I wonder how the explanation from today would differ from your explanation given 2 years ago. Do you think you'd explain it differently? what would you change?

@alfcnz Жыл бұрын

Hmm, the underlying principles are the same. Of course, there are new updates, but they are built on top of the foundations explained here.

@alirezajazini2476 10 ай бұрын

very good introduction

@alfcnz 10 ай бұрын

Thank you 😊😊😊

@sidexside1439 2 жыл бұрын

1:01:25, the 'h' after the predictor. What is the truth of that h? (Hidden representation)

@me4447 2 жыл бұрын

This is such a good explanation!

@alfcnz 2 жыл бұрын

❤️❤️❤️

@chani_m Жыл бұрын

Hi! I was wondering what does the beta mean in the formula at 22:17?

@alfcnz Жыл бұрын

β is the coldness coefficient of the softargmax, which for the vector 𝒗 is defined as exp(β𝒗)/∑ᵢ exp(β𝓋ᵢ)

@chani_m Жыл бұрын

@@alfcnz Thank you!

@mahdiamrollahi8456 Жыл бұрын

May I ask why we calculate the attention score between two vector (like K,Q) and apply this to another vecotor(like V) and then get output? I mean, could we perform it with two of them like K , V? att_score = softmax(k.v / d) att_output = att_score * v

@alfcnz Жыл бұрын

Think this way: V represents all KZbin videos, K is their title. q is the question you have in mind. KZbin will try to match q with every title of K. When found, the correct video is retrieved from V. 🤓🤓🤓

@mahdiamrollahi8456 Жыл бұрын

@@alfcnz Very good example✌ and It is ok when k and v are from different types(like here, title is text and the value is video | when in translation we have different languages for k(French) and v(English)). So, is that necessary to apply all three q, k ,v to texts that they come from the same language? In this case we, take the similarity between K and V (as attention score) and then apply the result again to V. I am just thinking about optimizing parameters in the case that we have data from the same type like English-to-English.

@zamiG74ever 3 жыл бұрын

Hi alf, first I want to thank you for all your highly explanatory youtube and github content, it has really helped me in many DL occasions. I would like to ask what happens when d ≠ d' ≠ d''?

@alfcnz 3 жыл бұрын

You're welcome 😀😀😀 Things would have a more meaningful size and you'd gain a degree of freedom.

@Batu135 2 жыл бұрын

You should decouple weight decay with norm layers and embeddings ;)

@alfcnz 2 жыл бұрын

???

@AdityaSanjivKanadeees 2 жыл бұрын

Hi Alf, thanks for the great explanation !! A question, why do we set the bias=False, why do we not need an affine transformation of the input space, but just rotation? @1:04:00

@alfcnz Жыл бұрын

It's not strictly necessary. If you have a bias set to zero, then you force the information to be encoded in the orientation only. This makes sense, since later on we use the cosine similarity to measure agreement in terms of angles.

@MarioKostelac 3 жыл бұрын

Thanks for this explanation Alfredo. When you say "rotation and squashing", what do you mean by that? My understanding is that linear transformations can do more than just rotate and squash (e.g. smear etc).

@alfcnz 3 жыл бұрын

You need to watch all videos 😉 The answer to your question can be found in practica 1 and 2.

@MarioKostelac 3 жыл бұрын

@@alfcnz Hah, watched it. It makes sense now :)

@alfcnz 3 жыл бұрын

Good, good! On a side note, our PostDoc was googling “squashing” the other day 😅😅😅 I'd have never thought people would watch these out of order and therefore getting confused by my terminologie 😬😬😬

@MarioKostelac 3 жыл бұрын

@@alfcnz Ah, I wanted to know more about transformed, but when you started talking about autoencoders, I figured I was missing more than just transformer explanation.

@mrigankanath7337 3 жыл бұрын

For the cross attention, what change should be done in the code?

@alfcnz 3 жыл бұрын

Watch again the lecture, specifically the cross attention part.

@mrigankanath7337 3 жыл бұрын

@@alfcnz Hi Alf, I saw it again and also the code, I understood that the length should be same for the q,k,v but when giving input q can have different shape than v,k. So if I use a linear layer which output is d(same for q,k,v) but what is the input shape here? since differnt shapes. Should I declare Differnt Linear layer for each of them,?

@mrigankanath7337 3 жыл бұрын

No need to reply, I got it!!! , had to write all the dimensions in my notebook and then proceed, anyway thanks

@alfcnz 3 жыл бұрын

@@mrigankanath7337 pay attention that q, k, v need *not* to be the same size. I fear I need to recommend watching it once more. 🥺🥺🥺

@mrigankanath7337 3 жыл бұрын

@@alfcnz like Xq,Xk,Xv can have different shapes right? But after passing throught the linear layer we get q,k,v which is of same shape? Am I right? I am confused now, well lets see your video again

@AnkitSharma-lh6rf Жыл бұрын

Isn't text to image a seq to vector?

@alfcnz Жыл бұрын

If you’re asking questions about the video, plz include the time stamp, or it’s impossible for me to address your point.

@khaledelsayed762 3 жыл бұрын

Simply brilliant. However, need to provide concrete example of using the attention mechanism on some simple real example with actual input x and showing how Q, K, V evolve

@alfcnz 3 жыл бұрын

That's what the notebook is for, no? Or have I misunderstood your question? 😮😮😮

@khaledelsayed762 3 жыл бұрын

@@alfcnz Right but I meant to show the examples within the lecture slides or maybe that would be too lengthy.

@alfcnz 3 жыл бұрын

I'll keep in mind when / if I'm going to revisit this lecture. Thanks!

@pesky_mousquito 3 жыл бұрын

I think the code and the notebook is missing the "predictor" + "decoder" part

@alfcnz 3 жыл бұрын

I'm just using an encoder there. 😅😅😅

@thuantrangnguyenanh3899 3 жыл бұрын

I could not find the notebook on you repo :(.

@alfcnz 3 жыл бұрын

Check last year's repo.

@thuantrangnguyenanh3899 3 жыл бұрын

Thank you very much.

@alfcnz 3 жыл бұрын

@@thuantrangnguyenanh3899 you found it?

@thuantrangnguyenanh3899 3 жыл бұрын

Yes, it was lecture 15 last year.

@alfcnz 3 жыл бұрын

@@thuantrangnguyenanh3899 👌🏻👌🏻👌🏻

@seokhoankevinchoi1182 3 жыл бұрын

Hello Alf, 0:22:34 Everyone here prob know what you are trying to say, but I think the matrix transpose animation should be fixed.

@alfcnz 3 жыл бұрын

Fixed how?

@alfcnz 3 жыл бұрын

Ah, you mean flipping over the diagonal?

@seokhoanchoi4218 3 жыл бұрын

@@alfcnz yes! I think its not a big issue for the most people but just in case....

@alfcnz 3 жыл бұрын

I didn't even realise it should flip, haha 😅 In my mind I'm used to think how the columns become the rows. Never actually thought about the mirroring, haha 😅 oops 😅

@my_master55 3 жыл бұрын

Wow, cool lecture, Alfredo, thank you so much! 🦄 BTW, is this implementation somehow related to the facebook DeiT transformer? Looks like the part of appending multiple attention+cnn layers was also implemented there 🐱‍🏍

@alfcnz 3 жыл бұрын

No no, it's just the basics of self and cross attention.

@capybara_business Жыл бұрын

Compared to other videos I didn't inderstand anything, you notation is weird. Obviously I lack knowledge but I was expecting something clearer, it would be great to add examples along the video. Btw good work, so many hours to build this up.

@alfcnz Жыл бұрын

Which part was not clear? What notation do you find counterintuitive?

@kevon217 Жыл бұрын

calling mom is always a good call

@alfcnz Жыл бұрын

🥰🥰🥰

@НиколайНовичков-е1э 3 жыл бұрын

Hello, Alfredo :)

@alfcnz 3 жыл бұрын

👋🏻👋🏻👋🏻

@anondoggo 2 жыл бұрын

31:48 I want pizza now :(

@anondoggo 2 жыл бұрын

57:49 I'm not sure if I understand why we use self-attention first and then cross-attention, where is h_i ^{Enc} coming from? (The conditional information?

@anondoggo 2 жыл бұрын

ok so I should have just watched on as this is explained 2 sec later, 01:00:51 :'). So I'm pretty sure h_i is from Enc(x), so are we using h_i as keys and values and h_j as queries, so we know what words in the source to pay attention to during translation? This amounts to providing the first t-1 y observations and getting the last t-1 observations.

@anondoggo 2 жыл бұрын

I think this is true. Will double check with the slides.

@anondoggo 2 жыл бұрын

Another amazing lecture by Alfredo, the fact that there is no dependency on time is so exciting, I finally understand transformers for the very first time. I am in tears. (At least I think I understand :'))

@alfcnz 2 жыл бұрын

I'm glad you have acquired new knowledge 🤓🤓🤓