Lecture 13: Attention

Рет қаралды 69,970

Michigan Online

Күн бұрын

Пікірлер

@roboticseabass 4 жыл бұрын

The buildup to the transformer block was absolutely excellent. This is the star lecture of the course, hands down.

@be2112 3 жыл бұрын

Fantastic lecture. This is by far the best explanation of attention I’ve seen so far. It isn’t even close. It blows Andrew Ng, Stanford, and Deep Mind lectures out of the water.

@lbognini 2 жыл бұрын

Absolutely!!

@andreimargeloiu-machinelea1484 4 жыл бұрын

It's incredible how smoothly the progression from a basic seq-seq to a Transformer goes! Legend.

@chriswang2464 Жыл бұрын

One thing for newbies is at 5:39, the decoder function gU stands for Gated Recurrent Units. Also, the gU function should be s_t = gU (y_t-1, s_t-1, c).

@timilehinfasipe485 2 жыл бұрын

By far the best lecture I’ve seen about attention

@ee5823 3 жыл бұрын

the best lecture on attention. like how you explain QKV in self attention

@neilteng1735 3 жыл бұрын

This lecture should be seen by every one who want to understand attention mechanism!

@kunaldargan4467 3 жыл бұрын

Finally found what I was looking for - great compilation, the effort is commendable. One of the best explanations of attention models !!

@kanakraj3198 3 жыл бұрын

Great Lecture. I would say one of the best explanation to Attention Mechanism. Thanks for the Video.

@ayushgarg692 3 жыл бұрын

This lecture needed the most *attention* but extremely well taught. Thank you!

@johanngerberding5956 3 жыл бұрын

Really brilliant lecture. The best resource if you want to learn about Attention in my opinion. Thanks for sharing this!

@xiongyang1141 4 жыл бұрын

This is the best course I've ever taken, which gives a complete structure of knowledge, thanks a lot!

@meixinzhu3409 4 жыл бұрын

Detailed and easy to understand. Thank you Justin!

@cc98-oe7ol 9 ай бұрын

Fantastic lecture. It makes the attention mechanism very easy to understand.

@yi-chenlu9137 3 жыл бұрын

Thank you for the great lecture! I guess there’s one minor error in 12:47, where the sum of the attention weights should be 1.

@pranavpandey2965 4 жыл бұрын

This is best explanation and intuition to Attention and Transformer block you can find.

@davidsasu8251 2 жыл бұрын

Best lecture I have ever seen on the topic!

@minhdang5132 3 жыл бұрын

Thanks for the lecture! It's very crystal-explained

@shaunaksen6076 3 жыл бұрын

Amazing explanation, loved the interpretability analysis.. Thanks for sharing

@ahmadfarhan8511 4 жыл бұрын

this is art!

@ariG23498 3 жыл бұрын

4:03 the decoder formula should have been $s_t = g_u(y_{t-1}, s_{t-1}, c)$

@freddiekarlbom Жыл бұрын

Amazing lecture! "I'm excited to see where this goes in the next couple of years" feels very prescient when revisited at the end of 2023. :D One question regarding the historical context though, the 2015 paper mentioned here as the 'origin' actually has a reference to the 2014 paper Neural machine translation by jointly learning to align and translate, which as I understand it actually introduced the attention mechanism?

@batubalci9987 Жыл бұрын

On 22:23 , the phrase on the slides is different from what Justin is saying with “seagull”. Either the audio was updated later or the slides. Does anyone know what happened?

@prakhargupta1745 Жыл бұрын

Great lecture! There seems to be an error at 9:53, where the attention weights should sum to 1.

@RichKempinski 2 жыл бұрын

So prior to watching this lecture I was under the impresion that the inputs to each of the multihead attention units was just the original input vector copied, where each head got the same input, which was confusing because the motivation provided for having multiple heads was to enable the different attention heads to attend to different aspects of the input, and I was thinking "how is that going to happen?" if each head is receiving the same inputs. So from the many presentation on this topic this has been the first to clear this up for me... I think. From what I understand now, the dimensions of the input vector are first evenly split into lower dimensional vectors which are then each handled by separate attention heads. But the in the lecture (1:02:35) the input dimension of 512 with 6 heads results in an uneven split (85.333) so something still doesn't make sense. Any ideas?

@RichKempinski 2 жыл бұрын

Ah, I just checked the original Vaswani paper and I think I found the error in the lecture. In section 3.2.2 on multi-head attention it points that the number of heads is 8 (not 6), which for the input dimension of 512 results in 64 dimensions in each split. That now makes more sense. Happy now.

@genericperson8238 2 жыл бұрын

Can someone please explain this to me: Around 23:30-24:40, one thing I don't understand is how the attention is weighting locations on the input image, when it's supposed to work on the features produced by the CNN.

@gokulakrishnancandassamy4995 2 жыл бұрын

The answer lies in the concept of "receptive field" which was the motivation behind the development of CNNs. Simply put, every pixel in an activation map (produced by "convolving" a filter with the input volume) actually looks at a larger spatial area in the input volume. For example, convolving a 7 x 7 x C image with a 3 x 3 x C filter (for simplicity you can assume C to be '1') with a stride of 1, you will get 5 x 5 activation map which corresponds to this filter. In this activation map, every pixel is actually looking at a grid of 3 x 3 x C in the input volume. Typically, the receptive field grows larger as we go deeper in the network. Hope this helps!

@ariG23498 3 жыл бұрын

36:03 why do we not consider the magnitudes scaling term of both the vectors? I think scaling the similarities with $D$ instead of $\sqrt{D}$ makes more sense.

@gokulakrishnancandassamy4995 2 жыл бұрын

Nope. The dot product is interpreted as a.b = |a| x (|b| cos \theta) = magnitude of the vector * the projection of other vector onto this vector. Since the magnitude of a vector scales only by sqrt(D), it makes sense to divide by sqrt(D) and not D. Hope this helps!

@prakhargupta1745 Жыл бұрын

Can anyone explain the segment 52:21 - 54:46? Cannot seem to wrap my head around it. 1. How are 1x1 convolutions producing grids of sizes C' * H * W? 2. Why is the size of values matrix equal to C' * H * H and not C' * H * W? 3. Does attention weights of size (H*W)*(H*W) denote the similarity level between each pixel to every other pixel in the grid? If yes, then why do we not consider the different feature matrices differently?

@ytc7644 8 ай бұрын

1. Use C' 1x1 convolution kernels. 2. There is a typo. You can see in the corrected version of slides it's C' * H * W.

@gauravms6681 4 жыл бұрын

Great Video !!!!!!!!!

@JH-pl8ih 11 ай бұрын

I have a question about what was discussed at 18:00 that I'm hoping someone can help me with. How does the model know how to translate "the European Economic Area" to "la zone economique Europeenne", given that it translates inputs to outputs step-by-step? It's outputing "zone" even though it hasn't processed the input for "area" yet. I must be missing something.

@TYANTOWERS 5 ай бұрын

It takes in all the input first, and then it starts processing the output. The steps go like this: First all the input X is taken in and put through the hidden layers h. Once that is done, s0 is computed using g_U. Then the alignment values e are computed. Then a is computed. Using a and h, we get c1. c1 and y0 give s1. Similarly for the next step, again the e is computed with the help of s1 and so forth.

@JH-pl8ih 5 ай бұрын

@@TYANTOWERS Thanks!

@zououoz3588 2 жыл бұрын

Good night, im a student and i have a task to explain this attention mechanism, i found this video has the best explanation for attention. So i want to ask, is it okay if i take a screenshot at some part of the video and then attached it on some ipynb file that ill upload on github publicly?

@muhammadawon8164 2 жыл бұрын

The lecture slides are already provided in the description, assuming they have already given their consent to use.