The buildup to the transformer block was absolutely excellent. This is the star lecture of the course, hands down.
@be21123 жыл бұрын
Fantastic lecture. This is by far the best explanation of attention I’ve seen so far. It isn’t even close. It blows Andrew Ng, Stanford, and Deep Mind lectures out of the water.
@lbognini2 жыл бұрын
Absolutely!!
@andreimargeloiu-machinelea14844 жыл бұрын
It's incredible how smoothly the progression from a basic seq-seq to a Transformer goes! Legend.
@chriswang2464 Жыл бұрын
One thing for newbies is at 5:39, the decoder function gU stands for Gated Recurrent Units. Also, the gU function should be s_t = gU (y_t-1, s_t-1, c).
@timilehinfasipe4852 жыл бұрын
By far the best lecture I’ve seen about attention
@ee58233 жыл бұрын
the best lecture on attention. like how you explain QKV in self attention
@neilteng17353 жыл бұрын
This lecture should be seen by every one who want to understand attention mechanism!
@kunaldargan44673 жыл бұрын
Finally found what I was looking for - great compilation, the effort is commendable. One of the best explanations of attention models !!
@kanakraj31983 жыл бұрын
Great Lecture. I would say one of the best explanation to Attention Mechanism. Thanks for the Video.
@ayushgarg6923 жыл бұрын
This lecture needed the most *attention* but extremely well taught. Thank you!
@johanngerberding59563 жыл бұрын
Really brilliant lecture. The best resource if you want to learn about Attention in my opinion. Thanks for sharing this!
@xiongyang11414 жыл бұрын
This is the best course I've ever taken, which gives a complete structure of knowledge, thanks a lot!
@meixinzhu34094 жыл бұрын
Detailed and easy to understand. Thank you Justin!
@cc98-oe7ol9 ай бұрын
Fantastic lecture. It makes the attention mechanism very easy to understand.
@yi-chenlu91373 жыл бұрын
Thank you for the great lecture! I guess there’s one minor error in 12:47, where the sum of the attention weights should be 1.
@pranavpandey29654 жыл бұрын
This is best explanation and intuition to Attention and Transformer block you can find.
@davidsasu82512 жыл бұрын
Best lecture I have ever seen on the topic!
@minhdang51323 жыл бұрын
Thanks for the lecture! It's very crystal-explained
@shaunaksen60763 жыл бұрын
Amazing explanation, loved the interpretability analysis.. Thanks for sharing
@ahmadfarhan85114 жыл бұрын
this is art!
@ariG234983 жыл бұрын
4:03 the decoder formula should have been $s_t = g_u(y_{t-1}, s_{t-1}, c)$
@freddiekarlbom Жыл бұрын
Amazing lecture! "I'm excited to see where this goes in the next couple of years" feels very prescient when revisited at the end of 2023. :D One question regarding the historical context though, the 2015 paper mentioned here as the 'origin' actually has a reference to the 2014 paper Neural machine translation by jointly learning to align and translate, which as I understand it actually introduced the attention mechanism?
@batubalci9987 Жыл бұрын
On 22:23 , the phrase on the slides is different from what Justin is saying with “seagull”. Either the audio was updated later or the slides. Does anyone know what happened?
@prakhargupta1745 Жыл бұрын
Great lecture! There seems to be an error at 9:53, where the attention weights should sum to 1.
@RichKempinski2 жыл бұрын
So prior to watching this lecture I was under the impresion that the inputs to each of the multihead attention units was just the original input vector copied, where each head got the same input, which was confusing because the motivation provided for having multiple heads was to enable the different attention heads to attend to different aspects of the input, and I was thinking "how is that going to happen?" if each head is receiving the same inputs. So from the many presentation on this topic this has been the first to clear this up for me... I think. From what I understand now, the dimensions of the input vector are first evenly split into lower dimensional vectors which are then each handled by separate attention heads. But the in the lecture (1:02:35) the input dimension of 512 with 6 heads results in an uneven split (85.333) so something still doesn't make sense. Any ideas?
@RichKempinski2 жыл бұрын
Ah, I just checked the original Vaswani paper and I think I found the error in the lecture. In section 3.2.2 on multi-head attention it points that the number of heads is 8 (not 6), which for the input dimension of 512 results in 64 dimensions in each split. That now makes more sense. Happy now.
@genericperson82382 жыл бұрын
Can someone please explain this to me: Around 23:30-24:40, one thing I don't understand is how the attention is weighting locations on the input image, when it's supposed to work on the features produced by the CNN.
@gokulakrishnancandassamy49952 жыл бұрын
The answer lies in the concept of "receptive field" which was the motivation behind the development of CNNs. Simply put, every pixel in an activation map (produced by "convolving" a filter with the input volume) actually looks at a larger spatial area in the input volume. For example, convolving a 7 x 7 x C image with a 3 x 3 x C filter (for simplicity you can assume C to be '1') with a stride of 1, you will get 5 x 5 activation map which corresponds to this filter. In this activation map, every pixel is actually looking at a grid of 3 x 3 x C in the input volume. Typically, the receptive field grows larger as we go deeper in the network. Hope this helps!
@ariG234983 жыл бұрын
36:03 why do we not consider the magnitudes scaling term of both the vectors? I think scaling the similarities with $D$ instead of $\sqrt{D}$ makes more sense.
@gokulakrishnancandassamy49952 жыл бұрын
Nope. The dot product is interpreted as a.b = |a| x (|b| cos \theta) = magnitude of the vector * the projection of other vector onto this vector. Since the magnitude of a vector scales only by sqrt(D), it makes sense to divide by sqrt(D) and not D. Hope this helps!
@prakhargupta1745 Жыл бұрын
Can anyone explain the segment 52:21 - 54:46? Cannot seem to wrap my head around it. 1. How are 1x1 convolutions producing grids of sizes C' * H * W? 2. Why is the size of values matrix equal to C' * H * H and not C' * H * W? 3. Does attention weights of size (H*W)*(H*W) denote the similarity level between each pixel to every other pixel in the grid? If yes, then why do we not consider the different feature matrices differently?
@ytc76448 ай бұрын
1. Use C' 1x1 convolution kernels. 2. There is a typo. You can see in the corrected version of slides it's C' * H * W.
@gauravms66814 жыл бұрын
Great Video !!!!!!!!!
@JH-pl8ih11 ай бұрын
I have a question about what was discussed at 18:00 that I'm hoping someone can help me with. How does the model know how to translate "the European Economic Area" to "la zone economique Europeenne", given that it translates inputs to outputs step-by-step? It's outputing "zone" even though it hasn't processed the input for "area" yet. I must be missing something.
@TYANTOWERS5 ай бұрын
It takes in all the input first, and then it starts processing the output. The steps go like this: First all the input X is taken in and put through the hidden layers h. Once that is done, s0 is computed using g_U. Then the alignment values e are computed. Then a is computed. Using a and h, we get c1. c1 and y0 give s1. Similarly for the next step, again the e is computed with the help of s1 and so forth.
@JH-pl8ih5 ай бұрын
@@TYANTOWERS Thanks!
@zououoz35882 жыл бұрын
Good night, im a student and i have a task to explain this attention mechanism, i found this video has the best explanation for attention. So i want to ask, is it okay if i take a screenshot at some part of the video and then attached it on some ipynb file that ill upload on github publicly?
@muhammadawon81642 жыл бұрын
The lecture slides are already provided in the description, assuming they have already given their consent to use.
@Abbashuss952 жыл бұрын
What is the name of the playlist of this video?
@lbognini2 жыл бұрын
kzbin.info/www/bejne/j3LKm5mDh56Fla8
@Anonymous-lw1zy Жыл бұрын
Deep Learning for Computer Vision kzbin.info/aero/PL5-TkQAfAZFbzxjBHtzdVCWE0Zbhomg7r
@user-fi1ur1cz9c2 жыл бұрын
Very helpful
@vitalispyskinas5595 Жыл бұрын
If only he saw the number of learnable parameters on GPT-4 (1:07:10)