Neural Attention - This simple example will change how you think about it

Рет қаралды 5,872

Күн бұрын

Пікірлер: 41

@avb_fj Жыл бұрын

Just uploaded the second part of the series discussing Self Attention and variants. Link here: kzbin.info/www/bejne/ap_EiX-ei8uYntE Here's future me posting the third part about Transformers: kzbin.info/www/bejne/ZoGZXmmBnaegkK8

@iSpades0 Жыл бұрын

By far one of the best Deep Learning KZbin channel I have ever checked out! Can't wait for part 2 and 3, keep up the good work!

@avb_fj Жыл бұрын

Thanks a lot! Nice timing with the comment, I just published the next video a couple hours back! 😀

@amoghjain Жыл бұрын

please keep making these videos!! your explanations are absolutely amazing, engaging, to the point, intuitive, and very easy to understand!!!

@kozer1986 Жыл бұрын

The best explanation I've ever seen. It totally clicked to me! Thanks!!

@avb_fj Жыл бұрын

Awesome! Glad to hear it! :)

@svenleijnen9045 Жыл бұрын

First time ever I comment a video but I just had to: the way you make complex concepts understandable is awesome! Best explanation about attention I’ve come across so far 👍

@avb_fj Жыл бұрын

Nice, I am a fellow non-commenter as well! Glad to see you here, and thanks for all the appreciation!

@sahhaf1234 Жыл бұрын

Excellent video. My only critique is that the concept of hidden state is used around @13:15 without being defined. After @13:13, it becomes a little bit too fast and concepts become a blur..

@avb_fj Жыл бұрын

Appreciate all the feedback! Thanks for sharing your experience... I thought getting into LSTMs and RNNs would be a bit of a rabbit hole for this video since not all of it is relevant to the primary topic, so I stayed at the surface level with the hidden state stuff and focussed more on the "attention" portions of the video.

@LexPodgorny 9 ай бұрын

@2:14 You've suddenly jumped from vector 512 to vector of 2. But how? Please explain what happened there, because I think a key portion of video got cut out. Thanks

@avb_fj 9 ай бұрын

As I mentioned around 3:06, the 2D thing was an example. The Q/K/V embedding size is an arbitrary hyperparameter so it can be set to anything. I used the size 2 example just to illustrate how the "dot product" works since it is easy to show the cosine similarity between two 2D vectors as in 2:51.

@LexPodgorny 9 ай бұрын

@@avb_fj Ah, I got it now. Thank you! But how do you actually produce a query embedding from a query. Is there a video on building the key and query neural networks that do that? Especially interesting is the part where query embedding is learned in a way that corresponds to the key embedding vector coordinates, I am assuming using same word embedding for both should take care of it somehow, but it would be great to see the actual technique that is used. Thank you!

@avb_fj 9 ай бұрын

So an embedding can be obtained by passing your input through a neural network. For the case of text, we can use anything from “word embeddings” or “RNN/LSTMs” etc to convert input text into embeddings. In my channel there are a couple of helpful videos kzbin.info/www/bejne/ZoGZXmmBnaegkK8 kzbin.info/www/bejne/q6DGioR-ZciKitU But there are plenty of resources online too! Good luck!

@sahhaf1234 Жыл бұрын

These videos are prepared in a very thought-provoking way... I think the weights/biases of the system are at query, key and value networks given at @15:30 and all training occurs there.. Therefore can we say that what neural attention learns are embeddings? On the other hand, the part softmax(QK^T/sqrt(d_k))V is fixed and apparently it does not learn anything during training.. Thank you very much again for these very well prepared videos.

@avb_fj Жыл бұрын

The Query, Keys, and Values are indeed embeddings that must be optimized, by updating the Query, Key, and Value neural networks. The softmax(QK^T/sqrt(d_k))V part is the "computation graph" that inputs the embeddings and transforms them into new "contextually aware embeddings". Consider the below example: Say: You have 2 inputs a and b. And you want to train a neural net to predict C. You can model your network as Y = F(a) + G(b). That is we are saying: "some function of a and some function of b will add up to be Y. We will optimize the functions F and G such that Y 1) F and G are analogous to the Q, K, V neural networks. 2) F(a) and G(b) are analogous to Q, K, V embeddings 3) The + sign is the computation graph that combines a & b to make a prediction Y. In the attention formula, this + sign is equivalent to softmax(Q . K^T) V. 4) Finally since we wanted to predict C, not Y. We calculate the loss between C and Y, and then optimize the weights/biases of the neural networks F and G such that Y (our prediction) gets closer to C (our target). The gradients of the loss flows right through the + operation & the F and G. So yeah, the input collections are passed through the Q, K, V networks to derive the Q, K, V embeddings. The softmax(...) portion is the computation that combines these embeddings. The softmax(...) operation doesn't contain any parameters to train, as you mentioned, but it forms the backbone/computation graph of how the forward pass and backward propagation work. Hope that helps.

@sahhaf1234 Жыл бұрын

@@avb_fj Thanks a lot.. Right now I'm listening the self-attention part.When I'm done with the third part, I'll return back here and read your reply again more carefully.

@luisfelipearaujodeoliveira469 9 ай бұрын

AMAZING TUTORIAL, I am definitely using your video as a recommendation to all my friends that want to learn Deep Learning in an easy way. Greeting from Brazil! And keep up the good work!

@avb_fj 9 ай бұрын

Thanks!! Totally made my day!

@gnorts_mr_alien Жыл бұрын

you will be a star teacher on youtube if you keep it up (and if that is your goal). thank you this was very good, subscribed.

@avb_fj Жыл бұрын

Wow that’s got to be one of the kindest comment I’ve ever received! Thanks a lot… glad you enjoyed it!

@gnorts_mr_alien Жыл бұрын

you definitely have that special "knack" some of the best teachers have, and have a very soothing tone to boost. eagerly waiting for part 2 and 3. cheers! @@avb_fj

@adrianjackson1045 10 ай бұрын

great video!! your explanations and graphics are amazing. love the content

@avb_fj 10 ай бұрын

Thanks!😊

@js116600 16 сағат бұрын

The touch on multi-head attention is a bit blurry. How do you enforce the first head is word order and the second head is word meaning, for example? How do you know if they are not the same?

@avb_fj 9 сағат бұрын

So… you don’t know or can’t enforce that. You just setup the architecture like that and train on massive amounts of data. The network learns via gradient descent and the weights can converge to anything that reduces the loss. Empirically it has been shown that the different attention heads learn orthogonal behaviors about the dataset. The example of word order/meaning/grammar was mentioned for building intuition, but you are correct - there is no external enforcing.

@venkateshbs1384 10 ай бұрын

Very Clear Explanation. Thanks for that.

@sahhaf1234 Жыл бұрын

I watch the whole series and it is a real gem.. But something is missing.. Where is the nonlinearity and weights and biases? What do we train?

@avb_fj Жыл бұрын

The weights, biases, and non-linearity come from the Query, Key, and Value neural networks. These convert the input embeddings into the query, key, value embeddings respectively - which then go through the attention computation. We can also add additional feed-forward layers after the attention layer to add additional transformations/non-linearity. Other weights we train can be initial embeddings of the input collection. Look up word-embeddings for example that train special embedding vectors for each word in the vocabulary. There can also be separate neural networks to embed each input type. For example, suppose you are trying to learn attention between a bunch of images and a sentence. The images can have their own image encoding neural network, and the sentence can have a text encoding neural network. All of these nets have their own weights and biases according to whatever the end-goal is. Once the forward-pass is defined, we compute the loss between network prediction and target. Through backpropagation, all learnable parameters then get updated.

@noclaf78 7 ай бұрын

Is there a link to your contrastive learning video?

@avb_fj 7 ай бұрын

Check out the first quarter of this video: Multimodal AI from First Principles - Neural Nets that can see, hear, AND write. kzbin.info/www/bejne/Y53PnICmg61kbJI

@serta5727 Жыл бұрын

Very good explanation, thanks 😄

@sahhaf1234 Жыл бұрын

Maybe an unimportant point, but @2:00 the vector Q sees like a column vector.. I think it should be a row vector..

@avb_fj Жыл бұрын

Yeah a row vector would be more accurate for the QK^t stuff that happens later. Thanks for pointing that out.

@w花b Ай бұрын

I was so confused for the dimensions I was questioning if I had learned matrices right lol. I knew something didn't match. But in numpy they're column by default so we would also need the transpose of Q as well.