Neural Attention - This simple example will change how you think about it

  Рет қаралды 5,872

Neural Breakdown with AVB

Neural Breakdown with AVB

Күн бұрын

Пікірлер: 41
@avb_fj
@avb_fj Жыл бұрын
Just uploaded the second part of the series discussing Self Attention and variants. Link here: kzbin.info/www/bejne/ap_EiX-ei8uYntE Here's future me posting the third part about Transformers: kzbin.info/www/bejne/ZoGZXmmBnaegkK8
@iSpades0
@iSpades0 Жыл бұрын
By far one of the best Deep Learning KZbin channel I have ever checked out! Can't wait for part 2 and 3, keep up the good work!
@avb_fj
@avb_fj Жыл бұрын
Thanks a lot! Nice timing with the comment, I just published the next video a couple hours back! 😀
@amoghjain
@amoghjain Жыл бұрын
please keep making these videos!! your explanations are absolutely amazing, engaging, to the point, intuitive, and very easy to understand!!!
@kozer1986
@kozer1986 Жыл бұрын
The best explanation I've ever seen. It totally clicked to me! Thanks!!
@avb_fj
@avb_fj Жыл бұрын
Awesome! Glad to hear it! :)
@svenleijnen9045
@svenleijnen9045 Жыл бұрын
First time ever I comment a video but I just had to: the way you make complex concepts understandable is awesome! Best explanation about attention I’ve come across so far 👍
@avb_fj
@avb_fj Жыл бұрын
Nice, I am a fellow non-commenter as well! Glad to see you here, and thanks for all the appreciation!
@sahhaf1234
@sahhaf1234 Жыл бұрын
Excellent video. My only critique is that the concept of hidden state is used around @13:15 without being defined. After @13:13, it becomes a little bit too fast and concepts become a blur..
@avb_fj
@avb_fj Жыл бұрын
Appreciate all the feedback! Thanks for sharing your experience... I thought getting into LSTMs and RNNs would be a bit of a rabbit hole for this video since not all of it is relevant to the primary topic, so I stayed at the surface level with the hidden state stuff and focussed more on the "attention" portions of the video.
@LexPodgorny
@LexPodgorny 9 ай бұрын
@2:14 You've suddenly jumped from vector 512 to vector of 2. But how? Please explain what happened there, because I think a key portion of video got cut out. Thanks
@avb_fj
@avb_fj 9 ай бұрын
As I mentioned around 3:06, the 2D thing was an example. The Q/K/V embedding size is an arbitrary hyperparameter so it can be set to anything. I used the size 2 example just to illustrate how the "dot product" works since it is easy to show the cosine similarity between two 2D vectors as in 2:51.
@LexPodgorny
@LexPodgorny 9 ай бұрын
@@avb_fj Ah, I got it now. Thank you! But how do you actually produce a query embedding from a query. Is there a video on building the key and query neural networks that do that? Especially interesting is the part where query embedding is learned in a way that corresponds to the key embedding vector coordinates, I am assuming using same word embedding for both should take care of it somehow, but it would be great to see the actual technique that is used. Thank you!
@avb_fj
@avb_fj 9 ай бұрын
So an embedding can be obtained by passing your input through a neural network. For the case of text, we can use anything from “word embeddings” or “RNN/LSTMs” etc to convert input text into embeddings. In my channel there are a couple of helpful videos kzbin.info/www/bejne/ZoGZXmmBnaegkK8 kzbin.info/www/bejne/q6DGioR-ZciKitU But there are plenty of resources online too! Good luck!
@sahhaf1234
@sahhaf1234 Жыл бұрын
These videos are prepared in a very thought-provoking way... I think the weights/biases of the system are at query, key and value networks given at @15:30 and all training occurs there.. Therefore can we say that what neural attention learns are embeddings? On the other hand, the part softmax(QK^T/sqrt(d_k))V is fixed and apparently it does not learn anything during training.. Thank you very much again for these very well prepared videos.
@avb_fj
@avb_fj Жыл бұрын
The Query, Keys, and Values are indeed embeddings that must be optimized, by updating the Query, Key, and Value neural networks. The softmax(QK^T/sqrt(d_k))V part is the "computation graph" that inputs the embeddings and transforms them into new "contextually aware embeddings". Consider the below example: Say: You have 2 inputs a and b. And you want to train a neural net to predict C. You can model your network as Y = F(a) + G(b). That is we are saying: "some function of a and some function of b will add up to be Y. We will optimize the functions F and G such that Y 1) F and G are analogous to the Q, K, V neural networks. 2) F(a) and G(b) are analogous to Q, K, V embeddings 3) The + sign is the computation graph that combines a & b to make a prediction Y. In the attention formula, this + sign is equivalent to softmax(Q . K^T) V. 4) Finally since we wanted to predict C, not Y. We calculate the loss between C and Y, and then optimize the weights/biases of the neural networks F and G such that Y (our prediction) gets closer to C (our target). The gradients of the loss flows right through the + operation & the F and G. So yeah, the input collections are passed through the Q, K, V networks to derive the Q, K, V embeddings. The softmax(...) portion is the computation that combines these embeddings. The softmax(...) operation doesn't contain any parameters to train, as you mentioned, but it forms the backbone/computation graph of how the forward pass and backward propagation work. Hope that helps.
@sahhaf1234
@sahhaf1234 Жыл бұрын
@@avb_fj Thanks a lot.. Right now I'm listening the self-attention part.When I'm done with the third part, I'll return back here and read your reply again more carefully.
@luisfelipearaujodeoliveira469
@luisfelipearaujodeoliveira469 9 ай бұрын
AMAZING TUTORIAL, I am definitely using your video as a recommendation to all my friends that want to learn Deep Learning in an easy way. Greeting from Brazil! And keep up the good work!
@avb_fj
@avb_fj 9 ай бұрын
Thanks!! Totally made my day!
@gnorts_mr_alien
@gnorts_mr_alien Жыл бұрын
you will be a star teacher on youtube if you keep it up (and if that is your goal). thank you this was very good, subscribed.
@avb_fj
@avb_fj Жыл бұрын
Wow that’s got to be one of the kindest comment I’ve ever received! Thanks a lot… glad you enjoyed it!
@gnorts_mr_alien
@gnorts_mr_alien Жыл бұрын
you definitely have that special "knack" some of the best teachers have, and have a very soothing tone to boost. eagerly waiting for part 2 and 3. cheers! @@avb_fj
@adrianjackson1045
@adrianjackson1045 10 ай бұрын
great video!! your explanations and graphics are amazing. love the content
@avb_fj
@avb_fj 10 ай бұрын
Thanks!😊
@js116600
@js116600 16 сағат бұрын
The touch on multi-head attention is a bit blurry. How do you enforce the first head is word order and the second head is word meaning, for example? How do you know if they are not the same?
@avb_fj
@avb_fj 9 сағат бұрын
So… you don’t know or can’t enforce that. You just setup the architecture like that and train on massive amounts of data. The network learns via gradient descent and the weights can converge to anything that reduces the loss. Empirically it has been shown that the different attention heads learn orthogonal behaviors about the dataset. The example of word order/meaning/grammar was mentioned for building intuition, but you are correct - there is no external enforcing.
@venkateshbs1384
@venkateshbs1384 10 ай бұрын
Very Clear Explanation. Thanks for that.
@sahhaf1234
@sahhaf1234 Жыл бұрын
I watch the whole series and it is a real gem.. But something is missing.. Where is the nonlinearity and weights and biases? What do we train?
@avb_fj
@avb_fj Жыл бұрын
The weights, biases, and non-linearity come from the Query, Key, and Value neural networks. These convert the input embeddings into the query, key, value embeddings respectively - which then go through the attention computation. We can also add additional feed-forward layers after the attention layer to add additional transformations/non-linearity. Other weights we train can be initial embeddings of the input collection. Look up word-embeddings for example that train special embedding vectors for each word in the vocabulary. There can also be separate neural networks to embed each input type. For example, suppose you are trying to learn attention between a bunch of images and a sentence. The images can have their own image encoding neural network, and the sentence can have a text encoding neural network. All of these nets have their own weights and biases according to whatever the end-goal is. Once the forward-pass is defined, we compute the loss between network prediction and target. Through backpropagation, all learnable parameters then get updated.
@noclaf78
@noclaf78 7 ай бұрын
Is there a link to your contrastive learning video?
@avb_fj
@avb_fj 7 ай бұрын
Check out the first quarter of this video: Multimodal AI from First Principles - Neural Nets that can see, hear, AND write. kzbin.info/www/bejne/Y53PnICmg61kbJI
@serta5727
@serta5727 Жыл бұрын
Very good explanation, thanks 😄
@sahhaf1234
@sahhaf1234 Жыл бұрын
Maybe an unimportant point, but @2:00 the vector Q sees like a column vector.. I think it should be a row vector..
@avb_fj
@avb_fj Жыл бұрын
Yeah a row vector would be more accurate for the QK^t stuff that happens later. Thanks for pointing that out.
@w花b
@w花b Ай бұрын
I was so confused for the dimensions I was questioning if I had learned matrices right lol. I knew something didn't match. But in numpy they're column by default so we would also need the transpose of Q as well.
@landerosedgard
@landerosedgard Жыл бұрын
great explanation!
@avb_fj
@avb_fj Жыл бұрын
Thanks!!
@repairstudio4940
@repairstudio4940 Жыл бұрын
Is that a One Piece shirt?! Love it man! 🎉 Also great content tas always.
@avb_fj
@avb_fj Жыл бұрын
Haha thanks!🙏🏽
@sahhaf1234
@sahhaf1234 9 ай бұрын
I think the embeddings must be normalized for the dot product to make sense.
@ahnafsamin3777
@ahnafsamin3777 10 ай бұрын
Good videos but a bit fast
The many amazing things about Self-Attention and why they work
12:31
Neural Breakdown with AVB
Рет қаралды 5 М.
Multimodal AI from First Principles - Neural Nets that can see, hear, AND write.
20:19
So Cute 🥰 who is better?
00:15
dednahype
Рет қаралды 19 МЛН
Quando A Diferença De Altura É Muito Grande 😲😂
00:12
Mari Maria
Рет қаралды 45 МЛН
СИНИЙ ИНЕЙ УЖЕ ВЫШЕЛ!❄️
01:01
DO$HIK
Рет қаралды 3,3 МЛН
Мясо вегана? 🧐 @Whatthefshow
01:01
История одного вокалиста
Рет қаралды 7 МЛН
Here is how Transformers ended the tradition of Inductive Bias in Neural Nets
12:05
Neural Breakdown with AVB
Рет қаралды 8 М.
Visualizing transformers and attention | Talk for TNG Big Tech Day '24
57:45
10 years of NLP history explained in 50 concepts | From Word2Vec, RNNs to GPT
17:32
Neural Breakdown with AVB
Рет қаралды 24 М.
MAMBA from Scratch: Neural Nets Better and Faster than Transformers
31:51
Algorithmic Simplicity
Рет қаралды 218 М.
So you think you know Text to Video Diffusion models?
12:33
Neural Breakdown with AVB
Рет қаралды 2,7 М.
Attention in transformers, step-by-step | DL6
26:10
3Blue1Brown
Рет қаралды 2 МЛН
If LLMs are text models, how do they generate images?
17:37
Neural Breakdown with AVB
Рет қаралды 7 М.
So Cute 🥰 who is better?
00:15
dednahype
Рет қаралды 19 МЛН