That's not the main reason, RNN keep adding the embeddings and hence override information that came before where as in case of transformer embeddings are there all the time and attention can pick the ones that are important.
@NoahElRhandour Жыл бұрын
that was a great video! i find learning about such things generally easier and more interesting, if they are compared to other models/ideas that are similar but not equal
@CodeEmporium Жыл бұрын
Thank you for the kind words. And yep, agreed 👍🏽
@NoahElRhandour Жыл бұрын
@@CodeEmporium i guess just like CLIP our brains perform contrastive learning as well xd
@brianprzezdziecki Жыл бұрын
KZbin recommend me more videos like this plz
@schillaci5590 Жыл бұрын
This answered a question I didn't have. Thanks!
@CodeEmporium Жыл бұрын
Always glad to help when not needed!
@GregHogg Жыл бұрын
This is a great video!!
@CodeEmporium Жыл бұрын
Thanks a lot Greg. I try :)
@IgorAherne Жыл бұрын
I think lstms are more tuned towards keeping the order, because although transformers can assemble embeddings from various tokens, they don't know what follows what in a sentence. But, perhaps with relative positional encoding they might be equipped just about enough to understand the order of sequential input
@evanshlom110 ай бұрын
Your comment came right before gpt blew up so maybe you wouldn’t say this anymore?
@borregoayudando1481 Жыл бұрын
I would like to have a more skeleton-up or foundation-up understanding (to better understand the top down representation of the transformer). Where should I start, linear algebra?
@sandraviknander78986 ай бұрын
An important caveat is that transformers like the decoder and GPT models are trained autoregresively with no context of the words coming after.
@sreedharsn-xw9yi6 ай бұрын
ya its masked multi head attention only focuses on left-to-right right ?
@free_thinker49584 ай бұрын
@@sreedharsn-xw9yiyes that's decoders only transformers such as gpt 3.5 for example and any text generation model
@kenichisegawa39816 ай бұрын
This is the best explanation I’ve ever seen RNN vs Transformer. Is there similar video like this for self attention by any chance? Thank you
@CodeEmporium6 ай бұрын
Thanks so much for the kind words. There is a full video on self attention on the channel. Check out the first video below in the playlist “Transformers from scratch “
@vtrandal Жыл бұрын
Fantastic!
@CodeEmporium Жыл бұрын
Thanks so much again :)
@jackrayner12636 ай бұрын
Does a decoder model share these same advantages? Without the attention mapping wouldn’t it would be operating with the same context as an RNN?
@aron2922 Жыл бұрын
You should have put LSTMs as a middle step
@CodeEmporium Жыл бұрын
Good call. I just bundled them with Recurrent Neural Networks here
@UnderstandingCode Жыл бұрын
Ty
@alfredwindslow18945 ай бұрын
Don’t transformer models generate one token at a time? It’s just they’re faster as calculations can be done in parallel
@nomecriativo84335 ай бұрын
Transformers aren't only used for text generation. But in the case of text generation, the model internally predicts the next token for every token on the sentence. E.g the model is trained to do this: This is an example phrase is an example phrase So the training requires a single step. Text generation models also have a causal mask, tokens can only attend to the tokens that come before it. So the network doesn't cheat during training. During inference, only one token is generated at a time, indeed. If I'm not mistaken, there's an optimization to avoid recalculating the previously calculated tokens.
@ccreutzig4 ай бұрын
Not all transformers use a causal mask. Encoder models like BERT usually don't - it would break the usefulness of the [CLS] token, for starters.
@free_thinker49584 ай бұрын
The main reason is that rnn has what we call the exploding and vanishing gradient descent..
@sreedharsn-xw9yi6 ай бұрын
how we can relate this to masked multi head attention concept of transformers, this video is kind of conflicting with that, any expert ideas here please ..
@vastabyss64967 ай бұрын
What if you wanted to train a network to take a sequence of images (like in a video) and generate what comes next? Wouldn't that be a case where RNNs and its variations like LSTM and GRUs are better since each image is most closely related to the images coming directly before and after it?
@-p23497 ай бұрын
This is done by “GAN” networks. Or generative adversarial networks. This would have two CNNs one is a “discriminator ” network and the other a “generator” network.
@vastabyss64967 ай бұрын
@@-p2349 I thought that GANs could only generate an image that was similar to those in the dataset (such as a dataset containing faces). Also, how would a GAN deal with the sequential nature of videos?
@ccreutzig4 ай бұрын
There is ViT (Vision Transformer), although that predicts parts of an image, and I've seen at least one example of ViT feeding into a Longformer network for video input. But I have no experience using it. GAN are not the answer to what I read in your question.
@wissalmasmoudi37809 ай бұрын
I need your help about my narx neural network please
@jugsma6676 Жыл бұрын
Can you do Fourier Transform replacing the attention head
@iro4201 Жыл бұрын
What?
@user-vm7we6bm7x Жыл бұрын
Fourier Transform?
@TheScott1001210 ай бұрын
I respect the craft! Also, pick up a pop filter
@CodeEmporium9 ай бұрын
I have p-p-p-predilection for p-p-plosives
@manikantabandla3923 Жыл бұрын
But there is also a version of RNN with attention.
@gpt-jcommentbot475911 ай бұрын
These RNNs are still worse than Transformers. However, there have been Transformers + LSTM combinations. Such neural networks have theoretical potential to create extremely long term chatbots, far higher than 4000 tokens, due to their recurrent nature.
@kvlnnguyieb95225 ай бұрын
how the new one SSM in MAMBA? the Mamba said to better than transformer
@cate95419 ай бұрын
cool
@CodeEmporium9 ай бұрын
Many thanks :)
@Laszer27110 ай бұрын
What I'm wondering is. Why do all APIs charge you credits for input tokens for transformers? For me, it shouldn't make a difference for a transformer to take 20 tokens as input or 1000 (as long as it's within its maximum context lengths). Isn't that the case that transformer always pads the input to its maximum context length anyway?
@ccreutzig4 ай бұрын
No, the attention layers usually take a padding mask into account and can use smaller matrices. It just makes the implementation a bit more involved. The actual cost should be roughly quadratic in your input size, but that's probably not something the marketing department would accept.
@UserforevernevereverАй бұрын
For the algo
@AshKetchumAllNow5 ай бұрын
No model understands
@FluffyAnvil4 ай бұрын
This video is 90% wrong…
@ccreutzig4 ай бұрын
But presented confidently and getting praise. Reminds me of ChatGPT. 😂
@cxsey858711 ай бұрын
Do LSTMs have any advantage over transformers ?
@gpt-jcommentbot475911 ай бұрын
They work better with less text data, they also work better as decoders. While LSTMs don't have many advantages, future iterations of RNNs could lead to learning far longer term dependencies than Transformers. I think that LSTMs are more biologically accurate than Transformers since they incorporate time and are not layered like conventional networks but instead are theoretically capable of simple topological structures. However, their have been "recurrent Transformers" which is basically Long Short Term Memory + Transformers. The architecture is literally a transformer layer turned into a recurrent cell along with gates inspired by LSTM.
@sijoguntayo228210 ай бұрын
Great video! I’m addition to this, RNNs due to their sequential nature are unable to take advantage of transfer learning. Transformers do not have this limitation
@iro4201 Жыл бұрын
I understand and do not understand.
@keithwhite29863 ай бұрын
Quantum learning, hopefully with an increasing probability towards understanding.