Why Transformer over Recurrent Neural Networks

Рет қаралды 74,350

CodeEmporium

Жыл бұрын

#transformers #machinelearning #chatgpt #gpt #deeplearning

Пікірлер: 55

@IshtiaqueAman 9 ай бұрын

That's not the main reason, RNN keep adding the embeddings and hence override information that came before where as in case of transformer embeddings are there all the time and attention can pick the ones that are important.

@NoahElRhandour Жыл бұрын

that was a great video! i find learning about such things generally easier and more interesting, if they are compared to other models/ideas that are similar but not equal

@CodeEmporium Жыл бұрын

Thank you for the kind words. And yep, agreed 👍🏽

@NoahElRhandour Жыл бұрын

@@CodeEmporium i guess just like CLIP our brains perform contrastive learning as well xd

@brianprzezdziecki Жыл бұрын

KZbin recommend me more videos like this plz

@schillaci5590 Жыл бұрын

This answered a question I didn't have. Thanks!

@CodeEmporium Жыл бұрын

Always glad to help when not needed!

@GregHogg Жыл бұрын

This is a great video!!

@CodeEmporium Жыл бұрын

Thanks a lot Greg. I try :)

@IgorAherne Жыл бұрын

I think lstms are more tuned towards keeping the order, because although transformers can assemble embeddings from various tokens, they don't know what follows what in a sentence. But, perhaps with relative positional encoding they might be equipped just about enough to understand the order of sequential input

@evanshlom1 10 ай бұрын

Your comment came right before gpt blew up so maybe you wouldn’t say this anymore?

@borregoayudando1481 Жыл бұрын

I would like to have a more skeleton-up or foundation-up understanding (to better understand the top down representation of the transformer). Where should I start, linear algebra?

@sandraviknander7898 6 ай бұрын

An important caveat is that transformers like the decoder and GPT models are trained autoregresively with no context of the words coming after.

@sreedharsn-xw9yi 6 ай бұрын

ya its masked multi head attention only focuses on left-to-right right ?

@free_thinker4958 4 ай бұрын

@@sreedharsn-xw9yiyes that's decoders only transformers such as gpt 3.5 for example and any text generation model

@kenichisegawa3981 6 ай бұрын

This is the best explanation I’ve ever seen RNN vs Transformer. Is there similar video like this for self attention by any chance? Thank you

@CodeEmporium 6 ай бұрын

Thanks so much for the kind words. There is a full video on self attention on the channel. Check out the first video below in the playlist “Transformers from scratch “

@vtrandal Жыл бұрын

Fantastic!

@CodeEmporium Жыл бұрын

Thanks so much again :)

@jackrayner1263 6 ай бұрын

Does a decoder model share these same advantages? Without the attention mapping wouldn’t it would be operating with the same context as an RNN?

@aron2922 Жыл бұрын

You should have put LSTMs as a middle step

@CodeEmporium Жыл бұрын

Good call. I just bundled them with Recurrent Neural Networks here

@UnderstandingCode Жыл бұрын

@alfredwindslow1894 5 ай бұрын

Don’t transformer models generate one token at a time? It’s just they’re faster as calculations can be done in parallel

@nomecriativo8433 5 ай бұрын

Transformers aren't only used for text generation. But in the case of text generation, the model internally predicts the next token for every token on the sentence. E.g the model is trained to do this: This is an example phrase is an example phrase So the training requires a single step. Text generation models also have a causal mask, tokens can only attend to the tokens that come before it. So the network doesn't cheat during training. During inference, only one token is generated at a time, indeed. If I'm not mistaken, there's an optimization to avoid recalculating the previously calculated tokens.

@ccreutzig 4 ай бұрын

Not all transformers use a causal mask. Encoder models like BERT usually don't - it would break the usefulness of the [CLS] token, for starters.

@free_thinker4958 4 ай бұрын

The main reason is that rnn has what we call the exploding and vanishing gradient descent..

@sreedharsn-xw9yi 6 ай бұрын

how we can relate this to masked multi head attention concept of transformers, this video is kind of conflicting with that, any expert ideas here please ..

@vastabyss6496 7 ай бұрын

What if you wanted to train a network to take a sequence of images (like in a video) and generate what comes next? Wouldn't that be a case where RNNs and its variations like LSTM and GRUs are better since each image is most closely related to the images coming directly before and after it?

@-p2349 7 ай бұрын

This is done by “GAN” networks. Or generative adversarial networks. This would have two CNNs one is a “discriminator ” network and the other a “generator” network.

@vastabyss6496 7 ай бұрын

@@-p2349 I thought that GANs could only generate an image that was similar to those in the dataset (such as a dataset containing faces). Also, how would a GAN deal with the sequential nature of videos?

@ccreutzig 4 ай бұрын

There is ViT (Vision Transformer), although that predicts parts of an image, and I've seen at least one example of ViT feeding into a Longformer network for video input. But I have no experience using it. GAN are not the answer to what I read in your question.

@wissalmasmoudi3780 9 ай бұрын

I need your help about my narx neural network please

@jugsma6676 Жыл бұрын

Can you do Fourier Transform replacing the attention head

@iro4201 Жыл бұрын

What?

@user-vm7we6bm7x Жыл бұрын

Fourier Transform?

@TheScott10012 10 ай бұрын

I respect the craft! Also, pick up a pop filter

@CodeEmporium 9 ай бұрын

I have p-p-p-predilection for p-p-plosives

@manikantabandla3923 Жыл бұрын

But there is also a version of RNN with attention.

@gpt-jcommentbot4759 11 ай бұрын

These RNNs are still worse than Transformers. However, there have been Transformers + LSTM combinations. Such neural networks have theoretical potential to create extremely long term chatbots, far higher than 4000 tokens, due to their recurrent nature.

@kvlnnguyieb9522 5 ай бұрын

how the new one SSM in MAMBA? the Mamba said to better than transformer

@cate9541 9 ай бұрын

cool

@CodeEmporium 9 ай бұрын

Many thanks :)

@Laszer271 10 ай бұрын

What I'm wondering is. Why do all APIs charge you credits for input tokens for transformers? For me, it shouldn't make a difference for a transformer to take 20 tokens as input or 1000 (as long as it's within its maximum context lengths). Isn't that the case that transformer always pads the input to its maximum context length anyway?

@ccreutzig 4 ай бұрын

No, the attention layers usually take a padding mask into account and can use smaller matrices. It just makes the implementation a bit more involved. The actual cost should be roughly quadratic in your input size, but that's probably not something the marketing department would accept.

@Userforeverneverever Ай бұрын

For the algo

@AshKetchumAllNow 5 ай бұрын

No model understands

@FluffyAnvil 4 ай бұрын

This video is 90% wrong…

@ccreutzig 4 ай бұрын

But presented confidently and getting praise. Reminds me of ChatGPT. 😂

@cxsey8587 11 ай бұрын

Do LSTMs have any advantage over transformers ?

@gpt-jcommentbot4759 11 ай бұрын

They work better with less text data, they also work better as decoders. While LSTMs don't have many advantages, future iterations of RNNs could lead to learning far longer term dependencies than Transformers. I think that LSTMs are more biologically accurate than Transformers since they incorporate time and are not layered like conventional networks but instead are theoretically capable of simple topological structures. However, their have been "recurrent Transformers" which is basically Long Short Term Memory + Transformers. The architecture is literally a transformer layer turned into a recurrent cell along with gates inspired by LSTM.

@sijoguntayo2282 10 ай бұрын

Great video! I’m addition to this, RNNs due to their sequential nature are unable to take advantage of transfer learning. Transformers do not have this limitation