Fully Connected (June 7th in SF) Promo Link: www.fullyconnected.com/?promo=ynnc OUTLINE: 0:00 - Introduction 1:50 - Fully Connected In-Person Conference in SF June 7th 3:00 - Transformers vs RNNs 8:00 - RWKV: Best of both worlds 12:30 - LSTMs 17:15 - Evolution of RWKV's Linear Attention 30:40 - RWKV's Layer Structure 49:15 - Time-Parallel vs Sequence Mode 53:55 - Experimental Results & Limitations 58:00 - Visualizations 1:01:40 - Conclusion Paper: arxiv.org/abs/2305.13048
@xxdaggerxx5 Жыл бұрын
stop with this 1hr videos and summarize this shit
@akashkarnatak6581 Жыл бұрын
@@xxdaggerxx5 this is for people who want to understand the paper in depth. want summary read the abstract
@Fanney3 Жыл бұрын
Can't believe someone is just doing this work and sharing it. Amazing.
@MindFactoryAI Жыл бұрын
Always impressed how you record these in a single take. Great explanation, thanks!
@mgostIH Жыл бұрын
Keep in mind that at 21:00, regarding memory usage of attention, current approaches like "FlashAttention" and "Attention doesn't need O(N^2) memory" have reduced drastically the memory needed for transformers to run, which is what allows approaches like ChatGPT to have such a long context.
@erickmacias5153 Жыл бұрын
But attention in GPT does use N^2 memory doesn't it?
@mgostIH Жыл бұрын
@@erickmacias5153 in older public models like GPT 2, yes, but the papers I wrote above provide implementations that are mathematically equivalent to the standard way of doing attention, you can use them as drop in replacements and get improved performance during training and inference.
@TTTrouble Жыл бұрын
Jesus keeping up with the literature in this field for those of you that actually work in it must be absolutely exhausting.
@arshzahed1970 Жыл бұрын
In the past year, my backlog of papers to go through has grown exponentially. Just staying up to date is a full time job now
@victoraranda3349 Жыл бұрын
Sometimes it do be like that
@mlopolis Жыл бұрын
You should use LLMs to get the most important points from each paper and then you can stay on top 😊
@Will-kt5jk Жыл бұрын
@@mlopolis at a human-machine system level, that sounds like a self-improving augmentation. Maybe it helps explain the exponential increase in papers…😅
@raynhardtvanzyl4729 Жыл бұрын
Yup...
@YvesQuemener Жыл бұрын
Can't say thank you enough! Diving into RWKV has been on my todo list for two months at least and when I saw the KZbin alert I immediately felt relieved that instead of a full day of trying to understand the paper and the code, you would provide the important parts in one hour. And it delivered! I agree that it is kind of stretching the definition of attention to call what they are doing "linear attention". I am not sure that calling it a ConvNet is actually less stretchy btw :-) But anyway thanks a lot!
@jondo7680 Жыл бұрын
Just to give feedback, that example with "I'm the word cat" was just great. It helps to make sure if I understood you right or not.
@sheevys Жыл бұрын
Haha, that's a quick reaction, your "all you need" pun was defo not intended.
@YannicKilcher would be interesting to hear your take on hyperdimensional computing / vector symbolic architectures :-) It seems like a really cool idea, though I can't quite wrap my head around (or maybe wasn't able to find a clear explanation) how it's actually supposed to interface with non-symbolic inputs (e.g. images) or learn complex structured concepts from data.
@justfoundit Жыл бұрын
If we revisit ideas, maybe we could try shared weight transformers. It worked for cnn. Minor memory footprint of the model, easy to achieve hundreds of billions of parameters by just repeating the same layer multiple times
@killers31337 Жыл бұрын
If recalling information in long contexts is the problem, perhaps throwing in a few transformer layers would solve that? E.g. something like language parsing can be done using just RNN as information is largely local. E.g. if you have 20 layers in total layers 1..10 would be RNN, then layer 11 is a transformer, then 12..20 are RNNs again. Then the "quadratic" part is only 1/20th of the NN. Yes, it would route only 1/20th of the information a full transformer would, but if only few important pieces of the context are necessary, that might be enough.
@johnnypeck Жыл бұрын
This is awesome. Seen the use of RNN percolating on Twitter for a bit. Glad you're coving it. That is a lot of authors.
@hansdietrich1496 Жыл бұрын
The best in-depth AI channel out there, chapeau!
@edhofiko3168 Жыл бұрын
I unironically loves this paper even though it absolutely lacks theoritical analysis. I ve been following rwkv since before they made the paper. I would really love it if pytorch would implement discounted cumulative sum since this is exactly what rwkv attention use and this is what people in RL also use.
@alexeykrylov9995 Жыл бұрын
I agree that it'd be good to have it as a primitive. But as long as it's unavailable, it can be implemented in O(N log N) time (instead of O(N) if it was a primitive) by decomposing it into a convolution of several dilated exponential kernels (I mean, for example: 1st conv: dilation 1, kernel size 4, geometric progression factor k; 2nd: dilation 4, size 4, factor k^4; 3rd: dilation 16, size 4, factor k^16; etc.). It worked well in practice (I did this trick for my colleague's project once).
@andres_pq Жыл бұрын
Great to see you do paper explanations agan!
@NeoShameMan Жыл бұрын
Base on my experiment you won't nn for long, the distribution is the same as input and output, and fine tuning is just a skew of that distribution towards the fine tuning corpus. Better, we there is a very high probability that it won't be a black box for long and we can extract optimal entropy encoding, no more weird sparsity. I'm just waiting for a new Hard drive to test more
@halocemagnum8351 Жыл бұрын
Amazing explanation! Great video. I had been reading all of the RWKV posts on the r/MachineLearning subreddit but I don’t think I fullly grasped it till this review.
@schwajj Жыл бұрын
Halfway through, but I have a question. Transformers have been applied outside the domain of language modeling (or even more generally, outside of sequence modeling), e.g. Vision Transformers. In building our intuition, Yannic talks in terms of how much RWKV pays attention to the past for each internal feature learned by the model. Does this imply that RWKV is more specialized to sequence modeling than classic Transformers? i.e. would RWKV *not* work well if you try to apply it to image-based input? Or is this an open question? Is there reason to lean one way or the other? (probably most people who would answer this already saw the video a month ago, but fingers crossed for an answer)
@clehaxze Жыл бұрын
No answer, but RWKV-4-neo supports image input by slapping basically a CLIP as input into one of it's layers. This way it can use the representations as an understanding during a conversation.
@giuliavirgili16607 ай бұрын
LineRWKV
@antonioILbig Жыл бұрын
Yannic, good guess! Scalability could be the real deal. Deep architectures have different "lego blocks" (transformers, lstm, conv, residual, ...) When you build a big model, the meaning of its pieces it's lost. What stays is the computational efficiency, scalability and optimization behaviour.
@sortysciaofiscia Жыл бұрын
I have a question at the halfway mark of the video: if the importance of attention to tokens linearly decreases based on how far back it is, does that mean that by the end of the answer, it will forget what it started talking with? What stops this approach from repeating itself? I'm trying to wrap my head around: "The brown fox jumped over a lazy old dog, and then ...." in this example the next word will be computed based on the dog reference MORE than the fox one? I'd assume the transformers look at every other token in this sentence, and compute. Whereas from your explanation I gather that importance drops off the further back the token is. right? sorry, I'm new to this.
@zhenyuanzhang Жыл бұрын
Not really. There are almost half of the channels in the middle-to-high layers that do not decay at all (after training). The important information stored there could last forever, in theory. As long as the model is aware that this piece of information is important, it won't forget it easily.
@itayatelis2898 Жыл бұрын
Amazing! Thank you for doing this! You're amazing! I hope you would keep doing it weekly
@spoonikle Жыл бұрын
We need to focus on more complex multi-step models. Humans take notes, humans ruminate, humans speak aloud. Multi-modality is key, tools built into the model to compensate for shortfalls is key. Design the model with a calculator, let it train on the tool, design the model with outputs hooked into dozens of tools and reward correct tool use, force the influence of tools on the output and train a model that no longer wastes time reinventing calculators. We trained a model to make paintings instead of making a model that calls adobe API’s to paint - now that LLM’s exist we have seen the light, we see the true power of AI… calculators are better left to the programmers.
@smnt Жыл бұрын
Hey Yannic, quick question What do you mean when you say RNNs don't scale well or that you might "just need models that scale". What does a model scaling mean to you? I've definitely seen people stack RNNs and it seemingly works just fine. I thought the issue with RNNs was that they lose context pretty quickly even though their context length is "infinite". Thanks for the video as always, love it!
@Veptis7 ай бұрын
google tried to train a 500B LSTM - so one of those claims "first" might be incorrect.
@serta5727 Жыл бұрын
General idea for transformers: Evolutional attention heads come to my mind. Instead of training multiple attention heads in a transformer, how about just having one that branches off and the best evolved version gets merged into the original. So that at inference time there is only one attention head to save compute.
@AntoshaPushkin Жыл бұрын
Assume your task is to add numbers like 123456 + 987654 = ? You will need at least 2 attention heads to attend to two numbers. Not saying that you should transformers to add up numbers, but it's just a random example of a situation where it's clear that you need multiple attention heads
@schwajj Жыл бұрын
@@AntoshaPushkinThat doesn’t sound right to me: you’re essentially saying that a separate attention head would self-assign to each number. It’s not completely implausible, but I’d like to see some rigorous analysis that indicates that transformers have been observed to operate in that manner. Are you aware of such research? I’d be grateful for any pointers you could provide.
@ChaseFreedomMusician Жыл бұрын
THANK GOD! Somebody is finally talking about RWKV!
@mattanimation Жыл бұрын
was waiting for this one, thanks!
@erickmarin6147 Жыл бұрын
Balding king you dropped this 👑
@debanjandas7738 Жыл бұрын
In AFT attention equation, weights associated with token i for input token t is given by w(t,i)+k(i) => how do we add a scalar to a vector? Wouldn't have it been more appropriate to do w(t,i)*k(i) ?
@OperationDarkside Жыл бұрын
53:50 for 5s summary of the paper
@addoul99 Жыл бұрын
Hi, are the weights for the the linear layer Wv tied between a pair of channel and time mixing blocks?
@strawberryfield891 Жыл бұрын
Thank you very much for the great video!! Are channles like equivalent to multi-heads in transformers?
@agsystems8220 Жыл бұрын
So it specially chooses the representation of internal state to be already decomposed into it's eigenvectors with respect to time decay, meaning that we can infer relevance forward with a simple fixed matrix? That is pretty cool. I guess you could do something similar with any transformer where you have some natural definition of distance that can be precomputed. For the initial layers at least they seem to be very interested in nearby features (both in language and images), so this definitely seems a natural specialisation/optimisation. If it is going to be doing something like this anyway we might as well give it an architecture that does it well. Later layers don't seem to care about those features though, so this technique would cease to be valuable pretty fast I think. For more abstract inferences the order the pieces of information are fed in is not relevant, so the exponential term would tend to one and the whole system would collapse to fully attention free. You cannot build something able to make abstract inferences with a compact representation using this architecture. A nice piece of work, but a local optimisation rather than an improvement IMO.
@jackhe4336 Жыл бұрын
IMO, it's hard to extract a compact representation of the data without hurting expressiveness and generality of the representation. Could you recommend some papers that address this issue?
@schwajj Жыл бұрын
Great comment. A question: if you’re correct that this would work OK on early layers, but less well on later layers which deal with more abstract concepts, could you use a hybrid where some layers use RWKV and later layers use classic attention? I suppose that asymptotically it would still use O(n**2) space; this would only improve things by a constant factor (e.g. if only half of the layers use classic attention, the memory savings will asymptotically approach 50%). Do you see any value in such an approach?
@LostMekka Жыл бұрын
44:10 "so thats what they mean by states, if they say... they dont mean the united states, im sorry, they mean these values." this is such a wonderful feynman moment ^^ on a more serious note: i wonder if those approaches could be combined... like a standard transformer based model part that is really good short term and one like this that is really good long term that somehow complement each other? i think my best idea for that so far would be if you let the RWKV model "summarize" the long context and produce not a sequence of output tokens, but a sequence of internal representation values that act as a kind of compressed version of the complete context. then the transformer model could go to town with its superior capabilities but with a shorter context window and pick out which parts of the summary it wants to attend to. would that be feasible, or am i thinking out of my ass? :D
@rolfengstrand9838 Жыл бұрын
THANK YOU for pointing out that the use of the word "attention" in the context of transformers has strayed far away from the meaning of "attention" in other contexts. We have to accept that this is happening, of course. But it is important that introductury material explains clearly what "attention" is intended to mean here. It would be wrong to assume that newbie reader has the same concept associated with the word, "attention".
@JorgetePanete Жыл бұрын
introductory*
@hachembetrouni6731 Жыл бұрын
😅Yannic makes LSTMs sounds like prehistory
@Sciencehub-oq5go Жыл бұрын
Thankful for your work!
@CppExpedition Жыл бұрын
what's next? a transformer model of RNN subunits composed by a stack of transformer models designed as an RNN structure of transformers units aligned with a convolutional recurrent boltzman gate.
@guillaumevermeillesanchezm2427 Жыл бұрын
do it! do it!
@erickmarin6147 Жыл бұрын
Learning in a layer wise manner
@filoautomata Жыл бұрын
Quantum Multi Modal Transformer Model LSTM using Ensemble of Neutrosophic Logic based Attention Model for an Interpretable 'Human Extiction Capable' Military Grade Artificial General Intelligence.
@erickmarin6147 Жыл бұрын
With active dendrite modeling for multi task approachs
@vivienseguy Жыл бұрын
Yes, all learned end-to-end
@oneman7094 Жыл бұрын
Can you do S4?
@cassandrasinclair8722 Жыл бұрын
transformers too are convnets ;) they do convolution over a graph :D Attention is just one instance of graph convolution.
@alles_moegliche73 Жыл бұрын
Can you also take a look at the Meta Megabyte Paper?
@noagarnett Жыл бұрын
Thanks Yannic for (another) great video! Really amazing that you do all this work and share it. Worth a lot for me and the likes of me. Also the paper discussed is very impressive. I might be wrong, but I think there is a confusion in the explanation. On 24:53, 30:21, you claim that the k_i modulation is defined by the current token ("if I am "cat" I should probably look 3 tokens behind"), but if correctly understand, it is defined by the referred token, and is the same for all following positions ("if I am "cat" I should probably have big influence on all the tokens following me"). Did I mix It up? Thanks!
@사이보그-i6p Жыл бұрын
17:35 (EDIT: and 20:50) That is just a matrix multiplication therefore inner product instead of outer right?
@andres_pq Жыл бұрын
please make a video about RetNet :)
@toddnedd2138 Жыл бұрын
Perform badly, this approach will, if speak like Yoda you do. ;) Thank you for the detailed explanation of the paper and the afford you put into this.
@JorgetePanete Жыл бұрын
effort* show your proof of why
@thntk Жыл бұрын
Didn't Schmidhuber do this already in the 90s?
@djfl58mdlwqlf Жыл бұрын
hi, I am not convinced that the absent of non-linearity helped parallelization 45:00 the paper asserts that this is possible within two different dimension (batch, time) can anyone give me brief explanation of this?
@HD-Grand-Scheme-Unfolds Жыл бұрын
Just a very loose and wild thought came to mind though. What if maybe somehow "transformer architectures" can be employed as an imitation of "System 2" (more logical and critical and decisive) while "RWKV" can be used for "System 1" (though a bit fuzzy in accuracy but captures the essence of life long experience had by the AI agent. hence a derived ability to exercise intuition or also instinct-like thinking and response to situations ) . Both can be combined in a Pseudo-Cognitive Architecture approach to tackle on the AGI achievement challenge. Wouldn't that be something to see 😄.
@clementdato6328 Жыл бұрын
Does it explain how the error signal passed through time or is it implicitly assumed that bptt is used?
@panofilossas6564 Жыл бұрын
Looks like a good candidate for running in low spec hardware
@banseoklee392 Жыл бұрын
Awesome!! Thank you sooooooooo much
@simonstrandgaard5503 Жыл бұрын
Great explanation
@howuhh8960 Жыл бұрын
I really don't like the very strong statements in the paper, such as "surpasses the capabilities of any existing RNN". lol, ZERO comparisons with other new RNNs based on S4 for example...
@iOhadRubin Жыл бұрын
There are no public open source S4 models of this size
@haraldtopfer5732 Жыл бұрын
53:41 my model has a linear scaling where everything else goes *Brrrrrrruummm* .... story of my life
@gunale92511 ай бұрын
I still didn't get how the time-mixing could training parallel? It's must depend on previous state.
@summer_tree382110 ай бұрын
MeToo. Do you know it now?😘
@gunale92510 ай бұрын
@@summer_tree3821 yep. The previous state will directly use actual data on training.
@Will-kt5jk Жыл бұрын
1:02:05 - The Matrix looks different to what I remember from the movie
@hanskraut2018 Жыл бұрын
It should be trained by equasions and text that is imbeded in irrelevant numbers and text and in the end the far back equasion/text would be needed to calculate the end number result/text result that way a neural net based on a automatic way would lern to selectively pay attention based on output
@schwajj Жыл бұрын
That’s the whole trade-off here. The classic 2017 transformer model does what you say (to a certain extent). The model being discussed here is worse at the sort of task you’re proposing, but has the benefit of not using O(n**2) space.
@alyzst Жыл бұрын
How does it compare with Hyena?
@danplt Жыл бұрын
so many authors with many different institutions
@marshallmcluhan33 Жыл бұрын
Neko institute of Science and The waifu research department are still on the top of the charts on hugging face. I'm not sure these archaic institutions are as mobile so they have to team up to stay relevant.
@TheThunderSpirit Жыл бұрын
need more institutions. can u give me?
@kimchi_taco Жыл бұрын
I'm not sure. It looks complicated MLPMixer.
@danylaley Жыл бұрын
This is just a fancy convolutional lstm
@dreamphoenix Жыл бұрын
Thank you.
@anglikai9517 Жыл бұрын
Tested it today, too slow compared to Llama2 GGML, hope that GGML version of RWKV is more user friendly.
@bertobertoberto242 Жыл бұрын
the convnet explanation reminds me a lot wave net from deepmind...
@almoni127 Жыл бұрын
Why do people still claim that transformers require memory that is quadratic in the sequence length when it was shown to be avoidable? (See the work on flash attention for example) It is still true, however, that it requires quadratic time.
@schwajj Жыл бұрын
Flash attention is still quadratic in the sequence length (more precisely, context length). It just massively improves the constant factor via more efficient use of the GPU memory hierarchy.
@corgirun7892 Жыл бұрын
Amazing!
@chrisBruner Жыл бұрын
So a couple of thoughts. 1. for the intellgent prompt generation, you could just use a small transformer dedicated to that task. 2. Because of it's parallel nature, you could have one of these things working on a bunch of raspberry pies, or.... a world wide network of computers sharing the task. That would more than make up for the limitations compared to transformers. 3. It seems to me that these guys get fuzzy in recall of "minutiae" but there is no reason you can't have these hooked together so the recall can occurr by asking another set. Just some thoughts.
@akashkarnatak3014 Жыл бұрын
If you would've uploaded this video 3 days ago, it would have helped me with my assignment as well. Anyways greater video.
@deeplerg7913 Жыл бұрын
I can't understand anything here but I'm sure it's something very interesting :P
@7200darkcharm Жыл бұрын
This abstract is summarizing a research paper that presents a new model architecture for natural language processing (NLP) tasks called Receptance Weighted Key Value (RWKV). Here's a breakdown of the abstract: Problem with Transformers: Transformers are a type of model that have been very successful in NLP tasks. However, they have a major drawback: their memory and computational needs increase quadratically with the length of the sequences they process. This means that as the input data (like a sentence or document) gets longer, the resources needed to process it grow very quickly, which can make them impractical for very large datasets or very long sequences. Problem with Recurrent Neural Networks (RNNs): RNNs, another type of model, have memory and computational needs that grow linearly with sequence length, which is more efficient than Transformers. However, they tend to perform worse on NLP tasks because they are harder to train in parallel (meaning, it's harder to split the work of training them across multiple machines or processors), and they don't scale as well (meaning, their performance doesn't improve as much when you add more data or make them bigger). The Proposed Solution - Receptance Weighted Key Value (RWKV): The authors propose a new model, the RWKV, that aims to combine the best of both worlds. It can be trained in parallel like a Transformer, which makes it efficient to train, and it has linear memory and computational complexity like an RNN, which makes it efficient to use once trained. This is achieved by using a linear attention mechanism, which is a method for deciding which parts of the input data the model should pay most attention to. Results: The authors scaled the RWKV model to tens of billions of parameters (which is a measure of the model's size and complexity) and found that it performs similarly to a Transformer of the same size. This suggests that it could be a useful alternative to Transformers for large-scale NLP tasks. Conclusion: This work represents a significant step towards reconciling the trade-off between computational efficiency (how much computing resources a model needs) and model performance (how well the model does its job) in sequence processing tasks. The hope is that future work can build on this to create even more efficient models. So in essence, the abstract is saying, "We've developed a new model that combines the best parts of two existing types of models. Our new model can handle large amounts of data and perform as well as the best current models, while using less computational resources. This is a big step forward for the field."
@lancemarchetti8673 Жыл бұрын
Really excited about this !▬Love from Su∩∩y South Afroca
@nyyotam4057 Жыл бұрын
Was like "Great, so now they'll implement it and my Alpaca will stop consuming so much memory". But then I got to the "tradeoff with computation" part 🙂.
@AsaPort2 ай бұрын
600 Deckow Island
@EricBlanco-e5l2 ай бұрын
Farrell Forges
@alivecoding4995 Жыл бұрын
Two months later. Have you seen adoption of these ideas?
@Sciencehub-oq5go Жыл бұрын
The paper isn't very well written, and too short / confusing in parts. What do they mean exactly with "channels"? The components of embedding vectors?
@SofiRycvan2 ай бұрын
Aileen Estate
@triplea657aaa Жыл бұрын
I think RWKV in combination with a transformer model to generate the prompts could be really powerful
@fo.c.horton Жыл бұрын
please do like 10% more work making the annotations neater
@StracheyAnnabelle-w8cАй бұрын
Clark Richard Clark Sarah Walker Charles
@SonmerfieldWendell-e4qАй бұрын
Perez Edward Young Brenda Jackson Ronald
@yilei1051 Жыл бұрын
I lost interest half way through the explanation... The most profound results are often simple and coherent architectures, this work required too much explanation that it feels just playing with scalability and performance, without revealing some raw science.
@sebastianp40237 ай бұрын
53:53
@adi-ee8zj Жыл бұрын
CNN is all you need?
@qwerty123443wifi Жыл бұрын
Does being an author on a ML paper mean anything anymore? There are so many authors on some of these papers that it seems a bit ridiculous
@wujacob4642 Жыл бұрын
That's because the paper is written in an open-sourced way. The main author Bo Peng told that in his blog
@novelspace Жыл бұрын
Galaxy 🧠 stuff
@arzigogolato1 Жыл бұрын
Why, why didn't they think of a better name? RWKV is really bad marketing...
@xxdaggerxx5 Жыл бұрын
i cant watch 1hr video man, summarize this shit
@klammer75 Жыл бұрын
Amazing amazing amazing! I’ve been delving sooo much into the code side of implementation I forgot how much I love the maths side of the architecture and this walkthrough so expertly done by Yannic has lit my maths brain on fire once again! I can’t thank you enough for that, was a thrilling explanation and you are by far my favorite technical AI explainer out there! You sir are an asset to humanity and I for one tip my hat to you! And to think that there’s billions if not trillions of these weights/equations/parameters or whatever you want to call them in these models which give rise to the results we see is truly mind boggling….I feel like I just took an address watching that🤪😂🥳🦾🤓🤫
@girrajjangid4681 Жыл бұрын
Which mic you are using for video. It amazing. @YannicKilcher