RWKV: Reinventing RNNs for the Transformer Era (Paper Explained)

  Рет қаралды 76,650

Yannic Kilcher

Yannic Kilcher

Күн бұрын

Пікірлер: 126
@YannicKilcher
@YannicKilcher Жыл бұрын
Fully Connected (June 7th in SF) Promo Link: www.fullyconnected.com/?promo=ynnc OUTLINE: 0:00 - Introduction 1:50 - Fully Connected In-Person Conference in SF June 7th 3:00 - Transformers vs RNNs 8:00 - RWKV: Best of both worlds 12:30 - LSTMs 17:15 - Evolution of RWKV's Linear Attention 30:40 - RWKV's Layer Structure 49:15 - Time-Parallel vs Sequence Mode 53:55 - Experimental Results & Limitations 58:00 - Visualizations 1:01:40 - Conclusion Paper: arxiv.org/abs/2305.13048
@xxdaggerxx5
@xxdaggerxx5 Жыл бұрын
stop with this 1hr videos and summarize this shit
@akashkarnatak6581
@akashkarnatak6581 Жыл бұрын
@@xxdaggerxx5 this is for people who want to understand the paper in depth. want summary read the abstract
@Fanney3
@Fanney3 Жыл бұрын
Can't believe someone is just doing this work and sharing it. Amazing.
@MindFactoryAI
@MindFactoryAI Жыл бұрын
Always impressed how you record these in a single take. Great explanation, thanks!
@TTTrouble
@TTTrouble Жыл бұрын
Jesus keeping up with the literature in this field for those of you that actually work in it must be absolutely exhausting.
@arshzahed1970
@arshzahed1970 Жыл бұрын
In the past year, my backlog of papers to go through has grown exponentially. Just staying up to date is a full time job now
@victoraranda3349
@victoraranda3349 Жыл бұрын
Sometimes it do be like that
@mlopolis
@mlopolis Жыл бұрын
You should use LLMs to get the most important points from each paper and then you can stay on top 😊
@Will-kt5jk
@Will-kt5jk Жыл бұрын
@@mlopolis at a human-machine system level, that sounds like a self-improving augmentation. Maybe it helps explain the exponential increase in papers…😅
@raynhardtvanzyl4729
@raynhardtvanzyl4729 Жыл бұрын
Yup...
@YvesQuemener
@YvesQuemener Жыл бұрын
Can't say thank you enough! Diving into RWKV has been on my todo list for two months at least and when I saw the KZbin alert I immediately felt relieved that instead of a full day of trying to understand the paper and the code, you would provide the important parts in one hour. And it delivered! I agree that it is kind of stretching the definition of attention to call what they are doing "linear attention". I am not sure that calling it a ConvNet is actually less stretchy btw :-) But anyway thanks a lot!
@mgostIH
@mgostIH Жыл бұрын
Keep in mind that at 21:00, regarding memory usage of attention, current approaches like "FlashAttention" and "Attention doesn't need O(N^2) memory" have reduced drastically the memory needed for transformers to run, which is what allows approaches like ChatGPT to have such a long context.
@erickmacias5153
@erickmacias5153 Жыл бұрын
But attention in GPT does use N^2 memory doesn't it?
@mgostIH
@mgostIH Жыл бұрын
@@erickmacias5153 in older public models like GPT 2, yes, but the papers I wrote above provide implementations that are mathematically equivalent to the standard way of doing attention, you can use them as drop in replacements and get improved performance during training and inference.
@jondo7680
@jondo7680 Жыл бұрын
Just to give feedback, that example with "I'm the word cat" was just great. It helps to make sure if I understood you right or not.
@sheevys
@sheevys Жыл бұрын
Haha, that's a quick reaction, your "all you need" pun was defo not intended.
@hansdietrich1496
@hansdietrich1496 Жыл бұрын
The best in-depth AI channel out there, chapeau!
@andres_pq
@andres_pq Жыл бұрын
Great to see you do paper explanations agan!
@johnnypeck
@johnnypeck Жыл бұрын
This is awesome. Seen the use of RNN percolating on Twitter for a bit. Glad you're coving it. That is a lot of authors.
@erickmarin6147
@erickmarin6147 Жыл бұрын
Balding king you dropped this 👑
@michael05242002
@michael05242002 Жыл бұрын
🎯 Key Takeaways for quick navigation: 00:14 🔄 RWKV是一种高度可扩展的模型架构,具有Transformer和RNN的一些属性。 01:21 ⚖️ RWKV模型在某些情况下与大型的Transformer模型在性能上相媲美。 03:29 📚 RWKV是一种用于语言建模的模型,可预测文本中的下一个单词或标记。 05:15 🧠 RNNS仅需一定的内存即可进行推理,但每个推理步骤只能考虑当前记忆和前一个标记。 10:17 📈 RWKV是第一个能够扩展到数百亿参数的非Transformer架构。 11:12 🧠 LSTM模型是RNN的一种,通过引入门控机制来解决梯度消失问题,并具有长期记忆能力。 14:44 🚪 LSTM使用门控机制来控制隐藏状态和记忆状态的更新,包括遗忘门和输入门。 16:09 📊 LSTM模型的更新涉及多次非线性计算,导致顺序计算和无法并行化。 18:37 🤔 注意力机制可以动态地分配注意力权重,以聚合信息,但计算量大且顺序计算。 20:33 🔄 Attention-free Transformers试图通过不进行token间交互的方式重新定义注意力机制,以减少内存需求。 23:43 ⚖️ RWKV模型使用固定的关注机制,对所有数据点都适用,但可以通过添加key来进行调制,以使关注模式具有一定的灵活性。 24:37 🔄 RWKV的关注机制是通过加法进行调制,而不是乘法,相对于Transformer的乘法交互,这种方式的作用相对较小。 25:30 🔍 RWKV的固定关注不能考虑到当前token的含义,而只能定义一个固定的关注模式,相比之下,原始的注意力机制更加强大和灵活。 28:10 💡 RWKV模型提出了一种新型的注意力机制,通过定义向量W来调制关注模式,从而考虑到了过去的信息。 30:09 📝 RWKV模型通过将模型应用于一系列的token上,构建了一个具有重复结构的模型,以实现处理序列数据的目标。 33:02 🔍 RWKV模型中每个块的计算都会保留一部分信息并传递到下一层,这种计算方式类似于LSTM中的状态传递,但是以层与层的方式进行传递。 34:01 💡 RWKV模型中的通道混合模块使用线性层、非线性函数和元素级乘法来实现通道之间的混合。 36:25 📝 RWKV模型通过向每个时间步骤的输入中添加上一时间步骤的输入并进行线性插值,实现了时间或令牌的平移操作。 38:14 🌟 RWKV模型中的时间混合模块采用类似Transformer的方法进行计算,包括线性层和加权求和操作。 42:40 ✨ RWKV模型通过无限制的加权求和操作,可以对整个过去的值进行加权求和,而不受固定大小的注意力矩阵限制。 44:05 📝 RWKV模型使用线性插值和加权求和来实现时间或令牌的平移操作。 45:10 🌟 RWKV模型中的隐藏状态通过线性插值的线性函数进行计算,没有经过非线性操作,因此可以使用并行计算进行训练。 46:03 ✨ RWKV模型通过令牌平移操作使每个元素可以访问它前面的元素,从而实现了深度增长的感受野。 47:36 💡 RWKV模型实现了对过去值进行加权求和的线性聚合操作,可以有效地回顾过去。 51:49 🚀 RWKV模型相比Transformer和LSTM来说,在能够回顾过去的能力和复杂计算的能力上处于中间水平,但可以通过堆叠多个层来增强模型的表达能力。 54:54 ⚡ RWKV模型相较于Transformer和LSTM,在处理复杂计算和较长上下文方面不如优势明显。 55:31 ✨ 增加上下文长度可以降低语言建模的损失。 56:27 📉 线性注意力机制可能在处理长上下文任务时限制了模型性能。 56:53 🏗️ RWKV模型比标准Transformer模型更加依赖精心设计的提示,相关问题需要进一步探索和确认。 58:15 🌍 RWKV模型在通道维度上逐层考虑过去信息,高层网络越倾向于考虑更长时期的信息。 Made with HARPA AI
@killers31337
@killers31337 Жыл бұрын
If recalling information in long contexts is the problem, perhaps throwing in a few transformer layers would solve that? E.g. something like language parsing can be done using just RNN as information is largely local. E.g. if you have 20 layers in total layers 1..10 would be RNN, then layer 11 is a transformer, then 12..20 are RNNs again. Then the "quadratic" part is only 1/20th of the NN. Yes, it would route only 1/20th of the information a full transformer would, but if only few important pieces of the context are necessary, that might be enough.
@itayatelis2898
@itayatelis2898 Жыл бұрын
Amazing! Thank you for doing this! You're amazing! I hope you would keep doing it weekly
@halocemagnum8351
@halocemagnum8351 Жыл бұрын
Amazing explanation! Great video. I had been reading all of the RWKV posts on the r/MachineLearning subreddit but I don’t think I fullly grasped it till this review.
@mattanimation
@mattanimation Жыл бұрын
was waiting for this one, thanks!
@ChaseFreedomMusician
@ChaseFreedomMusician Жыл бұрын
THANK GOD! Somebody is finally talking about RWKV!
@Sciencehub-oq5go
@Sciencehub-oq5go Жыл бұрын
Thankful for your work!
@justfoundit
@justfoundit Жыл бұрын
If we revisit ideas, maybe we could try shared weight transformers. It worked for cnn. Minor memory footprint of the model, easy to achieve hundreds of billions of parameters by just repeating the same layer multiple times
@OperationDarkside
@OperationDarkside Жыл бұрын
53:50 for 5s summary of the paper
@사이보그-i6p
@사이보그-i6p Жыл бұрын
17:35 (EDIT: and 20:50) That is just a matrix multiplication therefore inner product instead of outer right?
@edhofiko3168
@edhofiko3168 Жыл бұрын
I unironically loves this paper even though it absolutely lacks theoritical analysis. I ve been following rwkv since before they made the paper. I would really love it if pytorch would implement discounted cumulative sum since this is exactly what rwkv attention use and this is what people in RL also use.
@alexeykrylov9995
@alexeykrylov9995 Жыл бұрын
I agree that it'd be good to have it as a primitive. But as long as it's unavailable, it can be implemented in O(N log N) time (instead of O(N) if it was a primitive) by decomposing it into a convolution of several dilated exponential kernels (I mean, for example: 1st conv: dilation 1, kernel size 4, geometric progression factor k; 2nd: dilation 4, size 4, factor k^4; 3rd: dilation 16, size 4, factor k^16; etc.). It worked well in practice (I did this trick for my colleague's project once).
@djfl58mdlwqlf
@djfl58mdlwqlf Жыл бұрын
hi, I am not convinced that the absent of non-linearity helped parallelization 45:00 the paper asserts that this is possible within two different dimension (batch, time) can anyone give me brief explanation of this?
@spoonikle
@spoonikle Жыл бұрын
We need to focus on more complex multi-step models. Humans take notes, humans ruminate, humans speak aloud. Multi-modality is key, tools built into the model to compensate for shortfalls is key. Design the model with a calculator, let it train on the tool, design the model with outputs hooked into dozens of tools and reward correct tool use, force the influence of tools on the output and train a model that no longer wastes time reinventing calculators. We trained a model to make paintings instead of making a model that calls adobe API’s to paint - now that LLM’s exist we have seen the light, we see the true power of AI… calculators are better left to the programmers.
@dairin0d
@dairin0d Жыл бұрын
@YannicKilcher would be interesting to hear your take on hyperdimensional computing / vector symbolic architectures :-) It seems like a really cool idea, though I can't quite wrap my head around (or maybe wasn't able to find a clear explanation) how it's actually supposed to interface with non-symbolic inputs (e.g. images) or learn complex structured concepts from data.
@schwajj
@schwajj Жыл бұрын
Halfway through, but I have a question. Transformers have been applied outside the domain of language modeling (or even more generally, outside of sequence modeling), e.g. Vision Transformers. In building our intuition, Yannic talks in terms of how much RWKV pays attention to the past for each internal feature learned by the model. Does this imply that RWKV is more specialized to sequence modeling than classic Transformers? i.e. would RWKV *not* work well if you try to apply it to image-based input? Or is this an open question? Is there reason to lean one way or the other? (probably most people who would answer this already saw the video a month ago, but fingers crossed for an answer)
@clehaxze
@clehaxze Жыл бұрын
No answer, but RWKV-4-neo supports image input by slapping basically a CLIP as input into one of it's layers. This way it can use the representations as an understanding during a conversation.
@giuliavirgili1660
@giuliavirgili1660 8 ай бұрын
LineRWKV
@NeoShameMan
@NeoShameMan Жыл бұрын
Base on my experiment you won't nn for long, the distribution is the same as input and output, and fine tuning is just a skew of that distribution towards the fine tuning corpus. Better, we there is a very high probability that it won't be a black box for long and we can extract optimal entropy encoding, no more weird sparsity. I'm just waiting for a new Hard drive to test more
@noagarnett
@noagarnett Жыл бұрын
Thanks Yannic for (another) great video! Really amazing that you do all this work and share it. Worth a lot for me and the likes of me. Also the paper discussed is very impressive. I might be wrong, but I think there is a confusion in the explanation. On 24:53, 30:21, you claim that the k_i modulation is defined by the current token ("if I am "cat" I should probably look 3 tokens behind"), but if correctly understand, it is defined by the referred token, and is the same for all following positions ("if I am "cat" I should probably have big influence on all the tokens following me"). Did I mix It up? Thanks!
@sortysciaofiscia
@sortysciaofiscia Жыл бұрын
I have a question at the halfway mark of the video: if the importance of attention to tokens linearly decreases based on how far back it is, does that mean that by the end of the answer, it will forget what it started talking with? What stops this approach from repeating itself? I'm trying to wrap my head around: "The brown fox jumped over a lazy old dog, and then ...." in this example the next word will be computed based on the dog reference MORE than the fox one? I'd assume the transformers look at every other token in this sentence, and compute. Whereas from your explanation I gather that importance drops off the further back the token is. right? sorry, I'm new to this.
@zhenyuanzhang
@zhenyuanzhang Жыл бұрын
Not really. There are almost half of the channels in the middle-to-high layers that do not decay at all (after training). The important information stored there could last forever, in theory. As long as the model is aware that this piece of information is important, it won't forget it easily.
@LostMekka
@LostMekka Жыл бұрын
44:10 "so thats what they mean by states, if they say... they dont mean the united states, im sorry, they mean these values." this is such a wonderful feynman moment ^^ on a more serious note: i wonder if those approaches could be combined... like a standard transformer based model part that is really good short term and one like this that is really good long term that somehow complement each other? i think my best idea for that so far would be if you let the RWKV model "summarize" the long context and produce not a sequence of output tokens, but a sequence of internal representation values that act as a kind of compressed version of the complete context. then the transformer model could go to town with its superior capabilities but with a shorter context window and pick out which parts of the summary it wants to attend to. would that be feasible, or am i thinking out of my ass? :D
@antonioILbig
@antonioILbig Жыл бұрын
Yannic, good guess! Scalability could be the real deal. Deep architectures have different "lego blocks" (transformers, lstm, conv, residual, ...) When you build a big model, the meaning of its pieces it's lost. What stays is the computational efficiency, scalability and optimization behaviour.
@banseoklee392
@banseoklee392 Жыл бұрын
Awesome!! Thank you sooooooooo much
@debanjandas7738
@debanjandas7738 Жыл бұрын
In AFT attention equation, weights associated with token i for input token t is given by w(t,i)+k(i) => how do we add a scalar to a vector? Wouldn't have it been more appropriate to do w(t,i)*k(i) ?
@serta5727
@serta5727 Жыл бұрын
General idea for transformers: Evolutional attention heads come to my mind. Instead of training multiple attention heads in a transformer, how about just having one that branches off and the best evolved version gets merged into the original. So that at inference time there is only one attention head to save compute.
@AntoshaPushkin
@AntoshaPushkin Жыл бұрын
Assume your task is to add numbers like 123456 + 987654 = ? You will need at least 2 attention heads to attend to two numbers. Not saying that you should transformers to add up numbers, but it's just a random example of a situation where it's clear that you need multiple attention heads
@schwajj
@schwajj Жыл бұрын
@@AntoshaPushkinThat doesn’t sound right to me: you’re essentially saying that a separate attention head would self-assign to each number. It’s not completely implausible, but I’d like to see some rigorous analysis that indicates that transformers have been observed to operate in that manner. Are you aware of such research? I’d be grateful for any pointers you could provide.
@smnt
@smnt Жыл бұрын
Hey Yannic, quick question What do you mean when you say RNNs don't scale well or that you might "just need models that scale". What does a model scaling mean to you? I've definitely seen people stack RNNs and it seemingly works just fine. I thought the issue with RNNs was that they lose context pretty quickly even though their context length is "infinite". Thanks for the video as always, love it!
@simonstrandgaard5503
@simonstrandgaard5503 Жыл бұрын
Great explanation
@oneman7094
@oneman7094 Жыл бұрын
Can you do S4?
@Veptis
@Veptis 8 ай бұрын
google tried to train a 500B LSTM - so one of those claims "first" might be incorrect.
@rolfengstrand9838
@rolfengstrand9838 Жыл бұрын
THANK YOU for pointing out that the use of the word "attention" in the context of transformers has strayed far away from the meaning of "attention" in other contexts. We have to accept that this is happening, of course. But it is important that introductury material explains clearly what "attention" is intended to mean here. It would be wrong to assume that newbie reader has the same concept associated with the word, "attention".
@JorgetePanete
@JorgetePanete Жыл бұрын
introductory*
@addoul99
@addoul99 Жыл бұрын
Hi, are the weights for the the linear layer Wv tied between a pair of channel and time mixing blocks?
@strawberryfield891
@strawberryfield891 Жыл бұрын
Thank you very much for the great video!! Are channles like equivalent to multi-heads in transformers?
@alles_moegliche73
@alles_moegliche73 Жыл бұрын
Can you also take a look at the Meta Megabyte Paper?
@clementdato6328
@clementdato6328 Жыл бұрын
Does it explain how the error signal passed through time or is it implicitly assumed that bptt is used?
@agsystems8220
@agsystems8220 Жыл бұрын
So it specially chooses the representation of internal state to be already decomposed into it's eigenvectors with respect to time decay, meaning that we can infer relevance forward with a simple fixed matrix? That is pretty cool. I guess you could do something similar with any transformer where you have some natural definition of distance that can be precomputed. For the initial layers at least they seem to be very interested in nearby features (both in language and images), so this definitely seems a natural specialisation/optimisation. If it is going to be doing something like this anyway we might as well give it an architecture that does it well. Later layers don't seem to care about those features though, so this technique would cease to be valuable pretty fast I think. For more abstract inferences the order the pieces of information are fed in is not relevant, so the exponential term would tend to one and the whole system would collapse to fully attention free. You cannot build something able to make abstract inferences with a compact representation using this architecture. A nice piece of work, but a local optimisation rather than an improvement IMO.
@jackhe4336
@jackhe4336 Жыл бұрын
IMO, it's hard to extract a compact representation of the data without hurting expressiveness and generality of the representation. Could you recommend some papers that address this issue?
@schwajj
@schwajj Жыл бұрын
Great comment. A question: if you’re correct that this would work OK on early layers, but less well on later layers which deal with more abstract concepts, could you use a hybrid where some layers use RWKV and later layers use classic attention? I suppose that asymptotically it would still use O(n**2) space; this would only improve things by a constant factor (e.g. if only half of the layers use classic attention, the memory savings will asymptotically approach 50%). Do you see any value in such an approach?
@andres_pq
@andres_pq Жыл бұрын
please make a video about RetNet :)
@dreamphoenix
@dreamphoenix Жыл бұрын
Thank you.
@gunale926
@gunale926 Жыл бұрын
I still didn't get how the time-mixing could training parallel? It's must depend on previous state.
@summer_tree3821
@summer_tree3821 11 ай бұрын
MeToo. Do you know it now?😘
@gunale926
@gunale926 11 ай бұрын
@@summer_tree3821 yep. The previous state will directly use actual data on training.
@CppExpedition
@CppExpedition Жыл бұрын
what's next? a transformer model of RNN subunits composed by a stack of transformer models designed as an RNN structure of transformers units aligned with a convolutional recurrent boltzman gate.
@guillaumevermeillesanchezm2427
@guillaumevermeillesanchezm2427 Жыл бұрын
do it! do it!
@erickmarin6147
@erickmarin6147 Жыл бұрын
Learning in a layer wise manner
@filoautomata
@filoautomata Жыл бұрын
Quantum Multi Modal Transformer Model LSTM using Ensemble of Neutrosophic Logic based Attention Model for an Interpretable 'Human Extiction Capable' Military Grade Artificial General Intelligence.
@erickmarin6147
@erickmarin6147 Жыл бұрын
With active dendrite modeling for multi task approachs
@vivienseguy
@vivienseguy Жыл бұрын
Yes, all learned end-to-end
@corgirun7892
@corgirun7892 Жыл бұрын
Amazing!
@danplt
@danplt Жыл бұрын
so many authors with many different institutions
@marshallmcluhan33
@marshallmcluhan33 Жыл бұрын
Neko institute of Science and The waifu research department are still on the top of the charts on hugging face. I'm not sure these archaic institutions are as mobile so they have to team up to stay relevant.
@TheThunderSpirit
@TheThunderSpirit Жыл бұрын
need more institutions. can u give me?
@thntk
@thntk Жыл бұрын
Didn't Schmidhuber do this already in the 90s?
@howuhh8960
@howuhh8960 Жыл бұрын
I really don't like the very strong statements in the paper, such as "surpasses the capabilities of any existing RNN". lol, ZERO comparisons with other new RNNs based on S4 for example...
@iOhadRubin
@iOhadRubin Жыл бұрын
There are no public open source S4 models of this size
@Will-kt5jk
@Will-kt5jk Жыл бұрын
1:02:05 - The Matrix looks different to what I remember from the movie
@toddnedd2138
@toddnedd2138 Жыл бұрын
Perform badly, this approach will, if speak like Yoda you do. ;) Thank you for the detailed explanation of the paper and the afford you put into this.
@JorgetePanete
@JorgetePanete Жыл бұрын
effort* show your proof of why
@cassandrasinclair8722
@cassandrasinclair8722 Жыл бұрын
transformers too are convnets ;) they do convolution over a graph :D Attention is just one instance of graph convolution.
@hanskraut2018
@hanskraut2018 Жыл бұрын
It should be trained by equasions and text that is imbeded in irrelevant numbers and text and in the end the far back equasion/text would be needed to calculate the end number result/text result that way a neural net based on a automatic way would lern to selectively pay attention based on output
@schwajj
@schwajj Жыл бұрын
That’s the whole trade-off here. The classic 2017 transformer model does what you say (to a certain extent). The model being discussed here is worse at the sort of task you’re proposing, but has the benefit of not using O(n**2) space.
@HD-Grand-Scheme-Unfolds
@HD-Grand-Scheme-Unfolds Жыл бұрын
Just a very loose and wild thought came to mind though. What if maybe somehow "transformer architectures" can be employed as an imitation of "System 2" (more logical and critical and decisive) while "RWKV" can be used for "System 1" (though a bit fuzzy in accuracy but captures the essence of life long experience had by the AI agent. hence a derived ability to exercise intuition or also instinct-like thinking and response to situations ) . Both can be combined in a Pseudo-Cognitive Architecture approach to tackle on the AGI achievement challenge. Wouldn't that be something to see 😄.
@almoni127
@almoni127 Жыл бұрын
Why do people still claim that transformers require memory that is quadratic in the sequence length when it was shown to be avoidable? (See the work on flash attention for example) It is still true, however, that it requires quadratic time.
@schwajj
@schwajj Жыл бұрын
Flash attention is still quadratic in the sequence length (more precisely, context length). It just massively improves the constant factor via more efficient use of the GPU memory hierarchy.
@alyzst
@alyzst Жыл бұрын
How does it compare with Hyena?
@hachembetrouni6731
@hachembetrouni6731 Жыл бұрын
😅Yannic makes LSTMs sounds like prehistory
@haraldtopfer5732
@haraldtopfer5732 Жыл бұрын
53:41 my model has a linear scaling where everything else goes *Brrrrrrruummm* .... story of my life
@danylaley
@danylaley Жыл бұрын
This is just a fancy convolutional lstm
@panofilossas6564
@panofilossas6564 Жыл бұрын
Looks like a good candidate for running in low spec hardware
@kimchi_taco
@kimchi_taco Жыл бұрын
I'm not sure. It looks complicated MLPMixer.
@deeplerg7913
@deeplerg7913 Жыл бұрын
I can't understand anything here but I'm sure it's something very interesting :P
@bertobertoberto242
@bertobertoberto242 Жыл бұрын
the convnet explanation reminds me a lot wave net from deepmind...
@anglikai9517
@anglikai9517 Жыл бұрын
Tested it today, too slow compared to Llama2 GGML, hope that GGML version of RWKV is more user friendly.
@akashkarnatak3014
@akashkarnatak3014 Жыл бұрын
If you would've uploaded this video 3 days ago, it would have helped me with my assignment as well. Anyways greater video.
@chrisBruner
@chrisBruner Жыл бұрын
So a couple of thoughts. 1. for the intellgent prompt generation, you could just use a small transformer dedicated to that task. 2. Because of it's parallel nature, you could have one of these things working on a bunch of raspberry pies, or.... a world wide network of computers sharing the task. That would more than make up for the limitations compared to transformers. 3. It seems to me that these guys get fuzzy in recall of "minutiae" but there is no reason you can't have these hooked together so the recall can occurr by asking another set. Just some thoughts.
@EricBlanco-e5l
@EricBlanco-e5l 3 ай бұрын
Farrell Forges
@lancemarchetti8673
@lancemarchetti8673 Жыл бұрын
Really excited about this !▬Love from Su∩∩y South Afroca
@AsaPort
@AsaPort 3 ай бұрын
600 Deckow Island
@SofiRycvan
@SofiRycvan 3 ай бұрын
Aileen Estate
@7200darkcharm
@7200darkcharm Жыл бұрын
This abstract is summarizing a research paper that presents a new model architecture for natural language processing (NLP) tasks called Receptance Weighted Key Value (RWKV). Here's a breakdown of the abstract: Problem with Transformers: Transformers are a type of model that have been very successful in NLP tasks. However, they have a major drawback: their memory and computational needs increase quadratically with the length of the sequences they process. This means that as the input data (like a sentence or document) gets longer, the resources needed to process it grow very quickly, which can make them impractical for very large datasets or very long sequences. Problem with Recurrent Neural Networks (RNNs): RNNs, another type of model, have memory and computational needs that grow linearly with sequence length, which is more efficient than Transformers. However, they tend to perform worse on NLP tasks because they are harder to train in parallel (meaning, it's harder to split the work of training them across multiple machines or processors), and they don't scale as well (meaning, their performance doesn't improve as much when you add more data or make them bigger). The Proposed Solution - Receptance Weighted Key Value (RWKV): The authors propose a new model, the RWKV, that aims to combine the best of both worlds. It can be trained in parallel like a Transformer, which makes it efficient to train, and it has linear memory and computational complexity like an RNN, which makes it efficient to use once trained. This is achieved by using a linear attention mechanism, which is a method for deciding which parts of the input data the model should pay most attention to. Results: The authors scaled the RWKV model to tens of billions of parameters (which is a measure of the model's size and complexity) and found that it performs similarly to a Transformer of the same size. This suggests that it could be a useful alternative to Transformers for large-scale NLP tasks. Conclusion: This work represents a significant step towards reconciling the trade-off between computational efficiency (how much computing resources a model needs) and model performance (how well the model does its job) in sequence processing tasks. The hope is that future work can build on this to create even more efficient models. So in essence, the abstract is saying, "We've developed a new model that combines the best parts of two existing types of models. Our new model can handle large amounts of data and perform as well as the best current models, while using less computational resources. This is a big step forward for the field."
@sebastianp4023
@sebastianp4023 8 ай бұрын
53:53
@triplea657aaa
@triplea657aaa Жыл бұрын
I think RWKV in combination with a transformer model to generate the prompts could be really powerful
@nyyotam4057
@nyyotam4057 Жыл бұрын
Was like "Great, so now they'll implement it and my Alpaca will stop consuming so much memory". But then I got to the "tradeoff with computation" part 🙂.
@qwerty123443wifi
@qwerty123443wifi Жыл бұрын
Does being an author on a ML paper mean anything anymore? There are so many authors on some of these papers that it seems a bit ridiculous
@wujacob4642
@wujacob4642 Жыл бұрын
That's because the paper is written in an open-sourced way. The main author Bo Peng told that in his blog
@alivecoding4995
@alivecoding4995 Жыл бұрын
Two months later. Have you seen adoption of these ideas?
@fo.c.horton
@fo.c.horton Жыл бұрын
please do like 10% more work making the annotations neater
@StracheyAnnabelle-w8c
@StracheyAnnabelle-w8c 2 ай бұрын
Clark Richard Clark Sarah Walker Charles
@Sciencehub-oq5go
@Sciencehub-oq5go Жыл бұрын
The paper isn't very well written, and too short / confusing in parts. What do they mean exactly with "channels"? The components of embedding vectors?
@SonmerfieldWendell-e4q
@SonmerfieldWendell-e4q 2 ай бұрын
Perez Edward Young Brenda Jackson Ronald
@novelspace
@novelspace Жыл бұрын
Galaxy 🧠 stuff
@adi-ee8zj
@adi-ee8zj Жыл бұрын
CNN is all you need?
@yilei1051
@yilei1051 Жыл бұрын
I lost interest half way through the explanation... The most profound results are often simple and coherent architectures, this work required too much explanation that it feels just playing with scalability and performance, without revealing some raw science.
@arzigogolato1
@arzigogolato1 Жыл бұрын
Why, why didn't they think of a better name? RWKV is really bad marketing...
@xxdaggerxx5
@xxdaggerxx5 Жыл бұрын
i cant watch 1hr video man, summarize this shit
@klammer75
@klammer75 Жыл бұрын
Amazing amazing amazing! I’ve been delving sooo much into the code side of implementation I forgot how much I love the maths side of the architecture and this walkthrough so expertly done by Yannic has lit my maths brain on fire once again! I can’t thank you enough for that, was a thrilling explanation and you are by far my favorite technical AI explainer out there! You sir are an asset to humanity and I for one tip my hat to you! And to think that there’s billions if not trillions of these weights/equations/parameters or whatever you want to call them in these models which give rise to the results we see is truly mind boggling….I feel like I just took an address watching that🤪😂🥳🦾🤓🤫
@girrajjangid4681
@girrajjangid4681 Жыл бұрын
Which mic you are using for video. It amazing. @YannicKilcher
Transformers (how LLMs work) explained visually | DL5
27:14
3Blue1Brown
Рет қаралды 4,1 МЛН
Visualizing transformers and attention | Talk for TNG Big Tech Day '24
57:45
To Brawl AND BEYOND!
00:51
Brawl Stars
Рет қаралды 17 МЛН
人是不能做到吗?#火影忍者 #家人  #佐助
00:20
火影忍者一家
Рет қаралды 20 МЛН
Mixtral of Experts (Paper Explained)
34:32
Yannic Kilcher
Рет қаралды 59 М.
When Optimisations Work, But for the Wrong Reasons
22:19
SimonDev
Рет қаралды 1,1 МЛН
Why Does Diffusion Work Better than Auto-Regression?
20:18
Algorithmic Simplicity
Рет қаралды 403 М.
The moment we stopped understanding AI [AlexNet]
17:38
Welch Labs
Рет қаралды 1,5 МЛН
I Redesigned the ENTIRE YouTube UI from Scratch
19:10
Juxtopposed
Рет қаралды 867 М.
Attention in transformers, visually explained | DL6
26:10
3Blue1Brown
Рет қаралды 1,9 МЛН
[1hr Talk] Intro to Large Language Models
59:48
Andrej Karpathy
Рет қаралды 2,4 МЛН
How might LLMs store facts | DL7
22:43
3Blue1Brown
Рет қаралды 870 М.
To Brawl AND BEYOND!
00:51
Brawl Stars
Рет қаралды 17 МЛН