MAMBA from Scratch: Neural Nets Better and Faster than Transformers

  Рет қаралды 187,645

Algorithmic Simplicity

Algorithmic Simplicity

Күн бұрын

Пікірлер: 290
@jamescamacho3403
@jamescamacho3403 4 ай бұрын
As someone actively working on this stuff, this channel has the best explanations on the internet, and the 'tuber actually understands what is going on.
@Quarky_
@Quarky_ 4 ай бұрын
3blue1brown of deep learning?
@Sumpydumpert
@Sumpydumpert 3 ай бұрын
I’d love feed back on Reddit if ur working on this as well as on cosmo knowledge KZbin channel I threw up some concepts
@InfiniteQuest86
@InfiniteQuest86 5 ай бұрын
I like how we now call 1 billion parameters small.
@Nasser-bp6qf
@Nasser-bp6qf 2 ай бұрын
Will we ever scale up and reach a point where 1 trillion is small?
@lylong-i2z
@lylong-i2z 17 күн бұрын
i hope so
@tulgatbolderdene7493
@tulgatbolderdene7493 5 ай бұрын
This just shows how RNNs are way too natural of an architecture to ignore. Maybe solution to a gradient descent problem is to not use gradient descent at all. There has to be a different way to update parameters than this bizarre hack and slash let ||x_0|| = 1 for RNNs.
@BooleanDisorder
@BooleanDisorder 5 ай бұрын
Meta-learning could potentially be one way. Like a neural "module" in the model that looks how changes in the first layers affect the representation space deeper and vice versa. It would have to have some goal and reward itself
@tempname8263
@tempname8263 5 ай бұрын
But gradient descent is too natural of an algorithm to ignore >.
@ckpioo
@ckpioo 5 ай бұрын
​@@tempname8263 it's actually not natural at all, gradient decent itself is the one big difference between a human brain and any neural networks.
@egor.okhterov
@egor.okhterov 5 ай бұрын
​@@tempname8263no
@ultrasound1459
@ultrasound1459 5 ай бұрын
​@BooleanDisorder you have 10 missed calls from Juergen Schmidhuber 🧏‍♂️
@jawadmansoor6064
@jawadmansoor6064 5 ай бұрын
wow, you've made some difficult i mean extremely difficult algorithms look easy. thank you.
@thaRealShady1
@thaRealShady1 3 ай бұрын
It's all not as difficult as one might think. I'm currently in my PhD and I quickly realized that most of the difficulty comes from people trying to look smart instead of trying to properly explain stuff. It is very hard to come up with a good solution to a problem while it is significantly easier to explain the solution once it is understood. Hence, if you are of average or slightly above average intelligence you should be able to learn almost anything if you have someone that is willing to actually provide a good explanation.
@yqisq6966
@yqisq6966 4 ай бұрын
Peer review is broken nowadays because people have little time to actually read through a manuscript with attention to details given the amount of pressure to publish their own papers. So when you have more papers out there than the time people can spend on reviewing, you get low quality peer review.
@andreasbeschorner1215
@andreasbeschorner1215 Ай бұрын
During my Ph.D times a paper of mine got rejected at ICASSP for not having quoted a certain paper (I guess the reviewer was one of the authors) which had absolutely NOTHING to do with what my paper was about... So yes, a lot in the reviewing process seems to be a) personal and b) must do this and that even if it is not related to your paper at all. Since years...
@cparks1000000
@cparks1000000 Ай бұрын
People think it's okay to have their graduate students review papers for them. It's blatantly unethical and needs to stop.
@BadChess56
@BadChess56 9 күн бұрын
True
@rikkathemejo
@rikkathemejo 4 ай бұрын
Nice video! I just wanted to point out that the parallel scan algorithm can be also implemented in O(n) time (instead of the O(n log(n)) version peresented in the video. and this is the version that the MAMBA uses.
@peterdemore7239
@peterdemore7239 4 ай бұрын
Brutal. I'm going to have to watch this about 30 times. Love it.
@honglu679
@honglu679 5 ай бұрын
Wow, excellent explaination. It covers all the essense of the paper with just enough math/algo. Thank you so much ! If you dont mind, plz make a video for RWKV (v6 has some new modifications), which is another strong linear RNN model. I am curious how does it compares to mamba.
@RexPilger
@RexPilger 4 ай бұрын
About peer review: As one comment noted, there could be many more candidate papers presented than could be accommodated at the venue. However, this video argues, the rejection justification for this paper is inadequate at best. Some comments ask whether the rejection is important; for academics, the answer is yes, because presentations and publications count for tenure, promotions, and raises plus continued funding of the research. Since several comments plus the video indicate that the algorithm had already received a lot of publicity, for the sake of the project it may not matter if it can continue to be funded, especially if commercial implementations are successful. What is interesting in any case is that the paper exists; in effect it has been published; the authors may not get the desired credit for formal publication, but their work and the reviewer comments are out there now. A couple of decades ago that would not have been the case; most people in the field would be unaware of the algorithm. In terms of peer review, in general (outside of AI), in my field, one of the natural sciences, a paper I submitted for publication encountered an editor plus two reviewers who were well qualified in the field; after asking for two revisions to the manuscript, the third version was rejected. Interestingly, all three scientists had published research which my paper undermined; they may well have lost funding for their research or even their position had that manuscript of mine been published (I speculate here). Peer review cuts both ways. While iterating with the editor and reviewers I continued to expand my research project and made some additional discoveries. Following the rejection I wrote a completely different paper which incorporated my initial work supplemented by the new discoveries; happily it was published a few months ago (in a different journal). I'm formally retired now, but continue to do research. To young researchers -- never give up. Learn from rejection, refine your work, be humble, exercise integrity and honesty, and take pride in your accomplishments, even if only a few know about them. Peer review (by humans) is a necessity and will continue to be. There is no such thing as a perfect filter, but science and technology would be overwhelmed by irrelevancy, dishonesty, and duplication of effort without it. AI may become a useful filtering tool, but science is a human endeavor.
@goliathstark9142
@goliathstark9142 4 ай бұрын
nice one rex
@gyahoo
@gyahoo 24 күн бұрын
Underrated ML channel ❤
@EkShunya
@EkShunya 5 ай бұрын
please open your community tab your content is incredible
@danverzhao9912
@danverzhao9912 2 ай бұрын
Just wondering if you can make a video on how GNN works? There's not really many videos about GNN on youtube.
@algorithmicsimplicity
@algorithmicsimplicity 2 ай бұрын
Thanks for the suggestion, I will put it on the list!
@nialv7985
@nialv7985 4 ай бұрын
Thanks for this explanation! Phrasing mamba in terms of a Linear RNN makes it much easier to understand. You've done a lot already with this video, but I just want to ask for a little bit more. Since the original Mamba paper presented the model in terms of SSM, many, many implementations of Mamba also use that language. And I have difficulty wrapping my head around trying to map their code back to the concepts in this video. I wish you can explain how concepts in the Mamba paper (∆ A B C D, discretization, etc) maps back to the parameters of a Linear RNN, which would help a lot.
@algorithmicsimplicity
@algorithmicsimplicity 4 ай бұрын
Sure, for the state space terminology A in ℂ^d is the learnable parameter that is used to make the recurrent weight vector, the equivalent in my video is a+bi, with a, b in R^d as learnable parameters, i is the imaginary unit. B, C in ℂ^{d x d } are the complex matrices applied before and after the recurrence respectively, equivalent to P and Q matrices in my video, also learnable parameters. SSM performs discretization of the parameters, which creates A^bar = e^{ΔA} and B^bar = (ΔA^-1)(exp(ΔA)-I)ΔB. Note A^bar and B^bar are what are actually used in the computation. This discretization is equivalent to the stable reparameterization outlined in my video. In the SSM formulation, they phrase the discretization as modifying B into B^bar, but note that B is the matrix which is applied to the input, so multiplying B with Δ is equivalent to multiplying the input x with Δ and leaving B unchanged, which is how it is described in my video. One last thing to be aware of is that in the state space literature, the models are often described as having another "state dimension" N in addition to the model dimension d. This state dimension is equivalent to the factor by which the output vector's dimension is expanded, so for example Mamba uses N=16, i.e. expands outputs by a factor of 16. Let me know if you still have any questions!
@nialv7985
@nialv7985 4 ай бұрын
@@algorithmicsimplicity Thank you so much!
@timeflex
@timeflex 4 ай бұрын
GPT mafia 😞 Probably just can't lose their faces and title of "the best LLM tech" (and, perhaps, contracts as well).
@drdca8263
@drdca8263 5 ай бұрын
Here’s an idea that probably wouldn’t work: What if instead of algebraically guaranteeing that some operation is a monoid so that one can use the parallelizing thing that combines n inputs in O(log(n)) steps in n processors, what if you just had some operation, learned by a NN, which has “how much it deviates from being a monoid operation” as part of the loss? Like, suppose you randomly selected some pair of consecutive applications of the operation, and also computed it in the opposite order, and took the L^2 norm of the difference between the results, and multiplied that by some weighting, and made that a term in the loss? Like, within the family of continuous and piecewise-smooth monoidal operations, perhaps some of them would be better at selective remembering?
@algorithmicsimplicity
@algorithmicsimplicity 5 ай бұрын
That sounds really interesting, you should try it out!
@drdca8263
@drdca8263 5 ай бұрын
@@algorithmicsimplicity Thanks! Unfortunately I am lazy... And, there’s already another “what if I did [X]?” machine learning project I barely started (“what if I tried to add a simple approximation to what copying heads do to an n-gram model”, which seems like it should be much easier, but I’ve barely written the n-gram model part of it (and ChatGPT honestly wrote most of that). Haven’t even started on the “compute statistics about whether copying a word from previously in the current text, or go based on the corpus as a whole, is more accurate in this context” part...
@CyrusEstavillo
@CyrusEstavillo 4 ай бұрын
@@drdca8263thats a lame response. Try it. Make something in this world
@TheDoomerBlox
@TheDoomerBlox 4 ай бұрын
It's only yet another silly experiment to do the seemingly impossible in the hottest meme area, picking your nose seems like a more productive waste of time. But imagine, if you found something really cool and nobody would listen. That would be funny, that would be cool.
@gnaarW
@gnaarW 4 ай бұрын
​@@TheDoomerBloxif you would be able to build a RecNN that outperforms current state of the art models and put it on hugging face, people will care about that 🤷🏼‍♂️
@jarib3858
@jarib3858 5 ай бұрын
One small note on RNN's, reservoir computing is a very high dimensional random RNN with linear regression readout, therefore there is no exploding nor vanishing gradient. Reservoir computing is currently the standard for non-linear dynamic time series prediction
@zzador
@zzador 2 ай бұрын
Yes but does it support backpropagation? Remember you have to propagate an error from the output layer through every RNN up to the inputs. Reservoirs/EchoStateMachines don't support this. There only the delta layer (linear regression layer) gets trained while the reservoir stays fixed. So you could get the error up to the first delta layer but not further.
@CHRISTICAUTION
@CHRISTICAUTION 2 ай бұрын
Hi, can you recommend a paper about that
@terrortinus
@terrortinus Ай бұрын
@@zzador The wonder of it is that you don't need it to go further...
@cambrawal
@cambrawal Ай бұрын
@@terrortinus paper please
@Singularity606
@Singularity606 5 ай бұрын
There seems to be a growing zoo of related architectures that attempt to supersede the transformer. Besides Mamba, there's also RetNet, GLA, Based, and HGRN. And the secret upcoming xLSTM. Someone also mentioned RWKV. Are all these converging to something? And when will we see a frontier model based on this new paradigm?
@BooleanDisorder
@BooleanDisorder 5 ай бұрын
The main problem with transformers is the compute scaling from input length. Mamba tries to be equally good at high dimension representations as transformers without the extreme compute scaling. So effectively we want much more complex representations in the end, without needing a nuclear power plant and supercomputer for inference. Transformers could continue get better, but it takes an astronomical compute amount atm.
@algorithmicsimplicity
@algorithmicsimplicity 5 ай бұрын
I believe we are converging to hybrid Transformer and dynamic linear RNNs, such as Griffin, arxiv.org/abs/2402.19427 . There are already open source Mamba language models with a few billion parameters, training and testing full size models takes about a year.
@MrObveous777
@MrObveous777 4 ай бұрын
@@algorithmicsimplicity "training and testing full size models takes about a year." why so long?
@BC-bn7xd
@BC-bn7xd 4 ай бұрын
I think just training it can take weeks if not months ​@@MrObveous777
@kamdynshaeffer9491
@kamdynshaeffer9491 5 ай бұрын
Absolutely amazing vid. Just subbed after getting recommended to this channel. Never stop making videos dude
@Alulapower
@Alulapower 5 ай бұрын
Good video to explain mamba : I understand something
@harrysvensson2610
@harrysvensson2610 5 ай бұрын
You see, it's O(n log(n)) instead of O(n^2) without any penalties. Okay? 100% crystal clear, right? //end of joke
@BooleanDisorder
@BooleanDisorder 5 ай бұрын
​​​@@harrysvensson2610that means that, basically, transformers scale x² in compute needed for prompting. Also called square or quadratic since x² is a square if you would make a geometric figure. So if you write a prompt of 5 words, that's 25 compute since 5*5=25. You can see how this gets really crazy at high tokens counts. Mamba scales differently, so you need much less compute per prompt.
@nikilragav
@nikilragav 5 ай бұрын
I really wish that when you're talking about things happening in parallel, your animations happened in parallel. Like 8:30. I think it would really improve the comprehensibility of your explanation
@TheParkitny
@TheParkitny Ай бұрын
Very good explanation, and kudos for exposing the broken peer review system. Subscribed
@danverzhao9912
@danverzhao9912 3 ай бұрын
Actually best explanation channel on youtube, rivaling 3B1B!
@mehnot8193
@mehnot8193 4 ай бұрын
Extremely noob question but, at 13:52 why aren't the input vectors x multplied by P^-1 instead of P? Don't you need to convert them to the eigenbasis before applying the D transformation (or, equivalently, taking the hadamard product with the diag(D) vector)?
@algorithmicsimplicity
@algorithmicsimplicity 4 ай бұрын
Yes, I should have applied P^-1 first to be consistent with my earlier notation W=PDP^-1. Of course, the naming is just a matter of preference, you can equivalently call the first matrix which is applied P or P^-1, so long as the two matrices are inverse of each other it doesn't matter which is called which.
@mehnot8193
@mehnot8193 4 ай бұрын
@@algorithmicsimplicity Oh ok, that makes sense now! Thanks a lot for your answer and this amazing video ^^
@augmentos
@augmentos 4 ай бұрын
Great video, would prefer no music but that’s me
@ithaca2076
@ithaca2076 5 ай бұрын
absolutely love the quality and information of this video!!! please keep up the good work this is amazing
@hackerborabora7212
@hackerborabora7212 5 ай бұрын
This algo is new and you made a video about it I love you I will subscribe your channel keep going
@boogati9221
@boogati9221 3 ай бұрын
Crazy how two separate ideas ended up converging into one nearly identical solution.
@andrewy2957
@andrewy2957 3 ай бұрын
Totally agree. I feel like that's pretty common in math, robotics, and computer science, but it just shows how every field in stem is interconnected.
@kacemabdelaziz4940
@kacemabdelaziz4940 Ай бұрын
tmw you realize humanity is just being trained with gradient descent and we always converge to these local minima
@Sumpydumpert
@Sumpydumpert 3 ай бұрын
So keep some aspects of privacy laws coherent but merge the different sides of the web in a quantum computer
@nias2631
@nias2631 3 ай бұрын
@nias2631 I have no particular opinion on transformers or MAMBA since, for my work, I never use these. But as for peer review I think that Open Review itself is a great "filter for the filter". The research community can actively review the reasoning for accept/reject as you did in this video. For most journals not using Open Review the process is fairly opaque.
@algorithmicsimplicity
@algorithmicsimplicity 3 ай бұрын
Absolutely agree, the transparent review process is definitely a net benefit for the community as a whole.
@kalimero86
@kalimero86 Ай бұрын
If it's so good why nobody is usiing it?
@jhonny1682
@jhonny1682 4 ай бұрын
Can you make an explanation video like this one on Liquid Time Constant Networks 🙏
@Originalimoc
@Originalimoc 4 ай бұрын
Hahahaha another joke in academia, probably some corruption inside I guess
@marloelefant7500
@marloelefant7500 12 күн бұрын
I honestly found the "boring technical details" the most interesting of the video.
@shirenlu5260
@shirenlu5260 Ай бұрын
Wow this is a great video. I've been having a lot of trouble understanding and getting an intuition of how Mamba works, and this video just made it make sense. The visuals were a massive help and the explanations are super simple and easy to understand.
@davidespinosa1910
@davidespinosa1910 Ай бұрын
A+++ for OpenReview. Transparency is so valuable ! Also, many thanks for the excellent video !
@gunaysoni6792
@gunaysoni6792 4 ай бұрын
You're overselling Mamba a little. Transformers are not yet dethroned and for many tasks they'll perform better than RNNs. Try giving Mamba a passage and then asking it a question based on the passage. RNNs are sensitive to the order of the information presented to them. Great video but you should also talk about the shortcomings of mamba
@himalayo
@himalayo 3 ай бұрын
Have you read the Mamba 2 paper? They’ve figured out a way to make SSMs equivalent to attention-based models and called this state-space duality
@MarcosScheeren
@MarcosScheeren 4 ай бұрын
Subscribed! Thats some 3Blue1Brown level stuff! Amazing!
@anrilombard1121
@anrilombard1121 5 ай бұрын
Currently testing it on molecular generation, so excited to see where these strengths hold and where they falter :)
@dntbther9298
@dntbther9298 4 ай бұрын
How about RWKV ?
@goblinkoma
@goblinkoma 4 ай бұрын
peer review be like thats a nice method for building houses, its a shame it doesn't also cook burgers what
@tomaskubicek5983
@tomaskubicek5983 2 ай бұрын
Hi, we were recently trying to implement the algorithm but we came across a bug that breaks it for us. The operation *((w1,x1),(w2,x2))=(w1.w2,w1.x1+x2) which you say we should use in parallel scan is not associative. Meaning (abc,a.a.b.d+a.b.e+f)=(ab,a.d+e)*(c,f)=((a,d)*(b,e))*(c,f) != (a,d)*((b,e)*(c,f)) = (a,d)*(b.c,b.e+f)=(a.b.c,a.d+b.e+f), Now i cannot wrap my head around this, I have found similar operation elsewhere and it just simply does not make sense to me.
@algorithmicsimplicity
@algorithmicsimplicity 2 ай бұрын
The operation should should be *((w1,x1),(w2,x2))=(w1w2, w2x1+x2). For the operation to be associative we need *(*((w1, x1), (w2, x2)), (w3, x3)) = *((w1, x1), *((w2, x2), (w3, x3)). Expanding the left side we get: *(*((w1, x1), (w2, x2)), (w3, x3)) = *((w1w2, w2x1+x2,) (w3, x3)) = (w1w2w3, w3(w2x1+x2,)+x3) Expanding the right side we get: *((w1, x1), *((w2, x2), (w3, x3)) = *((w1, x1), (w2w3, w3x2+x3)) = (w1w2w3, w2w3x1 + w3x2 + x3) So both sides are equal and the operation is associative.
@Levy1111
@Levy1111 3 ай бұрын
I do hope you'll soon get at least 6 figures subscribers count. The quality of your videos (both in terms of education and presentation) is top notch, people need you to become popular (at least within our small tech bubble).
@vibertthio
@vibertthio 2 ай бұрын
Great video! It's not critical, but 13:05, the calculation has error (?). It should be ((1,-1),(2,3)) on the left hand side
@algorithmicsimplicity
@algorithmicsimplicity 2 ай бұрын
Yes! Well spotted, I think you're the first person to notice.
@marloelefant7500
@marloelefant7500 12 күн бұрын
What about LSTMs? You've shortly showed the paper, but didn't mention them, even though they were supposed to be the solution to the vanishing and exploding gradients problem.
@algorithmicsimplicity
@algorithmicsimplicity 12 күн бұрын
LSTMs do better than regular RNNs at remembering. A regular RNN will forget what it sees 20 tokens ago, LSTMs can remember for a few hundred tokens, maaaybe up to 1000, but after that they forget as well. This is because LSTMs don't completely fix vanishing and exploding gradients, they just make them vanish slower (basically because the sigmoid gates it uses saturate and they can't output values extremely close to 0 or 1). When people say LSTMs fix vanishing and exploding gradients they mean it has less vanishing and exploding gradients compared to regular RNNs. Mamba on the other hand can remember for at least hundreds of thousands of tokens. Also LSTMs aren't parallelizable, so it isn't practical to train large-scale LSTMs on modern hardware. Recently the author of LSTMs put out a new paper with new versions of LSTMs to fix these issues (called LSTMx), but from what I can tell LSTMx just performs worse than Mamba in every way.
@tempname8263
@tempname8263 5 ай бұрын
21:48 33%? Dude, it's 3.4x improvement. Measuring improvement relative to accuracy instead of error rate is dumb, since that'd mean that difference between 100% accuracy and 99% is just 1%, which is not representative of anything.
@harrysvensson2610
@harrysvensson2610 5 ай бұрын
Everyone got issues when it comes to calculating with percentages. Here's an example: Imagine a game character with armor, the person got 98% damage reduction, and then puts on some more armor and reaches 99% damage reduction. How much less damage does the tank take compared to before putting on the extra armor? 100%? 50%? 1%? If you math it out it's obviously 50% less damage taken, since there's 2% between 98% and 100%. And one of those 2% is now removed, hence 1/2 -> 50% less damage taken compared to before. But you know what? Not everyone agrees that it is 50%. Understanding percentages is difficult.
@BooleanDisorder
@BooleanDisorder 5 ай бұрын
​@@harrysvensson2610yeh, the armor things is a great example. The higher the damage and the more important a tank is, the more important that single percent becomes. Could literally mean the difference between surviving a blow from a boss or die
@ScorpioneOrzion
@ScorpioneOrzion 5 ай бұрын
@@harrysvensson2610 it depends, of the armor example, its 1% absolute, and 50% relative
@harrysvensson2610
@harrysvensson2610 5 ай бұрын
@@ScorpioneOrzion Exactly.
@tempname8263
@tempname8263 5 ай бұрын
@@harrysvensson2610 It's not like it's difficult, it's just that most people do leaps in logic, where they don't even think relative to *what* are they measuring the percentage
@luke.perkin.inventor
@luke.perkin.inventor 4 ай бұрын
great video. That trick around the 26 minute mark of doing 16x compute almost for free (in terms of time) because of memory bottlenecks is really neat. I wonder how many other architectures would benefit from that kind of design optimisation?
@algorithmicsimplicity
@algorithmicsimplicity 4 ай бұрын
It appears that it is only useful for linear recurrent layers, because the main computation is just performing elementwise multiplication between the previous output vector and the recurrent weight vector, which means you have O(d) parameters and you do O(d) compute, and transferring one parameter takes longer than doing one operation. For other kinds of layers, such as fully connected layers, you are doing at least a matrix-vector multiplication, which means you are doing O(d^2) compute, and that usually takes much longer than transferring O(d) parameters.
@야옹-m7h
@야옹-m7h Ай бұрын
Interesting, I saw this video just now and realized MAMBA is really close to my research. If you are an data scientist, I wish if I could share my ideas with you. mail me if you want
@algorithmicsimplicity
@algorithmicsimplicity Ай бұрын
I am indeed a data scientist, feel free to send me an e-mail. You can find my e-mail address on my KZbin page, click on "...more"
@iancurtis123
@iancurtis123 Ай бұрын
Lovely stuff. Thanks!
@gpjedy7379
@gpjedy7379 2 ай бұрын
Since this is also used to make long connections in the state space, might also mamba applied not just for language models but for gradient-optimising reinforcement learning models?
@algorithmicsimplicity
@algorithmicsimplicity 2 ай бұрын
Yes, absolutely. Mamba has been applied to some other areas now, such as protein sequence modelling. I haven't heard of anyone applying it to reinforcement learning, but I imagine it would work very well.
@jingqianliu7078
@jingqianliu7078 2 ай бұрын
Why do you say the Transformer use linear memory complexity? Am I missing anything here?
@algorithmicsimplicity
@algorithmicsimplicity 2 ай бұрын
Self attention is computed from a grid of n^2 pairs of vectors, but you don't need to materialize all of them in memory at the same time. You can, for example, materialize one column at a time. This way you only need O(n) memory (though still O(n^2) compute). You can check out FlashAttention for an efficient implementation of O(n) memory self attention.
@kalkhasse
@kalkhasse 4 ай бұрын
I love how you nail the level of detail in the explanations. Perfect for me at least.
@justtoleavecomments3755
@justtoleavecomments3755 5 ай бұрын
"Small models up to a few billion params" I think people have forgotten what small means 😂
@nyyotam4057
@nyyotam4057 5 ай бұрын
So how close is the weight estimator to the MMSE (minimal mean square error) estimator? Can the MAMBA arch be improved even more, using a sparse covariance matrix and an application of a 'true' Kalman filter? Or is it already as close as it can get?
@2255.
@2255. 5 ай бұрын
underrated channel
@tellu5493
@tellu5493 5 ай бұрын
This was very good, and I hope you make more videos like this!
@ThéoUscidda
@ThéoUscidda Ай бұрын
At 27:30, why do we get sub-linear O(n*log(n)) time complexity? Shouldn't it be linear O(n)? I'm surely missing something.
@algorithmicsimplicity
@algorithmicsimplicity Ай бұрын
It depends on the algorithm used for the parallel scan, in this video I described an O(nlog(n)) algorithm, in practice there are actually O(n) parallel scan algorithms and Mamba uses one of them.
@ThéoUscidda
@ThéoUscidda 19 күн бұрын
@@algorithmicsimplicity I see, thanks a lot!
@ThéoUscidda
@ThéoUscidda Ай бұрын
At 31:02, I agree that Mamba has linear O(n) memory requirements. However, why don't transformers have quadratic O(n^2) memory requirements? They need to store the attention matrices that are n x n. I'm surely missing something.
@algorithmicsimplicity
@algorithmicsimplicity Ай бұрын
You don't need to materialize the full nxn matrix in memory at the same time. You can instead materialize only a chunk of it, sum over that chunk, and then materialize the next chunk in the same memory slot. This is how, for example, FlashAttention and FlashAttention2 work. When you do this the memory requirement is O(n).
@ThéoUscidda
@ThéoUscidda 19 күн бұрын
@@algorithmicsimplicity very clear, thanks a lot!
@diabolo19x
@diabolo19x 4 ай бұрын
Incredible work. I mean REALLY incredible
@harshvardhanv3873
@harshvardhanv3873 4 ай бұрын
we need more videos from you, especially one from basics
@algorithmicsimplicity
@algorithmicsimplicity 4 ай бұрын
Any topics in particular you'd like to see?
@harshvardhanv3873
@harshvardhanv3873 4 ай бұрын
@@algorithmicsimplicity we need video series in math for linear algebra, calculus, probability and statistics seperately for ml perspective and then after that we would like to learn more on basic concepts like regression, classification, clustering, etc. we would also like to learn more on the types of learning unsuperwised, semi- superwised and self-superwised. some basic architectures like rnn types (lstm, gru, hybrids) , basic ann , mlp and even the recent kan, ntk.
@algorithmicsimplicity
@algorithmicsimplicity 4 ай бұрын
@@harshvardhanv3873 Got it. I am definitely planning to do videos on calculus and probability for ML soon. After that I can do videos on the types of ML.
@harshvardhanv3873
@harshvardhanv3873 4 ай бұрын
@@algorithmicsimplicity sure waiting for your videos ✌
@sichengmao4038
@sichengmao4038 4 ай бұрын
well, maybe 3b1b's video already fullfills what your need on prerequisites of ml.
@Adityak1997
@Adityak1997 6 күн бұрын
Cou you mention the souces for the tables and graph you gave in 23:46
@algorithmicsimplicity
@algorithmicsimplicity 6 күн бұрын
The graph is Figure 4b from the Mamba paper ( openreview.net/pdf?id=AL1fq05o7H ). The table I made by combining the numbers from the linear RNN paper ( openreview.net/pdf?id=M3Yd3QyRG4 ) with the transformer numbers provided in the S4 paper ( arxiv.org/pdf/2111.00396 ).
@jondo7680
@jondo7680 2 ай бұрын
Do you have a video comparing mamba to rwkv with benefits of each over the other?
@algorithmicsimplicity
@algorithmicsimplicity 2 ай бұрын
I do not, I'd recommend checking out the latest papers for each (Mamba: arxiv.org/pdf/2405.21060 , RWKV: arxiv.org/pdf/2404.05892 ) and seeing which performs better on tasks that are similar to your use case.
@anthonybernstein1626
@anthonybernstein1626 5 ай бұрын
Amazing explanation, thank you!
@jhonyiigp
@jhonyiigp 6 күн бұрын
Incredible explanation
@tianlechen
@tianlechen Ай бұрын
Peer reviews are highly motivated by the reviewers protecting their existing work extending previously state-of-the-art methodologies. If you have an actually new innovation that goes against the grain, you need to publish regardless of whether the venue is highly regarded or not.
@SolathPrime
@SolathPrime 5 ай бұрын
[6:28]: While that sound somewhat good in practice it doesn't work like that Alternating between linear recurrent and non linear dense doesn't give that much of context in advantage :( The gradients vanishes or explodes after a while and requires some sort sigmoid transformation + some value Say for example an architecture like this: ```plaintext Dense -> Sigmoid -> Recurrent -> Dense -> Sigmoid -> Recurrent -> Dense -> Softmax ``` Until the gradients reach the first Recurrent the gradients loses most of it's value :(
@tannergilliland6105
@tannergilliland6105 4 ай бұрын
If you ever get the time I would love to see another video on mamba implementation but dumded down even more. Like to the level of statquest videos. They need to make you feel special while also showing the math step by step like its 9th grade.
@algorithmicsimplicity
@algorithmicsimplicity 4 ай бұрын
Thanks for the suggestion, there will probably be improved versions of Mamba coming out soon, I will make a more basic explanation video for them when they do.
@oleonardohn
@oleonardohn 5 ай бұрын
I haven't found any significant evidence suggesting that Mamba models outperform Transformers, except that their attention mechanism does not scale quadratically with the context length. Am I missing something?
@ilonachan
@ilonachan 4 ай бұрын
I mean, even if it just accomplished the tasks about as good as transformers qualitatively, the better compute scaling alone is pretty significant.
@oleonardohn
@oleonardohn 4 ай бұрын
@@ilonachan Sure, but as far as I'm concerned, there is not much evidence it can qualitatively perform the same tasks either. Some people reported that Mamba's state space doesn't perform as well as true attention for long contexts.
@blacklistnr1
@blacklistnr1 5 ай бұрын
Nice video! What I didn't understand is what happens to the stable weights during training. Particularly: - How are they kept stable? - How can the model learn while being so restricted? What I'm guessing is that some form of the Delta is also used in training to keep the weights in those ranges + rely a lot more on the accuracy to carry the information. Is this correct? Does it imply that using double instead of float gives it a better ability to learn?
@algorithmicsimplicity
@algorithmicsimplicity 5 ай бұрын
Great question. The answer is it's really complicated and no-one knows for sure. There is nothing explicitly keeping the weights stable during training. They can (and probably do) become unstable. The thing is, there are actually thousands of different weights in the vector. At initialization, all of the weights are essentially one, so information from anywhere in the input can influence the gradient, but the model is incredibly restricted (cannot perform meaningful transformations in the recurrence). Then SOME of those weights change and enter the un-stable regime, so they can no longer carry information long distance but can do more interesting computations, while others remain stable. And in the fully-connected layers between recurrences, all weights can communicate information with each-other. So you have this complicated system where weights are changing at different rates, some remain stable, some become unstable, and that allows for interesting computation to be done and information to be propagated long distances.
@blacklistnr1
@blacklistnr1 5 ай бұрын
@@algorithmicsimplicity Thanks for the reply! That's quite interesting, different propagation lengths didn't even cross my mind. It'd be really funny if after all this work the model learned unstable weights and became forgetful :))
@TTminh-wh8me
@TTminh-wh8me 2 ай бұрын
Just watched the lecture by mohit, then watching your video. Feel like this make me understand this architecture better than reading those papers for months 😂
@oraz.
@oraz. 5 ай бұрын
One thing I don't understand is the HIPPO matrix, and what they mean by a structured matrix in the context of differential equations.
@unkarsthug4429
@unkarsthug4429 5 ай бұрын
People keep making things that they say are "better than transformers", but none of them are actually getting used. At this point, hearing people say that has sort of become meaningless from the number of false alarms. Feels like every few months we have something "better than transformers", like RetNets were claimed to be. We'll have to wait and see which actually turn out to be better with time.
@algorithmicsimplicity
@algorithmicsimplicity 5 ай бұрын
Yep, but Mamba is different, it is already being used in open source language model projects.
@Supreme_Lobster
@Supreme_Lobster 4 ай бұрын
investor money is generally spent conservatively. it will take at least a few months for them to see the upside in divesting from super large transformers and moving on to MAMBA (or upcoming derivatives). Remember, Transformer was first published in 2017, and it took until at least 2020 for any "large" (> 3B) model to come out.
@drjenschn
@drjenschn Ай бұрын
Quick question: I guess if you want a true linear recurrence from real-valued to real-valued, you could use the Hermitian of P for P^-1? That would also eliminate optimizing for Q...
@algorithmicsimplicity
@algorithmicsimplicity Ай бұрын
You could, but there isn't really any need to. The complex version performs the same as strictly real recurrences (actually, in some cases better). And optimizing for Q doesn't really have much cost, even if you used the Hermitian of P in place of Q you would still need to back-prop through it.
@drjenschn
@drjenschn Ай бұрын
@@algorithmicsimplicity Although I still don't get the backprop argument... If you backpropagate through P, computing the Hermitian has a closed-form solution... It's the complex version of a matrix transpose.
@algorithmicsimplicity
@algorithmicsimplicity Ай бұрын
@@drjenschn Sure, say we compute the output of a layer as y=P^TDPx. When we are backpropagating we need to compute the gradient of y w.r.t x, which means computing (P^TDP)^T y`. If you use a completely separate Q instead of P^T, computing this gradient still has the same cost. The only advantage of reusing P is you don't have to update the Q matrix as well, but updating weights is a relatively small computation compared to calculating (QDP)^T y`.
@drjenschn
@drjenschn Ай бұрын
@@algorithmicsimplicity Got it now. I was originally talking about "optimizing" for P^-1 (learning the matrix weights). Back-prop is still necessary, correct. Thx!
@saiipranay995
@saiipranay995 Ай бұрын
very well explained
@IllIl
@IllIl 4 ай бұрын
Thank you! Your channel is an invaluable resource on here. Hope you keep making these videos!
@ArtArtisian
@ArtArtisian 5 ай бұрын
Eh - re controversy, I don't think peer review broke here. The *conference* however, took a much deserved status hit for mismanaging review.
@MrStevemur
@MrStevemur 4 ай бұрын
I appreciate the soothing piano music. Currently the words are only slightly better than Charlie Brown listening to adults talk, but I hope to dive in.
@markdatton1348
@markdatton1348 5 ай бұрын
Awesome video. I love the speed and the depth of this, it's perfect
@karigucio
@karigucio 3 ай бұрын
so the transformation applied to the weights does not concern purely with initialization? instead, in the expression w=exp(-exp(a)*exp(ib)) numbers a and b are the learned parameters and not w, right?
@algorithmicsimplicity
@algorithmicsimplicity 3 ай бұрын
Yes a and b are the learned parameters.
@MarcosPedroLeal
@MarcosPedroLeal 5 ай бұрын
Loved your videos. Which software or library do you use to make these animations? Is it manim?
@algorithmicsimplicity
@algorithmicsimplicity 5 ай бұрын
It is a combination of Manim (for rendering latex) and my own renderer written in Pytorch (for 3d stuff).
@suleymanemirakin
@suleymanemirakin 2 ай бұрын
Çok güzel
@OscarTheStrategist
@OscarTheStrategist 5 ай бұрын
Amazing video, insta-sub!
@alexmomot6268
@alexmomot6268 5 ай бұрын
Thx a lot for the interesting video! 💛💙
@blutwurst9000
@blutwurst9000 4 ай бұрын
Love the video but I have the question: Shouldn't be the approximation at 17:00 be something like n*w^(n-1)*0.001*x, so isn't there an n missing? Or how was the approximation done?
@algorithmicsimplicity
@algorithmicsimplicity 4 ай бұрын
Ahh yes you're right, there should be an n out the front, the gradient is proportional to nw^(n-1)x. The vanishing/exploding gradient arguments are still the same though, the linear scaling factor doesn't matter compared to the exponential scaling for large n.
@nixonmanuel6459
@nixonmanuel6459 Ай бұрын
Thank you!
@downloadableram2666
@downloadableram2666 3 ай бұрын
State-space models are not necessarily from ML, they're used a lot in control systems actually. Not surprised by their relationship considering both are strongly based on linear algebra.
@hunter13971
@hunter13971 2 ай бұрын
Great explanation, do one for Mamba 2 as well, if possible
@novantha1
@novantha1 3 ай бұрын
Fascinating video. I've always found state space model papers a little bit dense and self-referential to understand coming from other areas of ML but this video is a really great reparameterization of the issue. I'm not sure if it would be in line with previous videos (covering generally useful industry standard models with wide applications), but is there any possibility of getting a video on liquid neural networks or spiking neural networks?
@algorithmicsimplicity
@algorithmicsimplicity 3 ай бұрын
Thanks for the feedback. I probably won't get around to making videos on spiking and liquid neural networks for a while, I have lots of other stuff I'm planning to cover, but they are definitely on my todo list!
@wargreymon2024
@wargreymon2024 4 ай бұрын
The level of details and intuition you dig into are excellent 💯🔥
@nothreeshoes1200
@nothreeshoes1200 3 ай бұрын
Please make more videos. They’re fantastic!
@f14-werto
@f14-werto 5 ай бұрын
I believe that the transformer does have a quadratic cost in memory (specifically self attention (SA)). The attention matrix in SA is n by n, thus n^2 (n being the number of tokens). Probably the reviewers is referring to that bit. Anyway, rejecting mamba was hecking stupid. Great video!
@algorithmicsimplicity
@algorithmicsimplicity 5 ай бұрын
The matrix is indeed n^2, but you never need to materialize the full matrix at the same time. You can materialize one column at a time, which is exactly what FlashAttention does, resulting in O(n) memory (still O(n^2) compute though).
@f14-werto
@f14-werto 5 ай бұрын
I have no idea how flash attention manages to be faster and more memory friendly. Are you sure that the attention matrix is never fully in memory (regardless of the type of memory)?. However the classical implementation didn't use flash attention so I believe that the reviewer is referring to that.
@f14-werto
@f14-werto 5 ай бұрын
I have rechecked the paper and it appears that flash attention is linear wrt the memory. The work of Tri Dao Is magic to me.
@itsyaro1297
@itsyaro1297 3 ай бұрын
Hey man! Really appreciate the technical detail in your videos
@algorithmicsimplicity
@algorithmicsimplicity 3 ай бұрын
Thanks for the suggestion, I will add them to the TODO list.
@pi5549
@pi5549 5 ай бұрын
Another beautiful exposition. Further points: (1) HiPPO itself comes from attempting to approximate a spiking net with a SSM (Voelker 2017/8), (2) we do have O(NlogN) transformer hacks now, (3) RWKV is a promising arch that deserves a place in this arena.
@algorithmicsimplicity
@algorithmicsimplicity 5 ай бұрын
I haven't heard of any O(NlogN) transformer hacks that preserve performance, got any links? And yeah RWKV is promising, I would've loved to talk about it as well but the video was getting long lol.
@BooleanDisorder
@BooleanDisorder 5 ай бұрын
You have such a pleasant voice 😊 Thanks for helping me understand better. Please keep making videos. ❤
@luke.perkin.inventor
@luke.perkin.inventor 4 ай бұрын
in para-lllelll :-D
@thebrownfrog
@thebrownfrog 4 ай бұрын
@yoloswaggins2161
@yoloswaggins2161 3 ай бұрын
A guy who actually understands this stuff
@maximilianchrzon4545
@maximilianchrzon4545 5 ай бұрын
Your videos are so good man keep it up, seriously. Although that is probably beneath you, but could you maby make a video on how neural networks are computed on machines in general or maby on GPUs? As someone who did not learn computer science in uni, this would be an interesting topic for me to learn and maby fundamentally understand nn better.
@algorithmicsimplicity
@algorithmicsimplicity 5 ай бұрын
That's an interesting topic, I was planning on making videos about how CPUs and GPUs work at the physical level (e.g. logical gates are built out of transistors, addition and multiplication are built out of logical gates). Neural nets are just implemented as a bunch of matrix multiplications (you put all the neuron weights in one matrix and multiply it with the input). Is that what you are asking about?
@maximilianchrzon4545
@maximilianchrzon4545 4 ай бұрын
@algorithmicsimplicity yeah that sounds about right, thank you. Maby you could use matrix multiplication as a case example on those inner workings :) anyways, thanks for making awesome videos
@ArtOfTheProblem
@ArtOfTheProblem 4 ай бұрын
3b1b has this covered pretty well already@@maximilianchrzon4545
@francescorossi7582
@francescorossi7582 3 ай бұрын
Thanks for the video. Why do you use matrix diagonalization instead of SVD in 13:00? SVD can decompose any matrix and you do not need to introduce complex numbers. The power trick also works with SVD wrt the singular values.
@algorithmicsimplicity
@algorithmicsimplicity 3 ай бұрын
With SVD you get W=USV for a diagonal matrix S, but the U and V are not necessarily inverse of each other, so when you take W^2=USVUSV you can't cancel out the inner VU.
@francescorossi7582
@francescorossi7582 3 ай бұрын
@@algorithmicsimplicity you are right, in my mind i was assuming W to be symmetric
@Neomadra
@Neomadra 4 ай бұрын
Why make peer review public when there's no way to remove obviously wrong reviews
@algorithmicsimplicity
@algorithmicsimplicity 4 ай бұрын
It's supposed to act as an incentive for peer reviewers to write quality reviews, since they can be seen by anyone. Didn't work in the case of Mamba lol.
Transformer Neural Networks Derived from Scratch
18:08
Algorithmic Simplicity
Рет қаралды 141 М.
🍉😋 #shorts
00:24
Денис Кукояка
Рет қаралды 3,6 МЛН
The moment we stopped understanding AI [AlexNet]
17:38
Welch Labs
Рет қаралды 1,1 МЛН
The Key Equation Behind Probability
26:24
Artem Kirsanov
Рет қаралды 106 М.
State Space Models (S4, S5, S6/Mamba) Explained
38:11
Anastasia Borovykh
Рет қаралды 3,2 М.
MAMBA and State Space Models explained | SSM explained
22:27
AI Coffee Break with Letitia
Рет қаралды 49 М.
How can a jigsaw have two distinct solutions?
26:23
Stand-up Maths
Рет қаралды 449 М.
Why Does Diffusion Work Better than Auto-Regression?
20:18
Algorithmic Simplicity
Рет қаралды 327 М.
AI can't cross this line and we don't know why.
24:07
Welch Labs
Рет қаралды 881 М.
The Most Important Algorithm in Machine Learning
40:08
Artem Kirsanov
Рет қаралды 443 М.
Mamba - a replacement for Transformers?
16:01
Samuel Albanie
Рет қаралды 249 М.
🍉😋 #shorts
00:24
Денис Кукояка
Рет қаралды 3,6 МЛН