Mamba: Linear-Time Sequence Modeling with Selective State Spaces (Paper Explained)

  Рет қаралды 148,798

Yannic Kilcher

Yannic Kilcher

Күн бұрын

Пікірлер: 131
@YannicKilcher
@YannicKilcher Жыл бұрын
OUTLINE: 0:00 - Introduction 0:45 - Transformers vs RNNs vs S4 6:10 - What are state space models? 12:30 - Selective State Space Models 17:55 - The Mamba architecture 22:20 - The SSM layer and forward propagation 31:15 - Utilizing GPU memory hierarchy 34:05 - Efficient computation via prefix sums / parallel scans 36:01 - Experimental results and comments 38:00 - A brief look at the code
@HoriaCristescu
@HoriaCristescu Жыл бұрын
I have to say - your return to making more frequent videos is making me very happy. I used to see your video before reading the papers.
@ousooo
@ousooo 10 ай бұрын
good to see you here, Horia. :)
@stephaneduhamel7706
@stephaneduhamel7706 Жыл бұрын
11:15 They did experiments up to 3 billion parameters iirc. There is a mamba 3B model available on huggingface at least
@hunterkudo9832
@hunterkudo9832 Жыл бұрын
How does it compare to a 3B transformer model?
@greengoblin9567
@greengoblin9567 Жыл бұрын
@@hunterkudo9832It’s better
@TheRyulord
@TheRyulord Жыл бұрын
@@hunterkudo9832 It's roughly on par with a 7B param transformer
@bmoharryman5809
@bmoharryman5809 Жыл бұрын
This is hot.
@erickmarin6147
@erickmarin6147 Жыл бұрын
(___0__)__(____0__) \_________________/
@SebastianRaschka
@SebastianRaschka 11 ай бұрын
This is a great video with a great rundown of Mamba. Was traveling when the Mamba paper came out and coincidentally stumbled upon this video today. This was a big time-saver to catch me up on the gist of itl. I'll make sure to watch more of your videos in the future. Big thumbs up!
@YangLi-gw9nb
@YangLi-gw9nb 6 ай бұрын
This is definitely the best explanation video for Mamba I've seen. Thank you!
@OperationDarkside
@OperationDarkside Жыл бұрын
These kinds of videos are great on an early christmas morning. You know you are still not really awake. You won't get it anyway. But it kickstarts your brain into work mode.
@barni_7762
@barni_7762 Жыл бұрын
I'd love another video diving deeper!
@jonatan01i
@jonatan01i Жыл бұрын
Could see this as a "memory" architecture for an actual transformer, remembering distinctive contexts for a long time, but use transformers for the much more complicated and sophisticated logical reasonings where directed focus and attention is much needed.
@albertmashy8590
@albertmashy8590 Жыл бұрын
Huge for assistants, and agents with longer term memory, as well as AI companions
@dibbidydoo4318
@dibbidydoo4318 Жыл бұрын
transformers have problems of their own for composition and logical reasoning right?
@vzxvzvcxasd7109
@vzxvzvcxasd7109 Жыл бұрын
Papers! TBH, I don’t even watch any other vids in this channel.
@sagetmaster4
@sagetmaster4 Жыл бұрын
There are other videos?
@Hexanitrobenzene
@Hexanitrobenzene Жыл бұрын
@@sagetmaster4 Yannic had pretty good ML news videos but then he got busy with OpenAssistant and probably other things... Now I usually watch "AI Explained" for ML news.
@EdFormer
@EdFormer 10 ай бұрын
​@@Hexanitrobenzenethat OpenAI fanboy hype merchant, who probably can't define a perceptron, is no replacement for Yannic's ML News. Better than nothing, but while Kilcher's coverage gave me the feeling of being at an academic event for those who actually study and engineer ML techniques to learn about new ideas they might want to apply and to help them stay up to date, AI Explained seems to just want to blow the minds of tech enthusiasts who are only interested in how close to AGI we are, and always leaves me with a strong smell of bull regarding the significance of whatever he's been making out is a huge leap forward. How much the general culture of the AI community has gone downhill since ChatGPT was released and caught the attention of psuedo intellectuals is so depressing.
@Hexanitrobenzene
@Hexanitrobenzene 10 ай бұрын
@@EdFormer I don't think such a strong criticism is warranted. You described him like he was ColdFusion or something like that :) Of course, if you want a raw engineering take, he is not a replacement, but considering all the countless clickbaity AI related videos on my side bar, he seems the most serious of those whose audience is semi-technical. He always reads the papers thoroughly, catches some less well known trends, does his own investigations. He even did a thorough analysis of MMLU dataset bugs. Not a NeurIPS level researcher, but still a serious guy.
@EdFormer
@EdFormer 10 ай бұрын
@@Hexanitrobenzene absolutely fair response. I guess I have just channelled my dismay regarding the decline of the once great culture of our community at him specifically, since his channel is the only example of these new AI channels (which I think are illustrative of that decline) that I am prepared to engage with. I still watch his videos, so must find some value there. I just really miss being able to rely on regular, diverse, and legitimately technical/academic content from the likes of Yannic and Letitia that really added another dimension to my never ending literature review. Even Edan Meyer, an undergrad at the time, provided the perspective that I think is sorely missed. I feel these channels have struggled to keep up with the cookie cutter garbage that is now filling my recommendations, and again, I am probably focusing my blame for that on AI Explained. One valid criticism I have of AI Explained though is the misleading perspective I think he provides that is AI = LLMs. I'm probably very biased as someone with a background in computer vision, but there's so much more going on than just LLMs. I find it mind blowing, given how things were previously, that I cannot find a good deep dive video on the Segment Anything Model (SAM) or DINOv2.
@varunsaagars
@varunsaagars Жыл бұрын
🎯 Key Takeaways for quick navigation: 00:00 📜 *This video discusses the Mamba paper, which introduces a linear-time sequence modeling approach with Selective State Spaces.* 01:19 🔄 *Transformers have the advantage of dynamic and selective attention but suffer from quadratic computation and memory requirements, while RNNs have limited memory and scalability.* 03:05 🔄 *Backpropagation through time in RNNs can be memory-intensive and lead to gradient problems.* 05:51 🔄 *State space models, like S4, offer an alternative to RNNs and Transformers with linear computation but lack data-dependent transitions.* 09:34 🔄 *Mamba introduces Selective State Spaces, relaxing the input-independence constraint while retaining linear computation, making it a hybrid between SSM and LSTM.* 11:39 🚀 *Mamba is competitive with Transformers, especially on long sequences and dense data like language and genomics.* 12:22 📚 *The paper addresses computational efficiency and context-based reasoning, improving on previous SSM models.* 16:12 🤖 *Mamba's architecture combines selective state spaces with other components, providing linear scaling in sequence length.* 18:40 🚀 *Mamba offers fast training and inference, scaling linearly in sequence length during training and achieving performance on sequences up to one million in length.* 26:33 🧮 *The video explains a sequence modeling technique involving the computation of Y3, which depends on various time steps and matrices.* 28:28 🖥️ *The technique allows for the precomputation of certain parameters, making it possible to compute the output of a new sequence instantly.* 29:38 🧩 *Mamba focuses on making parameters input-dependent, achieving efficiency through GPU memory optimization.* 32:07 🚀 *The paper reduces data movement and utilizes fast memory (SRAM) for matrix multiplications, resulting in significant speed improvements.* 34:21 🧬 *Mamba performs zoom operations differently by using a prefix sum technique to accommodate input-dependent elements.* 36:18 📈 *Mamba shows promising scaling performance for large-scale sequence models, outperforming other attention-free models.* 37:23 💻 *Inference throughput on an A100 GPU is good and improves as batch size increases when compared to Transformers.* 37:36 🧪 *The paper discusses the intricacies of the efficient implementation, providing insights into memory transfers and cost reductions.* 39:50 📊 *The code implementation includes various components such as input projection, 1D convolution, discretization, recurrence, and gating pathways for efficient sequence modeling.*
@josephle5686
@josephle5686 Жыл бұрын
I work with state space models as a a control/optimization engineer on a daily basis. But that diagram of the state space model has got to be the most confusing thing I’ve seen in my life lol
@brunomaruszczak4016
@brunomaruszczak4016 Жыл бұрын
Agreed, both State Space models and Kalman filter have been so throughly described in every control theory handbook and I have never seen something like this diagram 😅
@ivanzhelyazkov8099
@ivanzhelyazkov8099 11 ай бұрын
can you guys recommend a handbook with a clearer representation?
@brunomaruszczak4016
@brunomaruszczak4016 9 ай бұрын
@@ivanzhelyazkov8099 Well I think you could find some good introduction to state space systems in Signals and Systems: Fundamentals by Gang Li and others. Chapter 7 is on state space models but I would also recommend reading up on general properties of LTI systems. There could be better books though, I’ve read it quite some time ago
@brunomaruszczak4016
@brunomaruszczak4016 9 ай бұрын
@@ivanzhelyazkov8099 Signals and Systems: Fundamentals by Gang Li and others. Chapter 7 deals with SS models, but recommend looking at chapter 1 and LTI systems as well
@kimchi_taco
@kimchi_taco Жыл бұрын
I think this is very similar to "Retentive Network", which Yannic covered few months ago. State transition model recalls me linear Kalman filter. Anyway, I cannot believe single vector memory can carry all necessary information for every token, which fit for all.
@andreaterlizzi
@andreaterlizzi Жыл бұрын
Well, they actually mention this in the paper, and yeah Kalman filters are a type of state-space model
@axe863
@axe863 8 ай бұрын
​@andreaterlizzi lol if that's true can it really handle nonGaussian processes
@albertmashy8590
@albertmashy8590 Жыл бұрын
This is gonna be crazy if you think about it. It's like you could initialize an "Assistant" or agent with a huge prompt, but rather than including that information every time, you "save" that state space to save on compute for generating the next tokens because they don't need to be re-loaded every time. This also means that agents could also all have their own different personalities and behaviors without significant fine tuning requirements
@Summersault666
@Summersault666 Жыл бұрын
Finally, i knew i could count on you!
@pladselsker8340
@pladselsker8340 Жыл бұрын
Thank you for the paper review, it always helps!! Happy holydays to everyone 🍾
@ArthurColle-u2v
@ArthurColle-u2v Жыл бұрын
Happy holidays
@jsalsman
@jsalsman Жыл бұрын
Very happy transformers aren't the only game in town.
@vikaspoddar001
@vikaspoddar001 Жыл бұрын
I was waiting for your review
@bibhabasumohapatra
@bibhabasumohapatra Жыл бұрын
More than the merits and demerits of transformers. The best part is it's inter modalities between text-audio-voice clip
@vladimirtchuiev2218
@vladimirtchuiev2218 Жыл бұрын
This looks a lot like state-space control theory representation. What they are presenting is basically a learnable dynamic system with a linear state transition matrix A and an input dependent input matrix B, that makes it non-linear, same as for the observation matrix C. Look as a massive upgrade over transformers for stuff like music generation, maybe even ViT-based models. What isn't clear to me is how do they learn the A matrix, it seems that the farther the context is, the more severe the vanishing gradient problem and the nearest elements in the sequence is by far the most significant.
@patrickl5290
@patrickl5290 Жыл бұрын
Something about the “selective” piece maybe? My thought would be that the forgetting rules differ based on a lot of factors, so there are many opportunities for the model to be induced to not forget potentially relevant details
@邵斌-n2r
@邵斌-n2r 11 ай бұрын
A matrix is also input-dependent, but yea it's hard to believe that it can "remember" things that are 10^6 upstream (basically the K term in eq. 3a and 3b.)
@xxlvulkann6743
@xxlvulkann6743 8 ай бұрын
@@邵斌-n2r He says A is still a parameter at 30:13 and it shows on the paper as well. I think he made a mistake when he said A was input dependent because it seems not to be.
@РудаковАртем
@РудаковАртем 8 ай бұрын
Цікава тема та дослідження, які пояснюються в цій статті. Рекомендую переглянути!
@amirronen1913
@amirronen1913 11 ай бұрын
excellent talk explaining a non trivial paper. Thanks!
@TheEbbemonster
@TheEbbemonster Жыл бұрын
Interesting that A^n actually works for long sequences. I would have expected a severe degradation of performance as sequences get longer...
@ItsRyanStudios
@ItsRyanStudios Жыл бұрын
my same thought; I need to implement this in code to understand how/ why this works
@Thien--Nguyen
@Thien--Nguyen 11 ай бұрын
I think there are some parameterizations constraints like the eigenvalues of A's cannot be positive for the A^n to be stable. The h3 and hippo papers talk more about those conditions (there are also diagonal and other constraints on A to make things more efficient iirc).
@邵斌-n2r
@邵斌-n2r 11 ай бұрын
second this. If the eigenvalue is negative, then it will vanish even more quickly (consider the context length of 10^6).
@aleksszukovskis2074
@aleksszukovskis2074 Жыл бұрын
i dont understand what got me to even watch this at 2:00 in the morning
@kaikapioka9711
@kaikapioka9711 Жыл бұрын
Impressive, thx as always 🎉 happy holidays
@jonsimonatwork
@jonsimonatwork 11 ай бұрын
I think something is off about your explanation of the A_t prefix products around 35min. The dimensions given in Algorithm 2 imply that A remains constant across timesteps, since it has no L component.
@NLogSpace
@NLogSpace 10 ай бұрын
I wanted to ask the same thing. I think it is a mistake in the video. Also before, he mentioned that A is constant and only B and C depend on the input.
@rault.7108
@rault.7108 Жыл бұрын
Thanks for talking about this model! ❤
@Veptis
@Veptis 5 сағат бұрын
At a high level, the decoder only transformer can be considered a recurrent model. but the "hidden state" is the decoded token (or it's embedding). And the resulting output is whatever the chain of thought argument ends up decoding at the end (which is guided by the instructions). if you make the decoder - transformer more recurrent like: instead of quantizing to specific token embeddings at sampling time (any sampling method)... you just take that probability/similarity as a new embedding for the next timestep - you have the current idea of "hidden" reasoning.
@leisha1519
@leisha1519 Жыл бұрын
For A^n, will so many multiplies of A leads to an explosion of parameters?
@ithaca2076
@ithaca2076 Жыл бұрын
this is my christmas present
@srh80
@srh80 Жыл бұрын
Please do more papers, like before!
@simonl1938
@simonl1938 Жыл бұрын
SRAM has very low latency, but the total bandwith is less than the gpu-cpu can do (especially with DMA). You could use something like Buffer Object Streaming (this is the name that opengl has for it) to reduce memory usage massively. This would also allow for much larger batches as they are computed concurrently with the same weights. Does anyone know if this is already being done? I can't find anything on the topic.
@afbf6522
@afbf6522 Жыл бұрын
Do it. Can you talk about this in private? I'm very interested in the topic of optimizing operations right now
@charmy1138
@charmy1138 Жыл бұрын
What are the differences between RWKV, RetNet, and Mamba? Which here has the closest architecture to transformers?
@stellabiderman4056
@stellabiderman4056 Жыл бұрын
This is maybe worth a video all to itself, but they're all different types / levels of tradeoff between a transformer and an RNN, basically
@qingyangzhang6093
@qingyangzhang6093 Жыл бұрын
will we use mamba to install mamba?
@GatlingNG
@GatlingNG Жыл бұрын
I hope not the whole conda ecosystem should not be used for serious production code. Pyenv + poetry + pip is all you need.
@vcool
@vcool Жыл бұрын
@@GatlingNG conda can install things that pip can't.
@GatlingNG
@GatlingNG Жыл бұрын
@@vcool we have containers for that not some badly half-assed cobbled together venv manager with barely maintained packages.
@sirati9770
@sirati9770 11 ай бұрын
regarding not having a facecam: for people still learning english having a visual reference can drastically improve listening comprehension
@corgirun7892
@corgirun7892 Жыл бұрын
good video, Yannic has a strong ability to dissect and extract key points
@lexer_
@lexer_ Жыл бұрын
This very much triggers the same intuitive reservations and skepticism I feel towards linear transformers and other attempts at improving scaling by throwing away one of the essential components to what makes transformers work. I am not convinced that any of these architectures actually bring the same scaling we have seen with transformers so far where the only limiting factor seems to be the amount and quality of the training data.
@sagetmaster4
@sagetmaster4 Жыл бұрын
The future looks like collections of models working in concert, so having transformers doing most of the work in most systems with certain domains using a different architecture seems plausible
@dennismertens990
@dennismertens990 Жыл бұрын
After "Text Embeddings Reveal (Almost) As Much As Text" came out, I became convinced that encoder-decoder transformers learn embeddings that behave just like an RNN's hidden/recurrent state. If you plot the embedding of a sentence unmasked sequentially, you get a trajectory. That paper shows this trajectory is quite informative w.r.t. the original text. This is interesting because that suggests you can train an RNN on the learned embeddings. Since you already have the embeddings, there is no need for back-prop through time. It would be like training a regular neural net on input-output examples, where the input is the embedding at time t and the output is the embedding at time t+1. It's only speculation, but it could be a practical way of distilling pre-trained transformers into RNNs for deployment. P.S. By unmasking a sentence sequentially, I mean for example "hel", "hell", "hello".
@lexer_
@lexer_ Жыл бұрын
​@@dennismertens990 I agree that these architectures in general and Mamba in particular seem like they are a very good idea for more task specific deep learning. I should have specified that I was referring to the general purpose LLMs or more multi-modal models which develop some level of reasoning and general abstraction capability. For more constrained tasks like DNA modelling this might very well be a huge leap forward. In regards to sequential unmasking, it would of course still be token based, not character based, but I don't think you can get away with throwing away the attention matrix without a massive loss in performance even if you train an embedding.
@lexer_
@lexer_ Жыл бұрын
​@@sagetmaster4 A lot of people have been saying this for many years at this point. And for a long time this seemed like a very reasonable assumption. If you can not scale generally you have to specialize. It's the same with general compute as well as many other areas. But I haven't seen a single example of this actually working to a degree that outerforms just putting same amount of extra compute into a larger or longer trained model. I am beginning to think that we are missing some very fundamental and essential insight to make this actually work. Sparse network designs like the mixture of expert seem to have some real benefits here but only for inference speed. But I would argue this is only tangentially related to the whole heterogenous architecture idea. I for one think the next major architectural step probably needs to be intermediate higher level memory instead of just scaling the context. Being able to store abstractions in memory seems like such a fundamental component to human thinking that I can't help but think that it might have the potential for major improvements in an llm as well. The other thing that will eventually be necessary is a way for models to spend significantly more time "thinking" before answering. There were a few attempts with traditional LLMs and a thinking-token to essentiall allow it more inference steps for the next token. And the results looked promising in the paper. But now it seems to have been mostly forgotten about. So it seems a more fundamental way for the model to recurse internally might be necessary for introspective logic.
@TheRyulord
@TheRyulord Жыл бұрын
@@lexer_ If this paper's claims are actually true then it has better perplexity on the pile (general purpose language modeling) than traditional transformers. We've also seen attention actually struggles quite a bit on information retrieval with long sequences so while it is powerful it's not a perfect mechanism. A lot of older subquadratic transformer papers basically just did "attention but worse" (eg. window + random attention) and so they naturally had tradeoffs that a completely different mechanism like this isn't going to necessarily have.
@004307ec
@004307ec Жыл бұрын
😊this is the second video I watched for this paper. I got more understanding of this paper. The paper is kind of too hard for me😂
@RafalLewczuk-g6m
@RafalLewczuk-g6m Жыл бұрын
If Mamba can handle such long sequences, does it need tokenization ?
@simonl1938
@simonl1938 Жыл бұрын
You can't really just input words like that into any model, but it might still be easier in other ways. In the S4 talk video Albert Gu talks about how much easier it was for people in other fields to use their model since it's a one size fits all with great performance without tuning.
@albertmashy8590
@albertmashy8590 Жыл бұрын
Neither transformers or mamba need tokenization, it's more just for efficiency but curious to hear about the potential implications and whether mamba would be more capable of handling larger vocabularies
@许翔宇
@许翔宇 8 ай бұрын
Great explanation. Will you please share the link to the github repo?
@ngayminh8463
@ngayminh8463 9 ай бұрын
have not seen mamba any implementations on pytorch , also will it replace Transformers ?
@yannickpezeu3419
@yannickpezeu3419 Жыл бұрын
brilliant explanation ! Thanks !
@patrickl5290
@patrickl5290 Жыл бұрын
Was waiting for this, thanks!
@theskydebreuil
@theskydebreuil Жыл бұрын
Nice! Was waiting for this 😁
@feiyuchen1383
@feiyuchen1383 9 ай бұрын
what does the ‘’zoom" mean in the vedio?
@横川俊介-x4f
@横川俊介-x4f 4 ай бұрын
helped so much to understand!
@jabowery
@jabowery Жыл бұрын
What happens if you limit the training corpus to enwik9 and then just hammer away?
@ayushtibrewal4535
@ayushtibrewal4535 11 ай бұрын
Can we use this model for the image classification like vit
@EngineeredFemale
@EngineeredFemale 10 ай бұрын
Yep you can. Checkout V Mamba (Vision Mamba)
@ayushtibrewal4535
@ayushtibrewal4535 10 ай бұрын
@@EngineeredFemale but there is no training code for vision mamba.
@PicaPauDiablo1
@PicaPauDiablo1 Жыл бұрын
This is so great.
@heyman620
@heyman620 Жыл бұрын
You are just amazing.
@davidespinosa1910
@davidespinosa1910 3 ай бұрын
It sounds like S4 solves the vanishing and exploding gradient problem, so that it works for very long sequences. Now I'm curious how that works...
@PeterIsza
@PeterIsza Жыл бұрын
okay so hidden layer is just a really big IIR filter en.wikipedia.org/wiki/Infinite_impulse_response
@mkamp
@mkamp 11 ай бұрын
Can somebody help me understand this? In figure 1 in the paper it says “Structured SSMs independently map each channel (e.g. 𝐷 = 5) of an input 𝑥 to output 𝑦 through a higher dimensional latent state ℎ (e.g. 𝑁 = 4)”. How are four dimensions higher dimensional than five? I am not trolling and I understand that it says “e.g.” and that it could not be deliberate. But given the quality of the paper that seems unlikely. Is there another way to interpret higher dimensional? Does it just mean “not a scalar”?
@mkamp
@mkamp 11 ай бұрын
I found the solution. Well, GPT4 did, but who’s counting? 😅 Each of the five channels is mapped to four hidden dimensions. Of course, now it all makes sense. This is what the horse said: “The caption mentions that structured State Space Models (SSMs) map each channel of an input to an output through a higher dimensional latent state \( h \) (for example, \( N = 4 \)) while avoiding the materialization of a large effective state space. The inconsistency you're pointing out seems to be related to the dimensionality \( D \) given as an example (i.e., \( D = 5 \)) versus the term "higher dimensional." This likely means that each channel \( D \) is independently mapped, and the "higher dimensional" aspect refers to the combination of these mappings to create a state space with more dimensions than the individual inputs. The effective state space would then be \( D \times N \), which is larger than the dimensionality of each individual channel \( D \) alone. So, in this context, "higher dimensional" does not contradict the example dimensions given; rather, it points to the resultant multi-dimensional space created by the model's structure.”
@akolec
@akolec 10 ай бұрын
Why a "horse" said that? @@mkamp
@alivecoding4995
@alivecoding4995 11 ай бұрын
I am super sceptical towards these reduced models. But I think they are interesting as a kind of ablation study for the Transformer architecture. Helping us to understand their inner mechanisms better.
@berkk1993
@berkk1993 Жыл бұрын
tHANKS GREAT
@erickmarin6147
@erickmarin6147 Жыл бұрын
Would love to see some classification of machine learning itself using AI, there seems to be a lot of stuff called different that is functionally very similar
@duongbinh23
@duongbinh23 6 ай бұрын
6:50 RNN: y(t+1) = sigmoid(y(t) + x(t)) State-space: Well, let’s get rid of non-linear sigmoid so that y(t+1) = y(t) + x(t) = … = y(0) + x(0) + x(1) + … + x(t) Now you change the game lol.
@vcool
@vcool Жыл бұрын
I want to see a model that requires log memory of length, which could be better than requiring linear memory.
@johnzinhoinhoinho
@johnzinhoinhoinho Жыл бұрын
This model has constant memory on inference, it has linear memory on trainning fase
@tjpld
@tjpld Жыл бұрын
Generative Pretrained Mamba
@apilaiteloc6520
@apilaiteloc6520 7 ай бұрын
lovely
@mohdil123
@mohdil123 Жыл бұрын
Super nice
@NoNameAtAll2
@NoNameAtAll2 10 ай бұрын
tragic loss of not seeing your sunglasses* fify
@worthstream
@worthstream 11 ай бұрын
I've always preferred videos with no face cam, actually.
@RaghaVamsi
@RaghaVamsi 8 ай бұрын
Wowww
@monoham1
@monoham1 11 ай бұрын
yet again i cant understand a word of the maths except the intro ive heard of LSTM big O matrix FFT thats about IT
@monoham1
@monoham1 11 ай бұрын
"its just basic 2nd year maths" - i hate people who say this. they have no idea of the stuggles of people who have to work for 10 years just to afford 1 year of university or cant understand why paying 2 months of your income for 1 months of rent is a barrier to learning this IMPOSSIBLE maths! ITS NOT MATHS ITS LATIN CODE FOR BASIC SHIT ONLY YOU UNDERSTAND. THERES NO WAY TO LEARN THIS IF YOUR PARENTS ARENT MILLIONAIRES
@justtoleavecomments3755
@justtoleavecomments3755 8 ай бұрын
@@monoham1My parents were poor immigrants from Eastern Europe, and I took loans to go to a public university (Canadian) that cost $10k/yr. I got good grades and the govt (of Canada) wrote off half those loans. The US has cheap state schools too right? You don't need to be a millionaire. You just need time and interest.
@csabaczcsomps7655
@csabaczcsomps7655 Жыл бұрын
Remember traing data contain influenced, wrong, intermediate trues, high adv. data, school philosophy, wrong time, era optimized ideas ... and lot more. You need some how filter, or devaluate these values, or put in some different value dimensions. This will make hallucination go down little more. My noob opinion.
@thinkalinkle
@thinkalinkle 11 ай бұрын
I can't help but feeling like the claim that this will "replace" Transformers is a bit much... Especially after watching the video
@axe863
@axe863 8 ай бұрын
Before most of you were born ..... lol 😅
@JRyang-py4qp
@JRyang-py4qp 5 ай бұрын
3Q
@DanFrederiksen
@DanFrederiksen Жыл бұрын
I was surprised to learn that transformers rely on a classical algo hack to have a kind of a memory. I'm quite sure that's a flawed premise that wont last. It reminds me of bag of words which was a tremendously flawed premise. Despite how relatively amazingly well transformers work. Most approaches so far seem hacky. I've thought ahead a bit and I figure that any kind of real world well rounded AI needs a vastly more complex and sophisticated architecture. That's not to say it couldn't happen relatively quickly but LLM aint it. Just the ability to hold a thought and progressively work on it with shifting focus is way beyond LLMs. And that's analog to image generation which superficially looks very close to flawless but in reality is also very far from the holy grail for much the same reason.
@krimdelko
@krimdelko Жыл бұрын
Why learn multiplication and division when all you need is addition and real numbers? Same principle here. Reduce complexity
@jondo7680
@jondo7680 Жыл бұрын
It really makes no sense that all these efficient architectures come is small sizes like 3b models. If they trade of capabilities for performance, they should release 7b or 13b models, why would anyone run a 3b Mamba or rvkv if 7b mistral runs on everything? Nice tech but as long as it's under 7b it's just a demo.
@wenhanzhou5826
@wenhanzhou5826 11 ай бұрын
This is research, not engineering. Maybe they just had limited compute resources.
@hanyanglee9018
@hanyanglee9018 Жыл бұрын
lstm, transformer, and anything in similar direction are just not gonna work.
@Japneets1
@Japneets1 11 ай бұрын
what do you mean by 'not gonna work'? Transformers have already proven their worth, haven't they?
@BB-sd6sm
@BB-sd6sm Жыл бұрын
Opinion: over engineering architectures won't actually solve the real problem at hand of intelligence
@erkinalp
@erkinalp Жыл бұрын
overbuilding allows one to handle more unforeseen types of problems, though.
Mamba, Mamba-2 and Post-Transformer Architectures for Generative AI with Albert Gu - 693
57:25
The TWIML AI Podcast with Sam Charrington
Рет қаралды 6 М.
RWKV: Reinventing RNNs for the Transformer Era (Paper Explained)
1:02:17
Vampire SUCKS Human Energy 🧛🏻‍♂️🪫 (ft. @StevenHe )
0:34
Alan Chikin Chow
Рет қаралды 138 МЛН
The Dome Paradox: A Loophole in Newton's Laws
22:59
Up and Atom
Рет қаралды 400 М.
xLSTM: Extended Long Short-Term Memory
57:00
Yannic Kilcher
Рет қаралды 39 М.
Why LLMs Are Going to a Dead End Explained | AGI Lambda
14:46
AGI Lambda
Рет қаралды 6 М.
Mamba - a replacement for Transformers?
16:01
Samuel Albanie
Рет қаралды 252 М.
MAMBA from Scratch: Neural Nets Better and Faster than Transformers
31:51
Algorithmic Simplicity
Рет қаралды 211 М.
Were RNNs All We Needed? (Paper Explained)
27:48
Yannic Kilcher
Рет қаралды 53 М.
MAMBA and State Space Models explained | SSM explained
22:27
AI Coffee Break with Letitia
Рет қаралды 56 М.
The Most Important Algorithm in Machine Learning
40:08
Artem Kirsanov
Рет қаралды 548 М.