Funny to see my professors names on the paper here. Feels odd, since I know this channel way before I started to study there.
@wurstelei13566 ай бұрын
Thank god they had these techs decades ago, so nothing is patented and hidden from the public.
@Fleniken3 ай бұрын
Hey, sorry for being a bit irrelevant to the original comment, but I wanted to ask some advice on where to start if I want to get to a level where I can understand these videos (the ones Yannick is doing). My resources are: spare time, access to KZbin, and the like.
@UTOPYofficialАй бұрын
@@Fleniken StatQuest
@wolpumba40996 ай бұрын
*Summary* *What is xLSTM?* [0:00] * xLSTM aims to push the boundaries of LSTM architectures by incorporating lessons learned from the world of LLMs and Transformers. * It introduces two modified LSTM cells: sLSTM and mLSTM. * xLSTM architectures are formed by residually stacking these modified LSTM blocks. *Key Features:* [7:35] * *Exponential Gating:* [31:02] Replaces the traditional sigmoid non-linearity in LSTM gates with an exponential function to address vanishing gradient issues. * *Normalization and Stabilization Techniques:* [32:38] Introduces methods to handle the rapid growth of the exponential function and stabilize training. * *Modified Memory Structures:* * *sLSTM:* [27:47] Utilizes a scalar memory, scalar update, and "new" memory mixing (which leverages matrix properties for information routing between dimensions). * *mLSTM:* [36:24] Employs a matrix memory and a covariance update rule for associative memory. It's fully parallelizable in training, similar to Transformers. *Advantages:* * *Constant Memory Usage:* Unlike Transformers, xLSTM maintains a fixed memory footprint regardless of sequence length. * *Competitive Performance:* Achieves results comparable to state-of-the-art Transformers and State Space Models on language modeling benchmarks. * *Parallelizable Training (mLSTM):* The mLSTM variant removes the non-linear dependency on past time steps, enabling parallel training like Transformers. *Limitations:* [54:30] * *Large Constant Memory Requirement:* While memory usage is constant, the mLSTM's matrix memory can be large, leading to higher computational costs. * *No Fast Parallel Training for sLSTM:* The sLSTM variant still involves recurrency, making fast parallel training challenging. * *Further Optimization Needed:* The authors acknowledge the need for further architecture and hyperparameter optimization, especially for larger xLSTM models. *Overall:* [55:54] * xLSTM demonstrates the potential of enhanced LSTM architectures to compete with Transformers in language modeling. * Further research and real-world applications will determine its long-term impact and adoption. i summarized the transcript with gemini 1.5 pro
@XX-vu5jo6 ай бұрын
Gemini is a joke lol
@FunkyJeff226 ай бұрын
Thanks!
@guillaumevermeillesanchezm24276 ай бұрын
How much did it cost?
@wolpumba40996 ай бұрын
@@guillaumevermeillesanchezm2427 Nothing. I'm in some kind of beta. It is also super fast (less than 10 seconds). Much better than GPT-4
@guillaumevermeillesanchezm24276 ай бұрын
@@wolpumba4099 thank you for answering!
@rockapedra11305 ай бұрын
Thanks Yannic, I really value these videos. The way you work out the math really helps! For example, i realky liked the way you easily and clearly explained the mechanism of storing key, value pairs by using outer products. You should add a donation button to your channel!
@Fleniken3 ай бұрын
Hey, sorry for being a bit irrelevant to the original comment, but I wanted to ask some advice on where to start if I want to get to a level where I can understand these videos (the ones Yannick is doing). My resources are: spare time, access to KZbin, and the like.
@tantzer61136 ай бұрын
Seems like the title of this paper could have been, perhaps provocatively, “LSTMs are all you need.”
@nicolasmichel51636 ай бұрын
If feel that's not really the conclusion here. More like "Billions of parameters is all you need"
@Fleniken3 ай бұрын
Hey, sorry for being a bit irrelevant to the original comment, but I wanted to ask some advice on where to start if I want to get to a level where I can understand these videos (the ones Yannick is doing). My resources are: spare time, access to KZbin, and the like.
@myname-sj7hs3 ай бұрын
@@FlenikenTo get you started I'd recommend 3b1b's neural network miniseries as well as that of Josh Starmer's (kzbin.info/aero/PLblh5JKOoLUIxGDQs4LFFD--41Vzf-ME1&feature=shared). The former should give a basic intuition on how Neural Networks work and the latter should help explain more complicated concepts such as LSTMs and Transformers
@myname-sj7hs3 ай бұрын
@@FlenikenTo get you started I'd recommend 3b1b's neural network miniseries as well as that of Josh Starmer's (kzbin.info/aero/PLblh5JKOoLUIxGDQs4LFFD--41Vzf-ME1&feature=shared). The former should give a basic intuition on how Neural Networks work and the latter should help explain more complicated concepts such as LSTMs and Transformers
@myname-sj7hs3 ай бұрын
@@FlenikenTo get you started I'd recommend 3b1b's neural network miniseries as well as that of Josh Starmer's. The former should give a basic intuition on how Neural Networks work and the latter should help explain more complicated concepts such as Cross Entropy, LSTMs and Transformers
@Mordenor6 ай бұрын
Thank you Mr Yannic for explaining xLSTM, which extends the famous Long Short-Term Memory model. p.s I like your videos, so please stay healthy
@aintgonhappen6 ай бұрын
Pray for Mr Yannic 🙏🙏🙏
@yeetyeet70706 ай бұрын
Extended Long Short-Term really sounds like upper lower middle class
@Hexanitrobenzene6 ай бұрын
Yeah, the adjacent words "long" and "short" do not clear the matters at all... In contrast, the authors of "Attention is all you need" could work for political campaigns writing slogans as a side hustle :)
@CM-mo7mv6 ай бұрын
finally approaching ART
@KavinKaviya-bw7rb6 ай бұрын
I'm all in on Revux. Presales have the highest returns, and this one’s gold.
@KathiresanKathiresan-ld6zn6 ай бұрын
Revux is being mentioned everywhere - definitely a project to watch!
@JackSPk6 ай бұрын
"Matrices aren't circles" - Yannic Kilcher
@RezaJavadzadeh6 ай бұрын
brilliant thanks Yannic
@pawelkubik6 ай бұрын
I used to think of c and h as memory capacitor and hidden output. This was especially clear in word tagging problems where we had to align our outputs with the input tokens. So the h vector was directly corresponding to one of the tag classes that we used to predict and c was used strictly as the memory (I thought c was just from "capacitor" or "memory Cell").
@Fleniken3 ай бұрын
Hey, sorry for being a bit irrelevant to the original comment, but I wanted to ask some advice on where to start if I want to get to a level where I can understand these videos (the ones Yannick is doing). My resources are: spare time, access to KZbin, and the like.
@andytroo6 ай бұрын
33:10 - is this sort of a built-in soft-max? exponetiate everything then normalise?
@yichunchen43704 ай бұрын
This I think is an improved version of the idea of another video of Yannic, which is called infini attention, wherein the memory is updated directly with the new set of keys and values, here we have a more sophisticated way of updating the memory, which is borrowed from the idea of the gates in LSTM, namely the ft and it
@Fleniken3 ай бұрын
Hey, sorry for being a bit irrelevant to the original comment, but I wanted to ask some advice on where to start if I want to get to a level where I can understand these videos (the ones Yannick is doing). My resources are: spare time, access to KZbin, and the like.
@DamianReloaded6 ай бұрын
So, the answer is kind of yes. If you scale a high-dimensional token mixer using backpropagation to adjust weights towards the desired result, you will achieve functionality. The question lingering in my mind is: Do biological neural networks employ backpropagation? How do we one -shot learn new token sequences and how are we able to remember them long term and bring them back when we need them if they are so low probability (we only saw them once) ?
@xxlvulkann67436 ай бұрын
I imagine that when you have agentic models, you can implement more sophisticated memory encoding. For example, you might allow for particular memory samples to have a larger "significance" based upon your current level of arousal/reward. Also, exposure to a token doesn't have to come from the external environment, it may result from constantly "thinking" about the topic, essentially generating and training on synthetic data. We must remember that generative models are still not actual agentic models, they're basically just foundation models.
@ssssssstssssssss6 ай бұрын
Backpropagation is largely considered implausible for biological networks and BPTT is impossible because it is a non-causal system. Some do think the brain does employ some kind of "gradient" though.
@Hexanitrobenzene6 ай бұрын
@@ssssssstssssssss BPTT ?
@ChlorieHCl6 ай бұрын
@@Hexanitrobenzene Back-propagation through time
@eltongoaustriaco82686 ай бұрын
The brain might generate a training signal from a single example in short term memory (you repeating your hotel room number in mind). Regarding BP, it is plausible that the brain uses a less optimised version of that.
@davidhauser75376 ай бұрын
nice thanks for convering this paper :)
@intrinsical6 ай бұрын
I mean, the term Language Model was coined in the 90s. Even N-Gram models were considered language models. We just didn't start prefixing Language Models with the word "Large" till the early 2000s. The claim that LSTMs were doing LLM in the 90s is an exaggeration, but also partially true.
@Fleniken3 ай бұрын
Hey, sorry for being a bit irrelevant to the original comment, but I wanted to ask some advice on where to start if I want to get to a level where I can understand these videos (the ones Yannick is doing). My resources are: spare time, access to KZbin, and the like.
@EobardUchihaThawne6 ай бұрын
mlstms are similiar to google's infini attention on memory retrieval
@danraviv73932 ай бұрын
When you wonder what would happen if you squared the hidden dimension size at 47:00 - yeah maybe it would have similar model abstraction capacity - but it would still not be parallelizable which is maybe the main point of their solution?
@edeneden976 ай бұрын
in the mLSTM block, isn't it very similar to attention just without softmax?
@GGlessGo6 ай бұрын
And is it? Cant completely follow actually
@bensimonjoules44026 ай бұрын
The last few papers Yannic covered all follow the same line of using back again some sort of recurrence with transformers. In this case not explicitly but I don see a fundamental difference why each step on the sequence couldn't be processed by one. Seems to be a clear direction on research of resurging recurrence, I wonder if this direction has a formal theory or even a name.
@tamirtsogbayar39126 ай бұрын
Hello Yannic thanks for you videos! Are you going to make some vidoes related to KAN (Kolmogorov Arnold Network) ? thank you
@quickpert13826 ай бұрын
KANs are fairly easy, and it's a nice lecture to venture into by yourself
@_XoR_6 ай бұрын
Unfortunately they are quite flawed for most applications since they don't scale and based on the distribution shape they can be worse than mlps.
@quickpert13826 ай бұрын
@@_XoR_ Yep, for now we are waiting for an optimized implementation.
@pietrorse6 ай бұрын
this reminds me of serialization and paralelization mixing in various layers, which i actually observe in nature.
@kuan-chunlee96115 ай бұрын
I had a question so sounds like linear temporal dependency can be parallelized somehow or at least be computed faster. How can this parallelization be done if possible? Thanks if anyone knows
@TheRohrАй бұрын
(a) Couldn't they simply use a LayerNorm instead of the n_t ? (b) Sounds like they didn't answer their question (or was it already answered somewhere else?) of how for getting with plain LSTMs, but they invented these blocks etc. and only showed results for xLSTMs?
@paxdriver6 ай бұрын
thank you Yan! I thought I was crazy but you seem to have read a similar tone in the early sections lol that's pretty funny "our paper is all about this addition, and this multiication... Novel ideas, eh?". That's the headline, but only after that does the real new part start with memory management (soft memory, not hardware.. Also confusing).
@chickenp70386 ай бұрын
we need a new mamba explanation. the current one has errors and doesn’t rely explain much
@longvo70886 ай бұрын
You need to read previous papers like HiPPO, S4 to be able to understand Mamba. Also, with some prerequisite skills about CUDA Programming
@AM-yk5yd6 ай бұрын
Sasha Rush has several as he seems to be a big fan of SSM. "Mamba: The Hard Way" is very detailed.
@ANKIT_GAMING1936 ай бұрын
Cyberopolis been the hot topic in several groups I'm in.
@matveyshishov6 ай бұрын
Wait, I've been watching your channel for maaany years, how come it only has 245k subscribers, and something like 2minpapers has 1.5M?
@ChlorieHCl6 ай бұрын
I've felt a significant decline in quality for Two Minute Paper videos. The 2min are like 30s of unwanted background info, 30s of experimental results, and 1min of sponsor acknowledgment. And also “what a time to be alive” and “hold on to your papers”, apparently. No real info gained from those videos. To the point that I've unsubbed from that channel for months just to get rid of the annoyance.
@yakmage80856 ай бұрын
@@ChlorieHClthere’s been a decline for sure but also yannics videos have a significantly higher minimum education requirement. 2 min papers are just video highlights and no math, intuition or criticisms
@AvastarBin6 ай бұрын
because 2minpapers videos are 5 or 6 minutes long (ironically) and are understandable by anyone regardless of your background, whereas Yannik's videos are one hour long very indepth and requires a lot of background knowledge in ml
@GoldenBeholden6 ай бұрын
@@ChlorieHCl Yeah, seeing some guy get enthusiastic about research papers was nice enough when just began and sat below 30k subscribers, but he really started playing into his "character" rather than the actual content of the papers. Not really worth your time anymore, to be honest. AI Explained is great if you're looking for another channel in the same vein as this one (al be it lighter on the academics).
@thirdeye46546 ай бұрын
Why do influencers on Tiktok have millions of followers just talking bullshit all day long? Because people love entertainment and not many have a long attention span. Also there is just so much time you have in your own life to watch and do stuff.
@florianjug6 ай бұрын
Isn’t that close to the mindset behind Mamba as well? What would be the key difference?!
@Vanshbibyan356 ай бұрын
Some serious backing on cyberopolis
@BijoyBaskeBijoyBaske6 ай бұрын
Shifting my portfolio - heavy on BTC and Revux, with a sprinkle of DOT and ADA.
@ravindramore47836 ай бұрын
Caught some insider buzz about cyberopolis and the names involved.
@arashbtk323713 күн бұрын
Explained very ell
@hanskraut20186 ай бұрын
I could have told you when i was in the end of Kindergarden. I hope there is more behind it than what it sounds to be.
@Fabio-zi4hw6 ай бұрын
Is this the ultimate bitter lesson?
@herp_derpingson6 ай бұрын
Money is all you need
@GAMINǴAVI-o8x6 ай бұрын
I see Revux doing 50x, maybe even 100x after it goes live on major exchanges.
@intrinsical6 ай бұрын
So the matrix memory is simply old school Kohonen Maps from the 70s?
@Hexanitrobenzene6 ай бұрын
It seems, if that's the name. They list Kohonen, Anderson and Nakano as references, all from 1972.
@smvlogger036 ай бұрын
Yep, the signals are strong on cyberopolis, especially with the big endorsements.
@tiagotiagot6 ай бұрын
Is there enough information in the pdf that some of the current bigger LLMs that can read pdfs would be able to produce the equivalent code to what the researchers used to get their alleged results?
@Hexanitrobenzene6 ай бұрын
This task probably requires AGI...
@AmirNajafgholi6 ай бұрын
Don't you want to review KAN?
@dairin0d6 ай бұрын
Regarding the large memory requirements of the d*d matrix, perhaps they could take a page from the Vector Symbolic Architectures approach? In VSA, state, keys and values are all vectors of the same shared space (and so have the same dimension), so if all that's needed is to combine them in a way that would result in dot(new_state, key) ~= value, VSA's binding operation (e.g. component-wise / Hadamard product) sounds like a perfectly viable replacement 🤔 I suppose it would still benefit from large space dimensionality, but a vector size can be controlled on a more granular level than a square matrix size. If they use binary or ternary weights, the memory requirements would be even smaller (though that would probably require some changes in how the model is trained).
@JerryFederspiel6 ай бұрын
If I'm thinking about this right, the off-diagonal elements of the outer products of k and v can be thought of as "clues" that each vector element in the key gives about each other vector element in the value. The Hadamard product dispenses with these clues- each element is treated independently- but maybe each individual element only has to be kind-of right with a VSA because d is so high. It may also be possible to compromise between Hadamard and outer products by taking the key and value vectors and breaking them up into P parts of d/P elements each. Then you take the outer products of corresponding parts. This gives us a memory requirement of P * (d/P)^2 = d^2 / P. It means that each key element gives a clue about d/P value elements. Setting P to sqrt(d) feels good, so clearly that is the right choice 🙂
@Raju-ib9ug6 ай бұрын
Been seeing a lot of talk in the private circles about cyberopolis.
@rumfordc6 ай бұрын
ngl that's gotta be among the top 20 stupidest names for anything i've ever heard
@jonsmith63316 ай бұрын
First for AI
@darshank87486 ай бұрын
AI for fisting
@qhansen1236 ай бұрын
2nd for AI
@aakashsaini33276 ай бұрын
3rd for AI :P
@GAISENSE6 ай бұрын
Feels more tLSTM than mLSTM, right?
@SoniyaKhan-e8l3 ай бұрын
Miller Eric White Jose Jones Timothy
@AmarSingh-fh3xp6 ай бұрын
Big shots are piling in on cyberopolis, so I decided to get my share. Let's hope it pays off!
@jabowery6 ай бұрын
Self-aggrandizing Boomer-posting admitted, there is a good reason for bringing to people's "attention" prior art and it has to do with the foundation of intelligence in Kolmogorov Complexity approximation: Don't multiply names for things beyond necessity. Now, don't get me wrong here. I'm not saying that the terms currently in use are inferior -- I'm just saying that unification of taxonomy can reduce the explosion of confusion that now besets the field. So the renaming can be beneficial, so long as one then describes prior art in terms of the current tech-argot with appropriate modifiers.
@KmgdsHfafjp3 ай бұрын
Lewis Sarah Lewis Michelle Johnson Jennifer
@harshFF-r7c6 ай бұрын
Revux is dominating my crypto chats - seems like the next big thing!
@XX-vu5jo6 ай бұрын
I kinda find this a joke of a paper
@corgirun78926 ай бұрын
the baseline is unfair
@T.D_gamer6 ай бұрын
I like the Pi project, but still there is no money to be made here. I put my holdings in Cyberopolis, easy 50-200x
@jonnylukejs6 ай бұрын
I invented this and it got jacked low key i called it block matrix lstm and they changed the name to be dicks and get away with it but the fact that it exactly follows my ipynb for it is like ehhh
@jonnylukejs6 ай бұрын
my app is called hyper chat and I'm still going to launch it but yeah I've had this since i wrote the code for it