xLSTM: Extended Long Short-Term Memory

Рет қаралды 40,063

Күн бұрын

xLSTM is an architecture that combines the recurrency and constant memory requirement of LSTMs with the large-scale training of transformers and achieves impressive results.
Paper: arxiv.org/abs/...
Abstract:
In the 1990s, the constant error carousel and gating were introduced as the central ideas of the Long Short-Term Memory (LSTM). Since then, LSTMs have stood the test of time and contributed to numerous deep learning success stories, in particular they constituted the first Large Language Models (LLMs). However, the advent of the Transformer technology with parallelizable self-attention at its core marked the dawn of a new era, outpacing LSTMs at scale. We now raise a simple question: How far do we get in language modeling when scaling LSTMs to billions of parameters, leveraging the latest techniques from modern LLMs, but mitigating known limitations of LSTMs? Firstly, we introduce exponential gating with appropriate normalization and stabilization techniques. Secondly, we modify the LSTM memory structure, obtaining: (i) sLSTM with a scalar memory, a scalar update, and new memory mixing, (ii) mLSTM that is fully parallelizable with a matrix memory and a covariance update rule. Integrating these LSTM extensions into residual block backbones yields xLSTM blocks that are then residually stacked into xLSTM architectures. Exponential gating and modified memory structures boost xLSTM capabilities to perform favorably when compared to state-of-the-art Transformers and State Space Models, both in performance and scaling.
Authors: Maximilian Beck, Korbinian Pöppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael Kopp, Günter Klambauer, Johannes Brandstetter, Sepp Hochreiter
Links:
Homepage: ykilcher.com
Merch: ykilcher.com/m...
KZbin: / yannickilcher
Twitter: / ykilcher
Discord: ykilcher.com/d...
LinkedIn: / ykilcher
If you want to support me, the best thing to do is to share out the content :)
If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
SubscribeStar: www.subscribes...
Patreon: / yannickilcher
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n

Пікірлер: 117

@GraniLP 8 ай бұрын

Funny to see my professors names on the paper here. Feels odd, since I know this channel way before I started to study there.

@wurstelei1356 7 ай бұрын

Thank god they had these techs decades ago, so nothing is patented and hidden from the public.

@Fleniken 5 ай бұрын

Hey, sorry for being a bit irrelevant to the original comment, but I wanted to ask some advice on where to start if I want to get to a level where I can understand these videos (the ones Yannick is doing). My resources are: spare time, access to KZbin, and the like.

@UTOPYofficial 3 ай бұрын

@@Fleniken StatQuest

@tantzer6113 8 ай бұрын

Seems like the title of this paper could have been, perhaps provocatively, “LSTMs are all you need.”

@nicolasmichel5163 8 ай бұрын

If feel that's not really the conclusion here. More like "Billions of parameters is all you need"

@Fleniken 5 ай бұрын

@myname-sj7hs 4 ай бұрын

@@FlenikenTo get you started I'd recommend 3b1b's neural network miniseries as well as that of Josh Starmer's (kzbin.info/aero/PLblh5JKOoLUIxGDQs4LFFD--41Vzf-ME1&feature=shared). The former should give a basic intuition on how Neural Networks work and the latter should help explain more complicated concepts such as LSTMs and Transformers

@myname-sj7hs 4 ай бұрын

@@FlenikenTo get you started I'd recommend 3b1b's neural network miniseries as well as that of Josh Starmer's. The former should give a basic intuition on how Neural Networks work and the latter should help explain more complicated concepts such as Cross Entropy, LSTMs and Transformers

@rockapedra1130 7 ай бұрын

Thanks Yannic, I really value these videos. The way you work out the math really helps! For example, i realky liked the way you easily and clearly explained the mechanism of storing key, value pairs by using outer products. You should add a donation button to your channel!

@Fleniken 5 ай бұрын

@wolpumba4099 8 ай бұрын

*Summary* *What is xLSTM?* [0:00] * xLSTM aims to push the boundaries of LSTM architectures by incorporating lessons learned from the world of LLMs and Transformers. * It introduces two modified LSTM cells: sLSTM and mLSTM. * xLSTM architectures are formed by residually stacking these modified LSTM blocks. *Key Features:* [7:35] * *Exponential Gating:* [31:02] Replaces the traditional sigmoid non-linearity in LSTM gates with an exponential function to address vanishing gradient issues. * *Normalization and Stabilization Techniques:* [32:38] Introduces methods to handle the rapid growth of the exponential function and stabilize training. * *Modified Memory Structures:* * *sLSTM:* [27:47] Utilizes a scalar memory, scalar update, and "new" memory mixing (which leverages matrix properties for information routing between dimensions). * *mLSTM:* [36:24] Employs a matrix memory and a covariance update rule for associative memory. It's fully parallelizable in training, similar to Transformers. *Advantages:* * *Constant Memory Usage:* Unlike Transformers, xLSTM maintains a fixed memory footprint regardless of sequence length. * *Competitive Performance:* Achieves results comparable to state-of-the-art Transformers and State Space Models on language modeling benchmarks. * *Parallelizable Training (mLSTM):* The mLSTM variant removes the non-linear dependency on past time steps, enabling parallel training like Transformers. *Limitations:* [54:30] * *Large Constant Memory Requirement:* While memory usage is constant, the mLSTM's matrix memory can be large, leading to higher computational costs. * *No Fast Parallel Training for sLSTM:* The sLSTM variant still involves recurrency, making fast parallel training challenging. * *Further Optimization Needed:* The authors acknowledge the need for further architecture and hyperparameter optimization, especially for larger xLSTM models. *Overall:* [55:54] * xLSTM demonstrates the potential of enhanced LSTM architectures to compete with Transformers in language modeling. * Further research and real-world applications will determine its long-term impact and adoption. i summarized the transcript with gemini 1.5 pro

@XX-vu5jo 8 ай бұрын

Gemini is a joke lol

@FunkyJeff22 8 ай бұрын

Thanks!

@guillaumevermeillesanchezm2427 8 ай бұрын

How much did it cost?

@wolpumba4099 8 ай бұрын

@@guillaumevermeillesanchezm2427 Nothing. I'm in some kind of beta. It is also super fast (less than 10 seconds). Much better than GPT-4

@guillaumevermeillesanchezm2427 8 ай бұрын

@@wolpumba4099 thank you for answering!

@yichunchen4370 6 ай бұрын

This I think is an improved version of the idea of another video of Yannic, which is called infini attention, wherein the memory is updated directly with the new set of keys and values, here we have a more sophisticated way of updating the memory, which is borrowed from the idea of the gates in LSTM, namely the ft and it

@Fleniken 5 ай бұрын

@pawelkubik 7 ай бұрын

I used to think of c and h as memory capacitor and hidden output. This was especially clear in word tagging problems where we had to align our outputs with the input tokens. So the h vector was directly corresponding to one of the tag classes that we used to predict and c was used strictly as the memory (I thought c was just from "capacitor" or "memory Cell").

@Fleniken 5 ай бұрын

@JackSPk 8 ай бұрын

"Matrices aren't circles" - Yannic Kilcher

@Mordenor 8 ай бұрын

Thank you Mr Yannic for explaining xLSTM, which extends the famous Long Short-Term Memory model. p.s I like your videos, so please stay healthy

@aintgonhappen 8 ай бұрын

Pray for Mr Yannic 🙏🙏🙏

@bensimonjoules4402 8 ай бұрын

The last few papers Yannic covered all follow the same line of using back again some sort of recurrence with transformers. In this case not explicitly but I don see a fundamental difference why each step on the sequence couldn't be processed by one. Seems to be a clear direction on research of resurging recurrence, I wonder if this direction has a formal theory or even a name.

@intrinsical 8 ай бұрын

I mean, the term Language Model was coined in the 90s. Even N-Gram models were considered language models. We just didn't start prefixing Language Models with the word "Large" till the early 2000s. The claim that LSTMs were doing LLM in the 90s is an exaggeration, but also partially true.

@Fleniken 5 ай бұрын

@andytroo 8 ай бұрын

33:10 - is this sort of a built-in soft-max? exponetiate everything then normalise?

@pietrorse 8 ай бұрын

this reminds me of serialization and paralelization mixing in various layers, which i actually observe in nature.

@CM-mo7mv 8 ай бұрын

finally approaching ART

@danraviv7393 4 ай бұрын

When you wonder what would happen if you squared the hidden dimension size at 47:00 - yeah maybe it would have similar model abstraction capacity - but it would still not be parallelizable which is maybe the main point of their solution?

@RezaJavadzadeh 8 ай бұрын

brilliant thanks Yannic

@TheRohr 2 ай бұрын

(a) Couldn't they simply use a LayerNorm instead of the n_t ? (b) Sounds like they didn't answer their question (or was it already answered somewhere else?) of how for getting with plain LSTMs, but they invented these blocks etc. and only showed results for xLSTMs?

@EobardUchihaThawne 8 ай бұрын

mlstms are similiar to google's infini attention on memory retrieval

@paxdriver 8 ай бұрын

thank you Yan! I thought I was crazy but you seem to have read a similar tone in the early sections lol that's pretty funny "our paper is all about this addition, and this multiication... Novel ideas, eh?". That's the headline, but only after that does the real new part start with memory management (soft memory, not hardware.. Also confusing).

@matveyshishov 8 ай бұрын

Wait, I've been watching your channel for maaany years, how come it only has 245k subscribers, and something like 2minpapers has 1.5M?

@ChlorieHCl 8 ай бұрын

I've felt a significant decline in quality for Two Minute Paper videos. The 2min are like 30s of unwanted background info, 30s of experimental results, and 1min of sponsor acknowledgment. And also “what a time to be alive” and “hold on to your papers”, apparently. No real info gained from those videos. To the point that I've unsubbed from that channel for months just to get rid of the annoyance.

@yakmage8085 8 ай бұрын

@@ChlorieHClthere’s been a decline for sure but also yannics videos have a significantly higher minimum education requirement. 2 min papers are just video highlights and no math, intuition or criticisms

@AvastarBin 8 ай бұрын

because 2minpapers videos are 5 or 6 minutes long (ironically) and are understandable by anyone regardless of your background, whereas Yannik's videos are one hour long very indepth and requires a lot of background knowledge in ml

@GoldenBeholden 8 ай бұрын

@@ChlorieHCl Yeah, seeing some guy get enthusiastic about research papers was nice enough when just began and sat below 30k subscribers, but he really started playing into his "character" rather than the actual content of the papers. Not really worth your time anymore, to be honest. AI Explained is great if you're looking for another channel in the same vein as this one (al be it lighter on the academics).

@thirdeye4654 8 ай бұрын

Why do influencers on Tiktok have millions of followers just talking bullshit all day long? Because people love entertainment and not many have a long attention span. Also there is just so much time you have in your own life to watch and do stuff.

@davidhauser7537 8 ай бұрын

nice thanks for convering this paper :)

@DamianReloaded 8 ай бұрын

So, the answer is kind of yes. If you scale a high-dimensional token mixer using backpropagation to adjust weights towards the desired result, you will achieve functionality. The question lingering in my mind is: Do biological neural networks employ backpropagation? How do we one -shot learn new token sequences and how are we able to remember them long term and bring them back when we need them if they are so low probability (we only saw them once) ?

@xxlvulkann6743 8 ай бұрын

I imagine that when you have agentic models, you can implement more sophisticated memory encoding. For example, you might allow for particular memory samples to have a larger "significance" based upon your current level of arousal/reward. Also, exposure to a token doesn't have to come from the external environment, it may result from constantly "thinking" about the topic, essentially generating and training on synthetic data. We must remember that generative models are still not actual agentic models, they're basically just foundation models.

@ssssssstssssssss 8 ай бұрын

Backpropagation is largely considered implausible for biological networks and BPTT is impossible because it is a non-causal system. Some do think the brain does employ some kind of "gradient" though.

@Hexanitrobenzene 8 ай бұрын

@@ssssssstssssssss BPTT ?

@ChlorieHCl 8 ай бұрын

@@Hexanitrobenzene Back-propagation through time

@eltongoaustriaco8268 8 ай бұрын

The brain might generate a training signal from a single example in short term memory (you repeating your hotel room number in mind). Regarding BP, it is plausible that the brain uses a less optimised version of that.

@kuan-chunlee9611 6 ай бұрын

I had a question so sounds like linear temporal dependency can be parallelized somehow or at least be computed faster. How can this parallelization be done if possible? Thanks if anyone knows

@tamirtsogbayar3912 8 ай бұрын

Hello Yannic thanks for you videos! Are you going to make some vidoes related to KAN (Kolmogorov Arnold Network) ? thank you

@quickpert1382 8 ай бұрын

KANs are fairly easy, and it's a nice lecture to venture into by yourself

@_XoR_ 8 ай бұрын

Unfortunately they are quite flawed for most applications since they don't scale and based on the distribution shape they can be worse than mlps.

@quickpert1382 8 ай бұрын

@@_XoR_ Yep, for now we are waiting for an optimized implementation.

@edeneden97 8 ай бұрын

in the mLSTM block, isn't it very similar to attention just without softmax?

@GGlessGo 7 ай бұрын

And is it? Cant completely follow actually

@chickenp7038 8 ай бұрын

we need a new mamba explanation. the current one has errors and doesn’t rely explain much

@longvo7088 8 ай бұрын

You need to read previous papers like HiPPO, S4 to be able to understand Mamba. Also, with some prerequisite skills about CUDA Programming

@AM-yk5yd 8 ай бұрын

Sasha Rush has several as he seems to be a big fan of SSM. "Mamba: The Hard Way" is very detailed.

@yeetyeet7070 8 ай бұрын

Extended Long Short-Term really sounds like upper lower middle class

@Hexanitrobenzene 8 ай бұрын

Yeah, the adjacent words "long" and "short" do not clear the matters at all... In contrast, the authors of "Attention is all you need" could work for political campaigns writing slogans as a side hustle :)

@florianjug 8 ай бұрын

Isn’t that close to the mindset behind Mamba as well? What would be the key difference?!

@hanskraut2018 8 ай бұрын

I could have told you when i was in the end of Kindergarden. I hope there is more behind it than what it sounds to be.

@ANKIT_GAMING193 7 ай бұрын

Cyberopolis been the hot topic in several groups I'm in.

@arashbtk3237 Ай бұрын

Explained very ell

@Fabio-zi4hw 8 ай бұрын

Is this the ultimate bitter lesson?

@herp_derpingson 8 ай бұрын

Money is all you need

@intrinsical 8 ай бұрын

So the matrix memory is simply old school Kohonen Maps from the 70s?

@Hexanitrobenzene 8 ай бұрын

It seems, if that's the name. They list Kohonen, Anderson and Nakano as references, all from 1972.

@KathiresanKathiresan-ld6zn 8 ай бұрын

Revux is being mentioned everywhere - definitely a project to watch!

@AmirNajafgholi 7 ай бұрын

Don't you want to review KAN?

@KavinKaviya-bw7rb 8 ай бұрын

I'm all in on Revux. Presales have the highest returns, and this one’s gold.

@Raju-ib9ug 7 ай бұрын

Been seeing a lot of talk in the private circles about cyberopolis.

@tiagotiagot 8 ай бұрын

Is there enough information in the pdf that some of the current bigger LLMs that can read pdfs would be able to produce the equivalent code to what the researchers used to get their alleged results?

@Hexanitrobenzene 8 ай бұрын

This task probably requires AGI...

@rumfordc 8 ай бұрын

ngl that's gotta be among the top 20 stupidest names for anything i've ever heard

@dairin0d 8 ай бұрын

Regarding the large memory requirements of the d*d matrix, perhaps they could take a page from the Vector Symbolic Architectures approach? In VSA, state, keys and values are all vectors of the same shared space (and so have the same dimension), so if all that's needed is to combine them in a way that would result in dot(new_state, key) ~= value, VSA's binding operation (e.g. component-wise / Hadamard product) sounds like a perfectly viable replacement 🤔 I suppose it would still benefit from large space dimensionality, but a vector size can be controlled on a more granular level than a square matrix size. If they use binary or ternary weights, the memory requirements would be even smaller (though that would probably require some changes in how the model is trained).

@JerryFederspiel 8 ай бұрын

If I'm thinking about this right, the off-diagonal elements of the outer products of k and v can be thought of as "clues" that each vector element in the key gives about each other vector element in the value. The Hadamard product dispenses with these clues- each element is treated independently- but maybe each individual element only has to be kind-of right with a VSA because d is so high. It may also be possible to compromise between Hadamard and outer products by taking the key and value vectors and breaking them up into P parts of d/P elements each. Then you take the outer products of corresponding parts. This gives us a memory requirement of P * (d/P)^2 = d^2 / P. It means that each key element gives a clue about d/P value elements. Setting P to sqrt(d) feels good, so clearly that is the right choice 🙂

@jonsmith6331 8 ай бұрын

First for AI

@darshank8748 8 ай бұрын

AI for fisting

@qhansen123 8 ай бұрын

2nd for AI

@jabowery 7 ай бұрын

Self-aggrandizing Boomer-posting admitted, there is a good reason for bringing to people's "attention" prior art and it has to do with the foundation of intelligence in Kolmogorov Complexity approximation: Don't multiply names for things beyond necessity. Now, don't get me wrong here. I'm not saying that the terms currently in use are inferior -- I'm just saying that unification of taxonomy can reduce the explosion of confusion that now besets the field. So the renaming can be beneficial, so long as one then describes prior art in terms of the current tech-argot with appropriate modifiers.

@aakashsaini3327 8 ай бұрын

3rd for AI :P

@BijoyBaskeBijoyBaske 8 ай бұрын

Shifting my portfolio - heavy on BTC and Revux, with a sprinkle of DOT and ADA.

@ravindramore4783 7 ай бұрын

Caught some insider buzz about cyberopolis and the names involved.

@Vanshbibyan35 7 ай бұрын

Some serious backing on cyberopolis

@smvlogger03 7 ай бұрын

Yep, the signals are strong on cyberopolis, especially with the big endorsements.

@GAMINǴAVI-o8x 8 ай бұрын

I see Revux doing 50x, maybe even 100x after it goes live on major exchanges.

@harshFF-r7c 8 ай бұрын

Revux is dominating my crypto chats - seems like the next big thing!

@GAISENSE 8 ай бұрын

Feels more tLSTM than mLSTM, right?

@XX-vu5jo 8 ай бұрын

I kinda find this a joke of a paper

@corgirun7892 7 ай бұрын

the baseline is unfair

@T.D_gamer 7 ай бұрын

I like the Pi project, but still there is no money to be made here. I put my holdings in Cyberopolis, easy 50-200x

@AmarSingh-fh3xp 7 ай бұрын

Big shots are piling in on cyberopolis, so I decided to get my share. Let's hope it pays off!

@jonnylukejs 8 ай бұрын

I invented this and it got jacked low key i called it block matrix lstm and they changed the name to be dicks and get away with it but the fact that it exactly follows my ipynb for it is like ehhh

@jonnylukejs 8 ай бұрын

my app is called hyper chat and I'm still going to launch it but yeah I've had this since i wrote the code for it

@wunder1385 8 ай бұрын

Sure bro