TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters (Paper Explained)

Рет қаралды 16,499

Yannic Kilcher

Күн бұрын

Пікірлер: 60

@ace90870 Ай бұрын

You always know it's important paper when Yannic makes video of it.

@mshonle Ай бұрын

Well, that, or the paper is due for a serious burn! But like that viral “MIT exams passed” paper he can totally rip into it while explaining important concepts in the margins.

@emuccino Ай бұрын

This paper is not important.

@LysergicKids Ай бұрын

More like uploading to shitpost, with many of the papers that have been coming out recently being complete nonsense.

@Cereal.interface Ай бұрын

the satire is strong with this one

@PCPTMCROSBY 29 күн бұрын

Personally, I it's his perspective that's lends more to the understanding than any given paper which can vary as to its excitement level.

@caimansaurus5564 Ай бұрын

Their 757M model is worse but takes less than half the compute to train, because despite going through more tokens, the majority of those tokens were used to train the smaller models before upscaling. 300B on 124M + 60B on 354M + 60B on 757M is significantly less compute than 300B straight up on 757M. That is a noteworthy result.

@mingzhehan3348 Ай бұрын

I was just trying to comment the exact same point, and then I found your comment. Great catch!

@codelapiz Ай бұрын

Not only less compute. But the most generalized concepts were forcibly compressed due to the smaller size at the beginning. If the reminants of this do not fade, this architecture could exhibit some of imo dearly missed inductive bias. I was just recently doing some work with vision transformers on a small datasets. The consenus is that transformers need larger datasets than CNNs to avoid overfitting because they do not have inductive bias. Eg they dont develop theese biases towarss the real truth that bias away from train truth. I definatly felt this when working on them. Anyway on some level thats the same thing gpt-4 and SOTA llms do. They fit the data rather than lesrning the system. If this sort of an aproach, maybe with variable learning rates or locked tokens would keep some of the simpler world views of the smaller LLMs as the models grow, and add on it instead of replace it, we could get way less overfitting, less next word prediction, more real world compression.

@pavelkraynyukhov1990 28 күн бұрын

Do I read that chart wrong? The perplexity on the model which was trained with more tokens for less time is much less, which is GOOD 🤣

@Danilyn_Livao Ай бұрын

Yannic, your ability to simplify complex concepts like TokenFormer and transformer scaling is incredible! 🚀 The way you break down the paper makes advanced topics feel accessible and engaging. Thanks for always delivering top-notch explanations!

@JorgetePanete 22 күн бұрын

This really sounds like ChatGPT prompted with this video's title.

@goddamnitization Ай бұрын

A thing I think is interesting is the idea of swapping out the tokenized parameters based on the input. Could be a way of getting a kind of memory without having to have a larger context window

@viktoroo Ай бұрын

This video feels like a stand up joke. You have everything: setup, build, punchline. Nice work, Yannic, it was super entertaining

@Fordance100 Ай бұрын

A lot of details are missing. What I want to see the comparison between a large model training from scratch and the size model scaled from a small model over training epochs to determine if this has any advantages.

@Mordenor Ай бұрын

Thank you Mr Yannic for explaining TokenFormer in his comfy hoodie and trademark glasses.

@bairesearch 29 күн бұрын

An advantage of supporting dynamic scaling of model channel features is that a transformer can be applied over an expanding datasource (eg assembly knowledge graph) rather than a literal context window (where input features are typically automatically assigned during training). Can be combined with 'sequence' length independence.

@projectpiano5231 Ай бұрын

Y'all I feel like we're being collectively gaslit with all the "significant cost with scaling" stuff. It's known that the stability-plasticity dilemma is the core problem of machine learning so why isn't that held front-and-center in literally every paper? Like I get it, it's easier to make a marginal improvement and it's kind of hard to solve AGI but any meaningful progress that is made is going to be relating to the stability-plasticity dilemma and not making scaling cheaper by 10% in some cases. Is anybody else on the same page? Man I appreciate these videos *and also* they make me realize how bs so much of the ML space is. I feel like it's because there's expectation of output/progress and people try to rationalize why they put so much effort into their research while not producing anything of value and then try to convince themselves it's of value. I wish we could just say the quiet part out loud. Thanks for going through the pain of these papers so I don't have to.

@novantha1 Ай бұрын

Counterpoint: A lot of machine learning paradigms only show their real behaviour at scale. If you trained a modern Transformers architecture LLM back in 1999, people would have laughed at you, because 1M parameters would have been a pipe dream, so you wouldn’t be able to explore the types of behaviour we’re seeing in 330M, 900M, 1B, 7B, 13B, 70B models. Every performance improvement is a huge win in this respect because it means that we can train new ideas at larger and larger parameter counts, with more modest hardware allotments. We can train Diffusion models today on a single researcher’s GPU in a week that outperform SD 1, trained in the cloud for millions of dollars. We can train language models that are actually kind of coherent in a single day, on a single GPU, too. What if the next major paradigm shift only shows its value at 3B parameters? 7B? 13B? Odds are good that researchers may not actually be able to find it very quickly, because the rate of performing experiments at that scale is so much slower, so there may be totally valid research that’s only been done at a 330M scale where it just doesn’t work there, yet. Beyond that, there’s a huge practical benefit to training scaling improvements: AGI might not be a “model”. If you look at ant colonies, as a colony, they’re a remarkably exhaustive searcher, even though none of the individual ants are particularly intelligent. If AI is cheap enough to train a large number of variants of, I think it’s not unrealistic that a huge decentralized network of AI agents trading u with one another, where possibly every company and every person has their own customized model… We might see a situation where the collective of them is basically able to achieve any technical information problem we could set out to achieve, even if no one model is really “AGI” as such. The ability to introduce diversity, and create unique models by iterating quickly at a small scale will be hugely important to a future like that. This outcome doesn’t even really require new research, just application of what we already have.

@jithinchand Ай бұрын

Yannic on his unabomber arc.

@marshallmcluhan33 Ай бұрын

I'm glad I'm not the only one who thought that

@444haluk Ай бұрын

Lol, tokenformer is literally a hierarchical modern hopfield network. The only difference between a transformer and a modern hopfield network was those W matrices, now that they are gone, it is a modern hopfield network all the way to the down.

@DamianReloaded Ай бұрын

What if after adding parameters you only train the new parameters without changing the old ones, and then at some point you train all toghether.

@taohu88 Ай бұрын

Thanks for discussing the paper. i think their paper focuses on hours to train, not just #tokens. You comment on #tokens. Further on appendix B, they indeed show zero out new K, won't change the output of Pattent(X, K, V). It is not randomly enlarge dimension, I think this is the gist of their contribution.

@PCPTMCROSBY 29 күн бұрын

That's quite interesting. It's reminiscent of understanding how to deal with the differences mathematically, of whether you're using an old wire recorder. and which basically, if you edit the wire and cut a section out, you have a problem because the effective. computations for the total length of the wire are computed into the total spooling factor. which means you're changing the ratio of speed estimations. which of course changes your parameters. This was later addressed by machine in which a fixed control system to run the tape across the heads is always a consistent speed. or a pre chosen speed. and therefore the feed spool. and the catch spool. rotational speeds. If they change, do not matter respective to the assimilation of information through heads and the rest of the circuitry. Of course, an adaptation of this type of electronic understanding. Would limit the amount of process? speed to fewer fixed computational progress. But you would resolve some problems and be able to use workability. and many other different ways. having a new form of fixed foundational understanding. from key points to reference.

@wwkk4964 Ай бұрын

It's interesting how the models regardless of how they are trained, in the long run tend to compensate somehow (in terms of developing a capacity).

@float32 Ай бұрын

I think there’s something fundamental going on here, and these are just different implementations of that. Some law of data organization/compression that results in everything that tries resulting in something very close to the same.

@JerryFederspiel Ай бұрын

A few next steps to explore immediately present themselves: 1. If we replace some of the parts of transformers with little sub-transformers, why stop at one replacement step? Why not recurse deeper? 2. Transformers have a certain amount of computational power; multiple stacked transformers have a greater computational power. For example, a single transformer layer cannot perform the copy task, but a stack of two can. Can you make an induction head with just one TokenFormer layer (I doubt it)? Can we characterize the change in computational power (if any?) that we get by "tokenizing" model parameters? Or by recursively replacing transformer-parts with transformers to some given depth? 3. If we're going to perform transformer operations *within* a given token to get Q, K, and V... Could we enlarge the output dimension by treating chunks of V as sub-tokens and just "decoding" more sub-tokens?

@projectpiano5231 Ай бұрын

I'd be curious about model stability here with recursive subtransformers. Also, projecting to a larger space and decoding there seems unnecessary imo because you can just increase embedding length and expanding would take exponentially more space and computation. Generally the models do worse with larger numbers of tokens because it's deeper contexts that they're having to learn from. Not trying to diss on the points/questions, just some thoughts/concerns I'd personally have

@JerryFederspiel Ай бұрын

"Also, projecting to a larger space and decoding there seems unnecessary..." - I'm not sure we're talking about the same thing. Imagine a token whose size at the beginning of a transformer layer is 1024. It may be typical during the transformer's MLP step to expand up to 4096, and then contract back down to 1024. The admittedly silly idea number 3 is this: instead of expanding each token in the first layer of the MLP, treat the 1024-vector as if it were already made up of sub-tokens. Say, 4 sub-tokens of dimension 256 each. Using those 4 sub-tokens as context, "decode" an additional 12 sub-tokens of size 256. Now you have 16 sub-tokens of size 256 total (per original token). You're still going to contract that down (maybe with a Perceiver sort of thing, or maybe just saying "alright, let's ignore the sub-token boundaries and use a fully-connected layer to take these 4096 contiguous vector elements and shrink them back down to 1024"), and future layers are going to treat the contracted result as one token again. There's no exponential explosion of tokens (or more importantly, token interactions) involved. The interesting part of this idea (esp. the Perceiver-to-condense-subtokens-back-into-one variation) is that you can use the same parameters and scale the compute up or down depending on what's available.

@JerryFederspiel Ай бұрын

(If anyone wants to try to make a paper out of that, be sure to call it the Ruminator)

@patrickl5290 20 күн бұрын

It’s a bad idea

@JerryFederspiel 20 күн бұрын

thx

@herp_derpingson 4 күн бұрын

8:40 Basically the QKV projection is being generated using another attention mechanism? Basically it starts as low rank space and it will eventually fill up as we increase the param tokens. . 13:27 It would be interesting to see curriculum learning integrated with this. It is kind of begging for that. . 22:03 I thought the paper was already upprojecting. I read it wrong.

@oM477o Ай бұрын

19:40 That's also what lora uses. Setting the lora dim is just setting that projection dimension. Using multiple loras at a time is equivilant to concaternating them into one lora whose dim equals the sum of the dims of loras joined.

@patrickl5290 20 күн бұрын

Nice catch, where’d you learn that?

@andresacuna1108 28 күн бұрын

hey, so I've contributed to Open assistant back in the day but don't see my username on the contributor page :( sad to know you've gone, i really wanted this to blow up, i also noticed the acknowledge page is no available or the account page too.

@ugthefluffster Ай бұрын

yo dawg I heard you like transformers so we put a transformer in your transformer so you can attend to tokens while attending to tokens

@ChlorieHCl 29 күн бұрын

23:30 That's not even new. In their paper they used GeLU for the nonlinearity instead of softmax, which brings it even closer to the original FFN. The reason is simple, since using softmax won't even work in this case. The zero-initialized new keys will have zero dot product with the input, but the attention score is calculated by softmax-ing the dot products, bringing the zeros to non-zero values, thus breaking the whole purpose. No, even if you also set the "value tokens" to zero it still doesn't work, since the attention scores will be off by a constant factor. Some of the attention is diverted to the newly added zero tokens.

@EobardUchihaThawne Ай бұрын

your videos are super cool and fun to watch😂😂 waiting for ngpt paper view

@jeremykothe2847 Ай бұрын

I'm a simple man. Yannic posts, I watch.

@mattanimation Ай бұрын

Thanks Yannic!

@mohamedabbashedjazi493 29 күн бұрын

There maybe some benefit when training KV parameters for each domain, eg: medicine, history, etc. then this may give some interpretability to transformers.

@mike___-fi5kp 29 күн бұрын

Thanks for the video!

@PCPTMCROSBY 29 күн бұрын

I believe many of us abused experimental folder creation. utilizing zeros. to get to the higher speed addressing areas, respective to computational circuitry. To get to the faster. computational areas. generally associated with hard drive operation. Of course, there are many other ways to look at those computations. and twist a few things around.

@PCPTMCROSBY 29 күн бұрын

Gee that's have used instead of abused. My voice detects is horrible today. Sorry about that.

@xingkuizhu4011 Ай бұрын

I had raised a similar question to yours (20:30) during the ICLR2025 open review of this article, and the author provided a response. I would greatly value hearing your thoughts on their answer.

@4thpdespanolo Ай бұрын

Please do nGPT

@jonclement 28 күн бұрын

my phd defense...question in the back...yeah, you with the hoodie...

@TheTruthOfAI Ай бұрын

The code is madness xD py3.8 old and rusty libraries, etc...

@PCPTMCROSBY 29 күн бұрын

You don't have to take me totally serious, since obviously the bonus of high speed computation is to utilize max speed. capabilities, as opposed. to looking at things in a more simplified linear manner. Or analog manner. respective to the original electronics and computational theory presentation, which already. seemed to be cohesive. presenting everything in formulas from a low hurts mathematics perspective. Of course, some of you may have never have seen this. You just have to look at the older books to understand what I'm talking about.

@youngseokjeon3376 Ай бұрын

Probably the authors did not read the appendix and thought its novel enough.

@juancarlosrial7197 27 күн бұрын

Great explanation and thx for your opinion , I think the same..😅

@table-rdy Ай бұрын

it sounds really good, as an idea, not saying they got it perfect, but.. sounds like an idea. Why so negative?

@fontenbleau Ай бұрын

I can say that french anime "Mars Express" (2023) represent perfectly future of this & jailbroken androids.

@PCPTMCROSBY 29 күн бұрын

That's hertz not hurts

@AbbyJohnson-k5t Ай бұрын

Thanks for sharing such valuable information! I have a quick question: My OKX wallet holds some USDT, and I have the seed phrase. (alarm fetch churn bridge exercise tape speak race clerk couch crater letter). How can I transfer them to Binance?

@dancar2537 Ай бұрын

it s pretty simple in my eyes. you people don t understand much. attention should have been all you needed. karpathy came with tesla and thought yeah, we can do it. then tesla did not avoid a post and karpathy quit. now yes, my parrot aces those posts: goes around them, sits on them and does it perfectly. so karpathy expected tesla to do it too but it did not so you were in the dark did not know what to expect from it because you did not understand much of it. at 3 ghz, infinite training data, atomic plant energy demands and building size servers vs 10 hz, small training data, grain power and tiny brain. they might be logical but attention is not all you need to make it fly. it might not be think of everything understand nothing but sure is close

@patrickl5290 20 күн бұрын

Wat?

@dancar2537 20 күн бұрын

@@patrickl5290 wut?