You always know it's important paper when Yannic makes video of it.
@mshonleАй бұрын
Well, that, or the paper is due for a serious burn! But like that viral “MIT exams passed” paper he can totally rip into it while explaining important concepts in the margins.
@emuccinoАй бұрын
This paper is not important.
@LysergicKidsАй бұрын
More like uploading to shitpost, with many of the papers that have been coming out recently being complete nonsense.
@Cereal.interfaceАй бұрын
the satire is strong with this one
@PCPTMCROSBY29 күн бұрын
Personally, I it's his perspective that's lends more to the understanding than any given paper which can vary as to its excitement level.
@caimansaurus5564Ай бұрын
Their 757M model is worse but takes less than half the compute to train, because despite going through more tokens, the majority of those tokens were used to train the smaller models before upscaling. 300B on 124M + 60B on 354M + 60B on 757M is significantly less compute than 300B straight up on 757M. That is a noteworthy result.
@mingzhehan3348Ай бұрын
I was just trying to comment the exact same point, and then I found your comment. Great catch!
@codelapizАй бұрын
Not only less compute. But the most generalized concepts were forcibly compressed due to the smaller size at the beginning. If the reminants of this do not fade, this architecture could exhibit some of imo dearly missed inductive bias. I was just recently doing some work with vision transformers on a small datasets. The consenus is that transformers need larger datasets than CNNs to avoid overfitting because they do not have inductive bias. Eg they dont develop theese biases towarss the real truth that bias away from train truth. I definatly felt this when working on them. Anyway on some level thats the same thing gpt-4 and SOTA llms do. They fit the data rather than lesrning the system. If this sort of an aproach, maybe with variable learning rates or locked tokens would keep some of the simpler world views of the smaller LLMs as the models grow, and add on it instead of replace it, we could get way less overfitting, less next word prediction, more real world compression.
@pavelkraynyukhov199028 күн бұрын
Do I read that chart wrong? The perplexity on the model which was trained with more tokens for less time is much less, which is GOOD 🤣
@Danilyn_LivaoАй бұрын
Yannic, your ability to simplify complex concepts like TokenFormer and transformer scaling is incredible! 🚀 The way you break down the paper makes advanced topics feel accessible and engaging. Thanks for always delivering top-notch explanations!
@JorgetePanete22 күн бұрын
This really sounds like ChatGPT prompted with this video's title.
@goddamnitizationАй бұрын
A thing I think is interesting is the idea of swapping out the tokenized parameters based on the input. Could be a way of getting a kind of memory without having to have a larger context window
@viktorooАй бұрын
This video feels like a stand up joke. You have everything: setup, build, punchline. Nice work, Yannic, it was super entertaining
@Fordance100Ай бұрын
A lot of details are missing. What I want to see the comparison between a large model training from scratch and the size model scaled from a small model over training epochs to determine if this has any advantages.
@MordenorАй бұрын
Thank you Mr Yannic for explaining TokenFormer in his comfy hoodie and trademark glasses.
@bairesearch29 күн бұрын
An advantage of supporting dynamic scaling of model channel features is that a transformer can be applied over an expanding datasource (eg assembly knowledge graph) rather than a literal context window (where input features are typically automatically assigned during training). Can be combined with 'sequence' length independence.
@projectpiano5231Ай бұрын
Y'all I feel like we're being collectively gaslit with all the "significant cost with scaling" stuff. It's known that the stability-plasticity dilemma is the core problem of machine learning so why isn't that held front-and-center in literally every paper? Like I get it, it's easier to make a marginal improvement and it's kind of hard to solve AGI but any meaningful progress that is made is going to be relating to the stability-plasticity dilemma and not making scaling cheaper by 10% in some cases. Is anybody else on the same page? Man I appreciate these videos *and also* they make me realize how bs so much of the ML space is. I feel like it's because there's expectation of output/progress and people try to rationalize why they put so much effort into their research while not producing anything of value and then try to convince themselves it's of value. I wish we could just say the quiet part out loud. Thanks for going through the pain of these papers so I don't have to.
@novantha1Ай бұрын
Counterpoint: A lot of machine learning paradigms only show their real behaviour at scale. If you trained a modern Transformers architecture LLM back in 1999, people would have laughed at you, because 1M parameters would have been a pipe dream, so you wouldn’t be able to explore the types of behaviour we’re seeing in 330M, 900M, 1B, 7B, 13B, 70B models. Every performance improvement is a huge win in this respect because it means that we can train new ideas at larger and larger parameter counts, with more modest hardware allotments. We can train Diffusion models today on a single researcher’s GPU in a week that outperform SD 1, trained in the cloud for millions of dollars. We can train language models that are actually kind of coherent in a single day, on a single GPU, too. What if the next major paradigm shift only shows its value at 3B parameters? 7B? 13B? Odds are good that researchers may not actually be able to find it very quickly, because the rate of performing experiments at that scale is so much slower, so there may be totally valid research that’s only been done at a 330M scale where it just doesn’t work there, yet. Beyond that, there’s a huge practical benefit to training scaling improvements: AGI might not be a “model”. If you look at ant colonies, as a colony, they’re a remarkably exhaustive searcher, even though none of the individual ants are particularly intelligent. If AI is cheap enough to train a large number of variants of, I think it’s not unrealistic that a huge decentralized network of AI agents trading u with one another, where possibly every company and every person has their own customized model… We might see a situation where the collective of them is basically able to achieve any technical information problem we could set out to achieve, even if no one model is really “AGI” as such. The ability to introduce diversity, and create unique models by iterating quickly at a small scale will be hugely important to a future like that. This outcome doesn’t even really require new research, just application of what we already have.
@jithinchandАй бұрын
Yannic on his unabomber arc.
@marshallmcluhan33Ай бұрын
I'm glad I'm not the only one who thought that
@444halukАй бұрын
Lol, tokenformer is literally a hierarchical modern hopfield network. The only difference between a transformer and a modern hopfield network was those W matrices, now that they are gone, it is a modern hopfield network all the way to the down.
@DamianReloadedАй бұрын
What if after adding parameters you only train the new parameters without changing the old ones, and then at some point you train all toghether.
@taohu88Ай бұрын
Thanks for discussing the paper. i think their paper focuses on hours to train, not just #tokens. You comment on #tokens. Further on appendix B, they indeed show zero out new K, won't change the output of Pattent(X, K, V). It is not randomly enlarge dimension, I think this is the gist of their contribution.
@PCPTMCROSBY29 күн бұрын
That's quite interesting. It's reminiscent of understanding how to deal with the differences mathematically, of whether you're using an old wire recorder. and which basically, if you edit the wire and cut a section out, you have a problem because the effective. computations for the total length of the wire are computed into the total spooling factor. which means you're changing the ratio of speed estimations. which of course changes your parameters. This was later addressed by machine in which a fixed control system to run the tape across the heads is always a consistent speed. or a pre chosen speed. and therefore the feed spool. and the catch spool. rotational speeds. If they change, do not matter respective to the assimilation of information through heads and the rest of the circuitry. Of course, an adaptation of this type of electronic understanding. Would limit the amount of process? speed to fewer fixed computational progress. But you would resolve some problems and be able to use workability. and many other different ways. having a new form of fixed foundational understanding. from key points to reference.
@wwkk4964Ай бұрын
It's interesting how the models regardless of how they are trained, in the long run tend to compensate somehow (in terms of developing a capacity).
@float32Ай бұрын
I think there’s something fundamental going on here, and these are just different implementations of that. Some law of data organization/compression that results in everything that tries resulting in something very close to the same.
@JerryFederspielАй бұрын
A few next steps to explore immediately present themselves: 1. If we replace some of the parts of transformers with little sub-transformers, why stop at one replacement step? Why not recurse deeper? 2. Transformers have a certain amount of computational power; multiple stacked transformers have a greater computational power. For example, a single transformer layer cannot perform the copy task, but a stack of two can. Can you make an induction head with just one TokenFormer layer (I doubt it)? Can we characterize the change in computational power (if any?) that we get by "tokenizing" model parameters? Or by recursively replacing transformer-parts with transformers to some given depth? 3. If we're going to perform transformer operations *within* a given token to get Q, K, and V... Could we enlarge the output dimension by treating chunks of V as sub-tokens and just "decoding" more sub-tokens?
@projectpiano5231Ай бұрын
I'd be curious about model stability here with recursive subtransformers. Also, projecting to a larger space and decoding there seems unnecessary imo because you can just increase embedding length and expanding would take exponentially more space and computation. Generally the models do worse with larger numbers of tokens because it's deeper contexts that they're having to learn from. Not trying to diss on the points/questions, just some thoughts/concerns I'd personally have
@JerryFederspielАй бұрын
"Also, projecting to a larger space and decoding there seems unnecessary..." - I'm not sure we're talking about the same thing. Imagine a token whose size at the beginning of a transformer layer is 1024. It may be typical during the transformer's MLP step to expand up to 4096, and then contract back down to 1024. The admittedly silly idea number 3 is this: instead of expanding each token in the first layer of the MLP, treat the 1024-vector as if it were already made up of sub-tokens. Say, 4 sub-tokens of dimension 256 each. Using those 4 sub-tokens as context, "decode" an additional 12 sub-tokens of size 256. Now you have 16 sub-tokens of size 256 total (per original token). You're still going to contract that down (maybe with a Perceiver sort of thing, or maybe just saying "alright, let's ignore the sub-token boundaries and use a fully-connected layer to take these 4096 contiguous vector elements and shrink them back down to 1024"), and future layers are going to treat the contracted result as one token again. There's no exponential explosion of tokens (or more importantly, token interactions) involved. The interesting part of this idea (esp. the Perceiver-to-condense-subtokens-back-into-one variation) is that you can use the same parameters and scale the compute up or down depending on what's available.
@JerryFederspielАй бұрын
(If anyone wants to try to make a paper out of that, be sure to call it the Ruminator)
@patrickl529020 күн бұрын
It’s a bad idea
@JerryFederspiel20 күн бұрын
thx
@herp_derpingson4 күн бұрын
8:40 Basically the QKV projection is being generated using another attention mechanism? Basically it starts as low rank space and it will eventually fill up as we increase the param tokens. . 13:27 It would be interesting to see curriculum learning integrated with this. It is kind of begging for that. . 22:03 I thought the paper was already upprojecting. I read it wrong.
@oM477oАй бұрын
19:40 That's also what lora uses. Setting the lora dim is just setting that projection dimension. Using multiple loras at a time is equivilant to concaternating them into one lora whose dim equals the sum of the dims of loras joined.
@patrickl529020 күн бұрын
Nice catch, where’d you learn that?
@andresacuna110828 күн бұрын
hey, so I've contributed to Open assistant back in the day but don't see my username on the contributor page :( sad to know you've gone, i really wanted this to blow up, i also noticed the acknowledge page is no available or the account page too.
@ugthefluffsterАй бұрын
yo dawg I heard you like transformers so we put a transformer in your transformer so you can attend to tokens while attending to tokens
@ChlorieHCl29 күн бұрын
23:30 That's not even new. In their paper they used GeLU for the nonlinearity instead of softmax, which brings it even closer to the original FFN. The reason is simple, since using softmax won't even work in this case. The zero-initialized new keys will have zero dot product with the input, but the attention score is calculated by softmax-ing the dot products, bringing the zeros to non-zero values, thus breaking the whole purpose. No, even if you also set the "value tokens" to zero it still doesn't work, since the attention scores will be off by a constant factor. Some of the attention is diverted to the newly added zero tokens.
@EobardUchihaThawneАй бұрын
your videos are super cool and fun to watch😂😂 waiting for ngpt paper view
@jeremykothe2847Ай бұрын
I'm a simple man. Yannic posts, I watch.
@mattanimationАй бұрын
Thanks Yannic!
@mohamedabbashedjazi49329 күн бұрын
There maybe some benefit when training KV parameters for each domain, eg: medicine, history, etc. then this may give some interpretability to transformers.
@mike___-fi5kp29 күн бұрын
Thanks for the video!
@PCPTMCROSBY29 күн бұрын
I believe many of us abused experimental folder creation. utilizing zeros. to get to the higher speed addressing areas, respective to computational circuitry. To get to the faster. computational areas. generally associated with hard drive operation. Of course, there are many other ways to look at those computations. and twist a few things around.
@PCPTMCROSBY29 күн бұрын
Gee that's have used instead of abused. My voice detects is horrible today. Sorry about that.
@xingkuizhu4011Ай бұрын
I had raised a similar question to yours (20:30) during the ICLR2025 open review of this article, and the author provided a response. I would greatly value hearing your thoughts on their answer.
@4thpdespanoloАй бұрын
Please do nGPT
@jonclement28 күн бұрын
my phd defense...question in the back...yeah, you with the hoodie...
@TheTruthOfAIАй бұрын
The code is madness xD py3.8 old and rusty libraries, etc...
@PCPTMCROSBY29 күн бұрын
You don't have to take me totally serious, since obviously the bonus of high speed computation is to utilize max speed. capabilities, as opposed. to looking at things in a more simplified linear manner. Or analog manner. respective to the original electronics and computational theory presentation, which already. seemed to be cohesive. presenting everything in formulas from a low hurts mathematics perspective. Of course, some of you may have never have seen this. You just have to look at the older books to understand what I'm talking about.
@youngseokjeon3376Ай бұрын
Probably the authors did not read the appendix and thought its novel enough.
@juancarlosrial719727 күн бұрын
Great explanation and thx for your opinion , I think the same..😅
@table-rdyАй бұрын
it sounds really good, as an idea, not saying they got it perfect, but.. sounds like an idea. Why so negative?
@fontenbleauАй бұрын
I can say that french anime "Mars Express" (2023) represent perfectly future of this & jailbroken androids.
@PCPTMCROSBY29 күн бұрын
That's hertz not hurts
@AbbyJohnson-k5tАй бұрын
Thanks for sharing such valuable information! I have a quick question: My OKX wallet holds some USDT, and I have the seed phrase. (alarm fetch churn bridge exercise tape speak race clerk couch crater letter). How can I transfer them to Binance?
@dancar2537Ай бұрын
it s pretty simple in my eyes. you people don t understand much. attention should have been all you needed. karpathy came with tesla and thought yeah, we can do it. then tesla did not avoid a post and karpathy quit. now yes, my parrot aces those posts: goes around them, sits on them and does it perfectly. so karpathy expected tesla to do it too but it did not so you were in the dark did not know what to expect from it because you did not understand much of it. at 3 ghz, infinite training data, atomic plant energy demands and building size servers vs 10 hz, small training data, grain power and tiny brain. they might be logical but attention is not all you need to make it fly. it might not be think of everything understand nothing but sure is close