Chinchilla Explained: Compute-Optimal Massive Language Models

Рет қаралды 20,226

Күн бұрын

Пікірлер: 30

@stephaneduhamel7706 2 жыл бұрын

13:22 a linear fit in a log-log scale means that the relationship is not linear. In this case, it means that : parameters = p0*FLOPs^k with "k" being the slope in the log-log graph, and "p0" the amount of parameters for 1 single FLOP

@enozeren 4 ай бұрын

Hi, thanks for the video. About your critique about the linear / non-linear fit in Figure 3, the authors say in page 4, section 3, in the middle of the paragraph: "We assume a power-law relationship between compute and model size as done in Clark et al. (2022); Kaplan et al. (2020), though future work may want to include potential curvature in this relationship for large model sizes." Power-law relationship requires a linear fit and since they assume this, the line is linear in the plots. But they agree with you because it might be a non-linear fit but this is a feature work for them.

@chrissears9912 2 жыл бұрын

Guess: dotted stripes at 10:50 are due to the matrix design of the study data. ie. all dots in one stripe are from same set (like 1B set and 5B set each result in dots in a line, each point for varying size 700M, 200M, 500M...).

@Ganelin 2 жыл бұрын

I was just about to write the same :)

@felipedidio4698 Жыл бұрын

Man, this video is so under-appreciated

@adicsbtw 2 жыл бұрын

They actually do seem to acknowledge the curvature of the graphs for FLOPs versus model size and token amount. If you look at the third paragraph of section 5 (30:54 in the video) they mention that they observe concavity at high computer budgets, which is presumably referring to the curvature in those graphs

@dizietz Жыл бұрын

I had a thought about the linear vs non linear fits on Parameters vs FLOPs and Tokens vs FLOPs. The field generally has an issue in throwing things on a log-log plot, squinting, and saying things look linear on the log-log plot. The underlying relationship itself, of course, is not linear -- but the simplest fit on a simple log-log graph is linear. So you see this kind of logic embedded in many papers! A couple of folks in the comments also commented on this below too.

@user-wr4yl7tx3w Жыл бұрын

These paper reviews are great content.

@vslaykovsky 2 жыл бұрын

10:45 these are likely patterns produced by the hyperparameter sampling method used in the paper. In the end this is just a scatter plot with all their 400 models on it.

@Kajenx 2 жыл бұрын

I feel like you made a very obtuse subject both easy to understand and entertaining to think about. I wouldn't be able to read a paper like this and understand it with all the jargon and math, but I had no trouble understanding your video. Thank you!

@orik737 2 жыл бұрын

I don't know if you'll catch this comment, but I'm pretty new to looking at these papers on a deeper level, wondering if you could point to some resources that explain vocab like training tokens or the fundamental differences between different types of algorithms, like transformers.

@chickenp7038 2 жыл бұрын

great video!

@EdanMeyer 2 жыл бұрын

Thanks :)

@videowatching9576 2 жыл бұрын

23:00 Fascinating point about the context for the size of models in terms of text. Also: your point about comparing that to the web size is interesting - I would be curious how big the web would be in the context of this model. And also interested in as you said what filters might be applied to the entire web in order to try to have ‘quality’ of data. I’m wondering: what does the path look to be for say an AI approach to delivering a web-wide search experience? Is there some sort of paradigm shift in how to think about AI-enabled search and answers, given what model sizing / compute, and also given what sort of techniques to make sense of all of that data on the web etc?

@jsunrae 2 жыл бұрын

Who has ideas on those parallel dots? I’m keen to hear thoughts…

@zakuro8532 Жыл бұрын

As a noob, i barely get it but its super fascinating. I also want to train a lightweight now

@mgostIH 2 жыл бұрын

One criticism I'd have for the group is them not telling their results to the Google Brain team, I wonder what PaLM could've been, it would also have helped with experimenting this law even further!

@cindywu9623 2 жыл бұрын

Knowing the politics of all this, I wouldn't be surprised if DeepMind didn't trust Google Brain - Google Brain doesn't have a dedicated AI safety team, and DeepMind is very aware of the consequences of unchecked scaling work...

@SaidakbarP 2 жыл бұрын

Thank you for the detailed explanation. Subscribed! It is interesting to see if this model trained on images can outperform DALL-E 2 announced by OpenAI.

@cindywu9623 2 жыл бұрын

The diagonal lines I think are because they sampled only a subset of the true sample space (a given family of models with specific input token sizes). I'd imagine you could easily get more precision/ get rid of that effect?

@tiagotiagot 2 жыл бұрын

How reproducible is the training process, how close to the same weights or at least the same quality of results do they get if everything is kept the same but the PRNGs used have different seeds?

@birdbeakbeardneck3617 2 жыл бұрын

hi;) channel recommendations?

@EdanMeyer 2 жыл бұрын

If you mean you want channels like mine, Yannic Kilcher is great

@laurenpinschannels 2 жыл бұрын

Simons institute, ipam at ucla,

@dexterovski 2 жыл бұрын

Bigger model being worse on moral scenarios got me good. Can be definitely used as a clickbaity headline for some article.

@drdesten Жыл бұрын

The smaller model was worse though

@countofst.germain6417 2 жыл бұрын

Yes I believe this is the reason gpt-4 won't be over 300 billion.

@BlackXxScopez 2 жыл бұрын

thanks for your paper overviews - you've got me questioning whether I want to pursue AI or stick with software engineering

@ism6938 2 жыл бұрын

Divergence

@snaawflake 2 жыл бұрын

I think the reason that it does worse on the mathematical and logic tasks is that those tasks aren't very generalizable/compressible in a language model, so the bigger model does best because it has enough parameters to memorize more of the patterns, without actually understanding the tasks in a mathematically/logicall generalized and rigorous matter.