13:22 a linear fit in a log-log scale means that the relationship is not linear. In this case, it means that : parameters = p0*FLOPs^k with "k" being the slope in the log-log graph, and "p0" the amount of parameters for 1 single FLOP
@enozeren4 ай бұрын
Hi, thanks for the video. About your critique about the linear / non-linear fit in Figure 3, the authors say in page 4, section 3, in the middle of the paragraph: "We assume a power-law relationship between compute and model size as done in Clark et al. (2022); Kaplan et al. (2020), though future work may want to include potential curvature in this relationship for large model sizes." Power-law relationship requires a linear fit and since they assume this, the line is linear in the plots. But they agree with you because it might be a non-linear fit but this is a feature work for them.
@chrissears99122 жыл бұрын
Guess: dotted stripes at 10:50 are due to the matrix design of the study data. ie. all dots in one stripe are from same set (like 1B set and 5B set each result in dots in a line, each point for varying size 700M, 200M, 500M...).
@Ganelin2 жыл бұрын
I was just about to write the same :)
@felipedidio4698 Жыл бұрын
Man, this video is so under-appreciated
@adicsbtw2 жыл бұрын
They actually do seem to acknowledge the curvature of the graphs for FLOPs versus model size and token amount. If you look at the third paragraph of section 5 (30:54 in the video) they mention that they observe concavity at high computer budgets, which is presumably referring to the curvature in those graphs
@dizietz Жыл бұрын
I had a thought about the linear vs non linear fits on Parameters vs FLOPs and Tokens vs FLOPs. The field generally has an issue in throwing things on a log-log plot, squinting, and saying things look linear on the log-log plot. The underlying relationship itself, of course, is not linear -- but the simplest fit on a simple log-log graph is linear. So you see this kind of logic embedded in many papers! A couple of folks in the comments also commented on this below too.
@user-wr4yl7tx3w Жыл бұрын
These paper reviews are great content.
@vslaykovsky2 жыл бұрын
10:45 these are likely patterns produced by the hyperparameter sampling method used in the paper. In the end this is just a scatter plot with all their 400 models on it.
@Kajenx2 жыл бұрын
I feel like you made a very obtuse subject both easy to understand and entertaining to think about. I wouldn't be able to read a paper like this and understand it with all the jargon and math, but I had no trouble understanding your video. Thank you!
@orik7372 жыл бұрын
I don't know if you'll catch this comment, but I'm pretty new to looking at these papers on a deeper level, wondering if you could point to some resources that explain vocab like training tokens or the fundamental differences between different types of algorithms, like transformers.
@chickenp70382 жыл бұрын
great video!
@EdanMeyer2 жыл бұрын
Thanks :)
@videowatching95762 жыл бұрын
23:00 Fascinating point about the context for the size of models in terms of text. Also: your point about comparing that to the web size is interesting - I would be curious how big the web would be in the context of this model. And also interested in as you said what filters might be applied to the entire web in order to try to have ‘quality’ of data. I’m wondering: what does the path look to be for say an AI approach to delivering a web-wide search experience? Is there some sort of paradigm shift in how to think about AI-enabled search and answers, given what model sizing / compute, and also given what sort of techniques to make sense of all of that data on the web etc?
@jsunrae2 жыл бұрын
Who has ideas on those parallel dots? I’m keen to hear thoughts…
@zakuro8532 Жыл бұрын
As a noob, i barely get it but its super fascinating. I also want to train a lightweight now
@mgostIH2 жыл бұрын
One criticism I'd have for the group is them not telling their results to the Google Brain team, I wonder what PaLM could've been, it would also have helped with experimenting this law even further!
@cindywu96232 жыл бұрын
Knowing the politics of all this, I wouldn't be surprised if DeepMind didn't trust Google Brain - Google Brain doesn't have a dedicated AI safety team, and DeepMind is very aware of the consequences of unchecked scaling work...
@SaidakbarP2 жыл бұрын
Thank you for the detailed explanation. Subscribed! It is interesting to see if this model trained on images can outperform DALL-E 2 announced by OpenAI.
@cindywu96232 жыл бұрын
The diagonal lines I think are because they sampled only a subset of the true sample space (a given family of models with specific input token sizes). I'd imagine you could easily get more precision/ get rid of that effect?
@tiagotiagot2 жыл бұрын
How reproducible is the training process, how close to the same weights or at least the same quality of results do they get if everything is kept the same but the PRNGs used have different seeds?
@birdbeakbeardneck36172 жыл бұрын
hi;) channel recommendations?
@EdanMeyer2 жыл бұрын
If you mean you want channels like mine, Yannic Kilcher is great
@laurenpinschannels2 жыл бұрын
Simons institute, ipam at ucla,
@dexterovski2 жыл бұрын
Bigger model being worse on moral scenarios got me good. Can be definitely used as a clickbaity headline for some article.
@drdesten Жыл бұрын
The smaller model was worse though
@countofst.germain64172 жыл бұрын
Yes I believe this is the reason gpt-4 won't be over 300 billion.
@BlackXxScopez2 жыл бұрын
thanks for your paper overviews - you've got me questioning whether I want to pursue AI or stick with software engineering
@ism69382 жыл бұрын
Divergence
@snaawflake2 жыл бұрын
I think the reason that it does worse on the mathematical and logic tasks is that those tasks aren't very generalizable/compressible in a language model, so the bigger model does best because it has enough parameters to memorize more of the patterns, without actually understanding the tasks in a mathematically/logicall generalized and rigorous matter.