An Exactly Solvable Model for Emergence and Scaling Laws

Рет қаралды 5,478

14 күн бұрын

The paper:
arxiv.org/abs/2404.17563
Support my learning journey either by clicking the Join button above or becoming a Patreon member!
/ tunadorable
Discuss this stuff with other Tunadorks on Discord
/ discord
All my other links
linktr.ee/tunadorable

Пікірлер: 35

@Crawdaddy_Ro 13 күн бұрын

Emergence is one of the concepts I enjoy researching most! Complexity science is, without a doubt, a truly futuristic science! This paper really pulls my cord, dude! Edit: The paper is interesting but feels pretty basic when it comes to explaining emergence in deep learning models. They used a simplified model with specific tasks designed just for this research, and while it's cool to see skills following a power law and showing up as a sigmoid curve, I'm not sure how relevant it is to real-world applications. The models seem too tailored to this experiment to draw any solid conclusions about how skills emerge in more complex, practical scenarios.

@loganlawrence1476 12 күн бұрын

Parameter count limit in the bottleneck table might also be a proxy for inference costs or product latency, eg. a company sets aside a fixed budget for a deployed model but has lots of time until go-live and is willing to spend money on training to find the best performer within that speed constraint. Just an idea, great video btw!

@AndreRSilva-oz1nd 13 күн бұрын

Man, amazing vids, keep the good work going!

@marcfruchtman9473 13 күн бұрын

The title of this paper is super interesting. I do find the choice of "skills" as being basis functions within the model to be somewhat difficult for me to wrap my head around. It would be immeasurably more useful if they were able to demonstrate that it also modeled some real world example... such as using the MNIST data and applying some basis function such as detecting horizontal lines, vertical lines, Diagonals, loops, etc and then evaluating the result to see if it matched their findings when using the mathematically derived basis functions. I look forward to any future updates.

@wwkk4964 12 күн бұрын

Top tier content!

@whemmakatatt5311 13 күн бұрын

NICE content. S tier

@netherportals 13 күн бұрын

Pretty cool new ability

@joe_limon 12 күн бұрын

I think to advance future models, we are going to have to figure out how to increase training efficiency.

@JGLambourne 12 күн бұрын

're orthography of real world skills. Feels a little bit of a stretch to think of such complex things in this linear way, but I guess one could imagine some "basis" skills from which others are composed.

@andrewsilber 13 күн бұрын

Maybe Congress should authorize a full digitization of the Library of Congress if what we need is trillions of tokens of quality data. Presumably they could justify it on the grounds of national security, if the goal is to stay ahead in the “AI arms race”

@Tunadorable 13 күн бұрын

interesting

@phpn99 12 күн бұрын

It's a descriptive model. It has no predictive power.

@RoulDukeGonzo 12 күн бұрын

How does this relate to the whole "measurement creates emergence" thing?

@kimcosmos 12 күн бұрын

Is it possible to separate the simple skills by filtering out all data that does not assume those simple skills. ie. filter out the obvious data once it becomes obvious, to avoid repeated;ly reinventing the wheel. It means identify the obvious once it becomes obvious. ie looking for nonobvious or counter intuitive data. Running a prediction filter on (what has become) the obvious. It means testing generalising circuits (4 layers to find, +4 to test) and using them as retrieval heads to filter the data stream. Qstar is relatively compute inefficient but useful with sparse data because of its improved accuracy, and this would be a good use case. Filtered data, less shots. Maybe less parameters after filtering and maybe less layers if fast grokking retrieval heads with 8 layers.

@Tunadorable 11 күн бұрын

interesting. recently the fineweb-edu dataset was created as a filtered down version of fineweb where they asked llama 70b whether each document had educational value or not. i imagine that may be a conceptually easier method (albeit potentially more computationally intensive). a question like “is this document relatively mundane, or does it contain unusually rare/complex facts/reasoning?”. alternatively some sort of rating by perplexity or some other quantitative measure might work.

@kimcosmos 11 күн бұрын

@@Tunadorable RAG like retrieval heads can use a more focused subset for local learning. Especially few shot sparse data methodical analytics. ie "What am I missing here?" Fineweb extracts data pairs (ER?) with one of its 5 reward prompts for creating artificial data being "- Add another point if the extract addresses certain elements pertinent to education but does not align closely with educational standards. It might mix educational content with non-educational material, offering a superficial overview of potentially useful topics, or presenting information in a disorganized manner and incoherent writing style."

@kimcosmos 11 күн бұрын

@@Tunadorable "Add another point if the extract addresses certain elements pertinent to education but does not align closely with educational standards. It might mix educational content with non-educational material, offering a superficial overview of potentially useful topics, or presenting information in a disorganized manner and incoherent writing style.". 1 out of 5 points in their artificial generator prompt. Its not using Q* to find optimum paths. Fineweb is getting the low hanging fruit. Q* shakes the tree and is good for ticker feeds

@sikunowlol 13 күн бұрын

@RoulDukeGonzo 12 күн бұрын

From the comments i think i got the answer, but just to clarify, this is theoretical right? Why would skill data be so uniform on real skills.

@Tunadorable 12 күн бұрын

yes it's theoretical. on real skills it's likely not as uniform but very possible the general theme still holds true in aggregate. The idea that some skills are common while rare skills are very very very (orders of magnitude or exponentially more) rare seems reasonable; if anything the alternative would be that rare skills are only slightly more (geometric? linearly?) rare would be a good thing. However so far the fact that we've had to increase LLM training compute by orders of magnitude in order to get linear returns on benchmarks would imply the former

@JehovahsaysNetworth 13 күн бұрын

ChatGPT can’t write PHP like I showed it how to. I tried and it failed to understand. If you know a better bot to try out direct me to one to choose to work with.

@SoFukinDope24 13 күн бұрын

easy solution: use anthropic

@JehovahsaysNetworth 13 күн бұрын

@@SoFukinDope24 I will search for it and try it thanks

@RoulDukeGonzo 12 күн бұрын

Easier solution, learn python

@JehovahsaysNetworth 12 күн бұрын

@@RoulDukeGonzo I know some python I used to use a piwiki bot on my mediawiki

@ricosrealm 12 күн бұрын

Claude is the best for coding.

@jacksaunders1929 12 күн бұрын

Have you thought about doing a PhD?

@Tunadorable 12 күн бұрын

oof during undergrad I considered doing it one in economics but back then after going through through the legit publication process, talking with professors, looking at the way the system works, etc it sounded more restrictive than freeing. considered it again when I decided I wanted to pivot into AI but I was blessed to chance upon a short conversation with Paul Christiano and he told me it wasn't necessary for this field, just self-publish then go work at a company. Rn I'm hoping I can become self-sufficient off KZbin and do a combo of research & science education without any boss/restrictions

@danielmartinmonge4054 9 күн бұрын

This paper seems to miss the point about emergent capabilities. From my understanding, the model is learning to solve a specific problem only because it appears in the dataset and is solved in an exact way. The more frequently this exact problem appears, the faster the model learns it.However, true logic, abstraction, and understanding are about finding broader connections between concepts and solving new problems that are not present in the dataset. My intuition suggests that this approach is not suitable for learning natural language. Human knowledge cannot be reduced to a finite set of easily solvable problems. This method overlooks the critical strength of large language models: symbolic abstraction, where specific problems are merely examples of broader categories.It seems to me that the paper fails to address the core aspects of these new architectures. It applies mathematical models designed for narrow, purpose-specific AI rather than for this broader kind of intelligence.