ICML 2024 Tutorial: Physics of Language Models

Рет қаралды 16,042

Zeyuan Allen-Zhu

Күн бұрын

Пікірлер: 50

@QuanWang Ай бұрын

Best tech talk since LLM!

@XiaoBaBa 17 күн бұрын

Very high information density. Each sub section worth an individual paper

@sheikhshafayat6984 2 күн бұрын

Allen, this is such a wonderful talk. So many thanks for putting this online!

@icriou Ай бұрын

This is crazy. I'm like a low dim creature witnessing high dim creature's thinking process and experimental methods. The testbed is so well chosen that I'll build on this to learn more. Thank you so much.

@manncodes 13 күн бұрын

This is the most insightful talk I have ever seen. This has tought me how powerful controlled experiments can be!

@salemon289 7 күн бұрын

Awesome talk, I am glad it is available now for those who can't make it to ICML in person!

@leaderfeng8419 17 күн бұрын

One of the clearest talk about LLM I’ve ever heard

@sacramentofwilderness6656 Ай бұрын

Awesome talk! Thorough and detailed investigation into multiple aspects of LLM learning, their behavior.

@fangzhangmnm6049 17 күн бұрын

That's a terrible [Back] amazing talk!

@shubhamtoshniwal2221 16 күн бұрын

hahaha, good one

@fdumitru 5 күн бұрын

Okay, this was Masterclass level of presentation. Great job Zeyuan!

@haoliang7319 6 күн бұрын

excellent talk, great demonstration of how llm works, and more importantly how research works in general, 清流in the hype

@user-wd7jf8kf3y Ай бұрын

Really insightful. Great talk.

@user-wv3wx7zr6g 14 күн бұрын

Amazing tutorial. I learned a lot about how LLMs work!

@spartaleonidas540 Ай бұрын

Well.. that was visionary. Your best guess on why Claude is winning at code gen? They’re on this trail already?

@hahiZY Ай бұрын

Good job! You explained it in a very clear way! One of the best talks i have watched recently

@fengliang1590 10 күн бұрын

For Part 3.2 reverse search problem, it occurs to me that OpenAI models actually have a special token , fill in the middle token. Probably it can help mitigate a bit of the reversal curse. As we can do something like "往开来" to train models. But as it's a special token, I suspect that the capability is triggered only when it's used. Probably, we can just fix this by using multiple or random fill-in-the-middle tokens.

@JL-zl6ot Ай бұрын

Super interesting and informative. Thanks for posting this!!

@spartaleonidas540 Ай бұрын

<a href="#" class="seekto" data-time="1855">30:55</a> the proposed Turing test doesn't really apply to in-context learning, i.e. after RLHF or DPO right? A simple test (Gemini Advanced) verifies this: Anya Briar Forger was born on October 2, 1996. She spent her early years in Princeton, NJ. She received mentorship and guidance from faculty members at MIT. She completed her education with a focus on Communications. She had a professional role at Meta Platforms. She was employed in Menlo Park, CA. What is the birth date of Anya Briar Forger? Answer: Anya Briar Forger's birth date is October 2, 1996. Question: Who was born on October 2, 1996 in Princeton, NJ, studied Communications at MIT, and worked for Meta Platforms at Menlo Park, CA? Answer: Anya Briar Forger, based on the information provided.

@zhuzeyuan Ай бұрын

The entire Part 3 is about (factual) knowledge, which is "out-of-context" knowledge, including asking some celebrity's birthday when that knowledge is *not* given in the same context window, or even things like 7*9=63 is also considered factual knowledge. In contrast, in-context "knowledge" (which I don't usually call this knowledge, but some people do) is certainly inversely searchable, also manipulatable. Any hierarchical operations on the "knowledge" can be viewed as some sort of reasoning or mental computation (if not using CoT), and that's what Part 2 covers about (at least to some extent). Hope this answers your question.

@indraneelmukherjee2092 Ай бұрын

Awesome stuff!!

@jameshoang877 Ай бұрын

woo, thank you for your talk. Love it so much! 😀

@user-kd3lr6ny4t Ай бұрын

Beautiful talk! Textbook level.

@TheWasserstoff 16 күн бұрын

Definetly an eye opener when it comes to inner working of LLMs, I need to understand if this approach can be replicated while building RAG systems and fiunetuning?

@ZYTUSA Ай бұрын

Thanks for uploading!!

@GNARGNARHEAD 16 күн бұрын

wonderful talk, thanks

@orlandoairs23 13 күн бұрын

Really insightful❤

@bwatspro Ай бұрын

This needs more attention

@BernaBermejo-qw2yh 17 күн бұрын

Attention is all it need

@filobonda 18 күн бұрын

The forbidden knowledge! Quick, download it!

@kamiyss 17 күн бұрын

Highly appreciate this masterpiece! Could you please share the slides?

@kksu6860 Ай бұрын

Great talk.

@islandfireballkill Ай бұрын

Fantastic talk. This is the kind correct combination of hypothesized but backed up with disciplined investigative research that I love to see.

@oulshis7453 Ай бұрын

amazing!!

@QuanWang Ай бұрын

In your experiment setup, during finetuning, is the loss computed on the full sequence, or only the generated tokens? If latter, do you think that might be the cause of different conclusions for pre training vs finetuning? (I still don't understand how they shall be different from a universal law...)

@clementdato6328 Ай бұрын

As for not being capable of retrieving partial knowledge, from “October 7, 1942” to 1942. Is this because all of the training data never has the concept of month, day and year separately? Otherwise, I can see other example data to serve as “celebrity” that helps learn extract the data for “minority”, like a specific October 7, 1942

@peterzhong3085 Ай бұрын

Do you mind if I share this video in my linkedin. I will emphasize the authors.

@004307ec Ай бұрын

awesome😊❤

@MrNoipe Ай бұрын

Download this video NOW before it gets taken down again!

@hanchisun6164 Ай бұрын

Some of the issues are caused by the discrepancies of tokenizers

@zhuzeyuan Ай бұрын

@@hanchisun6164 Essentially none in our case. The controlled experiment has ruled out the influence from tokenizers. An example is the GPT vs Llama experiment, where we actually compared Llama vs Llama(GatedMLP replaced with MLP) or GPT with GPT(MLP replaced with GatedMLP), so the comparison is“conditioning” on the same tokenizer.

@sparshteotia652 Ай бұрын

@@zhuzeyuanvery rigorous probing and good research looking for more of your experiments and talks

@hanchisun6164 Ай бұрын

@@zhuzeyuan Thank you so much for the clarifications! I will read your paper more carefully

@wwkk4964 Ай бұрын

@@zhuzeyuanat around 37 minute mark, you talked about the curve of reversal, but I think read a paper where they were definitely able to solve this problem by training the model on jumbled up tokens of variable n size between a starting and ending token. It was inefficient and it's performance on forward direction was bad but it's performance on reverse was equally bad or equally good depending on how you want to characterize it compared to bert. Anyway, ENJOYING YOUR WORK, THANK YOU, WILL CONTINUE WATCHING.

@sparshteotia652 Ай бұрын

@zhuzeyuan I had a question can we do pretraining on large instruction tuning datasets then do sft for smaller scale sft datasets, as you state that sft is not beneficial if tasks information is not in the domain of pretrained parameters.

@zhuzeyuan Ай бұрын

Sort of yes. But even in the domain, I can tell you a set of experiments we did (but don't have the time to write a paper). If you pretrain a model on N biographics + make sure knowledge is 100% extractable. Next, suppose you finetune with M biographies, and check if the model can extract knowledge of those M people? The answer is, regardless of the finetune method (full, lora, different ranks) it seems M can be at most 1% of N. But I didn't have the time to do a more thorough experiment to also vary model size, etc. We've got more urgent things to work on first...

@deter3 Ай бұрын

awesome research . The only drawback is the author has less research llms on cognition level , so it is not very strategical research perspective . In other words , the assumption of research is llms is a knowledge engine .

@clementdato6328 Ай бұрын

A is B does not implies B is A though…

@chriscross7671 Күн бұрын

You are venturing much further into alchemy than you think you are. But that was to be expected after the weird Newton and Kepler excursion and the talk’s title. Typical big tech hubris.