ICML 2024 Tutorial: Physics of Language Models

  Рет қаралды 16,042

Zeyuan Allen-Zhu

Zeyuan Allen-Zhu

Күн бұрын

Пікірлер: 50
@QuanWang
@QuanWang Ай бұрын
Best tech talk since LLM!
@XiaoBaBa
@XiaoBaBa 17 күн бұрын
Very high information density. Each sub section worth an individual paper
@sheikhshafayat6984
@sheikhshafayat6984 2 күн бұрын
Allen, this is such a wonderful talk. So many thanks for putting this online!
@icriou
@icriou Ай бұрын
This is crazy. I'm like a low dim creature witnessing high dim creature's thinking process and experimental methods. The testbed is so well chosen that I'll build on this to learn more. Thank you so much.
@manncodes
@manncodes 13 күн бұрын
This is the most insightful talk I have ever seen. This has tought me how powerful controlled experiments can be!
@salemon289
@salemon289 7 күн бұрын
Awesome talk, I am glad it is available now for those who can't make it to ICML in person!
@leaderfeng8419
@leaderfeng8419 17 күн бұрын
One of the clearest talk about LLM I’ve ever heard
@sacramentofwilderness6656
@sacramentofwilderness6656 Ай бұрын
Awesome talk! Thorough and detailed investigation into multiple aspects of LLM learning, their behavior.
@fangzhangmnm6049
@fangzhangmnm6049 17 күн бұрын
That's a terrible [Back] amazing talk!
@shubhamtoshniwal2221
@shubhamtoshniwal2221 16 күн бұрын
hahaha, good one
@fdumitru
@fdumitru 5 күн бұрын
Okay, this was Masterclass level of presentation. Great job Zeyuan!
@haoliang7319
@haoliang7319 6 күн бұрын
excellent talk, great demonstration of how llm works, and more importantly how research works in general, 清流in the hype
@user-wd7jf8kf3y
@user-wd7jf8kf3y Ай бұрын
Really insightful. Great talk.
@user-wv3wx7zr6g
@user-wv3wx7zr6g 14 күн бұрын
Amazing tutorial. I learned a lot about how LLMs work!
@spartaleonidas540
@spartaleonidas540 Ай бұрын
Well.. that was visionary. Your best guess on why Claude is winning at code gen? They’re on this trail already?
@hahiZY
@hahiZY Ай бұрын
Good job! You explained it in a very clear way! One of the best talks i have watched recently
@fengliang1590
@fengliang1590 10 күн бұрын
For Part 3.2 reverse search problem, it occurs to me that OpenAI models actually have a special token , fill in the middle token. Probably it can help mitigate a bit of the reversal curse. As we can do something like "往开来" to train models. But as it's a special token, I suspect that the capability is triggered only when it's used. Probably, we can just fix this by using multiple or random fill-in-the-middle tokens.
@JL-zl6ot
@JL-zl6ot Ай бұрын
Super interesting and informative. Thanks for posting this!!
@spartaleonidas540
@spartaleonidas540 Ай бұрын
<a href="#" class="seekto" data-time="1855">30:55</a> the proposed Turing test doesn't really apply to in-context learning, i.e. after RLHF or DPO right? A simple test (Gemini Advanced) verifies this: Anya Briar Forger was born on October 2, 1996. She spent her early years in Princeton, NJ. She received mentorship and guidance from faculty members at MIT. She completed her education with a focus on Communications. She had a professional role at Meta Platforms. She was employed in Menlo Park, CA. What is the birth date of Anya Briar Forger? Answer: Anya Briar Forger's birth date is October 2, 1996. Question: Who was born on October 2, 1996 in Princeton, NJ, studied Communications at MIT, and worked for Meta Platforms at Menlo Park, CA? Answer: Anya Briar Forger, based on the information provided.
@zhuzeyuan
@zhuzeyuan Ай бұрын
The entire Part 3 is about (factual) knowledge, which is "out-of-context" knowledge, including asking some celebrity's birthday when that knowledge is *not* given in the same context window, or even things like 7*9=63 is also considered factual knowledge. In contrast, in-context "knowledge" (which I don't usually call this knowledge, but some people do) is certainly inversely searchable, also manipulatable. Any hierarchical operations on the "knowledge" can be viewed as some sort of reasoning or mental computation (if not using CoT), and that's what Part 2 covers about (at least to some extent). Hope this answers your question.
@indraneelmukherjee2092
@indraneelmukherjee2092 Ай бұрын
Awesome stuff!!
@jameshoang877
@jameshoang877 Ай бұрын
woo, thank you for your talk. Love it so much! 😀
@user-kd3lr6ny4t
@user-kd3lr6ny4t Ай бұрын
Beautiful talk! Textbook level.
@TheWasserstoff
@TheWasserstoff 16 күн бұрын
Definetly an eye opener when it comes to inner working of LLMs, I need to understand if this approach can be replicated while building RAG systems and fiunetuning?
@ZYTUSA
@ZYTUSA Ай бұрын
Thanks for uploading!!
@GNARGNARHEAD
@GNARGNARHEAD 16 күн бұрын
wonderful talk, thanks
@orlandoairs23
@orlandoairs23 13 күн бұрын
Really insightful❤
@bwatspro
@bwatspro Ай бұрын
This needs more attention
@BernaBermejo-qw2yh
@BernaBermejo-qw2yh 17 күн бұрын
Attention is all it need
@filobonda
@filobonda 18 күн бұрын
The forbidden knowledge! Quick, download it!
@kamiyss
@kamiyss 17 күн бұрын
Highly appreciate this masterpiece! Could you please share the slides?
@kksu6860
@kksu6860 Ай бұрын
Great talk.
@islandfireballkill
@islandfireballkill Ай бұрын
Fantastic talk. This is the kind correct combination of hypothesized but backed up with disciplined investigative research that I love to see.
@oulshis7453
@oulshis7453 Ай бұрын
amazing!!
@QuanWang
@QuanWang Ай бұрын
In your experiment setup, during finetuning, is the loss computed on the full sequence, or only the generated tokens? If latter, do you think that might be the cause of different conclusions for pre training vs finetuning? (I still don't understand how they shall be different from a universal law...)
@clementdato6328
@clementdato6328 Ай бұрын
As for not being capable of retrieving partial knowledge, from “October 7, 1942” to 1942. Is this because all of the training data never has the concept of month, day and year separately? Otherwise, I can see other example data to serve as “celebrity” that helps learn extract the data for “minority”, like a specific October 7, 1942
@peterzhong3085
@peterzhong3085 Ай бұрын
Do you mind if I share this video in my linkedin. I will emphasize the authors.
@004307ec
@004307ec Ай бұрын
awesome😊❤
@MrNoipe
@MrNoipe Ай бұрын
Download this video NOW before it gets taken down again!
@hanchisun6164
@hanchisun6164 Ай бұрын
Some of the issues are caused by the discrepancies of tokenizers
@zhuzeyuan
@zhuzeyuan Ай бұрын
@@hanchisun6164 Essentially none in our case. The controlled experiment has ruled out the influence from tokenizers. An example is the GPT vs Llama experiment, where we actually compared Llama vs Llama(GatedMLP replaced with MLP) or GPT with GPT(MLP replaced with GatedMLP), so the comparison is“conditioning” on the same tokenizer.
@sparshteotia652
@sparshteotia652 Ай бұрын
​@@zhuzeyuanvery rigorous probing and good research looking for more of your experiments and talks
@hanchisun6164
@hanchisun6164 Ай бұрын
@@zhuzeyuan Thank you so much for the clarifications! I will read your paper more carefully
@wwkk4964
@wwkk4964 Ай бұрын
​@@zhuzeyuanat around 37 minute mark, you talked about the curve of reversal, but I think read a paper where they were definitely able to solve this problem by training the model on jumbled up tokens of variable n size between a starting and ending token. It was inefficient and it's performance on forward direction was bad but it's performance on reverse was equally bad or equally good depending on how you want to characterize it compared to bert. Anyway, ENJOYING YOUR WORK, THANK YOU, WILL CONTINUE WATCHING.
@sparshteotia652
@sparshteotia652 Ай бұрын
@zhuzeyuan I had a question can we do pretraining on large instruction tuning datasets then do sft for smaller scale sft datasets, as you state that sft is not beneficial if tasks information is not in the domain of pretrained parameters.
@zhuzeyuan
@zhuzeyuan Ай бұрын
Sort of yes. But even in the domain, I can tell you a set of experiments we did (but don't have the time to write a paper). If you pretrain a model on N biographics + make sure knowledge is 100% extractable. Next, suppose you finetune with M biographies, and check if the model can extract knowledge of those M people? The answer is, regardless of the finetune method (full, lora, different ranks) it seems M can be at most 1% of N. But I didn't have the time to do a more thorough experiment to also vary model size, etc. We've got more urgent things to work on first...
@deter3
@deter3 Ай бұрын
awesome research . The only drawback is the author has less research llms on cognition level , so it is not very strategical research perspective . In other words , the assumption of research is llms is a knowledge engine .
@clementdato6328
@clementdato6328 Ай бұрын
A is B does not implies B is A though…
@chriscross7671
@chriscross7671 Күн бұрын
You are venturing much further into alchemy than you think you are. But that was to be expected after the weird Newton and Kepler excursion and the talk’s title. Typical big tech hubris.
How might LLMs store facts | Chapter 7, Deep Learning
22:43
3Blue1Brown
Рет қаралды 367 М.
Neural and Non-Neural AI, Reasoning, Transformers, and LSTMs
1:39:39
Machine Learning Street Talk
Рет қаралды 49 М.
Running With Bigger And Bigger Feastables
00:17
MrBeast
Рет қаралды 200 МЛН
АЗАРТНИК 4 |СЕЗОН 2 Серия
31:45
Inter Production
Рет қаралды 786 М.
王子原来是假正经#艾莎
00:39
在逃的公主
Рет қаралды 25 МЛН
Zombie Boy Saved My Life 💚
00:29
Alan Chikin Chow
Рет қаралды 35 МЛН
GraphRAG: The Marriage of Knowledge Graphs and RAG: Emil Eifrem
19:15
Terence Tao at IMO 2024: AI and Mathematics
57:24
AIMO Prize
Рет қаралды 266 М.
Why Does Diffusion Work Better than Auto-Regression?
20:18
Algorithmic Simplicity
Рет қаралды 302 М.
Official PyTorch Documentary: Powering the AI Revolution
35:53
Fine-tuning Large Language Models (LLMs) | w/ Example Code
28:18
Shaw Talebi
Рет қаралды 308 М.
Running With Bigger And Bigger Feastables
00:17
MrBeast
Рет қаралды 200 МЛН