Very high information density. Each sub section worth an individual paper
@sheikhshafayat69842 күн бұрын
Allen, this is such a wonderful talk. So many thanks for putting this online!
@icriouАй бұрын
This is crazy. I'm like a low dim creature witnessing high dim creature's thinking process and experimental methods. The testbed is so well chosen that I'll build on this to learn more. Thank you so much.
@manncodes13 күн бұрын
This is the most insightful talk I have ever seen. This has tought me how powerful controlled experiments can be!
@salemon2897 күн бұрын
Awesome talk, I am glad it is available now for those who can't make it to ICML in person!
@leaderfeng841917 күн бұрын
One of the clearest talk about LLM I’ve ever heard
@sacramentofwilderness6656Ай бұрын
Awesome talk! Thorough and detailed investigation into multiple aspects of LLM learning, their behavior.
@fangzhangmnm604917 күн бұрын
That's a terrible [Back] amazing talk!
@shubhamtoshniwal222116 күн бұрын
hahaha, good one
@fdumitru5 күн бұрын
Okay, this was Masterclass level of presentation. Great job Zeyuan!
@haoliang73196 күн бұрын
excellent talk, great demonstration of how llm works, and more importantly how research works in general, 清流in the hype
@user-wd7jf8kf3yАй бұрын
Really insightful. Great talk.
@user-wv3wx7zr6g14 күн бұрын
Amazing tutorial. I learned a lot about how LLMs work!
@spartaleonidas540Ай бұрын
Well.. that was visionary. Your best guess on why Claude is winning at code gen? They’re on this trail already?
@hahiZYАй бұрын
Good job! You explained it in a very clear way! One of the best talks i have watched recently
@fengliang159010 күн бұрын
For Part 3.2 reverse search problem, it occurs to me that OpenAI models actually have a special token , fill in the middle token. Probably it can help mitigate a bit of the reversal curse. As we can do something like "往开来" to train models. But as it's a special token, I suspect that the capability is triggered only when it's used. Probably, we can just fix this by using multiple or random fill-in-the-middle tokens.
@JL-zl6otАй бұрын
Super interesting and informative. Thanks for posting this!!
@spartaleonidas540Ай бұрын
<a href="#" class="seekto" data-time="1855">30:55</a> the proposed Turing test doesn't really apply to in-context learning, i.e. after RLHF or DPO right? A simple test (Gemini Advanced) verifies this: Anya Briar Forger was born on October 2, 1996. She spent her early years in Princeton, NJ. She received mentorship and guidance from faculty members at MIT. She completed her education with a focus on Communications. She had a professional role at Meta Platforms. She was employed in Menlo Park, CA. What is the birth date of Anya Briar Forger? Answer: Anya Briar Forger's birth date is October 2, 1996. Question: Who was born on October 2, 1996 in Princeton, NJ, studied Communications at MIT, and worked for Meta Platforms at Menlo Park, CA? Answer: Anya Briar Forger, based on the information provided.
@zhuzeyuanАй бұрын
The entire Part 3 is about (factual) knowledge, which is "out-of-context" knowledge, including asking some celebrity's birthday when that knowledge is *not* given in the same context window, or even things like 7*9=63 is also considered factual knowledge. In contrast, in-context "knowledge" (which I don't usually call this knowledge, but some people do) is certainly inversely searchable, also manipulatable. Any hierarchical operations on the "knowledge" can be viewed as some sort of reasoning or mental computation (if not using CoT), and that's what Part 2 covers about (at least to some extent). Hope this answers your question.
@indraneelmukherjee2092Ай бұрын
Awesome stuff!!
@jameshoang877Ай бұрын
woo, thank you for your talk. Love it so much! 😀
@user-kd3lr6ny4tАй бұрын
Beautiful talk! Textbook level.
@TheWasserstoff16 күн бұрын
Definetly an eye opener when it comes to inner working of LLMs, I need to understand if this approach can be replicated while building RAG systems and fiunetuning?
@ZYTUSAАй бұрын
Thanks for uploading!!
@GNARGNARHEAD16 күн бұрын
wonderful talk, thanks
@orlandoairs2313 күн бұрын
Really insightful❤
@bwatsproАй бұрын
This needs more attention
@BernaBermejo-qw2yh17 күн бұрын
Attention is all it need
@filobonda18 күн бұрын
The forbidden knowledge! Quick, download it!
@kamiyss17 күн бұрын
Highly appreciate this masterpiece! Could you please share the slides?
@kksu6860Ай бұрын
Great talk.
@islandfireballkillАй бұрын
Fantastic talk. This is the kind correct combination of hypothesized but backed up with disciplined investigative research that I love to see.
@oulshis7453Ай бұрын
amazing!!
@QuanWangАй бұрын
In your experiment setup, during finetuning, is the loss computed on the full sequence, or only the generated tokens? If latter, do you think that might be the cause of different conclusions for pre training vs finetuning? (I still don't understand how they shall be different from a universal law...)
@clementdato6328Ай бұрын
As for not being capable of retrieving partial knowledge, from “October 7, 1942” to 1942. Is this because all of the training data never has the concept of month, day and year separately? Otherwise, I can see other example data to serve as “celebrity” that helps learn extract the data for “minority”, like a specific October 7, 1942
@peterzhong3085Ай бұрын
Do you mind if I share this video in my linkedin. I will emphasize the authors.
@004307ecАй бұрын
awesome😊❤
@MrNoipeАй бұрын
Download this video NOW before it gets taken down again!
@hanchisun6164Ай бұрын
Some of the issues are caused by the discrepancies of tokenizers
@zhuzeyuanАй бұрын
@@hanchisun6164 Essentially none in our case. The controlled experiment has ruled out the influence from tokenizers. An example is the GPT vs Llama experiment, where we actually compared Llama vs Llama(GatedMLP replaced with MLP) or GPT with GPT(MLP replaced with GatedMLP), so the comparison is“conditioning” on the same tokenizer.
@sparshteotia652Ай бұрын
@@zhuzeyuanvery rigorous probing and good research looking for more of your experiments and talks
@hanchisun6164Ай бұрын
@@zhuzeyuan Thank you so much for the clarifications! I will read your paper more carefully
@wwkk4964Ай бұрын
@@zhuzeyuanat around 37 minute mark, you talked about the curve of reversal, but I think read a paper where they were definitely able to solve this problem by training the model on jumbled up tokens of variable n size between a starting and ending token. It was inefficient and it's performance on forward direction was bad but it's performance on reverse was equally bad or equally good depending on how you want to characterize it compared to bert. Anyway, ENJOYING YOUR WORK, THANK YOU, WILL CONTINUE WATCHING.
@sparshteotia652Ай бұрын
@zhuzeyuan I had a question can we do pretraining on large instruction tuning datasets then do sft for smaller scale sft datasets, as you state that sft is not beneficial if tasks information is not in the domain of pretrained parameters.
@zhuzeyuanАй бұрын
Sort of yes. But even in the domain, I can tell you a set of experiments we did (but don't have the time to write a paper). If you pretrain a model on N biographics + make sure knowledge is 100% extractable. Next, suppose you finetune with M biographies, and check if the model can extract knowledge of those M people? The answer is, regardless of the finetune method (full, lora, different ranks) it seems M can be at most 1% of N. But I didn't have the time to do a more thorough experiment to also vary model size, etc. We've got more urgent things to work on first...
@deter3Ай бұрын
awesome research . The only drawback is the author has less research llms on cognition level , so it is not very strategical research perspective . In other words , the assumption of research is llms is a knowledge engine .
@clementdato6328Ай бұрын
A is B does not implies B is A though…
@chriscross7671Күн бұрын
You are venturing much further into alchemy than you think you are. But that was to be expected after the weird Newton and Kepler excursion and the talk’s title. Typical big tech hubris.