DPO to TPO: Test-Time Preference Optimization (RL)

  Рет қаралды 2,556

Discover AI

Discover AI

Күн бұрын

Пікірлер: 15
@propeacemindfortress
@propeacemindfortress 5 күн бұрын
exactly, no need for 5 phd's 2 or 3 and a bit of genius is absolutely sufficient^^ great work. your amazing :wink:
@diga4696
@diga4696 4 күн бұрын
I have started a small local nonprofit that uses csi concepts and agentic workflows to help evolve local communities and businesses; objective is reward driven knowledge sharing for optimization of processes and task allocation of internal and external operations.
@s4kupfers
@s4kupfers 3 күн бұрын
Really fascinating, I'm looking forward to you sharing some of your real world findings and learnings.
@OumarDicko-c5i
@OumarDicko-c5i 5 күн бұрын
If you do implementation i will love to see it, so beautifull
@Summersault666
@Summersault666 5 күн бұрын
One thing i didn't understand. After you generate the "reasoning" text how do you incorporate the knowledge?
@thingX1x
@thingX1x 5 күн бұрын
I love learning these new ideas, I can just ask bolt iif it can add in the new concept :D
@tk0150
@tk0150 5 күн бұрын
Please share your experience after you play with it!
@tspis
@tspis Күн бұрын
Very cool stuff - thanks for sharing and covering, love your content, as usual! However, what is a bit unfortunate is the paper authors are framing an iterative heuristic as a gradient-based optimization method. They use optimization equations when no optimization calculation is actually happening. So while this doesn't at all diminish their results and achievements, it kind of leaves a bad taste in my mouth. And just a week later after the TPO paper's release, another one that does heuristic refinement (though this time, prior to inference) but tries to pass it on as backdrop/differentiation. Though this later one is an even worse offender, as it goes a step further and actually uses differentiation equations, when no backdrop or optimization calculations is being performed ("LLM-AutoDiff: Auto-Differentiate Any LLM Workflow", 2501.16673). Again, very cool work, and the results are there - but why the math-washing? Sigh.
@Karthikprath
@Karthikprath 4 күн бұрын
Thanks for this video.Can you tell me the formula for calaculating FLOPS during inference on H100?
@virendraashiwal8311
@virendraashiwal8311 2 күн бұрын
But domain specific reasoning can be done here? Do we need domain specific reward function?
@profcelsofontes
@profcelsofontes 5 күн бұрын
And how about GPRO used by Deepseek R1
@FelheartX
@FelheartX 5 күн бұрын
The reward model in this case is just "this answer got picked by more people than the other answer(s)" or what? But how does this help in some chat system like ChatGPT? The LLM internally then does all this "text loss" and "text gradient" stuff, and then what? The next response to the next message it gives will then be better adjusted to the users preference? Essentially this is an elaborate way of "this is the answer the user picked, now lets try to infer how they prefer their answers and continue doing that" or am I wrong?
@user-qw1rx1dq6n
@user-qw1rx1dq6n 5 күн бұрын
Fuck I was working on this for like a year and a half now I was so close to getting it to work
@TheDoomerBlox
@TheDoomerBlox 5 күн бұрын
insert obligatory derogatory statement here Well that puts you into prime position to reimplement something similar for different purposes, no?
@user-qw1rx1dq6n
@user-qw1rx1dq6n 5 күн бұрын
@ yeah I’m gonna go take a look at how they implemented the reward model maybe that can solve my problems
o3 Inference Reasoning: How to Build the Training Data Set
33:16
Discover AI
Рет қаралды 3,4 М.
The 8 AI Skills That Will Separate Winners From Losers in 2025
19:32
The evil clown plays a prank on the angel
00:39
超人夫妇
Рет қаралды 53 МЛН
Genius Machine Learning Advice for 10 Minutes Straight
9:46
Data Sensei
Рет қаралды 104 М.
Goodbye RAG - Smarter CAG w/ KV Cache Optimization
26:19
Discover AI
Рет қаралды 41 М.
Introduction to Deep Research
20:16
OpenAI
Рет қаралды 437 М.
AI Is Making You An Illiterate Programmer
27:22
ThePrimeTime
Рет қаралды 291 М.
Stanford Univ CREATED the S1 Reasoning LLM (o1, R1)
23:26
Discover AI
Рет қаралды 8 М.
China's slaughterbots show WW3 would kill us all.
14:46
Digital Engine
Рет қаралды 1,6 МЛН
FUSION: Knowledge GRAPHS are more than TOOLS for LLM
38:41
Discover AI
Рет қаралды 4,3 М.
NEW TextGrad by Stanford: Better than DSPy
41:25
Discover AI
Рет қаралды 16 М.
SMARTER: AI Reasoning w Knowledge Graphs + Agents
28:44
Discover AI
Рет қаралды 5 М.