DPO to TPO: Test-Time Preference Optimization (RL)

Рет қаралды 2,556

Discover AI

Күн бұрын

Пікірлер: 15

@propeacemindfortress 5 күн бұрын

exactly, no need for 5 phd's 2 or 3 and a bit of genius is absolutely sufficient^^ great work. your amazing :wink:

@diga4696 4 күн бұрын

I have started a small local nonprofit that uses csi concepts and agentic workflows to help evolve local communities and businesses; objective is reward driven knowledge sharing for optimization of processes and task allocation of internal and external operations.

@s4kupfers 3 күн бұрын

Really fascinating, I'm looking forward to you sharing some of your real world findings and learnings.

@OumarDicko-c5i 5 күн бұрын

If you do implementation i will love to see it, so beautifull

@Summersault666 5 күн бұрын

One thing i didn't understand. After you generate the "reasoning" text how do you incorporate the knowledge?

@thingX1x 5 күн бұрын

I love learning these new ideas, I can just ask bolt iif it can add in the new concept :D

@tk0150 5 күн бұрын

Please share your experience after you play with it!

@tspis Күн бұрын

Very cool stuff - thanks for sharing and covering, love your content, as usual! However, what is a bit unfortunate is the paper authors are framing an iterative heuristic as a gradient-based optimization method. They use optimization equations when no optimization calculation is actually happening. So while this doesn't at all diminish their results and achievements, it kind of leaves a bad taste in my mouth. And just a week later after the TPO paper's release, another one that does heuristic refinement (though this time, prior to inference) but tries to pass it on as backdrop/differentiation. Though this later one is an even worse offender, as it goes a step further and actually uses differentiation equations, when no backdrop or optimization calculations is being performed ("LLM-AutoDiff: Auto-Differentiate Any LLM Workflow", 2501.16673). Again, very cool work, and the results are there - but why the math-washing? Sigh.

@Karthikprath 4 күн бұрын

Thanks for this video.Can you tell me the formula for calaculating FLOPS during inference on H100?

@virendraashiwal8311 2 күн бұрын

But domain specific reasoning can be done here? Do we need domain specific reward function?

@profcelsofontes 5 күн бұрын

And how about GPRO used by Deepseek R1

@FelheartX 5 күн бұрын

The reward model in this case is just "this answer got picked by more people than the other answer(s)" or what? But how does this help in some chat system like ChatGPT? The LLM internally then does all this "text loss" and "text gradient" stuff, and then what? The next response to the next message it gives will then be better adjusted to the users preference? Essentially this is an elaborate way of "this is the answer the user picked, now lets try to infer how they prefer their answers and continue doing that" or am I wrong?