Nathan, the quality of your guests and conversations is so high that this is literally the only channel on all of KZbin where I tolerate in-video ads. Though I have to say that the in-video ads are also the only explanation I can find for why this channel doesn't already have 100k+ subs :)
@alexm18155 күн бұрын
This is my favorite episode yet, incredibly information dense. Thank you Nathan^2!
@zyzhang11305 күн бұрын
Very juicy content that is really lacking elsewhere (as far as I know)
@AR-iu7tf19 сағат бұрын
As others have already said below, the most substantive and informative conversation on post training I have seen to date. Thank you so much for shedding light on an area that is almost like a black box now! - all we can find online is tidbits of speculation. You mention a paper on Verifier RL. I couldnt find a link to it online, perhaps it is not published yet?. Could you please share that if it is. Also, I know we can only speculate what o1 or Deepseek is doing for the reasoning sequences, but would it be fair to assume, during training they are doing some of reward model/verifier feedback at intermediate stages of a sequence that leads to a correct result, as opposed to just one reward signal for an entire sentence like what ChatGPT (perhaps!) does? In other words, is it likely to be a Bellman update all the tokens at the end of the sequence or at intermediate stages - also thank you so much for clarifying how the single reward value at the end is converted to individual rewards of each token that constituted that sentence.