Really appreciate this presentation! Likely to use and cite to this video in a forthcoming article in Mind Matters. Thank you!
@riser9644 Жыл бұрын
Link to the blog code or ppt would be good
@vivekpadman52488 ай бұрын
Very nice short informative video. I'm looking to create a distilled model on reasoning tasks for games which could run locally. This will help 😊 thanks
@SnorkelAI8 ай бұрын
Glad it was helpful!
@LeonidAndrianov2 ай бұрын
Interesting, thank you
@SnorkelAI2 ай бұрын
Glad you think so!
@vivekpadman52488 ай бұрын
Is this approach used on all three levels of training? Base instruct ane chat fine-tuning? And are there different things to be considered for the above?
@SnorkelAI8 ай бұрын
I'm not 100% clear on your question. Are you referring to pre-training, fine-tuning and alignment? If so, this approach could be used on fine-tuning and/or alignment. It could also theoretically be used on pre-training, but I suspect that would yield poor results.
@vivekpadman52488 ай бұрын
@@SnorkelAI yes that was exactly my question, thanks 😊. I have one follow up question here. Why do you think it would yeild poorer results on pre training phase any insights on that and in that case what kind (size and arch) of pretrained student model should be used with a specific teacher llm Or anything would work?
@SnorkelAI8 ай бұрын
Sorry for the slow reply here. KZbin didn't surface your reply comment the same way it did your initial comment. We're getting a bit outside the bounds of what can be reasonably answered within a KZbin comment, but I think we can reasonably say this: Distilling a model means using its output to train a smaller model. For pre-training, that would mean creating an immense volume of raw generated outputs to form the parent model. Several studies have shown that pre-training generative models on other models' generated output tends not to work so well. We don't yet fully understand why, but we understand that it is a questionable practice at present.
@vivekpadman52488 ай бұрын
@@SnorkelAI no worries man, getting such a nice detailed reply is all that matters. Ah understood it properly now, also I guess the limits of the parameter size will come into picture while doing that if we use it for pretraining. Clean data plua synthetic data is anyways available now. Thanks again 😊🙏
@zengstephen3 күн бұрын
Who came here because of DeepSeek?
@science_electronique2 күн бұрын
me , i am wondering to distillation vision model to small one
@lionhuang9209 Жыл бұрын
where can we get PPT?
@mechwarrior83 Жыл бұрын
please
@420_gunna Жыл бұрын
When you talk about distilation requiring large, unlabeled datsets... to be clear for my understanding, it's not necessarily that they're unlabeled data, it's more like we don't care about the dataset's labels, and instead use the teacher model's output distribution as the replacement pseudolabel. I guess you COULD create a distilled model by training against some data distribution that the teacher wasn't itself trained on... but I can't imagine why you would want to do that😄
@SnorkelAI10 ай бұрын
Sort of. Typically, you would use this for data that is, in fact, unlabeled-think sections of contracts or paragraphs from text books. You could also employ this approach for data that has labels that don't fit your desired schema, in which case your statement of "we don't care about the dataset's labels" would be 100% correct. As for your second comment, there could be a number of reasons you may want to do that. Perhaps the teacher LLM does quite well on a particular labeling task when given a highly-engineered prompt. This approach would let you transfer that performance into a smaller and cheaper model.