Better not Bigger: Distilling LLMs into Specialized Models

Рет қаралды 7,156

Күн бұрын

Пікірлер: 17

@nunca789 15 сағат бұрын

Really appreciate this presentation! Likely to use and cite to this video in a forthcoming article in Mind Matters. Thank you!

@riser9644 Жыл бұрын

Link to the blog code or ppt would be good

@vivekpadman5248 8 ай бұрын

Very nice short informative video. I'm looking to create a distilled model on reasoning tasks for games which could run locally. This will help 😊 thanks

@SnorkelAI 8 ай бұрын

Glad it was helpful!

@LeonidAndrianov 2 ай бұрын

Interesting, thank you

@SnorkelAI 2 ай бұрын

Glad you think so!

@vivekpadman5248 8 ай бұрын

Is this approach used on all three levels of training? Base instruct ane chat fine-tuning? And are there different things to be considered for the above?

@SnorkelAI 8 ай бұрын

I'm not 100% clear on your question. Are you referring to pre-training, fine-tuning and alignment? If so, this approach could be used on fine-tuning and/or alignment. It could also theoretically be used on pre-training, but I suspect that would yield poor results.

@vivekpadman5248 8 ай бұрын

@@SnorkelAI yes that was exactly my question, thanks 😊. I have one follow up question here. Why do you think it would yeild poorer results on pre training phase any insights on that and in that case what kind (size and arch) of pretrained student model should be used with a specific teacher llm Or anything would work?

@SnorkelAI 8 ай бұрын

Sorry for the slow reply here. KZbin didn't surface your reply comment the same way it did your initial comment. We're getting a bit outside the bounds of what can be reasonably answered within a KZbin comment, but I think we can reasonably say this: Distilling a model means using its output to train a smaller model. For pre-training, that would mean creating an immense volume of raw generated outputs to form the parent model. Several studies have shown that pre-training generative models on other models' generated output tends not to work so well. We don't yet fully understand why, but we understand that it is a questionable practice at present.

@vivekpadman5248 8 ай бұрын

@@SnorkelAI no worries man, getting such a nice detailed reply is all that matters. Ah understood it properly now, also I guess the limits of the parameter size will come into picture while doing that if we use it for pretraining. Clean data plua synthetic data is anyways available now. Thanks again 😊🙏

@zengstephen 3 күн бұрын

Who came here because of DeepSeek?

@science_electronique 2 күн бұрын

me , i am wondering to distillation vision model to small one

@lionhuang9209 Жыл бұрын

where can we get PPT?

@mechwarrior83 Жыл бұрын

please

@420_gunna Жыл бұрын

When you talk about distilation requiring large, unlabeled datsets... to be clear for my understanding, it's not necessarily that they're unlabeled data, it's more like we don't care about the dataset's labels, and instead use the teacher model's output distribution as the replacement pseudolabel. I guess you COULD create a distilled model by training against some data distribution that the teacher wasn't itself trained on... but I can't imagine why you would want to do that😄

@SnorkelAI 10 ай бұрын

Sort of. Typically, you would use this for data that is, in fact, unlabeled-think sections of contracts or paragraphs from text books. You could also employ this approach for data that has labels that don't fit your desired schema, in which case your statement of "we don't care about the dataset's labels" would be 100% correct. As for your second comment, there could be a number of reasons you may want to do that. Perhaps the teacher LLM does quite well on a particular labeling task when given a highly-engineered prompt. This approach would let you transfer that performance into a smaller and cheaper model.