Better not Bigger: Distilling LLMs into Specialized Models

  Рет қаралды 7,156

Snorkel AI

Snorkel AI

Күн бұрын

Пікірлер: 17
@nunca789
@nunca789 15 сағат бұрын
Really appreciate this presentation! Likely to use and cite to this video in a forthcoming article in Mind Matters. Thank you!
@riser9644
@riser9644 Жыл бұрын
Link to the blog code or ppt would be good
@vivekpadman5248
@vivekpadman5248 8 ай бұрын
Very nice short informative video. I'm looking to create a distilled model on reasoning tasks for games which could run locally. This will help 😊 thanks
@SnorkelAI
@SnorkelAI 8 ай бұрын
Glad it was helpful!
@LeonidAndrianov
@LeonidAndrianov 2 ай бұрын
Interesting, thank you
@SnorkelAI
@SnorkelAI 2 ай бұрын
Glad you think so!
@vivekpadman5248
@vivekpadman5248 8 ай бұрын
Is this approach used on all three levels of training? Base instruct ane chat fine-tuning? And are there different things to be considered for the above?
@SnorkelAI
@SnorkelAI 8 ай бұрын
I'm not 100% clear on your question. Are you referring to pre-training, fine-tuning and alignment? If so, this approach could be used on fine-tuning and/or alignment. It could also theoretically be used on pre-training, but I suspect that would yield poor results.
@vivekpadman5248
@vivekpadman5248 8 ай бұрын
@@SnorkelAI yes that was exactly my question, thanks 😊. I have one follow up question here. Why do you think it would yeild poorer results on pre training phase any insights on that and in that case what kind (size and arch) of pretrained student model should be used with a specific teacher llm Or anything would work?
@SnorkelAI
@SnorkelAI 8 ай бұрын
Sorry for the slow reply here. KZbin didn't surface your reply comment the same way it did your initial comment. We're getting a bit outside the bounds of what can be reasonably answered within a KZbin comment, but I think we can reasonably say this: Distilling a model means using its output to train a smaller model. For pre-training, that would mean creating an immense volume of raw generated outputs to form the parent model. Several studies have shown that pre-training generative models on other models' generated output tends not to work so well. We don't yet fully understand why, but we understand that it is a questionable practice at present.
@vivekpadman5248
@vivekpadman5248 8 ай бұрын
@@SnorkelAI no worries man, getting such a nice detailed reply is all that matters. Ah understood it properly now, also I guess the limits of the parameter size will come into picture while doing that if we use it for pretraining. Clean data plua synthetic data is anyways available now. Thanks again 😊🙏
@zengstephen
@zengstephen 3 күн бұрын
Who came here because of DeepSeek?
@science_electronique
@science_electronique 2 күн бұрын
me , i am wondering to distillation vision model to small one
@lionhuang9209
@lionhuang9209 Жыл бұрын
where can we get PPT?
@mechwarrior83
@mechwarrior83 Жыл бұрын
please
@420_gunna
@420_gunna Жыл бұрын
When you talk about distilation requiring large, unlabeled datsets... to be clear for my understanding, it's not necessarily that they're unlabeled data, it's more like we don't care about the dataset's labels, and instead use the teacher model's output distribution as the replacement pseudolabel. I guess you COULD create a distilled model by training against some data distribution that the teacher wasn't itself trained on... but I can't imagine why you would want to do that😄
@SnorkelAI
@SnorkelAI 10 ай бұрын
Sort of. Typically, you would use this for data that is, in fact, unlabeled-think sections of contracts or paragraphs from text books. You could also employ this approach for data that has labels that don't fit your desired schema, in which case your statement of "we don't care about the dataset's labels" would be 100% correct. As for your second comment, there could be a number of reasons you may want to do that. Perhaps the teacher LLM does quite well on a particular labeling task when given a highly-engineered prompt. This approach would let you transfer that performance into a smaller and cheaper model.
How to Evaluate LLM Performance for Domain-Specific Use Cases
56:43
Compressing Large Language Models (LLMs) | w/ Python Code
24:04
Shaw Talebi
Рет қаралды 6 М.
When you have a very capricious child 😂😘👍
00:16
Like Asiya
Рет қаралды 18 МЛН
小丑教训坏蛋 #小丑 #天使 #shorts
00:49
好人小丑
Рет қаралды 54 МЛН
BAYGUYSTAN | 1 СЕРИЯ | bayGUYS
36:55
bayGUYS
Рет қаралды 1,9 МЛН
IL'HAN - Qalqam | Official Music Video
03:17
Ilhan Ihsanov
Рет қаралды 700 М.
Deepseek R1 Explained by a Retired Microsoft Engineer
10:07
Dave's Garage
Рет қаралды 2,2 МЛН
3. How do Large Language Models work?
6:44
ILLC Science
Рет қаралды 14 М.
How ChatGPT Cheaps Out Over Time
9:28
bycloud
Рет қаралды 47 М.
What is Mixture of Experts?
7:58
IBM Technology
Рет қаралды 14 М.
Knowledge Distillation: A Good Teacher is Patient and Consistent
12:35
Attention in transformers, step-by-step | DL6
26:10
3Blue1Brown
Рет қаралды 2,1 МЛН
Model Distillation: Same LLM Power but 3240x Smaller
25:21
Adam Lucek
Рет қаралды 10 М.
EASIEST Way to Fine-Tune a LLM and Use It With Ollama
5:18
warpdotdev
Рет қаралды 265 М.
When you have a very capricious child 😂😘👍
00:16
Like Asiya
Рет қаралды 18 МЛН