New Trick for Fine-Tuning LLMs

Рет қаралды 2,867

Күн бұрын

Пікірлер: 25

@wdonno 5 ай бұрын

Ha!!! You have always been telling us to blend ‘new’ data with some ‘old’ data when conducting SFT! Your intuition was spot on. You also always reminded us to mind the format of the new data, to match as much as possible the format of the original training data.

@borisguarisma8810 5 ай бұрын

Wow! I need to watch this again while reading the paper...thank you!

@code4AI 5 ай бұрын

Glad it was helpful!

@AdamBrusselback 5 ай бұрын

This is interesting. When ive been SFT new tasks, I was originally having problems getting models to learn the output with just an input + output example. I noticed much, much better performance on the final task when I augmented the training data to include answering questions about the input format, pulling out specific data points from it, generating intermediate representation, etc.

@code4AI 5 ай бұрын

Great observation. However with closed LLMs, or so called "open" ones but without any transparency of what their pre-training dataset included, global corporations being afraid of legal implications based on copyright violations .... we have no chance of really optimizing for a coherent fine-tuning dataset. Damn it.

@desmur36 5 ай бұрын

If this holds, this implies we need to sequence our training data in a format that scaffolds the model from lightly known to known knowledge. Intuitively this makes sense. Most students learn through a process of building on known concepts that are easy to grasp, then expanding that to more advanced topics using that base knowledge as a foundation. It also begs the question what was the sequence in the pretraining dataset? Was that carefully curated? And how would organize the internet from fundamental to advanced concepts? I think we got lucky with research papers because they always follow this sequence of known to new knowledge.

@i_accept_all_cookies 5 ай бұрын

I've been fine-tuning SLMs like TinyLlama, Phi 2, and Gemma 2b. This might explain some of the accuracy variance I've been seeing.

@milindgaharwar827 5 ай бұрын

It seems generally reasonable that 'new data + conceptually related known data' should lead to fewer hallucinations - when compared to only new data or new data + conceptually unrelated known data. It would probably not make a big difference IF there were a mechanism in the model architecture itself to find common patterns in different learnt concepts. Do please share if you are aware of any such research direction.

@zekehobbs7738 5 ай бұрын

At what volume of new tokens does this break down? Ie 1k, 10k, 100k, 1M, 10M etc.

@gileneusz 5 ай бұрын

23:15 ICL is high compute demanding... with longer prompts you will get slow prompt processing...

@code4AI 5 ай бұрын

Not in parallel processing like ring attention

@kishoretvk 5 ай бұрын

so pre-training of a LLM ? can we do it with a 7b or 8b model ? can we further fine tune a pre trainled LLM and avoid this ?

@code4AI 5 ай бұрын

Some argue, that fine-tuning is just a continuous pre-training, kind of. IF we have an open source LLM, where we know all the pre-training datasets and formats and complexities... then we might have a chance to create an additional coherent fine-tuning dataset. With closed LLMs however.... No chance.

@gileneusz 5 ай бұрын

I think you missed the point that this paper is about dense models like Llama 3, which is trained on large amount of tokens, this will not appear as much for models that are not as dense as Llama 3

@code4AI 5 ай бұрын

Smile. The formulation "it will not appear as much ..." Is hopeful, but do we have any validated data on this?! Why should MoE be immune, and if "maybe", to what degree?

@gileneusz 5 ай бұрын

@@code4AI I have no idea, all these fine tuning stuff is just pure experimental. Which is good, we are still learning

@proterotype 5 ай бұрын

I wonder if combining Fine Tuning with RAG would solve this

@code4AI 5 ай бұрын

No. We need our fine- tuned LLMs within our active agents when complex external info (RAx) is returned to the LLM.

@proterotype 5 ай бұрын

@@xspydazx interesting stuff! I understand you may not have personally trained an LLM on embeddings, but have you done the type of workflow in your first, longer comment? If so, how well have you witnessed it working, that is to say, how accurate are the results of the method you outline in your first longer comment

@gileneusz 5 ай бұрын

25:38 pre training is too expensive, but if you split your knowledge into many AI models, you can train smaller models and it would be much cheaper...

@code4AI 5 ай бұрын

Look at the performance of Snowflake Arctic 128x3.66B. Any questions left?

@gileneusz 5 ай бұрын

@@code4AI that's opposite spectrum. Snowflake suffer because 3.66B models are just undertrained there.