Ha!!! You have always been telling us to blend ‘new’ data with some ‘old’ data when conducting SFT! Your intuition was spot on. You also always reminded us to mind the format of the new data, to match as much as possible the format of the original training data.
@borisguarisma88105 ай бұрын
Wow! I need to watch this again while reading the paper...thank you!
@code4AI5 ай бұрын
Glad it was helpful!
@AdamBrusselback5 ай бұрын
This is interesting. When ive been SFT new tasks, I was originally having problems getting models to learn the output with just an input + output example. I noticed much, much better performance on the final task when I augmented the training data to include answering questions about the input format, pulling out specific data points from it, generating intermediate representation, etc.
@code4AI5 ай бұрын
Great observation. However with closed LLMs, or so called "open" ones but without any transparency of what their pre-training dataset included, global corporations being afraid of legal implications based on copyright violations .... we have no chance of really optimizing for a coherent fine-tuning dataset. Damn it.
@desmur365 ай бұрын
If this holds, this implies we need to sequence our training data in a format that scaffolds the model from lightly known to known knowledge. Intuitively this makes sense. Most students learn through a process of building on known concepts that are easy to grasp, then expanding that to more advanced topics using that base knowledge as a foundation. It also begs the question what was the sequence in the pretraining dataset? Was that carefully curated? And how would organize the internet from fundamental to advanced concepts? I think we got lucky with research papers because they always follow this sequence of known to new knowledge.
@i_accept_all_cookies5 ай бұрын
I've been fine-tuning SLMs like TinyLlama, Phi 2, and Gemma 2b. This might explain some of the accuracy variance I've been seeing.
@milindgaharwar8275 ай бұрын
It seems generally reasonable that 'new data + conceptually related known data' should lead to fewer hallucinations - when compared to only new data or new data + conceptually unrelated known data. It would probably not make a big difference IF there were a mechanism in the model architecture itself to find common patterns in different learnt concepts. Do please share if you are aware of any such research direction.
@zekehobbs77385 ай бұрын
At what volume of new tokens does this break down? Ie 1k, 10k, 100k, 1M, 10M etc.
@gileneusz5 ай бұрын
23:15 ICL is high compute demanding... with longer prompts you will get slow prompt processing...
@code4AI5 ай бұрын
Not in parallel processing like ring attention
@kishoretvk5 ай бұрын
so pre-training of a LLM ? can we do it with a 7b or 8b model ? can we further fine tune a pre trainled LLM and avoid this ?
@code4AI5 ай бұрын
Some argue, that fine-tuning is just a continuous pre-training, kind of. IF we have an open source LLM, where we know all the pre-training datasets and formats and complexities... then we might have a chance to create an additional coherent fine-tuning dataset. With closed LLMs however.... No chance.
@gileneusz5 ай бұрын
I think you missed the point that this paper is about dense models like Llama 3, which is trained on large amount of tokens, this will not appear as much for models that are not as dense as Llama 3
@code4AI5 ай бұрын
Smile. The formulation "it will not appear as much ..." Is hopeful, but do we have any validated data on this?! Why should MoE be immune, and if "maybe", to what degree?
@gileneusz5 ай бұрын
@@code4AI I have no idea, all these fine tuning stuff is just pure experimental. Which is good, we are still learning
@proterotype5 ай бұрын
I wonder if combining Fine Tuning with RAG would solve this
@code4AI5 ай бұрын
No. We need our fine- tuned LLMs within our active agents when complex external info (RAx) is returned to the LLM.
@proterotype5 ай бұрын
@@xspydazx interesting stuff! I understand you may not have personally trained an LLM on embeddings, but have you done the type of workflow in your first, longer comment? If so, how well have you witnessed it working, that is to say, how accurate are the results of the method you outline in your first longer comment
@gileneusz5 ай бұрын
25:38 pre training is too expensive, but if you split your knowledge into many AI models, you can train smaller models and it would be much cheaper...
@code4AI5 ай бұрын
Look at the performance of Snowflake Arctic 128x3.66B. Any questions left?
@gileneusz5 ай бұрын
@@code4AI that's opposite spectrum. Snowflake suffer because 3.66B models are just undertrained there.