Good review! My issue with these frameworks is their limitations become more transparent with large and more complex use cases - the boiler plate code ends up being technical debt that needs circumventing and the next iteration of the GPT's or Mistrals intrinsically solve some of the previous limitations that the models couldn't solve for.
@andrew.derevo4 ай бұрын
What’s happened under the hood of DSPY, how many tokens it will use during fine tuning process? thanks
@ryanscott6428 ай бұрын
Any way you can make the font bigger? :)
@vbywrde9 ай бұрын
Yes, this was useful. Thank you. But after watching the video I'm still not sure if DSPy lives up to the hype. What I would want to see is a series of benchmark tests of "Before Teleprompter" and "After Teleprompter" results on a determinative set of tasks that cover a range of concerns. Such as: math questions, reasoning questions, code generation questions, categorized into Easy, Medium, Difficult. This should be done with a series of models, starting with, of course, GPT4, but including Claude, Groq, Geminia, and a set of HuggingFace Open Source models such as DeepseekCoder-Instruct-33B, laser-dolphin-mixtral-2x7b, etc. I would want to see this done with the variety of DSPy Compile options, such as OneShot, FewShot, using ReACT, and using LLM as Judge, etc., where the concerns are appropriate. In other words, a formal set of tests and benchmarkes based on the various, but not infinite, configuration options for each set of concerns. This would give us much better information and be truly valuable to those who are embarking on their DSPy journey. Right now, it is very unclear whether compiling using teleprompter actually provides more accurate results, and under what circumstances (configurations). I have seen more than one demo, and in some cases the teleprompter actually produced worse results, and the comment was "well, sometimes it works better than others". My proposed information set, laid out in a coherent format, would be tremendously useful to the community and would go a long way towards answering the question you posed: Does DSPy live up to the Hype? Because we don't have this information, the Jury is still out, and your video poses the right question, but doesn't quite answer it, tbh. The benchmarking tests I am proposing would. That along with a thoughtful discussion of the discoveries would be tremendously useful. That said, I did learn a few useful things here, and so thanks again!
@Geekraver9 ай бұрын
In my experience the challenge is the metric function. Examples always seem to use exact match; getting a qualitative metric working is non-trivial. Try use DSPy to optimize prompting for summarizing video transcripts, for example; you'll probably spend more time trying to get the metric working than you would have just coming up with a decent prompt. You also need a metric function that is going to discriminate between prompts of different quality, which is also not as trivial as it might seem.
@mysticaltech9 ай бұрын
Bigger text would have been awesome for the notebook. Thanks for the info.
@malikrumi12064 ай бұрын
21 open tabs! I'm *not* the only one! 🙃
@bluebabboon9 ай бұрын
Ok, i did not understand one thing. Why did you include ANSWERNOTFOUND in the context? This seems to defeat the whole purpose of getting a correct answer. How would i know if the context is relevant to the question before the question is asked? Is it not similar to data leakage? The true test would be to just remove ANSWERNOTFOUND from the context , because we don't know what question that might be asked, or we can even create negative examples like we do in word2vec and just use them to train the answernotfound. Let me know if I make sense
@scienceineverydaylife35969 ай бұрын
You can think of this as similar to supplying labeled data for training ML models. In this example you are training a prompt for extracting answers in a particular format (ANSWENOTFOUND is the label when no answer can be extracted from the other parts of the context)
@bluebabboon9 ай бұрын
@@scienceineverydaylife3596 So in the test set I am assuming there wont be ANSWERNOTFOUND in the context, right?
@HarmonySolo7 ай бұрын
I would rather use prompt engineering than DSPy. The beauty of LLm is to generate contents/code with natural language, now DSPy asks people to use programming lanaguage again. There is also deep learning curve for DSPy.