Stop Prompt Engineering! Program Your LLMs with DSPy

Рет қаралды 9,505

Күн бұрын

Пікірлер: 26

@charlieyou97 6 күн бұрын

Really great video! Have been wanting to dig into DSPy for a bit but it was always a bit daunting to understand everything that was going on. Your structure here and explanations are super clear

@augmentos 10 күн бұрын

Really good video I would encourage you at the start to put what the results are almost like the conclusion at the front, for long technical videos like this. A lot of people start wondering 10 minutes in really what it's gonna do in the end and if they even need it. At least I'll speak for myself.

@volker_roth 13 күн бұрын

Danke!

@AdamLucek 13 күн бұрын

Thank you for supporting the channel! 🙏

@micams2009 14 күн бұрын

thanks for the breakdown. hmmm, with dspy, I am still missing the options to work more focused on the "for which cases didn't it work and how could that be mitigated/tackled" route. Of course, you don't want to just focus on those; you'd certainly want to keep the performance as good as possible for the ones that already worked. However, quite often blindly optimizing against a score is a kind senseless exercise if the failures come from mislabeled data (which I would clearly first look at if a large LLM model can't solve such a task). This is just from experience - if you have flawed data, that might hurts the actual downstream application because the optimization process might draw to much on it. Great content; go on!

@teebu 13 күн бұрын

Garbage in, garbage out.

@AdamLucek 12 күн бұрын

Very good point! Good clean data makes all the difference

@jonasls 6 күн бұрын

Great video!

@jonasbieniek4320 3 күн бұрын

Great intro, thanks! Was also looking to start with DSPy but never got myself to actually do it. A simple example like sentiment analysis is completely fine for an intro of course. But to me it seems that this is also the reason for the rather small percentage gains, since sentiment analysis has been there since the beginning of llm times and llm are probably already kind of optimized on that task because of their training data. Does anyone here have experience with using DSPy for more complex tasks/instructions and can give an insight on how well it does there? :)

@themax2go 14 күн бұрын

can it create the json / struct (i haven't watched the full video yet) as part of the optimized prompt?

@aaronabuusama 14 күн бұрын

Yesser

@teebu 13 күн бұрын

Great explanation. This is doing what people are doing manually with prompts -> test and score -> repeat, except at a much higher cost since you're letting the program make the decisions for you. It seems to me at some point a human should have some input if they see something obvious is not being done by the llm, I guess that would defeat the purpose of 'automating' the prompt generation.

@AdamLucek 12 күн бұрын

I think the effectiveness is yet to be truly determined. You're right that they very aggressively abstract away from human feedback, but I also agree that for a lot of these systems human input is almost a necessity due to the "weirdness" of interacting with LLMs, which at it's current maturity I don't think can fully be handled programmatically (DSPy) and can benefit from a little human inspiration. In a way I believe they hope the human touch to come in the form of gathered and labeled user logs which become the test/train data. But also their emphasis on the test and score phase is really crucial to understand regardless for the success of your application, and I appreciate their points on this!

@Swooshii-u4e 7 күн бұрын

So this would make 4o mini more effective than using prompt engineering techniques like zero shot, few shot, CoT, expert roles etc?

@kevinpham6658 13 күн бұрын

Thanks for the breakdown. Have you had success with it in production? It seems like in your examples, the performance didn’t go up significantly over baseline until there was fine tuning of actual weights. A trial and error prompt engineering approach might yield similar results if there is a test.

@AdamLucek 12 күн бұрын

I am yet to commit to DSPy for my prod use cases. Theres some trade offs here and there, namely development speed. Obviously the traditional prompt style can get you started and operational very quick, with good/good enough results. I feel like the comparison is mainly: naive prompting has an easier barrier to entry but a lower max performance ceiling, while DSPy has a much higher barrier to entry but can push performance much higher. As you noted about having viable tests, I feel the rigor that DSPy puts around metrics and testing against them is a good practice that all should go through regardless of approach, but has some limitations. Namely, performance of text generation tends to be highly subjective and difficult to accurately measure- I am yet to trust the "llm-as-a-judge" approaches that DSPy/LangSmith and others approach to fix this. In my experience, llm as a judge fails to contextualize and score to the same degree a human end user/SME is able to. I've found the contextualization and knowledge gap too wide to rely on LLM scoring purely for the applications I develop. As an aside, I find this is true for not just generated output but other ML based pipeline components, like relevancy of document retrieval. Unfortunately it's not the answer most CS folks are looking for- but true human feedback and scoring is worth much more than LLM based in it's current maturity, and it's difficult to convert that to repeatable or quantifiable metrics (i.e. reward modeling) that DSPy requires. And then for more solved use cases like classification, which I demonstrate here, it's overkill to use an LLM and you should just use a bert model or similar. But to get back on track- I think the best approach is start simply with prompting, log program interactions, and have enough human labeled positive/negative samples that you can start applying something like DSPy to in an optimized redesign. The lesson of clearly defined and measurable metrics however is a big one to stick to.

@themax2go 14 күн бұрын

does it make sense to pair it w pydantic ai?

@kallemickelborg 13 күн бұрын

Yes definitely

@uncleebenezer1928 3 күн бұрын

Absolutely!

@svenvarg6913 12 күн бұрын

Fantastic introduction! I have a love/hate relationship. I love the thoroughness of the implementation. I have not had as much fun dealing with the obfuscating terminology. Textgrad has a better API but nowhere near as useful as DSPy.

@AdamLucek 12 күн бұрын

There's certainly some give and take with the framework. It's good to see this kind of experimentation regardless, as there really isn't a "best" way to build these systems! But yeah they document everything much more confusingly than it should be, which makes it super difficult for the regular software engineer or interested party to start using if they have no ML background (in comparison to something like langchain which is dead easy to get started)

@CiaoKizomba 14 күн бұрын

Is dspy harder to use for complicated prompts?

@AdamLucek 12 күн бұрын

It certainly requires more rigor around clearly defining your input and output success measurement, as well as any intermediate step measurements. Harder to start with, but if going through the full practice can possibly provide much better results.