How to evaluate an LLM-powered RAG application automatically.

No video

How to evaluate an LLM-powered RAG application automatically.

Рет қаралды 17,001

Күн бұрын

Source code of this example:
github.com/svp...
Giskard library: github.com/Gis...
I teach a live, interactive program that'll help you build production-ready machine learning systems from the ground up. Check it out here:
www.ml.school
To keep up with the content I create:
• Twitter/X: / svpino
• LinkedIn: / svpino

Пікірлер: 55

@aleksandarboshevski 4 ай бұрын

Hey Santiago! Just wanted to drop a comment to say that you're absolutely killing it as an instructor. Your way of breaking down the code and the whole process into simple, understandable language is pure gold, making it accessible for new comers like me. Wishing you all the success and hoping you keep blessing the community with your valuable content! Aside of the teaching side have you tried to create micro-saas based on this technologies ? For me seems that you are half there and could be great opportunity to expand your business.

@underfitted 4 ай бұрын

Thanks for taking the time and letting me know! I have not created any micro saas applications, but you are right; that could be a great idea

@TooyAshy-100 4 ай бұрын

THANK YOU I greatly appreciate the release of the new videos. The clarity of the explanations and the logical sequence of the content are exceptional.

@AmbrishYadav Ай бұрын

Thanks ! Exactly what I was looking for. I’ve been cracking my head on how the hell to test a RAG system. How the hell is business going to give me 1000+ questions to test and how can a human verify the response. Top content.

@mohammed333suliman 4 ай бұрын

This is my first time watching your videos. It is great. Thank you.

@dikshantgupta5539 2 ай бұрын

Oh man, the way you explained these complex topics is mind blowing. I just wanted to say thank you for making such types of videos.

@TheScott10012 4 ай бұрын

FYI, keep an eye on the mic volume levels! Sounds like it was clipping

@underfitted 4 ай бұрын

Thanks. You are right. Will adjust.

@peterhjvaneijk1670 Ай бұрын

Love the video. Great breakdown. Would like to see more detail in evaluation results (e.g. it is now .73 good. WTH...!?), how tweaking the pipeline gives different eval results, and e.g. Ragas versus Giskard.

@user-hh9do9fn1o 2 ай бұрын

Hello Santiago, Your explanation was thorough and I understood it really well,Now I have a question as is there any other tool than giskard to evaluate (which is open source and does not require openai api key) for my llm or rag model. Thank you in advance😊

@TPH310 4 ай бұрын

We appreciate your work a lot, my man.

@tee_iam78 Ай бұрын

Superb video. Great content from start to finish. Thank you.

@liuyan8066 4 ай бұрын

Glad to see you involved pytest in the end, it is like a surprise dessert🍰 after great meal.

@maxnietzsche4843 2 ай бұрын

Damm, you explained each step really well! Love it!

@alextiger548 3 ай бұрын

Super important topic you covered here man!

@horyekhunley 3 ай бұрын

Great stuff! What are your preferred open source alternatives to all tools used in this tutorial?

@arifkarim768 2 ай бұрын

explained amazingly

@proterotype 4 ай бұрын

This is so well done

@CliveFernandesNZ 4 ай бұрын

Great stuff Santiago! You've used giskard to create the test cases. These test cases themselves are created using an LLM. In a real application, would we have to manually vet the test cases to ensure they themselves are 100% accurate?

@theacesystem 4 ай бұрын

Just awesome instruction Santiago. I am a beginner but you make learning digestible and clear! Sorry if ignorant question. but is it possible to use FAISS, Postgres, MongoDB, or Chroma DB, or another free open source model that can be substituted for pinecone to save money, and if so which would you recommend for ease of implementation with Langchain?

@underfitted 4 ай бұрын

Yes, you can! Any of them will work fine. FAISS is very popular.

@MohammadEskandari-do6xy 4 ай бұрын

Amazing! Can you also explain how to do the same type of evaluation on Vision Language Models that use images?

@ergun_kocak 4 ай бұрын

This is gold ❤

@maxisqt 4 ай бұрын

So the one thing you learn training ML models is that you don’t evaluate your model on training data, and be careful of data leaking. Here, you’re providing giskard your embedded documentation, meaning giskard is likely using its own RAG system to generate tests cases, which you then use to evaluate your own RAG system. Can you please explain how this isn’t nonsense? Do you evaluate the accuracy of the giskard test cases beyond the superficial “looks good to me” method that you claim to be replacing? What metrics do you evaluate giskard’s test cases against since its answers are also subjective, you’re just now entrusting that subjective evaluation to another LLM?

@maxisqt 4 ай бұрын

Perhaps the purpose of testing in software development is different to ML testing, in soft eng you’re ensuring that changes made to a system don’t break existing functionality, in ML you test on data your model hasn’t trained on to prove it generalises to unseen novel samples as that’s how it’ll have to perform in deployment. Maybe the tests you’re doing here fit into the software eng bucket and therefore LLMs may be perfectly capable of auto generating test cases, and since we aren’t trying to test how well the generated material “generalises” since that doesn’t make sense in this context, that’s okay… I’m a little confused.

@maxisqt 4 ай бұрын

I’m new to gen ai, background in ML some years back, apologies if I come off hostile or jaded.

@mikaelhuss5080 4 ай бұрын

@@maxisqt i think these are good questions actually. Maybe the way to think about RAG at least in the present scenario is that it is really a type of information retrieval and there is no need to generalise, as you say - we just want to be able to find relevant information in a predefined set of documents.

@u4tiwasdead 4 ай бұрын

The way that frameworks like Giscard try to solve the problem of how we can evaluate llms/rag using llms that are not necessarily better than the ones being eveluated is through the way that test sets are generated. Just to give one example the framework might ask an llm to generate a question and answer pair, then ask it to rephrase the question to make it harder to understand without changing its meaning/what the answer will be. It will then ask llm the harder version of the question and compare it to the original answer. This can work despite the fact their llm is not necessarily more powerful than yours, because rephrasing an easy question into a hard one is an easier problem, than interpreting the hard question. (a good analogy migght be that a person can create puzzles that are hard to solve for much smarter people than themselves by starting from the solution and then creating the question) Note that the test data does not need to be perfect, it just needs to be generally better than the outputs we will get from our models/pipelines. The point of these tools is not to evaluate whether the outputs we are getting are actually true, but simply whether they are improved when we make changes to the pipline.

@trejohnson7677 4 ай бұрын

Ouroboros

@sridharm4254 3 ай бұрын

Very useful video. thank you

@aliassim8774 4 ай бұрын

Hey Santiago, thank you for this course in which you explained all the concepts of rag evaluation in a very clear way. However, I have a question about the the reference answers. How they have been generated ? based on what (is it an LLM) ? If it is the case, let's say we have a question that needs a specific information that exists only on the knowledge base, how can other llm generate such an answer ? & how the we know that reference questions are correct and it is what we are looking for ? Thank you in advance

@tee_iam78 Ай бұрын

Thanks!

@JonathanLoscalzo 2 ай бұрын

I think that all the "AI experts" in the wild just "explain" common concepts of AI/LLM systems. It would be nice to understand a bit more other aspects, like (good choice) evaluation. It would be interesting to have some relevant courses on that. I know it is the secret juice but, could be useful. BTW, Are you teaching causal ml in your course?

@underfitted 2 ай бұрын

I’m not teaching causal ml, no. The program focuses on ML Engineering

@JonathanLoscalzo 2 ай бұрын

@@underfitted I want to do it, but I don't have time. I hope there will be more cohorts in near future

@francescofisica4691 2 ай бұрын

How can i use huggingface llms to generate the testset?

@theacesystem 4 ай бұрын

That's great. You rock!!!

@sabujghosh8474 4 ай бұрын

Its awesome need more os models workings

@PratheekBabu 26 күн бұрын

Thanks for an amzing content can we use giskard without openai key

@dhrroovv 2 ай бұрын

do we need to have a paid subscription to openai apis to be able to use giskard?

@utkarshgaikwad2476 4 ай бұрын

Is it ok to use generative ai to test generative ai ? What about the accuracy of giskard ? I’m not sure about this

@underfitted 4 ай бұрын

The accuracy is as good as the model they use is (which is GPT-4). Yes, this is how you can test the result of a model.

@not_amanullah 4 ай бұрын

Thanks ❤

@gauravpratapsingh8840 2 ай бұрын

Hey can you make a video that uses open source llm and make a q/a chat bot for website page?

@fintech1378 4 ай бұрын

How new is this giskard

@sbacon92 17 күн бұрын

What happens when you take away OpenAI and a module? Can you build this with a local model and your own code?

@kloklojul Ай бұрын

You are using an LLM to create a question and a LLM to get another answers and than let an LLM eval both answers, but how do you evaluate the output of the Initial Tests? At this point you are trusting the facts of an LLM by tusting the answers of the llm

@caesarHQ 4 ай бұрын

hi excellent tutorial, wouldn't anticipate any less. Ran your notebook with an open-source LLM, however generate test set with giskard.rag is calling OpenAI api "timestamp 19:11", any work-around?

@underfitted 4 ай бұрын

Giskard will always use gpt4 regardless of the model you use in your RAG app

@StoryWorld_Quiz 4 ай бұрын

how does the gpt instance that generates the questions and the answers know the validity of those answers? if they are actually accurate, why would you build the rag in the first place if you can create a gpt instance that is accurate enough (using one simple prompt: 18:33, agent description)? i dont understand, can someone explain please? do you see the paradox here?

@mehmetbakideniz 4 ай бұрын

because gpt 4 is quite expensive, you wouldnt want to use it in production if 3.5 or any other open source model does the job correctly. this library uses gpt4 as the best llm to have RAG answers. that is why they use it as the test case to see whether for your specific application a cheaper or a free open source model is more or the less okay.