LLMs & AI Benchmarks! - GenAI Eval Deep Dive

Рет қаралды 1,318

Күн бұрын

LLM App Evaluations are a crucial step that’s often missing when it comes to testing deployed GenAI apps, with many often relying on user feedback post deployment- a risky move considering the nature of these models. So, how do you run standardized evaluations and benchmarks on your AI workflow? Over the past week I’ve been putting together resources to answer this question, and it’s culminated in my latest video: a deep dive into LLM evaluations using LangSmith.
In this deep dive, I cover many different built-in and custom evaluators, including but not limited to:
1. Classification Task Model Comparison with a custom eval function
2. Using a more powerful LLM to evaluate and score output based on criteria like objectivity, helpfulness, conciseness, and custom criteria
3. Running Summary Evaluations on existing experiments
4. Pairwise evaluations with an LLM Judge scoring preference
5. Integrated Unit Tests for assertions and fuzzy matching
6. Attaching custom evaluators into my LLaMa 3 Research Agent for hallucination and document retrieval quality evaluation
My Code Available Here: github.com/ALucek/evals-deepdive
Resources:
LangChain LangSmith Tutorials: • LangSmith Evaluations
The original description of this video got erased in a KZbin bug, so if there's any reference that I state should be linked but isn't please just comment and I'll send that out!
Chapters:
00:00:00 - Intro: How to Evaluate LLM Applications
00:01:08 - LangSmith Introduction
00:01:41 - Evaluations Overview
00:02:27 - Evaluation Agenda
00:03:42 - LangSmith Datasets
00:05:03 - Setting Up a LangSmith Dataset From HuggingFace
00:07:24 - First Eval, Custom Classification Step
00:08:20 - First Eval, Custom Evaluator for Classification
00:12:16 - First Eval, Custom Classification Eval Results
00:16:08 - Second Eval, LLM-As-Judge Q&A Evaluation Setup
00:18:42 - Second Eval, LLM-As-Judge Q&A App Setup
00:20:13 - Overview, LangChainStringEvaluator Built-in Evals Framework
00:21:26 - Second Eval, Defining Chain of Thought Q&A Evaluator
00:22:05 - Second Eval, CoT Q&A Eval Results
00:26:20 - Third Eval, Built In LLM-as-Judge Criteria for Helpfulness
00:26:42 - Third Eval, Built in LLM-as-Judge Results for Helpfulness
00:28:13 - Fourth Eval, LLM-as-Judge Custom Criteria, with and without Ground Truth
00:34:08 - Fourth Eval, LLM-as-Judge Custom Criteria Results
00:39:50 - Fifth Eval, Summary Evaluations over Existing Experiments Setup
00:42:28 - Fifth Eval, Summary Evaluations Results
00:43:40 - Sixth Eval, Pairwise Evaluations using LLMs Setup
00:46:03 - Sixth Eval, Pairwise Evaluations Results
00:47:04 - Seventh Eval, Unit Testing & CI Testing on Existing Applications w/Pytest
00:51:47 - Seventh Eval, Unit Testing & CI Testing Results
00:53:57 - Eighth Eval, Attaching Evaluations to Existing Application Runs Setup
01:02:42 - Eighth Eval, Attaching Evals to Existing Applications Results
01:04:53 - Outro