LLMs & AI Benchmarks! - GenAI Eval Deep Dive

  Рет қаралды 1,318

Adam Lucek

Adam Lucek

Күн бұрын

LLM App Evaluations are a crucial step that’s often missing when it comes to testing deployed GenAI apps, with many often relying on user feedback post deployment- a risky move considering the nature of these models. So, how do you run standardized evaluations and benchmarks on your AI workflow? Over the past week I’ve been putting together resources to answer this question, and it’s culminated in my latest video: a deep dive into LLM evaluations using LangSmith.
In this deep dive, I cover many different built-in and custom evaluators, including but not limited to:
1. Classification Task Model Comparison with a custom eval function
2. Using a more powerful LLM to evaluate and score output based on criteria like objectivity, helpfulness, conciseness, and custom criteria
3. Running Summary Evaluations on existing experiments
4. Pairwise evaluations with an LLM Judge scoring preference
5. Integrated Unit Tests for assertions and fuzzy matching
6. Attaching custom evaluators into my LLaMa 3 Research Agent for hallucination and document retrieval quality evaluation
My Code Available Here: github.com/ALucek/evals-deepdive
Resources:
LangChain LangSmith Tutorials: • LangSmith Evaluations
The original description of this video got erased in a KZbin bug, so if there's any reference that I state should be linked but isn't please just comment and I'll send that out!
Chapters:
00:00:00 - Intro: How to Evaluate LLM Applications
00:01:08 - LangSmith Introduction
00:01:41 - Evaluations Overview
00:02:27 - Evaluation Agenda
00:03:42 - LangSmith Datasets
00:05:03 - Setting Up a LangSmith Dataset From HuggingFace
00:07:24 - First Eval, Custom Classification Step
00:08:20 - First Eval, Custom Evaluator for Classification
00:12:16 - First Eval, Custom Classification Eval Results
00:16:08 - Second Eval, LLM-As-Judge Q&A Evaluation Setup
00:18:42 - Second Eval, LLM-As-Judge Q&A App Setup
00:20:13 - Overview, LangChainStringEvaluator Built-in Evals Framework
00:21:26 - Second Eval, Defining Chain of Thought Q&A Evaluator
00:22:05 - Second Eval, CoT Q&A Eval Results
00:26:20 - Third Eval, Built In LLM-as-Judge Criteria for Helpfulness
00:26:42 - Third Eval, Built in LLM-as-Judge Results for Helpfulness
00:28:13 - Fourth Eval, LLM-as-Judge Custom Criteria, with and without Ground Truth
00:34:08 - Fourth Eval, LLM-as-Judge Custom Criteria Results
00:39:50 - Fifth Eval, Summary Evaluations over Existing Experiments Setup
00:42:28 - Fifth Eval, Summary Evaluations Results
00:43:40 - Sixth Eval, Pairwise Evaluations using LLMs Setup
00:46:03 - Sixth Eval, Pairwise Evaluations Results
00:47:04 - Seventh Eval, Unit Testing & CI Testing on Existing Applications w/Pytest
00:51:47 - Seventh Eval, Unit Testing & CI Testing Results
00:53:57 - Eighth Eval, Attaching Evaluations to Existing Application Runs Setup
01:02:42 - Eighth Eval, Attaching Evals to Existing Applications Results
01:04:53 - Outro

Пікірлер: 3
@ragibshahriyear3682
@ragibshahriyear3682 Ай бұрын
What a gem of a content! I needed something like this! You deserve wayyyy more subs! Thank you so much!
@saitejagangapuram7412
@saitejagangapuram7412 2 ай бұрын
dude, you next video should be comparing the multi agent collbaration framework or patterns
@AdamLucek
@AdamLucek 2 ай бұрын
Multi agent collab video coming soon!
AI Pioneer Shows The Power of AI AGENTS - "The Future Is Agentic"
23:47
Secret Experiment Toothpaste Pt.4 😱 #shorts
00:35
Mr DegrEE
Рет қаралды 38 МЛН
CHOCKY MILK.. 🤣 #shorts
00:20
Savage Vlogs
Рет қаралды 15 МЛН
Best KFC Homemade For My Son #cooking #shorts
00:58
BANKII
Рет қаралды 73 МЛН
Inside Out 2: Who is the strongest? Joy vs Envy vs Anger #shorts #animation
00:22
The ARM chip race is getting wild… Apple M4 unveiled
4:07
Fireship
Рет қаралды 1,2 МЛН
AI is TOO EXPENSIVE if You Don't Do This
11:25
Cole Medin
Рет қаралды 626
Build Your Own Finance LLM for FREE with SEC Data
43:53
Adam Lucek
Рет қаралды 3,7 М.
John Schulman (OpenAI Cofounder) - Reasoning, RLHF, & Plan for 2027 AGI
1:36:55
[1hr Talk] Intro to Large Language Models
59:48
Andrej Karpathy
Рет қаралды 2 МЛН
Unlimited AI Agents running locally with Ollama & AnythingLLM
15:21
Tim Carambat
Рет қаралды 107 М.
Chat with Multiple PDFs | LangChain App Tutorial in Python (Free LLMs and Embeddings)
1:07:30
Alejandro AO - Software & Ai
Рет қаралды 454 М.
Хакер взломал компьютер с USB кабеля. Кевин Митник.
0:58
Последний Оплот Безопасности
Рет қаралды 2,3 МЛН
ноутбуки от 7.900 в тг laptopshoptop
0:14
Ноутбуковая лавка
Рет қаралды 3,5 МЛН
КРАХ WINDOWS 19 ИЮЛЯ 2024 | ОБЪЯСНЯЕМ
10:04
تجربة أغرب توصيلة شحن ضد القطع تماما
0:56
صدام العزي
Рет қаралды 64 МЛН
Новые iPhone 16 и 16 Pro Max
0:42
Romancev768
Рет қаралды 2,4 МЛН
Как бесплатно замутить iphone 15 pro max
0:59
ЖЕЛЕЗНЫЙ КОРОЛЬ
Рет қаралды 8 МЛН