Deep Dive into LLM Evaluation with Weights & Biases

  Рет қаралды 17,127

DeepLearningAI

DeepLearningAI

9 ай бұрын

In the dynamic world of Large Language Models (LLMs), we've unlocked the power to build smart systems from our data. Just like any other piece of automation software, it's essential we take the time to assess these LLM systems. In this webinar, we're going to dive into how we can effectively evaluate these systems, with a particular focus on Retrieval Augmented Generation (RAG) systems.
We'll start by discussing the 'eye-balling' technique and why Weight & Biases Prompts stands out as the first great tool in this area. Then, we'll move on to supervised evaluation, highlighting why it's worth considering and pointing out some limitations.
To wrap things up, we'll look at how LLMs can be used to evaluate themselves - from generating their own evaluation datasets, to using standard metrics like SQuAD or BLUE, and even evaluating retrieval systems.
On top of all that, we'll also touch on how W&B Sweeps, an excellent tool for hyperparameter optimization, can be utilized to find the ideal balance to maximize accuracy and minimize costs. The session will end with a Q&A with the presenters.
This workshop is based off the foundational learnings of DeepLearning.AI’s course on Evaluating & Debugging Generative AI built in collaboration with the Weights & Biases team. Everything covered in the workshop is presented as continued education from the course.
Event Agenda
40-minute Workshop
10-minute Q&A: Answering questions from the audience.
​​About the Speakers:
Morgan McGuire - Growth Director at Weights & Biases
Morgan leads the Growth ML team and is a ML Engineer at Weights & Biases. He has a background in NLP and previously worked at Facebook on the Safety team where he helped classify and flag potentially high-severity content for removal.
Ayush Thakur - Machine Learning Engineer at Weights & Biases
Ayush is a MLE at Weights and Biases and Google Developer Expert in Machine Learning (TensorFlow). He is interested in everything computer vision and representation learning. For the past 7 months he’s been working with LLMs and have covered RLHF and how and what of building LLM-based systems.
Carey Phelps -Founding Product Manager at Weights & Biases
Carey is the founding Product Manager at Weights & Biases. She studied computer science at Stanford and went on to found Carta Healthcare before joining Weights & Biases.

Пікірлер: 6
@philipegger4599
@philipegger4599 9 ай бұрын
It seems that on slide 32, "Human Eval" and "User Testing" have inexplicably switched places.
@ayushthakur736
@ayushthakur736 9 ай бұрын
Thanks for catching. Corrected.
@420_gunna
@420_gunna Ай бұрын
I get that content marketing is always going to be thinly veiled product marketing, but this was a little too on the nose, and not enough meat otherwise
@Gus-AI-World
@Gus-AI-World 9 ай бұрын
this is a disaster presentation. At 10 mins Ayush starts a business presentation. Then nothing can be seen because of the small sized font. I do not know who is your target audience for this? then suddenly Ayush jumps to somehow "it works" wandb screen for LLMs.
[Webinar] LLMs for Evaluating LLMs
49:07
Arthur
Рет қаралды 7 М.
Fine-tuning Large Language Models (LLMs) | w/ Example Code
28:18
Shaw Talebi
Рет қаралды 236 М.
КАХА и Джин 2
00:36
К-Media
Рет қаралды 4 МЛН
Cat story: from hate to love! 😻 #cat #cute #kitten
00:40
Stocat
Рет қаралды 13 МЛН
小路飞姐姐居然让路飞小路飞都消失了#海贼王  #路飞
00:47
路飞与唐舞桐
Рет қаралды 94 МЛН
Evaluating LLM-based Applications
33:50
Databricks
Рет қаралды 19 М.
Weights & Biases End-to-End Demo
14:56
Weights & Biases
Рет қаралды 18 М.
Session 7: RAG Evaluation with RAGAS and How to Improve Retrieval
37:21
How To Read AI Research Papers Effectively
1:08:19
DeepLearningAI
Рет қаралды 36 М.
Tools for building AI applications
19:52
Weights & Biases
Рет қаралды 2,4 М.
Mitigating LLM Hallucinations with a Metrics-First Evaluation Framework
1:00:40
A Deep Dive to AI & LLM's
59:53
PASI-AI
Рет қаралды 242
Evaluation for Large Language Models and Generative AI - A Deep Dive
1:16:49
Rajistics - data science, AI, and machine learning
Рет қаралды 7 М.
DQ Flick Flush 🚽🍦🚽🍦🚽🍦🚽
0:11
Cereal Box Seth
Рет қаралды 11 МЛН
БАСПАНАҒА ТАЛАСҚАН БАУЫРЛАР/ KOREMIZ
46:53
Көреміз / «KÖREMIZ»
Рет қаралды 201 М.
🍪 Compartilhar é Cuidar:  Biscoito que Ensina a Compartilhar
0:13
Músicas Infantis LooLoo Divertidas
Рет қаралды 93 МЛН
1000 iq guy 😱 @fash
0:11
Tie
Рет қаралды 24 МЛН
Нашли меня? #софянка
0:12
Софья Земляная
Рет қаралды 1,5 МЛН
Gold vs Silver Brushing Routine
0:33
Dental Digest
Рет қаралды 7 МЛН