BLADE: Benchmarking Language Model Agents for Data-Driven Science

  Рет қаралды 197

Ai2

Ai2

Күн бұрын

Guest Speaker: Ken Gu
Abstract: Data-driven scientific discovery requires the iterative integration of scientific domain knowledge, statistical expertise, and an understanding of data semantics to make nuanced analytical decisions, e.g., about which variables, transformations, and statistical models to consider. LM-based agents equipped with planning, memory, and code execution capabilities have the potential to support data-driven science. However, evaluating agents on such open-ended tasks is challenging due to multiple valid approaches, partially correct steps, and different ways to express the same decisions. To address these challenges, we present BLADE, a benchmark to automatically evaluate agents' multifaceted approaches to open-ended research questions. BLADE consists of 12 datasets and research questions drawn from existing scientific literature, with ground truth collected from independent analyses by expert data scientists and researchers. To automatically evaluate agent responses, we developed corresponding computational methods to match different representations of analyses to this ground truth. Though language models possess considerable world knowledge, our evaluation shows that they are often limited to basic analyses. However, agents capable of interacting with the underlying data demonstrate improved, but still non-optimal, diversity in their analytical decision-making. Our work enables the evaluation of agents for data-driven science and provides researchers deeper insights into agents' analysis approaches.
Bio: I am a fourth-year PhD student in Computer Science at the University of Washington, where I am advised by Tim Althoff. My research focuses on the development and evaluation of AI agents that enhance data-driven science, with a particular emphasis on improving the quality and robustness of scientific analyses. Previously, I explored how AI assistants can empower data analysts to create more effective analyses through human-AI collaboration. I have also had the opportunity to spend two summers at Tableau Research and Microsoft Research working on these related threads.

Пікірлер
Domain-Specific LLM and Embeddings
58:18
Ai2
Рет қаралды 337
IL'HAN - Qalqam | Official Music Video
03:17
Ilhan Ihsanov
Рет қаралды 700 М.
It’s all not real
00:15
V.A. show / Магика
Рет қаралды 20 МЛН
[Webinar] How to Build a Modern Agentic System
1:00:55
Arthur
Рет қаралды 6 М.
How language model post-training is done today
53:51
Interconnects AI
Рет қаралды 3,1 М.
Speculations on Test-Time Scaling (o1)
47:56
Sasha Rush 🤗
Рет қаралды 22 М.
The Sad Reality of Being a Data Scientist
8:55
Samson Afolabi
Рет қаралды 115 М.
You Can't Have AI Safety Without Inclusion
1:01:01
Ai2
Рет қаралды 274
Transformers (how LLMs work) explained visually | DL5
27:14
3Blue1Brown
Рет қаралды 4,3 МЛН
AlphaGo - The Movie | Full award-winning documentary
1:30:28
Google DeepMind
Рет қаралды 36 МЛН
IL'HAN - Qalqam | Official Music Video
03:17
Ilhan Ihsanov
Рет қаралды 700 М.