35 - Peter Hase on LLM Beliefs and Easy-to-Hard Generalization

  Рет қаралды 180

AXRP

AXRP

Күн бұрын

How do we figure out what large language models believe? In fact, do they even have beliefs? Do those beliefs have locations, and if so, can we edit those locations to change the beliefs? Also, how are we going to get AI to perform tasks so hard that we can't figure out if they succeeded at them? In this episode, I chat with Peter Hase about his research into these questions.
Patreon: / axrpodcast
Ko-fi: ko-fi.com/axrp...
The transcript: axrp.net/episo...
Topics we discuss, and timestamps:
0:00:36 - NLP and interpretability
0:10:20 - Interpretability lessons
0:32:22 - Belief interpretability
1:00:12 - Localizing and editing models' beliefs
1:19:18 - Beliefs beyond language models
1:27:21 - Easy-to-hard generalization
1:47:16 - What do easy-to-hard results tell us?
1:57:33 - Easy-to-hard vs weak-to-strong
2:03:50 - Different notions of hardness
2:13:01 - Easy-to-hard vs weak-to-strong, round 2
2:15:39 - Following Peter's work
Peter on Twitter: x.com/peterbhase
Peter's papers:
Foundational Challenges in Assuring Alignment and Safety of Large Language Models: arxiv.org/abs/...
Do Language Models Have Beliefs? Methods for Detecting, Updating, and Visualizing Model Beliefs: arxiv.org/abs/...
Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models: arxiv.org/abs/...
Are Language Models Rational? The Case of Coherence Norms and Belief Revision: arxiv.org/abs/...
The Unreasonable Effectiveness of Easy Training Data for Hard Tasks: arxiv.org/abs/...
Other links:
Toy Models of Superposition: transformer-ci...
Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV): arxiv.org/abs/...
Locating and Editing Factual Associations in GPT (aka the ROME paper): arxiv.org/abs/...
Of nonlinearity and commutativity in BERT: arxiv.org/abs/...
Inference-Time Intervention: Eliciting Truthful Answers from a Language Model: arxiv.org/abs/...
Editing a classifier by rewriting its prediction rules: arxiv.org/abs/...
Discovering Latent Knowledge Without Supervision (aka the Collin Burns CCS paper): arxiv.org/abs/...
Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision: arxiv.org/abs/...
Concrete problems in AI safety: arxiv.org/abs/...
Rissanen Data Analysis: Examining Dataset Characteristics via Description Length: arxiv.org/abs/...

Пікірлер
32 - Understanding Agency with Jan Kulveit
2:22:29
AXRP
Рет қаралды 283
33 - RLHF Problems with Scott Emmons
1:41:24
AXRP
Рет қаралды 400
АЗАРТНИК 4 |СЕЗОН 2 Серия
31:45
Inter Production
Рет қаралды 1,1 МЛН
Amazing Parenting Hacks! 👶✨ #ParentingTips #LifeHacks
00:18
Snack Chat
Рет қаралды 20 МЛН
大家都拉出了什么#小丑 #shorts
00:35
好人小丑
Рет қаралды 97 МЛН
34 - AI Evaluations with Beth Barnes
2:14:02
AXRP
Рет қаралды 630
24 - Superalignment with Jan Leike
2:08:29
AXRP
Рет қаралды 1,5 М.
31 - Singular Learning Theory with Daniel Murfet
2:32:07
Top 3 Mistakes: Custom Business Software Projects
1:01:33
Ksense Technology Group
Рет қаралды 47
Freedom of Less: One Man's Minimalist Journey
15:49
Reflections of Life
Рет қаралды 67 М.
30 - AI Security with Jeffrey Ladish
2:15:44
AXRP
Рет қаралды 789
Why a Forefather of AI Fears the Future
1:10:41
World Science Festival
Рет қаралды 133 М.
29 - Science of Deep Learning with Vikrant Varma
2:13:46
АЗАРТНИК 4 |СЕЗОН 2 Серия
31:45
Inter Production
Рет қаралды 1,1 МЛН