35 - Peter Hase on LLM Beliefs and Easy-to-Hard Generalization

Рет қаралды 180

Күн бұрын

How do we figure out what large language models believe? In fact, do they even have beliefs? Do those beliefs have locations, and if so, can we edit those locations to change the beliefs? Also, how are we going to get AI to perform tasks so hard that we can't figure out if they succeeded at them? In this episode, I chat with Peter Hase about his research into these questions.
Patreon: / axrpodcast
Ko-fi: ko-fi.com/axrp...
The transcript: axrp.net/episo...
Topics we discuss, and timestamps:
0:00:36 - NLP and interpretability
0:10:20 - Interpretability lessons
0:32:22 - Belief interpretability
1:00:12 - Localizing and editing models' beliefs
1:19:18 - Beliefs beyond language models
1:27:21 - Easy-to-hard generalization
1:47:16 - What do easy-to-hard results tell us?
1:57:33 - Easy-to-hard vs weak-to-strong
2:03:50 - Different notions of hardness
2:13:01 - Easy-to-hard vs weak-to-strong, round 2
2:15:39 - Following Peter's work
Peter on Twitter: x.com/peterbhase
Peter's papers:
Foundational Challenges in Assuring Alignment and Safety of Large Language Models: arxiv.org/abs/...
Do Language Models Have Beliefs? Methods for Detecting, Updating, and Visualizing Model Beliefs: arxiv.org/abs/...
Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models: arxiv.org/abs/...
Are Language Models Rational? The Case of Coherence Norms and Belief Revision: arxiv.org/abs/...
The Unreasonable Effectiveness of Easy Training Data for Hard Tasks: arxiv.org/abs/...
Other links:
Toy Models of Superposition: transformer-ci...
Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV): arxiv.org/abs/...
Locating and Editing Factual Associations in GPT (aka the ROME paper): arxiv.org/abs/...
Of nonlinearity and commutativity in BERT: arxiv.org/abs/...
Inference-Time Intervention: Eliciting Truthful Answers from a Language Model: arxiv.org/abs/...
Editing a classifier by rewriting its prediction rules: arxiv.org/abs/...
Discovering Latent Knowledge Without Supervision (aka the Collin Burns CCS paper): arxiv.org/abs/...
Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision: arxiv.org/abs/...
Concrete problems in AI safety: arxiv.org/abs/...
Rissanen Data Analysis: Examining Dataset Characteristics via Description Length: arxiv.org/abs/...