19 - Mechanistic Interpretability with Neel Nanda

  Рет қаралды 577

AXRP

AXRP

Күн бұрын

How good are we at understanding the internal computation of advanced machine learning models, and do we have a hope at getting better? In this episode, Neel Nanda talks about the sub-field of mechanistic interpretability research, as well as papers he's contributed to that explore the basics of transformer circuits, induction heads, and grokking.
Topics we discuss, and timestamps:
- 00:01:05 - What is mechanistic interpretability?
- 00:24:16 - Types of AI cognition
- 00:54:27 - Automating mechanistic interpretability
- 01:11:57 - Summarizing the papers
- 01:24:43 - 'A Mathematical Framework for Transformer Circuits'
- 01:39:31 - How attention works
- 01:49:26 - Composing attention heads
- 01:59:42 - Induction heads
- 02:11:05 - 'In-context Learning and Induction Heads'
- 02:12:55 - The multiplicity of induction heads
- 02:30:10 - Lines of evidence
- 02:38:47 - Evolution in loss-space
- 02:46:19 - Mysteries of in-context learning
- 02:50:57 - 'Progress measures for grokking via mechanistic interpretability'
- 02:50:57 - How neural nets learn modular addition
- 03:11:37 - The suddenness of grokking
- 03:34:16 - Relation to other research
- 03:43:57 - Could mechanistic interpretability possibly work?
- 03:49:28 - Following Neel's research
The transcript
: axrp.net/episo...
Links to Neel's things:
Neel on Twitter: / neelnanda5
Neel on the Alignment Forum: www.alignmentf...
Neel's mechanistic interpretability blog: www.neelnanda....
TransformerLens: github.com/nee...
Concrete Steps to Get Started in Transformer Mechanistic Interpretability: www.alignmentf...
Neel on KZbin: / @neelnanda2469
200 Concrete Open Problems in Mechanistic Interpretability: www.alignmentf...
Comprehensive mechanistic interpretability explainer: dynalist.io/d/...
Writings we discuss:
A Mathematical Framework for Transformer Circuits: transformer-ci...
In-context Learning and Induction Heads: transformer-ci...
Progress measures for grokking via mechanistic interpretability: arxiv.org/abs/...
Hungry Hungry Hippos: Towards Language Modeling with State Space Models (referred to in this episode as the "S4 paper"): arxiv.org/abs/...
interpreting GPT: the logit lens: www.lesswrong....
Locating and Editing Factual Associations in GPT (aka the ROME paper): arxiv.org/abs/...
Human-level play in the game of Diplomacy by combining language models with strategic reasoning: www.science.or...
Causal Scrubbing: www.alignmentf...
An Interpretability Illusion for BERT: arxiv.org/abs/...
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small: arxiv.org/abs/...
Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets: arxiv.org/abs/...
The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models: arxiv.org/abs/...
Collaboration & Credit Principles: colah.github.i...
Transformer Feed-Forward Layers Are Key-Value Memories: arxiv.org/abs/...
Multi-Component Learning and S-Curves: www.alignmentf...
The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks: arxiv.org/abs/...
Linear Mode Connectivity and the Lottery Ticket Hypothesis: proceedings.ml...

Пікірлер
The Joker wanted to stand at the front, but unexpectedly was beaten up by Officer Rabbit
00:12
АЗАРТНИК 4 |СЕЗОН 1 Серия
40:47
Inter Production
Рет қаралды 1,4 МЛН
Шок. Никокадо Авокадо похудел на 110 кг
00:44
This is why Deep Learning is really weird.
2:06:38
Machine Learning Street Talk
Рет қаралды 387 М.
You don't understand AI until you watch this
37:22
AI Search
Рет қаралды 595 М.
24 - Superalignment with Jan Leike
2:08:29
AXRP
Рет қаралды 1,5 М.
How 3 Phase Power works: why 3 phases?
14:41
The Engineering Mindset
Рет қаралды 1 МЛН
What Is an AI Anyway? | Mustafa Suleyman | TED
22:02
TED
Рет қаралды 1,5 МЛН
The Boundary of Computation
12:59
Mutual Information
Рет қаралды 1 МЛН
GEOMETRIC DEEP LEARNING BLUEPRINT
3:33:23
Machine Learning Street Talk
Рет қаралды 181 М.
31 - Singular Learning Theory with Daniel Murfet
2:32:07
The Joker wanted to stand at the front, but unexpectedly was beaten up by Officer Rabbit
00:12