Anthropic Solved Interpretability Again? (Walkthrough)

  Рет қаралды 2,096

The Inside View

The Inside View

Күн бұрын

Пікірлер: 9
@drhxa
@drhxa 6 ай бұрын
That was an excellent walkthrough, thank you. I've learned a lot. Would love to see more walkthroughs of the prior/related work
@TheInsideView
@TheInsideView 6 ай бұрын
Thanks! My walkthrough of the previous Anthropic paper (prior work): kzbin.info/www/bejne/fnLblWt6pL-UjZY For other interpretability papers I'd recommend checking out Neel Nanda's series of walkthroughs (he's actually leading a mechanistic interpretability team at DeepMind): kzbin.info/aero/PL7m7hLIqA0hpsJYYhlt1WbHHgdfRLM2eY&si=tLqxLua5XZEdbyCy
@christopherwoodall3464
@christopherwoodall3464 6 ай бұрын
Great overview. Really enjoyed the fact that you showed previous work that was built upon.
@TheInsideView
@TheInsideView 6 ай бұрын
Thanks! To be honest I only briefly mentioned their previous work and don't think I actually went through previous work in the literature (was just doing a walkthrough of their blogpost, still doing daily uploads), but I'll definitely consider this preference to discuss previous work for future videos
@Alice_Fumo
@Alice_Fumo 6 ай бұрын
This seems actually useful and has real-world applications. It seems this allows for actually adjusting the personality of the model, so one could make it more adverse to writing code with bugs, more flirty, more honest or whatever. The big AI labs could adjust small details without needing to retrain the AI. Also, I guess this could be done with open source models to figure out their "deny response" features and set them to very low values. It can be done with retraining, but that also just changes the model. Not needing such brute-force-y methods is neat.
@TheInsideView
@TheInsideView 6 ай бұрын
Yeah exactly, that enables to steer them in the way that you'd prefer. If you haven't tried it yet I'd recommend checking out Golden Bridge Claude (which I talk about in the video) available on claude.ai for a limited time, which basically gives a concrete example of what having a custom steered LLM would be like.
@Alice_Fumo
@Alice_Fumo 6 ай бұрын
@@TheInsideView I asked it to go one prompt without mentioning the bridge and tell me a bedtime story and it got extremely internally conflicted, retrying several times and wondering why it had such difficulty with this. It's extremely interesting to witness. Thanks for notifying me that they were hosting that model, I didn't know.
@ThomasMeliWellness
@ThomasMeliWellness 6 ай бұрын
Crystal clear. Thank you for sharing this. Subscribed!
@TheInsideView
@TheInsideView 6 ай бұрын
Thanks! Tomorrow's video will be another walkthrough so hopefully worth the sub
Anthropic Caught Their Backdoored Models (Walkthrough)
21:32
The Inside View
Рет қаралды 994
Yay😃 Let's make a Cute Handbag for me 👜 #diycrafts #shorts
00:33
LearnToon - Learn & Play
Рет қаралды 117 МЛН
I was just passing by
00:10
Artem Ivashin
Рет қаралды 17 МЛН
AI Control: Humanity's Final Line Of Defense (Walkthrough)
14:23
The Inside View
Рет қаралды 1 М.
The Economics of AGI Automation
18:33
The Inside View
Рет қаралды 3 М.
How to Catch an AI Liar
12:32
The Inside View
Рет қаралды 877
2040: The Year of Full AI Automation
17:16
The Inside View
Рет қаралды 981
AGI Takeoff By 2036
18:11
The Inside View
Рет қаралды 850
How to Justify the Safety of Advanced AI Systems? (Walkthrough)
33:50
Adam Gleave - Vulnerabilities in GPT-4 APIs & Superhuman Go AIs
2:16:09
Sleeper Agents Explained - Part 1 - Safety Training
3:32
The Inside View
Рет қаралды 623
Yay😃 Let's make a Cute Handbag for me 👜 #diycrafts #shorts
00:33
LearnToon - Learn & Play
Рет қаралды 117 МЛН