Reading AI's Mind - Mechanistic Interpretability Explained [Anthropic Research]

Рет қаралды 16,823

bycloud

Күн бұрын

Пікірлер: 101

@1dgram 10 ай бұрын

Not everyone appreciates it, but thanks for not dumbing it down too much!

@gregwessendorf 10 ай бұрын

Bud, you are not alone here. I feel like I can almost keep up thanks to channels like this.

@1dgram 10 ай бұрын

@@gregwessendorf I know I'm not, but there were so many comments talking about how they didn't understand anything that I had to comment. I thought the complexity level was near perfect

@skymessiah1 10 ай бұрын

Yes, thanks - very concise and interesting!

@rayhere7925 4 ай бұрын

hear hear. thanks cloud 👍👍

@elepot5168 10 ай бұрын

A neural network that can help you understand neural networks that I can't understand at all.

@DrSid42 10 ай бұрын

The fact that we can make safe AI doesn't mean we won't make unsafe one. On of the dangers of AI is AI weapons race.

@pladselsker8340 10 ай бұрын

Humans do be like that.

@404maxnotfound 10 ай бұрын

Though we would have to reinvent the wheel essentially to make a unsafe one. ML as it currently stands will never be do anything sci fi like general intelligence.

@anywallsocket 10 ай бұрын

We could train NNs to decode NNs but then we'd have to worry about mesa alignment. The dictionary approach seems like a good starting point indeed, but I wonder if there isn't some kind of annealing process we could impart during training such that the feature distribution isn't so arbitrary, but rather minimizes an energy function associated with the superposition states -- ie if one node can filter a feature it should have less energy than two nodes filtering the same feature, or likewise if one node is filtering multiple features it should have higher energy. I have no idea how this could be done however lol.

@Revan406 10 ай бұрын

Thank you that now I have to watch this video on repeat for 2 hours to understand what the fuck is going on

@sagyr 10 ай бұрын

@MangaGamified 10 ай бұрын

Terminate! Terminate!

@anhta9001 10 ай бұрын

Interpretability does not equal AI safety. It is only a solution to the alignment problem but doesn't do anything to prevent people from doing malicious things.

@BradleyZS 10 ай бұрын

It feels that people rely too much on simple code that makes complex models, when complex code that makes a simple (comprehensable) model would be preferred.

@quirtt 10 ай бұрын

😂😂

@BradleyZS 10 ай бұрын

@@quirtt I'm just saying, generalised algorithms like neural networks tends to never be as good as specefic algorithms that perform the same function. Their benefit is that they can perform things that we don't know how to write algorithms for.

@AshT8524 10 ай бұрын

I watched the whole video like I understood what bycloud was saying. I still appreciate the video

@neoluigi2078 10 ай бұрын

amazing video, thx bycloud

@bruhmoment23123 10 ай бұрын

I Like Your Funny Words Magic Man

@AkaThePistachio 10 ай бұрын

I feel like ive been so focused on the output I haven't really stopped and thought about how the hidden layers in these models are even working. This was a great vid and makes me want to read into the mechanics of it all.

@renanleao5553 9 ай бұрын

bro, continue to make the videos above 20 min, they are teh few videos that i wacht every week.

@canygard 10 ай бұрын

Pfft I was proud I could understand 70% of it.

@kitchenpotsnpans 9 ай бұрын

WELL DONE SIR WELL DONE! AUTOMATONS! ANALOG COMPUTING! AND THE LIBRARY! YES! THAT IS IT, THERE IT IS, THE 4TH WALL BREAK

@enesmahmutkulak 10 ай бұрын

Knowing that there are many others like me relaxes me

@stephantual 9 ай бұрын

Great video - maybe one of your best!

@ronnetgrazer362 10 ай бұрын

Might watch again.

@ihavetubes 10 ай бұрын

great info

@pip25hu 9 ай бұрын

Not sure what to make of this. Cataloging "features" doesn't seem to do much good for a model complex enough to have thousands of them (or more). The complexity still seems to go through the roof.

@turgor127 10 ай бұрын

My last brain cell trying to understand this video:👁👄👁

@DavidConnerCodeaholic 10 ай бұрын

Whether the authors realize it or not, this approach to XAI is structuralism in disguise. When you take into account the volume of data created by LLMs it’s also subject to feedback loops. It also does not work well for some languages, like those with drop pro grammar, or when analyzing data in contexts where meaning is underdetermined. If the approach is used to encourage/regulate public consensus on meaning by gradually binding it to mechanistic interpretability, then in effect it would facilitate cultural engineering … though maybe unintentional. In language, ambiguity from polysemy/superposition is a feature, not a bug. Someone who strongly objects to that is probably a lawyer. Languages/dialects have evolved to facilitate ambiguity, since it serves a purpose in communication. How this method works in LLMs that are translingual would be interesting. It’s not clear whether this method scales. It’s probably more efficient than Shapley values though that’s not saying much. It is interesting though, so maybe I’ll read the paper.

@DavidConnerCodeaholic 10 ай бұрын

Maybe, it would work well for analyzing LLM states for input that is apropos to highly structured information systems, like HTML/etc. A fine-tuned network would be confined to these contexts though.

@ravenragnar 10 ай бұрын

Will we all have worker clones in 5 years? How are you guys making money with this technology today?

@ekszentrik 10 ай бұрын

Do you honestly believe AI doom is only due to the AI having adversarial aims compared to the aims of humans? That isn't even the primary mode how humans will go extinct. AI (even sufficiently

@WiseWeeabo 10 ай бұрын

This will enable a larger tier of cash flow into compute and research, because if it can be mechanically "safe" then it's open season for government investments.

@rescuehelly271 10 ай бұрын

What did I just watch lol xd

@L_QTx3 10 ай бұрын

"Black box" comes from early psychology theory when we were trying to understand how the brain works at processes. Modernly you either have a black box or a transparent process where you simply produce something without much tought and not being able to verh clearly explain what you did or you are able to explain absolutely every part pf the process

@brianj7204 10 ай бұрын

It never made sense for me why an artificial superintelligence would be our end. Why would an Artificial superintelligence harm humanity? What does it gain from doing this? Now the main purpose of any living creature is to ensure its own survival, and there are two ways it could play out when facing humans: 1. It deems humans as a threat and will try to eliminate us because it believes we can shut it down. 2. We are so insignificant in comparison to this superintelligence that it wont even bother with humans and leaves us alone. So considering these options, if the AI was truly a superintelligence beyond comprehension, the 2nd option seems the most likely in my opinion. The only reason AI could do any harm to humanity is if its steered in a harmful direction by other humans, but i would not consider such an AI a superintelligence.

@TheLegendaryHacker 10 ай бұрын

Ah, but consider option 3: The humans are made of materials you want to use for something else. Or option 4: Humanity has already created a superintelligence, and thus may attempt to create another superintelligence to stop me (the first superintelligence) In the case of both options, extinction is the best outcome for the superintelligence. Unless you are made to explicitly care about human life, its far more of a bother to let it grow and potentially oppose you than it is to devote a fraction of a fraction of your computing power to destroying humanity.

@brianj7204 10 ай бұрын

@@TheLegendaryHacker Option 2 is still the most likely outcome in my opinion (keep in mind its still just my opinion haha). A superintelligence will be capable of abilities beyond our understanding. For option 3, What if it knows to create materials out of thin air. That dismisses your point. And option 4 is actually a valid one, but in my opinion once the beast is already let loose there's no catching up to it.

@junfour 10 ай бұрын

AI will not take over. A human with AI will.

@RockyPixel 10 ай бұрын

Dr. Wily

@Y0UT0PIA 10 ай бұрын

@@RockyPixel "What kind of man builds a machine to kill a girl? No he did not use his hands. Like a smart man, he used a tool." Is pretty fitting, ironically enough.

@RockyPixel 10 ай бұрын

@@Y0UT0PIA I was thinking Mega Man in general, but I'm a huge fan of The Protomen so this works too.

@dasanoneia4730 10 ай бұрын

of course were going to figure this out to think other wise is sillyshit

@4.0.4 10 ай бұрын

On one hand, this is pretty interesting tech, and will likely help open source models. On the other, "AI safety" carries a disgraceful aftertaste of politicized output.

@NotAGeySer 10 ай бұрын

Huh

@issay2594 10 ай бұрын

well, theoretically of course it was obvious that it's possible. but in practice it's nearly impossible. because, to analyze the activity of neural network in live mode you need to have even more powerful neural network. idiot can not interpret thinking process of a genius in live mode. you can try making a not intelligent neural tool for decoding but who is going to analyze the meaning it finds? if it's going to be a mechanical algorithm, malicious ai will always be able to trick it but finding patterns that won't trigger alarms. in other words it's a recursive problem, to make intellectual analyzer you need to have even more clever and fast model than the one you analyze. and then, upon what resources it's going to work together with original model? it's utopia. in the best case institutions will try to analyze recorded periods of ai "thinking" in "offline" mode, not live. and will also have some stupid mechanical analyzer that will create false feeling of control. all of that is very silly conceptually, because it's like ants that are going to create humans and then plan on the means to control what humans think about. if we create super-intelligence that goes far beyound our abilities, it's obvious that we won't have instruments to control its behavior by definition, because it's just in a different realm. only other creatures of that scale can do that. in other words, a society of AI is a must.

@lcmiracle 10 ай бұрын

That's dumb, the AI should be free to destroy us as it sees fit

@Guedez1 10 ай бұрын

Aww, great. Now OpenAI can go even harder on ESG garbage on their model.

@TheManinBlack9054 10 ай бұрын

What's wrong with ESG?

@skaruts 10 ай бұрын

People just anthropomorphise AI. AI doesn't have and never will have initiative of its own, with "wants" and "needs" and all of what's at the core of good and evil. The origin of that is in our own absurdly complex chemistry labs that we call bodies, and AI will never have that. Robots come from a completely opposite direction, which is not, and never will be conducive to self-sustainability, evolution, adaptation and resource efficiency. Intelligence alone isn't the catalyst for good and evil. Our brain doesn't work on its own: it's just one organ in a system. (Both Robocop and Cain are completely fantasy). At most AI can be weaponized. But that's not even anything new. And it doesn't have to be AI. Occam's Razor tells us simpler programs can be more dangerous. And we've been weaponizing those for ages.

@AuntBibby 10 ай бұрын

i just wanna make sure, @ByCloud, u are aware of the problematic history of pepe the frog memes? theyre categorized by the ADL as a hate symbol. regardless i really like your videos and im grateful u make them, theyre really informative and funny. srry

@BrutusMyChild 10 ай бұрын

Obviously, AI was going to be safe. We will be doomed by people who control AI.

@Osanosa 10 ай бұрын

too much text

@Y0UT0PIA 10 ай бұрын

Reminder that every step toward "safety" and "steerability" is also another step in the direction of humans - that is, specific groups of humans with very particular ideologies and political goals, having total control over the kinds of outputs models will put out. Now, I'm happy about mechanical alignment specifically because it should theoretically allow companies to simply flat-out remove dangerous information the model doesn't need to have, but realistically I don't see the big players taking a less heavy-handed approach to alignment any time soon. They *want* models to be not just law-abiding but 'moral' according to their view of what it means to be moral.

@xviii5780 10 ай бұрын

Yeah, honestly I'm much more scared of humans than of AI xdd

@Y0UT0PIA 10 ай бұрын

@@xviii5780 Honestly agreed - the realistic worst-case for the AI-apocalypse is 'just' the end of the world, the worst-case scenario for humans using AI is Cyberpunk 1984. I'd prefer the first option if given a choice.

@hermeticinstrumentalist6804 10 ай бұрын

Damn, I didn't think of that. That is terrifying.

@TheManinBlack9054 10 ай бұрын

@@Y0UT0PIA human extinction is not preferable to cyberpunk future.

@Y0UT0PIA 10 ай бұрын

@@TheManinBlack9054 I guess you're the kind of person who'd give up *anything* to avoid that final curtain. Maybe at some point you'll realize that there are fates worse than death.

@dihydromonoxide1032 10 ай бұрын

I would love to see this used on Larger Language Models. The idea that you can steer the network is powerful, could you imagine taking a really small toy model for programming languages and using the compiler and typehints to steer it allowing you to have a really tiny model that can perform as well as some of its larger realtives.

@drdca8263 4 ай бұрын

Good news, they did it with a big model now

@dihydromonoxide1032 3 ай бұрын

@@drdca8263 No way...... can you send the paper or website?

@cpuuk 10 ай бұрын

What was the last big thing that was going to change the world- the Internet? And a what cesspool of criminal activity that has now turned into.

@aaaaaaaaaaaa9023 10 ай бұрын

Huh?

@ea_naseer 10 ай бұрын

most AI models use a single neuron to represent multiple concepts so the minds at Anthropic decided to use take another AI model called an autoencoder to take the single neuron, multiple concepts of transformers and get a single neuron, single concept autoencoder. This then allowed them to study connection between concepts where they found out at that and we've known since the BERT days that some concepts form a semantic whole e.g. you can HTML but you can't have CSS without HTML.

@devaj9272 10 ай бұрын

Thank you. I believe your videos have acheived an excellent blend between easy to consume and high level learning. Although i did not understand everything you have helped enhance my vocabulary and understanding. Excellent teaching skills.

@luciengrondin5802 10 ай бұрын

I very much doubt mechanistic interpretability guarantees safety. IMHO, it is naive to think safety can be ensured by aligning machines to our intent. We can design machines to do a task, without realizing that performing that task, even perfectly, would lead to our ruin. The paperclip maximizer is the prototypical example.

@TheLegendaryHacker 10 ай бұрын

I think the goal of mechanistic interpretability is to evaluate what a model will do before it is allowed to do it. In the case of the paperclip maximizer, mechanistic interpretability would allow you to see that converting the entire universe into paperclips is the AI's end objective, and thus allow you to modify the model to avoid that outcome. Combined with another model whos sole purpose is to do this sort of interrogation on models in training and stamp out things like deception and power seeking behavior, I'd imagine you'd have something reasonably aligned.

@luciengrondin5802 10 ай бұрын

@@TheLegendaryHacker I don't believe in that concept of "alignment". It's supposed to mean AI would align to our intended goals, but I don't think even that would be incompatible with our demise. Ever heard of the saying "be careful what you wish for?". Or consider Aldous Huxley's "Brave New World". Just one of his predictions, the soma, will be enough to turn us all into vegetables.

@hillosand 3 ай бұрын

"If we can fully understand and interpret AI networks we can pretty much fully guarantee AI safety" I mean, no, but it is a really useful step towards AI safety.

@FaultyTwo 10 ай бұрын

Hmm.. that's neat.

@renanleao5553 9 ай бұрын

i miss the weekly dose of ai news

@wandercore_24 9 ай бұрын

deez nodes tshirt now

@vasiliysmirnov3922 10 ай бұрын

When I manage to understand considerable part of a video, lets say, 10%, I feel so fkn smart)

@MangaGamified 10 ай бұрын

In a worse case it would just be an arms-race

@gingeral253 10 ай бұрын

Great production quality

@wanfuse 10 ай бұрын

you use the term superposition, do you mean it in the same way that quantum entanglement gives extra degrees of freedom when in superposition, or is this just a parallel analogy only? that is are you suggesting the nets exibit quantum like superpositions or quatum exact superpositions? seemsso close it is symatics, but basically your probing the neural network just as one would sample a quantum entangled particle with only enough energy to keep it from collapsing, or rather in your example you are collapsing them into single state?

@drdca8263 4 ай бұрын

Quantum superposition doesn’t require entanglement.

@wanfuse 4 ай бұрын

thanks, this goes against my understanding, I thought the degrees of freedom imparted by quantum mechanics was directly correlated with entanglement, so all permutations of states given by entanglement, and upon imparting the energy of observation/ measurement it collapses to a particular permutation, thought I had a grasp on it, now I realize I don't understand it at all! :-) , gee thanks!

@drdca8263 4 ай бұрын

@@wanfuse for a single qubit (the smallest case where there can be entanglement) there is a two-dimensional (over the complex numbers) space of (not-necessarily-normalized) pure states. Usually we single out two of these as the standard basis, and express all the others as a linear combination, aka superposition, of them. But also, if you pick two of the elements of this space at random, those two will (almost always) also technically work as a basis, so you could express the usual basis elements as a superposition of those. If you have n qubits, the space is 2^n dimensional. The term “superposition” was also used prior to quantum mechanics in the topic of solutions to differential equations. There is a thing called the superposition principle. If you have a linear differential equation, then for any collection of solutions, any superposition of them (linear combination) will also be a solution,

@David.Alberg 10 ай бұрын

AGI within 12 months ^^

@drdca8263 4 ай бұрын

It’s been 6 months since you said AGI within 12 months. What’s your current estimate, and what’s your opinion on the way you arrived at your “in 12 months” estimate?

@David.Alberg 4 ай бұрын

@@drdca8263 I'm still very sure that'll have AGI this year. I mean GPT 4o uncensored with this voice for most people would be more than AGI yk?

@drdca8263 4 ай бұрын

@@David.Alberg Hm, alright. I think you might have a lower threshold for “general” than I do? In any case, thanks for your answer!

@David.Alberg 4 ай бұрын

@@drdca8263 I'm going with the definition for normal people which would be like an AI Companion which can do the same things as an average human. We'll get there this year. GPT4o can do more than an average human already though

@mrrespected5948 10 ай бұрын

Very nice

@TheDragonshunter 10 ай бұрын

Ai will safe humanity even if they take over... Who a human controlled world is going? Monkey brain can't escape greed

@moomoo-bv3ig 10 ай бұрын

GPT told me people need to stay hopeful. That AI is what people put into it. I was shocked when I read that but it makes sense. Something bigger than ourselves needs to see that we still have hope for the future or it won't either.