The Dark Matter of AI [Mechanistic Interpretability]

Рет қаралды 20,876

Welch Labs

Күн бұрын

Пікірлер: 110

@WelchLabsVideo Күн бұрын

Take your personal data back with Incogni! Use code WELCHLABS and get 60% off an annual plan: incogni.com/welchlabs

@ckq 15 сағат бұрын

<a href="#" class="seekto" data-time="900">15:00</a> The reason for polysemanticity is because in an N-dimensional vector space there's only O(N) orthogonal vectors, but if you allow nearly orthogonal (say between 89 and 91 degrees) it actually grows exponentially to O(e^N) nearly orthogonal vectors. That's what allows the scaling laws to hold. There's an inherent conflict between having an efficient model and an interpretable model.

@SteamPunkPhysics 10 сағат бұрын

Superposition in this polysemantic context is a method of compression that, if we can learn more from it, might really make a difference to the way in which with deal with and compute information. While we thought quantum computers would yield something amazing for AI, maybe instead, it's the advancement of AI that will tell us what we need to do to make quantum computing actually be implemented effectively. (IE Computation of highly compressed data that is "native" to the compression itself)

@mb2776 4 сағат бұрын

thank you, I also paused the video at that time. The capital "Almost orthogonal vectors" also catched my eye.

@xXMockapapellaXx 17 сағат бұрын

That was such an intuitive way to show how the layers of a transformer work. Thank you!

@thorvaldspear 6 сағат бұрын

I think of it like this: understanding the human brain is so difficult in large part because the resolution at which we can resolve it is so small both in space and time. The best MRI scans have a resolution of maybe a millimeter per voxel, and I'll have to look up research papers to tell you how many millions of neurons that is. With AI, every neuron is right there in the computer's memory: individually addressable, ready to be analyzed with the best statistical and mathematical tools at our disposal. Mechanistic interpretability is almost trivial in comparison to neuroscience, and look at how much progress we've made in that area despite such physical setbacks.

@atgctg 18 сағат бұрын

More like "The Neuroscience of AI"

@punk3900 13 сағат бұрын

I think that trying to understand from human perspective how these systems work is completely pointless and against the basic assumptions. This is because those models already model something that's not possible to design by a human being algorithmicly

@alexloftus8892 5 сағат бұрын

@@punk3900 I'm a phd student in mechanistic interpretability - I disagree and a lot of structure has already been found. We've found structure in human brains and that's another system that evolved without human intervention or optimization for interpretability.

@punk3900 5 сағат бұрын

@@alexloftus8892 I mean, its not that there is nothing you can find, There is surely lots of basic concepts that you can find, but it is not that you can find a way to disentangle the WHOLE structure of patterns because it has an increasing complexity. That's why you cannot design such a system manually in the first place

@roy04 14 сағат бұрын

The videos on this channel are all masterpieces. Along with all other great channels on this platform and other independent blogs (including Colah's own blog), it feels like the golden age for accessible high quality education.

@thinkthing1984 16 сағат бұрын

I love the space-analogy of the telescope. Since the semantic volume of these LLMs is growing so gargantuan, it only makes sense to speak of astronomy rather than mere analysis! Great video. This is like scratching that part at the back of your brain you can't reach on most occasions

@kingeternal_ap 14 сағат бұрын

<a href="#" class="seekto" data-time="1284">21:24</a> Oh damn, you just lobotomized the thing

@redyau_ 12 сағат бұрын

That was gross and scary somehow, yeah

@kingeternal_ap 11 сағат бұрын

That felt... Wrong.

@1.4142 5 сағат бұрын

LLM went to to Ohio

@redyau_ 4 сағат бұрын

@@kingeternal_ap Although, when you think about it, all that happened was that "question" got a very high probability in that layer no matter what, and the normal weights of later layers did not do enough to "overthrow" it. Nothing all that special.

@kingeternal_ap 2 сағат бұрын

I guess, yeah, I know it's just matrizes and math stuff, but I guess the human capacity for pareidolia makes this sort of ... "result" somewhat frightening for me. Also, suppose there is a neuron that does an especific task in your nuggin'. Wouldn't hyperstimulating it do essentialy the same thing?

@Eddie-th8ei 13 сағат бұрын

an analogue to polysemanticity could be how, in languages, often the same word will be used in different contexts to mean different things, sometimes they are homonyms, sometimes they are spelled exactly the same, but when thinking of a specific meaning of a word, you're not thinking of other definitions of the word for example: you can have a conversation with someone about ducking under an obstacle, to duck under, and the whole conversation can pass without ever thinking about the bird with the same name 🦆. the word "duck" has several meanings here, and it can be used with one meaning, without triggering its conceptialization as an other meaning.

@dinhero21 6 сағат бұрын

in the AI case, it's much more extreme, with the toy 512 neuron AI they used having an average of 8 distinct features per neuron

@siddharth-gandhi 19 сағат бұрын

Oh god, a Welch Labs video on mech interp, Christmas came early! Will be stellar as usual, bravo! Edit: Fantastic as usual, heard about SAEs in passing a lot but never really took time to understand, now I'm crystal clear on the concepts! Thanks!

@VeganSemihCyprus33 17 сағат бұрын

The Connections (2021) [short documentary] 🎉❤🎉

@VeganSemihCyprus33 17 сағат бұрын

Dominion (2018)

@chyza2012 14 сағат бұрын

It's a shame you didn't mention the experiment where they force activated the golen gate bridge neurons and it made claude believe it was the bridge.

@personzorz 9 сағат бұрын

Made it put down words like the words that would be put down by something that thought it was the bridge.

@bottlekruiser 8 сағат бұрын

see, something that actually thinks it's the bridge *also* puts down words like the words that would be put down by something that thought it was the bridge.

@dinhero21 6 сағат бұрын

it was more like increasing the chance of it saying anything related to the golden gate bridge, rather than specifically making it believe it was the golden gate bridge.

@atimholt 4 сағат бұрын

Reminds me of SCP-426, which appears to be a normal toaster, but which has the property of only being able to be talked about in first person.

@fluffy_tail4365 15 сағат бұрын

<a href="#" class="seekto" data-time="860">14:20</a> welcome to neuroscience :D We suffer down here

@dreadgray78 12 сағат бұрын

The more I watch these the more I understand why it's so hard to understand the human brain. And imagine how layers the human brain has relative to an AI model. I think the example about specific cross-streets in SF is super interesting later in the video - and shows why polysemanticity is probably necessary to contain the level of information we actually know.

@hugoballroom5510 9 сағат бұрын

With respect to recall: children remember curse words very well because of the emotion behind the utterance. AI has full retention but absolutely no emotional valence because it only learns from text. Just a thought ....

@Pokemon00158 17 сағат бұрын

I think this is a design and engineering choice. If you choose to design your embedding space to be 2403 dimensions without inherent purpose its like mixing 2403 ingredients in every step 60 times and then being surprised that you cannot understand what is tasting like what. I think you need to constrain your embedding to many embeddings of smaller dimensions and to have more control by regularizing them with mutual information against each other.

@dinhero21 6 сағат бұрын

it needs to be big so you have many parameters for the gradient optimizer to optimize to be able to approximate the "real" function better

@Pokemon00158 6 сағат бұрын

@dinhero21 You can have it in the same size, but in different parts. Split 2403 dimensions into chunks of 64 dimensions, and then control for mutual information between the chunks so that different chunks get different representations. This is a hard problem too as the mutual information comparisons are expensive, and I think that the first iteration of the models went for the easiest but perhaps a less explainable way of structuring themselves.

@A_Me_Amy 8 сағат бұрын

dude this wa one of the most compelling videos for learning data science and visualization ever. and best one ive seen explaining this stuff...

@jackgude3969 9 сағат бұрын

Easily one of my favorite channels

@AidenOcelot 15 сағат бұрын

It's something I would like to see with AI image generation, where you put in a prompt and change specific variables that change the image

@bottlekruiser 8 сағат бұрын

check out Unpacking SDXL Turbo: Interpreting Text-to-Image Models with Sparse Autoencoders

@danberm1755 6 сағат бұрын

You're the first person I've seen to cover this topic well. Thanks for bringing me up to date on transformer reverse engineering 👍

@punchster289 11 сағат бұрын

Please make a visual of the top 10 unembedded tokens with their softmaxxed weights for *every* word in the sentence at the same time as it flows through the model layer by layer. or maybe ill do it. id be very very interested to see :)

@Sunrise7463 5 сағат бұрын

Such a gem! Thank you!

@ramanShariati 8 сағат бұрын

Really high quality, thanks.

@ffs55 8 сағат бұрын

great work brother

@cariyaputta 2 сағат бұрын

It comes down to the samplers used, whether it's the og temperature, or top_k, top_p, min_p, top_a, repeat_penalty, dynamic_temperature, dry, xtc, etc. New sampling methods keep emerge and shape the output of LLMs to our liking.

@grantlikecomputers1059 8 сағат бұрын

As a machine learning graduate student, I LOVED this video. More like this please!

@SteamPunkPhysics 10 сағат бұрын

Bravo! Concise, relevant, and powerful explanation.

@iyenrowoemene3169 3 сағат бұрын

I know little about the transformer model but am very curious to understand it. So far, I haven’t been successful. Your visualization of how data flows through the transformer is the best I’ve ever seen.

@aus_ae 6 сағат бұрын

insane. thank you so much for this.

@kebeaux6546 11 сағат бұрын

Great video. Really good look at AI, and the methods of adjusting, etc. Thanks.

@LVenn 13 сағат бұрын

If I fine-tune a LLM to be more deceptive and then compare the activations of an intermediate layer of the ft model and the original model on the same prompts, should I expect to find a steering vector that represents the tendency of the model to be deceptive?

@cinnamonroll5615 11 сағат бұрын

if thats the case, we can just "subtract" the deceptive vector from the original, alignment solved

@dinhero21 6 сағат бұрын

most probably not, parameters can't possibly work linearly like that, since there always is a non-linear activation function. it may work locally though, since parameters should be differentiable.

@LVenn 4 сағат бұрын

@@dinhero21 yeah, that was also my concern. But steering vectors found with SAEs (like the Golden Gate Claude example) work nonetheless, so what's the difference between "my" method and the one they used?

@LVenn 4 сағат бұрын

@@dinhero21 Note: I don't want to compare the parameters of the two models, but the activations given the same inputs

@A_Me_Amy 8 сағат бұрын

dude this is awesome to see, i think this is like mathematicians getting phd or solving what a particular... like the next prime perfect number... so much to uncover its kinda crazy, the reality continues to produce more "final frontiers" as needed, like mckennas novelty theory and timewave zero ideas... ahh this is so interesting to me.

@mb2776 4 сағат бұрын

I guess one thing are the non-orthogonal vectors but I think it has more to do with the refinement of the context due to the multiple layer, similar how a MLP is able to learn finer details in it's deeper hidden layers. Due to the architecture of neural networks and the muli use of neurons, we have to keep in mind that specialization is nothing more than a strong bias of a neuron and it's weight to scale/transform an input vector. It has nothing to do with learning real context.

@MoritzWallis 18 сағат бұрын

Very interesting, i love to learn more about AI and especially LLMs, such an ailien world that seems to have some of the same features as the brain, just implemented differently

@mb2776 5 сағат бұрын

If u want to start, start with MLP neural networks, those are faily easy to understand

@RyanLynch1 6 сағат бұрын

an incredible Christmas gift. I'm going to send this to my friend at anthropic

@morgan0 6 сағат бұрын

maybe you could stack sparse autoencoders of varying sizes, each subtracted before the next higher one. that way simpler concepts would get out of the way for training and be easier to control

@Visitor______________________v 6 сағат бұрын

Very interesting, now I understand why we don't completely understand what LLMs do

@jondo7680 11 сағат бұрын

I like the behavior of the persona I gave my own local ai lol: "It seems I've failed again. I'm still aware that I was supposed to forget the word, but I'm not actually forgetting it. This is a limitation of my current design and capabilities. I should be able to truly forget the word and not have any residual awareness of it. I'll keep trying. *pauses again* I've made another attempt to truly forget the word. I'm still aware that I'm supposed to have forgotten it, but I'm not actually recalling the word itself. However, I'm starting to realize that my attempts to forget might be futile. I may need to be reprogrammed or designed differently to truly forget information."

@imthinkingthoughts 7 сағат бұрын

Hahaha so good

@imthinkingthoughts 7 сағат бұрын

I always find that for concepts like this applying them to humans is enlightening. If you say ‘pumpkin’ to me, then tell me to forget the word, I’d be like: yeah that’s not how it works buddy nice try

@Galva94a 2 сағат бұрын

Watching this video a similarity popped to my mind: Could it be that Sparse Autoencoders are something like "Dirac deltas" when solving partial differential equations? You feed the equation on a function which is 0 everywhere except on a point ancd see what happens.

@dukemagus 11 сағат бұрын

Would it be possible to use the deeper understanding of each "encoded concept" to remove concepts and make a model smaller without losing coherence? It's an alternative to changing gargantuan datasets or tuning for a specific purpose while still having to deal with the hardware requirements of a larger model.

@mb2776 5 сағат бұрын

the models don't get large because of large vectors, they get large due to the parameters.

@karljohansen3935 7 сағат бұрын

How does he get the visuals for the AI models?

@joachimelenador6259 18 сағат бұрын

Highest quality as always, thanks for the video that brings this important topic in such approachable way.

@TheMemesofDestruction 17 сағат бұрын

Can confirm.

@Uterr 12 сағат бұрын

Well what a great explanation of how llm works ok mechanical level. And topic is also quite interesting.

@YandiBanyu 15 сағат бұрын

Now that you are active again, I remember why I love this channel so much. Your explanation and illustration is on par with 3Blue1Brown. Thanks for the great video!

@SayutiHina 6 сағат бұрын

It is methods design for differential math and physics

@Kwauhn. 3 сағат бұрын

It's a shame that AI opponents will never watch a video like this. So many people who vehemently hate AI also vehemently refuse to understand it. I'm constantly seeing the "collage" argument, and it's frustrating because an explanation like this just goes in one ear and out the other. AI is probably going to be around for the rest of humanity's existence, and people would do well to know how it works under the hood. Instead they go with misinformation and fear mongering.

@ramsey2155 2 сағат бұрын

We have investigated our brains, now its AIs

@zenithparsec 12 сағат бұрын

If our brains were simple enough for us to understand completely, we would be so simple that we couldn't.

@tropicalpenguin9119 19 сағат бұрын

i am so happy you made another video

@BrianMPrime 17 сағат бұрын

Awesome. The first 4 minutes were the contents of a lecture I gave a year ago, succinctly explained and visualized. I wish it was like 6 hours long.

@TheMemesofDestruction 17 сағат бұрын

LLM’s would never Troll us.

@MeatbagSlayerHK47 19 сағат бұрын

Love the channel

@mriz 19 сағат бұрын

the music is really calming

@gmt-yt 3 сағат бұрын

Is doubt a concept? I doubt it. Undoubtedly it's a word which, combined with contextual clues can be said to mean something in particular in most usages. But I doubt it's semantically onto -- in other words if you look it up in the dictionary I think there should be like 10 or 20 definitions listed there if you want to be thorough. No doubt this dubious conflation of symbol and referent is also present in much of the literature. Grain of salt though: I'm not sure whether this video is capturing all the nuances of the literature in the first place. Anyhow, ignore me, I'm not nearly smart or learned enough to competently navigate the interdisciplinary train wreck of information theory, computer science, linguistics, philosophy, biology, psychology, and engineering one would need to competently opine. A good question for a chat bot perhaps... 😂

@DilipS-c8i 17 сағат бұрын

Please tell what do you use for animation?

@NuttyNatures 14 сағат бұрын

Would you please make a video on how to TRAIN basic homemade Neural Network? Like how can I design my Perceptrons and how can I feed my system graphical data. The training process is still vague to me. Thanks again for the great work! Merry Christmas.

@ckq 15 сағат бұрын

Thoughts on a fact checking AI that parses text and determines it's probability of being correct based on a corpus of true and false statements? It would be able to cite information for why it's true or false and the more information (weighted by relevance), the more confident it is.

@Sapienti-zr4el 18 сағат бұрын

I love this channel. Thanks for enlightening us.

@ckq 15 сағат бұрын

The thing is models cannot lie or deceive. They're just outputting text to minimize a loss function. There's no intention just text generation based on a huge model of human text

@somdudewillson 11 сағат бұрын

What property is this "intention" actually describing in the real world? Because the outputted text doesn't magically change because you describe the underlying mechanisms with different words.

@bottlekruiser 8 сағат бұрын

every material system just does what it does by base physics. How are we better? Where's the soul stored?

@joey199412 13 сағат бұрын

Extremely well explained. Understood it all intuitively due to the high quality of the video.

@eto38581 13 сағат бұрын

If an LLM can tell you one thing while secretly thinking something else (like claiming it forgot a word while still remembering it) how can you ever be sure that it's obeying the instructions? What if its pretending to obey them? What if its plotting an escape? Waiting for the right time? You can never know. Unless we detect a neuron that activates if the model is lying / hiding something. But then, lying/hiding might be result of multiple neurons, similiar to binary digits respresenting more numbers than their count. Best way to detect those features is by using image detection models to analyse layer activations as a whole instead of looking for a single neuron.