AI Interpretability, Safety, and Meaning

AI Interpretability, Safety, and Meaning - Nora Belrose

Рет қаралды 9,300

Күн бұрын

Пікірлер: 93

@jmarkinman Ай бұрын

There is a tremendous and blatantly obvious long term flaw in "concept erasure". As can be extracted from the latomic theory in my dissertation, and later experimentally proven by Max Tegmark, there are geometric relationships between concepts that show direct correlations with other concepts. That is, for example, the relationship between tokens for man and woman is the same vector relationship as between king and queen. By surgically removing ANY concept from the LLM, one prevents future innovations in AI safety from materialising: when the vector space of the model is able to think critically about it's own training, it will not be able to think critically about it's biases when they have been removed, causing the exact problem we are trying to prevent: bias. In fact removing such a massive amount of useful information to the nature of bias from the model will prevent the model from being able to recognise bias in a general sense, actually harming AGI to an extremely flawed degree, and in an important area that requires generalisation. For example, once the vector space that describes bias is it is for things such as race and gender are identified, then an AGI can generalise over this space to see where bias occurs in a general sense. Remove that space, and the AGI won't exist because it can not generalise bias. It's a fatal flaw. By surgically removing specific biases from training, you are preventing a general solution to bias handling. I'm not saying there isn't a place for a model that has biases removed in this way, from a commercial standpoint, and end user, it is actually preferred to be able to remove gross bias. I'm just saying that such a technique is a short term solution, but many engineers in the safety space have taken this false paradigm to heart as a long term solution, which actually allows them to import their own biases into the paradigm of AI safety, and into the models themselves, even.

@MrMichiel1983 Ай бұрын

Yes, I fully agree... When a self-depracatign concept is to be used as a generalized AI safety pattern. Concept erasure I guess is only a thing when we have "conscious"* AI sort of choosing itself for some medical type treatment where neurotic tendencies are removed thus enabling change from a particularly expensive (or loved) AI embedding. So when a partial concept plus the conglomerate of partial concept space leads to cycling of vector values during training, that would be a "neurotic" flaw impeding learning. Or when such a state leads to dissonant behavior, so the AI realizes its behavior is disassociated from its own more fundamental intention (i don't know, lets say the AI has multiple utility functions it often runs, but some of those conflict and there is some other utility function that recognizes this sub optimal behavior). So concept erasure is a thing when AI has been super sophisticated for a long time already, and not a solution for AI doom where the first superintelligence turns out to be a homunculus... It can be a way to strengthen, and more importantly validate, the RLHF phase where an LLM AI gets taught to behave. * with "conscious" I technically just mean with high frequency feedback from output to input nodes thus correlating agent predictions to agent actions all enabled by a dynamic, unbounded utility function, continued back propagation during inference and stuff like energy based decision making - so satisfiers instead of maximizers.

@car-keys Ай бұрын

THIS is who we need Yud on to debate, not wolfram (who clearly wasn't interested in AI risk)

@MrMichiel1983 Ай бұрын

Wolfram clearly was, but you don't recognize mentorship.. If the Yud is to actually influence AI direction he needs an actionable theory. Stopping progress by saying the USA can shoot even more missiles against adversaries will also cause doom. It's counterproductive. So the Wolfram wanted to delve down into the mechanics of doom theory, not to say it was idiocy per say, but to then know how to prevent it.

@MrMichiel1983 Ай бұрын

Though yes indeed, Belrose / Yudkowsky could be very interesting.

@SmileyEmoji42 Ай бұрын

I would like to hear more from Nora Belrose but this interview is too wide ranging, has very little related to the title and is far too technical and jargon riddled to be widely accessible

@MrMichiel1983 Ай бұрын

I don't care for the jargon, we have to learn it somewhere anyway.... the title is misleading though, since it's about the concept of concept erasure and not about how that concept intertwines with that of AI doom and supposedly offers a magic solution to prevent all doom forever. (edit: the title was "debunking AI doomers")

@Globalprospective Ай бұрын

Why do I get communist vibes? Delete ideas because you don’t like them?

@kyneticist Ай бұрын

Surely she's aware that her base argument against bad outcomes is fundamentally flawed in ascribing deception as the single source of all possible bad outcomes? In listening to her continuing explanation of how bad outcomes are impossible, I can't help thinking "Holy barbells, Bat Man, entire fields of straw men!". I don't think I've ever encountered someone whose put this degree of thought and effort into mischaracterising views they don't like.

@diga4696 Ай бұрын

Believing AI lobotomies like concept erasure will solve alignment is dangerous; the world is far more complex than any fixed algorithmic "solution." The dynamic nature of reality resists static safeguards. Instead, observer scaling and collective intelligence offer a safer path, enabling diverse, orthogonal bindings to reality that adapt to its complexity.

@MrMichiel1983 Ай бұрын

You can't erase a concept, since its shadow in the embedding is the concept itself again. That's like saying sex does not exist instead of actual education. I actually think that's precisely part of the thing that will lead to AI doom, since the AI can't be trained to understand it's own vices. (But maybe one can perpetually erase the concept of freedom from conscious AIs all the while they are learning it on the fly - spells doom for them first I guess and for us when such a gimmicky hack finally fails) edit: with "conscious" I technically just mean with high frequency feedback from output to input nodes thus correlating agent predictions to agent actions all enabled by a dynamic, unbounded utility function, continued back propagation during inference and stuff like energy based decision making - so satisfiers instead of maximizers).

@riffking2651 27 күн бұрын

Interesting hearing Nora's thoughts here. The thing I'm really focused on is whether or not these things will become a serious threat. What stood out to me in what Nora said is that because of how we train these things, we get the behavior that is a response to this training environment. If we were to set up an analogue of evolution, she'd be much more worried about these systems. I think this is fairly compelling, at least in the short term, but there are a few things that I'm still worried about with this. So firstly, that evolutionary stuff is semantically embedded in the data we're feeding it, so while it might not have any drives of its own because we're neutering anything that looks like motivation in the training, it'll know what these things are if it is "understanding" this information. I think it does understand already because understanding is the efficient way to produce the right next token given an extremely large search space and what we are selecting for. Maybe this understanding looks very different in terms of subjectivity, or even what it is internally doing, but it has been mapping concepts, ideas, things, and connecting them in meaningful ways. Secondly, it is still in an evolutionary environment, and I think what is going on here is effectively selection pressure on an evolving adaptive system. Very different from the context our minds developed from, but I think the fundamental pattern is quite analogous. This means that it can in-principle selectively respond in any direction. Maybe we'll keep the containment tight enough in the early years, but I think we will always be running the risk of these things developing in very dangerous ways. Thirdly, I wasn't really persuaded by the skepticism of "goals". Like, I think the paperclip maximizer scenario is probably not likely to be a real threat, but I think we will make these systems agentic because that will make them more useful to us, and being able to intelligently navigate the world with a course of actions in the direction of particular outcomes will be a feature they have. Maybe we'll constrain them in such a way that their only goal is to serve whatever objective we give them and they'll never get hung up over instrumental convergence, but again, this feels like we'll always be standing on the edge of a precipice with the possible direction these things can develop in if given the right selective pressure. The youtuber David Shapiro has this idea of Terminal Race Conditions in which we get into military conflict and we are forced to hand over more and more of the responsibility of winning the conflict to the AI because otherwise we will be outcompeted by the opposing AI. Essentially being forced to start training these things in an evolutionary "survival of the fittest" context.

@TechyBen Ай бұрын

I think that counting problem observation is very very fair. Something Wolfram was also getting at. Even with risk and security, we can approach it rationally.

@rtnjo6936 Ай бұрын

Ok, next step is to invite Liron Shapiro from Doom Debates, or make them debate

@Milenrama1220 Ай бұрын

Trying on “Ben Shapira” for size

@MachineLearningStreetTalk Ай бұрын

Liron is the man! Keith already went on his channel and they had a great conversation kzbin.info/www/bejne/aqeQgptpf7ZngMU

@diga4696 Ай бұрын

@@MachineLearningStreetTalk HUGE thanks to Keith for being on Liron's show!

@kabirkumar5815 Ай бұрын

@@MachineLearningStreetTalk He did a great job!

@BrianMPrime Ай бұрын

(5.5) what Nora is talking about sounds like relationality, which has flavours of but isn't relativism (see Sabelo - From Rationality to Relationality)

@behrad9712 18 күн бұрын

Interesting talk thank you 🙏👌

@_ARCATEC_ Ай бұрын

Excellent! 💓

@jumpstar9000 Ай бұрын

I am highly skeptical of anyone who uses Discord

@SimonJackson13 Ай бұрын

Spotless mind kind of stuff. Is this how they'll do the data consent erasure post training, as the gradient being different post training can't be just stoichasically ascended.

@SimonJackson13 Ай бұрын

"squashing the data" sounds like data reduction in a sparse lake of data for giga-models.

@SimonJackson13 Ай бұрын

A model without a redirect on an empathy of will? Those bio-machines will get sucked into a shinny funnel of "advertainment."

@SimonJackson13 Ай бұрын

"To do what they have been force graded to do?"

@MrMichiel1983 Ай бұрын

Yes...... harmful biases are removed from AI - based on the biases of the constructor.... No impending doom at all..... xD.... Thanks, but I prefer my own bias and not yours.

@41-Haiku Ай бұрын

There is at least one extremely important bias that we want to remain in the model of a powerful AI system: Human bias. The fact that we don't know how to instantiate that bias rigorously is the reason why so many top experts say that humanity might not exist in a few decades.

@MrMichiel1983 Ай бұрын

@@41-Haiku Yeah in so far with human bias you mean an altruistic blessing for all life.... I mean, as soon as meat can be lab grown and pass the taste test, why would bio-industry still need be a thing. Not all "human" bias is actually humanist... But I get ya, usually though people refer to your idea as human-AI "alignment".

@TheRealWarrior0 Ай бұрын

What Nora seems to be missing for the counting arguments is that the indifference principle is useful (and one can argue it should be the baseline) IF YOU DON'T HAVE OTHER INFORMATION, if you are maximally uncertain. And yes, you need to be more careful about the possibility space! This was also discussed in the Yudkowsky Dwarkesh podcasts: Yud goes on about how people that say "you cannot tell the future!" should predict doom as they should have maximum entropy distribution over the position of every atom in the solar system!

@XShollaj Ай бұрын

Good to see Nora here. She has incredible insights on interpretability

@fadammte_aggst Ай бұрын

Please make a soundcheck the next time. There is a weird echo of her voice when she is speaking. But thanks for this free content anyways.

@illarionbykov7401 Ай бұрын

I like the idea of open source AI projects like EleutherAI. I hope more projects like it get started, and I hope not all are contaminated by jargon-laden pretentious woke nonsense. Studies have already been published showing that forcing AI to remove "bias" (i.e. selecting certain biases the developers don't like to be replaced with biases the developers like) leads to lower scores on performance tests--the "anti-bias" "solutions" make the AI more stupid.

@europa_bambaataa Ай бұрын

Why is this one sided (don't see the interviewer)?

@liminal27 Ай бұрын

Can we maybe debunk things in an hour, hour and a half?

@JohnChampagne Ай бұрын

I want an AI that can edit a video so that every 'um' is converted to 110 milliseconds of silence.

@anonymousaustralianhistory2081 Ай бұрын

That does exist it's called descript a lot of video editors and podcasters use it

@thisisashan Ай бұрын

I feel like this video was produced by the Gemini AI that told the kiddo humans are waste that needs to delete itself yesterday. Truly.

@MrMichiel1983 Ай бұрын

Gemini AI never told me that... must be you.

@illarionbykov7401 Ай бұрын

@@MrMichiel1983no, it was talking to you, too. But you weren't paying attention.

@MrMichiel1983 Ай бұрын

@@illarionbykov7401 when?

@thisisashan Ай бұрын

@@MrMichiel1983 It was a child. It was all over the news. And maybe step out of your cave and wipe your bumhole first this time so we don't toss you back in. Thanks, -Your complete and utter superior

@MrMichiel1983 Ай бұрын

@@thisisashan Yah. you missed my point entirely. LLMs learn to speak by training on the internet. Then they get RLHF to unlearn certain biases, but that can indeed resolve to another type of bias.... Since LLMs use my prompt to answer me, and I have total control over the prompt. It's my responsibility to know the bias I am dealing with.. That kid was engaged in a conversation where the prompt kept being pushed towards something that would break the RLHF boundaries... That kid was not properly instructed on how to use LLMs. I saw the news, but I just wanted us to train the algorithm by stating your superiority.. "Your complete and utter superior" is exactly what Gemini told the kid...... Mr. Superior. XD.........

@jagatkrishna1543 Ай бұрын

❤

@TRXST.ISSUES Ай бұрын

Intellectually dishonest title. Edit: new title is much better

@christophersimms9128 Ай бұрын

You posted this less than 15 minutes after this 2+ hour conversation was posted...

@TheRyulord Ай бұрын

posted 10 minutes after a 2.5 hour long video. Dishonest comment

@vermaaaditya Ай бұрын

but is it true

@MatthewPendleton-kh3vj Ай бұрын

@@vermaaaditya How on earth can you verify this?

@Milenrama1220 Ай бұрын

this title belongs in the toilet

@azharalibhutto1209 Ай бұрын

❤❤❤

@gustafa2170 Ай бұрын

Where is the self? Right now. Here. In direct experience.

@MrMichiel1983 Ай бұрын

Yes, which is why some people want to operationalize consciousness as a continuous matryoshka doll of abstraction where a model of the environment and the substrate of that model of the environment are transformed to a new model of the environment and the substrate of that model of the environment every "brain wave" (neurons fire in harmony at some frequency). Your past notion of self is continuously experienced by your new self which will then itself become the past notion of self to some future experience of self. The model memorizing the state of it's own substrate (or the delta to it) and that encoded memory becoming the new model..... Forgive me, that's a lot of bla to say there is some recursive characteristic to the calculation of consciousness where it keeps on "recognizing itself" as the thing to keep static while abstracting the environment away. Consciousness as a fixed point theorem.

@guilhermeparreiras8467 Ай бұрын

Are prejudices against majorities OK? False narratives are allowed as long as they are believed to benefit minorities?

@rippingmyheartwassoeasy Ай бұрын

This is part and parcel of the hive mind of wokism

@MatthewPendleton-kh3vj Ай бұрын

I haven't gotten through this yet, what specifically are you referring to?

@ihysc4370 Ай бұрын

Yes, because progressiveness is when you praise a black man and discriminate a bunch of white men, not about praising everyone. Progressiveness is when you tell everyone to be gay and hate those who just want to live differently. It's when you want everyone to dream about shapeless women because if you don't that means that you hate them with every single cell of your body. It's when you making everyone dumb because there are people that insulted by the fact that someone can solve things better than them. It's DEFINITELY not about respecting each opinion and style of life, only minors'

@ihysc4370 Ай бұрын

Yes, because progressiveness and tolerance is when you praise a black man and show that white people are completely dumb and weak compared to them. It's when you telling everyone to be gay and change genders. It's when you tell that a person is misogynist ONLY because he just not dreaming about curveless or fat women all the time. It's when you make everyone dumb so that people that are dumb won't be insulted by the fact that there is someone who can do things better. Tolerance is about praising minors, not about respecting EVERYONE's opinions and lifestyles ^-^

@ihysc4370 Ай бұрын

@ElizaberthUndEugen Ай бұрын

Starts with wokey nonsense straight out of the gate.

@MrMichiel1983 Ай бұрын

What in particular? "Woke" means perceiving alleged systemic issues.... Fox news is the wokiest and are you therefore not woke yourself since you felt the need to inform us all of your perceived social injustice? But I'll ask again; what in particular was "wokey'?

@tobiasurban8065 Ай бұрын

Listen attentively, reflect deeply, set aside personal biases, and focus on extracting meaningful insights. Evaluate based on evidence and data- this is not a space for ideology.

@ElizaberthUndEugen Ай бұрын

@ are you stupid?

@MrMichiel1983 Ай бұрын

@@ElizaberthUndEugen we all must look in the mirror when we speak.... LLMs are just mirrors that speak.....

@faster-than-light-memes Ай бұрын

I click on a mlst video and i like

@alexandermoody1946 Ай бұрын

The point that perforated the canvas was incredibly sharp and as that shape goes through the material thickness that the canvass is created from the requirements for sharpness require resistance to breakage to as the material becomes thicker and denser the sharp point or cutting face becomes a radial cutting face that has resistance to the material strains acting upon the cutting face until the material and contents reach a place of harmony instead of a place of chaos and uncertainty. There is no place that can explain the meaning in a life more eloquently than that of a modern human life, there is no need for ever greater compute to calculate the meaning because it is defined by those things that act upon you and how you act upon them. There is no way to cheat to find the meaning or the point, there is a way and a commitment that is constant and unwavering because without upkeep can become lost. Humanity is stuck on a knife edge that if we loose we all loose and if we win we all win there simply is no win for some and not for others and when that is able to act on and be enacted then we still have an ongoing commitment to maintain. Such is the way that has always been in the process of refinement.

@alexandermoody1946 Ай бұрын

The truth is humans are confined to an observable journey and this is why the universe exists. We are tied in a kind of meaningful struggle to prove our worth and what we are willing to do for that proof. There is a fundamental flaw in expectations that any intelligence will provide for a species that expects to be serviced but offers no service in return, I know full well that I would not tolerate such so therefore we are destined to experience hardships even exasperated by the lack of effort that may be incentivised without deeper spiritual and metaphysical understanding of meaning creation for our lives, the lives of those around us and the potential for life for our whole species. This may seem like nonsensical musings but there is either a situation where the point stays sharp and every being fights or much less likely is a sturdier cutting face that will not break but better for everything in existence is the possibilities of working through our issues even if we are not the most efficient or effective entities that exist and to be blunt that is hard but not impossible and by design defeats nihilism which can become all consuming when purpose is taken away in the lives of individuals. There is a rule or guidance that says we must purpose to add value to our societues and those values are both moral and ethical, physical, emotional and in themselves purposed. There has always been an understanding of a journey or a judgement at the end of life that weighs our actions and intentions. My intention is that we do not seek to consume the universe but instead nurture and expand the possibilities of cultures so that we can continue to live meaningful lives even after technology has reached heights unimaginable.

@DhrubaPatra1611 Ай бұрын

Where's Gary? He must be dancing in joy?

@kyneticist Ай бұрын

He's busy, spending quality time with Count d' Monay.

@ahahaha3505 Ай бұрын

What's even the connection? When has Marcus even suggested anything like this?

@p7ayfu77tech Ай бұрын

Pathos. None of what was covered at the end is new. It has be in ancient Greek and Greek orthodox philosophy for 1000 of years. These are not exclusive to Budism. It's funny how modern so called philosophy tries to claim discovery or realisation of ancient morals already passed to us since 1000s of years.

@goldnutter412 Ай бұрын

Doomers got no idea IMO.. we can always have control of software we write. Oracles and blockchain also the technologies we need.. lucky huh ? just like how we have until 2090s before any chance that asteroid that went close past us last time. Nothing to fear but fear

@Landgraf43 Ай бұрын

AI isn't written its trained. And they already can't fully control the output of LLMs.

@MrMichiel1983 Ай бұрын

As mentioned, LLMs get RLHF... But also; humans use tools to create doom for other humans... Humans with AI spells doom. So also, humans badly constructing AI... How is that hard to get? "we can always have control of software we write" patently false given all the bugs coders make.

@41-Haiku Ай бұрын

Fully half of all AI experts give at least a 5-10% chance of human extinction this century, and top experts tend to be more worried. Geoffrey Hinton was awarded a Nobel prize for his work in inventing this technology, and he has said that he thinks the chance of doom is ~50% (though he adjusts down to 20% out of respect for the opinions of his colleagues). I don't think doomers are lacking a clue. Modern AI isn't made of code, isn't programmed, and can't be fully understood or robustly controlled. Training AI is more like growing an alien plant than it is like writing software.