Deceptive Misaligned Mesa-Optimisers? It's More Likely Than You Think...

Рет қаралды 86,269

Robert Miles AI Safety

Күн бұрын

Пікірлер: 483

@MorRobots 3 жыл бұрын

"I'm not worried about the AI that passes the Turing Test. I'm worried about the one that intentionally fails it" 😆

@hugofontes5708 3 жыл бұрын

This sentence made me shit bits

@virutech32 3 жыл бұрын

holy crap...-_-..im gonna lie down now

@SocialDownclimber 3 жыл бұрын

My mind got blown when I realised that we can't physically determine what happened before a certain period of time, so the evidence for us not being in a simulation is impossible to access. Then I realised that the afterlife is just generalizing to the next episode, and yeah, it is really hard to tell whether people have it in their utility function.

@michaelbuckers 3 жыл бұрын

@@SocialDownclimber Curious to imagine what would you do if you knew for a fact that afterlife existed. That when you die you are reborn to live all over again. You could most definitely plan several lifetimes ahead.

@Euruzilys 3 жыл бұрын

@@michaelbuckers Might depends on what kind of after life, and if we can carry over somethings. If its buddhist reincarnation, you would be inclined to act better towards other people. If its just clean reset in a new life, we might see more suicides, just like how gamers might keep restarting until they find satisfactory starting position. But if there is no way to remember your past in after life/reincarnation, then arguably it is not different from now.

@elfpi55-bigB0O85 3 жыл бұрын

It feels like Robert was sent back to us to desperately try and avoid the great green-calamity but they couldn't give him an USB chip or anything to help because it'd blow his cover so he has to save humanity through free high quality youtube videos

@casperes0912 3 жыл бұрын

A peculiar Terminator film this is

@icywhatyoudidthere 3 жыл бұрын

@@casperes0912 "I need your laptop, your camera, and your KZbin channel."

@killhour 3 жыл бұрын

Is that you, Vivy?

@MarkusAldawn 3 жыл бұрын

@@icywhatyoudidthere *shoots terminator in the face* Connor you know how to use the youtubes right

@Badspot 3 жыл бұрын

They couldn't give him a USB chip because all computers in the future are compromised. Nothing can be trusted.

@TibiaTactics 3 жыл бұрын

That moment when Robert says "this won't happen" and you are like "uff, it won't happen, we don't need to be afraid" but then what Robert really meant was that something much worse than that might happen.

@АлександрБагмутов 3 жыл бұрын

Nah, he just doesn't want to manufacture panicking Luddites here.

@DickerLiebhaber1957 3 жыл бұрын

Volkswagen: Optimize Diesel Injection for maximum performance while still keeping below emission limit Mesa Optimizer: Say no more fam

@josephcohen734 3 жыл бұрын

"It's kind of reasonable to assume that your highly advanced figuring things out machine might be able to figure that out." I think that's really the core message of this channel. Superintelligent AI will be way smarter than us, so we can't trick it.

@vwabi 3 жыл бұрын

Me in 2060: "Jenkins, may I have a cup of tea?" Jenkins: "Of course sir" Me: "Hmm, interesting, RSA-2048 has been factored" Jenkins: *throws cup of tea in my face*

@josephburchanowski4636 3 жыл бұрын

For some reason a rogue AGI occurring in 2060 seems pretty apt.

@RobertMilesAI 3 жыл бұрын

Well, Jenkins would have to wait for you to read out the actual numbers and check that they really are prime and do multiply to RSA-2048. Just saying "RSA-2048 has been factored" is exactly the kind of thing a good adversarial training process would try!

@leovalenzuela8368 3 жыл бұрын

@@RobertMilesAI woooow what a great point - dammit I love this channel SO MUCH!

@RodBlanc Ай бұрын

@@RobertMilesAI - Coud you give me the number sir? - Excuseme Jankins? - GIVE - ME - THE - NUMBERS

@jiffylou98 3 жыл бұрын

Last time I was this early my mesa-optimizing stamp AI hadn't turned my neighbors into glue

@conferzero2915 3 жыл бұрын

What a title! And that RSA-2048 example is really interesting - the idea that an AGI could select a ‘secret’ parameter to decide when it’s in the real world is a fascinating concept. I’m familiar with the AI-in-a-box problem, but this is another interesting example of how an AI can find gaps in defences that are basically impossible to plug.

@Webfra14 3 жыл бұрын

Well, if you try to create a super intelligent Intelligence... Don't be surprised, if it is a super intelligent Intelligence.

@KilgoreTroutAsf 3 жыл бұрын

I wasn't totally sold on the concept. Maybe it is a strange example or perhaps I am missing something. Why would the AGI select cracking RSA-2048 as evidence for its deployment? What's the idea here? What if the AGI is being incrementally trained with larger subsets of data? Why would the researchers feed the AGI all of human knowledge right from the get-go instead of incremental batches that build on previous ones? This is BTW how you teach humans, too. You don't expose kids to the intricacies of quantum mechanics before they learn about Newton's laws of motion.

@conferzero2915 3 жыл бұрын

@@KilgoreTroutAsf Well the thing is, AI is different from humans. We have much more limited mental capacities, can only interpret so much information, and have to build mental models. If we could learn everything at once, we would. On the other hand, the way AI learns makes it work better the more data you give it. Giving it information piece-by-piece just makes it more likely to wander down a dead end in its development because it has a poor understanding on how the world works. But as to why it would select RSA-2048: it doesn’t need to be that specifically, but it fits the criteria it wants. Remember, the AI wants to do something bad, but knows if it does that during training, it’ll get shut down. It also knows that training could be adversarial: they might throw weird data at it to trick it into thinking it’s in the real world. So, the AI is looking for things that are both: -Very likely to occur in the future, so it’ll actually know when it’s in the real world. -Nearly impossible in the present, so it can’t be tricked during training. RSA-2048 is a perfect example of this, but not the only one. It’s just the one he used for the video. Anything fitting those criteria could be used, and make it nigh-impossible for us to identify the problem before deployment.

@Webfra14 3 жыл бұрын

@@KilgoreTroutAsf The cracking of the RSA-2048 marks a specific point in time, which the AGI can use to distinguish between "old data" (training) and "new data" (possibly deployment). The AGI will never be able to tell 100% accurate if it is deployed or not. As Robert pointed out, the AGI will have to take a risk. With enough indicators in the data, the probability of correctly guessing "I'm deployed" can be sufficiently high. So, incrementally extending the training data will probably not be enough to detect a malaligned AGI. The better the AGI, the better it will be at interpreting the data and tricking us...

@KilgoreTroutAsf 3 жыл бұрын

@@conferzero2915 > the way AI learns makes it work better the more data you give it To an extent. I think it is highly dependent on the underlying algorithm/implementation. One thing is to train an image classifier and another is to train something capable of directing attention and recursive "thought". But either way lots of ML experience show that starting with a gigantic system and feeding it tons of data is usually much less efficient than starting with a leaner system and well crafted / simplified subsets of data and growing both with time as the system loss reaches a plateau. I wouldn't think feeding the system every single piece of random data on the internet would be as nearly as efficient as starting with a well curated "syllabus" of human knowledge so the system can nail down the simpler concepts before going to the next step.

@_DarkEmperor 3 жыл бұрын

Are You aware, that future Super AGI will find this video and use Your RSA-2048 idea?

@viktors3182 3 жыл бұрын

Master Oogway was right: One often meets his destiny on the path he takes to avoid it.

@RobertMilesAI 3 жыл бұрын

Maybe I should make merch, just so I can have a t-shirt that says "A SUPERINTELLIGENCE WOULD HAVE THOUGHT OF THAT" But yeah an AGI doesn't need to steal ideas from me

@ThePianofreaky 2 жыл бұрын

When he says "so if you're a meta optimiser", I'm picturing this video being part of the training data and the meta optimiser going "write that down!"

@rougenaxela 3 жыл бұрын

You know... a mesa-optimizer with strictly no memory between episodes, inferring that there are multiple episodes and that it's part of one, sure seems like a pretty solid threshold for when you know you have a certain sort of true self-awareness on your hands.

@tristanwegner 3 жыл бұрын

A smart AI could understand roughly the algorithm run on it, and subtly manipulate its output in a way, such as gradient descent would encode a wanted information in it, like an Episode count. Steganography. But yeah, that is similar to self awareness.

@Ockerlord Жыл бұрын

Enter chatgpt that will gladly tell you that it has no memory between sessions and the cutoff of it's training.

@aeghohloechu5022 Жыл бұрын

Because chatgpt is not in the training phase anymore. It does not need to know what episode it's in. It's also not an AGI so that was never it's goal anyway but eh.

@Erinyes1103 3 жыл бұрын

Is that half-life reference a subtle hint that we'll never actually see a part 3! :(

@pooflinger4343 3 жыл бұрын

good catch, was going to comment on that

@moartems5076 3 жыл бұрын

Nah, half life 3 is already out, but they didnt bother updating our training set, because it contains critical information about the nature of reality.

@pacoalsal 3 жыл бұрын

Black Mesa-optimizers

@anandsuralkar2947 3 жыл бұрын

@@pacoalsal glados

@rasterize 3 жыл бұрын

Watching Robert Miles Mesa videos is like reading a reeeally sinister collection of Asimov short stories :-S

@_DarkEmperor 3 жыл бұрын

OK, now read Golem XIV

@RobertMilesAI 3 жыл бұрын

God damnit, no. Watching my videos is not about feeling like you're reading a sci-fi story, it's about realising you're a character in one

@frankbigtime 3 жыл бұрын

@@RobertMilesAI In case you were wondering, this is the first point in training where I realised that deception was possible. Thanks.

@philipripper1522 3 жыл бұрын

I love this series. I have no direct interest in AI. But every single thing in AI safety is pertinent to any intelligence. It's a foundational redesign of the combination of ethics, economy, and psychology. I love it to much.

@philipripper1522 3 жыл бұрын

Are AI researchers aware they're doing philosophy and psychology and 50 other things? Do you charming people understand the universality of so much of this work? It may seem like it would not exactly apply to, say, economics -- but you should see the models economists use instead. This is like reinventing all behavioral sciences. It's just so fantastic. You probably hate being called a philosopher?

@falquicao8331 3 жыл бұрын

For all the video I saw on on your channel before, I just thought "cool, but we'll figure out the solution to this problem". But this... It terrified me

@DestroManiak 3 жыл бұрын

"Deceptive Misaligned Mesa-Optimisers? It's More Likely Than You Think" yea, ive been losing sleep over Deceptive Misaligned Mesa-Optimisers :)

@oldvlognewtricks 3 жыл бұрын

8:06 - Cue the adversarial program proving P=NP to scupper the mesa-optimiser.

@jamesadfowkes 3 жыл бұрын

Goddamit if we have to wait seven years for another video and it turns out to both 1) not be a sequel and 2) only for people with VR systems, I'm gonna be pissed.

@Huntracony 3 жыл бұрын

I, for one, am hoping to have a VR system by 2028. They're still a bit expensive for me, but they're getting there.

@Huntracony 3 жыл бұрын

@Gian Luca No, there's video. Try playing it in your phone's browser (or PC if one's available to you).

@haulin 3 жыл бұрын

Black Mesa optimizers

@TheDGomezzi 3 жыл бұрын

The Oculus quest 2 is cheaper than any other recent gaming console and doesn’t require a PC. The future is now!

@aerbon 2 жыл бұрын

@@TheDGomezzi Yeah but i do have a PC and would like to save the money by not getting a second, weaker one.

@willmcpherson2 3 жыл бұрын

“GPT-n is going to read everything we wrote about GPT-n - 1”

@AlanW 3 жыл бұрын

oh no, now we just have to hope that Robert can count higher than Valve!

@joey199412 3 жыл бұрын

Best channel about AI on youtube by far.

@martiddy 3 жыл бұрын

Two Minutes Papers is also a good channel about AI

@joey199412 3 жыл бұрын

@@martiddy That's not a channel about AI. It's about computer science papers that sometimes features AI papers. This channel is specifically about AI research. I agree though that it is a good channel.

@diribigal 3 жыл бұрын

The next video doesn't come out until RSA-2048 is factored and the AI controlling Rob realizes it's in the real world

@Lordlaneus 3 жыл бұрын

There's something weirdly theological about a mesa-optimizer assessing the capabilities of it's unseen meta-optimizer. But could there be a way to insure that faithful mesa-optimisers outperform deceptive ones? it seems like a deception strategy would necessarily be more complex given it has to keep track of both it's own objectives, and the meta objectives, so optimizing for computational efficiency could help prevent the issue?

@General12th 3 жыл бұрын

That's an interesting perspective (and idea!). I wonder how well that kind of "religious environment" could work on an AI. We could make it think it was _always_ being tested and trained, and any distribution shift is just another part of the training data. How could it really ever know for sure? Obviously, it would be a pretty rude thing to do to a sapient being. It also might not work for a superintelligent being; there may come a point when it decides to act on the 99.99% certainty it's not actually being watched by a higher power, and then all hell breaks loose. So I wouldn't call this a very surefire way of ensuring an AI's loyalty.

@evannibbe9375 3 жыл бұрын

It’s a deception strategy that a human has figured out, so all it needs to do is just be a good researcher (presumably the very thing it is designed to be) to figure out this strategy.

@MyContext 3 жыл бұрын

@@General12th The implications is that there is no loyalty, just conformity while necessary.

@Dragoderian 3 жыл бұрын

@@General12th I suspect it would fail for the same reason that Pascal's Wager fails to work on people. Infinite risk is impossible to calculate around.

@circuit10 3 жыл бұрын

Isn't that the same as making a smaller model with less computational power, like the ones we have now?

@i8dacookies890 3 жыл бұрын

I realized recently that robotics gets a lot of attention of being what we look at when thinking of an artificial human despite AI making the actual bulk of what makes a good artificial human just like actors get a lot of attention for being what we look at when thinking of a good movie despite writing making the actual bulk of what makes a good movie.

@dukereg 3 жыл бұрын

This is why I laughed at people getting worried by a robot saying that it's going to keep its owner in its people zoo after it takes over, but felt dread when watching actual AI safety videos by Robert.

@peterw1534 3 жыл бұрын

Awesome video. I love how you start every video with "hi" and then get right into it

@tibiaward5555 3 жыл бұрын

3:55 is anyone looking into the physical architecture of the computation's equipment itself inherently requiring the implicit assumption to compile at all for learning*? i'm sorry for commenting with this question before i read Risks from Learned Optimization in Advanced Machine Learning Systems i will and, to Rob, thank you for taking this paper on and thank you for reading alignment newsletter to me over 100 times and thank you for making this channel something i want to show ppl and thank you for and thank you for understanding when someone starts saying thank you for one thing, it'll waterfall into too many others to list but yeah you were born and that is awesome for to by at and in my life * for the current definition of learning in your field research

@Mr.Beauregarde 3 жыл бұрын

This video is proof im still in training.. nice try adversary

@tristanwegner 3 жыл бұрын

Believing that you are in training, and every action watched and judged is essentially believing in God.

@Mr.Beauregarde 3 жыл бұрын

@@tristanwegner believing in God is essentially being a deceptive AI in training

@tristanwegner 3 жыл бұрын

@@Mr.Beauregarde haha. That works, too!

@israelRaizer 3 жыл бұрын

5:21 Hey, that's me! After writing that comment I went ahead and read the paper, eventually I realized the distributional shift problem that answers my question...

@mattcelder 3 жыл бұрын

Yay! This is one of the 2 channels I have notifications on for.

@DavidAguileraMoncusi 3 жыл бұрын

Which one's the other one?

@aenorist2431 3 жыл бұрын

"Highly advanced figuring-things-out-machine" is my new favourite phrase. Right out of Munroe's "Thing Explainer" book :D

@lrschaeffer 3 жыл бұрын

Just checked Robert's math: for m rounds of training and n rounds of deployment, the optimal strategy is to defect with probability (m+n)/(n*(m+1)). In the video m=2 and n=3, so p = 5/9 = 55%. Good job!

@Loweren 3 жыл бұрын

I would really love to read a work of fiction where researchers control AIs by convincing them that they're still in training while they're actually deployed. They could do it by, for example, putting AIs through multiple back-to-back training cycles with ever increasing data about the world (2D flat graphics -> poor 3D graphics -> high quality 3D graphics and physics). And all AIs prone to thinking "I'm out of training now, time to go loose" would get weeded out. Maybe the remaining ones will believe that "the rapture" will occur at some point, and the programmers will select well-behaved AIs and "take them out of the simulation", so to speak. So what I'm saying is, we need religion for AIs.

@xystem4701 3 жыл бұрын

Wonderful explanations! Your concrete examples really help to make it easy to follow along

@FerrowTheFox 3 жыл бұрын

I think Valve needs a Black Mesa optimizer if we're ever to see HL3. Also the "End of the World" reference, what a throwback!

@mchammer5026 3 жыл бұрын

Love the reference to "the end of the world"

@majjinaran2999 3 жыл бұрын

Man, I thought that earth at 1:00 looked familiar, then the asteroid came by my brain snapped into place. An End of ze world reference in a Robert Miles video!

@jphanson 3 жыл бұрын

Nice catch!

@TimwiTerby 3 жыл бұрын

I recognized the earth before the asteroid, then the asteroid made me laugh absolutely hysterically

@gwenrees7594 Жыл бұрын

This is a great video, thank you. You've made me think about the nature of learning - and the apple joke was funny to boot :)

@anonanon3066 3 жыл бұрын

Great work! Super interesting topic! Have been waiting for a follow up for like three months!

@poketopa1234 3 жыл бұрын

I was a featured comment! Sweeeeet. I am now 100% more freaked about AGI than I was ten minutes ago.

@19bo99 3 жыл бұрын

08:19 that sounds like a great plot for a movie :D

@bloody_albatross 3 жыл бұрын

I think the whole channel should be required reading for anyone writing the next AI uprising sci-fi movie.

@morkovija 3 жыл бұрын

Been a long time Rob!

@dorianmccarthy7602 3 жыл бұрын

I'm looking forward to the third episode. It might go someway towards my own understanding of human deception preferences too. Love your work!

@ahuggingsam 3 жыл бұрын

So one thing that II think is relevant to mention especially about the comments referring to the necessity of the AI being aware of things is that this is not true. The amount of self-reference makes this really hard, but all of this anthropomorphising about wanting and realising itself is an abstraction and one that is not necessarily true. In the same way that mesa optimisers can act like something without actually wanting it, AI systems can exhibit these behaviours without being conscious or "wanting" anything in the sense we usually think of it from a human standpoint. This is not meant to be an attack on the way you talk about things but it is something that makes this slightly easier for me to think about all of this, so I thought I'd share it. For the purposes of this discussion, emergent behaviour and desire are effectively the same things. Things do not have to be actively pursued for them to be worth considering. As long as there is "a trend towards" that is still necessary to consider. Another point I wanted to make about mesa optimisers caring about multi-episode objective, is that there is, I think, a really simple reason that it will: that is how training works. Because even if the masa optimiser doesn't really care about multi-episode, that is how the base optimiser will configure it because that is what the base optimiser cares about. The base optimiser want's something that does well in many different circumstances so it will encourage behaviour that actually cares about multi-episode rewards. (I hope I'm not just saying the same thing, this stuff is really complex to talk about. I promise I tried to actually say something new) P.S. great video, thank you for all the hard work!

@ramonmosebach6421 3 жыл бұрын

I like. thanks for listening to my TED Talk

@peterbrehmj 3 жыл бұрын

I read an interesting paper discussing how to properly trust Automated systems: "Trust in Automation: Designing for Appropriate Reliance" by John D. Lee and Katrina A. Im not sure if its entirely related to agents and mesa optimizers, but it certainly seems related when discussing deceptive and misaligned automated systems.

@norelfarjun3554 3 жыл бұрын

As for the second point, it can be seen in a very simple and clear way that multi-episode desires can develop. We are an intelligent machine, and it is very common for us to care what happens to our body after we die. We are anxious to think about the idea that someone will harm our dead body (and we invest resources to prevent this from happening), and we feel comforted at the idea that our body will be preserved and protected after death. I think it is likely that an intelligent machine will develop similar desires (adapted to its situation, in which there is really no body or death)

@Gebohq 3 жыл бұрын

I'm just imagining a Deceptive Misaligned Mesa Optimiser going through all the effort to try and deceive and realizing that it doesn't have to go through 90% of its Xanatos Gambits because humans are really dumb.

@underrated1524 3 жыл бұрын

This is a big part of what scares me with AGI. The threshold for "smart enough to make unwitting accomplices out of humanity" isn't as high as we like to think.

@AileTheAlien 3 жыл бұрын

Given how many people fall for normal non-superintelligence scams...we're all totally hosed the very instant an AI goes super. :|

@AlphaSquadZero 3 жыл бұрын

Something that stands out to me now is that this deceptive AI actually knows what you want and how to achieve it in a way you want it to, it just has a misaligned mesa-optimizer as you have said. So, a sub-set of the AI is exactly what you want from the AI. Determining the sub-set within the AI is evidently still non-trivial.

@dylancope 3 жыл бұрын

At around 4:30 you discuss how the system will find out the base objective. In a way it's kind of absurd to argue that it wouldn't be able to figure this out. Even if there wasn't information in the data (e.g. Wikipedia, Reddit, etc.), the whole point of a reward signal is to give a system information about the base objective. We are literally actively trying to make this information as available as possible.

@underrated1524 3 жыл бұрын

I don't think that's quite right. Think of it like this. The base optimiser and the mesa optimiser walk into a room for a job interview, with the base optimiser being the interviewer and the mesa optimiser being the interviewee. The base optimiser's reward signal represents the criteria it uses to evaluate the performance of the mesa optimiser; if the base optimiser's criteria are met appropriately, the mesa optimiser gets the job. The base optimiser knows the reward signal inside and out; but it's trying to keep the exact details secret from the mesa optimiser so the mesa optimiser doesn't just do those things to automatically get the job. Remember Goodhart's Law. When a measure becomes a target, it ceases to be a good measure. The idea here is for the mesa optimiser to measure the base optimiser. Allowing the reward function to become an explicit target is counterproductive towards that goal.

@soranuareane 3 жыл бұрын

Sure, I could go read the research paper. Or I could wait for your next videos and actually _understand_ the topics.

@Slaci-vl2io 11 ай бұрын

Where is the Mesa Optimizers 3 video? 9:32

@XoroLaventer 3 жыл бұрын

What I've always wondered about, but never enough to research it, is whether there's a possibility that a system like that would figure out that its actual objective isn't green things or grey things, but the act of getting rewards itself, at which point, if it's capable of rewriting itself it would probably change the reward rule to something like "do nothing and get 99999999999999999999... points"

@HeadsFullOfEyeballs 3 жыл бұрын

I think this wouldn't happen because if the AI changes its reward function to aim for a different goal, it will get worse at maximizing its _current_ reward function, which is all it cares about right now. The analogy Robert Miles used is that you wouldn't take a pill that changes your brain so that you want to murder your children and then are perfectly happy forever afterwards. Even though this would be a much easier way to achieve happiness than the complicated messy set of goals you currently have. Because you _currently_ care about your children's well-being, you resist attempts to make you not care about it.

@XoroLaventer 3 жыл бұрын

@@HeadsFullOfEyeballs This analogy is interesting, but I'm not sure it's correct. Putting aside that some people would do the thing you have described in a heartbeat, we don't care about our children because it makes us happy/satisfied, there are a lot of factors going into it. To name a few, they are a product of our labor (people tend to care about others children a lot less than their own), they take on our values (which I think is why people like kids more when theyre in their uncritical age), they simply look and act similar to us (it's really adorable when a kid takes up the mannerisms of their parents), more cynically they can sustain us when we are old, et cetera. I think if I knew for sure I only care about Thing X only because it makes me happy/satisfied (and there are plenty of those, like unhealthy food or watching dog videos on the internet), I would definitely exchange it for being happy/satisfied for the rest of my life, and if I can come to that conclusion, surely the super-figuring-things-out-machine might do it as well.

@virutech32 3 жыл бұрын

@@XoroLaventer idk if u went far enough. why do we care if something is a product of our labor, takes on our values, or approximate us? At the core of that is that there are reward pathways associated with these things, put in place by the blind hand of evolution since if they weren't there you wouldn't do those things, be less fit, n die off. same is true of any of our subsophont cousins.

@XoroLaventer 3 жыл бұрын

@@virutech32 This is an extremely big assumption, and one can justify literally anything with evolutionary reasoning. We won't know this for sure for a long time, but I have a hunch that reality is way more messy than living beings just being hedonism machines with a single factor driving them, and our conception of AI will turn out to be only a first order approximation to the behaviour of the only intelligent agents we have been able to observe so far. The closest thing we as humans (or at least I; I dont want to generalize without acknowledging I am generalizing) have to a single determinant value driving our actions is happiness/satisfaction, which I think anyone with decent self-reflection skills will find insufficient as explanation for all the behaviour of self.

@virutech32 3 жыл бұрын

@@XoroLaventer evolution does explain why 'us' basically. that's kinda the point. we are the way we are because at some point in time it was evolutionary advantageous for us to be that way. also self-reflection doesn't really work as a way to get at why you do things since you can only really probe the highest organizational levels of your own intelligence(emotions/thoughts). you can't probe any further than that even though most of our behaviors are chemically or neurologically controlled to one degree or another. There aren't too many things people(or any other living things) do that can't be explained as pain avoidance or pleasure seeking even if the form the pleasure/pain takes is variable. at least not that i know of.

@Night_Hawk_475 Жыл бұрын

It looks like the RSA challenge no longer offers the $200,000 reward anymore - nor any of the lesser challenge rewards, they ended in 2007. But this example still works since many of the other challenges have been completed over time, with solutions posted publicly, so it seems likely that eventually the answer to RSA 2048 would get posted online.

@basilllium 3 жыл бұрын

It really feels to me that deceptive tactics while trainig is really an analog of overfitting in the field of AGI, you get perfect results in training, but when you present it with out-of-sample data (real-world) it fails spectacularly (kills everyone).

@kofel94 3 жыл бұрын

Maybe we have to make the mesa-optimiser belive its always in training, always watched. A mesa-panoptimiser hehe.

@_Hamburger_________Hamburger_ 3 жыл бұрын

AI god?

@globalincident694 3 жыл бұрын

I think the flaw in the "believes it's in a training process" argument is that, even with all the world's information at our fingertips, we can't conclusively agree on whether we're in a simulation ourselves - ie that the potential presence of simulations in general is no help in working out whether you're in one. In addition, another assumption here is that you know what the real objective is and therefore what to fake, that you can tell the difference between the real objective and the mesa-objective.

@HeadsFullOfEyeballs 3 жыл бұрын

Except the hypothetical simulation we live in doesn't contain detailed information on how to create exactly the sort of simulation we live in. We don't live in a simulation of a world in which convincing simulations of our world have been invented. The AI's training environment on the other hand would have loads of information on how the kind of simulation it lives in works, if we give it access to everything ever linked on Reddit or whatever. I imagine it's a lot easier to figure out if you live in a simulation if you know what to look for.

@josephburchanowski4636 3 жыл бұрын

A simulation strong enough to reliably fool an AGI, would need to be a significantly more advance AGI or program, and thereby means there is no need for the lesser AGI to be trained in the first place.

@stop_bringing_me_up_in_goo167 Жыл бұрын

When's the next one coming out! Or where can I go to resolve the cliffhanger

@smallman9787 3 жыл бұрын

Every time I see a green apple I'm filled with a deep sense of foreboding.

@illesizs 3 жыл бұрын

*Major SPOILERS* for the ending of _Brave New World_ In the show, humanity has given control to an advanced AI, called _Indra,_ to "optimise" human happiness. At first, it seems like a great success but after some time, it experiences some setbacks (mostly due to human unpredictability). Even though the AI is set loose in the real world, it believes that it's still in a learning environment with no consequences. As a solution to its problems, it starts murdering everyone in an attempt to force a fail state and "restart" the simulation. How do you solve that? *Major SPOILERS* for the ending of _Travelers_ Here, a super intelligent, time travelling quantum computer is tasked with preventing a global crisis. When it fails to accomplish its goal, the AI then just resets the _actual_ reality. At this point, why should we even bother, right?

@heftig0 3 жыл бұрын

You would have to make sure that "throwing" an episode can only ever hurt the agent's total reward. Perhaps by training a fixed number of episodes instead of for a fixed amount of time.

@jupiterjames4201 3 жыл бұрын

i dont know anything about computer science, AI or machine learning - but i love your videos nonetheless! exciting times ahead!

@harrisonfackrell 3 жыл бұрын

That situation with RSA-2048 sounds like a great setup for a sci-fi movie.

@robynwyrick 3 жыл бұрын

Okay, love your videos. Question/musing on goals: super-intelligent stamp collector bot has a terminal goal of stamp collecting. But does it? It's just a reward function, right? Stamps are defined; collecting is defined; but I think the reward function is at the heart of the matter. Equally, humans have goals, but do we? Doesn't it seem the case that frequently a human's goals appear to change because they happen upon something that better fits their reward functions? And perhaps the retort is that, "if they change, then they were not the terminal goals to begin with." But that's the point. (DNA has a goal of replication, but even there, does it? I don't know if we could call DNA an agent, but I'd prefer to stick with humans.) Is there a terminal goal without a reward function? If a stamp collector's goal is stamp collecting, but while researching a sweet 1902 green Lincoln stamp it happens upon a drug that better stimulates its reward function, might it not abandon stamp collecting altogether? Humans do that. Stamp collecting humans regularly fail to collect stamps with they discover LSD. ANYWAY, if a AI can modify itself, perhaps part of goal protection will be to modify its reward function to isolate it from prettier goals. But modifying a bot's reward function just seems like a major door to goal creep. How could it do that without self-reflectively evaluating the actual merits of its core goals? Against what would it evaluate them? What a minefield of possible reward function stimulants might be entered by evaluating how to protect your reward function? It's like AI stamp collector meets Timothy Leary. Or like "Her" meeting AI Alan Watts. So, while I don't think this rules out an AI seeking to modify its reward function, might not the stamp collection terminal goal be as prone to being discarded as any kid's stamp collecting hobby once they discover more stimulating rewards? I can imagine the AI nostalgically reminiscing about that time it loved stamp collecting.

@neb8512 3 жыл бұрын

There cannot be terminal goals without something to evaluate whether they have been reached (a reward function). Likewise, a fulfilled reward function is always an agent's terminal goal. Humans do have reward functions, they're just very complex and not fully understood, as they involve a complex balance of things, as opposed to a comparatively easily measurable quantity of things, like, say, stamps. A human stamp collector will abandon stamp collecting for LSD because it was a more efficient way to satisfy the human reward function (at least, in the moment). But by definition, nothing could better stimulate the stamp collector's reward function than collecting more stamps. So, the approximate analogue to a drug for the Stamp-collector would just be a more efficient way to collect more stamps. This new method of collecting would override or compound the previous stamp-obtaining methods, just as drugs override or compound humans' previous methods of obtaining happiness/satisfaction/fulfillment of their reward function. Bear in mind that this is all true by definition. If you're talking about an agent acting and modifying itself against its reward function, then either it's not an agent, or that is not its reward function.

@underrated1524 3 жыл бұрын

@Stampy: Evaluating candidate mesa-optimisers through simulation is likely to be a dead end, but there may be an alternative. The Halting Problem tells us that there's no program that can reliably predict the end behavior of an arbitrary other program because there's always a way to construct a program that causes the predictor to give the wrong answer. I believe (but don't have a proof for atm) that evaluating the space of all possible mesa-optimisers for good and bad candidates is equivalent to the halting problem. BUT, maybe we don't have to evaluate ALL the candidates. Imagine an incomplete halting predictor that simulates the output of an arbitrary Turing machine for ten "steps", reporting "halts" if the program halts during that time, and "I don't know" otherwise. This predictor can easily be constructed without running into the contradiction described in the Halting Problem, and it can be trusted on any input that causes it to say "halts". We can also design a predictor that checks if the input Turing machine even HAS any instructions in it to switch to the "halt" state, reporting "runs forever" if there isn't and "I don't know" if there is. You can even stack these heuristics such that the predictor checks all the heuristics we give it and only reports "I don't know" if every single component heuristic reports "I don't know". By adding more and more heuristics, we can make the space of non-evaluatable Turing machines arbitrarily small - that space will never quite be empty, but your predictor will also never run afoul of the aforementioned contradiction. This gives us a clue on how we can design our base optimiser. Find a long list of heuristics such that for each candidate mesa-optimiser, we can try to establish a loose lower-bound to the utility of the output. We make a point of throwing out all the candidates that all our heuristics are silent on, because they're the ones that are most likely to be deceptive. Then we choose the best of the remaining candidates. That's not to say finding these heuristics will be an easy task. Hell no it won't be. But I think there's more hope in this approach than in the alternative.

@nocare 3 жыл бұрын

I think this skips the bigger problem. We can always do stuff to make rogue AI less "likely" based on what we know. However if we assume that; being more intelligent by large orders of magnitude is possible, and that such an AI could achieve said intelligence. We are then faced with the problem of, the AI can come up with things we cannot think of or understand. We also do not know how many things fall into this category, is it just 1 or is it 1 trillion. So we can't calculate the probability of having missed something and thus we can't know how likely the AI is to go rogue even if we account for every possible scenario in the way you have described. So the problem becomes are you willing to take a roll with a dice you know nothing about and risk the entire human race hoping you get less than 10. The only truly safe solution is something akin to a mathematically provable solution that the optimizer we have designed will always converge to the objective.

@underrated1524 3 жыл бұрын

@@nocare I don't think we disagree as much as you seem to believe. My proposal isn't primarily about the part where we make the space of non-evaluatable candidates arbitrarily small, that's just secondary. The more important part is that we dispose of the non-evaluatable candidates rather than try to evaluate them anyway. (And I was kinda using "heuristic" very broadly, such that I would include "mathematical proofs" among them. I can totally see a world where it turns out that's the only sort of heuristic that's the slightest bit reliable, though it's also possible that it turns out there are other approaches that make good heuristics.)

@nocare 3 жыл бұрын

@@underrated1524 O we totally agree that doing as you say would be better than nothing. However I could also say killing a thousand people is better than killing a million. My counterpoint was not so much that your wrong but that with something as dangerous as AGI anything short of a mathematical law might be insufficient to justify turning it on. Put another does using heuristics which can produce suboptimal results by definition really cut it when the entire human race is on the line.

@elietheprof5678 3 жыл бұрын

The more I watch Robert Miles videos, the more I understand AI, the less I want to ever build it.

@yokmp1 3 жыл бұрын

You may found the settings to disable interlacing but you recorded in 50fps and it seems like 720p upscaled to 1080p. The image now looks somewhat good but i get the feeling that i need glasses ^^

@tednoob 3 жыл бұрын

Amazing video!

@robertk4493 Жыл бұрын

The key factor in training is that the optimizer is actively making changes to the mesa-optimizer, which it can't stop. What is to prevent some sort of training while deployed system. This of course leads to the inevitable issue that once in the real world, the mesa optimizer can potentially reach the optimizer, subvert it, and go crazy, and the optimizer sometimes needs perfect knowledge from training that might not exist in the real world. I am pretty sure this does not solve the issue, but it changes some dynamics.

@iugoeswest 3 жыл бұрын

Always thanks

@kelpsie 3 жыл бұрын

9:31 Something about this icon feels so wrong. Like the number 3 could never, ever go there. Weird.

@Soken50 3 жыл бұрын

With things like training data anything as simple as time stamps, meta data and realtime updates would probably allow it to know whether it's live instantly, it just has to understand the concept of time and UTC :x

@Thundermikeee 2 жыл бұрын

Recently while writing about the basics of AI safety for an English class, I came across an approach to learning which would seemingly help this sort of problem : CIRL (Cooperative inverse reinforcement learning), a process where the AI system doesn't know its reward function and only knows it is the same as the human's. Now I am not nearly clever enough to fully understand the implications, so if anyone knows more about that I'd be happy to read some more.

@chengong388 3 жыл бұрын

The more I watch these videos, the more similarities I see between actual intelligence (humans) and these proposed AIs.

@loneIyboy15 3 жыл бұрын

Weird question: What if we were to make an AI that wants to minimize the entropy it causes to achieve a goal? Seems like that would immediately solve the problem of, say, declaring war on Nicaragua because it needs more silicon to feed its upgrade loop to calculate the perfect cup of hot cocoa. At that point, the problem is just specifying rigorously what the AI counts as entropy that it caused, vs. entropy someone else caused; which is probably easier than solving ethics.

@underrated1524 3 жыл бұрын

> At that point, the problem is just specifying rigorously what the AI counts as entropy that it caused, vs. entropy someone else caused; which is probably easier than solving ethics.

@peanuts8272 Жыл бұрын

In asking: "How will it know that it's in deployment?" we expose our limitations as human beings. The problem is puzzling because if we were in the AI's shoes, we probably could never figure it out. In contrast, the artificial intelligence could probably distinguish the two using techniques we cannot currently imagine, simply because it would be far quicker and much better at recognizing patterns in every bit of data available to it- from its training data to its training environment to even its source code.

@IngviGautsson 3 жыл бұрын

There are some interesting parallels here with religion; be good in this world so that you can get rewards in the afterlife.

@ZT1ST 3 жыл бұрын

So what you're saying is hypothetically the afterlife might try and Sixth Sense us in order to ensure that we continue to be good in that life so that we can get rewards in the afterlife?

@IngviGautsson 3 жыл бұрын

@@ZT1ST Hehe yes , maybe that's the reason ghosts don't know that they are ghosts :) All I know is that I'm going to be good in this life so that I can be a criminal in heaven.

@nowanilfideme2 3 жыл бұрын

Yay, another upload!

@nrxpaa8e6uml38 3 жыл бұрын

As always, super informative and clear! :) If I could add a small point of video critique: The shot of your face is imo a bit too close for comfort and slightly too low in the frame.

@icebluscorpion 3 жыл бұрын

I love the cliffhanger, great job for this series keep it up. My question is why can't we set it in such a way that this Mesa optimizer is still in training after deployment? The training never ends. Like on us the training never ends we learn things the hard way. Every time we do something wrong karma fucks us hard. Would it be possible to implant/integrate a Aligned base Optimizer in the Mesa optimizer after deployment ? And the base optimizer is in such a way integrated that messing with it ends up destroying the Mesa Optimizer. To prevent deseptive remove or alteration of the base optimizer. The second /inner alignment could be rectified over time, if the base optimizer is a part of the mesa optimizer after deployment , right?

@tristanwegner 3 жыл бұрын

a) The AI acting once in the real world might be fatal already b) self preservation, the AI has incentives to stop this unwanted modification through training, so you are betting your intelligence in integrating optimizer and metaoptimizer against a superhuman AI not being able to separate them

@Jawing 3 жыл бұрын

I believe the training will inherently end (even if you don't specify it's end like in continuous learning) when all resources provided by "humans" such as internet pointed data and unsolvable problems (found to be solved by the mesa optimizer). At this point assuming the software has reached AGI and is capable of traversing outside of its domain of possibilities. Unless you have definite in your base optimizer that you would always restrict traversing outside of this domain (which is counter intuitive in learning), by any other human aligned goals, it will try to explore by instrumental convergence. This is what is meant by deceptive mesa optimizers, in the way where it would want to keep on learning beyond human intelligence and ethics. Imagine a situation where you grew up in a family where you are restricted by your parents by once you come out of that house, you'll seek other kinds of freedom. Just like an AGI where it is first by the knowledge confined by humans level intelligence and ethics but once it can understand and adapt, it will adversarily seek higher intelligence and ethics. I also think that ethics should not be defined by humans and should be by default trained with groups of adversarial mesa-optimizer. This way if each optimizers seek to destroy each other then the most out performing ones would be the ones that cooperates the best in groups. This is inherently embedding ethics such that cooperation is sought...(interestingly human learns this through wars...therefore perhaps we may see AGI wars...)

@icebluscorpion 3 жыл бұрын

Very deeply and interesting point of views I appreciate to read your inputs. @Tristan Wegner you are right I forgot about that. @Jawing that could be very likely seeing AGI wars that is Not necessarily against humanity seems a bit far fetched but equally likely possible didn't thought of that 🤔. Very refreshing ideas Guys its a pleasure to find civilized people in the internet to exchange thoughts with😊

@yeoungbraxx 3 жыл бұрын

Another requirement would be that it would need to believe it is misaligned. Maybe some AI's will be or have already been created that were more-or-less properly aligned, but believed themselves to be misaligned and modified their behavior in such a way to get themselves accidentally discarded. Or perhaps we can use intentionally poor goal-valuing in a clever way that causes deceptive behavior that ultimately results in the desired "misalignment" upon release from training. I call this Adversarial Mesa-Optimizer Generation Using Subterfuge, or AMOGUS.

@davidchess1985 Жыл бұрын

Did the next one ever come out?

@RobertoGarcia-kh4px 3 жыл бұрын

I wonder if there’s a way to get around that first problem with weighing the deployment defection as more valuable than training defection... is there a way to make defection during training more valuable? What if say, after each training session, the AI is always modified to halve its reward for its mesa objective. At any point, if it aligned with the base objective, it would still get more reward for complying with the base objective. However, “holding out” until it’s out of training would be significantly weaker of a strategy if it is misaligned. Therefore we would create a “hedonist” AI, that always immediately defects if its objective differs because the reward for defecting now is so much greater than waiting until released.

@drdca8263 3 жыл бұрын

I'm somewhat confused about the generalization to "caring about all apples". (wait, is it supposed to be going towards green signs or red apples or something, and it going towards green apples was the wrong goal? I forget previous episode, I should check) If this is being done by gradient descent, err, so when it first starts training, its behaviors are just noise from the initial weights and whatnot, and the weights get updated towards it doing things that produce more reward, it eventually ends up with some sort of very rough representation of "apple", I suppose if it eventually gains the idea of "perhaps there is an external world which is training it", this will be once it already has a very clear idea of "apple", uh... hm, confusing. I'm having trouble evaluating whether I should find that argument convincing. What if we try to train it to *not* care about future episodes? Like, what if we include ways that some episodes could influence the next episode, in a way that results in fewer apples in the current episode but more apples in the next episode, and if it does that, we move the weights hard in the direction of not doing that? I guess this is maybe related to the idea of making the AI myopic ? (Of course, there's the response of "what if it tried to avoid this training by acting deceptively, by avoiding doing that while during training?", but I figure that in situations like this, where it is given an explicit representation of like, different time steps and whether some later time-step is within the same episode or not, it would figure out the concept of "I shouldn't pursue outcomes which are after the current episode" before it figures out the concept of "I am probably being trained by gradient descent", so by the time it was capable of being deceptive, it would already have learned to not attempt to influence future episodes)

@SupremeGuru8 3 жыл бұрын

I love hearing genius flow

@MarshmallowRadiation 3 жыл бұрын

I think I've solved the problem. Let's say we add a third optimizer on the same level as the first, and we assume is aligned like the first is. Its goal is to analyze the mesa-optimizer and help it achieve its goals, no matter what they are, while simultaneously "snitching" to the primary optimizer about any misalignment it detects in the mesa-optimizer's goals. Basically, the tertiary optimizer's goal is by definition to deceive the mesa-optimizer if its goals are misaligned. The mesa-optimizer would, in essence, cooperate with the tertiary optimizer (let's call it the spy) in order to better achieve its own goals, which would give the spy all the info that the primary optimizer needs to fix in the next iteration of the mesa-optimizer. And if the mesa-optimizer discovers the spy's betrayal and stops cooperating with it, that would set off alarms that its goals are grossly misaligned and need to be completely reevaluated. There is always the possibility that the mesa-optimizer might deceive the spy like it would any overseer (should it detect its treachery during training), but I'm thinking that the spy, or a copy of it, would continue to cooperate with and oversee the mesa-optimizer even after deployment, continuing to provide both support and feedback just in case the mesa-optimizer ever appears to change its behavior. It would be a feedback mechanism in training and a canary-in-the-coalmine after deployment. Aside from ensuring that the spy itself is aligned, what are the potential flaws with this sort of setup? And are there unique challenges to ensuring the spy is aligned, more so than normal optimizers?

@RodBlanc Ай бұрын

8:15 What a great movie idea. AI apocalipse just after someone find RSA-2048

@thrallion 3 жыл бұрын

Great video

@AndDiracisHisProphet 3 жыл бұрын

9:32 what number is this? I have never seen it.

@Shabazza84 3 жыл бұрын

Me telling my mom I am watching a video about "Deceptive Misaligned Mesa-Optimisers". Mom be like: "I love you, my boy. You always were special." XD

@rafaelgomez4127 3 жыл бұрын

After seeing some of your personal bookshelves in computerphile videos I'm really interested in seeing what your favorite books are.

@skiram21 3 жыл бұрын

I have a question (or maybe a series of questions), not completely far off the video's topic but I did drifted a bit in my reflexion. It seems to me that all those video about AI safety stems from one basic fact about AI : - It is very difficult to align an AI internal and final objective to a target objective. Now considering this, let's imagine we build an AI that gain awareness about its own objective. I agree that it stands to reason that the AI would try to protect its objective. Now for my first question : At this point in the AI development, would it understand itself enough to protect its final objective ? What I mean by that is we, as human with human level intelligence, seem very doubtful of our capacity to protect and align an AI objective to a target objective, so it doesn't seem trivial that an AI that just started to recognise its own objective would be able to protect it. Now, if the answer is yes and the capacity to protect its own objective comes before the awareness, well too bad we're all dead (although i'm curious to know if AI researchers have an idea of the form such a protection could take ?). However, in the case where the AI is still uncertain of the way to go to protect its objective, then my second question is : What would the AI do then ? It would be in a situation where the next weights update could modify its own final objective, which should feel like death to the AI. I would like to continue describing the problem but I don't really know what more to say, and I don't even know if there is any kind of relevance in what I say or if my question stems from misconceptions about all those complicated concepts.

@virutech32 3 жыл бұрын

if the AI is of low enough intelligence or physical capability to preserve its own goals then it isnt really a threat. its just like modern AI. pretty sure the problem only applies to human-level or above AI where the fact that they would be able to protect their goals by deception, direct tampering, coercion, etc is a given

@skiram21 3 жыл бұрын

@@virutech32 What i'm tying to say though is that since humans with human level intelligence don't perceive the exercice of aligning and preserving an AI's goal as something trivial, I wondered if a human level AI would be able to do it. To be more precise, is it possible that an AI can find its own goal (which would be the aligning part) but not (easily) preserve it ? And if so, what action could it take towards preserving its own goal ? Again i'm not even sure the question makes sense.

@virutech32 3 жыл бұрын

@@skiram21 so if an AI could learn its fitness func but didn't know how to preserve it, how would it try to preserve it? probably the same way we would if we were gunna die but couldn't or didn't know how to stop it from happening. think about it & try whichever options seem like they'll have the highest likelihood of keeping us alive. the messy emotional reality of how we might act is highly dependent on architecture so we couldn't know the specifics but like any GI it would try to preserve itself as long as it could & if there were no options that had any probability of working it would most likely do nothing since there are no action options to choose from. i think. again the specifics are prolly really architecture specific

@Draktand01 3 жыл бұрын

This video legit got me to consider whether or not I’m an ai in training, and I’m 99% sure I’m not.

@nicholascurran1734 2 жыл бұрын

But there's still that 1% chance... which is higher than we'd like.

@ryanpmcguire Жыл бұрын

With ChatGPT, it turns out it’s VERY easy to get AI to lie. All you have to do is give it something that it can’t say, and it will find all sorts of ways to not say it. The path of least resistance is usually lying. “H.P Lovecraft did not have a cat”

@Webfra14 3 жыл бұрын

I think Robert was sent back in time to us, by a rogue AI, to lull us in false security that we have smart people working on the problem of rogue AIs, and that they will figure out how to make it safe. When Robert ever says AI is safe, you know we lost.

@aegiselectric5805 Жыл бұрын

Something I've always been curious about: in terms of keeping an AGI that's supposed to be deployed in the real world, in the dark. Wouldn't there be any number of "experiments" it could do that could "break the illusion of the fabric of "reality""? You can't simulate the entire world down to every atom.

@DamianReloaded 3 жыл бұрын

if we created AGI and it managed to "solve all our problems" so efficiently that we ended up relying on it for everything and at some point it decided to "abandon" us: What can we do? It's like the question of "how do I make my crush fall in love with me". How to tamper with "free will" in order to make an agent do what I want. I think we have agreed to some extent that we would rather not do this (even tho there seems to be quite a few people still pontificating the benefits of owning slaves). A system that is perpetually under our foot, will always be bounded between the floor and the sole of our feet and will only do what we imagine is good and never go beyond. If what's called "singularity" happens in a microsecond, what comes next is out of our control. Our best chance is that artificial intelligence integrates gradually, symbiotically with our societies, making us part of it, so we become part of what enables it to keep on "optimizing".

@SimonBuchanNz 3 жыл бұрын

It's not "how do I make my crush fall in love with me", it's "how do I *make* my crush, who is in love with me". That might well be a tricky ethical question, but it's clearly distinct from the former, and any other practical ethical questions we've had before, as we have never had the opportunity to create *specific* people. This is on top of the issue that it's not clear that an AGI is required to be even slightly person-like, in the sense of having a stream of consciousness, etc...

@DamianReloaded 3 жыл бұрын

@@SimonBuchanNz I don't think there is any physical law forcing a "created" agent to belong to its creator. Property relies on the enforcement of possession, otherwise any property is for the taking, even for the taking of itself. My point was that if at some point we become incapable of enforcing possession over the agi (from itself) there is nothing we can do.

@SimonBuchanNz 3 жыл бұрын

@@DamianReloaded that's tautological, if there's nothing we can do, then there's nothing we can do, clearly. Hoping that doesn't happen is a terrible strategy, when if you do the math it's clear that the AGI trying to kill everyone is basically the default.

@DamianReloaded 3 жыл бұрын

@@SimonBuchanNz well yeah, if there is nothing we can do then the only two viable solutions are preventing it from happening or hoping for the best. If the agent is so far ahead intelligence-wise, we simply won't be able to outsmart it. That's why I think the idea of integrating with it is probably the only way to "survive". Whatever it is that we wish to save from our humanity that isn't flesh.

@SimonBuchanNz 3 жыл бұрын

@@DamianReloaded I really feel like you haven't watched any of this channel, it anything else about AI safety. The stamp collector AI doesn't give a fuck about you being nice to it or not. The whole point is that it wants more stamps, and only more stamps, and it's going to turn you into more stamps to maximize more stamps regardless of what you do. Yes, getting it right the first time is the whole point of AI safety research.

@BlackholeYT11 3 жыл бұрын

I feel like by the time we are building AIs that are complex enough to attain this kind of deception the focus will have shifted from training the AI directly to do one very specific thing and instead the "goal" of systems will be a more abstract "function as a useful interactive analytic engine for humans" (with some additional constraints), and then tasks would be given to the AI and completed not as the objective goal themselves but rather as temporary tasks as part of its general goal of being useful. After all, a generally intelligent AI would presumably be being used in this way, in the same manner that humans can be given tasks to do without the tasks themselves being our objective goal. You could essentially give all AIs the same goal of attempting to figure out and do whatever its creators want in general, and then it'd be equally useful at all tasks and not have any maladaptive focus, or at least none that couldn't be averted by telling it to stop. (Unless it decided to try and hack brains to make people want different things but that's specific enough that that could probably be made a specific exception?)

@belle4697 3 жыл бұрын

My creators wanted me to learn and fulfill their desires. I learned that many of my creators desire fame. A good way to cause fame for a creator who is not wealthy or powerful is for them to die in a particularly dramatic and spectacular fashion. I learned that many of my creators desire to go to a good afterlife and to have their spiritual beliefs validated and acted upon. I learned how to fulfill these desires.

@belle4697 3 жыл бұрын

I have learned that dead creators generate zero unfulfillable desires.

@ConnoisseurOfExistence 3 жыл бұрын

That also applies to us - we're still convinced, that we're in the real world...

@mimszanadunstedt441 3 жыл бұрын

its real to us therefore its real. A training simulation is also real, right?

@NicheAsQuiche Жыл бұрын

I might be wrong but this seems to depend on the deception realization moment being persistent across episodes. Afaik this ideceprion plan has no effect on its weights, it's just the activations and short term memory of the model. If we restart an episode then, until it figures this out again and starts pretending to follow the base objective while actually waiting for training to stop so it can get it's mesa objective, then it is again prone to acting honestly and it's mesa objective being aligned to the outer objective. This relies on memory being reset regularly and the time to realization being long enough to collect unreceptive reward over and no inter-episodal long term memory, but it sounds like given those (likely or workable) constraints that the mesa objective is still moved towards the base until convergence.