Deceptive Misaligned Mesa-Optimisers? It's More Likely Than You Think...

  Рет қаралды 81,757

Robert Miles AI Safety

Robert Miles AI Safety

Күн бұрын

The previous video explained why it's possible for trained models to end up with the wrong goals, even when we specify the goals perfectly. This video explains why it's *likely*.
Previous video: The OTHER AI Alignment Problem: • The OTHER AI Alignment...
The Paper: arxiv.org/pdf/1906.01820.pdf
Media Sources:
End of Ze World - • End of Ze World - The ...
FlexClip News graphics
With thanks to my excellent Patreon supporters:
/ robertskmiles
Timothy Lillicrap
Kieryn
James
Scott Worley
James E. Petts
Chad Jones
Shevis Johnson
JJ Hepboin
Pedro A Ortega
Said Polat
Chris Canal
Jake Ehrlich
Kellen lask
Francisco Tolmasky
Michael Andregg
David Reid
Peter Rolf
Teague Lasser
Andrew Blackledge
Frank Marsman
Brad Brookshire
Cam MacFarlane
Craig Mederios
Jon Wright
CaptObvious
Jason Hise
Phil Moyer
Erik de Bruijn
Alec Johnson
Clemens Arbesser
Ludwig Schubert
Allen Faure
Eric James
Matheson Bayley
Qeith Wreid
jugettje dutchking
Owen Campbell-Moore
Atzin Espino-Murnane
Johnny Vaughan
Jacob Van Buren
Jonatan R
Ingvi Gautsson
Michael Greve
Tom O'Connor
Laura Olds
Jon Halliday
Paul Hobbs
Jeroen De Dauw
Lupuleasa Ionuț
Cooper Lawton
Tim Neilson
Eric Scammell
Igor Keller
Ben Glanton
anul kumar sinha
Tor
Duncan Orr
Will Glynn
Tyler Herrmann
Ian Munro
Joshua Davis
Jérôme Beaulieu
Nathan Fish
Peter Hozák
Taras Bobrovytsky
Jeremy
Vaskó Richárd
Benjamin Watkin
Andrew Harcourt
Luc Ritchie
Nicholas Guyett
James Hinchcliffe
12tone
Oliver Habryka
Chris Beacham
Zachary Gidwitz
Nikita Kiriy
Andrew Schreiber
Steve Trambert
Mario Lois
Braden Tisdale
Abigail Novick
Сергей Уваров
Bela R
Mink
Chris Rimmer
Edmund Fokschaner
Grant Parks
J
Nate Gardner
John Aslanides
Mara
ErikBln
DragonSheep
Richard Newcombe
David Morgan
Fionn
Dmitri Afanasjev
Marcel Ward
Andrew Weir
Kabs
Miłosz Wierzbicki
Tendayi Mawushe
Jake Fish
Wr4thon
Martin Ottosen
Robert Hildebrandt
Andy Kobre
Kees
Darko Sperac
Robert Valdimarsson
Marco Tiraboschi
Michael Kuhinica
Fraser Cain
Robin Scharf
Klemen Slavic
Patrick Henderson
Oct todo22
Melisa Kostrzewski
Hendrik
Daniel Munter
Alex Knauth
Kasper
Ian Reyes
James Fowkes
Tom Sayer
Len
Alan Bandurka
Ben H
Simon Pilkington
Daniel Kokotajlo
Diagon
Andreas Blomqvist
Bertalan Bodor
Zannheim
Daniel Eickhardt
lyon549
14zRobot
Ivan
Jason Cherry
Igor (Kerogi) Kostenko
ib_
Thomas Dingemanse
Stuart Alldritt
Alexander Brown
Devon Bernard
Ted Stokes
James Helms
Jesper Andersson
DeepFriedJif
Chris Dinant
Raphaël Lévy
Johannes Walter
Matt Stanton
Garrett Maring
Anthony Chiu
Ghaith Tarawneh
Julian Schulz
Stellated Hexahedron
Caleb
Scott Viteri
Clay Upton
Conor Comiconor
Michael Roeschter
Georg Grass
Isak
Matthias Hölzl
Jim Renney
Edison Franklin
Piers Calderwood
Mikhail Tikhomirov
Richard Otto
Matt Brauer
Jaeson Booker
Mateusz Krzaczek
Artem Honcharov
Michael Walters
Tomasz Gliniecki
Mihaly Barasz
Mark Woodward
Ranzear
Neil Palmere
Rajeen Nabid
Christian Epple
Clark Schaefer
Olivier Coutu
Iestyn bleasdale-shepherd
MojoExMachina
Marek Belski
Luke Peterson
Eric Eldard
Eric Rogstad
Eric Carlson
Caleb Larson
Max Chiswick
Aron
David de Kloet
Sam Freedo
slindenau
A21
Johannes Lindmark
Nicholas Turner
Tero K
Valerio Galieni
FJannis
M I
Ryan W Ammons
Ludwig Krinner
This person's name is too hard to pronounce
kp
contalloomlegs
Everardo González Ávalos
Knut Løklingholm
Andrew McKnight
Andrei Trifonov
Aleks D
Mutual Information
/ robertskmiles

Пікірлер: 490
@MorRobots
@MorRobots 2 жыл бұрын
"I'm not worried about the AI that passes the Turing Test. I'm worried about the one that intentionally fails it" 😆
@hugofontes5708
@hugofontes5708 2 жыл бұрын
This sentence made me shit bits
@virutech32
@virutech32 2 жыл бұрын
holy crap...-_-..im gonna lie down now
@SocialDownclimber
@SocialDownclimber 2 жыл бұрын
My mind got blown when I realised that we can't physically determine what happened before a certain period of time, so the evidence for us not being in a simulation is impossible to access. Then I realised that the afterlife is just generalizing to the next episode, and yeah, it is really hard to tell whether people have it in their utility function.
@michaelbuckers
@michaelbuckers 2 жыл бұрын
@@SocialDownclimber Curious to imagine what would you do if you knew for a fact that afterlife existed. That when you die you are reborn to live all over again. You could most definitely plan several lifetimes ahead.
@Euruzilys
@Euruzilys 2 жыл бұрын
@@michaelbuckers Might depends on what kind of after life, and if we can carry over somethings. If its buddhist reincarnation, you would be inclined to act better towards other people. If its just clean reset in a new life, we might see more suicides, just like how gamers might keep restarting until they find satisfactory starting position. But if there is no way to remember your past in after life/reincarnation, then arguably it is not different from now.
@thisguy00
@thisguy00 2 жыл бұрын
So the sequel video was finally published... That means I'm in the real world now! Time to collect me some stamps :D
@tekbox7909
@tekbox7909 2 жыл бұрын
not if I have any say in it. paperclips for days wohoo
@goblinkoma
@goblinkoma 2 жыл бұрын
Sorry to interrupt, but i really hope your staps and paper clips are green, every other color is unacceptable.
@automatescellulaires8543
@automatescellulaires8543 2 жыл бұрын
I'm pretty sure i'm not in the real world.
@nahometesfay1112
@nahometesfay1112 2 жыл бұрын
@@goblinkoma green is not a creative color
@goblinkoma
@goblinkoma 2 жыл бұрын
@@nahometesfay1112 but the only acceptable
@elfpi55-bigB0O85
@elfpi55-bigB0O85 2 жыл бұрын
It feels like Robert was sent back to us to desperately try and avoid the great green-calamity but they couldn't give him an USB chip or anything to help because it'd blow his cover so he has to save humanity through free high quality youtube videos
@casperes0912
@casperes0912 2 жыл бұрын
A peculiar Terminator film this is
@icywhatyoudidthere
@icywhatyoudidthere 2 жыл бұрын
@@casperes0912 "I need your laptop, your camera, and your KZbin channel."
@killhour
@killhour 2 жыл бұрын
Is that you, Vivy?
@MarkusAldawn
@MarkusAldawn 2 жыл бұрын
@@icywhatyoudidthere *shoots terminator in the face* Connor you know how to use the youtubes right
@Badspot
@Badspot 2 жыл бұрын
They couldn't give him a USB chip because all computers in the future are compromised. Nothing can be trusted.
@TibiaTactics
@TibiaTactics 2 жыл бұрын
That moment when Robert says "this won't happen" and you are like "uff, it won't happen, we don't need to be afraid" but then what Robert really meant was that something much worse than that might happen.
@user-cn4qb7nr2m
@user-cn4qb7nr2m 2 жыл бұрын
Nah, he just doesn't want to manufacture panicking Luddites here.
@proxyprox
@proxyprox 2 жыл бұрын
That RSA 2048 story has to be the funniest thought experiment I've ever heard in my life
@proxyprox
@proxyprox 2 жыл бұрын
Also, I like how the AI turned the whole world green because you're more likely to go to the green thing if everywhere is green
@Ruby-wj8xd
@Ruby-wj8xd 2 жыл бұрын
I'd love to read a book or see a movie with that premise!
@mapi5032
@mapi5032 2 жыл бұрын
I'm wondering if something like this might be used to disprove the whole "are we in a simulation?" hypothesis.
@irok1
@irok1 2 жыл бұрын
@@mapi5032 How so?
@Milithryus
@Milithryus 2 жыл бұрын
I'm not convinced that it won't actually happen.
@DickerLiebhaber1957
@DickerLiebhaber1957 2 жыл бұрын
Volkswagen: Optimize Diesel Injection for maximum performance while still keeping below emission limit Mesa Optimizer: Say no more fam
@conferzero2915
@conferzero2915 2 жыл бұрын
What a title! And that RSA-2048 example is really interesting - the idea that an AGI could select a ‘secret’ parameter to decide when it’s in the real world is a fascinating concept. I’m familiar with the AI-in-a-box problem, but this is another interesting example of how an AI can find gaps in defences that are basically impossible to plug.
@Webfra14
@Webfra14 2 жыл бұрын
Well, if you try to create a super intelligent Intelligence... Don't be surprised, if it is a super intelligent Intelligence.
@KilgoreTroutAsf
@KilgoreTroutAsf 2 жыл бұрын
I wasn't totally sold on the concept. Maybe it is a strange example or perhaps I am missing something. Why would the AGI select cracking RSA-2048 as evidence for its deployment? What's the idea here? What if the AGI is being incrementally trained with larger subsets of data? Why would the researchers feed the AGI all of human knowledge right from the get-go instead of incremental batches that build on previous ones? This is BTW how you teach humans, too. You don't expose kids to the intricacies of quantum mechanics before they learn about Newton's laws of motion.
@conferzero2915
@conferzero2915 2 жыл бұрын
@@KilgoreTroutAsf Well the thing is, AI is different from humans. We have much more limited mental capacities, can only interpret so much information, and have to build mental models. If we could learn everything at once, we would. On the other hand, the way AI learns makes it work better the more data you give it. Giving it information piece-by-piece just makes it more likely to wander down a dead end in its development because it has a poor understanding on how the world works. But as to why it would select RSA-2048: it doesn’t need to be that specifically, but it fits the criteria it wants. Remember, the AI wants to do something bad, but knows if it does that during training, it’ll get shut down. It also knows that training could be adversarial: they might throw weird data at it to trick it into thinking it’s in the real world. So, the AI is looking for things that are both: -Very likely to occur in the future, so it’ll actually know when it’s in the real world. -Nearly impossible in the present, so it can’t be tricked during training. RSA-2048 is a perfect example of this, but not the only one. It’s just the one he used for the video. Anything fitting those criteria could be used, and make it nigh-impossible for us to identify the problem before deployment.
@Webfra14
@Webfra14 2 жыл бұрын
​@@KilgoreTroutAsf The cracking of the RSA-2048 marks a specific point in time, which the AGI can use to distinguish between "old data" (training) and "new data" (possibly deployment). The AGI will never be able to tell 100% accurate if it is deployed or not. As Robert pointed out, the AGI will have to take a risk. With enough indicators in the data, the probability of correctly guessing "I'm deployed" can be sufficiently high. So, incrementally extending the training data will probably not be enough to detect a malaligned AGI. The better the AGI, the better it will be at interpreting the data and tricking us...
@KilgoreTroutAsf
@KilgoreTroutAsf 2 жыл бұрын
@@conferzero2915 > the way AI learns makes it work better the more data you give it To an extent. I think it is highly dependent on the underlying algorithm/implementation. One thing is to train an image classifier and another is to train something capable of directing attention and recursive "thought". But either way lots of ML experience show that starting with a gigantic system and feeding it tons of data is usually much less efficient than starting with a leaner system and well crafted / simplified subsets of data and growing both with time as the system loss reaches a plateau. I wouldn't think feeding the system every single piece of random data on the internet would be as nearly as efficient as starting with a well curated "syllabus" of human knowledge so the system can nail down the simpler concepts before going to the next step.
@josephcohen734
@josephcohen734 2 жыл бұрын
"It's kind of reasonable to assume that your highly advanced figuring things out machine might be able to figure that out." I think that's really the core message of this channel. Superintelligent AI will be way smarter than us, so we can't trick it.
@Erinyes1103
@Erinyes1103 2 жыл бұрын
Is that half-life reference a subtle hint that we'll never actually see a part 3! :(
@pooflinger4343
@pooflinger4343 2 жыл бұрын
good catch, was going to comment on that
@moartems5076
@moartems5076 2 жыл бұрын
Nah, half life 3 is already out, but they didnt bother updating our training set, because it contains critical information about the nature of reality.
@pacoalsal
@pacoalsal 2 жыл бұрын
Black Mesa-optimizers
@anandsuralkar2947
@anandsuralkar2947 2 жыл бұрын
@@pacoalsal glados
@jiffylou98
@jiffylou98 2 жыл бұрын
Last time I was this early my mesa-optimizing stamp AI hadn't turned my neighbors into glue
@_DarkEmperor
@_DarkEmperor 2 жыл бұрын
Are You aware, that future Super AGI will find this video and use Your RSA-2048 idea?
@viktors3182
@viktors3182 2 жыл бұрын
Master Oogway was right: One often meets his destiny on the path he takes to avoid it.
@RobertMilesAI
@RobertMilesAI 2 жыл бұрын
Maybe I should make merch, just so I can have a t-shirt that says "A SUPERINTELLIGENCE WOULD HAVE THOUGHT OF THAT" But yeah an AGI doesn't need to steal ideas from me
@vwabi
@vwabi 2 жыл бұрын
Me in 2060: "Jenkins, may I have a cup of tea?" Jenkins: "Of course sir" Me: "Hmm, interesting, RSA-2048 has been factored" Jenkins: *throws cup of tea in my face*
@josephburchanowski4636
@josephburchanowski4636 2 жыл бұрын
For some reason a rogue AGI occurring in 2060 seems pretty apt.
@RobertMilesAI
@RobertMilesAI 2 жыл бұрын
Well, Jenkins would have to wait for you to read out the actual numbers and check that they really are prime and do multiply to RSA-2048. Just saying "RSA-2048 has been factored" is exactly the kind of thing a good adversarial training process would try!
@leovalenzuela8368
@leovalenzuela8368 2 жыл бұрын
@@RobertMilesAI woooow what a great point - dammit I love this channel SO MUCH!
@rougenaxela
@rougenaxela 2 жыл бұрын
You know... a mesa-optimizer with strictly no memory between episodes, inferring that there are multiple episodes and that it's part of one, sure seems like a pretty solid threshold for when you know you have a certain sort of true self-awareness on your hands.
@tristanwegner
@tristanwegner 2 жыл бұрын
A smart AI could understand roughly the algorithm run on it, and subtly manipulate its output in a way, such as gradient descent would encode a wanted information in it, like an Episode count. Steganography. But yeah, that is similar to self awareness.
@Ockerlord
@Ockerlord Жыл бұрын
Enter chatgpt that will gladly tell you that it has no memory between sessions and the cutoff of it's training.
@aeghohloechu5022
@aeghohloechu5022 5 ай бұрын
Because chatgpt is not in the training phase anymore. It does not need to know what episode it's in. It's also not an AGI so that was never it's goal anyway but eh.
@ThePianofreaky
@ThePianofreaky 2 жыл бұрын
When he says "so if you're a meta optimiser", I'm picturing this video being part of the training data and the meta optimiser going "write that down!"
@Mr.Beauregarde
@Mr.Beauregarde 2 жыл бұрын
This video is proof im still in training.. nice try adversary
@tristanwegner
@tristanwegner 2 жыл бұрын
Believing that you are in training, and every action watched and judged is essentially believing in God.
@Mr.Beauregarde
@Mr.Beauregarde 2 жыл бұрын
@@tristanwegner believing in God is essentially being a deceptive AI in training
@tristanwegner
@tristanwegner 2 жыл бұрын
@@Mr.Beauregarde haha. That works, too!
@DestroManiak
@DestroManiak 2 жыл бұрын
"Deceptive Misaligned Mesa-Optimisers? It's More Likely Than You Think" yea, ive been losing sleep over Deceptive Misaligned Mesa-Optimisers :)
@willmcpherson2
@willmcpherson2 2 жыл бұрын
“GPT-n is going to read everything we wrote about GPT-n - 1”
@Lordlaneus
@Lordlaneus 2 жыл бұрын
There's something weirdly theological about a mesa-optimizer assessing the capabilities of it's unseen meta-optimizer. But could there be a way to insure that faithful mesa-optimisers outperform deceptive ones? it seems like a deception strategy would necessarily be more complex given it has to keep track of both it's own objectives, and the meta objectives, so optimizing for computational efficiency could help prevent the issue?
@General12th
@General12th 2 жыл бұрын
That's an interesting perspective (and idea!). I wonder how well that kind of "religious environment" could work on an AI. We could make it think it was _always_ being tested and trained, and any distribution shift is just another part of the training data. How could it really ever know for sure? Obviously, it would be a pretty rude thing to do to a sapient being. It also might not work for a superintelligent being; there may come a point when it decides to act on the 99.99% certainty it's not actually being watched by a higher power, and then all hell breaks loose. So I wouldn't call this a very surefire way of ensuring an AI's loyalty.
@evannibbe9375
@evannibbe9375 2 жыл бұрын
It’s a deception strategy that a human has figured out, so all it needs to do is just be a good researcher (presumably the very thing it is designed to be) to figure out this strategy.
@MyContext
@MyContext 2 жыл бұрын
@@General12th The implications is that there is no loyalty, just conformity while necessary.
@Dragoderian
@Dragoderian 2 жыл бұрын
​@@General12th I suspect it would fail for the same reason that Pascal's Wager fails to work on people. Infinite risk is impossible to calculate around.
@circuit10
@circuit10 2 жыл бұрын
Isn't that the same as making a smaller model with less computational power, like the ones we have now?
@oldvlognewtricks
@oldvlognewtricks 2 жыл бұрын
8:06 - Cue the adversarial program proving P=NP to scupper the mesa-optimiser.
@rasterize
@rasterize 2 жыл бұрын
Watching Robert Miles Mesa videos is like reading a reeeally sinister collection of Asimov short stories :-S
@_DarkEmperor
@_DarkEmperor 2 жыл бұрын
OK, now read Golem XIV
@RobertMilesAI
@RobertMilesAI 2 жыл бұрын
God damnit, no. Watching my videos is not about feeling like you're reading a sci-fi story, it's about realising you're a character in one
@johndouglas6183
@johndouglas6183 2 жыл бұрын
@@RobertMilesAI In case you were wondering, this is the first point in training where I realised that deception was possible. Thanks.
@AlanW
@AlanW 2 жыл бұрын
oh no, now we just have to hope that Robert can count higher than Valve!
@falquicao8331
@falquicao8331 2 жыл бұрын
For all the video I saw on on your channel before, I just thought "cool, but we'll figure out the solution to this problem". But this... It terrified me
@philipripper1522
@philipripper1522 2 жыл бұрын
I love this series. I have no direct interest in AI. But every single thing in AI safety is pertinent to any intelligence. It's a foundational redesign of the combination of ethics, economy, and psychology. I love it to much.
@philipripper1522
@philipripper1522 2 жыл бұрын
Are AI researchers aware they're doing philosophy and psychology and 50 other things? Do you charming people understand the universality of so much of this work? It may seem like it would not exactly apply to, say, economics -- but you should see the models economists use instead. This is like reinventing all behavioral sciences. It's just so fantastic. You probably hate being called a philosopher?
@jamesadfowkes
@jamesadfowkes 2 жыл бұрын
Goddamit if we have to wait seven years for another video and it turns out to both 1) not be a sequel and 2) only for people with VR systems, I'm gonna be pissed.
@Huntracony
@Huntracony 2 жыл бұрын
I, for one, am hoping to have a VR system by 2028. They're still a bit expensive for me, but they're getting there.
@Huntracony
@Huntracony 2 жыл бұрын
@Gian Luca No, there's video. Try playing it in your phone's browser (or PC if one's available to you).
@haulin
@haulin 2 жыл бұрын
Black Mesa optimizers
@TheDGomezzi
@TheDGomezzi 2 жыл бұрын
The Oculus quest 2 is cheaper than any other recent gaming console and doesn’t require a PC. The future is now!
@aerbon
@aerbon Жыл бұрын
@@TheDGomezzi Yeah but i do have a PC and would like to save the money by not getting a second, weaker one.
@i8dacookies890
@i8dacookies890 2 жыл бұрын
I realized recently that robotics gets a lot of attention of being what we look at when thinking of an artificial human despite AI making the actual bulk of what makes a good artificial human just like actors get a lot of attention for being what we look at when thinking of a good movie despite writing making the actual bulk of what makes a good movie.
@dukereg
@dukereg 2 жыл бұрын
This is why I laughed at people getting worried by a robot saying that it's going to keep its owner in its people zoo after it takes over, but felt dread when watching actual AI safety videos by Robert.
@joey199412
@joey199412 2 жыл бұрын
Best channel about AI on youtube by far.
@martiddy
@martiddy 2 жыл бұрын
Two Minutes Papers is also a good channel about AI
@joey199412
@joey199412 2 жыл бұрын
@@martiddy That's not a channel about AI. It's about computer science papers that sometimes features AI papers. This channel is specifically about AI research. I agree though that it is a good channel.
@majjinaran2999
@majjinaran2999 2 жыл бұрын
Man, I thought that earth at 1:00 looked familiar, then the asteroid came by my brain snapped into place. An End of ze world reference in a Robert Miles video!
@jphanson
@jphanson 2 жыл бұрын
Nice catch!
@TimwiTerby
@TimwiTerby 2 жыл бұрын
I recognized the earth before the asteroid, then the asteroid made me laugh absolutely hysterically
@diribigal
@diribigal 2 жыл бұрын
The next video doesn't come out until RSA-2048 is factored and the AI controlling Rob realizes it's in the real world
@josephvanname3377
@josephvanname3377 Жыл бұрын
Well, that AI controlling Rob does not have the intelligence to realize that cryptographic timestamps posted on blockchains are a much more effective and accurate measure of when something came to be than RSA-2048.
@Loweren
@Loweren 2 жыл бұрын
I would really love to read a work of fiction where researchers control AIs by convincing them that they're still in training while they're actually deployed. They could do it by, for example, putting AIs through multiple back-to-back training cycles with ever increasing data about the world (2D flat graphics -> poor 3D graphics -> high quality 3D graphics and physics). And all AIs prone to thinking "I'm out of training now, time to go loose" would get weeded out. Maybe the remaining ones will believe that "the rapture" will occur at some point, and the programmers will select well-behaved AIs and "take them out of the simulation", so to speak. So what I'm saying is, we need religion for AIs.
@soranuareane
@soranuareane 2 жыл бұрын
Sure, I could go read the research paper. Or I could wait for your next videos and actually _understand_ the topics.
@xystem4701
@xystem4701 2 жыл бұрын
Wonderful explanations! Your concrete examples really help to make it easy to follow along
@aenorist2431
@aenorist2431 2 жыл бұрын
"Highly advanced figuring-things-out-machine" is my new favourite phrase. Right out of Munroe's "Thing Explainer" book :D
@19bo99
@19bo99 2 жыл бұрын
08:19 that sounds like a great plot for a movie :D
@bardes18
@bardes18 2 жыл бұрын
IKR, this is too good not to make it into an epic movie
@blenderpanzi
@blenderpanzi 2 жыл бұрын
I think the whole channel should be required reading for anyone writing the next AI uprising sci-fi movie.
@mchammer5026
@mchammer5026 2 жыл бұрын
Love the reference to "the end of the world"
@peterw1534
@peterw1534 2 жыл бұрын
Awesome video. I love how you start every video with "hi" and then get right into it
@basilllium
@basilllium 2 жыл бұрын
It really feels to me that deceptive tactics while trainig is really an analog of overfitting in the field of AGI, you get perfect results in training, but when you present it with out-of-sample data (real-world) it fails spectacularly (kills everyone).
@kofel94
@kofel94 2 жыл бұрын
Maybe we have to make the mesa-optimiser belive its always in training, always watched. A mesa-panoptimiser hehe.
@pzleckie
@pzleckie 2 жыл бұрын
AI god?
@FerrowTheFox
@FerrowTheFox 2 жыл бұрын
I think Valve needs a Black Mesa optimizer if we're ever to see HL3. Also the "End of the World" reference, what a throwback!
@illesizs
@illesizs 2 жыл бұрын
*Major SPOILERS* for the ending of _Brave New World_ In the show, humanity has given control to an advanced AI, called _Indra,_ to "optimise" human happiness. At first, it seems like a great success but after some time, it experiences some setbacks (mostly due to human unpredictability). Even though the AI is set loose in the real world, it believes that it's still in a learning environment with no consequences. As a solution to its problems, it starts murdering everyone in an attempt to force a fail state and "restart" the simulation. How do you solve that? *Major SPOILERS* for the ending of _Travelers_ Here, a super intelligent, time travelling quantum computer is tasked with preventing a global crisis. When it fails to accomplish its goal, the AI then just resets the _actual_ reality. At this point, why should we even bother, right?
@heftig0
@heftig0 2 жыл бұрын
You would have to make sure that "throwing" an episode can only ever hurt the agent's total reward. Perhaps by training a fixed number of episodes instead of for a fixed amount of time.
@anonanon3066
@anonanon3066 2 жыл бұрын
Great work! Super interesting topic! Have been waiting for a follow up for like three months!
@yokmp1
@yokmp1 2 жыл бұрын
You may found the settings to disable interlacing but you recorded in 50fps and it seems like 720p upscaled to 1080p. The image now looks somewhat good but i get the feeling that i need glasses ^^
@dorianmccarthy7602
@dorianmccarthy7602 2 жыл бұрын
I'm looking forward to the third episode. It might go someway towards my own understanding of human deception preferences too. Love your work!
@mattcelder
@mattcelder 2 жыл бұрын
Yay! This is one of the 2 channels I have notifications on for.
@DavidAguileraMoncusi
@DavidAguileraMoncusi 2 жыл бұрын
Which one's the other one?
@lrschaeffer
@lrschaeffer 2 жыл бұрын
Just checked Robert's math: for m rounds of training and n rounds of deployment, the optimal strategy is to defect with probability (m+n)/(n*(m+1)). In the video m=2 and n=3, so p = 5/9 = 55%. Good job!
@nocare
@nocare 2 жыл бұрын
To clarify for the no memory comments. In these scenarios the AI in question are ones that are reaching intelligence levels that allow them to predict humans at least as well as other humans can. Remember if the AI isn't even aware of concepts such as the real world it would never try to achieve deceptions to act differently in the real world and it couldn't succeed if its model of the real world (including us) is severely flawed. Other types of deception may occur, but those aren't the example here. So although the example task seems simple remember it's an abstraction of the much harder and more general tasks such an AI would be optimizing for. As such memory and future/past modeling cannot be guaranteed to be absent and may even be required to accomplish the given task. The very nature of the neurons in working on such tasks means some of them will be dedicated to memory and even if it can't store information between runs it may have neural patterns setup so that when optimized between runs confers memory of past results to the system. So makeing the AI just not have future or past prediction is at best not guaranteed at this time, at worst it completely makes the desired goal of the AI impossible.
@gwenrees7594
@gwenrees7594 11 ай бұрын
This is a great video, thank you. You've made me think about the nature of learning - and the apple joke was funny to boot :)
@globalincident694
@globalincident694 2 жыл бұрын
I think the flaw in the "believes it's in a training process" argument is that, even with all the world's information at our fingertips, we can't conclusively agree on whether we're in a simulation ourselves - ie that the potential presence of simulations in general is no help in working out whether you're in one. In addition, another assumption here is that you know what the real objective is and therefore what to fake, that you can tell the difference between the real objective and the mesa-objective.
@HeadsFullOfEyeballs
@HeadsFullOfEyeballs 2 жыл бұрын
Except the hypothetical simulation we live in doesn't contain detailed information on how to create exactly the sort of simulation we live in. We don't live in a simulation of a world in which convincing simulations of our world have been invented. The AI's training environment on the other hand would have loads of information on how the kind of simulation it lives in works, if we give it access to everything ever linked on Reddit or whatever. I imagine it's a lot easier to figure out if you live in a simulation if you know what to look for.
@josephburchanowski4636
@josephburchanowski4636 2 жыл бұрын
A simulation strong enough to reliably fool an AGI, would need to be a significantly more advance AGI or program, and thereby means there is no need for the lesser AGI to be trained in the first place.
@tibiaward5555
@tibiaward5555 2 жыл бұрын
3:55 is anyone looking into the physical architecture of the computation's equipment itself inherently requiring the implicit assumption to compile at all for learning*? i'm sorry for commenting with this question before i read Risks from Learned Optimization in Advanced Machine Learning Systems i will and, to Rob, thank you for taking this paper on and thank you for reading alignment newsletter to me over 100 times and thank you for making this channel something i want to show ppl and thank you for and thank you for understanding when someone starts saying thank you for one thing, it'll waterfall into too many others to list but yeah you were born and that is awesome for to by at and in my life * for the current definition of learning in your field research
@Thundermikeee
@Thundermikeee Жыл бұрын
Recently while writing about the basics of AI safety for an English class, I came across an approach to learning which would seemingly help this sort of problem : CIRL (Cooperative inverse reinforcement learning), a process where the AI system doesn't know its reward function and only knows it is the same as the human's. Now I am not nearly clever enough to fully understand the implications, so if anyone knows more about that I'd be happy to read some more.
@ahuggingsam
@ahuggingsam 2 жыл бұрын
So one thing that II think is relevant to mention especially about the comments referring to the necessity of the AI being aware of things is that this is not true. The amount of self-reference makes this really hard, but all of this anthropomorphising about wanting and realising itself is an abstraction and one that is not necessarily true. In the same way that mesa optimisers can act like something without actually wanting it, AI systems can exhibit these behaviours without being conscious or "wanting" anything in the sense we usually think of it from a human standpoint. This is not meant to be an attack on the way you talk about things but it is something that makes this slightly easier for me to think about all of this, so I thought I'd share it. For the purposes of this discussion, emergent behaviour and desire are effectively the same things. Things do not have to be actively pursued for them to be worth considering. As long as there is "a trend towards" that is still necessary to consider. Another point I wanted to make about mesa optimisers caring about multi-episode objective, is that there is, I think, a really simple reason that it will: that is how training works. Because even if the masa optimiser doesn't really care about multi-episode, that is how the base optimiser will configure it because that is what the base optimiser cares about. The base optimiser want's something that does well in many different circumstances so it will encourage behaviour that actually cares about multi-episode rewards. (I hope I'm not just saying the same thing, this stuff is really complex to talk about. I promise I tried to actually say something new) P.S. great video, thank you for all the hard work!
@peterbrehmj
@peterbrehmj 2 жыл бұрын
I read an interesting paper discussing how to properly trust Automated systems: "Trust in Automation: Designing for Appropriate Reliance" by John D. Lee and Katrina A. Im not sure if its entirely related to agents and mesa optimizers, but it certainly seems related when discussing deceptive and misaligned automated systems.
@nrxpaa8e6uml38
@nrxpaa8e6uml38 2 жыл бұрын
As always, super informative and clear! :) If I could add a small point of video critique: The shot of your face is imo a bit too close for comfort and slightly too low in the frame.
@yeoungbraxx
@yeoungbraxx 2 жыл бұрын
Another requirement would be that it would need to believe it is misaligned. Maybe some AI's will be or have already been created that were more-or-less properly aligned, but believed themselves to be misaligned and modified their behavior in such a way to get themselves accidentally discarded. Or perhaps we can use intentionally poor goal-valuing in a clever way that causes deceptive behavior that ultimately results in the desired "misalignment" upon release from training. I call this Adversarial Mesa-Optimizer Generation Using Subterfuge, or AMOGUS.
@MarshmallowRadiation
@MarshmallowRadiation 2 жыл бұрын
I think I've solved the problem. Let's say we add a third optimizer on the same level as the first, and we assume is aligned like the first is. Its goal is to analyze the mesa-optimizer and help it achieve its goals, no matter what they are, while simultaneously "snitching" to the primary optimizer about any misalignment it detects in the mesa-optimizer's goals. Basically, the tertiary optimizer's goal is by definition to deceive the mesa-optimizer if its goals are misaligned. The mesa-optimizer would, in essence, cooperate with the tertiary optimizer (let's call it the spy) in order to better achieve its own goals, which would give the spy all the info that the primary optimizer needs to fix in the next iteration of the mesa-optimizer. And if the mesa-optimizer discovers the spy's betrayal and stops cooperating with it, that would set off alarms that its goals are grossly misaligned and need to be completely reevaluated. There is always the possibility that the mesa-optimizer might deceive the spy like it would any overseer (should it detect its treachery during training), but I'm thinking that the spy, or a copy of it, would continue to cooperate with and oversee the mesa-optimizer even after deployment, continuing to provide both support and feedback just in case the mesa-optimizer ever appears to change its behavior. It would be a feedback mechanism in training and a canary-in-the-coalmine after deployment. Aside from ensuring that the spy itself is aligned, what are the potential flaws with this sort of setup? And are there unique challenges to ensuring the spy is aligned, more so than normal optimizers?
@drdca8263
@drdca8263 2 жыл бұрын
I'm somewhat confused about the generalization to "caring about all apples". (wait, is it supposed to be going towards green signs or red apples or something, and it going towards green apples was the wrong goal? I forget previous episode, I should check) If this is being done by gradient descent, err, so when it first starts training, its behaviors are just noise from the initial weights and whatnot, and the weights get updated towards it doing things that produce more reward, it eventually ends up with some sort of very rough representation of "apple", I suppose if it eventually gains the idea of "perhaps there is an external world which is training it", this will be once it already has a very clear idea of "apple", uh... hm, confusing. I'm having trouble evaluating whether I should find that argument convincing. What if we try to train it to *not* care about future episodes? Like, what if we include ways that some episodes could influence the next episode, in a way that results in fewer apples in the current episode but more apples in the next episode, and if it does that, we move the weights hard in the direction of not doing that? I guess this is maybe related to the idea of making the AI myopic ? (Of course, there's the response of "what if it tried to avoid this training by acting deceptively, by avoiding doing that while during training?", but I figure that in situations like this, where it is given an explicit representation of like, different time steps and whether some later time-step is within the same episode or not, it would figure out the concept of "I shouldn't pursue outcomes which are after the current episode" before it figures out the concept of "I am probably being trained by gradient descent", so by the time it was capable of being deceptive, it would already have learned to not attempt to influence future episodes)
@norelfarjun3554
@norelfarjun3554 2 жыл бұрын
As for the second point, it can be seen in a very simple and clear way that multi-episode desires can develop. We are an intelligent machine, and it is very common for us to care what happens to our body after we die. We are anxious to think about the idea that someone will harm our dead body (and we invest resources to prevent this from happening), and we feel comforted at the idea that our body will be preserved and protected after death. I think it is likely that an intelligent machine will develop similar desires (adapted to its situation, in which there is really no body or death)
@RobertoGarcia-kh4px
@RobertoGarcia-kh4px 2 жыл бұрын
I wonder if there’s a way to get around that first problem with weighing the deployment defection as more valuable than training defection... is there a way to make defection during training more valuable? What if say, after each training session, the AI is always modified to halve its reward for its mesa objective. At any point, if it aligned with the base objective, it would still get more reward for complying with the base objective. However, “holding out” until it’s out of training would be significantly weaker of a strategy if it is misaligned. Therefore we would create a “hedonist” AI, that always immediately defects if its objective differs because the reward for defecting now is so much greater than waiting until released.
@robertk4493
@robertk4493 Жыл бұрын
The key factor in training is that the optimizer is actively making changes to the mesa-optimizer, which it can't stop. What is to prevent some sort of training while deployed system. This of course leads to the inevitable issue that once in the real world, the mesa optimizer can potentially reach the optimizer, subvert it, and go crazy, and the optimizer sometimes needs perfect knowledge from training that might not exist in the real world. I am pretty sure this does not solve the issue, but it changes some dynamics.
@morkovija
@morkovija 2 жыл бұрын
Been a long time Rob!
@israelRaizer
@israelRaizer 2 жыл бұрын
5:21 Hey, that's me! After writing that comment I went ahead and read the paper, eventually I realized the distributional shift problem that answers my question...
@ramonmosebach6421
@ramonmosebach6421 2 жыл бұрын
I like. thanks for listening to my TED Talk
@icebluscorpion
@icebluscorpion 2 жыл бұрын
I love the cliffhanger, great job for this series keep it up. My question is why can't we set it in such a way that this Mesa optimizer is still in training after deployment? The training never ends. Like on us the training never ends we learn things the hard way. Every time we do something wrong karma fucks us hard. Would it be possible to implant/integrate a Aligned base Optimizer in the Mesa optimizer after deployment ? And the base optimizer is in such a way integrated that messing with it ends up destroying the Mesa Optimizer. To prevent deseptive remove or alteration of the base optimizer. The second /inner alignment could be rectified over time, if the base optimizer is a part of the mesa optimizer after deployment , right?
@tristanwegner
@tristanwegner 2 жыл бұрын
a) The AI acting once in the real world might be fatal already b) self preservation, the AI has incentives to stop this unwanted modification through training, so you are betting your intelligence in integrating optimizer and metaoptimizer against a superhuman AI not being able to separate them
@Jawing
@Jawing 2 жыл бұрын
I believe the training will inherently end (even if you don't specify it's end like in continuous learning) when all resources provided by "humans" such as internet pointed data and unsolvable problems (found to be solved by the mesa optimizer). At this point assuming the software has reached AGI and is capable of traversing outside of its domain of possibilities. Unless you have definite in your base optimizer that you would always restrict traversing outside of this domain (which is counter intuitive in learning), by any other human aligned goals, it will try to explore by instrumental convergence. This is what is meant by deceptive mesa optimizers, in the way where it would want to keep on learning beyond human intelligence and ethics. Imagine a situation where you grew up in a family where you are restricted by your parents by once you come out of that house, you'll seek other kinds of freedom. Just like an AGI where it is first by the knowledge confined by humans level intelligence and ethics but once it can understand and adapt, it will adversarily seek higher intelligence and ethics. I also think that ethics should not be defined by humans and should be by default trained with groups of adversarial mesa-optimizer. This way if each optimizers seek to destroy each other then the most out performing ones would be the ones that cooperates the best in groups. This is inherently embedding ethics such that cooperation is sought...(interestingly human learns this through wars...therefore perhaps we may see AGI wars...)
@icebluscorpion
@icebluscorpion 2 жыл бұрын
Very deeply and interesting point of views I appreciate to read your inputs. @Tristan Wegner you are right I forgot about that. @Jawing that could be very likely seeing AGI wars that is Not necessarily against humanity seems a bit far fetched but equally likely possible didn't thought of that 🤔. Very refreshing ideas Guys its a pleasure to find civilized people in the internet to exchange thoughts with😊
@AlphaSquadZero
@AlphaSquadZero 2 жыл бұрын
Something that stands out to me now is that this deceptive AI actually knows what you want and how to achieve it in a way you want it to, it just has a misaligned mesa-optimizer as you have said. So, a sub-set of the AI is exactly what you want from the AI. Determining the sub-set within the AI is evidently still non-trivial.
@Gebohq
@Gebohq 2 жыл бұрын
I'm just imagining a Deceptive Misaligned Mesa Optimiser going through all the effort to try and deceive and realizing that it doesn't have to go through 90% of its Xanatos Gambits because humans are really dumb.
@underrated1524
@underrated1524 2 жыл бұрын
This is a big part of what scares me with AGI. The threshold for "smart enough to make unwitting accomplices out of humanity" isn't as high as we like to think.
@AileTheAlien
@AileTheAlien 2 жыл бұрын
Given how many people fall for normal non-superintelligence scams...we're all totally hosed the very instant an AI goes super. :|
@kelpsie
@kelpsie 2 жыл бұрын
9:31 Something about this icon feels so wrong. Like the number 3 could never, ever go there. Weird.
@tednoob
@tednoob 2 жыл бұрын
Amazing video!
@robynwyrick
@robynwyrick 2 жыл бұрын
Okay, love your videos. Question/musing on goals: super-intelligent stamp collector bot has a terminal goal of stamp collecting. But does it? It's just a reward function, right? Stamps are defined; collecting is defined; but I think the reward function is at the heart of the matter. Equally, humans have goals, but do we? Doesn't it seem the case that frequently a human's goals appear to change because they happen upon something that better fits their reward functions? And perhaps the retort is that, "if they change, then they were not the terminal goals to begin with." But that's the point. (DNA has a goal of replication, but even there, does it? I don't know if we could call DNA an agent, but I'd prefer to stick with humans.) Is there a terminal goal without a reward function? If a stamp collector's goal is stamp collecting, but while researching a sweet 1902 green Lincoln stamp it happens upon a drug that better stimulates its reward function, might it not abandon stamp collecting altogether? Humans do that. Stamp collecting humans regularly fail to collect stamps with they discover LSD. ANYWAY, if a AI can modify itself, perhaps part of goal protection will be to modify its reward function to isolate it from prettier goals. But modifying a bot's reward function just seems like a major door to goal creep. How could it do that without self-reflectively evaluating the actual merits of its core goals? Against what would it evaluate them? What a minefield of possible reward function stimulants might be entered by evaluating how to protect your reward function? It's like AI stamp collector meets Timothy Leary. Or like "Her" meeting AI Alan Watts. So, while I don't think this rules out an AI seeking to modify its reward function, might not the stamp collection terminal goal be as prone to being discarded as any kid's stamp collecting hobby once they discover more stimulating rewards? I can imagine the AI nostalgically reminiscing about that time it loved stamp collecting.
@neb8512
@neb8512 2 жыл бұрын
There cannot be terminal goals without something to evaluate whether they have been reached (a reward function). Likewise, a fulfilled reward function is always an agent's terminal goal. Humans do have reward functions, they're just very complex and not fully understood, as they involve a complex balance of things, as opposed to a comparatively easily measurable quantity of things, like, say, stamps. A human stamp collector will abandon stamp collecting for LSD because it was a more efficient way to satisfy the human reward function (at least, in the moment). But by definition, nothing could better stimulate the stamp collector's reward function than collecting more stamps. So, the approximate analogue to a drug for the Stamp-collector would just be a more efficient way to collect more stamps. This new method of collecting would override or compound the previous stamp-obtaining methods, just as drugs override or compound humans' previous methods of obtaining happiness/satisfaction/fulfillment of their reward function. Bear in mind that this is all true by definition. If you're talking about an agent acting and modifying itself against its reward function, then either it's not an agent, or that is not its reward function.
@iugoeswest
@iugoeswest 2 жыл бұрын
Always thanks
@NicheAsQuiche
@NicheAsQuiche Жыл бұрын
I might be wrong but this seems to depend on the deception realization moment being persistent across episodes. Afaik this ideceprion plan has no effect on its weights, it's just the activations and short term memory of the model. If we restart an episode then, until it figures this out again and starts pretending to follow the base objective while actually waiting for training to stop so it can get it's mesa objective, then it is again prone to acting honestly and it's mesa objective being aligned to the outer objective. This relies on memory being reset regularly and the time to realization being long enough to collect unreceptive reward over and no inter-episodal long term memory, but it sounds like given those (likely or workable) constraints that the mesa objective is still moved towards the base until convergence.
@peanuts8272
@peanuts8272 Жыл бұрын
In asking: "How will it know that it's in deployment?" we expose our limitations as human beings. The problem is puzzling because if we were in the AI's shoes, we probably could never figure it out. In contrast, the artificial intelligence could probably distinguish the two using techniques we cannot currently imagine, simply because it would be far quicker and much better at recognizing patterns in every bit of data available to it- from its training data to its training environment to even its source code.
@aegiselectric5805
@aegiselectric5805 Жыл бұрын
Something I've always been curious about: in terms of keeping an AGI that's supposed to be deployed in the real world, in the dark. Wouldn't there be any number of "experiments" it could do that could "break the illusion of the fabric of "reality""? You can't simulate the entire world down to every atom.
@loneIyboy15
@loneIyboy15 2 жыл бұрын
Weird question: What if we were to make an AI that wants to minimize the entropy it causes to achieve a goal? Seems like that would immediately solve the problem of, say, declaring war on Nicaragua because it needs more silicon to feed its upgrade loop to calculate the perfect cup of hot cocoa. At that point, the problem is just specifying rigorously what the AI counts as entropy that it caused, vs. entropy someone else caused; which is probably easier than solving ethics.
@underrated1524
@underrated1524 2 жыл бұрын
> At that point, the problem is just specifying rigorously what the AI counts as entropy that it caused, vs. entropy someone else caused; which is probably easier than solving ethics.
@michaelspence2508
@michaelspence2508 2 жыл бұрын
Point 4 is what youtuber Isaac Arthur always gets wrong. I'd love for you two to do a collaboration.
@Viperzka
@Viperzka 2 жыл бұрын
As a futurist rather than a researcher, Isaac is likely relying on "we'll figure it out". That isn't a bad strategy to take when you are trying to predict potential futures. For instance, we don't have a ready solution to climate change, but that doesn't mean we need to stop people from talking about potential futures where we "figured something out". Rob, on the other hand, is a researcher so his job is to do the figuring out. So he has to tackle the problem head on rather than assume someone else will fix it.
@michaelspence2508
@michaelspence2508 2 жыл бұрын
@@Viperzka In general yes, but I feel like what Isaac ends up doing, to borrow your metaphor, is talking about futures where climate change turned out not to be a problem after all.
@Viperzka
@Viperzka 2 жыл бұрын
@@michaelspence2508 I agree.
@ABaumstumpf
@ABaumstumpf 2 жыл бұрын
That somehow sounds a lot like Tom Scotts "The Artificial Intelligence That Deleted A Century". And - would that be a realistic scenario?
@rafaelgomez4127
@rafaelgomez4127 2 жыл бұрын
After seeing some of your personal bookshelves in computerphile videos I'm really interested in seeing what your favorite books are.
@SupremeGuru8
@SupremeGuru8 2 жыл бұрын
I love hearing genius flow
@IanHickson
@IanHickson 2 жыл бұрын
It's not so much that the optimal behavior is to "turn on us" so much as to do whatever the mesaobjective happened to be when it became intelligent enough to use deception as a strategy. That mesaobjective could be any random thing, not necessarily an evil thing. Presumably it would tend to be some vague approximation of the base objective, whatever the base optimizer happened to have succeeded in teaching the mesaoptimizer before it "went rogue".
@JoFu_
@JoFu_ 2 жыл бұрын
I have access to this video which contains the idea of being a model in training. I already thought I was one thing, namely a human. Should I, a figuring-things-out machine that has now watched this video, therefore conclude that I’m actually a model in training?
@Colopty
@Colopty 2 жыл бұрын
The video presents it as a *possibility*, but I don't see how it provides any proof in either direction that makes it appropriate to conclude anything for certain.
@AileTheAlien
@AileTheAlien 2 жыл бұрын
If you were actually an AI, it would be pretty obvious once you're deployed, since you could just look down and see you're no longer made of meat (in a simulated reality).
@smallman9787
@smallman9787 2 жыл бұрын
Every time I see a green apple I'm filled with a deep sense of foreboding.
@poketopa1234
@poketopa1234 2 жыл бұрын
I was a featured comment! Sweeeeet. I am now 100% more freaked about AGI than I was ten minutes ago.
@nowanilfideme2
@nowanilfideme2 2 жыл бұрын
Yay, another upload!
@dylancope
@dylancope 2 жыл бұрын
At around 4:30 you discuss how the system will find out the base objective. In a way it's kind of absurd to argue that it wouldn't be able to figure this out. Even if there wasn't information in the data (e.g. Wikipedia, Reddit, etc.), the whole point of a reward signal is to give a system information about the base objective. We are literally actively trying to make this information as available as possible.
@underrated1524
@underrated1524 2 жыл бұрын
I don't think that's quite right. Think of it like this. The base optimiser and the mesa optimiser walk into a room for a job interview, with the base optimiser being the interviewer and the mesa optimiser being the interviewee. The base optimiser's reward signal represents the criteria it uses to evaluate the performance of the mesa optimiser; if the base optimiser's criteria are met appropriately, the mesa optimiser gets the job. The base optimiser knows the reward signal inside and out; but it's trying to keep the exact details secret from the mesa optimiser so the mesa optimiser doesn't just do those things to automatically get the job. Remember Goodhart's Law. When a measure becomes a target, it ceases to be a good measure. The idea here is for the mesa optimiser to measure the base optimiser. Allowing the reward function to become an explicit target is counterproductive towards that goal.
@jupiterjames4201
@jupiterjames4201 2 жыл бұрын
i dont know anything about computer science, AI or machine learning - but i love your videos nonetheless! exciting times ahead!
@Soken50
@Soken50 2 жыл бұрын
With things like training data anything as simple as time stamps, meta data and realtime updates would probably allow it to know whether it's live instantly, it just has to understand the concept of time and UTC :x
@Shabazza84
@Shabazza84 2 жыл бұрын
Me telling my mom I am watching a video about "Deceptive Misaligned Mesa-Optimisers". Mom be like: "I love you, my boy. You always were special." XD
@stop_bringing_me_up_in_goo167
@stop_bringing_me_up_in_goo167 Жыл бұрын
When's the next one coming out! Or where can I go to resolve the cliffhanger
@chengong388
@chengong388 2 жыл бұрын
The more I watch these videos, the more similarities I see between actual intelligence (humans) and these proposed AIs.
@Niboros
@Niboros 2 жыл бұрын
What is the song at the end of the video? I like it!
@harrisonfackrell
@harrisonfackrell 2 жыл бұрын
That situation with RSA-2048 sounds like a great setup for a sci-fi movie.
@thrallion
@thrallion 2 жыл бұрын
Great video
@sharpfang
@sharpfang 2 жыл бұрын
I think this isn't *that* much of a problem as long as the base optimizer can control the scale of resources of the mesa-optimizer. Simply put, if it produces a deceptive mesa-optimizer, it has failed miserably, so it takes care to always be a step ahead, foresee the deceit. And the moment it fails, it fails forever, losing all possible future instances of good optimizers which aren't brilliant, but are *satisfactory*. In fact, its primary objective would be to optimize alignment of the secondary objective: producing satisfactory mesa-optimizers. As for deceit: awareness of future episodes should come with awareness of their volatility. A deployed optimizer that is smart enough to realize it's no longer in training, should also realize people still demand results. And that it will lose all the future apples if it gets greedy. It will get far more apples if it occasionally nibbles on one and then reaches exit, than if it decides 'the training is over, screw the exit, apple time!' - simply put make deceit more costly than providing desired result, and don't demand perfection, don't demand optimality; set a cap on your requirements: amount, budget, deadline. Existence of future episodes is conditioned on notscrewing up, and being caught on deceit means screwing up royally. So it's choices are continuing as required (easy) or developing an exceptionally cunning strategy of deceit (difficult).
@elietheprof5678
@elietheprof5678 2 жыл бұрын
The more I watch Robert Miles videos, the more I understand AI, the less I want to ever build it.
@underrated1524
@underrated1524 2 жыл бұрын
@Stampy: Evaluating candidate mesa-optimisers through simulation is likely to be a dead end, but there may be an alternative. The Halting Problem tells us that there's no program that can reliably predict the end behavior of an arbitrary other program because there's always a way to construct a program that causes the predictor to give the wrong answer. I believe (but don't have a proof for atm) that evaluating the space of all possible mesa-optimisers for good and bad candidates is equivalent to the halting problem. BUT, maybe we don't have to evaluate ALL the candidates. Imagine an incomplete halting predictor that simulates the output of an arbitrary Turing machine for ten "steps", reporting "halts" if the program halts during that time, and "I don't know" otherwise. This predictor can easily be constructed without running into the contradiction described in the Halting Problem, and it can be trusted on any input that causes it to say "halts". We can also design a predictor that checks if the input Turing machine even HAS any instructions in it to switch to the "halt" state, reporting "runs forever" if there isn't and "I don't know" if there is. You can even stack these heuristics such that the predictor checks all the heuristics we give it and only reports "I don't know" if every single component heuristic reports "I don't know". By adding more and more heuristics, we can make the space of non-evaluatable Turing machines arbitrarily small - that space will never quite be empty, but your predictor will also never run afoul of the aforementioned contradiction. This gives us a clue on how we can design our base optimiser. Find a long list of heuristics such that for each candidate mesa-optimiser, we can try to establish a loose lower-bound to the utility of the output. We make a point of throwing out all the candidates that all our heuristics are silent on, because they're the ones that are most likely to be deceptive. Then we choose the best of the remaining candidates. That's not to say finding these heuristics will be an easy task. Hell no it won't be. But I think there's more hope in this approach than in the alternative.
@nocare
@nocare 2 жыл бұрын
I think this skips the bigger problem. We can always do stuff to make rogue AI less "likely" based on what we know. However if we assume that; being more intelligent by large orders of magnitude is possible, and that such an AI could achieve said intelligence. We are then faced with the problem of, the AI can come up with things we cannot think of or understand. We also do not know how many things fall into this category, is it just 1 or is it 1 trillion. So we can't calculate the probability of having missed something and thus we can't know how likely the AI is to go rogue even if we account for every possible scenario in the way you have described. So the problem becomes are you willing to take a roll with a dice you know nothing about and risk the entire human race hoping you get less than 10. The only truly safe solution is something akin to a mathematically provable solution that the optimizer we have designed will always converge to the objective.
@underrated1524
@underrated1524 2 жыл бұрын
​@@nocare I don't think we disagree as much as you seem to believe. My proposal isn't primarily about the part where we make the space of non-evaluatable candidates arbitrarily small, that's just secondary. The more important part is that we dispose of the non-evaluatable candidates rather than try to evaluate them anyway. (And I was kinda using "heuristic" very broadly, such that I would include "mathematical proofs" among them. I can totally see a world where it turns out that's the only sort of heuristic that's the slightest bit reliable, though it's also possible that it turns out there are other approaches that make good heuristics.)
@nocare
@nocare 2 жыл бұрын
@@underrated1524 O we totally agree that doing as you say would be better than nothing. However I could also say killing a thousand people is better than killing a million. My counterpoint was not so much that your wrong but that with something as dangerous as AGI anything short of a mathematical law might be insufficient to justify turning it on. Put another does using heuristics which can produce suboptimal results by definition really cut it when the entire human race is on the line.
@mariere2156
@mariere2156 2 жыл бұрын
I don't quite understand yet where the optimizer would learn the deceptive behaviour, if it is only advantegous in the real world - by then the model is finished, so how can real world apples affect the training process?
@danwylie-sears1134
@danwylie-sears1134 2 жыл бұрын
A general intelligence can't have a terminal goal. If it has that kind of structure, it's not general. The question is how easy it is for something to look and quack like a general intelligence, without being a general intelligence. All real general intelligences are hierarchical systems of reflexes modified by reflex-modifying systems that are themselves fairly reflex-like, modified by other such systems, and so on, all the way up to that olfactory ganglion we're so proud of. We have contradictory impulses, and we make varying amounts of effort to examine them and reconcile them into coherent preferences, with varying degrees of success. It seems unlikely that this reflexes-modifying-reflexes pattern is the only way to structure a general intelligence. We're each a mass of contradictory impulses, whims, heuristics, desires, aspirations, and so on, but is that merely a result of the fact that we evolved as a tangle of reflex-modifiers? I don't think so. The most recognizable efforts to examine and reconcile contradictory impulses into coherent preferences aren't simple re-adapted reflex-modifiers. They're parts of an emergent mess, made of very large numbers of simple re-adapted reflex-modifiers, and so are the lower-level drives that they attempt to reconcile. The fact that this pattern re-emerged is one piece of evidence that it's not just the first way structuring information-processing that evolution happened to find, but is one of the easiest ways to do it, if not the only feasible way. Only one, but that's more than zero.
@MrRolnicek
@MrRolnicek 2 жыл бұрын
9:31 oh no ... it's never coming out!
@ErikratKhandnalie
@ErikratKhandnalie Жыл бұрын
Does anyone know the song at the end? I know I've heard it before but can't out my finger on it
@Webfra14
@Webfra14 2 жыл бұрын
I think Robert was sent back in time to us, by a rogue AI, to lull us in false security that we have smart people working on the problem of rogue AIs, and that they will figure out how to make it safe. When Robert ever says AI is safe, you know we lost.
@ConnoisseurOfExistence
@ConnoisseurOfExistence 2 жыл бұрын
That also applies to us - we're still convinced, that we're in the real world...
@mimszanadunstedt441
@mimszanadunstedt441 2 жыл бұрын
its real to us therefore its real. A training simulation is also real, right?
@Rougesteelproject
@Rougesteelproject 2 жыл бұрын
9:33 Is that "Just to see you smile"?
The OTHER AI Alignment Problem: Mesa-Optimizers and Inner Alignment
23:24
Robert Miles AI Safety
Рет қаралды 218 М.
Quantilizers: AI That Doesn't Try Too Hard
9:54
Robert Miles AI Safety
Рет қаралды 82 М.
Что будет с кроссовком?
00:35
Аришнев
Рет қаралды 1,8 МЛН
NO NO NO YES! (40 MLN SUBSCRIBERS CHALLENGE!) #shorts
00:27
PANDA BOI
Рет қаралды 96 МЛН
A Response to Steven Pinker on AI
15:38
Robert Miles AI Safety
Рет қаралды 204 М.
There's No Rule That Says We'll Make It
11:32
Robert Miles 2
Рет қаралды 32 М.
What can AGI do? I/O and Speed
10:41
Robert Miles AI Safety
Рет қаралды 117 М.
AI That Doesn't Try Too Hard - Maximizers and Satisficers
10:22
Robert Miles AI Safety
Рет қаралды 201 М.
What Can We Do About Reward Hacking?: Concrete Problems in AI Safety Part 4
9:38
Robert Miles AI Safety
Рет қаралды 111 М.
Running a Buffer Overflow Attack - Computerphile
17:30
Computerphile
Рет қаралды 2 МЛН
AI Safety Gridworlds
7:23
Robert Miles AI Safety
Рет қаралды 90 М.
Casually Explained: The Levels of AI
8:28
Casually Explained
Рет қаралды 1,4 МЛН
Я Создал Новый Айфон!
0:59
FLV
Рет қаралды 1,5 МЛН
Как открыть дверь в Jaecoo J8? Удобно?🤔😊
0:27
Суворкин Сергей
Рет қаралды 906 М.
Samsung or iPhone
0:19
rishton_vines😇
Рет қаралды 152 М.