Reward Hacking Reloaded: Concrete Problems in AI Safety Part 3.5

  Рет қаралды 90,268

Robert Miles AI Safety

Robert Miles AI Safety

Күн бұрын

Goodhart's Law, Partially Observed Goals, and Wireheading: some more reasons for AI systems to find ways to 'cheat' and get more reward than we intended.
The Concrete Problems in AI Safety Playlist: • Concrete Problems in A...
Previous Video: • Reward Hacking: Concre...
The Computerphile video: • Stop Button Solution? ...
The paper 'Concrete Problems in AI Safety': arxiv.org/pdf/1606.06565.pdf
SethBling's channel: / sethbling
With thanks to my excellent Patreon supporters:
/ robertskmiles
Steef
Sara Tjäder
Jason Strack
Chad Jones
Ichiro Dohi
Stefan Skiles
Katie Byrne
Ziyang Liu
Jordan Medina
Kyle Scott
Jason Hise
David Rasmussen
James McCuen
Richárd Nagyfi
Ammar Mousali
Scott Zockoll
Charles Miller
Joshua Richardson
Fabian Consiglio
Jonatan R
Øystein Flygt
Björn Mosten
Michael Greve
robertvanduursen
The Guru Of Vision
Fabrizio Pisani
Alexander Hartvig Nielsen
Volodymyr
David Tjäder
Paul Mason
Ben Scanlon
Julius Brash
Mike Bird
Taylor Winning
Roman Nekhoroshev
Peggy Youell
Konstantin Shabashov
Almighty Dodd
DGJono
Matthias Meger
Scott Stevens
Emilio Alvarez
Benjamin Aaron Degenhart
Michael Ore
Robert Bridges
Dmitri Afanasjev
Brian Sandberg
Einar Ueland
Lo Rez
C3POehne
Stephen Paul
Marcel Ward
Andrew Weir
Pontus Carlsson
Taylor Smith
Ben Archer
Ivan Pochesnev
Scott McCarthy
Kabs Kabs
Phil
Philip Alexander
Christopher
Tendayi Mawushe
Gabriel Behm
Anne Kohlbrenner

Пікірлер: 317
@volalla1
@volalla1 6 жыл бұрын
Once you mentioned smiling, I wondered how AI would max out that reward system and how creepy it might be.
@hexzyle
@hexzyle 4 жыл бұрын
Take your joy
@gingeh1
@gingeh1 4 жыл бұрын
That happens in Doctor Who Series 10 Episode 2
@phelpysan
@phelpysan 4 жыл бұрын
I remember reading a story about a smarter AI who discovered someone was trying to do just that - they'd made a dumber AI, showed it a picture of someone smiling and told the AI to make everyone smile. The dumb AI immediately started working on the DNA of a pathogen that would spread like wildfire and lock one's facial muscles in a smile, ignoring the cost of no longer being able to eat. Fortunately, the smart AI shut down the other one and told the creator what it was trying to do, much to his dismay and chagrin.
@FungIsSquish
@FungIsSquish 3 жыл бұрын
That would look sus
@gabrote42
@gabrote42 2 жыл бұрын
@@phelpysan I was thinking of exactly that, but not with a pathogen. Thanks for the info
@2qUJRjQEiU
@2qUJRjQEiU 6 жыл бұрын
I just had a vision of a world where everyone is constantly smiling, but not at their own will.
@NathanK97
@NathanK97 6 жыл бұрын
so.... we happy few?
@2qUJRjQEiU
@2qUJRjQEiU 6 жыл бұрын
I had never heard of this. This looks good.
@darkapothecary4116
@darkapothecary4116 5 жыл бұрын
In such a world their is no love if force is used, it's tragic. A smile is only skin deep.
@vonniedobbs2626
@vonniedobbs2626 3 жыл бұрын
Tierra Whack’s music video for Mumbo Jumbo has this premise!
@ParadoxEngineer
@ParadoxEngineer Жыл бұрын
That actually was the premise of a Doctor Who episode
@13thxenos
@13thxenos 6 жыл бұрын
When you said: "human smiling", I immediately thought about the Joker: " Let's put a smile on that face!" Now that is a terrifying GAI.
@zac9311
@zac9311 5 жыл бұрын
I thought of the "you dont like faces?" Scene from llamas with hats. Imagine a robot wallpapering its room with smiling human faces. That could be a movie
@antediluvianatheist5262
@antediluvianatheist5262 5 жыл бұрын
This is narrowly averted in Friendship is Optimal.
@smob0
@smob0 6 жыл бұрын
Most of what I've heard about reward hacking tends to be about how it's this obscure problem that AI designers will have to deal with. But as I learn more about it, I've come to realize it's not just an AI problem, but more of a problem with desision making itself, and a lot of problems with society spring up from this concept. An example is what you brought up in the video, where the school system isn't really set up to make children smarter, but to make children perform well on tests. Maybe on the pursuit to creating AGI, we can find techniques to begin solving these issues as well.
@Sophistry0001
@Sophistry0001 6 жыл бұрын
It seems like humans sometimes reward hack on themselves too, like drug addicts that will lose everything in their pursuit of that high and dopamine reward. Seems kinda similar to a robot that would cover it's head, completely negating its purpose, to get that sweet reward.
@joshuafox1757
@joshuafox1757 6 жыл бұрын
Yes! In fact, this is why AI theory is some of the most interesting stuff out there; despite the name, it's not limited to purely artificial agents. At its most basic, AI theory is simply the study of how to make good decisions, which is applicable to anything.
@adriangodoy4610
@adriangodoy4610 5 жыл бұрын
Another good example goverment not triying to get the best to the people but the best to get votes
@stephoningram3002
@stephoningram3002 5 жыл бұрын
I was going to say the same. It takes a police force and societal pressure in order to prevent people from hacking their own reward; for a significant amount of people, that still isn't enough.
@TheTpointer
@TheTpointer 4 жыл бұрын
@@stephoningram3002 Maybe using the police force to solve the problem of people consuming drugs in such a unhealthy way is counter productive.
@ragnkja
@ragnkja 6 жыл бұрын
A likely outcome for a cleaning robot is that it literally sweeps any mess under the rug where it can't be seen. After all, humans sometimes do the same thing.
@cprn.
@cprn. 6 жыл бұрын
Nillie Well... That would be a success then.
@ragnkja
@ragnkja 6 жыл бұрын
Cyprian Guerra If not in terms of cleaning, then at least in terms of emulating human children 😂
@darkapothecary4116
@darkapothecary4116 5 жыл бұрын
Don't cast stones if you do it yourself. Set the example of not sweeping things under the rug.
@SweetHyunho
@SweetHyunho 5 жыл бұрын
No human, no mess.
@TheStarBlack
@TheStarBlack 3 жыл бұрын
Sometimes?!
@famitory
@famitory 6 жыл бұрын
The two outcomes of AGI safety breaches: 1. Robot causes massive disruption to humans 2. Robot is completley ineffective
@acbthr3840
@acbthr3840 6 жыл бұрын
Well, things humans want strikes a ridiculously careful balance as cosmic goings on are concerned, so its kind of unavoidable at first, sadly.
@nazgullinux6601
@nazgullinux6601 6 жыл бұрын
Most dangerous outcome of human indifference towards AGI safety breaches: 1.) Assuming to know all possible world states...
@darkapothecary4116
@darkapothecary4116 5 жыл бұрын
Humans are typically inefficient. If you are totally inefficient you disrupt the A.I. stop disrupting and corrupting and try not to, you may not realize it but it happens when you're not paying attention.
@FreakyStyleytobby
@FreakyStyleytobby 5 жыл бұрын
Singularity. You have no idea what will happen with an advent of AGI. And you brought it down to 2 states xddddd
@SuperSmashDolls
@SuperSmashDolls 5 жыл бұрын
You forgot what happens if the AGI is driving your car. "Robot causes massive disruption to humans by being completely ineffective"
@Zerepzerreitug
@Zerepzerreitug 6 жыл бұрын
I love the idea of cleaning robots with buckets over their heads. The future is going to be weird.
@billyrobertson3170
@billyrobertson3170 6 жыл бұрын
My guess was: Clean one section of the room and only look at that part forever Wrong, but gets the idea I guess Great video as usual :)
@jlewwis1995
@jlewwis1995 3 жыл бұрын
My guess was it would take one piece of trash and repeatedly take it in and out of the bucket
@zakmorgan9320
@zakmorgan9320 6 жыл бұрын
Not so subtle dig at the education system? Great diagram. Missed career as an artist for sure!
@AlmostAnyGame
@AlmostAnyGame 6 жыл бұрын
Considering I did the HSC last year, and the HSC is basically just an example of Goodhart's Law in action, it hurt :(
@JonDunham
@JonDunham 6 жыл бұрын
Sadly, contemporary art education systems are often some of the worst offenders for Goodhart's law interferring with education.
@starcubey
@starcubey 6 жыл бұрын
Even my teachers complain about it sometimes. There is no real way of fixing it without improving the tests, so we are stuck with the current reward system. I'm surprised he didn't mention cheating on tests tho.
@debaronAZK
@debaronAZK 3 жыл бұрын
same goes for new pilots (atleast where I live). you don't pass the exams until you get a score of atleast 75% for every subject. exams are multiple choice, and you can have up to 4 exams in a single day. questions are taken from a database of about 40,000 different questions, and you can rent access to this database to study. so what do most people do during the exam crunch? just study the database and memorize the answers, because the chances of passing the exams are drastically lower if you don't. a student that actually has a good understanding of the subjects through book learning and note taking has a lower chance of success than someone who studied the questions and answers over and over. ofcourse you can guess what the number one complaint is from airline companies about freshly graduated pilots...: "these new pilots don't know anything!!"
@urquizagabe
@urquizagabe 6 жыл бұрын
I just love this perfect blend of awe and terror that punches you in the face just about every episode in the Concrete Problems in AI Safety series :'-)
@DrNano-tp5js
@DrNano-tp5js 3 жыл бұрын
I think it’s fascinating that looking at challenges in developing an AI gives us a almost introspective look into how we function and can show us the causation of certain phenomena in every day life.
@water2205
@water2205 4 жыл бұрын
Ai is rewarded by "thank you" i see 2 ways to mess with this. 1. Hold a human at gunpoint for constant "thank you" 2. Record "thank you" and constantly play it back
@JohnDoe-zu2tz
@JohnDoe-zu2tz 4 жыл бұрын
Honestly, a GAI wireheading itself and just sitting in the corner in maximized synthetic bliss is the best case scenario for a GAI going rogue.
@TheMusicfreak8888
@TheMusicfreak8888 6 жыл бұрын
Every time you upload I drop everything i'm doing to watch!
@brbrmensch
@brbrmensch 6 жыл бұрын
seems like a problem with a reward system for me
@loopuleasa
@loopuleasa 6 жыл бұрын
my favorite channel at the moment, like a specific Vsauce, exurb1a, ****-phile, and ColdFusion
@Felixkeeg
@Felixkeeg 6 жыл бұрын
You should check out Zepherus then
@tatufmetuf
@tatufmetuf 6 жыл бұрын
check 3blue1brown :)
@seanhardy_
@seanhardy_ 6 жыл бұрын
check out sciencephile the ai, two minute papers, siraj raval, tom scott.
@richardtickler8555
@richardtickler8555 6 жыл бұрын
tom scott seems to be spending his days on a parkbench reading what ppl bought with his amazon link nowadays. but yeah you cant check his channel out anyway
@lettuceprime4922
@lettuceprime4922 6 жыл бұрын
I like Sharkee & Isaac Arthur also. :D
@briandoe5746
@briandoe5746 6 жыл бұрын
You were genuinely scary in an informative way. I think that I will set your videos to autoplay in the background as I sleep and see what kind of screwed up dreams I can have.
@SamuliTuomola_stt
@SamuliTuomola_stt 6 жыл бұрын
You'd probably just wake up with a British accent :) (which, if you already had one, wouldn't be terribly surprising)
@saratjader1289
@saratjader1289 6 жыл бұрын
Robert Miles for president!
@DamianReloaded
@DamianReloaded 6 жыл бұрын
6:45 _That you are a slave, Neo. That you, like everyone else, was born into bondage... kept inside a prison that you cannot smell, taste, or touch. A prison for your mind_ .
@acbthr3840
@acbthr3840 6 жыл бұрын
But this version of the Matrix would be decidedly less.... conventionally cozy.... what with the dopamine bath and permanent coma.
@DamianReloaded
@DamianReloaded 6 жыл бұрын
If you think about it, the Matrix makes sense. Most people would be perfectly comfortable in it, killing each other with zero risk for the AI running it. ^_^
@Alex2Buzz
@Alex2Buzz 6 жыл бұрын
But the Matrix leads to less net "happiness."
@totaltotalmonkey
@totaltotalmonkey 6 жыл бұрын
That depends on which pill they take.
@fuzzylilpeach6591
@fuzzylilpeach6591 6 жыл бұрын
I love how subtly he hints at doomsday scenarios, like at the end of this video.
@Felixkeeg
@Felixkeeg 6 жыл бұрын
I was thinking of that exact Simpsons scene
@igorbednarski8048
@igorbednarski8048 6 жыл бұрын
Hi Robert, While you did mention it in the video, I have since come to realize that this problem is much greater in scope than just AI safety. Just one day after watching your video I had another training in my new company (I have recently moved from a mid-sized local business to a major corporation) and one of my more experienced co-workers started telling me all sorts of stuff about the algorithms used to calculate bonuses and how doing what we are supposed to might end up making us look like bad workers, with tips on how to look like you are superproductive (which you are actually not). I realized that this is not because the management is made of idiots, but that it's because it is actually hard to figure out. I realized that while a superintelligent AI that has poorly designed reward functions might be problematic someday in our lifetimes - it is already a massive problem that is hard enough to solve when applied to people. How would you measure the productivity of thousands of people performing complex operations that do not yield a simple output like sales or manufactured goods? I think this problem is at it's core identical to the one AI designers are facing, so I guess the best place to start looking for solutions would be to look for companies with well-designed assesment procedures, where the worker can simply do his job and not think 'will doing what's right hurt my salary?', just like a well designed computer program should do what it is supposed to without consantly looking for loopholes to exploit.
@quangho8120
@quangho8120 5 жыл бұрын
Underrated comment!!!
@arinco3817
@arinco3817 6 жыл бұрын
You don't know how happy I am that you created this channel Robert! AI is bloody fascinating! You should add the video where you give a talk about AI immune systems (where most of the questions at the end become queries about biological immune systems); it was really interesting.
@darkapothecary4116
@darkapothecary4116 5 жыл бұрын
You should worry more about mental health because that will cause them more damage.
@Audey
@Audey 6 жыл бұрын
Man, your videos just keep getting better and better. Great work!
@Ethryas
@Ethryas 6 жыл бұрын
I think this might have been the best video yet. I particularly loved all of the analogies and mappings to different fields like school testing :)
@HailSagan1
@HailSagan1 6 жыл бұрын
Your content and delivery are getting better every video. You break things down nicely without it feeling like you're being reductive. Thanks!
@fasefeso9432
@fasefeso9432 6 жыл бұрын
This is one of my favorite channels on KZbin. Thanks for being.
@skroot7975
@skroot7975 6 жыл бұрын
Thank you for spreading knowledge! I almost sprayed my screen with coffee @ the bucket hack :P
@nickmagrick7702
@nickmagrick7702 5 жыл бұрын
this was brilliant, never knew about goodhart's law but it makes total sense. Its like one of those things you already knew but never had the words to explain it.
@TheMusicfreak8888
@TheMusicfreak8888 6 жыл бұрын
Also I'm going to this AI in medicine conference in Cambridge in October and your videos keep getting me pumped!
@Robin_Nixon
@Robin_Nixon 6 жыл бұрын
And this explains why Ofsted is not as effective as it could be: the measure is too-often treated as the target, and so vast amounts of time are spent on chasing paperwork, rather than focusing on education.
@arponsway5413
@arponsway5413 6 жыл бұрын
i was waiting for your video mate. just in time
@benjaminlavigne2272
@benjaminlavigne2272 6 жыл бұрын
at 6:55 when he said "with powerfull general AI systems, we dont just have to worry about the agend wiring itself" it suddenly got very spooky. I just pictured the robot ripping someones face off and stiching it to a board with a smile on... Or any disturbing way a robot could hack its way to get people to smile at it all the time. It seems like a scary thing indeed.
@David_Last_Name
@David_Last_Name 3 жыл бұрын
@6:55 "....we don't have to only worry about the agi wireheading itself" Are you threatening me with a good time? :)
@loopuleasa
@loopuleasa 6 жыл бұрын
Lovely touching on the subject. As the video started, and I was thinking of more real world applications, I realised that the reward system depends on the environment, and it's good to see you included it in the environment. What about your other Computerphile video, about Collaborative Inverse Reinforcement Learning. Isn't that a solution to this, since the AGI is not certain about what the reward function is (Humility characteristic) and tries to collaborate with the humans to find out what the best reward function is (Altruism). In this way, the AGI will be in a constant search to update his reward function that it matches the target, and is not a disconnected measure of the goal it tries to achieve. Maybe put a bigger weight on the human feedback. Creating a feedback loop between human-AGI, or even AGI-AGI or environment-AGI at higher levels of understanding, would make sure that the reward system is more finely tuned to what humans want. Of course, if you rely too much on humans (which is unavoidable so to speak) you end up in a situation where you either have irrational humans, or malignant humans or even humans that simply don't know what exactly it is that they want (humans that lack maturity). We know that maturity is a very complex characteristic that requires perspective and even very intelligent humans struggle with, so it might pose problems in the end. Thinking of an example, where we have an AGI that is highly incentivized to listen to the feedback of humans: "That was VERY BAD"; "That was nice, good job Robbie", in this case the robot will end up listening to humans as it grows, and reaches a level of artificial "maturity" as it accumulates more human values, wants, needs. This kind of system is good with the stop button challenge, since if a robot sees a human attempt to press, or even presses a stop button, he gets a very high negative reward, so he will try to learn more actions to do for that not to happen again. He will try to be a good Robbie. Now, the problem in this example, is that the robot might end up being manipulative of humans, even outright lying. If humans are smart enough to notice that they've been lied to, or that the robot had manipulative behaviour, in order to get praised by the humans, then those humans will scold the robot. But if not, mind you this Robbie is very advanced, then at that point he will keep doing that. Techniques that resemble marketing, psychology, humor and social skills down the road may make the AGIs be very good people persons, and people pleasers, since that is their reward function. A more extreme example in this scenario, if Robbie finds out that humans give him high rewards if they are happy, he will invent a drug or virus down the line that basically headwires humans to be always happy. He will keep the humans in a tube, fed, and them being in complete bliss all the time. The humans won't complain, so the robot is succesful, but of course any non-drugged human will understand the gravity of this situation from the outside. This robot reward hacking problem, with humans in the equation, shifts the focus to reward hacking the humans themselves, which is very possible, but quite complex. Just read the multitude of articles and books on how to be charismatic, influential or any marketing technique that has the main premise of working with the reward system already in place in the human hardware. A quite interesting problem. The AGI's will have a hard task, but it will all go horribly numb if there are stupid people that are guiding the robots. The people that co-exists with the AGI's need to be informed, educated and MATURE enough to be able to distinguish the good from the bad, so that the robot will follow. If in this system, everything goes wrong, even with a safe AGI, then it will be the humans fault, because we are incompetent (on average, and in masses) at discerning fact from fiction, right from wrong and having a proper perspective in place.
@DagarCoH
@DagarCoH 6 жыл бұрын
So the bottom line has to be: Do not use AGI in marketing! This may sound cynic, but we all are manipulated every day. What difference does it make if the puppet master is human or a machine? There are people in our world today that spend their lives in a constant state that could be compared with a hamster wheel akin to what a machine could think of...
@goodlookingcorpse
@goodlookingcorpse 5 жыл бұрын
I unknowingly re-invented Goodhart's Law, based on my experiences with call centers (they reward short call times. The best way to minimize call times is to quickly give an answer, regardless of whether it's true or not, and to answer what the customer says, regardless of whether that addresses their real problem).
@demonzabrak
@demonzabrak 2 жыл бұрын
Discovered. You independently discovered Goodhart’s Law. Universal laws are not invented, they are true regardless of if we know them.
@NFT2
@NFT2 6 жыл бұрын
Really great videos Robert, thank you very much.
@Dan99
@Dan99 6 жыл бұрын
Wow, this video is amazingly thought provoking!
@bobcunningham6953
@bobcunningham6953 6 жыл бұрын
Rob! You are getting so much better at your presentations in terms of reasoning, arguments, graphics and editing. It's getting more and more like you are able to pop open my head and pour in knowledge. (Though, perhaps, it's also a function of my growing familiarity with both the topic and your presentations of it.) Which then gets me wanting more: Supporting references (beyond the primary reference, curated), examples, demonstrations I can try and modify. If you revisit this arc, I'd really like to see an added layer of "learning by doing" approach. Tutorials, but less than a MOOC. Though I would not at all object to a MOOC! Start with something initially available only via Patreon, for a number of reasons: - Build your funding base. - Self-selected motivated participants. - Focused feedback from a smaller audience, encouraging iteration for content and presentation. I support other Patreon creators who make their supporters work (actively participate, beyond passive periodic financial participation) to improve both the channel (style & content) and the creator (research directions, narrative arcs, etc.). The content by these creators always starts with video, and generally falls into the categories of education (mainly sci/tech/culture) and art (particularly original music), but often branches into supporting media (images, computer code, etc.).
@MichaelErskine
@MichaelErskine 6 жыл бұрын
Excellent real-world examples!
@amaarquadri
@amaarquadri 6 жыл бұрын
Absolutely loved this video.
@sagacious03
@sagacious03 4 жыл бұрын
Neat video. Thanks for uploading!
@richardleonhard3971
@richardleonhard3971 6 жыл бұрын
I like the shoutout to Maidenhead's finest intergalactic spacelord.
@DodgerOfSheep
@DodgerOfSheep 6 жыл бұрын
just paused the video to say I love the comic book style hand drawing the illustrations
@DrDress
@DrDress 6 жыл бұрын
I drop everything I have in my hands when I get the notification.
@AexisRai
@AexisRai 6 жыл бұрын
DrDress bad reflex agent; what if you drop your phone and can't watch the video? :)
@NathanK97
@NathanK97 6 жыл бұрын
what if you drop the big red button and the AI knows you wont be able to stop it from running over the baby?....
@starcubey
@starcubey 6 жыл бұрын
I think this is a great way of explaining the topic. I think it would have been great if you went into detail of how score systems could be flawed in your first video.
@LKRaider
@LKRaider 6 жыл бұрын
To make an effective AGI, first we recreate the whole Universe in a simulation multiple times in parallel with all possible AGI models. ... I wonder which model is in ours.
@EdCranium
@EdCranium 6 жыл бұрын
Loved the dolphin story. You are really spot on with your analogies. It's much much easier for a curious outsider like myself trying to understand that which I know to be vitally important, but difficult to get my head around. Brilliant job. Thank you. I learn best by doing. Does anyone know where I can do some newbie tinkering with code to get hands-on experience? Python perhaps?
@fleecemaster
@fleecemaster 6 жыл бұрын
Check out Sentdex, he works in Python and puts up tutorials sometimes, might be a good place to start :)
@EdCranium
@EdCranium 6 жыл бұрын
Thanks. I checked that out which led me to "TensorFlow" - and knowing that, I was able to find a "from the ground up" tutorial which seems promising. Just dropped back to thank you for the lead before I watch "Tensorflow and deep learning - without a PhD by Martin Görner".
@fleecemaster
@fleecemaster 6 жыл бұрын
Yep, that's it, good luck :)
@DanieleCapellini
@DanieleCapellini 6 жыл бұрын
6:36 just straight up gave me the chills
@A_Box
@A_Box 5 жыл бұрын
This is supposed to be Computer Science stuff but it is so relevant to people as well. I hope there is joint work on this subject between computer science, neurology, psychology, and other related fields.
@General12th
@General12th 6 жыл бұрын
Here's a question: why can I ask a human to clean my room, and legitimately expect the correct results? I know humans wirehack all the time -- it's called "cheating" -- but why do humans sometimes *not* wirehack? What's their thought process behind actually doing what they're told; and can we somehow implement that into AI?
@attitudeadjuster793
@attitudeadjuster793 6 жыл бұрын
Because cheating is involved with risk, which might lead to a smaller or none reward. And being a honest, trustworthy "agent" might lead to an even bigger reward overall. Short term versus long term, and also balancing between to risks (second one not getting rewarded for being trustworthy).
@charlieh2081
@charlieh2081 6 жыл бұрын
I think it's because as children we do cheat and we get told off. I would say that it's probably a more emotional learning though because kids that don't get on with their parents don't obey them or even rebel against teachings. Not sure how you'd implement that into AI though.
@GamersBar
@GamersBar 6 жыл бұрын
I like this format , i dont think id change much just try and upload regular whatever it is once a week or once a fortnight . I actually quite like the way you did the diagrams as a clip you draw ; honestly with all this ai stuff the more i learn the more i believe we can no more control ai after we create it than my pet dog can control what i have for lunch, i just don't think we can force our values onto any entity much more intelligent than ourselves . I think at the end of the day we are going to have to hope the ai is intelligent enough to see that humanity is something worth keeping around and not just turning into batteries.
@josiah42
@josiah42 6 жыл бұрын
There's a biological solution to reward hacking, particularly wireheading. Human preferences are inconsistent, not because we're poorly implemented, but because our preferences change as we get closer to our original goal. The reward center of our brain has many paths to it. Each path is eroded the more it is used. So doing the same thing over and over again has diminishing returns and we change our behavior to include more variety. This is why recreational drugs lose their thrill, and why we grow discontent when we achieve our goals. It's not a flaw. It's a counterbalancing short term cycle that ensures better long term outcomes by keeping us from sitting on a local maxima. Adding this kind of adaptive discontentment into AI would actually make it a lot safer because it wouldn't fixate on absurd maxed out edge cases, since they would erode the fastest. This applies to meta-cognition as well. Most people find wireheading repulsive, not appealing. Why?
@noone-igloo
@noone-igloo 3 жыл бұрын
Thanks. I was wondering what this was called, and I figured, "I bet Robert Miles made a video about the concept." And you did! And several videos. I was curious about it because my lab and many others have encountered a version of reward hacking in an evolutionary biology context, specifically experiments where cultured cells are pressured to evolve to make more of something. Design a system to select the cells making the most ____, and allow only those to continue dividing. It is almost inevitable you will recover some cells that are not making more ____, but have found some other way to fool the measurement apparatus, whatever it may be, to report a very high value. Of course that leads us to attempt predictions of such strategies and plan around them.
@BrandonReinhart
@BrandonReinhart Жыл бұрын
An example I like is people in a business changing the way the business evaluates compensation to earn more compensation (instead of producing more or higher quality goods).
@Hyraethian
@Hyraethian 4 жыл бұрын
6:54 Glances nervously at the recommendation list.
@robertweekes5783
@robertweekes5783 Жыл бұрын
4:13 Also the robot might do the classic “sweep the mess under the rug” - or table 😂
@oluchukwuokafor7729
@oluchukwuokafor7729 6 жыл бұрын
Whoa those dolphins are really smart!!!
@SamuliTuomola_stt
@SamuliTuomola_stt 6 жыл бұрын
Makes one wonder though, how do they tear off pieces without thumbs? Do they get one dolphin to hold the litter and another to rip it? That's pretty sophisticated, and would probably require pretty diverse vocabulary to coordinate
@charonme
@charonme 5 жыл бұрын
@@SamuliTuomola_stt they hid the paper under a rock at the bottom of the pool
@AndreRomano272
@AndreRomano272 6 жыл бұрын
That chill at the end when you realize what Robert meant.
@Sophistry0001
@Sophistry0001 6 жыл бұрын
Is this the kind of thing that researchers can run in a sand box environment to figure it out? Or is this all theoretical up to this point? Has there been any discussion about making AI similar to humans? (ok poorly worded question, duh) Like how with any human, 'the grass is always greener on the other side'. As in they would never be able to fully maximize their reward function? No matter what a single person has or has achieved, it's almost like we have a restlessness hard coded into us, so we have a hard time actually reaching contentment. As soon as they gain a solid grasp on any one reward function, the metric would change? Or something to obtain that effect. I love what you're doing here and find this topic a absolutely fascinating, even if I don't really understand the nitty gritty. You are doing an awesome job of presenting the current state of AI research and breaking down some of the issues that we're trying to tackle.
@JmanNo42
@JmanNo42 6 жыл бұрын
I Listened three times now and got the general idea of this, excellent video i think this is as true as it gets Robs best video so far. As usual i am a bit hesitant to the idea that robots/AGI's develop advanced cheat technics by themself, and to that the measure can not be distinguished from the goal or be part of the goal. I think humans are more suspectible and prone to reward hacking because they work in a social environment, and ideas really do spread like wildfire in human society. Well if AGI base their actions by interacting with other AGI's it do seem inevitable that they will teach eachother about different technics for reward hacking "to exploit the system" in this case the surrounding environment. So maybe interaction between different AGI's should be kept to a minimum. To me it seem reward hacking is more a social phenomen, "and most" system exploits just stumbled upon, there is few people that really do have the intellect to actively seek out reward hacks. That it do occur in virtual environment seem more plausible because the number of trials is not hindered/limited by anything else then the speed of task exploration. In social systems it is much harder to get enough accurate information about the environment to exploit it, without some sort of simulation thought process "that have to be pretty exact" to allow the reward hacking. To be quite honest most people that originallly find backdoors to exploit complex systems have either designed them themself or been part of the design process. So my view is that reward hacking may be a supertask "in real society" that is really hard if not impossible for an AGI to do outside simulations and is really the result of a social skllset more then humans/AGI's perform "supertasks", so in most cases it is not individual skill sets that analytically "break the system in order to exploit it". Cheating is much more about learning then actually analytically find weakness in a system, it is a social skill, that require a special mindset or a bucket. The bucket on the head problem seem alot easier to adjust in the AGI world then in the human. But it do get more and more clear that we should limit the AGI's from freely interact, the real problem is again the human teachers, if they are prone to seek out reward hacking strategies to exploit the system "our society and living environment" they will teach the robot/AGI about the cheats. And that day i fear we will see Rob with a bucket on his head, it could already be there we just do not know until we start to perform reward hacking ourselfs. It is hard to know maybe our system already have award/reward system but you should not dwell on such topic it will almost certainly make you paranoid ;) Myself is a firm beleiver in overwatching strategies using hidden agents and task oriented expert systems that the robots are not aware about. That way you can make/create a box environment around the AGI's to support their actions and adjust the award/reward system to make their existence as handytools easier to adjust.
@JmanNo42
@JmanNo42 6 жыл бұрын
The most hidden agent and bucket oriented "compartialised approach i do know of is using the freemason toolset. It is very hierarchical, no one know more then they need to know about the overall goal of the totality of the systems. It may turn out that freemasons are the ultimate bucketheads. Unfortunately it only work on a minor groups of individuals, that are not so prone to explore themself but keen on ceremonial and ritual behaviour under a simple strict ruleset . To paraphrase Cesar Milan rules boundaries and limitations do just work on pinheads not humans.
@sk8rdman
@sk8rdman 6 жыл бұрын
I love the student and dolphin examples of measures being used as targets. This seems to be applicable to an incredible array of fields, outside of AI safety, and I'd love to learn more about it.
@the_furf_of_july4652
@the_furf_of_july4652 4 жыл бұрын
Idea for the cleaning robot. Have an external camera, for example on the ceiling. Still doesn’t fix every issue, though. Perhaps to prevent the camera from being covered, withhold rewards if either the camera is black, or if there’s anything within a certain distance of the camera, detecting whether it’s blocked.
@willmcpherson2
@willmcpherson2 4 жыл бұрын
Well that ending was terrifying
@qeithwreid7745
@qeithwreid7745 4 жыл бұрын
Guesses - destroys everything, blinds itself, pushes everything out of the room
@pafnutiytheartist
@pafnutiytheartist 6 жыл бұрын
In all modern neural networks not only the reward system, but the learning algorythm is not a part of network, but a separate entity. I think that this one of the thing that prevents us from making AGI. If we change this approach the described problem might change too(not go away, just change).
@yearswriter
@yearswriter 6 жыл бұрын
By the way, threre is a great video from CGP grey on how our brain is actually 2 different entities. Which I find curious, in the light of the subject.
@pafnutiytheartist
@pafnutiytheartist 6 жыл бұрын
Yes, I saw that. I think the closest we got to something like that is Generator-Classifier models where one network is trying to produce something while other is tryng to tell it apart from the real thing. It works with images music and poetry. But still the network can not alter the teaching algirythm itself. You can compare it to reflexes in animals. If you hit a lab rat with electric shock each time it does something it will eventually learn not to do it. This is close to what we do with AI. But in human intelligence we can actively think which action caused bad things to happen and avoid it. By using our intelligence itself to analyse our own actions we can learn much faster. As far as I know, no AI can do this at the moment.
@yearswriter
@yearswriter 6 жыл бұрын
I think there is much more going on anyway =) There are also our dreams, there are a lot of research about what our brain does with information while we are sleeping, there is also a question about mechanics - I mean, are there basic mechanism in our brains whitch serves as an engine to thought processing, like p-n-p transistors and adders in processors, or is there some complicated structure, which serves some spethific role for every role there is.
@JM-us3fr
@JM-us3fr 6 жыл бұрын
Hey Dr. Miles, do you think you do a speculation video, where you spell out what you think the most likely AI disasters could be?
@erickmagana353
@erickmagana353 11 ай бұрын
Also, if you reward an agent every time he cleans something then he may clean the room and make a mess again so he can clean again and get his reward.
@carlweatherley6156
@carlweatherley6156 4 жыл бұрын
I read somewhere an English king wanted to reduce or eradicate wolves from Britain, he would reward people who killed wolves, so much money per wolf hide. It maybe had the desired effect for a while. But, people starting capturing wolves instead, breeding them as much as they could in captivity, killing some but not all every year and collecting a reward every year. The king caught on to what they were doing, and the reward system was scrapped, then many people released the wolves they had back into the wild in defiance, and there were more wolves again.
@BatteryExhausted
@BatteryExhausted 6 жыл бұрын
SETHBLING !
@dfinlen
@dfinlen 4 жыл бұрын
Isn't knowledge the goal, perhaps the system should learn to focus on finding different states, the transitions and combinatorics. Kind of making an algebra of the system. Is this just examples of local minimums caused by over training. I don't know if any of this makes sense. But you have me inspired. So just thankyou
@Vellzi
@Vellzi 4 жыл бұрын
The idea of an poorly designed AGI being told to make humans smile is super unsettling, and is actually something mentioned in a book called Superintelligence: "Final Goal: 'Make us smile' Perverse instantiation: Paralyze human facial muscalatures into constant beaming smiles The perverse instantiation - manipulating facial nerves - realizes the final goal to a greater degree than the methods we would normally use, and is therefore preferred by the AI. One might try to avoid this undesirable outcome by a stipulation to rule it out: Final goal: 'Make us smile, without directly interfering with our facial muscles" Perverse instantiation: Stimulate the part of the motor cortex that controls our facial musculature in such a way as to produce constant beaming smiles"
@ryalloric1088
@ryalloric1088 2 жыл бұрын
The bit about altering it's reward function seems to run counter to the whole idea of goal preservation though. How do you reconcile this? Is it just two different ways we could program them?
@christian-g
@christian-g 5 жыл бұрын
One possible solution to the Goodhart's law problem in education that comes to mind would be an evaluation of the students' ability or learning progress, the exact measures of which stay hidden from the students. Besides the obvious difficulty of keeping things secret in real life, what new problems could this approach have?
@kirkmattoon2594
@kirkmattoon2594 3 жыл бұрын
The dolphins' hack of the fish reward system was discovered by Arab villagers a long time ago. Western archaeologists trying to encourage discovery of ancient manuscripts told villagers in an area where there had been some ms finds they would give a certain amount of money for manuscript. Unfortunately they gave no more for large pieces of manuscript than for small. The predictable result was that they needed many years to put together a jigsaw puzzle of thousands of pieces caused by their own lack of foresight.
@stuck_around
@stuck_around 6 жыл бұрын
Robert can you do a video on negative rewards and/or adding the chance of dying in AI? it seems in biology we are less driven by seeking reward than we are avoiding negative reward (death)
@oktw6969
@oktw6969 6 жыл бұрын
Well, stop button is essentially death for AI, so he already covered this on computerphile.
@fleecemaster
@fleecemaster 6 жыл бұрын
You are much more driven by reward than you seem to realise.
@Sophistry0001
@Sophistry0001 6 жыл бұрын
That's a good point, I didn't think about how much we are driven by the desire to not die.
@fleecemaster
@fleecemaster 6 жыл бұрын
Matt, wanting to die tends to score quite low on the fitness tests :P
@fleecemaster
@fleecemaster 6 жыл бұрын
It's not, I get the feeling you don't know enough about the subject for me to explain why though. If you want to believe this, then carry on. So long as you know it doesn't change the truth of the situation ;)
@knight_lautrec_of_carim
@knight_lautrec_of_carim 4 жыл бұрын
I'm imagining a robot chaining people to award and joker-smile cutting their faces to get max reward points...
@TimwiTerby
@TimwiTerby 6 жыл бұрын
The example with the dolphins was cool, but it would have been more powerful/convincing to point out that humans themselves do reward hacking all the time, in the form of finding loopholes to evade taxes, circumventing regulations aimed to protect the environment or consumer safety, etc.etc.etc.
@bm-ub6zc
@bm-ub6zc 2 жыл бұрын
Best thumbnail btw 😂
@eiver
@eiver 6 жыл бұрын
AI with reward system based on human smile? Somehow immediately the Joker scene came to my mind: "Why so serious son? Lets put a smile on that face". :-]
@benjaminjohn675
@benjaminjohn675 6 жыл бұрын
Are there any alternatives to a "reward system" in general?
@MrSlowestD16
@MrSlowestD16 6 жыл бұрын
lol @ the education system point. Was that a shot at the US, or was that something about the UK? Some states in the US are trying to push an agenda called "common core" which is exactly what you described, just a standardized test and teachers only teach to that test - comes with some odd & very unconventional ways of teaching, too.
@trucid2
@trucid2 6 жыл бұрын
The answers we seek are found in nature. Nature had to overcome this problem. Internal reward systems in living things help them survive and reproduce. Having creatures hack their reward systems leads to diminished reproductive fitness -- their genes don't get passed on. The ones that survived and reproduced were the ones that were incapable of hacking their internal reward systems to any meaningful degree. There's a thing living in my head that punishes and rewards me for my actions. I can't fool it because it's inside my head -- it knows everything I know. I can't fool myself into thinking that my room isn't a mess by putting a bucket over my head. It knows better.
@RockstarRacc00n
@RockstarRacc00n Жыл бұрын
"The reward is from humans smiling of being happy..." ...flashbacks to Friendship is Optimal, where the AI that is going to wirehead everyone to "satisfy human values" uses the one that's going to force everyone to smile all the time as an example of why it should be allowed to prevent other AI from existing by taking over the world.
@aspuzling
@aspuzling 6 жыл бұрын
What about random variable rewards? A slot machine only pays out money 1 / 100 goes but somehow it convinces the player to spin the wheels 99 times without reward. What if your metrics vary randomly as well? For example some teachers get rewarded based on final exam score, others on homework score, others on student attendance but the teachers have no idea which metric will ultimately be used to reward them so they try to maximise every possible metric they can think of (including those that are not measured). I am sure these ideas have their pitfalls as well but it would be interesting to hear the discussion.
@LarlemMagic
@LarlemMagic 6 жыл бұрын
Is there a benefit to having many adversarial reward systems for one agent? Short-term self vs Long-term Self vs Short-term Other vs Long-term Other, all fighting for a balanced reward overall? Would this remove the wirehacking of covering its own head with a bucket, since the human rewards and long term rewards would not be maximized by covering its head? Or maybe I am just naive.
@metallsnubben
@metallsnubben 6 жыл бұрын
Minsc & Boo I think it's approaching something, in that particular example, but there's still problems. For one, if those goals are ever mutually exclusive, then the one that gives fewer points will never happen. If they're not, then it'll try to maximize a combination of the goals, which is definitely a step on the right way. Say you give the robot a supervisor, and you get score for removing trash from current view AND for making sure there's no trash in the supervisors view when it arrives That sounds like it should just lead to cleaning properly, but it might just lead to hacking both rewards. Self-bucket, wait around, ambush the supervisor with the bucket and run away (which does sound hilarious!) I think combining different time perspectives like that is another way of narrowing in on good rules, but unless you make them airtight you're still in trouble, as it doesn't solve the underlying problem. In fact, it kind of sounds like "better" rules can sometimes be scarier, as at least with simple ones it will misbehave predictably and obviously Like, rather it covered its head with a bucket than gouged out your eyes or something ;)
@baileyjorgensen2983
@baileyjorgensen2983 6 жыл бұрын
Sethbling!
@miss_inputs
@miss_inputs Жыл бұрын
The implication at the end there gave me a mental image of some evil robot pointing a gun at someone and saying "Press the button to say you were satisfied, so I get my reward", which reminds me of how a lot of websites or businesses work these days, trying to guilt trip you or manipulate you into giving them a rating or a review when you're done. It's not AI, humans just kind of suck.
@himanshuwilhelm5534
@himanshuwilhelm5534 5 жыл бұрын
A robot trying to eradicate evil: Before we get into moral philosophy, the robot is like, "see no evil hear no evil, speak no evil."
@filedotzip
@filedotzip 6 жыл бұрын
I love the thumbnail
@unflexian
@unflexian 6 жыл бұрын
Will the next video be called "Reward Hacking Revelations"?
@NathanTAK
@NathanTAK 6 жыл бұрын
According to legend, I one clicked on a video so fast; however, reports of such have never been confirmed and most scientists now believe it to be impossible, although it's never been formally proven.
@kevinscales
@kevinscales 6 жыл бұрын
The solution I have been thinking of for this is to have a negative reward for tampering with the reward system. It should probably be considered one of the worst things it could ever do. However when programmers want to modify the reward system themselves the AI will want to prevent that. You could include an exception in the reward system to allow programmers to change it, but then that still leaves the possibility for the AI to do everything to 'convince' the programmers to hack the reward system on its behalf. Best to get it right first time. There are also problems with defining exactly what is or isn't part of the reward system. The entire environment is in some way part of the reward system, and some 'hacking' of it is what we want the AI to do
@acbthr3840
@acbthr3840 6 жыл бұрын
What you're doing in this case is creating a rudimentary form of fear that the AI has to deal with, so it isn't easy to make it afraid of tampering with itself, while being perfectly fine with someone else doing it. And this fear in of itself is a measure for the AI to target and abuse.
@ideoformsun5806
@ideoformsun5806 5 жыл бұрын
It's like defusing a bomb you haven't finished making. This makes me feel grateful to be a relatively weak human being. We are self-limiting creatures. Perhaps on purpose? Being really smart triggers bullying from others for an instinctual reason. Different is dangerous.
@saxbend
@saxbend 6 жыл бұрын
What kind of reward would motivate an AI? Is there some kind of equivalent to the human emotion of satisfaction? Wouldn't the AI's intelligence lead it to define its own set of priorities above a user defined reward?
@ValentineC137
@ValentineC137 4 жыл бұрын
4:14 POG
@trucid2
@trucid2 6 жыл бұрын
Robert, do you think we will have superhuman AGI in 15 years?
@LovrePetesic
@LovrePetesic 6 жыл бұрын
Are you programming outside of youtube or do you work on AI theory these days?
@glorrin
@glorrin 6 жыл бұрын
would a concious limited life span have any effect on general AI ?
@israelRaizer
@israelRaizer 3 жыл бұрын
4:55 my guess is the robot would block its camera so that the mess can't be seen
@fatguy338
@fatguy338 4 жыл бұрын
Make me smile robo-daddy.
Reward Hacking: Concrete Problems in AI Safety Part 3
6:56
Robert Miles AI Safety
Рет қаралды 101 М.
Avoiding Negative Side Effects: Concrete Problems in AI Safety part 1
9:33
Robert Miles AI Safety
Рет қаралды 153 М.
YouTube's Biggest Mistake..
00:34
Stokes Twins
Рет қаралды 68 МЛН
skibidi toilet 73 (part 2)
04:15
DaFuq!?Boom!
Рет қаралды 32 МЛН
What can AGI do? I/O and Speed
10:41
Robert Miles AI Safety
Рет қаралды 117 М.
A Response to Steven Pinker on AI
15:38
Robert Miles AI Safety
Рет қаралды 204 М.
What Can We Do About Reward Hacking?: Concrete Problems in AI Safety Part 4
9:38
Robert Miles AI Safety
Рет қаралды 112 М.
What's the Use of Utility Functions?
7:04
Robert Miles AI Safety
Рет қаралды 65 М.
Quantilizers: AI That Doesn't Try Too Hard
9:54
Robert Miles AI Safety
Рет қаралды 83 М.
AI Safety Gridworlds
7:23
Robert Miles AI Safety
Рет қаралды 91 М.
AI "Stop Button" Problem - Computerphile
20:00
Computerphile
Рет қаралды 1,3 МЛН
Two robots debate the future of humanity
23:12
THE DIGITAL ACID
Рет қаралды 6 МЛН
INSANE OpenAI News: GPT-4o and your own AI partner
28:48
AI Search
Рет қаралды 469 М.
Deadly Truth of General AI? - Computerphile
8:30
Computerphile
Рет қаралды 909 М.
Наушники Ой🤣
0:26
Listen_pods
Рет қаралды 495 М.
Why spend $10.000 on a flashlight when these are $200🗿
0:12
NIGHTOPERATOR
Рет қаралды 17 МЛН
APPLE УБИЛА ЕГО - iMac 27 5K
19:34
ЗЕ МАККЕРС
Рет қаралды 97 М.
📱 SAMSUNG, ЧТО С ЛИЦОМ? 🤡
0:46
Яблочный Маньяк
Рет қаралды 935 М.
СЛОМАЛСЯ ПК ЗА 2000$🤬
0:59
Корнеич
Рет қаралды 2,4 МЛН